Python get the x first words in a string - python

I'm looking for a code that takes the 4 (or 5) first words in a script.
I tried this:
import re
my_string = "the cat and this dog are in the garden"
a = my_string.split(' ', 1)[0]
b = my_string.split(' ', 1)[1]
But I can't take more than 2 strings:
a = the
b = cat and this dog are in the garden
I would like to have:
a = the
b = cat
c = and
d = this
...

You can use slice notation on the list created by split:
my_string.split()[:4] # first 4 words
my_string.split()[:5] # first 5 words
N.B. these are example commands. You should use one or the other, not both in a row.

The second argument of the split() method is the limit. Don't use it and you will get all words.
Use it like this:
my_string = "the cat and this dog are in the garden"
splitted = my_string.split()
first = splitted[0]
second = splitted[1]
...
Also, don't call split() every time when you want a word, it is expensive. Do it once and then just use the results later, like in my example.
As you can see, there is no need to add the ' ' delimiter since the default delimiter for the split() function (None) matches all whitespace. You can use it however if you don't want to split on Tab for example.

You can split a string on whitespace easily enough, but if your string doesn't happen to have enough words in it, the assignment will fail where the list is empty.
a, b, c, d, e = my_string.split()[:5] # May fail
You'd be better off keeping the list as is instead of assigning each member to an individual name.
words = my_string.split()
at_most_five_words = words[:5] # terrible variable name
That's a terrible variable name, but I used it to illustrate the fact that you're not guaranteed to get five words – you're only guaranteed to get at most five words.

Related

How to get only the strings that are >4 in a data frame

I have this script that reads an Excel file which contains a bunch of quotes in each line, and then I set and filter by keyword. I need to filter by the word length of the strings, e.g. the length of 'one dog' is 2. I want to select the data frame to point me only to the strings that are > 3 words long, like 'i have a dog.'
keywords = ['dogs']
solution_category = []
# Lower keyword.
keywords_lower = [x.lower() for x in keywords]
if not solution_category:
print("Words:", keywords_lower)
# Search for occurence of keyword.
df_data_clean = df[(df['Conversa'].str.lower()).str.contains('|'.join(keywords_lower))]
IIUC, pandas.Series.str.len could be used instead of a custom function.
# Search for occurency of keyword.
df_data_clean = df[(df['Conversa'].str.contains('|'.join(keywords_lower), case=False)]
# Filter data where column `Conversa` has > 3 words
df_data_clean = df_data_clean.loc[df_data_clean['Conversa'].str.split(' ').str.len() > 3]
You could define the following function which (naively) counts the number of words in a given string:
def word_count(string):
# removes leading/trailing whitespace, partitions string
# by whitespace and counts the number of elements in partition
return len(string.strip().split())
Here are some examples.
example1 = "I have a dog."
print(word_count(example1) > 3)
# True
example2 = "My dog."
print(word_count(example2) > 3)
# False
You can then filter using this function.
An intuitive way would be to create your own function which splits the string in spaces and takes the length of the resulting array:
def get_num_words(string):
words = string.strip().split(' ')
num_words = len(words)
return num_words
Next apply this to your DataFrame along with your specified condition:
filtered = df[df.apply(get_num_words) > 3]
Edit: As the other answer demonstrated, string.strip() should also be used to avoid overcounting.

Remove only trailing whitespace from output

I have a task that was assigned to me for homework. Basically the problem is:
Write a program that can get rid of the brand names and replace them with the generic names.
The table below shows some brand names that have generic names. The mapping has also been provided to you in your program as the BRANDS dictionary.
BRANDS = {
'Velcro': 'hook and loop fastener',
'Kleenex': 'tissues',
'Hoover': 'vacuum',
'Bandaid': 'sticking plaster',
'Thermos': 'vacuum flask',
'Dumpster': 'garbage bin',
'Rollerblade': 'inline skate',
'Asprin': 'acetylsalicylic acid'
}
This is my code:
sentence = input('Sentence: ')
sentencelist = sentence.split()
for c in sentencelist:
if c in BRANDS:
d = c.replace(c, BRANDS[c])
print(d, end=' ')
else:
print(c, end=' ')
My output:
Sentence: I bought some Velcro shoes.
I bought some hook and loop fastener shoes.
Expected output:
Sentence: I bought some Velcro shoes.
I bought some hook and loop fastener shoes.
It looks the same, but in my output there was an extra whitespace after 'shoes.' when there isn't supposed to be a whitespace. So how do I remove this whitespace?
I know you could do rstrip() or replace() and I tried it, but it would just jumble everything together when I just need to remove the trailing whitespace and not remove any other whitespace. If the user put the brand name in the middle of the sentence, and I used rstrip(), it would join the brand name and the rest of the sentence together.
The key is to use a string's join method to concatenate everything for you. For example, to put a space between a bunch of strings without putting a space after the last bit, do
' '.join(bunch_of_strings)
The strings have to be in an iterable, like a list, for that to work. You could make the list like this:
edited_list = []
for word in sentence_list:
if word in BRANDS:
edited_list.append(BRANDS[word])
else:
edited_list.append(word)
A much shorter alternative would be
edited_list = [BRANDS.get(word, word) for word in sentence_list]
Either way, you can combine the edited sentence using the join method:
print(' '.join(edited_list))
This being Python, you can do the whole thing as a one-liner without using an intermediate list at all:
print(' '.join(BRANDS.get(word, word) for word in sentence_list))
Finally, you could do the joining in print itself using splat notation. Here, you would pass in each element of your list as a separate argument, and use the default sep argument to insert the spaces:
print(*edited_list)
As an aside, d = c.replace(c, BRANDS[c]) is a completely pointless equivalent of just d = BRANDS[c]. Since strings are immutable, any time you do c.replace(c, ..., you are just returning the replacent in a somewhat illegible manner.
The problem is that print(c, end=' ') will always print a space after c. Here is a pretty minimal change to fix that:
sentence = input('Sentence: ')
sentencelist = sentence.split()
is_first = True
for c in sentencelist:
if not is_first:
print(' ', end='')
is_first = False
if c in BRANDS:
d = c.replace(c, BRANDS[c])
print(d, end='')
else:
print(c, end='')
As others have pointed out, this can be tidied up, e.g., d = c.replace(c, BRANDS[c]) is equivalent to d = BRANDS[c], and if you change it to c = BRANDS[c], then you could use a single print call and no else clause.
But you also have to be careful with your approach, because it will fail for sentences like "I bought a Hoover." The sentence.split() operation will keep "Hoover." as a single item, and that will fail the c in BRANDS test due to the extra period. You could try to separate words from punctuation, but that won't be easy. Another solution would be to apply all the replacements to each element, or equivalently, to the whole sentence. That should work fine in this case since you may not have to worry about replacement words that could be embedded in longer words (e.g., accidentally replacing 'cat' embedded in 'caterpillar'). So something like this may work OK:
new_sentence = sentence
for brand, generic in BRANDS.items():
new_sentence = new_sentence.replace(brand, generic)
print(new_sentence)
Your end=' ' unconditionally appends extra spaces to your output. There is no consistent way to undo this (echoing a backspace character only works for terminals, seeking only works for files, etc.).
The trick is to avoid printing it in the first place:
sentence = input('Sentence: ')
sentencelist = sentence.split()
result = []
for c in sentencelist:
# Perform replacement if needed
if c in BRANDS:
c = BRANDS[c] # c.replace(c, BRANDS[c]) is weird way to spell BRANDS[c]
# Append possibly replaced value to list of results
result.append(c)
# Add spaces only in between elements, not at the end, then print all at once
print(' '.join(result))
# Or as a trick to let print add the spaces and convert non-strings to strings:
print(*result)
You dont have to split the word and iterating through it.
Try this code it will work and will not get the issue of white space anymore
sentence = ' '.join(str(BRANDS.get(word, word)) for word in input_words)
Here,make a list names "input_words" and add the number of line that you wanted to process
Happy Learning!

print after a certain word in python

In Python, I would like to read the input and then only print after a certain point. I would like for it to work like this
humaninput = raw_input("Please enter:")
breakdown = humaninput.split()
say = "say"
if say in breakdown:
print (all words after say)
I have everything except for the last part
Here's a neat alternative that doesn't use split.
string = "string blah say foo bar"
say = "say"
after = string[string.index(say) + len(say):] # +1 if you're worried about spaces
print(after)
>> foo bar
And if there are multiple instances of "say", it will take the first.
Since you converted all the entries in to a list you can find the first instance of "say", then make a new list with everything after it.
humaninput = "This is me typing a whole bunch of say things with words after it"
breakdown = humaninput.split()
say = "say"
if say in breakdown:
split = breakdown.index(say)
after = breakdown[split+1:]
print(after)
This is pretty easy to do with split() if you're just using a string.
if say in humaninput:
saysplit = humaninput.split(say,1)
print saysplit[1]
It works on entire strings, not just single characters or nothing at all (where it defaults to space). If you've got a list, the other answer is correct.

how to get the last part of a string before a certain character?

I am trying to print the last part of a string before a certain character.
I'm not quite sure whether to use the string .split() method or string slicing or maybe something else.
Here is some code that doesn't work but I think shows the logic:
x = 'http://test.com/lalala-134'
print x['-':0] # beginning at the end of the string, return everything before '-'
Note that the number at the end will vary in size so I can't set an exact count from the end of the string.
You are looking for str.rsplit(), with a limit:
print x.rsplit('-', 1)[0]
.rsplit() searches for the splitting string from the end of input string, and the second argument limits how many times it'll split to just once.
Another option is to use str.rpartition(), which will only ever split just once:
print x.rpartition('-')[0]
For splitting just once, str.rpartition() is the faster method as well; if you need to split more than once you can only use str.rsplit().
Demo:
>>> x = 'http://test.com/lalala-134'
>>> print x.rsplit('-', 1)[0]
http://test.com/lalala
>>> 'something-with-a-lot-of-dashes'.rsplit('-', 1)[0]
'something-with-a-lot-of'
and the same with str.rpartition()
>>> print x.rpartition('-')[0]
http://test.com/lalala
>>> 'something-with-a-lot-of-dashes'.rpartition('-')[0]
'something-with-a-lot-of'
Difference between split and partition is split returns the list without delimiter and will split where ever it gets delimiter in string i.e.
x = 'http://test.com/lalala-134-431'
a,b,c = x.split(-)
print(a)
"http://test.com/lalala"
print(b)
"134"
print(c)
"431"
and partition will divide the string with only first delimiter and will only return 3 values in list
x = 'http://test.com/lalala-134-431'
a,b,c = x.partition('-')
print(a)
"http://test.com/lalala"
print(b)
"-"
print(c)
"134-431"
so as you want last value you can use rpartition it works in same way but it will find delimiter from end of string
x = 'http://test.com/lalala-134-431'
a,b,c = x.rpartition('-')
print(a)
"http://test.com/lalala-134"
print(b)
"-"
print(c)
"431"

remove specific whitespaces from list items

I have a huge list of lines, each of which looks as follows
1 01 01 some random text
The 1 01 01 part is a reference number that changes from line to line. I want to remove the two whitespaces between the three reference numbers, so that the lines look as follows.
10101 some random text
Obviously, this calls for a for loop. The question is what I should write inside the loop I can't use strip,
for i in my_list:
i.strip()
because that, if anything, would remove all whitespaces, giving me
10101somerandomtext
which I don't want. But if I write
for i in my_list:
i.remove(4)
i.remove(1)
I get an error message 'str' object has no attribute 'remove'. What is the proper solution in this case.
Thanks in advance.
If the number is always at the beginning, you can use the fact that str.replace function takes an optional argument count:
for l in mylist:
print l.replace(' ', '', 2)
Note that I'm doing print here for a reason: you can't change the strings in-place, because strings are immutable (this is also why they don't have a remove method, and replace returns a modified string, but leaves the initial string intact). So if you need them in a list, it's cleaner to create another list like this:
newlist = [l.replace(' ', '', 2) for l in mylist]
It's also safe to overwrite the list like this:
mylist = [l.replace(' ', '', 2) for l in mylist]
Use the count argument for replace, to replace the first 2 spaces.
a = "1 01 01 some random text"
a.replace(" " , "", 2)
>>> '10101 some random text'
split takes a second argument - the number of splits to make
for i in my_list:
components = i.strip(" ", 3)
refnum = ''.join(components[:3])
text = components[3]
Or in python 3:
for i in my_list:
*components, text = i.strip(" ", 3)
refnum = ''.join(components)

Categories