If I have a string and want to return a word that includes a whitespace how would it be done?
For example, I have:
line = 'This is a group of words that include #this and #that but not ME ME'
response = [ word for word in line.split() if word.startswith("#") or word.startswith('#') or word.startswith('ME ')]
print response ['#this', '#that', 'ME']
So ME ME does not get printed because of the whitespace.
Thanks
You could just keep it simple:
line = 'This is a group of words that include #this and #that but not ME ME'
words = line.split()
result = []
pos = 0
try:
while True:
if words[pos].startswith(('#', '#')):
result.append(words[pos])
pos += 1
elif words[pos] == 'ME':
result.append('ME ' + words[pos + 1])
pos += 2
else:
pos += 1
except IndexError:
pass
print result
Think about speed only if it proves to be too slow in practice.
From python Documentation:
string.split(s[, sep[, maxsplit]]): Return a list of the words of the string s. If the optional second
argument sep is absent or None, the words are separated by arbitrary
strings of whitespace characters (space, tab, newline, return,
formfeed).
so your error is first on the call for split.
print line.split()
['This', 'is', 'a', 'group', 'of', 'words', 'that', 'include', '#this', 'and', '#that', 'but', 'not', 'ME', 'ME']
I recommend to use re for splitting the string. Use the re.split(pattern, string, maxsplit=0, flags=0)
Related
As simple as it sounds, can't think of a straightforward way of doing the below in Python.
my_string = "This is a test.\nAlso\tthis"
list_i_want = ["This", "is", "a", "test.", "\n", "Also", "this"]
I need the same behaviour as with string.split(), i.e. remove any type and number of whitespaces, but excluding the line breaks \n in which case I need it as a standalone list item.
How could I do this?
Split String using Regex findall()
import re
my_string = "This is a test.\nAlso\tthis"
my_list = re.findall(r"\S+|\n", my_string)
print(my_list)
How it Works:
"\S+": "\S" = non whitespace characters. "+" is a greed quantifier so it find any groups of non-whitespace characters aka words
"|": OR logic
"\n": Find "\n" so it's returned as well in your list
Output:
['This', 'is', 'a', 'test.', '\n', 'Also', 'this']
Here's a code that works but is definitely not efficient/pythonic:
my_string = "This is a test.\nAlso\tthis"
l = my_string.splitlines() #Splitting lines
list_i_want = []
for i in l:
list_i_want.extend((i.split())) # Extending elements in list by splitting lines
list_i_want.extend('\n') # adding newline character
list_i_want.pop() # Removing last newline character
print(list_i_want)
Output:
['This', 'is', 'a', 'test.', '\n', 'Also', 'this']
I'm trying to tokenize sentences using re in python like an example mentioned here:
I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]
I wish to tokenize by splitting them using whitespace but without affecting the bracket set.
For example, I want the split list as:
["I", "want", "a", "(hot chocolate)[food]", "and", "(two)[quantity]", "boxes", "of", "(crispy bacon)[food]"]
How do I write the re.split expression to achieve the same.
You can do this with the regex pattern: \s(?!\w+\))
import re
s = """I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"""
print(re.split(r'\s(?!\w+\))',s))
# ['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']
\s(?!\w+\))
The above pattern will NOT match any space that is followed by a word and a ), basically any space inside ')'.
Test regex here: https://regex101.com/r/SRHEXO/1
Test python here: https://ideone.com/reIIcU
EDIT: Answer to the question from your comment:
Since your input has multiple words inside ( ), you can change the pattern to [\s,](?![\s\w]+\))
Test regex here: https://regex101.com/r/Ea9XlY/1
Regular expressions, no matter how clever, are not always the right answer.
def split(s):
result = []
brace_depth = 0
temp = ''
for ch in s:
if ch == ' ' and brace_depth == 0:
result.append(temp[:])
temp = ''
elif ch == '(' or ch == '[':
brace_depth += 1
temp += ch
elif ch == ']' or ch == ')':
brace_depth -= 1
temp += ch
else:
temp += ch
if temp != '':
result.append(temp[:])
return result
>>> s="I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"
>>> split(s)
['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']
The regex for string is \s. So using this with re.split:
print(re.split("[\s]", "I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"))
The output is ['I', 'want', 'a', '(hot', 'chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy', 'bacon)[food]']
I am trying to split the sentences in words.
words = content.lower().split()
this gives me the list of words like
'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'
and with this code:
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
I get something like:
'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'
if you see the word "morningthe" in the list, it used to have "--" in between words. Now, is there any way I can split them in two words like "morning","the"??
I would suggest a regex-based solution:
import re
def to_words(text):
return re.findall(r'\w+', text)
This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace.
>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']
Note that if you're looping over the words, using re.finditer which returns a generator object is probably better, as you don't have store the whole list of words at once.
Alternatively, you may also use itertools.groupby along with str.alpha() to extract alphabets-only words from the string as:
>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'
>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']
PS: Regex based solution is much cleaner. I have mentioned this as an possible alternative to achieve this.
Specific to OP: If all you want is to also split on -- in the resultant list, then you may firstly replace hyphens '-' with space ' ' before performing split. Hence, your code should be:
words = content.lower().replace('-', ' ').split()
where words will hold the value you desire.
Trying to do this with regexes will send you crazy e.g.
>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']
Definitely look at the nltk package.
Besides the solutions given already, you could also improve your clean_up_list function to do a better work.
def clean_up_list(word_list):
clean_word_list = []
# Move the list out of loop so that it doesn't
# have to be initiated every time.
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for word in word_list:
current_word = ''
for index in range(len(word)):
if word[index] in symbols:
if current_word:
clean_word_list.append(current_word)
current_word = ''
else:
current_word += word[index]
if current_word:
# Append possible last current_word
clean_word_list.append(current_word)
return clean_word_list
Actually, you could apply the block in for word in word_list: to the whole sentence to get the same result.
You could also do this:
import re
def word_list(text):
return list(filter(None, re.split('\W+', text)))
print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))
Returns:
['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']
i have this code:
def negate_sequence(text):
negation = False
delims = "?.,!:;"
result = []
words = text.split()
prev = None
pprev = None
for word in words:
stripped = word.strip(delims).lower()
negated = "not " + stripped if negation else stripped
result.append(negated)
if any(neg in word for neg in ["not", "n't", "no"]):
negation = not negation
if any(c in word for c in delims):
negation = False
return result
text = "i am not here right now, because i am not good to see that"
sa = negate_sequence(text)
print(sa)
well what this code do, basically his adding 'not' to the next words and he don't stop adding 'not' till he get to one of this "?.,!:;" they are like some sort of breaks for example if you run this code you'll get.
['i', 'am', 'not', 'not here', 'not right', 'not now', 'because', 'i', 'am', 'not', 'not good', 'not to', 'not see', 'not that']
what i want to do is to add the space instead of all this "?.,!:;" so if i have to run the code i will get this result instead:
['i', 'am', 'not', 'not here', 'right', 'now', 'because', 'i', 'am', 'not', 'not good', 'to', 'see', 'that']
so the code only add the 'not' to the next word only and break after finding the space, but i have tried everything but nothing worked for me please if anyone has an idea how to do that i will be appreciated . . .
Thanks in advance.
ipsnicerous's excellent code does exactly what you want, except it misses out the very first word. This is easily corrected by using is_negative(text[i-1] and and changing enumerate(text[1:] to enumerate(text[:] to give you:
def is_negative(word):
if word in ["not", "no"] or word.endswith("n't"):
return True
else:
return False
def negate_sequence(text):
text = text.split()
# remove punctuation
text = [word.strip("?.,!:;") for word in text]
# Prepend 'not' to each word if the preceeding word contains a negation.
text = ['not '+word if is_negative(text[i-1]) else word for i, word in enumerate(text[:])]
return text
if __name__ =="__main__":
print(negate_sequence("i am not here right now, because i am not good to see that"))
I'm not entirely sure what you are trying to do, but it seems like you want to turn every negation into a double negative?
def is_negative(word):
if word in ["not", "no"] or word.endswith("n't"):
return True
else:
return False
def negate_sequence(text):
text = text.split()
# remove punctuation
text = [word.strip("?.,!:;") for word in text]
# Prepend 'not' to each word if the preceeding word contains a negation.
text = ['not '+word if is_negative(text[i]) else word for i, word in enumerate(text[1:])]
return text
print negate_sequence("i am not here right now, because i am not good to see that")
I have a string a, I would like to return a list b, which contain words in a that not starts from # or #, and not contains any non-word characters.
However, I'm in trouble of keep words like "They're" as a single word. Please notice that words like "Okay....so" should be split into two words "okay" and "so".
I think problem could be solved by just revising the regular expression. Thanks!
a = "#luke5sos are you awake now?!!! me #hashtag time! is# over, now okay....so they're rich....and hopefully available?"
a = a.split()
b = []
for word in a:
if word != "" and word[0] != "#" and word[0] != "#":
for item in re.split(r'\W+\'\W|\W+', word):
if item != "":
b.append(item)
else:
continue
else:
continue
print b
It's easier to combine all these rules into one regex:
import re
a = "#luke5sos are you awake now?!!! me #hashtag time! is# over, now okay....so they're rich....and hopefully available?"
b = re.findall(r"(?<![##])\b\w+(?:'\w+)?", a)
print(b)
Result:
['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'over', 'now', 'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']
The regex works like this:
Checks to make sure that it's not coming right after # or #, using (?<![##]).
Checks that it's at the begining of a word using \b. This is important so that the #/# check doesn't just skip one character and go on.
Matches a sequence of one or more "word" type characters with \w+.
Optionally matches an apostrophe and some more word type characters with (?:'\w)?.
Note that the fourth step is written that way so that they're will count as one word, but only this, that, and these from this, 'that', these will match.
The following code (a) treats .... as a word separator, (b) removes trailing non-word characters, such as question marks and exclamation points, and (c) rejects any words that start with # or # or otherwise contain non-alpha characters:
a = "#luke5sos are you awake now?!!! me #hashtag time! is# over, now okay....so they're rich....and hopefully available?"
a = a.replace('....', ' ')
a = re.sub('[?!##$%^&]+( |$)', ' ', a)
result = [w for w in a.split() if w[0] not in '##' and w.replace("'",'').isalpha()]
print result
This produces the desired result:
['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'now', 'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']
import re
v = re.findall(r'(?:\s|^)([\w\']+)\b', a)
Gives:
['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'over', 'now',
'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']
From what I understand, you don't want words with digits in them and you want to disregard all the other special characters except the single quote. You could try something like this:
import re
a = re.sub('[^0-9a-zA-Z']+', ' ', a)
b = a.split()
I haven't been able to try the syntax, but hopefully it should work. What I suggest is replace every character that is not aplha-numberic or a single qoute with a single space. So this would result in a string where your required strings separated by multiple white spaces. Simply calling the split function with no argument, splits the string into words taking care of multiple whitespaces as well. Hope it helps.