i have this code:
def negate_sequence(text):
negation = False
delims = "?.,!:;"
result = []
words = text.split()
prev = None
pprev = None
for word in words:
stripped = word.strip(delims).lower()
negated = "not " + stripped if negation else stripped
result.append(negated)
if any(neg in word for neg in ["not", "n't", "no"]):
negation = not negation
if any(c in word for c in delims):
negation = False
return result
text = "i am not here right now, because i am not good to see that"
sa = negate_sequence(text)
print(sa)
well what this code do, basically his adding 'not' to the next words and he don't stop adding 'not' till he get to one of this "?.,!:;" they are like some sort of breaks for example if you run this code you'll get.
['i', 'am', 'not', 'not here', 'not right', 'not now', 'because', 'i', 'am', 'not', 'not good', 'not to', 'not see', 'not that']
what i want to do is to add the space instead of all this "?.,!:;" so if i have to run the code i will get this result instead:
['i', 'am', 'not', 'not here', 'right', 'now', 'because', 'i', 'am', 'not', 'not good', 'to', 'see', 'that']
so the code only add the 'not' to the next word only and break after finding the space, but i have tried everything but nothing worked for me please if anyone has an idea how to do that i will be appreciated . . .
Thanks in advance.
ipsnicerous's excellent code does exactly what you want, except it misses out the very first word. This is easily corrected by using is_negative(text[i-1] and and changing enumerate(text[1:] to enumerate(text[:] to give you:
def is_negative(word):
if word in ["not", "no"] or word.endswith("n't"):
return True
else:
return False
def negate_sequence(text):
text = text.split()
# remove punctuation
text = [word.strip("?.,!:;") for word in text]
# Prepend 'not' to each word if the preceeding word contains a negation.
text = ['not '+word if is_negative(text[i-1]) else word for i, word in enumerate(text[:])]
return text
if __name__ =="__main__":
print(negate_sequence("i am not here right now, because i am not good to see that"))
I'm not entirely sure what you are trying to do, but it seems like you want to turn every negation into a double negative?
def is_negative(word):
if word in ["not", "no"] or word.endswith("n't"):
return True
else:
return False
def negate_sequence(text):
text = text.split()
# remove punctuation
text = [word.strip("?.,!:;") for word in text]
# Prepend 'not' to each word if the preceeding word contains a negation.
text = ['not '+word if is_negative(text[i]) else word for i, word in enumerate(text[1:])]
return text
print negate_sequence("i am not here right now, because i am not good to see that")
Related
Write a function called getWords(sentence, letter) that takes in a sentence and a single letter, and returns a list of the words that start or end with this letter, but not both, regardless of the letter case.
For example:
>>> s = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
>>> getWords(s, "t")
['The', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'next']
My attempt:
regex = (r'[\w]*'+letter+r'[\w]*')
return (re.findall(regex,sentence,re.I))
My Output:
['The', 'TART', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'until', 'next']
\b detects word breaks. Verbose mode allows multi-line regexs and comments. Note that [^\W] is the same as \w, but to match \w except a certain letter, you need [^\W{letter}].
import re
def getWords(s,t):
pattern = r'''(?ix) # ignore case, verbose mode
\b{letter} # start with letter
\w* # zero or more additional word characters
[^{letter}\W]\b # ends with a word character that isn't letter
| # OR
\b[^{letter}\W] # does not start with a non-word character or letter
\w* # zero or more additional word characters
{letter}\b # ends with letter
'''.format(letter=t)
return re.findall(pattern,s)
s = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
print(getWords(s,'t'))
Output:
['The', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'next']
Doing this is much easy with the startswith() and endswith() method.
def getWords(s, letter):
return ([word for word in mystring.split() if (word.lower().startswith('t') or
word.lower().endswith('t')) and not
(word.lower().startswith('t') and word.lower().endswith('t'))])
mystring = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
print(getWords(mystring, 't'))
Output
['The', 'Tuesdays', 'Thursdays,', 'but', 'it', 'not', 'start', 'next']
Update (using regular expression)
import re
result1 = re.findall(r'\b[t]\w+|\w+[t]\b', mystring, re.I)
result2 = re.findall(r'\b[t]\w+[t]\b', mystring, re.I)
print([x for x in result1 if x not in result2])
Explanation
Regular expression \b[t]\w+ and \w+[t]\b finds words that start and ends with letter t and \b[t]\w+[t]\b finds words that both starts and ends with letter t.
After generating two lists of words, just take the intersection of those two lists.
It you want the regex for this, then use:
regex = r'\b(#\w*[^#\W]|[^#\W]\w*#)\b'.replace('#', letter)
The replace is done to avoid the repeated verbose +letter+.
So the code looks like this then:
import re
def getWords(sentence, letter):
regex = r'\b(#\w*[^#\W]|[^#\W]\w*#)\b'.replace('#', letter)
return re.findall(regex, sentence, re.I)
s = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
result = getWords(s, "t")
print(result)
Output:
['The', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'next']
Explanation
I have used # as a placeholder for the actual letter, and that will get replaced in the regular expression before it is actually used.
\b: word break
\w*: 0 or more letters (or underscores)
[^#\W]: a letter that is not # (the given letter)
|: logical OR. The left side matches words that start with the letter, but don't end with it, and the right side matches the opposite case.
Why are you using regex for this? Just check the first and last character.
def getWords(s, letter):
words = s.split()
return [a for a,b in ((word, set(word.lower()[::len(word)-1])) for word in words) if letter in b and len(b)==2]
You can try the builtin startswith and endswith functions.
>>> string = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
>>> [i for i in string.split() if i.lower().startswith('t') or i.lower().endswith('t')]
['The', 'TART', 'Tuesdays', 'Thursdays,', 'but', 'it', 'not', 'start', 'next']
I am trying to split the sentences in words.
words = content.lower().split()
this gives me the list of words like
'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'
and with this code:
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
I get something like:
'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'
if you see the word "morningthe" in the list, it used to have "--" in between words. Now, is there any way I can split them in two words like "morning","the"??
I would suggest a regex-based solution:
import re
def to_words(text):
return re.findall(r'\w+', text)
This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace.
>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']
Note that if you're looping over the words, using re.finditer which returns a generator object is probably better, as you don't have store the whole list of words at once.
Alternatively, you may also use itertools.groupby along with str.alpha() to extract alphabets-only words from the string as:
>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'
>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']
PS: Regex based solution is much cleaner. I have mentioned this as an possible alternative to achieve this.
Specific to OP: If all you want is to also split on -- in the resultant list, then you may firstly replace hyphens '-' with space ' ' before performing split. Hence, your code should be:
words = content.lower().replace('-', ' ').split()
where words will hold the value you desire.
Trying to do this with regexes will send you crazy e.g.
>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']
Definitely look at the nltk package.
Besides the solutions given already, you could also improve your clean_up_list function to do a better work.
def clean_up_list(word_list):
clean_word_list = []
# Move the list out of loop so that it doesn't
# have to be initiated every time.
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for word in word_list:
current_word = ''
for index in range(len(word)):
if word[index] in symbols:
if current_word:
clean_word_list.append(current_word)
current_word = ''
else:
current_word += word[index]
if current_word:
# Append possible last current_word
clean_word_list.append(current_word)
return clean_word_list
Actually, you could apply the block in for word in word_list: to the whole sentence to get the same result.
You could also do this:
import re
def word_list(text):
return list(filter(None, re.split('\W+', text)))
print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))
Returns:
['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']
I have a string a, I would like to return a list b, which contain words in a that not starts from # or #, and not contains any non-word characters.
However, I'm in trouble of keep words like "They're" as a single word. Please notice that words like "Okay....so" should be split into two words "okay" and "so".
I think problem could be solved by just revising the regular expression. Thanks!
a = "#luke5sos are you awake now?!!! me #hashtag time! is# over, now okay....so they're rich....and hopefully available?"
a = a.split()
b = []
for word in a:
if word != "" and word[0] != "#" and word[0] != "#":
for item in re.split(r'\W+\'\W|\W+', word):
if item != "":
b.append(item)
else:
continue
else:
continue
print b
It's easier to combine all these rules into one regex:
import re
a = "#luke5sos are you awake now?!!! me #hashtag time! is# over, now okay....so they're rich....and hopefully available?"
b = re.findall(r"(?<![##])\b\w+(?:'\w+)?", a)
print(b)
Result:
['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'over', 'now', 'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']
The regex works like this:
Checks to make sure that it's not coming right after # or #, using (?<![##]).
Checks that it's at the begining of a word using \b. This is important so that the #/# check doesn't just skip one character and go on.
Matches a sequence of one or more "word" type characters with \w+.
Optionally matches an apostrophe and some more word type characters with (?:'\w)?.
Note that the fourth step is written that way so that they're will count as one word, but only this, that, and these from this, 'that', these will match.
The following code (a) treats .... as a word separator, (b) removes trailing non-word characters, such as question marks and exclamation points, and (c) rejects any words that start with # or # or otherwise contain non-alpha characters:
a = "#luke5sos are you awake now?!!! me #hashtag time! is# over, now okay....so they're rich....and hopefully available?"
a = a.replace('....', ' ')
a = re.sub('[?!##$%^&]+( |$)', ' ', a)
result = [w for w in a.split() if w[0] not in '##' and w.replace("'",'').isalpha()]
print result
This produces the desired result:
['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'now', 'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']
import re
v = re.findall(r'(?:\s|^)([\w\']+)\b', a)
Gives:
['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'over', 'now',
'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']
From what I understand, you don't want words with digits in them and you want to disregard all the other special characters except the single quote. You could try something like this:
import re
a = re.sub('[^0-9a-zA-Z']+', ' ', a)
b = a.split()
I haven't been able to try the syntax, but hopefully it should work. What I suggest is replace every character that is not aplha-numberic or a single qoute with a single space. So this would result in a string where your required strings separated by multiple white spaces. Simply calling the split function with no argument, splits the string into words taking care of multiple whitespaces as well. Hope it helps.
I wrote some code to find the most popular words in submission titles on reddit, using the reddit praw api.
import nltk
import praw
picksub = raw_input('\nWhich subreddit do you want to analyze? r/')
many = input('\nHow many of the top words would you like to see? \n\t> ')
print 'Getting the top %d most common words from r/%s:' % (many,picksub)
r = praw.Reddit(user_agent='get the most common words from chosen subreddit')
submissions = r.get_subreddit(picksub).get_top_from_all(limit=200)
hey = []
for x in submissions:
hey.extend(str(x).split(' '))
fdist = nltk.FreqDist(hey) # creates a frequency distribution for words in 'hey'
top_words = fdist.keys()
common_words = ['its','am', 'ago','took', 'got', 'will', 'been', 'get', 'such','your','don\'t', 'if', 'why', 'do', 'does', 'or', 'any', 'but', 'they', 'all', 'now','than','into','can', 'i\'m','not','so','just', 'out','about','have','when', 'would' ,'where', 'what', 'who' 'I\'m','says' 'not', '', 'over', '_', '-','after', 'an','for', 'who', 'by', 'from', 'it', 'how', 'you', 'about' 'for', 'on', 'as', 'be', 'has', 'that', 'was', 'there', 'with','what', 'we', '::', 'to', 'the', 'of', ':', '...', 'a', 'at', 'is', 'my', 'in' , 'i', 'this', 'and', 'are', 'he', 'she', 'is', 'his', 'hers']
already = []
counter = 0
number = 1
print '-----------------------'
for word in top_words:
if word.lower() not in common_words and word.lower() not in already:
print str(number) + ". '" + word + "'"
counter +=1
number +=1
already.append(word.lower())
if counter == many:
break
print '-----------------------\n'
so inputting subreddit 'python' and getting 10 posts returns:
'Python'
'PyPy'
'code'
'use'
'136'
'181'
'd...'
'IPython'
'133'
10. '158'
How can I make this script not return numbers, and error words like 'd...'? The first 4 results are acceptable, but I would like to replace this rest with words that make sense. Making a list common_words is unreasonable, and doesn't filter these errors. I'm relatively new to writing code, and I appreciate the help.
I disagree. Making a list of common words is correct, there is no easier way to filter out the, for, I, am, etc.. However, it is unreasonable to use the common_words list to filter out results that aren't words, because then you'd have to include every possible non-word you don't want. Non-words should be filtered out differently.
Some suggestions:
1) common_words should be a set(), since your list is long this should speed things up. The in operation for sets in O(1), while for lists it is O(n).
2) Getting rid of all number strings is trivial. One way you could do it is:
all([w.isdigit() for w in word])
Where if this returns True, then the word is just a series of numbers.
3) Getting rid of the d... is a little more tricky. It depends on how you define a non-word. This:
tf = [ c.isalpha() for c in word ]
Returns a list of True/False values (where it is False if the char was not a letter). You can then count the values like:
t = tf.count(True)
f = tf.count(False)
You could then define a non-word as one that has more non-letter chars in it than letters, as one that has any non-letter characters at all, etc. For example:
def check_wordiness(word):
# This returns true only if a word is all letters
return all([ c.isalpha() for c in word ])
4) In the for word in top_words: block, are you sure that you have not mixed up counter and number? Also, counter and number are pretty much redundant, you could rewrite the last bit as:
for word in top_words:
# Since you are calling .lower() so much,
# you probably want to define it up here
w = word.lower()
if w not in common_words and w not in already:
# String formatting is preferred over +'s
print "%i. '%s'" % (number, word)
number +=1
# This could go under the if statement. You only want to add
# words that could be added again. Why add words that are being
# filtered out anyways?
already.append(w)
# this wasn't indented correctly before
if number == many:
break
Hope that helps.
If I have a string and want to return a word that includes a whitespace how would it be done?
For example, I have:
line = 'This is a group of words that include #this and #that but not ME ME'
response = [ word for word in line.split() if word.startswith("#") or word.startswith('#') or word.startswith('ME ')]
print response ['#this', '#that', 'ME']
So ME ME does not get printed because of the whitespace.
Thanks
You could just keep it simple:
line = 'This is a group of words that include #this and #that but not ME ME'
words = line.split()
result = []
pos = 0
try:
while True:
if words[pos].startswith(('#', '#')):
result.append(words[pos])
pos += 1
elif words[pos] == 'ME':
result.append('ME ' + words[pos + 1])
pos += 2
else:
pos += 1
except IndexError:
pass
print result
Think about speed only if it proves to be too slow in practice.
From python Documentation:
string.split(s[, sep[, maxsplit]]): Return a list of the words of the string s. If the optional second
argument sep is absent or None, the words are separated by arbitrary
strings of whitespace characters (space, tab, newline, return,
formfeed).
so your error is first on the call for split.
print line.split()
['This', 'is', 'a', 'group', 'of', 'words', 'that', 'include', '#this', 'and', '#that', 'but', 'not', 'ME', 'ME']
I recommend to use re for splitting the string. Use the re.split(pattern, string, maxsplit=0, flags=0)