Python :: Intersecting lists of strings gone wrong - python

I am trying to intersect lists of sentences divided into strings:
user = ['The', 'Macbeth', 'Tragedie'] #this list
plays = []
hamlet = gutenberg.sents('shakespeare-hamlet.txt')
macbeth = gutenberg.sents('shakespeare-macbeth.txt')
caesar = gutenberg.sents('shakespeare-caesar.txt')
plays.append(hamlet)
plays.append(macbeth)
plays.append(caesar)
shakespeare = list(chain.from_iterable(plays)) # with this list
'shakespeare' prints as follows:
[['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ['Scoena', 'Prima', '.'], ['Enter', 'Barnardo', 'and', 'Francisco', 'two', 'Centinels', '.']...['FINIS', '.'], ['THE', 'TRAGEDIE', 'OF', 'IVLIVS', 'CaeSAR', '.']]
bestCount = 0
for sent in shakespeare:
currentCount = len(set(user).intersection(sent))
if currentCount > bestCount:
bestCount = currentCount
answer = ' '.join(sent)
return ''.join(answer).lower(), bestCount
return, however, does not intersect right, that is, "hamlet" intersects with "macbeth"...
('the tragedie of hamlet , prince of denmarke .', 3)
where is the bug?

Doesn't sound like you should be using sets here. The most obvious problem is that you care about the number of occurrences of a word in a sentence (which starts life as a list), and by converting to a set you collapse all repeated words down to one occurrence, losing that information.
I would suggest rather converting each sentence's members into lowercase, like so:
mapped = map(str.lower, sentence) # may want list(map(...)) if on Py3
Initialize a dict of counts like this:
In [6]: counts = {word.lower(): 0 for word in user}
In [7]: counts
Out[7]: {'macbeth': 0, 'the': 0, 'tragedie': 0}
Then as you loop over the sentences, you can do something like this:
In [8]: for word in counts:
...: counts[word] = max(counts[word], mapped.count(word))
...:
In [9]: counts
Out[9]: {'macbeth': 0, 'the': 1, 'tragedie': 1}
I just used one example sentence, but you get the idea. At the end you'll have the maximum number of times the user's word appeared in a sentence. You can make the data structure a little more complex or use an if statement test if you want to also keep the sentence it occurred the most times in.
Good luck!

Related

Replace text in a list with formatted text in another list

I am attempting to replace text in a list with text from another list. Below, lst_a has the string length I need for another script, but none of the formatting from lst_b. I want to give lst_a the correct spelling, capitalization, and punctuation from lst_b.
For example:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
I'm not 100% sure the best way to approach this problem.
I have tried breaking lst_a into a smaller sub_lst_a and taking the difference from each list, but I'm not sure what to do when entire items exist in one list and not the other (e.g. 'it' and 'is' rather than 'it's').
Regardless, any help/direction would be greatly appreciated!
Solution attempt below:
I thought it may be worth trying to break lst_a into a list just of words. Then I thought to enumerate each item, so that I could more easily identify it's counter part in lst_b. From there I wanted to take the difference of the two lists, and replace the values in lst_a_diff with lst_b_diff. I had to sort the lists because my diff script wasn't consistently ordering the outputs.
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
# splitting lst_a into a smaller sub_lst_a
def convert(lst_a):
return ([i for item in lst_a for i in item.split()])
sub_lst_a = convert(lst_a)
# getting the position values of sub_lst_a and lst_b
lst_a_pos = [f"{i}, {v}" for i, v in enumerate(sub_lst_a)]
lst_b_pos = [f"{i}, {v}" for i, v in enumerate(lst_b)]
# finding the difference between the two lists
def Diff(lst_a_pos, lst_b_pos):
return list(set(lst_a_pos) - set(lst_b_pos))
lst_a_diff = Diff(lst_a_pos, lst_b_pos)
lst_b_diff = Diff(lst_b_pos, lst_a_pos)
# sorting lst_a_diff and lst_b_diff by the original position of each item
lst_a_diff_sorted = sorted(lst_a_diff, key = lambda x: int(x.split(', ')[0]))
lst_b_diff_sorted = sorted(lst_b_diff, key = lambda x: int(x.split(', ')[0]))
print(lst_a_diff_sorted)
print(lst_b_diff_sorted)
Desired Results:
final_lst_a = ['It\'s an', 'example of', 'an English simple sentence.']
Solution walkthrough
Assuming as you say that the two lists are essentially always in order, to properly align the indexes in both, words with apostrophe should really count for two.
One way to do that is for example to expand the words by adding an empty element:
# Fill in blanks for words that have apostrophe: they should count as 2
lst_c = []
for item in lst_b:
lst_c.append(item)
if item.find("'") != -1:
lst_c.append('')
print(lst_c)
>> ["It's", '', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
Now it is a matter of expanding lst_a on a word-by-word basis, and then group them back as in the original lists. Essentially, we align the lists like this:
['it', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence']
["It's", '', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
then we create new_item slices like these:
["It's", "", "an"]
["example of"]
["an English simple sentence"]
The code looks like this:
# Makes a map of list index and length to extract
final = []
ptr = 0
for item in lst_a:
# take each item in lst_a and count how many words it has
count = len(item.split())
# then use ptr and count to correctly map a slice off lst_c
new_item = lst_c[ptr:ptr+count]
# get rid of empty strings now
new_item = filter(len, new_item)
# print('new[{}:{}]={}'.format(ptr,count,new_item))
# join the words by single space and append to final list
final.append(' '.join(new_item))
# advance the ptr
ptr += count
>> ["It's an", 'example of', 'an English simple sentence.']
Complete code solution
This seems to handle other cases well enough. The complete code would be something like:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
# This is another example that seems to work
# lst_a = ['tomorrow I will', 'go to the movies']
# lst_b = ['Tomorrow', 'I\'ll', 'go', 'to', 'the', 'movies.']
# Fill in blanks for words that have apostrophe: they should count as 2
lst_c = []
for item in lst_b:
lst_c.append(item)
if item.find("'") != -1:
lst_c.append('')
print(lst_c)
# Makes a map of list index and length to extract
final = []
ptr = 0
for item in lst_a:
count = len(item.split())
# print(ptr, count, item)
new_item = lst_c[ptr:ptr+count]
# get rid of empty strings now
new_item = filter(len, new_item)
# print('new[{}:{}]={}'.format(ptr,count,new_item))
ptr += count
final.append(' '.join(new_item))
print(final)
You can try the following code:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
lst_a_split = []
end_indices_in_lst_a_split = []
# Construct "lst_a_split" and "end_indices_in_lst_a_split".
# "lst_a_split" is supposed to be ['it', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence'].
# "end_indices_in_lst_a_split" is supposed to be [3, 5, 9].
end = 0
for s in lst_a:
s_split = s.split()
end += len(s_split)
end_indices_in_lst_a_split.append(end)
for word in s_split:
lst_a_split.append(word)
# Construct "d" which contains
# index of every word in "lst_b" which does not include '\'' as value
# and the corresponding index of the word in "lst_a_split" as key.
# "d" is supposed to be {2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7}.
d = {}
start = 0
for index_in_lst_b, word in enumerate(lst_b):
if '\'' in word:
continue
word = word.lower().strip('.').strip(',').strip('"') # you can add other strip()'s as you want
index_in_lst_a_split = lst_a_split.index(word, start)
start = index_in_lst_a_split + 1
d[index_in_lst_a_split] = index_in_lst_b
# Construct "final_lst_a".
final_lst_a = []
start_index_in_lst_b = 0
for i, end in enumerate(end_indices_in_lst_a_split):
if end - 1 in d:
end_index_in_lst_b = d[end - 1] + 1
final_lst_a.append(' '.join(lst_b[start_index_in_lst_b:end_index_in_lst_b]))
start_index_in_lst_b = end_index_in_lst_b
elif end in d:
end_index_in_lst_b = d[end]
final_lst_a.append(' '.join(lst_b[start_index_in_lst_b:end_index_in_lst_b]))
start_index_in_lst_b = end_index_in_lst_b
else:
# It prints the following message if it fails to construct "final_lst_a" successfully.
# It would happen if words in "lst_b" on both sides at a boundary contain '\'', which seem to be unlikely.
print(f'Failed to find corresponding words in "lst_b" for the string "{lst_a[i]}".')
break
print(final_lst_a)
which prints
["It's an", 'example of', 'an English simple sentence.']
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
for word in lst_b:
# If a word is capitalized, look for it in lst_a and capitalize it
if word[0].upper() == word[0]:
for idx, phrase in enumerate(lst_a):
if word.lower() in phrase:
lst_a[idx] = phrase.replace(word.lower(), word)
if "'" in word:
# if a word has an apostrophe, look for it in lst_a and change it
# Note here you can include other patterns like " are",
# or maybe just restrict it to "it is", etc.
for idx, phrase in enumerate(lst_a):
if " is" in phrase:
lst_a[idx] = phrase.replace(" is", "'s")
break
print(lst_a)
I know you already have a few responses to review. Here's something that should help you expand the implementation.
In addition to lst_a and lst_b, what if you could give all the lookup items like 'It's', 'I'll', 'don't' and you could outline what it should represent, then the below could would take care of that lookup as well.
#original lst_a. This list does not have the punctuation marks
lst_a = ['it is an', 'example of', 'an english simple sentence', 'if time permits', 'I will learn','this weekend', 'but do not', 'count on me']
#desired output with correct spelling, capitalization, and punctuation
#but includes \' that need to be replaced
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence.', 'If', 'time', 'permits,','I\'ll', 'learn','this','weekend', 'but', 'don\'t','count', 'on', 'me']
#lookup list to replace the contractions
ch = {'It\'s':['It','is'],'I\'ll':['I','will'], 'don\'t':['do','not']}
#final list will be stored into lst_c
lst_c = []
#enumerate through lst_b to replace all words that are contractions
for i,v in enumerate(lst_b):
#for this example, i am considering that all contractions are 2 part words
for j,k in ch.items():
if v == j: #here you are checking for contractions
lst_b[i] = k[0] #for each contraction, you are replacing the first part
lst_b.insert(i+1,k[1]) #and inserting the second part
#now stitch the words together based on length of each word in lst_b
c = 0
for i in lst_a:
j = i.count(' ') #find out number of words to stitch together
#stitch together only the number of size of words in lst_a
lst_c.append(' '.join([lst_b[k] for k in range (c, c+j+1)]))
c += j+1
#finally, I am printing lst_a, lst_b, and lst_c. The final result is in lst_c
print (lst_a, lst_b, lst_c, sep = '\n')
Output for this is as shown below:
lst_a = ['it is an', 'example of', 'an english simple sentence', 'if time permits', 'I will learn', 'this weekend', 'but do not', 'count on me']
lst_b = ['It', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence.', 'If', 'time', 'permits,', 'I', 'will', 'learn', 'this', 'weekend', 'but', 'do', 'not', 'count', 'on', 'me']
lst_c = ['It is an', 'example of', 'an english simple sentence.', 'If time permits,', 'I will learn', 'this weekend', 'but do not', 'count on me']

Splitting the sentences in python

I am trying to split the sentences in words.
words = content.lower().split()
this gives me the list of words like
'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'
and with this code:
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
I get something like:
'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'
if you see the word "morningthe" in the list, it used to have "--" in between words. Now, is there any way I can split them in two words like "morning","the"??
I would suggest a regex-based solution:
import re
def to_words(text):
return re.findall(r'\w+', text)
This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace.
>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']
Note that if you're looping over the words, using re.finditer which returns a generator object is probably better, as you don't have store the whole list of words at once.
Alternatively, you may also use itertools.groupby along with str.alpha() to extract alphabets-only words from the string as:
>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'
>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']
PS: Regex based solution is much cleaner. I have mentioned this as an possible alternative to achieve this.
Specific to OP: If all you want is to also split on -- in the resultant list, then you may firstly replace hyphens '-' with space ' ' before performing split. Hence, your code should be:
words = content.lower().replace('-', ' ').split()
where words will hold the value you desire.
Trying to do this with regexes will send you crazy e.g.
>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']
Definitely look at the nltk package.
Besides the solutions given already, you could also improve your clean_up_list function to do a better work.
def clean_up_list(word_list):
clean_word_list = []
# Move the list out of loop so that it doesn't
# have to be initiated every time.
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for word in word_list:
current_word = ''
for index in range(len(word)):
if word[index] in symbols:
if current_word:
clean_word_list.append(current_word)
current_word = ''
else:
current_word += word[index]
if current_word:
# Append possible last current_word
clean_word_list.append(current_word)
return clean_word_list
Actually, you could apply the block in for word in word_list: to the whole sentence to get the same result.
You could also do this:
import re
def word_list(text):
return list(filter(None, re.split('\W+', text)))
print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))
Returns:
['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']

Checking superset of list in given order

I have a list of tuples in format (float,string) sorted in descending order.
print sent_scores
[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.'),
(0.078586381821416265,'Deadly stampede in Shanghai - Police and medical staff help injured people after the stampede.'),
(0.072031446647399661, '- Emergency personnel help victims.')]
In case there two cases in the list which four words same in continuinty. I want to remove the tuple with lesser score from the list. The new list should also preserve order.
The output of above:
[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.')]
This will be first certainly involve tokenization of the words, which can be done the code below:
from nltk.tokenize import TreebankWordTokenizer
def tokenize_words(text):
tokens = TreebankWordTokenizer().tokenize(text)
contractions = ["n't", "'ll", "'m","'s"]
fix = []
for i in range(len(tokens)):
for c in contractions:
if tokens[i] == c: fix.append(i)
fix_offset = 0
for fix_id in fix:
idx = fix_id - 1 - fix_offset
tokens[idx] = tokens[idx] + tokens[idx+1]
del tokens[idx+1]
fix_offset += 1
return tokens
tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]
I earlier tried to convert the words of each sentences in groups of 4 contained a set and then use issuperset for other sentences. But it doesn't check continuity then.
I suggest taking sequences of 4 tokens in a row from your tokenized list, and making a set of those tokens. By using Python's itertools module, this can be done rather elegantly:
my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
i1 = itertools.islice(my_list, 0, None)
i2 = itertools.islice(my_list, 1, None)
i3 = itertools.islice(my_list, 2, None)
i4 = itertools.islice(my_list, 3, None)
print zip(i1, i2, i3, i4)
Output of the above code (nicely formatted for you):
[('The', 'quick', 'brown', 'fox'),
('quick', 'brown', 'fox', 'jumps'),
('brown', 'fox', 'jumps', 'over'),
('fox', 'jumps', 'over', 'the'),
('jumps', 'over', 'the', 'lazy'),
('over', 'the', 'lazy', 'dog')]
Actually, even more elegant would be:
my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
iterators = [itertools.islice(my_list, x, None) for x in range(4)]
print zip(*iterators)
Same output as before.
Now that you have your list of four consecutive tokens (as 4-tuples) for each list, you can stick those tokens in a set, and check whether the same 4-tuple appears in two different sets:
my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
set1 = set(zip(*[itertools.islice(my_list, x, None) for x in range(4)]))
other_list = ['The', 'quick', 'red', 'fox', 'goes', 'home']
set2 = set(zip(*[itertools.islice(other_list, x, None) for x in range(4)]))
print set1.intersection(set2) # Empty set
if set1.intersection(set2):
print "Found something in common"
else:
print "Nothing in common"
# Prints "Nothing in common"
third_list = ['The', 'quick', 'brown', 'fox', 'goes', 'to', 'school']
set3 = set(zip(*[itertools.islice(third_list, x, None) for x in range(4)]))
print set1.intersection(set3) # Set containing ('The', 'quick', 'brown', 'fox')
if set1.intersection(set3):
print "Found something in common"
else:
print "Nothing in common"
# Prints "Found something in common"
NOTE: If you're using Python 3, just replace all the print "Something" statements with print("Something"): in Python 3, print became a function rather than a statement. But if you're using NLTK, I suspect you're using Python 2.
IMPORTANT NOTE: Any itertools.islice objects you create will iterate through their original list once, and then become "exhausted" (they've returned all their data, so putting them in a second for loop will produce nothing, and the for loop just won't do anything). If you want to iterate through the same list multiple times, create multiple iterators (as you see I did in my examples).
Update: Here's how to eliminate the lesser-scoring words. First, replace this line:
tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]
with:
tokenized_sents=[(score,tokenize_words(sentence)) for score,sentence in sent_scores]
Now what you have is a list of (score,sentence) tuples. Then we'll construct a list called scores_and_sets that will be a list of (score,sets_of_four_words) tuples (where sets_of_four_words is a list of four-word slices like in the example above):
scores_and_sentences_and_sets = [(score, sentence, set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))) for score,sentence in tokenized_sents]
That one-liner may be a bit too clever, actually, so let's unpack it to be a bit more readable:
scores_and_sentences_and_sets = []
for score, sentence in tokenized_sents:
set_of_four_word_groups = set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))
score_sentence_and_sets_tuple = (score, sentence, set_of_four_word_groups)
scores_and_sentences_and_sets.append(score_sentence_and_sets_tuple)
Go ahead and experiment with those two code snippets, and you'll find that they do exactly the same thing.
Okay, so now we have a list of (score, sentence, set_of_four_word_groups) tuples. So we'll go through the list in order, and build up a result list consisting of ONLY the sentences we want to keep. Since the list is already sorted in descending order, that makes things a little easier, because it means that at any point in the list, we only have to look at the items that have already been "accepted" to see if any of them have a duplicate; if any of the accepted items are a duplicate of the one we've just looked at, we don't even need to look at the scores, because we know the accepted item came earlier than the one we're looking at, and therefore it must have a higher score than the one we're looking at.
So here's some code that should do what you want:
accepted_items = []
for current_tuple in scores_and_sentences_and_sets:
score, sentence, set_of_four_words = current_tuple
found = False
for accepted_tuple in accepted_items:
accepted_score, accepted_sentence, accepted_set = accepted_tuple
if set_of_four_words.intersection(accepted_set):
found = True
break
if not found:
accepted_items.append(current_tuple)
print accepted_items # Prints a whole bunch of tuples
sentences_only = [sentence for score, sentence, word_set in accepted_items]
print sentences_only # Prints just the sentences

Is there a better way to get just 'important words' from a list in python?

I wrote some code to find the most popular words in submission titles on reddit, using the reddit praw api.
import nltk
import praw
picksub = raw_input('\nWhich subreddit do you want to analyze? r/')
many = input('\nHow many of the top words would you like to see? \n\t> ')
print 'Getting the top %d most common words from r/%s:' % (many,picksub)
r = praw.Reddit(user_agent='get the most common words from chosen subreddit')
submissions = r.get_subreddit(picksub).get_top_from_all(limit=200)
hey = []
for x in submissions:
hey.extend(str(x).split(' '))
fdist = nltk.FreqDist(hey) # creates a frequency distribution for words in 'hey'
top_words = fdist.keys()
common_words = ['its','am', 'ago','took', 'got', 'will', 'been', 'get', 'such','your','don\'t', 'if', 'why', 'do', 'does', 'or', 'any', 'but', 'they', 'all', 'now','than','into','can', 'i\'m','not','so','just', 'out','about','have','when', 'would' ,'where', 'what', 'who' 'I\'m','says' 'not', '', 'over', '_', '-','after', 'an','for', 'who', 'by', 'from', 'it', 'how', 'you', 'about' 'for', 'on', 'as', 'be', 'has', 'that', 'was', 'there', 'with','what', 'we', '::', 'to', 'the', 'of', ':', '...', 'a', 'at', 'is', 'my', 'in' , 'i', 'this', 'and', 'are', 'he', 'she', 'is', 'his', 'hers']
already = []
counter = 0
number = 1
print '-----------------------'
for word in top_words:
if word.lower() not in common_words and word.lower() not in already:
print str(number) + ". '" + word + "'"
counter +=1
number +=1
already.append(word.lower())
if counter == many:
break
print '-----------------------\n'
so inputting subreddit 'python' and getting 10 posts returns:
'Python'
'PyPy'
'code'
'use'
'136'
'181'
'd...'
'IPython'
'133'
10. '158'
How can I make this script not return numbers, and error words like 'd...'? The first 4 results are acceptable, but I would like to replace this rest with words that make sense. Making a list common_words is unreasonable, and doesn't filter these errors. I'm relatively new to writing code, and I appreciate the help.
I disagree. Making a list of common words is correct, there is no easier way to filter out the, for, I, am, etc.. However, it is unreasonable to use the common_words list to filter out results that aren't words, because then you'd have to include every possible non-word you don't want. Non-words should be filtered out differently.
Some suggestions:
1) common_words should be a set(), since your list is long this should speed things up. The in operation for sets in O(1), while for lists it is O(n).
2) Getting rid of all number strings is trivial. One way you could do it is:
all([w.isdigit() for w in word])
Where if this returns True, then the word is just a series of numbers.
3) Getting rid of the d... is a little more tricky. It depends on how you define a non-word. This:
tf = [ c.isalpha() for c in word ]
Returns a list of True/False values (where it is False if the char was not a letter). You can then count the values like:
t = tf.count(True)
f = tf.count(False)
You could then define a non-word as one that has more non-letter chars in it than letters, as one that has any non-letter characters at all, etc. For example:
def check_wordiness(word):
# This returns true only if a word is all letters
return all([ c.isalpha() for c in word ])
4) In the for word in top_words: block, are you sure that you have not mixed up counter and number? Also, counter and number are pretty much redundant, you could rewrite the last bit as:
for word in top_words:
# Since you are calling .lower() so much,
# you probably want to define it up here
w = word.lower()
if w not in common_words and w not in already:
# String formatting is preferred over +'s
print "%i. '%s'" % (number, word)
number +=1
# This could go under the if statement. You only want to add
# words that could be added again. Why add words that are being
# filtered out anyways?
already.append(w)
# this wasn't indented correctly before
if number == many:
break
Hope that helps.

Extract words surrounding a search word

I have this script that does a word search in text. The search goes pretty good and results work as expected. What I'm trying to achieve is extract n words close to the match. For example:
The world is a small place, we should try to take care of it.
Suppose I'm looking for place and I need to extract the 3 words on the right and the 3 words on the left. In this case they would be:
left -> [is, a, small]
right -> [we, should, try]
What is the best approach to do this?
Thanks!
def search(text,n):
'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly'''
word = r"\W*([\w]+)"
groups = re.search(r'{}\W*{}{}'.format(word*n,'place',word*n), text).groups()
return groups[:n],groups[n:]
This allows you to specify how many words either side you want to capture. It works by constructing the regular expression dynamically. With
t = "The world is a small place, we should try to take care of it."
search(t,3)
(('is', 'a', 'small'), ('we', 'should', 'try'))
While regex would work, I think it's overkill for this problem. You're better off with two list comprehensions:
sentence = 'The world is a small place, we should try to take care of it.'.split()
indices = (i for i,word in enumerate(sentence) if word=="place")
neighbors = []
for ind in indices:
neighbors.append(sentence[ind-3:ind]+sentence[ind+1:ind+4])
Note that if the word that you're looking for appears multiple times consecutively in the sentence, then this algorithm will include the consecutive occurrences as neighbors.
For example:
In [29]: neighbors = []
In [30]: sentence = 'The world is a small place place place, we should try to take care of it.'.split()
In [31]: sentence
Out[31]:
['The',
'world',
'is',
'a',
'small',
'place',
'place',
'place,',
'we',
'should',
'try',
'to',
'take',
'care',
'of',
'it.']
In [32]: indices = [i for i,word in enumerate(sentence) if word == 'place']
In [33]: for ind in indices:
....: neighbors.append(sentence[ind-3:ind]+sentence[ind+1:ind+4])
In [34]: neighbors
Out[34]:
[['is', 'a', 'small', 'place', 'place,', 'we'],
['a', 'small', 'place', 'place,', 'we', 'should']]
import re
s='The world is a small place, we should try to take care of it.'
m = re.search(r'((?:\w+\W+){,3})(place)\W+((?:\w+\W+){,3})', s)
if m:
l = [ x.strip().split() for x in m.groups()]
left, right = l[0], l[2]
print left, right
Output
['is', 'a', 'small'] ['we', 'should', 'try']
If you search for The, it yields:
[] ['world', 'is', 'a']
Handling the scenario where the search keyword appears multiple times. For example below is the input text where search keyword : place appears 3 times
The world is a small place, we should try to take care of this small place by planting trees in every place wherever is possible
Here is the function
import re
def extract_surround_words(text, keyword, n):
'''
text : input text
keyword : the search keyword we are looking
n : number of words around the keyword
'''
#extracting all the words from text
words = words = re.findall(r'\w+', text)
#iterate through all the words
for index, word in enumerate(words):
#check if search keyword matches
if word == keyword:
#fetch left side words
left_side_words = words[index-n : index]
#fetch right side words
right_side_words = words[index+1 : index + n + 1]
print(left_side_words, right_side_words)
Calling the function
text = 'The world is a small place, we should try to take care of this small place by planting trees in every place wherever is possible'
keyword = "place"
n = 3
extract_surround_words(text, keyword, n)
output :
['is', 'a', 'small'] ['we', 'should', 'try']
['we', 'should', 'try'] ['to', 'microsot', 'is']
['also', 'take', 'care'] ['googe', 'is', 'one']
Find all of the words:
import re
sentence = 'The world is a small place, we should try to take care of it.'
words = re.findall(r'\w+', sentence)
Get the index of the word that you're looking for:
index = words.index('place')
And then use slicing to find the other ones:
left = words[index - 3:index]
right = words[index + 1:index + 4]

Categories