Extract words surrounding a search word - python

I have this script that does a word search in text. The search goes pretty good and results work as expected. What I'm trying to achieve is extract n words close to the match. For example:
The world is a small place, we should try to take care of it.
Suppose I'm looking for place and I need to extract the 3 words on the right and the 3 words on the left. In this case they would be:
left -> [is, a, small]
right -> [we, should, try]
What is the best approach to do this?
Thanks!

def search(text,n):
'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly'''
word = r"\W*([\w]+)"
groups = re.search(r'{}\W*{}{}'.format(word*n,'place',word*n), text).groups()
return groups[:n],groups[n:]
This allows you to specify how many words either side you want to capture. It works by constructing the regular expression dynamically. With
t = "The world is a small place, we should try to take care of it."
search(t,3)
(('is', 'a', 'small'), ('we', 'should', 'try'))

While regex would work, I think it's overkill for this problem. You're better off with two list comprehensions:
sentence = 'The world is a small place, we should try to take care of it.'.split()
indices = (i for i,word in enumerate(sentence) if word=="place")
neighbors = []
for ind in indices:
neighbors.append(sentence[ind-3:ind]+sentence[ind+1:ind+4])
Note that if the word that you're looking for appears multiple times consecutively in the sentence, then this algorithm will include the consecutive occurrences as neighbors.
For example:
In [29]: neighbors = []
In [30]: sentence = 'The world is a small place place place, we should try to take care of it.'.split()
In [31]: sentence
Out[31]:
['The',
'world',
'is',
'a',
'small',
'place',
'place',
'place,',
'we',
'should',
'try',
'to',
'take',
'care',
'of',
'it.']
In [32]: indices = [i for i,word in enumerate(sentence) if word == 'place']
In [33]: for ind in indices:
....: neighbors.append(sentence[ind-3:ind]+sentence[ind+1:ind+4])
In [34]: neighbors
Out[34]:
[['is', 'a', 'small', 'place', 'place,', 'we'],
['a', 'small', 'place', 'place,', 'we', 'should']]

import re
s='The world is a small place, we should try to take care of it.'
m = re.search(r'((?:\w+\W+){,3})(place)\W+((?:\w+\W+){,3})', s)
if m:
l = [ x.strip().split() for x in m.groups()]
left, right = l[0], l[2]
print left, right
Output
['is', 'a', 'small'] ['we', 'should', 'try']
If you search for The, it yields:
[] ['world', 'is', 'a']

Handling the scenario where the search keyword appears multiple times. For example below is the input text where search keyword : place appears 3 times
The world is a small place, we should try to take care of this small place by planting trees in every place wherever is possible
Here is the function
import re
def extract_surround_words(text, keyword, n):
'''
text : input text
keyword : the search keyword we are looking
n : number of words around the keyword
'''
#extracting all the words from text
words = words = re.findall(r'\w+', text)
#iterate through all the words
for index, word in enumerate(words):
#check if search keyword matches
if word == keyword:
#fetch left side words
left_side_words = words[index-n : index]
#fetch right side words
right_side_words = words[index+1 : index + n + 1]
print(left_side_words, right_side_words)
Calling the function
text = 'The world is a small place, we should try to take care of this small place by planting trees in every place wherever is possible'
keyword = "place"
n = 3
extract_surround_words(text, keyword, n)
output :
['is', 'a', 'small'] ['we', 'should', 'try']
['we', 'should', 'try'] ['to', 'microsot', 'is']
['also', 'take', 'care'] ['googe', 'is', 'one']

Find all of the words:
import re
sentence = 'The world is a small place, we should try to take care of it.'
words = re.findall(r'\w+', sentence)
Get the index of the word that you're looking for:
index = words.index('place')
And then use slicing to find the other ones:
left = words[index - 3:index]
right = words[index + 1:index + 4]

Related

To include '#' in regex of words around certain words

I want to get the output to determine 3 words that are near a certain word.
For this example, the word will be to return 3 words from the left and 3 words from the right around "to".
import re
sentence="#allows us to be free from the place"
key= "to"
left=[]
right=[]
m = re.search(r'((?:\w+\W+){,3})'+key+'\W+((?:\w+\W+){,3})',sentence)
if m:
l = [ x.strip().split() for x in m.groups()]
#l= two arrays of left and right
left, right = l[0], l[1]
print left, right
Output:
['allows', 'us'] ['be', 'free', 'from']
As you can see from the output, '#' symbol was not included.
Expected output:
['#allows', 'us'] ['be', 'free', 'from']
Note:
Since there are only a maximum of 2 words around "to", it will return both words although the regex is for 3 words
In some cases, the key might be more than one word
What seems to be the problem, and how to solve it? Thank you
No need to do this with regex. You can use a list slice.
sentence = '#allows us to be free from the place'
search_word = 'to'
context = 3
words = sentence.split()
try:
word_index = words.index(search_word)
start = max(0, word_index - context)
stop = min(word_index + 1 + context, len(words))
context_words = words[start:stop]
print(context_words)
except ValueError:
print('search_word not in the sentence')
prints
['#allows', 'us', 'to', 'be', 'free', 'from']
Use two slices if you want separate "before" and "after" lists.

How do I detect whether a string with no whitespace has any English words?

For example,
text = 'huwefggthisisastringhef'
I'd like to return a True or False depending on the string. E.g.
if detectEnglish(text) == True:
print('contains english')
Finds all english words at least three characters long in text
import enchant
d = enchant.Dict('en_US')
text = 'huwefggthisisastringhef'
l = len(text)
for i in range(l):
for j in range(i+3, l+1):
if d.check(text[i:j]):
print(text[i:j])
Does that by testing all posible substrings (only 231 combinations for 23-chars-long string).
There are probably better methods to do this but if you don't need any information about the words that will be found you can do this.
This project on Github has over 466K words in a simple text file, you open the text file read it's content into memory and do look up for combination of letters.
If you wanted to you can sort this file into multi-dimensional dictionaries but to be honest if the text is very random it may be very computationally hungry.
I hope this answer was a bit helpful.
A trie regex could help you. You could filter the wordbook by length first, in order to avoid matching ['h', 'u', 'we', 'f', 'g', 'g', 'this', 'is', 'as', 't', 'ring', 'he', 'f']:
# encoding: utf-8
import re
from trie import Trie
with open('/usr/share/dict/american-english') as wordbook:
english_words = [word.strip().lower() for word in wordbook if len(word.strip()) >= 3]
trie = Trie()
for word in english_words:
trie.add(word)
test_word = "huwefggthisisastringhef"
print(re.findall(trie.pattern(), test_word))
# ['this', 'string']
It takes a few seconds to create the regex but the search itself is extremely fast, and should be more efficient than simply looping over every substring.
print(re.findall(trie.pattern(), "sdgfsdfgkjslfkgjsdkfgjsdbbqdsfghiddenwordsadfgsdfgsdfgsdfgsdtqtrwerthg"))
# ['hidden', 'words']
Based on the accepted answer here is a minor modification I thought could be valuable to share:
import enchant
d = enchant.Dict('en_US')
text = 'huwefggthisisastringhef'
l = len(text)
words = {text[i:j]:range(i,j) for i in range(l) for j in range(l+1) if len(text[i:j]) >=3 and d.check(text[i:j])}
print(words)
Returns a dictionary with the words and the ranges. Can for instance be used to check which words interesect and so on.
{'this': range(7, 11),
'his': range(8, 11),
'sis': range(10, 13),
'string': range(14, 20),
'ring': range(16, 20)}

Splitting the sentences in python

I am trying to split the sentences in words.
words = content.lower().split()
this gives me the list of words like
'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'
and with this code:
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
I get something like:
'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'
if you see the word "morningthe" in the list, it used to have "--" in between words. Now, is there any way I can split them in two words like "morning","the"??
I would suggest a regex-based solution:
import re
def to_words(text):
return re.findall(r'\w+', text)
This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace.
>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']
Note that if you're looping over the words, using re.finditer which returns a generator object is probably better, as you don't have store the whole list of words at once.
Alternatively, you may also use itertools.groupby along with str.alpha() to extract alphabets-only words from the string as:
>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'
>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']
PS: Regex based solution is much cleaner. I have mentioned this as an possible alternative to achieve this.
Specific to OP: If all you want is to also split on -- in the resultant list, then you may firstly replace hyphens '-' with space ' ' before performing split. Hence, your code should be:
words = content.lower().replace('-', ' ').split()
where words will hold the value you desire.
Trying to do this with regexes will send you crazy e.g.
>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']
Definitely look at the nltk package.
Besides the solutions given already, you could also improve your clean_up_list function to do a better work.
def clean_up_list(word_list):
clean_word_list = []
# Move the list out of loop so that it doesn't
# have to be initiated every time.
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for word in word_list:
current_word = ''
for index in range(len(word)):
if word[index] in symbols:
if current_word:
clean_word_list.append(current_word)
current_word = ''
else:
current_word += word[index]
if current_word:
# Append possible last current_word
clean_word_list.append(current_word)
return clean_word_list
Actually, you could apply the block in for word in word_list: to the whole sentence to get the same result.
You could also do this:
import re
def word_list(text):
return list(filter(None, re.split('\W+', text)))
print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))
Returns:
['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']

Python :: Intersecting lists of strings gone wrong

I am trying to intersect lists of sentences divided into strings:
user = ['The', 'Macbeth', 'Tragedie'] #this list
plays = []
hamlet = gutenberg.sents('shakespeare-hamlet.txt')
macbeth = gutenberg.sents('shakespeare-macbeth.txt')
caesar = gutenberg.sents('shakespeare-caesar.txt')
plays.append(hamlet)
plays.append(macbeth)
plays.append(caesar)
shakespeare = list(chain.from_iterable(plays)) # with this list
'shakespeare' prints as follows:
[['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ['Scoena', 'Prima', '.'], ['Enter', 'Barnardo', 'and', 'Francisco', 'two', 'Centinels', '.']...['FINIS', '.'], ['THE', 'TRAGEDIE', 'OF', 'IVLIVS', 'CaeSAR', '.']]
bestCount = 0
for sent in shakespeare:
currentCount = len(set(user).intersection(sent))
if currentCount > bestCount:
bestCount = currentCount
answer = ' '.join(sent)
return ''.join(answer).lower(), bestCount
return, however, does not intersect right, that is, "hamlet" intersects with "macbeth"...
('the tragedie of hamlet , prince of denmarke .', 3)
where is the bug?
Doesn't sound like you should be using sets here. The most obvious problem is that you care about the number of occurrences of a word in a sentence (which starts life as a list), and by converting to a set you collapse all repeated words down to one occurrence, losing that information.
I would suggest rather converting each sentence's members into lowercase, like so:
mapped = map(str.lower, sentence) # may want list(map(...)) if on Py3
Initialize a dict of counts like this:
In [6]: counts = {word.lower(): 0 for word in user}
In [7]: counts
Out[7]: {'macbeth': 0, 'the': 0, 'tragedie': 0}
Then as you loop over the sentences, you can do something like this:
In [8]: for word in counts:
...: counts[word] = max(counts[word], mapped.count(word))
...:
In [9]: counts
Out[9]: {'macbeth': 0, 'the': 1, 'tragedie': 1}
I just used one example sentence, but you get the idea. At the end you'll have the maximum number of times the user's word appeared in a sentence. You can make the data structure a little more complex or use an if statement test if you want to also keep the sentence it occurred the most times in.
Good luck!

Is there a better way to get just 'important words' from a list in python?

I wrote some code to find the most popular words in submission titles on reddit, using the reddit praw api.
import nltk
import praw
picksub = raw_input('\nWhich subreddit do you want to analyze? r/')
many = input('\nHow many of the top words would you like to see? \n\t> ')
print 'Getting the top %d most common words from r/%s:' % (many,picksub)
r = praw.Reddit(user_agent='get the most common words from chosen subreddit')
submissions = r.get_subreddit(picksub).get_top_from_all(limit=200)
hey = []
for x in submissions:
hey.extend(str(x).split(' '))
fdist = nltk.FreqDist(hey) # creates a frequency distribution for words in 'hey'
top_words = fdist.keys()
common_words = ['its','am', 'ago','took', 'got', 'will', 'been', 'get', 'such','your','don\'t', 'if', 'why', 'do', 'does', 'or', 'any', 'but', 'they', 'all', 'now','than','into','can', 'i\'m','not','so','just', 'out','about','have','when', 'would' ,'where', 'what', 'who' 'I\'m','says' 'not', '', 'over', '_', '-','after', 'an','for', 'who', 'by', 'from', 'it', 'how', 'you', 'about' 'for', 'on', 'as', 'be', 'has', 'that', 'was', 'there', 'with','what', 'we', '::', 'to', 'the', 'of', ':', '...', 'a', 'at', 'is', 'my', 'in' , 'i', 'this', 'and', 'are', 'he', 'she', 'is', 'his', 'hers']
already = []
counter = 0
number = 1
print '-----------------------'
for word in top_words:
if word.lower() not in common_words and word.lower() not in already:
print str(number) + ". '" + word + "'"
counter +=1
number +=1
already.append(word.lower())
if counter == many:
break
print '-----------------------\n'
so inputting subreddit 'python' and getting 10 posts returns:
'Python'
'PyPy'
'code'
'use'
'136'
'181'
'd...'
'IPython'
'133'
10. '158'
How can I make this script not return numbers, and error words like 'd...'? The first 4 results are acceptable, but I would like to replace this rest with words that make sense. Making a list common_words is unreasonable, and doesn't filter these errors. I'm relatively new to writing code, and I appreciate the help.
I disagree. Making a list of common words is correct, there is no easier way to filter out the, for, I, am, etc.. However, it is unreasonable to use the common_words list to filter out results that aren't words, because then you'd have to include every possible non-word you don't want. Non-words should be filtered out differently.
Some suggestions:
1) common_words should be a set(), since your list is long this should speed things up. The in operation for sets in O(1), while for lists it is O(n).
2) Getting rid of all number strings is trivial. One way you could do it is:
all([w.isdigit() for w in word])
Where if this returns True, then the word is just a series of numbers.
3) Getting rid of the d... is a little more tricky. It depends on how you define a non-word. This:
tf = [ c.isalpha() for c in word ]
Returns a list of True/False values (where it is False if the char was not a letter). You can then count the values like:
t = tf.count(True)
f = tf.count(False)
You could then define a non-word as one that has more non-letter chars in it than letters, as one that has any non-letter characters at all, etc. For example:
def check_wordiness(word):
# This returns true only if a word is all letters
return all([ c.isalpha() for c in word ])
4) In the for word in top_words: block, are you sure that you have not mixed up counter and number? Also, counter and number are pretty much redundant, you could rewrite the last bit as:
for word in top_words:
# Since you are calling .lower() so much,
# you probably want to define it up here
w = word.lower()
if w not in common_words and w not in already:
# String formatting is preferred over +'s
print "%i. '%s'" % (number, word)
number +=1
# This could go under the if statement. You only want to add
# words that could be added again. Why add words that are being
# filtered out anyways?
already.append(w)
# this wasn't indented correctly before
if number == many:
break
Hope that helps.

Categories