I want to get the output to determine 3 words that are near a certain word.
For this example, the word will be to return 3 words from the left and 3 words from the right around "to".
import re
sentence="#allows us to be free from the place"
key= "to"
left=[]
right=[]
m = re.search(r'((?:\w+\W+){,3})'+key+'\W+((?:\w+\W+){,3})',sentence)
if m:
l = [ x.strip().split() for x in m.groups()]
#l= two arrays of left and right
left, right = l[0], l[1]
print left, right
Output:
['allows', 'us'] ['be', 'free', 'from']
As you can see from the output, '#' symbol was not included.
Expected output:
['#allows', 'us'] ['be', 'free', 'from']
Note:
Since there are only a maximum of 2 words around "to", it will return both words although the regex is for 3 words
In some cases, the key might be more than one word
What seems to be the problem, and how to solve it? Thank you
No need to do this with regex. You can use a list slice.
sentence = '#allows us to be free from the place'
search_word = 'to'
context = 3
words = sentence.split()
try:
word_index = words.index(search_word)
start = max(0, word_index - context)
stop = min(word_index + 1 + context, len(words))
context_words = words[start:stop]
print(context_words)
except ValueError:
print('search_word not in the sentence')
prints
['#allows', 'us', 'to', 'be', 'free', 'from']
Use two slices if you want separate "before" and "after" lists.
Related
So this is a smaller version of my sentences/phrase list:
search = ['More than words', 'this way', 'go round', 'the same', 'end With', 'be sure', 'care for', 'see you in hell', 'see you', 'according to', 'they say', 'to go', 'stay with', 'Golf pro', 'Country Club']
What I would like to do is remove any terms that are more or less than 2 words in length. I basically just want another list with all 2 word terms. Is there a way to do this in Python? From my searching, I have only managed to find how to erase words of a certain number of characters and not entire phrases.
You can get 2 word phrases using this:
ans = list(filter(lambda s: len(s.split()) == 2, search))
Another way is using list comprehension:
ans = [w for w in search if len(w.split()) == 2]
As OP asked for removing duplicates in comment, adding it here:
ans = list(set(filter(lambda s: len(s.split()) == 2, search)))
Well if you take a simple approach and say every word must be separated by a space, then any two-word strings will only have 1 space character, three-word strings have 2 space characters, and so on (general rule, an n-word string has (n-1) spaces).
You can just break the two lists up like this (if you want one with strings <= two words, and one with strings > 2 words):
twoWords = []
moreThanTwo = []
for s in search:
if s.count(" ") == 1:
twoWords.append(s)
else
moreThanTwo.append(s)
Or to simplify with a lambda expression and just extract a single list of all two-word strings:
twoWords = list(filter(lambda s: s.count(" ") == 1, search))
As you want to get a new list with a list of words of length equal to 2, use list comprehension
new_list = [sentence for sentence in search if len(sentence.split(' ')) == 2]
I have a exclusion_list = ['and', 'or', 'not']
My string is 'Nick is not or playing football and or not cricket'
In the string if exclusion_list is coming together then it should take first one
My expected out is 'Nick is not playing football and cricket'
Code is below
search_value = 'Nick is not or playing football and or not cricket'
exclusion_list = ['and', 'or', 'not']
search_value_2 = search_value.split(' ')
for i in range(0,len(search_value_2)-2):
if search_value_2[i] in exclusion_list and search_value_2[i+1] in exclusion_list:
search_value_2.remove(search_value_2[i+1])
' '.join(search_value_2)
My out>> 'Nick is not playing football and not cricket'
Expected out> 'Nick is not playing football and cricket'
Basically I need to call recursively till no exlusion_list is coming together in string
First let's use a more meaningful name for search_value_2. What does this list contain? Words!
So we call this list words.
Since the code tries to access the element at i + 1 (where ìis the index of the word in the list) it shouldn't run up tolen(words) - 2butlen(words) - 1`.
search_value = 'Nick is not or playing football and or not cricket'
exclusion_list = ['and', 'or', 'not']
words = search_value.split(' ')
for i in range(len(words) - 1):
if words[i] in exclusion_list and words[i + 1] in exclusion_list:
words[i + 1] = ''
print(' '.join(words))
Now we get Nick is not playing football and not cricket. At least we have a result, but it's still not correct. We can avoid the extra space by not joining empty words. This means changing ' '.join(words) to ' '.join((word for word in words if word)).
Then we have to take into account that more than two words from the exclusion list follow each other. The simple words[i] in exclusion_list doesn't work because that word might have been overridden with an empty string during the last loop.
So this has to be changed to (words[i] == '' or words[i] in exclusion_list).
search_value = 'Nick is not or playing football and or not cricket'
exclusion_list = ['and', 'or', 'not']
words = search_value.split(' ')
for i in range(len(words) - 1):
if (words[i] == '' or words[i] in exclusion_list) and words[i + 1] in exclusion_list:
words[i + 1] = ''
print(' '.join((word for word in words if word)))
Final result: Nick is not playing football and cricket
Another approach is as follows:
Go through all the words in the text
If the word is in the exclusion_list, either
keep it, if the previous word wasn't in the exclusion_list, OR
skip it otherwise.
Technically, we use a variable skip that works as a switch to store whether or not we want to keep a word. What's nice about doing it this way is that it can handle sequences of arbitrary length.
exclusion_list = ['and', 'or', 'not']
txt = 'Nick is not or playing football and or not cricket'
words = txt.split()
output = []
skip = False
for w in words:
if w in exclusion_list:
if skip:
# previous word was in excluded list,
# so skip this one
continue
else:
# starts a new sequence of one or more
# exclusion_list words
# keep this one
output.append(w)
# skip the following
skip = True
else:
# keep normal words and reset skip
output.append(w)
skip = False
# Output: Nick is not playing football and cricket
print(' '.join(output))
You can modify your code this way:
search_value = 'Nick is not or playing football and or not cricket'
exclusion_list = ['and', 'or', 'not']
search_value_2 = search_value.split(' ')
selected = []
for i in range(0,len(search_value_2)):
if search_value_2[i] not in exclusion_list or (search_value_2[i] in exclusion_list and search_value_2[i+1] in exclusion_list and search_value_2[i-1] not in exclusion_list):
selected.append(search_value_2[i])
' '.join(selected)
Regex approach: has several drawbacks such as more than 3 consecutive terms of exclusion_list and if the length of exclusion_list is "big". See next approach with groupby
With a regex approach: generate "all" possible combinations of the elements of exclusion_list, both of length 2 and 3, merge them together (in order of decreasing length! it is important for the regex pattern), create the pattern with | to denote "or" and perform the substitution.
Remark: the combinations are made with product so 'and and', ... are also included, a bit greedy method.
import re
import itertools as it
search_value = 'Nick is not or playing football and or not cricket'
exclusion_list = ['and', 'or', 'not']
p2 = it.product(exclusion_list, repeat=2)
p3 = it.product(exclusion_list, repeat=3)
ps = it.chain(p3, p2) # <-- first the longest!
ps_as_strs = map(' '.join, ps)
regex = '|'.join(map('({})'.format, ps_as_strs))
new_search_value = re.sub(regex, lambda match: match.group(match.lastindex).split()[0], search_value)
print(new_search_value)
EDIT: more robust solution with groupby
import itertools as it
search_value = 'Nick is not or playing football and or not cricket'
exclusion_list = ['and', 'or', 'not']
new_str = ''
for check, grp in it.groupby(search_value.split(), lambda word: word in exclusion_list):
if check:
new_str += ' '.join([next(grp)])
else:
new_str += ' '.join(grp)
new_str += ' '
new_str = new_str.strip()
print(new_str)
I have a list of specific words
['to', 'with', 'in', 'for']
I want to make a function, which takes a sentence and if there is a word form my list, it should select the next two words after it and put them joined to the list(i need it for a part of my sentence generator). For example:
sentence = 'In the morning I went to the store and then to the restaurant'
I want to get
['tothe', 'tostore', 'tothe', 'torestaurant']
I wrote this code:
preps = ['to', 'with', 'in', 'for']
def nextofnext_words_ofnegs(sentence):
list_of_words = sentence.split()
next_word = []
for i in list_of_words:
for j in preps:
if i == j:
next_word.append(j + list_of_words[list_of_words.index(i) + 1])
next_word.append(j + list_of_words[list_of_words.index(i) + 2])
return next_word
However i get this:
['tothe', 'tostore', 'tothe', 'tostore']
Instead of this:
['tothe', 'tostore', 'tothe', 'torestaurant']
This should work to give you what you want. Note that you can use the "in" operator in Python to check if the word exists in your string list, there is no need to loop the list here in this case. Also as mentioned above, using of .index is insufficient here, you can use enumerate to get the index as well as the item in the list.
preps = ['to', 'with', 'in', 'for']
def nextofnext_words_ofnegs(sentence):
list_of_words = sentence.split()
next_word = []
for idx, word in enumerate(list_of_words):
if word in preps:
next_word.append(word + list_of_words[idx + 1])
next_word.append(word + list_of_words[idx + 2])
return next_word
I have this script that does a word search in text. The search goes pretty good and results work as expected. What I'm trying to achieve is extract n words close to the match. For example:
The world is a small place, we should try to take care of it.
Suppose I'm looking for place and I need to extract the 3 words on the right and the 3 words on the left. In this case they would be:
left -> [is, a, small]
right -> [we, should, try]
What is the best approach to do this?
Thanks!
def search(text,n):
'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly'''
word = r"\W*([\w]+)"
groups = re.search(r'{}\W*{}{}'.format(word*n,'place',word*n), text).groups()
return groups[:n],groups[n:]
This allows you to specify how many words either side you want to capture. It works by constructing the regular expression dynamically. With
t = "The world is a small place, we should try to take care of it."
search(t,3)
(('is', 'a', 'small'), ('we', 'should', 'try'))
While regex would work, I think it's overkill for this problem. You're better off with two list comprehensions:
sentence = 'The world is a small place, we should try to take care of it.'.split()
indices = (i for i,word in enumerate(sentence) if word=="place")
neighbors = []
for ind in indices:
neighbors.append(sentence[ind-3:ind]+sentence[ind+1:ind+4])
Note that if the word that you're looking for appears multiple times consecutively in the sentence, then this algorithm will include the consecutive occurrences as neighbors.
For example:
In [29]: neighbors = []
In [30]: sentence = 'The world is a small place place place, we should try to take care of it.'.split()
In [31]: sentence
Out[31]:
['The',
'world',
'is',
'a',
'small',
'place',
'place',
'place,',
'we',
'should',
'try',
'to',
'take',
'care',
'of',
'it.']
In [32]: indices = [i for i,word in enumerate(sentence) if word == 'place']
In [33]: for ind in indices:
....: neighbors.append(sentence[ind-3:ind]+sentence[ind+1:ind+4])
In [34]: neighbors
Out[34]:
[['is', 'a', 'small', 'place', 'place,', 'we'],
['a', 'small', 'place', 'place,', 'we', 'should']]
import re
s='The world is a small place, we should try to take care of it.'
m = re.search(r'((?:\w+\W+){,3})(place)\W+((?:\w+\W+){,3})', s)
if m:
l = [ x.strip().split() for x in m.groups()]
left, right = l[0], l[2]
print left, right
Output
['is', 'a', 'small'] ['we', 'should', 'try']
If you search for The, it yields:
[] ['world', 'is', 'a']
Handling the scenario where the search keyword appears multiple times. For example below is the input text where search keyword : place appears 3 times
The world is a small place, we should try to take care of this small place by planting trees in every place wherever is possible
Here is the function
import re
def extract_surround_words(text, keyword, n):
'''
text : input text
keyword : the search keyword we are looking
n : number of words around the keyword
'''
#extracting all the words from text
words = words = re.findall(r'\w+', text)
#iterate through all the words
for index, word in enumerate(words):
#check if search keyword matches
if word == keyword:
#fetch left side words
left_side_words = words[index-n : index]
#fetch right side words
right_side_words = words[index+1 : index + n + 1]
print(left_side_words, right_side_words)
Calling the function
text = 'The world is a small place, we should try to take care of this small place by planting trees in every place wherever is possible'
keyword = "place"
n = 3
extract_surround_words(text, keyword, n)
output :
['is', 'a', 'small'] ['we', 'should', 'try']
['we', 'should', 'try'] ['to', 'microsot', 'is']
['also', 'take', 'care'] ['googe', 'is', 'one']
Find all of the words:
import re
sentence = 'The world is a small place, we should try to take care of it.'
words = re.findall(r'\w+', sentence)
Get the index of the word that you're looking for:
index = words.index('place')
And then use slicing to find the other ones:
left = words[index - 3:index]
right = words[index + 1:index + 4]
I'm trying to format this string below where one row contains five words. However, I keep getting this as the output:
I love cookies yes I do Let s see a dog
First, I am not getting 5 words in one line, but instead, everything in one line.
Second, why does the "Let's" get split? I thought in splitting the string using "words", it will only split if there was a space in between?
Suggestions?
string = """I love cookies. yes I do. Let's see a dog."""
# split string
words = re.split('\W+',string)
words = [i for i in words if i != '']
counter = 0
output=''
for i in words:
if counter == 0:
output +="{0:>15s}".format(i)
# if counter == 5, new row
elif counter % 5 == 0:
output += '\n'
output += "{0:>15s}".format(i)
else:
output += "{0:>15s}".format(i)
# Increase the counter by 1
counter += 1
print(output)
As a start, don't call a variable "string" since it shadows the module with the same name
Secondly, use split() to do your word-splitting
>>> s = """I love cookies. yes I do. Let's see a dog."""
>>> s.split()
['I', 'love', 'cookies.', 'yes', 'I', 'do.', "Let's", 'see', 'a', 'dog.']
From re-module
\W
Matches any character which is not a Unicode word character. This is the opposite of \w. If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_] (but the flag affects the entire regular expression, so in such cases using an explicit [^a-zA-Z0-9_] may be a better choice).
Since the ' is not listed in the above, the regexp used splits the "Let's" string into two parts:
>>> words = re.split('\W+', s)
>>> words
['I', 'love', 'cookies', 'yes', 'I', 'do', 'Let', 's', 'see', 'a', 'dog', '']
This is the output I get using the strip()-approach above:
$ ./sp3.py
I love cookies. yes I
do. Let's see a dog.
The code could probably be simplified to this since counter==0 and the else-clause does the same thing. I through in an enumerate there as well to get rid of the counter:
#!/usr/bin/env python3
s = """I love cookies. yes I do. Let's see a dog."""
words = s.split()
output = ''
for n, i in enumerate(words):
if n % 5 == 0:
output += '\n'
output += "{0:>15s}".format(i)
print(output)
words = string.split()
while (len(words))
for word in words[:5]
print(word, end=" ")
print()
words = words[5:]
That's the basic concept, split it using the split() method
Then slice it using slice notation to get the first 5 words
Then slice off the first 5 words, and loop again