How to match exact string/word while searching a list. I have tried, but its not correct. below I have given the sample list, my code and the test results
list = ['Hi, hello', 'hi mr 12345', 'welcome sir']
my code:
for str in list:
if s in str:
print str
test results:
s = "hello" ~ expected output: 'Hi, hello' ~ output I get: 'Hi, hello'
s = "123" ~ expected output: *nothing* ~ output I get: 'hi mr 12345'
s = "12345" ~ expected output: 'hi mr 12345' ~ output I get: 'hi mr 12345'
s = "come" ~ expected output: *nothing* ~ output I get: 'welcome sir'
s = "welcome" ~ expected output: 'welcome sir' ~ output I get: 'welcome sir'
s = "welcome sir" ~ expected output: 'welcome sir' ~ output I get: 'welcome sir'
My list contains more than 200K strings
It looks like you need to perform this search not only once so I would recommend to convert your list into dictionary:
>>> l = ['Hi, hello', 'hi mr 12345', 'welcome sir']
>>> d = dict()
>>> for item in l:
... for word in item.split():
... d.setdefault(word, list()).append(item)
...
So now you can easily do:
>>> d.get('hi')
['hi mr 12345']
>>> d.get('come') # nothing
>>> d.get('welcome')
['welcome sir']
p.s. probably you have to improve item.split() to handle commas, point and other separators. maybe use regex and \w.
p.p.s. as cularion mentioned this won't match "welcome sir". if you want to match whole string, it is just one additional line to proposed solution. but if you have to match part of string bounded by spaces and punctuation regex should be your choice.
>>> l = ['Hi, hello', 'hi mr 12345', 'welcome sir']
>>> search = lambda word: filter(lambda x: word in x.split(),l)
>>> search('123')
[]
>>> search('12345')
['hi mr 12345']
>>> search('hello')
['Hi, hello']
if you search for exact match:
for str in list:
if set (s.split()) & set(str.split()):
print str
Provided s only ever consists of just a few words, you could do
s = s.split()
n = len(s)
for x in my_list:
words = x.split()
if s in (words[i:i+n] for i in range(len(words) - n + 1)):
print x
If s consists of many words, there are more efficient, but also much more complex algorithm for this.
use regular expression here to match exact word with word boundary \b
import re
.....
for str in list:
if re.search(r'\b'+wordToLook+'\b', str):
print str
\b only matches a word which is terminated and starts with word terminator e.g. space or line break
or do something like this to avoid typing the word for searching again and again.
import re
list = ['Hi, hello', 'hi mr 12345', 'welcome sir']
listOfWords = ['hello', 'Mr', '123']
reg = re.compile(r'(?i)\b(?:%s)\b' % '|'.join(listOfWords))
for str in list:
if reg.search(str):
print str
(?i) is to search for without worrying about the case of words, if you want to search with case sensitivity then remove it.
Related
I would like to split a string into separate sentences in a list.
example:
string = "Hey! How are you today? I am fine."
output should be:
["Hey!", "How are you today?", "I am fine."]
You can use a built-in regular expression library.
import re
string = "Hey! How are you today? I am fine."
output = re.findall(".*?[.!\?]", string)
output>> ['Hey!', ' How are you today?', ' I am fine.']
Update:
You may use split() method but it'll not return the character used for splitting.
import re
string = "Hey! How are you today? I am fine."
output = re.split("!|?", string)
output>> ['Hey', ' How are you today', ' I am fine.']
If this works for you, you can use replace() and split().
string = "Hey! How are you today? I am fine."
output = string.replace("!", "?").split("?")
you can try
>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']
I find it in here
You can use the methode split()
import re
string = "Hey! How are you today? I am fine."
yourlist = re.split("!|?",string)
You don't need regex for this. Just create your own generator:
def split_punc(text):
punctuation = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
# Alternatively, can use:
# from string import punctuation
j = 0
for i, x in enumerate(text):
if x in punctuation:
yield text[j:i+1]
j = i + 1
return text[j:i+1]
Usage:
list(split_punc(string))
# ['Hey!', ' How are you today?', ' I am fine.']
I have a string like that:
sentence = 'This is a nice day'
I want to have the following output:
output = ['This is', 'a nice', 'day']
In this case, I split the string on n=3 or more whitespaces and this is why it is split like it is shown above.
How can I efficiently do this for any n?
You may try using Python's regex split:
sentence = 'This is a nice day'
output = re.split(r'\s{3,}', sentence)
print(output)
['This is', 'a nice day']
To handle this for an actual variable n, we can try:
n = 3
pattern = r'\s{' + str(n) + ',}'
output = re.split(pattern, sentence)
print(output)
['This is', 'a nice day']
You can use the basic .split() function:
sentence = 'This is a nice day'
n = 3
sentence.split(' '*n)
>>> ['This is', 'a nice day']
You can also split by n spaces, strip the results and remove empty elements (if there are several such long spaces that would produce them):
sentence = 'This is a nice day'
n = 3
parts = [part.strip() for part in sentence.split(' ' * n) if part.strip()]
x = 'This is a nice day'
result = [i.strip() for i in x.split(' ' * 3)]
print(result)
['This is', 'a nice day']
What is the best way to split a string like
text = "hello there how are you"
in Python?
So I'd end up with an array like such:
['hello there', 'there how', 'how are', 'are you']
I have tried this:
liste = re.findall('((\S+\W*){'+str(2)+'})', text)
for a in liste:
print(a[0])
But I'm getting:
hello there
how are
you
How can I make the findall function move only one token when searching?
Here's a solution with re.findall:
>>> import re
>>> text = "hello there how are you"
>>> re.findall(r"(?=(?:(?:^|\W)(\S+\W\S+)(?:$|\W)))", text)
['hello there', 'there how', 'how are', 'are you']
Have a look at the Python docs for re: https://docs.python.org/3/library/re.html
(?=...) Lookahead assertion
(?:...) Non-capturing regular parentheses
If regex isn't require you could do something like:
l = text.split(' ')
out = []
for i in range(len(l)):
try:
o.append(l[i] + ' ' + l[i+1])
except IndexError:
continue
Explanation:
First split the string on the space character. The result will be a list where each element is a word in the sentence. Instantiate an empty list to hold the result. Loop over the list of words adding the two word combinations seperated by a space to the output list. This will throw an IndexError when accessing the last word in the list, just catch it and continue since you don't seem to want that lone word in your result anyway.
I don't think you actually need regex for this.
I understand you want a list, in which each element contains two words, the latter also being the former of the following element. We can do this easily like this:
string = "Hello there how are you"
liste = string.split(" ").pop(-1)
# we remove the last index, as otherwise we'll crash, or have an element with only one word
for i in range(len(liste)-1):
liste[i] = liste[i] + " " + liste[i+1]
I don't know if it's mandatory for you need to use regex, but I'd do this way.
First, you can get the list of words with the str.split() method.
>>> sentence = "hello there how are you"
>>> splited_sentence = sentence.split(" ")
>>> splited_sentence
['hello', 'there', 'how', 'are', 'you']
Then, you can make pairs.
>>> output = []
>>> for i in range (1, len(splited_sentence) ):
... output += [ splited[ i-1 ] + ' ' + splited_sentence[ i ] ]
...
output
['hello there', 'there how', 'how are', 'are you']
An alternative is just to split, zip, then join like so...
sentence = "Hello there how are you"
words = sentence.split()
[' '.join(i) for i in zip(words, words[1:])]
Another possible solution using findall.
>>> liste = list(map(''.join, re.findall(r'(\S+(?=(\s+\S+)))', text)))
>>> liste
['hello there', 'there how', 'how are', 'are you']
I am trying to erase specific words found in a list. Lets say that I have the following example:
a= ['you are here','you are there','where are you','what is that']
b = ['you','what is']
The desired output should be the following:
['are here', 'are there', 'where are', 'that']
I created the following code for that task:
import re
def _find_word_and_remove(w,strings):
"""
w:(string)
strings:(string)
"""
temp= re.sub(r'\b({0})\b'.format(w),'',strings).strip()# removes word from string
return re.sub("\s{1,}", " ", temp)# removes double spaces
def find_words_and_remove(words,strings):
"""
words:(list)
strings:(list)
"""
if len(words)==1:
return [_find_word_and_remove(words[0],word_a) for word_a in strings]
else:
temp =[_find_word_and_remove(words[0],word_a) for word_a in strings]
return find_words_and_remove(words[1:],temp)
find_words_and_remove(b,a)
>>> ['are here', 'are there', 'where are', 'that']
It seems that I am over-complicating the 'things' by using recursion for this task. Is there a more simple and readable way to do this task?
You can use list comprehension:
def find_words_and_remove(words, strings):
return [" ".join(word for word in string.split() if word not in words) for string in strings]
That will work only when there are single words in b, but because of your edit and comment, I now know that you really do need _find_word_and_remove(). Your recursion way isn't really too bad, but if you don't want recursion, do this:
def find_words_and_remove(words, strings):
strings_copy = strings[:]
for i, word in enumerate(words):
for string in strings:
strings_copy[i] = _find_word_and_remove(word, string)
return strings_copy
the simple way is to use regex:
import re
a= ['you are here','you are there','where are you','what is that']
b = ['you','what is']
here you go:
def find_words_and_remove(b,a):
return [ re.sub("|".join(b), "", x).strip() if len(re.sub("|".join(b), "", x).strip().split(" ")) < len(x.split(' ')) else x for x in a ]
find_words_and_remove(b,a)
>> ['are here', 'are there', 'where are', 'that']
I have a list as shown below:
exclude = ["please", "hi", "team"]
I have a string as follows:
text = "Hi team, please help me out."
I want my string to look as:
text = ", help me out."
effectively stripping out any word that might appear in the list exclude
I tried the below:
if any(e in text.lower()) for e in exclude:
print text.lower().strip(e)
But the above if statement returns a boolean value and hence I get the below error:
NameError: name 'e' is not defined
How do I get this done?
Something like this?
>>> from string import punctuation
>>> ' '.join(x for x in (word.strip(punctuation) for word in text.split())
if x.lower() not in exclude)
'help me out
If you want to keep the trailing/leading punctuation with the words that are not present in exclude:
>>> ' '.join(word for word in text.split()
if word.strip(punctuation).lower() not in exclude)
'help me out.'
First one is equivalent to:
>>> out = []
>>> for word in text.split():
word = word.strip(punctuation)
if word.lower() not in exclude:
out.append(word)
>>> ' '.join(out)
'help me out'
You can use Use this (remember it is case sensitive)
for word in exclude:
text = text.replace(word, "")
This is going to replace with spaces everything that is not alphanumeric or belong to the stopwords list, and then split the result into the words you want to keep. Finally, the list is joined into a string where words are spaced. Note: case sensitive.
' '.join ( re.sub('\W|'+'|'.join(stopwords),' ',sentence).split() )
Example usage:
>>> import re
>>> stopwords=['please','hi','team']
>>> sentence='hi team, please help me out.'
>>> ' '.join ( re.sub('\W|'+'|'.join(stopwords),' ',sentence).split() )
'help me out'
Using simple methods:
import re
exclude = ["please", "hi", "team"]
text = "Hi team, please help me out."
l=[]
te = re.findall("[\w]*",text)
for a in te:
b=''.join(a)
if (b.upper() not in (name.upper() for name in exclude)and a):
l.append(b)
print " ".join(l)
Hope it helps
if you are not worried about punctuation:
>>> import re
>>> text = "Hi team, please help me out."
>>> text = re.findall("\w+",text)
>>> text
['Hi', 'team', 'please', 'help', 'me', 'out']
>>> " ".join(x for x in text if x.lower() not in exclude)
'help me out'
In the above code, re.findall will find all words and put them in a list.
\w matches A-Za-z0-9
+ means one or more occurrence