What is efficient way to match words in string? - python

Example:
names = ['James John', 'Robert David', 'Paul' ... the list has 5K items]
text1 = 'I saw James today'
text2 = 'I saw James John today'
text3 = 'I met Paul'
is_name_in_text(text1,names) # this returns false 'James' in not in list
is_name_in_text(text2,names) # this returns 'James John'
is_name_in_text(text3,names) # this return 'Paul'
is_name_in_text() searches if any of the name list is in text.
The easy way to do is to just check if the name is in the list by using in operator, but the list has 5,000 items, so it is not efficient. I can just split the text into words and check if the words are in the list, but this not going to work if you have more than one word matching. Line number 7 will fail in this case.

Make names into a set and use the in-operator for fast O(1) lookup.
You can use a regex to parse out the possible names in a sentence:
>>> import re
>>> findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')
>>> def is_name_in_text(text, names):
for possible_name in set(findnames.findall(text)):
if possible_name in names:
return possible_name
return False
>>> names = set(['James John', 'Robert David', 'Paul'])
>>> is_name_in_text('I saw James today', names)
False
>>> is_name_in_text('I saw James John today', names)
'James John'
>>> is_name_in_text('I met Paul', names)
'Paul'

Build a regular expression with all the alternatives. This way you don't have to worry about somehow pulling the names out of the phrases beforehand.
import re
names_re = re.compile(r'\b' +
r'\b|\b'.join(re.escape(name) for name in names) +
r'\b')
print names_re.search('I saw James today')

You may use Python's set in order to get good performance while using the in operator.

If you have a mechanism of pulling the names out of the phrases and don't need to worry about partial matches (the full name will always be in the string), you can use a set rather than a list.
Your code is exactly the same, with this addition at line 2:
names = set(names)
The in operation will now function much faster.

Related

How to remove strings from a list which contain a sub-string?

I have a list of strings that's structured as follows:
['3M', 'Saint Paul, Minnesota', 'A. O. Smith', 'Milwaukee, Wisconsin', 'Abbott Laboratories',...]
I want to remove the strings corresponding to cities, which all contain a comma ,.
So far my code is:
for name in names:
if '</a>' in name:
names.remove(name)
if re.search(', Inc.',name) != None:
name = name.replace(',',"")
names.append(name)
if ',' in name:
names.remove(name)
But I get an error ValueError: list.remove(x): x not in list at names.remove(name).
I can't seem to understand why the 1st block, which drops if the string contains </a> works fine, but the one with commas does not.
Going off my comment about how, in Python, we generally want to "retain things we like" rather than "purge things we don't like". This is preferable because we can avoid changing the size of a list as we're iterating over it, which is never a good idea. We achieve this by filtering the original list based on a predicate (is_desirable in this case). A predicate is a function/callable that accepts a single parameter and returns a boolean. When used in conjunction with filter, we can create an iterator that yields only those items that satisfy the condition of the predicate. We then consume the contents of that iterator to build up a new list:
names = [
'3M',
'Saint Paul, Minnesota',
'A. O. Smith',
'Milwaukee, Wisconsin',
'Abbott Laboratories'
]
def is_desirable(name):
return "</a>" not in name and "," not in name
print(list(filter(is_desirable, names)))
Output:
['3M', 'A. O. Smith', 'Abbott Laboratories']
However, this does not take into account the other operation you're performing: If the current name contains the substring ", Inc.", we still want to retain this name, but with the comma removed.
In this situation, I would define a generator that only yields the items we want to retain. If we come across the substring ", Inc.", we modify the current name and yield it. The generator's contents are then consumed to build up a new list:
def filtered_names(names):
for name in names:
if ", Inc." in name:
name = name.replace(",", "")
if "</a>" in name or "," in name:
continue
yield name
print(list(filtered_names(names)))
This is by no means the only way of doing this. This is my personal preference.
You can use list comprehension to filter out invalid elements (i.e. the ones containing comma).
>>> l = ['3M', 'Saint Paul, Minnesota', 'A. O. Smith', 'Milwaukee, Wisconsin', 'Abbott Laboratories']
>>> result = [e for e in l if "," not in e]
>>> result
['3M', 'A. O. Smith', 'Abbott Laboratories']

Picking phrases containing specific words in python

I have a list with 10 names and a list with many of phrases. I only want to select the phrases containing one of those names.
ArrayNames = [Mark, Alice, Paul]
ArrayPhrases = ["today is sunny", "Paul likes apples", "The cat is alive"]
In the example, is there any way to pick only the second phrase considering the face that contains Paul, given these two arrays?
This is what I tried:
def foo(x,y):
tmp = []
for phrase in x:
if any(y) in phrase:
tmp.append(phrase)
print(tmp)
x is the array of phrases, y is the array of names.
This is the output:
if any(y) in phrase:
TypeError: coercing to Unicode: need string or buffer, bool found
I'm very unsure about the syntax I used concerning the any() construct. Any suggestions?
Your usage of any is incorrect, do the following:
ArrayNames = ['Mark', 'Alice', 'Paul']
ArrayPhrases = ["today is sunny", "Paul likes apples", "The cat is alive"]
result = []
for phrase in ArrayPhrases:
if any(name in phrase for name in ArrayNames):
result.append(phrase)
print(result)
Output
['Paul likes apples']
You are getting a TypeError because any returns a bool and your trying to search for a bool inside a string (if any(y) in phrase:).
Note that any(y) works because it will use the truthy value of each of the strings of y.

splitting first and last name regex [duplicate]

This question already has answers here:
Regular expression for first and last name
(28 answers)
Closed 3 years ago.
Hello I have a string of full names.
string='Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser'
I would like to split it by first and last name to have an output like this
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
I tried using this code:
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', string))
that returns this result
['Christof', 'Koch', 'Jonathan', 'Harel', 'Moran', 'Cerf', 'Wolfgang', 'Einhaeuser']
I would like to have each full name as an item.
Any suggestions? Thanks
You can use a lookahead after any lowercase to see if it's followed by an immediate uppercase or end-of-line such as [a-zA-Z\s]+?[a-z](?=[A-Z]|$) (more specific) or even .+?[a-z](?=[A-Z]|$) (more broad).
import re
string = 'Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser'
print(re.findall(r".+?[a-z](?=[A-Z]|$)", string))
# -> ['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
Having provided this answer, definitely check out Falsehoods Programmers Believe About Names; depending on your data, it might be erroneous to assume that your format will be parseable using the lower->upper assumption.
For your list of strings in this format from the comments, just add a list comprehension. The regex I provided above happens to be robust to the middle initials without modification (but I have to emphasize that if your dataset is enormous, that might not hold).
import re
names = ['Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser', 'Za?d HarchaouiC?line Levy-leduc', 'David A. ForsythDuan Tran', 'Arnold SmeuldersSennay GhebreabPieter Adriaans', 'Peter L. BartlettAmbuj Tewari', 'Javier R. MovellanPaul L. RuvoloIan Fasel', 'Deli ZhaoXiaoou Tang']
result = [re.findall(r".+?[a-z](?=[A-Z]|$)", x) for x in names]
for name in result:
print(name)
Output:
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
['Za?d Harchaoui', 'C?line Levy-leduc']
['David A. Forsyth', 'Duan Tran']
['Arnold Smeulders', 'Sennay Ghebreab', 'Pieter Adriaans']
['Peter L. Bartlett', 'Ambuj Tewari']
['Javier R. Movellan', 'Paul L. Ruvolo', 'Ian Fasel']
['Deli Zhao', 'Xiaoou Tang']
And if you want all of these names in one list, add
flattened = [x for y in result for x in y]
It'll most likely have FP and TN, yet maybe OK to start with:
[A-Z][^A-Z\r\n]*\s+[A-Z][^A-Z\r\n]*
Test
import re
expression = r"[A-Z][^A-Z]*\s+[A-Z][^A-Z]*"
string = """
Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser
"""
print(re.findall(expression, string))
Output
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

Efficient way to replace substring from list

Hi I have a large document saved as a sentence and a list of proper names that might be in the document.
I would like to replace instances of the list with the tag [PERSON]
ex: sentence = "John and Marie went to school today....."
list = ["Maria", "John"....]
result = [PERSON] and [PERSON] went to school today
as you can see there might be variations of the name that I still want to catch like Maria and Marie as they are spelled differently but close.
I know I can use a loop but since the list and the sentence is large there might be a more efficient way to do this. Thanks
Use fuzzywuzzy to check if each word in the sentence matches closely (with a match percentage above 80%) with that of a name and if so replace it with [PERSON]
>>> from fuzzywuzzy import process, fuzz
>>> names = ["Maria", "John"]
>>> sentence = "John and Marie went to school today....."
>>>
>>> match = lambda word: process.extractOne(word, names, scorer=fuzz.ratio, score_cutoff=80)
>>> ' '.join('[PERSON]' if match(word) else word for word in sentence.split())
'[PERSON] and [PERSON] went to school today.....'
You can use regex inside your input list, to match words with spell variations. For example, if you need to match Marie and Maria, you can use Mari(e|a) as regex. Here is the consequent code you can use:
import re
mySentence = "John and Marie and Maria went to school today....."
myList = ["Mari(e|a)", "John"]
myNewSentence = re.compile("|".join(myList)).sub('[PERSON]', mySentence)
print(myNewSentence) # [PERSON] and [PERSON] and [PERSON] went to school today.....

Replace all the occurrences of specific words

Suppose that I have the following sentence:
bean likes to sell his beans
and I want to replace all occurrences of specific words with other words. For example, bean to robert and beans to cars.
I can't just use str.replace because in this case it'll change the beans to roberts.
>>> "bean likes to sell his beans".replace("bean","robert")
'robert likes to sell his roberts'
I need to change the whole words only, not the occurrences of the word in the other word. I think that I can achieve this by using regular expressions but don't know how to do it right.
If you use regex, you can specify word boundaries with \b:
import re
sentence = 'bean likes to sell his beans'
sentence = re.sub(r'\bbean\b', 'robert', sentence)
# 'robert likes to sell his beans'
Here 'beans' is not changed (to 'roberts') because the 's' on the end is not a boundary between words: \b matches the empty string, but only at the beginning or end of a word.
The second replacement for completeness:
sentence = re.sub(r'\bbeans\b', 'cars', sentence)
# 'robert likes to sell his cars'
If you replace each word one at a time, you might replace words several times (and not get what you want). To avoid this, you can use a function or lambda:
d = {'bean':'robert', 'beans':'cars'}
str_in = 'bean likes to sell his beans'
str_out = re.sub(r'\b(\w+)\b', lambda m:d.get(m.group(1), m.group(1)), str_in)
That way, once bean is replaced by robert, it won't be modified again (even if robert is also in your input list of words).
As suggested by georg, I edited this answer with dict.get(key, default_value).
Alternative solution (also suggested by georg):
str_out = re.sub(r'\b(%s)\b' % '|'.join(d.keys()), lambda m:d.get(m.group(1), m.group(1)), str_in)
This is a dirty way to do this. using folds
reduce(lambda x,y : re.sub('\\b('+y[0]+')\\b',y[1],x) ,[("bean","robert"),("beans","cars")],"bean likes to sell his beans")
"bean likes to sell his beans".replace("beans", "cars").replace("bean", "robert")
Will replace all instances of "beans" with "cars" and "bean" with "robert". This works because .replace() returns a modified instance of original string. As such, you can think of it in stages. It essentially works this way:
>>> first_string = "bean likes to sell his beans"
>>> second_string = first_string.replace("beans", "cars")
>>> third_string = second_string.replace("bean", "robert")
>>> print(first_string, second_string, third_string)
('bean likes to sell his beans', 'bean likes to sell his cars',
'robert likes to sell his cars')

Categories