What's a more efficient way of looping with regex? - python

I have a list of names which I'm using to pull out of a target list of strings. For example:
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
output = ['Chris Smith', 'Kim', 'CHRIS']
So the rules so far are:
Case insensitive
Cannot match partial word ('ie Christmas/hijacked shouldn't match Chris/Jack)
Other words in string are okay as long as name is found in the string per the above criteria.
To accomplish this, another SO user suggested this code in this thread:
[targ for targ in target_list if any(re.search(r'\b{}\b'.format(name), targ, re.I) for name in first_names)]
This works very accurately so far, but very slowly given the names list is ~5,000 long and the target list ranges from 20-100 lines long with some strings up to 30 characters long.
Any suggestions on how to improve performance here?
SOLUTION: Both of the regex based solutions suffered from OverflowErrors so unfortunately I could not test them. The solution that worked (from #mglison's answer) was:
new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]
This provided a tremendous increase in performance from 15 seconds to under 1 second.

Seems like you could combine them all into 1 super regex:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
regex_string = '|'.join(r"(?:\b"+re.escape(x)+r"\b)" for x in names)
print regex_string
regex = re.compile(regex_string,re.I)
print [t for t in target if regex.search(t)]
A non-regex solution which will only work if the names are a single word (no whitespace):
new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]
the any expression could also be written as:
any(x in new_names for x in t.lower().split())
or
any(x.lower() in new_names for x in t.split())
or, another variant which relies on set.intersection (suggested by #DSM below):
[ t for t in target if new_names.intersection(t.lower().split()) ]
You can profile to see which performs best if performance is really critical, otherwise choose the one that you find to be easiest to read/understand.
*If you're using python2.x, you'll probably want to use itertools.imap instead of map if you go that route in the above to get it to evaluate lazily -- It also makes me wonder if python provides a lazy str.split which would have performance on par with the non-lazy version ...

this one is the simplest one i can think of:
[item for item in target if re.search(r'\b(%s)\b' % '|'.join(names), item)]
all together:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
results = [item for item in target if re.search(r'\b(%s)\b' % '|'.join(names), item)]
print results
>>>
['Chris Smith', 'Kim']
and to make it more efficient, you can compile the regex first.
regex = re.compile( r'\b(%s)\b' % '|'.join(names) )
[item for item in target if regex.search(item)]
edit
after considering the question and looking at some comments, i have revised the 'solution' to the following:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
regex = re.compile( r'\b((%s))\b' % ')|('.join([re.escape(name) for name in names]), re.I )
results = [item for item in target if regex.search(item)]
results:
>>>
['Chris Smith', 'Kim', 'CHRIS']

You're currently doing one loop inside another, iterating over two lists. That's always going to give you quadratic performance.
One local optimisation is to compile each name regex (which will make applying each regex faster). However, the big win is going to be to combine all of your regexes into one regex which you apply to each item in your input. See #mgilson's answer for how to do that. After that, your code performance should scale linearly as O(M+N), rather than O(M*N).

Related

splitting first and last name regex [duplicate]

This question already has answers here:
Regular expression for first and last name
(28 answers)
Closed 3 years ago.
Hello I have a string of full names.
string='Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser'
I would like to split it by first and last name to have an output like this
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
I tried using this code:
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', string))
that returns this result
['Christof', 'Koch', 'Jonathan', 'Harel', 'Moran', 'Cerf', 'Wolfgang', 'Einhaeuser']
I would like to have each full name as an item.
Any suggestions? Thanks
You can use a lookahead after any lowercase to see if it's followed by an immediate uppercase or end-of-line such as [a-zA-Z\s]+?[a-z](?=[A-Z]|$) (more specific) or even .+?[a-z](?=[A-Z]|$) (more broad).
import re
string = 'Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser'
print(re.findall(r".+?[a-z](?=[A-Z]|$)", string))
# -> ['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
Having provided this answer, definitely check out Falsehoods Programmers Believe About Names; depending on your data, it might be erroneous to assume that your format will be parseable using the lower->upper assumption.
For your list of strings in this format from the comments, just add a list comprehension. The regex I provided above happens to be robust to the middle initials without modification (but I have to emphasize that if your dataset is enormous, that might not hold).
import re
names = ['Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser', 'Za?d HarchaouiC?line Levy-leduc', 'David A. ForsythDuan Tran', 'Arnold SmeuldersSennay GhebreabPieter Adriaans', 'Peter L. BartlettAmbuj Tewari', 'Javier R. MovellanPaul L. RuvoloIan Fasel', 'Deli ZhaoXiaoou Tang']
result = [re.findall(r".+?[a-z](?=[A-Z]|$)", x) for x in names]
for name in result:
print(name)
Output:
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
['Za?d Harchaoui', 'C?line Levy-leduc']
['David A. Forsyth', 'Duan Tran']
['Arnold Smeulders', 'Sennay Ghebreab', 'Pieter Adriaans']
['Peter L. Bartlett', 'Ambuj Tewari']
['Javier R. Movellan', 'Paul L. Ruvolo', 'Ian Fasel']
['Deli Zhao', 'Xiaoou Tang']
And if you want all of these names in one list, add
flattened = [x for y in result for x in y]
It'll most likely have FP and TN, yet maybe OK to start with:
[A-Z][^A-Z\r\n]*\s+[A-Z][^A-Z\r\n]*
Test
import re
expression = r"[A-Z][^A-Z]*\s+[A-Z][^A-Z]*"
string = """
Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser
"""
print(re.findall(expression, string))
Output
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

Search strings using regular expression in Python

When I try to use regular expression for finding strings in other strings, it does not work as expected. Here is an example:
import re
message = 'I really like beer, but my favourite beer is German beer.'
keywords = ['beer', 'german beer', 'german']
regex = re.compile("|".join(keywords))
regex.findall(message.lower())
Result:
['beer', 'beer', 'german beer']
But the expected result would be:
['beer', 'beer', 'german beer', 'german']
Another way to do that could be:
results = []
for k in keywords:
regex = re.compile(k)
for r in regex.findall(message.lower()):
results.append(r)
['beer', 'beer', 'beer', 'german beer', 'german']
It works like I want, but I think it is not the best way to do that. Can somebody help me?
re.findall cannot find overlapping matches. If you want to use regular expressions you will have to create separate expressions and run them in a loop as in your second example.
Note that your second example can also be shortened to the following, though it's a matter of taste whether you find this more readable:
results = [r for k in keywords for r in re.findall(k, message.lower())]
Your specific example doesn't require the use of regular expressions. You should avoid using regular expressions if you just want to find fixed strings.
re.findall is described in http://docs.python.org/2/library/re.html
"Return all non-overlapping matches of pattern in string..."
Non-overlapping means that for "german beer" it will not find "german beer" AND "german", because those matches are overlapping.
My cleaner (for me) version for your last solution
results = []
for key in keywords:
results.extend(re.findall(key, message, re.IGNORECASE))

Python Disambiguation

I am currently building a MUD (Multi-User-Domain) for an rpg game. Doing this entirely in Python to both make a game I enjoy, and learn python. A problem I am running in to, and due to the extreme specificity of the question, I've been unable to find the right answer.
So, here's what I need, in a nut-shell. I don't have a good snippet of code that fully shows what I need as I'd have to paste about 50 lines to have to 5 lines I'm using make sense.
targetOptions = ['Joe', 'Bob', 'zombie', 'Susan', 'kobold', 'Bill']
A cmd in our game is attack, where we type 'a zombie' and we then proceed to kill the zombie. However, I want to just type 'a z'. We've tried a few different things in our code, but they're all unstable and often just wrong. One of our attempts returned something like ['sword', 'talisman'] as matches for 'get sword'. So, is there a way to search this list and have it return a matched value?
I also need to just return value[0] if there are say, 2 zombies in the room and I type 'a z'. Thanks for all your help ahead of time, and I hope I was clear enough for what I'm looking for. Please let me know if more info is needed. And don't worry about the whole attacking thing, I just need to send 'zo' and get 'zombie' or something similar. Thanks!
Welcome to SO and Python! I suggest you take a look at the official Python documentation and spend some time looking around what's included in the Python Standard Library.
The difflib module contains a function get_close_matches() that can help you with approximate string comparisons. Here's how it looks like:
from difflib import get_close_matches
def get_target_match(target, targets):
'''
Approximates a match for a target from a sequence of targets,
if a match exists.
'''
source, targets = targets, map(str.lower, targets)
target = target.lower()
matches = get_close_matches(target, targets, n=1, cutoff=0.25)
if matches:
match = matches[0]
return source[targets.index(match)]
else:
return None
target = 'Z'
targets = ['Joe', 'Bob', 'zombie', 'Susan', 'kobold', 'Bill']
match = get_target_match(target, targets)
print "Going nom on %s" % match # IT'S A ZOMBIE!!!
>>> filter(lambda x: x.startswith("z"), ['Joe', 'Bob', 'zombie', 'Susan', 'kobold', 'Bill'])
['zombie']
>>> cmd = "a zom"
>>> cmd.split()
['a', 'zom']
>>> cmd.split()[1]
'zom'
>>> filter(lambda x: x.startswith(cmd.split()[1]), ['Joe', 'Bob', 'zombie', 'Susan', 'kobold', 'Bill'])
['zombie']
does that help?
filter filters a list (2nd arg) for things that the 1st arg accepts. cmd is your command and cmd.split()[1] gets the part after the space. lambda x: x.startswith(cmd.split()[1]) is a function (a lambda expression) that asks "does x start with the command after the space?"
for another test, if cmd is "a B" then there are two matches:
>>> cmd = "a B"
>>> filter(lambda x: x.startswith(cmd.split()[1]), ['Joe', 'Bob', 'zombie', 'Susan', 'kobold', 'Bill'])
['Bob', 'Bill']

Parse out elements from a pattern

I am trying to parse the result output from a natural language parser (Stanford parser).
Some of the results are as below:
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
The result am trying to get are:
['dep', 'Company', 'rent']
['conj_or', 'rent', 'share']
['amod', 'information', 'personal']
...
['amod', 'companies', 'non-affiliated']
...
['aux', 'requested', "'ve"]
First I tried to directly get these elements out, but failed.
Then I realized regex should be the right way forward.
However, I am totally unfamiliar with regex. With some exploration, I got:
m = re.search('(?<=())\w+', line)
m2 =re.search('(?<=-)\d', line)
and stuck.
The first one can correctly get the first elements, e.g. 'dep', 'amod', 'conj_or', but I actually have not totally figured out why it is working...
Second line is trying to get the second elements, e.g. 'Company', 'rent', 'information', but I can only get the number after the word. I cannot figure out how to lookbefore rather than lookbehind...
BTW, I also cannot figure out how to deal with exceptions such as 'non-affiliated' and "'ve".
Could anyone give some hints or help. Highly appreciated.
It is difficult to give an optimal answer without knowing the full range of possible outputs, however, here's a possible solution:
>>> [re.findall(r'[A-Za-z_\'-]+[^-\d\(\)\']', line) for line in s.split('\n')]
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]
It works by finding all the groups of contiguous letters ([A-Za-z] represent the interval between capital A and Z and small a and z) or the characters "_" and "'" in the same line.
Furthermore it enforce the rule that your matched string must not have in the last position a given list of characters ([^...] is the syntax to say "must not contain any of the characters (replace "..." with the list of characters)).
The character \ escapes those characters like "(" or ")" that would otherwise be parsed by the regex engine as instructions.
Finally, s is the example string you gave in the question...
HTH!
Here is something you're looking for:
([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)
The parenthesis around [\w-]* are for grouping, so that you can access data as:
ex = r'([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)'
m = re.match(ex, line)
print(m.group(0), m.group(1), m.group(2))
Btw, I recommend using "Kodos" program written in Python+PyQT to learn and test regular expressions. It's my favourite tool to test regexs.
If the results from the parser are as regular as suggested, regexes may not be necessary:
from pprint import pprint
source = """
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
"""
items = []
for line in source.splitlines():
head, sep, tail = line.partition('(')
if head:
item = [head]
head, sep, tail = tail.strip('()').partition(', ')
item.append(head.rpartition('-')[0])
item.append(tail.rpartition('-')[0])
items.append(item)
pprint(items)
Output:
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]

What is efficient way to match words in string?

Example:
names = ['James John', 'Robert David', 'Paul' ... the list has 5K items]
text1 = 'I saw James today'
text2 = 'I saw James John today'
text3 = 'I met Paul'
is_name_in_text(text1,names) # this returns false 'James' in not in list
is_name_in_text(text2,names) # this returns 'James John'
is_name_in_text(text3,names) # this return 'Paul'
is_name_in_text() searches if any of the name list is in text.
The easy way to do is to just check if the name is in the list by using in operator, but the list has 5,000 items, so it is not efficient. I can just split the text into words and check if the words are in the list, but this not going to work if you have more than one word matching. Line number 7 will fail in this case.
Make names into a set and use the in-operator for fast O(1) lookup.
You can use a regex to parse out the possible names in a sentence:
>>> import re
>>> findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')
>>> def is_name_in_text(text, names):
for possible_name in set(findnames.findall(text)):
if possible_name in names:
return possible_name
return False
>>> names = set(['James John', 'Robert David', 'Paul'])
>>> is_name_in_text('I saw James today', names)
False
>>> is_name_in_text('I saw James John today', names)
'James John'
>>> is_name_in_text('I met Paul', names)
'Paul'
Build a regular expression with all the alternatives. This way you don't have to worry about somehow pulling the names out of the phrases beforehand.
import re
names_re = re.compile(r'\b' +
r'\b|\b'.join(re.escape(name) for name in names) +
r'\b')
print names_re.search('I saw James today')
You may use Python's set in order to get good performance while using the in operator.
If you have a mechanism of pulling the names out of the phrases and don't need to worry about partial matches (the full name will always be in the string), you can use a set rather than a list.
Your code is exactly the same, with this addition at line 2:
names = set(names)
The in operation will now function much faster.

Categories