Search strings using regular expression in Python - python

When I try to use regular expression for finding strings in other strings, it does not work as expected. Here is an example:
import re
message = 'I really like beer, but my favourite beer is German beer.'
keywords = ['beer', 'german beer', 'german']
regex = re.compile("|".join(keywords))
regex.findall(message.lower())
Result:
['beer', 'beer', 'german beer']
But the expected result would be:
['beer', 'beer', 'german beer', 'german']
Another way to do that could be:
results = []
for k in keywords:
regex = re.compile(k)
for r in regex.findall(message.lower()):
results.append(r)
['beer', 'beer', 'beer', 'german beer', 'german']
It works like I want, but I think it is not the best way to do that. Can somebody help me?

re.findall cannot find overlapping matches. If you want to use regular expressions you will have to create separate expressions and run them in a loop as in your second example.
Note that your second example can also be shortened to the following, though it's a matter of taste whether you find this more readable:
results = [r for k in keywords for r in re.findall(k, message.lower())]
Your specific example doesn't require the use of regular expressions. You should avoid using regular expressions if you just want to find fixed strings.

re.findall is described in http://docs.python.org/2/library/re.html
"Return all non-overlapping matches of pattern in string..."
Non-overlapping means that for "german beer" it will not find "german beer" AND "german", because those matches are overlapping.

My cleaner (for me) version for your last solution
results = []
for key in keywords:
results.extend(re.findall(key, message, re.IGNORECASE))

Related

Multiline text splitting

Sooo, I have this problem in which I have to create a list of lists, that contain every word from each line that has a length greater then 4. The challenge is to solve this with a one-liner.
text = '''My candle burns at both ends;
It will not last the night;
But ah, my foes, and oh, my friends—
It gives a lovely light!'''
So far I managed this res = [i for ele in text.splitlines() for i in ele.split(' ') if len(i) > 4] but it returns ['candle', 'burns', 'ends;', 'night;', 'foes,', 'friends—', 'gives', 'lovely', 'light!'] insetead of [['candle', 'burns', 'ends;'], ['night;'], ['foes,', 'friends—'], ['gives', 'lovely', 'light!']]
Any ideas? :D
So in this case i would utilize some regular expressions to find your results.
By doing a list comprehension as you did with a regular expression you end up automatically placing the matches into new lists.
This particular search pattern looks for any number or letter (both capital or not) in a recurrence of 4 or more times.
import re
text = '''My candle burns at both ends;
It will not last the night;
But ah, my foes, and oh, my friends—
It gives a lovely light!'''
results = [re.findall('\w{4,}', line) for line in text.split('\n')]
print(results)
Output:
[['candle', 'burns', 'both', 'ends'], ['will', 'last', 'night'], ['foes', 'friends'], ['gives', 'lovely', 'light']]
If you wish to keep the special characters you might want to look into expanding the regular expression so it includes all characters except whitespace.
There are great tools to play around with if you look for "online regular expression tools" so you get some more feedback when trying to build your own patterns.
IIUC, this oneliner should work for you (without the use of additional packages):
[[w.strip(';,!—') for w in l.split() if len(w)>=4] for l in text.split('\n')]
Output:
[['candle', 'burns', 'both', 'ends'],
['will', 'last', 'night'],
['foes', 'friends'],
['gives', 'lovely', 'light']]

Python matching various keyword from dictionary issues

I have a complex text where I am categorizing different keywords stored in a dictionary:
text = 'data-ls-static="1">Making Bio Implants, Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'
sector = {"med tech": ['Drug Delivery' '3D printing', 'medicine', 'medical technology', 'bio cell']}
this can successfully find my keywords and categorize them with some limitations:
pattern = r'[a-zA-Z0-9]+'
[cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]
The limitations that I cannot solve are:
For example, keywords like "Drug Delivery" that are separated by a space are not recognized and therefore categorized.
I was not able to make the pattern case insensitive, as words like MEDICINE are not recognized. I tried to add (?i) to the pattern but it doesn't work.
The categorized keywords go into a pandas df, but they are printed into []. I tried to loop again the script to take them out but they are still there.
Data to pandas df:
ind_list = []
for site in url_list:
ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
ind_list.append(ind)
websites['Indication'] = ind_list
Current output:
Website Sector Sub-sector Therapeutical Area Focus URL status
0 url3.com [med tech] [] [] [] []
1 www.url1.com [med tech, services] [] [oncology, gastroenterology] [] []
2 www.url2.com [med tech, services] [] [orthopedy] [] []
In the output I get [] that I'd like to avoid.
Can you help me with these points?
Thanks!
Give you some hints here the problem that can readily be spot:
Why can't match keywords like "Drug Delivery" that are separated by a space ? This is because the regex pattern r'[a-zA-Z0-9]+' does not match for a space. You can change it to r'[a-zA-Z0-9 ]+' (added a space after 9) if you want to match also for a space. However, if you want to support other types of white spaces (e.g. \t, \n), you need to further change this regex pattern.
Why don't support case insensitive match ? Your code fragment any(x in re.findall(pattern,text) for x in sector[cat]) requires x to have the same upper/lower case for BOTH being in result of re.findall and being in sector[cat]. This constrain even cannot be bypassed by setting flags=re.I in the re.findall() call. Suggest you to convert them all to the same case before checking. That is, for example change them all to lower cases before matching: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) Here we added .lower() to both text and x.lower().
With the above 2 changes, it should allow you to capture some categorized keywords.
Actually, for this particular case, you may not need to use regular expression and re.findall at all. You may just check e.g. sector[cat][i].lower()) in text.lower(). That is, change the list comprehension as follows:
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Edit
Test Run with 2-word phrase:
text = 'drug delivery'
sector = {"med tech": ['Drug Delivery', '3D printing', 'medicine', 'medical technology', 'bio cell']}
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Output: # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
['med tech']
text = 'Drug Store fast delivery'
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Ouptput: # Correctly doesn't match with extra words in between
[]
Can you try a different approach other than regex,
I would suggest difflib when you have two similar matching words.
findall is pretty wasteful here since you are repeatedly breaking up the string for each keyword.
If you want to test whether the keyword is in the string:
[cat for cat in sector if any(re.search(word, text, re.I) for word in sector[cat])]
# Output: med tech

Find Pattern in Textfile From Several Elements In Several Lists?

I am a beginner, been learning python for a few months as my very first programming language. I am looking to find a pattern from a text file. My first attempt has been using regex, which does work but has a limitation:
import re
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
CC_list = ['and', 'or']
noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b'
with open('test_sentence.txt', 'r') as input_f:
read_input = input_f.read()
word = re.findall(noun_list_pattern1, read_input)
for w in word:
print w
else:
pass
So at this point you may be asking why are the lists in this code since they are not being used. Well, I have been racking my brains out, trying all sort of for loops and if statements in functions to try and find a why to replicate the regex pattern, but using the lists.
The limitation with regex is that the \b\w+\w\ code which is found a number of times in `noun_list_pattern' actually only finds words - any words - but not specific nouns. This could raise false positives. I want to narrow things down more by using the elements in the list above instead of the regex.
Since I actually have 4 different regex in the regex pattern (it contains 4 |), I will just go with 1 of them here. So I would need to find a pattern such as:
'noun in noun_list' + ', ' + 'noun in noun_list' + ', ' + 'C in CC_list' + ' ' + 'noun in noun_list
Obviously, the above code quoted line is not real python code, but is an experession of my thoughts about the match needed. Where I say noun in noun_list I mean an iteration through the noun_list; C in CC_list is an iteration through the CC_list; , is a literal string match for a comma and whitespace.
Hopefully I have made myself clear!
Here is the content of the test_sentence.txt file that I am using:
I need to buy are bacon, cheese and eggs.
I also need to buy milk, cheese, and bacon.
What's your favorite: milk, cheese or eggs.
What's my favorite: milk, bacon, or eggs.
Break your problem down a little. First, you need a pattern that will match the words from your list, but no other. You can accomplish that with the alternation operator | and the literal words. red|green|blue, for example, will match "red", "green", or "blue", but not "purple". Join the noun list with that character, and add the word boundary metacharacters along with parentheses to group the alternations:
noun_patt = r'\b(' + '|'.join(nouns) + r')\b'
Do the same for your list of conjunctions:
conj_patt = r'\b(' + '|'.join(conjunctions) + r')\b'
The overall match you want to make is "one or more noun_patt match, each optionally followed by a comma, followed by a match for the conj_patt and then one more noun_patt match". Easy enough for a regex:
patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)
You don't really want to use re.findall(), but re.search(), since you're only expecting one match per line:
for line in lines:
... print re.search(patt, line).group(0)
...
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs
As a note, you're close to, if not rubbing up against, the limits of regular expressions as far as parsing English. Any more complex than this, and you will want to look into actual parsing, perhaps with NLTK.
In actuality, you don't necessarily need regular expressions, as there are a number of ways to do this using just your original lists.
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']
#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
if len(matches) == 4:
for match in matches:
print match
The reason the match number is 4, is that 4 is the correct number of matches. (Note, that this could also be the case for repeated nouns or conjunctions).
EDIT:
This version prints the lines that are matched and the words matched. Also fixed the possible multiple word match problem:
words_matched = []
matching_lines = []
for l in lst:
matches = [noun for noun in noun_list if noun in l] + [conj for conj in conjunctions if conj in l]
invalid = True
valid_count = 0
for match in matches:
if matches.count(match) == 1:
valid_count += 1
if valid_count == len(matches):
invalid = False
if not invalid:
words_matched.append(matches)
matching_lines.append(l)
for line, matches in zip(matching_lines, words_matched):
print line, matches
However, if this doesn't suit you, you can always build the regex as follows (using the itertools module):
#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
matches = [noun for noun in nouns]
matches.append(conj)
#matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
print regex_string
#... do regex related matching here
The caveat of this method is that it is pure brute-force as it generates all the possible combinations (read permutations) of both lists which can then be tested to see if each line matches. Hence, it is horrendously slow, but in this example that matches the ones given (the non-comma before the conjunction), this will generate exact matches perfectly.
Adapt as required.

What's a more efficient way of looping with regex?

I have a list of names which I'm using to pull out of a target list of strings. For example:
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
output = ['Chris Smith', 'Kim', 'CHRIS']
So the rules so far are:
Case insensitive
Cannot match partial word ('ie Christmas/hijacked shouldn't match Chris/Jack)
Other words in string are okay as long as name is found in the string per the above criteria.
To accomplish this, another SO user suggested this code in this thread:
[targ for targ in target_list if any(re.search(r'\b{}\b'.format(name), targ, re.I) for name in first_names)]
This works very accurately so far, but very slowly given the names list is ~5,000 long and the target list ranges from 20-100 lines long with some strings up to 30 characters long.
Any suggestions on how to improve performance here?
SOLUTION: Both of the regex based solutions suffered from OverflowErrors so unfortunately I could not test them. The solution that worked (from #mglison's answer) was:
new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]
This provided a tremendous increase in performance from 15 seconds to under 1 second.
Seems like you could combine them all into 1 super regex:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
regex_string = '|'.join(r"(?:\b"+re.escape(x)+r"\b)" for x in names)
print regex_string
regex = re.compile(regex_string,re.I)
print [t for t in target if regex.search(t)]
A non-regex solution which will only work if the names are a single word (no whitespace):
new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]
the any expression could also be written as:
any(x in new_names for x in t.lower().split())
or
any(x.lower() in new_names for x in t.split())
or, another variant which relies on set.intersection (suggested by #DSM below):
[ t for t in target if new_names.intersection(t.lower().split()) ]
You can profile to see which performs best if performance is really critical, otherwise choose the one that you find to be easiest to read/understand.
*If you're using python2.x, you'll probably want to use itertools.imap instead of map if you go that route in the above to get it to evaluate lazily -- It also makes me wonder if python provides a lazy str.split which would have performance on par with the non-lazy version ...
this one is the simplest one i can think of:
[item for item in target if re.search(r'\b(%s)\b' % '|'.join(names), item)]
all together:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
results = [item for item in target if re.search(r'\b(%s)\b' % '|'.join(names), item)]
print results
>>>
['Chris Smith', 'Kim']
and to make it more efficient, you can compile the regex first.
regex = re.compile( r'\b(%s)\b' % '|'.join(names) )
[item for item in target if regex.search(item)]
edit
after considering the question and looking at some comments, i have revised the 'solution' to the following:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
regex = re.compile( r'\b((%s))\b' % ')|('.join([re.escape(name) for name in names]), re.I )
results = [item for item in target if regex.search(item)]
results:
>>>
['Chris Smith', 'Kim', 'CHRIS']
You're currently doing one loop inside another, iterating over two lists. That's always going to give you quadratic performance.
One local optimisation is to compile each name regex (which will make applying each regex faster). However, the big win is going to be to combine all of your regexes into one regex which you apply to each item in your input. See #mgilson's answer for how to do that. After that, your code performance should scale linearly as O(M+N), rather than O(M*N).

Parse out elements from a pattern

I am trying to parse the result output from a natural language parser (Stanford parser).
Some of the results are as below:
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
The result am trying to get are:
['dep', 'Company', 'rent']
['conj_or', 'rent', 'share']
['amod', 'information', 'personal']
...
['amod', 'companies', 'non-affiliated']
...
['aux', 'requested', "'ve"]
First I tried to directly get these elements out, but failed.
Then I realized regex should be the right way forward.
However, I am totally unfamiliar with regex. With some exploration, I got:
m = re.search('(?<=())\w+', line)
m2 =re.search('(?<=-)\d', line)
and stuck.
The first one can correctly get the first elements, e.g. 'dep', 'amod', 'conj_or', but I actually have not totally figured out why it is working...
Second line is trying to get the second elements, e.g. 'Company', 'rent', 'information', but I can only get the number after the word. I cannot figure out how to lookbefore rather than lookbehind...
BTW, I also cannot figure out how to deal with exceptions such as 'non-affiliated' and "'ve".
Could anyone give some hints or help. Highly appreciated.
It is difficult to give an optimal answer without knowing the full range of possible outputs, however, here's a possible solution:
>>> [re.findall(r'[A-Za-z_\'-]+[^-\d\(\)\']', line) for line in s.split('\n')]
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]
It works by finding all the groups of contiguous letters ([A-Za-z] represent the interval between capital A and Z and small a and z) or the characters "_" and "'" in the same line.
Furthermore it enforce the rule that your matched string must not have in the last position a given list of characters ([^...] is the syntax to say "must not contain any of the characters (replace "..." with the list of characters)).
The character \ escapes those characters like "(" or ")" that would otherwise be parsed by the regex engine as instructions.
Finally, s is the example string you gave in the question...
HTH!
Here is something you're looking for:
([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)
The parenthesis around [\w-]* are for grouping, so that you can access data as:
ex = r'([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)'
m = re.match(ex, line)
print(m.group(0), m.group(1), m.group(2))
Btw, I recommend using "Kodos" program written in Python+PyQT to learn and test regular expressions. It's my favourite tool to test regexs.
If the results from the parser are as regular as suggested, regexes may not be necessary:
from pprint import pprint
source = """
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
"""
items = []
for line in source.splitlines():
head, sep, tail = line.partition('(')
if head:
item = [head]
head, sep, tail = tail.strip('()').partition(', ')
item.append(head.rpartition('-')[0])
item.append(tail.rpartition('-')[0])
items.append(item)
pprint(items)
Output:
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]

Categories