Phrase matching using regex and Python

Phrase matching using regex and Python - python

I have some short phrases that I want to match on. I used a regex as follows:
(^|)(piston|piston ring)( |$)
Using the above, regex.match("piston ring") matches on "piston". If I change the regex such that the longer phrase "piston ring" comes first then it work as expected.
I was surprised by this behavior as I was assuming that the greedy nature of regex would try to match the longest string "for free."
What am I missing? Can somebody explain this? Thanks!

When using alternation (|) in regular expressions, each option is attempted in order from left to right until a match can be found. So in your example since a match can be made with piston, piston ring will never be attempted.
A better way to write this regex would be something like this:
(^|)(piston( ring)?)( |$)
This will attempt to match 'piston', and then immediately attempt to match ' ring', with the ? making it optional. Alternatively just make sure your longer options occur at the beginning of the alternation.
You may also want to consider using a word boundary, \b, instead of (^|) and ( |$).

from http://www.regular-expressions.info/alternation.html (first Google result):
the regex engine is eager. It will stop searching as soon as it finds a valid match. The consequence is that in certain situations, the order of the alternatives matters
one exception:
the POSIX standard mandates that the longest match be returned, regardless if the regex engine is implemented using an NFA or DFA algorithm.
possible solutions:
piston( ring)?
(piston ring|piston) (put the longest before)

Thats the behaviour of Alternations. It tries to match the first alternative, that is "piston" if it is successful it is done.
That means it will not try all alternatives, it will finish with the first that matches.
You can find more details here on regular-expressions.info
What could also be interesting for you are word boundaries \b. I think what you are looking for is
\bpiston(?: ring)?\b

Edit2: It wasn't clear if your test data
contained pipes or not. I saw the pipes in
the regex and assumed you are searching
for pipe delim. Oh well.. not sure if below
helps.
Using regex to match text that's pipe delimited will need more alternations to pick up the beginning and ending columns.
What about another approach?
text='start piston|xxx|piston ring|xxx|piston cast|xxx|piston|xxx|stock piston|piston end'
j=re.split(r'\|',text)
k = [ x for x in j if x.find('piston') >= 0 ]
['start piston', 'piston ring', 'piston cast', 'piston', 'stock piston', 'piston end']
k = [ x for x in j if x.startswith('piston') ]
['piston ring', 'piston cast', 'piston', 'piston end']
k = [ x for x in j if x == 'piston' ]
['piston']
j=re.split(r'\|',text)
if 'piston ring' in j:
print True
> True
Edit: To clarify - take this example:
text2='piston1|xxx|spiston2|xxx|piston ring|xxx|piston3'
I add '.' to match anything to show the items matched
re.findall('piston.',text2)
['piston1', 'piston2', 'piston ', 'piston3']
To make it more accurate, you will need to use look-behind assertion.
This guarantees you match '|piston' but doesn't include the pipe in the result
re.findall('(?<=\|)piston.',text2)
['piston ', 'piston3']
Limit matching from greedy to first matching character .*?< stop char >
Add grouping parens to exclude the pipe. The match .*? is smart enough to detect if inside a group and ignores the paren and uses the next character as the stop matching sentinel. This seems to work, but it ignores the last column.
re.findall('(?<=\|)(piston.*?)\|',text2)
['piston ring']
When you add grouping you can now just specify starts with an escaped pipe
re.findall('\|(piston.*?)\|',text2)
['piston ring']
To search the last column as well, add this non-grouping match (?:\||$) - which means match on pipe (needs to be escaped) or (|) the end ($) of string.
The non-grouping match (?:x1|x2) doesn't get included in the result. An added bonus it gets optimized.
re.findall('\|(piston.*?)(?:\||$)',text2)
['piston ring', 'piston3']
Finally, to fix for the beginning of the string, add another alteration much like the previous one for end string match
re.findall('(?:\||^)(piston.*?)(?:\||$)',text2)
['piston1', 'piston ring', 'piston3']
Hope it helps. :)

Related

Trying to find the regex for this particular case? Also can I parse this without creating groups?

text to capture looks like this..
Policy Number ABCD000012345 other text follows in same line....
My regex looks like this
regex value='(?i)(?:[P|p]olicy\s[N|n]o[|:|;|,][\n\r\s\t]*[\na-z\sA-Z:,;\r\d\t]*[S|s]e\s*[H|h]abla\s*[^\n]*[\n\s\r\t]*|(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)(?P<policy_number>[^\n]*)'
this particular case matches with the second or case.. however it is also capturing everything after the policy number. What can be the stopping condition for it to just grab the number. I know something is wrong but can't find a way out.
(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)
current output
ABCD000012345othertextfollowsinsameline....
expected output
ABCD000012345

You may use a more simple regex, just finding from the beginning "[P|p]olicy\s*[N|n]umber\s*\b([A-Z]{4}\d+)\b.*" and use the word boundary \b
pattern = re.compile(r"[P|p]olicy\s*[N|n]umber\s*\b([A-Z0-9]+)\b.*")
line = "Policy Number ABCD000012345 other text follows in same line...."
matches = pattern.match(line)
id_res = matches.group(1)
print(id_res) # ABCD000012345
And if there's always 2 words before you can use (?:\w+\s+){2}\b([A-Z0-9]+)\b.*
Also \s is for [\r\n\t\f\v ] so no need to repeat them, your [\n\r\s\t] is just \s

you don't need the upper and lower case p and n specified since you're already specifying case insensitive.
Also \s already covers \n, \t and \r.
(?i)policy\s+number\s+([A-Z]{4}\d+)\b
for verification purpose: Regex
Another Solution:
^[\s\w]+\b([A-Z]{4}\d+)\b
for verification purpose: Regex
I like this better, in case your text changes from policy number

Why does this regular expression to match two consecutive words not work?

There is a similar question here: Regular Expression For Consecutive Duplicate Words. This addresses the general question of how to solve this problem, whereas I am looking for specific advice on why my solution does not work.
I'm using python regex, and I'm trying to match all consecutively repeated words, such as the bold in:
I am struggling to to make this this work
I tried:
[A-Za-z0-9]* {2}
This is the logic behind this choice of regex: The '[A-Za-z0-9]*' should match any word of any length, and '[A-Za-z0-9]* ' makes it consider the space at the end of the word. Hence [A-Za-z0-9]* {2} should flag a repetition of the previous word with a space at the end. In other words it says "For any word, find cases where it is immediately repeated after a space".
How is my logic flawed here? Why does this regex not work?

[A-Za-z0-9]* {2}
Quantifiers in regular expressions will always only apply to the element right in front of them. So a \d+ will look for one or more digits but x\d+ will look for a single x, followed by one or more digits.
If you want a quantifier to apply to more than just a single thing, you need to group it first, e.g. (x\d)+. This is a capturing group, so it will actually capture that in the result. This is sometimes undesired if you just want to group things to apply a common quantifier. In that case, you can prefix the group with ?: to make it a non-capturing group: (?:x\d)+.
So, going back to your regular expression, you would have to do it like this:
([A-Za-z0-9]* ){2}
However, this does not actually have any check that the second matched word is the same as the first one. If you want to match for that, you will need to use backreferences. Backreferences allow you to reference a previously captured group within the expression, looking for it again. In your case, this would look like this:
([A-Za-z0-9]*) \1
The \1 will reference the first capturing group, which is ([A-Za-z0-9]*). So the group will match the first word. Then, there is a space, followed by a backreference to the first word again. So this will look for a repetition of the same word separated by a space.
As bobble bubble points out in the comments, there is still a lot one can do to improve the regular expression. While my main concern was to explain the various concepts without focusing too much on your particular example, I guess I still owe you a more robust regular expression for matching two consecutive words within a string that are separated by a space. This would be my take on that:
\b(\w+)\s\1\b
There are a few things that are different to the previous approach: First of all, I’m looking for word boundaries around the whole expression. The \b matches basically when a word starts or ends. This will prevent the expression from matching within other words, e.g. neither foo fooo nor foo oo would be matched.
Then, the regular expression requires at least one character. So empty words won’t be matched. I’m also using \w here which is a more flexible way of including alphanumerical characters. And finally, instead of looking for an actual space, I accept any kind of whitespace between the words, so this could even match tabs or line breaks. It might make sense to add a quantifier there too, i.e. \s+ to allow multiple whitespace characters.
Of course, whether this works better for you, depends a lot on your actual requirements which we won’t be able to tell just from your one example. But this should give you a few ideas on how to continue at least.

You can match a previous capture group with \1 for the first group, \2 for the second, etc...
import re
s = "I am struggling to to make this this work"
matches = re.findall(r'([A-Za-z0-9]+) \1', s)
print(matches)
>>> ['to', 'this']
If you want both occurrences, add a capture group around \1:
matches = re.findall(r'([A-Za-z0-9]+) (\1)', s)
print(matches)
>>> [('to', 'to'), ('this', 'this')]

At a glance it looks like this will match any two words, not repeated words. If I recall correctly asterisk (*) will match zero or more times, so perhaps you should be using plus (+) for one or more. Then you need to provide a capture and re-use the result of the capture. Additionally the \w can be used for alphanumerical characters for clarity. Also \b can be used to match empty string at word boundary.
Something along the lines of the example below will get you part of the way.
>>> import re
>>> p = re.compile(r'\b(\w+) \1\b')
>>> p.findall('fa fs bau saa saa fa bau eek mu muu bau')
['saa']
These pages may offer some guidance:
Python regex cheat sheet
RegExp match repeated characters
Regular Expression For Consecutive Duplicate Words.

This should work: \b([A-Za-z0-9]+)\s+\1\b
\b matches a word boundary, \s matches whitespace and \1 specifies the first capture group.
>>> s = 'I am struggling to to make this this work'
>>> re.findall(r'\b([A-Za-z0-9]+)\s+\1\b', s)
['to', 'this']

Here is a simple solution not using RegEx.
sentence = 'I am struggling to to make this this work'
def find_duplicates_in_string(words):
""" Takes in a string and returns any duplicate words
i.e. "this this"
"""
duplicates = []
words = words.split()
for i in range(len(words) - 1):
prev_word = words[i]
word = words[i + 1]
if word == prev_word:
duplicates.append(word)
return duplicates
print(find_duplicates_in_string(sentence))

Python Regex Behaviour

I'm trying to parse a text document with data in the following format: 24036 -977. I need to separate the numbers into separate values, and the way I've done that is with the following steps.
values = re.search("(.*?)\s(.*)")
x = values.group(1)
y = values.gropu(2)
This does the job, however I was curious about why using (.*?) in the second group causes the regex to fail? I tested it in the online regex tester(https://regex101.com/r/bM2nK1/1), and adding the ? in causes the second group to return nothing. Now as far as I know .*? means to take any value unlimited times, as few times as possible, and the .* is just the greedy version of that. What I'm confused about is why the non greedy version.*? takes that definition to mean capturing nothing?

Because it means to match the previous token, the *, as few times as possible, which is 0 times. If you would it to extend to the end of the string, add a $, which matches the end of string. If you would like it to match at least one, use + instead of *.
The reason the first group .*? matches 24036 is because you have the \s token after it, so the fewest amount of characters the .*? could match and be followed by a \s is 24036.

#iobender has pointed out the answer to your question.
But I think it's worth mentioning that if the numbers are separated by space, you can just use split:
>>> '24036 -977'.split()
['24036', '-977']
This is simpler, easier to understand and often faster than regex.

Nongreedy Regex with Repetition

I am using the following regex:
((FFD8FF).+?((FFD9)(?:(?!FFD8).)*))
I need to do the following with regex:
Find FFD8FF
Find the last FFD9that comes before the next FFD8FF
Stop at the last FFD9 and not include any content after
What I've got does what I need except it finds and keeps any junk after the last FFD9. How can I get it to jump back to the last FFD9?
Here's the string that I'm searching with this expression:
asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9
Thanks a lot for your help.
More info:
I have a list of start and end values I need to search for (FFD8FF and FFD9 are just one pair). They are in a list. Because of this, I'm using r.compile to dynamically create the expression in a for loop that goes through the different values. I have the following code, but it is returning 0 matches:
regExp = re.compile("FD8FF(?:[^F]|F(?!FD8FF))*FFD9")
matchObj = re.findall(regExp, contents)
In the above code, I'm just trying to use the plain regex without even getting the values from the list (that would look like this):
regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1])
Any other ideas why there aren't any matches?
EDIT:
I figured out that I forgot to include flags. Flags are now included to ignore case and multiline. I now have
regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1],re.M|re.I)
Although now I'm getting a memory error. Is there any way to make this more efficient? I am using the expression to search hundreds of thousands of lines (using the findall expression above)

an easy way is to use this:
FFD8FF(?:[^F]|F(?!FD8FF))*FFD9
explanation:
FFD8FF
(?: # this group describe the allowed content between the "anchors"
[^F] # all that is not a "F"
| # OR
F(?!FD8FF) # a "F" not followed by "FD8FF"
)* # repeat (greedy)
FFD9 # until the last FFD9 before FFD8FF
Even if a greedy quantifier is used for the group, the regex engine will backtrack to find the last "FFD9" substring.
If you want to ensure that FFD8FF is present, you can add a lookahead at the end of the pattern:
FFD8FF(?:[^F]|F(?!FD8FF))*FFD9(?=.*?FFD8FF)
You can optimize this pattern by emulating an atomic group that will limit the backtracking and allows to use quantifier inside the group:
FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\1)*FFD9
This trick uses the fact that the content of a lookahead is naturally atomic once the closing parenthesis reached. So if you enclose a group inside a lookahead with a capture group inside, you only have to put the backreference after to obtain an "atom" (an indivisable substring).
When the regex engine need to backtrack, it will backtrack atom by atom instead of character by character that is much faster.
If you need a capture group before this trick, don't forget to update the number of the backreference, examples:
(FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\2)*FFD9)
(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)
working example:
>>> import re
>>> yourstr = 'asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9'
>>> p = re.compile(r'(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)(?=.*?FFD8FF)')
>>> re.findall(p, yourstr)
[('FFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9', 'asdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdf', 'D9asdflasdflasdf')]
variant:
(FFD8FF((?:(?=(F(?!FD8FF)[^F]*|[^F]+))\3)*)FFD9)(?=.*?FFD8FF)

Since you are not restricted to one regexp by your application's architecture, break it down into steps:
You want to break up the text in units that begin at each FFD8FF. Just use non-greedy search that ends just before the next FFD8FF: re.findall(r"FFD8FF.*?(?=FFD8FF)", contents). (This uses look-ahead, which is in my opinion overused; but it lets you save the final FFD8FF for the next string.)
You then want to trim each such string so that it ends at the last FFD9. Easiest way to do this is with greedy search: re.search(r"^.*FFD9", part). Like this:
for part in re.findall(r"FFD8FF.*?(?=FFD8FF)", contents):
print(re.search(r"^.*FFD9", part).group(0))
Simple, maintainable and efficient.

This is how I would do it:
>>> re.search(r'((FFD8FF).+?(FFD9))(?:((?!FFD9).)+FFD8FF)', s).groups()
('FFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9',
'FFD8FF',
'FFD9',
'f')
The second part just searches for a string not containing FFD9 that ends with FFD8FF.
It includes your search components, so you can still substitute them in your regex. However for something rather complicated like this I would avoid regex.
btw, thanks for posting a regex question that is high-quality and not the usual spam.

Python regex: how to match anything up to a specific string and avoid backtraking when failin

I'm trying to craft a regex able to match anything up to a specific pattern. The regex then will continue looking for other patterns until the end of the string, but in some cases the pattern will not be present and the match will fail. Right now I'm stuck at:
.*?PATTERN
The problem is that, in cases where the string is not present, this takes too much time due to backtraking. In order to shorten this, I tried mimicking atomic grouping using positive lookahead as explained in this thread (btw, I'm using re module in python-2.7):
Do Python regular expressions have an equivalent to Ruby's atomic grouping?
So I wrote:
(?=(?P<aux1>.*?))(?P=aux1)PATTERN
Of course, this is faster than the previous version when STRING is not present but trouble is, it doesn't match STRING anymore as the . matches everyhing to the end of the string and the previous states are discarded after the lookahead.
So the question is, is there a way to do a match like .*?STRING and alse be able to fail faster when the match is not present?

You could try using split
If the results are of length 1 you got no match. If you get two or more you know that the first one is the first match. If you limit the split to size one you'll short-circuit the later matching:
"HI THERE THEO".split("TH", 1) # ['HI ', 'ERE THEO']
The first element of the results is up to the match.

One-Regex Solution
^(?=(?P<aux1>(?:[^P]|P(?!ATTERN))*))(?P=aux1)PATTERN
Explanation
You wanted to use the atomic grouping like this: (?>.*?)PATTERN, right? This won't work. Problem is, you can't use lazy quantifiers at the end of an atomic grouping: the definition of the AG is that once you're outside of it, the regex won't backtrack inside.
So the regex engine will match the .*?, because of the laziness it will step outside of the group to check if the next character is a P, and if it's not it won't be able to backtrack inside the group to match that next character inside the .*.
What's usually used in Perl are structures like this: (?>(?:[^P]|P(?!ATTERN))*)PATTERN. That way, the equivalent of .* (here (?:[^P]|P(?!ATTERN))) won't "eat up" the wanted pattern.
This pattern is easier to read in my opinion with possessive quantifiers, which are made just for these occasions: (?:[^P]|P(?!ATTERN))*+PATTERN.
Translated with your workaround, this would lead to the above regex (added ^ since you should anchor the regex, either to the start of the string or to another regex).

The Python documentation includes a brief outline of the differences between the re.search() and re.match() functions http://docs.python.org/2/library/re.html#search-vs-match. In particular, the following quote is relevant:
Sometimes you’ll be tempted to keep using re.match(), and just add .* to the front of your RE. Resist this temptation and use re.search() instead. The regular expression compiler does some analysis of REs in order to speed up the process of looking for a match. One such analysis figures out what the first character of a match must be; for example, a pattern starting with Crow must match starting with a 'C'. The analysis lets the engine quickly scan through the string looking for the starting character, only trying the full match if a 'C' is found.
Adding .* defeats this optimization, requiring scanning to the end of the string and then backtracking to find a match for the rest of the RE. Use re.search() instead.
In your case, it would be preferable to define your pattern simply as:
pattern = re.compile("PATTERN")
And then call pattern.search(...), which will not backtrack when the pattern is not found.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.