Python regular expression question mark operator not working?

Python regular expression question mark operator not working? - python

import re
str='abc defg'
m1 = re.match(".*(def)?",str)
m2 = re.match(".*(def)",str)
print (m1.group(1),m2.group(1))
The output of the above is:
(None, 'def')
What is going on? Even with a non-greedy repetition operator, the optional capture group (def)? is not matched.

Here's what happens when the regex engine tries to match .*(def) against abc defg:
First, the engine starts trying to match the regex at the beginning of the string.
The greedy subpattern .* initially tries to match as many times as it can, matching the entire string.
Since this causes the rest of the match to fail, the regex engine backtracks until it finds a way to match the (def), which happens when the .* matches only abc .
However, if we change the regex to .*(def)?, the following happens instead:
First, the regex engine again starts at the beginning of the string.
Next, it again tries to match .* as many times as possible, matching the entire string.
But at that point, since all the rest of the regex is optional, it has found a match for the entire regex! Since (def)? is greedy, the engine would prefer to match it if it could, but it's not going to backtrack earlier subpatterns just to see if it can. Instead, it just lets the .* gobble up the entire string, leaving nothing for (def)?.
Something similar happens with .*?(def) and .*?(def)?:
Again, the engine starts at the beginning of the string.
The ungreedy subpattern .*? tries to match as few times as it can, i.e. not at all.
At that point, (def) cannot match, but (def)? can. Thus, for (def) the regex engine has to go back and consider longer matches for .*? until it finds one that lets the full pattern match, whereas for (def)? it doesn't have to do that, and so it doesn't.
For more information, see the "Combining RE Pieces" section of the Perl regular expressions manual (which matches the behavior of Python's "Perl-compatible" regular expressions).

Related

Python - Extract the words in a sentence that occur before a set of words [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.

You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).

location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c

How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.

Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/

Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.

Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?

The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.

Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/

import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Add Chinese brackets before and after two matches in a string in Python [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.

You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).

location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c

How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.

Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/

Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.

Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?

The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.

Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/

import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Python multiline regex groups with finditer only returns last match [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.

You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).

location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c

How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.

Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/

Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.

Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?

The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.

Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/

import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Python Regex Behaviour

I'm trying to parse a text document with data in the following format: 24036 -977. I need to separate the numbers into separate values, and the way I've done that is with the following steps.
values = re.search("(.*?)\s(.*)")
x = values.group(1)
y = values.gropu(2)
This does the job, however I was curious about why using (.*?) in the second group causes the regex to fail? I tested it in the online regex tester(https://regex101.com/r/bM2nK1/1), and adding the ? in causes the second group to return nothing. Now as far as I know .*? means to take any value unlimited times, as few times as possible, and the .* is just the greedy version of that. What I'm confused about is why the non greedy version.*? takes that definition to mean capturing nothing?

Because it means to match the previous token, the *, as few times as possible, which is 0 times. If you would it to extend to the end of the string, add a $, which matches the end of string. If you would like it to match at least one, use + instead of *.
The reason the first group .*? matches 24036 is because you have the \s token after it, so the fewest amount of characters the .*? could match and be followed by a \s is 24036.

#iobender has pointed out the answer to your question.
But I think it's worth mentioning that if the numbers are separated by space, you can just use split:
>>> '24036 -977'.split()
['24036', '-977']
This is simpler, easier to understand and often faster than regex.

Python regex: how to match anything up to a specific string and avoid backtraking when failin

I'm trying to craft a regex able to match anything up to a specific pattern. The regex then will continue looking for other patterns until the end of the string, but in some cases the pattern will not be present and the match will fail. Right now I'm stuck at:
.*?PATTERN
The problem is that, in cases where the string is not present, this takes too much time due to backtraking. In order to shorten this, I tried mimicking atomic grouping using positive lookahead as explained in this thread (btw, I'm using re module in python-2.7):
Do Python regular expressions have an equivalent to Ruby's atomic grouping?
So I wrote:
(?=(?P<aux1>.*?))(?P=aux1)PATTERN
Of course, this is faster than the previous version when STRING is not present but trouble is, it doesn't match STRING anymore as the . matches everyhing to the end of the string and the previous states are discarded after the lookahead.
So the question is, is there a way to do a match like .*?STRING and alse be able to fail faster when the match is not present?

You could try using split
If the results are of length 1 you got no match. If you get two or more you know that the first one is the first match. If you limit the split to size one you'll short-circuit the later matching:
"HI THERE THEO".split("TH", 1) # ['HI ', 'ERE THEO']
The first element of the results is up to the match.

One-Regex Solution
^(?=(?P<aux1>(?:[^P]|P(?!ATTERN))*))(?P=aux1)PATTERN
Explanation
You wanted to use the atomic grouping like this: (?>.*?)PATTERN, right? This won't work. Problem is, you can't use lazy quantifiers at the end of an atomic grouping: the definition of the AG is that once you're outside of it, the regex won't backtrack inside.
So the regex engine will match the .*?, because of the laziness it will step outside of the group to check if the next character is a P, and if it's not it won't be able to backtrack inside the group to match that next character inside the .*.
What's usually used in Perl are structures like this: (?>(?:[^P]|P(?!ATTERN))*)PATTERN. That way, the equivalent of .* (here (?:[^P]|P(?!ATTERN))) won't "eat up" the wanted pattern.
This pattern is easier to read in my opinion with possessive quantifiers, which are made just for these occasions: (?:[^P]|P(?!ATTERN))*+PATTERN.
Translated with your workaround, this would lead to the above regex (added ^ since you should anchor the regex, either to the start of the string or to another regex).

The Python documentation includes a brief outline of the differences between the re.search() and re.match() functions http://docs.python.org/2/library/re.html#search-vs-match. In particular, the following quote is relevant:
Sometimes you’ll be tempted to keep using re.match(), and just add .* to the front of your RE. Resist this temptation and use re.search() instead. The regular expression compiler does some analysis of REs in order to speed up the process of looking for a match. One such analysis figures out what the first character of a match must be; for example, a pattern starting with Crow must match starting with a 'C'. The analysis lets the engine quickly scan through the string looking for the starting character, only trying the full match if a 'C' is found.
Adding .* defeats this optimization, requiring scanning to the end of the string and then backtracking to find a match for the rest of the RE. Use re.search() instead.
In your case, it would be preferable to define your pattern simply as:
pattern = re.compile("PATTERN")
And then call pattern.search(...), which will not backtrack when the pattern is not found.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.