What is the difference between .* and .*? in a regular expression? - python

I am trying to learn about regular exprssoins. While investigating the difference between re.match and re.search I saw a (disputed) claim that re.match('(.*?)word(.*?)',string)
was faster than re.search("word",string) I do not see the difference between .*? and .* nor do I see a need for the trailing (.*?) .

See the documentation. That ? makes * non-greedy, i.e., it'll try to match as few repetitions as possible instead of as many as possible.
In your example re.match('(.*?)word(.*?)',string), that means as few leading . as possible, so try to find the earliest word instead of the last. The trailing (.*?) is indeed pointless.

To understand any regex, the first place you go should always be https://regex101.com/. In this case, here's what it says is the only difference between the two:
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
And from there, you can then enter in example text in order to test out the expression in realtime and see what the practical difference is.

Related

regex to match coordinates

I am trying to write a regex expression in python that can match the following lines - I am just able to match the very first number by doing something like this
re.compile(r'\d.\d{14}\s+')
but could not do rest. Also tried doing [^-\d] to catch the negative sign - does not seem working.
Any help? Thanks!
First, lets start by looking at the numbers. You've already got a decent expression for finding a single number (\d.\d{14}\s+), but there are a couple things wrong with it.
In regex, . indicates any single character. This means that your expression will accept any character after the first digit.
It's not taking into account the possibility that there could be a negative sign at the beginning.
Both of these problems are really easy to fix. The first can be fixed by simply escaping the period (\.). The second can be fixed by adding the negative sign to the pattern and giving it a quantifier. In this case, the ? quantifier will be the best option because it matches between 0 and 1 times. All this means is that it won't care if the symbol is there, but if it is it will match it. After these 2 changes, the pattern looks like this: -?\d\.\d{14}\s+.
Next, we need to tell it to match more than once. This can be done very easily by putting the pattern in a group and applying a quantifier to said group. Now the question is which quantifier should be used. In your example, there are only 3 numbers before the single character at the end of the line. You can match this pattern exactly 3 times by using the {3} quantifier. If you know there will be at least 1 but don't know how many in total there will be, you can use the + quantifier. For this example I will be using the {3} quantifier just so it's more specific to your question. After adding this, the pattern will look something like this: (-?\d\.\d{14}\s+){3}
Now all that's left is to match the character at the end. You can use \S to match any single word character. You can add a quantifier to it, but again, for the purposes of your question, I won't be since there's only a single character. The final expression would look like (-?\d\.\d{14}\s+){3}\S.

Add Chinese brackets before and after two matches in a string in Python [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Python multiline regex groups with finditer only returns last match [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Python regex examples

I suck at Python regex and would love to see some solved examples to help me gain understanding. I am experimenting using http://pyregex.com/ which is great but need some 'good' examples to get me started.
I try to create a set of rules like so:
rules = [('name', r'[a-z]+'),
('operator', r'[+-*\]']
which I have found but not confident enough to create my own regexes for cases like the ones listed below:
match only the = or += or *= characters
match the + character (i.e the operator as seen above) separately from the ++ characters
match any one word after a specific keyword (e.g. int) and any number of space(s) and/or tabs. [edited - initially had followed which was wrong]
For 1. I have tried [\+=|=], for 2. I know the order in the rules is important and for 3. I am completely lost with the [] and on how I can generalize that case to work not just for int, but for float as well.
Any code examples will be greatly appreciated since I am only just starting with Python and coding!
match only the = or += or *= characters
r'[+*]?='
The [+*]?= consists of an optional atom, a character class [+*] that matches either a + or a *, ? - one or zero times, and a literal = symbol. Why not r'\+=|\*=|='? Not only the optional character class solution is shorter, but also it is more efficient: when you use alternation, you always have more redundant backtracking involved. You also need to be attentive to place the alternatives in a correct order, so that the longest appears first (although that does not always guarantee that the longest will match (depends on the branch subpatterns), or the order does not matter if there are anchors on both sides of the alternation group).
match the + character (i.e the operator as seen above) separately from the ++ characters
r'(?<!\+)\+(?!\+)'
This pattern matches a literal + (as it is escaped) and only in case it is neither preceded with another plus (see the negative lookbehind (?<!\+)) nor followed with another plus (see the positive lookahead (?!\+)). The lookarounds are non-consuming, i.e. the regex index remains right before a plus when it checks for a plus in front of it, and after the plus when it checks for a plus after it. The characters (or start/end of string positions) are not returned as part of the match (that is why they are called zero-width, non-capturing patterns).
match any one word after a specific keyword (e.g. int) and any number of space(s) and/or tabs.
r'\bint\b(?=\s+\w+\s+)'
If you read the explanation above, you will recognize another zero-width assertion here: (?=\s+\w+\s+) is a positive lookahead that checks if a whole word int (as \b matches word boundary positions) is followed with 1+ whitespaces, then 1+ word characters, and then again 1+ whitespaces.
The examples provided in the documentation and in the previous answers should get you started in the right path. An additional consideration, since you said you are new to programming and Python, is that regular expressions are an intermediate to advanced topic (depending what you want to do with it) and should be tackled once you have a better grasp of good programming practices and Python's fundamentals.
In any case more information and examples can be found at:
Python Regular Expressions module.

Python Regex Behaviour

I'm trying to parse a text document with data in the following format: 24036 -977. I need to separate the numbers into separate values, and the way I've done that is with the following steps.
values = re.search("(.*?)\s(.*)")
x = values.group(1)
y = values.gropu(2)
This does the job, however I was curious about why using (.*?) in the second group causes the regex to fail? I tested it in the online regex tester(https://regex101.com/r/bM2nK1/1), and adding the ? in causes the second group to return nothing. Now as far as I know .*? means to take any value unlimited times, as few times as possible, and the .* is just the greedy version of that. What I'm confused about is why the non greedy version.*? takes that definition to mean capturing nothing?
Because it means to match the previous token, the *, as few times as possible, which is 0 times. If you would it to extend to the end of the string, add a $, which matches the end of string. If you would like it to match at least one, use + instead of *.
The reason the first group .*? matches 24036 is because you have the \s token after it, so the fewest amount of characters the .*? could match and be followed by a \s is 24036.
#iobender has pointed out the answer to your question.
But I think it's worth mentioning that if the numbers are separated by space, you can just use split:
>>> '24036 -977'.split()
['24036', '-977']
This is simpler, easier to understand and often faster than regex.

Categories