How to use ? and ?: and : in REGEX for Python? - python

I understand that
* = "zero or more"
? = "zero or more" ...what's the difference?
Also, ?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!

As Manu already said, ? means "zero or one time". It is the same as {0,1}.
And by ?:, you probably meant (?:X), where X is some other string. This is called a "non-capturing group".
Normally when you wrap parenthesis around something, you group what is matched by those parenthesis. For example, the regex .(.).(.) matches any 4 characters (except line breaks) and stores the second character in group 1 and the fourth character in group 2. However, when you do: .(?:.).(.) only the fourth character is stored in group 1, everything bewteen (?:.) is matched, but not "remembered".
A little demo:
import re
m = re.search('.(.).(.)', '1234')
print m.group(1)
print m.group(2)
# output:
# 2
# 4
m = re.search('.(?:.).(.)', '1234')
print m.group(1)
# output:
# 4
You might ask yourself: "why use this non-capturing group at all?". Well, sometimes, you want to make an OR between two strings, for example, you want to match the string "www.google.com" or "www.yahoo.com", you could then do: www\.google\.com|www\.yahoo\.com, but shorter would be: www\.(google|yahoo)\.com of course. But if you're not going to do something useful with what is being captured by this group (the string "google", or "yahoo"), you mind as well use a non-capturing group: www\.(?:google|yahoo)\.com. When the regex engine does not need to "remember" the substring "google" or "yahoo" then your app/script will run faster. Of course, it wouldn't make much difference with relatively small strings, but when your regex and string(s) gets larger, it probably will.
And for a better example to use non-capturing groups, see Chris Lutz's comment below.

?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
If that’s indeed what your book says, then I advise getting a better book.
Inside parentheses (more precisely: right after an opening parenthesis), ? has another meaning. It starts a group of options which count only for the scope of the parentheses. ?: is a special case of these options. To understand this special case, you must first know that parentheses create capture groups:
a(.)c
This is a regular expression that matches any three-letter string starting with a and ending with c. The middle character is (more or less) aribtrary. Since you put it in parentheses, you can capture it:
m = re.search('a(.)c', 'abcdef')
print m.group(1)
This will print b, since m.group(1) captures the content of the first parentheses (group(0) captures the whole hit, here abc).
Now, consider this regular expression:
a(?:.)c
No capture is made here – this is what ?: after an opening parenthesis means. That is, the following code will fail:
print m.group(1)
Because there is no group 1!

? = zero or one
you use (?:) for grouping w/o saving the group in a temporary variable as you would with ()

? does not mean "zero or more", it means "zero or one".

Related

Add Chinese brackets before and after two matches in a string in Python [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

How to match digits only after a particular string, stop matching if non-digit is found - Python 27

I have huge string like this dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2
I want to get the number after ludocid, only consecutive numbers.
I have tried this regex (ludocid).*(?=\d+\d+) and many more but no luck.
You can try ludocid=(\d+):
s = "dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2"
import re
re.findall(r"ludocid=(\d+)", s)
# ['15878284988193842600']
You can use this regex:
ludocid\D*(\d+)
RegEx Demo
This will match literal ludocid followed by 0 or more non-digits and then it will match 1 or more digits in captured group #1
Code:
>>> s = 'dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2'
>>> print re.search(r'ludocid\D*(\d+)', s).group(1)
15878284988193842600
It looks like you just threw a bunch of regex bits together... Let's work through that.
First, this is the correct regex: ludocid.(\d+)
(You would want to use it with re.search instead of re.match, by the way. Match requires the regex to match the entire string.)
But let's look at yours and see what went wrong and how we can get to the correct regex.
(ludocid).*(?=\d+\d+)
Imagine a regex as a function. You pass it the right things, and it gives you the appropriate result. When you wrap things in parentheses, you're saying "Find this and give it back to me." You don't need the ludocid given back to you, I'm guessing... so remove those paren.
ludocid.*(?=\d+\d+)
Now you've got a .*. This is dangerous in regular expressions because it literally says "Grab as many of anything as you possibly can!" Often I use the non-greedy version (.*?), but in this case it looks like we're just expecting a single extra character there. If you know the literal character you can use that, but to be safe I'll leave it as ., which says "Grab any one character."
ludocid.(?=\d+\d+)
Now let's go inside the parentheses. You've got \d+\d+, which says "Find a sequence of one or more digits, and then find another sequence of one or more digits." This equates to "Find a sequence of two or more digits." I don't think this is what you wanted (it's not how you described the problem, anyway), so let's reduce that:
ludocid.(?=\d+)
Okay, great. Now... what is (?=...) for? It's called a lookahead assertion. It says "If you find this string, match things in front of it." The example given in the Python 2.7 documentation is:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Essentially this means that your regex will never return the digits. Instead, it looks to see if digits exist, and then it returns things from the rest of the regex. Remove the lookahead assertion and we're there:
ludocid.(\d+)
When you use this with re.search, you'll get the group you want:
>>> s = "dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2"
>>> import re
>>> re.search(r"ludocid.(\d+)", s).group(1)
'15878284988193842600'
To match only the digits that follow, stopping at the first non-numeric char, try a positive look behind:
(?<=ludocid=)(\d+)
So:
re.findall(r"(?<=ludocid=)(\d+)", s)
The positive look behind will look for what you want, and only match if it is preceded by the 'flag' string.
**Note: **You may need to escape that second = sign like this: (?<=ludocid\=)(\d+)

Capturing groups and greediness in Python

Recently I have been playing around with regex expressions in Python and encountered a problem with r"(\w{3})+" and with its non-greedy equivalent r"(\w{3})+?".
Please let's take a look at the following example:
S = "abcdefghi" # string used for all the cases below
1. Greedy search
m = re.search(r"(\w{3})+", S)
print m.group() # abcdefghi
print m.groups() # ('ghi',)
m.group is exactly as I expected - just whole match.
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right? If yes, then can I capture all overwritten groups as well? Of course, for this particular string I could just write m = re.search(r"(\w{3})(\w{3})(\w{3})", S) but I am looking for a more general way to capture groups not knowing how many of them I can expect, thus metacharacter +.
2. Non-greedy search
m = re.search(r"(\w{3})+?", S)
print m.group() # abc
print m.groups() # ('abc',)
Now we are not greedy so only abc was found - exactly as I expected.
Regarding m.groups(), the engine stopped when it found abc so I understand that this is the only found group here.
3. Greedy findall
print re.findall(r"(\w{3})+", S) # ['ghi']
Now I am truly perplexed, I always thought that function re.findall finds all substrings where the RE matches and returns them as a list. Here, we have only one match abcdefghi (according to common sense and bullet 1), so I expected to have a list containing this one item. Why only ghi was returned?
4. Non-greedy findall
print re.findall(r"(\w{3})+?", S) # ['abc', 'def', 'ghi']
Here, in turn, I expected to have abc only, but maybe having bullet 3 explained will help me understand this as well. Maybe this is even the answer for my question from bullet 1 (about capturing overwritten groups), but I would really like to understand what is happening here.
You should think about the greedy/non-greedy behavior in the context of your regex (r"(\w{3})+") versus a regex where the repeating pattern was not at the end: (r"(\w{3})+\w")
It's important because the default behavior of regex matching is:
The entire regex must match
Starting as early in the target string as possible
Matching as much of the target string as possible (greedy)
If you have a "repeat" operator - either * or + - in your regex, then the default behavior is for that to match as much as it can, so long as the rest of the regex is satisfied.
When the repeat operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as much as it can.
If you have a repeat operator with a non-greedy qualifier - *? or +? - in your regex, then the behavior is to match as little as it can, so long as the rest of the regex is satisfied.
When the repeat-nongreedy operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as little as it can.
All that is in just one match. You are mixing re.findall() in as well, which will then repeat the match, if possible.
The first time you run re.findall, with r"(\w{3})+" you are using a greedy match at the end of the pattern. Thus, it will try to apply that last block as many times as possible in a single match. You have the case where, like the call to re.search, the single match consumes the entire string. As part of consuming the entire string, the w3 block gets repeated, and the group buffer is overwritten several times.
The second time you run re.findall, with r"(\w{3})+?" you are using a non-greedy match at the end of the pattern. Thus, it will try to apply that last block as few times as possible in a single match. Since the operator is +, that would be 1. Now you have a case where the match can stop without consuming the entire string. And now, the group buffer only gets filled one time, and not overwritten. Which means that findall can return that result (abc), then loop for a different result (def), then loop for a final result (ghi).
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right?
Right. Only the last captured text is stored in the group memory buffer.
can I capture all overwritten groups as well?
Not with re, but with PyPi regex, you can. Its match object has a captures method. However, with re, you can just match them with re.findall(r'\w{3}', S). However, in this case, you will match all 3-word character chunks from the string, not just those consecutive ones. With the regex module, you can get all the 3-character consecutive chunks from the beginning of the string with the help of \G operator: regex.findall(r"\G\w{3}", "abcdefghi") (result: abc, def, ghi).
Why only ghi was returned with re.findall(r"(\w{3})+", S)?
Because there is only one match that is equal to the whole abcdefghi string, and Capture group 1 contains just the last three characters. re.findall only returns the captured values if capturing groups are defined in the pattern.

Regex - Using * with a set of characters

I'm fairly new at regex, and I've run into a problem that I cannot figure out:
I am trying to match a set of characters that start with an arbitrary number of A-Z, 0-9, and _ characters that can optionally be followed by a number enclosed in a single set of parentheses and can be separated from the original string by a space (or not)
Examples of what this should find:
_ABCD1E
_123FD(13)
ABDF1G (2)
This is my current regex expression:
[A-Z_0-9]+\s*\({0,1}[\d]*\){0,1}
It's finding everything just fine, but a problem exists if I have the following:
_ABCDE )
It should only grab _ABCDE and not the " )" but it currently grabs '_ABCDE )'
Is there some way I can grab the (#) but not get extra characters if that entire pattern does not exist?
If possible, please explain syntax as I am aiming to learn, not just get the answer.
ANSWER: The following code is working for what I needed so far:
[A-Z_0-9]+(\s*\([\d]+\)){0,1}
# or, as has been mentioned, the above can be simplified
# and cleaned up a bit to be
[A-Z_0-9]+(\s*\(\d+\))?
# The [] around \d are unnecessary and {0,1} is equivalent to ?
Adding the parentheses around the (#) pattern allows for the use of ? or {0,1} on the entire pattern. I also changed the [\d]* to be [\d]+ to ensure at least one number inside of the parentheses.
Thanks for the fast answers, all!
Your regex says that each paren (open & closed) may or may not be there, INDEPENDENTLY. Instead, you should say that the number-enclosed-in-parens may or may not be there:
(\([\d]*\)){0,1}
Note that this allows for there to be nothing in the parens; that's what your regex said, but I'm not clear that's what you actually want.
how about
^[A-Z0-9_]+\s*(\([0-9]+\))?$
btw, from your example, the first part accepts not only [A-Z_], but also [0-9]
This seems to do the job.
[1-9A-Z_]+\s*(?:\([1-9]*\))?
It seems like you want the following regex:
^[A-Z\d_]+(\s*\(\d+\))?$
I used a non-capturing group to avoid grouping matching in results:
>>> pattern = r'[A-Z_]+\s*(?:\(\d+\)|\d*)'
>>> l = ['_ABCD1E', '_123FD(13)', 'ABDF1G (2)', '_ABCDE )', 'A_B (15)', 'E (345']
>>> [re.search(pattern , i).group() for i in l]
['_ABCD1', '_123', 'ABDF1', '_ABCDE ', 'A_B (15)', 'E ']

Python findall, regexp with "|" and grouping

I'm trying to implement Pig Latin with Python.
I want to match strings which begins by a consonant or "qu" (no matter the case) so to find the first letters, so at first I was doing :
first_letters = re.findall(r"^[^aeiou]+|^[qQ][uU]", "qualification")
It didn't work (finds only "q") so I figured that i had to add the q in the first group :
first_letters = re.findall(r"^[^aeiouq]+|^[qQ][uU]", "qualification")
so that works (it finds "qu" and not only "q") !
But playing around I found myself with this :
first_letters = re.findall(r"{^[^aeiou]+}|{^[qQ][uU]}", "qualification")
that didn't work because it is the same as the first expression I tried I think.
But finally this also worked :
first_letters = re.findall(r"{^[^aeiou]+}|(^[qQ][uU])", "qualification")
and I don't know why.
Someone can tell me why ?
You should put qu before [^aeuio], because otherwise "q" gets captured by the class and fails to match. Besides that, [Qq][Uu] is not needed, just provide the case insensitive flag:
first_letters = re.findall(r"^(qu|[^aeiou]+)", "qualification", re.I)
Given that you're probably going to match the rest of the string as well, this would be more practical:
start, rest = re.findall(r"^(qu|[^aeiou]+)?(.+)", word, re.I)[0]
Reverse the order of the rules:
>>> re.findall(r"^[qQ][uU]|^[^aeiou]+", "qualification")
['qu']
>>> re.findall(r"^[qQ][uU]|^[^aeiou]+", "boogie")
['b']
>>> re.findall(r"^[qQ][uU]|^[^aeiou]+", "blogie")
['bl']
In your first case, the first regex ^[^aeiou]+ matches the q. In the second case, since you've added q to the first part, the regex engine examines the second expression and matches qu.
In your other cases, I don't think the first expression does what you think it does (i.e. the ^ character inside the braces), so it's the second expression which matches again.
The first part of your 3rd and 4th patterns, {^[^aeiou]+} is trying to match a literal curly brace { followed by a start-of-line followed by one or more non-vowel characters, followed by a literal closing curly brace }. Since you don't have re.MULTILINE enabled, I'd assume that your pattern will be technically valid but unable to match any input.
The | runs left to right, and stops at the first success. So, that's why you found only q with the first expression, and qu with the second.
Not sure what your final regex does, particularly with regard to the {} expression. The part after the | will match in qualification, though. Perhaps that's what you are seeing.
You might find the re.I (ignore case) flag useful.

Categories