Regex disjunction in Python findall does not match full substrings - python

In several online testers, the regex a(b|c)z matches both 'abz' and 'acz' in the string 'abz acz', but Python's re.findall() only matches 'b' and 'c'.
What am I missing?
In[42]: re.findall(r'a(b|c)z', 'abz acz')
Out[42]: ['b', 'c']

With findall, the captured groups are returned:
As stated in the documentation ...
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
You can simply use a character class here instead.
>>> re.findall(r'a[bc]z', 'abz acz')
['abz', 'acz']

thats because you are using capturing parens
re.findall(r'a(?:b|c)z', 'abz acz')
?: will cause it to be non-capturing parentheses

Related

Apart from returning string and iterator in re.findall() and re.finditer() in python do their working also differ?

Wrote the following code so that i get all variable length patterns matching str_key.
line = "ABCDABCDABCDXXXABCDXXABCDABCDABCD"
str_key = "ABCD"
regex = rf"({str_key})+"
find_all_found = re.findall(regex,line)
print(find_all_found)
find_iter_found = re.finditer(regex, line)
for i in find_iter_found:
print(i.group())
Output i got:
['ABCD', 'ABCD', 'ABCD']
ABCDABCDABCD
ABCD
ABCDABCDABCD
The intended output is last three lines printed by finditer(). I was expecting both functions to give me same output(list or callable does not matter). why it differs in findall() as far i understood from other posts already on stackoverflow, these two functions differ only in their return types and not in matching patterns. Do they work differently, if not what have i done wrong?
You want to access groups rather than group.
>>> find_iter_found = re.finditer(regex, line)
>>> for i in find_iter_found:
... print(i.groups()[0])
The difference between the two methods is explained here.
The behaviour of the two functions is pretty much the same as far as the matching process is concerned as per:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a
previous empty match.
re.finditer(pattern, string, flags=0)
Return an iterator yielding match objects over all non-overlapping
matches for the RE pattern in string. The string is scanned
left-to-right, and matches are returned in the order found. Empty
matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a
previous empty match.
For re.findall change your regex
regex = rf"({str_key})+"
into
regex = rf"((?:{str_key})+)".
The quantifier + have to inside the capture group.

python regex: capturing group within OR

I'm using python and the re module to parse some strings and extract a 4 digits code associated with a prefix. Here are 2 examples of strings I would have to parse:
str1 = "random stuff tokenA1234 more stuff"
str2 = "whatever here tokenB5678 tokenA0123 and more there"
tokenA and tokenB are the prefixes and 1234, 5678, 0123 are the digits I need to grab. token A and B are just an example here. The prefix can be something like an address http://domain.com/ (tokenA) or a string like Id: ('[Ii]d:?\s?') (tokenB).
My regex looks like:
re.findall('.*?(?:tokenA([0-9]{4})|tokenB([0-9]{4})).*?', str1)
When parsing the 2 strings above, I get:
[('1234','')]
[('','5678'),('0123','')]
And I'd like to simply get ['1234'] or ['5678','0123'] instead of a tuple.
How can I modify the regex to achieve that? Thanks in advance.
You get tuples as a result since you have more than 1 capturing group in your regex. See re.findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So, the solution is to use only one capturing group.
Since you have tokens in your regex, you can use them inside a group. Since only tokens differ, ([0-9]{4}) part is common for both, just use an alternation operator between tokens put into a non-capturing group:
(?:tokenA|tokenB)([0-9]{4})
^^^^^^^^^^^^^^^^^
The regex means:
(?:tokenA|tokenB) - match but not capture tokenA or tokenB
([0-9]{4}) - match and capture into Group 1 four digits
IDEONE demo:
import re
s = "tokenA1234tokenB34567"
print(re.findall(r'(?:tokenA|tokenB)([0-9]{4})', s))
Result: ['1234', '3456']
Simply do this:
re.findall(r"token[AB](\d{4})", s)
Put [AB] inside a character class, so that it would match either A or B

regular expression: may or may not contain a string

I want to match a floating number that might be in the form of 0.1234567 or 1.23e-5
Here is my python code:
import re
def main():
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
for svs_elem in m2:
print svs_elem
main()
It prints blank... Based on my test, the problem was in (e-\d+)? part.
See emphasis:
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
You have a group, so it’s returned instead of the entire match, but it doesn’t match in any of your cases. Make it non-capturing with (?:e-\d+):
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
Use a non-capturing group. The matches are succeeding, but the output is the contents of the optional groups that don't actually match.
See the output when your input includes something like e-6:
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['', '', 'e-6']
With a non-capturing group ((?:...)):
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['1:0.00003', '3:0.123456', '8:-0.12345e-6']
Here's are some simpler examples to demonstrate how capturing groups work and how they influence the output of findall. First, no groups:
>>> re.findall("a[bc]", "ab")
["ab"]
Here, the string "ab" matched the regex, so we print everything the regex matched.
>>> re.findall("a([bc])", "ab")
["b"]
This time, we put the [bc] inside a capturing group, so even though the entire string is still matched by the regex, findall only includes the part inside the capturing group in its output.
>>> re.findall("a(?:[bc])", "ab")
["ab"]
Now, by converting the capturing group to a non-capturing group, findall again uses the match of the entire regex in its output.
>>> re.findall("a([bc])?", "a")
['']
>>> re.findall("a(?:[bc])?", "a")
['a']
In both of these final case, the regular expression as a whole matches, so the return value is a non-empty list. In the first one, the capturing group itself doesn't match any text, though, so the empty string is part of the output. In the second, we don't have a capturing group, so the match of the entire regex is used for the output.

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?
try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.
Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.
I think you don't need regexpes for this problem,
you need some recursial graph search function

Regular Expression in python

When the parenthesis were used in the below program output is
['www.google.com'].
import re
teststring = "href=\"www.google.com\""
m=re.findall('href="(.*?)"',teststring)
print m;
If parenthesis is removed in findall function output is ['href="www.google.com"'].
import re
teststring = "href=\"www.google.com\""
m=re.findall('href=".*?"',teststring)
print m;
Would be helpful if someone explained how it works.
The re.findall() documentation is quite clear on the difference:
Return all non-overlapping matches of pattern in string, as a list of strings. […] If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So .findall() returns a list containing one of three types of values, depending on the number of groups in the pattern:
0 capturing groups in the pattern (no (...) parenthesis): the whole matched string ('href="www.google.com"' in your second example).
1 capturing group in the pattern: return the captured group ('www.google.com' in your first example).
more than 1 capturing group in the pattern: return a tuple of all matched groups.
Use non-capturing groups ((?:...)) if you don't want that behaviour, or add groups if you want more information. For example, adding a group around the href= part would result in a list of tuples with two elements each:
>>> re.findall('(href=)"(.*?)"', teststring)
[('href=', 'www.google.com')]

Categories