Regular Expression in python - python

When the parenthesis were used in the below program output is
['www.google.com'].
import re
teststring = "href=\"www.google.com\""
m=re.findall('href="(.*?)"',teststring)
print m;
If parenthesis is removed in findall function output is ['href="www.google.com"'].
import re
teststring = "href=\"www.google.com\""
m=re.findall('href=".*?"',teststring)
print m;
Would be helpful if someone explained how it works.

The re.findall() documentation is quite clear on the difference:
Return all non-overlapping matches of pattern in string, as a list of strings. […] If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So .findall() returns a list containing one of three types of values, depending on the number of groups in the pattern:
0 capturing groups in the pattern (no (...) parenthesis): the whole matched string ('href="www.google.com"' in your second example).
1 capturing group in the pattern: return the captured group ('www.google.com' in your first example).
more than 1 capturing group in the pattern: return a tuple of all matched groups.
Use non-capturing groups ((?:...)) if you don't want that behaviour, or add groups if you want more information. For example, adding a group around the href= part would result in a list of tuples with two elements each:
>>> re.findall('(href=)"(.*?)"', teststring)
[('href=', 'www.google.com')]

Related

Apart from returning string and iterator in re.findall() and re.finditer() in python do their working also differ?

Wrote the following code so that i get all variable length patterns matching str_key.
line = "ABCDABCDABCDXXXABCDXXABCDABCDABCD"
str_key = "ABCD"
regex = rf"({str_key})+"
find_all_found = re.findall(regex,line)
print(find_all_found)
find_iter_found = re.finditer(regex, line)
for i in find_iter_found:
print(i.group())
Output i got:
['ABCD', 'ABCD', 'ABCD']
ABCDABCDABCD
ABCD
ABCDABCDABCD
The intended output is last three lines printed by finditer(). I was expecting both functions to give me same output(list or callable does not matter). why it differs in findall() as far i understood from other posts already on stackoverflow, these two functions differ only in their return types and not in matching patterns. Do they work differently, if not what have i done wrong?
You want to access groups rather than group.
>>> find_iter_found = re.finditer(regex, line)
>>> for i in find_iter_found:
... print(i.groups()[0])
The difference between the two methods is explained here.
The behaviour of the two functions is pretty much the same as far as the matching process is concerned as per:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a
previous empty match.
re.finditer(pattern, string, flags=0)
Return an iterator yielding match objects over all non-overlapping
matches for the RE pattern in string. The string is scanned
left-to-right, and matches are returned in the order found. Empty
matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a
previous empty match.
For re.findall change your regex
regex = rf"({str_key})+"
into
regex = rf"((?:{str_key})+)".
The quantifier + have to inside the capture group.

Python regex: unclear difference between repeating qualifier {n} and equivalent tuple

Why does xx yield something different from x{2}?
Please have a look at the following example:
import re
lines = re.findall(r'".*?"".*?"', '"x""y"')
print(lines) # yields: ['"x""y"']
lines = re.findall(r'(".*?"){2}', '"x""y"')
print(lines) # yields: ['"y"']
As per the documentation of findall, if you have a group in the regex, it returns the list of those groups, either as a tuple for 2+ groups or as a string for 1 groups. In your case, your two regexes are not merely xx versus x{2}, but rather the second one is (x){2}, which has a group, when the first regex has no groups.
Hence, "x" matches the group the first time, then "y" matches the group the second time. This fulfills your overall regex, but "y" overwrites "x" for the value of group 1.
The easiest way to solve this in your example is to convert your group to a non-matching group: (?:".*?"){2}. If you want two groups, one for "x" and one for "y", you need to repeat the group twice: (".*?")(".*?"). You can potentially use named groups to simplify this repetition.
The first expression is "X and then Y, where Y accidentally matches the same thing as X".
The second expression is "(X){repeat two times}". Group 1 cannot contain XX, because group 1 does not match XX. It matches X.
In other words: Group contents does not change just because of a quantifier outside of the group.
One way to remedy the second expression is to make an outer group (and make the inner group non-capturing)
lines = re.findall(r'((?:".*?"){2})', '"x""y"')
About your second pattern (".*?"){2}:
A cite from the rules of matching
If a group is contained in a part of the pattern that matched multiple times, the last match is returned.
And findall does the following:
If one or more groups are present in the pattern, return a list of groups;
Your pattern (".*?"){2} means that (".*?") should match twice in a row, and according to the first rule, only the content of the last match is captured.
For your data findall finds the sequence (".*?"){2} only once, so it returns a list consisting of the last captured group for a single match: ['"y"'].
This example would make it more obvious:
import re
print (re.findall(r'(\d){2}', 'a12b34c56'))
# ['2', '4', '6']
You can see that findall finds the sequence (\d){2} three times and for each it returns the last captured content for the group (\d).
Now about your first pattern: ".*?"".*?".
This one does not contains subgroups, and, according to findall again, in this case it returns:
all non-overlapping matches of pattern in string, as a list of strings.
So for your data it is ['"x""y"'].
AFAIK, findall() is capture group first, if there is any capture group in the applied regex, then findall() returns only capture group values.
And only when there is no capture group in the applied regex, findall() returns fullmatch values.
Therefore, if you want findall() returns fullmatch value, then you must not use capture group in the regex like this
(?:".*?"){2}
in which (?: ... ) indicate non-capture group.
Thus, in python
print(re.findall(r'(?:".*?"){2}', '"x""y"'))

python regex: capturing group within OR

I'm using python and the re module to parse some strings and extract a 4 digits code associated with a prefix. Here are 2 examples of strings I would have to parse:
str1 = "random stuff tokenA1234 more stuff"
str2 = "whatever here tokenB5678 tokenA0123 and more there"
tokenA and tokenB are the prefixes and 1234, 5678, 0123 are the digits I need to grab. token A and B are just an example here. The prefix can be something like an address http://domain.com/ (tokenA) or a string like Id: ('[Ii]d:?\s?') (tokenB).
My regex looks like:
re.findall('.*?(?:tokenA([0-9]{4})|tokenB([0-9]{4})).*?', str1)
When parsing the 2 strings above, I get:
[('1234','')]
[('','5678'),('0123','')]
And I'd like to simply get ['1234'] or ['5678','0123'] instead of a tuple.
How can I modify the regex to achieve that? Thanks in advance.
You get tuples as a result since you have more than 1 capturing group in your regex. See re.findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So, the solution is to use only one capturing group.
Since you have tokens in your regex, you can use them inside a group. Since only tokens differ, ([0-9]{4}) part is common for both, just use an alternation operator between tokens put into a non-capturing group:
(?:tokenA|tokenB)([0-9]{4})
^^^^^^^^^^^^^^^^^
The regex means:
(?:tokenA|tokenB) - match but not capture tokenA or tokenB
([0-9]{4}) - match and capture into Group 1 four digits
IDEONE demo:
import re
s = "tokenA1234tokenB34567"
print(re.findall(r'(?:tokenA|tokenB)([0-9]{4})', s))
Result: ['1234', '3456']
Simply do this:
re.findall(r"token[AB](\d{4})", s)
Put [AB] inside a character class, so that it would match either A or B

regular expression: may or may not contain a string

I want to match a floating number that might be in the form of 0.1234567 or 1.23e-5
Here is my python code:
import re
def main():
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
for svs_elem in m2:
print svs_elem
main()
It prints blank... Based on my test, the problem was in (e-\d+)? part.
See emphasis:
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
You have a group, so it’s returned instead of the entire match, but it doesn’t match in any of your cases. Make it non-capturing with (?:e-\d+):
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
Use a non-capturing group. The matches are succeeding, but the output is the contents of the optional groups that don't actually match.
See the output when your input includes something like e-6:
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['', '', 'e-6']
With a non-capturing group ((?:...)):
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['1:0.00003', '3:0.123456', '8:-0.12345e-6']
Here's are some simpler examples to demonstrate how capturing groups work and how they influence the output of findall. First, no groups:
>>> re.findall("a[bc]", "ab")
["ab"]
Here, the string "ab" matched the regex, so we print everything the regex matched.
>>> re.findall("a([bc])", "ab")
["b"]
This time, we put the [bc] inside a capturing group, so even though the entire string is still matched by the regex, findall only includes the part inside the capturing group in its output.
>>> re.findall("a(?:[bc])", "ab")
["ab"]
Now, by converting the capturing group to a non-capturing group, findall again uses the match of the entire regex in its output.
>>> re.findall("a([bc])?", "a")
['']
>>> re.findall("a(?:[bc])?", "a")
['a']
In both of these final case, the regular expression as a whole matches, so the return value is a non-empty list. In the first one, the capturing group itself doesn't match any text, though, so the empty string is part of the output. In the second, we don't have a capturing group, so the match of the entire regex is used for the output.

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?
try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.
Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.
I think you don't need regexpes for this problem,
you need some recursial graph search function

Categories