Python regex: unclear difference between repeating qualifier {n} and equivalent tuple

Python regex: unclear difference between repeating qualifier {n} and equivalent tuple - python

Why does xx yield something different from x{2}?
Please have a look at the following example:
import re
lines = re.findall(r'".*?"".*?"', '"x""y"')
print(lines) # yields: ['"x""y"']
lines = re.findall(r'(".*?"){2}', '"x""y"')
print(lines) # yields: ['"y"']

As per the documentation of findall, if you have a group in the regex, it returns the list of those groups, either as a tuple for 2+ groups or as a string for 1 groups. In your case, your two regexes are not merely xx versus x{2}, but rather the second one is (x){2}, which has a group, when the first regex has no groups.
Hence, "x" matches the group the first time, then "y" matches the group the second time. This fulfills your overall regex, but "y" overwrites "x" for the value of group 1.
The easiest way to solve this in your example is to convert your group to a non-matching group: (?:".*?"){2}. If you want two groups, one for "x" and one for "y", you need to repeat the group twice: (".*?")(".*?"). You can potentially use named groups to simplify this repetition.

The first expression is "X and then Y, where Y accidentally matches the same thing as X".
The second expression is "(X){repeat two times}". Group 1 cannot contain XX, because group 1 does not match XX. It matches X.
In other words: Group contents does not change just because of a quantifier outside of the group.
One way to remedy the second expression is to make an outer group (and make the inner group non-capturing)
lines = re.findall(r'((?:".*?"){2})', '"x""y"')

About your second pattern (".*?"){2}:
A cite from the rules of matching
If a group is contained in a part of the pattern that matched multiple times, the last match is returned.
And findall does the following:
If one or more groups are present in the pattern, return a list of groups;
Your pattern (".*?"){2} means that (".*?") should match twice in a row, and according to the first rule, only the content of the last match is captured.
For your data findall finds the sequence (".*?"){2} only once, so it returns a list consisting of the last captured group for a single match: ['"y"'].
This example would make it more obvious:
import re
print (re.findall(r'(\d){2}', 'a12b34c56'))
# ['2', '4', '6']
You can see that findall finds the sequence (\d){2} three times and for each it returns the last captured content for the group (\d).
Now about your first pattern: ".*?"".*?".
This one does not contains subgroups, and, according to findall again, in this case it returns:
all non-overlapping matches of pattern in string, as a list of strings.
So for your data it is ['"x""y"'].

AFAIK, findall() is capture group first, if there is any capture group in the applied regex, then findall() returns only capture group values.
And only when there is no capture group in the applied regex, findall() returns fullmatch values.
Therefore, if you want findall() returns fullmatch value, then you must not use capture group in the regex like this
(?:".*?"){2}
in which (?: ... ) indicate non-capture group.
Thus, in python
print(re.findall(r'(?:".*?"){2}', '"x""y"'))

Related

difference between regular expression with and without group '( )'?

There are two different codes which produce two different result but I don't know how those differences arise.
>>>re.findall('[a-z]+','abc')
['abc']
and this one with group:
>>> re.findall('([a-z])+','abc')
['c']
why the second code yield character c ?

In your last regex pattern (([a-z])+), you are repeating a capturing group (()). And doing this will return only last iteration. So you get the last letter, which is c
But in your first pattern ([a-z]+), you are repeating a character class ([]), and this doesn't behave the same as a capturing group. It returns all the iterations.

Extract substring using python re.match

I have a string as
sg_ts_feature_name_01_some_xyz
In this, i want to extract two words that comes after the pattern - sg_ts with the underscore seperation between them
It must be,
feature_name
This regex,
st = 'sg_ts_my_feature_01'
a = re.match('sg_ts_([a-zA-Z_]*)_*', st)
print a.group()
returns,
sg_ts_my_feature_
whereas, i expect,
my_feature

The problem is that you are asking for the whole match, not just the capture group. From the manual:
group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group.
and you asked for a.group() which is equivalent to a.group(0) which is the whole match. Asking for a.group(1) will give you only the capture group in the parentheses.

You can ask for the group surrounded by the parentheses, 'a.group(1)', which returns
'my_feature_'
In addition, if your string is always in this form you could also use the end-of string character $ and to make the inner match lazy instead of greedy (so it doesn't swallow the _).
a = re.match('sg_ts_([a-zA-Z_]*?)[_0-9]*$',st)

regular expression: may or may not contain a string

I want to match a floating number that might be in the form of 0.1234567 or 1.23e-5
Here is my python code:
import re
def main():
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
for svs_elem in m2:
print svs_elem
main()
It prints blank... Based on my test, the problem was in (e-\d+)? part.

See emphasis:
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
You have a group, so it’s returned instead of the entire match, but it doesn’t match in any of your cases. Make it non-capturing with (?:e-\d+):
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')

Use a non-capturing group. The matches are succeeding, but the output is the contents of the optional groups that don't actually match.
See the output when your input includes something like e-6:
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['', '', 'e-6']
With a non-capturing group ((?:...)):
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['1:0.00003', '3:0.123456', '8:-0.12345e-6']
Here's are some simpler examples to demonstrate how capturing groups work and how they influence the output of findall. First, no groups:
>>> re.findall("a[bc]", "ab")
["ab"]
Here, the string "ab" matched the regex, so we print everything the regex matched.
>>> re.findall("a([bc])", "ab")
["b"]
This time, we put the [bc] inside a capturing group, so even though the entire string is still matched by the regex, findall only includes the part inside the capturing group in its output.
>>> re.findall("a(?:[bc])", "ab")
["ab"]
Now, by converting the capturing group to a non-capturing group, findall again uses the match of the entire regex in its output.
>>> re.findall("a([bc])?", "a")
['']
>>> re.findall("a(?:[bc])?", "a")
['a']
In both of these final case, the regular expression as a whole matches, so the return value is a non-empty list. In the first one, the capturing group itself doesn't match any text, though, so the empty string is part of the output. In the second, we don't have a capturing group, so the match of the entire regex is used for the output.

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?

try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.

Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.

I think you don't need regexpes for this problem,
you need some recursial graph search function

Regular Expression in python

When the parenthesis were used in the below program output is
['www.google.com'].
import re
teststring = "href=\"www.google.com\""
m=re.findall('href="(.*?)"',teststring)
print m;
If parenthesis is removed in findall function output is ['href="www.google.com"'].
import re
teststring = "href=\"www.google.com\""
m=re.findall('href=".*?"',teststring)
print m;
Would be helpful if someone explained how it works.

The re.findall() documentation is quite clear on the difference:
Return all non-overlapping matches of pattern in string, as a list of strings. […] If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So .findall() returns a list containing one of three types of values, depending on the number of groups in the pattern:
0 capturing groups in the pattern (no (...) parenthesis): the whole matched string ('href="www.google.com"' in your second example).
1 capturing group in the pattern: return the captured group ('www.google.com' in your first example).
more than 1 capturing group in the pattern: return a tuple of all matched groups.
Use non-capturing groups ((?:...)) if you don't want that behaviour, or add groups if you want more information. For example, adding a group around the href= part would result in a list of tuples with two elements each:
>>> re.findall('(href=)"(.*?)"', teststring)
[('href=', 'www.google.com')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex: unclear difference between repeating qualifier {n} and equivalent tuple - python

Why does xx yield something different from x{2}? Please have a look at the following example: import re lines = re.findall(r'".?"".?"', '"x""y"') print(lines) # yields: ['"x""y"'] lines = re.findall(r'(".*?"){2}', '"x""y"') print(lines) # yields: ['"y"']

Related

difference between regular expression with and without group '( )'?

Extract substring using python re.match

regular expression: may or may not contain a string

Python regexp: get all group's sequence

Regular Expression in python

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex: unclear difference between repeating qualifier {n} and equivalent tuple - python

Why does xx yield something different from x{2}? Please have a look at the following example: import re lines = re.findall(r'".*?"".*?"', '"x""y"') print(lines) # yields: ['"x""y"'] lines = re.findall(r'(".*?"){2}', '"x""y"') print(lines) # yields: ['"y"']

Related

difference between regular expression with and without group '( )'?

Extract substring using python re.match

regular expression: may or may not contain a string

Python regexp: get all group's sequence

Regular Expression in python

Categories

Resources

Why does xx yield something different from x{2}? Please have a look at the following example: import re lines = re.findall(r'".?"".?"', '"x""y"') print(lines) # yields: ['"x""y"'] lines = re.findall(r'(".*?"){2}', '"x""y"') print(lines) # yields: ['"y"']