Python regexp: get all group's sequence - python

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?

try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.

Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.

I think you don't need regexpes for this problem,
you need some recursial graph search function

Related

Regex group doesn't capture all of matched part of string [duplicate]

This question already has an answer here:
Why Does a Repeated Capture Group Return these Strings?
(1 answer)
Closed 1 year ago.
I have the following regex: '(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$'.
Given a string the following string "/foo/bar/baz", I expect the first captured group to be "/foo/bar". However, I get the following:
>>> import re
>>> regex = re.compile('(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$');
>>> match = regex.match('/foo/bar/baz')
>>> match.group(1)
'/bar'
Why isn't the whole expected group being captured?
Edit: It's worth mentioning that the strings I'm trying to match are parts of URLs. To give you an idea, it's the part of the URL that would be returned from window.location.pathname in javascript, only without file extensions.
This will capture multiple repeated groups:
(/[a-zA-Z]+)*
However, as already discussed in another thread, quoting from #ByteCommander
If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
Thus the reason why you are only seeing the last match "/bar". What you can do instead is take advantage of the greedy matching of .* up to the last / via the pattern (/.*)/
regex = re.compile('(/.*)/([a-zA-Z]+)\.?$');
Don't need the * between the two expressions here, also move the first / into the brackets:
>>> regex = re.compile('([/a-zA-Z]+)/([a-zA-Z]+)\.?$')
>>> regex.match('/foo/bar/baz').group(1)
'/foo/bar'
>>>
In this case, you may don't need regex.
You can simply use split function.
text = "/foo/bar/baz"
"/".join(text.split("/", 3)[:3])
output:
/foo/bar
a.split("/", 3) splits your string up to the third occurrence of /, and then you can join the desidered elements.
As suggested by Niel, you can use a negative index to extract anything but the last part from a url (or a path).
In this case the generic approach would be :
text = "/foo/bar/baz/boo/bye"
"/".join(text.split("/", -1)[:-1])
Output:
/foo/bar/baz/boo

Python regex: unclear difference between repeating qualifier {n} and equivalent tuple

Why does xx yield something different from x{2}?
Please have a look at the following example:
import re
lines = re.findall(r'".*?"".*?"', '"x""y"')
print(lines) # yields: ['"x""y"']
lines = re.findall(r'(".*?"){2}', '"x""y"')
print(lines) # yields: ['"y"']
As per the documentation of findall, if you have a group in the regex, it returns the list of those groups, either as a tuple for 2+ groups or as a string for 1 groups. In your case, your two regexes are not merely xx versus x{2}, but rather the second one is (x){2}, which has a group, when the first regex has no groups.
Hence, "x" matches the group the first time, then "y" matches the group the second time. This fulfills your overall regex, but "y" overwrites "x" for the value of group 1.
The easiest way to solve this in your example is to convert your group to a non-matching group: (?:".*?"){2}. If you want two groups, one for "x" and one for "y", you need to repeat the group twice: (".*?")(".*?"). You can potentially use named groups to simplify this repetition.
The first expression is "X and then Y, where Y accidentally matches the same thing as X".
The second expression is "(X){repeat two times}". Group 1 cannot contain XX, because group 1 does not match XX. It matches X.
In other words: Group contents does not change just because of a quantifier outside of the group.
One way to remedy the second expression is to make an outer group (and make the inner group non-capturing)
lines = re.findall(r'((?:".*?"){2})', '"x""y"')
About your second pattern (".*?"){2}:
A cite from the rules of matching
If a group is contained in a part of the pattern that matched multiple times, the last match is returned.
And findall does the following:
If one or more groups are present in the pattern, return a list of groups;
Your pattern (".*?"){2} means that (".*?") should match twice in a row, and according to the first rule, only the content of the last match is captured.
For your data findall finds the sequence (".*?"){2} only once, so it returns a list consisting of the last captured group for a single match: ['"y"'].
This example would make it more obvious:
import re
print (re.findall(r'(\d){2}', 'a12b34c56'))
# ['2', '4', '6']
You can see that findall finds the sequence (\d){2} three times and for each it returns the last captured content for the group (\d).
Now about your first pattern: ".*?"".*?".
This one does not contains subgroups, and, according to findall again, in this case it returns:
all non-overlapping matches of pattern in string, as a list of strings.
So for your data it is ['"x""y"'].
AFAIK, findall() is capture group first, if there is any capture group in the applied regex, then findall() returns only capture group values.
And only when there is no capture group in the applied regex, findall() returns fullmatch values.
Therefore, if you want findall() returns fullmatch value, then you must not use capture group in the regex like this
(?:".*?"){2}
in which (?: ... ) indicate non-capture group.
Thus, in python
print(re.findall(r'(?:".*?"){2}', '"x""y"'))

using regular expression to split string in python

I use
re.compile(r"(.+?)\1+").findall('44442(2)2(2)44')
can get
['4','2(2)','4']
, but how can I get
['4444','2(2)2(2)','44']
by using regular expression?
Thanks
No change to your pattern needed. Just need to use to right function for the job. re.findall will return a list of groups if there are capturing groups in the pattern. To get the entire match, use re.finditer instead, so that you can extract the full match from each actual match object.
pattern = re.compile(r"(.+?)\1+")
[match.group(0) for match in pattern.finditer('44442(2)2(2)44')]
With minimal change to OP's regular expression:
[m[0] for m in re.compile(r"((.+?)\2+)").findall('44442(2)2(2)44')]
findall will give you the full match if there are no groups, or groups if there are some. So given that you need groups for your regexp to work, we simply add another group to encompass the full match, and extract it afterwards.
You can do:
[i[0] for i in re.findall(r'((\d)(?:[()]*\2*[()]*)*)', s)]
Here the Regex is:
((\d)(?:[()]*\2*[()]*)*)
which will output a list of tuples containing the two captured groups, and we are only interest din the first one hence i[0].
Example:
In [15]: s
Out[15]: '44442(2)2(2)44'
In [16]: [i[0] for i in re.findall(r'((\d)(?:[()]*\2*[()]*)*)', s)]
Out[16]: ['4444', '2(2)2(2)', '44']

Python parentheses and returning only certain part of regex

I have a list of strings that I'm looping through. I have the following regular expression (item is the string I'm looping through at any given moment):
regularexpression = re.compile(r'set(\d+)e', re.IGNORECASE)
number = re.search(regularexpression,item).group(1)
What I want it to do is return numbers that have the word set before them and the letter e after them.
However, I also want it to return numbers that have set before them and x after them. If I use the following code:
regularexpression = re.compile(r'set(\d+)(e|x)', re.IGNORECASE)
number = re.search(regularexpression,item).group(1)
Instead of returning just the number, it also returns e or x. Is there a way to use parentheses to group my regular expression into bits without it returning everything in the parentheses?
Your example code seems fine already, but to answer your question, you can make a non-capturing group using the (?:) syntax, e.g.:
set(\d+)(?:e|x)
Additionally, in this specific example you can just use a character class:
set(\d+)[ex]
It appears you are looking at more than just .group(1); you have two capturing groups defined in your regular expression.
You can make the second group non-capturing by using (?:...) instead of (...):
regularexpression = re.compile(r'set(\d+)(?:e|x)', re.IGNORECASE)

Find the indexes of all regex matches?

I'm parsing strings that could have any number of quoted strings inside them (I'm parsing code, and trying to avoid PLY). I want to find out if a substring is quoted, and I have the substrings index. My initial thought was to use re to find all the matches and then figure out the range of indexes they represent.
It seems like I should use re with a regex like \"[^\"]+\"|'[^']+' (I'm avoiding dealing with triple quoted and such strings at the moment). When I use findall() I get a list of the matching strings, which is somewhat nice, but I need indexes.
My substring might be as simple as c, and I need to figure out if this particular c is actually quoted or not.
This is what you want: (source)
re.finditer(pattern, string[, flags])
Return an iterator yielding MatchObject instances over all
non-overlapping matches for the RE pattern in string. The string is
scanned left-to-right, and matches are returned in the order found. Empty
matches are included in the result unless they touch the beginning of
another match.
You can then get the start and end positions from the MatchObjects.
e.g.
[(m.start(0), m.end(0)) for m in re.finditer(pattern, string)]
To get indice of all occurences:
S = input() # Source String
k = input() # String to be searched
import re
pattern = re.compile(k)
r = pattern.search(S)
if not r: print("(-1, -1)")
while r:
print("({0}, {1})".format(r.start(), r.end() - 1))
r = pattern.search(S,r.start() + 1)
This should solve your issue:
pattern=r"(?=(\"[^\"]+\"|'[^']+'))"
Then use the following to get all overlapping indices:
indicesTuple = [(mObj.start(1),mObj.end(1)-1) for mObj in re.finditer(pattern,input)]

Categories