Regex group doesn't capture all of matched part of string [duplicate] - python

This question already has an answer here:
Why Does a Repeated Capture Group Return these Strings?
(1 answer)
Closed 1 year ago.
I have the following regex: '(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$'.
Given a string the following string "/foo/bar/baz", I expect the first captured group to be "/foo/bar". However, I get the following:
>>> import re
>>> regex = re.compile('(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$');
>>> match = regex.match('/foo/bar/baz')
>>> match.group(1)
'/bar'
Why isn't the whole expected group being captured?
Edit: It's worth mentioning that the strings I'm trying to match are parts of URLs. To give you an idea, it's the part of the URL that would be returned from window.location.pathname in javascript, only without file extensions.

This will capture multiple repeated groups:
(/[a-zA-Z]+)*
However, as already discussed in another thread, quoting from #ByteCommander
If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
Thus the reason why you are only seeing the last match "/bar". What you can do instead is take advantage of the greedy matching of .* up to the last / via the pattern (/.*)/
regex = re.compile('(/.*)/([a-zA-Z]+)\.?$');

Don't need the * between the two expressions here, also move the first / into the brackets:
>>> regex = re.compile('([/a-zA-Z]+)/([a-zA-Z]+)\.?$')
>>> regex.match('/foo/bar/baz').group(1)
'/foo/bar'
>>>

In this case, you may don't need regex.
You can simply use split function.
text = "/foo/bar/baz"
"/".join(text.split("/", 3)[:3])
output:
/foo/bar
a.split("/", 3) splits your string up to the third occurrence of /, and then you can join the desidered elements.
As suggested by Niel, you can use a negative index to extract anything but the last part from a url (or a path).
In this case the generic approach would be :
text = "/foo/bar/baz/boo/bye"
"/".join(text.split("/", -1)[:-1])
Output:
/foo/bar/baz/boo

Related

Return all matches for a capturing group in Python [duplicate]

This question already has answers here:
Capturing repeating subpatterns in Python regex
(4 answers)
Closed 3 years ago.
I am implementing a method that takes in a regex pattern like r'(\w+/)+end' and a string 'ab/cd/ef/end'. Note that I cannot request the caller of the method to update their pattern format. Within the method, I needs to perform an operation that requires extracting all matches of the first capturing group i.e. ab/, cd/, and ef/.
How do I accomplish this in Python? Something like below returns a tuple of last-matches for each of capturing groups. We have just one in this example, so it returns ('ef/',).
re.match(r'(\w+/)+end', 'ab/cd/ef/end').groups()
By the way, in C#, every capturing group can match multiple strings e.g. Regex.Match("ab/cd/ef/end", #"(\w+/)+end").Groups[1].Captures will return all the three matches for first capturing group (\w+/)+.
If you just want to capture all path names which are followed by a separator, then use the pattern \w+/ with re.findall:
inp = "ab/cd/ef/end"
matches = re.findall(r'\w+/', inp)
print(matches)
['ab/', 'cd/', 'ef/']
If instead you want all path components, whether or not they be preceded by a path separator, then we can try:
inp = "ab/cd/ef/end"
matches = re.findall(r'[^/]+', inp)
r = r"(\w+/)(?<!end)"
s = "ab/cd/ef/end"
m = re.finditer(r, s, re.MULTILINE)
for g in m:
print(g.group())
Example:
https://regex101.com/r/VJ6knI/1

How to group inside "or" matching in a regex?

I have two kinds of documents to parse:
1545994641 INFO: ...
and
'{"deliveryDate":"1545994641","error"..."}'
I want to extract the timestamp 1545994641 from each of them.
So, I decided to write a regex to match both cases:
(\d{10}\s|\"\d{10}\")
In the 1st kind of document, it matches the timestamp and groups it, using the first expression in the "or" above (\d{10}\s):
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg="1545994641 INFO: ..."
>>> regex.search(msg).group(0)
'1545994641 '
(So far so good.)
However, in the 2nd kind, using the second expression in the "or" (\"\d{10}\") it matches the timestamp and quotation marks, grouping them. But I just want the timestamp, not the "":
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg='{"deliveryDate":"1545994641","error"..."}'
>>> regex.search(msg).group(0)
'"1545994641"'
What I tried:
I decided to use a non-capturing group for the quotation marks:
(\d{10}\s|(?:\")\d{10}(?:\"))
but it doesn't work as the outer group catches them.
I also removed the outer group, but the result is the same.
Unwanted ways to solve:
I can surpass this by creating a group for each expression in the or,
but I just want it to output a single group (to abstract the code
from the regex).
I could also use a 2nd step of regex to capture the timestamp from
the group that has the quotation marks, but again that would break
the code abstraction.
I could omit the "" in the regex but that would match a timestamp in the middle of the message , as I want it to be objective to capture the timestamp as a value of a key or in the beginning of the document, followed by a space.
Is there a way I can match both cases above but, in the case it matches the second case, return only the timestamp? Or is it impossible?
EDIT:
As noticed by #Amit Bhardwaj, the first case also returns a space after the timestamp. It's another problem (I didn't figure out) with the same solution, probably!
You may use lookarounds if your code can only access the whole match:
^\d{10}(?=\s)|(?<=")\d{10}(?=")
See the regex demo.
In Python, declare it as
rx = r'^\d{10}(?=\s)|(?<=")\d{10}(?=")'
Pattern details
^\d{10}(?=\s):
^ - string start
\d{10} - ten digits
(?=\s) - a positive lookahead that requires a whitespace char immediately to the right of the current location
| - or
(?<=")\d{10}(?="):
(?<=") - a " char
\d{10} - ten digits
(?=") - a positive lookahead that requires a double quotation mark immediately to the right of the current location.
You could use lookarounds, but I think this solution is simpler, if you can just get the group:
"?(\d{10})(?:\"|\s)
EDIT:
Considering if there is a first " there must be a ", try this:
(^\d{10}\s|(?<=\")\d{10}(?=\"))
EDIT 2:
To also remove the trailing space in the end, use a lookahead too:
(^\d{10}(?=\s)|(?<=\")\d{10}(?=\"))

python: re.search doesn't start at beginning of string?

I'm working on a Flask API, which takes the following regex as an endpoint:
([0-9]*)((OK)|(BACK)|(X))*
That means I'm expecting a series of numbers, and the OK, BACK, X keywords multiple times in succession after the numbers.
I want to split this regex and do different stuff depending which capture groups were present.
My approach was the following:
endp = endp.encode('ASCII', 'ignore')
match = re.search(r"([0-9]*)", str(endp), re.I)
if match:
n = match.groups()
logging.info('nums: ' + str(n[0]))
match = re.search(r"((OK)|(BACK)|(X))*", str(endp), re.I)
if match:
s1 = match.groups()
for i in s1:
logging.info('str: ' + str(i[0]))
Using the /12OK endpoint, getting the numbers works just fine, but for some reason capturing the rest of the keywords are unsuccessful. I tried reducing the second capture group to only
match = re.search(r"(OK)*", str(endp), re.I)
I constantly find the following in s1 (using the reduced regex):
(None,)
originally (with the rest of the keywords):
(None, None, None, None)
Which I suppose means the regex pattern does not match anything in my endp string (why does it have 4 Nones? 1 for each keyword, but what the 4th is there for?). I validated my endpoint (the regex against the same string too) with a regex validator, it seems fine to me. I understand that re.match is supposed to get matches from the beginning, therefore I used the re.search method, as the documentation points out it's supposed to match anywhere in the string.
What am I missing here? Please advise, I'm a beginner in the python world.
Indeed it is a bit surprising that searching with * returns `None:
>>> re.search("(OK|BACK|X)*", u'/12OK').groups()
(None,)
But it's "correct", since * matches zero or more, and any pattern matches zero times in any string, that's why you see None. Searching with + somewhat solves it:
>>> re.search("(OK|BACK|X)+", u'/12OK').groups()
('OK',)
But now, searching with this pattern in /12OKOK still only finds one match because + means one or more, and it matched one time at the first OK. To find all occurrences you need to use re.findall:
>>> re.findall("(OK|BACK|X)", u'/12OKOK')
['OK', 'OK']
With those findings, your code would look as follows: (note that you don't need to write i[0] since i is already a string, unless you want to log only the first char of the string):
import re
endp = endp.encode('ASCII', 'ignore')
match = re.search(r"([0-9]+)", str(endp))
if match:
n = match.groups()
logging.info('nums: ' + str(n))
match = re.findall(r"(OK|BACK|X)", str(endp), re.I)
for i in match:
logging.info('str: ' + str(i))
If you want to match at least ONE of the groups, use + instead of *.
>>> endp = '/12OK'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> if match:
... s1 = match.groups()
... for i in s1:
... print s1
...
('OK', 'OK', None, None)
>>> endp = '/12X'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> match.groups()
('X', None, None, 'X')
Notice that you have 4 matching groups in your expression, one for each pair of parentheses. The first match is the outer parenthesis and the second one is the first of the nested groups. In the second example, you still get the first match for the outer parenthesis and then the last one is the third of the nested ones.
"((OK)|(BACK)|(X))*" will search for OK or BACK or X, 0 or more times. Note that the * means 0 or more, not more than 0. The above expression should have a + at the end not * as + means 1 or more.
I think you're having two different issues, and their intersection is causing more confusion than either of them would cause on their own.
The first issue is that you're using repeated groups. Python's re library is not able to capture multiple matches when a group is repeated. Matching with a pattern like (X)+ against 'XXXX' will only capture a single 'X' in the first group even though the whole string will be matched. The regex library (which is not part of the standard library) can do multiple captures, though I'm not sure of the exact commands required.
The second issue is using the * repetition operator in your pattern. The pattern you show at the top of the question will match on an empty string. Obviously, none of the gropus will capture anything in that situation (which may be why you're seeing a lot of None entries in your results). You probably need to modify your pattern so that it requires some minimal amount of valid text to count as a match. Using + instead of * might be one solution, but it's not clear to me exactly what you want to match against so I can't suggest a specific pattern.

How to match length of variable size backreference but not content

Currently I'm trying to write a regular expression (using Python's re module) that will find occurrences of 'a' in a string of a given length. There are a few different patterns I'm trying to match, but the ones that are giving me trouble look like this:
a.a.a
a..a..a
a...a...a
Basically I'm trying to find matches that contain at least three occurrences of 'a', but they must be equally spaced apart. So far I've tried regexes:
regex1 = r'a(.|..|...)a\1a'
regex2 = r'a(.{1,3})a\1a'
But the problem I'm having is that the backreference repeats the matched text. So, for example, my regex will match #1 but not #2,
1. aoooaoooa
2. aoooabbba
when in actuality I don't care about the content between occurrences of 'a', simply the distance.
I know backreferences can be used to match the same unknown text multiple times, but I suppose I don't know enough to tell whether there's just a different way to use them, or whether I should be using some other method/pattern entirely. Tips?
Thanks in advance!
If you install Python PyPi regex module, you can use subpattern recursing features. Just wrap a repeating part with a capture group, and then use (?n) where n is the capture group ID.
>>> import regex
>>> a = "aoooaoooa"
>>> b = "aoooabbba"
>>> rx = r"a(.{1,3})a(?1)a"
>>> print(regex.search(rx, a).group(0))
aoooaoooa
>>> print(regex.search(rx, b).group(0))
aoooabbba
>>> print(regex.search(rx, "abacca").group(0))
abacca
Explanation:
a - matches a literal a
(.{1,3}) - matches and captures into Group 1 one to three characters other than a newline
a - matches a literal a
(?1) - a recursive construct telling the regex engine retreive the pattern rather than the value that belongs to Group 1 (i.e. .{1,3})
a - matches a literal a
PyPi regex module does not support balanced constructs (.NET can), so you will have to add more code to check if you matched groups of equal length. Fortunately, regex module keeps all captured submatches in the .captures object. So, all you need to do to exclude abacca from the valid matches is to use:
c = "abacca"
m = regex.search(rx, c)
if len(max(m.captures(1))) - len(min(m.captures(1))) == 0: # all of equal length ?
print m.group(0)

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?
try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.
Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.
I think you don't need regexpes for this problem,
you need some recursial graph search function

Categories