python: re.search doesn't start at beginning of string? - python

I'm working on a Flask API, which takes the following regex as an endpoint:
([0-9]*)((OK)|(BACK)|(X))*
That means I'm expecting a series of numbers, and the OK, BACK, X keywords multiple times in succession after the numbers.
I want to split this regex and do different stuff depending which capture groups were present.
My approach was the following:
endp = endp.encode('ASCII', 'ignore')
match = re.search(r"([0-9]*)", str(endp), re.I)
if match:
n = match.groups()
logging.info('nums: ' + str(n[0]))
match = re.search(r"((OK)|(BACK)|(X))*", str(endp), re.I)
if match:
s1 = match.groups()
for i in s1:
logging.info('str: ' + str(i[0]))
Using the /12OK endpoint, getting the numbers works just fine, but for some reason capturing the rest of the keywords are unsuccessful. I tried reducing the second capture group to only
match = re.search(r"(OK)*", str(endp), re.I)
I constantly find the following in s1 (using the reduced regex):
(None,)
originally (with the rest of the keywords):
(None, None, None, None)
Which I suppose means the regex pattern does not match anything in my endp string (why does it have 4 Nones? 1 for each keyword, but what the 4th is there for?). I validated my endpoint (the regex against the same string too) with a regex validator, it seems fine to me. I understand that re.match is supposed to get matches from the beginning, therefore I used the re.search method, as the documentation points out it's supposed to match anywhere in the string.
What am I missing here? Please advise, I'm a beginner in the python world.

Indeed it is a bit surprising that searching with * returns `None:
>>> re.search("(OK|BACK|X)*", u'/12OK').groups()
(None,)
But it's "correct", since * matches zero or more, and any pattern matches zero times in any string, that's why you see None. Searching with + somewhat solves it:
>>> re.search("(OK|BACK|X)+", u'/12OK').groups()
('OK',)
But now, searching with this pattern in /12OKOK still only finds one match because + means one or more, and it matched one time at the first OK. To find all occurrences you need to use re.findall:
>>> re.findall("(OK|BACK|X)", u'/12OKOK')
['OK', 'OK']
With those findings, your code would look as follows: (note that you don't need to write i[0] since i is already a string, unless you want to log only the first char of the string):
import re
endp = endp.encode('ASCII', 'ignore')
match = re.search(r"([0-9]+)", str(endp))
if match:
n = match.groups()
logging.info('nums: ' + str(n))
match = re.findall(r"(OK|BACK|X)", str(endp), re.I)
for i in match:
logging.info('str: ' + str(i))

If you want to match at least ONE of the groups, use + instead of *.
>>> endp = '/12OK'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> if match:
... s1 = match.groups()
... for i in s1:
... print s1
...
('OK', 'OK', None, None)
>>> endp = '/12X'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> match.groups()
('X', None, None, 'X')
Notice that you have 4 matching groups in your expression, one for each pair of parentheses. The first match is the outer parenthesis and the second one is the first of the nested groups. In the second example, you still get the first match for the outer parenthesis and then the last one is the third of the nested ones.

"((OK)|(BACK)|(X))*" will search for OK or BACK or X, 0 or more times. Note that the * means 0 or more, not more than 0. The above expression should have a + at the end not * as + means 1 or more.

I think you're having two different issues, and their intersection is causing more confusion than either of them would cause on their own.
The first issue is that you're using repeated groups. Python's re library is not able to capture multiple matches when a group is repeated. Matching with a pattern like (X)+ against 'XXXX' will only capture a single 'X' in the first group even though the whole string will be matched. The regex library (which is not part of the standard library) can do multiple captures, though I'm not sure of the exact commands required.
The second issue is using the * repetition operator in your pattern. The pattern you show at the top of the question will match on an empty string. Obviously, none of the gropus will capture anything in that situation (which may be why you're seeing a lot of None entries in your results). You probably need to modify your pattern so that it requires some minimal amount of valid text to count as a match. Using + instead of * might be one solution, but it's not clear to me exactly what you want to match against so I can't suggest a specific pattern.

Related

Regex group doesn't capture all of matched part of string [duplicate]

This question already has an answer here:
Why Does a Repeated Capture Group Return these Strings?
(1 answer)
Closed 1 year ago.
I have the following regex: '(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$'.
Given a string the following string "/foo/bar/baz", I expect the first captured group to be "/foo/bar". However, I get the following:
>>> import re
>>> regex = re.compile('(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$');
>>> match = regex.match('/foo/bar/baz')
>>> match.group(1)
'/bar'
Why isn't the whole expected group being captured?
Edit: It's worth mentioning that the strings I'm trying to match are parts of URLs. To give you an idea, it's the part of the URL that would be returned from window.location.pathname in javascript, only without file extensions.
This will capture multiple repeated groups:
(/[a-zA-Z]+)*
However, as already discussed in another thread, quoting from #ByteCommander
If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
Thus the reason why you are only seeing the last match "/bar". What you can do instead is take advantage of the greedy matching of .* up to the last / via the pattern (/.*)/
regex = re.compile('(/.*)/([a-zA-Z]+)\.?$');
Don't need the * between the two expressions here, also move the first / into the brackets:
>>> regex = re.compile('([/a-zA-Z]+)/([a-zA-Z]+)\.?$')
>>> regex.match('/foo/bar/baz').group(1)
'/foo/bar'
>>>
In this case, you may don't need regex.
You can simply use split function.
text = "/foo/bar/baz"
"/".join(text.split("/", 3)[:3])
output:
/foo/bar
a.split("/", 3) splits your string up to the third occurrence of /, and then you can join the desidered elements.
As suggested by Niel, you can use a negative index to extract anything but the last part from a url (or a path).
In this case the generic approach would be :
text = "/foo/bar/baz/boo/bye"
"/".join(text.split("/", -1)[:-1])
Output:
/foo/bar/baz/boo

Which pattern was matched among those that I passed through a regular expression?

I am using regexp with pyhton and the library re. The regular expression I am passing contains many possible variations of a string, such as:
myRExp = ("aaaaa|bbbbb|ccccc|ddddd")
This is what I am doing to match the full regular expression
# read a file with two columns
df = pd.read_csv('a_file.csv')
# get second column and create a unique regular expression
myRExp = "|".join(df[df.columns[1]])
# now test if line contains myRExp
if re.match(myRExp, line):
# get the actual matching pattern and do something with it
What I need to do is to know which substring from myRExp was actually matching the line, i.e. which one between "aaaaa", "bbbbb", "ccccc" or "ddddd", matched?
EDIT:
Let's go with the example. This is my regular expression:
>>> linE = 'zzzzbbdbbxxx'
>>> myRExp = "(aa[a|b]a)|(bb[c|d]bb)|(ccc[d|c]c)"
by re.match() I can now match it and get this output (note that I am using search to make my point here):
# do we have a match? (yes)
>>> matched = re.search(myRExp, linE)
# show groups: I partially care
>>> matched.groups(0)
(0, 'bbdbb', 0)
At this point, what I need is the index of the regular expression that matched: the match was (bb[c|d]bb), then the output should be 2, i.e. the index of that regular expression group in myRExp:
index of matched.groups(0) in myRExp
Is there any way of obtaining the index?
Grab the "match object" returned by the regex call, and you can examine it:
m = re.match(myRExp, line)
if m:
print("Matched", m.group(0))
This will show you the part of your string that matched, which in this case is the simplest way to get what you are after.
If your regex contains groups and you want to know exactly which of the groups matched, use m.groups() instead:
>>> probe = "(orange)|(or)|(or.*)"
>>> m = re.match(probe, 'order')
>>> m.groups()
(None, 'or', None)
There should only be one value that is not None, so you can take its index and look up the regex in your list of regex substrings. Here's one way to find the index with a one-liner:
>>> match_index = list(map(bool, m.groups())).index(True)
I would suggest, that you can use this website
There you can tinker and adapt your regular expressions and get visual feedback what is matched, when providing test strings. Also the syntax is documented for the rare case you forget some commands ;)

How to print substring using RegEx in Python?

This is two texts:
1) 'provider:sipoutilp1.ym.ms'
2) 'provider:sipoutqtm.ym.ms'
I would like to print ilp when reaches to the fist line and qtm when reaches to the second line.
This is my solution but it is not working.
RE_PROVIDER = re.compile(r'(?P<provider>\((ilp+|qtm+)')
or in the line below,
182938,DOMINICAN REPUBLIC-MOBILE
to DOMINICAN REPUBLIC , can I use the same approach re.compile?
Thank you for any help.
Your regex is not correct because you have a open parenthesis before your keywords, since there is no such character in your lines.
As a more general way you can get capture the alphabetical character after sipout or provider:sipout.
>>> s1 = 'provider:sipoutilp1.ym.ms'
>>> s2 = 'provider:sipoutqtm.ym.ms'
>>> RE_PROVIDER = re.compile(r'(?P<provider>(?<=sipout)(ilp|qtm))')
>>> RE_PROVIDER.search(s1).groupdict()
{'provider': 'ilp'}
>>> RE_PROVIDER.search(s2).groupdict()
{'provider': 'qtm'}
(?<=sipout) is a positive look-behind which will makes the regex engine match the patter which is precede with sipout.
After edit:
If you want to match multiple strings with different structure, you have to use a optional preceding patterns for matching your keywords, and due to this point that you cannot use unfixed length patterns within look-behind you cannot use it for this aim. So instead you can use a capture group trick.
You can define the optional preceding patterns within a none capture group and your keyword within a capture group then after match get the second matched gorup (group(1), group(0) is the whole of your match).
>>> RE_PROVIDER = re.compile(r'(?:sipout|\d+,)(?P<provider>(ilp|qtm|[A-Z\s]+))')
>>> RE_PROVIDER.search(s1).groupdict()
{'provider': 'ilp'}
>>> RE_PROVIDER.search(s2).groupdict()
{'provider': 'qtm'}
>>> s3 = "182938,DOMINICAN REPUBLIC-MOBILE"
>>> RE_PROVIDER.search(s3).groupdict()
{'provider': 'DOMINICAN REPUBLIC'}
Note that gorupdict doesn't works in this case because it will returns

Searching for multiple substrings of unknown size in string in python

I've seen lots of RE stuff in python but nothing for the exact case and I can't seem to get it. I have a list of files with names that look like this:
summary_Cells_a_01_2_1_45000_it_1.txt
summary_Cells_a_01_2_1_40000_it_2.txt
summary_Cells_bb_01_2_1_36000_it_3.txt
The "summary_Cells_" is always present. Then there is a string of letters, either 1, 2 or 3 long. Then there is "_01_2_1_" always. Then there is a number between 400 and 45000. Then there is "it" and then a number from 0-9, then ".txt"
I need to extract the letter(s) piece.
I was trying:
match = re.search('summary_Cells_(\w)_01_2_1_(\w)_it_(\w).txt', filename)
but was not getting anything for the match. I'm trying to get just the letters, but later might want the it number (last number) or the step (the middle number).
Any ideas?
Thanks
You're missing repetitions, i.e.:
re.search('summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)
\w will only match a single character
\w+ will match at least one
\w* will match any amount (0 or more)
Reference: Regular expression syntax
You were almost there all you need to do is to repeat the regex in caputure group
summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt
Example usage
>>> filename="summary_Cells_a_01_2_1_45000_it_1.txt"
>>> match = re.search(r'summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)
>>> match.group()
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(0)
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(1)
'a'
>>> match.group(2)
'45000'
>>> match.group(3)
'1'
Note
The match.group(n) will return the value captured by the nth caputre group
You don't need a regex, there is nothing complex about the pattern and it does not change:
s = "summary_Cells_a_01_2_1_45000_it_1.txt"
print(s.split("_")[2])
a
s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
print(s.split("_")[2])
bb
If you want both sets of lettrrs:
s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
spl = s.split("_")
a,b = spl[2],spl[7]
print(a,b)
('bb', 'it')
Since you only want to capture the letters at the beginning, you could do:
re.search('summary_Cells_(\w+)_01_2_1_[0-9]{3,6}_it_[0-9].txt', filename)
Which doesn't bother giving you the groups you don't need.
[0-9] looks for a number and [0-9]{3,6} allows for 3 to 6 numbers.
You're on the right track with your regex, but as everyone else forgets, \w includes alphanumerics and the underscore, so you should use [a-z] instead.
re.search(r"summary_Cells_([a-z]+)_\w+\.txt", filename)
Or, as Padraic mentioned, you can just use str.split("_").

Regex related to * and + in python

I am new to python. I didnt understand the behaviour of these program in python.
import re
sub="dear"
pat="[aeiou]+"
m=re.search(pat,sub)
print(m.group())
This prints "ea"
import re
sub="dear"
pat="[aeiou]*"
m=re.search(pat,sub)
print(m.group())
This doesnt prints anything.
I know + matches 1 or more occurences and * matches 0 or more occurrences. I am expecting it to print "ea" in both program.But it doesn't.
Why this happens?
This doesnt prints anything.
Not exactly. It prints an empty string which you just of course you didn't notice, as it's not visible. Try using this code instead:
l = re.findall(pat, sub)
print l
this will print:
['', 'ea', '', '']
Why this behaviour?
This is because when you use * quantifier - [aeiou]*, this regex pattern also matches an empty string before every non-matching string and also the empty string at the end. So, for your string dear, it matches like this:
*d*ea*r* // * where the pattern matches.
All the *'s denote the position of your matches.
d doesn't match the pattern. So match is the empty string before it.
ea matches the pattern. So next match is ea.
r doesn't match the pattern. So the match is empty string before r.
The last empty string is the empty string after r.
Using [aeiou]*, the pattern match at the beginning. You can confirm that using MatchObject.start:
>>> import re
>>> sub="dear"
>>> pat="[aeiou]*"
>>> m=re.search(pat,sub)
>>> m.start()
0
>>> m.end()
0
>>> m.group()
''
+ matches at least one of the character or group before it. [aeiou]+ will thus match at least one of a, e, i, o or u (vowels).
The regex will look everywhere in the string to find the minimum 1 vowel it's looking for and does what you expect it to (it will relentlessly try to get the condition satisfied).
* however means at least 0, which also means it can match nothing. That said, when the regex engine starts to look for a match at the beginning of the string to be tested, it doesn't find a match, so that the 0 match condition is satisfied and this is the result that you obtain.
If you had used the string ear, note that you would have ea as match.

Categories