Regex related to * and + in python - python

I am new to python. I didnt understand the behaviour of these program in python.
import re
sub="dear"
pat="[aeiou]+"
m=re.search(pat,sub)
print(m.group())
This prints "ea"
import re
sub="dear"
pat="[aeiou]*"
m=re.search(pat,sub)
print(m.group())
This doesnt prints anything.
I know + matches 1 or more occurences and * matches 0 or more occurrences. I am expecting it to print "ea" in both program.But it doesn't.
Why this happens?

This doesnt prints anything.
Not exactly. It prints an empty string which you just of course you didn't notice, as it's not visible. Try using this code instead:
l = re.findall(pat, sub)
print l
this will print:
['', 'ea', '', '']
Why this behaviour?
This is because when you use * quantifier - [aeiou]*, this regex pattern also matches an empty string before every non-matching string and also the empty string at the end. So, for your string dear, it matches like this:
*d*ea*r* // * where the pattern matches.
All the *'s denote the position of your matches.
d doesn't match the pattern. So match is the empty string before it.
ea matches the pattern. So next match is ea.
r doesn't match the pattern. So the match is empty string before r.
The last empty string is the empty string after r.

Using [aeiou]*, the pattern match at the beginning. You can confirm that using MatchObject.start:
>>> import re
>>> sub="dear"
>>> pat="[aeiou]*"
>>> m=re.search(pat,sub)
>>> m.start()
0
>>> m.end()
0
>>> m.group()
''

+ matches at least one of the character or group before it. [aeiou]+ will thus match at least one of a, e, i, o or u (vowels).
The regex will look everywhere in the string to find the minimum 1 vowel it's looking for and does what you expect it to (it will relentlessly try to get the condition satisfied).
* however means at least 0, which also means it can match nothing. That said, when the regex engine starts to look for a match at the beginning of the string to be tested, it doesn't find a match, so that the 0 match condition is satisfied and this is the result that you obtain.
If you had used the string ear, note that you would have ea as match.

Related

python: re.search doesn't start at beginning of string?

I'm working on a Flask API, which takes the following regex as an endpoint:
([0-9]*)((OK)|(BACK)|(X))*
That means I'm expecting a series of numbers, and the OK, BACK, X keywords multiple times in succession after the numbers.
I want to split this regex and do different stuff depending which capture groups were present.
My approach was the following:
endp = endp.encode('ASCII', 'ignore')
match = re.search(r"([0-9]*)", str(endp), re.I)
if match:
n = match.groups()
logging.info('nums: ' + str(n[0]))
match = re.search(r"((OK)|(BACK)|(X))*", str(endp), re.I)
if match:
s1 = match.groups()
for i in s1:
logging.info('str: ' + str(i[0]))
Using the /12OK endpoint, getting the numbers works just fine, but for some reason capturing the rest of the keywords are unsuccessful. I tried reducing the second capture group to only
match = re.search(r"(OK)*", str(endp), re.I)
I constantly find the following in s1 (using the reduced regex):
(None,)
originally (with the rest of the keywords):
(None, None, None, None)
Which I suppose means the regex pattern does not match anything in my endp string (why does it have 4 Nones? 1 for each keyword, but what the 4th is there for?). I validated my endpoint (the regex against the same string too) with a regex validator, it seems fine to me. I understand that re.match is supposed to get matches from the beginning, therefore I used the re.search method, as the documentation points out it's supposed to match anywhere in the string.
What am I missing here? Please advise, I'm a beginner in the python world.
Indeed it is a bit surprising that searching with * returns `None:
>>> re.search("(OK|BACK|X)*", u'/12OK').groups()
(None,)
But it's "correct", since * matches zero or more, and any pattern matches zero times in any string, that's why you see None. Searching with + somewhat solves it:
>>> re.search("(OK|BACK|X)+", u'/12OK').groups()
('OK',)
But now, searching with this pattern in /12OKOK still only finds one match because + means one or more, and it matched one time at the first OK. To find all occurrences you need to use re.findall:
>>> re.findall("(OK|BACK|X)", u'/12OKOK')
['OK', 'OK']
With those findings, your code would look as follows: (note that you don't need to write i[0] since i is already a string, unless you want to log only the first char of the string):
import re
endp = endp.encode('ASCII', 'ignore')
match = re.search(r"([0-9]+)", str(endp))
if match:
n = match.groups()
logging.info('nums: ' + str(n))
match = re.findall(r"(OK|BACK|X)", str(endp), re.I)
for i in match:
logging.info('str: ' + str(i))
If you want to match at least ONE of the groups, use + instead of *.
>>> endp = '/12OK'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> if match:
... s1 = match.groups()
... for i in s1:
... print s1
...
('OK', 'OK', None, None)
>>> endp = '/12X'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> match.groups()
('X', None, None, 'X')
Notice that you have 4 matching groups in your expression, one for each pair of parentheses. The first match is the outer parenthesis and the second one is the first of the nested groups. In the second example, you still get the first match for the outer parenthesis and then the last one is the third of the nested ones.
"((OK)|(BACK)|(X))*" will search for OK or BACK or X, 0 or more times. Note that the * means 0 or more, not more than 0. The above expression should have a + at the end not * as + means 1 or more.
I think you're having two different issues, and their intersection is causing more confusion than either of them would cause on their own.
The first issue is that you're using repeated groups. Python's re library is not able to capture multiple matches when a group is repeated. Matching with a pattern like (X)+ against 'XXXX' will only capture a single 'X' in the first group even though the whole string will be matched. The regex library (which is not part of the standard library) can do multiple captures, though I'm not sure of the exact commands required.
The second issue is using the * repetition operator in your pattern. The pattern you show at the top of the question will match on an empty string. Obviously, none of the gropus will capture anything in that situation (which may be why you're seeing a lot of None entries in your results). You probably need to modify your pattern so that it requires some minimal amount of valid text to count as a match. Using + instead of * might be one solution, but it's not clear to me exactly what you want to match against so I can't suggest a specific pattern.

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

Find and replace symbols with regex python

I have such sample:
sample = 'TEXT/xx_271802_1A'
p = re.compile("(/[a-z]{2})")
print p.match(sample)
in position of xx may be any from [a-z] in quantity of 2:
TEXT/qq_271802_1A TEXT/sg_271802_1A TEXT/ut_271802_1A
How can I find this xx and f.e. replace it with 'WW':
TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A
my code returns None
sample = 'TEXT/xx_271802_1A'
p = re.compile("(/[a-z]{2})")
print p.search(sample).group()
Your code return None as you are using match which matches from start.You need search or findall as you are finding anywhere in string and not at start.
For replacement use
re.sub(r'(?<=/)[a-z]{2}','WW',sample)
You can try the following Regular expression :
>>> sample = 'TEXT/xx_271802_1A'
>>> import re
>>> re.findall(r'([a-z])\1',sample)
['x']
>>> re.sub(r'([a-z])\1','WW',sample)
'TEXT/WW_271802_1A'
>>> sample = 'TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A'
>>> re.sub(r'([a-z])\1','WW',sample)
'TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A'
The RegEx ([a-z])\1 searches for 1 letter and then matches it if it repeats immediately.
you only need to do this:
sample = re.sub(r'(?<=/)[a-z]{2}', 'WW', sample)
No need to check the string before with match. re.sub makes the replacement when the pattern is found.
(?<=..) is a lookbehind assertion and means preceded by, it's only a check and is not part of the match result. So / is not replaced.
In the same way, you can add a lookahead (?=_) (followed by) at the end of the pattern, if you want to check if there is the underscore.

Python regular expression pattern * is not working as expected

While working through Google's 2010 Python class, I found the following documentation:
'*' -- 0 or more occurrences of the pattern to its left
But when I tried the following
re.search(r'i*','biiiiiiiiiiiiiig').group()
I expected 'iiiiiiiiiiiiii' as output but got ''. Why?
* means 0 or more but re.search would return only the first match. Here the first match is an empty string. So you get an empty string as output.
Change * to + to get the desired output.
>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i+','biiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'
Consider this example.
>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i*','iiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'
Here i* returns iiiiiiiiiiiiii because at first , the regex engine tries to match zero or more times of i. Once it finds i at the very first, it matches greedily all the i's like in the second example, so you get iiiiiiii as output and if the i is not at the first (consider this biiiiiiig string), i* pattern would match all the empty string before the every non-match, in our case it matches all the empty strings that exists before b and g. Because re.search returns only the first match, you should get an empty string because of the non-match b at the first.
Why i got three empty strings as output in the below example?
>>> re.findall(r'i*','biiiiiiiiiiiiiig')
['', 'iiiiiiiiiiiiii', '', '']
As i explained earlier, for every non-match you should get an empty string as match. Let me explain. Regex engine parses the input from left to right.
First empty string as output is because the pattern i* won't match the character b but it matches the empty string which exists before the b.
Now the engine moves to the next character that is i which would be matched by our pattern i*, so it greedily matches the following i's . So you get iiiiiiiiiiiiii as the second.
After matching all the i's, it moves to the next character that is g which isn't matched by our pattern i* . So i* matches the empty string before the non-match g. That's the reason for the third empty string.
Now our pattern i* matches the empty string which exists before the end of the line. That's the reason for fourth empty string.
try this
re.search(r'i+','biiiiiiiiiiiiiig').group()
hope it helps.
Update:
Seems I misunderstood the question. T_T

Regex pattern to extract substring

mystring = "q1)whatq2)whenq3)where"
want something like ["q1)what", "q2)when", "q3)where"]
My approach is to find the q\d+\) pattern then move till I find this pattern again and stop. But I'm not able to stop.
I did req_list = re.compile("q\d+\)[*]\q\d+\)").split(mystring)
But this gives the whole string.
How can I do it?
You could try the below code which uses re.findall function,
>>> import re
>>> s = "q1)whatq2)whenq3)where"
>>> m = re.findall(r'q\d+\)(?:(?!q\d+).)*', s)
>>> m
['q1)what', 'q2)when', 'q3)where']
Explanation:
q\d+\) Matches the string in the format q followed by one or more digits and again followed by ) symbol.
(?:(?!q\d+).)* Negative look-ahead which matches any char not of q\d+ zero or more times.

Categories