Python regular expression pattern * is not working as expected - python

While working through Google's 2010 Python class, I found the following documentation:
'*' -- 0 or more occurrences of the pattern to its left
But when I tried the following
re.search(r'i*','biiiiiiiiiiiiiig').group()
I expected 'iiiiiiiiiiiiii' as output but got ''. Why?

* means 0 or more but re.search would return only the first match. Here the first match is an empty string. So you get an empty string as output.
Change * to + to get the desired output.
>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i+','biiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'
Consider this example.
>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i*','iiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'
Here i* returns iiiiiiiiiiiiii because at first , the regex engine tries to match zero or more times of i. Once it finds i at the very first, it matches greedily all the i's like in the second example, so you get iiiiiiii as output and if the i is not at the first (consider this biiiiiiig string), i* pattern would match all the empty string before the every non-match, in our case it matches all the empty strings that exists before b and g. Because re.search returns only the first match, you should get an empty string because of the non-match b at the first.
Why i got three empty strings as output in the below example?
>>> re.findall(r'i*','biiiiiiiiiiiiiig')
['', 'iiiiiiiiiiiiii', '', '']
As i explained earlier, for every non-match you should get an empty string as match. Let me explain. Regex engine parses the input from left to right.
First empty string as output is because the pattern i* won't match the character b but it matches the empty string which exists before the b.
Now the engine moves to the next character that is i which would be matched by our pattern i*, so it greedily matches the following i's . So you get iiiiiiiiiiiiii as the second.
After matching all the i's, it moves to the next character that is g which isn't matched by our pattern i* . So i* matches the empty string before the non-match g. That's the reason for the third empty string.
Now our pattern i* matches the empty string which exists before the end of the line. That's the reason for fourth empty string.

try this
re.search(r'i+','biiiiiiiiiiiiiig').group()
hope it helps.
Update:
Seems I misunderstood the question. T_T

Related

How to match and replace this pattern in Python RE?

s = "[abc]abx[abc]b"
s = re.sub("\[([^\]]*)\]a", "ABC", s)
'ABCbx[abc]b'
In the string, s, I want to match 'abc' when it's enclosed in [], and followed by a 'a'. So in that string, the first [abc] will be replaced, and the second won't.
I wrote the pattern above, it matches:
match anything starting with a '[', followed by any number of characters which is not ']', then followed by the character 'a'.
However, in the replacement, I want the string to be like:
[ABC]abx[abc]b . // NOT ABCbx[abc]b
Namely, I don't want the whole matched pattern to be replaced, but only anything with the bracket []. How to achieve that?
match.group(1) will return the content in []. But how to take advantage of this in re.sub?
Why not simply include [ and ] in the substitution?
s = re.sub("\[([^\]]*)\]a", "[ABC]a", s)
There exist more than 1 method, one of them is exploting groups.
import re
s = "[abc]abx[abc]b"
out = re.sub('(\[)([^\]]*)(\]a)', r'\1ABC\3', s)
print(out)
Output:
[ABC]abx[abc]b
Note that there are 3 groups (enclosed in brackets) in first argument of re.sub, then I refer to 1st and 3rd (note indexing starts at 1) so they remain unchanged, instead of 2nd group I put ABC. Second argument of re.sub is raw string, so I do not need to escape \.
This regex uses lookarounds for the prefix/suffix assertions, so that the match text itself is only "abc":
(?<=\[)[^]]*(?=\]a)
Example: https://regex101.com/r/NDlhZf/1
So that's:
(?<=\[) - positive look-behind, asserting that a literal [ is directly before the start of the match
[^]]* - any number of non-] characters (the actual match)
(?=\]a) - positive look-ahead, asserting that the text ]a directly follows the match text.

Python Regex results longer than original string

I have python code like this:
a = 'xyxy123'
b = re.findall('x*',a)
print b
This is the result:
['x', '', 'x', '', '', '', '', '']
How come b has eight elements when a only has seven characters?
There are eight "spots" in the string:
|x|y|x|y|1|2|3|
Each of them is a location where a regex could start. Since your regex includes the empty string (because x* allows 0 copies of x), each spot generates one match, and that match gets appended to the list in b. The exceptions are the two spots that start a longer match, x; as in msalperen's answer,
Empty matches are included in the result unless they touch the beginning of another match,
so the empty matches at the first and third locations are not included.
According to python documentation (https://docs.python.org/2/library/re.html):
re.findall returns all non-overlapping matches of pattern in string,
as a list of strings. The string is scanned left-to-right, and matches
are returned in the order found. If one or more groups are present in
the pattern, return a list of groups; this will be a list of tuples if
the pattern has more than one group. Empty matches are included in the
result unless they touch the beginning of another match.
So it returns all the results that match x*, including the empty ones.

Regular expression result

I have below code:
import re
line = "78349999234";
searchObj = re.search(r'9*', line)
if searchObj:
print "searchObj.group() : ", searchObj.group()
else:
print "Nothing found!!"
However the output is empty. I thought * means: Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s. Why am I not able to see any result in this case?
I think the regular expression matches left to right. So the first pattern that matches is the empty string before 7.... If it find a 9, it will indeed match it greedy: and try to "eat" (that's the correct terminology) as many characters as possible.
If you query for:
>>> print(re.findall(r'9*',line));
['', '', '', '', '9999', '', '', '', '']
It matches all empty strings between the characters and as you can see, 9999 is matched as well.
The main reason is probably performance: if you search for a pattern in a string of 10M+ characters, you're very happy if the pattern is already in the first 10k characters. You don't want to waste effort on finding the "nicest" match...
EDIT
With 0 or more occurrence one means the group (in this case 9) is repeated zero or more times. In an empty string, the characters is repeated exactly 0 times. If you want to match patterns where the characters is repeated one or more times, you should use
9+
This results in:
>>> print(re.search(r'9+', line));
<_sre.SRE_Match object; span=(4, 8), match='9999'>
re.search for a pattern that accepts the empty string, is probably not that much helpful since it will always match the empty string before the actual start of the string first.
The main reason is , re.search function stops searching for strings once it finds a match. 9* means match the digit 9 zero or more times. Because an empty string exists before each and every character, re.search function stops it searching after finding the first empty string. That's why you got an empty string as output...

Regex related to * and + in python

I am new to python. I didnt understand the behaviour of these program in python.
import re
sub="dear"
pat="[aeiou]+"
m=re.search(pat,sub)
print(m.group())
This prints "ea"
import re
sub="dear"
pat="[aeiou]*"
m=re.search(pat,sub)
print(m.group())
This doesnt prints anything.
I know + matches 1 or more occurences and * matches 0 or more occurrences. I am expecting it to print "ea" in both program.But it doesn't.
Why this happens?
This doesnt prints anything.
Not exactly. It prints an empty string which you just of course you didn't notice, as it's not visible. Try using this code instead:
l = re.findall(pat, sub)
print l
this will print:
['', 'ea', '', '']
Why this behaviour?
This is because when you use * quantifier - [aeiou]*, this regex pattern also matches an empty string before every non-matching string and also the empty string at the end. So, for your string dear, it matches like this:
*d*ea*r* // * where the pattern matches.
All the *'s denote the position of your matches.
d doesn't match the pattern. So match is the empty string before it.
ea matches the pattern. So next match is ea.
r doesn't match the pattern. So the match is empty string before r.
The last empty string is the empty string after r.
Using [aeiou]*, the pattern match at the beginning. You can confirm that using MatchObject.start:
>>> import re
>>> sub="dear"
>>> pat="[aeiou]*"
>>> m=re.search(pat,sub)
>>> m.start()
0
>>> m.end()
0
>>> m.group()
''
+ matches at least one of the character or group before it. [aeiou]+ will thus match at least one of a, e, i, o or u (vowels).
The regex will look everywhere in the string to find the minimum 1 vowel it's looking for and does what you expect it to (it will relentlessly try to get the condition satisfied).
* however means at least 0, which also means it can match nothing. That said, when the regex engine starts to look for a match at the beginning of the string to be tested, it doesn't find a match, so that the 0 match condition is satisfied and this is the result that you obtain.
If you had used the string ear, note that you would have ea as match.

Why is there an extra result handed back to me during this Python regex example?

Code:
re.findall('(/\d\d\d\d)?','/2000')
Result:
['/2000', '']
Code:
re.findall('/\d\d\d\d?','/2000')
Result:
['/2000']
Why is the extra '' returned in the first example?
i am using the first example for django url configuration , is there a way i can prevent matching of '' ?
Because using the brackets you define a group, and then with ? you ask for 0 to 1 repetitions of the group. Thus the empty string and /2000 both match.
the operator ? will match 0 or 1 repetitions of the preceding expression, in the first case the preceding expression is (/\d\d\d\d), while in the second is the last \d.
Therefore the first case the empty string "" will be matched, as it contain zero repetition of the expression (/\d\d\d\d)
Here is what is happening: The regex engine starts off with its pointer before the first char in the target string. It greedily consumes the whole string and places the match result in the first list element. This leaves the internal pointer at the end of the string. But since the regex pattern can match nothingness, it successfully matches at the position at the end of the string too, Thus, there are two elements in the list.

Categories