Python Regex results longer than original string - python

I have python code like this:
a = 'xyxy123'
b = re.findall('x*',a)
print b
This is the result:
['x', '', 'x', '', '', '', '', '']
How come b has eight elements when a only has seven characters?

There are eight "spots" in the string:
|x|y|x|y|1|2|3|
Each of them is a location where a regex could start. Since your regex includes the empty string (because x* allows 0 copies of x), each spot generates one match, and that match gets appended to the list in b. The exceptions are the two spots that start a longer match, x; as in msalperen's answer,
Empty matches are included in the result unless they touch the beginning of another match,
so the empty matches at the first and third locations are not included.

According to python documentation (https://docs.python.org/2/library/re.html):
re.findall returns all non-overlapping matches of pattern in string,
as a list of strings. The string is scanned left-to-right, and matches
are returned in the order found. If one or more groups are present in
the pattern, return a list of groups; this will be a list of tuples if
the pattern has more than one group. Empty matches are included in the
result unless they touch the beginning of another match.
So it returns all the results that match x*, including the empty ones.

Related

Apart from returning string and iterator in re.findall() and re.finditer() in python do their working also differ?

Wrote the following code so that i get all variable length patterns matching str_key.
line = "ABCDABCDABCDXXXABCDXXABCDABCDABCD"
str_key = "ABCD"
regex = rf"({str_key})+"
find_all_found = re.findall(regex,line)
print(find_all_found)
find_iter_found = re.finditer(regex, line)
for i in find_iter_found:
print(i.group())
Output i got:
['ABCD', 'ABCD', 'ABCD']
ABCDABCDABCD
ABCD
ABCDABCDABCD
The intended output is last three lines printed by finditer(). I was expecting both functions to give me same output(list or callable does not matter). why it differs in findall() as far i understood from other posts already on stackoverflow, these two functions differ only in their return types and not in matching patterns. Do they work differently, if not what have i done wrong?
You want to access groups rather than group.
>>> find_iter_found = re.finditer(regex, line)
>>> for i in find_iter_found:
... print(i.groups()[0])
The difference between the two methods is explained here.
The behaviour of the two functions is pretty much the same as far as the matching process is concerned as per:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a
previous empty match.
re.finditer(pattern, string, flags=0)
Return an iterator yielding match objects over all non-overlapping
matches for the RE pattern in string. The string is scanned
left-to-right, and matches are returned in the order found. Empty
matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a
previous empty match.
For re.findall change your regex
regex = rf"({str_key})+"
into
regex = rf"((?:{str_key})+)".
The quantifier + have to inside the capture group.

How to comprehend the python regex compile matching result: `re.compile(r'a*')`

import re
pattern = re.compile(r'a*')
pattern.findall("aba")
result:
['a', '', 'a', '']
Why there is empty matches in the result? How to comprehend this?
To be more specific, what do the two empty matches--'' in the result stand for in the string "aba"?
findall(pattern, string, flags=0)¶
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
You are searching for a*. * matches zero or more repetitions of the character. So b matches a*, and so does anything else. It seems like you want a+ instead, which matches one or more repetitions of the character.
Let me try to explain, as I also could not find good information on the outputs. The documentation states that
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a previous empty match.
import re
text = 'abcaad'
print(f"'a' matches {re.findall('a' , text)}")
print(f"'a+' matches {re.findall('a+', text)}")
print(f"'a*' matches {re.findall('a*', text)}")
print(f"'z*' matches {re.findall('z*', text)}")
The output is
'a' matches ['a', 'a', 'a']
'a+' matches ['a', 'aa']
'a*' matches ['a', '', '', 'aa', '', '']
'z*' matches ['', '', '', '', '', '', '']
a matches exactly the character a thrice.
a+ matches one or more occurrences of character a.
a* matches zero or more occurrences of character a.
Besides matching a and aa, it also does not matches b, c, d and the whole string.
z* matches zero or more occurrences of character z.
It does not matches a, b, c, a, a, d and the whole string.

Finding and extracting multiple substrings in a string?

After looking a few similar questions, I have not been able to successfully implement a substring split on my data. For my specific case, I have a bunch of strings, and each string has a substring I need to extract. The strings are grouped together in a list and my data is NBA positions. I need to pull out the positions (either 'PG', 'SG', 'SF', 'PF', or 'C') from each string. Some strings will have more than one position. Here is the data.
text = ['Chi\xa0SG, SF\xa0\xa0DTD','Cle\xa0PF']
The code should ideally look at the first string, 'Chi\xa0SG, SF\xa0\xa0DTD', and return ['SG','SF'] the two positions. The code should look at the second string and return ['PF'].
Leverage (zero width) lookarounds:
(?<!\w)PG|SG|SF|PF|C(?!\w)
(?<!\w) is zero width negative lookbehind pattern, making sure the desired match is not preceded by any alphanumerics
PG|SG|SF|PF|C matches any of the desired patterns
(?!\w) is zero width negative lookahead pattern making sure the match is not followed by any alphanumerics
Example:
In [7]: s = 'Chi\xa0SG, SF\xa0\xa0DTD'
In [8]: re.findall(r'(?<!\w)PG|SG|SF|PF|C(?!\w)', s)
Out[8]: ['SG', 'SF']
heemayl's response is the most correct, but you could probably get away with splitting on commas and keeping only the last two (or in the case of 'C', the last) characters in each substring.
s = 'Chi\xa0SG, SF\xa0\xa0DTD'
fin = list(map(lambda x: x[-2:] if x != 'C' else x[-1:],s.split(',')))
I can't test this at the moment as I'm on a chromebook but it should work.

Python regular expression pattern * is not working as expected

While working through Google's 2010 Python class, I found the following documentation:
'*' -- 0 or more occurrences of the pattern to its left
But when I tried the following
re.search(r'i*','biiiiiiiiiiiiiig').group()
I expected 'iiiiiiiiiiiiii' as output but got ''. Why?
* means 0 or more but re.search would return only the first match. Here the first match is an empty string. So you get an empty string as output.
Change * to + to get the desired output.
>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i+','biiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'
Consider this example.
>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i*','iiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'
Here i* returns iiiiiiiiiiiiii because at first , the regex engine tries to match zero or more times of i. Once it finds i at the very first, it matches greedily all the i's like in the second example, so you get iiiiiiii as output and if the i is not at the first (consider this biiiiiiig string), i* pattern would match all the empty string before the every non-match, in our case it matches all the empty strings that exists before b and g. Because re.search returns only the first match, you should get an empty string because of the non-match b at the first.
Why i got three empty strings as output in the below example?
>>> re.findall(r'i*','biiiiiiiiiiiiiig')
['', 'iiiiiiiiiiiiii', '', '']
As i explained earlier, for every non-match you should get an empty string as match. Let me explain. Regex engine parses the input from left to right.
First empty string as output is because the pattern i* won't match the character b but it matches the empty string which exists before the b.
Now the engine moves to the next character that is i which would be matched by our pattern i*, so it greedily matches the following i's . So you get iiiiiiiiiiiiii as the second.
After matching all the i's, it moves to the next character that is g which isn't matched by our pattern i* . So i* matches the empty string before the non-match g. That's the reason for the third empty string.
Now our pattern i* matches the empty string which exists before the end of the line. That's the reason for fourth empty string.
try this
re.search(r'i+','biiiiiiiiiiiiiig').group()
hope it helps.
Update:
Seems I misunderstood the question. T_T

Regular expression result

I have below code:
import re
line = "78349999234";
searchObj = re.search(r'9*', line)
if searchObj:
print "searchObj.group() : ", searchObj.group()
else:
print "Nothing found!!"
However the output is empty. I thought * means: Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s. Why am I not able to see any result in this case?
I think the regular expression matches left to right. So the first pattern that matches is the empty string before 7.... If it find a 9, it will indeed match it greedy: and try to "eat" (that's the correct terminology) as many characters as possible.
If you query for:
>>> print(re.findall(r'9*',line));
['', '', '', '', '9999', '', '', '', '']
It matches all empty strings between the characters and as you can see, 9999 is matched as well.
The main reason is probably performance: if you search for a pattern in a string of 10M+ characters, you're very happy if the pattern is already in the first 10k characters. You don't want to waste effort on finding the "nicest" match...
EDIT
With 0 or more occurrence one means the group (in this case 9) is repeated zero or more times. In an empty string, the characters is repeated exactly 0 times. If you want to match patterns where the characters is repeated one or more times, you should use
9+
This results in:
>>> print(re.search(r'9+', line));
<_sre.SRE_Match object; span=(4, 8), match='9999'>
re.search for a pattern that accepts the empty string, is probably not that much helpful since it will always match the empty string before the actual start of the string first.
The main reason is , re.search function stops searching for strings once it finds a match. 9* means match the digit 9 zero or more times. Because an empty string exists before each and every character, re.search function stops it searching after finding the first empty string. That's why you got an empty string as output...

Categories