Trouble understanding re.findall() behavior

Trouble understanding re.findall() behavior - python

I'm having some trouble understanding the behavior of re.findall. Quoting from the documentation:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Based on this, I would expect the following line
re.findall(f"(a)|(b)|(c)","c")
to produce the result
[(c)]
However, it produces the result
[('', '', 'c')]
I don't understand why the two empty strings are included, since I don't see an empty match anywhere.

It's because of having three capturing groups:
import re
print(re.findall(r"(a)|(b)|(c)","d"))
print(re.findall(f"(a)|(b)|(c)","c"))
print(re.findall(r"(?:a)|(?:b)|(?:c)","c"))
print(re.findall(f"(?:a)|(b)|(c)","c"))
print(re.findall(f"(?:a|b|c)","c"))
print(re.findall(r"a|b|c","c"))
Output
[]
[('', '', 'c')]
['c']
[('', 'c')]
['c']
['c']

Related

Why does my regular expression return tuples for every character in a string? [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed last year.
I am making a simple project for my math class in which I want to verify if a given function body (string) only contains the allowed expressions (digits, basic trigonometry, +, -, *, /).
I am using regular expressions with the re.findall method.
My current code:
import re
def valid_expression(exp) -> bool:
# remove white spaces
exp = exp.replace(" ", "")
# characters to search for
chars = r"(cos)|(sin)|(tan)|[\d+/*x)(-]"
z = re.findall(chars, exp)
return "".join(z) == exp
However, when I test this any expression the re.findall(chars, exp) will return a list of tuples with 3 empty strings: ('', '', '') for every character in the string unless there is a trig function in which case it will return a tuple with the trig function and two empty strings.
Ex: cos(x) -> [('cos', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
I don't understand why it does this, I have tested the regular expression on regexr.com and it works fine. I get that it uses javascript but normally there should be no difference right ?
Thank you for any explanation and/or fix.

Short answer: If the result you want is ['cos', '(', 'x', ')'], you need something like
'(cos|sin|tan|[)(-*x]|\d+)':
>>> re.findall(r'(cos|sin|tan|[)(-*x]|\d+)', "cos(x)")
['cos', '(', 'x', ')']
From the documentation for findall:
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
For 'cos(x)', you start with ('cos', '', '') because cos matched, but neither sin nor tan matched. For each of (, x, and ), none of the three capture groups matched, although the bracket expression did. Since it isn't inside a capture group, anything it matches isn't included in your output.
As an aside, [\d+/*x)(-] doesn't include multidigit integers as a match. \d+ is not a regular expression; it's the two characters d and +. (The escape is a no-op, since d has no special meaning inside [...].) As a result, it matches exactly one of the following eight characters:
d
+
/
*
x
)
(
-

You have three groups (an expression with parentheses) in your regex, so you get tuples with three items. Also you get four results for all substrings that matches with your regex: first for 'cos', second for '(', third for 'x', and the last for ')'. But the last part of your regex doesn't marked as a group, so you don't get this matches in your tuple. If you change your regex like r"(cos)|(sin)|(tan)|([\d+/*x)(-])" you will get tuples with four items. And every tuple will have one non empty item.
Unfortunately, this fix doesn't help you to verify that you have no prohibited lexemes. It's just to understand what's going on.
I would suggest you to convert your regex to a negative form: you may check that you have no anything except allowed lexemes instead of checking that you have some allowed ones. I guess this way should work for simple cases. But, I am afraid, for more sophisticated expression you have to use something other than regex.

findall returns tuples because your regular expression has capturing groups. To make a group non-capturing, add ?: after the opening parenthesis:
r"(?:cos)|(?:sin)|(?:tan)|[\d+/*x)(-]"

How to comprehend the python regex compile matching result: `re.compile(r'a*')`

import re
pattern = re.compile(r'a*')
pattern.findall("aba")
result:
['a', '', 'a', '']
Why there is empty matches in the result? How to comprehend this?
To be more specific, what do the two empty matches--'' in the result stand for in the string "aba"?

findall(pattern, string, flags=0)¶
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

You are searching for a*. * matches zero or more repetitions of the character. So b matches a*, and so does anything else. It seems like you want a+ instead, which matches one or more repetitions of the character.

Let me try to explain, as I also could not find good information on the outputs. The documentation states that
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a previous empty match.
import re
text = 'abcaad'
print(f"'a' matches {re.findall('a' , text)}")
print(f"'a+' matches {re.findall('a+', text)}")
print(f"'a*' matches {re.findall('a*', text)}")
print(f"'z*' matches {re.findall('z*', text)}")
The output is
'a' matches ['a', 'a', 'a']
'a+' matches ['a', 'aa']
'a*' matches ['a', '', '', 'aa', '', '']
'z*' matches ['', '', '', '', '', '', '']
a matches exactly the character a thrice.
a+ matches one or more occurrences of character a.
a* matches zero or more occurrences of character a.
Besides matching a and aa, it also does not matches b, c, d and the whole string.
z* matches zero or more occurrences of character z.
It does not matches a, b, c, a, a, d and the whole string.

Python Regex results longer than original string

I have python code like this:
a = 'xyxy123'
b = re.findall('x*',a)
print b
This is the result:
['x', '', 'x', '', '', '', '', '']
How come b has eight elements when a only has seven characters?

There are eight "spots" in the string:
|x|y|x|y|1|2|3|
Each of them is a location where a regex could start. Since your regex includes the empty string (because x* allows 0 copies of x), each spot generates one match, and that match gets appended to the list in b. The exceptions are the two spots that start a longer match, x; as in msalperen's answer,
Empty matches are included in the result unless they touch the beginning of another match,
so the empty matches at the first and third locations are not included.

According to python documentation (https://docs.python.org/2/library/re.html):
re.findall returns all non-overlapping matches of pattern in string,
as a list of strings. The string is scanned left-to-right, and matches
are returned in the order found. If one or more groups are present in
the pattern, return a list of groups; this will be a list of tuples if
the pattern has more than one group. Empty matches are included in the
result unless they touch the beginning of another match.
So it returns all the results that match x*, including the empty ones.

Regex disjunction in Python findall does not match full substrings

In several online testers, the regex a(b|c)z matches both 'abz' and 'acz' in the string 'abz acz', but Python's re.findall() only matches 'b' and 'c'.
What am I missing?
In[42]: re.findall(r'a(b|c)z', 'abz acz')
Out[42]: ['b', 'c']

With findall, the captured groups are returned:
As stated in the documentation ...
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
You can simply use a character class here instead.
>>> re.findall(r'a[bc]z', 'abz acz')
['abz', 'acz']

thats because you are using capturing parens
re.findall(r'a(?:b|c)z', 'abz acz')
?: will cause it to be non-capturing parentheses

Use of findall and parenthesis in Python

I need to extract all letters after the + sign or at the beginning of a string like this:
formula = "X+BC+DAF"
I tried so, and I do not want to see the + sign in the result. I wish see only ['X', 'B', 'D'].
>>> re.findall("^[A-Z]|[+][A-Z]", formula)
['X', '+B', '+D']
When I grouped with parenthesis, I got this strange result:
re.findall("^([A-Z])|[+]([A-Z])", formula)
[('X', ''), ('', 'B'), ('', 'D')]
Why it created tuples when I try to group ? How to write the regexp directly such that it returns ['X', 'B', 'D'] ?

If there are any capturing groups in the regular expression then re.findall returns only the values captured by the groups. If there are no groups the entire matched string is returned.
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
How to write the regexp directly such that it returns ['X', 'B', 'D'] ?
Instead of using a capturing group you can use a non-capturing group:
>>> re.findall(r"(?:^|\+)([A-Z])", formula)
['X', 'B', 'D']
Or for this specific case you could try a simpler solution using a word boundary:
>>> re.findall(r"\b[A-Z]", formula)
['X', 'B', 'D']
Or a solution using str.split that doesn't use regular expressions:
>>> [s[0] for s in formula.split('+')]
['X', 'B', 'D']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble understanding re.findall() behavior - python

Related

Why does my regular expression return tuples for every character in a string? [duplicate]

How to comprehend the python regex compile matching result: `re.compile(r'a*')`

Python Regex results longer than original string

Regex disjunction in Python findall does not match full substrings

Use of findall and parenthesis in Python

Categories

Resources