Regex/Python missed encompassed pattern - python

I have tried to research answers to this question online, but nothing seems to describe the problem I have here. If I missed something, please close the question and redirect it to where it has already been answered.
That being said, my python regex doesn't seem to want to recognize a pattern if it is already encompassed in another captured pattern. I tried to run the code and here are the results:
>>> import re
>>> string = 'NNTSY'
>>> m = re.findall('N[^P][ST][^P]',string)
>>> m
['NNTS']
I don't understand why it didn't yield this output:
>>> m
['NNTS','NTSY']
Thanks!

re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
https://docs.python.org/3/library/re.html#re.findall
If you're not just trying to understand why, but actually need to get overlapping matches, you can use lookahead with a capturing group as described in this question's answers.

This is in fact possible, using a lookahead assertion.
(?=pattern)
will match at any position directly followed by pattern without consuming the string, and
(?=(pattern))
will capture the group that matched.
import re
string = 'NNTSY'
m = re.findall(r'(?=(N[^P][ST][^P]))',string)
print(m)
#['NNTS', 'NTSY']

Related

Regex for search specific substring [duplicate]

This question already has answers here:
How to find overlapping matches with a regexp?
(4 answers)
Closed 4 years ago.
I tried this code:
re.findall(r"d.*?c", "dcc")
to search for substrings with first letter d and last letter c.
But I get output ['dc']
The correct output should be ['dc', 'dcc'].
What did i do wrong?
What you're looking for isn't possible using any built-in regexp functions that I know of. re.findall() only returns non-overlapping matches. After it matches dc, it looks for another match starting after that. Since the rest of the string is just c, and that doesn't match, it's done, so it just returns ["dc"].
When you use a quantifier like *, you have a choice of making it greedy, or non-greedy -- either it finds the longest or shortest match of the regexp. To do what you want, you need a way of telling it to look for successively longer matches until it can't find anything. There's no simple way to do this. You can use a quantifier with a specific count, but you'd have to loop it in your code:
d.{0}c
d.{1}c
d.{2}c
d.{3}c
...
If you have a regexp with multiple quantified sub-patterns, you'd have to try all combinations of lengths.
Your two problems are that .* is greedy while .*? is minimal, and that re.findall() only returns non-overlapping matches. Here's a possible solution:
def findall_inner(expr, text):
explore = list(re.findall(expr, text))
matches = set()
while explore:
word = explore.pop()
if len(word) >= 2 and word not in matches:
explore.extend(re.findall(expr, word[1:])) # try more removing first letter
explore.extend(re.findall(expr, word[:-1])) # try more removing last letter
matches.add(word)
return list(matches)
found = findall_inner(r"d.*c", "dcc")
print(found)
This is a little bit of overkill, using findall instead of search and using >= 2 instead of > 2, as in this case there can only be one non-overlapping match of d.*c and one-character strings cannot match the pattern. But there is some flexibility in it depending on what other kinds of patterns you might want.
Try this regex:
^d.*c$
Essentially, you are looking for the start of the string to be d and the end of the string to be c.
This is a very important point to understand: a regex engine always returns the leftmost match, even if a "better" match could be found later. When applying a regex to a string, the engine starts at the first character of the string. It tries all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, does the engine continue with the second character in the text. So when it find ['dc'] then engine pass 'dc' and continues with second 'c'. So it is impossible to match with ['dcc'].

Python regular expression findall *

I am not able to understand the following code behavior.
>>> import re
>>> text = 'been'
>>> r = re.compile(r'b(e)*')
>>> r.search(text).group()
'bee' #makes sense
>>> r.findall(text)
['e'] #makes no sense
I read some already existing question and answers about capturing groups and all. But still I am confused. Could someone please explain me.
The answer is simplified in the Regex Howto
As you can read here, group returns the string matched by the Regular Expression.
group() returns the substring that was matched by the RE.
But the action of findall is justified in the documentation
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group
So you are getting the matched part of the capture group.
Some experiments include :
>>> r = re.compile(r'(b)(e)*')
>>> r.findall(text)
[('b', 'e')]
Here the regex has two capturing groups, so the returned values are a list of matched groups (in tuples)
When a pattern contains a capture group, findall returns only the content of the capture group and no more the whole match.
If this behaviour looks strange, it can be very useful to extract easily parts of a string in a particular context (substring before or after), especially since python re module doesn't support variable length lookbehinds.

Backreferencing in Python: findall() method output for HTML string

I am trying to learn some regular expressions in Python. The following does not produce the output I expected:
with open('ex06-11.html') as f:
a = re.findall("<div[^>]*id\\s*=\\s*([\"\'])header\\1[^>]*>(.*?)</div>", f.read())
# output: [('"', 'Some random text')]
The output I was expecting (same code, but without the backreference):
with open('ex06-11.html') as f:
print re.findall("<div[^>]*id\\s*=\\s*[\"\']header[\"\'][^>]*>(.*?)</div>", f.read())
# output: ['Some random text']
The question really boils down to: why is there a quotation mark in my first output, but not in my second? I thought that ([abc]) ... //1 == [abc] ... [abc]. Am I incorrect?
From the docs on re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If you want the entire match to be returned, remove the capturing groups or change them to non-capturing groups by adding ?: after the opening paren. For example you would change (foo) in your regex to (?:foo).
Of course in this case you need the capturing group for the backreference, so your best bet is to keep your current regex and then use a list comprehension with re.finditer() to get a list of only the second group:
regex = re.compile(r"""<div[^>]*id\s*=\s*(["'])header\1[^>]*>(.*?)</div>""")
with open('ex06-11.html') as f:
a = [m.group(2) for m in regex.finditer(f.read())
A couple of side notes, you should really consider using an HTML parser like BeautifulSoup instead of regex. You should also use triple-quoted strings if you need to include single or double quotes within you string, and use raw string literals when writing regular expressions so that you don't need to escape the backslashes.
The behaviour is clearly documented. See re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if you have a capture group in your regex pattern, then findall method returns a list of tuple, containing all the captured groups for a particular match, plus the group(0).
So, either you use a non-capturing group - (?:[\"\']), or don't use any group at all, as in your 2nd case.
P.S: Use raw string literals for your regex pattern, to avoid escaping your backslashes. Also, compile your regex outside the loop, so that is is not re-compiled on every iteration. Use re.compile for that.
When I asked this question I was just starting with regular expressions. I have since read the docs completely, and I just wanted to share what I found out.
Firstly, what Rohit and F.J suggested, use raw strings (to make the regex more readable and less error-prone) and compile your regex beforehand using re.compile. To match an HTML string whose id is 'header':
s = "<div id='header'>Some random text</div>"
We would need a regex like:
p = re.compile(r'<div[^>]*id\s*=\s*([\"\'])header\1[^>]*>(.*?)</div>')
In the Python implementation of regex, a capturing group is made by enclosing part of your regex in parentheses (...). Capturing groups capture the span of text that they match. They are also needed for backreferencing. So in my regex above, I have two capturing groups: ([\"\']) and (.*?). The first one is needed to make the backreference \1 possible. The use of a backreferences (and the fact that they reference back to a capturing group) has consequences, however. As pointed out in the other answers to this question, when using findall on my pattern p, findall will return matches from all groups and put them in a list of tuples:
print p.findall(s)
# [("'", 'Some random text')]
Since we only want the plain text from between the HTML tags, this is not the output we're looking for.
(Arguably, we could use:
print p.findall(s)[0][1]
# Some random text
But this may be a bit contrived.)
So in order to return only the text from between the HTML tags (captured by the second group), we use the group() method on p.search():
print p.search(s).group(2)
# Some random text
I'm fully aware that all but the most simple HTML should not be handled by regex, and that instead you should use a parser. But this was just a tutorial example for me to grasp the basics of regex in Python.

Issues with Python re.findall when matching variables [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 3 years ago.
I am trying to match two string variables, and would like to catch multiple matches. re.findall seems like the obvious choice for this task, but it doesn't appear to be working the way I would expect it to. The following is an example:
a = 'a(pp)?le'
b = 'ale, apple, apol'
match = re.findall(a,b)
match
['','pp']
However, when I apply the same variables to re.search, it recognizes the embedded regular expression within the string, and picks up the first match:
match = re.search(a,b)
match.group()
'ale'
Can anyone explain why re.findall is not working in this instance? I would expect the following:
match = re.findall(a,b)
match
['ale','apple']
Thanks!
You are using a capturing group, wheras you want a non-capturing group:
a = 'a(?:pp)?le'
As stated in the docs (...) in a regex will create a "capturing group" and the result of re.findall will be only what is inside the parens.
If you just want to group things (e.g. for the purpose of applying a ?) use (?:...)which creates a non-capturing group. The result of re.findall in this case will be the whole regex (or the largest capturing group).
The key part of the re.findall docs are:
If one or more groups are present in the pattern, return a list of groups
this explains the difference in results between re.findall and re.search.
Let me quote the Python docs about re.findall():
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
And this is what your expression a(pp)?le does. It matches the content in your group, i.e. pp. You can always disable this special behavior of a group by taking a non-capturing group (?:...).

Finding Regex Pattern after doing re.findall

This is in continuation of my earlier question where I wanted to compile many patterns as one regular expression and after the discussion I did something like this
REGEX_PATTERN = '|'.join(self.error_patterns.keys())
where self.error_patterns.keys() would be pattern like
: error:
: warning:
cc1plus:
undefine reference to
Failure:
and do
error_found = re.findall(REGEX_PATTERN,line)
Now when I run it against some file which might contain one or more than one patterns, how do I know what pattern exactly matched? I mean I can anyway see the line manually and find it out, but want to know if after doing re.findall I can find out the pattern like re.group() or something
Thank you
re.findall will return all portions of text that matched your expression.
If that is not sufficient to identify the pattern unambiguously, you can still do a second re.match/re.find against the individual subpatterns you have join()ed. At the time of applying your initial regular expression, the matcher is no longer aware that you have composed it of several subpatterns however, hence it cannot provide more detailed information which subpattern has matched.
Another, equally unwieldy option would be to enclose each pattern in a group (...). Then, re.findall will return an array of None values (for all the non-matching patterns), with the exception of the one group that matched the pattern.
MatchObject has a lastindex property that contains the index of the last capturing group that participated in the match. If you enclose each pattern in its own capturing group, like this:
(: error:)|(: warning:)
...lastindex will tell you which one matched (assuming you know the order in which the patterns appear in the regex). You'll probably want to use finditer() (which creates an iterator of MatchObjects) instead of findall() (which returns a list of strings). Also, make sure there are no other capturing groups in the regex, to throw your indexing out of sync.

Categories