?= and ?P combined in a regex - python

in short :
I want to use the Lookahead technique in Python with the ?P<name> convention (details here) to get the groups by name.
more details :
I discovered the Lookahead trick here; e.g. the following regex...
/^(?=.*Tim)(?=.*stupid).+
... allows to detect strings like "Tim stupid" or "stupid Tim", the order being not important.
I can't figure out how I can combine the ?= "operator" with the ?P one; the following regex obviously doesn't do the trick but gives an idea of what I want :
/^(?=?P<word1>.*Tim)(?=?P<word2>.*stupid).+

The ?P<word1> in your regex reminds of a named capture group:
The syntax for a named group is one of the Python-specific extensions: (?P<name>...). *name* is, obviously, the name of the group. Named groups also behave exactly like capturing groups, and additionally associate a name with a group.
So, most probably you are looking for a way to capture substrings inside a positive lookahead anchored at the start to require a string to meet both patterns, and capture the substrings inside both the lookaheads:
^(?=(?P<word1>.*Tim))(?=(?P<word2>.*stupid)).+
^^^^^^^^^^ ^ ^^^^^^^^^^ ^
See the regex demo
Note that if you do not need the string itself, .+ is redundant and can be removed. You might want to re-adjust the borders of the named capture groups if necessary.

Related

Python regex conditional, don't match if

Sorry for the somewhat unhelpful title, I'm having a really hard time explaining this issue.
I have a list of unique identifiers that can appear in a number of different ways and I'm trying to use regex to normalize them so I can compare across several databases. Here are some examples of them:
AB1201
AB-1201
AB1201-T
AB-12-01L1
AB1201-TER
AB1201 Transit
I've written a line of code that pulls out all hypens and spaces, and the used this regex:
([a-zA-Z]{2}[\d]{4})(L\d|Transit|T$)?
This works exactly as expected, returning a list looking like this:
AB1201
AB1201
AB1201T
AB1201L1
AB1201
AB1201T
The issue is, I have one identifier that looks like this: AB1201-02. I need this to be raised as an exception, and not included as a match.
Any ideas? I'm happy to provide more clarification if necessary. Thanks!
From Regex101 online tester
You can exclude matching the following hyphen and a digit (?!-\d) using a negative lookahead.
If it should start at the beginning of the string, you could use an anchor ^
Note that you could write [\d] as \d
^([a-zA-Z]{2}\d{4})(?!-\d)(L\d|Transit|T$)?
The pattern will look like
^ Start of string
( Capture group 1
[a-zA-Z]{2}\d{4} Match 2 times a-zA-Z and 4 digits
) Close group
(?!-\d) Negative lookahead, assert what is directly to the right is not - and a digit
(L\d|Transit|T$)? Optional capture group 2
Regex demo
Try this regular expression
^([a-zA-Z]{2}[\d]{4})(?!-\d)(L\d|Transit|T|-[A-Z]{3})?$
I have added the (?!...) Negative Lookahead to avoid matching with the -02.
(?!...) Negative Lookahead: Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.
You can view a demo on this link.

mapping regular expression group matches

I want to match the regular expressions \(.*\), \[.*\], \{.*\}, and \<.*\>. Is there a way to combine these regular expressions?
For example, I had in mind something like:
([\(\[\{\<]).*\1, but of course this matches \(.*\(, \[.*\[, \{.*\{, and \<.*\<.
My goal is to be able to match a previous regular expression group, but apply a function to the group before matching it.
Consider:
def match_pairs(pairs):
re = '|'.join("({begin}.*{end})".format(begin=beg, end=end) for (beg, end) in pairs)
return re
I'm considering using something similar to the above function for now, but ideally this function wouldn't return a really long regex. Do let me know if you think this question doesn't have any practical merit. I'm still curious to know if Python3 supports any feature like this, sort of how like re.sub can take a function as the replacement. If no such feature exists, how can I write match_pairs so that it can take in ["()", "[]", "[]", "{}"] as an argument?
The obvious (and shortest) regex for this task is \(.*\)|\[.*\]|\{.*\}|\<.*\>.
The downside is that you have four copies of the .* subpattern, so if you ever need to change it, you'll have to change it in 4 places. Luckily we can work around this problem with some use of capture groups:
(?:\(()|\[()|\{()|<()).*(?:\1\)|\2\]|\3\}|\4>)
Online demo.
This may look confusing, but it's actually very simple. The pattern is built like this:
(?:opening_char_1()|opening_char_2()|...).*(?:\1closing_char_1|\2closing_char_2|...)
This uses a fairly straightforward little trick: Each opening character ((, [, {, <) is accompanied by a capture group like so: \[(). This allows us to "remember" which opening character was matched - if capture group 1 matched, we know the opening character was (. If capture group 2 matched, the opening character was [, and so on. So we simply use backreferences (\1, \2, etc) to find out what the opening character was, and then match the corresponding closing character.

Named non-capturing group in python?

Is it possible to have named non-capturing group in python? For example I want to match string in this pattern (including the quotes):
"a=b"
'bird=angel'
I can do the following:
s = '"bird=angel"'
myre = re.compile(r'(?P<quote>[\'"])(\w+)=(\w+)(?P=quote)')
m = myre.search(s)
m.groups()
# ('"', 'bird', 'angel')
The result captures the quote group, which is not desirable here.
No, named groups are always capturing groups. From the documentation of the re module:
Extensions usually do not create a new group; (?P<name>...) is the
only exception to this rule.
And regarding the named group extension:
Similar to regular parentheses, but the substring matched by the group
is accessible within the rest of the regular expression via the
symbolic group name name
Where regular parentheses means (...), in contrast with (?:...).
You do need a capturing group in order to match the same quote: there is no other mechanism in re that allows you to do this, short of explicitly distinguishing the two quotes:
myre = re.compile('"{0}"' "|'{0}'" .format('(\w+)=(\w+)'))
(which has the downside of giving you four groups, two for each style of quotes).
Note that one does not need to give a name to the quotes, though:
myre = re.compile(r'([\'"])(\w+)=(\w+)\1')
works as well.
In conclusion, you are better off using groups()[1:] in order to get only what you need, if at all possible.

not returning the whole pattern in regex in python

I have the following code:
haystack = "aaa months(3) bbb"
needle = re.compile(r'(months|days)\([\d]*\)')
instances = list(set(needle.findall(haystack)))
print str(instances)
I'd expect it to print months(3) but instead I just get months. Is there any reason for this?
needle = re.compile(r'((?:months|days)\([\d]*\))')
fixes your problem.
you were capturing only the months|days part.
in this specific situation, this regex is a bit better:
needle = re.compile(r'((?:months|days)\(\d+\))')
this way you will only get results with a number, previously a result like months() would work. if you want to ignore case for options like Months or Days, then also add the re.IGNORECASE flag. like this:
re.compile(r'((?:months|days)\(\d+\))', re.IGNORECASE)
some explanation for the OP:
a regular expression is comprised of many elements, the chief among them is the capturing group. "()" but sometimes we want to make groups without capturing, so we use "(?:)" there are many other forms of groups, but these are the most common.
in this case, we surround the entire regular expression in a capturing group, because you are trying to capture everything, normally - any regular expression is automatically surrounded by a capturing group, but in this case, you specified one explicitly, so it did not surround your regular expression with an automatic capture group.
now that we have surrounded the entire regular expression with a capturing group, we turn the group we have into a non-capturing group by adding ?: to the beginning, as shown above. we could also not have surrounded the entire regular expression and only turned the group into a non-capturing group, since as you saw, it will automatically turn the whole regular expression into a capturing group where non is present. i personally prefer explicit coding.
further information about regular expressions can be found here: http://docs.python.org/library/re.html
Parens are not just for grouping, but also for forming capture groups. What you want is re.compile(r'(?:months|days)\(\d+\)'). That uses a non-capturing group for the or condition, and will not get you a bunch of subgroup matches you don't appear to want when using findall.

Finding Regex Pattern after doing re.findall

This is in continuation of my earlier question where I wanted to compile many patterns as one regular expression and after the discussion I did something like this
REGEX_PATTERN = '|'.join(self.error_patterns.keys())
where self.error_patterns.keys() would be pattern like
: error:
: warning:
cc1plus:
undefine reference to
Failure:
and do
error_found = re.findall(REGEX_PATTERN,line)
Now when I run it against some file which might contain one or more than one patterns, how do I know what pattern exactly matched? I mean I can anyway see the line manually and find it out, but want to know if after doing re.findall I can find out the pattern like re.group() or something
Thank you
re.findall will return all portions of text that matched your expression.
If that is not sufficient to identify the pattern unambiguously, you can still do a second re.match/re.find against the individual subpatterns you have join()ed. At the time of applying your initial regular expression, the matcher is no longer aware that you have composed it of several subpatterns however, hence it cannot provide more detailed information which subpattern has matched.
Another, equally unwieldy option would be to enclose each pattern in a group (...). Then, re.findall will return an array of None values (for all the non-matching patterns), with the exception of the one group that matched the pattern.
MatchObject has a lastindex property that contains the index of the last capturing group that participated in the match. If you enclose each pattern in its own capturing group, like this:
(: error:)|(: warning:)
...lastindex will tell you which one matched (assuming you know the order in which the patterns appear in the regex). You'll probably want to use finditer() (which creates an iterator of MatchObjects) instead of findall() (which returns a list of strings). Also, make sure there are no other capturing groups in the regex, to throw your indexing out of sync.

Categories