regex to dictionary using group label ?P<> - python

I'm using the regex module instead of default re module in python
https://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails
I'm trying to do the following
>>> regex.compile('(?P<heavy>heavily|heavy)').search("My laptop is heavy or heavily").groupdict()
{'heavy': 'heavy'}
I expect it returns
{'heavy': ['heavy','heavily]}
regex.findall will match both heavy and heavily, but it doesn't work with group label
I have to solve it with regex, so iterative through string solutions are not acceptable.

[Have you read the python documents on regexes?][1]
Relevant portion:
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.
This means that your regex:
(?P<heavy>heavily|heavy)
Will find the first matching string, which is "heavy", and save that string. It then says "congrats, I'm done!" and finishes scanning.
You need a regex that will capture both strings.
It then saves that string, heavy, into a group (as your regex requests) also called heavy. Your group dict command then returns this information. So you have a group named heavy with one regex match, also heavy, which gives you the return result of
{"heavy": "heavy"}
To resolve your issue, there are two solutions.
Use the regex findall method, which will return a list, and then you can turn this list into a dictionary. This is the easier route.
Craft a regex that will actually find both terms and place them into the same group. While doable, this is very convoluted.
I highly recommend you use the findall method instead, if you wish to search for multiple matches.

Related

Python Regex: How to get all matches of the _entire_ regex in a string with different occurrences and multiple regexes involved

I extracted text from a pdf and I am using re.finditer() atm but as the documentation on re.match() says, the latter already returns a match object if "zero or more characters at the beginning of string match the regular expression pattern"
re.finditer() also behaves that way. It is sufficient obviously for a certain amount of the beginning of two quite similar parts of the string to be considered as occurrence or "match" of the same compiled regular expression - which is NOT what I want/need.
In order to correctly "parse" the full text extracted from the pdf I will need to employ multiple regular expressions and I must employ them in their entirety. Either a block-type of unknown size beforehand from that text fully satisfies the unique and specific pattern-type or not.
Sadly re.fullmatch is not an alternative because it wants to match the whole text but as I said the whole text is a composition of different rexexp patterns, partly with multiple occurrences only differing on a very individual level such as "name of the store where I purchased things" but that is still into the specificity of the respective regexp-type as a capturing group which I need to process further.
Hence, question is: What else to use than re.finditer() if I don't know the start- and endpos of each possible block. To find out where the borders of each type-block-instance are is one reason I am wanting to test the text against multiple regexes.

Is a single big regex more efficient than a bunch of smaller ones?

I'm working on a function that uses regular expressions to find some product codes in a (very long) string given as an argument.
There are many possible forms of that code, for example:
UK[A-z]{10} or DE[A-z]{20} or PL[A-z]{7} or...
What solution would be better? Many (most probably around 20-50) small regular expressions or one huge monster-regex that matches them all? What is better when performance is concerned?
It depends what kind of big regex you write. If you end with a pathological pattern it's better to test smaller patterns. Example:
UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7}
this pattern is very inefficient because it starts with an alternation, this means that in the worst case (no match) each alternative needs to be tested for all positions in the string.
(* Note that a regex engine like PCRE is able to quickly find potentially matching positions when each branch of an alternation starts with literal characters.)
But if you write your pattern like this:
(?=[UDP][KEL])(?:UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7})
or the variation:
[UDP][KEL](?:(?<=UK)[A-Za-z]{10}|(?<=DE)[A-Za-z]{20}|(?<=PL)[A-Za-z]{7})
Most of the positions where the match isn't possible are quickly discarded before the alternation.
Also, when you write a single pattern, obviously, the string is parsed only once.

How do I extract definitions from a html file?

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.
Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.
By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'

Matching both conditions with regex

I'm trying to match the input string by both given conditions. For example, if I give '000011001000' as input and want to match it by '1001' and '0110' then what would the regex I need look like?
I tried different combinations, but couldn't find the correct one. The closest I got was using
re.match("(1001.*0110)+?")
but that one doesn't work when input is for example '0001100100'.
This pattern makes use of "look-arounds" which you should learn about for regex.
(?=[01]*1001[01]*)(?=[01]*0110[01]*)[01]+
in response to the comments:
look-arounds in regex are a simple way of checking the match for specific conditions. what it essentially does is stop the current match cursor when it hits the (?= (there are also others suchs as ?!, ?<=, and ?<!) token and reads the next characters using the pattern inside of the lookaround statement. if that statement is not fulfilled then the match fails. if it does, then the original cursor then keeps matching. imagine it being a probe that goes ahead of an explorer to check the environment ahead.
if you want more reference, rexegg is probably my favourite site for learning regex syntax and nifty tricks.

Python regex: how to match anything up to a specific string and avoid backtraking when failin

I'm trying to craft a regex able to match anything up to a specific pattern. The regex then will continue looking for other patterns until the end of the string, but in some cases the pattern will not be present and the match will fail. Right now I'm stuck at:
.*?PATTERN
The problem is that, in cases where the string is not present, this takes too much time due to backtraking. In order to shorten this, I tried mimicking atomic grouping using positive lookahead as explained in this thread (btw, I'm using re module in python-2.7):
Do Python regular expressions have an equivalent to Ruby's atomic grouping?
So I wrote:
(?=(?P<aux1>.*?))(?P=aux1)PATTERN
Of course, this is faster than the previous version when STRING is not present but trouble is, it doesn't match STRING anymore as the . matches everyhing to the end of the string and the previous states are discarded after the lookahead.
So the question is, is there a way to do a match like .*?STRING and alse be able to fail faster when the match is not present?
You could try using split
If the results are of length 1 you got no match. If you get two or more you know that the first one is the first match. If you limit the split to size one you'll short-circuit the later matching:
"HI THERE THEO".split("TH", 1) # ['HI ', 'ERE THEO']
The first element of the results is up to the match.
One-Regex Solution
^(?=(?P<aux1>(?:[^P]|P(?!ATTERN))*))(?P=aux1)PATTERN
Explanation
You wanted to use the atomic grouping like this: (?>.*?)PATTERN, right? This won't work. Problem is, you can't use lazy quantifiers at the end of an atomic grouping: the definition of the AG is that once you're outside of it, the regex won't backtrack inside.
So the regex engine will match the .*?, because of the laziness it will step outside of the group to check if the next character is a P, and if it's not it won't be able to backtrack inside the group to match that next character inside the .*.
What's usually used in Perl are structures like this: (?>(?:[^P]|P(?!ATTERN))*)PATTERN. That way, the equivalent of .* (here (?:[^P]|P(?!ATTERN))) won't "eat up" the wanted pattern.
This pattern is easier to read in my opinion with possessive quantifiers, which are made just for these occasions: (?:[^P]|P(?!ATTERN))*+PATTERN.
Translated with your workaround, this would lead to the above regex (added ^ since you should anchor the regex, either to the start of the string or to another regex).
The Python documentation includes a brief outline of the differences between the re.search() and re.match() functions http://docs.python.org/2/library/re.html#search-vs-match. In particular, the following quote is relevant:
Sometimes you’ll be tempted to keep using re.match(), and just add .* to the front of your RE. Resist this temptation and use re.search() instead. The regular expression compiler does some analysis of REs in order to speed up the process of looking for a match. One such analysis figures out what the first character of a match must be; for example, a pattern starting with Crow must match starting with a 'C'. The analysis lets the engine quickly scan through the string looking for the starting character, only trying the full match if a 'C' is found.
Adding .* defeats this optimization, requiring scanning to the end of the string and then backtracking to find a match for the rest of the RE. Use re.search() instead.
In your case, it would be preferable to define your pattern simply as:
pattern = re.compile("PATTERN")
And then call pattern.search(...), which will not backtrack when the pattern is not found.

Categories