Use regular expression to handle nested parenthesis in math equation?

Use regular expression to handle nested parenthesis in math equation? - python

If I have:
statement = "(2*(3+1))*2"
I want to be able to handle multiple parentheses within parentheses for a math reader I'm writing. Perhaps I'm going about this the wrong way, but my goal was to recursively go deeper into the parentheses until there were none, and then I would perform the math operations. Thus, I would first want to focus on
"(2*(3+1))"
then focus on
"(3+1)"
I hoped to do this by assigning the focus value to the start index of the regex and the end index of the regex. I have yet to figure out how to find the end index, but I'm more interested in first matching the regex
r"\(.+\)"
failed to match. I wanted it to read as "any one or more characters contained within a set of parentheses". Could someone explain why the above expression will not match to the above statement in python?

I love regular expressions. I use them all the time.
Don't use regular expressions for this.
You want an actual parser that will actually parse your math expressions. You might want to read this:
http://effbot.org/zone/simple-top-down-parsing.htm
Once you have actually parsed the expression, it's trivial to walk the parse tree and compute the result.
EDIT: #Lattyware suggested pyparsing, which should also be a good way to go, and might be easier than the EFFBot solution posted above.
https://github.com/pyparsing/pyparsing
Here's a direct link to the pyparsing sample code for a four-function algebraic expression evaluator:
http://pyparsing.wikispaces.com/file/view/fourFn.py

for what it's worth, here's a little more context:
regular expressions are called "regular" because they're associated with regular grammars, and regular grammars cannot describe (an unlimited number of) nested parentheses (they can describe a bunch of random parentheses, but cannot make them match in neat pairs).
one way to understand this is to understand that regular expressions can (modulo some details which i will explain at the end) be converted to deterministic finite automatons. which sounds intimidating but really just means that they can be converted into lists of "rules", where the rules depend on what you matched, and describe what you can match.
for example, the regular expression ab*c can be converted to:
at the start, you can only match a. then go to 2.
now, you can match b and go back to 2, or match c and go to 3
you're done! the match was a success!
and that is a "deterministic finite automaton".
anyway, the interesting part of this is that if you sit down and try to make something like that for matching pairs of parentheses you can't! try it. you can match a finite number by making more and more rules, but you can't write a general set of rules that match an unlimited number of parentheses (i should add that the rules have to be of the form "if you match X go to Y").
now obviously you could modify that in various ways. you could allow more complex rules (like extending them to let you keep a count of the parentheses), and you could then get something that worked as you expect. but it wouldn't be a regular grammar.
given that regular expressions are limited in this way, why are they used rather than something more complex? it turns out that they're something of a sweet spot - they can do a lot, while remaining fairly simple and efficient. more complex grammars (kinds of rules) can be more powerful, but are also harder to implement, and have more problems with efficiency.
final disclaimer and promised extra details: in practice many regular expressions these days actually are more powerful than this (and should not really be called "regular expressions"). but the above is still the basic explanation of why you should not use a regexp for this.
ps jesse's suggested solution gets round this by using a regexp multiple times; the argument here is for a single use of the regexp.

I probably agree with steveha, and don't recommend regex for this, but to answer your question specifically, you need unescaped parens to pull out results groups (your pattern only has escaped parens):
>>> re.match(r"\((.+)\)", "(2*(3+1))*2").group(1)
'2*(3+1)'
If you go that route, you could iterate over the match results until you run out of matches, and then reverse the results list to work inside out.

Related

competing regular expressions (race condition)

I'm trying to use python PLY (lex/yacc) to parse a language called 'GRBL'.
GRBL looks something like this:
G00 X0.0 Y0.0 Z-1.0
G01 X1.0
..
The 'G' Codes tell a machine to 'go' (or move) and the coordinates say where.
LEX requires us to specify a unique regular expression for every possible 'token'.
So in this case I need a regex that will clearly define 'G00' and one that will clearly define 'G01' etc.
Obviously one's first thought would be r'G00' etc.
However G code is imprecise. The G can be upper or lower case, there can be leading zeros etc.
(g0, G00, g001 etc.)
So something for G00 may be as simple as:
r'[Gg]{1}0*'
And for G01 we could have
r'[Gg]{1}0*1'
But this does not work. G00 parses correctly, but G01 gives:
LexToken(G00,'G0',3,21)
Illegal character '1'
That is, lex thinks that G01 is a G0 token and doesn't know what to do with the '1'.
Which is clearly some sort of greedy matching problem.
Unfortunately I can't use the "$" terminator to specify that the string must 'end' with a "1"
I realise this might seem simple to some, but I've been at this for 3 hours and can't get it to work! Does anyone know how to address this problem?

Note: There's pretty well no reason to write {1} in a regular expression. It means that the previous element should be repeated exactly once, which is what would have happened without the repetition operator. So all it does is to obfuscate the regular expression (and slow down matching).
But that's not your problem. Your problem is likely the order in which Ply applies the regular expressions. Ply creates a single massive Python regular expression by concatenating all patterns into a set of alternatives:
(pattern1)|(pattern2)|(pattern3)|...|(patternz)
The order in which the patterns are inserted is important because Python "regular" expressions use an ordered alternation operator (making them actually irregular in mathematical terms, but that's a side issue). So once some alternative matches, the following ones are not even tried.
The Ply manual defines the ordering:
All tokens defined by functions are added in the same order as they appear in the lexer file.
Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).
I'm guessing that you're using functions, so that the patterns are in order by appearance in the file, because your second pattern --which is longer-- would by applied first if they were defined as strings. But without seeing your actual file, it's very hard to know for sure.
In any case, conventional wisdom for Ply lexers is to use as few patterns as possible, preferring to map keywords to tokens with dictionaries. In the case of GRBL one possibility might be to use [Gg][0-9]+(\.[0-9]÷)? as the pattern and then extract the index in the semantic action.

Your pattern [Gg]{1}0* is to general and it matches both G00 and G01 https://regex101.com/r/gkz0Wb/1
And in this second case you are left with single character 1.
You will have to make this pattern more specific for example by adding whitespace character at the end pattern [Gg]{1}0*\s
https://regex101.com/r/dsSFaG/1

Is a single big regex more efficient than a bunch of smaller ones?

I'm working on a function that uses regular expressions to find some product codes in a (very long) string given as an argument.
There are many possible forms of that code, for example:
UK[A-z]{10} or DE[A-z]{20} or PL[A-z]{7} or...
What solution would be better? Many (most probably around 20-50) small regular expressions or one huge monster-regex that matches them all? What is better when performance is concerned?

It depends what kind of big regex you write. If you end with a pathological pattern it's better to test smaller patterns. Example:
UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7}
this pattern is very inefficient because it starts with an alternation, this means that in the worst case (no match) each alternative needs to be tested for all positions in the string.
(* Note that a regex engine like PCRE is able to quickly find potentially matching positions when each branch of an alternation starts with literal characters.)
But if you write your pattern like this:
(?=[UDP][KEL])(?:UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7})
or the variation:
[UDP][KEL](?:(?<=UK)[A-Za-z]{10}|(?<=DE)[A-Za-z]{20}|(?<=PL)[A-Za-z]{7})
Most of the positions where the match isn't possible are quickly discarded before the alternation.
Also, when you write a single pattern, obviously, the string is parsed only once.

How do I extract definitions from a html file?

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.

Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.

By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'

Combine case sensitive regex and case insensitive regex into one

I have multiple filters for files (I'm using python). Some of them are glob filters some of them are regular expressions. I have both case sensitive and case insensitive globs and regexes. I can transform the glob into a regular expression with translate.
I can combine the case sensitive regular expressions into one big regular expression. Let's call it R_sensitive.
I can combine the case insensitive regular expressions into one big regular expression (case insensitive). Let's call it R_insensitive.
Is there a way to combine R_insensitive and R_sensitive into one regular expression? The expression would be (of course) case sensitive?
Thanks,
Iulian
NOTE: The way I combine expressions is the following:
Having R1,R2,R3 regexes I make R = (R1)|(R2)|(R3).
EXAMPLE:
I'm searching for "*.txt" (insensitive glob). But I have another glob that is like this: "*abc*" (case sensitive). How to combine (from programming) the 2 regex resulted from "fnmatch.translate" when one is case insensitive while the other is case sensitive?

Unfortunately, the regex ability you describe is either ordinal modifiers or a modifier span. Python does not support either, though here are what they would look like:
Ordinal Modifiers: (?i)case_insensitive_match(?-i)case_sensitive_match
Modifier Spans: (?i:case_insensitive_match)(?-i:case_sensitive_match)
In Python, they both fail to parse in re. The closest thing you could do (for simple or small matches) would be letter groups:
[Cc][Aa][Ss][Ee]_[Ii][Nn][Ss][Ee][Nn][Ss][Ii][Tt][Ii][Vv][Ee]_[Mm][Aa][Tt][Cc][Hh]case_sensitive_match
Obviously, this approach would be best for something where the insensitive portion is very brief, so I'm afraid it wouldn't be the best choice for you.

What you need is a way to convert a case-insensitive-flagged regexp into a regexp that works equivalent without the flag.
To do this fully generally is going to be a nightmare.
To do this just for fnmatch results is a whole lot easier.
If you need to handle full Unicode case rules, it will still be very hard.
If you only need to handle making sure each character c also matches c.upper() and c.lower(), it's very easy.
I'm only going to explain the easy case, because it's probably what you want, given your examples, and it's easy. :)
Some modules in the Python standard library are meant to serve as sample code as well as working implementations; these modules' docs start with a link directly to their source code. And fnmatch has such a link.
If you understand regexp syntax, and glob syntax, and look at the source to the translate function, it should be pretty easy to write your own translatenocase function.
Basically: In the inner else clause for building character classes, iterate over the characters, and for each character, if c.upper() != c.lower(), append both instead of c. Then, in the outer else clause for non-special characters, if c.upper() != c.lower(), append a two-character character class consisting of those two characters.
So, translatenocase('*.txt') will return something like r'.*\.[tT][xX][tT]' instead of something like r'.*\.txt'. But normal translate('*abc*') will of course return the usual r'.*abc.*'. And you can combine these just by using an alternation, as you apparently already know how to do.

I need help figuring out some Python Regex

I have tried to properly wrap my head around the below but I still have big hole in my reasoning. What is ?::, and could someone explain it properly for me
rule_syntax = re.compile('(\\\\*)'\
'(?:(?::([a-zA-Z_][a-zA-Z_0-9]*)?()(?:#(.*?)#)?)'\
'|(?:<([a-zA-Z_][a-zA-Z_0-9]*)?(?::([a-zA-Z_]*)'\
'(?::((?:\\\\.|[^\\\\>]+)+)?)?)?>))')

There are two tools that you may wish to look into to help with your understanding
Regexper creates a visual representation of regex, here's yours:
Regexpal is a tool that allows you to input a regex and various strings and see what matches, here's yours with some example matches

(?:expr) is just like normal parentheses (expr), except that for purposes of retrieving groups later (backreferences, re.sub, or MatchObject.group), parenthesized groups beginning with ?: are excluded. This can be useful if you need to capture a complex expression in parentheses to apply another operator like * to it, but don't want to get it mixed in with groups that you'll actually need to retrieve later.'
?:: is simply ?: followed by a literal :.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.