competing regular expressions (race condition)

competing regular expressions (race condition) - python

I'm trying to use python PLY (lex/yacc) to parse a language called 'GRBL'.
GRBL looks something like this:
G00 X0.0 Y0.0 Z-1.0
G01 X1.0
..
The 'G' Codes tell a machine to 'go' (or move) and the coordinates say where.
LEX requires us to specify a unique regular expression for every possible 'token'.
So in this case I need a regex that will clearly define 'G00' and one that will clearly define 'G01' etc.
Obviously one's first thought would be r'G00' etc.
However G code is imprecise. The G can be upper or lower case, there can be leading zeros etc.
(g0, G00, g001 etc.)
So something for G00 may be as simple as:
r'[Gg]{1}0*'
And for G01 we could have
r'[Gg]{1}0*1'
But this does not work. G00 parses correctly, but G01 gives:
LexToken(G00,'G0',3,21)
Illegal character '1'
That is, lex thinks that G01 is a G0 token and doesn't know what to do with the '1'.
Which is clearly some sort of greedy matching problem.
Unfortunately I can't use the "$" terminator to specify that the string must 'end' with a "1"
I realise this might seem simple to some, but I've been at this for 3 hours and can't get it to work! Does anyone know how to address this problem?

Note: There's pretty well no reason to write {1} in a regular expression. It means that the previous element should be repeated exactly once, which is what would have happened without the repetition operator. So all it does is to obfuscate the regular expression (and slow down matching).
But that's not your problem. Your problem is likely the order in which Ply applies the regular expressions. Ply creates a single massive Python regular expression by concatenating all patterns into a set of alternatives:
(pattern1)|(pattern2)|(pattern3)|...|(patternz)
The order in which the patterns are inserted is important because Python "regular" expressions use an ordered alternation operator (making them actually irregular in mathematical terms, but that's a side issue). So once some alternative matches, the following ones are not even tried.
The Ply manual defines the ordering:
All tokens defined by functions are added in the same order as they appear in the lexer file.
Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).
I'm guessing that you're using functions, so that the patterns are in order by appearance in the file, because your second pattern --which is longer-- would by applied first if they were defined as strings. But without seeing your actual file, it's very hard to know for sure.
In any case, conventional wisdom for Ply lexers is to use as few patterns as possible, preferring to map keywords to tokens with dictionaries. In the case of GRBL one possibility might be to use [Gg][0-9]+(\.[0-9]÷)? as the pattern and then extract the index in the semantic action.

Your pattern [Gg]{1}0* is to general and it matches both G00 and G01 https://regex101.com/r/gkz0Wb/1
And in this second case you are left with single character 1.
You will have to make this pattern more specific for example by adding whitespace character at the end pattern [Gg]{1}0*\s
https://regex101.com/r/dsSFaG/1

Related

Is a single big regex more efficient than a bunch of smaller ones?

I'm working on a function that uses regular expressions to find some product codes in a (very long) string given as an argument.
There are many possible forms of that code, for example:
UK[A-z]{10} or DE[A-z]{20} or PL[A-z]{7} or...
What solution would be better? Many (most probably around 20-50) small regular expressions or one huge monster-regex that matches them all? What is better when performance is concerned?

It depends what kind of big regex you write. If you end with a pathological pattern it's better to test smaller patterns. Example:
UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7}
this pattern is very inefficient because it starts with an alternation, this means that in the worst case (no match) each alternative needs to be tested for all positions in the string.
(* Note that a regex engine like PCRE is able to quickly find potentially matching positions when each branch of an alternation starts with literal characters.)
But if you write your pattern like this:
(?=[UDP][KEL])(?:UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7})
or the variation:
[UDP][KEL](?:(?<=UK)[A-Za-z]{10}|(?<=DE)[A-Za-z]{20}|(?<=PL)[A-Za-z]{7})
Most of the positions where the match isn't possible are quickly discarded before the alternation.
Also, when you write a single pattern, obviously, the string is parsed only once.

How to match a string against a set of wildcard strings efficiently?

I am looking for a solution to match a single string against a set of wildcard strings. For example
>>> match("ab", ["a*", "b*", "*", "c", "*b"])
["a*", "*", "*b"]
The order of the output is of no importance.
I will have in the order of 10^4 wildcard strings to match against and I will do around ~10^9 match calls. This means I will probably have to rewrite my code like so:
>>> matcher = prepare(["a*", "b*", "*", "c", "*b"]
>>> for line in lines: yield matcher.match("ab")
["a*", "*", "*b"]
I've started writing a trie implementation in Python that handles wildcards and I just need to get those corner cases right. Despite this I am curious to hear; How would you solve this? Are there any Python libraries out there that make me solve this faster?
Some insights so far:
Named (Python, re) regular expressions will not help me here since they'll only return one match.
pyparsing seems like an awesome library, but is sparsely documented and does not, as I see it, support matching multiple patterns.

You could use FilteredRE2 class from re2 library with a help from Aho-Corasick algorithm implementation (or similar). From re2 docs:
Required substrings. Suppose you have an efficient way to check which
of a list of strings appear as substrings in a large text (for
example, maybe you implemented the Aho-Corasick algorithm), but now
your users want to be able to do regular expression searches
efficiently too. Regular expressions often have large literal strings
in them; if those could be identified, they could be fed into the
string searcher, and then the results of the string searcher could be
used to filter the set of regular expression searches that are
necessary. The FilteredRE2 class implements this analysis. Given a
list of regular expressions, it walks the regular expressions to
compute a boolean expression involving literal strings and then
returns the list of strings. For example, FilteredRE2 converts
(hello|hi)world[a-z]+foo into the boolean expression “(helloworld OR
hiworld) AND foo” and returns those three strings. Given multiple
regular expressions, FilteredRE2 converts each into a boolean
expression and returns all the strings involved. Then, after being
told which of the strings are present, FilteredRE2 can evaluate each
expression to identify the set of regular expressions that could
possibly be present. This filtering can reduce the number of actual
regular expression searches significantly.
The feasibility of these analyses depends crucially on the simplicity
of their input. The first uses the DFA form, while the second uses the
parsed regular expression (Regexp*). These kind of analyses would be
more complicated (maybe even impossible) if RE2 allowed non-regular
features in its regular expressions.

Seems like Aho-Corasick algorithm would work. esmre seem to do what I'm looking for. I got this information from this question.

Use regular expression to handle nested parenthesis in math equation?

If I have:
statement = "(2*(3+1))*2"
I want to be able to handle multiple parentheses within parentheses for a math reader I'm writing. Perhaps I'm going about this the wrong way, but my goal was to recursively go deeper into the parentheses until there were none, and then I would perform the math operations. Thus, I would first want to focus on
"(2*(3+1))"
then focus on
"(3+1)"
I hoped to do this by assigning the focus value to the start index of the regex and the end index of the regex. I have yet to figure out how to find the end index, but I'm more interested in first matching the regex
r"\(.+\)"
failed to match. I wanted it to read as "any one or more characters contained within a set of parentheses". Could someone explain why the above expression will not match to the above statement in python?

I love regular expressions. I use them all the time.
Don't use regular expressions for this.
You want an actual parser that will actually parse your math expressions. You might want to read this:
http://effbot.org/zone/simple-top-down-parsing.htm
Once you have actually parsed the expression, it's trivial to walk the parse tree and compute the result.
EDIT: #Lattyware suggested pyparsing, which should also be a good way to go, and might be easier than the EFFBot solution posted above.
https://github.com/pyparsing/pyparsing
Here's a direct link to the pyparsing sample code for a four-function algebraic expression evaluator:
http://pyparsing.wikispaces.com/file/view/fourFn.py

for what it's worth, here's a little more context:
regular expressions are called "regular" because they're associated with regular grammars, and regular grammars cannot describe (an unlimited number of) nested parentheses (they can describe a bunch of random parentheses, but cannot make them match in neat pairs).
one way to understand this is to understand that regular expressions can (modulo some details which i will explain at the end) be converted to deterministic finite automatons. which sounds intimidating but really just means that they can be converted into lists of "rules", where the rules depend on what you matched, and describe what you can match.
for example, the regular expression ab*c can be converted to:
at the start, you can only match a. then go to 2.
now, you can match b and go back to 2, or match c and go to 3
you're done! the match was a success!
and that is a "deterministic finite automaton".
anyway, the interesting part of this is that if you sit down and try to make something like that for matching pairs of parentheses you can't! try it. you can match a finite number by making more and more rules, but you can't write a general set of rules that match an unlimited number of parentheses (i should add that the rules have to be of the form "if you match X go to Y").
now obviously you could modify that in various ways. you could allow more complex rules (like extending them to let you keep a count of the parentheses), and you could then get something that worked as you expect. but it wouldn't be a regular grammar.
given that regular expressions are limited in this way, why are they used rather than something more complex? it turns out that they're something of a sweet spot - they can do a lot, while remaining fairly simple and efficient. more complex grammars (kinds of rules) can be more powerful, but are also harder to implement, and have more problems with efficiency.
final disclaimer and promised extra details: in practice many regular expressions these days actually are more powerful than this (and should not really be called "regular expressions"). but the above is still the basic explanation of why you should not use a regexp for this.
ps jesse's suggested solution gets round this by using a regexp multiple times; the argument here is for a single use of the regexp.

I probably agree with steveha, and don't recommend regex for this, but to answer your question specifically, you need unescaped parens to pull out results groups (your pattern only has escaped parens):
>>> re.match(r"\((.+)\)", "(2*(3+1))*2").group(1)
'2*(3+1)'
If you go that route, you could iterate over the match results until you run out of matches, and then reverse the results list to work inside out.

Regular expression to parse word structure

I'm trying to build my first non-trivial regular expression (for use in Python), but struggling.
Let us assume that a word in language X (NOT English) is a sequence of minimal 'structures'. Each 'structure' could be:
An independent vowel (basically one letter of the alphabet)
A consonant (one letter of the alphabet)
A consonant followed by a right-attaching vowel
A left-attaching vowel followed by a consonant
(Certain left-attaching vowels) followed by a consonant followed by (certain right-attaching vowels)
For example this word of 3 characters:
<a consonant><a left-attaching vowel><an independent vowel>
is not a valid word, and should not match the regex, because there is no consonant to the right of the left-attaching vowel.
I know all the Unicode ranges - the Unicode ranges for consonants, independent vowels, left-attaching vowels and so on.
Here is what I have so far:
WordPattern = (
ur'('
ur'[\u0985-\u0994]|'
ur'[\u0995-\u09B9]|'
ur'[\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]'
ur')+'
)
It's not working. Apart from getting it to work, I have three specific problems:
I need to split the regular expression over multiple lines, or else the code is going to look terrible. How do I do this?
I would like to use string substitution / templates of some sort to 'name' the Unicode ranges, for code readability and to prevent typing Unicode ranges multiple times.
(This seems very difficult) The list of permissible minimal 'structures' will have to be extended later. Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list?
Any help would be appreciated. This seems very complex to a beginner!

The appropriate tool for morphological analysis of languages with non-trivial morphology is "finite state transducers". There are robust implementations that you can track down and use (one by Xerox Parc). There's one that has python bindings (for using as an external library). Google it.
FSTs are based on finite-state automata, like (pure) regular expressions, but they are by no means a drop-in replacement. It's complex machinery, so if your goals are simple (e.g., syllabification for purposes of hyphenation) you may want to look for something simpler. There are machine-learning algorithms that will "learn" hyphenation, for example. If you are indeed interested in morphological analysis, you have to make the effort to look at FSTs.
Now for your algorithm, in case you really only need a trivial implementation: Since any vowel or consonant could be independent, your rules are ambiguous: They allow "ab" to be parsed as "a-b". Such ambiguities mean that a regexp approach will probably never work, but you may get better results if you put the longer regexps first, so they are used in preference to the short ones when both would apply. But really you need to build a parser (by hand or using a module) and try different things in steps. It's backwards from what you imagined: Set up a loop that uses different regexps, and "consumes" the string in steps.
However, it seems to me that what you are describing is essentially syllabification. And the near-universal rule of syllabification is this: A syllable consists of a core vowel, plus as many preceding ("onset") consonants as the rules of the language allow, plus any following consonants that cannot belong to the next syllable. The rule is called "maximize onset", and it has the consequence that it's easier to parse your syllables backwards (from the end of the word). Try it out.
PS. You probably know this, but if you put the following as the second line in your scripts you can embed Bengali in your regexps:
# -*- coding: utf-8 -*-

I need to split the regular expression over multiple lines, or else the code is going to look terrible. How do I do this?
Use the re.VERBOSE flag when compiling the regex.
pattern = re.compile(r"""(
[\u0985-\u0994] # comment to explain what this is
| [\u0995-\u09B9]
# etc.
)
""", re.VERBOSE)
I would like to use string substitution / templates of some sort to 'name' the Unicode ranges
You can construct an RE from ordinary Python strings:
>>> subpatterns = {"vowel": "[aeiou]", "consonant": "[^aeiou]"}
>>> "{consonant}{vowel}+{consonant}*".format(**subpatterns)
'[^aeiou][aeiou]+[^aeiou]*'
The list of permissible minimal 'structures' will have to be extended later. Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list?
I'm not sure if I get what you mean, but... suppose you have a list of (uncompiled) REs, say, patterns, then you can compute their union with
re.compile("(%s)" % "|".join(patterns))
Be careful with special characters when constructing REs this way and use re.escape where necessary.

What is the reason behind the advice that the substrings in regex should be ordered based on length?

longest first
>>> p = re.compile('supermanutd|supermanu|superman|superm|super')
shortest first
>>> p = re.compile('super|superm|superman|supermanu|supermanutd')
Why is the longest first regex preferred?

Alternatives in Regexes are tested in order you provide, so if first branch matches, then Rx doesn't check other branches. This doesn't matter if you only need to test for match, but if you want to extract text based on match, then it matters.
You only need to sort by length when your shorter strings are substrings of longer ones. For example when you have text:
supermanutd
supermanu
superman
superm
then with your first Rx you'll get:
>>> regex.findall(string)
[u'supermanutd', u'supermanu', u'superman', u'superm']
but with second Rx:
>>> regex.findall(string)
[u'super', u'super', u'super', u'super', u'super']
Test your regexes with http://www.pythonregex.com/

As #MBO says, alternatives are tested in the order they are written, and once one of them matches, the RE engine goes on to what comes after.
This behaviour is common to Perl-like RE engines, and ultimately goes back to the 1985 Bell Labs design of the RE library for Edition 8 Unix.
Note that POSIX 2 (from 1991) has another definition, insisting on the leftmost longest match for the whole RE and subject to that, for each subexpression in turn (in lexical order). In POSIX 2, order of alternatives does not matter.
However, the difference in behaviour is often: irrelevant (if you're just testing), masked by backtracking (if the shorter match causes the rest of the RE to fail), or compensated by the rest of the RE matching the part that the longer match 'should have' -- so most people aren't aware of it.

I'd guess it's because they're matched in that order, and it's faster to match shorter substrings. As an extreme example, a match against a single letter | a huge string will perform much better if the single letter (which is probably going to be responsible for the majority of matches anyway) is tested against first.
But in practice you should measure, not guess. If you need to have a performant regexp, test variations against representative test data.

The advice to which you refer is contingent on the regex engine attempting to match the components of the alternation in strictly left-to-right order, as is documented for the Python re module.
Sorting substrings in descending length order is just a special case of a wider problem when you are trying to extract a series of tokens. The general principle is that you put the more specialised sub-regexes first. For example, you are writing the lexical analysis for a formula parser. You have a "float constant" subregex and an "int constant" subregex. Your first attempt at the float subregex is likely to also match int constants. If so, you have two choices: (1) write a more complicated float subregex that doesn't match int constants (2) put your int subregex first.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.