How to match a string against a set of wildcard strings efficiently?

How to match a string against a set of wildcard strings efficiently? - python

I am looking for a solution to match a single string against a set of wildcard strings. For example
>>> match("ab", ["a*", "b*", "*", "c", "*b"])
["a*", "*", "*b"]
The order of the output is of no importance.
I will have in the order of 10^4 wildcard strings to match against and I will do around ~10^9 match calls. This means I will probably have to rewrite my code like so:
>>> matcher = prepare(["a*", "b*", "*", "c", "*b"]
>>> for line in lines: yield matcher.match("ab")
["a*", "*", "*b"]
I've started writing a trie implementation in Python that handles wildcards and I just need to get those corner cases right. Despite this I am curious to hear; How would you solve this? Are there any Python libraries out there that make me solve this faster?
Some insights so far:
Named (Python, re) regular expressions will not help me here since they'll only return one match.
pyparsing seems like an awesome library, but is sparsely documented and does not, as I see it, support matching multiple patterns.

You could use FilteredRE2 class from re2 library with a help from Aho-Corasick algorithm implementation (or similar). From re2 docs:
Required substrings. Suppose you have an efficient way to check which
of a list of strings appear as substrings in a large text (for
example, maybe you implemented the Aho-Corasick algorithm), but now
your users want to be able to do regular expression searches
efficiently too. Regular expressions often have large literal strings
in them; if those could be identified, they could be fed into the
string searcher, and then the results of the string searcher could be
used to filter the set of regular expression searches that are
necessary. The FilteredRE2 class implements this analysis. Given a
list of regular expressions, it walks the regular expressions to
compute a boolean expression involving literal strings and then
returns the list of strings. For example, FilteredRE2 converts
(hello|hi)world[a-z]+foo into the boolean expression “(helloworld OR
hiworld) AND foo” and returns those three strings. Given multiple
regular expressions, FilteredRE2 converts each into a boolean
expression and returns all the strings involved. Then, after being
told which of the strings are present, FilteredRE2 can evaluate each
expression to identify the set of regular expressions that could
possibly be present. This filtering can reduce the number of actual
regular expression searches significantly.
The feasibility of these analyses depends crucially on the simplicity
of their input. The first uses the DFA form, while the second uses the
parsed regular expression (Regexp*). These kind of analyses would be
more complicated (maybe even impossible) if RE2 allowed non-regular
features in its regular expressions.

Seems like Aho-Corasick algorithm would work. esmre seem to do what I'm looking for. I got this information from this question.

Related

competing regular expressions (race condition)

I'm trying to use python PLY (lex/yacc) to parse a language called 'GRBL'.
GRBL looks something like this:
G00 X0.0 Y0.0 Z-1.0
G01 X1.0
..
The 'G' Codes tell a machine to 'go' (or move) and the coordinates say where.
LEX requires us to specify a unique regular expression for every possible 'token'.
So in this case I need a regex that will clearly define 'G00' and one that will clearly define 'G01' etc.
Obviously one's first thought would be r'G00' etc.
However G code is imprecise. The G can be upper or lower case, there can be leading zeros etc.
(g0, G00, g001 etc.)
So something for G00 may be as simple as:
r'[Gg]{1}0*'
And for G01 we could have
r'[Gg]{1}0*1'
But this does not work. G00 parses correctly, but G01 gives:
LexToken(G00,'G0',3,21)
Illegal character '1'
That is, lex thinks that G01 is a G0 token and doesn't know what to do with the '1'.
Which is clearly some sort of greedy matching problem.
Unfortunately I can't use the "$" terminator to specify that the string must 'end' with a "1"
I realise this might seem simple to some, but I've been at this for 3 hours and can't get it to work! Does anyone know how to address this problem?

Note: There's pretty well no reason to write {1} in a regular expression. It means that the previous element should be repeated exactly once, which is what would have happened without the repetition operator. So all it does is to obfuscate the regular expression (and slow down matching).
But that's not your problem. Your problem is likely the order in which Ply applies the regular expressions. Ply creates a single massive Python regular expression by concatenating all patterns into a set of alternatives:
(pattern1)|(pattern2)|(pattern3)|...|(patternz)
The order in which the patterns are inserted is important because Python "regular" expressions use an ordered alternation operator (making them actually irregular in mathematical terms, but that's a side issue). So once some alternative matches, the following ones are not even tried.
The Ply manual defines the ordering:
All tokens defined by functions are added in the same order as they appear in the lexer file.
Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).
I'm guessing that you're using functions, so that the patterns are in order by appearance in the file, because your second pattern --which is longer-- would by applied first if they were defined as strings. But without seeing your actual file, it's very hard to know for sure.
In any case, conventional wisdom for Ply lexers is to use as few patterns as possible, preferring to map keywords to tokens with dictionaries. In the case of GRBL one possibility might be to use [Gg][0-9]+(\.[0-9]÷)? as the pattern and then extract the index in the semantic action.

Your pattern [Gg]{1}0* is to general and it matches both G00 and G01 https://regex101.com/r/gkz0Wb/1
And in this second case you are left with single character 1.
You will have to make this pattern more specific for example by adding whitespace character at the end pattern [Gg]{1}0*\s
https://regex101.com/r/dsSFaG/1

Is a single big regex more efficient than a bunch of smaller ones?

I'm working on a function that uses regular expressions to find some product codes in a (very long) string given as an argument.
There are many possible forms of that code, for example:
UK[A-z]{10} or DE[A-z]{20} or PL[A-z]{7} or...
What solution would be better? Many (most probably around 20-50) small regular expressions or one huge monster-regex that matches them all? What is better when performance is concerned?

It depends what kind of big regex you write. If you end with a pathological pattern it's better to test smaller patterns. Example:
UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7}
this pattern is very inefficient because it starts with an alternation, this means that in the worst case (no match) each alternative needs to be tested for all positions in the string.
(* Note that a regex engine like PCRE is able to quickly find potentially matching positions when each branch of an alternation starts with literal characters.)
But if you write your pattern like this:
(?=[UDP][KEL])(?:UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7})
or the variation:
[UDP][KEL](?:(?<=UK)[A-Za-z]{10}|(?<=DE)[A-Za-z]{20}|(?<=PL)[A-Za-z]{7})
Most of the positions where the match isn't possible are quickly discarded before the alternation.
Also, when you write a single pattern, obviously, the string is parsed only once.

Combine case sensitive regex and case insensitive regex into one

I have multiple filters for files (I'm using python). Some of them are glob filters some of them are regular expressions. I have both case sensitive and case insensitive globs and regexes. I can transform the glob into a regular expression with translate.
I can combine the case sensitive regular expressions into one big regular expression. Let's call it R_sensitive.
I can combine the case insensitive regular expressions into one big regular expression (case insensitive). Let's call it R_insensitive.
Is there a way to combine R_insensitive and R_sensitive into one regular expression? The expression would be (of course) case sensitive?
Thanks,
Iulian
NOTE: The way I combine expressions is the following:
Having R1,R2,R3 regexes I make R = (R1)|(R2)|(R3).
EXAMPLE:
I'm searching for "*.txt" (insensitive glob). But I have another glob that is like this: "*abc*" (case sensitive). How to combine (from programming) the 2 regex resulted from "fnmatch.translate" when one is case insensitive while the other is case sensitive?

Unfortunately, the regex ability you describe is either ordinal modifiers or a modifier span. Python does not support either, though here are what they would look like:
Ordinal Modifiers: (?i)case_insensitive_match(?-i)case_sensitive_match
Modifier Spans: (?i:case_insensitive_match)(?-i:case_sensitive_match)
In Python, they both fail to parse in re. The closest thing you could do (for simple or small matches) would be letter groups:
[Cc][Aa][Ss][Ee]_[Ii][Nn][Ss][Ee][Nn][Ss][Ii][Tt][Ii][Vv][Ee]_[Mm][Aa][Tt][Cc][Hh]case_sensitive_match
Obviously, this approach would be best for something where the insensitive portion is very brief, so I'm afraid it wouldn't be the best choice for you.

What you need is a way to convert a case-insensitive-flagged regexp into a regexp that works equivalent without the flag.
To do this fully generally is going to be a nightmare.
To do this just for fnmatch results is a whole lot easier.
If you need to handle full Unicode case rules, it will still be very hard.
If you only need to handle making sure each character c also matches c.upper() and c.lower(), it's very easy.
I'm only going to explain the easy case, because it's probably what you want, given your examples, and it's easy. :)
Some modules in the Python standard library are meant to serve as sample code as well as working implementations; these modules' docs start with a link directly to their source code. And fnmatch has such a link.
If you understand regexp syntax, and glob syntax, and look at the source to the translate function, it should be pretty easy to write your own translatenocase function.
Basically: In the inner else clause for building character classes, iterate over the characters, and for each character, if c.upper() != c.lower(), append both instead of c. Then, in the outer else clause for non-special characters, if c.upper() != c.lower(), append a two-character character class consisting of those two characters.
So, translatenocase('*.txt') will return something like r'.*\.[tT][xX][tT]' instead of something like r'.*\.txt'. But normal translate('*abc*') will of course return the usual r'.*abc.*'. And you can combine these just by using an alternation, as you apparently already know how to do.

Use regular expression to handle nested parenthesis in math equation?

If I have:
statement = "(2*(3+1))*2"
I want to be able to handle multiple parentheses within parentheses for a math reader I'm writing. Perhaps I'm going about this the wrong way, but my goal was to recursively go deeper into the parentheses until there were none, and then I would perform the math operations. Thus, I would first want to focus on
"(2*(3+1))"
then focus on
"(3+1)"
I hoped to do this by assigning the focus value to the start index of the regex and the end index of the regex. I have yet to figure out how to find the end index, but I'm more interested in first matching the regex
r"\(.+\)"
failed to match. I wanted it to read as "any one or more characters contained within a set of parentheses". Could someone explain why the above expression will not match to the above statement in python?

I love regular expressions. I use them all the time.
Don't use regular expressions for this.
You want an actual parser that will actually parse your math expressions. You might want to read this:
http://effbot.org/zone/simple-top-down-parsing.htm
Once you have actually parsed the expression, it's trivial to walk the parse tree and compute the result.
EDIT: #Lattyware suggested pyparsing, which should also be a good way to go, and might be easier than the EFFBot solution posted above.
https://github.com/pyparsing/pyparsing
Here's a direct link to the pyparsing sample code for a four-function algebraic expression evaluator:
http://pyparsing.wikispaces.com/file/view/fourFn.py

for what it's worth, here's a little more context:
regular expressions are called "regular" because they're associated with regular grammars, and regular grammars cannot describe (an unlimited number of) nested parentheses (they can describe a bunch of random parentheses, but cannot make them match in neat pairs).
one way to understand this is to understand that regular expressions can (modulo some details which i will explain at the end) be converted to deterministic finite automatons. which sounds intimidating but really just means that they can be converted into lists of "rules", where the rules depend on what you matched, and describe what you can match.
for example, the regular expression ab*c can be converted to:
at the start, you can only match a. then go to 2.
now, you can match b and go back to 2, or match c and go to 3
you're done! the match was a success!
and that is a "deterministic finite automaton".
anyway, the interesting part of this is that if you sit down and try to make something like that for matching pairs of parentheses you can't! try it. you can match a finite number by making more and more rules, but you can't write a general set of rules that match an unlimited number of parentheses (i should add that the rules have to be of the form "if you match X go to Y").
now obviously you could modify that in various ways. you could allow more complex rules (like extending them to let you keep a count of the parentheses), and you could then get something that worked as you expect. but it wouldn't be a regular grammar.
given that regular expressions are limited in this way, why are they used rather than something more complex? it turns out that they're something of a sweet spot - they can do a lot, while remaining fairly simple and efficient. more complex grammars (kinds of rules) can be more powerful, but are also harder to implement, and have more problems with efficiency.
final disclaimer and promised extra details: in practice many regular expressions these days actually are more powerful than this (and should not really be called "regular expressions"). but the above is still the basic explanation of why you should not use a regexp for this.
ps jesse's suggested solution gets round this by using a regexp multiple times; the argument here is for a single use of the regexp.

I probably agree with steveha, and don't recommend regex for this, but to answer your question specifically, you need unescaped parens to pull out results groups (your pattern only has escaped parens):
>>> re.match(r"\((.+)\)", "(2*(3+1))*2").group(1)
'2*(3+1)'
If you go that route, you could iterate over the match results until you run out of matches, and then reverse the results list to work inside out.

How to evaluate a matched number later in a regex? - Lexing FORTRAN 'H' edit descriptor with Ply

I am using Ply to interpret a FORTRAN format string. I am having trouble writing a regex to match the 'H' edit descriptor which is of the form
xHccccc ...
where x specifies the number of characters to read in after the 'H'
Ply matches tokens with a single regular expression, but I am having trouble using regular expression to perform the above. I am looking for something like,
(\d+)[Hh].{\1}
where \1 is parsed as an integer and evaluated as part of the regex - however it isn't.
It seems that it is not possible to use matched numbers later in the same regex, is this the case?
Does anyone have any other solutions that might use Ply?

Regex can't do things like that. You can hack it though:
(1[Hh].|2[Hh]..|3[Hh]...|etc...)
Ugly!

This is what comes of thinking that regexps can replace a lexer.
Short version: regular expressions can only deal with that small subset of all possible language termed "regular" (big surprise, I know). But "regular" is not isomorphic to the human understanding of "simple", so even very simple languages can have non-regular expressions.
Writing a lexer for a simple language is not terribly hard.
That canonical Stack Overflow question for resources on the topic is Learning to write a compiler.
Ah. I seem to have misunderstood the question. Mea Culpa.
I'm not familiar with ply, and its been a while since I used flex, but think you would eat any number of following digits, then check in the associated code block if the rules had been obeyed.

Pyparsing includes an adaptive expression that is very similar to this, called countedArray. countedArray(expr) parses a leading integer 'n' and then parses 'n' instances of expr, returning the whole array as a single list. The way this works is that countedArray parses a leading integer expression, followed by an uninitialized Forward expression. The leading integer expression has a parse action attached that assigns the following Forward to 'n'*expr. The pyparsing parser then continues on, and parses the following 'n' expr's. So it is sort of a self-modifying parser.
To parse your expression, this would look something like:
integer = Word(nums).setParseAction(lambda t:int(t[0]))
following = Forward()
integer.addParseAction(lambda t: following << Word(printables+" ",exact=t[0]))
H_expr = integer + 'H' + following
print H_expr.parseString("22HThis is a test string.This is not in the string")
Prints:
[22, 'H', 'This is a test string.']
If Ply has something similar, perhaps you could use this technique.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.