I have a list of regular expressions and a string. I want to know which of the regular expressions (possibly more than one) match the string, if any. The trivial solution would be to try regular expressions one by one, but this part is performance critical... Is there a faster solution? Maybe by combining regular expressions in a single state machine somehow?
Reason: regular expressions are user-supplied filters that match incoming strings. When the message with the string arrives, system needs to perform additional user-specified actions.
There are up to 10000 regular expressions available. Regular expressions are user-supplied and can be somewhat simplified, if necessary (.* should be allowed though :) ). The list of regex-es is saved in MongoDB, but I can also prefetch them and perform the search inside Python if necessary.
Similar questions:
similar to this question, but the constraints are different: fuzzy matching is not enough, number of regular expressions is much lower (up to 10k)
similar to this question, but I can pre-process the regexes if necessary
similar to this question, but I need to find all matching regular expressions
I would appreciate some help.
First of all, if you have > 10K regular expressions you definitely should prefetch and keep them compiled (using re.compile) in memory. As a second step, I recommend to think about parallelism. Threads in Python aren't strong, due to GIL. So use multiple processes instead. And third, I would think about scalability on number of servers, using ZeroMQ (or another MQ) for communication.
As an interesting scientific task you can try to build regexp parser, which builds trees of similar regexps:
A
|-B
|-c
|-D
|-E
So if regexp A is matches string, then B, C, D, E match it too. So you will be able to reduce number of checks. IMHO, this task will take so much time. Using a bunch of servers will be cheaper and faster.
Related
I need to do text matching over a large number of strings (between each pair) and find the overlapping subsequences.I wanted to know if the knuth morris pratt algorithm will be best for this job, considering that I want this functionality in Python and it should be scalable over a large set of string ? I am looking for advice like if this is the best way to go about it or is there any better way to do string matching that is both scalable and efficient ?
TL;DR: scalable + efficient = RegEx.
First of all I recommend you to read: Regular Expression Matching Can Be Simple And Fast.
RegEx would probably be the most scalable solution since it's not just for matching, but also provide possibility of group-capturing and back-referencing.
Furthermore the Python's re module is written in C and will probably be faster than most of the code you'll write yourself in Python.
For simple substring searching you could use Knuth-Morris-Pratt algorithm indeed but when it comes to real-world words and phrases (which are not so repetitive), you could find that RegEx is better on average.
I am writing a regular expression which needs below criteria to satisfy.
(name="myName".*house="myHouse"|house="myHouse".*name="myName")
Either name or house can come first. My Regex should match both.
Actually my real code is even more bigger after writing the repeated code.
Is there any way to use the regular expression without Repetition like above ?
The only possible way to do this without the | pipe operator is to do two separate regex searches. So the answer is no, there is no other way.
Also, if this is XML or HTML that you are searching, it is highly advised that you use a parser such as Beautiful Soup instead of regex.
You can use positive lookahead assertions for this. It's a bit jejune, but if you're trying to keep things simple, it should work. What you want to do is confirm:
I am looking at .*house
AND
I am looking at .*name
without regard to the lengths of the two .* parts.
So, since lookahead expressions are zero width (that is, they match without actually consuming any characters - they just "look ahead") you can paste together as many as you'd like.
Please be aware: doing this can get really expensive, performance-wise. You will have to scan, and then re-scan, for each extra term that you match. If the strings you are matching against are long, this will slow you down a lot.
Sample regex:
(?=.*name="myName")(?=.*house="myHouse")
For a single large text (~4GB) I need to search for ~1million phrases and replace them with complementary phrases. Both the raw text and the replacements can easily fit in memory. The naive solution will literally takes years to finish as a single replacement takes about a minute.
Naive solution:
for search, replace in replacements.iteritems():
text = text.replace(search, replace)
The regex method using re.subis x10 slower:
for search, replace in replacements.iteritems():
text = re.sub(search, replace, text)
At any rate, this seems like a great place use Boyer-Moore string, or Aho-Corasick; but these methods as they are generally implemented only work for searching the string and not also replacing it.
Alternatively, any tool (outside of Python) that can do this quickly would also be appreciated.
Thanks!
There's probably a better way than this:
re.sub('|'.join(replacements), lambda match: replacements[match.group()], text)
This does one search pass, but it's not a very efficient search. The re2 module may speed this up dramatically.
Outside of python, sed is usually used for this sort of thing.
For example (taken from here), to replace the word ugly with beautiful in the file sue.txt:
sed -i 's/ugly/beautiful/g' /home/bruno/old-friends/sue.txt
You haven't posted any profiling of your code, you should try some timings before you do any premature optimization. Searching and replacing text in a 4GB file is a computationally-intensive operation.
ALTERNATIVE
Ask: should I be doing this at all? -
You discuss below doing an entire search and replace of the Wikipedia corpus in under 10ms. This rings some alarm bells as it doesn't sound like great design. Unless there's an obvious reason not to you should be modifying whatever code you use to present and/or load that to do the search and replace as a subset of the data is being loaded/viewed. It's unlikely you'll be doing many operations on the entire 4GB of data so restrict your search and replace operations to what you're actually working on. Additionally, your timing is still very imprecise because you don't know how big the file you're working on is.
On a final point, you note that:
the speedup has to be algorithmic not chaining millions of sed calls
But you indicated that the data you're working with was a "single large text (~4GB)" so there shouldn't be any chaning involved if I understand what you mean by that correctly.
UPDATE:
Below you indicate that to do the operation on a ~4KB file (I'm assuming) takes 90s, this seems very strange to me - sed operations don't normally take anywhere close to that. If the file is actually 4MB (I'm hoping) then it should take 24 hours to evaluate (not ideal but probably acceptable?)
I had this use case as well, where I needed to do ~100,000 search and replace operations on the full text of Wikipedia. Using sed, awk, or perl would take years. I wasn't able to find any implementation of Aho-Corasick that did search-and-replace, so I wrote my own: fsed. This tool happens to be written in Python (so you can hack into the code if you like), but it's packaged up as a command line utility that runs like sed.
You can get it with:
pip install fsed
they are generally implemented only work for searching the string and not also replacing it
Perfect, that's exactly what you need. Searching with an ineffective algorithm in a 4G text is bad enough, but doing several replacing is probably even worse... you potentially have to move gigabytes of text to make space for the expansion/shrinking caused by the size difference of source and target text.
Just find the locations, then join the pieces with the replacements parts.
So a dumb analogy would be be "_".join( "a b c".split(" ") ), but of course you don't want to create copies the way split does.
Note: any reason to do this in python?
I have a set of strings. I would like to extract a regular expression that matches all these strings. Further, it should match preferably only these and not many others.
Is there an existing python module that does this?
www.google.com
www.googlemail.com/hello/hey
www.google.com/hello/hey
Then, the extracted regex could be www\.google(mail)?\.com(/hello/hey)?
(This also matches www.googlemail.com but I guess I need to live with it)
My motivation for this is in a machine learning setting. I would like to extract a regular expression that "best" represents all these strings.
I understand that regexes like
(www.google.com)|(www.googlemail.com/hello/hey)|(www.google.com/hello/hey) or
www.google(mail.com/hello/hey)|(.com)|(/hello/hey) would be right given my specification, because they match no other urls other than the given ones. But such a regex will become very large if there are large number of strings in the set.
There's a little perl library that was designed to do this. I know you're using python, but if it's a very large list of strings, you can fork off a perl subprocess now and then. (Or copy the algorithm if you're sufficiently motivated).
A question that I answered got me wondering:
How are regular expressions implemented in Python? What sort of efficiency guarantees are there? Is the implementation "standard", or is it subject to change?
I thought that regular expressions would be implemented as DFAs, and therefore were very efficient (requiring at most one scan of the input string). Laurence Gonsalves raised an interesting point that not all Python regular expressions are regular. (His example is r"(a+)b\1", which matches some number of a's, a b, and then the same number of a's as before). This clearly cannot be implemented with a DFA.
So, to reiterate: what are the implementation details and guarantees of Python regular expressions?
It would also be nice if someone could give some sort of explanation (in light of the implementation) as to why the regular expressions "cat|catdog" and "catdog|cat" lead to different search results in the string "catdog", as mentioned in the question that I referenced before.
Python's re module was based on PCRE, but has moved on to their own implementation.
Here is the link to the C code.
It appears as though the library is based on recursive backtracking when an incorrect path has been taken.
Regular expression and text size n
a?nan matching an
Keep in mind that this graph is not representative of normal regex searches.
http://swtch.com/~rsc/regexp/regexp1.html
There are no "efficiency guarantees" on Python REs any more than on any other part of the language (C++'s standard library is the only widespread language standard I know that tries to establish such standards -- but there are no standards, even in C++, specifying that, say, multiplying two ints must take constant time, or anything like that); nor is there any guarantee that big optimizations won't be applied at any time.
Today, F. Lundh (originally responsible for implementing Python's current RE module, etc), presenting Unladen Swallow at Pycon Italia, mentioned that one of the avenues they'll be exploring is to compile regular expressions directly to LLVM intermediate code (rather than their own bytecode flavor to be interpreted by an ad-hoc runtime) -- since ordinary Python code is also getting compiled to LLVM (in a soon-forthcoming release of Unladen Swallow), a RE and its surrounding Python code could then be optimized together, even in quite aggressive ways sometimes. I doubt anything like that will be anywhere close to "production-ready" very soon, though;-).
Matching regular expressions with backreferences is NP-hard, which is at least as hard as NP-Complete. That basically means that it's as hard as any problem you're likely to encounter, and most computer scientists think it could require exponential time in the worst case. If you could match such "regular" expressions (which really aren't, in the technical sense) in polynomial time, you could win a million bucks.