Creating custom extensions to a regular expression engine

Creating custom extensions to a regular expression engine - python

Is there any easy way to go about adding custom extensions to a
Regular Expression engine? (For Python in particular, but I would take
a general solution as well).
It might be easier to explain what I'm trying to build with an
example. Here is the use case that I have in mind:
I want users to be able to match strings that may contain arbitrary
ASCII characters. Regular Expressions are a good start, but aren't
quite enough for the type of data I have in mind. For instance, say I
have data that contains strings like this:
<STX>12.3,45.6<ETX>
where <STX> and <ETX> are the Start of Text/End of Text characters
0x02 and 0x03. To capture the two numbers, it would be very
convenient for the user to be able to specify any ASCII
character in their expression. Something like so:
\x02(\d\d\.\d),(\d\d\.\d)\x03
Where the "\x02" and "\x03" are matching the control characters and
the first and second match groups are the numbers. So, something like
regular expressions with just a few domain-specific add-ons.
How should I go about doing this? Is this even the right way to go?
I have to believe this sort of problem has been solved, but my initial
searches didn't turn up anything promising. Regular Expression have
the advantage of being well known, keeping the learning curve down.
A few notes:
I am not looking for a fixed parser for a particular protocol - it needs to be general and user configurable
I really don't want to write my own regex engine
Although it would be nice, I am not looking for "regex macros" where I create shortcuts for a handful of common expressions. (perhaps a follow-up question...)
Bonus: Have you heard of any academic work, i.e "Creating Domain Specific search languages"
EDIT: Thanks for the replies so far, I hadn't realized Python re supported arbitrary ascii chars. However, this is still not quite what I'm looking for. Here is another example that hopefully give the breadth of what I want in the end:
Suppose I have data that contains strings like this:
$\x01\x02\x03\r\n
Where the 123 forms two 12-bit integers (0x010 and 0x023). So how could I add syntax so the user could match it with a regex like this:
\$(\int12)(\int12)\x0d\x0a
Where the \int12's each pull out 12 bits. This would be handy if trying to search for packed data.

\x escapes are already supported by the Python regular expression parser:
>>> import re
>>> regex = re.compile(r'\x02(\d\d\.\d),(\d\d\.\d)\x03')
>>> regex.match('\x0212.3,45.6\x03')
<_sre.SRE_Match object at 0x7f551b0c9a48>

Related

Python IN vs grep

I have a script that iterates over file contents of hundreds of thousands of files to find specific matches. For convenience I am using a string in. What are the performance differences between the two? I'm looking for more of a conceptual understanding here.
list_of_file_contents = [...] # 1GB
key = 'd89fns;3ofll'
matches = []
for item in list_of_file_contents:
if key in item:
matches.append(key)
--vs--
grep -r my_files/ 'd89fns;3ofll'

The biggest conceptual difference is that grep does regular expression matching. In python you'd need to explicitly write code using the re module. The search expression in your example doesn't exploit any of the richness of regular expressions, so the search behaves just like the plain string match in python, and should consume only a tiny bit more resources than fgrep would. The python script is really fgrep and hopefully operates on par with that.
If the files are encoded, say in UTF-16, depending on the version of the various programs, there could be a big difference in whether matches are found, and a little in how long it takes.
And that's assuming that the actual python code deals with input and output efficiently, i.e. list_of_file_contents isn't an actual list of the data, but for instance a list comprehension around fileinput; and there is not a huge number of matches or a different matches.

I suggest you try it out for yourself. Profiling Python code is really easy: https://stackoverflow.com/a/582337/970247. For a more conceptual approach. Regex is a powerful string parsing engine full of features, in contrast Python "in" will do just one thing in a really straightforward way. I would say the latter will be the more efficient but again, trying it for yourself is the way to go.

Is there any streaming regular-expression module for Python?

I'm looking for a way to run a regex over a (long) iterable of "characters" in Python. (Python doesn't actually have characters, so it's actually an iterable of one-length strings. But same difference.)
The re module only allows searching over strings (or buffers), as far as I can tell.
I could implement it myself, but that seems a little silly.
Alternatively, I could convert the iterable to a string and run the regex over the string, but that gets (hideously) inefficient. (A worst-case example: re.search(".a", "".join('a' for a in range(10**8))) peaks at over 900M of RAM (private working set) on my (x64) machine, and takes ~12 seconds, even though it only needs to look at the first two characters in the iterable.)

As far as I understand, the example that joins a lot of 'a's is just extremely simple example that shows the problem. In other words, the construction of the content (generally) can be more time and memory consuming than the search itself.
The problem with the standard re module is that it uses the extended regular expression syntax, and it requires backtracking.
You may be interested in the very classic implementation by Thomson (NFA) -- see http://swtch.com/~rsc/regexp/regexp1.html for the explanation and the comparison of performance with the libraries that implement the extended syntax.
It seems that the re2 project can be useful for you. There should be the Python port -- see Is it possible to use re2 from Python? However, I do not know if it supports streaming and wherher any streaming regular expression engine for Python exists.
For understanding the Thomsons idea, you can also try the on-line visualization of the Regular Expression to NFA.

If the number of elements in that list is truly to the order of 10**8 then you are probably better off doing a linear search if you only want to do it once. Otherwise, you have to create this huge string that is really very inefficient. The other thing I can think of if you need to do this more than once is inserting the collection into a hashtable and do the search faster.

RegEx anomalous behavior in a program

I have written the following regex to match a set of e-mails from HTML files. The e-mails can take various formats such as
alice # so.edu
alice at sm.so.edu
alice # sm.com
<a href="mailto:alice at bob dot com">
I generally use RegexPal to test my regular expressions before implementing them in a programing language. I observe a strange behavior on the last e-mail example posted. RegexPal shows me a match for my regex but while using the same regex in a Python program it doesn't give me a hit. What could be the reason?
mail_regex = (?:[a-zA-Z]+[\w+\.]+[a-zA-Z]+)\s*(?:#|\bat\b)\s*(?:(?:(?:(?:[a-zA-Z]+)\s*
(?:\.|dot|dom)\s*(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*)(?:edu|com))|(?:(?:[a-zA-Z]+\s*(?:\.|dot|dom)\s*(?:edu|com))))
The RegEx is a little bit complex to accommodate variety of other examples (email patterns found in the dataset). You can also run and inspect the Python program on CodePad - http://codepad.org/W2p6waBb
Edit
Just to give a perspective the same regex works on - http://pythonregex.com/

It looks like the specific issue here is that you need to use a raw string:
mail_re = r"(?:[a-zA-Z]+[\w+\.]+[a-zA-Z]+)\s*(?:#|\bat\b)\s*(?:(?:(?:(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*)(?:edu|com))|(?:(?:[a-zA-Z]+\s*(?:\.|dot|dom)\s*(?:edu|com))))"
Otherwise, for instance \b will be backspace instead of word boundary.
Also, you're using a JavaScript tester. Python has different syntax and behavior. To avoid surprises, it would better to test with the Python-specific syntax.

Can I use a python regex for letters, dashes and underscores?

I want to handle geographic names i.e /new_york or /new-york etc
and since new-york is django-slugify for New york then maybe I should use the slugifed names even if names with underscores look better since I may want to automate the URL creation via an algorithm such as django slugify. A guess is that ([A-Za-z]+) or simply ([\w-]+) can work but to be safe I ask you which regex is best choice in this case.
I've already got a regex that handles number connecting numbers to a class:
('/([0-9]*)',ById)#fetches and displays an entity by id
Now I want another regex to match names e.g. new_york so that a request for
/new_york gets handled by the appropriate handler. Basically the negation of the regex above would or any combination letters+underscore and maybe a dash - since the names are geographical and It seems I could use this regex but I believe it works only because of precedence it that it just takes everything:
('/(.*)', ByName)#Handle for instance /new_york entities, /sao_paulo entities etc by custom mapping for my relevant places.
Since I have other handlers and I don't want conflicting regexes and I have other request handlers, could you recommend how to formulate the regex?
How does it work when an expression suits 2 regexes? Which has higher precedence? Can you tell me more how I should learn to write regexes and possible implementations for the geographical datastore - as entities or instance variables and special problems such as geographic locations that have different names in different languages e.g. Germany in german is called Deutschland so I also want to apply translations that I can do with gettext / djang.po files.

the first match wins.
usually your URLs will differ in other parts of the path. for example you might have
/cities/(?P<city>[^/]+)
/users/(?P<user>[^/]+)
and in many cases [^/]+ is a good regex because it will match anything except /, which you would normally avoid because it is used to separate path elements.
i don't think it's a good idea to separate URLs based solely on characters (in your case, letters or digits), but if you want to do that, use [-A-Za-z_]+ (note that the "-" goes at the start of the [], or it needs a backslash).
avoid \w because that can also match digits. unless you want to go really crazy and send digits only to one handler and letters+digits elsewhere, in which case use:
/(?P<id>\d+)
/(?P<city>[-\w]+)
in that order.

how would i design a db to contain a set of url regexes (python) that could be matched against an incoming url

Say I have the following set of urls in a db
url data
^(.*)google.com/search foobar
^(.*)google.com/alerts barfoo
^(.*)blah.com/foo/(.*) foofoo
... 100's more
Given any url in the wild, I would like to check to
see if that url belongs to an existing set of urls and get the
corresponding data field.
My questions are:
How would I design the db to do it
django does urlresolution by looping through each regex and checking for a match
given that there maybe 1000's of urls is this the best way to approach this?
Are there any existing implementations I can look at?

"2. django does urlresolution by looping through each regex and checking for a match given that there maybe 1000's of urls is this the best way to approach this?"
"3. Are there any existing implementations I can look at?"
If running a large number of regular expressions does turn out to be a problem, you should check out esmre, which is a Python extension module for speeding up large collections of regular expressions. It works by extracting the fixed strings of each regular expression and putting them in an Aho-Corasick-inspired pattern matcher to quickly eliminate almost all of the work.

Django has the advantage that its URLs are generally hierarchical. While the entire Django project may well have 100s or more URLs it's probably dealing only with a dozen or less patterns at a time. Do you have any structure in your URLs that you could exploit this way?
Other than that, you could try creating some kind of heuristics. E.g. finding the "fixed" parts of your patterns and then eliminating some of them and then (by a simple substring search) and only then switch to regex matching.
At the extreme end of the spectrum, you could create a product automaton. That would be super fast but the memory requirements would probably be impractical (and likely to remain so for the next few centuries).

Before determining that the django approach could not possibly work, try implementing it, and applying a typical workload. For a really thourough approach, you could actually time the cost of each regex and that can guide you in improving the most costly and most frequently used regexes. In particular, you could arrange for the most frequently used, inexpensive regexes to the front of the list. This is probably a better choice than inventing a new technology to fix a problem you don't even know you have yet.

You'll certainly need more care in your design of regular expressions. For example, the prefix ^(.*) will match any input - and while you may need the prefix to capture a group for various reasons, having it there will mean that you can't really eliminate any of the URLs in your database easily.
I sort of agree with TokenMacGuy's comment about the intractability of regexes, but the situation may not be completely hopeless depending on the true scale of your problem. For example, for an URL to match, then its first character should match; so for example you could pre-filter your URLs by saying which first character in the input will match that URL. So, you have a secondary table MatchingFirstCharacters which is a lookup between initial characters and URLs which match up to that initial character. (This will only work if you don't have lots of ambiguous prefixes, as I mentioned in the first paragraph of my answer.) Using this approach will mean you don't necessarily have to load all the regexes for full matching - just the ones where at least the first character matches. I suppose the idea could be generalised further, but that's an exercise for the reader ;-)

The plan I'm leaning towards is one which picks of the domain name + tld from
a url, uses that as a key to find out all the regexes and than loops through
each of this regex subset to find a match.
I use two tables for this
class Urlregex(db.Model):
"""
the data field is structured as a newline separated record list
and each record is a space separated list of regex's and
dispatch key. Example of one such record
domain_tld: google.com
data:
^(.*)google.com/search(.*) google-search
"""
domain_tld = db.StringProperty()
data = db.TextProperty()
class Urldispatch(db.Model):
urlkey = db.StringProperty()
data = db.TextProperty()
So, for the cost of 2 db reads and looping through a domain specific url subset
any incoming url should be able to be matched against a large db of urls.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.