Regular expression implementation details

Regular expression implementation details - python

A question that I answered got me wondering:
How are regular expressions implemented in Python? What sort of efficiency guarantees are there? Is the implementation "standard", or is it subject to change?
I thought that regular expressions would be implemented as DFAs, and therefore were very efficient (requiring at most one scan of the input string). Laurence Gonsalves raised an interesting point that not all Python regular expressions are regular. (His example is r"(a+)b\1", which matches some number of a's, a b, and then the same number of a's as before). This clearly cannot be implemented with a DFA.
So, to reiterate: what are the implementation details and guarantees of Python regular expressions?
It would also be nice if someone could give some sort of explanation (in light of the implementation) as to why the regular expressions "cat|catdog" and "catdog|cat" lead to different search results in the string "catdog", as mentioned in the question that I referenced before.

Python's re module was based on PCRE, but has moved on to their own implementation.
Here is the link to the C code.
It appears as though the library is based on recursive backtracking when an incorrect path has been taken.
Regular expression and text size n
a?nan matching an
Keep in mind that this graph is not representative of normal regex searches.
http://swtch.com/~rsc/regexp/regexp1.html

There are no "efficiency guarantees" on Python REs any more than on any other part of the language (C++'s standard library is the only widespread language standard I know that tries to establish such standards -- but there are no standards, even in C++, specifying that, say, multiplying two ints must take constant time, or anything like that); nor is there any guarantee that big optimizations won't be applied at any time.
Today, F. Lundh (originally responsible for implementing Python's current RE module, etc), presenting Unladen Swallow at Pycon Italia, mentioned that one of the avenues they'll be exploring is to compile regular expressions directly to LLVM intermediate code (rather than their own bytecode flavor to be interpreted by an ad-hoc runtime) -- since ordinary Python code is also getting compiled to LLVM (in a soon-forthcoming release of Unladen Swallow), a RE and its surrounding Python code could then be optimized together, even in quite aggressive ways sometimes. I doubt anything like that will be anywhere close to "production-ready" very soon, though;-).

Matching regular expressions with backreferences is NP-hard, which is at least as hard as NP-Complete. That basically means that it's as hard as any problem you're likely to encounter, and most computer scientists think it could require exponential time in the worst case. If you could match such "regular" expressions (which really aren't, in the technical sense) in polynomial time, you could win a million bucks.

Related

Finding a matching regex(es) from a list of alternatives

I have a list of regular expressions and a string. I want to know which of the regular expressions (possibly more than one) match the string, if any. The trivial solution would be to try regular expressions one by one, but this part is performance critical... Is there a faster solution? Maybe by combining regular expressions in a single state machine somehow?
Reason: regular expressions are user-supplied filters that match incoming strings. When the message with the string arrives, system needs to perform additional user-specified actions.
There are up to 10000 regular expressions available. Regular expressions are user-supplied and can be somewhat simplified, if necessary (.* should be allowed though :) ). The list of regex-es is saved in MongoDB, but I can also prefetch them and perform the search inside Python if necessary.
Similar questions:
similar to this question, but the constraints are different: fuzzy matching is not enough, number of regular expressions is much lower (up to 10k)
similar to this question, but I can pre-process the regexes if necessary
similar to this question, but I need to find all matching regular expressions
I would appreciate some help.

First of all, if you have > 10K regular expressions you definitely should prefetch and keep them compiled (using re.compile) in memory. As a second step, I recommend to think about parallelism. Threads in Python aren't strong, due to GIL. So use multiple processes instead. And third, I would think about scalability on number of servers, using ZeroMQ (or another MQ) for communication.
As an interesting scientific task you can try to build regexp parser, which builds trees of similar regexps:
A
|-B
|-c
|-D
|-E
So if regexp A is matches string, then B, C, D, E match it too. So you will be able to reduce number of checks. IMHO, this task will take so much time. Using a bunch of servers will be cheaper and faster.

Syntax recognizer in python

I need a module or strategy for detecting that a piece of data is written in a programming language, not syntax highlighting where the user specifically chooses a syntax to highlight. My question has two levels, I would greatly appreciate any help, so:
Is there any package in python that receives a string(piece of data) and returns if it belongs to any programming language syntax ?
I don't necessarily need to recognize the syntax, but know if the string is source code or not at all.
Any clues are deeply appreciated.

Maybe you can use existing multi-language syntax highlighters. Many of them can detect language a file is written in.

You could have a look at methods around baysian filtering.

My answer somewhat depends on the amount of code you're going to be given. If you're going to be given 30+ lines of code, it should be fairly easy to identify some unique features of each language that are fairly common. For example, tell the program that if anything matches an expression like from * import * then it's Python (I'm not 100% sure that phrasing is unique to Python, but you get the gist). Other things you could look at that are usually slightly different would be class definition (i.e. Python always starts with 'class', C will start with a definition of the return so you could check to see if there is a line that starts with a data type and has the formatting of a method declaration), conditionals are usually formatted slightly differently, etc, etc. If you wanted to make it more accurate, you could introduce some sort of weighting system, features that are more unique and less likely to be the result of a mismatched regexp get a higher weight, things that are commonly mismatched get a lower weight for the language, and just calculate which language has the highest composite score at the end. You could also define features that you feel are 100% unique, and tell it that as soon as it hits one of those, to stop parsing because it knows the answer (things like the shebang line).
This would, of course, involve you knowing enough about the languages you want to identify to find unique features to look for, or being able to find people that do know unique structures that would help.
If you're given less than 30 or so lines of code, your answers from parsing like that are going to be far less accurate, in that case the easiest best way to do it would probably be to take an appliance similar to Travis, and just run the code in each language (in a VM of course). If the code runs successfully in a language, you have your answer. If not, you would need a list of errors that are "acceptable" (as in they are errors in the way the code was written, not in the interpreter). It's not a great solution, but at some point your code sample will just be too short to give an accurate answer.

Does it make sense to use Hungarian notation prefixes in interpreted languages? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
First of all, I have taken a look at the following posts to avoid duplicate question.
https://stackoverflow.com/questions/1184717/hungarian-notation
Why shouldn't I use "Hungarian Notation"?
Are variable prefixes (“Hungarian notation”) really necessary anymore?
Do people use the Hungarian Naming Conventions in the real world?
Now, all of these posts are related to C#, C++, Java - strongly typed languages.
I do understand that there is no need for the prefixes when the type is known before compilation.
Nevertheless, my question is:
Is it worthwhile to use the prefixes in interpreter based languages, considering the fact that you cant see the type of the object before runtime?
Edit: If someone can make this post a community wiki, please do. I am hardly interested in the reputation (or negative reputation) from this post.

It depends on which of the two versions you refer to:
If you want to use the "real", original Hungarian notation AKA Applications Hungarian notation, denoting the logical variable type resp. its purpose, feel free to do so.
OTOH, the "misunderstood" version AKA Systems Hungarian notation, denotng just the physical variable type is frowned upon and should not be used.

IMHO, it never(*) makes real sense to use Systems Hungarian (prefixing the data type). Either you use a static language or a dynamic language, but with both the compiler or interpreter takes care of the type system. Annotating the type of a variable by means of the variable name can only cause ambiguity (e.g. imagine a float called intSomething).
It is completely different with regard to Application Hungarian, i.e. prefixing with some kind of usage pattern. I'd argue it is good practice to use this kind of notation, e.g. 'usValue' for an unsafe (i.e. unvalidated) value. This gives a visual cue as to the usage and prevents you from mixing different uses of variables which do have the same type but are not intended to be used together (or when they are intended to be used together, you at least have an idea as to what is being used and they produce a blip on your code checking radar).
I frequently use such a thing in MATLAB, e.g. idxInterest to indicate that the array of doubles are not raw data values, but just the indexes (into another array) which are of interest in one way or the other. I regularly use selInterest (sel from select) to do the same with logical indexes (I agree this might look like borderline Systems Hungarian), but in many cases both can be used in the same context.
Similarly for iterators: I regularly use multidimensional arrays (e.g. 4D), in the odd case I run a (par)for over a dimension, the iterators are called iFoo, jBar, kBaz, ... while their upper limit is generally nFoo, nBar, nBaz, ... (or numFoo, ...). When doing more complicated index manipulation, you can easily see what index belongs to what dimension (by the prefix you know what numerical dimension is used, by the full name you know what that dimension represents). This makes the code a lot more readable.
Next to that, I regularly use dFoo=1;, dBar=2;, ... to denote the number of the dimension for a certain set of variables. That way, you can easily see that something like meanIncome = mean(income, dBar) takes the mean income over the Bars , while meanIncome = mean(income, 2) does not convey the same information. Since you also have to set the dVariables, it also serves as documentation of your variables.
While it is not technically incorrect to do something like iFoo + jBar or kBaz + dBar, it does raise some questions when these do occur in your code and they allow you to inspect that part more vigilantly. And that is what real (Applications) Hungarian Notation is all about.
(*) The only moment where it might make some sense, is where your complete framework/language asks you to use it. E.g. the win32 API uses it, so when you interface with that directly, you should use those standards to keep confusion to a minimum. However, I'd argue that it might make even as much or even more sense to look for another framework/language.
Do note that this is something different from sigils as used in Perl, some BASIC dialects etc. These also convey the type, but in many implementations this is the type definition so no or little ambiguity is possible. It is another question whether it is good practice to use that kind of type declaration (and I'm not really sure about my own stance in this).

The reason Hungarian notation conveying type ("systems Hungarian") is frowned upon in Python is simple. It's misleading. A variable might be called iPhones (the integer number of phones, maybe :-) but because it's Python, there's nothing at all to keep you from putting something other than an integer into it! And maybe you will find you need to do that for some reason. And then all the code that uses it is very misleading to someone trying to understand it, unless of course you globally change the name of the variable.
This notation was intended to help you keep track of variable types in statically-typed languages and was arguably useful for a time. But it's obsolete now, even for statically typed languages, given the availability of IDEs that do the job in a much better way.

As it was proposed, Hungarian notation is a reasonable idea. As it was applied? It should be nuked from orbit (It's the only way to be sure.)

The accepted answer from the first question you link to applies the same to Python:
Hungarian notation has no place in Java. The Java API does not use it, and neither do most developers. Java code would not look like Java using it.
All this is also true for Python.

FSharp runs my algorithm slower than Python

Years ago, I solved a problem via dynamic programming:
https://www.thanassis.space/fillupDVD.html
The solution was coded in Python.
As part of expanding my horizons, I recently started learning OCaml/F#. What better way to test the waters, than by doing a direct port of the imperative code I wrote in Python to F# - and start from there, moving in steps towards a functional programming solution.
The results of this first, direct port... are disconcerting:
Under Python:
bash$ time python fitToSize.py
....
real 0m1.482s
user 0m1.413s
sys 0m0.067s
Under FSharp:
bash$ time mono ./fitToSize.exe
....
real 0m2.235s
user 0m2.427s
sys 0m0.063s
(in case you noticed the "mono" above: I tested under Windows as well, with Visual Studio - same speed).
I am... puzzled, to say the least. Python runs code faster than F# ? A compiled binary, using the .NET runtime, runs SLOWER than Python's interpreted code?!?!
I know about startup costs of VMs (mono in this case) and how JITs improve things for languages like Python, but still... I expected a speedup, not a slowdown!
Have I done something wrong, perhaps?
I have uploaded the code here:
https://www.thanassis.space/fsharp.slower.than.python.tar.gz
Note that the F# code is more or less a direct, line-by-line translation of the Python code.
P.S. There are of course other gains, e.g. the static type safety offered by F# - but if the resulting speed of an imperative algorithm is worse under F# ... I am disappointed, to say the least.
EDIT: Direct access, as requested in the comments:
the Python code: https://gist.github.com/950697
the FSharp code: https://gist.github.com/950699

Dr Jon Harrop, whom I contacted over e-mail, explained what is going on:
The problem is simply that the program has been optimized for Python. This is common when the programmer is more familiar with one language than the other, of course. You just have to learn a different set of rules that dictate how F# programs should be optimized...
Several things jumped out at me such as the use of a "for i in 1..n do" loop rather than a "for i=1 to n do" loop (which is faster in general but not significant here), repeatedly doing List.mapi on a list to mimic an array index (which allocated intermediate lists unnecessarily) and your use of the F# TryGetValue for Dictionary which allocates unnecessarily (the .NET TryGetValue that accepts a ref is faster in general but not so much here)
... but the real killer problem turned out to be your use of a hash table to implement a dense 2D matrix. Using a hash table is ideal in Python because its hash table implementation has been extremely well optimized (as evidenced by the fact that your Python code is running as fast as F# compiled to native code!) but arrays are a much better way to represent dense matrices, particularly when you want a default value of zero.
The funny part is that when I first coded this algorithm, I DID use a table -- I changed the implementation to a dictionary for reasons of clarity (avoiding the array boundary checks made the code simpler - and much easier to reason about).
Jon transformed my code (back :-)) into its array version, and it runs at 100x speed.
Moral of the story:
F# Dictionary needs work... when using tuples as keys, compiled F# is slower than interpreted Python's hash tables!
Obvious, but no harm in repeating: Cleaner code sometimes means... much slower code.
Thank you, Jon -- much appreciated.
EDIT: the fact that replacing Dictionary with Array makes F# finally run at the speeds a compiled language is expected to run, doesn't negate the need for a fix in Dictionary's speed (I hope F# people from MS are reading this). Other algorithms depend on dictionaries/hashes, and can't be easily switched to using arrays; making programs suffer "interpreter-speeds" whenever one uses a Dictionary, is arguably, a bug. If, as some have said in the comments, the problem is not with F# but with .NET Dictionary, then I'd argue that this... is a bug in .NET!
EDIT2: The clearest solution, that doesn't require the algorithm to switch to arrays (some algorithms simply won't be amenable to that) is to change this:
let optimalResults = new Dictionary<_,_>()
into this:
let optimalResults = new Dictionary<_,_>(HashIdentity.Structural)
This change makes the F# code run 2.7x times faster, thus finally beating Python (1.6x faster). The weird thing is that tuples by default use structural comparison, so in principle, the comparisons done by the Dictionary on the keys are the same (with or without Structural). Dr Harrop theorizes that the speed difference may be attributed to virtual dispatch: "AFAIK, .NET does little to optimize virtual dispatch away and the cost of virtual dispatch is extremely high on modern hardware because it is a "computed goto" that jumps the program counter to an unpredictable location and, consequently, undermines branch prediction logic and will almost certainly cause the entire CPU pipeline to be flushed and reloaded".
In plain words, and as suggested by Don Syme (look at the bottom 3 answers), "be explicit about the use of structural hashing when using reference-typed keys in conjunction with the .NET collections". (Dr. Harrop in the comments below also says that we should always use Structural comparisons when using .NET collections).
Dear F# team in MS, if there is a way to automatically fix this, please do.

As Jon Harrop has pointed out, simply constructing the dictionaries using Dictionary(HashIdentity.Structural) gives a major performance improvement (a factor of 3 on my computer). This is almost certainly the minimally invasive change you need to make to get better performance than Python, and keeps your code idiomatic (as opposed to replacing tuples with structs, etc.) and parallel to the Python implementation.

Edit: I was wrong, it's not a question of value type vs reference type. The performance problem was related to the hash function, as explained in other comments. I keep my answer here because there's an interessant discussion. My code partially fixed the performance issue, but this is not the clean and recommended solution.
--
On my computer, I made your sample run twice as fast by replacing the tuple with a struct. This means, the equivalent F# code should run faster than your Python code. I don't agree with the comments saying that .NET hashtables are slow, I believe there's no significant difference with Python or other languages implementations. Also, I don't agree with the "You can't 1-to-1 translate code expect it to be faster": F# code will generally be faster than Python for most tasks (static typing is very helpful to the compiler). In your sample, most of the time is spent doing hashtable lookups, so it's fair to imagine that both languages should be almost as fast.
I think the performance issue is related to gabage collection (but I haven't checked with a profiler). The reason why using tuples can be slower here than structures has been discussed in a SO question ( Why is the new Tuple type in .Net 4.0 a reference type (class) and not a value type (struct)) and a MSDN page (Building tuples):
If they are reference types, this
means there can be lots of garbage
generated if you are changing elements
in a tuple in a tight loop. [...]
F# tuples were reference types, but
there was a feeling from the team that
they could realize a performance
improvement if two, and perhaps three,
element tuples were value types
instead. Some teams that had created
internal tuples had used value instead
of reference types, because their
scenarios were very sensitive to
creating lots of managed objects.
Of course, as Jon said in another comment, the obvious optimization in your example is to replace hashtables with arrays. Arrays are obviously much faster (integer index, no hashing, no collision handling, no reallocation, more compact), but this is very specific to your problem, and it doesn't explain the performance difference with Python (as far as I know, Python code is using hashtables, not arrays).
To reproduce my 50% speedup, here is the full code: http://pastebin.com/nbYrEi5d
In short, I replaced the tuple with this type:
type Tup = {x: int; y: int}
Also, it seems like a detail, but you should move the List.mapi (fun i x -> (i,x)) fileSizes out of the enclosing loop. I believe Python enumerate does not actually allocate a list (so it's fair to allocate the list only once in F#, or use Seq module, or use a mutable counter).

Hmm.. if the hashtable is the major bottleneck, then it is properly the hash function itself. Havn't look at the specific hash function but For one of the most common hash functions namely
((a * x + b) % p) % q
The modulus operation % is painfully slow, if p and q is of the form 2^k - 1, we can do modulus with an and, add and a shift operation.
Dietzfelbingers universal hash function h_a : [2^w] -> [2^l]
lowerbound(((a * x) % 2^w)/2^(w-l))
Where is a random odd seed of w-bit.
It can be computed by (a*x) >> (w-l), which is magnitudes of speed faster than the first hash function. I had to implement a hash table with linked list as collision handling. It took 10 minutes to implement and test, we had to test it with both functions, and analyse the differens of speed. The second hash function had as I remember around 4-10 times of speed gain dependend on the size of the table.
But the thing to learn here is if your programs bottleneck is hashtable lookup the hash function has to be fast too

String matching in Python

does anyone know which string matching algorithm is implemented in Python?

Per the sources, it's a
fast search/count implementation,
based on a mix between boyer-moore and
horspool, with a few more bells and
whistles on the top. for some more
background, see:
http://effbot.org/zone/stringlib.htm
The essay in question is really well worth reading!

I assume you're talking about CPython. In that case, you could always check the source (see fastsearch.h).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.