Writing a parser for regular expressions

Writing a parser for regular expressions - python

Even after years of programming, I'm ashamed to say that I've never really fully grasped regular expressions. In general, when a problem calls for a regex, I can usually (after a bunch of referring to syntax) come up with an appropriate one, but it's a technique that I find myself using increasingly often.
So, to teach myself and understand regular expressions properly, I've decided to do what I always do when trying to learn something; i.e., try to write something ambitious that I'll probably abandon as soon as I feel I've learnt enough.
To this end, I want to write a regular expression parser in Python. In this case, "learn enough" means that I want to implement a parser that can understand Perl's extended regex syntax completely. However, it doesn't have to be the most efficient parser or even necessarily usable in the real-world. It merely has to correctly match or fail to match a pattern in a string.
The question is, where do I start? I know almost nothing about how regexes are parsed and interpreted apart from the fact that it involves a finite state automaton in some way. Any suggestions for how to approach this rather daunting problem would be much appreciated.
EDIT: I should clarify that while I'm going to implement the regex parser in Python, I'm not overly fussed about what programming language the examples or articles are written in. As long as it's not in Brainfuck, I will probably understand enough of it to make it worth my while.

Writing an implementation of a regular expression engine is indeed a quite complex task.
But if you are interested in how to do it, even if you can't understand enough of the details to actually implement it, I would recommend that you at least look at this article:
Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)
It explains how many of the popular programming languages implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that is faster. The article includes some details of how the proposed implementation works, including some source code in C. It may be a bit heavy reading if you are just starting to learn regular expressions, but I think it is well worth knowing about the difference between the two approaches.

I've already given a +1 to Mark Byers - but as far as I remember the paper doesn't really say that much about how regular expression matching works beyond explaining why one algorithm is bad and another much better. Maybe something in the links?
I'll focus on the good approach - creating finite automata. If you limit yourself to deterministic automata with no minimisation, this isn't really too difficult.
What I'll (very quickly) describe is the approach taken in Modern Compiler Design.
Imagine you have the following regular expression...
a (b c)* d
The letters represent literal characters to match. The * is the usual zero-or-more repetitions match. The basic idea is to derive states based on dotted rules. State zero we'll take as the state where nothing has been matched yet, so the dot goes at the front...
0 : .a (b c)* d
The only possible match is 'a', so the next state we derive is...
1 : a.(b c)* d
We now have two possibilities - match the 'b' (if there's at least one repeat of 'b c') or match the 'd' otherwise. Note - we are basically doing a digraph search here (either depth first or breadth first or whatever) but we are discovering the digraph as we search it. Assuming a breadth-first strategy, we'll need to queue one of our cases for later consideration, but I'll ignore that issue from here on. Anyway, we've discovered two new states...
2 : a (b.c)* d
3 : a (b c)* d.
State 3 is an end state (there may be more than one). For state 2, we can only match the 'c', but we need to be careful with the dot position afterwards. We get "a.(b c)* d" - which is the same as state 1, so we don't need a new state.
IIRC, the approach in Modern Compiler Design is to translate a rule when you hit an operator, in order to simplify the handling of the dot. State 1 would be transformed to...
1 : a.b c (b c)* d
a.d
That is, your next option is either to match the first repetition or to skip the repetition. The next states from this are equivalent to states 2 and 3. An advantage of this approach is that you can discard all your past matches (everything before the '.') as you only care about future matches. This typically gives a smaller state model (but not necessarily a minimal one).
EDIT If you do discard already matched details, your state description is a representation of the set of strings that can occur from this point on.
In terms of abstract algebra, this is a kind of set closure. An algebra is basically a set with one (or more) operators. Our set is of state descriptions, and our operators are our transitions (character matches). A closed set is one where applying any operator to any members in the set always produces another member that is in the set. The closure of a set is the mimimal larger set that is closed. So basically, starting with the obvious start state, we are constructing the minimal set of states that is closed relative to our set of transition operators - the minimal set of reachable states.
Minimal here refers to the closure process - there may be a smaller equivalent automata which is normally referred to as minimal.
With this basic idea in mind, it's not too difficult to say "if I have two state machines representing two sets of strings, how to I derive a third representing the union" (or intersection, or set difference...). Instead of dotted rules, your state representations will a current state (or set of current states) from each input automaton and perhaps additional details.
If your regular grammars are getting complex, you can minimise. The basic idea here is relatively simple. You group all your states into one equivalence class or "block". Then you repeatedly test whether you need to split blocks (the states aren't really equivalent) with respect to a particular transition type. If all states in a particular block can accept a match of the same character and, in doing so, reach the same next-block, they are equivalent.
Hopcrofts algorithm is an efficient way to handle this basic idea.
A particularly interesting thing about minimisation is that every deterministic finite automaton has precisely one minimal form. Furthermore, Hopcrofts algorithm will produce the same representation of that minimal form, no matter what representation of what larger case it started from. That is, this is a "canonical" representation which can be used to derive a hash or for arbitrary-but-consistent orderings. What this means is that you can use minimal automata as keys into containers.
The above is probably a bit sloppy WRT definitions, so make sure you look up any terms yourself before using them yourself, but with a bit of luck this gives a fair quick introduction to the basic ideas.
BTW - have a look around the rest of Dick Grunes site - he has a free PDF book on parsing techniques. The first edition of Modern Compiler Design is pretty good IMO, but as you'll see, there's a second edition imminent.

"A play on regular expressions: functional pearl" takes an interesting approach. The implementation is given in Haskell, but it's been reimplemented in Python at least once.
The developed program is based on an old technique to turn regular expressions into finite automata which makes it efficient both in terms of worst-case time and space bounds and actual performance: despite its simplicity, the Haskell implementation can compete with a recently published professional C++ program for the same problem.

There's an interesting (if slightly short) chapter in Beautiful Code by Brian Kernighan, appropriately called "A Regular Expression Matcher". In it he discusses a simple matcher that can match literal characters, and the .^$* symbols.

I do agree that writing a regex engine will improve understanding but have you taken a look at ANTLR??. It generates the parsers automatically for any kind of language. So maybe you can try your hand by taking one of the language grammars listed at Grammar examples and run through the AST and parser that it generates. It generates a really complicated code but you will have a good understanding on how a parser works.

Related

The difference in application between SequenceMatcher in edit distance and that in difflib?

I know the implementation of the edit distance algorithm. By dynamic programming, we first fill the first column and first row and then the entries immediately right and below of the filled entries by comparing three paths from the left, above, and left above. While for the Ratcliff/Obershelp algorithm, we first extract the longest common substring out from the two strings, then we do recursive operations for the left side two sub-strings and right side two sub-strings until no characters are left.
Both of them can be utilized to calculate the similarity between two strings and transform one string into another using four operations: delete, replace, copy, and insert.
But I wonder when to use which between SequenceMatcher in edit distance and that in difflib?
Here is what I found on the internet that makes me think that this question would also benefit others:
In the documentation of edit distance it reads that
Similar to the difflib SequenceMatcher, but uses Levenshtein/edit distance.
In this answer to a question on calculating edit distance, an answer on Ratcliff/Obershelp algorithm was provided.
There are only a few resources about the Ratcliff/Obershelp algorithm, let alone its comparison to edit distance that I thought is the most well known string alignment algorithm.
So far as I know, I have the following ideas:
I find that edit distance and the Ratcliff/Obershelp algorithm can both be used for spell checking. But when to use which?
I thought the edit distance is employed to find the minimal edit sequence while the Ratcliff/Obershelp algorithm yields matches that "look right" to people. However, 'look right' seems too vague a term, especially in real world applications. What's more, when is the minimum edit sequence a must/preferred?
Any suggestions would be highly appreciated, and thanks in advance.

"Looks right to people" needn't be all that vague. Search the web for discussion of why, e.g., the very widely used git source control system added "patience" and "histogram" differencing algorithms, as options. Variations of "minimal edit distance" routinely produce diffs that are jarring to humans, and I'm not going to reproduce examples here that are easily found by searching.
From a formal perspective, Levenshtein is more in line with what a mathematician means by "distance". Chiefly, difflib's .ratio() can depend on the order of the arguments passed to it, but Levenshtein is insensitve to order:
>>> import difflib
>>> difflib.SequenceMatcher(None, "tide", "diet").ratio()
0.25
>>> difflib.SequenceMatcher(None, "diet", "tide").ratio()
0.5
For the rest, I don't think you're going to get crisp answers. There are many notions of "similarity", not just the two you mentioned, and they all have their fans. "Minimal" was probably thought to be more important back when disk space and bandwidth were scarce and expensive.
The physical realities constraining genetic mutation have made measures that take into account spatial locality much more important in that field - doesn't matter one whit if it's "minimal" if it's also physically implausible ;-) Terms to search for: Smith–Waterman, and Needleman–Wunsch.

Simplifying Equations with Python z3 API

I'm trying to learn how to accomplish a few things when working with expressions in the Python z3 API.
I would like to be able to simplify/reduce sets of equations that contain intermediate variables. Say I have the Equations (A = B && C) and (C = D || E). In z3 these would be represented as (Bool('A') == And(Bool('B'), Bool('C')) and (Bool('C') == Or(Bool('D'), Bool('E')). Is there some function or series of functions that can be used to produce the simplified and reduced equation (A = B && (D || E))?
I would like to be able to convert a z3 expression into sum of products form (i.e Or(minterm1, minterm2,...).
An efficient way of determining the logical equivalence of two boolean equations.
A way of returning a boolean equations as formatted strings (i.e NOT in the nested function form used to declare the function.)
If anyone has any insight on any of these items, your input would be very much appreciated. Also, if any further clarification as to what is desired is needed, please let me know.
Thanks,

Great questions.
No, not in general. You can get z3 to simplify equations, but your notion of "simple" is unlikely to match what it will consider simple for its internal purposes. People often ask for this feature, but it is in general a very hard problem, and not at all clear what's meant by simple. Having said that, z3 does have a notion of Goal and Tactic, and there is even a simplify tactic that you can use. It will simplify the formulas, but having it behave precisely the way you want it to behave is a fool's errand.
See this great resource on tactics and perhaps you can play around to see to get something that works for you: http://www.cs.tau.ac.il/~msagiv/courses/asv/z3py/strategies-examples.htm
The simplify tactic does have a som option, I believe. That might do the trick. Again, see the above link, where they have the example:
s = Then(With('simplify', arith_lhs=True, som=True),
'normalize-bounds', 'lia2pb', 'pb2bv',
'bit-blast', 'sat').solver()
The nugget som=True tells the solver to use sum-of-minterms. Again, your mileage might vary depending on the exact structure of your formulas, and z3 might introduce new names that might defeat the purpose.
Absolutely! This is what z3 excels at. Simply assert f != g where f and g are your equations. If z3 says unsat, then you know they are equivalent for all assignments to variables. If it gives you a model, that forms a counter-example to their equaivalence. (The negated-equality trick is very common in SMT solving: A formula is a tautology precisely when its negation is unsatisfiable. So, you can assert the negation of what you want and see if it comes back with unsat.)
Note that this is what SMT (and SAT) solvers excel at.
For any formula f you build, you can issue print f and it'll print it. But as you probably already observed, it will not look like your textbook logical formulas. The pretty-printer has some options to control its behaviour, but it's probably not quite what you want.
However, the API provides functions to walk down the AST and extract nodes as you wish. So, you can write your own pretty printer if you so desire. Doing so isn't terribly difficult, but that doesn't mean it's simple: There are many cases to consider and in my experience, such printers are usually not that hard to fool; i.e., produce something vastly worse for small changes to the input.
From a practical perspective, while z3 and its high-level APIs in Python, C, Java, etc., is capable of doing everything you want, it's not going to be out-of-the-box except for #3. My recommendation would be to code everything else yourself, and rely on z3 for checking equality where it excels at. Of course, this all depends on precisely what you're trying to do. Best of luck!

How to measure similarity between two python code blocks?

Many would want to measure code similarity to catch plagiarisms, however my intention is to cluster a set of python code blocks (say answers to the same programming question) into different categories and distinguish different approaches taken by students.
If you have any idea how this could be achieved, I would appreciate it if you share it here.

You can choose any scheme you like that essentially hashes the contents of the code blocks, and place code blocks with identical hashes into the same category.
Of course, what will turn out to be similar will then depend highly on how you defined the hashing function. For instance, a truly stupid hashing function H(code)==0 will put everything in the same bin.
A hard problem is finding a hashing function that classifies code blocks in a way that seems similar in a natural sense. With lots of research, nobody has yet found anything better to judge this than I'll know if they are similar when I see them.
You surely do not want it to be dependent on layout/indentation/whitespace/comments, or slight changes to these will classify blocks differently even if their semantic content is identical.
There are three major schemes people have commonly used to find duplicated (or similar) code:
Metrics-based schemes, which compute the hash by counting various type of operators and operands by computing a metric. (Note: this uses lexical tokens). These often operate only at the function level. I know of no practical tools based on this.
Lexically based schemes, which break the input stream into lexemes, convert identifiers and literals into fixed special constants (e.g, treat them as undifferentiated), and then essentially hash N-grams (a sequence of N tokens) over these sequences. There are many clone detectors based on essentially this idea; they work tolerably well, but also find stupid matches because nothing forces alignment with program structure boundaries.
The sequence
return ID; } void ID ( int ID ) {
is an 11 gram which occurs frequently in C like languages but clearly isn't a useful clone). The result is that false positives tend to occur, e.g, you get claimed matches where there isn't one.
Abstract syntax tree based matching, (hashing over subtrees) which automatically aligns clones to language boundaries by virtue of using the ASTs, which represent the language structures directly. (I'm the author of the original paper on this, and build a commercial product CloneDR based on the idea, see my bio). These tools have the advantage that they can match code that contains sequences of tokens of different lengths in the middle of a match, e.g., one statement (of arbitrary size) is replaced by another.
This paper provides a survey of the various techniques: http://www.cs.usask.ca/~croy/papers/2009/RCK_SCP_Clones.pdf. It shows that AST-based clone detection tools appear to be the most effective at producing clones that people agree are similar blocks of code, which seems key to OP's particular interest; see Table 14.
[There are graph-based schemes that match control and data flow graphs. They should arguably produce even better matches but apparantly do not do much better in practice.]

One approach would be to count then number of functions, objects, keywords possibly grouped into categories such as branching, creating, manipulating, etc., and number variables of each type. Without relying on the methods and variables being called the same name(s).
For a given problem the similar approaches will tend to come out with similar scores for these, e.g.: A students who used decision tree would have a high number of branch statements while one who used a decision table would have much lower.
This approach would be much quicker to implement than parsing the code structure and comparing the results.

Identifying similar strings in a database in Python

I have a database table containing well over a million strings. Each string is a term that can vary in length from two words to five or six.
["big giant cars", "zebra videos", "hotels in rio de janeiro".......]
I also have a blacklist of over several thousand smaller terms in a csv file. What I want to do is identify similar terms in the database to the blacklisted terms in my csv file. Similarity in this case can be construed as mis-spellings of the blacklisted terms.
I am familiar with libraries in python such as fuzzywuzzy that can assess string similarity using Levensthein distance and return an integer representation of the similarity. An example from this tutorial would be:
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
A downside with this approach would be that it may falsely identify terms that may mean something in a different context.
A simple example of this would be "big butt", a blacklisted string, being confused with a more innocent string like "big but".
My question is, is it programmatically possible in python to accomplish this or would it be easier to just retrieve all the similar looking keywords and filter for false positives?

I'm not sure there's any definitive answer to this problem, so the best I can do is to explain how I'd approach this problem, and hopefully you'll be able to get any ideas from my ramblings. :-)
First.
On an unrelated angle, fuzzy string matching might not be enough. People are going to be using similar-looking characters and non-character symbols to get around any text matches, to the point where there's nearly zero match between a blacklisted word and actual text, and yet it's still readable for what it is. So perhaps you will need some normalization of your dictionary and search text, like converting all '0' (zeroes) to 'O' (capital O), '><' to 'X' etc. I believe there are libraries and/or conversion references to that purpose. Non-latin symbols are also a distinct possibility and should be accounted for.
Second.
I don't think you'll be able to differentiate between blacklisted words and similar-looking legal variants in a single pass. So yes, most likely you will have to search for possible blacklisted matches and then check if what you found matches some legal words too. Which means you will need not only the blacklisted dictionary, but a whitelisted dictionary as well. On a more positive note, there's probably no need to normalize the whitelisted dictionary, as people who're writing acceptable text are probably going to write it in acceptable language without any tricks outlined above. Or you could normalize it if you're feeling paranoid. :-)
Third.
However the problem is that matching words/expressions against black and white lists doesn't actually give you a reliable answer. Using your example, a person might write "big butt" as a honest typo which will be obvious in context (or vice versa, write a "big but" intentionally to get a higher match against a whitelisted word, even if context makes it quite obvious what the real meaning is). So you might have to actually check the context in case there are good enough matches against both black and white lists. This is an area I'm not intimately familiar with. It's probably possible to build correlation maps for various words (from both dictionaries) to identify what words are more (or less) frequently used in combination with them, and use them to check your specific example. Using this very paragraph as example, a word "black" could be whitelisted if it's used together with "list" but blacklisted in some other situations.
Fourth.
Even applying all those measures together you might want to leave a certain amount of gray area. That is, unless there's a high enough certainty in either direction, leave the final decision for a human (screening comments/posts for a time, automatically putting them into moderation queue, or whatever else your project dictates).
Fifth.
You might try to dabble in learning algorithms, collecting human input from previous step and using it to automatically fine-tune your algorithm as time goes by.
Hope that helps. :-)

Compare two glob expressions

Does anyone know an algorithm to check which of two wild card expression is more general than the other?
For example I'd like to compare
*/foo/foo.bar
with
*.bar
Clearly the first expression is contained in the second. I know that is not possible for regex (at least not if you don't have a looooot of time, as far as I remember this is in complexity class Non elementary), but it could be possible for wild card expression which are far less expressive. I tried to put together a simple python algorithm, but it get's very nasty when it comes to special cases.
Anybody has an idea if there is an algorithm for that problem?
UPDATE:
I do not want to use any brute force algorithm, since this won't work in general, because of performance reasons
Regards,
Gerald

You basically need to find a string somehow that matches the more general glob but not the more specific one. Just being captain obvious...
Probably by replacing * character with 0 or more random symbols.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.