Simplifying Equations with Python z3 API

Simplifying Equations with Python z3 API - python

I'm trying to learn how to accomplish a few things when working with expressions in the Python z3 API.
I would like to be able to simplify/reduce sets of equations that contain intermediate variables. Say I have the Equations (A = B && C) and (C = D || E). In z3 these would be represented as (Bool('A') == And(Bool('B'), Bool('C')) and (Bool('C') == Or(Bool('D'), Bool('E')). Is there some function or series of functions that can be used to produce the simplified and reduced equation (A = B && (D || E))?
I would like to be able to convert a z3 expression into sum of products form (i.e Or(minterm1, minterm2,...).
An efficient way of determining the logical equivalence of two boolean equations.
A way of returning a boolean equations as formatted strings (i.e NOT in the nested function form used to declare the function.)
If anyone has any insight on any of these items, your input would be very much appreciated. Also, if any further clarification as to what is desired is needed, please let me know.
Thanks,

Great questions.
No, not in general. You can get z3 to simplify equations, but your notion of "simple" is unlikely to match what it will consider simple for its internal purposes. People often ask for this feature, but it is in general a very hard problem, and not at all clear what's meant by simple. Having said that, z3 does have a notion of Goal and Tactic, and there is even a simplify tactic that you can use. It will simplify the formulas, but having it behave precisely the way you want it to behave is a fool's errand.
See this great resource on tactics and perhaps you can play around to see to get something that works for you: http://www.cs.tau.ac.il/~msagiv/courses/asv/z3py/strategies-examples.htm
The simplify tactic does have a som option, I believe. That might do the trick. Again, see the above link, where they have the example:
s = Then(With('simplify', arith_lhs=True, som=True),
'normalize-bounds', 'lia2pb', 'pb2bv',
'bit-blast', 'sat').solver()
The nugget som=True tells the solver to use sum-of-minterms. Again, your mileage might vary depending on the exact structure of your formulas, and z3 might introduce new names that might defeat the purpose.
Absolutely! This is what z3 excels at. Simply assert f != g where f and g are your equations. If z3 says unsat, then you know they are equivalent for all assignments to variables. If it gives you a model, that forms a counter-example to their equaivalence. (The negated-equality trick is very common in SMT solving: A formula is a tautology precisely when its negation is unsatisfiable. So, you can assert the negation of what you want and see if it comes back with unsat.)
Note that this is what SMT (and SAT) solvers excel at.
For any formula f you build, you can issue print f and it'll print it. But as you probably already observed, it will not look like your textbook logical formulas. The pretty-printer has some options to control its behaviour, but it's probably not quite what you want.
However, the API provides functions to walk down the AST and extract nodes as you wish. So, you can write your own pretty printer if you so desire. Doing so isn't terribly difficult, but that doesn't mean it's simple: There are many cases to consider and in my experience, such printers are usually not that hard to fool; i.e., produce something vastly worse for small changes to the input.
From a practical perspective, while z3 and its high-level APIs in Python, C, Java, etc., is capable of doing everything you want, it's not going to be out-of-the-box except for #3. My recommendation would be to code everything else yourself, and rely on z3 for checking equality where it excels at. Of course, this all depends on precisely what you're trying to do. Best of luck!

Related

Algorithms for implementing analytic calculus

I'm interested in writing a program in Python that can parse a mathematical expression and carry out operations on it in algebraic form. Most of these will be pretty easy, something like 2x+5x+xy. My program would take this and return the simplified 7x+xy. However I'd also like to eventually extend it's functionality to calculus as well, so if it's given something like integrate: 5x it should be able to return (5x^2)/2 + c. I'm aware that for this the sympy library can be used, but I'm interested in implementing it myself as a learning process. Are there any algorithims that exist to carry out algebraic calculus for any arbitrary integral/differential I could implement, or is this a more daunting task I've set for myself here? Any reasources would be appreciated.

It might be worthwhile to ask why you're looking at doing this, and understanding the sorts of things that you want to be able to integrate, because I think you might not know that you've actually stumbled onto a very difficult question.
If you are looking for an exercise to learn about simple low level polynomial integration and python - ie. something of the form x^2 + 2x + 7 - and the idea is to learn about string manipulation and simple math, then by all means I think you should attempt to do this and see how far you can get. Look at the various math related libaries (like Sympy or Numpy) and see what you can get.
If you are looking to produce a powerful tool to be used for Calculus, then I think you should rethink your idea and consider using something like Wolfram Alpha. A project like this could take years and you might not get a good result from it.

PyTorch Autograd automatic differentiation feature

I am just curious to know, how does PyTorch track operations on tensors (after the .requires_grad is set as True and how does it later calculate the gradients automatically. Please help me understand the idea behind autograd. Thanks.

That's a great question!
Generally, the idea of automatic differentiation (AutoDiff) is based on the multivariable chain rule, i.e.
.
What this means is that you can express the derivative of x with respect to z via a "proxy" variable y; in fact, that allows you to break up almost any operation in a bunch of simpler (or atomic) operations that can then be "chained" together.
Now, what AutoDiff packages like Autograd do, is simply to store the derivative of such an atomic operation block, e.g., a division, multiplication, etc.
Then, at runtime, your provided forward pass formula (consisting of multiple of these blocks) can be easily turned into an exact derivative. Likewise, you can also provide derivatives for your own operations, should you think AutoDiff does not exactly do what you want it to.
The advantage of AutoDiff over derivative approximations like finite differences is simply that this is an exact solution.
If you are further interested in how it works internally, I highly recommend the AutoDidact project, which aims to simplify the internals of an automatic differentiator, since there is usually also a lot of code optimization involved.
Also, this set of slides from a lecture I took was really helpful in understanding.

How to measure similarity between two python code blocks?

Many would want to measure code similarity to catch plagiarisms, however my intention is to cluster a set of python code blocks (say answers to the same programming question) into different categories and distinguish different approaches taken by students.
If you have any idea how this could be achieved, I would appreciate it if you share it here.

You can choose any scheme you like that essentially hashes the contents of the code blocks, and place code blocks with identical hashes into the same category.
Of course, what will turn out to be similar will then depend highly on how you defined the hashing function. For instance, a truly stupid hashing function H(code)==0 will put everything in the same bin.
A hard problem is finding a hashing function that classifies code blocks in a way that seems similar in a natural sense. With lots of research, nobody has yet found anything better to judge this than I'll know if they are similar when I see them.
You surely do not want it to be dependent on layout/indentation/whitespace/comments, or slight changes to these will classify blocks differently even if their semantic content is identical.
There are three major schemes people have commonly used to find duplicated (or similar) code:
Metrics-based schemes, which compute the hash by counting various type of operators and operands by computing a metric. (Note: this uses lexical tokens). These often operate only at the function level. I know of no practical tools based on this.
Lexically based schemes, which break the input stream into lexemes, convert identifiers and literals into fixed special constants (e.g, treat them as undifferentiated), and then essentially hash N-grams (a sequence of N tokens) over these sequences. There are many clone detectors based on essentially this idea; they work tolerably well, but also find stupid matches because nothing forces alignment with program structure boundaries.
The sequence
return ID; } void ID ( int ID ) {
is an 11 gram which occurs frequently in C like languages but clearly isn't a useful clone). The result is that false positives tend to occur, e.g, you get claimed matches where there isn't one.
Abstract syntax tree based matching, (hashing over subtrees) which automatically aligns clones to language boundaries by virtue of using the ASTs, which represent the language structures directly. (I'm the author of the original paper on this, and build a commercial product CloneDR based on the idea, see my bio). These tools have the advantage that they can match code that contains sequences of tokens of different lengths in the middle of a match, e.g., one statement (of arbitrary size) is replaced by another.
This paper provides a survey of the various techniques: http://www.cs.usask.ca/~croy/papers/2009/RCK_SCP_Clones.pdf. It shows that AST-based clone detection tools appear to be the most effective at producing clones that people agree are similar blocks of code, which seems key to OP's particular interest; see Table 14.
[There are graph-based schemes that match control and data flow graphs. They should arguably produce even better matches but apparantly do not do much better in practice.]

One approach would be to count then number of functions, objects, keywords possibly grouped into categories such as branching, creating, manipulating, etc., and number variables of each type. Without relying on the methods and variables being called the same name(s).
For a given problem the similar approaches will tend to come out with similar scores for these, e.g.: A students who used decision tree would have a high number of branch statements while one who used a decision table would have much lower.
This approach would be much quicker to implement than parsing the code structure and comparing the results.

How to convert Math Formula to Python code?

Are there any easy ways to convert mathematical formulas to Python code?
Perhaps translators, web reference, specific book chapters, anything ~
For regular expressions there are programs such as Kodos and sites such as pythonregex.com, so I was hoping there would be something similar for formula notation and Python.

No, this isn't possible in general. There are mathematical functions that aren't computable (for example, see wikipedia/Halting_problem). There are other mathematical functions where it's just not obvious how to code them up (consider a difficult integral or differential equations). There are many books written on finding numerical solutions to these sorts of problems (you can find some links here: wikipedia/Numerical_analysis).
For simple cases, you can transcribe mathematical formulae directly, but any automated means of translation would require a formal language for writing mathematical formulae in what would be a programming language in itself. This would beg the question, since you would be trading writing mathematical formulae in one language with writing them in another.

You could try to make your own with sympy and pyparsing.

so how to convert a math formula into python?
well, i myself dont know pretty much of python (i am just a beginner) but i have got some tips to help you:
first, you should like calculate the formula using x and y (or a and b) and write that down on a paper.
lets say phythagoras theory ( which is kinda basic).
c2 = a2 + b2;
so you have to input those a and b and write this formula to calculate c (which will be c = sqrt a2 + b**2)
now lets say you dont know this formulas (i can hear you). then? well, you should search it up on internet, listen to its explanation and try to transform them to python.
this is like noobs advice, but you know - no easy ways in programming.
(well, just search it on internet, i guess?)

Writing a parser for regular expressions

Even after years of programming, I'm ashamed to say that I've never really fully grasped regular expressions. In general, when a problem calls for a regex, I can usually (after a bunch of referring to syntax) come up with an appropriate one, but it's a technique that I find myself using increasingly often.
So, to teach myself and understand regular expressions properly, I've decided to do what I always do when trying to learn something; i.e., try to write something ambitious that I'll probably abandon as soon as I feel I've learnt enough.
To this end, I want to write a regular expression parser in Python. In this case, "learn enough" means that I want to implement a parser that can understand Perl's extended regex syntax completely. However, it doesn't have to be the most efficient parser or even necessarily usable in the real-world. It merely has to correctly match or fail to match a pattern in a string.
The question is, where do I start? I know almost nothing about how regexes are parsed and interpreted apart from the fact that it involves a finite state automaton in some way. Any suggestions for how to approach this rather daunting problem would be much appreciated.
EDIT: I should clarify that while I'm going to implement the regex parser in Python, I'm not overly fussed about what programming language the examples or articles are written in. As long as it's not in Brainfuck, I will probably understand enough of it to make it worth my while.

Writing an implementation of a regular expression engine is indeed a quite complex task.
But if you are interested in how to do it, even if you can't understand enough of the details to actually implement it, I would recommend that you at least look at this article:
Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)
It explains how many of the popular programming languages implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that is faster. The article includes some details of how the proposed implementation works, including some source code in C. It may be a bit heavy reading if you are just starting to learn regular expressions, but I think it is well worth knowing about the difference between the two approaches.

I've already given a +1 to Mark Byers - but as far as I remember the paper doesn't really say that much about how regular expression matching works beyond explaining why one algorithm is bad and another much better. Maybe something in the links?
I'll focus on the good approach - creating finite automata. If you limit yourself to deterministic automata with no minimisation, this isn't really too difficult.
What I'll (very quickly) describe is the approach taken in Modern Compiler Design.
Imagine you have the following regular expression...
a (b c)* d
The letters represent literal characters to match. The * is the usual zero-or-more repetitions match. The basic idea is to derive states based on dotted rules. State zero we'll take as the state where nothing has been matched yet, so the dot goes at the front...
0 : .a (b c)* d
The only possible match is 'a', so the next state we derive is...
1 : a.(b c)* d
We now have two possibilities - match the 'b' (if there's at least one repeat of 'b c') or match the 'd' otherwise. Note - we are basically doing a digraph search here (either depth first or breadth first or whatever) but we are discovering the digraph as we search it. Assuming a breadth-first strategy, we'll need to queue one of our cases for later consideration, but I'll ignore that issue from here on. Anyway, we've discovered two new states...
2 : a (b.c)* d
3 : a (b c)* d.
State 3 is an end state (there may be more than one). For state 2, we can only match the 'c', but we need to be careful with the dot position afterwards. We get "a.(b c)* d" - which is the same as state 1, so we don't need a new state.
IIRC, the approach in Modern Compiler Design is to translate a rule when you hit an operator, in order to simplify the handling of the dot. State 1 would be transformed to...
1 : a.b c (b c)* d
a.d
That is, your next option is either to match the first repetition or to skip the repetition. The next states from this are equivalent to states 2 and 3. An advantage of this approach is that you can discard all your past matches (everything before the '.') as you only care about future matches. This typically gives a smaller state model (but not necessarily a minimal one).
EDIT If you do discard already matched details, your state description is a representation of the set of strings that can occur from this point on.
In terms of abstract algebra, this is a kind of set closure. An algebra is basically a set with one (or more) operators. Our set is of state descriptions, and our operators are our transitions (character matches). A closed set is one where applying any operator to any members in the set always produces another member that is in the set. The closure of a set is the mimimal larger set that is closed. So basically, starting with the obvious start state, we are constructing the minimal set of states that is closed relative to our set of transition operators - the minimal set of reachable states.
Minimal here refers to the closure process - there may be a smaller equivalent automata which is normally referred to as minimal.
With this basic idea in mind, it's not too difficult to say "if I have two state machines representing two sets of strings, how to I derive a third representing the union" (or intersection, or set difference...). Instead of dotted rules, your state representations will a current state (or set of current states) from each input automaton and perhaps additional details.
If your regular grammars are getting complex, you can minimise. The basic idea here is relatively simple. You group all your states into one equivalence class or "block". Then you repeatedly test whether you need to split blocks (the states aren't really equivalent) with respect to a particular transition type. If all states in a particular block can accept a match of the same character and, in doing so, reach the same next-block, they are equivalent.
Hopcrofts algorithm is an efficient way to handle this basic idea.
A particularly interesting thing about minimisation is that every deterministic finite automaton has precisely one minimal form. Furthermore, Hopcrofts algorithm will produce the same representation of that minimal form, no matter what representation of what larger case it started from. That is, this is a "canonical" representation which can be used to derive a hash or for arbitrary-but-consistent orderings. What this means is that you can use minimal automata as keys into containers.
The above is probably a bit sloppy WRT definitions, so make sure you look up any terms yourself before using them yourself, but with a bit of luck this gives a fair quick introduction to the basic ideas.
BTW - have a look around the rest of Dick Grunes site - he has a free PDF book on parsing techniques. The first edition of Modern Compiler Design is pretty good IMO, but as you'll see, there's a second edition imminent.

"A play on regular expressions: functional pearl" takes an interesting approach. The implementation is given in Haskell, but it's been reimplemented in Python at least once.
The developed program is based on an old technique to turn regular expressions into finite automata which makes it efficient both in terms of worst-case time and space bounds and actual performance: despite its simplicity, the Haskell implementation can compete with a recently published professional C++ program for the same problem.

There's an interesting (if slightly short) chapter in Beautiful Code by Brian Kernighan, appropriately called "A Regular Expression Matcher". In it he discusses a simple matcher that can match literal characters, and the .^$* symbols.

I do agree that writing a regex engine will improve understanding but have you taken a look at ANTLR??. It generates the parsers automatically for any kind of language. So maybe you can try your hand by taking one of the language grammars listed at Grammar examples and run through the AST and parser that it generates. It generates a really complicated code but you will have a good understanding on how a parser works.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.