SymPy: Safely parsing strings

SymPy: Safely parsing strings - python

SymPy comes equipped with the nice sympify() function which can parse arbitrary strings into SymPy expressions. But it has two major drawbacks:
It is not safe, as it relies on the notorious eval()
It automatically simplifies the read expression. e.g. sympify('binomial(5,3)') will return the expression 10.
So my questions are:
First, is there a way to "just parse" the string, without any additional computations? I want to achieve something like this effect:
latex(parse('binomial(5,3)')) #returns '{\\binom{5}{3}}'
Second, is there an accepted way to sympify (i.e. parse and compute) arbitrary user-generated strings while remaining safe? It is done by SymPy Gamma, so it's possible in practice, but the question is how much dirty work is needed.

Look at the internal functions in the SymPy parsing module.
There is no official way to do it. We need to rewrite sympify to avoid eval. Note that SymPy gamma just uses sympify. It remains safe because it's sandboxed on the App Engine.

Related

Do “Clean Code”'s function argument number guidelines apply to API design?

I am a newbie reading Uncle Bob's Clean Code Book.
It is indeed great practice to limit the number of function arguments as few as possible. But I still come across so many functions offered in many libraries that require a bunch of arguments. For example, in Python's pandas, there is a function with 9 arguments:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<object object>, observed=False, dropna=True)
(And this function also violates the advice about flag arguments)
It seems that such cases are much rarer in Python standard libraries, but I still managed to find one with 4 arguments:
re.split(pattern, string, maxsplit=0, flags=0)
I understand that this is just a suggestion instead of silver bullet, but is it applicable when it comes to something mentioned above?

Uncle Bob does not mention a hard limit of arguments that would make your code smell, but I would consider 9 arguments as too much.
Today's IDEs are much better in supporting the readability of the code, nevertheless refactoring stays tricky, especially with a large number of equally typed arguments.
The suggested solution is to encapsulate the arguments in a single struct/object (depending on your language). In the given case, this could be a GroupingStrategy:
strategy = GroupingStrategy();
strategy.by = "Foo"
strategy.axis = 0
strategy.sorted = true
DataFrame.groupby(strategy)
All not mentioned attributes will be assigned with the respective default values.
You could then also convert it to a fluent API:
DataFrame.groupby(GroupingStrategy.by("Foo").axis(0).sorted())
Or keep some of the arguments, if this feels better:
DataFrame.groupby("Foo", GroupingStrategy.default())

The first point to note is that all those arguments to groupby are relevant. You can reduce the number of arguments by having different versions of groupby but that doesn't help much when the arguments can be applied independently of each other, as is the case here. The same logic would apply to re.split.
It's true that integer "flag" arguments can be dodgy from a maintenance point of view - what happens if you want to change a flag value in your code? You have to hunt through and manually fix each case. The traditional approach is to use enums (which map numbers to words eg a Day enum would have Day.Sun = 0, Day.Mon = 1, etc) In compiled languages like C++ or C# this gives you the speed of using integers under the hood but the readability of using labels/words in your code. However enums in Python are slow.
One rule that I think applies to any source code is to avoid "magic numbers", ie numbers which appear directly in the source code. The enum is one solution. Another solution is to have constant variables to represent different flag settings. Python sort-of supports constants (uppercase variable names in constant.py which you then import) however they are constant only by convention, you can actually change their value :(

Combine multiple fractions into one using SymPy (Python)

I have a symbolic expression (say, var_1) with is a sum of around ten fractions, each being a complex combination of many parameters. var_1 arises after performing several operations into other expressions. In other words, var_1 is an output and not an input. This is, I only have var_1 and not its individual components.
By construction, I know that when var_1 is written as one fraction (using the least common multiple as denominator), then the nominator is zero. I am confirming this with the SymPy library of Python (I am actually using SymPy in Julia but operations are the same).
I am looking for a function that combines multiple fractions into one, but applied to a single variable like var_1. factor(var_1) doesn't work. It seems SymPy stops factoring due to the complexity of the expression var_1. Any idea?

In fact, there are many standard functions that add up fractions in SymPy. For example, simplify, fraction, ratsimp, etc. See the developers section for a complete list of functions (link here).
The problem in my case is that the expression I have is too complex to be handled by those functions. Therefore the above functions return as output the same input. Yet, my problem was solved by the function powsimp with the option force=true. This is, I called powsimp(var_1,force=true) in Julia (I think it is powsimp(var_1,force=True) in Python). I am not sure myself why that option works but it does!

Generate python code from a sympy expression?

The Question:
Given a sympy expression, is there an easy way to generate python code (in the end I want a .py or perhaps a .pyc file)? I imagine this code would contain a function that is given any necessary inputs and returns the value of the expression.
Why
I find myself pretty frequently needing to generate python code to compute something that is nasty to derive, such as the Jacobian matrix of a nasty nonlinear function.
I can use sympy to derive the expression for the nonlinear thing I want: very good. What I then want is to generate python code from the resulting sympy expression, and save that python code to it's own module. I've done this before, but I had to:
Call str(sympyResult)
Do custom things with regular expressions to get this string to look like valid python code
write this python code to a file
I note that sympy has code generation capabilities for several other languages, but not python. Is there an easy way to get python code out of sympy?
I know of several possible but problematic ways around this problem:
I know that I could just call evalf on the sympy expression and plug in the numbers I want. This has several unfortuante side effects:
dependency: my code now depends on sympy to run. This is bad.
efficiency: sympy now must run every time I numerically evaluate: even if I pickle and unpickle the expression, I still need evalf every time.
I also know that I could generate, say, C code and then wrap that code using a host of tools (python/C api, cython, weave, swig, etc...). This, however, means that my code now depends on there being an appropriate C compiler.
Edit: Summary
It seems that sympy.python, or possibly just str(expression) are what there is (see answer from smichr and comment from Oliver W.), and they work for simple scalar expressions.
That doesn't help much with things like Jacobians, but then it seems that sympy.printing.print_ccode chokes on matrices as well. I suppose code that could handle the printing of matrices to another language would have to assume matrix support in the destination language, which for python would probably mean reliance on the presence of things like numpy. It would be nice if such a way to generate numpy code existed, but it seems it does not.

If you don't mind having a SymPy dependency in your code itself, a better solution is to generate the SymPy expression in your code and use lambdify to evaluate it. This will be much faster than using evalf, especially if you use numpy.
You could also look at using the printer in sympy.printing.lambdarepr directly, which is what lambdify uses to convert an expression into a lambda function.

The function you are looking for to generate python code is python. Although it generates python code, that code will need some tweaking to remove dependence on SymPy objects as Oliver W pointed out.
>>> import sympy as sp
>>> x = sp.Symbol('x')
>>> y = sp.Symbol('y')
>>> print(sp.python(sp.Matrix([[x**2,sp.exp(y) + x]]).jacobian([x, y])))
x = Symbol('x')
y = Symbol('y')
e = MutableDenseMatrix([[2*x, 0], [1, exp(y)]])

alternative to sympy's simplify function

I've been using the simplify function in Sympy to simplify some long complicated equations, but it's not proving sufficient, as it frequently does not simplify things as much as possible, giving my program numerical errors when it comes to solving the equations.
Does anyone know of any other symbolic engines with a simplify function that can be used instead?
Many thanks.

Maybe you use python's subprocess module to run maxima on behalf of your python program? This is what maxima-mode on Emacs does, just do something similar. Start maxima, keep file handles to it's input/output, feed it with equations and let it mangle them to your desire (Maxima has lots of equation-changing functions), and read back the result from the output file handle.
Sympy vs. Maxima
pyMaxima

Regular expression implementation details

A question that I answered got me wondering:
How are regular expressions implemented in Python? What sort of efficiency guarantees are there? Is the implementation "standard", or is it subject to change?
I thought that regular expressions would be implemented as DFAs, and therefore were very efficient (requiring at most one scan of the input string). Laurence Gonsalves raised an interesting point that not all Python regular expressions are regular. (His example is r"(a+)b\1", which matches some number of a's, a b, and then the same number of a's as before). This clearly cannot be implemented with a DFA.
So, to reiterate: what are the implementation details and guarantees of Python regular expressions?
It would also be nice if someone could give some sort of explanation (in light of the implementation) as to why the regular expressions "cat|catdog" and "catdog|cat" lead to different search results in the string "catdog", as mentioned in the question that I referenced before.

Python's re module was based on PCRE, but has moved on to their own implementation.
Here is the link to the C code.
It appears as though the library is based on recursive backtracking when an incorrect path has been taken.
Regular expression and text size n
a?nan matching an
Keep in mind that this graph is not representative of normal regex searches.
http://swtch.com/~rsc/regexp/regexp1.html

There are no "efficiency guarantees" on Python REs any more than on any other part of the language (C++'s standard library is the only widespread language standard I know that tries to establish such standards -- but there are no standards, even in C++, specifying that, say, multiplying two ints must take constant time, or anything like that); nor is there any guarantee that big optimizations won't be applied at any time.
Today, F. Lundh (originally responsible for implementing Python's current RE module, etc), presenting Unladen Swallow at Pycon Italia, mentioned that one of the avenues they'll be exploring is to compile regular expressions directly to LLVM intermediate code (rather than their own bytecode flavor to be interpreted by an ad-hoc runtime) -- since ordinary Python code is also getting compiled to LLVM (in a soon-forthcoming release of Unladen Swallow), a RE and its surrounding Python code could then be optimized together, even in quite aggressive ways sometimes. I doubt anything like that will be anywhere close to "production-ready" very soon, though;-).

Matching regular expressions with backreferences is NP-hard, which is at least as hard as NP-Complete. That basically means that it's as hard as any problem you're likely to encounter, and most computer scientists think it could require exponential time in the worst case. If you could match such "regular" expressions (which really aren't, in the technical sense) in polynomial time, you could win a million bucks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.