Can I improve python runtime by compiling?

Can I improve python runtime by compiling? - python

I'm writing a small toy simulation in python. Granted, this simulations are slow. To my understanding, the major reason that python codes are slow is the fact that python is in interpreted language. I don't want to give up python since the clear syntax and the available library cut the writing time significantly. So is there a simple way for me to "compile" my python code?
Edit
I answer some questions:
Yes, I'm using numpy. It greatly simplify the code and I don't think I can improve performance writing the functions on my own. I use numpy for all my lists and and I add all of the beads together. Namely. I invoke
pos += V*dt + forces*0.5*dt**2
where ''pos'', 'V', and 'forces' are all np.array of (2000,3) dimensions.
I'm quite certain that the slow part in the forces calculation. This is logical as I have to iterate over all my particles and check their position. For my real project (Ph.D. stuff) I have code of about roughly the same level of complexity, and I know that this is the expensive stuff.

If none of the solutions in the comment suffice, you can also take a look at cython.
For a quick tutorial & example check:
http://docs.cython.org/src/tutorial/cython_tutorial.html
Used at the correct spots (e.g. around frequently called functions) it can easily speed things up by a factor of 10 - 100.

Python is a slightly odd language in that it is both interpreted and compiled. Well sort of. When you run it is compiled to ".pyc" bytecode - so we can quickly get bogged down in semantic details here. Hell I don't even know if what I just said is strictly accurate. But at the end of the day you want to speed things up so...
First, use the profiler and timeit to work out where all the time is going
Second, rewrite your pure python code to improve the slow bits you've discovered
Third, see how it goes when optimised
Now, depends on your scenario, but seriously think "Can I run it on a bigger CPU/memory"
Ok, try rewriting those slow sections in C++
Screw it, write it all in C++
If you get so far as the last option I dare say you're screwed and the savings aren't going to be significant.

Related

Python with Numpy/Scipy vs. Pure C++ for Big Data Analysis [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Doing Python on relatively small projects makes me appreciate the dynamically typed nature of this language (no need for declaration code to keep track of types), which often makes for a quicker and less painful development process along the way. However, I feel that in much larger projects this may actually be a hindrance, as the code would run slower than say, its equivalent in C++. But then again, using Numpy and/or Scipy with Python may get your code to run just as fast as a native C++ program (where the code in C++ would sometimes take longer to develop).
I post this question after reading Justin Peel's comment on the thread "Is Python faster and lighter than C++?" where he states:
"Also, people who speak of Python being slow for serious number crunching haven't used the Numpy and Scipy modules. Python is really taking off in scientific computing these days. Of course, the speed comes from using modules written in C or libraries written in Fortran, but that's the beauty of a scripting language in my opinion." Or as S. Lott writes on the same thread regarding Python: "...Since it manages memory for me, I don't have to do any memory management, saving hours of chasing down core leaks."
I also inspected a Python/Numpy/C++ related performance question on "Benchmarking (python vs. c++ using BLAS) and (numpy)" where J.F. Sebastian writes "...There is no difference between C++ and numpy on my machine."
Both of these threads got me to wondering whether there is any real advantage conferred to knowing C++ for a Python programmer that uses Numpy/Scipy for producing software to analyze 'big data' where performance is obviously of great importance (but also code readability and development speed are a must)?
Note: I'm especially interested in handling huge text files. Text files on the order of 100K-800K lines with multiple columns, where Python could take a good five minutes to analyze a file "only" 200K lines long.

First off, if the bulk of your "work" comes from processing huge text files, that often means that your only meaningful speed bottleneck is your disk I/O speed, regardless of programming language.
As to the core question, it's probably too opinion-rich to "answer", but I can at least give you my own experience. I've been writing Python to do big data processing (weather and environmental data) for years. I have never once encountered significant performance problems due to the language.
Something that developers (myself included) tend to forget is that once the process runs fast enough, it's a waste of company resources to spend time making it run any faster. Python (using mature tools like pandas/scipy) runs fast enough to meet the requirements, and it's fast to develop, so for my money, it's a perfectly acceptable language for "big data" processing.

The short answer is that for simple problems, then there should not be much difference. If you want to do anything complicated, then you will quickly run into stark performance differences.
As a simple example, try adding three vectors together
a = b + c + d
In python, as I understand it, this generally adds b to c, adds the result to d, and then make a point to that final result. Each of those operations can be fast since they are just farmed out to a BLAS library. However, if the vectors are large, then the intermediate result can not be stored in cache. Moving that intermediate result to main memory is slow.
You can do the same thing in C++ using valarray and it will be equivalently slow. However, you can also do something else
for(int i=0; i<N; ++i)
a[i] = b[i] + c[i] + d[i]
This gets rid of the intermediate result and makes the code less sensitive to speed to main memory.
Doing the equivalent thing in python is possible, but python's looping constructs are not as efficient. They do nice things like bounds checks, but sometimes it is faster to run with the safeties disengaged. Java, for example, does a fair amount of work to remove bounds checks. So if you had a sufficiently smart compiler/JIT, python's loops could be fast. In practice, that has not worked out.

Python will definitely save your development time, it provides you flexibility as well if you are just comparing two languages here, though it still can't match the power and performance of C/C++ but who cares in this age of high memory, clusters, caching and parallel processing techniques? Another disadvantage with C++ can be the possible crashes and then debugging and fixing with big data can be a nightmare.
But having said that I have not seen a place where there is one size fit all solution is available , No programming language contains solutions to every problem, (unless you are an old native C developer who like to build the database in C as well :) you have to first identify all the problems, requirements, type of data, whether it is structured or non structured, what sort of text files you need to manipulate in what way and order, is scheduling an issue and so on... Then you need to build a complete stack of applications with some tool sets and scripting languages. Like you can always put more money in hardware or even buy some expensive tool like Ab Initio which give you power of loading and parsing those large text files and manipulate over the data unless you don't need real high end pattern matching capabilities on really biggg data files, python would be just fine with a conjunction of other tools. But i don't see a single yes/no answer, in certain situations, python may not be the best solution.

Optimizing for PyPy

(This is a follow-up to Statistical profiler for PyPy)
I'm running some Python code under PyPy and would like to optimize it.
In Python, I would use statprof or lineprofiler to know which exact lines are causing the slowdown and try to work around them. In PyPy though, both of the tools don't really report sensible results as PyPy might optimize away some lines. I would also prefer not to use cProfile as I find it very difficult to distil which part of the reported function is the bottleneck.
Does anyone have some tips on how to proceed? Perhaps another profiler which works nicely under PyPy? In general, how does one go about optimizing Python code for PyPy?

If you understand the way the PyPy architecture works, you'll realize that trying to pinpoint individual lines of code isn't really productive. You start with a Python interpreter written in RPython, which then gets run through a tracing JIT which generates flow graphs and then transforms those graphs to optimize the RPython interpreter. What this means is the layout of your Python code being run by the RPython interpreter being JIT'ed may have very different structure than the optimized assembler actually be run. Furthermore, keep in mind that the JIT always works on a loop or a function, so getting line-by-line stats is not as meaningful. Consequently, I think cProfile may really be a good option for you since it will give you an idea of where to concentrate your optimization. Once you know which functions are your bottlenecks, you can spend your optimization efforts targeting those slower functions, rather than trying to fix a single line of Python code.
Keep in mind as you do this that PyPy has very different performance characteristics than cPython. Always try to write code in as simple a way as possible (that doesn't mean as few lines as possible btw). There are a few other heuristics that help such as using specialized lists, using objects over dicts when you have a small number of mostly constant keys, avoiding C extensions using the C Python API, etc.
If you really, really insist on trying to optimize at the line level. There are a few options. One is called JitViewer (https://foss.heptapod.net/pypy/jitviewer), which will let you have a very low level view of what the JIT is doing to your code. For instance, you can even see the assembler instructions which correspond to a Python loop. Using that tool, you can really get a sense for just how fast PyPy will behave with certain parts of the code, as you can now do silly things like count the number of assembler instructions used for your loop or something.

FSharp runs my algorithm slower than Python

Years ago, I solved a problem via dynamic programming:
https://www.thanassis.space/fillupDVD.html
The solution was coded in Python.
As part of expanding my horizons, I recently started learning OCaml/F#. What better way to test the waters, than by doing a direct port of the imperative code I wrote in Python to F# - and start from there, moving in steps towards a functional programming solution.
The results of this first, direct port... are disconcerting:
Under Python:
bash$ time python fitToSize.py
....
real 0m1.482s
user 0m1.413s
sys 0m0.067s
Under FSharp:
bash$ time mono ./fitToSize.exe
....
real 0m2.235s
user 0m2.427s
sys 0m0.063s
(in case you noticed the "mono" above: I tested under Windows as well, with Visual Studio - same speed).
I am... puzzled, to say the least. Python runs code faster than F# ? A compiled binary, using the .NET runtime, runs SLOWER than Python's interpreted code?!?!
I know about startup costs of VMs (mono in this case) and how JITs improve things for languages like Python, but still... I expected a speedup, not a slowdown!
Have I done something wrong, perhaps?
I have uploaded the code here:
https://www.thanassis.space/fsharp.slower.than.python.tar.gz
Note that the F# code is more or less a direct, line-by-line translation of the Python code.
P.S. There are of course other gains, e.g. the static type safety offered by F# - but if the resulting speed of an imperative algorithm is worse under F# ... I am disappointed, to say the least.
EDIT: Direct access, as requested in the comments:
the Python code: https://gist.github.com/950697
the FSharp code: https://gist.github.com/950699

Dr Jon Harrop, whom I contacted over e-mail, explained what is going on:
The problem is simply that the program has been optimized for Python. This is common when the programmer is more familiar with one language than the other, of course. You just have to learn a different set of rules that dictate how F# programs should be optimized...
Several things jumped out at me such as the use of a "for i in 1..n do" loop rather than a "for i=1 to n do" loop (which is faster in general but not significant here), repeatedly doing List.mapi on a list to mimic an array index (which allocated intermediate lists unnecessarily) and your use of the F# TryGetValue for Dictionary which allocates unnecessarily (the .NET TryGetValue that accepts a ref is faster in general but not so much here)
... but the real killer problem turned out to be your use of a hash table to implement a dense 2D matrix. Using a hash table is ideal in Python because its hash table implementation has been extremely well optimized (as evidenced by the fact that your Python code is running as fast as F# compiled to native code!) but arrays are a much better way to represent dense matrices, particularly when you want a default value of zero.
The funny part is that when I first coded this algorithm, I DID use a table -- I changed the implementation to a dictionary for reasons of clarity (avoiding the array boundary checks made the code simpler - and much easier to reason about).
Jon transformed my code (back :-)) into its array version, and it runs at 100x speed.
Moral of the story:
F# Dictionary needs work... when using tuples as keys, compiled F# is slower than interpreted Python's hash tables!
Obvious, but no harm in repeating: Cleaner code sometimes means... much slower code.
Thank you, Jon -- much appreciated.
EDIT: the fact that replacing Dictionary with Array makes F# finally run at the speeds a compiled language is expected to run, doesn't negate the need for a fix in Dictionary's speed (I hope F# people from MS are reading this). Other algorithms depend on dictionaries/hashes, and can't be easily switched to using arrays; making programs suffer "interpreter-speeds" whenever one uses a Dictionary, is arguably, a bug. If, as some have said in the comments, the problem is not with F# but with .NET Dictionary, then I'd argue that this... is a bug in .NET!
EDIT2: The clearest solution, that doesn't require the algorithm to switch to arrays (some algorithms simply won't be amenable to that) is to change this:
let optimalResults = new Dictionary<_,_>()
into this:
let optimalResults = new Dictionary<_,_>(HashIdentity.Structural)
This change makes the F# code run 2.7x times faster, thus finally beating Python (1.6x faster). The weird thing is that tuples by default use structural comparison, so in principle, the comparisons done by the Dictionary on the keys are the same (with or without Structural). Dr Harrop theorizes that the speed difference may be attributed to virtual dispatch: "AFAIK, .NET does little to optimize virtual dispatch away and the cost of virtual dispatch is extremely high on modern hardware because it is a "computed goto" that jumps the program counter to an unpredictable location and, consequently, undermines branch prediction logic and will almost certainly cause the entire CPU pipeline to be flushed and reloaded".
In plain words, and as suggested by Don Syme (look at the bottom 3 answers), "be explicit about the use of structural hashing when using reference-typed keys in conjunction with the .NET collections". (Dr. Harrop in the comments below also says that we should always use Structural comparisons when using .NET collections).
Dear F# team in MS, if there is a way to automatically fix this, please do.

As Jon Harrop has pointed out, simply constructing the dictionaries using Dictionary(HashIdentity.Structural) gives a major performance improvement (a factor of 3 on my computer). This is almost certainly the minimally invasive change you need to make to get better performance than Python, and keeps your code idiomatic (as opposed to replacing tuples with structs, etc.) and parallel to the Python implementation.

Edit: I was wrong, it's not a question of value type vs reference type. The performance problem was related to the hash function, as explained in other comments. I keep my answer here because there's an interessant discussion. My code partially fixed the performance issue, but this is not the clean and recommended solution.
--
On my computer, I made your sample run twice as fast by replacing the tuple with a struct. This means, the equivalent F# code should run faster than your Python code. I don't agree with the comments saying that .NET hashtables are slow, I believe there's no significant difference with Python or other languages implementations. Also, I don't agree with the "You can't 1-to-1 translate code expect it to be faster": F# code will generally be faster than Python for most tasks (static typing is very helpful to the compiler). In your sample, most of the time is spent doing hashtable lookups, so it's fair to imagine that both languages should be almost as fast.
I think the performance issue is related to gabage collection (but I haven't checked with a profiler). The reason why using tuples can be slower here than structures has been discussed in a SO question ( Why is the new Tuple type in .Net 4.0 a reference type (class) and not a value type (struct)) and a MSDN page (Building tuples):
If they are reference types, this
means there can be lots of garbage
generated if you are changing elements
in a tuple in a tight loop. [...]
F# tuples were reference types, but
there was a feeling from the team that
they could realize a performance
improvement if two, and perhaps three,
element tuples were value types
instead. Some teams that had created
internal tuples had used value instead
of reference types, because their
scenarios were very sensitive to
creating lots of managed objects.
Of course, as Jon said in another comment, the obvious optimization in your example is to replace hashtables with arrays. Arrays are obviously much faster (integer index, no hashing, no collision handling, no reallocation, more compact), but this is very specific to your problem, and it doesn't explain the performance difference with Python (as far as I know, Python code is using hashtables, not arrays).
To reproduce my 50% speedup, here is the full code: http://pastebin.com/nbYrEi5d
In short, I replaced the tuple with this type:
type Tup = {x: int; y: int}
Also, it seems like a detail, but you should move the List.mapi (fun i x -> (i,x)) fileSizes out of the enclosing loop. I believe Python enumerate does not actually allocate a list (so it's fair to allocate the list only once in F#, or use Seq module, or use a mutable counter).

Hmm.. if the hashtable is the major bottleneck, then it is properly the hash function itself. Havn't look at the specific hash function but For one of the most common hash functions namely
((a * x + b) % p) % q
The modulus operation % is painfully slow, if p and q is of the form 2^k - 1, we can do modulus with an and, add and a shift operation.
Dietzfelbingers universal hash function h_a : [2^w] -> [2^l]
lowerbound(((a * x) % 2^w)/2^(w-l))
Where is a random odd seed of w-bit.
It can be computed by (a*x) >> (w-l), which is magnitudes of speed faster than the first hash function. I had to implement a hash table with linked list as collision handling. It took 10 minutes to implement and test, we had to test it with both functions, and analyse the differens of speed. The second hash function had as I remember around 4-10 times of speed gain dependend on the size of the table.
But the thing to learn here is if your programs bottleneck is hashtable lookup the hash function has to be fast too

I need to speed up a function. Should I use cython, ctypes, or something else?

I'm having a lot of fun learning Python by writing a genetic programming type of application.
I've had some great advice from Torsten Marek, Paul Hankin and Alex Martelli on this site.
The program has 4 main functions:
generate (randomly) an expression tree.
evaluate the fitness of the tree
crossbreed
mutate
As all of generate, crossbreed and mutate call 'evaluate the fitness'. it is the busiest function and is the primary bottleneck speedwise.
As is the nature of genetic algorithms, it has to search an immense solution space so the faster the better. I want to speed up each of these functions. I'll start with the fitness evaluator. My question is what is the best way to do this. I've been looking into cython, ctypes and 'linking and embedding'. They are all new to me and quite beyond me at the moment but I look forward to learning one and eventually all of them.
The 'fitness function' needs to compare the value of the expression tree to the value of the target expression. So it will consist of a postfix evaluator which will read the tree in a postfix order. I have all the code in python.
I need advice on which I should learn and use now: cython, ctypes or linking and embedding.
Thank you.

Ignore everyone elses' answer for now. The first thing you should learn to use is the profiler. Python comes with a profile/cProfile; you should learn how to read the results and analyze where the real bottlenecks is. The goal of optimization is three-fold: reduce the time spent on each call, reduce the number of calls to be made, and reduce memory usage to reduce disk thrashing.
The first goal is relatively easy. The profiler will show you the most time-consuming functions and you can go straight to that function to optimize it.
The second and third goal is harder since this means you need to change the algorithm to reduce the need to make so much calls. Find the functions that have high number of calls and try to find ways to reduce the need to call them. Utilize the built-in collections, they're very well optimized.
If you're doing a lot of number and array processing, you should take a look at pandas, Numpy/Scipy, gmpy third party modules; they're well optimised C libraries for processing arrays/tabular data.
Another thing you want to try is PyPy. PyPy can JIT recompile and do much more advanced optimisation than CPython, and it'll work without the need to change your python code. Though well optimised code targeting CPython can look quite different from well optimised code targeting PyPy.
Next to try is Cython. Cython is a slightly different language than Python, in fact Cython is actually best described as C with typed Python-like syntax.
For parts of your code that is in very tight loops that you can no longer optimize using any other ways, you may want to rewrite it as C extension. Python has a very good support for extending with C. In PyPy, the best way to extend PyPy is with cffi.

Cython is the quickest to get the job done, either by writing your algorithm directly in Cython, or by writing it in C and bind it to python with Cython.
My advice: learn Cython.

Another great option is boost::python which lets you easily wrap C or C++.
Of these possibilities though, since you have python code already written, cython is probably a good thing to try first. Perhaps you won't have to rewrite any code to get a speedup.

Try to work your fitness function so that it will support memoization. This will replace all calls that are duplicates of previous calls with a quick dict lookup.

How much of NumPy and SciPy is in C?

Are parts of NumPy and/or SciPy programmed in C/C++?
And how does the overhead of calling C from Python compare to the overhead of calling C from Java and/or C#?
I'm just wondering if Python is a better option than Java or C# for scientific apps.
If I look at the shootouts, Python loses by a huge margin. But I guess this is because they don't use 3rd-party libraries in those benchmarks.

I would question any benchmark which doesn't show the source for each implementation (or did I miss something)? It's entirely possible that either or both of those solutions are coded badly which would result in an unfair appraisal of either or both language's performance. [Edit] Oops, now I see the source. As others have pointed out though, it's not using the NumPy/SciPy libraries so those benchmarks are not going to help you make a decision.
I believe the vast majority of NumPy and SciPy is written in C and wrapped in Python for ease of use.
It probably depends what you're doing in any of those languages as to how much overhead there is for a particular application.
I've used Python for data processing and analysis for a couple of years now so I would say it's certainly fit for purpose.
What are you trying to achieve at the end of the day? If you want a fast way to develop readable code, Python is an excellent option and certainly fast enough for a first stab at whatever it is you're trying to solve.
Why not have a bash at each for a small subset of your problem and benchmark the results in terms of development time and run time? Then you can make an objective decision based on some relevant data ...or at least that's what I'd do :-)

There is a better comparison here (not a benchmark but shows ways of speeding up Python). NumPy is mostly written in C. The main advantage of Python is that there are a number of ways of very easily extending your code with C (ctypes, swig,f2py) / C++ (boost.python, weave.inline, weave.blitz) / Fortran (f2py) - or even just by adding type annotations to Python so it can be processed to C (cython). I don't think there are many things comparably easy for C# or Java - at least that so seemlessly handle passing numerical arrays of different types (although I guess proponents would argue since they don't have the performance penalty of Python there is less need to).

A lot of it is written in C or fortran. You can re-write the hot loops in C (or use one of the gazillion ways to speed python up, boost/weave is my favorite), but does it really matter?
Your scientific app will be run once. The rest is just debugging and development, and those can be much quicker on Python.

Most of NumPy is in C, but a large portion of the C code is "boilerplate" to handle all the dirty details of the Python/C interface. I think the ratio C vs. Python is around 50/50 ATM for NumPy.
I am not too familiar with vm-based low-level details, but I believe the interface cost would be higher because of the restrictions put on the jvm and the .clr. One of the reason why numpy is often faster than similar environments is the memory representation and how arrays are shared/passed between functions. Whereas most environments (Matlab and R as well I believe) use Copy-On-Write to pass arrays between functions, NumPy use references. But doing so in e.g. the JVM would be hard (because of restrictions on how to use pointer, etc...). It is doable (an early port of NumPy for Jython exists), but I don't know how they solve this issue. Maybe C++/Cli would make this easier, but I have zero experience with that environment.

It always depends on your own capability to handle the langue, so the language is able to generate fast code. Out of my experience, numpy is several times slower then good .NET implementations. And I expect JAVA to be similar fast. Their optimizing JIT compilers have improved significantly over the years and produce very efficient instructions.
numpy on the other hand comes with a syntax wich is easier to use for those, which are attuned to scripting languages. But if it comes to application development, those advantages often turn to obstacles and you will yearn for typesafety and enterprise IDEs. Also, the syntactic gap is already closing with C#. A growing number of scientific libraries exist for Java and .NET.Personally I tend towards C#, bacause it provides better syntax for multidimensional arrays and somehow feels more 'modern'. But of course, this is only my personal experience.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.