Speeding up loop with fsolve using PyPy or Cython? - python

I have a Python script containing a loop with a lot of calls to scipy.optimize.fsolve (99 (55 + 54) times per time step, and right now I need around 10^5 time steps). The rest of the script isn't very fast either, but as far as I can tell from the output of the Spyder Profiler, the calls to fsolve are by far the most time consuming. With my current settings, running the script takes over 10 hours, so I could do with a little speed-up.
Based on what I read here and elsewhere on internet, I already gave PyPy a first try: installed it in a separate environment under conda (MacOS 10.15.5, pypy 7.3.1 and pypy3.6 7.3.1), along with its own versions of numpy, scipy, and pandas, but so far it's actually a bit slower than just Python (195 s vs 171 s, for 100 time steps).
From what I read here (PyPy Status Blog, October '17), this might have to do with using numpy instead of numpypy, and/or repeated allocation of temporary arrays. Apart from calling fsolve 10 million+ times, I'm also using quite a lot of numpy arrays, so that makes sense as far as I can see.
The thing is, I'm not a developer, and I'm completely new to PyPy, so terms like JIT traces don't mean much to me, and deciphering what's in them is likely going to be challenging for me. Moreover, what used to be true in October 2017 may no longer be relevant now. Also, even if I'd manage to speed up the numpy array bit, I'm still not sure about the fsolve part.
Can anyone indicate whether it could be worthwhile for me to invest time in PyPy? Or would Cython be more suitable in this case? Or even mpi4py?
I'd happily share my code if that helps, but it includes a module of over 800 lines of code, so just including it in this post didn't seem like a great idea to me.
Many thanks!
Sita
EDIT: Thanks everyone for your quick and kind response! That's a fair point, about needing to see my code, I put it here (link valid until 19 June 2020). Arterial_1D.py is the module, CoronaryTree.py is the script that calls Arterial_1D.py. For a minimal working example, I put in one extra line, to be uncommented in that case (clearly marked in the code). Also, I set the number of time steps to 100, to have the code run in reasonable time (0.61 s for the minimal example, 37.3 s for the full coronary tree, in my case).
EDIT 2: Silly of me, in my original post I mentioned times of 197 and 171 s for running 100 steps of my code using PyPy and Python, respectively, but in that case I called Python from within the PyPy environment, so it was using the PyPy version of Numpy. From within my base environment running 100 steps takes a little over 30 s. So PyPy is A LOT slower than Python in this case, which motivates me to look into this PyPy Status Blog post anyway.

We can't really help you to optimize without looking at your code. But since you have quite the description going on up there, let me reply with what I think you can try to speed things up.
First thing's first. The Scipy library.
From the source for scipy.optimize.fsolve, it wraps around MINPACK's hybrd and hybrj algorithms which are considerably fast FORTRAN subroutines. So in your case, switching to PyPy is not going to do much good, if any at all for this identified bottleneck.
What can you do to optimize your Scipy fsolve ? One of the most obvious numerical thing to do is to vectorize your function's args. But seems that you are running a sort of time step algorithm and Most standard time stepping algos are not able to vectorize in time. IF your 'XX times per time step' is a sort of implicit spatial loop per time step (i.e. your grid), you can consider vectorizing this to achieve some gains in speed. Next is to zoom into your function's guess / starting root estimate. See if you can mod your algorithm to capitalize on a good starting solution over the whole time interval (do some literature digging). Note that this has less to do with the 'programming' than your knowledge on numerical methods.
Next, on your comment on "rest of the script isn't very fast either". Well, I'd go with Cython to sweat the remaining python parts of your code, esp the loops. It is very actively developed, great community and is battle tested. I personally have used it in many HPC type problems. Cython also has a convenient html annotation that highlights potential optimizations that may be possible over your native python implementation.
Hope this helps! Cheers

Related

Bytecode instruction cost

Is it possible to know a "cost" in some measure(seconds, CPU ticks, logarithm scale, anything) for each instruction? Or at least for some instructions, skipping something like SLICE. There is a description at https://docs.python.org/3/library/dis.html. There is source code: https://hg.python.org/cpython/file/tip/Python/ceval.c#l1199. I guess it is possible to estimate how much resources will eat each instruction by analyzing the source code, but I doubt that this can be done by noob like me. May be somebody already done this? Of course there are numerous high level advices about optimization, about over optimization, but may be such measure will help beginners with better bytecode understanding without digging into the C sources?
Edit: the actual question is not about profiling or debugging code - I know about various profiling methods - the question is particularly about bytecode. I am keeping in mind CPU instructions which has a cost measure - cycles per instruction.
The typical way to measure the performance of "small" pieces of code is timeit. For larger things, we usually use cProfile instead.
These techniques will measure the code by actually running it and seeing how long it takes to execute, so they will not produce totally deterministic answers. If you're looking for something more theoretical, you may need to look at the disassembly and the CPython source code, and reason about how fast it is. You may find the source easier to read if you consult the Python/C API docs, since a lot of CPython source is ultimately calling those same functions.
As mentioned by #kevin you can use cProfile. You can see http://lanyrd.com/2013/pycon/scdywg/
For line by line profiling you can use, https://github.com/rkern/line_profiler
which I found out from here.

Can I improve python runtime by compiling?

I'm writing a small toy simulation in python. Granted, this simulations are slow. To my understanding, the major reason that python codes are slow is the fact that python is in interpreted language. I don't want to give up python since the clear syntax and the available library cut the writing time significantly. So is there a simple way for me to "compile" my python code?
Edit
I answer some questions:
Yes, I'm using numpy. It greatly simplify the code and I don't think I can improve performance writing the functions on my own. I use numpy for all my lists and and I add all of the beads together. Namely. I invoke
pos += V*dt + forces*0.5*dt**2
where ''pos'', 'V', and 'forces' are all np.array of (2000,3) dimensions.
I'm quite certain that the slow part in the forces calculation. This is logical as I have to iterate over all my particles and check their position. For my real project (Ph.D. stuff) I have code of about roughly the same level of complexity, and I know that this is the expensive stuff.
If none of the solutions in the comment suffice, you can also take a look at cython.
For a quick tutorial & example check:
http://docs.cython.org/src/tutorial/cython_tutorial.html
Used at the correct spots (e.g. around frequently called functions) it can easily speed things up by a factor of 10 - 100.
Python is a slightly odd language in that it is both interpreted and compiled. Well sort of. When you run it is compiled to ".pyc" bytecode - so we can quickly get bogged down in semantic details here. Hell I don't even know if what I just said is strictly accurate. But at the end of the day you want to speed things up so...
First, use the profiler and timeit to work out where all the time is going
Second, rewrite your pure python code to improve the slow bits you've discovered
Third, see how it goes when optimised
Now, depends on your scenario, but seriously think "Can I run it on a bigger CPU/memory"
Ok, try rewriting those slow sections in C++
Screw it, write it all in C++
If you get so far as the last option I dare say you're screwed and the savings aren't going to be significant.

Python Multiprocessing/EM

I did a machine learning Expectation Maximization algorithm in Python, basically an implementation of IBM Model1 for doing machine translation ( here is my GitHub if you want to look at the code) and it works, but reeeaaaaallly sloowwwlly. I'm taking a class now in parallel computing and I was wondering if I could use Python Multiprocessing to reach convergence faster. Can anyone give me any pointers or tips? I don't even know where to start.
EDIT: I was reading around and found this paper on using EM with MapReduce to do parallelization -- maybe this is a better idea?
Most of your problem is that Python is really slow. Remember, your code is executing in an interpreter. When you do code (such as line 82) where you perform a numerical computation one element at a time, you have that one computation - and all the overhead of the Python interpreter.
The first thing you will want to do is vectorize you code with numpy. Unlike your normal python code, numpy is calling out to precompiled efficient binary code. The more work you can hide into numpy, the less time you will waist in the interpreter.
Once you vectorize your code, you can then start profiling it if its still too slow. You should be able to find a lot of simple examples on how to vectorize python, and some of the alternative options.
EDIT: Let me clarify, that parallelizing inherently slow code is mostly pointless. First, is the issue that parallelizing slow code gives the false impression that you have made an improvement. The "scaling up" of parallel code should always be done against the fastest possible single threaded version of the same code (within reason, no need to write everything in assembly before starting any parallel code). For example, consider a lock under contention. The more threads fighting for the lock, the slower the code will run, and you will get no (or negative) performance gains. One way to reduce contention for the lock is to simply slow down the code competing for the lock. This makes it appear as if there is no overhead from lock contention, when in actuality - you have no improvements because the fastest single threaded version of your code will outperform your parallel code.
Also, python really isn't a great language to learn how to write parallel code in. Python has the GIL , which essentially forces all multithreaded code in python to run as if there was but one CPU core. This means bizarre hacks (such as the one you linked) must be done, which have their own additional drawbacks and issues (there are times where such tricks are needed / used, but they shouldn't be the default for running code on a single machine). Don't expect what you learn writing any parallel python code to carry over to other languages or help you with your course.
I think you will have some good success depending on where your bottleneck is. One caveat - When I do code optimization I always like to profile the code, even informally to get an idea of where the bottlenecks are. This will help identify where the time is being spent i.e. file io, network latency, resource contention, not enough cpu cycles etc...
For others who may not be familiar with the Expectation Maximization algorithm a very nice introduction is in Motion Segmentation using EM - a short tutorial, by Yair Weiss. Let us assume we have M data points and N classes/models.
In the EM algorithm there are two steps: Computing the distance between data points and models and Updating our model weights using weighted least squares.
Step 1 - Expectation stage
for data_point in M:
for current_model in N:
compute distance or residual between data_point and current_model
Step 2 - Maximization stage
for each model, compute weighted least squares solving for the model parameters
This requires solving N weighted least square problems where the size is
dependent on the number of parameters in the model that will be solved for.
Your bottleneck may be in the stage of computing the residuals or distances between the data points and the models stage 1 - E Step. In this stage the computations are all independent. I would consider the first stage as embarassingly parallel and quite amenable to parallel computation using parallel map reduce or some other tools in python. I have good success using IPython for such tasks, but there are other good python packages as well.

Optimizing for PyPy

(This is a follow-up to Statistical profiler for PyPy)
I'm running some Python code under PyPy and would like to optimize it.
In Python, I would use statprof or lineprofiler to know which exact lines are causing the slowdown and try to work around them. In PyPy though, both of the tools don't really report sensible results as PyPy might optimize away some lines. I would also prefer not to use cProfile as I find it very difficult to distil which part of the reported function is the bottleneck.
Does anyone have some tips on how to proceed? Perhaps another profiler which works nicely under PyPy? In general, how does one go about optimizing Python code for PyPy?
If you understand the way the PyPy architecture works, you'll realize that trying to pinpoint individual lines of code isn't really productive. You start with a Python interpreter written in RPython, which then gets run through a tracing JIT which generates flow graphs and then transforms those graphs to optimize the RPython interpreter. What this means is the layout of your Python code being run by the RPython interpreter being JIT'ed may have very different structure than the optimized assembler actually be run. Furthermore, keep in mind that the JIT always works on a loop or a function, so getting line-by-line stats is not as meaningful. Consequently, I think cProfile may really be a good option for you since it will give you an idea of where to concentrate your optimization. Once you know which functions are your bottlenecks, you can spend your optimization efforts targeting those slower functions, rather than trying to fix a single line of Python code.
Keep in mind as you do this that PyPy has very different performance characteristics than cPython. Always try to write code in as simple a way as possible (that doesn't mean as few lines as possible btw). There are a few other heuristics that help such as using specialized lists, using objects over dicts when you have a small number of mostly constant keys, avoiding C extensions using the C Python API, etc.
If you really, really insist on trying to optimize at the line level. There are a few options. One is called JitViewer (https://foss.heptapod.net/pypy/jitviewer), which will let you have a very low level view of what the JIT is doing to your code. For instance, you can even see the assembler instructions which correspond to a Python loop. Using that tool, you can really get a sense for just how fast PyPy will behave with certain parts of the code, as you can now do silly things like count the number of assembler instructions used for your loop or something.

Convert python script to binary executable

I wrote a number crunching python code. The calculations involved can take hours. Is it possible somehow to compile it to binary?
Thanks
Not in any useful (for you) way, but moving the calculations into NumPy or Cython will speed them up.
First you can try psyco, that may give you a speed up as much as 10x, but 2x is more typical
If you can post the code up somewhere, perhaps someone can point out how to leverage numpy.
If your task doesn't map well only numpy then cython is a good choice to convert a intensive function or two into C code just by adding a few cdefs.
If you can show us the code (even just the hot spots) we can probably give you better advice.
Perhaps you can modify your algorithm
Shedskin might be worth a try.
From their front page blurb:
Shed Skin is an experimental compiler,
that can translate pure, but
implicitly statically typed Python
programs into optimized C++. It can
generate stand-alone programs or
extension modules that can be imported
and used in larger Python programs.
Besides the typing restriction,
programs cannot freely use the Python
standard library (although about 20
common modules, such as random and re,
are currently supported). Also, not
all Python features, such as nested
functions and variable numbers of
arguments, are supported (see the
tutorial for details).
For a set of 44 non-trivial test
programs (at over 10,000 lines in
total (sloccount)), measurements show
a typical speedup of 2-40 times over
Psyco, and 2-220 times over CPython.
Because Shed Skin is still in an early
stage of development, however, many
other programs will not compile
out-of-the-box.

Categories