Bytecode instruction cost - python

Is it possible to know a "cost" in some measure(seconds, CPU ticks, logarithm scale, anything) for each instruction? Or at least for some instructions, skipping something like SLICE. There is a description at https://docs.python.org/3/library/dis.html. There is source code: https://hg.python.org/cpython/file/tip/Python/ceval.c#l1199. I guess it is possible to estimate how much resources will eat each instruction by analyzing the source code, but I doubt that this can be done by noob like me. May be somebody already done this? Of course there are numerous high level advices about optimization, about over optimization, but may be such measure will help beginners with better bytecode understanding without digging into the C sources?
Edit: the actual question is not about profiling or debugging code - I know about various profiling methods - the question is particularly about bytecode. I am keeping in mind CPU instructions which has a cost measure - cycles per instruction.

The typical way to measure the performance of "small" pieces of code is timeit. For larger things, we usually use cProfile instead.
These techniques will measure the code by actually running it and seeing how long it takes to execute, so they will not produce totally deterministic answers. If you're looking for something more theoretical, you may need to look at the disassembly and the CPython source code, and reason about how fast it is. You may find the source easier to read if you consult the Python/C API docs, since a lot of CPython source is ultimately calling those same functions.

As mentioned by #kevin you can use cProfile. You can see http://lanyrd.com/2013/pycon/scdywg/
For line by line profiling you can use, https://github.com/rkern/line_profiler
which I found out from here.

Related

Speeding up loop with fsolve using PyPy or Cython?

I have a Python script containing a loop with a lot of calls to scipy.optimize.fsolve (99 (55 + 54) times per time step, and right now I need around 10^5 time steps). The rest of the script isn't very fast either, but as far as I can tell from the output of the Spyder Profiler, the calls to fsolve are by far the most time consuming. With my current settings, running the script takes over 10 hours, so I could do with a little speed-up.
Based on what I read here and elsewhere on internet, I already gave PyPy a first try: installed it in a separate environment under conda (MacOS 10.15.5, pypy 7.3.1 and pypy3.6 7.3.1), along with its own versions of numpy, scipy, and pandas, but so far it's actually a bit slower than just Python (195 s vs 171 s, for 100 time steps).
From what I read here (PyPy Status Blog, October '17), this might have to do with using numpy instead of numpypy, and/or repeated allocation of temporary arrays. Apart from calling fsolve 10 million+ times, I'm also using quite a lot of numpy arrays, so that makes sense as far as I can see.
The thing is, I'm not a developer, and I'm completely new to PyPy, so terms like JIT traces don't mean much to me, and deciphering what's in them is likely going to be challenging for me. Moreover, what used to be true in October 2017 may no longer be relevant now. Also, even if I'd manage to speed up the numpy array bit, I'm still not sure about the fsolve part.
Can anyone indicate whether it could be worthwhile for me to invest time in PyPy? Or would Cython be more suitable in this case? Or even mpi4py?
I'd happily share my code if that helps, but it includes a module of over 800 lines of code, so just including it in this post didn't seem like a great idea to me.
Many thanks!
Sita
EDIT: Thanks everyone for your quick and kind response! That's a fair point, about needing to see my code, I put it here (link valid until 19 June 2020). Arterial_1D.py is the module, CoronaryTree.py is the script that calls Arterial_1D.py. For a minimal working example, I put in one extra line, to be uncommented in that case (clearly marked in the code). Also, I set the number of time steps to 100, to have the code run in reasonable time (0.61 s for the minimal example, 37.3 s for the full coronary tree, in my case).
EDIT 2: Silly of me, in my original post I mentioned times of 197 and 171 s for running 100 steps of my code using PyPy and Python, respectively, but in that case I called Python from within the PyPy environment, so it was using the PyPy version of Numpy. From within my base environment running 100 steps takes a little over 30 s. So PyPy is A LOT slower than Python in this case, which motivates me to look into this PyPy Status Blog post anyway.
We can't really help you to optimize without looking at your code. But since you have quite the description going on up there, let me reply with what I think you can try to speed things up.
First thing's first. The Scipy library.
From the source for scipy.optimize.fsolve, it wraps around MINPACK's hybrd and hybrj algorithms which are considerably fast FORTRAN subroutines. So in your case, switching to PyPy is not going to do much good, if any at all for this identified bottleneck.
What can you do to optimize your Scipy fsolve ? One of the most obvious numerical thing to do is to vectorize your function's args. But seems that you are running a sort of time step algorithm and Most standard time stepping algos are not able to vectorize in time. IF your 'XX times per time step' is a sort of implicit spatial loop per time step (i.e. your grid), you can consider vectorizing this to achieve some gains in speed. Next is to zoom into your function's guess / starting root estimate. See if you can mod your algorithm to capitalize on a good starting solution over the whole time interval (do some literature digging). Note that this has less to do with the 'programming' than your knowledge on numerical methods.
Next, on your comment on "rest of the script isn't very fast either". Well, I'd go with Cython to sweat the remaining python parts of your code, esp the loops. It is very actively developed, great community and is battle tested. I personally have used it in many HPC type problems. Cython also has a convenient html annotation that highlights potential optimizations that may be possible over your native python implementation.
Hope this helps! Cheers

Comparing wall time and resource usage across different programming environments

Is there a particular software resource monitor that researchers or academics use to compare execution time and other resource usage metrics between programming environments? For instance, if I have a routine in C++, python and another in Matlab, that are all identical in function and similar implantations -how would I make an objective, measurable result comparison as to which was the most efficient process. Likewise is it a tool that could also analyze performance between versions of the same code to track improvements in processing efficiency. Please try to answer this question without generalizations like "oh, C++ is always more efficient than python and python will always be more efficient than Matlab."
The correct way is to write tests. Get current time before actual algo starts, and get current time after it ends. There are ways to do that in c++, python and matlab
You must not think of results as they are 100% precision because of system scheduling process etc, though it is a good way to compare before-after results.
Good way to get more precision results is to run your code multiple times.

Can I improve python runtime by compiling?

I'm writing a small toy simulation in python. Granted, this simulations are slow. To my understanding, the major reason that python codes are slow is the fact that python is in interpreted language. I don't want to give up python since the clear syntax and the available library cut the writing time significantly. So is there a simple way for me to "compile" my python code?
Edit
I answer some questions:
Yes, I'm using numpy. It greatly simplify the code and I don't think I can improve performance writing the functions on my own. I use numpy for all my lists and and I add all of the beads together. Namely. I invoke
pos += V*dt + forces*0.5*dt**2
where ''pos'', 'V', and 'forces' are all np.array of (2000,3) dimensions.
I'm quite certain that the slow part in the forces calculation. This is logical as I have to iterate over all my particles and check their position. For my real project (Ph.D. stuff) I have code of about roughly the same level of complexity, and I know that this is the expensive stuff.
If none of the solutions in the comment suffice, you can also take a look at cython.
For a quick tutorial & example check:
http://docs.cython.org/src/tutorial/cython_tutorial.html
Used at the correct spots (e.g. around frequently called functions) it can easily speed things up by a factor of 10 - 100.
Python is a slightly odd language in that it is both interpreted and compiled. Well sort of. When you run it is compiled to ".pyc" bytecode - so we can quickly get bogged down in semantic details here. Hell I don't even know if what I just said is strictly accurate. But at the end of the day you want to speed things up so...
First, use the profiler and timeit to work out where all the time is going
Second, rewrite your pure python code to improve the slow bits you've discovered
Third, see how it goes when optimised
Now, depends on your scenario, but seriously think "Can I run it on a bigger CPU/memory"
Ok, try rewriting those slow sections in C++
Screw it, write it all in C++
If you get so far as the last option I dare say you're screwed and the savings aren't going to be significant.

Optimizing for PyPy

(This is a follow-up to Statistical profiler for PyPy)
I'm running some Python code under PyPy and would like to optimize it.
In Python, I would use statprof or lineprofiler to know which exact lines are causing the slowdown and try to work around them. In PyPy though, both of the tools don't really report sensible results as PyPy might optimize away some lines. I would also prefer not to use cProfile as I find it very difficult to distil which part of the reported function is the bottleneck.
Does anyone have some tips on how to proceed? Perhaps another profiler which works nicely under PyPy? In general, how does one go about optimizing Python code for PyPy?
If you understand the way the PyPy architecture works, you'll realize that trying to pinpoint individual lines of code isn't really productive. You start with a Python interpreter written in RPython, which then gets run through a tracing JIT which generates flow graphs and then transforms those graphs to optimize the RPython interpreter. What this means is the layout of your Python code being run by the RPython interpreter being JIT'ed may have very different structure than the optimized assembler actually be run. Furthermore, keep in mind that the JIT always works on a loop or a function, so getting line-by-line stats is not as meaningful. Consequently, I think cProfile may really be a good option for you since it will give you an idea of where to concentrate your optimization. Once you know which functions are your bottlenecks, you can spend your optimization efforts targeting those slower functions, rather than trying to fix a single line of Python code.
Keep in mind as you do this that PyPy has very different performance characteristics than cPython. Always try to write code in as simple a way as possible (that doesn't mean as few lines as possible btw). There are a few other heuristics that help such as using specialized lists, using objects over dicts when you have a small number of mostly constant keys, avoiding C extensions using the C Python API, etc.
If you really, really insist on trying to optimize at the line level. There are a few options. One is called JitViewer (https://foss.heptapod.net/pypy/jitviewer), which will let you have a very low level view of what the JIT is doing to your code. For instance, you can even see the assembler instructions which correspond to a Python loop. Using that tool, you can really get a sense for just how fast PyPy will behave with certain parts of the code, as you can now do silly things like count the number of assembler instructions used for your loop or something.

Convert python script to binary executable

I wrote a number crunching python code. The calculations involved can take hours. Is it possible somehow to compile it to binary?
Thanks
Not in any useful (for you) way, but moving the calculations into NumPy or Cython will speed them up.
First you can try psyco, that may give you a speed up as much as 10x, but 2x is more typical
If you can post the code up somewhere, perhaps someone can point out how to leverage numpy.
If your task doesn't map well only numpy then cython is a good choice to convert a intensive function or two into C code just by adding a few cdefs.
If you can show us the code (even just the hot spots) we can probably give you better advice.
Perhaps you can modify your algorithm
Shedskin might be worth a try.
From their front page blurb:
Shed Skin is an experimental compiler,
that can translate pure, but
implicitly statically typed Python
programs into optimized C++. It can
generate stand-alone programs or
extension modules that can be imported
and used in larger Python programs.
Besides the typing restriction,
programs cannot freely use the Python
standard library (although about 20
common modules, such as random and re,
are currently supported). Also, not
all Python features, such as nested
functions and variable numbers of
arguments, are supported (see the
tutorial for details).
For a set of 44 non-trivial test
programs (at over 10,000 lines in
total (sloccount)), measurements show
a typical speedup of 2-40 times over
Psyco, and 2-220 times over CPython.
Because Shed Skin is still in an early
stage of development, however, many
other programs will not compile
out-of-the-box.

Categories