Generating Timing Statistics/Profiling the Python Interpreter

Generating Timing Statistics/Profiling the Python Interpreter - python

To begin, several similar questions have previously been asked on this site, notably here and here. The former is 11 years old, and the latter, while being 4 years old, references the 11 year old post as the solution. I am curious to know if there is something more recent that could accomplish the task. In addition, these questions are only interested in the total time spent by the interpreter. I am hoping for something more granular than that, if such a thing exists.
The problem: I have a GTK program written in C that spawns a matplotlib Python process and embeds it into a widget within the GTK program using GtkSocket and GtkPlug. The python process is spawned using g_spawn (GLib) and then the plot is plugged into the socket on the Python side after it has been created. It takes three seconds to do so, during which time the GtkSocket widget is transparent. This is not very pleasant aesthetically, and I would like to see if there is anything I could do to reduce this three second wait time. I looked at using PyPy instead of CPython as the interpreter, but I am not certain that PyPy has matplotlib support, and that route could cause further headaches since I freeze the script to an executable using PyInstaller. I timed the script itself from beginning to end and the time was around 0.25 seconds. I can run the plotting script from the terminal (normal or frozen) and it takes the same amount of time for the plot to appear (~3 seconds), so it can't be the g_spawn(). The time must all be spent within the interpreter.
I created a minimal example that reproduces the issue (although much less extreme, the time before the plot appears in the socket is only one second). I am not going to post it now since it is not necessarily relevant, although if requested, I can add the file contents with an edit later (contains the GUI C code using GTK, an XML Glade file, and the Python script).
The fact that the minimal example takes one second and my actual plot takes three seconds is hardly a surprise (and further confirms that the timing problem is the time spent with the interpreter), since it is more complicated and involves more imports.
The question: Is there any utility that exists that would allow me to profile where the time is being spent within my script by the Python interpreter? Is most of the time spent with the imports? Is it elsewhere? If I could see where the interpreter spends most of its time, that may possibly allow me to reduce this three second wait time to something less egregious.
Any assistance would be appreciated.

Related

multiprocessing pool map incredibly slow running a Monte Carlo ensemble of python simulations

I have a python code that runs a 2D diffusion simulation for a set of parameters. I need to run the code many times, O(1000), like a Monte Carlo approach, using different parameter settings each time. In order to do this more quickly I want to use all the cores on my machine (or cluster), so that each core runs one instance of the code.
In the past I have done this successfully for serial fortran codes by writing a python wrapper that then used multiprocessing map (or starmap in the case of multiple arguments) to call the fortan code in an ensemble of simulations. It works very nicely in that you loop over the 1000 simulations, and the python wrapper farms out a new integration to a core as soon as it becomes free after completing a previous integration.
However, now when I set this up to do the same to run multiple instances of my python (instead of fortran) code, I find it is incredibly slow, much slower than simply running the code 1000 times in serial on a single core. Using the system monitor I see that one core is working at a time, and it never goes above 10-20% load, while of course I expected to see N cores running near 100% (as is the case when I farm out fortran jobs).
I thought it might be a write issue, and so I checked the code carefully to ensure that all plotting is switched off, and in fact there is no file/disk access at all, I now merely have one print statement at the end to print out a final diagnostic.
The structure of my code is like this
I have the main python code in toy_diffusion_2d.py which has a single arg of a dictionary with the run parameters in it:
def main(arg)
loop over timesteps:
calculation simulation over a large-grid
print the result statistic
And then I wrote a "wrapper" script, where I import the main simulation code and try to run it in parallel:
from multiprocessing import Pool,cpu_count
import toy_diffusion_2d
# dummy list of arguments
par1=[1,2,3]
par2=[4,5,6]
# make a list of dictionaries to loop over, 3x3=9 simulations in all.
arglist=[{"par1":p1,"par2":p2} for p1 in par1 for p2 in par2]
ncore=min(len(arglist),int(cpu_count()))
with Pool(processes=ncore) as p:
p.map(toy_diffusion_2d.main,arglist)
The above is a shorter paraphrased example, my actual codes are longer, so I have placed them here:
Main code: http://clima-dods.ictp.it/Users/tompkins/files/toy_diffusion_2d.py
You can run this with the default values like this:
python3 toy_diffusion_2d.py
Wrapper script: http://clima-dods.ictp.it/Users/tompkins/files/toy_diffusion_loop.py
You can run a 4 member ensemble like this:
python3 toy_diffusion_loop.py --diffK=[33000,37500] --tau_sub=[20,30]
(note that the final stat is slightly different each run, even with the same values as the model is stochastic, a version of the stochastic allen-cahn equations in case any one is interested, but uses a stupid explicit solver on the diffusion term).
As I said, the second parallel code works, but as I say it is reeeeeallly slow... like it is constantly gating.
I also tried using starmap, but that was not any different, it is almost like the desktop only allows one python interpreter to run at a time...? I spent hours on it, I'm almost at the point to rewrite the code in Fortran. I'm sure I'm just doing something really stupid to prevent parallel execution.
EDIT(1): this problem is occurring on
4.15.0-112-generic x86_64 GNU/Linux, with Python 3.6.9
In response to the comments, in fact I also find it runs fine on my MAC laptop...
EDIT(2): so it seems my question was a bit of a duplicate of several other postings, apologies! As well as the useful links provided by Pavel, I also found this page very helpful: Importing scipy breaks multiprocessing support in Python I'll edit in the solution below to the accepted answer.

The code sample you provide works just fine on my MacOS Catalina 10.15.6. I can guess you're using some Linux distributive, where, according to this answer, it can be the case that numpy import meddles with core affinity due to being linked with OpenBLAS library.
If your Unix supports scheduler interface, something like this will work:
>>> import os
>>> os.sched_setaffinity(0, set(range(cpu_count)))
Another question that has a good explanation of this problem is found here and the solution suggested is this:
os.system('taskset -cp 0-%d %s' % (ncore, os.getpid()))
inserted right before the multiprocessing call.

Big difference in performance looping through 1-1000 in PyCharm and PythonIDLE/Shell

Recently, when I was fiddling Python with different IDE/shells, I was most surprised at the performance differences among them.
The code I wrote is a simple for-loop through 1-1000. When executed by PythonIDLE or Windows Powershell, it took about 16 seconds to finish it while PyCharm almost finished it immediately within about 500ms.
I'm wondering why the difference is so huge.
for x in range(0, 1000, 1):
print(x)

The time to execute the loop is almost zero. The time you're seeing elapse is due to the printing, which is tied to the output facilities of the particular shell you are using. For example, the sort of buffering it does, maybe the graphics routines being used to render the text, etc. There is no practical application for printing numbers in a loop as fast as possible to a human-readable display, so perhaps you can try the same test writing to a file instead. I expect the times will be more similar.
On my laptop your code takes 4.8 milliseconds if writing to the terminal. It takes only 460 microseconds if writing to a file.
TL;DR: run stupid benchmarks, get stupid times.

IDLE is written in Python and uses tkinter, which wraps tcl/tk. By default, IDLE runs your code in a separate process, with output sent through a socket for display in IDLE's Shell window. So there is extra overhead for each print call. For me, on a years-old Windows machine, the 1000 line prints take about 3 seconds, or 3 milliseconds per print.
If you print the the 1000 lines with one print call, as with
print('\n'.join(str(i) for i in range(1000)))
the result may take a bit more that 3 milliseconds but it is still subjectly almost 'instantaneous'.
Note: in 3.6.7 and 3.7.1, single 'large' prints, where 'large' can be customized by the user, are squeezed down to a label that can be expanded either in-place or in a separate window.

Different time taken by python script every time it is runned?

I am working on a Opencv based Python project. I am working on program development which takes less time to execute. For that i have tested my small program print hello world on python to test the time taken to run the program. I had run many time and every time it run it gives me a different run time.
Can you explain me why a simple program is taking different time to execute?
I need my program to be independent of system processes ?

Python gets different amounts of system resources depending upon what else the CPU is doing at the time. If you're playing Skyrim with the highest graphics levels at the time, then your script will run slower than if no other programs were open. But even if your task bar is empty, there may be invisible background processes confounding things.
If you're not already using it, consider using timeit. It performs multiple runs of your program in order to smooth out bad runs caused by a busy OS.
If you absolutely insist on requiring your program to run in the same amount of time every time, you'll need to use an OS that doesn't support multitasking. For example, DOS.

Is it possible to step backwards in pdb?

After I hit n to evaluate a line, I want to go back and then hit s to step into that function if it failed. Is this possible?
The docs say:
j(ump) lineno
Set the next line that will be executed. Only available in the bottom-most frame. This lets you jump back and execute code again, or jump forward to skip code that you don’t want to run.

The GNU debugger, gdb: It is extremely slow, as it undoes single machine instruction at a time.
The Python debugger, pdb: The jump command takes you backwards in the code, but does not reverse the state of the program.
For Python, the extended python debugger prototype, epdb, was created for this reason. Here is the thesis and here is the program and the code.
I used epdb as a starting point to create a live reverse debugger as part of my MSc degree. The thesis is available online: Combining reverse debugging and live programming towards visual thinking in computer programming. In chapter 1 and 2 I also cover most of the historical approaches to reverse debugging.

PyPy has started to implement RevDB, which supports reverse debugging.
It is (as of Feb 2017) still at an alpha stage, only supports Python 2.7, only works on Linux or OS X, and requires you to build Python yourself from a special revision. It's also very slow and uses a lot of RAM. To quote the Bitbucket page:
Note that the log file typically grows at a rate of 1-2 MB per second. Assuming size is not a problem, the limiting factor are:
Replaying time. If your recorded execution took more than a few minutes, replaying will be painfully slow. It sometimes needs to go over the whole log several times in a single session. If the bug occurs randomly but rarely, you should run recording for a few minutes, then kill the process and try again, repeatedly until you get the crash.
RAM usage for replaying. The RAM requirements are 10 or 15 times larger for replaying than for recording. If that is too much, you can try with a lower value for MAX_SUBPROCESSES in _revdb/process.py, but it will always be several times larger.
Details are on the PyPy blog and installation and usage instructions are on the RevDB bitbucket page.

Reverse debugging (returning to previously recorded application state or backwards single-stepping debugging) is generally an assembly or C level debugger feature. E.g. gdb can do it:
https://sourceware.org/gdb/wiki/ReverseDebug
Bidirectional (or reverse) debugging
Reverse debugging is utterly complex, and may have performance penalty of 50.000x. It also requires extensive support from the debugging tools. Python virtual machine does not provide the reverse debugging support.
If you are interactively evaluation Python code I suggest trying IPython Notebook which provide HTML-based interactive Python shells. You can easily write your code and mix and match the order. There is no pdb debugging support, though. There is ipdb which provides better history and search facilities for entered debugging commands, but it doesn't do direct backwards jumps either as far as I know.

Though you may not be able to reverse the code execution in time, the next best thing pdb has are the stack frame jumps.
Use w to see where you're in the stack frame (bottom is the newest), and u(p) or d(own) to traverse up the stackframe to access the frame where the function call stepped you into the current frame.

Python - tkinter call to after is too slow

I had been working on a python and tkinter solution to the code golf here: https://codegolf.stackexchange.com/questions/26824/frogger-ish-game/
My response is the Python 2.7 one. The thing is, when I run this code on my 2008 mac pro, everything works fine. When I run it on Win7 (I have tried this on several different machines, with the same result), the main update loop runs way too slowly. You will notice that I designed my implementation with a 1-ms internal clock:
if(self.gameover == False):
self.root.after(1, self.process_world)
Empirical testing reveals that this runs much, much slower than every 1ms. Is this a well-known Windows 7-specific behavior? I have not been able to find much information about calls to after() lagging behind by this much. I understand that the call is supposed to be executed "at least" after the given amount of time, and not "at most", but I am seeing 1000 update ticks every 20 seconds instead of every 1 second, and a factor of 20 seems excessive. The timer loop that displays the game clock works perfectly well. I thought that maybe the culprit was my thread lock arrangement, but commenting that out makes no difference. This is my first time using tkinter, so I would appreciate any help and/or advice!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.