Profile GDB python pretty printers for Qt5 - python

I'm trying to improve my own GDB pretty printers using the GDB python API.
Currently I'm testing them with a core.
I'm trying to get info for some QMap, QList content, but they have so many elements that printing them with it's contents is really slow (minutes).
So, I would like to know if there is any known way to profile which parts are slower.
I've already checked Python profile manual and google-perftools, but I don't know how to use them in the GDB execution cycle.
gdbcommands.txt:
source /home/user/codigo/git/kde-dev-scripts/gdb/load-qt5printers.py
source /home/user/codigo/myownprinters.py
core ../../core.QThread.1493215378
bt
p longQMapofQList
Link to load-qt5-printers.py content:
Then I launch gdb to automatically run those commands:
gdb-multiarch --command=~/bin/gdbcommands.txt

they have so many elements that printing them with it's contents is really slow (minutes).
Do the Qt pretty-printers respect the print-elements limit? Is your limit set too high?
Profiling is unlikely to help here: if you are printing a list with (say) 1000 elements, it's likely that GDB will need to execute 10,000 or more ptrace calls, and ~all the time is spent waiting for these calls.
You can run gdb under strace -c and observe how many ptraces are executed while printing a list of 10 elements, and 100 elements.
If the increase is linear, you can try to optimize the pretty printer to do fewer accesses. If the increase is quadratic, the pretty printer may do unnecessary pointer chasing (and that would certainly explain why printing long lists takes minutes).

Related

Generating Timing Statistics/Profiling the Python Interpreter

To begin, several similar questions have previously been asked on this site, notably here and here. The former is 11 years old, and the latter, while being 4 years old, references the 11 year old post as the solution. I am curious to know if there is something more recent that could accomplish the task. In addition, these questions are only interested in the total time spent by the interpreter. I am hoping for something more granular than that, if such a thing exists.
The problem: I have a GTK program written in C that spawns a matplotlib Python process and embeds it into a widget within the GTK program using GtkSocket and GtkPlug. The python process is spawned using g_spawn (GLib) and then the plot is plugged into the socket on the Python side after it has been created. It takes three seconds to do so, during which time the GtkSocket widget is transparent. This is not very pleasant aesthetically, and I would like to see if there is anything I could do to reduce this three second wait time. I looked at using PyPy instead of CPython as the interpreter, but I am not certain that PyPy has matplotlib support, and that route could cause further headaches since I freeze the script to an executable using PyInstaller. I timed the script itself from beginning to end and the time was around 0.25 seconds. I can run the plotting script from the terminal (normal or frozen) and it takes the same amount of time for the plot to appear (~3 seconds), so it can't be the g_spawn(). The time must all be spent within the interpreter.
I created a minimal example that reproduces the issue (although much less extreme, the time before the plot appears in the socket is only one second). I am not going to post it now since it is not necessarily relevant, although if requested, I can add the file contents with an edit later (contains the GUI C code using GTK, an XML Glade file, and the Python script).
The fact that the minimal example takes one second and my actual plot takes three seconds is hardly a surprise (and further confirms that the timing problem is the time spent with the interpreter), since it is more complicated and involves more imports.
The question: Is there any utility that exists that would allow me to profile where the time is being spent within my script by the Python interpreter? Is most of the time spent with the imports? Is it elsewhere? If I could see where the interpreter spends most of its time, that may possibly allow me to reduce this three second wait time to something less egregious.
Any assistance would be appreciated.

multiprocessing pool map incredibly slow running a Monte Carlo ensemble of python simulations

I have a python code that runs a 2D diffusion simulation for a set of parameters. I need to run the code many times, O(1000), like a Monte Carlo approach, using different parameter settings each time. In order to do this more quickly I want to use all the cores on my machine (or cluster), so that each core runs one instance of the code.
In the past I have done this successfully for serial fortran codes by writing a python wrapper that then used multiprocessing map (or starmap in the case of multiple arguments) to call the fortan code in an ensemble of simulations. It works very nicely in that you loop over the 1000 simulations, and the python wrapper farms out a new integration to a core as soon as it becomes free after completing a previous integration.
However, now when I set this up to do the same to run multiple instances of my python (instead of fortran) code, I find it is incredibly slow, much slower than simply running the code 1000 times in serial on a single core. Using the system monitor I see that one core is working at a time, and it never goes above 10-20% load, while of course I expected to see N cores running near 100% (as is the case when I farm out fortran jobs).
I thought it might be a write issue, and so I checked the code carefully to ensure that all plotting is switched off, and in fact there is no file/disk access at all, I now merely have one print statement at the end to print out a final diagnostic.
The structure of my code is like this
I have the main python code in toy_diffusion_2d.py which has a single arg of a dictionary with the run parameters in it:
def main(arg)
loop over timesteps:
calculation simulation over a large-grid
print the result statistic
And then I wrote a "wrapper" script, where I import the main simulation code and try to run it in parallel:
from multiprocessing import Pool,cpu_count
import toy_diffusion_2d
# dummy list of arguments
par1=[1,2,3]
par2=[4,5,6]
# make a list of dictionaries to loop over, 3x3=9 simulations in all.
arglist=[{"par1":p1,"par2":p2} for p1 in par1 for p2 in par2]
ncore=min(len(arglist),int(cpu_count()))
with Pool(processes=ncore) as p:
p.map(toy_diffusion_2d.main,arglist)
The above is a shorter paraphrased example, my actual codes are longer, so I have placed them here:
Main code: http://clima-dods.ictp.it/Users/tompkins/files/toy_diffusion_2d.py
You can run this with the default values like this:
python3 toy_diffusion_2d.py
Wrapper script: http://clima-dods.ictp.it/Users/tompkins/files/toy_diffusion_loop.py
You can run a 4 member ensemble like this:
python3 toy_diffusion_loop.py --diffK=[33000,37500] --tau_sub=[20,30]
(note that the final stat is slightly different each run, even with the same values as the model is stochastic, a version of the stochastic allen-cahn equations in case any one is interested, but uses a stupid explicit solver on the diffusion term).
As I said, the second parallel code works, but as I say it is reeeeeallly slow... like it is constantly gating.
I also tried using starmap, but that was not any different, it is almost like the desktop only allows one python interpreter to run at a time...? I spent hours on it, I'm almost at the point to rewrite the code in Fortran. I'm sure I'm just doing something really stupid to prevent parallel execution.
EDIT(1): this problem is occurring on
4.15.0-112-generic x86_64 GNU/Linux, with Python 3.6.9
In response to the comments, in fact I also find it runs fine on my MAC laptop...
EDIT(2): so it seems my question was a bit of a duplicate of several other postings, apologies! As well as the useful links provided by Pavel, I also found this page very helpful: Importing scipy breaks multiprocessing support in Python I'll edit in the solution below to the accepted answer.
The code sample you provide works just fine on my MacOS Catalina 10.15.6. I can guess you're using some Linux distributive, where, according to this answer, it can be the case that numpy import meddles with core affinity due to being linked with OpenBLAS library.
If your Unix supports scheduler interface, something like this will work:
>>> import os
>>> os.sched_setaffinity(0, set(range(cpu_count)))
Another question that has a good explanation of this problem is found here and the solution suggested is this:
os.system('taskset -cp 0-%d %s' % (ncore, os.getpid()))
inserted right before the multiprocessing call.

Python script runs slower when executed multiple times

I have a python script with a normal runtime of ~90 seconds. However, when I change only minor things in it (like alternating the colors in my final pyplot figure) and execute it thusly multiple times in quick succession, its runtime increases up to close to 10 minutes.
Some bullet points of what I'm doing:
I'm not downloading anything neither creating new files with my script.
I merely open some locally saved .dat-files using numpy.genfromtxt and crunch some numbers with them.
I transform my data into a rec-array and use indexing via array.columnname extensively.
For each file I loop over a range of criteria that basically constitute different maximum and minimum values for evaluation, and embedded in that I use an inner loop over the lines of the data arrays. A few if's here and there but nothing fancy, really.
I use the multiprocessing module as follows
import multiprocessing
npro = multiprocessing.cpu_count() # Count the number of processors
pool = multiprocessing.Pool(processes=npro)
bigdata = list(pool.map(analyze, range(len(FileEndings))))
pool.close()
with analyze being my main function and FileEndings its input, a string, to create the right name of the file I want to load and the evaluate. Afterwards, I use it a second time with
pool2 = multiprocessing.Pool(processes=npro)
listofaverages = list(pool2.map(averaging, range(8)))
pool2.close()
averaging being another function of mine.
I use numba's #jit decorator to speed up the basic calculations I do in my inner loops, nogil, nopython, and cache all set to be True. Commenting these out doesn't resolve the issue.
I run the scipt on Ubuntu 16.04 and am using a recent Anaconda build of python to compile.
I write the code in PyCharm and run it in its console most of the time. However, changing to bash doesn't help either.
Simply not running the script for about 3 minutes lets it go back to its normal runtime.
Using htop reveals that all processors are at full capacity when running. I am also seeing a lot of processes stemming from PyCharm (50 or so) that are each at equal MEM% of 7.9. The CPU% is at 0 for most of them, a few exceptions are in the range of several %.
Has anyone experienced such an issue before? And if so, any suggestions what might help? Or are any of the things I use simply prone to cause these problems?
May be closed, the problem was caused by a malfunction of the fan in my machine.

Is it possible to see what a Python process is doing?

I created a python script that grabs some info from various websites, is it possible to analyze how long does it take to download the data and how long does it take to write it on a file?
I am interested in knowing how much it could improve running it on a better PC (it is currently running on a crappy old laptop.
If you just want to know how long a process takes to run the time command is pretty handy. Just run time <command> and it will report how much time it took to run with it counted in a few categories, like wall clock time, system/kernel time and user space time. This won't tell you anything about which parts of the system are taking up the amount of time. You can always look at a profiler if you want/need that type of information.
That said, as Barmar said, if you aren't doing much processing of the sites you are grabbing, the laptop is probably not going to be a limiting factor.
You can always store the system time in a variable before a block of code that you want to test, do it again after then compare them.

Is it possible to step backwards in pdb?

After I hit n to evaluate a line, I want to go back and then hit s to step into that function if it failed. Is this possible?
The docs say:
j(ump) lineno
Set the next line that will be executed. Only available in the bottom-most frame. This lets you jump back and execute code again, or jump forward to skip code that you don’t want to run.
The GNU debugger, gdb: It is extremely slow, as it undoes single machine instruction at a time.
The Python debugger, pdb: The jump command takes you backwards in the code, but does not reverse the state of the program.
For Python, the extended python debugger prototype, epdb, was created for this reason. Here is the thesis and here is the program and the code.
I used epdb as a starting point to create a live reverse debugger as part of my MSc degree. The thesis is available online: Combining reverse debugging and live programming towards visual thinking in computer programming. In chapter 1 and 2 I also cover most of the historical approaches to reverse debugging.
PyPy has started to implement RevDB, which supports reverse debugging.
It is (as of Feb 2017) still at an alpha stage, only supports Python 2.7, only works on Linux or OS X, and requires you to build Python yourself from a special revision. It's also very slow and uses a lot of RAM. To quote the Bitbucket page:
Note that the log file typically grows at a rate of 1-2 MB per second. Assuming size is not a problem, the limiting factor are:
Replaying time. If your recorded execution took more than a few minutes, replaying will be painfully slow. It sometimes needs to go over the whole log several times in a single session. If the bug occurs randomly but rarely, you should run recording for a few minutes, then kill the process and try again, repeatedly until you get the crash.
RAM usage for replaying. The RAM requirements are 10 or 15 times larger for replaying than for recording. If that is too much, you can try with a lower value for MAX_SUBPROCESSES in _revdb/process.py, but it will always be several times larger.
Details are on the PyPy blog and installation and usage instructions are on the RevDB bitbucket page.
Reverse debugging (returning to previously recorded application state or backwards single-stepping debugging) is generally an assembly or C level debugger feature. E.g. gdb can do it:
https://sourceware.org/gdb/wiki/ReverseDebug
Bidirectional (or reverse) debugging
Reverse debugging is utterly complex, and may have performance penalty of 50.000x. It also requires extensive support from the debugging tools. Python virtual machine does not provide the reverse debugging support.
If you are interactively evaluation Python code I suggest trying IPython Notebook which provide HTML-based interactive Python shells. You can easily write your code and mix and match the order. There is no pdb debugging support, though. There is ipdb which provides better history and search facilities for entered debugging commands, but it doesn't do direct backwards jumps either as far as I know.
Though you may not be able to reverse the code execution in time, the next best thing pdb has are the stack frame jumps.
Use w to see where you're in the stack frame (bottom is the newest), and u(p) or d(own) to traverse up the stackframe to access the frame where the function call stepped you into the current frame.

Categories