Error with mpi4py after awhile of spawning, How to Debug

Error with mpi4py after awhile of spawning, How to Debug - python

This is similar to a few questions on the internet, but this code seems to be working for awhile instead of returning an error instantly, which suggests to me it is maybe not just a host-file error?
I am running a code that spawns multiple MPI processes which then each create a loop, within which they send some data with bcast and scatter, then gathers data from those processes. This runs the algorithm and saves data. It then disconnects from the spawned comm, and creates another set of spawns on the next loop. This works for a few minutes, then after around 300 files, it will spit this out:
[T7810:10898] [[50329,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../orte/mca/plm/base/plm_base_launch_support.c at line 758
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an error.
More information may be available above.
I am testing this on a local machine (single node), so the end deployment will have multiple nodes that each spawn their own mpi processes within that node. I am trying to figure out if this is an issue with testing the multiple nodes on my local machine and will work fine on the HPC or is a more serious error.
How can I debug this? Is there a way to be printing out what MPI is trying to do during, or monitor MPI, such as a verbose mode?

Since MPI4PY is so close to MPI (logically, if not in terms of lines-of-code), one way to debug this is to write the C version of your program and see if the problem persists. When you report this bug to OpenMPI, they are going to want a small c test case anyway.

Related

"IOStream.flush timed out" errors when multithreading

I am new to Python programming and having a problem with a multithreaded program (using the "threading" module) that runs fine at the beginning, but after a while starts repeatedly printing "IOStream.flush timed out" errors.
I am not even sure how to debug such an error, because I don't know what line is causing it. I read a bit about this error and saw that it might be related to memory consumption, so I tried profiling my program using a memory profiler on the Spyder IDE. Nothing jumped out at me, however (although I admit that I am not sure what to look for when it comes to Python memory leaks).
A few more observations:
I have an outer loop that runs my function over a large number of files. The files are just number data with the same formatting (they are quite large though and there is download latency, which is why I have made my application multithreaded so that each thread works on different files). If I have a long list of files, the problem occurs. If I shorten the list, the program concludes without problem. I am not sure why that is, although if it is some kind of memory leak then I would assume when I run the program longer, the problem grows until it reaches some kind of memory limit.
Normally, I use 128 threads in the program. If I reduce the number of threads to 48 or less, the program works fine and completes correctly. So clearly the problem is caused by multithreading (I'm using the "threading" module). This makes it a bit trickier to debug and figure out what is causing the problem. It seems something around 64 threads starts causing problems.
The program never explicitly crashes out. Once it gets to the point where it has this error, it just keeps repeatedly printing "IOStream.flush timed out". I have to close the Spyder IDE to stop it (Restart kernel doesn't work).
Right before this error happens, the program appears to stall. At least no more "prints" happen to the console (the various threads are all printing debug information to the screen). The last lines printed are standard debugging/status print statements that usually work when the number of threads is reduced or the number of files to process is decreased.
I have no idea how to debug this and get to the bottom of the problem. Any suggestions on how to get to the bottom of this would be much appreciated. Thanks in advance!
Specs:
Python 3.8.8
Spyder 4.2.5
Windows 10

How to instrument a python process which crashes after ~5 days without log entries

I am running a multi-process (and multi-threaded) python script on debian linux. One of the processes repeatedly crashes after 5 or 6 days. It is always the same, unique workload on the process that crashes. There are no entries in syslog about the crash - the process simply disappears silently. It also behaves completely normally and produces normal results, then suddenly stops.
How can I instrument the rogue process. Increasing the loglevel will produce large amounts of logs, so that's not my preferred option.

I used good-old log analysis to determine what happens when the process fails.
increased log level of the rogue process to INFO after 4 days
monitored the application for the rogue process failing
pin-pointed the point in time of the failure in syslog
analysed syslog at that time
I found following error at that time; first row is the last entry made by the rogue process (just before it fails), the 2nd row is the one pointing to the underlying error.
In this case there is a problem with pyzmq bindings or zeromq library. I'll open a ticket with them.
Aug 10 08:30:13 rpi6 python[16293]: 2021-08-10T08:30:13.045 WARNING w1m::pid 16325, tid 16415, taking reading from sensors with map {'000005ccbe8a': ['t-top'], '000005cc8eba': ['t-mid'], '00000676e5c3': ['t
Aug 10 08:30:14 rpi6 python[16293]: Too many open files (bundled/zeromq/src/ipc_listener.cpp:327)
A
Hope this helps someone in the future.

pythonw.exe creates network activity when running any script

When I run any python script that doesn't even contain any code or imports that could access the internet in any way, I get 2 pythonw.exe processes pop up in my resource monitor under network activity. One of them is always sending more than receiving while the other has the same activity but the amount of sending vs receiving is reversed. The amount of overall activity is dependent on the file size, regardless of how many line are commented out. Even a blank .py document will create network activity of about 200 kb/s. The activity drops from its peak, which is as high as 15,000 kb/s for a file with 10,000 lines, to around zero after around 20 seconds, and then the processes quit on their own. The actual script has finished running long before the network processes stop.
Because the activity is dependent on file size I'm suspicious that every time I run a python script, the whole thing is being transmitted to a server somewhere else in the world.
Is this something that could be built into python, a virus that's infecting my computer, or just something that python is supposed to do and its innocent activity?
If anyone doesn't have an answer but could check to see if this activity affects their own installation of python, that would be great. Thanks!
EDIT:
Peter Wood, to start the process just run any python script from the editor, its runs on its own, at least for me. I'm on 2.7.8.
Robert B, I think you may be right, but why would the communication continue after the script has finished running?

running 2 python scripts without them effecting each other

I have 2 python scripts I'm trying to run side by side. However, each of them have to open and close and reopen independently from each other. Also, one of the scripts is running inside a shell script.
Flaskserver.py & ./pyinit.sh
Flaskserver.py is just a flask server that needs to be restarted everynow and again to load a new page. (cant define all pages as the html is interchangeable). the pyinit is runs as xinit ./pyinit.sh (its selenium-webdriver pythoncode)
So when the Flaskserver changes and restarts the ./pyinit needs to wait about 20 seconds then restart as well.
Either one of these can create errors so I need to be able to check if Flaskserver has an error before restarting ./pyinit if ./pyinit errors i need to set the Flaskserver to a default value and then relaunch both of them.
I know a little about subprocess but I'm unsure on how it can deal with errors and stop-start code.

Rather than using sub-process I would recommend you to create a different thread for your processes using multithread.
Multithreading will not solve the problem if global variables are colliding, but by running them in different scripts, while you might solve this, you might collide in something else like a log file.
Now, if you keep both processes running from a single process that takes care of keeping them separated and assigning different global variables where necessary, you should be able to keep a better control. Using things like join and lock from the multithreading library, will also ensure that they don't collide and it should be easy to put a process to sleep while the other is running (as per waiting 20 secs).
You can keep a thread list as a global variable, as well as your lock. I have done this successfully with CherryPy's server for example. Any more details about multithreading look into the question I linked above, it's very well explained.

Starting an IPython cluster from the notebook with a delay

Our SGE cluster setup requires there to be a delay between controller and engines starting. If this delay is not there, some of the servers use "old" ipcontroller-client.json files and attempt to connect to previous (and not running) controllers. This is an NFS "feature", so to remedy, I set c.IPClusterStart.delay = 30 in the ipcluster_config.py file and things work well. The controller gets submitted to SGE, has enough time to start and write its json files, and then the engines can start correctly to the newly running controller. However, I'd like to also be able to start the cluster from the notebook. Unfortunately, it appears that this timeout is not used, the controller and engines start up at the same time (as seen with watch qstat), some of the engines connect (because the pick up the new settings from the json file) and some do not (because of NFS).
I ran an strace on the notebook and saw that it's using sge_controller and sge_engines scripts (created by the notebook when you press start) to start these processes.
I'm wondering if there's any way to implement a delay here, as well. It's starting the controller and engines the right way (SGE) so I know it's reading the ipcluster_config.py.
I've Googled around and searched this site, with no luck. Hoping maybe someone can shed some light on the deeper workings of this behavior.
Thanks,
Chris

Well, this is probably too late for the OP, but hopefully it helps someone.
If it is a timeout issue, just set c.EngineFactory.timeout and c.IPEngineApp.wait_for_url_file to some larger times.
If it is due to failure after the first run, it is probably due to lingering security files, which should be deleted ( ipcontroller-engine.json and ipcontroller-client.json ) from the relevant iPython profile using IPython.utils.path.get_security_file to get the full paths. To automate this and make it somewhat less painful, this deletion step can be tacked on to the beginning of the same profile's ipcluster_config.py.
These changes alone were enough for me to get the cluster running with the notebook easily.
If neither of these solve the problem, there are some other thoughts ( http://mail.scipy.org/pipermail/ipython-user/2011-November/008741.html ).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.