Python parallel, semaphore leak warning and abort without traceback - python

I am running a parallelized grid search in python using joblib.Parallel. My script is relatively straighforward:
# Imports for data and classes...
Parallel(n_jobs=n_jobs)(
delayed(biz_model)(
...
)
for ml_model_params in grid
for past_horizon in past_horizons
)
When I run it in my local machine, it seems to run fine though I can only test it on small datasets for memory reasons. Yet when I try to run it on a remote Oracle Linux server It begins some runs and after a while it outputs:
/u01/.../resources/python/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
Aborted!
I tried to reproduce it locally, and with small experiments it does run. The unparallelized script also runs and the number of jobs (either low on high) doesn't prevent the bug from happening.
So my question is, given that there is no traceback, is there a way to make joblib or Parallel more verbose? I just cannot quite get an idea of where to look at possible fail reasons without a traceback. Obviously if some possible reason for the abort can be inferred from just this (and I fail to grasp it) I thank the notice very much.
Thanks in advance.

Using a logger, catching the exception, logging it, flushing the logs and raising it again, usually makes the trick
# Imports for data and classes...
# Creates logger
Parallel(n_jobs=n_jobs)(
try:
delayed(biz_model)(
...
)
for ml_model_params in grid
for past_horizon in past_horizons
except BaseException as e:
logger.exception(e)
# you can use a for here if you had more than a handler
logger.handlers[0].flush()
raise e
)

Related

Guaranteeing calling to destruction on process termination

After reading A LOT of data on the subject I still couldn't find any actual solution to my problem (there might not be any).
My problem is as following:
In my project I have multiple drivers working with various hardware's (IO managers, programmable loads, power supplies and more).
Initializing connection to these hardware's is costly (in time), and I cant open and then close the connection for every communication iteration between us.
Meaning I cant do this (Assuming programmable load implements enter / exit):
start of code...
with programmable_load(args) as program_instance:
programmable_load_instance.do_something()
rest of code...
So I went for a different solution :
class programmable_load():
def __init__(self):
self.handler = handler_creator()
def close_connection(self):
self.handler.close_connection()
self.handler = None
def __del__(self):
if (self.handler != None):
self.close_connection()
For obvious reasons I dont 'trust' the destructor to actually get called so I explicitly call close_connection() when I want to end my program (for all drivers).
The problem happens when I abruptly terminate the process, for example when I run via debug mode and quit debugging.
In these cases the process terminates without running through any destructors.
I understand that the OS will clear all memory unused at this point, but is there any way to clear the memory in an organized manner?
and if not, is there a way to make the quit debugging function pass through a certain set of functions? Does the python process know it got a quite debugging event or does it treat it as a normal termination?
Operating system: Windows
According to this documentation:
If a process is terminated by TerminateProcess, all threads of the
process are terminated immediately with no chance to run additional
code.
(Emphasis mine.) This implies that there is nothing you can do in this case.
As detailed here, signals don't work very well on ms-windows.
As was mentioned in a comment, you could use atexit to do the cleanup. But that only works if the process is asked to close (e.g. QUIT signal on Linux) and not just killed (as is likely the case when stopping the debugging session). Similarily if you force your computer to turn off (e.g. long press power button or remove power) then it won't be called either. There is no 'solution' to that for obvious reasons. Your program can't expect to be called when the power suddenly goes off or when it is forcefully killed. The point of forcefully killing is to definitely kill the process now. If it first called your clean-up code then you could delay that which defeats the purpose. That is why there are signals such as to ask your process to stop. This is not Python specific. The same concept also applies across operating systems.
Bonus (design suggestion, not a solution): I would argue that you can still make use of the context manager (using with). Your problem is not unique. Database connections are usually kept alive for longer as well. It is a question of the scope. Move the context further up to the application level. Then it is clear what the boundary is and you don't need any magic (you are probably also aware of #contextmanager to make that a breeze).
I haven't tested properly as I don't have wingide installed over here so I can't grant you this will work but what about using setconsolectrlhandler? For instance, try something like this:
import os
import sys
import win32api
if __name__ == "__main__":
def callback(sig, func=None):
print("Exit handler called!")
try:
win32api.SetConsoleCtrlHandler(callback, True)
except Exception as e:
print("Captured exception", e)
sys.exit(1)
print("Press to quit")
input()
print("Bye!")
It'll be able to handle CTRL+C and CTRL+BREAK signals:

Python -- when to choose logging.error() over raise Exception()?

I am using logging throughout my code for easier control over logging level. One thing that's been confusing me is
when do you prefer logging.error() to raise Exception()?
To me, logging.error() doesn't really make sense, since the program should stop when encountered with an error, like what raise Exception() does.
In what scenarios, do we put up an error message with logging.error() and let the program keep running?
Logging is merely to leave a trace of events for later inspection but does absolutely nothing to influence the program execution; raising an exception allows you to signal an error condition to higher up callers and let them handle it. Typically you use both together, it's not an either-or.
That purely depends on your process flow but it basically boils down to whether you want to deal with an encountered error (and how), or do you want to propagate it to the user of your code - or both.
logging.error() is typically used to log when an error occurs - that doesn't mean that your code should halt and raise an exception as there are recoverable errors. For example, your code might have been attempting to load a remote resource - you can log an error and try again at later time without raising an exception.
btw. you can use logging to log an exception (full, with a stack trace) by utilizing logging.exception().

Android's MonkeyRunner occasionally throws exceptions

I am running an automated test using an Android emulator driving an app with a Monkey script written in Python.
The script is copying files onto the emulator, clicks buttons in the app and reacts depending on the activities that the software triggers during its operation. The script is supposed to be running the cycle a few thousand times so I have this in a loop to run the adb tool to copy the files, start the activities and see how the software is reacting by calling the getProperty method on the device with the parameter 'am.current.comp.class'.
So here is a very simplified version of my script:
for target in targets:
androidSDK.copyFile(emulatorName, target, '/mnt/sdcard')
# Runs the component
device.startActivity(component='com.myPackage/com.myPackage.myactivity')
while 1:
if device.getProperty('am.current.comp.class') == 'com.myPackage.anotheractivity':
time.sleep(1) # to allow the scree to display the new activity before I click on it
device.touch(100, 100, 'DOWN_AND_UP')
# Log the result of the operation somewhere
break
time.sleep(0.1)
(androidSDK is a small class I've written that wraps some utility functions to copy and delete files using the adb tool).
On occasions the script crashes with one of a number of exceptions, for instance (I am leaving out the full stack trace)
[com.android.chimpchat.adb.AdbChimpDevice]com.android.ddmlib.ShellCommandUnresponsiveException
or
[com.android.chimpchat.adb.AdbChimpDevice] Unable to get variable: am.current.comp.class
[com.android.chimpchat.adb.AdbChimpDevice]java.net.SocketException: Software caused connectionabort: socket write error
I have read that sometimes the socket connection to the device becomes unstable and may need a restart (adb start-server and adb kill-server come in useful).
The problem I'm having is that the tools are throwing Java exceptions (Monkey runs in Jython), but I am not sure how those can be trapped from within my Python script. I would like to be able to determine the exact cause of the failure inside the script and recover the situation so I can carry on with my iterations (re-establish the connection, for instance? Would for instance re-initialising my device with another call to MonkeyRunner.waitForConnection be enough?).
Any ideas?
Many thanks,
Alberto
EDIT. I thought I'd mention that I have discovered that it is possible to catch Java-specific exceptions in a Jython script, should anyone need this:
from java.net import SocketException
...
try:
...
except(SocketException):
...
It is possible to catch Java-specific exceptions in a Jython script:
from java.net import SocketException
...
try:
...
except(SocketException):
...
(Taken from OP's edit to his question)
This worked for me:
device.shell('exit')# Exit the shell

Why is my threading/multiprocessing python script not exiting properly?

I have a server script that I need to be able to shutdown cleanly. While testing the usual try..except statements I realized that Ctrl-C didn't work the usual way. Normally I'd wrap long running tasks like this
try:
...
except KeyboardInterrupt:
#close the script cleanly here
so the task could be shutdown cleanly on Ctrl-C. I have never ran into any problems with this before, but somehow when I hit Ctrl-C when this particular script is running the script just exits without catching the Ctrl-C.
The initial version was implemented using Process from multiprocessing. I rewrote the script using Thread from threading, but same issue there. I have used threading many times before, but I am new to the multiprocessing library. Either way, I have never experienced this Ctrl-C behavior before.
Normally I have always implemented sentinels etc to close down Queues and Thread instances in an orderly fashion, but this script just exits without any response.
Last, I tried overriding signal.SIGINT as well like this
def handler(signal, frame):
print 'Ctrl+C'
signal.signal(signal.SIGINT, handler)
...
Here Ctrl+C was actually caught, but the handler doesn't execute, it never prints anything.
Besides the threading / multiprocessing aspect, parts of the script contains C++ SWIG objects. I don't know if that has anything to do with it. I am running Python 2.7.2 on OS X Lion.
So, a few questions:
What's going on here?
How can I debug this?
What do I need to learn in order to understand the root cause?
PLEASE NOTE: The internals of the script is proprietary so I can't give code examples. I am however very willing to receive pointers so I could debug this myself. I am experienced enough to be able to figure it out if someone could point me in the right direction.
EDIT: I started commenting out imports etc to see what caused the weird behavior, and I narrowed it down to an import of a C++ SWIG library. Any ideas why importing a C++ SWIG library 'steals' Ctrl-C? I am not the author of the guilty library however and my SWIG experience is limited so don't really know where to start...
EDIT 2: I just tried the same script on a windows machine, and in Windows 7 the Ctrl-C is caught as expected. I'm not really going to bother with the OS X part, the script will be run in an Windows environment anyway.
This might have to do with the way Python manages threads, signals and C calls.
In short - Ctrl-C cannot interrupt C calls, since the implementation requires that a python thread will handle the signal, and not just any thread, but the main thread (often blocked, waiting for other threads).
In fact, long operations can block everything.
Consider this:
>>> nums = xrange(100000000)
>>> -1 in nums
False (after ~ 6.6 seconds)
>>>
Now, Try hitting Ctrl-C (uninterruptible!)
>>> nums = xrange(100000000)
>>> -1 in nums
^C^C^C (nothing happens, long pause)
...
KeyboardInterrupt
>>>
The reason Ctrl-C doesn't work with threaded programs is that the main thread is often blocked on an uninterruptible thread-join or lock (e.g, any 'wait', 'join' or just a plain empty 'main' thread, which in the background causes python to 'join' on any spawned threads).
Try to insert a simple
while True:
time.sleep(1)
in your main thread.
If you have a long running C function, do signal handling in C-level (May the Force be with you!).
This is largely based on David Beazley's video on the subject.
It exits because something else is likely catching the KeyboardInterupt and then raising some other exception, or simply returning None. You should still get a traceback to help debug. You need to capture the stderr output or run your script with the -i commandline option so you can see traceback. Also, add another except block to catch all other exceptions.
If you suspect the C++ function call to be catching the CTRL+C try catching it's output. If the C function is not returning anything then there isn't much you can do except ask the author to add some exception handling, return codes, etc.
try:
#Doing something proprietary ...
#catch the function call output
result = yourCFuncCall()
#raise an exception if it's not what you expected
if result is None:
raise ValueError('Unexpected Result')
except KeyboardInterupt:
print('Must be a CTRL+C')
return
except:
print('Unhandled Exception')
raise
how about atexit?
http://docs.python.org/library/atexit.html#module-atexit

How can I retrospectively debug a python exception

I'm looking for a way to debug a python exception "retrospectively". Essentially if my program raises an exception that isn't handled, I want it to save off the program state so I can go back later and debug the problem.
I've taken a look at the pdb docs, and it seems that you can do this, but only if you can interact with the program at the point of the exception. This won't work for me as the program will be run in the background (without a controlling terminal).
My first (doomed!) approach was to put a try/except block at the highest level of my program, and in the except block extract the traceback object from the current exception and write it to disk using pickle. I planned to then write a separate program that would unpickle the object and use pdb.post_mortem to debug the crashed program. But traceback objects aren't pickleable, but I wouldn't expect that to work anyway, as it wouldn't save off the entire program state.
As far as I know, there isn't any way to do what you're asking. That said, it sounds like you might be looking for a remote debugger. There are a couple of options:
rconsole - This isn't really a debugger, but it allows you to get an interactive prompt inside another process. This can be useful for debugging purposes. I haven't tried this, but it looks relatively straightforward.
rpdb2's embedded debugger - This lets you start a debugger and then connect to it from another shell.
What you can do is use twisted.python and write the traceback to a file, it gives you an exact traceback including the exception
At the moment the exception is caught, before the stack is unwound, the state is available for inspection with the inspect module: http://docs.python.org/2/library/inspect.html
Generically you would use inspect.getinnerframes on your traceback object. The local variables in each stack frame are available as .f_locals so you can see what they are.
The hard part is serializing all of them correctly: depending on what types you have in the local scope you may or may not be able to pickle them, dump them to JSON, or whatever.
You could create an entirely separate execution environment in the top level:
myEnv = {}
myEnv.update(globals)
Then execute your code within that execution environment. If an exception occurs you have the traceback (stack) and all the globals, so you can pretty well reconstruct the program state.
I hope this helps (it helped me):
import logging, traceback
_logger = logging.getLogger(__name__)
try:
something_bad()
except Exception as error:
_logger.exception("Oh no!") # Logs original traceback
store_exception_somewhere(error)
Also, in Python 3 there are a few new options like raise new_exc from original_exc
or raise OtherException(...).with_traceback(tb).

Categories