I'm using python multiprocessing module to parallelize some computationally heavy tasks.
The obvious choice is to use a Pool of workers and then use the map method.
However, processes can fail. For instance, they may be silently killed for instance by the oom-killer. Therefore I would like to be able to retrieve the exit code of the processes launched with map.
Additionally, for logging purpose, I would like to be able to know the PID of the process launched to execute each value in the the iterable.
If you're using multiprocessing.Pool.map you're generally not interested in the exit code of the sub-processes in the pool, you're interested in what value they returned from their work item. This is because under normal conditions, the processes in a Pool won't exit until you close/join the pool, so there's no exit codes to retrieve until all work is complete, and the Pool is about to be destroyed. Because of this, there is no public API to get the exit codes of those sub-processes.
Now, you're worried about exceptional conditions, where something out-of-band kills one of the sub-processes while it's doing work. If you hit an issue like this, you're probably going to run into some strange behavior. In fact, in my tests where I killed a process in a Pool while it was doing work as part of a map call, map never completed, because the killed process didn't complete. Python did, however, immediately launch a new process to replace the one I killed.
That said, you can get the pid of each process in your pool by accessing the multiprocessing.Process objects inside the pool directly, using the private _pool attribute:
pool = multiprocessing.Pool()
for proc in pool._pool:
print proc.pid
So, one thing you could do to try to detect when a process had died unexpectedly (assuming you don't get stuck in a blocking call as a result). You can do this by examining the list of processes in the pool before and after making a call to map_async:
before = pool._pool[:] # Make a copy of the list of Process objects in our pool
result = pool.map_async(func, iterable) # Use map_async so we don't get stuck.
while not result.ready(): # Wait for the call to complete
if any(proc.exitcode for proc in before): # Abort if one of our original processes is dead.
print "One of our processes has exited. Something probably went horribly wrong."
break
result.wait(timeout=1)
else: # We'll enter this block if we don't reach `break` above.
print result.get() # Actually fetch the result list here.
We have to make a copy of the list because when a process in the Pool dies, Python immediately replaces it with a new process, and removes the dead one from the list.
This worked for me in my tests, but because it's relying on a private attribute of the Pool object (_pool) it's risky to use in production code. I would also suggest that it may be overkill to worry too much about this scenario, since it's very unlikely to occur and complicates the implementation significantly.
Related
I am relatively new to the multiprocessing world in python3 and I am therefore sorry if this question has been asked before. I have a script which, from a list of N elements, runs the entire analysis on each element, mapping each onto a different process.
I am aware that this is suboptimal, in fact I want to increase the multiprocessing efficiency. I use map() to run each process into a Pool() which can contain as many processes as the user specifies via command line arguments.
Here is how the code looks like:
max_processes = 7
# it is passed by command line actually but not relevant here
def main_function( ... ):
res_1 = sub_function_1( ... )
res_2 = sub_function_2( ... )
if __name__ == '__main__':
p = Pool(max_processes)
Arguments = []
for x in Paths.keys():
# generation of the arguments
...
Arguments.append( Tup_of_arguments )
p.map(main_function, Arguments)
p.close()
p.join()
As you see my process calls a main function which in turn calls many other functions one after the other. Now, each of the sub_functions is multiprocessable. Can I map processes from those subfunctions, which map to the same pool where the main process runs?
No, you can't.
The pool is (pretty much) not available in the worker processes. It depends a bit on the start method used for the pool.
spawn
A new Python interpreter process is started and imports the module. Since in that process __name__ is '__mp_main__', the code in the __name__ == '__main__' block is not executed and no pool object exists in the workers.
fork
The memory space of the parent process is copied into the memory space of the child process. That effectively leads to an existing Pool object in the memory space of each worker.
However, that pool is unusable. The workers are created during the execution of the pool's __init__, hence the pool's initialization is incomplete when the workers are forked. The pool's copies in the worker processes have none of the threads running that manage workers, tasks and results. Threads anyway don't make it into child processes via fork.
Additionally, since the workers are created during the initialization, the pool object has not yet been assigned to any name at that point. While it does lurk in the worker's memory space, there is no handle to it. It does not show up via globals(); I only found it via gc.get_objects(): <multiprocessing.pool.Pool object at 0x7f75d8e50048>
Anyway, that pool object is a copy of the one in the main process.
forkserver
I could not test this start method
To solve your problem, you could fiddle around with queues and a queue handler thread in the main process to send back tasks from workers and delegate them to the pool, but all approaches I can think of seem rather clumsy.
You'll very probaly end up with a lot more maintainable code if you make the effort to adopt it for processing in a pool.
As an aside: I am not sure if allowing users to pass the number of workers via commandline is a good idea. I recommend to to give that value an upper boundary via os.cpu_count() at the very least.
I am using Dask to run a pool of tasks, retrieving results in the order they complete by the as_completed method, and potentially submitting new tasks to the pool each time one returns:
# Initial set of jobs
futures = [client.submit(job.run_simulation) for job in jobs]
pool = as_completed(futures, with_results=True)
while True:
# Wait for a job to finish
f, result = next(pool)
# Exit condition
if result == 'STOP':
break
# Do processing and maybe submit more jobs
more_jobs = process_result(f, result)
more_futures = [client.submit(job.run_simulation) for job in more_jobs]
pool.update(more_futures)
Here's my problem: The function job.run_simulation that I am submitting can sometimes hang for a long time, and I want to time out this function - kill the task and move on if the run time exceeds a certain time limit.
Ideally, I'd like to do something like client.submit(job.run_simulation, timeout=10), and have next(pool) return None if the task ran longer than the timeout.
Is there any way that Dask can help me time out jobs like this?
What I've tried so far
My first instinct was to handle the timeout independently of Dask within the job.run_simulation function itself. I've seen two types of suggestions (e.g. here) for generic Python timeouts.
1) Use two threads, one for the function itself and one for a timer. My impression is this doesn't actually work because you can't kill threads. Even if the timer runs out, both threads have to finish before the task is completed.
2) Use two separate processes (with the multiprocessing module), one for the function and one for the timer. This would work, but since I'm already in a daemon subprocess spawned by Dask, I'm not allowed to create new subprocesses.
A third possibility is to move the code block to a separate script that I run with subprocess.run and use the subprocess.run built in timeout. I could do this, but it feels like a worst-case fallback scenario because it would take a lot of cumbersome passing of data to and from the subprocess.
So it feels like I have to accomplish the timeout at the level of Dask. My one idea here is to create a timer as a subprocess at the same time as I submit the task to Dask. Then if the timer runs out, use Client.cancel() to stop the task. The problem with this plan is that Dask might wait for workers to free up before starting the task, and I don't want the timer running before the task is actually running.
Your assessment of the problem seems correct to me and the solutions you went through are the same that I would consider. Some notes:
Client.cancel is unable to stop a function from running if it has already started. These functions are running in a thread pool and so you run into the "can't stop threads" limitation. Dask workers are just Python processes and have the same abilities and limitations.
You say that you can't use processes from within a daemon process. One solution to this would be to change how you're using processes in one of the following ways:
If you're using dask.distributed on a single machine then just don't use processes
client = Client(processes=False)
Don't use Dask's default nanny processes, then your dask worker will be a normal process capable of using multiprocessing
Set dask's multiprocessing-context config to "spawn" rather than fork or forkserver
The clean way to solve this problem though is to solve it inside of your function job.run_simulation. Ideally you would be able to push this timeout logic down to that code and have it raise cleanly.
I am using Python's multiprocessing Process class for a project to handle a function in a separate process. My question is, what happens when does the function in a separate process do its job? Is it that the process remains idle, or is process killed by the end of the function? Also, will there be any issues with giving the process/function a heavy load?
The code block is like this:
p = Process(target=function, args=[status.json])
if not p.is_alive():
p.start()
p.join()
If you're using the Process class, then the process starts when you call start(), and terminates when the target function completes, or you call the Process object's terminate() method. The process does not consume execution resources after terminating, but it will remain in the process table as a zombie process until you call join().
Regarding your concerns about putting a "heavy load" on the new process, it shouldn't have any effect. The multiprocessing library launches full-blown independent processes, so they operate largely the same as your main process does. Now whether or not the Process class is really the appropriate solution to your problem is hard to say since "heavy load" can mean many different things. If your tasks are IO intensive, the threading module may be a better choice. If you are performing the same tasks on many different objects, process pools may be a more appropriate choice. But a "heavy load" in and of itself shouldn't cause problems.
In one of my Django views, I am calling a python script and getting its pid with:
from subprocess import Popen
p = Popen(['python', 'script.py'])
mypid = p.pid
When trying to find out if the process still is running from another page, I use the following function on mypid (thanks to this question):
def doesProcessExist(pid):
if pid < 0:
return False
try:
os.kill(pid, 0)
except OSError, e:
return e.errno == errno.EPERM
else:
return True
No matter how long I wait, the process still shows up as running. The only thing that stops it, is if I spawn a new python script process with Popen. Is there anyway I can fix this? I am not sure if this is caused by Django not closing python properly after the script is finished or something else. In Ubuntu's process status manager, the process shows up as [python] <defunct>.
--
The problem is true for all script.py I have tried. I am currently using one as simple as:
from time import sleep
sleep(5)
Really, what you're doing is wrong. When you use a high-level wrapper like a subprocess.Popen, you need to manage the process through that object. Just having the PID elsewhere isn't enough to manage it.
If you insist on dealing in PIDs instead of Popen objects, then you should use the low-level APIs in os.
Fortunately, you're not doing anything complicated, like creating pipes to talk to the child process. So, you can just launch it with your favorite spawn variant, then wait for it with waitpid or one of its variants.
I'm assuming you're doing this all in a single-process web server. If you're using a forking web server, where the other page could be in a different process, even using PIDs won't work. The parent process has to reap the child, not some other arbitrary process. If you want to make that work, you'll have to make things more complicated, and you're really going to have to learn about the Unix process model before anyone can explain it to you.
What you see is a zombie process. It doesn't keep running. It can't. It is dead. The only thing that is left is some info that allows for related processes to retrieve its status.
To find out whether a subprocess is alive without blocking, call p.poll(). If it returns None then the process is still alive, otherwise you can safely forget about it (it is already reaped by .poll()).
subprocess module calls _cleanup() function that reaps zombie processes inside Popen() constructor. So normally your script won't create many zombie processes anyway.
To see a list of zombie processes:
import os
#NOTE: don't use Popen() here
print os.popen(r"ps aux | grep Z | grep -v grep").read(),
Processes in Unix stick around until the parent waits for them. calling wait on the object returned by thepopen will wait for the process to be done and will wait for it so it goes away. Until you do that it will exist as a zombie process See this message for info on getting the process to go away in the background while your web server runs without waiting for it in a foreground thread/view.
So, let's say that you do
p = subprocess.Popen(...)
At some point you need to call
p.wait()
I've having the opposite problem of many Python users--my program is using too little CPU. I already got help in switching to multiprocessing to utilize all four of my work computer's cores, and I have seen real performance improvement as a result. But the improvement is somewhat unreliable. The CPU usage of my program seems to deteriorate as it continues to run--even with six processes running. After adding some debug messages, I discovered this was because some of the processes I was spawning (which are all supposed to run until completion) were dying prematurely. The main body of the method the processes run is a while True loop, and the only way out is this block:
try:
f = filequeue.get(False)
except Empty:
print "Done"
return
filequeue is populated before the creation of the subprocesses, so it definitely isn't actually empty. All the processes should exit at roughly the same time once it actually is empty. I tried adding a nonzero timeout (0.05) parameter to the Queue.get call, but this didn't fix the problem. Why could I be getting a Queue.empty exception from a nonempty Queue?
I suggest using filequeue.get(True) instead of filequeue.get(False). This will cause the queue to block until there are more elements.
It will, however, block forever after the final element has been processed.
To work around this, the main process could add a special "sentinel" object at the end of each queue. The workers would terminate upon seeing this special object (instead of relying on the emptiness of the queue).
I had a similar problem and found out from experimentation, not reading the docs, that even when the queue is non-empty, get(False) can still spuriously throw Empty. In my use case, the workers have to exit when they run out of work in the Queue, so get(True) is a non-option.
My solution was this: I found that if in the "except Empty:" block, I check that the Queue is indeed empty(), it works -- empty() will not return True unless the Queue is really empty.
I was using Python 2.7.