Python - weird behavior with multiprocessing - join does not execute

Python - weird behavior with multiprocessing - join does not execute - python

I am using the multiprocessing python module. I have about 20-25 tasks to run simultaneously. Each task will create a pandas.DataFrame object of ~20k rows. Problem is, all tasks execute well, but when it comes to "joining" the processes, it just stops. I've tried with "small" DataFrames and it works very well. To illustrate my point, I created the code below.
import pandas
import multiprocessing as mp
def task(arg, queue):
DF = pandas.DataFrame({"hello":range(10)}) # try range(1000) or range(10000)
queue.put(DF)
print("DF %d stored" %arg)
listArgs = range(20)
queue = mp.Queue()
processes = [mp.Process(target=task,args=(arg,queue)) for arg in listArgs]
for p in processes:
p.start()
for i,p in enumerate(processes):
print("joining %d" %i)
p.join()
results = [queue.get() for p in processes]
EDIT:
With DF = pandas.DataFrame({"hello":range(10)}) I have everything correct: "DF 0 stored" up to "DF 19 stored", same with "joining 0" to "joining 19".
However with DF = pandas.DataFrame({"hello":range(1000)}) the issue arises: while it is storing the DF, the joining step stops after "joining 3".
Thanks for the useful tips :)

This problem is explained in the docs, under Pipes and Queues:
Warning: As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.
This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.
Note that a queue created using a manager does not have this issue. See Programming guidelines.
Using a manager would work, but there are a lot of easier ways to solve this:
Read the data off the queue first, then join the processes, instead of the other way around.
Manage the Queue manually (e.g., using a JoinableQueue and task_done).
Just use Pool.map instead of reinventing the wheel. (Yes, much of what Pool does isn't necessary for your use case—but it also isn't going to get in the way, and the nice thing is, you already know it works.)
I won't show the implementation for #1 because it's so trivial, or for #2 because it's such a pain, but for #3:
def task(arg):
DF = pandas.DataFrame({"hello":range(1000)}) # try range(1000) or range(10000)
return DF
with mp.Pool(processes=20) as p:
results = p.map(task, range(20), chunksize=1)
(In 2.7, Pool may not work in a with statement; you can install the port of the later version of multiprocessing back to 2.7 off PyPI, or you can just manually create the pool, then close it in a try/finally, just you would handle a file if it didn't work in a with statement...)
You may ask yourself, why exactly does it fail at this point, but work with smaller numbers—even just a little bit smaller?
A pickle of that DataFrame is just over 16K. (The list by itself is a little smaller, but if you try it with 10000 instead of 1000 you should see the same thing without Pandas.)
So, the first child writes 16K, then blocks until there's room to write the last few hundred bytes. But you're not pulling anything off the pipe (by calling queue.get) until after the join, and you can't join until they exit, which they can't do until you unblock the pipe, so it's a classic deadlock. There's enough room for the first 4 to get through, but no room for 5. Because you have 4 cores, most of the time, the first 4 that get through will be the first 4. But occasionally #4 will beat #3 or something, and then you'll fail to join #3. That would happen more often with an 8-core machine.

Related

How to get rid of zombie processes using torch.multiprocessing.Pool (Python)

I am using torch.multiprocessing.Pool to speed up my NN in inference, like this:
import torch.multiprocessing as mp
mp = torch.multiprocessing.get_context('forkserver')
def parallel_predict(predict_func, sequences, args):
predicted_cluster_ids = []
pool = mp.Pool(args.num_workers, maxtasksperchild=1)
out = pool.imap(
func=functools.partial(predict_func, args=args),
iterable=sequences,
chunksize=1)
for item in tqdm(out, total=len(sequences), ncols=85):
predicted_cluster_ids.append(item)
pool.close()
pool.terminate()
pool.join()
return predicted_cluster_ids
Note 1) I am using imap because I want to be able to show a progress bar with tqdm.
Note 2) I tried with both forkserver and spawn but no luck. I cannot use other methods because of how they interact (poorly) with CUDA.
Note 3) I am using maxtasksperchild=1 and chunksize=1 so for each sequence in sequences it spawns a new process.
Note 4) Adding or removing pool.terminate() and pool.join() makes no difference.
Note 5) predict_func is a method of a class I created. I could also pass the whole model to parallel_predict but it does not change anything.
Everything works fine except the fact that after a while I run out of memory on the CPU (while on the GPU everything works as expected). Using htop to monitor memory usage I notice that, for every process I spawn with pool I get a zombie that uses 0.4% of the memory. They don't get cleared, so they keep using space. Still, parallel_predict does return the correct result and the computation goes on. My script is structured in a way that id does validation multiple times so next time parallel_predict is called the zombies add up.
This is what I get in htop:
Usually, these zombies get cleared after ctrl-c but in some rare cases I need to killall.
Is there some way I can force the Pool to close them?
UPDATE:
I tried to kill the zombie processes using this:
def kill(pool):
import multiprocessing
import signal
# stop repopulating new child
pool._state = multiprocessing.pool.TERMINATE
pool._worker_handler._state = multiprocessing.pool.TERMINATE
for p in pool._pool:
os.kill(p.pid, signal.SIGKILL)
# .is_alive() will reap dead process
while any(p.is_alive() for p in pool._pool):
pass
pool.terminate()
But it does not work. It gets stuck at pool.terminate()
UPDATE2:
I tried to use the initializer arg in imap to catch signals like this:
def process_initializer():
def handler(_signal, frame):
print('exiting')
exit(0)
signal.signal(signal.SIGTERM, handler)
def parallel_predict(predict_func, sequences, args):
predicted_cluster_ids = []
with mp.Pool(args.num_workers, initializer=process_initializer, maxtasksperchild=1) as pool:
out = pool.imap(
func=functools.partial(predict_func, args=args),
iterable=sequences,
chunksize=1)
for item in tqdm(out, total=len(sequences), ncols=85):
predicted_cluster_ids.append(item)
for p in pool._pool:
os.kill(p.pid, signal.SIGTERM)
pool.close()
pool.terminate()
pool.join()
return predicted_cluster_ids
but again it does not free memory.

Ok, I have more insights to share with you. Indeed this is not a bug, it is actually the "supposed" behavior for the multiprocessing module in Python (torch.multiprocessing wraps it). What happens is that, although the Pool terminates all the processes, the memory is not released (given back to the OS). This is also stated in the documentation, though in a very confusing way.
In the documentation it says that
Worker processes within a Pool typically live for the complete duration of the Pool’s work queue
but also:
A frequent pattern found in other systems (such as Apache, mod_wsgi, etc) to free resources held by workers is to allow a worker within a pool to complete only a set amount of work before being exiting, being cleaned up and a new process spawned to replace the old one. The maxtasksperchild argument to the Pool exposes this ability to the end user
but the "clean up" does NOT happen.
To make things worse I found this post in which they recommend to use maxtasksperchild=1. This increases the memory leak, because this way the number of zombies goes with the number of data points to be predicted, and since pool.close() does not free memory they add up.
This is very bad if you are using multiprocessing for example in validation. For every validation step I was reinitializing the pool but the memory didn't get freed from the previous iteration.
The SOLUTION here is to move pool = mp.Pool(args.num_workers) outside the training loop, so the pool does not get closed and reopened, and therefore it always reuses the same processes. NOTE: again remember to remove maxtasksperchild=1 and chunksize=1.
I think this should be included in the best practices page.
BTW in my opinion this behavior of the multiprocessing library should be considered as a bug and should be fixed Python side (not Pytorch side)

python: Why join keeps me waiting?

I want to do clustering on 10,000 models. Before that, I have to calculate the pearson corralation coefficient associated with every two models. That's a large amount of computation, so I use multiprocessing to spawn processes, assigning the computing job to 16 cpus.My code is like this:
import numpy as np
from multiprocessing import Process, Queue
def cc_calculator(begin, end, q):
index=lambda i,j,n: i*n+j-i*(i+1)/2-i-1
for i in range(begin, end):
for j in range(i, nmodel):
all_cc[i][j]=get_cc(i,j)
q.put((index(i,j,nmodel),all_cc[i][j]))
def func(i):
res=(16-i)/16
res=res**0.5
res=int(nmodel*(1-res))
return res
nmodel=int(raw_input("Entering the number of models:"))
all_cc=np.zeros((nmodel,nmodel))
ncc=int(nmodel*(nmodel-1)/2)
condensed_cc=[0]*ncc
q=Queue()
mprocess=[]
for ii in range(16):
begin=func(i)
end=func(i+1)
p=Process(target=cc_calculator,args=(begin,end,q))
mprocess+=[p]
p.start()
for x in mprocess:
x.join()
while not q.empty():
(ind, value)=q.get()
ind=int(ind)
condensed_cc[ind]=value
np.save("condensed_cc",condensed_cc)
where get_cc(i,j) calculates the corralation coefficient associated with model i and j. all_cc is an upper triangular matrix and all_cc[i][j] stores the cc value. condensed_cc is another version of all_cc. I'll process it to achive condensed_dist to do the clustering. The "func" function helps assign to each cpu almost the same amout of computing.
I run the program successfully with nmodel=20. When I try to run the program with nmodel=10,000, however, seems that it never ends.I wait about two days and use top command in another terminal window, no process with command "python" is still running. But the program is still running and there is no output file. I use Ctrl+C to force it to stop, it points to the line: x.join(). nmodel=40 ran fast but failed with the same problem.
Maybe this problem has something to do with q. Because if I comment the line: q.put(...), it runs successfully.Or something like this:
q.put(...)
q.get()
It is also ok.But the two methods will not give a right condensed_cc. They don't change all_cc or condensed_cc.
Another example with only one subprocess:
from multiprocessing import Process, Queue
def g(q):
num=10**2
for i in range(num):
print '='*10
print i
q.put((i,i+2))
print "qsize: ", q.qsize()
q=Queue()
p=Process(target=g,args=(q,))
p.start()
p.join()
while not q.empty():
q.get()
It is ok with num= 100 but fails with num=10,000. Even with num=100**2, they did print all i and q.qsizes. I cannot figure out why. Also, Ctrl+C causes trace back to p.join().
I want to say more about the size problem of queue. Documentation about Queue and its put method introduces Queue as Queue([maxsize]), and it says about the put method:...block if neccessary until a free slot is available. These all make one think that the subprocess is blocked because of running out of spaces of the queue. However, as I mentioned before in the second example, the result printed on the screen proves an increasing qsize, meaning that the queue is not full. I add one line:
print q.full()
after the print size statement, it is always false for num=10,000 while the program still stuck somewhere. Emphasize one thing: top command in another terminal shows no process with command python. That really puzzles me.
I'm using python 2.7.9.

I believe the problem you are running into is described in the multiprocessing programming guidelines: https://docs.python.org/2/library/multiprocessing.html#multiprocessing-programming
Specifically this section:
Joining processes that use queues
Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the cancel_join_thread() method of the queue to avoid this behaviour.)
This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate. Remember also that non-daemonic processes will be joined automatically.
An example which will deadlock is the following:
from multiprocessing import Process, Queue
def f(q):
q.put('X' * 1000000)
if __name__ == '__main__':
queue = Queue()
p = Process(target=f, args=(queue,))
p.start()
p.join() # this deadlocks
obj = queue.get()
A fix here would be to swap the last two lines (or simply remove the p.join() line).
You might also want to check out the section on "Avoid Shared State".
It looks like you are using .join to avoid the race condition of q.empty() returning True before something is added to it. You should not rely on .empty() at all while using multiprocessing (or multithreading). Instead you should handle this by signaling from the worker process to the main process when it is done adding items to the queue. This is normally done by placing a sentinal value in the queue, but there are other options as well.

Troubleshooting data inconsistencies with Python multiprocessing/threading

TL;DR: Getting different results after running code with threading and multiprocessing and single threaded. Need guidance on troubleshooting.
Hello, I apologize in advance if this may be a bit too generic, but I need a bit of help troubleshooting an issue and I am not sure how best to proceed.
Here is the story; I have a bunch of data indexed into a Solr Collection (~250m items), all items in that collection have a sessionid. Some items can share the same session id. I am combing through the collection to extract all items that have the same session, massage the data a bit and spit out another JSON file for indexing later.
The code has two main functions:
proc_day - accepts a day and processes all the sessions for that day
and
proc_session - does everything that needs to happen for a single session.
Multiprocessing is implemented on proc_day, so each day would be processed by a separate process, the proc_session function can be ran with threads. Below is the code I am using for threading/multiprocessing below. It accepts a function, a list of arguments and number of threads / multiprocesses. It will then create a queue based on input args, then create processes/threads and let them go through it. I am not posting the actual code, since it generally runs fine single threaded without any issues, but can post it if needed.
autoprocs.py
import sys
import logging
from multiprocessing import Process, Queue,JoinableQueue
import time
import multiprocessing
import os
def proc_proc(func,data,threads,delay=10):
if threads < 0:
return
q = JoinableQueue()
procs = []
for i in range(threads):
thread = Process(target=proc_exec,args=(func,q))
thread.daemon = True;
thread.start()
procs.append(thread)
for item in data:
q.put(item)
logging.debug(str(os.getpid()) + ' *** Processes started and data loaded into queue waiting')
s = q.qsize()
while s > 0:
logging.info(str(os.getpid()) + " - Proc Queue Size is:" + str(s))
s = q.qsize()
time.sleep(delay)
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
logging.debug(str(os.getpid()) + ' - *** Main Proc waiting')
q.join()
logging.debug(str(os.getpid()) + ' - *** Done')
def proc_exec(func,q):
p = multiprocessing.current_process()
logging.debug(str(os.getpid()) + ' - Starting:{},{}'.format(p.name, p.pid))
while True:
d = q.get()
try:
logging.debug(str(os.getpid()) + " - Starting to Process {}".format(d))
func(d)
sys.stdout.flush()
logging.debug(str(os.getpid()) + " - Marking Task as Done")
q.task_done()
except:
logging.error(str(os.getpid()) + " - Exception in subprocess execution")
logging.error(sys.exc_info()[0])
logging.debug(str(os.getpid()) + 'Ending:{},{}'.format(p.name, p.pid))
autothreads.py:
import threading
import logging
import time
from queue import Queue
def thread_proc(func,data,threads):
if threads < 0:
return "Thead Count not specified"
q = Queue()
for i in range(threads):
thread = threading.Thread(target=thread_exec,args=(func,q))
thread.daemon = True
thread.start()
for item in data:
q.put(item)
logging.debug('*** Main thread waiting')
s = q.qsize()
while s > 0:
logging.debug("Queue Size is:" + str(s))
s = q.qsize()
time.sleep(1)
logging.debug('*** Main thread waiting')
q.join()
logging.debug('*** Done')
def thread_exec(func,q):
while True:
d = q.get()
#logging.debug("Working...")
try:
func(d)
except:
pass
q.task_done()
I am running into problems with validating data after python runs under different multiprocessing/threading configs. There is a lot of data, so I really need to get multiprocessing working. Here are the results of my test yesterday.
Only with multiprocessing - 10 procs:
Days Processed 30
Sessions Found 3,507,475
Sessions Processed 3,514,496
Files 162,140
Data Output: 1.9G
multiprocessing and multithreading - 10 procs 10 threads
Days Processed 30
Sessions Found 3,356,362
Sessions Processed 3,272,402
Files 424,005
Data Output: 2.2GB
just threading - 10 threads
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 733,664
Data Output: 3.3GB
Single process/ no threading
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 162,190
Data Output: 1.9GB
These counts were gathered by grepping and counties entries in the log files (1 per main process). The first thing that jumps out is that days processed doesn't match. However, I manually checked the log files and it looks like a log entry was missing, there are follow on log entries to indicate that the day was actually processed. I have no idea why it was omitted.
I really don't want to write more code to validate this code, just seems like a terrible waste of time, is there any alternative?

I gave some general hints in the comments above. I think there are multiple problems with your approach, at very different levels of abstraction. You are also not showing all code of relevance.
The issue might very well be
in the method you are using to read from solr or in preparing read data before feeding it to your workers.
in the architecture you have come up with for distributing the work among multiple processes.
in your logging infrastructure (as you have pointed out yourself).
in your analysis approach.
You have to go through all of these points, and as of the complexity of the issue surely nobody here will be able to identify the exact issues for you.
Regarding points (3) and (4):
If you are not sure about the completeness of your log files, you should perform the analysis based on the payload output of your processing engine. What I am trying to say: the log files probably are just a side product of your data processing. The primary product is the thing you should analyze. Of course it is also important to get your logs right. But these two problems should be treated independently.
My contribution regarding point (2) in the list above:
What is especially suspicious about your multiprocessing-based solution is your way to wait for the workers to finish. You seem not to be sure by which method you should wait for your workers, so you apply three different methods:
First, you are monitoring the size of the queue in a while loop and wait for it to become 0. This is a non-canonical approach, which might actually work.
Secondly, you join() your processes in a weird way:
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
Why are you defining a timeout of one second here and do not respond to whether the process actually terminated within that time frame? You should either really join a process, i.e. wait until it has terminated or you specify a timeout and, if that timeout expires before the process finishes, treat that situation specially. Your code does not distinguish these situations, so p.join(1) is like writing time.sleep(1) instead.
Thirdly, you join the queue.
So, after making sure that q.qsize() returns 0 and after waiting for another second, do you really think that joining the queue is important? Does it make any difference? One of these approaches should be enough, and you need to think about which of these criteria is most important to your problem. That is, one of these conditions should deterministically implicate the other two.
All this looks like a quick & dirty hack of a multiprocessing solution, whereas you yourself are not really sure how that solution should behave. One of the most important insights I have obtained while working on concurrency architectures: You, the architect, must be 100 % aware of how the communication and control flow works in your system. Not properly monitoring and controlling the state of your worker processes may very well be the source of the issues you are observing.

I figured it out, I followed Jan-Philip's advice and started examining the output data of the multiprocess/multithreaded process. Turned out that an object that does all these things with the data from Solr was shared among threads. I did not have any locking mechanisms, so in a case it had mixed data from multiple sessions which caused inconsistent output. I validated this by instantiating a new object for every thread and the counts matched up. It is a bit slower, but still workable.
Thanks

Does pool.map() from multiprocessing lock process to CPU core automatically?

I've submitted several questions over last few days trying to understand how to use the multiprocessing python library properly.
Current method I'm using is to split a task over a number of processes that is equal to the number of available CPU cores on the machine, as follows:
from multiprocessing import Pool
from contextlib import closing
def myFunction(row):
# row function
with closing(Pool(processes=multiprocessing.cpu_count())) as pool:
pool.map(myFunction, rowList)
However, when the map part is reached in the program it seems to actually slow down, not speed up. One of my functions for example moves through only 60 records (the first function) and it prints a result at the end of each record. The record printing seems to slow down to an eventual stop and do not much! I am wondering if the program is loading the next function into memory async or whether there's something wrong with my methodology.
So I am wondering - are the child processes automatically 'LOCKED' to each CPU core with the pool.map() or do I need to do something extra?
EDIT:
So the program does not actually stop, it just begins to print the values very slowly.
here is an example of myFunction in very simplified terms (row is from a list object):
def myFunction(row):
d = string
j=0
for item in object:
d+= row[j]
j=j+1
d += row[x] + string
d += row[y] + string
print row[z]
return
As I said, the above function is for a very small list, however the function proceeding it deals with a much much larger list.

The problem is that you don't appear to be doing enough work in each call to the worker function. All you seem to be doing is pasting together list of strings being passed as argument. However this is pretty much exactly what the multiprocessing module needs to do in the parent process to pass the list of strings to the worker process. It pickles them, writes them to a pipe, which the child process then reads, unpickles and then passes as argument to myFunction.
Since in order to pass the argument to the worker process the parent process has to do at least as much work as the worker process needs to do, you gain no benefit from using the multiprocessing module in this case.

Multiple python threads writing to different records in same list simultaneously - is this ok?

I am trying to fix a bug where multiple threads are writing to a list in memory. Right now I have a thread lock and am occasionally running into problems that are related to the work being done in the threads.
I was hoping to simply make an hash of lists, one for each thread, and remove the thread lock. It seems like each thread could write to its own record without worrying about the others, but perhaps the fact that they are all using the same owning hash would itself be a problem.
Does anyone happen to know if this will work or not? If not, could I, for example, dynamically add a list to a package for each thread? Is that essentially the same thing?
I am far from a threading expert so any advice welcome.
Thanks,

import threading
def job(root_folder,my_list):
for current,files,dirs in os.walk(root):
my_list.extend(files)
time.sleep(1)
my_lists = [[],[],[]]
my_folders = ["C:\\Windows","C:\\Users","C:\\Temp"]
my_threads = []
for folder,a_list in zip(my_folders,my_lists):
my_threads.append(threading.Thread(target=job,args=(folder,a_list)
for thread in my_threads:
thread.start()
for thread in my_threads:
thread.join()
my_full_list = my_lists[0] + my_lists[1] + my_lists[2]
this way each thread just modifies its own list and at the end combines all the individual lists
also as pointed out this gives zero performance gain (actually probably slower than not threading it... ) you may get performance gains using multiprocessing instead ...

Don't use list. Use Queue (python2) or queue (python3).
There is 3 kinds of queue: fifo, lifo and priority. The last one is for ordered data.
You may put data at one side (with thread):
q.put(data)
And get at the other side (maybe in a loop for, say, database):
while not q.empty:
print q.get()
https://docs.python.org/2/library/queue.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.