Multiproccess Pool irregular behavior when processing from queue - python

I have a script that parses a large text file (~5.5GB) of irregularly sized records and sends each record to a JoinableQueue, tasks. These tasks are handled by a pool of workers who them put results into another JoinableQueue, results. My problem arises when trying to process this queue with another pool of workers concurrently. Here is a small example of what I have currently.
def process_nodes(results_queue, return_dict):
while True:
node_pairs = results_queue.get()
results_queue.task_done()
if node_pairs is None:
results_queue.task_done()
break
for key in node_pairs:
return_dict[tuple(key)] += 1
if __name__ == '__main__':
num_workers = mp.cpu_count() * 2
tasks = mp.JoinableQueue()
results = mp.JoinableQueue()
parser = ParseVehicleTrajectories('VehTrajectory_1.dat', tasks)
task_workers = [TaskWorker(tasks, results) for i in range(num_workers - 2)]
for worker in task_workers:
worker.start()
manager = MyManager()
manager.start()
return_dict = manager.defaultdict(int)
results_worker = mp.Process(target=process_nodes, args=(results, return_dict))
results_worker.start()
parser.run()
for _ in range(num_workers - 2):
tasks.put(None)
results.put(None)
tasks.join()
print(len(return_dict), sum(return_dict.values()))
ParseVehicleTrajectories is the class that reads the file and pushes records onto the queue. TaskWorker is the worker class and inherits from mp.Process.
My issue is that sometimes this code will run properly and print results and other times it does not. When it does not it appears that results is not exhausted yet and the results_worker is stopped. I admit that I am not a master of multiprocessing so it is possible I have missed something obvious, but I have been through SO exhaustively and can't find a solution.

Related

Python multiprocessing possible deadlock with two queue as producer-consumer pattern?

I'm wondering if there can be a sort of deadlock in the following code. I have to read each element of a database (about 1 million items), process it, then collect the results in a unique file.
I've parallelized the execution with multiprocessing using two Queue's and three types of processes:
Reader: Main process which reads the database and adds the read items in a task_queue
Worker: Pool of processes. Each worker gets an item from task_queue, processes the item, saves the results in an intermediate file stored in item_name/item_name.txt and puts the item_name in a completed_queue
Writer: Process which gets an item_name from completed_queue, gets the intermediate result from item_name/item_name.txt and writes it in results.txt
from multiprocessing import Pool, Process, Queue
class Computation():
def __init__(self,K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self,):
pool = Pool(n_cpus, self.worker, args=())
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
pool.close()
pool.join()
self.completed_queue.put("DONE")
writer.join()
The code works, but it seems that sometimes the writer or the pool stops working (or they are very slow). Is a deadlock possible in this scenario?
There are a couple of issues with your code. First, by using the queues as you are, you are in effect creating your own process pool and have no need for using the multiprocessing.Pool class at all. You are using a pool initializer as an actual pool worker and it's a bit of a misuse of this class; you would be better off to just use regular Process instances (my opinion, anyway).
Second, although it is well and good that you are putting message DONE to the writer_process to signal it to terminate, you have not done similarly for the self.n_cpus worker processes, which are looking for 'STOP' messages, and therefore the reader function needs to put self.n_cpus STOP messages in the task queue:
from multiprocessing import Process, Queue
class Computation():
def __init__(self, K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
self.task_queue.put('STOP')
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
processes = [Process(target=self.worker) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
for p in processes:
p.join()
self.completed_queue.put("DONE")
writer.join()
Personally, instead of using 'STOP' and 'DONE' as the sentinel messages, I would use None instead, assuming that is not a valid actual message. I have tested the above code where reader just processed strings in a list and self.process_item(item) simply appended ' done' to the each of those strings and put the modified string on the completed_queue and replaced self.write_f in the writer_process with a print call. I did not see any problems with the code as is.
Update to use a Managed Queue
Disclaimer: I have had no experience using mpi4py and have no idea how the queue proxies would get distributed across different computers. The above code may not be sufficient as suggested by the following article, How to share mutliprocessing queue object between multiple computers. However, that code is creating instances of Queue.Queue (that code is Python 2 code) and not the proxies that are returned by the multiprocessing.SyncManager. The documentation on this is very poor. Try the above change to see if it works better (it will be slower).
Because the proxy returned by manager.Queue(), I have had to rearrange the code a bit; the queues are now being passed explicitly as arguments to the process functions:
from multiprocessing import Process, Manager
class Computation():
def __init__(self, K):
self.n_cpus = K
def reader(self, task_queue):
with open(db, "r") as db:
... # Read an item
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
task_queue.put('STOP')
def worker(self, task_queue, completed_queue):
while True:
item = task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self, completed_queue):
while True:
f = completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
with Manager() as manager:
task_queue = manager.Queue()
completed_queue = manager.Queue()
processes = [Process(target=self.worker, args=(task_queue, completed_queue)) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=(completed_queue,))
writer.start()
self.reader(task_queue)
for p in processes:
p.join()
completed_queue.put("DONE")
writer.join()

Python multiprocessing for dataset preparation

I'm looking for shorter ways to prepare my dataset for a machine-learning task. I found that the multiprocessing library might helpful. However, because I'm a newbie in multiprocessing, I couldn't find a proper way.
I first wrote some codes like below:
class DatasetReader:
def __init__(self):
self.data_list = Read_Data_from_file
self.data = []
def _ready_data(self, ex, idx):
# Some complex functions that takes several minutes
def _dataset_creator(self, queue):
for idx, ex in enumerate(self.data_list):
queue.put(self._ready_data(ex, idx))
def _dataset_consumer(self, queue):
total_mem = 0.0
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
for idx in t:
ins = queue.get()
self.data.append(ins)
gc.collect()
def _build_dataset(self):
queue = Queue()
creator = Process(target=self._dataset_creator, args=(queue,))
consumer = Process(target=self._dataset_consumer, args=(queue,))
creator.start()
consumer.start()
queue.close()
queue.join_thread()
creator.join()
consumer.join()
However, in my opinion, because the _dataset_creator processes data (here _ready_data) in serial manner, this would not be helpful for reducing time consumption.
So, I modified the code to generate multiple processes that process one datum:
class DatasetReader:
def __init__(self):
self.data_list = Read_Data_from_file
self.data = []
def _ready_data(self, ex, idx):
# Some complex functions that takes several minutes
def _dataset_creator(self, ex, idx, queue):
queue.put(self._ready_data(ex, idx))
def _dataset_consumer(self, queue):
total_mem = 0.0
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
for idx in t:
ins = queue.get()
self.data.append(ins)
gc.collect()
def _build_dataset(self):
queue = Queue()
for idx, ex in enumerate(self.data_list):
p = Process(target=self._dataset_creator, args=(ex, idx, queue,))
p.start()
consumer = Process(target=self._dataset_consumer, args=(queue,))
consumer.start()
queue.close()
queue.join_thread()
consumer.join()
However, this returns me errors:
Process Process-18:
Traceback ~~~
RuntimeError: can't start new thread
Traceback ~~~
OSError: [Errno 12] Cannot allocate memory
Could you help me to process complex data in a parallel way?
EDIT 1:
Thanks to #tdelaney, I can reduce the time consumption by generating self.num_worker processes (16 in my experiment):
def _dataset_creator(self, pid, queue):
for idx, ex in list(enumerate(self.data_list))[pid::self.num_worker]:
queue.put(self._ready_data(ex, idx))
def _dataset_consumer(self, queue):
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
for _ in t:
ins = queue.get()
self.data[ins['idx']] = ins
def _build_dataset(self):
queue = Queue()
procs = []
for pid in range(self.num_worker):
p = Process(target=self._dataset_creator, args=(pid, queue,))
procs.append(p)
p.start()
consumer = Process(target=self._dataset_consumer, args=(queue,))
consumer.start()
queue.close()
queue.join_thread()
for p in procs:
p.join()
consumer.join()
I'm trying to sketch out what a solution with a multiprocessing pool would look like. I got rid of the consumer process completely because it looks like the parent process is just waiting anyway (and needs the data eventually) so it can be the consumer. So, I set up a pool and use imap_unordered to handle passing the data to the worker.
I guessed that the data processing doesn't really need the DatasetReader at all and moved it out to its own function. On Windows, either the entire DataReader object is serialized to the subprocess (including data you don't want) or the child version of the object is incomplete and may crash when you try to use it.
Either way, changes made to a DatasetReader object in the child processes aren't seen in the parent. This can be unexpected if the parent is dependent on updated state in that object. Its best to severely bracket what's happening in subprocesses, in my opinion.
from multiprocessing import Pool, get_start_method, cpu_count
# moved out of class (assuming it is not class dependent) so that
# the entire DatasetReader object isn't pickled and sent to
# the child on spawning systems like Microsoft Windows
def _ready_data(idx_ex):
idx, ex = idx_ex
# Some complex functions that take several minutes
result = complex_functions(ex)
return (idx, result)
class DatasetReader:
def __init__(self):
self.data_list = Read_Data_from_file
self.data = [None] * len(data_list)
def _ready_data_fork(self, idx):
# on forking system, call worker with object data
return _ready_data((idx, self.data_list[idx]))
def run(self):
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ',
bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) '
'[{elapsed}<{remaining},{rate_fmt}{postfix}]')
pool = Pool(min(cpu_count, len(self.data_list)))
if get_start_method() == 'fork':
# on forking system, self.data_list is in child process and
# we only pass the index
result_iter = pool.imap_unordered(self._ready_data_fork,
(idx for idx in range(len(data_list))),
chunksize=1)
else:
# on spawning system, we need to pass the data
result_iter = pool.imap_unordered(_ready_data,
enumerate(self.data_list,
chunksize=1)
for idx, result in result_iter:
next(t)
self.data[idx] = result
pool.join()

Dynamically generating new threads

I want to be able to run multiple threads without actually making a new line for every thread I want to run. In the code below I cannot dynamically add more accountIDs, or increase the #of threads just by changing the count on thread_count
For example this is my code now:
import threading
def get_page_list(account,thread_count):
return list_of_pages_split_by_threads
def pull_data(page_list,account_id):
data = api(page_list,account_id)
return data
if __name__ == "__main__":
accountIDs = [100]
#of threads to make:
thread_count = 3
#Returns a list of pages ie : [[1,2,3],[4,5,6],[7,8,9,10]]
page_lists = get_page_list(accountIDs[0],thread_count)
t1 = threading.Thread(target=pull_data, args=(page_list[0],accountIDs[0]))
t2 = threading.Thread(target=pull_data, args=(page_list[1],accountIDs[0]))
t3 = threading.Thread(target=pull_data, args=(page_list[2],accountIDs[0]))
t1.start()
t2.start()
t3.start()
t1.join()
t2.join()
t3.join()
This is where I want to get to:
Anytime I want to add an additional thread if the server can handle it or add additional accountIDs I dont have to reproduce the code?
IE (This example is what I would like to do, but the below doesnt work it tries to finish a whole list of pages before moving on to the next thread)
if __name__ == "__main__":
accountIDs = [100,101,103]
thread_count = 3
for account in accountIDs:
page_lists = get_page_list(account,thread_count)
for pg_list in page_list:
t1 = threading.Thread(target=pull_data, args=(pg_list,account))
t1.start()
t1.join()
One way of doing it is using Pool and Queue.
The pool will keep working while there are items in the queue, without holding the main thread.
Chose one of these imports:
import multiprocessing as mp (for process based parallelization)
import multiprocessing.dummy as mp (for thread based parallelization)
Creating the workers, pool and queue:
the_queue = mp.Queue() #store the account ids and page lists here
def worker_main(queue):
while waiting == True:
while not queue.empty():
account, pageList = queue.get(True) #get an id from the queue
pull_data(pageList, account)
waiting = True
the_pool = mp.Pool(num_parallel_workers, worker_main,(the_queue,))
# don't forget the coma here ^
accountIDs = [100,101,103]
thread_count = 3
for account in accountIDs:
list_of_page_lists = get_page_list(account, thread_count)
for pg_list in page_list:
the_queue.put((account, pg_list))
....
waiting = False #while you don't do this, the pool will probably never end.
#not sure if it's a good practice, but you might want to have
#the pool hanging there for a while to receive more items
the_pool.close()
the_pool.join()
Another option is to fill the queue first, create the pool second, use the worker only while there are items in the queue.
Then if more data arrives, you create another queue, another pool:
import multiprocessing.dummy as mp
#if you are not using dummy, you will probably need a queue for the results too
#as the processes will not access the vars from the main thread
#something like worker_main(input_queue, output_queue):
#and pull_data(pageList,account,output_queue)
#and mp.Pool(num_parallel_workers, worker_main,(in_queue,out_queue))
#and you get the results from the output queue after pool.join()
the_queue = mp.Queue() #store the account ids and page lists here
def worker_main(queue):
while not queue.empty():
account, pageList = queue.get(True) #get an id from the queue
pull_data(pageList, account)
accountIDs = [100,101,103]
thread_count = 3
for account in accountIDs:
list_of_page_lists = get_page_list(account, thread_count)
for pg_list in page_list:
the_queue.put((account, pg_list))
the_pool = mp.Pool(num_parallel_workers, worker_main,(the_queue,))
# don't forget the coma here ^
the_pool.close()
the_pool.join()
del the_queue
del the_pool
I couldn't get MP to work correctly so I did this instead and it seems to work great. But MP is probably the better way to tackle this problem
#Just keeps track of the threads
threads = []
#Generates a thread for whatever variable thread_count = N
for thread in range(thread_count):
#function retrns a list of pages stored in page_listS, this ensures each thread gets a unique list.
page_list = page_lists[thread]
#actual fucntion for each thread to work
t = threading.Thread(target=pull_data, args=(account,thread))
#puts all threads into a list
threads.append(t)
#runs all the treads up
t.start()
#After all threads are complete back to the main thread.. technically this is not needed
for t in threads:
t.join()
I also didn't understand why you would "need" .join() great answer here:
what is the use of join() in python threading

Problem trying to make two child processes share the load of processing the same resource

I'm messing around with python multiprocessing module. But something is not working as I was expecting it to do, so now I'm a little bit confused.
In a python script, I create two child processes, so they can work with the same resource. I was thinking that they were going to "share" the load more or less equally, but it seems that, instead of doing that, one of the processes executes just once, while the other one process almost everything.
To test it, I wrote the following code:
#!/usr/bin/python
import os
import multiprocessing
# Worker function
def worker(queueA, queueB):
while(queueA.qsize() != 0):
item = queueA.get()
item = "item: " + item + ". processed by worker " + str(os.getpid())
queueB.put(item)
return
# IPC Manager
manager = multiprocessing.Manager()
queueA = multiprocessing.Queue()
queueB = multiprocessing.Queue()
# Fill queueA with data
for i in range(0, 10):
queueA.put("hello" + str(i+1))
# Create processes
process1 = multiprocessing.Process(target = worker, args = (queueA, queueB,))
process2 = multiprocessing.Process(target = worker, args = (queueA, queueB,))
# Call processes
process1.start()
process2.start()
# Wait for processes to stop processing
process1.join()
process2.join()
for i in range(0, queueB.qsize()):
print queueB.get()
And that prints the following:
item: hello1. processed by worker 11483
item: hello3. processed by worker 11483
item: hello4. processed by worker 11483
item: hello5. processed by worker 11483
item: hello6. processed by worker 11483
item: hello7. processed by worker 11483
item: hello8. processed by worker 11483
item: hello9. processed by worker 11483
item: hello10. processed by worker 11483
item: hello2. processed by worker 11482
As you can see, one of the processes works with just one of the elements, and it doesn't continue to get more elements of the queue, while the other has to work with everything else.
I'm thinking that this is not correct, or at least not what I expected. Could you tell me which is the correct way of implementing this idea?
You're right that they won't be exactly equal, but mostly that's because your testing sample is so small. It takes time for each process to get started and start processing. The time it takes to process an item in the queue is extremely low and therefore one can quickly process 9 items before the other gets through one.
I tested this below (in Python3, but it should apply for 2.7 as well just change the print() function to a print statement):
import os
import multiprocessing
# Worker function
def worker(queueA, queueB):
for item in iter(queueA.get, 'STOP'):
out = str(os.getpid())
queueB.put(out)
return
# IPC Manager
manager = multiprocessing.Manager()
queueA = multiprocessing.Queue()
queueB = multiprocessing.Queue()
# Fill queueA with data
for i in range(0, 1000):
queueA.put("hello" + str(i+1))
# Create processes
process1 = multiprocessing.Process(target = worker, args = (queueA, queueB,))
process2 = multiprocessing.Process(target = worker, args = (queueA, queueB,))
# Call processes
process1.start()
process2.start()
queueA.put('STOP')
queueA.put('STOP')
# Wait for processes to stop processing
process1.join()
process2.join()
all = {}
for i in range(1000):
item = queueB.get()
if item not in all:
all[item] = 1
else:
all[item] += 1
print(all)
My output (a count of how many were done from each process):
{'18376': 537,
'18377': 463}
While they aren't the exact same, as we approach longer times they will approach being about equal.
Edit:
Another way to confirm this is to add a time.sleep(3) inside the worker function
def worker(queueA, queueB):
for item in iter(queueA.get, 'STOP'):
time.sleep(3)
out = str(os.getpid())
queueB.put(out)
return
I ran a range(10) test like in your original example and got:
{'18428': 5,
'18429': 5}

multiprocessing - reading big input data - program hangs

I want to run parallel computation on some input data which is loaded from a file. (The file can be really big, so I use a generator for this.)
On a certain number of items, my code runs OK but above this threshold the program hangs (some of the worker processes do not end).
Any suggestions? (I am running this with python2.7, 8 CPUs; 5,000 lines still OK, 7,500 does not work.)
Firstly, you need an input file. Generate it in bash:
for i in {0..10000}; do echo -e "$i"'\r' >> counter.txt; done
Then, run this:
python2.7 main.py 100 counter.txt > run_log.txt
main.py:
#!/usr/bin/python2.7
import os, sys, signal, time
import Queue
import multiprocessing as mp
def eat_queue(job_queue, result_queue):
"""Eats input queue, feeds output queue
"""
proc_name = mp.current_process().name
while True:
try:
job = job_queue.get(block=False)
if job == None:
print(proc_name + " DONE")
return
result_queue.put(execute(job))
except Queue.Empty:
pass
def execute(x):
"""Does the computation on the input data
"""
return x*x
def save_result(result):
"""Saves results in a list
"""
result_list.append(result)
def load(ifilename):
"""Generator reading the input file and
yielding it row by row
"""
ifile = open(ifilename, "r")
for line in ifile:
line = line.strip()
num = int(line)
yield (num)
ifile.close()
print("file closed".upper())
def put_tasks(job_queue, ifilename):
"""Feeds the job queue
"""
for item in load(ifilename):
job_queue.put(item)
for _ in range(get_max_workers()):
job_queue.put(None)
def get_max_workers():
"""Returns optimal number of processes to run
"""
max_workers = mp.cpu_count() - 2
if max_workers < 1:
return 1
return max_workers
def run(workers_num, ifilename):
job_queue = mp.Queue()
result_queue = mp.Queue()
# decide how many processes are to be created
max_workers = get_max_workers()
print "processes available: %d" % max_workers
if workers_num < 1 or workers_num > max_workers:
workers_num = max_workers
workers_list = []
# a process for feeding job queue with the input file
task_gen = mp.Process(target=put_tasks, name="task_gen",
args=(job_queue, ifilename))
workers_list.append(task_gen)
for i in range(workers_num):
tmp = mp.Process(target=eat_queue, name="w%d" % (i+1),
args=(job_queue, result_queue))
workers_list.append(tmp)
for worker in workers_list:
worker.start()
for worker in workers_list:
worker.join()
print "worker %s finished!" % worker.name
if __name__ == '__main__':
result_list = []
args = sys.argv
workers_num = int(args[1])
ifilename = args[2]
run(workers_num, ifilename)
This is because nothing in your code takes anything off result_queue. The behavior then depends on internal queue buffering details: if "not a lot" of data is waiting, everything appears fine, but if "a lot" of data is waiting, everything freezes. Not much more can be said, because it involves layers of internal magic ;-) But the docs do warn about it:
Warning
As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.
This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.
Note that a queue created using a manager does not have this issue. See Programming guidelines.
One easy way to repair that: First add
result_queue.put(None)
before eat_queue() returns. Then add:
count = 0
while count < workers_num:
if result_queue.get() is None:
count += 1
before the main program .join()s the workers. That drains the result queue, and everything shuts down cleanly then.
BTW, this code is pretty bizarre:
while True:
try:
job = job_queue.get(block=False)
if job == None:
print(proc_name + " DONE")
return
result_queue.put(execute(job))
except Queue.Empty:
pass
Why are you doing non-blocking get()? This turns into a CPU-hog "busy loop" so long as the queue is empty. The primary point of .get() is to supply an efficient way to wait for work to show up. So:
while True:
job = job_queue.get()
if job is None:
print(proc_name + " DONE")
break
else:
result_queue.put(execute(job))
result_queue.put(None)
does the same thing, but far more efficiently.
Queue size caution
You didn't ask about this, but let's cover it before it bites you ;-) By default, there is no bound on a Queue's size. If, e.g., you add a billion items to the Queue, it will demand enough RAM to hold a billion items. So if your producer(s) can generate work items faster than your consumer(s) can process them, memory use can get out of hand quickly.
Fortunately, that's easy to repair: specify a maximum queue size. For example,
job_queue = mp.Queue(maxsize=10*workers_num)
^^^^^^^^^^^^^^^^^^^^^^^
Then job_queue.put(some_work_item) will block until consumers reduce the size of the queue to less than the maximum. This way you can process enormous problems with a queue that requires trivial RAM.

Categories