I am rewriting a threaded process into a multiprocessing queue to attempt to speed up a large calculation. I have gotten it 95% of the way there, but I can't figure out how to signal when the Queue is empty using multiprocessing.
My original code is something like this:
import Queue
from threading import Thread
num_fetch_threads = 4
enclosure_queue = Queue()
for i in range(num_fetch_threads):
worker = Thread(target=run_experiment, args=(i, enclosure_queue))
worker.setDaemon(True)
worker.start()
for experiment in experiment_collection:
enclosure_queue.put((experiment, otherVar))
enclosure_queue.join()
And the queue function like this:
def run_experiment(i, q):
while True:
... do stuff ...
q.task_done()
My new code is somethings like this:
from multiprocessing import Process, Queue
num_fetch_threads = 4
enclosure_queue = Queue()
for i in range(num_fetch_threads):
worker = Process(target=run_experiment, args=(i, enclosure_queue))
worker.daemon = True
worker.start()
for experiment in experiment_collection:
enclosure_queue.put((experiment, otherVar))
worker.join() ## I only put this here bc enclosure_queue.join() is not available
And the new queue function:
def run_experiment(i, q):
while True:
... do stuff ...
## not sure what should go here
I have been reading the docs and Google, but can't figure out what I am missing - I know that task_done / join are not part of the multiprocessing Queue class, but it's not clear what I am supposed to use.
"They differ in that Queue lacks the task_done() and join() methods
introduced into Python 2.5’s Queue.Queue class." Source
But without either of those, I'm not sure how the queue knows it is done, and how to continue on with the program.
Consider using a multiprocessing.Pool instead of managing workers manually. Pool handles dispatching tasks to workers, with convenient functions like map and apply, and supports .close and .join methods. Pool takes care of handling the queues between processes and processing the results. Here's how your code might look like using multiprocessing.Pool:
from multiprocessing import Pool
def do_experiment(exp):
# run the experiment `exp`, will be called by `p.map`
return result
p = Pool() # automatically scales to the number of CPUs available
results = p.map(do_experiment, experiment_collection)
p.close()
p.join()
Related
So I have two webscrapers that collect data from two different sources. I am running them both simultaneously to collect a specific piece of data (e.g. covid numbers).
When one of the functions finds data I want to use that data without waiting for the other one to finish.
So far I tried the multiprocessing - pool module and to return the results with get() but by definition I have to wait for both get() to finish before I can continue with my code. My goal is to have the code as simple and as short as possible.
My webscraper functions can be run with arguments and return a result if found. It is also possible to modify them.
The code I have so far which waits for both get() to finish.
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
from twitter import post_tweet
if __name__ == '__main__':
with Pool(processes=2) as pool:
r1 = pool.apply_async(main_1, ('www.website1.com','June'))
r2 = pool.apply_async(main_2, ())
data = r1.get()
data2 = r2.get()
post_tweet("New data is {}".format(data))
post_tweet("New data is {}".format(data2))
From here I have seen that threading might be a better option since webscraping involves a lot of waiting and only little parsing but I am not sure how I would implement this.
I think the solution is fairly easy but I have been searching and trying different things all day without much success so I think I will just ask here. (I only started programming 2 months ago)
As always there are many ways to accomplish this task.
you have already mentioned using a Queue:
from multiprocessing import Process, Queue
from scraper1 import main_1
from scraper2 import main_2
def simple_worker(target, args, ret_q):
ret_q.put(target(*args)) # mp.Queue has it's own mutex so we don't need to worry about concurrent read/write
if __name__ == "__main__":
q = Queue()
p1 = Process(target=simple_worker, args=(main_1, ('www.website1.com','June'), q))
p2 = Process(target=simple_worker, args=(main_2, ('www.website2.com','July'), q))
p1.start()
p2.start()
first_result = q.get()
do_stuff(first_result)
#don't forget to get() the second result before you quit. It's not a good idea to
#leave things in a Queue and just assume it will be properly cleaned up at exit.
second_result = q.get()
p1.join()
p2.join()
You could also still use a Pool by using imap_unordered and just taking the first result:
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
def simple_worker2(args):
target, arglist = args #unpack args
return target(*arglist)
if __name__ == "__main__":
tasks = ((main_1, ('www.website1.com','June')),
(main_2, ('www.website2.com','July')))
with Pool() as p: #Pool context manager handles worker cleanup (your target function may however be interrupted at any point if the pool exits before a task is complete
for result in p.imap_unordered(simple_worker2, tasks, chunksize=1):
do_stuff(result)
break #don't bother with further results
I've seen people use queues in such cases: create one and pass it to both parsers so that they put their results in queue instead of returning them. Then do a blocking pop on the queue to retrieve the first available result.
I have seen that threading might be a better option
Almost true but not quite. I'd say that asyncio and async-based libraries is much better than both threading and multiprocessing when we're talking about code with a lot of blocking I/O. If it's applicable in your case, I'd recommend rewriting both your parsers in async.
I frequently use the pattern below to parallelify tasks in python. I do it this way because filling the input queue is quick, and once the processes are launched and running asynchronously, I can call a blocking get() in a loop and pull the results out as they are ready. For tasks which take days, this is great because I can do things like report progress.
from multiprocessing import Process, Queue
class worker():
def __init__(self, init_dict,):
self.init_dict = init_dict
def __call__(self, task_queue, done_queue):
for task_args in task_queue.get()
task_result = self.do_work(task_args)
done_queue.put(task_result)
if __name__=="__main__":
n_threads = 8
init_dict = {} # whatever we need to setup our class
worker_class = worker(init_dict)
task_queue = Queue()
done_queue = Queue()
some_iterator = [1,2,3,4,5] # or a list of files to chew through normally
for task in some_iterator:
task_queue.put(task)
for i in range(n_threads):
Process(target=worker_class, args=(task_queue, done_queue)).start()
for i in range(len(some_iterator)):
result = done_queue.get()
# do something with result
# print out progress stats, whatever, as tasks complete
I have glossed over a few detail like catching errors, dealing with things that fail, killing zombie process, exiting at the end of the task queue and catching tracebacks, but you get the idea. I really love this pattern and it works perfectly for my needs. I have a lot of code that uses it.
I need more computing power though and want to spread the work across a cluster. Ray offers a multiprocessing pool with an API that matches that of python multiprocessing. I just can't work out how to get the above pattern to work. Mainly I get:
RuntimeError: Queue objects should only be shared between processes through inheritance
Does anybody have any recommendations of how I can get results as they are ready from a queue when using a pool, rather than n separate processes?
I appreciate that if I do a massive rewrite, then there are probably other ways to get what I want from ray, but I have a lot of code like this, so want to try and keep changes minimal.
Thanks
I'm using the Process class to create and manage subprocesses, which may return non-trival quantities of data. The documentation states that join() is the correct way to wait for a Process to complete (https://docs.python.org/2/library/multiprocessing.html#the-process-class).
However, when using multiprocessing.Queue this can cause a hang after joining the process, as described here: https://bugs.python.org/issue8426 and here https://docs.python.org/2/library/multiprocessing.html#multiprocessing-programming (not a bug).
These docs suggest removing p.join() - but surely this will remove the guarantee that all processes have completed, as Queue.get() only waits for a single item to become available?
How can I wait for completion of all Processes in this case, and ensure I'm collecting output from them all?
A simple example of the hang I'd like to deal with:
from multiprocessing import Process, Queue
class MyClass:
def __init__(self):
pass
def example_run(output):
output.put([MyClass() for i in range(1000)])
print("Bottom of example_run() - note hangs after this is printed")
if __name__ == '__main__':
output = Queue()
processes = [Process(target=example_run, args=(output,)) for x in range(5)]
for p in processes:
p.start()
for p in processes:
p.join()
print("Processes completed")
https://bugs.python.org/issue8426
This means that whenever you use a queue you need to make sure that
all items which have been put on the queue will eventually be removed
before the process is joined. Otherwise you cannot be sure that
processes which have put items on the queue will terminate.
In your example I just added output.get() before calling to join() and every thing worked fine. We put data in queue to be used some where, so just make sure that.
for p in processes:
p.start()
print output.get()
for p in processes:
p.join()
print("Processes completed")
An inelegant solution is to add
output_final = []
for i in range(5): # we have 5 processes
output_final.append(output.get())
before attempting to join any of the processes. This simply tries to get the appropriate number of outputs for the number of processes we've started.
Turns out a much better, wider solution is not to use Process at all; use Pool instead. This way the hassles of starting worker processes and collecting the results is handled for you:
import multiprocessing
class MyClass:
def __init__(self):
pass
def example_run(someArbitraryInput):
foo = [MyClass() for i in range(10000)]
return foo
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=5)
output = pool.map(example_run, range(5))
pool.close(); pool.join() # make sure the processes are complete and tidy
print("Processes completed")
I got multiple parallel processes writing into one list in python. My code is:
global_list = []
class MyThread(threading.Thread):
...
def run(self):
results = self.calculate_results()
global_list.extend(results)
def total_results():
for param in params:
t = MyThread(param)
t.start()
while threading.active_count() > 1:
pass
return total_results
I don't like this aproach as it has:
An overall global variable -> What would be the way to have a local variable for the `total_results function?
The way I check when the list is returned seems somewhat clumsy, what would be the standard way?
Is your computation CPU-intensive? If so you should look at the multiprocessing module which is included with Python and offers a fairly easy to use Pool class into which you can feed compute tasks and later get all the results. If you need a lot of CPU time this will be faster anyway, because Python doesn't do threading all that well: only a single interpreter thread can run at a time in one process. Multiprocessing sidesteps that (and offers the Pool abstraction which makes your job easier). Oh, and if you really want to stick with threads, multiprocessing has a ThreadPool too.
1 - Use a class variable shared between all Worker's instances to append your results
from threading import Thread
class Worker(Thread):
results = []
...
def run(self):
results = self.calculate_results()
Worker.results.extend(results) # extending a list is thread safe
2 - Use join() to wait untill all the threads are done and let them have some computational time
def total_results(params):
# create all workers
workers = [Worker(p) for p in params]
# start all workers
[w.start() for w in workers]
# wait for all of them to finish
[w.join() for w in workers]
#get the result
return Worker.results
I'm having difficulty understanding the purpose of the pool in Python's multiprocessing module.
I know what this code is doing:
import multiprocessing
def worker():
"""worker function"""
print 'Worker'
return
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
So my question is, in what type of situation would a pool be used?
Pool objects are useful when you want to be able to submit more tasks to sub-processes, but you don't want to handle all the organization of these tasks(i.e. how many processes should be spawned to handle them; which task go to which process etc.) and you care only for the result value, and not any other kind of synchronisation etc. You don't want to have the control over the sub-process computation but simply the result.
On the other hand Process is used when you want to execute a specific action, and you need control over the sub-process, not only on the result of its computation.