I need to run a bunch of parallel processes, but cannot use the standard multiprocessing package since its serialization with pickle does not work for more complex objects. Therefore I'm currently using pathos.multiprocessing which uses dill for the serialization and it works flawlessly in that regard.
However, I would like to have an exit condition, so that all procceses get terminated once the result of a process meets a certain condition (I'm computing objective values for an optimization problem and I want all procceses to stop once the result from a process is worse than any of the previous results).
For the standard multiprocessing package I found this solution (taken from https://stackoverflow.com/a/21491438/15799363). Can I do something similar with pathos.multiprocessing? I couldn't figure out how to pass a callback function to processes with pathos.
from random import random
from multiprocessing import Pool
from time import sleep
def add_something(i):
# Sleep to simulate the long calculation
sleep(random() * 30)
return i + 1
def run_my_process():
# Create a process pool
pool = Pool(100)
# Callback function that checks results and kills the pool
def check_result(result):
print(result)
if result == 90:
pool.terminate()
# Start up all of the processes
for i in range(100):
pool.apply_async(add_something, args=[i], callback=check_result)
pool.close()
pool.join()
if __name__ == '__main__':
run_my_process()
Related
So I have two webscrapers that collect data from two different sources. I am running them both simultaneously to collect a specific piece of data (e.g. covid numbers).
When one of the functions finds data I want to use that data without waiting for the other one to finish.
So far I tried the multiprocessing - pool module and to return the results with get() but by definition I have to wait for both get() to finish before I can continue with my code. My goal is to have the code as simple and as short as possible.
My webscraper functions can be run with arguments and return a result if found. It is also possible to modify them.
The code I have so far which waits for both get() to finish.
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
from twitter import post_tweet
if __name__ == '__main__':
with Pool(processes=2) as pool:
r1 = pool.apply_async(main_1, ('www.website1.com','June'))
r2 = pool.apply_async(main_2, ())
data = r1.get()
data2 = r2.get()
post_tweet("New data is {}".format(data))
post_tweet("New data is {}".format(data2))
From here I have seen that threading might be a better option since webscraping involves a lot of waiting and only little parsing but I am not sure how I would implement this.
I think the solution is fairly easy but I have been searching and trying different things all day without much success so I think I will just ask here. (I only started programming 2 months ago)
As always there are many ways to accomplish this task.
you have already mentioned using a Queue:
from multiprocessing import Process, Queue
from scraper1 import main_1
from scraper2 import main_2
def simple_worker(target, args, ret_q):
ret_q.put(target(*args)) # mp.Queue has it's own mutex so we don't need to worry about concurrent read/write
if __name__ == "__main__":
q = Queue()
p1 = Process(target=simple_worker, args=(main_1, ('www.website1.com','June'), q))
p2 = Process(target=simple_worker, args=(main_2, ('www.website2.com','July'), q))
p1.start()
p2.start()
first_result = q.get()
do_stuff(first_result)
#don't forget to get() the second result before you quit. It's not a good idea to
#leave things in a Queue and just assume it will be properly cleaned up at exit.
second_result = q.get()
p1.join()
p2.join()
You could also still use a Pool by using imap_unordered and just taking the first result:
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
def simple_worker2(args):
target, arglist = args #unpack args
return target(*arglist)
if __name__ == "__main__":
tasks = ((main_1, ('www.website1.com','June')),
(main_2, ('www.website2.com','July')))
with Pool() as p: #Pool context manager handles worker cleanup (your target function may however be interrupted at any point if the pool exits before a task is complete
for result in p.imap_unordered(simple_worker2, tasks, chunksize=1):
do_stuff(result)
break #don't bother with further results
I've seen people use queues in such cases: create one and pass it to both parsers so that they put their results in queue instead of returning them. Then do a blocking pop on the queue to retrieve the first available result.
I have seen that threading might be a better option
Almost true but not quite. I'd say that asyncio and async-based libraries is much better than both threading and multiprocessing when we're talking about code with a lot of blocking I/O. If it's applicable in your case, I'd recommend rewriting both your parsers in async.
Suppose we have a following toy version of master-worker pipeline to parallel data collection
# pip install gym
import gym
import numpy as np
from multiprocessing import Process, Pipe
def worker(master_conn, worker_conn):
master_conn.close()
env = gym.make('Pendulum-v0')
env.reset()
while True:
cmd, data = worker_conn.recv()
if cmd == 'close':
worker_conn.close()
break
elif cmd == 'step':
results = env.step(data)
worker_conn.send(results)
class Master(object):
def __init__(self):
self.master_conns, self.worker_conns = zip(*[Pipe() for _ in range(10)])
self.list_process = [Process(target=worker, args=[master_conn, worker_conn], daemon=True)
for master_conn, worker_conn in zip(self.master_conns, self.worker_conns)]
[p.start() for p in self.list_process]
[worker_conn.close() for worker_conn in self.worker_conns]
def go(self, actions):
[master_conn.send(['step', action]) for master_conn, action in zip(self.master_conns, actions)]
results = [master_conn.recv() for master_conn in self.master_conns]
return results
def close(self):
[master_conn.send(['close', None]) for master_conn in self.master_conns]
[p.join() for p in self.list_process]
master = Master()
l = []
T = 1000
for t in range(T):
actions = np.random.rand(10, 1)
results = master.go(actions)
l.append(len(results))
sum(l)
Because of the Pipe connections between master each worker, for every time step, we have to send a command to the worker through the Pipe, and the worker sends back the results. We need to do this for a long horizon. This will be sometimes a bit slow due to frequent communications.
Therefore, I am wondering if by using latest Python feature asyncio combined with Process to replace Pipe, could it be potentially speedup due to IO concurrency, if I understand its functionality correctly.
Multiprocessing module has already a solution for parallel task processing: multiprocessing.Pool
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(processes=4) as pool: # start 4 worker processes
print(pool.map(f, range(10))) # prints "[0, 1, 4,..., 81]"
You can achieve the same using multiprocessing.Queue. I believe that's how pool.map() is implemented internally.
So, what's the difference between multiprocessing.Queue and multiprocessing.Pipe? Queue is just a Pipe plus some locking mechanism. Therefore multiple worker processes can share just a single Queue (or rather 2 - one for commands, one for results), but with Pipe each process would need it's own Pipe (or a pair of, or a duplex one), exactly how you are doing it now.
The only disadvantage of Queue is performance - because all processes share one queue mutex it doesn't scale well for many processes. To be sure it can handle tens of thousands items/s I would choose Pipe, but for classic parallel task processing use case I think Queue or just Pool.map() could be OK because they are much easier to use. (Managing processes can be tricky and asyncio doesn't make it easier either.)
Hope that helps, I'm aware that I've answered a bit different question than you've asked :)
I am in the following setting: I have a method that takes an objective function f as input. As a subrouting of that method i want to evaluate f on a small set of points. Since f has high complexity i considered doing that in parallel.
All online examples hang up even for trivial functions like squaring on sets with 5 points. They are using the multiprocessing library - and i don't know what i am doing wrong. I am not sure how to encapsulate that __name__ == "__main__" statement in my method. (since it is part of a module - i guess instead of "__main__" i should use the module name?)
Code i have been using looks like
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
num_cores = cpu_count()
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool(num_cores)
y = list(pool.map(f, x))
pool.join()
print(y)
When executing this code in my spyder it takes a bloody long time to finish.
So my main questions are: What am i doing wrong in this code? How can i encapsulate the __name__-statement, when this code is part of a bigger method?
Is it even worth it parallelizing this? (one function evaluation can take multiple minutes and in serial this adds up to a total runtime of hours...)
According to documentation :
close()
Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.
terminate()
Stops the worker processes immediately without completing outstanding work. When the pool object is garbage collected
terminate() will be called immediately.
join()
Wait for the worker processes to exit. One must call close() or terminate() before using join().
So you should add :
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool()
y = list(pool.map(f, x))
pool.close()
pool.join()
print(y)
You can call Pool without any argument and it will use cpu_count by default
If processes is None then the number returned by cpu_count() is used
About the if name == "main", read more informations here.
So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by name == 'main'
You might want to look into the chunksize argument of the map function that you are using.
On a large enough input list, a lot of your time is spent simply communicating the arguments to and from the separate parallel processes.
One symptom of this problem is that when you use something like htop all cores are firing but at < 100%.
Earlier I tried to use the threading module in python to create multiple threads. Then I learned about the GIL and how it does not allow taking advantage of multiple CPU cores on a single machine. So now I'm trying to do multiprocessing (I don't strictly need seperate threads).
Here is a sample code I wrote to see if distinct processes are being created. But as can be seen in the output below, I'm getting the same process ID everytime. So multiple processes are not being created. What am I missing?
import multiprocessing as mp
import os
def pri():
print(os.getpid())
if __name__=='__main__':
# Checking number of CPU cores
print(mp.cpu_count())
processes=[mp.Process(target=pri()) for x in range(1,4)]
for p in processes:
p.start()
for p in processes:
p.join()
Output:
4
12554
12554
12554
The Process class requires a callable as its target.
Instead of running the function in the separate process, you are calling it and passing its result (None in this case) to the Process class.
Just change the following:
mp.Process(target=pri())
with:
mp.Process(target=pri)
Since the subprocesses runs on a different process, you won't see their print statements. They also don't share the same memory space. You pass pri() to target, where it needs to be pri. You need to pass a callable object, not execute it.
The prints you see are part of your main thread executions. Because you pass pri(), the code is actually executed. You need to change your code so the pri function returns value, rather than prints it.
Then you need to implement a queue, where all your threads write to it and when they're done, your main thread reads the queue.
A nice feature of the multiprocessing module is the Pool object. It allows you to create a thread pool, and then just use it. It's more convenient.
I have tried your code, the thing is the command executes too quick, so the OS reuses the PIDs. If you add a time.sleep(1) in your pri function, it would work as you expect.
That is True only for Windows. The example below is made on Windows platform. On Unix like machines, you won't need the sleep.
The more convenience solution is like this:
from multiprocessing import Pool
from time import sleep
import os
def pri(x):
sleep(1)
return os.getpid()
def use_procs():
p_pool = Pool(4)
p_results = p_pool.map(pri, [_ for _ in range(1,4)])
p_pool.close()
p_pool.join()
return p_results
if __name__ == '__main__':
res = use_procs()
for r in res:
print r
Without the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
6576
6576
6576
>>>
with the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
10396
10944
9000
I have a data analysis script that takes an argument specifying the segments of the analysis to perform. I want to run up to 'n' instances of the script at a time where 'n' is the number of cores on the machine. The complication is that there are more segments of the analysis than there are cores so I want to run at most, 'n' processes at once, and one of them finishes, kick off another one. Has anyone done something like this before using the subprocess module?
I do think that multiprocessing module will help you achieve what you need.
Take look at the example technique.
import multiprocessing
def do_calculation(data):
"""
#note: you can define your calculation code
"""
return data * 2
def start_process():
print 'Starting', multiprocessing.current_process().name
if __name__ == '__main__':
analsys_jobs = list(range(10)) # could be your analysis work
print 'analsys_jobs :', analsys_jobs
pool_size = multiprocessing.cpu_count() * 2
pool = multiprocessing.Pool(processes=pool_size,
initializer=start_process,
maxtasksperchild=2, )
#maxtasksperchild = tells the pool to restart a worker process \
# after it has finished a few tasks. This can be used to avoid \
# having long-running workers consume ever more system resources
pool_outputs = pool.map(do_calculation, analsys_jobs)
#The result of the map() method is functionally equivalent to the \
# built-in map(), except that individual tasks run in parallel. \
# Since the pool is processing its inputs in parallel, close() and join()\
# can be used to synchronize the main process with the \
# task processes to ensure proper cleanup.
pool.close() # no more tasks
pool.join() # wrap up current tasks
print 'Pool :', pool_outputs
You could find good multiprocessing techniques here to start with
Use the multiprocessing module, specifically the Pool class. Pool creates a pool of processes (by default, as many processes as you have CPUs), and allows you to submit jobs to the pool which are executed on the next free process. It takes care of all the subprocess management and the details of passing data between tasks, so you can write code in a very straightforward manner. See the documentation for some examples on usage.