This is my first time trying to use multiprocessing in Python. I'm trying to parallelize my function fun over my dataframe df by row. The callback function is just to append results to an empty list that I'll sort through later.
Is this the correct way to use apply_async? Thanks so much.
import multiprocessing as mp
function_results = []
async_results = []
p = mp.Pool() # by default should use number of processors
for row in df.iterrows():
r = p.apply_async(fun, (row,), callback=function_results.extend)
async_results.append(r)
for r in async_results:
r.wait()
p.close()
p.join()
It looks like using map or imap_unordered (dependending on whether you need your results to be ordered or not) would better suit your needs
import multiprocessing as mp
#prepare stuff
if __name__=="__main__":
p = mp.Pool()
function_results = list(p.imap_unorderd(fun,df.iterrows())) #unordered
#function_results = p.map(fun,df.iterrows()) #ordered
p.close()
Related
I want to use Ray to parallelize some computations in python. As part of this, I want a method which takes the desired number of worker processes as an argument.
The introductory articles on Ray that I can find say to specify the number of processes at the top level, which is different from what I want. Is it possible to specify similarly to how one would do when instantiating e.g. a multiprocessing Pool object, as illustrated below?
Example using multiprocessing:
import multiprocessing as mp
def f(x):
return 2*x
def compute_results(x, n_jobs=4):
with mp.Pool(n_jobs) as pool:
res = pool.map(f, x)
return res
data = [1,2,3]
results = compute_results(data, n_jobs=4)
Example using ray
import ray
# Tutorials say to designate the number of cores already here
ray.remote(4)
def f(x):
return 2*x
def compute_results(x):
result_ids = [f.remote(val) for val in x]
res = ray.get(result_ids)
return res
If you run f.remote() four times then Ray will create four worker processes to run it.
Btw, you can use multiprocessing.Pool with Ray: https://docs.ray.io/en/latest/ray-more-libs/multiprocessing.html
This is the code I want to make parallel
dataset = {}
for index,Id in enumerate(MarketIds['Market Id']):
dataset[index] = GetAllBidPrice(Id)
Assuming you don't care about the order that keys are inserted into the dictionary, a good option here would probably be the imap_unordered method of a multiprocessing.pool.Pool object. Here's an example using all processor cores:
from multiprocessing.pool import Pool
p = Pool(None) # can pass a specific number of cores
dataset = {idx: d for idx, d in p.imap_unordered(
lambda idx, id: (idx, GetAllBidPrice(id)),
enumerate(MarketIds['Market Id']))}
Suppose I have two independent functions. I'd like to call them concurrently, using python's concurrent.futures.ThreadPoolExecutor. Is there a way to call them using Executor and ensure they are returned in order of submission?
I understand this is possible with the Executor.map, but I am looking to parallelize two separate functions, and not one function with a interable input.
I have example code below, but it doesn't guarantee that fn_a will return first, (by design of the wait function).
from concurrent.futures import ThreadPoolExecutor, wait
import time
def fn_a():
t_sleep = 0.5
print("fn_a: Wait {} seconds".format(t_sleep))
time.sleep(t_sleep)
ret = t_sleep * 5 # Do unique work
return "fn_a: return {}".format(ret)
def fn_b():
t_sleep = 1.0
print("fn_b: Wait {} seconds".format(t_sleep))
time.sleep(t_sleep)
ret = t_sleep * 10 # Do unique work
return "fn_b: return {}".format(ret)
if __name__ == "__main__":
with ThreadPoolExecutor() as executor:
futures = []
futures.append(executor.submit(fn_a))
futures.append(executor.submit(fn_b))
complete_futures, incomplete_futures = wait(futures)
for f in complete_futures:
print(f.result())
I'm also interested in knowing if there is a way to do this with joblib
Think I found a reasonable option using lambda and partials. The partials allow me to pass arguments to some functions in the parallelized iterable, but not others.
from functools import partial
import concurrent.futures
fns = [partial(fn_a), partial(fn_b)]
data = []
with concurrent.futures.ThreadPoolExecutor() as executor:
try:
for result in executor.map(lambda x: x(), fns):
data.append(result)
Since it is using executor.map, it returns in order.
I am currently trying to implement a class to do intensive calculus :
import random
import multiprocessing as mp
class IntensiveStuff:
def __init__(self):
self.N = 20
self.nb_process = 4
set_of_things = set()
def lunch_multiprocessing(self):
processes = []
for i in range(self.nb_process):
processes.append(mp.Process(target=self.process_method, args=()))
[x.start() for x in processes]
[x.join() for x in processes]
set_of_things = ... # I want all the sub_set of 'process_method' updated in set_of_things
def process_method(self):
sub_set = set()
for _ in range(self.N):
sub_set.add(random.randint(100))
I want to compute independent calculus, put the results in a sub_set for each process and merge all the sub_set in the set_of_things (which are object in the real code).
I have trying to use Queue without success, any advise ?
P.S : have tried to reproduce the code in Can a set() be shared between Python processes? but without any luck as well.
Processes can't share memory, but they may communicate via Pipes, sockets, etc. multiprocessing module has special objects (i believe, they use pipes under the hood). multiprocessing.Queue should also work, but I use often these two objects:
multiprocessing.Manager().list() and
multiprocessing.Manager().dict()
results = multiprocessing.Manager().list()
# now a bit of your code
processes = []
for i in range(self.nb_process):
processes.append(mp.Process(target=self.process_method, args=(), results))
def process_method(self, results):
sub_set = set()
for _ in range(self.N):
sub_set.add(random.randint(100))
results.append(sub_set) # or what you really need to add to results
I'd like to parallelize a function that returns a flatten list of values (called "keys") in a dict but I don't understand how to obtain in the final result. I have tried:
def toParallel(ht, token):
keys = []
words = token[token['hashtag'] == ht]['word']
for w in words:
keys.append(checkString(w))
y = {ht:keys}
num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)
token = pd.read_csv('/path', sep=",", header = None, encoding='utf-8')
token.columns = ['word', 'hashtag', 'count']
hashtag = pd.DataFrame(token.groupby(by='hashtag', as_index=False).count()['hashtag'])
result = pd.DataFrame(index = hashtag['hashtag'], columns = range(0, 21))
result = result.fillna(0)
final_result = []
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
Where toParallel function should return a dict with hashtag as key and a list of keys (where keys are int). But if I try to print final_result, I obtain only
bound method ApplyResult.get of multiprocessing.pool.ApplyResult object at 0x10c4fa950
How can I do it?
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
You can either use Pool.apply() and get the result right away (in which case you do not need multiprocessing hehe, the function is just there for completeness) or use Pool.apply_async() following by Pool.get(). Pool.apply_async() is asynchronous.
Something like this:
workers = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
final_result = [worker.get() for worker in workers]
Alternatively, you can also use Pool.map() which will do all this for you.
Either way, I recommend you read the documentation carefully.
Addendum: When answering this question I presumed the OP is using some Unix operating system like Linux or OSX. If you are using Windows, you must not forget to safeguard your parent/worker processes using if __name__ == '__main__'. This is because Windows lacks fork() and so the child process starts at the beginning of the file, and not at the point of forking like in Unix, so you must use an if condition to guide it. See here.
ps: this is unnecessary:
num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)
If you call multiprocessing.Pool() without arguments (or None), it already creates a pool of workers with the size of your cpu count.