Python3.5 multiprocessing with pandas : Process never stop - python

My problem is to execute something like :
multicore_apply(serie, func)
to run
So I tried to create a function doing it :
function used to run the apply method in a process :
def adaptator(func, queue) :
serie = queue.get().apply(func)
queue.put(serie)
the process management :
def parallel_apply(ncores, func, serie) :
series = [serie[i::ncores] for i in range(ncores)]
queues =[Queue() for i in range(ncores)]
for _serie, queue in zip(series, queues) :
queue.put(_serie)
result = []
jobs = []
for i in range(ncores) :
jobs.append(process(target = adaptator, args = (func, queues[i])))
for job in jobs :
job.start()
for queue, job in zip(queues, jobs) :
job.join()
result.append(queue.get())
return pd.concat(result, axis = 0).sort_index()
I know the i::ncores is not optimized but actually it's not the problem :
if the input len is greater than 30000 the processes never stop...
Is that a misunderstanding of Queue()?
I don't want to use multiprocessing.map : the func to apply is a method from a class very complex and with a pretty big size, so shared memory make it just too slow. Here I want to pass it in a queue when the problem of process will be solved.
Thank you for your advices

May be that will helps - you can use multiprocessing lib.
Your multicore_apply(serie, func) should look like:
from multiprocessing import Pool
pool = Pool()
pool.map(func, series)
pool.terminate()
You can specify count of process to be created like this pool = Pool(6), by default it equals to count of cores on the machine.

After many nights of intense search, I solved the problem with a post on the python development website about the max size of an object in a queue : the problem was here. I used another post on stackoverflow found here :
then I done the following program, but not as efficient as expected for large objects. I will do the same available for every axis.
Note this version allows to use complex class as function argument, that I cannot do with pool.map
def adaptator(series, results, ns, i) :
serie = series[i]
func = ns.func
result = serie.apply(func)
results[i] = result
def parallel_apply(ncores, func, serie) :
series = pd.np.array_split(serie, ncores, axis = 0)
M = Manager()
s_series = M.list()
s_series.extend(series)
results = M.list()
results.extend([None]*ncores)
ns = M.Namespace()
ns.func = func
jobs = []
for i in range(ncores) :
jobs.append(process(target = adaptator, args = (s_series, results, ns, i)))
for job in jobs :
job.start()
for job in jobs :
job.join()
print(results)
So if you put large objects between queues, Ipython freezes

Related

Programmatically setting number of processes with ray

I want to use Ray to parallelize some computations in python. As part of this, I want a method which takes the desired number of worker processes as an argument.
The introductory articles on Ray that I can find say to specify the number of processes at the top level, which is different from what I want. Is it possible to specify similarly to how one would do when instantiating e.g. a multiprocessing Pool object, as illustrated below?
Example using multiprocessing:
import multiprocessing as mp
def f(x):
return 2*x
def compute_results(x, n_jobs=4):
with mp.Pool(n_jobs) as pool:
res = pool.map(f, x)
return res
data = [1,2,3]
results = compute_results(data, n_jobs=4)
Example using ray
import ray
# Tutorials say to designate the number of cores already here
ray.remote(4)
def f(x):
return 2*x
def compute_results(x):
result_ids = [f.remote(val) for val in x]
res = ray.get(result_ids)
return res
If you run f.remote() four times then Ray will create four worker processes to run it.
Btw, you can use multiprocessing.Pool with Ray: https://docs.ray.io/en/latest/ray-more-libs/multiprocessing.html

Using threading/multiprocessing in Python to download images concurrently

I have a list of search queries to build a dataset:
classes = [...]. There are 100 search queries in this list.
Basically, I divide the list into 4 chunks of 25 queries.
def divide_chunks(l, n):
for i in range(0, len(l), n):
yield classes[i:i + n]
classes = list(divide_chunks(classes, 25))
And below, I've created a function that downloads queries from each chunk iteratively:
def download_chunk(n):
for label in classes[n]:
try:
downloader.download(label, limit=1000, output_dir='dataset', adult_filter_off=True, force_replace=False,verbose=True)
except:
pass
However, I want to run each 4 chunks concurrently. In other words, I want to run 4 separate iterative operations concurrently. I took both the Threading and Multiprocessing approaches but both of them don't work:
process_1 = Process(target=download_chunk(0))
process_1.start()
process_2 = Process(target=download_chunk(1))
process_2.start()
process_3 = Process(target=download_chunk(2))
process_3.start()
process_4 = Process(target=download_chunk(3))
process_4.start()
process_1.join()
process_2.join()
process_3.join()
process_4.join()
###########################################################
thread_1 = threading.Thread(target=download_chunk(0)).start()
thread_2 = threading.Thread(target=download_chunk(1)).start()
thread_3 = threading.Thread(target=download_chunk(2)).start()
thread_4 = threading.Thread(target=download_chunk(3)).start()
You're running download_chunk outside of the thread/process. You need to provide the function and arguments separately in order to delay execution:
For example:
Process(target=download_chunk, args=(0,))
Refer to the multiprocessing docs for more information about using the multiprocessing.Process class.
For this use-case, I would suggest using multiprocessing.Pool:
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(4) as pool:
pool.map(download_chunk, range(4))
It handles the work of creating, starting, and later joining the 4 processes. Each process calls download_chunk with each of the arguments provided in the iterable, which is range(4) in this case.
More info about multiprocessing.Pool can be found in the docs.

Plotting the pool map for multi processing Python

How can I run multiple processes pool where I process run1-3 asynchronously, with a multi processing tool in python. I am trying to pass the values (10,2,4),(55,6,8),(9,8,7) for run1,run2,run3 respectively?
import multiprocessing
def Numbers(number,number2,divider):
value = number * number2/divider
return value
if __name__ == "__main__":
with multiprocessing.Pool(3) as pool: # 3 processes
run1, run2, run3 = pool.map(Numbers, [(10,2,4),(55,6,8),(9,8,7)]) # map input & output
You just need to use method starmap instead of map, which, according to the documentation:
Like map() except that the elements of the iterable are expected to be iterables that are unpacked as arguments.
Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].
import multiprocessing
def Numbers(number,number2,divider):
value = number * number2/divider
return value
if __name__ == "__main__":
with multiprocessing.Pool(3) as pool: # 3 processes
run1, run2, run3 = pool.starmap(Numbers, [(10,2,4),(55,6,8),(9,8,7)]) # map input & output
print(run1, run2, run3)
Prints:
5.0 41.25 10.285714285714286
Note
This is the correct way of doing what you want to do, but you will not find that using multiprocessing for such a trivial worker function will improve performance; in fact, it will degrade performance due to the overhead in creating the pool and passing arguments and results to and from one address space to another.
Python's multiprocessing library does however have a wrapper for piping data between a parent and child process, the Manager which has shared data utilities such as a shared dictionary. There is a good stack overflow post here about the topic.
Using multiprocessing you can pass unique arguments and a shared dictionary to each process, and you must ensure each process writes to a different key in the dictionary.
An example of this in use given your example is as follows:
import multiprocessing
def worker(process_key, return_dict, compute_array):
"""worker function"""
number = compute_array[0]
number2 = compute_array[1]
divider = compute_array[2]
return_dict[process_key] = number * number2/divider
if __name__ == "__main__":
manager = multiprocessing.Manager()
return_dict = manager.dict()
jobs = []
compute_arrays = [[10, 2, 4], [55, 6, 8], [9, 8, 7]]
for i in range(len(compute_arrays)):
p = multiprocessing.Process(target=worker, args=(
i, return_dict, compute_arrays[i]))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print(return_dict)
Edit: Information from Booboo is much more precise, I had a recommendation for threading which I'm removing as it's certainly not the right utility in Python due to the GIL.

Multiprocessing on pd.DataFrame did not speed up?

I am trying to apply function on a large pd.dataframe on pyspark. My code was post below which uses multiprocessing.Pool but is not as fast as expected. It cost the same time as df.apply(f,axis=1).
There shall be some mistakes I didn't notice. I spend my day but find nothing out. That's why I finally come here for help.
def f(series):
# do something
return series
if __name__=='__main__':
#load(df)
output=pd.DataFrame()
pool = multiprocessing.Pool(8)
for name in df.colunms:
res=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
output[name]=res.get()
pool.close()
pool.join()
After #Andriy Ivaneyko answers, I also tried this:
if __name__=='__main__':
#load(df)
output=pd.DataFrame()
res={}
pool = multiprocessing.Pool(8)
for name in df.colunms:
res[name]=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
for name,val in res.items():
output[name]=val.get()
pool.close()
pool.join()
I change the number of cores from 4 to 8 to 16, however the function consumes almost the same time.
The get() method blocks until the function is completed, that's the reason for not getting performance benefit. Create list of the ApplyResult objects ( returned by apply_async) and perform get when you finish iteration over df.colunms
# ... Code before
apply_results = []
for name in df.colunms:
res=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
apply_results[name] = res
for name, res in apply_results.items():
output[name]=res.get()
# ... Code after

Python multiprocessing - Return a dict

I'd like to parallelize a function that returns a flatten list of values (called "keys") in a dict but I don't understand how to obtain in the final result. I have tried:
def toParallel(ht, token):
keys = []
words = token[token['hashtag'] == ht]['word']
for w in words:
keys.append(checkString(w))
y = {ht:keys}
num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)
token = pd.read_csv('/path', sep=",", header = None, encoding='utf-8')
token.columns = ['word', 'hashtag', 'count']
hashtag = pd.DataFrame(token.groupby(by='hashtag', as_index=False).count()['hashtag'])
result = pd.DataFrame(index = hashtag['hashtag'], columns = range(0, 21))
result = result.fillna(0)
final_result = []
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
Where toParallel function should return a dict with hashtag as key and a list of keys (where keys are int). But if I try to print final_result, I obtain only
bound method ApplyResult.get of multiprocessing.pool.ApplyResult object at 0x10c4fa950
How can I do it?
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
You can either use Pool.apply() and get the result right away (in which case you do not need multiprocessing hehe, the function is just there for completeness) or use Pool.apply_async() following by Pool.get(). Pool.apply_async() is asynchronous.
Something like this:
workers = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
final_result = [worker.get() for worker in workers]
Alternatively, you can also use Pool.map() which will do all this for you.
Either way, I recommend you read the documentation carefully.
Addendum: When answering this question I presumed the OP is using some Unix operating system like Linux or OSX. If you are using Windows, you must not forget to safeguard your parent/worker processes using if __name__ == '__main__'. This is because Windows lacks fork() and so the child process starts at the beginning of the file, and not at the point of forking like in Unix, so you must use an if condition to guide it. See here.
ps: this is unnecessary:
num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)
If you call multiprocessing.Pool() without arguments (or None), it already creates a pool of workers with the size of your cpu count.

Categories