In the following code, I time how long it takes to pass a large array (8 MB) to a child process using the args key word when forking the process verses passing using a pipe.
Does anyone have any insight into why it is so much faster to pass data using an argument than using a pipe?
Below, each code block is a cell in a Jupyter notebook.
import multiprocessing as mp
import random
N = 2**20
x = list(map(lambda x : random.random(),range(N)))
Time the call to sum in the parent process (for comparison only):
%%timeit -n 5 -r 10 -p 8 -o -q
pass
y = sum(x)/N
t_sum = _
Time the result of calling sum from a child process, using the args keyword to pass list x to child process.
def mean(x,q):
q.put(sum(x))
%%timeit -n 5 -r 10 -p 8 -o -q
pass
q = mp.Queue()
p = mp.Process(target=mean,args=(x,q))
p.start()
p.join()
s = q.get()
m = s/N
t_mean = _
Time using a pipe to pass data to child process
def mean_pipe(cp,q):
x = cp.recv()
q.put(sum(x))
%%timeit -n 5 -r 10 -p 8 -o -q
pass
q = mp.Queue()
pipe0,pipe1 = mp.Pipe()
p = mp.Process(target=mean_pipe,args=[pipe0,q])
p.start()
pipe1.send(x)
p.join()
s = q.get()
m = s/N
t_mean_pipe = _
(ADDED in response to comment) Use mp.Array shared memory feature (very slow!)
def mean_pipe_shared(xs,q):
q.put(sum(xs))
%%timeit -n 5 -r 10 -p 8 -o -q
xs = mp.Array('d',x)
q = mp.Queue()
p = mp.Process(target=mean_pipe_shared,args=[xs,q])
p.start()
p.join()
s = q.get()
m = s/N
t_mean_shared = _
Print out results (ms)
print("{:>20s} {:12.4f}".format("MB",8*N/1024**2))
print("{:>20s} {:12.4f}".format("mean (main)",1000*t_sum.best))
print("{:>20s} {:12.4f}".format("mean (args)",1000*t_mean.best))
print("{:>20s} {:12.4f}".format("mean (pipe)",1000*t_mean_pipe.best))
print("{:>20s} {:12.4f}".format("mean (shared)",1000*t_mean_shared.best))
MB 8.0000
mean (main) 7.1931
mean (args) 38.5217
mean (pipe) 136.5020
mean (shared) 4195.0568
Using the pipe is over 3 times slower than passing arguments to the child process. And unless I am doing something very wrong, mp.Array is a non-starter.
Why is the pipe so much slower than passing directly to the subprocess (using args)? And what's up with the shared memory?
Related
I need to perform ~18000 somewhat expensive calculations on a supercomputer and I'm trying to figure out how to parallelize the code. I had it mostly working with multiprocessing.Process but it would hang at the .join() step if I did more than ~350 calculations.
One of the computer scientists managing the supercomputer recommended I use multiprocessing.Pool instead of Process.
When using Process, I would set up an output Queue and a list of processes, then run and join the processes like this:
output = mp.Queue()
processes = [mp.Process(target=some_function,args=(x,output)) for x in some_array]
for p in processes:
p.start()
for p in processes:
p.join()
Because processes is a list, it is iterable, and I can use output.get() inside a list comprehension to get all the results:
result = [output.get() for p in processes]
What is the equivalent of this when using a Pool? If the Pool is not iterable, how can I get the output of each process that is inside it?
Here is my attempt with dummy data and a dummy calculation:
import pandas as pd
import multiprocessing as mp
##dummy function
def predict(row,output):
calc = [len(row.c1)**2,len(row.c2)**2]
output.put([row.c1+' - '+row.c2,sum(calc)])
#dummy data
c = pd.DataFrame(data=[['a','bb'],['ccc','dddd'],['ee','fff'],['gg','hhhh'],['i','jjj']],columns=['c1','c2'])
if __name__ == '__main__':
#output queue
print('initializing output container...')
output = mp.Manager().Queue()
#pool of processes
print('initializing and storing calculations...')
pool = mp.Pool(processes=5)
for i,row in c.iterrows(): #try some smaller subsets here
pool.apply_async(predict,args=(row,output))
#run processes and keep a counter-->I'm not sure what replaces this with Pool!
#for p in processes:
# p.start()
##exit completed processes-->or this!
#for p in processes:
# p.join()
#pool.close() #is this right?
#pool.join() #this?
#store each calculation
print('storing output of calculations...')
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
print(p)
The output I get is:
initializing output container...
initializing and storing calculations...
storing output of calculations...
Traceback (most recent call last):
File "parallel_test.py", line 37, in <module>
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
TypeError: 'Pool' object is not iterable
What I want is for p to print and look like:
0 1
0 a - bb 5
1 ccc - dddd 25
2 ee - fff 13
3 gg - hhhh 20
4 i - jjj 10
How do I get the output from each calculation instead of just the first one?
Even though you store all your useful results in the queue output you want to fetch the results via calling output.get() the number of times it was stored in the output (number of training examples - len(c) in your case). For me it works if you change the line:
print('storing output of calculations...')
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
to:
print('storing output of calculations...')
p = pd.DataFrame([output.get() for _ in range(len(c))]) ## <-- no longer breaks
Ok, here is my problem: I have a nested for loop in my program which runs on a single core. Since the program spend over 99% of run time in this nested for loop I would like to parallelize it. Right now I have to wait 9 days for the computation to finish. I tried to implement a parallel for loop by using the multiprocessing library. But I only find very basic examples and can not transfer them to my problem. Here are the nested loops with random data:
import numpy as np
dist_n = 100
nrm = np.linspace(1,10,dist_n)
data_Y = 11000
data_I = 90000
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n)
for t in range(data_Y):
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
print(dist)
Please give me some advise how to make it parallel.
There's a small overhead with initiating a process (50ms+ depending on data size) so it's generally best to MP the largest block of code possible. From your comment it sounds like each loop of t is independent so we should be free to parallelize this.
When python creates a new process you get a copy of the main process so you have available all your global data but when each process writes the data, it writes to it's own local copy. This means dist[i,p] won't be available to the main process unless you explicitly pass it back with a return (which will have some overhead). In your situation, if each process writes dist[i,p] to a file then you should be fine, just don't try to write to the same file unless you implement some type of mutex access control.
#!/usr/bin/python
import time
import multiprocessing as mp
import numpy as np
data_Y = 11 #11000
data_I = 90 #90000
dist_n = 100
nrm = np.linspace(1,10,dist_n)
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n))
def worker(t):
st = time.time()
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
# Here - each worker opens a different file and writes to it
print 'Worker time %4.3f mS' % (1000.*(time.time()-st))
if 1: # single threaded
st = time.time()
for x in map(worker, range(data_Y)):
pass
print 'Single-process total time is %4.3f seconds' % (time.time()-st)
print
if 1: # multi-threaded
pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
st = time.time()
for x in pool.imap_unordered(worker, range(data_Y)):
pass
print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
print
If you re-increase the size of data_Y/data_I again, the speed-up should increase up to the theoretical limit.
I used xargs for parallel processing. This thread follows what I had. Parallel processing with xargs in bash
But the parallel processing increased CPU load. I was running the script for 57 openstack tenants to fetch results from them. This report is running every 2 hours which is causing CPU spike.
To decrease the load, I thought of adding the random sleep time...something like below but it didn't do much help. I am not sure if I can use NICE to set a priority because the results I am fetching are from openstack server. If it's possible, please let me know.
I could remove the parallel processing but this will take 6 hours for me to get all the reports from tenants. So that is not an option.
If there is any other way to optimize this...any ideas or suggestions would be great.
source ../creds/base
printf '%s\n' A B C D E F G H I J K L M N O P |
xargs -n 3 -P 8 bash -c 'for tenant; do
source ../creds/"$tenant"
python ../tools/openstack_resource_list.py "$tenant"> ./reports/openstack_reports/"$tenant".html
sleep $[ ( $RANDOM % 10 ) + 1 ]s
done' _
Python file
with open('../floating_list/testReports/'+tenant_file+'.csv', 'wb') as myfile:
fields = ['Name', 'Volume', 'Flavor', 'Image', 'Floating IP']
writer = csv.writer(myfile)
writer.writerow(fields)
try:
for i in servers:
import io, json
with open('../floating_list/testReports/'+tenant_file+'.json', 'w') as e:
e.write(check_output(['openstack', 'server', 'show', i.name, '-f', 'json']))
with open('../floating_list/testReports/'+tenant_file+'.json', 'r') as a:
data = json.load(a)
name.append(i.name)
volume = data.get('os-extended-volumes:volumes_attached', None)
if volume:
vol = [d.get('id', {}) for d in volume if volume]
vol_name_perm = []
for i in vol:
try:
vol_name1 = (check_output(['openstack', 'volume', 'show', i, '-c', 'name', '-f', 'value'])).rstrip()
vol_name_perm.append(vol_name1)
except:
vol_name_perm.append('Error')
vol_join = ','.join(vol_name_perm)
vol_name.append(vol_join)
else:
vol_name.append('None')
...
zipped = [(name), (vol_name),(flavor),(image),(addr)]
result = zip(*zipped)
for i in result:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(i)
except (CalledProcessError, IndexError) as e:
print (e)
print("except", i.name)
WITH GNU PARALLEL
printf '%s\n' A B C D E F | parallel --eta -j 2 --load 40% --noswap 'for tenant; do
source ../creds/"$tenant"
python ../tools/openstack_resource_list.py "$tenant"> ./reports/openstack_reports/"$tenant".html
done'
I get syntax error near unexpected token `A'
WORKAROUND
I am able to manage the load with xargs -n 1 -p 3 for now. This gives me reports within 2 hours. I still want to explore my options with GNU Parallel as suggested by Ole Tange
Maybe you can use GNU Parallel with niceload (part of GNU Parallel):
niceload parallel --nice 11 --bar "'source ../creds/{};
python ../tools/openstack_resource_list.py {} > ./reports/openstack_reports/{}.html'" ::: {A..F}
i read the multiprocessing doc. in python and found that task can be assigned to different cpu cores. i like to run the following code (as a start) in parallel.
from multiprocessing import Process
import os
def do(a):
for i in range(a):
print i
if __name__ == "__main__":
proc1 = Process(target=do, args=(3,))
proc2 = Process(target=do, args=(6,))
proc1.start()
proc2.start()
now i get the output as 1 2 3 and then 1 ....6. but i need to work as 1 1 2 2 ie i want to run proc1 and proc2 in parallel (not one after other).
So you can have your code execute in parallel just by using map. I am using a delay (with time.sleep) to slow the code down so it prints as you want it to. If you don't use the sleep, the first process will finish before the second starts… and you get 0 1 2 0 1 2 3 4 5.
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool()
>>>
>>> def do(a):
... for i in range(a):
... import time
... time.sleep(1)
... print i
...
>>> _ = p.map(do, [3,6])
0
0
1
1
2
2
3
4
5
>>>
I'm using the multiprocessing fork pathos.multiprocessing because I'm the author and I'm too lazy to code it in a file. pathos enables you to do multiprocessing in the interpreter, but otherwise it's basically the same.
You can also use the library pp. I prefer pp over multiprocessing because it allows for parallel processing across different cpus on the network. A function (func) can be applied to a list of inputs (args) using a simple code:
job_server=pp.Server(ncpus=num_local_procs,ppservers=nodes)
result=[job() for job in job_server.submit(func,input) for arg in args]
You can also check out more examples at: https://github.com/gopiks/mappy/blob/master/map.py
I have a function "function" that I want to call 10 times using 2 times 5 cpus with multiprocessing.
Therefore I need a way to synchronize the processes as described in the code below.
Is this possible without using a multiprocessing pool? I get strange errors if I do so (for example "UnboundLocalError: local variable 'fd' referenced before assignment" (I don't have such a variable)). Also the processes seem to terminate randomly.
If possible I would like to do this without a pool. Thanks!
number_of_cpus = 5
number_of_iterations = 2
# An array for the processes.
processing_jobs = []
# Start 5 processes 2 times.
for iteration in range(0, number_of_iterations):
# TODO SYNCHRONIZE HERE
# Start 5 processes at a time.
for cpu_number in range(0, number_of_cpus):
# Calculate an offset for the current function call.
file_offset = iteration * cpu_number * number_of_files_per_process
p = multiprocessing.Process(target=function, args=(file_offset,))
processing_jobs.append(p)
p.start()
# TODO SYNCHRONIZE HERE
This is an (anonymized) traceback of the errors I get when I run the code in a pool:
Process Process-5:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "python_code_3.py", line 88, in function_x
xyz = python_code_1.function_y(args)
File "/python_code_1.py", line 254, in __init__
self.WK = file.WK(filename)
File "/python_code_2.py", line 1754, in __init__
self.__parse__(name, data, fast_load)
File "/python_code_2.py", line 1810, in __parse__
fd.close()
UnboundLocalError: local variable 'fd' referenced before assignment
Most of the processes crash like that but not all of them. More of them seem to crash when I increase the number of processes. I also thought this might be due to memory limitations...
Here's how you can do the synchronization you're looking for without using a pool:
import multiprocessing
def function(arg):
print ("got arg %s" % arg)
if __name__ == "__main__":
number_of_cpus = 5
number_of_iterations = 2
# An array for the processes.
processing_jobs = []
# Start 5 processes 2 times.
for iteration in range(1, number_of_iterations+1): # Start the range from 1 so we don't multiply by zero.
# Start 5 processes at a time.
for cpu_number in range(1, number_of_cpus+1):
# Calculate an offset for the current function call.
file_offset = iteration * cpu_number * number_of_files_per_process
p = multiprocessing.Process(target=function, args=(file_offset,))
processing_jobs.append(p)
p.start()
# Wait for all processes to finish.
for proc in processing_jobs:
proc.join()
# Empty active job list.
del processing_jobs[:]
# Write file here
print("Writing")
Here it is with a Pool:
import multiprocessing
def function(arg):
print ("got arg %s" % arg)
if __name__ == "__main__":
number_of_cpus = 5
number_of_iterations = 2
pool = multiprocessing.Pool(number_of_cpus)
for i in range(1, number_of_iterations+1): # Start the range from 1 so we don't multiply by zero
file_offsets = [number_of_files_per_process * i * cpu_num for cpu_num in range(1, number_of_cpus+1)]
pool.map(function, file_offsets)
print("Writing")
# Write file here
As you can see, the Pool solution is nicer.
This doesn't solve your traceback problem, though. It's hard for me to say how to fix that without understanding what's actually causing that. You may need to use a multiprocessing.Lock to synchronize access to the resource.
A Pool can be very easy to use. Here's a full example:
source
import multiprocessing
def calc(num):
return num*2
if __name__=='__main__': # required for Windows
pool = multiprocessing.Pool() # one Process per CPU
for output in pool.map(calc, [1,2,3]):
print 'output:',output
output
output: 2
output: 4
output: 6