How do I use multiprocessing Pool and Queue together? - python

I need to perform ~18000 somewhat expensive calculations on a supercomputer and I'm trying to figure out how to parallelize the code. I had it mostly working with multiprocessing.Process but it would hang at the .join() step if I did more than ~350 calculations.
One of the computer scientists managing the supercomputer recommended I use multiprocessing.Pool instead of Process.
When using Process, I would set up an output Queue and a list of processes, then run and join the processes like this:
output = mp.Queue()
processes = [mp.Process(target=some_function,args=(x,output)) for x in some_array]
for p in processes:
p.start()
for p in processes:
p.join()
Because processes is a list, it is iterable, and I can use output.get() inside a list comprehension to get all the results:
result = [output.get() for p in processes]
What is the equivalent of this when using a Pool? If the Pool is not iterable, how can I get the output of each process that is inside it?
Here is my attempt with dummy data and a dummy calculation:
import pandas as pd
import multiprocessing as mp
##dummy function
def predict(row,output):
calc = [len(row.c1)**2,len(row.c2)**2]
output.put([row.c1+' - '+row.c2,sum(calc)])
#dummy data
c = pd.DataFrame(data=[['a','bb'],['ccc','dddd'],['ee','fff'],['gg','hhhh'],['i','jjj']],columns=['c1','c2'])
if __name__ == '__main__':
#output queue
print('initializing output container...')
output = mp.Manager().Queue()
#pool of processes
print('initializing and storing calculations...')
pool = mp.Pool(processes=5)
for i,row in c.iterrows(): #try some smaller subsets here
pool.apply_async(predict,args=(row,output))
#run processes and keep a counter-->I'm not sure what replaces this with Pool!
#for p in processes:
# p.start()
##exit completed processes-->or this!
#for p in processes:
# p.join()
#pool.close() #is this right?
#pool.join() #this?
#store each calculation
print('storing output of calculations...')
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
print(p)
The output I get is:
initializing output container...
initializing and storing calculations...
storing output of calculations...
Traceback (most recent call last):
File "parallel_test.py", line 37, in <module>
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
TypeError: 'Pool' object is not iterable
What I want is for p to print and look like:
0 1
0 a - bb 5
1 ccc - dddd 25
2 ee - fff 13
3 gg - hhhh 20
4 i - jjj 10
How do I get the output from each calculation instead of just the first one?

Even though you store all your useful results in the queue output you want to fetch the results via calling output.get() the number of times it was stored in the output (number of training examples - len(c) in your case). For me it works if you change the line:
print('storing output of calculations...')
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
to:
print('storing output of calculations...')
p = pd.DataFrame([output.get() for _ in range(len(c))]) ## <-- no longer breaks

Related

Python pool.map function completes but leaves zombies

I've been having an issue where pool.map leaves processes even after pool.terminate is called. I've looked for solutions but they all seems to have some other issue like recursively calling the map function or another process that interferes with the multiprocessing.
So my code imports 2 NETCDF files and processes the data in them using different calculations. These take up a lot of time (several 6400x6400 arrays) so I tried to multi process my code. The multiprocessing works and the first time I run my code it takes 2.5 minutes (down from 8), but every time my code finishes running the memory usage by Spyder never goes back down and it leaves extra python processes in the Windows task manager. My code looks like this:
import numpy as np
import netCDF4
import math
from math import sin, cos
import logging
from multiprocessing.pool import Pool
import time
start=time.time()
format = "%(asctime)s: %(message)s"
logging.basicConfig(format=format, level=logging.INFO, datefmt="%H:%M:%S")
logging.info("Here we go!")
path = "DATAPATH"
geopath = "DATAPATH"
f = netCDF4.Dataset(path)
f.set_auto_maskandscale(False)
f2 = netCDF4.Dataset(geopath)
i5lut=f.groups['observation_data'].variables['I05_brightness_temperature_lut'][:]
i4lut=f.groups['observation_data'].variables['I05_brightness_temperature_lut'][:]
I5= f.groups['observation_data'].variables['I05'][:]
I4= f.groups['observation_data'].variables['I04'][:]
I5=i5lut[I5]
I4=i4lut[I4]
I4Quality= f.groups['observation_data'].variables['I04_quality_flags'][:]
I5Quality= f.groups['observation_data'].variables['I05_quality_flags'][:]
I3= f.groups['observation_data'].variables['I03']
I2= f.groups['observation_data'].variables['I02']
I1= f.groups['observation_data'].variables['I01']
I1.set_auto_scale(True)
I2.set_auto_scale(True)
I3.set_auto_scale(True)
I1=I1[:]
I2=I2[:]
I3=I3[:]
lats = f2.groups['geolocation_data'].variables['latitude'][:]
lons = f2.groups['geolocation_data'].variables['longitude'][:]
solarZen = f2.groups['geolocation_data'].variables['solar_zenith'][:]
sensorZen= solarZen = f2.groups['geolocation_data'].variables['sensor_zenith'][:]
solarAz = f2.groups['geolocation_data'].variables['solar_azimuth'][:]
sensorAz= solarZen = f2.groups['geolocation_data'].variables['sensor_azimuth'][:]
def kernMe(i, j, band):
if i<250 or j<250:
return -1
else:
return np.mean(band[i-250:i+250:1,j-250:j+250:1])
def thread_me(arr):
start1=arr[0]
end1=arr[1]
start2=arr[2]
end2=arr[3]
logging.info("Im starting at: %d to %d, %d to %d" %(start1, end1, start2, end2))
points = []
avg = np.mean(I4)
for i in range(start1,end1):
for j in range (start2,end2):
if solarZen[i,j]>=90:
if not (I5[i,j]<265 and I4[i,j]<295):#
if I4[i,j]>320 and I4Quality[i,j]==0:
points.append([lons[i,j],lats[i,j], 1])
elif I4[i,j]>300 and I5[i,j]-I4[i,j]>10:
points.append([lons[i,j],lats[i,j], 2])
elif I4[i,j] == 367 and I4Quality ==9:
points.append([lons[i,j],lats[i,j, 3]])
else:
if not ((I1[i,j]>I2[i,j]>I3[i,j]) or (I5[i,j]<265 or (I1[i,j]+I2[i,j]>0.9 and I5[i,j]<295) or
(I1[i,j]+I2[i,j]>0.7 and I5[i,j]<285))):
if not (I1[i,j]+I2[i,j] > 0.6 and I5[i,j]<285 and I3[i,j]>0.3 and I3[i,j]>I2[i,j] and I2[i,j]>0.25 and I4[i,j]<=335):
thetaG= (cos(sensorZen[i,j]*(math.pi/180))*cos(solarZen[i,j]*(math.pi/180)))-(sin(sensorZen[i,j]*(math.pi/180))*sin(solarZen[i,j]*(math.pi/180))*cos(sensorAz[i,j]*(math.pi/180)))
thetaG= math.acos(thetaG)*(180/math.pi)
if not ((thetaG<15 and I1[i,j]+I2[i,j]>0.35) or (thetaG<25 and I1[i,j]+I2[i,j]>0.4)):
if math.floor(I4[i,j])==367 and I4Quality[i,j]==9 and I5>290 and I5Quality[i,j]==0 and (I1[i,j]+I2[i,j])>0.7:
points.append([lons[i,j],lats[i,j, 4]])
elif I4[i,j]-I5[i,j]>25 or True:
kern = kernMe(i, j, I4)
if kern!=-1 or True:
BT4M = max(325, kern)
kern = min(330, BT4M)
if I4[i,j]> kern and I4[i,j]>avg:
points.append([lons[i,j],lats[i,j], 5])
return points
if __name__ == '__main__':
#Separate the arrays into 1616*1600 chunks for multi processing
#TODO: make this automatic, not hardcoded
arg=[[0,1616,0,1600],[0,1616,1600,3200],[0,1616,3200,4800],[0,1616,4800,6400],
[1616,3232,0,1600],[1616,3232,1600,3200],[1616,3232,3200,4800],[1616,3232,4800,6400],
[3232,4848,0,1600],[3232,4848,1600,3200],[3232,4848,3200,4800],[3232,4848,4800,6400],
[4848,6464,0,1600],[4848,6464,1600,3200],[4848,6464,3200,4800],[4848,6464,4800,6400]]
print(arg)
p=Pool(processes = 4)
output= p.map(thread_me, arg)
p.close()
p.join()
print(output)
f.close()
f2.close()
logging.info("Aaaand we're here!")
print(str((time.time()-start)/60))
p.terminate()
I use both p.close and p. terminate because I thought it would help (it doesn't). All of my code runs and produces the expected output but I have to manually end the lingering processes using the task manager. Any ideas as to
what's causing this?
I think I put all the relevant information here, if you need more I'll edit with the requests
Thanks in advance.

Passing arguments vs. using pipes to communicate with child processes

In the following code, I time how long it takes to pass a large array (8 MB) to a child process using the args key word when forking the process verses passing using a pipe.
Does anyone have any insight into why it is so much faster to pass data using an argument than using a pipe?
Below, each code block is a cell in a Jupyter notebook.
import multiprocessing as mp
import random
N = 2**20
x = list(map(lambda x : random.random(),range(N)))
Time the call to sum in the parent process (for comparison only):
%%timeit -n 5 -r 10 -p 8 -o -q
pass
y = sum(x)/N
t_sum = _
Time the result of calling sum from a child process, using the args keyword to pass list x to child process.
def mean(x,q):
q.put(sum(x))
%%timeit -n 5 -r 10 -p 8 -o -q
pass
q = mp.Queue()
p = mp.Process(target=mean,args=(x,q))
p.start()
p.join()
s = q.get()
m = s/N
t_mean = _
Time using a pipe to pass data to child process
def mean_pipe(cp,q):
x = cp.recv()
q.put(sum(x))
%%timeit -n 5 -r 10 -p 8 -o -q
pass
q = mp.Queue()
pipe0,pipe1 = mp.Pipe()
p = mp.Process(target=mean_pipe,args=[pipe0,q])
p.start()
pipe1.send(x)
p.join()
s = q.get()
m = s/N
t_mean_pipe = _
(ADDED in response to comment) Use mp.Array shared memory feature (very slow!)
def mean_pipe_shared(xs,q):
q.put(sum(xs))
%%timeit -n 5 -r 10 -p 8 -o -q
xs = mp.Array('d',x)
q = mp.Queue()
p = mp.Process(target=mean_pipe_shared,args=[xs,q])
p.start()
p.join()
s = q.get()
m = s/N
t_mean_shared = _
Print out results (ms)
print("{:>20s} {:12.4f}".format("MB",8*N/1024**2))
print("{:>20s} {:12.4f}".format("mean (main)",1000*t_sum.best))
print("{:>20s} {:12.4f}".format("mean (args)",1000*t_mean.best))
print("{:>20s} {:12.4f}".format("mean (pipe)",1000*t_mean_pipe.best))
print("{:>20s} {:12.4f}".format("mean (shared)",1000*t_mean_shared.best))
MB 8.0000
mean (main) 7.1931
mean (args) 38.5217
mean (pipe) 136.5020
mean (shared) 4195.0568
Using the pipe is over 3 times slower than passing arguments to the child process. And unless I am doing something very wrong, mp.Array is a non-starter.
Why is the pipe so much slower than passing directly to the subprocess (using args)? And what's up with the shared memory?

Proper handling of parallel function that writes output in Python

I have a function that takes a text file as input, does some processing, and writes a pickled result to file. I'm trying to perform this in parallel across multiple files. The order in which files are processed doesn't matter, and the processing of each is totally independent. Here's what I have now:
import mulitprocessing as mp
import pandas as pd
from glob import glob
def processor(fi):
df = pd.read_table(fi)
...do some processing to the df....
filename = fi.split('/')[-1][:-4]
df.to_pickle('{}.pkl'.format(filename))
if __name__ == '__main__':
files = glob('/path/to/my/files/*.txt')
pool = mp.Pool(8)
for _ in pool.imap_unordered(processor, files):
pass
Now, this actually works totally fine as far as I can tell, but the syntax seems really hinky and I'm wondering if there is a better way of going about it. E.g. can I get the same result without having to perform an explicit loop?
I tried map_async(processor, files), but this doesn't generate any output files (but doesn't throw any errors).
Suggestions?
You can use map_async, but you need to wait for it to finish, since the async bit means "don't block after setting off the jobs, but return immediately". If you don't wait, if there's nothing after your code your program will exit and all subprocesses will be killed immediately and before completing - not what you want!
The following example should help:
from multiprocessing.pool import Pool
from time import sleep
def my_func(val):
print('Executing %s' % val)
sleep(0.5)
print('Done %s' % val)
pl = Pool()
async_result = pl.map_async(my_func, [1, 2, 3, 4, 5])
res = async_result.get()
print('Pool done: %s' % res)
The output of which (when I ran it) is:
Executing 2
Executing 1
Executing 3
Executing 4
Done 2
Done 1
Executing 5
Done 4
Done 3
Done 5
Pool done: [None, None, None, None, None]
Alternatively, using plain map would also do the trick, and then you don't have to wait for it since it is not "asynchronous" and synchronously waits for all jobs to be complete:
pl = Pool()
res = pl.map(my_func, [1, 2, 3, 4, 5])
print('Pool done: %s' % res)

how to implement single program multiple data (spmd) in python

i read the multiprocessing doc. in python and found that task can be assigned to different cpu cores. i like to run the following code (as a start) in parallel.
from multiprocessing import Process
import os
def do(a):
for i in range(a):
print i
if __name__ == "__main__":
proc1 = Process(target=do, args=(3,))
proc2 = Process(target=do, args=(6,))
proc1.start()
proc2.start()
now i get the output as 1 2 3 and then 1 ....6. but i need to work as 1 1 2 2 ie i want to run proc1 and proc2 in parallel (not one after other).
So you can have your code execute in parallel just by using map. I am using a delay (with time.sleep) to slow the code down so it prints as you want it to. If you don't use the sleep, the first process will finish before the second starts… and you get 0 1 2 0 1 2 3 4 5.
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool()
>>>
>>> def do(a):
... for i in range(a):
... import time
... time.sleep(1)
... print i
...
>>> _ = p.map(do, [3,6])
0
0
1
1
2
2
3
4
5
>>>
I'm using the multiprocessing fork pathos.multiprocessing because I'm the author and I'm too lazy to code it in a file. pathos enables you to do multiprocessing in the interpreter, but otherwise it's basically the same.
You can also use the library pp. I prefer pp over multiprocessing because it allows for parallel processing across different cpus on the network. A function (func) can be applied to a list of inputs (args) using a simple code:
job_server=pp.Server(ncpus=num_local_procs,ppservers=nodes)
result=[job() for job in job_server.submit(func,input) for arg in args]
You can also check out more examples at: https://github.com/gopiks/mappy/blob/master/map.py

Python multiprocessing synchronization

I have a function "function" that I want to call 10 times using 2 times 5 cpus with multiprocessing.
Therefore I need a way to synchronize the processes as described in the code below.
Is this possible without using a multiprocessing pool? I get strange errors if I do so (for example "UnboundLocalError: local variable 'fd' referenced before assignment" (I don't have such a variable)). Also the processes seem to terminate randomly.
If possible I would like to do this without a pool. Thanks!
number_of_cpus = 5
number_of_iterations = 2
# An array for the processes.
processing_jobs = []
# Start 5 processes 2 times.
for iteration in range(0, number_of_iterations):
# TODO SYNCHRONIZE HERE
# Start 5 processes at a time.
for cpu_number in range(0, number_of_cpus):
# Calculate an offset for the current function call.
file_offset = iteration * cpu_number * number_of_files_per_process
p = multiprocessing.Process(target=function, args=(file_offset,))
processing_jobs.append(p)
p.start()
# TODO SYNCHRONIZE HERE
This is an (anonymized) traceback of the errors I get when I run the code in a pool:
Process Process-5:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "python_code_3.py", line 88, in function_x
xyz = python_code_1.function_y(args)
File "/python_code_1.py", line 254, in __init__
self.WK = file.WK(filename)
File "/python_code_2.py", line 1754, in __init__
self.__parse__(name, data, fast_load)
File "/python_code_2.py", line 1810, in __parse__
fd.close()
UnboundLocalError: local variable 'fd' referenced before assignment
Most of the processes crash like that but not all of them. More of them seem to crash when I increase the number of processes. I also thought this might be due to memory limitations...
Here's how you can do the synchronization you're looking for without using a pool:
import multiprocessing
def function(arg):
print ("got arg %s" % arg)
if __name__ == "__main__":
number_of_cpus = 5
number_of_iterations = 2
# An array for the processes.
processing_jobs = []
# Start 5 processes 2 times.
for iteration in range(1, number_of_iterations+1): # Start the range from 1 so we don't multiply by zero.
# Start 5 processes at a time.
for cpu_number in range(1, number_of_cpus+1):
# Calculate an offset for the current function call.
file_offset = iteration * cpu_number * number_of_files_per_process
p = multiprocessing.Process(target=function, args=(file_offset,))
processing_jobs.append(p)
p.start()
# Wait for all processes to finish.
for proc in processing_jobs:
proc.join()
# Empty active job list.
del processing_jobs[:]
# Write file here
print("Writing")
Here it is with a Pool:
import multiprocessing
def function(arg):
print ("got arg %s" % arg)
if __name__ == "__main__":
number_of_cpus = 5
number_of_iterations = 2
pool = multiprocessing.Pool(number_of_cpus)
for i in range(1, number_of_iterations+1): # Start the range from 1 so we don't multiply by zero
file_offsets = [number_of_files_per_process * i * cpu_num for cpu_num in range(1, number_of_cpus+1)]
pool.map(function, file_offsets)
print("Writing")
# Write file here
As you can see, the Pool solution is nicer.
This doesn't solve your traceback problem, though. It's hard for me to say how to fix that without understanding what's actually causing that. You may need to use a multiprocessing.Lock to synchronize access to the resource.
A Pool can be very easy to use. Here's a full example:
source
import multiprocessing
def calc(num):
return num*2
if __name__=='__main__': # required for Windows
pool = multiprocessing.Pool() # one Process per CPU
for output in pool.map(calc, [1,2,3]):
print 'output:',output
output
output: 2
output: 4
output: 6

Categories