Python pool.map function completes but leaves zombies - python

I've been having an issue where pool.map leaves processes even after pool.terminate is called. I've looked for solutions but they all seems to have some other issue like recursively calling the map function or another process that interferes with the multiprocessing.
So my code imports 2 NETCDF files and processes the data in them using different calculations. These take up a lot of time (several 6400x6400 arrays) so I tried to multi process my code. The multiprocessing works and the first time I run my code it takes 2.5 minutes (down from 8), but every time my code finishes running the memory usage by Spyder never goes back down and it leaves extra python processes in the Windows task manager. My code looks like this:
import numpy as np
import netCDF4
import math
from math import sin, cos
import logging
from multiprocessing.pool import Pool
import time
start=time.time()
format = "%(asctime)s: %(message)s"
logging.basicConfig(format=format, level=logging.INFO, datefmt="%H:%M:%S")
logging.info("Here we go!")
path = "DATAPATH"
geopath = "DATAPATH"
f = netCDF4.Dataset(path)
f.set_auto_maskandscale(False)
f2 = netCDF4.Dataset(geopath)
i5lut=f.groups['observation_data'].variables['I05_brightness_temperature_lut'][:]
i4lut=f.groups['observation_data'].variables['I05_brightness_temperature_lut'][:]
I5= f.groups['observation_data'].variables['I05'][:]
I4= f.groups['observation_data'].variables['I04'][:]
I5=i5lut[I5]
I4=i4lut[I4]
I4Quality= f.groups['observation_data'].variables['I04_quality_flags'][:]
I5Quality= f.groups['observation_data'].variables['I05_quality_flags'][:]
I3= f.groups['observation_data'].variables['I03']
I2= f.groups['observation_data'].variables['I02']
I1= f.groups['observation_data'].variables['I01']
I1.set_auto_scale(True)
I2.set_auto_scale(True)
I3.set_auto_scale(True)
I1=I1[:]
I2=I2[:]
I3=I3[:]
lats = f2.groups['geolocation_data'].variables['latitude'][:]
lons = f2.groups['geolocation_data'].variables['longitude'][:]
solarZen = f2.groups['geolocation_data'].variables['solar_zenith'][:]
sensorZen= solarZen = f2.groups['geolocation_data'].variables['sensor_zenith'][:]
solarAz = f2.groups['geolocation_data'].variables['solar_azimuth'][:]
sensorAz= solarZen = f2.groups['geolocation_data'].variables['sensor_azimuth'][:]
def kernMe(i, j, band):
if i<250 or j<250:
return -1
else:
return np.mean(band[i-250:i+250:1,j-250:j+250:1])
def thread_me(arr):
start1=arr[0]
end1=arr[1]
start2=arr[2]
end2=arr[3]
logging.info("Im starting at: %d to %d, %d to %d" %(start1, end1, start2, end2))
points = []
avg = np.mean(I4)
for i in range(start1,end1):
for j in range (start2,end2):
if solarZen[i,j]>=90:
if not (I5[i,j]<265 and I4[i,j]<295):#
if I4[i,j]>320 and I4Quality[i,j]==0:
points.append([lons[i,j],lats[i,j], 1])
elif I4[i,j]>300 and I5[i,j]-I4[i,j]>10:
points.append([lons[i,j],lats[i,j], 2])
elif I4[i,j] == 367 and I4Quality ==9:
points.append([lons[i,j],lats[i,j, 3]])
else:
if not ((I1[i,j]>I2[i,j]>I3[i,j]) or (I5[i,j]<265 or (I1[i,j]+I2[i,j]>0.9 and I5[i,j]<295) or
(I1[i,j]+I2[i,j]>0.7 and I5[i,j]<285))):
if not (I1[i,j]+I2[i,j] > 0.6 and I5[i,j]<285 and I3[i,j]>0.3 and I3[i,j]>I2[i,j] and I2[i,j]>0.25 and I4[i,j]<=335):
thetaG= (cos(sensorZen[i,j]*(math.pi/180))*cos(solarZen[i,j]*(math.pi/180)))-(sin(sensorZen[i,j]*(math.pi/180))*sin(solarZen[i,j]*(math.pi/180))*cos(sensorAz[i,j]*(math.pi/180)))
thetaG= math.acos(thetaG)*(180/math.pi)
if not ((thetaG<15 and I1[i,j]+I2[i,j]>0.35) or (thetaG<25 and I1[i,j]+I2[i,j]>0.4)):
if math.floor(I4[i,j])==367 and I4Quality[i,j]==9 and I5>290 and I5Quality[i,j]==0 and (I1[i,j]+I2[i,j])>0.7:
points.append([lons[i,j],lats[i,j, 4]])
elif I4[i,j]-I5[i,j]>25 or True:
kern = kernMe(i, j, I4)
if kern!=-1 or True:
BT4M = max(325, kern)
kern = min(330, BT4M)
if I4[i,j]> kern and I4[i,j]>avg:
points.append([lons[i,j],lats[i,j], 5])
return points
if __name__ == '__main__':
#Separate the arrays into 1616*1600 chunks for multi processing
#TODO: make this automatic, not hardcoded
arg=[[0,1616,0,1600],[0,1616,1600,3200],[0,1616,3200,4800],[0,1616,4800,6400],
[1616,3232,0,1600],[1616,3232,1600,3200],[1616,3232,3200,4800],[1616,3232,4800,6400],
[3232,4848,0,1600],[3232,4848,1600,3200],[3232,4848,3200,4800],[3232,4848,4800,6400],
[4848,6464,0,1600],[4848,6464,1600,3200],[4848,6464,3200,4800],[4848,6464,4800,6400]]
print(arg)
p=Pool(processes = 4)
output= p.map(thread_me, arg)
p.close()
p.join()
print(output)
f.close()
f2.close()
logging.info("Aaaand we're here!")
print(str((time.time()-start)/60))
p.terminate()
I use both p.close and p. terminate because I thought it would help (it doesn't). All of my code runs and produces the expected output but I have to manually end the lingering processes using the task manager. Any ideas as to
what's causing this?
I think I put all the relevant information here, if you need more I'll edit with the requests
Thanks in advance.

Related

How to distribute multiprocess CPU usage over multiple nodes?

I am trying to run a job on an HPC using multiprocessing. Each process has a peak memory usage of ~44GB. The job class I can use allows 1-16 nodes to be used, each with 32 CPUs and a memory of 124GB. Therefore if I want to run the code as quickly as possible (and within the max walltime limit) I should be able to run 2 CPUs on each node up to a maximum of 32 across all 16 nodes. However, when I specify mp.Pool(32) the job quickly exceeds the memory limit, I assume because more than two CPUs were used on a node.
My natural instinct was to specify 2 CPUs as the maximum in the pbs script I run my python script from, however this configuration is not permitted on the system. Would really appreciate any insight, having been scratching my head on this one for most of today - and have faced and worked around similar problems in the past without addressing the fundamentals at play here.
Simplified versions of both scripts below:
#!/bin/sh
#PBS -l select=16:ncpus=32:mem=124gb
#PBS -l walltime=24:00:00
module load anaconda3/personal
source activate py_env
python directory/script.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import multiprocessing as mp
def df_function(df, arr1, arr2):
df['col3'] = some_algorithm(df, arr1, arr2)
return df
def parallelize_dataframe(df, func, num_cores):
df_split = np.array_split(df, num_cores)
with mp.Pool(num_cores, maxtasksperchild = 10 ** 3) as pool:
df = pd.concat(pool.map(func, df_split))
return df
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = pd.read_csv(direc + file)
a = np.load(direc + a_file)
b = np.load(direc + b_file)
# Globally defining function with keyword defaults
global f
def f(df):
return df_function(df, arr1 = a, arr2 = b)
num_cores = 32 # i.e. 2 per node if evenly distributed.
# Running the function as a multiprocess:
df = parallelize_dataframe(df, f, num_cores)
# Saving:
df.to_csv(direc + 'outfile.csv', index = False)
if __name__ == '__main__':
main()
To run your job as-is, you could simply request ncpu=32 and then in your python script set num_cores = 2. Obviously this has you paying for 32 cores and then leaving 30 of them idle, which is wasteful.
The real problem here is that your current algorithm is memory-bound, not CPU-bound. You should be going to great lengths to read only chunks of your files into memory, operating on the chunks, and then writing the result chunks to disk to be organized later.
Fortunately Dask is built to do exactly this kind of thing. As a first step, you can take out the parallelize_dataframe function and directly load and map your some_algorithm with a dask.dataframe and dask.array:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import dask.dataframe as dd
import dask.array as da
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = dd.read_csv(direc + file, blocksize=25e6)
a_and_b = da.from_np_stack(direc)
df['col3'] = df.apply(some_algorithm, args=(a_and_b,))
# dask is lazy, this is the only line that does any work
# Saving:
df.to_csv(
direc + 'outfile.csv',
index = False,
compute_kwargs={"scheduler": "threads"}, # also "processes", but try threads first
)
if __name__ == '__main__':
main()
That will require some tweaks to some_algorithm, and to_csv and from_np_stack work a bit differently, but you will be able to reasonably run this thing just on your own laptop and it will scale to your cluster hardware. You can level up from here by using the distributed scheduler or even deploy it directly to your cluster with dask-jobqueue.

How to parallelize a nested for loop in python?

Ok, here is my problem: I have a nested for loop in my program which runs on a single core. Since the program spend over 99% of run time in this nested for loop I would like to parallelize it. Right now I have to wait 9 days for the computation to finish. I tried to implement a parallel for loop by using the multiprocessing library. But I only find very basic examples and can not transfer them to my problem. Here are the nested loops with random data:
import numpy as np
dist_n = 100
nrm = np.linspace(1,10,dist_n)
data_Y = 11000
data_I = 90000
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n)
for t in range(data_Y):
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
print(dist)
Please give me some advise how to make it parallel.
There's a small overhead with initiating a process (50ms+ depending on data size) so it's generally best to MP the largest block of code possible. From your comment it sounds like each loop of t is independent so we should be free to parallelize this.
When python creates a new process you get a copy of the main process so you have available all your global data but when each process writes the data, it writes to it's own local copy. This means dist[i,p] won't be available to the main process unless you explicitly pass it back with a return (which will have some overhead). In your situation, if each process writes dist[i,p] to a file then you should be fine, just don't try to write to the same file unless you implement some type of mutex access control.
#!/usr/bin/python
import time
import multiprocessing as mp
import numpy as np
data_Y = 11 #11000
data_I = 90 #90000
dist_n = 100
nrm = np.linspace(1,10,dist_n)
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n))
def worker(t):
st = time.time()
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
# Here - each worker opens a different file and writes to it
print 'Worker time %4.3f mS' % (1000.*(time.time()-st))
if 1: # single threaded
st = time.time()
for x in map(worker, range(data_Y)):
pass
print 'Single-process total time is %4.3f seconds' % (time.time()-st)
print
if 1: # multi-threaded
pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
st = time.time()
for x in pool.imap_unordered(worker, range(data_Y)):
pass
print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
print
If you re-increase the size of data_Y/data_I again, the speed-up should increase up to the theoretical limit.

Proper handling of parallel function that writes output in Python

I have a function that takes a text file as input, does some processing, and writes a pickled result to file. I'm trying to perform this in parallel across multiple files. The order in which files are processed doesn't matter, and the processing of each is totally independent. Here's what I have now:
import mulitprocessing as mp
import pandas as pd
from glob import glob
def processor(fi):
df = pd.read_table(fi)
...do some processing to the df....
filename = fi.split('/')[-1][:-4]
df.to_pickle('{}.pkl'.format(filename))
if __name__ == '__main__':
files = glob('/path/to/my/files/*.txt')
pool = mp.Pool(8)
for _ in pool.imap_unordered(processor, files):
pass
Now, this actually works totally fine as far as I can tell, but the syntax seems really hinky and I'm wondering if there is a better way of going about it. E.g. can I get the same result without having to perform an explicit loop?
I tried map_async(processor, files), but this doesn't generate any output files (but doesn't throw any errors).
Suggestions?
You can use map_async, but you need to wait for it to finish, since the async bit means "don't block after setting off the jobs, but return immediately". If you don't wait, if there's nothing after your code your program will exit and all subprocesses will be killed immediately and before completing - not what you want!
The following example should help:
from multiprocessing.pool import Pool
from time import sleep
def my_func(val):
print('Executing %s' % val)
sleep(0.5)
print('Done %s' % val)
pl = Pool()
async_result = pl.map_async(my_func, [1, 2, 3, 4, 5])
res = async_result.get()
print('Pool done: %s' % res)
The output of which (when I ran it) is:
Executing 2
Executing 1
Executing 3
Executing 4
Done 2
Done 1
Executing 5
Done 4
Done 3
Done 5
Pool done: [None, None, None, None, None]
Alternatively, using plain map would also do the trick, and then you don't have to wait for it since it is not "asynchronous" and synchronously waits for all jobs to be complete:
pl = Pool()
res = pl.map(my_func, [1, 2, 3, 4, 5])
print('Pool done: %s' % res)

parallel image processing in python

I am doing some image processing but I have a lot of images (~10,000). Thus I would like to do it in parallel but for some reason it does not go as fast as it should. I am using a MacBook Pro 16Gb and i7 . The code is like this :
def process_image(img_name):
cv2.imread('image/'+img_name)
tfs_im = some_function(im) # use opencv, skimage and math
cv2.imwrite("new_img/"img_name,tfs_im)
if __name__ == '__main__':
### Set Working Dir
wd_path = os.path.dirname(os.path.realpath(__file__))
os.chdir(wd_path+'/..')
img_list = os.listdir('images')
pool = Pool(processes=8)
pool.map(process_image, img_list) # proces data_inputs iterable with pool
I also tried a more basic approach using queueing.
def process_image(img_names):
for img_name in img_names:
cv2.imread('image/'+img_name)
im = read_img(img_name)
tfs_im = some_function(im) # use opencv, skimage and math
cv2.imwrite('new_img/'+img_name,tfs_im)
if __name__ == '__main__':
### Set Working Dir
wd_path = os.path.dirname(os.path.realpath(__file__))
os.chdir(wd_path+'/..')
q = Queue()
img_list = os.listdir('image')
# split work into 8 processes
processes = 8
def splitlist(inlist, chunksize):
return [inlist[x:x+chunksize] for x in xrange(0, len(inlist), chunksize)]
list_splitted = splitlist(img_list, len(img_list)/processes+1)
for imgs in list_splitted:
p = Process(target=process_image, args=([imgs]))
p.Daemon = True
p.start()
None of those approaches a giving the expected speed. I know that there some set up time expected for each process and thus the code will not run 8 time faster but as of now it is barely running 2 time faster than single thread.
Maybe some tasks are not meant to be parallelize such as writing/reading images from/to the same folder in different processes ?
Thanks for any tips or advices!

multiprocessing: processing a large dataset

I am working with DEAP.
I am evaluating a population (currently 50 individuals) against a large dataset (400.000 columns of 200 floating points).
I have successfully tested the algorithm without any multiprocessing. Execution time is about 40s/generation.
I want to work with larger populations and more generations, so I try to speed up by using multiprocessing.
I guess that my question is more related to multiprocessing than to DEAP.
This question is not directly related to sharing memory/variables between processes. The main issue is how to minimise disk access.
I have started to work with Python multiprocessing module.
The code looks like this
toolbox = base.Toolbox()
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)
PICKLE_SEED = 'D:\\Application Data\\Dev\\20150925173629ClustersFrame.pkl'
PICKLE_DATA = 'D:\\Application Data\\Dev\\20150925091456DataSample.pkl'
if __name__ == "__main__":
pool = multiprocessing.Pool(processes = 2)
toolbox.register("map", pool.map)
data = pd.read_pickle(PICKLE_DATA).values
And then, a little bit further:
def main():
NGEN = 10
CXPB = 0.5
MUTPB = 0.2
population = toolbox.population_guess()
fitnesses = list(toolbox.map(toolbox.evaluate, population))
print(sorted(fitnesses, reverse = True))
for ind, fit in zip(population, fitnesses):
ind.fitness.values = fit
# Begin the evolution
for g in range(NGEN):
The evaluation function uses the global "data" variable.
and, finally:
if __name__ == "__main__":
start = datetime.now()
main()
pool.close()
stop = datetime.now()
delta = stop-start
print (delta.seconds)
So: the main processing loop and the pool definition are guarded by if __name__ == "__main__":.
It somehow works. Execution times are:
1 process: 398 s
2 processes: 270 s
3 processes: 272 s
4 processes: 511 s
Multiprocessing does not dramatically improve the execution time, and can even harm it.
The 4 process (lack of) performance can be explained by memory constraints. My system is basically paging instead of processing.
I guess that the other measurements can be explained by the loading of data.
My questions:
1) I understand that the file will be read and unpickled each time the module is started as a separate process. Is this correct? Does this mean it will be read each time one of the functions it contains will be called by map?
2) I have tried to move the unpickling under the if __name__ == "__main__": guard, but, then, I get an error message saying the "data" is not defined when I call the evaluation function. Could you explain how I can read the file once, and then only pass the array to the processes

Categories