Proper handling of parallel function that writes output in Python - python

I have a function that takes a text file as input, does some processing, and writes a pickled result to file. I'm trying to perform this in parallel across multiple files. The order in which files are processed doesn't matter, and the processing of each is totally independent. Here's what I have now:
import mulitprocessing as mp
import pandas as pd
from glob import glob
def processor(fi):
df = pd.read_table(fi)
...do some processing to the df....
filename = fi.split('/')[-1][:-4]
df.to_pickle('{}.pkl'.format(filename))
if __name__ == '__main__':
files = glob('/path/to/my/files/*.txt')
pool = mp.Pool(8)
for _ in pool.imap_unordered(processor, files):
pass
Now, this actually works totally fine as far as I can tell, but the syntax seems really hinky and I'm wondering if there is a better way of going about it. E.g. can I get the same result without having to perform an explicit loop?
I tried map_async(processor, files), but this doesn't generate any output files (but doesn't throw any errors).
Suggestions?

You can use map_async, but you need to wait for it to finish, since the async bit means "don't block after setting off the jobs, but return immediately". If you don't wait, if there's nothing after your code your program will exit and all subprocesses will be killed immediately and before completing - not what you want!
The following example should help:
from multiprocessing.pool import Pool
from time import sleep
def my_func(val):
print('Executing %s' % val)
sleep(0.5)
print('Done %s' % val)
pl = Pool()
async_result = pl.map_async(my_func, [1, 2, 3, 4, 5])
res = async_result.get()
print('Pool done: %s' % res)
The output of which (when I ran it) is:
Executing 2
Executing 1
Executing 3
Executing 4
Done 2
Done 1
Executing 5
Done 4
Done 3
Done 5
Pool done: [None, None, None, None, None]
Alternatively, using plain map would also do the trick, and then you don't have to wait for it since it is not "asynchronous" and synchronously waits for all jobs to be complete:
pl = Pool()
res = pl.map(my_func, [1, 2, 3, 4, 5])
print('Pool done: %s' % res)

Related

How to distribute multiprocess CPU usage over multiple nodes?

I am trying to run a job on an HPC using multiprocessing. Each process has a peak memory usage of ~44GB. The job class I can use allows 1-16 nodes to be used, each with 32 CPUs and a memory of 124GB. Therefore if I want to run the code as quickly as possible (and within the max walltime limit) I should be able to run 2 CPUs on each node up to a maximum of 32 across all 16 nodes. However, when I specify mp.Pool(32) the job quickly exceeds the memory limit, I assume because more than two CPUs were used on a node.
My natural instinct was to specify 2 CPUs as the maximum in the pbs script I run my python script from, however this configuration is not permitted on the system. Would really appreciate any insight, having been scratching my head on this one for most of today - and have faced and worked around similar problems in the past without addressing the fundamentals at play here.
Simplified versions of both scripts below:
#!/bin/sh
#PBS -l select=16:ncpus=32:mem=124gb
#PBS -l walltime=24:00:00
module load anaconda3/personal
source activate py_env
python directory/script.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import multiprocessing as mp
def df_function(df, arr1, arr2):
df['col3'] = some_algorithm(df, arr1, arr2)
return df
def parallelize_dataframe(df, func, num_cores):
df_split = np.array_split(df, num_cores)
with mp.Pool(num_cores, maxtasksperchild = 10 ** 3) as pool:
df = pd.concat(pool.map(func, df_split))
return df
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = pd.read_csv(direc + file)
a = np.load(direc + a_file)
b = np.load(direc + b_file)
# Globally defining function with keyword defaults
global f
def f(df):
return df_function(df, arr1 = a, arr2 = b)
num_cores = 32 # i.e. 2 per node if evenly distributed.
# Running the function as a multiprocess:
df = parallelize_dataframe(df, f, num_cores)
# Saving:
df.to_csv(direc + 'outfile.csv', index = False)
if __name__ == '__main__':
main()
To run your job as-is, you could simply request ncpu=32 and then in your python script set num_cores = 2. Obviously this has you paying for 32 cores and then leaving 30 of them idle, which is wasteful.
The real problem here is that your current algorithm is memory-bound, not CPU-bound. You should be going to great lengths to read only chunks of your files into memory, operating on the chunks, and then writing the result chunks to disk to be organized later.
Fortunately Dask is built to do exactly this kind of thing. As a first step, you can take out the parallelize_dataframe function and directly load and map your some_algorithm with a dask.dataframe and dask.array:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import dask.dataframe as dd
import dask.array as da
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = dd.read_csv(direc + file, blocksize=25e6)
a_and_b = da.from_np_stack(direc)
df['col3'] = df.apply(some_algorithm, args=(a_and_b,))
# dask is lazy, this is the only line that does any work
# Saving:
df.to_csv(
direc + 'outfile.csv',
index = False,
compute_kwargs={"scheduler": "threads"}, # also "processes", but try threads first
)
if __name__ == '__main__':
main()
That will require some tweaks to some_algorithm, and to_csv and from_np_stack work a bit differently, but you will be able to reasonably run this thing just on your own laptop and it will scale to your cluster hardware. You can level up from here by using the distributed scheduler or even deploy it directly to your cluster with dask-jobqueue.

Multiprocessing Code not producing results

I am running the below code in Python. The code shows the example of how to use multiprocessing in Python. But it is not printing the result (line 16), Only the dataset (line 9) is getting printed and the kernel keeps on running. I am of the opinion that the results should be produced instantly as the code is using various cores of the cpu through multiprocessing so fast execution of the code should be there. Can someone tell what is the issue??
from multiprocessing import Pool
def square(x):
# calculate the square of the value of x
return x*x
if __name__ == '__main__':
# Define the dataset
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
# Output the dataset
print ('Dataset: ' + str(dataset)) # Line 9
# Run this with a pool of 5 agents having a chunksize of 3 until finished
agents = 5
chunksize = 3
with Pool(processes=agents) as pool:
result = pool.map(square, dataset, chunksize)
# Output the result
print ('Result: ' + str(result)) # Line 16
You are running under Windows. Look at the console output where you started up Jupyter. You will see 5 instances (you have 5 processes in your pool) of:
AttributeError: Can't get attribute 'square' on <module 'main' (built-in)>
You need to put your worker function in a file, for instance, workers.py, in the same directory as your Jupyter code:
workers.py
def square(x):
# calculate the square of the value of x
return x*xdef square(x):
Then remove the above function from your cell and instead add:
import workers
and then:
with Pool(processes=agents) as pool:
result = pool.map(workers.square, dataset, chunksize)
Note
See my comment(s) to your post concerning the chunksize argument to the Pool constructor. The default chunksize for the map method when None is calculated more or less as follows based on the iterable argument:
if not hasattr(iterable, '__len__'):
iterable = list(iterable)
if chunksize is None:
chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
if extra:
chunksize += 1
So I mispoke in one deatil: Regardless of whether you specify a chunksize or not, map will convert your iterable to a list if it needs to in order to get its length. But essentially the default chunksize used is the ceiling of the number of jobs being submitted divided by 4 times the pool size.
There is a known bug in older versions of python and Windows10 https://bugs.python.org/issue35797
It occurs when using multiprocessing through venv.
Bugfix is released in Python 3.7.3.

Python pool.map function completes but leaves zombies

I've been having an issue where pool.map leaves processes even after pool.terminate is called. I've looked for solutions but they all seems to have some other issue like recursively calling the map function or another process that interferes with the multiprocessing.
So my code imports 2 NETCDF files and processes the data in them using different calculations. These take up a lot of time (several 6400x6400 arrays) so I tried to multi process my code. The multiprocessing works and the first time I run my code it takes 2.5 minutes (down from 8), but every time my code finishes running the memory usage by Spyder never goes back down and it leaves extra python processes in the Windows task manager. My code looks like this:
import numpy as np
import netCDF4
import math
from math import sin, cos
import logging
from multiprocessing.pool import Pool
import time
start=time.time()
format = "%(asctime)s: %(message)s"
logging.basicConfig(format=format, level=logging.INFO, datefmt="%H:%M:%S")
logging.info("Here we go!")
path = "DATAPATH"
geopath = "DATAPATH"
f = netCDF4.Dataset(path)
f.set_auto_maskandscale(False)
f2 = netCDF4.Dataset(geopath)
i5lut=f.groups['observation_data'].variables['I05_brightness_temperature_lut'][:]
i4lut=f.groups['observation_data'].variables['I05_brightness_temperature_lut'][:]
I5= f.groups['observation_data'].variables['I05'][:]
I4= f.groups['observation_data'].variables['I04'][:]
I5=i5lut[I5]
I4=i4lut[I4]
I4Quality= f.groups['observation_data'].variables['I04_quality_flags'][:]
I5Quality= f.groups['observation_data'].variables['I05_quality_flags'][:]
I3= f.groups['observation_data'].variables['I03']
I2= f.groups['observation_data'].variables['I02']
I1= f.groups['observation_data'].variables['I01']
I1.set_auto_scale(True)
I2.set_auto_scale(True)
I3.set_auto_scale(True)
I1=I1[:]
I2=I2[:]
I3=I3[:]
lats = f2.groups['geolocation_data'].variables['latitude'][:]
lons = f2.groups['geolocation_data'].variables['longitude'][:]
solarZen = f2.groups['geolocation_data'].variables['solar_zenith'][:]
sensorZen= solarZen = f2.groups['geolocation_data'].variables['sensor_zenith'][:]
solarAz = f2.groups['geolocation_data'].variables['solar_azimuth'][:]
sensorAz= solarZen = f2.groups['geolocation_data'].variables['sensor_azimuth'][:]
def kernMe(i, j, band):
if i<250 or j<250:
return -1
else:
return np.mean(band[i-250:i+250:1,j-250:j+250:1])
def thread_me(arr):
start1=arr[0]
end1=arr[1]
start2=arr[2]
end2=arr[3]
logging.info("Im starting at: %d to %d, %d to %d" %(start1, end1, start2, end2))
points = []
avg = np.mean(I4)
for i in range(start1,end1):
for j in range (start2,end2):
if solarZen[i,j]>=90:
if not (I5[i,j]<265 and I4[i,j]<295):#
if I4[i,j]>320 and I4Quality[i,j]==0:
points.append([lons[i,j],lats[i,j], 1])
elif I4[i,j]>300 and I5[i,j]-I4[i,j]>10:
points.append([lons[i,j],lats[i,j], 2])
elif I4[i,j] == 367 and I4Quality ==9:
points.append([lons[i,j],lats[i,j, 3]])
else:
if not ((I1[i,j]>I2[i,j]>I3[i,j]) or (I5[i,j]<265 or (I1[i,j]+I2[i,j]>0.9 and I5[i,j]<295) or
(I1[i,j]+I2[i,j]>0.7 and I5[i,j]<285))):
if not (I1[i,j]+I2[i,j] > 0.6 and I5[i,j]<285 and I3[i,j]>0.3 and I3[i,j]>I2[i,j] and I2[i,j]>0.25 and I4[i,j]<=335):
thetaG= (cos(sensorZen[i,j]*(math.pi/180))*cos(solarZen[i,j]*(math.pi/180)))-(sin(sensorZen[i,j]*(math.pi/180))*sin(solarZen[i,j]*(math.pi/180))*cos(sensorAz[i,j]*(math.pi/180)))
thetaG= math.acos(thetaG)*(180/math.pi)
if not ((thetaG<15 and I1[i,j]+I2[i,j]>0.35) or (thetaG<25 and I1[i,j]+I2[i,j]>0.4)):
if math.floor(I4[i,j])==367 and I4Quality[i,j]==9 and I5>290 and I5Quality[i,j]==0 and (I1[i,j]+I2[i,j])>0.7:
points.append([lons[i,j],lats[i,j, 4]])
elif I4[i,j]-I5[i,j]>25 or True:
kern = kernMe(i, j, I4)
if kern!=-1 or True:
BT4M = max(325, kern)
kern = min(330, BT4M)
if I4[i,j]> kern and I4[i,j]>avg:
points.append([lons[i,j],lats[i,j], 5])
return points
if __name__ == '__main__':
#Separate the arrays into 1616*1600 chunks for multi processing
#TODO: make this automatic, not hardcoded
arg=[[0,1616,0,1600],[0,1616,1600,3200],[0,1616,3200,4800],[0,1616,4800,6400],
[1616,3232,0,1600],[1616,3232,1600,3200],[1616,3232,3200,4800],[1616,3232,4800,6400],
[3232,4848,0,1600],[3232,4848,1600,3200],[3232,4848,3200,4800],[3232,4848,4800,6400],
[4848,6464,0,1600],[4848,6464,1600,3200],[4848,6464,3200,4800],[4848,6464,4800,6400]]
print(arg)
p=Pool(processes = 4)
output= p.map(thread_me, arg)
p.close()
p.join()
print(output)
f.close()
f2.close()
logging.info("Aaaand we're here!")
print(str((time.time()-start)/60))
p.terminate()
I use both p.close and p. terminate because I thought it would help (it doesn't). All of my code runs and produces the expected output but I have to manually end the lingering processes using the task manager. Any ideas as to
what's causing this?
I think I put all the relevant information here, if you need more I'll edit with the requests
Thanks in advance.

How to parallelize a nested for loop in python?

Ok, here is my problem: I have a nested for loop in my program which runs on a single core. Since the program spend over 99% of run time in this nested for loop I would like to parallelize it. Right now I have to wait 9 days for the computation to finish. I tried to implement a parallel for loop by using the multiprocessing library. But I only find very basic examples and can not transfer them to my problem. Here are the nested loops with random data:
import numpy as np
dist_n = 100
nrm = np.linspace(1,10,dist_n)
data_Y = 11000
data_I = 90000
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n)
for t in range(data_Y):
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
print(dist)
Please give me some advise how to make it parallel.
There's a small overhead with initiating a process (50ms+ depending on data size) so it's generally best to MP the largest block of code possible. From your comment it sounds like each loop of t is independent so we should be free to parallelize this.
When python creates a new process you get a copy of the main process so you have available all your global data but when each process writes the data, it writes to it's own local copy. This means dist[i,p] won't be available to the main process unless you explicitly pass it back with a return (which will have some overhead). In your situation, if each process writes dist[i,p] to a file then you should be fine, just don't try to write to the same file unless you implement some type of mutex access control.
#!/usr/bin/python
import time
import multiprocessing as mp
import numpy as np
data_Y = 11 #11000
data_I = 90 #90000
dist_n = 100
nrm = np.linspace(1,10,dist_n)
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n))
def worker(t):
st = time.time()
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
# Here - each worker opens a different file and writes to it
print 'Worker time %4.3f mS' % (1000.*(time.time()-st))
if 1: # single threaded
st = time.time()
for x in map(worker, range(data_Y)):
pass
print 'Single-process total time is %4.3f seconds' % (time.time()-st)
print
if 1: # multi-threaded
pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
st = time.time()
for x in pool.imap_unordered(worker, range(data_Y)):
pass
print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
print
If you re-increase the size of data_Y/data_I again, the speed-up should increase up to the theoretical limit.

Python Multiprocessing not making it faster

Hello i have one program which takes its input from stdin which i thought it would be faster if i used multiprocessing but it takes actually longer:
Normal:
import sys
import hashlib
import base58
from progress.bar import ShadyBar
bar=ShadyBar('Fighting', max=100000, suffix='%(percent)d%% - %(index)d / %(max)d - %(elapsed)d')
listagood=[]
for cc in sys.stdin:
try:
bar.next()
hexwif=cc[0:51]
enco=base58.b58decode_check(hexwif)
Fhash=hashlib.sha256(enco)
d2=hashlib.sha256()
d2.update(Fhash.digest())
Shash=d2.hexdigest()
Conf1=Shash[0:8]
encooo=base58.b58decode(hexwif)
Conf2=encooo.encode("hex")
Conf2=Conf2[len(Conf2)-8:len(Conf2)]
if Conf1==Conf2:
listagood.append(cc)
except:
pass
bar.finish()
print("\nChecksum: " )
print(listagood)
print("\n")
Multiprocessing:
def worker(line):
try:
hexwif=line[0:51]
enco=base58.b58decode_check(hexwif)
Fhash=hashlib.sha256(enco)
d2=hashlib.sha256()
d2.update(Fhash.digest())
Shash=d2.hexdigest()
Conf1=Shash[0:8]
encooo=base58.b58decode(hexwif)
Conf2=encooo.encode("hex")
Conf2=Conf2[len(Conf2)-8:len(Conf2)]
if Conf1==Conf2:
return(line)
except:
#e=sys.exc_info()
#print(str(e))
pass
listagood=[]
pool = multiprocessing.Pool(processes=4)
bar=ShadyBar('Fighting', max=100000, suffix='%(percent)d%% - %(index)d / %(max)d - %(elapsed)d')
for result in pool.imap(worker, sys.stdin):
if result != None:
listagood.append(result)
#print "Result: %r" % (result)
bar.next()
bar.finish()
print("\nChecksum: " )
print(listagood)
print("\n")
Unfortunately, when i check the elapsed time it is almost the triple with the multiprocess one.
I have one processor, two physical cores, and 2 virtual cores for each physical core.
How can i know if this is caused by the multiprocessing overhead? Or is it something i did wrong?
Any help would be much appreciated
Pool divides input in multiple processes. In your case 4. But you are passing one input at a time which is not actually triggering all 4 threads.
Use the following and see the change in timing
pool.imap(worker, sys.stdin.readlines())
Hope this helps.

Categories