Python Multiprocessing not making it faster - python

Hello i have one program which takes its input from stdin which i thought it would be faster if i used multiprocessing but it takes actually longer:
Normal:
import sys
import hashlib
import base58
from progress.bar import ShadyBar
bar=ShadyBar('Fighting', max=100000, suffix='%(percent)d%% - %(index)d / %(max)d - %(elapsed)d')
listagood=[]
for cc in sys.stdin:
try:
bar.next()
hexwif=cc[0:51]
enco=base58.b58decode_check(hexwif)
Fhash=hashlib.sha256(enco)
d2=hashlib.sha256()
d2.update(Fhash.digest())
Shash=d2.hexdigest()
Conf1=Shash[0:8]
encooo=base58.b58decode(hexwif)
Conf2=encooo.encode("hex")
Conf2=Conf2[len(Conf2)-8:len(Conf2)]
if Conf1==Conf2:
listagood.append(cc)
except:
pass
bar.finish()
print("\nChecksum: " )
print(listagood)
print("\n")
Multiprocessing:
def worker(line):
try:
hexwif=line[0:51]
enco=base58.b58decode_check(hexwif)
Fhash=hashlib.sha256(enco)
d2=hashlib.sha256()
d2.update(Fhash.digest())
Shash=d2.hexdigest()
Conf1=Shash[0:8]
encooo=base58.b58decode(hexwif)
Conf2=encooo.encode("hex")
Conf2=Conf2[len(Conf2)-8:len(Conf2)]
if Conf1==Conf2:
return(line)
except:
#e=sys.exc_info()
#print(str(e))
pass
listagood=[]
pool = multiprocessing.Pool(processes=4)
bar=ShadyBar('Fighting', max=100000, suffix='%(percent)d%% - %(index)d / %(max)d - %(elapsed)d')
for result in pool.imap(worker, sys.stdin):
if result != None:
listagood.append(result)
#print "Result: %r" % (result)
bar.next()
bar.finish()
print("\nChecksum: " )
print(listagood)
print("\n")
Unfortunately, when i check the elapsed time it is almost the triple with the multiprocess one.
I have one processor, two physical cores, and 2 virtual cores for each physical core.
How can i know if this is caused by the multiprocessing overhead? Or is it something i did wrong?
Any help would be much appreciated

Pool divides input in multiple processes. In your case 4. But you are passing one input at a time which is not actually triggering all 4 threads.
Use the following and see the change in timing
pool.imap(worker, sys.stdin.readlines())
Hope this helps.

Related

Python pool.map function completes but leaves zombies

I've been having an issue where pool.map leaves processes even after pool.terminate is called. I've looked for solutions but they all seems to have some other issue like recursively calling the map function or another process that interferes with the multiprocessing.
So my code imports 2 NETCDF files and processes the data in them using different calculations. These take up a lot of time (several 6400x6400 arrays) so I tried to multi process my code. The multiprocessing works and the first time I run my code it takes 2.5 minutes (down from 8), but every time my code finishes running the memory usage by Spyder never goes back down and it leaves extra python processes in the Windows task manager. My code looks like this:
import numpy as np
import netCDF4
import math
from math import sin, cos
import logging
from multiprocessing.pool import Pool
import time
start=time.time()
format = "%(asctime)s: %(message)s"
logging.basicConfig(format=format, level=logging.INFO, datefmt="%H:%M:%S")
logging.info("Here we go!")
path = "DATAPATH"
geopath = "DATAPATH"
f = netCDF4.Dataset(path)
f.set_auto_maskandscale(False)
f2 = netCDF4.Dataset(geopath)
i5lut=f.groups['observation_data'].variables['I05_brightness_temperature_lut'][:]
i4lut=f.groups['observation_data'].variables['I05_brightness_temperature_lut'][:]
I5= f.groups['observation_data'].variables['I05'][:]
I4= f.groups['observation_data'].variables['I04'][:]
I5=i5lut[I5]
I4=i4lut[I4]
I4Quality= f.groups['observation_data'].variables['I04_quality_flags'][:]
I5Quality= f.groups['observation_data'].variables['I05_quality_flags'][:]
I3= f.groups['observation_data'].variables['I03']
I2= f.groups['observation_data'].variables['I02']
I1= f.groups['observation_data'].variables['I01']
I1.set_auto_scale(True)
I2.set_auto_scale(True)
I3.set_auto_scale(True)
I1=I1[:]
I2=I2[:]
I3=I3[:]
lats = f2.groups['geolocation_data'].variables['latitude'][:]
lons = f2.groups['geolocation_data'].variables['longitude'][:]
solarZen = f2.groups['geolocation_data'].variables['solar_zenith'][:]
sensorZen= solarZen = f2.groups['geolocation_data'].variables['sensor_zenith'][:]
solarAz = f2.groups['geolocation_data'].variables['solar_azimuth'][:]
sensorAz= solarZen = f2.groups['geolocation_data'].variables['sensor_azimuth'][:]
def kernMe(i, j, band):
if i<250 or j<250:
return -1
else:
return np.mean(band[i-250:i+250:1,j-250:j+250:1])
def thread_me(arr):
start1=arr[0]
end1=arr[1]
start2=arr[2]
end2=arr[3]
logging.info("Im starting at: %d to %d, %d to %d" %(start1, end1, start2, end2))
points = []
avg = np.mean(I4)
for i in range(start1,end1):
for j in range (start2,end2):
if solarZen[i,j]>=90:
if not (I5[i,j]<265 and I4[i,j]<295):#
if I4[i,j]>320 and I4Quality[i,j]==0:
points.append([lons[i,j],lats[i,j], 1])
elif I4[i,j]>300 and I5[i,j]-I4[i,j]>10:
points.append([lons[i,j],lats[i,j], 2])
elif I4[i,j] == 367 and I4Quality ==9:
points.append([lons[i,j],lats[i,j, 3]])
else:
if not ((I1[i,j]>I2[i,j]>I3[i,j]) or (I5[i,j]<265 or (I1[i,j]+I2[i,j]>0.9 and I5[i,j]<295) or
(I1[i,j]+I2[i,j]>0.7 and I5[i,j]<285))):
if not (I1[i,j]+I2[i,j] > 0.6 and I5[i,j]<285 and I3[i,j]>0.3 and I3[i,j]>I2[i,j] and I2[i,j]>0.25 and I4[i,j]<=335):
thetaG= (cos(sensorZen[i,j]*(math.pi/180))*cos(solarZen[i,j]*(math.pi/180)))-(sin(sensorZen[i,j]*(math.pi/180))*sin(solarZen[i,j]*(math.pi/180))*cos(sensorAz[i,j]*(math.pi/180)))
thetaG= math.acos(thetaG)*(180/math.pi)
if not ((thetaG<15 and I1[i,j]+I2[i,j]>0.35) or (thetaG<25 and I1[i,j]+I2[i,j]>0.4)):
if math.floor(I4[i,j])==367 and I4Quality[i,j]==9 and I5>290 and I5Quality[i,j]==0 and (I1[i,j]+I2[i,j])>0.7:
points.append([lons[i,j],lats[i,j, 4]])
elif I4[i,j]-I5[i,j]>25 or True:
kern = kernMe(i, j, I4)
if kern!=-1 or True:
BT4M = max(325, kern)
kern = min(330, BT4M)
if I4[i,j]> kern and I4[i,j]>avg:
points.append([lons[i,j],lats[i,j], 5])
return points
if __name__ == '__main__':
#Separate the arrays into 1616*1600 chunks for multi processing
#TODO: make this automatic, not hardcoded
arg=[[0,1616,0,1600],[0,1616,1600,3200],[0,1616,3200,4800],[0,1616,4800,6400],
[1616,3232,0,1600],[1616,3232,1600,3200],[1616,3232,3200,4800],[1616,3232,4800,6400],
[3232,4848,0,1600],[3232,4848,1600,3200],[3232,4848,3200,4800],[3232,4848,4800,6400],
[4848,6464,0,1600],[4848,6464,1600,3200],[4848,6464,3200,4800],[4848,6464,4800,6400]]
print(arg)
p=Pool(processes = 4)
output= p.map(thread_me, arg)
p.close()
p.join()
print(output)
f.close()
f2.close()
logging.info("Aaaand we're here!")
print(str((time.time()-start)/60))
p.terminate()
I use both p.close and p. terminate because I thought it would help (it doesn't). All of my code runs and produces the expected output but I have to manually end the lingering processes using the task manager. Any ideas as to
what's causing this?
I think I put all the relevant information here, if you need more I'll edit with the requests
Thanks in advance.

How to parallelize a nested for loop in python?

Ok, here is my problem: I have a nested for loop in my program which runs on a single core. Since the program spend over 99% of run time in this nested for loop I would like to parallelize it. Right now I have to wait 9 days for the computation to finish. I tried to implement a parallel for loop by using the multiprocessing library. But I only find very basic examples and can not transfer them to my problem. Here are the nested loops with random data:
import numpy as np
dist_n = 100
nrm = np.linspace(1,10,dist_n)
data_Y = 11000
data_I = 90000
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n)
for t in range(data_Y):
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
print(dist)
Please give me some advise how to make it parallel.
There's a small overhead with initiating a process (50ms+ depending on data size) so it's generally best to MP the largest block of code possible. From your comment it sounds like each loop of t is independent so we should be free to parallelize this.
When python creates a new process you get a copy of the main process so you have available all your global data but when each process writes the data, it writes to it's own local copy. This means dist[i,p] won't be available to the main process unless you explicitly pass it back with a return (which will have some overhead). In your situation, if each process writes dist[i,p] to a file then you should be fine, just don't try to write to the same file unless you implement some type of mutex access control.
#!/usr/bin/python
import time
import multiprocessing as mp
import numpy as np
data_Y = 11 #11000
data_I = 90 #90000
dist_n = 100
nrm = np.linspace(1,10,dist_n)
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n))
def worker(t):
st = time.time()
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
# Here - each worker opens a different file and writes to it
print 'Worker time %4.3f mS' % (1000.*(time.time()-st))
if 1: # single threaded
st = time.time()
for x in map(worker, range(data_Y)):
pass
print 'Single-process total time is %4.3f seconds' % (time.time()-st)
print
if 1: # multi-threaded
pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
st = time.time()
for x in pool.imap_unordered(worker, range(data_Y)):
pass
print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
print
If you re-increase the size of data_Y/data_I again, the speed-up should increase up to the theoretical limit.

Proper handling of parallel function that writes output in Python

I have a function that takes a text file as input, does some processing, and writes a pickled result to file. I'm trying to perform this in parallel across multiple files. The order in which files are processed doesn't matter, and the processing of each is totally independent. Here's what I have now:
import mulitprocessing as mp
import pandas as pd
from glob import glob
def processor(fi):
df = pd.read_table(fi)
...do some processing to the df....
filename = fi.split('/')[-1][:-4]
df.to_pickle('{}.pkl'.format(filename))
if __name__ == '__main__':
files = glob('/path/to/my/files/*.txt')
pool = mp.Pool(8)
for _ in pool.imap_unordered(processor, files):
pass
Now, this actually works totally fine as far as I can tell, but the syntax seems really hinky and I'm wondering if there is a better way of going about it. E.g. can I get the same result without having to perform an explicit loop?
I tried map_async(processor, files), but this doesn't generate any output files (but doesn't throw any errors).
Suggestions?
You can use map_async, but you need to wait for it to finish, since the async bit means "don't block after setting off the jobs, but return immediately". If you don't wait, if there's nothing after your code your program will exit and all subprocesses will be killed immediately and before completing - not what you want!
The following example should help:
from multiprocessing.pool import Pool
from time import sleep
def my_func(val):
print('Executing %s' % val)
sleep(0.5)
print('Done %s' % val)
pl = Pool()
async_result = pl.map_async(my_func, [1, 2, 3, 4, 5])
res = async_result.get()
print('Pool done: %s' % res)
The output of which (when I ran it) is:
Executing 2
Executing 1
Executing 3
Executing 4
Done 2
Done 1
Executing 5
Done 4
Done 3
Done 5
Pool done: [None, None, None, None, None]
Alternatively, using plain map would also do the trick, and then you don't have to wait for it since it is not "asynchronous" and synchronously waits for all jobs to be complete:
pl = Pool()
res = pl.map(my_func, [1, 2, 3, 4, 5])
print('Pool done: %s' % res)

Parallelizing a Numpy vector operation

Let's use, for example, numpy.sin()
The following code will return the value of the sine for each value of the array a:
import numpy
a = numpy.arange( 1000000 )
result = numpy.sin( a )
But my machine has 32 cores, so I'd like to make use of them. (The overhead might not be worthwhile for something like numpy.sin() but the function I actually want to use is quite a bit more complicated, and I will be working with a huge amount of data.)
Is this the best (read: smartest or fastest) method:
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool()
result = pool.map( numpy.sin, a )
or is there a better way to do this?
There is a better way: numexpr
Slightly reworded from their main page:
It's a multi-threaded VM written in C that analyzes expressions, rewrites them more efficiently, and compiles them on the fly into code that gets near optimal parallel performance for both memory and cpu bounded operations.
For example, in my 4 core machine, evaluating a sine is just slightly less than 4 times faster than numpy.
In [1]: import numpy as np
In [2]: import numexpr as ne
In [3]: a = np.arange(1000000)
In [4]: timeit ne.evaluate('sin(a)')
100 loops, best of 3: 15.6 ms per loop
In [5]: timeit np.sin(a)
10 loops, best of 3: 54 ms per loop
Documentation, including supported functions here. You'll have to check or give us more information to see if your more complicated function can be evaluated by numexpr.
Well this is kind of interesting note if you run the following commands:
import numpy
from multiprocessing import Pool
a = numpy.arange(1000000)
pool = Pool(processes = 5)
result = pool.map(numpy.sin, a)
UnpicklingError: NEWOBJ class argument has NULL tp_new
wasn't expecting that, so whats going on, well:
>>> help(numpy.sin)
Help on ufunc object:
sin = class ufunc(__builtin__.object)
| Functions that operate element by element on whole arrays.
|
| To see the documentation for a specific ufunc, use np.info(). For
| example, np.info(np.sin). Because ufuncs are written in C
| (for speed) and linked into Python with NumPy's ufunc facility,
| Python's help() function finds this page whenever help() is called
| on a ufunc.
yep numpy.sin is implemented in c as such you can't really use it directly with multiprocessing.
so we have to wrap it with another function
perf:
import time
import numpy
from multiprocessing import Pool
def numpy_sin(value):
return numpy.sin(value)
a = numpy.arange(1000000)
pool = Pool(processes = 5)
start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)
$ python perf.py
Singled threaded 0.032201
Multithreaded 10.550432
wow, wasn't expecting that either, well theres a couple of issues for starters we are using a python function even if its just a wrapper vs a pure c function, and theres also the overhead of copying the values, multiprocessing by default doesn't share data, as such each value needs to be copy back/forth.
do note that if properly segment our data:
import time
import numpy
from multiprocessing import Pool
def numpy_sin(value):
return numpy.sin(value)
a = [numpy.arange(100000) for _ in xrange(10)]
pool = Pool(processes = 5)
start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)
$ python perf.py
Singled threaded 0.150192
Multithreaded 0.055083
So what can we take from this, multiprocessing is great but we should always test and compare it sometimes its faster and sometimes its slower, depending how its used ...
Granted you are not using numpy.sin but another function I would recommend you first verify that indeed multiprocessing will speed up the computation, maybe the overhead of copying values back/forth may affect you.
Either way I also do believe that using pool.map is the best, safest method of multithreading code ...
I hope this helps.
SciPy actually has a pretty good writeup on this subject here.

Repeatedly write to stdin and read from stdout of a process from python

I have a piece of fortran code that reads some numbers from STDIN and writes results to STDOUT. For example:
do
read (*,*) x
y = x*x
write (*,*) y
enddo
So I can start the program from a shell and get the following sequence of inputs/outputs:
5.0
25.0
2.5
6.25
Now I need to do this from within python. After futilely wrestling with subprocess.Popen and looking through old questions on this site, I decided to use pexpect.spawn:
import pexpect, os
p = pexpect.spawn('squarer')
p.setecho(False)
p.write("2.5" + os.linesep)
res = p.readline()
and it works. The problem is, the real data I need to pass between python and my fortran program is an array of 100,000 (or more) double precision floats. If they're contained in an array called x, then
p.write(' '.join(["%.10f"%k for k in x]) + os.linesep)
times out with the following error message from pexpect:
buffer (last 100 chars):
before (last 100 chars):
after: <class 'pexpect.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 8574
child_fd: 3
closed: False
timeout: 30
delimiter: <class 'pexpect.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
unless x has less than 303 elements. Is there a way to pass large amounts of data to/from STDIN/STDOUT of another program?
I have tried splitting the data into smaller chunks, but then I lose a lot in speed.
Thanks in advance.
Found a solution using the subprocess module, so I'm posting it here for reference if anyone needs to do the same thing.
import subprocess as sbp
class ExternalProg:
def __init__(self, arg_list):
self.opt = sbp.Popen(arg_list, stdin=sbp.PIPE, stdout=sbp.PIPE, shell=True, close_fds=True)
def toString(self,x):
return ' '.join(["%.12f"%k for k in x])
def toFloat(self,x):
return float64(x.strip().split())
def sendString(self,string):
if not string.endswith('\n'):
string = string + '\n'
self.opt.stdin.write(string)
def sendArray(self,x):
self.opt.stdin.write(self.toString(x)+'\n')
def readInt(self):
return int(self.opt.stdout.readline().strip())
def sendScalar(self,x):
if type(x) == int:
self.opt.stdin.write("%i\n"%x)
elif type(x) == float:
self.opt.stdin.write("%.12f\n"%x)
def readArray(self):
return self.toFloat(self.opt.stdout.readline())
def close(self):
self.opt.kill()
The class is invoked with an external program called 'optimizer' as:
optim = ExternalProg(['./optimizer'])
optim.sendScalar(500) # send the optimizer the length of the state vector, for example
optim.sendArray(init_x) # the initial guess for x
optim.sendArray(init_g) # the initial gradient g
next_x = optim.readArray() # get the next estimate of x
next_g = evaluateGradient(next_x) # calculate gradient at next_x from within python
# repeat until convergence
On the fortran side (the program compiled to give the executable 'optimizer'), a 500-element vector would be read in so:
read(*,*) input_vector(1:500)
and would be written out so:
write(*,'(500f18.11)') output_vector(1:500)
and that's it! I've tested it with state vectors up to 200,000 elements (which is the upper limit of what I need right now). Hope this helps someone other than myself. This solution works with ifort and xlf90, but not with gfortran for some reason I don't understand.
example squarer.py program (it just happens to be in Python, use your Fortran executable):
#!/usr/bin/python
import sys
data= sys.stdin.readline() # expecting lots of data in one line
processed_data= data[-2::-1] # reverse without the newline
sys.stdout.write(processed_data+'\n')
example target.py program:
import thread, Queue
import subprocess as sbp
class Companion(object):
"A companion process manager"
def __init__(self, cmdline):
"Start the companion process"
self.companion= sbp.Popen(
cmdline, shell=False,
stdin=sbp.PIPE,
stdout=sbp.PIPE)
self.putque= Queue.Queue()
self.getque= Queue.Queue()
thread.start_new_thread(self._sender, (self.putque,))
thread.start_new_thread(self._receiver, (self.getque,))
def _sender(self, que):
"Actually sends the data to the companion process"
while 1:
datum= que.get()
if datum is Ellipsis:
break
self.companion.stdin.write(datum)
if not datum.endswith('\n'):
self.companion.stdin.write('\n')
def _receiver(self, que):
"Actually receives data from the companion process"
while 1:
datum= self.companion.stdout.readline()
que.put(datum)
def close(self):
self.putque.put(Ellipsis)
def send(self, data):
"Schedule a long line to be sent to the companion process"
self.putque.put(data)
def recv(self):
"Get a long line of output from the companion process"
return self.getque.get()
def main():
my_data= '12345678 ' * 5000
my_companion= Companion(("/usr/bin/python", "squarer.py"))
my_companion.send(my_data)
my_answer= my_companion.recv()
print my_answer[:20] # don't print the long stuff
# rinse, repeat
my_companion.close()
if __name__ == "__main__":
main()
The main function contains the code you will use: setup a Companion object, companion.send a long line of data, companion.recv a line. Repeat as necessary.
Here's a huge simplification: Break your Python into two things.
python source.py | squarer | python sink.py
The squarer application is your Fortran code. Reads from stdin, writes to stdout.
Your source.py is your Python that does
import sys
sys.stdout.write(' '.join(["%.10f"%k for k in x]) + os.linesep)
Or, perhaps something a tiny bit simpler, i.e.
from __future__ import print_function
print( ' '.join(["{0:.10f}".format(k) for k in x]) )
And your sink.py is something like this.
import fileinput
for line in fileinput.input():
# process the line
Separating source, squarer and sink gets you 3 separate processes (instead of 2) and will use more cores. More cores == more concurrency == more fun.
I think that you only add one linebreak here:
p.write(' '.join(["%.10f"%k for k in x]) + os.linesep)
instead of adding one per line.
Looks like you're timing out (default timeout, I believe, 30 seconds) because preparing, sending, receiving, and processing that much data is taking a lot of time. Per the docs, timeout= is an optional named parameter to the expect method, which you're not calling -- maybe there's an undocumented way to set the default timeout in the initializer, which could be found by poring over the sources (or, worst case, created by hacking those sources).
If the Fortran program read and saved (say) 100 items at a time, with a prompt, syncing up would become enormously easier. Could you modify your Fortran code for the purpose, or would you rather go for the undocumented / hack approach?

Categories