Why my parallel code is slower than the sequential - python

I am trying to implement an online recursive parallel algorithm, which is highly parallelizable. My problem is that my python implementation does not work as I want. I have two 2D matrices where I want to update recursively every column every time a new observation is observed at time-step t.
My parallel code is like this
def apply_async(t):
worker = mp.Pool(processes = 4)
for i in range(4):
X[:,i,np.newaxis], b[:,i,np.newaxis] = worker.apply_async(OULtraining, args=(train[t,i], X[:,i,np.newaxis], b[:,i,np.newaxis])).get()
worker.close()
worker.join()
for t in range(p,T):
count = 0
for l in range(p):
for k in range(4):
gn[count]=train[t-l-1,k]
count+=1
G = G*v + gn # gn.T
Gt = (1/(t-p+1))*G
if __name__ == '__main__':
apply_async(t)
The two matrices are X and b. I want to replace directly on master's memory as each process updates recursively only one specific column of the matrices.
Why this implementation is slower than the sequential?
Is there any way to resume the process every time-step rather than killing them and create them again? Could this be the reason it is slower?

The reason is, your program is in practice sequential. This is an example code snippet that is from parallelism standpoint identical to yours:
from multiprocessing import Pool
from time import sleep
def gwork( qq):
print (qq)
sleep(1)
return 42
p = Pool(processes=4)
for q in range(1, 10):
p.apply_async(gwork, args=(q,)).get()
p.close()
p.join()
Run this and you shall notice numbers 1-9 appearing exactly once in a second. Why is this? The reason is your .get(). This means every call to apply_async will in practice block in get() until a result is available. It will submit one task, wait a second emulating processing delay, then return the result, after which another task is submitted to your pool. This means there is no parallel execution ongoing at all.
Try replacing the pool management part with this:
results = []
for q in range(1, 10):
res = p.apply_async(gwork, args=(q,))
results.append(res)
p.close()
p.join()
for r in results:
print (r.get())
You can now see parallelism at work, as four of your tasks are now processed simultaneously. Your loop does not block in get, as get is moved out of the loop and results are received only when they are ready.
NB: If your arguments to your worker or the return values from them are large data structures, you will lose some performance. In practice Python implements these as queues, and transmitting a lot of data via a queue is slow on relative terms compared to getting an in-memory copy of a data structure when a subprocess is forked.

Related

Python: how to parallelizing a simple loop with MPI

I need to rewrite a simple for loop with MPI cause each step is time consuming. Lets say I have a list including several np.array and I want to apply some computation on each array. For example:
def myFun(x):
return x+2 # simple example, the real one would be complicated
dat = [np.random.rand(3,2), np.random.rand(3,2),np.random.rand(3,2),np.random.rand(3,2)] # real data would be much larger
result = []
for item in dat:
result.append(myFun(item))
Instead of using the simple for loop above, I want to use MPI to run the 'for loop' part of the above code in parallel with 24 different nodes also I want the order of items in the result list follow the same with dat list.
Note The data is read from other file which can be treated 'fix' for each processor.
I haven't use mpi before, so this stucks me for a while.
For simplicity let us assume that the master process (the process with rank = 0) is the one that will read the entire file from disk into memory. This problem can be solved only knowing about the following MPI routines, Get_size(), Get_rank(), scatter, and gather.
The Get_size():
Returns the number of processes in the communicator. It will return
the same number to every process.
The Get_rank():
Determines the rank of the calling process in the communicator.
In MPI to each process is assigned a rank, that varies from 0 to N - 1, where N is the total number of processes running.
The scatter:
MPI_Scatter involves a designated root process sending data to all
processes in a communicator. The primary difference between MPI_Bcast
and MPI_Scatter is small but important. MPI_Bcast sends the same piece
of data to all processes while MPI_Scatter sends chunks of an array to
different processes.
and the gather:
MPI_Gather is the inverse of MPI_Scatter. Instead of spreading
elements from one process to many processes, MPI_Gather takes elements
from many processes and gathers them to one single process.
Obviously, you should first follow a tutorial and read the MPI documentation to understand its parallel programming model, and its routines. Otherwise, you will find it very hard to understand how it all works. That being said your code could look like the following:
from mpi4py import MPI
def myFun(x):
return x+2 # simple example, the real one would be complicated
comm = MPI.COMM_WORLD
rank = comm.Get_rank() # get your process ID
data = # init the data
if rank == 0: # The master is the only process that reads the file
data = # something read from file
# Divide the data among processes
data = comm.scatter(data, root=0)
result = []
for item in data:
result.append(myFun(item))
# Send the results back to the master processes
newData = comm.gather(result,root=0)
In this way, each process will work (in parallel) in only a certain chunk of the data. After having finish their work, each process send back to the master process their data chunks (i.e., comm.gather(result,root=0)). This is just a toy example, now it is up to you to improved according to your testing environment and code.
You could either go the low-level MPI way as shown in the answer of #dreamcrash or you could go for a more Pythonic solution that uses an executor pool very similar to the one provided by the standard Python multiprocessing module.
First, you need to turn your code into a more functional-style one by noticing that you are actually doing a map operation, which applies myFun to each element of dat:
def myFun(x):
return x+2 # simple example, the real one would be complicated
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
result = map(myFun, dat)
map here runs sequentially in one Python interpreter process.
To run that map in parallel with the multiprocessing module, you only need to instantiate a Pool object and then call its map() method in place of the Python map() function:
from multiprocessing import Pool
def myFun(x):
return x+2 # simple example, the real one would be complicated
if __name__ == '__main__':
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
with Pool() as pool:
result = pool.map(myFun, dat)
Here, Pool() creates a new executor pool with as many interpreter processes as there are logical CPUs as seen by the OS. Calling the map() method of the pool runs the mapping in parallel by sending items to the different processes in the pool and waiting for completion. Since the worker processes import the Python script as a module, it is important to have the code that was previously at the top level moved under the if __name__ == '__main__': conditional so it doesn't run in the workers too.
Using multiprocessing.Pool() is very convenient because it requires only a slight change of the original code and the module handles for you all the work scheduling and the required data movement to and from the worker processes. The problem with multiprocessing is that it only works on a single host. Fortunately, mpi4py provides a similar interface through the mpi4py.futures.MPIPoolExecutor class:
from mpi4py.futures import MPIPoolExecutor
def myFun(x):
return x+2 # simple example, the real one would be complicated
if __name__ == '__main__':
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
with MPIPoolExecutor() as pool:
result = pool.map(myFun, dat)
Like with the Pool object from the multiprocessing module, the MPI pool executor handles for you all the work scheduling and data movement.
There are two ways to run the MPI program. The first one starts the script as an MPI singleton and then uses the MPI process control facility to spawn a child MPI job with all the pool workers:
mpiexec -n 1 python program.py
You also need to specify the MPI universe size (the total number of MPI ranks in both the main and all child jobs). The specific way of doing so differs between the implementations, so you need to consult your implementation's manual.
The second option is to launch directly the desired number of MPI ranks and have them execute the mpi4py.futures module itself with the script name as argument:
mpiexec -n 24 python -m mpi4py.futures program.py
Keep in mind that no mater which way you launch the script one MPI rank will be reserved for the controller and will not be running mapping tasks. You are aiming at running on 24 hosts, so you should be having plenty of CPU cores and can probably afford to have one reserved. Or you could instruct MPI to oversubscribe the first host with one more rank.
One thing to note with both multiprocessing.Pool and mpi4py.futures.MPIPoolExecutor is that the map() method guarantees the order of the items in the output array, but it doesn't guarantee the order in which the different items are evaluated. This shouldn't be a problem in most cases.
A word of advise. If your data is actually chunks read from a file, you may be tempted to do something like this:
if __name__ == '__main__':
data = read_chunks()
with MPIPoolExecutor() as p:
result = p.map(myFun, data)
Don't do that. Instead, if possible, e.g., if enabled by the presence of a shared (and hopefully parallel) filesytem, delegate the reading to the workers:
NUM_CHUNKS = 100
def myFun(chunk_num):
# You may need to pass the value of NUM_CHUNKS to read_chunk()
# for it to be able to seek to the right position in the file
data = read_chunk(NUM_CHUNKS, chunk_num)
return ...
if __name__ == '__main__':
chunk_nums = range(NUM_CHUNKS) # 100 chunks
with MPIPoolExecutor() as p:
result = p.map(myFun, chunk_nums)

How to generate a counter for finding a hash with 9 leading zeroes

I'm trying to create a function that will generate a hash using sha1 algorithm with 9 leading zeroes. The hash is based on some random data and, like in concurrency mining, I just want to add 1 to the string that is used in the hash function.
For this to be faster I used map() from the Pool class to make it run on all my cores, but I have an issue if I pass a chunk larger than range(99999999)
def computesha(counter):
hash = 'somedata'+'otherdata'+str(counter)
newHash = hashlib.sha1(hash.encode()).hexdigest()
if newHash[:9] == '000000000':
print(str(newHash))
print(str(counter))
return str(newHash), str(counter)
if __name__ == '__main__':
d1 = datetime.datetime.now()
print("Start timestamp" + str(d1))
manager = multiprocessing.Manager()
return_dict = manager.dict()
p = Pool()
p.map(computesha, range(sys.maxsize) )
print(return_dict)
p.close()
p.join()
d2 = datetime.datetime.now()
print("End timestamp " + str(d2))
print("Elapsed time: " + str((d2-d1)))
I want to create something similar to a global counter to feed it into the function while it is running multi-threaded, but if I try range(sys.maxsize) I get a MemoryError (I know, because i don't have enough RAM, few have), but I want to split the list generated by range() into chunks.
Is this possible or should I try a different approach?
Hi Alin and welcome to stackoverflow.
Firstly, yes, a global counter is possible. E.g with a multiprocessing.Queue or a multiprocessing.Value which is passed to the workers. However, fetching a new number from the global counter would result in locking (and possibly waiting for) the counter. This can and should be avoided, as you need to make A LOT of counter queries. My proposed solution below avoids the global counter by installing several local counters which work together as if they were a single global counter.
Regarding the RAM consumption of your code, I see two problems:
computesha returns a None value most of the time. This goes into the iterator which is created by map (even though you do not assign the return value of map). This means, that the iterator is a lot bigger than necessary.
Generally speaking, the RAM of a process is freed, after the process finishes. Your processes start A LOT of tasks which all reserve their own memory. A possible solution is the maxtasksperchild option (see the documentation of multiprocessing.pool.Pool). When you set this option to 1000, it closes the process after 1000 task and creates a new one, which frees the memory.
However, i'd like to propose a different solution which solves both problems, is very memory-friendly and runs faster (as it seems to me after N<10 tests) as the solution with the maxtasksperchild option:
#!/usr/bin/env python3
import datetime
import multiprocessing
import hashlib
import sys
def computesha(process_number, number_of_processes, max_counter, results):
counter = process_number # every process starts with a different counter
data = 'somedata' + 'otherdata'
while counter < max_counter: #stop after max_counter jobs have been started
hash = "".join((data,str(counter)))
newHash = hashlib.sha1(hash.encode()).hexdigest()
if newHash[:9] == '000000000':
print(str(newHash))
print(str(counter))
# return the results through a queue
results.put((str(newHash), str(counter)))
counter += number_of_processes # 'jump' to the next chunk
if __name__ == '__main__':
# execute this file with two command line arguments:
number_of_processes = int(sys.argv[1])
max_counter = int(sys.argv[2])
# this queue will be used to collect the results after the jobs finished
results = multiprocessing.Queue()
processes = []
# start a number of processes...
for i in range(number_of_processes):
p = multiprocessing.Process(target=computesha, args=(i,
number_of_processes,
max_counter,
results))
p.start()
processes.append(p)
# ... then wait for all processes to end
for p in processes:
p.join()
# collect results
while not results.empty():
print(results.get())
results.close()
This code spawns the desired number_of_processes which then call the computesha function. If number_of_processes=8 then the first process calculates the hash for the counter values [0,8,16,24,...], the second process for [1,9,17,25] and so on.
The advantages of this approach: In each iteration of the while loop the memory of hash, and newHash can be reused, loops are cheaper than functions and only number_of_processes function calls have to be made, and the uninteresting results are simply forgotten.
A possible disadvantage is, that the counters are completely independent and every process will do exactly 1/number_of_processes of the overall work, even if the some are faster than others. Eventually, the program is as fast as the slowest process. I did't measure it, but I guess it is a rather theoretical problem here.
Hope that helps!

Multiprocessing pool map_async for one function then block before the next (python 3)

please be warned that this demonstration code generates a few GB data.
I have been using versions of the code below for multiprocessing for some time. It works well when the run time of each process in the pool is similar but if one process takes much longer I end up with many blocked processes waiting on the one, so I'm trying to make it run asynchronously - just for one function at a time.
For example, if I have 70 cores and need to run a function 2000 times I want that to run asynchronously then wait for the last process before calling the next function. Currently it just submits processes in batches of how ever many cores I give it and each batch has to wait for the longest process.
As you can see I've tried using map_async but this is clearly the wrong syntax. Can anyone help me out?
import os
p='PATH/test/'
def f1(tup):
x,y=tup
to_write = x*(y**5)
with open(p+x+str(y)+'.txt','w') as fout:
fout.write(to_write)
def f2(tup):
x,y=tup
print (os.path.exists(p+x+str(y)+'.txt'))
def call_func(f,nos,threads,call):
print (call)
for i in range(0, len(nos), threads):
print (i)
chunk = nos[i:i + threads]
tmp = [('args', no) for no in chunk]
pool.map(f, tmp)
#pool.map_async(f, tmp)
nos=[i for i in range(55)]
threads=8
if __name__ == '__main__':
with Pool(processes=threads) as pool:
call_func(f1,nos,threads,'f1')
call_func(f2,nos,threads,'f2')
map will only return and map_async will only call the callback after all tasks of the current chunk are done.
So you can only either give all tasks to map/map_async at once or use apply_async (initially called threads times) where the callback calls apply_asyncfor the next task.
If the actual return values of the call don't matter (or at least their order doesn't), imap_unordered may be another efficient solution when giving it all tasks at once (or an iterator/generator producing the tasks on demand)

Python multiprocessing MUCH slower than sequential

To start out with, I know there are about 20 questions with similar titles, and I promise I've read all of them.
I know about most of the drawbacks of Python multiprocessing, and I wasn't expecting a massive speedup from applying it. This is my first time using multiprocessing.Process, in the past I've always used a Pool. I think I'm doing something wrong with it, because it is running several orders of magnitude slower. One iteration of the sequential method is running in less than a second, whereas one iteration of the parallel method is taking well more than a minute.
For context, this is an n-body simulator, and this particular method is checking for collisions and updating forces acting on each body. I know there are better ways to do this, this is just for my learning.
Here is the code:
from multiprocessing import Process, Manager
def par_run_helper(self, part, destroy, add):
for other in self.parts:
if other not in destroy and other is not part:
if part.touches(other):
add.append(part + other)
destroy.append(other)
destroy.append(part)
self.touches += 1
print("TOUCH " + str(part.size + other.size))
else:
part.interact(other, self.g)
def par_run(self, numTicks=1, visualizeEvery=1, visualizeAfter=0):
manager = Manager()
destroy = manager.list()
add = manager.list()
for tick in range(numTicks):
print(tick)
processes = []
for part in self.parts:
print(part)
p = Process(target=self.par_run_helper, args=(part, destroy, add))
p.start()
processes.append(p)
for p in processes:
p.join()
for p in destroy:
try:
self.parts.remove(p)
except Exception:
pass
for p in add:
self.parts.append(p)
Each iteration should take approximately the same amount of time, as they're all operating on the same number of elements.
Manager is multiprocessing.Manager().
part is short for particle and is the particle I am currently updating.
destroy and add are lists of particles that will be destroyed and added at the end of the tick.
I tested with only 300 parts, but I'd like to be able to do as many as 1000.

Multiprocessing time increases linearly with more cores

I have an arcpy process that requires doing a union on a bunch of layers, running some calculations, and writing an HTML report. Given the number of reports I need to generate (~2,100) I need this process to be as quick as possible (my target is 2 seconds per report). I've tried a number of ways to do this, including multiprocessing, when I ran across a problem, namely, that running the multi-process part essentially takes the same amount of time no matter how many cores I use.
For instance, for the same number of reports:
2 cores took ~30 seconds per round (so 40 reports takes 40/2 * 30 seconds)
4 cores took ~60 seconds (40/4 * 60)
10 cores took ~160 seconds (40/10 * 160)
and so on. It works out to the same total time because churning through twice as many at a time takes twice as long to do.
Does this mean my problem is I/O bound, rather than CPU bound? (And if so - what do I do about it?) I would have thought it was the latter, given that the large bottleneck in my timing is the union (it takes up about 50% of the processing time). Unions are often expensive in ArcGIS, so I assumed breaking it up and running 2 - 10 at once would have been 2 - 10 times faster. Or, potentially I implementing multi-process incorrectly?
## Worker function just included to give some context
def worker(sub_code):
layer = 'in_memory/lyr_{}'.format(sub_code)
arcpy.Select_analysis(subbasinFC, layer, where_clause="SUB_CD = '{}'".format(sub_code))
arcpy.env.extent = layer
union_name = 'in_memory/union_' + sub_code
arcpy.Union_analysis([fields],
union_name,
"NO_FID", "1 FEET")
#.......Some calculations using cursors
# Templating using Jinjah
context = {}
context['DATE'] = now.strftime("%B %d, %Y")
context['SUB_CD'] = sub_code
context['SUB_ACRES'] = sum([r[0] for r in arcpy.da.SearchCursor(union, ["ACRES"], where_clause="SUB_CD = '{}'".format(sub_code))])
# Etc
# Then write the report out using custom function
write_html('template.html', 'output_folder', context)
if __name__ == '__main__':
subList = sorted({r[0] for r in arcpy.da.SearchCursor(subbasinFC, ["SUB_CD"])})
NUM_CORES = 7
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
for chunk in chunk_list:
jobs = []
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
There isn't much to go on here, and I have no experience with ArcGIS. So I can just note two higher-level things. First, "the usual" way to approach this would be to replace all the code below your NUM_CORES = 7 with:
pool = multiprocessing.Pool(NUM_CORES)
pool.map(worker, subList)
pool.close()
pool.join()
map() takes care of keeping all the worker processes as busy as possible. As is, you fire up 7 processes, then wait for all of them to finish. All the processes that complete before the slowest vanish, and their cores sit idle waiting for the next outer loop iteration. A Pool keeps the 7 processes alive for the duration of the job, and feeds each a new piece of work to do as soon as it finishes its last piece of work.
Second, this part ends with a logical error:
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
You want NUM_CORES there rather than NUM_CORES-1. As-is, the first time around you extract
subList[0:7]
then
subList[6:13]
then
subList[12:19]
and so on. subList[6] and subList[12] (etc) are extracted twice each. The sublists overlap.
You don't show us quite enough to be sure what you are doing. For example, what is your env.workspace? And what is the value of subbasinFC? It seems like you're doing an analysis at the beginning of each process to filter down the data into layer. But is subbasinFC coming from disk, or from memory? If it's from disk, I'd suggest you read everything into memory before any of the processes try their filtering. That should speed things along, if you have the memory to support it. Otherwise, yeah, you're I/O bound on the input data.
Forgive my arcpy cluelessness, but why are you inserting a where clause in your sum of context['SUB_ACRES']? Didn't you already filter on sub_code at the start? (We don't know what the union is, so maybe you're unioning with something unfiltered...)
I'm not sure you are using the Process pool correctly to track your jobs. This:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
Should instead be:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
p.start()
p.join()
Is there a specific reason you are going against the spec of using the multiprocessing library? You are not waiting until the thread terminates before spinning another process up, which is just going to create a whole bunch of processes that are not handled by the parent calling process.

Categories