Python multiprocessing MUCH slower than sequential

Python multiprocessing MUCH slower than sequential - python

To start out with, I know there are about 20 questions with similar titles, and I promise I've read all of them.
I know about most of the drawbacks of Python multiprocessing, and I wasn't expecting a massive speedup from applying it. This is my first time using multiprocessing.Process, in the past I've always used a Pool. I think I'm doing something wrong with it, because it is running several orders of magnitude slower. One iteration of the sequential method is running in less than a second, whereas one iteration of the parallel method is taking well more than a minute.
For context, this is an n-body simulator, and this particular method is checking for collisions and updating forces acting on each body. I know there are better ways to do this, this is just for my learning.
Here is the code:
from multiprocessing import Process, Manager
def par_run_helper(self, part, destroy, add):
for other in self.parts:
if other not in destroy and other is not part:
if part.touches(other):
add.append(part + other)
destroy.append(other)
destroy.append(part)
self.touches += 1
print("TOUCH " + str(part.size + other.size))
else:
part.interact(other, self.g)
def par_run(self, numTicks=1, visualizeEvery=1, visualizeAfter=0):
manager = Manager()
destroy = manager.list()
add = manager.list()
for tick in range(numTicks):
print(tick)
processes = []
for part in self.parts:
print(part)
p = Process(target=self.par_run_helper, args=(part, destroy, add))
p.start()
processes.append(p)
for p in processes:
p.join()
for p in destroy:
try:
self.parts.remove(p)
except Exception:
pass
for p in add:
self.parts.append(p)
Each iteration should take approximately the same amount of time, as they're all operating on the same number of elements.
Manager is multiprocessing.Manager().
part is short for particle and is the particle I am currently updating.
destroy and add are lists of particles that will be destroyed and added at the end of the tick.
I tested with only 300 parts, but I'd like to be able to do as many as 1000.

Related

How to generate a counter for finding a hash with 9 leading zeroes

I'm trying to create a function that will generate a hash using sha1 algorithm with 9 leading zeroes. The hash is based on some random data and, like in concurrency mining, I just want to add 1 to the string that is used in the hash function.
For this to be faster I used map() from the Pool class to make it run on all my cores, but I have an issue if I pass a chunk larger than range(99999999)
def computesha(counter):
hash = 'somedata'+'otherdata'+str(counter)
newHash = hashlib.sha1(hash.encode()).hexdigest()
if newHash[:9] == '000000000':
print(str(newHash))
print(str(counter))
return str(newHash), str(counter)
if __name__ == '__main__':
d1 = datetime.datetime.now()
print("Start timestamp" + str(d1))
manager = multiprocessing.Manager()
return_dict = manager.dict()
p = Pool()
p.map(computesha, range(sys.maxsize) )
print(return_dict)
p.close()
p.join()
d2 = datetime.datetime.now()
print("End timestamp " + str(d2))
print("Elapsed time: " + str((d2-d1)))
I want to create something similar to a global counter to feed it into the function while it is running multi-threaded, but if I try range(sys.maxsize) I get a MemoryError (I know, because i don't have enough RAM, few have), but I want to split the list generated by range() into chunks.
Is this possible or should I try a different approach?

Hi Alin and welcome to stackoverflow.
Firstly, yes, a global counter is possible. E.g with a multiprocessing.Queue or a multiprocessing.Value which is passed to the workers. However, fetching a new number from the global counter would result in locking (and possibly waiting for) the counter. This can and should be avoided, as you need to make A LOT of counter queries. My proposed solution below avoids the global counter by installing several local counters which work together as if they were a single global counter.
Regarding the RAM consumption of your code, I see two problems:
computesha returns a None value most of the time. This goes into the iterator which is created by map (even though you do not assign the return value of map). This means, that the iterator is a lot bigger than necessary.
Generally speaking, the RAM of a process is freed, after the process finishes. Your processes start A LOT of tasks which all reserve their own memory. A possible solution is the maxtasksperchild option (see the documentation of multiprocessing.pool.Pool). When you set this option to 1000, it closes the process after 1000 task and creates a new one, which frees the memory.
However, i'd like to propose a different solution which solves both problems, is very memory-friendly and runs faster (as it seems to me after N<10 tests) as the solution with the maxtasksperchild option:
#!/usr/bin/env python3
import datetime
import multiprocessing
import hashlib
import sys
def computesha(process_number, number_of_processes, max_counter, results):
counter = process_number # every process starts with a different counter
data = 'somedata' + 'otherdata'
while counter < max_counter: #stop after max_counter jobs have been started
hash = "".join((data,str(counter)))
newHash = hashlib.sha1(hash.encode()).hexdigest()
if newHash[:9] == '000000000':
print(str(newHash))
print(str(counter))
# return the results through a queue
results.put((str(newHash), str(counter)))
counter += number_of_processes # 'jump' to the next chunk
if __name__ == '__main__':
# execute this file with two command line arguments:
number_of_processes = int(sys.argv[1])
max_counter = int(sys.argv[2])
# this queue will be used to collect the results after the jobs finished
results = multiprocessing.Queue()
processes = []
# start a number of processes...
for i in range(number_of_processes):
p = multiprocessing.Process(target=computesha, args=(i,
number_of_processes,
max_counter,
results))
p.start()
processes.append(p)
# ... then wait for all processes to end
for p in processes:
p.join()
# collect results
while not results.empty():
print(results.get())
results.close()
This code spawns the desired number_of_processes which then call the computesha function. If number_of_processes=8 then the first process calculates the hash for the counter values [0,8,16,24,...], the second process for [1,9,17,25] and so on.
The advantages of this approach: In each iteration of the while loop the memory of hash, and newHash can be reused, loops are cheaper than functions and only number_of_processes function calls have to be made, and the uninteresting results are simply forgotten.
A possible disadvantage is, that the counters are completely independent and every process will do exactly 1/number_of_processes of the overall work, even if the some are faster than others. Eventually, the program is as fast as the slowest process. I did't measure it, but I guess it is a rather theoretical problem here.
Hope that helps!

Why my parallel code is slower than the sequential

I am trying to implement an online recursive parallel algorithm, which is highly parallelizable. My problem is that my python implementation does not work as I want. I have two 2D matrices where I want to update recursively every column every time a new observation is observed at time-step t.
My parallel code is like this
def apply_async(t):
worker = mp.Pool(processes = 4)
for i in range(4):
X[:,i,np.newaxis], b[:,i,np.newaxis] = worker.apply_async(OULtraining, args=(train[t,i], X[:,i,np.newaxis], b[:,i,np.newaxis])).get()
worker.close()
worker.join()
for t in range(p,T):
count = 0
for l in range(p):
for k in range(4):
gn[count]=train[t-l-1,k]
count+=1
G = G*v + gn # gn.T
Gt = (1/(t-p+1))*G
if __name__ == '__main__':
apply_async(t)
The two matrices are X and b. I want to replace directly on master's memory as each process updates recursively only one specific column of the matrices.
Why this implementation is slower than the sequential?
Is there any way to resume the process every time-step rather than killing them and create them again? Could this be the reason it is slower?

The reason is, your program is in practice sequential. This is an example code snippet that is from parallelism standpoint identical to yours:
from multiprocessing import Pool
from time import sleep
def gwork( qq):
print (qq)
sleep(1)
return 42
p = Pool(processes=4)
for q in range(1, 10):
p.apply_async(gwork, args=(q,)).get()
p.close()
p.join()
Run this and you shall notice numbers 1-9 appearing exactly once in a second. Why is this? The reason is your .get(). This means every call to apply_async will in practice block in get() until a result is available. It will submit one task, wait a second emulating processing delay, then return the result, after which another task is submitted to your pool. This means there is no parallel execution ongoing at all.
Try replacing the pool management part with this:
results = []
for q in range(1, 10):
res = p.apply_async(gwork, args=(q,))
results.append(res)
p.close()
p.join()
for r in results:
print (r.get())
You can now see parallelism at work, as four of your tasks are now processed simultaneously. Your loop does not block in get, as get is moved out of the loop and results are received only when they are ready.
NB: If your arguments to your worker or the return values from them are large data structures, you will lose some performance. In practice Python implements these as queues, and transmitting a lot of data via a queue is slow on relative terms compared to getting an in-memory copy of a data structure when a subprocess is forked.

Multiprocessing time increases linearly with more cores

I have an arcpy process that requires doing a union on a bunch of layers, running some calculations, and writing an HTML report. Given the number of reports I need to generate (~2,100) I need this process to be as quick as possible (my target is 2 seconds per report). I've tried a number of ways to do this, including multiprocessing, when I ran across a problem, namely, that running the multi-process part essentially takes the same amount of time no matter how many cores I use.
For instance, for the same number of reports:
2 cores took ~30 seconds per round (so 40 reports takes 40/2 * 30 seconds)
4 cores took ~60 seconds (40/4 * 60)
10 cores took ~160 seconds (40/10 * 160)
and so on. It works out to the same total time because churning through twice as many at a time takes twice as long to do.
Does this mean my problem is I/O bound, rather than CPU bound? (And if so - what do I do about it?) I would have thought it was the latter, given that the large bottleneck in my timing is the union (it takes up about 50% of the processing time). Unions are often expensive in ArcGIS, so I assumed breaking it up and running 2 - 10 at once would have been 2 - 10 times faster. Or, potentially I implementing multi-process incorrectly?
## Worker function just included to give some context
def worker(sub_code):
layer = 'in_memory/lyr_{}'.format(sub_code)
arcpy.Select_analysis(subbasinFC, layer, where_clause="SUB_CD = '{}'".format(sub_code))
arcpy.env.extent = layer
union_name = 'in_memory/union_' + sub_code
arcpy.Union_analysis([fields],
union_name,
"NO_FID", "1 FEET")
#.......Some calculations using cursors
# Templating using Jinjah
context = {}
context['DATE'] = now.strftime("%B %d, %Y")
context['SUB_CD'] = sub_code
context['SUB_ACRES'] = sum([r[0] for r in arcpy.da.SearchCursor(union, ["ACRES"], where_clause="SUB_CD = '{}'".format(sub_code))])
# Etc
# Then write the report out using custom function
write_html('template.html', 'output_folder', context)
if __name__ == '__main__':
subList = sorted({r[0] for r in arcpy.da.SearchCursor(subbasinFC, ["SUB_CD"])})
NUM_CORES = 7
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
for chunk in chunk_list:
jobs = []
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()

There isn't much to go on here, and I have no experience with ArcGIS. So I can just note two higher-level things. First, "the usual" way to approach this would be to replace all the code below your NUM_CORES = 7 with:
pool = multiprocessing.Pool(NUM_CORES)
pool.map(worker, subList)
pool.close()
pool.join()
map() takes care of keeping all the worker processes as busy as possible. As is, you fire up 7 processes, then wait for all of them to finish. All the processes that complete before the slowest vanish, and their cores sit idle waiting for the next outer loop iteration. A Pool keeps the 7 processes alive for the duration of the job, and feeds each a new piece of work to do as soon as it finishes its last piece of work.
Second, this part ends with a logical error:
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
You want NUM_CORES there rather than NUM_CORES-1. As-is, the first time around you extract
subList[0:7]
then
subList[6:13]
then
subList[12:19]
and so on. subList[6] and subList[12] (etc) are extracted twice each. The sublists overlap.

You don't show us quite enough to be sure what you are doing. For example, what is your env.workspace? And what is the value of subbasinFC? It seems like you're doing an analysis at the beginning of each process to filter down the data into layer. But is subbasinFC coming from disk, or from memory? If it's from disk, I'd suggest you read everything into memory before any of the processes try their filtering. That should speed things along, if you have the memory to support it. Otherwise, yeah, you're I/O bound on the input data.
Forgive my arcpy cluelessness, but why are you inserting a where clause in your sum of context['SUB_ACRES']? Didn't you already filter on sub_code at the start? (We don't know what the union is, so maybe you're unioning with something unfiltered...)

I'm not sure you are using the Process pool correctly to track your jobs. This:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
Should instead be:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
p.start()
p.join()
Is there a specific reason you are going against the spec of using the multiprocessing library? You are not waiting until the thread terminates before spinning another process up, which is just going to create a whole bunch of processes that are not handled by the parent calling process.

Python multiprocess with pool workers - memory use optimization

I have a fuzzy string matching script that looks for some 30K needles in a haystack of 4 million company names. While the script works fine, my attempts at speeding up things via parallel processing on an AWS h1.xlarge failed as I'm running out of memory.
Rather than trying to get more memory as explained in response to my previous question, I'd like to find out how to optimize the workflow - I'm fairly new to this so there should be plenty of room. Btw, I've already experimented with queues (also worked but ran into the same MemoryError, plus looked through a bunch of very helpful SO contributions, but not quite there yet.
Here's what seems most relevant of the code. I hope it sufficiently clarifies the logic - happy to provide more info as needed:
def getHayStack():
## loads a few million company names into id: name dict
return hayCompanies
def getNeedles(*args):
## loads subset of 30K companies into id: name dict (for allocation to workers)
return needleCompanies
def findNeedle(needle, haystack):
""" Identify best match and return results with score """
results = {}
for hayID, hayCompany in haystack.iteritems():
if not isnull(haystack[hayID]):
results[hayID] = levi.setratio(needle.split(' '),
hayCompany.split(' '))
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
def runMatch(args):
""" Execute findNeedle and process results for poolWorker batch"""
batch, first = args
last = first + batch
hayCompanies = getHayStack()
needleCompanies = getTargets(first, last)
needles = defaultdict(list)
current = first
for needleID, needleCompany in needleCompanies.iteritems():
current += 1
needles[targetID] = findNeedle(needleCompany, hayCompanies)
## Then store results
if __name__ == '__main__':
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch,
itertools.izip(itertools.repeat(targetsPerBatch),
xrange(0,
totalTargets,
targetsPerBatch))).get(99999999)
pool.close()
pool.join()
So I guess the questions are: How can I avoid loading the haystack for all workers - e.g. by sharing the data or taking a different approach like dividing the much larger haystack across workers rather than the needles? How can I otherwise improve memory usage by avoiding or eliminating clutter?

Your design is a bit confusing. You're using a pool of N workers, and then breaking your M jobs work up into N tasks of size M/N. In other words, if you get that all correct, you're simulating worker processes on top of a pool built on top of worker processes. Why bother with that? If you want to use processes, just use them directly. Alternatively, use a pool as a pool, sends each job as its own task, and use the batching feature to batch them up in some appropriate (and tweakable) way.
That means that runMatch just takes a single needleID and needleCompany, and all it does is call findNeedle and then do whatever that # Then store results part is. And then the main program gets a lot simpler:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
results = pool.map_async(runMatch, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING).get()
Or, if the results are small, instead of having all of the processes (presumably) fighting over some shared resulting-storing thing, just return them. Then you don't need runMatch at all, just:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
for result in pool.imap_unordered(findNeedle, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING):
# Store result
Or, alternatively, if you do want to do exactly N batches, just create a Process for each one:
if __name__ == '__main__':
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
processes = [Process(target=runMatch,
args=(targetsPerBatch,
xrange(0,
totalTargets,
targetsPerBatch)))
for _ in range(numProcesses)]
for p in processes:
p.start()
for p in processes:
p.join()
Also, you seem to be calling getHayStack() once for each task (and getNeedles as well). I'm not sure how easy it would be to end up with multiple copies of this live at the same time, but considering that it's the largest data structure you have by far, that would be the first thing I try to rule out. In fact, even if it's not a memory-usage problem, getHayStack could easily be a big performance hit, unless you're already doing some kind of caching (e.g., explicitly storing it in a global or a mutable default parameter value the first time, and then just using it), so it may be worth fixing anyway.
One way to fix both potential problems at once is to use an initializer in the Pool constructor:
def initPool():
global _haystack
_haystack = getHayStack()
def runMatch(args):
global _haystack
# ...
hayCompanies = _haystack
# ...
if __name__ == '__main__':
pool = Pool(processes=numProcesses, initializer=initPool)
# ...
Next, I notice that you're explicitly generating lists in multiple places where you don't actually need them. For example:
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
If there's more than a handful of results, this is wasteful; just use the results.values() iterable directly. (In fact, it looks like you're using Python 2.x, in which case keys and values are already lists, so you're just making an extra copy for no good reason.)
But in this case, you can simplify the whole thing even farther. You're just looking for the key (resultID) and value (score) with the highest score, right? So:
needleID, score = max(results.items(), key=operator.itemgetter(1))
return [needleID, haystack[needleID], score]
This also eliminates all the repeated searches over score, which should save some CPU.
This may not directly solve the memory problem, but it should hopefully make it easier to debug and/or tweak.
The first thing to try is just to use much smaller batches—instead of input_size/cpu_count, try 1. Does memory usage go down? If not, we've ruled that part out.
Next, try sys.getsizeof(_haystack) and see what it says. If it's, say, 1.6GB, then you're cutting things pretty fine trying to squeeze everything else into 0.4GB, so that's the way to attack it—e.g., use a shelve database instead of a plain dict.
Also try dumping memory usage (with the resource module, getrusage(RUSAGE_SELF)) at the start and end of the initializer function. If the final haystack is only, say, 0.3GB, but you allocate another 1.3GB building it up, that's the problem to attack. For example, you might spin off a single child process to build and pickle the dict, then have the pool initializer just open it and unpickle it. Or combine the two—build a shelve db in the first child, and open it read-only in the initializer. Either way, this would also mean you're only doing the CSV-parsing/dict-building work once instead of 8 times.
On the other hand, if your total VM usage is still low (note that getrusage doesn't directly have any way to see your total VM size—ru_maxrss is often a useful approximation, especially if ru_nswap is 0) at time the first task runs, the problem is with the tasks themselves.
First, getsizeof the arguments to the task function and the value you return. If they're large, especially if they either keep getting larger with each task or are wildly variable, it could just be pickling and unpickling that data takes too much memory, and eventually 8 of them are together big enough to hit the limit.
Otherwise, the problem is most likely in the task function itself. Either you've got a memory leak (you can only have a real leak by using a buggy C extension module or ctypes, but if you keep any references around between calls, e.g., in a global, you could just be holding onto things forever unnecessarily), or some of the tasks themselves take too much memory. Either way, this should be something you can test more easily by pulling out the multiprocessing and just running the tasks directly, which is a lot easier to debug.

Why python multiprocessing in this code is slower?

I have this sample code. I'm making some math computation (graph theory) and I want to increase the speed so I decided to go with multiprocessing but surprisingly the same code runs even slower than the single process version.
I expect that if you have a list and you split it in half and start two processes it should take approximately half time. I don't think I have some synchronization issue so why the hell is this so slow?
from multiprocessing import Process, Queue
def do_work(queue, aList):
for elem in aList:
newList.append(a_math_function(elem))
queue.put(max(newList))
return max(newList)
def foo():
#I have let's say aList with 100000 objects
q1 = Queue(); q2 = Queue()
p1 = Process(target=do_work, args=[q1, aList[0: size/2])])
p2 = Process(target=do_work, args=[q2, aList[size/2+1: size])])
p1.start(); p2.start()
p1.join(); p2.join()
print(max(q1.get(),q2.get())

Instead of using multiprocessing, try cutting down on the amount of allocation you're doing. You're just filling RAM up with temporary lists (for example the halves of aList you pass the processes, and then the newList lists you create in each process). Quite possibly this is the cause of the slowness as lots of allocation means lots of garbage collection.
Try replacing all your code with this and measuring the speed.
def foo():
print max(a_math_function(x) for x in aList)

Use a multiprocessing.Pool.map() to distribute the workload between workers and get the result aggregated from it.
See the example here below http://docs.python.org/library/multiprocessing.html#multiprocessing.pool.AsyncResult.successful

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multiprocessing MUCH slower than sequential - python

Related

How to generate a counter for finding a hash with 9 leading zeroes

Why my parallel code is slower than the sequential

Multiprocessing time increases linearly with more cores

Python multiprocess with pool workers - memory use optimization

Why python multiprocessing in this code is slower?

Categories

Resources