I wrote a program that does use multiprocessing module. The program was initially written for Linux but I want to allow also Windows users to use it. The main issue is when passing variables (or rather interpolation functions) to subroutines - on Linux I don't have to make explict use of the interpolation function 'interpolator' and my call for subrotines looks like like:
pool = multiprocessing.Pool()
print 'Executing main loop...'
result2 = []
for i in range(0,NN):
pool.apply_async(SliceCalculate, (i), callback=result2.append)
pool.close()
pool.join()
and all seem to work fine ! The 'SliceCalculate' function uses inside:
interpolator=interpolate.NearestNDInterpolator(xyzpoints,weights,rescale=True)
and finds 'interpolator' automaticaly. On Windows the call needs to look like:
pool = multiprocessing.Pool()
print 'Executing main loop...'
result2 = []
for i in range(0,NN):
pool.apply_async(SliceCalculate, (i,interpolator), callback=result2.append)
pool.close()
pool.join()
and also works. Except one thing - performance drops to something like 40% (same machine used).I compared both version on Linux - on Windows the program meant for Windows has same poor performance while if I try to run the Linux version (no passing 'interpolator') I don't get any results. Any ideas what is wrong ?
I can't share the whole program - its simply too long.
PS. I did some more tests and it seems like when the 'interpolator' is big (e.g. the interpolation values in 3D go to 100x100x100 what I need) then the performance is as described above but when I limit it to e.g. 40x40x40 interpolation points (source points) then the performance of both solutions is same and starts to vary as the 'interpolator' size increases (NearestNDInterpolator us used). Could it be the OS 'issue' and actually I can't do much more ?
Related
I am struggeling around with multiprocessing. I have some heavy image processing to do and wanted to make use of multi-core CPU power. However, I tried a lot and finally wanted to use the concurrent.futures module because it is more or less quite handy to use. But when I set up my programm, it runs and runs and runs and... It does not stop. The basic idea is as follows (not related to image processing, just a dummy setup):
import concurrent.futures as cf
import time
import multiprocessing as mp
def someFunc(seconds, multiplier=1):
time.sleep(multiplier*seconds)
return (f'Slept for {multiplier*seconds} s, Proc: {mp.current_process()}')
def parallelize(secs):
factor=2
def wrapper(sec):
return someFunc(sec, factor)
with cf.ProcessPoolExecutor() as executor:
results=[executor.submit(wrapper, secs) for _ in range(8)]
for result in cf.as_completed(results):
print(result.result())
So, I am running this under windows 10 in a jupyter notebook. For this reason, the functions are saved in a separate func.py file which is imported in the notebook and then runs using the if__name__=='main' -statement.
import func
if __name__=='__main__':
func.parallelize('some_int_number')
The reason I do this is that I have to pass two arguments to the parallelize() function, but the submit()-method only provides one argument. I know, one could also make use of the map()-method or whatever, but for some reason (overhead?!?!) the effect of parallelizing is not very significant (I am playing with the possibilities for days now). So I wanted to try the submit()-method as proposed.
BUT, this does not work (the script runs endlessly) and I don't know why. The problem also is, I have to handle a 'static' argument (factor) which is only known in the scope of the parallelize-function.
If I would define the wrapper-function outside the parallelize-function, the script would run as expected, but then I had the problem of the static factor variable.
Any ideas?
Greetings
phtagen
multiprocessing pool.map works nicely on my old PC but does not work on the new PC.
It hangs in the call to
def wait(self,timeout=None)
self._event.wait(timeout)
at which time the cpu utilization drops to zero% with no further response like it has gone to sleep.
I wrote a simple test.py as follows
import multiprocessing as mp
letters = ['A','B','C']
def doit(letter):
for i in range(1000):
print(str(letter) + ' ' + str(i))
if __name__ == '__main__':
pool = mp.Pool()
pool.map(doit,letters)
This works on the old PC with i7-7700k(4cores,8logical), python365-64bit, Win10Pro, PyCharm2018.1 where the stdout displays letters and numbers in non-sequential order as expected.
Though this same code does not work on the new build i9-7960(16core-32logical), python37-64bit, Win10Pro, PyCharm2018.3
New PC bios version has not been updated from 2017/11 (4 months older)
pool.py appears to be the same on both machines (2006-2008 R Oudkerk)
The codeline where it hangs in the 'wait' function is ...
self._event.wait(timeout)
Any help please on where I might look next to find the cause.
Thanks in advance.
....
EDIT::
My further interpretation -
1. GIL (Global interpreter Lock) is not relevant here as this relates to multi-threading only, not multiprocessing.
2. multiprocessing.manager is unnecessary here as the code is consuming static input and producing independent output. So pool.close and pool.join are not required either, as I am not post-process joining results
3. This link is a good introduction to multiprocessing though I don't see a solution in here.
https://docs.python.org/2/library/multiprocessing.html#windows
Original Question
I am trying to use multiprocessing Pool in Python. This is my code:
def f(x):
return x
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
for x in xrange(1, 11):
res = list(mapper(f,bar(x)))
This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?
How can I resolve this problem?
minimal, complete, verifiable example
To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.
#!/usr/bin/python
import time
import itertools
import threading
import multiprocessing
import random
def f(x):
return x
def ngrams(input_tmp, n):
input = input_tmp.split()
if n > len(input):
n = len(input)
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
num = 100000000 #100
rand_list = random.sample(xrange(100000000), num)
rand_str = ' '.join(str(i) for i in rand_list)
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
if __name__ == '__main__':
start = time.time()
foo()
print 'Total time taken: '+str(time.time() - start)
When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.
Caution: When num is too large it may crash your system/VM.
First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.
If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.
So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.
Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).
From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.
This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.
If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.
But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)
Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.
And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.
A few observations:
In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.
The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.
ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.
To make things clearer, consider that this piece of code...
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
...is equivalent to this:
for n in xrange(1, 100):
arg = ngrams(rand_str, n)
res = list(mapper(f, arg))
Also the following is a CPU-intensive operation that is being performed on your main process:
num = 100000000
rand_list = random.sample(xrange(100000000), num)
You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.
I was reading up on Python Memory Management and would like to reduce the memory footprint of my application. It was suggested that subprocesses would go a long way in mitigating the problem; but i'm having trouble conceptualizing what needs to be done. Could some one please provide a simple example of how to turn this...
def my_function():
x = range(1000000)
y = copy.deepcopy(x)
del x
return y
#subprocess_witchcraft
def my_function_dispatcher(*args):
return my_function()
...into a real subprocessed function that doesn't store an extra "free-list"?
Bonus Question:
Does this "free-list" concept apply to python c-extensions as well?
The important thing about the optimization suggestion is to make sure that my_function() is only invoked in a subprocess. The deepcopy and del are irrelevant — once you create five million distinct integers in a process, holding onto all of them at the same time, it's game over. Even if you stop referring to those objects, Python will free them by keeping references to five million empty integer-object-sized fields in a limbo where they await reuse for the next function that wants to create five million integers. This is the free list mentioned in the other answer, and it buys blindingly fast allocation and deallocation of ints and floats. It is only fair to Python to note that this is not a memory leak since the memory is definitely made available for further allocations. However, that memory will not get returned to the system until the process ends, nor will it be reused for anything other than allocating numbers of the same type.
Most programs don't have this problem because most programs do not create pathologically huge lists of numbers, free them, and then expect to reuse that memory for other objects. Programs using numpy are also safe because numpy stores numeric data of its arrays in tightly packed native format. For programs that do follow this usage pattern, the way to mitigate the problem is by not creating a large number of the integers at the same time in the first place, at least not in the process which needs to return memory to the system. It is unclear what exact use case you have, but a real-world solution will likely require more than a "magic decorator".
This is where subprocess come in: if the list of numbers is created in another process, then all the memory associated with the list, including but not limited to storage of ints, is both freed and returned to the system by the mere act of terminating the subprocess. Of course, you must design your program so that the list can be both created and processed in the subsystem, without requiring the transfer of all these numbers. The subprocess can receive information needed to create the data set, and can send back the information obtained from processing the list.
To illustrate the principle, let's upgrade your example so that the whole list actually needs to exist - say we're benchmarking sorting algorithms. We want to create a huge list of integers, sort it, and reliably free the memory associated with the list, so that the next benchmark can allocate memory for its own needs without worrying of running out of RAM. To spawn the subprocess and communicate, this uses the multiprocessing module:
# To run this, save it to a file that looks like a valid Python module, e.g.
# "foo.py" - multiprocessing requires being able to import the main module.
# Then run it with "python foo.py".
import multiprocessing, random, sys, os, time
def create_list(size):
# utility function for clarity - runs in subprocess
maxint = sys.maxint
randrange = random.randrange
return [randrange(maxint) for i in xrange(size)]
def run_test(state):
# this function is run in a separate process
size = state['list_size']
print 'creating a list with %d random elements - this can take a while... ' % size,
sys.stdout.flush()
lst = create_list(size)
print 'done'
t0 = time.time()
lst.sort()
t1 = time.time()
state['time'] = t1 - t0
if __name__ == '__main__':
manager = multiprocessing.Manager()
state = manager.dict(list_size=5*1000*1000) # shared state
p = multiprocessing.Process(target=run_test, args=(state,))
p.start()
p.join()
print 'time to sort: %.3f' % state['time']
print 'my PID is %d, sleeping for a minute...' % os.getpid()
time.sleep(60)
# at this point you can inspect the running process to see that it
# does not consume excess memory
Bonus Answer
It is hard to provide an answer to the bonus question, since the question is unclear. The "free list concept" is exactly that, a concept, an implementation strategy that needs to be explicitly coded on top of the regular Python allocator. Most Python types do not use that allocation strategy, for example it is not used for instances of classes created with the class statement. Implementing a free list is not hard, but it is fairly advanced and rarely undertaken without good reason. If some extension author has chosen to use a free list for one of its types, it can be expected that they are aware of the tradeoff a free list offers — gaining extra-fast allocation/deallocation at the cost of some additional space (for the objects on the free list and the free list itself) and inability to reuse the memory for something else.
Joblib for parallel computation taking more time for njob>1 (njob=2 takes 12.6s finished) than njob=1 (1.3s finished). I am in mac OSX 10.9 with 16GB RAM. Am I doing some mistake? Here is a simple demo code:
from joblib import Parallel, delayed
def func():
for i in range(200):
for j in range(300):
yield i, j
def evaluate(x):
i=x[0]
j=x[1]
p=i*j
return p, i, j
if __name__ == '__main__':
results = Parallel(n_jobs=3, verbose=2)(delayed(evaluate)(x) for x in func())
res, i, j = zip(*results)
Short answer: Joblib is a multiprocessing system, and has a fair amount of overhead in booting up a new python process for each of your 3 simultaneous jobs. As a result, your specific code is likely to get even slower if you add more jobs.
There's some documentation about this here.
The workarounds aren't great:
accept the overhead
don't use parallel code
Use multithreading instead of multiprocessing.. Unfortunately, multithreading is rarely an option unless you are using a fully compiled function in place of evaluate, because python is almost always single-threaded (see the python GIL).
That said, for functions that take a long time, multiprocessing is often worth it. Depending on your application, it's really a judgment call. Note that every variable used in the function is copied to each process - variable copy is rare in python, so this can be a surprise. As a result, the overhead is in part a function of the size of the variables passed either explicitly or implicitly (eg. via use of global variables).