I am trying to understand how to get children to write to a parent's variables. Maybe I'm doing something wrong here, but I would have imagined that multiprocessing would have taken a fraction of the time that it is actually taking:
import multiprocessing, time
def h(x):
h.q.put('Doing: ' + str(x))
return x
def f_init(q):
h.q = q
def main():
q = multiprocessing.Queue()
p = multiprocessing.Pool(None, f_init, [q])
results = p.imap(h, range(1,5))
p.close()
-----Results-----:
1
2
3
4
Multiprocessed: 0.0695610046387 seconds
1
2
3
4
Normal: 2.78949737549e-05 seconds # much shorter
for i in range(len(range(1,5))):
print results.next() # prints 1, 4, 9, 16
if __name__ == '__main__':
start = time.time()
main()
print "Multiprocessed: %s seconds" % (time.time()-start)
start = time.time()
for i in range(1,5):
print i
print "Normal: %s seconds" % (time.time()-start)
#Blender basically already answered your question, but as a comment. There is some overhead associated with the multiprocessing machinery, so if you incur the overhead without doing any significant work, it will be slower.
Try actually doing some work that parallelizes well. For example, write Python code to open a file, scan it using a regular expression, and pull out matching lines; then make a list of ten big files and time how long it takes to do all ten with multiprocessing vs. plain Python. Or write code to compute an expensive function and try that.
I have used multiprocessing.Pool() just to run a bunch of instances of an external program. I used subprocess to run an audio encoder, and it ran four instances of the encoder at once for a noticeable speedup.
Related
I am playing around with multiprocessing in Python 3 to try and understand how it works and when it's good to use it.
I am basing my examples on this question, which is really old (2012).
My computer is a Windows, 4 physical cores, 8 logical cores.
First: not segmented data
First I try to brute force compute numpy.sinfor a million values. The million values is a single chunk, not segmented.
import time
import numpy
from multiprocessing import Pool
# so that iPython works
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"
def numpy_sin(value):
return numpy.sin(value)
a = numpy.arange(1000000)
if __name__ == '__main__':
pool = Pool(processes = 8)
start = time.time()
result = numpy.sin(a)
end = time.time()
print('Singled threaded {}'.format(end - start))
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print('Multithreaded {}'.format(end - start))
And I get that, no matter the number of processes, the 'multi_threading' always takes 10 times or so as much as the 'single threading'. In the task manager, I see that not all the CPUs are maxed out, and the total CPU usage is goes between 18% and 31%.
So I try something else.
Second: segmented data
I try to split up the original 1 million computations in 10 batches of 100,000 each. Then I try again for 10 million computations in 10 batches of 1 million each.
import time
import numpy
from multiprocessing import Pool
# so that iPython works
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"
def numpy_sin(value):
return numpy.sin(value)
p = 3
s = 1000000
a = [numpy.arange(s) for _ in range(10)]
if __name__ == '__main__':
print('processes = {}'.format(p))
print('size = {}'.format(s))
start = time.time()
result = numpy.sin(a)
end = time.time()
print('Singled threaded {}'.format(end - start))
pool = Pool(processes = p)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print('Multithreaded {}'.format(end - start))
I ran this last piece of code for different processes p and different list length s, 100000and 1000000.
At least now the task Manager gives the CPU maxed out at 100% usage.
I get the following results for the elapsed times (ORANGE: multiprocess, BLUE: single):
So multiprocessing never wins over the single process.
Why??
Numpy changes how the parent process runs so that it only runs on one core. You can call os.system("taskset -p 0xff %d" % os.getpid()) after you import numpy to reset the CPU affinity so that all cores are used.
See this question for more details
A computer can really only do one thing at a time. When multi-threading or multi-processing, the computer is really only switching back and forth between tasks quickly. With the provided problem, the computer could either perform the calculation 1,000,000 times, or split-up the work between a couple "workers" and perform 100,000 for each of 10 "workers".
Multi-processing shines not when computing something straight out, as the computer has to take time to create multiple processes, but while waiting for something. The main example I've heard is for webscraping. If a program requested data from a list of websites and waited for each server to send data before requesting data from the next, the program will have to sit for a couple seconds. If instead, the computer used multiprocessing/threading to ask all the websites first and all concurrently wait, the total running time is much shorter.
Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]
I use Multiprocessing library in Python to distribute a function over multiple cores. To do that I use "Pool" function, but I want to know when each processor has completed its work.
Here is the code :
def parallel(m,G):
D=0
for i in xrange(G):
D+=random()
return 1*(D<1)
pool=Pool()
TOTAL=0
for i in xrange(10):
TOTAL += sum(pool.map(partial(parallel,G=2),xrange(100)))
print TOTAL
I know how to use time.time() in normal situation, but what I need is to know when each core has completed is part of the job. If I put a time stamp directly in the function I will get many time values without knowing on what core it is processed.
Any advice is welcome!
You may return the completion time along with the actual result from parallel and then pick the last timestamp for each worker.
import time
from random import random
from functools import partial
from multiprocessing import Pool, current_process
def parallel(m, G):
D = 0
for i in xrange(G):
D += random()
# uncomment to give the other workers more chances to run
# time.sleep(.001)
return (current_process().name, time.time()), 1 * (D < 1)
# don't deny the existence of Windows
if __name__ == '__main__':
pool = Pool()
TOTAL = 0
proc_times = {}
for i in xrange(5):
# times is a list of proc_name:timestamp pairs
times, results = zip(*pool.map(partial(parallel, G=2), xrange(100)))
TOTAL += sum(results)
# process_times_loc is guaranteed to hold the last timestamp
# for each proc_name, see the doc on dict
proc_times_loc = dict(times)
print 'local completion times:', proc_times_loc
proc_times.update(proc_times_loc)
print TOTAL
print 'total completion times:', proc_times
However when jobs are that simple you may find that calling time.time each time consumes too much of CPU time.)
I'm making some API requests which are limited at 20 per second. As to get the answer the waiting time is about 0.5 secs I thought to use multiprocessing.Pool.map and using this decorator
rate-limiting
So my code looks like
def fun(vec):
#do stuff
def RateLimited(maxPerSecond):
minInterval = 1.0 / float(maxPerSecond)
def decorate(func):
lastTimeCalled = [0.0]
def rateLimitedFunction(*args,**kargs):
elapsed = time.clock() - lastTimeCalled[0]
leftToWait = minInterval - elapsed
if leftToWait>0:
time.sleep(leftToWait)
ret = func(*args,**kargs)
lastTimeCalled[0] = time.clock()
return ret
return rateLimitedFunction
return decorate
#RateLimited(20)
def multi(vec):
p = Pool(5)
return p.map(f, vec)
I have 4 cores and this program works fine and there is an improvement in time compared to the loop version. Furthermore, when the Pool argument is 4,5,6 it works and the time is smaller for Pool(6) but when I use 7+ I got errors (Too many connections per second I guess).
Then if my function is more complicated and can do 1-5 requests the decorator doesn't work as expected.
What else I can use in this case?
UPDATE
For anyone looking for use Pool remembers to close it otherwise you are going to use all the RAM
def multi(vec):
p = Pool(5)
res=p.map(f, vec)
p.close()
return res
UPDATE 2
I found that something like this WebRequestManager can do the trick. The problem is that doesn't work with multiprocessing. Pool with 19-20 processes because the time is stored in the class you need to call when you run the request.
Your indents are inconsistent up above which makes it harder to answer this, but I'll take a stab.
It looks like you're rate limiting the wrong thing; if f is supposed be limited, you need to limit the calls to f, not the calls to multi. Doing this in something that's getting dispatched to the Pool won't work, because the forked workers would each be limiting independently (forked processes will have independent tracking of the time since last call).
The easiest way to do this would be to limit how quickly the iterator that the Pool pulls from produces results. For example:
import collections
import time
def rate_limited_iterator(iterable, limit_per_second):
# Initially, we can run immediately limit times
runats = collections.deque([time.time()] * limit_per_second)
for x in iterable:
runat, now = runats.popleft(), time.time()
if now < runat:
time.sleep(runat - now)
runats.append(time.time() + 1)
yield x
def multi(vec):
p = Pool(5)
return p.map(f, rate_limited_iterator(vec, 20))
The ordering of results from the returned iterator of imap_unordered is arbitrary, and it doesn't seem to run faster than imap(which I check with the following code), so why would one use this method?
from multiprocessing import Pool
import time
def square(i):
time.sleep(0.01)
return i ** 2
p = Pool(4)
nums = range(50)
start = time.time()
print 'Using imap'
for i in p.imap(square, nums):
pass
print 'Time elapsed: %s' % (time.time() - start)
start = time.time()
print 'Using imap_unordered'
for i in p.imap_unordered(square, nums):
pass
print 'Time elapsed: %s' % (time.time() - start)
Using pool.imap_unordered instead of pool.imap will not have a large effect on the total running time of your code. It might be a little faster, but not by too much.
What it may do, however, is make the interval between values being available in your iteration more even. That is, if you have operations that can take very different amounts of time (rather than the consistent 0.01 seconds you were using in your example), imap_unordered can smooth things out by yielding faster-calculated values ahead of slower-calculated values. The regular imap will delay yielding the faster ones until after the slower ones ahead of them have been computed (but this does not delay the worker processes moving on to more calculations, just the time for you to see them).
Try making your work function sleep for i*0.1 seconds, shuffling your input list and printing i in your loops. You'll be able to see the difference between the two imap versions. Here's my version (the main function and the if __name__ == '__main__' boilerplate was is required to run correctly on Windows):
from multiprocessing import Pool
import time
import random
def work(i):
time.sleep(0.1*i)
return i
def main():
p = Pool(4)
nums = range(50)
random.shuffle(nums)
start = time.time()
print 'Using imap'
for i in p.imap(work, nums):
print i
print 'Time elapsed: %s' % (time.time() - start)
start = time.time()
print 'Using imap_unordered'
for i in p.imap_unordered(work, nums):
print i
print 'Time elapsed: %s' % (time.time() - start)
if __name__ == "__main__":
main()
The imap version will have long pauses while values like 49 are being handled (taking 4.9 seconds), then it will fly over a bunch of other values (which were calculated by the other processes while we were waiting for 49 to be processed). In contrast, the imap_unordered loop will usually not pause nearly as long at one time. It will have more frequent, but shorter pauses, and its output will tend to be smoother.
imap_unordered also seems to use less memory over time than imap. At least that's what I experienced with a iterator over millions of things.