For the following code, assume I have a 32 core machine, will python decide how many process to create for me?
from multiprocessing import Process
for i in range(100):
p = Process(target=run, args=(fileToAnalyse,))
p.start()
No, it does not decide for you.
To limit the number of subprocesses, you need to use a pool of workers.
Example from the documentation:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [10]) # evaluate "f(10)" asynchronously
print result.get(timeout=1) # prints "100" unless your computer is *very* slow
print pool.map(f, range(10)) # prints "[0, 1, 4,..., 81]"
If you omit, processes=4, it will use multiprocessing.cpu_count which return the number of cpu.
Related
I am trying to make use of Manager() to share dictionary between processes and tried out the following code:
from multiprocessing import Manager, Pool
def f(d):
d['x'] += 2
if __name__ == '__main__':
manager = Manager()
d = manager.dict()
d['x'] = 2
p= Pool(4)
for _ in range(2000):
p.map_async(f, (d,)) #apply_async, map
p.close()
p.join()
print (d) # expects this result --> {'x': 4002}
Using map_async and apply_async, the result printed is always different (e.g. {'x': 3838}, {'x': 3770}).
However, using map will give the expected result.
Also, i have tried using Process instead of Pool, the results are different too.
Any insights?
Something on the non-blocking part and race conditions are not handled by manager?
When you call map (rather than map_async), it will block until the processors have finished all the requests you are passing, which in your case is just one call to function f. So even though you have a pool size of 4, you are in essence doing the 2000 processes one at a time. To actually parallelize execution, you should have done a single p.map(f, [d]*2000) instead of the loop.
But when you call map_async, you do not block and are returned a result object. A call to get on the result object will block until the process finishes and will return with the result of the function call. So now you are running up to 4 processes at a time. But the update to the dictionary is not serialized across the processors. I have modifed the code to force serialization of of d[x] += 2 by using a multiprocessing lock. You will see that the results are now 4002.
from multiprocessing import Manager, Pool, Lock
def f(d):
lock.acquire()
d['x'] += 2
lock.release()
def init(l):
global lock
lock = l
if __name__ == '__main__':
with Manager() as manager:
d = manager.dict()
d['x'] = 2
lock = Lock()
p = Pool(4, initializer=init, initargs=(lock,)) # Create the multiprocessing lock that is sharable by all the processes
results = [] # if the function returnd a result we wanted
for _ in range(2000):
results.append(p.map_async(f, (d,))) #apply_async, map
"""
for i in range(2000): # if the function returned a result we wanted
results[i].get() # wait for everything to finish
"""
p.close()
p.join()
print(d)
Trying to understand Python multiprocessing documents.
I would put this on meta but I'm not sure whether it might be valuable to searchers later.
I need some guidance as to how these examples relate to multiprocessing.
Am I correct in thinking that multiprocessing is using multiple processes (and thus CPUs) in order to break down an iterable task and thus shorten its duration?
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
p.join()
We are starting one process; but how do I start multiple to complete my task? Do I iterate through Process + start() lines?
Yet there are no examples later in the documentation of for example:
for x in range(5):
p[x]=Process(target=f, args=('bob',))
p[x].start()
p.join()
Would that be the 'real life' implementation?
Here is the 'Queue Example':
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print(q.get()) # prints "[42, None, 'hello']"
p.join()
But again, how is this multiprocessing? This is just starting a process and having it run objects in a queue?
How do I make multiple processes start and run the objects in the queue?
Finally for pool:
from multiprocessing import Pool
import time
def f(x):
return x*x
if __name__ == '__main__':
with Pool(processes=4) as pool: # start 4 worker processes
result = pool.apply_async(f, (10,)) # evaluate "f(10)" asynchronously in a single process
print(result.get(timeout=1)) # prints "100" unless your computer is *very* slow
Are four processes doing 10x10 at once and it waits until all four come back or does just the one do this because we only gave the pool one argument?
If the former: Wouldn't that be slower than just having one do it in the first place? What about memory? Do we hold process 1's result until process 4 returns in RAM or does it get printed?
print(pool.map(f, range(10))) # prints "[0, 1, 4,..., 81]"
it = pool.imap(f, range(10))
print(next(it)) # prints "0"
print(next(it)) # prints "1"
print(it.next(timeout=1)) # prints "4" unless your computer is *very* slow
result = pool.apply_async(time.sleep, (10,))
print(result.get(timeout=1)) # raises multiprocessing.TimeoutError
Is it possible to set execution time for thread in Python.???? like,If that time period get completed I can stop that thread and create new instance of it.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
# print "[0, 1, 4,..., 81]"
print pool.map(f, range(10))
# print same numbers in arbitrary order
for i in pool.imap_unordered(f, range(10)):
print i
# evaluate "f(20)" asynchronously
res = pool.apply_async(f, (20,)) # runs in *only* one process
print res.get(timeout=1) # prints "400"
From the multiprocessing docs.
I want to make a brute force attack and therefore need some speed... So I came up to use the multiprocessing library... However, in every tutorial I've found, something did not work.... Hm.. This one seems to work very well, except, that I whenever I call the get() function, idle seems to go to sleep and it doesn't respond at all. Am I just silly or what? I just copy pasted the example, so it should have worked....
import multiprocessing as mp
import random
import string
# Define an output queue
output = mp.Queue()
# define a example function
def rand_string(length, output):
""" Generates a random string of numbers, lower- and uppercase chars. """
rand_str = ''.join(random.choice(
string.ascii_lowercase
+ string.ascii_uppercase
+ string.digits)
for i in range(length))
output.put(rand_str)
# Setup a list of processes that we want to run
processes = [mp.Process(target=rand_string, args=(5, output)) for x in range(2)]
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)
#dano hit it on the head! You don't have if __name__ == "__main__": so you have a "fork bomb". That is, each process is running the processes, and so on. You will also notice that I have moved the creation of the queue.
import multiprocessing as mp
import random
import string
# define a example function
def rand_string(length, output):
""" Generates a random string of numbers, lower- and uppercase chars. """
rand_str = ''.join(random.choice(
string.ascii_lowercase
+ string.ascii_uppercase
+ string.digits)
for i in range(length))
output.put(rand_str)
if __name__ == "__main__":
# Define an output queue
output = mp.Queue()
# Setup a list of processes that we want to run
processes = [mp.Process(target=rand_string, args=(5, output)) for x in range(2)]
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)
What happens is that multiprocessing runs each child process as a module, so __name__ is only __main__ in the parent. If you don't have that then each child process will (attempt to) start two more processes, each of which will start two more, and so on. No wonder IDLE stops.
I am working with python multiprocessing. Using Pool to start concurrent processes and RawArray to share an array between concurrent processes. I do not need to synchronize the accessing of RawArray, that is, the array can be modified by any processes at any time.
The test code for RawArray is: (do not mind the meaning of the program as it is just a test.)
from multiprocessing.sharedctypes import RawArray
import time
sieve = RawArray('i', (10 + 1)*[1]) # shared memory between processes
import multiprocessing as mp
def foo_pool(x):
time.sleep(0.2)
sieve[x] = x*x # modify the shared memory array. seem not work ?
return x*x
result_list = []
def log_result(result):
result_list.append(result)
def apply_async_with_callback():
pool = mp.Pool(processes = 4)
for i in range(10):
pool.apply_async(foo_pool, args = (i,), callback = log_result)
pool.close()
pool.join()
print(result_list)
for x in sieve:
print (x) # !!! sieve is [1, 1, ..., 1]
if __name__ == '__main__':
apply_async_with_callback()
While the code did not work as expected. I commented the key statements. I have got stuck on this for a whole day. Any help or constructive advices would be very appreciated.
time.sleep fails because you did not import time
use sieve[x] = x*x to modify the array instead of sieve[x].value = x*x
on Windows, your code creates a new sieve in each subprocess. You need to pass a reference to the shared array, for example like this:
def foo_init(s):
global sieve
sieve = s
def apply_async_with_callback():
pool = mp.Pool(processes = 4, initializer=foo_init, initargs=(sieve,))
if __name__ == '__main__':
sieve = RawArray('i', (10 + 1)*[1])
You should use multithreading instead of multiprocessing, as threads can share memory of main process natively.
If you worry about python's GIL mechanism, maybe you can resort to the nogil of numba.
Working version:
from multiprocessing import Pool, RawArray
import time
def foo_pool(x):
sieve[x] = x * x # modify the shared memory array.
def foo_init(s):
global sieve
sieve = s
def apply_async_with_callback(loc_size):
with Pool(processes=4, initializer=foo_init, initargs=(sieve,)) as pool:
pool.map(foo_pool, range(loc_size))
for x in sieve:
print(x)
if __name__ == '__main__':
size = 50
sieve = RawArray('i', size * [1]) # shared memory between processes
apply_async_with_callback(size)