I have a complexed problem with python multiprocessing module.
I have build a script that in one place has to call a multiargument function (call_function) for each element in a specyfic list. My idea is to define an integer 'N' and divide this problem for single sub processes.
li=[a,b,c,d,e] #elements are int's
for element in li:
call_function(element,string1,string2,int1)
call_summary_function()
Summary function will analyze results obtained by all iterations of the loop. Now, I want each iteration to be carried out by single sub process, but there cannot be more than N subprocesses altogether. If so, main process should wait until 1 of subprocesses end and then perform another iteration. Also, call_sumary_function need to be called after all the sub processes finish.
I have tried my best with multiprocessing module, Locks and global variables to keep the actual number of subprocesses running (to compare to N) but every time i get error.
//--------------EDIT-------------//
Firstly, the main process code:
MAX_PROCESSES=3
lock=multiprocessing.Lock()
processes=0
k=0
while k < len(k_list):
if processes<=MAX_PROCESSES: # running processes <= 'N' set by me
p = multiprocessing.Process(target=single_analysis, args=(k_list[k],main_folder,training_testing,subsets,positive_name,ratio_list,lock,processes))
p.start()
k+=1
else: time.sleep(1)
while processes>0: time.sleep(1)
Now: the function that is called by multiprocessing:
def single_analysis(k,main_folder,training_testing,subsets,positive_name,ratio_list,lock,processes):
lock.acquire()
processes+=1
lock.release()
#stuff to do
lock.acquire()
processes-=1
lock.release()
I get the Error that int value (processes variable) is always equal to 0, since single_analysis() function seems to create new, local variable processes.
When I change processes to global and import it in single_analysis() with global keyword and type print processes in within the function I get len(li) times 1...
What you're describing is pefectly suited for multiprocessing.Pool - specifically its map method:
import multiprocessing
from functools import partial
def call_function(string1, string2, int1, element):
# Do stuff here
if __name__ == "__main__":
li=[a,b,c,d,e]
p = multiprocessing.Pool(N) # The pool will contain N worker processes.
# Use partial so that we can pass a method that takes more than one argument to map.
func = partial(call_function, string1,string2,int1)
results = p.map(func, li)
call_summary_function(results)
p.map will call call_function(string1, string2, int1, element), for each element in the li list. results will be a list containing the value returned by each call to call_function. You can pass that list to call_summary_function to process the results.
Related
I try to use multiprocessing in this way:
from multiprocessing import Pool
added = []
def foo(i):
added = []
# do something
added.append(x[i])
return added
if __name__ == '__main__':
h = 0
while len(added)<len(c):
pool = Pool(4)
result = pool.imap_unordered(foo, c)
added.append(result[-1])
pool.close()
pool.join()
h = h + 1
Multiprocessing takes place in the while-loop, and in the foo function, the
added list is created. In each subsequent step h in the loop, the listadded should be incremented by subsequent values, and the current list added should be used in the functionfoo. Is it possible to pass the current contents of the list to the function in each subsequent step of the loop? Because in the above code, the foo function creates the new contents of the added list from scratch each time. How can this be solved?
You can use a multiprocessing.Queue. The rough idea is to construct one of these in your main process, pass it to the child processes, and each foo() invocation can call put(x[i]) to add a value to the queue.
The main process will then read the queue to collect the results.
Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]
I am trying to launch multiple processes to parallelize certain tasks and want one global variable to be decremented by 1 each time each process executes a method X().
I tried to look at the multiprocessing.Value method but not sure if that's the only way to do it. Could someone provide some code snippets to do this ?
from multiprocessing import Pool, Process
def X(list):
global temp
print list
temp = 10
temp -= 1
return temp
list = ['a','b','c']
pool = Pool(processes=5)
pool.map(X, list)
With the use of global variable, each process gets its own copy of the global variable which doesn't solve the purpose of sharing it's value. I believe, the need is to have sort of a shared memory system but I am not sure how to do it. Thanks
Move counter variable into the main process i.e., avoid sharing the variable between processes:
for result in pool.imap_unordered(func, args):
counter -= 1
counter is decremented as soon as the corresponding result (func(arg)) becomes available. Here's a complete code example:
#!/usr/bin/env python
import random
import time
import multiprocessing
def func(arg):
time.sleep(random.random())
return arg*10
def main():
counter = 10
args = "abc"
pool = multiprocessing.Pool()
for result in pool.imap_unordered(func, args):
counter -= 1
print("counter=%d, result=%r" % (counter, result))
if __name__ == "__main__":
main()
An alternative is to pass multiprocessing.Value() object to each worker process (use initialize, initargs Pool()'s parameters).
I'm wondering about the way that python's Multiprocessing.Pool class works with map, imap, and map_async. My particular problem is that I want to map on an iterator that creates memory-heavy objects, and don't want all these objects to be generated into memory at the same time. I wanted to see if the various map() functions would wring my iterator dry, or intelligently call the next() function only as child processes slowly advanced, so I hacked up some tests as such:
def g():
for el in xrange(100):
print el
yield el
def f(x):
time.sleep(1)
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
go = g()
g2 = pool.imap(f, go)
g2.next()
And so on with map, imap, and map_async. This is the most flagrant example however, as simply calling next() a single time on g2 prints out all my elements from my generator g(), whereas if imap were doing this 'lazily' I would expect it to only call go.next() once, and therefore print out only '1'.
Can someone clear up what is happening, and if there is some way to have the process pool 'lazily' evaluate the iterator as needed?
Thanks,
Gabe
Let's look at the end of the program first.
The multiprocessing module uses atexit to call multiprocessing.util._exit_function when your program ends.
If you remove g2.next(), your program ends quickly.
The _exit_function eventually calls Pool._terminate_pool. The main thread changes the state of pool._task_handler._state from RUN to TERMINATE. Meanwhile the pool._task_handler thread is looping in Pool._handle_tasks and bails out when it reaches the condition
if thread._state:
debug('task handler found thread._state != RUN')
break
(See /usr/lib/python2.6/multiprocessing/pool.py)
This is what stops the task handler from fully consuming your generator, g(). If you look in Pool._handle_tasks you'll see
for i, task in enumerate(taskseq):
...
try:
put(task)
except IOError:
debug('could not put task on queue')
break
This is the code which consumes your generator. (taskseq is not exactly your generator, but as taskseq is consumed, so is your generator.)
In contrast, when you call g2.next() the main thread calls IMapIterator.next, and waits when it reaches self._cond.wait(timeout).
That the main thread is waiting instead of
calling _exit_function is what allows the task handler thread to run normally, which means fully consuming the generator as it puts tasks in the workers' inqueue in the Pool._handle_tasks function.
The bottom line is that all Pool map functions consume the entire iterable that it is given. If you'd like to consume the generator in chunks, you could do this instead:
import multiprocessing as mp
import itertools
import time
def g():
for el in xrange(50):
print el
yield el
def f(x):
time.sleep(1)
return x * x
if __name__ == '__main__':
pool = mp.Pool(processes=4) # start 4 worker processes
go = g()
result = []
N = 11
while True:
g2 = pool.map(f, itertools.islice(go, N))
if g2:
result.extend(g2)
time.sleep(1)
else:
break
print(result)
I had this problem too and was disappointed to learn that map consumes all its elements. I coded a function which consumes the iterator lazily using the Queue data type in multiprocessing. This is similar to what #unutbu describes in a comment to his answer but as he points out, suffers from having no callback mechanism for re-loading the Queue. The Queue datatype instead exposes a timeout parameter and I've used 100 milliseconds to good effect.
from multiprocessing import Process, Queue, cpu_count
from Queue import Full as QueueFull
from Queue import Empty as QueueEmpty
def worker(recvq, sendq):
for func, args in iter(recvq.get, None):
result = func(*args)
sendq.put(result)
def pool_imap_unordered(function, iterable, procs=cpu_count()):
# Create queues for sending/receiving items from iterable.
sendq = Queue(procs)
recvq = Queue()
# Start worker processes.
for rpt in xrange(procs):
Process(target=worker, args=(sendq, recvq)).start()
# Iterate iterable and communicate with worker processes.
send_len = 0
recv_len = 0
itr = iter(iterable)
try:
value = itr.next()
while True:
try:
sendq.put((function, value), True, 0.1)
send_len += 1
value = itr.next()
except QueueFull:
while True:
try:
result = recvq.get(False)
recv_len += 1
yield result
except QueueEmpty:
break
except StopIteration:
pass
# Collect all remaining results.
while recv_len < send_len:
result = recvq.get()
recv_len += 1
yield result
# Terminate worker processes.
for rpt in xrange(procs):
sendq.put(None)
This solution has the advantage of not batching requests to Pool.map. One individual worker can not block others from making progress. YMMV. Note that you may want to use a different object to signal termination for the workers. In the example, I've used None.
Tested on "Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32"
What you want is implemented in the NuMap package, from the website:
NuMap is a parallel (thread- or process-based, local or remote),
buffered, multi-task, itertools.imap or multiprocessing.Pool.imap
function replacement. Like imap it evaluates a function on elements of
a sequence or iterable, and it does so lazily.
Laziness can be adjusted via the “stride” and “buffer” arguments.
In this example (see code, please) 2 workers.
Pool work as expected: when worker is free, then to do next iteration.
This code as code in topic, except one thing: argument size = 64 k.
64 k - default socket buffer size.
import itertools
from multiprocessing import Pool
from time import sleep
def f( x ):
print( "f()" )
sleep( 3 )
return x
def get_reader():
for x in range( 10 ):
print( "readed: ", x )
value = " " * 1024 * 64 # 64k
yield value
if __name__ == '__main__':
p = Pool( processes=2 )
data = p.imap( f, get_reader() )
p.close()
p.join()
I ran into this issue as well, and came to a different solution than the other answers here so I figured I would share it.
import collections, multiprocessing
def map_prefetch(func, data, lookahead=128, workers=16, timeout=10):
with multiprocessing.Pool(workers) as pool:
q = collections.deque()
for x in data:
q.append(pool.apply_async(func, (x,)))
if len(q) >= lookahead:
yield q.popleft().get(timeout=timeout)
while len(q):
yield q.popleft().get(timeout=timeout)
for x in map_prefetch(myfunction, huge_data_iterator):
# do stuff with x
Basically is uses a queue to send at most lookahead pending requests to the worker pool, enforcing a limit on buffered results. The work starts asap within that limit so it can run in parallel. Also the result remains in order.
how can i control the return value of this function pool apply_asyn
supposing that I have the following cool
import multiprocessing:
de fun(..)
...
...
return value
my_pool = multiprocessing.Pool(2)
for i in range(5) :
result=my_pool.apply_async(fun, [i])
some code going to be here....
digest_pool.close()
digest_pool.join()
here i need to proccess the results
how can i control the result value for every proccess and know to check to which proccess it belongs ,
store the the value of 'i' from the for loop and either print it or return and save it somewhere else.
so if a process happens you can check from which process it was by looking at the variable i.
Hope this helps.
Are you sure, that you need to know, which of your two workers is doing what right now? In such a case you might be better off with Processes and Queues, because, this sounds as some communication between the multiple processes is required.
If you just want to know, which result was processed by which worker, you can simply return a tuple:
#!/usr/bin/python
import multiprocessing
def fun(..)
...
return value, multiprocessing.current_process()._name
my_pool = multiprocessing.Pool(2)
async_result = []
for i in range(5):
async_result.append(my_pool.apply_async(fun, [i]))
# some code going to be here....
my_pool.join()
result = {}
for i in range(5):
result[i] = async_result[i].get()
If you have the different input variables as a list, the map_async command might be a better decision:
#!/usr/bin/python
import multiprocessing
def fun(..)
...
...
return value, multiprocessing.current_process()._name
my_pool = multiprocessing.Pool()
async_results = my_pool.map_async(fun, range(5))
# some code going to be here....
results = async_results.get()
The last line joins the pool. Note, that results is a list of tuples, each tuple containing of your calculated value and the name of the process who calculated it.