I am trying to launch multiple processes to parallelize certain tasks and want one global variable to be decremented by 1 each time each process executes a method X().
I tried to look at the multiprocessing.Value method but not sure if that's the only way to do it. Could someone provide some code snippets to do this ?
from multiprocessing import Pool, Process
def X(list):
global temp
print list
temp = 10
temp -= 1
return temp
list = ['a','b','c']
pool = Pool(processes=5)
pool.map(X, list)
With the use of global variable, each process gets its own copy of the global variable which doesn't solve the purpose of sharing it's value. I believe, the need is to have sort of a shared memory system but I am not sure how to do it. Thanks
Move counter variable into the main process i.e., avoid sharing the variable between processes:
for result in pool.imap_unordered(func, args):
counter -= 1
counter is decremented as soon as the corresponding result (func(arg)) becomes available. Here's a complete code example:
#!/usr/bin/env python
import random
import time
import multiprocessing
def func(arg):
time.sleep(random.random())
return arg*10
def main():
counter = 10
args = "abc"
pool = multiprocessing.Pool()
for result in pool.imap_unordered(func, args):
counter -= 1
print("counter=%d, result=%r" % (counter, result))
if __name__ == "__main__":
main()
An alternative is to pass multiprocessing.Value() object to each worker process (use initialize, initargs Pool()'s parameters).
Related
I'm working on a project that needs to run two different CPU-intensive functions. Hence using a multiproccessing approach seems to be the way to go. The challenge that I'm facing is that one function has a slower runtime than the other one. For the sake of argument lets say that execute has a runtime of .1 seconds while update takes a full second to run. The goal is that while update is running execute will have calculated an output value 10 times. Once update has finished it needs to pass a set of parameters to execute which can then continue generating an output with the new set of parameters. After sometime update needs to run again and once more generate a new set of parameters.
Furthermore both functions will require a different set of input variables.
The image link below should hopefully visualize my conundrum a bit better.
function runtime visualisation
From what I've gathered (https://zetcode.com/python/multiprocessing/) using an asymetric mapping approach might be the way to go, but it doesn't really seem to work. Any help is greatly appreciated.
Pseudo Code
from multiprocessing import Pool
from datetime import datetime
import time
import numpy as np
class MyClass():
def __init__(self, inital_parameter_1, inital_parameter_2):
self.parameter_1 = inital_parameter_1
self.parameter_2 = inital_parameter_2
def execute(self, input_1, input_2, time_in):
print('starting execute function for time:' + str(time_in))
time.sleep(0.1) # wait for 100 milliseconds
# generate some output
output = (self.parameter_1 * input_1) + (self.parameter_2 + input_2)
print('exiting execute function')
return output
def update(self, update_input_1, update_input_2, time_in):
print('starting update function for time:' + str(time_in))
time.sleep(1) # wait for 1 second
# generate parameters
self.parameter_1 += update_input_1
self.parameter_2 += update_input_2
print('exiting update function')
def smap(f):
return f()
if __name__ == "__main__":
update_input_1 = 3
update_input_2 = 4
input_1 = 0
input_2 = 1
# initialize class
my_class = MyClass(1, 2)
# total runtime (arbitrary)
runtime = int(10e6)
# update_time (arbitrary)
update_time = np.array([10, 10e2, 15e4, 20e5])
for current_time in range(runtime):
# if time equals update time run both functions simultanously until update is complete
if any(update_time == current_time):
with Pool() as pool:
res = pool.map_async(my_class.smap, [my_class.execute(input_1, input_2, current_time),
my_class.update(update_input_1, update_input_2, current_time)])
# otherwise run only execute
else:
output = my_class.execute(input_1, input_2,current_time)
# increment input
input_1 += 1
input_2 += 2
I confess to not being able to fully following your code vis-a-vis your description. But I see some issues:
Method update is not returning any value other than None, which is implicitly returned due to the lack of a return statement.
Your with Pool() ...: block will call terminate upon block exit, which is immediately after your call to pool.map_async, which is non-blocking. But you have no provision to wait for the completion of this submitted task (terminate will most likely kill the running task before it completes).
What you are passing to the map_async function is the worker function name and an iterable. But you are invoking method calls to execute and update from the current main process and using their return values as elements of the iterable and these return values are definitely not functions suitable for passing to smap. So there is no multiprocessing being done and this is just plain wrong.
You are also creating and destroying process pools over and over again. Much better to create the process pool just once.
I would therefore recommend the following changes at the very least. But note that this code potentially generates tasks much faster than they can be completed and you could have millions of tasks queued up to run given your current runtime value, which could be quite a strain on system resources such as memory. So I've inserted some code that ensures that the rate of submitting tasks is throttled so that the number of incomplete submitted tasks is never more than three times the number of CPU cores available.
# we won't need heavy-duty numpy for what we are doing:
#import numpy as np
from multiprocessing import cpu_count
from threading import Lock
... # etc.
if __name__ == "__main__":
update_input_1 = 3
update_input_2 = 4
input_1 = 0
input_2 = 1
# initialize class
my_class = MyClass(1, 2)
# total runtime (arbitrary)
runtime = int(10e6)
# update_time (arbitrary)
# we don't need overhead of numpy (remove import of numpy):
#update_time = np.array([10, 10e2, 15e4, 20e5])
update_time = [10, 10e2, 15e4, 20e5]
tasks_submitted = 0
lock = Lock()
execute_output = []
def execute_result(result):
global tasks_submitted
with lock:
tasks_submitted -= 1
# result is the return value from method execute
# do something with it, e.g. execute_output.append(result)
pass
update_output = []
def update_result(result):
global tasks_submitted
with lock:
tasks_submitted -= 1
# result is the return value from method update
# do something with it, e.g. update_output.append(result)
pass
n_processors = cpu_count()
with Pool() as pool:
for current_time in range(runtime):
# if time equals update time run both functions simultanously until update is complete
#if any(update_time == current_time):
if current_time in update_time:
# run both update and execute:
pool.apply_async(my_class.update, args=(update_input_1, update_input_2, current_time), callback=update_result)
with lock:
tasks_submitted += 1
pool.apply_async(my_class.execute, args=(input_1, input_2, current_time), callback=execute_result)
with lock:
tasks_submitted += 1
# increment input
input_1 += 1
input_2 += 2
while tasks_submitted > n_processors * 3:
time.sleep(.05)
# Ensure all tasks have completed:
pool.close()
pool.join()
assert(tasks_submitted == 0)
I try to use multiprocessing in this way:
from multiprocessing import Pool
added = []
def foo(i):
added = []
# do something
added.append(x[i])
return added
if __name__ == '__main__':
h = 0
while len(added)<len(c):
pool = Pool(4)
result = pool.imap_unordered(foo, c)
added.append(result[-1])
pool.close()
pool.join()
h = h + 1
Multiprocessing takes place in the while-loop, and in the foo function, the
added list is created. In each subsequent step h in the loop, the listadded should be incremented by subsequent values, and the current list added should be used in the functionfoo. Is it possible to pass the current contents of the list to the function in each subsequent step of the loop? Because in the above code, the foo function creates the new contents of the added list from scratch each time. How can this be solved?
You can use a multiprocessing.Queue. The rough idea is to construct one of these in your main process, pass it to the child processes, and each foo() invocation can call put(x[i]) to add a value to the queue.
The main process will then read the queue to collect the results.
I have a complexed problem with python multiprocessing module.
I have build a script that in one place has to call a multiargument function (call_function) for each element in a specyfic list. My idea is to define an integer 'N' and divide this problem for single sub processes.
li=[a,b,c,d,e] #elements are int's
for element in li:
call_function(element,string1,string2,int1)
call_summary_function()
Summary function will analyze results obtained by all iterations of the loop. Now, I want each iteration to be carried out by single sub process, but there cannot be more than N subprocesses altogether. If so, main process should wait until 1 of subprocesses end and then perform another iteration. Also, call_sumary_function need to be called after all the sub processes finish.
I have tried my best with multiprocessing module, Locks and global variables to keep the actual number of subprocesses running (to compare to N) but every time i get error.
//--------------EDIT-------------//
Firstly, the main process code:
MAX_PROCESSES=3
lock=multiprocessing.Lock()
processes=0
k=0
while k < len(k_list):
if processes<=MAX_PROCESSES: # running processes <= 'N' set by me
p = multiprocessing.Process(target=single_analysis, args=(k_list[k],main_folder,training_testing,subsets,positive_name,ratio_list,lock,processes))
p.start()
k+=1
else: time.sleep(1)
while processes>0: time.sleep(1)
Now: the function that is called by multiprocessing:
def single_analysis(k,main_folder,training_testing,subsets,positive_name,ratio_list,lock,processes):
lock.acquire()
processes+=1
lock.release()
#stuff to do
lock.acquire()
processes-=1
lock.release()
I get the Error that int value (processes variable) is always equal to 0, since single_analysis() function seems to create new, local variable processes.
When I change processes to global and import it in single_analysis() with global keyword and type print processes in within the function I get len(li) times 1...
What you're describing is pefectly suited for multiprocessing.Pool - specifically its map method:
import multiprocessing
from functools import partial
def call_function(string1, string2, int1, element):
# Do stuff here
if __name__ == "__main__":
li=[a,b,c,d,e]
p = multiprocessing.Pool(N) # The pool will contain N worker processes.
# Use partial so that we can pass a method that takes more than one argument to map.
func = partial(call_function, string1,string2,int1)
results = p.map(func, li)
call_summary_function(results)
p.map will call call_function(string1, string2, int1, element), for each element in the li list. results will be a list containing the value returned by each call to call_function. You can pass that list to call_summary_function to process the results.
I want to use subprocesses to let 20 instances of a written script run parallel. Lets say i have a big list of urls with like 100.000 entries and my program should control that all the time 20 instances of my script are working on that list. I wanted to code it as follows:
urllist = [url1, url2, url3, .. , url100000]
i=0
while number_of_subproccesses < 20 and i<100000:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
My script just writes something into a database or textfile. It doesnt output anything and dont need more input than the url.
My problem is i wasnt able to find something how to get the number of subprocesses that are active. Im a novice programmer so every hint and suggestion is welcome. I was also wondering how i can manage it once the 20 subprocesses are loaded that the while loop checks the conditions again? I thought of maybe putting another while loop over it, something like
while i<100000
while number_of_subproccesses < 20:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
if number_of_subprocesses == 20:
sleep() # wait to some time until check again
Or maybe theres a bette possibility that the while loop is always checking on the number of subprocesses?
I also considered using the module multiprocessing, but i found it really convenient to just call the script.py with subprocessing instead of a function with multiprocessing.
Maybe someone can help me and lead me into the right direction. Thanks Alot!
Taking a different approach from the above - as it seems that the callback can't be sent as a parameter:
NextURLNo = 0
MaxProcesses = 20
MaxUrls = 100000 # Note this would be better to be len(urllist)
Processes = []
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global Processes
if NextURLNo < MaxUrls:
proc = subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print ("Started to Process %s", urllist[NextURLNo])
NextURLNo += 1
Processes.append(proc)
def CheckRunning():
""" Check any running processes and start new ones if there are spare slots."""
global Processes
global NextURLNo
for p in range(len(Processes):0:-1): # Check the processes in reverse order
if Processes[p].poll() is not None: # If the process hasn't finished will return None
del Processes[p] # Remove from list - this is why we needed reverse order
while (len(Processes) < MaxProcesses) and (NextURLNo < MaxUrls): # More to do and some spare slots
StartNew()
if __name__ == "__main__":
CheckRunning() # This will start the max processes running
while (len(Processes) > 0): # Some thing still going on.
time.sleep(0.1) # You may wish to change the time for this
CheckRunning()
print ("Done!")
Just keep count as you start them and use a callback from each subprocess to start a new one if there are any url list entries to process.
e.g. Assuming that your sub-process calls the OnExit method passed to it as it ends:
NextURLNo = 0
MaxProcesses = 20
NoSubProcess = 0
MaxUrls = 100000
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global NoSubProcess
if NextURLNo < MaxUrls:
subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print "Started to Process", urllist[NextURLNo]
NextURLNo += 1
NoSubProcess += 1
def OnExit():
NoSubProcess -= 1
if __name__ == "__main__":
for n in range(MaxProcesses):
StartNew()
while (NoSubProcess > 0):
time.sleep(1)
if (NextURLNo < MaxUrls):
for n in range(NoSubProcess,MaxProcesses):
StartNew()
To keep constant number of concurrent requests, you could use a thread pool:
#!/usr/bin/env python
from multiprocessing.dummy import Pool
def process_url(url):
# ... handle a single url
urllist = [url1, url2, url3, .. , url100000]
for _ in Pool(20).imap_unordered(process_url, urllist):
pass
To run processes instead of threads, remove .dummy from the import.
how can i control the return value of this function pool apply_asyn
supposing that I have the following cool
import multiprocessing:
de fun(..)
...
...
return value
my_pool = multiprocessing.Pool(2)
for i in range(5) :
result=my_pool.apply_async(fun, [i])
some code going to be here....
digest_pool.close()
digest_pool.join()
here i need to proccess the results
how can i control the result value for every proccess and know to check to which proccess it belongs ,
store the the value of 'i' from the for loop and either print it or return and save it somewhere else.
so if a process happens you can check from which process it was by looking at the variable i.
Hope this helps.
Are you sure, that you need to know, which of your two workers is doing what right now? In such a case you might be better off with Processes and Queues, because, this sounds as some communication between the multiple processes is required.
If you just want to know, which result was processed by which worker, you can simply return a tuple:
#!/usr/bin/python
import multiprocessing
def fun(..)
...
return value, multiprocessing.current_process()._name
my_pool = multiprocessing.Pool(2)
async_result = []
for i in range(5):
async_result.append(my_pool.apply_async(fun, [i]))
# some code going to be here....
my_pool.join()
result = {}
for i in range(5):
result[i] = async_result[i].get()
If you have the different input variables as a list, the map_async command might be a better decision:
#!/usr/bin/python
import multiprocessing
def fun(..)
...
...
return value, multiprocessing.current_process()._name
my_pool = multiprocessing.Pool()
async_results = my_pool.map_async(fun, range(5))
# some code going to be here....
results = async_results.get()
The last line joins the pool. Note, that results is a list of tuples, each tuple containing of your calculated value and the name of the process who calculated it.