I am using multiprocessing to faster a long process in python and I want to save my data in a separate class in order to make code a little cleaner but it seems that whether I change class var in the process it will roll back to the last state before the process, while in proceed it show that variable is updated.
here is simplified example
class state_mangment():
def __init__(self):
print('__init__')
self.last_save = -1
def update_state(self):
self.last_save =self.last_save + 1
return self.last_save
from multiprocessing import Process, Lock
def f(l, i,persist_state ):
l.acquire()
try:
print('last save is ',persist_state.update_state(),' should be ',i)
finally:
l.release()
if __name__ == '__main__':
lock = Lock()
persist_state = state_mangment()
processes = []
for num in range(10):
p = Process(target=f, args=(lock, num,persist_state ))
processes.append(p)
p.start()
for p in processes:
p.join()
print(persist_state.last_save)
here is the output, as you can see increase variable from -1 to 0 as we can see in return value but it won't start from 0 in next iteration
__init__
last save is 0 should be 0
last save is 0 should be 1
last save is 0 should be 2
last save is 0 should be 3
last save is 0 should be 4
last save is 0 should be 5
last save is 0 should be 6
last save is 0 should be 7
last save is 0 should be 8
last save is 0 should be 9
-1
There are several things wrong with your code. A function run by multiprocessing.Process() does not share the address space of the parent process. Which is why the manipulation of the persist_state object is not reflected in the parent process. That you can use a multiprocessing.Lock() object in this fashion is because that class was designed to work in that fashion when used in a multiprocessing.Process() context. That does not mean you can manipulate the state of arbitrary objects and have those manipulations reflected in the parent process.
See this description of the Manager() class for one way to solve this problem.
Related
I'm working on a project that needs to run two different CPU-intensive functions. Hence using a multiproccessing approach seems to be the way to go. The challenge that I'm facing is that one function has a slower runtime than the other one. For the sake of argument lets say that execute has a runtime of .1 seconds while update takes a full second to run. The goal is that while update is running execute will have calculated an output value 10 times. Once update has finished it needs to pass a set of parameters to execute which can then continue generating an output with the new set of parameters. After sometime update needs to run again and once more generate a new set of parameters.
Furthermore both functions will require a different set of input variables.
The image link below should hopefully visualize my conundrum a bit better.
function runtime visualisation
From what I've gathered (https://zetcode.com/python/multiprocessing/) using an asymetric mapping approach might be the way to go, but it doesn't really seem to work. Any help is greatly appreciated.
Pseudo Code
from multiprocessing import Pool
from datetime import datetime
import time
import numpy as np
class MyClass():
def __init__(self, inital_parameter_1, inital_parameter_2):
self.parameter_1 = inital_parameter_1
self.parameter_2 = inital_parameter_2
def execute(self, input_1, input_2, time_in):
print('starting execute function for time:' + str(time_in))
time.sleep(0.1) # wait for 100 milliseconds
# generate some output
output = (self.parameter_1 * input_1) + (self.parameter_2 + input_2)
print('exiting execute function')
return output
def update(self, update_input_1, update_input_2, time_in):
print('starting update function for time:' + str(time_in))
time.sleep(1) # wait for 1 second
# generate parameters
self.parameter_1 += update_input_1
self.parameter_2 += update_input_2
print('exiting update function')
def smap(f):
return f()
if __name__ == "__main__":
update_input_1 = 3
update_input_2 = 4
input_1 = 0
input_2 = 1
# initialize class
my_class = MyClass(1, 2)
# total runtime (arbitrary)
runtime = int(10e6)
# update_time (arbitrary)
update_time = np.array([10, 10e2, 15e4, 20e5])
for current_time in range(runtime):
# if time equals update time run both functions simultanously until update is complete
if any(update_time == current_time):
with Pool() as pool:
res = pool.map_async(my_class.smap, [my_class.execute(input_1, input_2, current_time),
my_class.update(update_input_1, update_input_2, current_time)])
# otherwise run only execute
else:
output = my_class.execute(input_1, input_2,current_time)
# increment input
input_1 += 1
input_2 += 2
I confess to not being able to fully following your code vis-a-vis your description. But I see some issues:
Method update is not returning any value other than None, which is implicitly returned due to the lack of a return statement.
Your with Pool() ...: block will call terminate upon block exit, which is immediately after your call to pool.map_async, which is non-blocking. But you have no provision to wait for the completion of this submitted task (terminate will most likely kill the running task before it completes).
What you are passing to the map_async function is the worker function name and an iterable. But you are invoking method calls to execute and update from the current main process and using their return values as elements of the iterable and these return values are definitely not functions suitable for passing to smap. So there is no multiprocessing being done and this is just plain wrong.
You are also creating and destroying process pools over and over again. Much better to create the process pool just once.
I would therefore recommend the following changes at the very least. But note that this code potentially generates tasks much faster than they can be completed and you could have millions of tasks queued up to run given your current runtime value, which could be quite a strain on system resources such as memory. So I've inserted some code that ensures that the rate of submitting tasks is throttled so that the number of incomplete submitted tasks is never more than three times the number of CPU cores available.
# we won't need heavy-duty numpy for what we are doing:
#import numpy as np
from multiprocessing import cpu_count
from threading import Lock
... # etc.
if __name__ == "__main__":
update_input_1 = 3
update_input_2 = 4
input_1 = 0
input_2 = 1
# initialize class
my_class = MyClass(1, 2)
# total runtime (arbitrary)
runtime = int(10e6)
# update_time (arbitrary)
# we don't need overhead of numpy (remove import of numpy):
#update_time = np.array([10, 10e2, 15e4, 20e5])
update_time = [10, 10e2, 15e4, 20e5]
tasks_submitted = 0
lock = Lock()
execute_output = []
def execute_result(result):
global tasks_submitted
with lock:
tasks_submitted -= 1
# result is the return value from method execute
# do something with it, e.g. execute_output.append(result)
pass
update_output = []
def update_result(result):
global tasks_submitted
with lock:
tasks_submitted -= 1
# result is the return value from method update
# do something with it, e.g. update_output.append(result)
pass
n_processors = cpu_count()
with Pool() as pool:
for current_time in range(runtime):
# if time equals update time run both functions simultanously until update is complete
#if any(update_time == current_time):
if current_time in update_time:
# run both update and execute:
pool.apply_async(my_class.update, args=(update_input_1, update_input_2, current_time), callback=update_result)
with lock:
tasks_submitted += 1
pool.apply_async(my_class.execute, args=(input_1, input_2, current_time), callback=execute_result)
with lock:
tasks_submitted += 1
# increment input
input_1 += 1
input_2 += 2
while tasks_submitted > n_processors * 3:
time.sleep(.05)
# Ensure all tasks have completed:
pool.close()
pool.join()
assert(tasks_submitted == 0)
I have 800 files with some data to process, it's enough that I want to use multiprocessing to do this but I think I'm not doing it correctly.
Inside my main() function I'm trying to spin off 1 process for each file that needs processing (I'm guessing that this is not a good idea because my computer won't be able to handle 800 concurrent processes but I haven't gotten that far yet).
Here is my main():
manager = multiprocessing.Manager()
arr = manager.list()
def main():
count = 0
with open("loc.csv") as loc_file:
locs = csv.reader(loc_file, delimiter=',')
for loc in locs:
if count != 0:
process = multiprocessing.Process(target=sort_run, args=[loc])
process.start()
process.join()
count += 1
And then my code that is the target of the process:
def sort_run(loc):
start_time = time.time()
sorted_list = sort_splits.sort_splits(loc[0])
value = process_reads.count_coverage(sorted_list, loc[0])
arr.append([loc[0], value])
I'm using the multiprocessing.Manager() so that my processes can access the arr list properly. I received the error:
An attempt has been made to start a new process before the current
process has finished its bootstrapping phase.
I think what's happening is the loop is too fast to spin off the processes correctly. Or maybe each process has to have a specific variable not just "process = ..."
Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]
I am trying to launch multiple processes to parallelize certain tasks and want one global variable to be decremented by 1 each time each process executes a method X().
I tried to look at the multiprocessing.Value method but not sure if that's the only way to do it. Could someone provide some code snippets to do this ?
from multiprocessing import Pool, Process
def X(list):
global temp
print list
temp = 10
temp -= 1
return temp
list = ['a','b','c']
pool = Pool(processes=5)
pool.map(X, list)
With the use of global variable, each process gets its own copy of the global variable which doesn't solve the purpose of sharing it's value. I believe, the need is to have sort of a shared memory system but I am not sure how to do it. Thanks
Move counter variable into the main process i.e., avoid sharing the variable between processes:
for result in pool.imap_unordered(func, args):
counter -= 1
counter is decremented as soon as the corresponding result (func(arg)) becomes available. Here's a complete code example:
#!/usr/bin/env python
import random
import time
import multiprocessing
def func(arg):
time.sleep(random.random())
return arg*10
def main():
counter = 10
args = "abc"
pool = multiprocessing.Pool()
for result in pool.imap_unordered(func, args):
counter -= 1
print("counter=%d, result=%r" % (counter, result))
if __name__ == "__main__":
main()
An alternative is to pass multiprocessing.Value() object to each worker process (use initialize, initargs Pool()'s parameters).
I want to use subprocesses to let 20 instances of a written script run parallel. Lets say i have a big list of urls with like 100.000 entries and my program should control that all the time 20 instances of my script are working on that list. I wanted to code it as follows:
urllist = [url1, url2, url3, .. , url100000]
i=0
while number_of_subproccesses < 20 and i<100000:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
My script just writes something into a database or textfile. It doesnt output anything and dont need more input than the url.
My problem is i wasnt able to find something how to get the number of subprocesses that are active. Im a novice programmer so every hint and suggestion is welcome. I was also wondering how i can manage it once the 20 subprocesses are loaded that the while loop checks the conditions again? I thought of maybe putting another while loop over it, something like
while i<100000
while number_of_subproccesses < 20:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
if number_of_subprocesses == 20:
sleep() # wait to some time until check again
Or maybe theres a bette possibility that the while loop is always checking on the number of subprocesses?
I also considered using the module multiprocessing, but i found it really convenient to just call the script.py with subprocessing instead of a function with multiprocessing.
Maybe someone can help me and lead me into the right direction. Thanks Alot!
Taking a different approach from the above - as it seems that the callback can't be sent as a parameter:
NextURLNo = 0
MaxProcesses = 20
MaxUrls = 100000 # Note this would be better to be len(urllist)
Processes = []
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global Processes
if NextURLNo < MaxUrls:
proc = subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print ("Started to Process %s", urllist[NextURLNo])
NextURLNo += 1
Processes.append(proc)
def CheckRunning():
""" Check any running processes and start new ones if there are spare slots."""
global Processes
global NextURLNo
for p in range(len(Processes):0:-1): # Check the processes in reverse order
if Processes[p].poll() is not None: # If the process hasn't finished will return None
del Processes[p] # Remove from list - this is why we needed reverse order
while (len(Processes) < MaxProcesses) and (NextURLNo < MaxUrls): # More to do and some spare slots
StartNew()
if __name__ == "__main__":
CheckRunning() # This will start the max processes running
while (len(Processes) > 0): # Some thing still going on.
time.sleep(0.1) # You may wish to change the time for this
CheckRunning()
print ("Done!")
Just keep count as you start them and use a callback from each subprocess to start a new one if there are any url list entries to process.
e.g. Assuming that your sub-process calls the OnExit method passed to it as it ends:
NextURLNo = 0
MaxProcesses = 20
NoSubProcess = 0
MaxUrls = 100000
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global NoSubProcess
if NextURLNo < MaxUrls:
subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print "Started to Process", urllist[NextURLNo]
NextURLNo += 1
NoSubProcess += 1
def OnExit():
NoSubProcess -= 1
if __name__ == "__main__":
for n in range(MaxProcesses):
StartNew()
while (NoSubProcess > 0):
time.sleep(1)
if (NextURLNo < MaxUrls):
for n in range(NoSubProcess,MaxProcesses):
StartNew()
To keep constant number of concurrent requests, you could use a thread pool:
#!/usr/bin/env python
from multiprocessing.dummy import Pool
def process_url(url):
# ... handle a single url
urllist = [url1, url2, url3, .. , url100000]
for _ in Pool(20).imap_unordered(process_url, urllist):
pass
To run processes instead of threads, remove .dummy from the import.