Always run a constant number of subprocesses in parallel - python

I want to use subprocesses to let 20 instances of a written script run parallel. Lets say i have a big list of urls with like 100.000 entries and my program should control that all the time 20 instances of my script are working on that list. I wanted to code it as follows:
urllist = [url1, url2, url3, .. , url100000]
i=0
while number_of_subproccesses < 20 and i<100000:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
My script just writes something into a database or textfile. It doesnt output anything and dont need more input than the url.
My problem is i wasnt able to find something how to get the number of subprocesses that are active. Im a novice programmer so every hint and suggestion is welcome. I was also wondering how i can manage it once the 20 subprocesses are loaded that the while loop checks the conditions again? I thought of maybe putting another while loop over it, something like
while i<100000
while number_of_subproccesses < 20:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
if number_of_subprocesses == 20:
sleep() # wait to some time until check again
Or maybe theres a bette possibility that the while loop is always checking on the number of subprocesses?
I also considered using the module multiprocessing, but i found it really convenient to just call the script.py with subprocessing instead of a function with multiprocessing.
Maybe someone can help me and lead me into the right direction. Thanks Alot!

Taking a different approach from the above - as it seems that the callback can't be sent as a parameter:
NextURLNo = 0
MaxProcesses = 20
MaxUrls = 100000 # Note this would be better to be len(urllist)
Processes = []
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global Processes
if NextURLNo < MaxUrls:
proc = subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print ("Started to Process %s", urllist[NextURLNo])
NextURLNo += 1
Processes.append(proc)
def CheckRunning():
""" Check any running processes and start new ones if there are spare slots."""
global Processes
global NextURLNo
for p in range(len(Processes):0:-1): # Check the processes in reverse order
if Processes[p].poll() is not None: # If the process hasn't finished will return None
del Processes[p] # Remove from list - this is why we needed reverse order
while (len(Processes) < MaxProcesses) and (NextURLNo < MaxUrls): # More to do and some spare slots
StartNew()
if __name__ == "__main__":
CheckRunning() # This will start the max processes running
while (len(Processes) > 0): # Some thing still going on.
time.sleep(0.1) # You may wish to change the time for this
CheckRunning()
print ("Done!")

Just keep count as you start them and use a callback from each subprocess to start a new one if there are any url list entries to process.
e.g. Assuming that your sub-process calls the OnExit method passed to it as it ends:
NextURLNo = 0
MaxProcesses = 20
NoSubProcess = 0
MaxUrls = 100000
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global NoSubProcess
if NextURLNo < MaxUrls:
subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print "Started to Process", urllist[NextURLNo]
NextURLNo += 1
NoSubProcess += 1
def OnExit():
NoSubProcess -= 1
if __name__ == "__main__":
for n in range(MaxProcesses):
StartNew()
while (NoSubProcess > 0):
time.sleep(1)
if (NextURLNo < MaxUrls):
for n in range(NoSubProcess,MaxProcesses):
StartNew()

To keep constant number of concurrent requests, you could use a thread pool:
#!/usr/bin/env python
from multiprocessing.dummy import Pool
def process_url(url):
# ... handle a single url
urllist = [url1, url2, url3, .. , url100000]
for _ in Pool(20).imap_unordered(process_url, urllist):
pass
To run processes instead of threads, remove .dummy from the import.

Related

Python multiprocessing, spinning off processes inside a for loop

I have 800 files with some data to process, it's enough that I want to use multiprocessing to do this but I think I'm not doing it correctly.
Inside my main() function I'm trying to spin off 1 process for each file that needs processing (I'm guessing that this is not a good idea because my computer won't be able to handle 800 concurrent processes but I haven't gotten that far yet).
Here is my main():
manager = multiprocessing.Manager()
arr = manager.list()
def main():
count = 0
with open("loc.csv") as loc_file:
locs = csv.reader(loc_file, delimiter=',')
for loc in locs:
if count != 0:
process = multiprocessing.Process(target=sort_run, args=[loc])
process.start()
process.join()
count += 1
And then my code that is the target of the process:
def sort_run(loc):
start_time = time.time()
sorted_list = sort_splits.sort_splits(loc[0])
value = process_reads.count_coverage(sorted_list, loc[0])
arr.append([loc[0], value])
I'm using the multiprocessing.Manager() so that my processes can access the arr list properly. I received the error:
An attempt has been made to start a new process before the current
process has finished its bootstrapping phase.
I think what's happening is the loop is too fast to spin off the processes correctly. Or maybe each process has to have a specific variable not just "process = ..."

Different inputs for different processes in python multiprocessing

Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]

Increment/decrement global variable using python multiprocessing module

I am trying to launch multiple processes to parallelize certain tasks and want one global variable to be decremented by 1 each time each process executes a method X().
I tried to look at the multiprocessing.Value method but not sure if that's the only way to do it. Could someone provide some code snippets to do this ?
from multiprocessing import Pool, Process
def X(list):
global temp
print list
temp = 10
temp -= 1
return temp
list = ['a','b','c']
pool = Pool(processes=5)
pool.map(X, list)
With the use of global variable, each process gets its own copy of the global variable which doesn't solve the purpose of sharing it's value. I believe, the need is to have sort of a shared memory system but I am not sure how to do it. Thanks
Move counter variable into the main process i.e., avoid sharing the variable between processes:
for result in pool.imap_unordered(func, args):
counter -= 1
counter is decremented as soon as the corresponding result (func(arg)) becomes available. Here's a complete code example:
#!/usr/bin/env python
import random
import time
import multiprocessing
def func(arg):
time.sleep(random.random())
return arg*10
def main():
counter = 10
args = "abc"
pool = multiprocessing.Pool()
for result in pool.imap_unordered(func, args):
counter -= 1
print("counter=%d, result=%r" % (counter, result))
if __name__ == "__main__":
main()
An alternative is to pass multiprocessing.Value() object to each worker process (use initialize, initargs Pool()'s parameters).

Multi-process, using Queue & Pool

I have a Producer process that runs and puts the results in a Queue
I also have a Consumer function that takes the results from the Queue and processes them , for example:
def processFrame(Q,commandsFile):
fr = Q.get()
frameNum = fr[0]
Frame = fr[1]
#
# Process the frame
#
commandsFile.write(theProcessedResult)
I want to run my consumer function using multiple processes, they number should be set by user:
processes = raw_input('Enter the number of process you want to use: ')
i tried using Pool:
pool = Pool(int(processes))
pool.apply(processFrame, args=(q,toFile))
when i try this , it returns a RuntimeError: Queue objects should only be shared between processes through inheritance.
what does that mean?
I also tried to use a list of processes:
while (q.empty() == False):
mp = [Process(target=processFrame, args=(q,toFile)) for x in range(int(processes))]
for p in mp:
p.start()
for p in mp:
p.join()
This one seems to run, but not as expected.
it using multiple processes on same frame from Queue, doesn't Queue have locks?
also ,in this case the number of processes i'm allowed to use must divide the number of frames without residue(reminder) - for example:
if i have 10 frames i can use only 1,2,5,10 processes. if i use 3,4.. it will create a process while Q empty and wont work.
if u want to recycle the procces until q is empty u should just try to do somthing like that:
code1:
def proccesframe():
while(True):
frame = queue.get()
##do something
your procces will be blocked until there is something in the queue
i dont think that's a good idie to use multiproccess on the cunsomer part , you should use them on the producer.
if u want to terminate the procces when the queue is empty u can do something like that:
code2:
def proccesframe():
while(!queue.empty()):
frame = queue.get()
##do something
terminate_procces()
update:
if u want to use multiproccesing in the consumer part just do a simple loop and add code2 , then you will be able to close your proccess when u finish doing stuff with the queue.
I am not entirely sure what are you trying to accomplish from your explanation, but have you considered using multiprocessing.Pool with its methods map or map_async?
from multiprocessing import Pool
from foo import bar # your function
if __name__ == "__main__":
p = Pool(4) # your number of processes
result = p.map_async(bar, [("arg #1", "arg #2"), ...])
print result.get()
It collects result from your function in unordered(!) iterable and you can use it however you wish.
UPDATE
I think you should not use queue and be more straightforward:
from multiprocessing import Pool
def process_frame(fr): # PEP8 and see the difference in definition
# magic
return result # and result handling!
if __name__ == "__main__":
p = Pool(4) # your number of processes
results = p.map_async(process_frame, [fr_1, fr_2, ...])
# Do not ever write or manipulate with files in parallel processes
# if you are not 100% sure what you are doing!
for result in results.get():
commands_file.write(result)
UPDATE 2
from multiprocessing import Pool
import random
import time
def f(x):
return x*x
def g(yr):
with open("result.txt", "ab") as f:
for y in yr:
f.write("{}\n".format(y))
if __name__ == '__main__':
pool = Pool(4)
while True:
# here you fetch new data and send it to process
new_data = [random.randint(1, 50) for i in range(4)]
pool.map_async(f, new_data, callback=g)
Some example how to do it and I updated the algorithm to be "infinite", it can be only closed by interruption or kill command from outside. You can use also apply_async, but it would cause slow downs with result handling (depending on speed of processing).
I have also tried using long-time open result.txt in global scope, but every time it hit deadlock.

Multiprocessing with python

how can i control the return value of this function pool apply_asyn
supposing that I have the following cool
import multiprocessing:
de fun(..)
...
...
return value
my_pool = multiprocessing.Pool(2)
for i in range(5) :
result=my_pool.apply_async(fun, [i])
some code going to be here....
digest_pool.close()
digest_pool.join()
here i need to proccess the results
how can i control the result value for every proccess and know to check to which proccess it belongs ,
store the the value of 'i' from the for loop and either print it or return and save it somewhere else.
so if a process happens you can check from which process it was by looking at the variable i.
Hope this helps.
Are you sure, that you need to know, which of your two workers is doing what right now? In such a case you might be better off with Processes and Queues, because, this sounds as some communication between the multiple processes is required.
If you just want to know, which result was processed by which worker, you can simply return a tuple:
#!/usr/bin/python
import multiprocessing
def fun(..)
...
return value, multiprocessing.current_process()._name
my_pool = multiprocessing.Pool(2)
async_result = []
for i in range(5):
async_result.append(my_pool.apply_async(fun, [i]))
# some code going to be here....
my_pool.join()
result = {}
for i in range(5):
result[i] = async_result[i].get()
If you have the different input variables as a list, the map_async command might be a better decision:
#!/usr/bin/python
import multiprocessing
def fun(..)
...
...
return value, multiprocessing.current_process()._name
my_pool = multiprocessing.Pool()
async_results = my_pool.map_async(fun, range(5))
# some code going to be here....
results = async_results.get()
The last line joins the pool. Note, that results is a list of tuples, each tuple containing of your calculated value and the name of the process who calculated it.

Categories