Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]
Related
I am learning multiprocessing in Python, and thinking of a problem. I want that for a shared list(nums = mp.Manager().list), is there any way that it automatically splits the list for all the processes so that it does not compute on same numbers in parallel.
Current code:
# multiple processes
nums = mp.Manager().list(range(10000))
results = mp.Queue()
def get_square(list_of_num, results_sharedlist):
# simple get square
results_sharedlist.put(list(map(lambda x: x**2, list_of_num)))
start = time.time()
process1 = mp.Process(target=get_square, args = (nums, results))
process2 = mp.Process(target=get_square, args=(nums, results))
process1.start()
process2.start()
process1.join()
process2.join()
print(time.time()-start)
for i in range(results.qsize()):
print(results.get())
Current Behaviour
It computes the square of same list twice
What I want
I want the process 1 and process 2 to compute squares of nums list 1 time in parallel without me defining the split.
You can make function to decide on which data it needs to perform operations. In current scenario, you want your function to divide the square calculation work by it's own based on how many processes are working in parallel.
To do so, you need to let your function know which process it is working on and how many other processes are working along with it. So that it can only work on specific data. So you can just pass two more parameters to your functions which will give information about processes running in parallel. i.e. current_process and total_process.
If you have a list of length divisible by 2 and you want to calculate squares of same using two processes then your function would look something like as follows:
def get_square(list_of_num, results_sharedlist, current_process, total_process):
total_length = len(list_of_num)
start = (total_length // total_process) * (current_process - 1)
end = (total_length // total_process) * current_process
results_sharedlist.put(list(map(lambda x: x**2, list_of_num[start:end])))
TOTAL_PROCESSES = 2
process1 = mp.Process(target=get_square, args = (nums, results, 1, TOTAL_PROCESSES))
process2 = mp.Process(target=get_square, args=(nums, results, 2, TOTAL_PROCESSES))
The assumption I have made here is that the length of list on which you are going to work is in multiple of processes you are allocating. And if it not then the current logic will leave behind some numbers with no output.
Hope this answers your question!
Agree on the answer by Jake here, but as a bonus:
if you are using a multiprocessing.Pool(), it keeps an internal counter of the multiprocessing threads spawned, so you can avoid the parametr to identify the current_process by accessing _identity from the current_process by multiprocessing, like this:
from multiprocessing import current_process, Pool
p = current_process()
print('process counter:', p._identity[0])
more info from this answer.
Let us consider the following code where I calculate the factorial of 4 really large numbers, saving each output to a separate .txt file (out_mp_{idx}.txt). I use multiprocessing (4 processes) to reduce the computation time. Though this works fine, I want to output all the 4 results in one file.
One way is to open each of the generated (4) files I create (from the code below) and append to a new file, but that's not my choice (below is just a simplistic version of my code, I have too many files to handle, which defeats the purpose of time-saving via multiprocessing). Is there a better way to automate such that the results from the processes are all dumped/appended to some file? Also, in my case the returned results form each process could be several lines, so how would we avoid open-file conflict for the case when the results are appended in the output file by one process and second process returns its answer and wants to open/access the output file?
As an alternative, I tried process.immap route, but that's not as computationally efficient as the below code. Something like this SO post.
from multiprocessing import Process
import os
import time
tic = time.time()
def factorial(n, idx): # function to calculate the factorial
num = 1
while n >= 1:
num *= n
n = n - 1
with open(f'out_mp_{idx}.txt', 'w') as f0: # saving output to a separate file
f0.writelines(str(num))
def My_prog():
jobs = []
N = [10000, 20000, 40000, 50000] # numbers for which factorial is desired
n_procs = 4
# executing multiple processes
for i in range(n_procs):
p = Process(target=factorial, args=(N[i], i))
jobs.append(p)
for j in jobs:
j.start()
for j in jobs:
j.join()
print(f'Exec. Time:{time.time()-tic} [s]')
if __name__=='__main__':
My_prog()
You can do this.
Create a Queue
a) manager = Manager()
b) data_queue = manager.Queue()
c) put all data in this queue.
Create a thread and start it before multiprocess
a) create a function which waits on data_queue.
Something like
`
def fun():
while True:
data = data_queue.get()
if instance(data_queue, Sentinal):
break
#write to a file
`
3) Remember to send some Sentinal object after all multiprocesses are done.
You can also make this thread a daemon thread and skip sentinal part.
I have a Producer process that runs and puts the results in a Queue
I also have a Consumer function that takes the results from the Queue and processes them , for example:
def processFrame(Q,commandsFile):
fr = Q.get()
frameNum = fr[0]
Frame = fr[1]
#
# Process the frame
#
commandsFile.write(theProcessedResult)
I want to run my consumer function using multiple processes, they number should be set by user:
processes = raw_input('Enter the number of process you want to use: ')
i tried using Pool:
pool = Pool(int(processes))
pool.apply(processFrame, args=(q,toFile))
when i try this , it returns a RuntimeError: Queue objects should only be shared between processes through inheritance.
what does that mean?
I also tried to use a list of processes:
while (q.empty() == False):
mp = [Process(target=processFrame, args=(q,toFile)) for x in range(int(processes))]
for p in mp:
p.start()
for p in mp:
p.join()
This one seems to run, but not as expected.
it using multiple processes on same frame from Queue, doesn't Queue have locks?
also ,in this case the number of processes i'm allowed to use must divide the number of frames without residue(reminder) - for example:
if i have 10 frames i can use only 1,2,5,10 processes. if i use 3,4.. it will create a process while Q empty and wont work.
if u want to recycle the procces until q is empty u should just try to do somthing like that:
code1:
def proccesframe():
while(True):
frame = queue.get()
##do something
your procces will be blocked until there is something in the queue
i dont think that's a good idie to use multiproccess on the cunsomer part , you should use them on the producer.
if u want to terminate the procces when the queue is empty u can do something like that:
code2:
def proccesframe():
while(!queue.empty()):
frame = queue.get()
##do something
terminate_procces()
update:
if u want to use multiproccesing in the consumer part just do a simple loop and add code2 , then you will be able to close your proccess when u finish doing stuff with the queue.
I am not entirely sure what are you trying to accomplish from your explanation, but have you considered using multiprocessing.Pool with its methods map or map_async?
from multiprocessing import Pool
from foo import bar # your function
if __name__ == "__main__":
p = Pool(4) # your number of processes
result = p.map_async(bar, [("arg #1", "arg #2"), ...])
print result.get()
It collects result from your function in unordered(!) iterable and you can use it however you wish.
UPDATE
I think you should not use queue and be more straightforward:
from multiprocessing import Pool
def process_frame(fr): # PEP8 and see the difference in definition
# magic
return result # and result handling!
if __name__ == "__main__":
p = Pool(4) # your number of processes
results = p.map_async(process_frame, [fr_1, fr_2, ...])
# Do not ever write or manipulate with files in parallel processes
# if you are not 100% sure what you are doing!
for result in results.get():
commands_file.write(result)
UPDATE 2
from multiprocessing import Pool
import random
import time
def f(x):
return x*x
def g(yr):
with open("result.txt", "ab") as f:
for y in yr:
f.write("{}\n".format(y))
if __name__ == '__main__':
pool = Pool(4)
while True:
# here you fetch new data and send it to process
new_data = [random.randint(1, 50) for i in range(4)]
pool.map_async(f, new_data, callback=g)
Some example how to do it and I updated the algorithm to be "infinite", it can be only closed by interruption or kill command from outside. You can use also apply_async, but it would cause slow downs with result handling (depending on speed of processing).
I have also tried using long-time open result.txt in global scope, but every time it hit deadlock.
I have a list of numbers. I want to perform some time-consuming operation on each number in the list and make a new list with all the results. Here's a simplified version of what I have:
def calcNum(n):#some arbitrary, time-consuming calculation on a number
m = n
for i in range(5000000):
m += i%25
if m > n*n:
m /= 2
return m
nums = [12,25,76,38,8,2,5]
finList = []
for i in nums:
return_val = calcNum(i)
finList.append(return_val)
print(finList)
Now, I wanted to take advantage of the multiple cores in my CPU, and give each of them a task of processing one of the numbers, and since the "number calculation" function is self-contained from start to finish I figured this would be fairly simple to do and a perfect situation for multiprocessing/threading.
My question is, which one should I use (multiprocessing or threading?), and what is the simplest way to do this?
I did a test with various code I found in other questions to achieve this, and while it runs fine it doesn't seem to be doing any actual multithreading/processing and takes just as long as my first test:
from multiprocessing.pool import ThreadPool
def calcNum(n):#some arbitrary, time-consuming calculation on a number
m = n
for i in range(5000000):
m += i%25
if m > n*n:
m /= 2
return m
pool = ThreadPool(processes=3)
nums = [12,25,76,38,8,2,5]
finList = []
for i in nums:
async_result = pool.apply_async(calcNum, (i,))
return_val = async_result.get()
finList.append(return_val)
print(finList)
multiprocessing.pool and pool.map are your best friends here. It saves a lot of headache as it hides all the other complex queues and whatnot you need to make it work. All you need to do is set up the pool, assign it the max number of processes, point it to the function and iterable. See working code below.
Because of the join and the usage cases pool.map was intended to work, the program will wait until ALL processes have returned something before giving you the result.
from multiprocessing.pool import Pool
def calcNum(n):#some arbitrary, time-consuming calculation on a number
print "Calcs Started on ", n
m = n
for i in range(5000000):
m += i%25
if m > n*n:
m /= 2
return m
if __name__ == "__main__":
p = Pool(processes=3)
nums = [12,25,76,38,8,2,5]
finList = []
result = p.map(calcNum, nums)
p.close()
p.join()
print result
That will get you something like this:
Calcs Started on 12
Calcs Started on 25
Calcs Started on 76
Calcs Started on 38
Calcs Started on 8
Calcs Started on 2
Calcs Started on 5
[72, 562, 5123, 1270, 43, 23, 23]
Regardless of when each process is started or when it completes, map waits for each to finish and then puts them all back in the correct order (corresponding to the input iterable).
As #Guy mentioned, the GIL hurts us here. You can change the Pool to ThreadPool in the code above and see how it affects the timing of the calculations. Since the same function is used, the GIL only allows one thread to use the calcNum function at a time. So it near enough still runs serially.
Multirocessing with a process or pool essentially starts further instances of your script which gets around the issue of the GIL. If you watch your running processes during the above, you'll see extra instances of 'python.exe' start while the pool is running. In this case, you'll see a total of 4.
I guess you are affected by python Global Interpreter Lock
The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations.
try to use multiprocessing instead
from multiprocessing import Pool
how can i control the return value of this function pool apply_asyn
supposing that I have the following cool
import multiprocessing:
de fun(..)
...
...
return value
my_pool = multiprocessing.Pool(2)
for i in range(5) :
result=my_pool.apply_async(fun, [i])
some code going to be here....
digest_pool.close()
digest_pool.join()
here i need to proccess the results
how can i control the result value for every proccess and know to check to which proccess it belongs ,
store the the value of 'i' from the for loop and either print it or return and save it somewhere else.
so if a process happens you can check from which process it was by looking at the variable i.
Hope this helps.
Are you sure, that you need to know, which of your two workers is doing what right now? In such a case you might be better off with Processes and Queues, because, this sounds as some communication between the multiple processes is required.
If you just want to know, which result was processed by which worker, you can simply return a tuple:
#!/usr/bin/python
import multiprocessing
def fun(..)
...
return value, multiprocessing.current_process()._name
my_pool = multiprocessing.Pool(2)
async_result = []
for i in range(5):
async_result.append(my_pool.apply_async(fun, [i]))
# some code going to be here....
my_pool.join()
result = {}
for i in range(5):
result[i] = async_result[i].get()
If you have the different input variables as a list, the map_async command might be a better decision:
#!/usr/bin/python
import multiprocessing
def fun(..)
...
...
return value, multiprocessing.current_process()._name
my_pool = multiprocessing.Pool()
async_results = my_pool.map_async(fun, range(5))
# some code going to be here....
results = async_results.get()
The last line joins the pool. Note, that results is a list of tuples, each tuple containing of your calculated value and the name of the process who calculated it.