Multi-process, using Queue & Pool - python

I have a Producer process that runs and puts the results in a Queue
I also have a Consumer function that takes the results from the Queue and processes them , for example:
def processFrame(Q,commandsFile):
fr = Q.get()
frameNum = fr[0]
Frame = fr[1]
#
# Process the frame
#
commandsFile.write(theProcessedResult)
I want to run my consumer function using multiple processes, they number should be set by user:
processes = raw_input('Enter the number of process you want to use: ')
i tried using Pool:
pool = Pool(int(processes))
pool.apply(processFrame, args=(q,toFile))
when i try this , it returns a RuntimeError: Queue objects should only be shared between processes through inheritance.
what does that mean?
I also tried to use a list of processes:
while (q.empty() == False):
mp = [Process(target=processFrame, args=(q,toFile)) for x in range(int(processes))]
for p in mp:
p.start()
for p in mp:
p.join()
This one seems to run, but not as expected.
it using multiple processes on same frame from Queue, doesn't Queue have locks?
also ,in this case the number of processes i'm allowed to use must divide the number of frames without residue(reminder) - for example:
if i have 10 frames i can use only 1,2,5,10 processes. if i use 3,4.. it will create a process while Q empty and wont work.

if u want to recycle the procces until q is empty u should just try to do somthing like that:
code1:
def proccesframe():
while(True):
frame = queue.get()
##do something
your procces will be blocked until there is something in the queue
i dont think that's a good idie to use multiproccess on the cunsomer part , you should use them on the producer.
if u want to terminate the procces when the queue is empty u can do something like that:
code2:
def proccesframe():
while(!queue.empty()):
frame = queue.get()
##do something
terminate_procces()
update:
if u want to use multiproccesing in the consumer part just do a simple loop and add code2 , then you will be able to close your proccess when u finish doing stuff with the queue.

I am not entirely sure what are you trying to accomplish from your explanation, but have you considered using multiprocessing.Pool with its methods map or map_async?
from multiprocessing import Pool
from foo import bar # your function
if __name__ == "__main__":
p = Pool(4) # your number of processes
result = p.map_async(bar, [("arg #1", "arg #2"), ...])
print result.get()
It collects result from your function in unordered(!) iterable and you can use it however you wish.
UPDATE
I think you should not use queue and be more straightforward:
from multiprocessing import Pool
def process_frame(fr): # PEP8 and see the difference in definition
# magic
return result # and result handling!
if __name__ == "__main__":
p = Pool(4) # your number of processes
results = p.map_async(process_frame, [fr_1, fr_2, ...])
# Do not ever write or manipulate with files in parallel processes
# if you are not 100% sure what you are doing!
for result in results.get():
commands_file.write(result)
UPDATE 2
from multiprocessing import Pool
import random
import time
def f(x):
return x*x
def g(yr):
with open("result.txt", "ab") as f:
for y in yr:
f.write("{}\n".format(y))
if __name__ == '__main__':
pool = Pool(4)
while True:
# here you fetch new data and send it to process
new_data = [random.randint(1, 50) for i in range(4)]
pool.map_async(f, new_data, callback=g)
Some example how to do it and I updated the algorithm to be "infinite", it can be only closed by interruption or kill command from outside. You can use also apply_async, but it would cause slow downs with result handling (depending on speed of processing).
I have also tried using long-time open result.txt in global scope, but every time it hit deadlock.

Related

Getting started with multiprocessing in Python

I'm trying to set up a multiprocess in my script: I want to loop through an array and run a function for every item in the array but I want this function to be called simultaneously.
This is the original set up:
def my_function(my_variable):
#DO stuff
return my_variable_updated
def main():
#initialize my_variable as a list with 10000 items
results = []
for item in my_variable:
results.append(my_function(item))
#continue script
How can I convert this to multiprocessing so I can run multiple my_functions at the same time and get to '#continue script' faster? Do I need to use a queue for this?
You will have to restructure your script pretty thoroughly to implement multiprocessing. The main script would look something like this:
from multiprocessing import Process, JoinableQueue, Manager
def my_function(input_queue, manager_list):
while True:
item_to_process = input_queue.get() # item_to_process will be an (index, item) tuple
result_of_processing = item_to_process[1] ** 2
manager_list[item_to_process[0]] = result_of_processing
input_queue.task_done()
def main():
item_count = 10 # 10000 in your case
my_variable = [i for i in range(item_count)]
q = JoinableQueue()
for index, item in enumerate(my_variable):
q.put((index, item))
manager = Manager()
results = manager.list([0] * item_count) # initialize to same size as my_variable
worker_count = 2
for _ in range(worker_count):
p = Process(target=my_function, args=[q, results])
p.daemon = True # optional, but should be used unless your subprocess will spawn another process
p.start()
# now you can continue on
# but when you need to access `results` you have to put:
q.join()
# now we have our results
print(results)
if __name__ == "__main__":
main()
Yeilding
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
In my simple case.
You can also use a pool, but I'm not well versed in that and wouldn't want to lead you astray.
The main thing to watch out for when using multiprocessing is avoiding deadlocks, and also maintaining shared memory, and it can get tricky fast! In most cases it would be sufficient and recommended to use threading.Thread instead! This module is super easy to get cooking away with, but you will still likely need a queue.Queue. However, you wouldn't have to worry about sharing memory and things like multiprocessing.Manager's

Starting a large number of dependent process in async using python multiprocessing

Problem: I've a DAG(Directed-acyclic-graph) like structure for starting the execution of some massive data processing on a machine. Some of the process can only be started when their parent data processing is completed cause there is multi level of processing. I want to use python multiprocessing library to handle all on one single machine of it as first goal and later scale to execute on different machines using Managers. I've got no prior experience with python multiprocessing. Can anyone suggest if it's a good library to begin with? If yes, some basic implementation idea would do just fine. If not, what else can be used to do this thing in python?
Example:
A -> B
B -> D, E, F, G
C -> D
In the above example i want to kick A & C first(parallel), after their successful execution, other remaining processes would just wait for B to finish first. As soon as B finishes its execution all other process will start.
P.S.: Sorry i cannot share actual data because confidential, though i tried to make it clear using the example.
I'm a big fan of using processes and queues for things like this.
Like so:
from multiprocessing import Process, Queue
from Queue import Empty as QueueEmpty
import time
#example process functions
def processA(queueA, queueB):
while True:
try:
data = queueA.get_nowait()
if data == 'END':
break
except QueueEmpty:
time.sleep(2) #wait some time for data to enter queue
continue
#do stuff with data
queueB.put(data)
def processA(queueB, _):
while True:
try:
data = queueB.get_nowait()
if data == 'END':
break
except QueueEmpty:
time.sleep(2) #wait some time for data to enter queue
continue
#do stuff with data
#helper functions for starting and stopping processes
def start_procs(num_workers, target_function, args):
procs = []
for _ in range(num_workers):
p = Process(target=target_function, args=args)
p.start()
procs.append(p)
return procs
def shutdown_process(proc_lst, queue):
for _ in proc_lst:
queue.put('END')
for p in proc_lst:
try:
p.join()
except KeyboardInterrupt:
break
queueA = Queue(<size of queue> * 3) #needs to be a bit bigger than actual. 3x works well for me
queueB = Queue(<size of queue>)
queueC = Queue(<size of queue>)
queueD = Queue(<size of queue>)
procsA = start_procs(number_of_workers, processA, (queueA, queueB))
procsB = start_procs(number_of_workers, processB, (queueB, None))
# feed some data to processA
[queueA.put(data) for data in start_data]
#shutdown processes
shutdown_process(procsA, queueA)
shutdown_process(procsB, queueB)
#etc, etc. You could arrange the start, stop, and data feed statements to arrive at the dag behaviour you desire

Different inputs for different processes in python multiprocessing

Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]

Python Multiprocess not terminate

I am new to python multiprocess and I want to understand why my code does not terminate (maybe zombi or deadlock) and how to fix it. The createChain functions execute a for loop also and returns a tuple: (value1, value2). Inside createChain function there are other calls to other functions. I don't think posting the createChain function code will help because inside that function I am not doing something regarding multiprocess. I tried to make the processes as deamon but still didn't work. The strange think is that if I decrease the value of maxChains i.e to 500 or 100 is working.
I just want the process to do some heavy tasks and put the results to a data type.
My version of python is 2.7
def createTable(chainsPerCore, q, chainLength):
for chain in xrange(chainsPerCore):
q.put(createChain(chainLength, chain))
def initTable():
maxChains = 1000
chainLength = 10000
resultsQueue = JoinableQueue()
numOfCores = cpu_count()
chainsPerCore = maxChains / numOfCores
processes = [Process(target=createTable, args=(chainsPerCore, resultsQueue, chainLength,)) for x in range(numOfCores)]
for p in processes:
# p.daemon = True
p.start()
# Wait for hashing cores to finish
for p in processes:
p.join()
resultsQueue.task_done()
temp = [resultsQueue.get() for p in processes]
print temp
Based on the very useful comments of Tadhg McDonald-Jensen I understood better my needs and how the Queues are workings and for what purpose they should be used.
I change my code to
def initTable(output):
maxChains = 1000
results = []
with closing(Pool(processes=8)) as pool:
results = pool.map(createChain, xrange(maxChains))
pool.terminate()

Python Multiprocessing.Pool lazy iteration

I'm wondering about the way that python's Multiprocessing.Pool class works with map, imap, and map_async. My particular problem is that I want to map on an iterator that creates memory-heavy objects, and don't want all these objects to be generated into memory at the same time. I wanted to see if the various map() functions would wring my iterator dry, or intelligently call the next() function only as child processes slowly advanced, so I hacked up some tests as such:
def g():
for el in xrange(100):
print el
yield el
def f(x):
time.sleep(1)
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
go = g()
g2 = pool.imap(f, go)
g2.next()
And so on with map, imap, and map_async. This is the most flagrant example however, as simply calling next() a single time on g2 prints out all my elements from my generator g(), whereas if imap were doing this 'lazily' I would expect it to only call go.next() once, and therefore print out only '1'.
Can someone clear up what is happening, and if there is some way to have the process pool 'lazily' evaluate the iterator as needed?
Thanks,
Gabe
Let's look at the end of the program first.
The multiprocessing module uses atexit to call multiprocessing.util._exit_function when your program ends.
If you remove g2.next(), your program ends quickly.
The _exit_function eventually calls Pool._terminate_pool. The main thread changes the state of pool._task_handler._state from RUN to TERMINATE. Meanwhile the pool._task_handler thread is looping in Pool._handle_tasks and bails out when it reaches the condition
if thread._state:
debug('task handler found thread._state != RUN')
break
(See /usr/lib/python2.6/multiprocessing/pool.py)
This is what stops the task handler from fully consuming your generator, g(). If you look in Pool._handle_tasks you'll see
for i, task in enumerate(taskseq):
...
try:
put(task)
except IOError:
debug('could not put task on queue')
break
This is the code which consumes your generator. (taskseq is not exactly your generator, but as taskseq is consumed, so is your generator.)
In contrast, when you call g2.next() the main thread calls IMapIterator.next, and waits when it reaches self._cond.wait(timeout).
That the main thread is waiting instead of
calling _exit_function is what allows the task handler thread to run normally, which means fully consuming the generator as it puts tasks in the workers' inqueue in the Pool._handle_tasks function.
The bottom line is that all Pool map functions consume the entire iterable that it is given. If you'd like to consume the generator in chunks, you could do this instead:
import multiprocessing as mp
import itertools
import time
def g():
for el in xrange(50):
print el
yield el
def f(x):
time.sleep(1)
return x * x
if __name__ == '__main__':
pool = mp.Pool(processes=4) # start 4 worker processes
go = g()
result = []
N = 11
while True:
g2 = pool.map(f, itertools.islice(go, N))
if g2:
result.extend(g2)
time.sleep(1)
else:
break
print(result)
I had this problem too and was disappointed to learn that map consumes all its elements. I coded a function which consumes the iterator lazily using the Queue data type in multiprocessing. This is similar to what #unutbu describes in a comment to his answer but as he points out, suffers from having no callback mechanism for re-loading the Queue. The Queue datatype instead exposes a timeout parameter and I've used 100 milliseconds to good effect.
from multiprocessing import Process, Queue, cpu_count
from Queue import Full as QueueFull
from Queue import Empty as QueueEmpty
def worker(recvq, sendq):
for func, args in iter(recvq.get, None):
result = func(*args)
sendq.put(result)
def pool_imap_unordered(function, iterable, procs=cpu_count()):
# Create queues for sending/receiving items from iterable.
sendq = Queue(procs)
recvq = Queue()
# Start worker processes.
for rpt in xrange(procs):
Process(target=worker, args=(sendq, recvq)).start()
# Iterate iterable and communicate with worker processes.
send_len = 0
recv_len = 0
itr = iter(iterable)
try:
value = itr.next()
while True:
try:
sendq.put((function, value), True, 0.1)
send_len += 1
value = itr.next()
except QueueFull:
while True:
try:
result = recvq.get(False)
recv_len += 1
yield result
except QueueEmpty:
break
except StopIteration:
pass
# Collect all remaining results.
while recv_len < send_len:
result = recvq.get()
recv_len += 1
yield result
# Terminate worker processes.
for rpt in xrange(procs):
sendq.put(None)
This solution has the advantage of not batching requests to Pool.map. One individual worker can not block others from making progress. YMMV. Note that you may want to use a different object to signal termination for the workers. In the example, I've used None.
Tested on "Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32"
What you want is implemented in the NuMap package, from the website:
NuMap is a parallel (thread- or process-based, local or remote),
buffered, multi-task, itertools.imap or multiprocessing.Pool.imap
function replacement. Like imap it evaluates a function on elements of
a sequence or iterable, and it does so lazily.
Laziness can be adjusted via the “stride” and “buffer” arguments.
In this example (see code, please) 2 workers.
Pool work as expected: when worker is free, then to do next iteration.
This code as code in topic, except one thing: argument size = 64 k.
64 k - default socket buffer size.
import itertools
from multiprocessing import Pool
from time import sleep
def f( x ):
print( "f()" )
sleep( 3 )
return x
def get_reader():
for x in range( 10 ):
print( "readed: ", x )
value = " " * 1024 * 64 # 64k
yield value
if __name__ == '__main__':
p = Pool( processes=2 )
data = p.imap( f, get_reader() )
p.close()
p.join()
I ran into this issue as well, and came to a different solution than the other answers here so I figured I would share it.
import collections, multiprocessing
def map_prefetch(func, data, lookahead=128, workers=16, timeout=10):
with multiprocessing.Pool(workers) as pool:
q = collections.deque()
for x in data:
q.append(pool.apply_async(func, (x,)))
if len(q) >= lookahead:
yield q.popleft().get(timeout=timeout)
while len(q):
yield q.popleft().get(timeout=timeout)
for x in map_prefetch(myfunction, huge_data_iterator):
# do stuff with x
Basically is uses a queue to send at most lookahead pending requests to the worker pool, enforcing a limit on buffered results. The work starts asap within that limit so it can run in parallel. Also the result remains in order.

Categories