Queues and multiprocessing - python

I am writing some code to build a table of variable length (Huffman) codes, and I wanted to use the multiprocessing module for fun. The idea is to have each process try to get a node from the queue. They do work on the node, and either put that nodes two children back into the work queue, or they put the variable length code into result queue. They are also passing messages to a message queue, which should be printed by a thread in the main process. Here is the code so far:
import Queue
import multiprocessing as mp
from threading import Thread
from collections import Counter, namedtuple
Node = namedtuple("Node", ["child1", "child2", "weight", "symbol", "code"])
def _sort_func(node):
return node.weight
def _encode_proc(proc_number, work_queue, result_queue, message_queue):
while True:
try:
#get a node from the work queue
node = work_queue.get(timeout=0.1)
#if it is an end node, add the symbol-code pair to the result queue
if node.child1 == node.child2 == None:
message_queue.put("Symbol processed! : proc%d" % proc_number)
result_queue.put({node.symbol:node.code})
#otherwise do some work and add some nodes to the work queue
else:
message_queue.put("More work to be done! : proc%d" % proc_number)
node.child1.code.append(node.code + '0')
node.child2.code.append(node.code + '1')
work_queue.put(node.child1)
work_queue.put(node.child2)
except Queue.Empty: #everything is probably done
return
def _reporter_thread(message_queue):
while True:
try:
message = message_queue.get(timeout=0.1)
print message
except Queue.Empty: #everything is probably done
return
def _encode_tree(tree, symbol_count):
"""Uses multiple processes to walk the tree and build the huffman codes."""
#Create a manager to manage the queues, and a pool of workers.
manager = mp.Manager()
worker_pool = mp.Pool()
#create the queues you will be using
work = manager.Queue()
results = manager.Queue()
messages = manager.Queue()
#add work to the work queue, and start the message printing thread
work.put(tree)
message_thread = Thread(target=_reporter_thread, args=(messages,))
message_thread.start()
#add the workers to the pool and close it
for i in range(mp.cpu_count()):
worker_pool.apply_async(_encode_proc, (i, work, results, messages))
worker_pool.close()
#get the results from the results queue, and update the table of codes
table = {}
while symbol_count > 0:
try:
processed_symbol = results.get(timeout=0.1)
table.update(processed_symbol)
symbol_count -= 1
except Queue.Empty:
print "WAI DERe NO SYMBOLzzzZzz!!!"
finally:
print "Symbols to process: %d" % symbol_count
return table
def make_huffman_table(data):
"""
data is an iterable containing the string that needs to be encoded.
Returns a dictionary mapping symbols to codes.
"""
#Build a list of Nodes out of the characters in data
nodes = [Node(None, None, weight, symbol, bytearray()) for symbol, weight in Counter(data).items()]
nodes.sort(reverse=True, key=_sort_func)
symbols = len(nodes)
append_node = nodes.append
while len(nodes) > 1:
#make a new node out of the two nodes with the lowest weight and add it to the list of nodes.
child2, child1 = nodes.pop(), nodes.pop()
new_node = Node(child1, child2, child1.weight+child2.weight, None, bytearray())
append_node(new_node)
#then resort the nodes
nodes.sort(reverse=True, key=_sort_func)
top_node = nodes[0]
return _encode_tree(top_node, symbols)
def chars(fname):
"""
A simple generator to make reading from files without loading them
totally into memory a simple task.
"""
f = open(fname)
char = f.read(1)
while char != '':
yield char
char = f.read(1)
f.close()
raise StopIteration
if __name__ == "__main__":
text = chars("romeo-and-juliet.txt")
table = make_huffman_table(text)
print table
The current output of this is:
More work to be done! : proc0
WAI DERe NO SYMBOLzzzZzz!!!
Symbols to process: 92
WAI DERe NO SYMBOLzzzZzz!!!
Symbols to process: 92
WAI DERe NO SYMBOLzzzZzz!!!
Symbols to process: 92
It just repeats the last bit forever. After the first process adds work to the node, everything just stops. Why is that? Am I not understand/using queues properly? Sorry for all the code to read.

Your first problem is trying to use timeouts. They're almost never a good idea. They may be a good idea if you can't possibly think of a reliable way to do something efficiently, and you use timeouts only as a first step in checking whether something is really done.
That said, the primary problem is that multiprocessing is often very bad at reporting exceptions that occur in worker processes. Your code is actually dying here:
node.child1.code.append(node.code + '0')
The error message you're not seeing is "an integer or string of size 1 is required". You can't append a bytearray to a bytearray. You want to do :
node.child1.code.extend(node.code + '0')
^^^^^^
instead, and in the similar line for child2. As is, because the first worker process to take something off the work queue dies, nothing more is ever added to the work queue. That explains everything you've seen - so far ;-)
No timeouts
FYI, the usual approach to avoid timeouts (which are flaky - unreliable) is to put a special sentinel value on a queue. Consumers know it's time to quit when they see the sentinel, and use a plain blocking .get() to retrieve items from the queue. So first thing is to create a sentinel; e.g., add this near the top:
ALL_DONE = "all done"
Best practice is also to .join() threads and processes - that way the main program knows (doesn't just guess) when they're done too.
So, you can change the end of _encode_tree() like so:
for i in range(1, symbol_count + 1):
processed_symbol = results.get()
table.update(processed_symbol)
print "Symbols to process: %d" % (symbol_count - i)
for i in range(mp.cpu_count()):
work.put(ALL_DONE)
worker_pool.join()
messages.put(ALL_DONE)
message_thread.join()
return table
The key here is that the main program knows all the work is done when, and only when, no symbols remain to be processed. Until then, it can unconditionally .get() results from the results queue. Then it puts a number of sentinels on the work queue equal to the number of workers. They'll each consume a sentinel and quit. Then we wait for them to finish (worker_pool.join()). Then a sentinel is put on the message queue, and we wait for that thread to end too. Only then does the function return.
Now nothing ever terminates early, everything is shut down cleanly, and the output of your final table isn't mixed up anymore with various other output from the workers and the message thread. _reporter_thread() gets rewritten like so:
def _reporter_thread(message_queue):
while True:
message = message_queue.get()
if message == ALL_DONE:
break
else:
print message
and similarly for _encode_proc(). No more timeouts or try/except Queue.Empty: fiddling. You don't even have to import Queue anymore :-)

Related

Starting a large number of dependent process in async using python multiprocessing

Problem: I've a DAG(Directed-acyclic-graph) like structure for starting the execution of some massive data processing on a machine. Some of the process can only be started when their parent data processing is completed cause there is multi level of processing. I want to use python multiprocessing library to handle all on one single machine of it as first goal and later scale to execute on different machines using Managers. I've got no prior experience with python multiprocessing. Can anyone suggest if it's a good library to begin with? If yes, some basic implementation idea would do just fine. If not, what else can be used to do this thing in python?
Example:
A -> B
B -> D, E, F, G
C -> D
In the above example i want to kick A & C first(parallel), after their successful execution, other remaining processes would just wait for B to finish first. As soon as B finishes its execution all other process will start.
P.S.: Sorry i cannot share actual data because confidential, though i tried to make it clear using the example.
I'm a big fan of using processes and queues for things like this.
Like so:
from multiprocessing import Process, Queue
from Queue import Empty as QueueEmpty
import time
#example process functions
def processA(queueA, queueB):
while True:
try:
data = queueA.get_nowait()
if data == 'END':
break
except QueueEmpty:
time.sleep(2) #wait some time for data to enter queue
continue
#do stuff with data
queueB.put(data)
def processA(queueB, _):
while True:
try:
data = queueB.get_nowait()
if data == 'END':
break
except QueueEmpty:
time.sleep(2) #wait some time for data to enter queue
continue
#do stuff with data
#helper functions for starting and stopping processes
def start_procs(num_workers, target_function, args):
procs = []
for _ in range(num_workers):
p = Process(target=target_function, args=args)
p.start()
procs.append(p)
return procs
def shutdown_process(proc_lst, queue):
for _ in proc_lst:
queue.put('END')
for p in proc_lst:
try:
p.join()
except KeyboardInterrupt:
break
queueA = Queue(<size of queue> * 3) #needs to be a bit bigger than actual. 3x works well for me
queueB = Queue(<size of queue>)
queueC = Queue(<size of queue>)
queueD = Queue(<size of queue>)
procsA = start_procs(number_of_workers, processA, (queueA, queueB))
procsB = start_procs(number_of_workers, processB, (queueB, None))
# feed some data to processA
[queueA.put(data) for data in start_data]
#shutdown processes
shutdown_process(procsA, queueA)
shutdown_process(procsB, queueB)
#etc, etc. You could arrange the start, stop, and data feed statements to arrive at the dag behaviour you desire

Different inputs for different processes in python multiprocessing

Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]

How to design an async pipeline pattern in python

I am trying to design an async pipeline that can easily make a data processing pipeline. The pipeline is composed of several functions. Input data goes in at one end of the pipeline and comes out at the other end.
I want to design the pipeline in a way that:
Additional functions can be insert in the pipeline
Functions already in the pipeline can be popped out.
Here is what I came up with:
import asyncio
#asyncio.coroutine
def add(x):
return x + 1
#asyncio.coroutine
def prod(x):
return x * 2
#asyncio.coroutine
def power(x):
return x ** 3
def connect(funcs):
def wrapper(*args, **kwargs):
data_out = yield from funcs[0](*args, **kwargs)
for func in funcs[1:]:
data_out = yield from func(data_out)
return data_out
return wrapper
pipeline = connect([add, prod, power])
input = 1
output = asyncio.get_event_loop().run_until_complete(pipeline(input))
print(output)
This works, of course, but the problem is that if I want to add another function into (or pop out a function from) this pipeline, I have to disassemble and reconnect every function again.
I would like to know if there is a better scheme or design pattern to create such a pipeline?
I've done something similar before, using just the multiprocessing library. It's a bit more manual, but it gives you the ability to easily create and modify your pipeline, as you've requested in your question.
The idea is to create functions that can live in a multiprocessing pool, and their only arguments are an input queue and an output queue. You tie the stages together by passing them different queues. Each stage receives some work on its input queue, does some more work, and passes the result out to the next stage through its output queue.
The workers spin on trying to get something from their queues, and when they get something, they do their work and pass the result to the next stage. All of the work ends by passing a "poison pill" through the pipeline, causing all stages to exit:
This example just builds a string in multiple work stages:
import multiprocessing as mp
POISON_PILL = "STOP"
def stage1(q_in, q_out):
while True:
# get either work or a poison pill from the previous stage (or main)
val = q_in.get()
# check to see if we got the poison pill - pass it along if we did
if val == POISON_PILL:
q_out.put(val)
return
# do stage 1 work
val = val + "Stage 1 did some work.\n"
# pass the result to the next stage
q_out.put(val)
def stage2(q_in, q_out):
while True:
val = q_in.get()
if val == POISON_PILL:
q_out.put(val)
return
val = val + "Stage 2 did some work.\n"
q_out.put(val)
def main():
pool = mp.Pool()
manager = mp.Manager()
# create managed queues
q_main_to_s1 = manager.Queue()
q_s1_to_s2 = manager.Queue()
q_s2_to_main = manager.Queue()
# launch workers, passing them the queues they need
results_s1 = pool.apply_async(stage1, (q_main_to_s1, q_s1_to_s2))
results_s2 = pool.apply_async(stage2, (q_s1_to_s2, q_s2_to_main))
# Send a message into the pipeline
q_main_to_s1.put("Main started the job.\n")
# Wait for work to complete
print(q_s2_to_main.get()+"Main finished the job.")
q_main_to_s1.put(POISON_PILL)
pool.close()
pool.join()
return
if __name__ == "__main__":
main()
The code produces this output:
Main started the job.
Stage 1 did some work.
Stage 2 did some work.
Main finished the job.
You can easily put more stages in the pipeline or rearrange them just by changing which functions get which queues. I'm not very familiar with the asyncio module, so I can't speak to what capabilities you would be losing by using the multiprocessing library instead, but this approach is very straightforward to implement and understand, so I like its simplicity.
I don't know if it is the best way to do it but here is my solution.
While I think it's possible to control a pipeline using a list or a dictionary I found easier and more efficent to use a generator.
Consider the following generator:
def controller():
old = value = None
while True:
new = (yield value)
value = old
old = new
This is basically a one-element queue, it stores the value that you send it and releases it at the next call of send (or next).
Example:
>>> c = controller()
>>> next(c) # prime the generator
>>> c.send(8) # send a value
>>> next(c) # pull the value from the generator
8
By associating every coroutine in the pipeline with its controller we will have an external handle that we can use to push the target of each one. We just need to define our coroutines in a way that they will pull the new target from our controller every cycle.
Now consider the following coroutines:
def source(controller):
while True:
target = next(controller)
print("source sending to", target.__name__)
yield (yield from target)
def add():
return (yield) + 1
def prod():
return (yield) * 2
The source is a coroutine that does not return so that it will not terminate itself after the first cycle. The other coroutines are "sinks" and does not need a controller.
You can use these coroutines in a pipeline as in the following example. We initially set up a route source --> add and after receiving the first result we change the route to source --> prod.
# create a controller for the source and prime it
cont_source = controller()
next(cont_source)
# create three coroutines
# associate the source with its controller
coro_source = source(cont_source)
coro_add = add()
coro_prod = prod()
# create a pipeline
cont_source.send(coro_add)
# prime the source and send a value to it
coro_source.send(None)
print("add =", coro_source.send(4))
# change target of the source
cont_source.send(coro_prod)
# reset the source, send another value
coro_source.send(None)
print("prod =", coro_source.send(8))
Output:
source sending to add
add = 5
source sending to prod
prod = 16

Multi-process, using Queue & Pool

I have a Producer process that runs and puts the results in a Queue
I also have a Consumer function that takes the results from the Queue and processes them , for example:
def processFrame(Q,commandsFile):
fr = Q.get()
frameNum = fr[0]
Frame = fr[1]
#
# Process the frame
#
commandsFile.write(theProcessedResult)
I want to run my consumer function using multiple processes, they number should be set by user:
processes = raw_input('Enter the number of process you want to use: ')
i tried using Pool:
pool = Pool(int(processes))
pool.apply(processFrame, args=(q,toFile))
when i try this , it returns a RuntimeError: Queue objects should only be shared between processes through inheritance.
what does that mean?
I also tried to use a list of processes:
while (q.empty() == False):
mp = [Process(target=processFrame, args=(q,toFile)) for x in range(int(processes))]
for p in mp:
p.start()
for p in mp:
p.join()
This one seems to run, but not as expected.
it using multiple processes on same frame from Queue, doesn't Queue have locks?
also ,in this case the number of processes i'm allowed to use must divide the number of frames without residue(reminder) - for example:
if i have 10 frames i can use only 1,2,5,10 processes. if i use 3,4.. it will create a process while Q empty and wont work.
if u want to recycle the procces until q is empty u should just try to do somthing like that:
code1:
def proccesframe():
while(True):
frame = queue.get()
##do something
your procces will be blocked until there is something in the queue
i dont think that's a good idie to use multiproccess on the cunsomer part , you should use them on the producer.
if u want to terminate the procces when the queue is empty u can do something like that:
code2:
def proccesframe():
while(!queue.empty()):
frame = queue.get()
##do something
terminate_procces()
update:
if u want to use multiproccesing in the consumer part just do a simple loop and add code2 , then you will be able to close your proccess when u finish doing stuff with the queue.
I am not entirely sure what are you trying to accomplish from your explanation, but have you considered using multiprocessing.Pool with its methods map or map_async?
from multiprocessing import Pool
from foo import bar # your function
if __name__ == "__main__":
p = Pool(4) # your number of processes
result = p.map_async(bar, [("arg #1", "arg #2"), ...])
print result.get()
It collects result from your function in unordered(!) iterable and you can use it however you wish.
UPDATE
I think you should not use queue and be more straightforward:
from multiprocessing import Pool
def process_frame(fr): # PEP8 and see the difference in definition
# magic
return result # and result handling!
if __name__ == "__main__":
p = Pool(4) # your number of processes
results = p.map_async(process_frame, [fr_1, fr_2, ...])
# Do not ever write or manipulate with files in parallel processes
# if you are not 100% sure what you are doing!
for result in results.get():
commands_file.write(result)
UPDATE 2
from multiprocessing import Pool
import random
import time
def f(x):
return x*x
def g(yr):
with open("result.txt", "ab") as f:
for y in yr:
f.write("{}\n".format(y))
if __name__ == '__main__':
pool = Pool(4)
while True:
# here you fetch new data and send it to process
new_data = [random.randint(1, 50) for i in range(4)]
pool.map_async(f, new_data, callback=g)
Some example how to do it and I updated the algorithm to be "infinite", it can be only closed by interruption or kill command from outside. You can use also apply_async, but it would cause slow downs with result handling (depending on speed of processing).
I have also tried using long-time open result.txt in global scope, but every time it hit deadlock.

Python multiprocessing map_async hangs

I have some troubles [probably] with closing a pool of processes in my parser. When all tasks done, it hangs and do nothing, cpu usage is about 1%.
profiles_pool = multiprocessing.Pool(processes=4)
pages_pool = multiprocessing.Pool(processes=4)
m = multiprocessing.Manager()
pages = m.list(['URL'])
pages_done = m.list()
while True:
# grab all links
res = pages_pool.imap_unordered(deco_process_people, pages, chunksize=1)
pages_done += pages
pages = []
for new_users,new_pages in res:
users.update(new_users)
profile_tasks = [ (new_users[i]['link'],i) for i in new_users ]
# enqueue grabbed links for parsing
profiles_pool.map_async(deco_process_profiles,
profile_tasks, chunksize=2,
callback=profile_update_callback)
# i dont need a result of map_async actually
# callback will apply parsed data to users dict
# users dict is an instance of Manager.dict()
for p in new_pages:
if p not in pages_done and p not in pages:
pages.append(p)
# we need more than 900 pages to be parsed for bug occurrence
#if len(pages) == 0:
if len(pages_done) > 900:
break
#
# closing other pools
#
# ---- the last printed string:
print 'Closing profiles pool',
sys.stdout.flush()
profiles_pool.close()
profiles_pool.join()
print 'closed'
I guess the problem is in wrong open tasks calculation in pool queue, but i'm not shure and cannot check this - idk how to get task queue length.
What can it be, and where to look first?
The most immediately-obvious problem is that pages_done is a synchronized Manager.list object (so each process can access it atomically), but while pages starts out as one such, it quickly becomes an ordinary un(multi)processed list:
pages_done += pages
pages = []
The second quoted line rebinds pages to a new, empty ordinary list.
Even if you deleted all the elements of pages on the second line (rather than doing a rebinding assignment), you could run into a race where (eg) pages had A, B, and C in it when you did the += on the first line, but had become A, B, C, and D by the second.
A quick fix would be to take items off pages one at a time and put them into pages_done one at a time (not very efficient). Might be better to have these not be shared data structures at all though; it doesn't look like they need to be, in the quoted code (I'm assuming some unquoted code depends on it—since otherwise the rebinding of pages is a red herring anyway!).
I've found out the reason of a bug: "join method of multiprocessing Pool object hangs if iterable argument of pool.map is empty"
http://bugs.python.org/issue12157

Categories