I have some troubles [probably] with closing a pool of processes in my parser. When all tasks done, it hangs and do nothing, cpu usage is about 1%.
profiles_pool = multiprocessing.Pool(processes=4)
pages_pool = multiprocessing.Pool(processes=4)
m = multiprocessing.Manager()
pages = m.list(['URL'])
pages_done = m.list()
while True:
# grab all links
res = pages_pool.imap_unordered(deco_process_people, pages, chunksize=1)
pages_done += pages
pages = []
for new_users,new_pages in res:
users.update(new_users)
profile_tasks = [ (new_users[i]['link'],i) for i in new_users ]
# enqueue grabbed links for parsing
profiles_pool.map_async(deco_process_profiles,
profile_tasks, chunksize=2,
callback=profile_update_callback)
# i dont need a result of map_async actually
# callback will apply parsed data to users dict
# users dict is an instance of Manager.dict()
for p in new_pages:
if p not in pages_done and p not in pages:
pages.append(p)
# we need more than 900 pages to be parsed for bug occurrence
#if len(pages) == 0:
if len(pages_done) > 900:
break
#
# closing other pools
#
# ---- the last printed string:
print 'Closing profiles pool',
sys.stdout.flush()
profiles_pool.close()
profiles_pool.join()
print 'closed'
I guess the problem is in wrong open tasks calculation in pool queue, but i'm not shure and cannot check this - idk how to get task queue length.
What can it be, and where to look first?
The most immediately-obvious problem is that pages_done is a synchronized Manager.list object (so each process can access it atomically), but while pages starts out as one such, it quickly becomes an ordinary un(multi)processed list:
pages_done += pages
pages = []
The second quoted line rebinds pages to a new, empty ordinary list.
Even if you deleted all the elements of pages on the second line (rather than doing a rebinding assignment), you could run into a race where (eg) pages had A, B, and C in it when you did the += on the first line, but had become A, B, C, and D by the second.
A quick fix would be to take items off pages one at a time and put them into pages_done one at a time (not very efficient). Might be better to have these not be shared data structures at all though; it doesn't look like they need to be, in the quoted code (I'm assuming some unquoted code depends on it—since otherwise the rebinding of pages is a red herring anyway!).
I've found out the reason of a bug: "join method of multiprocessing Pool object hangs if iterable argument of pool.map is empty"
http://bugs.python.org/issue12157
Related
I have a piece of code with a multiprocessing implementation:
q = range(len(aaa))
w = range(len(aab))
e = range(len(aba))
paramlist = list(itertools.product(q,w,e))
def f(combinations):
q = combinations[0]
w = combinations[1]
e = combinations[2]
# the rest of the function
if __name__ == '__main__':
pool = mul.Pool(4)
res_p = pool.map(f, paramlist)
for _ in tqdm.tqdm(res_p, total=len(paramlist)):
pass
pool.close()
pool.join()
Where 'aaa, aab, aba' are lists with triple values of type:
aaa = [[1,2,3], [3,5,1], ...], etc.
And I wanted to use imap() to be able to follow the calculation progress using module tqdm().
But why does the map() show me the length of the list(res_p) list correctly, but when I change to imap(), the list is empty? Can you track progress using the map() module?
tqdm doesn't work with map because map is blocking; it waits for all results and then returns them as a list. By the time your loop is executed, the only progress to be made is what happens in that loop—the parallel phase has already been completed.
imap does not block, since it returns just an iterator, i.e. a thing you can ask for the next result, and the next result, and the next result. Only when you do that, by looping over it, the next result is waited for, one after another. The consequence of it being an iterator means that once all results have been consumed (the end of your loop), it is empty. As such, there's nothing left to put in a list. If you wish to keep the results, you could append each in the loop, for example, or change the code to this:
res_p = list(tqdm.tqdm(pool.imap(f, paramlist), total=len(paramlist)))
for res in res_p:
... # Do stuff
Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]
How can I get the result from my process without using a pool ?
(I'm willing to conserve an eye on the progression:
(print "\r",float(done)/total,"%",)
which can't be done using a pool as far I know)
def multiprocess(function, argslist, ncpu):
total = len(argslist)
done = 0
jobs = []
while argslist != []:
if len(mp.active_children()) < ncpu:
p = mp.Process(target=function,args=(argslist.pop(),))
jobs.append(p)
p.start()
done+=1
print "\r",float(done)/total,"%",
#get results here
for job in jobs:
job.get_my_result()???
The processes are really short (<0.5 seconds) but I have around 1 million of them.
I saw this thread Can I get a return value from multiprocessing.Process? I tried to reproduce it but I couldn't make it work properly.
At your entire disposal for any further information.
This question may be considered as a duplicate but anyway here is the solution to my problem:
def multiprocess(function, argslist, ncpu):
total = len(argslist)
done = 0
result_queue = mp.Queue()
jobs = []
while argslist != [] and done<10 :
if len(mp.active_children()) < ncpu:
p = mp.Process(target=function,args=(result_queue, argslist.pop(),))
jobs.append(p)
p.start()
done+=1
print "\r",float(done)/total,"%",
#get results here
res = [result_queue.get() for p in jobs]
print res
and I had to change as well the
return function_result
into
result_queue.put(function_result)
The easiest way should be a queue that is passed as argument to your function. The results of that function can be put into that queue and later on you can iterate over that queue to collect all the results or process it as soon as a result arrives. However, it only works when the you can work with "unordered" results. See the Python documentation for details: Examples for Multiprocessing and Queues
I am writing some code to build a table of variable length (Huffman) codes, and I wanted to use the multiprocessing module for fun. The idea is to have each process try to get a node from the queue. They do work on the node, and either put that nodes two children back into the work queue, or they put the variable length code into result queue. They are also passing messages to a message queue, which should be printed by a thread in the main process. Here is the code so far:
import Queue
import multiprocessing as mp
from threading import Thread
from collections import Counter, namedtuple
Node = namedtuple("Node", ["child1", "child2", "weight", "symbol", "code"])
def _sort_func(node):
return node.weight
def _encode_proc(proc_number, work_queue, result_queue, message_queue):
while True:
try:
#get a node from the work queue
node = work_queue.get(timeout=0.1)
#if it is an end node, add the symbol-code pair to the result queue
if node.child1 == node.child2 == None:
message_queue.put("Symbol processed! : proc%d" % proc_number)
result_queue.put({node.symbol:node.code})
#otherwise do some work and add some nodes to the work queue
else:
message_queue.put("More work to be done! : proc%d" % proc_number)
node.child1.code.append(node.code + '0')
node.child2.code.append(node.code + '1')
work_queue.put(node.child1)
work_queue.put(node.child2)
except Queue.Empty: #everything is probably done
return
def _reporter_thread(message_queue):
while True:
try:
message = message_queue.get(timeout=0.1)
print message
except Queue.Empty: #everything is probably done
return
def _encode_tree(tree, symbol_count):
"""Uses multiple processes to walk the tree and build the huffman codes."""
#Create a manager to manage the queues, and a pool of workers.
manager = mp.Manager()
worker_pool = mp.Pool()
#create the queues you will be using
work = manager.Queue()
results = manager.Queue()
messages = manager.Queue()
#add work to the work queue, and start the message printing thread
work.put(tree)
message_thread = Thread(target=_reporter_thread, args=(messages,))
message_thread.start()
#add the workers to the pool and close it
for i in range(mp.cpu_count()):
worker_pool.apply_async(_encode_proc, (i, work, results, messages))
worker_pool.close()
#get the results from the results queue, and update the table of codes
table = {}
while symbol_count > 0:
try:
processed_symbol = results.get(timeout=0.1)
table.update(processed_symbol)
symbol_count -= 1
except Queue.Empty:
print "WAI DERe NO SYMBOLzzzZzz!!!"
finally:
print "Symbols to process: %d" % symbol_count
return table
def make_huffman_table(data):
"""
data is an iterable containing the string that needs to be encoded.
Returns a dictionary mapping symbols to codes.
"""
#Build a list of Nodes out of the characters in data
nodes = [Node(None, None, weight, symbol, bytearray()) for symbol, weight in Counter(data).items()]
nodes.sort(reverse=True, key=_sort_func)
symbols = len(nodes)
append_node = nodes.append
while len(nodes) > 1:
#make a new node out of the two nodes with the lowest weight and add it to the list of nodes.
child2, child1 = nodes.pop(), nodes.pop()
new_node = Node(child1, child2, child1.weight+child2.weight, None, bytearray())
append_node(new_node)
#then resort the nodes
nodes.sort(reverse=True, key=_sort_func)
top_node = nodes[0]
return _encode_tree(top_node, symbols)
def chars(fname):
"""
A simple generator to make reading from files without loading them
totally into memory a simple task.
"""
f = open(fname)
char = f.read(1)
while char != '':
yield char
char = f.read(1)
f.close()
raise StopIteration
if __name__ == "__main__":
text = chars("romeo-and-juliet.txt")
table = make_huffman_table(text)
print table
The current output of this is:
More work to be done! : proc0
WAI DERe NO SYMBOLzzzZzz!!!
Symbols to process: 92
WAI DERe NO SYMBOLzzzZzz!!!
Symbols to process: 92
WAI DERe NO SYMBOLzzzZzz!!!
Symbols to process: 92
It just repeats the last bit forever. After the first process adds work to the node, everything just stops. Why is that? Am I not understand/using queues properly? Sorry for all the code to read.
Your first problem is trying to use timeouts. They're almost never a good idea. They may be a good idea if you can't possibly think of a reliable way to do something efficiently, and you use timeouts only as a first step in checking whether something is really done.
That said, the primary problem is that multiprocessing is often very bad at reporting exceptions that occur in worker processes. Your code is actually dying here:
node.child1.code.append(node.code + '0')
The error message you're not seeing is "an integer or string of size 1 is required". You can't append a bytearray to a bytearray. You want to do :
node.child1.code.extend(node.code + '0')
^^^^^^
instead, and in the similar line for child2. As is, because the first worker process to take something off the work queue dies, nothing more is ever added to the work queue. That explains everything you've seen - so far ;-)
No timeouts
FYI, the usual approach to avoid timeouts (which are flaky - unreliable) is to put a special sentinel value on a queue. Consumers know it's time to quit when they see the sentinel, and use a plain blocking .get() to retrieve items from the queue. So first thing is to create a sentinel; e.g., add this near the top:
ALL_DONE = "all done"
Best practice is also to .join() threads and processes - that way the main program knows (doesn't just guess) when they're done too.
So, you can change the end of _encode_tree() like so:
for i in range(1, symbol_count + 1):
processed_symbol = results.get()
table.update(processed_symbol)
print "Symbols to process: %d" % (symbol_count - i)
for i in range(mp.cpu_count()):
work.put(ALL_DONE)
worker_pool.join()
messages.put(ALL_DONE)
message_thread.join()
return table
The key here is that the main program knows all the work is done when, and only when, no symbols remain to be processed. Until then, it can unconditionally .get() results from the results queue. Then it puts a number of sentinels on the work queue equal to the number of workers. They'll each consume a sentinel and quit. Then we wait for them to finish (worker_pool.join()). Then a sentinel is put on the message queue, and we wait for that thread to end too. Only then does the function return.
Now nothing ever terminates early, everything is shut down cleanly, and the output of your final table isn't mixed up anymore with various other output from the workers and the message thread. _reporter_thread() gets rewritten like so:
def _reporter_thread(message_queue):
while True:
message = message_queue.get()
if message == ALL_DONE:
break
else:
print message
and similarly for _encode_proc(). No more timeouts or try/except Queue.Empty: fiddling. You don't even have to import Queue anymore :-)
how can i control the return value of this function pool apply_asyn
supposing that I have the following cool
import multiprocessing:
de fun(..)
...
...
return value
my_pool = multiprocessing.Pool(2)
for i in range(5) :
result=my_pool.apply_async(fun, [i])
some code going to be here....
digest_pool.close()
digest_pool.join()
here i need to proccess the results
how can i control the result value for every proccess and know to check to which proccess it belongs ,
store the the value of 'i' from the for loop and either print it or return and save it somewhere else.
so if a process happens you can check from which process it was by looking at the variable i.
Hope this helps.
Are you sure, that you need to know, which of your two workers is doing what right now? In such a case you might be better off with Processes and Queues, because, this sounds as some communication between the multiple processes is required.
If you just want to know, which result was processed by which worker, you can simply return a tuple:
#!/usr/bin/python
import multiprocessing
def fun(..)
...
return value, multiprocessing.current_process()._name
my_pool = multiprocessing.Pool(2)
async_result = []
for i in range(5):
async_result.append(my_pool.apply_async(fun, [i]))
# some code going to be here....
my_pool.join()
result = {}
for i in range(5):
result[i] = async_result[i].get()
If you have the different input variables as a list, the map_async command might be a better decision:
#!/usr/bin/python
import multiprocessing
def fun(..)
...
...
return value, multiprocessing.current_process()._name
my_pool = multiprocessing.Pool()
async_results = my_pool.map_async(fun, range(5))
# some code going to be here....
results = async_results.get()
The last line joins the pool. Note, that results is a list of tuples, each tuple containing of your calculated value and the name of the process who calculated it.