I have a simple implementation of python's multi-processing module
if __name__ == '__main__':
jobs = []
while True:
for i in range(40):
# fetch one by one from redis queue
#item = item from redis queue
p = Process(name='worker '+str(i), target=worker, args=(item,))
# if p is not running, start p
if not p.is_alive():
jobs.append(p)
p.start()
for j in jobs:
j.join()
jobs.remove(j)
def worker(url_data):
"""worker function"""
print url_data['link']
What I expect this code to do:
run in infinite loop, keep waiting for Redis queue.
if Redis queue not empty, fetch item.
create 40 multiprocess.Process, not more not less
if a process has finished processing, start new process, so that ~40 process are running at all time.
I read that, to avoid zombie process that should be bound(join) to the parent, that's what I expected to achieve in the second loop. But the issue is that on launching it spawns 40 processes, workers finish processing and enter zombie state, until all currently spawned processes haven't finished,
then in next iteration of "while True", the same pattern continues.
So my question is:
How can I avoid zombie processes. and spawn new process as soon as 1 in 40 has finished
For a task like the one you described is usually better to use a different approach using Pool.
You can have the main process fetching data and the workers deal with it.
Following an example of Pool from Python Docs
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [10]) # evaluate "f(10)" asynchronously
print result.get(timeout=1) # prints "100" unless your computer is *very* slow
print pool.map(f, range(10)) # prints "[0, 1, 4,..., 81]"
I also suggest to use imap instead of map as it seems your task can be asynch.
Roughly your code will be:
p = Pool(40)
while True:
items = items from redis queue
p.imap_unordered(worker, items) #unordered version is faster
def worker(url_data):
"""worker function"""
print url_data['link']
Related
I am using python multiprocessing library for executing a selenium script. My code is below :
#-- start and join multiple threads ---
thread_list = []
total_threads=10 #-- no of parallel threads
for i in range(total_threads):
t = Process(target=get_browser_and_start, args=[url,nlp,pixel])
thread_list.append(t)
print "starting thread..."
t.start()
for t in thread_list:
print "joining existing thread..."
t.join()
As I understood the join() function, it will wait for each process to complete. But I want that as soon as a process is released, it will be assigned another task to perform new function.
It can be understood like this:
Say 8 processes started in first instance.
no_of_tasks_to_perform = 100
for i in range(no_of_tasks_to_perform):
processes start(8)
if process no 2 finished executing, start new process
maintain 8 process at any point of time till
"i" is <= no_of_tasks_to_perform
Instead of starting new processes every now and then, try to put all your tasks into a multiprocessing.Queue(), and start 8 long-running processes, in each process keep accessing the task queue to get new tasks and then do the job, until there's no task any more.
In your case, it's more like this:
from multiprocessing import Queue, Process
def worker(queue):
while not queue.empty():
task = queue.get()
# now start to work on your task
get_browser_and_start(url,nlp,pixel) # url, nlp, pixel can be unpacked from task
def main():
queue = Queue()
# Now put tasks into queue
no_of_tasks_to_perform = 100
for i in range(no_of_tasks_to_perform):
queue.put([url, nlp, pixel, ...])
# Now start all processes
process = Process(target=worker, args=(queue, ))
process.start()
...
process.join()
I'm using the Process class to create and manage subprocesses, which may return non-trival quantities of data. The documentation states that join() is the correct way to wait for a Process to complete (https://docs.python.org/2/library/multiprocessing.html#the-process-class).
However, when using multiprocessing.Queue this can cause a hang after joining the process, as described here: https://bugs.python.org/issue8426 and here https://docs.python.org/2/library/multiprocessing.html#multiprocessing-programming (not a bug).
These docs suggest removing p.join() - but surely this will remove the guarantee that all processes have completed, as Queue.get() only waits for a single item to become available?
How can I wait for completion of all Processes in this case, and ensure I'm collecting output from them all?
A simple example of the hang I'd like to deal with:
from multiprocessing import Process, Queue
class MyClass:
def __init__(self):
pass
def example_run(output):
output.put([MyClass() for i in range(1000)])
print("Bottom of example_run() - note hangs after this is printed")
if __name__ == '__main__':
output = Queue()
processes = [Process(target=example_run, args=(output,)) for x in range(5)]
for p in processes:
p.start()
for p in processes:
p.join()
print("Processes completed")
https://bugs.python.org/issue8426
This means that whenever you use a queue you need to make sure that
all items which have been put on the queue will eventually be removed
before the process is joined. Otherwise you cannot be sure that
processes which have put items on the queue will terminate.
In your example I just added output.get() before calling to join() and every thing worked fine. We put data in queue to be used some where, so just make sure that.
for p in processes:
p.start()
print output.get()
for p in processes:
p.join()
print("Processes completed")
An inelegant solution is to add
output_final = []
for i in range(5): # we have 5 processes
output_final.append(output.get())
before attempting to join any of the processes. This simply tries to get the appropriate number of outputs for the number of processes we've started.
Turns out a much better, wider solution is not to use Process at all; use Pool instead. This way the hassles of starting worker processes and collecting the results is handled for you:
import multiprocessing
class MyClass:
def __init__(self):
pass
def example_run(someArbitraryInput):
foo = [MyClass() for i in range(10000)]
return foo
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=5)
output = pool.map(example_run, range(5))
pool.close(); pool.join() # make sure the processes are complete and tidy
print("Processes completed")
I'm running Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32.
I spawn 4 processes, give them 2 queues - for tasks and results, and join the task queue at the end. And when the task count reaches a certain amount - njobs = 10000 for example - some of the children and the main process won't exit, even though all tasks are done.
Why is this?
The code to illustrate this
def worker(job_queue, result_queue):
import Queue
while True:
try:
j = job_queue.get(False)
except Queue.Empty:
exit('done')
else:
result_queue.put_nowait(j)
job_queue.task_done()
if __name__ == "__main__":
from multiprocessing import JoinableQueue, Process, cpu_count
job_queue = JoinableQueue()
result_queue = JoinableQueue()
njobs = 10000
for i in xrange(njobs):
job_queue.put(i)
cpus = cpu_count()
for i in xrange(cpus):
p = Process(target=worker, args=(job_queue, result_queue))
p.start()
job_queue.join()
print("DONE")
And the longer the task, the lower number of tasks required for someone (or all) processes to hang. Originally, I'm doing sequence matching with this. And it usually leaves 3 processes hanging when queue is about 500.
Apparently, having more than 6570 items in a queue might cause a deadlock (more information in this thread). What you can do is empty result_queue at the end of the main execution:
while not result_queue.empty():
result_queue.get(False)
result_queue.task_done()
print "Done"
Note that you don't have to call exit in the worker function, return is enough:
except Queue.Empty:
print "done"
return
You might also consider using a Pool:
from multiprocessing import Pool
def task(arg):
"""Called by the workers"""
return arg
def callback(arg):
"""Called by the main process"""
pass
if __name__ == "__main__":
pool = Pool()
njobs = 10000
print "Enqueuing tasks"
for i in xrange(njobs):
pool.apply_async(task, (i,), callback=callback)
print "Closing the pool"
pool.close()
print "Joining the pool"
pool.join()
print "Done"
This is an implementation limit with pipes or sockets well described in Issue 8426: multiprocessing.Queue fails to get() very large objects. Note it also applies to a lot of small objects.
Solution
Either
make sure to consume the result queue concurrently fast enough
from child processes, call Queue.cancel_join_thread()
Documentation
Bear in mind that a process that has put items in a queue will wait
before terminating until all the buffered items are fed by the
“feeder” thread to the underlying pipe. (The child process can call
the cancel_join_thread() method of the queue to avoid this behaviour.)
This means that whenever you use a queue you need to make sure that
all items which have been put on the queue will eventually be removed
before the process is joined. Otherwise you cannot be sure that
processes which have put items on the queue will terminate. Remember
also that non-daemonic processes will be joined automatically.
— Multiprocessing - Programming guidelines
How can I script a Python multiprocess that uses two Queues as these ones?:
one as a working queue that starts with some data and that, depending on conditions of the functions to be parallelized, receives further tasks on the fly,
another that gathers results and is used to write down the result after processing finishes.
I basically need to put some more tasks in the working queue depending on what I found in its initial items. The example I post below is silly (I could transform the item as I like and put it directly in the output Queue), but its mechanics are clear and reflect part of the concept I need to develop.
Hereby my attempt:
import multiprocessing as mp
def worker(working_queue, output_queue):
item = working_queue.get() #I take an item from the working queue
if item % 2 == 0:
output_queue.put(item**2) # If I like it, I do something with it and conserve the result.
else:
working_queue.put(item+1) # If there is something missing, I do something with it and leave the result in the working queue
if __name__ == '__main__':
static_input = range(100)
working_q = mp.Queue()
output_q = mp.Queue()
for i in static_input:
working_q.put(i)
processes = [mp.Process(target=worker,args=(working_q, output_q)) for i in range(mp.cpu_count())] #I am running as many processes as CPU my machine has (is this wise?).
for proc in processes:
proc.start()
for proc in processes:
proc.join()
for result in iter(output_q.get, None):
print result #alternatively, I would like to (c)pickle.dump this, but I am not sure if it is possible.
This does not end nor print any result.
At the end of the whole process I would like to ensure that the working queue is empty, and that all the parallel functions have finished writing to the output queue before the later is iterated to take out the results. Do you have suggestions on how to make it work?
The following code achieves the expected results. It follows the suggestions made by #tawmas.
This code allows to use multiple cores in a process that requires that the queue which feeds data to the workers can be updated by them during the processing:
import multiprocessing as mp
def worker(working_queue, output_queue):
while True:
if working_queue.empty() == True:
break #this is the so-called 'poison pill'
else:
picked = working_queue.get()
if picked % 2 == 0:
output_queue.put(picked)
else:
working_queue.put(picked+1)
return
if __name__ == '__main__':
static_input = xrange(100)
working_q = mp.Queue()
output_q = mp.Queue()
results_bank = []
for i in static_input:
working_q.put(i)
processes = [mp.Process(target=worker,args=(working_q, output_q)) for i in range(mp.cpu_count())]
for proc in processes:
proc.start()
for proc in processes:
proc.join()
results_bank = []
while True:
if output_q.empty() == True:
break
results_bank.append(output_q.get_nowait())
print len(results_bank) # length of this list should be equal to static_input, which is the range used to populate the input queue. In other words, this tells whether all the items placed for processing were actually processed.
results_bank.sort()
print results_bank
You have a typo in the line that creates the processes. It should be mp.Process, not mp.process. This is what is causing the exception you get.
Also, you are not looping in your workers, so they actually only consume a single item each from the queue and then exit. Without knowing more about the required logic, it's not easy to give specific advice, but you will probably want to enclose the body of your worker function inside a while True loop and add a condition in the body to exit when the work is done.
Please note that, if you do not add a condition to explicitly exit from the loop, your workers will simply stall forever when the queue is empty. You might consider using the so-called poison pill technique to signal the workers they may exit. You will find an example and some useful discussion in the PyMOTW article on Communication Between processes.
As for the number of processes to use, you will need to benchmark a bit to find what works for you, but, in general, one process per core is a good starting point when your workload is CPU bound. If your workload is IO bound, you might have better results with a higher number of workers.
I'm having this problem in python:
I have a queue of URLs that I need to check from time to time
if the queue is filled up, I need to process each item in the queue
Each item in the queue must be processed by a single process (multiprocessing)
So far I managed to achieve this "manually" like this:
while 1:
self.updateQueue()
while not self.mainUrlQueue.empty():
domain = self.mainUrlQueue.get()
# if we didn't launched any process yet, we need to do so
if len(self.jobs) < maxprocess:
self.startJob(domain)
#time.sleep(1)
else:
# If we already have process started we need to clear the old process in our pool and start new ones
jobdone = 0
# We circle through each of the process, until we find one free ; only then leave the loop
while jobdone == 0:
for p in self.jobs :
#print "entering loop"
# if the process finished
if not p.is_alive() and jobdone == 0:
#print str(p.pid) + " job dead, starting new one"
self.jobs.remove(p)
self.startJob(domain)
jobdone = 1
However that leads to tons of problems and errors. I wondered if I was not better suited using a Pool of process. What would be the right way to do this?
However, a lot of times my queue is empty, and it can be filled by 300 items in a second, so I'm not too sure how to do things here.
You could use the blocking capabilities of queue to spawn multiple process at startup (using multiprocessing.Pool) and letting them sleep until some data are available on the queue to process. If your not familiar with that, you could try to "play" with that simple program:
import multiprocessing
import os
import time
the_queue = multiprocessing.Queue()
def worker_main(queue):
print os.getpid(),"working"
while True:
item = queue.get(True)
print os.getpid(), "got", item
time.sleep(1) # simulate a "long" operation
the_pool = multiprocessing.Pool(3, worker_main,(the_queue,))
# don't forget the comma here ^
for i in range(5):
the_queue.put("hello")
the_queue.put("world")
time.sleep(10)
Tested with Python 2.7.3 on Linux
This will spawn 3 processes (in addition of the parent process). Each child executes the worker_main function. It is a simple loop getting a new item from the queue on each iteration. Workers will block if nothing is ready to process.
At startup all 3 process will sleep until the queue is fed with some data. When a data is available one of the waiting workers get that item and starts to process it. After that, it tries to get an other item from the queue, waiting again if nothing is available...
Added some code (submitting "None" to the queue) to nicely shut down the worker threads, and added code to close and join the_queue and the_pool:
import multiprocessing
import os
import time
NUM_PROCESSES = 20
NUM_QUEUE_ITEMS = 20 # so really 40, because hello and world are processed separately
def worker_main(queue):
print(os.getpid(),"working")
while True:
item = queue.get(block=True) #block=True means make a blocking call to wait for items in queue
if item is None:
break
print(os.getpid(), "got", item)
time.sleep(1) # simulate a "long" operation
def main():
the_queue = multiprocessing.Queue()
the_pool = multiprocessing.Pool(NUM_PROCESSES, worker_main,(the_queue,))
for i in range(NUM_QUEUE_ITEMS):
the_queue.put("hello")
the_queue.put("world")
for i in range(NUM_PROCESSES):
the_queue.put(None)
# prevent adding anything more to the queue and wait for queue to empty
the_queue.close()
the_queue.join_thread()
# prevent adding anything more to the process pool and wait for all processes to finish
the_pool.close()
the_pool.join()
if __name__ == '__main__':
main()