producer/consumer-like multithreading in python - python

A typical producer-consumer problem is solved in python like below:
from queue import Queue
job_queue = Queue(maxsize=10)
def manager():
while i_have_some_job_do:
job = get_data_from_somewhere()
job_queue.put(job) #blocks only if queue is currently full
def worker():
while True:
data = job_queue.get() # blocks until data available
#get things done
But I have a variant of producer/consumer problem (not one strictly speaking, so let me call it manager-worker):
The manager puts some job in a Queue, and the worker should keep getting the jobs and doing them. But when the worker get a job, it does not remove the job from the Queue(unlike Queue.get()). And it is the manager which is able to remove a job from the Queue.
So how does the worker get the job while not removing the job from the queue? Maybe get and put is OK?
How does the manager remove a particular job from the queue?

Perhaps your works can't remove jobs completely, but consider letting them move them from the original queue to a different "job done" queue. The move itself should be cheap and fast, and the manager can then process the "job done" queue, removing elements it agrees are done, and moving others back to the worker queue.

Related

How do I wait when all ThreadPoolExecutor threads are busy?

My understanding of how a ThreadPoolExecutor works is that when I call #submit, tasks are assigned to threads until all available threads are busy, at which point the executor puts the tasks in a queue awaiting a thread becoming available.
The behavior I want is to block when there is not a thread available, to wait until one becomes available and then only submit my task.
The background is that my tasks are coming from a queue, and I only want to pull messages off my queue when there are threads available to work on these messages.
In an ideal world, I'd be able to provide an option to #submit to tell it to block if a thread is not available, rather than putting them in a queue.
However, that option does not exist. So what I'm looking at is something like:
with concurrent.futures.ThreadPoolExecutor(max_workers=CONCURRENCY) as executor:
while True:
wait_for_available_thread(executor)
message = pull_from_queue()
executor.submit(do_work_for_message, message)
And I'm not sure of the cleanest implementation of wait_for_available_thread.
Honestly, I'm surprised this isn't actually in concurrent.futures, as I would have thought the pattern of pulling from a queue and submitting to a thread pool executor would be relatively common.
One approach might be to keep track of your currently running threads via a set of Futures:
active_threads = set()
def pop_future(future):
active_threads.pop(future)
with concurrent.futures.ThreadPoolExecutor(max_workers=CONCURRENCY) as executor:
while True:
while len(active_threads) >= CONCURRENCY:
time.sleep(0.1) # or whatever
message = pull_from_queue()
future = executor.submit(do_work_for_message, message)
active_threads.add(future)
future.add_done_callback(pop_future)
A more sophisticated approach might be to have the done_callback be the thing that triggers a queue pull, rather than polling and blocking, but then you need to fall back to polling the queue if the workers manage to get ahead of it.

Multiprocesing pool.join() hangs under some circumstances

I am trying to create a simple producer / consumer pattern in Python using multiprocessing. It works, but it hangs on poll.join().
from multiprocessing import Pool, Queue
que = Queue()
def consume():
while True:
element = que.get()
if element is None:
print('break')
break
print('Consumer closing')
def produce(nr):
que.put([nr] * 1000000)
print('Producer {} closing'.format(nr))
def main():
p = Pool(5)
p.apply_async(consume)
p.map(produce, range(5))
que.put(None)
print('None')
p.close()
p.join()
if __name__ == '__main__':
main()
Sample output:
~/Python/Examples $ ./multip_prod_cons.py
Producer 1 closing
Producer 3 closing
Producer 0 closing
Producer 2 closing
Producer 4 closing
None
break
Consumer closing
However, it works perfectly when I change one line:
que.put([nr] * 100)
It is 100% reproducible on Linux system running Python 3.4.3 or Python 2.7.10. Am I missing something?
There is quite a lot of confusion here. What you are writing is not a producer/consumer scenario but a mess which is misusing another pattern usually referred as "pool of workers".
The pool of workers pattern is an application of the producer/consumer one in which there is one producer which schedules the work and many consumers which consume it. In this pattern, the owner of the Pool ends up been the producer while the workers will be the consumers.
In your example instead you have a hybrid solution where one worker ends up being a consumer and the others act as sort of middle-ware. The whole design is very inefficient, duplicates most of the logic already provided by the Pool and, more important, is very error prone. What you end up suffering from, is a Deadlock.
Putting an object into a multiprocessing.Queue is an asynchronous operation. It blocks only if the Queue is full and your Queue has infinite size.
This means your produce function returns immediately therefore the call to p.map is not blocking as you expect it to do. The related worker processes instead, wait until the actual message goes through the Pipe which the Queue uses as communication channel.
What happens next is that you terminate prematurely your consumer as you put in the Queue the None "message" which gets delivered before all the lists your produce function create are properly pushed through the Pipe.
You notice the issue once you call p.join but the real situation is the following.
the p.join call is waiting for all the worker processes to terminate.
the worker processes are waiting for the big lists to go though the Queue's Pipe.
as the consumer worker is long gone, nobody drains the Pipe which is obviously full.
The issue does not show if your lists are small enough to go through before you actually send the termination message to the consume function.

Dynamically reordering jobs in a multiprocessing pool in Python

I'm writing a python script (for cygwin and linux environments) to run regression testing on a program that is run from the command line using subprocess.Popen(). Basically, I have a set of jobs, a subset of which need to be run depending on the needs of the developer (on the order of 10 to 1000). Each job can take anywhere from a few seconds to 20 minutes to complete.
I have my jobs running successfully across multiple processors, but I'm trying to eke out some time savings by intelligently ordering the jobs (based on past performance) to run the longer jobs first. The complication is that some jobs (steady state calculations) need to be run before others (the transients based on the initial conditions determined by the steady state).
My current method of handling this is to run the parent job and all child jobs recursively on the same process, but some jobs have multiple, long-running children. Once the parent job is complete, I'd like to add the children back to the pool to farm out to other processes, but they would need to be added to the head of the queue. I'm not sure I can do this with multiprocessing.Pool. I looked for examples with Manager, but they all are based on networking it seems, and not particularly applicable. Any help in the form of code or links to a good tutorial on multiprocessing (I've googled...) would be much appreciated. Here's a skeleton of the code for what I've got so far, commented to point out the child jobs that I would like spawned off on other processors.
import multiprocessing
import subprocess
class Job(object):
def __init__(self, popenArgs, runTime, children)
self.popenArgs = popenArgs #list to be fed to popen
self.runTime = runTime #Approximate runTime for the job
self.children = children #Jobs that require this job to run first
def runJob(job):
subprocess.Popen(job.popenArgs).wait()
####################################################
#I want to remove this, and instead kick these back to the pool
for j in job.children:
runJob(j)
####################################################
def main(jobs):
# This jobs argument contains only jobs which are ready to be run
# ie no children, only parent-less jobs
jobs.sort(key=lambda job: job.runTime, reverse=True)
multiprocessing.Pool(4).map(runJob, jobs)
First, let me second Armin Rigo's comment: There's no reason to use multiple processes here instead of multiple threads. In the controlling process you're spending most of your time waiting on subprocesses to finish; you don't have CPU-intensive work to parallelize.
Using threads will also make it easier to solve your main problem. Right now you're storing the jobs in attributes of other jobs, an implicit dependency graph. You need a separate data structure that orders the jobs in terms of scheduling. Also, each tree of jobs is currently tied to one worker process. You want to decouple your workers from the data structure you use to hold the jobs. Then the workers each draw jobs from the same queue of tasks; after a worker finishes its job, it enqueues the job's children, which can then be handled by any available worker.
Since you want the child jobs to be inserted at the front of the line when their parent is finished a stack-like container would seem to fit your needs; the Queue module provides a thread-safe LifoQueue class that you can use.
import threading
import subprocess
from Queue import LifoQueue
class Job(object):
def __init__(self, popenArgs, runTime, children):
self.popenArgs = popenArgs
self.runTime = runTime
self.children = children
def run_jobs(queue):
while True:
job = queue.get()
subprocess.Popen(job.popenArgs).wait()
for child in job.children:
queue.put(child)
queue.task_done()
# Parameter 'jobs' contains the jobs that have no parent.
def main(jobs):
job_queue = LifoQueue()
num_workers = 4
jobs.sort(key=lambda job: job.runTime)
for job in jobs:
job_queue.put(job)
for i in range(num_workers):
t = threading.Thread(target=run_jobs, args=(job_queue,))
t.daemon = True
t.start()
job_queue.join()
A couple of notes: (1) We can't know when all the work is done by monitoring the worker threads, since they don't keep track of the work to be done. That's the queue's job. So the main thread monitors the queue object to know when all the work is complete (job_queue.join()). We can thus mark the worker threads as daemon threads, so the process will exit whenever the main thread does without waiting on the workers. We thereby avoid the need for communication between the main thread and the worker threads in order to tell the latter when to break out of their loops and stop.
(2) We know all the work is done when all tasks that have been enqueued have been marked as done (specifically, when task_done() has been called a number of times equal to the number of items that have been enqueued). It wouldn't be reliable to use the queue's being empty as the condition that all work is done; the queue might be momentarily and misleadingly empty between popping a job from it and enqueuing that job's children.

Threads in Python again

guys!
My application is a bot. It simply receives a message, process it and returns result.
But there are a lot of messages and I'm creating separate thread for processing each, but it makes an application slower (not a bit).
So, Is it any way to reduce CPU usage by replacing threads with something else?
You probably want processes rather than threads. Spawn processes at startup, and use Pipes to talk to them.
http://docs.python.org/dev/library/multiprocessing.html
Threads and processes have the same speed.
Your problem is not which one you use, but how many you use.
The answer is to only have a fixed couple of threads or processes. Say 10.
You then create a Queue (use the Queue module) to store all messages from your robot.
The 10 threads will constantly be working, and everytime they finish, they wait for a new message in the Queue.
This saves you from the overhead of creating and destroying threads.
See http://docs.python.org/library/queue.html for more info.
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
q = Queue()
for i in range(num_worker_threads):
t = Thread(target=worker)
t.daemon = True
t.start()
for item in source():
q.put(item)
q.join() # block until all tasks are done
You could try creating only a limited amount of workers and distribute work between them. Python's multiprocessing.Pool would be the thing to use.
You might not even need threads. If your server can handle each request quickly, you can just make it all single-threaded using something like Twisted.

python: how to make threads wait for specific response?

could anyone please provide on how to achieve below scenario ?
2 queues - destination queue, response queue
thread picks task up from destination queue
finds out needs more details
submits new task to destination queue
waits for his request to be processed and result appear in response queue
or
monitors response queue for response to his task but does not actually pick any response so it is available to the other threads waiting for other responses ?
thank you
If a threads waits for a specific task completion, i.e it shouldn't pick any completed task except that one it put, you can use locks to wait for the task:
def run(self):
# get a task, do somethings, put a new task
newTask.waitFor()
...
class Task:
...
def waitFor(self):
self._lock.acquire()
def complete(self):
self._lock.release()
def failedToComplete(self, err):
self._error = err
self._lock.release()
This will help to avoid time.sleep()-s on response queue monitoring. Task completion errors handling should be considered here. But this is uncommon approach. Is it some specific algorithm where the thread which puts a new task, should wait for it? Even so, you can implement that logic into a Task class, and not in the thread that processes it. And why the thread picks a task from the destination queue and puts a new task back to the destination queue? If you have n steps of processing, you can use n queues for it. A group of threads serves the first queue, gets a task, processes it, puts the result (a new task) to the next queue. The group of final response-handler threads gets a response and sends it back to the client. The tasks encapsulate details concerning themselves, the threads don't distinguish a task from another. And there is not need to wait for a particular task.

Categories