Single Producer Multiple Consumer - python

I wish to have a single producer, multiple consumer architecture in Python while performing multi-threaded programming. I wish to have an operation like this :
Producer produces the data
Consumers 1 ..N (N is pre-determined) wait for the data to arrive (block) and then process the SAME data in different ways.
So I need all the consumers to to get the same data from the producer.
When I used Queue to perform this, I realized that all but the first consumer would be starved with the implementation I have.
One possible solution is to have a unique queue for each of the consumer threads wherein the same data is pushed in multiple queues by the producer. Is there a better way to do this ?
from threading import Thread
import time
import random
from Queue import Queue
my_queue = Queue(0)
def Producer():
global my_queue
my_list = []
for each in range (50):
my_list.append(each)
my_queue.put(my_list)
def Consumer1():
print "Consumer1"
global my_queue
print my_queue.get()
my_queue.task_done()
def Consumer2():
print "Consumer2"
global my_queue
print my_queue.get()
my_queue.task_done()
P = Thread(name = "Producer", target = Producer)
C1 = Thread(name = "Consumer1", target = Consumer1)
C2 = Thread(name = "Consumer2", target = Consumer2)
P.start()
C1.start()
C2.start()
In the example above, the C2 gets blocked indefinitely as C1 consumes the data produced by P1. What I would rather want is for C1 and C2 both to be able to access the SAME data as produced by P1.
Thanks for any code/pointers!

Your producer creates only one job to do:
my_queue.put(my_list)
For example, put my_list twice, and both consumers work:
def Producer():
global my_queue
my_list = []
for each in range (50):
my_list.append(each)
my_queue.put(my_list)
my_queue.put(my_list)
So this way you put two jobs to queue with the same list.
However i have to warn you: to modify the same data in different threads without thread synchronization is generally bad idea.
Anyways, approach with one queue would not work for you, since one queue is supposed to be processed with threads with the same algorithm.
So, I advise you to go ahead with unique queue per each consumer, since other solutions are not as trivial.

How about a per-thread queue then?
As part of starting each consumer, you would also create another Queue, and add this to a list of "all thread queues". Then start the producer, passing it the list of all queues, which he can then push data into all of them.

A single-producers and five-consumers example, verified.
from multiprocessing import Process, JoinableQueue
import time
import os
q = JoinableQueue()
def producer():
for item in range(30):
time.sleep(2)
q.put(item)
pid = os.getpid()
print(f'producer {pid} done')
def worker():
while True:
item = q.get()
pid = os.getpid()
print(f'pid {pid} Working on {item}')
print(f'pid {pid} Finished {item}')
q.task_done()
for i in range(5):
p = Process(target=worker, daemon=True).start()
producers = []
# it is easy to extend it to multi producers.
for i in range(1):
p = Process(target=producer)
producers.append(p)
p.start()
# make sure producers done
for p in producers:
p.join()
# block until all workers are done
q.join()
print('All work completed')
Explanation:
One producer and five consumers in this example.
JoinableQueue is used to make sure all elements stored in queue will be processed. 'task_done' is for worker to notify an element is done. 'q.join()' will wait for all elements marked as done.
With #2, there is no need to join wait for every worker.
But it is important to join wait for producer to store element into queue. Otherwise, program exit immediately.

I do know it might be an overkill, but... What about using signal/slot framework from Qt? For consistency, QThread could be used instead of threading.Thread
from __future__ import annotations # Needed for forward Consumer typehint in register_consumer
from queue import Queue
from typing import List
from PySide2.QtCore import QThread, QObject, QCoreApplication, Signal, Slot, Qt
import time
import random
def thread_name():
# Convenient class
return QThread.currentThread().objectName()
class Producer(QThread):
product_available = Signal(list)
def __init__(self):
QThread.__init__(self, objectName='ThreadProducer')
self.consumers: List[Consumer] = list()
# See Consumer class comments for info (exactly the same reason here)
self.internal_consumer_queue = Queue()
self.active = True
def run(self):
my_list = [each for each in range(5)]
self.product_available.emit(my_list)
print(f'Producer: from thread {QThread.currentThread().objectName()} I\'ve sent my products\n')
while self.active:
consumer: Consumer = self.internal_consumer_queue.get(block=True)
print(f'Producer: {consumer} has told me it has completed his task with my product! '
f'(Thread {thread_name()})')
if not consumer in self.consumers:
raise ValueError(f'Consumer {consumer} was not registered')
self.consumers.remove(consumer)
if len(self.consumers) == 0:
print('All consumers have completed their task! I\'m terminating myself')
self.active = False
#Slot(object)
def on_task_done_by_consumer(self, consumer: Consumer):
self.internal_consumer_queue.put(consumer)
def register_consumer(self, consumer: Consumer):
if consumer in self.consumers:
return
self.consumers.append(consumer)
consumer.task_done_with_product.connect(self.on_task_done_by_consumer)
class Consumer(QThread):
task_done_with_product = Signal(object)
def __init__(self, name: str, producer: Producer):
self.name = name
# Super init and set Thread name
QThread.__init__(self, objectName=f'Thread_Of_{self.name}')
self.producer = producer
# See method on_product_available doc
self.internal_queue = Queue()
def run(self) -> None:
self.producer.product_available.connect(self.on_product_available, Qt.ConnectionType.UniqueConnection)
# Thread loop waiting for product availability
product = self.internal_queue.get(block=True)
print(f'{self.name}: Product {product} received and elaborated in thread {thread_name()}\n\n')
# Tell the producer I've done
self.task_done_with_product.emit(self)
# Now the thread is naturally closed
#Slot(list)
def on_product_available(self, product: list):
"""
As a limitation of PySide, it seems that list are not supported for QueuedConnection. This work around using
internal queue might solve
"""
# This is executed in Main Loop!
print(f'{self.name}: In thread {thread_name()} I received the product, and I\'m queuing it for being elaborated'
f'in consumer thread')
self.internal_queue.put(product)
# Quit the thread
self.active = False
def __repr__(self):
# Needed in case of exception for representing current consumer
return f'{self.name}'
# Needed to executed main and threads event loops
app = QCoreApplication()
QThread.currentThread().setObjectName('MainThread')
producer = Producer()
c1 = Consumer('Consumer1', producer)
c1.start()
producer.register_consumer(c1)
c2 = Consumer('Consumer2', producer)
c2.start()
producer.register_consumer(c2)
producer.product_available.connect(c1.on_product_available)
producer.product_available.connect(c2.on_product_available)
# Start Producer thread for LAST!
producer.start()
app.exec_()
Results:
Producer: from thread ThreadProducer I've sent my products
Consumer1: In thread MainThread I received the product, and I'm queuing it for being elaboratedin consumer thread
Consumer1: Product [0, 1, 2, 3, 4] received and elaborated in thread Thread_Of_Consumer1
Consumer2: In thread MainThread I received the product, and I'm queuing it for being elaboratedin consumer thread
Consumer2: Product [0, 1, 2, 3, 4] received and elaborated in thread Thread_Of_Consumer2
Producer: Consumer1 has told me it has completed his task with my product! (Thread ThreadProducer)
Producer: Consumer2 has told me it has completed his task with my product! (Thread ThreadProducer)
All consumers have completed their task! I'm terminating myself
Notes:
The step-by-step explanation is into the code comments. If anything is unclear, I'll try my best for better clarifying
Unfortunately I've not found a way to use QueueConnection (doc here) so as to directly execute the Slot into the proper thread: an internal queueing has been used to pass information from main loop to proper thread (either Producer and Consumer). It seems that list and object cannot be meta-registered in PySide/pyqt for queueing purposes

Related

Why am I unable to join this thread in python?

I am writing a multithreading class. The class has a parallel_process() function that is overridden with the parallel task. The data to be processed is put in the queue. The worker() function in each thread keeps calling parallel_process() until the queue is empty. Results are put in the results Queue object. The class definition is:
import threading
try:
from Queue import Queue
except ImportError:
from queue import Queue
class Parallel:
def __init__(self, pkgs, common=None, nthreads=1):
self.nthreads = nthreads
self.threads = []
self.queue = Queue()
self.results = Queue()
self.common = common
for pkg in pkgs:
self.queue.put(pkg)
def parallel_process(self, pkg, common):
pass
def worker(self):
while not self.queue.empty():
pkg = self.queue.get()
self.results.put(self.parallel_process(pkg, self.common))
self.queue.task_done()
return
def start(self):
for i in range(self.nthreads):
t = threading.Thread(target=self.worker)
t.daemon = False
t.start()
self.threads.append(t)
def wait_for_threads(self):
print('Waiting on queue to empty...')
self.queue.join()
print('Queue processed. Joining threads...')
for t in self.threads:
t.join()
print('...Thread joined.')
def get_results(self):
results = []
print('Obtaining results...')
while not self.results.empty():
results.append(self.results.get())
return results
I use it to create a parallel task:
class myParallel(Parallel): # return square of numbers in a list
def parallel_process(self, pkg, common):
return pkg**2
p = myParallel(range(50),nthreads=4)
p.start()
p.wait_for_threads()
r = p.get_results()
print('FINISHED')
However all threads do not join every time the code is run. Sometimes only 2 join, sometimes no thread joins. I do not think I am blocking the threads from finishing. What reason could there be for join() to not work here?
This statement may lead to errors:
while not self.queue.empty():
pkg = self.queue.get()
With multiple threads pulling items from the queue, there's no guarantee that self.queue.get() will return a valid item, even if you check if the queue is empty beforehand. Here is a possible scenario
Thread 1 checks the queue and the queue is not empty, control proceeds into the while loop.
Control passes to Thread 2, which also checks the queue, finds it is not empty and enters the while loop. Thread 2 gets an item from the loop. The queue is now empty.
Control passes back to Thread 1, it gets an item from the queue, but the queue is now empty, an Empty Exception should be raised.
You should just use a try/except to get an item from the queue
try:
pkg = self.queue.get_nowait()
except Empty:
pass
#Brendan Abel identified the cause. I'd like to suggest a different solution: queue.join() is usually a Bad Idea too. Instead, create a unique value to use as a sentinel:
class Parallel:
_sentinel = object()
At the end of __init__(), add one sentinel to the queue for each thread:
for i in range(nthreads):
self.queue.put(self._sentinel)
Change the start of worker() like so:
while True:
pkg = self.queue.get()
if pkg is self._sentinel:
break
By the construction of the queue, it won't be empty until each thread has seen its sentinel value, so there's no need to mess with the unpredictable queue.size().
Also remove the queue.join() and queue.task_done() cruft.
This will give you reliable code that's easy to modify for fancier scenarios. For example, if you want to add more work items while the threads are running, fine - just write another method to say "I'm done adding work items now", and move the loop adding sentinels into that.

Can a queue worker signal failure to the parent?

Imagine that I have a task queue with a consumer like this (this is almost identical to the sample code here):
def worker(tasks):
while True:
try:
item = tasks.get_nowait()
except:
return
execute(item)
tasks.task_done()
and a producer like this:
def batch_execute(items, n_threads):
tasks = Queue()
for item in items:
tasks.put(item)
for n in range(n_threads):
t = threading.Thread(target=worker, args=tasks)
t.start()
tasks.join()
This works, except that execute(item) can throw exceptions. If that happens, the given thread will bail, the others keep running, and the tasks.join() will hang indefinitely. Both traits are undesirable. Is there a typical design people use to e.g. "forward" the exception from the child thread into the parent thread and unblock tasks.join()? Or do I have to manually implement all of that around python's Queue class?

Why does this http thread pool die (join), but keep functioning?

The following code takes an initial string ('a', 'b', or 'c'), and the two thread types pass it back and forth, appending 'W' and 'H' to it repeatedly, marking that the Worker thread or the Http thread last handled the string.
The code is a simple test to try and eventually accomplish the following. The http thread pool will pull web pages, and the worker thread will add info to a db, and then give the http thread more urls to pull. They just go back and forth. I want both thread pools and queues to stay alive unless BOTH are empty simultaneously. (there are cases where one pool will temporarily run out of things to do, and I don't want it to join because it's companion thread pool will probably be adding more work to it's queue soon.)
In the following code, the http thread pool runs out of things to do almost immediately, and then joins. But you'll notice that the threads keep functioning.
Why does it do this
And how do I make it so neither queues can join until BOTH are simultaneously empty?
from queue import Queue
import threading
import time
class http(threading.Thread):
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
row = self.queue.get()
print(row)
self.out_queue.put(row+'H')
self.queue.task_done()
class worker(threading.Thread):
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
time.sleep(1)
row = self.out_queue.get()
self.queue.put(row+'W')
self.out_queue.task_done()
URL_THREAD_COUNT = 3
rows = [chr(x) for x in range(97, 100)]
def main():
queue = Queue()
out_queue = Queue()
#spawn a pool of threads, and pass them queue instance
for i in range(URL_THREAD_COUNT):
t = http(queue, out_queue)
t.daemon = True
t.start()
#populate queue with data
for row in rows:
queue.put(row)
#spawn worker thread
dt = worker(queue, out_queue)
dt.daemon = True
dt.start()
#time.sleep(5)
# wait for queues
queue.join()
print('EXIT http')
out_queue.join()
print('EXIT worker')
start = time.time()
main()
print("Elapsed Time: %s" % (time.time() - start))
"joining" a queue waits until the queue is empty. If worker finishes processing some out_queue messages before the other threads can add more messages, the outer out_queue.join thinks you are done. You may want to add a control message that tells the threads when their work is done so that they can exit, and call thread.join() for them all instead. That will mean keeping a list of threads created in the for loop instead of just abandoning them.

python queue task_done() issue

I have problem with python multithreaded Queues. I have this script, where producer take elements from input queue, produces some elements and puts them to output queue, and consumer takes element from output queue and just prints them:
import threading
import Queue
class Producer(threading.Thread):
def __init__(self, iq, oq):
threading.Thread.__init__(self)
self.iq = iq
self.oq = oq
def produce(self, e):
self.oq.put(e*2)
self.oq.task_done()
print "Producer %s produced %d and put it to output Queue"%(self.getName(), e*2)
def run(self):
while 1:
e = self.iq.get()
self.iq.task_done()
print "Get %d from input Queue"%(e)
self.produce(e)
class Consumer(threading.Thread):
def __init__(self, oq):
threading.Thread.__init__(self)
self.oq = oq
def run(self):
while 1:
e = self.oq.get()
self.oq.task_done()
print "Consumer get %d from output queue and consumed"%e
iq = Queue.Queue()
oq = Queue.Queue()
for i in xrange(2):
iq.put((i+1)*10)
for i in xrange(2):
t1 = Producer(iq, oq)
t1.setDaemon(True)
t1.start()
t2 = Consumer(oq)
t2.setDaemon(True)
t2.start()
iq.join()
oq.join()
But, every time I run it, it works different(gives exception, or consumer does not do any job). I think the problem is in task_done() command, can anyone explain me where the bug is?
I have modified Consumer class:
class Consumer(threading.Thread):
def __init__(self, oq):
threading.Thread.__init__(self)
self.oq = oq
def run(self):
while 1:
e = self.oq.get()
self.oq.task_done()
print "Consumer get %d from output queue and consumed"%e
page = urllib2.urlopen("http://www.ifconfig.me/ip")
print page
Now consumer after each task_done() command should connect to web site (it takes some time), but it does not, instead if execution time of code after task_done() is small, it runs but if it is long it does not run! Why? Can anyone explain me this issue? If I put everything before task_done() command then I will block queue from other threads which is stupid enough. Or is there anything I am missing about multithreading in python?
From the Queue docs:
Queue.task_done() Indicate that a formerly enqueued task is complete.
Used by queue consumer threads. For each get() used to fetch a task, a
subsequent call to task_done() tells the queue that the processing on
the task is complete.
If a join() is currently blocking, it will resume when all items have
been processed (meaning that a task_done() call was received for every
item that had been put() into the queue)
For example in your code you do the following in your Producer class:
def produce(self, e):
self.oq.put(e*2)
self.oq.task_done()
print "Producer %s produced %d and put it to output Queue"%(self.getName(), e*2)
You shouldn't do self.oq.task_done() here, since you haven't used oq.get().
I am not sure this is the only problem though.
EDIT:
For your other problem, you're using iq.join() and oq.join() at the end, this leads your main thread to exit before the other threads print the retrieved pages, and since you're creating your threads as Daemons, your Python application exits without waiting for them to finish executing. (Remember that Queue.join() depends on Queue.task_done())
Now you're saying "If I put everything before task_done() command then I will block queue from other threads". I can't see what you mean, this will only block your Consumer thread, but you can always create more Consumer threads which won't be blocked by each other.

Checking on a thread / remove from list

I have a thread which extends Thread. The code looks a little like this;
class MyThread(Thread):
def run(self):
# Do stuff
my_threads = []
while has_jobs() and len(my_threads) < 5:
new_thread = MyThread(next_job_details())
new_thread.run()
my_threads.append(new_thread)
for my_thread in my_threads
my_thread.join()
# Do stuff
So here in my pseudo code I check to see if there is any jobs (like a db etc) and if there is some jobs, and if there is less than 5 threads running, create new threads.
So from here, I then check over my threads and this is where I get stuck, I can use .join() but my understanding is that - this then waits until it's finished so if the first thread it checks is still in progress, it then waits till it's done - even if the other threads are finished....
so is there a way to check if a thread is done, then remove it if so?
eg
for my_thread in my_threads:
if my_thread.done():
# process results
del (my_threads[my_thread]) ?? will that work...
As TokenMacGuy says, you should use thread.is_alive() to check if a thread is still running. To remove no longer running threads from your list you can use a list comprehension:
for t in my_threads:
if not t.is_alive():
# get results from thread
t.handled = True
my_threads = [t for t in my_threads if not t.handled]
This avoids the problem of removing items from a list while iterating over it.
mythreads = threading.enumerate()
Enumerate returns a list of all Thread objects still alive.
https://docs.python.org/3.6/library/threading.html
you need to call thread.isAlive()to find out if the thread is still running
The answer has been covered, but for simplicity...
# To filter out finished threads
threads = [t for t in threads if t.is_alive()]
# Same thing but for QThreads (if you are using PyQt)
threads = [t for t in threads if t.isRunning()]
Better way is to use Queue class:
http://docs.python.org/library/queue.html
Look at the good example code in the bottom of documentation page:
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
q = Queue()
for i in range(num_worker_threads):
t = Thread(target=worker)
t.daemon = True
t.start()
for item in source():
q.put(item)
q.join() # block until all tasks are done
A easy solution to check thread finished or not. It is thread safe
Install pyrvsignal
pip install pyrvsignal
Example:
import time
from threading import Thread
from pyrvsignal import Signal
class MyThread(Thread):
started = Signal()
finished = Signal()
def __init__(self, target, args):
self.target = target
self.args = args
Thread.__init__(self)
def run(self) -> None:
self.started.emit()
self.target(*self.args)
self.finished.emit()
def do_my_work(details):
print(f"Doing work: {details}")
time.sleep(10)
def started_work():
print("Started work")
def finished_work():
print("Work finished")
thread = MyThread(target=do_my_work, args=("testing",))
thread.started.connect(started_work)
thread.finished.connect(finished_work)
thread.start()

Categories