Why does this http thread pool die (join), but keep functioning?

Why does this http thread pool die (join), but keep functioning? - python

The following code takes an initial string ('a', 'b', or 'c'), and the two thread types pass it back and forth, appending 'W' and 'H' to it repeatedly, marking that the Worker thread or the Http thread last handled the string.
The code is a simple test to try and eventually accomplish the following. The http thread pool will pull web pages, and the worker thread will add info to a db, and then give the http thread more urls to pull. They just go back and forth. I want both thread pools and queues to stay alive unless BOTH are empty simultaneously. (there are cases where one pool will temporarily run out of things to do, and I don't want it to join because it's companion thread pool will probably be adding more work to it's queue soon.)
In the following code, the http thread pool runs out of things to do almost immediately, and then joins. But you'll notice that the threads keep functioning.
Why does it do this
And how do I make it so neither queues can join until BOTH are simultaneously empty?
from queue import Queue
import threading
import time
class http(threading.Thread):
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
row = self.queue.get()
print(row)
self.out_queue.put(row+'H')
self.queue.task_done()
class worker(threading.Thread):
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
time.sleep(1)
row = self.out_queue.get()
self.queue.put(row+'W')
self.out_queue.task_done()
URL_THREAD_COUNT = 3
rows = [chr(x) for x in range(97, 100)]
def main():
queue = Queue()
out_queue = Queue()
#spawn a pool of threads, and pass them queue instance
for i in range(URL_THREAD_COUNT):
t = http(queue, out_queue)
t.daemon = True
t.start()
#populate queue with data
for row in rows:
queue.put(row)
#spawn worker thread
dt = worker(queue, out_queue)
dt.daemon = True
dt.start()
#time.sleep(5)
# wait for queues
queue.join()
print('EXIT http')
out_queue.join()
print('EXIT worker')
start = time.time()
main()
print("Elapsed Time: %s" % (time.time() - start))

"joining" a queue waits until the queue is empty. If worker finishes processing some out_queue messages before the other threads can add more messages, the outer out_queue.join thinks you are done. You may want to add a control message that tells the threads when their work is done so that they can exit, and call thread.join() for them all instead. That will mean keeping a list of threads created in the for loop instead of just abandoning them.

Related

How to prevent multiple threads from picking up same task from queue

I want to run multiple threads in parallel. Each thread picks up a task from a task queue and executes that task.
from threading import Thread
from Queue import Queue
import time
class link(object):
def __init__(self, i):
self.name = str(i)
def run_jobs_in_parallel(consumer_func, jobs, results, thread_count,
async_run=False):
def consume_from_queue(jobs, results):
while not jobs.empty():
job = jobs.get()
try:
results.append(consumer_func(job))
except Exception as e:
print str(e)
results.append(False)
finally:
jobs.task_done()
#start worker threads
if jobs.qsize() < thread_count:
thread_count = jobs.qsize()
for tc in range(1,thread_count+1):
worker = Thread(
target=consume_from_queue,
name="worker_{0}".format(str(tc)),
args=(jobs,results,))
worker.start()
if not async_run:
jobs.join()
def create_link(link):
print str(link.name)
time.sleep(10)
return True
def consumer_func(link):
return create_link(link)
# create_link takes a while to execute
jobs = Queue()
results = list()
for i in range(0,10):
jobs.put(link(i))
run_jobs_in_parallel(consumer_func, jobs, results, 25, async_run=False)
Now what is happening is, let say we have 10 link objects in jobs queue, while the threads are running in parallel, multiple threads are executing same task. How can I prevent this from happening?
Note - the above sample code does not have the problem describe above, but i have exactly same code except create_link method does some complex stuff.

I think what you need is a lock object (docs,tutorial+examples). If you create an instance of such an object you can 'lock' some parts of your code, ensuring that only one thread executes this part at a time.
I guess in your case you want to lock the line job = jobs.get().
First you have to create the lock in a scope where all threads have access to it. (You don't want a lock for every thread but a single lock for all your threads. That means creating the lock within your thread just before acquiring it won't work)
import threading
lock = threading.Lock()
then you can use it on your line like:
lock.acquire()
job = jobs.get()
lock.release()
or
with lock:
job = jobs.get()
The first thread to reach acquire() will lock the lock. other threads that try to acquire() the lock will pause until the lock gets unlocked again by calling release().

Single Producer Multiple Consumer

I wish to have a single producer, multiple consumer architecture in Python while performing multi-threaded programming. I wish to have an operation like this :
Producer produces the data
Consumers 1 ..N (N is pre-determined) wait for the data to arrive (block) and then process the SAME data in different ways.
So I need all the consumers to to get the same data from the producer.
When I used Queue to perform this, I realized that all but the first consumer would be starved with the implementation I have.
One possible solution is to have a unique queue for each of the consumer threads wherein the same data is pushed in multiple queues by the producer. Is there a better way to do this ?
from threading import Thread
import time
import random
from Queue import Queue
my_queue = Queue(0)
def Producer():
global my_queue
my_list = []
for each in range (50):
my_list.append(each)
my_queue.put(my_list)
def Consumer1():
print "Consumer1"
global my_queue
print my_queue.get()
my_queue.task_done()
def Consumer2():
print "Consumer2"
global my_queue
print my_queue.get()
my_queue.task_done()
P = Thread(name = "Producer", target = Producer)
C1 = Thread(name = "Consumer1", target = Consumer1)
C2 = Thread(name = "Consumer2", target = Consumer2)
P.start()
C1.start()
C2.start()
In the example above, the C2 gets blocked indefinitely as C1 consumes the data produced by P1. What I would rather want is for C1 and C2 both to be able to access the SAME data as produced by P1.
Thanks for any code/pointers!

Your producer creates only one job to do:
my_queue.put(my_list)
For example, put my_list twice, and both consumers work:
def Producer():
global my_queue
my_list = []
for each in range (50):
my_list.append(each)
my_queue.put(my_list)
my_queue.put(my_list)
So this way you put two jobs to queue with the same list.
However i have to warn you: to modify the same data in different threads without thread synchronization is generally bad idea.
Anyways, approach with one queue would not work for you, since one queue is supposed to be processed with threads with the same algorithm.
So, I advise you to go ahead with unique queue per each consumer, since other solutions are not as trivial.

How about a per-thread queue then?
As part of starting each consumer, you would also create another Queue, and add this to a list of "all thread queues". Then start the producer, passing it the list of all queues, which he can then push data into all of them.

A single-producers and five-consumers example, verified.
from multiprocessing import Process, JoinableQueue
import time
import os
q = JoinableQueue()
def producer():
for item in range(30):
time.sleep(2)
q.put(item)
pid = os.getpid()
print(f'producer {pid} done')
def worker():
while True:
item = q.get()
pid = os.getpid()
print(f'pid {pid} Working on {item}')
print(f'pid {pid} Finished {item}')
q.task_done()
for i in range(5):
p = Process(target=worker, daemon=True).start()
producers = []
# it is easy to extend it to multi producers.
for i in range(1):
p = Process(target=producer)
producers.append(p)
p.start()
# make sure producers done
for p in producers:
p.join()
# block until all workers are done
q.join()
print('All work completed')
Explanation:
One producer and five consumers in this example.
JoinableQueue is used to make sure all elements stored in queue will be processed. 'task_done' is for worker to notify an element is done. 'q.join()' will wait for all elements marked as done.
With #2, there is no need to join wait for every worker.
But it is important to join wait for producer to store element into queue. Otherwise, program exit immediately.

I do know it might be an overkill, but... What about using signal/slot framework from Qt? For consistency, QThread could be used instead of threading.Thread
from __future__ import annotations # Needed for forward Consumer typehint in register_consumer
from queue import Queue
from typing import List
from PySide2.QtCore import QThread, QObject, QCoreApplication, Signal, Slot, Qt
import time
import random
def thread_name():
# Convenient class
return QThread.currentThread().objectName()
class Producer(QThread):
product_available = Signal(list)
def __init__(self):
QThread.__init__(self, objectName='ThreadProducer')
self.consumers: List[Consumer] = list()
# See Consumer class comments for info (exactly the same reason here)
self.internal_consumer_queue = Queue()
self.active = True
def run(self):
my_list = [each for each in range(5)]
self.product_available.emit(my_list)
print(f'Producer: from thread {QThread.currentThread().objectName()} I\'ve sent my products\n')
while self.active:
consumer: Consumer = self.internal_consumer_queue.get(block=True)
print(f'Producer: {consumer} has told me it has completed his task with my product! '
f'(Thread {thread_name()})')
if not consumer in self.consumers:
raise ValueError(f'Consumer {consumer} was not registered')
self.consumers.remove(consumer)
if len(self.consumers) == 0:
print('All consumers have completed their task! I\'m terminating myself')
self.active = False
#Slot(object)
def on_task_done_by_consumer(self, consumer: Consumer):
self.internal_consumer_queue.put(consumer)
def register_consumer(self, consumer: Consumer):
if consumer in self.consumers:
return
self.consumers.append(consumer)
consumer.task_done_with_product.connect(self.on_task_done_by_consumer)
class Consumer(QThread):
task_done_with_product = Signal(object)
def __init__(self, name: str, producer: Producer):
self.name = name
# Super init and set Thread name
QThread.__init__(self, objectName=f'Thread_Of_{self.name}')
self.producer = producer
# See method on_product_available doc
self.internal_queue = Queue()
def run(self) -> None:
self.producer.product_available.connect(self.on_product_available, Qt.ConnectionType.UniqueConnection)
# Thread loop waiting for product availability
product = self.internal_queue.get(block=True)
print(f'{self.name}: Product {product} received and elaborated in thread {thread_name()}\n\n')
# Tell the producer I've done
self.task_done_with_product.emit(self)
# Now the thread is naturally closed
#Slot(list)
def on_product_available(self, product: list):
"""
As a limitation of PySide, it seems that list are not supported for QueuedConnection. This work around using
internal queue might solve
"""
# This is executed in Main Loop!
print(f'{self.name}: In thread {thread_name()} I received the product, and I\'m queuing it for being elaborated'
f'in consumer thread')
self.internal_queue.put(product)
# Quit the thread
self.active = False
def __repr__(self):
# Needed in case of exception for representing current consumer
return f'{self.name}'
# Needed to executed main and threads event loops
app = QCoreApplication()
QThread.currentThread().setObjectName('MainThread')
producer = Producer()
c1 = Consumer('Consumer1', producer)
c1.start()
producer.register_consumer(c1)
c2 = Consumer('Consumer2', producer)
c2.start()
producer.register_consumer(c2)
producer.product_available.connect(c1.on_product_available)
producer.product_available.connect(c2.on_product_available)
# Start Producer thread for LAST!
producer.start()
app.exec_()
Results:
Producer: from thread ThreadProducer I've sent my products
Consumer1: In thread MainThread I received the product, and I'm queuing it for being elaboratedin consumer thread
Consumer1: Product [0, 1, 2, 3, 4] received and elaborated in thread Thread_Of_Consumer1
Consumer2: In thread MainThread I received the product, and I'm queuing it for being elaboratedin consumer thread
Consumer2: Product [0, 1, 2, 3, 4] received and elaborated in thread Thread_Of_Consumer2
Producer: Consumer1 has told me it has completed his task with my product! (Thread ThreadProducer)
Producer: Consumer2 has told me it has completed his task with my product! (Thread ThreadProducer)
All consumers have completed their task! I'm terminating myself
Notes:
The step-by-step explanation is into the code comments. If anything is unclear, I'll try my best for better clarifying
Unfortunately I've not found a way to use QueueConnection (doc here) so as to directly execute the Slot into the proper thread: an internal queueing has been used to pass information from main loop to proper thread (either Producer and Consumer). It seems that list and object cannot be meta-registered in PySide/pyqt for queueing purposes

why queue is showing incorrect data?

I have used queue for passing urls to download, however the queue gets corrupted when received in the thread:
class ThreadedFetch(threading.Thread):
""" docstring for ThreadedFetch
"""
def __init__(self, queue, out_queue):
super(ThreadedFetch, self).__init__()
self.queue = queue
self.outQueue = out_queue
def run(self):
items = self.queue.get()
print items
def main():
for i in xrange(len(args.urls)):
t = ThreadedFetch(queue, out_queue)
t.daemon = True
t.start()
# populate queue with data
for url, saveTo in urls_saveTo.iteritems():
queue.put([url, saveTo, split])
# wait on the queue until everything has been processed
queue.join()
output resulting execution of run() when I execute the main is :
['http://www.nasa.gov/images/content/607800main_kepler1200_1600-1200.jpg', ['http://broadcast.lds.org/churchmusic/MP3/1/2/nowords/271.mp3', None, 3None, 3]
]
while expected is
['http://www.nasa.gov/images/content/607800main_kepler1200_1600-1200.jpg', None, 3]
['http://broadcast.lds.org/churchmusic/MP3/1/2/nowords/271.mp3', None, 3]

All of the threads print their data at once and the results are interleaved. If you want threads to display data in production code, you need some way for them to cooperate when writing. One option is a global lock that all screen writers use, another is the logging module.

Populate your queue before you start your threads. Add a lock for I/O (for the reason #tdelaney says -- the threads are interleaving writes to stdout and the results appear broken). And modify your run method to this:
lock = threading.Lock()
def run(self):
while True:
try:
items = self.queue.get_nowait()
with lock:
print items
except Queue.Empty:
break
except Exception as err:
pass
self.queue.task_done()
You might also find that it is easier to do this with concurrent.futures. There is a solid example of using a method that returns a value that is called in a thread pool.

Python multithread using queue - the program gets blocked forever

I am not sure which part of my program is wrong. It will be blocked at the join() calls of two queues. However, if I removed the 2 join calls, the program does not work at all.
import threading
import Queue
queue = Queue.Queue()
out_queue = Queue.Queue()
fruits = ['apple', 'strawberry', 'banana', 'peach', 'rockmelon']
class WorkerThread(threading.Thread):
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
print 'run'
while not self.queue.empty():
name = self.queue.get()
self.out_queue.put(name)
self.queue.task_done()
def main():
print 'start'
for i in xrange(5):
t = WorkerThread(queue, out_queue)
t.setDaemon(True)
t.start()
#populate the queue
for fruit in fruits:
queue.put(fruit)
queue.join()
out_queue.join()
while not out_queue.empty():
print out_queue.get()
print 'end'
if __name__=='__main__':
main()
Thanks in advance.

You're calling out_queue.join(), which waits until out_queue.task_done() has been called the same number of times out_queue.put() has been called. However, you're never calling out_queue.task_done(). This can best be fixed by never calling out_queue.join() in the first place.
EDIT: also, you are populating queue after you start your WorkerThreads. This means there's a chance that the worker threads will run and finish before you've had a chance to insert all your elements. Inserting them before starting the worker thread will fix this.

Signal the end of jobs on the Queue?

Here's an example code of from Python documentation:
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
q = Queue()
for i in range(num_worker_threads):
t = Thread(target=worker)
t.daemon = True
t.start()
for item in source():
q.put(item)
q.join() # block until all tasks are done
I modified it to fit my use case like this:
import threading
from Queue import Queue
max_threads = 10
q = Queue(maxsize=max_threads + 2)
def worker():
while True:
task = q.get(1)
# do something with the task
q.task_done()
for i in range(max_threads):
t = threading.Thread(target=worker)
t.start()
for task in ['a', 'b', 'c']:
q.put(task)
q.join()
When I execute it, debugger says that all the jobs were executed, but q.join() seems to wait forever. How can I send a signal to the worker threads that I already sent all the tasks?

This process doesn't finish at .join() because the worker threads continue waiting on new queue data (blocking .get())
Here is a method that uses a simple flag finishUp to tell workers to exit, which we set after .join() is done - meaning all tasks are processed. I added a timeout in the q.get() call to allow it to check on finishUp flag
import threading
import queue
max_threads = 5
q = queue.Queue(maxsize=max_threads + 2)
finishUp = False
def worker():
while True:
try:
task = q.get(block=True, timeout=1)
# do something with the task
print ("processing task for:"+str(task))
q.task_done()
except Exception as ex: # we get this exception when queue is empty
if finishUp:
print ("thread finishing because processing is done")
return
for i in range(max_threads):
t = threading.Thread(target=worker)
t.start()
for task in ['a', 'b', 'c']:
q.put(task)
print ("waiting on join")
q.join()
finishUp = True # let the workers know that they can exit
print ("finished")
this produces the following output:
waiting on join
processing task for:a
processing task for:b
processing task for:c
finished
thread finishing because processing is done
thread finishing because processing is done
thread finishing because processing is done
thread finishing because processing is done
thread finishing because processing is done
Process finished with exit code 0

q.join() actually returns. You can test that by put print("done") after the q.join() line.
....
q.join()
print('done')
Then, why does it not end the program?
Because, by default, threads are non-daemon thread.
You can set thread as daemon thread using <thread_object>.daemon = True
for i in range(max_threads):
t = threading.Thread(target=worker)
t.daemon = True # <---
t.start()
According to threading module documentation:
daemon
A boolean value indicating whether this thread is a daemon thread
(True) or not (False). This must be set before start() is called,
otherwise RuntimeError is raised. Its initial value is inherited from
the creating thread; the main thread is not a daemon thread and
therefore all threads created in the main thread default to daemon =
False.
The entire Python program exits when no alive non-daemon threads are
left.
New in version 2.6.

I defined a DONE object to signal the end of work:
DONE = object()
and literally put it into the queue when the upper level knows that no more data will come:
q.put_nowait(DONE)
in the worker thread, as soon as the object is received, the thread quits.
But in case there are other threads listening on the very same queue, we have to put the object back on the queue:
item = q.get()
if item is DONE:
q.put_nowait(DONE)
return
cheers :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does this http thread pool die (join), but keep functioning? - python

Related

How to prevent multiple threads from picking up same task from queue

Single Producer Multiple Consumer

why queue is showing incorrect data?

Python multithread using queue - the program gets blocked forever

Signal the end of jobs on the Queue?

Categories

Resources