I am learning about Thread in Python and am trying to make a simple program, one that uses threads to grab a number off the Queue and print it.
I have the following code
import threading
from Queue import Queue
test_lock = threading.Lock()
tests = Queue()
def start_thread():
while not tests.empty():
with test_lock:
if tests.empty():
return
test = tests.get()
print("{}".format(test))
for i in range(10):
tests.put(i)
threads = []
for i in range(5):
threads.append(threading.Thread(target=start_thread))
threads[i].daemon = True
for thread in threads:
thread.start()
tests.join()
When run it just prints the values and never exits.
How do I make the program exit when the Queue is empty?
From the docstring of Queue.join():
Blocks until all items in the Queue have been gotten and processed.
The count of unfinished tasks goes up whenever an item is added to the
queue. The count goes down whenever a consumer thread calls task_done()
to indicate the item was retrieved and all work on it is complete.
When the count of unfinished tasks drops to zero, join() unblocks.
So you must call tests.task_done() after processing the item.
Since your threads are daemon threads, and the queue will handle concurrent access correctly, you don't need to check if the queue is empty or use a lock. You can just do:
def start_thread():
while True:
test = tests.get()
print("{}".format(test))
tests.task_done()
Related
I have set up a thread pool executor with 4 threads. I have added 2 items to my queue to be processed. When I submit the tasks and retrieve futures, it appears the other 2 threads not processing items in the queue keep running and hang, even if they are not processing anything!
import time
import queue
import concurrent
def _read_queue(queue):
msg = queue.get()
time.sleep(2)
queue.task_done()
n_threads = 4
q = queue.Queue()
q.put('test')
q.put("test2")
with concurrent.futures.ThreadPoolExecutor(max_workers=n_threads) as pool:
futures = []
for _ in range(n_threads):
future = pool.submit(_read_queue, q)
print(future.running())
print("Why am running forever?")
How can I adjust my code so that threads that are not processing anything from the queue are shutdown so my program can terminate?
Because queue.get() operation block your ThreadPoolExecutor threads.
for _ in range(n_threads):
future = pool.submit(_read_queue, q)
print(future.running())
Let's examine future = pool.submit(_read_queue, q) in every iteration of for loop
In first iteration of for loop, pool.submit(_read_queue, q) will put a job inside the ThreadPoolExecutor internal queue. When any job are put inside the ThreadPoolExecutor internal queue (it's name is self._work_queue), submit method will create a thread1(I say thread1,thread2.. for easily understand) thread. This thread will execute _read_queue func(This can be happen immediately or this can be happen after the fourth iteration of for loop. This ordering is depends on the Operating System Scheduler, please look at this) and queue.get() will return "test". Then, this thread will sleep for 2 seconds.
In second iteration of for loop, pool.submit(_read_queue, q) will put a job inside the ThreadPoolExecutor internal queue and then submit method will check that there is any thread which is waiting for a job ? No, there is no any waiting thread, first thread is sleeping(for 2 seconds). So submit method will do below steps :
if "there is a thread which will accept a job immediately": #Step 1
return
# Step 2
if numbe_of_created_threads(now this is 1) < self._max_workers:
threading.Thread().... #create a new thread
And then submit method will create a new thread2 thread and this thread will execute _read_queue func and queue.get() will return "test2". Then, this thread will sleep for 2 seconds. Also, q, queue object will be empty and subsequent get() call will block the calling thread
In third iteration of for loop, submit method will put a job inside the ThreadPoolExecutor internal queue and then submit method will check that there is any thread which is waiting for a job ? There is no any waiting thread, first thread is sleeping(for 2 seconds) and second thread also is sleeping, so submit method will create a new thread3 thread (It will check the both step1 and Step2) and this thread will execute _read_queue func same as other threads did. When thread3 run, it will execute queue.get() but this will block the thread3, because q,queue object is empty and if you call get(blocking=True) method of a empty queue object, your calling thread will be blocked .
In fourth iteration of for loop, this will be same as with third case, and then thread4 will be blocked on queue.get() operation.
I assume 2 seconds not passed now, and there will be 5 thread which is alive (can be sleep mode or not) currently. After 2 seconds passed, thread1 and thread2(because time.sleep(2) will return) will terminate*1 but thread3 and thread4 will not, because queue.get() blocking them. That's why your main thread (whole program) will wait them and not terminate.
What can we do in this situation ?
We can put two elements inside the q object because q.get() blocking your thread by using acquire a lock object. We can only release this lock by calling release() method, to do that we need to call queue.put(something)
Here is one of the solutions ;
import time,threading
import queue
from concurrent import futures
def _read_queue(queue):
msg = queue.get()
time.sleep(2)
queue.put(None)
n_threads = 4
q = queue.Queue()
q.put('test')
q.put("test2")
with futures.ThreadPoolExecutor(max_workers=n_threads) as pool:
futures = []
for _ in range(n_threads):
futures.append(pool.submit(_read_queue, q))
*1, I said ThreadPoolExecutor threads will terminate after function finished, but it is depend on calling the shutdown() method, If we don't call shutdown() method of pool object, thread will not terminate even if function finished. Because creating and destruction a thread is costly, that's why threadpool concept is there.(shutdown() method will be called end of the with statement)
If I'm wrong somewhere please correct me.
I have two threads in a producer consumer pattern. When the consumer receives data it calls an time consuming function expensive() and then enters in a for loop.
But if while the consumer is working new data arrives, it should abort the current work, (exit the loop) and start with the new data.
I tried with a queue.Queue something like this:
q = queue.Queue()
def producer():
while True:
...
q.put(d)
def consumer():
while True:
d = q.get()
expensive(d)
for i in range(10000):
...
if not q.empty():
break
But the problem with this code is that if the producer put data too too fast, and the queue get to have many items, the consumer will do the expensive(d) call plus one loop iteration and then abort for each item, which is time consuming. The code should work, but is not optimized.
Without modifying the code in expensive one solution could be to run it as a separate process which will provide you the ability to terminateit prematurely. Since there's no mention to how long expensive runs this may or may not be more time efficient, however.
import multiprocessing as mp
q = queue.Queue()
def producer():
while True:
...
q.put(d)
def consumer():
while True:
d = q.get()
exp = mp.Thread(target=expensive, args=(d,))
for i in range(10000):
...
if not q.empty():
exp.terminate() # or exp.kill()
break
Well, one way is to use a queue design that can keep an internal lists of waiting and working threads. You can then create several consumer threads to wait on the queue and, when work arrives, set a known consumer thread to do the work. When the thread has finished, it calls into the queue to remove itself from the working list and add itself to the waiting list.
The consumer threads each have an 'abort' atomic that can signal the thread to finish early. There will be some latency while the thread performs inner loops, but that will not matter....
If new work arrives at the queue from the producer, and the working queue is not empty, the 'abort' bool of the working thread/s can be set and their priority set to the minimum possible. The new work can then be dispatched onto one of the waiting threads from the pool, so setting it working.
The waiting threads will need a 'start' function that signals an event/sema/condvar that the wait thread..well..waits on. That allows the producer that supplied work to set that specific thread running, rather than the 'usual' practice where any thread from a pool may pick up work.
Such a design allows new work to be started 'immediately', makes the previous work thread irrelevant by de-prioritizing it and avoids the overheads of thread/process termination.
I am currently using worker threads in Python to get tasks from a Queue and execute them, as follows:
from queue import Queue
from threading import Thread
def run_job(item)
#runs an independent job...
pass
def workingThread():
while True:
item = q.get()
run_job(item)
q.task_done()
q = Queue()
num_worker_threads = 2
for i in range(num_worker_threads):
t = Thread(target=workingThread)
t.daemon = True
t.start()
for item in listOfJobs:
q.put(item)
q.join()
This is functional, but there is an issue: some of the jobs to be executed under the run_job function are very memory-demanding and can only be run individually. Given that I could identify these during runtime, how could I manage to put the parallel worker threads to halt their execution until said job is taken care of?
Edit: It has been flagged as a possible duplicate of Python - Thread that I can pause and resume, and I have referred to this question before asking, and it surely is a reference that has to be cited. However, I don't think it adresses this situation specifically, as it does not consider the jobs being inside a Queue, nor how to specifically point to the other objects that have to be halted.
I would pause/resume the threads so that they run individually.
The following thread
Python - Thread that I can pause and resume indicates how to do that.
In Python while using multiprocessing module there are 2 kinds of queues:
Queue
JoinableQueue.
What is the difference between them?
Queue
from multiprocessing import Queue
q = Queue()
q.put(item) # Put an item on the queue
item = q.get() # Get an item from the queue
JoinableQueue
from multiprocessing import JoinableQueue
q = JoinableQueue()
q.task_done() # Signal task completion
q.join() # Wait for completion
JoinableQueue has methods join() and task_done(), which Queue hasn't.
class multiprocessing.Queue( [maxsize] )
Returns a process shared queue implemented using a pipe and a few locks/semaphores. When a process first puts an item on the queue a feeder thread is started which transfers objects from a buffer into the pipe.
The usual Queue.Empty and Queue.Full exceptions from the standard library’s Queue module are raised to signal timeouts.
Queue implements all the methods of Queue.Queue except for task_done() and join().
class multiprocessing.JoinableQueue( [maxsize] )
JoinableQueue, a Queue subclass, is a queue which additionally has task_done() and join() methods.
task_done()
Indicate that a formerly enqueued task is complete. Used by queue consumer threads. For each get() used to fetch a task, a subsequent call to task_done() tells the queue that the processing on the task is complete.
If a join() is currently blocking, it will resume when all items have been processed (meaning that a task_done() call was received for every item that had been put() into the queue).
Raises a ValueError if called more times than there were items placed in the queue.
join()
Block until all items in the queue have been gotten and processed.
The count of unfinished tasks goes up whenever an item is added to the queue. The count goes down whenever a consumer thread calls task_done() to indicate that the item was retrieved and all work on it is complete. When the count of unfinished tasks drops to zero, join() unblocks.
If you use JoinableQueue then you must call JoinableQueue.task_done() for each task removed from the queue or else the semaphore used to count the number of unfinished tasks may eventually overflow, raising an exception.
Based on the documentation, it's hard to be sure that Queue is actually empty. With JoinableQueue you can wait for the queue to empty by calling q.join(). In cases where you want to complete work in distinct batches where you do something discrete at the end of each batch, this could be helpful.
For example, perhaps you process 1000 items at a time through the queue, then send a push notification to a user that you've completed another batch. This would be challenging to implement with a normal Queue.
It might look something like:
import multiprocessing as mp
BATCH_SIZE = 1000
STOP_VALUE = 'STOP'
def consume(q):
for item in iter(q.get, STOP_VALUE):
try:
process(item)
# Be very defensive about errors since they can corrupt pipes.
except Exception as e:
logger.error(e)
finally:
q.task_done()
q = mp.JoinableQueue()
with mp.Pool() as pool:
# Pull items off queue as fast as we can whenever they're ready.
for _ in range(mp.cpu_count()):
pool.apply_async(consume, q)
for i in range(0, len(URLS), BATCH_SIZE):
# Put `BATCH_SIZE` items in queue asynchronously.
pool.map_async(expensive_func, URLS[i:i+BATCH_SIZE], callback=q.put)
# Wait for the queue to empty.
q.join()
notify_users()
# Stop the consumers so we can exit cleanly.
for _ in range(mp.cpu_count()):
q.put(STOP_VALUE)
NB: I haven't actually run this code. If you pull items off the queue faster than you put them on, you might finish early. In that case this code sends an update AT LEAST every 1000 items, and maybe more often. For progress updates, that's probably ok. If it's important to be exactly 1000, you could use an mp.Value('i', 0) and check that it's 1000 whenever your join releases.
Python's Queue has a join() method that will block until task_done() has been called on all the items that have been taken from the queue.
Is there a way to periodically check for this condition, or receive an event when it happens, so that you can continue to do other things in the meantime? You can, of course, check if the queue is empty, but that doesn't tell you if the count of unfinished tasks is actually zero.
The Python Queue itself does not support this, so you could try the following
from threading import Thread
class QueueChecker(Thread):
def __init__(self, q):
Thread.__init__(self)
self.q = q
def run(self):
q.join()
q_manager_thread = QueueChecker(my_q)
q_manager_thread.start()
while q_manager_thread.is_alive():
#do other things
#when the loop exits the tasks are done
#because the thread will have returned
#from blocking on the q.join and exited
#its run method
q_manager_thread.join() #to cleanup the thread
a while loop on the thread.is_alive() bit might not be exactly what you want, but at least you can see how to asynchronously check on the status of the q.join now.