I have a huge list of information and my program should analyze each one of them. To speed up I want to use threads, but I want to limit them by 5. So I need to make a loop with 5 threads and when one finish their job grab a new one till the end of the list.
But I don't have a clue how to do that. Should I use queue? For now I just running 5 threads in the most simple way:
Thank you!
for thread_number in range (5):
thread = Th(thread_number)
thread.start()
Separate the idea of worker thread and task -- do not have one worker work on one task, then terminate the thread. Instead, spawn 5 threads, and let them all get tasks from a common queue. Let them each iterate until they receive a sentinel from the queue which tells them to quit.
This is more efficient than continually spawning and terminating threads after they complete just one task.
import logging
import Queue
import threading
logger = logging.getLogger(__name__)
N = 100
sentinel = object()
def worker(jobs):
name = threading.current_thread().name
for task in iter(jobs.get, sentinel):
logger.info(task)
logger.info('Done')
def main():
logging.basicConfig(level=logging.DEBUG,
format='[%(asctime)s %(threadName)s] %(message)s',
datefmt='%H:%M:%S')
jobs = Queue.Queue()
# put tasks in the jobs Queue
for task in range(N):
jobs.put(task)
threads = [threading.Thread(target=worker, args=(jobs,))
for thread_number in range (5)]
for t in threads:
t.start()
jobs.put(sentinel) # Send a sentinel to terminate worker
for t in threads:
t.join()
if __name__ == '__main__':
main()
It seems that you want a thread pool. If you're using python 3, you're lucky : there is a ThreadPoolExecutor class
Else, from this SO question, you can find various solutions, either handcrafted or using hidden modules from the python library.
Related
I have a list list_of_params and a function run_train() that receives an item from the list_of_params (e.g. run_train(list_of_params[0])). I can dispatch run_train() to multiple GPUs at a time. So i'd like if there is any implementation of a single queue that can be paralelized.
If this is not clear enough, imagine the following scenario:
"A supermarket which has a single customer queue, but 5 cashiers. Once a cashier is idle, it process the products of the next customer in the queue. This is the contrary of each cashier having its own line."
I can provide more details if needed.
Thank you!
Try queue module. 'Thread-safe multi-producer, multi-consumer queues'.
Here's an example of using a queue with a few producer and consumer processes:
from multiprocessing import Process, Queue, Event
#if you use threading instead of multiprocessing, use queue.Queue rather than multiprocessing.Queue
from queue import Empty
from time import sleep
from random import random
def producer(stopflag, out_queue):
while True:
if stopflag.is_set():
break
sleep(random()) #sleep anywhere from 0-1 sec avg 0.5 (producer will produce on average a maximum of 2 outputs / sec)
out_queue.put(random()) #may wait if the queue is currently full, thereby reducing production rate.
def consumer(stopflag, in_queue):
while True:
if stopflag.is_set():
break
x = in_queue.get()
sleep(x*2) #consumers work twice slower than producers averaging 1 item per sec
def main():
stopflag = Event()
shared_q = Queue(maxsize=100) # if the queue fills up, it will force the producers
# to wait for the consumers to catch up. Otherwise, it
# may grow infinitely (until the computer runs out of memory)
#2 producers
producers = [Process(target=producer, args=(stopflag, shared_q)) for _ in range(2)]
#4 consumers
consumers = [Process(target=consumer, args=(stopflag, shared_q)) for _ in range(4)]
for process in producers + consumers:
process.start()
sleep(20) #let them work a while
stopflag.set() #tell them to stop
for process in producers + consumers: #wait for them to stop
process.join()
#empty unfinished tasks from the queue so its thread can exit normally
#(it's good practice to clean up resources kind of like closing files when done with them)
try:
while True:
shared_q.get_nowait()
except Empty:
pass
if __name__ == '__main__':
main()
I am currently using worker threads in Python to get tasks from a Queue and execute them, as follows:
from queue import Queue
from threading import Thread
def run_job(item)
#runs an independent job...
pass
def workingThread():
while True:
item = q.get()
run_job(item)
q.task_done()
q = Queue()
num_worker_threads = 2
for i in range(num_worker_threads):
t = Thread(target=workingThread)
t.daemon = True
t.start()
for item in listOfJobs:
q.put(item)
q.join()
This is functional, but there is an issue: some of the jobs to be executed under the run_job function are very memory-demanding and can only be run individually. Given that I could identify these during runtime, how could I manage to put the parallel worker threads to halt their execution until said job is taken care of?
Edit: It has been flagged as a possible duplicate of Python - Thread that I can pause and resume, and I have referred to this question before asking, and it surely is a reference that has to be cited. However, I don't think it adresses this situation specifically, as it does not consider the jobs being inside a Queue, nor how to specifically point to the other objects that have to be halted.
I would pause/resume the threads so that they run individually.
The following thread
Python - Thread that I can pause and resume indicates how to do that.
I am learning about Thread in Python and am trying to make a simple program, one that uses threads to grab a number off the Queue and print it.
I have the following code
import threading
from Queue import Queue
test_lock = threading.Lock()
tests = Queue()
def start_thread():
while not tests.empty():
with test_lock:
if tests.empty():
return
test = tests.get()
print("{}".format(test))
for i in range(10):
tests.put(i)
threads = []
for i in range(5):
threads.append(threading.Thread(target=start_thread))
threads[i].daemon = True
for thread in threads:
thread.start()
tests.join()
When run it just prints the values and never exits.
How do I make the program exit when the Queue is empty?
From the docstring of Queue.join():
Blocks until all items in the Queue have been gotten and processed.
The count of unfinished tasks goes up whenever an item is added to the
queue. The count goes down whenever a consumer thread calls task_done()
to indicate the item was retrieved and all work on it is complete.
When the count of unfinished tasks drops to zero, join() unblocks.
So you must call tests.task_done() after processing the item.
Since your threads are daemon threads, and the queue will handle concurrent access correctly, you don't need to check if the queue is empty or use a lock. You can just do:
def start_thread():
while True:
test = tests.get()
print("{}".format(test))
tests.task_done()
My problem is the following: I have a multiprocessing.pool.ThreadPool object with worker_count workers and a main pqueue from which I feed tasks to the pool.
The flow is as follows: There is a main loop that gets an item of level level from pqueue and submits it tot the pool using apply_async. When the item is processed, it generates items of level + 1. The problem is that the pool accepts all tasks and processes them in the order they were submitted.
More precisely, what is happening is that the level 0 items are processed and each generates 100 level 1 items that are retrieved immediately from pqueue and added to the pool, each level 1 item produces 100 level 2 items that are submitted to the pool, and so on, and the items are processed in an BFS manner.
I need to tell the pool to not accept more than worker_count items in order to give a chance of higher level to be retrieved from pqueue in order to process items in a DFS manner.
The current solution I came with is: for each submitted task, save the AsyncResult object in a asyncres_list list, and before retrieving items from pqueue I remove the items that were processed (if any), check if the length of the asyncres_list is lower than the number of threads in the pool every 0.5 seconds, and like that only thread_number items will be processed at the same time.
I am wondering if there is a cleaner way to achieve this behaviour and I can't seem to find in the documentation some parameters to limit the maximum number of tasks that can be submitted to a pool.
ThreadPool is a simple tool for a common task. If you want to manage the queue yourself, to get DFS behavior; you could implement the necessary functionality on top threading and queue modules directly.
To prevent scheduling the next root task until all tasks spawned by the current task are done ("DFS"-like order), you could use Queue.join():
#!/usr/bin/env python3
import queue
import random
import threading
import time
def worker(q, multiplicity=5, maxlevel=3, lock=threading.Lock()):
for task in iter(q.get, None): # blocking get until None is received
try:
if len(task) < maxlevel:
for i in range(multiplicity):
q.put(task + str(i)) # schedule the next level
time.sleep(random.random()) # emulate some work
with lock:
print(task)
finally:
q.task_done()
worker_count = 2
q = queue.LifoQueue()
threads = [threading.Thread(target=worker, args=[q], daemon=True)
for _ in range(worker_count)]
for t in threads:
t.start()
for task in "01234": # populate the first level
q.put(task)
q.join() # block until all spawned tasks are done
for _ in threads: # signal workers to quit
q.put(None)
for t in threads: # wait until workers exit
t.join()
The code example is derived from the example in the queue module documentation.
The task at each level spawns multiplicity direct child tasks that spawn their own subtasks until maxlevel is reached.
None is used to signal the workers that they should quit. t.join() is used to wait until threads exit gracefully. If the main thread is interrupted for any reason then the daemon threads are killed unless there are other non-daemon threads (you might want to provide SIGINT hanlder, to signal the workers to exit gracefully on Ctrl+C instead of just dying).
queue.LifoQueue() is used, to get "Last In First Out" order (it is approximate due to multiple threads).
The maxsize is not set because otherwise the workers may deadlock--you have to put the task somewhere anyway. worker_count background threads are running regardless of the task queue.
I've a python program that spawns a number of threads. These threads last anywhere between 2 seconds to 30 seconds. In the main thread I want to track whenever each thread completes and print a message. If I just sequentially .join() all threads and the first thread lasts 30 seconds and others complete much sooner, I wouldn't be able to print a message sooner -- all messages will be printed after 30 seconds.
Basically I want to block until any thread completes. As soon as a thread completes, print a message about it and go back to blocking if any other threads are still alive. If all threads are done then exit program.
One way I could think of is to have a queue that is passed to all the threads and block on queue.get(). Whenever a message is received from the queue, print it, check if any other threads are alive using threading.active_count() and if so, go back to blocking on queue.get(). This would work but here all the threads need to follow the discipline of sending a message to the queue before terminating.
I'm wonder if this is the conventional way of achieving this behavior or are there any other / better ways ?
Here's a variation on #detly's answer that lets you specify the messages from your main thread, instead of printing them from your target functions. This creates a wrapper function which calls your target and then prints a message before terminating. You could modify this to perform any kind of standard cleanup after each thread completes.
#!/usr/bin/python
import threading
import time
def target1():
time.sleep(0.1)
print "target1 running"
time.sleep(4)
def target2():
time.sleep(0.1)
print "target2 running"
time.sleep(2)
def launch_thread_with_message(target, message, args=[], kwargs={}):
def target_with_msg(*args, **kwargs):
target(*args, **kwargs)
print message
thread = threading.Thread(target=target_with_msg, args=args, kwargs=kwargs)
thread.start()
return thread
if __name__ == '__main__':
thread1 = launch_thread_with_message(target1, "finished target1")
thread2 = launch_thread_with_message(target2, "finished target2")
print "main: launched all threads"
thread1.join()
thread2.join()
print "main: finished all threads"
The thread needs to be checked using the Thread.is_alive() call.
Why not just have the threads themselves print a completion message, or call some other completion callback when done?
You can the just join these threads from your main program, so you'll see a bunch of completion messages and your program will terminate when they're all done, as required.
Here's a quick and simple demonstration:
#!/usr/bin/python
import threading
import time
def really_simple_callback(message):
"""
This is a really simple callback. `sys.stdout` already has a lock built-in,
so this is fine to do.
"""
print message
def threaded_target(sleeptime, callback):
"""
Target for the threads: sleep and call back with completion message.
"""
time.sleep(sleeptime)
callback("%s completed!" % threading.current_thread())
if __name__ == '__main__':
# Keep track of the threads we create
threads = []
# callback_when_done is effectively a function
callback_when_done = really_simple_callback
for idx in xrange(0, 10):
threads.append(
threading.Thread(
target=threaded_target,
name="Thread #%d" % idx,
args=(10 - idx, callback_when_done)
)
)
[t.start() for t in threads]
[t.join() for t in threads]
# Note that thread #0 runs for the longest, but we'll see its message first!
What I would suggest is loop like this
while len(threadSet) > 0:
time.sleep(1)
for thread in theadSet:
if not thread.isAlive()
print "Thread "+thread.getName()+" terminated"
threadSet.remove(thread)
There is a 1 second sleep, so there will be a slight delay between the thread termination and the message being printed. If you can live with this delay, then I think this is a simpler solution than the one you proposed in your question.
You can let the threads push their results into a threading.Queue. Have another thread wait on this queue and print the message as soon as a new item appears.
I'm not sure I see the problem with using:
threading.activeCount()
to track the number of threads that are still active?
Even if you don't know how many threads you're going to launch before starting it seems pretty easy to track. I usually generate thread collections via list comprehension then a simple comparison using activeCount to the list size can tell you how many have finished.
See here: http://docs.python.org/library/threading.html
Alternately, once you have your thread objects you can just use the .isAlive method within the thread objects to check.
I just checked by throwing this into a multithread program I have and it looks fine:
for thread in threadlist:
print(thread.isAlive())
Gives me a list of True/False as the threads turn on and off. So you should be able to do that and check for anything False in order to see if any thread is finished.
I use a slightly different technique because of the nature of the threads I used in my application. To illustrate, this is a fragment of a test-strap program I wrote to scaffold a barrier class for my threading class:
while threads:
finished = set(threads) - set(threading.enumerate())
while finished:
ttt = finished.pop()
threads.remove(ttt)
time.sleep(0.5)
Why do I do it this way? In my production code, I have a time limit, so the first line actually reads "while threads and time.time() < cutoff_time". If I reach the cut-off, I then have code to tell the threads to shut down.