Python: Make sure concurrent Thread is finished

Python: Make sure concurrent Thread is finished - python

I'm trying to write a python program that works with threads. I'm using concurrent.futures to handle the threads. In the programm one thread should create an PDF-File. Because this task is lasting very long I want to create a thread to handle the creation. But after sometime I want to work with the pdf file again. Therefore, I have to be sure, that the previous thread is finished.
My question is, how can I check if my concurrent future thread is finished or how I can wait on its execution.
Code
if __name__ == '__main__'
with concurrent.futures.ThreadPoolExecutor() as executor:
document = executor.submit(createPDF, pdfValues)
#working on something different ....
#Here I want to work with the pdf file again, via a new thread
#Therefore I want to make sure that the thread above is finished.
with concurrent.futures.ThreadPoolExecutor() as executor:
result = executor.submit(workWithPDF, values)

your document variable is of type concurrent.futures.Future, which has done() method that returns True if it completed, or result() that returns the result and blocks until it ready (if it still isn't)

Related

How to run a python script multiple times simultaneously using python and terminate all when one has finished

Maybe it's a very simple question, but I'm new in concurrency. I want to do a python script to run foo.py 10 times simultaneously with a time limit of 60 sec before automatically abort. The script is a non deterministic algorithm, hence all executions takes different times and one will be finished before the others. Once the first ends, I would like to save the execution time, the output of the algorithm and after that kill the rest of the processes.
I have seen this question run multiple instances of python script simultaneously and it looks very similar, but how can I add time limit and the possibility of when the first one finishes the execution, kills the rest of processes?
Thank you in advance.

I'd suggest using the threading lib, because with it you can set threads to daemon threads so that if the main thread exits for whatever reason the other threads are killed. Here's a small example:
#Import the libs...
import threading, time
#Global variables... (List of results.)
results=[]
#The subprocess you want to run several times simultaneously...
def run():
#We declare results as a global variable.
global results
#Do stuff...
results.append("Hello World! These are my results!")
n=int(input("Welcome user, how much times should I execute run()? "))
#We run the thread n times.
for _ in range(n):
#Define the thread.
t=threading.Thread(target=run)
#Set the thread to daemon, this means that if the main process exits the threads will be killed.
t.setDaemon(True)
#Start the thread.
t.start()
#Once the threads have started we can execute tha main code.
#We set a timer...
startTime=time.time()
while True:
#If the timer reaches 60 s we exit from the program.
if time.time()-startTime>=60:
print("[ERROR] The script took too long to run!")
exit()
#Do stuff on your main thread, if the stuff is complete you can break from the while loop as well.
results.append("Main result.")
break
#When we break from the while loop we print the output.
print("Here are the results: ")
for i in results:
print(f"-{i}")
This example should solve your problem, but if you wanted to use blocking commands on the main thread the timer would fail, so you'd need to tweak this code a bit. If you wanted to do that move the code from the main thread's loop to a new function (for example def main(): and execute the rest of the threads from a primary thread on main. This example may help you:
def run():
pass
#Secondary "main" thread.
def main():
#Start the rest of the threads ( in this case I just start 1).
localT=threading.Thread(target=run)
localT.setDaemon(True)
localT.start()
#Do stuff.
pass
#Actual main thread...
t=threading.Thread(target=main)
t.setDaemon(True)
t.start()
#Set up a timer and fetch the results you need with a global list or any other method...
pass
Now, you should avoid global variables at all costs as sometimes they may be a bit buggy, but for some reason the threading lib doesn't allow you to return values from threads, at least i don't know any methods. I think there are other multi-processing libs out there that do let you return values, but I don't know anything about them so I can't explain you anything. Anyways, I hope that this works for you.
-Update: Ok, I was busy writing the code and I didn't read the comments in the post, sorry. You can still use this method but instead of writing code inside the threads, execute another script. You could either import it as a module or actually run it as a script, here's a question that may help you with that:
How to run one python file in another file?

how do I wait for any thread that ends instead of a specific one?

My script has started many threads. It has kept count. The maximum number of threads are now running. There are more to be run. How can the script wait for any one of the running threads to end so it can safely start another one? It is using threading.Thread() to create each thread but that can be changed if there is a better module. I am using Python 3.6.x.

To create a pool of processes and pass tasks to them:
def processing_task(arg1, arg2):
...
from multiprocessing import Pool
with Pool() as worker_pool: # By default creates processes == number of CPUs
while True:
task = some_queue_implementation.get() # Some blocking method that receives tasks
worker_pool.apply_async(processing_task, task.arg1, task.arg2)
This will create child processes that will be idle until they are passed a task

Here is a snippet of code I always use when using threads.
sets a certain amount of threads;
ensures that no code coming after the context manager block executes until all threads complete; and
kills and raises an exception, if one of the child threads throws an exception.
with concurrent.futures.ThreadPoolExecutor(max_workers=5)\
as executor:
futures.append(executor.submit(function, args))
for future in concurrent.futures.as_completed(futures):
if future.exception():
for child in futures:
child.cancel()
raise future.exception()
It has kept count
I hope you are using .lock() and .acquire(), ;-)

How to find which have finished executing in Python

I am very new to the concept of threading and the concepts are still somewhat fuzzy.
But as of now i have a requirement in which i spin up an arbitrary number of threads from my Python program and then my Python program should indicate to the user running the process which threads have finished executing. Below is my first try:
import threading
from threading import Thread
from time import sleep
def exec_thread(n):
name = threading.current_thread().getName()
filename = name + ".txt"
with open(filename, "w+") as file:
file.write(f"My name is {name} and my main thread is {threading.main_thread()}\n")
sleep(n)
file.write(f"{name} exiting\n")
t1 = Thread(name="First", target=exec_thread, args=(10,))
t2 = Thread(name="Second", target=exec_thread, args=(2,))
t1.start()
t2.start()
while len(threading.enumerate()) > 1:
print(f"Waiting ... !")
sleep(5)
print(f"The threads are done"
So this basically tells me when all the threads are done executing.
But i want to know as soon as any one of my threads have completed execution so that i can tell the user that please check the output file for the thread.
I cannot use thread.join() since that would block my main program and the user would not know anything unless everything is complete which might take hours. The user wants to know as soon as some results are available.
Now i know that we can check individual threads whether they are active or not by doing : thread.isAlive() but i was hoping for a more elegant solution in which if the child threads can somehow communicate with the main thread and say I am done !
Many thanks for any answers in advance.

The simplest and most straightforward way to indicate a single thread is "done" is to put the required notification in the thread's implementation method, as the very last step. For example, you could print out a notification to the user.
Or, you could use events, see: https://docs.python.org/3/library/threading.html#event-objects
This is one of the simplest mechanisms for communication between
threads: one thread signals an event and other threads wait for it.
An event object manages an internal flag that can be set to true with
the set() method and reset to false with the clear() method. The
wait() method blocks until the flag is true.
So, the "final act" in your thread implementation would be to set an event object, and your main thread can wait until it's set.
Or, for an even fancier and more mechanism, use queues: https://docs.python.org/3/library/queue.html
Each thread writes an "I'm done" object to the queue when done, and the main thread can read those notifications from the queue in sequence as each thread completes.

Python threading: What am I missing? (task_done() called too many times)

My apologies for the long-ish post up front. Hopefully it'll give enough context for a solution. I've tried to create a utility function that will take any number of old classmethods and stick them into a multi-threaded queue:
class QueuedCall(threading.Thread):
def __init__(self, name, queue, fn, args, cb):
threading.Thread.__init__(self)
self.name = name
self._cb = cb
self._fn = fn
self._queue = queue
self._args = args
self.daemon = True
self.start()
def run(self):
r = self._fn(*self._args) if self._args is not None \
else self._fn()
if self._cb is not None:
self._cb(self.name, r)
self._queue.task_done()
Here's what my calling code looks like (within a class)
data = {}
def __op_complete(name, r):
data[name] = r
q = Queue.Queue()
socket.setdefaulttimeout(5)
q.put(QueuedCall('twitter', q, Twitter.get_status, [5,], __op_complete))
q.put(QueuedCall('so_answers', q, StackExchange.get_answers,
['api.stackoverflow.com', 534476, 5], __op_complete))
q.put(QueuedCall('so_user', q, StackExchange.get_user_info,
['api.stackoverflow.com', 534476], __op_complete))
q.put(QueuedCall('p_answers', q, StackExchange.get_answers,
['api.programmers.stackexchange.com', 23901, 5], __op_complete))
q.put(QueuedCall('p_user', q, StackExchange.get_user_info,
['api.programmers.stackexchange.com', 23901], __op_complete))
q.put(QueuedCall('fb_image', q, Facebook.get_latest_picture, None, __op_complete))
q.join()
return data
The problem that I'm running into here is that it seems to work every time on a fresh server restart, but fails every second or third request, with the error:
ValueError: task_done() called too many times
This error presents itself in a random thread every second or third request, so it's rather difficult to nail down exactly what the problem is.
Anyone have any ideas and/or suggestions?
Thanks.
Edit:
I had added prints in an effort to debug this (quick and dirty rather than logging). One print statement (print 'running thread: %s' % self.name) in the first line of run and another right before calling task_done() (print 'thread done: %s' % self.name).
The output of a successful request:
running thread: twitter
running thread: so_answers
running thread: so_user
running thread: p_answers
thread done: twitter
thread done: so_user
running thread: p_user
thread done: so_answers
running thread: fb_image
thread done: p_answers
thread done: p_user
thread done: fb_image
The output of an unsuccessful request:
running thread: twitter
running thread: so_answers
thread done: twitter
thread done: so_answers
running thread: so_user
thread done: so_user
running thread: p_answers
thread done: p_answers
Exception in thread p_answers:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/home/demian/src/www/projects/demianbrecht/demianbrecht/demianbrecht/helpers.py", line 37, in run
self._queue.task_done()
File "/usr/lib/python2.7/Queue.py", line 64, in task_done
raise ValueError('task_done() called too many times')
ValueError: task_done() called too many times
running thread: p_user
thread done: p_user
running thread: fb_image
thread done: fb_image

Your approach to this problem is "unconventional". But ignoring that for now ... the issue is simply that in the code you have given
q.put(QueuedCall('twitter', q, Twitter.get_status, [5,], __op_complete))
it is clearly possible for the following workflow to occur
A thread is constructed and started by QueuedCall.__init__
It is then put into the queue q. However ... before the Queue completes its logic for inserting the item, the independent thread has already finished its work and attempted to call q.task_done(). Which causes the error you have (task_done() has been called before the object was safely put into the queue)
How it should be done? You don't insert threads into queues. Queues hold data that threads process. So instead you
Create a Queue. Insert into it jobs you want done (as eg functions, the args they want and the callback)
You create and start worker threads
A worker thread calls
q.get() to get the function to invoke
invokes it
calls q.task_done() to let the queue know the item was handled.

I may be misunderstanding here, but I'm not sure you're using the Queue correctly.
From a brief survey of the docs, it looks like the idea is that you can use the put method to put work into a Queue, then another thread can call get to get some work out of it, do the work, and then call task_done when it has finished.
What your code appears to do is put instances of QueuedCall into a queue. Nothing ever gets from the queue, but the QueuedCall instances are also passed a reference to the queue they're being inserted into, and they do their work (which they know about intrinsically, not because they get it from the queue) and then call task_done.
If my reading of all that is correct (and you don't call the get method from somewhere else I can't see), then I believe I understand the problem.
The issue is that the QueuedCall instances have to be created before they can be put on the queue, and the act of creating one starts its work in another thread. If the thread finishes its work and calls task_done before the main thread has managed to put the QueuedCall into the queue, then you can get the error you see.
I think it only works when you run it the first time by accident. The GIL 'helps' you a lot; it's not very likely that the QueuedCall thread will actually gain the GIL and begin running immediately. The fact that you don't actually care about the Queue other than as a counter also 'helps' this appear to work: it doesn't matter if the QueuedCall hasn't hit the queue yet so long as it's not empty (this QueuedCall can just task_done another element in the queue, and by the time that element calls task_done this one will hopefully be in the queue, and it can be marked as done by that). And adding sleep also makes the new threads wait a bit, giving the main thread time to make sure they're actually in the queue, which is why that masks the problem as well.
Also note that, as far as I can tell from some quick fiddling with an interactive shell, your queue is actually still full at the end, because you never actually get anything out of it. It's just received a number of task_done messages equal to the number of things that were put in it, so the join works.
I think you'll need to radically redesign the way your QueuedCall class works, or use a different synchronisation primitive than a Queue. A Queue is designed to be used to queue work for worker threads that already exist. Starting a thread from within a constructor for an object that you put on the queue isn't really a good fit.

Using the Queue class in Python 2.6

Let's assume I'm stuck using Python 2.6, and can't upgrade (even if that would help). I've written a program that uses the Queue class. My producer is a simple directory listing. My consumer threads pull a file from the queue, and do stuff with it. If the file has already been processed, I skip it. The processed list is generated before all of the threads are started, so it isn't empty.
Here's some pseudo-code.
import Queue, sys, threading
processed = []
def consumer():
while True:
file = dirlist.get(block=True)
if file in processed:
print "Ignoring %s" % file
else:
# do stuff here
dirlist.task_done()
dirlist = Queue.Queue()
for f in os.listdir("/some/dir"):
dirlist.put(f)
max_threads = 8
for i in range(max_threads):
thr = Thread(target=consumer)
thr.start()
dirlist.join()
The strange behavior I'm getting is that if a thread encounters a file that's already been processed, the thread stalls out and waits until the entire program ends. I've done a little bit of testing, and the first 7 threads (assuming 8 is the max) stop, while the 8th thread keeps processing, one file at a time. But, by doing that, I'm losing the entire reason for threading the application.
Am I doing something wrong, or is this the expected behavior of the Queue/threading classes in Python 2.6?

I tried running your code, and did not see the behavior you describe. However, the program never exits. I recommend changing the .get() call as follows:
try:
file = dirlist.get(True, 1)
except Queue.Empty:
return
If you want to know which thread is currently executing, you can import the thread module and print thread.get_ident().
I added the following line after the .get():
print file, thread.get_ident()
and got the following output:
bin 7116328
cygdrive 7116328
cygwin.bat 7149424
cygwin.ico 7116328
dev etc7598568
7149424
fix 7331000
home 7116328lib
7598568sbin
7149424Thumbs.db
7331000
tmp 7107008
usr 7116328
var 7598568proc
7441800
The output is messy because the threads are writing to stdout at the same time. The variety of thread identifiers further confirms that all of the threads are running.
Perhaps something is wrong in the real code or your test methodology, but not in the code you posted?

Since this problem only manifests itself when finding a file that's already been processed, it seems like this is something to do with the processed list itself. Have you tried implementing a simple lock? For example:
processed = []
processed_lock = threading.Lock()
def consumer():
while True:
with processed_lock.acquire():
fileInList = file in processed
if fileInList:
# ... et cetera
Threading tends to cause the strangest bugs, even if they seem like they "shouldn't" happen. Using locks on shared variables is the first step to make sure you don't end up with some kind of race condition that could cause threads to deadlock.
Of course, if what you're doing under # do stuff here is CPU-intensive, then Python will only run code from one thread at a time anyway, due to the Global Interpreter Lock. In that case, you may want to switch to the multiprocessing module - it's very similar to threading, though you will need to replace shared variables with another solution (see here for details).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.