thread.join() does not work when target function uses subprocess.run() - python

I have a function that runs another function in the background as a command line using subprocess.run(). When this function runs generates as many processes as I gave as input. For instance:
# Importing libraries
import subprocess
import threading
# Function with background function inside
def func(input_value, processes):
# Running background function as many times as processes I input
for i in range(processes)
subprocess.run("nohup python3 func_background " + input_value + "&\n", shell= True)
So if I specify 50, I will run 50 processes in the background with the same input value. The problem is that now I am trying to run different input_values with many processes each of them. Of course If I run all input values at the same time I will max up the CPU, so I was trying to create one thread for each input value and run one by one waiting for each new thread to finish until I run the next (using thread.join()). Like this:
# Defining in input values
list_input_values = [input_value_1, input_value_2]
# Looping over input values creating one thread with each of them
for i in list_input_values:
# Creating thread
th = threading.Thread(target=func, args=(i,4))
# Starting thread an trying to make the loop wait to start a new one
th.deamon = True
th.start()
th.join()
This works, but it runs both threads instead of one by one and I don't know if its because of the loop or because the target function that I specify finishs but the subprocess.run() inside no. Is there any way to solve this by waiting for the background function to finish or am I doing something wrong?
Thank you in advance

This is a comment, not an answer. I can't include code in a comment. I asked, why create threads if you aren't going to let the threads run concurrently?
You said;
I was trying to find a way to don't let the loop go to the next iteration until the first one is finished, beciuase the first itration will use all the possible cores and then when they finish pass to the next iteration.
OK, So why don't you just do this?
list_input_values = [input_value_1, input_value_2]
for i in list_input_values:
func(i,4)

Related

Best way of run more than 1 function at the same time but if one crashes, continue running all the others and be able to fix the function that crashed

I'm creating a script that scrapes data from sites. I have at least 10 sites to scrape. Each site is one .ipynb file (Then I transform them to .py to execute). It could happen that one site changes so the scraping code would need to be changed.
I have the following
def ex_scrape_site1():
%run "scrape\\scrape_site1.py"
def ex_scrape_site2():
%run "scrape\\scrape_site2.py"
def ex_scrape_site3():
%run "scrape\\scrape_site3.py"
.
.
.
(10 so far)
I'm currently using a list with all functions and then I'm doing a for loop on the list to generate a thread per each function. Like this:
funcs = [ex_scrape_site1, ex_scrape_site2, ex_scrape_site3]
Then, I'm executing them doing the following:
while True:
threads = []
for func in funcs:
threads.append(Thread(target = func))
[thread.start() for thread in threads] # start threads
[thread.join() for thread in threads] # Wait for all to complete
So here it's executing all the functions in paralell which is OK. However, if 1 crashes I have to stop everything and fix the error.
Is there a way to:
When something happens in one of the scraping functions, I want to be able to amend the broken function, but continue running all the others.
Since I'm using a 'join()' I have to wait until all the scrapes finish, then it'll iterate again. How could I iterate over each function individually, do not wait until all of them finish, then start the process again?
I though of use Airflow, do you think this could make sense to implement?

How to run a python script multiple times simultaneously using python and terminate all when one has finished

Maybe it's a very simple question, but I'm new in concurrency. I want to do a python script to run foo.py 10 times simultaneously with a time limit of 60 sec before automatically abort. The script is a non deterministic algorithm, hence all executions takes different times and one will be finished before the others. Once the first ends, I would like to save the execution time, the output of the algorithm and after that kill the rest of the processes.
I have seen this question run multiple instances of python script simultaneously and it looks very similar, but how can I add time limit and the possibility of when the first one finishes the execution, kills the rest of processes?
Thank you in advance.
I'd suggest using the threading lib, because with it you can set threads to daemon threads so that if the main thread exits for whatever reason the other threads are killed. Here's a small example:
#Import the libs...
import threading, time
#Global variables... (List of results.)
results=[]
#The subprocess you want to run several times simultaneously...
def run():
#We declare results as a global variable.
global results
#Do stuff...
results.append("Hello World! These are my results!")
n=int(input("Welcome user, how much times should I execute run()? "))
#We run the thread n times.
for _ in range(n):
#Define the thread.
t=threading.Thread(target=run)
#Set the thread to daemon, this means that if the main process exits the threads will be killed.
t.setDaemon(True)
#Start the thread.
t.start()
#Once the threads have started we can execute tha main code.
#We set a timer...
startTime=time.time()
while True:
#If the timer reaches 60 s we exit from the program.
if time.time()-startTime>=60:
print("[ERROR] The script took too long to run!")
exit()
#Do stuff on your main thread, if the stuff is complete you can break from the while loop as well.
results.append("Main result.")
break
#When we break from the while loop we print the output.
print("Here are the results: ")
for i in results:
print(f"-{i}")
This example should solve your problem, but if you wanted to use blocking commands on the main thread the timer would fail, so you'd need to tweak this code a bit. If you wanted to do that move the code from the main thread's loop to a new function (for example def main(): and execute the rest of the threads from a primary thread on main. This example may help you:
def run():
pass
#Secondary "main" thread.
def main():
#Start the rest of the threads ( in this case I just start 1).
localT=threading.Thread(target=run)
localT.setDaemon(True)
localT.start()
#Do stuff.
pass
#Actual main thread...
t=threading.Thread(target=main)
t.setDaemon(True)
t.start()
#Set up a timer and fetch the results you need with a global list or any other method...
pass
Now, you should avoid global variables at all costs as sometimes they may be a bit buggy, but for some reason the threading lib doesn't allow you to return values from threads, at least i don't know any methods. I think there are other multi-processing libs out there that do let you return values, but I don't know anything about them so I can't explain you anything. Anyways, I hope that this works for you.
-Update: Ok, I was busy writing the code and I didn't read the comments in the post, sorry. You can still use this method but instead of writing code inside the threads, execute another script. You could either import it as a module or actually run it as a script, here's a question that may help you with that:
How to run one python file in another file?

What is the preferred way of running two methods containing infinite loops concurrently using threads?

I'm trying to combine two python3 scripts I'm running separately at the moment. Both run in an infinite loop. I found different ways of achieving what I want, but I'm a beginner still learning and trying to do it the right way.
One script is a reddit bot that replies to certain comments and uploads videos, while saving links in newly created .txt files. The other one iterates through those .txt files, reads them and sometimes deletes them.
This variety seems to be the most intuitive for me:
from threading import Thread
def runA():
while True:
print 'A\n'
def runB():
while True:
print 'B\n'
if __name__ == "__main__":
t1 = Thread(target = runA)
t2 = Thread(target = runB)
t1.setDaemon(True)
t2.setDaemon(True)
t1.start()
t2.start()
while True:
pass
Is this the preferred way of running threads? And why do I need
While True:
pass
at the end?
In general, that is a good way to start two threads, but there are details to think about.
Note that in that code, there are actually 3 threads: main thread, t1 and t2.
Since the comments say one thread downloads and the other reads the downloaded files and since the main thread does nothing in your case, I'd say you need just this much:
def download_forever():
while True:
download_stuff()
def process_new_downloads():
do_something_with_new_downloads_here
def main():
download_thread = Thread(target=download_forever)
download_thread.start()
while True:
process_new_downloads()
sleep(1) # let go of the CPU for a while, there's nothing to do anyway
Setting the threads as daemon does not modify how they live, only how they die. And here it is not clear how the whole thing ends, so I'm not sure you need that. You might want to implement some ways to stop the threads politely. You might also define some way to end the whole thing.
Additionaly, you could implement a way for one thread to wake up the other exactly when there is something new to do. You can do that e.g. with a threading.Event.
BTW, the while True which was in the main thread in the original code was needed exactly because all other threads were daemons, so ending the main thread (i.e. not making it run forever) would kill the whole application.

Share variables across scripts

I have 2 separate scripts working with the same variables.
To be more precise, one code edits the variables and the other one uses them (It would be nice if it could edit them too but not absolutely necessary.)
This is what i am currently doing:
When code 1 edits a variable it dumps it into a json file.
Code 2 repeatedly opens the json file to get the variables.
This method is really not elegant and the while loop is really slow.
How can i share variables across scripts?
My first scripts gets data from a midi controller and sends web-requests.
My second script is for LED strips (those run thanks to the same midi controller). Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down. I am currently just sharing the variables via a json file.
If enough people ask for it i will post the whole code but i have been told not to do this
Considering the information you provided, meaning...
Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down.
To me, you have 2 choices :
Use a client/server model. You have 2 machines. One acts as the server, and the second as the client. The server has a script with an infinite loop that consistently updates the data, and you would have an API that would just read and expose the current state of your file/database to the client. The client would be on another machine, and as I understand it, it would simply request the current data, and process it.
Make a single multiprocessing script. Each script would run on a separate 'thread' and would manage its own memory. As you also want to share variables between your two programs, you could pass as argument an object that would be shared between both your programs. See this resource to help you.
Note that there are more solutions to this. For instance, you're using a JSON file that you are consistently opening and closing (that is probably what takes the most time in your program). You could use a real Database that could handle being opened only once, and processed many times, while still being updated.
a Manager from multiprocessing lets you do this sort thing pretty easily
first I simplify your "midi controller and sends web-request" code down to something that just sleeps for random amounts of time and updates a variable in a managed dictionary:
from time import sleep
from random import random
def slow_fn(d):
i = 0
while True:
sleep(random() ** 2)
i += 1
d['value'] = i
next we simplify the "LED strip" control down to something that just prints to the screen:
from time import perf_counter
def fast_fn(d):
last = perf_counter()
while True:
sleep(0.05)
value = d.get('value')
now = perf_counter()
print(f'fast {value} {(now - last) * 1000:.2f}ms')
last = now
you can then run these functions in separate processes:
import multiprocessing as mp
with mp.Manager() as manager:
d = manager.dict()
procs = []
for fn in [slow_fn, fast_fn]:
p = mp.Process(target=fn, args=[d])
procs.append(p)
p.start()
for p in procs:
p.join()
the "fast" output happens regularly with no obvious visual pauses

python: Why join keeps me waiting?

I want to do clustering on 10,000 models. Before that, I have to calculate the pearson corralation coefficient associated with every two models. That's a large amount of computation, so I use multiprocessing to spawn processes, assigning the computing job to 16 cpus.My code is like this:
import numpy as np
from multiprocessing import Process, Queue
def cc_calculator(begin, end, q):
index=lambda i,j,n: i*n+j-i*(i+1)/2-i-1
for i in range(begin, end):
for j in range(i, nmodel):
all_cc[i][j]=get_cc(i,j)
q.put((index(i,j,nmodel),all_cc[i][j]))
def func(i):
res=(16-i)/16
res=res**0.5
res=int(nmodel*(1-res))
return res
nmodel=int(raw_input("Entering the number of models:"))
all_cc=np.zeros((nmodel,nmodel))
ncc=int(nmodel*(nmodel-1)/2)
condensed_cc=[0]*ncc
q=Queue()
mprocess=[]
for ii in range(16):
begin=func(i)
end=func(i+1)
p=Process(target=cc_calculator,args=(begin,end,q))
mprocess+=[p]
p.start()
for x in mprocess:
x.join()
while not q.empty():
(ind, value)=q.get()
ind=int(ind)
condensed_cc[ind]=value
np.save("condensed_cc",condensed_cc)
where get_cc(i,j) calculates the corralation coefficient associated with model i and j. all_cc is an upper triangular matrix and all_cc[i][j] stores the cc value. condensed_cc is another version of all_cc. I'll process it to achive condensed_dist to do the clustering. The "func" function helps assign to each cpu almost the same amout of computing.
I run the program successfully with nmodel=20. When I try to run the program with nmodel=10,000, however, seems that it never ends.I wait about two days and use top command in another terminal window, no process with command "python" is still running. But the program is still running and there is no output file. I use Ctrl+C to force it to stop, it points to the line: x.join(). nmodel=40 ran fast but failed with the same problem.
Maybe this problem has something to do with q. Because if I comment the line: q.put(...), it runs successfully.Or something like this:
q.put(...)
q.get()
It is also ok.But the two methods will not give a right condensed_cc. They don't change all_cc or condensed_cc.
Another example with only one subprocess:
from multiprocessing import Process, Queue
def g(q):
num=10**2
for i in range(num):
print '='*10
print i
q.put((i,i+2))
print "qsize: ", q.qsize()
q=Queue()
p=Process(target=g,args=(q,))
p.start()
p.join()
while not q.empty():
q.get()
It is ok with num= 100 but fails with num=10,000. Even with num=100**2, they did print all i and q.qsizes. I cannot figure out why. Also, Ctrl+C causes trace back to p.join().
I want to say more about the size problem of queue. Documentation about Queue and its put method introduces Queue as Queue([maxsize]), and it says about the put method:...block if neccessary until a free slot is available. These all make one think that the subprocess is blocked because of running out of spaces of the queue. However, as I mentioned before in the second example, the result printed on the screen proves an increasing qsize, meaning that the queue is not full. I add one line:
print q.full()
after the print size statement, it is always false for num=10,000 while the program still stuck somewhere. Emphasize one thing: top command in another terminal shows no process with command python. That really puzzles me.
I'm using python 2.7.9.
I believe the problem you are running into is described in the multiprocessing programming guidelines: https://docs.python.org/2/library/multiprocessing.html#multiprocessing-programming
Specifically this section:
Joining processes that use queues
Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the cancel_join_thread() method of the queue to avoid this behaviour.)
This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate. Remember also that non-daemonic processes will be joined automatically.
An example which will deadlock is the following:
from multiprocessing import Process, Queue
def f(q):
q.put('X' * 1000000)
if __name__ == '__main__':
queue = Queue()
p = Process(target=f, args=(queue,))
p.start()
p.join() # this deadlocks
obj = queue.get()
A fix here would be to swap the last two lines (or simply remove the p.join() line).
You might also want to check out the section on "Avoid Shared State".
It looks like you are using .join to avoid the race condition of q.empty() returning True before something is added to it. You should not rely on .empty() at all while using multiprocessing (or multithreading). Instead you should handle this by signaling from the worker process to the main process when it is done adding items to the queue. This is normally done by placing a sentinal value in the queue, but there are other options as well.

Categories