pool.apply_async takes so long to finish, how to speed up? - python

I call pool.apply_async() with 14 cores.
import multiprocessing
from time import time
import timeit
informative_patients = informative_patients_2500_end[20:]
pool = multiprocessing.Pool(14)
results = []
wLength = [20,30,50]
start = time()
for fn in informative_patients:
result = pool.apply_async(compute_features_test_set, args = (fn,
wLength), callback=results.append)
pool.close()
pool.join()
stop = timeit.default_timer()
print stop - start
The problem is it finishes calling compute_features_test_set() function for the first 13 data in less than one hour, but it takes more than one hour to finish the last one. The size of the data for all the 14 data-set is the same. I tried putting pool.terminate() after pool.close() but in this case it doesn't even start the pool and terminate the pool immediately without going inside the for loop. This always happen in the same way and if I use more cores and more data set, always the last one takes so long to finish. My compute_features_test_set() function is a simple feature extraction code and works correctly. I work on a server with Linux red hat 6, python 2.7 and jupyter. Computation time is important to me and my question is what is wrong here and how I can fix it to get the all the computation done in a reasonable time?

Question: ... what is wrong here and how I can fix it
Couldn't catch this as a multiprocessing issue.
But How do you get this: "always the last one takes so long to finish"?
You are using callback=results.append instead of a own function?
Edit your Question and show How you timeit one Process Time.
Also add your Python Version to your Question.
Do the following to verify it's not a Data issue:
start = time()
results.append(
compute_features_test_set(<First informative_patients>, wLength
)
stop = timeit.default_timer()
print stop - start
start = time()
results.append(
compute_features_test_set(<Last informative_patients>, wLength
)
stop = timeit.default_timer()
print stop - start
Compare the two times you get.

Related

How to kill the process after particular timeout in python?

I am having a python function that has loop which may fall into infinte loop.
I want to kill the function and print error if it exceeds the time limit let's say 10 sec.
For eg. this is the function that contains infinite loop
def myfunc(data):
while True:
data[0]+=10
return data
data=[57,6,879,79,79,]
On calling from here
print(myfunc(data))
I want it to completely kill the process and print the message. I am working on windows.
Any reference or resource will be helpful.
Thanks
Maybe you can try with func-timeout Python package.
You can install it with the following command:
pip install func-timeout
This code will resolve your problem "I want to kill the function and print error if it exceeds the time limit let's say 10 sec."
from time import process_time as timer
def whatever(function):
start = timer()
infinite loop here:
end = timer()
(your code here)
if end >= 10:
print(whatever)
return
Breakdown: there are a few options for doing this with the time module
At the start of your program import the method
At the beginning of your (infinite loop) start = timer() will call process_time and save the value as start.
end = timer() will continue to send the call to process_time storing new value
put w/e code you want in your loop
if end >= 10: will keep checking the count each loop iteration
print(whatever)
return will automatically terminate the function by escaping if time runs 10 seconds when it is checked next loop.
This doesn't say "how to stop the process_time" from running once its called idk if it continues to run in the background and you have to stop it on your on. But this should answer your question. And you can investigate a bit further with what I've given you.
note: This is not designed to generate "precision" timing for that a more complex method will need to be found

Why this multi-threading is slower than single thread

This is my first time to use multi-threading..
I write a code to process every file in a directory like:
list_battle=[]
start = time.time()
for filepath in pathlib.Path(dir_battle).glob('**/*'):
battle_json = gzip.GzipFile(filepath,'rb').read().decode("utf-8")
battle_id = eval(type_cmd)
list_battle.append((battle_id, battle_json))
end = time.time()
print(end - start)
it shows the code runs 8.74 seconds.
Then, I tried to use multi-threading as follows:
# define function to process each file
def get_file_data(path, cmd, result_list):
data_json = gzip.GzipFile(path,'rb').read().decode("utf-8")
data_id = eval(cmd)
result_list.append((battle_id, battle_json))
# start to run multi-threading
pool = Pool(5)
start = time.time()
for filepath in pathlib.Path(dir_battle).glob('**/*'):
pool.apply_async( get_file_data(filepath, type_cmd, list_battle) )
end = time.time()
print(end - start)
However, the result shows it takes 12.36 seconds!
In my view, in single threading, in each turn of loop, the loop waits for the single thread to finish codes and then starts the next turn; while in multi-processing, 1st turn, the loop calls thread1 to run the codes, then 2nd turn calls thread2 to run.... during this job dispatching for other 4 threads, thread1 is running and when the 6th turn arrives, it should finishes its job and the loop could directly ask it to run the code of 6th trun...
So this should be quicker than single thread...Why the code with multi-processing runs even slower? How to address this issue? What is wrong with my thinking?
Any help is appreciated.
Multiprocessing does not reduce your processing time unless your process has a lot of dead time (waiting). The main purpose of multiprocessing is parallelism of different tasks at the burden of context switching. Whenever you switch from one task to another, interupting the previous one, your program needs to store all variables for the former task and get the ones from the new one. This takes time as well.
This means the shorter the time you spend per task, the less efficient (in regards of computing time) your multiprocessing is.

Why this Python parallel loop is taking longer time than sequential loop?

I have this code that I tried to make parallel based on a previous question. Here is the code using 2 processes.
import multiprocessing
import timeit
start_time = timeit.default_timer()
d1 = dict( (i,tuple([i*0.1,i*0.2,i*0.3])) for i in range(500000) )
d2={}
def fun1(gn):
x,y,z = d1[gn]
d2.update({gn:((x+y+z)/3)})
#
if __name__ == '__main__':
gen1 = [x for x in d1.keys()]
#fun1(gen1)
p= multiprocessing.Pool(2)
p.map(fun1,gen1)
print('Script finished')
stop_time = timeit.default_timer()
print(stop_time - start_time)
Output is:
Script finished
1.8478448875989333
If I change the program to sequential,
fun1(gen1)
#p= multiprocessing.Pool(2)
#p.map(fun1,gen1)
output is:
Script finished
0.8345944193950299
So parallel loop is taking more time that sequential loop, more than double. (My computer has 2 cores, running on Windows.) I tried to find similar questions on the topic, this and this but could not figure out the reason. How can I get performance improvement using multiprocessing module in this example?
When you do p.map(fun1,gen1) you send gen1 over to the other process. This includes serializing the list which is 500000 elements big.
Comparing serialization to the small computation, it takes much longer.
You can measure or profile where the time is spent.

Python threads do something at the EXACT same time

Is it possible to have 2, 3 or more threads in Python to be able to execute something simultaneously - at the exact same moment? Is it possible if one of the threads is late, for the other to be waiting for it, so the last request can be executed at the same time?
Example: There are two threads that are calculating specific parameters, after they have done that they need to click one button at the same time (to send post request to the server).
"Exact the same time" is really difficult, at almost the same time is possible but you need to use multiprocessing instead of threads. Here one example.
from time import time
from multiprocessing import Pool
def f(*args):
while time() < start + 5: #syncronize the execution of each process
pass
print(time())
start = time()
with Pool(10) as p:
p.map(f, range(10))
It prints
1495552973.6672032
1495552973.6672032
1495552973.669514
1495552973.667697
1495552973.6672032
1495552973.668086
1495552973.6693969
1495552973.6672032
1495552973.6677089
1495552973.669164
Note that some of the processes are really simultaneous (in the 10e-7 second precision). It's impossible to guarantee that all the processes will be executed at the very same moment.
However, if you limitate the number of processes to the number of core you actually have, then most of the time they will run exactly at the same moment.

Terminating a while loop after 2 seconds for a lengthy process within the loop

I have a while loop which contains a process which takes roughly 1 second each time but occasionally takes over 5 minutes. I want the loop to terminate after 2 seconds in case this 5 minute scenario occurs.
I have already tried
import time
t1 = time.time() + 2
while time.time() < t1:
PerformProcess
but this approach is limited by the fact that PerformProcess is automatically terminated each time time.time() is calculated.
The other solution i have commonly seen posted is
import time
t1 = time.time()
t2 = time.time()
while t2 - t1 < 2:
PerformProcess
t2 = time.time()
but this approach does not work as the refresh for t2 will not be reached if it does the 5 minute process.
I essentially just need a way to terminate the while loop after 2 seconds such that there is no time calculation done inside the while loop and/or such that the time calculation does not terminate the process within the while loop.
In case it's helpful, the PerformProcess is essentially opening a web page using urllib and extracting data from it. The simplified version of what it does is essentially
import urllib
f = urllib.urlopen(str(Hyperlink))
s = f.read()
print s
f.close()
where f is what usually takes 1 second but sometimes, for some reason, takes far longer.
If I understand correctly, the timeout you are trying to set is in relation to the url fetch. I'm not sure what version of python you are using but if you are using 2.6+ then I am wondering if something like this might be what your looking to do:
import urllib2
f = urllib2.urlopen(str(Hyperlink), timeout=2)
You would then need to wrap in a try/except and do something if the timeout occurs.

Categories