Python Creating Thread Pool for function - python

I have a function that gives values fast and I have a second function that calculates something from that value which takes around 10-20 seconds. I need to create a thread pool to concurrently run around 40 of the second function at the same time since the first function releases values so fast. How can I go about doing this?
Eg:
def function1(): #gives values fast
while(values not complete):
newvalue = xyz
UseAvailableThread(function2(newvalue)) # wait if thread isn't available
def function2(value):
#computes some data - takes 10-20 seconds
I hope the example helps, how can I go about doing this?

Related

Scheduling non-periodic events with multiple threads

I am attempting to develop a GUI program in python (using pyqt5) to interact with a data acquisition device (DAQ) that will be connected via LAN or USB to a windows PC. On the click of a button (in the GUI), the DAQ will perform a test.
Each "Test" will consist of collecting a reading (collecting a reading takes about 1.5 seconds) at user-defined intervals from the start of the test (e.g., 0.1 min, 0.2 min, 0.5 min, 1 min, 2 min, 5 min...1000 min etc.). A reading is collected by execution of a function, so code for a single test might look like this:
import time
t=[0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000] #times from start of test to collect readings at (min)
intervals=[t[i]-t[i-1] for i in range(1,len(t))] #time delta between readings (min)
def GetReading():
#some code to connect to the DAQ (using pyvisa) and collect reading
reading=['2020-01-02 17:33:33',1.23456] #the data returned from the DAQ
return reading
def RunTest(r):
results=[GetReading()] #get the initial (t=0) reading
ReadTime=1.5 #time in seconds to collect 1 reading (I may use an implementation of
#time.run_process() or similar to actually calculate this this instead)
for j in r:
time.sleep(j*60-ReadTime)
results.append(GetReading())
return results
RunTest(intervals)
The DAQ can only perform one reading at a time. I would like to be able to run multiple tests simultaneously, and have my program automatically wait and start a new test when it is feasible (i.e., delay starting a test on click if another test is already running).
The first, say 5 readings, are imperative that they happen on time, but the subsequent readings of a given test can be delayed by a bit without affecting the quality of the test. For example, if a test is running at the 0.2 min reading interval, and the user initiates a new test, the program would wait until the current test completed say, the 5 min reading, before starting the additional test sequence.
Subsequent readings beyond the 5 min reading could be delayed to collect the first 5 readings of a new test sequence, or collect a reading from another test.
I'm struggling with how to program this, conceptually. I think i need to use multiprocesses or similar to allow multiple tests to be run in parallel (though no actual parallel readings can occur). Or, perhaps I can use scheduler? I'm just not sure how to implement either of these; I've never used them before, and I'm having trouble understanding examples I find in the context of my problem.
Furthermore, I need to be able to access results (output from RunTest) between calls to GetReading() (e.g., to view data as the test progresses), and using the time.sleep wouldn't allow that.
UPDATE
The measurement the DAQ is collecting is deformation, via a LVDT.
The time zero in var t is not actually the button click supplied by the user. On button click, the DAQ will open the specified channel and the program will monitor for a change in deformation above a certain threshold. The user will then physically start the test (which involves adding a weight on some material, to measure stress-strain properties), and time zero will occur at i-1 where i is the first instance of change above the threshold is detected (i.e., t=0 corresponds to the zero-deformation reading the instant before the weight is added). I need the whole process from button click, to adding the weight, to collecting up to the 5 minute reading to be uninterrupted for a single test (Deformation occurs most rapidly, and potentially erratically, in the first 5 minutes or so).
Below code works, but doesn't ensure, that first measurements of a new test are prioritized.
If this is is essential, then the solution will be a little bit more difficult.
In order to be sure, that only one function / thread is reading data at a given time, you can use a mutex (threading.Lock)
from threading import Lock
read_lock = Lock()
def get_data():
with read_lock:
#some code to connect to the DAQ (using pyvisa) and collect reading
reading=['2020-01-02 17:33:33',1.23456] #the data returned from the DAQ
return reading
I'd propose to write a function, that fetches the result and appends it to a results list.
Any object being modified by one thread and read by another should be protected with a Lock, therefore there is a second lock to avoid simultaneous reading / writing of results.
results_lock = Lock()
def get_and_store_data(results):
result = get_data()
with results_lock:
results.append(result)
You can schedule a get_and_store_data action with threading.Timer
Below a the full code example:
from threading import Lock
from threading import Timer
t=[0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000] #times from start of test to collect readings at (min)
read_lock = Lock()
results_lock = Lock()
def get_data():
import time
import datetime
import random
with read_lock:
time.sleep(1.5)
#some code to connect to the DAQ (using pyvisa) and collect reading
reading = [
datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
random.randint(0,10000) / 1000,
]
return reading
def get_and_store_data(results):
result = get_data()
with results_lock:
results.append(result)
# schedule measures for one test
def schedule_measures(measure_times, results):
timers = []
for t in measure_times:
timer = Timer(t, get_and_store_data, args=[results])
timer.start()
timers.append(timer)
def main():
results = []
meas_times = [tim * 1 for tim in t]
schedule_measures(meas_times, results)
while True:
msg = "Please press enter to display results or q and enter to quit"
choice = input(msg).strip()
if choice == "q":
break
print("Results:")
with results_lock:
for result in results:
print(result)
main()
If you want to reduce the 'drift' between measurements, then you could do something like:
import time
def schedule_measures(measure_times, results):
timers = []
t_0 = time.time()
for t in measure_times:
now = time.time()
timer = Timer(t - (now - t_0), get_and_store_data, args=[results])
timer.start()
timers.append(timer)
Though the drift will probably be low enough anyways, but it's a neat trick if you have more CPU intensive actions in your schedule function or if you do not want to schedule all events at startup.
For prioritizing measurements it might perhaps be easier to create a sorted list of lists with the calculated times when the measurement should be performed and start the next timer only when the previous timer fired. there had to be some logic to decide which measurement should be the next one to be scheduled. I don't have time now, but will probably come back within the next 12 hours with a suggested algorithm

How to use multiple, but a limited number of threads in Python to process a list

I have a dataframe, several thousand rows in length, that contains two pairs of GPS coordinates in one of the columns, with which I am trying to calculate the drive time between those coordinates. I have a function that takes in those coordinates and returns the drive time and it takes maybe 3-8 seconds to calculate each entry. So, the total process can take quite a while. What I'd like to be able to do is: using maybe 3-5 threads, iterate through the list and calculate the drive time and move on to the next entry while the other threads are completing and not creating more than 5 threads in the process. Independently, I have everything working - I can run multiple threads, I can track the thread count and wait until the max number of allowed threads drops below limit until the next starts and can iterate the dataframe and calculate the drive time. However, I'm having trouble piecing it all together. Here's an edited, slimmed down version of what I have.
import pandas
import threading
import arcgis
class MassFunction:
#This is intended to keep track of the active threads
MassFunction.threadCount = 0
def startThread(functionName,params=None):
#This kicks off a new thread and should count up to keep track of the threads
MassFunction.threadCount +=1
if params is None:
t = threading.Thread(target=functionName)
else:
t = threading.Thread(target=functionName,args=[params])
t.daemon = True
t.start()
class GeoAnalysis:
#This class handles the connection to the ArcGIS services
def __init__(self):
super(GeoAnalysis, self).__init__()
self.my_gis = arcgis.gis.GIS("https://www.arcgis.com", username, pw)
def drivetimeCalc(self, coordsString):
#The coords come in as a string, formatted as 'lat_1,long_1,lat_2,long_2'
#This is the bottleneck of the process, as this calculation/response
#below takes a few seconds to get a response
points = coordsString.split(", ")
route_service_url = self.my_gis.properties.helperServices.route.url
self.route_layer = arcgis.network.RouteLayer(route_service_url, gis=self.my_gis)
point_a_to_point_b = "{0}, {1}; {2}, {3}".format(points[1], points[0], points[3], points[2])
result = self.route_layer.solve(stops=point_a_to_point_b,return_directions=False, return_routes=True,output_lines='esriNAOutputLineNone',return_barriers=False, return_polygon_barriers=False,return_polyline_barriers=False)
travel_time = result['routes']['features'][0]['attributes']['Total_TravelTime']
#This is intended to 'remove' one of the active threads
MassFunction.threadCount -=1
return travel_time
class MainFunction:
#This is to give access to the GeoAnalysis class from this class
GA = GeoAnalysis()
def closureDriveTimeCalc(self,coordsList):
#This is intended to loop in the event that a fifth loop gets started and will prevent additional threads from starting
while MassFunction.threadCount > 4:
pass
MassFunction.startThread(MainFunction.GA.drivetimeCalc,coordsList)
def driveTimeAnalysis(self,location):
#This reads a csv file containing a few thousand entries.
#Each entry/row contains gps coordinates, which need to be
#iterated over to calculate the drivetimes
locationMemberFile = pandas.read_csv(someFileName)
#The built-in apply() method in pandas seems to be the
#fastest way to iterate through the rows
locationMemberFile['DRIVETIME'] = locationMemberFile['COORDS_COL'].apply(self.closureDriveTimeCalc)
When I run this right now, using VS Code, I can see the thread counts go up into the thousands in the call stack, so I feel like it is not waiting for the thread to finish and adding/subtracting from the threadCount value. Any ideas/suggestions/tips would be much appreciated.
EDIT: Essentially my problem is how do I get the travel_time value back so that it can be placed into the dataframe. I currently have no return statement for closureDriveTimeCalc function, so while the function runs correctly, it doesn't send any information back into the apply() method.
Rather than do this in an apply, I'd use multiprocessing Pool.map:
from multiprocessing import Pool
with Pool(processes=4) as pool:
locationMemberFile['DRIVETIME'] = pool.map(self.closureDriveTimeCalc, locationMemberFile['COORDS_COL']))

Schedule at given time meanwhile a function runs - Python

I have been searching for different schedule in Python such as Sched (Im a Windows user) etc. However I can't really get a grip on it and I don't know if it is possible. My plan is to make like the picture below:
We can see at Time:00.21 is etc the time I want the program to do the function 2 BUT the function 1 should be add into a list I have made as many as possible in the list as it works in 2 minutes before the timer hits. Basically...
The function 1 is doing its function 2 minutes before the timer. When it hits 00:21 then stop the function 1 and do the function 2 where it takes the List and uses it in its own function and when its done then its done.
However I don't know how to do this or to start. I was thinking to do a own timer but it feels like that is not the solution. What do you guys suggest?
I think I would approach a problem like this by creating a class that subclasses threading.Thread. From there, you override the run method with the function that you want to perform, which in this case will put stuff in a list. Then, in main, you start that thread followed by a call to sleep. The class would look like this:
class ListBuilder(threading.Thread):
def__init__(self):
super().__init__()
self._finished = False
self.lst = []
def get_data():
# This is the data retrieval function
# It could be imported in, defined outside the class, or made static.
def run(self):
while not self._finished:
self.lst.append(self.get_data())
def stop(self):
self._finished = True
Your main would then look something like
import time
if __name__ == '__main__':
lb = ListBuilder()
lb.start()
time.sleep(120) # sleep for 120 seconds, 2 minutes
lb.stop()
time.sleep(.1) # A time buffer to make sure the final while loop finishes
# Depending on how long each while loop iteration takes,
# it may not be necessary or it may need to be longer
do_stuf(lb.lst) # performs actions on the resulting list
Now, all you have to do is use the Windows Task Scheduler to run it at 00:19 and you should be set.

Function takes a long time to return

I have a function that takes a long time to return control to the parent after the return call is made. I've timed it, and it frequently runs between 2-10 minutes. Here's the function itself:
def mass_update_database(database, queue):
documents_to_update = []
# Get all docs from queue for updating.
while not queue.empty():
documents_to_update.append(queue.get())
queue.task_done()
# Update database
database.update(documents_to_update)
# Compact the database, which removes previous revisions
# and slims the size of our database.
if database.compact():
print('Compaction completed successfully.')
else:
print('Compaction failed.')
print('Beginning return')
d = datetime.datetime.now()
return d
Some notes on the above code, queue is pretty large (8,500 dictionaries with at least 20 keys and potentially lengthy values) This is updating to CouchDB, so the database object is a couchdb.Database object. The d variable is for timing (which is how I know it's taking so long).
I suspect that maybe the documents_to_update variable is so large that cleaning it up is taking a long time? But I ran it with a variation where I added documents_to_update = [] right before the timer started, and it still took a long time to return.
Here's where it's being called. The above function is in a different module called NS.
d = NS.mass_update_database(ns_database, docs_to_update_queue)
print('Returned', datetime.datetime.now() - d)
Anyone know any reason why returning control to the parent could take 2-10 minutes?
I should add that when I take the code from the function and stick it where the function call would go, it doesn't take forever to finish running where the return statement would be.
EDIT: I should clarify, the long time that it takes to return is from where I initialize d until control returns to the parent. All the code ABOVE that has finished and completed. What's taking a long time is from the return statement until the next statement in the parent that called mass_update_database

Why the python threads is so frustrating?

# test.py
import threading
import time
import random
from itertools import count
def fib(n):
"""fibonacci sequence
"""
if n < 2:
return n
else:
return fib(n - 1) + fib(n - 2)
if __name__ == '__main__':
counter = count(1)
start_time = time.time()
def thread_worker():
while True:
try:
# To simulate downloading
time.sleep(random.randint(5, 10))
# To simulate doing some process, will take about 0.14 ~ 0.63 second
fib(n=random.randint(28, 31))
finally:
finished_number = counter.next()
print 'Has finished %d, the average speed is %f per second.' % (finished_number, finished_number/(time.time() - start_time))
threads = [threading.Thread(target=thread_worker) for i in range(100)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
The above is my test script.
The thread_worker function takes at most 10.63 seconds to run once.
I started 100 threads and expected the results is ~10 times per second.
But the actual results were frustrating as following:
...
Has finished 839, the average speed is 1.385970 per second.
Has finished 840, the average speed is 1.386356 per second.
Has finished 841, the average speed is 1.387525 per second.
...
And if i commented "fib(n=random.randint(28, 31))" out, the results is expected:
...
Has finished 1026, the average speed is 12.982740 per second.
Has finished 1027, the average speed is 12.995230 per second.
Has finished 1028, the average speed is 13.007719 per second.
...
Has finished 1029, the average speed is 12.860571 per second.
My question is why it's so slow? I expected ~10 per second.
How to make it faster?
fib() function is just to simulate doing some process. e.g. extracting data from big html.
If I ask you to bake a cake and this takes you an hour and a half, 30 minutes for the dough, and 60 minutes in the oven, by your logic I would expect 2 cakes to take exactly the same amount of time. However there are some things you are missing. First if I do not tell you to bake two cakes in the beginning you have to make the dough twice, wich is now 2 times 30 minutes. now it actually takes you two hours ( you are free to work on the second cakeonce the first is in the oven).
Now lets assume i ask you to bake four cakes, again I do not allow you to make the dough once and split it for four cakes but you have to make it every time. The time we would expect now is 4*30minutes+ one hour for the alst cake to bake. Now for the sake of example assume your wife helps you, meaning you can do the dough for two cakes in parallel. THe time expected now is two hours, since every person has to bake two cakes. However the oven you have can only fit 2 cakes at a time. The time now becomes 30 minutes to makte the first dough, 1h hour to bake it, while you make the second dough, and after the first two cakes are done you put the next two cakes in the oven. WHich take a nother hour. If you add up the times you will see that it now took you 2 and a half hours.
If you take this further and I ask you for thousand cakes it will take you 500 and a half hours.
What has this to do with threads?
Think of making the dough as an initial computation that creates 100% cpu load. Your wife is the second core in a dual core. The oven is a resource, for which your programm generates 50% load.
In real threadiing you have some overhead to start the threads (I told you toi bake the cakes, you have to ask your wife for help whcih takes time), you compete for resources (i.e. memory access)(you and your wife can"t use the mixer at the same time). THe speedup is sub linear even if the number of threads is smaller than the number fo cores.
Furthermore smart programs download their code once (make the dough once), in the main thread and than duplicate it to threads, there is no nead to duplicate a computation. It does not make it faster just because you compute it twice.
While Manoj's answer is correct I think it needs more explanation. The python GIL is a mutex used in cpython that essentially will disable any parallel execution of python code. It does not make threaded code slower, nor does it actually prevent the OS from scheduling python threads simultaneously on all your cores. It just makes sure only one thread can execute python byte code at the same time.
What does this mean for you? You essentially do two things:
Sleep: While performing this function no python code is being executed, you just do nothing for 5 to 10 seconds. In the meantime any other thread can do exactly the same thing. Given that the overhead of calling time.sleep is negligible, you could have thousands of threads and it will probably still scale linearly like you expected. This is why everything works as soon as you comment out the fib line. Your average sleep time is 7.5s so you'd expect 15 calculations per second.
A calculation of the Fibonacci sequence: This one is the problem, it is actually executing python code. Let's say it takes about 0.5s per calculation. Now we've seen that you can only run one calculation at the time, no matter how many threads you have. Given that, you'd only get to 2 calculations per second.
Now, it's lower than 15 and 2, mainly because there is some overhead involved. First of all you are printing out data to the screen, this is almost always a surprisingly slow operation. Secondly, you're using 100 threads, which means that you're constantly switching between 100 thread stacks (even if they're sleeping), which is not a lightweight operation.
Note that threading can still be very useful though. For example for blocking calls where the execution is not done by python itself but some other resource. This could be waiting for the result of a socket, a sleep like in your example or even a calculation that is done outside of python itself (e.g. many numpy calculations).
Python threads use Global Interpretor Lock (GIL) to synchronize access of Python interpretor state. Compared to other threads like POSIX threads, usage of GIL can make Python threads significantly slower, especially when dealing with multiple cores. This is well-known. Here is a really good presentation on the same: www.dabeaz.com/python/UnderstandingGIL.pdf‎
You're looking for a faster solution. Memoizing results helps.
import collections
import functools
class Memoized(object):
"""Decorator. Caches a function's return value each time it is called.
If called later with the same arguments, the cached value is returned
(not reevaluated).
"""
def __init__(self, func):
self.func = func
self.cache = {}
def __call__(self, *args):
if not isinstance(args, collections.Hashable):
# uncacheable. a list, for instance.
# better to not cache than blow up.
return self.func(*args)
if args in self.cache:
return self.cache[args]
else:
value = self.func(*args)
self.cache[args] = value
return value
def __repr__(self):
"""Return the function's docstring."""
return self.func.__doc__
def __get__(self, obj, objtype):
"""Support instance methods."""
return functools.partial(self.__call__, obj)
if __name__ == '__main__':
#Memoized
def fibonacci(n):
"""Return the nth fibonacci number
:param n: value
"""
if n in (0, 1):
return n
return fibonacci(n - 1) + fibonacci(n - 2)
print(fibonacci(35))
Try to run it with and without the #Memoized decorator.
Recipe taken from http://wiki.python.org/moin/PythonDecoratorLibrary#Memoize.

Categories