I have a function that takes a long time to return control to the parent after the return call is made. I've timed it, and it frequently runs between 2-10 minutes. Here's the function itself:
def mass_update_database(database, queue):
documents_to_update = []
# Get all docs from queue for updating.
while not queue.empty():
documents_to_update.append(queue.get())
queue.task_done()
# Update database
database.update(documents_to_update)
# Compact the database, which removes previous revisions
# and slims the size of our database.
if database.compact():
print('Compaction completed successfully.')
else:
print('Compaction failed.')
print('Beginning return')
d = datetime.datetime.now()
return d
Some notes on the above code, queue is pretty large (8,500 dictionaries with at least 20 keys and potentially lengthy values) This is updating to CouchDB, so the database object is a couchdb.Database object. The d variable is for timing (which is how I know it's taking so long).
I suspect that maybe the documents_to_update variable is so large that cleaning it up is taking a long time? But I ran it with a variation where I added documents_to_update = [] right before the timer started, and it still took a long time to return.
Here's where it's being called. The above function is in a different module called NS.
d = NS.mass_update_database(ns_database, docs_to_update_queue)
print('Returned', datetime.datetime.now() - d)
Anyone know any reason why returning control to the parent could take 2-10 minutes?
I should add that when I take the code from the function and stick it where the function call would go, it doesn't take forever to finish running where the return statement would be.
EDIT: I should clarify, the long time that it takes to return is from where I initialize d until control returns to the parent. All the code ABOVE that has finished and completed. What's taking a long time is from the return statement until the next statement in the parent that called mass_update_database
Related
I have a function that gives values fast and I have a second function that calculates something from that value which takes around 10-20 seconds. I need to create a thread pool to concurrently run around 40 of the second function at the same time since the first function releases values so fast. How can I go about doing this?
Eg:
def function1(): #gives values fast
while(values not complete):
newvalue = xyz
UseAvailableThread(function2(newvalue)) # wait if thread isn't available
def function2(value):
#computes some data - takes 10-20 seconds
I hope the example helps, how can I go about doing this?
I have a dataframe, several thousand rows in length, that contains two pairs of GPS coordinates in one of the columns, with which I am trying to calculate the drive time between those coordinates. I have a function that takes in those coordinates and returns the drive time and it takes maybe 3-8 seconds to calculate each entry. So, the total process can take quite a while. What I'd like to be able to do is: using maybe 3-5 threads, iterate through the list and calculate the drive time and move on to the next entry while the other threads are completing and not creating more than 5 threads in the process. Independently, I have everything working - I can run multiple threads, I can track the thread count and wait until the max number of allowed threads drops below limit until the next starts and can iterate the dataframe and calculate the drive time. However, I'm having trouble piecing it all together. Here's an edited, slimmed down version of what I have.
import pandas
import threading
import arcgis
class MassFunction:
#This is intended to keep track of the active threads
MassFunction.threadCount = 0
def startThread(functionName,params=None):
#This kicks off a new thread and should count up to keep track of the threads
MassFunction.threadCount +=1
if params is None:
t = threading.Thread(target=functionName)
else:
t = threading.Thread(target=functionName,args=[params])
t.daemon = True
t.start()
class GeoAnalysis:
#This class handles the connection to the ArcGIS services
def __init__(self):
super(GeoAnalysis, self).__init__()
self.my_gis = arcgis.gis.GIS("https://www.arcgis.com", username, pw)
def drivetimeCalc(self, coordsString):
#The coords come in as a string, formatted as 'lat_1,long_1,lat_2,long_2'
#This is the bottleneck of the process, as this calculation/response
#below takes a few seconds to get a response
points = coordsString.split(", ")
route_service_url = self.my_gis.properties.helperServices.route.url
self.route_layer = arcgis.network.RouteLayer(route_service_url, gis=self.my_gis)
point_a_to_point_b = "{0}, {1}; {2}, {3}".format(points[1], points[0], points[3], points[2])
result = self.route_layer.solve(stops=point_a_to_point_b,return_directions=False, return_routes=True,output_lines='esriNAOutputLineNone',return_barriers=False, return_polygon_barriers=False,return_polyline_barriers=False)
travel_time = result['routes']['features'][0]['attributes']['Total_TravelTime']
#This is intended to 'remove' one of the active threads
MassFunction.threadCount -=1
return travel_time
class MainFunction:
#This is to give access to the GeoAnalysis class from this class
GA = GeoAnalysis()
def closureDriveTimeCalc(self,coordsList):
#This is intended to loop in the event that a fifth loop gets started and will prevent additional threads from starting
while MassFunction.threadCount > 4:
pass
MassFunction.startThread(MainFunction.GA.drivetimeCalc,coordsList)
def driveTimeAnalysis(self,location):
#This reads a csv file containing a few thousand entries.
#Each entry/row contains gps coordinates, which need to be
#iterated over to calculate the drivetimes
locationMemberFile = pandas.read_csv(someFileName)
#The built-in apply() method in pandas seems to be the
#fastest way to iterate through the rows
locationMemberFile['DRIVETIME'] = locationMemberFile['COORDS_COL'].apply(self.closureDriveTimeCalc)
When I run this right now, using VS Code, I can see the thread counts go up into the thousands in the call stack, so I feel like it is not waiting for the thread to finish and adding/subtracting from the threadCount value. Any ideas/suggestions/tips would be much appreciated.
EDIT: Essentially my problem is how do I get the travel_time value back so that it can be placed into the dataframe. I currently have no return statement for closureDriveTimeCalc function, so while the function runs correctly, it doesn't send any information back into the apply() method.
Rather than do this in an apply, I'd use multiprocessing Pool.map:
from multiprocessing import Pool
with Pool(processes=4) as pool:
locationMemberFile['DRIVETIME'] = pool.map(self.closureDriveTimeCalc, locationMemberFile['COORDS_COL']))
I have been searching for different schedule in Python such as Sched (Im a Windows user) etc. However I can't really get a grip on it and I don't know if it is possible. My plan is to make like the picture below:
We can see at Time:00.21 is etc the time I want the program to do the function 2 BUT the function 1 should be add into a list I have made as many as possible in the list as it works in 2 minutes before the timer hits. Basically...
The function 1 is doing its function 2 minutes before the timer. When it hits 00:21 then stop the function 1 and do the function 2 where it takes the List and uses it in its own function and when its done then its done.
However I don't know how to do this or to start. I was thinking to do a own timer but it feels like that is not the solution. What do you guys suggest?
I think I would approach a problem like this by creating a class that subclasses threading.Thread. From there, you override the run method with the function that you want to perform, which in this case will put stuff in a list. Then, in main, you start that thread followed by a call to sleep. The class would look like this:
class ListBuilder(threading.Thread):
def__init__(self):
super().__init__()
self._finished = False
self.lst = []
def get_data():
# This is the data retrieval function
# It could be imported in, defined outside the class, or made static.
def run(self):
while not self._finished:
self.lst.append(self.get_data())
def stop(self):
self._finished = True
Your main would then look something like
import time
if __name__ == '__main__':
lb = ListBuilder()
lb.start()
time.sleep(120) # sleep for 120 seconds, 2 minutes
lb.stop()
time.sleep(.1) # A time buffer to make sure the final while loop finishes
# Depending on how long each while loop iteration takes,
# it may not be necessary or it may need to be longer
do_stuf(lb.lst) # performs actions on the resulting list
Now, all you have to do is use the Windows Task Scheduler to run it at 00:19 and you should be set.
I have a concurrent.futures.ThreadPoolExecutor and a list. And with the following code I add futures to the ThreadPoolExecutor:
for id in id_list:
future = self._thread_pool.submit(self.myfunc, id)
self._futures.append(future)
And then I wait upon the list:
concurrent.futures.wait(self._futures)
However, self.myfunc does some network I/O and thus there will be some network exceptions. When errors occur, self.myfunc submits a new self.myfunc with the same id to the same thread pool and add a new future to the same list, just as the above:
try:
do_stuff(id)
except:
future = self._thread_pool.submit(self.myfunc, id)
self._futures.append(future)
return None
Here comes the problem: I got an error on the line of concurrent.futures.wait(self._futures):
File "/usr/lib/python3.4/concurrent/futures/_base.py", line 277, in wait
f._waiters.remove(waiter)
ValueError: list.remove(x): x not in list
How should I properly add new Futures to a list while already waiting upon it?
Looking at the implementation of wait(), it certainly doesn't expect that anything outside concurrent.futures will ever mutate the list passed to it. So I don't think you'll ever get that "to work". It's not just that it doesn't expect the list to mutate, it's also that significant processing is done on list entries, and the implementation has no way to know that you've added more entries.
Untested, I'd suggest trying this instead: skip all that, and just keep a running count of threads still active. A straightforward way is to use a Condition guarding a count.
Initialization:
self._count_cond = threading.Condition()
self._thread_count = 0
When my_func is entered (i.e., when a new thread starts):
with self._count_cond:
self._thread_count += 1
When my_func is done (i.e., when a thread ends), for whatever reason (exceptional or not):
with self._count_cond:
self._thread_count -= 1
self._count_cond.notify() # wake up the waiting logic
And finally the main waiting logic:
with self._count_cond:
while self._thread_count:
self._count_cond.wait()
POSSIBLE RACE
It seems possible that the thread count could reach 0 while work for a new thread has been submitted, but before its my_func invocation starts running (and so before _thread_count is incremented to account for the new thread).
So the:
with self._count_cond:
self._thread_count += 1
part should really be done instead right before each occurrence of
self._thread_pool.submit(self.myfunc, id)
Or write a new method to encapsulate that pattern; e.g., like so:
def start_new_thread(self, id):
with self._count_cond:
self._thread_count += 1
self._thread_pool.submit(self.myfunc, id)
A DIFFERENT APPROACH
Offhand, I expect this could work too (but, again, haven't tested it): keep all your code the same except change how you're waiting:
while self._futures:
self._futures.pop().result()
So this simply waits for one thread at a time, until none remain.
Note that .pop() and .append() on lists are atomic in CPython, so no need for your own lock. And because your my_func() code appends before the thread it's running in ends, the list won't become empty before all threads really are done.
AND YET ANOTHER APPROACH
Keep the original waiting code, but rework the rest not to create new threads in case of exception. Like rewrite my_func to return True if it quits due to an exception, return False otherwise, and start threads running a wrapper instead:
def my_func_wrapper(self, id):
keep_going = True
while keep_going:
keep_going = self.my_func(id)
This may be especially attractive if you someday decide to use multiple processes instead of multiple threads (creating new processes can be a lot more expensive on some platforms).
AND A WAY USING cf.wait()
Another way is to change just the waiting code:
while self._futures:
fs = self._futures[:]
for f in fs:
self._futures.remove(f)
concurrent.futures.wait(fs)
Clear? This makes a copy of the list to pass to .wait(), and the copy is never mutated. New threads show up in the original list, and the whole process is repeated until no new threads show up.
Which of these ways makes most sense seems to me to depend mostly on pragmatics, but there's not enough info about all you're doing for me to make a guess about that.
Let's say I add 100 push tasks (as group 1) to my tasks-queue. Then I add another 200 tasks (as group 2) to the same queue. How can I understand if all tasks of group 1 are finished?
Looks like QueueStatistics will not help here. tag works only with pull queues.
And I can not have separate queues (since I may have hundreds of groups).
I would probably solve it by using a sharded counter in datastore like #mgilson said and decorate my deferred functions to run a callback when the tasks are done running.
I think something like this is what you are looking for if you include the code at https://cloud.google.com/appengine/articles/sharding_counters?hl=en and write a decriment function to complement the increment one.
import random
import time
from google.appengine.ext import deferred
def done_work():
logging.info('work done!')
def worker(callback=None):
def fst(f):
def snd(*args, **kwargs):
key = kwargs['shard_key']
del kwargs['shard_key']
retval = f(*args, **kwargs)
decriment(key)
if get_count(key) == 0:
callback()
return retval
return snd
return fst
def func(n):
# do some work
time.sleep(random.randint(1, 10) / 10.0)
logging.info('task #{:d}'.format(n))
def make_some_tasks():
func = worker(callback=done_work)(func)
key = random.randint(0, 1000)
for n in xrange(0, 100):
increment(key)
deferred.defer(func, n, shard_key=key)
Tasks are not guaranteed to run only once, occasionally even successfully executed tasks may be repeated. Here's such an example: GAE deferred task retried due to "instance unavailable" despite having already succeeded.
Because of this using a counter incremented at task enqueueing and decremented at task completion wouldn't work - it would be decremented twice in such a duplicate execution case, throwing the whole computation off.
The only reliable way of keeping track of task completion (that I can think of) is to independently track each individual enqueued task. You can do that using the task names (either specified or auto-assigned after successful enqueueing) - they are unique for a given queue. Task names to be tracked can be kept in task lists persisted in the datastore, for example.
Note: this is just the theoretical answer I got to when I asked myself the same question, I didn't get to actually test it.