My queue is empty after multiprocessing.Process instances finish - python

I have a python script where at the top of the file I have:
result_queue = Queue.Queue()
key_list = *a large list of small items* #(actually from bucket.list() via boto)
I have learned that Queues are process safe data structures. I have a method:
def enqueue_tasks(keys):
for key in keys:
try:
result = perform_scan.delay(key)
result_queue.put(result)
except:
print "failed"
The perform_scan.delay() function here actually calls a celery worker, but I don't think is relevant (it is an asynchronous process call).
I also have:
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
Lastly I have a main() function:
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
concurrent.futures.wait(futures)
print len(result_queue)
The result from the print statement is a 0. Yet if I include a print statement of the size of result_queue in enqueue_tasks, while the program is running, I can see that the size is increasing and things are being added to the queue.
Ideas of what is happening?

It looks like there's a simpler solution to this problem.
You're building a list of futures. The whole point of futures is that they're future results. In particular, whatever each function returns, that's the (eventual) value of the future. So, don't do the whole "push results onto a queue" thing at all, just return them from the task function, and pick them up from the futures.
The simplest way to do this is to break that loop up so that each key is a separate task, with a separate future. I don't know whether that's appropriate for your real code, but if it is:
def do_task(key):
try:
return perform_scan.delay(key)
except:
print "failed"
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(do_task, key) for key in key_list]
# If you want to do anything with these results, you probably want
# a loop around concurrent.futures.as_completed or similar here,
# rather than waiting for them all to finish, ignoring the results,
# and printing the number of them.
concurrent.futures.wait(futures)
print len(futures)
Of course that doesn't do the grouping. But do you need it?
The most likely reason for the grouping to be necessary is that the tasks are so tiny that the overhead in scheduling them (and pickling the inputs and outputs) swamps the actual work. If that's true, then you can almost certainly wait until a whole batch is done to return any results. Especially given that you're not even looking at the results until they're all done anyway. (This model of "split into groups, process each group, merge back together" is pretty common in cases like numerical work, where each element may be tiny, or elements may not be independent of each other, but there are groups that are big enough or independent from the rest of the work.)
At any rate, that's almost as simple:
def do_tasks(keys):
results = []
for key in keys:
try:
result = perform_scan.delay(key)
results.append(result)
except:
print "failed"
return results
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
print sum(len(results) for results in concurrent.futures.as_completed(futures))
Or, if you prefer to first wait and then calculate:
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
concurrent.futures.wait(futures)
print sum(len(future.result()) for future in futures)
But again, I doubt you need even this.

You need to use a multiprocessing.Queue, not a Queue.Queue. Queue.Queue is thread-safe, not process-safe, so the changes you make to it in one process are not reflected in any others.

Related

Multithreading / Multiprocessing with a for-loop in Python3

I have this task which is sort of I/O bound and CPU bound at the same time.
Basically I am getting a list of queries from a user, google search them (via custom-search-api), store each query results in a .txt file, and storing all results in a results.txt file.
I was thinking that maybe parallelism might be an advantage here.
My whole task is wrapped with an Object which has 2 member fields which I am supposed to use across all threads/processes (a list and a dictionary).
Therefore, when I use multiprocessing I get weird results (I assume that it is because of my shared resources).
i.e:
class MyObject(object):
_my_list = []
_my_dict = {}
_my_dict contains key:value pairs of "query_name":list().
_my_list is a list of queries to search in google. It is safe to assume that it is not written into.
For each query : I search it on google, grab the top results and store it in _my_dict
I want to do this in parallel. I thought that threading may be good but it seems that they slow the work..
how I attempted to do it (this is the method which is doing the entire job per query):
def _do_job(self, query):
""" search the query on google (via http)
save results on a .txt file locally. """
this is the method which is supposed to execute all jobs for all queries in parallel:
def find_articles(self):
p = Pool(processes=len(self._my_list))
p.map_async(self._do_job, self._my_list)
p.close()
p.join()
self._create_final_log()
The above execution does not work, I get corrupted results...
When I use multithreading however, the results are fine, but very slow:
def find_articles(self):
thread_pool = []
for vendor in self._vendors_list:
self._search_validate_cache(vendor)
thread = threading.Thread(target=self._search_validate_cache, args=. (vendor,))
thread_pool.append(thread)
thread.start()
for thread in thread_pool:
thread.join()
self._create_final_log()
Any help would be appreciated, thanks!
I have encountered this while doing similar projects in the past (multiprocessing doesn't work efficiently, single-threaded is too slow, starting a thread per query is too fast and is bottlenecked). I found an efficient way to complete a task like this is to create a thread pool with a limited amount of threads. Logically, the fastest way to complete this task is to use as many network resources as possible without a bottleneck, which is why the threads active at one time actively making requests are capped.
In your case, cycling a list of queries with a thread pool with a callback function would be a quick and easy way to go through all the data. Obviously, there is a lot of factors that affect that such as network speed and finding the correct size threadpool to avoid a bottlneck, but overall I've found this to work well.
import threading
class MultiThread:
def __init__(self, func, list_data, thread_cap=10):
"""
Parameters
----------
func : function
Callback function to multi-thread
threads : int
Amount of threads available in the pool
list_data : list
List of data to multi-thread index
"""
self.func = func
self.thread_cap = thread_cap
self.thread_pool = []
self.current_index = -1
self.total_index = len(list_data) - 1
self.complete = False
self.list_data = list_data
def start(self):
for _ in range(self.thread_cap):
thread = threading.Thread(target=self._wrapper)
self.thread_pool += [thread]
thread.start()
def _wrapper(self):
while not self.complete:
if self.current_index < self.total_index:
self.current_index += 1
self.func(self.list_data[self.current_index])
else:
self.complete = True
def wait_on_completion(self):
for thread in self.thread_pool:
thread.join()
import requests #, time
_my_dict = {}
base_url = "https://www.google.com/search?q="
s = requests.sessions.session()
def example_callback_func(query):
global _my_dict
# code to grab data here
r = s.get(base_url+query)
_my_dict[query] = r.text # whatever parsed results
print(r, query)
#start_time = time.time()
_my_list = ["examplequery"+str(n) for n in range(100)]
mt = MultiThread(example_callback_func, _my_list, thread_cap=30)
mt.start()
mt.wait_on_completion()
# output queries to file
#print("Time:{:2f}".format(time.time()-start_time))
You could also open the file and output whatever you need to as you go, or output data at the end. Obviously, my replica here isn't exactly what you need, but it's a solid boilerplate with a lightweight function I made that will greatly reduce the time it takes. It uses a thread pool to call a callback to a default function that takes a single parameter (the query).
In my test here, it completed cycling 100 queries in ~2 seconds. I could definitely play with the thread cap and get the timings lower before I find the bottleneck.

end python multithreading if one of the threads end first

So, I have the following code which im using to run the tasks in multiple functions at the same time:
if __name__ == '__main__':
po = Pool(processes = 10)
resultslist = []
i = 1
while i <= 2:
arg = [i]
result = po.apply_async(getAllTimes, arg)
resultslist.append(result)
i += 1
feedback = []
for res in resultslist:
multipresults = res.get()
feedback.append(multipresults)
matchesBegin, matchesEnd = feedback[0][0], feedback[0][1]
TheTimes = feedback[1]
This works well for me. I'm currently using it to run two jobs at the same time.
But the problem is, i dont always need all the two simultaneously running jobs to complete before I move on to the next phases of the script. Sometimes, if the first job completes successfully and im able to confirm it by verifying whats in matchesBegin, matchesEnd, I want to be able to just move on and kill off the other job.
My issue is, i dont know how to do that.
Job 1 usually completes much faster than Job 2. So, what im trying to do here is, IF job 1 completes before Job 2, AND the content of the variables from Job 1 (matchesBegin, matchesEnd) is True, then, i want Job 2 to be blown away because I dont need it anymore. If i dont blow it away, it will only prolong the completion of the script. Job 2 should only be allowed to continue to run if results of the variables from Job 1 arent True.
I do not know all the details of your use case, but I would hope this provides you some direction. Essentially, what you've started with apply_async() could do that job, but you would also need to use its callback argument and evaluate incoming result to see if it fulfills your criteria and take a corresponding action if it does. I've hacked around your code a bit and got this:
class ParallelCall:
def __init__(self, jobs=None, check_done=lambda res: None):
self.pool = Pool(processes=jobs)
self.pending_results = []
self.return_results = []
self.check_done = check_done
def _callback(self, incoming_result):
self.return_results.append(incoming_result)
if self.check_done(incoming_result):
self.pool.terminate()
return incoming_result
def run_fce(self, fce, *args, **kwargs):
self.pending_results.append(self.pool.apply_async(fce,
*args, **kwargs,
callback=self._callback))
def collect(self):
self.pool.close()
self.pool.join()
return self.return_results
Which you could use like this:
def final_result(result_to_check):
return result_to_check[0] == result_to_check[1]
if __name__ == '__main__':
runner = ParallelCall(jobs=2, check_done=final_result)
for i in range(1,3):
arg = [i]
runner.run_fce(getAllTimes, arg)
feedback = runner.collect()
TheTimes = feedback[-1] # last completed getAllTimes call
What does it do? runner is an instance of ParallelCall (note: I've used only two workers as you seem to only run two jobs) which uses final_result() function to evaluate the result whether it is a suitable candidate for valid final result. In this case, it's first and second item are equal.
We use that to start getAllTimes two times like in your example above. It uses apply_async() just as you did, but we now also have a callback registered through which we pass the result when it becomes available. We also pass it through the function registered with check_done to see if we got an acceptable final result and if so (return value evaluates to True) we just stop all the worker processes.
Disclaimer: this is not exactly what your example does, because the returning list is not in order in which function calls has taken place, but in which the results became available.
Then we collect() available results into feedback. This method closes the pool to not accept any further tasks (close()) and then waits for the workers to finish (wait()) (they could be stopped if one of the incoming results matched the registered criterion). Then we return all the results (either up to matching result or until all work has been done).
I've put this into ParallelCall class so that I can conveniently keep track of the pending and finished results as well know what my pool is. Default check_done is basically a (callable) nop.

Is there any way to understand when all tasks are finished?

Let's say I add 100 push tasks (as group 1) to my tasks-queue. Then I add another 200 tasks (as group 2) to the same queue. How can I understand if all tasks of group 1 are finished?
Looks like QueueStatistics will not help here. tag works only with pull queues.
And I can not have separate queues (since I may have hundreds of groups).
I would probably solve it by using a sharded counter in datastore like #mgilson said and decorate my deferred functions to run a callback when the tasks are done running.
I think something like this is what you are looking for if you include the code at https://cloud.google.com/appengine/articles/sharding_counters?hl=en and write a decriment function to complement the increment one.
import random
import time
from google.appengine.ext import deferred
def done_work():
logging.info('work done!')
def worker(callback=None):
def fst(f):
def snd(*args, **kwargs):
key = kwargs['shard_key']
del kwargs['shard_key']
retval = f(*args, **kwargs)
decriment(key)
if get_count(key) == 0:
callback()
return retval
return snd
return fst
def func(n):
# do some work
time.sleep(random.randint(1, 10) / 10.0)
logging.info('task #{:d}'.format(n))
def make_some_tasks():
func = worker(callback=done_work)(func)
key = random.randint(0, 1000)
for n in xrange(0, 100):
increment(key)
deferred.defer(func, n, shard_key=key)
Tasks are not guaranteed to run only once, occasionally even successfully executed tasks may be repeated. Here's such an example: GAE deferred task retried due to "instance unavailable" despite having already succeeded.
Because of this using a counter incremented at task enqueueing and decremented at task completion wouldn't work - it would be decremented twice in such a duplicate execution case, throwing the whole computation off.
The only reliable way of keeping track of task completion (that I can think of) is to independently track each individual enqueued task. You can do that using the task names (either specified or auto-assigned after successful enqueueing) - they are unique for a given queue. Task names to be tracked can be kept in task lists persisted in the datastore, for example.
Note: this is just the theoretical answer I got to when I asked myself the same question, I didn't get to actually test it.

multithreading check membership in Queue and stop the threads

I want to iterate over a list using 2 thread. One from leading and other from trailing, and put the elements in a Queue on each iteration. But before putting the value in Queue I need to check for existence of the value within Queue (its when that one of the threads has putted that value in Queue), So when this happens I need to stop the thread and return list of traversed values for each thread.
This is what I have tried so far :
from Queue import Queue
from threading import Thread, Event
class ThreadWithReturnValue(Thread):
def __init__(self, group=None, target=None, name=None,
args=(), kwargs={}, Verbose=None):
Thread.__init__(self, group, target, name, args, kwargs, Verbose)
self._return = None
def run(self):
if self._Thread__target is not None:
self._return = self._Thread__target(*self._Thread__args,
**self._Thread__kwargs)
def join(self):
Thread.join(self)
return self._return
main_path = Queue()
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
def a(main_path,g,l=[]):
for i in g:
l.append(i)
print 'a'
if is_in_queue(i,main_path):
return l
main_path.put(i)
def b(main_path,g,l=[]):
for i in g:
l.append(i)
print 'b'
if is_in_queue(i,main_path):
return l
main_path.put(i)
g=['a','b','c','d','e','f','g','h','i','j','k','l']
t1 = ThreadWithReturnValue(target=a, args=(main_path,g))
t2 = ThreadWithReturnValue(target=b, args=(main_path,g[::-1]))
t2.start()
t1.start()
# Wait for all produced items to be consumed
print main_path.join()
I used ThreadWithReturnValue that will create a custom thread that returns the value.
And for membership checking I used the following function :
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
Now if I first start the t1 and then the t2 I will get 12 a then one b then it doesn't do any thing and I need to terminate the python manually!
But if I first run the t2 then t1 I will get the following result:
b
b
b
b
ab
ab
b
b
b
b
a
a
So my questions is that why python treads different in this cases? and how can I terminate the threads and make them communicate with each other?
Before we get into bigger problems, you're not using Queue.join right.
The whole point of this function is that a producer who adds a bunch of items to a queue can wait until the consumer or consumers have finished working on all of those items. This works by having the consumer call task_done after they finish working on each item that they pulled off with get. Once there have been as many task_done calls as put calls, the queue is done. You're not doing a get anywhere, much less a task_done, so there's no way the queue can ever be finished. So, that's why you block forever after the two threads finish.
The first problem here is that your threads are doing almost no work outside of the actual synchronization. If the only thing they do is fight over a queue, only one of them is going to be able to run at a time.
Of course that's common in toy problems, but you have to think through your real problem:
If you're doing a lot of I/O work (listening on sockets, waiting for user input, etc.), threads work great.
If you're doing a lot of CPU work (calculating primes), threads don't work in Python because of the GIL, but processes do.
If you're actually primarily dealing with synchronizing separate tasks, neither one is going to work well (and processes will be worse). It may still be simpler to think in terms of threads, but it'll be the slowest way to do things. You may want to look into coroutines; Greg Ewing has a great demonstration of how to use yield from to use coroutines to build things like schedulers or many-actor simulations.
Next, as I alluded to in your previous question, making threads (or processes) work efficiently with shared state requires holding locks for as short a time as possible.
So, if you have to search a whole queue under a lock, that had better be a constant-time search, not a linear-time search. That's why I suggested using something like an OrderedSet recipe rather than a list, like the one inside the stdlib's Queue.Queue. Then this function:
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
… is only blocking the queue for a tiny fraction of a second—just long enough to look up a hash value in a table, instead of long enough to compare every element in the queue against x.
Finally, I tried to explain about race conditions on your other question, but let me try again.
You need a lock around every complete "transaction" in your code, not just around the individual operations.
For example, if you do this:
with queue locked:
see if x is in the queue
if x was not in the queue:
with queue locked:
add x to the queue
… then it's always possible that x was not in the queue when you checked, but in the time between when you unlocked it and relocked it, someone added it. This is exactly why it's possible for both threads to stop early.
To fix this, you need to put a lock around the whole thing:
with queue locked:
if x is not in the queue:
add x to the queue
Of course this goes directly against what I said before about locking the queue for as short a time as possible. Really, that's what makes multithreading hard in a nutshell. It's easy to write safe code that just locks everything for as long as might conceivably be necessary, but then your code ends up only using a single core, while all the other threads are blocked waiting for the lock. And it's easy to write fast code that just locks everything as briefly as possible, but then it's unsafe and you get garbage values or even crashes all over the place. Figuring out what needs to be a transaction, and how to minimize the work inside those transactions, and how to deal with the multiple locks you'll probably need to make that work without deadlocking them… that's not so easy.
A couple of things that I think can be improved:
Due to the GIL, you might want to use the multiprocessing (rather than threading) module. In general, CPython threading will not cause CPU intensive work to speed up. (Depending on what exactly is the context of your question, it's also possible that multiprocessing won't, but threading almost certainly won't.)
A function like your is_inqueue would likely lead to high contention.
The locked time seems linear in the number of items that need to be traversed:
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
So, instead, you could possibly do the following.
Use multiprocessing with a shared dict:
from multiprocessing import Process, Manager
manager = Manager()
d = manager.dict()
# Fn definitions and such
p1 = Process(target=p1, args=(d,))
p2 = Process(target=p2, args=(d,))
within each function, check for the item like this:
def p1(d):
# Stuff
if 'foo' in d:
return

Picking up items progressivly as soon as a queue is available

I am looking for a solid implementation to allow me to progressively work through a list of items using Queue.
The idea is that I want to use a set number of workers that willgo through a list of 20+ database intensive tasks and return the result. I want Python to start with the five first items and as soon as it's done with one task starts on the next task in the queue.
This is how I am currently doing it without Threading.
for key, v in self.sources.iteritems():
# Do Stuff
I would like to have a similar approach, but possibly without having to split the list up into subgroups of five. So that it automatically will pick up the next item in the list. The goal is to make sure that if one database is slowing down the process, it will not have a negative impact on the whole application.
You can implement this yourself, but Python 3 already comes with an Executor-based solution for thread management, which you can use in Python 2.x by installing the backported version.
Your code could then look like
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_key = {}
for key, value in sources.items():
future_to_idday[executor.submit(do_stuff, value)] = key
for future in concurrent.futures.as_completed(future_to_key):
key = future_to_key[future]
result = future.result()
# process result
If you are using python3, I recommend the concurrent futures module. If you are not using python3 and are not attached to threads (versus processes) then you might try multiprocessing.Pool (though it comes with some caveats and I have had trouble with pools not closing properly in my applications). If you must use threads, in python2, you might end up writing code yourself - spawn 5 threads running consumer functions and simply push the calls (function + args) into the queue iteratively for the consumers to find and process them.
You could do it using only stdlib:
#!/usr/bin/env python
from multiprocessing.dummy import Pool # use threads
def db_task(key_value):
try:
key, value = key_value
# compute result..
return result, None
except Exception as e:
return None, e
def main():
pool = Pool(5)
for result, error in pool.imap_unordered(db_task, sources.items()):
if error is None:
print(result)
if __name__=="__main__":
main()

Categories