How do I make my ThreadPool work better with requests

How do I make my ThreadPool work better with requests - python

I currently have this function, which does a api call, each api call is requesting different data. I can do up to 300 concurrent api calls at a time.
Doing this does not seem to go fast, since this is just waiting for the repl I was wondering how I would make this function faster?
from multiprocessing.pool import ThreadPool
import requests
pool = ThreadPool(processes=500)
variables = VariableBaseDict
for item in variables:
async_result = pool.apply_async(requests.get(url.json()))
result = async_result.get()
#do stuff with result

Your current code is not actually farming any real work off to a worker thread. You are calling requests.get(url.json()) right in the main thread, and then passing the object that returns to pool.apply_async. You should be doing pool.apply_async(requests.get, (url.json(),)) instead. That said, even if you corrected this problem, you are then immediately waiting for the reply to the call, which means you never actually run any calls concurrently. You farm one item off to a thread, wait for it to be done, then wait for the next item.
You need to:
Fix the issue where you're accidentally calling requests.get(...) in the main thread.
Either use pool.map to farm the list of work off to the worker threads concurrently, or continue using pool.apply_async, but instead of immediately calling async_result.get(), store all the async_result objects in a list, and once you've iterated over variables, iterate over the async_result list and call .get() on each item. That way you actually end up running all the calls concurrently.
So, if you used apply_async, you'd do something like this:
async_results = [pool.apply_async(requests.get, (build_url(item),)) for item in variables]
for ar in async_results:
result = ar.get()
# do stuff with result
With pool.map it would be:
results = pool.map(requests.get, [build_url(item) for item in variables])

Related

Is it right to append item into the same list by multi-thread without lock?

Hrere's the detail question:
I want use multi-thread way to do a batch-http-request work, then gather all these result into a list and sort all items.
So I want to define a empty list origin_list in main process first, and start some threads to just append result into this list after pass origin_list to ervery thread.
And It seemed that I got the expected results in then end, so I think I got the right result list finally without thread lock for the list is a mutable object, am I right?
My main codes are as below:
def do_request_work(final_item_list,request_url):
request_results = request.get(request_url).text
# do request work
finnal_item_list.append(request_results )
def do_sort_work(final_item_list):
# do sort work
return final_item_list
def main():
f_item_list = []
request_list = [url1, url2, ...]
with ThreadPoolExecutor(max_workers=20) as executor:
executor.map(
partial(
do_request_work,
f_item_list
),
request_list)
sorted_list = do_sort_work(f_item_list)
Any commentary is very welcome. great thanks.

I think, that this is a quite questionable solution even without taking thread safety into account.
First of all python has GIL, which
In CPython, the global interpreter lock, or GIL, is a mutex that
protects access to Python objects, preventing multiple threads from
executing Python bytecodes at once.
Thus, I doubt about much performance benefit here, even noting that
potentially blocking or long-running operations, such as I/O, image
processing, and NumPy number crunching, happen outside the GIL.
all python work will be executed one thread in a time.
From the other perspective, the same lock may help you with the thread safety here, so only one thread will modify final_item_list in a time, but I am not sure.
Anyway, I would use multiprocessing module here with integrated parallel map:
from multiprocessing import Pool
def do_request_work(request_url):
request_results = request.get(request_url).text
# do request work
return request_results
if __name__ == '__main__':
request_list = [url1, url2, ...]
with Pool(20) as p:
f_item_list = p.map(do_request_work, request_list)
Which will guarantee you parallel lock-free execution of requests, since every process will receive only their part of work and just return the result, when ready.

Look at this thread: I'm seeking advise on multi-tasking on Python36 platform, Procedure setup.
Relevant to python3.5+
Running Tasks Concurrently¶
awaitable asyncio.gather(*aws, loop=None, return_exceptions=False)
Run awaitable objects in the aws sequence concurrently.
I use this very often, just be aware that its not thread-safe, so do not change values inside, otherwise you will have use deepcopy.
Other things to look at:
https://github.com/kennethreitz/grequests
https://github.com/jreese/aiomultiprocess
aiohttp

multithreading check membership in Queue and stop the threads

I want to iterate over a list using 2 thread. One from leading and other from trailing, and put the elements in a Queue on each iteration. But before putting the value in Queue I need to check for existence of the value within Queue (its when that one of the threads has putted that value in Queue), So when this happens I need to stop the thread and return list of traversed values for each thread.
This is what I have tried so far :
from Queue import Queue
from threading import Thread, Event
class ThreadWithReturnValue(Thread):
def __init__(self, group=None, target=None, name=None,
args=(), kwargs={}, Verbose=None):
Thread.__init__(self, group, target, name, args, kwargs, Verbose)
self._return = None
def run(self):
if self._Thread__target is not None:
self._return = self._Thread__target(*self._Thread__args,
**self._Thread__kwargs)
def join(self):
Thread.join(self)
return self._return
main_path = Queue()
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
def a(main_path,g,l=[]):
for i in g:
l.append(i)
print 'a'
if is_in_queue(i,main_path):
return l
main_path.put(i)
def b(main_path,g,l=[]):
for i in g:
l.append(i)
print 'b'
if is_in_queue(i,main_path):
return l
main_path.put(i)
g=['a','b','c','d','e','f','g','h','i','j','k','l']
t1 = ThreadWithReturnValue(target=a, args=(main_path,g))
t2 = ThreadWithReturnValue(target=b, args=(main_path,g[::-1]))
t2.start()
t1.start()
# Wait for all produced items to be consumed
print main_path.join()
I used ThreadWithReturnValue that will create a custom thread that returns the value.
And for membership checking I used the following function :
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
Now if I first start the t1 and then the t2 I will get 12 a then one b then it doesn't do any thing and I need to terminate the python manually!
But if I first run the t2 then t1 I will get the following result:
b
b
b
b
ab
ab
b
b
b
b
a
a
So my questions is that why python treads different in this cases? and how can I terminate the threads and make them communicate with each other?

Before we get into bigger problems, you're not using Queue.join right.
The whole point of this function is that a producer who adds a bunch of items to a queue can wait until the consumer or consumers have finished working on all of those items. This works by having the consumer call task_done after they finish working on each item that they pulled off with get. Once there have been as many task_done calls as put calls, the queue is done. You're not doing a get anywhere, much less a task_done, so there's no way the queue can ever be finished. So, that's why you block forever after the two threads finish.
The first problem here is that your threads are doing almost no work outside of the actual synchronization. If the only thing they do is fight over a queue, only one of them is going to be able to run at a time.
Of course that's common in toy problems, but you have to think through your real problem:
If you're doing a lot of I/O work (listening on sockets, waiting for user input, etc.), threads work great.
If you're doing a lot of CPU work (calculating primes), threads don't work in Python because of the GIL, but processes do.
If you're actually primarily dealing with synchronizing separate tasks, neither one is going to work well (and processes will be worse). It may still be simpler to think in terms of threads, but it'll be the slowest way to do things. You may want to look into coroutines; Greg Ewing has a great demonstration of how to use yield from to use coroutines to build things like schedulers or many-actor simulations.
Next, as I alluded to in your previous question, making threads (or processes) work efficiently with shared state requires holding locks for as short a time as possible.
So, if you have to search a whole queue under a lock, that had better be a constant-time search, not a linear-time search. That's why I suggested using something like an OrderedSet recipe rather than a list, like the one inside the stdlib's Queue.Queue. Then this function:
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
… is only blocking the queue for a tiny fraction of a second—just long enough to look up a hash value in a table, instead of long enough to compare every element in the queue against x.
Finally, I tried to explain about race conditions on your other question, but let me try again.
You need a lock around every complete "transaction" in your code, not just around the individual operations.
For example, if you do this:
with queue locked:
see if x is in the queue
if x was not in the queue:
with queue locked:
add x to the queue
… then it's always possible that x was not in the queue when you checked, but in the time between when you unlocked it and relocked it, someone added it. This is exactly why it's possible for both threads to stop early.
To fix this, you need to put a lock around the whole thing:
with queue locked:
if x is not in the queue:
add x to the queue
Of course this goes directly against what I said before about locking the queue for as short a time as possible. Really, that's what makes multithreading hard in a nutshell. It's easy to write safe code that just locks everything for as long as might conceivably be necessary, but then your code ends up only using a single core, while all the other threads are blocked waiting for the lock. And it's easy to write fast code that just locks everything as briefly as possible, but then it's unsafe and you get garbage values or even crashes all over the place. Figuring out what needs to be a transaction, and how to minimize the work inside those transactions, and how to deal with the multiple locks you'll probably need to make that work without deadlocking them… that's not so easy.

A couple of things that I think can be improved:
Due to the GIL, you might want to use the multiprocessing (rather than threading) module. In general, CPython threading will not cause CPU intensive work to speed up. (Depending on what exactly is the context of your question, it's also possible that multiprocessing won't, but threading almost certainly won't.)
A function like your is_inqueue would likely lead to high contention.
The locked time seems linear in the number of items that need to be traversed:
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
So, instead, you could possibly do the following.
Use multiprocessing with a shared dict:
from multiprocessing import Process, Manager
manager = Manager()
d = manager.dict()
# Fn definitions and such
p1 = Process(target=p1, args=(d,))
p2 = Process(target=p2, args=(d,))
within each function, check for the item like this:
def p1(d):
# Stuff
if 'foo' in d:
return

multiprocessing.Pool: calling helper functions when using apply_async's callback option

How does the flow of apply_async work between calling the iterable (?) function and the callback function?
Setup: I am reading some lines of all the files inside a 2000 file directory, some with millions of lines, some with only a few. Some header/formatting/date data is extracted to charecterize each file. This is done on a 16 CPU machine, so it made sense to multiprocess it.
Currently, the expected result is being sent to a list (ahlala) so I can print it out; later, this will be written to *.csv. This is a simplified version of my code, originally based off this extremely helpful post.
import multiprocessing as mp
def dirwalker(directory):
ahlala = []
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# results() is the callback function
def results(r):
ahlala.extend(r) # or .append, haven't yet decided
# helper function
def Z(arr):
return fileinfo # to X() or Y()!
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count()
for f in files:
if (filetype(f) == filetypeX):
pool.apply_async(X, args=(f,), callback=results)
elif (filetype(f) == filetypeY):
pool.apply_async(Y, args=(f,), callback=results)
pool.close(); pool.join()
return ahlala
Note, the code works if I put all of Z(), the helper function, into either X(), Y(), or results(), but is this either repetitive or possibly slower than possible? I know that the callback function is called for every function call, but when is the callback function called? Is it after pool.apply_async()...finishes all the jobs for the processes? Shouldn't it be faster if these helper functions were called within the scope (?) of the first function pool.apply_async() takes (in this case, X())? If not, should I just put the helper function in results()?
Other related ideas: Are daemon processes why nothing shows up? I am also very confused about how to queue things, and if this is the problem. This seems like a place to start learning it, but can queuing be safely ignored when using apply_async, or only at a noticable time inefficiency?

You're asking about a whole bunch of different things here, so I'll try to cover it all as best I can:
The function you pass to callback will be executed in the main process (not the worker) as soon as the worker process returns its result. It is executed in a thread that the Pool object creates internally. That thread consumes objects from a result_queue, which is used to get the results from all the worker processes. After the thread pulls the result off the queue, it executes the callback. While your callback is executing, no other results can be pulled from the queue, so its important that the callback finishes quickly. With your example, as soon as one of the calls to X or Y you make via apply_async completes, the result will be placed into the result_queue by the worker process, and then the result-handling thread will pull the result off of the result_queue, and your callback will be executed.
Second, I suspect the reason you're not seeing anything happen with your example code is because all of your worker function calls are failing. If a worker function fails, callback will never be executed. The failure won't be reported at all unless you try to fetch the result from the AsyncResult object returned by the call to apply_async. However, since you're not saving any of those objects, you'll never know the failures occurred. If I were you, I'd try using pool.apply while you're testing so that you see errors as soon as they occur.
The reason the workers are probably failing (at least in the example code you provided) is because X and Y are defined as function inside another function. multiprocessing passes functions and objects to worker processes by pickling them in the main process, and unpickling them in the worker processes. Functions defined inside other functions are not picklable, which means multiprocessing won't be able to successfully unpickle them in the worker process. To fix this, define both functions at the top-level of your module, rather than embedded insice the dirwalker function.
You should definitely continue to call Z from X and Y, not in results. That way, Z can be run concurrently across all your worker processes, rather than having to be run one call at a time in your main process. And remember, your callback function is supposed to be as quick as possible, so you don't hold up processing results. Executing Z in there would slow things down.
Here's some simple example code that's similar to what you're doing, that hopefully gives you an idea of what your code should look like:
import multiprocessing as mp
import os
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(f)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(f)
return fileinfo
# helper function
def Z(arr):
return arr + "zzz"
def dirwalker(directory):
ahlala = []
# results() is the callback function
def results(r):
ahlala.append(r) # or .append, haven't yet decided
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count())
for f in files:
if len(f) > 5: # Just an arbitrary thing to split up the list with
pool.apply_async(X, args=(f,), callback=results) # ,error_callback=handle_error # In Python 3, there's an error_callback you can use to handle errors. It's not available in Python 2.7 though :(
else:
pool.apply_async(Y, args=(f,), callback=results)
pool.close()
pool.join()
return ahlala
if __name__ == "__main__":
print(dirwalker("/usr/bin"))
Output:
['ftpzzz', 'findhyphzzz', 'gcc-nm-4.8zzz', 'google-chromezzz' ... # lots more here ]
Edit:
You can create a dict object that's shared between your parent and child processes using the multiprocessing.Manager class:
pool = mp.Pool(mp.cpu_count())
m = multiprocessing.Manager()
helper_dict = m.dict()
for f in files:
if len(f) > 5:
pool.apply_async(X, args=(f, helper_dict), callback=results)
else:
pool.apply_async(Y, args=(f, helper_dict), callback=results)
Then make X and Y take a second argument called helper_dict (or whatever name you want), and you're all set.
The caveat is that this worked by creating a server process that contains a normal dict, and all your other processes talk to that one dict via a Proxy object. So every time you read or write to the dict, you're doing IPC. This makes it a lot slower than a real dict.

How to efficiently iterate over multiple generators?

I've got three different generators, which yields data from the web. Therefore, each iteration may take a while until it's done.
I want to mix the calls to the generators, and thought about roundrobin (Found here).
The problem is that every call is blocked until it's done.
Is there a way to loop through all the generators at the same time, without blocking?

You can do this with the iter() method on my ThreadPool class.
pool.iter() yields threaded function return values until all of the decorated+called functions finish executing. Decorate all of your async functions, call them, then loop through pool.iter() to catch the values as they happen.
Example:
import time
from threadpool import ThreadPool
pool = ThreadPool(max_threads=25, catch_returns=True)
# decorate any functions you need to aggregate
# if you're pulling a function from an outside source
# you can still say 'func = pool(func)' or 'pool(func)()
#pool
def data(ID, start):
for i in xrange(start, start+4):
yield ID, i
time.sleep(1)
# each of these calls will spawn a thread and return immediately
# make sure you do either pool.finish() or pool.iter()
# otherwise your program will exit before the threads finish
data("generator 1", 5)
data("generator 2", 10)
data("generator 3", 64)
for value in pool.iter():
# this will print the generators' return values as they yield
print value

In short, no: there's no good way to do this without threads.
Sometimes ORMs are augmented with some kind of peek function or callback that will signal when data is available. Otherwise, you'll need to spawn threads in order to do this. If threads are not an option, you might try switching out your database library for an asynchronous one.

How to manage python threads results?

I am using this code:
def startThreads(arrayofkeywords):
global i
i = 0
while len(arrayofkeywords):
try:
if i<maxThreads:
keyword = arrayofkeywords.pop(0)
i = i+1
thread = doStuffWith(keyword)
thread.start()
except KeyboardInterrupt:
sys.exit()
thread.join()
for threading in python, I have almost everything done, but I dont know how to manage the results of each thread, on each thread I have an array of strings as result, how can I join all those arrays into one safely? Because, I if I try writing into a global array, two threads could be writing at the same time.

First, you actually need to save all those thread objects to call join() on them. As written, you're saving only the last one of them, and then only if there isn't an exception.
An easy way to do multithreaded programming is to give each thread all the data it needs to run, and then have it not write to anything outside that working set. If all threads follow that guideline, their writes will not interfere with each other. Then, once a thread has finished, have the main thread only aggregate the results into a global array. This is know as "fork/join parallelism."
If you subclass the Thread object, you can give it space to store that return value without interfering with other threads. Then you can do something like this:
class MyThread(threading.Thread):
def __init__(self, ...):
self.result = []
...
def main():
# doStuffWith() returns a MyThread instance
threads = [ doStuffWith(k).start() for k in arrayofkeywords[:maxThreads] ]
for t in threads:
t.join()
ret = t.result
# process return value here
Edit:
After looking around a bit, it seems like the above method isn't the preferred way to do threads in Python. The above is more of a Java-esque pattern for threads. Instead you could do something like:
def handler(outList)
...
# Modify existing object (important!)
outList.append(1)
...
def doStuffWith(keyword):
...
result = []
thread = Thread(target=handler, args=(result,))
return (thread, result)
def main():
threads = [ doStuffWith(k) for k in arrayofkeywords[:maxThreads] ]
for t in threads:
t[0].start()
for t in threads:
t[0].join()
ret = t[1]
# process return value here

Use a Queue.Queue instance, which is intrinsically thread-safe. Each thread can .put its results to that global instance when it's done, and the main thread (when it knows all working threads are done, by .joining them for example as in #unholysampler's answer) can loop .getting each result from it, and use each result to .extend the "overall result" list, until the queue is emptied.
Edit: there are other big problems with your code -- if the maximum number of threads is less than the number of keywords, it will never terminate (you're trying to start a thread per keyword -- never less -- but if you've already started the max numbers you loop forever to no further purpose).
Consider instead using a threading pool, kind of like the one in this recipe, except that in lieu of queueing callables you'll queue the keywords -- since the callable you want to run in the thread is the same in each thread, just varying the argument. Of course that callable will be changed to peel something from the incoming-tasks queue (with .get) and .put the list of results to the outgoing-results queue when done.
To terminate the N threads you could, after all keywords, .put N "sentinels" (e.g. None, assuming no keyword can be None): a thread's callable will exit if the "keyword" it just pulled is None.
More often than not, Queue.Queue offers the best way to organize threading (and multiprocessing!) architectures in Python, be they generic like in the recipe I pointed you to, or more specialized like I'm suggesting for your use case in the last two paragraphs.

You need to keep pointers to each thread you make. As is, your code only ensures the last created thread finishes. This does not imply that all the ones you started before it have also finished.
def startThreads(arrayofkeywords):
global i
i = 0
threads = []
while len(arrayofkeywords):
try:
if i<maxThreads:
keyword = arrayofkeywords.pop(0)
i = i+1
thread = doStuffWith(keyword)
thread.start()
threads.append(thread)
except KeyboardInterrupt:
sys.exit()
for t in threads:
t.join()
//process results stored in each thread
This also solves the problem of write access because each thread will store it's data locally. Then after all of them are done, you can do the work to combine each threads local data.

I know that this question is a little bit old, but the best way to do this is not to harm yourself too much in the way proposed by other colleagues :)
Please read the reference on Pool. This way you will fork-join your work:
def doStuffWith(keyword):
return keyword + ' processed in thread'
def startThreads(arrayofkeywords):
pool = Pool(processes=maxThreads)
result = pool.map(doStuffWith, arrayofkeywords)
print result

Writing into a global array is fine if you use a semaphore to protect the critical section. You 'acquire' the lock when you want to append to the global array, then 'release' when you are done. This way, only one thread is every appending to the array.
Check out http://docs.python.org/library/threading.html and search for semaphore for more info.
sem = threading.Semaphore()
...
sem.acquire()
# do dangerous stuff
sem.release()

try some semaphore's methods, like acquire and release..
http://docs.python.org/library/threading.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.