In my application, I am sending off several request.post() requests in threads. Depending on the amount of data I have to post, the number of threads created can be in their hundreds.
The actual creation of the request object is made using requests-oauthlib, which inserts authentication data into the request object when it is used.
My issue is that when there is a large amount of data being sent in parallel, that the log is flooded with the following messages, and eventually no more input is sent to the log:
Connection pool is full. Discarding connection.
My question is, with the use of requests-oauthlib, is there a way to specity, perhaps within the post method itself, the size of the connection pool, or whether it should block so that other requests can complete before creating more? I ask for this because with the use of requests-oauthlib, it would be tricky to construct a custom request object, and ask requests-oauthlib to use it.
One thing I have tried is as follows, but it had no effect - I continued to get the warnings:
import requests
s = requests.Session()
a = requests.adapters.HTTPAdapter(pool_block=True)
s.mount('http://', a)
s.mount('https://', a)
Update - The threads are now being created in a controlled manner.
with futures.ThreadPoolExecutor(max_workers=10) as executor:
executor.submit(function, args)
The easiest way to block the requests so only N of them are trying to use the connection pool at once is to only create N at a time.
The easiest way to do that is to use a pool of N threads servicing queue of M requests, instead of a separate thread for every request. If you're using Python 3.2+, this is very easy with the concurrent.futures library—in fact, it's nearly identical to the first ThreadPoolExecutor example, except that you're using requests instead of urllib. If you're not using 3.2+, there's a backport of the stdlib module named futures that provides the same functionality back to… I think 2.6, but don't quote me on that (PyPI is down at the moment).
There may be an even easier solution: there's a third-party library named requests-futures that, I'm guessing from the name (again, PyPI down…), wraps that up for you in some way.
You may also want to consider using something like grequests to do it all in one thread with gevent greenlets, but that won't be significantly different, as far as your code is concerned, from using a thread pool.
Related
In order to test our server we designed a test that sends a lot of requests with JSON payload and compares the response it gets back.
I'm currently trying to find a way to optimize the process by using multi threads to do so. I didn't find any solution for the problem that I'm facing though.
I have a url address and a bunch of JSON files (these files hold the requests, and for each request file there is an 'expected response' JSON to compare the response to).
I would like to use multi threading to send all these requests and still be able to match the response that I get back to the request I sent.
Any ideas?
Well, you have couple of options:
Use multiprocessing.pool.ThreadPool (Python 2.7) where you create pool of threads and then use them for dispatching requests. map_async may be of interest here if you want to make async requests,
Use concurrent.futures.ThreadPoolExecutor (Python 3) with similar way of working with ThreadPool pool and yet it is used for asynchronously executing callables,
You even have option of using multiprocessing.Pool, but I'm not sure if that will give you any benefit since everything you will be doing is I/O bound, so threads should do just fine,
You can make asynchronous requests with Twisted or asyncio but that may require a bit more learning if you are not accustomed to asynchronous programming.
Use python multiprocessing threadpool where you can get return value can be compared.
https://docs.python.org/2/library/multiprocessing.html
https://gist.github.com/wrunk/b689be4b59270c32441c
Let's say I make 5 requests via a requests.Session to a server, using a ThreadPoolExecutor:
session = requests.Session()
executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)
def post(data):
response = mysession.post('http://example.com/api/endpoint1', data)
return response
for data in (data1, data2, data3, data4, data5):
executor.submit(post, data)
Since we are using the same requests.Session for each request, do we have to wait for the server to acknowledge the first request before we can send the next one?
If I had 5 sessions open concurrently -- one session per thread -- would I be able to send the requests more rapidly by sending each request via its own session?
The maintainer already recommends "one session per thread" so it's certainly doable... but will it improve performance?
Would I be better off using aiohttp and async?
So, first of all if you are not sure whether certain object/function is thread safe you should assume that it is not. Therefore you should not use Session objects in multiple threads without appropriate locking.
As for performance: always measure. Many libraries tend to do lots of stuff under the hood, including opening multiple TCP connections. They can probably be configured to tune performance, so its very hard to answer the question precisely. Especially since we don't know your case. For example if you intend to make 5 parallel requests, then simply run 5 threads with 5 session objects. Most likely you won't see a diffrence between libs (unless you pick a really bad one). On the other hand if you are looking at hundreds or thousands concurrent requests it will matter.
Anyway: always measure it yourself.
How can I minimize the thread lock with Tornado? Actually, I have already the working code, but I suspect that it is not fully asynchronous.
I have a really long task.
It consists of making several requests to CouchDB to get meta-data and to construct a final link. Then I need to make the last request to CouchDB and stream a file (from 10 MB up to 100 MB). So, the result will be the streaming of a large file to a client.
The problem that the server can receive 100 simultaneous requests to download large files and I need not to lock thread and keep recieving new requests (I have to minimize the thread lock).
So, I am making several synchronous requests (requests library) and then stream a large file with chunks with AsyncHttpClient.
The questions are as follows:
1) Should I use AsyncHTTPClient EVERYWHERE? Since I have some interface it will take quite a lot of time to replace all synchronous requests with asynchronous ones. Is it worth doing it?
2) Should I use tornado.curl_httpclient.CurlAsyncHTTPClient? Will the code run faster (file download, making requests)?
3) I see that Python 3.5 introduced async and theoretically it can be faster. Should I use async or keep using the decorator #gen.coroutine?
Use AsyncHTTPClient or CurlAsyncHTTPClient. Since the "requests" library is synchronous, it blocks the Tornado event loop during execution and you can only have one request in progress at a time. To do asynchronous networking operations with Tornado requires purpose-built asynchronous network code, like CurlAsyncHTTPClient.
Yes, CurlAsyncHTTPClient is a bit faster than AsyncHTTPClient, you may notice a speedup if you stream large amounts of data with it.
async and await are faster than gen.coroutine and yield, so if you have yield statements that are executed very frequently in a tight loop, or if you have deeply nested coroutines that call coroutines, it will be worthwhile to port your code.
So I have this problem I am trying to solve in a particular way, but I am not sure how hard it is to achieve.
I would like to use the asyncio/coroutines features of Python 3.4 to trigger many concurrent http requests using a blocking http library, like requests, or any Python api that does http requests like boto for aws.
I know about run_in_executor() method to run tasks in threads/processes, but I would like to avoid that.
I would like to do it in a single-thread, using those select features in Linux/Unix kernel.
Actually I was following David Beazley's presentation on this, and I was trying to use this code: https://github.com/dabeaz/concurrencylive/blob/master/aserver.py
but without the future/pool stuff, and use my blocking-api call instead of computing the Fibonacci number.
Put it seems that the http requests are still running in sequence.
Any ideas if this is possible? And how?
Thanks
Not possible. All the calls that the requests library makes to the underlying socket are blocking (i.e. socket.read) because the socket is in blocking mode. You could put the socket into non-blocking mode, but then socket.read would fail. You basically need an event-loop to tell you when it's possible to do a socket.read, but blocking libraries aren't written with one in mind. This is the whole reason why asyncio exists; providing a default event-loop that different libraries can share and make use of non-blocking file descriptors (e.g. sockets).
Use aiohttp, it's just as easy as requests and in the process you get to learn more about asyncio. asyncio and the new Python 3.5 async/await syntax are the Future of networking IO; yield to it (pun intended).
Let's say that i have a way to send http request to a server. How it's possible to send two of these requests (or more) to the server at the same time? For example maybe by fork a process? How can i do it?
(also i'm using django)
#This example is not tested...
import requests
def tester(request):
server_url = 'http://localhost:9000/receive'
payload = {
'd_test2': '1234',
'd_test2': 'demo',
}
json_payload = simplejson.dumps(payload)
content_length = len(json_payload)
headers = {'Content-Type': 'application/json', 'Content-Length': content_length}
response = requests.post(server_url, data=json_payload, headers=headers, allow_redirects=True)
if response.status_code == requests.codes.ok:
print 'Headers: {}\nResponse: {}'.format(response.headers, response.text)
Thanks!
I think you want to use threads here rather than forking off new processes. While threads are bad in some cases, that isn't true here. Also, I think you want to use concurrent.futures instead of using threads (or processes) directly.
For example, let's say you have 10 URLs, and you're currently doing them one in a row, like this:
results = map(tester, urls)
But now, you want to send them 2 at a time. Just change it to this:
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as pool:
results = pool.map(tester, urls)
If you want to try 4 at a time instead of 2, just change the max_workers. In fact, you should probably experiment with different values to see what works best for your program.
If you want to do something a little fancier, see the documentation—the main ThreadPoolExecutor Example is almost exactly what you're looking for.
Unfortunately, in 2.7, this module doesn't come with the standard library, so you will have to install the backport from PyPI.
If you have pip installed, this should be as simple as:
pip install futures
… or maybe sudo pip install futures, on Unix.
And if you don't have pip, go get it first (follow the link above).
The main reason you sometimes want to use processes instead of threads is that you've got heavy CPU-bound computation, and you want to take advantage of multiple CPU cores. In Python, threading can't effectively use up all your cores. So, if the Task Manager/Activity Monitor/whatever shows that your program is using up 100% CPU on one core, while the others are all at 0%, processes are the answer. With futures, all you have to do is change ThreadPoolExecutor to ProcessPoolExecutor.
Meanwhile, sometimes you need more than just "give me a magic pool of workers to run my tasks". Sometimes you want to run a handful of very long jobs instead of a bunch of little ones, or load-balance the jobs yourself, or pass data between jobs, or whatever. For that, you want to use multiprocessing or threading instead of futures.
Very rarely, even that is too high-level, and directly tell Python to create a new child process or thread. For that, you go all the way down to os.fork (on Unix only) or thread.
I would use gevent, which can launch these all in so-called green-threads:
# This will make requests compatible
from gevent import monkey; monkey.patch_all()
import requests
# Make a pool of greenlets to make your requests
from gevent.pool import Pool
p = Pool(10)
urls = [..., ..., ...]
p.map(requests.get, urls)
Of course, this example submits gets, but pool is generalized to map inputs into any function, including, say, yours to make requests. These greenlets will run as nearly as simultaneously as using fork but are much faster and much lighter-weight.