Asynchronous HTTP calls in Python - python

I have a need for a callback kind of functionality in Python where I am sending a request to a webservice multiple times, with a change in the parameter each time. I want these requests to happen concurrently instead of sequentially, so I want the function to be called asynchronously.
It looks like asyncore is what I might want to use, but the examples I've seen of how it works all look like overkill, so I'm wondering if there's another path I should be going down. Any suggestions on modules/process? Ideally I'd like to use these in a procedural fashion instead of creating classes but I may not be able to get around that.

Starting in Python 3.2, you can use concurrent.futures for launching parallel tasks.
Check out this ThreadPoolExecutor example:
http://docs.python.org/dev/library/concurrent.futures.html#threadpoolexecutor-example
It spawns threads to retrieve HTML and acts on responses as they are received.
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
The above example uses threading. There is also a similar ProcessPoolExecutor that uses a pool of processes, rather than threads:
http://docs.python.org/dev/library/concurrent.futures.html#processpoolexecutor-example
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))

Do you know about eventlet? It lets you write what appears to be synchronous code, but have it operate asynchronously over the network.
Here's an example of a super minimal crawler:
urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
"https://wiki.secondlife.com/w/images/secondlife.jpg",
"http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)

Twisted framework is just the ticket for that. But if you don't want to take that on you might also use pycurl, wrapper for libcurl, that has its own async event loop and supports callbacks.

(Although this thread is about server-side Python. Since this question was asked a while back. Others might stumble on this where they are looking for a similar answer on the client side)
For a client side solution, you might want to take a look at Async.js library especially the "Control-Flow" section.
https://github.com/caolan/async#control-flow
By combining the "Parallel" with a "Waterfall" you can achieve your desired result.
WaterFall( Parallel(TaskA, TaskB, TaskC) -> PostParallelTask)
If you examine the example under Control-Flow - "Auto" they give you an example of the above:
https://github.com/caolan/async#autotasks-callback
where "write-file" depends on "get_data" and "make_folder" and "email_link" depends on write-file".
Please note that all of this happens on the client side (unless you're doing Node.JS - on the server-side)
For server-side Python, look at PyCURL # https://github.com/pycurl/pycurl/blob/master/examples/basicfirst.py
By combining the example below with pyCurl, you can achieve the non-blocking multi-threaded functionality.

Related

Is having a concurrent.futures.ThreadPoolExecutor call dangerous in a FastAPI endpoint?

I have the following test code:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor() as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
I need to use the concurrent.futures.ThreadPoolExecutor part of the code in a FastAPI endpoint.
My concern is the impact of the number of API calls and the inclusion of threads. Concern about creating too many threads and its related consequences, starving the host, crashing the application and/or the host.
Any thoughts or gotchas on this approach?
You should rather use the HTTPX library, which provides an async API. As described in this answer , you spawn a Client and reuse it every time you need it. To make asynchronous requests with HTTPX, you'll need an AsyncClient.
You could control the connection pool size as well, using the limits keyword argument on the Client, which takes an instance of httpx.Limits. For example:
limits = httpx.Limits(max_keepalive_connections=5, max_connections=10)
client = httpx.AsyncClient(limits=limits)
You can adjust the above per your needs. As per the documentation on Pool limit configuration:
max_keepalive_connections, number of allowable keep-alive connections, or None to always allow. (Defaults 20)
max_connections, maximum number of allowable connections, or None for no limits. (Default 100)
keepalive_expiry, time limit on idle keep-alive connections in seconds, or None for no limits. (Default 5)
If you would like to adjust the timeout as well, you can use the timeout paramter to set timeout on an individual request, or on a Client/AsyncClient instance, which results in the given timeout being used as the default for requests made with this client (see the implementation of Timeout class as well). You can specify the timeout behavior in a fine grained detail; for example, setting the read timeout parameter will specify the maximum duration to wait for a chunk of data to be received (i.e., a chunk of the response body). If HTTPX is unable to receive data within this time frame, a ReadTimeout exception is raised. If set to None instead of some positive numerical value, there will be no timeout on read. The default is 5 seconds timeout on all operations.
You can use await client.aclose() to explicitly close the AsyncClient when you are done with it (this could be done inside a shutdown event handler, for instance).
To run multiple asynchronous operations—as you need to request five different URLs, when your API endpoint is called—you can use the awaitable asyncio.gather(). It will execute the async operations and return a list of results in the same order the awaitables (tasks) were passed to that function.
Working Example:
from fastapi import FastAPI
import httpx
import asyncio
URLS = ['https://www.foxnews.com/',
'https://edition.cnn.com/',
'https://www.nbcnews.com/',
'https://www.bbc.co.uk/',
'https://www.reuters.com/']
limits = httpx.Limits(max_keepalive_connections=5, max_connections=10)
timeout = httpx.Timeout(5.0, read=15.0) # 15s timeout on read. 5s timeout elsewhere.
client = httpx.AsyncClient(limits=limits, timeout=timeout)
app = FastAPI()
#app.on_event('shutdown')
async def shutdown_event():
await client.aclose()
async def send(url, client):
return await client.get(url)
#app.get('/')
async def main():
tasks = [send(url, client) for url in URLS]
responses = await asyncio.gather(*tasks)
return [r.text[:50] for r in responses] # return the first 50 chars of each response
If you would like to avoid reading the entire response body into RAM, you could use Streaming responses, as described in this answer and demonstrated below:
# ... rest of the code is the same as above
from fastapi.responses import StreamingResponse
async def send(url, client):
req = client.build_request('GET', url)
return await client.send(req, stream=True)
async def iter_content(responses):
for r in responses:
async for chunk in r.aiter_text():
yield chunk[:50] # return the first 50 chars of each response
yield '\n'
break
await r.aclose()
#app.get('/')
async def main():
tasks = [send(url, client) for url in URLS]
responses = await asyncio.gather(*tasks)
return StreamingResponse(iter_content(responses), media_type='text/plain')

Python concurrent executor.map() and submit()

I'm learning how to use concurrent with executor.map() and executor.submit().
I have a list that contains 20 url and want to send 20 requests at the same time, the problem is .submit() returns results in different order than the given list from the beginning. I've read that map() does what I need but i don't know how to write code with it.
The code below worked perfect to me.
Questions: is there any code block of map()that equivalent to the code below, or any sorting methods that can sort the result list from submit() by order of the list given?
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Here is the map version of your existing code. Note that the callback now accepts a tuple as a parameter. I added an try\except in the callback so the results will not throw an error. The results are ordered according to the input list.
from concurrent.futures import ThreadPoolExecutor
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://www.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(tt): # (url,timeout)
url, timeout = tt
try:
with urllib.request.urlopen(url, timeout=timeout) as conn:
return (url, conn.read())
except Exception as ex:
print("Error:", url, ex)
return(url,"") # error, return empty string
with ThreadPoolExecutor(max_workers=5) as executor:
results = executor.map(load_url, [(u,60) for u in URLS]) # pass url and timeout as tuple to callback
executor.shutdown(wait=True) # wait for all complete
print("Results:")
for r in results: # ordered results, will throw exception here if not handled in callback
print(' %r page is %d bytes' % (r[0], len(r[1])))
Output
Error: http://www.wsj.com/ HTTP Error 404: Not Found
Results:
'http://www.foxnews.com/' page is 320028 bytes
'http://www.cnn.com/' page is 1144916 bytes
'http://www.wsj.com/' page is 0 bytes
'http://www.bbc.co.uk/' page is 279418 bytes
'http://some-made-up-domain.com/' page is 64668 bytes
Without using the map method, you can use enumerate to build the future_to_url dict with not just the URLs as values, but also their indices in the list. You can then build a dict from the future objects returned by the call to concurrent.futures.as_completed(future_to_url) with indices as the keys, so that you can iterate an index over the length of the dict to read the dict in the same order as the corresponding items in the original list:
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {
executor.submit(load_url, url, 60): (i, url) for i, url in enumerate(URLS)
}
futures = {}
for future in concurrent.futures.as_completed(future_to_url):
i, url = future_to_url[future]
futures[i] = url, future
for i in range(len(futures)):
url, future = futures[i]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))

Python multiprocess Pool vs Process

I'm new to Python multiprocessing. I don't quite understand the difference between Pool and Process. Can someone suggest which one I should use for my needs?
I have thousands of http GET requests to send. After sending each and getting the response, I want to store to response (a simple int) to a (shared) dict. My final goal is to write all data in the dict to a file.
This is not CPU intensive at all. All my goal is the speed up sending the http GET requests because there are too many. The requests are all isolated and do not depend on each other.
Shall I use Pool or Process in this case?
Thanks!
----The code below is added on 8/28---
I programmed with multiprocessing. The key challenges I'm facing are:
1) GET request can fail sometimes. I have to set 3 retries to minimize the need to rerun my code/all requests. I only want to retry the failed ones. Can I achieve this with async http requests without using Pool?
2) I want to check the response value of every requests, and have exception handling
The code below is simplified from my actual code. It is working fine, but I wonder if it's the most efficient way of doing things. Can anyone give any suggestions? Thanks a lot!
def get_data(endpoint, get_params):
response = requests.get(endpoint, params = get_params)
if response.status_code != 200:
raise Exception("bad response for " + str(get_params))
return response.json()
def get_currency_data(endpoint, currency, date):
get_params = {'currency': currency,
'date' : date
}
for attempt in range(3):
try:
output = get_data(endpoint, get_params)
# additional return value check
# ......
return output['value']
except:
time.sleep(1) # I found that sleeping for 1s almost always make the retry successfully
return 'error'
def get_all_data(currencies, dates):
# I have many dates, but not too many currencies
for currency in currencies:
results = []
pool = Pool(processes=20)
for date in dates:
results.append(pool.apply_async(get_currency_data, args=(endpoint, date)))
output = [p.get() for p in results]
pool.close()
pool.join()
time.sleep(10) # Unfortunately I have to give the server some time to rest. I found it helps to reduce failures. I didn't write the server. This is not something that I can control
Neither. Use asynchronous programming. Consider the below code pulled directly from that article (credit goes to Paweł Miech)
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
return await response.read()
async def run(r):
url = "http://localhost:8080/{}"
tasks = []
# Fetch all responses within one Client session,
# keep connection alive for all requests.
async with ClientSession() as session:
for i in range(r):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = await asyncio.gather(*tasks)
# you now have all response bodies in this variable
print(responses)
def print_responses(result):
print(result)
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(4))
loop.run_until_complete(future)
Just maybe create a URL's array, and instead of the given code, loop against that array and issue each one to fetch.
EDIT: Use requests_futures
As per #roganjosh comment below, requests_futures is a super-easy way to accomplish this.
from requests_futures.sessions import FuturesSession
sess = FuturesSession()
urls = ['http://google.com', 'https://stackoverflow.com']
responses = {url: sess.get(url) for url in urls}
contents = {url: future.result().content
for url, future in responses.items()
if future.result().status_code == 200}
EDIT: Use grequests to support Python 2.7
You can also us grequests, which supports Python 2.7 for performing asynchronous URL calling.
import grequests
urls = ['http://google.com', 'http://stackoverflow.com']
responses = grequests.map(grequests.get(u) for u in urls)
print([len(r.content) for r in rs])
# [10475, 250785]
EDIT: Using multiprocessing
If you want to do this using multiprocessing, you can. Disclaimer: You're going to have a ton of overhead by doing this, and it won't be anywhere near as efficient as async programming... but it is possible.
It's actually pretty straightforward, you're mapping the URL's through the http GET function:
import requests
urls = ['http://google.com', 'http://stackoverflow.com']
from multiprocessing import Pool
pool = Pool(8)
responses = pool.map(requests.get, urls)
The size of the pool will be the number of simultaneously issues GET requests. Sizing it up should increase your network efficiency, but it'll add overhead on the local machine for communication and forking.
Again, I don't recommend this, but it certainly is possible, and if you have enough cores it's probably faster than doing the calls synchronously.

Faster Scraping of JSON from API: Asynchronous or?

I need to scrape roughly 30GB of JSON data from a website API as quickly as possible. I don't need to parse it -- I just need to save everything that shows up on each API URL.
I can request quite a bit of data at a time -- say 1MB or even 50MB 'chunks' (API parameters are encoded in the URL and allow me to select how much data I want per request)
the API places a limit of 1 request per second.
I would like to accomplish this on a laptop and 100MB/sec internet connection
Currently, I am accomplishing this (synchronously & too slowly) by:
-pre-computing all of the (encoded) URL's I want to scrape
-using Python 3's requests library to request each URL and save the resulting JSON one-by-one in separate .txt files.
Basically, my synchronous, too-slow solution looks like this (simplified slightly):
#for each pre-computed encoded URL do:
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
json.dump(curr_url_request.json(), outfile)
What would be a better/faster way to do this? Is there a straight-forward way to accomplish this asynchronously but respecting the 1-request-per-second threshold? I have read about grequests (no longer maintained?), twisted, asyncio, etc but do not have enough experience to know whether/if one of these is the right way to go.
EDIT
Based on Kardaj's reply below, I decided to give async Tornado a try. Here's my current Tornado version (which is heavily based on one of the examples in their docs). It successfully limits concurrency.
The hangup is, how can I do an overall rate-limit of 1 request per second globally across all workers? (Kardaj, the async sleep makes a worker sleep before working, but does not check whether other workers 'wake up' and request at the same time. When I tested it, all workers grab a page and break the rate limit, then go to sleep simultaneously).
from datetime import datetime
from datetime import timedelta
from tornado import httpclient, gen, ioloop, queues
URLS = ["https://baconipsum.com/api/?type=meat",
"https://baconipsum.com/api/?type=filler",
"https://baconipsum.com/api/?type=meat-and-filler",
"https://baconipsum.com/api/?type=all-meat&paras=2&start-with-lorem=1"]
concurrency = 2
def handle_request(response):
if response.code == 200:
with open("FOO"+'.txt', "wb") as thisfile:#fix filenames to avoid overwrite
thisfile.write(response.body)
#gen.coroutine
def request_and_save_url(url):
try:
response = yield httpclient.AsyncHTTPClient().fetch(url, handle_request)
print('fetched {0}'.format(url))
except Exception as e:
print('Exception: {0} {1}'.format(e, url))
raise gen.Return([])
#gen.coroutine
def main():
q = queues.Queue()
tstart = datetime.now()
fetching, fetched = set(), set()
#gen.coroutine
def fetch_url(worker_id):
current_url = yield q.get()
try:
if current_url in fetching:
return
#print('fetching {0}'.format(current_url))
print("Worker {0} starting, elapsed is {1}".format(worker_id, (datetime.now()-tstart).seconds ))
fetching.add(current_url)
yield request_and_save_url(current_url)
fetched.add(current_url)
finally:
q.task_done()
#gen.coroutine
def worker(worker_id):
while True:
yield fetch_url(worker_id)
# Fill a queue of URL's to scrape
list = [q.put(url) for url in URLS] # this does not make a list...it just puts all the URLS into the Queue
# Start workers, then wait for the work Queue to be empty.
for ii in range(concurrency):
worker(ii)
yield q.join(timeout=timedelta(seconds=300))
assert fetching == fetched
print('Done in {0} seconds, fetched {1} URLs.'.format(
datetime.now() - tstart, len(fetched)))
if __name__ == '__main__':
import logging
logging.basicConfig()
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(main)
You are parsing the content and then serializing it again. You can just write the content directly to a file.
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
outfile.write(curr_url_request.content)
That probably removes most of the processing overhead.
tornado has a very powerful asynchronous client. Here's a basic code that may do the trick:
from tornado.httpclient import AsyncHTTPClient
import tornado
URLS = []
http_client = AsyncHTTPClient()
loop = tornado.ioloop.IOLoop.current()
def handle_request(response):
if response.code == 200:
with open('json_output.txt', 'a') as outfile:
outfile.write(response.body)
#tornado.gen.coroutine
def queue_requests():
results = []
for url in URLS:
nxt = tornado.gen.sleep(1) # 1 request per second
res = http_client.fetch(url, handle_request)
results.append(res)
yield nxt
yield results # wait for all requests to finish
loop.add_callback(loop.stop)
loop.add_callback(queue_requests)
loop.start()
This is a straight-forward approach that may lead to too many connections with the remote server. You may have to resolve such problem using a sliding window while queuing the requests.
In case of request timeouts or specific headers required, feel free to read the doc

Python what is the best way to handle multiple threads

Since my scaper is running so slow (one page at a time) so I'm trying to use thread to make it work faster. I have a function scrape(website) that take in a website to scrape, so easily I can create each thread and call start() on each of them.
Now I want to implement a num_threads variable that is the number of threads that I want to run at the same time. What is the best way to handle those multiple threads?
For ex: supposed num_threads = 5 , my goal is to start 5 threads then grab the first 5 website in the list and scrape them, then if thread #3 finishes, it will grab the 6th website from the list to scrape immidiately, not wait until other threads end.
Any recommendation for how to handle it? Thank you
It depends.
If your code is spending most of its time waiting for network operations (likely, in a web scraping application), threading is appropriate. The best way to implement a thread pool is to use concurrent.futures in 3.4. Failing that, you can create a threading.Queue object and write each thread as an infinite loop that consumes work objects from the queue and processes them.
If your code is spending most of its time processing data after you've downloaded it, threading is useless due to the GIL. concurrent.futures provides support for process concurrency, but again only works in 3.4+. For older Pythons, use multiprocessing. It provides a Pool type which simplifies the process of creating a process pool.
You should profile your code (using cProfile) to determine which of those two scenarios you are experiencing.
If you're using Python 3, have a look at concurrent.futures.ThreadPoolExecutor
Example pulled from the docs ThreadPoolExecutor Example:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
If you're using Python 2, there is a backport available:
ThreadPoolExecutor Example:
from concurrent import futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
def load_url(url, timeout):
return urllib.request.urlopen(url, timeout=timeout).read()
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = dict((executor.submit(load_url, url, 60), url)
for url in URLS)
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is not None:
print('%r generated an exception: %s' % (url,
future.exception()))
else:
print('%r page is %d bytes' % (url, len(future.result())))

Categories