tl;dr: how do I maximize number of http requests I can send in parallel?
I am fetching data from multiple urls with aiohttp library. I'm testing its performance and I've observed that somewhere in the process there is a bottleneck, where running more urls at once just doesn't help.
I am using this code:
import asyncio
import aiohttp
async def fetch(url, session):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'}
try:
async with session.get(
url, headers=headers,
ssl = False,
timeout = aiohttp.ClientTimeout(
total=None,
sock_connect = 10,
sock_read = 10
)
) as response:
content = await response.read()
return (url, 'OK', content)
except Exception as e:
print(e)
return (url, 'ERROR', str(e))
async def run(url_list):
tasks = []
async with aiohttp.ClientSession() as session:
for url in url_list:
task = asyncio.ensure_future(fetch(url, session))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
return responses
loop = asyncio.get_event_loop()
asyncio.set_event_loop(loop)
task = asyncio.ensure_future(run(url_list))
loop.run_until_complete(task)
result = task.result().result()
Running this with url_list of varying length (tests against https://httpbin.org/delay/2) I see that adding more urls to be run at once helps only up to ~100 urls and then total time starts to grow proportionally to number of urls (or in other words, time per one url does not decrease). This suggests that something fails when trying to process these at once. In addition, with more urls in 'one batch' I am occasionally receiving connection timeout errors.
Why is it happening? What exactly limits the speed here?
How can I check what is the maximum number of parallel requests I can send on a given computer? (I mean an exact number - not approx by 'trial and error' as above)
What can I do to increase the number of requests processed at once?
I am runnig this on Windows.
EDIT in response to comment:
This is the same data with limit set to None. Only slight improvement in the end and there are many connection timeout errors with 400 urls sent at once. I ended up using limit = 200 on my actual data.
By default aiohttp limits number of simultaneous connections to 100. It achieves by setting default limit to TCPConnector object that is used by ClientSession. You can bypass it by creating and passing custom connector to session:
connector = aiohttp.TCPConnector(limit=None)
async with aiohttp.ClientSession(connector=connector) as session:
# ...
Note however that you probably don't want to set this number too high: your network capacity, CPU, RAM and target server have their own limits and try to make enormous amount of connection can lead to increasing failures.
Optimal number can probably be found only through experiments on concrete machine.
Unrelated:
You don't have to create tasks without reason. Most asyncio api accept regular coroutines. For example, your last lines of code can be altered this way:
loop = asyncio.get_event_loop()
loop.run_until_complete(run(url_list))
Or even to just asyncio.run(run(url_list)) (doc) if you're using Python 3.7
Related
I have this code in python:
session = requests.Session()
for i in range(0, len(df_1)):
page = session.head(df_1['listing_url'].loc[i], allow_redirects=False, stream=True)
if page.status_code == 200:
df_1['condition'][i] = 'active'
else:
df_1['condition'][i] = 'false'
df_1 is my data frame and the column "listing_url" have more than 500 lines.
I want to Request if the URL list is active and append this in my data frame. But this code demands a long time. How can I reduce my time?
The problem with your current approach is that requests runs sequentially (synchronously), which means that a new request can't be sent before the prior one is finished.
What you are looking for is handling those requests asynchronously. Sadly, requests library does not support asynchronous requests. A newer library that has similar API to requests but can do that is httpx. aiohttp is another popular choice. With httpx you can do something like this:
import asyncio
import httpx
listing_urls = list(df_1['listing_url'])
async def do_tasks():
async with httpx.AsyncClient() as client:
tasks = [client.head(url) for url in listing_urls]
responses = await asyncio.gather(*tasks)
return {r.url: r.status_code for r in responses}
url_2_status = asyncio.run(do_tasks())
This will give you a mapping of {url: status_code}. You should be able to go from there.
This solution assumes you are using Python3.7 or newer. Also remember to install httpx.
I have the following test code:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor() as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
I need to use the concurrent.futures.ThreadPoolExecutor part of the code in a FastAPI endpoint.
My concern is the impact of the number of API calls and the inclusion of threads. Concern about creating too many threads and its related consequences, starving the host, crashing the application and/or the host.
Any thoughts or gotchas on this approach?
You should rather use the HTTPX library, which provides an async API. As described in this answer , you spawn a Client and reuse it every time you need it. To make asynchronous requests with HTTPX, you'll need an AsyncClient.
You could control the connection pool size as well, using the limits keyword argument on the Client, which takes an instance of httpx.Limits. For example:
limits = httpx.Limits(max_keepalive_connections=5, max_connections=10)
client = httpx.AsyncClient(limits=limits)
You can adjust the above per your needs. As per the documentation on Pool limit configuration:
max_keepalive_connections, number of allowable keep-alive connections, or None to always allow. (Defaults 20)
max_connections, maximum number of allowable connections, or None for no limits. (Default 100)
keepalive_expiry, time limit on idle keep-alive connections in seconds, or None for no limits. (Default 5)
If you would like to adjust the timeout as well, you can use the timeout paramter to set timeout on an individual request, or on a Client/AsyncClient instance, which results in the given timeout being used as the default for requests made with this client (see the implementation of Timeout class as well). You can specify the timeout behavior in a fine grained detail; for example, setting the read timeout parameter will specify the maximum duration to wait for a chunk of data to be received (i.e., a chunk of the response body). If HTTPX is unable to receive data within this time frame, a ReadTimeout exception is raised. If set to None instead of some positive numerical value, there will be no timeout on read. The default is 5 seconds timeout on all operations.
You can use await client.aclose() to explicitly close the AsyncClient when you are done with it (this could be done inside a shutdown event handler, for instance).
To run multiple asynchronous operations—as you need to request five different URLs, when your API endpoint is called—you can use the awaitable asyncio.gather(). It will execute the async operations and return a list of results in the same order the awaitables (tasks) were passed to that function.
Working Example:
from fastapi import FastAPI
import httpx
import asyncio
URLS = ['https://www.foxnews.com/',
'https://edition.cnn.com/',
'https://www.nbcnews.com/',
'https://www.bbc.co.uk/',
'https://www.reuters.com/']
limits = httpx.Limits(max_keepalive_connections=5, max_connections=10)
timeout = httpx.Timeout(5.0, read=15.0) # 15s timeout on read. 5s timeout elsewhere.
client = httpx.AsyncClient(limits=limits, timeout=timeout)
app = FastAPI()
#app.on_event('shutdown')
async def shutdown_event():
await client.aclose()
async def send(url, client):
return await client.get(url)
#app.get('/')
async def main():
tasks = [send(url, client) for url in URLS]
responses = await asyncio.gather(*tasks)
return [r.text[:50] for r in responses] # return the first 50 chars of each response
If you would like to avoid reading the entire response body into RAM, you could use Streaming responses, as described in this answer and demonstrated below:
# ... rest of the code is the same as above
from fastapi.responses import StreamingResponse
async def send(url, client):
req = client.build_request('GET', url)
return await client.send(req, stream=True)
async def iter_content(responses):
for r in responses:
async for chunk in r.aiter_text():
yield chunk[:50] # return the first 50 chars of each response
yield '\n'
break
await r.aclose()
#app.get('/')
async def main():
tasks = [send(url, client) for url in URLS]
responses = await asyncio.gather(*tasks)
return StreamingResponse(iter_content(responses), media_type='text/plain')
I am attempting to make an API request, pull down specific chunks of the response and ultimately save it into a file for later processing. I also first want to mention that the script works full, until I begin to pull larger sets of data.
When I open the params to a larger date range, I receive:
ContentTypeError(
aiohttp.client_exceptions.ContentTypeError: 0, message='Attempt to decode JSON with unexpected mimetype: text/html'
async def get_dataset(session, url):
async with session.get(url=url, headers=headers, params=params) as resp:
dataset = await resp.json()
return dataset['time_entries']
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for page in range(1, total_pages):
url = "https://api.harvestapp.com/v2/time_entries?page=" + str(page)
tasks.append(asyncio.ensure_future(get_dataset(session, url)))
dataset = await asyncio.gather(*tasks)
If I keep my params small enough, then it works without issue. But too large of a date range and the error pops up, and anything past the snippet I shared above does not run
More for reference:
url_address = "https://api.harvestapp.com/v2/time_entries/"
headers = {
"Content-Type": 'application/json',
"Authorization": authToken,
"Harvest-Account-ID": accountID
}
params = {
"from": StartDate,
"to": EndDate
}
Any ideas on what would cause this to work on certain data sizes but fail on larger sets? I am assuming the JSON is becoming malformed at some point, but I am unsure of how to examine that and/or prevent it from happening, since I am able to pull multiple pages from the API and successfully appending on the smaller data pulls.
OP: Thank you to the others who gave answers. I discovered the issue and implemented a solution. A friend pointed out that aiohttp can return that error message if the response is of an error page instead of the expected json content i.e. a html page giving a 429 HTTP too many requests. I looked up the API limits and found they do have it set to 100 requests per 15 seconds.
My solution was to implement the asyncio-throttle module which allowed me to directly limit the requests and time period. You can find this on the devs GitHub
Here is my updated code with the implementation, very simple! For my instance I needed to limit my requests to 100 per 15 seconds which you can see below as well.
async def get_dataset(session, url, throttler):
while True:
async with throttler:
async with session.get(url=url, headers=headers, params=params) as resp:
dataset = await resp.json()
return dataset['time_entries']
async def main():
tasks = []
throttler = Throttler(rate_limit=100, period=15)
async with aiohttp.ClientSession() as session:
try:
for page in range(1, total_pages):
url = "https://api.harvestapp.com/v2/time_entries?page=" + str(page)
tasks.append(asyncio.ensure_future(get_dataset(session, url, throttler)))
dataset = await asyncio.gather(*tasks)
I am trying to build a web application fuzzer. It will take a wordlist and a url from the user and will do request to those urls. At the end, It will give output according to their responses' status codes.
I have written some code, it does ~600req/s in local (takes about 8 seconds to finish 4600 lines of wordlist) but since I'm using requests library I was thinking if there is a faster way to do so.
Only time consuming part as I analyzed is fuzz() and req() functions as they are doing the most job. I have also other functions but those that I've shown must be enough for you to understand (I didn't want to put so much code).
def __init__(self):
self.statusCodes = [200, 204, 301, 302, 307, 403]
self.session = requests.Session()
self.headers = {
'User-Agent': 'x',
'Connection': 'Closed'
}
def req(self, URL):
# request to only one url
try:
r = self.session.head(URL, allow_redirects=False, headers=self.headers, timeout=3)
if r.status_code in self.statusCodes:
if r.status_code == 301:
self.directories.append(URL)
self.warning("301", URL)
return
self.success(r.status_code, URL)
return
return
except requests.exceptions.ConnectTimeout:
return
except requests.exceptions.ConnectionError:
self.error("Connection error")
sys.exit(1)
def fuzz(self):
pool = ThreadPool(self.threads)
pool.map(self.req, self.URLList)
pool.close()
pool.join()
return
#self.threads is number of threads
#self.URLList is a list of full urls
'__init__' ((<MWAF.MWAF instance at 0x7f554cd8dcb0>, 'http://localhost', '/usr/share/wordlists/seclists/Discovery/Web-Content/common.txt', 25), {}) 0.00362110137939453125 sec
#each req is around this
'req' ((<MWAF.MWAF instance at 0x7f554cd8dcb0>, 'http://localhost/webedit'), {}) 0.00855112075805664062 sec
'fuzz' ((<MWAF.MWAF instance at 0x7f554cd8dcb0>,), {}) 7.39054012298583984375 sec
Whole Program
[*] 7.39426517487
You may wish to combine multiple processes with multiple threads. As 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task shows, there's an optimal number of threads per process -- the more the higher percentage of time they spend waiting for I/O.
On the higher order of vanishing, you can try reusing prepared requests to save on object creation time. (I'm not sure if that'll have an effect -- requests might e.g. treat them as immutable so it would create a new object each time anyway. But this may still cut on input validation time or something.)
I'm new to Python multiprocessing. I don't quite understand the difference between Pool and Process. Can someone suggest which one I should use for my needs?
I have thousands of http GET requests to send. After sending each and getting the response, I want to store to response (a simple int) to a (shared) dict. My final goal is to write all data in the dict to a file.
This is not CPU intensive at all. All my goal is the speed up sending the http GET requests because there are too many. The requests are all isolated and do not depend on each other.
Shall I use Pool or Process in this case?
Thanks!
----The code below is added on 8/28---
I programmed with multiprocessing. The key challenges I'm facing are:
1) GET request can fail sometimes. I have to set 3 retries to minimize the need to rerun my code/all requests. I only want to retry the failed ones. Can I achieve this with async http requests without using Pool?
2) I want to check the response value of every requests, and have exception handling
The code below is simplified from my actual code. It is working fine, but I wonder if it's the most efficient way of doing things. Can anyone give any suggestions? Thanks a lot!
def get_data(endpoint, get_params):
response = requests.get(endpoint, params = get_params)
if response.status_code != 200:
raise Exception("bad response for " + str(get_params))
return response.json()
def get_currency_data(endpoint, currency, date):
get_params = {'currency': currency,
'date' : date
}
for attempt in range(3):
try:
output = get_data(endpoint, get_params)
# additional return value check
# ......
return output['value']
except:
time.sleep(1) # I found that sleeping for 1s almost always make the retry successfully
return 'error'
def get_all_data(currencies, dates):
# I have many dates, but not too many currencies
for currency in currencies:
results = []
pool = Pool(processes=20)
for date in dates:
results.append(pool.apply_async(get_currency_data, args=(endpoint, date)))
output = [p.get() for p in results]
pool.close()
pool.join()
time.sleep(10) # Unfortunately I have to give the server some time to rest. I found it helps to reduce failures. I didn't write the server. This is not something that I can control
Neither. Use asynchronous programming. Consider the below code pulled directly from that article (credit goes to Paweł Miech)
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
return await response.read()
async def run(r):
url = "http://localhost:8080/{}"
tasks = []
# Fetch all responses within one Client session,
# keep connection alive for all requests.
async with ClientSession() as session:
for i in range(r):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = await asyncio.gather(*tasks)
# you now have all response bodies in this variable
print(responses)
def print_responses(result):
print(result)
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(4))
loop.run_until_complete(future)
Just maybe create a URL's array, and instead of the given code, loop against that array and issue each one to fetch.
EDIT: Use requests_futures
As per #roganjosh comment below, requests_futures is a super-easy way to accomplish this.
from requests_futures.sessions import FuturesSession
sess = FuturesSession()
urls = ['http://google.com', 'https://stackoverflow.com']
responses = {url: sess.get(url) for url in urls}
contents = {url: future.result().content
for url, future in responses.items()
if future.result().status_code == 200}
EDIT: Use grequests to support Python 2.7
You can also us grequests, which supports Python 2.7 for performing asynchronous URL calling.
import grequests
urls = ['http://google.com', 'http://stackoverflow.com']
responses = grequests.map(grequests.get(u) for u in urls)
print([len(r.content) for r in rs])
# [10475, 250785]
EDIT: Using multiprocessing
If you want to do this using multiprocessing, you can. Disclaimer: You're going to have a ton of overhead by doing this, and it won't be anywhere near as efficient as async programming... but it is possible.
It's actually pretty straightforward, you're mapping the URL's through the http GET function:
import requests
urls = ['http://google.com', 'http://stackoverflow.com']
from multiprocessing import Pool
pool = Pool(8)
responses = pool.map(requests.get, urls)
The size of the pool will be the number of simultaneously issues GET requests. Sizing it up should increase your network efficiency, but it'll add overhead on the local machine for communication and forking.
Again, I don't recommend this, but it certainly is possible, and if you have enough cores it's probably faster than doing the calls synchronously.