Asyncio get with unique URLs slower than non-unique - python

I have an asyncio function which calls aiohttp.Clientsession to get json data from many (10,000) URLs (code below).
During testing, I have simply been calling the async function with a list of urls:
urls=[url1, url2, url3, url4, url5]*2000
I have now achieved a speed of about 300 it/s with the above url variable.
However, it is about 10x slower when I try to run this with a complete list of 10,000 unique urls:
urls=[url1, url2, ..., url10000]
I am not sure why the unique url list is much slower. Maybe the asyncio maybe compares sessions to see if they are fetching the same thing and return the same value, instead of fetching a new value?
It seems as though THIS IMPLEMENTATION is robust and fast. I have even tried copying this but I get the same results: 300it/s with the non-unique list, and 30it/s with unique.
Perhaps there is something simple I am missing in the implementation of the function below.
import aiohttp
import asyncio
import tqdm.asyncio
from url_list import unique_urls
async def fetch(session, url, sem):
async with sem, session.get(url) as response:
return await response.json()
async def fetch_all(session, urls, loop):
sem = asyncio.Semaphore(100)
tasks = [asyncio.create_task(fetch(session, url, sem)) for url in urls]
return await asyncio.gather(*tasks,tq(tasks))
async def run(urls):
async with aiohttp.ClientSession(loop=loop) as session:
return await fetch_all(session, urls, loop)
async def tq(tasks):
for f in tqdm.asyncio.tqdm.as_completed(tasks):
await f
if __name__ == '__main__':
len(unique_urls)
loop = asyncio.get_event_loop()
calc_routes = loop.run_until_complete(run(unique_urls))
My unique and non-unique list are below:
nonunique_urls = [
"https://arweave.net/ufeA5gLvBtf9vFt_HaLtZsOXIIN5cOfHB2c_DKTbOT8",
"https://arweave.net/UHGKgIp9OAxBsAbR1px3Sqaq1kWqFvNfeEeT9VDeArY",
"https://arweave.net/J8dxe4MHj1TRRaEI5wxiXP-ulqTn_z5NADlX19kVxew",
"https://arweave.net/TuJ0uo6Gofe1cxaNIWwyx0RxjF5khkHTcwtdEu3m1BA",
"https://arweave.net/nv0yKK7U_2T2a4KdC41BBPgOYAldJxyFmnjnavjMI5I"
] * 2000
unique_urls=['https://arweave.net/ufeA5gLvBtf9vFt_HaLtZsOXIIN5cOfHB2c_DKTbOT8',
'https://arweave.net/UHGKgIp9OAxBsAbR1px3Sqaq1kWqFvNfeEeT9VDeArY',
'https://arweave.net/J8dxe4MHj1TRRaEI5wxiXP-ulqTn_z5NADlX19kVxew',
'https://arweave.net/TuJ0uo6Gofe1cxaNIWwyx0RxjF5khkHTcwtdEu3m1BA',
'https://arweave.net/nv0yKK7U_2T2a4KdC41BBPgOYAldJxyFmnjnavjMI5I',
'https://arweave.net/a8cnY1xvM82d444JMSvzKO2qy7DCW73tFaFuNAkPzeg',
'https://arweave.net/pfv8utICUpTHmvdM9bWxMyI8hHmJCf33bMw1F0nvXik',
'https://arweave.net/oRAf9M7gG3mtraqzXh9XsCxbM675u3wxzYM5scL1keY',
'https://arweave.net/2AiZNhPluj9jwdOOlIk_vqKYzhLwtJyLAT_7tdNvgVg',
........]

Related

AsyncHTMLSession returns responses list disorderly! How to sort or make list ordered?

I'm found async requests-html much more useful than simple requests for parsing with using BeautifulSoup. But results when I'm using function asession.run for my async functions return responses in disordered way and It's alright if I will make some dictionary for async function which give me response with url as a key to make it sorted but it's looks redundant in my mind. Any ideas?
Here I'm expecting correct order of responses where it's at least not random in every new function call:
from requests_html import AsyncHTMLSession, HTMLSession, HTMLResponse
from bs4 import BeautifulSoup
asession = AsyncHTMLSession()
async def kucoin():
print(f'get K')
r = await asession.get('https://kucoin.com')
return r
async def gateio():
print(f'get g')
r = await asession.get('https://gate.io')
return r
async def vk():
print(f'get vk')
r = await asession.get('https://vk.com')
return r
tasks = [kucoin, gateio, vk]
results = asession.run(*tasks)
for result in results:
print(BeautifulSoup(result.text).title)`
But getting:
get K
get g
get vk
<title>Buy/Sell Bitcoin, Ethereum | Cryptocurrency Exchange | Gate.io</title>
<title>Crypto Exchange | Bitcoin Exchange | Bitcoin Trading | KuCoin</title>
<title>Welcome | VK</title>
If you have experience in async parsing I would be thankful if you will share with me your experience!
UPDATE: Found that it's normal in this lib to return disordered responses https://github.com/psf/requests-html/issues/381
In AsyncHTMLSession.run, done is a set (which is unordered).
You can replace the implementation to return result from tasks:
def run(self, *coros):
tasks = [asyncio.ensure_future(coro()) for coro in coros]
done, _ = self.loop.run_until_complete(asyncio.wait(tasks))
# return [t.result() for t in done]
return [t.result() for t in tasks]
AsyncHTMLSession.run = run

Using Python asyncio and aiohttp - one request is fast, two or more very slow

I'm trying to speed up multiple requests to the US Census API by using asyncio and aiohttp.
If I make a request with only one lat/lon pair, it is fast, less than 1 second. With two or more, it is very slow, always 20 seconds or more.
Can't figure out why.
import pprint
import aiohttp
import time
async def getCensusInfo(lat, lon, session):
url = f'https://geocoding.geo.census.gov/geocoder/geographies/coordinates?x={lon}&y={lat}&format=json&benchmark=Public_AR_Current&vintage=Census2020_Current&layers=all'
async with session.get(url) as response:
result = await response.json()
return result
async def main():
async with aiohttp.ClientSession() as session:
tasks = []
locations = [(27.652250703997332, -80.42654388841413)]
locations = [(27.652250703997332, -80.42654388841413), (27.459669616175788, -80.30859777448217)]
for location in locations:
tasks.append(getCensusInfo(location[0], location[1], session))
results = await asyncio.gather(*tasks, return_exceptions=True)
for result in results:
pprint.pprint(result["result"]["geographies"]["Incorporated Places"][0], sort_dicts=False, indent = 4)
start_time = time.perf_counter()
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
asyncio.run(main())
duration = time.perf_counter() - start_time
print(f'Time to download geoCode data: {duration} seconds')

python asyncio asynchronously fetch data by key from a dict when the key becomes available

Like title told, my use case is like this:
I have one aiohttp server, which accept request from client, when i have the request i generate one unique request id for it, and then i send the {req_id: req_pyaload} dict to some workers (the worker is not in python thus running in another process), when the workers complete the work, i get back the response and put them in a result dict like this: {req_id_1: res_1, req_id_2: res_2}.
Then I want my aiohttp server handler to await on above result dict, so when the specific response become available (by req_id) it can send it back.
I build below example code to try to simulate the process, but got stuck in implementing the coroutine async def fetch_correct_res(req_id) which should asynchronously/unblockly fetch the correct response by req_id.
import random
import asyncio
import shortuuid
n_tests = 1000
idxs = list(range(n_tests))
req_ids = []
for _ in range(n_tests):
req_ids.append(shortuuid.uuid())
res_dict = {}
async def fetch_correct_res(req_id):
pass
async def handler(req):
res = await fetch_correct_res(req)
assert req == res, "the correct res for the req should exactly be the req itself."
print("got correct res for req: {}".format(req))
async def randomly_put_res_to_res_dict():
for _ in range(n_tests):
random_idx = random.choice(idxs)
await asyncio.sleep(random_idx / 1000)
res_dict[req_ids[random_idx]] = req_ids[random_idx]
print("req: {} is back".format(req_ids[random_idx]))
So:
Is it possible to make this solution work? how?
If above solution is not possible, what should be the correct solution for this use case with asyncio?
Many thanks.
The only approach i can think of for now to make this work is: pre-created some asyncio.Queue with pre-assigned id, then for each incoming request assign one queue to it, so the handler just await on this queue, when the response come back i put it into this pre-assigned queue only, after the request fulfilled, i collect back the queue to use it for next incoming request. Not very elegant, but will solve the problem.
See if the below sample implementation fulfils your need
basically you want to respond back to the request(id) with your response(unable to predict the order) in an asynchronous way
So at the time of request handling, populate the dict with {request_id: {'event':<async.Event>, 'result': <result>}} and await on asyncio.Event.wait(), once the response is received, signal the event with asyncio.Event.set() which will release the await and then fetch the response from the dict based on the request id
I modified your code slightly to pre-populate the dict with request id and put the await on asyncio.Event.wait() until the signal comes from the response
import random
import asyncio
import shortuuid
n_tests = 10
idxs = list(range(n_tests))
req_ids = []
for _ in range(n_tests):
req_ids.append(shortuuid.uuid())
res_dict = {}
async def fetch_correct_res(req_id, event):
await event.wait()
res = res_dict[req_id]['result']
return res
async def handler(req, loop):
print("incoming request id: {}".format(req))
event = asyncio.Event()
data = {req :{}}
res_dict.update(data)
res_dict[req]['event']=event
res_dict[req]['result']='pending'
res = await fetch_correct_res(req, event)
assert req == res, "the correct res for the req should exactly be the req itself."
print("got correct res for req: {}".format(req))
async def randomly_put_res_to_res_dict():
random.shuffle(req_ids)
for i in req_ids:
await asyncio.sleep(random.randrange(2,4))
print("req: {} is back".format(i))
if res_dict.get(i) is not None:
event = res_dict[i]['event']
res_dict[i]['result'] = i
event.set()
loop = asyncio.get_event_loop()
tasks = asyncio.gather(handler(req_ids[0], loop),
handler(req_ids[1], loop),
handler(req_ids[2], loop),
handler(req_ids[3], loop),
randomly_put_res_to_res_dict())
loop.run_until_complete(tasks)
loop.close()
sample response from the above code
incoming request id: NDhvBPqMiRbteFD5WqiLFE
incoming request id: fpmk8yC3iQcgHAJBKqe2zh
incoming request id: M7eX7qeVQfWCCBnP4FbRtK
incoming request id: v2hAfcCEhRPUDUjCabk45N
req: VeyvAEX7YGgRZDHqa2UGYc is back
req: M7eX7qeVQfWCCBnP4FbRtK is back
got correct res for req: M7eX7qeVQfWCCBnP4FbRtK
req: pVvYoyAzvK8VYaHfrFA9SB is back
req: soP8NDxeQKYjgeT7pa3wtG is back
req: j3rcg5Lp59pQXuvdjCAyZe is back
req: NDhvBPqMiRbteFD5WqiLFE is back
got correct res for req: NDhvBPqMiRbteFD5WqiLFE
req: v2hAfcCEhRPUDUjCabk45N is back
got correct res for req: v2hAfcCEhRPUDUjCabk45N
req: porzHqMqV8SAuttteHRwNL is back
req: trVVqZrUpsW3tfjQajJfb7 is back
req: fpmk8yC3iQcgHAJBKqe2zh is back
got correct res for req: fpmk8yC3iQcgHAJBKqe2zh
This may work (note: I removed UUID in order to know req id in advance)
import random
import asyncio
n_tests = 1000
idxs = list(range(n_tests))
req_ids = []
for i in range(n_tests):
req_ids.append(i)
res_dict = {}
async def fetch_correct_res(req_id):
while not res_dict.get(req_id):
await asyncio.sleep(0.1)
return req_ids[req_id]
async def handler(req):
print("fetching req: ", req)
res = await fetch_correct_res(req)
assert req == res, "the correct res for the req should exactly be the req itself."
print("got correct res for req: {}".format(req))
async def randomly_put_res_to_res_dict(future):
for i in range(n_tests):
res_dict[req_ids[i]] = req_ids[i]
await asyncio.sleep(0.5)
print("req: {} is back".format(req_ids[i]))
future.set_result("done")
loop = asyncio.get_event_loop()
future = asyncio.Future()
asyncio.ensure_future(randomly_put_res_to_res_dict(future))
loop.run_until_complete(handler(10))
loop.close()
Is it the best solution? according to me No, basically its kind of requesting long running job status, and you should have (REST) api for doing the job submission and knowing job status like:
http POST server:port/job
{some job json paylod}
Response: 200 OK {"req_id": 1}
http GET server:port/job/1
Response: 200 OK {"req_id": 1, "status": "in process"}
http GET server:port/job/1
Response: 200 OK {"req_id": 1, "status": "done", "result":{}}

Asyncio and aiohttp returning task instead of results

I have a script to run parallel requests against an API within a class. However, the results I'm getting is basically a task instead of the actual results. Any reason why?
I mimicked the modified Client code on https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html.
import asyncio
from aiohttp import ClientSession
class Requestor:
async def _async_request(self, url, session, sema_sz=10):
sema = asyncio.Semaphore(sema_sz)
async with sema:
async with session.get(url) as response:
req = await response.json()
return req
async def _async_chunk_request(self, url, chunks, headers=None, sema_sz=10):
async with ClientSession(headers=headers) as session:
futures = [asyncio.ensure_future(self._async_request(url.format(chunk), session, sema_sz)) for chunk in chunks]
responses = asyncio.gather(*futures)
await responses
def get_request(self, url, chunks):
loop = asyncio.get_event_loop()
bulk_req = asyncio.ensure_future(self._async_chunk_request(url, chunks))
loop.run_until_complete(bulk_req)
return bulk_req
bulk_req is actually a Task variable and not the results and shows this in PyCharm, Task finished coro=<Requestor._async_chunk_request() done, defined at ...
When I debug, I see that req has a full and proper response value, so there's no issue with that. I feel like it's something to do with the actual gathering of the futures?
Your _chunk_request does not return anything.
async def _chunk_request(...):
...
...
await responses
I made a toy example trying to mimic your process. If I ended _chunk_request like you did, i got the same result - a finished Task with No results. Changing _chunk_request to return something fixed it:
async def _chunk_request(...):
...
...
return await responses
If you only need the return values from the tasks, get_request should return the result of the loop.run_until_complete() call.
My toy example
import asyncio
import random
from pprint import pprint
async def response(n):
asyncio.sleep(random.choice([1,3,5]))
return f'i am {n}'
async def _request(n):
req = await response(n)
#print(req)
return req
async def _chunk_request(chunks):
futures = [asyncio.ensure_future(_request(chunk)) for chunk in chunks]
#pprint(futures)
responses = asyncio.gather(*futures, return_exceptions=True)
#pprint(responses)
return await responses
def get_request(chunks):
loop = asyncio.get_event_loop()
bulk_req = asyncio.ensure_future(_chunk_request(chunks))
return loop.run_until_complete(bulk_req)
In [7]: result = get_request(range(1,6))
In [8]: print(result)
['i am 1', 'i am 2', 'i am 3', 'i am 4', 'i am 5']

How to use parallelization in set/list comprehension using asyncio?

I want to create a multiprocess comprehension in Python 3.7.
Here's the code I have:
async def _url_exists(url):
"""Check whether a url is reachable"""
request = requests.get(url)
return request.status_code == 200:
async def _remove_unexisting_urls(rows):
return {row for row in rows if await _url_exists(row[0])}
rows = [
'http://example.com/',
'http://example.org/',
'http://foo.org/',
]
rows = asyncio.run(_remove_unexisting_urls(rows))
In this code example, I want to remove non-existing URLs from a list. (Note that I'm using a set instead of a list because I also want to remove duplicates).
My issue is that I still see that the execution is sequential. HTTP Requests make the execution wait.
When compared to a serial execution, the execution time is the same.
Am I doing something wrong?
How should these await/async keywords be used with python comprehension?
asyncio itself doesn't run different async functions concurrently. However, with the multiprocessing module's Pool.map, you can schedule functions to run in another process:
from multiprocessing.pool import Pool
pool = Pool()
def fetch(url):
request = requests.get(url)
return request.status_code == 200
rows = [
'http://example.com/',
'http://example.org/',
'http://foo.org/',
]
rows = [r for r in pool.map(fetch, rows) if r]
requests does not support asyncio. If you want to go for true asynchronous execution, you will have to look at libs like aiohttp or asks
Your set should be built before offloading to the tasks, so you don't even execute for duplicates, instead of streamlining the result.
With requests itself, you can fall back to run_in_executor which will execute your requests inside a ThreadPoolExecutor, so not really asynchronous I/O:
import asyncio
import time
from requests import exceptions, get
def _url_exists(url):
try:
r = get(url, timeout=10)
except (exceptions.ConnectionError, exceptions.ConnectTimeout):
return False
else:
return r.status_code is 200
async def _remove_unexisting_urls(l, r):
# making a set from the list before passing it to the futures
# so we just have three tasks instead of nine
futures = [l.run_in_executor(None, _url_exists, url) for url in set(r)]
return [await f for f in futures]
rows = [ # added some dupes
'http://example.com/',
'http://example.com/',
'http://example.com/',
'http://example.org/',
'http://example.org/',
'http://example.org/',
'http://foo.org/',
'http://foo.org/',
'http://foo.org/',
]
loop = asyncio.get_event_loop()
print(time.time())
result = loop.run_until_complete(_remove_unexisting_urls(loop, rows))
print(time.time())
print(result)
Output
1537266974.403686
1537266986.6789136
[False, False, False]
As you can see, there is a penalty from initializing the thread pool, ~2.3 seconds in this case. However, given that fact that each of the three tasks runs for ten seconds until timeout on my box (my IDE is not allowed through the proxy), an overall of twelve seconds execution time looks quite concurrent.

Categories