I am using aiohttp session along with a semaphore within a custom class:
async def get_url(self, url):
async with self.semaphore:
async with self.session.get(url) as response:
try:
text_response = await response.text()
read_response = await response.read()
json_response = await response.json()
await asyncio.sleep(random.uniform(0.1, 0.5))
except aiohttp.client_exceptions.ContentTypeError:
json_response = {}
return {
'json': json_response,
'text': text_response,
'read': read_response,
'status': response.status,
'url': response.url,
}
I have two questions:
Is it correct/incorrect to to have multiple await statements within a single async function? I need to return both the response.text() and response.read(). However, depending on the URL, the response.json() may or may not be available so I've thrown everything into a try/except block to catch this exception.
Since I am using this function to loop through a list of different RESTful API endpoints, I am controlling the number of simultaneous requests through the semaphore (set to max of 100) but I also need to stagger the requests so they aren't log jamming the host machine. So, I thought I could accomplish this by adding an asyncio.sleep that is randomly chosen between 0.1-0.5 seconds. Is this the best way to enforce a small wait in between requests? Should I move this to the beginning of the function instead of near the end?
It is absolutely fine to have multiple awaits in one async function, as far as you know what you are awaiting for, and each of them are awaited one by one, just like the very normal sequential execution. One thing to mention about aiohttp is that, you'd better call read() first and catch UnicodeDecodeError too, because internally text() and json() call read() first and process its result, you don't want the processing to prevent returning at least read_response. You don't have to worry about read() being called multiple times, it is simply cached in the response instance on the first call.
Random stagger is an easy and effective solution for sudden traffic. However if you want to control exactly the minimum time interval between any two requests - for academic reasons, you could set up two semaphores:
def __init__(self):
# something else
self.starter = asyncio.Semaphore(0)
self.ender = asyncio.Semaphore(30)
Then change get_url() to use them:
async def get_url(self, url):
await self.starter.acquire()
try:
async with self.session.get(url) as response:
# your code
finally:
self.ender.release()
Because starter was initialized with zero, so all get_url() coroutines will block on starter. We'll use a separate coroutine to control it:
async def controller(self):
last = 0
while self.running:
await self.ender.acquire()
sleep = 0.5 - (self.loop.time() - last) # at most 2 requests per second
if sleep > 0:
await asyncio.sleep(sleep)
last = self.loop.time()
self.starter.release()
And your main program should look something like this:
def run(self):
for url in [...]:
self.loop.create_task(self.get_url(url))
self.loop.create_task(self.controller())
So at first, the controller will release starter 30 times evenly in 15 seconds, because that is the initial value of ender. After that, the controller would release starter as soon as any get_url() ends, if 0.5 seconds have passed since the last release of starter, or it will wait up to that time.
One issue here: if the URLs to fetch is not a constant list in memory (e.g. coming from network constantly with unpredictable delays between URLs), the RPS limiter will fail (starter released too early before there is actually a URL to fetch). You'll need further tweaks for this case, even though the chance of a traffic burst is already very low.
Related
I want to scrape a website asynchronously using a list of tor circuits with different exit nodes and making sure each exit node only makes a request every 5 seconds.
For testing purposes, I'm using the website https://books.toscrape.com/ and I'm lowering the sleep time, number of circuits and number of pages to scrape.
I'm getting the following two errors when I use the --tor argument. Both related to the torpy package.
'TorWebScraper' object has no attribute 'circuits'
'_GeneratorContextManager' object has no attribute 'create_stream'
Here is the relevant code causing the error:
async with aiohttp.ClientSession() as session:
for circuit in self.circuits:
async with circuit.create_stream() as stream:
async with session.get(url, proxy=stream.proxy) as response:
await asyncio.sleep(20e-3)
text = await response.text()
return url, text
Here there is more context
Your error is being caused by the fact that your code starts asyncio loop on object init which is not a good practice:
class WebScraper(object):
def __init__(self, urls: List[str]):
self.urls = urls
self.all_data = []
self.master_dict = {}
asyncio.run(self.run())
#^^^^^^
class TorWebScraper(WebScraper):
def __init__(self, urls: List[str]):
super().__init__(urls)
# ^^^^^ this already called run() from parent class
self.circuits = get_circuits(3)
asyncio.run(self.run())
# ^^^^^ now run() is being called a second time
Ideally, to avoid issues like this you should leave the logic code to your classes and separate out the run code to your script. In other words, move asyncio.run to asyncio.run(scrape_test_website()).
I'm writing a genetic algorithm that uses a Cloud Function as its evaluation function and have been running into 401 response codes about 1/3 of the way into the process. I'm not entirely certain what is going on here given the numerous successful invocations, and there are no indications in the Google Cloud logs that anything is amiss (either in the CF logs or the generic cloud-wide logs).
The intention is to use it as a generic evaluation function for more 'rigorous' projects, however for this one I am passing a list of strings, along with the 'correct' string, and returning the ASCII distance between each string to the correct string. This information is coming in as a JSON packet. The genetic algorithm basically just needs to discover the correct string for completion. This is basically a play on the one-max optimization problem.
For reference, this has only really happened once I scaled up the number of invocations and passed strings. The process ran fine with a smaller number of evaluations and amount of strings passed, however when I scale it up a bit it chokes halfway through). Note that the entire purpose of using CFs is to try and scale exponentially upwards for evaluation calls, otherwise I'd just run this locally.
The Cloud Function is fairly trivial (evaluating the string optimization problem):
import json
# Evaluate distance from expected to individual
def fitness(bitstring, expected, dna_size):
f = 0
for c in range(dna_size):
f += abs(ord(bitstring[c]) - ord(expected[c]))
return f
def evaluateBitstrings(request):
resp = []
request_json = request.get_json()
if request_json and 'bitstrings' in request_json and 'expected' in request_json and 'dna_size' in request_json:
for bitstring in request_json['bitstrings']:
f = fitness(bitstring, request_json['expected'], int(request_json['dna_size']))
resp.append((bitstring, f))
return str(json.dumps(resp))
else:
return f'Error: missing JSON information'
The JSON packet I'm sending comprises a list of 1000 strings, so it's really just doing a loop over those and creating a JSON packet of distances to return.
It is configured to use 512mb of memory, has a 180 second timeout, and uses authentication to protect from anonymous calls. I trigger the call locally via Python, asyncio, and aiohttp with the authorization included in the header (authenticating locally in Windows Subsystem for Linux (Ubuntu) via gcloud).
Here is the relevant bit of Python code (using 3.6). One of the issues was being locally-bound with the large number of aiohttp calls, and I came across this post on using semaphores to increase the amount of calls used with semaphores.
import aiohttp
import asyncio
...
base_url = "https://<GCF:CF>"
headers = {'Content-Type': 'application/json'}
...
token = sys.argv[2] # call to `gcloud auth print-identity-token` as parameter
headers['Authorization'] = 'bearer ' + token
async def fetch(bitstrings,expected,session):
b = {'bitstrings':bitstrings,
'expected':expected,
'dna_size':len(expected)}
async with session.post(base_url, data=json.dumps(b), headers=headers) as response:
assert response.status == 200
data = await response.read()
try:
return json.loads(data)
except:
print("An error occurred: {0}".format(data))
async def bound_fetch(sem, bitstrings, expected, session):
async with sem:
return await fetch(bitstrings, expected, session)
async def run(iterable, expected, token):
tasks = []
sem = asyncio.Semaphore(1000)
async with aiohttp.ClientSession(trust_env=True) as session:
chunks = [iterable[x:x+1000] for x in range(0,len(population), 1000)]
# build up JSON array
for chunk in chunks:
task = asyncio.ensure_future(bound_fetch(sem, chunk, expected, session))
tasks.append(task)
...
# Within the GA code
for generation in range(ga.GENERATIONS):
...
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(population, ga.OPTIMAL, token))
responses = []
results = loop.run_until_complete(future)
for res in results: # loop through task results
for r in res: # json coming in as a list
responses.append((r[0], float(r[1]))) # string, fitness
Further reference, I have done this locally via the functions-framework and did not run into this issue. It only happens when I reach out to the cloud.
Edit: I resolved the Forbidden issue (the token needed to be refreshed, and I simply fired off a subprocess call to the relevant gcloud command), however now I am seeing a new issue:
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host us-central1-cloud-function-<CF>:443 ssl:default [Connect call failed ('216.239.36.54', 443)]
This now happens sporadically (10 minutes in, 70 minutes in, etc.). I'm starting to wonder if I'm fighting a losing battle here.
I am making my first steps in Python and I have a bit of struggle trying to understand why I do not have the expected result with this one. Here is what I am trying to achieve :
I have a function that consumes an API. While waiting for the API to answer and given that I am going through a proxy that creates additional lag, I though that sending concurrent request will speed up the process (I run 100 concurrent requests). It does. But asyncio run_until_complete always returns some unfinished coroutines.
Here the code (simplified):
import aiohttp
import asyncio
async def consume_api(parameter):
url = "someurl" #it is actually based on the parameter
try:
async with aiohttp.ClientSession() as session:
async with session.get(URL, proxy="someproxy") as asyncresponse:
r = await asyncresponse.read()
except:
global error_count
error_count += 1
if error_count > 50:
return "Exceeded 50 try on same request"
else:
return consume_api(parameter)
return r.decode("utf-8")
def loop_on_api(list_of_parameter):
loop = asyncio.get_event_loop()
coroutines = [consume_api(list_of_parameter[i]) for i in range(len(list_of_parameter))]
results = loop.run_until_complete(asyncio.gather(*coroutines))
return results
When I run the debugger, the results returned by the loop_on_api function include a list of string corresponding to the results of consume_api and some occurence of <coroutine objects consume_api at 0x00...>. Those variables have a cr_running parameter at False and a cr_Frame.
Though if I check the coroutines variables, I can find all the 100 coroutines but none seems to have a cr_Frame.
Any idea what I am doing wrong?
I'm also thinking my way of counting the 50 error will be shared by all coroutines.
Any idea how I can make it specific?
This should work, you can add/change/refactor what ever you want
import aiohttp
import asyncio
async def consume_api(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.read()
def loop_on_api(list_of_urls):
loop = asyncio.get_event_loop()
coroutines = [consume_api(url) for url in list_of_urls]
results = loop.run_until_complete(asyncio.gather(*coroutines))
return results
if __name__ == '__main__':
print(loop_on_api(['https://google.com', 'https://twitter.com']))
It seems the issue is coming from the proxy I am using, which sometimes do not carry the request or response. Hence forcing a rerun seems to be the best answer. Hence I now check if the results returned have some coroutines remaining and re-run the loop_on_api() on them
def loop_on_api(list_of_parameter):
loop = asyncio.get_event_loop()
coroutines = [consume_api(list_of_parameter[i]) for i in range(len(list_of_parameter))]
results = loop.run_until_complete(asyncio.gather(*coroutines))
undone = []
rerun_list_of_parameter = []
for i in range(len(results)):
if str(type(results[i])) == "<class 'coroutine'>": #not very elegant >> is there a better way?
undone.append(i)
rerun_list_of_parameter.append(list_of_parameter[i])
if len(undone) > 0:
undone_results = loop_on_api(rerun_list_of_parameter)
for i in range(len(undone_results)):
results[undone[i]] = undone_results[i]
return results
I have an api that returns response of pagination only 10 records at a time. I want to process 10 record (index=0 and limit=10) then next 10(index=10 and limit=10) and so on till it returns empty array.
I want to do it in async way.
I am using the following deps:
yarl==1.6.0
Mako==1.1.3
asyncio==3.4.3
aiohttp==3.6.2
The code is:
loop = asyncio.get_event_loop()
loop.run_until_complete(getData(id, token,0, 10))
logger.info("processed all data")
async def getData(id, token, index, limit):
try:
async with aiohttp.ClientSession() as session:
response = await fetch_data_from_api(session, id, token, index, limit)
if response == []:
logger.info('Fetched all data')
else:
# process data(response)
getData(session, id, limit, limit+10)
except Exception as ex:
raise Exception(ex)
async def fetch_data_from_api(
session, id, token, index, limit
):
try:
url = f"http://localhost:8080/{id}?index={index}&limit={limit}"
async with session.post(
url=url,
headers={"Authorization": token}
) as response:
response.raise_for_status()
response = await response.json()
return json.loads(json.dumps(response))
except Exception as ex:
raise Exception(
f"Exception {ex} occurred"
)
I issue is that it works fine for first time but when i am calling the method getData(session, id, limit, limit+10) again from async def getData(id, token, index, limit). It is not been called.
How can i resolve the issue?
There are a few issues I see in your code.
First, and this is what you talk about, is the getData method.
It is unclear to me a bit by looking at the code, what is that "second" getData. In the function definition your arguments are getData(id, token, index, limit), but when you call it from within the function you call it with getData(session, id, limit, limit+10) where the id is the second parameter. Is that intentional? This looks to me like there is another getData method, or it's a bug.
In case of the first option: (a) you probably need to show us that code as well, as it's important for us to be able to give you better responses, and (b), more importantly, it will not work. Python doesn't support overloading and the getData you are referencing from within the wrapping getData is the same wrapping method.
In case it's the second option: (a) you might have an issue with the function parameters, and (b) - you are missing an await before the getData (i.e. await getData). This is actually probably also relevant in case it's the "first option".
Other than that, your exception handling is redundant. You basically just re raise the exception, so I don't see any point in having the try-catch blocks. Even more, for some reason in the first method, you create an Exception from the base exception class (not to be confused with BaseException). Just don't have the try block.
I'd like to embed some async code in my Python Project to make the http request part be asyncable . for example, I read params from Kafka, use this params to generate some urls and put the urls into a list. if the length of the list is greater than 1000, then I send this list to aiohttp to batch get the response.
I can not change the whole project from sync to async, so I could only change the http request part.
the code example is:
async def async_request(url):
async with aiohttp.ClientSession() as client:
resp = await client.get(url)
result = await resp.json()
return result
async def do_batch_request(url_list, result):
task_list = []
for url in url_list:
task = asyncio.create_task(async_request(url))
task_list.append(task)
batch_response = asyncio.gather(*task_list)
result.extend(batch_response)
def batch_request(url_list):
batch_response = []
asyncio.run(do_batch_request(url_list, batch_response))
return batch_response
url_list = []
for msg in kafka_consumer:
url = msg['url']
url_list.append(url)
if len(url_list) >= 1000:
batch_response = batch_request(url_list)
parse(batch_response)
....
As we know, asyncio.run will create an even loop to run the async function and then close the even loop. My problem is that, will my method influence the performance of the async code? And do you have some better way for my situation?
There's no serious problem with your approach and you'll get speed benefit from asyncio. Only possible problem here is that if later you'll want to do something async in other place in the code you'll not be able to do it concurrently with batch_request.
There's not much to do if you don't want to change the whole project from sync to async, but if in the future you'll want to run batch_request in parallel with something, keep in mind that you can run it in thread and wait for result asynchronously.