I'm writing a code to get some links from a list of input urls using asyncio, aiohttp and BeautifulSoup.
Here's a snippet of the relevant code:
def async_get_jpg_links(links):
def extractLinks(ep_num, html):
soup = bs4.BeautifulSoup(html, 'lxml',
parse_only = bs4.SoupStrainer('article'))
main = soup.findChildren('img')
return ep_num, [img_link.get('data-src') for img_link in main]
async def get_htmllinks(session, ep_num, ep_link):
async with session.get(ep_link) as response:
html_txt = await response.text()
return extractLinks(ep_num, html_txt)
async def get_jpg_links(ep_links):
async with aiohttp.ClientSession() as session:
tasks = [get_htmllinks(session, num, link)
for num, link in enumerate(ep_links, 1)]
return await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
return loop.run_until_complete(get_jpg_links(links))
I then later call jpgs_links = dict(async_get_jpg_links(hrefs)), where hrefs is a bunch of links (~170 links).
jpgs_links should be a dictionary with numerical keys and a bunch of lists as values. Some of the values come back as empty lists (which should instead be filled with data). When I cut down the numbers of links in hrefs, more of the lists come back full.
For the photo below, I reran the same code with a minute between, and as you can see, I get different lists that come back empty and different ones that come back full.
Could it be that asyncio.gather is not waiting for all the tasks to finish?
How can I get asyncio to get me to return no empty lists, while keeping the number of links in hrefs high?
So, turns out that some of the urls I sent in threw up the error:
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 504, message='Gateway Time-out',...
So I changed
async def get_htmllinks(session, ep_num, ep_link):
async with session.get(ep_link) as response:
html_txt = await response.text()
return extractLinks(ep_num, html_txt)
to
async def get_htmllinks(session, ep_num, ep_link):
html_txt = None
while not html_txt:
try:
async with session.get(ep_link) as response:
response.raise_for_status()
html_txt = await response.text()
except aiohttp.ClientResponseError:
await asyncio.sleep(1)
return extractLinks(ep_num, html_txt)
What this does is that it retries the connection after sleeping for a second (the await asyncio.sleep(1) does that).
Nothing to do with asyncio or BeautifulSoup, apparently.
Related
I am a noob who is trying to scrape a list of urls and search for a word using asynchronous programming in python.
My code is as follows:
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
def parse(wd, html, url):
add_soup = bsoup(html,'html.parser')
res = []
for para in (add_soup.find_all("p")):
para_txt = para.text
for sent_txt in para_txt.split("."):
if wd in sent_txt:
res.append([sent_txt, url])
return res
async def scrape_urls(wd, urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*(fetch_and_parse(wd, session, url) for url in urls)
)
async def fetch_and_parse(wd, session, url):
html = await fetch(wd, session, url)
loop = asyncio.get_event_loop()
paras = await loop.run_in_executor(None, parse, html)
return paras
I wrote the above code from this link. But I am unclear as how to proceed to retrieve the resultant list
I am trying to get the results using this co = scrape_urls("agriculture", urls). As expected I get a coroutine object. How do I parse the coroutine object?
Not entirely sure what issue you're facing. Once you use gather to get the Future instance, use an event loop to execute it and get results.
loop = asyncio.get_event_loop()
group = scrape_urls("agriculture", urls)
results = loop.run_until_complete(group)
loop.close()
print(results)
Problem I'm trying to solve:
I'm making many api requests to a server. I'm trying to create delays bewtween async api calls to comply with the server's rate limit policy.
What I want it to do
I want it to behave like this:
Make api request #1
wait 0.1 seconds
Make api request #2
wait 0.1 seconds
... and so on ...
repeat until all requests are made
gather the responses and return the results in one object (results)
Issue:
When when I introduced asyncio.sleep() or time.sleep() in the code, it still made api requests almost instantaneously. It seemed to delay the execution of print(), but not the api requests. I suspect that I have to create the delays within the loop, not at the fetch_one() or fetch_all(), but couldn't figure out how to do so.
Code block:
async def fetch_all(loop, urls, delay):
results = await asyncio.gather(*[fetch_one(loop, url, delay) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay):
#time.sleep(delay)
#asyncio.sleep(delay)
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
delay = 0.1
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
Versions I'm using:
python 3.8.5
aiohttp 3.7.4
asyncio 3.4.3
I would appreciate any tips on guiding me to the right direction!
The call to asyncio.gather will launch all requests "simultaneously" - and on the other hand, if you would simply use a lock or await for each task, you would not gain anything from using parallelism at all.
The simplest thing to do, if you know the rate you can issue the requests, is simply to increase the asynchronous pause before each request in sucession - a simple global variable can do that:
next_delay = 0.1
async def fetch_all(loop, urls, delay):
results = await asyncio.gather(*[fetch_one(loop, url, delay) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay):
global next_delay
next_delay += delay
await asyncio.sleep(next_delay)
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
delay = 0.1
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
Now, if you want like, issue 5 requests and then issue the next 5, you could use a synchronization primitive like asyncio.Condition, using its wait_for on an expression which checks how many api calls are active:
active_calls = 0
MAX_CALLS = 5
async def fetch_all(loop, urls, delay):
event = asyncio.Event()
event.set()
results = await asyncio.gather(*[fetch_one(loop, url, delay, event) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay, cond):
global active_calls
active_calls += 1
if active_calls > MAX_CALLS:
event.clear()
await event.wait()
try:
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
finally:
active_calls -= 1
if active_calls == 0:
event.set()
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
For both examples, should your task avoid global variables in the design (actually,these are "module" variables) - you could either move all funtions to a class, and work on an instance, and promote the global variables to instance attributes, or use a mutable container, such as a list for holding the active_calls value in its first item, and pass that as a parameter.
When you use asyncio.gather you run all fetch_one coroutines concurrently. All of them wait for delay together, than make API calls instantaneously together.
To solve the issue, you should either await fetch_one in one by one in fetch_all or to use Semaphore to signalize next shouldn't start before previous is done.
Here's the idea:
import asyncio
_sem = asyncio.Semaphore(1)
async def fetch_all(loop, urls, delay):
results = await asyncio.gather(*[fetch_one(loop, url, delay) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay):
async with _sem: # next coroutine(s) will stuck here until the previous is done
await asyncio.sleep(delay)
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
delay = 0.1
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
Question is regarding aiohttp libriary usage.
My goal here is to check list of urls by sending bunch of HEAD requests, potentially asynchronously ,and eventually create dict of
url: status pairs.
I am new in asyncio and stuff and I found a lot of examples where people use GET requests to fetch html ,for example ,and they use await resp.read() or await resp.text() and it works fine but with HEAD request I don’t have body, I just have header, that's it. If I try to await resp.status or resp itself as an object – it does not work as they are not awaitable.
Code below works only synchronously step by step and I can’t figure out how to make it run asynchronously. Seems like whatever i do with status turns code to sync mode somehow...
I would be glad to see your ideas.
Thanks.
import asyncio
import aiohttp
urls_list = [url1, url2, url3, etc, etc, etc, ]
status_dict = {}
async def main():
async with aiohttp.ClientSession() as session:
for individual_url in urls_list:
async with session.head(individual_url) as resp:
status_dict.update({url: resp.status})
asyncio.run(main())
You can you asyncio.gather:
import asyncio
import aiohttp
urls_list = ["https://google.com", "https://yahoo.com", "http://hello123456789.com"]
status_dict = {}
async def head_status(session, url) -> dict:
async with session.head(url) as resp:
return {url: resp.status}
async def main():
async with aiohttp.ClientSession() as session:
statuses = await asyncio.gather(*[head_status(session, url) for url in urls_list], return_exceptions=True)
for a in statuses:
if not isinstance(a, Exception):
status_dict.update(a)
asyncio.run(main())
I'm trying to figure out a way to handle bad requests from an API within an asynchronous function using aiohttp. This is what I've got for testing:
async def fetch(session):
url = 'http://httpbin.org/status/404'
async with session.request('GET', url) as response:
if response.status == 200:
try:
r = await response.json()
return r
except ValueError:
return
else:
return None
async def fetch_all(project_list):
output = []
async with ClientSession() as session:
tasks = [asyncio.ensure_future(fetch(session, project)) for project in project_list]
for future in await asyncio.gather(*tasks):
output += future
return output
def get_data(project_list):
loop = asyncio.get_event_loop()
futures = asyncio.ensure_future(fetch_all(project_list))
output = loop.run_until_complete(futures)
return output
In this example, project_list is just a list of integers.
In this instance, fetch() should return none since the response will undoubtedly be 404. The problem arises in fetch_all() where I tell it to += future. I get a TypeError: 'coroutine' object is not iterable. Basically I'd like this to return nothing and in this case, += nothing to that list. In a perfect world I'd receive a proper json response every time, but I'd like to account for a random instance wherein I receive a bad response from the server.
From what I've read, #asyncio.coroutine would return None but async values have to be awaited if I'm understanding it correctly.
At first, you don't need to wrap task in ensure_future if you want to use gather. Second, you are trying to add fetch tasks with 2 arguments: session and project, but you have only session in your fetch function definition. And one more thing you can change is removing the loop where you are iterating through gather result because it already gives you the output you want to return.
Code will be like that:
async def fetch(session, project):
url = 'http://httpbin.org/status/404'
async with session.request('GET', url) as response:
if response.status == 200:
try:
return await response.json()
except ValueError:
pass
return None
async def fetch_all(project_list):
output = []
async with ClientSession() as session:
tasks = [fetch(session, project) for project in project_list]
return await asyncio.gather(*tasks)
def get_data(project_list):
loop = asyncio.get_event_loop()
output = loop.run_until_complete(futures)
return output
As for advice: use ensure_future only if you want to run coroutine instantly but the result will be needed only in the future.
I'm not completely sure this is the best way to do it but I put this together and it worked. If anyone can correct me, please do.
async def fetch_all(project_list):
output = []
async with ClientSession() as session:
tasks = [asyncio.ensure_future(fetch(session, project)) for project in project_list]
for future in await asyncio.gather(*tasks):
if future is not None: #Check if the future is None before adding it
output += future
return output
I have a script that checks the status code for a couple hundred thousand supplied websites, and I was trying to integrate a Semaphore to the flow to speed up processing. The problem is that whenever I integrate a Semaphore, I just get a list populated with None objects, and I'm not entirely sure why.
I have been mostly copying code from other sources as I don't fully grok asynchronous programming fully yet, but it seems like when I debug I should be getting results out of the function, but something is going wrong when I gather the results. I've tried juggling around my looping, my gathering, ensuring futures, etc, but nothing seems to return a list of things that work.
async def fetch(session, url):
try:
async with session.head(url, allow_redirects=True) as resp:
return url, resp.real_url, resp.status, resp.reason
except Exception as e:
return url, None, e, 'Error'
async def bound_fetch(sem, session, url):
async with sem:
await fetch(session, url)
async def run(urls):
timeout = 15
tasks = []
sem = asyncio.Semaphore(100)
conn = aiohttp.TCPConnector(limit=64, ssl=False)
async with aiohttp.ClientSession(connector=conn) as session:
for url in urls:
task = asyncio.wait_for(bound_fetch(sem, session, url), timeout)
tasks.append(task)
responses = await asyncio.gather(*tasks)
# responses = [await f for f in tqdm.tqdm(asyncio.as_completed(tasks), total=len(tasks))]
return responses
urls = ['https://google.com', 'https://yahoo.com']
loop = asyncio.ProactorEventLoop()
data = loop.run_until_complete(run(urls))
I've commented out the progress bar component, but that implementation returns the desired results when there is no semaphore.
Any help would be greatly appreciated. I am furiously reading up on asynchronous programming, but I can't wrap my mind around it yet.
You should explicitly return results of awaiting coroutines.
Replace this code...
async def bound_fetch(sem, session, url):
async with sem:
await fetch(session, url)
... with this:
async def bound_fetch(sem, session, url):
async with sem:
return await fetch(session, url)