I am a noob who is trying to scrape a list of urls and search for a word using asynchronous programming in python.
My code is as follows:
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
def parse(wd, html, url):
add_soup = bsoup(html,'html.parser')
res = []
for para in (add_soup.find_all("p")):
para_txt = para.text
for sent_txt in para_txt.split("."):
if wd in sent_txt:
res.append([sent_txt, url])
return res
async def scrape_urls(wd, urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*(fetch_and_parse(wd, session, url) for url in urls)
)
async def fetch_and_parse(wd, session, url):
html = await fetch(wd, session, url)
loop = asyncio.get_event_loop()
paras = await loop.run_in_executor(None, parse, html)
return paras
I wrote the above code from this link. But I am unclear as how to proceed to retrieve the resultant list
I am trying to get the results using this co = scrape_urls("agriculture", urls). As expected I get a coroutine object. How do I parse the coroutine object?
Not entirely sure what issue you're facing. Once you use gather to get the Future instance, use an event loop to execute it and get results.
loop = asyncio.get_event_loop()
group = scrape_urls("agriculture", urls)
results = loop.run_until_complete(group)
loop.close()
print(results)
Related
I am trying to achieve aiohttp async processing of requests that have been defined in my class as follows:
class Async():
async def get_service_1(self, zip_code, session):
url = SERVICE1_ENDPOINT.format(zip_code)
response = await session.request('GET', url)
return await response
async def get_service_2(self, zip_code, session):
url = SERVICE2_ENDPOINT.format(zip_code)
response = await session.request('GET', url)
return await response
async def gather(self, zip_code):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
self.get_service_1(zip_code, session),
self.get_service_2(zip_code, session)
)
def get_async_requests(self, zip_code):
asyncio.set_event_loop(asyncio.SelectorEventLoop())
loop = asyncio.get_event_loop()
results = loop.run_until_complete(self.gather(zip_code))
loop.close()
return results
When running to get the results from the get_async_requests function, i am getting the following error:
TypeError: object ClientResponse can't be used in 'await' expression
Where am i going wrong in the code? Thank you in advance
When you await something like session.response, the I/O starts, but aiohttp returns when it receives the headers; it doesn't want for the response to finish. (This would let you react to a status code without waiting for the entire body of the response.)
You need to await something that does that. If you're expecting a response that contains text, that would be response.text. If you're expecting JSON, that's response.json. This would look something like
response = await session.get(url)
return await response.text()
I'm writing a code to get some links from a list of input urls using asyncio, aiohttp and BeautifulSoup.
Here's a snippet of the relevant code:
def async_get_jpg_links(links):
def extractLinks(ep_num, html):
soup = bs4.BeautifulSoup(html, 'lxml',
parse_only = bs4.SoupStrainer('article'))
main = soup.findChildren('img')
return ep_num, [img_link.get('data-src') for img_link in main]
async def get_htmllinks(session, ep_num, ep_link):
async with session.get(ep_link) as response:
html_txt = await response.text()
return extractLinks(ep_num, html_txt)
async def get_jpg_links(ep_links):
async with aiohttp.ClientSession() as session:
tasks = [get_htmllinks(session, num, link)
for num, link in enumerate(ep_links, 1)]
return await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
return loop.run_until_complete(get_jpg_links(links))
I then later call jpgs_links = dict(async_get_jpg_links(hrefs)), where hrefs is a bunch of links (~170 links).
jpgs_links should be a dictionary with numerical keys and a bunch of lists as values. Some of the values come back as empty lists (which should instead be filled with data). When I cut down the numbers of links in hrefs, more of the lists come back full.
For the photo below, I reran the same code with a minute between, and as you can see, I get different lists that come back empty and different ones that come back full.
Could it be that asyncio.gather is not waiting for all the tasks to finish?
How can I get asyncio to get me to return no empty lists, while keeping the number of links in hrefs high?
So, turns out that some of the urls I sent in threw up the error:
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 504, message='Gateway Time-out',...
So I changed
async def get_htmllinks(session, ep_num, ep_link):
async with session.get(ep_link) as response:
html_txt = await response.text()
return extractLinks(ep_num, html_txt)
to
async def get_htmllinks(session, ep_num, ep_link):
html_txt = None
while not html_txt:
try:
async with session.get(ep_link) as response:
response.raise_for_status()
html_txt = await response.text()
except aiohttp.ClientResponseError:
await asyncio.sleep(1)
return extractLinks(ep_num, html_txt)
What this does is that it retries the connection after sleeping for a second (the await asyncio.sleep(1) does that).
Nothing to do with asyncio or BeautifulSoup, apparently.
Question is regarding aiohttp libriary usage.
My goal here is to check list of urls by sending bunch of HEAD requests, potentially asynchronously ,and eventually create dict of
url: status pairs.
I am new in asyncio and stuff and I found a lot of examples where people use GET requests to fetch html ,for example ,and they use await resp.read() or await resp.text() and it works fine but with HEAD request I don’t have body, I just have header, that's it. If I try to await resp.status or resp itself as an object – it does not work as they are not awaitable.
Code below works only synchronously step by step and I can’t figure out how to make it run asynchronously. Seems like whatever i do with status turns code to sync mode somehow...
I would be glad to see your ideas.
Thanks.
import asyncio
import aiohttp
urls_list = [url1, url2, url3, etc, etc, etc, ]
status_dict = {}
async def main():
async with aiohttp.ClientSession() as session:
for individual_url in urls_list:
async with session.head(individual_url) as resp:
status_dict.update({url: resp.status})
asyncio.run(main())
You can you asyncio.gather:
import asyncio
import aiohttp
urls_list = ["https://google.com", "https://yahoo.com", "http://hello123456789.com"]
status_dict = {}
async def head_status(session, url) -> dict:
async with session.head(url) as resp:
return {url: resp.status}
async def main():
async with aiohttp.ClientSession() as session:
statuses = await asyncio.gather(*[head_status(session, url) for url in urls_list], return_exceptions=True)
for a in statuses:
if not isinstance(a, Exception):
status_dict.update(a)
asyncio.run(main())
I have a script that checks the status code for a couple hundred thousand supplied websites, and I was trying to integrate a Semaphore to the flow to speed up processing. The problem is that whenever I integrate a Semaphore, I just get a list populated with None objects, and I'm not entirely sure why.
I have been mostly copying code from other sources as I don't fully grok asynchronous programming fully yet, but it seems like when I debug I should be getting results out of the function, but something is going wrong when I gather the results. I've tried juggling around my looping, my gathering, ensuring futures, etc, but nothing seems to return a list of things that work.
async def fetch(session, url):
try:
async with session.head(url, allow_redirects=True) as resp:
return url, resp.real_url, resp.status, resp.reason
except Exception as e:
return url, None, e, 'Error'
async def bound_fetch(sem, session, url):
async with sem:
await fetch(session, url)
async def run(urls):
timeout = 15
tasks = []
sem = asyncio.Semaphore(100)
conn = aiohttp.TCPConnector(limit=64, ssl=False)
async with aiohttp.ClientSession(connector=conn) as session:
for url in urls:
task = asyncio.wait_for(bound_fetch(sem, session, url), timeout)
tasks.append(task)
responses = await asyncio.gather(*tasks)
# responses = [await f for f in tqdm.tqdm(asyncio.as_completed(tasks), total=len(tasks))]
return responses
urls = ['https://google.com', 'https://yahoo.com']
loop = asyncio.ProactorEventLoop()
data = loop.run_until_complete(run(urls))
I've commented out the progress bar component, but that implementation returns the desired results when there is no semaphore.
Any help would be greatly appreciated. I am furiously reading up on asynchronous programming, but I can't wrap my mind around it yet.
You should explicitly return results of awaiting coroutines.
Replace this code...
async def bound_fetch(sem, session, url):
async with sem:
await fetch(session, url)
... with this:
async def bound_fetch(sem, session, url):
async with sem:
return await fetch(session, url)
I am making a script that gets the HTML of almost 20 000 pages and parses it to get just a portion of it.
I managed to get the 20 000 pages' content in a dataframe with aynchronous requests using asyncio and aiohttp but this script still wait for all the pages to be fetched to parse them.
async def get_request(session, url, params=None):
async with session.get(url, headers=HEADERS, params=params) as response:
return await response.text()
async def get_html_from_url(urls):
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(get_request(session, url))
html_page_response = await asyncio.gather(*tasks)
return html_page_response
html_pages_list = asyncio_loop.run_until_complete(get_html_from_url(urls))
Once I have the content of each page I managed to use multiprocessing's Pool to parallelize the parsing.
get_whatiwant_from_html(html_content):
parsed_html = BeautifulSoup(html_content, "html.parser")
clean = parsed_html.find("div", class_="class").get_text()
# Some re.subs
clean = re.sub("", "", clean)
clean = re.sub("", "", clean)
clean = re.sub("", "", clean)
return clean
pool = Pool(4)
what_i_want = pool.map(get_whatiwant_from_html, html_content_list)
This code mixes asynchronously the fetching and the parsing but I would like to integrate multiprocessing into it:
async def process(url, session):
html = await getRequest(session, url)
return await get_whatiwant_from_html(html)
async def dispatch(urls):
async with aiohttp.ClientSession() as session:
coros = (process(url, session) for url in urls)
return await asyncio.gather(*coros)
result = asyncio.get_event_loop().run_until_complete(dispatch(urls))
Is there any obvious way to do this? I thought about creating 4 processes that each run the asynchronous calls but the implementation looks a bit complex and I'm wondering if there is another way.
I am very new to asyncio and aiohttp so if you have anything to advise me to read to get a better understanding, I will be very happy.
You can use ProcessPoolExecutor.
With run_in_executor you can do IO in your main asyncio process.
But your heavy CPU calculations in separate processes.
async def get_data(session, url, params=None):
loop = asyncio.get_event_loop()
async with session.get(url, headers=HEADERS, params=params) as response:
html = await response.text()
data = await loop.run_in_executor(None, partial(get_whatiwant_from_html, html))
return data
async def get_data_from_urls(urls):
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(get_data(session, url))
result_data = await asyncio.gather(*tasks)
return result_data
executor = concurrent.futures.ProcessPoolExecutor(max_workers=10)
asyncio_loop.set_default_executor(executor)
results = asyncio_loop.run_until_complete(get_data_from_urls(urls))
You can increase your parsing speed by changing your BeautifulSoup parser from html.parser to lxml which is by far the fastest, followed by html5lib. html.parser is the slowest of them all.
Your bottleneck is not processing issue but IO. You might want multiple threads and not process:
E.g. here is a template program that scraping and sleep to make it slow but ran in multiple threads and thus complete task faster.
from concurrent.futures import ThreadPoolExecutor
import random,time
from bs4 import BeautifulSoup as bs
import requests
URL = 'http://quotesondesign.com/wp-json/posts'
def quote_stream():
'''
Quoter streamer
'''
param = dict(page=random.randint(1, 1000))
quo = requests.get(URL, params=param)
if quo.ok:
data = quo.json()
author = data[0]['title'].strip()
content = bs(data[0]['content'], 'html5lib').text.strip()
print(f'{content}\n-{author}\n')
else:
print('Connection Issues :(')
def multi_qouter(workers=4):
with ThreadPoolExecutor(max_workers=workers) as executor:
_ = [executor.submit(quote_stream) for i in range(workers)]
if __name__ == '__main__':
now = time.time()
multi_qouter(workers=4)
print(f'Time taken {time.time()-now:.2f} seconds')
In your case, create a function that performs the task you want from starry to finish. This function would accept url and necessary parameters as arguments. After that create another function that calls the previous function in different threads, each thread having its our url. So instead of i in range(..), for url in urls. You can run 2000 threads at once, but I would prefer chunks of say 200 running parallel.