Script performs very slowly even when it runs asynchronously - python

I've written a script in asyncio in association with aiohttp library to parse the content of a website asynchronously. I've tried to apply the logic within the following script the way it is usually applied in scrapy.
However, when I execute my script, it acts like how syncronous libraries like requests or urllib.request do. Therefore, it is very slow and doesn't serve the purpose.
I know I can get around this by defining all the next page link within the link variable. But, am I not doing the task with my existing script in the right way already?
Within the script what processing_docs() function does is collect all the links of the different posts and pass the refined links to the fetch_again() function to fetch the title from it's target page. There is a logic applied within processing_docs() function which collects the next_page link and supply the same to fetch() function to repeat the same. This next_page call is making the script slower whereas we usually do the same inscrapyand get expected performance.
My question is: How can I achieve the same keeping the existing logic intact?
import aiohttp
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin
link = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await processing_docs(session, text)
return result
async def processing_docs(session, html):
tree = fromstring(html)
titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")]
for title in titles:
await fetch_again(session,title)
next_page = tree.cssselect("div.pager a[rel='next']")
if next_page:
page_link = urljoin(link,next_page[0].attrib['href'])
await fetch(page_link)
async def fetch_again(session,url):
async with session.get(url) as response:
text = await response.text()
tree = fromstring(text)
title = tree.cssselect("h1[itemprop='name'] a")[0].text
print(title)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*(fetch(url) for url in [link])))
loop.close()

Whole point of using asyncio is that you may run multiple fetches concurrently (in parallel to each other). Let's look at your code:
for title in titles:
await fetch_again(session,title)
This part means that each new fetch_again will be started only after previous was awaited (finished). If you do things this way, yes, there's no difference with using sync approach.
To invoke all power of asyncio start multiple fetches concurrently using asyncio.gather:
await asyncio.gather(*[
fetch_again(session,title)
for title
in titles
])
You'll see significant speedup.
You can go event futher and start fetch for next page concurrently with fetch_again for titles:
async def processing_docs(session, html):
coros = []
tree = fromstring(html)
# titles:
titles = [
urljoin(link,title.attrib['href'])
for title
in tree.cssselect(".summary .question-hyperlink")
]
for title in titles:
coros.append(
fetch_again(session,title)
)
# next_page:
next_page = tree.cssselect("div.pager a[rel='next']")
if next_page:
page_link = urljoin(link,next_page[0].attrib['href'])
coros.append(
fetch(page_link)
)
# await:
await asyncio.gather(*coros)
Important note
While such approach allows you to do things much faster you may want to limit number of concurrent requests at the time to avoid significant resources usage on both your machine and on server.
You can use asyncio.Semaphore for this purpose:
semaphore = asyncio.Semaphore(10)
async def fetch(url):
async with semaphore:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await processing_docs(session, text)
return result

Related

How to retrieve scraped data using asyncio

I am a noob who is trying to scrape a list of urls and search for a word using asynchronous programming in python.
My code is as follows:
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
def parse(wd, html, url):
add_soup = bsoup(html,'html.parser')
res = []
for para in (add_soup.find_all("p")):
para_txt = para.text
for sent_txt in para_txt.split("."):
if wd in sent_txt:
res.append([sent_txt, url])
return res
async def scrape_urls(wd, urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*(fetch_and_parse(wd, session, url) for url in urls)
)
async def fetch_and_parse(wd, session, url):
html = await fetch(wd, session, url)
loop = asyncio.get_event_loop()
paras = await loop.run_in_executor(None, parse, html)
return paras
I wrote the above code from this link. But I am unclear as how to proceed to retrieve the resultant list
I am trying to get the results using this co = scrape_urls("agriculture", urls). As expected I get a coroutine object. How do I parse the coroutine object?
Not entirely sure what issue you're facing. Once you use gather to get the Future instance, use an event loop to execute it and get results.
loop = asyncio.get_event_loop()
group = scrape_urls("agriculture", urls)
results = loop.run_until_complete(group)
loop.close()
print(results)

intercepting response with substring in url using playwright

I have been learning playwright on python, but it appears that I cannot get it to successfully find a response whose URL contains a substring, while on node I am indeed able to do so, is there anything I am doing wrong?
async with page.expect_response("*") as response:
if "getVerify" in response.url:
print("found")
i have also tried using getVerify and in to no avail.
node code:
page.on('response', response => {
if (response.url().includes('getVerify')) {
console.log(response.url())
With the node case, it's a bit different as you're passively subscribing to an event. In the python snippet, you are basically doing the equivalent of page.waitForResponse. Typically you'll do that in conjunction with some action that triggers the response (such as submitting a form).
Try using the python page.on API. Like this:
import asyncio
from playwright.async_api import async_playwright
def check_response(response):
print(response.url)
if 'getVerify' in response.url:
print("Response URL: ", response.url)
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
page.on("response", check_response)
await page.goto("http://playwright.dev")
print(await page.title())
await browser.close()
asyncio.run(main())

aiohttp download large list of pdf files

i am trying to download large number of pdf files asynchronously, python requests does not work well with async functionalities
but i am finding aiohttp hard to implement with pdf downloads, and can't find a thread for this specific task, for someone new into python async world to understand easily.
yeah it can be done with threadpoolexecutor but in this case better to keep in one thread.
this code works but need to do with 100 or so urls
asynchronously
import aiohttp
import aiofiles
async with aiohttp.ClientSession() as session:
url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
async with session.get(url) as resp:
if resp.status == 200:
f = await aiofiles.open('download_pdf.pdf', mode='wb')
await f.write(await resp.read())
await f.close()
Thanks in advance.
You could do try something like this. For the sake of simplicity the same dummy pdf will be downloaded multiple times to disk with different file names:
from asyncio import Semaphore, gather, run, wait_for
from random import randint
import aiofiles
from aiohttp.client import ClientSession
# Mock a list of different pdfs to download
pdf_list = [
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
]
MAX_TASKS = 5
MAX_TIME = 5
async def download(pdf_list):
tasks = []
sem = Semaphore(MAX_TASKS)
async with ClientSession() as sess:
for pdf_url in pdf_list:
# Mock a different file name each iteration
dest_file = str(randint(1, 100000)) + ".pdf"
tasks.append(
# Wait max 5 seconds for each download
wait_for(
download_one(pdf_url, sess, sem, dest_file),
timeout=MAX_TIME,
)
)
return await gather(*tasks)
async def download_one(url, sess, sem, dest_file):
async with sem:
print(f"Downloading {url}")
async with sess.get(url) as res:
content = await res.read()
# Check everything went well
if res.status != 200:
print(f"Download failed: {res.status}")
return
async with aiofiles.open(dest_file, "+wb") as f:
await f.write(content)
# No need to use close(f) when using with statement
if __name__ == "__main__":
run(download(pdf_list))
Keep in mind that firing multiple concurrent request to a server might get your IP banned for a period of time. In that case, consider adding a sleep call (which kind of defeats the purpose of using aiohttp) or switching to a classic sequential script. In order to keep things concurrent but kinder to the server, the script will fire max 5 requests at any given time (MAX_TASKS).

asyncio.gather not waiting long enough for all tasks to complete

I'm writing a code to get some links from a list of input urls using asyncio, aiohttp and BeautifulSoup.
Here's a snippet of the relevant code:
def async_get_jpg_links(links):
def extractLinks(ep_num, html):
soup = bs4.BeautifulSoup(html, 'lxml',
parse_only = bs4.SoupStrainer('article'))
main = soup.findChildren('img')
return ep_num, [img_link.get('data-src') for img_link in main]
async def get_htmllinks(session, ep_num, ep_link):
async with session.get(ep_link) as response:
html_txt = await response.text()
return extractLinks(ep_num, html_txt)
async def get_jpg_links(ep_links):
async with aiohttp.ClientSession() as session:
tasks = [get_htmllinks(session, num, link)
for num, link in enumerate(ep_links, 1)]
return await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
return loop.run_until_complete(get_jpg_links(links))
I then later call jpgs_links = dict(async_get_jpg_links(hrefs)), where hrefs is a bunch of links (~170 links).
jpgs_links should be a dictionary with numerical keys and a bunch of lists as values. Some of the values come back as empty lists (which should instead be filled with data). When I cut down the numbers of links in hrefs, more of the lists come back full.
For the photo below, I reran the same code with a minute between, and as you can see, I get different lists that come back empty and different ones that come back full.
Could it be that asyncio.gather is not waiting for all the tasks to finish?
How can I get asyncio to get me to return no empty lists, while keeping the number of links in hrefs high?
So, turns out that some of the urls I sent in threw up the error:
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 504, message='Gateway Time-out',...
So I changed
async def get_htmllinks(session, ep_num, ep_link):
async with session.get(ep_link) as response:
html_txt = await response.text()
return extractLinks(ep_num, html_txt)
to
async def get_htmllinks(session, ep_num, ep_link):
html_txt = None
while not html_txt:
try:
async with session.get(ep_link) as response:
response.raise_for_status()
html_txt = await response.text()
except aiohttp.ClientResponseError:
await asyncio.sleep(1)
return extractLinks(ep_num, html_txt)
What this does is that it retries the connection after sleeping for a second (the await asyncio.sleep(1) does that).
Nothing to do with asyncio or BeautifulSoup, apparently.

Combine aiohttp with multiprocessing

I am making a script that gets the HTML of almost 20 000 pages and parses it to get just a portion of it.
I managed to get the 20 000 pages' content in a dataframe with aynchronous requests using asyncio and aiohttp but this script still wait for all the pages to be fetched to parse them.
async def get_request(session, url, params=None):
async with session.get(url, headers=HEADERS, params=params) as response:
return await response.text()
async def get_html_from_url(urls):
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(get_request(session, url))
html_page_response = await asyncio.gather(*tasks)
return html_page_response
html_pages_list = asyncio_loop.run_until_complete(get_html_from_url(urls))
Once I have the content of each page I managed to use multiprocessing's Pool to parallelize the parsing.
get_whatiwant_from_html(html_content):
parsed_html = BeautifulSoup(html_content, "html.parser")
clean = parsed_html.find("div", class_="class").get_text()
# Some re.subs
clean = re.sub("", "", clean)
clean = re.sub("", "", clean)
clean = re.sub("", "", clean)
return clean
pool = Pool(4)
what_i_want = pool.map(get_whatiwant_from_html, html_content_list)
This code mixes asynchronously the fetching and the parsing but I would like to integrate multiprocessing into it:
async def process(url, session):
html = await getRequest(session, url)
return await get_whatiwant_from_html(html)
async def dispatch(urls):
async with aiohttp.ClientSession() as session:
coros = (process(url, session) for url in urls)
return await asyncio.gather(*coros)
result = asyncio.get_event_loop().run_until_complete(dispatch(urls))
Is there any obvious way to do this? I thought about creating 4 processes that each run the asynchronous calls but the implementation looks a bit complex and I'm wondering if there is another way.
I am very new to asyncio and aiohttp so if you have anything to advise me to read to get a better understanding, I will be very happy.
You can use ProcessPoolExecutor.
With run_in_executor you can do IO in your main asyncio process.
But your heavy CPU calculations in separate processes.
async def get_data(session, url, params=None):
loop = asyncio.get_event_loop()
async with session.get(url, headers=HEADERS, params=params) as response:
html = await response.text()
data = await loop.run_in_executor(None, partial(get_whatiwant_from_html, html))
return data
async def get_data_from_urls(urls):
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(get_data(session, url))
result_data = await asyncio.gather(*tasks)
return result_data
executor = concurrent.futures.ProcessPoolExecutor(max_workers=10)
asyncio_loop.set_default_executor(executor)
results = asyncio_loop.run_until_complete(get_data_from_urls(urls))
You can increase your parsing speed by changing your BeautifulSoup parser from html.parser to lxml which is by far the fastest, followed by html5lib. html.parser is the slowest of them all.
Your bottleneck is not processing issue but IO. You might want multiple threads and not process:
E.g. here is a template program that scraping and sleep to make it slow but ran in multiple threads and thus complete task faster.
from concurrent.futures import ThreadPoolExecutor
import random,time
from bs4 import BeautifulSoup as bs
import requests
URL = 'http://quotesondesign.com/wp-json/posts'
def quote_stream():
'''
Quoter streamer
'''
param = dict(page=random.randint(1, 1000))
quo = requests.get(URL, params=param)
if quo.ok:
data = quo.json()
author = data[0]['title'].strip()
content = bs(data[0]['content'], 'html5lib').text.strip()
print(f'{content}\n-{author}\n')
else:
print('Connection Issues :(')
def multi_qouter(workers=4):
with ThreadPoolExecutor(max_workers=workers) as executor:
_ = [executor.submit(quote_stream) for i in range(workers)]
if __name__ == '__main__':
now = time.time()
multi_qouter(workers=4)
print(f'Time taken {time.time()-now:.2f} seconds')
In your case, create a function that performs the task you want from starry to finish. This function would accept url and necessary parameters as arguments. After that create another function that calls the previous function in different threads, each thread having its our url. So instead of i in range(..), for url in urls. You can run 2000 threads at once, but I would prefer chunks of say 200 running parallel.

Categories