intercepting response with substring in url using playwright - python

I have been learning playwright on python, but it appears that I cannot get it to successfully find a response whose URL contains a substring, while on node I am indeed able to do so, is there anything I am doing wrong?
async with page.expect_response("*") as response:
if "getVerify" in response.url:
print("found")
i have also tried using getVerify and in to no avail.
node code:
page.on('response', response => {
if (response.url().includes('getVerify')) {
console.log(response.url())

With the node case, it's a bit different as you're passively subscribing to an event. In the python snippet, you are basically doing the equivalent of page.waitForResponse. Typically you'll do that in conjunction with some action that triggers the response (such as submitting a form).
Try using the python page.on API. Like this:
import asyncio
from playwright.async_api import async_playwright
def check_response(response):
print(response.url)
if 'getVerify' in response.url:
print("Response URL: ", response.url)
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
page.on("response", check_response)
await page.goto("http://playwright.dev")
print(await page.title())
await browser.close()
asyncio.run(main())

Related

How to check if Pyppeteer browser has closed?

I can't seem to find any information regarding Python's version
of Puppeteer on how to check if my browser has closed properly, following browser.close().
I have limited knowledge of JavaScript, so can't properly follow the answer puppeteer : how check if browser is still open and working.
printing((browser.on('disconnected')) seems to return a function object, which when called requires something called f.
What is the proper way to check if the browser has closed properly?
from pyppeteer import launch
async def get_browser():
return await launch({"headless": False})
async def get_page():
browser = await get_browser()
url = 'https://www.wikipedia.org/'
page = await browser.newPage()
await page.goto(url)
content = await page.content()
await browser.close()
print(browser.on('disconnected'))
#assert browser is None
#assert print(html)
loop = asyncio.get_event_loop()
result = loop.run_until_complete(get_page())
print(result)
.on methods register a callback to be fired on a particular event. For example:
import asyncio
from pyppeteer import launch
async def get_page():
browser = await launch({"headless": True})
browser.on("disconnected", lambda: print("disconnected"))
url = "https://www.wikipedia.org/"
page, = await browser.pages()
await page.goto(url)
content = await page.content()
print("disconnecting...")
await browser.disconnect()
await browser.close()
return content
loop = asyncio.get_event_loop()
result = loop.run_until_complete(get_page())
Output:
disconnecting...
disconnected
From the callback, you could flip a flag to indicate closure or (better yet) take whatever other action you want to take directly.
There's also browser.process.returncode (browser.process is a Popen instance). It's 1 after the browser has been closed, but not after disconnect.
Here's an example of the above:
import asyncio
from pyppeteer import launch
async def get_page():
browser = await launch({"headless": True})
connected = True
async def handle_disconnected():
nonlocal connected
connected = False
browser.on(
"disconnected",
lambda: asyncio.ensure_future(handle_disconnected())
)
print("connected?", connected)
print("return code?", browser.process.returncode)
print("disconnecting...")
await browser.disconnect()
print("connected?", connected)
print("return code?", browser.process.returncode)
print("closing...")
await browser.close()
print("return code?", browser.process.returncode)
asyncio.get_event_loop().run_until_complete(get_page())
Output:
connected? True
return code? None
disconnecting...
connected? False
return code? None
closing...
return code? 1
You can use browser. on('disconnected') to listen for when the browser is closed or crashed, or if the browser. disconnect() method was called. Then, you can automatically relaunch the browser, and continue with your program

how to schedule the execution of an async function in python and immediately return

I need to implement a proxy server in python that forwards client requests to some api if it doesn't have the data in it's cache. The requirement for when the data isn't present in the cache, is to not let the client wait at all but rather send back something like "you'll have your data soon" and in the meanwhile send a request to the api. It was my understanding I need to use async/await for this, but I could not make this work no matter what I tried. I am using asyncio and aiohttp libraries for this.
So let's say I have my function that sends a request to the api:
async def fetch(url, page_num):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
resp = await response.json()
cache[page_num] = (resp, datetime.now())
return resp
what I would like is the following behavior:
if not_in_cache(page_number):
fetch(url, page_number) #this needs to return immediately so the client won't wait!!!
return Response("we're working on it") #send back a response without data
So on the one hand I want the method to immediately return a response to the client, but in the background I want it to get the data and store it in the cache. How can you accomplish that with async/await?
Create a task. Instead of:
if not_in_cache(page_number):
await fetch(url, page_number)
return Response(...)
write:
if not_in_cache(page_number):
asyncio.create_task(fetch(url, page_number))
return Response(...)
Don't forget to read the asyncio docs: Coroutines and Tasks

Fetch HEAD request's status asynchronously in aiohttp

Question is regarding aiohttp libriary usage.
My goal here is to check list of urls by sending bunch of HEAD requests, potentially asynchronously ,and eventually create dict of
url: status pairs.
I am new in asyncio and stuff and I found a lot of examples where people use GET requests to fetch html ,for example ,and they use await resp.read() or await resp.text() and it works fine but with HEAD request I don’t have body, I just have header, that's it. If I try to await resp.status or resp itself as an object – it does not work as they are not awaitable.
Code below works only synchronously step by step and I can’t figure out how to make it run asynchronously. Seems like whatever i do with status turns code to sync mode somehow...
I would be glad to see your ideas.
Thanks.
import asyncio
import aiohttp
urls_list = [url1, url2, url3, etc, etc, etc, ]
status_dict = {}
async def main():
async with aiohttp.ClientSession() as session:
for individual_url in urls_list:
async with session.head(individual_url) as resp:
status_dict.update({url: resp.status})
asyncio.run(main())
You can you asyncio.gather:
import asyncio
import aiohttp
urls_list = ["https://google.com", "https://yahoo.com", "http://hello123456789.com"]
status_dict = {}
async def head_status(session, url) -> dict:
async with session.head(url) as resp:
return {url: resp.status}
async def main():
async with aiohttp.ClientSession() as session:
statuses = await asyncio.gather(*[head_status(session, url) for url in urls_list], return_exceptions=True)
for a in statuses:
if not isinstance(a, Exception):
status_dict.update(a)
asyncio.run(main())

aiohttp: multiple requests to same URL return authentication error, but the URL is correct

I am using the below code to make 599 asynchronous requests to Strava API.
For some reason the response I get for each of them is
{"message":"Authorization
Error","errors":[{"resource":"Application","field":"","code":"invalid"}]}
This is the type of error you typically get when your access_token query string parameter
is invalid.
But in this case the token is 100% correct: the URL returns correct response when just
copy-pasted manually in the browser.
What might be the reason of the error and how to fix it? Might it be that the aiohttp session is somehow
messing the authentication procedure up?
Note: for privacy reasons the token in the code below is fake.
import aiohttp
import asyncio
async def fetch(session, url):
async with session.get(url) as response:
print(await response.text())
async def main():
urls = ['''https://www.strava.com/api/v3/activities/
280816027?include_all_efforts=true&
access_token=11111111'''] * 599
async with aiohttp.ClientSession() as session:
tasks = [
asyncio.ensure_future(fetch(session, url))
for url in urls
]
await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
You shouldn't use a multiline string as the URL, because it will keep all whitespaces and as a result you will get the wrong URL.

Script performs very slowly even when it runs asynchronously

I've written a script in asyncio in association with aiohttp library to parse the content of a website asynchronously. I've tried to apply the logic within the following script the way it is usually applied in scrapy.
However, when I execute my script, it acts like how syncronous libraries like requests or urllib.request do. Therefore, it is very slow and doesn't serve the purpose.
I know I can get around this by defining all the next page link within the link variable. But, am I not doing the task with my existing script in the right way already?
Within the script what processing_docs() function does is collect all the links of the different posts and pass the refined links to the fetch_again() function to fetch the title from it's target page. There is a logic applied within processing_docs() function which collects the next_page link and supply the same to fetch() function to repeat the same. This next_page call is making the script slower whereas we usually do the same inscrapyand get expected performance.
My question is: How can I achieve the same keeping the existing logic intact?
import aiohttp
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin
link = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await processing_docs(session, text)
return result
async def processing_docs(session, html):
tree = fromstring(html)
titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")]
for title in titles:
await fetch_again(session,title)
next_page = tree.cssselect("div.pager a[rel='next']")
if next_page:
page_link = urljoin(link,next_page[0].attrib['href'])
await fetch(page_link)
async def fetch_again(session,url):
async with session.get(url) as response:
text = await response.text()
tree = fromstring(text)
title = tree.cssselect("h1[itemprop='name'] a")[0].text
print(title)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*(fetch(url) for url in [link])))
loop.close()
Whole point of using asyncio is that you may run multiple fetches concurrently (in parallel to each other). Let's look at your code:
for title in titles:
await fetch_again(session,title)
This part means that each new fetch_again will be started only after previous was awaited (finished). If you do things this way, yes, there's no difference with using sync approach.
To invoke all power of asyncio start multiple fetches concurrently using asyncio.gather:
await asyncio.gather(*[
fetch_again(session,title)
for title
in titles
])
You'll see significant speedup.
You can go event futher and start fetch for next page concurrently with fetch_again for titles:
async def processing_docs(session, html):
coros = []
tree = fromstring(html)
# titles:
titles = [
urljoin(link,title.attrib['href'])
for title
in tree.cssselect(".summary .question-hyperlink")
]
for title in titles:
coros.append(
fetch_again(session,title)
)
# next_page:
next_page = tree.cssselect("div.pager a[rel='next']")
if next_page:
page_link = urljoin(link,next_page[0].attrib['href'])
coros.append(
fetch(page_link)
)
# await:
await asyncio.gather(*coros)
Important note
While such approach allows you to do things much faster you may want to limit number of concurrent requests at the time to avoid significant resources usage on both your machine and on server.
You can use asyncio.Semaphore for this purpose:
semaphore = asyncio.Semaphore(10)
async def fetch(url):
async with semaphore:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await processing_docs(session, text)
return result

Categories