Using Pyppeteer to Generate a PDF - Working inconsistently - python

Summary
I am using Pyppeteer to open a headless browser, to load HTML & CSS to create a PDF, the html is acessed via a HTTP request as it is laid out in a server side react client.
The PDF is triggered by a button press in a front end React site.
The issue
The majority of the time, the PDF prints perfectly, however, occasionally the pdf prints blank, once it happens once, it seems more likely to happen again multiple times in a row.
I initially thought this was from consecutive download requests happening too close together, but that doesn't seem to be the only cause.
I am seeing 2 errors:
RuntimeError: You cannot use AsyncToSync in the same thread as an async event loop - just await the async function directly.
and
pyppeteer.errors.TimeoutError: Timeout exceeded while waiting for event
Additionally, when this issue happens, I get the following message, from the dumpio log:
"Uncaught (in promise) SyntaxError: Unexpected token < in JSON at position 0"
The code
When the webpage is first loaded, I've added a "launch" function to predownload the browser, so it is preinstalled, to reduce waiting time of the first PDF download:
from flask import Flask
import asyncio
from pyppeteer import launch
import os
download_in_progress = False
app = Flask(__name__)
async def preload_browser(): #pre-downloading chrome client to save time for first pdf generation.
print("downloading browser")
await launch(
headless=True,
handleSIGINT=False,
handleSIGTERM=False,
handleSIGHUP=False,
autoClose=False,
args=['--no-sandbox', '--single-process', '--font-render-hinting=none']
)
Then the code for creating the PDF is triggered by app route via an async def:
#app.route(
"/my app route with :variables",
methods=["POST"]
)
async def go_to_pdf_download(self, id, name, version_number, data):
global download_in_progress
while download_in_progress:
print("download in progress, please wait")
await asyncio.sleep(1)
else:
download_in_progress = True
download_pdf = await pdf(self, id, name, version_number, data)
return pdf
I found that the headless browser was failing with multiple function calls, so I tried adding a while loop. This worked in my local (docker) container, however it didn't work consistently in our my test environment, so will likely remove this (instead I am removing the download button in the react app, once clicked, until the PDF is returned).
The code itself:
async def pdf(self, id, name, version_number, data):
global download_in_progress
url = "my website URL"
try:
print("opening browser")
browser = await launch( #can't use initBrowser here for some reason, so calling again to access it without having to redownload chrome.
headless=True,
handleSIGINT=False,
handleSIGTERM=False,
handleSIGHUP=False,
autoClose=False,
args=['--no-sandbox', '--single-process', '--font-render-hinting=none'],
dumpio = True #used to generate console.log statements in terminal for debugging
)
page = await browser.newPage()
if os.getenv("running environment") == "local":
pass
elif self._oidc_identity and self._oidc_data:
await page.setExtraHTTPHeaders({
"x-amzn-oidc-identity": self._oidc_identity,
"x-amzn-oidc-data": self._oidc_data
})
await page.goto(url, {'waitUntil' : ['domcontentloaded']}) #wait until doesn't seem to do anything
await page.waitForResponse(lambda res: res.status == 200) #otherwise was completing download before PDF was generated
#used to use a sleep timer for 5 seconds but this was inconsistent, so now waiting for http response
pdf = await page.pdf({
'printBackground':True,
'format': 'A4',
'scale': 1,
'preferCSSPageSize': True
})
download_in_progress = False
return pdf
except Exception as e:
download_in_progress = False
raise e
(I've amended some of the code to hide variable names etc)
I have thought I've solved this issue multiple times, however I just seem to be missing something. Any suggestions (or code improvements!) would be greatly appreciated.
Things I've tried
Off the top of my head, the main things I have tried so solve this is:
Adding in wait until loops - to create a queue system of generation to stop simultaneous calls, if statements to block multiple requests, adding packages, various different "wait for X", manual sleep timers instead of trying to detect when the process is complete (up to 20 seconds and still failed occasionally). Tried using Selenium however this encountered a lot of issues.

Related

python/httpx/asyncio: httpx.RemoteProtocolError: Server disconnected without sending a response

I am attempting to optimize a simple web scraper that I made. It gets a list of urls from a table on a main page and then goes to each of those "sub" urls and gets information from those pages. I was able to successfully write it synchronously and using concurrent.futures.ThreadPoolExecutor(). However, I am trying to optimize it to use asyncio and httpx as these seem to be very fast for making hundreds of http requests.
I wrote the following script using asyncio and httpx however, I keep getting the following errors:
httpcore.RemoteProtocolError: Server disconnected without sending a response.
RuntimeError: The connection pool was closed while 4 HTTP requests/responses were still in-flight.
It appears that I keep losing connection when I run the script. I even attempted running a synchronous version of it and get the same error. I was thinking that the remote server was blocking my requests, however, I am able to run my original program and go to each of the urls from the same IP address without issue.
What would cause this exception and how do you fix it?
import httpx
import asyncio
async def get_response(client, url):
resp = await client.get(url, headers=random_user_agent()) # Gets a random user agent.
html = resp.text
return html
async def main():
async with httpx.AsyncClient() as client:
tasks = []
# Get list of urls to parse.
urls = get_events('https://main-url-to-parse.com')
# Get the responses for the detail page for each event
for url in urls:
tasks.append(asyncio.ensure_future(get_response(client, url)))
detail_responses = await asyncio.gather(*tasks)
for resp in detail_responses:
event = get_details(resp) # Parse url and get desired info
asyncio.run(main())
I've had a same issue, the problem occurs when there is an exception in one of the asyncio.gather tasks, when it's raised, it causes httpxclient to call __ aexit __ and cancel all the current requests, you could bypass it by using return_exceptions=True for asyncio.gather:
async def main():
async with httpx.AsyncClient() as client:
tasks = []
# Get list of urls to parse.
urls = get_events('https://main-url-to-parse.com')
# Get the responses for the detail page for each event
for url in urls:
tasks.append(asyncio.ensure_future(get_response(client, url)))
detail_responses = await asyncio.gather(*tasks, return_exceptions=True)
for resp in detail_responses:
# here you would need to do smth with the exceptions
# if isinstance(resp, Exception): ...
event = get_details(resp) # Parse url and get desired info

Concurrent execution of two python methods

I'm creating a script that is posting a message to both discord and twitter, depending on some input. I have to methods (in separate .py files), post_to_twitter and post_to_discord. What I want to achieve is that both of these try to execute even if the other fails (e.g. if there is some exception with login).
Here is the relevant code snippet for posting to discord:
def post_to_discord(message, channel_name):
client = discord.Client()
#client.event
async def on_ready():
channel = # getting the right channel
await channel.send(message)
await client.close()
client.run(discord_config.token)
and here is the snippet for posting to twitter part (stripped from the try-except blocks):
def post_to_twitter(message):
auth = tweepy.OAuthHandler(twitter_config.api_key, twitter_config.api_key_secret)
auth.set_access_token(twitter_config.access_token, twitter_config.access_token_secret)
api = tweepy.API(auth)
api.update_status(message)
Now, both of these work perfectly fine on their own and when being called synchronously from the same method:
def main(message):
post_discord.post_to_discord(message)
post_tweet.post_to_twitter(message)
However, I just cannot get them to work concurrently (i.e. to try to post to twitter even if discord fails or vice-versa). I've already tried a couple of different approaches with multi-threading and with asyncio.
Among others, I've tried the solution from this question. But got an error No module named 'IPython'. When I omitted the IPython line, changed the methods to async, I got this error: RuntimeError: Cannot enter into task <ClientEventTask state=pending event=on_ready coro=<function post_to_discord.<locals>.on_ready at 0x7f0ee33e9550>> while another task <Task pending name='Task-1' coro=<main() running at post_main.py:31>> is being executed..
To be honest, I'm not even sure if asyncio would be the right approach for my use case, so any insight is much appreciated.
Thank you.
In this case running the two things in completely separate threads (and completely separate event loops) is probably the easiest option at your level of expertise. For example, try this:
import post_to_discord, post_to_twitter
import concurrent.futures
def main(message):
with concurrent.futures.ThreadPoolExecutor() as pool:
fut1 = pool.submit(post_discord.post_to_discord, message)
fut2 = pool.submit(post_tweet.post_to_twitter, message)
# here closing the threadpool will wait for both futures to complete
# make exceptions visible
for fut in (fut1, fut2):
try:
fut.result()
except Exception as e:
print("error: ", e)

How to poll a webpage that keeps updating?

I'm trying to integrate my telegram bot with my webcam (DLINK DCS-942LB).
Using NIPCA standard (Network IP Camera Application Programming Interface) I managed to solve quite everything.
I'm now working on a polling mechanism.
The basic should be:
telegram bot keeps polling the camera using http://CAMERA_IP:CAMERA_PORT/config/notify_stream.cgi
when an event happens, telegram bot sends a notification to users
The problem is: the notify_stream.cgi page keeps updating every 1 second adding events.
I am not able to poll the notify_stream.cgi as I have requests hanging (doesn't get a response):
This can be reproduce with a simple script:
import requests
myurl = "http://CAMERA_IP:CAMERA_PORT/config/notify_stream.cgi"
response = requests.get(myurl, auth=("USERNAME", "PASSWORD"))
This results in requests hanging until I stop it manually.
Is it possible to keep listening the notify_stream.cgi and passing new lines to a function?
Thanks to the comment received, using session and strem works fine.
Here is the code:
import requests
def getwebcameventstream(webcam_url, webcam_username, webcam_password):
requestsession = requests.Session()
eventhandler = ["first_evet", "second_event", "third_event"]
with requestsession.get(webcam_url, auth=(webcam_username, webcam_password), stream=True) as webcam_response:
for event in webcam_response.iter_lines():
if event in eventhandler:
handlewebcamalarm(event)
def handlewebcamalarm(event):
print ("New event received :" + str(event))
url = 'http://CAMERA_IP:CAMERA_PORT/config/notify_stream.cgi'
username="myusername"
password="mypassword"
getwebcamstream(url, username, password)

Run spider while web application running based on Python asyncio module

My practice now:
I let my backend to catch the get request sent by the front-end page to run my scrapy spider, everytime the page is refreshed or loaded. The crawled data will be shown in my front page. Here's the code, I call a subprocess to run the spider:
from subprocess import run
#get('/api/get_presentcode')
def api_get_presentcode():
if os.path.exists("/static/presentcodes.json"):
run("rm presentcodes.json", shell=True)
run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
with open("/static/presentcodes.json") as data_file:
data = json.load(data_file)
logging.info(data)
return data
It works well.
What I want:
However, the spider crawls a website which barely changes, so it's no need to crawl that often.
So I want to run my scrapy spider every 30 minutes using the coroutine method just at backend.
What I tried and succeeded:
from subprocess import run
# init of my web application
async def init(loop):
....
async def run_spider():
while True:
print("Run spider...")
await asyncio.sleep(10) # to check results more obviously
loop = asyncio.get_event_loop()
tasks = [run_spider(), init(loop)]
loop.run_until_complete(asyncio.wait(tasks))
loop.run_forever()
It works well too.
But when I change the codes of run_spider() into this (which is basically the same as the first one):
async def run_spider():
while True:
if os.path.exists("/static/presentcodes.json"):
run("rm presentcodes.json", shell=True)
run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
await asyncio.sleep(20)
the spider was run only at the first time and crawled data was stored to presentcode.json successfully, but the spider never called after 20 seconds later.
Questions
What's wrong with my program? Is it because I called a subprocess in a coroutine and it is invalid?
Any better thoughts to run a spider while the main application is running?
Edit:
Let me put the code of my web app init function here first:
async def init(loop):
logging.info("App started at {0}".format(datetime.now()))
await orm.create_pool(loop=loop, user='root', password='', db='myBlog')
app = web.Application(loop=loop, middlewares=[
logger_factory, auth_factory, response_factory
])
init_jinja2(app, filters=dict(datetime=datetime_filter))
add_routes(app, 'handlers')
add_static(app)
srv = await loop.create_server(app.make_handler(), '127.0.0.1', 9000) # It seems something happened here.
logging.info('server started at http://127.0.0.1:9000') # this log didn't show up.
return srv
My thought is, the main app made coroutine event loop 'stuck' so the spider cannot be callback later after.
Let me check the source code of create_server and run_until_complete..
Probably not a complete answer, and I would not do it like you do. But calling subprocess from within an asyncio coroutine is definitely not correct. Coroutines offer cooperative multitasking, so when you call subprocess from within a coroutine, that coroutine effectively stops your whole app until called process is finished.
One thing you need to understand when working with asyncio is that control flow can be switched from one coroutine to another only when you call await (or yield from, or async for, async with and other shortcuts). If you do some long action without calling any of those then you block any other coroutines until this action is finished.
What you need to use is asyncio.subprocess which will properly return control flow to other parts of your application (namely webserver) while the subprocess is running.
Here is how actual run_spider() coroutine could look:
import asyncio
async def run_spider():
while True:
sp = await asyncio.subprocess.create_subprocess_shell(
"scrapy srawl presentcodespider -o ../static/presentcodes.new.json",
cwd="./presentcodeSpider")
code = await sp.wait()
if code != 0:
print("Warning: something went wrong, code %d" % code)
continue # retry immediately
if os.path.exists("/static/presentcodes.new.json"):
# output was created, overwrite older version (if any)
os.rename("/static/presentcodes.new.json", "/static/presentcodes.json")
else:
print("Warning: output file was not found")
await asyncio.sleep(20)

Why AsyncHTTPClient in Tornado doesn't send request immediately?

In my current application I use Tornado AsyncHttpClient to make requests to a web site.
The flow is complex, procesing responses from previous request results in another request.
Actually, I download an article, then analyze it and download images mention in it
What bothers me is that while in my log I clearly see the message indicating that .fetch() on photo URL has beeen issued, no actual HTTP request is made, as sniffed in Wireshark
I tried tinkering with max_client_count and Curl/Simple HTTP client, but the bahvior is always the same - until all articles are downloaded not photo requests are actually issued. How can change this?
upd. some pseudo code
#VictorSergienko I am on Linux, so by default, I guess, EPoll version is used. The whole system is too complicated but it boils down to:
#gen.coroutine
def fetch_and_process(self, url, callback):
body = yield self.async_client.fetch(url)
res = yield callback(body)
return res
#gen.coroutine
def process_articles(self,urls):
wait_ids=[]
for url in urls:
#Enqueue but don't wait for one
IOLoop.current().add_callback(self.fetch_and_process(url, self.process_article))
wait_ids.append(yield gen.Callback(key=url))
#wait for all tasks to finish
yield wait_ids
#gen.coroutine
def process_article(self,body):
photo_url=self.extract_photo_url_from_page(body)
do_some_stuff()
print('I gonna download that photo '+photo_url)
yield self.download_photo(photo_url)
#gen.coroutine
def download_photo(self, photo_url):
body = yield self.async_client.fetch(photo_url)
with open(self.construct_filename(photo_url)) as f:
f.write(body)
And when it prints I gonna download that photo no actual request is made!
Instead, it keeps on downloading more articles and enqueueing more photos untils all articles are downloaded, only THEN all photos are requested in a bulk
AsyncHTTPClient has a queue, which you are filling up immediately in process_articles ("Enqueue but don't wait for one"). By the time the first article is processed its photos will go at the end of the queue after all the other articles.
If you used yield self.fetch_and_process instead of add_callback in process_articles, you would alternate between articles and their photos, but you could only be downloading one thing at a time. To maintain a balance between articles and photos while still downloading more than one thing at a time, consider using the toro package for synchronization primitives. The example in http://toro.readthedocs.org/en/stable/examples/web_spider_example.html is similar to your use case.

Categories