I am attempting to optimize a simple web scraper that I made. It gets a list of urls from a table on a main page and then goes to each of those "sub" urls and gets information from those pages. I was able to successfully write it synchronously and using concurrent.futures.ThreadPoolExecutor(). However, I am trying to optimize it to use asyncio and httpx as these seem to be very fast for making hundreds of http requests.
I wrote the following script using asyncio and httpx however, I keep getting the following errors:
httpcore.RemoteProtocolError: Server disconnected without sending a response.
RuntimeError: The connection pool was closed while 4 HTTP requests/responses were still in-flight.
It appears that I keep losing connection when I run the script. I even attempted running a synchronous version of it and get the same error. I was thinking that the remote server was blocking my requests, however, I am able to run my original program and go to each of the urls from the same IP address without issue.
What would cause this exception and how do you fix it?
import httpx
import asyncio
async def get_response(client, url):
resp = await client.get(url, headers=random_user_agent()) # Gets a random user agent.
html = resp.text
return html
async def main():
async with httpx.AsyncClient() as client:
tasks = []
# Get list of urls to parse.
urls = get_events('https://main-url-to-parse.com')
# Get the responses for the detail page for each event
for url in urls:
tasks.append(asyncio.ensure_future(get_response(client, url)))
detail_responses = await asyncio.gather(*tasks)
for resp in detail_responses:
event = get_details(resp) # Parse url and get desired info
asyncio.run(main())
I've had a same issue, the problem occurs when there is an exception in one of the asyncio.gather tasks, when it's raised, it causes httpxclient to call __ aexit __ and cancel all the current requests, you could bypass it by using return_exceptions=True for asyncio.gather:
async def main():
async with httpx.AsyncClient() as client:
tasks = []
# Get list of urls to parse.
urls = get_events('https://main-url-to-parse.com')
# Get the responses for the detail page for each event
for url in urls:
tasks.append(asyncio.ensure_future(get_response(client, url)))
detail_responses = await asyncio.gather(*tasks, return_exceptions=True)
for resp in detail_responses:
# here you would need to do smth with the exceptions
# if isinstance(resp, Exception): ...
event = get_details(resp) # Parse url and get desired info
Related
I have a process which generates some data then uploads it to a web server. Both operations take some time, so I would like to speed the process up by sending the data asynchronously. I also do not want to overload the server, so I want to restrict uploading to one file at a time. I can't get this working and was hoping someone could point me in the right direction.
I have set up testing using a simple Flask server. I have created a test example using requests which works fine. However, when I try to upload data with asyncio it throws an exception some of the time. It seems to be related to how quickly I make the calls, sometimes it's fine, sometimes it throws but once it starts failing it seems to cascade. I am trying to ensure I'm only making one call at a time which should mimic requests.
Given it seems to work fine with requests, I'm guessing I'm doing something wrong on asyncio side. I have tried a range of different implementations with no luck.
This is the code for the Flask server:
from flask import Flask
from flask import jsonify
app = Flask(__name__)
#app.route('/', methods=['GET', 'POST'])
def upload():
return jsonify({'status': 'OK'})
if __name__ == "__main__":
app.run(debug=True, port=8080)
This is the code for the requests and asyncio upload implementations:
import asyncio, requests
from aiohttp import ClientSession
url = 'http://127.0.0.1:8080'
def get_data():
return {'a': 'b'}
async def send_data(session, data):
async with session.post(url, json=data) as response:
response_data = await response.json()
print(response_data)
async def upload_asyncio(loop):
"""attempt at asyncio upload"""
task = None
async with ClientSession() as session:
for _ in range(20):
# dummy get data function for testing
data = get_data()
# I want to wait until the previous upload has complete before uploading next chunk
if not task is None:
await task
task = asyncio.create_task(send_data(session, data))
if not task is None:
await task
def upload_requests():
"""working equivalent using requests"""
for _ in range(20):
data = get_data()
with requests.post(url, json=data) as resp:
response_data = resp.json()
print(response_data)
#upload_requests()
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(upload_asyncio(loop))
the exception which gets raised is:
Exception in callback _ProactorBasePipeTransport._call_connection_lost(None)
handle: <Handle _ProactorBasePipeTransport._call_connection_lost(None)>
Traceback (most recent call last):
File "C:\Users\John\anaconda3\envs\py310\lib\asyncio\events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "C:\Users\John\anaconda3\envs\py310\lib\asyncio\proactor_events.py", line 162, in _call_connection_lost
self._sock.shutdown(socket.SHUT_RDWR)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
Summary
I am using Pyppeteer to open a headless browser, to load HTML & CSS to create a PDF, the html is acessed via a HTTP request as it is laid out in a server side react client.
The PDF is triggered by a button press in a front end React site.
The issue
The majority of the time, the PDF prints perfectly, however, occasionally the pdf prints blank, once it happens once, it seems more likely to happen again multiple times in a row.
I initially thought this was from consecutive download requests happening too close together, but that doesn't seem to be the only cause.
I am seeing 2 errors:
RuntimeError: You cannot use AsyncToSync in the same thread as an async event loop - just await the async function directly.
and
pyppeteer.errors.TimeoutError: Timeout exceeded while waiting for event
Additionally, when this issue happens, I get the following message, from the dumpio log:
"Uncaught (in promise) SyntaxError: Unexpected token < in JSON at position 0"
The code
When the webpage is first loaded, I've added a "launch" function to predownload the browser, so it is preinstalled, to reduce waiting time of the first PDF download:
from flask import Flask
import asyncio
from pyppeteer import launch
import os
download_in_progress = False
app = Flask(__name__)
async def preload_browser(): #pre-downloading chrome client to save time for first pdf generation.
print("downloading browser")
await launch(
headless=True,
handleSIGINT=False,
handleSIGTERM=False,
handleSIGHUP=False,
autoClose=False,
args=['--no-sandbox', '--single-process', '--font-render-hinting=none']
)
Then the code for creating the PDF is triggered by app route via an async def:
#app.route(
"/my app route with :variables",
methods=["POST"]
)
async def go_to_pdf_download(self, id, name, version_number, data):
global download_in_progress
while download_in_progress:
print("download in progress, please wait")
await asyncio.sleep(1)
else:
download_in_progress = True
download_pdf = await pdf(self, id, name, version_number, data)
return pdf
I found that the headless browser was failing with multiple function calls, so I tried adding a while loop. This worked in my local (docker) container, however it didn't work consistently in our my test environment, so will likely remove this (instead I am removing the download button in the react app, once clicked, until the PDF is returned).
The code itself:
async def pdf(self, id, name, version_number, data):
global download_in_progress
url = "my website URL"
try:
print("opening browser")
browser = await launch( #can't use initBrowser here for some reason, so calling again to access it without having to redownload chrome.
headless=True,
handleSIGINT=False,
handleSIGTERM=False,
handleSIGHUP=False,
autoClose=False,
args=['--no-sandbox', '--single-process', '--font-render-hinting=none'],
dumpio = True #used to generate console.log statements in terminal for debugging
)
page = await browser.newPage()
if os.getenv("running environment") == "local":
pass
elif self._oidc_identity and self._oidc_data:
await page.setExtraHTTPHeaders({
"x-amzn-oidc-identity": self._oidc_identity,
"x-amzn-oidc-data": self._oidc_data
})
await page.goto(url, {'waitUntil' : ['domcontentloaded']}) #wait until doesn't seem to do anything
await page.waitForResponse(lambda res: res.status == 200) #otherwise was completing download before PDF was generated
#used to use a sleep timer for 5 seconds but this was inconsistent, so now waiting for http response
pdf = await page.pdf({
'printBackground':True,
'format': 'A4',
'scale': 1,
'preferCSSPageSize': True
})
download_in_progress = False
return pdf
except Exception as e:
download_in_progress = False
raise e
(I've amended some of the code to hide variable names etc)
I have thought I've solved this issue multiple times, however I just seem to be missing something. Any suggestions (or code improvements!) would be greatly appreciated.
Things I've tried
Off the top of my head, the main things I have tried so solve this is:
Adding in wait until loops - to create a queue system of generation to stop simultaneous calls, if statements to block multiple requests, adding packages, various different "wait for X", manual sleep timers instead of trying to detect when the process is complete (up to 20 seconds and still failed occasionally). Tried using Selenium however this encountered a lot of issues.
I'm trying to integrate my telegram bot with my webcam (DLINK DCS-942LB).
Using NIPCA standard (Network IP Camera Application Programming Interface) I managed to solve quite everything.
I'm now working on a polling mechanism.
The basic should be:
telegram bot keeps polling the camera using http://CAMERA_IP:CAMERA_PORT/config/notify_stream.cgi
when an event happens, telegram bot sends a notification to users
The problem is: the notify_stream.cgi page keeps updating every 1 second adding events.
I am not able to poll the notify_stream.cgi as I have requests hanging (doesn't get a response):
This can be reproduce with a simple script:
import requests
myurl = "http://CAMERA_IP:CAMERA_PORT/config/notify_stream.cgi"
response = requests.get(myurl, auth=("USERNAME", "PASSWORD"))
This results in requests hanging until I stop it manually.
Is it possible to keep listening the notify_stream.cgi and passing new lines to a function?
Thanks to the comment received, using session and strem works fine.
Here is the code:
import requests
def getwebcameventstream(webcam_url, webcam_username, webcam_password):
requestsession = requests.Session()
eventhandler = ["first_evet", "second_event", "third_event"]
with requestsession.get(webcam_url, auth=(webcam_username, webcam_password), stream=True) as webcam_response:
for event in webcam_response.iter_lines():
if event in eventhandler:
handlewebcamalarm(event)
def handlewebcamalarm(event):
print ("New event received :" + str(event))
url = 'http://CAMERA_IP:CAMERA_PORT/config/notify_stream.cgi'
username="myusername"
password="mypassword"
getwebcamstream(url, username, password)
I am looking to create a server that takes in a request, does some processing, and forwards the request to another endpoint. I seem to be running into an issue at higher concurrency where my client.post is causing a httpx.ConnectTimeout exception.
I haven't completely ruled out the possibility of an issue with the endpoint(I am currently working with them to debug anything that might be on their end), but I'm trying to figure out if there is something wrong on my end or if there are any glaring inefficiencies I can improve upon.
I am running this in ECS, currently on a cluster where tasks have 4 vCPUs. I am using the docker image uvicorn-gunicorn-fastapi(https://github.com/tiangolo/uvicorn-gunicorn-fastapi-docker). Currently all default settings minus the bind/port/logging. Here is a minimal code example:
import httpx
from fastapi import FastAPI, Request, Response
app = FastAPI()
def process_request(path, request):
#Process Request Here
def create_headers(path):
#Create headers here
#app.get('/')
async def root(path: str, request: Request):
endpoint = 'https://endpoint.com/'
querystring = 'path=' + path
data = process_request(request, path, request)
headers = create_headers(request)
async with httpx.AsyncClient() as client:
await client.post(endpoint + "?" + querystring, data=data, headers=headers)
return Response(status_code=200)
Could be that the server on the other side is taking too much and the connection simply times out because httpx doesn't give enough time to the other endpoint to complete the request?
If yes, you could try disabling timeout or increase the limit (which I suggest over disabling).
See https://www.python-httpx.org/quickstart/#timeouts
I'm working on an application that will have to consult multiple APIs for information and after processing the data, will output the answer to a client. The client uses a browser to connect to a web server to forward the request, afterwards, the web server will look for the information needed from the multiple APIs and after joining the responses from those APIs will then give an answer to the client.
The web server was built using Flask and a module that extracts the needed information for each API was also implemented (Python). Since the consulting process for each API takes time, I would like to give the web server a timeout for responding, therefore, after the requests are sent only those that are below the time buffer will be used.
My proposed solution:
Use a Redis Queue and an RQ worker to enqueue the requests for each API and store the responses on the Queue then wait for the timeout and collect the responses that were able to respond in the allowed time. Afterwards, process the information and give the response to the user.
The flask web server is setup something like this:
#app.route('/result',methods=["POST"])
def show_result():
inputText = request.form["question"]
tweetModule = Twitter()
tweeterResponse = tweetModule.ask(params=inputText)
redditObject = RedditModule()
redditResponse = redditObject.ask(params=inputText)
edmunds = Edmunds()
edmundsJson = edmunds.ask(params=inputText)
# More APIs could be consulted here
# Send each request async and the synchronize the responses from the queue
template = env.get_template('templates/result.html')
return render_template(template,resp=resp)
The worker:
conn = redis.from_url(redis_url)
if __name__ == '__main__':
with Connection(conn):
worker = Worker(map(Queue, listen))
worker.work()
And lets assume each Module handles its own queueing process.
I can see some problems ahead:
What happens to the information stored on the queue that did not make it to the timeout?
How can I make Flask wait and then extract the responses from the Queue?
Is it possible that information could get mixed if two clients ask in the same time-frame?
Is there a better way to handle the async requests and then synchronize the response?
Thanks!
In such cases I prefer a combination of HTTPX and flask[async]
First - HTTPX
HTTPX offers a standard synchronous API by default, but also gives you the option of an async client if you need it.
Async is a concurrency model that is far more efficient than multi-threading, and can provide significant performance benefits and enable the use of long-lived network connections such as WebSockets.
If you're working with an async web framework then you'll also want to use an async client for sending outgoing HTTP requests.
>>> async with httpx.AsyncClient() as client:
... r = await client.get('https://www.example.com/')
...
>>> r
<Response [200 OK]>
Second - Using async and await in a flask
Routes, error handlers, before request, after request, and teardown functions can all be coroutine functions if Flask is installed with the async extra (pip install flask[async]). It requires Python 3.7+ where contextvars.ContextVar is available. This allows views to be defined with async def and use await.
For example, you should do something like this:
import asyncio
import httpx
from flask import Flask, render_template, request
app = Flask(__name__)
#app.route('/async', methods=['GET', 'POST'])
async def async_form():
if request.method == 'POST':
...
async with httpx.AsyncClient() as client:
tweeterResponse, redditResponse, edmundsJson = await asyncio.gather(
client.get(f'https://api.tweeter....../id?id={request.form["tweeter_id"]}', timeout=None),
client.get(f'https://api.redditResponse.....?key={APIKEY}&reddit={request.form["reddit_id"]}'),
client.post(f'https://api.edmundsJson.......', data=inputText)
)
...
resp = {
"tweeter_response" : tweeterResponse,
"reddit_response": redditResponse,
"edmunds_json" : edmundsJson
}
template = env.get_template('templates/result.html')
return render_template(template, resp=resp)