How to Retry 429 Errors On Scroll - python

I am using the elasticsearch dsl library for python. I am running scroll scans in a threaded application like so:
s = Search\
.from_dict(query)\
.using(es)\
.index(index)\
.doc_type(doc_type)\
.extra(slice=slice)
for hit in s.scan():
yield hit
There can be 32 threads all running on the same scroll with a difference slice. Occasionally this will hit a 429 Too Many Requests error from this and it kills the whole process. Heart-breakingly, this may occur 1 hour into the scroll and the process just needs to start over.
How can I recover rom a 429 Too Many Requests error? Is it possible to retry at the last scroll offset without restarting the entire scroll?

use the retry lib. I have a custom TooManyRequestsError to only catch 429s
#retry(TooManyRequestsError, delay=3, jitter=3, max_delay=30)
def get(url: str) -> requests.Response:
res = requests.get(url)
if res.status_code == 429:
logger.warning(f'Too many requests! {url}')
raise TooManyRequestsError()
res.raise_for_status()
return res

Related

part of the requests delay the whole process in grequests

I have 600 urls to request, when I use grequests, I found that sometimes it finishes so fast within 10 secs, but sometimes just stuck there(can't reach the statement of printing 'done').
Here is my code:
l=[]
urls=[...]
reqs=[grequests.get(i) for i in urls]
rs=grequests.map(reqs)
print('done')
Is it because majority of the requests are already finished within 10 secs(with status 200), just a few requests are still waiting for response? I am using Pycharm, in the variable monitor window, I can indeed see that lots of requests already get 200 status. How can I fix this problem? Is there any params that I can set to limit the maximum request time for each request, therefore if some requests exceed a certain time, it will return anyway.
Just set the timeout:
import grequests
l=[]
urls = [ ]
reqs=[grequests.get(i, timeout=1) for i in urls] # timeout in sec
rs=grequests.map(reqs)
print(rs)

Using Pyppeteer to Generate a PDF - Working inconsistently

Summary
I am using Pyppeteer to open a headless browser, to load HTML & CSS to create a PDF, the html is acessed via a HTTP request as it is laid out in a server side react client.
The PDF is triggered by a button press in a front end React site.
The issue
The majority of the time, the PDF prints perfectly, however, occasionally the pdf prints blank, once it happens once, it seems more likely to happen again multiple times in a row.
I initially thought this was from consecutive download requests happening too close together, but that doesn't seem to be the only cause.
I am seeing 2 errors:
RuntimeError: You cannot use AsyncToSync in the same thread as an async event loop - just await the async function directly.
and
pyppeteer.errors.TimeoutError: Timeout exceeded while waiting for event
Additionally, when this issue happens, I get the following message, from the dumpio log:
"Uncaught (in promise) SyntaxError: Unexpected token < in JSON at position 0"
The code
When the webpage is first loaded, I've added a "launch" function to predownload the browser, so it is preinstalled, to reduce waiting time of the first PDF download:
from flask import Flask
import asyncio
from pyppeteer import launch
import os
download_in_progress = False
app = Flask(__name__)
async def preload_browser(): #pre-downloading chrome client to save time for first pdf generation.
print("downloading browser")
await launch(
headless=True,
handleSIGINT=False,
handleSIGTERM=False,
handleSIGHUP=False,
autoClose=False,
args=['--no-sandbox', '--single-process', '--font-render-hinting=none']
)
Then the code for creating the PDF is triggered by app route via an async def:
#app.route(
"/my app route with :variables",
methods=["POST"]
)
async def go_to_pdf_download(self, id, name, version_number, data):
global download_in_progress
while download_in_progress:
print("download in progress, please wait")
await asyncio.sleep(1)
else:
download_in_progress = True
download_pdf = await pdf(self, id, name, version_number, data)
return pdf
I found that the headless browser was failing with multiple function calls, so I tried adding a while loop. This worked in my local (docker) container, however it didn't work consistently in our my test environment, so will likely remove this (instead I am removing the download button in the react app, once clicked, until the PDF is returned).
The code itself:
async def pdf(self, id, name, version_number, data):
global download_in_progress
url = "my website URL"
try:
print("opening browser")
browser = await launch( #can't use initBrowser here for some reason, so calling again to access it without having to redownload chrome.
headless=True,
handleSIGINT=False,
handleSIGTERM=False,
handleSIGHUP=False,
autoClose=False,
args=['--no-sandbox', '--single-process', '--font-render-hinting=none'],
dumpio = True #used to generate console.log statements in terminal for debugging
)
page = await browser.newPage()
if os.getenv("running environment") == "local":
pass
elif self._oidc_identity and self._oidc_data:
await page.setExtraHTTPHeaders({
"x-amzn-oidc-identity": self._oidc_identity,
"x-amzn-oidc-data": self._oidc_data
})
await page.goto(url, {'waitUntil' : ['domcontentloaded']}) #wait until doesn't seem to do anything
await page.waitForResponse(lambda res: res.status == 200) #otherwise was completing download before PDF was generated
#used to use a sleep timer for 5 seconds but this was inconsistent, so now waiting for http response
pdf = await page.pdf({
'printBackground':True,
'format': 'A4',
'scale': 1,
'preferCSSPageSize': True
})
download_in_progress = False
return pdf
except Exception as e:
download_in_progress = False
raise e
(I've amended some of the code to hide variable names etc)
I have thought I've solved this issue multiple times, however I just seem to be missing something. Any suggestions (or code improvements!) would be greatly appreciated.
Things I've tried
Off the top of my head, the main things I have tried so solve this is:
Adding in wait until loops - to create a queue system of generation to stop simultaneous calls, if statements to block multiple requests, adding packages, various different "wait for X", manual sleep timers instead of trying to detect when the process is complete (up to 20 seconds and still failed occasionally). Tried using Selenium however this encountered a lot of issues.

Stop concurrent.futures if a condition is met and then execute again after the task is done

I am fetching JSON data from a website (https://www.nseindia.com/) using python request library. The website only gives data if correct cookies are provided. So I use selenium webdriver to get cookies and then get data for 850 different stocks.
Now, my code is such that if the cookies are wrong, then selenium should open again and get new cookie value. But the problem is that when I am using concurrent.futures, the tasks are very fast (due to asynchronicity) and it opens new drivers for each symbol till new cookies are not found. My code is as below:
--Initially get cookies
cookie_dict = get_cookies()
for cookie in cookie_dict:
if cookie == "bm_sv" or cookie == "nsit" or cookie == "nseappid":
session.cookies.set(cookie,cookie_dict[cookie])
--This function is used in Thread Pool Executor
def final(u):
try:
data = session.get(u,headers = headers).json()
print(data['data'][0]['CH_SYMBOL'])
list_done.append(data['data'][0]['CH_SYMBOL'])
except:
print("Error")
cookie_dict = get_cookies()
for cookie in cookie_dict:
if cookie == "bm_sv" or cookie == "nsit" or cookie == "nseappid":
session.cookies.set(cookie,cookie_dict[cookie])
data = session.get(u,headers = headers).json()
print(data['data'][0]['CH_SYMBOL'])
list_done.append(data['data'][0]['CH_SYMBOL'])
As you can see that if exception is met, it should get new cookies. But as I previously mentioned, that futures will keep running for other stocks and exceptions will be met till new cookie is not found.
with concurrent.futures.ThreadPoolExecutor(max_workers = 5) as executor:
executor.map(final,urls)
Is there a way I can change my code, or some built in facility by futures, so that I can stop it till I don't get new cookies and continue running only if new cookies are appended?
You could use the wait() method of concurrent.futures to monitor for the first exception. first, you have to change final to be throwing. You should not append to a shared list inside this function, you should return the value:
def final(u):
data = session.get(u,headers = headers).json()
print(data['data'][0]['CH_SYMBOL'])
return data['data'][0]['CH_SYMBOL']
Second, instead of using .map() you submit work and store futures:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
futures = [executor.submit(final, u) for u in urls]
Then you use wait() to wait for an exception to occur. Here, either there is an exception and you handle it before continuing, of there is no error and you unwrap the futures:
from concurrent.futures import wait, FIRST_EXCEPTION
completed, not_done = wait(futures, return_when=FIRST_EXCEPTION)
if not_done:
# An exception occurred before all futures completed,
# handle it here.
# Here, handle the results of the futures,
# `completed` contains both finished and failed futures,
# `not_done` contains pending futures.
Note that you should check for each future if succeeded of failed before unwrapping it, and you should resubmit futures that failed. Also, the wait() function does not cancels pending futures, so not_done futures are still progressing.

Bug in python thread

I have some raspberry pi running some python code. Once and a while my devices will fail to check in. The rest of the python code continues to run perfectly but the code here quits. I am not sure why? If the devices can't check in they should reboot but they don't. Other threads in the python file continue to run correctly.
class reportStatus(Thread):
def run(self):
checkInCount = 0
while 1:
try:
if checkInCount < 50:
payload = {'d':device,'k':cKey}
resp = requests.post(url+'c', json=payload)
if resp.status_code == 200:
checkInCount = 0
time.sleep(1800) #1800
else:
checkInCount += 1
time.sleep(300) # 2.5 min
else:
os.system("sudo reboot")
except:
try:
checkInCount += 1
time.sleep(300)
except:
pass
The devices can run for days and weeks and will check in perfectly every 30 minutes, then out of the blue they will stop. My linux computers are in read-only and the computer continue to work and run correctly. My issue is in this thread. I think they might fail to get a response and this line could be the issue
resp = requests.post(url+'c', json=payload)
I am not sure how to solve this, any help or suggestions would be greatly appreciated.
Thank you
A bare except:pass is a very bad idea.
A much better approach would be to, at the very minimum, log any exceptions:
import traceback
while True:
try:
time.sleep(60)
except:
with open("exceptions.log", "a") as log:
log.write("%s: Exception occurred:\n" % datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
traceback.print_exc(file=log)
Then, when you get an exception, you get a log:
2016-12-20 13:28:55: Exception occurred:
Traceback (most recent call last):
File "./sleepy.py", line 8, in <module>
time.sleep(60)
KeyboardInterrupt
It is also possible that your code is hanging on sudo reboot or requests.post. You could add additional logging to troubleshoot which issue you have, although given you've seen it do reboots, I suspect it's requests.post, in which case you need to add a timeout (from the linked answer):
import requests
import eventlet
eventlet.monkey_patch()
#...
resp = None
with eventlet.Timeout(10):
resp = requests.post(url+'c', json=payload)
if resp:
# your code
Your code basically ignores all exceptions. This is considered a bad thing in Python.
The only reason I can think of for the behavior that you're seeing is that after checkInCount reaches 50, the sudo reboot raises an exception which is then ignored by your program, keeping this thread stuck in the infinite loop.
If you want to see what really happens, add print or loggging.info statements to all the different branches of your code.
Alternatively, remove the blanket try-except clause or replace it by something specific, e.g. except requests.exceptions.RequestException
Because of the answers given I was able to come up with a solution. I realized requests has a built in time out function. The timeout will never happen if a timeout is not specified as a parameter.
here is my solution:
resp = requests.post(url+'c', json=payload, timeout=45)
You can tell Requests to stop waiting for a response after a given
number of seconds with the timeout parameter. Nearly all production
code should use this parameter in nearly all requests. Failure to do
so can cause your program to hang indefinitely
The answers provided by TemporalWolf and other helped me alot. Thank you to all that helped.

requests process hangs

I'm using requests to get a URL, such as:
while True:
try:
rv = requests.get(url, timeout=1)
doSth(rv)
except socket.timeout as e:
print e
except Exception as e:
print e
After it runs for a while, it quits working. No exception or any error, just like it suspended. I then stop the process by typing Ctrl+C from the console. It shows that the process is waiting for data:
.............
httplib_response = conn.getresponse(buffering=True) #httplib.py
response.begin() #httplib.py
version, status, reason = self._read_status() #httplib.py
line = self.fp.readline(_MAXLINE + 1) #httplib.py
data = self._sock.recv(self._rbufsize) #socket.py
KeyboardInterrupt
Why is this happening? Is there a solution?
It appears that the server you're sending your request to is throttling you - that is, it's sending bytes with less than 1 second between each package (thus not triggering your timeout parameter), but slow enough for it to appear to be stuck.
The only fix for this I can think of is to reduce the timeout parameter, unless you can fix this throttling issue with the Server provider.
Do keep in mind that you'll need to consider latency when setting the timeout parameter, otherwise your connection will be dropped too quickly and might not work at all.
The default requests doesn't not set a timeout for connection or read.
If for some reason, the server cannot get back to the client within the time, the client will stuck at connecting or read, mostly the read for the response.
The quick resolution is to set a timeout value in the requests object, the approach is well described here: http://docs.python-requests.org/en/master/user/advanced/#timeouts
(Thanks to the guys.)
If this resolves the issue, please kindly mark this a resolution. Thanks.

Categories