Errno 10054 while scraping HTML with Python: how to reconnect - python

I'm a novice Python programmer trying to use Python to scrape a large amount of pages from fanfiction.net and deposit a particular line of the page's HTML source into a .csv file. My program works fine, but eventually hits a snag where it stops running. My IDE told me that the program has encountered "Errno 10054: an existing connection was forcibly closed by the remote host".
I'm looking for a way to get my code to reconnect and continue every time I get the error. My code will be scraping a few hundred thousand pages every time it runs; is this maybe just too much for the site? The site doesn't appear to prevent scraping. I've done a fair amount of research on this problem already and attempted to implement a retry decorator, but the decorator doesn't seem to work. Here's the relevant section of my code:
def retry(ExceptionToCheck, tries=4, delay=3, backoff=2, logger=None):
def deco_retry(f):
#wraps(f)
def f_retry(*args, **kwargs):
mtries, mdelay = tries, delay
while mtries > 1:
try:
return f(*args, **kwargs)
except ExceptionToCheck as e:
msg = "%s, Retrying in %d seconds..." % (str(e), mdelay)
if logger:
logger.warning(msg)
else:
print(msg)
time.sleep(mdelay)
mtries -= 1
mdelay *= backoff
return f(*args, **kwargs)
return f_retry # true decorator
return deco_retry
#retry(urllib.error.URLError, tries=4, delay=3, backoff=2)
def retrieveURL(URL):
response = urllib.request.urlopen(URL)
return response
def main():
# first check: 5000 to 100,000
MAX_ID = 600000
ID = 400001
URL = "http://www.fanfiction.net/s/" + str(ID) + "/index.html"
fCSV = open('buffyData400k600k.csv', 'w')
fCSV.write("Rating, Language, Genre 1, Genre 2, Character A, Character B, Character C, Character D, Chapters, Words, Reviews, Favorites, Follows, Updated, Published, Story ID, Story Status, Author ID, Author Name" + '\n')
while ID <= MAX_ID:
URL = "http://www.fanfiction.net/s/" + str(ID) + "/index.html"
response = retrieveURL(URL)
Whenever I run the .py file outside of my IDE, it eventually locks up and stops grabbing new pages after about an hour, tops. I'm also running a different version of the same file in my IDE, and that appears to have been running for almost 12 hours now, if not longer-is it possible that the file could work in my IDE but not when run independently?
Have I set my decorator up wrong? What else could I potentially do to get python to reconnect? I've also seen claims that the SQL native client being out of date could cause problems for a Window user such as myself - is this true? I've tried to update that but had no luck.
Thank you!

You are catching URLErrors, which Errno: 10054 is not, so your #retry decorator is not going to retry. Try this.
#retry(Exception, tries=4)
def retrieveURL(URL):
response = urllib.request.urlopen(URL)
return response
This should retry 4 times on any Exception. Your #retry decorator is defined correctly.

Your code for reconnecting looks good except for one part - the exception that you're trying to catch. According to this StackOverflow question, an Errno 10054 is a socket.error. All you need to do is to import socket and add an except socket.error statement in your retry handler.

Related

Using Pyppeteer to Generate a PDF - Working inconsistently

Summary
I am using Pyppeteer to open a headless browser, to load HTML & CSS to create a PDF, the html is acessed via a HTTP request as it is laid out in a server side react client.
The PDF is triggered by a button press in a front end React site.
The issue
The majority of the time, the PDF prints perfectly, however, occasionally the pdf prints blank, once it happens once, it seems more likely to happen again multiple times in a row.
I initially thought this was from consecutive download requests happening too close together, but that doesn't seem to be the only cause.
I am seeing 2 errors:
RuntimeError: You cannot use AsyncToSync in the same thread as an async event loop - just await the async function directly.
and
pyppeteer.errors.TimeoutError: Timeout exceeded while waiting for event
Additionally, when this issue happens, I get the following message, from the dumpio log:
"Uncaught (in promise) SyntaxError: Unexpected token < in JSON at position 0"
The code
When the webpage is first loaded, I've added a "launch" function to predownload the browser, so it is preinstalled, to reduce waiting time of the first PDF download:
from flask import Flask
import asyncio
from pyppeteer import launch
import os
download_in_progress = False
app = Flask(__name__)
async def preload_browser(): #pre-downloading chrome client to save time for first pdf generation.
print("downloading browser")
await launch(
headless=True,
handleSIGINT=False,
handleSIGTERM=False,
handleSIGHUP=False,
autoClose=False,
args=['--no-sandbox', '--single-process', '--font-render-hinting=none']
)
Then the code for creating the PDF is triggered by app route via an async def:
#app.route(
"/my app route with :variables",
methods=["POST"]
)
async def go_to_pdf_download(self, id, name, version_number, data):
global download_in_progress
while download_in_progress:
print("download in progress, please wait")
await asyncio.sleep(1)
else:
download_in_progress = True
download_pdf = await pdf(self, id, name, version_number, data)
return pdf
I found that the headless browser was failing with multiple function calls, so I tried adding a while loop. This worked in my local (docker) container, however it didn't work consistently in our my test environment, so will likely remove this (instead I am removing the download button in the react app, once clicked, until the PDF is returned).
The code itself:
async def pdf(self, id, name, version_number, data):
global download_in_progress
url = "my website URL"
try:
print("opening browser")
browser = await launch( #can't use initBrowser here for some reason, so calling again to access it without having to redownload chrome.
headless=True,
handleSIGINT=False,
handleSIGTERM=False,
handleSIGHUP=False,
autoClose=False,
args=['--no-sandbox', '--single-process', '--font-render-hinting=none'],
dumpio = True #used to generate console.log statements in terminal for debugging
)
page = await browser.newPage()
if os.getenv("running environment") == "local":
pass
elif self._oidc_identity and self._oidc_data:
await page.setExtraHTTPHeaders({
"x-amzn-oidc-identity": self._oidc_identity,
"x-amzn-oidc-data": self._oidc_data
})
await page.goto(url, {'waitUntil' : ['domcontentloaded']}) #wait until doesn't seem to do anything
await page.waitForResponse(lambda res: res.status == 200) #otherwise was completing download before PDF was generated
#used to use a sleep timer for 5 seconds but this was inconsistent, so now waiting for http response
pdf = await page.pdf({
'printBackground':True,
'format': 'A4',
'scale': 1,
'preferCSSPageSize': True
})
download_in_progress = False
return pdf
except Exception as e:
download_in_progress = False
raise e
(I've amended some of the code to hide variable names etc)
I have thought I've solved this issue multiple times, however I just seem to be missing something. Any suggestions (or code improvements!) would be greatly appreciated.
Things I've tried
Off the top of my head, the main things I have tried so solve this is:
Adding in wait until loops - to create a queue system of generation to stop simultaneous calls, if statements to block multiple requests, adding packages, various different "wait for X", manual sleep timers instead of trying to detect when the process is complete (up to 20 seconds and still failed occasionally). Tried using Selenium however this encountered a lot of issues.

Request timed out: timeout('timed out') in Python's HTTPServer

I am trying to create a simple HTTP server that uses the Python HTTPServer which inherits BaseHTTPServer. [https://github.com/python/cpython/blob/main/Lib/http/server.py][1]
There are numerous examples of this approach online and I don't believe I am doing anything unusual.
I am simply importing the class via:
"from http.server import HTTPServer, BaseHTTPRequestHandler"
in my code.
My code overrides the do_GET() method to parse the path variable to determine what page to show.
However, if I start this server and connect to it locally (ex: http://127.0.0.1:50000) the first page loads fine. If I navigate to another page (via my first page links) that too works fine, however, on occasion (and this is somewhat sporadic), there is a delay and the server log shows a Request timed out: timeout('timed out') error. I have tracked this down to the handle_one_request method in the BaseHTTPServer class:
def handle_one_request(self):
"""Handle a single HTTP request.
You normally don't need to override this method; see the class
__doc__ string for information on how to handle specific HTTP
commands such as GET and POST.
"""
try:
self.raw_requestline = self.rfile.readline(65537)
if len(self.raw_requestline) > 65536:
self.requestline = ''
self.request_version = ''
self.command = ''
self.send_error(HTTPStatus.REQUEST_URI_TOO_LONG)
return
if not self.raw_requestline:
self.close_connection = True
return
if not self.parse_request():
# An error code has been sent, just exit
return
mname = 'do_' + self.command ## the name of the method is created
if not hasattr(self, mname): ## checking that we have that method defined
self.send_error(
HTTPStatus.NOT_IMPLEMENTED,
"Unsupported method (%r)" % self.command)
return
method = getattr(self, mname) ## getting that method
method() ## finally calling it
self.wfile.flush() #actually send the response if not already done.
except socket.timeout as e:
# a read or a write timed out. Discard this connection
self.log_error("Request timed out: %r", e)
self.close_connection = True
return
You can see where the exception is thrown in the "except socket.timeout as e:" clause.
I have tried overriding this method by including it in my code but it is not clear what is causing the error so I run into dead ends. I've tried creating very basic HTML pages to see if there was something in the page itself, but even "blank" pages cause the same sporadic issue.
What's odd is that sometimes a page loads instantly, and almost randomly, it will then timeout. Sometimes the same page, sometimes a different page.
I've played with the http.timeout setting, but it makes no difference. I suspect it's some underlying socket issue, but am unable to diagnose it further.
This is on a Mac running Big Sur 11.3.1, with Python version 3.9.4.
Any ideas on what might be causing this timeout, and in particular any suggestions on a resolution. Any pointers would be appreciated.
After further investigation, this particular appears to be an issue with Safari. Running the exact same code and using Firefox does not show the same issue.

for loop skipping over code?

So im trying to execute this code:
liner = 0
for eachLine in content:
print(content[liner].rstrip())
raw=str(content[liner].rstrip())
print("Your site:"+raw)
Sitecheck=requests.get(raw)
time.sleep(5)
var=Sitecheck.text.find('oht945t945iutjkfgiutrhguih4w5t45u9ghdgdirfgh')
time.sleep(5)
print(raw)
liner += 1
I would expect this to run through the first print up to the liner variable and then go back up, however something else seems to happen:
https://google.com
Your site:https://google.com
https://google.com
https://youtube.com
Your site:https://youtube.com
https://youtube.com
https://x.com
Your site:https://x.com
This happens before the get requests. And the get requests later just get timed out:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))
I tried adding time.sleep(5) in my code for it to run smoother however this failed to yield results
Why don't you use Python's exception handling, to catch failed connections?
import requests
#list with websites
content = ["https://google.com", "https://stackoverflow.com/", "https://bbc.co.uk/", "https://this.site.doesnt.exi.st"]
#get list index and element
for liner, eachLine in enumerate(content):
#not sure, why this line exists, probably necessary for your content list
raw = str(eachLine.rstrip())
#try to get a connection and give feedback, if successful
try:
Sitecheck = requests.get(raw)
print("Tested site #{0}: site {1} responded".format(liner, raw))
except:
print("Tested site #{0}: site {1} seems to be down".format(liner, raw))
Mind you that there are more elaborate ways like scrapy or beautifulsoup in Python to retrieve web content. But I think that your question is more a conceptual than a practical one.

Bug in python thread

I have some raspberry pi running some python code. Once and a while my devices will fail to check in. The rest of the python code continues to run perfectly but the code here quits. I am not sure why? If the devices can't check in they should reboot but they don't. Other threads in the python file continue to run correctly.
class reportStatus(Thread):
def run(self):
checkInCount = 0
while 1:
try:
if checkInCount < 50:
payload = {'d':device,'k':cKey}
resp = requests.post(url+'c', json=payload)
if resp.status_code == 200:
checkInCount = 0
time.sleep(1800) #1800
else:
checkInCount += 1
time.sleep(300) # 2.5 min
else:
os.system("sudo reboot")
except:
try:
checkInCount += 1
time.sleep(300)
except:
pass
The devices can run for days and weeks and will check in perfectly every 30 minutes, then out of the blue they will stop. My linux computers are in read-only and the computer continue to work and run correctly. My issue is in this thread. I think they might fail to get a response and this line could be the issue
resp = requests.post(url+'c', json=payload)
I am not sure how to solve this, any help or suggestions would be greatly appreciated.
Thank you
A bare except:pass is a very bad idea.
A much better approach would be to, at the very minimum, log any exceptions:
import traceback
while True:
try:
time.sleep(60)
except:
with open("exceptions.log", "a") as log:
log.write("%s: Exception occurred:\n" % datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
traceback.print_exc(file=log)
Then, when you get an exception, you get a log:
2016-12-20 13:28:55: Exception occurred:
Traceback (most recent call last):
File "./sleepy.py", line 8, in <module>
time.sleep(60)
KeyboardInterrupt
It is also possible that your code is hanging on sudo reboot or requests.post. You could add additional logging to troubleshoot which issue you have, although given you've seen it do reboots, I suspect it's requests.post, in which case you need to add a timeout (from the linked answer):
import requests
import eventlet
eventlet.monkey_patch()
#...
resp = None
with eventlet.Timeout(10):
resp = requests.post(url+'c', json=payload)
if resp:
# your code
Your code basically ignores all exceptions. This is considered a bad thing in Python.
The only reason I can think of for the behavior that you're seeing is that after checkInCount reaches 50, the sudo reboot raises an exception which is then ignored by your program, keeping this thread stuck in the infinite loop.
If you want to see what really happens, add print or loggging.info statements to all the different branches of your code.
Alternatively, remove the blanket try-except clause or replace it by something specific, e.g. except requests.exceptions.RequestException
Because of the answers given I was able to come up with a solution. I realized requests has a built in time out function. The timeout will never happen if a timeout is not specified as a parameter.
here is my solution:
resp = requests.post(url+'c', json=payload, timeout=45)
You can tell Requests to stop waiting for a response after a given
number of seconds with the timeout parameter. Nearly all production
code should use this parameter in nearly all requests. Failure to do
so can cause your program to hang indefinitely
The answers provided by TemporalWolf and other helped me alot. Thank you to all that helped.

Wrapping library functions to retry on 500 errors not working properly

I'm trying to wrap all the functions in an instance of a library to retry on 500 errors (wrapping in order to avoid forcing team members to specially add retry code on each function). I've done similar stuff before, but for BigQuery, I'm having no luck. Here's my code:
def bq_methods_retry(func):
num_retries = 5
#functools.wraps(func)
def wrapper(*a, **kw):
sleep_interval = 2
for i in xrange(num_retries):
try:
return func(*a, **kw)
except apiclient.errors.HttpError, e:
if e.resp.status == 500 and i < num_retries-1:
logger.info("got a 500. retrying.")
time.sleep(sleep_interval)
sleep_interval = min(2*sleep_interval, 60)
else:
logger.info('failed with unexpected apiclient error:')
raise e
except:
logger.info('failed with unexpected error:')
raise
return wrapper
def decorate_all_bq_methods(instance, decorator):
for k, f in instance.__dict__.items():
if inspect.ismethod(f):
name = f.func_name
setattr(instance, k, decorator(f))
return instance
...
service = discovery.build('bigquery', 'v2', http=http)
#make all the methods in the service retry when appropriate
service = decorate_all_bq_methods(service, bq_methods_retry)
jobs = decorate_all_bq_methods(service.jobs(), bq_methods_retry)
Then, when I run something like:
jobs.query(projectId=some_id, body=some_query).execute()
500 Errors are never caught by bq_methods_retry, but pass along to the rest of the program.
Any ideas? I'm also open to a better retry solution.
The BigQuery client that the bq command line tool uses does something similar by wrapping the HTTP object. It doesn't do a retry, but it does translate exceptions, so you could likely use the same type of hook.
Note that you may want to be careful about retrying certain types of operations; for example, if you retry a job insert that appends data, if it hit a network error returning the response, the original request might actually succeed, so you'll be inserting the same data twice. To avoid this, you can pass in your own job id, which should prevent it from being run twice (since the job will already exist the second time).
Check out the code here.

Categories