I am working on a webscraping script using HTML requests. I scrape the URLs then run through them and commit to a database. I have been able to scrape the links and created a for loop which renders the page then scrape the specific product information. For the majority of links this works but for some, the page will not render and I get a pyppeteer.errors.TimeoutError. I am fine with not scraping some links as the majority of the website information is grabbed. I have tried using try and except as below:
session = HTMLSession()
for link in productlinks2:
r = session.get(link)
try:
r.html.render(sleep=3, timeout=30)
except TimeoutError:
pass
But this still produces:
pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.
Is there anyway to skip over the links which won't render in time? any help would be appreciated.
Do you import your error ?
Then you need to set timeout to your session.get() too
Then it depends on your error but, if you have a bad url, you will have an error from the session.get() before rendering the page.
So for example see the different errors that can be caught:
from requests_html import HTMLSession
from requests.exceptions import ConnectionError, InvalidSchema, ReadTimeout
from pyppeteer.errors import TimeoutError
session = HTMLSession()
links = [
'https://www.google.com/',
'h**ps://www.google.com/',
'https://deelay.me/4000/https://www.google.com/', # 4s of delay to get the page
'https://www.baaaadurl.com/',
'https://www.youtube.com/',
'https://www.google.com/',
]
for url in links:
try:
r = session.get(url, timeout=3)
r.html.render(timeout=1) # timout short to render google but not youtube
print(r.html.find('title', first=True).text, '\n')
except InvalidSchema as e:
# error for 'h**ps://www.google.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
except ReadTimeout as e:
# error due to too much delay for
# 'https://deelay.me/4000/https://www.google.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
except ConnectionError as e:
# error for 'https://www.baaaadurl.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
except TimeoutError as e:
# error if timout
# in rendering the page 'https://www.youtube.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
Print result:
Google
For the url "h**ps://www.google.com/" the error is: No connection adapters were found for 'h**ps://www.google.com/'
For the url "https://deelay.me/4000/https://www.google.com/" the error is: HTTPSConnectionPool(host='deelay.me', port=443): Read timed out. (read timeout=3)
For the url "https://www.baaaadurl.com/" the error is: HTTPSConnectionPool(host='www.baaaadurl.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2596ba6460>: Failed to establish a new connection: [Errno -2] Name or service not known'))
For the url "https://www.youtube.com/" the error is: Navigation Timeout Exceeded: 1000 ms exceeded.
Google
So you can catch errors and continue your loop.
Related
I want to access mouser.com using urllib. When I try to fetch the data from the URL, it hangs indefinitely.
Here is code:
import urllib.error
import urllib.request
try:
htmls = urllib.request.urlopen("https://www.mouser.com/")
except urllib.error.HTTPError as e:
print("HTTP ERROR")
except urllib.error.URLError as e:
print("URL ERROR")
else :
print(htmls.read().decode("utf-8"))
This piece of code works fine for most URLs, but for some URLs it doesn't like Mouser or element14.
I have 1000 website list to check weather its exist or not but my code is showing all correct which is starting with https:// here is below my code
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://stackoverflow.com")
try:
response = urlopen(req)
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print ('Website is working fine')
You can use the python request library.
If you do response = requests.get('http://stackoverflow.com') and then do response.status_code, you should get 200. But if you try for site that is not available, you should get status_code as 404. You can use status_code in your case.
More on status codes: Link.
I get a "Connection Error" error while capturing data. It works fine for a while then gives error, how can I overcome this error.
import requests
from bs4 import BeautifulSoup
url = "https://www.example.com"
for page in range(0,951,50):
new_url = url +page + "&pagingSize=50"
r = requests.get(new_url)
source = BeautifulSoup(r.content,"html.parser")
content = source.select('tr.searchResultsItem:not(.nativeAd, .classicNativeAd)')
print(content)
When I get this error, I want it to wait for a while and continue where it left off
Error:
ConnectionError: ('Connection aborted.', OSError("(10054, 'WSAECONNRESET')"))
You can workaround connection resets (and other networking problems) by implementing retries. Basically, you can tell requests to automatically retry if a problem occurs.
Here's how you can do it:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
session = requests.Session()
# in case of error, retry at most 3 times, waiting
# at least half a second between each retry
retry = Retry(total=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
Then, instead of:
r = requests.get(new_url)
you can use:
r = session.get(new_url)
See also the documentation for Retry for a full overview of the scenarios it supports.
I want to get the response code from a web server, but sometime I get code 200 even if the page doesn't exist and I don't know how to deal with it.
I'm using this code:
def checking_url(link):
try:
link = urllib.request.urlopen(link)
response = link.code
except urllib.error.HTTPError as e:
response = e.code
return response
When I'm checking a website like this one:
https://www.wykop.pl/notexistlinkkk/
It still returns code 200 even if the page doesn't exist.
Is there any solution to deal with it?
I found solution, now gonna test it with more websites
I had to use http.client.
You are getting response code 200, because the website you are checking has automatic redirection. In the URL you gave, even if you specify a non-existing page, it automatically redirects you to the home page, rather than returning a 404 status code. Your code works fine.
import urllib2
thisCode = None
try:
i = urllib2.urlopen('http://www.google.com')
thisCode = i.code
except urllib2.HTTPError, e:
thisCode = e.code
print thisCode
I am having an issue with getting a particular URL response using requests that I need for web scraping. I have been able to get all other URLs to work except this one. My code:
import requests
u = "https://jobs.utc.com"
r = requests.get(u)
r
The error I am receiving is:
SSLError: ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
Is there a reason why this URL is giving me trouble?
I received the same error in my browser, coming from Cloudflare. It seems that this particular host simply has a problem, and you are not facing a particular Python or socket challenge.
Cloudflare error message