I have 1000 website list to check weather its exist or not but my code is showing all correct which is starting with https:// here is below my code
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://stackoverflow.com")
try:
response = urlopen(req)
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print ('Website is working fine')
You can use the python request library.
If you do response = requests.get('http://stackoverflow.com') and then do response.status_code, you should get 200. But if you try for site that is not available, you should get status_code as 404. You can use status_code in your case.
More on status codes: Link.
Related
I want to access mouser.com using urllib. When I try to fetch the data from the URL, it hangs indefinitely.
Here is code:
import urllib.error
import urllib.request
try:
htmls = urllib.request.urlopen("https://www.mouser.com/")
except urllib.error.HTTPError as e:
print("HTTP ERROR")
except urllib.error.URLError as e:
print("URL ERROR")
else :
print(htmls.read().decode("utf-8"))
This piece of code works fine for most URLs, but for some URLs it doesn't like Mouser or element14.
I am working on a webscraping script using HTML requests. I scrape the URLs then run through them and commit to a database. I have been able to scrape the links and created a for loop which renders the page then scrape the specific product information. For the majority of links this works but for some, the page will not render and I get a pyppeteer.errors.TimeoutError. I am fine with not scraping some links as the majority of the website information is grabbed. I have tried using try and except as below:
session = HTMLSession()
for link in productlinks2:
r = session.get(link)
try:
r.html.render(sleep=3, timeout=30)
except TimeoutError:
pass
But this still produces:
pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.
Is there anyway to skip over the links which won't render in time? any help would be appreciated.
Do you import your error ?
Then you need to set timeout to your session.get() too
Then it depends on your error but, if you have a bad url, you will have an error from the session.get() before rendering the page.
So for example see the different errors that can be caught:
from requests_html import HTMLSession
from requests.exceptions import ConnectionError, InvalidSchema, ReadTimeout
from pyppeteer.errors import TimeoutError
session = HTMLSession()
links = [
'https://www.google.com/',
'h**ps://www.google.com/',
'https://deelay.me/4000/https://www.google.com/', # 4s of delay to get the page
'https://www.baaaadurl.com/',
'https://www.youtube.com/',
'https://www.google.com/',
]
for url in links:
try:
r = session.get(url, timeout=3)
r.html.render(timeout=1) # timout short to render google but not youtube
print(r.html.find('title', first=True).text, '\n')
except InvalidSchema as e:
# error for 'h**ps://www.google.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
except ReadTimeout as e:
# error due to too much delay for
# 'https://deelay.me/4000/https://www.google.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
except ConnectionError as e:
# error for 'https://www.baaaadurl.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
except TimeoutError as e:
# error if timout
# in rendering the page 'https://www.youtube.com/'
print(f'For the url "{url}" the error is: {e} \n')
pass
Print result:
Google
For the url "h**ps://www.google.com/" the error is: No connection adapters were found for 'h**ps://www.google.com/'
For the url "https://deelay.me/4000/https://www.google.com/" the error is: HTTPSConnectionPool(host='deelay.me', port=443): Read timed out. (read timeout=3)
For the url "https://www.baaaadurl.com/" the error is: HTTPSConnectionPool(host='www.baaaadurl.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2596ba6460>: Failed to establish a new connection: [Errno -2] Name or service not known'))
For the url "https://www.youtube.com/" the error is: Navigation Timeout Exceeded: 1000 ms exceeded.
Google
So you can catch errors and continue your loop.
Update
I am trying to define a function for a list of urls, the function is intended to print either if a certain link or a server is not found from the original list (job_title_links):
This is what I've got so far:
from urllib.error import URLError
from urllib.error import HTTPError
from urllib.request import urlopen
job_title_links =['https://www.salario.com.br/profissao/abacaxicultor-cbo-612510/',
'https://www.salario.com.br/profissao/abade-cbo-263105/',
'https://www.salario.com.br/profissao/abanador-na-agricultura-cbo-622020/']
def try_url_exist(links):
for link in job_title_links:
try:
html=urlopen(link)
except HTTPError as e:
print(e) # not found url
except URLError as e:
print(e) # server not found
try_url_exist(job_title_links)
However the function returns me a list of HTTPError 403 even when the
url's exist.
Console output:
HTTP Error 403: Forbidden
HTTP Error 403: Forbidden
HTTP Error 403: Forbidden
Expected function output should do nothing if the url exists and should return
either HTTPError or URLError and the name of the url when the url does not exist.
How could I accomplish this task?
By changing urlopen() to requests.get() from requests library and adding it
to an empty list, the code worked.
import requests
from urllib.error import URLError
from urllib.error import HTTPError
from urllib.request import urlopen
def try_url_exist(links):
for link in job_title_links:
try:
html=requests.get(link)
except HTTPError as e:
print(e)
except URLError as e:
print(e)
else:
print(link)
functional_links = []
functional_links = try_url_exist(job_title_links)
I have the list of more than 1000 URLs (those URLs are to download the reports) saved in a .csv file.
Some of the URLs have 404 error and I want to find a way to remove them from the list.
I managed to write a code to identify which URL is invalid (for python 3) below. However I don't know how to remove those URLs from the list automatically given there any many URLs. Thank you!
from urllib.request import urlopen
from urllib.error import HTTPError
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
print ('invalid')
else:
raise
You can use another list to save the 404 url(if 404 url less than normal url), then get the difference set, so:
from urllib.request import urlopen
from urllib.error import HTTPError
exclude_urls = set()
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
exclude_urls.add(url)
valid_urls = set(all_urls) - exclude_urls
Consider List A has all the urls.
A = A.remove("invalid_url")
You can do something like this:
from urllib.request import urlopen
from urllib.error import HTTPError
def load_data(csv_name):
...
def save_data(data,csv_name):
...
links=load_data(csv_name)
new_links=set()
for i in links:
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
print ('invalid')
else:
new_links.add(i)
save_data( list(new_links),csv_name)
Create a loop and write valid urls in finally statement to new csv file would be the easiest solution.
Iam using python to fetch content from some urls. So I have a list of urls, and all are fine except one of them where I get a 404. I wanted to fetch this like:
for url in urls:
r = requests.get(url)
try:
r.raise_for_status()
except RuntimeError:
print('error: could not get content from url because of {}'.format(r.status_code))
But now, the exception raised by raise_for_status() is not fetched but just printed out? How can I print my own error code if its raised?
You need to modify your try catch block
try:
r = requests.get(url)
r.raise_for_status()
except requests.exceptions.HTTPError as error:
print error
You could create your own exception class and just raise that,
class MyException(Exception):
pass
...
...
for url in urls:
r = requests.get(url)
try:
r.raise_for_status()
except requests.exceptions.HTTPError as error:
raise MyException('error: could not get content from url because of {}'.format(r.status_code))