I have the following code:
from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
# target = "https://www.rolcruise.co.uk/cruise-detail/1158731-hawaii-round-trip-honolulu-2020-05-23"
target = "https://www.rolcruise.co.uk"
try:
html = urlopen(target)
except HTTPError as e:
print("You got a HTTP Error. Something wrong with the path.")
print("Here is the error code: " + str(e.code))
print("Here is the error reason: " + e.reason)
print("Happy for the program to end here"
except URLError as e:
print("You got a URL Error. Something wrong with the URL.")
print("Here is the error reason: " + str(e.reason))
print("Happy for the program to end here")
else:
bs_obj = BeautifulSoup(html, features="lxml")
print(bs_obj)
If I deliberately make a mistake in typing certain parts of the url, the urlerror handling works fine, i.e. if I deliberately type "htps" instead of "https", or "ww" instead of "www", or "u" instead of "uk".
e.g.
target = "https://www.rolcruise.co.u"
However if there is a mistake in the typing of the hostname ("rolcruise") or in the "co" part of url, urlerror does not work and I get an error message that says ssl.CertificateError.
e.g.
target = "https://www.rolcruise.c.uk"
I do not understand why URLError doesn't cover all scenarios where there is a typo somewhere in a url?
Given that it is happening, what is the next move to handle the ssl.CertificateError?
Thanks for your help!
Get ssl into your namespace to start:
import ssl
Then you can catch that kind of exception:
try:
html = urlopen(target)
except HTTPError as e:
print("You got a HTTP Error. Something wrong with the path.")
print("Here is the error code: " + str(e.code))
print("Here is the error reason: " + e.reason)
print("Happy for the program to end here"
except URLError as e:
print("You got a URL Error. Something wrong with the URL.")
print("Here is the error reason: " + str(e.reason))
print("Happy for the program to end here")
except ssl.CertificateError:
# Do your stuff here...
else:
bs_obj = BeautifulSoup(html, features="lxml")
print(bs_obj)
Related
In this loop, sometimes I get infinity spamming in telegram error with code request, that not equal 200 and that there is no records in log_file. Like looping on bot.send_message. And I don't get why.
But the code doesn't break, so I can't get any errors. Seems like "sleep" doesn't work sometimes, but how it can work randomly?
Most of the time, the code works properly
import time
import telebot
import requests
import datetime
import json
import os
from funcs import *
from varias import *
from pathlib import Path
if not os.path.exists('C:/pymon_logs'):
os.makedirs('C:/pymon_logs')
log_file = path("api_log", "a")
while 1:
try:
request = str(requests.get("https://***"))
if request == "<Response [200]>":
time.sleep(5)
elif request != "<Response [200]>":
bot.send_message(chat_id, "API " + str(request)[1:-1])
log_file.write(str(request)[1:-1] + " " + today + '\n')
log_file.close()
time.sleep(120)
except requests.ConnectionError as e:
bot.send_message(chat_id, "API Connection Error")
log_file.write("API Connection Error" + " " + today + '\n')
log_file.close()
time.sleep(120)
except requests.Timeout as e:
bot.send_message(chat_id, "API Timeout Error")
log_file.write("API Timeout Error" + " " + today + '\n')
log_file.close()
time.sleep(120)
except requests.RequestException as e:
bot.send_message(chat_id, "API huy znaet chto za oshibka")
log_file.write("API huy znaet chto za oshibka" + " " + today + '\n')
log_file.close()
time.sleep(120)
except:
pass
First of all, you shouldn't check response code by converting to string. It's not a good practice IMHO.
You should:
request = requests.get("https://***")
and check by:
if r.status_code == 200:
...
And this next line is unnecessary. Replace this code with else:
elif request != "<Response [200]>":
But the main problem with your code, is you shouldn't close the file in an infinite loop. With your code, on the very first exception you are closing the file. If you close the file handle, you cannot write, so you will not be able to see anything in the log file.
And, after your code gets another non-200 response code, you try to write on a closed file, then throw another exception. This way, you skip the sleep(120) part, and throw again and again and again...
TL,DR;
Remove all the
log_file.close()
parts.
I'm trying to test if a simple list of urls exists, the code works when I'm just testing one url, but when I try add a array of urls, it's breaks.
Any idea what i'm doing wrong?
Single URL Code
import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
Broken Array Code
import httplib
Urls = ['www.google.ie', 'www.msn.com', 'www.fakeniallweb.com', 'www.wikipedia.org', 'www.galwaydxc.com', 'www.foxnews.com', 'www.blizzard.com', 'www.youtube.com']
for x in Urls:
c = httplib.HTTPConnection(x)
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
else:
print('web site' + x + 'un-reachable')
#To prevent code from closing
input ()
The problem is not that you do it as an array, it is that one of your urls (www.fakeniallweb.com) has a different problem than your other urls.
I think because the DNS cannot be resolved, you cannot request the HEAD as you do. So you need an additional check other than just checking for response code 200.
Maybe you could do something like this:
try:
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
else:
print('website does not exist')
except gaierror as e:
print('Error resolving DNS')
Honestly I suspect you will find other cases where a website returns different status codes. For example a website might return something in the 3xx range for a redirect, or a 403 if you cannot access it. That does not mean the website does not exist.
Hope this helps you on your way!
#Dries De Rydt
Thanks for your help , it was a unresolved dns error causing it to crash out.
I ended up Lib/socket.py
solution
import socket
Urls = ['www.google.ie', 'www.msn.com', 'www.fakeniallweb.com', 'www.wikipedia.org', 'www.galwaydxc.com', 'www.foxnews.com', 'www.blizzard.com', 'www.youtube.com']
for x in Urls:
try:
url = socket.gethostbyname(x)
print x + ' was reachable '
except socket.gaierror, err:
print "cannot resolve hostname: ", x, err
#To prevent code from closing
input ()
Thanks for all the help.
I have a simple function (in python 3) to take a url and attempt to resolve it: printing an error code if there is one (e.g. 404) or resolve one of the shortened urls to its full url. My urls are in one column of a csv files and the output is saved in the next column. The problem arises where the program encounters a url where the server takes too long to respond- the program just crashes. Is there a simple way to force urllib to print an error code if the server is taking too long. I looked into Timeout on a function call but that looks a little too complicated as i am just starting out. Any suggestions?
i.e. (COL A) shorturl (COL B) http://deals.ebay.com/500276625
def urlparse(urlColumnElem):
try:
conn = urllib.request.urlopen(urlColumnElem)
except urllib.error.HTTPError as e:
return (e.code)
except urllib.error.URLError as e:
return ('URL_Error')
else:
redirect=conn.geturl()
#check redirect
if(redirect == urlColumnElem):
#print ("same: ")
#print(redirect)
return (redirect)
else:
#print("Not the same url ")
return(redirect)
EDIT: if anyone gets the http.client.disconnected error (like me), see this question/answer http.client.RemoteDisconnected error while reading/parsing a list of URL's
Have a look at the docs:
urllib.request.urlopen(url, data=None[, timeout])
The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used).
You can set a realistic timeout (in seconds) for your process:
conn = urllib.request.urlopen(urlColumnElem, timeout=realistic_timeout_in_seconds)
and in order for your code to stop crushing, move everything inside the try except block:
import socket
def urlparse(urlColumnElem):
try:
conn = urllib.request.urlopen(
urlColumnElem,
timeout=realistic_timeout_in_seconds
)
redirect=conn.geturl()
#check redirect
if(redirect == urlColumnElem):
#print ("same: ")
#print(redirect)
return (redirect)
else:
#print("Not the same url ")
return(redirect)
except urllib.error.HTTPError as e:
return (e.code)
except urllib.error.URLError as e:
return ('URL_Error')
except socket.timeout as e:
return ('Connection timeout')
Now if a timeout occurs, you will catch the exception and the program will not crush.
Good luck :)
First, there is a timeout parameter than can be used to control the time allowed for urlopen. Next an timeout in urlopen should just throw an exception, more precisely a socket.timeout. If you do not want it to abort the program, you just have to catch it:
def urlparse(urlColumnElem, timeout=5): # allow 5 seconds by default
try:
conn = urllib.request.urlopen(urlColumnElem, timeout = timeout)
except urllib.error.HTTPError as e:
return (e.code)
except urllib.error.URLError as e:
return ('URL_Error')
except socket.timeout:
return ('Timeout')
else:
...
My code is the following:
import json
import urllib2
from urllib2 import HTTPError
def karma_reddit(user):
while True:
try:
url = "https://www.reddit.com/user/" + str(user) + ".json"
data = json.load(urllib2.urlopen(url))
except urllib2.HTTPError as err:
if err == "Too Many Requests":
continue
if err == "Not Found":
print str(user) + " isn't a valid username."
else:
raise
break
I'm trying to get the data from the reddit user profile. However HTTPErrors keep occuring. When trying to catch them using the except statement they keep coming up without the program executing either another iteration of the loop or the print statement. How do I manage to catch the HTTPErrors? I'm pretty new to Python so this might be a rookie mistake. Thanks!
You need to check err.msg for the string, err itself is never equal to either so you always reach the else:raise :
if err.msg == "Too Many Requests":
continue
if err.msg == "Not Found":
print str(user) + " isn't a valid username."
I would recommend using requests and with reddit the error code is actually returned in the json so you can use that:
import requests
def karma_reddit(user):
while True:
data = requests.get("https://www.reddit.com/user/" + str(user) + ".json").json()
if data.get("error") == 429:
print("Too many requests")
elif data.get("error") == 404:
print str(user) + " isn't a valid username."
return data
The fact you are raising all exceptions bar your 429 and 404's means you don't need a try. You should really break on any error and just output a message to the user and limit the amount of requests.
Here is the code:
import urllib2 as URL
def get_unread_msgs(user, passwd):
auth = URL.HTTPBasicAuthHandler()
auth.add_password(
realm='New mail feed',
uri='https://mail.google.com',
user='%s'%user,
passwd=passwd
)
opener = URL.build_opener(auth)
URL.install_opener(opener)
try:
feed= URL.urlopen('https://mail.google.com/mail/feed/atom')
return feed.read()
except:
return None
It works just fine. The only problem is that when a wrong username or password is used, it takes forever to open to url #
feed= URL.urlopen('https://mail.google.com/mail/feed/atom')
It doesn't throw up any errors, just keep executing the urlopen statement forever.
How can i know if username/password is incorrect.
I thought of a timeout for the function but then that would turn all error and even slow internet into a authentication error.
It should throw an error, more precisely an urllib2.HTTPError, with the code field set to 401, you can see some adapted code below. I left your general try/except structure, but really, do not use general except statements, catch only what you expect that could happen!
def get_unread_msgs(user, passwd):
auth = URL.HTTPBasicAuthHandler()
auth.add_password(
realm='New mail feed',
uri='https://mail.google.com',
user='%s'%user,
passwd=passwd
)
opener = URL.build_opener(auth)
URL.install_opener(opener)
try:
feed= URL.urlopen('https://mail.google.com/mail/feed/atom')
return feed.read()
except HTTPError, e:
if e.code == 401:
print "authorization failed"
else:
raise e # or do something else
except: #A general except clause is discouraged, I let it in because you had it already
return None
I just tested it here, works perfectly