Checking URLs with Python

Checking URLs with Python - python

I am trying to test an entire list of websites to see if the URLs are valid, and I want to know which ones are not.
import urllib2
filename=open(argfile,'r')
f=filename.readlines()
filename.close()
def urlcheck() :
for line in f:
try:
urllib2.urlopen()
print "SITE IS FUNCTIONAL"
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
urlcheck()

You have to pass url
def urlcheck() :
for line in f:
try:
urllib2.urlopen(line)
print line, "SITE IS FUNCTIONAL"
except urllib2.HTTPError, e:
print line, "SITE IS NOT FUNCTIONAL"
print(e.code)
except urllib2.URLError, e:
print line, "SITE IS NOT FUNCTIONAL"
print(e.args)
except Exception,e:
print line, "Invalid URL"
Some edge cases or things to consider
Little bit on error codes and HTTPError
Every HTTP response from the server contains a numeric “status code”.
Sometimes the status code indicates that the server is unable to
fulfil the request. The default handlers will handle some of these
responses for you (for example, if the response is a “redirection”
that requests the client fetch the document from a different URL,
urllib2 will handle that for you). For those it can’t handle, urlopen
will raise an HTTPError. Typical errors include ‘404’ (page not
found), ‘403’ (request forbidden), and ‘401’ (authentication
required).
Even if HTTPError is raised you may check for the error code
So sometimes even if the URL is valid and available it may raise HTTPError with code 403,401 etc .
Sometime valid urls would give 5xx due to temporary ServerErrors

I would suggest you to use requests library.
import requests
resp = requests.get('your url')
if not resp.ok:
print resp.status_code

You have to pass url as a parameter to the urlopen function.
import urllib2
filename=open(argfile,'r')
f=filename.readlines()
filename.close()
def urlcheck() :
for line in f:
try:
urllib2.urlopen(line) # careful here
print "SITE IS FUNCTIONAL"
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
urlcheck()

import urllib2
def check(url):
request = urllib2.Request(url)
request.get_method = lambda : 'HEAD' # gets only headers without body (increase speed)
request.add_header('Content-Encoding', 'gzip, deflate, br') # gets archived headers (increase speed)
try:
response = urllib2.urlopen(request)
return response.getcode() <= 400
except Exception:
return False
'''
Contents of "/tmp/urls.txt"
http://www.google.com
https://fb.com
http://not-valid
http://not-valid.nvd
not-valid
'''
filename = open('/tmp/urls.txt', 'r')
urls = filename.readlines()
filename.close()
for url in urls:
print url + ' ' + str(check(url))

I would probably write it like this:
import urllib2
with open('urls.txt') as f:
urls = [url.strip() for url in f.readlines()]
def urlcheck() :
for url in urls:
try:
urllib2.urlopen(url)
except (ValueError, urllib2.URLError) as e:
print('invalid url: {}'.format(url))
urlcheck()
some changes from the OP's original implementation:
use a context manager to open/close data file
strip newlines from URLs as they are read from file
use better variable names
switch to more modern exception handling style
also catch ValueError for malformed URL's
display a more useful error message
example output:
$ python urlcheck.py
invalid url: http://www.google.com/wertbh
invalid url: htp:/google.com
invalid url: google.com
invalid url: https://wwwbad-domain-zzzz.com

Related

How can I make this work? Should I use requests or urllib.error for exceptions?

I am trying to handle the exceptions from the http responses.
The PROBLEM with my code is that I am forced to use and IF condition to catch http error codes
if page.status_code != requests.codes.ok:
page.raise_for_status()
I do not believe this is the right way to do it, I am trying the FOLLOWING
import requests
url = 'http://someurl.com/404-page.html'
myHeaders = {'User-agent': 'myUserAgent'}
s = requests.Session()
try:
page = s.get(url, headers=myHeaders)
#if page.status_code != requests.codes.ok:
# page.raise_for_status()
except requests.ConnectionError:
print ("DNS problem or refused to connect")
# Or Do something with it
except requests.HTTPError:
print ("Some HTTP response error")
#Or Do something with it
except requests.Timeout:
print ("Error loading...too long")
#Or Do something with it, perhaps retry
except requests.TooManyRedirects:
print ("Too many redirect")
#Or Do something with it
except requests.RequestException as e:
print (e.message)
#Or Do something with it
else:
print ("nothing happen")
#Do something if no exception
s.close()
This ALWAYS prints "nothing happen", How I would be able to catch all possible exceptions related to GET URL?

You could catch a RequestException if you want to catch all the exceptions:
import requests
try:
r = requests.get(........)
except requests.RequestException as e:
print(e.message)

Python 2.2.3 HTTP Basic Authentication Implementation

I am trying to implement the HTTP Basic Authentication in Python 2.2.3. This is code:
import urllib2
proxyUserName1='<proxyusername>'
proxyPassword1='<proxypassword>'
realmName1='<realm>'
proxyUri1='<uri>'
passman=urllib2.HTTPPasswordMgr()
passman.add_password(realm=realmName1, uri=proxyUri1, user=proxyUserName1, passwd=proxyPassword1)
auth_handler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
# Setting up the request & request parameters
login_url_request = urllib2.Request('<URL To be Accessed>')
# Getting the Response & reading it.
try:
url_socket_connection = urllib2.urlopen(login_url_request)
except urllib2.URLError, urlerror:
print ("URL Error Occured:")
print (urlerror.code)
print (urlerror.headers)
except urllib2.HTTPError, httperror:
print ("HTTP Error Occured:")
print (httperror.code)
print (httperror.headers)
else:
login_api_response = str(url_socket_connection.read())
print (login_api_response)
I always get the URL Error 401. This code works perfectly in Python 3.4. Unfortunately I need to get this running in Python 2.2.3. Can someone please tell where am I going wrong ?

It worked after changing the code:
import urllib2
import base64
proxyUserName1='<proxyusername>'
proxyPassword1='<proxypassword>'
realmName1='<realm>'
proxyUri1='<uri>'
base64encodedstring = base64.encodestring('%s:%s' % (proxyUserName1, proxyPassword1)).replace('\n', '')
passman=urllib2.HTTPPasswordMgr()
passman.add_password(realm=realmName1, uri=proxyUri1, user=proxyUserName1, passwd=proxyPassword1)
auth_handler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
# Setting up the request & request parameters
login_url_request = urllib2.Request('<URL To be Accessed>')
login_url_request.add_header('Authorization', 'Basic %s' % base64encodedstring)
# Getting the Response & reading it.
try:
url_socket_connection = urllib2.urlopen(login_url_request)
except urllib2.URLError, urlerror:
print ("URL Error Occured:")
print (urlerror.code)
print (urlerror.headers)
except urllib2.HTTPError, httperror:
print ("HTTP Error Occured:")
print (httperror.code)
print (httperror.headers)
else:
login_api_response = str(url_socket_connection.read())
print (login_api_response)

Multithreading under url open process

I finished editing a script that check the url is requiring a WWW web basic authentication or not and printing the result for the user as in this script :
#!/usr/bin/python
# Importing libraries
from urllib2 import urlopen, HTTPError
import socket
import urllib2
import threading
import time
# Setting up variables
url = open("oo.txt",'r')
response = None
start = time.time()
# Excuting Coommands
start = time.time()
for line in url:
try:
response = urlopen(line, timeout=1)
except HTTPError as exc:
# A 401 unauthorized will raise an exception
response = exc
except socket.timeout:
print ("{0} | Request timed out !!".format(line))
except urllib2.URLError:
print ("{0} | Access error !!".format(line))
auth = response and response.info().getheader('WWW-Authenticate')
if auth and auth.lower().startswith('basic'):
print "requires basic authentication"
elif socket.timeout or urllib2.URLError:
print "Yay"
else:
print "Not requires basic authentication"
print "Elapsed Time: %s" % (time.time() - start)
I have a little things i need your help with the script to edit it here ..
I want the script to check every 10 urls together and give the result for all the urls in one time inside a text file . I read about the multithreading and the processing but i didn't find a match form my case to simplify the code to me .
also i have a problem in the result when a timeout or a url error appears , the script give the result in 2 lines like that :
http://www.test.test
| Access error !!
I want it in one line , why it shows in tow ??
Any help in this issues ?
Thanks in advance

The package concurrent.futures provides functionality, that makes it very easy to use concurrency in Python. You define a function check_url that should be called for each URL. Then you can use the map function the apply the function to each URL in parallel and iterate over the return values.
#! /usr/bin/env python3
import concurrent.futures
import urllib.error
import urllib.request
import socket
def load_urls(pathname):
with open(pathname, 'r') as f:
return [ line.rstrip('\n') for line in f ]
class BasicAuth(Exception): pass
class CheckBasicAuthHandler(urllib.request.BaseHandler):
def http_error_401(self, req, fp, code, msg, hdrs):
if hdrs.get('WWW-Authenticate', '').lower().startswith('basic'):
raise BasicAuth()
return None
def check_url(url):
try:
opener = urllib.request.build_opener(CheckBasicAuthHandler())
with opener.open(url, timeout=1) as u:
return 'requires no authentication'
except BasicAuth:
return 'requires basic authentication'
except socket.timeout:
return 'request timed out'
except urllib.error.URLError as e:
return 'access error ({!r})'.format(e.reason)
if __name__ == '__main__':
urls = load_urls('/tmp/urls.txt')
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
for url, result in zip(urls, executor.map(check_url, urls)):
print('{}: {}'.format(url, result))

web2py url validator

In a shorten-er built by web2by i want to validate url's first, if it's not valid goes back to the first page with an error message. this is my code in controller (mvc arch.) but i don't get what's wrong..!!
import urllib
def index():
return dict()
def random_maker():
url = request.vars.url
try:
urllib.urlopen(url)
return dict(rand_url = ''.join(random.choice(string.ascii_uppercase +
string.digits + string.ascii_lowercase) for x in range(6)),
input_url=url)
except IOError:
return index()

Couldn't you check the http response code using httplib. If it was 200 then the page is valid, if it is anything else (like 404) or an error then it is invalid.
See this question: What’s the best way to get an HTTP response code from a URL?
Update:
Based on your comment it looks like your issue is how you are handling the error. You are only handling IOError issues. In your case you can either handle all errors singularly by switching to:
except:
return index()
You could also build your own exception handler by overriding http_default_error. See How to catch 404 error in urllib.urlretrieve for more information.
Or you can switch to urllib2 which has specific errors, You can then handle the specific errors that urllib2 throws like this:
from urllib2 import Request, urlopen, URLError
req = Request('http://jfvbhsjdfvbs.com')
try:
response = urlopen(req)
except URLError, e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
else:
print 'URL is good!'
The above code with that will return:
We failed to reach a server.
Reason: [Errno 61] Connection refused
The specifics of each exception class is contained in the urllib.error api documentation.
I am not exactly sure how to slot this into your code, because I am not sure exactly what you are trying to do, but IOError is not going to handle the exceptions thrown by urllib.

Get URL when handling urllib2.URLError

This pertains to urllib2 specifically, but custom exception handling more generally. How do I pass additional information to a calling function in another module via a raised exception? I'm assuming I would re-raise using a custom exception class, but I'm not sure of the technical details.
Rather than pollute the sample code with what I've tried and failed, I'll simply present it as a mostly blank slate. My end goal is for the last line in the sample to work.
#mymod.py
import urllib2
def openurl():
req = urllib2.Request("http://duznotexist.com/")
response = urllib2.urlopen(req)
#main.py
import urllib2
import mymod
try:
mymod.openurl()
except urllib2.URLError as e:
#how do I do this?
print "Website (%s) could not be reached due to %s" % (e.url, e.reason)

You can add information to and then re-raise the exception.
#mymod.py
import urllib2
def openurl():
req = urllib2.Request("http://duznotexist.com/")
try:
response = urllib2.urlopen(req)
except urllib2.URLError as e:
# add URL and reason to the exception object
e.url = "http://duznotexist.com/"
e.reason = "URL does not exist"
raise e # re-raise the exception, so the calling function can catch it
#main.py
import urllib2
import mymod
try:
mymod.openurl()
except urllib2.URLError as e:
print "Website (%s) could not be reached due to %s" % (e.url, e.reason)

I don't think re-raising the exception is an appropriate way to solve this problem.
As #Jonathan Vanasco said,
if you're opening a.com , and it 301 redirects to b.com , urlopen will automatically follow that because an HTTPError with a redirect was raised. if b.com causes the URLError , the code above marks a.com as not existing
My solution is to overwrite redirect_request of urllib2.HTTPRedirectHandler
import urllib2
class NewHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def redirect_request(self, req, fp, code, msg, headers, newurl):
m = req.get_method()
if (code in (301, 302, 303, 307) and m in ("GET", "HEAD")
or code in (301, 302, 303) and m == "POST"):
newurl = newurl.replace(' ', '%20')
newheaders = dict((k,v) for k,v in req.headers.items()
if k.lower() not in ("content-length", "content-type")
)
# reuse the req object
# mind that req will be changed if redirection happends
req.__init__(newurl,
headers=newheaders,
origin_req_host=req.get_origin_req_host(),
unverifiable=True)
return req
else:
raise HTTPError(req.get_full_url(), code, msg, headers, fp)
opener = urllib2.build_opener(NewHTTPRedirectHandler)
urllib2.install_opener(opener)
# mind that req will be changed if redirection happends
#req = urllib2.Request('http://127.0.0.1:5000')
req = urllib2.Request('http://www.google.com/')
try:
response = urllib2.urlopen(req)
except urllib2.URLError as e:
print 'error'
print req.get_full_url()
else:
print 'normal'
print response.geturl()
let's try to redirect the url to an unknown url:
import os
from flask import Flask,redirect
app = Flask(__name__)
#app.route('/')
def hello():
# return 'hello world'
return redirect("http://a.com", code=302)
if __name__ == '__main__':
port = int(os.environ.get('PORT', 5000))
app.run(host='0.0.0.0', port=port)
And the result is:
error
http://a.com/
normal
http://www.google.com/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Checking URLs with Python - python

I would suggest you to use requests library. import requests resp = requests.get('your url') if not resp.ok: print resp.status_code

Related

How can I make this work? Should I use requests or urllib.error for exceptions?

Python 2.2.3 HTTP Basic Authentication Implementation

Multithreading under url open process

web2py url validator

Get URL when handling urllib2.URLError

Categories

Resources