Multithreading under url open process

Multithreading under url open process - python

I finished editing a script that check the url is requiring a WWW web basic authentication or not and printing the result for the user as in this script :
#!/usr/bin/python
# Importing libraries
from urllib2 import urlopen, HTTPError
import socket
import urllib2
import threading
import time
# Setting up variables
url = open("oo.txt",'r')
response = None
start = time.time()
# Excuting Coommands
start = time.time()
for line in url:
try:
response = urlopen(line, timeout=1)
except HTTPError as exc:
# A 401 unauthorized will raise an exception
response = exc
except socket.timeout:
print ("{0} | Request timed out !!".format(line))
except urllib2.URLError:
print ("{0} | Access error !!".format(line))
auth = response and response.info().getheader('WWW-Authenticate')
if auth and auth.lower().startswith('basic'):
print "requires basic authentication"
elif socket.timeout or urllib2.URLError:
print "Yay"
else:
print "Not requires basic authentication"
print "Elapsed Time: %s" % (time.time() - start)
I have a little things i need your help with the script to edit it here ..
I want the script to check every 10 urls together and give the result for all the urls in one time inside a text file . I read about the multithreading and the processing but i didn't find a match form my case to simplify the code to me .
also i have a problem in the result when a timeout or a url error appears , the script give the result in 2 lines like that :
http://www.test.test
| Access error !!
I want it in one line , why it shows in tow ??
Any help in this issues ?
Thanks in advance

The package concurrent.futures provides functionality, that makes it very easy to use concurrency in Python. You define a function check_url that should be called for each URL. Then you can use the map function the apply the function to each URL in parallel and iterate over the return values.
#! /usr/bin/env python3
import concurrent.futures
import urllib.error
import urllib.request
import socket
def load_urls(pathname):
with open(pathname, 'r') as f:
return [ line.rstrip('\n') for line in f ]
class BasicAuth(Exception): pass
class CheckBasicAuthHandler(urllib.request.BaseHandler):
def http_error_401(self, req, fp, code, msg, hdrs):
if hdrs.get('WWW-Authenticate', '').lower().startswith('basic'):
raise BasicAuth()
return None
def check_url(url):
try:
opener = urllib.request.build_opener(CheckBasicAuthHandler())
with opener.open(url, timeout=1) as u:
return 'requires no authentication'
except BasicAuth:
return 'requires basic authentication'
except socket.timeout:
return 'request timed out'
except urllib.error.URLError as e:
return 'access error ({!r})'.format(e.reason)
if __name__ == '__main__':
urls = load_urls('/tmp/urls.txt')
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
for url, result in zip(urls, executor.map(check_url, urls)):
print('{}: {}'.format(url, result))

Related

Checking website response within x seconds

Good day the problem I am facing is that I want to check if my website is up or not this is the sample pseudo code
Check(website.com)
if checking_time > 10 seconds:
print "No response Recieve"
else:
print "Site is up"
I already try the code below but not working
try:
response = urllib.urlopen("http://insurance.contactnumbersph.com").getcode()
time.sleep(5)
if response == "" or response == "403":
print "No response"
else:
print "ok"

If the website is not up and running, you will get connection refused error and actually doesn't return any status code. So, you can catch the error in python with simple try: and except: blocks.
import requests
URL = 'http://some-url-where-there-is-no-server'
try:
resp = requests.get(URL)
except Exception as e:
# handle here
print(e) # for example
You can also check repeatedly 10 times, each per second to check if there is an exception, if there is you will check again
import requests
URL = 'http://some-url'
canCheck = False
counts = 0
gotConnected = False
while counts < 10 :
try:
resp = requests.get(URL)
gotConnected = True
break
except Exception as e:
counts +=1
time.sleep(1)
The result will be available in gotConnected flag, which you can use later to handle appropriate actions.

note that the timeout that gets passed around by urllib applies to the "wrong thing". that is each individual network operation (e.g. hostname resolution, socket connection, sending headers, reading a few bytes of the headers, reading a few more bytes of the response) each get this same timeout applied. hence passing a "timeout" of 10 seconds could allow a large response to continue for hours
if you want to stick to built in Python code then it would be nice to use a thread to do this, but it doesn't seem to be possible to cancel running threads nicely. an async library like trio would allow better timeout and cancellation handling, but we can make do by using the multiprocessing module instead:
from urllib.request import Request, urlopen
from multiprocessing import Process
from time import perf_counter
def _http_ping(url):
req = Request(url, method='HEAD')
print(f'trying {url!r}')
start = perf_counter()
res = urlopen(req)
secs = perf_counter() - start
print(f'response {url!r} of {res.status} after {secs*1000:.2f}ms')
res.close()
def http_ping(url, timeout):
proc = Process(target=_http_ping, args=(url,))
try:
proc.start()
proc.join(timeout)
success = not proc.is_alive()
finally:
proc.terminate()
proc.join()
proc.close()
return success
you can use https://httpbin.org/ to test this, e.g:
http_ping('https://httpbin.org/delay/2', 1)
should print out a "trying" message, but not a "response" message. you can adjust the delay time and timeout to explore how this behaves...
note that this spins up a new process for each request, but as long as you're doing this less than a thousand pings a second it should be OK

How to get the exception string from requests.exceptions.RequestException

I have the below flask code :
from flask import Flask,request,jsonify
import requests
from werkzeug.exceptions import InternalServerError, NotFound
import sys
import json
app = Flask(__name__)
app.config['SECRET_KEY'] = "Secret!"
class InvalidUsage(Exception):
status_code = 400
def __init__(self, message, status_code=None, payload=None):
Exception.__init__(self)
self.message = message
if status_code is not None:
self.status_code = status_code
self.payload = payload
def to_dict(self):
rv = dict(self.payload or ())
rv['message'] = self.message
rv['status_code'] = self.status_code
return rv
#app.errorhandler(InvalidUsage)
def handle_invalid_usage(error):
response = jsonify(error.to_dict())
response.status_code = error.status_code
return response
#app.route('/test',methods=["GET","POST"])
def test():
url = "https://httpbin.org/status/404"
try:
response = requests.get(url)
if response.status_code != 200:
try:
response.raise_for_status()
except requests.exceptions.HTTPError:
status = response.status_code
print status
raise InvalidUsage("An HTTP exception has been raised",status_code=status)
except requests.exceptions.RequestException as e:
print e
if __name__ == "__main__":
app.run(debug=True)
My question is how do i get the exception string(message) and other relevant params from the requests.exceptions.RequestException object e ?
Also what is the best way to log such exceptions . In case of an HTTPError exceptions i have the status code to refer to.
But requests.exceptions.RequestException catches all request exceptions . So how do i differentiate between them and also what is the best way to log them apart from using print statements.
Thanks a lot in advance for any answers.

RequestException is a base class for HTTPError, ConnectionError, Timeout, URLRequired, TooManyRedirects and others (the whole list is available at the GitHub page of requests module). Seems that the best way of dealing with each error and printing the corresponding information is by handling them starting from more specific and finishing with the most general one (the base class). This has been elaborated widely in the comments in this StackOverflow topic. For your test() method this could be:
#app.route('/test',methods=["GET","POST"])
def test():
url = "https://httpbin.org/status/404"
try:
# some code...
except requests.exceptions.ConnectionError as ece:
print("Connection Error:", ece)
except requests.exceptions.Timeout as et:
print("Timeout Error:", et)
except requests.exceptions.RequestException as e:
print("Some Ambiguous Exception:", e)
This way you can firstly catch the errors that inherit from the RequestException class and which are more specific.
And considering an alternative for printing statements - I'm not sure if that's exactly what you meant, but you can log into console or to a file with standard Python logging in Flask or with the logging module itself (here for Python 3).

This is actually not a question about using the requests library as much as it is a general Python question about how to extract the error string from an exception instance. The answer is relatively straightforward: you convert it to a string by calling str() on the exception instance. Any properly written exception handler (in requests or otherwise) would have implemented an __str__() method to allow an str() call on an instance. Example below:
import requests
rsp = requests.get('https://httpbin.org/status/404')
try:
if rsp.status_code >= 400:
rsp.raise_for_status()
except requests.exceptions.RequestException as e:
error_str = str(e)
# log 'error_str' to disk, a database, etc.
print('The error was:', error_str)
Yes, in this example, we print it, but once you have the string you have additional options. Anyway, saving this to test.py results in the following output given your test URL:
$ python3 test.py
The error was: 404 Client Error: NOT FOUND for url: https://httpbin.org/status/404

Checking URLs with Python

I am trying to test an entire list of websites to see if the URLs are valid, and I want to know which ones are not.
import urllib2
filename=open(argfile,'r')
f=filename.readlines()
filename.close()
def urlcheck() :
for line in f:
try:
urllib2.urlopen()
print "SITE IS FUNCTIONAL"
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
urlcheck()

You have to pass url
def urlcheck() :
for line in f:
try:
urllib2.urlopen(line)
print line, "SITE IS FUNCTIONAL"
except urllib2.HTTPError, e:
print line, "SITE IS NOT FUNCTIONAL"
print(e.code)
except urllib2.URLError, e:
print line, "SITE IS NOT FUNCTIONAL"
print(e.args)
except Exception,e:
print line, "Invalid URL"
Some edge cases or things to consider
Little bit on error codes and HTTPError
Every HTTP response from the server contains a numeric “status code”.
Sometimes the status code indicates that the server is unable to
fulfil the request. The default handlers will handle some of these
responses for you (for example, if the response is a “redirection”
that requests the client fetch the document from a different URL,
urllib2 will handle that for you). For those it can’t handle, urlopen
will raise an HTTPError. Typical errors include ‘404’ (page not
found), ‘403’ (request forbidden), and ‘401’ (authentication
required).
Even if HTTPError is raised you may check for the error code
So sometimes even if the URL is valid and available it may raise HTTPError with code 403,401 etc .
Sometime valid urls would give 5xx due to temporary ServerErrors

I would suggest you to use requests library.
import requests
resp = requests.get('your url')
if not resp.ok:
print resp.status_code

You have to pass url as a parameter to the urlopen function.
import urllib2
filename=open(argfile,'r')
f=filename.readlines()
filename.close()
def urlcheck() :
for line in f:
try:
urllib2.urlopen(line) # careful here
print "SITE IS FUNCTIONAL"
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
urlcheck()

import urllib2
def check(url):
request = urllib2.Request(url)
request.get_method = lambda : 'HEAD' # gets only headers without body (increase speed)
request.add_header('Content-Encoding', 'gzip, deflate, br') # gets archived headers (increase speed)
try:
response = urllib2.urlopen(request)
return response.getcode() <= 400
except Exception:
return False
'''
Contents of "/tmp/urls.txt"
http://www.google.com
https://fb.com
http://not-valid
http://not-valid.nvd
not-valid
'''
filename = open('/tmp/urls.txt', 'r')
urls = filename.readlines()
filename.close()
for url in urls:
print url + ' ' + str(check(url))

I would probably write it like this:
import urllib2
with open('urls.txt') as f:
urls = [url.strip() for url in f.readlines()]
def urlcheck() :
for url in urls:
try:
urllib2.urlopen(url)
except (ValueError, urllib2.URLError) as e:
print('invalid url: {}'.format(url))
urlcheck()
some changes from the OP's original implementation:
use a context manager to open/close data file
strip newlines from URLs as they are read from file
use better variable names
switch to more modern exception handling style
also catch ValueError for malformed URL's
display a more useful error message
example output:
$ python urlcheck.py
invalid url: http://www.google.com/wertbh
invalid url: htp:/google.com
invalid url: google.com
invalid url: https://wwwbad-domain-zzzz.com

Get URL when handling urllib2.URLError

This pertains to urllib2 specifically, but custom exception handling more generally. How do I pass additional information to a calling function in another module via a raised exception? I'm assuming I would re-raise using a custom exception class, but I'm not sure of the technical details.
Rather than pollute the sample code with what I've tried and failed, I'll simply present it as a mostly blank slate. My end goal is for the last line in the sample to work.
#mymod.py
import urllib2
def openurl():
req = urllib2.Request("http://duznotexist.com/")
response = urllib2.urlopen(req)
#main.py
import urllib2
import mymod
try:
mymod.openurl()
except urllib2.URLError as e:
#how do I do this?
print "Website (%s) could not be reached due to %s" % (e.url, e.reason)

You can add information to and then re-raise the exception.
#mymod.py
import urllib2
def openurl():
req = urllib2.Request("http://duznotexist.com/")
try:
response = urllib2.urlopen(req)
except urllib2.URLError as e:
# add URL and reason to the exception object
e.url = "http://duznotexist.com/"
e.reason = "URL does not exist"
raise e # re-raise the exception, so the calling function can catch it
#main.py
import urllib2
import mymod
try:
mymod.openurl()
except urllib2.URLError as e:
print "Website (%s) could not be reached due to %s" % (e.url, e.reason)

I don't think re-raising the exception is an appropriate way to solve this problem.
As #Jonathan Vanasco said,
if you're opening a.com , and it 301 redirects to b.com , urlopen will automatically follow that because an HTTPError with a redirect was raised. if b.com causes the URLError , the code above marks a.com as not existing
My solution is to overwrite redirect_request of urllib2.HTTPRedirectHandler
import urllib2
class NewHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def redirect_request(self, req, fp, code, msg, headers, newurl):
m = req.get_method()
if (code in (301, 302, 303, 307) and m in ("GET", "HEAD")
or code in (301, 302, 303) and m == "POST"):
newurl = newurl.replace(' ', '%20')
newheaders = dict((k,v) for k,v in req.headers.items()
if k.lower() not in ("content-length", "content-type")
)
# reuse the req object
# mind that req will be changed if redirection happends
req.__init__(newurl,
headers=newheaders,
origin_req_host=req.get_origin_req_host(),
unverifiable=True)
return req
else:
raise HTTPError(req.get_full_url(), code, msg, headers, fp)
opener = urllib2.build_opener(NewHTTPRedirectHandler)
urllib2.install_opener(opener)
# mind that req will be changed if redirection happends
#req = urllib2.Request('http://127.0.0.1:5000')
req = urllib2.Request('http://www.google.com/')
try:
response = urllib2.urlopen(req)
except urllib2.URLError as e:
print 'error'
print req.get_full_url()
else:
print 'normal'
print response.geturl()
let's try to redirect the url to an unknown url:
import os
from flask import Flask,redirect
app = Flask(__name__)
#app.route('/')
def hello():
# return 'hello world'
return redirect("http://a.com", code=302)
if __name__ == '__main__':
port = int(os.environ.get('PORT', 5000))
app.run(host='0.0.0.0', port=port)
And the result is:
error
http://a.com/
normal
http://www.google.com/

Checking if a website is up via Python

By using python, how can I check if a website is up? From what I read, I need to check the "HTTP HEAD" and see status code "200 OK", but how to do so ?
Cheers
Related
How do you send a HEAD HTTP request in Python?

You could try to do this with getcode() from urllib
import urllib.request
print(urllib.request.urlopen("https://www.stackoverflow.com").getcode())
200
For Python 2, use
print urllib.urlopen("http://www.stackoverflow.com").getcode()
200

I think the easiest way to do it is by using Requests module.
import requests
def url_ok(url):
r = requests.head(url)
return r.status_code == 200

You can use httplib
import httplib
conn = httplib.HTTPConnection("www.python.org")
conn.request("HEAD", "/")
r1 = conn.getresponse()
print r1.status, r1.reason
prints
200 OK
Of course, only if www.python.org is up.

from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://stackoverflow.com")
try:
response = urlopen(req)
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print ('Website is working fine')
Works on Python 3

import httplib
import socket
import re
def is_website_online(host):
""" This function checks to see if a host name has a DNS entry by checking
for socket info. If the website gets something in return,
we know it's available to DNS.
"""
try:
socket.gethostbyname(host)
except socket.gaierror:
return False
else:
return True
def is_page_available(host, path="/"):
""" This function retreives the status code of a website by requesting
HEAD data from the host. This means that it only requests the headers.
If the host cannot be reached or something else goes wrong, it returns
False.
"""
try:
conn = httplib.HTTPConnection(host)
conn.request("HEAD", path)
if re.match("^[23]\d\d$", str(conn.getresponse().status)):
return True
except StandardError:
return None

The HTTPConnection object from the httplib module in the standard library will probably do the trick for you. BTW, if you start doing anything advanced with HTTP in Python, be sure to check out httplib2; it's a great library.

If server if down, on python 2.7 x86 windows urllib have no timeout and program go to dead lock. So use urllib2
import urllib2
import socket
def check_url( url, timeout=5 ):
try:
return urllib2.urlopen(url,timeout=timeout).getcode() == 200
except urllib2.URLError as e:
return False
except socket.timeout as e:
print False
print check_url("http://google.fr") #True
print check_url("http://notexist.kc") #False

I use requests for this, then it is easy and clean.
Instead of print function you can define and call new function (notify via email etc.). Try-except block is essential, because if host is unreachable then it will rise a lot of exceptions so you need to catch them all.
import requests
URL = "https://api.github.com"
try:
response = requests.head(URL)
except Exception as e:
print(f"NOT OK: {str(e)}")
else:
if response.status_code == 200:
print("OK")
else:
print(f"NOT OK: HTTP response code {response.status_code}")

You may use requests library to find if website is up i.e. status code as 200
import requests
url = "https://www.google.com"
page = requests.get(url)
print (page.status_code)
>> 200

In my opinion, caisah's answer misses an important part of your question, namely dealing with the server being offline.
Still, using requests is my favorite option, albeit as such:
import requests
try:
requests.get(url)
except requests.exceptions.ConnectionError:
print(f"URL {url} not reachable")

If by up, you simply mean "the server is serving", then you could use cURL, and if you get a response than it's up.
I can't give you specific advice because I'm not a python programmer, however here is a link to pycurl http://pycurl.sourceforge.net/.

Hi this class can do speed and up test for your web page with this class:
from urllib.request import urlopen
from socket import socket
import time
def tcp_test(server_info):
cpos = server_info.find(':')
try:
sock = socket()
sock.connect((server_info[:cpos], int(server_info[cpos+1:])))
sock.close
return True
except Exception as e:
return False
def http_test(server_info):
try:
# TODO : we can use this data after to find sub urls up or down results
startTime = time.time()
data = urlopen(server_info).read()
endTime = time.time()
speed = endTime - startTime
return {'status' : 'up', 'speed' : str(speed)}
except Exception as e:
return {'status' : 'down', 'speed' : str(-1)}
def server_test(test_type, server_info):
if test_type.lower() == 'tcp':
return tcp_test(server_info)
elif test_type.lower() == 'http':
return http_test(server_info)

Requests and httplib2 are great options:
# Using requests.
import requests
request = requests.get(value)
if request.status_code == 200:
return True
return False
# Using httplib2.
import httplib2
try:
http = httplib2.Http()
response = http.request(value, 'HEAD')
if int(response[0]['status']) == 200:
return True
except:
pass
return False
If using Ansible, you can use the fetch_url function:
from ansible.module_utils.basic import AnsibleModule
from ansible.module_utils.urls import fetch_url
module = AnsibleModule(
dict(),
supports_check_mode=True)
try:
response, info = fetch_url(module, url)
if info['status'] == 200:
return True
except Exception:
pass
return False

my 2 cents
def getResponseCode(url):
conn = urllib.request.urlopen(url)
return conn.getcode()
if getResponseCode(url) != 200:
print('Wrong URL')
else:
print('Good URL')

Here's my solution using PycURL and validators
import pycurl, validators
def url_exists(url):
"""
Check if the given URL really exists
:param url: str
:return: bool
"""
if validators.url(url):
c = pycurl.Curl()
c.setopt(pycurl.NOBODY, True)
c.setopt(pycurl.FOLLOWLOCATION, False)
c.setopt(pycurl.CONNECTTIMEOUT, 10)
c.setopt(pycurl.TIMEOUT, 10)
c.setopt(pycurl.COOKIEFILE, '')
c.setopt(pycurl.URL, url)
try:
c.perform()
response_code = c.getinfo(pycurl.RESPONSE_CODE)
c.close()
return True if response_code < 400 else False
except pycurl.error as err:
errno, errstr = err
raise OSError('An error occurred: {}'.format(errstr))
else:
raise ValueError('"{}" is not a valid url'.format(url))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multithreading under url open process - python

Related

Checking website response within x seconds

How to get the exception string from requests.exceptions.RequestException

Checking URLs with Python

Get URL when handling urllib2.URLError

Checking if a website is up via Python

Categories

Resources