I want to get the Url status for websites with the below code. For one website (webscraper.io), I got an error. My script is:
import httplib
url = "http://webscraper.io/"
if 'http' in url:
url = url.replace('http://', '').strip()
conn = httplib.HTTPConnection(url)
conn.request("GET",'')
r1 = conn.getresponse()
print 'r1.Status code=', r1.status
I got the below errors:
Traceback (most recent call last):
File "TestSatusline.py", line 23, in <module>
conn.request("GET",'')
File "/usr/lib/python2.7/httplib.py", line 1017, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/httplib.py", line 1051, in _send_request
self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 1013, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 864, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 826, in send
self.connect()
File "/usr/lib/python2.7/httplib.py", line 807, in connect
self.timeout, self.source_address)
File "/usr/lib/python2.7/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
socket.gaierror: [Errno -2] Name or service not known
Does anybody has any idea?
thanks
after
if 'http' in url:
url = url.replace('http://', '').strip()
in your code , url is webscraper.io/, it should be webscraper.io
use urlparse
import httplib
import urlparse
url = "http://webscraper.io/"
o = urlparse.urlparse(url)
conn = httplib.HTTPConnection(o.netloc)
conn.request("GET",'')
r1 = conn.getresponse()
print 'r1.Status code=', r1.status
output
r1.Status code= 200
you could take a look at requests. http://docs.python-requests.org/en/master/
Related
I'm trying to check 22,800+ urls from a 2012 database to find out which ones are still valid. I'm using urllib in Python 3.8 in PyCharm. It makes it through the first 47 urls which are in a text file that I read in. Then it crashes when the host can't be found.
Here's the error output:
Traceback (most recent call last):
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 1350, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1255, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1301, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1010, in _send_output
self.send(msg)
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 950, in send
self.connect()
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 921, in connect
self.sock = self._create_connection(
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\socket.py", line 787, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\socket.py", line 918, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11002] getaddrinfo failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/rmcape/PycharmProjects/first/venv/validateURLs.py", line 19, in
resp=urllib.request.urlopen(req)
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 1379, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Users\rmcape\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 1353, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 11002] getaddrinfo failed>
How can I detect the DNS lookup failure and recover from it and continue on to the next URL in the file? Is there some other library that I should be using? I've googled about everything I can think of.
Thanks for any help.
Here's the code:
#!/bin/python
#
#validateURLs.py
import urllib
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
import responses
import socket
f = open("updatedURLs.txt", "r")
site=f.readline()
siteCount=1
errorCount=0
while site:
site=site.strip()
req = urllib.request.Request(site)
try:
resp=urllib.request.urlopen(req)
respo=str(resp.getcode())
result = "("+str(siteCount)+") "+respo+" ==> "+site
print(result)
#print(siteCount, site, resp.getcode())
except urllib.error.HTTPError as e:
errorCount=errorCount+1
result="("+str(siteCount)+") "+str(e.code)+" ==> "+site
print(result)
print("errorCount = "+str(errorCount))
site=f.readline()
siteCount=siteCount+1
print(errorCount)
print("Done")
Will this work for you?:
#!/bin/python
#
#validateURLs.py
import urllib
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
import responses
import socket
f = open("updatedURLs.txt", "r")
site=f.readline()
siteCount=1
errorCount=0
while site:
site=site.strip()
req = urllib.request.Request(site)
try:
resp=urllib.request.urlopen(req)
respo=str(resp.getcode())
result = "("+str(siteCount)+") "+respo+" ==> "+site
print(result)
#print(siteCount, site, resp.getcode())
except Exception as e:
errorCount=errorCount+1
result="("+str(siteCount)+") "+str(e)+" ==> "+site
print(result)
print("errorCount = "+str(errorCount))
else:
site=f.readline()
siteCount=siteCount+1
print(errorCount)
print("Done")
UPDATE: I managed to do a request with urllib2, but I'm still wondering what is happening here.
I would like to do a HTTPS request with Python.
This works fine with the requests module, but I don't want to use external dependencies, so I'd like to use the standard library.
httplib
When I follow this example I don't get a response. I get a timeout instead. I'm out of ideas as to what would cause this.
Code:
import requests
print requests.get('https://python.org')
from httplib import HTTPSConnection
conn = HTTPSConnection('www.python.org')
conn.request('GET', '/index.html')
print conn.getresponse()
Output:
<Response [200]>
Traceback (most recent call last):
File "test.py", line 6, in <module>
conn.request('GET', '/index.html')
File "C:\Python27\lib\httplib.py", line 1069, in request
self._send_request(method, url, body, headers)
File "C:\Python27\lib\httplib.py", line 1109, in _send_request
self.endheaders(body)
File "C:\Python27\lib\httplib.py", line 1065, in endheaders
self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 892, in _send_output
self.send(msg)
File "C:\Python27\lib\httplib.py", line 854, in send
self.connect()
File "C:\Python27\lib\httplib.py", line 1282, in connect
HTTPConnection.connect(self)
File "C:\Python27\lib\httplib.py", line 831, in connect
self.timeout, self.source_address)
File "C:\Python27\lib\socket.py", line 575, in create_connection
raise err
socket.error: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
urllib
This fails for a different (but possibly related) reason. Code:
import urllib
print urllib.urlopen("https://python.org")
Output:
Traceback (most recent call last):
File "test.py", line 10, in <module>
print urllib.urlopen("https://python.org")
File "C:\Python27\lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "C:\Python27\lib\urllib.py", line 215, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 445, in open_https
h.endheaders(data)
File "C:\Python27\lib\httplib.py", line 1065, in endheaders
self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 892, in _send_output
self.send(msg)
File "C:\Python27\lib\httplib.py", line 854, in send
self.connect()
File "C:\Python27\lib\httplib.py", line 1290, in connect
server_hostname=server_hostname)
File "C:\Python27\lib\ssl.py", line 369, in wrap_socket
_context=self)
File "C:\Python27\lib\ssl.py", line 599, in __init__
self.do_handshake()
File "C:\Python27\lib\ssl.py", line 828, in do_handshake
self._sslobj.do_handshake()
IOError: [Errno socket error] [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:727)
What is requests doing that makes it succeed where both of these libraries fail?
requests.get without timeout parameter mean no timeout at all.
httplib.HTTPSConnection accept parameter timeout in Python 2.6 and newer according to httplib docs. If your problem was caused by timeout, setting high enough timeout should help. Please try replacing:
conn = HTTPSConnection('www.python.org')
with:
conn = HTTPSConnection('www.python.org', timeout=300)
which will give 300 seconds (5 minutes) for processing.
This question already has answers here:
"getaddrinfo failed", what does that mean?
(6 answers)
Closed 8 years ago.
i am trying to connect to TOR through python but it doesnt let me the code is:
def tor_connection():
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050, True)
socket.socket = socks.socksocket
def main():
tor_connection()
print('Connected to tor')
con = httplib.HTTPConnection('myip.dnsomatic.com/')
con.request('GET', '/')
response = con.getresponse()
print(response.read())
main()
even though its giving me the next error message:
Traceback (most recent call last):
File "C:/Users/anon/PycharmProjects/Scraper/tor.py", line 198, in <module>
main()
File "C:/Users/anon/PycharmProjects/Scraper/tor.py", line 194, in main
con.request('GET', '/')
File "C:\Python27\lib\httplib.py", line 1001, in request
self._send_request(method, url, body, headers)
File "C:\Python27\lib\httplib.py", line 1035, in _send_request
self.endheaders(body)
File "C:\Python27\lib\httplib.py", line 997, in endheaders
self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 850, in _send_output
self.send(msg)
File "C:\Python27\lib\httplib.py", line 812, in send
self.connect()
File "C:\Python27\lib\httplib.py", line 793, in connect
self.timeout, self.source_address)
File "C:\Python27\lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
socket.gaierror: [Errno 11001] getaddrinfo failed
i am just a beginner could someone help me out please? i have tried it in another laptop but its the same error message
It's not a problem of socks. You need to specify the hostname without the trailing /:
>>> # with /
>>> httplib.HTTPConnection('myip.dnsomatic.com/').request('GET', '/')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/httplib.py", line 973, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/httplib.py", line 1007, in _send_request
self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 969, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 829, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 791, in send
self.connect()
File "/usr/lib/python2.7/httplib.py", line 772, in connect
self.timeout, self.source_address)
File "/usr/lib/python2.7/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
socket.gaierror: [Errno -2] Name or service not known
>>> # without /
>>> httplib.HTTPConnection('myip.dnsomatic.com').request('GET', '/')
>>>
I have a script which get HTTP Header of a lot of pages on Internet with httplib in Python.
My problem is on a specific domain (and probably others), httplib raise an exception, and I don't understand why.
>>> import httplib
>>> http = httplib.HTTPConnection('iswtc.la')
>>> http.request('GET', '/a')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/httplib.py", line 914, in request
self._send_request(method, url, body, headers)
File "/usr/lib64/python2.6/httplib.py", line 951, in _send_request
self.endheaders()
File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders
self._send_output()
File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output
self.send(msg)
File "/usr/lib64/python2.6/httplib.py", line 739, in send
self.connect()
File "/usr/lib64/python2.6/httplib.py", line 720, in connect
self.timeout)
File "/usr/lib64/python2.6/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
socket.gaierror: [Errno -2] Name or service not known
What is different on this specific domain, and how can I handle this ?
PS : It's not really my code because this works fine :
>>> http = httplib.HTTPConnection('bit.ly')
>>> http.request('GET', '/a')
bit.ly exists, whereas iswtc.la doesn't:
$ nslookup bit.ly
Non-authoritative answer:
Name: bit.ly
Address: 69.58.188.39
Name: bit.ly
Address: 69.58.188.40
$ nslookup iswtc.la
** server can't find iswtc.la: NXDOMAIN
I'm using gdata to map YouTube URLs to video titles, using the following code:
import gdata.youtube.service as youtube
import re
import queue
import urlparse
ytservice = youtube.YouTubeService()
ytservice.ssl = True
ytservice.developer_key = '' # snip
class youtube(mediaplugin):
def __init__(self, parsed_url):
self.url = parsed_url
self.video_id = urlparse.parse_qs(parsed_url.query)['v'][0]
self.ytdata = ytservice.GetYouTubeVideoEntry(self.video_id)
print self.ytdata
I get the following socket exception when calling service.GetYouTubeVideoEntry():
File "/Users/haldean/Documents/qpi/qpi/media.py", line 21, in __init__
self.ytdata = ytservice.GetYouTubeVideoEntry(self.video_id)
File "/Users/haldean/Documents/qpi/lib/python2.7/site-packages/gdata/youtube/service.py", line 210, in GetYouTubeVideoEntry
return self.Get(uri, converter=gdata.youtube.YouTubeVideoEntryFromString)
File "/Users/haldean/Documents/qpi/lib/python2.7/site-packages/gdata/service.py", line 1069, in Get
headers=extra_headers)
File "/Users/haldean/Documents/qpi/lib/python2.7/site-packages/atom/__init__.py", line 93, in optional_warn_function
return f(*args, **kwargs)
File "/Users/haldean/Documents/qpi/lib/python2.7/site-packages/atom/service.py", line 186, in request
data=data, headers=all_headers)
File "/Users/haldean/Documents/qpi/lib/python2.7/site-packages/atom/http_interface.py", line 148, in perform_request
return http_client.request(operation, url, data=data, headers=headers)
File "/Users/haldean/Documents/qpi/lib/python2.7/site-packages/atom/http.py", line 163, in request
connection.endheaders()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 937, in endheaders
self._send_output(message_body)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 797, in _send_output
self.send(msg)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 759, in send
self.connect()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1140, in connect
self.timeout, self.source_address)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 8] nodename nor servname provided, or not known
I'm at a loss as to how to even begin debugging this. Any ideas appreciated. Thanks!
Edit:
In response to a question asked in comments, video_id is qh-mwjF-OMo and parsed_url is:
ParseResult(scheme=u'http', netloc=u'www.youtube.com', path=u'/watch', params='', query=u'v=qh-mwjF-OMo&feature=g-user-u', fragment='')
My mistake was that the video_id should be passed as a keyword parameter, like so:
self.ytdata = ytservice.GetYouTubeVideoEntry(video_id=self.video_id)
It seems that the socket exception is the only layer of gdata that will throw an exception; it tries to get a URL blindly based on the arguments and it only fails when the URL fetch fails.