Urllib2 <urlopen error [Errno -5] No address associated with hostname> - python

I have a python program that uses several APIs, including two from Yahoo!. One i am having trouble with, however. it always gives me the error <urlopen error [Errno -5] No address associated with hostname> Here are the outputs from various terminal commands to try to get the data from "http://download.finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv":
my code:
def stocks():
lcd.clear()
lcd.message('RasPi Stocks\nUsing Yahoo!')
time.sleep(2)
val = 1
for i in stockslst:
# host = urllib2.socket.gethostbyname('finance.yahoo.com')
req = urllib2.Request('http://download.finance.yahoo.com/d/quotes.csv?s=' + i + '&f=nsl1l2op&e=.csv$
req.add_header('Host:', 'finance.yahoo.com')
mdata = urllib2.urlopen(req)
csvdata = [row for row in csv.reader(mdata)]
sname = csvdata[0][0]
last1 = float(csvdata[0][2])
last2 = float(csvdata[0][3])
if last1 > last2:
r = 0
if last1 < last2:
r = 1
if last1 == last2:
r = 126
lcd.clear()
lcd.message(sname + '\n' + chr(r) + str(last1))
time.sleep(2)
my program:
pi#raspberrypi ~ $ sudo python lcddisplayrpi.py
Traceback (most recent call last):
File "lcddisplayrpi.py", line 132, in <module>
stocks()
File "lcddisplayrpi.py", line 113, in stocks
mdata = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 401, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 419, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1211, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1181, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -5] No address associated with hostname>
curl the problematic address:
curl "http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv"
<HEAD><TITLE>Redirect</TITLE></HEAD>
<BODY BGCOLOR="white" FGCOLOR="black">
<FONT FACE="Helvetica,Arial"><B>
"<em>http://download.finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv</em>".<p></B></FONT>
<!-- default "Redirect" response (301) -->
</BODY>
wget the problematic address:
wget "http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv"
--2013-10-06 09:01:30-- http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv
Resolving finance.yahoo.com (finance.yahoo.com)... failed: No address associated with hostname.
wget: unable to resolve host address `finance.yahoo.com'
I can get the csv data in a web browser on the same computer. as hinted in the curl command, I tried changing it to download.finance.yahoo.com, but no luck with the python program. I tried curl again:
curl "http://download.finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv"
"Google Inc.","GOOG",872.35,-,876.00,876.09
This is the data I want. I don't know what is wrong with my python program.

Related

Why is the connection refused?

I am creating a web scraping script and divided it into four pieces. Separately they all work perfect, however when I put them all together I get the following error : urlopen error [Errno 111] Connection refused. I have looked at similar questions to mine and have tried to catch the error with try-except but even that doesn`t work. My all in one code is :
from selenium import webdriver
import re
import urllib2
site = ""
def phone():
global site
site = "https://www." + site
if "spokeo" in site:
browser = webdriver.Firefox()
browser.get(site)
content = browser.page_source
browser.quit()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:
print m_obj.group(0)
elif "addresses" in site:
usock = urllib2.urlopen(site)
data = usock.read()
usock.close()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
else :
usock = urllib2.urlopen(site)
data = usock.read()
usock.close()
m_obj = re.search(r"(\d{3}-\s\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
def pipl():
global site
url = "https://pipl.com/search/?q=tom+jones&l=Phoenix%2C+AZ%2C+US&sloc=US|AZ|Phoenix&in=6"
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
r_list = [#re.compile("spokeo.com/[^\s]+"),
re.compile("addresses.com/[^\s]+"),
re.compile("10digits.us/[^\s]+")]
for r in r_list:
match = re.findall(r,data)
for site in match:
site = site[:-6]
print site
phone()
pipl()
Here is my traceback:
Traceback (most recent call last):
File "/home/lazarov/.spyder2/.temp.py", line 48, in <module>
pipl()
File "/home/lazarov/.spyder2/.temp.py", line 46, in pipl
phone()
File "/home/lazarov/.spyder2/.temp.py", line 25, in phone
usock = urllib2.urlopen(site)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>
After manually debugging the code I found that the error comes from the function phone(), so I tried to run just that piece :
import re
import urllib2
url = 'http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7'
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
m_obj = re.search(r"(\d{3}-\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
And it worked. Which, I believe, shows it`s not that the firewall is actively denying the connection or the respective service is not started on the other site or is overloaded. Any help would be apreciated.
Usually the devil is in the detail.
according to your traceback...
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
and your source code...
site = "https://www." + site
...I may suppose that in your code you are trying to access https://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7 whereas in your test you are connecting to http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7.
try to replace the https with http (at least for www.10digits.us): probably the website you are trying to scraping does not respond to the port 443 but only to the port 80 (you can check it even with your browser)

Python Web Scraping - urlopen error [Errno -2] Name or service not known

I am trying to extract data from Civic Commons Apps link for my project. I am able to obtain the links of the page that I need. But when I try to open the links I get "urlopen error [Errno -2] Name or service not known"
The web scraping python code:
from bs4 import BeautifulSoup
from urlparse import urlparse, parse_qs
import re
import urllib2
import pdb
base_url = "http://civiccommons.org"
url = "http://civiccommons.org/apps"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
list_of_links = []
for link_tag in soup.findAll('a', href=re.compile('^/civic-function.*')):
string_temp_link = base_url+link_tag.get('href')
list_of_links.append(string_temp_link)
list_of_links = list(set(list_of_links))
list_of_next_pages = []
for categorized_apps_url in list_of_links:
categorized_apps_page = urllib2.urlopen(categorized_apps_url)
categorized_apps_soup = BeautifulSoup(categorized_apps_page.read())
last_page_tag = categorized_apps_soup.find('a', title="Go to last page")
if last_page_tag:
last_page_url = base_url+last_page_tag.get('href')
index_value = last_page_url.find("page=") + 5
base_url_for_next_page = last_page_url[:index_value]
for pageno in xrange(0, int(parse_qs(urlparse(last_page_url).query)['page'][0]) + 1):
list_of_next_pages.append(base_url_for_next_page+str(pageno))
else:
list_of_next_pages.append(categorized_apps_url)
I get the following error:
urllib2.urlopen(categorized_apps_url)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>
Should I take care of anything specific when I perform urlopen? Because I don't see a problem with the http links that I get.
[edit]
On second run I got the following error:
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
The same code runs fine in my friend's Mac, but fails in my ubuntu 12.04.
Also I tried running the code in scraper wiki and it finished successfully. But few url's were missing (when compared to mac). Are there any reason for these behavior?
The code works on my Mac and on your friends mac. It runs fine from a virtual machine instance of Ubuntu 12.04 server. There is obviously something in your particular environment - your os (Ubuntu Desktop?) or network that is causing it to crap out. For example my home router's default setting throttles the number of calls to the same domain in x seconds - and could cause this kind of issue if I didn't turn it off. It could be a number of things.
At this stage I would suggest refactoring your code to catch the URLError and set aside problematic urls for a retry. Also log/print errors if they fail after several retries. Maybe even throw in some code to time your calls between errors. It is better than having your script just fail outright and you'll get feedback as to whether it is just particular urls causing the problem or a timing issue (i.e. does it fail after x number of urlopen calls, or if it is failing after x number of urlopen calls in x amount of micro/seconds). If it's a timing issue, a simple time.sleep(1) inserted into your loops might do the trick.
SyncMaster,
I ran into the same issue recently after jumping onto an old ubuntu box I haven't played with in a while. This issue is actually caused because of the DNS settings on your machine. I would highly recommend that you check your DNS settings (/etc/resolv.conf and add nameserver 8.8.8.8) and then try again, you should meet success.

Python - Trying to write a script that tests an http interface

ers,
I'm still getting to grips with the Python basics..
My current requirement is to develop a Python script that will test the availabilty of web-based interfaces of mulitple devices (e.g. where you may have to enter "http://192.168.0.2:9876" via a web browser), this does not have to be over complicated.
I'm trying to convert from the simple bash curl command, as originally I had something like the following in a bash script:
date=`date +"%Y-%m-%d_%H-%M-%S-%N"`
curl -s --connect-timeout 1 ${ip} -o /dev/null
test=$?
if [[ $test == 0 ]] ;then
echo "${date}:webping - Web Page Up for ${ip}" >> $log
else
echo "${date}:webping - Web Page Down for ${ip}" >> $log
fi
which worked for the original concept, but I was looking to have something similar in python. the output can vary, within reason... anyone have any pointers on where to start.
P.S I have looked at some other questions on here, but they appear to give false-positives, where the interface has been "taken down" (i.e. I stopped the service) and it still gives a status code of 200.
EDIT: Below is the code I have tried.
for url in ["http://www.google.co.uk", "http://192.168.0.2:8000"]:
try:
connection = urllib2.urlopen(url)
print connection.getcode()
connection.close()
except urllib2.HTTPError, e:
print "none"
CORRECTION: I get the following results...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 391, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 409, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1173, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python27\lib\urllib2.py", line 1148, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 10061] No connection could be made because the target machine actively refused it>
I would prefer not to see the python error output.
Thanks in advance
Take a look at http://docs.python-requests.org/en/latest/index.html for a Python module providing the facilities you need with a nice friendly API. In this instance you'd do something along these lines:
import requests
...
try:
r = requests.get(url, timeout=1)
ok = (r.status_code // 100) == 2
except:
ok = False
# now use the value of ok
though I don't know whether the particular test I've used there (success means a 2xx response) is exactly what you want.

Problem with urllib2 loading mobile site

I'm trying to fetch some data from http://m.finnkino.fi/events/now_showing, but at the moment I'm failing badly because I'm not even able to load the page source with python.
At the moment I'm using following code:
req = urllib2.urlopen(URL,None,2.5)
page = req.read()
print page
Here is the traceback for timeout error:
Traceback (most recent call last):
File "user/src/finnkinoParser.py", line 26, in <module>
main()
File "user/src/finnkinoParser.py", line 13, in main
getNowPlayingMovies()
File "user/src/finnkinoParser.py", line 17, in getNowPlayingMovies
req = urllib2.urlopen(baseURL,None,2.5)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 383, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 401, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 361, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1130, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1105, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error timed out>
If I browse to the url with my browser it works fine. So could someone tell me what makes that site that much different so the urllib2 is unable to load the page. I suppose it has something to do with the site being aimed to mobile users. With "regular" sites urllib2 works fine. Is there any other kind of sites to which the basic urlopen(URL) doesn't work?
Thanks for help
Following snippet works fine.
import httplib
headers = {"User-Agent": "Mozilla/5.0"}
conn = httplib.HTTPConnection("m.finnkino.fi")
conn.request("GET", "/events/now_showing", "", headers)
response = conn.getresponse()
print response.status, response.reason
data = response.read()
print data
conn.close()
It seems their server has verified several request vars. After tested some times, here is conclusion:
http protocol must be HTTP/1.1.
if request headers have Connection prop, its value should be keep-alive.
request headers must have User-Agent prop, whatever its value.
While in urllib2, Connection prop in HTTPHandler has been set to Close by default (L1127 in urllib2.py). you can use urlgrabber or other HTTP handler which supports HTTP/1.1 and keep-alive.

Repeated host lookups failing in urllib2

I have code which issues many HTTP GET requests using Python's urllib2, in several threads, writing the responses into files (one per thread).
During execution, it looks like many of the host lookups fail (causing a name or service unknown error, see appended error log for an example).
Is this due to a flaky DNS service? Is it bad practice to rely on DNS caching, if the host name isn't changing? I.e. should a single lookup's result be passed into the urlopen?
Exception in thread Thread-16:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/home/da/local/bin/ThreadedDownloader.py", line 61, in run
page = urllib2.urlopen(url) # get the page
File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.6/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib/python2.6/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.6/urllib2.py", line 1170, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.6/urllib2.py", line 1145, in do_open
raise URLError(err)
URLError: <urlopen error [Errno -2] Name or service not known>
UPDATE my (extremely simple) code
class AsyncGet(threading.Thread):
def __init__(self,outDir,baseUrl,item,method,numPages,numRows,semaphore):
threading.Thread.__init__(self)
self.outDir = outDir
self.baseUrl = baseUrl
self.method = method
self.numPages = numPages
self.numRows = numRows
self.item = item
self.semaphore = semaphore
def run(self):
with self.semaphore: # 'with' is awesome.
with open( os.path.join(self.outDir,self.item+".xml"), 'a' ) as f:
for i in xrange(1,self.numPages+1):
url = self.baseUrl + \
"method=" + self.method + \
"&item=" + self.item + \
"&page=" + str(i) + \
"&rows=" + str(self.numRows) + \
"&prettyXML"
page = urllib2.urlopen(url)
f.write(page.read())
page.close() # Must remember to close!
The semaphore is a BoundedSemaphore to constrain the total number of running threads.
This is not a Python problem, on Linux systems make sure nscd (Name Service Cache Daemon) is actually running.
UPDATE:
And looking at your code you are never calling page.close() hence leaking sockets.

Categories