I am creating a web scraping script and divided it into four pieces. Separately they all work perfect, however when I put them all together I get the following error : urlopen error [Errno 111] Connection refused. I have looked at similar questions to mine and have tried to catch the error with try-except but even that doesn`t work. My all in one code is :
from selenium import webdriver
import re
import urllib2
site = ""
def phone():
global site
site = "https://www." + site
if "spokeo" in site:
browser = webdriver.Firefox()
browser.get(site)
content = browser.page_source
browser.quit()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:
print m_obj.group(0)
elif "addresses" in site:
usock = urllib2.urlopen(site)
data = usock.read()
usock.close()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
else :
usock = urllib2.urlopen(site)
data = usock.read()
usock.close()
m_obj = re.search(r"(\d{3}-\s\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
def pipl():
global site
url = "https://pipl.com/search/?q=tom+jones&l=Phoenix%2C+AZ%2C+US&sloc=US|AZ|Phoenix&in=6"
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
r_list = [#re.compile("spokeo.com/[^\s]+"),
re.compile("addresses.com/[^\s]+"),
re.compile("10digits.us/[^\s]+")]
for r in r_list:
match = re.findall(r,data)
for site in match:
site = site[:-6]
print site
phone()
pipl()
Here is my traceback:
Traceback (most recent call last):
File "/home/lazarov/.spyder2/.temp.py", line 48, in <module>
pipl()
File "/home/lazarov/.spyder2/.temp.py", line 46, in pipl
phone()
File "/home/lazarov/.spyder2/.temp.py", line 25, in phone
usock = urllib2.urlopen(site)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>
After manually debugging the code I found that the error comes from the function phone(), so I tried to run just that piece :
import re
import urllib2
url = 'http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7'
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
m_obj = re.search(r"(\d{3}-\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
And it worked. Which, I believe, shows it`s not that the firewall is actively denying the connection or the respective service is not started on the other site or is overloaded. Any help would be apreciated.
Usually the devil is in the detail.
according to your traceback...
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
and your source code...
site = "https://www." + site
...I may suppose that in your code you are trying to access https://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7 whereas in your test you are connecting to http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7.
try to replace the https with http (at least for www.10digits.us): probably the website you are trying to scraping does not respond to the port 443 but only to the port 80 (you can check it even with your browser)
Related
I have a python program that uses several APIs, including two from Yahoo!. One i am having trouble with, however. it always gives me the error <urlopen error [Errno -5] No address associated with hostname> Here are the outputs from various terminal commands to try to get the data from "http://download.finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv":
my code:
def stocks():
lcd.clear()
lcd.message('RasPi Stocks\nUsing Yahoo!')
time.sleep(2)
val = 1
for i in stockslst:
# host = urllib2.socket.gethostbyname('finance.yahoo.com')
req = urllib2.Request('http://download.finance.yahoo.com/d/quotes.csv?s=' + i + '&f=nsl1l2op&e=.csv$
req.add_header('Host:', 'finance.yahoo.com')
mdata = urllib2.urlopen(req)
csvdata = [row for row in csv.reader(mdata)]
sname = csvdata[0][0]
last1 = float(csvdata[0][2])
last2 = float(csvdata[0][3])
if last1 > last2:
r = 0
if last1 < last2:
r = 1
if last1 == last2:
r = 126
lcd.clear()
lcd.message(sname + '\n' + chr(r) + str(last1))
time.sleep(2)
my program:
pi#raspberrypi ~ $ sudo python lcddisplayrpi.py
Traceback (most recent call last):
File "lcddisplayrpi.py", line 132, in <module>
stocks()
File "lcddisplayrpi.py", line 113, in stocks
mdata = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 401, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 419, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1211, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1181, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -5] No address associated with hostname>
curl the problematic address:
curl "http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv"
<HEAD><TITLE>Redirect</TITLE></HEAD>
<BODY BGCOLOR="white" FGCOLOR="black">
<FONT FACE="Helvetica,Arial"><B>
"<em>http://download.finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv</em>".<p></B></FONT>
<!-- default "Redirect" response (301) -->
</BODY>
wget the problematic address:
wget "http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv"
--2013-10-06 09:01:30-- http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv
Resolving finance.yahoo.com (finance.yahoo.com)... failed: No address associated with hostname.
wget: unable to resolve host address `finance.yahoo.com'
I can get the csv data in a web browser on the same computer. as hinted in the curl command, I tried changing it to download.finance.yahoo.com, but no luck with the python program. I tried curl again:
curl "http://download.finance.yahoo.com/d/quotes.csv?s=GOOG&f=nsl1l2op&e=.csv"
"Google Inc.","GOOG",872.35,-,876.00,876.09
This is the data I want. I don't know what is wrong with my python program.
I am trying to extract data from Civic Commons Apps link for my project. I am able to obtain the links of the page that I need. But when I try to open the links I get "urlopen error [Errno -2] Name or service not known"
The web scraping python code:
from bs4 import BeautifulSoup
from urlparse import urlparse, parse_qs
import re
import urllib2
import pdb
base_url = "http://civiccommons.org"
url = "http://civiccommons.org/apps"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
list_of_links = []
for link_tag in soup.findAll('a', href=re.compile('^/civic-function.*')):
string_temp_link = base_url+link_tag.get('href')
list_of_links.append(string_temp_link)
list_of_links = list(set(list_of_links))
list_of_next_pages = []
for categorized_apps_url in list_of_links:
categorized_apps_page = urllib2.urlopen(categorized_apps_url)
categorized_apps_soup = BeautifulSoup(categorized_apps_page.read())
last_page_tag = categorized_apps_soup.find('a', title="Go to last page")
if last_page_tag:
last_page_url = base_url+last_page_tag.get('href')
index_value = last_page_url.find("page=") + 5
base_url_for_next_page = last_page_url[:index_value]
for pageno in xrange(0, int(parse_qs(urlparse(last_page_url).query)['page'][0]) + 1):
list_of_next_pages.append(base_url_for_next_page+str(pageno))
else:
list_of_next_pages.append(categorized_apps_url)
I get the following error:
urllib2.urlopen(categorized_apps_url)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>
Should I take care of anything specific when I perform urlopen? Because I don't see a problem with the http links that I get.
[edit]
On second run I got the following error:
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
The same code runs fine in my friend's Mac, but fails in my ubuntu 12.04.
Also I tried running the code in scraper wiki and it finished successfully. But few url's were missing (when compared to mac). Are there any reason for these behavior?
The code works on my Mac and on your friends mac. It runs fine from a virtual machine instance of Ubuntu 12.04 server. There is obviously something in your particular environment - your os (Ubuntu Desktop?) or network that is causing it to crap out. For example my home router's default setting throttles the number of calls to the same domain in x seconds - and could cause this kind of issue if I didn't turn it off. It could be a number of things.
At this stage I would suggest refactoring your code to catch the URLError and set aside problematic urls for a retry. Also log/print errors if they fail after several retries. Maybe even throw in some code to time your calls between errors. It is better than having your script just fail outright and you'll get feedback as to whether it is just particular urls causing the problem or a timing issue (i.e. does it fail after x number of urlopen calls, or if it is failing after x number of urlopen calls in x amount of micro/seconds). If it's a timing issue, a simple time.sleep(1) inserted into your loops might do the trick.
SyncMaster,
I ran into the same issue recently after jumping onto an old ubuntu box I haven't played with in a while. This issue is actually caused because of the DNS settings on your machine. I would highly recommend that you check your DNS settings (/etc/resolv.conf and add nameserver 8.8.8.8) and then try again, you should meet success.
I'm trying to fetch some data from http://m.finnkino.fi/events/now_showing, but at the moment I'm failing badly because I'm not even able to load the page source with python.
At the moment I'm using following code:
req = urllib2.urlopen(URL,None,2.5)
page = req.read()
print page
Here is the traceback for timeout error:
Traceback (most recent call last):
File "user/src/finnkinoParser.py", line 26, in <module>
main()
File "user/src/finnkinoParser.py", line 13, in main
getNowPlayingMovies()
File "user/src/finnkinoParser.py", line 17, in getNowPlayingMovies
req = urllib2.urlopen(baseURL,None,2.5)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 383, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 401, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 361, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1130, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1105, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error timed out>
If I browse to the url with my browser it works fine. So could someone tell me what makes that site that much different so the urllib2 is unable to load the page. I suppose it has something to do with the site being aimed to mobile users. With "regular" sites urllib2 works fine. Is there any other kind of sites to which the basic urlopen(URL) doesn't work?
Thanks for help
Following snippet works fine.
import httplib
headers = {"User-Agent": "Mozilla/5.0"}
conn = httplib.HTTPConnection("m.finnkino.fi")
conn.request("GET", "/events/now_showing", "", headers)
response = conn.getresponse()
print response.status, response.reason
data = response.read()
print data
conn.close()
It seems their server has verified several request vars. After tested some times, here is conclusion:
http protocol must be HTTP/1.1.
if request headers have Connection prop, its value should be keep-alive.
request headers must have User-Agent prop, whatever its value.
While in urllib2, Connection prop in HTTPHandler has been set to Close by default (L1127 in urllib2.py). you can use urlgrabber or other HTTP handler which supports HTTP/1.1 and keep-alive.
I have code which issues many HTTP GET requests using Python's urllib2, in several threads, writing the responses into files (one per thread).
During execution, it looks like many of the host lookups fail (causing a name or service unknown error, see appended error log for an example).
Is this due to a flaky DNS service? Is it bad practice to rely on DNS caching, if the host name isn't changing? I.e. should a single lookup's result be passed into the urlopen?
Exception in thread Thread-16:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/home/da/local/bin/ThreadedDownloader.py", line 61, in run
page = urllib2.urlopen(url) # get the page
File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.6/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib/python2.6/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.6/urllib2.py", line 1170, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.6/urllib2.py", line 1145, in do_open
raise URLError(err)
URLError: <urlopen error [Errno -2] Name or service not known>
UPDATE my (extremely simple) code
class AsyncGet(threading.Thread):
def __init__(self,outDir,baseUrl,item,method,numPages,numRows,semaphore):
threading.Thread.__init__(self)
self.outDir = outDir
self.baseUrl = baseUrl
self.method = method
self.numPages = numPages
self.numRows = numRows
self.item = item
self.semaphore = semaphore
def run(self):
with self.semaphore: # 'with' is awesome.
with open( os.path.join(self.outDir,self.item+".xml"), 'a' ) as f:
for i in xrange(1,self.numPages+1):
url = self.baseUrl + \
"method=" + self.method + \
"&item=" + self.item + \
"&page=" + str(i) + \
"&rows=" + str(self.numRows) + \
"&prettyXML"
page = urllib2.urlopen(url)
f.write(page.read())
page.close() # Must remember to close!
The semaphore is a BoundedSemaphore to constrain the total number of running threads.
This is not a Python problem, on Linux systems make sure nscd (Name Service Cache Daemon) is actually running.
UPDATE:
And looking at your code you are never calling page.close() hence leaking sockets.
I installed Python 2.6.2 earlier on a Windows XP machine and run the following code:
import urllib2
import urllib
page = urllib2.Request('http://www.python.org/fish.html')
urllib2.urlopen( page )
I get the following error.
Traceback (most recent call last):<br>
File "C:\Python26\test3.py", line 6, in <module><br>
urllib2.urlopen( page )<br>
File "C:\Python26\lib\urllib2.py", line 124, in urlopen<br>
return _opener.open(url, data, timeout)<br>
File "C:\Python26\lib\urllib2.py", line 383, in open<br>
response = self._open(req, data)<br>
File "C:\Python26\lib\urllib2.py", line 401, in _open<br>
'_open', req)<br>
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain<br>
result = func(*args)<br>
File "C:\Python26\lib\urllib2.py", line 1130, in http_open<br>
return self.do_open(httplib.HTTPConnection, req)<br>
File "C:\Python26\lib\urllib2.py", line 1105, in do_open<br>
raise URLError(err)<br>
URLError: <urlopen error [Errno 11001] getaddrinfo failed><br><br><br>
import urllib2
response = urllib2.urlopen('http://www.python.org/fish.html')
html = response.read()
You're doing it wrong.
Have a look in the urllib2 source, at the line specified by the traceback:
File "C:\Python26\lib\urllib2.py", line 1105, in do_open
raise URLError(err)
There you'll see the following fragment:
try:
h.request(req.get_method(), req.get_selector(), req.data, headers)
r = h.getresponse()
except socket.error, err: # XXX what error?
raise URLError(err)
So, it looks like the source is a socket error, not an HTTP protocol related error. Possible reasons: you are not on line, you are behind a restrictive firewall, your DNS is down,...
All this aside from the fact, as mcandre pointed out, that your code is wrong.
Name resolution error.
getaddrinfo is used to resolve the hostname (python.org)in your request. If it fails, it means that the name could not be resolved because:
It does not exist, or the records are outdated (unlikely; python.org is a well-established domain name)
Your DNS server is down (unlikely; if you can browse other sites, you should be able to fetch that page through Python)
A firewall is blocking Python or your script from accessing the Internet (most likely; Windows Firewall sometimes does not ask you if you want to allow an application)
You live on an ancient voodoo cemetery. (unlikely; if that is the case, you should move out)
Windows Vista, python 2.6.2
It's a 404 page, right?
>>> import urllib2
>>> import urllib
>>>
>>> page = urllib2.Request('http://www.python.org/fish.html')
>>> urllib2.urlopen( page )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python26\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python26\lib\urllib2.py", line 389, in open
response = meth(req, response)
File "C:\Python26\lib\urllib2.py", line 502, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python26\lib\urllib2.py", line 427, in error
return self._call_chain(*args)
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain
result = func(*args)
File "C:\Python26\lib\urllib2.py", line 510, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
>>>
DJ
First, I see no reason to import urllib; I've only ever seen urllib2 used to replace urllib entirely and I know of no functionality that's useful from urllib and yet is missing from urllib2.
Next, I notice that http://www.python.org/fish.html gives a 404 error to me. (That doesn't explain the backtrace/exception you're seeing. I get urllib2.HTTPError: HTTP Error 404: Not Found
Normally if you just want to do a default fetch of a web pages (without adding special HTTP headers, doing doing any sort of POST, etc) then the following suffices:
req = urllib2.urlopen('http://www.python.org/')
html = req.read()
# and req.close() if you want to be pedantic