HTML Link parsing using BeautifulSoup - python

here is my Python code which I'm using to extract the Specific HTML from the Page links I'm sending as parameter. I'm using BeautifulSoup. This code works fine for sometimes and sometimes it is getting stuck!
import urllib
from bs4 import BeautifulSoup
rawHtml = ''
url = r'http://iasexamportal.com/civilservices/tag/voice-notes?page='
for i in range(1, 49):
#iterate url and capture content
sock = urllib.urlopen(url+ str(i))
html = sock.read()
sock.close()
rawHtml += html
print i
Here I'm printing the loop variable to find out where it is getting stuck. It shows me that it's getting stuck randomly at any of the loop sequence.
soup = BeautifulSoup(rawHtml, 'html.parser')
t=''
for link in soup.find_all('a'):
t += str(link.get('href')) + "</br>"
#t += str(link) + "</br>"
f = open("Link.txt", 'w+')
f.write(t)
f.close()
what could be the possible issue. Is it the problem with the socket configuration or some other issue.
This is the error I got. I checked these links - python-gaierror-errno-11004,ioerror-errno-socket-error-errno-11004-getaddrinfo-failed for the solution. But I didn't find it much helpful.
d:\python>python ext.py
Traceback (most recent call last):
File "ext.py", line 8, in <module>
sock = urllib.urlopen(url+ str(i))
File "d:\python\lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "d:\python\lib\urllib.py", line 213, in open
return getattr(self, name)(url)
File "d:\python\lib\urllib.py", line 350, in open_http
h.endheaders(data)
File "d:\python\lib\httplib.py", line 1049, in endheaders
self._send_output(message_body)
File "d:\python\lib\httplib.py", line 893, in _send_output
self.send(msg)
File "d:\python\lib\httplib.py", line 855, in send
self.connect()
File "d:\python\lib\httplib.py", line 832, in connect
self.timeout, self.source_address)
File "d:\python\lib\socket.py", line 557, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed
It's running perfectly fine when I'm running it on my personal laptop. But It's giving error when I'm running it on Office Desktop. Also, My version of Python is 2.7. Hope these information will help.

Finally, guys.... It worked! Same script worked when I checked on other PC's too. So probably the problem was because of the firewall settings or proxy settings of my office desktop. which was blocking this website.

Related

BeautifulSoup timing out on instantiation?

I'm just doing some web scraping with BeautifulSoup and I'm running into a weird error. Code:
print "Running urllib2"
g = urllib2.urlopen(link + "about", timeout=5)
print "Finished urllib2"
about_soup = BeautifulSoup(g, 'lxml')
Here's the output:
Running urllib2
Finished urllib2
Error
Traceback (most recent call last):
File "/Users/pspieker/Documents/projects/ThePyStrikesBack/tests/TestSpringerOpenScraper.py", line 10, in test_strip_chars
for row in self.instance.get_entries():
File "/Users/pspieker/Documents/projects/ThePyStrikesBack/src/JournalScrapers.py", line 304, in get_entries
about_soup = BeautifulSoup(g, 'lxml')
File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__
markup = markup.read()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 355, in read
data = self._sock.recv(rbufsize)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 588, in read
return self._read_chunked(amt)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 648, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 703, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 384, in read
data = self._sock.recv(left)
timeout: timed out
I understand that the urllib2.urlopen could be causing problems, but the exception occurs in the line instantiating BeautifulSoup. I did some googling but couldn't find anything about BeautfiulSoup timeout issues.
Any ideas on what is happening?
This is urllib2 part that causing the timeout.
The reason you see it is failing on the BeautifulSoup instantiation line is that g, the file-like object, is being read by BeautifulSoup internally. This is the part of the stacktrace proving that:
File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__
markup = markup.read()

Timing out when I use a proxy

I am trying to implement proxies into my web crawler. Without the proxies, my code has no problem connecting to the website, however when I try to add in proxies, suddenly it won't connect! It doesn't look like anybody in python-requests has made a post about this problem, so I'm hoping you all can help me!
Background info: I'm using a Mac and using Anaconda's Python 3.4 inside of a virtual environment.
Here is my code that works without proxies
proxyDict = {'http': 'http://10.10.1.10:3128'}
def pmc_spider(max_pages, pmid):
start = 1
titles_list = []
url_list = []
url_keys = []
while start <= max_pages:
url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/'+str(pmid)+'/citedby/?page='+str(start)
req = requests.get(url) #this works
plain_text = req.text
soup = BeautifulSoup(plain_text, "lxml")
for items in soup.findAll('div', {'class': 'title'}):
title = items.get_text()
titles_list.append(title)
for link in items.findAll('a'):
urlkey = link.get('href')
url_keys.append(urlkey) #url = base + key
url = "http://www.ncbi.nlm.nih.gov"+str(urlkey)
url_list.append(url)
start += 1
return titles_list, url_list, authors_list
Based on other posts I'm looking at, I should just be able to replace this:
req = requests.get(url)
with this:
req = requests.get(url, proxies=proxyDict, timeout=2)
But this doesn't work! :( If I run it with this line of code the terminal gives me a TimeOut error
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 578, in urlopen
chunked=chunked)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 362, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1182, in _send_request
self.endheaders(body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1133, in endheaders
self._send_output(message_body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 963, in _send_output
self.send(msg)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 898, in send
self.connect()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 167, in connect
conn = self._new_conn()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 147, in _new_conn
(self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError: (<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)')
And then I get a few of these printed in the terminal with different traces but the same error:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/adapters.py", line 403, in send
timeout=timeout
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 623, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 281, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.1.10', port=3128): Max retries exceeded with url: http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/18269575/citedby/?page=1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)'))
Why would the addition of proxies to my code suddenly cause me to timeout? I tried it on several random urls and had the same thing happen. So it seems to be a problem with proxies rather than a problem with my code. However, I'm at the point where I MUST use proxies now so I need to get to the root of this and fix it. I've also tried several different IP addresses for the proxy from a VPN that I use, so I know the IP addresses are valid.
I appreciate your help so much! Thank you!
It looks like you'll need to use a http or https proxy that will respond to requests.
The 10.10.1.10:3128 in your code seems to be from examples in the requests documentation
Taking a proxy from the list at http://proxylist.hidemyass.com/search-1291967 (may not be the best source) your proxyDict should look like this: {'http' : 'http://209.242.141.60:8080'}
testing this on the command line seems to work fine:
>>> proxies = {'http' : 'http://209.242.141.60:8080'}
>>> requests.get('http://google.com', proxies=proxies)
<Response [200]>

Why is the connection refused?

I am creating a web scraping script and divided it into four pieces. Separately they all work perfect, however when I put them all together I get the following error : urlopen error [Errno 111] Connection refused. I have looked at similar questions to mine and have tried to catch the error with try-except but even that doesn`t work. My all in one code is :
from selenium import webdriver
import re
import urllib2
site = ""
def phone():
global site
site = "https://www." + site
if "spokeo" in site:
browser = webdriver.Firefox()
browser.get(site)
content = browser.page_source
browser.quit()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:
print m_obj.group(0)
elif "addresses" in site:
usock = urllib2.urlopen(site)
data = usock.read()
usock.close()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
else :
usock = urllib2.urlopen(site)
data = usock.read()
usock.close()
m_obj = re.search(r"(\d{3}-\s\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
def pipl():
global site
url = "https://pipl.com/search/?q=tom+jones&l=Phoenix%2C+AZ%2C+US&sloc=US|AZ|Phoenix&in=6"
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
r_list = [#re.compile("spokeo.com/[^\s]+"),
re.compile("addresses.com/[^\s]+"),
re.compile("10digits.us/[^\s]+")]
for r in r_list:
match = re.findall(r,data)
for site in match:
site = site[:-6]
print site
phone()
pipl()
Here is my traceback:
Traceback (most recent call last):
File "/home/lazarov/.spyder2/.temp.py", line 48, in <module>
pipl()
File "/home/lazarov/.spyder2/.temp.py", line 46, in pipl
phone()
File "/home/lazarov/.spyder2/.temp.py", line 25, in phone
usock = urllib2.urlopen(site)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>
After manually debugging the code I found that the error comes from the function phone(), so I tried to run just that piece :
import re
import urllib2
url = 'http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7'
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
m_obj = re.search(r"(\d{3}-\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
And it worked. Which, I believe, shows it`s not that the firewall is actively denying the connection or the respective service is not started on the other site or is overloaded. Any help would be apreciated.
Usually the devil is in the detail.
according to your traceback...
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
and your source code...
site = "https://www." + site
...I may suppose that in your code you are trying to access https://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7 whereas in your test you are connecting to http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7.
try to replace the https with http (at least for www.10digits.us): probably the website you are trying to scraping does not respond to the port 443 but only to the port 80 (you can check it even with your browser)

Python help - parse xml

Hey i'm trying to parse yahoo weather xml using python and this is the code :
#!/usr/bin/env python
import subprocess
import urllib
from xml.dom import minidom
WEATHER_URL = 'http://weather.yahooapis.com/forecastrss?w=55872649&u=c'
WEATHER_NS = 'http://xml.weather.yahoo.com/ns/rss/1.0'
dom = minidom.parse(urllib.urlopen(WEATHER_URL))
ycondition = dom.getElementsByTagNameNS(WEATHER_NS, 'condition')[0]
CURRENT_OUTDOOR_TEMP = ycondition.getAttribute('temp')
print(CURRENT_OUTDOOR_TEMP)
Why am i getting this error on IIS7?
Traceback (most recent call last): File "C:\inetpub\wwwroot\22.py", line 16, in dom = minidom.parse(urllib.urlopen(WEATHER_URL))
File "C:\Python27\lib\urllib.py", line 86, in urlopen return opener.open(url)
File "C:\Python27\lib\urllib.py", line 207, in open return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 344, in open_http h.endheaders(data)
File "C:\Python27\lib\httplib.py", line 954, in endheaders self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 814, in _send_output self.send(msg)
File "C:\Python27\lib\httplib.py", line 776, in send self.connect()
File "C:\Python27\lib\httplib.py", line 757, in connect self.timeout, self.source_address)
File "C:\Python27\lib\socket.py", line 571, in create_connection
raise err IOError: [Errno socket error] [Errno 10061] No connection could be made because the target machine actively refused it
Please help...
thanks
It's look like that your firewall blocks the access to weather.yahooapis.com.
Check your firewall logs
Allow the access to the domain weather.yahooapis.com
could be several reasons:
You're behind proxy, so you should provide code that first connect to proxy then urlopen()
some firewall on your PC or Gateway forbid connecting to that site.
your antivirus software get suspicious about your program. rare but possible.
the website detected you're a bot, not browser. so closed the connection. for example from User Agent and etc.
make sure server is not ssl only.
hope it helps you diag the problem.
import json
import codecs
import urllib , cStringIO
import string
from Tkinter import *
weather_api =urllib.urlopen('https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%20in%20(select%20woeid%20from%20geo.places(1)%20where%20text%3D%22nagercoil%2C%20IND%22)&format=json&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys')
weather_json = json.load(weather_api)
data_location = weather_json["query"]['results']['channel']['location']
place = data_location['city']
data_item = weather_json["query"]['results']['channel']['item']
fore_cast = data_item['forecast'][0]['text']
temp = data_item['condition']['temp']
date = data_item['condition']['date']
data_image= weather_json["query"]['results']['channel']['image']
root = Tk()
f = Frame(width=800, height=546, bg='green', colormap='new')
w0 = Label(root, fg='#FF6699', text=fore_cast, font=("Helvetica",15))
w = Label(root, fg='blue', text=temp + u'\N{DEGREE SIGN}'+'F', font=("Helvetica",56))
w1 = Label(root, fg='#3399CC', text=place, font=("Helvetica",15))
w0.pack()
w.pack()
w1.pack()
root.mainloop()
simple weather forecast application using python2.7.
select api from this web address https://developer.yahoo.com/weather/ then simply decode the json object . json object contains large amount of data. then easily show data using Tkinter final output

Python Scraper - Socket Error breaks script if target is 404'd

Encountered an error while building a web scraper to compile data and output into XLS format; when testing again a list of domains in which I wish to scrape from, the program faulters when it recieves a socket error. Hoping to find an 'if' statement that would null parsing a broken website and continue through my while-loop. Any ideas?
workingList = xlrd.open_workbook(listSelection)
workingSheet = workingList.sheet_by_index(0)
destinationList = xlwt.Workbook()
destinationSheet = destinationList.add_sheet('Gathered')
startX = 1
startY = 0
while startX != 21:
workingCell = workingSheet.cell(startX,startY).value
print ''
print ''
print ''
print workingCell
#Setup
preSite = 'http://www.'+workingCell
theSite = urlopen(preSite).read()
currentSite = BeautifulSoup(theSite)
destinationSheet.write(startX,0,workingCell)
And here's the error:
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
homeMenu()
File "C:\Python27\farming.py", line 31, in homeMenu
openList()
File "C:\Python27\farming.py", line 79, in openList
openList()
File "C:\Python27\farming.py", line 83, in openList
openList()
File "C:\Python27\farming.py", line 86, in openList
homeMenu()
File "C:\Python27\farming.py", line 34, in homeMenu
startScrape()
File "C:\Python27\farming.py", line 112, in startScrape
theSite = urlopen(preSite).read()
File "C:\Python27\lib\urllib.py", line 84, in urlopen
return opener.open(url)
File "C:\Python27\lib\urllib.py", line 205, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 342, in open_http
h.endheaders(data)
File "C:\Python27\lib\httplib.py", line 951, in endheaders
self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 811, in _send_output
self.send(msg)
File "C:\Python27\lib\httplib.py", line 773, in send
self.connect()
File "C:\Python27\lib\httplib.py", line 754, in connect
self.timeout, self.source_address)
File "C:\Python27\lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed
Ummm that looks like the error I get when my internet connection is down. HTTP 404 errors are what you get when you do have a connection but the URL that you specify can't be found.
There's no if statement to handle exceptions; you need to "catch" them using the try/except construct.
Update: Here's a demonstration:
import urllib
def getconn(url):
try:
conn = urllib.urlopen(url)
return conn, None
except IOError as e:
return None, e
urls = """
qwerty
http://www.foo.bar.net
http://www.google.com
http://www.google.com/nonesuch
"""
for url in urls.split():
print
print url
conn, exc = getconn(url)
if conn:
print "connected; HTTP response is", conn.getcode()
else:
print "failed"
print exc.__class__.__name__
print str(exc)
print exc.args
Output:
qwerty
failed
IOError
[Errno 2] The system cannot find the file specified: 'qwerty'
(2, 'The system cannot find the file specified')
http://www.foo.bar.net
failed
IOError
[Errno socket error] [Errno 11004] getaddrinfo failed
('socket error', gaierror(11004, 'getaddrinfo failed'))
http://www.google.com
connected; HTTP response is 200
http://www.google.com/nonesuch
connected; HTTP response is 404
Note that so far we have just opened the connection. Now what you need to do is check the HTTP response code and decide whether there is anything worth retrieving using conn.read()

Categories