Beautiful Soup Traceback on First Attempt - python

Hello I'm new to python and Beautiful Soup. I have downloaded BS4 with pip install and am attempting to do some web scaping. I have looked through a lot of help guides and haven't been able to get my BeautifulSoup() to work through the cmd compiler. Here is my code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
This is the traceback I get with an URL input:
C:\Users\aaron\OneDrive\Desktop\Coding>python urllinks_get.py
Enter - http://www.dr-chuck.com/page1.htm
Traceback (most recent call last):
File "C:\Users\aaron\OneDrive\Desktop\Coding\urllinks_get.py", line 21, in <module>
soup = BeautifulSoup(html, 'html.parser')
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\__init__.py", line 215, in __init__
self._feed()
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\__init__.py", line 239, in _feed
self.builder.feed(self.markup)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\builder\_htmlparser.py", line 164, in feed
parser.feed(markup)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2032.0_x64__qbz5n2kfra8p0\lib\html\parser.py", line 110, in feed
self.goahead(0)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2032.0_x64__qbz5n2kfra8p0\lib\html\parser.py", line 170, in goahead
k = self.parse_starttag(i)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2032.0_x64__qbz5n2kfra8p0\lib\html\parser.py", line 344, in parse_starttag
self.handle_starttag(tag, attrs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\builder\_htmlparser.py", line 62, in handle_starttag
self.soup.handle_starttag(name, None, None, attr_dict)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\__init__.py", line 404, in handle_starttag
self.currentTag, self._most_recent_element)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1001, in __getattr__
return self.find(tag)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1238, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1259, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 516, in _find_all
strainer = SoupStrainer(name, attrs, text, **kwargs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1560, in __init__
self.text = self._normalize_search_value(text)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1565, in _normalize_search_value
if (isinstance(value, str) or isinstance(value, collections.Callable) or hasattr(value, 'match')
AttributeError: module 'collections' has no attribute 'Callable'
Really would like to continue my online classes so any help would be much appreciated!
Thanks!

Found my issue. I had installed beautifulsoup4 as well as used the bs4 folder in the same directory as my program ran in. I didn't realize they would interfere with one another. As soon as I removed the bs4 folder from the directory my program ran fine :)

Related

BeautifulSoup soup object creation consistent error

I am new in web scraping, and I ma having a few difficulties using beautifulsoup, which seems more related to installation than to the code itself. I have installed bs4, and want to get data from webpages. I started with a simple exercise as follows:
import requests
import urllib
from BeautifulSoup import BeautifulSoup
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
which gets me the following error message
Traceback (most recent call last):
File "<ipython-input-62-a9912850b0dc>", line 1, in <module>
soup = BeautifulSoup(page.content, 'html.parser')
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
File "/Users/../anaconda/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/Users/../anaconda/lib/python2.7/sgmllib.py", line 174, in goahead
k = self.parse_declaration(i)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1463, in parse_declaration
j = SGMLParser.parse_declaration(self, i)
File "/Users/../anaconda/lib/python2.7/markupbase.py", line 109, in parse_declaration
self.handle_decl(data)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1448, in handle_decl
self._toStringSubclass(data, Declaration)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1381, in _toStringSubclass
self.endData(subclass)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1251, in endData
(not self.parseOnlyThese.text or \
AttributeError: 'str' object has no attribute 'text'
If I remove 'html.parser' and use
soup = BeautifulSoup(page.content)
the code works, but, of course, it does not give me what I need.
Any clues as to how to solve this? I am in a OSX El Capitan, and use spyder as editor. I did re-installed bs4 a few times.
Thanks
You are using an old version of BeautifulSoup. Please uninstall it, and then install BeautifulSoup4, with pip install BeautifulSoup4; and then adjust your code thus:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168')
s = BeautifulSoup(r.content, 'html.parser')
Try this snippet:
soup = BeautifulSoup(page.content, 'html.parser')
# in place of page.content use page.text
soup = BeautifulSoup(page.text, 'html.parser')

BeautifulSoup timing out on instantiation?

I'm just doing some web scraping with BeautifulSoup and I'm running into a weird error. Code:
print "Running urllib2"
g = urllib2.urlopen(link + "about", timeout=5)
print "Finished urllib2"
about_soup = BeautifulSoup(g, 'lxml')
Here's the output:
Running urllib2
Finished urllib2
Error
Traceback (most recent call last):
File "/Users/pspieker/Documents/projects/ThePyStrikesBack/tests/TestSpringerOpenScraper.py", line 10, in test_strip_chars
for row in self.instance.get_entries():
File "/Users/pspieker/Documents/projects/ThePyStrikesBack/src/JournalScrapers.py", line 304, in get_entries
about_soup = BeautifulSoup(g, 'lxml')
File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__
markup = markup.read()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 355, in read
data = self._sock.recv(rbufsize)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 588, in read
return self._read_chunked(amt)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 648, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 703, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 384, in read
data = self._sock.recv(left)
timeout: timed out
I understand that the urllib2.urlopen could be causing problems, but the exception occurs in the line instantiating BeautifulSoup. I did some googling but couldn't find anything about BeautfiulSoup timeout issues.
Any ideas on what is happening?
This is urllib2 part that causing the timeout.
The reason you see it is failing on the BeautifulSoup instantiation line is that g, the file-like object, is being read by BeautifulSoup internally. This is the part of the stacktrace proving that:
File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__
markup = markup.read()

RuntimeError: maximum recursion depth exceeded while trying to scrape data from a website [duplicate]

I am following a tutorial to try to learn how to use BeautifulSoup. I am trying to remove names from the urls on a html page I downloaded. I have it working great to this point.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
print link
but when I enter this next part
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
names = link.contents[0]
fullLink = link.get('href')
print names
print fullLink
I get this error
Traceback (most recent call last):
File "C:/Python27/python tutorials/soupexample.py", line 13, in <module>
print names
File "C:\Python27\lib\idlelib\PyShell.py", line 1325, in write
return self.shell.write(s, self.tags)
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded
This is a buggy interaction between IDLE and BeautifulSoup's NavigableString objects (which subclass unicode). See issue 1757057; it's been around for a while.
The work-around is to convert the object to a plain unicode value first:
print unicode(names)

requests: TypeError: 'tuple' object is not callable in python 3.1.2

I'm making a web page scraper using BeautifulSoup4 and requests libraries. I had some trouble with BeautifulSoup working but got some help and was able to get that fixed. Now I've run into a new problem and I'm not sure how to fix it. I'm using requests 2.2.1 and I'm trying to run this program on Python 3.1.2. And when I do I get a traceback error.
here is my code:
from bs4 import BeautifulSoup
import requests
url = input("Enter a URL (start with www): ")
link = "http://" + url
page = requests.get(link).content
soup = BeautifulSoup(page)
for url in soup.find_all('a'):
print(url.get('href'))
print()
and the error:
Enter a URL (start with www): www.google.com
Traceback (most recent call last):
File "/Users/user/Desktop/project.py", line 8, in <module>
page = requests.get(link).content
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/sessions.py", line 349, in request
prep = self.prepare_request(req)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/sessions.py", line 287, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/models.py", line 287, in prepare
self.prepare_url(url,params)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/models.py", line 321, in prepare_url
url = str(url)
TypeError: 'tuple' object is not callable
I've done some looking and when others have gotten this error (in django mostly) there was a comma missing but I'm not sure where to put a comma at? Any help will be appreciated.

What is the best way to handle a bad link given to BeautifulSoup?

I'm working on something that pulls in urls from delicious and then uses those urls to discover associated feeds.
However, some of the bookmarks in delicious are not html links and cause BS to barf. Basically, I want to throw away a link if BS fetches it and it does not look like html.
Right now, this is what I'm getting.
trillian:Documents jauderho$ ./d2o.py "green data center"
processing http://www.greenm3.com/
processing http://www.eweek.com/c/a/Green-IT/How-to-Create-an-EnergyEfficient-Green-Data-Center/?kc=rss
Traceback (most recent call last):
File "./d2o.py", line 53, in <module>
get_feed_links(d_links)
File "./d2o.py", line 43, in get_feed_links
soup = BeautifulSoup(html)
File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 314, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u'</b />', at line 739, column 1
Update:
Jehiah's answer does the trick. For reference, here's some code to get the content type:
def check_for_html(link):
out = urllib.urlopen(link)
return out.info().getheader('Content-Type')
I simply wrap my BeautifulSoup processing and look for the HTMLParser.HTMLParseError exception
import HTMLParser,BeautifulSoup
try:
soup = BeautifulSoup.BeautifulSoup(raw_html)
for a in soup.findAll('a'):
href = a.['href']
....
except HTMLParser.HTMLParseError:
print "failed to parse",url
but further than that, you can check the content type of the responses when you crawl a page and make sure that it's something like text/html or application/xml+xhtml or something like that before you even try to parse it. That should head off most errors.

Categories