BeautifulSoup not working - python

i am trying to import the content of my blog using BeautifulSoup,using the the syntax as given below
import urllib2
from BeautifulSoup import BeautifulSoup
response=urllib2.urlopen('http://www.bugsandbrains.blogspot.com')
html=response.read()
soup=BeautifulSoup(html)
Every thing worked fine two or three time after that it started throwing HtmlParseError
i see it highly unlikely that the structure of the page might have changed within a few minutes what else can might be causing this problem ?
i am enclosing the trace as well.
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 317, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: bad end tag: u"</scr' + 'ipt>", at line 1152, column 16

I just tried your code on Windows with:
Python: 2.6 (same as yours)
BeautiSoup: 3.0.8.1 (latest)
I can't reproduce this. Are you using the latest code 3.0 series which is meant for Python 2.6, not 3.1 series which is for Python 3 [0]. Sorry, but can't think of any other clues right now.
[0] http://www.crummy.com/software/BeautifulSoup/#Download

I have tried your code, and it works. My env: ActivePython 2.6.6.15, BeautifulSoup 3.0.8.1. I printed out soup variable and it contains content of "Boredom Induced Post". When I tested http://www.bugsandbrains.blogspot.com with browsers they shows Wave Sandbox login page. No clue about what is wrong :(

Related

Unexpected error when using feedparser.py

I have had great success parsing RSS feeds from the National Hurricane Center using the feedparser module:
import feedparser
feedparser.parse('https://www.nhc.noaa.gov/gis-at.xml') #Works Fine
feedparser.parse('https://www.nhc.noaa.gov/gis-ep.xml') #Works Fine
However, when I try to read the superficially similar feed from the Central Pacific Hurricane Center, I generate a KeyError:
feedparser.parse('http://www.prh.noaa.gov/cphc/gis-cp.xml') #Doesn't work
Is this a bug with feedparser? Is the CPHC's feed malformed? Is there an option that I've forgotten to specify? It seems the trouble is that there isn't a key named 'where', but I don't know why this isn't a problem for the NHC feeds. The stack is reproduced below:
>>> import feedparser
>>> feedparser.parse('http://www.prh.noaa.gov/cphc/gis-cp.xml')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 3956, in parse
saxparser.parse(source)
File ".../anaconda3/lib/python3.6/xml/sax/expatreader.py", line 111, in parse
xmlreader.IncrementalParser.parse(self, source)
File ".../anaconda3/lib/python3.6/xml/sax/xmlreader.py", line 125, in parse
self.feed(buffer)
File ".../anaconda3/lib/python3.6/xml/sax/expatreader.py", line 217, in feed
self._parser.Parse(data, isFinal)
File "/tmp/build/80754af9/python_1516124163501/work/Modules/pyexpat.c", line 414, in StartElement
File ".../anaconda3/lib/python3.6/xml/sax/expatreader.py", line 370, in start_element_ns
AttributesNSImpl(newattrs, qnames))
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 2031, in startElementNS
self.unknown_starttag(localname, list(attrsD.items()))
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 666, in unknown_starttag
return method(attrsD)
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 1500, in _start_gml_point
self._parse_srs_attrs(attrsD)
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 1496, in _parse_srs_attrs
context['where']['srsName'] = srsName
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 356, in __getitem__
return dict.__getitem__(self, key)
KeyError: 'where'
I know this is an old question but I faced myself this issue and became my first opensource contribution :)
Is this a bug with feedparser?
Yes, it was.
Is the CPHC's feed malformed?
Also yes, or at least it doesn't follow the GeoRSSS GML model to the letter. If you check the GMLPoint description you will see the following structure:
<georss:where>
<gml:Point>
<gml:pos>45.256 -71.92</gml:pos>
</gml:Point>
</georss:where>
but the feed data is structured this way:
<gml:Point>
<gml:pos>45.256 -71.92</gml:pos>
</gml:Point>
So that's why the KeyError: 'where' occurs, due to the absent of where tag.
This was fixed on feedparser's 6.0.9 hotfix (see https://github.com/kurtmckee/feedparser/pull/306)

Python - BeautifulSoup error while scraping

UPDATE: Using lxml instead of html.parser helped solve the problem, as Freddier suggested in the answer below!
I am trying to webscrape some information off of this website: https://www.ticketmonster.co.kr/deal/952393926.
I get an error when I run soup(thispage, 'html.parser) but this error only happens for this specific page. Does anyone know why this is happening?
The code I have so far is very simple:
from bs4 import BeautifulSoup as soup
openU = urlopen(url)
thispage = openU.read()
open.close()
pageS = soup(thispage, 'html.parser')
The error I get is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py", line 228, in __init__
self._feed()
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site- packages\bs4\__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\builder\_htmlparser.py", line 215, in feed
parser.feed(markup)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 149, in parse_marked_section
sectName, j = self._scan_name( i+3, i )
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 391, in _scan_name
% rawdata[declstartpos:declstartpos+20])
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 34, in error
"subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()
Please help!
Try using
pageS = soup(thispage, 'lxml')
insted of
pageS = soup(thispage, 'html.parser')
It looks may be a problem with characters encoding using "html.parser"

malformed start tag, exception being thrown in python 2.6.9 but not in 2.7.4 HTMLParser

I am fetching url contents using urllib2 in python and them subjecting it to python's native html parser. Code works wonderfully well on my python 2.7.4, however, my friend's machine has python 2.6.9 and the issue being faced on his machine is:
Traceback (most recent call last):
File "opsview_audit.py", line 420, in <module>
check_instances_against_regex(instances)
File "opsview_audit.py", line 219, in check_instances_against_regex
attrs_being_monitored = get_host_monitoring_status(cred['url'], running_instances,
cred['user_name'], cred['pass_key'])
File "opsview_audit.py", line 112, in get_host_monitoring_status
parser.feed(result.read())
File "/usr/lib64/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib64/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib64/python2.6/HTMLParser.py", line 229, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib64/python2.6/HTMLParser.py", line 304, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib64/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 509, column 47
May be some start tag wasn't proper, which in python 2.6.9 is being thrown as an exception, but not in 2.7.4
Here, upgrading 2.6.9 to 2.7.4 or above is not an option.
Two solutions:
-Use another htmlparser like Beautiful soup 3 or lxml. They are both really easy to learn and campatible with python 2.6.
-Try to find the bug and filter it out.

Parsing XML exception

I'm new to python, and seriously need help! I have a number of errors I can't figure out. I'm using python 2.7 on a mac. Here is the list of errors:
Traceback (most recent call last):
File "minihiveosc.py", line 378, in <module>
swhive = SWMiniHiveOSC( options.host, options.hport, options.ip, options.port, options.minibees, options.serial, options.baudrate, options.config, [1,options.minibees], options.verbose, options.apimode )
File "minihiveosc.py", line 280, in __init__
self.hive.load_from_file( config )
File "/Users/Puffin/Documents/python/pydon/pydon/pydonhive.py", line 396, in load_from_file
hiveconf = cfgfile.read_file( filename )
File "/Users/Puffin/Documents/python/pydon/pydon/minibeexml.py", line 116, in read_file
tree = ET.parse( filename )
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 164, column 8
Any chance someone can help me?
Thanks!
What you posted in your question is called a "Traceback", and it shows only one error:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 164, column 8
All the lines before it show how python got there; in the file minihiveosc.py, on line 378 some code was executed (shown in the traceback), which then led to line 280 of the same file, where something else was called, etc.
Every time Python calls a function the current state is pushed onto the stack to make room for the next context, and when an exception occurs python can show you this stack to help you diagnose your problem
In this case, you are trying to feed an XML document to the XML parser that has an error in it; by the time the parser gets to line 164, column 8, it found something it didn't expect. You'll need to inspect that document to see what the problem is, it'll be around that area.
It just because that your XML file is not wellformed at line 8. When the parser tries to read that line it raises that error. Have a look at your document to see what it is.
This is one error with stack trace.
Creation of SWMiniHiveOSC object caused error when executing load_from_file(config) method. File name or file content is in 'options.config'. Your XML config file is not well-formed, there is invalid token at line 164, column 8 in this file. The problem is with XML file, not python code.

how to fix or make an exception for this error

I'm creating a code that gets image's urls from any web pages, the code are in python and use BeutifulSoup and httplib2.
When I run the code, I get the next error:
Look me http://movies.nytimes.com (this line is printed by the code)
Traceback (most recent call last):
File "main.py", line 103, in <module>
visit(initialList,profundidad)
File "main.py", line 98, in visit
visit(dodo[indice], bottom -1)
File "main.py", line 94, in visit
getImages(w)
File "main.py", line 34, in getImages
iSoupList = BeautifulSoup(response, parseOnlyThese=SoupStrainer('img'))
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 942, column 118
Someone can explain me how to fix or make an exeption for the error
Are you using latest version of BeautifulSoup?
This seems a known issue of version 3.1.x, because it started using a new parser (HTMLParser, instead of SGMLParser) that is much worse at processing malformed HTML. You can find more information about this on BeautifulSoup website.
As a quick solution, you can simply use an older version (3.0.7a).
To catch that error specifically, change your code to look like this:
try:
iSoupList = BeautifulSoup(response, parseOnlyThese=SoupStrainer('img'))
except HTMLParseError:
#Do something intelligent here
Here's some more reading on Python's try except blocks:
http://docs.python.org/tutorial/errors.html
I got that error when I had the string =& in my HTML document. When I replaced that string (in my case with =and) then I no longer received that parsing error.

Categories