Beautiful Soup and uTidy - python

I want to pass the results of utidy to Beautiful Soup, ala:
page = urllib2.urlopen(url)
options = dict(output_xhtml=1,add_xml_decl=0,indent=1,tidy_mark=0)
cleaned_html = tidy.parseString(page.read(), **options)
soup = BeautifulSoup(cleaned_html)
When run, the following error results:
Traceback (most recent call last):
File "soup.py", line 34, in <module>
soup = BeautifulSoup(cleaned_html)
File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1245, in _feed
smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1751, in __init__
self._detectEncoding(markup, isHTML)
File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1899, in _detectEncoding
xml_encoding_match = re.compile(xml_encoding_re).match(xml_data)
TypeError: expected string or buffer
I gather utidy returns an XML document while BeautifulSoup wants a string. Is there a way to cast cleaned_html? Or am I doing it wrong and should take a different approach?

Just wrap str() around cleaned_html
when passing it to BeautifulSoup.

Convert the value passed to BeautifulSoup into a string.
In your case, do the following edit to the last line:
soup = BeautifulSoup(str(cleaned_html))

Related

MemoryError when parsing XML file

I am trying to find the specific tag in an XML file and I used BeautifulSoup to read the XML file. It produces the following error:
soup = BeautifulSoup(XML, 'xml')
Traceback (most recent call last):
File "<ipython-input-5-f431fabb5903>", line 1, in <module>
soup = BeautifulSoup(XML, 'xml')
File "D:\software\Anaconda3\envs\py37\lib\site-packages\bs4\__init__.py", line 362, in __init__
self._feed()
File "D:\software\Anaconda3\envs\py37\lib\site-packages\bs4\__init__.py", line 448, in _feed
self.builder.feed(self.markup)
File "D:\software\Anaconda3\envs\py37\lib\site-packages\bs4\builder\_lxml.py", line 203, in feed
markup = StringIO(markup)
MemoryError
The size of the file is 353 MB but it has also parsed a larger file and did not produce this error. Do you know what the problem is?

Python - BeautifulSoup error while scraping

UPDATE: Using lxml instead of html.parser helped solve the problem, as Freddier suggested in the answer below!
I am trying to webscrape some information off of this website: https://www.ticketmonster.co.kr/deal/952393926.
I get an error when I run soup(thispage, 'html.parser) but this error only happens for this specific page. Does anyone know why this is happening?
The code I have so far is very simple:
from bs4 import BeautifulSoup as soup
openU = urlopen(url)
thispage = openU.read()
open.close()
pageS = soup(thispage, 'html.parser')
The error I get is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py", line 228, in __init__
self._feed()
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site- packages\bs4\__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\builder\_htmlparser.py", line 215, in feed
parser.feed(markup)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 149, in parse_marked_section
sectName, j = self._scan_name( i+3, i )
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 391, in _scan_name
% rawdata[declstartpos:declstartpos+20])
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 34, in error
"subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()
Please help!
Try using
pageS = soup(thispage, 'lxml')
insted of
pageS = soup(thispage, 'html.parser')
It looks may be a problem with characters encoding using "html.parser"

how to unpack dmoz urls from rdf dump with python and rdflib?

i tried to open rdf file (dmoz rdf dump), but a get this error message
Traceback (most recent call last):
File "/media/_dev_/ODP_RDF_get_links.py", line 4, in <module>
result = g.parse("data/content.rdf")
File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 1033, in parse
parser.parse(source, self, **args)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 577, in parse
self._parser.parse(source)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 210, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 352, in end_element_ns
self._cont_handler.endElementNS(pair, None)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 160, in endElementNS
self.current.end(name, qname)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 331, in node_element_end
self.error("Repeat node-elements inside property elements: %s"%"".join(name))
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 185, in error
raise ParserError(info + message)
file:///media/_dev_/data/content.rdf:5:12: Repeat node-elements inside property elements: http://dmoz.org/rdf/catid
my simple code is as follow:
import rdflib
g = rdflib.Graph()
result = g.parse("data/content.rdf")
print("graph has %s statements." % len(g))
i need to be able to read the file.
extract all links in the world category.
thanks for any possible help
EDIT:
PS: found this wikipedia rdf_dumps, so developing custom scripts is necessary to use this dump

How i parse with lxml a result page with form?

I try to parse a secondary page with form . I use example code source from this link :
http://blog.ianbicking.org/2007/09/24/lxmlhtml/
On my test i use this url: http://www.infofer.ro/
Like on example , I use this values :
>>> pprint(form.form_values())
[('cboData', '8/30/2010'),
('txtPlecare', 'Bucuresti Nord'),
('txtSosire', 'Constanta'),
('tip', 'GO'),
('lng', '1')]
The result is take it with this :
result = parse(submit_form(form)).getroot()
This is another page with another form .
I try something like this :
>>> page2=parse(result).getroot()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/site-packages/lxml/html/__init__.py", line 661, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49945)
File "parser.pxi", line 1525, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:72026)
TypeError: cannot parse from 'HtmlElement'
How i parse the form from secondary page ?
Regards.
The getroot method does not give you another "page", but an instance of lxml.html.HtmlElement.
There is no need (and no way) to parse this once more, you already have everything you need packed into the result variable.

how to fix or make an exception for this error

I'm creating a code that gets image's urls from any web pages, the code are in python and use BeutifulSoup and httplib2.
When I run the code, I get the next error:
Look me http://movies.nytimes.com (this line is printed by the code)
Traceback (most recent call last):
File "main.py", line 103, in <module>
visit(initialList,profundidad)
File "main.py", line 98, in visit
visit(dodo[indice], bottom -1)
File "main.py", line 94, in visit
getImages(w)
File "main.py", line 34, in getImages
iSoupList = BeautifulSoup(response, parseOnlyThese=SoupStrainer('img'))
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 942, column 118
Someone can explain me how to fix or make an exeption for the error
Are you using latest version of BeautifulSoup?
This seems a known issue of version 3.1.x, because it started using a new parser (HTMLParser, instead of SGMLParser) that is much worse at processing malformed HTML. You can find more information about this on BeautifulSoup website.
As a quick solution, you can simply use an older version (3.0.7a).
To catch that error specifically, change your code to look like this:
try:
iSoupList = BeautifulSoup(response, parseOnlyThese=SoupStrainer('img'))
except HTMLParseError:
#Do something intelligent here
Here's some more reading on Python's try except blocks:
http://docs.python.org/tutorial/errors.html
I got that error when I had the string =& in my HTML document. When I replaced that string (in my case with =and) then I no longer received that parsing error.

Categories