Python: Parsing XML with lxml - python

I am trying to use a library called dblp-python. This library parses the DBLP data (which is in XML format). While I am trying to print all publications of an author, the script acts weirdly. Sometimes it prints them without any errors, and sometimes it shows an error. If I run the same code more than once after an error, it shows the publications without any problem.
The code I use is:
a = dblp.search('Michael L. Littman')
for i in range(len(a[0].publications)):
print i
print a[0].publications[i].title
The error I get when I execute the above code is:
> Traceback (mostrecent call last): File "<pyshell#217>", line 3, in <module>
> print a[0].publications[i].title File "build\bdist.win32\egg\dblp\__init__.py", line 19, in __getattr__
> self.load_data() File "build\bdist.win32\egg\dblp\__init__.py", line 110, in load_data
> root = etree.fromstring(xml) File "lxml.etree.pyx", line 3092, in lxml.etree.fromstring (src\lxml\lxml.etree.c:70691) File
> "parser.pxi", line 1828, in lxml.etree._parseMemoryDocument
> (src\lxml\lxml.etree.c:106689) File "parser.pxi", line 1716, in
> lxml.etree._parseDoc (src\lxml\lxml.etree.c:105478) File
> "parser.pxi", line 1086, in lxml.etree._BaseParser._parseDoc
> (src\lxml\lxml.etree.c:100105) File "parser.pxi", line 580, in
> lxml.etree._ParserContext._handleParseResultDoc
> (src\lxml\lxml.etree.c:94543) File "parser.pxi", line 690, in
> lxml.etree._handleParseResult (src\lxml\lxml.etree.c:96003) File
> "parser.pxi", line 620, in lxml.etree._raiseParseError
> (src\lxml\lxml.etree.c:95050) XMLSyntaxError: Space required after the
> Public Identifier, line 2, column 47
The code of the library can be seen HERE.
I raised this problem to the author but without response. I hope of anyone can help me here at least to know what the error might be.
Thank you

Related

python lxml in docker: "Document is empty" while parsing

Why this code is working without issues on my mac with any version of python, requests and lxml, but doesn't work in any docker container? i tried everything(
it just fails on 34533 line (discovered by printing el.sourceline)
from requests import get
from lxml import etree
r = get('https://printbar.ru/synsfiles/yandex/market/idrr_full.xml')
with open('test.xml', 'wb') as f:
f.write(r.content)
tree = etree.iterparse(source='test.xml', events=('end',))
for (ev, el) in tree:
continue
print('ok')
https://printbar.ru/synsfiles/yandex/market/idrr_full.xml seems completely valid and works locally on any of my macs...
i tried ubuntu, alpine, several python containers even with prebuilt lxml, nothing helped. I expected that parsing this file won't throw this error in the middle of parsing:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "src/lxml/iterparse.pxi", line 210, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 195, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 230, in lxml.etree.iterparse._read_more_events
File "src/lxml/parser.pxi", line 1376, in lxml.etree._FeedParser.feed
File "src/lxml/parser.pxi", line 606, in lxml.etree._ParserContext._handleParseResult
File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "test.xml", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
xmllint says that there is encoding error, but it works locally on mac...) HOW?) i want it dockerized!)

Reading local file with lxml -Python

I have what I thought was very basic code to read an xml file into Python. But I'm baffled that I'm running into issues.
I thought this code should work:
from lxml import etree
parser = etree.XMLParser(ns_clean=True, recover=True)
tree = etree.parse('file.xml', parser)
#root = tree.getroot()
However, I get this issue:
Traceback (most recent call last):
File "C:/Users/Root/Documents/xml.py", line 5, in <module>
tree = etree.parse('file.xml', parser)
File "src\lxml\etree.pyx", line 3519, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1839, in lxml.etree._parseDocument
File "src\lxml\parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src\lxml\parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src\lxml\parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError
OSError: Error reading file 'file.xml': failed to load external entity "file.xml"
I read a few posts with similar issues, and it turns out that the problem was not including the full path. So I've run the code with a few iterations:
tree = etree.parse(r'C:\Users\Root\Documents\file.xml', parser)
tree = etree.parse("C:\\Users\\Root\\Documents\\file.xml", parser)
tree = etree.parse('C:/Users/Root/Documents/file.xml', parser)
Am I missing something obvious? Any help is much appreciated

Can etree.XMLParser in recover mode still throw a parse error?

I have a utility method that parses XML using a parser created as etree.XMLParser(recover=True). I would like to test failure scenarios in a unit test. Except for empty input throwing an lxml.etree.XMLSyntaxError, I can't seem to break the parser.
My question is: is it possible to construct a StringIO or BytesIO input for this parser such that the parser throws a parse error?
Here's some examples (tested with Python 3.5 and lxml 4.3.3):
from io import BytesIO
from lxml import etree
def parse(xml):
parser = etree.XMLParser(recover=True)
elem = etree.parse(BytesIO(xml), parser)
print(etree.tostring(elem))
parse(b'<broken<') # prints b'<broken/>'
parse(b'</lf|\jf>') # prints None
parse('<?xml encoding="ascii"?><foo>æøå</foo>'.encode('utf-8')) # prints b'<foo/>'
parse(b'') # Throws lxml.etree.XMLSyntaxError
If I slap a NULL character at the beginning of any of the bad inputs you show that don't raise an error, I do get an error. For instance:
parse(b'\0<broken<')
produces:
Traceback (most recent call last):
File "test.py", line 13, in <module>
parse(b'\0<broken<') # prints b'<broken/>'
File "test.py", line 9, in parse
elem = etree.parse(BytesIO(xml), parser)
File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1857, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
Isn't it because you are using recover=True?
recover - try hard to parse through broken XML
I changed recover=False and I get:
Traceback (most recent call last):
File "./foo.py", line 11, in <module>
parse(b'<broken<') # prints b'<broken/>'
File "./foo.py", line 7, in parse
elem = etree.parse(BytesIO(xml), parser)
File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1857, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: error parsing attribute name, line 1, column 8
Am I missing something?

printing title of URL from a file in python

I'm trying to fetch URL from file and output the title of page :
import lxml.html
file = open('ab.txt','r')
for line in file:
t = lxml.html.parse(line)
print t.find(".//title").text
The error :
Traceback (most recent call last):
File "C:\Python27\site.py", line 4, in <module>
t = lxml.html.parse(line)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 661, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71797)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72080)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71175)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68173)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64493)
IOError: Error reading file 'http://example.com/5129860
': failed to load HTTP resource
The ab.txt has:
example.com/123
example.com/234
example.com/456
....
Anything wrong in here?
The parse method in lxml.html parses a filename, URL, or file-like object into an HTML document and returns a tree. From the documentation, the arguments of this function are like this,
parse(filename_or_url, parser=None, base_url=None, **kw)
So you can directly pass the filename and get your output.
t = lxml.html.parse('ab.txt')
print t.find(".//title").text
for line in file:
t = lxml.html.parse(line)
print t.find(".//title").text
Here you are trying to read each line and parse each line using the lxml.html.parse Which means that the argument to function is not a valid http content. you should be modifiying these lines as
from urllib2 import urlopen
for line in file:
content = urlopen(line)
t = lxml.html.parse(content)
print t.find(".//title").text
Here the entire content of the file, is read to variable content. There by it contains a valid http content.

Why is the slash at the end of lxml.html.parse() important?

I am using lxml to scrape html. This code works.
lxml.html.parse( "http://google.com/" )
This code does not.
lxml.html.parse( "http://google.com" )
Why does the slash at the end of the URL matter? Thank you.
To be clear, here is the error log that python is giving me from the latter code.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/davidfaux/epd-7.2-2-rh5-x86/lib/python2.7/site-packages/lxml/html/__init__.py", line 692, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2953, in lxml.etree.parse (src/lxml/lxml.etree.c:56204)
File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82287)
File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82580)
File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81619)
File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78528)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 588, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74665)
IOError: Error reading file 'http://google.com': failed to load HTTP resource
Because without the slash, Google isn't sending you a page, it's sending you a redirect. In fact, it's redirecting you to the URL with the slash! The body of the redirect is probably empty.

Categories