This is my xml file:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE papers>
<papers>
<paper>
<title>Title containing & and more</title>
</paper>
</papers>
How do I read that using lxml's etree? I tried
from lxml import etree
with open(xml_file, 'r') as inf:
tree = etree.parse(inf)
but it results in the following Traceback:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69955)
File "parser.pxi", line 1769, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102257)
File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:102516)
File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:101442)
File "parser.pxi", line 1134, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91275)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92461)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91757)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 5, column 30
If you need to retain the & character, you can parse the file as HTML.
from lxml import html
tree = html.parse(path)
If you don't need the & character, you can create a new XML parser and pass the recover=True option.
from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse(path, parser=parser)
Since the xml file is malformed, because of the ampersand (predefined xml entity) use BeautifulSoup if you can. It is a more error tolerant parser.
from bs4 import BeautifulSoup
soup = BeautifulSoup(data)
print soup.find("title").text
outputs
Title containing & and more
Related
I have what I thought was very basic code to read an xml file into Python. But I'm baffled that I'm running into issues.
I thought this code should work:
from lxml import etree
parser = etree.XMLParser(ns_clean=True, recover=True)
tree = etree.parse('file.xml', parser)
#root = tree.getroot()
However, I get this issue:
Traceback (most recent call last):
File "C:/Users/Root/Documents/xml.py", line 5, in <module>
tree = etree.parse('file.xml', parser)
File "src\lxml\etree.pyx", line 3519, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1839, in lxml.etree._parseDocument
File "src\lxml\parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src\lxml\parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src\lxml\parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError
OSError: Error reading file 'file.xml': failed to load external entity "file.xml"
I read a few posts with similar issues, and it turns out that the problem was not including the full path. So I've run the code with a few iterations:
tree = etree.parse(r'C:\Users\Root\Documents\file.xml', parser)
tree = etree.parse("C:\\Users\\Root\\Documents\\file.xml", parser)
tree = etree.parse('C:/Users/Root/Documents/file.xml', parser)
Am I missing something obvious? Any help is much appreciated
I have a utility method that parses XML using a parser created as etree.XMLParser(recover=True). I would like to test failure scenarios in a unit test. Except for empty input throwing an lxml.etree.XMLSyntaxError, I can't seem to break the parser.
My question is: is it possible to construct a StringIO or BytesIO input for this parser such that the parser throws a parse error?
Here's some examples (tested with Python 3.5 and lxml 4.3.3):
from io import BytesIO
from lxml import etree
def parse(xml):
parser = etree.XMLParser(recover=True)
elem = etree.parse(BytesIO(xml), parser)
print(etree.tostring(elem))
parse(b'<broken<') # prints b'<broken/>'
parse(b'</lf|\jf>') # prints None
parse('<?xml encoding="ascii"?><foo>æøå</foo>'.encode('utf-8')) # prints b'<foo/>'
parse(b'') # Throws lxml.etree.XMLSyntaxError
If I slap a NULL character at the beginning of any of the bad inputs you show that don't raise an error, I do get an error. For instance:
parse(b'\0<broken<')
produces:
Traceback (most recent call last):
File "test.py", line 13, in <module>
parse(b'\0<broken<') # prints b'<broken/>'
File "test.py", line 9, in parse
elem = etree.parse(BytesIO(xml), parser)
File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1857, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
Isn't it because you are using recover=True?
recover - try hard to parse through broken XML
I changed recover=False and I get:
Traceback (most recent call last):
File "./foo.py", line 11, in <module>
parse(b'<broken<') # prints b'<broken/>'
File "./foo.py", line 7, in parse
elem = etree.parse(BytesIO(xml), parser)
File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1857, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: error parsing attribute name, line 1, column 8
Am I missing something?
I want to know how to use lxml to get a url,and then I can use xpath to parse the data I want .
Please guide me,thank you very much.
res = requests.get('http://www.ipeen.com.tw/comment/778246')
doc = parse(res.content)
name = doc.xpath("//meta[#itemprop='name']/#content")
print name
There are errors in my code:
doc = parse(res.content)
File "/Users/ome/djangoenv/lib/python2.7/site-packages/lxml/html/__init__.py", line 786, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95015)
IOError
res.content is a string, an HTML string.
You need to use lxml.html.fromstring():
import lxml.html
import requests
res = requests.get('http://www.ipeen.com.tw/comment/778246')
doc = lxml.html.fromstring(res.content)
name = doc.xpath(".//meta[#itemprop='name']/#content")
print name
Presumably res.content is a string containing the contents of the page. parse takes a filename or file-like object. Thus, you are using the page content as the name of a file. This is probably not what you want. To construct a tree from a string, use fromstring rather than parse.
I'm trying to fetch URL from file and output the title of page :
import lxml.html
file = open('ab.txt','r')
for line in file:
t = lxml.html.parse(line)
print t.find(".//title").text
The error :
Traceback (most recent call last):
File "C:\Python27\site.py", line 4, in <module>
t = lxml.html.parse(line)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 661, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71797)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72080)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71175)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68173)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64493)
IOError: Error reading file 'http://example.com/5129860
': failed to load HTTP resource
The ab.txt has:
example.com/123
example.com/234
example.com/456
....
Anything wrong in here?
The parse method in lxml.html parses a filename, URL, or file-like object into an HTML document and returns a tree. From the documentation, the arguments of this function are like this,
parse(filename_or_url, parser=None, base_url=None, **kw)
So you can directly pass the filename and get your output.
t = lxml.html.parse('ab.txt')
print t.find(".//title").text
for line in file:
t = lxml.html.parse(line)
print t.find(".//title").text
Here you are trying to read each line and parse each line using the lxml.html.parse Which means that the argument to function is not a valid http content. you should be modifiying these lines as
from urllib2 import urlopen
for line in file:
content = urlopen(line)
t = lxml.html.parse(content)
print t.find(".//title").text
Here the entire content of the file, is read to variable content. There by it contains a valid http content.
Here's the Python source:
fsock = urllib2.urlopen('http://eprints.soton.ac.uk/cgi/exportview/divisions/uos-fp/2009/XML/uos-fp_2009.xml')
doc=et.parse(fsock)
When I tried to run this it gives the following error:
Traceback (most recent call last):
File "C:\Python27\reading and writing xml file from web1.py", line 30, in
doc=et.parse(fsock)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1176, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 654, in parse
self._root = parser.close()
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1635, in close
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1487, in _raiseerror
raise err
ParseError: no element found: line 1, column 0
Can any one help as to why this is happening?
Your code works:
import urllib2
from xml.etree.cElementTree import parse, dump
fsock = urllib2.urlopen('http://eprints.soton.ac.uk/cgi/exportview/divisions/uos-fp/2009/XML/uos-fp_2009.xml')
doc = parse(fsock)
dump(doc)