How do I validate xml against a DTD file in Python

How do I validate xml against a DTD file in Python - python

I need to validate an XML string (and not a file)
against a DTD description file.
How can that be done in python?

Another good option is lxml's validation which I find quite pleasant to use.
A simple example taken from the lxml site:
from StringIO import StringIO
from lxml import etree
dtd = etree.DTD(StringIO("""<!ELEMENT foo EMPTY>"""))
root = etree.XML("<foo/>")
print(dtd.validate(root))
# True
root = etree.XML("<foo>bar</foo>")
print(dtd.validate(root))
# False
print(dtd.error_log.filter_from_errors())
# <string>:1:0:ERROR:VALID:DTD_NOT_EMPTY: Element foo was declared EMPTY this one has content

from the examples directory in the libxml2 python bindings:
#!/usr/bin/python -u
import libxml2
import sys
# Memory debug specific
libxml2.debugMemory(1)
dtd="""<!ELEMENT foo EMPTY>"""
instance="""<?xml version="1.0"?>
<foo></foo>"""
dtd = libxml2.parseDTD(None, 'test.dtd')
ctxt = libxml2.newValidCtxt()
doc = libxml2.parseDoc(instance)
ret = doc.validateDtd(ctxt, dtd)
if ret != 1:
print "error doing DTD validation"
sys.exit(1)
doc.freeDoc()
dtd.freeDtd()
del dtd
del ctxt

Related

How to write an xml file using libxml2 in python?

I am attempting to write an xml file in python3 using libxml2. I cannot find any relevant documentation regarding python about writing files with libxml. When I attempt to write an xml file parsed with libxml2 I get the error:
xmlDoc has no attribute write
Anyone here done this before? I can get it to work in Etree just fine but Etree will not respect the attribute order that I need.

You can use saveFile() or saveFileEnc(). Example:
import libxml2
XML = """
<root a="1" b="2">XYZ</root>
"""
doc = libxml2.parseDoc(XML)
doc.saveFile("test.xml")
doc.saveFileEnc("test2.xml", "UTF-8")
I could not find any good documentation for the Python API. Here is the corresponding C documentation: http://xmlsoft.org/html/libxml-tree.html#xmlSaveFile.

import libxml2
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
<attribution>Christopher Okibgo</attribution>
<line>For he was a shrub among the poplars,</line>
<line>Needing more roots</line>
<line>More sap to grow to sunlight,</line>
<line>Thirsting for sunlight</line>
</verse>
"""
doc = libxml2.parseDoc(DOC)
root = doc.children
print root

Writing "pretty printed" indented xml to an xml file results in wrong indentation levels [duplicate]

What is the best way (or are the various ways) to pretty print XML in Python?

import xml.dom.minidom
dom = xml.dom.minidom.parse(xml_fname) # or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = dom.toprettyxml()

lxml is recent, updated, and includes a pretty print function
import lxml.etree as etree
x = etree.parse("filename")
print etree.tostring(x, pretty_print=True)
Check out the lxml tutorial:
http://lxml.de/tutorial.html

Another solution is to borrow this indent function, for use with the ElementTree library that's built in to Python since 2.5.
Here's what that would look like:
from xml.etree import ElementTree
def indent(elem, level=0):
i = "\n" + level*" "
j = "\n" + (level-1)*" "
if len(elem):
if not elem.text or not elem.text.strip():
elem.text = i + " "
if not elem.tail or not elem.tail.strip():
elem.tail = i
for subelem in elem:
indent(subelem, level+1)
if not elem.tail or not elem.tail.strip():
elem.tail = j
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = j
return elem
root = ElementTree.parse('/tmp/xmlfile').getroot()
indent(root)
ElementTree.dump(root)

You have a few options.
xml.etree.ElementTree.indent()
Batteries included, simple to use, pretty output.
But requires Python 3.9+
import xml.etree.ElementTree as ET
element = ET.XML("<html><body>text</body></html>")
ET.indent(element)
print(ET.tostring(element, encoding='unicode'))
BeautifulSoup.prettify()
BeautifulSoup may be the simplest solution for Python < 3.9.
from bs4 import BeautifulSoup
bs = BeautifulSoup(open(xml_file), 'xml')
pretty_xml = bs.prettify()
print(pretty_xml)
Output:
<?xml version="1.0" encoding="utf-8"?>
<issues>
<issue>
<id>
1
</id>
<title>
Add Visual Studio 2005 and 2008 solution files
</title>
</issue>
</issues>
This is my goto answer. The default arguments work as is. But text contents are spread out on separate lines as if they were nested elements.
lxml.etree.parse()
Prettier output but with arguments.
from lxml import etree
x = etree.parse(FILE_NAME)
pretty_xml = etree.tostring(x, pretty_print=True, encoding=str)
Produces:
<issues>
<issue>
<id>1</id>
<title>Add Visual Studio 2005 and 2008 solution files</title>
<details>We need Visual Studio 2005/2008 project files for Windows.</details>
</issue>
</issues>
This works for me with no issues.
xml.dom.minidom.parse()
No external dependencies but post-processing.
import xml.dom.minidom as md
dom = md.parse(FILE_NAME)
# To parse string instead use: dom = md.parseString(xml_string)
pretty_xml = dom.toprettyxml()
# remove the weird newline issue:
pretty_xml = os.linesep.join([s for s in pretty_xml.splitlines()
if s.strip()])
The output is the same as above, but it's more code.

Here's my (hacky?) solution to get around the ugly text node problem.
uglyXml = doc.toprettyxml(indent=' ')
text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)
prettyXml = text_re.sub('>\g<1></', uglyXml)
print prettyXml
The above code will produce:
<?xml version="1.0" ?>
<issues>
<issue>
<id>1</id>
<title>Add Visual Studio 2005 and 2008 solution files</title>
<details>We need Visual Studio 2005/2008 project files for Windows.</details>
</issue>
</issues>
Instead of this:
<?xml version="1.0" ?>
<issues>
<issue>
<id>
1
</id>
<title>
Add Visual Studio 2005 and 2008 solution files
</title>
<details>
We need Visual Studio 2005/2008 project files for Windows.
</details>
</issue>
</issues>
Disclaimer: There are probably some limitations.

As of Python 3.9, ElementTree has an indent() function for pretty-printing XML trees.
See https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.indent.
Sample usage:
import xml.etree.ElementTree as ET
element = ET.XML("<html><body>text</body></html>")
ET.indent(element)
print(ET.tostring(element, encoding='unicode'))
The upside is that it does not require any additional libraries. For more information check https://bugs.python.org/issue14465 and https://github.com/python/cpython/pull/15200

As others pointed out, lxml has a pretty printer built in.
Be aware though that by default it changes CDATA sections to normal text, which can have nasty results.
Here's a Python function that preserves the input file and only changes the indentation (notice the strip_cdata=False). Furthermore it makes sure the output uses UTF-8 as encoding instead of the default ASCII (notice the encoding='utf-8'):
from lxml import etree
def prettyPrintXml(xmlFilePathToPrettyPrint):
assert xmlFilePathToPrettyPrint is not None
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False)
document = etree.parse(xmlFilePathToPrettyPrint, parser)
document.write(xmlFilePathToPrettyPrint, pretty_print=True, encoding='utf-8')
Example usage:
prettyPrintXml('some_folder/some_file.xml')

If you have xmllint you can spawn a subprocess and use it. xmllint --format <file> pretty-prints its input XML to standard output.
Note that this method uses an program external to python, which makes it sort of a hack.
def pretty_print_xml(xml):
proc = subprocess.Popen(
['xmllint', '--format', '/dev/stdin'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
)
(output, error_output) = proc.communicate(xml);
return output
print(pretty_print_xml(data))

I tried to edit "ade"s answer above, but Stack Overflow wouldn't let me edit after I had initially provided feedback anonymously. This is a less buggy version of the function to pretty-print an ElementTree.
def indent(elem, level=0, more_sibs=False):
i = "\n"
if level:
i += (level-1) * ' '
num_kids = len(elem)
if num_kids:
if not elem.text or not elem.text.strip():
elem.text = i + " "
if level:
elem.text += ' '
count = 0
for kid in elem:
indent(kid, level+1, count < num_kids - 1)
count += 1
if not elem.tail or not elem.tail.strip():
elem.tail = i
if more_sibs:
elem.tail += ' '
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = i
if more_sibs:
elem.tail += ' '

If you're using a DOM implementation, each has their own form of pretty-printing built-in:
# minidom
#
document.toprettyxml()
# 4DOM
#
xml.dom.ext.PrettyPrint(document, stream)
# pxdom (or other DOM Level 3 LS-compliant imp)
#
serializer.domConfig.setParameter('format-pretty-print', True)
serializer.writeToString(document)
If you're using something else without its own pretty-printer — or those pretty-printers don't quite do it the way you want —  you'd probably have to write or subclass your own serialiser.

I had some problems with minidom's pretty print. I'd get a UnicodeError whenever I tried pretty-printing a document with characters outside the given encoding, eg if I had a β in a document and I tried doc.toprettyxml(encoding='latin-1'). Here's my workaround for it:
def toprettyxml(doc, encoding):
"""Return a pretty-printed XML document in a given encoding."""
unistr = doc.toprettyxml().replace(u'<?xml version="1.0" ?>',
u'<?xml version="1.0" encoding="%s"?>' % encoding)
return unistr.encode(encoding, 'xmlcharrefreplace')

from yattag import indent
pretty_string = indent(ugly_string)
It won't add spaces or newlines inside text nodes, unless you ask for it with:
indent(mystring, indent_text = True)
You can specify what the indentation unit should be and what the newline should look like.
pretty_xml_string = indent(
ugly_xml_string,
indentation = ' ',
newline = '\r\n'
)
The doc is on http://www.yattag.org homepage.

I wrote a solution to walk through an existing ElementTree and use text/tail to indent it as one typically expects.
def prettify(element, indent=' '):
queue = [(0, element)] # (level, element)
while queue:
level, element = queue.pop(0)
children = [(level + 1, child) for child in list(element)]
if children:
element.text = '\n' + indent * (level+1) # for child open
if queue:
element.tail = '\n' + indent * queue[0][0] # for sibling open
else:
element.tail = '\n' + indent * (level-1) # for parent close
queue[0:0] = children # prepend so children come before siblings

Here's a Python3 solution that gets rid of the ugly newline issue (tons of whitespace), and it only uses standard libraries unlike most other implementations.
import xml.etree.ElementTree as ET
import xml.dom.minidom
import os
def pretty_print_xml_given_root(root, output_xml):
"""
Useful for when you are editing xml data on the fly
"""
xml_string = xml.dom.minidom.parseString(ET.tostring(root)).toprettyxml()
xml_string = os.linesep.join([s for s in xml_string.splitlines() if s.strip()]) # remove the weird newline issue
with open(output_xml, "w") as file_out:
file_out.write(xml_string)
def pretty_print_xml_given_file(input_xml, output_xml):
"""
Useful for when you want to reformat an already existing xml file
"""
tree = ET.parse(input_xml)
root = tree.getroot()
pretty_print_xml_given_root(root, output_xml)
I found how to fix the common newline issue here.

XML pretty print for python looks pretty good for this task. (Appropriately named, too.)
An alternative is to use pyXML, which has a PrettyPrint function.

You can use popular external library xmltodict, with unparse and pretty=True you will get best result:
xmltodict.unparse(
xmltodict.parse(my_xml), full_document=False, pretty=True)
full_document=False against <?xml version="1.0" encoding="UTF-8"?> at the top.

Take a look at the vkbeautify module.
It is a python version of my very popular javascript/nodejs plugin with the same name. It can pretty-print/minify XML, JSON and CSS text. Input and output can be string/file in any combinations. It is very compact and doesn't have any dependency.
Examples:
import vkbeautify as vkb
vkb.xml(text)
vkb.xml(text, 'path/to/dest/file')
vkb.xml('path/to/src/file')
vkb.xml('path/to/src/file', 'path/to/dest/file')

You can try this variation...
Install BeautifulSoup and the backend lxml (parser) libraries:
user$ pip3 install lxml bs4
Process your XML document:
from bs4 import BeautifulSoup
with open('/path/to/file.xml', 'r') as doc:
for line in doc:
print(BeautifulSoup(line, 'lxml-xml').prettify())

An alternative if you don't want to have to reparse, there is the xmlpp.py library with the get_pprint() function. It worked nice and smoothly for my use cases, without having to reparse to an lxml ElementTree object.

I found a fast and easy way to nicely format and print an xml file:
import xml.etree.ElementTree as ET
xmlTree = ET.parse('your XML file')
xmlRoot = xmlTree.getroot()
xmlDoc = ET.tostring(xmlRoot, encoding="unicode")
print(xmlDoc)
Outuput:
<root>
<child>
<subchild>.....</subchild>
</child>
<child>
<subchild>.....</subchild>
</child>
...
...
...
<child>
<subchild>.....</subchild>
</child>
</root>

I had this problem and solved it like this:
def write_xml_file (self, file, xml_root_element, xml_declaration=False, pretty_print=False, encoding='unicode', indent='\t'):
pretty_printed_xml = etree.tostring(xml_root_element, xml_declaration=xml_declaration, pretty_print=pretty_print, encoding=encoding)
if pretty_print: pretty_printed_xml = pretty_printed_xml.replace(' ', indent)
file.write(pretty_printed_xml)
In my code this method is called like this:
try:
with open(file_path, 'w') as file:
file.write('<?xml version="1.0" encoding="utf-8" ?>')
# create some xml content using etree ...
xml_parser = XMLParser()
xml_parser.write_xml_file(file, xml_root, xml_declaration=False, pretty_print=True, encoding='unicode', indent='\t')
except IOError:
print("Error while writing in log file!")
This works only because etree by default uses two spaces to indent, which I don't find very much emphasizing the indentation and therefore not pretty. I couldn't ind any setting for etree or parameter for any function to change the standard etree indent. I like how easy it is to use etree, but this was really annoying me.

For converting an entire xml document to a pretty xml document
(ex: assuming you've extracted [unzipped] a LibreOffice Writer .odt or .ods file, and you want to convert the ugly "content.xml" file to a pretty one for automated git version control and git difftooling of .odt/.ods files, such as I'm implementing here)
import xml.dom.minidom
file = open("./content.xml", 'r')
xml_string = file.read()
file.close()
parsed_xml = xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = parsed_xml.toprettyxml()
file = open("./content_new.xml", 'w')
file.write(pretty_xml_as_string)
file.close()
References:
- Thanks to Ben Noland's answer on this page which got me most of the way there.

from lxml import etree
import xml.dom.minidom as mmd
xml_root = etree.parse(xml_fiel_path, etree.XMLParser())
def print_xml(xml_root):
plain_xml = etree.tostring(xml_root).decode('utf-8')
urgly_xml = ''.join(plain_xml .split())
good_xml = mmd.parseString(urgly_xml)
print(good_xml.toprettyxml(indent=' ',))
It's working well for the xml with Chinese!

If for some reason you can't get your hands on any of the Python modules that other users mentioned, I suggest the following solution for Python 2.7:
import subprocess
def makePretty(filepath):
cmd = "xmllint --format " + filepath
prettyXML = subprocess.check_output(cmd, shell = True)
with open(filepath, "w") as outfile:
outfile.write(prettyXML)
As far as I know, this solution will work on Unix-based systems that have the xmllint package installed.

I found this question while looking for "how to pretty print html"
Using some of the ideas in this thread I adapted the XML solutions to work for XML or HTML:
from xml.dom.minidom import parseString as string_to_dom
def prettify(string, html=True):
dom = string_to_dom(string)
ugly = dom.toprettyxml(indent=" ")
split = list(filter(lambda x: len(x.strip()), ugly.split('\n')))
if html:
split = split[1:]
pretty = '\n'.join(split)
return pretty
def pretty_print(html):
print(prettify(html))
When used this is what it looks like:
html = """\
<div class="foo" id="bar"><p>'IDK!'</p><br/><div class='baz'><div>
<span>Hi</span></div></div><p id='blarg'>Try for 2</p>
<div class='baz'>Oh No!</div></div>
"""
pretty_print(html)
Which returns:
<div class="foo" id="bar">
<p>'IDK!'</p>
<br/>
<div class="baz">
<div>
<span>Hi</span>
</div>
</div>
<p id="blarg">Try for 2</p>
<div class="baz">Oh No!</div>
</div>

Use etree.indent and etree.tostring
import lxml.etree as etree
root = etree.fromstring('<html><head></head><body><h1>Welcome</h1></body></html>')
etree.indent(root, space=" ")
xml_string = etree.tostring(root, pretty_print=True).decode()
print(xml_string)
output
<html>
<head/>
<body>
<h1>Welcome</h1>
</body>
</html>
Removing namespaces and prefixes
import lxml.etree as etree
def dump_xml(element):
for item in element.getiterator():
item.tag = etree.QName(item).localname
etree.cleanup_namespaces(element)
etree.indent(element, space=" ")
result = etree.tostring(element, pretty_print=True).decode()
return result
root = etree.fromstring('<cs:document xmlns:cs="http://blabla.com"><name>hello world</name></cs:document>')
xml_string = dump_xml(root)
print(xml_string)
output
<document>
<name>hello world</name>
</document>

I solved this with some lines of code, opening the file, going trough it and adding indentation, then saving it again. I was working with small xml files, and did not want to add dependencies, or more libraries to install for the user. Anyway, here is what I ended up with:
f = open(file_name,'r')
xml = f.read()
f.close()
#Removing old indendations
raw_xml = ''
for line in xml:
raw_xml += line
xml = raw_xml
new_xml = ''
indent = ' '
deepness = 0
for i in range((len(xml))):
new_xml += xml[i]
if(i<len(xml)-3):
simpleSplit = xml[i:(i+2)] == '><'
advancSplit = xml[i:(i+3)] == '></'
end = xml[i:(i+2)] == '/>'
start = xml[i] == '<'
if(advancSplit):
deepness += -1
new_xml += '\n' + indent*deepness
simpleSplit = False
deepness += -1
if(simpleSplit):
new_xml += '\n' + indent*deepness
if(start):
deepness += 1
if(end):
deepness += -1
f = open(file_name,'w')
f.write(new_xml)
f.close()
It works for me, perhaps someone will have some use of it :)

Extracting nested namespace from a xml using lxml

I'm new to Python and currently learning to parse XML. All seems to be going well until I hit a wall with nested namespaces.
Below is an snippet of my xml ( with a beginning and child element that I'm trying to parse:
<?xml version="1.0" encoding="UTF-8"?>
-<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
-------------
-------------
-------------
-<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#"><Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id><EditRate>24 1</EditRate><IntrinsicDuration>2698</IntrinsicDuration></cc-cpl:MainClosedCaption>
------------
------------
------------
</CompositionPlaylist>
What I'm need is a solution to extract the URI of the local name 'MainClosedCaption'. In this case, I'm trying to extract the string "http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#". I looked through a lot of tutorials but cannot seems to find a solution.
If there's anyone out there can lend your expertise, it would be much appreciated.
Here what I did so far with the help from the two contributors:
#!/usr/bin/env python
from xml.etree import ElementTree as ET #import ElementTree module as an alias ET
from lxml import objectify, etree
def parse():
import os
import sys
cpl_file = sys.argv[1]
xml_file = os.path.abspath(__file__)
xml_file = os.path.dirname(xml_file)
xml_file = os.path.join(xml_file,cpl_file)
with open(xml_file)as f:
xml = f.read()
tree = etree.XML(xml)
caption_namespace = etree.QName(tree.find('.//{*}MainClosedCaption')).namespace
print caption_namespace
print tree.nsmap
nsmap = {}
for ns in tree.xpath('//namespace::*'):
if ns[0]:
nsmap[ns[0]] = ns[1]
tree.xpath('//cc-cpl:MainClosedCaption', namespace=nsmap)
return nsmap
if __name__=="__main__":
parse()
But it's not working so far. I got the result 'None' when I used QName to locate the tag and its namespace. And when I try to locate all namespace in the XML using for loop as suggested in another post, I got the error 'Unknown return type: dict'
Any suggestions pls?

This program prints the namespace of the indicated tag:
from lxml import etree
xml = etree.XML('''<?xml version="1.0" encoding="UTF-8"?>
<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#">
<Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>2698</IntrinsicDuration>
</cc-cpl:MainClosedCaption>
</CompositionPlaylist>
''')
print etree.QName(xml.find('.//{*}MainClosedCaption')).namespace
Result:
http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#
Reference: http://lxml.de/tutorial.html#namespaces

python lxml with py2exe

I have Generated an XML with dom and i want to use lxml to pretty print the xml.
this is my code for pretty print the xml
def prettify_xml(xml_str):
import lxml.etree as etree
root = etree.fromstring(xml_str)
xml_str = etree.tostring(root, pretty_print=True)
return xml_str
my output should be an xml formatted string.
I got this code from some post in stactoverflow. This works flawlessly when i am compiling wit python itself. But when i convert my project to a binary created from py2exe (my binary is windows service with a namedpipe).I had two problems:
My service was not starting , i solved this by adding lxml.etree in includes option in py2exe function. then on my service started properly.
when xml generation in called here, is the error which I am seeing in my log
'module' object has no attribute 'fromstring'
where do i rectify this error ? And Is my first problem's solution correct ?
my xml generation Code :
from xml.etree import ElementTree
from xml.dom import minidom
from xml.etree.ElementTree import Element, SubElement, tostring, XML
import lxml.etree
def prettify_xml(xml_str):
root = lxml.etree.fromstring(xml_str)
xml_str = lxml.etree.tostring(root, pretty_print=True)
return xml_str
def dll_xml(status):
try:
xml_declaration = '<?xml version="1.0" standalone="no" ?>'
rootTagName='response'
root = Element(rootTagName)
root.set('id' , 'rp001')
parent = SubElement(root, 'command', opcode ='-ac')
# Create children
chdtag1Name = 'mode'
chdtag1Value = 'repreport'
chdtag2Name='status'
chdtag2Value = status
fullchildtag1 = ''+chdtag1Name+' value = "'+chdtag1Value+'"'
fullchildtag2=''+chdtag2Name+' value="'+chdtag2Value+'"'
children = XML('''<root><'''+fullchildtag1+''' /><'''+fullchildtag2+'''/></root> ''')
# Add parent
parent.extend(children)
dll_xml_doc = xml_declaration + tostring(root)
dll_xml_doc = prettify_xml(dll_xml_doc)
return dll_xml_doc
except Exception , error:
log.error("xml_generation_failed : %s" % error)

Try to use PyInstaller instead py2exe. I converted your program to binary .exe with no problem just by running python pyinstaller.py YourPath\xml_a.py.

Python xml etree DTD from a StringIO source?

I'm adapting the following code (created via advice in this question), that took an XML file and it's DTD and converted them to a different format. For this problem only the loading section is important:
xmldoc = open(filename)
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
tree = etree.parse(xmldoc, parser)
This worked fine, whilst using the file system, but I'm converting it to run via a web framework, where the two files are loaded via a form.
Loading the xml file works fine:
tree = etree.parse(StringIO(data['xml_file'])
But as the DTD is linked to in the top of the xml file, the following statement fails:
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
tree = etree.parse(StringIO(data['xml_file'], parser)
Via this question, I tried:
etree.DTD(StringIO(data['dtd_file'])
tree = etree.parse(StringIO(data['xml_file'])
Whilst the first line doesn't cause an error, the second falls over on unicode entities the DTD is meant to pick up (and does so in the file system version):
XMLSyntaxError: Entity 'eacute' not
defined, line 4495, column 46
How do I go about correctly loading this DTD?

Here's a short but complete example, using the custom resolver technique #Steven mentioned.
from StringIO import StringIO
from lxml import etree
data = dict(
xml_file = '''<?xml version="1.0"?>
<!DOCTYPE x SYSTEM "a.dtd">
<x><y>ézz</y></x>
''',
dtd_file = '''<!ENTITY eacute "é">
<!ELEMENT x (y)>
<!ELEMENT y (#PCDATA)>
''')
class DTDResolver(etree.Resolver):
def resolve(self, url, id, context):
return self.resolve_string(data['dtd_file'], context)
xmldoc = StringIO(data['xml_file'])
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
parser.resolvers.add(DTDResolver())
try:
tree = etree.parse(xmldoc, parser)
except etree.XMLSyntaxError as e:
# handle xml and validation errors

You could probably use a custom resolver. The docs actually give an example of doing this to provide a dtd.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I validate xml against a DTD file in Python - python

I need to validate an XML string (and not a file) against a DTD description file. How can that be done in python?

Related

How to write an xml file using libxml2 in python?

Writing "pretty printed" indented xml to an xml file results in wrong indentation levels [duplicate]

Extracting nested namespace from a xml using lxml

python lxml with py2exe

Python xml etree DTD from a StringIO source?

Categories

Resources