Reading XML DOCTYPE info with Python - python

I need to parse a version of an XML file as follows.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE twReport [
<!ELEMENT twReport (twHead?, (twWarn | twDebug | twInfo)*, twBody, twSum?,
twDebug*, twFoot?, twClientInfo?)>
<!ATTLIST twReport version CDATA "10,4"> <----- VERSION INFO HERE
I use xml.dom.minidom for parsing XML file, and I need to parse the version of the XML file written in embedded DTD.
Can I use xml.dom.minidom for this purpose?
Is there any python XML parser for that purposes?

How about xmlproc's DTD api?
Here's a random snippet of code I wrote years and years ago to do some work with DTDs from Python, which might give you an idea of what it's like to work with this library:
from xml.parsers.xmlproc import dtdparser
attr_separator = '_'
child_separator = '_'
dtd = dtdparser.load_dtd('schedule.dtd')
for name, element in dtd.elems.items():
for attr in element.attrlist:
output = '%s%s%s = ' % (name, attr_separator, attr)
print output
for child in element.get_valid_elements(element.get_start_state()):
output = '%s%s%s = ' % (name, child_separator, child)
print output
(FYI, this was the first result when searching for "python dtd parser")

Because both of the the standard library XML libraries (xml.dom.minidom and xml.etree) use the same parser (xml.parsers.expat) you are limited in the "quality" of XML data you are able to successfully parse.
You're better off using the tried-and-true 3rd party modules out there like lxml or BeautifulSoup that are not only more resilient to errors, but will also give you exactly what you are looking for with little trouble.

Related

how to build xml file in python, with formatting

I'm trying to build a xml file in python so I can write it out to a file, but I'm getting complications with new lines and tabbing etc...
I cannot use a module to do this - because Im using a cut down version of python 2. It must all be in pure python.
For instance, how is it possible to create a xml file with this type of formatting, which keeps all the new lines and tabs (whitespace)?
e.g.
<?xml version="1.0" encoding="UTF-8"?>
<myfiledata>
<mydata>
blahblah
</mydata>
</myfiledata>
I've tried enclosing each line
' <myfiledata>' +\n
' blahblah' +\n
etc.
However, the output Im getting from the script is not anything close to how it looks in my python file, there is extra white space and the new lines arent properly working.
Is there any definitive way to do this? I would rather be editing a file that looks somewhat like what I will end up with - for clarity sake...
You can use XMLGenerator from saxutils to generate the XML and xml.dom.minidom to parse it and print the pretty xml (both modules from standard library in Python 2).
Sample code creating a XML and pretty-printing it:
from __future__ import print_function
from xml.sax.saxutils import XMLGenerator
import io
import xml.dom.minidom
def pprint_xml_string(s):
"""Pretty-print an XML string with minidom"""
parsed = xml.dom.minidom.parse(io.BytesIO(s))
return parsed.toprettyxml()
# create a XML file in-memory:
fp = io.BytesIO()
xg = XMLGenerator(fp)
xg.startDocument()
xg.startElement('root', {})
xg.startElement('subitem', {})
xg.characters('text content')
xg.endElement('subitem')
xg.startElement('subitem', {})
xg.characters('text content for another subitem')
xg.endElement('subitem')
xg.endElement('root')
xg.endDocument()
# pretty-print it
xml_string = fp.getvalue()
pretty_xml = pprint_xml_string(xml_string)
print(pretty_xml)
Output is:
<?xml version="1.0" ?>
<root>
<subitem>text content</subitem>
<subitem>text content for another subitem</subitem>
</root>
Note that the text content elements (wrapped in <subitem> tags) aren't indented because doing so would change their content (XML doesn't ignore whitespace like HTML does).
The answer was to use xml.element.tree and from xml.dom import minidom
Which are all available on python 2.5

extracting result text from xml using Python

I have the following xml :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns3:result xmlns:ns2="http://ws.def.com/">
<ns3:value>QWESW12323D2412123S</ns3:value>
</ns3:result>
and want to parse it with python and extract this text i tried the following :
from xml.etree import ElementTree as etree
xml = etree.fromstring(data)
item = xml.find('ns3:value')
print item
but i get empty item ,could someone help to achieve this with Python?
Use the syntax '{ns3}value' to apply the namespace - although as ns3 isn't defined, I don't think this is actually valid xml.

How to (push) parse XML files in Python?

I've already seen this question, but it's from the 2009.
What's a simple modern way to handle XML files in Python 3?
I.e., from this TLD (adapted from here):
<?xml version="1.0" encoding="UTF-8" ?>
<taglib>
<tlib-version>1.0</tlib-version>
<short-name>bar-baz</short-name>
<tag>
<name>present</name>
<tag-class>condpkg.IfSimpleTag</tag-class>
<body-content>scriptless</body-content>
<attribute>
<name>test</name>
<required>true</required>
<rtexprvalue>true</rtexprvalue>
</attribute>
</tag>
</taglib>
I want to parse TLD files (Java Server Pages Tag Library Descriptors), to obtain some sort of structure in Python (I have still to decide about that part).
Hence, I need a push parser. But I won't do much more with it, so I'd rather prefer a simple API (I'm new to Python).
xml.etree.ElementTree is still there, in the standard library:
import xml.etree.ElementTree as ET
data = """your xml here"""
tree = ET.fromstring(data)
print(tree.find('tag/name').text) # prints "present"
If you look outside of the standard library, there is a very popular and fast lxml module that follows the ElementTree interface and supports Python3:
from lxml import etree as ET
data = """your xml here"""
tree = ET.fromstring(data)
print(tree.find('tag/name').text) # prints "present"
Besides, there is lxml.objectify that allows you to deal with XML structure like with a Python object.

Python ElementTree - print out namespace definitions?

I'm using Python's elementtree to parse some XML configuration files.
At the top of the file, I have a root element like this:
<?xml version="1.0" encoding="utf-8"?>
<sgx:FooConfig
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:foo="http://ns.au.firm.com/foo.xsd"
xmlns:bar="http://ns.au.firm.com/bar.xsd"
>
The problem is, the bar namespace can be set to one of two different XSDs, depending on the version of the configuration file.
I'm looking for a way to print out the namespace mapping using ElementTree, so I can check which of the two XSDs is being used - then I can get my code to handle the correct case.
Is there a way to print out all the namespace definitions out using Python?
Cheers,
Victor
What you have is not valid xml (undefined prefixes) and I think you can't do this with xml.etree but you should be able to do it using lxml.
import lxml.etree as et
tree = et.XML(yourxml)
print tree.nsmap

Use Python to edit XML header

I've written a Python script to create some XML, but I didn't find a way to edit the heading within Python.
Basically, instead of having this as my heading:
<?xml version="1.0" ?>
I need to have this:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
I looked into this as best I could, but found no way to add standalone status from within Python. So I figured that I'd have to go over the file after it had been created and then replace the text. I read in several places that I should stay far away from using readlines() because it could ruin the XML formatting.
Currently the code I have to do this - which I got from another Stackoverflow post about editing XML with Python - is:
doc = parse('file.xml')
elem = doc.find('xml version="1.0"')
elem.text = 'xml version="1.0" encoding="UTF-8" standalone="no"'
That provides me with a KeyError. I've tried to debug it to no avail, which leads me to believe that perhaps the XML heading wasn't meant to be edited in this way. Or my code is just wrong.
If anyone is curious (or miraculously knows how to insert standalone status into Python code), here is my code to write the xml file:
with open('file.xml', 'w') as f:
f.write(doc.toprettyxml(indent=' '))
Some people don't like "toprettyxml", but with my relatively basic level, it seemed like the best bet.
Anyway, if anyone can provide some advice or insight, I would be very grateful.
The xml.etree API does not give you any options to write out a standalone attribute in the XML declaration.
You may want to use the lxml library instead; it uses the same ElementTree API, but offers more control over the output. tostring() supports a standalone flag:
from lxml import etree
etree.tostring(doc, pretty_print=True, standalone=False)
or use .write(), which support the same options:
doc.write(outputfile, pretty_print=True, standalone=False)

Categories