How to to produce XML output? [duplicate] - python

This question already has answers here:
Creating a simple XML file using python
(6 answers)
Closed 5 years ago.
I'm creating an web api and need a good way to very quickly generate some well formatted xml. I cannot find any good way of doing this in python.
Note: Some libraries look promising but either lack documentation or only output to files.

ElementTree is a good module for reading xml and writing too e.g.
from xml.etree.ElementTree import Element, SubElement, tostring
root = Element('root')
child = SubElement(root, "child")
child.text = "I am a child"
print(tostring(root))
Output:
<root><child>I am a child</child></root>
See this tutorial for more details and how to pretty print.
Alternatively if your XML is simple, do not underestimate the power of string formatting :)
xmlTemplate = """<root>
<person>
<name>%(name)s</name>
<address>%(address)s</address>
</person>
</root>"""
data = {'name':'anurag', 'address':'Pune, india'}
print xmlTemplate%data
Output:
<root>
<person>
<name>anurag</name>
<address>Pune, india</address>
</person>
</root>
You can use string.Template or some template engine too, for complex formatting.

Using lxml:
from lxml import etree
# create XML
root = etree.Element('root')
root.append(etree.Element('child'))
# another child with text
child = etree.Element('child')
child.text = 'some text'
root.append(child)
# pretty string
s = etree.tostring(root, pretty_print=True)
print s
Output:
<root>
<child/>
<child>some text</child>
</root>
See the tutorial for more information.

I would use the yattag library.
from yattag import Doc
doc, tag, text = Doc().tagtext()
with tag('food'):
with tag('name'):
text('French Breakfast')
with tag('price', currency='USD'):
text('6.95')
with tag('ingredients'):
for ingredient in ('baguettes', 'jam', 'butter', 'croissants'):
with tag('ingredient'):
text(ingredient)
print(doc.getvalue())
FYI I'm the author of the library.

Use lxml.builder class, from: http://lxml.de/tutorial.html#the-e-factory
import lxml.builder as lb
from lxml import etree
nstext = "new story"
story = lb.E.Asset(
lb.E.Attribute(nstext, name="Name", act="set"),
lb.E.Relation(lb.E.Asset(idref="Scope:767"),
name="Scope", act="set")
)
print 'story:\n', etree.tostring(story, pretty_print=True)
Output:
story:
<Asset>
<Attribute name="Name" act="set">new story</Attribute>
<Relation name="Scope" act="set">
<Asset idref="Scope:767"/>
</Relation>
</Asset>

An optional way if you want to use pure Python:
ElementTree is good for most cases, but it can't CData and pretty print.
So, if you need CData and pretty print you should use minidom:
minidom_example.py:
from xml.dom import minidom
doc = minidom.Document()
root = doc.createElement('root')
doc.appendChild(root)
leaf = doc.createElement('leaf')
text = doc.createTextNode('Text element with attributes')
leaf.appendChild(text)
leaf.setAttribute('color', 'white')
root.appendChild(leaf)
leaf_cdata = doc.createElement('leaf_cdata')
cdata = doc.createCDATASection('<em>CData</em> can contain <strong>HTML tags</strong> without encoding')
leaf_cdata.appendChild(cdata)
root.appendChild(leaf_cdata)
branch = doc.createElement('branch')
branch.appendChild(leaf.cloneNode(True))
root.appendChild(branch)
mixed = doc.createElement('mixed')
mixed_leaf = leaf.cloneNode(True)
mixed_leaf.setAttribute('color', 'black')
mixed_leaf.setAttribute('state', 'modified')
mixed.appendChild(mixed_leaf)
mixed_text = doc.createTextNode('Do not use mixed elements if it possible.')
mixed.appendChild(mixed_text)
root.appendChild(mixed)
xml_str = doc.toprettyxml(indent=" ")
with open("minidom_example.xml", "w") as f:
f.write(xml_str)
minidom_example.xml:
<?xml version="1.0" ?>
<root>
<leaf color="white">Text element with attributes</leaf>
<leaf_cdata>
<![CDATA[<em>CData</em> can contain <strong>HTML tags</strong> without encoding]]> </leaf_cdata>
<branch>
<leaf color="white">Text element with attributes</leaf>
</branch>
<mixed>
<leaf color="black" state="modified">Text element with attributes</leaf>
Do not use mixed elements if it possible.
</mixed>
</root>

I've tried a some of the solutions in this thread, and unfortunately, I found some of them to be cumbersome (i.e. requiring excessive effort when doing something non-trivial) and inelegant. Consequently, I thought I'd throw my preferred solution, web2py HTML helper objects, into the mix.
First, install the the standalone web2py module:
pip install web2py
Unfortunately, the above installs an extremely antiquated version of web2py, but it'll be good enough for this example. The updated source is here.
Import web2py HTML helper objects documented here.
from gluon.html import *
Now, you can use web2py helpers to generate XML/HTML.
words = ['this', 'is', 'my', 'item', 'list']
# helper function
create_item = lambda idx, word: LI(word, _id = 'item_%s' % idx, _class = 'item')
# create the HTML
items = [create_item(idx, word) for idx,word in enumerate(words)]
ul = UL(items, _id = 'my_item_list', _class = 'item_list')
my_div = DIV(ul, _class = 'container')
>>> my_div
<gluon.html.DIV object at 0x00000000039DEAC8>
>>> my_div.xml()
# I added the line breaks for clarity
<div class="container">
<ul class="item_list" id="my_item_list">
<li class="item" id="item_0">this</li>
<li class="item" id="item_1">is</li>
<li class="item" id="item_2">my</li>
<li class="item" id="item_3">item</li>
<li class="item" id="item_4">list</li>
</ul>
</div>

Related

Writing "pretty printed" indented xml to an xml file results in wrong indentation levels [duplicate]

What is the best way (or are the various ways) to pretty print XML in Python?
import xml.dom.minidom
dom = xml.dom.minidom.parse(xml_fname) # or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = dom.toprettyxml()
lxml is recent, updated, and includes a pretty print function
import lxml.etree as etree
x = etree.parse("filename")
print etree.tostring(x, pretty_print=True)
Check out the lxml tutorial:
http://lxml.de/tutorial.html
Another solution is to borrow this indent function, for use with the ElementTree library that's built in to Python since 2.5.
Here's what that would look like:
from xml.etree import ElementTree
def indent(elem, level=0):
i = "\n" + level*" "
j = "\n" + (level-1)*" "
if len(elem):
if not elem.text or not elem.text.strip():
elem.text = i + " "
if not elem.tail or not elem.tail.strip():
elem.tail = i
for subelem in elem:
indent(subelem, level+1)
if not elem.tail or not elem.tail.strip():
elem.tail = j
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = j
return elem
root = ElementTree.parse('/tmp/xmlfile').getroot()
indent(root)
ElementTree.dump(root)
You have a few options.
xml.etree.ElementTree.indent()
Batteries included, simple to use, pretty output.
But requires Python 3.9+
import xml.etree.ElementTree as ET
element = ET.XML("<html><body>text</body></html>")
ET.indent(element)
print(ET.tostring(element, encoding='unicode'))
BeautifulSoup.prettify()
BeautifulSoup may be the simplest solution for Python < 3.9.
from bs4 import BeautifulSoup
bs = BeautifulSoup(open(xml_file), 'xml')
pretty_xml = bs.prettify()
print(pretty_xml)
Output:
<?xml version="1.0" encoding="utf-8"?>
<issues>
<issue>
<id>
1
</id>
<title>
Add Visual Studio 2005 and 2008 solution files
</title>
</issue>
</issues>
This is my goto answer. The default arguments work as is. But text contents are spread out on separate lines as if they were nested elements.
lxml.etree.parse()
Prettier output but with arguments.
from lxml import etree
x = etree.parse(FILE_NAME)
pretty_xml = etree.tostring(x, pretty_print=True, encoding=str)
Produces:
<issues>
<issue>
<id>1</id>
<title>Add Visual Studio 2005 and 2008 solution files</title>
<details>We need Visual Studio 2005/2008 project files for Windows.</details>
</issue>
</issues>
This works for me with no issues.
xml.dom.minidom.parse()
No external dependencies but post-processing.
import xml.dom.minidom as md
dom = md.parse(FILE_NAME)
# To parse string instead use: dom = md.parseString(xml_string)
pretty_xml = dom.toprettyxml()
# remove the weird newline issue:
pretty_xml = os.linesep.join([s for s in pretty_xml.splitlines()
if s.strip()])
The output is the same as above, but it's more code.
Here's my (hacky?) solution to get around the ugly text node problem.
uglyXml = doc.toprettyxml(indent=' ')
text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)
prettyXml = text_re.sub('>\g<1></', uglyXml)
print prettyXml
The above code will produce:
<?xml version="1.0" ?>
<issues>
<issue>
<id>1</id>
<title>Add Visual Studio 2005 and 2008 solution files</title>
<details>We need Visual Studio 2005/2008 project files for Windows.</details>
</issue>
</issues>
Instead of this:
<?xml version="1.0" ?>
<issues>
<issue>
<id>
1
</id>
<title>
Add Visual Studio 2005 and 2008 solution files
</title>
<details>
We need Visual Studio 2005/2008 project files for Windows.
</details>
</issue>
</issues>
Disclaimer: There are probably some limitations.
As of Python 3.9, ElementTree has an indent() function for pretty-printing XML trees.
See https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.indent.
Sample usage:
import xml.etree.ElementTree as ET
element = ET.XML("<html><body>text</body></html>")
ET.indent(element)
print(ET.tostring(element, encoding='unicode'))
The upside is that it does not require any additional libraries. For more information check https://bugs.python.org/issue14465 and https://github.com/python/cpython/pull/15200
As others pointed out, lxml has a pretty printer built in.
Be aware though that by default it changes CDATA sections to normal text, which can have nasty results.
Here's a Python function that preserves the input file and only changes the indentation (notice the strip_cdata=False). Furthermore it makes sure the output uses UTF-8 as encoding instead of the default ASCII (notice the encoding='utf-8'):
from lxml import etree
def prettyPrintXml(xmlFilePathToPrettyPrint):
assert xmlFilePathToPrettyPrint is not None
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False)
document = etree.parse(xmlFilePathToPrettyPrint, parser)
document.write(xmlFilePathToPrettyPrint, pretty_print=True, encoding='utf-8')
Example usage:
prettyPrintXml('some_folder/some_file.xml')
If you have xmllint you can spawn a subprocess and use it. xmllint --format <file> pretty-prints its input XML to standard output.
Note that this method uses an program external to python, which makes it sort of a hack.
def pretty_print_xml(xml):
proc = subprocess.Popen(
['xmllint', '--format', '/dev/stdin'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
)
(output, error_output) = proc.communicate(xml);
return output
print(pretty_print_xml(data))
I tried to edit "ade"s answer above, but Stack Overflow wouldn't let me edit after I had initially provided feedback anonymously. This is a less buggy version of the function to pretty-print an ElementTree.
def indent(elem, level=0, more_sibs=False):
i = "\n"
if level:
i += (level-1) * ' '
num_kids = len(elem)
if num_kids:
if not elem.text or not elem.text.strip():
elem.text = i + " "
if level:
elem.text += ' '
count = 0
for kid in elem:
indent(kid, level+1, count < num_kids - 1)
count += 1
if not elem.tail or not elem.tail.strip():
elem.tail = i
if more_sibs:
elem.tail += ' '
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = i
if more_sibs:
elem.tail += ' '
If you're using a DOM implementation, each has their own form of pretty-printing built-in:
# minidom
#
document.toprettyxml()
# 4DOM
#
xml.dom.ext.PrettyPrint(document, stream)
# pxdom (or other DOM Level 3 LS-compliant imp)
#
serializer.domConfig.setParameter('format-pretty-print', True)
serializer.writeToString(document)
If you're using something else without its own pretty-printer — or those pretty-printers don't quite do it the way you want —  you'd probably have to write or subclass your own serialiser.
I had some problems with minidom's pretty print. I'd get a UnicodeError whenever I tried pretty-printing a document with characters outside the given encoding, eg if I had a β in a document and I tried doc.toprettyxml(encoding='latin-1'). Here's my workaround for it:
def toprettyxml(doc, encoding):
"""Return a pretty-printed XML document in a given encoding."""
unistr = doc.toprettyxml().replace(u'<?xml version="1.0" ?>',
u'<?xml version="1.0" encoding="%s"?>' % encoding)
return unistr.encode(encoding, 'xmlcharrefreplace')
from yattag import indent
pretty_string = indent(ugly_string)
It won't add spaces or newlines inside text nodes, unless you ask for it with:
indent(mystring, indent_text = True)
You can specify what the indentation unit should be and what the newline should look like.
pretty_xml_string = indent(
ugly_xml_string,
indentation = ' ',
newline = '\r\n'
)
The doc is on http://www.yattag.org homepage.
I wrote a solution to walk through an existing ElementTree and use text/tail to indent it as one typically expects.
def prettify(element, indent=' '):
queue = [(0, element)] # (level, element)
while queue:
level, element = queue.pop(0)
children = [(level + 1, child) for child in list(element)]
if children:
element.text = '\n' + indent * (level+1) # for child open
if queue:
element.tail = '\n' + indent * queue[0][0] # for sibling open
else:
element.tail = '\n' + indent * (level-1) # for parent close
queue[0:0] = children # prepend so children come before siblings
Here's a Python3 solution that gets rid of the ugly newline issue (tons of whitespace), and it only uses standard libraries unlike most other implementations.
import xml.etree.ElementTree as ET
import xml.dom.minidom
import os
def pretty_print_xml_given_root(root, output_xml):
"""
Useful for when you are editing xml data on the fly
"""
xml_string = xml.dom.minidom.parseString(ET.tostring(root)).toprettyxml()
xml_string = os.linesep.join([s for s in xml_string.splitlines() if s.strip()]) # remove the weird newline issue
with open(output_xml, "w") as file_out:
file_out.write(xml_string)
def pretty_print_xml_given_file(input_xml, output_xml):
"""
Useful for when you want to reformat an already existing xml file
"""
tree = ET.parse(input_xml)
root = tree.getroot()
pretty_print_xml_given_root(root, output_xml)
I found how to fix the common newline issue here.
XML pretty print for python looks pretty good for this task. (Appropriately named, too.)
An alternative is to use pyXML, which has a PrettyPrint function.
You can use popular external library xmltodict, with unparse and pretty=True you will get best result:
xmltodict.unparse(
xmltodict.parse(my_xml), full_document=False, pretty=True)
full_document=False against <?xml version="1.0" encoding="UTF-8"?> at the top.
Take a look at the vkbeautify module.
It is a python version of my very popular javascript/nodejs plugin with the same name. It can pretty-print/minify XML, JSON and CSS text. Input and output can be string/file in any combinations. It is very compact and doesn't have any dependency.
Examples:
import vkbeautify as vkb
vkb.xml(text)
vkb.xml(text, 'path/to/dest/file')
vkb.xml('path/to/src/file')
vkb.xml('path/to/src/file', 'path/to/dest/file')
You can try this variation...
Install BeautifulSoup and the backend lxml (parser) libraries:
user$ pip3 install lxml bs4
Process your XML document:
from bs4 import BeautifulSoup
with open('/path/to/file.xml', 'r') as doc:
for line in doc:
print(BeautifulSoup(line, 'lxml-xml').prettify())
An alternative if you don't want to have to reparse, there is the xmlpp.py library with the get_pprint() function. It worked nice and smoothly for my use cases, without having to reparse to an lxml ElementTree object.
I found a fast and easy way to nicely format and print an xml file:
import xml.etree.ElementTree as ET
xmlTree = ET.parse('your XML file')
xmlRoot = xmlTree.getroot()
xmlDoc = ET.tostring(xmlRoot, encoding="unicode")
print(xmlDoc)
Outuput:
<root>
<child>
<subchild>.....</subchild>
</child>
<child>
<subchild>.....</subchild>
</child>
...
...
...
<child>
<subchild>.....</subchild>
</child>
</root>
I had this problem and solved it like this:
def write_xml_file (self, file, xml_root_element, xml_declaration=False, pretty_print=False, encoding='unicode', indent='\t'):
pretty_printed_xml = etree.tostring(xml_root_element, xml_declaration=xml_declaration, pretty_print=pretty_print, encoding=encoding)
if pretty_print: pretty_printed_xml = pretty_printed_xml.replace(' ', indent)
file.write(pretty_printed_xml)
In my code this method is called like this:
try:
with open(file_path, 'w') as file:
file.write('<?xml version="1.0" encoding="utf-8" ?>')
# create some xml content using etree ...
xml_parser = XMLParser()
xml_parser.write_xml_file(file, xml_root, xml_declaration=False, pretty_print=True, encoding='unicode', indent='\t')
except IOError:
print("Error while writing in log file!")
This works only because etree by default uses two spaces to indent, which I don't find very much emphasizing the indentation and therefore not pretty. I couldn't ind any setting for etree or parameter for any function to change the standard etree indent. I like how easy it is to use etree, but this was really annoying me.
For converting an entire xml document to a pretty xml document
(ex: assuming you've extracted [unzipped] a LibreOffice Writer .odt or .ods file, and you want to convert the ugly "content.xml" file to a pretty one for automated git version control and git difftooling of .odt/.ods files, such as I'm implementing here)
import xml.dom.minidom
file = open("./content.xml", 'r')
xml_string = file.read()
file.close()
parsed_xml = xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = parsed_xml.toprettyxml()
file = open("./content_new.xml", 'w')
file.write(pretty_xml_as_string)
file.close()
References:
- Thanks to Ben Noland's answer on this page which got me most of the way there.
from lxml import etree
import xml.dom.minidom as mmd
xml_root = etree.parse(xml_fiel_path, etree.XMLParser())
def print_xml(xml_root):
plain_xml = etree.tostring(xml_root).decode('utf-8')
urgly_xml = ''.join(plain_xml .split())
good_xml = mmd.parseString(urgly_xml)
print(good_xml.toprettyxml(indent=' ',))
It's working well for the xml with Chinese!
If for some reason you can't get your hands on any of the Python modules that other users mentioned, I suggest the following solution for Python 2.7:
import subprocess
def makePretty(filepath):
cmd = "xmllint --format " + filepath
prettyXML = subprocess.check_output(cmd, shell = True)
with open(filepath, "w") as outfile:
outfile.write(prettyXML)
As far as I know, this solution will work on Unix-based systems that have the xmllint package installed.
I found this question while looking for "how to pretty print html"
Using some of the ideas in this thread I adapted the XML solutions to work for XML or HTML:
from xml.dom.minidom import parseString as string_to_dom
def prettify(string, html=True):
dom = string_to_dom(string)
ugly = dom.toprettyxml(indent=" ")
split = list(filter(lambda x: len(x.strip()), ugly.split('\n')))
if html:
split = split[1:]
pretty = '\n'.join(split)
return pretty
def pretty_print(html):
print(prettify(html))
When used this is what it looks like:
html = """\
<div class="foo" id="bar"><p>'IDK!'</p><br/><div class='baz'><div>
<span>Hi</span></div></div><p id='blarg'>Try for 2</p>
<div class='baz'>Oh No!</div></div>
"""
pretty_print(html)
Which returns:
<div class="foo" id="bar">
<p>'IDK!'</p>
<br/>
<div class="baz">
<div>
<span>Hi</span>
</div>
</div>
<p id="blarg">Try for 2</p>
<div class="baz">Oh No!</div>
</div>
Use etree.indent and etree.tostring
import lxml.etree as etree
root = etree.fromstring('<html><head></head><body><h1>Welcome</h1></body></html>')
etree.indent(root, space=" ")
xml_string = etree.tostring(root, pretty_print=True).decode()
print(xml_string)
output
<html>
<head/>
<body>
<h1>Welcome</h1>
</body>
</html>
Removing namespaces and prefixes
import lxml.etree as etree
def dump_xml(element):
for item in element.getiterator():
item.tag = etree.QName(item).localname
etree.cleanup_namespaces(element)
etree.indent(element, space=" ")
result = etree.tostring(element, pretty_print=True).decode()
return result
root = etree.fromstring('<cs:document xmlns:cs="http://blabla.com"><name>hello world</name></cs:document>')
xml_string = dump_xml(root)
print(xml_string)
output
<document>
<name>hello world</name>
</document>
I solved this with some lines of code, opening the file, going trough it and adding indentation, then saving it again. I was working with small xml files, and did not want to add dependencies, or more libraries to install for the user. Anyway, here is what I ended up with:
f = open(file_name,'r')
xml = f.read()
f.close()
#Removing old indendations
raw_xml = ''
for line in xml:
raw_xml += line
xml = raw_xml
new_xml = ''
indent = ' '
deepness = 0
for i in range((len(xml))):
new_xml += xml[i]
if(i<len(xml)-3):
simpleSplit = xml[i:(i+2)] == '><'
advancSplit = xml[i:(i+3)] == '></'
end = xml[i:(i+2)] == '/>'
start = xml[i] == '<'
if(advancSplit):
deepness += -1
new_xml += '\n' + indent*deepness
simpleSplit = False
deepness += -1
if(simpleSplit):
new_xml += '\n' + indent*deepness
if(start):
deepness += 1
if(end):
deepness += -1
f = open(file_name,'w')
f.write(new_xml)
f.close()
It works for me, perhaps someone will have some use of it :)

parse xml with lxml including namespace

I need to get some info after a specific tag in lxml.
the xml doc looks like this
<?xml version="1.0" encoding="ISO-8859-1"?>
<web-app xmlns="http://java.sun.com/xml/ns/j2ee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/
ns/j2ee/web-app_2_4.xsd"
version="2.4">
<display-name>Community Bank</display-name>
<description>WebGoat for Cigital</description>
<context-param>
<param-name>PropertiesPath</param-name>
<param-value>/WEB-INF/properties.txt</param-value>
<description>This is the path to the properties file from the servlet root</description>
</context-param>
<servlet>
<servlet-name>Index</servlet-name>
<servlet-class>com.cigital.boi.servlet.index</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>Index</servlet-name>
<url-pattern>/index</url-pattern>
</servlet-mapping>
<servlet-mapping>
<servlet-name>Index</servlet-name>
<url-pattern>/index.html</url-pattern>
</servlet-mapping>
I want to read com.cigital.boi.servlet.index .
I have used this code to read everything under servlets
context = etree.parse(handle)
list = parser.xpath('//servlet')
print list
list contains nothing
more info : iterating over the context field i found these lines.
<Element {http://java.sun.com/xml/ns/j2ee}servlet-name at 2ad19e6eca48>
<Element {http://java.sun.com/xml/ns/j2ee}servlet-class at 2ad19e6ecaf8>
I am thinking as I have not included name space while searching , output is empty list.
please suggest hoe to read "com.cigital.boi.servlet.index" in the servlet-class tag
Try following:
from lxml import etree
context = etree.parse(handle)
print next(x.text for x in context.xpath('.//*[local-name()="servlet-class"]'))
Alternative:
from lxml import etree
context = etree.parse(handle)
nsmap = context.getroot().nsmap.copy()
nsmap['xmlns'] = nsmap.pop(None)
print next(x.text for x in context.xpath('.//xmlns:servlet-class', namespaces=nsmap))

Parsing large xml data using python's elementtree

I'm currently learning how to parse xml data using elementtree. I got an error that say:ParseError: not well-formed (invalid token): line 1, column 2.
My code is right below, and a bit of the xml data is after my code.
import xml.etree.ElementTree as ET
tree = ET.fromstring("C:\pbc.xml")
root = tree.getroot()
for article in root.findall('article'):
print ' '.join([t.text for t in pub.findall('title')])
for author in article.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in article.findall('journal'): # all venue tags with id attribute
print 'journal'
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2002-01-03" key="persons/Codd71a">
<author>E. F. Codd</author>
<title>Further Normalization of the Data Base Relational Model.</title>
<journal>IBM Research Report, San Jose, California</journal>
<volume>RJ909</volume>
<month>August</month>
<year>1971</year>
<cdrom>ibmTR/rj909.pdf</cdrom>
<ee>db/labs/ibm/RJ909.html</ee>
</article>
<article mdate="2002-01-03" key="persons/Hall74">
<author>Patrick A. V. Hall</author>
<title>Common Subexpression Identification in General Algebraic Systems.</title>
<journal>Technical Rep. UKSC 0060, IBM United Kingdom Scientific Centre</journal>
<month>November</month>
<year>1974</year>
</article>
You are using .fromstring() instead of .parse():
import xml.etree.ElementTree as ET
tree = ET.parse("C:\pbc.xml")
root = tree.getroot()
.fromstring() expects to be given the XML data in a bytestring, not a filename.
If the document is really large (many megabytes or more) then you should use the ET.iterparse() function instead and clear elements you have processed:
for event, article in ET.iterparse('C:\\pbc.xml', tag='article'):
for title in aarticle.findall('title'):
print 'Title: {}'.format(title.txt)
for author in article.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in article.findall('journal'):
print 'journal'
article.clear()
with open("C:\pbc.xml", 'rb') as f:
root = ET.fromstring(f.read().strip())
Unlike ET.parse, ET.fromstring expects a string with XML content, not the name of a file.
Also in contrast to ET.parse, ET.fromstring returns a root Element, not a Tree. So you should omit
root = tree.getroot()
Also, the XML snippet you posted needs a closing </dblp> to be parsable. I assume your real data has that closing tag...
The iterparse provided by xml.etree.ElementTree does not have a tag argument, although lxml.etree.iterparse does have a tag argument.
Try:
import xml.etree.ElementTree as ET
import htmlentitydefs
filename = "test.xml"
# http://stackoverflow.com/a/10792473/190597 (lambacck)
parser = ET.XMLParser()
parser.entity.update((x, unichr(i)) for x, i in htmlentitydefs.name2codepoint.iteritems())
context = ET.iterparse(filename, events = ('end', ), parser=parser)
for event, elem in context:
if elem.tag == 'article':
for author in elem.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in elem.findall('journal'): # all venue tags with id attribute
print(journal.text)
elem.clear()
Note: To use iterparse your XML must be valid, which means among other things that there can not be empty lines at the beginning of the file.
You'd better not putting the meta-info of the xml file into the parser. The parser do well if the tags are well-closed. So the <?xml may not be recognized by the parser. So omit the first two lines and try again. :-)

ExpatError: junk after document element

I really don't know, what the Problem is? I get the following error:
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
ExpatError: junk after document element: line 5, column 0
I DONT SEE NO JUNK! Any help? I'm getting crazy......
text = """<questionaire>
<question>
<questiontext>Question1</questiontext>
<answer>Your Answer: 99</answer>
</question>
<question>
<questiontext>Question2</questiontext>
<answer>Your Answer: 64</answer>
</question>
<question>
<questiontext>Question3</questiontext>
<answer>Your Answer: 46</answer>
</question>
<question>
<questiontext>Bitte geben</questiontext>
<answer>Your Answer: 544</answer>
<answer>Your Answer: 943</answer>
</question>
</questionaire>"""
cleandata = text.split('<questionaire>')
cleandatastring= "".join(cleandata)
stripped = cleandatastring.strip()
planhtml = stripped.split('</questionaire>')[0]
clean= planhtml.strip()
from xml.dom import minidom
doc = minidom.parseString(clean)
for question in doc.getElementsByTagName('question'):
for answer in question.getElementsByTagName('answer'):
if answer.childNodes[0].nodeValue.strip() == 'Your Answer: 99':
question.parentNode.removeChild(question)
print doc.toxml()
Thanx!
Your original text string is well-formed XML. Then you do a bunch of stuff to it that breaks it. Parse your original text, and you will be fine.
XML is required to have exactly one top-level element. By the time you parse it, it has a number of top-level <question> tags. The XML parser is parsing the first one as a root element, and then is surprised to find another top-level element.
In my case it was caused by the changes made in libxml2-2.9.11 that made tostring() (lxml) return more content (what follows the element) than it should. E.g.
from lxml import etree
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>
</b>
</a>
'''
t = etree.fromstring(xml.encode()).getroottree()
print(etree.tostring(
t.xpath('/a/b')[0],
encoding=t.docinfo.encoding,
).decode())
Expected output:
<b>
</b>
Actual output:
<b>
</b>
</a>
Should you pass the result to xml.dom.minidom.parseString(), it will complain.
More on it here.
To avoid this you either need libxml2 <= 2.9.10, or Alpine Linux >= 3.14.

Best way to generate xml? [duplicate]

This question already has answers here:
Creating a simple XML file using python
(6 answers)
Closed 5 years ago.
I'm creating an web api and need a good way to very quickly generate some well formatted xml. I cannot find any good way of doing this in python.
Note: Some libraries look promising but either lack documentation or only output to files.
ElementTree is a good module for reading xml and writing too e.g.
from xml.etree.ElementTree import Element, SubElement, tostring
root = Element('root')
child = SubElement(root, "child")
child.text = "I am a child"
print(tostring(root))
Output:
<root><child>I am a child</child></root>
See this tutorial for more details and how to pretty print.
Alternatively if your XML is simple, do not underestimate the power of string formatting :)
xmlTemplate = """<root>
<person>
<name>%(name)s</name>
<address>%(address)s</address>
</person>
</root>"""
data = {'name':'anurag', 'address':'Pune, india'}
print xmlTemplate%data
Output:
<root>
<person>
<name>anurag</name>
<address>Pune, india</address>
</person>
</root>
You can use string.Template or some template engine too, for complex formatting.
Using lxml:
from lxml import etree
# create XML
root = etree.Element('root')
root.append(etree.Element('child'))
# another child with text
child = etree.Element('child')
child.text = 'some text'
root.append(child)
# pretty string
s = etree.tostring(root, pretty_print=True)
print s
Output:
<root>
<child/>
<child>some text</child>
</root>
See the tutorial for more information.
I would use the yattag library.
from yattag import Doc
doc, tag, text = Doc().tagtext()
with tag('food'):
with tag('name'):
text('French Breakfast')
with tag('price', currency='USD'):
text('6.95')
with tag('ingredients'):
for ingredient in ('baguettes', 'jam', 'butter', 'croissants'):
with tag('ingredient'):
text(ingredient)
print(doc.getvalue())
FYI I'm the author of the library.
Use lxml.builder class, from: http://lxml.de/tutorial.html#the-e-factory
import lxml.builder as lb
from lxml import etree
nstext = "new story"
story = lb.E.Asset(
lb.E.Attribute(nstext, name="Name", act="set"),
lb.E.Relation(lb.E.Asset(idref="Scope:767"),
name="Scope", act="set")
)
print 'story:\n', etree.tostring(story, pretty_print=True)
Output:
story:
<Asset>
<Attribute name="Name" act="set">new story</Attribute>
<Relation name="Scope" act="set">
<Asset idref="Scope:767"/>
</Relation>
</Asset>
An optional way if you want to use pure Python:
ElementTree is good for most cases, but it can't CData and pretty print.
So, if you need CData and pretty print you should use minidom:
minidom_example.py:
from xml.dom import minidom
doc = minidom.Document()
root = doc.createElement('root')
doc.appendChild(root)
leaf = doc.createElement('leaf')
text = doc.createTextNode('Text element with attributes')
leaf.appendChild(text)
leaf.setAttribute('color', 'white')
root.appendChild(leaf)
leaf_cdata = doc.createElement('leaf_cdata')
cdata = doc.createCDATASection('<em>CData</em> can contain <strong>HTML tags</strong> without encoding')
leaf_cdata.appendChild(cdata)
root.appendChild(leaf_cdata)
branch = doc.createElement('branch')
branch.appendChild(leaf.cloneNode(True))
root.appendChild(branch)
mixed = doc.createElement('mixed')
mixed_leaf = leaf.cloneNode(True)
mixed_leaf.setAttribute('color', 'black')
mixed_leaf.setAttribute('state', 'modified')
mixed.appendChild(mixed_leaf)
mixed_text = doc.createTextNode('Do not use mixed elements if it possible.')
mixed.appendChild(mixed_text)
root.appendChild(mixed)
xml_str = doc.toprettyxml(indent=" ")
with open("minidom_example.xml", "w") as f:
f.write(xml_str)
minidom_example.xml:
<?xml version="1.0" ?>
<root>
<leaf color="white">Text element with attributes</leaf>
<leaf_cdata>
<![CDATA[<em>CData</em> can contain <strong>HTML tags</strong> without encoding]]> </leaf_cdata>
<branch>
<leaf color="white">Text element with attributes</leaf>
</branch>
<mixed>
<leaf color="black" state="modified">Text element with attributes</leaf>
Do not use mixed elements if it possible.
</mixed>
</root>
I've tried a some of the solutions in this thread, and unfortunately, I found some of them to be cumbersome (i.e. requiring excessive effort when doing something non-trivial) and inelegant. Consequently, I thought I'd throw my preferred solution, web2py HTML helper objects, into the mix.
First, install the the standalone web2py module:
pip install web2py
Unfortunately, the above installs an extremely antiquated version of web2py, but it'll be good enough for this example. The updated source is here.
Import web2py HTML helper objects documented here.
from gluon.html import *
Now, you can use web2py helpers to generate XML/HTML.
words = ['this', 'is', 'my', 'item', 'list']
# helper function
create_item = lambda idx, word: LI(word, _id = 'item_%s' % idx, _class = 'item')
# create the HTML
items = [create_item(idx, word) for idx,word in enumerate(words)]
ul = UL(items, _id = 'my_item_list', _class = 'item_list')
my_div = DIV(ul, _class = 'container')
>>> my_div
<gluon.html.DIV object at 0x00000000039DEAC8>
>>> my_div.xml()
# I added the line breaks for clarity
<div class="container">
<ul class="item_list" id="my_item_list">
<li class="item" id="item_0">this</li>
<li class="item" id="item_1">is</li>
<li class="item" id="item_2">my</li>
<li class="item" id="item_3">item</li>
<li class="item" id="item_4">list</li>
</ul>
</div>

Categories