Python xml minidom. generate <text>Some text</text> element - python

I have the following code.
from xml.dom.minidom import Document
doc = Document()
root = doc.createElement('root')
doc.appendChild(root)
main = doc.createElement('Text')
root.appendChild(main)
text = doc.createTextNode('Some text here')
main.appendChild(text)
print doc.toprettyxml(indent='\t')
The result is:
<?xml version="1.0" ?>
<root>
<Text>
Some text here
</Text>
</root>
This is all fine and dandy, but what if I want the output to look like this?
<?xml version="1.0" ?>
<root>
<Text>Some text here</Text>
</root>
Can this easily be done?
Orjanp...

Can this easily be done?
It depends what exact rule you want, but generally no, you get little control over pretty-printing. If you want a specific format you'll usually have to write your own walker.
The DOM Level 3 LS parameter format-pretty-print in pxdom comes pretty close to your example. Its rule is that if an element contains only a single TextNode, no extra whitespace will be added. However it (currently) uses two spaces for an indent rather than four.
>>> doc= pxdom.parseString('<a><b>c</b></a>')
>>> doc.domConfig.setParameter('format-pretty-print', True)
>>> print doc.pxdomContent
<?xml version="1.0" encoding="utf-16"?>
<a>
<b>c</b>
</a>
(Adjust encoding and output format for whatever type of serialisation you're doing.)
If that's the rule you want, and you can get away with it, you might also be able to monkey-patch minidom's Element.writexml, eg.:
>>> from xml.dom import minidom
>>> def newwritexml(self, writer, indent= '', addindent= '', newl= ''):
... if len(self.childNodes)==1 and self.firstChild.nodeType==3:
... writer.write(indent)
... self.oldwritexml(writer) # cancel extra whitespace
... writer.write(newl)
... else:
... self.oldwritexml(writer, indent, addindent, newl)
...
>>> minidom.Element.oldwritexml= minidom.Element.writexml
>>> minidom.Element.writexml= newwritexml
All the usual caveats about the badness of monkey-patching apply.

I was looking for exactly the same thing, and I came across this post. (the indenting provided in xml.dom.minidom broke another tool that I was using to manipulate the XML, and I needed it to be indented) I tried the accepted solution with a slightly more complex example and this was the result:
In [1]: import pxdom
In [2]: xml = '<a><b>fda</b><c><b>fdsa</b></c></a>'
In [3]: doc = pxdom.parseString(xml)
In [4]: doc.domConfig.setParameter('format-pretty-print', True)
In [5]: print doc.pxdomContent
<?xml version="1.0" encoding="utf-16"?>
<a>
<b>fda</b><c>
<b>fdsa</b>
</c>
</a>
The pretty printed XML isn't formatted correctly, and I'm not too happy about monkey patching (i.e. I barely know what it means, and understand it's bad), so I looked for another solution.
I'm writing the output to file, so I can use the xmlindent program for Ubuntu ($sudo aptitude install xmlindent). So I just write the unformatted to the file, then call the xmlindent from within the python program:
from subprocess import Popen, PIPE
Popen(["xmlindent", "-i", "2", "-w", "-f", "-nbe", file_name],
stderr=PIPE,
stdout=PIPE).communicate()
The -w switch causes the file to be overwritten, but annoyingly leaves a named e.g. "myfile.xml~" which you'll probably want to remove. The other switches are there to get the formatting right (for me).
P.S. xmlindent is a stream formatter, i.e. you can use it as follows:
cat myfile.xml | xmlindent > myfile_indented.xml
So you might be able to use it in a python script without writing to a file if you needed to.

This could be done with toxml(), using regular expressions to tidy things up.
>>> from xml.dom.minidom import Document
>>> import re
>>> doc = Document()
>>> root = doc.createElement('root')
>>> _ = doc.appendChild(root)
>>> main = doc.createElement('Text')
>>> _ = root.appendChild(main)
>>> text = doc.createTextNode('Some text here')
>>> _ = main.appendChild(text)
>>> out = doc.toxml()
>>> niceOut = re.sub(r'><', r'>\n<', re.sub(r'(<\/.*?>)', r'\1\n', out))
>>> print niceOut
<?xml version="1.0" ?>
<root>
<Text>Some text here</Text>
</root>

The pyxml package offers a simple solution to this by using the xml.dom.ext.PrettyPrint() function. It can also print to a file descriptor.
But the pyxml package is no longer maintained.
Oerjan Pettersen

This solution worked for me without monkey patching or ceasing to use minidom:
from xml.dom.ext import PrettyPrint
from StringIO import StringIO
def toprettyxml_fixed (node, encoding='utf-8'):
tmpStream = StringIO()
PrettyPrint(node, stream=tmpStream, encoding=encoding)
return tmpStream.getvalue()
http://ronrothman.com/public/leftbraned/xml-dom-minidom-toprettyxml-and-silly-whitespace/#best-solution

Easiest way to do this is to use prettyxml, and remove the \n and \t inside the tags. That way you keep your indent as you required in your example.
xml_output = doc.toprettyxml()
nojunkintags = re.sub('>(\n|\t)</', '', xml_output)
print nojunkintags

Related

Need some help generating XML with Python

I have some variables in Python that I need to store as XML. I have been using the python:LXML module for this so far. Not too experienced with it. Have tried playing around with various tutorials and docs, but I am at a dead end need some help.
Here is the python script:
root = etree.Element("root")
coins=etree.Element("coins")
doc=etree.ElementTree(coins)
coins.append(etree.Element("trader"))
coins.append(etree.Element("metal"))
coins.append(etree.Element("type"))
coins.append(etree.Element("price"))
coins[0].text="Gold.co.uk"
coins[0].attrib["variable"]=("GLDAG_MAPLE")
coins[1].text="Silver"
coins[2].text="Britannia"
coins[3].text=str(GLDAG_MAPLE)
doc.write('data.xml', pretty_print=True)
As of now it outputs this:
<coins>
<trader variable="GLDAG_MAPLE">Gold.co.uk</trader>
<metal>Silver</metal>
<type>Britannia</type>
<price>
£31.20
</price>
</coins>
However I would like it to look like this:
<root>
<coin>
<trader> Gold.co.uk </trader>
<type> Britannia </type>
<price> £31.20 </price>
</coin>
</root>
The tags and their sub-tags would be duplicated for every type of coin. I have no idea how to construct the XML so that the output looks like the third code-block. So far I have tried to follow other scripts that I have seen on github and other sites but modify them to suit my needs but my scripts keep failing or producing incorrect resaults for some reason.
If someone could help me out then that would be great!
You can simply append the Element to root:
from lxml import etree
coinItems = [
{'trader': 'Gold.co.uk', 'metal': 'Silver', 'type': 'Britannia'},
{'trader': 'copper.co.uk', 'metal': 'Copper', 'type': 'World'}
]
root = etree.Element("root")
for ci in coinItems:
coin=etree.Element("coin")
etree.SubElement(coin, "trader", {'variable': 'GLDAG_MAPLE'}).text = ci['trader'] # example how to use attributes!
etree.SubElement(coin, "metal").text = ci['metal']
etree.SubElement(coin, "type").text = ci['type']
root.append(coin)
fName = '/tmp/data.xml'
with open(fName, 'wb') as f:
# remove encoding here, in case you want escaped ASCII characters: £
f.write(etree.tostring(root, xml_declaration=True, encoding="utf-8", pretty_print=True))
print(open(fName).read())
Output:
<?xml version='1.0' encoding='utf-8'?>
<root>
<coin>
<trader variable="GLDAG_MAPLE">Gold.co.uk</trader>
<metal>Silver</metal>
<type>Britannia</type>
</coin>
<coin>
<trader variable="GLDAG_MAPLE">copper.co.uk</trader>
<metal>Copper</metal>
<type>World</type>
</coin>
</root>
I prefer using the lxml builder (https://lxml.de/api/lxml.builder.ElementMaker-class.html) because imho it is easier to see the structure of your XML document.
from lxml.builder import E
root = E.root(
E.coin(
E.trader("Gold.co.uk",
variable="GLDAG_MAPLE"),
E.metal("silver"),
E.price("£31.20")
)
)
You can then append the root element to your main document.

ElementTree prettyprint blank lines [duplicate]

I've been using a minidom.toprettyxml for prettify my xml file.
When I'm creating XML file and using this method, all works grate, but if I use it after I've modified the xml file (for examp I've added an additional nodes) and then I'm writing it back to XML, I'm getting empty lines, each time I'm updating it, I'm getting more and more empty lines...
my code :
file.write(prettify(xmlRoot))
def prettify(elem):
rough_string = xml.tostring(elem, 'utf-8') //xml as ElementTree
reparsed = mini.parseString(rough_string) //mini as minidom
return reparsed.toprettyxml(indent=" ")
and the result :
<?xml version="1.0" ?>
<testsuite errors="0" failures="3" name="TestSet_2013-01-23 14_28_00.510935" skip="0" tests="3" time="142.695" timestamp="2013-01-23 14:28:00.515460">
<testcase classname="TC test" name="t1" status="Failed" time="27.013"/>
<testcase classname="TC test" name="t2" status="Failed" time="78.325"/>
<testcase classname="TC test" name="t3" status="Failed" time="37.357"/>
</testsuite>
any suggestions ?
thanks.
I found a solution here: http://code.activestate.com/recipes/576750-pretty-print-xml/
Then I modified it to take a string instead of a file.
from xml.dom.minidom import parseString
pretty_print = lambda data: '\n'.join([line for line in parseString(data).toprettyxml(indent=' '*2).split('\n') if line.strip()])
Output:
<?xml version="1.0" ?>
<testsuite errors="0" failures="3" name="TestSet_2013-01-23 14_28_00.510935" skip="0" tests="3" time="142.695" timestamp="2013-01-23 14:28:00.515460">
<testcase classname="TC test" name="t1" status="Failed" time="27.013"/>
<testcase classname="TC test" name="t2" status="Failed" time="78.325"/>
<testcase classname="TC test" name="t3" status="Failed" time="37.357"/>
</testsuite>
This may help you work it into your function a little be easier:
def new_prettify():
reparsed = parseString(CONTENT)
print '\n'.join([line for line in reparsed.toprettyxml(indent=' '*2).split('\n') if line.strip()])
I found an easy solution for this problem, just with changing the last line
of your prettify() so it will be:
def prettify(elem):
rough_string = xml.tostring(elem, 'utf-8') //xml as ElementTree
reparsed = mini.parseString(rough_string) //mini as minidom
return reparsed.toprettyxml(indent=" ", newl='')
use this to resolve problem with the lines
toprettyxml(indent=' ', newl='\r', encoding="utf-8")
I am having the same issue with Python 2.7 (32b) in a Windows 10 machine. The issue seems to be that when python parses an XML text to an ElementTree object, it adds some annoying line feeds to either the "text" or "tail" attributes of each element.
This script removes such line break characters:
def removeAnnoyingLines(elem):
hasWords = re.compile("\\w")
for element in elem.iter():
if not re.search(hasWords,str(element.tail)):
element.tail=""
if not re.search(hasWords,str(element.text)):
element.text = ""
Use this function before "pretty-printing" your tree:
removeAnnoyingLines(element)
myXml = xml.dom.minidom.parseString(xml.etree.ElementTree.tostring(element))
print myXml.toprettyxml()
It worked for me. I hope it works for you!
Here's a Python3 solution that gets rid of the ugly newline issue (tons of whitespace), and it only uses standard libraries unlike most other implementations.
import xml.etree.ElementTree as ET
import xml.dom.minidom
import os
def pretty_print_xml_given_root(root, output_xml):
"""
Useful for when you are editing xml data on the fly
"""
xml_string = xml.dom.minidom.parseString(ET.tostring(root)).toprettyxml()
xml_string = os.linesep.join([s for s in xml_string.splitlines() if s.strip()]) # remove the weird newline issue
with open(output_xml, "w") as file_out:
file_out.write(xml_string)
def pretty_print_xml_given_file(input_xml, output_xml):
"""
Useful for when you want to reformat an already existing xml file
"""
tree = ET.parse(input_xml)
root = tree.getroot()
pretty_print_xml_given_root(root, output_xml)
I found how to fix the common newline issue here.
The problem is that minidom doesn't handle well the new line chars (on Windows).
Anyway it doesn't need them so removing them from the sting is the solution:
reparsed = mini.parseString(rough_string) //mini as minidom
replace with
reparsed = mini.parseString(rough_string.replace('\n','')) //mini as minidom
But be aware that this is solution working only for Windows.
Since minidom toprettyxml insert too many lines, my solution was to delete lines that do not have useful data in them by checking if there is at least one '<' character (there may be a better idea). This worked perfectly for a similar issue I had (on Windows).
text = md.toprettyxml() # get the prettyxml string from minidom Document md
# text = text.replace(' ', '\t') # for those using tabs :)
spl = text.split('\n') # split lines into a list
spl = [i for i in spl if '<' in i] # keep only element with data inside
text = '\n'.join(spl) # join again all elements of the filtered list into a string
# write the result to file (I use codecs because I needed the utf-8 encoding)
import codecs # if not imported yet (just to show this import is needed)
with codecs.open('yourfile.xml', 'w', encoding='utf-8') as f:
f.write(text)

Writing "pretty printed" indented xml to an xml file results in wrong indentation levels [duplicate]

What is the best way (or are the various ways) to pretty print XML in Python?
import xml.dom.minidom
dom = xml.dom.minidom.parse(xml_fname) # or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = dom.toprettyxml()
lxml is recent, updated, and includes a pretty print function
import lxml.etree as etree
x = etree.parse("filename")
print etree.tostring(x, pretty_print=True)
Check out the lxml tutorial:
http://lxml.de/tutorial.html
Another solution is to borrow this indent function, for use with the ElementTree library that's built in to Python since 2.5.
Here's what that would look like:
from xml.etree import ElementTree
def indent(elem, level=0):
i = "\n" + level*" "
j = "\n" + (level-1)*" "
if len(elem):
if not elem.text or not elem.text.strip():
elem.text = i + " "
if not elem.tail or not elem.tail.strip():
elem.tail = i
for subelem in elem:
indent(subelem, level+1)
if not elem.tail or not elem.tail.strip():
elem.tail = j
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = j
return elem
root = ElementTree.parse('/tmp/xmlfile').getroot()
indent(root)
ElementTree.dump(root)
You have a few options.
xml.etree.ElementTree.indent()
Batteries included, simple to use, pretty output.
But requires Python 3.9+
import xml.etree.ElementTree as ET
element = ET.XML("<html><body>text</body></html>")
ET.indent(element)
print(ET.tostring(element, encoding='unicode'))
BeautifulSoup.prettify()
BeautifulSoup may be the simplest solution for Python < 3.9.
from bs4 import BeautifulSoup
bs = BeautifulSoup(open(xml_file), 'xml')
pretty_xml = bs.prettify()
print(pretty_xml)
Output:
<?xml version="1.0" encoding="utf-8"?>
<issues>
<issue>
<id>
1
</id>
<title>
Add Visual Studio 2005 and 2008 solution files
</title>
</issue>
</issues>
This is my goto answer. The default arguments work as is. But text contents are spread out on separate lines as if they were nested elements.
lxml.etree.parse()
Prettier output but with arguments.
from lxml import etree
x = etree.parse(FILE_NAME)
pretty_xml = etree.tostring(x, pretty_print=True, encoding=str)
Produces:
<issues>
<issue>
<id>1</id>
<title>Add Visual Studio 2005 and 2008 solution files</title>
<details>We need Visual Studio 2005/2008 project files for Windows.</details>
</issue>
</issues>
This works for me with no issues.
xml.dom.minidom.parse()
No external dependencies but post-processing.
import xml.dom.minidom as md
dom = md.parse(FILE_NAME)
# To parse string instead use: dom = md.parseString(xml_string)
pretty_xml = dom.toprettyxml()
# remove the weird newline issue:
pretty_xml = os.linesep.join([s for s in pretty_xml.splitlines()
if s.strip()])
The output is the same as above, but it's more code.
Here's my (hacky?) solution to get around the ugly text node problem.
uglyXml = doc.toprettyxml(indent=' ')
text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)
prettyXml = text_re.sub('>\g<1></', uglyXml)
print prettyXml
The above code will produce:
<?xml version="1.0" ?>
<issues>
<issue>
<id>1</id>
<title>Add Visual Studio 2005 and 2008 solution files</title>
<details>We need Visual Studio 2005/2008 project files for Windows.</details>
</issue>
</issues>
Instead of this:
<?xml version="1.0" ?>
<issues>
<issue>
<id>
1
</id>
<title>
Add Visual Studio 2005 and 2008 solution files
</title>
<details>
We need Visual Studio 2005/2008 project files for Windows.
</details>
</issue>
</issues>
Disclaimer: There are probably some limitations.
As of Python 3.9, ElementTree has an indent() function for pretty-printing XML trees.
See https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.indent.
Sample usage:
import xml.etree.ElementTree as ET
element = ET.XML("<html><body>text</body></html>")
ET.indent(element)
print(ET.tostring(element, encoding='unicode'))
The upside is that it does not require any additional libraries. For more information check https://bugs.python.org/issue14465 and https://github.com/python/cpython/pull/15200
As others pointed out, lxml has a pretty printer built in.
Be aware though that by default it changes CDATA sections to normal text, which can have nasty results.
Here's a Python function that preserves the input file and only changes the indentation (notice the strip_cdata=False). Furthermore it makes sure the output uses UTF-8 as encoding instead of the default ASCII (notice the encoding='utf-8'):
from lxml import etree
def prettyPrintXml(xmlFilePathToPrettyPrint):
assert xmlFilePathToPrettyPrint is not None
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False)
document = etree.parse(xmlFilePathToPrettyPrint, parser)
document.write(xmlFilePathToPrettyPrint, pretty_print=True, encoding='utf-8')
Example usage:
prettyPrintXml('some_folder/some_file.xml')
If you have xmllint you can spawn a subprocess and use it. xmllint --format <file> pretty-prints its input XML to standard output.
Note that this method uses an program external to python, which makes it sort of a hack.
def pretty_print_xml(xml):
proc = subprocess.Popen(
['xmllint', '--format', '/dev/stdin'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
)
(output, error_output) = proc.communicate(xml);
return output
print(pretty_print_xml(data))
I tried to edit "ade"s answer above, but Stack Overflow wouldn't let me edit after I had initially provided feedback anonymously. This is a less buggy version of the function to pretty-print an ElementTree.
def indent(elem, level=0, more_sibs=False):
i = "\n"
if level:
i += (level-1) * ' '
num_kids = len(elem)
if num_kids:
if not elem.text or not elem.text.strip():
elem.text = i + " "
if level:
elem.text += ' '
count = 0
for kid in elem:
indent(kid, level+1, count < num_kids - 1)
count += 1
if not elem.tail or not elem.tail.strip():
elem.tail = i
if more_sibs:
elem.tail += ' '
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = i
if more_sibs:
elem.tail += ' '
If you're using a DOM implementation, each has their own form of pretty-printing built-in:
# minidom
#
document.toprettyxml()
# 4DOM
#
xml.dom.ext.PrettyPrint(document, stream)
# pxdom (or other DOM Level 3 LS-compliant imp)
#
serializer.domConfig.setParameter('format-pretty-print', True)
serializer.writeToString(document)
If you're using something else without its own pretty-printer — or those pretty-printers don't quite do it the way you want —  you'd probably have to write or subclass your own serialiser.
I had some problems with minidom's pretty print. I'd get a UnicodeError whenever I tried pretty-printing a document with characters outside the given encoding, eg if I had a β in a document and I tried doc.toprettyxml(encoding='latin-1'). Here's my workaround for it:
def toprettyxml(doc, encoding):
"""Return a pretty-printed XML document in a given encoding."""
unistr = doc.toprettyxml().replace(u'<?xml version="1.0" ?>',
u'<?xml version="1.0" encoding="%s"?>' % encoding)
return unistr.encode(encoding, 'xmlcharrefreplace')
from yattag import indent
pretty_string = indent(ugly_string)
It won't add spaces or newlines inside text nodes, unless you ask for it with:
indent(mystring, indent_text = True)
You can specify what the indentation unit should be and what the newline should look like.
pretty_xml_string = indent(
ugly_xml_string,
indentation = ' ',
newline = '\r\n'
)
The doc is on http://www.yattag.org homepage.
I wrote a solution to walk through an existing ElementTree and use text/tail to indent it as one typically expects.
def prettify(element, indent=' '):
queue = [(0, element)] # (level, element)
while queue:
level, element = queue.pop(0)
children = [(level + 1, child) for child in list(element)]
if children:
element.text = '\n' + indent * (level+1) # for child open
if queue:
element.tail = '\n' + indent * queue[0][0] # for sibling open
else:
element.tail = '\n' + indent * (level-1) # for parent close
queue[0:0] = children # prepend so children come before siblings
Here's a Python3 solution that gets rid of the ugly newline issue (tons of whitespace), and it only uses standard libraries unlike most other implementations.
import xml.etree.ElementTree as ET
import xml.dom.minidom
import os
def pretty_print_xml_given_root(root, output_xml):
"""
Useful for when you are editing xml data on the fly
"""
xml_string = xml.dom.minidom.parseString(ET.tostring(root)).toprettyxml()
xml_string = os.linesep.join([s for s in xml_string.splitlines() if s.strip()]) # remove the weird newline issue
with open(output_xml, "w") as file_out:
file_out.write(xml_string)
def pretty_print_xml_given_file(input_xml, output_xml):
"""
Useful for when you want to reformat an already existing xml file
"""
tree = ET.parse(input_xml)
root = tree.getroot()
pretty_print_xml_given_root(root, output_xml)
I found how to fix the common newline issue here.
XML pretty print for python looks pretty good for this task. (Appropriately named, too.)
An alternative is to use pyXML, which has a PrettyPrint function.
You can use popular external library xmltodict, with unparse and pretty=True you will get best result:
xmltodict.unparse(
xmltodict.parse(my_xml), full_document=False, pretty=True)
full_document=False against <?xml version="1.0" encoding="UTF-8"?> at the top.
Take a look at the vkbeautify module.
It is a python version of my very popular javascript/nodejs plugin with the same name. It can pretty-print/minify XML, JSON and CSS text. Input and output can be string/file in any combinations. It is very compact and doesn't have any dependency.
Examples:
import vkbeautify as vkb
vkb.xml(text)
vkb.xml(text, 'path/to/dest/file')
vkb.xml('path/to/src/file')
vkb.xml('path/to/src/file', 'path/to/dest/file')
You can try this variation...
Install BeautifulSoup and the backend lxml (parser) libraries:
user$ pip3 install lxml bs4
Process your XML document:
from bs4 import BeautifulSoup
with open('/path/to/file.xml', 'r') as doc:
for line in doc:
print(BeautifulSoup(line, 'lxml-xml').prettify())
An alternative if you don't want to have to reparse, there is the xmlpp.py library with the get_pprint() function. It worked nice and smoothly for my use cases, without having to reparse to an lxml ElementTree object.
I found a fast and easy way to nicely format and print an xml file:
import xml.etree.ElementTree as ET
xmlTree = ET.parse('your XML file')
xmlRoot = xmlTree.getroot()
xmlDoc = ET.tostring(xmlRoot, encoding="unicode")
print(xmlDoc)
Outuput:
<root>
<child>
<subchild>.....</subchild>
</child>
<child>
<subchild>.....</subchild>
</child>
...
...
...
<child>
<subchild>.....</subchild>
</child>
</root>
I had this problem and solved it like this:
def write_xml_file (self, file, xml_root_element, xml_declaration=False, pretty_print=False, encoding='unicode', indent='\t'):
pretty_printed_xml = etree.tostring(xml_root_element, xml_declaration=xml_declaration, pretty_print=pretty_print, encoding=encoding)
if pretty_print: pretty_printed_xml = pretty_printed_xml.replace(' ', indent)
file.write(pretty_printed_xml)
In my code this method is called like this:
try:
with open(file_path, 'w') as file:
file.write('<?xml version="1.0" encoding="utf-8" ?>')
# create some xml content using etree ...
xml_parser = XMLParser()
xml_parser.write_xml_file(file, xml_root, xml_declaration=False, pretty_print=True, encoding='unicode', indent='\t')
except IOError:
print("Error while writing in log file!")
This works only because etree by default uses two spaces to indent, which I don't find very much emphasizing the indentation and therefore not pretty. I couldn't ind any setting for etree or parameter for any function to change the standard etree indent. I like how easy it is to use etree, but this was really annoying me.
For converting an entire xml document to a pretty xml document
(ex: assuming you've extracted [unzipped] a LibreOffice Writer .odt or .ods file, and you want to convert the ugly "content.xml" file to a pretty one for automated git version control and git difftooling of .odt/.ods files, such as I'm implementing here)
import xml.dom.minidom
file = open("./content.xml", 'r')
xml_string = file.read()
file.close()
parsed_xml = xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = parsed_xml.toprettyxml()
file = open("./content_new.xml", 'w')
file.write(pretty_xml_as_string)
file.close()
References:
- Thanks to Ben Noland's answer on this page which got me most of the way there.
from lxml import etree
import xml.dom.minidom as mmd
xml_root = etree.parse(xml_fiel_path, etree.XMLParser())
def print_xml(xml_root):
plain_xml = etree.tostring(xml_root).decode('utf-8')
urgly_xml = ''.join(plain_xml .split())
good_xml = mmd.parseString(urgly_xml)
print(good_xml.toprettyxml(indent=' ',))
It's working well for the xml with Chinese!
If for some reason you can't get your hands on any of the Python modules that other users mentioned, I suggest the following solution for Python 2.7:
import subprocess
def makePretty(filepath):
cmd = "xmllint --format " + filepath
prettyXML = subprocess.check_output(cmd, shell = True)
with open(filepath, "w") as outfile:
outfile.write(prettyXML)
As far as I know, this solution will work on Unix-based systems that have the xmllint package installed.
I found this question while looking for "how to pretty print html"
Using some of the ideas in this thread I adapted the XML solutions to work for XML or HTML:
from xml.dom.minidom import parseString as string_to_dom
def prettify(string, html=True):
dom = string_to_dom(string)
ugly = dom.toprettyxml(indent=" ")
split = list(filter(lambda x: len(x.strip()), ugly.split('\n')))
if html:
split = split[1:]
pretty = '\n'.join(split)
return pretty
def pretty_print(html):
print(prettify(html))
When used this is what it looks like:
html = """\
<div class="foo" id="bar"><p>'IDK!'</p><br/><div class='baz'><div>
<span>Hi</span></div></div><p id='blarg'>Try for 2</p>
<div class='baz'>Oh No!</div></div>
"""
pretty_print(html)
Which returns:
<div class="foo" id="bar">
<p>'IDK!'</p>
<br/>
<div class="baz">
<div>
<span>Hi</span>
</div>
</div>
<p id="blarg">Try for 2</p>
<div class="baz">Oh No!</div>
</div>
Use etree.indent and etree.tostring
import lxml.etree as etree
root = etree.fromstring('<html><head></head><body><h1>Welcome</h1></body></html>')
etree.indent(root, space=" ")
xml_string = etree.tostring(root, pretty_print=True).decode()
print(xml_string)
output
<html>
<head/>
<body>
<h1>Welcome</h1>
</body>
</html>
Removing namespaces and prefixes
import lxml.etree as etree
def dump_xml(element):
for item in element.getiterator():
item.tag = etree.QName(item).localname
etree.cleanup_namespaces(element)
etree.indent(element, space=" ")
result = etree.tostring(element, pretty_print=True).decode()
return result
root = etree.fromstring('<cs:document xmlns:cs="http://blabla.com"><name>hello world</name></cs:document>')
xml_string = dump_xml(root)
print(xml_string)
output
<document>
<name>hello world</name>
</document>
I solved this with some lines of code, opening the file, going trough it and adding indentation, then saving it again. I was working with small xml files, and did not want to add dependencies, or more libraries to install for the user. Anyway, here is what I ended up with:
f = open(file_name,'r')
xml = f.read()
f.close()
#Removing old indendations
raw_xml = ''
for line in xml:
raw_xml += line
xml = raw_xml
new_xml = ''
indent = ' '
deepness = 0
for i in range((len(xml))):
new_xml += xml[i]
if(i<len(xml)-3):
simpleSplit = xml[i:(i+2)] == '><'
advancSplit = xml[i:(i+3)] == '></'
end = xml[i:(i+2)] == '/>'
start = xml[i] == '<'
if(advancSplit):
deepness += -1
new_xml += '\n' + indent*deepness
simpleSplit = False
deepness += -1
if(simpleSplit):
new_xml += '\n' + indent*deepness
if(start):
deepness += 1
if(end):
deepness += -1
f = open(file_name,'w')
f.write(new_xml)
f.close()
It works for me, perhaps someone will have some use of it :)

Generating XML file with proper indentation

I am trying to generate the XML file in python but its not getting indented the out put is coming in straight line.
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
name = str(request.POST.get('name'))
top = Element('scenario')
environment = SubElement(top, 'environment')
cluster = SubElement(top, 'cluster')
cluster.text=name
I tried to use pretty parser but its giving me an error as: 'Element' object has no attribute 'read'
import xml.dom.minidom
xml_p = xml.dom.minidom.parse(top)
pretty_xml = xml_p.toprettyxml()
Is the input given to parser is proper format ? if this is wrong method please suggest another way to indent.
You cannot directly parse top which is an Element(), you need to make that a string (which is why you should import tostring. that you are currently not using), and use xml.dom.minidom.parseString() on the result:
import xml.dom.minidom
xml_p = xml.dom.minidom.parseString(tostring(top))
pretty_xml = xml_p.toprettyxml()
print(pretty_xml)
that gives:
<?xml version="1.0" ?>
<scenario>
<environment/>
<cluster>xyz</cluster>
</scenario>

XML row structure in one row

Strange error occured, got a XML-file emailed to me which was wrongly formated. The info in the file was all in one row.
Like this
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><Text><otherText><printdate>2015-02-08</printdate>
Does anyone know a quick way to fix this by using a python script or something that has had the same error?
I want to make the file like this.
<?xml version="1.0" encoding="ISO-8859-1"?>
<Text>
<OtherText>
<Name>VH2</Name>
<PrintDate>2015-02-05</PrintDate>
Thanks!
It seems you want to print pretty, if you look into other XML libraries, such as lxml, it support pretty print.
import lxml.etree as etree
x = etree.parse("filename")
print etree.tostring(x, pretty_print = True)
However, you can also try this:
Pretty printing XML in Python
If the XML is well formed this snippet will work
#!/usr/bin/python
import xml.dom.minidom
def main():
ugly_xml = open('ugly.xml', 'r')
pretty_xml = open('pretty.xml', 'w')
xmll = xml.dom.minidom.parseString(ugly_xml.read())
pretty_xml.write(xmll.toprettyxml())
if __name__ == "__main__":
main()

Categories