How to write an XML file without header in Python? - python

when using Python's stock XML tools such as xml.dom.minidom for XML writing, a file would always start off like
<?xml version="1.0"?>
[...]
While this is perfectly legal XML code, and it's even recommended to use the header, I'd like to get rid of it as one of the programs I'm working with has problems here.
I can't seem to find the appropriate option in xml.dom.minidom, so I wondered if there are other packages which do allow to neglect the header.
Cheers,
Nico

Unfortunately minidom does not give you the option to omit the XML Declaration.
But you can always serialise the document content yourself by calling toxml() on the document's root element instead of the document. Then you won't get an XML Declaration:
xml= document.documentElement.toxml('utf-8')
...but then you also wouldn't get anything else outside the root element, such as the DOCTYPE, or any comments or processing instructions. If you need them, serialise each child of the document object one by one:
xml= '\n'.join(node.toxml('utf-8') for node in document.childNodes)
I wondered if there are other packages which do allow to neglect the header.
DOM Level 3 LS defines an xml-declaration config parameter you can use to suppress it. The only Python implementation I know of is pxdom, which is thorough on standards support, but not at all fast.

If you want to use minidom and maintain 'prettiness', how about this as a quick/hacky fix:
xml_without_declaration.py:
import xml.dom.minidom as xml
doc = xml.Document()
declaration = doc.toxml()
a = doc.createElement("A")
doc.appendChild(a)
b = doc.createElement("B")
a.appendChild(b)
xml = doc.toprettyxml()[len(declaration):]
print xml

The header is print in Document. If you print the node directly, it won't print the header.
root = doc.childNodes[0]
root.toprettyxml(encoding="utf-8")

Just replace the first line with blank:
import xml.dom.minidom as MD
<XML String>.replace(MD.Document().toxml()+'\n', '')

If you're set on using minidom, just scan back in the file and remove the first line after writing all the XML you need.

You might be able to use a custom file-like object which removes the first tag, e.g:
class RemoveFirstLine:
def __init__(self, f):
self.f = f
self.xmlTagFound = False
def __getattr__(self, attr):
return getattr(self, self.f)
def write(self, s):
if not self.xmlTagFound:
x = 0 # just to be safe
for x, c in enumerate(s):
if c == '>':
self.xmlTagFound = True
break
self.f.write(s[x+1:])
else:
self.f.write(s)
...
f = RemoveFirstLine(open('path', 'wb'))
Node.writexml(f, encoding='UTF-8')
or something similar. This has the advantage the file doesn't have to be totally rewritten if the XML files are fairly large.

Purists may not like to hear this, but I have found using an XML parser to generate XML to be overkill. Just generate it directly as strings. This also lets you generate files larger than you can keep in memory, which you can't do with DOM. Reading XML is another story.

Use string replace
from xml.dom import minidom
mydoc = minidom.parse('filename.xml')
with open(newfile, "w" ) as fs:
fs.write(mydoc.toxml().replace('?xml version="1.0" ?>', ''))
fs.close()
That's it ;)

Related

Modifying XML file from string source - how to do it?

I have a problem, where I want to change some lines in my XML, but this XML is not in file, it is in string. I am using Python 3.x and lib xml.etree.ElementTree for this purpose.
I have these piece of code which I know works for files in project, but as I said, I want no files, only operations on string sources.
source_tree = ET.ElementTree(ET.fromstring(source_config))
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
This works, but I don't know how to save it. I tried
source_tree.write(source_config, encoding='latin-1') but this doesn't work (treats all XML as a name).
I don't think you need both source_tree and source_tree_root. By having both, you're creating two separate things. When you write using source_tree, you don't get the changes made to source_tree_root.
Try creating an ElementTree from source_tree_root (which is just an Element), like this (untested since you didn't supply an mcve)...
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
ET.ElementTree(source_tree_root).write("output.xml")
OK, I thought that you were using lxml. My bad. Well here is how you would do it with lxml. I think that lxml is superior.
Here is the basic way of parsing a string into an XML document.
from lxml import etree
doc = etree.fromstring(yourstring)
for e in doc.xpath('//sometagname'):
e.set('foo', 'bar')

Python lxml error "namespace not defined."

I am being driven crazy by some oddly formed xml and would be grateful for some pointers:
The documents are defined like this:
<sphinx:document id="18059090929806848187">
<url>http://www.some-website.com</url>
<page_number>104</page_number>
<size>7865</size>
</sphinx:document>
Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.
sample code:
from lxml import objectify, etree
import gzip
with open ('file_list','rb') as file_list:
for file in file_list:
in_xml = gzip.open(file.strip('\n'))
xml2 = etree.iterparse(in_xml)
for action, elem in xml2:
if elem.tag == "page_number":
print elem.text + str(file)
the first value elem.text is returned but only for the first file in the list and quickly followed by the error:
lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20
Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?
Thanks
Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.
Your choices are:
Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.
Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:
xml2 =etree.iterparse(in_xml, recover=True)

Writing modified Beautiful Soup tree to file, while maintaining original XML formatting

We have an XML document that has a tag we wish to alter:
...<version>1.0</version>...
It's buried deep in the XML file, but we're successfully able to use Beautiful Soup to replace its contents with a command-line parameter.
The problem is that after modifying the tree, we need to write back to the file we read it from. But, we want to maintain the original formatting of the document. When I use:
fileForWriting = open(myXmlFile, 'w')
fileForWriting.write(soup.prettify())
The prettify() call breaks the formatting, and I end up with:
<version>
1.0
</version>
Is there any way to maintain the original formatting of the XML document, while replacing that single tag text?
Note: Using simply:
fileForWriting.write(str(soup))
Keeps the text and tags on the same line, but eliminates the indents and extra newlines that had been human-added for readability. Close, but no cigar.
By request, the entire script:
from BeautifulSoup import BeautifulSoup as bs
import sys
xmlFile = sys.argv[1:][0]
version = sys.argv[1:][1]
fileForReading = open(xmlFile, 'r')
xmlString = fileForReading.read()
fileForReading.close()
soup = bs(xmlString)
soup.findAll('version')[1].contents[0].replaceWith(version)
fileForWriting = open(xmlFile, 'w')
fileForWriting.write(str(soup))
fileForWriting.close()
The script is then run using:
python myscript.py someFile.xml 1.2
And if you use xml.elementtree, the tree.write(file) method replaces the CRLF by LF only, which also creates issues when trying to import the XML file into i.e. PyXB.
The solution I found is to use ElementTree just to find what I have to replace. Then I do source_XML = 'new value'.join(source_XML.split('what you need to replace)) Finally a file.write(source_XML)
it's not nice, but it solves the issue. However, I do not mind about the indentations, so on this I can't really say. I would only use pprint.pprint() whenever I need to print it.

Loading different data types from XML into a dictionary in python

I'm using cElementTree to extract xml tags and values in a loop and then storing them into a dictionary.
XML file contains:
<root>
<tag1>['item1', 'item2']</tag1>
<tag2>a normal string</tag2>
</root>
Python code (roughly):
import xml.etree.cElementTree as xml
xmldata = {}
xmlfile = xml.parse(XMLFile.xml)
for xmltag in xmlfile.iter():
xmldata[xmltag.tag] = xmltag.text
The problem I have encountered is that the xml file contains different data types, which include string and list. Unfortunately Element.text saves all the xml values as string (including the lists).
So when I load from the XML file I have:
{'tag1':"['item1', 'item2']", 'tag2':'a normal string'}
When I'd prefer to have:
{'tag1':['item1', 'item2'], 'tag2':'a normal string'}
Is there an easy way to do this?
e.g a command that saves to the dictionary in the original format
Or do I need to set up if statements to determine the value type and save it seperately using an alternative to Element.text?
You can use literal_eval to try to parse complex python literals. Since your strigns are unquoted, they will raise a SyntaxError in lteral eval, but that is simle to work around:
import xml.etree.cElementTree as xml
from ast import literal_eval
xmldata = {}
xmlfile = xml.parse(XMLFile.xml)
for xmltag in xmlfile.iter():
try:
xmldata[xmltag.tag] = literal_eval(xmltag.text)
except SyntaxError:
xmldata[xmltag.tag] = xmltag.text
Unlike Python's builtin "eval", ast.literal_eval does not allow the execution of expressions, and thus is safe, even if the XML data come from an untrusted source.
Here is a proposed solution: check for the existence of [, then parse the list. It's not failsafe (it won't work if the separator is not exactly , with a space) but I think that it'll be easy for you to improve it.
import xml.etree.cElementTree as xml
xmldata = {}
xmlfile = xml.parse("data.xml")
for xmltag in xmlfile.iter():
# it's a list
if "[" in xmltag.text:
d = xmltag.text.lstrip("[").rstrip("]")
l = [item.lstrip("'").rstrip("'") for item in d.split(", ")]
xmldata[xmltag.tag] = l
else:
xmldata[xmltag.tag] = xmltag.text
print xmldata
Prints: {'root': '\n', 'tag1': ['item1', 'item2'], 'tag2': 'a normal string'}
I think that you are not using xml in all its mighty power!
Why don't you organize your .xml like:
<root>
<tag1>
<item>item1</item>
<item>item2</item>
</tag1>
<tag2>a normal string<tag2>
</root>
This way your python code will be handling every <tag1> as a container of <item>, and I think that's better.
Note: You may also want to take a look here. (I agree with the "Favorite Way" of the author)

python: is there an XML parser implemented as a generator?

I'd like to parse a big XML file "on the fly". I'd like to use a python generator to perform this. I've tried "iterparse" of "xml.etree.cElementTree" (which is really nice) but still not a generator.
Other suggestions?
xml.etree.cElementTree comes close to a generator with correct usage; by default you receive each element after its 'end' event, at which point you can process it. You should use element.clear() on the element if you don't need it after processing; thereby you save the memory.
Here is a complete example what I mean, where I parse Rhythmbox's (Music Player) Library. I use (c)ElementTree's iterparse and for each processed element I call element.clear() so that I save quite a lot of memory. (Btw, the code below is a successor to some sax code to do the same thing; the cElementTree solution was a relief since 1) The code is concise and expresses what I need and nothing more 2) It is 3x as fast, 3) it uses less memory.)
import os
import xml.etree.cElementTree as ElementTree
NEEDED_KEYS= set(("title", "artist", "album", "track-number", "location", ))
def _lookup_string(string, strmap):
"""Look up #string in the string map,
and return the copy in the map.
If not found, update the map with the string.
"""
string = string or ""
try:
return strmap[string]
except KeyError:
strmap[string] = string
return string
def get_rhythmbox_songs(dbfile, typ="song", keys=NEEDED_KEYS):
"""Return a list of info dictionaries for all songs
in a Rhythmbox library database file, with dictionary
keys as given in #keys.
"""
rhythmbox_dbfile = os.path.expanduser(dbfile)
lSongs = []
strmap = {}
# Parse with iterparse; we get the elements when
# they are finished, and can remove them directly after use.
for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
if not (entry.tag == ("entry") and entry.get("type") == typ):
continue
info = {}
for child in entry.getchildren():
if child.tag in keys:
tag = _lookup_string(child.tag, strmap)
text = _lookup_string(child.text, strmap)
info[tag] = text
lSongs.append(info)
entry.clear()
return lSongs
Now, I don't understand your expectations, do you have the following expectation?
# take one
for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
# parse some entries, then exit loop
# take two
for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
# parse the rest of entries
Each time you call iterparse you get a new iterator object, reading the file anew! If you want a persistent object with iterator semantics, you have to refer to the same object in both loops (untried code):
#setup
parseiter = iter(ElementTree.iterparse(rhythmbox_dbfile))
# take one
for event, entry in parseiter:
# parse some entries, then exit loop
# take two
for event, entry in parseiter:
# parse the rest of entries
I think it can be confusing since different objects have different semantics. A file object will always have an internal state and advance in the file, however you iterate on it. An ElementTree iterparse object apparently not. The crux is to think that when you use a for loop, the for always calls iter() on the thing you iterate over. Here is an experiment comparing ElementTree.iterparse with a file object:
>>> import xml.etree.cElementTree as ElementTree
>>> pth = "/home/ulrik/.local/share/rhythmbox/rhythmdb.xml"
>>> iterparse = ElementTree.iterparse(pth)
>>> iterparse
<iterparse object at 0x483a0890>
>>> iter(iterparse)
<generator object at 0x483a2f08>
>>> iter(iterparse)
<generator object at 0x483a6468>
>>> f = open(pth, "r")
>>> f
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>
>>> iter(f)
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>
>>> iter(f)
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>
What you see is that each call to iter() on an iterparse object returns a new generator. The file object however, has an internal Operating System state that must be conserved and it its own iterator.
"On the fly" parsing and document trees are not really compatible. SAX-style parsers are usually used for that (for example, Python's standard xml.sax). You basically have to define a class with handlers for various events like startElement, endElement, etc. and the parser will call the methods as it parses the XML file.
PullDom does what you want. It reads XML from a stream, like SAX, but then builds a DOM for a selected piece of it.
"PullDOM is a really simple API for working with DOM objects in a streaming (efficient!) manner rather than as a monolithic tree."
This is possible with elementtree and incremental parsing:
http://effbot.org/zone/element-iterparse.htm#incremental-parsing
import xml.etree.cElementTree as etree
for event, elem in etree.iterparse(source):
...
Easier to use than sax.
xmltodict has a callback way of reading row by row, but it is not very pythonic. I wanted something similar for reading stackoverflow posts one by one from their xml dump using a generator.
This is the structure of the xml file:
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" ... />
<row Id="2" ... />
</posts>
And here is the code I used. It combines pulldom for streaming and xmltodict for parsing the rows.
def xml_to_dict_gen(file_path, tag='row'):
from xml.dom import pulldom
import xmltodict
doc = pulldom.parse(file_path)
for event, node in doc:
if event == pulldom.START_ELEMENT and node.tagName == tag:
doc.expandNode(node)
yield dict(xmltodict.parse(node.toxml()).get(tag))
for post in xml_to_dict_gen('Posts.xml'):
print(post)

Categories