Modifying XML file from string source - how to do it? - python

I have a problem, where I want to change some lines in my XML, but this XML is not in file, it is in string. I am using Python 3.x and lib xml.etree.ElementTree for this purpose.
I have these piece of code which I know works for files in project, but as I said, I want no files, only operations on string sources.
source_tree = ET.ElementTree(ET.fromstring(source_config))
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
This works, but I don't know how to save it. I tried
source_tree.write(source_config, encoding='latin-1') but this doesn't work (treats all XML as a name).

I don't think you need both source_tree and source_tree_root. By having both, you're creating two separate things. When you write using source_tree, you don't get the changes made to source_tree_root.
Try creating an ElementTree from source_tree_root (which is just an Element), like this (untested since you didn't supply an mcve)...
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
ET.ElementTree(source_tree_root).write("output.xml")

OK, I thought that you were using lxml. My bad. Well here is how you would do it with lxml. I think that lxml is superior.
Here is the basic way of parsing a string into an XML document.
from lxml import etree
doc = etree.fromstring(yourstring)
for e in doc.xpath('//sometagname'):
e.set('foo', 'bar')

Related

MemoryError on generating large xml files using xml.etree.ElementTree [duplicate]

There are many ways to read XML, both all-at-once (DOM) and one-bit-at-a-time (SAX). I have used SAX or lxml to iteratively read large XML files (e.g. wikipedia dump which is 6.5GB compressed).
However after doing some iterative processing (in python using ElementTree) of that XML file, I want to write out the (new) XML data to another file.
Are there any libraries to iteratively write out XML data? I could create the XML tree and then write it out, but that is not possible without oodles of ram. Is there anyway to write the XML tree to a file iteratively? One bit at a time?
I know I could generate the XML myself with print "<%s>" % tag_name, etc., but that seems a bit... hacky.
Fredrik Lundh's elementtree.SimpleXMLWriter will let you write out XML incrementally. Here's the demo code embedded in the module:
from elementtree.SimpleXMLWriter import XMLWriter
import sys
w = XMLWriter(sys.stdout)
html = w.start("html")
w.start("head")
w.element("title", "my document")
w.element("meta", name="generator", value="my application 1.0")
w.end()
w.start("body")
w.element("h1", "this is a heading")
w.element("p", "this is a paragraph")
w.start("p")
w.data("this is ")
w.element("b", "bold")
w.data(" and ")
w.element("i", "italic")
w.data(".")
w.end("p")
w.close(html)
With lxml, you can use etree.Element to create new nodes, and etree.tostring to write out the XML representation. See, for example, Listing 6. Serialize an element's children from Liza Daly's article "High-performance XML parsing in Python with lxml".
If you're reading in XML dialect1, and have to write XML dialect2, wouldn't it be a good idea to write down the conversion process using xslt? You may not even need any source code that way.
If you don't find anything else, what I'd prefer here is to inherit from ElementTree and create a "iteractiveElementTree", adding to it a "file" attribute. I'd subclasse the nodes to have a "start_tag_comitted" attribute and a "commit" method. Upon being called, this "commit" method would call the render method for a subtree - starting from the fartest parent where e"start_tag_comitted" is false. With the string in hand I'd manually strip the closing tags for the parents of the current node. There is the need to handle the previously oppened but not closed parent siblings as well.
Then, I'd remove the "commited" node from the memory model.
You will need to anotate node parents to each node as well, as ElementTree does not do that.
(Write me if there are no better answers an dyou get stuck there, I could implement this)

Following xs:include when parsing XSD as XML with lxml in Python

So, my problem is I'm trying to do something a little un-orthodox. I have a complicated set of XSD files. However I don't want to use these XSD files to verify an XML file; I want to parse these XSDs as XML and interrogate them just as I would a normal XML file. This is possible because XSDs are valid XML. I am using lxml with Python3.
The problem I'm having is with the statement:
<xs:include schemaLocation="sdm-extension.xsd"/>
If I instruct lxml to create an XSD for verifying like this:
schema = etree.XMLSchema(schema_root)
this dependency will be resolved (the file exists in the same directory as the one I've just loaded). HOWEVER, I am treating these as XML so, correctly, lxml just treats this as a normal element with an attribute and does not follow it.
Is there an easy or correct way to extend lxml so that I may have the same or similar behaviour as, say
<xi:include href="metadata.xml" parse="xml" xpointer="title"/>
I could, of course, create a separate xml file manually that includes all the dependencies in the XSD schema. That is perhaps a solution?
So it seems like one option is to use the xi:xinclude method and create a separate xml file that includes all the XSDs I want to parse. Something along the lines of:
<fullxsd>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm1-0-0.xsd" parse="xml"/>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm-ns-structure.xsd" parse="xml"/>
</fullxsd>
Then use some lxml along the lines of
def combine(xsd_file):
with open(xsd_file, 'rb') as f_xsd:
parser = etree.XMLParser(recover=True, encoding='utf-8',remove_comments=True, remove_blank_text=True)
xsd_source = f_xsd.read()
root = etree.fromstring(xsd_source, parser)
incl = etree.XInclude()
incl(root)
print(etree.tostring(root, pretty_print=True))
Its not ideal but it seems the proper way. I've looked at custom URI parsers in the lxml but that would mean actually altering the XSDs which seems messier.
Try this:
def validate_xml(schema_file, xml_file):
xsd_doc = etree.parse(schema_file)
xsd = etree.XMLSchema(xsd_doc)
xml = etree.parse(xml_file)
return xsd.validate(xml)

Using Python to comment out XML

I'm trying (and failing) to comment out the HornetQ configuration from a JBoss 6.2 domain.xml file, instead of inserting a comment around the stanza I want to remove, I'm managing to delete everything remaining in the file.
The code I have so far is
from xml.dom import minidom
import os, time, shutil
domConf=('/home/test/JBoss/jboss-eap-6.2/domain/configuration/domain.xml')
commentSub=('urn:jboss:domain:messaging:1.4')
now=str(int(time.time()))
bkup=(domConf+now)
shutil.copy2(domConf, bkup)
xmldoc = minidom.parse(domConf)
itemlist = xmldoc.getElementsByTagName('subsystem')
for s in itemlist:
if commentSub in s.attributes['xmlns'].value:
s.parentNode.insertBefore(xmldoc.createComment(s.toxml()), s)
file = open(domConf, "wb")
xmldoc.writexml(file)
file.write('\n')
file.close()
the configuration I'm trying to comment out is -
<subsystem xmlns="urn:jboss:domain:messaging:1.4">
<hornetq-server>
<persistence-enabled>true</persistence-enabled>
<journal-type>NIO</journal-type>
<journal-min-files>2</journal-min-files>
<connectors>
[....]
</pooled-connection-factory>
</jms-connection-factories>
</hornetq-server>
</subsystem>
Thanks!
The problem you were running into was that the sections you are trying to comment out already contain XML comments. Nested comments are not allowed in XML. (See Nested comments in XML? for more info.)
I think what you need to do is this:
from xml.dom import minidom
import os, time, shutil
domConf=('/home/test/JBoss/jboss-eap-6.2/domain/configuration/domain.xml')
resultFile='result.xml'
commentSub=('urn:jboss:domain:messaging:1.4')
now=str(int(time.time()))
bkup=(domConf+now)
shutil.copy2(domConf, bkup)
xmldoc = minidom.parse(domConf)
itemlist = xmldoc.getElementsByTagName('subsystem')
for s in itemlist:
if commentSub in s.attributes['xmlns'].value:
commentText = s.toxml()
commentText = commentText.replace('--', '- -')
s.parentNode.insertBefore(xmldoc.createComment(commentText), s)
s.parentNode.removeChild(s)
file = open("result.xml", "wb")
xmldoc.writexml(file)
file.write('\n')
file.close()
shutil.copy2(resultFile, domConf)
This finds the comment as you do, but before inserting it, changes any nested XML comments so they are no longer comments by replacing '--' with '- -'. (Note this will probably break the XML file structure if you uncomment that section. You'll have to reverse the process if you want it to parse again.) After inserting, the script deletes the original node. Then it writes everything to a temporary file, and uses shutil to copy it back over the original.
I tested this on my system, using the file you posted to the pastebin in the comment below, and it works.
Note that it's kind of a quick and dirty hack - because the script will also replace '--' with '- -' everywhere in that section, and if there is other text as part of an XML node that has '--' in it, it too will get replaced...
The right way to do this would probably be to use lxml's elementtree implementation, use lxml's XSL to select only comments within the section, and either delete or transform them appropriately - so you don't mess up non-commented text. But that's probably beyond the scope of what you asked. (Python's built-in elementtree doesn't have a complete XSL implementation and probably can't be used to select comments.)

Writing modified Beautiful Soup tree to file, while maintaining original XML formatting

We have an XML document that has a tag we wish to alter:
...<version>1.0</version>...
It's buried deep in the XML file, but we're successfully able to use Beautiful Soup to replace its contents with a command-line parameter.
The problem is that after modifying the tree, we need to write back to the file we read it from. But, we want to maintain the original formatting of the document. When I use:
fileForWriting = open(myXmlFile, 'w')
fileForWriting.write(soup.prettify())
The prettify() call breaks the formatting, and I end up with:
<version>
1.0
</version>
Is there any way to maintain the original formatting of the XML document, while replacing that single tag text?
Note: Using simply:
fileForWriting.write(str(soup))
Keeps the text and tags on the same line, but eliminates the indents and extra newlines that had been human-added for readability. Close, but no cigar.
By request, the entire script:
from BeautifulSoup import BeautifulSoup as bs
import sys
xmlFile = sys.argv[1:][0]
version = sys.argv[1:][1]
fileForReading = open(xmlFile, 'r')
xmlString = fileForReading.read()
fileForReading.close()
soup = bs(xmlString)
soup.findAll('version')[1].contents[0].replaceWith(version)
fileForWriting = open(xmlFile, 'w')
fileForWriting.write(str(soup))
fileForWriting.close()
The script is then run using:
python myscript.py someFile.xml 1.2
And if you use xml.elementtree, the tree.write(file) method replaces the CRLF by LF only, which also creates issues when trying to import the XML file into i.e. PyXB.
The solution I found is to use ElementTree just to find what I have to replace. Then I do source_XML = 'new value'.join(source_XML.split('what you need to replace)) Finally a file.write(source_XML)
it's not nice, but it solves the issue. However, I do not mind about the indentations, so on this I can't really say. I would only use pprint.pprint() whenever I need to print it.

Python xml ElementTree from a string source?

The ElementTree.parse reads from a file, how can I use this if I already have the XML data in a string?
Maybe I am missing something here, but there must be a way to use the ElementTree without writing out the string to a file and reading it again.
xml.etree.elementtree
You can parse the text as a string, which creates an Element, and create an ElementTree using that Element.
import xml.etree.ElementTree as ET
tree = ET.ElementTree(ET.fromstring(xmlstring))
I just came across this issue and the documentation, while complete, is not very straightforward on the difference in usage between the parse() and fromstring() methods.
If you're using xml.etree.ElementTree.parse to parse from a file, then you can use xml.etree.ElementTree.fromstring to get the root Element of the document. Often you don't actually need an ElementTree.
See xml.etree.ElementTree
You need the xml.etree.ElementTree.fromstring(text)
from xml.etree.ElementTree import XML, fromstring
myxml = fromstring(text)
io.StringIO is another option for getting XML into xml.etree.ElementTree:
import io
f = io.StringIO(xmlstring)
tree = ET.parse(f)
root = tree.getroot()
Hovever, it does not affect the XML declaration one would assume to be in tree (although that's needed for ElementTree.write()). See How to write XML declaration using xml.etree.ElementTree.

Categories