Using Python to comment out XML - python

I'm trying (and failing) to comment out the HornetQ configuration from a JBoss 6.2 domain.xml file, instead of inserting a comment around the stanza I want to remove, I'm managing to delete everything remaining in the file.
The code I have so far is
from xml.dom import minidom
import os, time, shutil
domConf=('/home/test/JBoss/jboss-eap-6.2/domain/configuration/domain.xml')
commentSub=('urn:jboss:domain:messaging:1.4')
now=str(int(time.time()))
bkup=(domConf+now)
shutil.copy2(domConf, bkup)
xmldoc = minidom.parse(domConf)
itemlist = xmldoc.getElementsByTagName('subsystem')
for s in itemlist:
if commentSub in s.attributes['xmlns'].value:
s.parentNode.insertBefore(xmldoc.createComment(s.toxml()), s)
file = open(domConf, "wb")
xmldoc.writexml(file)
file.write('\n')
file.close()
the configuration I'm trying to comment out is -
<subsystem xmlns="urn:jboss:domain:messaging:1.4">
<hornetq-server>
<persistence-enabled>true</persistence-enabled>
<journal-type>NIO</journal-type>
<journal-min-files>2</journal-min-files>
<connectors>
[....]
</pooled-connection-factory>
</jms-connection-factories>
</hornetq-server>
</subsystem>
Thanks!

The problem you were running into was that the sections you are trying to comment out already contain XML comments. Nested comments are not allowed in XML. (See Nested comments in XML? for more info.)
I think what you need to do is this:
from xml.dom import minidom
import os, time, shutil
domConf=('/home/test/JBoss/jboss-eap-6.2/domain/configuration/domain.xml')
resultFile='result.xml'
commentSub=('urn:jboss:domain:messaging:1.4')
now=str(int(time.time()))
bkup=(domConf+now)
shutil.copy2(domConf, bkup)
xmldoc = minidom.parse(domConf)
itemlist = xmldoc.getElementsByTagName('subsystem')
for s in itemlist:
if commentSub in s.attributes['xmlns'].value:
commentText = s.toxml()
commentText = commentText.replace('--', '- -')
s.parentNode.insertBefore(xmldoc.createComment(commentText), s)
s.parentNode.removeChild(s)
file = open("result.xml", "wb")
xmldoc.writexml(file)
file.write('\n')
file.close()
shutil.copy2(resultFile, domConf)
This finds the comment as you do, but before inserting it, changes any nested XML comments so they are no longer comments by replacing '--' with '- -'. (Note this will probably break the XML file structure if you uncomment that section. You'll have to reverse the process if you want it to parse again.) After inserting, the script deletes the original node. Then it writes everything to a temporary file, and uses shutil to copy it back over the original.
I tested this on my system, using the file you posted to the pastebin in the comment below, and it works.
Note that it's kind of a quick and dirty hack - because the script will also replace '--' with '- -' everywhere in that section, and if there is other text as part of an XML node that has '--' in it, it too will get replaced...
The right way to do this would probably be to use lxml's elementtree implementation, use lxml's XSL to select only comments within the section, and either delete or transform them appropriately - so you don't mess up non-commented text. But that's probably beyond the scope of what you asked. (Python's built-in elementtree doesn't have a complete XSL implementation and probably can't be used to select comments.)

Related

Modifying XML file from string source - how to do it?

I have a problem, where I want to change some lines in my XML, but this XML is not in file, it is in string. I am using Python 3.x and lib xml.etree.ElementTree for this purpose.
I have these piece of code which I know works for files in project, but as I said, I want no files, only operations on string sources.
source_tree = ET.ElementTree(ET.fromstring(source_config))
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
This works, but I don't know how to save it. I tried
source_tree.write(source_config, encoding='latin-1') but this doesn't work (treats all XML as a name).
I don't think you need both source_tree and source_tree_root. By having both, you're creating two separate things. When you write using source_tree, you don't get the changes made to source_tree_root.
Try creating an ElementTree from source_tree_root (which is just an Element), like this (untested since you didn't supply an mcve)...
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
ET.ElementTree(source_tree_root).write("output.xml")
OK, I thought that you were using lxml. My bad. Well here is how you would do it with lxml. I think that lxml is superior.
Here is the basic way of parsing a string into an XML document.
from lxml import etree
doc = etree.fromstring(yourstring)
for e in doc.xpath('//sometagname'):
e.set('foo', 'bar')

Python lxml error "namespace not defined."

I am being driven crazy by some oddly formed xml and would be grateful for some pointers:
The documents are defined like this:
<sphinx:document id="18059090929806848187">
<url>http://www.some-website.com</url>
<page_number>104</page_number>
<size>7865</size>
</sphinx:document>
Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.
sample code:
from lxml import objectify, etree
import gzip
with open ('file_list','rb') as file_list:
for file in file_list:
in_xml = gzip.open(file.strip('\n'))
xml2 = etree.iterparse(in_xml)
for action, elem in xml2:
if elem.tag == "page_number":
print elem.text + str(file)
the first value elem.text is returned but only for the first file in the list and quickly followed by the error:
lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20
Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?
Thanks
Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.
Your choices are:
Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.
Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:
xml2 =etree.iterparse(in_xml, recover=True)

Reading 1000s of XML documents with BeautifulSoup

I'm trying to read a bunch of xml files and do stuff to them. The first thing I want to do is rename them based on a number that's inside the file.
You can see a sample of the data hereWarning this will initiate a download of a 108MB zip file!. That's a huge xml file with thousands of smaller xml files inside it. I've broken those out into individual files. I want to rename the files based on a number inside (part of preprocessing). I have the following code:
from __future__ import print_function
from bs4 import BeautifulSoup # To get everything
import os
def rename_xml_files(directory):
xml_files = [xml_file for xml_file in os.listdir(directory) ]
for filename in xml_files:
filename = filename.strip()
full_filename = directory + "/" +filename
print (full_filename)
f = open(full_filename, "r")
xml = f.read()
soup = BeautifulSoup(xml)
del xml
del soup
f.close()
If I comment out the "soup =" and "del" lines, it works perfectly. If I add the "soup = ..." line, it will work for a moment and then it will eventually crap out - it just crashes the python kernel. I'm using Enthought Canopy, but I've tried it running from the command line and it craps out there, too.
I thought, perhaps, it was not deallocating the space for the variable "soup" so I tried adding the "del" commands. Same problem.
Any thoughts on how to circumvent this? I'm not stuck on BS. If there's a better way of doing this, I would love it, but I need a little sample code.
Try using cElementTree.parse() from Python's standard xml library instead of BeautifulSoup. 'Soup is great for parsing normal web pages, but cElementTree is blazing fast.
Like this:
import xml.etree.cElementTree as cET
# ...
def rename_xml_files(directory):
xml_files = [xml_file for xml_file in os.listdir(directory) ]
for filename in xml_files:
filename = filename.strip()
full_filename = directory + "/" +filename
print(full_filename)
parsed = cET.parse(full_filename)
del parsed
If your XML formatted correctly this should parse it. If your machine is still unable to handle all that data in memory, you should look into streaming the XML.
I would not separate that file out into many small files and then process them some more, I would process them all in one go.
I would just use a streaming api XML parser and parse the master file, get the name and write out the sub-files once with the correct name.
There is no need for BeautifulSoup which is primarily designed to handle HTML and uses a document model instead of a streaming parser.
There is no need for what you are doing to build an entire DOM just to get a single element all at once.

Writing modified Beautiful Soup tree to file, while maintaining original XML formatting

We have an XML document that has a tag we wish to alter:
...<version>1.0</version>...
It's buried deep in the XML file, but we're successfully able to use Beautiful Soup to replace its contents with a command-line parameter.
The problem is that after modifying the tree, we need to write back to the file we read it from. But, we want to maintain the original formatting of the document. When I use:
fileForWriting = open(myXmlFile, 'w')
fileForWriting.write(soup.prettify())
The prettify() call breaks the formatting, and I end up with:
<version>
1.0
</version>
Is there any way to maintain the original formatting of the XML document, while replacing that single tag text?
Note: Using simply:
fileForWriting.write(str(soup))
Keeps the text and tags on the same line, but eliminates the indents and extra newlines that had been human-added for readability. Close, but no cigar.
By request, the entire script:
from BeautifulSoup import BeautifulSoup as bs
import sys
xmlFile = sys.argv[1:][0]
version = sys.argv[1:][1]
fileForReading = open(xmlFile, 'r')
xmlString = fileForReading.read()
fileForReading.close()
soup = bs(xmlString)
soup.findAll('version')[1].contents[0].replaceWith(version)
fileForWriting = open(xmlFile, 'w')
fileForWriting.write(str(soup))
fileForWriting.close()
The script is then run using:
python myscript.py someFile.xml 1.2
And if you use xml.elementtree, the tree.write(file) method replaces the CRLF by LF only, which also creates issues when trying to import the XML file into i.e. PyXB.
The solution I found is to use ElementTree just to find what I have to replace. Then I do source_XML = 'new value'.join(source_XML.split('what you need to replace)) Finally a file.write(source_XML)
it's not nice, but it solves the issue. However, I do not mind about the indentations, so on this I can't really say. I would only use pprint.pprint() whenever I need to print it.

How to write an XML file without header in Python?

when using Python's stock XML tools such as xml.dom.minidom for XML writing, a file would always start off like
<?xml version="1.0"?>
[...]
While this is perfectly legal XML code, and it's even recommended to use the header, I'd like to get rid of it as one of the programs I'm working with has problems here.
I can't seem to find the appropriate option in xml.dom.minidom, so I wondered if there are other packages which do allow to neglect the header.
Cheers,
Nico
Unfortunately minidom does not give you the option to omit the XML Declaration.
But you can always serialise the document content yourself by calling toxml() on the document's root element instead of the document. Then you won't get an XML Declaration:
xml= document.documentElement.toxml('utf-8')
...but then you also wouldn't get anything else outside the root element, such as the DOCTYPE, or any comments or processing instructions. If you need them, serialise each child of the document object one by one:
xml= '\n'.join(node.toxml('utf-8') for node in document.childNodes)
I wondered if there are other packages which do allow to neglect the header.
DOM Level 3 LS defines an xml-declaration config parameter you can use to suppress it. The only Python implementation I know of is pxdom, which is thorough on standards support, but not at all fast.
If you want to use minidom and maintain 'prettiness', how about this as a quick/hacky fix:
xml_without_declaration.py:
import xml.dom.minidom as xml
doc = xml.Document()
declaration = doc.toxml()
a = doc.createElement("A")
doc.appendChild(a)
b = doc.createElement("B")
a.appendChild(b)
xml = doc.toprettyxml()[len(declaration):]
print xml
The header is print in Document. If you print the node directly, it won't print the header.
root = doc.childNodes[0]
root.toprettyxml(encoding="utf-8")
Just replace the first line with blank:
import xml.dom.minidom as MD
<XML String>.replace(MD.Document().toxml()+'\n', '')
If you're set on using minidom, just scan back in the file and remove the first line after writing all the XML you need.
You might be able to use a custom file-like object which removes the first tag, e.g:
class RemoveFirstLine:
def __init__(self, f):
self.f = f
self.xmlTagFound = False
def __getattr__(self, attr):
return getattr(self, self.f)
def write(self, s):
if not self.xmlTagFound:
x = 0 # just to be safe
for x, c in enumerate(s):
if c == '>':
self.xmlTagFound = True
break
self.f.write(s[x+1:])
else:
self.f.write(s)
...
f = RemoveFirstLine(open('path', 'wb'))
Node.writexml(f, encoding='UTF-8')
or something similar. This has the advantage the file doesn't have to be totally rewritten if the XML files are fairly large.
Purists may not like to hear this, but I have found using an XML parser to generate XML to be overkill. Just generate it directly as strings. This also lets you generate files larger than you can keep in memory, which you can't do with DOM. Reading XML is another story.
Use string replace
from xml.dom import minidom
mydoc = minidom.parse('filename.xml')
with open(newfile, "w" ) as fs:
fs.write(mydoc.toxml().replace('?xml version="1.0" ?>', ''))
fs.close()
That's it ;)

Categories