I can't control quality of XML that I get. In some cases it is:
<COLLADA xmlns="http://www.collada.org/2005/11/COLLADASchema" version="1.4.1">
...
</COLLADA>
in others I get:
<COLLADA>...</COLLADA>
and I guess I should also handle
<collada:COLLADA xmlns:collada="http://www.collada.org/2005/11/COLLADASchema">
...
</collada:COLLADA>
It's the same schema all over, and I only need one parser to process it. How can I handle all these cases? I need XPath and other lxml goodies to get through this. How do I make it consistent during etree.parse time? I don't want to check on namespaces every time I need to use XPath.
My usual recommendation is to preprocess it first, to normalize the namespaces. This has two benefits: the normalization code is highly reusable, because it doesn't depend on how the data is being processed subsequently; and the logic to process the data is considerably simplified.
If the documents only use this one namespace, or none, and do not use qualified names in the content of text or attribute nodes, then the transformation to achieve this normalization is very simple:
<xsl:template match="*">
<xsl:element name="local-name()" namespace="http://www.collada.org/2005/11/COLLADASchema">
<xsl:copy-of select="#*"/>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
Related
I have many graphml files starting with:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns/graphml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns/graphml">
I need to change the xmlns and xsi attributes to reflect proper values for this XML file format specification:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
I tried to change these values with BeautifulSoup like:
soup = BeautifulSoup(myfile, 'html.parser')
soup.graphml['xmlns'] = 'http://graphml.graphdrawing.org/xmlns'
soup.graphml['xsi:schemalocation'] = "http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"
It works fine but it is definitely too slow on some of my larger files, so I am trying to do the same with lxml, but I don't understand how to achieve the same result. I sort of managed to reach the attributes, but don't know how to change them:
doc = etree.parse(myfile)
root = doc.getroot()
root.attrib
> {'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://graphml.graphdrawing.org/xmlns/graphml'}
What is the right way to accomplish this task?
When you say that you have many files "starting with" those 4 lines, if you really mean they're exactly like that, the fastest way is probably to entirely ignore that fact that it's XML, and just replace those lines.
In Python, just read the first four lines, compare them to what you expect (so you can issue a warning if they don't match), then discard them. Write out the new four lines you want, then copy the rest of the file out. Repeat for each file.
On the other hand, if you have namespace attributes anywhere else in the file this method wouldn't catch them, and you should probably do a real XML-based solution. With a regular SAX parser, you get a callback for each element start, element end, text node, etc. as it comes along. So you'd just copy them out until you hit the one(s) you want (in this case, a graphml element), then instead of copying out that start-tag, write out the new one you want. Then back to copying. XSLT is also a fine way to do this, which would let you write a tiny generic copier, plus one rule to handle the graphml element.
Suppose I have a abc.xml and a sample.xsl which transforms my abc.xml to 123.xml.
Now given the transformed XML and the stylesheet sample.xsl is there a way to get back my original abc.xml file.
Is it possible to achieve this using XSLT?
given the transformed XML and the stylesheet sample.xsl is there a way
to get back my original abc.xml file.
This is theoretically impossible, because information may be (and very often is) discarded during the first transformation. Here's a trivial example:
Original XML
<input>123.4567</input>
XSLT
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<output>
<xsl:value-of select="round(input)" />
</output>
</xsl:template>
</xsl:stylesheet>
Result
<?xml version="1.0" encoding="UTF-8"?>
<output>123</output>
As you can see, the original fractional part, .4567, is nowhere in the result nor in the stylesheet.
In certain (simple) cases you can make an educated guess regarding the contents of the original XML document, but no more than that.
As Matthias already said, this is (usually) not possible. Suppose this is your input XML:
<hello>world</hello>
And you transform that using the following stylesheet:
<xsl:template match="hello"><world>Not here!</world></xsl:template>
This will produce the following XML:
<world>Not here!</world>
Obviously, there is no roundtrip conversion possible in this case, the resulting XML is (usually) a text file and it does not contain any kind of meta-information of prior to the conversion. However, if your conversion has some way to make the roundtrip possible, for instance by adding the original XML, then it becomes a trivial excercise:
<xsl:template match="/">
<result>
<original-xml>
<xsl:copy-of select="."/>
</original-xml>
<xsl:apply-templates />
<result>
</xsl:template>
Then you can get your original file back by doing:
<xsl:template match="original-xml">
<xsl:copy-of select="." />
</xsl:template>
A waste of resources, if you ask me, but if that is the requirement... (though, a warning up front, a copy of XML is not always 100% equal to the original due to whitespace stripping, order of attributes and line-end handling).
Another way of doing this is perhaps by storing the original URI to your source document, which, if not moved, can be used to reconstruct the original by simply pointing to that file.
I have some code that, as part of its running, takes a HTML document and mangles it into another form for the sake of output. (HTML to BBCode, essentially.)
I am currently doing this via defining a dictionary of XPath and the replacements, and then iterating over the dictionary with tools from lxml:
change_xpaths = {
XPath(".//span[contains(#style, 'font')]") : "font",
XPath(".//span[contains(#style, 'color')]") : "color",
XPath(".//span[contains(#style, 'size')]") : "size"
}
replace_xpaths = {
XPath(".//span[#style='text-decoration: underline']") : "u",
XPath(".//span[#style='text-decoration: line-through']") : "s",
XPath(".//div[#style='padding-left: 30px']") : "remove"
}
def _clean_text(cls, raw):
for ele in cls.struck_through_xpath(raw):
ele.getparent().remove(ele)
for xp, repl in cls.replace_xpaths.items():
for ele in xp(raw):
ele.attrib.pop("style")
ele.tag = repl
for xp, chng in cls.change_xpaths.items():
for ele in xp(raw):
ele.tag = chng
for br in raw.xpath(".//br"):
try:
br.tail = "\n" + br.tail
except TypeError:
br.tail = "\n"
strip_elements(raw, 'img', with_tail = False)
strip_elements(raw, 'br', with_tail = False)
strip_tags(raw, 'remove')
(This is, indeed, part of a class definition.)
I understand that I can do this using an xslt transform, too.
I would like, firstly, a conformation that I can indeed do all this with xslt, namely, replacing some tags with non-standard tags, and outright removing tags while leaving their text or tail content.
Secondly, I would like to know if I can expect a significant performance increase by doing so? I would suspect so, however, I can't seem to find much about this on the internet.
Question 1: Yes, this is possible with XSLT. But it seems that you simply ignore the font, color and size values. Actually parsing these values from inline CSS could be complicated with XSLT 1.0.
Question 2: I think it will be significantly faster. With your current solution, you have to iterate all nodes of your document multiple times (more than 10 times, AFAICS). With an XSLT stylesheet, you visit each input node only once. Also, since lxml is based on libxml2 and libxslt, you'll need less calls into the C API which can be quite expensive in my experience.
OTOH, you could get a similar performance boost by rewriting your Python code to scan the document only once.
Make sure that you compile your XSLT stylesheet only once if you make multiple transformations.
There also some optimizations possible on the XSLT level. The most elegant way would be to write templates like:
<xsl:template match="span[contains(#style, 'font')]">...
<xsl:template match="span[contains(#style, 'color')]">...
<xsl:template match="span[contains(#style, 'size')]">...
It might be a little faster to have a single template per element name like this:
<xsl:template match="span">
<xsl:choose>
<xsl:when test="contains(#style, 'font')">...
<xsl:when test="contains(#style, 'color')">...
<xsl:when test="contains(#style, 'size')">...
I have just finished skiming through the python DOM API and can't seem to find what I am looking for.
I basically want to preserve the XML tags when traversing through the DOM tree. The idea is to print the tag name and corresponding attributes which I later want to convert into an xml file.
<book name="bookname" source="/home/phiri/Book/book.xml"
xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter>
<page>page1</page>
<page>page2</page>
</chapter>
<chapter>
<page>page1</page>
<page>page2</page>
<page>Page3</page>
</chapter>
</book>
Using the XML contents above for instance, what I want is for the result of the book.xml file to have.
<book name="bookname" source="/home/phiri/Book/book.xml"
xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter></chapter>
<chapter></chapter>
</book>
Is there an alternative xml package I could use to preserve results I get when extracting contents using python?
A simple way to get the output you posted from the input is to override the XSLT identity transform. It looks like you want to eliminate all text nodes and all elements that have more than two ancestors, so you'd just add empty templates for those:
<xsl:template match="text()"/>
<xsl:template match="*[count(ancestor::*) > 2]"/>
Generally the best way to use XSLT in Python is with the libxml2 module. Unless you need a pure Python solution, in which case you're stuck not using XSLT, because nobody's built a pure Python XSLT processor yet.
I need to remove white spaces between xml tags, e.g. if the original xml looks like:
<node1>
<node2>
<node3>foo</node3>
</node2>
</node1>
I'd like the end-result to be crunched down to single line:
<node1><node2><node3>foo</node3></node2></node1>
Please note that I will not have control over the xml structure, so the solution should be generic enough to be able to handle any valid xml. Also the xml might contain CDATA blocks, which I'd need to exclude from this crunching and leave them as-is.
I have couple of ideas so far: (1) parse the xml as text and look for start and end of tags < and > (2) another approach is to load the xml document and go node-by-node and print out a new document by concatenating the tags.
I think either method would work, but I'd rather not reinvent the wheel here, so may be there is a python library that already does something like this? If not, then any issues/pitfalls to be aware of when rolling out my own cruncher? Any recommendations?
EDIT
Thank you all for answers/suggestions, both Triptych's and Van Gale's solutions work for me and do exactly what I want. Wish I could accept both answers.
This is pretty easily handled with lxml (note: this particular feature isn't in ElementTree):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
foo = """<node1>
<node2>
<node3>foo </node3>
</node2>
</node1>"""
bar = etree.XML(foo, parser)
print etree.tostring(bar,pretty_print=False,with_tail=True)
Results in:
<node1><node2><node3>foo </node3></node2></node1>
Edit: The answer by Triptych reminded me about the CDATA requirements, so the line creating the parser object should actually look like this:
parser = etree.XMLParser(remove_blank_text=True, strip_cdata=False)
I'd use XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="#*" />
<xsl:apply-templates />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
That should do the trick.
In python you could use lxml (direct link to sample on homepage) to transform it.
For some tests, use xsltproc, sample:
xsltproc test.xsl test.xml
where test.xsl is the file above and test.xml your XML file.
Pretty straightforward with BeautifulSoup.
This solution assumes it is ok to strip whitespace from the tail ends of character data.
Example: <foo> bar </foo> becomes <foo>bar</foo>
It will correctly ignore comments and CDATA.
import BeautifulSoup
s = """
<node1>
<node2>
<node3>foo</node3>
</node2>
<node3>
<!-- I'm a comment! Leave me be! -->
</node3>
<node4>
<![CDATA[
I'm CDATA! Changing me would be bad!
]]>
</node4>
</node1>
"""
soup = BeautifulSoup.BeautifulStoneSoup(s)
for t in soup.findAll(text=True):
if type(t) is BeautifulSoup.NavigableString: # Ignores comments and CDATA
t.replaceWith(t.strip())
print soup
Not a solution really but since you asked for recommendations: I'd advise against doing your own parsing (unless you want to learn how to write a complex parser) because, as you say, not all spaces should be removed. There are not only CDATA blocks but also elements with the "xml:space=preserve" attribute, which correspond to things like <pre> in XHTML (where the enclosed whitespaces actually have meaning), and writing a parser that is able to recognize those elements and leave the whitespace alone would be possible but unpleasant.
I would go with the parsing method, i.e. load the document and go node-by-node printing them out. That way you can easily identify which nodes you can strip the spaces out of and which you can't. There are some modules in the Python standard library, none of which I have ever used ;-) that could be useful to you... try xml.dom, or I'm not sure if you could do this with xml.parsers.expat.