Python XML/Pandas: How to merge nested XML? - python

How can I join two different pieces of information together from this XML file?
# data
xml1 = ('''<?xml version="1.0" encoding="utf-8"?>
<TopologyDefinition xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RSkus>
<RSku ID="V1" Deprecated="true" Owner="Unknown" Generation="1">
<Devices>
<Device ID="1" SkuID="Switch" Role="xD" />
</Devices>
<Blades>
<Blade ID="{1-20}" SkuID="SBlade" />
</Blades>
<Interfaces>
<Interface ID="COM" HardwareID="NS1" SlotID="COM1" Type="serial" />
<Interface ID="LINK" HardwareID="TS1" SlotID="UPLINK_1" Type="serial" />
</Interfaces>
<Wires>
<WireGroup Type="network">
<Wire LocationA="NS1" SlotA="{1-20}" LocationB="{1-20}" SlotB="NIC1" />
</WireGroup>
<WireGroup Type="serial">
<Wire LocationA="TS1" SlotA="{7001-7020}" LocationB="{1-20}" SlotB="COM1" />
</WireGroup>
</Wires>
</RSku>
</RSkus>
</TopologyDefinition>
''')
While this is a single case and trivial in the instance below; if I run the below commands on the full file, I get shapes that do not match and therefore cannot be joined so easily.
How can I extract the XML information such that for every row, I get all the RSku information PLUS its Blade information. Each xpath contains no information that would let me join it to another xpath so that I may combine the information.
# how to have them joined?
pd.read_xml(xml1, xpath = ".//RSku")
pd.read_xml(xml1, xpath = ".//Blade")
# expected
pd.concat([pd.read_xml(xml1, xpath = ".//RSku"), pd.read_xml(xml1, xpath = ".//Blade")], axis=1)

Consider transforming the XML with XSLT by flattening the document with information you need. Specifically, retrieve only Blade attributes using descendant::* axis and corresponding RSku attributes using the ancestor::* axis. Python' lxml (default parser of pandas.read_xml) can run XSLT 1.0 scripts.
Below XSLT's <xsl:for-each> is used to prefix RSku_ and Blade_ to attribute names since they share same attribute such as ID. Otherwise template would be much less wordy.
import pandas as pd
xml1 = ...
xsl = ('''<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/TopologyDefinition">
<root>
<xsl:apply-templates select="descendant::Blade"/>
</root>
</xsl:template>
<xsl:template match="Blade">
<data>
<xsl:for-each select="ancestor::RSku/#*">
<xsl:attribute name="{concat('RSku_', name())}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:for-each>
<xsl:for-each select="#*">
<xsl:attribute name="{concat('Blade_', name())}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:for-each>
</data>
</xsl:template>
</xsl:stylesheet>''')
blades_df = pd.read xml(xml1, stylesheet=xsl)
Online XSLT Demo

Related

How do I insert a tag that holds the text of an older tag in xml using python?

I want to insert s tags inside a tag that already exists and move the text of the older tag inside the s tag. For example, if my XML file looks like this:
<root>
<name>Light and dark</name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>
I want it to be like this (check the name tag):
<root>
<name>
<s>Light and dark</s>
</name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>
I tried using ET.SubElement but it doesn't give me the same result.
It it much better to use XSLT for such tasks.
XSLT has so called Identity Transform pattern.
Input XML
<root>
<name>Light and dark</name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>
XSLT
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="name">
<xsl:copy>
<s>
<xsl:value-of select="."/>
</s>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Output XML
<root>
<name>
<s>Light and dark</s>
</name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>
To insert a sub-element in the XML using ElementTree XML API, append the new element then set it with the text value of the parent element.
import xml.etree.ElementTree as ET
xml = """
<root>
<name>Light and dark</name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>"""
root = ET.fromstring(xml)
# 1. find name element in document
name = root.find('name')
# 2. get text value and reset the element
text = name.text
name.clear()
# 3. create new element s and set text
elt = ET.SubElement(name, "s")
elt.text = text
print(ET.tostring(root, encoding='unicode'))
To process multiple elements, add a loop around steps 1-3:
for child in root.findall('name'):
text = child.text
child.clear()
elt = ET.SubElement(child, "s")
elt.text = text
Output:
<root>
<name><s>Light and dark</s></name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>

Converting complex XML to CSV using Python or XSLT

Using Python or XSLT, I would like to know how to convert highly complex, hierarchical nested XML file to CSV including all the sub-elements and without hard coding as few element nodes as possible or is rational/effective?
Please find attached simplified XML example and the output CSV to get a better understanding of what I’m trying to achieve.
The actual XML file has much more elements but the data hierarchy and the nesting is like in the example. <InvoiceRow> element and its sub-elements are the only repeating elements in the XML file, all the other elements are static that are repeated in the output CSV as many times as there are <InvoiceRow> elements in the XML file.
It’s the repeating <InvoiceRow> element that is causing trouble for me. Elements that don’t repeat are easy to convert to CSV without hard coding any elements.
Complex XML scenarios, with hierarchical data structures and multiple one-to-many relationships all being stored in a single XML file. Structured text file.
Example XML input:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Invoice>
<SellerDetails>
<Identifier>1234-1</Identifier>
<SellerAddress>
<SellerStreet>Street1</SellerStreet>
<SellerTown>Town1</SellerTown>
</SellerAddress>
</SellerDetails>
<BuyerDetails>
<BuyerIdentifier>1234-2</BuyerIdentifier>
<BuyerAddress>
<BuyerStreet>Street2</BuyerStreet>
<BuyerTown>Town2</BuyerTown>
</BuyerAddress>
</BuyerDetails>
<BuyerNumber>001234</BuyerNumber>
<InvoiceDetails>
<InvoiceNumber>0001</InvoiceNumber>
</InvoiceDetails>
<InvoiceRow>
<ArticleName>Article1</ArticleName>
<RowText>Product Text1</RowText>
<RowText>Product Text2</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">10.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article2</ArticleName>
<RowText>Product Text11</RowText>
<RowText>Product Text22</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">20.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article3</ArticleName>
<RowText>Product Text111</RowText>
<RowText>Product Text222</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">30.00</RowAmount>
</InvoiceRow>
<EpiDetails>
<EpiPartyDetails>
<EpiBfiPartyDetails>
<EpiBfiIdentifier IdentificationSchemeName="BIC">XXXXX</EpiBfiIdentifier>
</EpiBfiPartyDetails>
</EpiPartyDetails>
</EpiDetails>
<InvoiceUrlText>Some text</InvoiceUrlText>
</Invoice>
Example CSV output:
Identifier,SellerStreet,SellerTown,BuyerIdentifier,BuyerStreet,BuyerTown,BuyerNumber,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article1,Product Text1,Product Text2,10,XXXXX,Some text
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article2,Product Text11,Product Text22,20,XXXXX,Some text
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article3,Product Text111,Product Text222,30,XXXXX,Some text
Consider the following example:
XML
<Invoice>
<SellerDetails>
<Identifier>1234-1</Identifier>
<SellerAddress>
<SellerStreet>Street1</SellerStreet>
<SellerTown>Town1</SellerTown>
</SellerAddress>
</SellerDetails>
<BuyerDetails>
<BuyerIdentifier>1234-2</BuyerIdentifier>
<BuyerAddress>
<BuyerStreet>Street2</BuyerStreet>
<BuyerTown>Town2</BuyerTown>
</BuyerAddress>
</BuyerDetails>
<BuyerNumber>001234</BuyerNumber>
<InvoiceDetails>
<InvoiceNumber>0001</InvoiceNumber>
</InvoiceDetails>
<InvoiceRow>
<ArticleName>Article1</ArticleName>
<RowText>Product Text1</RowText>
<RowText>Product Text2</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">10.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article2</ArticleName>
<RowText>Product Text11</RowText>
<RowText>Product Text22</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">20.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article3</ArticleName>
<RowText>Product Text111</RowText>
<RowText>Product Text222</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">30.00</RowAmount>
</InvoiceRow>
<EpiDetails>
<EpiPartyDetails>
<EpiBfiPartyDetails>
<EpiBfiIdentifier IdentificationSchemeName="BIC">XXXXX</EpiBfiIdentifier>
</EpiBfiPartyDetails>
</EpiPartyDetails>
</EpiDetails>
<InvoiceUrlText>Some text</InvoiceUrlText>
</Invoice>
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="Invoice">
<xsl:variable name="common-head">
<xsl:value-of select="SellerDetails/Identifier"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="BuyerDetails/BuyerIdentifier"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="InvoiceDetails/InvoiceNumber"/>
<xsl:text>,</xsl:text>
<!-- add more here -->
</xsl:variable>
<xsl:variable name="common-tail">
<xsl:value-of select="EpiDetails/EpiPartyDetails/EpiBfiPartyDetails/EpiBfiIdentifier"/>
<xsl:text>,</xsl:text>
<!-- add more here -->
<xsl:value-of select="InvoiceUrlText"/>
</xsl:variable>
<!-- header -->
<xsl:text>SellerIdentifier,BuyerIdentifier,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
</xsl:text>
<!-- data -->
<xsl:for-each select="InvoiceRow">
<xsl:copy-of select="$common-head"/>
<xsl:value-of select="ArticleName"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="RowAmount"/>
<xsl:text>,</xsl:text>
<!-- add more here -->
<xsl:copy-of select="$common-tail"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Result
SellerIdentifier,BuyerIdentifier,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
1234-1,1234-2,0001,Article1,10.00,XXXXX,Some text
1234-1,1234-2,0001,Article2,20.00,XXXXX,Some text
1234-1,1234-2,0001,Article3,30.00,XXXXX,Some text
Added in response to:
Is there a way in XSLT to get the same results using loop? For example loop through and output all the elements and the sub-elements except the InvoiceRow elements and then vice versa?
If you prefer, you could try something like:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="Invoice">
<xsl:variable name="invoice-fields" select="//*[not(*) and not(ancestor::InvoiceRow)]" />
<xsl:variable name="common-data">
<xsl:for-each select="$invoice-fields">
<xsl:value-of select="."/>
<xsl:text>,</xsl:text>
</xsl:for-each>
</xsl:variable>
<!-- header -->
<xsl:for-each select="$invoice-fields">
<xsl:value-of select="name()"/>
<xsl:text>,</xsl:text>
</xsl:for-each>
<xsl:for-each select="InvoiceRow[1]/*">
<xsl:value-of select="name()"/>
<xsl:if test="position()!=last()">,</xsl:if>
</xsl:for-each>
<xsl:text>
</xsl:text>
<!-- data -->
<xsl:for-each select="InvoiceRow">
<xsl:copy-of select="$common-data"/>
<xsl:for-each select="*">
<xsl:value-of select="."/>
<xsl:if test="position()!=last()">,</xsl:if>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The result here would be:
Identifier,SellerStreet,SellerTown,BuyerIdentifier,BuyerStreet,BuyerTown,BuyerNumber,InvoiceNumber,EpiBfiIdentifier,InvoiceUrlText,ArticleName,RowText,RowText,RowAmount
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article1,Product Text1,Product Text2,10.00
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article2,Product Text11,Product Text22,20.00
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article3,Product Text111,Product Text222,30.00
i.e. listing all invoice fields before the row fields.
I have done similar case like your requirements, I have created one package base on untangle, a package which can parse your XML to pure python objects like:
<?xml version="1.0"?>
<root>
<child name="child1"/>
</root>
to
obj.root.child['name'] # u'child1'
then you can easily write some code to traverse the object to get what you want.
For example, you can do something like get_items_by_tag(InvoiceRow).
Hope it helps!

lxml throwing xslt parse error - not able to template match any other than root

I'm using python/lxml to translate source xml to a target xml format. I keep getting a XLSTParseError when I try to match template to any other elements than root ('/') but cannot figure out what is wrong - pretty sure its namespace related though...The content i am trying to access from source xml is contained in the elements. Any idea how to fix or how to get lxml to output more detailed error msg?
Source xml has declaration:
<?xml version="1.0" encoding="UTF-8"?>
<dataroot generated="2016-10-24T09:16:37" xsi:noNamespaceSchemaLocation="BOLIG_XML.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:od="urn:schemas-microsoft-com:officedata">
<BOLIG_XML>...</BOLIG_XML>
<BOLIG_XML>...</BOLIG_XML>
...
Target xml has declaration:
<?xml version="1.0" encoding="utf-8"?>
<BoligListe xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:oio:lbf:1.0.0">
<BoligStruktur>...</BoligStruktur>
<BoligStruktur>...</BoligStruktur>
...
XSLT currently looks like this:
xslt_tree = etree.XML('''\
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<BoligListe xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:oio:lbf:1.0.0">
<xsl:template match="BOLIG_XML">
<BoligStruktur>hello world</BoligStruktur>
</xsl:template>
</BoligListe>
</xsl:template>
</xsl:stylesheet>'''
)
Each xsl:template element needs to be a top level element of the stylesheet root element, you can't nest templates as you seem to try to do in your XSLT code. You then use xsl:apply-templates to process child elements with a matching template so I guess you want
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<BoligListe xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:oio:lbf:1.0.0">
<xsl:apply-templates/>
</BoligListe>
</xsl:template>
<xsl:template match="BOLIG_XML">
<BoligStruktur>hello world</BoligStruktur>
</xsl:template>
</xsl:stylesheet>

How to use python functions and variables inside XSLT?

In PHP, you can use registerPHPFunctions to use a PHP function inside an XSLT file like this:
<?php
$xml = <<<EOB
<allusers>
<user>
<uid>bob</uid>
<id>1</id>
</user>
<user>
<uid>joe</uid>
<id>2</id>
</user>
</allusers>
EOB;
$xsl = <<<EOB
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:php="http://php.net/xsl">
<xsl:output method="html" encoding="utf-8" indent="yes"/>
<xsl:template match="allusers">
<html><body>
<h2>Users</h2>
<table>
<xsl:for-each select="user">
<tr><td>
<xsl:value-of
select="php:function('ucfirst',concat(string(uid), string(id)))"/>
</td></tr>
</xsl:for-each>
</table>
</body></html>
</xsl:template>
</xsl:stylesheet>
EOB;
$xmldoc = DOMDocument::loadXML($xml);
$xsldoc = DOMDocument::loadXML($xsl);
$proc = new XSLTProcessor();
$proc->registerPHPFunctions('ucfirst');
$proc->importStyleSheet($xsldoc);
echo $proc->transformToXML($xmldoc);
?>
What is the Python equivalent? This is what I've tried
from lxml import etree
xml = etree.XML('''
<allusers>
<user>
<uid>bob</uid>
<id>1</id>
</user>
<user>
<uid>joe</uid>
<id>2</id>
</user>
</allusers>''')
xsl = etree.XML('''
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:f="mynamespace"
extension-element-prefixes="f">
<xsl:output method="html" encoding="utf-8" indent="yes"/>
<xsl:template match="allusers">
<html><body>
<h2>Users</h2>
<table>
<xsl:for-each select="user">
<tr><td>
<f:ucfirst>
<xsl:value-of select="concat(string(uid), string(id))"/>
</f:ucfirst>
</td></tr>
</xsl:for-each>
</table>
</body></html>
</xsl:template>
</xsl:stylesheet>
''')
extension = Ucase()
extensions = { ('mynamespace', 'ucfirst') : extension }
proc = etree.XSLT(xsl, extensions=extensions)
str(proc(xml))
class Ucase(etree.XSLTExtension):
def execute(self, context, self_node, input_node, output_parent):
title = self_node[0].text.capitalize()
output_parent.text(title)
This is a simplified version of my XSLT.
Here is how an extension function (not an element) can give the result that I think you want:
from lxml import etree
def ucfirst(context, s):
return s.capitalize()
ns = etree.FunctionNamespace("mynamespace")
ns['ucfirst'] = ucfirst
xml = etree.XML('''
<allusers>
<user>
<uid>bob</uid>
<id>1</id>
</user>
<user>
<uid>joe</uid>
<id>2</id>
</user>
</allusers>''')
xsl = etree.XML('''\
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:f="mynamespace" exclude-result-prefixes="f">
<xsl:output method="html" encoding="utf-8" indent="yes"/>
<xsl:template match="allusers">
<html><body>
<h2>Users</h2>
<table>
<xsl:for-each select="user">
<tr><td>
<xsl:value-of select="f:ucfirst(concat(string(uid), string(id)))"/>
</td></tr>
</xsl:for-each>
</table>
</body></html>
</xsl:template>
</xsl:stylesheet>
''')
transform = etree.XSLT(xsl)
result = transform(xml)
print result
Output:
<html><body>
<h2>Users</h2>
<table>
<tr><td>Bob1</td></tr>
<tr><td>Joe2</td></tr>
</table>
</body></html>
See http://lxml.de/extensions.html#xpath-extension-functions.
There are separate answers for variables and functions. I'm only really familiar with the variable half.
For variables, you can pass them as an xsl:param by passing them as keyword arguments to the call. For example:
transform = etree.XSLT(xslt_tree)
result = transform(doc_root, a="5")
Note that the argument is an XPath expression, so strings need to be quoted. There is a function that does this opaquely:
result = transform(doc_root, a=etree.XSLT.strparam(""" It's "Monty Python" """))
If you want to pass an XML fragment you could use the exslt:node-set() function.
For functions, you can expose them either as an xpath function or as an element. There is a bunch of variety and I haven't done this myself so read the docs below and/or edit this answer.
Docs for basic use and variables.
Docs for adding functions.

(Not so) advanced xsl transformation of child nodes into list

Input:
<root>
<aa><aaa/><bbb/><ccc/><ddd/><eee/></aa>
<bb><ggg/></bb>
</root>
Desirable output:
<root>
<aa>aaa<aa>
<aa>bbb<aa>
<aa>ccc<aa>
<aa>ddd<aa>
<aa>eee<aa>
<bb>ggg</bb>
</root>
I've come up with the simple xslt but it properly handles only and doesn't create list of tags.
XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- select all elements that doesn't have any child nodes (elements or text etc) -->
<xsl:template match="//*[not(node())]">
<xsl:value-of select="name()"/>
</xsl:template>
</xsl:stylesheet>
Output:
<root>
<aa>aaabbbcccdddeee</aa>
<bb>ggg</bb>
</root>
P.S. It is part of python script. Does it make to do such conversions using xslt in python script? Or python solution using simple xpath and python logic will work better?
An example is not a substitute for explaining the logic behind the required transformation. I can think of several different ways to process your example input and arrive at the same output.
Here's a guess at what you want to accomplish (read the comments):
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<root>
<!-- select all elements that don't have any child nodes -->
<xsl:for-each select="//*[not(node())]">
<!-- create an element with the name of the parent element -->
<xsl:element name="{name(..)}">
<xsl:value-of select="name()"/>
</xsl:element>
</xsl:for-each>
</root>
</xsl:template>
</xsl:stylesheet>

Categories