How can I join two different pieces of information together from this XML file?
# data
xml1 = ('''<?xml version="1.0" encoding="utf-8"?>
<TopologyDefinition xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RSkus>
<RSku ID="V1" Deprecated="true" Owner="Unknown" Generation="1">
<Devices>
<Device ID="1" SkuID="Switch" Role="xD" />
</Devices>
<Blades>
<Blade ID="{1-20}" SkuID="SBlade" />
</Blades>
<Interfaces>
<Interface ID="COM" HardwareID="NS1" SlotID="COM1" Type="serial" />
<Interface ID="LINK" HardwareID="TS1" SlotID="UPLINK_1" Type="serial" />
</Interfaces>
<Wires>
<WireGroup Type="network">
<Wire LocationA="NS1" SlotA="{1-20}" LocationB="{1-20}" SlotB="NIC1" />
</WireGroup>
<WireGroup Type="serial">
<Wire LocationA="TS1" SlotA="{7001-7020}" LocationB="{1-20}" SlotB="COM1" />
</WireGroup>
</Wires>
</RSku>
</RSkus>
</TopologyDefinition>
''')
While this is a single case and trivial in the instance below; if I run the below commands on the full file, I get shapes that do not match and therefore cannot be joined so easily.
How can I extract the XML information such that for every row, I get all the RSku information PLUS its Blade information. Each xpath contains no information that would let me join it to another xpath so that I may combine the information.
# how to have them joined?
pd.read_xml(xml1, xpath = ".//RSku")
pd.read_xml(xml1, xpath = ".//Blade")
# expected
pd.concat([pd.read_xml(xml1, xpath = ".//RSku"), pd.read_xml(xml1, xpath = ".//Blade")], axis=1)
Consider transforming the XML with XSLT by flattening the document with information you need. Specifically, retrieve only Blade attributes using descendant::* axis and corresponding RSku attributes using the ancestor::* axis. Python' lxml (default parser of pandas.read_xml) can run XSLT 1.0 scripts.
Below XSLT's <xsl:for-each> is used to prefix RSku_ and Blade_ to attribute names since they share same attribute such as ID. Otherwise template would be much less wordy.
import pandas as pd
xml1 = ...
xsl = ('''<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/TopologyDefinition">
<root>
<xsl:apply-templates select="descendant::Blade"/>
</root>
</xsl:template>
<xsl:template match="Blade">
<data>
<xsl:for-each select="ancestor::RSku/#*">
<xsl:attribute name="{concat('RSku_', name())}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:for-each>
<xsl:for-each select="#*">
<xsl:attribute name="{concat('Blade_', name())}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:for-each>
</data>
</xsl:template>
</xsl:stylesheet>''')
blades_df = pd.read xml(xml1, stylesheet=xsl)
Online XSLT Demo
Using Python or XSLT, I would like to know how to convert highly complex, hierarchical nested XML file to CSV including all the sub-elements and without hard coding as few element nodes as possible or is rational/effective?
Please find attached simplified XML example and the output CSV to get a better understanding of what I’m trying to achieve.
The actual XML file has much more elements but the data hierarchy and the nesting is like in the example. <InvoiceRow> element and its sub-elements are the only repeating elements in the XML file, all the other elements are static that are repeated in the output CSV as many times as there are <InvoiceRow> elements in the XML file.
It’s the repeating <InvoiceRow> element that is causing trouble for me. Elements that don’t repeat are easy to convert to CSV without hard coding any elements.
Complex XML scenarios, with hierarchical data structures and multiple one-to-many relationships all being stored in a single XML file. Structured text file.
Example XML input:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Invoice>
<SellerDetails>
<Identifier>1234-1</Identifier>
<SellerAddress>
<SellerStreet>Street1</SellerStreet>
<SellerTown>Town1</SellerTown>
</SellerAddress>
</SellerDetails>
<BuyerDetails>
<BuyerIdentifier>1234-2</BuyerIdentifier>
<BuyerAddress>
<BuyerStreet>Street2</BuyerStreet>
<BuyerTown>Town2</BuyerTown>
</BuyerAddress>
</BuyerDetails>
<BuyerNumber>001234</BuyerNumber>
<InvoiceDetails>
<InvoiceNumber>0001</InvoiceNumber>
</InvoiceDetails>
<InvoiceRow>
<ArticleName>Article1</ArticleName>
<RowText>Product Text1</RowText>
<RowText>Product Text2</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">10.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article2</ArticleName>
<RowText>Product Text11</RowText>
<RowText>Product Text22</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">20.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article3</ArticleName>
<RowText>Product Text111</RowText>
<RowText>Product Text222</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">30.00</RowAmount>
</InvoiceRow>
<EpiDetails>
<EpiPartyDetails>
<EpiBfiPartyDetails>
<EpiBfiIdentifier IdentificationSchemeName="BIC">XXXXX</EpiBfiIdentifier>
</EpiBfiPartyDetails>
</EpiPartyDetails>
</EpiDetails>
<InvoiceUrlText>Some text</InvoiceUrlText>
</Invoice>
Example CSV output:
Identifier,SellerStreet,SellerTown,BuyerIdentifier,BuyerStreet,BuyerTown,BuyerNumber,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article1,Product Text1,Product Text2,10,XXXXX,Some text
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article2,Product Text11,Product Text22,20,XXXXX,Some text
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article3,Product Text111,Product Text222,30,XXXXX,Some text
Consider the following example:
XML
<Invoice>
<SellerDetails>
<Identifier>1234-1</Identifier>
<SellerAddress>
<SellerStreet>Street1</SellerStreet>
<SellerTown>Town1</SellerTown>
</SellerAddress>
</SellerDetails>
<BuyerDetails>
<BuyerIdentifier>1234-2</BuyerIdentifier>
<BuyerAddress>
<BuyerStreet>Street2</BuyerStreet>
<BuyerTown>Town2</BuyerTown>
</BuyerAddress>
</BuyerDetails>
<BuyerNumber>001234</BuyerNumber>
<InvoiceDetails>
<InvoiceNumber>0001</InvoiceNumber>
</InvoiceDetails>
<InvoiceRow>
<ArticleName>Article1</ArticleName>
<RowText>Product Text1</RowText>
<RowText>Product Text2</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">10.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article2</ArticleName>
<RowText>Product Text11</RowText>
<RowText>Product Text22</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">20.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article3</ArticleName>
<RowText>Product Text111</RowText>
<RowText>Product Text222</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">30.00</RowAmount>
</InvoiceRow>
<EpiDetails>
<EpiPartyDetails>
<EpiBfiPartyDetails>
<EpiBfiIdentifier IdentificationSchemeName="BIC">XXXXX</EpiBfiIdentifier>
</EpiBfiPartyDetails>
</EpiPartyDetails>
</EpiDetails>
<InvoiceUrlText>Some text</InvoiceUrlText>
</Invoice>
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="Invoice">
<xsl:variable name="common-head">
<xsl:value-of select="SellerDetails/Identifier"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="BuyerDetails/BuyerIdentifier"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="InvoiceDetails/InvoiceNumber"/>
<xsl:text>,</xsl:text>
<!-- add more here -->
</xsl:variable>
<xsl:variable name="common-tail">
<xsl:value-of select="EpiDetails/EpiPartyDetails/EpiBfiPartyDetails/EpiBfiIdentifier"/>
<xsl:text>,</xsl:text>
<!-- add more here -->
<xsl:value-of select="InvoiceUrlText"/>
</xsl:variable>
<!-- header -->
<xsl:text>SellerIdentifier,BuyerIdentifier,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
</xsl:text>
<!-- data -->
<xsl:for-each select="InvoiceRow">
<xsl:copy-of select="$common-head"/>
<xsl:value-of select="ArticleName"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="RowAmount"/>
<xsl:text>,</xsl:text>
<!-- add more here -->
<xsl:copy-of select="$common-tail"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Result
SellerIdentifier,BuyerIdentifier,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
1234-1,1234-2,0001,Article1,10.00,XXXXX,Some text
1234-1,1234-2,0001,Article2,20.00,XXXXX,Some text
1234-1,1234-2,0001,Article3,30.00,XXXXX,Some text
Added in response to:
Is there a way in XSLT to get the same results using loop? For example loop through and output all the elements and the sub-elements except the InvoiceRow elements and then vice versa?
If you prefer, you could try something like:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="Invoice">
<xsl:variable name="invoice-fields" select="//*[not(*) and not(ancestor::InvoiceRow)]" />
<xsl:variable name="common-data">
<xsl:for-each select="$invoice-fields">
<xsl:value-of select="."/>
<xsl:text>,</xsl:text>
</xsl:for-each>
</xsl:variable>
<!-- header -->
<xsl:for-each select="$invoice-fields">
<xsl:value-of select="name()"/>
<xsl:text>,</xsl:text>
</xsl:for-each>
<xsl:for-each select="InvoiceRow[1]/*">
<xsl:value-of select="name()"/>
<xsl:if test="position()!=last()">,</xsl:if>
</xsl:for-each>
<xsl:text>
</xsl:text>
<!-- data -->
<xsl:for-each select="InvoiceRow">
<xsl:copy-of select="$common-data"/>
<xsl:for-each select="*">
<xsl:value-of select="."/>
<xsl:if test="position()!=last()">,</xsl:if>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The result here would be:
Identifier,SellerStreet,SellerTown,BuyerIdentifier,BuyerStreet,BuyerTown,BuyerNumber,InvoiceNumber,EpiBfiIdentifier,InvoiceUrlText,ArticleName,RowText,RowText,RowAmount
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article1,Product Text1,Product Text2,10.00
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article2,Product Text11,Product Text22,20.00
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article3,Product Text111,Product Text222,30.00
i.e. listing all invoice fields before the row fields.
I have done similar case like your requirements, I have created one package base on untangle, a package which can parse your XML to pure python objects like:
<?xml version="1.0"?>
<root>
<child name="child1"/>
</root>
to
obj.root.child['name'] # u'child1'
then you can easily write some code to traverse the object to get what you want.
For example, you can do something like get_items_by_tag(InvoiceRow).
Hope it helps!
In PHP, you can use registerPHPFunctions to use a PHP function inside an XSLT file like this:
<?php
$xml = <<<EOB
<allusers>
<user>
<uid>bob</uid>
<id>1</id>
</user>
<user>
<uid>joe</uid>
<id>2</id>
</user>
</allusers>
EOB;
$xsl = <<<EOB
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:php="http://php.net/xsl">
<xsl:output method="html" encoding="utf-8" indent="yes"/>
<xsl:template match="allusers">
<html><body>
<h2>Users</h2>
<table>
<xsl:for-each select="user">
<tr><td>
<xsl:value-of
select="php:function('ucfirst',concat(string(uid), string(id)))"/>
</td></tr>
</xsl:for-each>
</table>
</body></html>
</xsl:template>
</xsl:stylesheet>
EOB;
$xmldoc = DOMDocument::loadXML($xml);
$xsldoc = DOMDocument::loadXML($xsl);
$proc = new XSLTProcessor();
$proc->registerPHPFunctions('ucfirst');
$proc->importStyleSheet($xsldoc);
echo $proc->transformToXML($xmldoc);
?>
What is the Python equivalent? This is what I've tried
from lxml import etree
xml = etree.XML('''
<allusers>
<user>
<uid>bob</uid>
<id>1</id>
</user>
<user>
<uid>joe</uid>
<id>2</id>
</user>
</allusers>''')
xsl = etree.XML('''
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:f="mynamespace"
extension-element-prefixes="f">
<xsl:output method="html" encoding="utf-8" indent="yes"/>
<xsl:template match="allusers">
<html><body>
<h2>Users</h2>
<table>
<xsl:for-each select="user">
<tr><td>
<f:ucfirst>
<xsl:value-of select="concat(string(uid), string(id))"/>
</f:ucfirst>
</td></tr>
</xsl:for-each>
</table>
</body></html>
</xsl:template>
</xsl:stylesheet>
''')
extension = Ucase()
extensions = { ('mynamespace', 'ucfirst') : extension }
proc = etree.XSLT(xsl, extensions=extensions)
str(proc(xml))
class Ucase(etree.XSLTExtension):
def execute(self, context, self_node, input_node, output_parent):
title = self_node[0].text.capitalize()
output_parent.text(title)
This is a simplified version of my XSLT.
Here is how an extension function (not an element) can give the result that I think you want:
from lxml import etree
def ucfirst(context, s):
return s.capitalize()
ns = etree.FunctionNamespace("mynamespace")
ns['ucfirst'] = ucfirst
xml = etree.XML('''
<allusers>
<user>
<uid>bob</uid>
<id>1</id>
</user>
<user>
<uid>joe</uid>
<id>2</id>
</user>
</allusers>''')
xsl = etree.XML('''\
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:f="mynamespace" exclude-result-prefixes="f">
<xsl:output method="html" encoding="utf-8" indent="yes"/>
<xsl:template match="allusers">
<html><body>
<h2>Users</h2>
<table>
<xsl:for-each select="user">
<tr><td>
<xsl:value-of select="f:ucfirst(concat(string(uid), string(id)))"/>
</td></tr>
</xsl:for-each>
</table>
</body></html>
</xsl:template>
</xsl:stylesheet>
''')
transform = etree.XSLT(xsl)
result = transform(xml)
print result
Output:
<html><body>
<h2>Users</h2>
<table>
<tr><td>Bob1</td></tr>
<tr><td>Joe2</td></tr>
</table>
</body></html>
See http://lxml.de/extensions.html#xpath-extension-functions.
There are separate answers for variables and functions. I'm only really familiar with the variable half.
For variables, you can pass them as an xsl:param by passing them as keyword arguments to the call. For example:
transform = etree.XSLT(xslt_tree)
result = transform(doc_root, a="5")
Note that the argument is an XPath expression, so strings need to be quoted. There is a function that does this opaquely:
result = transform(doc_root, a=etree.XSLT.strparam(""" It's "Monty Python" """))
If you want to pass an XML fragment you could use the exslt:node-set() function.
For functions, you can expose them either as an xpath function or as an element. There is a bunch of variety and I haven't done this myself so read the docs below and/or edit this answer.
Docs for basic use and variables.
Docs for adding functions.