Converting complex XML to CSV using Python or XSLT

Converting complex XML to CSV using Python or XSLT - python

Using Python or XSLT, I would like to know how to convert highly complex, hierarchical nested XML file to CSV including all the sub-elements and without hard coding as few element nodes as possible or is rational/effective?
Please find attached simplified XML example and the output CSV to get a better understanding of what I’m trying to achieve.
The actual XML file has much more elements but the data hierarchy and the nesting is like in the example. <InvoiceRow> element and its sub-elements are the only repeating elements in the XML file, all the other elements are static that are repeated in the output CSV as many times as there are <InvoiceRow> elements in the XML file.
It’s the repeating <InvoiceRow> element that is causing trouble for me. Elements that don’t repeat are easy to convert to CSV without hard coding any elements.
Complex XML scenarios, with hierarchical data structures and multiple one-to-many relationships all being stored in a single XML file. Structured text file.
Example XML input:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Invoice>
<SellerDetails>
<Identifier>1234-1</Identifier>
<SellerAddress>
<SellerStreet>Street1</SellerStreet>
<SellerTown>Town1</SellerTown>
</SellerAddress>
</SellerDetails>
<BuyerDetails>
<BuyerIdentifier>1234-2</BuyerIdentifier>
<BuyerAddress>
<BuyerStreet>Street2</BuyerStreet>
<BuyerTown>Town2</BuyerTown>
</BuyerAddress>
</BuyerDetails>
<BuyerNumber>001234</BuyerNumber>
<InvoiceDetails>
<InvoiceNumber>0001</InvoiceNumber>
</InvoiceDetails>
<InvoiceRow>
<ArticleName>Article1</ArticleName>
<RowText>Product Text1</RowText>
<RowText>Product Text2</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">10.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article2</ArticleName>
<RowText>Product Text11</RowText>
<RowText>Product Text22</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">20.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article3</ArticleName>
<RowText>Product Text111</RowText>
<RowText>Product Text222</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">30.00</RowAmount>
</InvoiceRow>
<EpiDetails>
<EpiPartyDetails>
<EpiBfiPartyDetails>
<EpiBfiIdentifier IdentificationSchemeName="BIC">XXXXX</EpiBfiIdentifier>
</EpiBfiPartyDetails>
</EpiPartyDetails>
</EpiDetails>
<InvoiceUrlText>Some text</InvoiceUrlText>
</Invoice>
Example CSV output:
Identifier,SellerStreet,SellerTown,BuyerIdentifier,BuyerStreet,BuyerTown,BuyerNumber,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article1,Product Text1,Product Text2,10,XXXXX,Some text
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article2,Product Text11,Product Text22,20,XXXXX,Some text
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article3,Product Text111,Product Text222,30,XXXXX,Some text

Consider the following example:
XML
<Invoice>
<SellerDetails>
<Identifier>1234-1</Identifier>
<SellerAddress>
<SellerStreet>Street1</SellerStreet>
<SellerTown>Town1</SellerTown>
</SellerAddress>
</SellerDetails>
<BuyerDetails>
<BuyerIdentifier>1234-2</BuyerIdentifier>
<BuyerAddress>
<BuyerStreet>Street2</BuyerStreet>
<BuyerTown>Town2</BuyerTown>
</BuyerAddress>
</BuyerDetails>
<BuyerNumber>001234</BuyerNumber>
<InvoiceDetails>
<InvoiceNumber>0001</InvoiceNumber>
</InvoiceDetails>
<InvoiceRow>
<ArticleName>Article1</ArticleName>
<RowText>Product Text1</RowText>
<RowText>Product Text2</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">10.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article2</ArticleName>
<RowText>Product Text11</RowText>
<RowText>Product Text22</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">20.00</RowAmount>
</InvoiceRow>
<InvoiceRow>
<ArticleName>Article3</ArticleName>
<RowText>Product Text111</RowText>
<RowText>Product Text222</RowText>
<RowAmount AmountCurrencyIdentifier="EUR">30.00</RowAmount>
</InvoiceRow>
<EpiDetails>
<EpiPartyDetails>
<EpiBfiPartyDetails>
<EpiBfiIdentifier IdentificationSchemeName="BIC">XXXXX</EpiBfiIdentifier>
</EpiBfiPartyDetails>
</EpiPartyDetails>
</EpiDetails>
<InvoiceUrlText>Some text</InvoiceUrlText>
</Invoice>
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="Invoice">
<xsl:variable name="common-head">
<xsl:value-of select="SellerDetails/Identifier"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="BuyerDetails/BuyerIdentifier"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="InvoiceDetails/InvoiceNumber"/>
<xsl:text>,</xsl:text>
<!-- add more here -->
</xsl:variable>
<xsl:variable name="common-tail">
<xsl:value-of select="EpiDetails/EpiPartyDetails/EpiBfiPartyDetails/EpiBfiIdentifier"/>
<xsl:text>,</xsl:text>
<!-- add more here -->
<xsl:value-of select="InvoiceUrlText"/>
</xsl:variable>
<!-- header -->
<xsl:text>SellerIdentifier,BuyerIdentifier,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
</xsl:text>
<!-- data -->
<xsl:for-each select="InvoiceRow">
<xsl:copy-of select="$common-head"/>
<xsl:value-of select="ArticleName"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="RowAmount"/>
<xsl:text>,</xsl:text>
<!-- add more here -->
<xsl:copy-of select="$common-tail"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Result
SellerIdentifier,BuyerIdentifier,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
1234-1,1234-2,0001,Article1,10.00,XXXXX,Some text
1234-1,1234-2,0001,Article2,20.00,XXXXX,Some text
1234-1,1234-2,0001,Article3,30.00,XXXXX,Some text
Added in response to:
Is there a way in XSLT to get the same results using loop? For example loop through and output all the elements and the sub-elements except the InvoiceRow elements and then vice versa?
If you prefer, you could try something like:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="Invoice">
<xsl:variable name="invoice-fields" select="//*[not(*) and not(ancestor::InvoiceRow)]" />
<xsl:variable name="common-data">
<xsl:for-each select="$invoice-fields">
<xsl:value-of select="."/>
<xsl:text>,</xsl:text>
</xsl:for-each>
</xsl:variable>
<!-- header -->
<xsl:for-each select="$invoice-fields">
<xsl:value-of select="name()"/>
<xsl:text>,</xsl:text>
</xsl:for-each>
<xsl:for-each select="InvoiceRow[1]/*">
<xsl:value-of select="name()"/>
<xsl:if test="position()!=last()">,</xsl:if>
</xsl:for-each>
<xsl:text>
</xsl:text>
<!-- data -->
<xsl:for-each select="InvoiceRow">
<xsl:copy-of select="$common-data"/>
<xsl:for-each select="*">
<xsl:value-of select="."/>
<xsl:if test="position()!=last()">,</xsl:if>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The result here would be:
Identifier,SellerStreet,SellerTown,BuyerIdentifier,BuyerStreet,BuyerTown,BuyerNumber,InvoiceNumber,EpiBfiIdentifier,InvoiceUrlText,ArticleName,RowText,RowText,RowAmount
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article1,Product Text1,Product Text2,10.00
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article2,Product Text11,Product Text22,20.00
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article3,Product Text111,Product Text222,30.00
i.e. listing all invoice fields before the row fields.

I have done similar case like your requirements, I have created one package base on untangle, a package which can parse your XML to pure python objects like:
<?xml version="1.0"?>
<root>
<child name="child1"/>
</root>
to
obj.root.child['name'] # u'child1'
then you can easily write some code to traverse the object to get what you want.
For example, you can do something like get_items_by_tag(InvoiceRow).
Hope it helps!

Related

Python XML/Pandas: How to merge nested XML?

How can I join two different pieces of information together from this XML file?
# data
xml1 = ('''<?xml version="1.0" encoding="utf-8"?>
<TopologyDefinition xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RSkus>
<RSku ID="V1" Deprecated="true" Owner="Unknown" Generation="1">
<Devices>
<Device ID="1" SkuID="Switch" Role="xD" />
</Devices>
<Blades>
<Blade ID="{1-20}" SkuID="SBlade" />
</Blades>
<Interfaces>
<Interface ID="COM" HardwareID="NS1" SlotID="COM1" Type="serial" />
<Interface ID="LINK" HardwareID="TS1" SlotID="UPLINK_1" Type="serial" />
</Interfaces>
<Wires>
<WireGroup Type="network">
<Wire LocationA="NS1" SlotA="{1-20}" LocationB="{1-20}" SlotB="NIC1" />
</WireGroup>
<WireGroup Type="serial">
<Wire LocationA="TS1" SlotA="{7001-7020}" LocationB="{1-20}" SlotB="COM1" />
</WireGroup>
</Wires>
</RSku>
</RSkus>
</TopologyDefinition>
''')
While this is a single case and trivial in the instance below; if I run the below commands on the full file, I get shapes that do not match and therefore cannot be joined so easily.
How can I extract the XML information such that for every row, I get all the RSku information PLUS its Blade information. Each xpath contains no information that would let me join it to another xpath so that I may combine the information.
# how to have them joined?
pd.read_xml(xml1, xpath = ".//RSku")
pd.read_xml(xml1, xpath = ".//Blade")
# expected
pd.concat([pd.read_xml(xml1, xpath = ".//RSku"), pd.read_xml(xml1, xpath = ".//Blade")], axis=1)

Consider transforming the XML with XSLT by flattening the document with information you need. Specifically, retrieve only Blade attributes using descendant::* axis and corresponding RSku attributes using the ancestor::* axis. Python' lxml (default parser of pandas.read_xml) can run XSLT 1.0 scripts.
Below XSLT's <xsl:for-each> is used to prefix RSku_ and Blade_ to attribute names since they share same attribute such as ID. Otherwise template would be much less wordy.
import pandas as pd
xml1 = ...
xsl = ('''<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/TopologyDefinition">
<root>
<xsl:apply-templates select="descendant::Blade"/>
</root>
</xsl:template>
<xsl:template match="Blade">
<data>
<xsl:for-each select="ancestor::RSku/#*">
<xsl:attribute name="{concat('RSku_', name())}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:for-each>
<xsl:for-each select="#*">
<xsl:attribute name="{concat('Blade_', name())}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:for-each>
</data>
</xsl:template>
</xsl:stylesheet>''')
blades_df = pd.read xml(xml1, stylesheet=xsl)
Online XSLT Demo

How do I insert a tag that holds the text of an older tag in xml using python?

I want to insert s tags inside a tag that already exists and move the text of the older tag inside the s tag. For example, if my XML file looks like this:
<root>
<name>Light and dark</name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>
I want it to be like this (check the name tag):
<root>
<name>
<s>Light and dark</s>
</name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>
I tried using ET.SubElement but it doesn't give me the same result.

It it much better to use XSLT for such tasks.
XSLT has so called Identity Transform pattern.
Input XML
<root>
<name>Light and dark</name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>
XSLT
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="name">
<xsl:copy>
<s>
<xsl:value-of select="."/>
</s>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Output XML
<root>
<name>
<s>Light and dark</s>
</name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>

To insert a sub-element in the XML using ElementTree XML API, append the new element then set it with the text value of the parent element.
import xml.etree.ElementTree as ET
xml = """
<root>
<name>Light and dark</name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>"""
root = ET.fromstring(xml)
# 1. find name element in document
name = root.find('name')
# 2. get text value and reset the element
text = name.text
name.clear()
# 3. create new element s and set text
elt = ET.SubElement(name, "s")
elt.text = text
print(ET.tostring(root, encoding='unicode'))
To process multiple elements, add a loop around steps 1-3:
for child in root.findall('name'):
text = child.text
child.clear()
elt = ET.SubElement(child, "s")
elt.text = text
Output:
<root>
<name><s>Light and dark</s></name>
<address>
<sector>142</sector>
<location>Noida</location>
</address>
</root>

How to use python functions and variables inside XSLT?

In PHP, you can use registerPHPFunctions to use a PHP function inside an XSLT file like this:
<?php
$xml = <<<EOB
<allusers>
<user>
<uid>bob</uid>
<id>1</id>
</user>
<user>
<uid>joe</uid>
<id>2</id>
</user>
</allusers>
EOB;
$xsl = <<<EOB
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:php="http://php.net/xsl">
<xsl:output method="html" encoding="utf-8" indent="yes"/>
<xsl:template match="allusers">
<html><body>
<h2>Users</h2>
<table>
<xsl:for-each select="user">
<tr><td>
<xsl:value-of
select="php:function('ucfirst',concat(string(uid), string(id)))"/>
</td></tr>
</xsl:for-each>
</table>
</body></html>
</xsl:template>
</xsl:stylesheet>
EOB;
$xmldoc = DOMDocument::loadXML($xml);
$xsldoc = DOMDocument::loadXML($xsl);
$proc = new XSLTProcessor();
$proc->registerPHPFunctions('ucfirst');
$proc->importStyleSheet($xsldoc);
echo $proc->transformToXML($xmldoc);
?>
What is the Python equivalent? This is what I've tried
from lxml import etree
xml = etree.XML('''
<allusers>
<user>
<uid>bob</uid>
<id>1</id>
</user>
<user>
<uid>joe</uid>
<id>2</id>
</user>
</allusers>''')
xsl = etree.XML('''
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:f="mynamespace"
extension-element-prefixes="f">
<xsl:output method="html" encoding="utf-8" indent="yes"/>
<xsl:template match="allusers">
<html><body>
<h2>Users</h2>
<table>
<xsl:for-each select="user">
<tr><td>
<f:ucfirst>
<xsl:value-of select="concat(string(uid), string(id))"/>
</f:ucfirst>
</td></tr>
</xsl:for-each>
</table>
</body></html>
</xsl:template>
</xsl:stylesheet>
''')
extension = Ucase()
extensions = { ('mynamespace', 'ucfirst') : extension }
proc = etree.XSLT(xsl, extensions=extensions)
str(proc(xml))
class Ucase(etree.XSLTExtension):
def execute(self, context, self_node, input_node, output_parent):
title = self_node[0].text.capitalize()
output_parent.text(title)
This is a simplified version of my XSLT.

Here is how an extension function (not an element) can give the result that I think you want:
from lxml import etree
def ucfirst(context, s):
return s.capitalize()
ns = etree.FunctionNamespace("mynamespace")
ns['ucfirst'] = ucfirst
xml = etree.XML('''
<allusers>
<user>
<uid>bob</uid>
<id>1</id>
</user>
<user>
<uid>joe</uid>
<id>2</id>
</user>
</allusers>''')
xsl = etree.XML('''\
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:f="mynamespace" exclude-result-prefixes="f">
<xsl:output method="html" encoding="utf-8" indent="yes"/>
<xsl:template match="allusers">
<html><body>
<h2>Users</h2>
<table>
<xsl:for-each select="user">
<tr><td>
<xsl:value-of select="f:ucfirst(concat(string(uid), string(id)))"/>
</td></tr>
</xsl:for-each>
</table>
</body></html>
</xsl:template>
</xsl:stylesheet>
''')
transform = etree.XSLT(xsl)
result = transform(xml)
print result
Output:
<html><body>
<h2>Users</h2>
<table>
<tr><td>Bob1</td></tr>
<tr><td>Joe2</td></tr>
</table>
</body></html>
See http://lxml.de/extensions.html#xpath-extension-functions.

There are separate answers for variables and functions. I'm only really familiar with the variable half.
For variables, you can pass them as an xsl:param by passing them as keyword arguments to the call. For example:
transform = etree.XSLT(xslt_tree)
result = transform(doc_root, a="5")
Note that the argument is an XPath expression, so strings need to be quoted. There is a function that does this opaquely:
result = transform(doc_root, a=etree.XSLT.strparam(""" It's "Monty Python" """))
If you want to pass an XML fragment you could use the exslt:node-set() function.
For functions, you can expose them either as an xpath function or as an element. There is a bunch of variety and I haven't done this myself so read the docs below and/or edit this answer.
Docs for basic use and variables.
Docs for adding functions.

(Not so) advanced xsl transformation of child nodes into list

Input:
<root>
<aa><aaa/><bbb/><ccc/><ddd/><eee/></aa>
<bb><ggg/></bb>
</root>
Desirable output:
<root>
<aa>aaa<aa>
<aa>bbb<aa>
<aa>ccc<aa>
<aa>ddd<aa>
<aa>eee<aa>
<bb>ggg</bb>
</root>
I've come up with the simple xslt but it properly handles only and doesn't create list of tags.
XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- select all elements that doesn't have any child nodes (elements or text etc) -->
<xsl:template match="//*[not(node())]">
<xsl:value-of select="name()"/>
</xsl:template>
</xsl:stylesheet>
Output:
<root>
<aa>aaabbbcccdddeee</aa>
<bb>ggg</bb>
</root>
P.S. It is part of python script. Does it make to do such conversions using xslt in python script? Or python solution using simple xpath and python logic will work better?

An example is not a substitute for explaining the logic behind the required transformation. I can think of several different ways to process your example input and arrive at the same output.
Here's a guess at what you want to accomplish (read the comments):
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<root>
<!-- select all elements that don't have any child nodes -->
<xsl:for-each select="//*[not(node())]">
<!-- create an element with the name of the parent element -->
<xsl:element name="{name(..)}">
<xsl:value-of select="name()"/>
</xsl:element>
</xsl:for-each>
</root>
</xsl:template>
</xsl:stylesheet>

Removing Signature from xml

I would like to discard signature element from my xml files. So I use xslt for filtering some elements and tags from my xml files. I use xslt with python. Xslt looks like the following:
xslt_root = etree.XML('''\
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="node() | #*">
<xsl:copy>
<xsl:apply-templates select="node() | #*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="TimeStamp"/>
<xsl:template match="#timeStamp"/>
<xsl:template match="TimeStamps"/>
<xsl:template match="Signature"/>
</xsl:stylesheet>
''')
The problem is that when I save the result(updated) xml files, all elements and tags which I have defined in the xslt rule will be discarded except "Signature" element which remains. Is there a possible way to discard this signature from xml file?

If your Signature element has a namespace, for example:
<Signature xmlns="http://www.w3.org/2000/09/xmldsig#">...</Signature>
Then you'll need to adapt your XSLT to match it with the namespace:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:s="http://www.w3.org/2000/09/xmldsig#"> <!-- CHANGE #1 -->
<xsl:template match="node() | #*">
<xsl:copy>
<xsl:apply-templates select="node() | #*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="TimeStamp"/>
<xsl:template match="#timeStamp"/>
<xsl:template match="TimeStamps"/>
<xsl:template match="s:Signature"/> <!-- CHANGE #2 -->
</xsl:stylesheet>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting complex XML to CSV using Python or XSLT - python

Related

Python XML/Pandas: How to merge nested XML?

How do I insert a tag that holds the text of an older tag in xml using python?

How to use python functions and variables inside XSLT?

(Not so) advanced xsl transformation of child nodes into list

Removing Signature from xml

Categories

Resources