Parsing PDML / XML and selecting specific fields for Pandas

Parsing PDML / XML and selecting specific fields for Pandas - python

I am using the ElementTreeXML API and trying to parse a large PDML (XML) file in Python. I am trying to get a tabular Pandas dataframe output with specific fields of information. The following is a subset of the actual file.
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="pdml2html.xsl"?>
<!-- You can find pdml2html.xsl in C:\Program Files\Wireshark or at https://code.wireshark.org/review/gitweb?p=wireshark.git;a=blob_plain;f=pdml2html.xsl. -->
<pdml version="0" creator="wireshark/3.2.2" time="Sun Mar 22 23:53:43 2020" capture_file="C:\Users\anyoung\AppData\Local\Temp\wireshark_Wi-Fi 2_20200322234518_a20824.pcapng">
<packet>
<proto name="geninfo" pos="0" showname="General information" size="66">
<field name="frame.cap_len" showname="Capture Length: 66 bytes (528 bits)" size="0" pos="0" show="66"/>
<field name="frame.marked" showname="Frame is marked: False" size="0" pos="0" show="0"/>
<field name="frame.cap_len" showname="Capture Length: 66 bytes (528 bits)" size="0" pos="0" show="66"/>
<field name="frame.marked" showname="Frame is marked: False" size="0" pos="0" show="0"/>
<field name="caplen" pos="0" show="66" showname="Captured Length" value="42" size="66"/>
<field name="timestamp" pos="0" show="Mar 22, 2020 23:45:34.045301000 Pacific Daylight Time" showname="Captured Time" value="1584945934.045301000" size="66"/>
</proto>
I want to get a table like:
field size value
frame.cap_len 0 null
frame.marked 0 null
timestamp 66 1584945934.045301000
I am really struggling with the syntax to do the above. I haven't been able to get anything that even comes close.

Here's another XSLT example (this is more for #Kristian).
XML Input (input.xml)
<pdml version="0" creator="wireshark/3.2.2" time="Sun Mar 22 23:53:43 2020" capture_file="C:\Users\anyoung\AppData\Local\Temp\wireshark_Wi-Fi 2_20200322234518_a20824.pcapng">
<packet>
<proto name="geninfo" pos="0" showname="General information" size="66">
<field name="frame.cap_len" showname="Capture Length: 66 bytes (528 bits)" size="0" pos="0" show="66"/>
<field name="frame.marked" showname="Frame is marked: False" size="0" pos="0" show="0"/>
<field name="frame.cap_len" showname="Capture Length: 66 bytes (528 bits)" size="0" pos="0" show="66"/>
<field name="frame.marked" showname="Frame is marked: False" size="0" pos="0" show="0"/>
<field name="caplen" pos="0" show="66" showname="Captured Length" value="42" size="66"/>
<field name="timestamp" pos="0" show="Mar 22, 2020 23:45:34.045301000 Pacific Daylight Time" showname="Captured Time" value="1584945934.045301000" size="66"/>
</proto>
</packet>
</pdml>
XSLT 1.0 (test.xsl)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="tab" select="' '"/>
<xsl:variable name="nl" select="'
'"/>
<xsl:template match="/">
<xsl:value-of select="concat('field',$tab,'size',$tab,'value',$nl)"/>
<xsl:apply-templates select=".//field"/>
</xsl:template>
<xsl:template match="field">
<xsl:value-of select="concat(#name,$tab,#size,$tab,#value,$nl)"/>
</xsl:template>
</xsl:stylesheet>
Python 3
from lxml import etree
tree = etree.parse("input.xml")
xslt = etree.parse("test.xsl")
new_tree = tree.xslt(xslt)
print(new_tree)
Printed Output
field size value
frame.cap_len 0
frame.marked 0
frame.cap_len 0
frame.marked 0
caplen 66 42
timestamp 66 1584945934.045301000

lxml
https://lxml.de/xpathxslt.html
lxml will allow you to transform XML documents using XSLT. For some reason, XSLT is overlooked and brute-force programming objects are used instead. Nevertheless, I prefer to use XLST when handling and transforming XML data.
I highly recommend learning XSLT and utilizing it regularly if you must handle XML data on a daily basis.
XLST to transform your XML document: packet.xsl
Variables are used for delimiter and end-of-line (EOL) to allow for easy modification.
Templates for header-row and field are used to allow the re-arrangement or addition of new fields when necessary.
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" media-type="string" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="delimiter" select="' '"/>
<xsl:variable name="EOL" select="'
'"/>
<xsl:template match="/pdml/packet/proto">
<xsl:call-template name="header-row"/>
<xsl:apply-templates match="field"/>
</xsl:template>
<xsl:template match="field">
<xsl:value-of select="#name"/>
<xsl:value-of select="$delimiter"/>
<xsl:value-of select="#size"/>
<xsl:value-of select="$delimiter"/>
<xsl:value-of select="#value"/>
<xsl:value-of select="$EOL"/>
<xsl:apply-templates select="*"/>
</xsl:template>
<xsl:template name="header-row">
<xsl:element name="row">
<xsl:text>field</xsl:text>
<xsl:value-of select="$delimiter"/>
<xsl:text>size</xsl:text>
<xsl:value-of select="$delimiter"/>
<xsl:text>value</xsl:text>
<xsl:value-of select="$EOL"/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Example Python XML/XSLT transformation code
Use the python script provided by #Daniel Haley. I just named my file test.py
Execute XSLT with XML input
XSLT = packet.xsl
XML = packet.xml
Assuming your packet.xml is a well formed XML document and not an incomplete fragement.
./test.py
Tab delimited output.
field size value
frame.cap_len 0
frame.marked 0
frame.cap_len 0
frame.marked 0
caplen 66 42
timestamp 66 1584945934.045301000

Related

Editing EPG XML with Python

I am new to Python and looking to modify an XML file to change some things around. I can provide an example followed by what I would like the output to be.
Original....
<programme channel="I9.11363.zap2it.com" start="20220729080000 -0500" stop="20220729090000 -0500">
<title lang="en">Live with Kelly and Ryan</title>
<sub-title lang="en">Live's Ready or Not Week; Live's Foodfluencer Friday Faceoff</sub-title>
<desc lang="en">Making an emergency evacuation kit; a chef provides a summertime recipe.</desc>
<date>20220729</date>
<category lang="en">Talk</category>
<category lang="en">Series</category>
<length units="minutes">60</length>
<icon src="https://zap2it.tmsimg.com/assets/p14101643_b_v13_ah.jpg" />
<url>https://tvlistings.zap2it.com//overview.html?programSeriesId=SH02684484&tmsId=EP026844841372</url>
<episode-num system="common">S06E232</episode-num>
<episode-num system="dd_progid">EP02684484.1372</episode-num>
<episode-num system="xmltv_ns">5.231.</episode-num>
<audio>
<stereo>stereo</stereo>
</audio>
<new />
<subtitles type="teletext" />
<rating>
<value>TV-PG</value>
</rating>
</programme>
Desired Output.... Moving the "New" tag into the title and removing the <episode-num system="common">S06E232</episode-num> and placing it into the description.
<programme channel="I9.11363.zap2it.com" start="20220729080000 -0500" stop="20220729090000 -0500">
<title lang="en">Live with Kelly and Ryan New</title>
<sub-title lang="en">Live's Ready or Not Week; Live's Foodfluencer Friday Faceoff</sub-title>
<desc lang="en">S06E232 (return)Making an emergency evacuation kit; a chef provides a summertime recipe. TV-PG 20220729 </desc>
<icon src="https://zap2it.tmsimg.com/assets/p14101643_b_v13_ah.jpg" />
<url>https://tvlistings.zap2it.com//overview.html?programSeriesId=SH02684484&tmsId=EP026844841372</url>
</programme>

Here is an XSLT based solution.
Input XML
<?xml version="1.0"?>
<programme channel="I9.11363.zap2it.com" start="20220729080000 -0500" stop="20220729090000 -0500">
<title lang="en">Live with Kelly and Ryan</title>
<sub-title lang="en">Live's Ready or Not Week; Live's Foodfluencer Friday Faceoff</sub-title>
<desc lang="en">Making an emergency evacuation kit; a chef provides a summertime recipe.</desc>
<date>20220729</date>
<category lang="en">Talk</category>
<category lang="en">Series</category>
<length units="minutes">60</length>
<icon src="https://zap2it.tmsimg.com/assets/p14101643_b_v13_ah.jpg"/>
<url>https://tvlistings.zap2it.com//overview.html?programSeriesId=SH02684484&tmsId=EP026844841372</url>
<episode-num system="common">S06E232</episode-num>
<episode-num system="dd_progid">EP02684484.1372</episode-num>
<episode-num system="xmltv_ns">5.231.</episode-num>
<audio>
<stereo>stereo</stereo>
</audio>
<new/>
<subtitles type="teletext"/>
<rating>
<value>TV-PG</value>
</rating>
</programme>
XSLT
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="title">
<xsl:copy>
<xsl:attribute name="lang">en</xsl:attribute>
<xsl:value-of select="concat(., ' new')"/>
</xsl:copy>
</xsl:template>
<xsl:template match="desc">
<xsl:copy>
<xsl:attribute name="lang">en</xsl:attribute>
<xsl:value-of select="concat(/programme/episode-num[#system='common'], ' ', .)"/>
</xsl:copy>
</xsl:template>
<xsl:template match="date | category | length | episode-num | audio | new | subtitles | rating"/>
</xsl:stylesheet>
Output XML
<programme stop="20220729090000 -0500" channel="I9.11363.zap2it.com" start="20220729080000 -0500">
<title lang="en">Live with Kelly and Ryan new</title>
<sub-title lang="en">Live's Ready or Not Week; Live's Foodfluencer Friday Faceoff</sub-title>
<desc lang="en">S06E232 Making an emergency evacuation kit; a chef provides a summertime recipe.</desc>
<icon src="https://zap2it.tmsimg.com/assets/p14101643_b_v13_ah.jpg"/>
<url>https://tvlistings.zap2it.com//overview.html?programSeriesId=SH02684484&tmsId=EP026844841372</url>
</programme>
Python
import os
import lxml.etree as ET
inputfile = "D:\\temp\\input.xml"
xsltfile = "D:\\temp\\process.xslt"
outfile = "D:\\output\\output.xml"
dom = ET.parse(inputfile)
xslt = ET.parse(xsltfile)
transform = ET.XSLT(xslt)
newdom = transform(dom,
id=XSLT.strparam("bk101"),
author=XSLT.strparam("New Author"))
infile = unicode((ET.tostring(newdom, pretty_print=True)))
outfile = open(outfile, 'a')
outfile.write(infile)

Create a series tag for multiple episodes - XML to XSLT

I have an XML feed with multiple tv show's episodes.
I need to convert it into an RDF/XML file using XLST template.
Each RDF should be one tv show, its episodes should be nested properties under it along with extracted info about the show.
The way i'm transforming it, all episodes (item tag) become RDFs.
How do i extract tv show's metadata from episode i.e items and then nest the tv show's episodes under the tv show rdf?
CURRENT BEHAVIOUR
rdf:Group rdf:id=New Item in Series
ebucore:groupName: Football
ebucore:Season: 1
　ebucore:Episode: 1
　ebucore:groupId: Better Call Saul
　ebucore:groupDescription: Three young up
#----------------------------------------------------------------------
rdf:Group rdf:id=New Item in Series
ebucore:groupName: Basketball
ebucore:Season: 1
　ebucore:Episode: 2
　ebucore:groupId: Better Call Saul
　ebucore:groupDescription: Three young up
#----------------------------------------------------------------------
EXPECTED BEHAVIOUR IS BELOW
Each show as an rdf with its episodes nested properties under it.
rdf
show title better call saul
show description: Three young up
ep
ep.id 1
ep
ep.id 2
Attached below XML raw, XLST & code
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:media="http://search.yahoo.com/mrss/" xmlns:ott="http://rss.ott.com/1.0" xmlns:name="http://www.name.com/rss/extensions/" xmlns:dcterms="http://purl.org/dc/terms/">
<channel>
<title>ABC</title>
<description>Feed of items</description>
<pubDate>Wed, 06 May 2022</pubDate>
<item>
<guid isPermaLink="false">some_alphaNumeric</guid>
<title>Better Call Saul</title>
<name:episodic type="episode">
<name:seriesId>Better Call Saul</name:seriesId>
<name:seasonNum>1</name:seasonNum>
<name:episodeNum>1</name:episodeNum>
</name:episodic>
<description>This is a victorious story of a Lawyer </description>
<media:keywords>urban, African American, destiny</media:keywords>
<media:subTitle lang="en" type="application/octet-stream"/>
<media:category scheme="http://www.name.com">Drama - Drama</media:category>
<media:group>
<media:content url="video_url.mp4?Jd" type="application/mp4" duration="4693" medium="video" expression="full" bitrate="7627"/>
</media:group>
<media:group>
<media:thumbnail url="some_url" height="450" width="800"/>
<media:thumbnail url="https://mp4video.ott.com/04" height="1920" width="1080"/>
</media:group>
<media:rating scheme="simple">nonadult</media:rating>
<name:cuePoints>10</name:cuePoints>
<pubDate>Tue, 28 Jul 2020 13:29:16 -0400</pubDate>
</item>
<item>
<guid isPermaLink="false">some_alphaNumeric</guid>
<title>Better Call Saul</title>
<name:episodic type="episode">
<name:seriesId>Better Call Saul</name:seriesId>
<name:seasonNum>1</name:seasonNum>
<name:episodeNum>2</name:episodeNum>
</name:episodic>
<description>This is a victorious story of a Lawyer </description>
<media:keywords>urban, African American, destiny</media:keywords>
<media:subTitle lang="en" type="application/octet-stream"/>
<media:category scheme="http://www.name.com">Drama - Drama</media:category>
<media:group>
<media:content url="video_url.mp4?acs3" type="application/mp4" duration="4693" medium="video" expression="full" bitrate="7627"/>
</media:group>
<media:group>
<media:thumbnail url="some_url" height="450" width="800"/>
<media:thumbnail url="https://mp4video.ott.com/04" height="1920" width="1080"/>
</media:group>
<media:rating scheme="simple">nonadult</media:rating>
<name:cuePoints>10</name:cuePoints>
<pubDate>Tue, 28 Aug 2020 13:29:16 -0400</pubDate>
</item>
and the XSLT
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:foaf="http://xmlns.com/foaf/spec/"
xmlns:ebucore="http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#"
xmlns:ebulang="http://resolution.org/res#"
xmlns:foo="http://example.com/foo#"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:name="http://www.name.com/rss/extensions/">
<xsl:template match="/">
<rdf:RDF>
Number of Items in Rss Feed: <xsl:value-of select="count(//item)" />
<xsl:apply-templates/>
</rdf:RDF>
</xsl:template>
<xsl:template match="item">
<xsl:text>#----------------------------------------------------------------------</xsl:text><xsl:text>
</xsl:text>
<xsl:text>rdf:Group rdf:id=New Item in Series</xsl:text><xsl:text>
</xsl:text>
<xsl:text>ebucore:groupName: </xsl:text>
<xsl:value-of select="normalize-space(title)"/>
<xsl:text>
</xsl:text>
<xsl:for-each select="name:episodic">
<xsl:text>ebucore:Season: </xsl:text>
<xsl:value-of select="normalize-space(name:seasonNum)"/>
<xsl:text>
</xsl:text>
<xsl:text>　</xsl:text>
<xsl:text>ebucore:Episode: </xsl:text>
<xsl:value-of select="normalize-space(name:episodeNum)"/>
<xsl:text>
</xsl:text>
<xsl:text>　</xsl:text>
<xsl:text>ebucore:groupId: </xsl:text>
<xsl:value-of select="name:seriesId"/>
<xsl:text>
</xsl:text>
<xsl:text>　</xsl:text>
</xsl:for-each>
<xsl:text>ebucore:groupDescription: </xsl:text>
<xsl:if test="count(description)=1">
<xsl:value-of select="normalize-space(description)"/>
</xsl:if>
<xsl:if test="count(description)=0">
<xsl:text>Null</xsl:text>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
code
import lxml.etree as ET
import xml.dom.minidom
from pprint import pprint
import os
dom_rssFeed = ET.parse('raw.xml')
xslt_rssFeed = ET.parse('mapper.xslt')
transform_rssFeed = ET.XSLT(xslt_rssFeed)
finaldom = transform_rssFeed(dom_rssFeed)
new = ET.tostring(finaldom)
dom_transformed = xml.dom.minidom.parseString(new)
pretty_xml_as_string = dom_transformed.toprettyxml()
print(pretty_xml_as_string)

Change the XML structure to a new one with python

I want to change the structure from an XML to another standard structure given to me.
I believe I can achieve that through the following steps:
Get all the tags and their attributes, so I can know what to modify,
remove or add.
Change the tags name (i.e. informaltable to table, or , sect1 to
section)
Establish certain standard attributes for the different tags, and
show it in a dictionary (i.e. all the section, title and table tags
must have these attributes ----
section:{"xmlns:xsi","id","type","xsi:noNamespaceSchemaLocation"} ,
title:"id" , table:{"frame","id"} .
Give a random alpha-numerical id to every tag that has the id
attribute and it must never repeat itself(i.e.
id=id-824fc56b-431b-4ad3-e933-f0fc222e50d3)
Modify, add and remove attributes values for certain tags (i.e.
frame was frame=all and now is frame=any) (i.e. delete the rowsep
attribute in the colspec tag).
Remove specific tags(i.e. remove the anchor tags and of course all
of their attributes) (I hope this doesn't affect the whole
hierarchy).
I have this xml example
<section xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="id-c3ee53e4-e2ef-441b-8f3b-7320c4e32ef8" type="policy" xsi:noNamespaceSchemaLocation="urn:fontoxml:cpa.xsd:5.0">
<title id="id-f0497441-5ecb-47ee-b7c0-263832a9e402">
<anchor id="_Toc493170182"/>
<anchor id="__RefHeading___Toc3574_3674829928"/>
<anchor id="_Toc72503731"/>
<anchor id="_Toc69390724"/>
<anchor id="_Toc493496869"/>
Abbreviations of Terms
</title>
<table frame="all" id="id-6837f232-02e3-4e7a-ce8d-cb2df48256ac">
<tgroup cols="2" id="id-437c0d54-7257-4d34-a73d-351d533f0460">
<colspec colname="column-0" colnum="1" colsep="1" rowsep="1" colwidth="0.2*" id="id-c87e1040-c2d7-4b15-fb0c-86557d201235" />
<colspec colname="column-1" colnum="2" colsep="1" rowsep="1" colwidth="0.8*" id="id-5bebcf85-440b-416e-b2f9-72e47d5bb4f7" />
<thead id="id-ff67f8a7-5baf-4a42-ac31-09c0f99cceed">
<row id="id-542df999-7736-4cc2-e725-1b7b106e08d6">
<entry rowsep="1" colsep="1" colname="column-0" id="id-54a7d605-21ff-44db-c1f6-03111db180c7">
<para id="id-f43f7fb1-cd40-4b4a-88f2-02e55e786a5e">
<emphasis style="bold">Abbreviation
</emphasis>
</para>
</entry>
<entry rowsep="1" colsep="1" colname="column-1" id="id-aecec4c6-f85b-490e-9b72-99c6764b49cf">
<para id="id-4d89100a-4e4c-419a-d081-f776bcf9083e">
<emphasis style="bold">Definition
</emphasis>
</para>
</entry>
</row>
</thead>
<tbody id="id-824fc56b-431b-4ad3-e933-f0fc222e50d3">
<row id="id-620a8ff6-0189-41c7-e9af-dc9498ce703e">
<entry rowsep="1" colsep="1" colname="column-0" id="id-fb941cc0-287d-4760-a5a0-87419fa66d68">
<para id="id-127a8a37-9705-496b-87ee-303bcfd52a25">A/C</para>
</entry>
<entry rowsep="1" colsep="1" colname="column-1" id="id-317ad682-6e02-43c3-b724-5d50683c8f79">
<para id="id-c7c2fac5-f286-4802-b8d6-2e54fa2cad3c">AirCraft</para>
</entry>
</row>
</tbody>
</tgroup>
</table>
</section>
And this is the code that I have so far
from lxml import etree
import numpy as np
#Parsing the xml file and creating lists
tree = etree.parse("InitialFile")
root = tree.getroot()
Lista = []
tags = []
#Get the unique tags values
for element in root.iter():
Lista.append(element.tag)
tags = np.unique(Lista)
#Show the unique tag[attributes] pairs
for tag in tags:
print(tag,root.xpath(f'//{tag}')[0].attrib.keys())
#Changes the tag name to the required's tag's name
for p in tree.findall(".//sect1"):
p.tag = ("section")
for p in tree.findall(".//informaltable"):
p.tag = ("table")
#Modify the tag's attributes to its desired form
for cy in root.xpath('//section'):
cy.attrib['xmlns:xsi']='http://www.w3.org/2001/XMLSchema-instance' #it doesnt accept : as part of the attribute's name and i don't know why
cy.attrib['id']=random() #this doesn't work yet
cy.attrib['type']='policy'
cy.attrib['xsi:noNamespaceSchemaLocation']='urn:fontoxml:cpa.xsd:1.0'#it doesnt accept :as part of the attribute's name and i don't know why
#Modify the attributes values
for t in root.xpath('//title'):
t.attrib['id']='random()
for p in root.xpath('//section'):
p.attrib['id']=random()
p.attrib['type']='policy'
for p in root.xpath('//table'):
p.attrib['id']=random()
for ct in root.xpath('//colspec'):
ct.attrib.pop("rowsep", None)
#Print the new xml to make sure it worked:
print(etree.tostring(root).decode())
tree.write("Final file.xml")
If you have any other ideas please feel free to share.

I agree that this is a task for XSLT (which can be used by lxml), here is an example stylesheet that tries to implement some of your requirements in a modular way by delegating each change to a template of its own:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="1.0">
<xsl:output method="xml"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sect1">
<section>
<xsl:apply-templates select="#* | node()"/>
</section>
</xsl:template>
<xsl:template match="informaltable">
<table>
<xsl:apply-templates select="#* | node()"/>
</table>
</xsl:template>
<xsl:template match="#id">
<xsl:attribute name="{name()}">
<xsl:value-of select="generate-id()"/>
</xsl:attribute>
</xsl:template>
<xsl:template match="#xsi:noNamespaceSchemaLocation">
<xsl:attribute name="{name()}" namespace="{namespace-uri()}">urn:fontoxml:cpa.xsd:1.0</xsl:attribute>
</xsl:template>
<xsl:template match="colspec/#rowsep"/>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/bET2rXs
I hope with that as a starting point and any XSLT tutorial or introduction you can work it out.

How to convert nodes in XML to CDATA with XSLT?

I have a source.xml file with structure like:
<products>
<product>
<id>1</id>
<description>
<style>
table{
some css here
}
</style>
<descr>
<div>name of producer like ABC&DEF</div>
<table>
<th>parameters</th>
<tr><td>name of param 1 e.g POWER CONSUMPTION</td>
<td>value of param 1 with e.g < 100 W</td></tr>
</table>
</descr>
</description>
</product>
.....................
</products>
I would like to have:
<products>
<product>
<id>1</id>
<description>
<![CDATA[
<style>
table{
some css here
}
</style>
<descr>
<div>name of producer like ABC&DEF</div>
<table>
<th>parameters</th>
<tr><td>name of param 1 e.g POWER CONSUMPTION</td>
<td>value of param 1 with e.g < 100 VA</td></tr>
</table>
]]>
</descr>
</description>
</product>
.....................
</products>
I tried .xsl stylesheets based on:
How to use in XSLT?
and
Add CDATA to an xml file
and
how to add cdata to an xml file using xsl such as:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="UTF-8" />
<xsl:template match="/products">
<products>
<xsl:for-each select="product">
<product>
<description>
<xsl:text disable-output-escaping="yes"><![CDATA[</xsl:text>
<xsl:copy-of select="description/node()" />
<xsl:text disable-output-escaping="yes">]]></xsl:text>
</xsl:for-each>
</description>
</product>
</xsl:for-each>
</products>
</xsl:template>
</xsl:stylesheet>
and
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:output method="xml" indent="yes" cdata-section-elements="description"/>
<xsl:template match="description">
<xsl:copy>
<xsl:apply-templates select="#*"/>
<xsl:variable name="subElementsText">
<xsl:apply-templates select="node()" mode="asText"/>
</xsl:variable>
</xsl:copy>
</xsl:template>
<xsl:template match="text()" mode="asText">
<xsl:copy/>
</xsl:template>
<xsl:template match="*" mode="asText">
<xsl:value-of select="concat('<',name())"/>
<xsl:for-each select="#*">
<xsl:value-of select="concat(' ',name(),'="',.,'"')"/>
</xsl:for-each>
<xsl:value-of select="'>'"/>
<xsl:apply-templates select="node()" mode="asText"/>
<xsl:value-of select="concat('</',name(),'>')"/>
</xsl:template>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
but running my python script
import lxml.etree as ET
doc = ET.parse('source.xml')
xslt = ET.parse('modyfi.xsl')
transform = ET.XSLT(xslt)
newdoc = transform(doc)
with open(f'output.xml', 'wb') as f:
f.write(newdoc)
on SublimeText3 I allways get the same error:
lxml.etree.XMLSyntaxError: StartTag: invalid element name, {number of line and column with first appearance of illegal character}
I am sure, that solution is straight in front of me in links above, but I can't see it.
Or maybe I can't find it because I can't ask the right question. Please help, I'm new to coding.

The input XML is not well-formed. I had to fix it first. That seems to be the reason why it is failing on your end.
XML
<products>
<product>
<id>1</id>
<description>
<style>table{
some css here
}</style>
<descr>
<div>name of producer like ABC&DEF</div>
<table>
<th>parameters</th>
<tr>
<td>name of param 1 e.g POWER CONSUMPTION</td>
<td>value of param 1 with e.g < 100 W</td>
</tr>
</table>
</descr>
</description>
</product>
</products>
XSLT
<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="description">
<xsl:copy>
<xsl:text disable-output-escaping="yes"><![CDATA[</xsl:text>
<xsl:copy-of select="*"/>
<xsl:text disable-output-escaping="yes">]]></xsl:text>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Output
<products>
<product>
<id>1</id>
<description><![CDATA[
<style>table{
some css here
}
</style>
<descr>
<div>name of producer like ABC&DEF</div>
<table>
<th>parameters</th>
<tr>
<td>name of param 1 e.g POWER CONSUMPTION</td>
<td>value of param 1 with e.g < 100 W</td>
</tr>
</table>
</descr>]]>
</description>
</product>
</products>

In my view a clean way is to make use of a serialize function to serialize all elements you want as plain text, to then designate the parent container in the xsl:output declaration in the cdata-section-elements and to finally make sure the XSLT processor is in charge of the serialization.
Now XSLT 3 has a built-in XPath 3.1 serialize function, in Python you could use that with Saxon-C and its Python API.
For libxslt based XSLT 1 with lxml you can write an extension function in Python exposed to XSLT:
from lxml import etree as ET
def serialize(context, nodes):
return b''.join(ET.tostring(node) for node in nodes)
ns = ET.FunctionNamespace('http://example.com/mf')
ns['serialize'] = serialize
xml = ET.fromstring('<root><div><p>foo</p><p>bar</p></div></root>')
xsl = ET.fromstring('''<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:mf="http://example.com/mf" version="1.0">
<xsl:output method="xml" cdata-section-elements="div" encoding="UTF-8"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div">
<xsl:copy>
<xsl:value-of select="mf:serialize(node())"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>''')
transform = ET.XSLT(xsl)
result = transform(xml)
result.write_output("transformed.xml")
Output then is
<?xml version="1.0" encoding="UTF-8"?>
<root><div><![CDATA[<p>foo</p><p>bar</p>]]></div></root>

xslt transformation works in Altova but not in python

I am trying to transform the one xml to another xml using xslt. Below is the xslt I am using
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns="Apartments.AP.Mits20PropertyFinal"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:var="http://schemas.microsoft.com/BizTalk/2003/var"
exclude-result-prefixes="msxsl var userCSharp"
xmlns:userCSharp="http://schemas.microsoft.com/BizTalk/2003/userCSharp"
>
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:key name="myKey" match="ILS_Unit/Units/Unit" use="./Identification[#IDType='FloorplanID']/#IDValue"/>
<xsl:template match="/">
<PhysicalProperty>
<!-- <xsl:apply-templates select="PhysicalProperty/Management"/>
--> <xsl:apply-templates select="PhysicalProperty/Property"/>
</PhysicalProperty>
</xsl:template>
<xsl:template match="Property">
<Property>
<PropertyInfo>
<xsl:element name="MITSID">
<xsl:value-of select="./PropertyID/Identification[#IDRank='secondary']/#IDValue"/>
</xsl:element>
<xsl:element name="MarketingName">
<xsl:value-of select="./PropertyID/MarketingName"/>
</xsl:element>
<xsl:element name="TotalUnits">
</xsl:element>
<xsl:element name="AddressLine1">
<xsl:value-of select="./PropertyID/Address/AddressLine1"/>
</xsl:element>
<xsl:element name="City">
<xsl:value-of select="./PropertyID/Address/City"/>
</xsl:element>
<xsl:element name="State">
<xsl:value-of select="./PropertyID/Address/State"/>
</xsl:element>
<xsl:element name="Zip">
<xsl:value-of select="./PropertyID/Address/PostalCode"/>
</xsl:element>
</PropertyInfo>
<xsl:for-each select="./Floorplan">
<Floorplan>
<FloorplanID><xsl:value-of select="./#IDValue"/></FloorplanID>
<FloorplanName><xsl:value-of select="./Name"/></FloorplanName>
<!--This Unit count is Totals no of Units per floorplan-->
<UnitCount><xsl:value-of select="./UnitCount"/></UnitCount>
<Units>
<xsl:variable name="floorplanid" select="./#IDValue"/>
<xsl:for-each select="key('myKey',$floorplanid)">
<Unit>
<UnitNum>
<xsl:value-of select="./MarketingName"/>
</UnitNum>
<xsl:if test="./UnitLeasedStatus='Not_Available'">
<UnitLeasedStatus>Occupied</UnitLeasedStatus>
</xsl:if>
<xsl:if test="./UnitLeasedStatus='On_Notice'">
<UnitLeasedStatus>On Notice</UnitLeasedStatus>
</xsl:if>
<xsl:if test="./UnitLeasedStatus='Available'">
<UnitLeasedStatus>Available</UnitLeasedStatus>
</xsl:if>
</Unit>
</xsl:for-each>
</Units>
</Floorplan>
</xsl:for-each>
</Property>
</xsl:template>
</xsl:stylesheet>
And below is the sample xml file
<?xml version="1.0" encoding="utf-8"?>
<PhysicalProperty>
<Property>
<PropertyID>
<Identification IDValue="183dbed4-0101-4a85-954c-a4e8042d2819" IDRank="primary"/>
<Identification IDValue="6458174" IDRank="secondary"/>
<MarketingName>Westmount at London Park</MarketingName>
<Website>http://www.westmountatlondonpark.com</Website>
<Address AddressType="property">
<AddressLine1>14545 Bammel North Houston Road</AddressLine1>
<AddressLine2/>
<City>Houston</City>
<State>TX</State>
<PostalCode>77014</PostalCode>
</Address>
</PropertyID>
<ILS_Identification ILS_IdentificationType="Apartment" RentalType="Unspecified"/>
<Floorplan IDValue="171ad0f2-da57-45c0-9cf5-c61cfa26dd63" IDType="FloorplanID" IDRank="primary">
<FloorplanType>Internal</FloorplanType>
<Name>A1  </Name>
<Comment/>
<UnitCount>22</UnitCount>
<UnitsAvailable>3</UnitsAvailable>
<DisplayedUnitsAvailable>5</DisplayedUnitsAvailable>
<Room RoomType="Bedroom">
<Count>1</Count>
</Room>
<Room RoomType="Bathroom">
<Count>1</Count>
</Room>
<SquareFeet Min="602" Max="602"/>
<EffectiveRent Min="810" Max="830"/>
<Deposit DepositType="Security Deposit">
<Amount>
<ValueRange Min="150" Max="150"/>
</Amount>
</Deposit>
<File FileID="e114a438-ee82-497d-8e04-6a5a8b2381ef" active="true">
<FileType>Floorplan</FileType>
<Caption/>
<Src>https://apollostore.blob.core.windows.net/londonpark/uploads/images/floorplans/a1.7a75909c-3362-409c-82c7-aa6c959f9c99.jpg</Src>
<Rank>999</Rank>
</File>
</Floorplan>
<ILS_Unit>
<ILS_Unit IDValue="be827564-6460-4af1-9644-3f6ffa225557" IDType="UnitID" IDRank="primary">
<Units>
<Unit>
<Identification IDValue="be827564-6460-4af1-9644-3f6ffa225557" IDType="UnitID" IDRank="primary"/>
<Identification IDValue="171ad0f2-da57-45c0-9cf5-c61cfa26dd63" IDType="FloorplanID" IDRank="primary"/>
<MarketingName>1901</MarketingName>
<UnitBedrooms>1</UnitBedrooms>
<UnitBathrooms>1</UnitBathrooms>
<UnitRent>765</UnitRent>
<UnitLeasedStatus>Not_Available</UnitLeasedStatus>
<FloorplanName>A1  </FloorplanName>
</Unit>
</Units>
<Comment/>
<EffectiveRent Min="765" Max="765"/>
<Deposit DepositType="Security Deposit">
<Amount>
<ValueRange Min="150" Max="150"/>
</Amount>
</Deposit>
</ILS_Unit>
</ILS_Unit>
</Property>
</PhysicalProperty>
The transformation works fine using altova tool but does not work in python.
Below is the python script I am using.
from lxml import etree
import os
import glob
xslt = etree.parse("fl.xslt")
dom = etree.parse("f.xml",)
transform = etree.XSLT(xslt)
try:
newdom = transform(dom)
root = etree.parse(newdom)
properties = root.findall("./Property")
print(properties)
except Exception as e:
print (e)
for error in transform.error_log:
print(error.message, error.line)
print(etree.tostring(newdom, pretty_print=True))
I am trying to print the property nodes and I also tried to write the result to .xml file but it returns nothing. Could some one tell what the issue could be.
Surprisingly it works fine in Altova.
Below are the errors that are thrown.
line 53, in <module>
root = etree.parse(newdom)
File "src\lxml\etree.pyx", line 3519, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1862, in lxml.etree._parseDocument
TypeError: cannot parse from 'lxml.etree._XSLTResultTree'

Specific error has nothing to do with XSLT but your attempted parse of the result. Like Python's built-in xml.etree, the parse function of lxml.etree requires a file-like object. However, the result from an XSLT transformation in lxml is a ElementTree object that you can then directly run any XML DOM calls like findall, iterfind, etc.
Therefore, simply remove the parse line. Additionally, because you have a default namespace, consider assigning a temporary prefix to access nodes. Also, consider xpath in lxml.
newdom = transform(dom)
properties = newdom.findall("./doc:Property", namespaces={'doc': 'Apartments.AP.Mits20PropertyFinal'})
print(properties)
# [<Element {Apartments.AP.Mits20PropertyFinal}Property at 0x136b2f94b08>]
properties = newdom.xpath("./doc:Property", namespaces={'doc': 'Apartments.AP.Mits20PropertyFinal'})
print(properties)
# [<Element {Apartments.AP.Mits20PropertyFinal}Property at 0x136b2f94b08>]
Output
print(etree.tostring(newdom, pretty_print=True).decode("utf-8"))
# <PhysicalProperty xmlns="Apartments.AP.Mits20PropertyFinal">
# <Property>
# <PropertyInfo>
# <MITSID>6458174</MITSID>
# <MarketingName>Westmount at London Park</MarketingName>
# <TotalUnits/>
# <AddressLine1>14545 Bammel North Houston Road</AddressLine1>
# <City>Houston</City>
# <State>TX</State>
# <Zip>77014</Zip>
# </PropertyInfo>
# <Floorplan>
# <FloorplanID>171ad0f2-da57-45c0-9cf5-c61cfa26dd63</FloorplanID>
# <FloorplanName>A1 </FloorplanName>
# <UnitCount>22</UnitCount>
# <Units>
# <Unit>
# <UnitNum>1901</UnitNum>
# <UnitLeasedStatus>Occupied</UnitLeasedStatus>
# </Unit>
# </Units>
# </Floorplan>
# </Property>
# </PhysicalProperty>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing PDML / XML and selecting specific fields for Pandas - python

Related

Editing EPG XML with Python

Create a series tag for multiple episodes - XML to XSLT

Change the XML structure to a new one with python

How to convert nodes in XML to CDATA with XSLT?

xslt transformation works in Altova but not in python

Categories

Resources