Change the XML structure to a new one with python - python

I want to change the structure from an XML to another standard structure given to me.
I believe I can achieve that through the following steps:
Get all the tags and their attributes, so I can know what to modify,
remove or add.
Change the tags name (i.e. informaltable to table, or , sect1 to
section)
Establish certain standard attributes for the different tags, and
show it in a dictionary (i.e. all the section, title and table tags
must have these attributes ----
section:{"xmlns:xsi","id","type","xsi:noNamespaceSchemaLocation"} ,
title:"id" , table:{"frame","id"} .
Give a random alpha-numerical id to every tag that has the id
attribute and it must never repeat itself(i.e.
id=id-824fc56b-431b-4ad3-e933-f0fc222e50d3)
Modify, add and remove attributes values for certain tags (i.e.
frame was frame=all and now is frame=any) (i.e. delete the rowsep
attribute in the colspec tag).
Remove specific tags(i.e. remove the anchor tags and of course all
of their attributes) (I hope this doesn't affect the whole
hierarchy).
I have this xml example
<section xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="id-c3ee53e4-e2ef-441b-8f3b-7320c4e32ef8" type="policy" xsi:noNamespaceSchemaLocation="urn:fontoxml:cpa.xsd:5.0">
<title id="id-f0497441-5ecb-47ee-b7c0-263832a9e402">
<anchor id="_Toc493170182"/>
<anchor id="__RefHeading___Toc3574_3674829928"/>
<anchor id="_Toc72503731"/>
<anchor id="_Toc69390724"/>
<anchor id="_Toc493496869"/>
Abbreviations of Terms
</title>
<table frame="all" id="id-6837f232-02e3-4e7a-ce8d-cb2df48256ac">
<tgroup cols="2" id="id-437c0d54-7257-4d34-a73d-351d533f0460">
<colspec colname="column-0" colnum="1" colsep="1" rowsep="1" colwidth="0.2*" id="id-c87e1040-c2d7-4b15-fb0c-86557d201235" />
<colspec colname="column-1" colnum="2" colsep="1" rowsep="1" colwidth="0.8*" id="id-5bebcf85-440b-416e-b2f9-72e47d5bb4f7" />
<thead id="id-ff67f8a7-5baf-4a42-ac31-09c0f99cceed">
<row id="id-542df999-7736-4cc2-e725-1b7b106e08d6">
<entry rowsep="1" colsep="1" colname="column-0" id="id-54a7d605-21ff-44db-c1f6-03111db180c7">
<para id="id-f43f7fb1-cd40-4b4a-88f2-02e55e786a5e">
<emphasis style="bold">Abbreviation
</emphasis>
</para>
</entry>
<entry rowsep="1" colsep="1" colname="column-1" id="id-aecec4c6-f85b-490e-9b72-99c6764b49cf">
<para id="id-4d89100a-4e4c-419a-d081-f776bcf9083e">
<emphasis style="bold">Definition
</emphasis>
</para>
</entry>
</row>
</thead>
<tbody id="id-824fc56b-431b-4ad3-e933-f0fc222e50d3">
<row id="id-620a8ff6-0189-41c7-e9af-dc9498ce703e">
<entry rowsep="1" colsep="1" colname="column-0" id="id-fb941cc0-287d-4760-a5a0-87419fa66d68">
<para id="id-127a8a37-9705-496b-87ee-303bcfd52a25">A/C</para>
</entry>
<entry rowsep="1" colsep="1" colname="column-1" id="id-317ad682-6e02-43c3-b724-5d50683c8f79">
<para id="id-c7c2fac5-f286-4802-b8d6-2e54fa2cad3c">AirCraft</para>
</entry>
</row>
</tbody>
</tgroup>
</table>
</section>
And this is the code that I have so far
from lxml import etree
import numpy as np
#Parsing the xml file and creating lists
tree = etree.parse("InitialFile")
root = tree.getroot()
Lista = []
tags = []
#Get the unique tags values
for element in root.iter():
Lista.append(element.tag)
tags = np.unique(Lista)
#Show the unique tag[attributes] pairs
for tag in tags:
print(tag,root.xpath(f'//{tag}')[0].attrib.keys())
#Changes the tag name to the required's tag's name
for p in tree.findall(".//sect1"):
p.tag = ("section")
for p in tree.findall(".//informaltable"):
p.tag = ("table")
#Modify the tag's attributes to its desired form
for cy in root.xpath('//section'):
cy.attrib['xmlns:xsi']='http://www.w3.org/2001/XMLSchema-instance' #it doesnt accept : as part of the attribute's name and i don't know why
cy.attrib['id']=random() #this doesn't work yet
cy.attrib['type']='policy'
cy.attrib['xsi:noNamespaceSchemaLocation']='urn:fontoxml:cpa.xsd:1.0'#it doesnt accept :as part of the attribute's name and i don't know why
#Modify the attributes values
for t in root.xpath('//title'):
t.attrib['id']='random()
for p in root.xpath('//section'):
p.attrib['id']=random()
p.attrib['type']='policy'
for p in root.xpath('//table'):
p.attrib['id']=random()
for ct in root.xpath('//colspec'):
ct.attrib.pop("rowsep", None)
#Print the new xml to make sure it worked:
print(etree.tostring(root).decode())
tree.write("Final file.xml")
If you have any other ideas please feel free to share.

I agree that this is a task for XSLT (which can be used by lxml), here is an example stylesheet that tries to implement some of your requirements in a modular way by delegating each change to a template of its own:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="1.0">
<xsl:output method="xml"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sect1">
<section>
<xsl:apply-templates select="#* | node()"/>
</section>
</xsl:template>
<xsl:template match="informaltable">
<table>
<xsl:apply-templates select="#* | node()"/>
</table>
</xsl:template>
<xsl:template match="#id">
<xsl:attribute name="{name()}">
<xsl:value-of select="generate-id()"/>
</xsl:attribute>
</xsl:template>
<xsl:template match="#xsi:noNamespaceSchemaLocation">
<xsl:attribute name="{name()}" namespace="{namespace-uri()}">urn:fontoxml:cpa.xsd:1.0</xsl:attribute>
</xsl:template>
<xsl:template match="colspec/#rowsep"/>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/bET2rXs
I hope with that as a starting point and any XSLT tutorial or introduction you can work it out.

Related

Python + XML documents

I'm a bit new to XML and python. Below is a cut down version of a large XML file I'm trying to bring into python to eventually write into SQL Server db.
<?xml version="1.0" encoding="utf-8"?>
<MyOrgRefData:OrgRefData xmlns:MyOrgRefData="http://refdata.org/org/v2-0-0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.org/org/v2-0-0/MyOrgRefData.xsd">
<Manifest>
<Version value="2-0-0" />
<PublicationType value="Full" />
<PublicationSource value="TEST123" />
<PublicationDate value="2022-05-23" />
<PublicationSeqNum value="1659" />
<FileCreationDateTime value="2022-05-23T22:14:47" />
<RecordCount value="287654" />
<ContentDescription value="FullFile_20220523" />
<PrimaryRoleScope>
<PrimaryRole id="123" displayName="Free beer for me" />
<PrimaryRole id="456" displayName="Free air for you" />
</PrimaryRoleScope>
</Manifest>
<CodeSystems>
<CodeSystem name="OrganisationRecordClass" oid="1.2.3.4.5">
<concept id="RC2" code="2" displayName="World1" />
<concept id="RC1" code="1" displayName="World2" />
</CodeSystem>
<CodeSystem name="OrganisationRole" oid="5.4.7.8">
<concept id="B1ng0" code="179" displayName="BoomBastic" />
<concept id="R2D2a" code="180" displayName="Fantastic" />
</CodeSystem>
</CodeSystems>
</MyOrgRefData:OrgRefData>
I've tried with lxml, pandas.read_xml, xml.etree and I'm not able to understand how to get what I want.
Ideally I'd like to pull in Manifest into a dataframe ready to to send to SQL (pd.to_sql()). I would do the same with CodeSystems as well, but separately. (there are other sections but I cut them off to shorten)
For example, using pandas to read in, I can only get a column with the values in. But I would like to either have the tag (Version, PublicationType, PublicationSource etc) in a column by the side of the value, or have them as the column headers and the values pivoted across the row instead.
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
xpath='//Manifest/*',
attrs_only=True ,
)
df_bulk.head()
This is the output I get:
inx
value
0
2-0-0
1
Full
2
TEST123
3
2022-05-23
4
1659
5
2022-05-23T22:14:47
6
287654
7
FullFile_20220523
Ideally I would like:
inx
value
Version
2-0-0
PublicationType
Full
PublicationSource
TEST123
PublicationDate
2022-05-23
PublicationSeqNum
1659
FileCreationDateTime
2022-05-23T22:14:47
FileCreationDateTime
287654
ContentDescription
FullFile_20220523
The eagle eyed among you will notice I've left out PrimaryRoleScope. I would ideally like to treat this separately in it's own dataframe as well. But I am unsure how to exclude it when pulling in the rest of the Manifest section.
Many thanks if you've read this far, even more thanks for any help.
One possibility is using the stylesheet parameter to transform the XML data internally with XSLT before processing it.
So your code could look like this:
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
stylesheet='transform.xslt',
xpath='/Root/Item',
attrs_only=True ,
)
print(df_bulk.head(10))
The stylesheet(transform.xml) to be passed to read_xml could be (lxml is required)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:template match="/">
<Root><xsl:apply-templates /></Root>
</xsl:template>
<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
<Item name="{name()}" value="{#value}" />
</xsl:template>
</xsl:stylesheet>
In this example a new XML like the following is created. It is intermediate XML and not shown, but the xpath= parameter above has to be set accordingly.
<Root>
<Item name="Version" value="2-0-0"/>
<Item name="PublicationType" value="Full"/>
<Item name="PublicationSource" value="TEST123"/>
<Item name="PublicationDate" value="2022-05-23"/>
<Item name="PublicationSeqNum" value="1659"/>
<Item name="FileCreationDateTime" value="2022-05-23T22:14:47"/>
<Item name="RecordCount" value="287654"/>
<Item name="ContentDescription" value="FullFile_20220523"/>
</Root>
And the final output is
name value
0 Version 2-0-0
1 PublicationType Full
2 PublicationSource TEST123
3 PublicationDate 2022-05-23
4 PublicationSeqNum 1659
5 FileCreationDateTime 2022-05-23T22:14:47
6 RecordCount 287654
7 ContentDescription FullFile_20220523
The above approach uses only attributes, but you could also create an element structure with the XSLT if you prefer that. In this case change one template to
<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
<Item>
<name><xsl:value-of select="name()" /></name>
<value><xsl:value-of select="#value" /></value>
</Item>
</xsl:template>
and your python code to
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
stylesheet='transform.xslt',
xpath='/Root/Item',
)
print(df_bulk.head(10))
The output is the same.

How to convert nodes in XML to CDATA with XSLT?

I have a source.xml file with structure like:
<products>
<product>
<id>1</id>
<description>
<style>
table{
some css here
}
</style>
<descr>
<div>name of producer like ABC&DEF</div>
<table>
<th>parameters</th>
<tr><td>name of param 1 e.g POWER CONSUMPTION</td>
<td>value of param 1 with e.g < 100 W</td></tr>
</table>
</descr>
</description>
</product>
.....................
</products>
I would like to have:
<products>
<product>
<id>1</id>
<description>
<![CDATA[
<style>
table{
some css here
}
</style>
<descr>
<div>name of producer like ABC&DEF</div>
<table>
<th>parameters</th>
<tr><td>name of param 1 e.g POWER CONSUMPTION</td>
<td>value of param 1 with e.g < 100 VA</td></tr>
</table>
]]>
</descr>
</description>
</product>
.....................
</products>
I tried .xsl stylesheets based on:
How to use in XSLT?
and
Add CDATA to an xml file
and
how to add cdata to an xml file using xsl such as:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="UTF-8" />
<xsl:template match="/products">
<products>
<xsl:for-each select="product">
<product>
<description>
<xsl:text disable-output-escaping="yes"><![CDATA[</xsl:text>
<xsl:copy-of select="description/node()" />
<xsl:text disable-output-escaping="yes">]]></xsl:text>
</xsl:for-each>
</description>
</product>
</xsl:for-each>
</products>
</xsl:template>
</xsl:stylesheet>
and
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:output method="xml" indent="yes" cdata-section-elements="description"/>
<xsl:template match="description">
<xsl:copy>
<xsl:apply-templates select="#*"/>
<xsl:variable name="subElementsText">
<xsl:apply-templates select="node()" mode="asText"/>
</xsl:variable>
</xsl:copy>
</xsl:template>
<xsl:template match="text()" mode="asText">
<xsl:copy/>
</xsl:template>
<xsl:template match="*" mode="asText">
<xsl:value-of select="concat('<',name())"/>
<xsl:for-each select="#*">
<xsl:value-of select="concat(' ',name(),'="',.,'"')"/>
</xsl:for-each>
<xsl:value-of select="'>'"/>
<xsl:apply-templates select="node()" mode="asText"/>
<xsl:value-of select="concat('</',name(),'>')"/>
</xsl:template>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
but running my python script
import lxml.etree as ET
doc = ET.parse('source.xml')
xslt = ET.parse('modyfi.xsl')
transform = ET.XSLT(xslt)
newdoc = transform(doc)
with open(f'output.xml', 'wb') as f:
f.write(newdoc)
on SublimeText3 I allways get the same error:
lxml.etree.XMLSyntaxError: StartTag: invalid element name, {number of line and column with first appearance of illegal character}
I am sure, that solution is straight in front of me in links above, but I can't see it.
Or maybe I can't find it because I can't ask the right question. Please help, I'm new to coding.
The input XML is not well-formed. I had to fix it first. That seems to be the reason why it is failing on your end.
XML
<products>
<product>
<id>1</id>
<description>
<style>table{
some css here
}</style>
<descr>
<div>name of producer like ABC&DEF</div>
<table>
<th>parameters</th>
<tr>
<td>name of param 1 e.g POWER CONSUMPTION</td>
<td>value of param 1 with e.g < 100 W</td>
</tr>
</table>
</descr>
</description>
</product>
</products>
XSLT
<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="description">
<xsl:copy>
<xsl:text disable-output-escaping="yes"><![CDATA[</xsl:text>
<xsl:copy-of select="*"/>
<xsl:text disable-output-escaping="yes">]]></xsl:text>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Output
<products>
<product>
<id>1</id>
<description><![CDATA[
<style>table{
some css here
}
</style>
<descr>
<div>name of producer like ABC&DEF</div>
<table>
<th>parameters</th>
<tr>
<td>name of param 1 e.g POWER CONSUMPTION</td>
<td>value of param 1 with e.g < 100 W</td>
</tr>
</table>
</descr>]]>
</description>
</product>
</products>
In my view a clean way is to make use of a serialize function to serialize all elements you want as plain text, to then designate the parent container in the xsl:output declaration in the cdata-section-elements and to finally make sure the XSLT processor is in charge of the serialization.
Now XSLT 3 has a built-in XPath 3.1 serialize function, in Python you could use that with Saxon-C and its Python API.
For libxslt based XSLT 1 with lxml you can write an extension function in Python exposed to XSLT:
from lxml import etree as ET
def serialize(context, nodes):
return b''.join(ET.tostring(node) for node in nodes)
ns = ET.FunctionNamespace('http://example.com/mf')
ns['serialize'] = serialize
xml = ET.fromstring('<root><div><p>foo</p><p>bar</p></div></root>')
xsl = ET.fromstring('''<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:mf="http://example.com/mf" version="1.0">
<xsl:output method="xml" cdata-section-elements="div" encoding="UTF-8"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div">
<xsl:copy>
<xsl:value-of select="mf:serialize(node())"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>''')
transform = ET.XSLT(xsl)
result = transform(xml)
result.write_output("transformed.xml")
Output then is
<?xml version="1.0" encoding="UTF-8"?>
<root><div><![CDATA[<p>foo</p><p>bar</p>]]></div></root>

xslt transformation works in Altova but not in python

I am trying to transform the one xml to another xml using xslt. Below is the xslt I am using
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns="Apartments.AP.Mits20PropertyFinal"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:var="http://schemas.microsoft.com/BizTalk/2003/var"
exclude-result-prefixes="msxsl var userCSharp"
xmlns:userCSharp="http://schemas.microsoft.com/BizTalk/2003/userCSharp"
>
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:key name="myKey" match="ILS_Unit/Units/Unit" use="./Identification[#IDType='FloorplanID']/#IDValue"/>
<xsl:template match="/">
<PhysicalProperty>
<!-- <xsl:apply-templates select="PhysicalProperty/Management"/>
--> <xsl:apply-templates select="PhysicalProperty/Property"/>
</PhysicalProperty>
</xsl:template>
<xsl:template match="Property">
<Property>
<PropertyInfo>
<xsl:element name="MITSID">
<xsl:value-of select="./PropertyID/Identification[#IDRank='secondary']/#IDValue"/>
</xsl:element>
<xsl:element name="MarketingName">
<xsl:value-of select="./PropertyID/MarketingName"/>
</xsl:element>
<xsl:element name="TotalUnits">
</xsl:element>
<xsl:element name="AddressLine1">
<xsl:value-of select="./PropertyID/Address/AddressLine1"/>
</xsl:element>
<xsl:element name="City">
<xsl:value-of select="./PropertyID/Address/City"/>
</xsl:element>
<xsl:element name="State">
<xsl:value-of select="./PropertyID/Address/State"/>
</xsl:element>
<xsl:element name="Zip">
<xsl:value-of select="./PropertyID/Address/PostalCode"/>
</xsl:element>
</PropertyInfo>
<xsl:for-each select="./Floorplan">
<Floorplan>
<FloorplanID><xsl:value-of select="./#IDValue"/></FloorplanID>
<FloorplanName><xsl:value-of select="./Name"/></FloorplanName>
<!--This Unit count is Totals no of Units per floorplan-->
<UnitCount><xsl:value-of select="./UnitCount"/></UnitCount>
<Units>
<xsl:variable name="floorplanid" select="./#IDValue"/>
<xsl:for-each select="key('myKey',$floorplanid)">
<Unit>
<UnitNum>
<xsl:value-of select="./MarketingName"/>
</UnitNum>
<xsl:if test="./UnitLeasedStatus='Not_Available'">
<UnitLeasedStatus>Occupied</UnitLeasedStatus>
</xsl:if>
<xsl:if test="./UnitLeasedStatus='On_Notice'">
<UnitLeasedStatus>On Notice</UnitLeasedStatus>
</xsl:if>
<xsl:if test="./UnitLeasedStatus='Available'">
<UnitLeasedStatus>Available</UnitLeasedStatus>
</xsl:if>
</Unit>
</xsl:for-each>
</Units>
</Floorplan>
</xsl:for-each>
</Property>
</xsl:template>
</xsl:stylesheet>
And below is the sample xml file
<?xml version="1.0" encoding="utf-8"?>
<PhysicalProperty>
<Property>
<PropertyID>
<Identification IDValue="183dbed4-0101-4a85-954c-a4e8042d2819" IDRank="primary"/>
<Identification IDValue="6458174" IDRank="secondary"/>
<MarketingName>Westmount at London Park</MarketingName>
<Website>http://www.westmountatlondonpark.com</Website>
<Address AddressType="property">
<AddressLine1>14545 Bammel North Houston Road</AddressLine1>
<AddressLine2/>
<City>Houston</City>
<State>TX</State>
<PostalCode>77014</PostalCode>
</Address>
</PropertyID>
<ILS_Identification ILS_IdentificationType="Apartment" RentalType="Unspecified"/>
<Floorplan IDValue="171ad0f2-da57-45c0-9cf5-c61cfa26dd63" IDType="FloorplanID" IDRank="primary">
<FloorplanType>Internal</FloorplanType>
<Name>A1  </Name>
<Comment/>
<UnitCount>22</UnitCount>
<UnitsAvailable>3</UnitsAvailable>
<DisplayedUnitsAvailable>5</DisplayedUnitsAvailable>
<Room RoomType="Bedroom">
<Count>1</Count>
</Room>
<Room RoomType="Bathroom">
<Count>1</Count>
</Room>
<SquareFeet Min="602" Max="602"/>
<EffectiveRent Min="810" Max="830"/>
<Deposit DepositType="Security Deposit">
<Amount>
<ValueRange Min="150" Max="150"/>
</Amount>
</Deposit>
<File FileID="e114a438-ee82-497d-8e04-6a5a8b2381ef" active="true">
<FileType>Floorplan</FileType>
<Caption/>
<Src>https://apollostore.blob.core.windows.net/londonpark/uploads/images/floorplans/a1.7a75909c-3362-409c-82c7-aa6c959f9c99.jpg</Src>
<Rank>999</Rank>
</File>
</Floorplan>
<ILS_Unit>
<ILS_Unit IDValue="be827564-6460-4af1-9644-3f6ffa225557" IDType="UnitID" IDRank="primary">
<Units>
<Unit>
<Identification IDValue="be827564-6460-4af1-9644-3f6ffa225557" IDType="UnitID" IDRank="primary"/>
<Identification IDValue="171ad0f2-da57-45c0-9cf5-c61cfa26dd63" IDType="FloorplanID" IDRank="primary"/>
<MarketingName>1901</MarketingName>
<UnitBedrooms>1</UnitBedrooms>
<UnitBathrooms>1</UnitBathrooms>
<UnitRent>765</UnitRent>
<UnitLeasedStatus>Not_Available</UnitLeasedStatus>
<FloorplanName>A1  </FloorplanName>
</Unit>
</Units>
<Comment/>
<EffectiveRent Min="765" Max="765"/>
<Deposit DepositType="Security Deposit">
<Amount>
<ValueRange Min="150" Max="150"/>
</Amount>
</Deposit>
</ILS_Unit>
</ILS_Unit>
</Property>
</PhysicalProperty>
The transformation works fine using altova tool but does not work in python.
Below is the python script I am using.
from lxml import etree
import os
import glob
xslt = etree.parse("fl.xslt")
dom = etree.parse("f.xml",)
transform = etree.XSLT(xslt)
try:
newdom = transform(dom)
root = etree.parse(newdom)
properties = root.findall("./Property")
print(properties)
except Exception as e:
print (e)
for error in transform.error_log:
print(error.message, error.line)
print(etree.tostring(newdom, pretty_print=True))
I am trying to print the property nodes and I also tried to write the result to .xml file but it returns nothing. Could some one tell what the issue could be.
Surprisingly it works fine in Altova.
Below are the errors that are thrown.
line 53, in <module>
root = etree.parse(newdom)
File "src\lxml\etree.pyx", line 3519, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1862, in lxml.etree._parseDocument
TypeError: cannot parse from 'lxml.etree._XSLTResultTree'
Specific error has nothing to do with XSLT but your attempted parse of the result. Like Python's built-in xml.etree, the parse function of lxml.etree requires a file-like object. However, the result from an XSLT transformation in lxml is a ElementTree object that you can then directly run any XML DOM calls like findall, iterfind, etc.
Therefore, simply remove the parse line. Additionally, because you have a default namespace, consider assigning a temporary prefix to access nodes. Also, consider xpath in lxml.
newdom = transform(dom)
properties = newdom.findall("./doc:Property", namespaces={'doc': 'Apartments.AP.Mits20PropertyFinal'})
print(properties)
# [<Element {Apartments.AP.Mits20PropertyFinal}Property at 0x136b2f94b08>]
properties = newdom.xpath("./doc:Property", namespaces={'doc': 'Apartments.AP.Mits20PropertyFinal'})
print(properties)
# [<Element {Apartments.AP.Mits20PropertyFinal}Property at 0x136b2f94b08>]
Output
print(etree.tostring(newdom, pretty_print=True).decode("utf-8"))
# <PhysicalProperty xmlns="Apartments.AP.Mits20PropertyFinal">
# <Property>
# <PropertyInfo>
# <MITSID>6458174</MITSID>
# <MarketingName>Westmount at London Park</MarketingName>
# <TotalUnits/>
# <AddressLine1>14545 Bammel North Houston Road</AddressLine1>
# <City>Houston</City>
# <State>TX</State>
# <Zip>77014</Zip>
# </PropertyInfo>
# <Floorplan>
# <FloorplanID>171ad0f2-da57-45c0-9cf5-c61cfa26dd63</FloorplanID>
# <FloorplanName>A1 </FloorplanName>
# <UnitCount>22</UnitCount>
# <Units>
# <Unit>
# <UnitNum>1901</UnitNum>
# <UnitLeasedStatus>Occupied</UnitLeasedStatus>
# </Unit>
# </Units>
# </Floorplan>
# </Property>
# </PhysicalProperty>

How to get content of an XML element as string?

Considering this XML example:
<data>
<items>
<item name="item1">item1pre <bold>ok!</bold> item1post</item>
<item name="item2">item2</item>
</items>
</data>
I am looking for a way to get the following result:
"item1pre **ok! ** item1post"
I thought of getting all the content of item1 as a string "item1pre <'bold> ok!<'/bold> item1post" and then replace "<'bold>" and "<'/bold>" by "**", but I don't know how to get that.
xml="""
<data>
<items>
<item name="item1">item1pre<bold>ok!</bold>item1post</item>
<item name="item2">item2</item>
</items>
</data>
"""
import xml.etree.ElementTree as ET
# python included module
def cleaned_strings_from_xml(xml_str, tag='item'):
"""
finds all items of type tag from xml-string
:param xml_str: valid xml structure as string
:param tag: tag to search inside the xml
:returns: list of all texts of 'tag'-items
"""
strings = []
root = ET.fromstring(xml)
for item in root.iter(tag):
item_str = ET.tostring(item).decode('utf-8')
item_str = item_str.replace('<bold>', ' **').replace('</bold>', ' **')
strings.append(ET.fromstring(item_str).text)
return strings
print(cleaned_strings_from_xml(xml))
You could offload all xml processing to libxml by using an xslt transformation. Libxml is written in C which should be quicker:
from lxml import etree
transform = etree.XSLT(etree.XML('''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="data/items/item[#name = 'item1']">
<xsl:text>"</xsl:text>
<xsl:value-of select="text()"/>
<xsl:text>**</xsl:text>
<xsl:value-of select="bold/."/>
<xsl:text>**</xsl:text>
<xsl:value-of select="bold/following-sibling::text()[1]"/>
<xsl:text>"</xsl:text>
</xsl:template>
<xsl:template match="data/items/item[#name != 'item1']" />
</xsl:stylesheet>
'''))
with open("source.xml") as f:
print(transform(etree.parse(f)))
In a nutshell: Match the item element with name attribute 'item1' then use relative xpath expressions to extract the strings.

Python XML and XPath to sort things out

Let's say I have an XML as follows.
<a>
<b>
<c>A</c>
</b>
<bb>
<c>B</c>
</bb>
<c>
X
</c>
</a>
I need to parse this XML into dictionary X for a/b/c and a/b'/c, but dictionary Y for a/c.
dictionary X
X[a_b_c] = A
X[a_bb_c] = B
dictionary T
T[a_c] = X
Q : I'd like to make a mapping file for this in XML file using XPath. How can I do this?
I think of having mapping.xml as follows.
<mapping>
<from>a/c</from><to>dictionary T<to>
....
</mapping>
And using 'a/c' to get X, and put it in dictionary T. Is there any better ways to go?
Maybe you could do this with XSLT. This stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:key name="dict" match="item" use="#dict"/>
<xsl:key name="path" match="*[not(*)]" use="concat(name(../..),'/',
name(..),'/',
name())"/>
<xsl:variable name="map">
<item path="a/b/c" dict="X"/>
<item path="a/bb/c" dict="X"/>
<item path="/a/c" dict="T"/>
</xsl:variable>
<xsl:template match="/">
<xsl:variable name="input" select="."/>
<xsl:for-each select="document('')/*/xsl:variable[#name='map']/*[count(.|key('dict',#dict)[1])=1]">
<xsl:variable name="dict" select="#dict"/>
<xsl:variable name="path" select="../item[#dict=$dict]/#path"/>
<xsl:value-of select="concat('dictionary ',$dict,'
')"/>
<xsl:for-each select="$input">
<xsl:apply-templates select="key('path',$path)">
<xsl:with-param name="dict" select="$dict"/>
</xsl:apply-templates>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
<xsl:template match="*">
<xsl:param name="dict"/>
<xsl:variable name="path" select="concat(name(../..),'_',
name(..),'_',
name())"/>
<xsl:value-of select="concat($dict,'[',
translate(substring($path,
1,
1),
'_',
''),
substring($path,2),'] = ',
normalize-space(.),'
')"/>
</xsl:template>
</xsl:stylesheet>
Output:
dictionary X
X[a_b_c] = A
X[a_bb_c] = B
dictionary T
T[a_c] = X
EDIT: Pretty things a bit.

Categories