Python: In an xml, How to delete nodes with some condition - python

I have a XML file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Reviews>
<Review rid="1004293">
<sentences>
<sentence id="1004293:0">
<text>Judging from previous posts this used to be a good place, but not any longer.</text>
<Opinions>
</sentence>
<sentence id="1004293:1">
<text>We, there were four of us, arrived at noon - the place was empty - and the staff acted like we were imposing on them and they were very rude.</text>
<Opinions>
</sentence>
<sentence id="1004293:2">
<text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
<Opinions>
<Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0"/>
</Opinions>
</sentence>
</sentences>
</Review>
How to delete those sentences without opinions? And left those sentences where text has an opinion?
I would like to get something like that:
<sentences>
<sentence id="1004293:2">
<text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
<Opinions>
<Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0"/>
</Opinions>
</sentence>
</sentences>

I would convert the xml to a dict using this module, for example: How to convert an xml string to a dictionary?, filter out the nodes that you do not want and reconvert to xml....

Consider using XSLT, the special-purpose language designed to transform XML documents. Specifically, run the identity transform then an empty template on sentence with needed condition.
XSLT (save as an .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- EMPTY TEMPLATE TO DELETE NODE(S) -->
<xsl:template match="sentence[text and not(Opinions/*)]"/>
</xsl:stylesheet>
Online Demo
Python (using third-party module, lxml)
import lxml.etree as et
doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/Script.xsl')
# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)
# TRANSFORM SOURCE DOC
result = transform(doc)
# OUTPUT TO CONSOLE
print(result)
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)

Using builtin XML library (ElementTree).
Note: The XML you have posted was not a valid one and I had to fix it.
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Reviews>
<Review rid="1004293">
<sentences>
<sentence id="1004293:0">
<text>Judging from previous posts this used to be a good place, but not any longer.</text>
<Opinions />
</sentence>
<sentence id="1004293:1">
<text>We, there were four of us, arrived at noon - the place was empty - and the staff acted like we were imposing on them and they were very rude.</text>
<Opinions />
</sentence>
<sentence id="1004293:2">
<text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
<Opinions>
<Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0" />
</Opinions>
</sentence>
</sentences>
</Review>
</Reviews>
'''
root = ET.fromstring(xml)
sentences_root = root.find('.//sentences')
sentences_with_no_opinions = [s for s in root.findall('.//sentence') if not s.find('.//Opinions')]
for s in sentences_with_no_opinions:
sentences_root.remove(s)
print(ET.tostring(root))
output
<?xml version="1.0" encoding="UTF-8"?>
<Reviews>
<Review rid="1004293">
<sentences>
<sentence id="1004293:2">
<text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
<Opinions>
<Opinion category="SERVICE#GENERAL" from="0" polarity="negative" target="NULL" to="0" />
</Opinions>
</sentence>
</sentences>
</Review>
</Reviews>

Related

Python XML/Pandas: How to merge nested XML?

How can I join two different pieces of information together from this XML file?
# data
xml1 = ('''<?xml version="1.0" encoding="utf-8"?>
<TopologyDefinition xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RSkus>
<RSku ID="V1" Deprecated="true" Owner="Unknown" Generation="1">
<Devices>
<Device ID="1" SkuID="Switch" Role="xD" />
</Devices>
<Blades>
<Blade ID="{1-20}" SkuID="SBlade" />
</Blades>
<Interfaces>
<Interface ID="COM" HardwareID="NS1" SlotID="COM1" Type="serial" />
<Interface ID="LINK" HardwareID="TS1" SlotID="UPLINK_1" Type="serial" />
</Interfaces>
<Wires>
<WireGroup Type="network">
<Wire LocationA="NS1" SlotA="{1-20}" LocationB="{1-20}" SlotB="NIC1" />
</WireGroup>
<WireGroup Type="serial">
<Wire LocationA="TS1" SlotA="{7001-7020}" LocationB="{1-20}" SlotB="COM1" />
</WireGroup>
</Wires>
</RSku>
</RSkus>
</TopologyDefinition>
''')
While this is a single case and trivial in the instance below; if I run the below commands on the full file, I get shapes that do not match and therefore cannot be joined so easily.
How can I extract the XML information such that for every row, I get all the RSku information PLUS its Blade information. Each xpath contains no information that would let me join it to another xpath so that I may combine the information.
# how to have them joined?
pd.read_xml(xml1, xpath = ".//RSku")
pd.read_xml(xml1, xpath = ".//Blade")
# expected
pd.concat([pd.read_xml(xml1, xpath = ".//RSku"), pd.read_xml(xml1, xpath = ".//Blade")], axis=1)
Consider transforming the XML with XSLT by flattening the document with information you need. Specifically, retrieve only Blade attributes using descendant::* axis and corresponding RSku attributes using the ancestor::* axis. Python' lxml (default parser of pandas.read_xml) can run XSLT 1.0 scripts.
Below XSLT's <xsl:for-each> is used to prefix RSku_ and Blade_ to attribute names since they share same attribute such as ID. Otherwise template would be much less wordy.
import pandas as pd
xml1 = ...
xsl = ('''<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/TopologyDefinition">
<root>
<xsl:apply-templates select="descendant::Blade"/>
</root>
</xsl:template>
<xsl:template match="Blade">
<data>
<xsl:for-each select="ancestor::RSku/#*">
<xsl:attribute name="{concat('RSku_', name())}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:for-each>
<xsl:for-each select="#*">
<xsl:attribute name="{concat('Blade_', name())}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:for-each>
</data>
</xsl:template>
</xsl:stylesheet>''')
blades_df = pd.read xml(xml1, stylesheet=xsl)
Online XSLT Demo

How to remove all occurences of element in XML file?

I'd like to edit a KML file and remove all occurences of ExtendedData elements, wherever they are located in the file.
Here's the input XML file:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
<Style id="placemark-red">
<IconStyle>
<Icon>
<href>http://maps.me/placemarks/placemark-red.png</href>
</Icon>
</IconStyle>
</Style>
<name>My track</name>
<ExtendedData xmlns:mwm="https://maps.me">
<mwm:name>
<mwm:lang code="default">Blah</mwm:lang>
</mwm:name>
<mwm:lastModified>2020-04-05T14:17:18Z</mwm:lastModified>
</ExtendedData>
<Placemark>
<name></name>
…
<ExtendedData xmlns:mwm="https://maps.me">
<mwm:localId>0</mwm:localId>
<mwm:visibility>1</mwm:visibility>
</ExtendedData>
</Placemark>
</Document>
</kml>
And here's the code that 1) only removes the outermost occurence, and 2) requires adding the namespace to find it:
from lxml import etree
from pykml import parser
from pykml.factory import KML_ElementMaker as KML
with open("input.xml") as f:
doc = parser.parse(f)
root = doc.getroot()
ns = "{http://earth.google.com/kml/2.2}"
for pm in root.Document.getchildren():
#No way to get rid of namespace, for easier search?
if pm.tag==f"{ns}ExtendedData":
root.Document.remove(pm)
#How to remove innermost occurence of ExtendedData?
print(etree.tostring(doc, pretty_print=True))
Is there a way to remove all occurences in one go, or should I parse the whole tree?
Thank you.
Edit: The BeautifulSoup solution below requires adding an option "BeautifulSoup(my_xml,features="lxml")" to avoid the warning "No parser was explicitly specified".
Here's a solution using BeautifulSoup:
soup = BeautifulSoup(my_xml) # this is your xml
while True:
elem = soup.find("extendeddata")
if not elem:
break
elem.decompose()
Here's the output for your data:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<body>
<kml xmlns="http://earth.google.com/kml/2.2">
<document>
<style id="placemark-red">
<IconStyle>
<Icon>
<href>http://maps.me/placemarks/placemark-red.png</href>
</Icon>
</IconStyle>
</style>
<name>
My track
</name>
<placemark>
<name>
</name>
</placemark>
</document>
</kml>
</body>
</html>
If you know the XML structure, try:
xml_root = ElementTree.parse(filename_path).getroot()
elem = xml_root.find('./ExtendedData')
xml_root.remove(elem)
or
xml_root = ElementTree.parse(filename_path).getroot()
p_elem = xml_root.find('/Placemark')
c_elem = xml_root.find('/Placemark/ExtendedData')
p_elem.remove(c_elem)
play with this ideas :)
if you don't know the xml structure, I think you need to parse the whole tree.
Simply run the empty template with Identity Transform using XSLT 1.0 which Python's lxml can run. No for/while loops or if logic needed. To handle the default namespace, define a prefix like doc:
XSLT (save a .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://earth.google.com/kml/2.2">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- REMOVE ALL OCCURRENCES OF NODE -->
<xsl:template match="doc:ExtendedData"/>
</xsl:stylesheet>
Python
import lxml.etree as et
# LOAD XML AND XSL SOURCES
xml = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')
# TRANSFORM INPUT
transform = et.XSLT(xsl)
result = transform(xml)
# PRINT TO SCREEN
print(result)
# SAVE TO FILE
with open('Output.kml', 'wb') as f:
f.write(result)

BeautifulSoup4 Removes Namespace Definitions from Schema in WSDL

I am working on a custom library that consumes WSDLs. One thing I need to be able to do is pull out the namespace definitions in the schemas so I can create a map of them. What I'm running into is that BeautifulSoup (using lxml) is removing the namespace definitions from the Schema elements.
Here is one of my actual Schemas:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://servicecenter.peregrine.com/PWS" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
And here is what bs4's rendering of it looks like:
<xsd:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
All of my xmlns attributes are gone. Obviously, this seems intentional, but I can't figure out how I can retrieve these attributes. They are not in .attrs and nothing I can find in documentation or online or using dir() has thus far yielded anything useful.
EDIT:
I reduced my WSDL to just the following:
<definitions xmlns="http://schemas.xmlsoap.org/wsdl/" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:ns="http://servicecenter.peregrine.com/PWS" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" targetNamespace="http://servicecenter.peregrine.com/PWS" xsi:schemaLocation="http://schemas.xmlsoap.org/wsdl/ http://schemas.xmlsoap.org/wsdl/">
<types>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://servicecenter.peregrine.com/PWS" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
</xs:schema>
</types>
</definitions>
And passed this to BeautifulSoup:
from bs4 import BeautifulSoup
wsdl = "..." #Replace this with wsdl from above. I didn't want to duplicate data
soup = BeautifulSoup(wsdl,'xml')
print(soup.prettify())
And now it is gone:
<?xml version="1.0" encoding="utf-8"?>
<definitions targetNamespace="http://servicecenter.peregrine.com/PWS" xmlns="http://schemas.xmlsoap.org/wsdl/" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:ns="http://servicecenter.peregrine.com/PWS" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schemas.xmlsoap.org/wsdl/ http://schemas.xmlsoap.org/wsdl/">
<types>
<xsd:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
</xsd:schema>
</types>
</definitions>
I can see that it is apparently removing namespace declarations that are redundant (they are already defined under different names in the definitions tag) but it changes the name of these namespaces. Is there any way to prevent it from being so smart? ;) I realize in terms of functionality of the web service request, the name change does not matter, but I would like to stick as close to the actual content of the WSDL as possible.

Save (print) xml node with its parents but without children

From the XML document, I want to save one node to a file - with all parent nodes, but without any child nodes. For example, for the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid">
<name>ref.kml</name>
<Style id="normalState">
<IconStyle><scale>1.0</scale><Icon><href>yt.png</href></Icon></IconStyle>
</Style>
</Document>
</kml>
expected output for <Document> node will be like this:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid">
</Document>
</kml>
So far I only found a solution with iterated removal of all child elements before saving. But as I need to work with original XML after, I have to make a copy of the whole document:
#!/usr/bin/env python
import lxml.etree as ET # have to use [lxml] because [xml] doesn't support 'xml_declaration'
import copy
kml_file = ET.parse("myfile.kml")
kml_copied = copy.deepcopy(kml_file) # .copy() is not enough, need .deepcopy()
root = kml_copied.getroot()
my_node = root[0]
for child in my_node:
my_node.remove(child)
print ET.tostring(kml_copied, xml_declaration=True, encoding='utf-8')
Is there better way to do this? at least to avoid making a deepcopy of the whole document...
Consider XSLT, the special-purpose declarative language designed to transform XML documents. And Python's lxml module has a built-in XSLT 1.0 processor. Additionally XSLT (whose script is a well-formed xml document can also adequately handle the kml undeclared namespace):
XSLT Script (save as .xsl to be loaded in Python, also portable to other languages)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:doc="http://earth.google.com/kml/2.1">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform to copy entire document -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Empty Template to Remove Nodes -->
<xsl:template match="doc:Style|doc:name"/>
</xsl:transform>
Python Script
import lxml.etree as ET
# LOAD XML AND XSL
dom = ET.parse('Input.xml')
xslt = ET.parse('XSLTScript.xsl')
# TRANSFORM INPUT INTO DOM OBJECT
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT DOM TO STRING
tree_out = ET.tostring(newdom,
encoding='UTF-8',
pretty_print=True,
xml_declaration=True)
print(tree_out.decode("utf-8"))
# SAVE RESULTING XML
xmlfile = open('Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version='1.0' encoding='UTF-8'?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid"/>
</kml>

xml parsing (Removing parent nodes)

Hi I'm seriously stuck when trying to filter out my xml document. Here is some example of the contents:
<sentence id="1" document_id="Perseus:text:1999.02.0029" >
<primary>millermo</primary>
<word id="1" />
<word id="2" />
<word id="3" />
<word id="4" />
</sentence>
<sentence id="2" document_id="Perseus:text:1999.02.0029" >
<primary>millermo</primary>
<word id="1" />
<word id="2" />
<word id="3" />
<word id="4" />
<word id="5" />
<word id="6" />
<word id="7" />
<word id="8" />
</sentence>
There are many sentences (Over 3000) but all I want to do is write some code (preferably in java or python) that will go through my xml file and remove all the sentences which have more than 5 word ids,
so in other words I will be left with just sentences tags with 5 or less word ids. Thanks. (Just to note my xml isnt great, I get mixed up with nodes/tags/element/ids.
I'm trying this atm but not sure:
import xml.etree.ElementTree as ET
tree = ET.parse('treebank.xml')
root = tree.getroot()
parent_map = dict((c, p) for p in tree.getiterator() for c in p)
iterator = list(root.getiterator('word id'))
for item in iterator:
old = item.find('word id')
text = old.text
if 'id=16' in text:
parent_map[item].remove(item)
continue
tree.write('out.xml')
Consider an XSLT solution where no looping is required. As information, XSLT is a declarative, special purpose language designed natively to transform XML documents to various formatting, styling, structuring for end use purposes. Specifically here, the identity transform copies entire document as is and writes an empty template to all <word> nodes whose position is greater than 5.
XSLT script (save as .xsl or .xslt file)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="word[position() > 5]"/>
</xsl:transform>
Python Script
import os, sys
import lxml.etree as ET
# LOAD XML AND XSL
dom = ET.parse('C/Path/To/Input.xml')
xslt = ET.parse('C/Path/To/XSLTscript.xsl')
# TRANSFORM XML
transform = ET.XSLT(xslt)
newdom = transform(dom)
# PRETTY PRINT OUTPUT
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
print(tree_out.decode("utf-8"))
# SAVE TO FILE
xmlfile = open('Output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
And the beauty of XSLT is that it is transferrable as practically all general purpose languages maintain XSLT processors including Java:
Java Script
import javax.xml.transform.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import java.io.File;
import java.io.IOException;
import java.net.URISyntaxException;
public class Sentence {
public static void main(String[] args) throws IOException, URISyntaxException, TransformerException {
String currentDir = new File("").getAbsolutePath();
String xml = "C:/Path/To/Input.xml";
String xsl = "C:/Path/To/XSLTScript.xsl";
// Transformation
TransformerFactory factory = TransformerFactory.newInstance();
Source xslt = new StreamSource(new File(xsl));
Transformer transformer = factory.newTransformer(xslt);
Source text = new StreamSource(new File(xml));
transformer.transform(text, new StreamResult(new File("C:/Path/To/Output.xml")));
}
}
OUTPUT (using posted content)
<?xml version='1.0' encoding='UTF-8'?>
<root>
<sentence id="1" document_id="Perseus:text:1999.02.0029">
<primary>millermo</primary>
<word id="1"/>
<word id="2"/>
<word id="3"/>
<word id="4"/>
</sentence>
<sentence id="2" document_id="Perseus:text:1999.02.0029">
<primary>millermo</primary>
<word id="1"/>
<word id="2"/>
<word id="3"/>
<word id="4"/>
<word id="5"/>
</sentence>
</root>
The thing is that word is a tag and id is its attribute; you can't pass them both to .find().
Also, the result of parsing is a tree, where attributes and text are represented differently than in an XML file.
I suppose you have a root element which has <sentence> elements as children.
The you have to look at each <sentence> node, count its <word> elements, and remove the sentence if needed.
# We cannot iterate over a tree and modify it at the same time.
# Remember the nodes to remove later.
elements_to_kill = []
for sentence_node in root.getiterator('sentence'):
if len(sentence_node.findall('word')) <= 5:
elements_to_kill.append(sentence_node)
# Now it's safe to remove them
for node in elements_to_kill:
root.remove(node)
# Serialize as file, etc
Hope this helps.
You seem to lack the grasp on how ETree works. Please feel free to read the docs and experiment in a Python REPL to gain the understanding.

Categories