How to load xml file with specifc paragraph by xml in Python?

How to load xml file with specifc paragraph by xml in Python? - python

I have a xml file and its structure like that,
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<book>
<toc> <tocdiv pagenum="564">
<title>9thmemo</title>
<tocdiv pagenum="588">
<title>b</title>
</tocdiv>
</tocdiv></toc>
<chapter><title>9thmemo</title>
<para>...</para>
<para>...</para></chapter>
<chapter>...</chapter>
<chapter>...</chapter>
</book>
There are several chapters in the <book>...</book>, and each chapter has a title, I only want to read all content of this chapter,"9thmemo"(not others)
I tried to read by following code:
from xml.dom import minidom
filename = "result.xml"
file = minidom.parse(filename)
chapters = file.getElementsByTagName('chapter')
for i in range(10):
print(chapters[i])
I only get the address of each chapter...
if I add some sub-element like chapters[i].title, it shows cannot find this attribute

I only want to read all content of this chapter,"9thmemo"(not others)
The problem with the code is that it does not try to locate the specific 'chapter' while the answer code uses xpath in order to locate it.
Try the below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<book>
<toc>
<tocdiv pagenum="564">
<title>9thmemo</title>
<tocdiv pagenum="588">
<title>b</title>
</tocdiv>
</tocdiv>
</toc>
<chapter>
<title>9thmemo</title>
<para>A</para>
<para>B</para>
</chapter>
<chapter>...</chapter>
<chapter>...</chapter>
</book>'''
root = ET.fromstring(xml)
chapter = root.find('.//chapter/[title="9thmemo"]')
para_data = ','.join(p.text for p in chapter.findall('para'))
print(para_data)
output
A,B

Related

Removing Elements from a KML (Python)

I generated a KML file using Python's SimpleKML library and the following script, the output of which is also shown below:
import simplekml
kml = simplekml.Kml()
ground = kml.newgroundoverlay(name='Aerial Extent')
ground.icon.href = 'C:\\Users\\mdl518\\Desktop\\aerial_image.png'
ground.latlonbox.north = 46.55537
ground.latlonbox.south = 46.53134
ground.latlonbox.east = 48.60005
ground.latlonbox.west = 48.57678
ground.latlonbox.rotation = 0.090320
kml.save(".//aerial_extent.kml")
The output KML:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
<Document id="1">
<GroundOverlay id="2">
<name>Aerial Extent</name>
<Icon id="3">
<href>C:\\Users\\mdl518\\Desktop\\aerial_image.png</href>
</Icon>
<LatLonBox>
<north>46.55537</north>
<south>46.53134</south>
<east>48.60005</east>
<west>48.57678</west>
<rotation>0.090320</rotation>
</LatLonBox>
</GroundOverlay>
</Document>
However, I am trying to remove the "Document" tag from this KML since it is a default element generated with SimpleKML, while keeping the child elements (e.g. GroundOverlay). Additionally, is there a way to remove the "id" attributes associated with specific elements (i.e. for the GroundOverlay, Icon elements)? I am exploring the usage of ElementTree/lxml to enable this, but these seem to be more specific to XML files as opposed to KMLs. Here's what I'm trying to use to modify the KML, but it is unable to remove the Document element:
from lxml import etree
tree = etree.fromstring(open("C:\\Users\\mdl518\\Desktop\\aerial_extent.kml").read())
for item in tree.xpath("//Document[#id='1']"):
item.getparent().remove(item)
print(etree.tostring(tree, pretty_print=True))
Here is the final desired output XML:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
<GroundOverlay>
<name>Aerial Extent</name>
<Icon>
<href>C:\\Users\\mdl518\\Desktop\\aerial_image.png</href>
</Icon>
<LatLonBox>
<north>46.55537</north>
<south>46.53134</south>
<east>48.60005</east>
<west>48.57678</west>
<rotation>0.090320</rotation>
</LatLonBox>
</GroundOverlay>
</kml>
Any insights are most appreciated!

You are getting tripped up on the dreaded namespaces...
Try using something like this:
ns = {'kml': 'http://www.opengis.net/kml/2.2'}
for item in tree.xpath("//kml:Document[#id='1']",namespaces=ns):
item.getparent().remove(item)
Edit:
To remove just the parent and retain all its descendants, try the following:
retain = doc.xpath("//kml:Document[#id='1']/kml:GroundOverlay",namespaces=ns)[0]
for item in doc.xpath("//kml:Document[#id='1']",namespaces=ns):
anchor = item.getparent()
anchor.remove(item)
anchor.insert(1,retain)
print(etree.tostring(doc, pretty_print=True).decode())
This should get you the desired output.

parsing xml with Python minidom

I'm trying to parse elements from .xml file with Python minidom library, but it doesn't seem to work. It's returning "IndexError list out of range". Perhaps I'm using incorrect method/library for the job. Please suggest how to do this. Thanks
from xml.dom import minidom
doc = minidom.parse('/path/to/file/runParameters.xml')
docs = doc.getElementsByTagName('RunParameters')
for el in docs:
cloud = el.getElementsByTagName("EnableCloud")
print(cloud[0].firstChild.nodeValue)
Here is what the structure of the file looks like
<?xml version="1.0"?>
<RunParameters xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<EnableCloud>false</EnableCloud>
<RunParametersVersion>MiSeq</RunParametersVersion>
<CopyManifests>true</CopyManifests>
<FlowcellRFIDTag>
<SerialNumber>000000000-AG01C</SerialNumber>
<PartNumber>17772</PartNumber>
<ExpirationDate>2016-04-10T00:00:00</ExpirationDate>
</FlowcellRFIDTag>
</RunParameters>

Using ElementTree
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0"?>
<RunParameters xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<EnableCloud>false</EnableCloud>
<RunParametersVersion>MiSeq</RunParametersVersion>
<CopyManifests>true</CopyManifests>
<FlowcellRFIDTag>
<SerialNumber>000000000-AG01C</SerialNumber>
<PartNumber>17772</PartNumber>
<ExpirationDate>2016-04-10T00:00:00</ExpirationDate>
</FlowcellRFIDTag>
</RunParameters>'''
root = ET.fromstring(xml)
print(root.find('.//EnableCloud').text)
output
false

This code works for me. Please try it on your system:
xx = '''
<?xml version="1.0"?>
<RunParameters xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<EnableCloud>false</EnableCloud>
<RunParametersVersion>MiSeq</RunParametersVersion>
<CopyManifests>true</CopyManifests>
<FlowcellRFIDTag>
<SerialNumber>000000000-AG01C</SerialNumber>
<PartNumber>17772</PartNumber>
<ExpirationDate>2016-04-10T00:00:00</ExpirationDate>
</FlowcellRFIDTag>
</RunParameters>
'''.strip()
with open('test2.xml','w') as f:
f.write(xx)
from xml.dom import minidom
doc = minidom.parse('test2.xml')
docs = doc.getElementsByTagName('RunParameters')
for el in docs:
cloud = el.getElementsByTagName("EnableCloud")
print(cloud[0].firstChild.nodeValue)
Output
false

How to get the content of child->child->child->child in XML file using Python

<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.056.001.01" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<FIToFIPmtCxlReq>
<Assgnmt>
<Id>TEST-ISO-81</Id>
<Assgnr>
<Agt>
<FinInstnId>
<BIC>CCCCGB2L</BIC>
</FinInstnId>
</Agt>
</Assgnr>
<Assgne>
<Agt>
<FinInstnId>
<BIC>MMMMGB2L</BIC>
</FinInstnId>
</Agt>
</Assgne>
<CreDtTm>2009-03-24T11:22:59</CreDtTm>
</Assgnmt>
<TxInf>
<CxlId>103012345</CxlId>
<Case>
<Id>ISO_TEST_CASE</Id>
<Cretr>
<Agt>
<FinInstnId>
<BIC>MMMMGB2L</BIC>
</FinInstnId>
</Agt>
</Cretr>
</Case>
</TxInf>
</Undrlyg>
</FIToFIPmtCxlReq>
</Document>
Here I want to get the content of "TxInf" like all its child and child of child and the data.
What I have tried is :
import xml.etree.ElementTree as ET
from xml.etree import ElementTree
tree = ET.parse('R3-CAMT.056.001.07-ISO-V.XML')
root = tree.getroot()
for element in root.iter():
if element.tag == "{urn:iso:std:iso:20022:tech:xsd:camt.056.001.01}TxInf":
tree._setroot(element.tag)
print(root.tag)
print(root.attrib)
Please suggest if I can change the root with _setroot or any other possible method

Try something along these lines on your code to see if it works:
for r in root.findall(".//*"):
if 'TxInf' in r.tag:
print(ET.tostring(r))
By the way, it may be easier to do it with lxml, if you can use it.

XML file generating unwanted data

I have tried writing few things to xml file after reading it from a different xml file, everything works smoothly but there are few unwanted tags coming inside the xml file which i generate as output.
Here is what I have tried
from xml.etree import ElementTree as ET
from xml.dom.minidom import getDOMImplementation
from xml.dom.minidom import parseString
tree = ET.parse('C:\\Users\\ca33.xml')
root = tree.getroot()
impl = getDOMImplementation()
#print(root)
header = [root.find('header')]
for h in header:
h1=(parseString(ET.tostring(h)).toprettyxml(''))
#print(h1)
commands = root.findall(".//records//")
recs=[c for c in commands if c.find('soc_id')!=None and c.find('soc_id').text[:9]=='000001051']
bb=""
for rec in recs:
aa=(parseString(ET.tostring(rec)).toprettyxml(''))
bb=bb+aa
#print(bb)
newdoc = impl.createDocument(None, "file"+h1+bb, None)
newdoc.writexml(open('data.xml', 'w'),'\n'.join([line for line in newdoc.toprettyxml(indent=' '*2).split('\n') if line.strip()]))
I get the output data.xml file as.
<?xml version="1.0" ?><?xml version="1.0" ?>
<file<?xml version="1.0" ?>
<header>
<number_of_records>41</number_of_records>
</header>
<?xml version="1.0" ?>
<record>
<soc_id>00000105139E3B82</soc_id>
</record>
<?xml version="1.0" ?>
<soc_id>00000105139E3640</soc_id>
</record>
<?xml version="1.0" ?>
<header>
<number_of_records>41</number_of_records>
So you can see that many tags of <?xml version="1.0" ?> is being generated everywhere and in the last it again starts writing the data from first but leaves a 2 line spacing

So, what I understand is that you are trying to read a xml file at first place and then you are trying to write the same data into a different file.
In this process you are running into problems
from xml.etree import ElementTree as ET
tree = ET.parse('C:\\Users\\ca33.xml')
root = tree.getroot()
for header_ex in root.findall('header'):
h = [ET.tostring(c) for c in header_ex]
str_header=str(h)
for record_ex in root.findall('records'):
r = [ET.tostring(c) for c in record if c.find('soc_id')!=None and c.find('soc_id').text[:9]=='000001051']
for rec in r:
str_rec=str(rec)
with open("output.xml","w") as f:
f.write("<?xml version='1.0' encoding='ASCII' standalone='yes'?>")
f.write("<file>"+"<header>"+str_header+"</header>")
f.close()
Since you have not posted any random data, I assume it to be the way you had posted in question.I assume that record is a tag and it has something more or many sub/child tags inside it and that's the reason for me to loop twice over it.
And also stop using unnecessary imports in your code.

XPath with LXML Element

I am trying to parse an XML document using lxml etree. The XML doc I am parsing looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/">\t
<codeBook version="2.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ddi:codebook:2_5" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd">
<docDscr>
<citation>
<titlStmt>
<titl>Test Title</titl>
</titlStmt>
<prodStmt>
<prodDate/>
</prodStmt>
</citation>
</docDscr>
<stdyDscr>
<citation>
<titlStmt>
<titl>Test Title 2</titl>
<IDNo agency="UKDA">101</IDNo>
</titlStmt>
<rspStmt>
<AuthEnty>TestAuthEntry</AuthEnty>
</rspStmt>
<prodStmt>
<copyright>Yes</copyright>
</prodStmt>
<distStmt/>
<verStmt>
<version date="">1</version>
</verStmt>
</citation>
<stdyInfo>
<subject>
<keyword>2009</keyword>
<keyword>2010</keyword>
<topcClas>CLASS</topcClas>
<topcClas>ffdsf</topcClas>
</subject>
<abstract>This is an abstract piece of text.</abstract>
<sumDscr>
<timePrd event="single">2020</timePrd>
<nation>UK</nation>
<anlyUnit>Test</anlyUnit>
<universe>test</universe>
<universe>hello</universe>
<dataKind>fdsfdsf</dataKind>
</sumDscr>
</stdyInfo>
<method>
<dataColl>
<timeMeth>test timemeth</timeMeth>
<dataCollector>test data collector</dataCollector>
<sampProc>test sampprocess</sampProc>
<deviat>test deviat</deviat>
<collMode>test collMode</collMode>
<sources/>
</dataColl>
</method>
<dataAccs>
<setAvail>
<accsPlac>Test accsPlac</accsPlac>
</setAvail>
<useStmt>
<restrctn>NONE</restrctn>
</useStmt>
</dataAccs>
<othrStdyMat>
<relPubl>122</relPubl>
<relPubl>12332</relPubl>
</othrStdyMat>
</stdyDscr>
</codeBook>
</metadata>
I wrote the following code to try and process it:
from lxml import etree
import pdb
f = open('/vagrant/out2.xml', 'r')
xml_str = f.read()
xml_doc = etree.fromstring(xml_str)
f.close()
From what I understand from the lxml xpath docs, I should be able to get the text from a specific element as follows:
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
However, when I run this it returns an empty array.
The only xpath I can get to return something is using a wildcard:
xml_doc.xpath('*')
Which returns [<Element {ddi:codebook:2_5}codeBook at 0x7f8da8a413f8>].
I've read through the docs and I'm not understanding what is going wrong with this. Any help is appreciated.

You need to take the default namespace into account so instead of
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
use
xml_doc.xpath.xpath(
'/oai:metadata/ddi:codeBook/ddi:docDscr/ddi:citation/ddi:titlStmt/ddi:titl/text()',
namespaces={
'oai': 'http://www.openarchives.org/OAI/2.0/',
'ddi': 'ddi:codebook:2_5'
}
)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to load xml file with specifc paragraph by xml in Python? - python

Related

Removing Elements from a KML (Python)

parsing xml with Python minidom

How to get the content of child->child->child->child in XML file using Python

XML file generating unwanted data

XPath with LXML Element

Categories

Resources