parsing xml with Python minidom - python

I'm trying to parse elements from .xml file with Python minidom library, but it doesn't seem to work. It's returning "IndexError list out of range". Perhaps I'm using incorrect method/library for the job. Please suggest how to do this. Thanks
from xml.dom import minidom
doc = minidom.parse('/path/to/file/runParameters.xml')
docs = doc.getElementsByTagName('RunParameters')
for el in docs:
cloud = el.getElementsByTagName("EnableCloud")
print(cloud[0].firstChild.nodeValue)
Here is what the structure of the file looks like
<?xml version="1.0"?>
<RunParameters xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<EnableCloud>false</EnableCloud>
<RunParametersVersion>MiSeq</RunParametersVersion>
<CopyManifests>true</CopyManifests>
<FlowcellRFIDTag>
<SerialNumber>000000000-AG01C</SerialNumber>
<PartNumber>17772</PartNumber>
<ExpirationDate>2016-04-10T00:00:00</ExpirationDate>
</FlowcellRFIDTag>
</RunParameters>

Using ElementTree
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0"?>
<RunParameters xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<EnableCloud>false</EnableCloud>
<RunParametersVersion>MiSeq</RunParametersVersion>
<CopyManifests>true</CopyManifests>
<FlowcellRFIDTag>
<SerialNumber>000000000-AG01C</SerialNumber>
<PartNumber>17772</PartNumber>
<ExpirationDate>2016-04-10T00:00:00</ExpirationDate>
</FlowcellRFIDTag>
</RunParameters>'''
root = ET.fromstring(xml)
print(root.find('.//EnableCloud').text)
output
false

This code works for me. Please try it on your system:
xx = '''
<?xml version="1.0"?>
<RunParameters xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<EnableCloud>false</EnableCloud>
<RunParametersVersion>MiSeq</RunParametersVersion>
<CopyManifests>true</CopyManifests>
<FlowcellRFIDTag>
<SerialNumber>000000000-AG01C</SerialNumber>
<PartNumber>17772</PartNumber>
<ExpirationDate>2016-04-10T00:00:00</ExpirationDate>
</FlowcellRFIDTag>
</RunParameters>
'''.strip()
with open('test2.xml','w') as f:
f.write(xx)
from xml.dom import minidom
doc = minidom.parse('test2.xml')
docs = doc.getElementsByTagName('RunParameters')
for el in docs:
cloud = el.getElementsByTagName("EnableCloud")
print(cloud[0].firstChild.nodeValue)
Output
false

Related

How to load xml file with specifc paragraph by xml in Python?

I have a xml file and its structure like that,
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<book>
<toc> <tocdiv pagenum="564">
<title>9thmemo</title>
<tocdiv pagenum="588">
<title>b</title>
</tocdiv>
</tocdiv></toc>
<chapter><title>9thmemo</title>
<para>...</para>
<para>...</para></chapter>
<chapter>...</chapter>
<chapter>...</chapter>
</book>
There are several chapters in the <book>...</book>, and each chapter has a title, I only want to read all content of this chapter,"9thmemo"(not others)
I tried to read by following code:
from xml.dom import minidom
filename = "result.xml"
file = minidom.parse(filename)
chapters = file.getElementsByTagName('chapter')
for i in range(10):
print(chapters[i])
I only get the address of each chapter...
if I add some sub-element like chapters[i].title, it shows cannot find this attribute
I only want to read all content of this chapter,"9thmemo"(not others)
The problem with the code is that it does not try to locate the specific 'chapter' while the answer code uses xpath in order to locate it.
Try the below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<book>
<toc>
<tocdiv pagenum="564">
<title>9thmemo</title>
<tocdiv pagenum="588">
<title>b</title>
</tocdiv>
</tocdiv>
</toc>
<chapter>
<title>9thmemo</title>
<para>A</para>
<para>B</para>
</chapter>
<chapter>...</chapter>
<chapter>...</chapter>
</book>'''
root = ET.fromstring(xml)
chapter = root.find('.//chapter/[title="9thmemo"]')
para_data = ','.join(p.text for p in chapter.findall('para'))
print(para_data)
output
A,B

How to get the content of child->child->child->child in XML file using Python

<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.056.001.01" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<FIToFIPmtCxlReq>
<Assgnmt>
<Id>TEST-ISO-81</Id>
<Assgnr>
<Agt>
<FinInstnId>
<BIC>CCCCGB2L</BIC>
</FinInstnId>
</Agt>
</Assgnr>
<Assgne>
<Agt>
<FinInstnId>
<BIC>MMMMGB2L</BIC>
</FinInstnId>
</Agt>
</Assgne>
<CreDtTm>2009-03-24T11:22:59</CreDtTm>
</Assgnmt>
<TxInf>
<CxlId>103012345</CxlId>
<Case>
<Id>ISO_TEST_CASE</Id>
<Cretr>
<Agt>
<FinInstnId>
<BIC>MMMMGB2L</BIC>
</FinInstnId>
</Agt>
</Cretr>
</Case>
</TxInf>
</Undrlyg>
</FIToFIPmtCxlReq>
</Document>
Here I want to get the content of "TxInf" like all its child and child of child and the data.
What I have tried is :
import xml.etree.ElementTree as ET
from xml.etree import ElementTree
tree = ET.parse('R3-CAMT.056.001.07-ISO-V.XML')
root = tree.getroot()
for element in root.iter():
if element.tag == "{urn:iso:std:iso:20022:tech:xsd:camt.056.001.01}TxInf":
tree._setroot(element.tag)
print(root.tag)
print(root.attrib)
Please suggest if I can change the root with _setroot or any other possible method
Try something along these lines on your code to see if it works:
for r in root.findall(".//*"):
if 'TxInf' in r.tag:
print(ET.tostring(r))
By the way, it may be easier to do it with lxml, if you can use it.

XPath with LXML Element

I am trying to parse an XML document using lxml etree. The XML doc I am parsing looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/">\t
<codeBook version="2.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ddi:codebook:2_5" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd">
<docDscr>
<citation>
<titlStmt>
<titl>Test Title</titl>
</titlStmt>
<prodStmt>
<prodDate/>
</prodStmt>
</citation>
</docDscr>
<stdyDscr>
<citation>
<titlStmt>
<titl>Test Title 2</titl>
<IDNo agency="UKDA">101</IDNo>
</titlStmt>
<rspStmt>
<AuthEnty>TestAuthEntry</AuthEnty>
</rspStmt>
<prodStmt>
<copyright>Yes</copyright>
</prodStmt>
<distStmt/>
<verStmt>
<version date="">1</version>
</verStmt>
</citation>
<stdyInfo>
<subject>
<keyword>2009</keyword>
<keyword>2010</keyword>
<topcClas>CLASS</topcClas>
<topcClas>ffdsf</topcClas>
</subject>
<abstract>This is an abstract piece of text.</abstract>
<sumDscr>
<timePrd event="single">2020</timePrd>
<nation>UK</nation>
<anlyUnit>Test</anlyUnit>
<universe>test</universe>
<universe>hello</universe>
<dataKind>fdsfdsf</dataKind>
</sumDscr>
</stdyInfo>
<method>
<dataColl>
<timeMeth>test timemeth</timeMeth>
<dataCollector>test data collector</dataCollector>
<sampProc>test sampprocess</sampProc>
<deviat>test deviat</deviat>
<collMode>test collMode</collMode>
<sources/>
</dataColl>
</method>
<dataAccs>
<setAvail>
<accsPlac>Test accsPlac</accsPlac>
</setAvail>
<useStmt>
<restrctn>NONE</restrctn>
</useStmt>
</dataAccs>
<othrStdyMat>
<relPubl>122</relPubl>
<relPubl>12332</relPubl>
</othrStdyMat>
</stdyDscr>
</codeBook>
</metadata>
I wrote the following code to try and process it:
from lxml import etree
import pdb
f = open('/vagrant/out2.xml', 'r')
xml_str = f.read()
xml_doc = etree.fromstring(xml_str)
f.close()
From what I understand from the lxml xpath docs, I should be able to get the text from a specific element as follows:
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
However, when I run this it returns an empty array.
The only xpath I can get to return something is using a wildcard:
xml_doc.xpath('*')
Which returns [<Element {ddi:codebook:2_5}codeBook at 0x7f8da8a413f8>].
I've read through the docs and I'm not understanding what is going wrong with this. Any help is appreciated.
You need to take the default namespace into account so instead of
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
use
xml_doc.xpath.xpath(
'/oai:metadata/ddi:codeBook/ddi:docDscr/ddi:citation/ddi:titlStmt/ddi:titl/text()',
namespaces={
'oai': 'http://www.openarchives.org/OAI/2.0/',
'ddi': 'ddi:codebook:2_5'
}
)

How to get node's value of an XML in Python?

suppose i have an xml file:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<quarkSettings>
<UpdatePath></UpdatePath>
<Version>Development</Version>
<Project>ABC</Project>
</quarkSettings>
</configuration>
now i want get Project's value. I have written following code:
import xml.etree.ElementTree as ET
doc1 = ET.parse("Configuration.xml")
for e in doc1.find("Project"):
project =e.text
but it doesn't give the value.
i got the answer:
import xml.etree.ElementTree as ET
doc1 = ET.parse(get_path_for_config_Quark_Release)
root = doc1.getroot()
for element in root.findall("quarkSettings"):
project = element.find("Project").text

XML header getting removed after processing with elementtree

i have an xml file and i used Elementtree to add a new tag to the xml file.My xml file before processing is as follows
<?xml version="1.0" encoding="utf-8"?>
<PackageInfo xmlns="http://someurlpackage">
<data ID="http://someurldata1">data1</data >
<data ID="http://someurldata2">data2</data >
<data ID="http://someurldata3">data3</data >
</PackageInfo>
I used following python code to add a new data tag and write it to my xml file
tree = ET.ElementTree(xmlFile)
root = tree.getroot()
elem= ET.Element('data')
elem.attrib['ID']="http://someurldata4"
elem.text='data4'
root[1].append(elem)
tree = ET.ElementTree(root)
tree.write(xmlFile)
But the resultant xml file have <?xml version="1.0" encoding="utf-8"?> absent and the file looks as below
<PackageInfo xmlns="http://someurlpackage">
<data ID="http://someurldata1">data1</data >
<data ID="http://someurldata2">data2</data >
<data ID="http://someurldata3">data3</data >
</PackageInfo>
Is there any way to include the xml header rather than hardcoding the line
It looks like you need optional arguments to the write method to output the declaration.
http://docs.python.org/library/xml.etree.elementtree.html#elementtree-elementtree-objects
tree.write(xmlfile,xml_declaration=True)
I'm afraid I'm not that familiar with xml.etree.ElementTree and it's variation between python releases.
Here's it working with lxml.etree:
>>> from lxml import etree
>>> sample = """<?xml version="1.0" encoding="utf-8"?>
... <PackageInfo xmlns="http://someurlpackage">
... <data ID="http://someurldata1">data1</data >
... <data ID="http://someurldata2">data2</data >
... <data ID="http://someurldata3">data3</data >
... </PackageInfo>"""
>>>
>>> doc = etree.XML(sample)
>>> data = doc.makeelement("data")
>>> data.attrib['ID'] = 'http://someurldata4'
>>> data.text = 'data4'
>>> doc.append(data)
>>> etree.tostring(doc,xml_declaration=True)
'<?xml version=\'1.0\' encoding=\'ASCII\'?>\n<PackageInfo xmlns="http://someurlpackage">\n<data ID="http://someurldata1">data1</data>\n<data ID="http://someurldata2">data2</data>\n<data ID="http://someurldata3">data3</data>\n<data ID="http://someurldata4">data4</data></PackageInfo>'
>>> etree.tostring(doc,xml_declaration=True,encoding='utf-8')
'<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<PackageInfo xmlns="http://someurlpackage">\n<data ID="http://someurldata1">data1</data>\n<data ID="http://someurldata2">data2</data>\n<data ID="http://someurldata3">data3</data>\n<data ID="http://someurldata4">data4</data></PackageInfo>'
try this:::
tree.write(xmlFile, encoding="utf-8")
If you are using python <=2.6
There is no xml_declaration parameter in ElementTree.write()
def write(self, file, encoding="us-ascii"):
def _write(self, file,node, encoding, namespaces):
You can use lxml.etree
install lxml
sample here:
from lxml import etree
document = etree.Element('outer')
node = etree.SubElement(document, 'inner')
print(etree.tostring(document, xml_declaration=True))
BTW:
I find that it is not necessary to write the xml_declaration
Is the XML declaration node mandatory?
There is no XML declaration necessary for a document to be
successfully readable, since there are defaults for both version and
encoding (1.0 and UTF-8, respectively).
At least,it works even if AndroidManifest.xml does not have an xml_declaration
I have tried :-)

Categories