How to write an xml file using libxml2 in python? - python

I am attempting to write an xml file in python3 using libxml2. I cannot find any relevant documentation regarding python about writing files with libxml. When I attempt to write an xml file parsed with libxml2 I get the error:
xmlDoc has no attribute write
Anyone here done this before? I can get it to work in Etree just fine but Etree will not respect the attribute order that I need.

You can use saveFile() or saveFileEnc(). Example:
import libxml2
XML = """
<root a="1" b="2">XYZ</root>
"""
doc = libxml2.parseDoc(XML)
doc.saveFile("test.xml")
doc.saveFileEnc("test2.xml", "UTF-8")
I could not find any good documentation for the Python API. Here is the corresponding C documentation: http://xmlsoft.org/html/libxml-tree.html#xmlSaveFile.

import libxml2
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
<attribution>Christopher Okibgo</attribution>
<line>For he was a shrub among the poplars,</line>
<line>Needing more roots</line>
<line>More sap to grow to sunlight,</line>
<line>Thirsting for sunlight</line>
</verse>
"""
doc = libxml2.parseDoc(DOC)
root = doc.children
print root

Related

XPath with LXML Element

I am trying to parse an XML document using lxml etree. The XML doc I am parsing looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/">\t
<codeBook version="2.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ddi:codebook:2_5" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd">
<docDscr>
<citation>
<titlStmt>
<titl>Test Title</titl>
</titlStmt>
<prodStmt>
<prodDate/>
</prodStmt>
</citation>
</docDscr>
<stdyDscr>
<citation>
<titlStmt>
<titl>Test Title 2</titl>
<IDNo agency="UKDA">101</IDNo>
</titlStmt>
<rspStmt>
<AuthEnty>TestAuthEntry</AuthEnty>
</rspStmt>
<prodStmt>
<copyright>Yes</copyright>
</prodStmt>
<distStmt/>
<verStmt>
<version date="">1</version>
</verStmt>
</citation>
<stdyInfo>
<subject>
<keyword>2009</keyword>
<keyword>2010</keyword>
<topcClas>CLASS</topcClas>
<topcClas>ffdsf</topcClas>
</subject>
<abstract>This is an abstract piece of text.</abstract>
<sumDscr>
<timePrd event="single">2020</timePrd>
<nation>UK</nation>
<anlyUnit>Test</anlyUnit>
<universe>test</universe>
<universe>hello</universe>
<dataKind>fdsfdsf</dataKind>
</sumDscr>
</stdyInfo>
<method>
<dataColl>
<timeMeth>test timemeth</timeMeth>
<dataCollector>test data collector</dataCollector>
<sampProc>test sampprocess</sampProc>
<deviat>test deviat</deviat>
<collMode>test collMode</collMode>
<sources/>
</dataColl>
</method>
<dataAccs>
<setAvail>
<accsPlac>Test accsPlac</accsPlac>
</setAvail>
<useStmt>
<restrctn>NONE</restrctn>
</useStmt>
</dataAccs>
<othrStdyMat>
<relPubl>122</relPubl>
<relPubl>12332</relPubl>
</othrStdyMat>
</stdyDscr>
</codeBook>
</metadata>
I wrote the following code to try and process it:
from lxml import etree
import pdb
f = open('/vagrant/out2.xml', 'r')
xml_str = f.read()
xml_doc = etree.fromstring(xml_str)
f.close()
From what I understand from the lxml xpath docs, I should be able to get the text from a specific element as follows:
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
However, when I run this it returns an empty array.
The only xpath I can get to return something is using a wildcard:
xml_doc.xpath('*')
Which returns [<Element {ddi:codebook:2_5}codeBook at 0x7f8da8a413f8>].
I've read through the docs and I'm not understanding what is going wrong with this. Any help is appreciated.
You need to take the default namespace into account so instead of
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
use
xml_doc.xpath.xpath(
'/oai:metadata/ddi:codeBook/ddi:docDscr/ddi:citation/ddi:titlStmt/ddi:titl/text()',
namespaces={
'oai': 'http://www.openarchives.org/OAI/2.0/',
'ddi': 'ddi:codebook:2_5'
}
)

Extracting nested namespace from a xml using lxml

I'm new to Python and currently learning to parse XML. All seems to be going well until I hit a wall with nested namespaces.
Below is an snippet of my xml ( with a beginning and child element that I'm trying to parse:
<?xml version="1.0" encoding="UTF-8"?>
-<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
-------------
-------------
-------------
-<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#"><Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id><EditRate>24 1</EditRate><IntrinsicDuration>2698</IntrinsicDuration></cc-cpl:MainClosedCaption>
------------
------------
------------
</CompositionPlaylist>
What I'm need is a solution to extract the URI of the local name 'MainClosedCaption'. In this case, I'm trying to extract the string "http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#". I looked through a lot of tutorials but cannot seems to find a solution.
If there's anyone out there can lend your expertise, it would be much appreciated.
Here what I did so far with the help from the two contributors:
#!/usr/bin/env python
from xml.etree import ElementTree as ET #import ElementTree module as an alias ET
from lxml import objectify, etree
def parse():
import os
import sys
cpl_file = sys.argv[1]
xml_file = os.path.abspath(__file__)
xml_file = os.path.dirname(xml_file)
xml_file = os.path.join(xml_file,cpl_file)
with open(xml_file)as f:
xml = f.read()
tree = etree.XML(xml)
caption_namespace = etree.QName(tree.find('.//{*}MainClosedCaption')).namespace
print caption_namespace
print tree.nsmap
nsmap = {}
for ns in tree.xpath('//namespace::*'):
if ns[0]:
nsmap[ns[0]] = ns[1]
tree.xpath('//cc-cpl:MainClosedCaption', namespace=nsmap)
return nsmap
if __name__=="__main__":
parse()
But it's not working so far. I got the result 'None' when I used QName to locate the tag and its namespace. And when I try to locate all namespace in the XML using for loop as suggested in another post, I got the error 'Unknown return type: dict'
Any suggestions pls?
This program prints the namespace of the indicated tag:
from lxml import etree
xml = etree.XML('''<?xml version="1.0" encoding="UTF-8"?>
<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#">
<Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>2698</IntrinsicDuration>
</cc-cpl:MainClosedCaption>
</CompositionPlaylist>
''')
print etree.QName(xml.find('.//{*}MainClosedCaption')).namespace
Result:
http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#
Reference: http://lxml.de/tutorial.html#namespaces

python lxml etree parsing an fvdl file

The file contains the following lines.
<?xml version="1.0" encoding="UTF-8"?>
<FVDL xmlns="xmlns://www.fortifysoftware.com/schema/fvdl" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.9" xsi:type="FVDL">`
<CreatedTS date="2013-08-06" time="11:8:48" />`
I am trying to read the version tag in FVDL.I am using lxml etree and my code snippet is
from lxml import etree
with open(os.path.join(analysis,"merged-results.fvdl") ,"r") as file_handle:
context = etree.parse(file_handle)
ver = context.xpath('//FVDL')
print ver
This had worked before in parsing a standard xml file. However it is failing for the above mentioned file .(ver is an empty list at the end of execution)
Alternative to #falsetru's answer
(By "trying to read the version tag", I understand "the version attribute" (which may not be what you want))
Explicitly register fvdl namespace, under the "fvdl" prefix:
ver = context.xpath('//fvdl:FVDL/#version',
namespaces={"fvdl": "xmlns://www.fortifysoftware.com/schema/fvdl"})
Or, riskier, if somehow you know you want the version attribute from the root node
ver = context.xpath('/*/#version')
Both give ['1.9']
context = etree.parse(file_handle)
ver = context.getroot()
print ver.attrib['version']
output:'1.9'
Use [local-name()=...]:
ver = context.xpath('//*[local-name()="FVDL"]')

How to resolve external entities with xml.etree like lxml.etree

I have a script that parses XML using lxml.etree:
from lxml import etree
parser = etree.XMLParser(load_dtd=True, resolve_entities=True)
tree = etree.parse('main.xml', parser=parser)
I need load_dtd=True and resolve_entities=True be have &emptyEntry; from globals.xml resolved:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE map SYSTEM "globals.xml" [
<!ENTITY dirData "${DATADIR}">
]>
<map
xmlns:map="http://my.dummy.org/map"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsschemaLocation="http://my.dummy.org/map main.xsd"
>
&emptyEntry; <!-- from globals.xml -->
<entry><key>KEY</key><value>VALUE</value></entry>
<entry><key>KEY</key><value>VALUE</value></entry>
</map>
with globals.xml
<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY emptyEntry "<entry></entry>">
Now I would like to move from non-standard lxml to standard xml.etree. But this fails with my file because the load_dtd=True and resolve_entities=True is not supported by xml.etree.
Is there an xml.etree-way to have these entities resolved?
My trick is to use the external program xmllint
proc = subprocess.Popen(['xmllint','--noent',fname],stdout=subprocess.PIPE)
output = proc.communicate()[0]
tree = ElementTree.parse(StringIO.StringIO(output))
lxml is a right tool for the job.
But, if you want to use stdlib, then be prepared for difficulties and take a look at XMLParser's UseForeignDTD method. Here's a good (but hacky) example: Python ElementTree support for parsing unknown XML entities?

Python xml etree DTD from a StringIO source?

I'm adapting the following code (created via advice in this question), that took an XML file and it's DTD and converted them to a different format. For this problem only the loading section is important:
xmldoc = open(filename)
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
tree = etree.parse(xmldoc, parser)
This worked fine, whilst using the file system, but I'm converting it to run via a web framework, where the two files are loaded via a form.
Loading the xml file works fine:
tree = etree.parse(StringIO(data['xml_file'])
But as the DTD is linked to in the top of the xml file, the following statement fails:
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
tree = etree.parse(StringIO(data['xml_file'], parser)
Via this question, I tried:
etree.DTD(StringIO(data['dtd_file'])
tree = etree.parse(StringIO(data['xml_file'])
Whilst the first line doesn't cause an error, the second falls over on unicode entities the DTD is meant to pick up (and does so in the file system version):
XMLSyntaxError: Entity 'eacute' not
defined, line 4495, column 46
How do I go about correctly loading this DTD?
Here's a short but complete example, using the custom resolver technique #Steven mentioned.
from StringIO import StringIO
from lxml import etree
data = dict(
xml_file = '''<?xml version="1.0"?>
<!DOCTYPE x SYSTEM "a.dtd">
<x><y>ézz</y></x>
''',
dtd_file = '''<!ENTITY eacute "é">
<!ELEMENT x (y)>
<!ELEMENT y (#PCDATA)>
''')
class DTDResolver(etree.Resolver):
def resolve(self, url, id, context):
return self.resolve_string(data['dtd_file'], context)
xmldoc = StringIO(data['xml_file'])
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
parser.resolvers.add(DTDResolver())
try:
tree = etree.parse(xmldoc, parser)
except etree.XMLSyntaxError as e:
# handle xml and validation errors
You could probably use a custom resolver. The docs actually give an example of doing this to provide a dtd.

Categories