Can I create this XML file with lxml? - python

I'm trying to generate an xml that looks exactly like this:
<?xml version="1.0" encoding="utf-8"?>
<XML type="formats" version="4">
<format type="format" uid="BEAUTY:MasterBeauty">
<type>video</type>
<channelsDepth type="uint">16</channelsDepth>
<channelsEncoding type="string">Float</channelsEncoding>
<channelsEndianess type="string">Little Endian</channelsEndianess>
<fieldDominance type="int">2</fieldDominance>
<height type="uint">1080</height>
<nbChannels type="uint">4</nbChannels>
<pixelLayout type="string">ABGR</pixelLayout>
<pixelRatio type="float">1</pixelRatio>
<rowOrdering type="string">up</rowOrdering>
<width type="uint">1920</width>
</format>
</XML>
It's part of a VFX nodal workflow script ensemble and this file is part of a "read media" node.
I've spent the whole week looking at many different things but can't find anything close to this. I picked lxml for the pretty print thing. I was able to generate a bunch of other simpler (to me) xml files but for this one, I gotta say … i'm lost. Complete fail so far!
Could someone kindly shed a light on this please?
MY QUESTIONS:
- is lxml appropriate for this?
- if no, what is a better choice? (i did look for ElementTree example, no luck!)
- if yes, where do i start? Could someone share a piece a code to get me started?
What i could create so far was things like this one:
import os, sys
import lxml.etree
import lxml.builder as lb
from lxml import etree
E = lxml.builder.ElementMaker()
Setup = E.Setup
Base = E.Base
Version = E.Version
Note = E.Note
Expanded = E.Expanded
ScrollBar = E.ScrollBar
Frames = E.Frames
Current_Time = E.Current_Time
Input_DataType = E.Input_DataType
ClampMode = E.ClampMode
AdapDegrad = E.AdapDegrad
UsedAsTransition = E.UsedAsTransition
State = E.State
root_node = Setup(
Base(
Version('12.030000'),
Note(''),
Expanded('False'),
ScrollBar('0'),
Frames('0'),
Current_Time('1'),
Input_DataType('3'),
ClampMode('0'),
AdapDegrad('False'),
UsedAsTransition('False')
),
State(),
)
print lxml.etree.tostring(root_node, pretty_print=True)
str = etree.tostring(root_node, pretty_print=True)
myXMLfile = open('/Users/stefan/XenDRIVE/___DEV/PYTHON/Create_xlm/create_Batch_xml_setups/result/xml_result/root.root_node.xml', 'w')
myXMLfile.write(str)
myXMLfile.close()
Hope those are "acceptable" questions.
Thank you in advance for any help.

First, make the format node and then add it to the root XML node.
Example code (follow it to create more nodes):
from lxml import etree
from lxml.builder import ElementMaker
E = ElementMaker()
format = E.format(
E.type("video"),
E.channelsDepth("16", type="uint"),
# create more elements here
type="format",
uid="BEAUTY:MasterBeauty"
)
root = E.XML(
format,
type="formats",
version="4"
)
print(etree.tostring(root, xml_declaration=True, encoding='utf-8', pretty_print=True))
Prints:
<?xml version='1.0' encoding='utf-8'?>
<XML version="4" type="formats">
<format type="format" uid="BEAUTY:MasterBeauty">
<type>video</type>
<channelsDepth type="uint">16</channelsDepth>
</format>
</XML>

Related

Need some help generating XML with Python

I have some variables in Python that I need to store as XML. I have been using the python:LXML module for this so far. Not too experienced with it. Have tried playing around with various tutorials and docs, but I am at a dead end need some help.
Here is the python script:
root = etree.Element("root")
coins=etree.Element("coins")
doc=etree.ElementTree(coins)
coins.append(etree.Element("trader"))
coins.append(etree.Element("metal"))
coins.append(etree.Element("type"))
coins.append(etree.Element("price"))
coins[0].text="Gold.co.uk"
coins[0].attrib["variable"]=("GLDAG_MAPLE")
coins[1].text="Silver"
coins[2].text="Britannia"
coins[3].text=str(GLDAG_MAPLE)
doc.write('data.xml', pretty_print=True)
As of now it outputs this:
<coins>
<trader variable="GLDAG_MAPLE">Gold.co.uk</trader>
<metal>Silver</metal>
<type>Britannia</type>
<price>
£31.20
</price>
</coins>
However I would like it to look like this:
<root>
<coin>
<trader> Gold.co.uk </trader>
<type> Britannia </type>
<price> £31.20 </price>
</coin>
</root>
The tags and their sub-tags would be duplicated for every type of coin. I have no idea how to construct the XML so that the output looks like the third code-block. So far I have tried to follow other scripts that I have seen on github and other sites but modify them to suit my needs but my scripts keep failing or producing incorrect resaults for some reason.
If someone could help me out then that would be great!
You can simply append the Element to root:
from lxml import etree
coinItems = [
{'trader': 'Gold.co.uk', 'metal': 'Silver', 'type': 'Britannia'},
{'trader': 'copper.co.uk', 'metal': 'Copper', 'type': 'World'}
]
root = etree.Element("root")
for ci in coinItems:
coin=etree.Element("coin")
etree.SubElement(coin, "trader", {'variable': 'GLDAG_MAPLE'}).text = ci['trader'] # example how to use attributes!
etree.SubElement(coin, "metal").text = ci['metal']
etree.SubElement(coin, "type").text = ci['type']
root.append(coin)
fName = '/tmp/data.xml'
with open(fName, 'wb') as f:
# remove encoding here, in case you want escaped ASCII characters: £
f.write(etree.tostring(root, xml_declaration=True, encoding="utf-8", pretty_print=True))
print(open(fName).read())
Output:
<?xml version='1.0' encoding='utf-8'?>
<root>
<coin>
<trader variable="GLDAG_MAPLE">Gold.co.uk</trader>
<metal>Silver</metal>
<type>Britannia</type>
</coin>
<coin>
<trader variable="GLDAG_MAPLE">copper.co.uk</trader>
<metal>Copper</metal>
<type>World</type>
</coin>
</root>
I prefer using the lxml builder (https://lxml.de/api/lxml.builder.ElementMaker-class.html) because imho it is easier to see the structure of your XML document.
from lxml.builder import E
root = E.root(
E.coin(
E.trader("Gold.co.uk",
variable="GLDAG_MAPLE"),
E.metal("silver"),
E.price("£31.20")
)
)
You can then append the root element to your main document.

Parsing subchilds in XML with ElementTree

Im trying to extract information from a XML-document with ElementTree in Python 3.2.
The XML looks like this:
<Page Id="1">
<Group>4</Group>
<Type>
<Letter>B</Letter>
<Number>101</Number>
<Deep>
<A>900</A>
<B>900</B>
</Deep>
</Type>
</Page>
I manage to get the elementdata from "Group" with:
for Page in root.iter('Page'):
Group = Page.find('Group').text
And "Letter"-data with:
for Type in root.iter('Type'):
Dim = Type.find('Letter').text
However I can't figure out how to get the data from the subchilds of "Deep" (A and B).
All help is greatly appreciated!
You are very close. Use find to find the Deep tag and the iterate over it.
Ex:
import xml.etree.ElementTree as ET
tree = ET.parse(filename)
root = tree.getroot()
for Type in root.iter('Type'):
for deep_tag in Type.find("Deep"):
print( deep_tag.text )
Output:
900
900

XPath with LXML Element

I am trying to parse an XML document using lxml etree. The XML doc I am parsing looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/">\t
<codeBook version="2.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ddi:codebook:2_5" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd">
<docDscr>
<citation>
<titlStmt>
<titl>Test Title</titl>
</titlStmt>
<prodStmt>
<prodDate/>
</prodStmt>
</citation>
</docDscr>
<stdyDscr>
<citation>
<titlStmt>
<titl>Test Title 2</titl>
<IDNo agency="UKDA">101</IDNo>
</titlStmt>
<rspStmt>
<AuthEnty>TestAuthEntry</AuthEnty>
</rspStmt>
<prodStmt>
<copyright>Yes</copyright>
</prodStmt>
<distStmt/>
<verStmt>
<version date="">1</version>
</verStmt>
</citation>
<stdyInfo>
<subject>
<keyword>2009</keyword>
<keyword>2010</keyword>
<topcClas>CLASS</topcClas>
<topcClas>ffdsf</topcClas>
</subject>
<abstract>This is an abstract piece of text.</abstract>
<sumDscr>
<timePrd event="single">2020</timePrd>
<nation>UK</nation>
<anlyUnit>Test</anlyUnit>
<universe>test</universe>
<universe>hello</universe>
<dataKind>fdsfdsf</dataKind>
</sumDscr>
</stdyInfo>
<method>
<dataColl>
<timeMeth>test timemeth</timeMeth>
<dataCollector>test data collector</dataCollector>
<sampProc>test sampprocess</sampProc>
<deviat>test deviat</deviat>
<collMode>test collMode</collMode>
<sources/>
</dataColl>
</method>
<dataAccs>
<setAvail>
<accsPlac>Test accsPlac</accsPlac>
</setAvail>
<useStmt>
<restrctn>NONE</restrctn>
</useStmt>
</dataAccs>
<othrStdyMat>
<relPubl>122</relPubl>
<relPubl>12332</relPubl>
</othrStdyMat>
</stdyDscr>
</codeBook>
</metadata>
I wrote the following code to try and process it:
from lxml import etree
import pdb
f = open('/vagrant/out2.xml', 'r')
xml_str = f.read()
xml_doc = etree.fromstring(xml_str)
f.close()
From what I understand from the lxml xpath docs, I should be able to get the text from a specific element as follows:
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
However, when I run this it returns an empty array.
The only xpath I can get to return something is using a wildcard:
xml_doc.xpath('*')
Which returns [<Element {ddi:codebook:2_5}codeBook at 0x7f8da8a413f8>].
I've read through the docs and I'm not understanding what is going wrong with this. Any help is appreciated.
You need to take the default namespace into account so instead of
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
use
xml_doc.xpath.xpath(
'/oai:metadata/ddi:codeBook/ddi:docDscr/ddi:citation/ddi:titlStmt/ddi:titl/text()',
namespaces={
'oai': 'http://www.openarchives.org/OAI/2.0/',
'ddi': 'ddi:codebook:2_5'
}
)

Extracting nested namespace from a xml using lxml

I'm new to Python and currently learning to parse XML. All seems to be going well until I hit a wall with nested namespaces.
Below is an snippet of my xml ( with a beginning and child element that I'm trying to parse:
<?xml version="1.0" encoding="UTF-8"?>
-<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
-------------
-------------
-------------
-<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#"><Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id><EditRate>24 1</EditRate><IntrinsicDuration>2698</IntrinsicDuration></cc-cpl:MainClosedCaption>
------------
------------
------------
</CompositionPlaylist>
What I'm need is a solution to extract the URI of the local name 'MainClosedCaption'. In this case, I'm trying to extract the string "http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#". I looked through a lot of tutorials but cannot seems to find a solution.
If there's anyone out there can lend your expertise, it would be much appreciated.
Here what I did so far with the help from the two contributors:
#!/usr/bin/env python
from xml.etree import ElementTree as ET #import ElementTree module as an alias ET
from lxml import objectify, etree
def parse():
import os
import sys
cpl_file = sys.argv[1]
xml_file = os.path.abspath(__file__)
xml_file = os.path.dirname(xml_file)
xml_file = os.path.join(xml_file,cpl_file)
with open(xml_file)as f:
xml = f.read()
tree = etree.XML(xml)
caption_namespace = etree.QName(tree.find('.//{*}MainClosedCaption')).namespace
print caption_namespace
print tree.nsmap
nsmap = {}
for ns in tree.xpath('//namespace::*'):
if ns[0]:
nsmap[ns[0]] = ns[1]
tree.xpath('//cc-cpl:MainClosedCaption', namespace=nsmap)
return nsmap
if __name__=="__main__":
parse()
But it's not working so far. I got the result 'None' when I used QName to locate the tag and its namespace. And when I try to locate all namespace in the XML using for loop as suggested in another post, I got the error 'Unknown return type: dict'
Any suggestions pls?
This program prints the namespace of the indicated tag:
from lxml import etree
xml = etree.XML('''<?xml version="1.0" encoding="UTF-8"?>
<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#">
<Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>2698</IntrinsicDuration>
</cc-cpl:MainClosedCaption>
</CompositionPlaylist>
''')
print etree.QName(xml.find('.//{*}MainClosedCaption')).namespace
Result:
http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#
Reference: http://lxml.de/tutorial.html#namespaces

XML row structure in one row

Strange error occured, got a XML-file emailed to me which was wrongly formated. The info in the file was all in one row.
Like this
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><Text><otherText><printdate>2015-02-08</printdate>
Does anyone know a quick way to fix this by using a python script or something that has had the same error?
I want to make the file like this.
<?xml version="1.0" encoding="ISO-8859-1"?>
<Text>
<OtherText>
<Name>VH2</Name>
<PrintDate>2015-02-05</PrintDate>
Thanks!
It seems you want to print pretty, if you look into other XML libraries, such as lxml, it support pretty print.
import lxml.etree as etree
x = etree.parse("filename")
print etree.tostring(x, pretty_print = True)
However, you can also try this:
Pretty printing XML in Python
If the XML is well formed this snippet will work
#!/usr/bin/python
import xml.dom.minidom
def main():
ugly_xml = open('ugly.xml', 'r')
pretty_xml = open('pretty.xml', 'w')
xmll = xml.dom.minidom.parseString(ugly_xml.read())
pretty_xml.write(xmll.toprettyxml())
if __name__ == "__main__":
main()

Categories