How can I convert xml to Python data structure using lxml?
I have searched high and low but can't find anything.
Input example
<ApplicationPack>
<name>Mozilla Firefox</name>
<shortname>firefox</shortname>
<description>Leading Open Source internet browser.</description>
<version>3.6.3-1</version>
<license name="Firefox EULA">http://www.mozilla.com/en-US/legal/eula/firefox-en.html</license>
<ms-license>False</ms-license>
<vendor>Mozilla Foundation</vendor>
<homepage>http://www.mozilla.org/firefox</homepage>
<icon>resources/firefox.png</icon>
<download>http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-GB</download>
<crack required="0"/>
<install>scripts/install.sh</install>
<postinstall name="Clean Up"></postinstall>
<run>C:\\Program Files\\Mozilla Firefox\\firefox.exe</run>
<uninstall>c:\\Program Files\\Mozilla Firefox\\uninstall\\helper.exe /S</uninstall>
<requires name="autohotkey" />
</ApplicationPack>
>>> from lxml import etree
>>> treetop = etree.fromstring(anxmlstring)
converts the xml in the string to a Python data structure, and so does
>>> othertree = etree.parse(somexmlurl)
where somexmlurl is the path to a local XML file or the URL of an XML file on the web.
The Python data structure these functions provide (known as an "element tree", whence the etree module name) is well documented here -- all the classes, functions, methods, etc, that the Python data structure in question supports. It closely matches one supported in the Python standard library, by the way.
If you want some different Python data structure, you'll have to walk through the Python data structure which lxml returns, as above mentioned, and build your different data structure yourself based on the information thus collected; lxml can't specifically help you, except by offering several helpers for finding information in the parsed structure it returns, so that collecting said info is a flexible, easy task (again, see the documentation URL above).
It's not entirely clear what kind of data structure you're looking for, but here's a link to a code sample to convert XML to python dictionary of lists via lxml.etree.
Related
I have an xml file with this as the header
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type='text/xsl' href='\\segotn12805\ppr\PPRData3\StyleSheet\PPRData3.xslt'?>
when I modify the file I use .write (for example)
mytree.write('output.xml')
but the output file does not contain the header info.
The first two lines of the output file look like this
<ns0:pprdata xmlns:ns0="http://ManHub.PPRData">
<ns0:Group name="Models">
any ideas on how I can add the header info to the output file?
The first line is the XML declaration. It is optional, and a parser will assume UTF-8 if not specified.
The second line is a processing instruction.
It would be helpful if you provided more code to show what you are doing, but I suspect that you are using ElementTree. The documentation has this note indicating that by default these are skipped:
Note Not all elements of the XML input will end up as elements of the parsed tree. Currently, this module skips over any XML comments, processing instructions, and document type declarations in the input. Nevertheless, trees built using this module’s API rather than parsing from XML text can have comments and processing instructions in them; they will be included when generating XML output. A document type declaration may be accessed by passing a custom TreeBuilder instance to the XMLParser constructor.
As suggested in this answer, you might want to try using lxml
I am trying to generate a PDF using FOP. To do this I am taking in a template file, initialling its values with Jinja2 and then passing it through to fop with a system call.
Is it possible to do a subprocess call to FOP without passing through an input file but instead a string containing the XML directly? And if so how would I go about doing so?
I was hoping for something like this
fop -fo "XML here" -pdf output.pdf
Yes actually it was possible.
Using python I was able to import the xml from the file into lxml.etree:
tree = etree.parse('FOP_PARENT.fo.xml')
And then by using the etree to parse the include tags:
tree.xinclude()
Then it was a simple case of converting the xml back into unicode:
xml = etree.tounicode(tree)
This is how I got the templates to work. Hopefully this helps someone who has the same issue!
In Python, I need to Canonicalize (c14n) an XML string.
Which module/package can I use for this? And how should I do this?
(I prefer to use default python 2.7 modules, without extra installs or patches.)
For reference see: http://www.w3.org/TR/xml-exc-c14n/
from http://www.decalage.info/en/python/lxml-c14n
lxml provides a very easy way to do c14n in python.
<..>
Here is an example showing how to perform C14N using lxml 2.1:
import lxml.etree as ET
et = ET.parse('file.xml')
output = StringIO.StringIO()
et.write_c14n(output)
print output.getvalue()
from lxml docs:
write_c14n(self, file, exclusive=False, with_comments=True,
compression=0, inclusive_ns_prefixes=None)
C14N write of document. Always writes UTF-8.
<..>
Also there is libxml2:
XML C14N version 1.0 provides two options which make four
possibilities (see http://www.w3.org/TR/xml-c14n and
http://www.w3.org/TR/xml-exc-c14n/):
Inclusive or Exclusive C14N
With or without comments
libxml2 gives access to these options in its C14N API:
http://xmlsoft.org/html/libxml-c14n.html
Though obligatory check for version changes in these two libs.
now in python 3 you can write your code like this:
import lxml.etree as ET
et = ET.parse('your_xml_file_that_you_want_to_canonicalize.xml')
et.write_c14n("your_result_will_be_in_this_file.xml")
Here are two different files that my python (2.6) script encounters. One will parse, the other will not. I'm just curious as to why this happens.
This xml file will not parse and the script will fail:
<Landfire_Feedback_Point_xlsform id="fbfm40v10" instanceID="uuid:9e062da6-b97b-4d40-b354-6eadf18a98ab" submissionDate="2013-04-30T23:03:32.881Z" isComplete="true" markedAsCompleteDate="2013-04-30T23:03:32.881Z" xmlns="http://opendatakit.org/submissions">
<date_test>2013-04-17</date_test>
<plot_number>10</plot_number>
<select_multiple_names>BillyBob</select_multiple_names>
<geopoint_plot>43.2452830500 -118.2149402900 210.3000030518 3.0000000000</geopoint_plot><fbfm40_new>GS2</fbfm40_new>
<select_grazing>NONE</select_grazing>
<image_close>1366230030355.jpg</image_close>
<plot_note>No road present.</plot_note>
<n0:meta xmlns:n0="http://openrosa.org/xforms">
<n0:instanceID>uuid:9e062da6-b97b-4d40-b354-6eadf18a98ab</n0:instanceID>
</n0:meta>
</Landfire_Feedback_Point_xlsform>
This xml file will parse correctly and the script succeeds:
<Landfire_Feedback_Point_xlsform id="fbfm40v10">
<date_test>2013-05-14</date_test>
<plot_number>010</plot_number>
<select_multiple_names>BillyBob</select_multiple_names>
<geopoint_plot>43.26630563 -118.39881809 351.70001220703125 5.0</geopoint_plot>
<fbfm40_new>GR1</fbfm40_new>
<select_grazing>HIGH</select_grazing>
<image_close>fbfm40v10_PLOT_010_ID_6.jpg</image_close>
<plot_note>Heavy grazing</plot_note>
<meta><instanceID>uuid:90e7d603-86c0-46fc-808f-ea0baabdc082</instanceID></meta>
</Landfire_Feedback_Point_xlsform>
Here is a little python script that demonstrates that one will work, while the other will not. I'm just looking for an explanation as to why one is seen by ElementTree as an xml file while the other isn't. Specifically, the one that doesn't seem to parse fails with a "'NONE' type doesn't have a 'text' attribute" or something similar. But, it's because it doesn't seem to consider the file as xml or it can't see any elements beyond the opening line. Any explanation or direction with regard to this error would be appreciated. Thanks in advance.
Python script:
import os
from xml.etree import ElementTree
def replace_xml_attribute_in_file(original_file,element_name,attribute_value):
#THIS FUNCTION ONLY WORKS ON XML FILES WITH UNIQUE ELEMENT NAMES
# -DUPLICATE ELEMENT NAMES WILL ONLY GET THE FIRST ELEMENT WITH A GIVEN NAME
#split original filename and add tempfile name
tempfilename="temp.xml"
rootsplit = original_file.rsplit('\\') #split the root directory on the backslash
rootjoin = '\\'.join(rootsplit[:-1]) #rejoin the root diretory parts with a backslash -minus the last
temp_file = os.path.join(rootjoin,tempfilename)
et = ElementTree.parse(original_file)
author=et.find(element_name)
author.text = attribute_value
et.write(temp_file)
if os.path.exists(temp_file) and os.path.exists(original_file): #if both the original and the temp files exist
os.remove(original_file) #erase the original
os.rename(temp_file,original_file) #rename the new file
else:
print "Something went wrong."
replace_xml_attribute_in_file("testfile1.xml","image_close","whoopdeedoo.jpg");
Here is a little python script that demonstrates that one will work, while the other will not. I'm just looking for an explanation as to why one is seen by ElementTree as an xml file while the other isn't.
Your code doesn't demonstrate that at all. It demonstrates that they're both seen by ElementTree as valid XML files chock full of nodes. They both parse just fine, they both read past the first line, etc.
The only problem is that the first one doesn't have a node named 'image_close', so your code doesn't work.
You can see that pretty easily:
for node in et.getroot().getchildren():
print node.tag
You get 9 children of the root, with either version.
And the output to that should show you the problem. The node you want is actually named {http://opendatakit.org/submissions}image_close in the first example, rather than image_close as in the second.
And, as you can probably guess, this is because of the namespace=http://opendatakit.org/submissions in the root node. ElementTree uses the "James Clark notation" for mapping unknown-namespaced names to universal names.
Anyway, because none of the nodes are named image_close, the et.find(element_name) returns None, so your code stores author=None, then tries to assign to author.text, and gets an error.
As for how to fix this problem… well, you could learn how namespaces work by default in ElementTree, or you could upgrade to Python 2.7 or install a newer ElementTree for 2.6 that lets you customize things more easily. But if you want to do custom namespace handling and also stick with your old version… I'd start with this article (and its two predecessors) and this one.
I've written a Python script to create some XML, but I didn't find a way to edit the heading within Python.
Basically, instead of having this as my heading:
<?xml version="1.0" ?>
I need to have this:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
I looked into this as best I could, but found no way to add standalone status from within Python. So I figured that I'd have to go over the file after it had been created and then replace the text. I read in several places that I should stay far away from using readlines() because it could ruin the XML formatting.
Currently the code I have to do this - which I got from another Stackoverflow post about editing XML with Python - is:
doc = parse('file.xml')
elem = doc.find('xml version="1.0"')
elem.text = 'xml version="1.0" encoding="UTF-8" standalone="no"'
That provides me with a KeyError. I've tried to debug it to no avail, which leads me to believe that perhaps the XML heading wasn't meant to be edited in this way. Or my code is just wrong.
If anyone is curious (or miraculously knows how to insert standalone status into Python code), here is my code to write the xml file:
with open('file.xml', 'w') as f:
f.write(doc.toprettyxml(indent=' '))
Some people don't like "toprettyxml", but with my relatively basic level, it seemed like the best bet.
Anyway, if anyone can provide some advice or insight, I would be very grateful.
The xml.etree API does not give you any options to write out a standalone attribute in the XML declaration.
You may want to use the lxml library instead; it uses the same ElementTree API, but offers more control over the output. tostring() supports a standalone flag:
from lxml import etree
etree.tostring(doc, pretty_print=True, standalone=False)
or use .write(), which support the same options:
doc.write(outputfile, pretty_print=True, standalone=False)