Python Exclusive XML Canonicalization (xml-exc-c14n) - python

In Python, I need to Canonicalize (c14n) an XML string.
Which module/package can I use for this? And how should I do this?
(I prefer to use default python 2.7 modules, without extra installs or patches.)
For reference see: http://www.w3.org/TR/xml-exc-c14n/

from http://www.decalage.info/en/python/lxml-c14n
lxml provides a very easy way to do c14n in python.
<..>
Here is an example showing how to perform C14N using lxml 2.1:
import lxml.etree as ET
et = ET.parse('file.xml')
output = StringIO.StringIO()
et.write_c14n(output)
print output.getvalue()
from lxml docs:
write_c14n(self, file, exclusive=False, with_comments=True,
compression=0, inclusive_ns_prefixes=None)
C14N write of document. Always writes UTF-8.
<..>
Also there is libxml2:
XML C14N version 1.0 provides two options which make four
possibilities (see http://www.w3.org/TR/xml-c14n and
http://www.w3.org/TR/xml-exc-c14n/):
Inclusive or Exclusive C14N
With or without comments
libxml2 gives access to these options in its C14N API:
http://xmlsoft.org/html/libxml-c14n.html
Though obligatory check for version changes in these two libs.

now in python 3 you can write your code like this:
import lxml.etree as ET
et = ET.parse('your_xml_file_that_you_want_to_canonicalize.xml')
et.write_c14n("your_result_will_be_in_this_file.xml")

Related

Reading a page and parsing it with minidom.parse or minidom.parseString in Python?

I have either of these codes:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parse(res)
which gives me the error xml.parsers.expat.ExpatError: syntax error: line 1, column 0
Or this:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parseString(res.read())
which gives me the same error. res.read() reads fine and is a string.
I would like to parse through the code later. How can I do this using xml.dom.minidom?
The reason you're getting this error is that the page isn't valid XML. It's HTML 5. The doctype right at the top tells you this, even if you ignore the content type. You can't parse HTML with an XML parser.*
If you want to stick with what's in the stdlib, you can use html.parser (Python 3.x) / HTMLParser (2.x).** However, you may want to consider third-party libraries like lxml (which, despite the name, can parse HTML), html5lib, or BeautifulSoup (which wraps up a lower-level parser in a really nice interface).
* Well, unless it's XHTML, or the XML output of HTML5, but that's not the case here.
** Do not use htmllib unless you're using an old version of Python without a working HTMLParser. This module is deprecated for a reason.

Python ElementTree - print out namespace definitions?

I'm using Python's elementtree to parse some XML configuration files.
At the top of the file, I have a root element like this:
<?xml version="1.0" encoding="utf-8"?>
<sgx:FooConfig
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:foo="http://ns.au.firm.com/foo.xsd"
xmlns:bar="http://ns.au.firm.com/bar.xsd"
>
The problem is, the bar namespace can be set to one of two different XSDs, depending on the version of the configuration file.
I'm looking for a way to print out the namespace mapping using ElementTree, so I can check which of the two XSDs is being used - then I can get my code to handle the correct case.
Is there a way to print out all the namespace definitions out using Python?
Cheers,
Victor
What you have is not valid xml (undefined prefixes) and I think you can't do this with xml.etree but you should be able to do it using lxml.
import lxml.etree as et
tree = et.XML(yourxml)
print tree.nsmap

python feedparser

How would you parse xml data as follows with python feedparser
<Book_API>
<Contributor_List>
<Display_Name>Jason</Display_Name>
</Contributor_List>
<Contributor_List>
<Display_Name>John Smith</Display_Name>
</Contributor_List>
</Book_API>
That doesn't look like any sort of RSS/ATOM feed. I wouldn't use feedparser at all for that, I would use lxml. In fact, feedparser can't make any sense of it and drops the "Jason" contributor in your example.
from lxml import etree
data = <fetch the data somehow>
root = etree.parse(data)
Now you have a tree of xml objects. How to do it in lxml more specifically is impossible to say until you actually give valid XML data. ;)
As Lennart Regebro mentioned, it seems not a RSS/Atom feed but just XML document. There are several XML parsing facilities (SAX and DOM both) in Python standard libraries. I recommend you ElementTree. Also lxml is best one (which is drop-in replacement of ElementTree) in third party libraries.
try:
from lxml import etree
except ImportError:
try:
from xml.etree.cElementTree as etree
except ImportError:
from xml.etree.ElementTree as etree
doc = """<Book_API>
<Contributor_List>
<Display_Name>Jason</Display_Name>
</Contributor_List>
<Contributor_List>
<Display_Name>John Smith</Display_Name>
</Contributor_List>
</Book_API>"""
xml_doc = etree.fromstring(doc)

Editing values in a xml file with Python

Hey. I want to have a config.xml file for settings in a Python web app.
I made car.xml manually. It looks like this:
<car>
<lights>
<blinkers>off</blinkers>
</lights>
</car>
Now I want to see whether the blinkers are on or off, using xml.etree.ElementTree.
import xml.etree.ElementTree as ET
tree = ET.parse('car.xml')
blinkers = tree.findtext('lights/blinkers')
print blinkers
> off
Now I want to turn the blinkers on and off, how can I do this?
You can remove nodes by calling the parent node's remove method,
and insert nodes by calling ET.SubElement:
import xml.etree.ElementTree as ET
def flip_lights(tree):
lights = tree.find('lights')
state=get_blinker(tree)
blinkers = tree.find('lights/blinkers')
lights.remove(blinkers)
new_blinkers = ET.SubElement(lights, "blinkers")
new_blinkers.text='on' if state=='off' else 'off'
def get_blinker(tree):
blinkers = tree.find('lights/blinkers')
return blinkers.text
tree = ET.parse('car.xml')
print(get_blinker(tree))
# off
flip_lights(tree)
print(get_blinker(tree))
# on
flip_lights(tree)
print(get_blinker(tree))
# off
flip_lights(tree)
print(get_blinker(tree))
# on
tree.write('car2.xml')
Without addressing the merits of using XML instead of a Python module for managing configuration files, here's how to do what you asked using lxml:
>>> from lxml import etree
>>> xml = """<car>
<lights>
<blinkers>on</blinkers>
</lights>
</car>"""
>>> doc = etree.fromstring(xml)
>>> elm = doc.xpath("/car/lights/blinkers")[0]
>>> elm.text="off"
>>> etree.tostring(doc)
'<car>\n <lights>\n <blinkers>off</blinkers>\n </lights>\n</car>'
Take a look at this article.
But consider AaronMcSmooth's comment above -- this may be the wrong approach to your overall problem.
XML is a rather poor way of storing configuration settings. For one, XML is not exactly human friendly in the context of settings. In the Python universe in particular you are better off using a settings module (as #AaronMcSmooth commented). Unfortunately a lot of projects in the Java world have (mis?)used XML for settings thereby making it a trend. I'd argue that this trend really sucks. Use native settings (module in Python) or something more human friendly like YAML.
Use beautifulstonesoup. Here is the section on modifying xml:
http://www.crummy.com/software/BeautifulSoup/documentation.html#Modifying%20the%20Parse%20Tree

xml to Python data structure using lxml

How can I convert xml to Python data structure using lxml?
I have searched high and low but can't find anything.
Input example
<ApplicationPack>
<name>Mozilla Firefox</name>
<shortname>firefox</shortname>
<description>Leading Open Source internet browser.</description>
<version>3.6.3-1</version>
<license name="Firefox EULA">http://www.mozilla.com/en-US/legal/eula/firefox-en.html</license>
<ms-license>False</ms-license>
<vendor>Mozilla Foundation</vendor>
<homepage>http://www.mozilla.org/firefox</homepage>
<icon>resources/firefox.png</icon>
<download>http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-GB</download>
<crack required="0"/>
<install>scripts/install.sh</install>
<postinstall name="Clean Up"></postinstall>
<run>C:\\Program Files\\Mozilla Firefox\\firefox.exe</run>
<uninstall>c:\\Program Files\\Mozilla Firefox\\uninstall\\helper.exe /S</uninstall>
<requires name="autohotkey" />
</ApplicationPack>
>>> from lxml import etree
>>> treetop = etree.fromstring(anxmlstring)
converts the xml in the string to a Python data structure, and so does
>>> othertree = etree.parse(somexmlurl)
where somexmlurl is the path to a local XML file or the URL of an XML file on the web.
The Python data structure these functions provide (known as an "element tree", whence the etree module name) is well documented here -- all the classes, functions, methods, etc, that the Python data structure in question supports. It closely matches one supported in the Python standard library, by the way.
If you want some different Python data structure, you'll have to walk through the Python data structure which lxml returns, as above mentioned, and build your different data structure yourself based on the information thus collected; lxml can't specifically help you, except by offering several helpers for finding information in the parsed structure it returns, so that collecting said info is a flexible, easy task (again, see the documentation URL above).
It's not entirely clear what kind of data structure you're looking for, but here's a link to a code sample to convert XML to python dictionary of lists via lxml.etree.

Categories