I'm coming from Java, C++ and Delphi, now I'm working on a small project in Python. So far I could most problems solve but I have following unsolved:
I just want to substitute/overwrite a method getAttribute() of a XML Node by type casting the xml node into ext_xml_node, so that every function of the fore mentioned project uses the new getAttribute. As far as I could read there is no way to real type cast (Like C++ etc.) in python, so I came on the idea to make at certain functions (which calls other sub functions like a dispatcher with a xml node argument) a type cast of the argument to the new class ext_xml_node
class ext_xml_node(xml_node):
...
def getAttribute(self, name):
unhandled_value = super...(name)
handled_value= dosomethingiwth(unhandled_value)
return handled_value
def dispatcher(self, xml_node):
for child_node in xml_node.childNodes:
if child_node.nodeName == 'setvariable':
bla = ext_xml_node(child_node)
self.handle_setvariable_statement(bla)
def handle_setvariable_statement(xml_node):
varname= xml_node.getAttribute("varname")
# Now it should call ext_xml_node.getAttribute Method
I don't want to substitute each getAttribute function in this project, and is there another way (duck typing surely isn't working) or should I really write a function with that yield over each attribute - and how?
lxml provides custom element classes which should suit your needs.
xml = '''\
<?xml version="1.0" encoding="utf-8"?>
<root xmlns="http://example.com">
<element1>
<element2/>
</element1>
</root>'''
from lxml import etree
class MyXmlClass1(etree.ElementBase):
def getAttribute(self):
print '1'
class MyXmlClass2(etree.ElementBase):
def getAttribute(self):
print '2'
nsprefix = '{http://example.com}'
fallback = etree.ElementDefaultClassLookup(element = MyXmlClass1)
lookup = etree.ElementNamespaceClassLookup(fallback)
parser = etree.XMLParser()
parser.set_element_class_lookup(lookup)
namespace = lookup.get_namespace('http://example.com')
namespace['element2'] = MyXmlClass2
root = etree.XML(xml, parser)
root.getAttribute()
>>> 1
element1 = root.getchildren()[0]
element1.getAttribute()
>>> 1
element2 = element1.getchildren()[0]
element2.getAttribute()
>>> 2
Related
I would like to extract elements from xmi file by using Python, and output the element to a new file in the order I want. For example, I have the following xml:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" xmlns="SimpleClass">
<Class name="AcademicInstitution" is_persistent="false">
<attrs name="name" is_primary="true" type="/11"/>
</Class>
I would like to transform it into:
Class (AcademicInstitution).
Class (false).
I have tried to use Python ElementTree, but if I use node.attrib.get(), I have to write the code like node.attrib.get('name') and node.attrib.get('is_persistent'), and get the result AcademicInstitution and false.
But how can I get these results directly without input 'name' and 'is_persistent' in the get()?
And how can I get the 'Class' String from xml file???
Thanks!
The attrib property of the element is a dictionary of all attributes.
from xml.etree import ElementTree
tree = ElementTree.parse('sample.xml')
for el in tree.getiterator():
_, _, tag = el.tag.rpartition('}')
for att in el.attrib:
print(f"{tag} ({el.attrib.get(att)})")
Yields the following from your sample (which needs a closing </xmi:XMI> to be valid):
XMI (2.0)
Class (AcademicInstitution)
Class (false)
attrs (name)
attrs (true)
attrs (/11)
i need to build xml file using special name of items, this is my current code :
from lxml import etree
import lxml
from lxml.builder import E
wp = E.wp
tmp = wp("title")
print(etree.tostring(tmp))
current output is this :
b'<wp>title</wp>'
i want to be :
b'<wp:title>title</title:wp>'
how i can create items with name like this : wp:title ?
You confused the namespace prefix wp with the tag name. The namespace prefix is a document-local name for a namespace URI. wp:title requires a parser to look for a xmlns:wp="..." attribute to find the namespace itself (usually a URL but any globally unique string would do), either on the tag itself or on a parent tag. This connects tags to a unique value without making tag names too verbose to type out or read.
You need to provide the namepace, and optionally, the namespace mapping (mapping short names to full namespace names) to the element maker object. The default E object provided doesn't have a namespace or namespace map set. I'm going to assume that here that wp is the http://wordpress.org/export/1.2/ Wordpress namespace, as that seems the most likely, although it could also be that you are trying to send Windows Phone notifications.
Instead of using the default E element maker, create your own ElementMaker instance and pass it a namespace argument to tell lxml what URL the element belongs to. To get the right prefix on your element names, you also need to give it a nsmap dictionary that maps prefixes to URLs:
from lxml.builder import ElementMaker
namespaces = {"wp": "http://wordpress.org/export/1.2/"}
E = ElementMaker(namespace=namespaces["wp"], nsmap=namespaces)
title = E.title("Value of the wp:title tag")
This produces a tag with both the correct prefix, and the xmlns:wp attribute:
>>> from lxml.builder import ElementMaker
>>> namespaces = {"wp": "http://wordpress.org/export/1.2/"}
>>> E = ElementMaker(namespace=namespaces["wp"], nsmap=namespaces)
>>> title = E.title("Value of the wp:title tag")
>>> etree.tostring(title, encoding="unicode")
'<wp:title xmlns:wp="http://wordpress.org/export/1.2/">Value of the wp:title tag</wp:title>'
You can omit the nsmap value, but then you'd want to have such a map on a parent element of the document. In that case, you probably want to make separate ElementMaker objects for each namespace you need to support, and you put the nsmap namespace mapping on the outer-most element. When writing out the document, lxml then uses the short names throughout.
For example, creating a Wordpress WXR format document would require a number of namespaces:
from lxml.builder import ElementMaker
namespaces = {
"excerpt": "https://wordpress.org/export/1.2/excerpt/",
"content": "http://purl.org/rss/1.0/modules/content/",
"wfw": "http://wellformedweb.org/CommentAPI/",
"dc": "http://purl.org/dc/elements/1.1/",
"wp": "https://wordpress.org/export/1.2/",
}
RootElement = ElementMaker(nsmap=namespaces)
ExcerptElement = ElementMaker(namespace=namespaces["excerpt"])
ContentElement = ElementMaker(namespace=namespaces["content"])
CommentElement = ElementMaker(namespace=namespaces["wfw"])
DublinCoreElement = ElementMaker(namespace=namespaces["dc"])
ExportElement = ElementMaker(namespace=namespaces["wp"])
and then you'd construct a document with
doc = RootElement.rss(
RootElement.channel(
ExportElement.wxr_version("1.2"),
# etc. ...
),
version="2.0"
)
which, when pretty printed with etree.tostring(doc, pretty_print=True, encoding="unicode"), produces:
<rss xmlns:excerpt="https://wordpress.org/export/1.2/excerpt/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="https://wordpress.org/export/1.2/" version="2.0">
<channel>
<wp:wxr_version>1.2</wp:wxr_version>
</channel>
</rss>
Note how only the root <rss> element has xmlns attributes, and how the <wp:wxr_version> tag uses the right prefix even though we only gave it the namespace URI.
To give a different example, if you are building a Windows Phone tile notification, it'd be simpler. After all, there is just a single namespace to use:
from lxml.builder import ElementMaker
namespaces = {"wp": "WPNotification"}
E = ElementMaker(namespace=namespaces["wp"], nsmap=namespaces)
notification = E.Notification(
E.Tile(
E.BackgroundImage("https://example.com/someimage.png"),
E.Count("42"),
E.Title("The notification title"),
# ...
)
)
which produces
<wp:Notification xmlns:wp="WPNotification">
<wp:Tile>
<wp:BackgroundImage>https://example.com/someimage.png</wp:BackgroundImage>
<wp:Count>42</wp:Count>
<wp:Title>The notification title</wp:Title>
</wp:Tile>
</wp:Notification>
Only the outer-most element, <wp:Notification>, now has the xmlns:wp attribute. All other elements only need to include the wp: prefix.
Note that the prefix used is entirely up to you and even optional. It is the namespace URI that is the real key to uniquely identifying elements across different XML documents. If you used E = ElementMaker(namespace="WPNotification", nsmap={None: "WPNotification"}) instead, and so produced a top-level element with <Notification xmlns="WPNotification"> you still have a perfectly legal XML document that, according to the XML standard, has the exact same meaning.
My python libxml2 differently processes the files with the default attributes, depending on what I want to know what. The example, using the DITA DTD (the package can be downloaded on www.dita-ot.org):
import libxml2
import libxsltmod
s = """<!DOCTYPE map PUBLIC "-//OASIS//DTD XDITA Map//EN"
"file://.../dita-ot-2.2.1/plugins/org.oasis-open.dita.v1
_2/dtd/technicalContent/dtd/map.dtd">
<map title="Empty map">
</map>"""
libxml2.substituteEntitiesDefault(1)
xmldoc = libxml2.parseDoc(s)
print xmldoc
The output is as desired:
<?xml version="1.0"?>
<!DOCTYPE map PUBLIC "-//OASIS//DTD XDITA Map//EN"
"file://.../dita-ot-2.2.1/plugins/org.oasis-open.dita.v1
_2/dtd/technicalContent/dtd/map.dtd">
<map xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/"
title="Empty map" ditaarch:DITAArchVersion="1.2" domains="(topic delay-d)
(map mapgroup-d) (topic indexing-d)
(map glossref-d) (topic hi-d)
(topic ut-d) (topic hazard-d)
(topic abbrev-d) (topic pr-d)
(topic sw-d) (topic ui-d)
" class="- map/map ">
</map>
But if I comment-out import libxsltmod, the result is:
<?xml version="1.0"?>
<!DOCTYPE map PUBLIC "-//OASIS//DTD XDITA Map//EN"
"file://.../dita-ot-2.2.1/plugins/org.oasis-open.dita.v
1_2/dtd/technicalContent/dtd/map.dtd">
<map title="Empty map">
</map>
So, libxsltmod makes something to activate default attributes expansion. Would you please suggest what, and how I can activate this functionality through python?
I have no idea how libxsltmod enables this setting globally, but normally, DTD default attributes are added with the parser option XML_PARSE_DTDATTR. Use readDoc instead of parseDoc to provide parser options:
xmldoc = libxml2.readDoc(s, None, None, libxml2.XML_PARSE_DTDATTR)
Or, if you also want to substitute entities:
flags = libxml2.XML_PARSE_NOENT | libxml2.XML_PARSE_DTDATTR
xmldoc = libxml2.readDoc(s, None, None, flags)
I've accepted the answer from #nwellnhof, but would like also to publish my investigations.
The initialization function initlibxsltmod of libxslt module sets the global variable:
xmlLoadExtDtdDefaultValue = XML_DETECT_IDS | XML_COMPLETE_ATTRS;
I have not found any possibility to access this variable from the libxml2 Python/C binding code, but I have found that this variable is used to initialize a default 'parser context', and it is possible to create and use the parser context manually:
ctxt = libxml2.createDocParserCtxt(s)
opts = libxml2.XML_PARSE_NOENT | libxml2.XML_PARSE_DTDATTR
ctxt.ctxtUseOptions(opts)
ctxt.parseDocument()
xmldoc = ctxt.doc()
del ctxt
The Python/C function readDoc performs exacly this way (create context, set options, parse). The manual context createion is verbose, but probably is necessary in some situation.
I would like to format a bit of XML and pass it to a Django template. In the shell, I am able to successfully create the XML string using the following code:
locations = Location.objects.all()
industries = Industry.objects.all()
root = ET.Element("root")
for industry in industries:
doc = ET.SubElement(root, "industry")
doc.set("name", industry.text)
for location in locations:
if industry.id == location.company.industry_id:
item = ET.SubElement(doc, "item")
latitude = ET.SubElement(item, "latitude")
latitude.text = str(location.latitude)
longitude = ET.SubElement(item, "longitude")
longitude.text = str(location.longitude)
Then, still in the shell, ET.dump(root) outputs the XML I expect.
But, how can I use ET.dump(root) to pass the XML string from a Django view to a template file?
I have tried to pass it as {{xml_items}} using 'xml_items': ET.dump(root) and I have also tried to assign ET.dump(root) to a variable and pass it like 'xml_items': xml_items.
In both cases, the template outputs None for {{xml_items}}
dump is just a debug function. You should use the tostring function:
ET.tostring(root)
which will give you exactly what ET.dump() prints, but as a string.
If you're using lxml, you can also use
ET.tostring(root, pretty_print=True)
to get a better-looking XML, but if this is just going to be consumed by another code layer, then you don't really want that anyways. And it's not available in the stock ElementTree.
I want to parse XML content with Python's libxml2 using xpath, i followed this example and that tutorial. The XML file is:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://purl.org/atom/ns#" version="0.3">
<title>Gmail - Inbox for myemailaddress#gmail.com</title>
<tagline>New messages in your Gmail Inbox</tagline>
<fullcount>1</fullcount>
<link rel="alternate" href="http://mail.google.com/mail" type="text/html"/>
<modified>2011-05-04T18:56:19Z</modified>
</feed>
This XML is stored in a file called "atom", and i try the following:
>>> import libxml2
>>> myfile = open('/pathtomyfile/atom', 'r').read()
>>> xmldata = libxml2.parseDoc('myfile')
>>> data.xpathEval('/fullcount')
[]
>>>
Now as you can see it returns an empty list. No matter what i may provide xpath with, it will return an empty list. However, if i use the * wildcard, i get a list of all nodes:
>>>> data.xpathEval('//*')
[<xmlNode (feed) object at 0xb73862cc>, <xmlNode (title) object at 0xb738650c>, <xmlNode (tagline) object at 0xb73865ec>, <xmlNode (fullcount) object at 0xb738660c>, <xmlNode (link) object at 0xb738662c>, <xmlNode (modified) object at 0xb738664c>]
Now i don't understand, judging from the working examples above, why xpath doesn't find the "fullcount" node or any other: i'm using the same syntax after all...
Any idea or suggestion? Thanks.
Your XPath is failing because you need to specify the purl namespace on the node:
import libxml2
tree = libxml2.parseDoc(data)
xp = tree.xpathNewContext()
xp.xpathRegisterNs("purl", "http://purl.org/atom/ns#")
print xp.xpathEval('//purl:fullcount')
Result:
[<xmlNode (fullcount) object at 0x7fbbeba9ef80>]
(Also: check out lxml, it has a nicer, higher-level interface).
Firstly:
/fullcount is an absolute path, so it's looking for the <fullcount> element in the root of the document, when the element is in fact within the <feed> element.
Secondly:
You need to specify the namespace. This is how you would do it with lxml:
import lxml.etree as etree
tree = etree.parse('/pathtomyfile/atom')
fullcounts = tree.xpath('//ns:fullcount',
namespaces={'ns': "http://purl.org/atom/ns#"})
print etree.tostring(fullcounts[0])
Which would give you:
<fullcount xmlns="http://purl.org/atom/ns#">1</fullcount>