Expand default (dita) attributes - python

My python libxml2 differently processes the files with the default attributes, depending on what I want to know what. The example, using the DITA DTD (the package can be downloaded on www.dita-ot.org):
import libxml2
import libxsltmod
s = """<!DOCTYPE map PUBLIC "-//OASIS//DTD XDITA Map//EN"
"file://.../dita-ot-2.2.1/plugins/org.oasis-open.dita.v1
_2/dtd/technicalContent/dtd/map.dtd">
<map title="Empty map">
</map>"""
libxml2.substituteEntitiesDefault(1)
xmldoc = libxml2.parseDoc(s)
print xmldoc
The output is as desired:
<?xml version="1.0"?>
<!DOCTYPE map PUBLIC "-//OASIS//DTD XDITA Map//EN"
"file://.../dita-ot-2.2.1/plugins/org.oasis-open.dita.v1
_2/dtd/technicalContent/dtd/map.dtd">
<map xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/"
title="Empty map" ditaarch:DITAArchVersion="1.2" domains="(topic delay-d)
(map mapgroup-d) (topic indexing-d)
(map glossref-d) (topic hi-d)
(topic ut-d) (topic hazard-d)
(topic abbrev-d) (topic pr-d)
(topic sw-d) (topic ui-d)
" class="- map/map ">
</map>
But if I comment-out import libxsltmod, the result is:
<?xml version="1.0"?>
<!DOCTYPE map PUBLIC "-//OASIS//DTD XDITA Map//EN"
"file://.../dita-ot-2.2.1/plugins/org.oasis-open.dita.v
1_2/dtd/technicalContent/dtd/map.dtd">
<map title="Empty map">
</map>
So, libxsltmod makes something to activate default attributes expansion. Would you please suggest what, and how I can activate this functionality through python?

I have no idea how libxsltmod enables this setting globally, but normally, DTD default attributes are added with the parser option XML_PARSE_DTDATTR. Use readDoc instead of parseDoc to provide parser options:
xmldoc = libxml2.readDoc(s, None, None, libxml2.XML_PARSE_DTDATTR)
Or, if you also want to substitute entities:
flags = libxml2.XML_PARSE_NOENT | libxml2.XML_PARSE_DTDATTR
xmldoc = libxml2.readDoc(s, None, None, flags)

I've accepted the answer from #nwellnhof, but would like also to publish my investigations.
The initialization function initlibxsltmod of libxslt module sets the global variable:
xmlLoadExtDtdDefaultValue = XML_DETECT_IDS | XML_COMPLETE_ATTRS;
I have not found any possibility to access this variable from the libxml2 Python/C binding code, but I have found that this variable is used to initialize a default 'parser context', and it is possible to create and use the parser context manually:
ctxt = libxml2.createDocParserCtxt(s)
opts = libxml2.XML_PARSE_NOENT | libxml2.XML_PARSE_DTDATTR
ctxt.ctxtUseOptions(opts)
ctxt.parseDocument()
xmldoc = ctxt.doc()
del ctxt
The Python/C function readDoc performs exacly this way (create context, set options, parse). The manual context createion is verbose, but probably is necessary in some situation.

Related

python lxml how i use tag in items name?

i need to build xml file using special name of items, this is my current code :
from lxml import etree
import lxml
from lxml.builder import E
wp = E.wp
tmp = wp("title")
print(etree.tostring(tmp))
current output is this :
b'<wp>title</wp>'
i want to be :
b'<wp:title>title</title:wp>'
how i can create items with name like this : wp:title ?
You confused the namespace prefix wp with the tag name. The namespace prefix is a document-local name for a namespace URI. wp:title requires a parser to look for a xmlns:wp="..." attribute to find the namespace itself (usually a URL but any globally unique string would do), either on the tag itself or on a parent tag. This connects tags to a unique value without making tag names too verbose to type out or read.
You need to provide the namepace, and optionally, the namespace mapping (mapping short names to full namespace names) to the element maker object. The default E object provided doesn't have a namespace or namespace map set. I'm going to assume that here that wp is the http://wordpress.org/export/1.2/ Wordpress namespace, as that seems the most likely, although it could also be that you are trying to send Windows Phone notifications.
Instead of using the default E element maker, create your own ElementMaker instance and pass it a namespace argument to tell lxml what URL the element belongs to. To get the right prefix on your element names, you also need to give it a nsmap dictionary that maps prefixes to URLs:
from lxml.builder import ElementMaker
namespaces = {"wp": "http://wordpress.org/export/1.2/"}
E = ElementMaker(namespace=namespaces["wp"], nsmap=namespaces)
title = E.title("Value of the wp:title tag")
This produces a tag with both the correct prefix, and the xmlns:wp attribute:
>>> from lxml.builder import ElementMaker
>>> namespaces = {"wp": "http://wordpress.org/export/1.2/"}
>>> E = ElementMaker(namespace=namespaces["wp"], nsmap=namespaces)
>>> title = E.title("Value of the wp:title tag")
>>> etree.tostring(title, encoding="unicode")
'<wp:title xmlns:wp="http://wordpress.org/export/1.2/">Value of the wp:title tag</wp:title>'
You can omit the nsmap value, but then you'd want to have such a map on a parent element of the document. In that case, you probably want to make separate ElementMaker objects for each namespace you need to support, and you put the nsmap namespace mapping on the outer-most element. When writing out the document, lxml then uses the short names throughout.
For example, creating a Wordpress WXR format document would require a number of namespaces:
from lxml.builder import ElementMaker
namespaces = {
"excerpt": "https://wordpress.org/export/1.2/excerpt/",
"content": "http://purl.org/rss/1.0/modules/content/",
"wfw": "http://wellformedweb.org/CommentAPI/",
"dc": "http://purl.org/dc/elements/1.1/",
"wp": "https://wordpress.org/export/1.2/",
}
RootElement = ElementMaker(nsmap=namespaces)
ExcerptElement = ElementMaker(namespace=namespaces["excerpt"])
ContentElement = ElementMaker(namespace=namespaces["content"])
CommentElement = ElementMaker(namespace=namespaces["wfw"])
DublinCoreElement = ElementMaker(namespace=namespaces["dc"])
ExportElement = ElementMaker(namespace=namespaces["wp"])
and then you'd construct a document with
doc = RootElement.rss(
RootElement.channel(
ExportElement.wxr_version("1.2"),
# etc. ...
),
version="2.0"
)
which, when pretty printed with etree.tostring(doc, pretty_print=True, encoding="unicode"), produces:
<rss xmlns:excerpt="https://wordpress.org/export/1.2/excerpt/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="https://wordpress.org/export/1.2/" version="2.0">
<channel>
<wp:wxr_version>1.2</wp:wxr_version>
</channel>
</rss>
Note how only the root <rss> element has xmlns attributes, and how the <wp:wxr_version> tag uses the right prefix even though we only gave it the namespace URI.
To give a different example, if you are building a Windows Phone tile notification, it'd be simpler. After all, there is just a single namespace to use:
from lxml.builder import ElementMaker
namespaces = {"wp": "WPNotification"}
E = ElementMaker(namespace=namespaces["wp"], nsmap=namespaces)
notification = E.Notification(
E.Tile(
E.BackgroundImage("https://example.com/someimage.png"),
E.Count("42"),
E.Title("The notification title"),
# ...
)
)
which produces
<wp:Notification xmlns:wp="WPNotification">
<wp:Tile>
<wp:BackgroundImage>https://example.com/someimage.png</wp:BackgroundImage>
<wp:Count>42</wp:Count>
<wp:Title>The notification title</wp:Title>
</wp:Tile>
</wp:Notification>
Only the outer-most element, <wp:Notification>, now has the xmlns:wp attribute. All other elements only need to include the wp: prefix.
Note that the prefix used is entirely up to you and even optional. It is the namespace URI that is the real key to uniquely identifying elements across different XML documents. If you used E = ElementMaker(namespace="WPNotification", nsmap={None: "WPNotification"}) instead, and so produced a top-level element with <Notification xmlns="WPNotification"> you still have a perfectly legal XML document that, according to the XML standard, has the exact same meaning.

Python object typecasting and XML

I'm coming from Java, C++ and Delphi, now I'm working on a small project in Python. So far I could most problems solve but I have following unsolved:
I just want to substitute/overwrite a method getAttribute() of a XML Node by type casting the xml node into ext_xml_node, so that every function of the fore mentioned project uses the new getAttribute. As far as I could read there is no way to real type cast (Like C++ etc.) in python, so I came on the idea to make at certain functions (which calls other sub functions like a dispatcher with a xml node argument) a type cast of the argument to the new class ext_xml_node
class ext_xml_node(xml_node):
...
def getAttribute(self, name):
unhandled_value = super...(name)
handled_value= dosomethingiwth(unhandled_value)
return handled_value
def dispatcher(self, xml_node):
for child_node in xml_node.childNodes:
if child_node.nodeName == 'setvariable':
bla = ext_xml_node(child_node)
self.handle_setvariable_statement(bla)
def handle_setvariable_statement(xml_node):
varname= xml_node.getAttribute("varname")
# Now it should call ext_xml_node.getAttribute Method
I don't want to substitute each getAttribute function in this project, and is there another way (duck typing surely isn't working) or should I really write a function with that yield over each attribute - and how?
lxml provides custom element classes which should suit your needs.
xml = '''\
<?xml version="1.0" encoding="utf-8"?>
<root xmlns="http://example.com">
<element1>
<element2/>
</element1>
</root>'''
from lxml import etree
class MyXmlClass1(etree.ElementBase):
def getAttribute(self):
print '1'
class MyXmlClass2(etree.ElementBase):
def getAttribute(self):
print '2'
nsprefix = '{http://example.com}'
fallback = etree.ElementDefaultClassLookup(element = MyXmlClass1)
lookup = etree.ElementNamespaceClassLookup(fallback)
parser = etree.XMLParser()
parser.set_element_class_lookup(lookup)
namespace = lookup.get_namespace('http://example.com')
namespace['element2'] = MyXmlClass2
root = etree.XML(xml, parser)
root.getAttribute()
>>> 1
element1 = root.getchildren()[0]
element1.getAttribute()
>>> 1
element2 = element1.getchildren()[0]
element2.getAttribute()
>>> 2

lxml - using find method to find specific tag? (does not find)

I have an xml file that I need to update some values from some specific tags. In header tag there are some tags with namespaces. Using find for such tags, works, but if I try to search for some other tags that do not have name spaces, it does not find it.
I tried relative, absolute path, but it does not find. The code is like this:
from lxml import etree
tree = etree.parse('test.xml')
root = tree.getroot()
# get its namespace map, excluding default namespace
nsmap = {k:v for k,v in root.nsmap.iteritems() if k}
# Replace values in tags
identity = tree.find('.//env:identity', nsmap)
identity.text = 'Placeholder' # works fine
e01_0017 = tree.find('.//e01_0017') # does not find
e01_0017.text = 'Placeholder' # and then it throws this ofcourse: AttributeError: 'NoneType' object has no attribute 'text'
# Also tried like this, but still not working
e01_0017 = tree.find('Envelope/Body/IVOIC/UNB/cmp04/e01_0017')
I even tried finding for example body tag, but it does not find it too.
This is how xml structure looks like:
<?xml version="1.0" encoding="ISO-8859-1"?><Envelope xmlns="http://www.someurl.com/TTT" xmlns:env="http://www.someurl.com/TTT_Envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="http://www.someurl.com/TTT TTT_INVOIC.xsd"><Header>
<env:delivery>
<env:to>
<env:address>Test</env:address>
</env:to>
<env:from>
<env:address>Test2</env:address>
</env:from>
<env:reliability>
<env:sendReceiptTo/>
<env:receiptRequiredBy/>
</env:reliability>
</env:delivery>
<env:properties>
<env:identity>some code</env:identity>
<env:sentAt>2006-03-17T00:38:04+01:00</env:sentAt>
<env:expiresAt/>
<env:topic>http://www.someurl.com/TTT/</env:topic>
</env:properties>
<env:manifest>
<env:reference uri="#INVOIC#D00A">
<env:description>Doc Name Descr</env:description>
</env:reference>
</env:manifest>
<env:process>
<env:type></env:type>
<env:instance/>
<env:handle></env:handle>
</env:process>
</Header>
<Body>
<INVOIC>
<UNB>
<cmp01>
<e01_0001>1</e01_0001>
<e02_0002>1</e02_0002>
</cmp01>
<cmp02>
<e01_0004>from</e01_0004>
</cmp02>
<cmp03>
<e01_0010>to</e01_0010>
</cmp03>
<cmp04>
<e01_0017>060334</e01_0017>
<e02_0019>1652</e02_0019>
</cmp04>
<e01_0020>1</e01_0020>
<cmp05>
<e01_0022>1</e01_0022>
</cmp05>
</UNB>
</INVOIC>
</Body>
</Envelope>
Update It seems something is wrong with header or envelope tags. If I for example use xml without that header and envelope info, then tags are found just fine. If I include envelope attributes and header, it stops finding tags. Updated xml sample with header info
The thing is that your elements like e01_0017 also has a namespace, it inherits its namespace from the namespace of its parent, in this case it goes all the way back to - <Envelope> . The namespace for your elements are - "http://www.someurl.com/TTT" .
You have two options ,
Either directly specify the namespace in the XPATH , Example -
e01_0017 = tree.find('.//{http://www.someurl.com/TTT}e01_0017')
Demo (for your xml) -
In [39]: e01_0017 = tree.find('.//{http://www.someurl.com/TTT}e01_0017')
In [40]: e01_0017
Out[40]: <Element {http://www.someurl.com/TTT}e01_0017 at 0x2fe78c8>
Another option is to add it to the nsmap with some default value for the key and then use it in the xpath. Example -
nsmap = {(k or 'def'):v for k,v in root.nsmap.items()}
e01_0017 = tree.find('.//def:e01_0017',nsmap)

Python 2.5 ElementTree handle xml node with namespace

I'm using Python2.5, ElementTree 1.2 to parse XML document, which looks like:
<cm:CompositeMessage xmlns:cm="http://www.xyz.com">
<cm:Message>
<cm:Body format="text/xml">
<CHMasterbook >
<event>
<eventName>Snapshot</eventName>
<date>2013-10-25</date>
<time>20:59:02</time>
</event>
</CHMasterbook>
</cm:Body>
</cm:Message>
</cm:CompositeMessage>
After I register the namespace
ET._namespace_map['http://www.xyz.com'] = 'cm'
I can parse the XMLdocument and locate the 'event' node
tree = ElementTree(fromstring(xml))
tree.findall('./{http://www.xyz.com}Message/{http://www.xyz.com}Body/CHMasterBook/event')
But if 'CHMasterbook' node has namespaces like
<CHMasterbook xmlns="http://uri.xyz.com/Chorus/Message" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uri.xyz.com/Chorus/Message ../schema/chorus-master-book-msg.xsd">
tree.findall only returns empty list and it can no longer locate 'event' node. I also tried to register those namespaces like:
ET._namespace_map['http://uri.xyz.com/Chorus/Message'] = 'xmlns'
ET._namespace_map['http://www.w3.org/2001/XMLSchema-instance'] = 'xmlns:xsi'
ET._namespace_map['http://uri.xyz.com/Chorus/Message ../schema/chorus-master-book-msg.xsd'] = 'xsi:schemaLocationi'
But it didn't help.
I can only use Python 2.5 and ElementTree 1.2 (can't use lxml). Does anyone know how to locate the 'event' node with 'CHMasterbook' having those namespaces?
Try this:
tree = ElementTree(fromstring(xml))
tree.findall('./{http://www.xyz.com}Message'
'/{http://www.xyz.com}Body'
'/{http://uri.xyz.com/Chorus/Message}CHMasterbook'
'/{http://uri.xyz.com/Chorus/Message}event')
In your example, you use CHMasterbook and sometimes CHMasterBook. Remember case is important in XML.

Python: libxml2 xpath returns empty list

I want to parse XML content with Python's libxml2 using xpath, i followed this example and that tutorial. The XML file is:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://purl.org/atom/ns#" version="0.3">
<title>Gmail - Inbox for myemailaddress#gmail.com</title>
<tagline>New messages in your Gmail Inbox</tagline>
<fullcount>1</fullcount>
<link rel="alternate" href="http://mail.google.com/mail" type="text/html"/>
<modified>2011-05-04T18:56:19Z</modified>
</feed>
This XML is stored in a file called "atom", and i try the following:
>>> import libxml2
>>> myfile = open('/pathtomyfile/atom', 'r').read()
>>> xmldata = libxml2.parseDoc('myfile')
>>> data.xpathEval('/fullcount')
[]
>>>
Now as you can see it returns an empty list. No matter what i may provide xpath with, it will return an empty list. However, if i use the * wildcard, i get a list of all nodes:
>>>> data.xpathEval('//*')
[<xmlNode (feed) object at 0xb73862cc>, <xmlNode (title) object at 0xb738650c>, <xmlNode (tagline) object at 0xb73865ec>, <xmlNode (fullcount) object at 0xb738660c>, <xmlNode (link) object at 0xb738662c>, <xmlNode (modified) object at 0xb738664c>]
Now i don't understand, judging from the working examples above, why xpath doesn't find the "fullcount" node or any other: i'm using the same syntax after all...
Any idea or suggestion? Thanks.
Your XPath is failing because you need to specify the purl namespace on the node:
import libxml2
tree = libxml2.parseDoc(data)
xp = tree.xpathNewContext()
xp.xpathRegisterNs("purl", "http://purl.org/atom/ns#")
print xp.xpathEval('//purl:fullcount')
Result:
[<xmlNode (fullcount) object at 0x7fbbeba9ef80>]
(Also: check out lxml, it has a nicer, higher-level interface).
Firstly:
/fullcount is an absolute path, so it's looking for the <fullcount> element in the root of the document, when the element is in fact within the <feed> element.
Secondly:
You need to specify the namespace. This is how you would do it with lxml:
import lxml.etree as etree
tree = etree.parse('/pathtomyfile/atom')
fullcounts = tree.xpath('//ns:fullcount',
namespaces={'ns': "http://purl.org/atom/ns#"})
print etree.tostring(fullcounts[0])
Which would give you:
<fullcount xmlns="http://purl.org/atom/ns#">1</fullcount>

Categories