I am working on a xml parser.
The goal is to parse a number of different xml files where prefixes and tags remain consistent but namespaces change.
I am hence trying either:
to parse the xml just by <prefix:tags> without resolving (replacing) the prefix with the namespace. The prefixes remain unchanged from document to document.
to load automatically the namespaces so that the identifier (<prefix:tag>) could be replaced with the proper namespace.
just parse the xml by tag
I have tried with xml.etree.ElementTree.
I also had a look at lxml
I did not find any configuration option of the XMLParser in lxml that could help me out although here I could read an answer where the author suggests that lxml should be able to collect namespaces for me automatically.
Interestingly, parsed_file = etree.XML(file) fails with the error:
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
One example of the files I would like to parse is here
Do not care about ns prefixes, care about complete namespaces
Sometime, people do care about those short prefixes and forgetting, the are of secondary importance. They are only short reference to fully qualified namespace. E.g.
xmlns:trw="http://www.trw.com/20131231"
in xml means, from now on, the "trw:" stands for fully qualified namespace "http://www.trw.com/20131231". Note, that this prefix can be redefined to any other namespace in any following element and may get completely different meaning.
On the other hand, when you care about real meaning, what means here fully qualified namespace, you may think of "trw:row" as "{http://www.trw.com/20131231}row". This translated meaning will be reliable and will not change with prefix changes.
Parsing referred xml
The link to http://edgar.sec.gov/Archives/edgar/data/1267097/000104746914000925/trw-20131231.xml leads to an xml, which validates by xmlstarlet and which lxml is able to parse.
The error message you show is referring to very first character of the stream, so chances are you either met BOM byte in your file, or you are trying to read xml, which is gzipped and shall be decompressed first.
lxml and namespaces
lxml works with namespaces well. It allows you to use XPath expressions, which use namespaces. With controlling namspace prefix on output it is a bit more complex, as it is dependent on xmlns attributes, which are part of serialized document. If you want to modify the prefixes, you must somehow organize these xmlns attributes, often by moving all of the to the root element. At the same time, lxml keeps track of fully qualified namespace of each element, so at the moment of serialization, it will respect this full name as well as currently valid prefix for this namespace.
Handling these xmlna attributes is a bit of more code, refer to lxml documentation.
items = tree.xpath("*[local-name(.) = 'a_tag_goes_here']")
did the job. On top of that I had to browse the generated list items manually to define my other desired filtering functions.
Related
I am trying to parse a .kml file into Python using the xml module (after failing to make this work in BeautifulSoup, which I use for HTML).
As this is my first time doing this, I followed the official tutorial and all goes well until I try to construct an iterator to extract my data by root iteration:
from lxml import etree
tree=etree.parse('kmlfile')
Here is the example from the tutorial I am trying to emulate:
If you know you are only interested in a single tag, you can pass its name to getiterator() to have it filter for you:
for element in root.getiterator("child"):
print element.tag, '-', element.text
I would like to get all data under 'Placemark', so I tried
for i in tree.getiterterator("Placemark"):
print i, type(i)
which doesn't give me anything. What does work is:
for i in tree.getiterterator("{http://www.opengis.net/kml/2.2}Placemark"):
print i, type(i)
I don't understand how this comes about. The www.opengis.net is listed in the tag at the beginning of the document (kml xmlns="http://www.opengis.net/kml/2.2"...) , but I don't understand
how the part in {} relates to my specific example at all
why it is different from the tutorial
and what I am doing wrong
Any help is much appreciated!
Here is my solution.
So, the most important thing to do is read this as posted by Tomalak. It's a really good description of namespaces and easy to understand.
We are going to use XPath to navigate the XML document. Its notation is similar to file systems, where parents and descendants are separated by slashes /. The syntax is explained here, but note that some commands are different for the lxml implementation.
###Problem
Our goal is to extract the city name: the content of <name> which is under <Placemark>. Here's the relevant XML:
<Placemark> <name>CITY NAME</name>
The XPath equivalent to the non-functional code I posted above is:
tree=etree.parse('kml document')
result=tree.xpath('//Placemark/name/text()')
Where the text() part is needed to get the text contained in the location //Placemark/name.
Now this doesn't work, as Tomalak pointed out, cause the name of these two nodes are actually {http://www.opengis.net/kml/2.2}Placemark and {http://www.opengis.net/kml/2.2}name. The part in curly brackets is the default namespace. It does not show up in the actual document (which confused me) but it is defined at the beginning of the XML document like this:
xmlns="http://www.opengis.net/kml/2.2"
###Solution
We can supply namespaces to xpath by setting the namespaces argument:
xpath(X, namespaces={prefix: namespace})
This is easy enough for the namespaces that have actual prefixes, in this document for instance <gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode> where the gx prefix is defined in the document as xmlns:gx="http://www.google.com/kml/ext/2.2".
However, Xpath does not understand what a default namespace is (cf docs). Therefore, we need to trick it, like Tomalak suggested above: We invent a prefix for the default and add it to our search terms. We can just call it kml for instance. This piece of code actually does the trick:
tree.xpath('//kml:Placemark/kml:name/text()', namespaces={"kml":"http://www.opengis.net/kml/2.2"})
The tutorial mentions that there is also an ETXPath method, that works just like Xpath except that one writes the namespaces out in curly brackets instead of defining them in a dictionary. Thus, the input would be of the style {http://www.opengis.net/kml/2.2}Placemark.
I can't find info, how to parse my XML with namespace:
I have this xml:
<par:Request xmlns:par="http://somewhere.net/actual">
<par:actual>blabla</par:actual>
<par:documentType>string</par:documentType>
</par:Request>
And tried to parse it:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
for subtag in rootxml.xpath(u'//par:actual'):
#do something
print(subtag)
And got exception, because it doesn't know about namespace prefix.
Is there best way to solve that problem, counting that script will not know about file it going to parse and tag is going to search for?
Searching web and stackoverflow I found, that if I will add there:
namespace = {u'par': u"http://somewhere.net/actual"}
for subtag in rootxml.xpath(u'//par:actual', namespaces=namespace):
#do something
print(subtag)
That works. Perfect. But I don't know which XML I will parse, and searching tag (such as //par:actual) is also unknown to my script. So, I need to find way to extract namespace from XML somehow.
I found a lot of ways, how to extract namespace URI, such as:
print(rootxml.tag)
print(rootxml.xpath('namespace-uri(.)'))
print(rootxml.xpath('namespace-uri(/*)'))
But how should I extract prefix to create dictionary which ElementTree wants from me? I don't want to use regular expression monster over xml body to extract prefix, I believe there have to exist supported way for that, isn't it?
And maybe there have to exist some methods for me to extract by ETree namespace from XML as dictionary (as ETree wants!) without hands manipulation?
You cannot rely on the namespace declarations on the root element: there is no guarantee that the declarations will even be there, or that the document will have the same prefix for the same namespace throughout.
Assuming you are going to have some way of passing the tag you want to search (because you say it is not known by your script), you should also provide a way to pass a namespace mapping as well. Or use the James Clark notation, like {http://somewhere.net/actual}actual (the ETXPath has support for this syntax, whereas "normal" xpath does not, but you can also use other methods like .findall() if you don't need full xpath)
If you don't care for the prefix at all, you could also use the local-name() function in xpath, eg. //*[local-name()="actual"] (but you won't be "really" sure it's the right "actual")
Oh, I found it.
After we do that:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
Object rootxml contains dictionary nsmap, which contains all namespaces that I want.
So, simplest solution I've found:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
nss = rootxml.nsmap
for subtag in rootxml.xpath(u'//par:actual', namespaces=nss):
#do something
print(subtag)
That works.
UPD: that works if user understand what means 'par' in XML he works with. For example, comparing supposed namespace with existing namespace before any other operations.
Still, I like much variant with XPath that understands {...}actual, that was what I tried to achieve.
With Python 3.8.2 I found this question with the same issue.
This is the solution I found, put the namespace in the XPath query. (Between the {})
ApplicationArea = BOD_IN_tree.find('.//ApplicationArea', ns)
if(ApplicationArea is None):
ApplicationArea = BOD_IN_tree.find('.//{http://www.defaultNamespace.com/2}ApplicationArea', ns)
I search for the element without the namespace, then search again if it's not found. I have no control over the inbound documents, some have namespaces, some do not.
I hope this helps!
ElementTree (Python 2.7) does not see the attributes of the root element, for example, for tag <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> - get an empty dictionary. I want "on the fly"to get the namespace for working with tags. Library xml.dom.minidom works fine, but I don't want to lose features with ET. Code example:
from xml.etree import ElementTree as ET
import zipfile
path = '/path/to/sample.docx'
zf = zipfile.ZipFile(path, 'r')
root = ET.fromstring(zf.read('word/document.xml'))
print(root.tag, root.attrib) # =>
# ('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}document', {})
An XML namespace declaration (a thing starting with xmlns:) is not an attribute. I think that's why you're not seeing it appear in the attrib dictionary. There are other ways of working with namespaces, so if you can say more about the purposes you're working to serve I may be able to be of more help.
The namespaces (and their prefixes) of WordprocessingML elements are well known and documented, and relatively few in number. There are some tens at most and only a small handful that appear in most documents. So depending on what you're trying to accomplish it may be easier to get done than it might seem.
I'm getting a lot of parsing errors from python related to my xml file. I read elsewhere on stackoverflow that I should validate the xml file first.
I can't understand why this xml will not validate:
<xml><hive name="myprojectname">
XML validator says this
Error: Can not find declaration of element 'xml'.
Error Position: <xml><hive name="myprojectname">
This:
<xml><hive name="myprojectname">
doesn't validate in http://www.validome.org/xml/validate/, because first you have to check "Well-Formedness only" option there.
Second, it have to follow XML rules, http://en.wikipedia.org/wiki/XML#Well-formedness_and_error-handling. So this should look:
<xml><hive name="myprojectname"/></xml>
The validator you are using appears to be a DTD based validator. Unless you tell it to check for well formedness only (in which case it won't check if your elements and attributes are correct, just that you open/close elements in a sane order, quote your attribute values, etc) then you must start the XML document with a Doctype so that it can find the DTD.
I'm having an issue when using lxml's find() method to select a node in an xml file. Essentially I am trying to move a node from one xml file to another.
File 1:
<somexml xmlns:a='...' xmlns:b='...' xmlns:c='...'>
<somenode id='foo'>
<something>bar</something>
</somenode>
</somexml>
Once I parse File 1 and do a find on it:
node = tree.find('//*[#id="foo"]')
Node looks like this:
<somenode xmlns:a='...' xmlns:b='...' xmlns:c='...'>
<something>bar</something>
</somenode>
Notice it added the namespaces that were found in the document to that node. However, nothing in that node uses any of those namespaces. How would I go about either A) not writing namespaces that aren't used in the selected node, or B) removing unused name space declarations? If it's being used in the selected node then I will need it, but otherwise, I would like to get rid of them. Any ideas? Thanks!
If the namespaces are in the document, then the document uses the namespaces. The namespaces are being used in those nodes, because those nodes are part of the subtree which declared the namespace. Follow the link given by Daenyth to remove them, or strip them off the XML string before you turn it into an lxml object.