I'm parsing a non-compliant XML file (Sphinx's xmlpipe2 format) and would like LXML parser to ignore the fact that there are unresolved namespace prefixes.
An example of the Sphinx XML:
<sphinx:schema>
<sphinx:field name="subject"/>
<sphinx:field name="content"/>
<sphinx:attr name="published" type="timestamp"/>
<sphinx:attr name="author_id" type="int" bits="16" default="1"/>
</sphinx:schema>
I'm aware of passing a parser keyword option to try and recover broken XML, e.g.
parser = etree.XMLParser(recover=True)
tree = etree.parse('sphinxTest.xml', parser)
but the above does not ignore the prefix, it removes it.
I could create a target which adds in the removed prefix e.g.
parser = etree.XMLParser(target = AddPrefix())
where AddPrefix() is a class which adds in the prefix to every attribute tag.
Is there a simpler way to do this?
Eventually i want to programmatically write Sphinx's xmlpipe2 format cleanly.
Add xmlns:sphinx="bogus" to the root element.
Related
I'm using lxml to parse XML product feeds with the following code:
namespace = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc",namespaces=namespace)]
This works with the majority of feeds that I am using as an input, but I occasionally I find a feed with additional namespaces such as the below:
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.sitemaps.org/schemas/sitemap/0.9
https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://www.example.com/</loc>
<priority>1.00</priority>
</url>
From what I've read I would need to add the additional namespace here (xmlns:xsi I guess) to the namespace dictionary to get my xpath to work with multiple namespaces.
However, this is not a long term solution for me as I might come across other differing namespaces in the future - is there a way for me to search/detect or even delete the namespace ? The element tree always will be the same, so my xpath wouldn't change.
Thanks
You shouldn't need to map the xsi prefix; that's only there for the xsi:schemaLocation attribute.
The difference between your current mapping and the input file is that there is an "s" in "https" in the default namespace of the XML.
To handle both namespace URIs (or really any other namespace URI that urlset might have) is to first get the namespace URI for the root element and then use that in your dict mapping...
from lxml import etree
tree = etree.parse("input.xml")
root_ns_uri = tree.xpath("namespace-uri()")
namespace = {"sm": root_ns_uri}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc", namespaces=namespace)]
print(data)
prints...
['https://www.example.com/']
If urlset isn't always the root element, you may want to do something like this instead...
root_ns_uri = tree.xpath("namespace-uri(//*[local-name()='urlset'])")
I have an XML file with an element which looks like this:
<wrapping_element>
<prefix:tag xmlns:prefix="url">value</prefix:tag>
</wrapping_element>
I want to get this element, so I am using lxml as follows:
wrapping_element.find('prefix:tag', wrapping_element.nsmap)
but I get the following error: SyntaxError: prefix 'prefix' not found in prefix map because prefix is not defined before reaching this element in the XML.
Is there a way to get the element anyway?
Like mentioned in the comments, you could use local-name() to circumvent the namespace, but it's easy enough to just handle the namespace directly in the xpath() call...
from lxml import etree
tree = etree.parse("input.xml")
wrapping_element = tree.xpath("/wrapping_element")[0]
tag = wrapping_element.xpath("x:tag", namespaces={"x": "url"})[0]
print(etree.tostring(tag, encoding="unicode"))
This will print...
<prefix:tag xmlns:prefix="url">value</prefix:tag>
Notice I used the prefix x. The prefix can match the prefix in the XML file, but it doesn't have to; only the namespace URIs need to match exactly.
See here for more details: http://lxml.de/xpathxslt.html#namespaces-and-prefixes
I can't find info, how to parse my XML with namespace:
I have this xml:
<par:Request xmlns:par="http://somewhere.net/actual">
<par:actual>blabla</par:actual>
<par:documentType>string</par:documentType>
</par:Request>
And tried to parse it:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
for subtag in rootxml.xpath(u'//par:actual'):
#do something
print(subtag)
And got exception, because it doesn't know about namespace prefix.
Is there best way to solve that problem, counting that script will not know about file it going to parse and tag is going to search for?
Searching web and stackoverflow I found, that if I will add there:
namespace = {u'par': u"http://somewhere.net/actual"}
for subtag in rootxml.xpath(u'//par:actual', namespaces=namespace):
#do something
print(subtag)
That works. Perfect. But I don't know which XML I will parse, and searching tag (such as //par:actual) is also unknown to my script. So, I need to find way to extract namespace from XML somehow.
I found a lot of ways, how to extract namespace URI, such as:
print(rootxml.tag)
print(rootxml.xpath('namespace-uri(.)'))
print(rootxml.xpath('namespace-uri(/*)'))
But how should I extract prefix to create dictionary which ElementTree wants from me? I don't want to use regular expression monster over xml body to extract prefix, I believe there have to exist supported way for that, isn't it?
And maybe there have to exist some methods for me to extract by ETree namespace from XML as dictionary (as ETree wants!) without hands manipulation?
You cannot rely on the namespace declarations on the root element: there is no guarantee that the declarations will even be there, or that the document will have the same prefix for the same namespace throughout.
Assuming you are going to have some way of passing the tag you want to search (because you say it is not known by your script), you should also provide a way to pass a namespace mapping as well. Or use the James Clark notation, like {http://somewhere.net/actual}actual (the ETXPath has support for this syntax, whereas "normal" xpath does not, but you can also use other methods like .findall() if you don't need full xpath)
If you don't care for the prefix at all, you could also use the local-name() function in xpath, eg. //*[local-name()="actual"] (but you won't be "really" sure it's the right "actual")
Oh, I found it.
After we do that:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
Object rootxml contains dictionary nsmap, which contains all namespaces that I want.
So, simplest solution I've found:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
nss = rootxml.nsmap
for subtag in rootxml.xpath(u'//par:actual', namespaces=nss):
#do something
print(subtag)
That works.
UPD: that works if user understand what means 'par' in XML he works with. For example, comparing supposed namespace with existing namespace before any other operations.
Still, I like much variant with XPath that understands {...}actual, that was what I tried to achieve.
With Python 3.8.2 I found this question with the same issue.
This is the solution I found, put the namespace in the XPath query. (Between the {})
ApplicationArea = BOD_IN_tree.find('.//ApplicationArea', ns)
if(ApplicationArea is None):
ApplicationArea = BOD_IN_tree.find('.//{http://www.defaultNamespace.com/2}ApplicationArea', ns)
I search for the element without the namespace, then search again if it's not found. I have no control over the inbound documents, some have namespaces, some do not.
I hope this helps!
I am working on a xml parser.
The goal is to parse a number of different xml files where prefixes and tags remain consistent but namespaces change.
I am hence trying either:
to parse the xml just by <prefix:tags> without resolving (replacing) the prefix with the namespace. The prefixes remain unchanged from document to document.
to load automatically the namespaces so that the identifier (<prefix:tag>) could be replaced with the proper namespace.
just parse the xml by tag
I have tried with xml.etree.ElementTree.
I also had a look at lxml
I did not find any configuration option of the XMLParser in lxml that could help me out although here I could read an answer where the author suggests that lxml should be able to collect namespaces for me automatically.
Interestingly, parsed_file = etree.XML(file) fails with the error:
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
One example of the files I would like to parse is here
Do not care about ns prefixes, care about complete namespaces
Sometime, people do care about those short prefixes and forgetting, the are of secondary importance. They are only short reference to fully qualified namespace. E.g.
xmlns:trw="http://www.trw.com/20131231"
in xml means, from now on, the "trw:" stands for fully qualified namespace "http://www.trw.com/20131231". Note, that this prefix can be redefined to any other namespace in any following element and may get completely different meaning.
On the other hand, when you care about real meaning, what means here fully qualified namespace, you may think of "trw:row" as "{http://www.trw.com/20131231}row". This translated meaning will be reliable and will not change with prefix changes.
Parsing referred xml
The link to http://edgar.sec.gov/Archives/edgar/data/1267097/000104746914000925/trw-20131231.xml leads to an xml, which validates by xmlstarlet and which lxml is able to parse.
The error message you show is referring to very first character of the stream, so chances are you either met BOM byte in your file, or you are trying to read xml, which is gzipped and shall be decompressed first.
lxml and namespaces
lxml works with namespaces well. It allows you to use XPath expressions, which use namespaces. With controlling namspace prefix on output it is a bit more complex, as it is dependent on xmlns attributes, which are part of serialized document. If you want to modify the prefixes, you must somehow organize these xmlns attributes, often by moving all of the to the root element. At the same time, lxml keeps track of fully qualified namespace of each element, so at the moment of serialization, it will respect this full name as well as currently valid prefix for this namespace.
Handling these xmlna attributes is a bit of more code, refer to lxml documentation.
items = tree.xpath("*[local-name(.) = 'a_tag_goes_here']")
did the job. On top of that I had to browse the generated list items manually to define my other desired filtering functions.
How do I generate and parse XML like the following using lxml?
<s:Envelope xmlns:s="a" xmlns:a="http_//www.w3.org/2005/08/addressing">
....
</s:Envelope>
I currently swap : with _ in the element names when I parse and generate XML, but it seems stupid.
It's not clear exactly what you're asking, but maybe this will help:
An element such as <s:Envelope> is using a XML namespace prefix. This is used to indicate that the s:Envelope attribute in this document is defined in the a namespace.
LXML represents XML namespaces using a namespace prefix in braces, for example: {a}Envelope. Your example document is sort of confusing, because you also defined the a: namespace prefix, so:
a:Element is equivalent to {http://www.w3.org/2005/08/addressing}Element, and
s:Element is equivalent to {a}Element.
Many of the LXML commands let you provide a namespace prefix mapping. For example, to find the Envelope element in your document using XPATH, you could do this:
import lxml.etree as etree
doc = etree.parse('mydocument.xml')
envelope = doc.xpath('//s:Envelope',
namespaces={'s': 'a'})
Note that this is exactly equivalent to:
envelope = doc.xpath('//x:Envelope',
namespaces={'x': 'a'})
That is, the namespace prefix doesn't have to match what is used in the source XML document; only the the absolute namespace matters.
You can read more about LXML and namespaces here.