How to make a correct xpath? [duplicate]

How to make a correct xpath? [duplicate] - python

How does XPath deal with XML namespaces?
If I use
/IntuitResponse/QueryResponse/Bill/Id
to parse the XML document below I get 0 nodes back.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<IntuitResponse xmlns="http://schema.intuit.com/finance/v3"
time="2016-10-14T10:48:39.109-07:00">
<QueryResponse startPosition="1" maxResults="79" totalCount="79">
<Bill domain="QBO" sparse="false">
<Id>=1</Id>
</Bill>
</QueryResponse>
</IntuitResponse>
However, I'm not specifying the namespace in the XPath (i.e. http://schema.intuit.com/finance/v3 is not a prefix of each token of the path). How can XPath know which Id I want if I don't tell it explicitly? I suppose in this case (since there is only one namespace) XPath could get away with ignoring the xmlns entirely. But if there are multiple namespaces, things could get ugly.

XPath 1.0/2.0
Defining namespaces in XPath (recommended)
XPath itself doesn't have a way to bind a namespace prefix with a namespace. Such facilities are provided by the hosting library.
It is recommended that you use those facilities and define namespace prefixes that can then be used to qualify XML element and attribute names as necessary.
Here are some of the various mechanisms which XPath hosts provide for specifying namespace prefix bindings to namespace URIs.
(OP's original XPath, /IntuitResponse/QueryResponse/Bill/Id, has been elided to /IntuitResponse/QueryResponse.)
C#:
XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable);
nsmgr.AddNamespace("i", "http://schema.intuit.com/finance/v3");
XmlNodeList nodes = el.SelectNodes(#"/i:IntuitResponse/i:QueryResponse", nsmgr);
Google Docs:
Unfortunately, IMPORTXML() does not provide a namespace prefix binding mechanism. See next section, Defeating namespaces in XPath, for how to use local-name() as a work-around.
Java (SAX):
NamespaceSupport support = new NamespaceSupport();
support.pushContext();
support.declarePrefix("i", "http://schema.intuit.com/finance/v3");
Java (XPath):
xpath.setNamespaceContext(new NamespaceContext() {
public String getNamespaceURI(String prefix) {
switch (prefix) {
case "i": return "http://schema.intuit.com/finance/v3";
// ...
}
});
Remember to call
DocumentBuilderFactory.setNamespaceAware(true).
See also:
Java XPath: Queries with default namespace xmlns
JavaScript:
See Implementing a User Defined Namespace Resolver:
function nsResolver(prefix) {
var ns = {
'i' : 'http://schema.intuit.com/finance/v3'
};
return ns[prefix] || null;
}
document.evaluate( '/i:IntuitResponse/i:QueryResponse',
document, nsResolver, XPathResult.ANY_TYPE,
null );
Note that if the default namespace has an associated namespace prefix defined, using the nsResolver() returned by Document.createNSResolver() can obviate the need for a customer nsResolver().
Perl (LibXML):
my $xc = XML::LibXML::XPathContext->new($doc);
$xc->registerNs('i', 'http://schema.intuit.com/finance/v3');
my #nodes = $xc->findnodes('/i:IntuitResponse/i:QueryResponse');
Python (lxml):
from lxml import etree
f = StringIO('<IntuitResponse>...</IntuitResponse>')
doc = etree.parse(f)
r = doc.xpath('/i:IntuitResponse/i:QueryResponse',
namespaces={'i':'http://schema.intuit.com/finance/v3'})
Python (ElementTree):
namespaces = {'i': 'http://schema.intuit.com/finance/v3'}
root.findall('/i:IntuitResponse/i:QueryResponse', namespaces)
Python (Scrapy):
response.selector.register_namespace('i', 'http://schema.intuit.com/finance/v3')
response.xpath('/i:IntuitResponse/i:QueryResponse').getall()
PhP:
Adapted from #Tomalak's answer using DOMDocument:
$result = new DOMDocument();
$result->loadXML($xml);
$xpath = new DOMXpath($result);
$xpath->registerNamespace("i", "http://schema.intuit.com/finance/v3");
$result = $xpath->query("/i:IntuitResponse/i:QueryResponse");
See also #IMSoP's canonical Q/A on PHP SimpleXML namespaces.
Ruby (Nokogiri):
puts doc.xpath('/i:IntuitResponse/i:QueryResponse',
'i' => "http://schema.intuit.com/finance/v3")
Note that Nokogiri supports removal of namespaces,
doc.remove_namespaces!
but see the below warnings discouraging the defeating of XML namespaces.
VBA:
xmlNS = "xmlns:i='http://schema.intuit.com/finance/v3'"
doc.setProperty "SelectionNamespaces", xmlNS
Set queryResponseElement =doc.SelectSingleNode("/i:IntuitResponse/i:QueryResponse")
VB.NET:
xmlDoc = New XmlDocument()
xmlDoc.Load("file.xml")
nsmgr = New XmlNamespaceManager(New XmlNameTable())
nsmgr.AddNamespace("i", "http://schema.intuit.com/finance/v3");
nodes = xmlDoc.DocumentElement.SelectNodes("/i:IntuitResponse/i:QueryResponse",
nsmgr)
SoapUI (doc):
declare namespace i='http://schema.intuit.com/finance/v3';
/i:IntuitResponse/i:QueryResponse
xmlstarlet:
-N i="http://schema.intuit.com/finance/v3"
XSLT:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:i="http://schema.intuit.com/finance/v3">
...
Once you've declared a namespace prefix, your XPath can be written to use it:
/i:IntuitResponse/i:QueryResponse
Defeating namespaces in XPath (not recommended)
An alternative is to write predicates that test against local-name():
/*[local-name()='IntuitResponse']/*[local-name()='QueryResponse']
Or, in XPath 2.0:
/*:IntuitResponse/*:QueryResponse
Skirting namespaces in this manner works but is not recommended because it
Under-specifies the full element/attribute name.
Fails to differentiate between element/attribute names in different
namespaces (the very purpose of namespaces). Note that this concern could be addressed by adding an additional predicate to check the namespace URI explicitly:
/*[ namespace-uri()='http://schema.intuit.com/finance/v3'
and local-name()='IntuitResponse']
/*[ namespace-uri()='http://schema.intuit.com/finance/v3'
and local-name()='QueryResponse']
Thanks to Daniel Haley for the namespace-uri() note.
Is excessively verbose.
XPath 3.0/3.1
Libraries and tools that support modern XPath 3.0/3.1 allow the specification of a namespace URI directly in an XPath expression:
/Q{http://schema.intuit.com/finance/v3}IntuitResponse/Q{http://schema.intuit.com/finance/v3}QueryResponse
While Q{http://schema.intuit.com/finance/v3} is much more verbose than using an XML namespace prefix, it has the advantage of being independent of the namespace prefix binding mechanism of the hosting library. The Q{} notation is known as Clark Notation after its originator, James Clark. The W3C XPath 3.1 EBNF grammar calls it a BracedURILiteral.
Thanks to Michael Kay for the suggestion to cover XPath 3.0/3.1's BracedURILiteral.

I use /*[name()='...'] in a google sheet to fetch some counts from Wikidata. I have a table like this
thes WD prop links items
NOM P7749 3925 3789
AAT P1014 21157 20224
and the formulas in cols links and items are
=IMPORTXML("https://query.wikidata.org/sparql?query=SELECT(COUNT(*)as?c){?item wdt:"&$B14&"[]}","//*[name()='literal']")
=IMPORTXML("https://query.wikidata.org/sparql?query=SELECT(COUNT(distinct?item)as?c){?item wdt:"&$B14&"[]}","//*[name()='literal']")
respectively. The SPARQL query happens not to have any spaces...
I saw name() used instead of local-name() in Xml Namespace breaking my xpath!, and for some reason //*:literal doesn't work.

Related

Finding namespace URIs for lxml

I'm using lxml to parse XML product feeds with the following code:
namespace = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc",namespaces=namespace)]
This works with the majority of feeds that I am using as an input, but I occasionally I find a feed with additional namespaces such as the below:
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.sitemaps.org/schemas/sitemap/0.9
https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://www.example.com/</loc>
<priority>1.00</priority>
</url>
From what I've read I would need to add the additional namespace here (xmlns:xsi I guess) to the namespace dictionary to get my xpath to work with multiple namespaces.
However, this is not a long term solution for me as I might come across other differing namespaces in the future - is there a way for me to search/detect or even delete the namespace ? The element tree always will be the same, so my xpath wouldn't change.
Thanks

You shouldn't need to map the xsi prefix; that's only there for the xsi:schemaLocation attribute.
The difference between your current mapping and the input file is that there is an "s" in "https" in the default namespace of the XML.
To handle both namespace URIs (or really any other namespace URI that urlset might have) is to first get the namespace URI for the root element and then use that in your dict mapping...
from lxml import etree
tree = etree.parse("input.xml")
root_ns_uri = tree.xpath("namespace-uri()")
namespace = {"sm": root_ns_uri}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc", namespaces=namespace)]
print(data)
prints...
['https://www.example.com/']
If urlset isn't always the root element, you may want to do something like this instead...
root_ns_uri = tree.xpath("namespace-uri(//*[local-name()='urlset'])")

python etree with xpath and namespaces with prefix

I can't find info, how to parse my XML with namespace:
I have this xml:
<par:Request xmlns:par="http://somewhere.net/actual">
<par:actual>blabla</par:actual>
<par:documentType>string</par:documentType>
</par:Request>
And tried to parse it:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
for subtag in rootxml.xpath(u'//par:actual'):
#do something
print(subtag)
And got exception, because it doesn't know about namespace prefix.
Is there best way to solve that problem, counting that script will not know about file it going to parse and tag is going to search for?
Searching web and stackoverflow I found, that if I will add there:
namespace = {u'par': u"http://somewhere.net/actual"}
for subtag in rootxml.xpath(u'//par:actual', namespaces=namespace):
#do something
print(subtag)
That works. Perfect. But I don't know which XML I will parse, and searching tag (such as //par:actual) is also unknown to my script. So, I need to find way to extract namespace from XML somehow.
I found a lot of ways, how to extract namespace URI, such as:
print(rootxml.tag)
print(rootxml.xpath('namespace-uri(.)'))
print(rootxml.xpath('namespace-uri(/*)'))
But how should I extract prefix to create dictionary which ElementTree wants from me? I don't want to use regular expression monster over xml body to extract prefix, I believe there have to exist supported way for that, isn't it?
And maybe there have to exist some methods for me to extract by ETree namespace from XML as dictionary (as ETree wants!) without hands manipulation?

You cannot rely on the namespace declarations on the root element: there is no guarantee that the declarations will even be there, or that the document will have the same prefix for the same namespace throughout.
Assuming you are going to have some way of passing the tag you want to search (because you say it is not known by your script), you should also provide a way to pass a namespace mapping as well. Or use the James Clark notation, like {http://somewhere.net/actual}actual (the ETXPath has support for this syntax, whereas "normal" xpath does not, but you can also use other methods like .findall() if you don't need full xpath)
If you don't care for the prefix at all, you could also use the local-name() function in xpath, eg. //*[local-name()="actual"] (but you won't be "really" sure it's the right "actual")

Oh, I found it.
After we do that:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
Object rootxml contains dictionary nsmap, which contains all namespaces that I want.
So, simplest solution I've found:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
nss = rootxml.nsmap
for subtag in rootxml.xpath(u'//par:actual', namespaces=nss):
#do something
print(subtag)
That works.
UPD: that works if user understand what means 'par' in XML he works with. For example, comparing supposed namespace with existing namespace before any other operations.
Still, I like much variant with XPath that understands {...}actual, that was what I tried to achieve.

With Python 3.8.2 I found this question with the same issue.
This is the solution I found, put the namespace in the XPath query. (Between the {})
ApplicationArea = BOD_IN_tree.find('.//ApplicationArea', ns)
if(ApplicationArea is None):
ApplicationArea = BOD_IN_tree.find('.//{http://www.defaultNamespace.com/2}ApplicationArea', ns)
I search for the element without the namespace, then search again if it's not found. I have no control over the inbound documents, some have namespaces, some do not.
I hope this helps!

Retrieve content of element with unknown namespace using python

I am attempting to parse a maven project definition using python to extract a version.
The project definition looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>...</groupId>
<artifactId>...</artifactId>
<version>1.6.0-SNAPSHOT</version>
...
</project>
I can extract the version using:
root = ET.fromstring(xml)
version = root.find('./p:version', { 'p': 'http://maven.apache.org/POM/4.0.0' })
print(version.text)
prints: 1.6.0-SNAPSHOT
However, the namespace used may change, and I don't want to depend on this. Is there a way to extract the namespace to use in my subsequent xpath expression?
I tried the following, to see if xmlns was itself exposed, but no luck:
root = ET.fromstring(xml)
for k in root.attrib:
print('%s => %s' % (k, root.attrib[k]))
prints: {http://www.w3.org/2001/XMLSchema-instance}schemaLocation => http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd

However, the namespace used may change, and I don't want to depend on this.
Are you saying that the namespace uri might change, or that the prefix might? If it's just the prefix, then that's not an issue, because what matters is that the prefixes in your XPath match the prefixes you supply to the XPath evaluator. And in either case, auto-detecting the namespaces is probably a bad call. Suppose someone decides to start generating that XML like this:
<proj:project xmlns:proj="http://maven.apache.org/POM/4.0.0"
xmlns:other="http://maven.apache.org/POM/5.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
which is still perfectly representing the XML in the same namespace as your example, but you have no idea that the proj prefix is the namespace prefix you're looking for.
I think it's unlikely that Apache would suddenly change the namespace for one of their official XML formats, but if you are genuinely worried about it, there should always be the option of using local-name() to namespace-agnostically find a node you're looking for:
version = root.find('./*[local-name() = "version"]')
Also, I'm not familiar with the elementTree library, but you could try this to try to get information about the XML document's namespaces, just to see if you can:
namespaces = root.findall('//namespace::*')

Unfortunately, ElementTree namespace support is rather patchy.
You'll need to use an internal method from the xml.etree.ElementTree module to get a namespace map out:
_, namespaces = ET._namespaces(root, 'utf8')
namespaces is now a dict with URIs as keys, and prefixes as values.
You could switch to lxml instead. That library implements the same ElementTree API, but has augmented that API considerably.
For example, each node includes a .nsmap attribute which maps prefixes to URIs, including the default namespace under the key None.

python : lxml xpath tag name with colon

i have to parse some feed, but one of the element (tag) is with colon <dc:creator>leemore23</dc:creator>
how can i parse it using lxml? so i have done it in this way
r = requests.get('http://www.site.com/feed/')
foo = (r.content).replace("dc:creator","dc")
tree = lxml.etree.fromstring(foo)
for article_node in tree.xpath('//item'):
data['dc'] = article_node.xpath('.//dc')[0].text.strip()
but i think there is a better way, something like
data['dc'] = article_node.xpath('.//dc:creator')[0].text.strip()
or
data['dc'] = article_node.xpath('.//dc|creator')[0].text.strip()
so without replacing
what can you advice me ?

The dc: prefix indicates a XML namespace. Use the elementtree API namespace support to deal with it, not just remove it from your input. As it happens, dc usually refers to Dublin Core metadata.
You need to determine the full namespace URL, then use that URL in your XPath queries:
DCNS = 'http://purl.org/dc/elements/1.1/'
creator = article_node.xpath('.//{{{0}}}creator'.format(DCNS))
Here I used the recommended http://purl.org/dc/elements/1.1/ namespace URL for the dublin core prefix.
You can normally determine the URL from the .nsmap property; your root element probably has the following .nsmap attribute:
{'dc': 'http://purl.org/dc/elements/1.1/'}
and thus you can change your code to:
creator = article_node.xpath('.//{{{0}}}creator'.format(article_node.nsmap['dc']))
This can be simplified further still by passing the nsmap dictionary to the xpath() method as the namespaces keyword, at which point you can use the prefix in your xpath expression:
creator = article_node.xpath('.//dc:creator', namespaces=article_node.nsmap)

The dc: indicates a namespace. When using lxml's xpath method, use the namespaces parameter to search for elements in a namespace.
So, in your case, using the dublin core prefix supplied by #MartijnPieters,
r = requests.get('http://www.site.com/feed/')
tree = lxml.etree.fromstring(r.content)
ns = {'dc':'http://purl.org/dc/elements/1.1/'}
for article_node in tree.xpath('//item'):
data['dc'] = article_node.xpath('.//dc:creator', namespaces = ns)[0].text.strip()

How do I use ":" in XML element names using lxml?

How do I generate and parse XML like the following using lxml?
<s:Envelope xmlns:s="a" xmlns:a="http_//www.w3.org/2005/08/addressing">
....
</s:Envelope>
I currently swap : with _ in the element names when I parse and generate XML, but it seems stupid.

It's not clear exactly what you're asking, but maybe this will help:
An element such as <s:Envelope> is using a XML namespace prefix. This is used to indicate that the s:Envelope attribute in this document is defined in the a namespace.
LXML represents XML namespaces using a namespace prefix in braces, for example: {a}Envelope. Your example document is sort of confusing, because you also defined the a: namespace prefix, so:
a:Element is equivalent to {http://www.w3.org/2005/08/addressing}Element, and
s:Element is equivalent to {a}Element.
Many of the LXML commands let you provide a namespace prefix mapping. For example, to find the Envelope element in your document using XPATH, you could do this:
import lxml.etree as etree
doc = etree.parse('mydocument.xml')
envelope = doc.xpath('//s:Envelope',
namespaces={'s': 'a'})
Note that this is exactly equivalent to:
envelope = doc.xpath('//x:Envelope',
namespaces={'x': 'a'})
That is, the namespace prefix doesn't have to match what is used in the source XML document; only the the absolute namespace matters.
You can read more about LXML and namespaces here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make a correct xpath? [duplicate] - python

Related

Finding namespace URIs for lxml

python etree with xpath and namespaces with prefix

Retrieve content of element with unknown namespace using python

python : lxml xpath tag name with colon

How do I use ":" in XML element names using lxml?

Categories

Resources