Xpath select attribute of current node?

Xpath select attribute of current node? - python

I use python with lxml to process the xml. After I query/filter to get to a nodes I want but I have some problem. How to get its attribute's value by xpath ? Here is my input example.
>print(etree.tostring(node, pretty_print=True ))
<rdf:li xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:resource="urn:miriam:obo.chebi:CHEBI%3A37671"/>
The value I want is in resource=... . Currently I just use the lxml to get the value. I wonder if it is possible to do in pure xpath ? thanks
EDIT: Forgot to said, this is not a root nodes so I can't use // here. I have like 2000-3000 others in xml file. My first attempt was playing around with ".#attrib" and "self::*#" but those does not seems to work.
EDIT2: I will try my best to explain (well, this is my first time to deal with xml problem using xpath. and english is not one of my favorite field....). Here is my input snippet http://pastebin.com/kZmVdbQQ (full one from here http://www.comp-sys-bio.org/yeastnet/ using version 4).
In my code, I try to get speciesTypes node with resource link chebi (<rdf:li rdf:resource="urn:miriam:obo.chebi:...."/>). and then I tried to get value from rdf:resource attribute in rdf:li. The thing is, I am pretty sure it would be easy to get attribute in child node if I start from parent node like speciesTypes, but I wonder how to do if I start from rdf:li. From my understanding, the "//" in xpath will looking for node from everywhere not just only in the current node.
below is my code
import lxml.etree as etree
tree = etree.parse("yeast_4.02.xml")
root = tree.getroot()
ns = {"sbml": "http://www.sbml.org/sbml/level2/version4",
"rdf":"http://www.w3.org/1999/02/22-rdf-syntax-ns#",
"body":"http://www.w3.org/1999/xhtml",
"re": "http://exslt.org/regular-expressions"
}
#good enough for now
maybemeta = root.xpath("//sbml:speciesType[descendant::rdf:li[starts-with(#rdf:resource, 'urn:miriam:obo.chebi') and not(starts-with(#rdf:resource, 'urn:miriam:uniprot'))]]", namespaces = ns)
def extract_name_and_chebi(node):
name = node.attrib['name']
chebies = node.xpath("./sbml:annotation//rdf:li[starts-with(#rdf:resource, 'urn:miriam:obo.chebi') and not(starts-with(#rdf:resource, 'urn:miriam:uniprot'))]", namespaces=ns) #get all rdf:li node with chebi resource
assert len(chebies) == 1
#my current solution to get rdf:resource value from rdf:li node
rdfNS = "{" + ns.get('rdf') + "}"
chebi = chebies[0].attrib[rdfNS + 'resource']
#do protein later
return (name, chebi)
metaWithChebi = map(extract_name_and_chebi, maybemeta)
fo = open("metabolites.txt", "w")
for name, chebi in metaWithChebi:
fo.write("{0}\t{1}\n".format(name, chebi))

Prefix the attribute name with # in the XPath query:
>>> from lxml import etree
>>> xml = """\
... <?xml version="1.0" encoding="utf8"?>
... <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
... <rdf:li rdf:resource="urn:miriam:obo.chebi:CHEBI%3A37671"/>
... </rdf:RDF>
... """
>>> tree = etree.fromstring(xml)
>>> ns = {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'}
>>> tree.xpath('//rdf:li/#rdf:resource', namespaces=ns)
['urn:miriam:obo.chebi:CHEBI%3A37671']
EDIT
Here's a revised version of the script in the question:
import lxml.etree as etree
ns = {
'sbml': 'http://www.sbml.org/sbml/level2/version4',
'rdf':'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'body':'http://www.w3.org/1999/xhtml',
're': 'http://exslt.org/regular-expressions',
}
def extract_name_and_chebi(node):
chebies = node.xpath("""
.//rdf:li[
starts-with(#rdf:resource, 'urn:miriam:obo.chebi')
]/#rdf:resource
""", namespaces=ns)
return node.attrib['name'], chebies[0]
with open('yeast_4.02.xml') as xml:
tree = etree.parse(xml)
maybemeta = tree.xpath("""
//sbml:speciesType[descendant::rdf:li[
starts-with(#rdf:resource, 'urn:miriam:obo.chebi')]]
""", namespaces = ns)
with open('metabolites.txt', 'w') as output:
for node in maybemeta:
output.write('%s\t%s\n' % extract_name_and_chebi(node))

To select off the current node its attribute named rdf:resource, use this XPath expression:
#rdf:resource
In order for this to "work correctly" you must register the association of the prefix "rdf:" to the corresponding namespace.
If you don't know how to register the rdf namespace, it is still possible to select the attribute -- with this XPath expression:
#*[name()='rdf:resource']

Well, I got it. The xpath expression I need here is "./#rdf:resource" not ".#rdf:resource". But why ? I thought "./" indicate the child of current node.

Related

Is there something wrong with my script or the XML file? I am using ElementTree in attempt to get out child attributes

This is a shorted version of the XML file that I am trying to parse:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<TipsContents xmlns="http://www.avendasys.com/tipsapiDefs/1.0">
<TipsHeader exportTime="Mon May 04 20:05:47 SAST 2020" version="6.8"/>
<Endpoints>
<Endpoint macVendor="SHENZHEN RF-LINK TECHNOLOGY CO.,LTD." macAddress="c46e7b2939cb" status="Known">
<EndpointProfile updatedAt="May 04, 2020 10:02:21 SAST" profiledBy="Policy Manager" addedAt="Mar 04, 2020 17:31:53 SAST" fingerprint="{}" conflict="false" name="Windows" family="Windows" category="Computer" staticIP="false" ipAddress="xxx.xxx.xxx.xxx"/>
<EndpointTags tagName="Username" tagValue="xxxxxxxx"/>
<EndpointTags tagName="Disabled Reason" tagValue="IS_ACTIVE"/>
</Endpoint>
</Endpoints>
<TagDictionaries>
<TagDictionary allowMultiple="false" mandatory="true" defaultValue="false" dataType="Boolean" attributeName="DOMAIN-MACHINES" entityName="Endpoint"/>
<TagDictionary allowMultiple="false" mandatory="true" defaultValue="true" dataType="Boolean" attributeName="IS_ACTIVE" entityName="Endpoint"/>
<TagDictionary allowMultiple="true" mandatory="false" dataType="String" attributeName="Disabled Reason" entityName="Endpoint"/>
<TagDictionary allowMultiple="false" mandatory="false" dataType="String" attributeName="Username" entityName="Endpoint"/>
</TagDictionaries>
</TipsContents>
I run the following script:
import xml.etree.ElementTree as ET
f = open("Endpoint-5.xml", 'r')
tree = ET.parse(f)
root = tree.getroot()
This is what my outputs look like:
In [8]: root = tree.getroot()
In [9]: root.findall('.')
Out[9]: [<Element '{http://www.avendasys.com/tipsapiDefs/1.0}TipsContents' at 0x10874b410>]
In [10]: root.findall('./TipsHeader')
Out[10]: []
In [11]: root.findall('./TipsContents')
Out[11]: []
In [15]: root.findall('{http://www.avendasys.com/tipsapiDefs/1.0}TipsContents//Endpoints/Endpoint/EndpointProfile')
Out[15]: []
I have been following this: https://docs.python.org/3/library/xml.etree.elementtree.html#example
among other tutorials but I don't seem to get an output.
I have tried from lxml import html
My script is as follows:
tree = html.fromstring(html=f)
updatedAt = tree.xpath("//TipsContents/Endpoints/Endpoint/EndpointProfile/#updatedAt")
name = tree.xpath("//TipsContents/Endpoints/Endpoint/EndpointProfile/#name")
category = tree.xpath("//TipsContents/Endpoints/Endpoint/EndpointProfile/#category")
tagValue = tree.xpath("//TipsContents/Endpoints/Endpoint/EndpointTags[#tagName = 'Username']/#tagValue")
active = tree.xpath("//TipsContents/Endpoints/Endpoint/EndpointTags[#tagName = 'Disabled Reason']/#tagValue")
print("Name:",name)
The above attempt also returns nothing.
I am able to parse an XML document from an API and use the second attempt successfully but when I am doing this from a local file I do not get the results.
Any assistance will be appreciated.

Note that your input XML contains a default namespace, so to refer to
any element you have to specify the namespace.
One of methods to do it is to define a dictionary of namespaces
(shortcut : full_name), in your case:
ns = {'tips': 'http://www.avendasys.com/tipsapiDefs/1.0'}
Then, using findall:
use the appropriate shortcut before the element name (and ':'),
pass the namespace dictionary as the second argument.
The code to do it is:
for elem in root.findall('./tips:TipsHeader', ns):
print(elem.attrib)
The result, for your input sample, is:
{'exportTime': 'Mon May 04 20:05:47 SAST 2020', 'version': '6.8'}
As far as root.findall('./TipsContents') is concerned, it will return
an empty list, even if you specify the namespace as above.
The reason is that TipsContents is the name of the root node,
whereas you attempt to find an element with the same name below in
the XML tree, but it contains no such element.
If you want to access attributes of the root element, you can run:
print(root.attrib)
but to get something more than an empty dictionary, you have to add
some attributes to the root element (namespace is not an attribute).

Python xml to csv

Please read entire question before marking duplicate.
I have a nested XML file which i Want to convert to a csv file.
I have to write a python script for same.
The XML file is:
<?xml version="1.0"?>
<ListOrdersResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
<ListOrdersResult>
<Orders>
<Order>
<LatestShipDate>2015-06-02T18:29:59Z</LatestShipDate>
<OrderType>StandardOrder</OrderType>
<PurchaseDate>2015-05-31T03:58:30Z</PurchaseDate>
<AmazonOrderId>171-6355256-9594715</AmazonOrderId>
<LastUpdateDate>2015-06-01T04:18:58Z</LastUpdateDate>
<ShipServiceLevel>IN Std Domestic</ShipServiceLevel>
<NumberOfItemsShipped>0</NumberOfItemsShipped>
<OrderStatus>Canceled</OrderStatus>
<SalesChannel>Amazon.in</SalesChannel>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<IsPremiumOrder>false</IsPremiumOrder>
<EarliestShipDate>2015-05-31T18:30:00Z</EarliestShipDate>
<MarketplaceId>A21TJRUUN4KGV</MarketplaceId>
<FulfillmentChannel>MFN</FulfillmentChannel>
<IsPrime>false</IsPrime>
<ShipmentServiceLevelCategory>Standard</ShipmentServiceLevelCategory>
</Order>
<Order>
<LatestShipDate>2015-06-02T18:29:59Z</LatestShipDate>
<OrderType>StandardOrder</OrderType>
<PurchaseDate>2015-05-31T04:50:07Z</PurchaseDate>
<BuyerEmail>dr7h1rhy6457rng#marketplace.amazon.in</BuyerEmail>
<AmazonOrderId>403-5551715-2566754</AmazonOrderId>
<LastUpdateDate>2015-06-01T07:52:49Z</LastUpdateDate>
<ShipServiceLevel>IN Exp Dom 2</ShipServiceLevel>
<NumberOfItemsShipped>2</NumberOfItemsShipped>
<OrderStatus>Shipped</OrderStatus>
<SalesChannel>Amazon.in</SalesChannel>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<LatestDeliveryDate>2015-06-06T18:29:59Z</LatestDeliveryDate>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<BuyerName>Ajit Nair</BuyerName>
<EarliestDeliveryDate>2015-06-02T18:30:00Z</EarliestDeliveryDate>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>938.00</Amount>
</OrderTotal>
<IsPremiumOrder>false</IsPremiumOrder>
<EarliestShipDate>2015-05-31T18:30:00Z</EarliestShipDate>
<MarketplaceId>A21TJRUUN4KGV</MarketplaceId>
<FulfillmentChannel>MFN</FulfillmentChannel>
<TFMShipmentStatus>Delivered</TFMShipmentStatus>
<PaymentMethod>Other</PaymentMethod>
<ShippingAddress>
<StateOrRegion>MAHARASHTRA</StateOrRegion>
<City>THANE</City>
<Phone>9769994355</Phone>
<CountryCode>IN</CountryCode>
<PostalCode>400709</PostalCode>
<Name>Ajit Nair</Name>
<AddressLine1>C-25 / con-7 / Chandralok CHS</AddressLine1>
<AddressLine2>Sector-10 ,Koper khairne</AddressLine2>
</ShippingAddress>
<IsPrime>false</IsPrime>
<ShipmentServiceLevelCategory>Expedited</ShipmentServiceLevelCategory>
</Order>
I tried to get values for my code in form of a list. But it doesn't print anything.
My Code:
from xml.etree import ElementTree
with open('orders.xml', 'rb') as f:
tree = ElementTree.parse(f)
for node in tree.findall('.//Order'):
oid = node.attrib.get('SellerOrderId')
if oid:
print oid
What is wrong with my code?
EDIT: Temporary link to complete File Orders.xml

Your XML has default namespace defined here :
<ListOrdersResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
Note that descendant elements inherits ancestor default namespace implicitly, unless otherwise specified. You need to combine namespace + local name to form a fully qualified element name, for example :
ns = {'d': 'https://mws.amazonservices.com/Orders/2013-09-01'}
for node in tree.findall('.//d:Order', ns):
oid = node.attrib.get('SellerOrderId')
if oid:
print oid
According to the full XML file you linked to, SellerOrderId is child element of Order instead of attribute. In this case, you can simply use .//d:Order/d:SellerOrderId to get them and then print it's value, like so :
ns = {'d': 'https://mws.amazonservices.com/Orders/2013-09-01'}
for node in tree.findall('.//d:Order/d:SellerOrderId', ns):
print node.text
output :
171-1322776-9700344
171-4214129-7148305
402-8263846-7042737
402-7017923-9474716
402-9691237-2887553
171-4614227-7597903
403-6729903-2119563
402-2184564-2676353
171-4520392-2088330
402-7986969-8827533

How to get the child of child using Python's ElementTree

I'm building a Python file that communicates with a PLC. When compiling, the PLC creates a XML file that delivers important information about the program. The XML looks more less like this:
<visu>
<time>12:34</time>
<name>my_visu</name>
<language>english</language>
<vars>
<var name="input1">2</var>
<var name="input2">45.6</var>
<var name="input3">"hello"</var>
</vars>
</visu>
The important part is found under child "vars". Using Python I want to make a file that when sending argument "input2" it will print "45.6".
So far I can read all children of "visu", but don't know how to actually tell Python to search among "the child of child". Here's is what I got so far:
tree = ET.parse("file.xml")
root = tree.getroot()
for child in root:
if child.tag == "vars":
.......
if ( "childchild".attrib.get("name") == "input2" ):
print "childchild".text
Any ideas how I can complete the script? (or maybe a more efficient way of programming it?)

You'd be better of using an XPath search here:
name = 'input2'
value = root.find('.//vars/var[#name="{}"]'.format(name)).text
This searches for a <var> tag directly below a <vars> tag, whose attribute name is equal to the value given by the Python name variable, then retrieves the text value of that tag.
Demo:
>>> from xml.etree import ElementTree as ET
>>> sample = '''\
... <visu>
... <time>12:34</time>
... <name>my_visu</name>
... <language>english</language>
... <vars>
... <var name="input1">2</var>
... <var name="input2">45.6</var>
... <var name="input3">"hello"</var>
... </vars>
... </visu>
... '''
>>> root = ET.fromstring(sample)
>>> name = 'input2'
>>> root.find('.//vars/var[#name="{}"]'.format(name)).text
'45.6'
You can do this the hard way and manually loop over all the elements; each element can be looped over directly:
name = 'input2'
for elem in root:
if elem.tag == 'vars':
for var in elem:
if var.attrib.get('name') == name:
print var.text
but using element.find() or element.find_all() is probably going to be easier and more concise.

Using "info.get" for a child element in Python / lxml

I'm trying to get the attribute of a child element in Python, using lxml.
This is the structure of the xml:
<GroupInformation groupId="crid://thing.com/654321" ordered="true">
<GroupType value="show" xsi:type="ProgramGroupTypeType"/>
<BasicDescription>
<Title type="main" xml:lang="EN">A programme</Title>
<RelatedMaterial>
<HowRelated href="urn:eventis:metadata:cs:HowRelatedCS:2010:boxCover">
<Name>Box cover</Name>
</HowRelated>
<MediaLocator>
<mpeg7:MediaUri>file://ftp.something.com/Images/123456.jpg</mpeg7:MediaUri>
</MediaLocator>
</RelatedMaterial>
</BasicDescription>
The code I've got is below. The bit I want to return is the 'value' attribute ("Show" in the example) under 'grouptype' (third line from the bottom):
file_name = input('Enter the file name, including .xml extension: ')
print('Parsing ' + file_name)
from lxml import etree
parser = etree.XMLParser()
tree = etree.parse(file_name, parser)
root = tree.getroot()
nsmap = {'xmlns': 'urn:tva:metadata:2010','mpeg7':'urn:tva:mpeg7:2008'}
with open(file_name+'.log', 'w', encoding='utf-8') as f:
for info in root.xpath('//xmlns:GroupInformation', namespaces=nsmap):
crid = info.get('groupId'))
grouptype = info.find('.//xmlns:GroupType', namespaces=nsmap)
gtype = grouptype.get('value')
titlex = info.find('.//xmlns:BasicDescription/xmlns:Title', namespaces=nsmap)
title = titlex.text if titlex != None else 'Missing'
Can anyone explain to me how to implement it? I had a quick look at the xsi namespace, but was unable to get it to work (and didn't know if it was the right thing to do).

Is this what you are looking for?
grouptype.attrib['value']
PS: why the parenthesis around assignment values? Those look unnecessary.

How should I parse this xml string in python?

My XML string is -
xmlData = """<SMSResponse xmlns="http://example.com" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Cancelled>false</Cancelled>
<MessageID>00000000-0000-0000-0000-000000000000</MessageID>
<Queued>false</Queued>
<SMSError>NoError</SMSError>
<SMSIncomingMessages i:nil="true"/>
<Sent>false</Sent>
<SentDateTime>0001-01-01T00:00:00</SentDateTime>
</SMSResponse>"""
I am trying to parse and get the values of tags - Cancelled, MessageId, SMSError, etc. I am using python's Elementtree library. So far, I have tried things like -
root = ET.fromstring(xmlData)
print root.find('Sent') // gives None
for child in root:
print chil.find('MessageId') // also gives None
Although, I am able to print the tags with -
for child in root:
print child.tag
//child.tag for the tag Cancelled is - {http://example.com}Cancelled
and their respective values with -
for child in root:
print child.text
How do I get something like -
print child.Queued // will print false
Like in PHP we can access them with the root -
$xml = simplexml_load_string($data);
$status = $xml->SMSError;

Your document has a namespace on it, you need to include the namespace when searching:
root = ET.fromstring(xmlData)
print root.find('{http://example.com}Sent',)
print root.find('{http://example.com}MessageID')
output:
<Element '{http://example.com}Sent' at 0x1043e0690>
<Element '{http://example.com}MessageID' at 0x1043e0350>
The find() and findall() methods also take a namespace map; you can search for a arbitrary prefix, and the prefix will be looked up in that map, to save typing:
nsmap = {'n': 'http://example.com'}
print root.find('n:Sent', namespaces=nsmap)
print root.find('n:MessageID', namespaces=nsmap)

If you're set on Python standard XML libraries, you could use something like this:
root = ET.fromstring(xmlData)
namespace = 'http://example.com'
def query(tree, nodename):
return tree.find('{{{ex}}}{nodename}'.format(ex=namespace, nodename=nodename))
queued = query(root, 'Queued')
print queued.text

You can create a dictionary and directly get values out of it...
tree = ET.fromstring(xmlData)
root = {}
for child in tree:
root[child.tag.split("}")[1]] = child.text
print root["Queued"]

With lxml.etree:
In [8]: import lxml.etree as et
In [9]: doc=et.fromstring(xmlData)
In [10]: ns={'n':'http://example.com'}
In [11]: doc.xpath('n:Queued/text()',namespaces=ns)
Out[11]: ['false']
With elementtree you can do:
import xml.etree.ElementTree as ET
root=ET.fromstring(xmlData)
ns={'n':'http://example.com'}
root.find('n:Queued',namespaces=ns).text
Out[13]: 'false'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Xpath select attribute of current node? - python

Well, I got it. The xpath expression I need here is "./#rdf:resource" not ".#rdf:resource". But why ? I thought "./" indicate the child of current node.

Related

Is there something wrong with my script or the XML file? I am using ElementTree in attempt to get out child attributes

Python xml to csv

How to get the child of child using Python's ElementTree

Using "info.get" for a child element in Python / lxml

How should I parse this xml string in python?

Categories

Resources