Conditional Search in XML Python

Conditional Search in XML Python - python

I have a xml file Orders.xml (excerpt follows):
<?xml version="1.0"?>
<ListOrdersResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
<ListOrdersResult>
<Orders>
<Order>
<LatestShipDate>2015-06-02T18:29:59Z</LatestShipDate>
<OrderType>StandardOrder</OrderType>
<PurchaseDate>2015-05-31T03:58:30Z</PurchaseDate>
<AmazonOrderId>171-6355256-9594715</AmazonOrderId>
<LastUpdateDate>2015-06-01T04:18:58Z</LastUpdateDate>
<ShipServiceLevel>IN Std Domestic</ShipServiceLevel>
<NumberOfItemsShipped>0</NumberOfItemsShipped>
<OrderStatus>Canceled</OrderStatus>
<SalesChannel>Amazon.in</SalesChannel>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<IsPremiumOrder>false</IsPremiumOrder>
<EarliestShipDate>2015-05-31T18:30:00Z</EarliestShipDate>
<MarketplaceId>A21TJRUUN4KGV</MarketplaceId>
<FulfillmentChannel>MFN</FulfillmentChannel>
<IsPrime>false</IsPrime>
<ShipmentServiceLevelCategory>Standard</ShipmentServiceLevelCategory>
</Order>
<Order>
<LatestShipDate>2015-06-02T18:29:59Z</LatestShipDate>
<OrderType>StandardOrder</OrderType>
<PurchaseDate>2015-05-31T04:50:07Z</PurchaseDate>
<BuyerEmail>dr7h1rhy6457rng#marketplace.amazon.in</BuyerEmail>
<AmazonOrderId>403-5551715-2566754</AmazonOrderId>
<LastUpdateDate>2015-06-01T07:52:49Z</LastUpdateDate>
<ShipServiceLevel>IN Exp Dom 2</ShipServiceLevel>
<NumberOfItemsShipped>2</NumberOfItemsShipped>
<OrderStatus>Shipped</OrderStatus>
<SalesChannel>Amazon.in</SalesChannel>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<LatestDeliveryDate>2015-06-06T18:29:59Z</LatestDeliveryDate>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<BuyerName>Ajit Nair</BuyerName>
<EarliestDeliveryDate>2015-06-02T18:30:00Z</EarliestDeliveryDate>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>938.00</Amount>
</OrderTotal>
<IsPremiumOrder>false</IsPremiumOrder>
<EarliestShipDate>2015-05-31T18:30:00Z</EarliestShipDate>
<MarketplaceId>A21TJRUUN4KGV</MarketplaceId>
<FulfillmentChannel>MFN</FulfillmentChannel>
<TFMShipmentStatus>Delivered</TFMShipmentStatus>
<PaymentMethod>Other</PaymentMethod>
<ShippingAddress>
<StateOrRegion>MAHARASHTRA</StateOrRegion>
<City>THANE</City>
<Phone>9769994355</Phone>
<CountryCode>IN</CountryCode>
<PostalCode>400709</PostalCode>
<Name>Ajit Nair</Name>
<AddressLine1>C-25 / con-7 / Chandralok CHS</AddressLine1>
<AddressLine2>Sector-10 ,Koper khairne</AddressLine2>
</ShippingAddress>
<IsPrime>false</IsPrime>
<ShipmentServiceLevelCategory>Expedited</ShipmentServiceLevelCategory>
</Order>
</Orders>
<CreatedBefore>2015-06-08T06:45:22Z</CreatedBefore>
<NextToken>smN7fNREdZyaJqJYLDm0ZIfVkJJPpovRb7YcCAmB0tlUojdU4H46trQzazHyYVyLqBXdLk4iogxpJASl2BeRezElfc2tdWR3lK0FtvOjoEqUrelVme04kSJ0wMvlylZkWQWPqGlbsnPaEpJjLWtrc27Vm9nDvRdgFtvOhjiqTWA16vKmtecRgbuZIF9n45mtnrZ4AbBdBTdge/hBzh1HtoVw85GaTVKBVfeXMWcfhX25HmwX5IAmwKfxnqm3JqvZ0Rjw/YZARKQMcjl5+H0CsJGesRwkZOQCBLVDshZ93sFo8v4Do3XuodaFg8ZGJDSTcawcthgh/MGM4KOIYd79q7Aq3I/8b9+STDy5JVgPyI0jQ6ftKc7EcAIwpq2cHuPbP+HgZXNbc7qI4HDvHa5YloEDUrIQbaP8qbwRHLZm6VTmGvVwLKwj6AZ0GNanrGO6</NextToken>
</ListOrdersResult>
<ResponseMetadata>
<RequestId>f2b55344-d281-4bd3-b8b3-788be07b7656</RequestId>
</ResponseMetadata>
</ListOrdersResponse>
I am using a python script to parse data from xml file. I want two fields from XML file AmazonOrderID and BuyerName. Some sub element in XML might not have have BuyerName. When I parse both individually, I get a list of 100 AmazonOrder and 70 BuyerName.
I want to get a empty string instead of nothing. i.e. if any subelement doesn't have a buyer name, i want to include '' instead of nothing.
My Code:
from xml.etree import ElementTree
with open('orders.xml', 'rb') as f:
tree = ElementTree.parse(f)
ns = {'d': 'https://mws.amazonservices.com/Orders/2013-09-01'}
for node in tree.findall('.//d:Order/d:AmazonOrderId', ns):
oid.append(node.text)
for node in tree.findall('.//d:Order/d:BuyerName', ns):
bn.append(node.text)
print oid
print bn

You can make it in a single loop using findtext() specifying the default as an empty string:
for node in tree.findall('.//d:Order', namespaces=ns):
oid.append(node.findtext("d:AmazonOrderId", default='', namespaces=ns))
bn.append(node.findtext("d:BuyerName", default='', namespaces=ns))

Related

How to iterate over a XML file and sum a specific field

I want to iterate over an xml file and get the sum of the field "PremieTot" (marked in the xml below)
<?xml version="1.0" encoding="iso-8859-1" ?>
<Bericht Version="1.0" xmlns="http://www.test.nl/test/2022/01">
<Bericht>
<IdBer>1111</IdBer>
<IdLcr>2323</IdLcr>
<NmLcr>Test Company</NmLcr>
</Bericht>
<AdministratieveEenheid>
<LhNr>3434</LhNr>
<NmIP>Test Company</NmIP>
<TvkCd>MND</TvkCd>
<TijdvakAangifte>
<DatAanvTv>2022-01-01</DatAanvTv>
<DatEindTv>2022-01-31</DatEindTv>
<VolledigeAangifte>
<CollectieveAangifte>
<TotaalRegelingen>
<RelNrAansl>3434</RelNrAansl>
</TotaalRegelingen>
<TotaalRegelingen>
<RelNrAansl>3434</RelNrAansl>
</TotaalRegelingen>
</CollectieveAangifte>
<InkomstenverhoudingInitieel>
<NumIV>1</NumIV>
<DatAanv>2020-01-01</DatAanv>
<PersNr>2364</PersNr>
<RegelingGegevens>
<PremieTot>0.52</PremieTot> //I want to sum this field
</RegelingGegevens>
</InkomstenverhoudingInitieel>
<InkomstenverhoudingInitieel>
<NumIV>1</NumIV>
<DatAanv>2020-07-01</DatAanv>
<PersNr>2365</PersNr>
<RegelingGegevens>
<PremieTot>0.66</PremieTot> //I want to sum this field
<AantVerlUPens>29.12</AantVerlUPens>
</RegelingGegevens>
</InkomstenverhoudingInitieel>
</VolledigeAangifte>
</TijdvakAangifte>
</AdministratieveEenheid>
</Bericht>
Iam trying it with xmldict to parse the xml file into a dict, but for some reason i cant get the value "PremieTot"
info_dict = xml_dict["PensioenAangifte"]["AdministratieveEenheid"]["TijdvakAangifte"]["VolledigeAangifte"]
premieTotal = [xml_data["RegelingGegevens]["PremieTot"] for xml_data in info_dict]

Quite easy with ElementTree:
from xml.etree import ElementTree as ET
et = ET.fromstring(xml)
result = sum(
float(el.text)
for el in et.findall('.//{*}PremieTot')
)

how to format attributes, prefixes, and tags using xml.etree.ElementTree Python

I'm trying to create a python script that will create a schema to then fill data based on an existing reference.
This is what I need to create:
<srp:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
this is what I have:
from xml.etree.ElementTree import *
from xml.dom import minidom
def prettify(elem):
rough_string = tostring(elem, "utf-8")
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")
ns = { "SOAP-ENV": "http://www.w3.org/2003/05/soap-envelope",
"SOAP-ENC": "http://www.w3.org/2003/05/soap-encoding",
"xsi": "http://www.w3.org/2001/XMLSchema-instance",
"srp": "http://www.-redacted-standards.org/Schemas/MSRP.xsd"}
def gen():
root = Element(QName(ns["xsi"],'root'))
print(prettify(root))
gen()
which gives me:
<xsi:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
how do I fix it so that the front matches?

The exact result that you ask for is incomplete, but with a few edits to the gen() function, it is possible to generate well-formed output.
The root element should be bound to the http://www.-redacted-standards.org/Schemas/MSRP.xsd namespace (srp prefix). In order to generate the xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" declaration, the namespace must be used in the XML document.
def gen():
root = Element(QName(ns["srp"], 'root'))
root.set(QName(ns["xsi"], "schemaLocation"), "whatever") # Add xsi:schemaLocation attribute
register_namespace("srp", ns["srp"]) # Needed to get 'srp' instead of 'ns0'
print(prettify(root))
Result (linebreaks added for readability):
<?xml version="1.0" ?>
<srp:root xmlns:srp="http://www.-redacted-standards.org/Schemas/MSRP.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="whatever"/>

gpxpy: Get extension value from gpx file

I’m trying to get the ID from a waypoint in my gpx-file. The ID is placed in the extension tag of my file. I’m using gpxpy to get other values like the latitude and longitude from the file, but I didn’t find a way to get the ID.
Here you can see my code:
import gpxpy
node_id = []
gpx_file = open("test.gpx", mode='rt', encoding='utf-8')
gpx = gpxpy.parse(gpx_file)
for waypoint in gpx.waypoints:
node_id.append(waypoint.extensions.id)
And a part of my test.gpx-file:
<wpt lat="53.865650" lon="10.684415">
<extensions>
<ogr:id>17</ogr:id>
<ogr:longitude>10.684415</ogr:longitude>
<ogr:latitude>53.865650</ogr:latitude>
</extensions>
</wpt>
Is there a way to get the id of the waypoint with gpxpy?

waypoint.extensions is just an array. So you can't just get an item by name. You have to iterate through that array. The "name" of the extensions is stored in the "tag" property of the Element, the value in the "text" property. As i don't have your xml-scheme to test with the extension ogr:id, i tried with the following gpx file:
<?xml version="1.0" encoding="UTF-8" ?>
<gpx xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="OSMTracker for Android™ - https://github.com/labexp/osmtracker-android"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd ">
<wpt lat="10.31345465" lon="10.21237815">
<extensions>
<id>17</id>
</extensions>
<ele>110.0</ele>
<time>2018-09-29T09:31:58Z</time>
<name><![CDATA[train station]]></name>
<sat>0</sat>
</wpt>
</gpx>
I wrote an short function to get the id. It is not tested against anything (for example the extensions doesn't exist).
import gpxpy
def getId(waypoint):
for extension in waypoint.extensions:
if extension.tag == 'id':
return extension.text
node_id = []
gpx_file = open("test2.gpx", mode='rt', encoding='utf-8')
gpx = gpxpy.parse(gpx_file)
for waypoint in gpx.waypoints:
print(getId(waypoint))
The functions gets an GPX Waypoint as argument and loops through the extensions array. If that array contains an element with the tag (name) "id" it returns the text (value).
Best regards
Thimo

Python xml to csv

Please read entire question before marking duplicate.
I have a nested XML file which i Want to convert to a csv file.
I have to write a python script for same.
The XML file is:
<?xml version="1.0"?>
<ListOrdersResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
<ListOrdersResult>
<Orders>
<Order>
<LatestShipDate>2015-06-02T18:29:59Z</LatestShipDate>
<OrderType>StandardOrder</OrderType>
<PurchaseDate>2015-05-31T03:58:30Z</PurchaseDate>
<AmazonOrderId>171-6355256-9594715</AmazonOrderId>
<LastUpdateDate>2015-06-01T04:18:58Z</LastUpdateDate>
<ShipServiceLevel>IN Std Domestic</ShipServiceLevel>
<NumberOfItemsShipped>0</NumberOfItemsShipped>
<OrderStatus>Canceled</OrderStatus>
<SalesChannel>Amazon.in</SalesChannel>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<IsPremiumOrder>false</IsPremiumOrder>
<EarliestShipDate>2015-05-31T18:30:00Z</EarliestShipDate>
<MarketplaceId>A21TJRUUN4KGV</MarketplaceId>
<FulfillmentChannel>MFN</FulfillmentChannel>
<IsPrime>false</IsPrime>
<ShipmentServiceLevelCategory>Standard</ShipmentServiceLevelCategory>
</Order>
<Order>
<LatestShipDate>2015-06-02T18:29:59Z</LatestShipDate>
<OrderType>StandardOrder</OrderType>
<PurchaseDate>2015-05-31T04:50:07Z</PurchaseDate>
<BuyerEmail>dr7h1rhy6457rng#marketplace.amazon.in</BuyerEmail>
<AmazonOrderId>403-5551715-2566754</AmazonOrderId>
<LastUpdateDate>2015-06-01T07:52:49Z</LastUpdateDate>
<ShipServiceLevel>IN Exp Dom 2</ShipServiceLevel>
<NumberOfItemsShipped>2</NumberOfItemsShipped>
<OrderStatus>Shipped</OrderStatus>
<SalesChannel>Amazon.in</SalesChannel>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<LatestDeliveryDate>2015-06-06T18:29:59Z</LatestDeliveryDate>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<BuyerName>Ajit Nair</BuyerName>
<EarliestDeliveryDate>2015-06-02T18:30:00Z</EarliestDeliveryDate>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>938.00</Amount>
</OrderTotal>
<IsPremiumOrder>false</IsPremiumOrder>
<EarliestShipDate>2015-05-31T18:30:00Z</EarliestShipDate>
<MarketplaceId>A21TJRUUN4KGV</MarketplaceId>
<FulfillmentChannel>MFN</FulfillmentChannel>
<TFMShipmentStatus>Delivered</TFMShipmentStatus>
<PaymentMethod>Other</PaymentMethod>
<ShippingAddress>
<StateOrRegion>MAHARASHTRA</StateOrRegion>
<City>THANE</City>
<Phone>9769994355</Phone>
<CountryCode>IN</CountryCode>
<PostalCode>400709</PostalCode>
<Name>Ajit Nair</Name>
<AddressLine1>C-25 / con-7 / Chandralok CHS</AddressLine1>
<AddressLine2>Sector-10 ,Koper khairne</AddressLine2>
</ShippingAddress>
<IsPrime>false</IsPrime>
<ShipmentServiceLevelCategory>Expedited</ShipmentServiceLevelCategory>
</Order>
I tried to get values for my code in form of a list. But it doesn't print anything.
My Code:
from xml.etree import ElementTree
with open('orders.xml', 'rb') as f:
tree = ElementTree.parse(f)
for node in tree.findall('.//Order'):
oid = node.attrib.get('SellerOrderId')
if oid:
print oid
What is wrong with my code?
EDIT: Temporary link to complete File Orders.xml

Your XML has default namespace defined here :
<ListOrdersResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
Note that descendant elements inherits ancestor default namespace implicitly, unless otherwise specified. You need to combine namespace + local name to form a fully qualified element name, for example :
ns = {'d': 'https://mws.amazonservices.com/Orders/2013-09-01'}
for node in tree.findall('.//d:Order', ns):
oid = node.attrib.get('SellerOrderId')
if oid:
print oid
According to the full XML file you linked to, SellerOrderId is child element of Order instead of attribute. In this case, you can simply use .//d:Order/d:SellerOrderId to get them and then print it's value, like so :
ns = {'d': 'https://mws.amazonservices.com/Orders/2013-09-01'}
for node in tree.findall('.//d:Order/d:SellerOrderId', ns):
print node.text
output :
171-1322776-9700344
171-4214129-7148305
402-8263846-7042737
402-7017923-9474716
402-9691237-2887553
171-4614227-7597903
403-6729903-2119563
402-2184564-2676353
171-4520392-2088330
402-7986969-8827533

Xpath select attribute of current node?

I use python with lxml to process the xml. After I query/filter to get to a nodes I want but I have some problem. How to get its attribute's value by xpath ? Here is my input example.
>print(etree.tostring(node, pretty_print=True ))
<rdf:li xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:resource="urn:miriam:obo.chebi:CHEBI%3A37671"/>
The value I want is in resource=... . Currently I just use the lxml to get the value. I wonder if it is possible to do in pure xpath ? thanks
EDIT: Forgot to said, this is not a root nodes so I can't use // here. I have like 2000-3000 others in xml file. My first attempt was playing around with ".#attrib" and "self::*#" but those does not seems to work.
EDIT2: I will try my best to explain (well, this is my first time to deal with xml problem using xpath. and english is not one of my favorite field....). Here is my input snippet http://pastebin.com/kZmVdbQQ (full one from here http://www.comp-sys-bio.org/yeastnet/ using version 4).
In my code, I try to get speciesTypes node with resource link chebi (<rdf:li rdf:resource="urn:miriam:obo.chebi:...."/>). and then I tried to get value from rdf:resource attribute in rdf:li. The thing is, I am pretty sure it would be easy to get attribute in child node if I start from parent node like speciesTypes, but I wonder how to do if I start from rdf:li. From my understanding, the "//" in xpath will looking for node from everywhere not just only in the current node.
below is my code
import lxml.etree as etree
tree = etree.parse("yeast_4.02.xml")
root = tree.getroot()
ns = {"sbml": "http://www.sbml.org/sbml/level2/version4",
"rdf":"http://www.w3.org/1999/02/22-rdf-syntax-ns#",
"body":"http://www.w3.org/1999/xhtml",
"re": "http://exslt.org/regular-expressions"
}
#good enough for now
maybemeta = root.xpath("//sbml:speciesType[descendant::rdf:li[starts-with(#rdf:resource, 'urn:miriam:obo.chebi') and not(starts-with(#rdf:resource, 'urn:miriam:uniprot'))]]", namespaces = ns)
def extract_name_and_chebi(node):
name = node.attrib['name']
chebies = node.xpath("./sbml:annotation//rdf:li[starts-with(#rdf:resource, 'urn:miriam:obo.chebi') and not(starts-with(#rdf:resource, 'urn:miriam:uniprot'))]", namespaces=ns) #get all rdf:li node with chebi resource
assert len(chebies) == 1
#my current solution to get rdf:resource value from rdf:li node
rdfNS = "{" + ns.get('rdf') + "}"
chebi = chebies[0].attrib[rdfNS + 'resource']
#do protein later
return (name, chebi)
metaWithChebi = map(extract_name_and_chebi, maybemeta)
fo = open("metabolites.txt", "w")
for name, chebi in metaWithChebi:
fo.write("{0}\t{1}\n".format(name, chebi))

Prefix the attribute name with # in the XPath query:
>>> from lxml import etree
>>> xml = """\
... <?xml version="1.0" encoding="utf8"?>
... <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
... <rdf:li rdf:resource="urn:miriam:obo.chebi:CHEBI%3A37671"/>
... </rdf:RDF>
... """
>>> tree = etree.fromstring(xml)
>>> ns = {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'}
>>> tree.xpath('//rdf:li/#rdf:resource', namespaces=ns)
['urn:miriam:obo.chebi:CHEBI%3A37671']
EDIT
Here's a revised version of the script in the question:
import lxml.etree as etree
ns = {
'sbml': 'http://www.sbml.org/sbml/level2/version4',
'rdf':'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'body':'http://www.w3.org/1999/xhtml',
're': 'http://exslt.org/regular-expressions',
}
def extract_name_and_chebi(node):
chebies = node.xpath("""
.//rdf:li[
starts-with(#rdf:resource, 'urn:miriam:obo.chebi')
]/#rdf:resource
""", namespaces=ns)
return node.attrib['name'], chebies[0]
with open('yeast_4.02.xml') as xml:
tree = etree.parse(xml)
maybemeta = tree.xpath("""
//sbml:speciesType[descendant::rdf:li[
starts-with(#rdf:resource, 'urn:miriam:obo.chebi')]]
""", namespaces = ns)
with open('metabolites.txt', 'w') as output:
for node in maybemeta:
output.write('%s\t%s\n' % extract_name_and_chebi(node))

To select off the current node its attribute named rdf:resource, use this XPath expression:
#rdf:resource
In order for this to "work correctly" you must register the association of the prefix "rdf:" to the corresponding namespace.
If you don't know how to register the rdf namespace, it is still possible to select the attribute -- with this XPath expression:
#*[name()='rdf:resource']

Well, I got it. The xpath expression I need here is "./#rdf:resource" not ".#rdf:resource". But why ? I thought "./" indicate the child of current node.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conditional Search in XML Python - python

You can make it in a single loop using findtext() specifying the default as an empty string: for node in tree.findall('.//d:Order', namespaces=ns): oid.append(node.findtext("d:AmazonOrderId", default='', namespaces=ns)) bn.append(node.findtext("d:BuyerName", default='', namespaces=ns))

Related

How to iterate over a XML file and sum a specific field

how to format attributes, prefixes, and tags using xml.etree.ElementTree Python

gpxpy: Get extension value from gpx file

Python xml to csv

Xpath select attribute of current node?

Categories

Resources