How can I parse a XML file to a dictionary in Python?

How can I parse a XML file to a dictionary in Python? - python

I 'am trying to parse a XML file using the Python library minidom (even tried xml.etree.ElementTree API).
My XML (resource.xml)
<?xml version='1.0'?>
<quota_result xmlns="https://some_url">
</quota_rule>
<quota_rule name='max_mem_per_user/5'>
<users>user1</users>
<limit resource='mem' limit='1550' value='921'/>
</quota_rule>
<quota_rule name='max_mem_per_user/6'>
<users>user2 /users>
<limit resource='mem' limit='2150' value='3'/>
</quota_rule>
</quota_result>
I would like to parse this file and store inside a dictionnary the information in the following form and be able to access it:
dict={user1=[resource,limit,value],user2=[resource,limit,value]}
So far I have only been able to do things like:
docXML = minidom.parse("resource.xml")
for node in docXML.getElementsByTagName('limit'):
print node.getAttribute('value')

You can use getElementsByTagName and getAttribute to trace the result:
dict_users = dict()
docXML = parse('mydata.xml')
users= docXML.getElementsByTagName("quota_rule")
for node in users:
user = 'None'
tag_user = node.getElementsByTagName("users") #check the length of the tag_user to see if tag <users> is exist or not
if len(tag_user) ==0:
print "tag <users> is not exist"
else:
user = tag_user[0]
resource = node.getElementsByTagName("limit")[0].getAttribute("resource")
limit = node.getElementsByTagName("limit")[0].getAttribute("limit")
value = node.getElementsByTagName("limit")[0].getAttribute("value")
dict_users[user.firstChild.data]=[resource, limit, value]
if user == 'None':
dict_users['None']=[resource, limit, value]
else:
dict_users[user.firstChild.data]=[resource, limit, value]
print(dict_users) # remove the <users>user1</users> in xml
Output:
tag <users> is not exist
{'None': [u'mem', u'1550', u'921'], u'user2': [u'mem', u'2150', u'3']}

Related

How to get the full XML when decoding bytes to string. It is generating XML but truncated

I am using dicttoxml for generating an XML file in Python.
**xml = dicttoxml(obj, item_func = my_item_func, attr_type=False, custom_root='Benefits')**
and it generates a byte XML which I am again converting to string for my XSD validation.
**strxml = bytes.decode(xml, 'utf-8')**
Issue : When the XML is generated it skips many nodes and are replaced by ..., the reason I think it is doing so because the XML file is very big. I don't want it to skip nodes I want the XML file in its entirety.
However when this "xml" byte object is rendered on the browser or when I print in debug mode there is no issue and I get the XML in it's entirety.
How can I overcome this problem?
Here is the complete code.
from datetime import date
from dicttoxml import dicttoxml
from xml.dom.minidom import parseString
import json
import xmlschema
def response_handler(self,node, mktsegtype):
isvalidxml = False
if node is not None:
obj = json.loads(json.dumps(node,default=lambda o: dict((key, value) for key, value in o.__dict__.items() if value is not None),indent=4,allow_nan=False))
my_item_func = lambda x: 'cvrg' if x == "Covered" else('Insure' if x == "Insurance" else x[:-1])
xml = dicttoxml(obj, item_func = my_item_func, attr_type=False, custom_root='Benefits')
isvalidxml = self.validatexmlwithxsd(xml, mktsegtype)
if(isvalidxml):
return xml
else:
return None
def validatexmlwithxsd(self, xml, mktsegtype):
valid = False
xsd = None
strxml = bytes.decode(xml, 'utf-8')
if(mktsegtype == "XXX"):
xsd = xmlschema.XMLSchema('tests/test_cases/examples/vehicles/vehicles.xsd')
elif(mktsegtype == "YYY"):
xsd = xmlschema.XMLSchema('tests/test_cases/examples/vehicles/boat.xsd')
valid = xsd.is_valid(strxml)
return valid
E.g.of node generated
<cvrg>
<cvrgID>285</cvrgID>
<cvrgCoveredRtl>1</cvrgCoveredRtl>
<cvrgCoveredMail>1</cvrgCoveredMail>
**<cvrgO...goryAgeLimitMax>**
</cvrgAgeLimitMax>
<cvrgOutOfLimitActionAge></cvrgOutOfLimitActionAge>
<cvrgOutOfLimitActionGender></cvrgOutOfLimitActionGender>
</cvrg>
<cvrg>
<cvrgID>559</cvrgID>
<cvrgCoveredRtl>2</cvrgCoveredRtl>
<cvrgCoveredMail>2</cvrgCoveredMail>
<cvrgOutOfLimitActionAge></cvrgOutOfLimitActionAge>
<cvrgOutOfLimitActionGender></cvrgOutOfLimitActionGender>
</cvrg>
Update 1: -
I tried to capture the print output to a variable, based on below url
Python: Assign print output to a variable.
s = StringIO()
print(xml, file=s,flush=True)
result = s.getvalue()
In this case also I getting a truncated XML nodes(with...) but as I mentioned in my original post when I print the byte object in the debug window or render it into the browser I am getting the entire XML. Any suggestion or help!!!

How to handle XML element iter nested attributes with the same tag

I am trying to parse NPORT-P XML SEC submission. My code (Python 3.6.8) with a sample XML record:
import xml.etree.ElementTree as ET
content_xml = '<?xml version="1.0" encoding="UTF-8"?><edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><headerData></headerData><formData><genInfo></genInfo><fundInfo></fundInfo><invstOrSecs><invstOrSec><name>N/A</name><lei>N/A</lei><title>US 10YR NOTE (CBT)Sep20</title><cusip>N/A</cusip> <identifiers> <ticker value="TYU0"/> </identifiers> <derivativeInfo> <futrDeriv derivCat="FUT"> <counterparties> <counterpartyName>Chicago Board of Trade</counterpartyName> <counterpartyLei>549300EX04Q2QBFQTQ27</counterpartyLei> </counterparties><payOffProf>Short</payOffProf> <descRefInstrmnt> <otherRefInst> <issuerName>U.S. Treasury 10 Year Notes</issuerName> <issueTitle>U.S. Treasury 10 Year Notes</issueTitle> <identifiers> <cusip value="N/A"/><other otherDesc="USER DEFINED" value="TY_Comdty"/> </identifiers> </otherRefInst> </descRefInstrmnt> <expDate>2020-09-21</expDate> <notionalAmt>-2770555</notionalAmt> <curCd>USD</curCd> <unrealizedAppr>-12882.5</unrealizedAppr></futrDeriv> </derivativeInfo> </invstOrSec> </invstOrSecs> <signature> </signature> </formData></edgarSubmission>'
content_tree = ET.ElementTree(ET.fromstring(bytes(content_xml, encoding='utf-8')))
content_root = content_tree.getroot()
for edgar_submission in content_root.iter('{http://www.sec.gov/edgar/nport}edgarSubmission'):
for form_data in edgar_submission.iter('{http://www.sec.gov/edgar/nport}formData'):
for genInfo in form_data.iter('{http://www.sec.gov/edgar/nport}genInfo'):
None
for fundInfo in form_data.iter('{http://www.sec.gov/edgar/nport}fundInfo'):
None
for invstOrSecs in form_data.iter('{http://www.sec.gov/edgar/nport}invstOrSecs'):
for invstOrSec in invstOrSecs.iter('{http://www.sec.gov/edgar/nport}invstOrSec'):
myrow = []
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}name'), 'text', ''))
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}lei'), 'text', ''))
security_title = getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}title'), 'text', '')
myrow.append(security_title)
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}cusip'), 'text', ''))
for identifiers in invstOrSec.iter('{http://www.sec.gov/edgar/nport}identifiers'):
if identifiers.find('{http://www.sec.gov/edgar/nport}isin') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}isin').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No ISIN")
if identifiers.find('{http://www.sec.gov/edgar/nport}ticker') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}ticker').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No Ticker")
if identifiers.find('{http://www.sec.gov/edgar/nport}other') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}other').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No Other")
The output from this code is:
No ISIN
No Other
No ISIN
No Ticker
This working fine aside from the fact that the identifiers iter invstOrSec.iter('{http://www.sec.gov/edgar/nport}identifiers') finds identifiers under formData>invstOrSecs>invstOrSec but also other identifiers under a nested tag under formData>invstOrSecs>invstOrSec>derivativeInfo>futrDeriv>descRefInstrmnt>otherRefInst. How can I restrict my iter or the find to the right level? I have unsuccessfully tried to get the parent but I am not finding how to do this using the {namespace}tag notation. Any ideas?

So I switched from ElementTree to lxml using an import like this to avoid code changes:
from lxml import etree as ET
Make sure you check https://lxml.de/1.3/compatibility.html to avoid compatibility issues. In my case lxml worked without issues.
And I then I was able to use the getparent() method to be able to only read the identifiers from the right part of the XML file:
if identifiers.getparent().tag == '{http://www.sec.gov/edgar/nport}invstOrSec':

AttributeError when assigning value to function for XML data extraction

I'm coding a script to extract information from several XML files with the same structure but with missing sections when there is no information related to a tag. The easiest way to achieve this was using try/except so instead of getting a "AtributeError: 'NoneType' object has no atrribute 'find'" I assign an empty string('') to the object in the exeption. Something like this:
try:
string1=root.find('value1').find('value2').find('value3').text
except:
string1=''
The issue is that I want to shrink my code by using a function:
def extract(string):
tempstr=''
try:
tempstr=string.replace("\n", "")
except:
if tempstr is None:
tempstr=""
return string
And then I try to called it like this:
string1=extract(root.find('value1').find('value2').find('value3').text)
and value2 or value3 does not exist for the xml that is being processed, I get and AttributeError even if I don't use the variable in the function making the function useless.
Is there a way to make a function work, maybe there is a way to make it run without checking if the value entered is invalid?
Solution:
I'm using a mix of both answers:
def extract(root, xpath):
tempstr=''
try:
tempstr=root.findall(xpath)[0].text.replace("\n", "")
except:
tempstr=''#To avoid getting a Nonetype object
return tempstr

You can try something like that:
def extract(root, children_keys: list):
target_object = root
result_text = ''
try:
for child_key in children_keys:
target_object = target_object.find(child_key)
result_text = target_object.text
except:
pass
return result_text
You will go deeper at XML structure with for loop (children_keys - is predefined by you list of nested keys of XML - xml-path to your object).
And if error will throw inside that code - you will get '' as result.
Example XML (source):
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>
<y>Don't forget me this weekend!</y>
</body>
</note>
Example:
import xml.etree.ElementTree as ET
tree = ET.parse('note.xml')
root = tree.getroot()
children_keys = ['body', 'y']
result_string = extract(root, children_keys)
print(result_string)
Output:
"Don't forget me this weekend!"

Use XPATH expression
import xml.etree.ElementTree as ET
xml1 = '''<r><v1><v2><v3>a string</v3></v2></v1></r>'''
root = ET.fromstring(xml1)
v3 = root.findall('./v1/v2/v3')
if v3:
print(v3[0].text)
else:
print('v3 not found')
xml2 = '''<r><v1><v3>a string</v3></v1></r>'''
root = ET.fromstring(xml2)
v3 = root.findall('./v1/v2/v3')
if v3:
print(v3[0].text)
else:
print('v3 not found')
output
a string
v3 not found

Python xml to csv

Please read entire question before marking duplicate.
I have a nested XML file which i Want to convert to a csv file.
I have to write a python script for same.
The XML file is:
<?xml version="1.0"?>
<ListOrdersResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
<ListOrdersResult>
<Orders>
<Order>
<LatestShipDate>2015-06-02T18:29:59Z</LatestShipDate>
<OrderType>StandardOrder</OrderType>
<PurchaseDate>2015-05-31T03:58:30Z</PurchaseDate>
<AmazonOrderId>171-6355256-9594715</AmazonOrderId>
<LastUpdateDate>2015-06-01T04:18:58Z</LastUpdateDate>
<ShipServiceLevel>IN Std Domestic</ShipServiceLevel>
<NumberOfItemsShipped>0</NumberOfItemsShipped>
<OrderStatus>Canceled</OrderStatus>
<SalesChannel>Amazon.in</SalesChannel>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<IsPremiumOrder>false</IsPremiumOrder>
<EarliestShipDate>2015-05-31T18:30:00Z</EarliestShipDate>
<MarketplaceId>A21TJRUUN4KGV</MarketplaceId>
<FulfillmentChannel>MFN</FulfillmentChannel>
<IsPrime>false</IsPrime>
<ShipmentServiceLevelCategory>Standard</ShipmentServiceLevelCategory>
</Order>
<Order>
<LatestShipDate>2015-06-02T18:29:59Z</LatestShipDate>
<OrderType>StandardOrder</OrderType>
<PurchaseDate>2015-05-31T04:50:07Z</PurchaseDate>
<BuyerEmail>dr7h1rhy6457rng#marketplace.amazon.in</BuyerEmail>
<AmazonOrderId>403-5551715-2566754</AmazonOrderId>
<LastUpdateDate>2015-06-01T07:52:49Z</LastUpdateDate>
<ShipServiceLevel>IN Exp Dom 2</ShipServiceLevel>
<NumberOfItemsShipped>2</NumberOfItemsShipped>
<OrderStatus>Shipped</OrderStatus>
<SalesChannel>Amazon.in</SalesChannel>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<LatestDeliveryDate>2015-06-06T18:29:59Z</LatestDeliveryDate>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<BuyerName>Ajit Nair</BuyerName>
<EarliestDeliveryDate>2015-06-02T18:30:00Z</EarliestDeliveryDate>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>938.00</Amount>
</OrderTotal>
<IsPremiumOrder>false</IsPremiumOrder>
<EarliestShipDate>2015-05-31T18:30:00Z</EarliestShipDate>
<MarketplaceId>A21TJRUUN4KGV</MarketplaceId>
<FulfillmentChannel>MFN</FulfillmentChannel>
<TFMShipmentStatus>Delivered</TFMShipmentStatus>
<PaymentMethod>Other</PaymentMethod>
<ShippingAddress>
<StateOrRegion>MAHARASHTRA</StateOrRegion>
<City>THANE</City>
<Phone>9769994355</Phone>
<CountryCode>IN</CountryCode>
<PostalCode>400709</PostalCode>
<Name>Ajit Nair</Name>
<AddressLine1>C-25 / con-7 / Chandralok CHS</AddressLine1>
<AddressLine2>Sector-10 ,Koper khairne</AddressLine2>
</ShippingAddress>
<IsPrime>false</IsPrime>
<ShipmentServiceLevelCategory>Expedited</ShipmentServiceLevelCategory>
</Order>
I tried to get values for my code in form of a list. But it doesn't print anything.
My Code:
from xml.etree import ElementTree
with open('orders.xml', 'rb') as f:
tree = ElementTree.parse(f)
for node in tree.findall('.//Order'):
oid = node.attrib.get('SellerOrderId')
if oid:
print oid
What is wrong with my code?
EDIT: Temporary link to complete File Orders.xml

Your XML has default namespace defined here :
<ListOrdersResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
Note that descendant elements inherits ancestor default namespace implicitly, unless otherwise specified. You need to combine namespace + local name to form a fully qualified element name, for example :
ns = {'d': 'https://mws.amazonservices.com/Orders/2013-09-01'}
for node in tree.findall('.//d:Order', ns):
oid = node.attrib.get('SellerOrderId')
if oid:
print oid
According to the full XML file you linked to, SellerOrderId is child element of Order instead of attribute. In this case, you can simply use .//d:Order/d:SellerOrderId to get them and then print it's value, like so :
ns = {'d': 'https://mws.amazonservices.com/Orders/2013-09-01'}
for node in tree.findall('.//d:Order/d:SellerOrderId', ns):
print node.text
output :
171-1322776-9700344
171-4214129-7148305
402-8263846-7042737
402-7017923-9474716
402-9691237-2887553
171-4614227-7597903
403-6729903-2119563
402-2184564-2676353
171-4520392-2088330
402-7986969-8827533

How to retrieve certain child elements using python and lxml

With lots of help from stack overflow, I managed to get some python code working to process xml files (using lxml). I've been able to adapt it for lots of different purposes, but there is one thing I can't work out.
Example XML:
<?xml version="1.0" encoding="UTF-8" ?>
<TVAMain xml:lang="PL" publisher="Someone" publicationTime="2014-01-03T06:24:24+00:00" version="217" xmlns="urn:tva:metadata:2010" xmlns:mpeg7="urn:tva:mpeg7:2008" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:tva:metadata:2010 http://Something.xsd">
<ProgramDescription>
<ProgramInformationTable>
<ProgramInformation programId="crid://bds.tv/88032"><BasicDescription>
<Title xml:lang="PL" type="episodeTitle">Some Title</Title>
<Synopsis xml:lang="PL" length="short">Some Synopsis</Synopsis>
<Genre href="urn:tva:metadata:cs:EventGenreCS:2009:96">
<Name xml:lang="EN">Some Genre</Name>
</Genre>
<Language>PL</Language>
<RelatedMaterial>
<HowRelated href="urn:eventis:metadata:cs:HowRelatedCS:2010:boxCover">
<Name>Box cover</Name>
</HowRelated>
<MediaLocator>
<mpeg7:MediaUri>file://Images/98528834.p.jpg</mpeg7:MediaUri>
</MediaLocator>
</RelatedMaterial>
The python code will return the Title, Genre and Synopsis, but it will not return the image reference (3rd line from the bottom). I presume this is because of the name format 'mpeg7:MediaUri' (which I cannot change). The code will return the 'No Image' string instead.
This is the relavent python code
file_name = input('Enter the file name, including .xml extension: ')
print('Parsing ' + file_name)
from lxml import etree
parser = etree.XMLParser()
tree = etree.parse(file_name, parser)
root = tree.getroot()
nsmap = {'xmlns': 'urn:tva:metadata:2010'}
with open(file_name+'.log', 'w', encoding='utf-8') as f:
for info in root.xpath('//xmlns:ProgramInformation', namespaces=nsmap):
crid = (info.get('programId'))
titlex = (info.find('.//xmlns:Title', namespaces=nsmap))
title = (titlex.text if titlex != None else 'No title')
genrex = (info.find('.//xmlns:Genre/xmlns:Name', namespaces=nsmap))
genre = (genrex.text if genrex != None else 'No Genre')
imagex = (info.find('.//xmlns:RelatedMaterial/xmlns:MediaLocator/xmlns:"mpeg7:MediaUri"', namespaces=nsmap))
image = (image.text if imagex != None else 'No Image')
f.write('{}|{}|{}|{}\n'.format(crid, title, genre, image))
Can someone explain how I can adapt the 'imagex' line, so that it returns 'file://Images/98528834.p.jpg' from the example? I had a look at using square brackets, but it caused an error.

That node you are interested in, has mpeg7 namespace instead of default namespace. You can try with this syntax *[local-name() = "elementName"] to match element by it's local name (ignoring the namespace) :
imagex = info.xpath(
'.//xmlns:RelatedMaterial/xmlns:MediaLocator/*[local-name() = "MediaUri"]',
namespaces=nsmap)[0]
Or add the mpeg7 in namespaces declaration :
nsmap = {'xmlns': 'urn:tva:metadata:2010', 'mpeg7':'urn:tva:mpeg7:2008'}
then you can use mpeg7 prefix in xpath query :
imagex = (info.find('.//xmlns:RelatedMaterial/xmlns:MediaLocator/mpeg7:MediaUri', namespaces=nsmap))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I parse a XML file to a dictionary in Python? - python

Related

How to get the full XML when decoding bytes to string. It is generating XML but truncated

How to handle XML element iter nested attributes with the same tag

AttributeError when assigning value to function for XML data extraction

Python xml to csv

How to retrieve certain child elements using python and lxml

Categories

Resources