Extract data from xml to Excel (Python 2.7 )

Extract data from xml to Excel (Python 2.7 ) - python

i'm attempting to extract some data from a XML file and create a Excel with the information.
XML File:
<UniversalTransaction>
<TransactionInfo>
<DataContext>
<DataSourceCollection>
<DataSource>
<Type>AccountingInvoice</Type>
<Key>AR INV 00001006</Key>
</DataSource>
</DataSourceCollection>
<Company>
<Code>DCL</Code>
<Country>
<Code>CL</Code>
<Name>Chile</Name>
</Country>
<Name>Your Chile Corp</Name>
</Company>
...etc
Then I made this Code in python 2.7
import xml.etree.ElementTree as ET
import xlwt
from datetime import datetime
tree = ET.parse('ar.xml')
root = tree.getroot()
#extract xml
invoice = root.findall('DataSource')
arinv = root.find('Key').text
country = root.findall('Company')
ctry = root.find('Name').text
wb = xlwt.Workbook()
ws = wb.add_sheet('A Test Sheet')
ws.write(0, 0, arinv)
ws.write(0, 1, ctry)
wb.save('example2.xls')
But I get this error:
arinv = root.find('Key').text
'NoneType' object has no attribute 'text'
And i guess it will be the same with
ctry = root.find('Name').text
Also when I change the "extract xml" part of the code to this
for ar in root.findall('DataContext'):
nro = []
ctry = []
inv = ar.find('Key').text
nro.append(inv)
country = ar.find('Name').text
ctry.append(country)
i get the following error:
ws.write(0, 0, arinv)
name 'arinv' is not defined
then again, I guess its the same with "ctry"
Windows 10, python 2.7
I'll apreciate any help, thanks.

It is better to ask shortened questions - without yours bunch of context code. Probably you find a solution yourself when you carefully try to split out exact short question.
According to the docs, Element.find basically finds only in direct children. You need to use some XPath (look about XPath expressions in the docs) like
root.findall('.//Key')[0].text
(given with assumption the Key always exists, contains text and unique within a document; i.e. without validation)

Related

Adding a single occurrence xml tag using lxml

Based on a couple of other examples I've found here, I've created a script that creates an xml file from a csv input using lxml.etree and lxml.ebuilder. It gives me almost what I need - the one thing I'm struggling with is that I need to also include a single-occurrence tag at the top of the data which will contain a static value.
Here's my sample data:
ACTION|INV_ACCT_CLASS|EXT_INV_ID|WAREHOUSE_ID|NAME|CNTRY_CD|PHONE|ADDR_STR1|ADDR_STR2|CITY|ST|ZIP|ADD_KEY_NUM
add|2|AAA_00005|1001213|Company 1|US|9995555555|1313 Mockingbird Lane||New York|NY|10001|44433322
add|2|BBB_00008|1004312|Company 2|US|43255511110|Some other address||Stamford|CT|44112|11122233
My code so far:
import lxml.etree
from lxml.builder import E
import csv
with open("filename.csv") as csvfile:
results = E.paiInv(*(
E.invrec(
E.action(row['ACTION']),
E.investor(
E.inv_account_class(row['INV_ACCOUNT_CLASS']),
E.ext_inv_id(row['EXT_INV_ID']),
E.warehouse_id(row['WAREHOUSE_ID']),
E.name(row['NAME']),
E.cntry_cd(row['CNTRY_CD']),
E.phone(row['PHONE']),
E.addr_str1(row['ADDRESS_STR1']),
E.addr_str2(row['ADDRESS_STR2']),
E.city(row['CITY']),
E.st(row['ST']),
E.zip(row['ZIP']),
E.add_key_num(row['ADD_KEY_NUM'])
)
) for row in csv.DictReader(csvfile, delimiter = '|'))
)
lxml.etree.ElementTree(results).write("OutputFile.xml")
Here's my output so far:
<paiInv>
<invrec>
<action>add</action>
<investor>
<inv_account_class>2</inv_account_class>
<ext_inv_id>AAA_00005</ext_inv_id>
<warehouse_id>1001213</warehouse_id>
<name>Company 1</name>
<cntry_cd>US</cntry_cd>
<phone>9995555555</phone>
<addr_str1>1313 Mockingbird Lane</addr_str1>
<addr_str2></addr_str2>
<city>New York</city>
<st>NY</st>
<zip>10001</zip>
<add_key_num>44433322</add_key_num>
</investor>
</invrec>
<invrec>
<action>add</action>
<investor>
<inv_account_class>2</inv_account_class>
<ext_inv_id>BBB_00008</ext_inv_id>
<warehouse_id>1004312</warehouse_id>
<name>Company 2</name>
<cntry_cd>US</cntry_cd>
<phone>43255511110</phone>
<addr_str1>Some other address</addr_str1>
<addr_str2></addr_str2>
<city>Stamford</city>
<st>NB</st>
<zip>44112</zip>
<add_key_num>11122233</add_key_num>
</investor>
</invrec>
</paiInv>
And the output I need includes one extra (single occurrence) tag, named request_id, occurring at the top of the data, like this:
<paiInv>
<request_id>req44</request_id>
<invrec>
<action>add</action>
<investor>
<inv_account_class>2</inv_account_class>
<ext_inv_id>AAA_00005</ext_inv_id>
<warehouse_id>1001213</warehouse_id>
<name>Company 1</name>
<cntry_cd>US</cntry_cd>
<phone>9995555555</phone>
<addr_str1>1313 Mockingbird Lane</addr_str1>
<addr_str2></addr_str2>
<city>New York</city>
<st>NY</st>
<zip>10001</zip>
<add_key_num>44433322</add_key_num>
</investor>
</invrec>
<invrec>
<action>add</action>
<investor>
<inv_account_class>2</inv_account_class>
<ext_inv_id>BBB_00008</ext_inv_id>
<warehouse_id>1004312</warehouse_id>
<name>Company 2</name>
<cntry_cd>US</cntry_cd>
<phone>43255511110</phone>
<addr_str1>Some other address</addr_str1>
<addr_str2></addr_str2>
<city>Stamford</city>
<st>NB</st>
<zip>44112</zip>
<add_key_num>11122233</add_key_num>
</investor>
</invrec>
</paiInv>
Any suggestions will be appreciated. I haven't been able to get anything other than syntax errors with my attempts to get the extra tag so far.

Before you save the file, try something like:
doc = lxml.etree.ElementTree(results)
ins = lxml.etree.fromstring('<request_id>req44</request_id>')
ins.tail = "\n"
dest = doc.xpath('/paiInv')[0]
dest.insert(0,ins)
print(lxml.etree.tostring(doc).decode())
The output should be what you are looking for.

How to handle XML element iter nested attributes with the same tag

I am trying to parse NPORT-P XML SEC submission. My code (Python 3.6.8) with a sample XML record:
import xml.etree.ElementTree as ET
content_xml = '<?xml version="1.0" encoding="UTF-8"?><edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><headerData></headerData><formData><genInfo></genInfo><fundInfo></fundInfo><invstOrSecs><invstOrSec><name>N/A</name><lei>N/A</lei><title>US 10YR NOTE (CBT)Sep20</title><cusip>N/A</cusip> <identifiers> <ticker value="TYU0"/> </identifiers> <derivativeInfo> <futrDeriv derivCat="FUT"> <counterparties> <counterpartyName>Chicago Board of Trade</counterpartyName> <counterpartyLei>549300EX04Q2QBFQTQ27</counterpartyLei> </counterparties><payOffProf>Short</payOffProf> <descRefInstrmnt> <otherRefInst> <issuerName>U.S. Treasury 10 Year Notes</issuerName> <issueTitle>U.S. Treasury 10 Year Notes</issueTitle> <identifiers> <cusip value="N/A"/><other otherDesc="USER DEFINED" value="TY_Comdty"/> </identifiers> </otherRefInst> </descRefInstrmnt> <expDate>2020-09-21</expDate> <notionalAmt>-2770555</notionalAmt> <curCd>USD</curCd> <unrealizedAppr>-12882.5</unrealizedAppr></futrDeriv> </derivativeInfo> </invstOrSec> </invstOrSecs> <signature> </signature> </formData></edgarSubmission>'
content_tree = ET.ElementTree(ET.fromstring(bytes(content_xml, encoding='utf-8')))
content_root = content_tree.getroot()
for edgar_submission in content_root.iter('{http://www.sec.gov/edgar/nport}edgarSubmission'):
for form_data in edgar_submission.iter('{http://www.sec.gov/edgar/nport}formData'):
for genInfo in form_data.iter('{http://www.sec.gov/edgar/nport}genInfo'):
None
for fundInfo in form_data.iter('{http://www.sec.gov/edgar/nport}fundInfo'):
None
for invstOrSecs in form_data.iter('{http://www.sec.gov/edgar/nport}invstOrSecs'):
for invstOrSec in invstOrSecs.iter('{http://www.sec.gov/edgar/nport}invstOrSec'):
myrow = []
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}name'), 'text', ''))
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}lei'), 'text', ''))
security_title = getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}title'), 'text', '')
myrow.append(security_title)
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}cusip'), 'text', ''))
for identifiers in invstOrSec.iter('{http://www.sec.gov/edgar/nport}identifiers'):
if identifiers.find('{http://www.sec.gov/edgar/nport}isin') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}isin').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No ISIN")
if identifiers.find('{http://www.sec.gov/edgar/nport}ticker') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}ticker').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No Ticker")
if identifiers.find('{http://www.sec.gov/edgar/nport}other') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}other').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No Other")
The output from this code is:
No ISIN
No Other
No ISIN
No Ticker
This working fine aside from the fact that the identifiers iter invstOrSec.iter('{http://www.sec.gov/edgar/nport}identifiers') finds identifiers under formData>invstOrSecs>invstOrSec but also other identifiers under a nested tag under formData>invstOrSecs>invstOrSec>derivativeInfo>futrDeriv>descRefInstrmnt>otherRefInst. How can I restrict my iter or the find to the right level? I have unsuccessfully tried to get the parent but I am not finding how to do this using the {namespace}tag notation. Any ideas?

So I switched from ElementTree to lxml using an import like this to avoid code changes:
from lxml import etree as ET
Make sure you check https://lxml.de/1.3/compatibility.html to avoid compatibility issues. In my case lxml worked without issues.
And I then I was able to use the getparent() method to be able to only read the identifiers from the right part of the XML file:
if identifiers.getparent().tag == '{http://www.sec.gov/edgar/nport}invstOrSec':

Is there something wrong with my script or the XML file? I am using ElementTree in attempt to get out child attributes

This is a shorted version of the XML file that I am trying to parse:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<TipsContents xmlns="http://www.avendasys.com/tipsapiDefs/1.0">
<TipsHeader exportTime="Mon May 04 20:05:47 SAST 2020" version="6.8"/>
<Endpoints>
<Endpoint macVendor="SHENZHEN RF-LINK TECHNOLOGY CO.,LTD." macAddress="c46e7b2939cb" status="Known">
<EndpointProfile updatedAt="May 04, 2020 10:02:21 SAST" profiledBy="Policy Manager" addedAt="Mar 04, 2020 17:31:53 SAST" fingerprint="{}" conflict="false" name="Windows" family="Windows" category="Computer" staticIP="false" ipAddress="xxx.xxx.xxx.xxx"/>
<EndpointTags tagName="Username" tagValue="xxxxxxxx"/>
<EndpointTags tagName="Disabled Reason" tagValue="IS_ACTIVE"/>
</Endpoint>
</Endpoints>
<TagDictionaries>
<TagDictionary allowMultiple="false" mandatory="true" defaultValue="false" dataType="Boolean" attributeName="DOMAIN-MACHINES" entityName="Endpoint"/>
<TagDictionary allowMultiple="false" mandatory="true" defaultValue="true" dataType="Boolean" attributeName="IS_ACTIVE" entityName="Endpoint"/>
<TagDictionary allowMultiple="true" mandatory="false" dataType="String" attributeName="Disabled Reason" entityName="Endpoint"/>
<TagDictionary allowMultiple="false" mandatory="false" dataType="String" attributeName="Username" entityName="Endpoint"/>
</TagDictionaries>
</TipsContents>
I run the following script:
import xml.etree.ElementTree as ET
f = open("Endpoint-5.xml", 'r')
tree = ET.parse(f)
root = tree.getroot()
This is what my outputs look like:
In [8]: root = tree.getroot()
In [9]: root.findall('.')
Out[9]: [<Element '{http://www.avendasys.com/tipsapiDefs/1.0}TipsContents' at 0x10874b410>]
In [10]: root.findall('./TipsHeader')
Out[10]: []
In [11]: root.findall('./TipsContents')
Out[11]: []
In [15]: root.findall('{http://www.avendasys.com/tipsapiDefs/1.0}TipsContents//Endpoints/Endpoint/EndpointProfile')
Out[15]: []
I have been following this: https://docs.python.org/3/library/xml.etree.elementtree.html#example
among other tutorials but I don't seem to get an output.
I have tried from lxml import html
My script is as follows:
tree = html.fromstring(html=f)
updatedAt = tree.xpath("//TipsContents/Endpoints/Endpoint/EndpointProfile/#updatedAt")
name = tree.xpath("//TipsContents/Endpoints/Endpoint/EndpointProfile/#name")
category = tree.xpath("//TipsContents/Endpoints/Endpoint/EndpointProfile/#category")
tagValue = tree.xpath("//TipsContents/Endpoints/Endpoint/EndpointTags[#tagName = 'Username']/#tagValue")
active = tree.xpath("//TipsContents/Endpoints/Endpoint/EndpointTags[#tagName = 'Disabled Reason']/#tagValue")
print("Name:",name)
The above attempt also returns nothing.
I am able to parse an XML document from an API and use the second attempt successfully but when I am doing this from a local file I do not get the results.
Any assistance will be appreciated.

Note that your input XML contains a default namespace, so to refer to
any element you have to specify the namespace.
One of methods to do it is to define a dictionary of namespaces
(shortcut : full_name), in your case:
ns = {'tips': 'http://www.avendasys.com/tipsapiDefs/1.0'}
Then, using findall:
use the appropriate shortcut before the element name (and ':'),
pass the namespace dictionary as the second argument.
The code to do it is:
for elem in root.findall('./tips:TipsHeader', ns):
print(elem.attrib)
The result, for your input sample, is:
{'exportTime': 'Mon May 04 20:05:47 SAST 2020', 'version': '6.8'}
As far as root.findall('./TipsContents') is concerned, it will return
an empty list, even if you specify the namespace as above.
The reason is that TipsContents is the name of the root node,
whereas you attempt to find an element with the same name below in
the XML tree, but it contains no such element.
If you want to access attributes of the root element, you can run:
print(root.attrib)
but to get something more than an empty dictionary, you have to add
some attributes to the root element (namespace is not an attribute).

Is there a way to parse a XML according to its attributes?

I'm trying to parse my xml using minidom.parse but the program crushes when debugger reaches line xmldoc = minidom.parse(self)
Here is what have I tried:
attribValList = list()
xmldoc = minidom.parse(path)
equipments = xmldoc.getElementsByTagName(xmldoc , elementName)
equipNames = equipments.getElementsByTagName(xmldoc , attributeKey)
for item in equipNames:
attribValList.append(item.value)
return attribValList
Maybe my XML is too specific for minidom. Here is how it looks like:
<TestSystem id="...">
<Port>58</Port>
<TestSystemEquipment>
<Equipment type="BCAFC">
<Name>System1</Name>
<DU-Junctions>
...
</DU-Junctions>
</Equipment>
Basically I need to get for each Equipment its name and write the names into a list.
Can anybody tell what I'm doing wrong?
enter image description here

Using "info.get" for a child element in Python / lxml

I'm trying to get the attribute of a child element in Python, using lxml.
This is the structure of the xml:
<GroupInformation groupId="crid://thing.com/654321" ordered="true">
<GroupType value="show" xsi:type="ProgramGroupTypeType"/>
<BasicDescription>
<Title type="main" xml:lang="EN">A programme</Title>
<RelatedMaterial>
<HowRelated href="urn:eventis:metadata:cs:HowRelatedCS:2010:boxCover">
<Name>Box cover</Name>
</HowRelated>
<MediaLocator>
<mpeg7:MediaUri>file://ftp.something.com/Images/123456.jpg</mpeg7:MediaUri>
</MediaLocator>
</RelatedMaterial>
</BasicDescription>
The code I've got is below. The bit I want to return is the 'value' attribute ("Show" in the example) under 'grouptype' (third line from the bottom):
file_name = input('Enter the file name, including .xml extension: ')
print('Parsing ' + file_name)
from lxml import etree
parser = etree.XMLParser()
tree = etree.parse(file_name, parser)
root = tree.getroot()
nsmap = {'xmlns': 'urn:tva:metadata:2010','mpeg7':'urn:tva:mpeg7:2008'}
with open(file_name+'.log', 'w', encoding='utf-8') as f:
for info in root.xpath('//xmlns:GroupInformation', namespaces=nsmap):
crid = info.get('groupId'))
grouptype = info.find('.//xmlns:GroupType', namespaces=nsmap)
gtype = grouptype.get('value')
titlex = info.find('.//xmlns:BasicDescription/xmlns:Title', namespaces=nsmap)
title = titlex.text if titlex != None else 'Missing'
Can anyone explain to me how to implement it? I had a quick look at the xsi namespace, but was unable to get it to work (and didn't know if it was the right thing to do).

Is this what you are looking for?
grouptype.attrib['value']
PS: why the parenthesis around assignment values? Those look unnecessary.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract data from xml to Excel (Python 2.7 ) - python

Related

Adding a single occurrence xml tag using lxml

How to handle XML element iter nested attributes with the same tag

Is there something wrong with my script or the XML file? I am using ElementTree in attempt to get out child attributes

Is there a way to parse a XML according to its attributes?

Using "info.get" for a child element in Python / lxml

Categories

Resources