Python: lxml is not reading element text all time - python

I want to load xml file with below structure into pandas dataframe
The size of xml could be between 1 GB to 6GB
Below xml sample just have 5 records but my acutal file will have around 100000 records as mention in the RECORDS attributes below (RECORDS="108881")
Also each and every element in this file will have some value.
None of the element is empty in the whole file.
<?xml version="1.0" encoding="UTF-8"?>
<STUDENTS ASOF_DATE="11/21/2019" CREATE_DATE="11/22/2019" RECORDS="108881">
i am trying to read these xmls with lxml as below functions
As you can see in below functions, i am just interested in reading specific tags from xml file which are ["ACADEMICS","STUDENDS","ID","SHORT_STD_DESC"]
def recursive_dict(self,element):
return element.tag, \
dict(map(self.recursive_dict, element)) or element.text
def ConvertFilePivot(self, inputfile):
context = etree.iterparse(inputfile, events=('start','end'), tag=["ACADEMICS","STUDENDS","ID","SHORT_STD_DESC"])
lstValues = []
asOfDate = ""
for event, elem in context:
if elem.tag == "ACADEMICS" :
asOfDate = elem[0].attrib['ASOF_DATE']
for event, elem in context:
doc = self.recursive_dict(elem)
dfvalues = pd.DataFrame(lstValues,columns=["ColName","ColValue"])
columns = dfvalues['ColName'].unique()
data = {}
for column in columns:
data[column] = list(dfvalues[dfvalues['ColName'] == column]['ColValue'])
dfdata = pd.DataFrame(data)
return dfdata
Now, the problem is when i load this xml into dataframe as shown in above function, for some records i get 'None' as a text for ID and SHORT_STD_DESC elements.
But the actual xml file has that value.
So i am not sure why it is not reflected in my dataframe ?
Any input would be great help for me.

This may be more a comment than an answer, but I can't fit it in an actual comment...
Try changing
for event, elem in context:
doc = self.recursive_dict(elem)
to just:
doc = self.recursive_dict(elem)
and see if it works.


convert xml to csv using python

I am learning my way around python and right now I need a little bit of help. I have an XML file from soap api that I am failing at converting to CSV. I managed to get the data with the request library easily. My struggle is converting it to CSV, I end up with headers with no values
My XML Data :
<soap:Envelope xmlns:soap=""
<Level2 xmlns="https://xxxxxxxxxx/xxxxxxx">
<Message>20 alert(s) generated for this period</Message>
<TypeDescription>TYPE OF ENQUIRY</TypeDescription>
<BusinessName>COMPANY NAME</BusinessName>
<SubscriberName>COMPANY NAME 2 </SubscriberName>
My python code looks like this :
import xml.etree.ElementTree as ET
import pandas as pd
cols = ['Date', 'RegisteredDate', 'Type',
rows = []
# parse xml file
xmlparse = ET.parse('xmldata.xml')
root = xmlparse.getroot()
for i in root:
Date = i.get('Date').text
RegisteredDate = i.get('RegisteredDate').text
Type = i.get('Type').text
TypeDescription = i.get('TypeDescription').text
rows.append({'Date': Date,
'RegisteredDate': RegisteredDate,
'Type': Type,
'TypeDescription': TypeDescription})
df = pd.DataFrame(rows, columns=cols)
In my approach, I was following the idea from here
You probably don't need to go through ElementTree; you can feed the xml directly to pandas. If I understand you correctly, this should do it:
df = pd.read_xml(path_to_file,"//*[local-name()='MainVIP']")
df = df.iloc[:,:4]
Output from your xml above:
Date RegisteredDate Type TypeDescription
0 20210616 20210216 YMBA TYPE OF ENQUIRY
Without any external lib - the code below generates a csv file.
The idea is to collect the required elements data from MainVip and store it in list of dicts. Loop on the list and write the data into a file.
import xml.etree.ElementTree as ET
xml = ''' <soap:Envelope xmlns:soap=""
<Level2 xmlns="https://xxxxxxxxxx/xxxxxxx">
<Message>20 alert(s) generated for this period</Message>
<TypeDescription>TYPE OF ENQUIRY</TypeDescription>
<BusinessName>COMPANY NAME</BusinessName>
<SubscriberName>COMPANY NAME 2 </SubscriberName>
cols = ['Date', 'RegisteredDate', 'Type',
rows = []
NS = '{https://xxxxxxxxxx/xxxxxxx}'
root = ET.fromstring(xml)
for vip in root.findall(f'.//{NS}MainVIP'):
rows.append({c: vip.find(NS+c).text for c in cols})
with open('out.csv','w') as f:
f.write(','.join(cols) + '\n')
for row in rows:
f.write(','.join(row[c] for c in cols) + '\n')
20210616,20210216,YMBA,TYPE OF ENQUIRY

How to handle XML element iter nested attributes with the same tag

I am trying to parse NPORT-P XML SEC submission. My code (Python 3.6.8) with a sample XML record:
import xml.etree.ElementTree as ET
content_xml = '<?xml version="1.0" encoding="UTF-8"?><edgarSubmission xmlns="" xmlns:com="" xmlns:ncom="" xmlns:xsi=""><headerData></headerData><formData><genInfo></genInfo><fundInfo></fundInfo><invstOrSecs><invstOrSec><name>N/A</name><lei>N/A</lei><title>US 10YR NOTE (CBT)Sep20</title><cusip>N/A</cusip> <identifiers> <ticker value="TYU0"/> </identifiers> <derivativeInfo> <futrDeriv derivCat="FUT"> <counterparties> <counterpartyName>Chicago Board of Trade</counterpartyName> <counterpartyLei>549300EX04Q2QBFQTQ27</counterpartyLei> </counterparties><payOffProf>Short</payOffProf> <descRefInstrmnt> <otherRefInst> <issuerName>U.S. Treasury 10 Year Notes</issuerName> <issueTitle>U.S. Treasury 10 Year Notes</issueTitle> <identifiers> <cusip value="N/A"/><other otherDesc="USER DEFINED" value="TY_Comdty"/> </identifiers> </otherRefInst> </descRefInstrmnt> <expDate>2020-09-21</expDate> <notionalAmt>-2770555</notionalAmt> <curCd>USD</curCd> <unrealizedAppr>-12882.5</unrealizedAppr></futrDeriv> </derivativeInfo> </invstOrSec> </invstOrSecs> <signature> </signature> </formData></edgarSubmission>'
content_tree = ET.ElementTree(ET.fromstring(bytes(content_xml, encoding='utf-8')))
content_root = content_tree.getroot()
for edgar_submission in content_root.iter('{}edgarSubmission'):
for form_data in edgar_submission.iter('{}formData'):
for genInfo in form_data.iter('{}genInfo'):
for fundInfo in form_data.iter('{}fundInfo'):
for invstOrSecs in form_data.iter('{}invstOrSecs'):
for invstOrSec in invstOrSecs.iter('{}invstOrSec'):
myrow = []
myrow.append(getattr(invstOrSec.find('{}name'), 'text', ''))
myrow.append(getattr(invstOrSec.find('{}lei'), 'text', ''))
security_title = getattr(invstOrSec.find('{}title'), 'text', '')
myrow.append(getattr(invstOrSec.find('{}cusip'), 'text', ''))
for identifiers in invstOrSec.iter('{}identifiers'):
if identifiers.find('{}isin') is not None:
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No ISIN")
if identifiers.find('{}ticker') is not None:
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No Ticker")
if identifiers.find('{}other') is not None:
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No Other")
The output from this code is:
No Other
No Ticker
This working fine aside from the fact that the identifiers iter invstOrSec.iter('{}identifiers') finds identifiers under formData>invstOrSecs>invstOrSec but also other identifiers under a nested tag under formData>invstOrSecs>invstOrSec>derivativeInfo>futrDeriv>descRefInstrmnt>otherRefInst. How can I restrict my iter or the find to the right level? I have unsuccessfully tried to get the parent but I am not finding how to do this using the {namespace}tag notation. Any ideas?
So I switched from ElementTree to lxml using an import like this to avoid code changes:
from lxml import etree as ET
Make sure you check to avoid compatibility issues. In my case lxml worked without issues.
And I then I was able to use the getparent() method to be able to only read the identifiers from the right part of the XML file:
if identifiers.getparent().tag == '{}invstOrSec':

How to get the content of specific grandchild from xml file through python

Hi I am very new to python programming. I have an xml file of structure:
<?xml version="1.0" encoding="UTF-8"?>
-<LidcReadMessage xsi:schemaLocation=""
xmlns="" uid="">
<TaskDescription>Second unblinded read</TaskDescription>
<ResponseDescription>1 - Reading complete</ResponseDescription>
<noduleID>Nodule 001</noduleID>
<imageZposition>-125.000000 </imageZposition>
There are four which contains multiple . Each contains an . I need to extract the information in from all of these headers.
Right now I am doing this:
import xml.etree.ElementTree as ET
tree = ET.parse('069.xml')
root = tree.getroot()
#lst = []
for readingsession in root.iter('readingSession'):
for roi in readingsession.findall('roi'):
id = roi.findtext('imageSOP_UID')
but it ouputs like this:
Process finished with exit code 0.
If anyone can help.
The real problem as been wit the namespace. I tried with and without it, but it didn't work with this code.
ds = pydicom.dcmread("000071.dcm")
uid = ds.SOPInstanceUID
tree = ET.parse("069.xml")
root = tree.getroot()
for child in root:
if child.tag == '{}readingSession':
read = child.find('{}unblindedReadNodule')
if read != None:
nodule_id = read.find('{}noduleID').text
xml_uid = read.find('{}roi').find('{}imageSOP_UID').text
if xml_uid == uid:
print(xml_uid, "=", uid)
roi= read.find('{}roi')
This work completely fine to get a uid from dicom image of LIDC/IDRI dataset and then extract the same uid from xml file for it region of interest.

How can I parse a XML file to a dictionary in Python?

I 'am trying to parse a XML file using the Python library minidom (even tried xml.etree.ElementTree API).
My XML (resource.xml)
<?xml version='1.0'?>
<quota_result xmlns="https://some_url">
<quota_rule name='max_mem_per_user/5'>
<limit resource='mem' limit='1550' value='921'/>
<quota_rule name='max_mem_per_user/6'>
<users>user2 /users>
<limit resource='mem' limit='2150' value='3'/>
I would like to parse this file and store inside a dictionnary the information in the following form and be able to access it:
So far I have only been able to do things like:
docXML = minidom.parse("resource.xml")
for node in docXML.getElementsByTagName('limit'):
print node.getAttribute('value')
You can use getElementsByTagName and getAttribute to trace the result:
dict_users = dict()
docXML = parse('mydata.xml')
users= docXML.getElementsByTagName("quota_rule")
for node in users:
user = 'None'
tag_user = node.getElementsByTagName("users") #check the length of the tag_user to see if tag <users> is exist or not
if len(tag_user) ==0:
print "tag <users> is not exist"
user = tag_user[0]
resource = node.getElementsByTagName("limit")[0].getAttribute("resource")
limit = node.getElementsByTagName("limit")[0].getAttribute("limit")
value = node.getElementsByTagName("limit")[0].getAttribute("value")
dict_users[]=[resource, limit, value]
if user == 'None':
dict_users['None']=[resource, limit, value]
dict_users[]=[resource, limit, value]
print(dict_users) # remove the <users>user1</users> in xml
tag <users> is not exist
{'None': [u'mem', u'1550', u'921'], u'user2': [u'mem', u'2150', u'3']}

How to associate values of tags with label of the tag the using ElementTree in a Pythonic way

I have some xml files I am trying to process.
Here is a derived sample from one of the files
fileAsString = """
<?xml version="1.0" encoding="utf-8"?>
<value>Some Event</value>
<footnoteId id="F1"/>
<footnoteId id="F2"/>
Note, there can be many eventTables in each document and events can have more details then just the ones I have isolated.
My goal is to create a dictionary in the following form
{'eventTitle':'Some Event, 'eventDate':'2003-12-31','eventType':'47',\
'eventCode':'A', 'eventCoding_FTNT_1':'F1','eventCoding_FTNT_2':'F2',\
'eventCycled': , 'eventVoltage':'40000'}
I am actually reading these in from files but assuming I have a string my code to get the text for the elements right below the eventTransaction element where the text is inside a value tag is as follows
import xml.etree.cElementTree as ET
myXML = ET.fromstring(fileAsString)
eventTransactions = [ e for e in myXML.iter() if e.tag == 'eventTransaction']
testTransaction = eventTransactions[0]
my_dict = {}
for child_of in testTransaction:
grand_children_tags = [e.tag for e in child_of]
if grand_children_tags == ['value']:
my_dict[child_of.tag] = [e.text for e in child_of][0]
>>> my_dict
{'eventTitle': 'Some Event', 'eventCycled': None, 'eventDate': '2003-12-31'}
This seems wrong because I am not really taking advantage of xml instead I am using brute force but I have not seemed to find an example.
Is there a clearer and more pythonic way to create the output I am looking for?
Use XPath to pull out the elements you're interested in.
The following code creates a list of lists of dicts (i.e. tables/transactions/info):
tables = []
myXML = ET.fromstring(fileAsString)
for table in myXML.findall('./eventTable'):
transactions = []
for transaction in table.findall('./eventTransaction'):
info = {}
for element in table.findall('.//*[value]'):
info[element.tag] = element.find('./value').text or ''
coding = transaction.find('./eventCoding')
if coding is not None:
for tag in 'eventType', 'eventCode':
element = coding.find('./%s' % tag)
if element is not None:
info[tag] = element.text or ''
for index, element in enumerate(coding.findall('./footnoteId')):
info['eventCoding_FTNT_%d' % index] = element.get('id', '')
if info:
[[{'eventCode': 'A',
'eventCoding_FTNT_0': 'F1',
'eventCoding_FTNT_1': 'F2',
'eventCycled': '',
'eventDate': '2003-12-31',
'eventTitle': 'Some Event',
'eventType': '47',
'eventVoltage': '40000'}]]
