Python: lxml is not reading element text all time

Python: lxml is not reading element text all time - python

I want to load xml file with below structure into pandas dataframe
The size of xml could be between 1 GB to 6GB
Below xml sample just have 5 records but my acutal file will have around 100000 records as mention in the RECORDS attributes below (RECORDS="108881")
Also each and every element in this file will have some value.
None of the element is empty in the whole file.
<?xml version="1.0" encoding="UTF-8"?>
<ACADEMICS>
<STUDENTS ASOF_DATE="11/21/2019" CREATE_DATE="11/22/2019" RECORDS="108881">
<STUDENT>
<NAME>JOHN</NAME>
<REGNUM>1000</REGNUM>
<COUNTRY>USA</COUNTRY>
<ID>JH1</ID>
<SHORT_STD_DESC>JOHN IS A GOOD STUDENT</SHORT_STD_DESC>
</STUDENT>
<STUDENT>
<NAME>ADAM</NAME>
<REGNUM>1001</REGNUM>
<COUNTRY>FRANCE</COUNTRY>
<ID>AD2</ID>
<SHORT_STD_DESC>ADAM IS A GOOD STUDENT</SHORT_STD_DESC>
</STUDENT>
<STUDENT>
<NAME>PETER</NAME>
<REGNUM>1003</REGNUM>
<COUNTRY>BELGIUM</COUNTRY>
<ID>PE5</ID>
<SHORT_STD_DESC>PETER IS A GOOD STUDENT</SHORT_STD_DESC>
</STUDENT>
<STUDENT>
<NAME>ERIC</NAME>
<REGNUM>1006</REGNUM>
<COUNTRY>AUSTRALIA</COUNTRY>
<ID>ER7</ID>
<SHORT_STD_DESC>ERIC IS A GOOD STUDENT</SHORT_STD_DESC>
</STUDENT>
<STUDENT>
<NAME>NICHOLAS</NAME>
<REGNUM>1009</REGNUM>
<COUNTRY>GREECE</COUNTRY>
<ID>NI8</ID>
<SHORT_STD_DESC>NICHOLAS IS A GOOD STUDENT</SHORT_STD_DESC>
</STUDENT>
</STUDENTS>
i am trying to read these xmls with lxml as below functions
As you can see in below functions, i am just interested in reading specific tags from xml file which are ["ACADEMICS","STUDENDS","ID","SHORT_STD_DESC"]
def recursive_dict(self,element):
return element.tag, \
dict(map(self.recursive_dict, element)) or element.text
def ConvertFilePivot(self, inputfile):
context = etree.iterparse(inputfile, events=('start','end'), tag=["ACADEMICS","STUDENDS","ID","SHORT_STD_DESC"])
lstValues = []
asOfDate = ""
for event, elem in context:
if elem.tag == "ACADEMICS" :
asOfDate = elem[0].attrib['ASOF_DATE']
else:
for event, elem in context:
doc = self.recursive_dict(elem)
lstValues.append(doc)
dfvalues = pd.DataFrame(lstValues,columns=["ColName","ColValue"])
columns = dfvalues['ColName'].unique()
data = {}
for column in columns:
data[column] = list(dfvalues[dfvalues['ColName'] == column]['ColValue'])
dfdata = pd.DataFrame(data)
return dfdata
Now, the problem is when i load this xml into dataframe as shown in above function, for some records i get 'None' as a text for ID and SHORT_STD_DESC elements.
But the actual xml file has that value.
So i am not sure why it is not reflected in my dataframe ?
Any input would be great help for me.

This may be more a comment than an answer, but I can't fit it in an actual comment...
Try changing
else:
for event, elem in context:
doc = self.recursive_dict(elem)
to just:
else:
doc = self.recursive_dict(elem)
and see if it works.

Related

convert xml to csv using python

I am learning my way around python and right now I need a little bit of help. I have an XML file from soap api that I am failing at converting to CSV. I managed to get the data with the request library easily. My struggle is converting it to CSV, I end up with headers with no values
My XML Data :
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<Level2 xmlns="https://xxxxxxxxxx/xxxxxxx">
<Level3>
<ResponseStatus>Success</ResponseStatus>
<ErrorMessage/>
<Message>20 alert(s) generated for this period</Message>
<ProcessingTimeSecs>0.88217689999999993</ProcessingTimeSecs>
<Something1>1</Something1>
<Something2/>
<Something3/>
<Something4/>
<VIP>
<MainVIP>
<Date>20210616</Date>
<RegisteredDate>20210216</RegisteredDate>
<Type>YMBA</Type>
<TypeDescription>TYPE OF ENQUIRY</TypeDescription>
<BusinessName>COMPANY NAME</BusinessName>
<ITNumber>987654321</ITNumber>
<RegistrationNumber>123456789</RegistrationNumber>
<SubscriberNumber>55889977</SubscriberNumber>
<SubscriberReference/>
<TicketNumber>1122336655</TicketNumber>
<SubscriberName>COMPANY NAME 2 </SubscriberName>
<CompletedDate>20210615</CompletedDate>
</MainVIP>
</VIP>
<Something5/>
<Something6/>
<Something7/>
<Something8/>
<Something9/>
<PrincipalSomething10/>
<PrincipalSomething11/>
<PrincipalSomething12/>
<PrincipalSomething13/>
<Something14/>
<Something15/>
<Something16/>
<Something17/>
<Something18/>
<PrincipalSomething19/>
<PrincipalSomething20/>
</Level3>
</Level2>
</soap:Body>
</soap:Envelope>
My python code looks like this :
import xml.etree.ElementTree as ET
import pandas as pd
cols = ['Date', 'RegisteredDate', 'Type',
'TypeDescription']
rows = []
# parse xml file
xmlparse = ET.parse('xmldata.xml')
root = xmlparse.getroot()
for i in root:
Date = i.get('Date').text
RegisteredDate = i.get('RegisteredDate').text
Type = i.get('Type').text
TypeDescription = i.get('TypeDescription').text
rows.append({'Date': Date,
'RegisteredDate': RegisteredDate,
'Type': Type,
'TypeDescription': TypeDescription})
df = pd.DataFrame(rows, columns=cols)
print(df)
df.to_csv('csvdata.csv')
In my approach, I was following the idea from here https://www.geeksforgeeks.org/convert-xml-to-csv-in-python/

You probably don't need to go through ElementTree; you can feed the xml directly to pandas. If I understand you correctly, this should do it:
df = pd.read_xml(path_to_file,"//*[local-name()='MainVIP']")
df = df.iloc[:,:4]
df
Output from your xml above:
Date RegisteredDate Type TypeDescription
0 20210616 20210216 YMBA TYPE OF ENQUIRY

Without any external lib - the code below generates a csv file.
The idea is to collect the required elements data from MainVip and store it in list of dicts. Loop on the list and write the data into a file.
import xml.etree.ElementTree as ET
xml = ''' <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<Level2 xmlns="https://xxxxxxxxxx/xxxxxxx">
<Level3>
<ResponseStatus>Success</ResponseStatus>
<ErrorMessage/>
<Message>20 alert(s) generated for this period</Message>
<ProcessingTimeSecs>0.88217689999999993</ProcessingTimeSecs>
<Something1>1</Something1>
<Something2/>
<Something3/>
<Something4/>
<VIP>
<MainVIP>
<Date>20210616</Date>
<RegisteredDate>20210216</RegisteredDate>
<Type>YMBA</Type>
<TypeDescription>TYPE OF ENQUIRY</TypeDescription>
<BusinessName>COMPANY NAME</BusinessName>
<ITNumber>987654321</ITNumber>
<RegistrationNumber>123456789</RegistrationNumber>
<SubscriberNumber>55889977</SubscriberNumber>
<SubscriberReference/>
<TicketNumber>1122336655</TicketNumber>
<SubscriberName>COMPANY NAME 2 </SubscriberName>
<CompletedDate>20210615</CompletedDate>
</MainVIP>
</VIP>
<Something5/>
<Something6/>
<Something7/>
<Something8/>
<Something9/>
<PrincipalSomething10/>
<PrincipalSomething11/>
<PrincipalSomething12/>
<PrincipalSomething13/>
<Something14/>
<Something15/>
<Something16/>
<Something17/>
<Something18/>
<PrincipalSomething19/>
<PrincipalSomething20/>
</Level3>
</Level2>
</soap:Body>
</soap:Envelope>'''
cols = ['Date', 'RegisteredDate', 'Type',
'TypeDescription']
rows = []
NS = '{https://xxxxxxxxxx/xxxxxxx}'
root = ET.fromstring(xml)
for vip in root.findall(f'.//{NS}MainVIP'):
rows.append({c: vip.find(NS+c).text for c in cols})
with open('out.csv','w') as f:
f.write(','.join(cols) + '\n')
for row in rows:
f.write(','.join(row[c] for c in cols) + '\n')
out.csv
Date,RegisteredDate,Type,TypeDescription
20210616,20210216,YMBA,TYPE OF ENQUIRY

How to handle XML element iter nested attributes with the same tag

I am trying to parse NPORT-P XML SEC submission. My code (Python 3.6.8) with a sample XML record:
import xml.etree.ElementTree as ET
content_xml = '<?xml version="1.0" encoding="UTF-8"?><edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><headerData></headerData><formData><genInfo></genInfo><fundInfo></fundInfo><invstOrSecs><invstOrSec><name>N/A</name><lei>N/A</lei><title>US 10YR NOTE (CBT)Sep20</title><cusip>N/A</cusip> <identifiers> <ticker value="TYU0"/> </identifiers> <derivativeInfo> <futrDeriv derivCat="FUT"> <counterparties> <counterpartyName>Chicago Board of Trade</counterpartyName> <counterpartyLei>549300EX04Q2QBFQTQ27</counterpartyLei> </counterparties><payOffProf>Short</payOffProf> <descRefInstrmnt> <otherRefInst> <issuerName>U.S. Treasury 10 Year Notes</issuerName> <issueTitle>U.S. Treasury 10 Year Notes</issueTitle> <identifiers> <cusip value="N/A"/><other otherDesc="USER DEFINED" value="TY_Comdty"/> </identifiers> </otherRefInst> </descRefInstrmnt> <expDate>2020-09-21</expDate> <notionalAmt>-2770555</notionalAmt> <curCd>USD</curCd> <unrealizedAppr>-12882.5</unrealizedAppr></futrDeriv> </derivativeInfo> </invstOrSec> </invstOrSecs> <signature> </signature> </formData></edgarSubmission>'
content_tree = ET.ElementTree(ET.fromstring(bytes(content_xml, encoding='utf-8')))
content_root = content_tree.getroot()
for edgar_submission in content_root.iter('{http://www.sec.gov/edgar/nport}edgarSubmission'):
for form_data in edgar_submission.iter('{http://www.sec.gov/edgar/nport}formData'):
for genInfo in form_data.iter('{http://www.sec.gov/edgar/nport}genInfo'):
None
for fundInfo in form_data.iter('{http://www.sec.gov/edgar/nport}fundInfo'):
None
for invstOrSecs in form_data.iter('{http://www.sec.gov/edgar/nport}invstOrSecs'):
for invstOrSec in invstOrSecs.iter('{http://www.sec.gov/edgar/nport}invstOrSec'):
myrow = []
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}name'), 'text', ''))
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}lei'), 'text', ''))
security_title = getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}title'), 'text', '')
myrow.append(security_title)
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}cusip'), 'text', ''))
for identifiers in invstOrSec.iter('{http://www.sec.gov/edgar/nport}identifiers'):
if identifiers.find('{http://www.sec.gov/edgar/nport}isin') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}isin').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No ISIN")
if identifiers.find('{http://www.sec.gov/edgar/nport}ticker') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}ticker').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No Ticker")
if identifiers.find('{http://www.sec.gov/edgar/nport}other') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}other').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No Other")
The output from this code is:
No ISIN
No Other
No ISIN
No Ticker
This working fine aside from the fact that the identifiers iter invstOrSec.iter('{http://www.sec.gov/edgar/nport}identifiers') finds identifiers under formData>invstOrSecs>invstOrSec but also other identifiers under a nested tag under formData>invstOrSecs>invstOrSec>derivativeInfo>futrDeriv>descRefInstrmnt>otherRefInst. How can I restrict my iter or the find to the right level? I have unsuccessfully tried to get the parent but I am not finding how to do this using the {namespace}tag notation. Any ideas?

So I switched from ElementTree to lxml using an import like this to avoid code changes:
from lxml import etree as ET
Make sure you check https://lxml.de/1.3/compatibility.html to avoid compatibility issues. In my case lxml worked without issues.
And I then I was able to use the getparent() method to be able to only read the identifiers from the right part of the XML file:
if identifiers.getparent().tag == '{http://www.sec.gov/edgar/nport}invstOrSec':

How to get the content of specific grandchild from xml file through python

Hi I am very new to python programming. I have an xml file of structure:
<?xml version="1.0" encoding="UTF-8"?>
-<LidcReadMessage xsi:schemaLocation="http://www.nih.gov http://troll.rad.med.umich.edu/lidc/LidcReadMessage.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.nih.gov" uid="1.3.6.1.4.1.14519.5.2.1.6279.6001.1307390687803.0">
-<ResponseHeader>
<Version>1.8.1</Version>
<MessageId>-421198203</MessageId>
<DateRequest>2007-11-01</DateRequest>
<TimeRequest>12:30:44</TimeRequest>
<RequestingSite>removed</RequestingSite>
<ServicingSite>removed</ServicingSite>
<TaskDescription>Second unblinded read</TaskDescription>
<CtImageFile>removed</CtImageFile>
<SeriesInstanceUid>1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192</SeriesInstanceUid>
<DateService>2008-08-18</DateService>
<TimeService>02:05:51</TimeService>
<ResponseDescription>1 - Reading complete</ResponseDescription>
<StudyInstanceUID>1.3.6.1.4.1.14519.5.2.1.6279.6001.298806137288633453246975630178</StudyInstanceUID>
</ResponseHeader>
-<readingSession>
<annotationVersion>3.12</annotationVersion>
<servicingRadiologistID>540461523</servicingRadiologistID>
-<unblindedReadNodule>
<noduleID>Nodule 001</noduleID>
-<characteristics>
<subtlety>5</subtlety>
<internalStructure>1</internalStructure>
<calcification>6</calcification>
<sphericity>3</sphericity>
<margin>3</margin>
<lobulation>3</lobulation>
<spiculation>4</spiculation>
<texture>5</texture>
<malignancy>5</malignancy>
</characteristics>
-<roi>
<imageZposition>-125.000000 </imageZposition>
<imageSOP_UID>1.3.6.1.4.1.14519.5.2.1.6279.6001.110383487652933113465768208719</imageSOP_UID>
......
There are four which contains multiple . Each contains an . I need to extract the information in from all of these headers.
Right now I am doing this:
import xml.etree.ElementTree as ET
tree = ET.parse('069.xml')
root = tree.getroot()
#lst = []
for readingsession in root.iter('readingSession'):
for roi in readingsession.findall('roi'):
id = roi.findtext('imageSOP_UID')
print(id)
but it ouputs like this:
Process finished with exit code 0.
If anyone can help.

The real problem as been wit the namespace. I tried with and without it, but it didn't work with this code.
ds = pydicom.dcmread("000071.dcm")
uid = ds.SOPInstanceUID
tree = ET.parse("069.xml")
root = tree.getroot()
for child in root:
print(child.tag)
if child.tag == '{http://www.nih.gov}readingSession':
read = child.find('{http://www.nih.gov}unblindedReadNodule')
if read != None:
nodule_id = read.find('{http://www.nih.gov}noduleID').text
xml_uid = read.find('{http://www.nih.gov}roi').find('{http://www.nih.gov}imageSOP_UID').text
if xml_uid == uid:
print(xml_uid, "=", uid)
roi= read.find('{http://www.nih.gov}roi')
print(roi)
This work completely fine to get a uid from dicom image of LIDC/IDRI dataset and then extract the same uid from xml file for it region of interest.

How can I parse a XML file to a dictionary in Python?

I 'am trying to parse a XML file using the Python library minidom (even tried xml.etree.ElementTree API).
My XML (resource.xml)
<?xml version='1.0'?>
<quota_result xmlns="https://some_url">
</quota_rule>
<quota_rule name='max_mem_per_user/5'>
<users>user1</users>
<limit resource='mem' limit='1550' value='921'/>
</quota_rule>
<quota_rule name='max_mem_per_user/6'>
<users>user2 /users>
<limit resource='mem' limit='2150' value='3'/>
</quota_rule>
</quota_result>
I would like to parse this file and store inside a dictionnary the information in the following form and be able to access it:
dict={user1=[resource,limit,value],user2=[resource,limit,value]}
So far I have only been able to do things like:
docXML = minidom.parse("resource.xml")
for node in docXML.getElementsByTagName('limit'):
print node.getAttribute('value')

You can use getElementsByTagName and getAttribute to trace the result:
dict_users = dict()
docXML = parse('mydata.xml')
users= docXML.getElementsByTagName("quota_rule")
for node in users:
user = 'None'
tag_user = node.getElementsByTagName("users") #check the length of the tag_user to see if tag <users> is exist or not
if len(tag_user) ==0:
print "tag <users> is not exist"
else:
user = tag_user[0]
resource = node.getElementsByTagName("limit")[0].getAttribute("resource")
limit = node.getElementsByTagName("limit")[0].getAttribute("limit")
value = node.getElementsByTagName("limit")[0].getAttribute("value")
dict_users[user.firstChild.data]=[resource, limit, value]
if user == 'None':
dict_users['None']=[resource, limit, value]
else:
dict_users[user.firstChild.data]=[resource, limit, value]
print(dict_users) # remove the <users>user1</users> in xml
Output:
tag <users> is not exist
{'None': [u'mem', u'1550', u'921'], u'user2': [u'mem', u'2150', u'3']}

How to associate values of tags with label of the tag the using ElementTree in a Pythonic way

I have some xml files I am trying to process.
Here is a derived sample from one of the files
fileAsString = """
<?xml version="1.0" encoding="utf-8"?>
<eventDocument>
<schemaVersion>X2</schemaVersion>
<eventTable>
<eventTransaction>
<eventTitle>
<value>Some Event</value>
</eventTitle>
<eventDate>
<value>2003-12-31</value>
</eventDate>
<eventCoding>
<eventType>47</eventType>
<eventCode>A</eventCode>
<footnoteId id="F1"/>
<footnoteId id="F2"/>
</eventCoding>
<eventCycled>
<value></value>
</eventCycled>
<eventAmounts>
<eventVoltage>
<value>40000</value>
</eventVoltage>
</eventAmounts>
</eventTransaction>
</eventTable>
</eventDocument>"""
Note, there can be many eventTables in each document and events can have more details then just the ones I have isolated.
My goal is to create a dictionary in the following form
{'eventTitle':'Some Event, 'eventDate':'2003-12-31','eventType':'47',\
'eventCode':'A', 'eventCoding_FTNT_1':'F1','eventCoding_FTNT_2':'F2',\
'eventCycled': , 'eventVoltage':'40000'}
I am actually reading these in from files but assuming I have a string my code to get the text for the elements right below the eventTransaction element where the text is inside a value tag is as follows
import xml.etree.cElementTree as ET
myXML = ET.fromstring(fileAsString)
eventTransactions = [ e for e in myXML.iter() if e.tag == 'eventTransaction']
testTransaction = eventTransactions[0]
my_dict = {}
for child_of in testTransaction:
grand_children_tags = [e.tag for e in child_of]
if grand_children_tags == ['value']:
my_dict[child_of.tag] = [e.text for e in child_of][0]
>>> my_dict
{'eventTitle': 'Some Event', 'eventCycled': None, 'eventDate': '2003-12-31'}
This seems wrong because I am not really taking advantage of xml instead I am using brute force but I have not seemed to find an example.
Is there a clearer and more pythonic way to create the output I am looking for?

Use XPath to pull out the elements you're interested in.
The following code creates a list of lists of dicts (i.e. tables/transactions/info):
tables = []
myXML = ET.fromstring(fileAsString)
for table in myXML.findall('./eventTable'):
transactions = []
tables.append(transactions)
for transaction in table.findall('./eventTransaction'):
info = {}
for element in table.findall('.//*[value]'):
info[element.tag] = element.find('./value').text or ''
coding = transaction.find('./eventCoding')
if coding is not None:
for tag in 'eventType', 'eventCode':
element = coding.find('./%s' % tag)
if element is not None:
info[tag] = element.text or ''
for index, element in enumerate(coding.findall('./footnoteId')):
info['eventCoding_FTNT_%d' % index] = element.get('id', '')
if info:
transactions.append(info)
Output:
[[{'eventCode': 'A',
'eventCoding_FTNT_0': 'F1',
'eventCoding_FTNT_1': 'F2',
'eventCycled': '',
'eventDate': '2003-12-31',
'eventTitle': 'Some Event',
'eventType': '47',
'eventVoltage': '40000'}]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: lxml is not reading element text all time - python

This may be more a comment than an answer, but I can't fit it in an actual comment... Try changing else: for event, elem in context: doc = self.recursive_dict(elem) to just: else: doc = self.recursive_dict(elem) and see if it works.

Related

convert xml to csv using python

How to handle XML element iter nested attributes with the same tag

How to get the content of specific grandchild from xml file through python

How can I parse a XML file to a dictionary in Python?

How to associate values of tags with label of the tag the using ElementTree in a Pythonic way

Categories

Resources