I have a very large xml file (about 100mb) with multiple elements similar to the one in this example
<adrmsg:hasMember>
<aixm:DesignatedPoint gml:id="ID_197095_1650420151927_74256">
<gml:identifier codeSpace="urn:uuid:">084e1bb6-94f7-450f-a88e-44eb465cd5a6</gml:identifier>
<aixm:timeSlice>
<aixm:DesignatedPointTimeSlice gml:id="ID_197095_1650420151927_74257">
<gml:validTime>
<gml:TimePeriod gml:id="ID_197095_1650420151927_74258">
<gml:beginPosition>2020-12-31T00:00:00</gml:beginPosition>
<gml:endPosition indeterminatePosition="unknown"/>
</gml:TimePeriod>
</gml:validTime>
<aixm:interpretation>BASELINE</aixm:interpretation>
<aixm:featureLifetime>
<gml:TimePeriod gml:id="ID_197095_1650420151927_74259">
<gml:beginPosition>2020-12-31T00:00:00</gml:beginPosition>
<gml:endPosition indeterminatePosition="unknown"/>
</gml:TimePeriod>
</aixm:featureLifetime>
<aixm:designator>BITLA</aixm:designator>
<aixm:type>ICAO</aixm:type>
<aixm:location>
<aixm:Point gml:id="ID_197095_1650420151927_74260">
<gml:pos srsName="urn:ogc:def:crs:EPSG::4326">40.87555555555556 21.358055555555556</gml:pos>
</aixm:Point>
</aixm:location>
<aixm:extension>
<adrext:DesignatedPointExtension gml:id="ID_197095_1650420151927_74261">
<adrext:pointUsage>
<adrext:PointUsage gml:id="ID_197095_1650420151927_74262">
<adrext:role>FRA_ENTRY</adrext:role>
<adrext:reference_border>
<adrext:AirspaceBorderCrossingObject gml:id="ID_197095_1650420151927_74263">
<adrext:exitedAirspace xlink:href="urn:uuid:78447f69-9671-41c5-a7b7-bdd82c60e978"/>
<adrext:enteredAirspace xlink:href="urn:uuid:afb35b5b-6626-43ff-9d92-875bbd882c05"/>
</adrext:AirspaceBorderCrossingObject>
</adrext:reference_border>
</adrext:PointUsage>
</adrext:pointUsage>
<adrext:pointUsage>
<adrext:PointUsage gml:id="ID_197095_1650420151927_74264">
<adrext:role>FRA_EXIT</adrext:role>
<adrext:reference_border>
<adrext:AirspaceBorderCrossingObject gml:id="ID_197095_1650420151927_74265">
<adrext:exitedAirspace xlink:href="urn:uuid:78447f69-9671-41c5-a7b7-bdd82c60e978"/>
<adrext:enteredAirspace xlink:href="urn:uuid:afb35b5b-6626-43ff-9d92-875bbd882c05"/>
</adrext:AirspaceBorderCrossingObject>
</adrext:reference_border>
</adrext:PointUsage>
</adrext:pointUsage>
</adrext:DesignatedPointExtension>
</aixm:extension>
</aixm:DesignatedPointTimeSlice>
</aixm:timeSlice>
</aixm:DesignatedPoint>
</adrmsg:hasMember>
The ultimate goal is to have in a pandas DataFrame parsed data from this very big xml file.
So far I cannot 'capture' the data that I am looking for.
I manage only to 'capture' the last data from the very last element in that large xml file.
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
ab = {'aixm':'http://www.aixm.aero/schema/5.1.1', 'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR', 'gml':'http://www.opengis.net/gml/3.2'}
for point in root.findall('.//aixm:DesignatedPointTimeSlice', ab):
designator = point.find('.//aixm:designator', ab)
d = point.find('.//{http://www.aixm.aero/schema/5.1.1}type', ab)
for pos in point.findall('.//gml:pos', ab):
print(designator.text, pos.text, d.text)
the print statement returns the data that I would like to have but as mentioned, only for the very last element of the file whereas I would like to have the result returned for all of them
ZIFSA 54.02111111111111 27.823888888888888 ICAO
Could I be pls advice on the path I should follow? I need some help pls
Thank you very much
Assuming all three needed nodes (aixm:designator, aixm:type, and gml:pos) are always present, consider parsing the parent nodes, aixm:DesignatedPointTimeSlice and axim:Point and then join them. Finally, select the three final columns needed.
import pandas as pd
ab = {
'aixm':'http://www.aixm.aero/schema/5.1.1',
'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR',
'gml':'http://www.opengis.net/gml/3.2'
}
time_slice_df = pd.read_xml(
'file.xml', xpath=".//aixm:DesignatedPointTimeSlice", namespaces=ab
).add_prefix("time_slice_")
point_df = pd.read_xml(
'file.xml', xpath=".//aixm:Point", namespaces=ab
).add_prefix("point_")
time_slice_df = (
time_slice_df.join(point_df)
.reindex(
["time_slice_designator", "time_slice_type", "point_pos"],
axis="columns"
)
)
And in forthcoming pandas 1.5, read_xml will support iterparse allowing retrieval of descendant nodes not limited to XPath expressions:
time_slice_df = pd.read_xml(
'file.xml',
namespaces = ab,
iterparse = {"aixm:DesignatedPointTimeSlice":
["aixm:designator", "axim:type", "aixm:Point"]
}
)
Using python lxml I want to test if a XML document contains EXPERIMENT_TYPE, and if it exists, extract the <VALUE>.
Example:
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
Is there a faster way than iterating through all elements?
all = etree.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
print("Found")
That attempt is also getting messy when I want to extract the <VALUE>.
Preferably you do this with XPath which is bound to be incredibly fast. My sugestion (tested and working). It will return a (possible empty) list of VALUE elements from which you can extra the text.
PS: do not use "special" words such as all as variable names. Bad practice and may lead to unexpected bugs.
import lxml.etree as ET
from lxml.etree import Element
from typing import List
xml_str = """
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
"""
tree = ET.ElementTree(ET.fromstring(xml_str))
vals: List[Element] = tree.xpath(".//EXPERIMENT_ATTRIBUTE/TAG[text()='EXPERIMENT_TYPE']/following-sibling::VALUE")
print(vals[0].text)
# DNA Methylation
An alternative XPath declaration was provided below by Michael Kay, which is identical to the answer by Martin Honnen.
.//EXPERIMENT_ATTRIBUTE[TAG='EXPERIMENT_TYPE']/VALUE
In terms of XPath it seems you simply want to select the VALUE element based on the TAG element with e.g. /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE.
I think with Python and lxml people often use a text node selection with e.g. /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE/text() as then the xpath function returns that as a Python string.
Using findall is the natural way to do it. I suggest the following code to find the VALUEs:
from lxml import etree
root = etree.parse('toto.xml').getroot()
all = root.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
v = e.getparent().find('VALUE')
if v is not None:
print(f'Found val="{v.text}"')
This outputs:
Found val="DNA Methylation"
I have an xml file that contain a lot of information and tags.
For example I have this tag:
<SelectListMap SourceName="Document Type" SourceNumber="43" DestName="Document Type" DestNumber="43"/>
I have 40 other tags like this one with the same nodes, but the value of these nodes is different in each tag.
SourceName and DestName have the same value.
In some tags the DestName value is empty like this one:
<SelectListMap SourceName="Boolean Values" SourceNumber="73" DestName="" DestNumber="0" IsInternal="True"/>
So, I'm trying to give the empty DestName the value of Sourcename.
Here is my Python codes:
import re
import xml.etree.ElementTree as ET
tree = ET.parse("SPPID04A_BG3 - Copy - Copy.xml")
root = tree.getroot()
for SelectListMap in root.iter('SelectListMap'):
#DestName.text = str(DestName)
for node in tree.iter('SelectListMap'):
SourceName = node.attrib.get('SourceName')
SelectListMap.set('DestName', SourceName)
tree.write("SPPID04A_BG3 - Copy - Copy.xml")
This program is not working on the right way. any help or ideas?
Thanks!
You never check the if the DestName attribute is empty. If you replace the first for loop with the following, you should get what you want:
for SelectListMap in root.iter('SelectListMap'):
if SelectListMap.get("DestName") == "":
SourceName = SelectListMap.get("SourceName")
SelectListMap.set("DestName", SourceName)
I'm trying to figure out in lxml and python how to replace an element with a string.
In my experimentation, I have the following code:
from lxml import etree as et
docstring = '<p>The value is permitted only when that includes <xref linkend=\"my linkend\" browsertext=\"something here\" filename=\"A_link.fm\"/>, otherwise the value is reserved.</p>'
topicroot = et.XML(docstring)
topicroot2 = et.ElementTree(topicroot)
xref = topicroot2.xpath('//*/xref')
xref_attribute = xref[0].attrib['browsertext']
print href_attribute
The result is: 'something here'
This is the browser text attribute I'm looking for in this small sample. But what I can't seem to figure out is how to replace the entire element with the attribute text I've captured here.
(I do recognize that in my sample I could have multiple xrefs and will need to construct a loop to go through them properly.)
What's the best way to go about doing this?
And for those wondering, I'm having to do this because the link actually goes to a file that doesn't exist because of our different build systems.
Thanks in advance!
Try this (Python 3):
from lxml import etree as et
docstring = '<p>The value is permitted only when that includes <xref linkend=\"my linkend\" browsertext=\"something here\" filename=\"A_link.fm\"/>, otherwise the value is reserved.</p>'
# Get the root element.
topicroot = et.XML(docstring)
topicroot2 = et.ElementTree(topicroot)
# Get the text of the root element. This is a list of strings!
topicroot2_text = topicroot2.xpath("text()")
# Get the xref elment.
xref = topicroot2.xpath('//*/xref')[0]
xref_attribute = xref.attrib['browsertext']
# Save a reference to the p element, remove the xref from it.
parent = xref.getparent()
parent.remove(xref)
# Set the text of the p element by combining the list of string with the
# extracted attribute value.
new_text = [topicroot2_text[0], xref_attribute, topicroot2_text[1]]
parent.text = "".join(new_text)
print(et.tostring(topicroot2))
Output:
b'<p>The value is permitted only when that includes something here, otherwise the value is reserved.</p>'
I am looking at an xml file similar to the below:
<pinnacle_line_feed>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
<lastContest>28962804</lastContest>
<lastGame>162995589</lastGame>
<events>
<event>
<event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
<gamenumber>422739932</gamenumber>
<sporttype>Alpine Skiing</sporttype>
<league>DH 145</league>
<IsLive>No</IsLive>
<participants>
<participant>
<participant_name>Kjetil Jansrud (NOR)</participant_name>
<contestantnum>2001</contestantnum>
<rotnum>2001</rotnum>
<visiting_home_draw>Visiting</visiting_home_draw>
</participant>
<participant>
<participant_name>The Field</participant_name>
<contestantnum>2002</contestantnum>
<rotnum>2002</rotnum>
<visiting_home_draw>Home</visiting_home_draw>
</participant>
</participants>
<periods>
<period>
<period_number>0</period_number>
<period_description>Matchups</period_description>
<periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
<period_status>I</period_status>
<period_update>open</period_update>
<spread_maximum>200</spread_maximum>
<moneyline_maximum>100</moneyline_maximum>
<total_maximum>200</total_maximum>
<moneyline>
<moneyline_visiting>116</moneyline_visiting>
<moneyline_home>-136</moneyline_home>
</moneyline>
</period>
</periods>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
</event>
</events>
</pinnacle_line_feed>
I have parsed the file with the code below:
pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'
tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []
for event in root.iter('event'):
event_datetimeGMT = event.find('event_datetimeGMT').text
gamenumber = event.find('gamenumber').text
sporttype = event.find('sporttype').text
league = event.find('league').text
IsLive = event.find('IsLive').text
for participants in event.iter('participants'):
for participant in participants.iter('participant'):
p1_name = participant.find('participant_name').text
contestantnum = participant.find('contestantnum').text
rotnum = participant.find('rotnum').text
vhd = participant.find('visiting_home_draw').text
for periods in event.iter('periods'):
for period in periods.iter('period'):
period_number = period.find('period_number').text
desc = period.find('period_description').text
pdatetime = period.find('periodcutoff_datetimeGMT')
status = period.find('period_status').text
update = period.find('period_update').text
max = period.find('spread_maximum').text
mlmax = period.find('moneyline_maximum').text
tot_max = period.find('total_maximum').text
for moneyline in period.iter('moneyline'):
ml_vis = moneyline.find('moneyline_visiting').text
ml_home = moneyline.find('moneyline_home').text
However, I am hoping to get the nodes separated by event similar to a 2D table (as in a pandas dataframe). However, the full xml file has multiple "event" children, some events that do not share the same nodes as above. I am struggling quite mightily with being able to take each event node and simply create a 2d table with the tag and that value where the tag acts as the column name and the text acts as the value.
Up to this point, I have done the above to gauge how I might put that information into a dictionary and subsequently put a number of dictionaries into a list from which I can create a dataframe using pandas, but that has not worked out, as all attempts have required me to find and replace text to create the dxcictionaries and python has not responded well to that when attempting to subsequently create a dataframe. I have also used a simple:
for elt in tree.iter():
list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))
which worked quite well in simple pulling out every single tag and the corresponding text, but I was unable to make anything of that because any attempts at finding and replacing the text to create dictionaries was no good.
Any assistance would be greatly appreciated.
Thank you.
Here's an easy way to get your XML into a pandas dataframe. This utilizes the awesome requests library (which you can switch for urllib if you'd like, as well as the always helpful xmltodict library available in pypi. (NOTE: a reverse library is also available, knows as dicttoxml)
import json
import pandas
import requests
import xmltodict
web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')
# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)
# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
# visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))
# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)
To get some idea of what this is starting to look like, here's the first couple of lines of the CSV:
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo) # Output the df to a CSV file
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball
Notice that the participants and periods columns are still their native Python dictionaries. You'll either need to remove them from the columns list, or do some additional mangling to get them to flatten out:
# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball