I have been learning how to extract parts of XML using the dom.minidom function, and I can return specific elements and attributes successfully.
I have a number of large XML files I want to parse, and push all the results into a db.
Is there are a function like os.walk that I can use to and extract elements from the XML in a logical way that preserves the hierarchical structure?
The XML is pretty basic and is very straight forward:
<InternalSignature ID="9" Specificity="Generic">
<ByteSequence Reference="BOFoffset">
<SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="0" MinFragLength="0">
<Sequence>49492A00</Sequence>
<DefaultShift>5</DefaultShift>
<Shift Byte="00">1</Shift>
<Shift Byte="2A">2</Shift>
<Shift Byte="49">3</Shift>
</SubSequence>
</ByteSequence>
</InternalSignature>
<InternalSignature ID="10" Specificity="Generic">
<ByteSequence Reference="BOFoffset">
<SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="0" MinFragLength="0">
<Sequence>4D4D002A</Sequence>
<DefaultShift>5</DefaultShift>
<Shift Byte="2A">1</Shift>
<Shift Byte="00">2</Shift>
<Shift Byte="4D">3</Shift>
</SubSequence>
</ByteSequence>
</InternalSignature>
Is there a formal method of crawling the XML and (in this small example) extracting the elements that relate to each specific InternalSignature?
I can see how to call things via a list using the minidom.parse and the .GetElementsByName methods, but I'm not sure how you associate elements into their hierarchical representation.
So far I have found a tutorial that shows how to return various values:
xmldoc = minidom.parse("file.xml")
Versionlist = xmldoc.getElementsByTagName('FFSignatureFile')
VersionRef = Versionlist[0]
Version = VersionRef.attributes["Version"]
DateCreated = VersionRef.attributes["DateCreated"]
print Version.value
print DateCreated.value
InternalSignatureList = xmldoc.getElementsByTagName('InternalSignature')
InternalSignatureRef = InternalSignatureList[0]
SigID = InternalSignatureRef.attributes["ID"]
SigSpecificity = InternalSignatureRef.attributes["Specificity"]
print SigID.value
print SigSpecificity.value
print len(InternalSignatureList)
I can see from the last line (len) that there is 134 elements in the InternalSignatureList, and essentially I want to be able to extract all the elements inside each InternalSignature as an individual record and flick it into a db.
( What have you tried? )
from xml.etree import ElementTree
e = ElementTree.fromstring(xmlstring)
e.findall("ByteSequence")
Related
Using python lxml I want to test if a XML document contains EXPERIMENT_TYPE, and if it exists, extract the <VALUE>.
Example:
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
Is there a faster way than iterating through all elements?
all = etree.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
print("Found")
That attempt is also getting messy when I want to extract the <VALUE>.
Preferably you do this with XPath which is bound to be incredibly fast. My sugestion (tested and working). It will return a (possible empty) list of VALUE elements from which you can extra the text.
PS: do not use "special" words such as all as variable names. Bad practice and may lead to unexpected bugs.
import lxml.etree as ET
from lxml.etree import Element
from typing import List
xml_str = """
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
"""
tree = ET.ElementTree(ET.fromstring(xml_str))
vals: List[Element] = tree.xpath(".//EXPERIMENT_ATTRIBUTE/TAG[text()='EXPERIMENT_TYPE']/following-sibling::VALUE")
print(vals[0].text)
# DNA Methylation
An alternative XPath declaration was provided below by Michael Kay, which is identical to the answer by Martin Honnen.
.//EXPERIMENT_ATTRIBUTE[TAG='EXPERIMENT_TYPE']/VALUE
In terms of XPath it seems you simply want to select the VALUE element based on the TAG element with e.g. /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE.
I think with Python and lxml people often use a text node selection with e.g. /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE/text() as then the xpath function returns that as a Python string.
Using findall is the natural way to do it. I suggest the following code to find the VALUEs:
from lxml import etree
root = etree.parse('toto.xml').getroot()
all = root.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
v = e.getparent().find('VALUE')
if v is not None:
print(f'Found val="{v.text}"')
This outputs:
Found val="DNA Methylation"
I've been struggling with this for a couple of days now, and I figured I would ask here.
I am working on preparing an XML payload to POST to an Oracle endpoint that contains financials data. I've got most of the XML structured per Oracle specs, but I am struggling with one aspect of it. This is data that will feed the general ledger financial system and the xml structure is below (some elements have been omitted to cut down on the post.
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:typ="http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/types/" xmlns:jour="http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/">
<soapenv:Header/>
<soapenv:Body>
<typ:importJournals>
<typ:interfaceRows>
<jour:BatchName>batch</jour:BatchName>
<jour:AccountingPeriodName>Aug-20</jour:AccountingPeriodName>
<jour:AccountingDate>2020-08-31</jour:AccountingDate>
<jour:GlInterface>
<jour:LedgerId>1234567890</jour:LedgerId>
<jour:PeriodName>Aug-20</jour:PeriodName>
<jour:AccountingDate>2020-08-31</jour:AccountingDate>
<jour:Segment1>1</jour:Segment1>
<jour:Segment2>1</jour:Segment2>
<jour:Segment3>1</jour:Segment3>
<jour:Segment4>1</jour:Segment4>
<jour:Segment5>0</jour:Segment5>
<jour:Segment6>0</jour:Segment6>
<jour:CurrencyCode>USD</jour:CurrencyCode>
<jour:EnteredCrAmount currencyCode="USD">10.0000</jour:EnteredCrAmount>
</jour:GlInterface>
<jour:GlInterface>
<jour:LedgerId>1234567890</jour:LedgerId>
<jour:PeriodName>Aug-20</jour:PeriodName>
<jour:AccountingDate>2020-08-31</jour:AccountingDate>
<jour:Segment1>2</jour:Segment1>
<jour:Segment2>2</jour:Segment2>
<jour:Segment3>2</jour:Segment3>
<jour:Segment4>2</jour:Segment4>
<jour:Segment5>0</jour:Segment5>
<jour:Segment6>0</jour:Segment6>
<jour:CurrencyCode>USD</jour:CurrencyCode>
<jour:EnteredDrAmount currencyCode="USD">10.0000</jour:EnteredCrAmount>
</jour:GlInterface>
</typ:interfaceRows>
</typ:importJournals>
</soapenv:Body>
</soapenv:Envelope>
So if you look at the XML above, within the GlInterface tags, there are 2 per transaction (one is a debit and one is a credit, if you look at the Segments (account codes) they are different, and one GlInterface tag as a EnteredDrAmount tag, while the other has EnteredCrAmount tag.
In the source data, either the Cr or Dr tag is null depending on if the line is a debit or credit, which comes in as "None" in python.
The way I got this to work is to call two calls to get data, one where Cr is not null and one where Dr is not null, and this process works fine, but in Python, I get an error "only one * allowed". Code is below.
xmlOracle = x_Envelope(
x_Header,
x_Body(
x_importJournals(
x_interfaceRows(
x_h_BatchName(str(batch[0])),
x_h_AccountingPeriodName(str(batch[3])),
x_h_AccountingDate(str(batch[4])),
*[x_GlInterface(
x_d_LedgerId(str(adid[0])),
x_d_PeriodName(str(adid[1])),
x_d_AccountingDate(str(adid[2])),
x_d_Segment1(str(adid[5])),
x_d_Segment2(str(adid[6])),
x_d_Segment3(str(adid[7])),
x_d_Segment4(str(adid[8])),
x_d_Segment5(str(adid[9])),
x_d_Segment6(str(adid[10])),
x_d_CurrencyCode(str(adid[11])),
x_d_EnteredCrAmount(str(adid[14]), currencyCode=str(adid[11]))
) for adid in CrAdidToProcess],
*[x_GlInterface(
x_d_LedgerId(str(adid[0])),
x_d_PeriodName(str(adid[1])),
x_d_AccountingDate(str(adid[2])),
x_d_Segment1(str(adid[5])),
x_d_Segment2(str(adid[6])),
x_d_Segment3(str(adid[7])),
x_d_Segment4(str(adid[8])),
x_d_Segment5(str(adid[9])),
x_d_Segment6(str(adid[10])),
x_d_CurrencyCode(str(adid[11])),
x_d_EnteredDrAmount(str(adid[14]), currencyCode=str(adid[11]))
) for adid in DrAdidToProcess]
)
)
)
)
I've also tried making a single call to get the line details and then either removing or filtering out the tag (either Cr or Dr) if it's "None" but I had no luck with this.
While the above process works, there is an error in my code, and I'd like to not have an error in my code.
Thank you all.
After further testing, I believe I figured out the solution to this. I believe I was trying to remove an element from ElementTree object and it was not having any of that. When I passed an element to the remove method/function, it finally worked.
Here is code for the function to remove the "None" entries.
def removeCrDrEmptyElements(element):
for removeElement in element.xpath('/soapenv:Envelope/soapenv:Body/typ:importJournals/typ:interfaceRows/jour:GlInterface/jour:EnteredCrAmount',
namespaces = { 'soapenv' : 'http://schemas.xmlsoap.org/soap/envelope/',
'typ' : 'http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/types/',
'jour' : 'http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/'
}):
if removeElement.text == 'None':
removeElement.getparent().remove(removeElement)
for removeElement in element.xpath('/soapenv:Envelope/soapenv:Body/typ:importJournals/typ:interfaceRows/jour:GlInterface/jour:EnteredDrAmount',
namespaces = { 'soapenv' : 'http://schemas.xmlsoap.org/soap/envelope/',
'typ' : 'http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/types/',
'jour' : 'http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/'
}):
if removeElement.text == 'None':
removeElement.getparent().remove(removeElement)
return element
Obviously this can be better rewritten (which I will do) but I only want to check two elements within the GlInterface tag, the EnteredCrAmount and EnteredDrAmount and remove those elements if the text is None.
Then you can call the function by using the code below to return an element type with the removed Nulls/Nones
xmlWithoutNull = removeCrDrEmptyElements(xmlElement)
Output before running function results:
<jour:GlInterface>
# omitted elements
<jour:EnteredCrAmount currencyCode="USD">1.000000</jour:EnteredCrAmount>
<jour:EnteredDrAmount currencyCode="USD">None</jour:EnteredDrAmount>
# omitted elements
</jour:GlInterface>
<jour:GlInterface>
# omitted elements
<jour:EnteredCrAmount currencyCode="USD">None</jour:EnteredCrAmount>
<jour:EnteredDrAmount currencyCode="USD">1.000000</jour:EnteredDrAmount>
# omitted elements
</jour:GlInterface>
Output after running function results:
<jour:GlInterface>
# omitted elements
<jour:EnteredCrAmount currencyCode="USD">1.000000</jour:EnteredCrAmount>
# omitted elements
</jour:GlInterface>
<jour:GlInterface>
# omitted elements
<jour:EnteredDrAmount currencyCode="USD">1.000000</jour:EnteredDrAmount>
# omitted elements
</jour:GlInterface>
xml and pandas.
What I am trying to do is learn Pandas dataframes, so I can work with and analyse data coming from xml format. In particular I want to be confident in ingesting nested xml. In most of the tutorials I have read or seen on Youtube the instructions use flat XML documents with no nesting. This does not represent real world data so I am trying something a bit more challenging.
I have knocked up some code in python with a view to generating Pandas data-frames I can start practising querying the data with the Pandas framework.
I am using an open source music resource 'Discogs' because they provide access to large xmls with lots of data I can play with.
There are a couple of challenges with the source data, first being there is no standardised schema for the tables, so the structure of the data is not consistent throughout an XML table (an issue I feel mimics real data I'll eventually be working with for real). The second is that the source files are huge the smallest being 1.5GB.
The first step I took was to split the files into smaller 200MB chunks. I then looked at the structure with a text editor so I had a good understanding of the tags and elements I needed to work with. Right now I am working with a table called 'Masters'. I m hard coding he elements I am trying to pull into a dataframe to keep the exercise simple and contained for now.
I am using xml.etree to parse an xml document and interact with each element that contains.
I have created a static data-frame with 8 columns for data go into. Again keeping it simple for now.
I am then searching for specific elements within the parsed xml data and extracting the Text from each into a variable per element of interest.
The data is broken down within this xml as a set of rows, each wrapped in a tag called master. So I use this tag as my root anchor to loop around.
If I run the above as a print to console, all works fine up to this point, and I get a stream of nicely flattened and well formed data (excluding some elements which randomly have None values and therefore throw an error)
The last step was to then parse the strings from each collected element into a row appended to the data-frame.
This is where I hit a problem. The code to append to the data-frame seems straightforward, but when I add it to my for loop, I get an endless loop which I have to force to end.
I am obviously missing something here. advise greatly appreciated. Code I am working with below:
import xml.etree.ElementTree as et
import re
import pandas as pd
tree = et.parse
('/media/linux/Data1/TestData/masters/200mb/masters-01.xml')
root = tree.getroot()
masters_df_cols = ["MasterID", "MainRelese", "Title", "Year",
"Genre", "ArtistID", "ArtistName"]
masters_df = pd.DataFrame(columns = masters_df_cols)
for elem in root.iter('master'):
if elem is not None:
masterID = str(elem.get('id'))
mainRelease = str(elem.find('main_release').text)
year = str(elem.find('year').text)
title = str(elem.find('title').text)
genre = str(elem.find('./genres/genre').text)
#style = str(elem.find('./styles/style').text)
artistID = str(elem.find('./artists/artist/id').text)
artistName =
str(elem.find('./artists/artist/name').text)
print(masterID, ':', mainRelease, ':', year, ':', title,
':', genre, ':', artistID, ':', artistName)
masters_df = masters_df.append(pd.DataFrame([masterID,
mainRelease, year, title, genre, artistID, artistName],
index = masters_df_cols), ignore_index = True)
print("Dataframe exported.")
The goal is to eventually take this exercise a replicate the knowledge I gain from it across different types of XMl, giving me the skill to searching dynamically through XMls for the tags and elements I want to draw out into a data-frame. Then use the data frames to generate meaningful stats about the data content. For now I am just trying to create simple flat data frames, with hard coded element values.
There are a couple of issues in the above code.
First and foremost, it is important to be vigilant with the whitespace and indentation in Python. The copy/paste of the above code had a combination of space and tab whitespace.
The primary semantic problem is that Pandas is not good at incrementally building DataFrames. Typically, the Pandas API is used to pass in an entire data structure at once to let the framework do its heavy lifting. But of course there are methods to append DataFrames. But here, a DataFrame was instantiated, and then for each iteration of the XML, a new DataFrame as instantiated and then appended to the first. This is most certainly not what was desired and would cause memory headaches, especially in face of the voluminous XML which will require lots of host computer memory.
Then there were minor issues on constructing the row level DataFrame. Here is a working MCVE illustrating how to make parsing discogs XML work including sample in-line data.
import xml.etree.ElementTree as et
import re
import pandas as pd
parser = et.XMLParser()
discogs_masters = """<masters>
<master id="18500"><main_release>155102</main_release><images><image height="588" type="primary" uri="" uri150="" width="600" /></images><artists><artist><id>212070</id><name>Samuel L Session</name><anv>Samuel L</anv><join /><role /><tracks /></artist></artists><genres><genre>Electronic</genre></genres><styles><style>Techno</style></styles><year>2001</year><title>New Soil</title><data_quality>Correct</data_quality><videos><video duration="489" embed="true" src="http://www.youtube.com/watch?v=f05Ai921itM"><title>Samuel L - Velvet</title><description>Samuel L - Velvet</description></video><video duration="292" embed="true" src="http://www.youtube.com/watch?v=iOQsBOJLbwg"><title>Samuel L. - Danshes D'afrique</title><description>Samuel L. - Danshes D'afrique</description></video><video duration="348" embed="true" src="http://www.youtube.com/watch?v=v23rSPG_StA"><title>Samuel L - Danses D'Afrique</title><description>Samuel L - Danses D'Afrique</description></video><video duration="288" embed="true" src="http://www.youtube.com/watch?v=tHo82ha6p40"><title>Samuel L - Body N' Soul</title><description>Samuel L - Body N' Soul</description></video><video duration="331" embed="true" src="http://www.youtube.com/watch?v=KDcqzHca5dk"><title>Samuel L - Into The Groove</title><description>Samuel L - Into The Groove</description></video><video duration="334" embed="true" src="http://www.youtube.com/watch?v=3DIYjJFl8Dk"><title>Samuel L - Soul Syndrome</title><description>Samuel L - Soul Syndrome</description></video><video duration="325" embed="true" src="http://www.youtube.com/watch?v=_o8yZMPqvNg"><title>Samuel L - Lush</title><description>Samuel L - Lush</description></video><video duration="346" embed="true" src="http://www.youtube.com/watch?v=JPwwJSc_-30"><title>Samuel L - Velvet ( Direct Me )</title><description>Samuel L - Velvet ( Direct Me )</description></video></videos></master>
<master id="18512"><main_release>33699</main_release><images><image height="150" type="primary" uri="" uri150="" width="150" /><image height="592" type="secondary" uri="" uri150="" width="600" /><image height="592" type="secondary" uri="" uri150="" width="600" /></images><artists><artist><id>212070</id><name>Samuel L Session</name><anv /><join /><role /><tracks /></artist></artists><genres><genre>Electronic</genre></genres><styles><style>Tribal</style><style>Techno</style></styles><year>2002</year><title>Psyche EP</title><data_quality>Correct</data_quality><videos><video duration="376" embed="true" src="http://www.youtube.com/watch?v=c_AfLqTdncI"><title>Samuel L. Session - Psyche Part 1</title><description>Samuel L. Session - Psyche Part 1</description></video><video duration="419" embed="true" src="http://www.youtube.com/watch?v=0nxvR8Zl9wY"><title>Samuel L. Session - Psyche Part 2</title><description>Samuel L. Session - Psyche Part 2</description></video><video duration="118" embed="true" src="http://www.youtube.com/watch?v=QYf4j0Pd2FU"><title>Samuel L. Session - Arrival</title><description>Samuel L. Session - Arrival</description></video></videos></master>
</masters>
"""
parser.feed(discogs_masters)
root = parser.close()
masters_df_cols = ["MasterID", "MainRelese", "Title", "Year",
"Genre", "ArtistID", "ArtistName"]
masters_rows = []
for elem in root.iter('master'):
if elem is not None:
masterID = str(elem.get('id'))
mainRelease = str(elem.find('main_release').text)
year = str(elem.find('year').text)
title = str(elem.find('title').text)
genre = str(elem.find('./genres/genre').text)
artistID = str(elem.find('./artists/artist/id').text)
artistName = str(elem.find('./artists/artist/name').text)
masters_rows.append([masterID, mainRelease, year, title, genre, artistID, artistName])
masters_df = pd.DataFrame(masters_rows, columns = masters_df_cols)
print(masters_df)
Produces this output
MasterID MainRelese Title Year Genre ArtistID ArtistName
0 18500 155102 2001 New Soil Electronic 212070 Samuel L Session
1 18512 33699 2002 Psyche EP Electronic 212070 Samuel L Session
I am looking at an xml file similar to the below:
<pinnacle_line_feed>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
<lastContest>28962804</lastContest>
<lastGame>162995589</lastGame>
<events>
<event>
<event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
<gamenumber>422739932</gamenumber>
<sporttype>Alpine Skiing</sporttype>
<league>DH 145</league>
<IsLive>No</IsLive>
<participants>
<participant>
<participant_name>Kjetil Jansrud (NOR)</participant_name>
<contestantnum>2001</contestantnum>
<rotnum>2001</rotnum>
<visiting_home_draw>Visiting</visiting_home_draw>
</participant>
<participant>
<participant_name>The Field</participant_name>
<contestantnum>2002</contestantnum>
<rotnum>2002</rotnum>
<visiting_home_draw>Home</visiting_home_draw>
</participant>
</participants>
<periods>
<period>
<period_number>0</period_number>
<period_description>Matchups</period_description>
<periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
<period_status>I</period_status>
<period_update>open</period_update>
<spread_maximum>200</spread_maximum>
<moneyline_maximum>100</moneyline_maximum>
<total_maximum>200</total_maximum>
<moneyline>
<moneyline_visiting>116</moneyline_visiting>
<moneyline_home>-136</moneyline_home>
</moneyline>
</period>
</periods>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
</event>
</events>
</pinnacle_line_feed>
I have parsed the file with the code below:
pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'
tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []
for event in root.iter('event'):
event_datetimeGMT = event.find('event_datetimeGMT').text
gamenumber = event.find('gamenumber').text
sporttype = event.find('sporttype').text
league = event.find('league').text
IsLive = event.find('IsLive').text
for participants in event.iter('participants'):
for participant in participants.iter('participant'):
p1_name = participant.find('participant_name').text
contestantnum = participant.find('contestantnum').text
rotnum = participant.find('rotnum').text
vhd = participant.find('visiting_home_draw').text
for periods in event.iter('periods'):
for period in periods.iter('period'):
period_number = period.find('period_number').text
desc = period.find('period_description').text
pdatetime = period.find('periodcutoff_datetimeGMT')
status = period.find('period_status').text
update = period.find('period_update').text
max = period.find('spread_maximum').text
mlmax = period.find('moneyline_maximum').text
tot_max = period.find('total_maximum').text
for moneyline in period.iter('moneyline'):
ml_vis = moneyline.find('moneyline_visiting').text
ml_home = moneyline.find('moneyline_home').text
However, I am hoping to get the nodes separated by event similar to a 2D table (as in a pandas dataframe). However, the full xml file has multiple "event" children, some events that do not share the same nodes as above. I am struggling quite mightily with being able to take each event node and simply create a 2d table with the tag and that value where the tag acts as the column name and the text acts as the value.
Up to this point, I have done the above to gauge how I might put that information into a dictionary and subsequently put a number of dictionaries into a list from which I can create a dataframe using pandas, but that has not worked out, as all attempts have required me to find and replace text to create the dxcictionaries and python has not responded well to that when attempting to subsequently create a dataframe. I have also used a simple:
for elt in tree.iter():
list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))
which worked quite well in simple pulling out every single tag and the corresponding text, but I was unable to make anything of that because any attempts at finding and replacing the text to create dictionaries was no good.
Any assistance would be greatly appreciated.
Thank you.
Here's an easy way to get your XML into a pandas dataframe. This utilizes the awesome requests library (which you can switch for urllib if you'd like, as well as the always helpful xmltodict library available in pypi. (NOTE: a reverse library is also available, knows as dicttoxml)
import json
import pandas
import requests
import xmltodict
web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')
# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)
# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
# visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))
# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)
To get some idea of what this is starting to look like, here's the first couple of lines of the CSV:
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo) # Output the df to a CSV file
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball
Notice that the participants and periods columns are still their native Python dictionaries. You'll either need to remove them from the columns list, or do some additional mangling to get them to flatten out:
# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball
I have quite often some SVGs with structures like this:
<svg:g
transform="translate(-251.5,36.5)"
id="g12578"
style="fill:#ffff00;fill-opacity:1">
<svg:rect
width="12"
height="12"
x="288"
y="35.999958"
id="rect12580"
style="fill:#ffff00;fill-opacity:1;stroke:#000000;stroke-width:1" />
</svg:g>
I would like to apply translate directly to the coordinates and delete the tranform-attribute:
<svg:g
id="g12578"
style="fill:#ffff00;fill-opacity:1">
<svg:rect
width="12"
height="12"
x="36.5"
y="69.499958"
id="rect12580"
style="fill:#ffff00;fill-opacity:1;stroke:#000000;stroke-width:1" />
</svg:g>
Do you know a script / program for simplifying SVGs? Or a python-snipplet for parsing SVG's?
This script works for my special case, but I would like one which works allways:
#http://epydoc.sourceforge.net/stdlib/xml.dom.minidom.Element-class.html
from xml.dom.minidom import parse, parseString
import re
f = open('/home/moose/mathe/svg/Solitaire-Board.svg', 'r')
xmldoc = parse(f)
p = re.compile('translate\(([-\d.]+),([-\d.]+)\)', re.IGNORECASE)
for node in xmldoc.getElementsByTagName('svg:g'):
transform_dict = node.attributes["transform"]
m = p.match(transform_dict.value)
if m:
x = float(m.group(1))
y = float(m.group(2))
child_rectangles = node.getElementsByTagName('svg:rect')
for rectangle in child_rectangles:
x_dict = rectangle.attributes["x"]
y_dict = rectangle.attributes["y"]
new_x = float(x_dict.value) + x
new_y = float(y_dict.value) + y
rectangle.setAttribute('x', str(new_x))
rectangle.setAttribute('y', str(new_y))
node.removeAttribute('transform')
print xmldoc.toxml()
I think the size of the svg could be reduced quite heavily without loss of quality, if the transform-attribute could be removed.
If the tool would be able to reduce coordinate precision, delete unnecessary regions, group and style wisely it would be great.
I'd recommend using lxml. It's extremely fast and has a lot of nice features. You can parse your example if you properly declare the svg namespace prefix. You can do that pretty easily:
>>> svg = '<svg xmlns:svg="http://www.w3.org/2000/svg">' + example_svg + '</svg>'
Now you can parse it with lxml.etree (or xml.etree.ElementTree):
>>> doc = etree.fromstring(svg)
If you use lxml you can take advantage of XPath:
>>> ns = {'svg': 'http://www.w3.org/2000/svg'}
>>> doc.xpath('//svg:g/#transform', namespaces=ns)
<<< ['translate(-251.5,36.5)']
You might want to have a look at scour:
Scour aims to reduce the size of SVG files as much as possible, while retaining the original rendering of the files. It does not do so flawlessly for all files, therefore users are encouraged not to overwrite their original files.
Optimizations performed by Scour on SVG files include: removing empty elements, removing metadata elements, removing unused id= attribute values, removing unrenderable elements, trimming coordinates to a certain number of significant places, and removing vector editor metadata.
1) it can be parsed and edited with regular expression. you can easily get the translate values, and the x,y's.
2) if you checked the minidom, and sure that your only problem is with the ':', so just replace the ':', edit what you need, and then re-replace it to ':'.
3) you can use this question: Is there any scripting SVG editor? to learn how to parse better this XML format.