I have a requests call that gives me some somewhat formatted XML data like this:
<info>
<stats vol="545080705" orders="718021755"/>
<symbols timestamp="2022-09-08 19:56:37" count="11394">
<symbol name="TQQQ" vol="8700394" last="28.23" matched="8382339" />
<symbol name="SPY" vol="8571092" last="401.00" matched="8209174" />
<symbol name="SQQQ" vol="7091770" last="44.39" matched="6734334" />
<symbol name="AVCT" vol="6493626" last="0.17" matched="6469576" />
<symbol name="UVXY" vol="6158364" last="9.42" matched="6142800" />
I'm having difficulty figuring out how to convert this into either a dictionary, or data frame, or some other object which I can loop over and extract the NAME, VOL, LAST & MATCHED items.
You can parse this code as follow, but it depends what you need from it :
from bs4 import BeautifulSoup as bs
import pandas as pd
response = """<info>
<stats vol="545080705" orders="718021755"/>
<symbols timestamp="2022-09-08 19:56:37" count="11394">
<symbol name="TQQQ" vol="8700394" last="28.23" matched="8382339" />
<symbol name="SPY" vol="8571092" last="401.00" matched="8209174" />
<symbol name="SQQQ" vol="7091770" last="44.39" matched="6734334" />
<symbol name="AVCT" vol="6493626" last="0.17" matched="6469576" />
<symbol name="UVXY" vol="6158364" last="9.42" matched="6142800" />
</symbols></info>"""
content = bs(response,"lxml-xml" )
df = pd.read_xml(str(content), xpath="//symbol")
Output :
last matched name vol
0 28.23 8382339 TQQQ 8700394
1 401.00 8209174 SPY 8571092
2 44.39 6734334 SQQQ 7091770
3 0.17 6469576 AVCT 6493626
4 9.42 6142800 UVXY 6158364
There are many ways you can parse and convert XML. Here is one of the ways using Beautifulsoup
doc = '''
<info>
<stats vol="545080705" orders="718021755"/>
<symbols timestamp="2022-09-08 19:56:37" count="11394">
<symbol name="TQQQ" vol="8700394" last="28.23" matched="8382339" />
<symbol name="SPY" vol="8571092" last="401.00" matched="8209174" />
<symbol name="SQQQ" vol="7091770" last="44.39" matched="6734334" />
<symbol name="AVCT" vol="6493626" last="0.17" matched="6469576" />
<symbol name="UVXY" vol="6158364" last="9.42" matched="6142800" />
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'lxml-xml')
for sym in soup.find_all('symbol'):
print("-"*32)
print(sym.get("name"))
print(sym.get("vol"))
print(sym.get("last"))
print(sym.get("matched"))
If you want more ways you can check this link
Related
I am very new to python.
I have an xml file with data like this:
<SCHEDULE type="ILN_BGY_G162_SL1D2T4T4SL2D2T4T4" inter_league="0" balanced_games="1" games_per_team="162" preferred_start_day="2">
<GAMES>
<GAME day="-34" time="1905" away="14" home="9" type="2" />
<GAME day="-34" time="1905" away="16" home="11" type="2" />
<GAME day="78" time="1905" away="12" home="15" type="2" />
<GAME day="79" time="1905" away="6" home="8" type="2" />
</GAMES>
</SCHEDULE>
I am trying to remove all elements of the xml file that have a day value NOT in the list day_range where day_range = [78,79,80]. With the sample data above, I would remove the elements where day="-34" and retain those where day="78" and day="79".
I have followed the answers in the following questions very closely and have gotten various errors and unwanted results that I will explain below. Accepted solutions I have tried:
XML Filtering with Python
How do I Filter Values From XML in Python
When I try the following code
import xml.etree.ElementTree as ET
from pathlib import Path
day_range = [78,79,80]
schedule = ET.parse(path)
root = schedule.getroot()
for element in root:
for day in element:
if element['day'] in day_range:
root.remove(element)
I get a Type error on if element['day'] in day_range: element indices must be integers.
Changing it slightly as below, I get a ValueError on root.remove(element): list.remove(x): x not in list
for element in root:
for day in element.findall('GAME'):
if element[0] in day_range:
root.remove(element)
schedule.write('test.xml')
I would like the output xml to look like this:
<SCHEDULE type="ILN_BGY_G162_SL1D2T4T4SL2D2T4T4" inter_league="0" balanced_games="1" games_per_team="162" preferred_start_day="2">
<GAMES>
<GAME day="78" time="1905" away="12" home="15" type="2" />
<GAME day="79" time="1905" away="6" home="8" type="2" />
</GAMES>
</SCHEDULE>
I have been working on this all day and I believe that I am missing an important concept but can't quite find it.
Below:
import xml.etree.ElementTree as ET
xml = '''<SCHEDULE type="ILN_BGY_G162_SL1D2T4T4SL2D2T4T4" inter_league="0" balanced_games="1" games_per_team="162" preferred_start_day="2">
<GAMES>
<GAME day="-34" time="1905" away="14" home="9" type="2" />
<GAME day="-34" time="1905" away="16" home="11" type="2" />
<GAME day="78" time="1905" away="12" home="15" type="2" />
<GAME day="79" time="1905" away="6" home="8" type="2" />
</GAMES>
</SCHEDULE>'''
day_range = {78,79,80}
root = ET.fromstring(xml)
games = root.find('.//GAMES')
for g in games.findall('./GAME'):
if int(g.attrib['day']) not in day_range:
games.remove(g)
ET.dump(root)
output
<SCHEDULE balanced_games="1" games_per_team="162" inter_league="0" preferred_start_day="2" type="ILN_BGY_G162_SL1D2T4T4SL2D2T4T4">
<GAMES>
<GAME away="12" day="78" home="15" time="1905" type="2" />
<GAME away="6" day="79" home="8" time="1905" type="2" />
</GAMES>
</SCHEDULE
im lookning for a solution to convert XML to Json and use the Json as the payload for post request.
I'm aiming for the following logic:
search for all root.listing.scedules.s and parse #s #d #p #c.
in root.listing.programs parse #t [p.id = #p (from scedules)] ->"Prime Discussion"
3, in root.listing.channels parse #c [c.id = #c (from scedules)] -> "mychannel"
once I have all the info parsed, I want to build a JSON containing all the params and send it using post request
I also look for a solution which will trigger multiple post APIs as the number of root.listing.scedules.s elements
{
"time":"{#s}",
"durartion":"{#d}",
"programID":"{#p}",
"title":"{#t}",
"channelName":"{#c}",
}
<?xml version="1.0" encoding="UTF-8"?>
<root>
<listings>
<schedules>
<s s="2019-09-26T00:00:00" d="1800" p="1569735" c="100007">
<f id="3" />
</s>
</schedules>
<programs>
<p id="1569735" t="Prime Discussion" d="Discussion on Current Affairs." rd="Discussion on Current Affairs." l="en">
<f id="2" />
<f id="21" />
<k id="6" v="20160614" />
<k id="1" v="2450548" />
<k id="18" v="12983658" />
<k id="21" v="12983658" />
<k id="10" v="Program" />
<k id="19" v="SH024505480000" />
<k id="20" v="http://tmsimg.com/assets/p12983658_b_h5_aa.jpg" />
<c id="607" />
<r o="1" r="1" n="100" />
<r o="2" r="1" n="1000" />
<r o="3" r="1" n="10000" />
</p>
</programs>
</listings>
<channels>
<c id="100007" c="mychannel" l="Prime Asia TV SD" d="Prime Asia TV SD" t="Digital" iso639="hi" />
<c id="10035" c="AETV" l="A&E Canada" d="A&E Canada" t="Digital" u="WWW.AETV.COM" iso639="en" />
</channels>
</root>
currently, i use this code to parse the scedules.s elements (part 1) and need some help with parts 2,3,4
import xml.etree.ElementTree as ET
tree = ET.parse('ChannelsProgramsTest.xml')
root = tree.getroot()
for sched in root[0][0].findall('s'):
new = sched.get('s'),sched.get('p'),sched.get('d'),sched.get('c')
print(new)
Below (I think it is the core solution you were looking for)
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
<listings>
<schedules>
<s s="2019-09-26T00:00:00" d="1800" p="1569735" c="100007">
<f id="3" />
</s>
</schedules>
<programs>
<p id="1569735" t="Prime Discussion" d="Discussion on Current Affairs." rd="Discussion on Current Affairs." l="en">
<f id="2" />
<f id="21" />
<k id="6" v="20160614" />
<k id="1" v="2450548" />
<k id="18" v="12983658" />
<k id="21" v="12983658" />
<k id="10" v="Program" />
<k id="19" v="SH024505480000" />
<k id="20" v="http://tmsimg.com/assets/p12983658_b_h5_aa.jpg" />
<c id="607" />
<r o="1" r="1" n="100" />
<r o="2" r="1" n="1000" />
<r o="3" r="1" n="10000" />
</p>
</programs>
</listings>
<channels>
<c id="100007" c="mychannel" l="Prime Asia TV SD" d="Prime Asia TV SD" t="Digital" iso639="hi" />
<c id="10035" c="AETV" l="A&E Canada" d="A&E Canada" t="Digital" u="WWW.AETV.COM" iso639="en" />
</channels>
</root>'''
tree = ET.fromstring(xml)
listings = tree.findall('.//listings')
for entry in listings:
# This is the first requirement: find s,d,p,c under 's' element
s = entry.find('./schedules/s')
print(s.attrib)
# now that we have s,d,p,c we can move on and look for the program with a specific id
program = entry.find("./programs/p[#id='{}']".format(s.attrib['p']))
print(program.attrib['t'])
# find the channel
channel = tree.find(".//channels/c[#id='{}']".format(s.attrib['c']))
print(channel.attrib['c'])
output
{'s': '2019-09-26T00:00:00', 'd': '1800', 'p': '1569735', 'c': '100007'}
Prime Discussion
mychannel
I'm still somewhat new to Stackoverflow in usage but not years. I think this is a somewhat duplicate question but I do not know how to tag this as duplicate yet.
A very good explanation of XML to JSON via Python is in the following post by the author of the library suggested.
Converting XML to JSON using Python?
The data source may have unknown characters in it that you will need to code for if you don't use a library
ie, newlines, unicode characters, other 'stray' characters. Often libraries will have done this for you already and you don't have to re-invent the wheel.
I have recently got a RaspberryPi and have started to learn Python. To begin with I want to parse an XML file and I am doing this via the untangle library.
My XML looks like:
<?xml version="1.0" encoding="utf-8"?>
<weatherdata>
<location>
<name>Katherine</name>
<type>Administrative division</type>
<country>Australia</country>
<timezone id="Australia/Darwin" utcoffsetMinutes="570" />
<location altitude="176" latitude="-14.65012" longitude="132.17414" geobase="geonames" geobaseid="7839404" />
</location>
<sun rise="2019-02-04T06:33:52" set="2019-02-04T19:16:15" />
<forecast>
<tabular>
<time from="2019-02-04T06:30:00" to="2019-02-04T12:30:00" period="1">
<!-- Valid from 2019-02-04T06:30:00 to 2019-02-04T12:30:00 -->
<symbol number="9" numberEx="9" name="Rain" var="09" />
<precipitation value="1.8" />
<!-- Valid at 2019-02-04T06:30:00 -->
<windDirection deg="314.8" code="NW" name="Northwest" />
<windSpeed mps="3.3" name="Light breeze" />
<temperature unit="celsius" value="26" />
<pressure unit="hPa" value="1005.0" />
</time>
<time from="2019-02-04T12:30:00" to="2019-02-04T18:30:00" period="2">
<!-- Valid from 2019-02-04T12:30:00 to 2019-02-04T18:30:00 -->
<symbol number="9" numberEx="9" name="Rain" var="09" />
<precipitation value="2.3" />
<!-- Valid at 2019-02-04T12:30:00 -->
<windDirection deg="253.3" code="WSW" name="West-southwest" />
<windSpeed mps="3.0" name="Light breeze" />
<temperature unit="celsius" value="29" />
<pressure unit="hPa" value="1005.0" />
</time>
</tabular>
</forecast>
</weatherdata>
From this I would like to be able to print out the from and to attributes of the <time> element as well as the value attribute in its child node <temperature>
I can correctly print out the temperature values if I run the Python script below:
for forecast in data.weatherdata.forecast.tabular.time:
print (forecast.temperature['value'])
but if I run
for forecast in data.weatherdata.forecast.tabular:
print ("time is " + forecast.time['from'] + "and temperature is " + forecast.time.temperature['value'])
I get an error:
print (forecast.time['from'] + forecast.time.temperature['value'])
TypeError: list indices must be integers, not str
Can anyone advise how I can correctly access these values?
forecast.time should be a list, as it does have multiple values, one for each <time> node.
Did you expect forecast.time['from'] to automatically aggregate that data?
I have an xml file as below:
<?xml version="1.0" encoding="utf-8"?>
<EDoc CID="1000101" Cname="somename" IName="iname" CSource="e1" Version="1.0">
<RIGLIST>
<RIG RIGID="100001" RIGName="RgName1">
<ListID>
<nodeA nodeAID="1000011" nodeAName="node1A" nodeAExtID="9000011" />
<nodeA nodeAID="1000012" nodeAName="node2A" nodeAExtID="9000012" />
<nodeA nodeAID="1000013" nodeAName="node3A" nodeAExtID="9000013" />
<nodeA nodeAID="1000014" nodeAName="node4A" nodeAExtID="9000014" />
<nodeA nodeAID="1000015" nodeAName="node5A" nodeAExtID="9000015" />
<nodeA nodeAID="1000016" nodeAName="node6A" nodeAExtID="9000016" />
<nodeA nodeAID="1000017" nodeAName="node7A" nodeAExtID="9000017" />
</ListID>
</RIG>
<RIG RIGID="100002" RIGName="RgName2">
<ListID>
<nodeA nodeAID="1000021" nodeAName="node1B" nodeAExtID="9000021" />
<nodeA nodeAID="1000022" nodeAName="node2B" nodeAExtID="9000022" />
<nodeA nodeAID="1000023" nodeAName="node3B" nodeAExtID="9000023" />
</ListID>
</RIG>
</RIGLIST>
</EDoc>
I need to search for the Node value RIGName and if match is found print out all the values of nodeAName
Example:
Searching for RIGName = "RgName2" should print all the values as node1B, node2B, node3B
As of now I am only able to get the first part as below:
import xml.etree.ElementTree as eT
import re
xmlfilePath = "Path of xml file"
tree = eT.parse(xmlfilePath)
root = tree.getroot()
for elem in root.iter("RIGName"):
# print(elem.tag, elem.attrib)
if re.findall(searchtxt, elem.attrib['RIGName'], re.IGNORECASE):
print(elem.attrib)
count += 1
How can I get only the immediate child node values?
Switching from xml.etree to lxml would give you a way to do it in a single go because of a much better XPath query language support:
In [1]: from lxml import etree as ET
In [2]: tree = ET.parse('input.xml')
In [3]: root = tree.getroot()
In [4]: root.xpath('//RIG[#RIGName = "RgName2"]/ListID/nodeA/#nodeAName')
Out[4]: ['node1B', 'node2B', 'node3B']
I have the following XML file ('registerreads_EE.xml'):
<?xml version="1.0" encoding="us-ascii" standalone="yes"?>
<ReadingDocument xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ReadingStatusRefTable>
<ReadingStatusRef Ref="1">
<UnencodedStatus SourceValidation="SIMPLE">
<StatusCodes>
<Signal>XX</Signal>
</StatusCodes>
</UnencodedStatus>
</ReadingStatusRef>
</ReadingStatusRefTable>
<Header>
<IEE_System Id="XXXXXXXXXXXXXXX" />
<Creation_Datetime Datetime="2015-10-22T09:05:32Z" />
<Timezone Id="UTC" />
<Path FilePath="X:\XXXXXXXXXXXX.xml" />
<Export_Template Id="XXXXX" />
<CorrelationID Id="" />
</Header>
<ImportExportParameters ResubmitFile="false" CreateGroup="true">
<DataFormat TimestampType="XXXXXX" Type="XXXX" />
</ImportExportParameters>
<Channels>
<Channel StartDate="2015-10-21T00:00:00-05:00" EndDate="2015-10-22T00:00:00-05:00">
<ChannelID ServicePointChannelID="73825603:301" />
<Readings>
<Reading Value="3577.0" ReadingTime="2015-10-21T00:00:00-05:00" StatusRef="1" />
<Reading Value="3601.3" ReadingTime="2015-10-22T00:00:00-05:00" StatusRef="1" />
</Readings>
<ExportRequest RequestID="152" EntityType="ServicePoint" EntityID="73825603" RequestSource="Scheduled" />
</Channel>
<Channel StartDate="2015-10-21T00:00:00-05:00" EndDate="2015-10-22T00:00:00-05:00">
<ChannelID ServicePointChannelID="73825604:301" />
<Readings>
<Reading Value="3462.5" ReadingTime="2015-10-21T00:00:00-05:00" StatusRef="1" />
<Reading Value="3501.5" ReadingTime="2015-10-22T00:00:00-05:00" StatusRef="1" />
</Readings>
<ExportRequest RequestID="152" EntityType="ServicePoint" EntityID="73825604" RequestSource="Scheduled" />
</Channel>
</Channels>
</ReadingDocument>
I want to parse the XML of the channel data into a csv file.
He is what I have written in Python 2.7.10:
import xml.etree.ElementTree as ET
tree = ET.parse('registerreads_EE.xml')
root = tree.getroot()[3]
for channel in tree.iter('Channel'):
for exportrequest in channel.iter('ExportRequest'):
entityid = exportrequest.attrib.get('EntityID')
for meterread in channel.iter('Reading'):
read = meterread.attrib.get('Value')
date = meterread.attrib.get('ReadingTime')
print read[:-2],",",date[:10],",",entityid
tree.write(open('registerreads_EE.csv','w'))
Here is the screen output when the above is run:
3577 , 2015-10-21 , 73825603
3601 , 2015-10-22 , 73825603
3462 , 2015-10-21 , 73825604
3501 , 2015-10-22 , 73825604
The 'registerreads.csv' output file is like the original XML file, minus the first line.
I would like the printed output above outputted to a csv file with headers of read, date, entityid.
I am having difficulty with this. This is my first python program. Any help is appreciated.
Use the csv module not lxml module to write rows to csv file. But still use lxml to parse and extract content from xml file:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse('registerreads_EE.xml')
root = tree.getroot()[3]
with open('registerreads_EE.csv', 'w', newline='') as r:
writer = csv.writer(r)
writer.writerow(['read', 'date', 'entityid']) # WRITING HEADERS
for channel in tree.iter('Channel'):
for exportrequest in channel.iter('ExportRequest'):
entityid = exportrequest.attrib.get('EntityID')
for meterread in channel.iter('Reading'):
read = meterread.attrib.get('Value')
date = meterread.attrib.get('ReadingTime')
# WRITE EACH ROW ITERATIVELY
writer.writerow([read[:-2],date[:10],entityid])