Parse XML from String using Python - python

I'm pulling XML from a URL, which is behind a login screen. I can enter my credentials and grab the XML just fine, but I'm having trouble parsing this string. I think it's a list of XML tags. Anyway, the XML looks like this:
<Report Type="SLA Report" SiteName="Get Dataset Metadata" SLA_Name="Get Dataset Metadata" SLA_Description="Get Dataset Metadata" From="2018-11-27 00:00" Thru="2018-11-27 23:59" obj_device="4500" locations="69,31,"><Objective Type="Availability"><Goal>99.93</Goal><Actual>100.00</Actual><Compliant>Yes</Compliant><Errors>0</Errors><Checks>2878</Checks></Objective><Objective Type="Uptime"><Goal/><Actual/><Compliant/><Errors>0</Errors><Checks>0</Checks></Objective><Objective Type="Response Time"><Goal>300.00</Goal><Actual>1.7482</Actual><Compliant>Yes</Compliant><Errors>0</Errors><Checks>2878</Checks></Objective><MonitoringPeriods><Monitor><Exclude>No</Exclude><DayFrom>Sunday</DayFrom><TimeFrom>00:00</TimeFrom><DayThru>Sunday</DayThru><TimeThru>23:59</TimeThru></Monitor><Monitor><Exclude>No</Exclude><DayFrom>Monday</DayFrom><TimeFrom>00:00</TimeFrom><DayThru>Monday</DayThru><TimeThru>23:59</TimeThru></Monitor>
I want to get everything between the tags and everything between the tags. I want to load this into a Data Frame. That's it. How can I do that?
I thought I could use tree and root, like this:
REQUEST_URL = 'https://URL'
response = requests.get(REQUEST_URL, auth=(login, password))
xml_data = response.text.encode('utf-8', 'ignore')
tree = ET.parse(xml_data)
root = tree.getroot()
But that just gives me NameError: name 'tree' is not defined & NameError: name 'root' is not defined. I'm hoping there is a one-liner or two-liner, to get everything tidied up. So far, I haven't come across anything useful.
print(response.text) gives me:
<?xml version='1.0' standalone='yes'?><Report Type='SLA Report'
SiteName='Execute Query'
SLA_Name='Execute Query'
SLA_Description='Execute Query'
From='2018-11-27 00:00'
Thru='2018-11-27 23:59'
obj_device='4500'
locations='69,31,'
>
<Objective Type='Availability'>
<Goal>99.93</Goal>
<Actual>99.93</Actual>
<Compliant>Yes</Compliant>
<Errors>2</Errors>
<Checks>2878</Checks>
</Objective>
<Objective Type='Uptime'>
<Goal></Goal>
<Actual></Actual>
<Compliant></Compliant>
<Errors>0</Errors>
<Checks>0</Checks>
</Objective>
<Objective Type='Response Time'>
<Goal>300.00</Goal>
<Actual>3.1164</Actual>
<Compliant>Yes</Compliant>
<Errors>0</Errors>
<Checks>2878</Checks>
</Objective>
<MonitoringPeriods>
<Monitor>
<Exclude>No</Exclude><DayFrom>Sunday</DayFrom><TimeFrom>00:00</TimeFrom><DayThru>Sunday</DayThru><TimeThru>23:59</TimeThru>
</Monitor>

Related

How can we parse xml data that contains nodes with xml namespace tags in python?

I am getting XML as a response so I want to parse it. I tried many python libraries but not get my desired results. So if you can help, it will be really appreciative.
The following code returns None:
xmlResponse = ET.fromstring(context.response_document)
a = xmlResponse.findall('.//Body')
print(a)
Sample XML Data:
<S:Envelope
xmlns:S="http://www.w3.org/2003/05/soap-envelope">
<S:Header>
<wsa:Action s:mustUnderstand="1"
xmlns:s="http://www.w3.org/2003/05/soap-envelope"
xmlns:wsa="http://www.w3.org/2005/08/addressing">urn:ihe:iti:2007:RegistryStoredQueryResponse
</wsa:Action>
</S:Header>
<S:Body>
<query:AdhocQueryResponse status="urn:oasis:names:tc:ebxml-regrep:ResponseStatusType:Success"
xmlns:query="urn:oasis:names:tc:ebxml-regrep:xsd:query:3.0">
<rim:RegistryObjectList
xmlns:rim="u`enter code here`rn:oasis:names:tc:ebxml-regrep:xsd:rim:3.0"/>
</query:AdhocQueryResponse>
</S:Body>
</S:Envelope>
I want to get status from it which is in Body. If you can suggest some changes of some library then please help me. Thanks
Given the following base code:
import xml.etree.ElementTree as ET
root = ET.fromstring(xml)
Let's build on top of it to get your desired output.
Your initial find for .//Body x-path returns NONE because it doesn't exist in your XML response.
Each tag in your XML has a namespace associated with it. More info on xml namespaces can be found here.
Consider the following line with xmlns value (xml-namespace):
<S:Envelope xmlns:S="http://www.w3.org/2003/05/soap-envelope">
The value of namespace S is set to be http://www.w3.org/2003/05/soap-envelope.
Replacing S in {S}Envelope with value set above will give you the resulting tag to find in your XML:
root.find('{http://www.w3.org/2003/05/soap-envelope}Envelope') #top most node
We would need to do the same for <S:Body>.
To get<S:Body> elements and it's child nodes you can do the following:
body_node = root.find('{http://www.w3.org/2003/05/soap-envelope}Body')
for response_child_node in list(body_node):
print(response_child_node.tag) #tag of the child node
print(response_child_node.get('status')) #the status you're looking for
Outputs:
{urn:oasis:names:tc:ebxml-regrep:xsd:query:3.0}AdhocQueryResponse
urn:oasis:names:tc:ebxml-regrep:ResponseStatusType:Success
Alternatively
You can also directly find all {query}AdhocQueryResponse in your XML using:
response_nodes = root.findall('.//{urn:oasis:names:tc:ebxml-regrep:xsd:query:3.0}AdhocQueryResponse')
for response in response_nodes:
print(response.get('status'))
Outputs:
urn:oasis:names:tc:ebxml-regrep:ResponseStatusType:Success

Goodreads API Error: List Indices must be integers or slices, not str

So, I'm trying to program a Goodreads Information Fetcher App in Python using Goodreads' API. I'm currently working on the first function of the app which will fetch information from the API, the API returns an XML file.
I parsed the XML file and converted it to a JSON file, then I further converted it to a dictionary. but I still can't seem to extract the information from it, I've looked up other posts here, but nothing works.
main.py
def get_author_books(authorId):
url = "https://www.goodreads.com/author/list/{}?format=xml&key={}".format(authorId, key)
r = requests.get(url)
xml_file = r.content
json_file = json.dumps(xmltodict.parse(xml_file))
data = json.loads(json_file)
print("Book Name: " + str(data[0]["GoodreadsResponse"]["author"]["books"]["book"]))
I expect the output to give me the name of the first book in the dictionary.
Here is a sample XML file provided by Goodreads.
I think you lack understanding of how xml works, or at the very least, how the response you're getting is formatted.
The xml file you linked to has the following format:
<GoodreadsResponse>
<Request>...</Request>
<Author>
<id>...</id>
<name>...</name>
<link>...</link>
<books>
<book> [some stuff about the first book] </book>
<book> [some stuff about the second book] </book>
[More books]
</books>
</Author>
</GoodreadsResponse>
This means that in your data object, data["GoodreadsResponse"]["author"]["books"]["book"] is a collection of all the books in the response (all the elements surrounded by the <book> tags). So:
data["GoodreadsResponse"]["author"]["books"]["book"][0] is the first book.
data["GoodreadsResponse"]["author"]["books"]["book"][1] is the second book, and so on.
Looking back at the xml, each book element has an id, isbn, title, description, among other tags. So you can print the title of the first book by printing:
data["GoodreadsResponse"]["author"]["books"]["book"][0]["title"]
For reference, I'm running the following code using the xml file you linked to, you'd normally fetch this from the API:
import json
import xmltodict
f = open("source.xml", "r") # xml file in OP
xml_file = f.read()
json_file = json.dumps(xmltodict.parse(xml_file))
data = json.loads(json_file)
books = data["GoodreadsResponse"]["author"]["books"]["book"]
print(books[0]["title"]) # The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary

How to iteratively parse and save XML responses that come in as one string?

I am making API calls, that is retrieving IDs, each call represents 10000 IDs and I can only retrieve 10000 at a time. My goal is to save each XML call into a list to count how many people are in the platform automatically.
The problem I running into is two fold.
Each call comes as response object, the response object when I append to a list appends as a single string, so I can not count total number of IDs
To get the next 10000 list of IDs I have to use another API call to get information about each ID, and retrieve a piece of information called website ID and use that to call the next 10000 from the API in #1
I also want to prevent any duplicate IDs in the list but I feel like this is the easiest task.
Here is my code:
1
Call profile IDs (each call brings back 10000)
Append response object 'r' into list 'lst'
import requests
import xml.etree.ElementTree as et
import pandas as pd
from lxml import etree
import time
lst = []
xml = '''
<?xml version="1.0" encoding="utf-8" ?>
<YourMembership>
<Version>2.25</Version>
<ApiKey>*****</ApiKey>
<CallID>009</CallID>
<SaPasscode>*****</SaPasscode>
<Call Method="Sa.People.All.GetIDs">
<Timestamp></Timestamp>
<WebsiteID></WebsiteID>
<Groups>
<Code></Code>
<Name></Name>
</Groups>
</Call>
</YourMembership>
'''
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
r = requests.post('https://api.yourmembership.com', data=xml, headers=headers)
lst.append(r.text)
API Call result
<YourMembership_Response>
<Sa.People.All.GetIDs>
<People>
<ID>1234567</ID>
</People>
</Sa.People.All.GetIDs>
</YourMembership_Response>
2
I take the last ID from API call in #1 and manually input the value
into the API call below in the 'ID' tags.
xml_2 = '''
<?xml version="1.0" encoding="utf-8" ?>
<YourMembership>
<Version>2.25</Version>
<ApiKey>****</ApiKey>
<CallID>001</CallID>
<SaPasscode>****</SaPasscode>
<Call Method="Sa.People.Profile.Get">
<ID>1234567</ID>
</Call>
</YourMembership>
'''
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
r_2 = requests.post('https://api.yourmembership.com', data=xml_2, headers=headers)
print (r_2.text)
API call result:
<YourMembership_Response>
<ErrCode>0</ErrCode>
<ExtendedErrorInfo></ExtendedErrorInfo>
<Sa.People.Profile.Get>
<ID>1234567</ID>
<WebsiteID>7654321</WebsiteID>
</YourMembership_Response>
I take the website ID and rerun this in API Call from #1 (example) with website ID tag filled, get the next 10000 until no more results come back:
xml = '''
<?xml version="1.0" encoding="utf-8" ?>
<YourMembership>
<Version>2.25</Version>
<ApiKey>*****</ApiKey>
<CallID>009</CallID>
<SaPasscode>*****</SaPasscode>
<Call Method="Sa.People.All.GetIDs">
<Timestamp></Timestamp>
<WebsiteID>7654321</WebsiteID>
<Groups>
<Code></Code>
<Name></Name>
</Groups>
</Call>
</YourMembership>
'''
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
r = requests.post('https://api.yourmembership.com', data=xml, headers=headers)
lst.append(r.text)
Hope my question makes sense, and thank you in advance.
I once started building something to crawl over an API which sounds similar to what you are aiming to achieve. One difference in my case though was the response came as json instead of xml but shouldn't be a big deal.
Can't see in your question evidence that you are really using the power of the xml parser. Have a look at the docs. For example you can easily get the id number out of those items you are appending to the list like this:
xml_sample = """
<YourMembership_Response>
<Sa.People.All.GetIDs>
<People>
<ID>1234567</ID>
</People>
</Sa.People.All.GetIDs>
</YourMembership_Response>
"""
import xml.etree.ElementTree as ET
root = ET.fromstring(xml_sample)
print (root[0][0][0].text)
>>> '1234567'
Experiment, apply it in a loop to each element in the list or maybe you will be lucky and the whole response object will parse without needing to look through things.
You should now be able to programmatically instead of manually enter that number in the next bit of code.
Your XML for the next section for the website ID seems to have an invalid line in it <Sa.People.Profile.Get> Once I take it out it can be parsed:
xml_sample2 = """
<YourMembership_Response>
<ErrCode>0</ErrCode>
<ExtendedErrorInfo></ExtendedErrorInfo>
<ID>1234567</ID>
<WebsiteID>7654321</WebsiteID>
</YourMembership_Response>
"""
root2 = ET.fromstring(xml_sample2)
print (root2[3].text)
>>> '7654321'
So not sure if there is always an invalid line there or if you forgot to paste something, maybe remove that line with regex or something before applying xtree.
Would recommend you try sqlite to help you with the interactions between 1 and 2. I think it's good up to half a million rows otherwise you would need to hook to a proper database. It saves a file in your directory and has a bit less setup time and fuss as with a proper database. Perhaps, test the concept with sqlite and if necessary migrate to postgresql.
You can store whichever useful elements from this parsed xml you like user ID, website ID into a table and pull it out again to use in a different section. Is also not hard to go back and forth from sqlite to pandas dataframes if you need it with pandas.read_sql and pandas.DataFrame.to_sql Hope this helps..

How to parse an XML file and get its data Python

I have a the below web service : 'https://news.google.com/news/rss/?ned=us&hl=en'
I need to parse it and get the title and date values of each item in the XML file.
I have tried to get the data to an xml file and i am trying to parse it but i see all blank values:
import requests
import xml.etree.ElementTree as ET
response = requests.get('https://news.google.com/news/rss/?ned=us&hl=en')
with open('text.xml','w') as xmlfile:
xmlfile.write(response.text)
with open('text.xml','rt') as f:
tree = ET.parse(f)
for node in tree.iter():
print (node.tag, node.attrib)
I am not sure where i am going wrong . I have to somehow extract the values of title and published date of each and every item in the XML.
Thanks for any answers in advance.
#Ilja Everilä is right, you should use feedparser.
For sure there is no need to write any xml file... except if you want to archive it.
I didn't really get what output you expected but something like this works (python3)
import feedparser
url = 'https://news.google.com/news/rss/?ned=us&hl=en'
d = feedparser.parse(url)
#print the feed title
print(d['feed']['title'])
#print tuples (title, tag)
print([(d['entries'][i]['title'], d['entries'][i]['tags'][0]['term']) for i in range(len(d['entries']))] )
to explicitly print it as utf8 strings use:
print([(d['entries'][i]['title'].encode('utf8'), d['entries'][i]['tags'][0]['term'].encode('utf8')) for i in range(len(d['entries']))])
Maybe if you show your expected output, we could help you to get the right content from the parser.

Parsing Weather XML with Python

I'm a beginner but with a lot of effort I'm trying to parse some data about the weather from an .xml file called "weather.xml" which looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<Weather>
<locality name="Rome" alt="21">
<situation temperature="18°C" temperatureF="64,4°F" humidity="77%" pression="1016 mb" wind="5 SSW km/h" windKN="2,9 SSW kn">
<description>clear sky</description>
<lastUpdate>17:45</lastUpdate>
/>
</situation>
<sun sunrise="6:57" sunset="18:36" />
</locality>
I parsed some data from this XML and this is how my Python code looks now:
#!/usr/bin/python
from xml.dom import minidom
xmldoc = minidom.parse('weather.xml')
entry_situation = xmldoc.getElementsByTagName('situation')
entry_locality = xmldoc.getElementsByTagName('locality')
print entry_locality[0].attributes['name'].value
print "Temperature: "+entry_situation[0].attributes['temperature'].value
print "Humidity: "+entry_situation[0].attributes['humidity'].value
print "Pression: "+entry_situation[0].attributes['pression'].value
It's working fine but if I try to parse data from "description" or "lastUpdate" node with the same method, I get an error, so this way must be wrong for those nodes which actually I can see they are differents.
I'm also trying to write output into a log file with no success, the most I get is an empty file.
Thank you for your time reading this.
It is because "description" and "lastUpdate" are not attributes but child nodes of the "situation" node.
Try:
d = entry_situation[0].getElementsByTagName("description")[0]
print "Description: %s" % d.firstChild.nodeValue
You should use the same method to access the "situation" node from its parent "locality".
By the way you should take a look at the lxml module, especially the objectify API as yegorich said. It is easier to use.

Categories