I'm scraping a webpage that returns an XML response that I cannot for the life of me extract any data from. Here is my code that just returns the XML response:
import requests
url = 'https://www5.fdic.gov/cra/WebServices/DBService.asmx/callWS'
r = requests.post(url, data={"functionName":"SearchCRA","parmsJSON":"{\"Appl_Number\":\"\",\"Appl_Type\":\"\",\"PSTALP\":\"\",\"SUPRV_FDICDBS\":\"09\",\"BANK_NAME\":\"\"}"})
print(r.content)
For example I would like to extract application numbers, institution names, and application type. I'm relatively new to Python and I just can't get my head around this one.
Thanks in advance.
The XML response actually has a very simple structure, with just a single root element <string>. The text of that element contains JSON, so actually parsing the content is trivial.
Assuming you have the response in r, then:
import json
from xml.etree import ElementTree as ET
root = ET.fromstring(r.content)
data = json.loads(root.text)
for result in data['Result']:
print(result['Appl_Number'])
print(result['Instname'])
print(result['Appl_Type'])
print('--')
Related
I'm trying to retrieve data from an API, however it appears to be returning in XML format.
response = requests.get('https string')
print(response.text)
Output:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><RegisterSearch TotalResultsOnPage="500" TotalResults="15167" TotalPages="31" PageSize="500" CurrentPage="1"><SearchResults><Document DocumentId="1348828088640186163"/><Document DocumentId="1348828088751561003"/></SearchResults></RegisterSearch>
I've tried using ElementTree as suggested by other answers, but receive a file not found error. I think I'm missing something.
import xml.etree.ElementTree as ET
tree = ET.parse(response.text)
root = tree.getroot()
EDIT:
If you want to use ElementTree You need to parse from STRING
root = ET.fromstring(response.text)
You can parse it with Beautiful Soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'xml')
Then depending on what you want to find or extract you can use find
soup.find('DocumentId').text
I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.
Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
you can filter on attributes like following:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here
You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})
Goodreads claims I can get XML that begins with a root called <GoodreadsResponse>, whose 1st child is <book>, the 8th child of which is image_url. Trouble is, I can't event get it to recognize the proper root (it prints root not GoodreadsResponse and fails to recognize that the root has any children at all, though the response code is 200. I prefer to work with JSON and, allegedly, you can convert it to JSON, but I had zero luck with that.
Here's the function I have at the moment. Where am I going wrong?
def main(url, payload):
"""Retrieves image from Goodreads API endpoint returning XML response"""
res = requests.get(url, payload)
status = res.status_code
print(status)
parser = etree.XMLParser(recover=True)
tree = etree.fromstring(res.content, parser=parser)
root = etree.Element("root")
print(root.text)
if __name__ == '__main__':
main("https://www.goodreads.com/book/isbn/", '{"isbns": "0441172717", "key": "my_key"}')
The goodreads info is here:
**Get the reviews for a book given an ISBN**
Get an xml or json response that contains embed code for the iframe reviews widget that shows excerpts (first 300 characters) of the most popular reviews of a book for a given ISBN. The reviews are from all known editions of the book.
URL: https://www.goodreads.com/book/isbn/ISBN?format=FORMAT (sample url)
HTTP method: GET
At the moment you are receiving HTML not XML with your request.
You need to set the format of the response you want: https://www.goodreads.com/book/isbn/ISBN?format=FORMAT
And you need to use params not payload:
Constructing requests with URL Query String in Python
P.S. For the request you are doing you can use JSON.
https://www.goodreads.com/api/index#book.show_by_isbn
Here's the solution that worked best for me:
import requests
from bs4 import BeautifulSoup
def main():
key = 'myKey'
isbn = '0441172717'
url = 'https://www.goodreads.com/book/isbn/{}?key={}'.format(isbn, key)
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml-xml")
print(soup.find('image_url').text)
The issue was that XML contents were wrapped in tags. Using the Beautiful Soup 'lxml-xml' parser, rather than 'lxml' retained the content included in CDATA tags and allowed them to be parsed correctly.
I've been trying to get information from a site, and recently found out that is stored in childNodes[0].data.
I'm pretty new to python and never tried scripting against websites.
Somebody told me I could make a tmp.xml file, and extract the information from there, but as it's only getting the source code(which I think is of no use for me), I don't get any results.
Current code:
response = urllib2.urlopen(get_link)
html = response.read()
with open("tmp.xml", "w") as f:
f.write(html)
dom = parse("tmp.xml")
name = dom.getElementsByTagName("name[0].firstChild.nodeValue")
I've also tried using 'dom = parse(html)' without better result.
getElementsByTagName() takes an element name, not an expression. It is highly unlikely that there are tags in the page you are loading that contain <name[0].firstChild.nodeValue> tags.
If you are loading HTML, use a HTML parser instead, like BeautifulSoup. For XML, using the ElementTree API is a lot easier than using the (archaic and very verbose) DOM API.
Neither approach requires that you first save the source to disk, both APIs can parse directly from the response object returned by urllib2.
# HTML
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen(get_link)
soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))
print soup.find('title').text
or
# XML
import urllib2
from xml.etree import ElementTree as ET
response = urllib2.urlopen(get_link)
tree = ET.parse(response)
print tree.find('elementname').text
I'm trying to learn Python. My only experience is Applescripting and it's not so easy to learn.. so far anyway.
I'm trying to parse an xml weather site and so far I have the data I need but I can't figure out how to get it into a list to process it further. Can anyone help?
from BeautifulSoup import BeautifulSoup
import xml.etree.cElementTree as ET
from xml.etree.cElementTree import parse
import urllib2
url = "http://www.weatheroffice.gc.ca/rss/city/ab-52_e.xml"
response = urllib2.urlopen(url)
local_file = open("\Temp\weather.xml", "w")
local_file.write(response.read())
local_file.close()
invalid_tags = ['b', 'br']
tree = parse("\Temp\weather.xml")
stuff = tree.findall("channel/item/description")
item = stuff[1]
parsewx = BeautifulSoup(stuff[1].text)
for tag in invalid_tags:
for match in parsewx.findAll(tag):
match.replaceWithChildren()
print parsewx
Since XML is structured data, BeautifulSoup returns a tree of Tags.
The documentation has extensive information on how to search and navigate in that tree.