Failed attempts to read XML from goodreads API using requests and lxml - python

Goodreads claims I can get XML that begins with a root called <GoodreadsResponse>, whose 1st child is <book>, the 8th child of which is image_url. Trouble is, I can't event get it to recognize the proper root (it prints root not GoodreadsResponse and fails to recognize that the root has any children at all, though the response code is 200. I prefer to work with JSON and, allegedly, you can convert it to JSON, but I had zero luck with that.
Here's the function I have at the moment. Where am I going wrong?
def main(url, payload):
"""Retrieves image from Goodreads API endpoint returning XML response"""
res = requests.get(url, payload)
status = res.status_code
print(status)
parser = etree.XMLParser(recover=True)
tree = etree.fromstring(res.content, parser=parser)
root = etree.Element("root")
print(root.text)
if __name__ == '__main__':
main("https://www.goodreads.com/book/isbn/", '{"isbns": "0441172717", "key": "my_key"}')
The goodreads info is here:
**Get the reviews for a book given an ISBN**
Get an xml or json response that contains embed code for the iframe reviews widget that shows excerpts (first 300 characters) of the most popular reviews of a book for a given ISBN. The reviews are from all known editions of the book.
URL: https://www.goodreads.com/book/isbn/ISBN?format=FORMAT (sample url)
HTTP method: GET

At the moment you are receiving HTML not XML with your request.
You need to set the format of the response you want: https://www.goodreads.com/book/isbn/ISBN?format=FORMAT
And you need to use params not payload:
Constructing requests with URL Query String in Python
P.S. For the request you are doing you can use JSON.
https://www.goodreads.com/api/index#book.show_by_isbn

Here's the solution that worked best for me:
import requests
from bs4 import BeautifulSoup
def main():
key = 'myKey'
isbn = '0441172717'
url = 'https://www.goodreads.com/book/isbn/{}?key={}'.format(isbn, key)
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml-xml")
print(soup.find('image_url').text)
The issue was that XML contents were wrapped in tags. Using the Beautiful Soup 'lxml-xml' parser, rather than 'lxml' retained the content included in CDATA tags and allowed them to be parsed correctly.

Related

Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags

I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.
Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
you can filter on attributes like following:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here
You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})

Why does my request return an empty list?

When I use XPath to crawl and parse the content of Tencent commonweal, all the returned lists are empty.
The following below is my code(The information of headers is hidden).And the target url is https://gongyi.qq.com/succor/project_list.htm#s_tid=75.I would appreciate it if someone could help me solve this problem.
import requests
import os
from lxml import etree
if __name__ =='__main__':
url = 'https://gongyi.qq.com/succor/project_list.htm#s_tid=75'
headers = {
'User-Agent': XXX }
response = requests.get(url=url,headers=headers)
page_text = response.text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[#class="pro_main"]//li')
for li in li_list:
title = li.xpath('./div[2]/div/a/text()')[0]
print(title)
So what is actually happening here is that you can only access the first ul inside the pro_main div, because all those li items and their parent are populated by JavaScript, thus your list won't be there by the time you scrape the html with requests.get(), it will be empty!
The good news is that the JS script in questions populates the data using an API and just exactly how the website does it, you may as well retrieve those titles using the actual API and print them.
import requests, json
import os
if __name__ =='__main__':
url = 'https://ssl.gongyi.qq.com/cgi-bin/WXSearchCGI?ptype=stat&s_status=1&s_tid=75'
resp = requests.get(url).text
resp = resp[1:-1] #Result is wrapped in (), so we get rid of those
jj = json.loads(resp)
for i in jj["plist"]:
title = i["title"]
print(title)
You can explore the API by printing jj to see if there's more info that you may need later!
Let me know if it works for you!

Large XML Response-Python

I'm scraping a webpage that returns an XML response that I cannot for the life of me extract any data from. Here is my code that just returns the XML response:
import requests
url = 'https://www5.fdic.gov/cra/WebServices/DBService.asmx/callWS'
r = requests.post(url, data={"functionName":"SearchCRA","parmsJSON":"{\"Appl_Number\":\"\",\"Appl_Type\":\"\",\"PSTALP\":\"\",\"SUPRV_FDICDBS\":\"09\",\"BANK_NAME\":\"\"}"})
print(r.content)
For example I would like to extract application numbers, institution names, and application type. I'm relatively new to Python and I just can't get my head around this one.
Thanks in advance.
The XML response actually has a very simple structure, with just a single root element <string>. The text of that element contains JSON, so actually parsing the content is trivial.
Assuming you have the response in r, then:
import json
from xml.etree import ElementTree as ET
root = ET.fromstring(r.content)
data = json.loads(root.text)
for result in data['Result']:
print(result['Appl_Number'])
print(result['Instname'])
print(result['Appl_Type'])
print('--')

Python - getting information from nodes

I've been trying to get information from a site, and recently found out that is stored in childNodes[0].data.
I'm pretty new to python and never tried scripting against websites.
Somebody told me I could make a tmp.xml file, and extract the information from there, but as it's only getting the source code(which I think is of no use for me), I don't get any results.
Current code:
response = urllib2.urlopen(get_link)
html = response.read()
with open("tmp.xml", "w") as f:
f.write(html)
dom = parse("tmp.xml")
name = dom.getElementsByTagName("name[0].firstChild.nodeValue")
I've also tried using 'dom = parse(html)' without better result.
getElementsByTagName() takes an element name, not an expression. It is highly unlikely that there are tags in the page you are loading that contain <name[0].firstChild.nodeValue> tags.
If you are loading HTML, use a HTML parser instead, like BeautifulSoup. For XML, using the ElementTree API is a lot easier than using the (archaic and very verbose) DOM API.
Neither approach requires that you first save the source to disk, both APIs can parse directly from the response object returned by urllib2.
# HTML
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen(get_link)
soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))
print soup.find('title').text
or
# XML
import urllib2
from xml.etree import ElementTree as ET
response = urllib2.urlopen(get_link)
tree = ET.parse(response)
print tree.find('elementname').text

Return last URL in sequence of redirects

I sometimes need to parse with Beautiful Soup and Requests URLs that are provided as such:
http://bit.ly/sdflksdfwefwe
http://stup.id/sdfslkjsfsd
http://0.r.msn.com/sdflksdflsdj
Of course, these URLs generally 'resolve' to a canonical URL some as http://real-website.com/page.html. How can I get the last URL in the resolution / redirect chain?
My code generally looks like this:
from bs4 import BeautifulSoup
import requests
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)
canonical_url = response.??? ## This is what I need to know
Note that I don't mean to query http://bit.ly/bllsht to see where it goes, but rather when I am using Beautiful Soup to already parse the page that it returns, to also get the canonical URL that was the last in the redirect chain.
Thanks.
It's in the url attribute of your response object.
>>> response = requests.get('http://bit.ly/bllsht')
>>> response.url
> u'http://www.thenews.org/sports/well-hey-there-murray-state-1-21-11-1.2436937'
You could easily find this information in the “Quick Start” page.

Categories