learning to work with the request library and pandas but have been struggling to get past the starting point even with a good amount of examples online.
I am trying to extract NBA shot data from the URL below using a GET request, and then turn it into a DataFrame:
def extractData():
Harden_data_url = "https://stats.nba.com/events/?flag=3&CFID=33&CFPARAMS=2017-18&PlayerID=201935&ContextMeasure=FGA&Season=2017-18§ion=player&sct=hex"
response = requests.get(Harden_data_url)
data = response.json()
shots = data['resultSets'][0]['rowSet']
headers = data['resultSets'][0]['headers']
df = pandas.DataFrame.from_records(shots, columns = headers)
However I get this error starting on line 2 "response = requests.get(url)"
ValueError: No JSON object could be decoded
I imagine I am missing something basic, any debugging help is appreciated!
The problem is that you are using the wrong URL for fetching the data.
The URL you used was for the HTML, which is in charge of the layout of the site. The data comes from a different URL, which fetches it in JSON format.
The correct URL for the data you are looking for is this:
https://stats.nba.com/stats/shotchartdetail?CFID=33&CFPARAMS=2017-18&ContextMeasure=FGA&DateFrom=&DateTo=&EndPeriod=10&EndRange=28800&GameID=&GameSegment=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID=201935&PlayerPosition=&RangeType=0&RookieYear=&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&StartPeriod=1&StartRange=0&TeamID=0&VsConference=&VsDivision=
If you run it on the browser, you can see only the raw JSON data, which is exactly what you will get in your code, and make it work properly.
This blog post explains the method to find the data URL, and although the API has changed a little since the post was written, the method still works:
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
Related
I am a new programmer and I'm learning the request module. I'm stuck on the fact that I don't know how to get a specific part of a json response, I think it's called a header? or its the thing inside of a header? I'm not sure. But the API returns simple json code. This is the api
https://mcapi.us/server/status?ip=mc.hypixel.net
for more of a example, lets say it returns this json code from the api
{"status":"success","online":true"}
And I wanted to get the "online" response, how would I do that?
And this is the code im currently working with.
import requests
def main():
ask = input("IP : ")
response = requests.get('https://mcapi.us/server/status?ip=' + ask)
print(response.content)
main()
And to be honest, I don't even know if this is json. I think it is but the api page says its cors? if it isn't I'm sorry.
In your example you have a dictionary with key "online"
You need to parse it first with .json() and then you can get it in form dict[key]
In your case
response = requests.get('https://mcapi.us/server/status?ip=' + ask).json()
print(response["online"])
or in case of actual content
response = requests.get('https://mcapi.us/server/status?ip=' + ask).json()
print(response["content"])
I am very new to APIs. I am trying to get the response of a requests.post method in the form of a json file or dictionary. I get a status_code of 200, so I know there is success, but when I run response.text I return everything as a string. I have read parts of the Quickstart guide for Requests, but they only seem to use .text to extract the data. My expected output for this particular api would ideally be a json file or some dictionary I can work with.
What I have so far (I get this is not a full reproducible example, but I think it gets the point across, otherwise refer to here for some examples):
import pandas as pd
import requests
response = requests.post(
url = request_url
,headers = headers
,json = body
)
response.text # returns a string
response.json # returns a method
pd.json_normalize(response.text) #throws an error that pandas does not have this attribute (which it does, idk why not)
pd.read_json(response.text) #somewhat workable dataframe.
pd.read_json() gets me somewhere, but it is an object in a cell of a dataframe, which I feel like is not the route to go down on.
Based on John Gordon's comment above, you can do the following
data = response.json()
Then with from pandas.io.json import json_normalize you can also do
df = json_normalize(data)
This will convert the response into a pandas dataframe.
I'm fairly new to programming and I am trying to take data from a webpage and use it in my python code. Basically, I'm trying to take the price of an item for a game by having python grab the data whenever I run my code, if that makes sense. Here's what I'm struggling with in particular:
The HTML page I'm using is for runescape, namely
http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151
This page provides me with a bunch of dictionaries from which I am trying to extract the price of the item in question. All I really want to do it get all of this data into python so I can then manipulate it. My current code is:
import urllib2
response =urllib2.urlopen('http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151')
print response
And it outputs:
addinfourl at 49631760 whose fp = socket._fileobject object at 0x02F4B2F0
whereas I just want it to display exactly what is on the URL in question.
Any ideas? I'm sorry if my formatting is terrible. And if it sounds like I have no idea what I'm talking about, it's because I don't.
If the webpage returns a json-encoded data, then do something like this:
import urllib2
import json
response = urllib2.urlopen("http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151")
data = json.load(response)
print(data)
Extract the relevant keys in the data variable to get the values you want.
This is my first couple of weeks coding; apologies for a basic question.
I've managed to parse the 'WorldNews' subreddit json, identify the individual children (24 of them as I write) and grab the titles of each news item. I'm now trying to create an array from these news titles. The code below does print the fifth title ([4]) to command line every 2-3 attempts (otherwise provides the error below). It will also not print more than one title at a time (for example if I try[2,3,4] I will continuously get the same error).
The error I get when doesn't compile:
in <module> Children = theJSON["data"]["children"] KeyError: 'data'
My script:
import requests
import json
r = requests.get('https://www.reddit.com/r/worldnews/.json')
theJSON = json.loads(r.text)
Children = theJSON["data"]["children"]
News_list = []
for post in Children:
News_list.append (post["data"]["title"])
print News_list [4]
I've managed to find a solution with the help of Eric. The issue here was in fact not related to the key, parsing or presentation of the dict or array. When requesting a Url from reddit and attempting to print the json string output we encounter an HTTP Error 429. Fixing this is simple. The answer was found on this redditdev thread.
Solution: by adding an identifier for the device requesting the Url ('User-agent' in header) it runs smoothly and works every time.
import requests
import json
r = requests.get('https://www.reddit.com/r/worldnews.json', headers = {'User-agent': 'Chrome'})
theJSON = json.loads(r.text)
print theJSON
This means that the payload you got didn't have a data key in it, for whatever reason. I don't know about Reddit's JSON API; I tested the request and saw that you were using the correct keys. The fact that you say your code works every few times tells me that you're getting a different response between requests. I can't reproduce it, I tried making the request over and over and checking for the correct response. If I had to guess why you'd get something different I'd say it'd have to be either rate limiting or a temporary 503 (Reddit having issues.)
You can guard against this by either catching the KeyError or using the .get method of dictionaries.
Catching KeyError:
try:
Children = theJSON["data"]["children"]
except KeyError:
print 'bad payload'
return
Using .get:
Children = theJSON.get("data", {}).get("children")
if not Children:
print 'bad payload'
return
I have a site that I want to extract data from. The data retrieval is very straight forward.
It takes the parameters using HTTP POST and returns a JSON object. So, I have a list of queries that I want to do and then repeat at certain intervals to update a database. Is scrapy suitable for this or should I be using something else?
I don't actually need to follow links but I do need to send multiple requests at the same time.
How does looks like the POST request? There are many variations, like simple query parameters (?a=1&b=2), form-like payload (the body contains a=1&b=2), or any other kind of payload (the body contains a string in some format, like json or xml).
In scrapy is fairly straightforward to make POST requests, see: http://doc.scrapy.org/en/latest/topics/request-response.html#request-usage-examples
For example, you may need something like this:
# Warning: take care of the undefined variables and modules!
def start_requests(self):
payload = {"a": 1, "b": 2}
yield Request(url, self.parse_data, method="POST", body=urllib.urlencode(payload))
def parse_data(self, response):
# do stuff with data...
data = json.loads(response.body)
For handling requests and retrieving response, scrapy is more than enough. And to parse JSON, just use the json module in the standard library:
import json
data = ...
json_data = json.loads(data)
Hope this helps!
Based on my understanding of the question, you just want to fetch/scrape data from a web page at certain intervals. Scrapy is generally used for crawling.
If you just want to make http post requests you might consider using the python requests library.