Parsing reddit json into Python array and print items from array - python

This is my first couple of weeks coding; apologies for a basic question.
I've managed to parse the 'WorldNews' subreddit json, identify the individual children (24 of them as I write) and grab the titles of each news item. I'm now trying to create an array from these news titles. The code below does print the fifth title ([4]) to command line every 2-3 attempts (otherwise provides the error below). It will also not print more than one title at a time (for example if I try[2,3,4] I will continuously get the same error).
The error I get when doesn't compile:
in <module> Children = theJSON["data"]["children"] KeyError: 'data'
My script:
import requests
import json
r = requests.get('https://www.reddit.com/r/worldnews/.json')
theJSON = json.loads(r.text)
Children = theJSON["data"]["children"]
News_list = []
for post in Children:
News_list.append (post["data"]["title"])
print News_list [4]

I've managed to find a solution with the help of Eric. The issue here was in fact not related to the key, parsing or presentation of the dict or array. When requesting a Url from reddit and attempting to print the json string output we encounter an HTTP Error 429. Fixing this is simple. The answer was found on this redditdev thread.
Solution: by adding an identifier for the device requesting the Url ('User-agent' in header) it runs smoothly and works every time.
import requests
import json
r = requests.get('https://www.reddit.com/r/worldnews.json', headers = {'User-agent': 'Chrome'})
theJSON = json.loads(r.text)
print theJSON

This means that the payload you got didn't have a data key in it, for whatever reason. I don't know about Reddit's JSON API; I tested the request and saw that you were using the correct keys. The fact that you say your code works every few times tells me that you're getting a different response between requests. I can't reproduce it, I tried making the request over and over and checking for the correct response. If I had to guess why you'd get something different I'd say it'd have to be either rate limiting or a temporary 503 (Reddit having issues.)
You can guard against this by either catching the KeyError or using the .get method of dictionaries.
Catching KeyError:
try:
Children = theJSON["data"]["children"]
except KeyError:
print 'bad payload'
return
Using .get:
Children = theJSON.get("data", {}).get("children")
if not Children:
print 'bad payload'
return

Related

How do you get a part of a json response in python?

I am a new programmer and I'm learning the request module. I'm stuck on the fact that I don't know how to get a specific part of a json response, I think it's called a header? or its the thing inside of a header? I'm not sure. But the API returns simple json code. This is the api
https://mcapi.us/server/status?ip=mc.hypixel.net
for more of a example, lets say it returns this json code from the api
{"status":"success","online":true"}
And I wanted to get the "online" response, how would I do that?
And this is the code im currently working with.
import requests
def main():
ask = input("IP : ")
response = requests.get('https://mcapi.us/server/status?ip=' + ask)
print(response.content)
main()
And to be honest, I don't even know if this is json. I think it is but the api page says its cors? if it isn't I'm sorry.
In your example you have a dictionary with key "online"
You need to parse it first with .json() and then you can get it in form dict[key]
In your case
response = requests.get('https://mcapi.us/server/status?ip=' + ask).json()
print(response["online"])
or in case of actual content
response = requests.get('https://mcapi.us/server/status?ip=' + ask).json()
print(response["content"])

Unable to modify page number which is within dictionary

I've written a script in python using post requests to fetch the json content from a webpage. The script is doing just fine if I'm only stick to it's default page. However, my intention is to create a loop to collect the content from few different pages. The only problem I'm struggling to solve is use page keyword within payload in order to loop three different pages. Consider my faulty approach as a placeholder.
How can I use format within dict in order to change page numbers?
Working script (if I get rid of the pagination loop):
import requests
link = 'https://nsv3auess7-3.algolianet.com/1/indexes/idealist7-production/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0&x-algolia-application-id=NSV3AUESS7&x-algolia-api-key=c2730ea10ab82787f2f3cc961e8c1e06'
for page in range(0,3):
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query="}
res = requests.post(link,json=payload.format(page)).json()
for item in res['hits']:
print(item['name'])
I get an error when I run the script as it is:
res = requests.post(link,json=payload.format(page)).json()
AttributeError: 'dict' object has no attribute 'format'
format is a string method. You should apply it to the string value of your payload instead:
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query=".format(page)}
res = requests.post(link,json=payload).json()

Access the next page of list results in Reddit API

I'm trying to play around with the API of Reddit, and I understand most of it, but I can't seem to figure out how to access the next page of results (since each page is 25 entries).
Here is the code I'm using:
import requests
import json
r = requests.get(r'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all')
listing = r.json()
after = listing['data']['after']
data = listing['data']['children']
for entry in data:
post = entry['data']
print post['score']
query = 'https://www.reddit.com/r/Petscop/top.json?after='+after
r = requests.get(query)
listing = r.json()
data = listing['data']['children']
for entry in data:
post = entry['data']
print post['score']
So I extract the after ID as after, and pass it into the next request. However, after the first 25 entries (the first page) the code returns just an empty list ([]). I tried changing the second query as:
r = requests.get(r'https://www.reddit.com/r/Petscop/top.json?after='+after)
And the result is the same. I also tried replacing "after" with "before", but the result was again the same.
Is there a better way to get the next page of results?
Also, what the heck is the r in the get argument? I copied it from an example, but I have no idea what it actually means. I ask because I don't know if it is necessary to access the next page, and if it is necessary, I don't know how to modify the query dynamically by adding after to it.
Try:
query = 'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all&after='+after
or better:
query = 'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all&after={}'.format(after)
As for r in strings you can omit it.

Using python request library to dynamically get HTTP Error String

I've been asked to swap over from urllib2 to the requests library because the library is simpler to use and doesn't cause exceptions.
I can get the HTTP error code with response.status_code but I don't see a way to get the error message for that code. Normally, I wouldn't care, but I'm testing an API and the string is just as important.
Does anybody know of a simple way to get that string? I'm expecting 2 pieces something like:
'400':'Bad Request'
This is NOT a DUPLICATE
Some of the codes being returned have unique strings being sent by the application that I am testing. These strings cannot be looked up using this method: requests.status_codes._codes[error][0] Since the string is dynamically coming from the back end server. I was able to get this information using urllib using this method:
import urllib2
...
opener = urllib2.build_opener(urllib2.HTTPSHandler(context=ctx))
except urllib2.HTTPError as err:
try: error_message = err.read()
...
The question now is... is there a method of getting the dynamic http error string? Thanks so much for being patient. The previous issue was closed so quickly I never got a chance to look at the answer, test it and re-ask by cleaning up the description.
response = requests.get(url)
error_message = response.reason
In HTTPResponse there's a reason attribute that returns the reason phrase from the response's status line. In the requests library the requests.Response class has an equivalent reason attribute that returns the same thing. Both should return the information from the response, not a fixed string based on the code.

KeyError in JSON request Python - NYT API

I am trying to extract the URL of specific articles from NYT API.
This is my code:
import requests
for i in range(0,100):
page=str(i)
r = requests.get("http://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20100101&q=terrorist+attack&page="+page+"&api-key=***")
data = r.json()
article = data['response']['docs']
for url in article:
print(url["web_url"])
After printing the first 20 URL it gives me this error
KeyError: 'response'
however by checking random pages the key 'response' is present in any of them. What can I do to print all the URLs from the next 88 pages?
I ran into a similar problem. You might be requesting faster than the allowed limit of 5 per second. In that case, the NYT server is going to hit you with an error message, so there will be no 'response' key. I would suggest printing out the keys from every GET request using something like:
print(dict.keys(data))
If you keep seeing 'message' as one of your keys, then you know that you're probably requesting too fast. So just put in a time.sleep(0.5) to slow things down and you should be good.
You are assuming that there are at least 101 pages to make requests (0 to 100).
If you make a request to page 100, do you still get the same JSON structure with a response key?
What you should instead use is a while loop that breaks when you get a KeyError.

Categories