I am trying to extract the URL of specific articles from NYT API.
This is my code:
import requests
for i in range(0,100):
page=str(i)
r = requests.get("http://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20100101&q=terrorist+attack&page="+page+"&api-key=***")
data = r.json()
article = data['response']['docs']
for url in article:
print(url["web_url"])
After printing the first 20 URL it gives me this error
KeyError: 'response'
however by checking random pages the key 'response' is present in any of them. What can I do to print all the URLs from the next 88 pages?
I ran into a similar problem. You might be requesting faster than the allowed limit of 5 per second. In that case, the NYT server is going to hit you with an error message, so there will be no 'response' key. I would suggest printing out the keys from every GET request using something like:
print(dict.keys(data))
If you keep seeing 'message' as one of your keys, then you know that you're probably requesting too fast. So just put in a time.sleep(0.5) to slow things down and you should be good.
You are assuming that there are at least 101 pages to make requests (0 to 100).
If you make a request to page 100, do you still get the same JSON structure with a response key?
What you should instead use is a while loop that breaks when you get a KeyError.
Related
I am a new programmer and I'm learning the request module. I'm stuck on the fact that I don't know how to get a specific part of a json response, I think it's called a header? or its the thing inside of a header? I'm not sure. But the API returns simple json code. This is the api
https://mcapi.us/server/status?ip=mc.hypixel.net
for more of a example, lets say it returns this json code from the api
{"status":"success","online":true"}
And I wanted to get the "online" response, how would I do that?
And this is the code im currently working with.
import requests
def main():
ask = input("IP : ")
response = requests.get('https://mcapi.us/server/status?ip=' + ask)
print(response.content)
main()
And to be honest, I don't even know if this is json. I think it is but the api page says its cors? if it isn't I'm sorry.
In your example you have a dictionary with key "online"
You need to parse it first with .json() and then you can get it in form dict[key]
In your case
response = requests.get('https://mcapi.us/server/status?ip=' + ask).json()
print(response["online"])
or in case of actual content
response = requests.get('https://mcapi.us/server/status?ip=' + ask).json()
print(response["content"])
I've written a script in python using post requests to fetch the json content from a webpage. The script is doing just fine if I'm only stick to it's default page. However, my intention is to create a loop to collect the content from few different pages. The only problem I'm struggling to solve is use page keyword within payload in order to loop three different pages. Consider my faulty approach as a placeholder.
How can I use format within dict in order to change page numbers?
Working script (if I get rid of the pagination loop):
import requests
link = 'https://nsv3auess7-3.algolianet.com/1/indexes/idealist7-production/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0&x-algolia-application-id=NSV3AUESS7&x-algolia-api-key=c2730ea10ab82787f2f3cc961e8c1e06'
for page in range(0,3):
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query="}
res = requests.post(link,json=payload.format(page)).json()
for item in res['hits']:
print(item['name'])
I get an error when I run the script as it is:
res = requests.post(link,json=payload.format(page)).json()
AttributeError: 'dict' object has no attribute 'format'
format is a string method. You should apply it to the string value of your payload instead:
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query=".format(page)}
res = requests.post(link,json=payload).json()
I am trying to have a request.get statement with two urls in it. What I am aiming to do is have requests (Python Module) make two requests based on list or two strings I provide. How can I pass multiple strings from a list into a request.get statement, and have requests go to each url (string) and have do something?
Thanks
Typically if we talking python requests library it only runs one url get request at a time. If what you are trying to do is perform multiple requests with a list of known urls then it's quite easy.
import requests
my_links = ['www.google.com', 'www.yahoo.com']
my_responses = []
for link in my_links:
payload = requests.get(link).json()
print('got response from {}'.format(link))
my_response.append(payload)
print(payload)
my_responses now has all the content from the pages.
You don't. The requests.get() method (or any other method, really) takes single URL and makes a single HTTP request because that is what most humans want it to do.
If you need to make two requests, you must call that method twice.
requests.get(url)
requests.get(another_url)
Of course, these calls are synchronous, the second will only begin once the first response is received.
I've been asked to swap over from urllib2 to the requests library because the library is simpler to use and doesn't cause exceptions.
I can get the HTTP error code with response.status_code but I don't see a way to get the error message for that code. Normally, I wouldn't care, but I'm testing an API and the string is just as important.
Does anybody know of a simple way to get that string? I'm expecting 2 pieces something like:
'400':'Bad Request'
This is NOT a DUPLICATE
Some of the codes being returned have unique strings being sent by the application that I am testing. These strings cannot be looked up using this method: requests.status_codes._codes[error][0] Since the string is dynamically coming from the back end server. I was able to get this information using urllib using this method:
import urllib2
...
opener = urllib2.build_opener(urllib2.HTTPSHandler(context=ctx))
except urllib2.HTTPError as err:
try: error_message = err.read()
...
The question now is... is there a method of getting the dynamic http error string? Thanks so much for being patient. The previous issue was closed so quickly I never got a chance to look at the answer, test it and re-ask by cleaning up the description.
response = requests.get(url)
error_message = response.reason
In HTTPResponse there's a reason attribute that returns the reason phrase from the response's status line. In the requests library the requests.Response class has an equivalent reason attribute that returns the same thing. Both should return the information from the response, not a fixed string based on the code.
This is my first couple of weeks coding; apologies for a basic question.
I've managed to parse the 'WorldNews' subreddit json, identify the individual children (24 of them as I write) and grab the titles of each news item. I'm now trying to create an array from these news titles. The code below does print the fifth title ([4]) to command line every 2-3 attempts (otherwise provides the error below). It will also not print more than one title at a time (for example if I try[2,3,4] I will continuously get the same error).
The error I get when doesn't compile:
in <module> Children = theJSON["data"]["children"] KeyError: 'data'
My script:
import requests
import json
r = requests.get('https://www.reddit.com/r/worldnews/.json')
theJSON = json.loads(r.text)
Children = theJSON["data"]["children"]
News_list = []
for post in Children:
News_list.append (post["data"]["title"])
print News_list [4]
I've managed to find a solution with the help of Eric. The issue here was in fact not related to the key, parsing or presentation of the dict or array. When requesting a Url from reddit and attempting to print the json string output we encounter an HTTP Error 429. Fixing this is simple. The answer was found on this redditdev thread.
Solution: by adding an identifier for the device requesting the Url ('User-agent' in header) it runs smoothly and works every time.
import requests
import json
r = requests.get('https://www.reddit.com/r/worldnews.json', headers = {'User-agent': 'Chrome'})
theJSON = json.loads(r.text)
print theJSON
This means that the payload you got didn't have a data key in it, for whatever reason. I don't know about Reddit's JSON API; I tested the request and saw that you were using the correct keys. The fact that you say your code works every few times tells me that you're getting a different response between requests. I can't reproduce it, I tried making the request over and over and checking for the correct response. If I had to guess why you'd get something different I'd say it'd have to be either rate limiting or a temporary 503 (Reddit having issues.)
You can guard against this by either catching the KeyError or using the .get method of dictionaries.
Catching KeyError:
try:
Children = theJSON["data"]["children"]
except KeyError:
print 'bad payload'
return
Using .get:
Children = theJSON.get("data", {}).get("children")
if not Children:
print 'bad payload'
return