I've written a script in python using post requests to fetch the json content from a webpage. The script is doing just fine if I'm only stick to it's default page. However, my intention is to create a loop to collect the content from few different pages. The only problem I'm struggling to solve is use page keyword within payload in order to loop three different pages. Consider my faulty approach as a placeholder.
How can I use format within dict in order to change page numbers?
Working script (if I get rid of the pagination loop):
import requests
link = 'https://nsv3auess7-3.algolianet.com/1/indexes/idealist7-production/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0&x-algolia-application-id=NSV3AUESS7&x-algolia-api-key=c2730ea10ab82787f2f3cc961e8c1e06'
for page in range(0,3):
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query="}
res = requests.post(link,json=payload.format(page)).json()
for item in res['hits']:
print(item['name'])
I get an error when I run the script as it is:
res = requests.post(link,json=payload.format(page)).json()
AttributeError: 'dict' object has no attribute 'format'
format is a string method. You should apply it to the string value of your payload instead:
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query=".format(page)}
res = requests.post(link,json=payload).json()
Related
I am scraping data from a website and I have retrieved a list of URLs from which I will be getting the final data I need. How do I retrieve the html from this list of addresses using a loop?
Using xpath in lxml I have a list of URLs. Now I need to retrieve page content for each of thse URLs and then get use xpath once again to get the final data from each of these pages. I am able to individually get data from each page if use
pagecontent=requests.get(linklist[1])
then I am able to get the content of 1 url but if I use a for loop
for i in range(0,8):
pagecontent[i]=requests.get(linklist[i])
I get an error list assignment index out of range. I have also tried using
pagecontent=[requests.get(linklist) for s in linklist]
the error I see is No connection adapters were found for '['http...(list of links)...]'
I am trying to get a list pagecontent where each item in the list has html of the respective URLs. What is the best way to achieve this?
In light of your comment, I believe this (or something like this) may be what you're looking for; I can't try it myself since I don't have your linklist, but you should be able to modify the code to fit your situation. It uses python f-strings to accomplish what you need.
linklist = ['www.example_1.com','www.example_2.com','www.example_3.com']
pages = {} #initialize an empty dictionary to house your name/link entries
for i in range(len(linklist)):
pages[f'pagecontent[{i+1}]'] = linklist[i] #the '+1' is needed because python counts from 0...
for name, link in pages.items() :
print (name, link)
Output:
pagecontent[1] www.example_1.com
pagecontent[2] www.example_2.com
pagecontent[3] www.example_3.com
I'm trying to play around with the API of Reddit, and I understand most of it, but I can't seem to figure out how to access the next page of results (since each page is 25 entries).
Here is the code I'm using:
import requests
import json
r = requests.get(r'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all')
listing = r.json()
after = listing['data']['after']
data = listing['data']['children']
for entry in data:
post = entry['data']
print post['score']
query = 'https://www.reddit.com/r/Petscop/top.json?after='+after
r = requests.get(query)
listing = r.json()
data = listing['data']['children']
for entry in data:
post = entry['data']
print post['score']
So I extract the after ID as after, and pass it into the next request. However, after the first 25 entries (the first page) the code returns just an empty list ([]). I tried changing the second query as:
r = requests.get(r'https://www.reddit.com/r/Petscop/top.json?after='+after)
And the result is the same. I also tried replacing "after" with "before", but the result was again the same.
Is there a better way to get the next page of results?
Also, what the heck is the r in the get argument? I copied it from an example, but I have no idea what it actually means. I ask because I don't know if it is necessary to access the next page, and if it is necessary, I don't know how to modify the query dynamically by adding after to it.
Try:
query = 'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all&after='+after
or better:
query = 'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all&after={}'.format(after)
As for r in strings you can omit it.
I tried to scrape twitter followers using requests library. finally, i tried to save the response of required page in json format and then tried to search for the required parts. The thing is, how to find required elements in the json object?
my code is:
s =requests.session()
res = s.post("https://twitter.com/sessions",data=payload,headers=headers)
r = s.get("https://twitter.com/akhiltaker619/following/users?include_available_features=1&include_entities=1&max_position=1590310744326457266&reset_error_state=false")
dp = r.text
dp1=json.loads(dp)
x = json.dumps(dp1)
print(res.status_code)
soup = BeautifulSoup(x,"html.parser")
x1= soup.find_all("b",{"class":"u-linkComplex-target"})
for i in x1:
print(i.text)
At the end parsing part is wrong as i am trying to scrape json object which is not possible. When i print the json object, i get this:
The link is attached which contains the output of json object
now from this object, i want "class : u-linkComplex-target" present in the "item_html" of this json object. How to get this? Or is there any way to get the same content without using json object(this content is the followers list page in twitter). I used json inorder to load the dynamic content of the page.
The Beautiful Soup library is for parsing HTML and similar tagged languages, not JSON.
If your requests return JSON responses then you should call the r.json() method. This will return a dictionary of the JSON structure. Suppose you used
j = r.json()
then you probably want j['item-html']['linkComplex-target'] or something similar. If you access the dictionary interactively you will probably find what you want.
All,
I wrote the following code:
import requests,bs4
res=requests.get('http://itunes.apple.com/lookup?id=551798799')
res.raise_for_status()
wwe=bs4.BeautifulSoup(res.text)
print wwe.select('p averageUserRating')
If I only do: print wwe.select('p') then the code works, but it prints everything in the list. However when I print what is in the output above, this throws an error saying the selector is unsupported.
I am basically only trying to return the averageUserRating value (which is 4.0).
Thanks for the help!
The contents of that file isn't HTML, which is what BeautifulSoup is designed to read; it's a different data format called JSON. Thankfully, the requests library makes it really easy to parse JSON - if you call .json() on a response it parses it into a dictionary. You need to access averageUserRating, which is inside the first element of the results list, so you can use this to access what you need:
>>> data = res.json()
>>> data["results"][0]["averageUserRating"]
4.0
To modify your existing code:
import requests
res=requests.get('http://itunes.apple.com/lookup?id=551798799')
res.raise_for_status()
wwe=res.json()
print data["results"][0]["averageUserRating"]
This is my first couple of weeks coding; apologies for a basic question.
I've managed to parse the 'WorldNews' subreddit json, identify the individual children (24 of them as I write) and grab the titles of each news item. I'm now trying to create an array from these news titles. The code below does print the fifth title ([4]) to command line every 2-3 attempts (otherwise provides the error below). It will also not print more than one title at a time (for example if I try[2,3,4] I will continuously get the same error).
The error I get when doesn't compile:
in <module> Children = theJSON["data"]["children"] KeyError: 'data'
My script:
import requests
import json
r = requests.get('https://www.reddit.com/r/worldnews/.json')
theJSON = json.loads(r.text)
Children = theJSON["data"]["children"]
News_list = []
for post in Children:
News_list.append (post["data"]["title"])
print News_list [4]
I've managed to find a solution with the help of Eric. The issue here was in fact not related to the key, parsing or presentation of the dict or array. When requesting a Url from reddit and attempting to print the json string output we encounter an HTTP Error 429. Fixing this is simple. The answer was found on this redditdev thread.
Solution: by adding an identifier for the device requesting the Url ('User-agent' in header) it runs smoothly and works every time.
import requests
import json
r = requests.get('https://www.reddit.com/r/worldnews.json', headers = {'User-agent': 'Chrome'})
theJSON = json.loads(r.text)
print theJSON
This means that the payload you got didn't have a data key in it, for whatever reason. I don't know about Reddit's JSON API; I tested the request and saw that you were using the correct keys. The fact that you say your code works every few times tells me that you're getting a different response between requests. I can't reproduce it, I tried making the request over and over and checking for the correct response. If I had to guess why you'd get something different I'd say it'd have to be either rate limiting or a temporary 503 (Reddit having issues.)
You can guard against this by either catching the KeyError or using the .get method of dictionaries.
Catching KeyError:
try:
Children = theJSON["data"]["children"]
except KeyError:
print 'bad payload'
return
Using .get:
Children = theJSON.get("data", {}).get("children")
if not Children:
print 'bad payload'
return