I'm trying to play around with the API of Reddit, and I understand most of it, but I can't seem to figure out how to access the next page of results (since each page is 25 entries).
Here is the code I'm using:
import requests
import json
r = requests.get(r'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all')
listing = r.json()
after = listing['data']['after']
data = listing['data']['children']
for entry in data:
post = entry['data']
print post['score']
query = 'https://www.reddit.com/r/Petscop/top.json?after='+after
r = requests.get(query)
listing = r.json()
data = listing['data']['children']
for entry in data:
post = entry['data']
print post['score']
So I extract the after ID as after, and pass it into the next request. However, after the first 25 entries (the first page) the code returns just an empty list ([]). I tried changing the second query as:
r = requests.get(r'https://www.reddit.com/r/Petscop/top.json?after='+after)
And the result is the same. I also tried replacing "after" with "before", but the result was again the same.
Is there a better way to get the next page of results?
Also, what the heck is the r in the get argument? I copied it from an example, but I have no idea what it actually means. I ask because I don't know if it is necessary to access the next page, and if it is necessary, I don't know how to modify the query dynamically by adding after to it.
Try:
query = 'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all&after='+after
or better:
query = 'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all&after={}'.format(after)
As for r in strings you can omit it.
Related
I have two lists of baseball players that I would like to scrape data for from the website fangraphs. I am trying to figure out how to have selenium search the first player in the list which would redirect to that players profile, scrape the data I am interested in, and then search the next player until each for loop is completed for the two lists. I have written other scrapers with selenium, but I haven't come across this situation where I need to perform a search, collect the data, then perform the next search, etc ...
Here is a smaller version of one of the lists:
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
driver.get('https://www.fangraphs.com/')
search_box = driver.find_element_by_xpath('/html/body/form/div[3]/div[1]/div[2]/header/div[3]/nav/div[1]/div[2]/div/div/input')
search_box.click()
for batter in batters:
search_box.send_keys(batter)
search_box.send_keys(Keys.RETURN)
This will search all the names at once obviously, so I guess I'm trying to figure out how to code the logic of searching one by one but not performing the next search until I have collected the data for the previous search - any help is appreciated cheers
With selenium, you would just have to iterate through the names, "type" it into the search bar, click/go to the link, scrape the stats, then repeat. You have set up to do that, you just need to add the scrape part. So something like:
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
driver.get('https://www.fangraphs.com/')
search_box = driver.find_element_by_xpath('/html/body/form/div[3]/div[1]/div[2]/header/div[3]/nav/div[1]/div[2]/div/div/input')
search_box.click()
for batter in batters:
search_box.send_keys(batter)
search_box.send_keys(Keys.RETURN)
## CODE THAT SCRAPES THE DATA ##
## CODE THAT STORES IT SOMEWAY TO APPEND AFTER EACH ITERATION ##
However, they have an api which is a far better solution than Selenium. Why?
APIs are consistent. Parsing HTML with selnium and/or beautifulsoup is reliant on the html structure. If they ever change the layout of the website, it may crash as certain tags that used to be there may not be there anymore, or they may add certain tags and attributes to the html. But the underlying data that is rendered in the html will come from the api in a nice json format and that will rarely change unless they do a complete overhaul of the data structure
It's far more efficient and quicker. No need to have Selenium open a browser, search and load/render the content, that scrape, then repeat. You get the response in 1 request
You'll get far more data than you intended and (imo) is a good thing. I'd rather have more data and "trim" off what I don't need. Lots of the time you'll see very interesting and useful data that you otherwise wouldn't had known was there.
So I'm not sure what you are after specifically, but this will get you going. You'll have to sift through the statsData to figure out what you want, but if you tell me what you are after, I can help get that into a nice table for you. Or if you want to figure it out yourself, look up pandas and the .json_normalize() function with that. Parsing nested json can be tricky (but it's also fun ;-) )
Code:
import requests
# Get teamIds
def get_teamIds():
team_id_dict = {}
url = 'https://cdn.fangraphs.com/api/menu/menu-standings'
jsonData = requests.get(url).json()
for team in jsonData:
team_id_dict[team['shortName']] = str(team['teamid'])
return team_id_dict
# Get Player IDs
def get_playerIds(team_id_dict):
player_id_dict = {}
for team, teamId in team_id_dict.items():
url = 'https://cdn.fangraphs.com/api/depth-charts/roster?teamid={teamId}'.format(teamId=teamId)
jsonData = requests.get(url).json()
print(team)
for player in jsonData:
if 'oPlayerId' in player.keys():
player_id_dict[player['player']] = [str(player['oPlayerId']), player['position']]
else:
player_id_dict[player['player']] = ['N/A', player['position']]
return player_id_dict
team_id_dict = get_teamIds()
player_id_dict = get_playerIds(team_id_dict)
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
for player in batters:
playerId = player_id_dict[player][0]
pos = player_id_dict[player][1]
url = 'https://cdn.fangraphs.com/api/players/stats?playerid={playerId}&position={pos}'.format(playerId=playerId, pos=pos)
statsData = requests.get(url).json()
Ouput: Here's just a look at what you get
I am trying to request data from Yahoo Finance and then print specific pieces of the data.
My code so far is:
import requests
ticker = input("Enter Stock Ticker: ")
url = "https://query1.finance.yahoo.com/v8/finance/chart/{}?region=GB&lang=en-GB&includePrePost=false&interval=2m&range=1d&corsDomain=uk.finance.yahoo.com&.tsrc=finance".format(ticker)
r = requests.get(url)
data = r.json()
What I am unsure of is how to extract certain pieces of the 'data' variable. For example, I want to display the value that is paired with 'regularMarketPrice'. This can be found in the request.
How can I do this?
Apologies if this isn't worded correctly.
Thanks
If you print data, you will see that it is a dictionary.
If you dig deep enough into the dictionary, you will see that regularMarketPrice can be retrieved as follows (for the first result):
print(data['chart']['result'][0]['meta']['regularMarketPrice'])
If there are multiple results, then you can use the following:
for result in data['chart']['result']:
print(result['meta']['regularMarketPrice'])
I've written a script in python using post requests to fetch the json content from a webpage. The script is doing just fine if I'm only stick to it's default page. However, my intention is to create a loop to collect the content from few different pages. The only problem I'm struggling to solve is use page keyword within payload in order to loop three different pages. Consider my faulty approach as a placeholder.
How can I use format within dict in order to change page numbers?
Working script (if I get rid of the pagination loop):
import requests
link = 'https://nsv3auess7-3.algolianet.com/1/indexes/idealist7-production/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0&x-algolia-application-id=NSV3AUESS7&x-algolia-api-key=c2730ea10ab82787f2f3cc961e8c1e06'
for page in range(0,3):
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query="}
res = requests.post(link,json=payload.format(page)).json()
for item in res['hits']:
print(item['name'])
I get an error when I run the script as it is:
res = requests.post(link,json=payload.format(page)).json()
AttributeError: 'dict' object has no attribute 'format'
format is a string method. You should apply it to the string value of your payload instead:
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query=".format(page)}
res = requests.post(link,json=payload).json()
This is my first couple of weeks coding; apologies for a basic question.
I've managed to parse the 'WorldNews' subreddit json, identify the individual children (24 of them as I write) and grab the titles of each news item. I'm now trying to create an array from these news titles. The code below does print the fifth title ([4]) to command line every 2-3 attempts (otherwise provides the error below). It will also not print more than one title at a time (for example if I try[2,3,4] I will continuously get the same error).
The error I get when doesn't compile:
in <module> Children = theJSON["data"]["children"] KeyError: 'data'
My script:
import requests
import json
r = requests.get('https://www.reddit.com/r/worldnews/.json')
theJSON = json.loads(r.text)
Children = theJSON["data"]["children"]
News_list = []
for post in Children:
News_list.append (post["data"]["title"])
print News_list [4]
I've managed to find a solution with the help of Eric. The issue here was in fact not related to the key, parsing or presentation of the dict or array. When requesting a Url from reddit and attempting to print the json string output we encounter an HTTP Error 429. Fixing this is simple. The answer was found on this redditdev thread.
Solution: by adding an identifier for the device requesting the Url ('User-agent' in header) it runs smoothly and works every time.
import requests
import json
r = requests.get('https://www.reddit.com/r/worldnews.json', headers = {'User-agent': 'Chrome'})
theJSON = json.loads(r.text)
print theJSON
This means that the payload you got didn't have a data key in it, for whatever reason. I don't know about Reddit's JSON API; I tested the request and saw that you were using the correct keys. The fact that you say your code works every few times tells me that you're getting a different response between requests. I can't reproduce it, I tried making the request over and over and checking for the correct response. If I had to guess why you'd get something different I'd say it'd have to be either rate limiting or a temporary 503 (Reddit having issues.)
You can guard against this by either catching the KeyError or using the .get method of dictionaries.
Catching KeyError:
try:
Children = theJSON["data"]["children"]
except KeyError:
print 'bad payload'
return
Using .get:
Children = theJSON.get("data", {}).get("children")
if not Children:
print 'bad payload'
return
I am just starting out with Python/Scrapy.
I have a written a spider that crawls a website and fetches information. But i am stuck in 2 places.
I am trying to retrieve the telephone numbers from a page and they are coded like this
<span class="mrgn_right5">(+001) 44 42676000,</span>
<span class="mrgn_right5">(+011) 44 42144100</span>
The code i have is:
getdata = soup.find(attrs={"class":"mrgn_right5"})
if getdata:
aditem['Phone']=getdata.get_text().strip()
#print phone
But it is fetching only the first set of numbers and not the second one. How can i fix this?
On the same page there is another set of information
I am using this code
getdata = soup.find(attrs={"itemprop":"pricerange"})
if getdata:
#print getdata
aditem['Pricerange']=getdata.get_text().strip()
#print pricerange
But it is not fetching any thing.
Any help on fixing these two would be great.
From a browse of the Beautiful Soup documentation, find will only return a single result. If multiple results are expected/required, then use find_all instead. Since there are two results, a list will be returned, so the elements of the list need to be joined together (for example) to add them to Phone field of your AdItem.
getdata = soup.find_all(attrs={"class":"mrgn_right5"})
if getdata:
aditem['Phone'] = ''.join([x.get_text().strip() for x in getdata])
For the second issue, you need to access the attributes of the returned object. Try the following:
getdata = soup.find(attrs={"itemprop":"pricerange"})
if getdata:
aditem['Pricerange'] = getdata.attrs['content']
And for the address information, the following code works but is very hacky and could no doubt be improved by someone who understands Beautiful Soup better than me.
getdata = soup.find(attrs={"itemprop":"address"})
address = getdata.span.get_text()
addressLocality = getdata.meta.attrs['content']
addressRegion = getdata.find(attrs={"itemprop":"addressRegion"}).attrs['content']
postalCode = getdata.find(attrs={"itemprop":"postalCode"}).attrs['content']