Paginate the CSV (Python) - python

How can I paginate through the CSV version of an API call using Python?
I understand the metadata in the JSON call includes the total number of records, but without similar info in the CSV call I won't know where to stop my loop if I try to increment the page parameter.
Below is my code:
url = 'https://api.data.gov/ed/collegescorecard/v1/schools.csv'
payload = {
'api_key': '4KC***UNKk',
'fields': 'school.name,2012.repayment.2_yr_default_rate',
'_page' : '0'
}
r = requests.get(url, params=payload)
df = pd.read_csv(r.url)
This loads a dataframe with the first 20 results, but I'd like to load a dataframe with all the results.

Utilize the &_per_page option parameter to edit the number of choices per call; Setting it to &_per_page=200 returns a CSV with 100 lines, so lets assume 100 is the maximum.
Now that we know the maximum per call, and we have the total calls, its possible to run a for loop to get what we need, like so:
url = 'https://api.data.gov/ed/collegescorecard/v1/schools.csv'
apikey = '&api_key=xxx'
fields = '&_fields=school.name,2012.repayment.2_yr_default_rate'
pageA = '&_page='
pageTotal = '&_per_page='
pageNumbersMaximum = 10
rowSum = 200
for page in range(pageNumbersMaximum):
fullURL = url + pageA + str(page) + pageTotal + str(rowSum) + fields + apikey
print(fullURL)
print("Page Number: " + str(page) + ", Total Rows: " + str(rowSum))
rowSum += 200
That will loop through the results until it gets to 7000 total.

Related

How to change variable in a while loop from hardcoded to incorporate pandas iterrows loop?

I have the following while loop to scrape info from a platform:
while result_count != 0:
start_at = "startAt=" + str(start_index)
url = base_url + toget + "&" + start_at + "&" + max_results
response = requests.get(url, auth=(username, password))
json_response = json.loads(response.text)
print (json_response)
page_info = json_response["meta"]["pageInfo"]
start_index = page_info["startIndex"] + allowed_results
result_count = page_info["resultCount"]
items2 = json_response["data"]
print(items2)
'toget' variable is dataframe which includes different id's.
I need 'toget' variable to ietrate through all elements of pandas dataframe column, returning each time different id, as this is the only way to scrape all informations properly.
import pandas as pd
toget = {'id': [3396750, 3396753, 3396755, 3396757, 3396759]}
Just add the for loop to iterate through your list and use that variable in the url.
A few other things I'd clean up here:
I would use f'{}' syntax for the url, but how you had it is fine...just preference, as I think it's easier to read
No need to use json package to read in the response. You can do that straight away (see edit below)
I'm also making an assumption here that you are setting an initial value for both variables start_index and max_results as this code will throw an error of those variables not being defined once it enters the while loop.
Code:
import pandas as pd
toget = {'id': [3396750, 3396753, 3396755, 3396757, 3396759]}
for eachId in toget['id']:
while result_count != 0:
start_at = "startAt=" + str(start_index)
url = url = f'{base_url}{eachId}&{start_at}&{max_results}'
response = requests.get(url, auth=(username, password))
json_response = json.loads(response.text)
print (json_response)
page_info = json_response["meta"]["pageInfo"]
start_index = page_info["startIndex"] + allowed_results
result_count = page_info["resultCount"]
items2 = json_response["data"]
print(items2)
If you need to loop through a pandas DataFrame, then recommend reviewing this post: How to iterate over rows in a DataFrame in Pandas
The code in your question declares toget a dict, not a DataFrame. If that's the case, then you can use the code below to loop through:
Looping through Dict
toget = {'id': [3396750, 3396753, 3396755, 3396757, 3396759]}
for i in toget.get('id'):
print(i)

Python - Using a While Loop, how can I update a variable string to be used in a function

Background: I'm attempting to create a dataframe using data called from Twitch's API. They only allow 100 records per call so with each pull a new Pagination Cursor is offered in order to move on to the next page. I'm using the following code to try and efficiently pull this data rather than manually adjusting the after=(pagination value) in the get response. Right now the variable I'm trying to make dynamic is the 'Pagination' variable but it only gets updated once the loop finishes - not helpful! Take a look below and see if you notice anything I can change to achieve this goal. Any help is appreciated!
TwitchTopGamesDataFrame = [] #This is our Data List
BaseURL = 'https://api.twitch.tv/helix/games/top?first=100'
Headers = {'client-id':'lqctse0orgdbs5gdf5faz665api03r','Authorization': 'Bearer a1yl09mwmnwetp6ovocilheias8pzt'}
Indent = 2
Pagination = ''
FullURL = BaseURL + Pagination
Response = requests.get(FullURL,headers=Headers)
iterations = 1 # Data records returned are equivalent to iterations x100
#Loop: Response, Convert JSON data, Append to Data List, Get Pagination & Replace String in Variable - Iterate until 300 records
while count <= 3:
#Grab JSON Data, Convert, & Append
ResponseJSONData = Response.json()
#print(pgn) - Debug
pd.set_option('display.max_rows', None)
TopGamesDF = pd.DataFrame(ResponseJSONData['data'])
TopGamesDF = TopGamesDF[['id','name']]
TopGamesDF = TopGamesDF.rename(columns={'id':'GameID','name':'GameName'})
TopGamesDF['Rank'] = TopGamesDF.index + 1
TwitchTopGamesDataFrame.append(TopGamesDF)
#print(FullURL) - Debug
#Grab & Replace Pagination Value
ResponseJSONData['pagination']
RPagination = pd.DataFrame(ResponseJSONData['pagination'],index=[0])
pgn = str('&after='+RPagination.to_string(index=False,header=False).strip())
Pagination = pgn
#print(FullURL) - Debug
iterations += 1
TwitchTopGamesDataFrame```
Figured it out:
def top_games(page_count):
from time import gmtime, strftime
strftime("%Y-%m-%d %H:%M:%S", gmtime())
print("Time of Execution:", strftime("%Y-%m-%d %H:%M:%S", gmtime()))
#In order to condense the code above and be more efficient, a while/for loop would work great.
#Goal: Run a While Loop to create a larger DataFrame through Pagination as the Twitch API only allows for 100 records per call.
baseURL = 'https://api.twitch.tv/helix/games/top?first=100' #Base URL
Headers = {'client-id':'lqctse0orgdbs5gdf5faz665api03r','Authorization': 'Bearer a1yl09mwmnwetp6ovocilheias8pzt'}
Indent = 2
Pagination = ''
FullURL = BaseURL + Pagination
Response = requests.get(FullURL,headers=Headers)
start_count = 0
count = 0 # Data records returned are equivalent to iterations x100
max_count = page_count
#Loop: Response, Convert JSON data, Append to Data List, Get Pagination & Replace String in Variable - Iterate until 300 records
while count <= max_count:
#Grab JSON Data, Extend List
Pagination
FullURL = baseURL + Pagination
Response = requests.get(FullURL,headers=Headers)
ResponseJSONData = Response.json()
pd.set_option('display.max_rows', None)
if count == start_count:
TopGamesDFL = ResponseJSONData['data']
if count > start_count:
i = ResponseJSONData['data']
TopGamesDFL.extend(i)
#Grab & Replace Pagination Value
ResponseJSONData['pagination']
RPagination = pd.DataFrame(ResponseJSONData['pagination'],index=[0])
pgn = str('&after='+RPagination.to_string(index=False,header=False).strip())
Pagination = pgn
count += 1
if count == max_count:
FinalDataFrame = pd.DataFrame(TopGamesDFL)
FinalDataFrame = FinalDataFrame[['id','name']]
FinalDataFrame = FinalDataFrame.rename(columns={'id':'GameID','name':'GameName'})
FinalDataFrame['Rank'] = FinalDataFrame.index + 1
return FinalDataFrame

Determine the rate limit for requests

I have a question about rate limits.
I take a data from the CSV and enter it into the query and the output is stored in a list.
I get an error because I make too many requests at once.
(I can only make 20 requests per second). How can I determine the rate limit?
import requests
import pandas as pd
df = pd.read_csv("Data_1000.csv")
list = []
def requestSummonerData(summonerName, APIKey):
URL = "https://euw1.api.riotgames.com/lol/summoner/v3/summoners/by-name/" + summonerName + "?api_key=" + APIKey
response = requests.get(URL)
return response.json()
def main():
APIKey = (str)(input('Copy and paste your API Key here: '))
for index, row in df.iterrows():
summonerName = row['Player_Name']
responseJSON = requestSummonerData(summonerName, APIKey)
ID = responseJSON ['accountId']
ID = int(ID)
list.insert(index,ID)
df["accountId"]= list
If you already know you can only make 20 requests per second, you just need to work out how long to wait between each request:
Divide 1 second by 20, which should give you 0.05. So you just need to sleep for 0.05 of a second between each request and you shouldn't hit the limit (maybe increase it a bit if you want to be safe).
import time at the top of your file and then time.sleep(0.05) inside of your for loop (you could also just do time.sleep(1/20))

Extract JSON value from API call to use as variable [duplicate]

This question already has answers here:
What's the best way to parse a JSON response from the requests library?
(3 answers)
Closed 6 years ago.
I am using Indeeds API to scrape job listings. Their API only allows 25 results per call so that's why I have to iterate through the range.
I need to know the number of results returned (for the range), to use as my numresults variable. Right now I am just doing the same search in my browser and manually inputting the result.
I want iterate through multiple countries or search terms so I need to pass in the value "totalResults" to numresults which is found in the JSON.
The problem is I don't understand how to extract this value.
Can I do this right after the call (where would the json be stored) or do I need to create the JSON file first?
Here is my working scraper:
import requests
api_url = 'http://api.indeed.com/ads/apisearch? publisher=XXXXXXXXXXX&v=2&limit=100000&format=json'
Country = 'au'
SearchTerm = 'Insight'
number = -25
numresults = 3925
# must match the actual number of job results to the lower of the 25 increment or the last page will repeat over and over
#so if there are 392 results, then put 375
for number in range(-25, numresults, 25):
url = api_url + '&co=' + Country + '&q=' + SearchTerm + '&start=' + str(number + 25)
response = requests.get(url)
f = open(SearchTerm + '_' + Country +'.json','a')
f.write (response.content)
f.close()
print 'Complete' , url
Here is a sample of the returned JSON:
{
"version" : 2,
"query" : "Pricing",
"location" : "",
"dupefilter" : true,
"highlight" : true,
"start" : 1,
"end" : 25,
"totalResults" : 1712,
"pageNumber" : 0,
"results" : [
{
"jobtitle" : "New Energy Technical Specialist",
"company" : "Rheem",
etc.
Why not use the python json module ?
import json
# inside the loop, after the request.
json_content = json.loads(r.content)
print(json_content["version"]) # should display 2
Be careful, check before is the content returned by the request is really in json format. The doc is here: https://docs.python.org/2/library/json.html

I seem to be getting an incorrect tweet count with my code

For some reason, I'm only getting 100 tweets from this code. According to the Twitter API, I believe I should be getting 1500.
What am I doing incorrectly here?
Specifically the issue in question is this:
twiturl = "http://search.twitter.com/search.json?q=" + urlinfo + "&rpp=99&page=15" + "&since_id=" + str(tweetdate)
for x in arg1:
urlinfo = x[2]
idnum = int(x[1])
name = x[0]
twiturl = "http://search.twitter.com/search.json?q=" + urlinfo + "&rpp=99&page=15" + "&since_id=" + str(tweetdate)
response = urllib2.urlopen(twiturl)
twitseek = simplejson.load(response)
twitsearch = twitseek['results']
tweets = [x['text'] for x in twitsearch]
tweetlist = [tweets, name]
namelist.append(tweetlist)
The item that should be in x[2] is just a word or phrase like "I am" or "I am feeling" changed into a url friendly encoding
The docs for the Twitter Search API state:
rpp (optional): The number of tweets to return per page, up to a max of
100.
and
page (optional): The page number (starting at 1) to return, up to a max
of roughly 1500 results (based on rpp * page).
Accordingly, you should make multiple requests, each with a different page number for up to 100 tweets for each request:
import urllib, json
twiturl = "http://search.twitter.com/search.json?q=%s&rpp=99&page=%d"
def getmanytweets(topic):
'Return a list of upto 1500 tweets'
results = []
for page in range(1, 16):
u = urllib.urlopen(twiturl % (topic, page))
data = u.read()
u.close()
t = json.loads(data)
results += t['results']
return results
if __name__ == '__main__':
import pprint
pprint.pprint(getmanytweets('obama'))
The maximum number of results returned on a single page of results is 100. In order to get all results, you need to 'page' through them by using the next_page URL that is included in the response (see here for the documentation). You can then loop through the responses, calling the next_page argument of each one until the argument is no longer present (indicating that you have collected all of the results).
import json
import urllib
import urllib2
# General query stub
url_stub = 'http://search.twitter.com/search.json'
# Parameters to pass
params = {
'q': 'tennis',
'rpp': 100,
'result_type': 'mixed'
}
# Variable to store our results
results = []
# Outside of our loop, we pull the first page of results
# The '?' is included in the 'next_page' parameter we receive
# later, so here we manually add it
resp = urllib2.urlopen('{0}?{1}'.format(url_stub, urllib.urlencode(params)))
contents = json.loads(resp.read())
results.extend(contents['results'])
# Now we loop until there is either no longer a 'next_page' variable
# or until we max out our number of results
while 'next_page' in contents:
# Print some random information
print 'Page {0}: {1} results'.format(
contents['page'], len(contents['results']))
# Capture the HTTPError that will appear once the results have maxed
try:
resp = urllib2.urlopen(url_stub + contents['next_page'])
except urllib2.HTTPError:
print 'No mas'
break
# Load our new contents
contents = json.loads(resp.read())
# Extend our results
results.extend(contents['results'])
# Print out how many results we received
print len(results)
Output:
Page 1: 103 results
Page 2: 99 results
Page 3: 100 results
Page 4: 100 results
Page 5: 100 results
Page 6: 100 results
Page 7: 100 results
Page 8: 99 results
Page 9: 98 results
Page 10: 95 results
Page 11: 100 results
Page 12: 99 results
Page 13: 99 results
Page 14: 100 results
Page 15: 100 results
No mas
1492

Categories