How to iterate over dataframe rows for individual API calls - python

I'm trying to set up a loop to pull in weather data for about 500 weather stations for an entire year which I have in my dataframe. The base URL stays the same, and the only part that changes is the weather station ID.
I'd like to create a dataframe with the results. I believe i'd use requests.get to pull in data for all the weather stations in my list, which the IDs to use in the URL are in a column called "API ID" in my dataframe. I am a python beginner - so any help would be appreciated! My code is below but doesn't work and returns an error:
"InvalidSchema: No connection adapters were found for '0 " http://www.ncei.noaa.gov/access/services/data/...\nName: API ID, Length: 497, dtype: object'
.
def callAPI(API_id):
for IDs in range(len(API_id)):
url = ('http://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries&dataTypes=PRCP,SNOW,TMAX,TMIN&stations=' + distances['API ID'] + '&startDate=2020-01-01&endDate=2020-12-31&includeAttributes=0&includeStationName=true&units=standard&format=json')
r = requests.request('GET', url)
d = r.json()
ll = []
for index1,rows1 in distances.iterrows():
station = rows1['Closest Station']
API_id = rows1['API ID']
data = callAPI(API_id)
ll.append([(data)])

I am not sure about your whole code base, but this is the function that will return the data from the API, If you have multiple station id on a single df column then you can use a for loop otherwise no need to do that.
Also, you are not returning the result from the function. Check the return keyword at the end of the function.
Working code:
import requests
def callAPI(API_id):
url = ('http://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries&dataTypes=PRCP,SNOW,TMAX,TMIN&stations=' + API_id + '&startDate=2020-01-01&endDate=2020-12-31&includeAttributes=0&includeStationName=true&units=standard&format=json')
r = requests.request('GET', url)
d = r.json()
return d
print(callAPI('USC00457180'))
So your full code will be something like this,
def callAPI(API_id):
url = ('http://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries&dataTypes=PRCP,SNOW,TMAX,TMIN&stations=' + API_id + '&startDate=2020-01-01&endDate=2020-12-31&includeAttributes=0&includeStationName=true&units=standard&format=json')
r = requests.request('GET', url)
d = r.json()
return d
ll = []
for index1,rows1 in distances.iterrows():
station = rows1['Closest Station']
API_id = rows1['API ID']
data = callAPI(API_id)
ll.append([(data)])
Note: Even better use asynchronous calls to the API to make the process faster. Something like this: https://stackoverflow.com/a/56926297/1138192

Related

How to change variable in a while loop from hardcoded to incorporate pandas iterrows loop?

I have the following while loop to scrape info from a platform:
while result_count != 0:
start_at = "startAt=" + str(start_index)
url = base_url + toget + "&" + start_at + "&" + max_results
response = requests.get(url, auth=(username, password))
json_response = json.loads(response.text)
print (json_response)
page_info = json_response["meta"]["pageInfo"]
start_index = page_info["startIndex"] + allowed_results
result_count = page_info["resultCount"]
items2 = json_response["data"]
print(items2)
'toget' variable is dataframe which includes different id's.
I need 'toget' variable to ietrate through all elements of pandas dataframe column, returning each time different id, as this is the only way to scrape all informations properly.
import pandas as pd
toget = {'id': [3396750, 3396753, 3396755, 3396757, 3396759]}
Just add the for loop to iterate through your list and use that variable in the url.
A few other things I'd clean up here:
I would use f'{}' syntax for the url, but how you had it is fine...just preference, as I think it's easier to read
No need to use json package to read in the response. You can do that straight away (see edit below)
I'm also making an assumption here that you are setting an initial value for both variables start_index and max_results as this code will throw an error of those variables not being defined once it enters the while loop.
Code:
import pandas as pd
toget = {'id': [3396750, 3396753, 3396755, 3396757, 3396759]}
for eachId in toget['id']:
while result_count != 0:
start_at = "startAt=" + str(start_index)
url = url = f'{base_url}{eachId}&{start_at}&{max_results}'
response = requests.get(url, auth=(username, password))
json_response = json.loads(response.text)
print (json_response)
page_info = json_response["meta"]["pageInfo"]
start_index = page_info["startIndex"] + allowed_results
result_count = page_info["resultCount"]
items2 = json_response["data"]
print(items2)
If you need to loop through a pandas DataFrame, then recommend reviewing this post: How to iterate over rows in a DataFrame in Pandas
The code in your question declares toget a dict, not a DataFrame. If that's the case, then you can use the code below to loop through:
Looping through Dict
toget = {'id': [3396750, 3396753, 3396755, 3396757, 3396759]}
for i in toget.get('id'):
print(i)

Processing API data (json) into a singular data frame (list of list of dictionaries)?

So this is a somewhat of a continuation from a previous post of mine except now I have API data to work with. I am trying to get keys Type and Email as columns in a data frame to come up with a final number. My code:
jsp_full=[]
for p in payloads:
payload = {"payload": {"segmentId":p}}
r = requests.post(url,headers = header, json = payload)
#print(r, r.reason)
time.sleep(r.elapsed.total_seconds())
json_data = r.json() if r and r.status_code == 200 else None
json_keys = json_data['payload']['supporters']
json_package = []
jsp_full.append(json_package)
for row in json_keys:
SID = row['supporterId']
Handle = row['contacts']
a_key = 'value'
list_values = [a_list[a_key] for a_list in Handle]
string = str(list_values).split(",")
data = {
'SupporterID' : SID,
'Email' : strip_characters(string[-1]),
'Type' : labels(p)
}
json_package.append(data)
t2 = round(time.perf_counter(),2)
b_key = "Email"
e = len([b_list[b_key] for b_list in json_package])
t = str(labels(p))
#print(json_package)
print(f'There are {e} emails in the {t} segment')
print(f'Finished in {t2 - t1} seconds')
excel = pd.DataFrame(json_package)
excel.to_excel(r'C:\Users\am\Desktop\email parsing\{0} segment {1}.xlsx'.format(t, str(today)), sheet_name=t)
This part works all well and good. Each payload in the API represents a different segment of people so I split them out into different files. However, I am at a point where I need to combine all records into a single data frame hence why I append out to jsp_full. This is a list of a list of dictionaries.
Once I have that I would run the balance of my code which is like this:
S= pd.DataFrame(jsp_full[0], index = {0})
Advocacy_Supporters = S.sort_values("Type").groupby("Type", as_index=False)["Email"].first()
print(Advocacy_Supporters['Email'].count())
print("The number of Unique Advocacy Supporters is :")
Advocacy_Supporters_Group = Advocacy_Supporters.groupby("Type")["Email"].nunique()
print(Advocacy_Supporters_Group)
Some sample data:
[{'SupporterID': '565f6a2f-c7fd-4f1b-bac2-e33976ef4306', 'Email': 'somebody#somewhere.edu', 'Type': 'd_Student Ambassadors'}, {'SupporterID': '7508dc12-7647-4e95-a8b8-bcb067861faf', 'Email': 'someoneelse#email.somewhere.edu', 'Type': 'd_Student Ambassadors'},...`
My desired output is a dataframe that looks like so:
SupporterID Email Type
565f6a2f-c7fd-4f1b-bac2-e33976ef4306 somebody#somewhere.edu d_Student Ambassadors
7508dc12-7647-4e95-a8b8-bcb067861faf someoneelse#email.somewhere.edu d_Student Ambassadors
Any help is greatly appreciated!!
So because this code creates an excel file for each segment, all I did was read back in the excels via a for loop like so:
filesnames = ['e_S Donors', 'b_Contributors', 'c_Activists', 'd_Student Ambassadors', 'a_Volunteers', 'f_Offline Action Takers']
S= pd.DataFrame()
for i in filesnames:
data = pd.read_excel(r'C:\Users\am\Desktop\email parsing\{0} segment {1}.xlsx'.format(i, str(today)),sheet_name= i, engine = 'openpyxl')
S= S.append(data)
This did the trick since it was in a format I already wanted.

Python - Using a While Loop, how can I update a variable string to be used in a function

Background: I'm attempting to create a dataframe using data called from Twitch's API. They only allow 100 records per call so with each pull a new Pagination Cursor is offered in order to move on to the next page. I'm using the following code to try and efficiently pull this data rather than manually adjusting the after=(pagination value) in the get response. Right now the variable I'm trying to make dynamic is the 'Pagination' variable but it only gets updated once the loop finishes - not helpful! Take a look below and see if you notice anything I can change to achieve this goal. Any help is appreciated!
TwitchTopGamesDataFrame = [] #This is our Data List
BaseURL = 'https://api.twitch.tv/helix/games/top?first=100'
Headers = {'client-id':'lqctse0orgdbs5gdf5faz665api03r','Authorization': 'Bearer a1yl09mwmnwetp6ovocilheias8pzt'}
Indent = 2
Pagination = ''
FullURL = BaseURL + Pagination
Response = requests.get(FullURL,headers=Headers)
iterations = 1 # Data records returned are equivalent to iterations x100
#Loop: Response, Convert JSON data, Append to Data List, Get Pagination & Replace String in Variable - Iterate until 300 records
while count <= 3:
#Grab JSON Data, Convert, & Append
ResponseJSONData = Response.json()
#print(pgn) - Debug
pd.set_option('display.max_rows', None)
TopGamesDF = pd.DataFrame(ResponseJSONData['data'])
TopGamesDF = TopGamesDF[['id','name']]
TopGamesDF = TopGamesDF.rename(columns={'id':'GameID','name':'GameName'})
TopGamesDF['Rank'] = TopGamesDF.index + 1
TwitchTopGamesDataFrame.append(TopGamesDF)
#print(FullURL) - Debug
#Grab & Replace Pagination Value
ResponseJSONData['pagination']
RPagination = pd.DataFrame(ResponseJSONData['pagination'],index=[0])
pgn = str('&after='+RPagination.to_string(index=False,header=False).strip())
Pagination = pgn
#print(FullURL) - Debug
iterations += 1
TwitchTopGamesDataFrame```
Figured it out:
def top_games(page_count):
from time import gmtime, strftime
strftime("%Y-%m-%d %H:%M:%S", gmtime())
print("Time of Execution:", strftime("%Y-%m-%d %H:%M:%S", gmtime()))
#In order to condense the code above and be more efficient, a while/for loop would work great.
#Goal: Run a While Loop to create a larger DataFrame through Pagination as the Twitch API only allows for 100 records per call.
baseURL = 'https://api.twitch.tv/helix/games/top?first=100' #Base URL
Headers = {'client-id':'lqctse0orgdbs5gdf5faz665api03r','Authorization': 'Bearer a1yl09mwmnwetp6ovocilheias8pzt'}
Indent = 2
Pagination = ''
FullURL = BaseURL + Pagination
Response = requests.get(FullURL,headers=Headers)
start_count = 0
count = 0 # Data records returned are equivalent to iterations x100
max_count = page_count
#Loop: Response, Convert JSON data, Append to Data List, Get Pagination & Replace String in Variable - Iterate until 300 records
while count <= max_count:
#Grab JSON Data, Extend List
Pagination
FullURL = baseURL + Pagination
Response = requests.get(FullURL,headers=Headers)
ResponseJSONData = Response.json()
pd.set_option('display.max_rows', None)
if count == start_count:
TopGamesDFL = ResponseJSONData['data']
if count > start_count:
i = ResponseJSONData['data']
TopGamesDFL.extend(i)
#Grab & Replace Pagination Value
ResponseJSONData['pagination']
RPagination = pd.DataFrame(ResponseJSONData['pagination'],index=[0])
pgn = str('&after='+RPagination.to_string(index=False,header=False).strip())
Pagination = pgn
count += 1
if count == max_count:
FinalDataFrame = pd.DataFrame(TopGamesDFL)
FinalDataFrame = FinalDataFrame[['id','name']]
FinalDataFrame = FinalDataFrame.rename(columns={'id':'GameID','name':'GameName'})
FinalDataFrame['Rank'] = FinalDataFrame.index + 1
return FinalDataFrame

Using data from one request for second url request in same script python

What I have so far is the first request gathering Id's. I would then like to use that return draftgroupid to insert into the second url request. Is it possible to send two requests in the same script, and if so how would I do a for i in range(draftgroupid) in the second url request?
import requests
import json
req1 = requests.get(url="https://www.draftkings.com/lobby/getcontests?sport=NHL")
req.raise_for_status()
data = req.json()
for i, contest in enumerate(data['DraftGroups']):
draftgroupid = contest['DraftGroupId']
Output of draftgroupid:
16901
16905
16902
16903
req2 = requests.get(url="https://api.draftkings.com/draftgroups/v1/draftgroups/THEVALUEIWANTTOLOOPTHROUGH/draftables?format=json")
EDIT
import csv
import requests
import json
req = requests.get(url="https://www.draftkings.com/lobby/getcontests?sport=NHL")
req.raise_for_status()
data = req.json()
for i, contest in enumerate(data['DraftGroups']):
draftgroupid = contest['DraftGroupId']
req2 = requests.get(url="https://api.draftkings.com/draftgroups/v1/draftgroups/" + str(draftgroupid) + "/draftables?format=json")
data2 = req2.json
for i, player_info in enumerate(data2['draftables'][0]):
date = player_info['competition']['startTime']
print(date)
Running into a TypeError: 'method' object is not subscriptable
As I understand, your problem is related to string manipulation rather than for the request library.
So basically,
import requests
import json
req1 = requests.get(url="https://www.draftkings.com/lobby/getcontests?sport=NHL")
req.raise_for_status()
data = req.json()
for i, contest in enumerate(data['DraftGroups']):
draftgroupid = contest['DraftGroupId']
requests.get(url="https://api.draftkings.com/draftgroups/v1/draftgroups/" + str(draftgroupid) + "/draftables?format=json")
should do the job.
More elegant ways to concatenate strings can be found at http://www.pythonforbeginners.com/concatenation/string-concatenation-and-formatting-in-python
Edit
For example,
"some string " + str(123)
"some string %d" % 123
"some string %s" % 123
Will all give the same output. There are more ways to concatenate strings. You just need to choose the best fit based on the context.
for i, contest in enumerate(data['DraftGroups']):
draftgroupid = contest['DraftGroupId']
req2 = requests.get(url="https://api.draftkings.com/draftgroups/v1/draftgroups/%d/draftables?format=json" % draftgroupid)
I assume you didn't actually mean for i in range(draftgroupid) as you stated in the question, because that would mean making 16901 requests, followed by 16905 requests (all of which except the last four would be duplicates of the first batch), followed by 16902 requests (of which all would be duplicates), etc.

Split large JSON file into batches of 100 at a time to run through an API

I am so close to being done with this tool I am developing, but as a junior developer with NO senior programmer to work with I am stuck. I have a script in python that takes data from our data base converts it to JSON to be run through an Address validation API, I have it all working, but the fact is that the API only accepts 100 objects at a time. I need to basically break up the file with X objects into batches of 100 to be run then stored into the same output file. Here is the snippit of my script structure:
for row in rows:
d = collections.OrderedDict()
d['input_id'] = str(row.INPUT_ID)
d['addressee'] = row.NAME
d['street'] = row.ADDRESS
d['city'] = row.CITY
d['state'] = row.STATE
d['zipcode'] = row.ZIP
d['candidates'] = row.CANDIDATES
obs_list.append(d)
json.dump(obs_list, file)
ids_file = '.csv'
cur.execute(input_ids)
columns = [i[0] for i in cur.description]
ids_input = cur.fetchall()
#ids_csv = csv.writer(with open('.csv','w',newline=''))
with open('.csv','w',newline='') as f:
ids_csv = csv.writer(f,delimiter=',')
ids_csv.writerow(columns)
ids_csv.writerows(ids_input)
print('Run through API')
url = 'https://api.'
headers = {'content-type': 'application/json'}
this is where i assume i need to do the loop to break it up
with open('.json', 'r') as run:
dict_run = run.readlines()
dict_ready = (''.join(dict_run))
#lost :(
for object in dict_ready:
# do something with object to only run 100 at a time
r = requests.post(url, data=dict_ready, headers=headers)
ss_output = r.text
output = 'C:\\Users\\TurnerC1\\Desktop\\ss_output.json'
with open(output,'w') as of:
of.write(ss_output)
at the moment I have about 4,000 of these in a file to be run through the API that only accepts 100 at a time. Im sure there is an easy answer, I am just burnt out doing this by myself lol. Any help is greatly appreciated.
sample json:
[
{
"street":"1 Santa Claus",
"city":"North Pole",
"state":"AK",
"candidates":10
},
{
"addressee":"Apple Inc",
"street":"1 infinite loop",
"city":"cupertino",
"state":"CA",
"zipcode":"95014",
"candidates":10
}
]'
As I read it, you have a range of objects, and you want to break it up into subsets, each of which is no more than 100 items.
Assuming that dict_ready is a list of objects (if it's not, modify your code to make it so):
count = 100
subsets = (dict_ready[x:x + count] for x in range(0, len(dict_ready), count))
for subset in subsets:
r = requests.post(url, data=subset, headers=headers)
try this as your second chunk of code
with open('.json', 'r') as run:
dict_run = json.loads(run)
ss_output=[]
for i in range(0,len(dict_ready),100):
# do something with object to only run 100 at a time
dict_ready=json.dumps(dict_run[i:i+100])
r = requests.post(url, data=dict_ready, headers=headers)
ss_output.extend(r.json())
output = 'C:\\Users\\TurnerC1\\Desktop\\ss_output.json'
with open(output,'w') as of:
json.dump(ss_output,of)
So assuming the rest of your code works, this will give the api a break every 100 rows for 10 secs. You will need to import the time module.
for i, object in enumerate(dict_ready):
r = requests.post(url, data=dict_ready, headers=headers)
if i%100==0:
time.sleep(10)

Categories