Working with google(and all) api calls for the first time, I'm continually hitting a rate limit threshold despite having limited my rate. How would I go about changing the following code to a batch format to avoid this?
#API Call Function
from ratelimit import limits, sleep_and_retry
import requests
from googleapiclient.discovery import build
#sleep_and_retry
#limits(calls=1, period=4.5)
def pull_sheet_data(SCOPE,SPREADSHEET_ID,DATA_TO_PULL):
creds = gsheet_api_check(SCOPE)
service = build('sheets', 'v4', credentials=creds)
sheet = service.spreadsheets()
result = sheet.values().get(
spreadsheetId=SPREADSHEET_ID,
range=DATA_TO_PULL).execute()
values = result.get('values', [])
if not values:
print('No data found.')
else:
rows = sheet.values().get(spreadsheetId=SPREADSHEET_ID,
range=DATA_TO_PULL).execute()
data = rows.get('values')
print("COMPLETE: Data copied")
return data
#list files in the active brews folder
activebrews = drive.ListFile({'q':"'0BxU70FB_wb-Da0x5amtYbUkybXc' in parents"}).GetList()
tabs = ['Brew Log','Fermentation Log','Centrifuge Log']
brewsheetdict ={}
#Pulls data from the entire spreadsheet tab.
for i in activebrews:
for j in tabs:
#set spreadsheet parameters
SCOPE = ['https://www.googleapis.com/auth/drive.readonly']
SPREADSHEET_ID = i['id']
data = pull_sheet_data(SCOPE,SPREADSHEET_ID,j)
dftitle = str(i['title'])
dftab = str(j)
dfname = dftitle+'_'+dftab
brewsheetdict[dfname] = pd.DataFrame(data)
Thanks!
Thanks to Tanaike's suggestion, I settled on the following solution. There's still remaining issues that I haven't resolved. Namely, the resultant API calls occured at a rate of .5/second, well below the published limit, but any faster would still result in rate limiting issues. Additionally, the code completed after executing the for loops on half of the list every time, requiring repeated executions. To work around the second problem, I included the indicated line to remove list items after they were iterated over, so re running the code began with the last incomplete record each time.
import time
SCOPE = ['https://www.googleapis.com/auth/drive.readonly']
recordcount=0
#if (current - start)/(recordcount) >= 1:
sleeptime=2.2
start=time.time()
for i in activebrews:
SPREADSHEET_ID = i['id']
for j in tabs:
data = pull_sheet_data(SCOPE,SPREADSHEET_ID,j)
dftitle = str(i['title'])
dftab = str(j)
dfname = dftitle+'_'+dftab
brewsheetdict[dfname] = pd.DataFrame(data)
recordcount+=1
time.sleep(sleeptime)
end=time.time()
print(recordcount,dfname, (end-start)/recordcount)
activebrews.remove(i)#remove list item after iterated over
time.sleep(1)
brewsheetdata = open("brewsheetdata.pkl","wb")
pickle.dump(brewsheetdict,brewsheetdata)
brewsheetdata.close()
Related
I'm working on pulling data from a public API and converting the response JSON file to a Pandas Dataframe. I've written the code to pull the data and gotten a successful JSON response. The issue I'm having is parsing through the file and converting the data to a dataframe. Whenever I run through my for loop, I get a dataframe that retruns 1 row when it should be returning approximately 2500 rows & 6 columns. I've copied and pasted my code below:
Things to note:
I've commented out my api key with "api_key".
I'm new(ish) to python so I understand that my code formatting might not be best practices. I'm open to changes.
Here is the link to the API that I am requesting from: https://developer.va.gov/explore/facilities/docs/facilities?version=current
facilities_data = pd.DataFrame(columns=['geometry_type', 'geometry_coordinates', 'id', 'facility_name', 'facility_type','facility_classification'])
# function that will make the api call and sort through the json data
def get_facilities_data(facilities_data):
# Make API Call
res = requests.get('https://sandboxapi.va.gov/services/va_facilities/v0/facilities/all',headers={'apikey': 'api_key'})
data = json.loads(res.content.decode('utf-8'))
time.sleep(1)
for facility in data['features']:
geometry_type = data['features'][0]['geometry']['type']
geometry_coordinates = data['features'][0]['geometry']['coordinates']
facility_id = data['features'][0]['properties']['id']
facility_name = data['features'][0]['properties']['name']
facility_type = data['features'][0]['properties']['facility_type']
facility_classification = data['features'][0]['properties']['classification']
# Save data into pandas dataframe
facilities_data = facilities_data.append(
{'geometry_type': geometry_type, 'geometry_coordinates': geometry_coordinates,
'facility_id': facility_id, 'facility_name': facility_name, 'facility_type': facility_type,
'facility_classification': facility_classification}, ignore_index=True)
return facilities_data
facilities_data = get_facilities_data(facilities_data)
print(facilities_data)```
As mentioned, you should
loop over facility instead of data['features'][0]
append within the loop
This will get you the result you are after.
facilities_data = pd.DataFrame(columns=['geometry_type', 'geometry_coordinates', 'id', 'facility_name', 'facility_type','facility_classification'])
def get_facilities_data(facilities_data):
# Make API Call
res = requests.get("https://sandbox-api.va.gov/services/va_facilities/v0/facilities/all",
headers={"apikey": "REDACTED"})
data = json.loads(res.content.decode('utf-8'))
time.sleep(1)
for facility in data['features']:
geometry_type = facility['geometry']['type']
geometry_coordinates = facility['geometry']['coordinates']
facility_id = facility['properties']['id']
facility_name = facility['properties']['name']
facility_type = facility['properties']['facility_type']
facility_classification = facility['properties']['classification']
# Save data into pandas dataframe
facilities_data = facilities_data.append(
{'geometry_type': geometry_type, 'geometry_coordinates': geometry_coordinates,
'facility_id': facility_id, 'facility_name': facility_name, 'facility_type': facility_type,
'facility_classification': facility_classification}, ignore_index=True)
return facilities_data
facilities_data = get_facilities_data(facilities_data)
print(facilities_data.head())
There are some more things we can improve upon;
json() can be called directly on requests output
time.sleep() is not needed
appending to a DataFrame on each iteration is discouraged; we can collect the data in another way and create the DataFrame afterwards.
Implementing these improvements results in;
def get_facilities_data():
data = requests.get("https://sandbox-api.va.gov/services/va_facilities/v0/facilities/all",
headers={"apikey": "REDACTED"}).json()
facilities_data = []
for facility in data["features"]:
facility_data = (facility["geometry"]["type"],
facility["geometry"]["coordinates"],
facility["properties"]["id"],
facility["properties"]["name"],
facility["properties"]["facility_type"],
facility["properties"]["classification"])
facilities_data.append(facility_data)
facilities_df = pd.DataFrame(data=facilities_data,
columns=["geometry_type", "geometry_coords", "id", "name", "type", "classification"])
return facilities_df
The idea of the code is to add to existent playlist unwatched EPs by index order, ep 1 Show X, ep 1 Show Z, regardless of air date:
from plexapi.server import PlexServer
baseurl = 'http://0.0.0.0:0000/'
token = '0000000000000'
plex = PlexServer(baseurl, token)
episode = 0
first_ep_name = []
for x in plex.library.section('Anime').search(unwatched=True):
try:
for y in plex.library.section('Anime').get(x.title).episodes()[episode]:
if plex.library.section('Anime').get(x.title).episodes()[episode].isWatched:
episode +=1
first_ep_name.append(y)
else:
episode = 0
first_ep_name.append(y)
except:
continue
plex.playlist('Anime Playlist').addItems(first_ep_name)
But when I run it, it will always add watched EPs but if I debug the code in Thoni IDE it seems that is doing its purpose so I am not sure whats wrong with that code.
Any ideas?
Im thinking that the error might be here:
plex.playlist('Anime Playlist').addItems(first_ep_name)
but according to the documentation addItems should be a list but my list "first_ep_name " its already appending unwatched episodes in the correct order, in theory addItems should recognize the specific episode and not only the series name but I am not sure anymore.
is somebody out there is having the same issue with plexapi I was able to find a way to get this project working properly:
from plexapi.server import PlexServer
baseurl = 'insert plex url here'
token = 'plex token here'
plex = PlexServer(baseurl, token)
anime_plex = []
scrapped_playlist = []
for x in plex.library.section('Anime').search(unwatched=True):
anime_plex.append(x)
while len(anime_plex) >0:
episode_list = []
for y in plex.library.section('Anime').get(anime_plex[0].title).episodes():
episode_list.append(y)
ep_checker = True
while ep_checker:
if episode_list[0].isWatched:
episode_list.pop(0)
else:
scrapped_playlist.append(episode_list[0])
episode_list.clear()
ep_checker = False
anime_plex.pop(0)
# plex.playlist('Anime Playlist').addItems(scrapped_playlist)
plex.playlist('Anime Playlist').delete()
plex.createPlaylist('Anime Playlist', section='Anime', items= scrapped_playlist)
Basically, what I am doing with that code I am looping through each anime series I have and if EP # X is watched then pop from the list until it finds a boolean FALSE then that will append into an empty list that later I will use for creating/adding to playlist.
The last lines of the code can be commented on for whatever purpose, creating the playlist anime or adding items.
I have a python code here which goes into SAP using BAPI RFC_READ_TABLE, queries USR02 table and bring back the results. The input is taken from an excel sheet A column and the output is pasted in B column
The code is running all fine. However, for 1000 records, it is taking 8 minutes approximately to run.
Can you please help in optimizing the code? I am really new at python, managed to write this heavy code but now stuck at the optimization part.
It would be really great if this can run in 1-2 minutes max.
from pyrfc import Connection, ABAPApplicationError, ABAPRuntimeError, LogonError, CommunicationError
from configparser import ConfigParser
from pprint import PrettyPrinter
import openpyxl
ASHOST='***'
CLIENT='***'
SYSNR='***'
USER='***'
PASSWD='***'
conn = Connection(ashost=ASHOST, sysnr=SYSNR, client=CLIENT, user=USER, passwd=PASSWD)
try:
wb = openpyxl.load_workbook('new2.xlsx')
ws = wb['Sheet1']
for i in range(1,len(ws['A'])+1):
x = ws['A'+ str(i)].value
options = [{ 'TEXT': "BNAME = '" +x+"'"}]
fields = [{'FIELDNAME': 'CLASS'},{'FIELDNAME':'USTYP'}]
pp = PrettyPrinter(indent=4)
ROWS_AT_A_TIME = 10
rowskips = 0
while True:
result = conn.call('RFC_READ_TABLE', \
QUERY_TABLE = 'USR02', \
OPTIONS = options, \
FIELDS = fields, \
ROWSKIPS = rowskips, ROWCOUNT = ROWS_AT_A_TIME)
rowskips += ROWS_AT_A_TIME
if len(result['DATA']) < ROWS_AT_A_TIME:
break
data_result = result['DATA']
length_result = len(data_result)
for line in range(0,length_result):
a= data_result[line]["WA"].strip()
wb = openpyxl.load_workbook('new2.xlsx')
ws = wb['Sheet1']
ws['B'+str(i)].value = a
wb.save('new2.xlsx')
except CommunicationError:
print("Could not connect to server.")
raise
except LogonError:
print("Could not log in. Wrong credentials?")
raise
except (ABAPApplicationError, ABAPRuntimeError):
print("An error occurred.")
raise
EDIT :
So here is my updated code. For now, I have decided to output the data on command line only. Output shows where is the time taken.
try:
output_list = []
wb = openpyxl.load_workbook('new3.xlsx')
ws = wb['Sheet1']
col = ws['A']
col_lis = [col[x].value for x in range(len(col))]
length = len(col_lis)
for i in range(length):
print("--- %s seconds Start of the loop ---" % (time.time() - start_time))
x = col_lis[i]
options = [{ 'TEXT': "BNAME = '" + x +"'"}]
fields = [{'FIELDNAME': 'CLASS'},{'FIELDNAME':'USTYP'}]
ROWS_AT_A_TIME = 10
rowskips = 0
while True:
result = conn.call('RFC_READ_TABLE', QUERY_TABLE = 'USR02', OPTIONS = options, FIELDS = fields, ROWSKIPS = rowskips, ROWCOUNT = ROWS_AT_A_TIME)
rowskips += ROWS_AT_A_TIME
if len(result['DATA']) < ROWS_AT_A_TIME:
break
print("--- %s seconds in SAP ---" % (time.time() - start_time))
data_result = result['DATA']
length_result = len(data_result)
for line in range(0,length_result):
a= data_result[line]["WA"]
output_list.append(a)
print(output_list)
Firstly I put timing mark at different places of code having divided it into functional sections (SAP processing, Excel processing).
Upon analyzing the timings I found that the most runtime is consumed by Excel writing code,
consider the intervals:
16:52:37.306272
16:52:37.405006 moment it was fetched from SAP
16:52:37.552611 moment it was pushed to Excel
16:52:37.558631
16:52:37.634395 moment it was fetched from SAP
16:52:37.796002 moment it was pushed to Excel
16:52:37.806930
16:52:37.883724 moment it was fetched from SAP
16:52:38.060254 moment it was pushed to Excel
16:52:38.067235
16:52:38.148098 moment it was fetched from SAP
16:52:38.293669 moment it was pushed to Excel
16:52:38.304640
16:52:38.374453 moment it was fetched from SAP
16:52:38.535054 moment it was pushed to Excel
16:52:38.542004
16:52:38.618800 moment it was fetched from SAP
16:52:38.782363 moment it was pushed to Excel
16:52:38.792336
16:52:38.873119 moment it was fetched from SAP
16:52:39.034687 moment it was pushed to Excel
16:52:39.040712
16:52:39.114517 moment it was fetched from SAP
16:52:39.264716 moment it was pushed to Excel
16:52:39.275649
16:52:39.346005 moment it was fetched from SAP
16:52:39.523721 moment it was pushed to Excel
16:52:39.530741
16:52:39.610487 moment it was fetched from SAP
16:52:39.760086 moment it was pushed to Excel
16:52:39.771057
16:52:39.839873 moment it was fetched from SAP
16:52:40.024574 moment it was pushed to Excel
as you can see the Excel writing part is much as twice as SAP querying part.
What is wrong in your code is that you open/initizalizing the workbook and sheet in each loop iteration, this slows execution a lot and is redundant as you can reuse the wrokbook variables from the top.
Another redundant thing is stripping leading and trailing zeroes, it is quite of redundant as Excel do this automatically for string data.
This variant of code
try:
wb = openpyxl.load_workbook('new2.xlsx')
ws = wb['Sheet1']
print(datetime.now().time())
for i in range(1,len(ws['A'])+1):
x = ws['A'+ str(i)].value
options = [{ 'TEXT': "BNAME = '" + x +"'"}]
fields = [{'FIELDNAME': 'CLASS'},{'FIELDNAME':'USTYP'}]
ROWS_AT_A_TIME = 10
rowskips = 0
while True:
result = conn.call('RFC_READ_TABLE', QUERY_TABLE = 'USR02', OPTIONS = options, FIELDS = fields, ROWSKIPS = rowskips, ROWCOUNT = ROWS_AT_A_TIME)
rowskips += ROWS_AT_A_TIME
if len(result['DATA']) < ROWS_AT_A_TIME:
break
data_result = result['DATA']
length_result = len(data_result)
for line in range(0,length_result):
ws['B'+str(i)].value = data_result[line]["WA"]
wb.save('new2.xlsx')
print(datetime.now().time())
except ...
gives me following timestamps of program run:
>>> exec(open('RFC_READ_TABLE.py').read())
18:14:03.003174
18:16:29.014373
2.5 minutes for 1000 user records, which looks a fair price for this kind of processing.
In my opinion, the problem is in the while True loop. I think you need to optimize your query logic (or change it). It is hard without knowing what you are interested in the DB, The other things looking easy and fast.
Something that could help is to try to not open and close the file continuously: try to compute your "B" column and then open and paste all at once in the xlsx file. It could help (but i'm pretty sure that is the query the problem)
P.S. Maybe you can use some timing library (like here) to compute WHERE you spend most of the time.
I have a question about rate limits.
I take a data from the CSV and enter it into the query and the output is stored in a list.
I get an error because I make too many requests at once.
(I can only make 20 requests per second). How can I determine the rate limit?
import requests
import pandas as pd
df = pd.read_csv("Data_1000.csv")
list = []
def requestSummonerData(summonerName, APIKey):
URL = "https://euw1.api.riotgames.com/lol/summoner/v3/summoners/by-name/" + summonerName + "?api_key=" + APIKey
response = requests.get(URL)
return response.json()
def main():
APIKey = (str)(input('Copy and paste your API Key here: '))
for index, row in df.iterrows():
summonerName = row['Player_Name']
responseJSON = requestSummonerData(summonerName, APIKey)
ID = responseJSON ['accountId']
ID = int(ID)
list.insert(index,ID)
df["accountId"]= list
If you already know you can only make 20 requests per second, you just need to work out how long to wait between each request:
Divide 1 second by 20, which should give you 0.05. So you just need to sleep for 0.05 of a second between each request and you shouldn't hit the limit (maybe increase it a bit if you want to be safe).
import time at the top of your file and then time.sleep(0.05) inside of your for loop (you could also just do time.sleep(1/20))
I am trying to do reverse geocoding and extract pincodes for lot-long. The .csv file has around 1 million records..
Below is my problem
1. Google API failing to give address for large records, and taking huge amount of time. I will later move it to Batch-Process though.
2. I tried to split the file into chunks and ran few files manually one by one (1000 records in each file after splitting), then i surprisingly get 100% result.
3. Later, I ran in loop one by one, again, Google API fails to give the result
Note: Right now we are looking for free API's only
**Below is my code**
def reverse_geocode(latlng):
result = {}
url = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}'
request = url.format(latlng)
key= '&key=' + api_key
request = request + key
data = requests.get(request).json()
if len(data['results']) > 0:
result = data['results'][0]
return result
def parse_postal_code(geocode_data):
if (not geocode_data is None) and ('formatted_address' in geocode_data):
for component in geocode_data['address_components']:
if 'postal_code' in component['types']:
return component['short_name']
return None
dfinal = pd.DataFrame(columns=colnames)
dmiss = pd.DataFrame(columns=colnames)
for fl in files:
df = pd.read_csv(fl)
print ('Processing file : ' + fl[36:])
df['geocode_data'] = ''
df['Pincode'] = ''
df['geocode_data']=df['latlng'].map(reverse_geocode)
df['Pincode'] = df['geocode_data'].map(parse_postal_code)
if (len(df[df['Pincode'].isnull()]) > 0):
d0=df[df['Pincode'].isnull()]
print("Missing Picodes : " + str(len(df[df['Pincode'].isnull()])) + " / " + str(len(df)))
dmiss.append(d0)
d0=df[~df['Pincode'].isnull()]
dfinal.append(d0)
else:
dfinal.append(df)
Can anybody help me out, what is the problem in my code? or if any additional info required please let me know....
You've run into Google API usage limits.