I am paginating through an API, which uses two different methods - next token & next timestamp.
For unknown to me reasons, sometimes at the end of a call the next token will be the same, which leaves me stuck in an endless loop. The same happens if I use the next timestamp.
However, I have noticed that this could be avoided if I use a combination of the two.
This is the loop I am currently running:
while int(JSONContent['next'][0:10])>unixtime_yesterday:
try:
url='www.website.com?next'+JSONContent['next'][0:10]+'api_key'
JSONContent = requests.request("GET", url, headers=headers).json()
temp_df=json_normalize(JSONContent['data'])
df=df.append(temp_df,ignore_index=True,sort=False)
except ValueError:
print('There was a JSONDecodeError')
This is a normal result of the JSONContent['next'] field. The first 10 characters are the timestamp and the last 10 are the other token:
'1650377727-3feWs8592va'
How can I check if the next timestamp is the same as the current one, so that I can then use the token instead of the timestamp?
In layman terms I want to do the following:
if JSONContent['next'][0:10][current_cycle]=JSONContent['next'][0:10][next_cycle]:
token=JSONContent['next'][11:22][next_cycle]
else:
token=JSONContent['next'][0:10][next_cycle]
If you want just the "next next" before you're passing on to the next iteration, send another request and check for equality between the next and the next next.
import time
time_stamp= time.time()
using_token = False
while using_token or int(time_stamp) > unixtime_yesterday:
try:
url = 'www.website.com?next'+JSONContent['next'][0:10]+'api_key'
JSONContent = requests.request("GET", url, headers=headers).json()
next_url = 'www.website.com?next'+JSONContent['next'][0:10]+'api_key'
next_JSONContent = requests.request("GET", url, headers=headers).json()
if JSONContent['next'][0:10] == next_JSONContent['next'][0:10]:
using_token = True
else:
time_stamp = JSONContent['next'][0:10]
using_token = False
temp_df = json_normalize(JSONContent['data'])
df = df.append(temp_df, ignore_index=True, sort=False)
except ValueError:
print('There was a JSONDecodeError')
You can also initialize the using_token to true, but it's break clean code rule of naming.
Related
I am trying to make a call to an API and then grab event_ids from the data. I then want to use those event ids as variables in another request, then parse that data. Then loop back and make another request using the next event id in the event_id variable for all the IDs.
so far i have the following
def nba_odds():
url = "https://xxxxx.com.au/sports/summary/basketball?api_key=xxxxx"
response = requests.get(url)
data = response.json()
event_ids = []
for event in data['Events']:
if event['Country'] == 'USA' and event['League'] == 'NBA':
event_ids.append(event['EventID'])
# print(event_ids)
game_url = f'https://xxxxx.com.au/sports/detail/{event_ids}?api_key=xxxxx'
game_response = requests.get(game_url)
game_data = game_response.json()
print(game_url)
that gives me the result below in the terminal.
https://xxxxx.com.au/sports/detail/['dbx-1425135', 'dbx-1425133', 'dbx-1425134', 'dbx-1425136', 'dbx-1425137', 'dbx-1425138', 'dbx-1425139', 'dbx-1425140', 'anyvsany-nba01-1670043600000000000', 'dbx-1425141', 'dbx-1425142', 'dbx-1425143', 'dbx-1425144', 'dbx-1425145', 'dbx-1425148', 'dbx-1425149', 'dbx-1425147', 'dbx-1425146', 'dbx-1425150', 'e95270f6-661b-46dc-80b9-cd1af75d38fb', '0c989be7-0802-4683-8bb2-d26569e6dcf9']?api_key=779ac51a-2fff-4ad6-8a3e-6a245a0a4cbb
the URL above format should look like
https://xxxx.com.au/sports/detail/dbx-1425135
If anyone can point me in the right direction it would be appreciated.
thanks.
you need to loop over the event ID's again to call the API with one event_id if it is not supporting multiple event_ids like:
all_events_response = []
for event_id in event_ids
game_url = f'https://xxxxx.com.au/sports/detail/{event_id}?api_key=xxxxx'
game_response = requests.get(game_url)
game_data = game_response.json()
all_events_response.append(game_data)
print(game_url)
You can find list of json responses under all_events_response
event_ids is an entire list of event ids. You make a single URL with the full list converted to its string view (['dbx-1425135', 'dbx-1425133', ...]). But it looks like you want to get information on each event in turn. To do that, put the second request in the loop so that it runs for every event you find interesting.
def nba_odds():
url = "https://xxxxx.com.au/sports/summary/basketball?api_key=xxxxx"
response = requests.get(url)
data = response.json()
event_ids = []
for event in data['Events']:
if event['Country'] == 'USA' and event['League'] == 'NBA':
event_id = event['EventID']
# print(event_id)
game_url = f'https://xxxxx.com.au/sports/detail/{event_id}?api_key=xxxxx'
game_response = requests.get(game_url)
game_data = game_response.json()
# do something with game_data - it will be overwritten
# on next round in the loop
print(game_url)
I am trying to open up several URL's (because they contain data I want to append to a list). I have a logic saying "if amount in icl_dollar_amount_l" then run the rest of the code. However, I want the script to only run the rest of the code on that specific amount in the variable "amount".
Example:
selenium opens up X amount of links and sees ['144,827.95', '5,199,024.87', '130,710.67'] in icl_dollar_amount_l but i want it to skip '144,827.95', '5,199,024.87' and only get the information for '130,710.67' which is in the 'amount' variable already.
Actual results:
Its getting webscaping information for amount '144,827.95' only and not even going to '5,199,024.87', '130,710.67'. I only want it getting webscaping information for '130,710.67' because my amount variable has this as the only amount.
print(icl_dollar_amount_l)
['144,827.95', '5,199,024.87', '130,710.67']
print(amount)
'130,710.67'
file2.py
def scrapeBOAWebsite(url,fcg_subject_l, gp_subject_l):
from ICL_Awk_Checker import rps_amount_l2
icl_dollar_amount_l = []
amount_ack_missing_l = []
file_total_l = []
body_l = []
for link in url:
print(link)
browser = webdriver.Chrome(options=options,
executable_path=r'\\TEST\user$\TEST\Documents\driver\chromedriver.exe')
# if 'P2 Cust ID 908554 File' in fcg_subject:
browser.get(link)
username = browser.find_element_by_name("dialog:username").get_attribute('value')
submit = browser.find_element_by_xpath("//*[#id='dialog:continueButton']").click()
body = browser.find_element_by_xpath("//*[contains(text(), 'Total:')]").text
body_l.append(body)
icl_dollar_amount = re.findall('(?:[\£\$\€]{1}[,\d]+.?\d*)', body)[0].split('$', 1)[1]
icl_dollar_amount_l.append(icl_dollar_amount)
if not missing_amount:
logging.info("List is empty")
print("List is empty")
count = 0
for amount in missing_amount:
if amount in icl_dollar_amount_l:
body = body_l[count]
get_file_total = re.findall('(?:[\£\$\€]{1}[,\d]+.?\d*)', body)[0].split('$', 1)[1]
file_total_l.append(get_file_total)
return icl_dollar_amount_l, file_date_l, company_id_l, client_id_l, customer_name_l, file_name_l, file_total_l, \
item_count_l, file_status_l, amount_ack_missing_l
I don't know if I understand problem but this
if amount in icl_dollar_amount_l:
doesn't give information on which position is '130,710.67' in icl_dollar_amount_l and you need also
count = icl_dollar_amount_l.index(amount)
for amount in missing_amount:
if amount in icl_dollar_amount_l:
count = icl_dollar_amount_l.index(amount)
body = body_l[count]
But it will works if you expect only one amount on list icl_dollar_amount_l. For more elements you would have to use rather for-loop and check every element separatelly
for amount in missing_amount:
for count, item in enumerate(icl_dollar_amount_l)
if amount == item :
body = body_l[count]
But frankly I don't know why you don't check it in first loop for link in url: when you have direct access to icl_dollar_amount and body
Background: I'm attempting to create a dataframe using data called from Twitch's API. They only allow 100 records per call so with each pull a new Pagination Cursor is offered in order to move on to the next page. I'm using the following code to try and efficiently pull this data rather than manually adjusting the after=(pagination value) in the get response. Right now the variable I'm trying to make dynamic is the 'Pagination' variable but it only gets updated once the loop finishes - not helpful! Take a look below and see if you notice anything I can change to achieve this goal. Any help is appreciated!
TwitchTopGamesDataFrame = [] #This is our Data List
BaseURL = 'https://api.twitch.tv/helix/games/top?first=100'
Headers = {'client-id':'lqctse0orgdbs5gdf5faz665api03r','Authorization': 'Bearer a1yl09mwmnwetp6ovocilheias8pzt'}
Indent = 2
Pagination = ''
FullURL = BaseURL + Pagination
Response = requests.get(FullURL,headers=Headers)
iterations = 1 # Data records returned are equivalent to iterations x100
#Loop: Response, Convert JSON data, Append to Data List, Get Pagination & Replace String in Variable - Iterate until 300 records
while count <= 3:
#Grab JSON Data, Convert, & Append
ResponseJSONData = Response.json()
#print(pgn) - Debug
pd.set_option('display.max_rows', None)
TopGamesDF = pd.DataFrame(ResponseJSONData['data'])
TopGamesDF = TopGamesDF[['id','name']]
TopGamesDF = TopGamesDF.rename(columns={'id':'GameID','name':'GameName'})
TopGamesDF['Rank'] = TopGamesDF.index + 1
TwitchTopGamesDataFrame.append(TopGamesDF)
#print(FullURL) - Debug
#Grab & Replace Pagination Value
ResponseJSONData['pagination']
RPagination = pd.DataFrame(ResponseJSONData['pagination'],index=[0])
pgn = str('&after='+RPagination.to_string(index=False,header=False).strip())
Pagination = pgn
#print(FullURL) - Debug
iterations += 1
TwitchTopGamesDataFrame```
Figured it out:
def top_games(page_count):
from time import gmtime, strftime
strftime("%Y-%m-%d %H:%M:%S", gmtime())
print("Time of Execution:", strftime("%Y-%m-%d %H:%M:%S", gmtime()))
#In order to condense the code above and be more efficient, a while/for loop would work great.
#Goal: Run a While Loop to create a larger DataFrame through Pagination as the Twitch API only allows for 100 records per call.
baseURL = 'https://api.twitch.tv/helix/games/top?first=100' #Base URL
Headers = {'client-id':'lqctse0orgdbs5gdf5faz665api03r','Authorization': 'Bearer a1yl09mwmnwetp6ovocilheias8pzt'}
Indent = 2
Pagination = ''
FullURL = BaseURL + Pagination
Response = requests.get(FullURL,headers=Headers)
start_count = 0
count = 0 # Data records returned are equivalent to iterations x100
max_count = page_count
#Loop: Response, Convert JSON data, Append to Data List, Get Pagination & Replace String in Variable - Iterate until 300 records
while count <= max_count:
#Grab JSON Data, Extend List
Pagination
FullURL = baseURL + Pagination
Response = requests.get(FullURL,headers=Headers)
ResponseJSONData = Response.json()
pd.set_option('display.max_rows', None)
if count == start_count:
TopGamesDFL = ResponseJSONData['data']
if count > start_count:
i = ResponseJSONData['data']
TopGamesDFL.extend(i)
#Grab & Replace Pagination Value
ResponseJSONData['pagination']
RPagination = pd.DataFrame(ResponseJSONData['pagination'],index=[0])
pgn = str('&after='+RPagination.to_string(index=False,header=False).strip())
Pagination = pgn
count += 1
if count == max_count:
FinalDataFrame = pd.DataFrame(TopGamesDFL)
FinalDataFrame = FinalDataFrame[['id','name']]
FinalDataFrame = FinalDataFrame.rename(columns={'id':'GameID','name':'GameName'})
FinalDataFrame['Rank'] = FinalDataFrame.index + 1
return FinalDataFrame
The code below errors out when trying to execute this line
"RptTime = TimeTable[0].xpath('//text()')"
Not sure why I see TimeTable has a value in my variable window, but the HtmlElement "TimeTable[0]" has no value and the "content.cssselect" at time of assignment returns value. Why then would I get an error "list index out of range". This tells me that the element is empty. I am trying to get the Year Month value in that field.
import pandas as pd
from datetime import datetime
from lxml import html
import requests
def http_request_get(url, session=None, payload=None, parse=True):
""" Sends a GET HTTP request to a website and returns its HTML content and full url address. """
if payload is None:
payload = {}
if session:
content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
else:
content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})
content.raise_for_status() # Raise HTTPError for bad requests (4xx or 5xx)
if parse:
return html.fromstring(content.text), content.url
else:
return content.text, content.url
def get_html(link):
"""
Returns a html.
"""
page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
return page_parsed
cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1']
df=pd.DataFrame()
df2=pd.DataFrame()
for cmslink in cmslinks:
print(cmslink)
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
TimeTable = content.cssselect('td[headers="view-dlf-2-report-period-table-column"]')[0]
headers = linkTable[0].xpath("//a[contains(text(),'Contract Summary') or contains(text(),'Monthly Enrollment by CPSC')]/#href")
RptTime = TimeTable.xpath('//text()')
dfl = pd.DataFrame(headers,columns= ['links'])
dft = pd.DataFrame(RptTime,columns= ['ReportTime'])
df=df.append(dfl)
df2=df.append(dft)
Error
src\lxml\etree.pyx in lxml.etree._Element.__getitem__()
IndexError: list index out of range
Look carefully at your last line. df = df.append(df1). Your explanation and code is quite unclear due to indenting and error traceback however this is obviously not what you intended.
df.append(df1) is a procedure rather than strictly a function, it does not return anything. You simply write the line and it does its magic similar to print("hi") rather than this_is_wrong = print("hi").
What would end up happening is you overwrite df will null which should be causing some major errors if you ever use that variable again. However, this is not the cause of your problem I thought it my duty to tell you anyway.
Could you please tell us exactly what is returned by the css... function. Although you said it returned something you only store the [0] index of the return value. Meaning that if it is ["","something"] hypothetically, the value stored would be null.
It is quite likely that the problem you are having is that you indexed [0] twice, when you probably only meant to do it once.
I have a program to call an API, return the JSON data and write it to a CSV.
The program loops through a list of entities as the first parameter in the API call, but also now needs to loop through a second parameter set (start and end times in epoch) because the API has a max of pulling a week of data at a time.
Example:
API call: ex.com/api/entity/timecards?start_time=1531306800&end_time=1531846800&pretty=1
So I need to loop through all of the entities, and then loop through an entire year's worth of data, a week at a time.
code example so far for the API call function:
def callAPI(entities):
for j in range(len(entities)):
locnum = entities[j][:5]
locnumv = entities[j]
startTime =
endTime =
url = "http://ex.com/api/entity/" + entity[j] + "/timecards?start_time=" + startTime + "&end_time=" + endTime
querystring = {"pretty":"1"}
headers = {
'Api-Key': ""
}
r = requests.request("GET", url, headers=headers, params=querystring)
d = r.json()
The program then goes on to write the data to rows in a CSV, which is all successful when tested with looping through the entities with static time parameters.
So I just need to figure out how would I create another nested for loop to loop through the start time/end time + 518400 seconds (6 days instead of 7 to be safe) and factor in a timeout since this is effectively going to be 20,000+ API calls by the time it's all said and done?
First of all, because you are just using j for getting the current entity, you could replace for j in range(len(entities)) by for entity in entities, it reads better. As for the question, you could just use an inner for loop to iterate over each week. The whole code will be:
def callAPI(entities):
for entity in entities:
locnum = entity[:5]
locnumv = entity # This is redundant
START = 1531306800 # starting time, 1 year ago
END = START + 31536000 # final time, e.g. the current time
TIME_STEP = 518400 # 1 day, 1 week, 1 month
for start_time in range(START, END, TIME_STEP):
end_time = start_time + TIME_STEP - 1 # substract 1 for preventing overlap of times
url = "http://ex.com/api/entity/%s/timecards?start_time=%d&end_time=%d" % (entity, start_time, end_time)
querystring = {"pretty":"1"}
headers = {'Api-Key': ""}
try:
r = requests.request("GET", url, headers=headers, params=querystring)
except:
break
d = r.json()
# Do something with the data
I hope this can help you!!
First off, you can just do:
for entity in entities:
instead of:
for j in range(len(entities)):
and then use entity instead of entities[j]
When it comes to looping through your epoch times. You will have to set your start time and then set your end time to start_time + 540000 inside of another loop:
start_time = 1531306800
i = 0
while True:
if i != 0:
start_time = end_time
end_time = start_time + 540000
url = "http://ex.com/api/entity/" + entity + "/timecards?start_time=" + start_time + "&end_time=" + end_time
querystring = {"pretty":"1"}
headers = {'Api-Key': ""}
try:
r = requests.request("GET", url, headers=headers, params=querystring)
except:
break
d = r.json()
Basically, you are going to loop through all of the epoch times until the request fails. Once it does, you will exit the loop and go to your next entity. The new entity's url will start at the same epoch time as the entity before it and so on.
I hope that helps!