Creating a pandas dataframe in a while loop with exception handling not reading the dataframe after - python

This is my first post here. So I had python script to do some algorithmic trading. It worked fine on coinbase but when I tried switching some of the code to work with binance it's not working. Below is the error I get. I made sure I copy and paste the df name to make sure it's the same name. I'm not sure why the print statement is unable to print the df. I will also paste the part of the code that isn't working to see if someone can make it work. In my actual code instead of print is where I start calculating the Moving Averages but I get the same error with just print so some for some reason the dataframe isn't being created. Any help will be appreciated.
Error Encountered
Traceback (most recent call last):
File "test2.py", line 44, in
print(historic_df)
NameError: name 'historic_df' is not defined
The code:
import requests
import json
import time
import pandas as pd
currency = 'BTCUSD'
base = 'https://api.binance.com'
endpoint = '/api/v1/klines'
params = '?&symbol='+currency+'&interval=1h'
url = base + endpoint + params
#---------------------------------------------------------------------------------------------------
### Begin Loop and get Historic Data ###
while True:
try:
#Pulls the historical data from binance. The interval is defined in params variable
data = requests.get(url)
dictionary = json.loads(data.text)
# The line below just puts the historic data we pulled into a data frame
historic_df = pd.DataFrame.from_dict(dictionary)
historic_df = historic_df.drop(range(6,12), axis=1)
# This gives the columns meaning full names based on what we pull
historic_df.columns = ['time', 'open', 'high', 'low', 'close', 'volumne']
# Changing the columns to floats
historic_df['open'] = historic_df['open'].astype(float)
historic_df['high'] = historic_df['high'].astype(float)
historic_df['low'] = historic_df['low'].astype(float)
historic_df['close'] = historic_df['close'].astype(float)
historic_df['volume'] = historic_df['volume'].astype(float)
# Get latest data and show to the user for reference
btc_price = bi_client.get_symbol_ticker(symbol=currency)
currentPrice = float(btc_price['price'])
except:`enter code here`
print("Error Encountered")
print(historic_df)

The matter comes from how you manage your errors. It come from the line json.loads(data.text). data does not contains correct json. As you except every Exception, you print "Error Encountered" and your code does not create the variable historic_df as it fails before.
You should write something more like this:
import requests
import json
import time
import pandas as pd
from requests import HTTPError
currency = 'BTCUSD'
base = 'https://api.binance.com'
endpoint = '/api/v1/klines'
params = '?&symbol=' + currency + '&interval=1h'
url = base + endpoint + params
# ---------------------------------------------------------------------------------------------------
### Begin Loop and get Historic Data ###
while True:
try:
# Pulls the historical data from binance. The interval is defined in params variable
data = requests.get(url)
data.raise_for_status()
dictionary = data.json()
# The line below just puts the historic data we pulled into a data frame
historic_df = pd.DataFrame.from_dict(dictionary)
historic_df = historic_df.drop(range(6, 12), axis=1)
# I delete other code
except HTTPError as err:
print(err)
except as err:
print(err)
You should read for a more detailed the tutorial about exception.

Related

Trouble with Gate.io API call

I'm working on python code to update and append token price and volume data using gate.io's API to a .csv file. Basically trying to check to see if it's up to date, and update with the most recently hour's data if not. The below code isn't throwing any errors, but it's not working. My columns are all in the same order as they are in the code. Any assistance would be greatly appreciated, thank you
import requests
import pandas as pd
from datetime import datetime
# Define API endpoint and parameters
host = "https://api.gateio.ws"
prefix = "/api/v4"
url = '/spot/candlesticks'
currency_pair = "BTC_USDT"
interval = "1h"
# Read the existing data from the csv file
df = pd.read_csv("price_calcs.csv")
# Extract the last timestamp from the csv file
last_timestamp = df["time1"].iloc[-1]
# Convert the timestamp to datetime and add an hour to get the new "from" parameter
from_time = datetime.utcfromtimestamp(last_timestamp).strftime('%Y-%m-%d %H:%M:%S')
to_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
# Use the last timestamp to make a 'GET' request to the API to get the latest hourly data for the token
query_params = {"currency_pair": currency_pair, "from": from_time, "to": to_time, "interval": interval}
r = requests.get(host + prefix + url, params=query_params)
# Append the new data to the existing data from the csv file
new_data = pd.DataFrame(r.json(), columns=["time1", "volume1", "close1", "high1", "low1", "open1", "volume2"])
df = pd.concat([df, new_data])
# Write the updated data to the csv file
df.to_csv("price_calcs.csv", index=False)
Nevermind figured it out myself

Python API Call: JSON to Pandas DF

I'm working on pulling data from a public API and converting the response JSON file to a Pandas Dataframe. I've written the code to pull the data and gotten a successful JSON response. The issue I'm having is parsing through the file and converting the data to a dataframe. Whenever I run through my for loop, I get a dataframe that retruns 1 row when it should be returning approximately 2500 rows & 6 columns. I've copied and pasted my code below:
Things to note:
I've commented out my api key with "api_key".
I'm new(ish) to python so I understand that my code formatting might not be best practices. I'm open to changes.
Here is the link to the API that I am requesting from: https://developer.va.gov/explore/facilities/docs/facilities?version=current
facilities_data = pd.DataFrame(columns=['geometry_type', 'geometry_coordinates', 'id', 'facility_name', 'facility_type','facility_classification'])
# function that will make the api call and sort through the json data
def get_facilities_data(facilities_data):
# Make API Call
res = requests.get('https://sandboxapi.va.gov/services/va_facilities/v0/facilities/all',headers={'apikey': 'api_key'})
data = json.loads(res.content.decode('utf-8'))
time.sleep(1)
for facility in data['features']:
geometry_type = data['features'][0]['geometry']['type']
geometry_coordinates = data['features'][0]['geometry']['coordinates']
facility_id = data['features'][0]['properties']['id']
facility_name = data['features'][0]['properties']['name']
facility_type = data['features'][0]['properties']['facility_type']
facility_classification = data['features'][0]['properties']['classification']
# Save data into pandas dataframe
facilities_data = facilities_data.append(
{'geometry_type': geometry_type, 'geometry_coordinates': geometry_coordinates,
'facility_id': facility_id, 'facility_name': facility_name, 'facility_type': facility_type,
'facility_classification': facility_classification}, ignore_index=True)
return facilities_data
facilities_data = get_facilities_data(facilities_data)
print(facilities_data)```
As mentioned, you should
loop over facility instead of data['features'][0]
append within the loop
This will get you the result you are after.
facilities_data = pd.DataFrame(columns=['geometry_type', 'geometry_coordinates', 'id', 'facility_name', 'facility_type','facility_classification'])
def get_facilities_data(facilities_data):
# Make API Call
res = requests.get("https://sandbox-api.va.gov/services/va_facilities/v0/facilities/all",
headers={"apikey": "REDACTED"})
data = json.loads(res.content.decode('utf-8'))
time.sleep(1)
for facility in data['features']:
geometry_type = facility['geometry']['type']
geometry_coordinates = facility['geometry']['coordinates']
facility_id = facility['properties']['id']
facility_name = facility['properties']['name']
facility_type = facility['properties']['facility_type']
facility_classification = facility['properties']['classification']
# Save data into pandas dataframe
facilities_data = facilities_data.append(
{'geometry_type': geometry_type, 'geometry_coordinates': geometry_coordinates,
'facility_id': facility_id, 'facility_name': facility_name, 'facility_type': facility_type,
'facility_classification': facility_classification}, ignore_index=True)
return facilities_data
facilities_data = get_facilities_data(facilities_data)
print(facilities_data.head())
There are some more things we can improve upon;
json() can be called directly on requests output
time.sleep() is not needed
appending to a DataFrame on each iteration is discouraged; we can collect the data in another way and create the DataFrame afterwards.
Implementing these improvements results in;
def get_facilities_data():
data = requests.get("https://sandbox-api.va.gov/services/va_facilities/v0/facilities/all",
headers={"apikey": "REDACTED"}).json()
facilities_data = []
for facility in data["features"]:
facility_data = (facility["geometry"]["type"],
facility["geometry"]["coordinates"],
facility["properties"]["id"],
facility["properties"]["name"],
facility["properties"]["facility_type"],
facility["properties"]["classification"])
facilities_data.append(facility_data)
facilities_df = pd.DataFrame(data=facilities_data,
columns=["geometry_type", "geometry_coords", "id", "name", "type", "classification"])
return facilities_df

Issues converting json data to a dataframe

I am using the pushshift api to gather posts in a Reddit subreddit. But some of the returned data from the request is throwing an error which is:
"None of [Index(['author', 'id', 'title', 'score', 'created_utc', 'permalink',\n 'num_comments'],\n dtype='object')] are in the [columns]"
I understand that it means that some of the columns have spaces in them but the thing is that I am running this request over 112 different days and 94 times it is successful. So I am struggling to figure out how to fix it when the request ideally should return the same format of data every time as the only thing I am changing is the range of days for before and after.
I cannot fit the entire json object which is returned but you can check it for yourself by running this:
import requests
from pprint import pprint
link = "https://api.pushshift.io/reddit/search/submission/?size=100&before=111d&after=112d&sort_type=num_comments&sort=desc&subreddit=wallstreetbets"
sample = requests.get(link).json()
pprint(sample)
Full code:
import pandas as pd
import requests
for i in reversed(range(2, 113)):
after_ = i
before_ = i - 1
link = f"https://api.pushshift.io/reddit/search/submission/?size=100&before={before_}d&after={after_}d&sort_type=num_comments&sort=desc&subreddit=wallstreetbets"
path = f"/content/drive/MyDrive/UsersWSB/Posts/days_after_{after_}.csv"
try:
sample = requests.get(link).json()
df = pd.DataFrame(sample['data'])
# I only need these columns from the entire df
df = df[['author', 'id', 'title', 'score', 'created_utc', 'permalink', 'num_comments']]
df.to_csv(path)
except Exception as e:
print(e)
In full code you are not testing for expected conditions
response is an error status_code!=200
response JSON did return data....
import pandas as pd
import requests
if True:
df = pd.DataFrame()
link = "https://api.pushshift.io/reddit/search/submission/"
for i in reversed(range(2, 113)):
after_ = i
before_ = i - 1
params = {"size":100,"before":f"{before_}d","after":f"{after_}d","sort_type":"num_comments","sort":"desc","subreddit":"wallstreetbets"}
# DL is very slow, only DL if haven't done so already
if len(df)==0 or len(df.loc[df.before.eq(before_) & df.after.eq(after_)])==0:
req = requests.get(link, params=params)
# check request succeeded and it returned some data...
if req.status_code == 200 and "data" in req.json().keys() and len(req.json()["data"])>0:
df = pd.concat([df, pd.json_normalize(req.json()['data']).loc[:,['author', 'id', 'title', 'score', 'created_utc', 'permalink', 'num_comments']].assign(before=before_,after=after_)])
else:
print(f"{req.status_code} {params} {'error' if req.status_code!=200 else req.json()}" )
print(f"***DONE*** downloaded: {len(df)}")

Scraping text from a header and Class tag

The code below errors out when trying to execute this line
"RptTime = TimeTable[0].xpath('//text()')"
Not sure why I see TimeTable has a value in my variable window, but the HtmlElement "TimeTable[0]" has no value and the "content.cssselect" at time of assignment returns value. Why then would I get an error "list index out of range". This tells me that the element is empty. I am trying to get the Year Month value in that field.
import pandas as pd
from datetime import datetime
from lxml import html
import requests
def http_request_get(url, session=None, payload=None, parse=True):
""" Sends a GET HTTP request to a website and returns its HTML content and full url address. """
if payload is None:
payload = {}
if session:
content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
else:
content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})
content.raise_for_status() # Raise HTTPError for bad requests (4xx or 5xx)
if parse:
return html.fromstring(content.text), content.url
else:
return content.text, content.url
def get_html(link):
"""
Returns a html.
"""
page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
return page_parsed
cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1']
df=pd.DataFrame()
df2=pd.DataFrame()
for cmslink in cmslinks:
print(cmslink)
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
TimeTable = content.cssselect('td[headers="view-dlf-2-report-period-table-column"]')[0]
headers = linkTable[0].xpath("//a[contains(text(),'Contract Summary') or contains(text(),'Monthly Enrollment by CPSC')]/#href")
RptTime = TimeTable.xpath('//text()')
dfl = pd.DataFrame(headers,columns= ['links'])
dft = pd.DataFrame(RptTime,columns= ['ReportTime'])
df=df.append(dfl)
df2=df.append(dft)
Error
src\lxml\etree.pyx in lxml.etree._Element.__getitem__()
IndexError: list index out of range
Look carefully at your last line. df = df.append(df1). Your explanation and code is quite unclear due to indenting and error traceback however this is obviously not what you intended.
df.append(df1) is a procedure rather than strictly a function, it does not return anything. You simply write the line and it does its magic similar to print("hi") rather than this_is_wrong = print("hi").
What would end up happening is you overwrite df will null which should be causing some major errors if you ever use that variable again. However, this is not the cause of your problem I thought it my duty to tell you anyway.
Could you please tell us exactly what is returned by the css... function. Although you said it returned something you only store the [0] index of the return value. Meaning that if it is ["","something"] hypothetically, the value stored would be null.
It is quite likely that the problem you are having is that you indexed [0] twice, when you probably only meant to do it once.

Random "IndexError: list index out of range "

I am trying to scrape a site that returns its data via Javascript. The code I wrote using BeautifulSoup works pretty well, but at random points during scraping I get the following error:
Traceback (most recent call last):
File "scraper.py", line 48, in <module>
accessible = accessible[0].contents[0]
IndexError: list index out of range
Sometimes I can scrape 4 urls, sometimes 15, but at some point the script eventually fails and gives me the above error. I can find no pattern behind the failing, so I'm really at a loss here - what am I doing wrong?
from bs4 import BeautifulSoup
import urllib
import urllib2
import jabba_webkit as jw
import csv
import string
import re
import time
countries = csv.reader(open("countries.csv", 'rb'), delimiter=",")
database = csv.writer(open("herdict_database.csv", 'w'), delimiter=',')
basepage = "https://www.herdict.org/explore/"
session_id = "indepth;jsessionid=C1D2073B637EBAE4DE36185564156382"
ccode = "#fc=IN"
end_date = "&fed=12/31/"
start_date = "&fsd=01/01/"
year_range = range(2009, 2011)
years = [str(year) for year in year_range]
def get_number(var):
number = re.findall("(\d+)", var)
if len(number) > 1:
thing = number[0] + number[1]
else:
thing = number[0]
return thing
def create_link(basepage, session_id, ccode, end_date, start_date, year):
link = basepage + session_id + ccode + end_date + year + start_date + year
return link
for ccode, name in countries:
for year in years:
link = create_link(basepage, session_id, ccode, end_date, start_date, year)
print link
html = jw.get_page(link)
soup = BeautifulSoup(html, "lxml")
accessible = soup.find_all("em", class_="accessible")
inaccessible = soup.find_all("em", class_="inaccessible")
accessible = accessible[0].contents[0]
inaccessible = inaccessible[0].contents[0]
acc_num = get_number(accessible)
inacc_num = get_number(inaccessible)
print acc_num
print inacc_num
database.writerow([name]+[year]+[acc_num]+[inacc_num])
time.sleep(2)
You need to add error-handling to your code. When scraping a lot of websites, some will be malformed, or somehow broken. When that happens, you'll be trying to manipulate empty objects.
Look through the code, find all assumptions where you're assuming it works, and check against errors.
For that specific case, I would do this:
if not inaccessible or not accessible:
# malformed page
continue
soup.find_all("em", class_="accessible") is probably returning an empty list. You can try:
if accessible:
accessible = accessible[0].contents[0]
or more generally:
if accessibe and inaccesible:
accessible = accessible[0].contents[0]
inaccessible = inaccessible[0].contents[0]
else:
print 'Something went wrong!'
continue

Categories