I am using sanpy to gather crypto market data, compute alpha, beta and rsquared with statsmodels, and then create a crypto = input("Cryptocurrency: ") function with a while loop that allows me to ask the user for an specific crypto and output its respective statistics, followed by showing the input again.
With the following code I receive the error: ValueError: If using all scalar values, you must pass an index
import san
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
import statsmodels.api as sm
from statsmodels import regression
cryptos = ["bitcoin", "ethereum", "ripple", "bitcoin-cash", "tether",
"bitcoin-sv", "litecoin", "binance-coin", "eos", "chainlink",
"monero", "bitcoin-gold"]
def get_and_process_data(c):
raw_data = san.get("daily_closing_price_usd/" + c, from_date="2014-12-31", to_date="2019-12-31", interval="1d") # "query/slug"
return raw_data.pct_change()[1:]
df = pd.DataFrame({c: get_and_process_data(c) for c in cryptos})
df['MKT Return'] = df.mean(axis=1) # avg market return
#print(df) # show dataframe with all data
def model(x, y):
# Calculate r-squared
X = sm.add_constant(x) # artificially add intercept to x, as advised in the docs
model = sm.OLS(y,X).fit()
rsquared = model.rsquared
# Fit linear regression and calculate alpha and beta
X = sm.add_constant(x)
model = regression.linear_model.OLS(y,X).fit()
alpha = model.params[0]
beta = model.params[1]
return rsquared, alpha, beta
results = pd.DataFrame({c: model(df[df[c].notnull()]['MKT Return'], df[df[c].notnull()][c]) for c in cryptos}).transpose()
results.columns = ['rsquared', 'alpha', 'beta']
print(results)
The error is in the following line:
df = pd.DataFrame({c: get_and_process_data(c) for c in cryptos})
I tried solving the issue by changing it to:
df = {c: get_and_process_data(c) for c in cryptos}
df['MKT Return'] = df.mean(axis=1) # avg market return
print(df) # show dataframe with all data
But with that, it gave me a different error: AttributeError: 'dict' object has no attribute 'mean'.
The goal is to create a single DataFrame with the datatime column, columns for the cryptos and their pct.change data, an additional column for MKT Return with the daily mean from all cryptos' pct.change. Then, use all this data to calculate each crypto's statistics and finally create the input function mentioned at the beginning.
I hope I made myself clear and that someone is able to help me with this matter.
This is a great start, but I think that you are getting confused with the return from san. If you look at
import san
import pandas as pd
# List of data we are interested in
cryptos = ["bitcoin", "ethereum", "ripple", "bitcoin-cash", "tether",
"bitcoin-sv", "litecoin", "binance-coin", "eos", "chainlink",
"monero", "bitcoin-gold"]
# function to get the data from san into a dataframe and turn in into
# a daily percentage change
def get_and_process_data(c):
raw_data = san.get("daily_closing_price_usd/" + c, from_date="2014-12-31", to_date="2019-12-31", interval="1d") # "query/slug"
return raw_data.pct_change()[1:]
# now set up an empty dataframe to get all the data put into
df = pd.DataFrame()
# cycle through your list
for c in cryptos:
# get the data as percentage changes
dftemp = get_and_process_data(c)
# then add it to the output dataframe df
df[c] = dftemp['value']
# have a look at what you have
print(df)
And from that point on you know you have some good data and you can play with it as you go forward.
If I could suggest that you just get one currency and get the regressions working with that one then move forward to cycling through all of them.
You are passing scalar values, you need to pass lists so try the following:
data = {c: [get_and_process_data(c)] for c in cryptos}
df = pd.DataFrame(data)
Maybe try this first
Related
I am using sklearn for KNN regressor:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv")#pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
a = theta.get("YearBuilt")
b = theta.get("YrSold")
A = a.to_frame()
B = b.to_frame()
glasses = [A,B]
x = pd.concat(glasses)
#getting target data
y = theta.get("SalePrice")
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y)
I get this error message:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Could someone please explain this? My data is in the hundred thousands for target and the thousands for input. And there is no blanks in the data.
Before answering the question, Let me refactor the code. You are using a dataframe so you can index single or muliple fields of the dataframe without going through the extra steps you've used:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv") # pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
x = theta[["YearBuilt", "YrSold"]] # index multiple fields
#getting target data
y = theta["SalePrice"] # index single field
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y) # fit KNN
Regarding your error, it indicates that you have some NaN, Inf, large values in your data. You can ensure these doesnt occur by filtering out the NaN and inf values using this:
theta = theta.replace([np.inf, -np.inf], np.nan)
theta.dropna(inplace=True)
I have installed Salesforce-Merlion package in my conda-environment. Now I want to use my own dataset to run the algorithm for forecasting. Here I need only one univariate series to forecast. But I cannot figure out how to do that. As there are some variables which I cannot find how to initialize those. In the example provided in GIThub, using some already splitted dataset. Can someone can help me out here?
GIThub example for forecasting is like this:
from merlion.utils import TimeSeries from ts_datasets.forecast import M4
# Data loader returns pandas DataFrames, which we convert to Merlion TimeSeries
time_series, metadata = M4(subset="Hourly")[0]
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
The complete code with internal dataset is available in the following link:
https://github.com/salesforce/Merlion/tree/main/examples/forecast
(Here they are using their internal dataset M4)
Now, I have to use my dataset. So my code is like this:
from merlion.utils import TimeSeries
df = pd.read_csv(r'C:\Users\Doyel_De_Sarkar\Desktop\forecasting\15786_GIK.csv')
df.dropna(inplace=True)
df['ts'] = pd.to_datetime(df['ts'])
df.sort_values('ts', inplace=True)
trainval = []
for i in range(len(df)):
if i <= (round((len(df)*0.75),0)):
trainval.append(True)
else:
trainval.append(False)
df['trainval'] = trainval
df = df.drop(columns=['wday', 'hour'])
from merlion.utils import UnivariateTimeSeries
kpi = UnivariateTimeSeries(
time_stamps=df.ts, # timestamps in units of seconds
values=df.saps_total, # time series values
name="kpi" # optional: a name for this univariate
)
kpi_label = UnivariateTimeSeries(
time_stamps=df.ts, # timestamps in units of seconds
values=df.trainval # time series values
)
from merlion.utils import TimeSeries
time_series, metadata = kpi, kpi_label
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
I am getting this following error
'UnivariateTimeSeries' object has no attribute 'trainval'
at this line:
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
The reason you're getting this error is because trainval is not a parameter of the TimeSeries class. In the example from GitHub that you shared, metadata is a pandas timeframe, but you're constructing a TimeSeries object out of kpi_label.
I'm not sure exactly what your dataset looks like, but try using:
kpi_labels = df.trainval
instead.
Thank you SalmonKiller for taking out time to look into the issue. The dataset used in the github has a very weird data structure, hence I had to create the column trainval and set the metadata as the column df[['trainval']]. The univariate I had created was of no use. The issue was there with indexing. After I set the time stamp column as index , the issue got solved.
Here is the code which is running fine now.
import os
import numpy as np
import pandas as pd
from merlion.models.forecast.smoother import MSESConfig, MSES
from merlion.transform.resample import TemporalResample
from merlion.utils import TimeSeries
df = pd.read_csv(r'<file.csv>')
df['ts'] = pd.to_datetime(df['ts'])
df.set_index('ts', inplace=True)
df.sort_values('ts', inplace=True)
hours = pd.date_range(start=df.index[0], end=df.index[-1], freq='H')
mean = df.saps_total.mean()
df = df.reindex(hours, fill_value=mean)
trainval = []
for i in range(len(df)):
if i <= (round((len(df)*0.75),0)):
trainval.append(True)
else:
trainval.append(False)
df['trainval'] = trainval
df = df.drop(columns=['wday', 'hour'])
from merlion.utils import TimeSeries
time_series = df[['saps_total']]
metadata = df[['trainval']]
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
from merlion.models.forecast.arima import Arima, ArimaConfig
config1 = ArimaConfig(max_forecast_steps=len(time_series[~metadata.trainval].index), order=(0, 1, 0),
transform=TemporalResample(granularity="1h"))
model1 = Arima(config1)
model1.train(train_data=train_data)
test_pred, test_err = model1.forecast(time_stamps=test_data.time_stamps)
print(test_pred)
I have a dataframe of 24 variables (24 columns x 4580 rows) from 2008 to 2020.
My independant variable is the first one in the DF and the dependant variables are the 23 others.
I've done a test for one rolling window regression, it works well, here is my code :
import statsmodels.api as sm
from statsmodels.regression.rolling import RollingOLS
import seaborn
seaborn.set_style('darkgrid')
pd.plotting.register_matplotlib_converters()
x = sm.add_constant(df[['DIFFSWAP']])
y = df[['CADUSD']]
rols = RollingOLS(y,x, window=60)
rres = rols.fit()
params = rres.params
r_sq = rres.rsquared
Now, what i want to do, i'd like to do a loop to regress (rolling window) all the dependant variables of the DF (columns 2:24) on the independant variable (column 1) and store the coefficients and the rsquareds.
My ultimate goal is to extract Rsquareds and Coefficients and put them in dataframes(or lists or whatever) and then graph them.
I'm new to Python so I'd be very gratefull for any help.
Thank you!
Can you throw it all in a loop and store the results in some other object like a dict?
Potential solution:
data = {}
for column in list(df.columns)[2:]: # iterate over columns 2 to 24
x = sm.add_constant(df[column])
y = df[['CADUSD']] ## This never changes from CADUSD, right?
rols = RollingOLS(y, x, window=60)
rres = rols.fit()
params = rres.params
r_sq = rres.rsquared
# Store results from each column's fit as a new dict entry
data[column] = {'params':params, 'r_sq':r_sq}
results_df = pd.DataFrame(data).T
I'm trying to use statsmodels to run separate logistic regressions for each "group" in a pandas dataframe and save the predicted probabilities for each observations (row). Each "group" represents about 2500 respondents or observations; I would like to get the predicted probability for each respondent - similar to how with SPSS you can "save" predicted probabilities when running a logistic regression.
I've read what others have attempted, but nothing seems to work. I'm using SPSS to check that the looping operation in Python is working correctly - the predicted probabilities should be the same (SPSS has a split function which makes this really easy).
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit
df = pd.read_csv('test_data.csv')
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df)
print(est_result.summary())
df['pred'] = pred
The model summaries are correct (est_result.summary()) and match SPSS exactly. However, the saved predicted values do not match at all. I cannot seem to understand how to get it to work correctly.
Any advice is appreciated.
I solved it in a really un-pythonic kind of way. I hope someone can improve this code. The probabilities now match exactly what SPSS produces when you split the file by group, and run individual regressions by group.
result =[]
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df_slice)
results.append(pred)
# print(est_result.summary())
n = len(df['Brand'].unique())
r = pd.DataFrame(results) #put the results into a dataframe
rt = r.T #tranpose the dataframe
r_small = rt[rt.columns[-n:]] #remove all but the last n columns, n = number of categories
r_new = r_small.bfill(axis=1).iloc[:, 0] #merge the n columns and remove the NaNs
r_new #show us
df['predicted'] = r_new # combine the r_new array with the original dataframe
df #show us.
I am trying to construct trend following momentum portfolio strategy based on S&P500 index (momthly data)
I used Kaufmann's fractal efficiency ratio to filter out whipsaw signal
(http://etfhq.com/blog/2011/02/07/kaufmans-efficiency-ratio/)
I succeeded in coding, but it's very clumsy, so I need advice for better code.
Strategy
Get data of S&P 500 index from yahoo finance
Calculate Kaufmann's efficiency ratio on lookback period X (1 , if close > close(n), 0)
Averages calculated value of 2, from 1 to 12 time period ---> Monthly asset allocation ratio, 1-asset allocation ratio = cash (3% per year)
I am having a difficulty in averaging 1 to 12 efficiency ratio. Of course I know that it can be simply implemented by for loop and it's very easy task, but I failed.
I need more concise and refined code, anybody can help me?
a['meanfractal'] bothers me in the code below..
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pandas_datareader.data as web
def price(stock, start):
price = web.DataReader(name=stock, data_source='yahoo', start=start)['Adj Close']
return price.div(price.iat[0]).resample('M').last().to_frame('price')
a = price('SPY','2000-01-01')
def fractal(a,p):
a['direction'] = np.where(a['price'].diff(p)>0,1,0)
a['abs'] = a['price'].diff(p).abs()
a['volatility'] = a.price.diff().abs().rolling(p).sum()
a['fractal'] = a['abs'].values/a['volatility'].values*a['direction'].values
return a['fractal']
def meanfractal(a):
a['meanfractal']= (fractal(a,1).values+fractal(a,2).values+fractal(a,3).values+fractal(a,4).values+fractal(a,5).values+fractal(a,6).values+fractal(a,7).values+fractal(a,8).values+fractal(a,9).values+fractal(a,10).values+fractal(a,11).values+fractal(a,12).values)/12
a['portfolio1'] = (a.price/a.price.shift(1).values*a.meanfractal.shift(1).values+(1-a.meanfractal.shift(1).values)*1.03**(1/12)).cumprod()
a['portfolio2'] = ((a.price/a.price.shift(1).values*a.meanfractal.shift(1).values+1.03**(1/12))/(1+a.meanfractal.shift(1))).cumprod()
a=a.dropna()
a=a.div(a.ix[0])
return a[['price','portfolio1','portfolio2']].plot()
print(a)
plt.show()
You could simplify further by storing the values corresponding to p in a DF rather than computing for each series separately as shown:
def fractal(a, p):
df = pd.DataFrame()
for count in range(1,p+1):
a['direction'] = np.where(a['price'].diff(count)>0,1,0)
a['abs'] = a['price'].diff(count).abs()
a['volatility'] = a.price.diff().abs().rolling(count).sum()
a['fractal'] = a['abs']/a['volatility']*a['direction']
df = pd.concat([df, a['fractal']], axis=1)
return df
Then, you could assign the repeating operations to a variable which reduces the re-computation time.
def meanfractal(a, l=12):
a['meanfractal']= pd.DataFrame(fractal(a, l)).sum(1,skipna=False)/l
mean_shift = a['meanfractal'].shift(1)
price_shift = a['price'].shift(1)
factor = 1.03**(1/l)
a['portfolio1'] = (a['price']/price_shift*mean_shift+(1-mean_shift)*factor).cumprod()
a['portfolio2'] = ((a['price']/price_shift*mean_shift+factor)/(1+mean_shift)).cumprod()
a.dropna(inplace=True)
a = a.div(a.ix[0])
return a[['price','portfolio1','portfolio2']].plot()
Resulting plot obtained:
meanfractal(a)
Note: If speed is not a major concern, you could perform the operations via the built-in methods present in pandas instead of converting them into it's corresponding numpy array values.