I'm trying to predict the future values of a share with SKLearn regressors (but it could be the next value of anything, I've tried the same function on Covid Cases data with the same results) but it doesn't work.
I've written a function that takes my train dataset, the target variable, the test Xs and the features to take into account and gives back the prediction:
def predict_share_valuesGRDBST(data, target_variable, X_test, features=None):
# Split data into features (X) and target (y)
if features:
X = data[features]
else:
X = data.drop(target_variable, axis=1)
y = data[target_variable]
# Fit Gradient Boosting model to training data
model = GradientBoostingRegressor(n_estimators=200,random_state=20)
model.fit(X, y)
# Use model to make predictions on next num_predictions values
next_values = model.predict(X_test[features])
return next_values
variable data is like
Date
CloseValue
OpenValue
TradeVolume
...
...
...
...
2023-01-19
100
90
1000
2023-01-20
110
101
1100
Target_variable is 'CloseValue'
X_test is like data but with values in future dates
features variable is like ['OpenValue', 'TradeVolume', 'Date']
but the returned values don't fit at all:
I've tried with other regressors (AdaBoost, RandomForest) but they al give me the same, wrong, results:
that's why I'm think that I am doing something wrong and it's not just a problem of correlation between variables, it seems that they're working on wrong data but I cannot figure out how to fix it. Any ideas?
Related
I'm trying to predict volatility one step ahead with an SVM model based on O'Reilly book example (Machine Learning for Financial Risk Management with Python). When I copy exactly the example (with S&P500 data) it works well but now I'm having troubles with this chunk of code with a particular fund returns data:
# returns
r = np.array([ nan, 0.0013933 , 0.00118874, 0.00076462, 0.00168565,
-0.00018507, -0.00390753, 0.00307275, -0.00351472])
# horizon
t = 252
# mean of returns
mu = r.mean()
# critical value
z = norm.ppf(0.95)
# realized volatility
vol = r.rolling(5).std()
vol = pd.DataFrame(vol)
vol.reset_index(drop=True, inplace=True)
# SVM GARCH
r_svm = r ** 2
r_svm = r_svm.reset_index()
# inputs X (returns and realized volatility)
X = pd.concat([vol, r_svm], axis=1, ignore_index=True)
X = X.dropna().copy()
X = X.reset_index()
X.drop([1, 'index'], axis=1, inplace=True)
# labels y realized volatility shifted 1 period onward
vol = vol.dropna().reset_index()
vol.drop('index', axis=1, inplace=True)
# linear kernel
svr_lin = SVR(kernel='linear')
# hyperparameters grid
para_grid = {'gamma': sp_rand(),
'C': sp_rand(),
'epsilon': sp_rand()}
# svm classifier (regression?)
clf = RandomizedSearchCV(svr_lin, para_grid)
clf.fit(X[:-1].dropna().values,
vol[1:].values.reshape(-1,))
# prediction
n_vol = clf.predict(X.iloc[-1:])
The raised error is:
ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=3.
The code works with longer returns series so I assume that the problem is the length of the array but I can't figure out how to solve it. can someone help me with that?
This error is getting raised because you use RandomizedSearchCV with default cv parameter.
By default RandomizedSearchCV is running 5-folds cross-validation to find the best hyperparameters for the model.
5-folds cross-validation means splitting your training data into 5 subsets and training 5 different models based on these splits.
Looks like you have less than 5 objects in your training set, so splitting your data into 5 folds isn't possible.
To fix the issue you should either add more data or decrease number of folds for the RandomizedSearchCV by adding cv parameter:
clf = RandomizedSearchCV(svr_lin, para_grid, cv=2)
I'd recommend to collect more data, since 4 data points most likely won't be enough to make the model accurate or predictive.
I have built a multi-step, multi-variate LSTM model to predict the target variable 5 days into the future with 5 days of look-back. The model runs smooth (even though it has to be further improved), but I cannot correctly invert the transformation applied, once I get my predictions.
I have seen on the web that there are many ways to pre-process and transform data. I decided to follow these steps:
Data fetching and cleaning
df = yfinance.download(['^GSPC', '^GDAXI', 'CL=F', 'AAPL'], period='5y', interval='1d')['Adj Close'];
df.dropna(axis=0, inplace=True)
df.describe()
Data set table
Split the data set into train and test
size = int(len(df) * 0.80)
df_train = df.iloc[:size]
df_test = df.iloc[size:]
Scaled train and test sets separately with MinMaxScaler()
scaler = MinMaxScaler(feature_range=(0,1))
df_train_sc = scaler.fit_transform(df_train)
df_test_sc = scaler.transform(df_test)
Creation of 3D X and y time-series compatible with the LSTM model
I borrowed the following function from this article
def create_X_Y(ts: np.array, lag=1, n_ahead=1, target_index=0) -> tuple:
"""
A method to create X and Y matrix from a time series array for the training of
deep learning models
"""
# Extracting the number of features that are passed from the array
n_features = ts.shape[1]
# Creating placeholder lists
X, Y = [], []
if len(ts) - lag <= 0:
X.append(ts)
else:
for i in range(len(ts) - lag - n_ahead):
Y.append(ts[(i + lag):(i + lag + n_ahead), target_index])
X.append(ts[i:(i + lag)])
X, Y = np.array(X), np.array(Y)
# Reshaping the X array to an RNN input shape
X = np.reshape(X, (X.shape[0], lag, n_features))
return X, Y
#In this example let's assume that the first column (AAPL) is the target variable.
trainX,trainY = create_X_Y(df_train_sc,lag=5, n_ahead=5, target_index=0)
testX,testY = create_X_Y(df_test_sc,lag=5, n_ahead=5, target_index=0)
Model creation
def build_model(optimizer):
grid_model = Sequential()
grid_model.add(LSTM(64,activation='tanh', return_sequences=True,input_shape=(trainX.shape[1],trainX.shape[2])))
grid_model.add(LSTM(64,activation='tanh', return_sequences=True))
grid_model.add(LSTM(64,activation='tanh'))
grid_model.add(Dropout(0.2))
grid_model.add(Dense(trainY.shape[1]))
grid_model.compile(loss = 'mse',optimizer = optimizer)
return grid_model
grid_model = KerasRegressor(build_fn=build_model,verbose=1,validation_data=(testX,testY))
parameters = {'batch_size' : [12,24],
'epochs' : [8,30],
'optimizer' : ['adam','Adadelta'] }
grid_search = GridSearchCV(estimator = grid_model,
param_grid = parameters,
cv = 3)
grid_search = grid_search.fit(trainX,trainY)
grid_search.best_params_
my_model = grid_search.best_estimator_.model
Get predictions
yhat = my_model.predict(testX)
Invert transformation of predictions and actual values
Here my problems begin, because I am not sure which way to go. I have read many tutorials, but it seems that those authors prefer to apply MinMaxScaler() on the entire dataset before splitting the data into train and test. I do not agree on this, because, otherwise, training data will be incorrectly scaled with information we should not use (i.e. the test set). So, I followed my approach, but I am stucked here.
I found this possible solution on another post, but it's not working for me:
# invert scaling for forecast
pred_scaler = MinMaxScaler(feature_range=(0, 1)).fit(df_test.values[:,0].reshape(-1, 1))
inv_yhat = pred_scaler.inverse_transform(yhat)
# invert scaling for actual
inv_y = pred_scaler.inverse_transform(testY)
In fact, when I double check the last values of the target from my original data set they don't match with the inverted scaled version of the testY.
Can someone please help me on this? Many thanks in advance for your support!
Two things could be mentioned here. First, you cannot inverse transform something you did not see. This happens because you use two different scalers. The NN will predict values in the range of Scaler 1, where it is not said that this lies within the range of Scaler 2 (scaled on test data). Second, the best practice is to fit your scaler on the training set and use the same scaler (only transform) on the test data as well. Now, you should be able to reverse transform your test results. Third if scaling wents off, because the test set has completely different values - e.g. happens with live streaming data, it is up to you to deal with it, e.g. the min-max scaler will produce values > 1.0.
I am building a quant model that takes a bunch of features and predicts performance of an index. The model is doing exceptionally well which obviously makes me wonder If I am making some mistake.
I have looked at the underlying features that I am using to ensure there is no data leakage. So now my attention is turning towards my code. Below is the main body of code that I use for prediction.
Does anything look wrong in the looping or how I am predicting? Please let me know if you need any more information and I will share what I can share.
X -> Features used in model training and prediction
y -> Class variable (1,0)
n_record -> Number of records in the dataset
n_train -> Amount of data to use for training in the rolling window construct
model -> Ensemble model from sklearn
My training data is c4500 records. I used n_train of 800 to train the first instance of the model and then a rolling window of 800 for training to predict the 801st instance (and so on). So in that way I roll through time leaving out very old data (keeping the model "current").
col_names = ['Pred', 'Actual', 'Pred Prob'] #Column names for prediction output dataframe
def Strategy (n_train):
list_ans = []
n_records = len(X) #Number of records in X
for i in range(n_train, n_records):
# creating a rolling window to train model on backward data (n_train records) and predict tomorrows performance
X_train, X_test, y_train, y_test = X[i-n_train:i], X[i:i+1], y[i-n_train:i], y[i:i+1]
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
model.fit(X_train,y_train)
Pred=model.predict(X_test)
Actual = y_test.values
Prob = model.predict_proba(X_test)[:,1]
i_ans = [Pred.item(), Actual.item(), Prob.item()]
resi = pd.Series(data=i_ans, index=col_names)
list_ans.append(resi)
return pd.DataFrame(list_ans)
For 1. What values do you expect from n_record or n_train? keep in mind that n_train is acting as the min value for the range. I Don't know if this is how it should be, but be careful, you may be skipping training data.
Apart from that, it's good on my eyes!
I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps. I wonder if the time series data has been properly learned and predicted. How do I do this right to get the following(next) value? I want to get the next value using model.predict or similar.
I have x_test and x_test[-1] == t So, the meaning of the next value is t+1, t+2, .... t+n. In this example I want to get t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same. I'm wondering if it's the correct method to train and predicted value and also if the result was correct.
full source
The method I tried first did not work correctly.
Second
I realized something is wrong, I tried using another official data so I used the time series in the Tensorflow tutorial to practice training the model.
a = y_val[-look_back:]
for i in range(N-step prediction): #predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) #predicted value
a = a[1:] #remove first
a = np.append(a, tmp) #insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression abnormal that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious that How can I predict the following value of time series using Tensorflow predict method
I'm not wondering if this works or not theoretically. I'm just wondering how to get the following n steps using the predict method.
Thank you for reading the long question. I seek advice about your priceless opinion.
In the Second approach, Output is not expected, as per my understanding, because of a small mistake in the code.
The line of code,
a = y_val[-look_back:]
should be replaced by
look_back = 20
x = x_val_uni
a = x[-look_back:]
a.shape
In other words, we should send X Values as Inputs to the Model for the Prediction, not the Y Values.
However, we can compare it's predictions with Y Values, with the code,
y = y_val_uni[-20:]
plt.plot(y)
plt.plot(tmp)
plt.show()
Which would result in the plot shown below:
Please find the Complete Working Code in this Google Colab Gist.
I'm confused how to interpret the output of .predict from a fitted CoxnetSurvivalAnalysis model in scikit-survival. I've read through the notebook Intro to Survival Analysis in scikit-survival and the API reference, but can't find an explanation. Below is a minimal example of what leads to my confusion:
import pandas as pd
from sksurv.datasets import load_veterans_lung_cancer
from sksurv.linear_model import CoxnetSurvivalAnalysis
# load data
data_X, data_y = load_veterans_lung_cancer()
# one-hot-encode categorical columns in X
categorical_cols = ['Celltype', 'Prior_therapy', 'Treatment']
X = data_X.copy()
for c in categorical_cols:
dummy_matrix = pd.get_dummies(X[c], prefix=c, drop_first=False)
X = pd.concat([X, dummy_matrix], axis=1).drop(c, axis=1)
# display final X to fit Cox Elastic Net model on
del data_X
print(X.head(3))
so here's the X going into the model:
Age_in_years Celltype Karnofsky_score Months_from_Diagnosis \
0 69.0 squamous 60.0 7.0
1 64.0 squamous 70.0 5.0
2 38.0 squamous 60.0 3.0
Prior_therapy Treatment
0 no standard
1 yes standard
2 no standard
...moving on to fitting model and generating predictions:
# Fit Model
coxnet = CoxnetSurvivalAnalysis()
coxnet.fit(X, data_y)
# What are these predictions?
preds = coxnet.predict(X)
preds has same number of records as X, but their values are wayyy different than the values in data_y, even when predicted on the same data they were fit on.
print(preds.mean())
print(data_y['Survival_in_days'].mean())
output:
-0.044114643249153422
121.62773722627738
So what exactly are preds? Clearly .predict means something pretty different here than in scikit-learn, but I can't figure out what. The API Reference says it returns "The predicted decision function," but what does that mean? And how do I get to the predicted estimate in months yhat for a given X? I'm new to survival analysis so I'm obviously missing something.
I posted this question on github, though the author renamed the issue question.
I got some helpful explanation of what the predict output is, but still am not sure how to get to a set of predicted survival times, which is what I really want. Here's a couple helpful explanations from that github thread:
predictions are risk scores on an arbitrary scale, which means you can
usually only determine the sequence of events, but not their exact time.
-sebp (library author)
It [predict] returns a type of risk score. Higher value means higher
risk of your event (class value = True)...You were probably looking
for a predicted time. You can get the predicted survival function with
estimator.predict_survival_function as in the example 00
notebook...EDIT: Actually, I’m trying to extract this but it’s been a
bit of a pain to munge
-pavopax.
There's more explanation at the github thread, though I wasn't really able to follow all of it. I need to play around with predict_survival_function and predict_cumulative_hazard_function and see if I can get to a set of predictions for most likely survival time by row in X, which is what I really want.
I'm not going to accept this answer here, in case anyone else has a better one.
With the X input, you get an evaluation of the input array:
def predict(self, X, alpha=None):
"""The linear predictor of the model.
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test data of which to calculate log-likelihood from
alpha : float, optional
Constant that multiplies the penalty terms. If the same alpha was used during training, exact
coefficients are used, otherwise coefficients are interpolated from the closest alpha values that
were used during training. If set to ``None``, the last alpha in the solution path is used.
Returns
-------
T : array, shape = (n_samples,)
The predicted decision function
"""
X = check_array(X)
coef = self._get_coef(alpha)
return numpy.dot(X, coef)
The definition check_array comes from another library.
You can review the code of coxnet.