Scaling TEST data which is not true representative of train data - python

I've built a model which I would like to test on unseen data. I feed the data in daily, which can have a different range everyday. For example, if I use MinMaxScaler(), I scale the training data to [0,1] interval.
Now, the maximum value in the training set is 100, which will be transformed to 1.
When my test data comes in daily, it could actually turn out that maximum value was actually 10, which would also be transformed to 1.
# min_max_scaler = preprocessing.MinMaxScaler()
# df_scaled = min_max_scaler.fit_transform(df.values)
I tried using normalisation instead, e.g. df_norm = (df - df.mean()) / (df.max() - df.min()), and then using these values on the test data:
test_norm = (test_df - df.mean()) / (df.max() - df.min())
But my data is not normally distributed. It is probably exponentially distributed, with high number of 0s and lower large values.

No your maximum value of test (ie 10) will not be scaled to 1, but to 0.1 if used properly against learned max and min from training data.
That can be achieved by calling only min_max_scaler.transform() on test data. fit() or fit_transform() is to be used on training data only.
So for training data the code is same:
df_train_scaled = min_max_scaler.fit_transform(df_train.values)
But for testing data, it becomes:
df_test_scaled = min_max_scaler.transform(df_test.values)
This way, the MinMaxScaler will store the max and min values seen during the fit() on the training data and then use them on test data, to properly scale the data.

Related

SVM classifier n_samples, n_splits problem sklearn Python

I'm trying to predict volatility one step ahead with an SVM model based on O'Reilly book example (Machine Learning for Financial Risk Management with Python). When I copy exactly the example (with S&P500 data) it works well but now I'm having troubles with this chunk of code with a particular fund returns data:
# returns
r = np.array([ nan, 0.0013933 , 0.00118874, 0.00076462, 0.00168565,
-0.00018507, -0.00390753, 0.00307275, -0.00351472])
# horizon
t = 252
# mean of returns
mu = r.mean()
# critical value
z = norm.ppf(0.95)
# realized volatility
vol = r.rolling(5).std()
vol = pd.DataFrame(vol)
vol.reset_index(drop=True, inplace=True)
# SVM GARCH
r_svm = r ** 2
r_svm = r_svm.reset_index()
# inputs X (returns and realized volatility)
X = pd.concat([vol, r_svm], axis=1, ignore_index=True)
X = X.dropna().copy()
X = X.reset_index()
X.drop([1, 'index'], axis=1, inplace=True)
# labels y realized volatility shifted 1 period onward
vol = vol.dropna().reset_index()
vol.drop('index', axis=1, inplace=True)
# linear kernel
svr_lin = SVR(kernel='linear')
# hyperparameters grid
para_grid = {'gamma': sp_rand(),
'C': sp_rand(),
'epsilon': sp_rand()}
# svm classifier (regression?)
clf = RandomizedSearchCV(svr_lin, para_grid)
clf.fit(X[:-1].dropna().values,
vol[1:].values.reshape(-1,))
# prediction
n_vol = clf.predict(X.iloc[-1:])
The raised error is:
ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=3.
The code works with longer returns series so I assume that the problem is the length of the array but I can't figure out how to solve it. can someone help me with that?
This error is getting raised because you use RandomizedSearchCV with default cv parameter.
By default RandomizedSearchCV is running 5-folds cross-validation to find the best hyperparameters for the model.
5-folds cross-validation means splitting your training data into 5 subsets and training 5 different models based on these splits.
Looks like you have less than 5 objects in your training set, so splitting your data into 5 folds isn't possible.
To fix the issue you should either add more data or decrease number of folds for the RandomizedSearchCV by adding cv parameter:
clf = RandomizedSearchCV(svr_lin, para_grid, cv=2)
I'd recommend to collect more data, since 4 data points most likely won't be enough to make the model accurate or predictive.

Problems with inverse_transform scaled predictions and y_test in multi-step, multi-variate LSTM

I have built a multi-step, multi-variate LSTM model to predict the target variable 5 days into the future with 5 days of look-back. The model runs smooth (even though it has to be further improved), but I cannot correctly invert the transformation applied, once I get my predictions.
I have seen on the web that there are many ways to pre-process and transform data. I decided to follow these steps:
Data fetching and cleaning
df = yfinance.download(['^GSPC', '^GDAXI', 'CL=F', 'AAPL'], period='5y', interval='1d')['Adj Close'];
df.dropna(axis=0, inplace=True)
df.describe()
Data set table
Split the data set into train and test
size = int(len(df) * 0.80)
df_train = df.iloc[:size]
df_test = df.iloc[size:]
Scaled train and test sets separately with MinMaxScaler()
scaler = MinMaxScaler(feature_range=(0,1))
df_train_sc = scaler.fit_transform(df_train)
df_test_sc = scaler.transform(df_test)
Creation of 3D X and y time-series compatible with the LSTM model
I borrowed the following function from this article
def create_X_Y(ts: np.array, lag=1, n_ahead=1, target_index=0) -> tuple:
"""
A method to create X and Y matrix from a time series array for the training of
deep learning models
"""
# Extracting the number of features that are passed from the array
n_features = ts.shape[1]
# Creating placeholder lists
X, Y = [], []
if len(ts) - lag <= 0:
X.append(ts)
else:
for i in range(len(ts) - lag - n_ahead):
Y.append(ts[(i + lag):(i + lag + n_ahead), target_index])
X.append(ts[i:(i + lag)])
X, Y = np.array(X), np.array(Y)
# Reshaping the X array to an RNN input shape
X = np.reshape(X, (X.shape[0], lag, n_features))
return X, Y
#In this example let's assume that the first column (AAPL) is the target variable.
trainX,trainY = create_X_Y(df_train_sc,lag=5, n_ahead=5, target_index=0)
testX,testY = create_X_Y(df_test_sc,lag=5, n_ahead=5, target_index=0)
Model creation
def build_model(optimizer):
grid_model = Sequential()
grid_model.add(LSTM(64,activation='tanh', return_sequences=True,input_shape=(trainX.shape[1],trainX.shape[2])))
grid_model.add(LSTM(64,activation='tanh', return_sequences=True))
grid_model.add(LSTM(64,activation='tanh'))
grid_model.add(Dropout(0.2))
grid_model.add(Dense(trainY.shape[1]))
grid_model.compile(loss = 'mse',optimizer = optimizer)
return grid_model
grid_model = KerasRegressor(build_fn=build_model,verbose=1,validation_data=(testX,testY))
parameters = {'batch_size' : [12,24],
'epochs' : [8,30],
'optimizer' : ['adam','Adadelta'] }
grid_search = GridSearchCV(estimator = grid_model,
param_grid = parameters,
cv = 3)
grid_search = grid_search.fit(trainX,trainY)
grid_search.best_params_
my_model = grid_search.best_estimator_.model
Get predictions
yhat = my_model.predict(testX)
Invert transformation of predictions and actual values
Here my problems begin, because I am not sure which way to go. I have read many tutorials, but it seems that those authors prefer to apply MinMaxScaler() on the entire dataset before splitting the data into train and test. I do not agree on this, because, otherwise, training data will be incorrectly scaled with information we should not use (i.e. the test set). So, I followed my approach, but I am stucked here.
I found this possible solution on another post, but it's not working for me:
# invert scaling for forecast
pred_scaler = MinMaxScaler(feature_range=(0, 1)).fit(df_test.values[:,0].reshape(-1, 1))
inv_yhat = pred_scaler.inverse_transform(yhat)
# invert scaling for actual
inv_y = pred_scaler.inverse_transform(testY)
In fact, when I double check the last values of the target from my original data set they don't match with the inverted scaled version of the testY.
Can someone please help me on this? Many thanks in advance for your support!
Two things could be mentioned here. First, you cannot inverse transform something you did not see. This happens because you use two different scalers. The NN will predict values in the range of Scaler 1, where it is not said that this lies within the range of Scaler 2 (scaled on test data). Second, the best practice is to fit your scaler on the training set and use the same scaler (only transform) on the test data as well. Now, you should be able to reverse transform your test results. Third if scaling wents off, because the test set has completely different values - e.g. happens with live streaming data, it is up to you to deal with it, e.g. the min-max scaler will produce values > 1.0.

h2o: F1 score and other binary classification metrics missing

I am able to run the following example code and get an F1 score:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
# convert columns to factors
airlines["Year"]= airlines["Year"].asfactor()
airlines["Month"]= airlines["Month"].asfactor()
airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
airlines["Cancelled"] = airlines["Cancelled"].asfactor()
airlines['FlightNum'] = airlines['FlightNum'].asfactor()
# set the predictor names and the response column name
predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
"DayOfWeek", "Month", "Distance", "FlightNum"]
response = "IsDepDelayed"
# split into train and validation sets
train, valid = airlines.split_frame(ratios = [.8], seed = 1234)
# train your model
airlines_gbm = H2OGradientBoostingEstimator(sample_rate = .7, seed = 1234)
airlines_gbm.train(x = predictors,
y = response,
training_frame = train,
validation_frame = valid)
# retrieve the model performance
perf = airlines_gbm.model_performance(valid)
perf
With output like so:
ModelMetricsBinomial: gbm
** Reported on test data. **
MSE: 0.20546330299964743
RMSE: 0.4532806007316521
LogLoss: 0.5967028742962095
Mean Per-Class Error: 0.31720065289432364
AUC: 0.7414970113257631
AUCPR: 0.7616331690362552
Gini: 0.48299402265152613
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.35417599264806404:
NO YES Error Rate
0 NO 1641.0 2480.0 0.6018 (2480.0/4121.0)
1 YES 595.0 4011.0 0.1292 (595.0/4606.0)
2 Total 2236.0 6491.0 0.3524 (3075.0/8727.0)
...
However, my dataset doesn't work in a similar manner, despite appearing to be of the same form. My dataset target variable also has a binary label. Some information about my dataset:
y_test.nunique()
failure 2
dtype: int64
Yet my performance (perf) metrics are a much smaller subset of the example code:
perf = gbm.model_performance(hf_test)
perf
ModelMetricsRegression: gbm
** Reported on test data. **
MSE: 0.02363221438767555
RMSE: 0.1537277281028883
MAE: 0.07460874699751764
RMSLE: 0.12362377397478382
Mean Residual Deviance: 0.02363221438767555
It is difficult to share my data due to its sensitive nature. Any ideas on what to check?
You're training a regression model and that's why you're missing the binary classification metrics. The way that H2O knows whether to train a regression vs classification model is by looking at the data type of the response column.
We explain it here in the H2O User Guide, but this is a frequent question we get since it's different than how scikit-learn works, which uses different methods for regression vs classification and doesn't require you to think about column types.
y_test.nunique()
failure 2
dtype: int64
On the response column in your training data, you can do something like this:
train["response"] = train["response"].asfactor()
Alternatively, when you read the file in from disk, you can parse the response column as "enum" type, so you don't have to convert it, after-the-fact. There's some examples of how to do that in Python here. If the response is stored as integers, H2O just assumes it's a numeric column when it reads in the data from disk, but if the response is stored as strings, it will correctly parse it as a categorical (aka. "enum") column and you won't need to specify or convert it.

Keras / NN - Applying different scaling/normalization on inputs

on my journey learning ML I was testing some NN and I was seeing that my output doesn't seem to take into consideration one of my 3 inputs which is very important.
My dataset is composed of 4 cols (csv):
3 are numbers (including the output) between 1 000 and can go up to 150 000 000
1 is a number between 0 and 100 which is the one that's not taken in consideration by my NN
I scale my dataset this way using MinMaxScaler from scikit-learn:
df = pd.read_csv('rawData.csv')
dataset = df.values
min_max_scaler = preprocessing.MinMaxScaler()
dataset = min_max_scaler.fit_transform(dataset)
X = dataset[:,0:3] # input
Y = dataset[:,3] # output
I also use another way to scale my data (when I want to test my model):
min_test = np.min(runset)
max_test = np.max(runset)
normalized = (runset - min_test) / (max_test - min_test)
test = model.predict(normalized)
result = test * (max_test - min_test) + min_test
So my question is: Is it possible and recommended to use different scales for different inputs? If yes how do I do that?
Short answer to your question is another question: do I have prior knowledge about the importance of the features characterizing my dataset?
if yes, I may scale my data in such a way that more important features have larger variance/range.
if not, I should scale my data in such a way that features have mean 0 and standard deviation 1. Why? Mainly for improving numerical condition, removing the problem of scale dependence of the initial weights, making the training process faster, and reducing the risk of getting stuck in local optima.
Do not underestimate the sensitivity of gradient descent methods to scaling.
Finally, remember to use statistics (mean and standard deviation) from the training set to standardize the validation/test set.

LSTM keras MaxMinScale function producing different values for the same number

I am trying to apply LSTM model to do timeseries forecasting.
One step in LSTM is to scale the data using MinMaxScaler. I'm trying to scale the data in the range (-1,1).
My train data looks something like below:
[[0. 2.54583333]
[2.54583333 2.6125 ]
[2.6125 2.875 ]
...
[2.69583333 2.5125 ]
[2.5125 2.91666667]
[2.91666667 3.4375 ]]
MinMaxScaler(copy=True, feature_range=(-1, 1))
[[-1. -0.95342466]
[-0.80649248 -0.94794521]
[-0.80142518 -0.92636986]
...
[-0.79509105 -0.95616438]
[-0.80902613 -0.92294521]
[-0.77830562 -0.88013699]]
As, it can be seen above the values are not getting scaled properly. For eg: 2.54583333 is getting scaled to -0.95342466 as well as -0.80649248.
Actually the scaled values should look something like below:
[[-1. -0.95342466]
[-0.95342466 -0.94794521]
[0.94794521 -0.92636986]
...
[-0.79509105 -0.95616438]
[-0.95616438 -0.92294521]
[-0.92294521 -0.88013699]]
Scaled values should also follow the pattern in the training set.
Can someone tell me what i am doing wrong, why multiple scaled values are coming for the same number?
I'm scaling values using below code:
def scale(train, test):
# fit scaler
scaler = MinMaxScaler(feature_range=(-1, 1))
scaler = scaler.fit(train)
# transform train
train = train.reshape(train.shape[0], train.shape[1])
train_scaled = scaler.transform(train)
# transform test
test = test.reshape(test.shape[0], test.shape[1])
test_scaled = scaler.transform(test)
return scaler, train_scaled, test_scaled
But something is really wrong here. Please help.
My whole dataset is of shape (720,1)
Length of train data : 519
Length of test data : 200.
In the above method the shape of train data is (519,2) and test data is (200,2). Please let me know if more information or any other clarity required. I am new to this and I'm trying LSTM for the first time.
There is nothing wrong with this. Each feature is scaled independently, and each has its own normalization factors, so the same feature value in a different feature dimension will not be normalized in the same way.

Categories