How to apply standardization to train and test datasets

How to apply standardization to train and test datasets - python

Let's say I have a 10 feature dataset X of shape [100, 10] and a ytarget dataset of shape [100, 1].
For example, after splitting the two with sklearn.model_selection.train_test_split I obtained:
X_train: [70, 10]
X_test: [30, 10]
y_train: [70, 1]
y_test: [30, 1]
What is the correct way of apply standardization?
I've tried with:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)
but then if I try to predict using a model, when I try to inverse the scaling for looking at the MAE, I have an error
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(X_train_std, y_train)
y_pred_std = lr.predict(X_test_std)
y_pred = scaler.inverse_transform(y_pred_std) # error here
I have also another question. Since I have the target values, should I use
scaler = preprocessing.StandardScaler()
X_train_std = scaler.fit_transform(X_train, y_train)
X_test_std = scaler.transform(X_test)
instead of the first code block?
Do I have to apply the transformation also to the y_train and y_test datasets? I am a bit confuse

StandardScaler is supposed to be used on the feature matrix X only.
So all the fit, transform and inverse_transform methods just need your X.
Note that after you fit the model, you can access the following attributes:
mean_: mean of each feature in X_train
scale_: standard deviation of each feature in X_train
The transform method does (X[i, col] - mean_[col] / scale_[col]) for each sample i. Whereas the inverse_transform method (X[i, col] * scale_[col] + mean_[col]) for each sample i.

Related

Trying to train simple linear regression algorithm. Keep getting error [duplicate]

While practicing Simple Linear Regression Model I got this error,
I think there is something wrong with my data set.
Here is my data set:
Here is independent variable X:
Here is dependent variable Y:
Here is X_train
Here Is Y_train
This is error body:
ValueError: Expected 2D array, got 1D array instead:
array=[ 7. 8.4 10.1 6.5 6.9 7.9 5.8 7.4 9.3 10.3 7.3 8.1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
And this is My code:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
Thank you

You need to give both the fit and predict methods 2D arrays. Your x_train and x_test are currently only 1 dimensional. What is suggested by the console should work:
x_train= x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
This uses numpy's reshape to transform your array. For example, x = [1, 2, 3] wopuld be transformed to a matrix x' = [[1], [2], [3]] (-1 gives the x dimension of the matrix, inferred from the length of the array and remaining dimensions, 1 is the y dimension - giving us a n x 1 matrix where n is the input length).
Questions about reshape have been answered in the past, this for example should answer what reshape(-1,1) fully means: What does -1 mean in numpy reshape? (also some of the other below answers explain this very well too)

A lot of times when doing linear regression problems, people like to envision this graph
On the input, we have an X of X = [1,2,3,4,5]
However, many regression problems have multidimensional inputs. Consider the prediction of housing prices. It's not one attribute that determines housing prices. It's multiple features (ex: number of rooms, location, etc. )
If you look at the documentation you will see this
It tells us that rows consist of the samples while the columns consist of the features.
However, consider what happens when he have one feature as our input. Then we need an n x 1 dimensional input where n is the number of samples and the 1 column represents our only feature.
Why does the array.reshape(-1, 1) suggestion work? -1 means choose a number of rows that works based on the number of columns provided. See the image for how it changes in the input.

If you look at documentation of LinearRegression of scikit-learn.
fit(X, y, sample_weight=None)
X : numpy array or sparse matrix of shape [n_samples,n_features]
predict(X)
X : {array-like, sparse matrix}, shape = (n_samples, n_features)
As you can see X has 2 dimensions, where as, your x_train and x_test clearly have one.
As suggested, add:
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
Before fitting and predicting the model.

Use
y_pred = regressor.predict([[x_test]])

I would suggest to reshape X at the beginning before you do the split into train and test dataset:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
# Here is the trick
x = x.reshape(-1,1)
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)

This is what I use
X_train = X_train.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)

This is the solution
regressor.predict([[x_test]])
And for polynomial regression:
regressor_2.predict(poly_reg.fit_transform([[x_test]]))

Modify
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
to
regressor.fit(x_train.values.reshape(-1,1),y_train)
y_pred = regressor.predict(x_test.values.reshape(-1,1))

2D arrays in Random Forest fitting and scoring

I'm trying to fit a Random Forest regression model. This are the steps I've followed (pls see code below with comments):
Before fitting the model, I've splitted into training and test
I've converted the results into arrays
I've reshaped them to 2D arrays as the regressor likes them to be using the reshape function
I'm getting the following error (it seems that there is a 1D array even though I've reshaped them at the beginning):
ValueError: Expected 2D array, got 1D array instead:
array=[183. 27. 520. ... 23. 28. 34.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or
array.reshape(1, -1) if it contains a single sample.
Here's the code I've used:
#train & test split
X = order_final.loc[:, ~order_final.columns.isin(['lag','observed'])]
y = order_final['lag']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#convert X,y train and X test into arrays
X_train = X_train.to_numpy()
y_train = y_train.to_numpy()
X_test = X_test.to_numpy()
#make them 2D-arrays
X_train.reshape(-1,1)
y_train.reshape(-1,1)
X_test.reshape(-1,1)
# Fitting Random Forest Regression to the dataset
# import the regressor
from sklearn.ensemble import RandomForestRegressor
# create regressor object
RF = RandomForestRegressor(n_estimators = 100, random_state = 0)
# fit the regressor with x and y data
RF.fit(X_train, y_train)
#Prediction of test set
y_pred = RF.predict(X_test)
# View accuracy score
RF.score(y_test, y_pred)
And here is the shape of my arrays (which look goods to me but...):
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
print(y_pred.shape)
(7326, 10)
(7326,)
(1832, 10)
(1832,)
(1832,)
Can Someone pls help me out and point me where the error is? Thanks in advance!
Stefano

RF.score() needs the inputs to be reshaped:
RF.score(y_test.reshape(-1,1), y_pred.reshape(-1,1))

You need to pass the inputs used for making your predictions to the score() method instead, like so:
RF.score(X_test, y_test)

How can I predict on the trained SVR model and resolve error Value Error: X.shape[1] = 1 should be equal to 22

I have datasets that have more than 2000 rows and 23 columns including the age column. I have completed all of the processes for SVR. Now I want to predict the trained SVR model is where I need to input X_test to the model? Have faced an error that is
ValueError: X.shape[1] = 1 should be equal to 22, the number of features at training time
How may I resolve this problem? How may I write code for making predictions on the trained SVR model?
import pandas as pd
import numpy as np
# Make fake dataset
dataset = pd.DataFrame(data= np.random.rand(2000,22))
dataset['age'] = np.random.randint(2, size=2000)
# Separate the target from the other features
target = dataset['age']
data = dataset.drop('age', axis = 1)
X_train, y_train = data.loc[:1000], target.loc[:1000]
X_test, y_test = data.loc[1001], target.loc[1001]
X_test = np.array(X_test).reshape((len(X_test), 1))
print(X_test.shape)
SupportVectorRefModel = SVR()
SupportVectorRefModel.fit(X_train, y_train)
y_pred = SupportVectorRefModel.predict(X_test)
Output:
ValueError: X.shape[1] = 1 should be equal to 22, the number of features at training time

Your reshaping of X_test is not correct; it should be:
X_test = np.array(X_test).reshape(1, -1)
print(X_test.shape)
# (1, 22)
With that change, the rest of your code runs OK:
y_pred = SupportVectorRefModel.predict(X_test)
y_pred
# array([0.90156667])
UPDATE
In the case as you show it in your code, obviously X_test consists of one single sample, as defined here:
X_test, y_test = data.loc[1001], target.loc[1001]
But if (as I suspect) this is not what you actually want, but in fact you want the rest of your data as your test set, you should change the definition to:
X_test, y_test = data.loc[1001:], target.loc[1001:]
X_test.shape
# (999, 22)
and without any reshaping
y_pred = SupportVectorRefModel.predict(X_test)
y_pred.shape
# (999,)
i.e. a y_pred of 999 predictions.

Heavily weighted distance returns the same results as regular distance in knn with iris dataset

I am experimenting with the way the weights on the distance affect the performance of the kNN algorithm and for a reproducible example I am working with the iris dataset.
To my surprise, weighting 2 predictors 100 times more than the rest 2 predictors generate identical predictions with the unweighted model. What is this rather counterintuitive finding?
My code is the following:
X_original = iris['data']
Y = iris['target']
sc = StandardScaler() # Defines the parameters of the Scaler
X = sc.fit_transform(X_original) # Transforms the original data to standardized data and returns them
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2)
split = sss.split(X, Y)
s = list(split)
train_index = s[0][0]
test_index = s[0][1]
X_train = X[train_index, ]
X_test = X[test_index, ]
Y_train = Y[train_index]
Y_test = Y[test_index]
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 6)
iris_fit = knn.fit(X_train, Y_train) # The data can be passed as numpy arrays or pandas dataframes/series.
# All the data should be numeric
# There should be no NaNs
predictions_w1 = knn.predict(X_test)
weights = np.array([1, 1, 100, 100])
weights =weights/np.sum(weights)
knn_w = KNeighborsClassifier(n_neighbors = 6, metric='wminkowski', p=2,
metric_params={'w': weights})
iris_fit_w = knn_w.fit(X_train, Y_train) # The data can be passed as numpy arrays or pandas dataframes/series.
# All the data should be numeric
# There should be no NaNs
predictions_w100 = knn_w.predict(X_test)
(predictions_w1 != predictions_w100).sum()
0

They are not always the same, add a random state to your train test split and you will see how it changes for different values.
StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2, random_state=3)
Additionally, the weighted Minkowski distance with such extreme weights on 3rd (petal length) and 4th (petal width) feature basically gives you the same results as if you only ran KNN on these 2 features with unweighted Minkowski. And since they seem to be quite informative then it is no surprise you get very similar results compared to the case of considering all 4 features. See the wiki picture below

Sklearn StratifiedKFold: ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead

Working with Sklearn stratified kfold split, and when I attempt to split using multi-class, I received on error (see below). When I tried and split using binary, it works no problem.
num_classes = len(np.unique(y_train))
y_train_categorical = keras.utils.to_categorical(y_train, num_classes)
kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=999)
# splitting data into different folds
for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical)):
x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.

keras.utils.to_categorical produces a one-hot encoded class vector, i.e. the multilabel-indicator mentioned in the error message. StratifiedKFold is not designed to work with such input; from the split method docs:
split(X, y, groups=None)
[...]
y : array-like, shape (n_samples,)
The target variable for supervised learning problems. Stratification is done based on the y labels.
i.e. your y must be a 1-D array of your class labels.
Essentially, what you have to do is simply to invert the order of the operations: split first (using your intial y_train), and convert to_categorical afterwards.

Call to split() like this:
for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical.argmax(1))):
x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]

If your target variable is continuous then use simple KFold cross validation instead of StratifiedKFold.
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

I bumped into the same problem and found out that you can check the type of the target with this util function:
from sklearn.utils.multiclass import type_of_target
type_of_target(y)
'multilabel-indicator'
From its docstring:
'binary': y contains <= 2 discrete values and is 1d or a column
vector.
'multiclass': y contains more than two discrete values, is not a
sequence of sequences, and is 1d or a column vector.
'multiclass-multioutput': y is a 2d array that contains more
than two discrete values, is not a sequence of sequences, and both
dimensions are of size > 1.
'multilabel-indicator': y is a label indicator matrix, an array
of two dimensions with at least two columns, and at most 2 unique
values.
With LabelEncoder you can transform your classes into an 1d array of numbers (given your target labels are in an 1d array of categoricals/object):
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(target_labels)

In my case, x was a 2D matrix, and y was also a 2d matrix, i.e. indeed a multi-class multi-output case. I just passed a dummy np.zeros(shape=(n,1)) for the y and the x as usual. Full code example:
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [3, 7], [9, 4]])
# y = np.array([0, 0, 1, 1, 0, 1]) # <<< works
y = X # does not work if passed into `.split`
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=36851234)
for train_index, test_index in rskf.split(X, np.zeros(shape=(X.shape[0], 1))):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

Complementing what #desertnaut said, in order to convert your one-hot-encoding back to 1-D array you will only need to do is:
class_labels = np.argmax(y_train, axis=1)
This will convert back to the initial representation of your classes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to apply standardization to train and test datasets - python

Related

Trying to train simple linear regression algorithm. Keep getting error [duplicate]

2D arrays in Random Forest fitting and scoring

How can I predict on the trained SVR model and resolve error Value Error: X.shape[1] = 1 should be equal to 22

Heavily weighted distance returns the same results as regular distance in knn with iris dataset

Sklearn StratifiedKFold: ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead

Categories

Resources