How to find the best degree of polynomials? - python

I'm new to Machine Learning and currently got stuck with this.
First I use linear regression to fit the training set but get very large RMSE. Then I tried using polynomial regression to reduce the bias.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)
poly_predict = poly_reg.predict(X_poly)
poly_mse = mean_squared_error(X, poly_predict)
poly_rmse = np.sqrt(poly_mse)
poly_rmse
Then I got slightly better result than linear regression, then I continued to set degree = 3/4/5, the result kept getting better. But it might be somewhat overfitting as degree increased.
The best degree of polynomial should be the degree that generates the lowest RMSE in cross validation set. But I don't have any idea how to achieve that. Should I use GridSearchCV? or any other method?
Much appreciate if you could me with this.

You should provide the data for X/Y next time, or something dummy, it'll be faster and provide you with a specific solution. For now I've created a dummy equation of the form y = X**4 + X**3 + X + 1.
There are many ways you can improve on this, but a quick iteration to find the best degree is to simply fit your data on each degree and pick the degree with the best performance (e.g., lowest RMSE).
You can also play with how you decide to hold out your train/test/validation data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
X = np.arange(100).reshape(100, 1)
y = X**4 + X**3 + X + 1
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rmses = []
degrees = np.arange(1, 10)
min_rmse, min_deg = 1e10, 0
for deg in degrees:
# Train features
poly_features = PolynomialFeatures(degree=deg, include_bias=False)
x_poly_train = poly_features.fit_transform(x_train)
# Linear regression
poly_reg = LinearRegression()
poly_reg.fit(x_poly_train, y_train)
# Compare with test data
x_poly_test = poly_features.fit_transform(x_test)
poly_predict = poly_reg.predict(x_poly_test)
poly_mse = mean_squared_error(y_test, poly_predict)
poly_rmse = np.sqrt(poly_mse)
rmses.append(poly_rmse)
# Cross-validation of degree
if min_rmse > poly_rmse:
min_rmse = poly_rmse
min_deg = deg
# Plot and present results
print('Best degree {} with RMSE {}'.format(min_deg, min_rmse))
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(degrees, rmses)
ax.set_yscale('log')
ax.set_xlabel('Degree')
ax.set_ylabel('RMSE')
This will print:
Best degree 4 with RMSE 1.27689038706e-08
Alternatively, you could also build a new class that carries out Polynomial fitting, and pass that to GridSearchCV with a set of parameters.

In my opinion, the best way to find an optimal curve fitting degree or in general a fitting model is to use the GridSearchCV module from the scikit-learn library.
Here is an example how to use this library:
Firstly let us define a method to sample random data:
def make_data(N, err=1.0, rseed=1):
rng = np.random.RandomState(rseed)
X = rng.rand(N, 1) ** 2
y = 1. / (X.ravel() + 0.3)
if err > 0:
y += err * rng.randn(N)
return X, y
Build a pipeline:
def PolynomialRegression(degree=2, **kwargs):
return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))
Create a data and a vector(X_test) for testing and visualisation purposes:
X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]
Define the GridSearchCV parameters:
param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)
Get the best parameters from our model:
model = grid.best_estimator_
model
Pipeline(memory=None,
steps=[('polynomialfeatures', PolynomialFeatures(degree=4, include_bias=True, interaction_only=False)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])
Fit the model with the X and y data and use the vector to predict the values:
y_test = model.fit(X, y).predict(X_test)
Visualize the result:
plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')
The best fit result
The full code snippet:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
def make_data(N, err=1.0, rseed=1):
rng = np.random.RandomState(rseed)
X = rng.rand(N, 1) ** 2
y = 1. / (X.ravel() + 0.3)
if err > 0:
y += err * rng.randn(N)
return X, y
def PolynomialRegression(degree=2, **kwargs):
return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))
X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]
param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)
model = grid.best_estimator_
y_test = model.fit(X, y).predict(X_test)
plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')

This is where Bayesian model selection comes in really. This gives you the most likely model given both model complexity and data fit. I'm super tired so the quick answer is to use the BIC (Bayesian information criterion):
k = number of variables in the model
n = number of observations
sse = sum(residuals**2)
BIC = n*ln(sse/n) + k*ln(n)
This BIC (or AIC etc) will give you the best model

Related

Why am I getting negative SCORE even if i am using scoring = 'neg_mean_squared_error'?

Why am I getting negative SCORE even if i am using scoring = 'neg_mean_squared_error'?
I tried the following code from apparently the source code:
neg_mean_squared_error_scorer = make_scorer(mean_squared_error, greater_is_better=False)
Source Code
However it doesn't work. And I don't see the point of using it if we are supposed to use scoring = 'neg_mean_squared_error'.
Here is the code I used:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import sqrt
from sklearn.metrics import \
r2_score, get_scorer, make_scorer, mean_squared_error
from sklearn.linear_model import \
Lasso, Ridge, LassoCV,LinearRegression
from sklearn.preprocessing import \
StandardScaler, PolynomialFeatures
from sklearn.model_selection import \
KFold, RepeatedKFold, GridSearchCV, \
cross_validate, train_test_split
# Features
x1 = np.linspace(-20,20,100)
x1 = np.array(x1).reshape(-1,1)
x2 = pow(x1,2)
x3 = pow(x1,3)
x4 = pow(x1,4)
x5 = pow(x1,5)
# Parameters
beta_0 = 1.75
beta_1 = 5
beta_3 = 0.05
beta_5 = -10.3
eps_mu = 0 # epsilon mean
eps_sigma = sqrt(4) # epsilon standard deviation
eps_size = 100 # epsilon size
np.random.seed(1) # Fixing a seed
eps = np.random.normal(eps_mu, eps_sigma, eps_size)
eps = np.array(eps).reshape(-1,1)
y = beta_0 + beta_1*x1 + beta_3*x3 + beta_5*x5 + eps
data = np.concatenate((y,x1,x2,x3,x4,x5), axis = 1)
X = data[:,1:6]
y = data[:,0]
alphas_to_try = np.linspace(0.00000000000000000000000001,0.002,10) ######## To modify #######
scoring = 'neg_mean_squared_error'
#scoring = (mean_squared_error, greater_is_better=False)
scorer = get_scorer(scoring)
k = 5
cv = KFold(n_splits = k)
for train_index, test_index in cv.split(data):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)
validation_scores = []
train_scores = []
results_list = []
test_scores = []
for curr_alpha in alphas_to_try:
regmodel = Lasso(alpha = curr_alpha)
results = cross_validate(
regmodel, X, y, scoring=scoring, cv=cv,
return_train_score = True)
validation_scores.append(np.mean(results['test_score']))
train_scores.append(np.mean(results['train_score']))
results_list.append(results)
regmodel.fit(X,y)
y_pred = regmodel.predict(X_test)
test_scores.append(scorer(regmodel, X_test, y_test))
chosen_alpha_id = np.argmax(validation_scores)
chosen_alpha = alphas_to_try[chosen_alpha_id]
max_validation_score = np.max(validation_scores)
test_score_at_chosen_alpha = test_scores[chosen_alpha_id]
print('chosen_alpha:', chosen_alpha)
print('max_validation_score:', max_validation_score)
print('test_score_at_chosen_alpha:', test_score_at_chosen_alpha)
plt.figure(figsize = (8,8))
sns.lineplot(y = validation_scores, x = alphas_to_try, label = 'validation_data')
sns.lineplot(y = train_scores, x = alphas_to_try, label = 'training_data')
plt.axvline(x=chosen_alpha, linestyle='--')
sns.lineplot(y = test_scores, x = alphas_to_try, label = 'test_data')
plt.xlabel('alpha_parameter')
plt.ylabel(scoring)
plt.title('LASSO Regularisation')
plt.legend()
plt.show()
Why the code is not working? Why am I getting negative scores?
Output:
What I am supposed to get:
I am supposed to get something like the screenshot above, but MSE instead of r2 on the y axis.
As the name suggests, neg_mean_squared_error is the negative of the mean-squared-error, so negative scores is expected (in fact, it is positive scores that are impossible).
As to the plots, there's a bigger problem. Your train and validation scores are obtained using cross_validate, and are fine. But your test scores are obtained by fitting the regressor to the entire X, y and then scoring that on X_test, y_test, a subset of the training set! So those scores are quite optimistically biased.
A quick check on the scale of the errors: you have a degree-5 polynomial with the original feature taking values between -20 and 20. So the target takes values on the order of 10^6, and so squared errors may be expected on the order of 10^12.

SVR/SVM output predictions are very similar to each other but far from true value

The main idea is to predict 2 target output, based on input features.
the input features are already scaled using Standardscaler() from sklearn.
size of X_train is (190 x 6), Y_train = (190 x 2). X_test is (20 x 6), Y_test = (20x2)
linear and rbf kernel also make use of GridsearchCV to find the best C (linear), gamma and C ('rbf')
[PROBLEM] I perform SVR utilizing MultiOutputRegressor on both linear and rbf kernel but, the predicted outputs are very similar to each other (not exactly constant prediction) and pretty far from the true value of y.
Below are the plots where the scatter plot represent the true value of Y. first picture correspond to result of first target, Y[:,0]. while second picture is second target, Y[:,1].
Do i have to scale my target output? Any other model that could help improving test accuracy?
I have tried random forest regressor and perform tuning as well, and test accuracy is about similar to what I'm getting with SVR. (below result from SVR)
Best parameter: {'estimator__C': 1}
MAE: [18.51151192 9.604601 ] #from linear kernel
Best parameter (rbf): {'estimator__C': 1, 'estimator__gamma': 1e-09}
MAE (rbf): [17.80482033 9.39780134] #from rbf kernel
Thankyou so much! any help and input is greatly appreciated!! ^__^
---------------- Code -----------------------------
import numpy as np
from numpy import load
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.multioutput import MultiOutputRegressor
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=3)
#input features - HR, HRV, PTT, breathing_rate, LASI, AI
X = load('200_patient_input_scaled.npy')
#Output features - SBP, DBP
Y = load('200_patient_output_raw.npy')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.095, random_state = 43)
epsilon = 0.1
#--------------------------- Linear SVR kernel Model ------------------------------------------------------
linear_svr = SVR(kernel='linear', epsilon = epsilon)
multi_output_linear_svr = MultiOutputRegressor(linear_svr)
#multi_output_linear_svr.fit(X_train, Y_train) #just to see the output
#GridSearch - find the best C
grid = {'estimator__C': [1,10,10,100,1000] }
grid_linear_svr = GridSearchCV(multi_output_linear_svr, grid, scoring='neg_mean_absolute_error', cv=rkf, refit=True)
grid_linear_svr.fit(X_train, Y_train)
#Prediction
Y_predict = grid_linear_svr.predict(X_test)
print("\nBest parameter:", grid_linear_svr.best_params_ )
print("MAE:", mean_absolute_error(Y_predict,Y_test, multioutput='raw_values'))
#-------------------------- RBF SVR kernel Model --------------------------------------------------------
rbf_svr = SVR(kernel='rbf', epsilon = epsilon)
multi_output_rbf_svr = MultiOutputRegressor(rbf_svr)
#Grid search - Find best combination of C and gamma
grid_rbf = {'estimator__C': [1,10,10,100,1000], 'estimator__gamma': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2] }
grid_rbf_svr = GridSearchCV(multi_output_rbf_svr, grid_rbf, scoring='neg_mean_absolute_error', cv=rkf, refit=True)
grid_rbf_svr.fit(X_train, Y_train)
#Prediction
Y_predict_rbf = grid_rbf_svr.predict(X_test)
print("\nBest parameter (rbf):", grid_rbf_svr.best_params_ )
print("MAE (rbf):", mean_absolute_error(Y_predict_rbf,Y_test, multioutput='raw_values'))
#Plotting
plot_y_predict = Y_predict_rbf[:,1]
plt.scatter( np.linspace(0, 20, num = 20), Y_test[:,1], color = 'red')
plt.plot(np.linspace(0, 20, num = 20), plot_y_predict)
A common mistake is that when people use StandardScaler they use it along the wrong axis of the data. You may scale all the data, or row by row instead of column by column, please make sure you've done this right! I would do this by hand to be sure because else I think it needs different StandardScaler fit for each feature.
[RESPONSE/EDIT]: I think that just negates what StandardScaler did by inversing the application. I'm not entirely sure of the StandardScaler behaviour I'm just saying all this out of experience and having trouble scaling multiple feature data. If i were you (for example for MInMax scaling) I would prefer something like this:
columnsX = X.shape[1]
for i in range(columnsX):
X[:, i] = (X[:, i] - X[:, i].min()) / (X[:, i].max() - X[:, i].min())

Cannot use Mean Squared Logarithmic Error (negative predicted values) though normalized data and predictions > -1

I am trying to implement a simple Sklearn.linear_model.LinearRegression model and evaluate its performance through MSLE:
MSLE is based on SLE = (log(prediction + 1) - log(actual + 1))^2
I have something like 15 features, which all are normalized or standardized, all positive.
Though when I try to do a cross validation on my training data:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
lin_reg = LinearRegression()
linreg_scores = cross_val_score(lin_reg, X_train, y_train, cv = 10, scoring = 'neg_mean_squared_log_error')
I get the following error:
ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.
So I checked by hand doing a manual cross validation with sklearn.model_selection.KFold, in order to print the predicted values for each fold...
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error
from sklearn.base import clone
kf = KFold(n_splits=5, shuffle=True, random_state=5)
lin_reg = LinearRegression()
split_count = 0
for train_index, val_index in kf.split(X_train, y_train):
split_count += 1
clone_reg = clone(lin_reg)
X_tr = X_train.loc[train_index, :]
X_val = X_train.loc[val_index, :]
y_tr = y_train.loc[train_index]
y_val = y_train.loc[val_index]
clone_reg.fit(X_tr, y_tr)
pred = clone_reg.predict(X_val)
if any(pred<0):
print(split_count)
print(pred[pred<0])
The thing is, I do get negative predicted values, but they are all between [-1, 0]:
1
[-0.08642619]
3
[-0.2426673]
5
[-0.51744243]
So according to the MSLE formula, (y_predict + 1) should be positive, thus ln(y_predict + 1) should be mathematically correct.
Is there something that I am missing here?
Thanks a lot for your help, I'll obviously provide any additional info if needed!

RMSE error doesn't converge towards the same value depending on the train/test ratio

I am trying to find a reliable testing method to compute the error of my model / training parameters, but I am seeing weird results when I play with the train/test ratio.
When I change the ratio of my train/test data, the RMSE converges towards different values, see below:
You can see the test ratio on the top-right corner.
Zoomed:
After 50K iteration, it doesn't converge towards the same value.
Here is the code:
import time
import sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
np.random.seed(int(time.time()))
def seed():
np.random.randint(2**32-1)
n_scores_per_test = 50000
test_ratios = [.1, .2, .4, .6, .8]
model = Lasso(alpha=0.0005, random_state=seed(), tol=0.00001, copy_X=True)
# load our training data
train = pd.read_csv('train.csv')
X = train[['OverallCond']].values
y = np.log(train['SalePrice'].values)
# custom RMSE
def rmse(y_predicted, y_actual):
tmp = np.power(y_actual - y_predicted, 2) / y_actual.size
return np.sqrt(np.sum(tmp, axis=0))
for test_ratio in test_ratios:
print 'Testing test ratio:', test_ratio
scores = []
avg_scores = []
for i in range(n_scores_per_test):
if i % 200 == 0:
print i, '/', n_scores_per_test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_ratio, random_state=seed())
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores.append(rmse(y_pred, y_test))
avg_scores.append(np.array(scores).mean())
plt.plot(avg_scores, label=str(test_ratio))
plt.legend(loc='upper right')
plt.show()
Any idea why they don't all converge nicely together?
See https://github.com/benji/rmse_convergence/
UPDATE:
Use selection=random for Lasso
using random_state in Lasso
using random_state train_test_split
removed redundant shuffle()
setting low tol on Lasso model

Different accuracy for python (Scikit-Learn) and R (e1071)

For the same dataset (here Bupa) and parameters i get different accuracies.
What did I overlook?
R implementation:
data_file = "bupa.data"
dataset = read.csv(data_file, header = FALSE)
nobs <- nrow(dataset) # 303 observations
sample <- train <- sample(nrow(dataset), 0.95*nobs) # 227 observations
# validate <- sample(setdiff(seq_len(nrow(dataset)), train), 0.1*nobs) # 30 observations
test <- setdiff(seq_len(nrow(dataset)), train) # 76 observations
svmfit <- svm(V7~ .,data=dataset[train,],
type="C-classification",
kernel="linear",
cost=1,
cross=10)
testpr <- predict(svmfit, newdata=na.omit(dataset[test,]))
accuracy <- sum(testpr==na.omit(dataset[test,])$V7)/length(na.omit(dataset[test,])$V7)
I get accuracy: 0.94
but when i do as following in python (scikit-learn)
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
import pandas as pd
from sklearn import svm, grid_search
f = open("data/bupa.data")
dataset = np.loadtxt(fname = f, delimiter = ',')
nobs = np.shape(dataset)[0]
print("Number of Observations: %d" % nobs)
y = dataset[:,6]
X = dataset[:,:-1]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.06, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy')
I get accuracy 0.67
please help me.
I came across this post having the same issue - wildly different accuracy between scikit-learn and e1071 bindings for libSVM. I think the issue is that e1071 scales the training data and then keeps the scaling parameters for using in predicting new observations. Scikit-learn does not do this and leaves it up the user to realize that the same scaling approach needs to be taken on both training and test data. I only thought to check this after encountering and reading this guide from the nice people behind libSVM.
While I don't have your data, str(svmfit) should give you the scaling params (mean and standard deviation of the columns of Bupa). You can use these to appropriately scale your data in Python (see below for an idea). Alternately, you can scale the entire dataset together in Python and then do test/train splits; either way should give you now identical predictions.
def manual_scale(a, means, sds):
a1 = a - means
a1 = a1/sds
return a1
When using Support Vector Regression in Python/sklearn and R/e1071 both x and y variables need to be scaled/unscaled.
Here is a self-contained example using rpy2 to show equivalence of R and Python results (first part with disabled scaling in R, second part with 'manual' scaling in Python):
# import modules
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.model_selection
import sklearn.datasets
import sklearn.svm
import rpy2
import rpy2.robjects
import rpy2.robjects.packages
# use R e1071 SVM function via rpy2
def RSVR(x_train, y_train, x_test,
cost=1.0, epsilon=0.1, gamma=0.01, scale=False):
# convert Python arrays to R matrices
rx_train = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_train).T.flatten()), nrow = len(x_train))
ry_train = rpy2.robjects.FloatVector(np.array(y_train).flatten())
rx_test = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_test).T.flatten()), nrow = len(x_test))
# train SVM
e1071 = rpy2.robjects.packages.importr('e1071')
rsvr = e1071.svm(x=rx_train,
y=ry_train,
kernel='radial',
cost=cost,
epsilon=epsilon,
gamma=gamma,
scale=scale)
# run SVM
predict = rpy2.robjects.r['predict']
ry_pred = np.array(predict(rsvr, rx_test))
return ry_pred
# define auxiliary function for plotting results
def plot_results(y_test, py_pred, ry_pred, title, lim=[-500, 500]):
plt.title(title)
plt.plot(lim, lim, lw=2, color='gray', zorder=-1)
plt.scatter(y_test, py_pred, color='black', s=40, label='Python/sklearn')
plt.scatter(y_test, ry_pred, color='orange', s=10, label='R/e1071')
plt.xlabel('observed')
plt.ylabel('predicted')
plt.legend(loc=0)
return None
# get example regression data
x_orig, y_orig = sklearn.datasets.make_regression(n_samples=100, n_features=10, random_state=42)
# split into train and test set
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x_orig, y_orig, train_size=0.8)
# SVM parameters
# (identical but named differently for R/e1071 and Python/sklearn)
C = 1000.0
epsilon = 0.1
gamma = 0.01
# setup SVM and scaling classes
psvr = sklearn.svm.SVR(kernel='rbf', C=C, epsilon=epsilon, gamma=gamma)
x_sca = sklearn.preprocessing.StandardScaler()
y_sca = sklearn.preprocessing.StandardScaler()
# run R and Python SVMs without any scaling
# (see 'scale=False')
py_pred = psvr.fit(x_train, y_train).predict(x_test)
ry_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=False)
# scale both x and y variables
sx_train = x_sca.fit_transform(x_train)
sy_train = y_sca.fit_transform(y_train.reshape(-1, 1))[:, 0]
sx_test = x_sca.transform(x_test)
sy_test = y_sca.transform(y_test.reshape(-1, 1))[:, 0]
# run Python SVM on scaled data and invert scaling afterwards
ps_pred = psvr.fit(sx_train, sy_train).predict(sx_test)
ps_pred = y_sca.inverse_transform(ps_pred.reshape(-1, 1))[:, 0]
# run R SVM with native scaling on original/unscaled data
# (see 'scale=True')
rs_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=True)
# plot results
plt.subplot(121)
plot_results(y_test, py_pred, ry_pred, 'without scaling (Python/sklearn default)')
plt.subplot(122)
plot_results(y_test, ps_pred, rs_pred, 'with scaling (R/e1071 default)')
plt.tight_layout()
UPDATE: Actually, the scaling uses a slightly different definition of variance in R and Python, see this answer (1/(N-1)... in R vs. 1/N... in Python where N is the sample size). However, for typical sample sizes, this should be negligible.
I can confirm these statements. One indeed needs to apply the same scaling to the train and test sets. In particular I have done this:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
where X is my training set. Then, when preparing the test set, I have simply used the StandardScaler instance obtained from the scaling of the training test. It is important to used it just for transforming, not for fitting and transforming (like above), i.e.:
X_test = sc_X.transform(X_test)
This allowed on obtaining substantial agreement between R and scikit-learn results.

Categories