I'm new to machine learning and in the books and documentation I read there is always a score value between 0 and 1, which represent an accuracy between 0% and 100%.
In my own machine learning code in scikit-learn I get score values between -750.880810 and 5154.771036, which confuses me.
>>> pipe = Pipeline([("scaler", MinMaxScaler()), ("svr", SVR())])
>>> param_grid = {'svr__C':[0.1, 1, 5],
'svr__epsilon':[0.001, 0.01]}
>>> grid = GridSearchCV(estimator=pipe,
param_grid=param_grid,
cv=GroupKFold(n_splits=24)
)
>>> grid.fit(X, y, groups)
GridSearchCV(cv=GroupKFold(n_splits=24), error_score=nan,
estimator=Pipeline(memory=None,
steps=[('scaler',
MinMaxScaler(copy=True,
feature_range=(0, 1))),
('svr',
SVR(C=1.0, cache_size=200, coef0=0.0,
degree=3, epsilon=0.1,
gamma='scale', kernel='rbf',
max_iter=-1, shrinking=True,
tol=0.001, verbose=False))],
verbose=False),
iid='deprecated', n_jobs=None,
param_grid={'svr__C': [0.1, 1, 5], 'svr__epsilon': [0.001, 0.01]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=0)
>>> grid.best_score_
-750.880810
Could someone please explain that to me?
Edit:
My input data is a measurement of an engine.
I have 12 different failures of the engine and every failure is measured twice => 12x2 = 24 different groups (I will also try 12 groups). Every group consist of:
X data: 13 different features (temperature, pressure, electric voltage etc.) with 1200 samples per group
y data: 1 feature (pressure) with 1200 samples per group
Accuracy is the usual scoring method for classification problem. For a regression problem, it is R square value.
For scoring param in GridSearchCV,
If None, the estimator's score method is used.
For SVR, the default scoring value comes from RegressorMixin, which is R^2.
Documentation:
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual
sum of squares ((y_true - y_pred) ** 2).sum() and v is the total
sum of squares ((y_true - y_true.mean()) ** 2).sum().
The best possible score is 1.0 and it can be negative (because the
model can be arbitrarily worse).
A constant model that always
predicts the expected value of y, disregarding the input features,
would get a R^2 score of 0.0.
Hence, it sounds wired when you very large/small value as R^2.
A toy example, to understand the scoring output.
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, GroupKFold
from sklearn.pipeline import Pipeline
import numpy as np
np.random.seed(0)
X, y = datasets.make_regression()
groups = np.random.randint(0, 10, len(X))
pipe = Pipeline([("scaler", MinMaxScaler()), ("svr", svm.SVR())])
parameters = {'svr__C': [ 0.1, 1, 5, 100], 'svr__epsilon': [0.001, 0.1]}
svr = svm.SVR()
clf = GridSearchCV(pipe, parameters, cv=GroupKFold(n_splits=2))
clf.fit(X, y, groups)
print(clf.best_score_)
# 0.1239707770092825
I would recommend trying with different cv and investigate the issue.
Related
The main idea is to predict 2 target output, based on input features.
the input features are already scaled using Standardscaler() from sklearn.
size of X_train is (190 x 6), Y_train = (190 x 2). X_test is (20 x 6), Y_test = (20x2)
linear and rbf kernel also make use of GridsearchCV to find the best C (linear), gamma and C ('rbf')
[PROBLEM] I perform SVR utilizing MultiOutputRegressor on both linear and rbf kernel but, the predicted outputs are very similar to each other (not exactly constant prediction) and pretty far from the true value of y.
Below are the plots where the scatter plot represent the true value of Y. first picture correspond to result of first target, Y[:,0]. while second picture is second target, Y[:,1].
Do i have to scale my target output? Any other model that could help improving test accuracy?
I have tried random forest regressor and perform tuning as well, and test accuracy is about similar to what I'm getting with SVR. (below result from SVR)
Best parameter: {'estimator__C': 1}
MAE: [18.51151192 9.604601 ] #from linear kernel
Best parameter (rbf): {'estimator__C': 1, 'estimator__gamma': 1e-09}
MAE (rbf): [17.80482033 9.39780134] #from rbf kernel
Thankyou so much! any help and input is greatly appreciated!! ^__^
---------------- Code -----------------------------
import numpy as np
from numpy import load
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.multioutput import MultiOutputRegressor
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=3)
#input features - HR, HRV, PTT, breathing_rate, LASI, AI
X = load('200_patient_input_scaled.npy')
#Output features - SBP, DBP
Y = load('200_patient_output_raw.npy')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.095, random_state = 43)
epsilon = 0.1
#--------------------------- Linear SVR kernel Model ------------------------------------------------------
linear_svr = SVR(kernel='linear', epsilon = epsilon)
multi_output_linear_svr = MultiOutputRegressor(linear_svr)
#multi_output_linear_svr.fit(X_train, Y_train) #just to see the output
#GridSearch - find the best C
grid = {'estimator__C': [1,10,10,100,1000] }
grid_linear_svr = GridSearchCV(multi_output_linear_svr, grid, scoring='neg_mean_absolute_error', cv=rkf, refit=True)
grid_linear_svr.fit(X_train, Y_train)
#Prediction
Y_predict = grid_linear_svr.predict(X_test)
print("\nBest parameter:", grid_linear_svr.best_params_ )
print("MAE:", mean_absolute_error(Y_predict,Y_test, multioutput='raw_values'))
#-------------------------- RBF SVR kernel Model --------------------------------------------------------
rbf_svr = SVR(kernel='rbf', epsilon = epsilon)
multi_output_rbf_svr = MultiOutputRegressor(rbf_svr)
#Grid search - Find best combination of C and gamma
grid_rbf = {'estimator__C': [1,10,10,100,1000], 'estimator__gamma': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2] }
grid_rbf_svr = GridSearchCV(multi_output_rbf_svr, grid_rbf, scoring='neg_mean_absolute_error', cv=rkf, refit=True)
grid_rbf_svr.fit(X_train, Y_train)
#Prediction
Y_predict_rbf = grid_rbf_svr.predict(X_test)
print("\nBest parameter (rbf):", grid_rbf_svr.best_params_ )
print("MAE (rbf):", mean_absolute_error(Y_predict_rbf,Y_test, multioutput='raw_values'))
#Plotting
plot_y_predict = Y_predict_rbf[:,1]
plt.scatter( np.linspace(0, 20, num = 20), Y_test[:,1], color = 'red')
plt.plot(np.linspace(0, 20, num = 20), plot_y_predict)
A common mistake is that when people use StandardScaler they use it along the wrong axis of the data. You may scale all the data, or row by row instead of column by column, please make sure you've done this right! I would do this by hand to be sure because else I think it needs different StandardScaler fit for each feature.
[RESPONSE/EDIT]: I think that just negates what StandardScaler did by inversing the application. I'm not entirely sure of the StandardScaler behaviour I'm just saying all this out of experience and having trouble scaling multiple feature data. If i were you (for example for MInMax scaling) I would prefer something like this:
columnsX = X.shape[1]
for i in range(columnsX):
X[:, i] = (X[:, i] - X[:, i].min()) / (X[:, i].max() - X[:, i].min())
I am using ElasticNet to obtain a fit of my data. To determine the hyperparameters (l1, alpha), I am using ElasticNetCV. With the obtained hyperparamers, I refit the model to the whole dataset for production use. I am unsure if this is correct in both, the machine learning aspect and - if so - how I do it. The code "works" and presumably does what it should, but I wanted to be certain that it is also correct.
My procedure is:
X_tr, X_te, y_tr, y_te = train_test_split(X,y)
optimizer = ElasticNetCV(l1_ratio = [.1,.5,.7,.9,.99,1], n_alphas=400, cv=5, normalize=True)
optimizer.fit(X_tr, y_tr)
best = ElasticNet(alpha=optimizer.alpha_, l1_ratio=optimizer.l1_ratio_, normalize=True)
best.fit(X,y)
Thank you in advance
I am a beginner on this but I would love to share my approach to ElasticNet hyperparameters tuning. I would suggest to use RandomizedSearchCV instead. Here is part of the current code I am writing now:
#-----------------------------------------------
# input:
# X_train, X_test, Y_train, Y_test: datasets
# Returns:
# R² and RMSE Scores
#-----------------------------------------------
# Standardize data before
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# define grid
params = dict()
# values for alpha: 100 values between e^-5 and e^5
params['alpha'] = np.logspace(-5, 5, 100, endpoint=True)
# values for l1_ratio: 100 values between 0 and 1
params['l1_ratio'] = np.arange(0, 1, 0.01)
Warning: you are testing 100 x 100 = 10 000 possible combinations.
# Create an instance of the Elastic Net Regressor
regressor = ElasticNet()
# Call the RanddomizedSearch with Cross Validation using the chosen regressor
rs_cv= RandomizedSearchCV(regressor, params, n_iter = 100, scoring=None, cv=5, verbose=0, refit=True)
rs_cv.fit(X_train, Y_train.values.ravel())
# Results
Y_pred = rs_cv.predict(X_test)
R2_score = rs_cv.score(X_test, Y_test)
RMSE_score = np.sqrt(mean_squared_error(Y_test, Y_pred))
return R2_score, RMSE_score, rs_cv.best_params_
The advantage is that in RandomizedSearchCV the number of iterations can be predetermined in advance. The choices of points to be tested are random but 90% (in some cases) faster than GridSearchCV (that tests all possible combinations).
I am using this same approach for other regressors like RandomForests and GradientBoosting who parameters grids are far more complicated and demand much more computer power to run.
As I said at the beginning I am new to this field, so any constructive comment will be welcomed.
Johnny
I need to develop a model which will be free (or close to free) of false negative values. To do so I've plotted Recall-Precision curve and determined that the threshold value should be set to 0.11
My question is, how to define threshold value upon model training? There's no point in defining it later upon evaluation because it won't reflect on new data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
rfc_model = RandomForestClassifier(random_state=101)
rfc_model.fit(X_train, y_train)
rfc_preds = rfc_model.predict(X_test)
recall_precision_vals = []
for val in np.linspace(0, 1, 101):
predicted_proba = rfc_model.predict_proba(X_test)
predicted = (predicted_proba[:, 1] >= val).astype('int')
recall_sc = recall_score(y_test, predicted)
precis_sc = precision_score(y_test, predicted)
recall_precision_vals.append({
'Threshold': val,
'Recall val': recall_sc,
'Precis val': precis_sc
})
recall_prec_df = pd.DataFrame(recall_precision_vals)
Any ideas?
how to define threshold value upon model training?
There is simply no threshold during model training; Random Forest is a probabilistic classifier, and it only outputs class probabilities. "Hard" classes (i.e. 0/1), which indeed require a threshold, are neither produced nor used in any stage of the model training - only during prediction, and even then only in the cases we indeed require a hard classification (not always the case). Please see Predict classes or class probabilities? for more details.
Actually, the scikit-learn implementation of RF doesn't actually employ a threshold at all, even for hard class prediction; reading closely the docs for the predict method:
the predicted class is the one with highest mean probability estimate across the trees
In simple words, this means that the actual RF output is [p0, p1] (assuming binary classification), from which the predict method simply returns the class with the highest value, i.e. 0 if p0 > p1 and 1 otherwise.
Assuming that what you actually want to do is return 1 if p1 is greater from some threshold less than 0.5, you have to ditch predict, use predict_proba instead, and then manipulate these returned probabilities to get what you want. Here is an example with dummy data:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
n_classes=2, random_state=0, shuffle=False)
clf = RandomForestClassifier(n_estimators=100, max_depth=2,
random_state=0)
clf.fit(X, y)
Here, simply using predict for, say, the first element of X, will give 0:
clf.predict(X)[0]
# 0
because
clf.predict_proba(X)[0]
# array([0.85266881, 0.14733119])
i.e. p0 > p1.
To get what you want (i.e. here returning class 1, since p1 > threshold for a threshold of 0.11), here is what you have to do:
prob_preds = clf.predict_proba(X)
threshold = 0.11 # define threshold here
preds = [1 if prob_preds[i][1]> threshold else 0 for i in range(len(prob_preds))]
after which, it is easy to see that now for the first predicted sample we have:
preds[0]
# 1
since, as shown above, for this sample we have p1 = 0.14733119 > threshold.
I have a multiclass classification problem for various classifiers (random forest, SVM, NN) and I use OneVsRestClassifier to wrap my models. I want to use an interpretability method (LIME) which makes use of probabilities that sum to 1, but when I use the function predict_proba, the sum of the matrix does not always sum to 1.
It's a multiclass classification problem. I have checked my raw data, my binarized values, and my train/test data to check that there is no overlap of classes. Each instance has a distinct label (100, 010, or 001).
x = pd.read_pickle(r"x.pkl").values
y = pd.read_pickle(r"y.pkl").values
# binarize labels for multilabel auc calculations
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
# create train and test sets, stratified
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size = 0.20, random_state=5)
rfclassifier = RandomForestClassifier(n_estimators=100, random_state=5, criterion = 'gini', bootstrap = True)
classifier = OneVsRestClassifier(rfclassifier)
classifier.fit(x_train, y_train)
prediction = classifier.predict(x_test)
probability = classifier.predict_proba(x_test)
#check probabilities
print(classifier.predict_proba([x_test[0]]).round(3))
print(classifier.predict_proba([x_test[1]]).round(3))
print(classifier.predict_proba([x_test[20]]).round(3))
The print statements show examples for label 1, 0, and 2 respectively.
The outputs are [[0.164 0.836 0. ]], [[0.953 0.015 0. ]], and [[0.01 0.12 0.96]]. The last two (as well as many other instances) do not sum to 0 and prevent me from implementing the interpretability method.
I am using stratified 10-fold cross validation to find model that predicts y (binary outcome) from X (X has 34 labels) with the highest auc. I set the GridSearchCV:
log_reg = LogisticRegression()
parameter_grid = {'penalty' : ["l1", "l2"],'C': np.arange(0.1, 3, 0.1),}
cross_validation = StratifiedKFold(n_splits=10,shuffle=True,random_state=100)
grid_search = GridSearchCV(log_reg, param_grid = parameter_grid,scoring='roc_auc',
cv = cross_validation)
And then do the cross-validation:
grid_search.fit(X, y)
y_pr=grid_search.predict(X)
I do not understand the following:
why grid_search.score(X,y) and roc_auc_score(y, y_pr) give different results (the former is 0.74 and the latter is 0.63)? Why do not these commands do the same thing in my case?
This is due to different initialization of roc_auc when used in GridSearchCV.
Look at the source code here
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)
Observe the third parameter needs_threshold. When true, it will require the continous values for y_pred such as probabilities or confidence scores which in gridsearch will be calculated from log_reg.decision_function().
When you explicitly call roc_auc_score with y_pr, you are using .predict() which will output the resultant predicted class labels of the data and not probabilities. That should account for the difference.
Try :
y_pr=grid_search.decision_function(X)
roc_auc_score(y, y_pr)
If still not same results, please update the question with complete code and some sample data.