I'm working on SVC model for classification and I faced different accuracy result in each time I changed the values of the parameters (svc__gamma, svc__kernel and svc__C), I read the documentation of Sklearn but I could not understand what those parameters mean, I have Three questions :
What did those parameters indicate to?
How its effect Accuracy each time I change it?
What is the correct parameter values?
the result of accuracy is 0.70, but when I delete svc__gamma and svc__C , the result increases up to 0.76.
pipe = make_pipeline(TfidfVectorizer(),
SVC())
param_grid = {'svc__kernel': ['rbf', 'linear', 'poly'],
'svc__gamma': [0.1, 1, 10, 100],
'svc__C': [0.1, 1, 10, 100]}
svc_model = GridSearchCV(pipe, param_grid, cv=3)
svc_model.fit(X_train, Y_train)
prediction = svc_model.predict(X_test)
print(f"Accuracy score is {accuracy_score(Y_test, prediction):.2f}")
print(classification_report(Y_test, prediction))
to 1.
gamma is a parameter of the gaussian bell curve, so it should only
affect the RBF( Gaussian Kernel)
C is the paramter of the optimization problem, the inverse of the Lagrangian multiplier
to. 2.
get familiar with the mathematical background to fully understand how they affect your accuracy (sidenote: Accuracy is usuallly no reliable measure, but depends on context)
to 3.
there are no 'correct' parameters. They depend on the context, data and the goal you want to achive. Usually there is a tradeoff between how good the algorithm works on test data and how it works on new data ( overfitting vs. underfitting)
I hope that helps as a first step :)
for further information I suggest SVM.
Related
Problem: Scikit-learn's GridSearchCV is returning the parameter which results in the worst score (Root MSE) rather than the best.
I think it is possible the problem is that I am not using train-test-split to create a hold out test set because it is time series data, and I do not want to disrupt the time order. Another possible cause is that I have over 7,000 features but only 50 observations. But clarification from anyone who knows whether these could be the problems and what I might do to remedy these potential issues would be greatly appreciated.
I start with the following code (and have imported Ridge, GridSearchCV, make_pipeline, TimeSeriesSplit, numpy, pandas, etc.):
ridge_pipe = make_pipeline(Ridge(random_state=42, max_iter=100000))
tscv = TimeSeriesSplit(n_splits=5)
param_grid = {'ridge__alpha': np.logspace(1e-300, 1e-1, 500)}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this output:
{'ridge__alpha': 1.2589254117941673}
-4.067235334106922
Skeptical that this would be the best Root MSE, I next tried finding the score when considering an alpha value of 1e-300 alone:
param_grid = {'ridge__alpha': [1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv,
scoring='neg_root_mean_squared_error', n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this ouput:
{'ridge__alpha': 1e-300}
-2.0906161667718835e-13
Clearly then, an alpha value of 1e-300 has a better Root MSE (approx. -2e-13) than does an alpha value of 1e-1 (approx. -4) since negative Root MSE using GridSearchCV means the same thing - as I understand it - as positive Root MSE in all other contexts. So a Root MSE of -2e-13 is really 2e-13 and -4 is really 4. And the lower the Root MSE the better.
To see if np.logspace could be the culprit, I instead provide just a list of values:
param_grid = {'ridge__alpha': [1e-1, 1e-50, 1e-60, 1e-70, 1e-80, 1e-90, 1e-100, 1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
And the output shows that the same problem:
{'ridge__alpha': 0.1}
-2.0419740158869386
And I don't think it's because I'm using TimeSeriesSplit, because I have tried using cv=5 instead of cv=tscv inside GridSearchCV() and it results in the same problem.
The same issue happens when I try Lasso instead of Ridge. Any thoughts?
This appears to be fine. The problem is that you're comparing the final outputs on the same dataset that the best_estimator_ was trained on (search's method score delegates to the score method of search.best_estimator_, which is the model using best hyperparameters refitted on the entire training set); but the grid search is selecting based on cross-validated scores, which are a better indicator for future performance.
Specifically, with alpha=1e-300 (practically zero), the model overfits badly to the training data, and so the rmse on that training data is very small (2e-13). Meanwhile, with alpha=1.26, the model performs worse on the training data (rmse 4), but performs better on unseen data. You can see those cross-validation scores in the grid search's attribute cv_results_.
I need to predict what the output will be depending on the time. I want to make it so i can train my model on the first 20% of the data, and then make a model, that will follow the behavior, and predict the remaining 80%.
The data i am working on looks as follows:
My data
But when i try to make regressions to do this, i either get something way off target (or something quite close, but then it is linear), which is not accepted.
I maybe think my problem is the choice of my kernel, or the way i am making the regressions. Right now i am making the with the sklearn package as follows:
gpr=GaussianProcessRegressor(kernel=1.15**2*RBF(length_scale=41.4) + WhiteKernel(noise_level=1.32e-4),
n_restarts_optimizer=10,
optimizer='fmin_l_bfgs_b',
normalize_y=True,
alpha=0.051)
gpr.fit(X_train, y_train)
y_gpr, y_std = gpr.predict(X_test, return_std=True)
But after a few predictions, the predictions just become the same steady value, instead of having the curve as in the data. Also, the standard variations for the prediction becomes very large.
The GPR prediction on the real data
When doing the Kernel Ridge Regression in python, i can't seem to get the curve to follow the data aswell. Either it drops to 0 in a few predictions, or it has to be a linear prediction.
The KRR model, but linear instead - which is not good enough
The KRR model is made as follows (and i know the kernel=polynomial with a degree of 1, but i cant seem to figure out/find an appropiate kernel that will follow my data):
#The kernel ridge regression
krr = KernelRidge(alpha=0.051,kernel='polynomial',degree=1)
# krr = KernelRidge(alpha=0.051,kernel=RBF(0.5))
krr.fit(X_train,y_train)
list_y_pred=krr.predict(X_test)
So if possible, i would like to get some inputs, of how it should be done instead, or if a different approach to the problem would be better. But i am really hoping i can get the KRR to fit the data, and the gaussian process regression aswell.
There is nothing absolutely wrong with your code. I believe your parameters are wrong and the guess is not the best because of it.
My sugestion would be to use grid search and pipelines to estimate the best parameters.
An example of how it would work would be
param_grid = [
{'alpha': [1, 10, 100, 1000], 'kernel': ['linear']}, ## test linear kernel with varying alpha
{'alpha': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}, # test rbf kernel while varying gamma and alpha
]
## you can have as many dictionaries as you want inside this list, or just 1. Keep in
## mind this takes O(n^n)*time_per_fit where n is the number of arguments you try to test, so
## it can take a long time
estimator = KernelRidge()
clf = clf = GridSearchCV(estimator, param_grid)
clf.fit(X_train,y_train)
list_y_pred=clf.predict(X_test)
For a more comprehensive tutorial try taking a look at the oficial docs here and here, or even here for a faster, but less thorough search.
Keep in mind my parameters are way off, I just copied the example from the docs
I have a binary classification problem where I use the following code to get my weighted avarege precision, weighted avarege recall, weighted avarege f-measure and roc_auc.
df = pd.read_csv(input_path+input_file)
X = df[features]
y = df[["gold_standard"]]
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
scores = cross_validate(clf, X, y, cv=k_fold, scoring = ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'))
print("accuracy")
print(np.mean(scores['test_accuracy'].tolist()))
print("precision_weighted")
print(np.mean(scores['test_precision_weighted'].tolist()))
print("recall_weighted")
print(np.mean(scores['test_recall_weighted'].tolist()))
print("f1_weighted")
print(np.mean(scores['test_f1_weighted'].tolist()))
print("roc_auc")
print(np.mean(scores['test_roc_auc'].tolist()))
I got the following results for the same dataset with 2 different feature settings.
Feature setting 1 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6920, 0.6888, 0.6920, 0.6752, 0.7120
Feature setting 2 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6806 0.6754 0.6806 0.6643 0.7233
So, we can see that in feature setting 1 we get good results for 'accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted' compared to feature setting 2.
However, when it comes to 'roc_auc' feature setting 2 is better than feature setting 1. I found this weird becuase every other metric was better with feature setting 1.
On one hand, I suspect that this happens since I am using weighted scores for precision, recall and f-measure and not with roc_auc. Is it possible to do weighted roc_auc for binary classification in sklearn?
What is the real problem for this weird roc_auc results?
It is not weird, because comparing all these other metrics with AUC is like comparing apples to oranges.
Here is a high-level description of the whole process:
Probabilistic classifiers (like RF here) produce probability outputs p in [0, 1].
To get hard class predictions (0/1), we apply a threshold to these probabilities; if not set explicitly (like here), this threshold is implicitly taken to be 0.5, i.e. if p>0.5 then class=1, else class=0.
Metrics like accuracy, precision, recall, and f1-score are calculated over the hard class predictions 0/1, i.e after the threshold has been applied.
In contrast, AUC measures the performance of a binary classifier averaged over the range of all possible thresholds, and not for a particular threshold.
So, it can certainly happen, and it can indeed lead to confusion among new practitioners.
The second part of my answer in this similar question might be helpful for more details. Quoting:
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.
I have to use a Decision Tree for binary classification on a unbalanced dataset(50000:0, 1000:1). To have a good Recall (0.92) I used RandomOversampling function found in module Imblearn and pruning with max_depth parameter.
The problem is that the Precision is very low (0.44), I have too many false positives.
I tried to train a specific classifier to deal with borderline instances that generate false positives.
First I splitted dataset in train and test sets(80%-20%).
Then I splitted train in train2 and test2 sets (66%,33%).
I used a dtc(#1) to predict test2 and i took only the instances predicted as true.
Then I trained a dtc(#2) on all these datas with the goal of build a classifier able to distinguish borderline cases.
I used the dtc(#3) trained on first oversampled train set to predict official test set and got Recall=0.92 and Precision=0.44.
Finally I used dtc(#2) only on datas predicted as true by dtc(#3) with hope to distinguish TP from FP but it doesn't work too good. I got Rec=0.79 and Prec=0.69.
x_train, X_test, y_train, Y_test =train_test_split(df2.drop('k',axis=1), df2['k'], test_size=test_size, random_state=0.2)
x_res, y_res=ros.fit_resample(x_train,y_train)
df_to_trick=df2.iloc[x_train.index.tolist(),:]
#....split in 0.33-0.66, trained and tested
confusion_matrix(y_test,predicted1) #dtc1
array([[13282, 266],
[ 18, 289]])
#training #dtc2 only on (266+289) datas
confusion_matrix(Y_test,predicted3) #dtc3 on official test set
array([[9950, 294],
[ 20, 232]])
confusion_matrix(true,predicted4)#here i used dtc2 on (294+232) datas
array([[204, 90],
[ 34, 198]])
I have to choose between dtc3 (Recall=0.92, Prec=0.44) or the entire cervellotic process with (Recall=0.79, Prec=0.69).
Do you have any ideas to improve these metrics? My goal is about (0.8/0.9).
Keep in mind that precision and recall are based on the threshold that you choose (i.e. in sklearn the default threshold is 0.5 - any class with a prediction probability > 0.5 is classified as positive) and that there will always be a trade-off between favoring precision over recall. ...
I think in the case you are describing (trying to fine-tune your classifier given your model's performance limitations) you can choose a higher or lower threshold to cut-off which has a more favorable precision-recall tradeoff ...
The below code can help you visualize how your precision and recall change as you move your decision threshold:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.figure(figsize=(8, 8))
plt.title("Precision and Recall Scores as a function of the decision threshold")
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.ylabel("Score")
plt.xlabel("Decision Threshold")
plt.legend(loc='best')
Other suggestions to improve your model's performance is to either use alternative pre-processing methods - SMOTE instead of Random Oversampling or choosing a more complex classifier (a random forrest/ ensemble of trees or a boosting approach ADA Boost or Gradient based boosting)
Given a machine learning model RBF SVC called 'm', I performed a gridSearchCV on gamma value, to optimize recall.
I'm looking to answer to this:
"The grid search should find the model that best optimizes for recall. How much better is the recall of this model than the precision?"
So I did the gridSearchCV:
grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}
grid_m_re = GridSearchCV(m, param_grid = grid_values, scoring = 'recall')
grid_m_re.fit(X_train, y_train)
y_decision_fn_scores_re = grid_m_re.decision_function(X_test)
print('Grid best parameter (max. recall): ', grid_m_re.best_params_)
print('Grid best score (recall): ', grid_m_re.best_score_)
This tell me the best model is for gamma=0.001 and it has a recall score of 1.
I'm wondering how to get the precision for this model to get the trade of of this model, cause the GridSearchCV only has attribute to get what it was optimize for.([Doc sklearn.GridSearchCV][1])
Not sure if there's an easier/more direct way to get this, but this approach also allows you to capture the 'best' model to play around with later:
First do you CV fit on training data:
grid_m_re = GridSearchCV(m, param_grid = grid_values, scoring = 'recall')
grid_m_re.fit(X_train, y_train)
Once you're done, you can pull out the 'best' model (as determined by your scoring criteria during CV), and then use it however you want:
m_best = grid_m_re.best_estimator_
and in your specific case:
from sklearn.metrics import precision_score
y_pred = m_best.predict(X_test)
precision_score(y_test, y_pred)
You can easily overfit if you don't optimize both, C and gamma at the same time.
if you plot the SVC with C on the X axis, gamma on the y axis and recall as color you get some kind of V-Shape, see here
So if you do grid search, better optimize for both, C and gamma, at the same time.
The problem is that usually you get the best results for small C-Values, and in that area the V-shape has it's pointy end: is not very big and difficult to hit.
I recently used:
make a random grid of 10 points
every point contains C, gamma, direction, speed
cut the dataset with stratifiedShuffleSplit
fit & estimate score with cross validation
repeat:
kill the worst two points
the best two points spawn a kid
move every point in its direction with just a little bit of random,
fit & estimate score with cross validation
(if a point notice it goes downward, turn around and half speed)
until break criterion is hit
Worked like a charm.
I used the max distance in the feature space divided by four as initial speed,
the direction had a maximum random of pi/4
Well, the cross validation was a bit costly.
Cleptocreatively inspired by this paper.
... and another edit:
I used between 10 and 20 cycles in the loop to get the perfect points.
If your dataset is too big to do several fits, make a representative subset for the first few trainings...