Fine way to improve and tuning these models Sklearn - python

I am building an algorithm to predict football matches for sport betting.
I have a function that train some models from a list, I would improve the accuracy of models and get the probabilities as accurate as possible.
List of classifiers
clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
clf2 = GradientBoostingClassifier()
clf3 = MLPClassifier()
classifiers=[
MLPClassifier(),
VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft'),
AdaBoostClassifier(),
GradientBoostingClassifier(),
CalibratedClassifierCV(),
LinearDiscriminantAnalysis(),
LogisticRegression(),
LogisticRegressionCV(),
QuadraticDiscriminantAnalysis(),
]
Code
def proba_classifiers(df_clean, df_clean_pred, X, X_pred, y, classifier_list):
df_proba=[]
df_proba_pred=[]
for classifier in classifier_list:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.8, stratify=y)
classifier.fit(X_train, y_train)
p_train = classifier.predict(X_train)
acc_train = accuracy_score(y_train, p_train)
p_test = classifier.predict(X_test)
acc_test = accuracy_score(y_test, p_test)
print(f'Accuracy Train: {"{:.0%}".format(acc_train)}, Accuracy Test: {"{:.0%}".format(acc_test)}, Model: {classifier.__class__.__name__} ')
proba = classifier.predict_proba(X)
proba_pred = classifier.predict_proba(X_pred)
proba = pd.DataFrame(proba, columns=['H','D','A'])
df_proba.append(proba)
proba_pred = pd.DataFrame(proba_pred, columns=['H','D','A'])
df_proba_pred.append(proba_pred)
df_proba = pd.concat(df_proba,axis=1)
#adjust iloc
df_clean = df_clean.iloc[:, :16]
df_proba = pd.concat([df_clean.reset_index(drop=True), df_proba], axis=1)
df_proba_pred = pd.concat(df_proba_pred,axis=1)
#adjust iloc
df_clean_pred = df_clean_pred.iloc[:, :16]
df_proba_pred = pd.concat([df_clean_pred.reset_index(drop=True), df_proba_pred], axis=1)
return df_proba, df_proba_pred
PS X_pred is X that I want to predict.
Models score
Accuracy Train: 48%, Accuracy Test: 48%, Model: MLPClassifier
Accuracy Train: 48%, Accuracy Test: 48%, Model: VotingClassifier
Accuracy Train: 48%, Accuracy Test: 48%, Model: AdaBoostClassifier
Accuracy Train: 50%, Accuracy Test: 48%, Model: GradientBoostingClassifier
Accuracy Train: 48%, Accuracy Test: 48%, Model: CalibratedClassifierCV
Accuracy Train: 48%, Accuracy Test: 48%, Model: LinearDiscriminantAnalysis
Accuracy Train: 48%, Accuracy Test: 48%, Model: LogisticRegression
Accuracy Train: 48%, Accuracy Test: 48%, Model: LogisticRegressionCV
Accuracy Train: 47%, Accuracy Test: 47%, Model: QuadraticDiscriminantAnalysis

I want to assume you want to increase your accuracy, most times if you want to increase your accuracy you will probably need to use GridSearch in sklearn. This is a way of tuning your model(classifier) to increase the accuracy, you can go through the documentation link below to learn more.
After you finishing turning it gives you the best parameters to fix into your classifier.
For example where you have "clf1 = LogisticRegression(multi_class='multinomial', random_state=1)"
You will need more parameters such as I have defined in the code below.
Sklearn documentation
However here is a way to do Logistic regression hypertunning but search for others like adaboost and the rest for best parameters:
# Hyperparameter tunning for logistic regression
logreg_grid = {"C": np.logspace(-5, 5, 10),
"solver": ["liblinear","lbfgs"],
"tol":np.logspace(-5, 5, 10),
"max_iter":np.arange(100, 3000,80),
"class_weight": [None, {0:1, 0:1.5}, {0:1, 1:2}]}
# setup grid hyperparameter search for logistic resgression
gs_logreg = GridSearchCV(LogisticRegression(),
param_grid=logreg_grid,
cv=5,
verbose=True)
# fit grid hyperparamter search model
gs_logreg.fit(Xtrain,ytrain)
After you have done that you can use this code to get the parameters and fix into your classifier:
gs_logreg.best_params_
Then fix the parameters into your earlier equation, only if you have done the hypertuning.
clf1 = LogisticRegression(C=2154.4346900318865, solver='rbf', tol=0.0037649358067924714, max_iter = "300", multi_class='multinomial', random_state=1)

Related

Scikitlearn GridSearchCV best model scores

I am trying to print the training and test score of the best model from my GridSearchCV object. My initial guess was to use cv_results['best_train_score'] and cv_results['best_test_score'] but after looking at the documentation I dont think there is a 'best_train_score' for cv_results.
I also see that there is a best_estimator_ but I'm not sure if I can use this to print a test and a training score. Any help is greatly appreciated.
You can use the best_estimator_ of your fitted GridSearchCV to retrieve the best model and then use the score function of your estimator to calculate the train and test accuracy.
As follows:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV, train_test_split
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2
)
parameters = {"kernel": ("linear", "rbf"), "C": [1, 10]}
svc = svm.SVC()
cv = GridSearchCV(svc, parameters)
cv.fit(iris.data, iris.target)
model = cv.best_estimator_
print(f"train score: {model.score(X_train, y_train)}")
print(f"test score: {model.score(X_test, y_test)}")
Output:
train score: 0.9916666666666667
test score: 1.0

How to plot model accuracy and loss for cross-validation with SVM classifier?

Greeting of the day.
I used SVM classifier to classify clickbaits and non-clickbaits.
The code is:
// code for training SVM with cross validation
svm_class = svm.SVC(kernel='linear', C = 1.0)
result_svm= cross_val_score(svm_class, x,encoded_Y, scoring='accuracy', cv = 10)
print("Accuracy with SVM")
result_svm.mean()*100
Ouptut: 94.30 %
// code for cross-validation test
y_pred_svm=cross_val_predict(svm_class,x,encoded_Y,cv=10)
//code for confusion matrix
import numpy as np
y_s=np.argmax(y, axis=1)
#print(y_s)
from sklearn.metrics import confusion_matrix
cm_svm = confusion_matrix(y_s, y_pred_svm)
cm_svm
Output: array([[3688, 312],
[ 257, 5743]])
I got two quesitons:
First, how do I plot model accuracy and loss for this SVM?
Second, how do I plot the ROC-AUC curve?

Getting a 100% Training Accuracy, but 60% Testing accuracy

I am trying different classifiers with different parameters and stuff on a dataset provided to us as part of a course project. We have to try and get the best performance on the dataset. The dataset is actually a reduced version of the online news popularity
I have tried the SVM, Random Forest, SVM with cross-validation with k = 5 and they all seem to give approximately 100% training accuracy, while the testing accuracy is between 60-70. I think the testing accuracy is fine, but the training accuracy bothers me.
I would say maybe it was a case of overfitting data but none of my classmates seem to be getting similar results so maybe the problem is with my code.
Here is the code for my cross-validation and random forest classifier. I would be very grateful if you help me find out why I am getting such a high Training accuracy
def crossValidation(X_train, X_test, y_train, y_test, numSplits):
skf = StratifiedKFold(n_splits=5, shuffle=True)
Cs = np.logspace(-3, 3, 10)
gammas = np.logspace(-3, 3, 10)
ACC = np.zeros((10, 10))
DEV = np.zeros((10, 10))
for i, gamma in enumerate(gammas):
for j, C in enumerate(Cs):
acc = []
for train_index, dev_index in skf.split(X_train, y_train):
X_cv_train, X_cv_dev = X_train[train_index], X_train[dev_index]
y_cv_train, y_cv_dev = y_train[train_index], y_train[dev_index]
clf = SVC(C=C, kernel='rbf', gamma=gamma, )
clf.fit(X_cv_train, y_cv_train)
acc.append(accuracy_score(y_cv_dev, clf.predict(X_cv_dev)))
ACC[i, j] = np.mean(acc)
DEV[i, j] = np.std(acc)
i, j = np.argwhere(ACC == np.max(ACC))[0]
clf1 = SVC(C=Cs[j], kernel='rbf', gamma=gammas[i], decision_function_shape='ovr')
clf1.fit(X_train, y_train)
y_predict_train = clf1.predict(X_train)
y_pred_test = clf1.predict(X_test)
print("Train Accuracy :: ", accuracy_score(y_train, y_predict_train))
print("Test Accuracy :: ", accuracy_score(y_test, y_pred_test))
def randomForestClassifier(X_train, X_test, y_train, y_test):
"""
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_predict_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
print("Train Accuracy :: ", accuracy_score(y_train, y_predict_train))
print("Test Accuracy :: ", accuracy_score(y_test, y_pred_test))
There are two issues about the problem, training accuracy and testing accuracy are significantly different.
Different distribution of training data and testing data.(because of selecting a part of the dataset)
Overfitting of the model to the training data.
Since you apply cross-validation, it seems that you should think about another solution.
I recommend that you apply some feature selection or feature reduction (like PCA) approaches to tackle the overfitting problem.

Take accuracy for simple fit and cross val

I have the simple fitting model like this:
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)
print accuracy_score(y_test, predictions)
and with using cross validation I have this:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 7)
from cross validation how can I take the accuracy in order to have the same measure print accuracy_score(y_test, predictions)? Is it accuracies.mean()?
print accuracies will give an array of accuracy on each fold of cross validation
print "Train set score :: {} ".format(accuracies.mean()) will give the mean accuracy on the cross validation and
print "Train set score :: {} +/-{}".format(accuracies.mean(),accuracies.std()*2) will give you the accuracy along with the mean deviation

When using OnevsRest classifier, is it possible to know the accuracy for each classifier?

I am using a OnevsRest classifier.
I have a dataset with 21 classes. I want to know the accuracy for each classifier.
For example:
Accuracy for class1 vs (class2+classx...+ class21)
Accuracy for class2 vs (class3+classx...+ class21)
.
.
.
Accuracy for class21 vs (class1+classx...+ class20)
How can I know that?
# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, random_state=random_state))
y_score = classifier.fit(X_train, y_train).score(X_test, y_test)
print(y_score)
I think this is not supported out of the box and you will need to do it yourself.
Here is some example-code, which i will call a prototype, because it's not heavily tested! Keep in mind, that it's hard to compare single-class accuracies and the meta-accuracy (which is based on probability-estimates; in the SVM-case obtained by Platt-scaling).
import numpy as np
from sklearn import datasets
from sklearn import svm
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split
# Data
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
X_train, X_test, y_train, y_test = train_test_split(
iris_X, iris_y, test_size=0.5, random_state=0)
# Train classifier
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
random_state=0))
y_score = classifier.fit(X_train, y_train).score(X_test, y_test)
print(y_score)
# Get all accuracies
classes = np.unique(y_train)
def get_acc_single(clf, X_test, y_test, class_):
pos = np.where(y_test == class_)[0]
neg = np.where(y_test != class_)[0]
y_trans = np.empty(X_test.shape[0], dtype=bool)
y_trans[pos] = True
y_trans[neg] = False
return clf.score(X_test, y_trans) # assumption: acc = default-scorer
for class_index, est in enumerate(classifier.estimators_):
class_ = classes[class_index]
print('class ' + str(class_))
print(get_acc_single(est, X_test, y_test, class_))
Output:
0.8133333333333334
class 0
1.0
class 1
0.6666666666666666
class 2
0.9733333333333334

Categories