Related
#split dataset in features and target variable
feature_cols = ['RIAGENDR_0', 'RIDAGEYR', 'RIDRETH3_2', 'RIDRETH3_3', 'RIDRETH3_4', 'RIDRETH3_6', 'RIDRETH3_7', 'INDFMPIR', 'DMDMARTZ_1.0', 'DMDMARTZ_2.0', 'DMDMARTZ_3.0', 'DMDMARTZ_4.0', 'DMDMARTZ_6.0', 'DMDEDUC2', 'RFXT010', 'BMXWT', 'BMXBMI', 'URXUMA', 'LBDHDD', 'LBXFER', 'LBXGH', 'LBXBPB', 'LBXBCD', 'LBXBSE', 'LBXBMN', 'URXUBA', 'URXUCD', 'URXUCO', 'URXUCS', 'URXUMO', 'URXUMN', 'URXUPB', 'URXUSB', 'URXUSN', 'URXUTL', 'URXUTU']
X = data[feature_cols] # Features
scale = StandardScaler()
X = scale.fit_transform(X)
y = data['depre_score'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_test)
print(y_pred)
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
recall_sensitivity = metrics.recall_score(y_test, y_pred, pos_label=1)
recall_specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
print(recall_sensitivity, recall_specificity)
Why do you think you are doing something wrong? Perhaps your data are such that you can achieve a perfect classification... e.g., see this mushroom classification.
Having said that, it is also possible that there is some leakage in your data as specified by #gtomer. That means an exact point that is present in training set is available in your test set. You can do K-fold test on your data and see how it follows up with the accuracy. And secondly, use different classifiers too (it is better to use Random Forests compared to Decision Trees)
I have a problem. I found this question https://stats.stackexchange.com/questions/56302/what-are-good-rmse-values
Someone wrote:
The RMSE for your training and your test sets should be very similar
if you have built a good model.
and another wrote:
RMSE of test > RMSE of train => OVER FITTING of the data. RMSE of test
< RMSE of train => UNDER FITTING of the data.
I think RMSE of test data it is
y_pred = knn.predict(X_test)
rmse = metrics.mean_squared_error(y_test, y_pred , squared=False)
But how could I get the RMSE (or another metric) of my training data? Perhaps it is
rmse = metrics.mean_squared_error(X_train, X_test, squared=False)
But with that I got
ValueError: Found input variables with inconsistent numbers of samples: [8880, 2220]
So how could I get the RMSE from my training ?
from sklearn.neighbors import (KNeighborsRegressor,)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=30)
knn = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5,
p=2, weights='uniform')
knn.fit(X, y)
y_pred = knn.predict(X_test)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = metrics.mean_squared_error(y_test, y_pred , squared=False)
print(rmse)
First of all, there's something wrong with your code and it's that you are training your model with the whole data, instead of the training data you've already split, this makes your validation sample useless as the model itself learnt from it.
You should change your fit like so:
knn.fit(X_train, y_train)
Then to get the RMSE of it you should use the predict on your train data and compare it afterwards:
y_train_pred = knn.predict(X_train)
rmse = metrics.mean_squared_error(y_train, y_train_pred, squared=False)
Everything else should stay the same
The context of your question is not clear. However, you should take the predictions and not the model features of the test set, and evaluate the RMSE using the same package that you used:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
Pay attention to the squared parameter.
I am building an algorithm to predict football matches for sport betting.
I have a function that train some models from a list, I would improve the accuracy of models and get the probabilities as accurate as possible.
List of classifiers
clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
clf2 = GradientBoostingClassifier()
clf3 = MLPClassifier()
classifiers=[
MLPClassifier(),
VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft'),
AdaBoostClassifier(),
GradientBoostingClassifier(),
CalibratedClassifierCV(),
LinearDiscriminantAnalysis(),
LogisticRegression(),
LogisticRegressionCV(),
QuadraticDiscriminantAnalysis(),
]
Code
def proba_classifiers(df_clean, df_clean_pred, X, X_pred, y, classifier_list):
df_proba=[]
df_proba_pred=[]
for classifier in classifier_list:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.8, stratify=y)
classifier.fit(X_train, y_train)
p_train = classifier.predict(X_train)
acc_train = accuracy_score(y_train, p_train)
p_test = classifier.predict(X_test)
acc_test = accuracy_score(y_test, p_test)
print(f'Accuracy Train: {"{:.0%}".format(acc_train)}, Accuracy Test: {"{:.0%}".format(acc_test)}, Model: {classifier.__class__.__name__} ')
proba = classifier.predict_proba(X)
proba_pred = classifier.predict_proba(X_pred)
proba = pd.DataFrame(proba, columns=['H','D','A'])
df_proba.append(proba)
proba_pred = pd.DataFrame(proba_pred, columns=['H','D','A'])
df_proba_pred.append(proba_pred)
df_proba = pd.concat(df_proba,axis=1)
#adjust iloc
df_clean = df_clean.iloc[:, :16]
df_proba = pd.concat([df_clean.reset_index(drop=True), df_proba], axis=1)
df_proba_pred = pd.concat(df_proba_pred,axis=1)
#adjust iloc
df_clean_pred = df_clean_pred.iloc[:, :16]
df_proba_pred = pd.concat([df_clean_pred.reset_index(drop=True), df_proba_pred], axis=1)
return df_proba, df_proba_pred
PS X_pred is X that I want to predict.
Models score
Accuracy Train: 48%, Accuracy Test: 48%, Model: MLPClassifier
Accuracy Train: 48%, Accuracy Test: 48%, Model: VotingClassifier
Accuracy Train: 48%, Accuracy Test: 48%, Model: AdaBoostClassifier
Accuracy Train: 50%, Accuracy Test: 48%, Model: GradientBoostingClassifier
Accuracy Train: 48%, Accuracy Test: 48%, Model: CalibratedClassifierCV
Accuracy Train: 48%, Accuracy Test: 48%, Model: LinearDiscriminantAnalysis
Accuracy Train: 48%, Accuracy Test: 48%, Model: LogisticRegression
Accuracy Train: 48%, Accuracy Test: 48%, Model: LogisticRegressionCV
Accuracy Train: 47%, Accuracy Test: 47%, Model: QuadraticDiscriminantAnalysis
I want to assume you want to increase your accuracy, most times if you want to increase your accuracy you will probably need to use GridSearch in sklearn. This is a way of tuning your model(classifier) to increase the accuracy, you can go through the documentation link below to learn more.
After you finishing turning it gives you the best parameters to fix into your classifier.
For example where you have "clf1 = LogisticRegression(multi_class='multinomial', random_state=1)"
You will need more parameters such as I have defined in the code below.
Sklearn documentation
However here is a way to do Logistic regression hypertunning but search for others like adaboost and the rest for best parameters:
# Hyperparameter tunning for logistic regression
logreg_grid = {"C": np.logspace(-5, 5, 10),
"solver": ["liblinear","lbfgs"],
"tol":np.logspace(-5, 5, 10),
"max_iter":np.arange(100, 3000,80),
"class_weight": [None, {0:1, 0:1.5}, {0:1, 1:2}]}
# setup grid hyperparameter search for logistic resgression
gs_logreg = GridSearchCV(LogisticRegression(),
param_grid=logreg_grid,
cv=5,
verbose=True)
# fit grid hyperparamter search model
gs_logreg.fit(Xtrain,ytrain)
After you have done that you can use this code to get the parameters and fix into your classifier:
gs_logreg.best_params_
Then fix the parameters into your earlier equation, only if you have done the hypertuning.
clf1 = LogisticRegression(C=2154.4346900318865, solver='rbf', tol=0.0037649358067924714, max_iter = "300", multi_class='multinomial', random_state=1)
I'm testing different models (classifiers) and I've created a list (that will contain model names) and then loop through it to print accuracy and cross validation score for each one of them. It works fine.
The thing I'd like to do is showing them ordered by descending accuracy_score (metrics.accuracy_score(y_test, y_pred) in the code below). How do I do that easily?
Thanks a lot to anyone who'll be willing to help!
#create an array of models
models = []
models.append(("Random Forest",RandomForestClassifier(n_estimators = 100, random_state = 0)))
#models.append(("Logistic Regression",LogisticRegression()))
models.append(("Naive Bayes",GaussianNB()))
models.append(("SVM",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("Gradient Boosting",GradientBoostingClassifier()))
#measure the accuracy and show results per model
for name, model in models:
# fit the model with x and y data
model.fit(X_train, y_train)
#Prediction of test set
y_pred = model.predict(X_test)
kfold = KFold(n_splits=4)#, random_state=22)
cv_result = cross_val_score(model, X_train, y_train, cv = kfold, scoring = "accuracy")
print('\033[1m', name, '\033[0m')
print('accuracy score is: \033[1m', metrics.accuracy_score(y_test, y_pred),'\033[0m')
print('cross validation score is: ' ,cv_result,'\n------------------------------------------------------------------------------------')
Append your scores to a new list, and then sort that list using the .sort() method, like so:
#create an array of models
models = []
models.append(("Random Forest",RandomForestClassifier(n_estimators = 100, random_state = 0)))
#models.append(("Logistic Regression",LogisticRegression()))
models.append(("Naive Bayes",GaussianNB()))
models.append(("SVM",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("Gradient Boosting",GradientBoostingClassifier()))
results = [] # New list to store results
#measure the accuracy and show results per model
for name, model in models:
# fit the model with x and y data
model.fit(X_train, y_train)
#Prediction of test set
y_pred = model.predict(X_test)
kfold = KFold(n_splits=4)#, random_state=22)
cv_result = cross_val_score(model, X_train, y_train, cv = kfold, scoring = "accuracy")
results.append((name, metrics.accuracy_score(y_test, y_pred)))
print('\033[1m', name, '\033[0m')
print('accuracy score is: \033[1m', metrics.accuracy_score(y_test, y_pred),'\033[0m')
print('cross validation score is: ' ,cv_result,'\n------------------------------------------------------------------------------------')
results.sort(key=lambda tup: tup[1], reverse=True) # sort in-place
print(results) # print results
Rather than doing everything in the same big chunk of code inside the loop, I suggest to identify the different types of operations that you're doing and separate them into their own functions:
Run the model, fit, predict, compute score;
Sort the list;
Print the model;
import operator # itemgetter
#create an array of models
models = []
models.append(("Random Forest",RandomForestClassifier(n_estimators = 100, random_state = 0)))
#models.append(("Logistic Regression",LogisticRegression()))
models.append(("Naive Bayes",GaussianNB()))
models.append(("SVM",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("Gradient Boosting",GradientBoostingClassifier()))
def run_model(m):
name, model = m
# fit the model with x and y data
model.fit(X_train, y_train)
#Prediction of test set
y_pred = model.predict(X_test)
kfold = KFold(n_splits=4)#, random_state=22)
cv_result = cross_val_score(model, X_train, y_train, cv = kfold, scoring = "accuracy")
return (name, accuracy, cv_result)
def print_model(name, accuracy, cv_result):
print('\033[1m', name, '\033[0m')
print('accuracy score is: \033[1m', accuracy,'\033[0m')
print('cross validation score is: ' ,cv_result,'\n------------------------------------------------------------------------------------')
results = sorted(map(run_model, models), key=operator.itemgetter(1))
for name, accuracy, cv_result in results:
print_model(name, accuracy, cv_result)
Disclaimer: Contrary to all best practices, I did not test this code before posting it, because the OP didn't provide example values for X_train, y_train, X_test, y_test, nor the relevant import to make their code work.
I am trying different classifiers with different parameters and stuff on a dataset provided to us as part of a course project. We have to try and get the best performance on the dataset. The dataset is actually a reduced version of the online news popularity
I have tried the SVM, Random Forest, SVM with cross-validation with k = 5 and they all seem to give approximately 100% training accuracy, while the testing accuracy is between 60-70. I think the testing accuracy is fine, but the training accuracy bothers me.
I would say maybe it was a case of overfitting data but none of my classmates seem to be getting similar results so maybe the problem is with my code.
Here is the code for my cross-validation and random forest classifier. I would be very grateful if you help me find out why I am getting such a high Training accuracy
def crossValidation(X_train, X_test, y_train, y_test, numSplits):
skf = StratifiedKFold(n_splits=5, shuffle=True)
Cs = np.logspace(-3, 3, 10)
gammas = np.logspace(-3, 3, 10)
ACC = np.zeros((10, 10))
DEV = np.zeros((10, 10))
for i, gamma in enumerate(gammas):
for j, C in enumerate(Cs):
acc = []
for train_index, dev_index in skf.split(X_train, y_train):
X_cv_train, X_cv_dev = X_train[train_index], X_train[dev_index]
y_cv_train, y_cv_dev = y_train[train_index], y_train[dev_index]
clf = SVC(C=C, kernel='rbf', gamma=gamma, )
clf.fit(X_cv_train, y_cv_train)
acc.append(accuracy_score(y_cv_dev, clf.predict(X_cv_dev)))
ACC[i, j] = np.mean(acc)
DEV[i, j] = np.std(acc)
i, j = np.argwhere(ACC == np.max(ACC))[0]
clf1 = SVC(C=Cs[j], kernel='rbf', gamma=gammas[i], decision_function_shape='ovr')
clf1.fit(X_train, y_train)
y_predict_train = clf1.predict(X_train)
y_pred_test = clf1.predict(X_test)
print("Train Accuracy :: ", accuracy_score(y_train, y_predict_train))
print("Test Accuracy :: ", accuracy_score(y_test, y_pred_test))
def randomForestClassifier(X_train, X_test, y_train, y_test):
"""
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_predict_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
print("Train Accuracy :: ", accuracy_score(y_train, y_predict_train))
print("Test Accuracy :: ", accuracy_score(y_test, y_pred_test))
There are two issues about the problem, training accuracy and testing accuracy are significantly different.
Different distribution of training data and testing data.(because of selecting a part of the dataset)
Overfitting of the model to the training data.
Since you apply cross-validation, it seems that you should think about another solution.
I recommend that you apply some feature selection or feature reduction (like PCA) approaches to tackle the overfitting problem.