Getting a 100% Training Accuracy, but 60% Testing accuracy - python

I am trying different classifiers with different parameters and stuff on a dataset provided to us as part of a course project. We have to try and get the best performance on the dataset. The dataset is actually a reduced version of the online news popularity
I have tried the SVM, Random Forest, SVM with cross-validation with k = 5 and they all seem to give approximately 100% training accuracy, while the testing accuracy is between 60-70. I think the testing accuracy is fine, but the training accuracy bothers me.
I would say maybe it was a case of overfitting data but none of my classmates seem to be getting similar results so maybe the problem is with my code.
Here is the code for my cross-validation and random forest classifier. I would be very grateful if you help me find out why I am getting such a high Training accuracy
def crossValidation(X_train, X_test, y_train, y_test, numSplits):
skf = StratifiedKFold(n_splits=5, shuffle=True)
Cs = np.logspace(-3, 3, 10)
gammas = np.logspace(-3, 3, 10)
ACC = np.zeros((10, 10))
DEV = np.zeros((10, 10))
for i, gamma in enumerate(gammas):
for j, C in enumerate(Cs):
acc = []
for train_index, dev_index in skf.split(X_train, y_train):
X_cv_train, X_cv_dev = X_train[train_index], X_train[dev_index]
y_cv_train, y_cv_dev = y_train[train_index], y_train[dev_index]
clf = SVC(C=C, kernel='rbf', gamma=gamma, )
clf.fit(X_cv_train, y_cv_train)
acc.append(accuracy_score(y_cv_dev, clf.predict(X_cv_dev)))
ACC[i, j] = np.mean(acc)
DEV[i, j] = np.std(acc)
i, j = np.argwhere(ACC == np.max(ACC))[0]
clf1 = SVC(C=Cs[j], kernel='rbf', gamma=gammas[i], decision_function_shape='ovr')
clf1.fit(X_train, y_train)
y_predict_train = clf1.predict(X_train)
y_pred_test = clf1.predict(X_test)
print("Train Accuracy :: ", accuracy_score(y_train, y_predict_train))
print("Test Accuracy :: ", accuracy_score(y_test, y_pred_test))
def randomForestClassifier(X_train, X_test, y_train, y_test):
"""
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_predict_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
print("Train Accuracy :: ", accuracy_score(y_train, y_predict_train))
print("Test Accuracy :: ", accuracy_score(y_test, y_pred_test))

There are two issues about the problem, training accuracy and testing accuracy are significantly different.
Different distribution of training data and testing data.(because of selecting a part of the dataset)
Overfitting of the model to the training data.
Since you apply cross-validation, it seems that you should think about another solution.
I recommend that you apply some feature selection or feature reduction (like PCA) approaches to tackle the overfitting problem.

Related

I am getting 100% accuracy in my decision tree model. Where I was wrong?

#split dataset in features and target variable
feature_cols = ['RIAGENDR_0', 'RIDAGEYR', 'RIDRETH3_2', 'RIDRETH3_3', 'RIDRETH3_4', 'RIDRETH3_6', 'RIDRETH3_7', 'INDFMPIR', 'DMDMARTZ_1.0', 'DMDMARTZ_2.0', 'DMDMARTZ_3.0', 'DMDMARTZ_4.0', 'DMDMARTZ_6.0', 'DMDEDUC2', 'RFXT010', 'BMXWT', 'BMXBMI', 'URXUMA', 'LBDHDD', 'LBXFER', 'LBXGH', 'LBXBPB', 'LBXBCD', 'LBXBSE', 'LBXBMN', 'URXUBA', 'URXUCD', 'URXUCO', 'URXUCS', 'URXUMO', 'URXUMN', 'URXUPB', 'URXUSB', 'URXUSN', 'URXUTL', 'URXUTU']
X = data[feature_cols] # Features
scale = StandardScaler()
X = scale.fit_transform(X)
y = data['depre_score'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_test)
print(y_pred)
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
recall_sensitivity = metrics.recall_score(y_test, y_pred, pos_label=1)
recall_specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
print(recall_sensitivity, recall_specificity)
Why do you think you are doing something wrong? Perhaps your data are such that you can achieve a perfect classification... e.g., see this mushroom classification.
Having said that, it is also possible that there is some leakage in your data as specified by #gtomer. That means an exact point that is present in training set is available in your test set. You can do K-fold test on your data and see how it follows up with the accuracy. And secondly, use different classifiers too (it is better to use Random Forests compared to Decision Trees)

Where to use validation set in model training

I have split my data into 3 sets train, test and validation as shown below:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
I wanted to ask where do I put the validation set in this code:
#Defining Model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy on test is:",accuracy_score(y_test,y_pred))
#Measure Precision,Recall
print("Precision Score: ",precision_score(y_test, y_pred,average='macro'))
print("Recall Score: ",recall_score(y_test, y_pred,average='macro'))
print("F1-Score :",f1_score(y_test, y_pred,average='macro'))
I suggest you to read up some more on why you would split your date into a train, test and validation set.
In the code you show you can use validation data in the same way you use your test data but that doesn't really make fully sense.
There is a lot to it, I think this can get you started. Link
In short and very simplified, the general idea is that you use results from test data to make adjustment to your model, to improve its performance.
Your validation data you use only at the very end for your final model evaluation, to make sure it actually does perform well on unseen data.
(Worst case if you only use two sets, then you might adjust parameters until works well for these two datasets but still not any other.)
You can show validation data score and accuracy like this way:
#Defining Model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred_val = model.predict(X_val)
print("Accuracy on val is:",accuracy_score(y_val, y_pred_val))
y_pred = model.predict(X_test)
print("Accuracy on test is:",accuracy_score(y_test,y_pred))
#Measure Precision,Recall
print("Precision Score for Val: ",precision_score(y_val, y_pred_val, average='macro'))
print("Recall Score for Val: ",recall_score(y_val, y_pred_val, average='macro'))
print("F1-Score for Val :",f1_score(y_val, y_pred_val, average='macro'))
print("Precision Score: ",precision_score(y_test, y_pred,average='macro'))
print("Recall Score: ",recall_score(y_test, y_pred,average='macro'))
print("F1-Score :",f1_score(y_test, y_pred,average='macro'))

Can i know train and test accuracy individually if am using (cross_val_score)

I used a random-forest classifier to classify my dataset; I want to use cross-validation; my problem is I could not find a way to know the accuracy of train and test splits, so is this possible?
here is the code I used, which tell me about the
X_test, X_rem, y_test, y_rem = train_test_split(X,Y, test_size=0.1)
X_valid, X_train, y_valid, y_train = train_test_split(X_rem,y_rem, train_size=0.80)
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = RandomForestClassifier(n_estimators =400)
scoresranP = cross_val_score(model, X_rem, y_rem, cv=cv, n_jobs=-1)
print('Accuracy of Random Forest: %.3f (%.3f)' % (mean(scoresranP), std(scoresranP)))
If I am right, in this case, the "scoresranP" will give me the training accuracy, so can I get the test accuracy using the test split?
I am wrong. Can anyone tell me if I can using cross_val_score with three splits (train, valid, test)?
I really appreciate any help you can provide.
If I am correct Random Forest does not gives you a training accuracy as it is a regression learning method.

How to get predicted labels during training of any classifier?

Is there any way that I can track my model's performance in terms of it's classified labels, during the training phase? Any classifier from sklearn would work as an example.
To be more specific, I want to get something like a list of Confusion Matrices here:
clf = LinearSVC(random_state=42).fit(X_train, y_train)
# ... here ...
y_pred = clf.predict(X_test)
My objective here is to see how well the model is learning (during training). This is similar to analyzing the training loss, that is a common practice in DNN's, and libraries such as pyTorch, Keras, and Tensorflow have such capability already implemented.
I thought a quick browsing of the web would give me what I want, but apparently not. I still believe this should be fairly simple though.
Some ML practitioners like to work with three folds of data: training, validation and testing sets. The latter should not be seen in any training at all, but the middle could. For example, cross-validation uses K different folds of validation sets "during the training phase" to get a less biased performance estimation when training with different parts of the data.
But you can do this on a single validation fold for the purpose of what you asked.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train2, X_valid, y_train2, y_valid = train_test_split(X_train, y_train, test_size=0.2)
# Fit a classifier with train data only
clf = LinearSVC(random_state=42).fit(X_train2, y_train2)
y_valid_pred = clf.predict(X_valid)
confusionm_valid = confusion_matrix(y_valid, y_valid_pred) # ... here ...
# Refit with all your training data
clf = LinearSVC(random_state=42).fit(X_train, y_train)
y_pred = clf.predict(X_valid)

Take accuracy for simple fit and cross val

I have the simple fitting model like this:
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)
print accuracy_score(y_test, predictions)
and with using cross validation I have this:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 7)
from cross validation how can I take the accuracy in order to have the same measure print accuracy_score(y_test, predictions)? Is it accuracies.mean()?
print accuracies will give an array of accuracy on each fold of cross validation
print "Train set score :: {} ".format(accuracies.mean()) will give the mean accuracy on the cross validation and
print "Train set score :: {} +/-{}".format(accuracies.mean(),accuracies.std()*2) will give you the accuracy along with the mean deviation

Categories