Where to use validation set in model training - python

I have split my data into 3 sets train, test and validation as shown below:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
I wanted to ask where do I put the validation set in this code:
#Defining Model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy on test is:",accuracy_score(y_test,y_pred))
#Measure Precision,Recall
print("Precision Score: ",precision_score(y_test, y_pred,average='macro'))
print("Recall Score: ",recall_score(y_test, y_pred,average='macro'))
print("F1-Score :",f1_score(y_test, y_pred,average='macro'))

I suggest you to read up some more on why you would split your date into a train, test and validation set.
In the code you show you can use validation data in the same way you use your test data but that doesn't really make fully sense.
There is a lot to it, I think this can get you started. Link
In short and very simplified, the general idea is that you use results from test data to make adjustment to your model, to improve its performance.
Your validation data you use only at the very end for your final model evaluation, to make sure it actually does perform well on unseen data.
(Worst case if you only use two sets, then you might adjust parameters until works well for these two datasets but still not any other.)

You can show validation data score and accuracy like this way:
#Defining Model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred_val = model.predict(X_val)
print("Accuracy on val is:",accuracy_score(y_val, y_pred_val))
y_pred = model.predict(X_test)
print("Accuracy on test is:",accuracy_score(y_test,y_pred))
#Measure Precision,Recall
print("Precision Score for Val: ",precision_score(y_val, y_pred_val, average='macro'))
print("Recall Score for Val: ",recall_score(y_val, y_pred_val, average='macro'))
print("F1-Score for Val :",f1_score(y_val, y_pred_val, average='macro'))
print("Precision Score: ",precision_score(y_test, y_pred,average='macro'))
print("Recall Score: ",recall_score(y_test, y_pred,average='macro'))
print("F1-Score :",f1_score(y_test, y_pred,average='macro'))

Related

I am getting 100% accuracy in my decision tree model. Where I was wrong?

#split dataset in features and target variable
feature_cols = ['RIAGENDR_0', 'RIDAGEYR', 'RIDRETH3_2', 'RIDRETH3_3', 'RIDRETH3_4', 'RIDRETH3_6', 'RIDRETH3_7', 'INDFMPIR', 'DMDMARTZ_1.0', 'DMDMARTZ_2.0', 'DMDMARTZ_3.0', 'DMDMARTZ_4.0', 'DMDMARTZ_6.0', 'DMDEDUC2', 'RFXT010', 'BMXWT', 'BMXBMI', 'URXUMA', 'LBDHDD', 'LBXFER', 'LBXGH', 'LBXBPB', 'LBXBCD', 'LBXBSE', 'LBXBMN', 'URXUBA', 'URXUCD', 'URXUCO', 'URXUCS', 'URXUMO', 'URXUMN', 'URXUPB', 'URXUSB', 'URXUSN', 'URXUTL', 'URXUTU']
X = data[feature_cols] # Features
scale = StandardScaler()
X = scale.fit_transform(X)
y = data['depre_score'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_test)
print(y_pred)
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
recall_sensitivity = metrics.recall_score(y_test, y_pred, pos_label=1)
recall_specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
print(recall_sensitivity, recall_specificity)
Why do you think you are doing something wrong? Perhaps your data are such that you can achieve a perfect classification... e.g., see this mushroom classification.
Having said that, it is also possible that there is some leakage in your data as specified by #gtomer. That means an exact point that is present in training set is available in your test set. You can do K-fold test on your data and see how it follows up with the accuracy. And secondly, use different classifiers too (it is better to use Random Forests compared to Decision Trees)

Can i know train and test accuracy individually if am using (cross_val_score)

I used a random-forest classifier to classify my dataset; I want to use cross-validation; my problem is I could not find a way to know the accuracy of train and test splits, so is this possible?
here is the code I used, which tell me about the
X_test, X_rem, y_test, y_rem = train_test_split(X,Y, test_size=0.1)
X_valid, X_train, y_valid, y_train = train_test_split(X_rem,y_rem, train_size=0.80)
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = RandomForestClassifier(n_estimators =400)
scoresranP = cross_val_score(model, X_rem, y_rem, cv=cv, n_jobs=-1)
print('Accuracy of Random Forest: %.3f (%.3f)' % (mean(scoresranP), std(scoresranP)))
If I am right, in this case, the "scoresranP" will give me the training accuracy, so can I get the test accuracy using the test split?
I am wrong. Can anyone tell me if I can using cross_val_score with three splits (train, valid, test)?
I really appreciate any help you can provide.
If I am correct Random Forest does not gives you a training accuracy as it is a regression learning method.

How to apply oversampling when doing Leave-One-Group-Out cross validation?

I am working on an imbalanced data for classification and I tried to use Synthetic Minority Over-sampling Technique (SMOTE) previously to oversampling the training data. However, this time I think I also need to use a Leave One Group Out (LOGO) cross-validation because I want to leave one subject out on each CV.
I am not sure if I can explain it nicely, but, as my understanding, to do k-fold CV using SMOTE we can loop the SMOTE on every fold, as I saw in this code on another post. Below is an example of SMOTE implementation on the k-fold CV.
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # classification model example
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
Without SMOTE, I tried to do this to do LOGO CV. But by doing this, I will be using a super imbalanced dataset.
X = X
y = np.array(df.loc[:, df.columns == 'label'])
groups = df["cow_id"].values #because I want to leave cow data with same ID on each run
logo = LeaveOneGroupOut()
logo.get_n_splits(X_std, y, groups)
cv=logo.split(X_std, y, groups)
scores=[]
for train_index, test_index in cv:
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
model.fit(X_train, y_train.ravel())
scores.append(model.score(X_test, y_test.ravel()))
How should I implement SMOTE inside a loop of leave-one-group-out CV? I am confused about how to define the group list for the synthetic training data.
The approach suggested here LOOCV makes more sense for leave one out cross-validation. Leave one group which you will use as test set and over-sample the other remaining set. Train your classifier on all the over-sampled data and test your classifier on test set.
In your case, following code would be the correct way to implement SMOTE inside a loop of LOGO CV.
for train_index, test_index in cv:
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model.fit(X_train_oversampled, y_train_oversampled.ravel())
scores.append(model.score(X_test, y_test.ravel()))

Getting a 100% Training Accuracy, but 60% Testing accuracy

I am trying different classifiers with different parameters and stuff on a dataset provided to us as part of a course project. We have to try and get the best performance on the dataset. The dataset is actually a reduced version of the online news popularity
I have tried the SVM, Random Forest, SVM with cross-validation with k = 5 and they all seem to give approximately 100% training accuracy, while the testing accuracy is between 60-70. I think the testing accuracy is fine, but the training accuracy bothers me.
I would say maybe it was a case of overfitting data but none of my classmates seem to be getting similar results so maybe the problem is with my code.
Here is the code for my cross-validation and random forest classifier. I would be very grateful if you help me find out why I am getting such a high Training accuracy
def crossValidation(X_train, X_test, y_train, y_test, numSplits):
skf = StratifiedKFold(n_splits=5, shuffle=True)
Cs = np.logspace(-3, 3, 10)
gammas = np.logspace(-3, 3, 10)
ACC = np.zeros((10, 10))
DEV = np.zeros((10, 10))
for i, gamma in enumerate(gammas):
for j, C in enumerate(Cs):
acc = []
for train_index, dev_index in skf.split(X_train, y_train):
X_cv_train, X_cv_dev = X_train[train_index], X_train[dev_index]
y_cv_train, y_cv_dev = y_train[train_index], y_train[dev_index]
clf = SVC(C=C, kernel='rbf', gamma=gamma, )
clf.fit(X_cv_train, y_cv_train)
acc.append(accuracy_score(y_cv_dev, clf.predict(X_cv_dev)))
ACC[i, j] = np.mean(acc)
DEV[i, j] = np.std(acc)
i, j = np.argwhere(ACC == np.max(ACC))[0]
clf1 = SVC(C=Cs[j], kernel='rbf', gamma=gammas[i], decision_function_shape='ovr')
clf1.fit(X_train, y_train)
y_predict_train = clf1.predict(X_train)
y_pred_test = clf1.predict(X_test)
print("Train Accuracy :: ", accuracy_score(y_train, y_predict_train))
print("Test Accuracy :: ", accuracy_score(y_test, y_pred_test))
def randomForestClassifier(X_train, X_test, y_train, y_test):
"""
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_predict_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
print("Train Accuracy :: ", accuracy_score(y_train, y_predict_train))
print("Test Accuracy :: ", accuracy_score(y_test, y_pred_test))
There are two issues about the problem, training accuracy and testing accuracy are significantly different.
Different distribution of training data and testing data.(because of selecting a part of the dataset)
Overfitting of the model to the training data.
Since you apply cross-validation, it seems that you should think about another solution.
I recommend that you apply some feature selection or feature reduction (like PCA) approaches to tackle the overfitting problem.

Make a graph of accuracy by epoch with mlpclassifier in scikit-learn

I would like to test the accuracy by epoch in scikit-learn. However, so far, i have been unsuccessful.
This is my code part of classifying with mlpclassifier:
NUM_EPOCHS = 1000
LOG_FOR_EVERY = 10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = MLPClassifier(hidden_layer_sizes=(18, 175, 256), batch_size=528,
learning_rate_init=0.0001, beta_1=0.001,
beta_2=0.001, max_iter=1, warm_start=True)
for i in range(NUM_EPOCHS):
clf.fit(X_train, y_train.ravel())
I have also made a graph with this result but i need to make it continuous and further increase accuracy.
Why is the accuracy not increasing?
Please try with :
clf.fit(X_train, y_train.ravel(), epochs=NUM_EPOCHS)

Categories