Predict training data in sklearn

Predict training data in sklearn - python

I use scikit-learn's SVM like so:
clf = svm.SVC()
clf.fit(td_X, td_y)
My question is when I use the classifier to predict the class of a member of the training set, could the classifier ever be wrong even in scikit-learns implementation. (eg. clf.predict(td_X[a])==td_Y[a])

Yes definitely, run this code for example:
from sklearn import svm
import numpy as np
clf = svm.SVC()
np.random.seed(seed=42)
x=np.random.normal(loc=0.0, scale=1.0, size=[100,2])
y=np.random.randint(2,size=100)
clf.fit(x,y)
print(clf.score(x,y))
The score is 0.61, so nearly 40% of the training data is missclassified. Part of the reason is that even though the default kernel is 'rbf' (which in theory should be able to classify perfectly any training data set, as long as you don't have two identical training points with different labels), there is also regularization to reduce overfitting. The default regularizer is C=1.0.
If you run the same code as above but switch clf = svm.SVC() to clf = svm.SVC(C=200000), you'll get an accuracy of 0.94.

Related

Different accuracy for cross_val_score and train_test_split

I am testing RandomForestClassifier on simple dataset from sklearn. When I split the data with train_test_split, I get accuracy=0.89. If I use cross-validation with cross_val_score with same parameters of classifier, accuracy is smaller - about 0.83. Why?
Here is the code:
from sklearn.model_selection import cross_val_score, StratifiedKFold,GridSearchCV,train_test_split
from sklearn.metrics import accuracy_score,f1_score,make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_circles
np.random.seed(42)
#create dataset:
x, y = make_circles(n_samples=500, factor=0.1, noise=0.35, random_state=42)
#initialize stratified split:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
#create classifier:
clf = RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1,
oob_score=True,n_estimators=100,min_samples_leaf=10)
#average accuracy on cross-validation:
results = np.mean(cross_val_score(clf, x, y, cv=skf,scoring=make_scorer(accuracy_score)))
print("ACCURACY WITH CV = ",results)#prints 0.832
#use train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
clf=RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1, oob_score=True,n_estimators=100,min_samples_leaf=10)
clf.fit(xtrain,ytrain)
ypred=clf.predict(xtest)
print("ACCURACY WITHOUT CV = ",accuracy_score(ytest,ypred))#prints 0.89
what I got:
ACCURACY WITH CV = 0.83
ACCURACY WITHOUT CV = 0.89

Cross validation is used to run multiple experiments on different splits of data and then average their results. This is to ensure that the result of the experiment is not biased by one split, as it is in your case.
Your chosen seed along with some luck gave you a test train split which has higher accuracy than the average. The higher accuracy is an artifact of random sampling when making a split and not an indicator of better model performance.
Simply put:
Cross Validation makes multiple splits of data. Your model is trained
on all of these different splits and then the performance is
averaged.
If you pick one of these splits, you may get lucky and there might be
good overlap between the data points in your test and train set. Your
model will have high accuracy in this case.
Or you may get unlucky and there might not be a high overlap between
the data points in test and train set. Your model will have a lower
accuracy in this case.
Thus, cross validation is used to average the results of various such splits (5 in your case).
Here is your code run in a google colab notebook:
https://colab.research.google.com/drive/16-NotF-_WVLESmvGMONSGSZigxrT3KLx?usp=sharing
The last cell makes 5 different splits and then averages their accuracies. Notice how that is the same as the one you got from cross validation. Also notice how some splits have higher and some splits have a lower accuracy.
To further convince yourself, look at the output of:
cross_val_score(clf, x, y, cv=skf, scoring=make_scorer(accuracy_score))
The output is a list of scores (accuracies in your case) for the 5 different splits. You can see that they have varying values around 0.83

This is just up to chance for the split and the random state of the Random Forest Classifier. Try leaving the random_state=42 out and let it fit several times and you'll get a variance of different accuracies. By chance, I had one without CV of "just" 0.78! In contrast, the cv will give you and average (your calculated mean) PLUS an idea about how much your accuracy could vary around that.

Cross validation and logistic regression

I am analyzing a dataset from kaggle and want to apply a logistic regression model to predict something. This is the data: https://www.kaggle.com/code/mohamedadelhosny/stroke-prediction-data-analysis-challenge/data
I split the data into train and test, and want to use cross validation to inssure highest accuracy possible. I did some pre-processing and used the dummy function over catigorical features, got to a certain point in the code, and and I don't know how to proceed. I cant figure out how to use the results of the cross validation, it's not so straight forward.
This is what I got so far:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
X = data_Enco.iloc[:, data_Enco.columns != 'stroke'].values # features
Y = data_Enco.iloc[:, 6] # labels
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
scaler = MinMaxScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
logisticModel = LogisticRegression(class_weight='balanced')
# evaluate model
scores = cross_val_score(logisticModel, scaled_X_train, Y_train, scoring='accuracy', cv=cv)
print('average score = ', np.mean(scores))
print('std of scores = ', np.std(scores))
average score = 0.7483538453549359
std of scores = 0.0190400919099899
So far so good.. I got the results of the model for each 10 splits. But now what? how do I build a confusion matrix? how do I calculate the recall, precesion..? I have the right code without performing cross validation, I just dont know how to adapt it.. how do I use the scores of the cross_val_score function ?
logisticModel = LogisticRegression(class_weight='balanced')
logisticModel.fit(scaled_X_train, Y_train) # Train the model
predictions_log = logisticModel.predict(scaled_X_test)
## Scoring the model
logisticModel.score(scaled_X_test,Y_test)
## Confusion Matrix
Y_pred = logisticModel.predict(scaled_X_test)
real_data = Y_test
print('Observe the difference between the real data and the data predicted by the knn classifier:\n')
print('Predictions: ',Y_pred,'\n\n')
print('Real Data:m', real_data,'\n')
cmtx = pd.DataFrame(
confusion_matrix(real_data, Y_pred, labels=[0, 1]),
index = ['real 0: ', 'real 1:'], columns = ['pred 0:', 'pred 1:']
)
print(cmtx)
print('Accuracy score is: ',accuracy_score(real_data, Y_pred))
print('Precision score is: ',precision_score(real_data, Y_pred))
print('Recall Score is: ',recall_score(real_data, Y_pred))
print('F1 Score is: ',f1_score(real_data, Y_pred))

The performance of a model on the training dataset is not a good estimator of the performance on new data because of overfitting.
Cross-validation is used to obtain an estimation of the performance of your model on new data, i.e. without overfitting. And you correctly applied it to compute the mean and variance of the accuracy of your model. This should be a much better approximation of the accuracy on your test dataset than the accuracy on your training dataset. And that is it.
However, cross-validation is usually used to do model selection. Say you have two logistic regression models that use different sets of independent variables. E.g., one is using only age and gender while the other one is using age, gender, and bmi. Or you want to compare logistic regression with an SVM model.
I.e. you have several possible models and you want to decide which one is best. Of course, you cannot just compare the training dataset accuracies of all the models because those are spoiled by overfitting. And if you use the performance on the test dataset for choosing the best model, the test dataset becomes part of the training, you will have leakage, and thus the performance on the test dataset cannot be used anymore for a final, untainted performance measure. That is why cross-validation is used which creates those splits that contain different versions of validation sets.
So the idea is to
apply cross-validation to each of your candidate models,
use the scores of those cross-validations to choose the best model,
retrain that best model on the complete training dataset to get a final version of your best model, and
to finally apply this final version to the test dataset to obtain some untainted evaluation.
But note, that those three steps are for model selection. However, you have only a single model, the logistic regression, so there is nothing to select from. If you fit your model, let's call it m(p) where p denotes the parameters, to e.g. five folds of CV, you get five different fitted versions m(p1), m(p2), ..., m(p5) of the same model.
So if you have only one model, you fit it to the complete training dataset, maybe use CV to have an additional estimate for the performance on new data, but that's it. But you have already done this. There is no "selection of best model", that is only for if you have several models as described above, like e.g. logistic regression and SVM.

Prevent overfitting in Logistic Regression using Sci-Kit Learn

I trained a model using Logistic Regression to predict whether a name field and description field belong to a profile of a male, female, or brand. My train accuracy is around 99% while my test accuracy is around 83%. I have tried implementing regularization by tuning the C parameter but the improvements were barely noticed. I have around 5,000 examples in my training set. Is this an instance where I just need more data or is there something else I can do in Sci-Kit Learn to get my test accuracy higher?

overfitting is a multifaceted problem. It could be your train/test/validate split (anything from 50/40/10 to 90/9/1 could change things). You might need to shuffle your input. Try an ensemble method, or reduce the number of features. you might have outliers throwing things off
then again, it could be none of these, or all of these, or some combination of these.
for starters, try to plot out test score as a function of test split size, and see what you get

#The 'C' value in Logistic Regresion works very similar as the Support
#Vector Machine (SVM) algorithm, when I use SVM I like to use #Gridsearch
#to find the best posible fit values for 'C' and 'gamma',
#maybe this can give you some light:
# For SVC You can remove the gamma and kernel keys
# param_grid = {'C': [0.1,1, 10, 100, 1000],
# 'gamma': [1,0.1,0.01,0.001,0.0001],
# 'kernel': ['rbf']}
param_grid = {'C': [0.1,1, 10, 100, 1000]}
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report,confusion_matrix
# Train and fit your model to see initial values
X_train, X_test, y_train, y_test = train_test_split(df_feat, np.ravel(df_target), test_size=0.30, random_state=101)
model = SVC()
model.fit(X_train,y_train)
predictions = model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
# Find the best 'C' value
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
grid.best_params_
c_val = grid.best_estimator_.C
#Then you can re-run predictions on this grid object just like you would with a normal model.
grid_predictions = grid.predict(X_test)
# use the best 'C' value found by GridSearch and reload your LogisticRegression module
logmodel = LogisticRegression(C=c_val)
logmodel.fit(X_train,y_train)
print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))

Using cross_val_predict against test data set

I'm confused about using cross_val_predict in a test data set.
I created a simple Random Forest model and used cross_val_predict to make predictions:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict, KFold
lr = RandomForestClassifier(random_state=1, class_weight="balanced", n_estimators=25, max_depth=6)
kf = KFold(train_df.shape[0], random_state=1)
predictions = cross_val_predict(lr,train_df[features_columns], train_df["target"], cv=kf)
predictions = pd.Series(predictions)
I'm confused on the next step here. How do I use what is learnt above to make predictions on the test data set?

I don't think cross_val_score or cross_val_predict uses fit before predicting. It does it on the fly. If you look at the documentation (section 3.1.1.1), you'll see that they never mention fit anywhere.

As #DmitryPolonskiy commented, the model has to be trained (with the fit method) before it can be used to predict.
# Train the model (a.k.a. `fit` training data to it).
lr.fit(train_df[features_columns], train_df["target"])
# Use the model to make predictions based on testing data.
y_pred = lr.predict(test_df[feature_columns])
# Compare the predicted y values to actual y values.
accuracy = (y_pred == test_df["target"]).mean()
cross_val_predict is a method of cross validation, which lets you determine the accuracy of your model. Take a look at sklearn's cross-validation page.

I am not sure the question was answered. I had a similar thought. I want compare the results (Accuracy for example) with the method that does not apply CV. The CV valiadte accuracy is on the X_train and y_train. The other method fit the model using X_trian and y_train, tested on the X_test and y_test. So the comparison is not fair since they are on different datasets.
What you can do is using the estimator returned by the cross_validate
lr_fit = cross_validate(lr, train_df[features_columns], train_df["target"], cv=kf, return_estimator=Ture)
y_pred = lr_fit.predict(test_df[feature_columns])
accuracy = (y_pred == test_df["target"]).mean()

How can I explain this drop in performance on test data?

I am asking the question here, even though I hesitated to post it on CrossValidated (or DataScience) StackExchange. I have a dataset of 60 labeled objects (to be used for training) and 150 unlabeled objects (for test). The aim of the problem is to predict the labels of the 150 objects (this used to be given as a homework problem). For each object, I computed 258 features. Considering each object as a sample, I have X_train : (60,258), y_train : (60,) (labels of the objects used for training) and X_test : (150,258). Since the solution of the homework problem was given, I also have the true labels of the 150 objects, in y_test : (150,).
In order to predict the labels of the 150 objects, I choose to use a LogisticRegression (the Scikit-learn implementation). The classifier is trained on (X_train, y_train), after the data has been normalized, and used to make predictions for the 150 objects. Those predictions are compared to y_test to assess the performance of the model. For reproducibility, I copy the code I have used.
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, crosss_val_predict
# Fit classifier
LogReg = LogisticRegression(C=1, class_weight='balanced')
scaler = StandardScaler()
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
# Performance on training data
CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')
print(CV_score)
# Performance on test data
probas = LogReg.predict_proba(X_test)[:, 1]
AUC = metrics.roc_auc_score(y_test, probas)
print(AUC)
The matrices X_train,y_train,X_test and y_test are saved in a .mat file available at this link. My problem is the following :
Using this approach, I get a good performance on training data (CV_score = 0.8) but the performance on the test data is much worse : AUC = 0.54 for C=1 in LogReg and AUC = 0.40 for C=0.01. How can I get AUC<0.5 if a naive classifier should score AUC = 0.5 ? Is this due to the fact that I have a small number of samples for training ?
I have noticed that the performance on test data improves if I change the code for :
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
AUC = metrics.roc_auc_score(y_test, y_pred)
print(AUC)
Indeed, AUC=0.87 for C=1 and 0.9 for C=0.01. Why is the AUC score so much better using cross validation predictions ? Is it because cross validation allows to make predictions on subsets of the test data which do not contain objects/samples which decrease the AUC ?

Looks like you are encountering an overfitting problem, i.e. the classifier trained using the training data is overfitting to the training data. It has poor generalization ability. That is why the performance on the testing dataset isn't good.
cross_val_predict is actually training the classifier using part of your testing data and then predict on the rest. So the performance is much better.
Overall, there seems to be quite some difference between your training and testing datasets. So the classifier with the highest training accuracy doesn't work well on your testing set.
Another point not directly related with your question: since the number of your training samples is much smaller than the feature dimensions, it may be helpful to perform dimension reduction before feeding to classifier.

It looks like your training and test process are inconsistent. Although from your code you intend to standardize your data, you fail to do so during testing. What I mean:
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
Although you define a pipeline, you do not fit the pipeline (clf.fit) but only the Logistic Regression. This matters, because your cross-validated score is calculated with the pipeline (CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')) but during test instead of using the pipeline as expected to predict, you use only LogReg, hence the test data are not standardized.
The second point you raise is different. In y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
you get predictions by doing cross-validation on the test data, while ignoring the train data. Here, you do data standardization since you use clf and thus your score is high; this is evidence that the standardization step is important.
To summarize, standardizing the test data, I believe will improve your test score.

Firstly it makes no sense to have 258 features for 60 training items. Secondly CV=10 for 60 items means you split the data into 10 train/test sets. Each of these has 6 items only in the test set. So whatever results you obtain will be useless. You need more training data and less features.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.