I am training different models for a regression problem. Since i want to find the best model between the choices, i wanted to perform a cross validation with k = 20, to characterize the MSE of the models, and statistically determine what model is the better between them.
The problem has got multiple dependant variables, and i would like to determinate the MSE separately for both dependant variables, but cross_val_score doesnt let me do that explicitely.
Here is some example code of one of my models:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
model = LinearRegression()
model.fit(x, y)
y_pred = model.predict(x_test)
mse = mean_squared_error(scaler2.inverse_transform(y_test), scaler2.inverse_transform(y_pred), multioutput="raw_values")
How can i iterate training on the k times corresponding to the k models trained and tested in a k fold cross validation?
Scikit provides a Kfold but it is just a way to specify the number of folds, and it doesnt actually returns the training and test folds, so i can't think a way to actually train different models using kfold cross validation theory. Plus, i would need to evaluate MSE seprately on each dependant variable since it's a multiple regression problem
You can use Scikit Learn KFold Cross Validation with just a simple for loop.
And here is a example testing 5-fold cross validation on bayes classifer:
from sklearn.model_selection import KFold
k = 5
kf = KFold(n_splits=k)
res = []
for train_index , test_index in kf.split(X_train_concat):
X_train_kf , X_test_kf = X_train_concat[train_index,:],X_train_concat[test_index,:]
y_train_kf , y_test_kf = y_train_concat[train_index] , y_train_concat[test_index]
X_train = np.append(X_train_concat, np.reshape(y_train_concat, (len(y_train_concat),1)), axis=1)
W_bayes = trainBayes(X_train)
y_pred = predict(X_test_kf, W_bayes)
mis_classification = len(y_pred)-np.count_nonzero(y_pred == y_test_kf)
e = (mis_classification / y_test_kf.shape[0]) * 100
res.append(e)
avg_res = sum(res)/k
print('Result of each fold - {}'.format(res))
print('Avg result : {}'.format(avg_res))
For more check this
Related
I'm trying to compare the RMSE I have from performing multiple linear regression upon the full data set, to that of 10-fold cross validation, using the KFold module in scikit learn. I found some code that I tried to adapt but I can't get it to work (and I suspect it never worked in the first place.
TIA for any help!
Here's my linear regression function
def standRegres(xArr,yArr):
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = xTx.I * (xMat.T*yMat)
return ws
## I run it on my matrix ("comm_df") and my dependent var (comm_target)
## Calculate RMSE (omitted some code)
initial_regress_RMSE = np.sqrt(np.mean((yHat_array - comm_target_array)**2)
## Now trying to get RMSE after training model through 10-fold cross validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf:
linreg.fit(comm_df,comm_target)
p = linreg.predict(comm_df)
e = p-comm_target
xval_err += np.sqrt(np.dot(e,e)/len(comm_df))
rmse_10cv = xval_err/10
I get an error about how kfold object is not iterable
There are several things you need to correct in this code.
You cannot iterate over kf. You can only iterate over kf.split(comm_df)
You need to somehow use the train test split that KFold provides. You are not using them in your code! The goal of the KFold is to fit your regression on the train observations, and to evaluate the regression (ie compute the RMSE in your case) on the test observations.
With this in mind, here is how I would correct your code (it is assumed here that your data is in numpy arrays, but you can easily switch to pandas)
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf.split(comm_df):
linreg.fit(comm_df[train],comm_target[train])
p = linreg.predict(comm_df[test])
e = p-comm_label[test]
xval_err += np.sqrt(np.dot(e,e)/len(comm_target[test]))
rmse_10cv = xval_err/10
So the code you provided still threw an error. I abandoned what I had above in favor of the following, which works:
## KFold cross-validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
## Define variables for the for loop
kf = KFold(n_splits=10)
RMSE_sum=0
RMSE_length=10
X = np.array(comm_df)
y = np.array(comm_target)
for loop_number, (train, test) in enumerate(kf.split(X)):
## Get Training Matrix and Vector
training_X_array = X[train]
training_y_array = y[train].reshape(-1, 1)
## Get Testing Matrix Values
X_test_array = X[test]
y_actual_values = y[test]
## Fit the Linear Regression Model
lr_model = LinearRegression().fit(training_X_array, training_y_array)
## Compute the predictions for the test data
prediction = lr_model.predict(X_test_array)
crime_probabilites = np.array(prediction)
## Calculate the RMSE
RMSE_cross_fold = RMSEcalc(crime_probabilites, y_actual_values)
## Add each RMSE_cross_fold value to the sum
RMSE_sum=RMSE_cross_fold+RMSE_sum
## Calculate the average and print
RMSE_cross_fold_avg=RMSE_sum/RMSE_length
print('The Mean RMSE across all folds is',RMSE_cross_fold_avg)
Using Python and SVM, I applied these two pieces of codes:
First I applied this code with a dataset
from sklearn.metrics import confusion_matrix
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC
model = LinearSVC(class_weight='balanced',C=0.01, penalty='l2').fit(X_, y)
y_preds = model.predict(X_)
report = classification_report( y, y_preds )
print(report)
print(cohen_kappa_score(y, y_preds),'\n', accuracy_score(y, y_preds), \n',confusion_matrix(y, y_preds))
This gives me this accuracy : 0.9485714285714286
Second I applied this code with exactly same dataset again
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
models = [
LinearSVC(class_weight='balanced',C=0.01, penalty='l2', loss='squared_hinge'),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, X_, y, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
cv_df.groupby('model_name').accuracy.mean()
The accuracy is different: 0.797090
Where are my mistakes?
Which code is correct if any?
How to calculate precision and recall after cross-validation as in the 2nd code?
In the 1st code, you only do 1 time prediction & accuracy calculation. While in the 2nd code you do 5 times predictions & accuracy calculations (with different chunks of dataset) then get the mean/average of the accuracy scores. In other words, the 2nd code gives more reliable accuracy score.
As for your other question, if you want to do cross validation with multiple metrics, you can use cross_validate() instead of cross_val_score():
scores = cross_validate(model, X, y, scoring=('precision', 'recall'))
print(scores['precision'])
print(scores['recall'])
I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model.
Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)
# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')
# Scores
df_data = df.ix[:,:-1].values
# Target
df_target = df.ix[:,-1].values
# Values to predict
df_test = df_pred.ix[:,:-1].values
# Scores' names
df_data_names = cols.values
# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
# Logistic regression normalizing variables
LogReg = LogisticRegression()
# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores
# Predict new
novel = LogReg.predict(X_pred)
Is this the correct way to implement a Logistic Regression?
I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.
I general things are okay, but there are some problems.
Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.
scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
Cross-validation and predicting.
How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.
I am running two different classification algorithms on my data logistic regression and naive bayes but it is giving me same accuracy even if I change the training and testing data ratio. Following is the code I am using
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
df = pd.read_csv('Speed Dating.csv', encoding = 'latin-1')
X = pd.DataFrame()
X['d_age'] = df ['d_age']
X['match'] = df ['match']
X['importance_same_religion'] = df ['importance_same_religion']
X['importance_same_race'] = df ['importance_same_race']
X['diff_partner_rating'] = df ['diff_partner_rating']
# Drop NAs
X = X.dropna(axis=0)
# Categorical variable Match [Yes, No]
y = X['match']
# Drop y from X
X = X.drop(['match'], axis=1)
# Transformation
scalar = StandardScaler()
X = scalar.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Logistic Regression
model = LogisticRegression(penalty='l2', C=1)
model.fit(X_train, y_train)
print('Accuracy Score with Logistic Regression: ', accuracy_score(y_test, model.predict(X_test)))
#Naive Bayes
model_2 = GaussianNB()
model_2.fit(X_train, y_train)
print('Accuracy Score with Naive Bayes: ', accuracy_score(y_test, model_2.predict(X_test)))
print(model_2.predict(X_test))
Is it possible that every time the accuracy is same ?
This is common phenomena occurring if the class frequencies are unbalanced, e.g. nearly all samples belong to one class. For examples if 80% of your samples belong to class "No", then classifier will often tend to predict "No" because such a trivial prediction reaches the highest overall accuracy on your train set.
In general, when evaluating the performance of a binary classifier, you should not only look at the overall accuracy. You have to consider other metrics such as the ROC Curve, class accuracies, f1 scores and so on.
In your case you can use sklearns classification report to get a better feeling what your classifier is actually learning:
from sklearn.metrics import classification_report
print(classification_report(y_test, model_1.predict(X_test)))
print(classification_report(y_test, model_2.predict(X_test)))
It will print the precision, recall and accuracy for every class.
There are three options on how to reach a better classification accuracy on your class "Yes"
use sample weights, you can increase the importance of the samples of the "Yes" class thus forcing the classifier to predict "Yes" more often
downsample the "No" class in the original X to reach more balanced class frequencies
upsample the "Yes" class in the original X to reach more balanced class frequencies
I am trying to run multinomial naive bayes on a series of examples in python using sci kit learn. I am consitently getting all examples classified as negative. The training set is somewhat biased towards negatives P(negative) ~.75. I looked through the documentation and I couldn't find a way to bias toward positives.
from sklearn.datasets import load_svmlight_file
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
X_train, y_train= load_svmlight_file("POS.train")
x_test, y_test = load_svmlight_file("POS.val")
clf = MultinomialNB()
clf.fit(X_train, y_train)
preds = clf.predict(x_test)
print('accuracy: ' + str(accuracy_score(y_test, preds)))
print('precision: ' + str(precision_score(y_test, preds)))
print('recall: ' + str(recall_score(y_test, preds)))
Setting a prior is a poor way to handle this and will result in negative cases being classified as positive that really shouldn't be. Your data has a .25/.75 split, so a .5/.5 prior is a pretty bad option.
Instead, one can average the precision and recall with a harmonic mean to produce an F score which attempts to properly handle biased data like this:
from sklearn.metrics import f1_score
The F1 score can then be used to assess the quality of the model. You can then do some model tuning and cross validation to find a model that better classifies your data i.e. the model that maximizes the F1 score.
Another option is to randomly prune out the negative cases in your data so that the classifier is trained with .5/.5 data. The predict step should then give more appropriate classifications.