I'm using scikit-learn to train some classifiers. I do cross validation and then compute AUC. However I'm getting a different AUC number every time I run the tests although I made sure to use a seed and a RandomState. I want my tests to be deterministic. Here's my code:
from sklearn.utils import shuffle
SEED = 0
random_state = np.random.RandomState(SEED)
X, y = shuffle(data, Y, random_state=random_state)
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, y, test_size=test_size, random_state=random_state)
clf = linear_model.LogisticRegression()
kfold = cross_validation.KFold(len(X), n_folds=n_folds)
mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)
for train, test in kfold:
probas_ = clf.fit(X[train], Y[train]).predict_proba(X[test])
fpr, tpr, thresholds = roc_curve(Y[test], probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
mean_tpr /= len(kfold)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
My questions:
1- Is there something wrong in my code that's making the results different each time I run it?
2- Is there a global way to make scikit deterministic?
EDIT:
I just tried this:
test_size = 0.5
X = np.random.randint(10, size=(10,2))
Y = np.random.randint(2, size=(10))
SEED = 0
random_state = np.random.RandomState(SEED)
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=random_state)
print X_train # I recorded the result
Then I did:
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=6) #notice the change in random_state
Then I did:
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=random_state)
print X_train #the result is different from the first one!!!!
As you see I'm getting different results although I used the same random_state! How to solve this?
LogisticRegression uses randomness internally and has an (undocumented, will fix in a moment) random_state argument.
There's no global way of setting the random state, because unfortunately the random state on LogisticRegression and the SVM code can only be set in a hacky way. That's because this code comes from Liblinear and LibSVM, which use the C standard library's rand function and that cannot be seeded in a principled way.
EDIT The above is true, but probably not the cause of the problem. You're threading a single np.random.RandomState through your calls, while you should pass the same integer seed for easy reproducibility.
Related
I'm trying to do a multiclass classification using multiple machine learning using this function that I have created:
def model_roc(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=11)
pipeline1 = imbpipeline(steps = [['pca' , PCA()],
['smote', SMOTE('not majority',random_state=11)],
['scaler', MinMaxScaler()],
['classifier', LogisticRegression(random_state=11,max_iter=1000)]])
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)
param_grid1 = {'classifier__penalty': ['l1', 'l2'],'classifier__C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search1 = GridSearchCV(estimator=pipeline1,
param_grid=param_grid1,
scoring=make_scorer(roc_auc_score, average='weighted',multi_class='ovo', needs_proba=True),
cv=stratified_kfold,
n_jobs=-1)
print("#"*100)
print(pipeline1['classifier'])
grid_search1.fit(X_train, y_train)
cv_score1 = grid_search1.best_score_
test_score1 = grid_search1.score(X_test, y_test)
print('cv_score',cv_score1, 'test_score',test_score1)
return
I have 2 questions:
Can I get multiple metrics from the same function(ROC/Accuracy/precision and F1_score) as it is imbalanced data with multiple classes?
I need to plot the learning curve, but I don't know how to do this out of my function.
to plot a learning curve of roc_auc_score
def plot_learning_curves(model,X,y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=11)
train_scores ,test_scores = [],[]
for m in range(1,len(X_train)):
model.fit(X_train[:m],y_train[:m])
y_train_predict =model.predict(X_train[:m])
y_test_predict = model.predict(X_test)
train_scores.append(roc_auc_score(y_train,y_train_predict))
test_scores.append(roc_auc_score(y_test,y_test_predict))
plt.plot(train_scores,"r-+",linewidth=2,label="train")
plt.plot(test_scores,"b-",linewidth=2,label="test")
we iterate m from 1 to the whole length of the training set, slicing m records from the training data and training the model on this subset.
The model's roc_auc_score will be calculated and added to a list, showing the progress of it's roc_auc_score as the model trains on more and more instances.
-modified from ageron-handsonml2
I am trying to code a multiple linear regression problem using two different methods. One is the simple one as stated below:
from sklearn.model_selection import train_test_split
X = df[['geo','age','v_age']]
y = df['freq']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Fitting model
regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train)
print(metrics.mean_squared_error(ypred,y_test))
print(r2_score(y_test,ypred))
The above code gives me an MSE of 0.46 and a Y2 score of '0.0012' which is really bad fit. Meanwhile when I use:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=1) #Degree = 1 should give the same equation as above code block
X_ = poly.fit_transform(X)
y = y.values.reshape(-1, 1)
predict_ = poly.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X_, predict_, test_size=0.33, random_state=42)
# Fitting model
regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train)
print(metrics.mean_squared_error(ypred,y_test))
print(r2_score(y_test,ypred))
Using PolynomialFeatures gives me an MSE of 0.23 and a Y2 score of '0.5' which is much much better. I don't understand how two methods using the same regression equation give such different answers. Rest everything else is the same.
I need to create a FOR loop in Python that will repeat steps 1-2 1,00 times.
Split sample randomly into training test using a 632:368 ratio.
Build the model using the 63.2% training data and compute R square in holdout data.
I can't seem to grab the R square for the dataset :
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(10_00):
X_train, X_test, y_train, y_test = train_test_split(xall, y,
test_size=0.382,
random_state=seed)
modelall = LinearRegression()
modelall.fit(xall, y)
modelall = LinearRegression().fit(xall, y)
r_sq = modelall.score(xall, y)
print('coefficient of determination:', r_sq)
Fit the model using the TRAINING data and estimate the score using the TEST data.
Use this:
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(100):
X_train, X_test, y_train, y_test = train_test_split(xall, y, test_size=0.382, random_state=seed)
modelall = LinearRegression()
modelall.fit(X_train, y_train)
r_sq = modelall.score(X_test, y_test)
print('coefficient of determination:', r_sq)
You are fitting a linear model to the whole dataset (xall) with a different seed number. Linear regression should give you the same output irrespective of the seed value.
I am using cross-validation to evaluate my ML models but now I want to look into the distribution of the errors, i.e. I want to get the average error of specific data points whenever they are in the test set.
from sklearn import linear_model
from sklearn.model_selection import KFold, cross_val_score
X = #data points
y = #output
lm = linear_model.LinearRegression()
kfold = KFold(n_splits=10)
scores = cross_val_score(lm, X, y, scoring='neg_mean_squared_error', cv=kfold)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Testing RMSE (lin reg): {:.3f}'.format(np.mean(rmse_scores)))
Is there an easy way to get the individual errors of each of the data points whenever they are in the test set (not training error) using cross-validation with scikit-learn?
Thank you!
If I understood your question correctly, this should be what you are looking for.
kf = KFold(n_splits=3)
error = []
for train_index, val_index in kf.split(X, y):
Xtrain, X_val = X[train_index], X[val_index]
ytrain, y_val = y[train_index], y[val_index]
model.fit(Xtrain, ytrain)
pred = model.predict(X_val)
current_error = mean_squared_error(y_val, pred) # error per iteration
error.append(current_error)
print(np.mean(error)) # get mean error after CV
I want to have metrics per class label and an aggregate confusion matrix from a cross validation in scikit learn.
I wrote a method that performs a cross-validation for scikit learn that sums the confusion matrices and also stores all the predicted labels. Then, it calls scikit learn methods to print out the metrics.
The code below should run with any recent scikit learn installation, you can test it out with any dataset.
Is below the correct way to gather an aggregate cm and a classification_report when doing StratifiedKFold cross validation?
from sklearn import metrics
from sklearn.cross_validation import StratifiedKFold
import numpy as np
def customCrossValidation(self, X, y, classifier, n_folds=10, shuffle=True, random_state=0):
''' Perform a cross validation and print out the metrics '''
skf = StratifiedKFold(y, n_folds=n_folds, shuffle=shuffle, random_state=random_state)
cm = None
y_predicted_overall = None
y_test_overall = None
for train_index, test_index in skf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
classifier.fit(X_train, y_train)
y_predicted = classifier.predict(X_test)
# collect the y_predicted per fold
if y_predicted_overall is None:
y_predicted_overall = y_predicted
y_test_overall = y_test
else:
y_predicted_overall = np.concatenate([y_predicted_overall, y_predicted])
y_test_overall = np.concatenate([y_test_overall, y_test])
cv_cm = metrics.confusion_matrix(y_test, y_predicted)
# sum the cv per fold
if cm is None:
cm = cv_cm
else:
cm += cv_cm
print (metrics.classification_report(y_test_overall, y_predicted_overall, digits=3))
print (cm)