today I attempted to make a bootstrap to obtain the interval confidence of various different ML algorithm AUC.
I used my personal medical dataset with 61 features formatted liked this :
Age
Female
65
1
45
0
For exemple I used this type of algorithm :
X = data_sevrage.drop(['Echec_sevrage'], axis=1)
y = data_sevrage['Echec_sevrage']
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.25, random_state=0)
lr = LogisticRegression(C=10 ,penalty='l1', solver= 'saga', max_iter=500).fit(X_train,y_train)
score=roc_auc_score(y_test,lr.predict_proba(X_test)[:,1])
precision, recall, thresholds = precision_recall_curve(y_test, lr.predict_proba(X_test)[:,1])
auc_precision_recall = metrics.auc(recall, precision)
y_pred = lr.predict(X_test)
print('ROC AUC score :',score)
print('auc_precision_recall :',auc_precision_recall)
And finally, when I used the boostrap method to obtain the confidence interval (I take the code from other topic : How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python?)
def bootstrap_auc(clf, X_train, y_train, X_test, y_test, nsamples=1000):
auc_values = []
for b in range(nsamples):
idx = np.random.randint(X_train.shape[0], size=X_train.shape[0])
clf.fit(X_train[idx], y_train[idx])
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
auc_values.append(roc_auc)
return np.percentile(auc_values, (2.5, 97.5))
bootstrap_auc(lr, X_train, y_train, X_test, y_test, nsamples=1000)
I have this error :
"None of [Int64Index([21, 22, 20, 31, 30, 13, 22, 1, 31, 3, 2, 9, 9, 18, 29, 30, 31,\n 31, 16, 11, 23, 7, 19, 10, 14, 5, 10, 25, 30, 24, 8, 20],\n dtype='int64')] are in the [columns]"
I use this other method, and i have nearly the same error :
n_bootstraps = 1000
rng_seed = 42 # control reproducibility
bootstrapped_scores = []
rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
# bootstrap by sampling with replacement on the prediction indices
indices = rng.randint(0, len(y_pred), len(y_pred))
if len(np.unique(y_test[indices])) < 2:
# We need at least one positive and one negative sample for ROC AUC
# to be defined: reject the sample
continue
score = roc_auc_score(y_test[indices], y_pred[indices])
bootstrapped_scores.append(score)
print("Bootstrap #{} ROC area: {:0.3f}".format(i + 1, score))
'[6, 3, 12, 14, 10, 7, 9] not in index'
Can you help me please ? I tested many solutions but I have this error every time.
Thank you !
Bootstrap method for AUC confidence interval on machine learning algorithm.
The problem is solved ! It's just a format problem, the conversion in numpy format solve it. Thank you !
Related
I work with ML Random Forest model and I want to set up all its important parameters as best as it can possible be. So, for this purpose in multiple cycles I try all possible variants and save their results. When I finish I just look in results which setup is the best.
So, doing it just on my own PC I faced the problem that my code crashes after 3 hours of work because memory ended. Because of this I come to you with 2 questions:
Is it even good and right to do what I am doing (I am new in ML)? I mean going through all the variants to find its best setup?
Because of my memory limits, can it be done on some website? Online free compilers on which I can load my datafiles and ask them to calculate variants for me.
Anyway, my code is:
random_states=[0,42,1000]
min_samples_leafs = np.linspace(0.1, 0.5, 5, endpoint=True)
min_samples_splits = np.linspace(0.1, 1.0, 10, endpoint=True)
n_estimators = [1, 2, 4, 8, 16, 32, 64, 100, 200]
max_depths = np.linspace(1, 32, 32, endpoint=True)
train_results = []
test_results = []
temp_results = []
attempts = [1,2,3,4,5,6,7,8,9,10]
for estimator in n_estimators:
for max_depth in max_depths:
for min_samples_split in min_samples_splits:
for min_samples_leaf in min_samples_leafs:
for random_state in random_states:
for attempt in attempts:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=random_state)
rf = RandomForestClassifier(n_estimators=estimator, max_depth=int(max_depth),n_jobs=-1, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf)
rf.fit(X_train, y_train)
train_pred = rf.predict(X_train)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
temp_results.append({"estimator":estimator, "max_depth":max_depth, "sample_split":min_samples_split,"sample_leaf":min_samples_leaf,"random_state":random_state,"attempt":attempt,"result":roc_auc})
if attempt==attempts[-1]:
results = 0
for elem in temp_results:
results+=float(elem["result"])
results=results/10
test_results.append({"estimator":estimator, "max_depth":max_depth, "sample_split":min_samples_split,"sample_leaf":min_samples_leaf,"random_state":random_state,"attempt":attempt,"final_result":results})
result= []
max = 0
goat = 0
for dict in test_results:
if dict["final_result"]>max:
max = dict["final_result"]
goat = dict
result.append(dict)
print(datetime.now().strftime("%H:%M:%S"), "END ML")
print(result)
print(goat)
I'm trying to tune hyperparameters for KNN on a quite small datasets ( Kaggle Leaf which has around 990 lines ):
def knnTuning(self, x_train, t_train):
params = {
'n_neighbors': [1, 2, 3, 4, 5, 7, 9],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
grid = GridSearchCV(KNeighborsClassifier(), params)
grid.fit(x_train, t_train)
print(grid.best_params_)
print(grid.best_score_)
return knn.KNN(neighbors=grid.best_params_["n_neighbors"],
weight = grid.best_params_["weights"],
leafSize = grid.best_params_["leaf_size"])
Prints:
{'leaf_size': 5, 'n_neighbors': 1, 'weights': 'uniform'}
0.9119999999999999
And I return this classifier
class KNN:
def __init__(self, neighbors=1, weight = 'uniform', leafSize = 10):
self.clf = KNeighborsClassifier(n_neighbors = neighbors,
weights = weight, leaf_size = leafSize)
def train(self, X, t):
self.clf.fit(X, t)
def predict(self, x):
return self.clf.predict(x)
def global_accuracy(self, X, t):
predicted = self.predict(X)
accuracy = (predicted == t).mean()
return accuracy
I run this several time using 700 lines for the training and 200 for validation, which are chosen with random permutation.
I then got result for the global accuracy from 0.01 (often) to 0.4 (rarely).
I know that i'm not comparing two same metrics but I still can't understand the huge difference between the results.
Not very sure how you trained your model or how the preprocessing was done. The leaf dataset has about 100 labels (species) so you have to take care to split your test and train to ensure an even split of your samples. One reason for the weird accuracy could be that your samples are split unevenly.
Also you would need to scale your features:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
df = pd.read_csv("https://raw.githubusercontent.com/WenjinTao/Leaf-Classification--Kaggle/master/train.csv")
le = LabelEncoder()
scaler = StandardScaler()
X = df.drop(['id','species'],axis=1)
X = scaler.fit_transform(X)
y = le.fit_transform(df['species'])
strat = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0).split(X,y)
x_train, y_train, x_test, y_test = [[X.iloc[train,:],t[train],X.iloc[test,:],t[test]] for train,test in strat][0]
If we do the training, and I would be careful about including n_neighbors = 1 :
params = {
'n_neighbors': [2, 3, 4],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
grid = GridSearchCV(KNeighborsClassifier(), params, cv=sss)
grid.fit(x_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
{'leaf_size': 5, 'n_neighbors': 2, 'weights': 'distance'}
0.9676258992805755
Then you can check on your test:
pred = grid.predict(x_test)
(y_test == pred).mean()
0.9831649831649831
I experimented with breast cancer data from scikit-learn.
Use all features and not use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 1 : 0.9473684210526315
Use all features and use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
sc=StandardScaler()
sc.fit(x_train)
x_train=sc.transform(x_train)
x_test=sc.transform(x_test)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 2 : 0.9736842105263158
Use only two features and not use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data[:,[27,22]]
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 3 : 0.37719298245614036
Use only two features and use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data[:,[27,22]]
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
sc=StandardScaler()
sc.fit(x_train)
x_train=sc.transform(x_train)
x_test=sc.transform(x_test)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 4 : 0.9824561403508771
As result1, result2, result3, result4 show , accuracy has much improvement with Standardscaler while training with fewer features.
So I wondering Why does the standardscaler have different effects under different number of features?
PS. Here is the two featrues I choose:
TL;DR
Don't do feature selection as long as you do not understand fully why you're doing it and in which way it may assist your algo in learning and generalizing better. For starter, please read http://www.feat.engineering/selection.html by Max Kuhn
Full read.
I suspect you tried to select a best feature subset and encountered a situation where a [arbitrary] subset performed better than the whole dataset. StandardScaling is out of question here because it's considered a standard preprocessing procedure for the algo of yours. So your real question should be "Why a subset of features perform better than a full dataset?"
Why your selection algo is arbitrary? 2 reasons.
First. Nobody has proven most linearly correlated feature would improve your [or any other if you wish] algo. Second. The best feature subset is different from what is necessitated by best correlated features.
Let's see this with code.
A feature subset giving best accuracy (note a)
Lets do a brute force.
acc_bench = 0.9736842105263158 # accuracy on all features
res = {}
f = x_train.shape[1]
pcpt = Perceptron(n_jobs=-1)
from itertools import combinations
for i in tqdm(range(2,10)):
features_list = combinations(range(f),i)
for features in features_list:
pcpt.fit(x_train[:,features],y_train)
preds = pcpt.predict(x_test[:, features])
acc = accuracy_score(y_test, preds)
if acc > acc_bench:
acc_bench = acc
res["accuracy"] = acc_bench
res["features"] = features
print(res)
{'accuracy': 1.0, 'features': (0, 15, 22)}
So you see, that features [0,15,22] give perfect accuracy over validation dataset.
Do best features have anything to do with correlation to target?
Let's find a list orderd by a degree of linear correlation.
featrues = pd.DataFrame(cancer.data, columns=cancer.feature_names)
target = pd.DataFrame(cancer.target, columns=['target'])
cancer_data = pd.concat([featrues,target], axis=1)
features_list = np.argsort(np.abs(cancer_data.corr()['target'])[:-1].values)[::-1]
feature_list
array([27, 22, 7, 20, 2, 23, 0, 3, 6, 26, 5, 25, 10, 12, 13, 21, 24,
28, 1, 17, 4, 8, 29, 15, 16, 19, 14, 9, 11, 18])
You see, that best feature subset found by brute force has nothing to do with correlation.
Can linear correlation explain accuracy of Perceptron?
Let's try plotting num of feature from the above list (starting with 2 most correlated) vs resulting accuracy.
res = dict()
for i in tqdm(range(2,10)):
features=features_list[:i]
pcpt.fit(x_train[:,features],y_train)
preds = pcpt.predict(x_test[:, features])
acc = accuracy_score(y_test, preds)
res[i]=[acc]
pd.DataFrame(res).T.plot.bar()
plt.ylim([.9,1])
Once again, linear correlated features have nothing to do with perceptron accuracy.
Conclusion.
Don't select feature prior to any algo unless you're perfectly sure what you're doing and what would be the effects of doing this. Do not mix up diffrent selection and learning algos because different algos have diferent opinions of what is important and what is not. A feature unimportant for one algo may become important for another. This is especially true for linear vs nonlinear algos.
If you want to improve accuracy of your algo do data cleaning or feature engineering instead.
I'm new in data science, I have a question about train_test_split.
I have a example try to predict ice tea sales from temperature
My Question is when I use train_test_split, my mse, score & predict sales value will be different every times (since train_test_split selected different part every times)
Is this normal? If user enter 30 degree same value every time and they will get different predict sales value?
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
#1. predict value
temperature = np.reshape(np.array([30]), (1, 1))
#2. data
X = np.array([29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30]) #temperatures
y = np.array([77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84]) #iced_tea_sales
X = np.reshape(X, (len(X), 1))
y = np.reshape(y, (len(y), 1))
#3. split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#4. train
lm = LinearRegression()
lm.fit(X_train, y_train)
#5. mse score
y_pred = lm.predict(X_test)
mse = np.mean((y_pred - y_test) ** 2)
r_squared = lm.score(X_test, y_test)
print(f'mse: {mse}')
print(f'score(r_squared): {r_squared}')
#6. predict
sales = lm.predict(temperature)
print(sales) #output, user get their prediction
The values will never be the same as when you fit() any model even on the same data multiple times, the weights learned may vary hence the predictions can never be the same. Though, they should be close enough (if you don't have outliers) as the distribution from which the samples are coming is common.
Below is my trial code:
from sklearn import linear_model
# plt.title("Time-independent variant student performance analysis")
x_train = [5, 9, 33, 25, 4]
y_train = [35, 2, 14 ,9, 7]
x_test = [14, 2, 8, 1, 11]
# create linear regression object
linear = linear_model.LinearRegression()
#train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)
# predict output
predicted = linear.predict(x_test)
when run, this is the output:
ValueError: Found arrays with inconsistent numbers of samples: [1 5]
Redefine
x_train = [[5],[9],[33],[25],[4]]
y_train = [35,2,14,9,7]
x_test = [[14],[2],[8],[1],[11]]
From doc of fit(X, y): X : numpy array or sparse matrix of shape [n_samples,n_features]
In your case, every example has only one feature.