Use of sample_weight in gradient boosting classifier - python

I have the following code for gradient boosting classifier to be used for binary classification problem.
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
#Creating training and test dataset
X_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=0.30,random_state=1)
#Count of goods in the training set
#This count is 50000
y0 = len(y_train[y_train['bad_flag'] == 0])
#Count of bads in the training set
#This count is 100
y1 = len(y_train[y_train['bad_flag'] == 1])
#Creating the sample_weights array. Include all bad customers and
#twice the number of goods as bads
w0=(y1/y0)*2
w1=1
sample_weights = np.zeros(len(y_train))
sample_weights[y_train['bad_flag'] == 0] = w0
sample_weights[y_train['bad_flag'] == 1] = w1
model=GradientBoostingClassifier(
n_estimators=100,max_features=0.5,random_state=1)
model=model.fit(X_train, y_train.values.ravel(),sample_weights)
My thinking about writing this code is as follows:-
sample_weights will allow model.fit to select all 100 bads and 200 goods from the training set and this same set of 300 customers will be used to fit 100 estimators in forward stage-wise fashion. I want to undersample my training set because the two response classes are highly imbalanced. Please let me know if my understanding of the code is correct?
Also, I would like to confirm that n_estimators=100 means that 100 estimators will be fit on the same set of 300 customers. This also means that there is no bootstrapping in gradient boosting classifier as seen in bagging classifier.

As far as I understand, this is not how it works. By default, you have GradientBoostingClassifier(subsample = 1.0) which means that the sample size that will be used at each stage (for each of the n_estimators) will be the same as in your original dataset. The weights will not change anything to the size of the subsample. If you want to enforce 300 observations for each stage, you need to set subsample = 300/(50000+100) in addition to the weight definition.
The answer is no. For each stage, a new fraction subsample of observations will be drawn. You can read more about it here: https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting. It says:
At each iteration the base classifier is trained on a fraction subsample of the available training data.
So, as a result, there is some bootstraping combined with the boosting algorithm.

Related

Different accuracy for cross_val_score and train_test_split

I am testing RandomForestClassifier on simple dataset from sklearn. When I split the data with train_test_split, I get accuracy=0.89. If I use cross-validation with cross_val_score with same parameters of classifier, accuracy is smaller - about 0.83. Why?
Here is the code:
from sklearn.model_selection import cross_val_score, StratifiedKFold,GridSearchCV,train_test_split
from sklearn.metrics import accuracy_score,f1_score,make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_circles
np.random.seed(42)
#create dataset:
x, y = make_circles(n_samples=500, factor=0.1, noise=0.35, random_state=42)
#initialize stratified split:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
#create classifier:
clf = RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1,
oob_score=True,n_estimators=100,min_samples_leaf=10)
#average accuracy on cross-validation:
results = np.mean(cross_val_score(clf, x, y, cv=skf,scoring=make_scorer(accuracy_score)))
print("ACCURACY WITH CV = ",results)#prints 0.832
#use train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
clf=RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1, oob_score=True,n_estimators=100,min_samples_leaf=10)
clf.fit(xtrain,ytrain)
ypred=clf.predict(xtest)
print("ACCURACY WITHOUT CV = ",accuracy_score(ytest,ypred))#prints 0.89
what I got:
ACCURACY WITH CV = 0.83
ACCURACY WITHOUT CV = 0.89
Cross validation is used to run multiple experiments on different splits of data and then average their results. This is to ensure that the result of the experiment is not biased by one split, as it is in your case.
Your chosen seed along with some luck gave you a test train split which has higher accuracy than the average. The higher accuracy is an artifact of random sampling when making a split and not an indicator of better model performance.
Simply put:
Cross Validation makes multiple splits of data. Your model is trained
on all of these different splits and then the performance is
averaged.
If you pick one of these splits, you may get lucky and there might be
good overlap between the data points in your test and train set. Your
model will have high accuracy in this case.
Or you may get unlucky and there might not be a high overlap between
the data points in test and train set. Your model will have a lower
accuracy in this case.
Thus, cross validation is used to average the results of various such splits (5 in your case).
Here is your code run in a google colab notebook:
https://colab.research.google.com/drive/16-NotF-_WVLESmvGMONSGSZigxrT3KLx?usp=sharing
The last cell makes 5 different splits and then averages their accuracies. Notice how that is the same as the one you got from cross validation. Also notice how some splits have higher and some splits have a lower accuracy.
To further convince yourself, look at the output of:
cross_val_score(clf, x, y, cv=skf, scoring=make_scorer(accuracy_score))
The output is a list of scores (accuracies in your case) for the 5 different splits. You can see that they have varying values around 0.83
This is just up to chance for the split and the random state of the Random Forest Classifier. Try leaving the random_state=42 out and let it fit several times and you'll get a variance of different accuracies. By chance, I had one without CV of "just" 0.78! In contrast, the cv will give you and average (your calculated mean) PLUS an idea about how much your accuracy could vary around that.

Cross validation and logistic regression

I am analyzing a dataset from kaggle and want to apply a logistic regression model to predict something. This is the data: https://www.kaggle.com/code/mohamedadelhosny/stroke-prediction-data-analysis-challenge/data
I split the data into train and test, and want to use cross validation to inssure highest accuracy possible. I did some pre-processing and used the dummy function over catigorical features, got to a certain point in the code, and and I don't know how to proceed. I cant figure out how to use the results of the cross validation, it's not so straight forward.
This is what I got so far:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
X = data_Enco.iloc[:, data_Enco.columns != 'stroke'].values # features
Y = data_Enco.iloc[:, 6] # labels
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
scaler = MinMaxScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
logisticModel = LogisticRegression(class_weight='balanced')
# evaluate model
scores = cross_val_score(logisticModel, scaled_X_train, Y_train, scoring='accuracy', cv=cv)
print('average score = ', np.mean(scores))
print('std of scores = ', np.std(scores))
average score = 0.7483538453549359
std of scores = 0.0190400919099899
So far so good.. I got the results of the model for each 10 splits. But now what? how do I build a confusion matrix? how do I calculate the recall, precesion..? I have the right code without performing cross validation, I just dont know how to adapt it.. how do I use the scores of the cross_val_score function ?
logisticModel = LogisticRegression(class_weight='balanced')
logisticModel.fit(scaled_X_train, Y_train) # Train the model
predictions_log = logisticModel.predict(scaled_X_test)
## Scoring the model
logisticModel.score(scaled_X_test,Y_test)
## Confusion Matrix
Y_pred = logisticModel.predict(scaled_X_test)
real_data = Y_test
print('Observe the difference between the real data and the data predicted by the knn classifier:\n')
print('Predictions: ',Y_pred,'\n\n')
print('Real Data:m', real_data,'\n')
cmtx = pd.DataFrame(
confusion_matrix(real_data, Y_pred, labels=[0, 1]),
index = ['real 0: ', 'real 1:'], columns = ['pred 0:', 'pred 1:']
)
print(cmtx)
print('Accuracy score is: ',accuracy_score(real_data, Y_pred))
print('Precision score is: ',precision_score(real_data, Y_pred))
print('Recall Score is: ',recall_score(real_data, Y_pred))
print('F1 Score is: ',f1_score(real_data, Y_pred))
The performance of a model on the training dataset is not a good estimator of the performance on new data because of overfitting.
Cross-validation is used to obtain an estimation of the performance of your model on new data, i.e. without overfitting. And you correctly applied it to compute the mean and variance of the accuracy of your model. This should be a much better approximation of the accuracy on your test dataset than the accuracy on your training dataset. And that is it.
However, cross-validation is usually used to do model selection. Say you have two logistic regression models that use different sets of independent variables. E.g., one is using only age and gender while the other one is using age, gender, and bmi. Or you want to compare logistic regression with an SVM model.
I.e. you have several possible models and you want to decide which one is best. Of course, you cannot just compare the training dataset accuracies of all the models because those are spoiled by overfitting. And if you use the performance on the test dataset for choosing the best model, the test dataset becomes part of the training, you will have leakage, and thus the performance on the test dataset cannot be used anymore for a final, untainted performance measure. That is why cross-validation is used which creates those splits that contain different versions of validation sets.
So the idea is to
apply cross-validation to each of your candidate models,
use the scores of those cross-validations to choose the best model,
retrain that best model on the complete training dataset to get a final version of your best model, and
to finally apply this final version to the test dataset to obtain some untainted evaluation.
But note, that those three steps are for model selection. However, you have only a single model, the logistic regression, so there is nothing to select from. If you fit your model, let's call it m(p) where p denotes the parameters, to e.g. five folds of CV, you get five different fitted versions m(p1), m(p2), ..., m(p5) of the same model.
So if you have only one model, you fit it to the complete training dataset, maybe use CV to have an additional estimate for the performance on new data, but that's it. But you have already done this. There is no "selection of best model", that is only for if you have several models as described above, like e.g. logistic regression and SVM.

Pairwise comparisons for model training/testing - how to parameter tune?

For some reasons, I have base dataframes of the following structure
print(df1.shape)
display(df1.head())
print(df2.shape)
display(df2.head())
Where the top dataframe is my features set and my bottom is the output set. To turn this into a problem that is amenable to data modeling I first do:
x_train, x_test, y_train, y_test = train_test_split(df1, df2, train_size = 0.8)
I then have a split for 80% training and 20% testing.
Since the output set (df2; y_test/y_train) is individual measurements with no inherent meaning on their own, I calculate pairwise distances between the labels to generate a single output value denoting the pairwise distances between observations using (the distances are computed after z-scoring; the z-scoring code isn't described here but it is done):
y_train = pdist(y_train, 'euclidean')
y_test = pdist(y_test, 'euclidean')
Similarly I then apply this strategy to the features set to generate pairwise distances between individual observations of each of the instances of each feature.
def feature_distances(input_vector):
modified_vector = np.array(input_vector).reshape(-1,1)
vector_distances = pdist(modified_vector, 'euclidean')
vector_distances = pd.Series(vector_distances)
return vector_distances
x_train = x_train.apply(feature_distances, axis = 0)
x_test = x_test.apply(feature_distances, axis = 0)
I then proceed to train & test all of my models.
For now I am trying linear regression , random forest, xgboost.
Is there any easy way to implement a cross validation scheme in my dataset?
Since my problem requires calculating pairwise distances between observations, I am struggling to identify an easy way to do cross validation schemes to optimize parameter tuning.
GridsearchCV doesn't quite work here since in each instance of the test/train split, distances have to be recomputed to avoid contamination of test with train.
Hope it's clear!
First, what I understood from the shape of your data frames that you have 42 samples and 1643 features in the input, and each output vector consists of 392 values.
Huge Input: In case, you are sure that your problem has 1643 features, you might need to use PCA to reduce the dimensionality instead of pairwise distance. You should collect more samples instead of 42 samples to avoid overfitting because it is not enough data to train and test your model.
Huge Output: you could use sampled_softmax_loss to speed up the training process as mentioned in TensorFlow documentation . You could also read this here. In case, you do not want to follow this approach, you can continue training with this output but it takes some time.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=n)
here X is independent feature, y is dependent feature means what you actually want to predict - it could be label or continuous value. We used train_test_split on train dataset and we are using (x_train, y_train) to train model and (x_test, y_test) to test model to ensure performance of model on unknown data(x_test, y_test). In your case you have given y as df2 which is wrong just figure out your target feature and give it as y and there is no need to split test data.

Implementing AdaBoost from first principles using SVM classifiers

I am currently trying to code the AdaBoost algorithm from first principles using SVM classifiers. I am using the moons dataset and I want to train 5 SVM classifiers sequentially, updating the weights of the incorrectly classified instances each time in line with the way that Adaboost updates the weights. the issue is that when implementing the correct initialisation and normalisation for the weights, my classifiers all remain the same (not better or worse) for each iteration. If I then change the initialisation to be 1 and don't normalise I get a better result where the sequentially trained classifiers seem to improve however at the 5th iteration it gets worse. If I extend the number of models, the model misclassification rate converges at 111 which is a lot greater than the initial 49.
I have written the following code:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
X,y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
m = len(X_train)
sample_weights = np.ones(m)/m #initialise at 1 (in book it does say that each weight should be initialised at 1/m)
learning_rate = 1
models = {}
r_js = []
alphas = []
for i in range(5):
print("iteration {0}".format(i))
svm_clf = SVC(kernel="rbf", C=0.05, gamma="scale", random_state=42)
svm_clf = svm_clf.fit(X_train, y_train, sample_weight=sample_weights)
models["SVM_"+str(i)] = svm_clf #storing the SVM trained models
y_pred = svm_clf.predict(X_train)
r_j = sample_weights[y_pred != y_train].sum() / sample_weights.sum()
r_js.append(r_j)
number = (1-r_j)/r_j
alpha_j = learning_rate * np.log10(number)
alphas.append(alpha_j)
sample_weights[y_pred != y_train] *= np.exp(alpha_j)
print(len(sample_weights[y_pred != y_train]))
sample_weights /= sample_weights.sum() #normalising the sample weights by dividing by the sum of the weights
Running the code, the weights do get altered as expected but I get the following misclassifications for 5 models:
RESULTS:
iteration 0: 193
iteration 1: 193
iteration 2: 193
iteration 3: 193
iteration 4: 193
I then alter the initialisation and do not do it "properly" by setting the weights to be 1 using:
sample_weights = np.ones(m)
and not normalising the weights after updating them. When I implement this new code I get the following misclassification rate:
RESULTS:
iteration 0: 49
iteration 1: 41
iteration 2: 32
iteration 3: 36
iteration 4: 46
this looks like it is working (until the 5th models) as the classifications are getting better.
My questions are:
Am I correctly implementing AdaBoost?
Does the svm_clf.fit() method with the param sample_weights do what I am intending it to do? i.e. update the training data with new weights OR do I have to explicitly do a W.X matrix multiplication to update the training data with the new weights?
Any help would be greatly appreciated here!
Cheers

How can I explain this drop in performance on test data?

I am asking the question here, even though I hesitated to post it on CrossValidated (or DataScience) StackExchange. I have a dataset of 60 labeled objects (to be used for training) and 150 unlabeled objects (for test). The aim of the problem is to predict the labels of the 150 objects (this used to be given as a homework problem). For each object, I computed 258 features. Considering each object as a sample, I have X_train : (60,258), y_train : (60,) (labels of the objects used for training) and X_test : (150,258). Since the solution of the homework problem was given, I also have the true labels of the 150 objects, in y_test : (150,).
In order to predict the labels of the 150 objects, I choose to use a LogisticRegression (the Scikit-learn implementation). The classifier is trained on (X_train, y_train), after the data has been normalized, and used to make predictions for the 150 objects. Those predictions are compared to y_test to assess the performance of the model. For reproducibility, I copy the code I have used.
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, crosss_val_predict
# Fit classifier
LogReg = LogisticRegression(C=1, class_weight='balanced')
scaler = StandardScaler()
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
# Performance on training data
CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')
print(CV_score)
# Performance on test data
probas = LogReg.predict_proba(X_test)[:, 1]
AUC = metrics.roc_auc_score(y_test, probas)
print(AUC)
The matrices X_train,y_train,X_test and y_test are saved in a .mat file available at this link. My problem is the following :
Using this approach, I get a good performance on training data (CV_score = 0.8) but the performance on the test data is much worse : AUC = 0.54 for C=1 in LogReg and AUC = 0.40 for C=0.01. How can I get AUC<0.5 if a naive classifier should score AUC = 0.5 ? Is this due to the fact that I have a small number of samples for training ?
I have noticed that the performance on test data improves if I change the code for :
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
AUC = metrics.roc_auc_score(y_test, y_pred)
print(AUC)
Indeed, AUC=0.87 for C=1 and 0.9 for C=0.01. Why is the AUC score so much better using cross validation predictions ? Is it because cross validation allows to make predictions on subsets of the test data which do not contain objects/samples which decrease the AUC ?
Looks like you are encountering an overfitting problem, i.e. the classifier trained using the training data is overfitting to the training data. It has poor generalization ability. That is why the performance on the testing dataset isn't good.
cross_val_predict is actually training the classifier using part of your testing data and then predict on the rest. So the performance is much better.
Overall, there seems to be quite some difference between your training and testing datasets. So the classifier with the highest training accuracy doesn't work well on your testing set.
Another point not directly related with your question: since the number of your training samples is much smaller than the feature dimensions, it may be helpful to perform dimension reduction before feeding to classifier.
It looks like your training and test process are inconsistent. Although from your code you intend to standardize your data, you fail to do so during testing. What I mean:
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
Although you define a pipeline, you do not fit the pipeline (clf.fit) but only the Logistic Regression. This matters, because your cross-validated score is calculated with the pipeline (CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')) but during test instead of using the pipeline as expected to predict, you use only LogReg, hence the test data are not standardized.
The second point you raise is different. In y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
you get predictions by doing cross-validation on the test data, while ignoring the train data. Here, you do data standardization since you use clf and thus your score is high; this is evidence that the standardization step is important.
To summarize, standardizing the test data, I believe will improve your test score.
Firstly it makes no sense to have 258 features for 60 training items. Secondly CV=10 for 60 items means you split the data into 10 train/test sets. Each of these has 6 items only in the test set. So whatever results you obtain will be useless. You need more training data and less features.

Categories