Implementing AdaBoost from first principles using SVM classifiers - python

I am currently trying to code the AdaBoost algorithm from first principles using SVM classifiers. I am using the moons dataset and I want to train 5 SVM classifiers sequentially, updating the weights of the incorrectly classified instances each time in line with the way that Adaboost updates the weights. the issue is that when implementing the correct initialisation and normalisation for the weights, my classifiers all remain the same (not better or worse) for each iteration. If I then change the initialisation to be 1 and don't normalise I get a better result where the sequentially trained classifiers seem to improve however at the 5th iteration it gets worse. If I extend the number of models, the model misclassification rate converges at 111 which is a lot greater than the initial 49.
I have written the following code:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
X,y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
m = len(X_train)
sample_weights = np.ones(m)/m #initialise at 1 (in book it does say that each weight should be initialised at 1/m)
learning_rate = 1
models = {}
r_js = []
alphas = []
for i in range(5):
print("iteration {0}".format(i))
svm_clf = SVC(kernel="rbf", C=0.05, gamma="scale", random_state=42)
svm_clf = svm_clf.fit(X_train, y_train, sample_weight=sample_weights)
models["SVM_"+str(i)] = svm_clf #storing the SVM trained models
y_pred = svm_clf.predict(X_train)
r_j = sample_weights[y_pred != y_train].sum() / sample_weights.sum()
r_js.append(r_j)
number = (1-r_j)/r_j
alpha_j = learning_rate * np.log10(number)
alphas.append(alpha_j)
sample_weights[y_pred != y_train] *= np.exp(alpha_j)
print(len(sample_weights[y_pred != y_train]))
sample_weights /= sample_weights.sum() #normalising the sample weights by dividing by the sum of the weights
Running the code, the weights do get altered as expected but I get the following misclassifications for 5 models:
RESULTS:
iteration 0: 193
iteration 1: 193
iteration 2: 193
iteration 3: 193
iteration 4: 193
I then alter the initialisation and do not do it "properly" by setting the weights to be 1 using:
sample_weights = np.ones(m)
and not normalising the weights after updating them. When I implement this new code I get the following misclassification rate:
RESULTS:
iteration 0: 49
iteration 1: 41
iteration 2: 32
iteration 3: 36
iteration 4: 46
this looks like it is working (until the 5th models) as the classifications are getting better.
My questions are:
Am I correctly implementing AdaBoost?
Does the svm_clf.fit() method with the param sample_weights do what I am intending it to do? i.e. update the training data with new weights OR do I have to explicitly do a W.X matrix multiplication to update the training data with the new weights?
Any help would be greatly appreciated here!
Cheers

Related

train_test_split random_state not working; produces different output everytime

So, I've been using KNN on a set of data, with a random_state = 4 during the train_test_split phase. Despite of using the random state, the output of accuracy, classification report, prediction, etc, are different each time. Was wondering why was that?
Here's the head of the data: (predicting the position based on all_time_runs and order)
order position all_time_runs
0 10 NO BAT 1304
1 2 CAN BAT 7396
2 3 NO BAT 6938
3 6 CAN BAT 4903
4 6 CAN BAT 3761
And here's the code for the classification and prediction:
#splitting data into features and target
X = posdf.drop('position',axis=1)
y = posdf['position']
knn = KNeighborsClassifier(n_neighbors = 5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
#fitting the KNN model
knn.fit(X_train, y_train)
#predicting with the model
prediction = knn.predict(X_test)
#knn score
score = knn.score(X_test, y_test)
Althought train_test_split has a random factor associated to it, and it has to be solved to avoid having random resuls, it's not the only factor you should work on solving.
KNN is a model that takes each row of the test set, finds the nearest k training set vectors and classifies it by majority decision and even in case of ties, the decision is random. You need to set.seed(x) in order to ensure the method is replicable.
Documentation states:
Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

Handling unbalanced data in GradientBoostingClassifier using weighted class?

I have a very unbalanced dataset that I need to build a model on top of that for a classification problem. The dataset has around 30000 samples which around 1000 samples are labelled as—1—, and the rest are 0. I build the model by the following lines:
X_train=training_set
y_train=target_value
my_classifier=GradientBoostingClassifier(loss='deviance',learning_rate=0.005)
my_model = my_classifier.fit(X_train, y_train)
Since, this is an unbalanced data, it is not correct to build the model simply like the above code, so I have tried to use class weights as follows:
class_weights = compute_class_weight('balanced',np.unique(y_train), y_train)
Now, I have no idea how I can use class_weights (which basically includes 0.5 and 9.10 values) to train and build the model using GradientBoostingClassifier.
Any idea? How I can handle this unbalanced data with weighted class or other techniques?
You should be using sample weights instead of class weights. In other words, GradientBoostingClassifier lets you assign weights to each observation and not to classes. This is how you can do it, supposing y = 0 corresponds to the weight 0.5 and y = 1 to the weight 9.1:
import numpy as np
sample_weights = np.zeros(len(y_train))
sample_weights[y_train == 0] = 0.5
sample_weights[y_train == 1] = 9.1
Then pass these weights to the fit methodology:
my_model = my_classifier.fit(X_train, y_train, sample_weight = weights)

Use of sample_weight in gradient boosting classifier

I have the following code for gradient boosting classifier to be used for binary classification problem.
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
#Creating training and test dataset
X_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=0.30,random_state=1)
#Count of goods in the training set
#This count is 50000
y0 = len(y_train[y_train['bad_flag'] == 0])
#Count of bads in the training set
#This count is 100
y1 = len(y_train[y_train['bad_flag'] == 1])
#Creating the sample_weights array. Include all bad customers and
#twice the number of goods as bads
w0=(y1/y0)*2
w1=1
sample_weights = np.zeros(len(y_train))
sample_weights[y_train['bad_flag'] == 0] = w0
sample_weights[y_train['bad_flag'] == 1] = w1
model=GradientBoostingClassifier(
n_estimators=100,max_features=0.5,random_state=1)
model=model.fit(X_train, y_train.values.ravel(),sample_weights)
My thinking about writing this code is as follows:-
sample_weights will allow model.fit to select all 100 bads and 200 goods from the training set and this same set of 300 customers will be used to fit 100 estimators in forward stage-wise fashion. I want to undersample my training set because the two response classes are highly imbalanced. Please let me know if my understanding of the code is correct?
Also, I would like to confirm that n_estimators=100 means that 100 estimators will be fit on the same set of 300 customers. This also means that there is no bootstrapping in gradient boosting classifier as seen in bagging classifier.
As far as I understand, this is not how it works. By default, you have GradientBoostingClassifier(subsample = 1.0) which means that the sample size that will be used at each stage (for each of the n_estimators) will be the same as in your original dataset. The weights will not change anything to the size of the subsample. If you want to enforce 300 observations for each stage, you need to set subsample = 300/(50000+100) in addition to the weight definition.
The answer is no. For each stage, a new fraction subsample of observations will be drawn. You can read more about it here: https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting. It says:
At each iteration the base classifier is trained on a fraction subsample of the available training data.
So, as a result, there is some bootstraping combined with the boosting algorithm.

About scikit-learn Perceptron Learning Rate

I'm studying machine learning with 'Python Machine Learning' book written by Sebastian Raschka.
My question is about learning rate eta0 in scikit-learn Perceptron Class.
The following code is implemented for Iris data classifier using Perceptron in that book.
(...omitted...)
from sklearn import datasets
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
ml = Perceptron(eta0=0.1, n_iter=40, random_state=0)
ml.fit(X_train_std, y_train)
y_pred = ml.predict(X_test_std)
print('total test:%d, errors:%d' %(len(y_test), (y_test != y_pred).sum()))
print('accuracy: %.2f' %accuracy_score(y_test, y_pred))
My question is like the following.
The result(total test, errors, accuracy) is not changed for various eta0 values.
The same result of "total test=45, errors=4, accuracy=0.91' is out with both eta0=0.1 and eta0=100.
What's the wrong?
I will try to briefly explain the position of the learning rate in the Perceptron so you understand why there is no difference between the final error magnitude and the accuracy score.
The algorithm of the Perceptron always finds a solution provided we have defined a finite number of epochs (i.e. iterations or steps), no matter how big eta0 is, because this constant simply multiplies the output weights during fitting.
The learning rate in other implementations (like neural nets and basically everything else*) is a value which is multiplied on partial derivatives of a given function during the process of reaching the optimal minima. While higher learning rates give us higher chances of overshooting the optimum, lower learning rates consume more time to converge (to reach the optimal point). The theory is complex, though, there is really good topic describing the learning rate which you should read:
http://neuralnetworksanddeeplearning.com/chap3.html
Okay, now I will also show you that the learning rate in the Perceptron is only used to rescale weights. Let us consider X as our train data and y as our train labels. Let us try to fit the Perceptron with two different eta0, say, 1.0 and 100.0:
X = [[1,2,3], [4,5,6], [1,2,3]]
y = [1, 0, 1]
clf = Perceptron(eta0=1.0, n_iter=5)
clf.fit(X, y)
clf.coef_ # returns weights assigned to the input features
array([[-5., -1., 3.]])
clf = Perceptron(eta0=100.0, n_iter=5)
clf.fit(X, y)
clf.coef_
array([[-500., -100., 300.]])
As you can see, the learning rate in the Perceptron only rescales the weights (leaving signs unchanged) of the model while leaving accuracy score and the error term constant.
Hope that suffices. E.

Logistic regression and cross-validation in Python (with sklearn)

I am trying to solve a classification problem on a given dataset, through logistic regression (and this is not the problem). To avoid overfitting I'm trying to implement it through cross-validation (and here's the problem): there's something that I'm missing to complete the program. My purpose here is to determine accuracy.
But let me be specific. This is what I've done:
I split the set into train set and test set
I defined the logregression prediction model to be used
I used the cross_val_predict method (in sklearn.cross_validation) to make predictions
Lastly, I measured accuracy
Here is the code:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cross_validation import train_test_split
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression
# read training data in pandas dataframe
data = pd.read_csv("./dataset.csv", delimiter=';')
# last column is target, store in array t
t = data['TARGET']
# list of features, including target
features = data.columns
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)
# define method
logreg=LogisticRegression()
# cross valitadion prediction
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
print(metrics.accuracy_score(t_train, predicted))
My problems:
From what I understand the test set should not be considered until the very end and cross-validation should be made on training set. That's why I inserted X_train and t_train in the cross_val_predict method. Thuogh, I get an error saying:
ValueError: Found input variables with inconsistent numbers of samples: [6016, 4812]
where 6016 is the number of samples in the whole dataset, and 4812 is the number of samples in the training set after the dataset has been split
After this, I don't know what to do. I mean: when do the X_test and t_test come into play? I don't get how I should use them after cross-validating and how to get the final accuracy.
Bonus question: I'd also like to perform scaling and reduction of dimensionality (through feature selection or PCA) within each step of the cross-validation. How can I do this? I've seen that defining a pipeline can help with scaling, but I don't know how to apply this to the second problem.
I'd really appreciate any help :-)
Here is working code tested on a sample dataframe. The first issue in your code is the target array is not an np.array. You also shouldn't have target data in your features. Below I illustrate how to manually split the training and testing data using train_test_split. I also show how to use the wrapper cross_val_score to automatically split, fit, and score.
random.seed(42)
# Create example df with alphabetic col names.
alphabet_cols = list(string.ascii_uppercase)[:26]
df = pd.DataFrame(np.random.randint(1000, size=(1000, 26)),
columns=alphabet_cols)
df['Target'] = df['A']
df.drop(['A'], axis=1, inplace=True)
print(df.head())
y = df.Target.values # df['Target'] is not an np.array.
feature_cols = [i for i in list(df.columns) if i != 'Target']
X = df.ix[:, feature_cols].as_matrix()
# Illustrated here for manual splitting of training and testing data.
X_train, X_test, y_train, y_test = \
model_selection.train_test_split(X, y, test_size=0.2, random_state=0)
# Initialize model.
logreg = linear_model.LinearRegression()
# Use cross_val_score to automatically split, fit, and score.
scores = model_selection.cross_val_score(logreg, X, y, cv=10)
print(scores)
print('average score: {}'.format(scores.mean()))
Output
B C D E F G H I J K ... Target
0 20 33 451 0 420 657 954 156 200 935 ... 253
1 427 533 801 183 894 822 303 623 455 668 ... 421
2 148 681 339 450 376 482 834 90 82 684 ... 903
3 289 612 472 105 515 845 752 389 532 306 ... 639
4 556 103 132 823 149 974 161 632 153 782 ... 347
[5 rows x 26 columns]
[-0.0367 -0.0874 -0.0094 -0.0469 -0.0279 -0.0694 -0.1002 -0.0399 0.0328
-0.0409]
average score: -0.04258093018969249
Helpful references:
Convert from pandas to numpy
Select all but subset of columns of dataframe
sklearn.model_selection.train_test_split
sklearn.model_selection.cross_val_score
Please look at the documentation of cross-validation at scikit to understand it more.
Also you are using cross_val_predict incorrectly. What it will do is internally call the cv you supplied (cv=10) to split the supplied data (i.e. X_train, t_train in your case) into again train and test, fit the estimator on train and predict on data which remains in test.
Now for usage of your X_test, y_test, you should first fit your estimtor on the train data (cross_val_predict will not fit) and then use it to predict on test data and then calculate accuracy.
Simple code snippet to describe the above (borrowing from your code) (Do read the comments and ask if not understand anything):
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)
# Until here everything is good
# You keep away 20% of data for testing (test_size=0.2)
# This test data should be unseen by any of the below methods
# define method
logreg=LogisticRegression()
# Ideally what you are doing here should be correct, until you did anything wrong in dataframe operations (which apparently has been solved)
#cross valitadion prediction
#This cross validation prediction will print the predicted values of 't_train'
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
# internal working of cross_val_predict:
#1. Get the data and estimator (logreg, X_train, t_train)
#2. From here on, we will use X_train as X_cv and t_train as t_cv (because cross_val_predict doesnt know that its our training data) - Doubts??
#3. Split X_cv, t_cv into X_cv_train, X_cv_test, t_cv_train, t_cv_test by using its internal cv
#4. Use X_cv_train, t_cv_train for fitting 'logreg'
#5. Predict on X_cv_test (No use of t_cv_test)
#6. Repeat steps 3 to 5 repeatedly for cv=10 iterations, each time using different data for training and different data for testing.
# So here you are correctly comparing 'predicted' and 't_train'
print(metrics.accuracy_score(t_train, predicted))
# The above metrics will show you how our estimator 'logreg' works on 'X_train' data. If the accuracies are very high it may be because of overfitting.
# Now what to do about the X_test and t_test above.
# Actually the correct preference for metrics is this X_test and t_train
# If you are satisfied by the accuracies on the training data then you should fit the entire training data to the estimator and then predict on X_test
logreg.fit(X_train, t_train)
t_pred = logreg(X_test)
# Here is the final accuracy
print(metrics.accuracy_score(t_test, t_pred))
# If this accuracy is good, then your model is good.
If you have less data or dont want to split the data into training and testing, then you should use the approach as suggested by #fuzzyhedge
# Use cross_val_score on your all data
scores = model_selection.cross_val_score(logreg, X, y, cv=10)
# 'cross_val_score' will almost work same from steps 1 to 4
#5. t_cv_pred = logreg.predict(X_cv_test) and calculate accuracy with t_cv_test.
#6. Repeat steps 1 to 5 for cv_iterations = 10
#7. Return array of accuracies calculated in step 5.
# Find out average of returned accuracies to see the model performance
scores = scores.mean()
Note - Also cross_validation is best used with gridsearch to find out parameters of the estimator which perform best for the given data.
For example, using LogisticRegression it has many parameters defined. But if you use
logreg = LogisticRegression()
will initialize the model with only default parameters. Maybe a different value of parameter
logreg = LogisticRegression(penalty='l1', solver='liblinear')
may perform better for your data. This search for better parameters is gridsearch.
Now as for your second part of scaling, dimension reductions etc using pipeline. You can refer to the documentation of pipeline and the following examples:
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#sphx-glr-auto-examples-feature-stacker-py
http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html#sphx-glr-auto-examples-plot-digits-pipe-py
Feel free to contact me if need any help.

Categories