I have a very unbalanced dataset that I need to build a model on top of that for a classification problem. The dataset has around 30000 samples which around 1000 samples are labelled as—1—, and the rest are 0. I build the model by the following lines:
X_train=training_set
y_train=target_value
my_classifier=GradientBoostingClassifier(loss='deviance',learning_rate=0.005)
my_model = my_classifier.fit(X_train, y_train)
Since, this is an unbalanced data, it is not correct to build the model simply like the above code, so I have tried to use class weights as follows:
class_weights = compute_class_weight('balanced',np.unique(y_train), y_train)
Now, I have no idea how I can use class_weights (which basically includes 0.5 and 9.10 values) to train and build the model using GradientBoostingClassifier.
Any idea? How I can handle this unbalanced data with weighted class or other techniques?
You should be using sample weights instead of class weights. In other words, GradientBoostingClassifier lets you assign weights to each observation and not to classes. This is how you can do it, supposing y = 0 corresponds to the weight 0.5 and y = 1 to the weight 9.1:
import numpy as np
sample_weights = np.zeros(len(y_train))
sample_weights[y_train == 0] = 0.5
sample_weights[y_train == 1] = 9.1
Then pass these weights to the fit methodology:
my_model = my_classifier.fit(X_train, y_train, sample_weight = weights)
Related
I am currently trying to code the AdaBoost algorithm from first principles using SVM classifiers. I am using the moons dataset and I want to train 5 SVM classifiers sequentially, updating the weights of the incorrectly classified instances each time in line with the way that Adaboost updates the weights. the issue is that when implementing the correct initialisation and normalisation for the weights, my classifiers all remain the same (not better or worse) for each iteration. If I then change the initialisation to be 1 and don't normalise I get a better result where the sequentially trained classifiers seem to improve however at the 5th iteration it gets worse. If I extend the number of models, the model misclassification rate converges at 111 which is a lot greater than the initial 49.
I have written the following code:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
X,y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
m = len(X_train)
sample_weights = np.ones(m)/m #initialise at 1 (in book it does say that each weight should be initialised at 1/m)
learning_rate = 1
models = {}
r_js = []
alphas = []
for i in range(5):
print("iteration {0}".format(i))
svm_clf = SVC(kernel="rbf", C=0.05, gamma="scale", random_state=42)
svm_clf = svm_clf.fit(X_train, y_train, sample_weight=sample_weights)
models["SVM_"+str(i)] = svm_clf #storing the SVM trained models
y_pred = svm_clf.predict(X_train)
r_j = sample_weights[y_pred != y_train].sum() / sample_weights.sum()
r_js.append(r_j)
number = (1-r_j)/r_j
alpha_j = learning_rate * np.log10(number)
alphas.append(alpha_j)
sample_weights[y_pred != y_train] *= np.exp(alpha_j)
print(len(sample_weights[y_pred != y_train]))
sample_weights /= sample_weights.sum() #normalising the sample weights by dividing by the sum of the weights
Running the code, the weights do get altered as expected but I get the following misclassifications for 5 models:
RESULTS:
iteration 0: 193
iteration 1: 193
iteration 2: 193
iteration 3: 193
iteration 4: 193
I then alter the initialisation and do not do it "properly" by setting the weights to be 1 using:
sample_weights = np.ones(m)
and not normalising the weights after updating them. When I implement this new code I get the following misclassification rate:
RESULTS:
iteration 0: 49
iteration 1: 41
iteration 2: 32
iteration 3: 36
iteration 4: 46
this looks like it is working (until the 5th models) as the classifications are getting better.
My questions are:
Am I correctly implementing AdaBoost?
Does the svm_clf.fit() method with the param sample_weights do what I am intending it to do? i.e. update the training data with new weights OR do I have to explicitly do a W.X matrix multiplication to update the training data with the new weights?
Any help would be greatly appreciated here!
Cheers
I'm using dataset from Kaggle - Cardiovascular Disease Dataset.
The model has been trained and what I want to do is to label a single input(a row of 13 values)
inserted in dynamic way.
Shape of Dataset is 13 Features + 1 Target, 66k rows
#prepare dataset for train and test
dfCardio = load_csv("cleanCardio.csv")
y = dfCardio['cardio']
x = dfCardio.drop('cardio',axis = 1, inplace=False)
model = knn = KNeighborsClassifier()
x_train,x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)
model.fit(x_train, y_train)
# make predictions for test data
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
ML is trained, what I want to do is to predict the label of this single row :
['69','1','151','22','37','0','65','140','90','2','1','0','0','1']
to return 0 or 1 for Target.
So I wrote this code :
import numpy as np
import pandas as pd
single = np.array(['69','1','151','22','37','0','65','140','90','2','1','0','0','1'])
singledf = pd.DataFrame(single)
final=singledf.transpose()
prediction = model.predict(final)
print(prediction)
but it gives error : query data dimension must match training data dimension
how can I fix the labeling for single row ? why I'm not able to predict a single case ?
Each instance in your dataset has 13 features and 1 label.
x = dfCardio.drop('cardio',axis = 1, inplace=False)
This line in the code removes what I assume is the label column from the data, leaving only the (13) feature columns.
The feature vector on which you are trying to predict, is 14 elements long. You can only predict on feature vectors that are 13 elements long because that is what the model was trained on.
if you are looking for a real and quick solution you can use this
import numpy as np
import pandas as pd
single = np.array([['69','1','151','22','37','0','65','140','90','2','1','0','0']])
prediction = model.predict(single)
print(prediction)
I disagree with the others, this is not a problem with including the target.
I had this problem too. The only way I got around it was to input part of x.
So:
x2=x.iloc[0:3]
then give the first row a new value:
x2.iloc[0]=single
ypred=model.predict(x2)
and just look at ypred[0].
Or try a dataframe with 2 values
I'm a bit confused as to what the line model.predict(X) actually predicts. I can't find anything on it with a Google search.
import statsmodels.api as sm
# Step 1) Load data into dataframe
df = pd.read_csv('my_data.csv')
# Step 2) Separate dependent and independent variables
X = df['independent_variable']
y = df["dependent_variable"]
# Step 3) using OLS -fit a linear regression
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make predictions
predictions
I'm not sure what predictions is showing? Is it predicting the next x amount of rows or something? Aren't I just passing in my independent variables?
You are fitting an OLS model from your data, which is most likely interpreted as an array. The predict method will returns an array of fitted values given the trained model.
In other words, from statsmodels documentation:
Return linear predicted values from a design matrix.
Similar to the sk-learn. After model = sm.OLS(y, X).fit(), you will have a model, then predictions = model.predict(X) is not predict next x amount of rows, it will predict from your X, the training dataset. The model using ordinary least squares will be a function of "x" and the output should be:
$$ \hat{y}=f(x) $$
If you want to predict the new X, you need to split X into training and testing dataset.
Actually you are doing it wrong
The predict method is use to predict next values
After separating dependent and I dependent values
You can split the data in two part train and test
From sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,0.2)
This will make X_train ur 80% of total data with only independent variable
And you can put your y_test in predict method to check how well the model is performing
I have the following code for gradient boosting classifier to be used for binary classification problem.
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
#Creating training and test dataset
X_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=0.30,random_state=1)
#Count of goods in the training set
#This count is 50000
y0 = len(y_train[y_train['bad_flag'] == 0])
#Count of bads in the training set
#This count is 100
y1 = len(y_train[y_train['bad_flag'] == 1])
#Creating the sample_weights array. Include all bad customers and
#twice the number of goods as bads
w0=(y1/y0)*2
w1=1
sample_weights = np.zeros(len(y_train))
sample_weights[y_train['bad_flag'] == 0] = w0
sample_weights[y_train['bad_flag'] == 1] = w1
model=GradientBoostingClassifier(
n_estimators=100,max_features=0.5,random_state=1)
model=model.fit(X_train, y_train.values.ravel(),sample_weights)
My thinking about writing this code is as follows:-
sample_weights will allow model.fit to select all 100 bads and 200 goods from the training set and this same set of 300 customers will be used to fit 100 estimators in forward stage-wise fashion. I want to undersample my training set because the two response classes are highly imbalanced. Please let me know if my understanding of the code is correct?
Also, I would like to confirm that n_estimators=100 means that 100 estimators will be fit on the same set of 300 customers. This also means that there is no bootstrapping in gradient boosting classifier as seen in bagging classifier.
As far as I understand, this is not how it works. By default, you have GradientBoostingClassifier(subsample = 1.0) which means that the sample size that will be used at each stage (for each of the n_estimators) will be the same as in your original dataset. The weights will not change anything to the size of the subsample. If you want to enforce 300 observations for each stage, you need to set subsample = 300/(50000+100) in addition to the weight definition.
The answer is no. For each stage, a new fraction subsample of observations will be drawn. You can read more about it here: https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting. It says:
At each iteration the base classifier is trained on a fraction subsample of the available training data.
So, as a result, there is some bootstraping combined with the boosting algorithm.
Working on a Classification problem using python scikit, its a medical diagnostics data having 6 features and 2 targets. I tried with one target, trained a model using KNN algorithm, prediction accuracy is 100% with this model.
Now want to extend this to second target, want to predict the outcome of two y values for the same feature set(6 columns).
Following is my code where Im able to accurately predict the outcome of Target 1 ('Outcome1-Urinary-bladder'). How can I extend to predict the outcome of the second Target (Outcome2-Nephritis-of-renal).
X = Feature_set
y = Target1['Outcome1-Urinary-bladder'].values
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X,y)
y_predictor = knn.predict(X)
print metrics.accuracy_score(y,y_predictor)
Click here to view the dataset
What modifications to be made to the code to predict outcome of 2 target values ('Outcome1-Urinary-bladder' & Outcome2-Nephritis-of-renal)?
Please help me out. Thanks in advance.
In general you just wrap your classifier into one-vs-rest classifier wrapper:
http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier
And feed it with the matrix y which will have 2 columns at the same time.
Example of usage:
selClassifiers = {
'linear': LinearSVC(),
'linearWithSGD': SGDClassifier(),
'rbf': SVC(kernel='rbf', probability=True),
'poly': SVC(kernel='poly', probability=True),
'sigmoid': SVC(kernel='sigmoid', probability=True),
'bayes': MultinomialNB()
}
classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(selClassifiers[classif]))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)
As pointed by #yangjie, for your specific classifier there is no need to wrap it, it already support multi-output classification.