I would like to use class_weight to create a weighted SVC classifier in sikit-learn. Nevertheless, I'm not sure if I'm configuring correctly my model. Please consider the example below:
x = np.array([[0,0,1],[0,1,1],[1,0,0]])
y = np.array([1,1,0])
cw = {}
for l in set(y):
cw[l] = np.sum(y == l)
print(cw)
m = SVC(probability = True, max_iter = 1000, class_weight = cw)
m = m.fit(x,y)
I obtained the model:
SVC(C=1.0, cache_size=200, class_weight={0: 1, 1: 2}, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=1000, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
With class_weight={0: 1, 1: 2} corresponding to the number of data points in each class.
QUESTION: Is it correct to proceed in this way?
As you have a 2:1 ratio of class labels, this weighting appears to be correct.
One other thing you can do if you don't want to manually calculate the class weights is to pass class_weight='balanced' and let the SVC balance the weights for you
Related
I'm new to machine learning and in the books and documentation I read there is always a score value between 0 and 1, which represent an accuracy between 0% and 100%.
In my own machine learning code in scikit-learn I get score values between -750.880810 and 5154.771036, which confuses me.
>>> pipe = Pipeline([("scaler", MinMaxScaler()), ("svr", SVR())])
>>> param_grid = {'svr__C':[0.1, 1, 5],
'svr__epsilon':[0.001, 0.01]}
>>> grid = GridSearchCV(estimator=pipe,
param_grid=param_grid,
cv=GroupKFold(n_splits=24)
)
>>> grid.fit(X, y, groups)
GridSearchCV(cv=GroupKFold(n_splits=24), error_score=nan,
estimator=Pipeline(memory=None,
steps=[('scaler',
MinMaxScaler(copy=True,
feature_range=(0, 1))),
('svr',
SVR(C=1.0, cache_size=200, coef0=0.0,
degree=3, epsilon=0.1,
gamma='scale', kernel='rbf',
max_iter=-1, shrinking=True,
tol=0.001, verbose=False))],
verbose=False),
iid='deprecated', n_jobs=None,
param_grid={'svr__C': [0.1, 1, 5], 'svr__epsilon': [0.001, 0.01]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=0)
>>> grid.best_score_
-750.880810
Could someone please explain that to me?
Edit:
My input data is a measurement of an engine.
I have 12 different failures of the engine and every failure is measured twice => 12x2 = 24 different groups (I will also try 12 groups). Every group consist of:
X data: 13 different features (temperature, pressure, electric voltage etc.) with 1200 samples per group
y data: 1 feature (pressure) with 1200 samples per group
Accuracy is the usual scoring method for classification problem. For a regression problem, it is R square value.
For scoring param in GridSearchCV,
If None, the estimator's score method is used.
For SVR, the default scoring value comes from RegressorMixin, which is R^2.
Documentation:
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual
sum of squares ((y_true - y_pred) ** 2).sum() and v is the total
sum of squares ((y_true - y_true.mean()) ** 2).sum().
The best possible score is 1.0 and it can be negative (because the
model can be arbitrarily worse).
A constant model that always
predicts the expected value of y, disregarding the input features,
would get a R^2 score of 0.0.
Hence, it sounds wired when you very large/small value as R^2.
A toy example, to understand the scoring output.
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, GroupKFold
from sklearn.pipeline import Pipeline
import numpy as np
np.random.seed(0)
X, y = datasets.make_regression()
groups = np.random.randint(0, 10, len(X))
pipe = Pipeline([("scaler", MinMaxScaler()), ("svr", svm.SVR())])
parameters = {'svr__C': [ 0.1, 1, 5, 100], 'svr__epsilon': [0.001, 0.1]}
svr = svm.SVR()
clf = GridSearchCV(pipe, parameters, cv=GroupKFold(n_splits=2))
clf.fit(X, y, groups)
print(clf.best_score_)
# 0.1239707770092825
I would recommend trying with different cv and investigate the issue.
I have used a GridSearch for parameter optimization when predicting values with 10-fold cross validation using sklearn, as shown below,
svr_params = {
'C': [0.1, 1, 10],
'epsilon': [0.01, 0.05, 0.1, 0.5, 1],
}
svr = SVR(kernel='linear', coef0=0.1, shrinking=True, tol=0.001, cache_size=200, verbose=False, max_iter=-1)
best_svr = GridSearchCV(
svr, param_grid=svr_params, cv=10, verbose=0, n_jobs=-1)
predicted = cross_val_predict(best_svr, X, y, cv=10)
I want to print out the best parameters selected by the GridSearch for C and epsilon. I would really appriate some help. Thanks in advance.
The best parameters are available as best_params_ attribute of GridSearchCV.
best_svr = GridSearchCV(svr, param_grid=svr_params, cv=10, verbose=0, n_jobs=-1, refit=True)
best_svr.fit(X, y)
print(best_svr.best_params_)
I want to use GridSearchCV over a range of alphas (LaPlace smoothing parameters) to check which gives me the best accuracy with a Bernoulli Naive Bayes model.
def binarize_pixels(data, threshold=0.784):
# Initialize a new feature array with the same shape as the original data.
binarized_data = np.zeros(data.shape)
# Apply a threshold to each feature.
for feature in range(data.shape[1]):
binarized_data[:,feature] = data[:,feature] > threshold
return binarized_data
binarized_train_data = binarize_pixels(mini_train_data)
def BNB():
clf = BernoulliNB()
clf.fit(binarized_train_data, mini_train_labels)
scoring = clf.score(mini_train_data, mini_train_labels)
predsNB = clf.predict(dev_data)
print "Bernoulli binarized model accuracy: {:.4}".format(np.mean(predsNB == dev_labels))
The model runs fine, while my GridSearch cross validation does not:
pipeline = Pipeline([('classifier', BNB())])
def P8(alphas):
gs_clf = GridSearchCV(pipeline, param_grid = alphas, refit=True)
y_predictions = gs_clf.best_estimator_.predict(dev_data)
print classification_report(dev_labels, y_predictions)
alphas = {'alpha' : [0.0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
P8(alphas)
I get AttributeError: 'GridSearchCV' object has no attribute 'best_estimator_'
The problem is in the following two rows:
gs_clf = GridSearchCV(pipeline, param_grid = alphas, refit=True)
y_predictions = gs_clf.best_estimator_.predict(dev_data)
Note that before using predict, you first need to fit the model. That is, to call gs_clf.fit. See the following example from the documentation:
>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import GridSearchCV
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
>>> svr = svm.SVC()
>>> clf = GridSearchCV(svr, parameters)
>>> clf.fit(iris.data, iris.target)
...
GridSearchCV(cv=None, error_score=...,
estimator=SVC(C=1.0, cache_size=..., class_weight=..., coef0=...,
decision_function_shape=None, degree=..., gamma=...,
kernel='rbf', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=...,
verbose=False),
fit_params={}, iid=..., n_jobs=1,
param_grid=..., pre_dispatch=..., refit=..., return_train_score=...,
scoring=..., verbose=...)
>>> sorted(clf.cv_results_.keys())
...
['mean_fit_time', 'mean_score_time', 'mean_test_score',...
'mean_train_score', 'param_C', 'param_kernel', 'params',...
'rank_test_score', 'split0_test_score',...
'split0_train_score', 'split1_test_score', 'split1_train_score',...
'split2_test_score', 'split2_train_score',...
'std_fit_time', 'std_score_time', 'std_test_score', 'std_train_score'...]
For experimental purposes, I train the SVM model as follows,
clf = SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=True)
scores = cross_val_score(clf,train_feature,train_label,cv=3)
print(scores)
The printed result looks like as follows
Warning: using -h 0 may be faster
optimization finished, #iter = 2182
obj = -794.208203, rho = 1.303717
nSV = 1401, nBSV = 992
Total nSV = 1401
The cross-validation score is like
[LibSVM][LibSVM][LibSVM][ 0.68838493 0.6887449 0.75864138]
I think nSV represents the number of support vectors. Is that right? Then what do nBSV and rho represent? How can I know whether these cross-validations score are a good indicator for the model performance?
I have 9164 points, where 4303 are labeled as the class I want to predict and 4861 are labeled as not that class. They are no duplicate points.
Following How to split into train, test and evaluation sets in sklearn?, and since my dataset is a tuple of 3 items (id, vector, label), I do:
df = pd.DataFrame(dataset)
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
train_labels = construct_labels(train)
train_data = construct_data(train)
test_labels = construct_labels(test)
test_data = construct_data(test)
def predict_labels(test_data, classifier):
labels = []
for test_d in test_data:
labels.append(classifier.predict([test_d]))
return np.array(labels)
def construct_labels(df):
labels = []
for index, row in df.iterrows():
if row[2] == 'Trump':
labels.append('Atomium')
else:
labels.append('Not Trump')
return np.array(labels)
def construct_data(df):
first_row = df.iloc[0]
data = np.array([first_row[1]])
for index, row in df.iterrows():
if first_row[0] != row[0]:
data = np.concatenate((data, np.array([row[1]])), axis=0)
return data
and then:
>>> classifier = SVC(verbose=True)
>>> classifier.fit(train_data, train_labels)
[LibSVM].......*..*
optimization finished, #iter = 9565
obj = -2718.376533, rho = 0.132062
nSV = 5497, nBSV = 2550
Total nSV = 5497
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=True)
>>> predicted_labels = predict_labels(test_data, classifier)
>>> for p, t in zip(predicted_labels, test_labels):
... if p == t:
... correct = correct + 1
and I get correct only 943 labels out of 1833 (=len(test_labels)) -> (943*100/1843 = 51.4%)
I am suspecting I am missing something big time here, maybe I should set a parameter to the classifier to do more refined work or something?
Note: First time using SVMs here, so anything you might get for granted, I might have not even imagine...
Attempt:
I went ahed and decreased the number of negative examples to 4303 (same number as positive examples). This slightly improved accuracy.
Edit after the answer:
>>> print(clf.best_estimator_)
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> classifier = SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
... decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
... max_iter=-1, probability=False, random_state=None, shrinking=True,
... tol=0.001, verbose=False)
>>> classifier.fit(train_data, train_labels)
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Also I tried clf.fit(train_data, train_labels), which performed the same.
Edit with data (the data are not random):
>>> train_data[0]
array([ 20.21062112, 27.924016 , 137.13815308, 130.97432804,
... # there are 256 coordinates in total
67.76352596, 56.67798138, 104.89566517, 10.02616417])
>>> train_labels[0]
'Not Trump'
>>> train_labels[1]
'Trump'
Most estimators in scikit-learn such as SVC are initiated with a number of input parameters, also known as hyper parameters. Depending on your data, you will have to figure out what to pass as inputs to the estimator during initialization. If you look at the SVC documentation in scikit-learn, you see that it can be initialized using several different input parameters.
For simplicity, let's consider kernel which can be 'rbf' or ‘linear’ (among a few other choices); and C which is a penalty parameter, and you want to try values 0.01, 0.1, 1, 10, 100 for C. That will lead to 10 different possible models to create and evaluate.
One simple solution is to write two nested for-loops one for kernel and the other for C and create the 10 possible models and see which one is the best model amongst others. However, if you have several hyper parameters to tune, then you have to write several nested for loops which can be tedious.
Luckily, scikit learn has a better way to create different models based on different combinations of values for your hyper model and choose the best one. For that, you use GridSearchCV. GridSearchCV is initialized using two things: an instance of an estimator, and a dictionary of hyper parameters and the desired values to examine. It will then run and create all possible models given the choices of hyperparameters and finds the best one, hence you need not to write any nested for-loops. Here is an example:
from sklearn.grid_search import GridSearchCV
print("Fitting the classifier to the training set")
param_grid = {'C': [0.01, 0.1, 1, 10, 100], 'kernel': ['rbf', 'linear']}
clf = GridSearchCV(SVC(class_weight='balanced'), param_grid)
clf = clf.fit(train_data, train_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)
You will need to use something similar to this example, and play with different hyperparameters. If you have a good variety of values for your hyperparameters, there is a very good chance you will find a much better model this way.
It is however possible for GridSearchCV to take a very long time to create all these models to find the best one. A more practical approach is to use RandomizedSearchCV instead, which creates a subset of all possible models (using the hyperparameters) at random. It should run much faster if you have a lot of hyperparameters, and its best model is usually pretty good.
After the comments of sascha and the answer of shahins, I did this eventually:
df = pd.DataFrame(dataset)
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
train_labels = construct_labels(train)
train_data = construct_data(train)
test_labels = construct_labels(test)
test_data = construct_data(test)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
from sklearn.svm import SVC
# Classifier found with shahins' answer
classifier = SVC(C=10, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
classifier = classifier.fit(train_data, train_labels)
test_data = scaler.fit_transform(test_data)
predicted_labels = predict_labels(test_data, classifier)
and got:
>>> correct_labels = count_correct_labels(predicted_labels, test_labels)
>>> print_stats(correct_labels, len(test_labels))
Correct labels = 1624
Accuracy = 88.5979268958
with these methods:
def count_correct_labels(predicted_labels, test_labels):
correct = 0
for p, t in zip(predicted_labels, test_labels):
if p[0] == t:
correct = correct + 1
return correct
def print_stats(correct_labels, len_test_labels):
print "Correct labels = " + str(correct_labels)
print "Accuracy = " + str((correct_labels * 100 / float(len_test_labels)))
I was able to optimize more with more hyper parameter tuning!
Helpful link: RBF SVM parameters
Note: If I don't transform the test_data, accuracy is 52.7%.