How can I do classification or regression in sklearn if I want to weight each sample differently? Is there a way to do it with a custom loss function? If so, what does that loss function look like in general? Is there an easier way?
To weigh individual samples, feed a sample_weight array to the estimator's fit method. This should be a 1-d array of length n_samples (i.e. the same dimension as y in most tasks):
estimator.fit(X, y, sample_weight=some_array)
Not all models support this, check the documentation.
Related
I am using XGBClassifier to model an unbalanced multiclass target. I have a few questions:
First I would like to now where should I use the parameter weight on the instantion of the classifier or on the fit step of the pipeline?
Second question is how I calculate a weights. I assume that the sum of the array should be 1.
Third: Is there any order of the weight array that maps the diferent label classes?
Thank you all in advance
For your first question:
where should I use the parameter weight
Use sample_weight in XGBClassifier.fit()
xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X, y, sample_weight=sample_weight)
When using pipeline:
pipe = Pipeline([
('my_xgb_clf', xgb.XGBClassifier()),
])
pipe.fit(X, y, my_xgb_clf__sample_weight=sample_weight)
Btw, some API in sklearn does not support sample_weight kwarg, e.g., learning_curve.
So I simply do this:
import functools
xgb_clf.fit = functools.partial(xgb_clf.fit, sample_weight=sample_weight)
Note: You would need to patch fit() again after a grid search, because GridSearchCV.best_estimator_ will not be the original estimator.
For the second question:
how I calculate a weights. I assume that the sum of the array should be 1.
from sklearn.utils import compute_sample_weight
sample_weight = compute_sample_weight('balanced', y_train)
This simulates class_weight='balanced' in sklearn.
Note:
Sum of the array is not 1. You can normalize it, but I think the
score result would be different.
This does not equal to class_weight='balanced_subsample'
I can not find a way to simulate this.
For the third question:
Is there any order...
Sorry I don't understand what you mean...
Maybe you want the order in xgb_clf.classes_?
You can access this after calling xgb_clf.fit.
Or just use np.unique(y_train).
I would like to use a custom loss function to train a neural network in scikit learn; using MLPClassifier. I would like to give more importance to larger values. Therefore, I would like to use something like the mean square error but multiplying the numerator by y. Thus, it would look like :
1/n∑y(yi-y(hat)i)^2
Here is the code of my model:
mlp10 = MLPClassifier(hidden_layer_sizes=(150,100,50,25,10), max_iter=1000,
random_state=42)
mlp10.fit(X_train, y_train)
How can I modify the loss function ?
I don't believe you can modify the loss function directly as there is no parameter for it in the construction of the classifier and the documentation explicitly specifies that it's optimizing using the log-loss function. If you're willing to be a bit flexible, you might be able to get the effect you're looking for simply by an transform of the y values before training and then use the inverse transform to recover the predicted ys after testing.
For instance, mapping y_prime = transform(y) and y = inverse_transform(y_prime) on each value where you define transform and inverse_transform as:
def transform(y):
return y ** 2
def inverse_transform(y_prime):
return math.sqrt(y_prime)
would cause larger values of y to have more influence in the training. Obviously you could experiment with different transforms to see what works best for your use-case. The key is just to make sure that transform is superlinear.
Before training you'd need to do:
y_train = map(transform, y_train)
And after calling predict:
y_predict = model.predict(x)
y_predict = map(inverse_transform, y_predict)
I have such a pipline:
attribute_est = Pipeline([
('jsdf', DictVectorizer()),
('clf', Ridge())
])
In there I pass data like:
{
'Master_card' : 1,
'Credit_Cards': 1,
'casual_ambiance': 0,
'Classy_People': 0
}
My model does not predict that well. Now I got proposed to:
You may find it difficult to find a single regressor that does well
enough. A common solution is to use a linear model to fit the linear
part of some data, and use a non-linear model to fit the residual that
the linear model can't fit. Build a residual estimator that takes as
an argument two other estimators. It should use the first to fit the
raw data and the second to fit the residuals of the first.
What is meant with a Residual estimator? Can you provide me with an example please?
A residual is the error between the true data values, and those predicted by some estimator. The simplest example is in the case of linear regression, where the residuals are the distance between the best linear fit to some data and the actual data points. Least-squares fitting of a line minimizes the sum of these squared residuals.
The recommendation you were given suggests using two estimators. The first will fit the data itself. In the linear regression case, this is a least-squares linear fit, probably using something like scikit-learn's linear regression model.
The second estimator will then try to fit the residuals, i.e., the difference between the linear fit to the data and the actual data points. In the least-squares case, this is effectively detrending the data, and then fitting what is left over. You might pick this to be a Gaussian, in the case where you expect the data actually is a line with additive Gaussian noise. But if you know something about the underlying noise distribution, then use that as your second estimator.
Can I use sklearn's BaggingClassifier to produce continuous predictions? Is there a similar package? My understanding is that the bagging classifier predicts several classifications with different models, then reports the majority answer. It seems like this algorithm could be used to generate probability functions for each classification then reporting the mean value.
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict(X_test)
If you're interested in predicting probabilities for the classes in your classifier, you can use the predict_proba method, which gives you a probability for each class. It's a one-line change to your code:
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict_proba(X_test)
The shape of Y_pred will be [n_samples, n_classes].
If your Y_train values are continuous and you want to predict those continuous values (i.e., you're working on a regression problem), then you can use the BaggingRegressor instead.
I typically use BaggingRegressor() for continuous values, and then compare performance with RMSE. example below:
from sklearn.ensemble import BaggingReressor
trees = BaggingRegressor()
trees.fit(X_train,Y_train)
scores_RMSE = math.sqrt(metrics.mean_squared_error(Y_test, trees.predict(X_test))
I need to know how to return the logistic regression coefficients in such a manner that I can generate the predicted probabilities myself.
My code looks like this:
lr = LogisticRegression()
lr.fit(training_data, binary_labels)
# Generate probabities automatically
predicted_probs = lr.predict_proba(binary_labels)
I had assumed the lr.coeff_ values would follow typical logistic regression, so that I could return the predicted probabilities like this:
sigmoid( dot([val1, val2, offset], lr.coef_.T) )
But this is not the appropriate formulation. Does anyone have the proper format for generating predicted probabilities from Scikit Learn LogisticRegression?
Thanks!
take a look at the documentations (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), offset coefficient isn't stored by lr.coef_
coef_ array, shape = [n_classes-1, n_features] Coefficient of the
features in the decision function. coef_ is readonly property derived
from raw_coef_ that follows the internal memory layout of liblinear.
intercept_ array, shape = [n_classes-1] Intercept (a.k.a. bias) added
to the decision function. It is available only when parameter
intercept is set to True.
try:
sigmoid( dot([val1, val2], lr.coef_) + lr.intercept_ )
The easiest way is by calling coef_ attribute of LR classifier:
Definition of coef_ please check Scikit-Learn document:
See example:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(x_train,y_train)
weight = classifier.coef_