How to add sample_weight into a scikit-learn estimator - python

I have recently developed a scikit-learn estimator (a classifier) and I am now wanting to add sample_weight to the estimator. The reason is so I could apply boosting (ie. Adaboost) to the estimator (as Adaboost requires sample_weight to be present in the estimator).
I had a look at a few different scikit-learn estimators such as linear regression, logistic regression and SVM, but they all seem to have a different way of adding sample_weight into their estimators and it's not very clear to me:
Linear regression:
https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/linear_model/_base.py#L375
Logistic regression:
https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/linear_model/_logistic.py#L1459
SVM:
https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/svm/_base.py#L796
So I am confused now and wanting to know how do I add sample_weight into my estimator? Is there a standard way of doing this in scikit-learn or it just depends on the estimator? Any templates or any examples would really be appreciated. Many thanks in advance.

Related

Shapley for Logistic regression?

Does shapley support logistic regression models?
Running the following code i get:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
explainer = shap.TreeExplainer(logmodel )
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.linear_model.logistic.LogisticRegression'>
P.S. You are supposed to use a different explainder for different models
Shap is model agnostic by definition. It looks like you have just chosen an explainer that doesn't suit your model type. I suggest looking at KernelExplainer which as described by the creators here is
An implementation of Kernel SHAP, a model agnostic method to estimate SHAP values for any model. Because it makes not assumptions about the model type, KernelExplainer is slower than the other model type specific algorithms.
The documentation for Shap is mostly solid and has some decent examples.
explainer = shap.LinearExplainer(logmodel) should work as Logistic Regression is a linear model.
Logistic Regression is a linear model, so you should use the linear explainer.

Nature and redundancy of classifiers

I am applying a set of linear and non-linear classification models in a classification task. The input data are language vectors (CountVectorizer, Word2Vec) and binary labels. In scikit-learn, I selected following estimators:
LogisticRegression(),
LinearSVC(),
XGBClassifier(),
SGDClassifier(),
SVC(), # Radial basis function kernel
BernoulliNB(), # Naive Bayes seems widely used for LV models
KNeighborsClassifier(),
RandomForestClassifier(),
MLPClassifier()
Question: Am I correct that LinearSVC() is a linear
classifier, at least for the case of a binary estimator?
Question: In view of experts, is there any significant redundancy among the classifiers?
Thanks for clarification.
LogisticRegression(), LinearSVC(), SGDClassifier() and BernoulliNB() are linear models.
With the default loss function SGDClassifier() works as a linear SVM, with log loss as a logistic regression, so one of these three is redundant. Also you could substitute LogisticRegression() for LogisticRegressionCV() which has built-in optimization for regularization hyperparameter.
XGBClassifier() and all the others are non-linear.
The list seems to include all the major sklearn classifiers.

Online version of Ridge Regression Classifier in ski-kitlearn?

I'm trying a range of Online classifiers in the ski-kitlearn library to train a model from huge data. I found there are many classifiers supporting the partial_fit allowing for incremental learning. I want to use the Ridge Regression classifier in this setting, but could not find it in the implementation. Is there an alternative model that can do this in sklearn?
sklearn.linear_model.SGDClassifier, its loss function contain ‘hinge’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’
sklearn.linear_model.SGDRegressor, its default loss function is squared_loss, The possible values are ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’

Plotting training and Cross-validation error for Lasso using Scikit-learn

I am running a lasso on a dataset. I used scikit-learn Lasso option. This is the code I used:
#Running Lasso On Gdpdataset:
from sklearn import linear_model
lasso_gdp=linear_model.LassoCV(max_iter=2000,cv=10,normalize=False)
lasso_gdp.fit(Gdp_train,Gdp_Y)
lasso_gdp.alpha_
scores_gdp=np.zeros((100,1))
scores_gdp[:,0]=np.mean(lasso_gdp.mse_path_,axis=1)
scores_gdp=np.sort(scores_gdp)
lasso_gdp.coef_
While lasso.alpha_ and lasso_coef_ gives me the cross-validated alpha and final weight vector, I am looking to plot the MSE of each alpha value(i used default parameter for alpah(100) on the training and the cross-validated set which lasso uses. As of now I am not sure what observation it chose for cross-validation as I simply used the cv=10 option in the lasso method.
Could someone help how to get those two curves for each alpha value?

Working of Regression in sklearn.linear_model.LogisticRegression

How does scikit-learn's sklearn.linear_model.LogisticRegression class work with regression as well as classification problems?
As given on the Wikipedia page as well as a number of sources, since the output of Logistic Regression is based on the sigmoid function, it returns a probability. Then how does the sklearn class work as both a classifier and regressor?
Logistic regression is a method for classification, not regression. This goes for scikit-learn as for anywhere else.
If you have entered continuous values as the target vector y, then LogisticRegression will most probably fail, as it interprets the unique values of y, i.e. np.unique(y) as different classes. So you may end up having as many classes as samples.
TL;DR: Logistic regression needs a categorical target variable, because it is a classification method.

Categories