Custom Criterion for DecisionTreeRegressor in sklearn

Custom Criterion for DecisionTreeRegressor in sklearn - python

I want to use a DecisionTreeRegressor for multi-output regression, but I want to use a different "importance" weight for each output (e.g. predicting y1 accurately is twice as important as predicting y2).
Is there a way of including these weights directly in the DecisionTreeRegressor of sklearn? If not, how can I create a custom MSE criterion with different weights for each output in sklearn?

I am afraid you can only provide one weight-set when you fit
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.fit
And the more disappointing thing is that since only one weight-set is allowed, the algorithms in sklearn is all about one weight-set.
As for custom criterion:
There is a similar issue in scikit-learn
https://github.com/scikit-learn/scikit-learn/issues/17436
Potential solution is to create a criterion class mimicking the existing one (e.g. MAE) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
However, if you see the code in detail, you will find that all the variables about weights are "one weight-set", which is unspecific to the tasks.
So to customize, you may need to hack a lot of code, including:
hacking the fit function to accept a 2D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_classes.py#L142
Bypassing the checking (otherwise continue to hack...)
Modify tree builder to allow the weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L111
It is terrible, there are a lot of related variable, you should change double to double*
Modify Criterion class to accept a 2-D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
In init, reset and update, you have to keep attributions such as self.weighted_n_node_samples specific to outputs (tasks).
TBH, I think it is really difficult to implement. Maybe we need to raise an issue for scikit-learn group.

Related

Difference between MultiOutputRegressor(RandomForestRegressor()) versus RandomForestRegressor() when predicting multiple outputs?

It's not clear to me why some resources online demonstrate a multi-target Random Forest regression as being instantiated as either
model = MultiOutputRegressor(RandomForestRegressor())
versus:
model = RandomForestRegressor()
when both seemingly generate multiple regressed outputs. Can anyone clarify?

The internal models are different, but they are both multioutput regressors.
MultiOutputRegressor fits one random forest for each target. Each tree inside then is predicting one of your outputs.
Without the wrapper, RandomForestRegressor fits trees targeting all the outputs at once. The split criteria are based on the average impurity reduction across the outputs. See the User Guide.
The latter may be better computationally, since fewer trees are being built. It can also make use of the fact that the several outputs for a given input may well be correlated. That's all discussed in the user guide as well.
Some conjecture on my part: On the other hand, if the several outputs for a given input are not correlated, internal splits that are good for one output may be lousy for other inputs, so simply averaging them might not work as well. I think in that case increasing the tree complexity can alleviate the issue (but will also take more computation).

Is it possible to use a MultiOutputRegressor with different hyperparameters for each output?

I need a scikit-learn composite estimator for vector targets, but I need to define different hyperparameters for each target.
My first instinct was to define a MultiOutputRegressor of dummy estimators, then overwrite the estimators_ attribute with the desired regressors, but this does not work as only the base estimator is defined on construction; it is then copied on fit.
Do I need to write my own meta-estimator class, or is there a better solution I'm not thinking of?

This was answered off-site by Dr. Lemaitre - no prepacked solution exists to define multiple different regressors into a single multi-output regressor, but a decent work-around is to use one of the -CV family of regressors such as ElasticNetCV as a base estimator. This will allow different hyperparameters for each output, assuming the parameters can be decently tuned on each instance of fit.

fit() vs fit_predict() metthods in sklearn KMeans

There are two methods when we make a model on sklearn.cluster.KMeans. First is fit() and other is fit_predict(). My understanding is that when we use fit() method on KMeans model, it gives an attribute labels_ which basically holds the info on which observation belong to which cluster. fit_predict() also have labels_ attribute.
So my question are,
If fit() fulfills the need then why their is fit_predict()?
Are fit() and fit_predict() interchangeable while writing code?

KMeans is just one of the many models that sklearn has, and many share the same API. The basic functions ae fit, which teaches the model using examples, and predict, which uses the knowledge obtained by fit to answer questions on potentially new values.
KMeans will automatically predict the cluster of all the input data during the training, because doing so is integral to the algorithm. It keeps them around for efficiency, because predicting the labels for the original dataset is very common. Thus, fit_predict adds very little: it calls fit, then returns .labels_. fit_predict is just a convenience method that calls fit, then returns the labels of the training dataset. (fit_predict doesn't have a labels_ attribute, it just gives you the labels.)
However, if you want to train your model on one set of data and then use this to quickly (and without changing the established cluster boundaries) get an answer for a data point that was not in the original data, you would need to use predict, not fit_predict.
In other models (for example sklearn.neural_network.MLPClassifier), training can be a very expensive operation so you may not want to re-train a model every time you want to predict something; also, it may not be a given that the prediction result is generated as a part of the prediction. Or, as discussed above, you just don't want to change the model in response to new data. In those cases, you cannot get predictions from the result of fit: you need to call predict with the data you want to get a prediction on.
Also note that labels_ is marked with an underscore, a Python convention for "don't touch this, it's private" (in absence of actual access control). Whenever possible, you should use the established API instead.

In scikit-learn, there are similar things such as fit and fit_transform.
Fit and predict or labels_ are essential for clustering.
Thus fit_predict is just efficient code, and its result is the same as the result from fit and predict (or labels).
In addition, the fitted clustering model is used only once when determining cluster labels of samples.

Does scikit-learn use One-Vs-Rest by default in multi-class classification?

I am dealing with a multi-class problem (4 classes) and I am trying to solve it with scikit-learn in Python.
I saw that I have three options:
I simply instantiate a classifier, then I fit with train and evaluate with test;
classifier = sv.LinearSVC(random_state=123)
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsRest object, generating a new classifier that I use for train and test;
classifier = OneVsRestClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsOne object, generating a new classifier that I use for train and test.
classifier = OneVsOneClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I understand the difference between OneVsRest and OneVsOne, but I cannot understand what I am doing in the first scenario where I do not explicitly pick up any of these two options. What does scikit-learn do in that case? Does it implicitly use OneVsRest?
Any clarification on the matter would be highly appreciated.
Best,
MR
Edit:
Just to make things clear, I am not specifically interested in the case of SVMs. For example, what about RandomForest?

Updated answer: As clarified in the comments and edits, the question is more about the general setting of sklearn, and less about the specific case of LinearSVC which is explained below.
The main difference here is that some of the classifiers you can use have "built-in multiclass classification support", i.e. it is possible for that algorithm to discern between more than two classes by default. One example for this would for example be a Random Forest, or a Multi-Layer Perceptron (MLP) with multiple output nodes.
In these cases, having a OneVs object is not required at all, since you are already solving your task. In fact, using such a strategie might even decreaes your performance, since you are "hiding" potential correlations from the algorithm, by letting it only decide between single binary instances.
On the other hand, algorithms like SVC or LinearSVC only support binary classification. So, to extend these classes of (well-performing) algorithms, we instead have to rely on the reduction to a binary classification task, from our initial multiclass classification task.
As far as I am aware of, the most complete overview can be found here:
If you scroll down a little bit, you can see which one of the algorithms is inherently multiclass, or uses either one of the strategies by default.
Note that all of the listed algorithms under OVO actually now employ a OVR strategy by default! This seems to be slightly outdated information in that regard.
Initial answer:
This is a question that can easily be answered by looking at the relevant scikit-learn documentation.
Generally, the expectation on Stackoverflow is that you have at least done some form of research on your own, so please consider looking into existing documentation first.
multi_class : string, ‘ovr’ or ‘crammer_singer’ (default=’ovr’)
Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while
"crammer_singer" optimizes a joint objective over all classes. While
crammer_singer is interesting from a theoretical perspective as it is
consistent, it is seldom used in practice as it rarely leads to better
accuracy and is more expensive to compute. If "crammer_singer" is
chosen, the options loss, penalty and dual will be ignored.
So, clearly, it uses one-vs-rest.
The same holds by the way for the "regular" SVC.

GridSearchCV: passing weights to a scorer

I am trying to find an optimal parameter set for an XGB_Classifier using GridSearchCV.
Since my data is very unbalanced, both fitting and scoring (in cross_validation) must be performed using weights, therefore I have to use a custom scorer, which takes a 'weights' vector as a parameter.
However, I can't find a way to have GridSearchCV pass 'weights' vector to a scorer.
There were some attempts to add this functionality to gridsearch:
https://github.com/ndawe/scikit-learn/commit/3da7fb708e67dd27d7ef26b40d29447b7dc565d7
But they were not merged into master and now I am afraid that this code is not compatible with upstream changes.
Has anyone faced a similar problem and is there any 'easy' way to cope with it?

You could manually balance your training dataset as in the answer to Scikit-learn balanced subsampling

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.