Using python and any machine learning library, I'm trying to have two target labels and a custom loss function. From my understanding, there is only one way to achieve this and that is by using Keras. Is this correct?
Here is a list of other things I have tried, have I missed something?
LightGBM
This article is the first that pops up when searching for custom loss functions. Unfortunately, LightGBM doees not support more than one target label and it doesn't seem like that's going to change anytime soon.
XGBoost
Has the same problem as LightGBM, you cannot have multiple labels only multiple target classes (Done by duplicating those rows) as discussed here.
SciKit-Learn: GridSearchCV and make_scorer
This initially looked good as you can have several target labels. However, the make_scorer method only scores the result of the model and it is not the loss function the model itself uses.
Related
I have a question regarding the multi:softmax objective function in relation to XGBoost. I've been playing around with the objective function a bit in the context of a multi-class classification and I've noticed something I don't quite understand.
Suppose we have a multi class classification problem with three different classes. So I want to use multi:softmax as objective and set num_class = 3 as recommended in the documentation of XGBoost. Everything works as expected.
https://xgboost.readthedocs.io/en/stable/parameter.html
Now I set num_class = 2 for the same problem setting and XGBoost still works as before.
Why does it still work even though num_class was set incorrectly?
I want to use a DecisionTreeRegressor for multi-output regression, but I want to use a different "importance" weight for each output (e.g. predicting y1 accurately is twice as important as predicting y2).
Is there a way of including these weights directly in the DecisionTreeRegressor of sklearn? If not, how can I create a custom MSE criterion with different weights for each output in sklearn?
I am afraid you can only provide one weight-set when you fit
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.fit
And the more disappointing thing is that since only one weight-set is allowed, the algorithms in sklearn is all about one weight-set.
As for custom criterion:
There is a similar issue in scikit-learn
https://github.com/scikit-learn/scikit-learn/issues/17436
Potential solution is to create a criterion class mimicking the existing one (e.g. MAE) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
However, if you see the code in detail, you will find that all the variables about weights are "one weight-set", which is unspecific to the tasks.
So to customize, you may need to hack a lot of code, including:
hacking the fit function to accept a 2D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_classes.py#L142
Bypassing the checking (otherwise continue to hack...)
Modify tree builder to allow the weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L111
It is terrible, there are a lot of related variable, you should change double to double*
Modify Criterion class to accept a 2-D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
In init, reset and update, you have to keep attributions such as self.weighted_n_node_samples specific to outputs (tasks).
TBH, I think it is really difficult to implement. Maybe we need to raise an issue for scikit-learn group.
I wish to use classification metrics like matthews_corrcoef as a metric to a neural network built with CNTK. The way I could find as of now was to evaluate the value by passing the predictions and label as shown
matthews_corrcoef(cntk.argmax(y_true, axis=-1).eval(), cntk.argmax(y_pred, axis=-1).eval())
Ideally I'd like to pass the metric to the trainer object while building my network.
One of the ways would be to create own custom metric and pass that to the trainer object. Although possible, it'll be better to be able to reuse the already existing metrics present in other libraries.
Unless this metric is already implemented in CNTK, implement your own custom "metric" function in whatever format CNTK requires, and have it pass the inputs on to scikit-learn's metric function.
I am trying to find an optimal parameter set for an XGB_Classifier using GridSearchCV.
Since my data is very unbalanced, both fitting and scoring (in cross_validation) must be performed using weights, therefore I have to use a custom scorer, which takes a 'weights' vector as a parameter.
However, I can't find a way to have GridSearchCV pass 'weights' vector to a scorer.
There were some attempts to add this functionality to gridsearch:
https://github.com/ndawe/scikit-learn/commit/3da7fb708e67dd27d7ef26b40d29447b7dc565d7
But they were not merged into master and now I am afraid that this code is not compatible with upstream changes.
Has anyone faced a similar problem and is there any 'easy' way to cope with it?
You could manually balance your training dataset as in the answer to Scikit-learn balanced subsampling
I have a support vector machine trained on ~300,000 examples, and it takes roughly 1.5-2 hours to train this model, and I pickled(serialized) it. Currently, I want to add/remove a couple of the parameters of the model. Is there a way to do this without having to retrain the entire model? I am using sklearn in python.
If you are using SVC from sklearn then the answer is no. There is no way to do it, this implementation is purely batch training based. If you are training linear SVM using SGDClassifier from sklearn then the answer is yes as you can simply start the optimization from the previous solution (when removing feature - simply with removed corresponding weight, and when adding - with added any weight there).