I am trying to find an optimal parameter set for an XGB_Classifier using GridSearchCV.
Since my data is very unbalanced, both fitting and scoring (in cross_validation) must be performed using weights, therefore I have to use a custom scorer, which takes a 'weights' vector as a parameter.
However, I can't find a way to have GridSearchCV pass 'weights' vector to a scorer.
There were some attempts to add this functionality to gridsearch:
https://github.com/ndawe/scikit-learn/commit/3da7fb708e67dd27d7ef26b40d29447b7dc565d7
But they were not merged into master and now I am afraid that this code is not compatible with upstream changes.
Has anyone faced a similar problem and is there any 'easy' way to cope with it?
You could manually balance your training dataset as in the answer to Scikit-learn balanced subsampling
Related
I want to use a DecisionTreeRegressor for multi-output regression, but I want to use a different "importance" weight for each output (e.g. predicting y1 accurately is twice as important as predicting y2).
Is there a way of including these weights directly in the DecisionTreeRegressor of sklearn? If not, how can I create a custom MSE criterion with different weights for each output in sklearn?
I am afraid you can only provide one weight-set when you fit
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.fit
And the more disappointing thing is that since only one weight-set is allowed, the algorithms in sklearn is all about one weight-set.
As for custom criterion:
There is a similar issue in scikit-learn
https://github.com/scikit-learn/scikit-learn/issues/17436
Potential solution is to create a criterion class mimicking the existing one (e.g. MAE) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
However, if you see the code in detail, you will find that all the variables about weights are "one weight-set", which is unspecific to the tasks.
So to customize, you may need to hack a lot of code, including:
hacking the fit function to accept a 2D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_classes.py#L142
Bypassing the checking (otherwise continue to hack...)
Modify tree builder to allow the weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L111
It is terrible, there are a lot of related variable, you should change double to double*
Modify Criterion class to accept a 2-D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
In init, reset and update, you have to keep attributions such as self.weighted_n_node_samples specific to outputs (tasks).
TBH, I think it is really difficult to implement. Maybe we need to raise an issue for scikit-learn group.
This question already has an answer here:
Retrieving specific classifiers and data from GridSearchCV
(1 answer)
Closed 2 years ago.
GridSearchCV and RandomizedSearchCV has best_estimator_ that :
Returns only the best estimator/model
Find the best estimator via one of the simple scoring methods : accuracy, recall, precision, etc.
Evaluate based on training sets only
I would like to enrich those limitations with
My own definition of scoring methods
Evaluate further on test set rather than training as done by GridSearchCV. Eventually it's the test set performance that counts. Training set tends to give almost perfect accuracy on my Grid Search.
I was thinking of achieving it by :
Get the individual estimators/models in GridSearchCV and RandomizedSearchCV
With every estimator/model, predict on test set and evaluate with my customized score
My question is:
Is there a way to get all individual models from GridSearchCV ?
If not, what is your thought to achieve the same thing as what I wanted ? Initially I wanted to exploit existing GridSearchCV because it handles automatically multiple parameter grid, CV and multi-threading. Any other recommendation to achieve the similar result is welcome.
You can use custom scoring methods already in the XYZSearchCVs: see the scoring parameter and the documentation's links to the User Guide for how to write a custom scorer.
You can use a fixed train/validation split to evaluate the hyperparameters (see the cv parameter), but this will be less robust than a k-fold cross-validation. The test set should be reserved for scoring only the final model; if you use it to select hyperparameters, then the scores you receive will not be unbiased estimates of future performance.
There is no easy way to retrieve all the models built by GridSearchCV. (It would generally be a lot of models, and saving them all would generally be a waste of memory.)
The parallelization and parameter grid parts of GridSearchCV are surprisingly simple; if you need to, you can copy out the relevant parts of the source code to produce your own approach.
Training set tends to give almost perfect accuracy on my Grid Search.
That's a bit surprising, since the CV part of the searches means the models are being scored on unseen data. If you get very high best_score_ but low performance on the test set, then I would suspect your training set is not actually a representative sample, and that'll require a much more nuanced understanding of the situation.
I need to get the parameters to use the model in another program.
I tried cat_model.coef_, cat_model.intercept_ or what I think. is that possible to catch the params ?
I totally solved this problem, what i was tryna do is named 'saving model'.
cat_model.save_model('cat_model.cbm')
Attributes .coef_ and .intercept_ only exist in sklearn applications of linear regression and logistic regression and will give you the slopes and the intercept (if fitted). You can use .feature_importances_ instead.
For catboost, your model has something called feature importances, given that it's a gradient boosting tree model what you get back is how heavy certain features are in splitting the tree up.
cat_model.feature_importances_
will tell you that. Though you should do more research into how the model works and what it will give you back because interpreting these features can be somewhat deceptive.
Using python and any machine learning library, I'm trying to have two target labels and a custom loss function. From my understanding, there is only one way to achieve this and that is by using Keras. Is this correct?
Here is a list of other things I have tried, have I missed something?
LightGBM
This article is the first that pops up when searching for custom loss functions. Unfortunately, LightGBM doees not support more than one target label and it doesn't seem like that's going to change anytime soon.
XGBoost
Has the same problem as LightGBM, you cannot have multiple labels only multiple target classes (Done by duplicating those rows) as discussed here.
SciKit-Learn: GridSearchCV and make_scorer
This initially looked good as you can have several target labels. However, the make_scorer method only scores the result of the model and it is not the loss function the model itself uses.
I'm using an SGDClassifier in combination with the partial fit method to train with lots of data. I'd like to monitor when I've achieved an acceptable level of convergence, which means I'd like to know the loss every n iterations on some data (possibly training, possibly held-out, maybe both).
I know this information is available if I pass verbose=1 in the constructor of the classifier, but I'd like to query it programmatically rather than visually. I also know I can use the score method to get accuracy, but I'd like actual loss as measured by my chosen loss function.
Does anyone know how to do this?
You'll have to use either the score method, or one of the loss functions in sklearn.metrics, called explicitly. Not all of SGDC's losses are in sklearn.metrics, but log loss and hinge loss are.
The above answer was too short, outdated and might result in misleading.
Using score method could only give accuracy (it's in the BaseEstimator). If you want the loss function, you could either call private function _get_loss_function (defined in the BaseSGDClassifier). Or accessing BaseSGDClassifier.loss_functions class attribute which will give you a dict and whose entry is the callable for loss function (with default setting)
Also using sklearn.metrics might not get exact loss used for minimization (due to regularization and what to minimize, but you can hand compute anyway). The exact code for Loss function is defined in cython code (sgd_fast.pyx, you could look up the code in scikit-learn github repo)
I'm looking for a good way to plot the minimization progress. Probably will redirect stdout and parse the output.
BTW, I'm using 0.17.1. So a update for the answer.