How to get CatBoost model's coefficients? - python

I need to get the parameters to use the model in another program.
I tried cat_model.coef_, cat_model.intercept_ or what I think. is that possible to catch the params ?
I totally solved this problem, what i was tryna do is named 'saving model'.
cat_model.save_model('cat_model.cbm')

Attributes .coef_ and .intercept_ only exist in sklearn applications of linear regression and logistic regression and will give you the slopes and the intercept (if fitted). You can use .feature_importances_ instead.

For catboost, your model has something called feature importances, given that it's a gradient boosting tree model what you get back is how heavy certain features are in splitting the tree up.
cat_model.feature_importances_
will tell you that. Though you should do more research into how the model works and what it will give you back because interpreting these features can be somewhat deceptive.

Related

Custom Criterion for DecisionTreeRegressor in sklearn

I want to use a DecisionTreeRegressor for multi-output regression, but I want to use a different "importance" weight for each output (e.g. predicting y1 accurately is twice as important as predicting y2).
Is there a way of including these weights directly in the DecisionTreeRegressor of sklearn? If not, how can I create a custom MSE criterion with different weights for each output in sklearn?
I am afraid you can only provide one weight-set when you fit
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.fit
And the more disappointing thing is that since only one weight-set is allowed, the algorithms in sklearn is all about one weight-set.
As for custom criterion:
There is a similar issue in scikit-learn
https://github.com/scikit-learn/scikit-learn/issues/17436
Potential solution is to create a criterion class mimicking the existing one (e.g. MAE) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
However, if you see the code in detail, you will find that all the variables about weights are "one weight-set", which is unspecific to the tasks.
So to customize, you may need to hack a lot of code, including:
hacking the fit function to accept a 2D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_classes.py#L142
Bypassing the checking (otherwise continue to hack...)
Modify tree builder to allow the weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L111
It is terrible, there are a lot of related variable, you should change double to double*
Modify Criterion class to accept a 2-D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
In init, reset and update, you have to keep attributions such as self.weighted_n_node_samples specific to outputs (tasks).
TBH, I think it is really difficult to implement. Maybe we need to raise an issue for scikit-learn group.

Can I use a machine learning model as the objective function in an optimization problem?

I have a data set for which I use Sklearn Decision Tree regression machine learning package to build a model for prediction purposes. Subsequently, I am trying to utilize scipy.optimize package to solve for the minimized solution based on a given constraint.
However, I am not sure if I can take the decision tree model as the objective function for the optimization problem. What should be the approach in a situation like this? I have tried linear regression models such as LarsCV in the past and they worked just fine. But in a linear regression model, you can essentially extract the coefficients and interception point from the model.
Yes; a linear regression model is a straightforward linear function of coefficients (one of which is the "intercept" or "bias").
The problem you have now is that a more complex model isn't quite so simple. You need to load the model into an appropriate engine. To "call" the model, you feed that engine the input vector (the cognate of a list of arguments), and wait for the model to return the prediction.
You need to wrap this process in a function call, perhaps one that issues the model load and processing as external system / shell commands, and returns the results to your main program. Some applications are large enough that it makes sense to implement a full-bore data stream with listener and reporter to handle the throughput.
Does that get you moving?

Python classification define feature importance

I am wondering if it is possbile to define feature importances/weights in Pyhton Classification methods? For example:
model = tree.DecisionTreeClassifier(feature_weight = ...)
I've seen in RandomForest there is an attribute feature_importance, which shows the importance of features based on analysis. But is it possible that I could define the feature importance for analysis in advance?
Thank you very much for your help in advance!
The feature importance determination in random forest classifiers uses a random forest-specific method (invert all binary tests over the feature, and get the additional classification error).
Feature importance is thus a concept that relates to the predictive ability of the model, not the training phase. Now, if you want to make it so that your model favours some feature over others, you will have to find some trick that depends on the model.
Regarding sklearn's DecisionTreeClassifier, such a trick does not appear to be trivial. You could custom your class weights, if you know some classes will be more easily predicted by some features that you want to favour; but this seems pretty dirty.
In other types of models, such as ones using kernels, you can do this more easily, by setting hyperparameters which directly relate to features.
If you are trying to limit an overfitting, I would also simply suggest that you remove the features you know to be less important.

What's the best way to select features independent of the model being used?

I am using tensorflow's DNNRegressor to model a multivariate regression problem. I want to form an optimal feature-set from a mixed bag of categorical and continuous features. What would be the best way to proceed? The reason, I want this approach to be independent of the model is because I couldn't find much about feature selection/evaluation in direct context of tensorflow.
Tensorflow is mostly library for machine learning algorithms. So, you need to use other libraries for preprocessing.
Scikit-library is good in many cases. You should try it, it contains the feature selection methods. I'm not sure about the categorical features, but if not you always can convert it to numerical ones.
They suggest:
For regression: f_regression, mutual_info_regression
And for any problem, you can use their first method VarianceThreshold

How do I evaluate whether the mean squared error (MSE) is reasonable or not?

I'm creating regression models using scikit learn.
Now I'm wondering how I can evaluate whether the mean squared error is reasonable or bad?
For example, when I do cross validation, the testing data's MSE for model of train data is 0.70. Is it reasonable or bad score?
Also is it meaningful to calculate the whole data's MSE for a model and compare them and see if the scores are similar?
It's not programming question but I want to know how to evaluate the value. I'm not sure if my way is correct or not.
The way you should use MSE or other regression performance metrics (link) is to compare different models (or same models with different hyperparamaters). If you keep your data set constant then it will give you an idea about what models perform better and which worse.
Let me suggest 2 benchmark regression models to always compare your sophisticated model to. If you are not able to beat these in terms of test MSE (or others) you are doing something wrong
Dummy regressor link
Linear regression link

Categories