As I understand, ExtraTreesRegressor from sklearn works by doing random splits instead of minimizing a metric like gini for classification or mae for regression.
I don't understand why there's a criterion parameter, as the criterion for the splits should be random.
Is it just for code compatibility, or am I missing something?
I think you misunderstood what about these splits is random. According to the User Guide for Extremely Randomized Trees:
As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds,
thresholds are drawn at random for each candidate feature
and
the best of these randomly-generated thresholds is picked as the splitting rule.
The additional randomization of the ExtraTreesRegressor concerns the thresholds of the candidate features. But it must still be determined which of them provides the best split. And this is why you still need a criterion specifying the function to evaluate the quality of a split.
Related
I have a highly unbalanced dataset of 3 classes. To address this, I applied the sample_weight array in the XGBClassifier, but I'm not noticing any changes in the modelling results? All of the metrics in the classification report (confusion matrix) are the same. Is there an issue with the implementation?
The class ratios:
military: 1171
government: 34852
other: 20869
Example:
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=process_text)), # convert strings to integer counts
('tfidf', TfidfTransformer()), # convert integer counts to weighted TF-IDF scores
('classifier', XGBClassifier(sample_weight=compute_sample_weight(class_weight='balanced', y=y_train))) # train on TF-IDF vectors w/ Naive Bayes classifier
])
Sample of Dataset:
data = pd.DataFrame({'entity_name': ['UNICEF', 'US Military', 'Ryan Miller'],
'class': ['government', 'military', 'other']})
Classification Report
First, most important: use a multiclass eval_metric. eval_metric=merror or mlogloss, then post us the results. You showed us ['precision','recall','f1-score','support'], but that's suboptimal, or outright broken unless you computed them in a multi-class-aware, imbalanced-aware way.
Second, you need weights. Your class ratio is military: government: other 1:30:18, or as percentages 2:61:37%.
You can manually set per-class weights with xgb.DMatrix..., weights)
Look inside your pipeline (use print or verbose settings, dump values), don't just blindly rely on boilerplate like sklearn.utils.class_weight.compute_sample_weight('balanced', ...) to give you optimal weights.
Experiment with manually setting per-class weights, starting with 1 : 1/30 : 1/18 and try more extreme values. Reciprocals so the rarer class gets higher weight.
Also try setting min_child_weight much higher, so it requires a few exemplars (of the minority classes). Start with min_child_weight >= 2(* weight of rarest class) and try going higher. Beware of overfitting to the very rare minority class (this is why people use StratifiedKFold crossvalidation, for some protection, but your code isn't using CV).
We can't see your other parameters for xgboost classifier (how many estimators? early stopping on or off? what was learning_rate/eta? etc etc.). Seems like you used the defaults - they'll be terrible. Or else you're not showing your code. Distrust xgboost's defaults, esp. for multiclass, don't expect xgboost to give good out-of-the-box results. Read the doc and experiment with values.
Do all that experimentation, post your results, check before concluding "it doesn't work". Don't expect optimal results from out-of-the-box. Distrust or double-check the sklearn util functions, try manual alternatives. (Often, just because sklearn has a function to do something, doesn't mean it's good or best or suitable for all use-cases, like imbalanced multiclass)
It seems that GridSearchCV of scikit-learn collects the scores of its (inner) cross-validation folds and then averages across the scores of all folds. I was wondering about the rationale behind this. At first glance, it would seem more flexible to instead collect the predictions of its cross-validation folds and then apply the chosen scoring metric to the predictions of all folds.
The reason I stumbled upon this is that I use GridSearchCV on an imbalanced data set with cv=LeaveOneOut() and scoring='balanced_accuracy' (scikit-learn v0.20.dev0). It doesn't make sense to apply a scoring metric such as balanced accuracy (or recall) to each left-out sample. Rather, I would want to collect all predictions first and then apply my scoring metric once to all predictions. Or does this involve an error in reasoning?
Update: I solved it by creating a custom grid search class based on GridSearchCV with the difference that predictions are first collected from all inner folds and the scoring metric is applied once.
GridSearchCVuses the scoring to decide what internal hyperparameters to set in the model.
If you want to estimate the performance of the "optimal" hyperparameters, you need to do an additional step of cross validation.
See http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
EDIT to get closer to answering the actual question:
For me it seems reasonable to collect predictions for each fold and then score them all, if you want to use LeaveOneOut and balanced_accuracy. I guess you need to make your own grid searcher to do that. You could use model_selection.ParameterGrid and model_selection.KFold for that.
I'm trying to create a metric to optimize the precision of True Positives of the positive class in a Decision Tree classifier:
metrica = make_scorer(precision_score, pos_label=1, greater_is_better=True,
average="binary")
And then using RandomizedSearchCV for hyperparameters tune:
random_search = RandomizedSearchCV(clf, scoring= metrica,
param_distributions=param_dist,
n_iter=n_iter_search)
I get the following result:
Tunning the Tree with these parameters, I get zero percent of True Positives ...
Simply changing splitter='random' to 'best', I get better at 82% accuracy in the positive class.
What is failing in my metric or RandomSearchCV?
There's nothing wrong with your RandomizedSearchCV or your scorer, though you could have just used precision_score as the scorer instead of using make_scorer, because by default, the precision_score has the parameters that you've set it as:
The point of grid searching (or randomized search) is to find the best hyperparameter values for the model that you're using. In this case, you chose to use a classic decision tree. Keep in mind, that this model is pretty basic. You're essentially building one tree so the important hyperparameters are the ones that affect the tree depth and the criteria for splitting.
You mention that when the splitter strategy was changed to "best", you got a better accuracy score. Well, splitter also happens to be a hyperparameter of the model, so it could be fed as an additional parameter in the grid search space.
Another potential reason why you might have gotten the low precision score after running the randomized search is that you might not have given it enough iterations to find the right hyperparameter combinations.
Ultimately, here would be my pointers:
Decision tree is a very basic model. It's prone to overfitting. I would recommend an ensemble model instead such as a random forest, which consists of multiple decision trees.
Consider what hyperparameters you want to tune and how exhaustively you want to search within that hyperparameter space to get the best model. You should start with a couple of hyperparameters to search on, starting with the important ones like max_depth or min_samples_split and then scaling up. This will be experimental on your side. There's no right or wrong here, but keep track of the best parameters found.
You should consider how well balanced your classes are. Models tend to be pretty biased if there's too much of one class over another. If there is a class imbalance, you can control for the imbalance using the class_weight argument.
I want to implement a AdaBoost model using scikit-learn (sklearn). My question is similar to another question but it is not totally the same. As far as I understand, the random_state variable described in the documentation is for randomly splitting the training and testing sets, according to the previous link. So if I understand correctly, my classification results should not be dependent on the seeds, is it correct? Should I be worried if my classification results turn out to be dependent on the random_state variable?
Your classification scores will depend on random_state. As #Ujjwal rightly said, it is used for splitting the data into training and test test. Not just that, a lot of algorithms in scikit-learn use the random_state to select the subset of features, subsets of samples, and determine the initial weights etc.
For eg.
Tree based estimators will use the random_state for random selections of features and samples (like DecisionTreeClassifier, RandomForestClassifier).
In clustering estimators like Kmeans, random_state is used to initialize centers of clusters.
SVMs use it for initial probability estimation
Some feature selection algorithms also use it for initial selection
And many more...
Its mentioned in the documentation that:
If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.
Do read the following questions and answers for better understanding:
Choosing random_state for sklearn algorithms
confused about random_state in decision tree of scikit learn
It does matter. When your training set differs then your trained state also changes. For a different subset of data you can end up with a classifier which is little different from the one trained with some other subset.
Hence, you should use a constant seed like 0 or another integer, so that your results are reproducible.
I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)
To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"
You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.