Random Forest in-bag and node dimensions - python

I have to do a random forest classifier for an exercise and the exercise specifically says for the parameters, and I quote from my language
in-bag percentage: 25% 50% 85%
Number of dimensions in one node: 10%, 50%, 80%
I use scikit-learn for the classifier and I don't know which are the parameters from the class to set the in-bag percentage and the number of dimensions.

You can define the number of dimension using max_features parameter. Something like:
rf = RandomForestClassifier(max_features=.1)
Unfortunately, RandomForestClassifier doesn't yet support subsampling (i.e. in-bag percentage). However this feature has been added in current development branch of sklearn, so will be available in future.
A good workaround for now is to use BaggingClassifier: it have a max_samples parameter for subsampling, and it can be turned into RandomForestClassifier using DecisionTreeClassifier as base.
base = DecisionTreeClassifier(max_features=.1)
rf = BaggingClassifier(base_estimator=base, max_samples=.25)
Note that BaggingClassifier also have a max_features parameter, but that works differently than Random Forest does.

Related

xgboost: Sample Weights for Imbalanced Data?

I have a highly unbalanced dataset of 3 classes. To address this, I applied the sample_weight array in the XGBClassifier, but I'm not noticing any changes in the modelling results? All of the metrics in the classification report (confusion matrix) are the same. Is there an issue with the implementation?
The class ratios:
military: 1171
government: 34852
other: 20869
Example:
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=process_text)), # convert strings to integer counts
('tfidf', TfidfTransformer()), # convert integer counts to weighted TF-IDF scores
('classifier', XGBClassifier(sample_weight=compute_sample_weight(class_weight='balanced', y=y_train))) # train on TF-IDF vectors w/ Naive Bayes classifier
])
Sample of Dataset:
data = pd.DataFrame({'entity_name': ['UNICEF', 'US Military', 'Ryan Miller'],
'class': ['government', 'military', 'other']})
Classification Report
First, most important: use a multiclass eval_metric. eval_metric=merror or mlogloss, then post us the results. You showed us ['precision','recall','f1-score','support'], but that's suboptimal, or outright broken unless you computed them in a multi-class-aware, imbalanced-aware way.
Second, you need weights. Your class ratio is military: government: other 1:30:18, or as percentages 2:61:37%.
You can manually set per-class weights with xgb.DMatrix..., weights)
Look inside your pipeline (use print or verbose settings, dump values), don't just blindly rely on boilerplate like sklearn.utils.class_weight.compute_sample_weight('balanced', ...) to give you optimal weights.
Experiment with manually setting per-class weights, starting with 1 : 1/30 : 1/18 and try more extreme values. Reciprocals so the rarer class gets higher weight.
Also try setting min_child_weight much higher, so it requires a few exemplars (of the minority classes). Start with min_child_weight >= 2(* weight of rarest class) and try going higher. Beware of overfitting to the very rare minority class (this is why people use StratifiedKFold crossvalidation, for some protection, but your code isn't using CV).
We can't see your other parameters for xgboost classifier (how many estimators? early stopping on or off? what was learning_rate/eta? etc etc.). Seems like you used the defaults - they'll be terrible. Or else you're not showing your code. Distrust xgboost's defaults, esp. for multiclass, don't expect xgboost to give good out-of-the-box results. Read the doc and experiment with values.
Do all that experimentation, post your results, check before concluding "it doesn't work". Don't expect optimal results from out-of-the-box. Distrust or double-check the sklearn util functions, try manual alternatives. (Often, just because sklearn has a function to do something, doesn't mean it's good or best or suitable for all use-cases, like imbalanced multiclass)

Creating a metric for True Positives with make_scorer

I'm trying to create a metric to optimize the precision of True Positives of the positive class in a Decision Tree classifier:
metrica = make_scorer(precision_score, pos_label=1, greater_is_better=True,
average="binary")
And then using RandomizedSearchCV for hyperparameters tune:
random_search = RandomizedSearchCV(clf, scoring= metrica,
param_distributions=param_dist,
n_iter=n_iter_search)
I get the following result:
Tunning the Tree with these parameters, I get zero percent of True Positives ...
Simply changing splitter='random' to 'best', I get better at 82% accuracy in the positive class.
What is failing in my metric or RandomSearchCV?
There's nothing wrong with your RandomizedSearchCV or your scorer, though you could have just used precision_score as the scorer instead of using make_scorer, because by default, the precision_score has the parameters that you've set it as:
The point of grid searching (or randomized search) is to find the best hyperparameter values for the model that you're using. In this case, you chose to use a classic decision tree. Keep in mind, that this model is pretty basic. You're essentially building one tree so the important hyperparameters are the ones that affect the tree depth and the criteria for splitting.
You mention that when the splitter strategy was changed to "best", you got a better accuracy score. Well, splitter also happens to be a hyperparameter of the model, so it could be fed as an additional parameter in the grid search space.
Another potential reason why you might have gotten the low precision score after running the randomized search is that you might not have given it enough iterations to find the right hyperparameter combinations.
Ultimately, here would be my pointers:
Decision tree is a very basic model. It's prone to overfitting. I would recommend an ensemble model instead such as a random forest, which consists of multiple decision trees.
Consider what hyperparameters you want to tune and how exhaustively you want to search within that hyperparameter space to get the best model. You should start with a couple of hyperparameters to search on, starting with the important ones like max_depth or min_samples_split and then scaling up. This will be experimental on your side. There's no right or wrong here, but keep track of the best parameters found.
You should consider how well balanced your classes are. Models tend to be pretty biased if there's too much of one class over another. If there is a class imbalance, you can control for the imbalance using the class_weight argument.

h2o Distributed Random Forest maximum features parameter

I am hyperparameter tuning a random forest and I would like to tune the parameter regarding the maximum features of each tree. By sklearn's documentation it is:
The number of features to consider when looking for the best split: If
int, then consider max_features features at each split.
If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
I tried looking through the h2o documentation to no avail.
Does this parameter or any of the different ways you can adjust that parameter (e.g. log of features) exist in h2o?
The name for this parameter in H2O Random Forest is mtries.

Random Forest in Spark

So I am trying to classify certain text documents into three classes.
I wrote the following code for Cross validation in spark
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Define a grid of hyperparameters to test:
# - maxDepth: max depth of each decision tree in the GBT ensemble
# - maxIter: iterations, i.e., number of trees in each GBT ensemble
# In this example notebook, we keep these values small. In practice, to get the highest accuracy, you would likely want to try deeper trees (10 or higher) and more trees in the ensemble (>100).
paramGrid = ParamGridBuilder()\
.addGrid(jpsa.rf.maxDepth, [2,4,10])\
.addGrid(jpsa.rf.numTrees, [100, 250, 600,800,1000])\
.build()
# We define an evaluation metric. This tells CrossValidator how well we are doing by comparing the true labels with predictions.
evaluator = MulticlassClassificationEvaluator(metricName="f1", labelCol=jpsa.rf.getLabelCol(), predictionCol=jpsa.rf.getPredictionCol())
# Declare the CrossValidator, which runs model tuning for us.
cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=paramGrid,numFolds=5)
cvModel=cv.fit(jpsa.jpsa_train)
evaluator.evaluate(cvModel.transform(jpsa.jpsa_train))
I dont have much data. 115 total observation(documents with labels). I break them into 80:35 training and test. On the training I use 5 fold cross validation using the above code.
The evaluator above gave me the following on whole training data.
evaluator.evaluate(cvModel.transform(jpsa.jpsa_train))
0.9021290600237969
I am using f1 here since I am not able to find aucROC for MulticlassEvaluator in Spark as the option for evaluator. It does have it for Binary. I know AUC is for binary class, but then we can get a combined or average AUC for multi class by plotting various binary classes and getting their AUC. Sri-kit learn does the same for Multi class AUC.
However, when I use the evaluator on test data, my f1 score is miserable.
evaluator.evaluate(cvModel.transform(jpsa.jpsa_test))
0.5830903790087463
This indicates it's overfitting. Also if I don't use 1000 and 800 trees in the hyparameter search space and just keep it to 600 and 800, my test accuracy is 69%. So that means more trees are leading to overfitting? Which is strange as that is contrary to what and how Random Forests work. More tress reduce variance and lead to less overfitting (in fact ppl even suggest sometimes random forests don't overfit though I disagree that with very less data and complex forest it can).
Is that what is happening here? Less data and more no. of trees leading to overfitting?
Also how do I get a measure of cross validation accuracy? Currently evaluator is on training data. I don't want that as a measure to choose the algo. I want the validation data. Is it possible to get this OOB estimate internally from this CV estimator?
Parameter selection is an important aspect in developing machine learning models. To do this, there are multiple ways. One of them would be this one. Use 50% data (stratified) for parameter selection. Divide this data into 10 folds. Now perform 10-fold cross validation along with grid search with the tuning parameters. Usually the parameters to tune in a random forest are the number of trees and the depth of each tree (There are other parameters as well such as the number of features to select for each split, but generally the default parameter works well).
Also, it is true that higher number of trees can reduce variance, but it is too much high it can increase bias. There is trade-off. Create a grid with number of trees varying from 10 to 100 with step of 10, 50 or 100.

Python: In which cases will random forest and SVM classifiers can produce high accuracy?

I am using Random Forest and SVM classifiers to do classification, and I have 18322 samples which are unbalanced in 9 classes (3667, 1060, 1267, 2103, 2174, 1495, 884, 1462, 4210). I use 10-fold CV and my training data has 100 feature dimensions. In my samples, training data are not very different in these 100 dimensions, and when I use SVM, the accuracy is approximately 40%, however, when I use RF, the accuracy can be 92%. Then I make my data even less different in these 100 feature dimensions, however, RF can also give me accuracy of 92%, but the accuracy of SVM drops to 25%.
My classifier configurations are:
SVM: LinearSVC(penalty="l1",dual=False)
RF: RandomForestClassifier(n_estimators = 50)
All other parameters are default values. I think there must be something wrong with my RF classifier but I don't know how to check it.
Anyone familiar with these two classifiers can give me some hints?
Linear SVC tries to separate your classes by finding appropriate hyperplanes in euclidean space. Your samples might just not be linearly separable causing poor performance. Random Forest, on the other hand, uses several (in this case 50) simpler classifiers (Decision Trees), each of which has a piece-wise linear decision boundary. When you sum them together you end up with a much more complicated decision function.
In my experience, RF tends to perform quite good with default parameters and even an extensive parameter search improves accuracy only a little. SVM behaves almost exactly opposite.
Have you tried different configurations? How about doing grid search for better parameters for the SVM?
Since you're already using sklearn you can use sklearn.grid_search.GridSearchCV, more details here

Categories