Random Forest in Spark - python

So I am trying to classify certain text documents into three classes.
I wrote the following code for Cross validation in spark
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Define a grid of hyperparameters to test:
# - maxDepth: max depth of each decision tree in the GBT ensemble
# - maxIter: iterations, i.e., number of trees in each GBT ensemble
# In this example notebook, we keep these values small. In practice, to get the highest accuracy, you would likely want to try deeper trees (10 or higher) and more trees in the ensemble (>100).
paramGrid = ParamGridBuilder()\
.addGrid(jpsa.rf.maxDepth, [2,4,10])\
.addGrid(jpsa.rf.numTrees, [100, 250, 600,800,1000])\
.build()
# We define an evaluation metric. This tells CrossValidator how well we are doing by comparing the true labels with predictions.
evaluator = MulticlassClassificationEvaluator(metricName="f1", labelCol=jpsa.rf.getLabelCol(), predictionCol=jpsa.rf.getPredictionCol())
# Declare the CrossValidator, which runs model tuning for us.
cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=paramGrid,numFolds=5)
cvModel=cv.fit(jpsa.jpsa_train)
evaluator.evaluate(cvModel.transform(jpsa.jpsa_train))
I dont have much data. 115 total observation(documents with labels). I break them into 80:35 training and test. On the training I use 5 fold cross validation using the above code.
The evaluator above gave me the following on whole training data.
evaluator.evaluate(cvModel.transform(jpsa.jpsa_train))
0.9021290600237969
I am using f1 here since I am not able to find aucROC for MulticlassEvaluator in Spark as the option for evaluator. It does have it for Binary. I know AUC is for binary class, but then we can get a combined or average AUC for multi class by plotting various binary classes and getting their AUC. Sri-kit learn does the same for Multi class AUC.
However, when I use the evaluator on test data, my f1 score is miserable.
evaluator.evaluate(cvModel.transform(jpsa.jpsa_test))
0.5830903790087463
This indicates it's overfitting. Also if I don't use 1000 and 800 trees in the hyparameter search space and just keep it to 600 and 800, my test accuracy is 69%. So that means more trees are leading to overfitting? Which is strange as that is contrary to what and how Random Forests work. More tress reduce variance and lead to less overfitting (in fact ppl even suggest sometimes random forests don't overfit though I disagree that with very less data and complex forest it can).
Is that what is happening here? Less data and more no. of trees leading to overfitting?
Also how do I get a measure of cross validation accuracy? Currently evaluator is on training data. I don't want that as a measure to choose the algo. I want the validation data. Is it possible to get this OOB estimate internally from this CV estimator?

Parameter selection is an important aspect in developing machine learning models. To do this, there are multiple ways. One of them would be this one. Use 50% data (stratified) for parameter selection. Divide this data into 10 folds. Now perform 10-fold cross validation along with grid search with the tuning parameters. Usually the parameters to tune in a random forest are the number of trees and the depth of each tree (There are other parameters as well such as the number of features to select for each split, but generally the default parameter works well).
Also, it is true that higher number of trees can reduce variance, but it is too much high it can increase bias. There is trade-off. Create a grid with number of trees varying from 10 to 100 with step of 10, 50 or 100.

Related

low training (~64%) and test accuracy (~14%) with 5 different models

Im struggling to find a learning algorithm that works for my dataset.
I am working with a typical regressor problem. There are 6 features in the dataset that I am concerned with. There are about 800 data points in my dataset. The features and the predicted values have high non-linear correlation so the features are not useless (as far as I understand). The predicted values have a bimodal distribution so I disregard linear model pretty quickly.
So I have tried 5 different models: random forest, extra trees, AdaBoost, gradient boosting and xgb regressor. The training dataset returns accuracy and the test data returns 11%-14%. Both numbers scare me haha. I try tuning the parameters for the random forest but seems like nothing particularly make a drastic difference.
Function to tune the parameters
def hyperparatuning(model, train_features, train_labels, param_grid = {}):
grid_search = GridSearchCV(estimator = model, param_grid = param_grid, cv = 3, n_jobs = -1, verbose =2)
grid_search.fit(train_features, train_labels)
print(grid_search.best_params_)
return grid_search.best_estimator_`
Function to evaluate the model
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100*np.mean(errors/test_labels)
accuracy = 100 - mape
print('Model Perfomance')
print('Average Error: {:0.4f} degress. '.format(np.mean(errors)))
print('Accuracy = {:0.2f}%. '.format(accuracy))
I expect the output to be at least ya know acceptable but instead i got training data to be 64% and testing data to be 12-14%. It is a real horror to look at this numbers!
There are several issues with your question.
For starters, you are trying to use accuracy in what it seems to be a regression problem, which is meaningless.
Although you don't provide the exact models (it would arguably be a good idea), this line in your evaluation function
errors = abs(predictions - test_labels)
is actually the basis of the mean absolute error (MAE - although you should actually take its mean, as the name implies). MAE, like MAPE, is indeed a performance metric for regression problems; but the formula you use next
accuracy = 100 - mape
does not actually hold, neither it is used in practice.
It is true that, intuitively, one might want to get the 1-MAPE quantity; but this is not a good idea, as MAPE itself has a lot of drawbacks which seriously limit its use; here is a partial list from Wikipedia:
It cannot be used if there are zero values (which sometimes happens for example in demand data) because there would be a division by zero.
For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error.
It is an overfitting problem. You are fitting the hypothesis very well on your training data.
Possible solutions to your problem:
You can try getting more training data(not features).
Try less complex model like decision trees since highly complex
models(like random forest,neural networks etc.) fit the hypothesis
well on the training data.
Cross-validation:It allows you to tune hyperparameters with only
your original training set. This allows you to keep your test set as
a truly unseen dataset for selecting your final model.
Regularization:The method will depend on the type of learner you’re
using. For example, you could prune a decision tree, use dropout on
a neural network, or add a penalty parameter to the cost function in
regression.
I would suggest you use pipeline function since it'll allow you to perform multiple models simultaneously.
An example of that:
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'pca__n_components': [5, 20, 30, 40, 50, 64],
'logistic__alpha': np.logspace(-4, 4, 5),
}
search = GridSearchCV(pipe, param_grid, iid=False, cv=5)
search.fit(X_train, X_test)
I would suggest improving by preprocessing the data in better forms. Try to manually remove the outliers, check the concept of cook's distance to see elements which have high influence in your model negatively. Also, you could scale the data in a different form than Standard scaling, use log scaling if elements in your data are too big, or too small. Or use feature transformations like DCT transform/ SVD transform etc.
Or to be simplest, you could create your own features with the existing data, for example, if you have yest closing price and todays opening price as 2 features in stock price prediction, you can create a new feature saying the difference in cost%, which could help a lot on your accuracy.
Do some linear regression analysis to know the Beta values, to have a better understanding which feature is contributing more to the target value. U can use feature_importances_ in random forests too for the same purpose and try to improve that feature as well as possible such that the model would understand better.
This is just a tip of ice-berg of what could be done. I hope this helps.
Currently, you are overfitting so what you are looking for is regularization. For example, to reduce the capacity of models that are ensembles of trees, you can limit the maximum depth of the trees (max_depth), increase the minimum required samples at a node to split (min_samples_split), reduce the number of learners (n_estimators), etc.
When performing cross-validation, you should fit on the training set and evaluate on your validation set and the best configuration should be the one that performs the best on the validation set. You should also keep a test set in order to evaluate your model on completely new observations.

When do feature selection in imblearn pipeline with cross-validation and grid search

Currently I am building a classifier with heavily imbalanced data. I am using the imblearn pipeline to first to StandardScaling, SMOTE, and then the classification with gridSearchCV. This ensures that the upsampling is done during the cross-validation. Now I want to include feature_selection into my pipeline. How should I include this step into the pipeline?
model = Pipeline([
('sampling', SMOTE()),
('classification', RandomForestClassifier())
])
param_grid = {
'classification__n_estimators': [10, 20, 50],
'classification__max_depth' : [2,3,5]
}
gridsearch_model = GridSearchCV(model, param_grid, cv = 4, scoring = make_scorer(recall_score))
gridsearch_model.fit(X_train, y_train)
predictions = gridsearch_model.predict(X_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
It does not necessarily make sense to include feature selection in a pipeline where your model is a random forest(RF). This is because the max_depth and max_features arguments of the RF model essentially control the amounts of features included when building the individual trees (the max depth of n just says that each tree in your forest will be built for n nodes, each with a split consisting of a combination of max_features amount of features). Check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
You can simply investigate your trained model for the top ranked features. When training an individual tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. So then you actually don't need to retrain the forest for different feature sets, because the feature importance (already computed in the sklearn model) tells you all the info you'd need.
P.S. I would not waste time grid searching n_estimators either, because more trees will result in better accuracy. More trees means more computational cost and after a certain number of trees, the improvement is too small, so maybe you have to worry about that, but otherwise you will gain performance from a large-ish number of n_estimator and you're not really in trouble of overfitting either.
do you mean feature selection form sklearn? https://scikit-learn.org/stable/modules/feature_selection.html
You can run it in the beginning. You will basically adjust your columns of X (X_train, and X_test accordingly). It is important that you fit your feature selection only with the training data (as your test data should be unseen at that point in time).
How should I include this step into the pipeline?
so you should run it before your code.
There is no "how" as if there is a concrete recipe, it depends on your goal.
If you want to check which set of features gives you the best performance (according to your metrics, here recall), you could use sklearn's sklearn.feature_selection.RFE (Recursive Feature Elimination) or it's cross validation variant sklearn.feature_selection.RFECV.
The first one fit's your model with whole set of features, measures their importance and prunes the least impactful ones. This operation continues until the desired number of features are left. It is quite computationally intensive though.
Second one starts with all features and removes step features every time trying out all possible combinations of learned models. This continues until min_features_to_select is hit. It is VERY computationally intensive, way more than the first one.
As this operation is rather infeasible to use in connection with hyperparameters search, you should do it with a fixed set of defaults before GridSearchCV or after you have found some suitable values with it. In the first case, features choice will not depend on the hyperparams you've found, while for the second case the influence might be quite high. Both ways are correct but would probably yield different results and models.
You can read more about RFECV and RFE in this StackOverflow answer.

Insignificant improvement after Grid Search

I am running some algorithms for classification purposes on a dataset regarding bus schedules. Specifically, I run some random forests and a part of my source code is the following:
# Instantiate random forest
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier()
# K-Fold Cross Validation (for grid search)
inner_cross_validator = StratifiedKFold(n_splits=k_fold, shuffle=True)
from sklearn.model_selection import GridSearchCV
# Define parameters for grid search
number_of_trees = {'n_estimators': [100, 300, 500]}
max_features_per_tree = {'max_features': [0.2, 0.5, 0.8]}
min_samples_split_per_node = {'min_samples_split': [0.2, 0.5, 0.8]}
parameters = {**number_of_trees, **max_features_per_tree, **min_samples_split_per_node}
# Execute grid search and retrieve the best classifier
best_random_forest = GridSearchCV(estimator=random_forest, param_grid=parameters, scoring='average_precision', cv=inner_cross_validator, n_jobs=3)
best_random_forest.fit(X_train, y_train)
However, after Grid Search the precision and the recall are not improved almost at all.
In general, in my experience with other datasets in the past, I have not noticed an improvement of more than 5% or rarely 10% at the scores of the various metrics after grid search in comparison with the default values of a library like SkLearn.
Can I do something (after the stage of feature engineering) to improve significantly more the performance of my classification model?
A 5%-10% increase from hyperparameter tuning is a significant increase. You should not expect a greater increase than that from GridSearch.
Other than feature engineering (which has a very large scope for increase in performance) you can try:
Randomised Search: To search for randomly selected hyperparameter values within defined ranges. This should
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
Using a different algorithm: You are currently using RandomForest. This is a very effective method to reduce the variance of your predictions and slightly increase the performance. However, other methods like Gradient Boosting should give you better performance.
Ensembling of different Algorithms: This is a very broad topic and covers many different ways to combine models for increased performance. https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/

Python: In which cases will random forest and SVM classifiers can produce high accuracy?

I am using Random Forest and SVM classifiers to do classification, and I have 18322 samples which are unbalanced in 9 classes (3667, 1060, 1267, 2103, 2174, 1495, 884, 1462, 4210). I use 10-fold CV and my training data has 100 feature dimensions. In my samples, training data are not very different in these 100 dimensions, and when I use SVM, the accuracy is approximately 40%, however, when I use RF, the accuracy can be 92%. Then I make my data even less different in these 100 feature dimensions, however, RF can also give me accuracy of 92%, but the accuracy of SVM drops to 25%.
My classifier configurations are:
SVM: LinearSVC(penalty="l1",dual=False)
RF: RandomForestClassifier(n_estimators = 50)
All other parameters are default values. I think there must be something wrong with my RF classifier but I don't know how to check it.
Anyone familiar with these two classifiers can give me some hints?
Linear SVC tries to separate your classes by finding appropriate hyperplanes in euclidean space. Your samples might just not be linearly separable causing poor performance. Random Forest, on the other hand, uses several (in this case 50) simpler classifiers (Decision Trees), each of which has a piece-wise linear decision boundary. When you sum them together you end up with a much more complicated decision function.
In my experience, RF tends to perform quite good with default parameters and even an extensive parameter search improves accuracy only a little. SVM behaves almost exactly opposite.
Have you tried different configurations? How about doing grid search for better parameters for the SVM?
Since you're already using sklearn you can use sklearn.grid_search.GridSearchCV, more details here

When using multiple classifiers - How to measure the ensemble's performance? [SciKit Learn]

I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)
To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"
You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.

Categories