I am running some algorithms for classification purposes on a dataset regarding bus schedules. Specifically, I run some random forests and a part of my source code is the following:
# Instantiate random forest
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier()
# K-Fold Cross Validation (for grid search)
inner_cross_validator = StratifiedKFold(n_splits=k_fold, shuffle=True)
from sklearn.model_selection import GridSearchCV
# Define parameters for grid search
number_of_trees = {'n_estimators': [100, 300, 500]}
max_features_per_tree = {'max_features': [0.2, 0.5, 0.8]}
min_samples_split_per_node = {'min_samples_split': [0.2, 0.5, 0.8]}
parameters = {**number_of_trees, **max_features_per_tree, **min_samples_split_per_node}
# Execute grid search and retrieve the best classifier
best_random_forest = GridSearchCV(estimator=random_forest, param_grid=parameters, scoring='average_precision', cv=inner_cross_validator, n_jobs=3)
best_random_forest.fit(X_train, y_train)
However, after Grid Search the precision and the recall are not improved almost at all.
In general, in my experience with other datasets in the past, I have not noticed an improvement of more than 5% or rarely 10% at the scores of the various metrics after grid search in comparison with the default values of a library like SkLearn.
Can I do something (after the stage of feature engineering) to improve significantly more the performance of my classification model?
A 5%-10% increase from hyperparameter tuning is a significant increase. You should not expect a greater increase than that from GridSearch.
Other than feature engineering (which has a very large scope for increase in performance) you can try:
Randomised Search: To search for randomly selected hyperparameter values within defined ranges. This should
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
Using a different algorithm: You are currently using RandomForest. This is a very effective method to reduce the variance of your predictions and slightly increase the performance. However, other methods like Gradient Boosting should give you better performance.
Ensembling of different Algorithms: This is a very broad topic and covers many different ways to combine models for increased performance. https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/
Related
Im struggling to find a learning algorithm that works for my dataset.
I am working with a typical regressor problem. There are 6 features in the dataset that I am concerned with. There are about 800 data points in my dataset. The features and the predicted values have high non-linear correlation so the features are not useless (as far as I understand). The predicted values have a bimodal distribution so I disregard linear model pretty quickly.
So I have tried 5 different models: random forest, extra trees, AdaBoost, gradient boosting and xgb regressor. The training dataset returns accuracy and the test data returns 11%-14%. Both numbers scare me haha. I try tuning the parameters for the random forest but seems like nothing particularly make a drastic difference.
Function to tune the parameters
def hyperparatuning(model, train_features, train_labels, param_grid = {}):
grid_search = GridSearchCV(estimator = model, param_grid = param_grid, cv = 3, n_jobs = -1, verbose =2)
grid_search.fit(train_features, train_labels)
print(grid_search.best_params_)
return grid_search.best_estimator_`
Function to evaluate the model
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100*np.mean(errors/test_labels)
accuracy = 100 - mape
print('Model Perfomance')
print('Average Error: {:0.4f} degress. '.format(np.mean(errors)))
print('Accuracy = {:0.2f}%. '.format(accuracy))
I expect the output to be at least ya know acceptable but instead i got training data to be 64% and testing data to be 12-14%. It is a real horror to look at this numbers!
There are several issues with your question.
For starters, you are trying to use accuracy in what it seems to be a regression problem, which is meaningless.
Although you don't provide the exact models (it would arguably be a good idea), this line in your evaluation function
errors = abs(predictions - test_labels)
is actually the basis of the mean absolute error (MAE - although you should actually take its mean, as the name implies). MAE, like MAPE, is indeed a performance metric for regression problems; but the formula you use next
accuracy = 100 - mape
does not actually hold, neither it is used in practice.
It is true that, intuitively, one might want to get the 1-MAPE quantity; but this is not a good idea, as MAPE itself has a lot of drawbacks which seriously limit its use; here is a partial list from Wikipedia:
It cannot be used if there are zero values (which sometimes happens for example in demand data) because there would be a division by zero.
For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error.
It is an overfitting problem. You are fitting the hypothesis very well on your training data.
Possible solutions to your problem:
You can try getting more training data(not features).
Try less complex model like decision trees since highly complex
models(like random forest,neural networks etc.) fit the hypothesis
well on the training data.
Cross-validation:It allows you to tune hyperparameters with only
your original training set. This allows you to keep your test set as
a truly unseen dataset for selecting your final model.
Regularization:The method will depend on the type of learner you’re
using. For example, you could prune a decision tree, use dropout on
a neural network, or add a penalty parameter to the cost function in
regression.
I would suggest you use pipeline function since it'll allow you to perform multiple models simultaneously.
An example of that:
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'pca__n_components': [5, 20, 30, 40, 50, 64],
'logistic__alpha': np.logspace(-4, 4, 5),
}
search = GridSearchCV(pipe, param_grid, iid=False, cv=5)
search.fit(X_train, X_test)
I would suggest improving by preprocessing the data in better forms. Try to manually remove the outliers, check the concept of cook's distance to see elements which have high influence in your model negatively. Also, you could scale the data in a different form than Standard scaling, use log scaling if elements in your data are too big, or too small. Or use feature transformations like DCT transform/ SVD transform etc.
Or to be simplest, you could create your own features with the existing data, for example, if you have yest closing price and todays opening price as 2 features in stock price prediction, you can create a new feature saying the difference in cost%, which could help a lot on your accuracy.
Do some linear regression analysis to know the Beta values, to have a better understanding which feature is contributing more to the target value. U can use feature_importances_ in random forests too for the same purpose and try to improve that feature as well as possible such that the model would understand better.
This is just a tip of ice-berg of what could be done. I hope this helps.
Currently, you are overfitting so what you are looking for is regularization. For example, to reduce the capacity of models that are ensembles of trees, you can limit the maximum depth of the trees (max_depth), increase the minimum required samples at a node to split (min_samples_split), reduce the number of learners (n_estimators), etc.
When performing cross-validation, you should fit on the training set and evaluate on your validation set and the best configuration should be the one that performs the best on the validation set. You should also keep a test set in order to evaluate your model on completely new observations.
Currently I am building a classifier with heavily imbalanced data. I am using the imblearn pipeline to first to StandardScaling, SMOTE, and then the classification with gridSearchCV. This ensures that the upsampling is done during the cross-validation. Now I want to include feature_selection into my pipeline. How should I include this step into the pipeline?
model = Pipeline([
('sampling', SMOTE()),
('classification', RandomForestClassifier())
])
param_grid = {
'classification__n_estimators': [10, 20, 50],
'classification__max_depth' : [2,3,5]
}
gridsearch_model = GridSearchCV(model, param_grid, cv = 4, scoring = make_scorer(recall_score))
gridsearch_model.fit(X_train, y_train)
predictions = gridsearch_model.predict(X_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
It does not necessarily make sense to include feature selection in a pipeline where your model is a random forest(RF). This is because the max_depth and max_features arguments of the RF model essentially control the amounts of features included when building the individual trees (the max depth of n just says that each tree in your forest will be built for n nodes, each with a split consisting of a combination of max_features amount of features). Check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
You can simply investigate your trained model for the top ranked features. When training an individual tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. So then you actually don't need to retrain the forest for different feature sets, because the feature importance (already computed in the sklearn model) tells you all the info you'd need.
P.S. I would not waste time grid searching n_estimators either, because more trees will result in better accuracy. More trees means more computational cost and after a certain number of trees, the improvement is too small, so maybe you have to worry about that, but otherwise you will gain performance from a large-ish number of n_estimator and you're not really in trouble of overfitting either.
do you mean feature selection form sklearn? https://scikit-learn.org/stable/modules/feature_selection.html
You can run it in the beginning. You will basically adjust your columns of X (X_train, and X_test accordingly). It is important that you fit your feature selection only with the training data (as your test data should be unseen at that point in time).
How should I include this step into the pipeline?
so you should run it before your code.
There is no "how" as if there is a concrete recipe, it depends on your goal.
If you want to check which set of features gives you the best performance (according to your metrics, here recall), you could use sklearn's sklearn.feature_selection.RFE (Recursive Feature Elimination) or it's cross validation variant sklearn.feature_selection.RFECV.
The first one fit's your model with whole set of features, measures their importance and prunes the least impactful ones. This operation continues until the desired number of features are left. It is quite computationally intensive though.
Second one starts with all features and removes step features every time trying out all possible combinations of learned models. This continues until min_features_to_select is hit. It is VERY computationally intensive, way more than the first one.
As this operation is rather infeasible to use in connection with hyperparameters search, you should do it with a fixed set of defaults before GridSearchCV or after you have found some suitable values with it. In the first case, features choice will not depend on the hyperparams you've found, while for the second case the influence might be quite high. Both ways are correct but would probably yield different results and models.
You can read more about RFECV and RFE in this StackOverflow answer.
So I am trying to classify certain text documents into three classes.
I wrote the following code for Cross validation in spark
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Define a grid of hyperparameters to test:
# - maxDepth: max depth of each decision tree in the GBT ensemble
# - maxIter: iterations, i.e., number of trees in each GBT ensemble
# In this example notebook, we keep these values small. In practice, to get the highest accuracy, you would likely want to try deeper trees (10 or higher) and more trees in the ensemble (>100).
paramGrid = ParamGridBuilder()\
.addGrid(jpsa.rf.maxDepth, [2,4,10])\
.addGrid(jpsa.rf.numTrees, [100, 250, 600,800,1000])\
.build()
# We define an evaluation metric. This tells CrossValidator how well we are doing by comparing the true labels with predictions.
evaluator = MulticlassClassificationEvaluator(metricName="f1", labelCol=jpsa.rf.getLabelCol(), predictionCol=jpsa.rf.getPredictionCol())
# Declare the CrossValidator, which runs model tuning for us.
cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=paramGrid,numFolds=5)
cvModel=cv.fit(jpsa.jpsa_train)
evaluator.evaluate(cvModel.transform(jpsa.jpsa_train))
I dont have much data. 115 total observation(documents with labels). I break them into 80:35 training and test. On the training I use 5 fold cross validation using the above code.
The evaluator above gave me the following on whole training data.
evaluator.evaluate(cvModel.transform(jpsa.jpsa_train))
0.9021290600237969
I am using f1 here since I am not able to find aucROC for MulticlassEvaluator in Spark as the option for evaluator. It does have it for Binary. I know AUC is for binary class, but then we can get a combined or average AUC for multi class by plotting various binary classes and getting their AUC. Sri-kit learn does the same for Multi class AUC.
However, when I use the evaluator on test data, my f1 score is miserable.
evaluator.evaluate(cvModel.transform(jpsa.jpsa_test))
0.5830903790087463
This indicates it's overfitting. Also if I don't use 1000 and 800 trees in the hyparameter search space and just keep it to 600 and 800, my test accuracy is 69%. So that means more trees are leading to overfitting? Which is strange as that is contrary to what and how Random Forests work. More tress reduce variance and lead to less overfitting (in fact ppl even suggest sometimes random forests don't overfit though I disagree that with very less data and complex forest it can).
Is that what is happening here? Less data and more no. of trees leading to overfitting?
Also how do I get a measure of cross validation accuracy? Currently evaluator is on training data. I don't want that as a measure to choose the algo. I want the validation data. Is it possible to get this OOB estimate internally from this CV estimator?
Parameter selection is an important aspect in developing machine learning models. To do this, there are multiple ways. One of them would be this one. Use 50% data (stratified) for parameter selection. Divide this data into 10 folds. Now perform 10-fold cross validation along with grid search with the tuning parameters. Usually the parameters to tune in a random forest are the number of trees and the depth of each tree (There are other parameters as well such as the number of features to select for each split, but generally the default parameter works well).
Also, it is true that higher number of trees can reduce variance, but it is too much high it can increase bias. There is trade-off. Create a grid with number of trees varying from 10 to 100 with step of 10, 50 or 100.
I use the scikit-learn library for the machine learning (with text data). It looks like this:
vectorizer = TfidfVectorizer(analyzer='word', tokenizer=nltk.word_tokenize, stop_words=stop_words).fit(train)
matr_train = vectorizer.transform(train)
X_train = matr_train.toarray()
matr_test = vectorizer.transform(test)
X_test = matr_test.toarray()
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_predict = rfc.predict(X_test)
When I run it for the first time, the result for the test dataset is 0.17 for the recall and 1.00 for the precision. Ok.
But when I run it for the second time on this test dataset and this training dataset the result is different - 0.23 for the recall and 1.00 for the precision. And when I'll run it for the next times the result will be different. At the same time the precision and the recall for the training dataset are one and the same.
Why does it happen? Maybe this fact refers to something about my data?
Thanks.
A random forest fits a number of decision tree classifiers on various sub-samples of the dataset. Every time you call the classifier, sub-samples are randomly generated and thus different results. In order to control this thing you need to set a parameter called random_state.
rfc = RandomForestClassifier(random_state=137)
Note that random_state is the seed used by the random number generator. You can use any integer to set this parameter. Whenever you change the random_state value the results are likely to change. But as long as you use the same value for random_state you will get the same results.
The random_state parameter is used in various other classifiers as well. For example in Neural Networks we use random_state in order to fix initial weight vectors for every run of the classifier. This helps in tuning other hyper-parameters like learning rate, weight decay etc. If we don't set the random_state, we are not sure whether the performance change is due to the change in hyper-parameters or due to change in initial weight vectors. Once we tune the hyper-parameters we can change the random_state to further improve the performance of the model.
The clue is (at least partly) in the name.
A Random Forest uses randomised decision trees, and as such, each time you fit, the result will change.
https://www.quora.com/How-does-randomization-in-a-random-forest-work
I am using Random Forest and SVM classifiers to do classification, and I have 18322 samples which are unbalanced in 9 classes (3667, 1060, 1267, 2103, 2174, 1495, 884, 1462, 4210). I use 10-fold CV and my training data has 100 feature dimensions. In my samples, training data are not very different in these 100 dimensions, and when I use SVM, the accuracy is approximately 40%, however, when I use RF, the accuracy can be 92%. Then I make my data even less different in these 100 feature dimensions, however, RF can also give me accuracy of 92%, but the accuracy of SVM drops to 25%.
My classifier configurations are:
SVM: LinearSVC(penalty="l1",dual=False)
RF: RandomForestClassifier(n_estimators = 50)
All other parameters are default values. I think there must be something wrong with my RF classifier but I don't know how to check it.
Anyone familiar with these two classifiers can give me some hints?
Linear SVC tries to separate your classes by finding appropriate hyperplanes in euclidean space. Your samples might just not be linearly separable causing poor performance. Random Forest, on the other hand, uses several (in this case 50) simpler classifiers (Decision Trees), each of which has a piece-wise linear decision boundary. When you sum them together you end up with a much more complicated decision function.
In my experience, RF tends to perform quite good with default parameters and even an extensive parameter search improves accuracy only a little. SVM behaves almost exactly opposite.
Have you tried different configurations? How about doing grid search for better parameters for the SVM?
Since you're already using sklearn you can use sklearn.grid_search.GridSearchCV, more details here