I understand Random Forest models can be used both for classification and regression situations. Is there a more specific criteria to determine where a random forest model would perform better than common regressions (Linear, Lasso, etc) to estimate values or Logistic Regression for classification?
The idea of a random forest model is built from a bunch of decision trees, and it is an supervised ensemble learning algorithm to reduce the over-fitting issue in individual decision trees.
The theory in machine learning is that there is no single model that outperforms all other models and hence, it is always recommended to try out different models before obtaining the optimal model.
With that said, there are preferences of model selection when one is dealing with data of different natures. Each model makes intrinsic assumptions about the data and the model with assumptions that are most aligned with the data generally works better for the data. For instance, logistic model is suitable for categorical data with a smooth linear decision boundary and if the data has this feature whereas a random forest does not assume a smooth linear decision boundary. Hence, the nature of your data makes a difference in your choice of models and it is always good to try them all before reaching to a conclusion.
Related
Let's assume we're dealing with continuous features and responses. We fit a linear regression model (let's say first order) and after CV we get a somehow good r^2 (let's say r^2=0.8).
Why do we go for other ML algorithms? I've read some research papers and they were trying different ML algorithms and taking the simple linear model as a base model for comparison. In these papers, the linear model outperformed other algorithms, what I have difficulty understanding is why do we go for other ML algorithms then? Why can't we just be satisfied with the linear model especially in the specific case where other algorithms perform poorly?
The other question is what do they gain from presenting the other algorithms in their research papers if these algorithms performed poorly?
The Best model for solving predictive problems such as continuous output is the regression model especially if you do it using a Neural network (polynomial or linear) with hyperparameter tuning based on the problem.
Using other ML algorithms such as Decision Tree or SVM or any other model where their main goal is classification but on the paper, they say it can do regression also in fact, they can't predict any new values.
but in the field of research people always try to find a better way to predict values other than regression, like in the classification world we start with Logistic regression -> decision tree and now we have SVM and ensembling models and DeepLearning.
I think the answer is because you never know.
especially in the specific case where other algorithms perform poorly?
You know they performed poorly because someone tried dose models. It's always worthy trying various models.
those're the learning curves for each algorithm I used. i'm working on my report and i'm confused how to interpret the curve.
I used multi label classification algorithms.
this is the learning curve of binary relevance the classifier is KNeighborsClassifier.
the second one is the curve of classifier chain using DecisionTreeClassifier
and the last one is the curve of LabelPowerset using GaussianNB
which one is the best? because the accuracy and the F1 score are good results
Learning curves are a tool to finding which models benefit from increasing training data. In other words, they indicate whether a model, with an increased dataset, will give better results.
The best curve in my opinion is the one that gives the best normalized score in a minimum training example. It also has to converge fast enough to a good score.
I have trained a model using several algorithms, including Random Forest from skicit-learn and LightGBM. and these model performs similarly in term of accuracy and other stats.
The issue is the inconsistent behavior between these two algorithms in terms of feature importance. I used default parameters and I know that they are using different method for calculating the feature importance but I suppose the highly correlated features should always have the most influence to the model's prediction. Random Forest makes more sense to me because the highly correlated features appear at top while it is not the case for LightGBM.
Is there a way to explain for this behavior and does this result with LightGBM is trustworthy to be presented?
Random Forest feature importance
LightGBM feature importance
Correlation with target
I have had a similar issue. The default feature importance for LGBM is based on 'split', and when I changed this to 'gain', the plots gave similar results.
Well, GBM is often shown to perform better especially when you comparing with random forest. Especially when comparing it with LightGBM. A properly-tuned LightGBM will most likely win in terms of performance and speed compared with random forest.
GBM advantages :
More developed. A lot of new features are developed for modern GBM model (xgboost, lightgbm, catboost) which affect its performance, speed, and scalability.
GBM disadvantages :
Number of parameters to tune
Tendency to overfit easily
If you aren't completely sure the hyperparameters are tuned correctly for the LightGBM, stick with the Random Forest; this will be easier to use and maintain.
I have two data set with different size.
1) Data set 1 is with high dimensions 4500 samples (sketches).
2) Data set 2 is with low dimension 1000 samples (real data).
I suppose that "both data set have the same distribution"
I want to train an non linear SVM model using sklearn on the first data set (as a pre-training ), and after that I want to update the model on a part of the second data set (to fit the model).
How can I develop a kind of update on sklearn. How can I update a SVM model?
In sklearn you can do this only for linear kernel and using SGDClassifier (with appropiate selection of loss/penalty terms, loss should be hinge, and penalty L2). Incremental learning is supported through partial_fit methods, and this is not implemented for neither SVC nor LinearSVC.
Unfortunately, in practise fitting SVM in incremental fashion for such small datasets is rather useless. SVM has easy obtainable global solution, thus you do not need pretraining of any form, in fact it should not matter at all, if you are thinking about pretraining in the neural network sense. If correctly implemented, SVM should completely forget previous dataset. Why not learn on the whole data in one pass? This is what SVM is supposed to do. Unless you are working with some non-convex modification of SVM (then pretraining makes sense).
To sum up:
From theoretical and practical point of view there is no point in pretraining SVM. You can either learn only on the second dataset, or on both in the same time. Pretraining is only reasonable for methods which suffer from local minima (or hard convergence of any kind) thus need to start near actual solution to be able to find reasonable model (like neural networks). SVM is not one of them.
You can use incremental fitting (although in sklearn it is very limited) for efficiency reasons, but for such small dataset you will be just fine fitting whole dataset at once.
I am new to all these methods and am trying to get a simple answer to that or perhaps if someone could direct me to a high level explanation somewhere on the web. My googling only returned kaggle sample codes.
Are the extratree and randomforrest essentially the same? And xgboost uses boosting when it chooses the features for any particular tree i.e. sampling the features. But then how do the other two algorithms select the features?
Thanks!
Extra-trees(ET) aka. extremely randomized trees is quite similar to random forest (RF). Both methods are bagging methods aggregating some fully grow decision trees. RF will only try to split by e.g. a third of features, but evaluate any possible break point within these features and pick the best. However, ET will only evaluate a random few break points and pick the best of these. ET can bootstrap samples to each tree or use all samples. RF must use bootstrap to work well.
xgboost is an implementation of gradient boosting and can work with decision trees, typical smaller trees. Each tree is trained to correct the residuals of previous trained trees. Gradient boosting can be more difficult to train, but can achieve a lower model bias than RF. For noisy data bagging is likely to be most promising. For low noise and complex data structures boosting is likely to be most promising.