I've seen some papers providing Information Criterion for SVM (e.g. Demyanov, Bailey, Ramamohanarao & Leckie (2012)). But it doesn't seem like there is any implementation of such a method in Python. For instance, Sklearn only provides methods for linear models and random forest/gradien boosting algorithms.
Is there any implementation of a potential Information Criterion for SVM in Python?
You can use SVM with changing the kernel for non-linear model.
For example, kernel='poly'.
Related
Let's assume we're dealing with continuous features and responses. We fit a linear regression model (let's say first order) and after CV we get a somehow good r^2 (let's say r^2=0.8).
Why do we go for other ML algorithms? I've read some research papers and they were trying different ML algorithms and taking the simple linear model as a base model for comparison. In these papers, the linear model outperformed other algorithms, what I have difficulty understanding is why do we go for other ML algorithms then? Why can't we just be satisfied with the linear model especially in the specific case where other algorithms perform poorly?
The other question is what do they gain from presenting the other algorithms in their research papers if these algorithms performed poorly?
The Best model for solving predictive problems such as continuous output is the regression model especially if you do it using a Neural network (polynomial or linear) with hyperparameter tuning based on the problem.
Using other ML algorithms such as Decision Tree or SVM or any other model where their main goal is classification but on the paper, they say it can do regression also in fact, they can't predict any new values.
but in the field of research people always try to find a better way to predict values other than regression, like in the classification world we start with Logistic regression -> decision tree and now we have SVM and ensembling models and DeepLearning.
I think the answer is because you never know.
especially in the specific case where other algorithms perform poorly?
You know they performed poorly because someone tried dose models. It's always worthy trying various models.
I understand Random Forest models can be used both for classification and regression situations. Is there a more specific criteria to determine where a random forest model would perform better than common regressions (Linear, Lasso, etc) to estimate values or Logistic Regression for classification?
The idea of a random forest model is built from a bunch of decision trees, and it is an supervised ensemble learning algorithm to reduce the over-fitting issue in individual decision trees.
The theory in machine learning is that there is no single model that outperforms all other models and hence, it is always recommended to try out different models before obtaining the optimal model.
With that said, there are preferences of model selection when one is dealing with data of different natures. Each model makes intrinsic assumptions about the data and the model with assumptions that are most aligned with the data generally works better for the data. For instance, logistic model is suitable for categorical data with a smooth linear decision boundary and if the data has this feature whereas a random forest does not assume a smooth linear decision boundary. Hence, the nature of your data makes a difference in your choice of models and it is always good to try them all before reaching to a conclusion.
I am trying to perform cross-validation on NMF to find the best parameters to use. I tried using the sklearn cross-validation but get an error that states the NMF does not have a scoring method. Could anyone here help me with that? Thank you all
A property of nmf is that it is an unsupervised (machine learning) method. This generally means that there is no labeled data that can serve as a 'golden standard'.
In case of NMF you can not define what is the 'desired' outcome beforehand.
The cross validation in sklearn is designed for supervised machine learning, in which you have labeled data by definition.
What cross validation does, it holds out sets of labeled data, then trains a model on the data that is leftover and evaluates this model on the held out set. For this evaluation any metric can be used. For example: accuracy, precision, recall and F-measure, and for computing these measures it needs labeled data.
I'm trying a range of Online classifiers in the ski-kitlearn library to train a model from huge data. I found there are many classifiers supporting the partial_fit allowing for incremental learning. I want to use the Ridge Regression classifier in this setting, but could not find it in the implementation. Is there an alternative model that can do this in sklearn?
sklearn.linear_model.SGDClassifier, its loss function contain ‘hinge’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’
sklearn.linear_model.SGDRegressor, its default loss function is squared_loss, The possible values are ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’
In python sklearn library, both RandomizedLogisticRegression and RandomizedLasso are supported as feature selection methods.
However, they are all using L1(Lasso) penalty, and I am not sure why both of them are implemented. In fact, I though that Lasso regression is the other term of L1-regularized logistic regression, but maybe there seems to be some difference.
I think even Linear SVM with L1 penalty(combined with resampling) will also produce the similar result.
Are there significant difference among them?
From: http://scikit-learn.org/stable/modules/feature_selection.html#randomized-l1
RandomizedLasso implements this strategy for regression settings, using the Lasso, while RandomizedLogisticRegression uses the logistic regression and is suitable for classification tasks. To get a full path of stability scores you can use lasso_stability_path.
RandomizedLasso is used for regression in which the outcome is continuous. RandomizedLogisticRegression on the other hand is for classification in which the outcome is a class label.