I'm interested in using group lasso for a problem I have. Here is a link to the algorithm. I know R has a slick implementation, but am curious to see if python has something similar.
I think sklearn.linear_model.MultiTaskLasso might be kind of similar, but am not sure. Can anyone shed some light on this?
Whether or not to implement the Group Lasso in sklearn is discussed in this issue in the sklearn repo, where the conclusion so far is that it is too much of a niche model to justify the maintenance it would need if included in master.
Therefore, I have implemented a GroupLasso class which passes sklearn's check_estimator() in my python/cython package celer, which acts as a dropin replacement for sklearn's Lasso, MultitaskLasso, sparse Logistic regression with faster solvers.
The solver uses coordinate descent, working set methods and extrapolation, which should allow it to scale to problems with millions of features.
It supports sparse and dense data, along with centering and normalization (centering sparse data is not trivial as it breaks the sparsity of the design matrix), and comes with a GroupLassoCV class to perform cross-validation.
In celer's documentation, there is an example showing how to use it.
I've also looked into this, as far as I know scikit-learn does not provide this implementation.
The MultiTaskLasso does something else. From the documentation:
"The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly: y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks."
In other words, the MultiTaskLasso is an implementation of the Lasso which is able to predict multiple targets at the same time (hence y is a 2D array). Another way this problem is known is 'multi-output regression' or 'multi-target regression'. If the tasks are related, such methods can improve methods which try to model every task or target separately.
Related
I have quickly looked for Distributed Lag Model in StatsModels but can't find one. The one that is similar is VAR model. Can I transform VAR model to Distributed Lag Model and how? It will be great if there are already other packages which have Distributed Lag Model. Please let me know if so.
Thanks!
If you are using a finite distributed lag model, just use OLS or FGLS, with the lagged predictors forming the covariate matrix, and some parameterized model of autocorrelation (if using FGLS).
If your target variable is vector-valued, then the same advice applies and it just becomes a multiple regression problem, with a separate regression for each component of the output, and possibly additional covariance structure if there is correlation between error terms across components of the target.
It does not appear there is a standard statistics package in Python that implements this directly, likely because it would boil down to FGLS in almost any practical situation.
Scikit-learn allows sample weights to be provided to linear, logistic, and ridge regressions (among others), but not to elastic net or lasso regressions. By sample weights, I mean each element of the input to fit on (and the corresponding output) is of varying importance, and should have an effect on the estimated coefficients proportional to its weight.
Is there a way I can manipulate my data before passing it to ElasticNet.fit() to incorporate my sample weights?
If not, is there a fundamental reason it is not possible?
Thanks!
You can read some discussion about this in sklearn's issue-tracker.
It basically reads like:
not that hard to do (theory-wise)
pain keeping all the basic sklearn'APIs and supporting all possible cases (dense vs. sparse)
As you can see in this thread and the linked one about adaptive lasso, there is not much activity there (probably because not many people care and the related paper is not popular enough; but that's only a guess).
Depending on your exact task (size? sparseness?), you could build your own optimizer quite easily based on scipy.optimize, supporting this kind of sample-weights (which will be a bit slower, but robust and precise)!
I'm performing a stepwise model selection, progressively dropping variables with a variance inflation factor over a certain threshold.
In order to do this, I'm running OLS many, many times on datasets ranging from a few hundred MB to 10 gigs.
What is the quickest implementation of OLS would be for larger datasets? The Statsmodel OLS implementation seems to be using numpy to invert matrices. Would a gradient descent based method be quicker? Does scikit-learn have an especially quick implementation?
Or maybe an mcmc based approach using pymc is quickest...
Update 1: Seems that the scikit learn implementation of LinearRegression is a wrapper for the scipy implementation.
Update 2: Scipy OLS via scikit learn LinearRegression is twice as fast as statsmodels OLS in my very limited tests...
The scikit-learn SGDRegressor class is (iirc) the fastest, but would probably be more difficult to tune than a simple LinearRegression.
I would give each of those a try, and see if they meet your needs. I also recommend subsampling your data - if you have many gigs but they are all samples from the same distibution, you can train/tune your model on a few thousand samples (dependent on the number of features). This should lead to faster exploration of your model space, without wasting a bunch of time on "repeat/uninteresting" data.
Once you find a few candidate models, then you can try those on the whole dataset.
Stepwise methods are not a good way to perform model selection, as they are entirely ad hoc, and depend highly on which direction you run the stepwise procedure. Its far better to use criterion-based methods, or some other method for generating model probabilities. Perhaps the best approach is to use reversible-jump MCMC, which fits models over the entire models space, and not just the parameter space of a particular model.
PyMC does not implement rjMCMC itself, but it can be implemented. Note also that PyMC 3 makes it really easy to fit regression models using its new glm submodule.
I'm trying to use scikit-learn to do some machine learning on natural language data. I've got my corpus transformed into bag-of-words vectors (which take the form of a sparse CSR matrix) and I'm wondering if there's a supervised dimensionality reduction algorithm in sklearn capable of taking high-dimensional, supervised data and projecting it into a lower dimensional space which preserves the variance between these classes.
The high-level problem description is that I have a collection of documents, each of which can have multiple labels on it, and I want to predict which of those labels will get slapped on a new document based on the content of the document.
At it's core, this is a supervised, multi-label, multi-class problem using a sparse representation of BoW vectors. Is there a dimensionality reduction technique in sklearn that can handle that sort of data? Are there other sorts of techniques people have used in working with supervised, BoW data in scikit-learn?
Thanks!
I am a bit confused by your question. In my experience, dimensionality reduction is never really supervised... but it seems that what you want is some sort of informed feature selection, which is impossible to do before the classification is done. In other words, you cannot know which features are more informative before your classifier is trained and validated.
However, reducing the size and complexity of your data is always good, and you have various ways to do it with text data. The applicability and performance depends on the type of vectors you have (frequency counts, tfidf) and you will always have to determine the number of dimensions (components) you want in your output. The implementations in scikit-learn are mostly in the decomposition module.
The most popular method in Natural Language Processing is Singular Value Decomposition (SVD), which is at the core of Latent Semantic Analysis (LSA, also LSI). Staying with scikit-learn, you can simply apply TruncatedSVD() on your data. A similar method is Non-negative matrix factorization, implemented in scikit-learn as NMF().
An increasingly popular approach uses transformation by random projections, Random Indexing. You can do this in scikit-learn with the functions in random_projection.
As someone pointed out in another answer, Latent Dirichlet Allocation is also an alternative, although it is much slower and computationally more demanding than the methods above. Besides, it is at the time of writing unavailable in scikit-learn.
If all you want is to simplify your data in order to feed it to a classifier, I would suggest SVD with n_components between 100 and 500, or random projection with n_components between 500 and 2000 (common values from the literature).
If you are interested in using the reduced dimensions as some sort of classification/clustering already (people call this topic extraction, although you are really not extracting topics, rather latent dimensions), then LDA might be better option. Beware, it is slow and it only takes pure frequency counts (no tfidf). And the number of components is a parameter that you have to determine in advance (no estimation possible).
Returning to your problem, I would make a sckit-learn pipeline with a vectorizer, dimensionality reduction options and classifier and would carry out a massive parameter search. In this way, you will see what gives you best results with the label set you have.
You can use latent dirichlet allocation (here's the wiki) to discover the topics in your documents. For the assignment of a label to a document, you can use the conditional probability distribution for a document label (given the distribution over the topics in your document). If you have labels for your documents already, then you just need to learn the CPD, which is trivial. Unfortunately, scikit-learn does not have an LDA implementation, but gensim does.
PS: Here's another paper that may help. If you're not very well versed in statistical inference/learning or machine learning, I suggest that your start here (note: it's still assumes a high level of mathematical maturity).
Several existing scikit modules do something similar to what you asked for.
Linear Discriminant Analysis is probably closest to what you asked for. It find a projection of the data that maximizes the distance between the class centroids relative to the projected variances.
Cross decomposition includes methods like Partial Least Squares which fit linear regression models for multidimentional targets via a projection through a lower dimentonial intermediate space. It is a lot like a single hidden layer neural net without the sigmoids.
These are linear regression methods, but you could apply a 0-1 encoding to you target signal
and use these models anyway.
You could use an L1 regularized classifier like LogisticRegression or SGDClassifier to do feature selection. RandomizedLogisticRegression combines this with bootstrapping get a more stable feature set.
Try ISOMAP. There's a super simple built-in function for it in scikits.learn. Even if it doesn't have some of the preservation properties you're looking for, it's worth a try.
Use a multi-layer neural net for classification. If you want to see what the representation of the input is in the reduced dimension, look at the activations of the hidden layer. The role of the hidden layer is by definition optimised to distinguish between the classes, since that's what's directly optimised when the weights are set.
You should remember to use a softmax activation on the output layer, and something non-linear on the hidden layer (tanh or sigmoid).
I would like to ask if anyone has an idea or example of how to do support vector regression in python with high dimensional output( more than one) using a python binding of libsvm? I checked the examples and they are all assuming the output to be one dimensional.
libsvm might not be the best tool for this task.
The problem you describe is called multivariate regression, and usually for regression problems, SVM's are not necessarily the best choice.
You could try something like group lasso (http://www.di.ens.fr/~fbach/grouplasso/index.htm - matlab) or sparse group lasso (http://spams-devel.gforge.inria.fr/ - seems to have a python interface), which solve the multivariate regression problem with different types of regularization.
Support Vector Machines as a mathematical framework is formulated in terms of a single prediction variable. Hence most libraries implementing them will reflect this as using one single target variable in their API.
What you could do is train a single SVM model for each target dimension in your data.
on the plus side, you can train them in // on a cluster as each model is independent of one another
on the minus side, sub-models will share nothing and won't benefit from what they individually discovered in the structure of the input data and potentially need a lot of memory to store the model as they will have no shared intermediate representation
Variant of SVMs can probably be devised in a multi-task learning setting to learn some common kernel-based intermediate representation suitable for reuse to predict multi-dimensional targets however this is not implemented in libsvm AFAIK. Google for multi task learning SVM if you want to learn more.
Alternatively, multi-layer perceptrons (a kind of feed forward neural networks) can naturally deal with multi-dimensional outcomes and hence should be better at sharing intermediate representations of the data reused across targets, especially if they are deep enough with the first layers pre-trained in an unsupervised manner using an autoencoder objective function.
You might want to have a look at http://deeplearning.net/tutorial/ for a nice introduction to various neural network architectures and practical tools and examples to implement them efficiently.