Including time dependency in gaussian mixture model/expectation–maximization model?

Including time dependency in gaussian mixture model/expectation–maximization model? - python

I'm working on a time series dataset and therefore while fitting the GaussianMixture() function from the scikit-learn package, I need to make each feature(timestamp) dependent. However, I don't find a parameter to customize the covariance matrix after examining the source code.
With my limited statistics knowledge, I'm curious how I can modify the covariance matrix during the E-step to incorporate time dependency into GMM model. Thank you very much.
Here is the Source Code: The change I want to make is in the estimate_gaussian_parameters() function
https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/mixture/gaussian_mixture.py#L435

With darksky's help, I learned the function is built-in with the option of covariance-matrix. The parameter covariance_type has 4 options:
'full' (each component has its own general covariance matrix),
'tied' (all components share the same general covariance matrix),
'diag' (each component has its own diagonal covariance matrix),
'spherical' (each component has its own single variance).
In my understanding then, 'spherical' is used for uni-variant dataset,'diag' is used for datasets with multi-variant but independent features. Therefore, one should either use 'full' or 'tied' if they want to predict on multi-variant and dependent features.

Related

How to build a Gaussian Process regression model for observations that are constrained to be positive

I'm currently trying to train a GP regression model in GPflow which will predict precipitation values given some meteorological inputs. I'm using a Linear+RBF+WhiteNoise kernel, which seems appropriate given the set of predictors I'm using.
My problem at the moment is that when I get the model to predict new values, it has a tendency to predict negative precipitation - see attached figure.
How can I "enforce" physical constraints when building the model? The training data doesn't contain any negative precipitation values, but it does contain a lot of values close to zero, which I assume means the GPR model isn't learning the "precipitation must be >=0" constraint very well.
If there's a way of explicitly enforcing a constraint like this it'd be perfect, but I'm not sure how that would work. Would this require a different optimization algorithm? Or is it possible to somehow build this constraint into the kernel structure?

This is more of a question for CrossValidated ... A Gaussian process is essentially a distribution over functions with Gaussian marginals: the predictive distribution of f(x) at any point is by construction a Gaussian, not constrained. E.g. if you have lots of observations close to zero, your model expects that something just below zero must also be very likely.
If your observations are strictly positive, you could use a different likelihood, e.g. Exponential (gpflow.likelihoods.Exponential) or Beta (gpflow.likelihoods.Beta). Note that model.predict_y() always returns mean and variance, and for non-Gaussian likelihoods the variance may not actually be what you want. In practice, you're more likely to care about quantiles (e.g. 10%-90% confidence interval); there is an open issue on the GPflow github that relates to this. Which likelihood you use is part of your modelling choice, and depends on your data.
The simplest practical answer to your problem is to consider modelling the log-precipitation: if your original dataset is X and Y (with Y > 0 for all entries), compute logY = np.log(Y) and create your GP model e.g. using gpflow.models.GPR((X, logY), kernel). You then predict logY at test points, and can then convert it back from log-precipitation into precipitation space. (This is equivalent to a LogNormal likelihood, which isn't currently implemented in GPflow, though this would be straightforward.)

Distributed Lag Model in Python

I have quickly looked for Distributed Lag Model in StatsModels but can't find one. The one that is similar is VAR model. Can I transform VAR model to Distributed Lag Model and how? It will be great if there are already other packages which have Distributed Lag Model. Please let me know if so.
Thanks!

If you are using a finite distributed lag model, just use OLS or FGLS, with the lagged predictors forming the covariate matrix, and some parameterized model of autocorrelation (if using FGLS).
If your target variable is vector-valued, then the same advice applies and it just becomes a multiple regression problem, with a separate regression for each component of the output, and possibly additional covariance structure if there is correlation between error terms across components of the target.
It does not appear there is a standard statistics package in Python that implements this directly, likely because it would boil down to FGLS in almost any practical situation.

Does sklearn have group lasso?

I'm interested in using group lasso for a problem I have. Here is a link to the algorithm. I know R has a slick implementation, but am curious to see if python has something similar.
I think sklearn.linear_model.MultiTaskLasso might be kind of similar, but am not sure. Can anyone shed some light on this?

Whether or not to implement the Group Lasso in sklearn is discussed in this issue in the sklearn repo, where the conclusion so far is that it is too much of a niche model to justify the maintenance it would need if included in master.
Therefore, I have implemented a GroupLasso class which passes sklearn's check_estimator() in my python/cython package celer, which acts as a dropin replacement for sklearn's Lasso, MultitaskLasso, sparse Logistic regression with faster solvers.
The solver uses coordinate descent, working set methods and extrapolation, which should allow it to scale to problems with millions of features.
It supports sparse and dense data, along with centering and normalization (centering sparse data is not trivial as it breaks the sparsity of the design matrix), and comes with a GroupLassoCV class to perform cross-validation.
In celer's documentation, there is an example showing how to use it.

I've also looked into this, as far as I know scikit-learn does not provide this implementation.
The MultiTaskLasso does something else. From the documentation:
"The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly: y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks."
In other words, the MultiTaskLasso is an implementation of the Lasso which is able to predict multiple targets at the same time (hence y is a 2D array). Another way this problem is known is 'multi-output regression' or 'multi-target regression'. If the tasks are related, such methods can improve methods which try to model every task or target separately.

Setting the parameters of locally linear embedding (LLE) method in Scikit-learn for dimensionality-reduction

I'm using locally linear embedding (LLE) method in Scikit-learn for dimensionality reduction. The only example that I could find belong to the Scikit-learn documentation here and here, but I'm not sure how should I choose the parameters of the method. In particular, is there any relation between the dimension of data points or the number of samples and the number of neighbors (n_neighbors) and number of components (n_components)? All of the examples in Scikit-learn use n_components=2, is this always the case? Finally, is there any other parameter that is critical to tune, or I should use the default setting for the rest of parameters?

Is there any relation between the dimension of data points or the number of samples and the number of neighbors (n_neighbors) and number of components (n_components)?
Generally speaking, not related. n_neighbors is often decided by the distances among samples. Especially, if you know the classes of your samples, you'd better set n_neighbors a little bit greater than the number of samples in each class. While n_components, namely the reduced dimension size, is determined by the redundancy of data in original dimension. Based on the specific data distribution and your own demands, you can choose the proper space dimension for projection.
n_components=2 is to mapping the original high-dimensional space into a 2d-space. It is a special case, actually.
Is there any other parameter that is critical to tune, or I should use the default setting for the rest of parameters?
Here are several other parameters you should take care of.
reg for weight regularization, which is not used in the original LLE paper. If you don't want to use it, just simply set it to zero. However, the default value of reg is 1e-3, which is quite small.
eigen_solver. If your data size is small, it is recommended to use dense for accuracy. You can do more research on this.
max_iter. The default value of max_iter is only 100, which often causes the results not converged. If the results are not stable, please choose a larger interger.

You can use GridSearch (Scikit-learn) to choose the best values for you.

Supervised Dimensionality Reduction for Text Data in scikit-learn

I'm trying to use scikit-learn to do some machine learning on natural language data. I've got my corpus transformed into bag-of-words vectors (which take the form of a sparse CSR matrix) and I'm wondering if there's a supervised dimensionality reduction algorithm in sklearn capable of taking high-dimensional, supervised data and projecting it into a lower dimensional space which preserves the variance between these classes.
The high-level problem description is that I have a collection of documents, each of which can have multiple labels on it, and I want to predict which of those labels will get slapped on a new document based on the content of the document.
At it's core, this is a supervised, multi-label, multi-class problem using a sparse representation of BoW vectors. Is there a dimensionality reduction technique in sklearn that can handle that sort of data? Are there other sorts of techniques people have used in working with supervised, BoW data in scikit-learn?
Thanks!

I am a bit confused by your question. In my experience, dimensionality reduction is never really supervised... but it seems that what you want is some sort of informed feature selection, which is impossible to do before the classification is done. In other words, you cannot know which features are more informative before your classifier is trained and validated.
However, reducing the size and complexity of your data is always good, and you have various ways to do it with text data. The applicability and performance depends on the type of vectors you have (frequency counts, tfidf) and you will always have to determine the number of dimensions (components) you want in your output. The implementations in scikit-learn are mostly in the decomposition module.
The most popular method in Natural Language Processing is Singular Value Decomposition (SVD), which is at the core of Latent Semantic Analysis (LSA, also LSI). Staying with scikit-learn, you can simply apply TruncatedSVD() on your data. A similar method is Non-negative matrix factorization, implemented in scikit-learn as NMF().
An increasingly popular approach uses transformation by random projections, Random Indexing. You can do this in scikit-learn with the functions in random_projection.
As someone pointed out in another answer, Latent Dirichlet Allocation is also an alternative, although it is much slower and computationally more demanding than the methods above. Besides, it is at the time of writing unavailable in scikit-learn.
If all you want is to simplify your data in order to feed it to a classifier, I would suggest SVD with n_components between 100 and 500, or random projection with n_components between 500 and 2000 (common values from the literature).
If you are interested in using the reduced dimensions as some sort of classification/clustering already (people call this topic extraction, although you are really not extracting topics, rather latent dimensions), then LDA might be better option. Beware, it is slow and it only takes pure frequency counts (no tfidf). And the number of components is a parameter that you have to determine in advance (no estimation possible).
Returning to your problem, I would make a sckit-learn pipeline with a vectorizer, dimensionality reduction options and classifier and would carry out a massive parameter search. In this way, you will see what gives you best results with the label set you have.

You can use latent dirichlet allocation (here's the wiki) to discover the topics in your documents. For the assignment of a label to a document, you can use the conditional probability distribution for a document label (given the distribution over the topics in your document). If you have labels for your documents already, then you just need to learn the CPD, which is trivial. Unfortunately, scikit-learn does not have an LDA implementation, but gensim does.
PS: Here's another paper that may help. If you're not very well versed in statistical inference/learning or machine learning, I suggest that your start here (note: it's still assumes a high level of mathematical maturity).

Several existing scikit modules do something similar to what you asked for.
Linear Discriminant Analysis is probably closest to what you asked for. It find a projection of the data that maximizes the distance between the class centroids relative to the projected variances.
Cross decomposition includes methods like Partial Least Squares which fit linear regression models for multidimentional targets via a projection through a lower dimentonial intermediate space. It is a lot like a single hidden layer neural net without the sigmoids.
These are linear regression methods, but you could apply a 0-1 encoding to you target signal
and use these models anyway.
You could use an L1 regularized classifier like LogisticRegression or SGDClassifier to do feature selection. RandomizedLogisticRegression combines this with bootstrapping get a more stable feature set.

Try ISOMAP. There's a super simple built-in function for it in scikits.learn. Even if it doesn't have some of the preservation properties you're looking for, it's worth a try.

Use a multi-layer neural net for classification. If you want to see what the representation of the input is in the reduced dimension, look at the activations of the hidden layer. The role of the hidden layer is by definition optimised to distinguish between the classes, since that's what's directly optimised when the weights are set.
You should remember to use a softmax activation on the output layer, and something non-linear on the hidden layer (tanh or sigmoid).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.