I'd like to (efficiently) evaluate a Gaussian mixture model (GMM) over an (n,d) list of datapoints, given the GMM parameters ($\pi_k, \mu_k, \Sigma_k$). I can't find a way to do this using standard sklearn or scipy packages.
EDIT: assume there is n datapoints, dimension d so (n,d), and GMM has k components, so for example the covariance matrix of the k-th component, \Sigma_k, is (d,d), and altogether \Sigma is (k,d,d).
For example, if you first fit a GMM in sklearn, you can call score_samples, but this only works if I'm fitting to data. Or, in scipy you can run a for-loop over multivariate_normal.pdf with each set of parameters, and do a weighted sum/dot product, but this is slow. Checking the source code of either was not illuminating (for me).
I'm currently hacking something together with n-d arrays and tensor dot products .. oy .. hoping someone has a better way?
Related
I'm trying to fit a simple KNN classifier, and wanted to use the scikit-learn implementation in order to benefit from their efficient implementation (multiprocessing, tree-based algorithms).
However, what I want to get as a result is just the list of distances and nearest neighbours for each data point, rather than the predicted label.
I will then compute the label separately in a non-standard way.
The kneighbors method seems exactly what I need, however I cannot call it without fitting the model with fit first. The issue is, fit() requires the labels (y) as a parameter.
Is there a way to achieve what I'm after? Perhaps I can pass fake labels in the fit() method - is there any issues I'm missing by doing this? E.g. is this going to affect the results (of the computed distances and list of nearest neighbours for each datapoint) in any way? I wouldn't expect so but I'm not familiar with the workings of the scikit-learn implementation.
There is another algorithm for what you desire: NearestNeighbors.
This algorithm is unsupervised (you don't need the y labels); moreover, there is one method (kneighbors) that calculates distances to points and which sample is.
Check the link, it is quite clear.
So I need to classify a spiral dataset. I have been experimenting with a bunch of algorithms like KNN, Kernel SVM, etc. I would like to try to improve the performance of Logistic Regression using feature engineering, preprocessing, etc.
I am also using scikit learn to do all of the classifications.
I fully understand Logistic Regression is not the proper algorithm to do this sort of problem. This is more of a learning excerise for Pre processing and other feature engineering/extraction methods to see how much I can improve this specific model.
Here is an example dataset I would use for the classification. Any suggestions of how I can manipulate the dataset to use in the Logistic Regression algorithm would be helpful.
I also have datasets with multiple spirals as well. some datasets have 2 classes or sometimes up to 5. This means up to 5 spirals.
Logistic Regression is generally used as a linear classifier i.e the decision boundary separating one class samples from the other is a linear(straight-line) but it can be used for non-linear decision boundaries as well.
Using the kernel trick in SVC is also good option as it maps the data in the lower dimension to higher dimension making it linearly separable.
example:
In the above example, the data is not linearly separable in lower dimension, but after applying the transformation ϕ(x) = x² and adding the second dimension to the features we have the right side graph that becomes linearly separable.
You can start transforming the data by creating new features for applying logistic regression.
Also try SVC(Support Vector Classifier) that uses kernel trick. For SVC you don't have to transform the data into higher dimensions explicitly.
There are few resources which are great for learning are one and two
Since the data doesn't seem to be linearly separable, you can try using the Kernel Trick method commonly used in Support Vector Classification. The kernel function accepts inputs in the original lower-dimensional space and returns the dot product of the transformed vectors in the higher dimensional space. That means transformed vector ϕ(x) is just some function of the coordinates in the corresponding lower-dimensional vector x.
I'm working on a time series dataset and therefore while fitting the GaussianMixture() function from the scikit-learn package, I need to make each feature(timestamp) dependent. However, I don't find a parameter to customize the covariance matrix after examining the source code.
With my limited statistics knowledge, I'm curious how I can modify the covariance matrix during the E-step to incorporate time dependency into GMM model. Thank you very much.
Here is the Source Code: The change I want to make is in the estimate_gaussian_parameters() function
https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/mixture/gaussian_mixture.py#L435
With darksky's help, I learned the function is built-in with the option of covariance-matrix. The parameter covariance_type has 4 options:
'full' (each component has its own general covariance matrix),
'tied' (all components share the same general covariance matrix),
'diag' (each component has its own diagonal covariance matrix),
'spherical' (each component has its own single variance).
In my understanding then, 'spherical' is used for uni-variant dataset,'diag' is used for datasets with multi-variant but independent features. Therefore, one should either use 'full' or 'tied' if they want to predict on multi-variant and dependent features.
I have a vector of data points that seems to represent a 3D Gaussian distribution or a Gaussian mixture distribution. Is there a way to fit a 3D Gaussian distribution or a Gaussian mixture distribution to this matrix, and if yes, do there exist libraries to do that (e.g. in Python)?
The question seems related to the following one, but I would like to fit a 3D Gaussian to it:
Fit multivariate gaussian distribution to a given dataset
The targeted end results would look like this (a single distribution or a mixture):
For example, very much simplified, my data vector (from which the Gaussian (mixture) distribution should be learned) looks like this:
[[0,0,0,0,0,0], [0,1,1,1,1,0], [0,1,2,2,1,0], [1,2,3,3,2,1], [0,1,2,2,1,0], [0,0,0,0,0,0]]
I can give an answer if you know the number of Gaussians. Your vector gives the Z values at a grid of X, Y points. You can make X and Y vectors:
import numpy as np
num_x, num_y = np.shape(z)
xx = np.outer(np.ones(num_x), np.arange(num_y))
yy = np.outer(np.arange(num_x), np.ones(num_y))
Then follow any routine fitting procedure, for instance 2D Gaussian Fit for intensities at certain coordinates in Python.
There is so-called Gaussian Mixture Models (GMM), with lots of literature behind it. And there is python code to do sampling, parameters estimation etc, not sure if it fits your needs
http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GMM.html
Disclaimer: used scikit, but never used GMM
I'm trying to train a linear model on a very large dataset.
The feature space is small but there are too many samples to hold in memory.
I'm calculating the Gram matrix on-the-fly and trying to pass it as an argument to sklearn Lasso (or other algorithms) but, when I call fit, it needs the actual X and y matrices.
Any idea how to use the 'precompute' feature without storing the original matrices?
(My answer is based on the usage of svm.SVC, Lasso may be different.)
I think that you are supposed pass the Gram matrix instead of X to the fit method.
Also, the Gram matrix has shape (n_samples, n_samples) so it should also be too large for memory in your case, right?