Using precomputed Gram matrix in sklearn linear models (Lasso, Lars, etc) - python

I'm trying to train a linear model on a very large dataset.
The feature space is small but there are too many samples to hold in memory.
I'm calculating the Gram matrix on-the-fly and trying to pass it as an argument to sklearn Lasso (or other algorithms) but, when I call fit, it needs the actual X and y matrices.
Any idea how to use the 'precompute' feature without storing the original matrices?

(My answer is based on the usage of svm.SVC, Lasso may be different.)
I think that you are supposed pass the Gram matrix instead of X to the fit method.
Also, the Gram matrix has shape (n_samples, n_samples) so it should also be too large for memory in your case, right?

Related

Pre Processing spiral dataset to use for Logistic Regression

So I need to classify a spiral dataset. I have been experimenting with a bunch of algorithms like KNN, Kernel SVM, etc. I would like to try to improve the performance of Logistic Regression using feature engineering, preprocessing, etc.
I am also using scikit learn to do all of the classifications.
I fully understand Logistic Regression is not the proper algorithm to do this sort of problem. This is more of a learning excerise for Pre processing and other feature engineering/extraction methods to see how much I can improve this specific model.
Here is an example dataset I would use for the classification. Any suggestions of how I can manipulate the dataset to use in the Logistic Regression algorithm would be helpful.
I also have datasets with multiple spirals as well. some datasets have 2 classes or sometimes up to 5. This means up to 5 spirals.
Logistic Regression is generally used as a linear classifier i.e the decision boundary separating one class samples from the other is a linear(straight-line) but it can be used for non-linear decision boundaries as well.
Using the kernel trick in SVC is also good option as it maps the data in the lower dimension to higher dimension making it linearly separable.
example:
In the above example, the data is not linearly separable in lower dimension, but after applying the transformation ϕ(x) = x² and adding the second dimension to the features we have the right side graph that becomes linearly separable.
You can start transforming the data by creating new features for applying logistic regression.
Also try SVC(Support Vector Classifier) that uses kernel trick. For SVC you don't have to transform the data into higher dimensions explicitly.
There are few resources which are great for learning are one and two
Since the data doesn't seem to be linearly separable, you can try using the Kernel Trick method commonly used in Support Vector Classification. The kernel function accepts inputs in the original lower-dimensional space and returns the dot product of the transformed vectors in the higher dimensional space. That means transformed vector ϕ(x) is just some function of the coordinates in the corresponding lower-dimensional vector x.

Convert Keras model output into sparse matrix without forloop

I have a pretrained keras model that has output with dimesion of [n, 4000] (It makes the classification on 4000 classes).
I need to make a prediction on the test data (300K observations).
But when I call method model.predict(X_train), I get an run-out memory error, because I don't have enough RAM to store matrix with shape (300K , 4000).
Therefore, it would be logical to convert the model output to a sparse matrix.
But wrapping the predict method into scipy function sparse.csr_matrix does not work (sparse.csr_matrix(model.predict(X_train))), because it first allocates space in the RAM for the prediction, and only then it converts into the sparse matrix.
I can also make a prediction on a specific batch of test data, and then convert it using forloop.
But it seems to me that this is not optimal and very resource-consuming way.
Please give me advice, can there be any other methods for converting the model output into a sparse matrix?
Isn't there a batch_size parameter in the predict()?
If I get it correct the n means number of sample right?
Assume that you system ram is enough to hold the entire data but the VRAM is not.

Efficient Gaussian mixture evaluation

I'd like to (efficiently) evaluate a Gaussian mixture model (GMM) over an (n,d) list of datapoints, given the GMM parameters ($\pi_k, \mu_k, \Sigma_k$). I can't find a way to do this using standard sklearn or scipy packages.
EDIT: assume there is n datapoints, dimension d so (n,d), and GMM has k components, so for example the covariance matrix of the k-th component, \Sigma_k, is (d,d), and altogether \Sigma is (k,d,d).
For example, if you first fit a GMM in sklearn, you can call score_samples, but this only works if I'm fitting to data. Or, in scipy you can run a for-loop over multivariate_normal.pdf with each set of parameters, and do a weighted sum/dot product, but this is slow. Checking the source code of either was not illuminating (for me).
I'm currently hacking something together with n-d arrays and tensor dot products .. oy .. hoping someone has a better way?

Is it possible to use scikit TSNE on a large sparse matrix?

The scikit documentation explains fit_transform can only be used for dense matrices, but I have a sparse matrix in csr format which I want to perform tsne on. The documentation says to use the fit method for sparse matrices, but this doesn't return the low dimensional embedding.
I appreciate I could use the .todense() method as in this question, but my data set is very large (0.4*10^6 rows and 0.5*10^4 columns) so wont fit in memory. Really, it would be nice to do this using sparse matrices. Is there a way to use scikit TSNE (or any other python implementation of TSNE) to reduce the dimensionality of a large sparse matrix and return the low dimensional embedding to then visualize?
From that same documentation:
It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.
Use sklearn.decomposition.TruncatedSVD instead.

Trying to avoid .toarray() when loading data into an SVC model in scikit-learn

I'm trying to plug a bunch of data (sentiment-tagged tweets) into an SVM using scikit-learn. I've been using CountVectorizer to build a sparse array of word counts, and it's all working fine with smallish data sets (~5000 tweets). However, when I try to use a larger corpus (ideally 150,000 tweets, but I'm currently exploring with 15,000), .toarray(), which converts a sparse format to a denser format, immediately starts taking up immense amounts of memory (30k tweets hit over 50gb before the MemoryError.
So my question is -- is there a way to feed LinearSVC() or a different manifestation of SVM a sparse matrix? Am I necessarily required to use a dense matrix? It doesn't seem like a different vectorizer would help fix this problem (as this problem seems to be solved by: MemoryError in toarray when using DictVectorizer of Scikit Learn). Is a different model the solution? It seems like all of the scikit-learn models require a dense array representation at some point, unless I've been looking in the wrong places.
cv = CountVectorizer(analyzer=str.split)
clf = svm.LinearSVC()
X = cv.fit_transform(data)
trainArray = X[:breakpt].toarray()
testArray = X[breakpt:].toarray()
clf.fit(trainArray, label)
guesses = clf.predict(testArray)
LinearSVC.fit and its predict method can both handle a sparse matrix as the first argument, so just removing the toarray calls from your code should work.
All estimators that take sparse inputs are documented as doing so. E.g., the docstring for LinearSVC states:
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and
n_features is the number of features.

Categories