The scikit documentation explains fit_transform can only be used for dense matrices, but I have a sparse matrix in csr format which I want to perform tsne on. The documentation says to use the fit method for sparse matrices, but this doesn't return the low dimensional embedding.
I appreciate I could use the .todense() method as in this question, but my data set is very large (0.4*10^6 rows and 0.5*10^4 columns) so wont fit in memory. Really, it would be nice to do this using sparse matrices. Is there a way to use scikit TSNE (or any other python implementation of TSNE) to reduce the dimensionality of a large sparse matrix and return the low dimensional embedding to then visualize?
From that same documentation:
It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.
Use sklearn.decomposition.TruncatedSVD instead.
Related
I have data that has both numeric and cathegoric attributes that I'm trying to apply the PCA analysis. For the cathegoric ones I OneHot encoded them using sklearn.preprocessing OneHotEncoder, but when I apply the matrix to sklearn.decomposition PCA it gives me the following error:
TypeError: PCA does not support sparse input. See TruncatedSVD for a possible alternative.
I want to use the PCA because my data has also numeric atributes besides the categoric that I'm trying to OneHot encode. I could just either convert the SciPy sparse matrix to the dense NumPy array and append it to my df (I don't know if it'd provide decent results as I don't have much knowledge in statistics) but I wanted to know if there's a way to apply the sparse matrix directly to PCA in case I run into a bigger set of data.
Further information:
I'm using the "Horse Colic Dataset".
You can download it here: http://networkrepository.com/horse-colic.php
The datadict may be obtained here: https://archive.ics.uci.edu/ml/datasets/Horse+Colic
I am successfully converting documents using this module available on TensorFlow hub.
The output of each document is a 512 dimensional vector, however this is too large for my application and I would like to reduce the dimensionality, which the module itself does not provide.
I can see a few options:
Use another package with a lower dimensionality output.
Use something such as PCA or tSNE to reduce the dimensions.
The problem with using PCA or tSNE is that this needs to be fit to the data of many example vectors - this would mean as new documents arrived and had been converted to a 512-dim vector, I would need to keep fitting another model, and then updating the old document vectors - this would be a huge issue in my application.
Are there any other dimensionality reduction techniques which can operate on a single data point?
"UMAP supports adding new points to an existing embedding via the standard sklearn transform method." UMAP is the winner for dimensionality reduction in every way, speed, accuracy, and theoretical foundation.
I have a np.array with 400 entries, each containing the values of a spectrum with 1000 points.
I want to identify the n most interesting indices of the spectrum and return them. So I can visualize and use them as an input vector for my classificator.
Is it best to calculate the variance, apply a PCA or are there better-suited algorithms?
And how do I compute the accounted variance for that selection?
Thanks
Dimensionality reduction can be performed in two different broad ways: feature extraction and feature selection. PCA is more suited for feature extraction, while "identify the n most interesting indices" is a feature selection problem. More details and how to code it up here: sklearn.feature_selection
I'm trying to train a linear model on a very large dataset.
The feature space is small but there are too many samples to hold in memory.
I'm calculating the Gram matrix on-the-fly and trying to pass it as an argument to sklearn Lasso (or other algorithms) but, when I call fit, it needs the actual X and y matrices.
Any idea how to use the 'precompute' feature without storing the original matrices?
(My answer is based on the usage of svm.SVC, Lasso may be different.)
I think that you are supposed pass the Gram matrix instead of X to the fit method.
Also, the Gram matrix has shape (n_samples, n_samples) so it should also be too large for memory in your case, right?
I need to train the svm classifier in sklearn. The dimensions of the feature vectors go in hundreds of thousands and there are tens of thousands of such feature vectors. However, each dimension can be 0, 1 or -1. Only some 100 are non-zero in each feature vector. Any efficient way to give the info about the feature vectors to the classifier?
I need to train the svm classifier in sklearn.
You mean sklearn.svm.SVC? For high dimensional sparse data and many samples, LinearSVC, LogisticRegression, PassiveAggressiveClassifier or SGDClassifier can be much faster to train for comparable predictive accuracy.
The dimensions of the feature vectors go in lakhs and there are tens of thousands of such feature vectors. However, each dimension can be 0, 1 or -1. Only some 100 are non-zero in each feature vector. Any efficient way to give the info about the feature vectors to the classifier?
Find a way to load your data as a scipy.sparse matrix that does not store the zeros in memory. Have a look at the documentation on feature extraction. It will give you tools to do that depending on the nature of the representation of the original data.