Universal Sentence Encoder, reduce vector dimensionality - python

I am successfully converting documents using this module available on TensorFlow hub.
The output of each document is a 512 dimensional vector, however this is too large for my application and I would like to reduce the dimensionality, which the module itself does not provide.
I can see a few options:
Use another package with a lower dimensionality output.
Use something such as PCA or tSNE to reduce the dimensions.
The problem with using PCA or tSNE is that this needs to be fit to the data of many example vectors - this would mean as new documents arrived and had been converted to a 512-dim vector, I would need to keep fitting another model, and then updating the old document vectors - this would be a huge issue in my application.
Are there any other dimensionality reduction techniques which can operate on a single data point?

"UMAP supports adding new points to an existing embedding via the standard sklearn transform method." UMAP is the winner for dimensionality reduction in every way, speed, accuracy, and theoretical foundation.

Related

Pre Processing spiral dataset to use for Logistic Regression

So I need to classify a spiral dataset. I have been experimenting with a bunch of algorithms like KNN, Kernel SVM, etc. I would like to try to improve the performance of Logistic Regression using feature engineering, preprocessing, etc.
I am also using scikit learn to do all of the classifications.
I fully understand Logistic Regression is not the proper algorithm to do this sort of problem. This is more of a learning excerise for Pre processing and other feature engineering/extraction methods to see how much I can improve this specific model.
Here is an example dataset I would use for the classification. Any suggestions of how I can manipulate the dataset to use in the Logistic Regression algorithm would be helpful.
I also have datasets with multiple spirals as well. some datasets have 2 classes or sometimes up to 5. This means up to 5 spirals.
Logistic Regression is generally used as a linear classifier i.e the decision boundary separating one class samples from the other is a linear(straight-line) but it can be used for non-linear decision boundaries as well.
Using the kernel trick in SVC is also good option as it maps the data in the lower dimension to higher dimension making it linearly separable.
example:
In the above example, the data is not linearly separable in lower dimension, but after applying the transformation ϕ(x) = x² and adding the second dimension to the features we have the right side graph that becomes linearly separable.
You can start transforming the data by creating new features for applying logistic regression.
Also try SVC(Support Vector Classifier) that uses kernel trick. For SVC you don't have to transform the data into higher dimensions explicitly.
There are few resources which are great for learning are one and two
Since the data doesn't seem to be linearly separable, you can try using the Kernel Trick method commonly used in Support Vector Classification. The kernel function accepts inputs in the original lower-dimensional space and returns the dot product of the transformed vectors in the higher dimensional space. That means transformed vector ϕ(x) is just some function of the coordinates in the corresponding lower-dimensional vector x.

How to save learned weights/parameters of PCA and T-SNE in Sklearn

I have two set of data let's say A and B.
I want to apply PCA and T-sne to A and fine tune the algo.
Once, I am satisfied with my tuning I want to save the learnt things to some pickle file.
Now I want to apply the same learnt PCA and t-sne to set B.
I want t-sne to produce the same results every time on B. I am hoping this because I am assuming, we can save the state of learnt t-sne parametes also. If the parametes are same, and when I load the same file everytime, the result of applying t-sne on the set B should be same every time.
How can I do this in Sklearn and python?
I am sorry, I am new to ML and python, this may be a very basic question.
Fine-tuning T-SNE equals tuning some heuristic-algorithm (it's ill-conditioned after all; higher-dimension -> lower-dimension mapping is lossy) for your data.
Applying this tuned & learned mapping to other data is done by sklearn's transform.
But: you will see that there is no transform-method for T-SNE and the reason is given here (including further discussion):
It is a transductive learner, like many clustering algorithms: the model is not really applicable beyond the data points it is fed as training.
So whatever you tuned for dataset A, does not really apply to dataset B (including parameters)!
For PCA this is trivial. Use the methods described in docs: model_persistence and use PCA's transform-method (assuming compatible datasets; dimensions!).
To be able to save the models you should use the below library:
from joblib import dump, load
after establishing the model as below in PCA:
pca_model = PCA(n_components=n)
you can save the model in joblib format in the current directory:
dump(pca_model, 'pca_model.joblib')

Clustering algorithm performance check on un plot able data

I am using Kmeans Clustring algorithm from Sci-kit learn library and dimension of my data is 169 and that's why I am unable to visualize the result of clustering.
Is there any way to measure the performance of algorithm?
Secondly, I have the labels of data and I want to test the learned model with the test dataset but I am not sure the labels Kmeans algo gave to cluster coincide with the labels I have.
There are ways of visualizing high dimensional data. You can sample some dimensions, use PCA components, MDS, tSNE, parallel coordinates, and many more.
If you even just read the Wikipedia article on clustering, there is a section on evaluation, including supervised as well as unsupervised evaluation. But the results of such evaluation can be very misleading...
Bear on mind that if you have labeled data, supervised methods should always outperform unsupervised methods that do not have the labels: they don't know what to look for - there is lie reason to believe that every clustering happens to align with some labels. In particular, on most data there will be many reasonable clusterings that capture different aspects of your data.

How to convert Gensim corpus to numpy array (or scipy sparse matrix) efficiently?

Suppose I have a (possibly) large corpus, about 2.5M of them with 500 features (after running LSI on the original data with gensim). I need the corpus to train my classifiers using scikit-learn. However, I need to first convert the corpus into a numpy array. The corpus creation and classifier trainer are done in two different scripts.
So the problem is that, my collection size is expected to grow, and at this stage I already don't have enough memory (32GB on the machine) to convert all at once (with gensim.matutils.corpus2dense). In order to work around the problem I am converting one vector after another at a time, but it is very slow.
I have considered dumping the corpus into svmlight format, and have scikit-learn to load it with sklearn.datasets.load_svmlight_file. But then it would probably mean I will need to load everything into memory at once?
Is there anyway I can efficiently convert from gensim corpus to numpy array (or scipy sparse matrix)?
I'm not very knowledgable about Gensim, so I hesitate to answer, but here goes:
Your data does not fit in memory so you will have to either stream it (basically what you are doing now) or chunk it out. It looks to me like gensim.utils.chunkize chunks it out for you, and you should be able to get the dense numpy array that you need with as_numpy=True. You will have to use the sklearn models that support partial_fit. These are trained iteratively, one batch at a time. The good ones are the SGD classifier and the Passive-Aggressive Classifier. Make sure to pass the classes argument at least the first time you call partial_fit. I recommend reading the docs on out-of-core scaling.

Supervised Dimensionality Reduction for Text Data in scikit-learn

I'm trying to use scikit-learn to do some machine learning on natural language data. I've got my corpus transformed into bag-of-words vectors (which take the form of a sparse CSR matrix) and I'm wondering if there's a supervised dimensionality reduction algorithm in sklearn capable of taking high-dimensional, supervised data and projecting it into a lower dimensional space which preserves the variance between these classes.
The high-level problem description is that I have a collection of documents, each of which can have multiple labels on it, and I want to predict which of those labels will get slapped on a new document based on the content of the document.
At it's core, this is a supervised, multi-label, multi-class problem using a sparse representation of BoW vectors. Is there a dimensionality reduction technique in sklearn that can handle that sort of data? Are there other sorts of techniques people have used in working with supervised, BoW data in scikit-learn?
Thanks!
I am a bit confused by your question. In my experience, dimensionality reduction is never really supervised... but it seems that what you want is some sort of informed feature selection, which is impossible to do before the classification is done. In other words, you cannot know which features are more informative before your classifier is trained and validated.
However, reducing the size and complexity of your data is always good, and you have various ways to do it with text data. The applicability and performance depends on the type of vectors you have (frequency counts, tfidf) and you will always have to determine the number of dimensions (components) you want in your output. The implementations in scikit-learn are mostly in the decomposition module.
The most popular method in Natural Language Processing is Singular Value Decomposition (SVD), which is at the core of Latent Semantic Analysis (LSA, also LSI). Staying with scikit-learn, you can simply apply TruncatedSVD() on your data. A similar method is Non-negative matrix factorization, implemented in scikit-learn as NMF().
An increasingly popular approach uses transformation by random projections, Random Indexing. You can do this in scikit-learn with the functions in random_projection.
As someone pointed out in another answer, Latent Dirichlet Allocation is also an alternative, although it is much slower and computationally more demanding than the methods above. Besides, it is at the time of writing unavailable in scikit-learn.
If all you want is to simplify your data in order to feed it to a classifier, I would suggest SVD with n_components between 100 and 500, or random projection with n_components between 500 and 2000 (common values from the literature).
If you are interested in using the reduced dimensions as some sort of classification/clustering already (people call this topic extraction, although you are really not extracting topics, rather latent dimensions), then LDA might be better option. Beware, it is slow and it only takes pure frequency counts (no tfidf). And the number of components is a parameter that you have to determine in advance (no estimation possible).
Returning to your problem, I would make a sckit-learn pipeline with a vectorizer, dimensionality reduction options and classifier and would carry out a massive parameter search. In this way, you will see what gives you best results with the label set you have.
You can use latent dirichlet allocation (here's the wiki) to discover the topics in your documents. For the assignment of a label to a document, you can use the conditional probability distribution for a document label (given the distribution over the topics in your document). If you have labels for your documents already, then you just need to learn the CPD, which is trivial. Unfortunately, scikit-learn does not have an LDA implementation, but gensim does.
PS: Here's another paper that may help. If you're not very well versed in statistical inference/learning or machine learning, I suggest that your start here (note: it's still assumes a high level of mathematical maturity).
Several existing scikit modules do something similar to what you asked for.
Linear Discriminant Analysis is probably closest to what you asked for. It find a projection of the data that maximizes the distance between the class centroids relative to the projected variances.
Cross decomposition includes methods like Partial Least Squares which fit linear regression models for multidimentional targets via a projection through a lower dimentonial intermediate space. It is a lot like a single hidden layer neural net without the sigmoids.
These are linear regression methods, but you could apply a 0-1 encoding to you target signal
and use these models anyway.
You could use an L1 regularized classifier like LogisticRegression or SGDClassifier to do feature selection. RandomizedLogisticRegression combines this with bootstrapping get a more stable feature set.
Try ISOMAP. There's a super simple built-in function for it in scikits.learn. Even if it doesn't have some of the preservation properties you're looking for, it's worth a try.
Use a multi-layer neural net for classification. If you want to see what the representation of the input is in the reduced dimension, look at the activations of the hidden layer. The role of the hidden layer is by definition optimised to distinguish between the classes, since that's what's directly optimised when the weights are set.
You should remember to use a softmax activation on the output layer, and something non-linear on the hidden layer (tanh or sigmoid).

Categories