tf.contrib.layers.embedding_column from tensor flow

tf.contrib.layers.embedding_column from tensor flow - python

I am going through tensorflow tutorial tensorflow. I would like to find description of the following line:
tf.contrib.layers.embedding_column
I wonder if it uses word2vec or anything else, or maybe I am thinking in completely wrong direction. I tried to click around on GibHub, but found nothing. I am guessing looking on GitHub is not going to be easy, since python might refer to some C++ libraries. Could anybody point me in the right direction?

I've been wondering about this too. It's not really clear to me what they're doing, but this is what I found.
In the paper on wide and deep learning, they describe the embedding vectors as being randomly initialized and then adjusted during training to minimize error.
Normally when you do embeddings, you take some arbitrary vector representation of the data (such as one-hot vectors) and then multiply it by a matrix that represents the embedding. This matrix can be found by PCA or while training by something like t-SNE or word2vec.
The actual code for the embedding_column is here, and it's implemented as a class called _EmbeddingColumn which is a subclass of _FeatureColumn. It stores the embedding matrix inside its sparse_id_column attribute. Then, the method to_dnn_input_layer applies this embedding matrix to produce the embeddings for the next layer.
def to_dnn_input_layer(self,
input_tensor,
weight_collections=None,
trainable=True):
output, embedding_weights = _create_embedding_lookup(
input_tensor=self.sparse_id_column.id_tensor(input_tensor),
weight_tensor=self.sparse_id_column.weight_tensor(input_tensor),
vocab_size=self.length,
dimension=self.dimension,
weight_collections=_add_variable_collection(weight_collections),
initializer=self.initializer,
combiner=self.combiner,
trainable=trainable)
So as far as I can see, it seems like the embeddings are formed by applying whatever learning rule you're using (gradient descent, etc.) to the embedding matrix.

I had a similar doubt about embeddings.
Here is the main point:
The ability of adding an embedding layer along with tradition wide linear models allows for accurate predictions by reducing sparse dimensionality down to low dimensionality.
Here is a good post about it!
And here is a simple example combining embedding layers. Using the Titanic Kaggle data to predict whether or not the passenger will survive based on certain attributes like Name, Sex, what ticket they had, the fare they paid the cabin they stayed in etc.

Related

Feature Extraction Using Representation Learning

I'm new to machine learning, and I've been given a task where I'm asked to extract features from a data set with continuous data using representation learning (for example a stacked autoencoder).
Then I'm to combine these extracted features with the original features of the dataset and then use a feature selection technique to determine my final set of features that goes into my prediction model.
Could anyone point me to some resources or demos or sample code of how I could get started on this? I'm very confused on where to begin on this and would love some advice!

Okay, say you have an input of (1000 instances and 30 features). What I would do based on what you told us is:
Train an autoencoder, a neural network that compresses the input and then decompresses it, which has as a target your original input. The compressed representation lies in the latent space and encapsulates information about the input which is not directly accessible by humans. Now you may find such networks in tensorflow or pytorch. Tensorflow is easier and more straightforward so it could be better for you. I will provide this link (https://keras.io/examples/generative/vae/) for a variational autoencoder that may do the job for you. This has Conv2D layers so it performs really well for image data, but you can play around with the architecture. I cannot tell u more because you did not provide more info for your dataset. However, the important thing is the following:
After your autoencoder is trained properly and you need to make sure of it, (it adequately reconstructs the input) then you need to extract the aforementioned latent inputs (you will find more in the link). Now, that will be let's say 16 numbers but you can play with it. These 16 numbers were built to preserve info regarding your input. You said you wanted to combine these numbers with your input so might as well do that and end up with 46 input features. Now the feature selection part has to do with selecting the input features that are more useful for your model. That is not very interesting, you may find more information (https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e) and one way to select features is by training many models with different feature subsets. Remember, techniques such as PCA are for feature extraction not selection. I cannot provide any demo that does the whole thing but there are sources that can help. Remember, your autoencoder is supposed to return 16 numbers for each training example. Your autoencoder is trained only on your train data, with your train data as targets.

Visualizing a feature/kernel produced by a CNN via an "optimal" input image using Keras

So I've been doing a lot of research regarding the visualization of CNN's and I can't seem to find a solution to what I'm trying to do, or at least to my understanding of the methodologies employed. A lot of it is pretty new and cutting edge, so I could just not be properly grasping the concepts.
Basically, I want to take a learned kernel/feature as trained by a CNN and essentially manufacture an "optimized" picture such that when the kernel is convolved with said picture, we have the highest convolutional sum possible.
If I'm not mistaken, this should exaggerate the features of that kernel on the image level rather than at the filter/kernel level, which seems to be what most have done in terms of visualizing these filters.
In case what I'm asking is not clear, here's an example (probably bad, but it'll get the point across.)
Assume we are using MNIST and I've created a CNN like so:
5x5 Conv with 10 kernels/Feature Maps
Relu
2x2 MaxPool 2 stride
Dense + Softmax
Let's say I've fully trained my model and now want to look at one of the 10 5x5 kernels it produced and get a better idea of what it's looking for. I want to manufacture a new 28x28 picture such that when convolved with this 5x5 kernel, the sum of the 28x28 convolution is maximized.
Are there techniques that already do something like this? I feel like everything I see involves either "unwinding" or "reversing" the neural net (https://arxiv.org/pdf/1311.2901.pdf), viewing the feature maps as pictures pass through (http://kvfrans.com/visualizing-features-from-a-convolutional-neural-network/), or just looking at the kernels themselves (https://www.youtube.com/watch?v=AgkfIQ4IGaM).
Is it even something useful to look at? I feel like this is the closest thing I've seen to what I'm requesting. https://arxiv.org/pdf/1312.6034.pdf
Any insight would be a huge help, thanks!

This is called activation maximization, and Keras even has an example of it available here. Note that the code in the post might be outdated for current Keras versions, but a updated version is available in the examples folder in Keras.

How to analyse 3d mesh data(in .stl) by TensorFlow

I try to write an script in python for analyse an .stl data file(3d geometry) and say which model is convex or concave and watertight and tell other properties...
I would like to use and TensorFlow, scikit-learn or other machine learning library. Create some database with examples of objects with tags and in future add some more examples and just re-train model for better results.
But my problem is: I don´t know how to recalculate or restructure 3d data for working in ML libraries. I have no idea.
Thank you for your help.

You have to first extract "features" out of your dataset. These are fixed-dimension vectors. Then you have to define labels which define the prediction. Then, you have to define a loss function and a neural network. Put that all together and you can train a classifier.
In your example, you would first need to extract a fixed dimension vector out of an object. For instance, you could extract the object and project it on a fixed support on the x, y, and z dimensions. That defines the features.
For each object, you'll need to label whether it's convex or concave. You can do that by hand, analytically, or by creating objects analytically that are known to be concave or convex. Now you have a dataset with a lot of sample pairs (object, is-concave).
For the loss function, you can simply use the negative log-probability.
Finally, a feed-forward network with some convoluational layers at the bottom is probably a good idea.

How to find the most important features learned during Deep Learning using CNN?

I followed the tutorial given at this site, which detailed how to perform text classification on the movie dataset using CNN. It utilized the movie review dataset to find predict positive and negative reviews.
My question is, is there any way to find the most important learned features from the model? Does Tensorflow/Theano has any support for this?
Thanks !

A word of warning: if you can trace the classification back to specific input features, it's quite possible that CNN is the wrong ML paradigm for your application. Most text processing uses RNN, bag-of-words, bi-grams, and other simple linear combinations.
The structure of a CNN is generally antithetical to identifying the importance of individual features. Because of the various non-linear layers, it is rarely possible to pick out any one feature as important; rather, the combinations of inputs form small structures of inference, which then convolve to form more complex structures, until the final output is driven by a series of neighbor relationships, cut-offs, poolings, and other items.
This is why back-propagation is so important to running CNNs: the causation chain does not reverse cleanly. Otherwise, we'd reduce the process to a simple linear NN with one hidden layer.
If you want to analyze what's happening, try visualizing your intermediate layers. There are various modules to help with that; for instance, try a search for "+theano +visualize +CNN -news" (the last is to remove the high-traffic references to Cable News Network). There are plenty of examples in image processing; we won't know how much it might help your text processing, until you try it.

Supervised Dimensionality Reduction for Text Data in scikit-learn

I'm trying to use scikit-learn to do some machine learning on natural language data. I've got my corpus transformed into bag-of-words vectors (which take the form of a sparse CSR matrix) and I'm wondering if there's a supervised dimensionality reduction algorithm in sklearn capable of taking high-dimensional, supervised data and projecting it into a lower dimensional space which preserves the variance between these classes.
The high-level problem description is that I have a collection of documents, each of which can have multiple labels on it, and I want to predict which of those labels will get slapped on a new document based on the content of the document.
At it's core, this is a supervised, multi-label, multi-class problem using a sparse representation of BoW vectors. Is there a dimensionality reduction technique in sklearn that can handle that sort of data? Are there other sorts of techniques people have used in working with supervised, BoW data in scikit-learn?
Thanks!

I am a bit confused by your question. In my experience, dimensionality reduction is never really supervised... but it seems that what you want is some sort of informed feature selection, which is impossible to do before the classification is done. In other words, you cannot know which features are more informative before your classifier is trained and validated.
However, reducing the size and complexity of your data is always good, and you have various ways to do it with text data. The applicability and performance depends on the type of vectors you have (frequency counts, tfidf) and you will always have to determine the number of dimensions (components) you want in your output. The implementations in scikit-learn are mostly in the decomposition module.
The most popular method in Natural Language Processing is Singular Value Decomposition (SVD), which is at the core of Latent Semantic Analysis (LSA, also LSI). Staying with scikit-learn, you can simply apply TruncatedSVD() on your data. A similar method is Non-negative matrix factorization, implemented in scikit-learn as NMF().
An increasingly popular approach uses transformation by random projections, Random Indexing. You can do this in scikit-learn with the functions in random_projection.
As someone pointed out in another answer, Latent Dirichlet Allocation is also an alternative, although it is much slower and computationally more demanding than the methods above. Besides, it is at the time of writing unavailable in scikit-learn.
If all you want is to simplify your data in order to feed it to a classifier, I would suggest SVD with n_components between 100 and 500, or random projection with n_components between 500 and 2000 (common values from the literature).
If you are interested in using the reduced dimensions as some sort of classification/clustering already (people call this topic extraction, although you are really not extracting topics, rather latent dimensions), then LDA might be better option. Beware, it is slow and it only takes pure frequency counts (no tfidf). And the number of components is a parameter that you have to determine in advance (no estimation possible).
Returning to your problem, I would make a sckit-learn pipeline with a vectorizer, dimensionality reduction options and classifier and would carry out a massive parameter search. In this way, you will see what gives you best results with the label set you have.

You can use latent dirichlet allocation (here's the wiki) to discover the topics in your documents. For the assignment of a label to a document, you can use the conditional probability distribution for a document label (given the distribution over the topics in your document). If you have labels for your documents already, then you just need to learn the CPD, which is trivial. Unfortunately, scikit-learn does not have an LDA implementation, but gensim does.
PS: Here's another paper that may help. If you're not very well versed in statistical inference/learning or machine learning, I suggest that your start here (note: it's still assumes a high level of mathematical maturity).

Several existing scikit modules do something similar to what you asked for.
Linear Discriminant Analysis is probably closest to what you asked for. It find a projection of the data that maximizes the distance between the class centroids relative to the projected variances.
Cross decomposition includes methods like Partial Least Squares which fit linear regression models for multidimentional targets via a projection through a lower dimentonial intermediate space. It is a lot like a single hidden layer neural net without the sigmoids.
These are linear regression methods, but you could apply a 0-1 encoding to you target signal
and use these models anyway.
You could use an L1 regularized classifier like LogisticRegression or SGDClassifier to do feature selection. RandomizedLogisticRegression combines this with bootstrapping get a more stable feature set.

Try ISOMAP. There's a super simple built-in function for it in scikits.learn. Even if it doesn't have some of the preservation properties you're looking for, it's worth a try.

Use a multi-layer neural net for classification. If you want to see what the representation of the input is in the reduced dimension, look at the activations of the hidden layer. The role of the hidden layer is by definition optimised to distinguish between the classes, since that's what's directly optimised when the weights are set.
You should remember to use a softmax activation on the output layer, and something non-linear on the hidden layer (tanh or sigmoid).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.