I have a set of vectors that represent words and each vector has 300 features meaning that there are 300 floats for each vector. My goal is to reduce to dimensionality i.e. to 50 so that I can gain some space.
How can apply a dimensionality reduction on this vector set using e.g. tensorflow? I couldn't find a method, an implementation etc. that takes a list of vectors as input and reduces it.
You might want to look into convolutional neural networks for text processing. CNNs in general are known for dimensionality reduction of the input vectors. They are usually used for image classification but also work on text and sentence classification. What you are looking for is the embedding of an input vector. Quote:
Now that our words have been replaced by numbers, we could simply do one-hot encoding but that would result in an extremely wide input — there are thousands of unique words in the titles dataset. A better approach is to reduce the dimensionality of the input — this is done through an embedding layer (see full code here):
This is from here:
TowardsDataScience
Another ressource:
AnalyticsVidhya
Related
Let's say i have a dataset consisting of a review column with exactly 100 words for each review, then it may be easy to train my model as i can simply tokenize each of the 100 words for each reviews then convert it into a numerical array and then feed it into a Sequential model with input_shape=(1,100). But in the real world, reviews are never the same size. If I use a function such as CountVectorizer, then the structure of the sentence is not reserved, and one hot encoding may not be efficient enough.
So what is the proper way to preprocess this particular dataset so that i feed it into a trainable NN
A common way to represent text as vectors is by utilizing word embeddings. The main idea is that you used a large text corpus to compute vector representations of all words occurring in that dataset. So now for each review, you could run the following algorithm to compute its vector representation:
For each word in the review, check if a word embedding exists (in other words, that word occurred in the large training corpus) and if it does, add its vector representation to the representation of the review
Once you summed up the vector representations of all words, you compute the average embedding by dividing the summed review vector by the number of words in the document and this results in the final vector representation for that document
This vector can now be fed into a trainable NN
Before performing steps 1-3, you could also apply more preprocessing steps and remove fill words such as "and", "or", etc. as they usually carry no meaning, you could convert words to lower case and apply other standard NLP (natural language processing techniques) which could affect the vector representation of the reviews. But the key idea is to sum up the word vectors of a review and use its averaged vector as the representation of the review. By averaging, the length of the reviews is unimportant. Similarly, in word embeddings, the dimensionality of the word vectors is fixed (100D, 200D, ...), so you can experiment with the most suitable dimensionality.
Note that there are many different models available that compute word embeddings, so you could choose any of them. One that is nicely integrated into Python is word2vec.
And a state-of-the-art model that is currently being used by Google is called BERT.
i'm trying to build a neural network using pytorch-nlp (https://pytorchnlp.readthedocs.io/en/latest/).
My intent is to build a network like this:
Embedding layer (uses pytorch standard layer and from_pretrained method)
Encoder with LSTM (also uses standard nn.LSTM)
Attention mechanism (uses torchnlp.nn.Attention)
Decoder siwth LSTM (as encoder)
Linear layer standard
I'm encountering a major problem with the dimensions of the input sentences (each word is a vector) but most importantly with the attention layer : I don't know how to declare it because i need the exact dimensions of the output from the encoder, but the sequences have varying dimensions (corresponding to the fact that sentences have different number of words).
I've tried to look at torch.nn.utils.rnn.pad_packed_sequence and torch.nn.utils.rnn.pack_padded_sequence since they're supported by LSTM, but i cannot find the solution.
Can anyone help me?
EDIT
I thought about padding all sequences to a specific dimension, but I don't want to truncate longer sequences because I want to keep all the information.
You are on the right track with padding all sequences to a specific dimension. You will have to pick a dimension that is larger than "most" of your sentences but you will need to cutoff some sentences. This blog article should help.
As I understand, Embedding Layers are simply lookup matrices, weights of which are learned by the optimisation problem.
Suppose, for this example, my dataset contains a single categorical variable. For example, I would like to auto encode a sentence of words to itself, to learn the sentence representation.
# example model
input = tf.keras.layers.Input()
embed = tf.keras.layers.Embedding(99)(input)
encoder = tf.keras.layers.LSTM()(embed)
decoder = tf.keras.layers.LSTM()(encoder)
model = tf.keras.models.Model(input, decoder)
The error will minimise the difference between embed and decoder outputs.
However, since embeddings are learned depending on optimisation condition, I think that I will end up learning trivial representations e.g.
the embedding matrix is all ones, and decoder always outputs ones. (Or zeros even), giving me a 100% accuracy in training.
For example, in the embedding matrix all words are just a vector of ones, and the auto encoder simply returns ones.
What I would like to do is to learn a meaningful representation of categorical variables.
If I have to use pretrained word vectors as embedding layer in Neural Networks (eg. say CNN), How do I deal with index 0?
Detail:
We usually start with creating a zero numpy 2D array. Later we fill in the indices of words from the vocabulary.
The problem is, 0 is already the index of another word in our vocabulary (say, 'i' is index at 0). Hence, we are basically initializing the whole matrix filled with 'i' instead of empty words. So, how do we deal with padding all the sentences of equal length?
One easy pop-up in mind is we can use the another digit=numberOfWordsInVocab+1 to pad. But wouldn't that take more size? [Help me!]
One easy pop-up in mind is we can use the another digit=numberOfWordsInVocab+1 to pad. But wouldn't that take more size?
Nope! That's the same size.
a=np.full((5000,5000), 7)
a.nbytes
200000000
b=np.zeros((5000,5000))
b.nbytes
200000000
Edit: Typo
If I have to use pretrained word vectors as embedding layer in Neural
Networks (eg. say CNN), How do I deal with index 0?
Answer
In general, empty entries can be handled via a weighted cost of the model and the targets.
However, when dealing with words and sequential data, things can be a little tricky and there are several things that can be considered. Let's make some assumptions and work with that.
Assumptions
We begin with a pre-trained word2vec model.
We have sequences with varying lengths, with at most max_lenght words.
Details
Word2Vec is a model that learns a mapping (embedding) from discrete variables (word token = word unique id) to a continuous vector space.
The representation in the vector space is such that the cost function (CBOW, Skip-gram, essentially it is predicting word from context in bi-directional way) is minimized on the corpus.
Reading basic tutorials (like Google's word2vec tutorial on Tensorflow tutorials) reveals some details on the algorithm, including negative sampling.
The implementation is a lookup table. It is faster than the alternative one-hot encoding technique, since the dimensions of a one-hot encoded matrix are huge (say 10,000 columns for 10,000 words, n row for n sequential words). So the lookup (hash) table is significantly faster, and it selects rows from the embedding matrix (for row vectors).
Task
Add missing entries (no words) and use it in the model.
Suggestions
If there is some use for the cost of missing data, such as using a prediction from that entry and there is a label for that entry, you can add a new value as suggested (can be the 0 index, but all indexes must move i=i+1 and the embedding matrix should have new row at position 0).
Following the first suggestion, you need to train the added row. You can use negative sampling for the NaN class vs all. I do not suggest it for handling missing values. It is a good trick to handle an "Unknown word" class.
You can weight the cost of those entries by constant 0 for each sample that is shorter that max_length. That is, if we have a sequence of word tokens [0,5,6,2,178,24,0,NaN,NaN], the corresponding weight vector is [1,1,1,1,1,1,1,0,0]
You should worry about re-indexing the words and the cost of it. In memory, there is almost no difference (1 vs N words, N is large). In complexity, it is something that can be later incorporated in the initial tokenize function. The predictions and model complexity is a larger issue and more important requirement from the system.
There are numerous ways to tackle varying lengths (LSTM, RNNs, now we try CNNs and costs tricks). Read state-of-the-art literature on that issue, I'm sure there is much work. For example, see A Convolutional Neural Network for Modelling Sentences paper.
I want to create a Neural Network in Keras for converting handwriting into computer letters.
My first step is to convert a sentence into an Array. My Array has the shape (1, number of letters,27). Now I want to input it in my Deep Neural Network and train.
But how do I input it properly if the dimension doesn't fit those from my image? And how do I achieve that my predict function gives me an output array of (1, number of letters,27)?
Seems like you are attempting to do Handwritten Recognition or similarly Optical Character Recognition or OCR. This is quite a broad field, and there are many ways to proceed. Even though, one approach I suggest is the following:
It is commonly known that Neural Networks have fixed size inputs, that is if you build it to take, say, inputs of shape (28,28,1) then the model will expect that shape as their inputs. Therefore, having a dimension in your samples that depends on the number of letters in a sentence (something variable) is not recommended, as you will not be able to train a model in such way with NNs.
Training such a model could be possible if you design it to predict one character at a time, instead a whole sentence that can have different lengths, and then group the predicted characters. The steps you could try to achieve this could be:
Obtain training samples for the characters you wish to recognize (like the MNIST database for example), and design and train your model to predict one character at a time.
Take the image with writing to classify and pass a Sliding Window over it that matches your expected input size (say a 28x28 window). Then, classify each of those windows to a character. Instead of Sliding Window, you could try isolating your desired features somehow and just classify those 28x28 segments instead.
Group the predicted characters somehow so you get words (probably grouping those separated by empty spaces) or do whatever you want with the predictions.
You can also try searching for tutorials or guides for Handwriting recognition like this one I have found quite useful. Hope this helps you get on track, good luck.