Pretrained (Word2Vec) embedding in Neural Networks

Pretrained (Word2Vec) embedding in Neural Networks - python

If I have to use pretrained word vectors as embedding layer in Neural Networks (eg. say CNN), How do I deal with index 0?
Detail:
We usually start with creating a zero numpy 2D array. Later we fill in the indices of words from the vocabulary.
The problem is, 0 is already the index of another word in our vocabulary (say, 'i' is index at 0). Hence, we are basically initializing the whole matrix filled with 'i' instead of empty words. So, how do we deal with padding all the sentences of equal length?
One easy pop-up in mind is we can use the another digit=numberOfWordsInVocab+1 to pad. But wouldn't that take more size? [Help me!]

One easy pop-up in mind is we can use the another digit=numberOfWordsInVocab+1 to pad. But wouldn't that take more size?
Nope! That's the same size.
a=np.full((5000,5000), 7)
a.nbytes
200000000
b=np.zeros((5000,5000))
b.nbytes
200000000
Edit: Typo

If I have to use pretrained word vectors as embedding layer in Neural
Networks (eg. say CNN), How do I deal with index 0?
Answer
In general, empty entries can be handled via a weighted cost of the model and the targets.
However, when dealing with words and sequential data, things can be a little tricky and there are several things that can be considered. Let's make some assumptions and work with that.
Assumptions
We begin with a pre-trained word2vec model.
We have sequences with varying lengths, with at most max_lenght words.
Details
Word2Vec is a model that learns a mapping (embedding) from discrete variables (word token = word unique id) to a continuous vector space.
The representation in the vector space is such that the cost function (CBOW, Skip-gram, essentially it is predicting word from context in bi-directional way) is minimized on the corpus.
Reading basic tutorials (like Google's word2vec tutorial on Tensorflow tutorials) reveals some details on the algorithm, including negative sampling.
The implementation is a lookup table. It is faster than the alternative one-hot encoding technique, since the dimensions of a one-hot encoded matrix are huge (say 10,000 columns for 10,000 words, n row for n sequential words). So the lookup (hash) table is significantly faster, and it selects rows from the embedding matrix (for row vectors).
Task
Add missing entries (no words) and use it in the model.
Suggestions
If there is some use for the cost of missing data, such as using a prediction from that entry and there is a label for that entry, you can add a new value as suggested (can be the 0 index, but all indexes must move i=i+1 and the embedding matrix should have new row at position 0).
Following the first suggestion, you need to train the added row. You can use negative sampling for the NaN class vs all. I do not suggest it for handling missing values. It is a good trick to handle an "Unknown word" class.
You can weight the cost of those entries by constant 0 for each sample that is shorter that max_length. That is, if we have a sequence of word tokens [0,5,6,2,178,24,0,NaN,NaN], the corresponding weight vector is [1,1,1,1,1,1,1,0,0]
You should worry about re-indexing the words and the cost of it. In memory, there is almost no difference (1 vs N words, N is large). In complexity, it is something that can be later incorporated in the initial tokenize function. The predictions and model complexity is a larger issue and more important requirement from the system.
There are numerous ways to tackle varying lengths (LSTM, RNNs, now we try CNNs and costs tricks). Read state-of-the-art literature on that issue, I'm sure there is much work. For example, see A Convolutional Neural Network for Modelling Sentences paper.

Related

Dimensionality reduction on list of word vectors

I have a set of vectors that represent words and each vector has 300 features meaning that there are 300 floats for each vector. My goal is to reduce to dimensionality i.e. to 50 so that I can gain some space.
How can apply a dimensionality reduction on this vector set using e.g. tensorflow? I couldn't find a method, an implementation etc. that takes a list of vectors as input and reduces it.

You might want to look into convolutional neural networks for text processing. CNNs in general are known for dimensionality reduction of the input vectors. They are usually used for image classification but also work on text and sentence classification. What you are looking for is the embedding of an input vector. Quote:
Now that our words have been replaced by numbers, we could simply do one-hot encoding but that would result in an extremely wide input — there are thousands of unique words in the titles dataset. A better approach is to reduce the dimensionality of the input — this is done through an embedding layer (see full code here):
This is from here:
TowardsDataScience
Another ressource:
AnalyticsVidhya

What's the best way to classify text data in ML?

Let's say i have a dataset consisting of a review column with exactly 100 words for each review, then it may be easy to train my model as i can simply tokenize each of the 100 words for each reviews then convert it into a numerical array and then feed it into a Sequential model with input_shape=(1,100). But in the real world, reviews are never the same size. If I use a function such as CountVectorizer, then the structure of the sentence is not reserved, and one hot encoding may not be efficient enough.
So what is the proper way to preprocess this particular dataset so that i feed it into a trainable NN

A common way to represent text as vectors is by utilizing word embeddings. The main idea is that you used a large text corpus to compute vector representations of all words occurring in that dataset. So now for each review, you could run the following algorithm to compute its vector representation:
For each word in the review, check if a word embedding exists (in other words, that word occurred in the large training corpus) and if it does, add its vector representation to the representation of the review
Once you summed up the vector representations of all words, you compute the average embedding by dividing the summed review vector by the number of words in the document and this results in the final vector representation for that document
This vector can now be fed into a trainable NN
Before performing steps 1-3, you could also apply more preprocessing steps and remove fill words such as "and", "or", etc. as they usually carry no meaning, you could convert words to lower case and apply other standard NLP (natural language processing techniques) which could affect the vector representation of the reviews. But the key idea is to sum up the word vectors of a review and use its averaged vector as the representation of the review. By averaging, the length of the reviews is unimportant. Similarly, in word embeddings, the dimensionality of the word vectors is fixed (100D, 200D, ...), so you can experiment with the most suitable dimensionality.
Note that there are many different models available that compute word embeddings, so you could choose any of them. One that is nicely integrated into Python is word2vec.
And a state-of-the-art model that is currently being used by Google is called BERT.

How to predict word using trained skipgram model?

I'm using Google's Word2vec and I'm wondering how to get the top words that are predicted by a skipgram model that is trained using hierarchical softmax, given an input word?
For instance, when using negative sampling, one can simply multiply an input word's embedding (from the input matrix) with each of the vectors in the output matrix and take the one with the top value. However, in hierarchical softmax, there are multiple output vectors that correspond to each input word, due to the use of the Huffman tree.
How do we compute the likelihood value/probability of an output word given an input word in this case?

I haven't seen any way to do this, and given the way hierarchical-softmax (HS) outputs work, there's no obviously correct way to turn the output nodes' activation levels into a precise per-word likelihood estimation. Note that:
the predict_output_word() method that (sort-of) simulates a negative-sampling prediction doesn't even try to handle HS mode
during training, neither HS nor negative-sampling modes make exact predictions – they just nudge the outputs to be more like the current training example would require
To the extent you could calculate all output node activations for a given context, then check each word's unique HS code-point node values for how close they are to "being predicted", you could potentially synthesize relative scores for each word – some measure of how far the values are from a "certain" output of that word. But whether and how each node's deviation should contribute to that score, and how that score might be indicative of a interpretable liklihood, is unclear.
There could also be issues because of the way HS codes are assigned strictly by word-frequency – so 'neighbor' word sharing mostly-the-same-encoding may be very different semantically. (There were some hints in the original word2vec.c code that it could potentially be beneficial to assign HS-encodings by clustering related words to have similar codings, rather than by strict frequency, but I've seen little practice of that since.)
I would suggest sticking to negative-sampling if interpretable predictions are important. (But also remember, word2vec isn't mainly used for predictions, it just uses the training-attempts-at-prediction to bootstrap a vector-arrangment that turn out to be useful for other tasks.)

What is the input to an RNN language model (TensorFlow)?

I want to build a recurrent neural network (RNN) in TensorFlow that predicts the next word in a sequence of words. I have looked at several tutorials, e.g. the one of TensorFlow. I know that each word in the training text(s) is mapped to an integer index. However there are still a few things about the input that I don't get:
Networks are trained with batches, e.g. with 128 examples at the same time. Let's say we have 10.000 words in our vocabulary. Is the input to the network a matrix of size (128, sequence_length) or a one-hot encoded tensor (128, sequence_length, 10.000)?
How large is the second dimension, i.e. the sequence length? Do I use one sentence in each row of the batch, padding the sentences that are shorter than others with zeros?
Or can a row correspond to multiple sentences? E.g. can a row stand for "This is a test sentence. How are"? If so, where does the second sentence continue? In the next row of the same batch? Or in the same row in the next batch? How do I guarantee that TensorFlow continues the sentence correctly?
I wasn't able to find answers to these questions even if they are quite simple. I hope someone can help!

Yes. It's 3-dimensional vector (128, sequence_length, 10.000)
Yes. you should pad your sentences to make them same length. AND you can use tf.nn.dynamic_rnn and it can handle sentences of variable length base on tf.while.
There is great article dealt with this problem.
https://danijar.com/variable-sequence-lengths-in-tensorflow/
you can check more detail in Whats the difference between tensorflow dynamic_rnn and rnn?
Possible. but network doesn't know the sentence is connected or not. it just consider one row as one sentence. So, the result will be meaningless.
I hope this answer would help you.

Formatting and combining word frequency with other data machine learning python

I'm new in machine learning algorithms. I extensively read the scikit learn website and other SO post, which led me to build my first machine learning algorithm using the RandomForestClassifier and LinearSVC.
I'm working on medical notes. Each stay of a patient is associated (or not) to a code corresponding to a complication (bleeding, infection, heart attack...)
Using the notes, fitted and transformed with Countvectorizer and tfidfTransformer, i can accurately predict most of the codes. However, i'd like to add more data to my training dataset: length of stay, number of operations, title of operations, ICU stay duration...etc...
After parsing the web and SO, i ended up by adding all continuous/binary/scaled value to my word frequency array.
e.g: [0,0,0.34,0,0.45,0, 2, 45] (last 2 numbers are added data, whereas previous one match countvectorizer and tfdif.fit_transform(train_set)
However, this seems to me to be a gross way to combine data, and a huge number of words could mask others data.
I tried to set my data like: [[0,0,0.34,0,0.45,0],[2],[45]] but it doesn't work.
I searched the web, but no real clue, even though i might not be the first one facing this issue...:p
Thanks for your help
Edit:
Thanks for your detailed valuable answer. I really appreciated. However, what is exactly the range 0-1: is it the {predict_proba} value (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict) ?. I understood that the score is the accuracy of the prediction model. Then when you have all your predictions depending of each variable, do you average all of them ? Eventually, i'm working with multiple outputs, i guess it's not a problem since i can get a prediction for each of the output (btw predict_proba(X) give me an array like [array([[0.,1.]]), array ([[0.2,0.8]]).....] with a random forest tree classifier. i guess one of the number is the probability of the output, but i haven't explored this yet !)

Your first solution of just appending to the list is the correct solution. However, you should think about what this is implying. If you have 100 words and add two additional features, each specific word will get the same "weight" as the added features - IE - your added features won't be treated very strongly in the model. Additionally, you're saying that the last feature with a value of 45 is 100x the value of the feature 4th from end (0.45).
One common way to get around that is to use an ensemble model. Instead of adding those features to your list of words and predicting, first build a prediction model just using the words. That prediction will be in the range 0-1 and will capture the "sentiment" of the article. Then, scale your other variables (minmax scaler, normal distribution, etc.). Finally, combine the score from the words with the last two scaled variables and run another prediction on a list like this [.86,.2,.65]. In this way, you have transformed all of the words to a sentiment score, which you can use as a feature.
Hope that helps.
EDIT PER YOUR UPDATE ABOVE
Yes, in this instance you could use the predict_proba, but really if everything is scaled correctly, and you are using 1/0 as your targets for a class you don't need the predict_proba. The idea is to take the prediction from the words and combine it with the other variables. You do not average the predictions, you make a prediction from the predictions! This is called ensemble learning. Train another model with the output of your predictions as the features. Here is a flow of what you need to do.

Thanks for your time and your detailed answer. I think i get it. In short:
Prediction based on words, and for each bag of words of the training set (t1), you pull out a "sentiment"
Create a new array for each training set row with the sentiment and others values->new training set(t2)
Make a prediction based on t2.
Apply previous steps to the test.
One more question though !
What is the "sentiment" value ?! For each bag of words, i have a sparse matrix (countvectorizer+tf_idf). So how do you calculate the sentiment ? Do you run each row of the test again the rest of the test ? and your sentiment is the clf.predict(X) value ?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.