Training a RNN to output word2vec embedding instead of logits

Training a RNN to output word2vec embedding instead of logits - python

Traditionally it seems that RNNs use logits to predict next time step in the sequence. In my case I need the RNN to output a word2vec (50 depth) vector prediction. This means that the cost function has be based off 2 vectors: Y the actual vector of the next word in the series and Y_hat, the network prediction.
I've tried using a cosine distance cost function but the network does not seem to learn (I've let it run other 10 hours on a AWS P3 and the cost is always around 0.7)
Is such a model possible at all ? If so what cost function should be used ?
Cosine distance in TF:
cosine_distance = tf.losses.cosine_distance(tf.nn.l2_normalize(outputs, 2), tf.nn.l2_normalize(targets, 2), axis=2)
Update:
I am trying to predict a word2vec so during sampling I could pick next word based on the closest neighbors of the predicted vector.

What is the reason that you want to predict a word embedding? Where are you getting the "ground truth" word embeddings from? For word2vec models you typically will re-use the trained word-embeddings in future models. If you trained a word2vec model with an embedding size of 50, then you would have 50-d embeddings that you could save and use in future models. If you just want to re-create an existing ground truth word2vec model, then you could just use those values. Typical word2vec would be having regular softmax outputs via continuous-bag-of-words or skip-gram and then saving the resulting word embeddings.
If you really do have a reason for trying to generate a model that creates tries to match word2vec, then looking at your loss function here are a few suggestions. I do not believe that you should be normalizing your outputs or your targets -- you probably want those to remain unaffected (the targets are no longer the "ground truth" targets if you have normalized them. Also, it appears you are using dim=0 which has now been deprecated and replaced with axis. Did you try different values for dim? This should represent the dimension along which to compute the cosine distance and I think that the 0th dimension would be the wrong dimension (as this likely should be the batch size. I would try with values of axis=-1 (last dimension) or axis=1 and see if you observe any difference.
Separately, what is your optimizer/learning rate? If the learning rate is too small then you may not actually be able to move enough in the right direction.

Related

How to build an end-to-end NLP RNN classification model?

I have trained a NLP model with RNN on keras to classify tweets with word embeddings (Stanford GloVe) used as a feature selection method. I would like to apply this model trained onto new tweets extracted. However, this error appears when I try to apply the model to new data.
ValueError: Error when checking input: expected input_1 to have shape (22,) but got array with shape (51,)
Then I realised that the model trained is expecting an input with a 22-input vector (the max tweet length in the training set tweets). On the other hand, the new dataset I would like to apply the model to has a 51-input vector (the max tweet length in the new dataset tweets).
In attempt to tackle this, I increased the size of the vector when training the model to 51 so both would be balanced. A new error pops up:
InvalidArgumentError: indices[29,45] = 5870 is not in [0, 2489)
Thus, I decided to try to apply the model back on the training dataset to see if it was possible in the first place with the original parameters and model. It was not and a very similar error appeared.
InvalidArgumentError: indices[23,11] = 2489 is not in [0, 2489)
In this case, how can I export an end-to-end NLP RNN classification model to apply on new unseen data? (FYI: I was able to successfully to do this for Logistic Regression with TF-IDF used as a feature selection. There just seems to be a lot of issues with the Word Embeddings.)
===========
UPDATE:
I was able to solve this issue by pickling not only the model, but also variables such as the max_len, texttotensor_instance and tokenizer. When applying the model to new data, I will have to use the same variables generated from the training data (instead of redefining them with the new data).

Your error is because the maximum number of words in your training data exceeds the max in the Embeddings layer (aka. input_dim).
It seems that the input_dim param. in your Embeddings layer is set to 2489, where you have words in your dataset tokenized and mapped to a higher value (5870).
Also don't forget to add one to the maximum # of words when you set this in the Embedding layer (input_dim=max_number_of_words+1). If you're interested to know why check this question: Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2?

Can we normalize features extracted from pre-trained models

I am working with features extracted from pre-trained VGG16 and VGG19 models. The features have been extracted from second fully connected layer (FC2) of the above networks.
The resulting feature matrix (of dimensions (8000,4096)) has values in the range [0,45]. As a result, when I am using this feature matrix in gradient based optimization algorithms, the value for loss function, gradient, norms etc. take very high values.
In order to do away with such high values, I applied MinMax normalization to this feature matrix and since then the values are manageable. Also, the optimization algorithm is behaving properly. Is my strategy OK i.e. is it fair enough to normalize features that have been extracted from a pre-trained models for further processing.

From experience, as long as you are aware of the fact that your results are coming from normalized values, it is okay. If normalization helps you show gradients, norms, etc. better then I am for it.
What I would be cautious about though, would be any further analysis on those feature matrices as they are normalized and not the true values. Say, if you were to study the distributions and such, you should be fine, but I am not sure what is your next step, and if this can/will be harmful.
Can you share more details around "further analysis"?

TF2 - Splitting Input Data & Using Different Pre-Trained Weights of a Layer

I have individually trained the same neural network architecture on a large number of different datasets (order of 100s) to learn a unique non-linear function for each i.e have basically learned a set of weights that describes the function for each dataset.
Now, I want to use these sets of weights as a pre-trained layer in another optimization problem. I know how to load in a single saved model and employ that as a layer. However, what I will be doing is a group-wise optimization across the 100s of different datasets, where I have a pre-trained weights for each (from above).
So the setup is a batch of x datasets, each with n data points in d dimensions i.e. input data is of the shape [X, N, D]. There are a series of layers which act on all this data, then when it gets to the "pre-trained" layer, I wish to use different pre-trained weights i.e. For [0,:,:] uses the weights learned from dataset 0 from above, [1,:,:] with weights learned from dataset 1 etc etc etc.
I then need to combine the output of all this together, as the loss function for this groupwise optimization is based on the variance across all datasets. So I don't believe I can trivially evaulate one set, calculate loss, change weights, rinse and repeat and sum up at the end.
I doubt it is feasible to have some massive duplicate branches going, where I have x copies of the pre-trained NN layers as the pre-trained NN architecture is already quite complex.
Is it is possible to use a split layer, then a for loop type approach, in which I change the weights, then pass the correct portion of data through? Then merge all the outputs? Or is there a better way of tackling this?
Any help much appreciated.

EnsembleVoteClassifier with neural network

I have a trained neural networks in which I am trying to average their prediction using EnsembleVoteClassifier from mlxtend.classifier. The problem is my neural network don't share the same input, (I performed feature reduction and feature select algorithms randomly and stored the results on new different variables, so I have something like X_test_algo1, X_test_algo2 and X_test_algo3 and Y_test).
I am trying to average the weights, but as I said, I don't have the same X, and I didn't any example on the documentation. How can I average the predictions for my three models model1, model2 and model3
eclf = EnsembleVoteClassifier(clfs=[model1, model2, model3], weights=[1,1,1], refit=False)
names = ['NN1', 'NN2', 'NN2', 'Ensemble']
eclf.fit(X_train_algo1, Ytrain) #????
If it's not possible, that is okay. I am only interested on how to calculate the formulas of Hard Voting, Hard Voting and Weighted Voting, or if there is anther library that is more flexible or the explicit expressions of the formulas could be helpful too.

Why would you need a library to do that?
Simply pass the same examples through all your neural networks and get the predictions (either logits or probabilities or labels).
Hard voting choose the label predicted most often by classifiers.
Soft voting, average probabilities predicted by classifiers and choose the label having the highest.
Weighted voting - either of the above can be weighted. Just assign weights to each classifier and multiply their predictions by them. Weights are usually normalized to (0, 1] range.
In principle you could also sum logits and choose the label with highest.
Oh, and weight averaging is different technique and requires you to have the same model and usually is done for the same initialization but at different training timesteps. You can read about it in this blog post.

Why do we use numpy.argmax() to return an index from a numpy array of predictions?

Let me preface this by saying, I am very new to neural networks, and this is my first time using numpy, tensorflow, or keras.
I wrote a neural network to recognize handwritten digits, using the MNIST data set. I followed this tutorial by Sentdex and noticed he was using print(np.argmax(predictions[0])) to print the first index from the numpy array of predictions.
I tried running the program with that line replaced by print(predictions[i]), (i was set to 0) but the output was not a number, it was:
[2.1975785e-08 1.8658861e-08 2.8842608e-06 5.7113186e-05 1.2067199e-10
7.2511304e-09 1.6282028e-12 9.9993789e-01 1.3356166e-08 2.0409643e-06].
My code than I'm confused about is:
predictions = model.predict(x_test)
for i in range(10):
plt.imshow(x_test[i])
plt.show()
print("PREDICTION: ", predictions[i])
I read the numpy documentation for the argmax() function, and from what I understand, it takes in a x-dimensional array, converts it to a one-dimensional array, then returns the index of the largest value. The Keras documentation for model.predict() indicated that the function returns a numpy array of the networks predictions. So I don't understand why we have to use argmax() to properly print the prediction, because as I understand, it has a completely unrelated purpose.
Sorry for the bad code formatting, I couldn't figure out how to properly insert multi line chunks of code into my post

What any classification neural network outputs is a probability distribution over the class indices, meaning that the network assigns one probability to each class. The sum of these probabilities is 1.0. Then the network is trained to assign the highest probability to the correct class, so to recover the class index from the probabilities you have to take the location (index) that has the maximum probability. This is done with the argmax operation.

If i understand well your question, the answer is pretty simple :
You want to predict which number is in the image, for that you are using a softmax activation layer in order to predict a probability for each class
So your predictions is an array of NUMBER_OF_CLASS shape, but what we want is not the class probability, only which digit is in the image
So we take the indice of the maximum probability in this array of prediction
This indice will correspond to digit predicted by the network
I hope i'm clear ahah

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.