How is cosine similarity performed in this Keras word2vec implementation? - python

I'm following a tutorial
http://adventuresinmachinelearning.com/word2vec-keras-tutorial/
which uses cosine similarity for validation model but dot product (not normalized unlike cosine similarity) for training. That's really confusing but my question is about axis of dot product, for validation model it's:
similarity = merge([target, context], mode='cos', dot_axes=0)
as a result when I log the model summary for the validation model I get:
Layer (type) Output Shape
dot_1 (Dot) (300, 1, 1)
but for training model
dot_product = merge([target, context], mode='dot', dot_axes=1)
not surprisingly model summary shows a different thing
Layer (type) Output Shape
dot_2 (Dot) (None, 1, 1)
I've read 'Ouput' in the summary (who's value is usually None) shows batch size (if it's fixed for some reason) but it still doesn't make sense to me why it's 300 for the first model because I think batch size in the example is not fixed and 300 is also my embedding dimension. Clearly something's wrong the dot_axes, can anybody confirm? Do I just fix it by putting dot_axes=1 instead of zero? Why is first axis 1 and not 0? Does dot_axes=0 dot over batch size or what (makes no sense to me)?
code on github:
https://github.com/adventuresinML/adventures-in-ml-code/blob/master/keras_word2vec.py

Related

Tensorflow Keras Predict Returning Wrong Shaped Output

I am trying to predict a single image. But my model returns a prediction array with the shape (1,1,1,2048) when it should be (1,10). Any idea what I am doing wrong? My x input shape is correct at (1,32,32,3).
ResNet50V2():
IMG_SHAPE = (32, 32, 3)
return tf.keras.applications.ResNet50V2(input_shape=IMG_SHAPE, include_top=False, weights=None, classes=10)
model = ResNet50V2()
x = x[None, :]
predictions = model.predict(x)
You are loading your keras-model with parameter
include_top=False
which cuts of the fully-connected projection layer that is responsible for projecting the model output to your expected amout of classes. Change the parameter to True.
That's because you are disabling top with the include top, which removes the final classification layer. You need to either add your own layer with 10 classes or remove the include top parameter and retrain the network with the desired inputs.
A image classification network usually works within 2 steps of processing.
The first one is feature extraction, we call that "base", and it consists in a stack of layers to find and reinforce patterns on image(2DConv, Relu and MaxPool).
The second one is the "head" and it is used to classify the extracted features from previous step.
Your code is getting the raw output of the "base", with no classification, and as the other gentle people stated, the solution is adding a custom "head" or changing the include_top parameter to True.

What values should I take of input_shape of LSTM in below model

These is the image of the code LSTM model please help me to give appropriate input_dim value for the first LSTM layer
For the code you gave, all possible answers are wrong.
A LSTM layer accepts a 3D input tensor of shape (batch_dim, time_dim, feat_dim), and you should write input_shape=(time_dim, feat_dim) in the layer definition.
However, since you useX_train = np.expand_dims(X_train, axis=0), it implies there is only one training sample in your data, which completely makes no sense. I therefore suspect what you really want to do is
X_train = np.expand_dims(X_train, axis=-1)
which has X_train.shape[0] of samples, X_train.shape[1] of time steps, and only 1 feature dimension, which is sort of common in many time-series analysis problem.
If my guess is correct, then your input shape of LSTM should be of shape (X_train.shape[1], 1).
NOTE: the batch_dim is intentionally not specified by keras's setting, which makes sense, because if you include it in your model definition, then you have to use to this particular batch size for both training and testing, while this is very inconvenient.

Split TFLite model into two submodels

I have a trained TF model which has the following architecture:
Inputs:
word_a, one-hot representation, vocab-size: 50000
word_b, one-hot representation, vocab-size: 50
Output:
probs, size: 1x10000
The network consists of embedding lookup of word_a of size 1x100 (dense_word_a) from an embedding matrix. word_b is transformed into a similar vector using a Character CNN into a dense vector of size 1x250. Both the vectors are concatenated into a 1x350 vector and using a decoder layer and sigmoid we're mapping it to the output layer and sigmoid with vector size 1x10000.
I need to run this model on the client, and for this I'm converting it to TFLite.
However, I also need to break the model into two sub-models with the following inputs and outputs:
Model 1:
Inputs:
word_a: one-hot representation, (1x50000) vocab-size: 50000
Output:
dense_word_a: dense word-embedding looked up from embedding matrix (1x100)
Network:
Simple embedding lookup for word_a from embedding matrix.
Model 2:
Inputs:
dense_word_a: embedding for word_a received from Model 1. (1x100).
word_b, one-hot representation, vocab-size: 50 (1x50)
Output:
probs, size: 1x10000
In Model 1, the input word_a is a placeholder and dense_word_a is a variable. In Model 2, dense_word_a is a placeholder and it's value is concatenated with the word_b's embedding.
The embeddings in my model are not pre-trained, and are trained as part of the model training process itself. So I need to train the model as a combined model but during inference I want to break it up into Model 1 and Model 2 as described above.
The idea is to run the Model 1 on server side and pass the embedding values to client so it can perform inference using word_b and not have a 5MB embedding matrix on the client. So, I'm not constrained on the size of Model 1, but since Model 2 runs on the client I need it to be small.
Here's what I've tried:
1. During model freezing time, I freeze the full model but in the output nodes list I also add the variable name dense_word_a along with probs. I then convert the model to TFLite. During inference, I'm able to see the dense_word_a output as a 1x100 vector. This seems to work fine. I'm also getting the probs as output,
For generating Model 2, I just remove the dense_word_a variable and convert it into a placeholder (using tf.placeholder), remove the placeholder for word_a and freeze the graph again.
However, I'm not able to match the probs value. The probs vector generated by the full model don't match with the probs values vector generated by Model 2.
How can I go about breaking the model into sub-models and also match the results?
I think what you described should work.
Is it easy to reproduce the problem that you're seeing? If you can isolate the reproducible steps and you believe there's a bug, could you file a bug on github? Thanks!

LSTM having a systematic offset between predictions and ground truth

Currently i think i'm experiencing a systematic offset in a LSTM model, between the predictions and the ground truth values. What's the best approach to continue further from now on?
The model architecture, along with the predictions & ground truth values are shown below. This is a regression problem where the historical data of the target plus 5 other correlated features X are used to predict the target y. Currently the input sequence n_input is of length 256, where the output sequence n_out is one. Simplified, the previous 256 points are used to predict the next target value.
X is normalized. The mean squared error is used as the loss function. Adam with a cosine annealing learning rate is used as the optimizer (min_lr=1e-7, max_lr=6e-2).
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
cu_dnnlstm_8 (CuDNNLSTM) (None, 256) 270336
_________________________________________________________________
batch_normalization_11 (Batc (None, 256) 1024
_________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 256) 0
_________________________________________________________________
dropout_11 (Dropout) (None, 256) 0
_________________________________________________________________
dense_11 (Dense) (None, 1) 257
=================================================================
Total params: 271,617
Trainable params: 271,105
Non-trainable params: 512
_________________________________________________________________
Increasing the node size in the LSTM layer, adding more LSTM layers (with return_sequences=True) or adding dense layers after the LSTM layer(s) only seems to lower the accuracy. Any advice would be appreciated.
Additional information on the image. The y-axis is a value, x-axis is the time (in days). NaNs have been replaced with zero, because the ground truth value in this case can never reach zero. That's why the odd outliers are in the data.
Edit:
I made some changes to the model, which increased accuracy. The architecture is the same, however the features used have changed. Currently only the historical data of the target sequence itself is used as a feature. Along with this, n_input got changed so 128. Switched Adam for SGD, mean squared error with the mean absolute error and finally the NaNs have been interpolated instead of being replaced with 0.
One step ahead predictions on the validation set look fine:
However, the offset on the validation set remains:
It might be worth noting that this offset also appears on the train set for x < ~430:
It looks like your model is overfitting and is simply always returning the value from the last timestep as a prediction. Your dataset is probably too small to have a model with this amount of parameters converge. You'll need to resort to techniques that combat overfitting: agressive dropout, adding more data, or try simpler, less overparameterized methods.
This phenomenon (LSTMs returning a shifted version of the input) has been a recurring theme in many stackoverflow questions. The answers there might contain some useful information:
LSTM Sequence Prediction in Keras just outputs last step in the input
LSTM model just repeats the past in forecasting time series
LSTM NN produces “shifted” forecast (low quality result)
Keras network producing inverse predictions
Stock price predictions of keras multilayer LSTM model converge to a constant value
Keras LSTM predicted timeseries squashed and shifted
LSTM Time series shifted predictions on stock market close price
Interesting results from LSTM RNN : lagged results for train and validation data
Finally, be aware that, depending on the nature of your dataset, there simply might be no pattern to be discovered in your data at all. You see this a lot with people trying to predict the stock market with LSTMs (there is a question on stackoverflow on how to predict the lottery numbers).
The answer is much simpler than we thought...
I saw multiple people saying this is due to overfitting and datasize. Some other people stated this is due to rescaling.
After several try, I found the solution: Try to do detrending before feed the data to RNN.
For example, you can do a simple degree-2 polynomial fitting of the data which will give you a polynomial formula. And it is possible to reduce the each data value corresponding to the formula value. Then we got a new dataset and we can feed it to the LSTM, after prediction we can just add the trend back to the result and the results should look better.

How to train a network in Keras for varying output size

I have basic neural network created with Keras. I train the network successfully with vectors of data and corresponding output data that is a vector with two elements. It represents a coordinate (x, y). So in goes an array, out comes an array.
Problem is that I am unable to use training data where a single input vector should correspond to many coordinates. Effectively, I desire a vector of coordinates as output, without prior knowledge of the number of coordinates.
Network is created by
model = Sequential()
model.add(Dense(20, input_shape=(196608,)))
model.add(Dense(10))
model.add(Dense(2))
and model summary shows the output dimensions for each layer
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 20) 3932180
_________________________________________________________________
dense_2 (Dense) (None, 10) 210
_________________________________________________________________
dense_3 (Dense) (None, 2) 22
=================================================================
I realize the network structure only allows a length 2 vector as output. Dense layers also do not accept None as their size. How do I modify the network so that it can train on and output a vector of vectors (list of coordinates)?
A recurrent neural networks (RNNs) would be much more appropriate, this models are typicall called seq2seq, that is, sequence to sequence. Recurrent nets use layers like LSTM and GRU, and can input and output variable length sequences. Just look at things like Machine Translation done with RNNs.
This can be done directly with keras, and there are many examples lying around the internet, for example this one.
An rnn is not what you want for predicting coordinates. Instead, I would recommend using a model that predicts coordinates and associated confidences. So you would have 100 coordinate predictions for every forward pass through your model. Each of those predictions would have another associated prediction that determines if it is correct or not. Only predictions that are above a certain confidence threshold would count. That confidence threshold is what allows the model to choose how many points it wants to use each time (with a maximum number set by the number of outputs which in this example is 100).
r-cnn is a model that does just that. Here is the first keras implementaion I found on github https://github.com/yhenon/keras-frcnn.

Categories