I want to implement a GRU for sentiment analysis and this is what I have so far:
epochs = 20
batch_size = 25
embedding_size = 50
layers = 1
max_label = 2 # only valid target labels are 0 and 1
# one word is fed in at any time instance
embedding_matrix = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embeddings = tf.nn.embedding_lookup(embedding_matrix, x)
# number of neurons in a single LSTM cell is equal to the embedding size
cell = tf.contrib.rnn.GRUCell(embedding_size)
cell = tf.contrib.rnn.DropoutWrapper(cell=cell, output_keep_prob=0.75)
# encoding fed into softmax prediction layer
output, states = tf.nn.dynamic_rnn(cell, embeddings, dtype=tf.float32)
logits = tf.layers.dense(output, max_label, activation=None)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
loss = tf.reduce_mean(cross_entropy)
# prediction is accurate if predicted label is equal to the actual label
prediction = tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))
But I get this error:
ValueError: Rank mismatch: Rank of labels (received 1) should equal rank of logits minus 1 (received 3).
I tried changing the value of max_label to 2 and 1 and tried to make logits=logits-1 but the error is still there. How can I fix it?
NB: This is without knowing the size of your label vector y nor which line the error message is coming from. Please include this important information in future questions.
According to the tensorflow 1.3 docs, the usage of nn.sparse_softmax_cross_entropy_with_logits is:
"... to have logits of shape [batch_size, num_classes] and labels of shape [batch_size]. But higher dimensions are supported."
From my experience with using RNN's in tensorflow, the size convention of your model will be [Batch, Length of Sequence, Feature Size]. So, your output will be 3 dimensional. According to the quote I've attached above, your output should then be of size [Batch, Length of Sequence], with each value being the index of your label (so if position (X,Y) is class 30, then, in your code, y[X,Y] = 30). This is different (from my memory) to the non-sparse version, which takes in one-hot (or multiple-hot) encodings.
Related
I am working on a binary classification problem using LSTM layers where I classify each timestep as belonging to a class (0,1). I have sequences of variable sizes. So, I padded and masked those extra steps at the end of the sequence using the -1 value. My input follows the shape (300,2000,8), so all my 300 samples have 2000 timesteps and 8 features. If a sample originally had 1500 timesteps, I, therefore, add 500 extra steps at the end as with the value = -1, to each of the 8 features.
I added the padding for both my inputs = train_x and the labels = train_y, so the shape of train_y is actually (300,2000,1). So, both the input and label vectors are padded with a -1 signaling the steps to ignore.
Now, I have some doubts that been causing my headaches for days.From what I understood from here, whenever keras 'sees' a timestep where all the feature values are -1 it will ignore it in the processing. However, when I access the tensors in my custom loss function, y_predictions and y_labels, the timesteps that have a value of -1 in the y_labels also have a prediction given by the model (e.g. a value between 0 and 1) that is usually the same for all -1 timesteps. I am wondering if I did something wrong?
Should I pad and mask only the features vector and keep the vector of labels with their original size when passing it to the model?
I think I end up 'ignoring' the -1 timesteps by doing
this at the start of the loss function and then use only the indexes of the timesteps where y_true is != -1, when doing the calculations and returning the loss value. Does it make sense?
pos_class = tf.where(y_true > 0)
neg_class = tf.where(y_true == 0)
... rest of calculations ...
The code for the model building part goes as follows:
# train_x -> (300,2000,8)
# train_y -> (300,2000,1)
# both already padded and masked with -1 for the extra steps
input_layer = Input(shape=(2000, 8))
mask_1 = Masking(mask_value=-1)(input_layer)
lstm_1 = LSTM(64, return_sequences=True)(mask_1)
dense_1 = Dense(1, activation="sigmoid")(lstm_1)
model = Model(inputs=input_layer, outputs=dense_1)
model.summary()
optimizer = Adam(lr=0.001)
model.compile(optimizer=optimizer, loss=CustomLossFun, metrics=[CustomMcc(), CustomPrecision()])
train_model = model.fit(x = train_x, y = train_y, ...)
Got to understood the answer from this question.
I will still obtain a tensor with the size of the entire sequence in the loss function. However, the values that were masked are there just as fillers and were not computed (that's probably why they're the same value repeated). In the case of my loss function I still need to ignore those though.
I would like to create a 'Sequential' model (a Time Series model as you might have guessed), that takes 20 days of past data with a feature size of 2, and predict 1 day into the future with the same feature size of 2.
I found out you need to specify the batch size for a stateful LSTM model, so if I specify a batch size of 32 for example, the final output shape of the model is (32, 2), which I think means the model is predicting 32 days into the future rathen than 1.
How would I go on fixing it?
Also, asking before I arrive to the problem; if I specify a batch size of 32 for example, but I want to predict on an input of shape (1, 20, 2), would the model predict correctly or what, since I changed to batch size from 32 to 1. Thank you.
You don't need to specify batch_size. But you should feed 3-d tensor:
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras import Model, Sequential
features = 2
dim = 128
new_model = Sequential([
LSTM(dim, stateful=True, return_sequences = True),
Dense(2)
])
number_of_sequences = 1000
sequence_length = 20
input = tf.random.uniform([number_of_sequences, sequence_length, features], dtype=tf.float32)
output = new_model(input) # shape is (number_of_sequences, sequence_length, features)
predicted = output[:,-1] # shape is (number_of_sequences, 1, features)
Shape of (32, 2) means that your sequence length is 32.
Batch size is a parameter of training (how many sequences should be feeded to the model before backpropagating error - see stochastic graient descent method). It doesn't affect your data (which shoud be 3-d - (number of sequences, length of sequence, feature)).
If you need to predict only one sequence - just feed tensor of shape (1, 20, 2) to the model.
I want to create a RNN in Tensorflow that classifies short texts analyzing them on per-letter basis. For that I created a numpy 2D array, where each piece of text was either padded or truncated, where each element is a character code. An output is just vector of clasess represented as one-hot encoded numpy 2D-array.
Here is an example:
train_x.shape, train_y.shape
((91845, 50), (91845, 5))
Input consists of 90K rows 50 chars each, output is 90K rows with 5 classes. Next, I want to build a network shown in a figure below.
The structure looks trivial, but I deffinetelly lack knowledge in Tensorflow and run in all kinds of problems trying to at least do training. Here is the part of code I use to build the network
chars = sequence_categorical_column_with_identity('chars', params['domain_size']+1)
chars_emb = tf.feature_column.embedding_column(chars, dimension=10)
columns = [chars_emb]
input_layer, sequence_length = sequence_input_layer(features, columns)
hidden_units = 32
lstm = tf.nn.rnn_cell.LSTMCell(hidden_units, state_is_tuple=True)
rnn_outputs, state = tf.nn.dynamic_rnn(lstm,
inputs = input_layer,
sequence_length=sequence_length,
dtype=tf.float32)
output = rnn_outputs[:,-1,:]
logits = tf.layers.dense(output, params['n_classes'], activation=tf.nn.tanh)
# apply projection to every timestep.
# Compute predictions.
predicted_classes = tf.nn.softmax(logits)
# Compute loss.
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=logits)
# Compute evaluation metrics.
accuracy = tf.metrics.accuracy(labels=labels,
predictions=predicted_classes,
name='acc_op')
But I get an error
InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 8 values, but the requested shape has 1
[[Node: Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](softmax_cross_entropy_with_logits, sequence_input_layer/chars_embedding/assert_equal/Const)]]
A fuller minimal example you can find here. Most likely you would need Tensorflow 1.8.0.
Adding
loss = tf.reduce_mean(loss)
now allows to train the network, but the results are underwhelming.
I'm building DNN to predict if the object is present in the image or not. My network has two hidden layers and the last layer looks like this:
# Output layer
W_fc2 = weight_variable([2048, 1])
b_fc2 = bias_variable([1])
y = tf.matmul(h_fc1, W_fc2) + b_fc2
Then I have placeholder for labels:
y_ = tf.placeholder(tf.float32, [None, 1], 'Output')
I run training in batches (therefore first argument in Output layer shape is None).
I use the following loss function:
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
y[:, :1], y_[:, :1], name='xentropy')
loss = tf.reduce_mean(cross_entropy, name='xentropy_mean')
predict_hand = tf.greater(y, 0.5)
correct_prediction = tf.equal(tf.to_float(predict_hand), y_)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
But in runtime I got the following error:
Rank mismatch: Rank of labels (received 2) should equal rank of logits
minus 1 (received 2).
I guess I should reshape labels layer, but not sure what it expects. I looked up in documentation and it says:
logits: Unscaled log probabilities of rank r and shape [d_0, d_1, ...,
d_{r-2}, num_classes] and dtype float32 or float64. labels: Tensor of
shape [d_0, d_1, ..., d_{r-2}] and dtype int32 or int64. Each entry in
labels must be an index in [0, num_classes).
If I have just single class, what my labels should look like (now it is just 0 or 1)? Any help appreciated
From the documentation* for tf.nn.sparse_softmax_cross_entropy_with_logits:
"A common use case is to have logits of shape [batch_size,
num_classes] and labels of shape [batch_size]. But higher dimensions
are supported."
So I suppose your labels tensor should be of shape [None]. Note that a given tensor with shape [None, 1] or shape [None] will contain the same number of elements.
Example input with concrete dummy values:
>>> logits = np.array([[11, 22], [33, 44], [55, 66]])
>>> labels = np.array([1, 0, 1])
Where there's 3 examples in the mini-batch, the logits for the first example are 11 and 22 and there's 2 classes: 0 and 1.
*https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#sparse_softmax_cross_entropy_with_logits
The problem may be the activation function in your network. Use tf.nn.softmax_cross_entropy_with_logits instead of sparse_softmax. This will solve the issue.
In short, here is implements of it
cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=hypothesis,labels=tf.argmax(Y,1)))
sparse_softmax_cross_entropy_with_logits
Computes sparse softmax cross entropy between logits and labels.
Measures the probability error in discrete classification tasks in
which the classes are mutually exclusive (each entry is in exactly
one class).
For example, each CIFAR-10 image is labeled with one and only one
label: an image can be a dog or a truck, but not both.
NOTE: For this operation, the probability of a given label is
considered exclusive. That is, soft classes are not allowed,
and the labels vector must provide a single specific index for the
true class for each row of logits (each minibatch entry).
For soft softmax classification with a probability distribution
for each entry, see softmax_cross_entropy_with_logits.
WARNING: This op expects unscaled logits, since it performs a softmax
on logits internally for efficiency. Do not call this op with the
output of softmax, as it will produce incorrect results.
A common use case is to have logits of shape [batch_size, num_classes]
and labels of shape [batch_size]. But higher dimensions are supported.
Note that to avoid confusion, it is required to pass only named
arguments to this function.
softmax_cross_entropy_with_logits_v2 and softmax_cross_entropy_with_logits
Computes softmax cross entropy between logits and labels. (deprecated)
THIS FUNCTION IS DEPRECATED. It will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow into
the labels input on backprop by default. Backpropagation will happen
only into logits. To calculate a cross entropy loss that allows
backpropagation into both logits and labels, see softmax_cross_entropy_with_logits_v2
Measures the probability error in discrete classification tasks in
which the classes are mutually exclusive (each entry is in exactly one
class).
For example, each CIFAR-10 image is labeled with one and only one
label: an image can be a dog or a truck, but not both.
here is the same implements of softmax_cross_entropy_with_logits_v2
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
logits=hypothesis,labels=Y))
First let me explain the input and target values of the RNN. My dataset consists of sequences (e.g. 4, 7, 1, 23, 42, 69). The RNN is trained to predict the next value in each sequence. So all values except the last are input, and all values except the first are target values. Each value is represented as a 1-HOT vector.
I have a RNN in Tensorflow where the outputs from the RNN (tf.dynamic_rnn) are sent through a feedforward layer. The input sequences have varying length, so I use the sequence_length parameter to specify the length of each sequence in a batch. The output from the RNN layer is a Tensor of outputs for each timestep. Most sequences have the same length, but some are shorter. When shorter sequences are sent through, I get additional all-zero vectors (as a padding).
The problem is that I want to send the output from the RNN layer through a feedforward layer. If I add bias in this feedforward layer, then the additional all-zero vectors become non-zero. With no bias, only weights, this works fine, since the all-zero vectors are not affected by multiplication. So without bias, I can set the target vectors as all-zero as well and thus they will not affect the backward pass. But if bias is added, I don't know what to put in the padded/dummy target vectors.
So the network looks like this:
[INPUT (1-HOT vectors, one vector for each value in the sequence)]
V
[GRU layer (smaller size than the input layer)]
V
[Feedforward layer (outputs vectors of the same size as the input)]
And here is the code:
# [batch_size, max_sequence_length, size of 1-HOT vectors]
x = tf.placeholder(tf.float32, [None, max_length, n_classes])
y = tf.placeholder(tf.int32, [None, max_length, n_classes])
session_length = tf.placeholder(tf.int32, [None])
outputs, state = rnn.dynamic_rnn(
rnn_cell.GRUCell(num_hidden),
x,
dtype=tf.float32,
sequence_length=session_length
)
layer = {'weights':tf.Variable(tf.random_normal([n_hidden, n_classes])),
'biases':tf.Variable(tf.random_normal([n_classes]))}
# Flatten to apply same weights to all timesteps
outputs = tf.reshape(outputs, [-1, n_hidden])
prediction = tf.matmul(output, layer['weights']) # + layer['bias']
error = tf.nn.softmax_cross_entropy_with_logits(prediction,y)
You can add the bias, but mask out the non-relevant sequence elements from the loss function.
See an example from the im2txt project:
weights = tf.to_float(tf.reshape(self.input_mask, [-1])) # these are the masks
# Compute losses.
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits, targets)
batch_loss = tf.div(tf.reduce_sum(tf.mul(losses, weights)),
tf.reduce_sum(weights),
name="batch_loss") # Here the irrelevant sequence elements are masked out
Also, for generating the mask see the function batch_with_dynamic_pad in the same project, under ops/inputs