The output layer of my CNN should use the RBF function, described as "each neuron outputs the square of the Euclidean distance between its input vector and its weight vector". I've implemented this as
dense2 = tf.square(tf.norm(dense1 - tf.transpose(dense2_W)))
where dense1 is a tensor of shape (?, 84). I've tried declaring dense2_W, the weights, as a variable of shape (84, 10) since it's doing number classification and should have 10 outputs. Running the code with a batch of 100 I get this error: InvalidArgumentError: Incompatible shapes: [100,84] vs. [10,84]. I believe it is due to the subtraction.
I train the network by iterating this code:
x_batch, y_batch = mnist.train.next_batch(100)
x_batch = tf.pad(x_batch, [[0,0],[2,2],[2,2],[0,0]]).eval() # Pad 28x28 -> 32x32
sess.run(train_step, {X: x_batch, Y: y_batch})
and then test it using the entire test set, thus the batch size in the network must be dynamic.
How can I work around this? The batch size must be dynamic, as in dense1's case, but I don't understand how to make a variable with dynamic size and transposing it (dense2_W).
You need the shapes of the two tensors to match. Assuming you want to share the weights across the batch and also having separate set of weights for each output class, you could reshape both of the tensors in order to be correctly broadcasted, e.g:
# broadcasting will copy the input to every output class neuron
input_dense = tf.expand_dims(dense1, axis=2)
# broadcasting here will copy the weights across the batch
weights = tf.expand_dims(tf.transpose(dense2_W), axis=0)
dense2 = tf.square(tf.norm(input_dense - weights, axis=1))
The resulting tensor dense2 should have shape of [batch_size, num_classes], which is [100, 10] in your case (so it will hold logits for every data instance over the number of output classes)
EDIT: added axis argument to the tf.norm call so that the distance is computed in the hidden dimension (not over the whole matrices).
Related
I am learning deep learning and am trying to understand the pytorch code given below. I'm struggling to understand how the probability calculation works. Can somehow break it down in lay-man terms. Thanks a ton.
ps = model.forward(images[0,:])
# Hyperparameters for our network
input_size = 784
hidden_sizes = [128, 64]
output_size = 10
# Build a feed-forward network
model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
nn.ReLU(),
nn.Linear(hidden_sizes[0], hidden_sizes[1]),
nn.ReLU(),
nn.Linear(hidden_sizes[1], output_size),
nn.Softmax(dim=1))
print(model)
# Forward pass through the network and display output
images, labels = next(iter(trainloader))
images.resize_(images.shape[0], 1, 784)
print(images.shape)
ps = model.forward(images[0,:])
I'm a layman so I'll help you with the layman's terms :)
input_size = 784
hidden_sizes = [128, 64]
output_size = 10
These are parameters for the layers in your network. Each neural network consists of layers, and each layer has an input and an output shape.
Specifically input_size deals with the input shape of the first layer. This is the input_size of the entire network. Each sample that is input into the network will be a 1 dimension vector that is length 784 (array that is 784 long).
hidden_size deals with the shapes inside the network. We will cover this a little later.
output_size deals with the output shape of the last layer. This means that our network will output a 1 dimensional vector that is length 10 for each sample.
Now to break up model definition line by line:
model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
The nn.Sequential part simply defines a network, each argument that is input defines a new layer in that network in that order.
nn.Linear(input_size, hidden_sizes[0]) is an example of such a layer. It is the first layer of our network takes in an input of size input_size, and outputs a vector of size hidden_sizes[0]. The size of the output is considered "hidden" in that it is not the input or the output of the whole network. It "hidden" because it's located inside of the network far from the input and output ends of the network that you interact with when you actually use it.
This is called Linear because it applies a linear transformation by multiplying the input by its weights matrix and adding its bias matrix to the result. (Y = Ax + b, Y = output, x = input, A = weights, b = bias).
nn.ReLU(),
ReLU is an example of an activation function. What this function does is apply some sort of transformation to the output of the last layer (the layer discussed above), and outputs the result of that transformation. In this case the function being used is the ReLU function, which is defined as ReLU(x) = max(x, 0). Activation functions are used in neural networks because they create non-linearities. This allows your model to model non-linear relationships.
nn.Linear(hidden_sizes[0], hidden_sizes[1]),
From what we discussed above, this is a another example of a layer. It takes an input of hidden_sizes[0] (same shape as the output of the last layer) and outputs a 1D vector of length hidden_sizes[1].
nn.ReLU(),
Apples the ReLU function again.
nn.Linear(hidden_sizes[1], output_size)
Same as the above two layers, but our output shape is the output_size this time.
nn.Softmax(dim=1))
Another activation function. This activation function turns the logits outputted by nn.Linear into an actual probability distribution. This lets the model output the probability for each class. At this point our model is built.
# Forward pass through the network and display output
images, labels = next(iter(trainloader))
images.resize_(images.shape[0], 1, 784)
print(images.shape)
These are simply just preprocessing training data and putting it into the correct format
ps = model.forward(images[0,:])
This passes the images through the model (forward pass) and applies the operations previously discussed in layer. You get the resultant output.
I am training a model using 3D point cloud data in TensorFlow. My batch size is 64, so TensorFlow expects to receive batch of 64 of 3D points like: (64,1024,3). When I run the training code:
feed_dict = {ops['points_pl']: augmented_data,
ops['labels_pl']: current_label[start_idx:end_idx],
ops['w_pl']: gmm.weights_,
ops['mu_pl']: gmm.means_,
ops['sigma_pl']: np.sqrt(gmm.covariances_),
ops['is_training_pl']: is_training, }
summary, step, _, loss_val, pred_val = sess.run([ops['merged'], ops['step'],
ops['train_op'], ops['loss'], ops['pred']],
feed_dict=feed_dict)
In the last batch because the remaining data is less than 64, I get this error:
ValueError: Cannot feed value of shape (36, 1024, 3) for Tensor 'Placeholder_4:0', which has shape '(64, 1024, 3)'
I tried to manually add data at end of a batch when it is smaller than 64 but it significantly reduced the performance. When I set batch size to 1,2,4 it works okay but it ran very slowly. How can I get rid of this problem in an efficient way? Is there a way that TF to recognize such a situation and continue training without throwing an error?
You don't need to define the size of the batch dimension precisely. Instead you put None as the size of that dimension. You can define your placeholders e.g.:
n1 = 1024
n2 = 3
ops['points_pl'] = tf.placeholder(tf.float32, [None, n1, n2])
ops['labels_pl'] = tf.placeholder(tf.float32, [None])
Tensorflow will then allow you to feed those placeholders arrays without any restriction on the first dimension. This solves the problem of the final batch, and is also useful during inference (when you may want to apply the model to a different number of inputs than your batch size).
I'm implementing a model relying on 3D convolutions (for a task that is similar to action recognition) and I want to use batch normalization (see [Ioffe & Szegedy 2015]). I could not find any tutorial focusing on 3D convs, hence I'm making a short one here which I'd like to review with you.
The code below refers to TensorFlow r0.12 and it explicitly instances variables - I mean I'm not using tf.contrib.learn except for the tf.contrib.layers.batch_norm() function. I'm doing this both to better understand how things work under the hood and to have more implementation freedom (e.g., variable summaries).
I will get to the 3D convolution case smoothly by first writing the example for a fully-connected layer, then for a 2D convolution and finally for the 3D case. While going through the code, it would be great if you could check if everything is done correctly - the code runs, but I'm not 100% sure about the way I apply batch normalization. I end this post with a more detailed question.
import tensorflow as tf
# This flag is used to allow/prevent batch normalization params updates
# depending on whether the model is being trained or used for prediction.
training = tf.placeholder_with_default(True, shape=())
Fully-connected (FC) case
# Input.
INPUT_SIZE = 512
u = tf.placeholder(tf.float32, shape=(None, INPUT_SIZE))
# FC params: weights only, no bias as per [Ioffe & Szegedy 2015].
FC_OUTPUT_LAYER_SIZE = 1024
w = tf.Variable(tf.truncated_normal(
[INPUT_SIZE, FC_OUTPUT_LAYER_SIZE], dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
fc = tf.matmul(u, w)
# Batch normalization.
fc_bn = tf.contrib.layers.batch_norm(
fc,
center=True,
scale=True,
is_training=training,
scope='fc-batch_norm')
# Activation function.
fc_bn_relu = tf.nn.relu(fc_bn)
print(fc_bn_relu) # Tensor("Relu:0", shape=(?, 1024), dtype=float32)
2D convolutional (CNN) layer case
# Input: 640x480 RGB images (whitened input, hence tf.float32).
INPUT_HEIGHT = 480
INPUT_WIDTH = 640
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN_FILTER_HEIGHT = 3 # Space dimension.
CNN_FILTER_WIDTH = 3 # Space dimension.
CNN_FILTERS = 128
w = tf.Variable(tf.truncated_normal(
[CNN_FILTER_HEIGHT, CNN_FILTER_WIDTH, INPUT_CHANNELS, CNN_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN_LAYER_STRIDE_VERTICAL = 1
CNN_LAYER_STRIDE_HORIZONTAL = 1
CNN_LAYER_PADDING = 'SAME'
cnn = tf.nn.conv2d(
input=u, filter=w,
strides=[1, CNN_LAYER_STRIDE_VERTICAL, CNN_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN_LAYER_PADDING)
# Batch normalization.
cnn_bn = tf.contrib.layers.batch_norm(
cnn,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 480, 640, 128).
center=True,
scale=True,
is_training=training,
scope='cnn-batch_norm')
# Activation function.
cnn_bn_relu = tf.nn.relu(cnn_bn)
print(cnn_bn_relu) # Tensor("Relu_1:0", shape=(?, 480, 640, 128), dtype=float32)
3D convolutional (CNN3D) layer case
# Input: sequence of 9 160x120 RGB images (whitened input, hence tf.float32).
INPUT_SEQ_LENGTH = 9
INPUT_HEIGHT = 120
INPUT_WIDTH = 160
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_SEQ_LENGTH, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN3D_FILTER_LENGHT = 3 # Time dimension.
CNN3D_FILTER_HEIGHT = 3 # Space dimension.
CNN3D_FILTER_WIDTH = 3 # Space dimension.
CNN3D_FILTERS = 96
w = tf.Variable(tf.truncated_normal(
[CNN3D_FILTER_LENGHT, CNN3D_FILTER_HEIGHT, CNN3D_FILTER_WIDTH, INPUT_CHANNELS, CNN3D_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN3D_LAYER_STRIDE_TEMPORAL = 1
CNN3D_LAYER_STRIDE_VERTICAL = 1
CNN3D_LAYER_STRIDE_HORIZONTAL = 1
CNN3D_LAYER_PADDING = 'SAME'
cnn3d = tf.nn.conv3d(
input=u, filter=w,
strides=[1, CNN3D_LAYER_STRIDE_TEMPORAL, CNN3D_LAYER_STRIDE_VERTICAL, CNN3D_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN3D_LAYER_PADDING)
# Batch normalization.
cnn3d_bn = tf.contrib.layers.batch_norm(
cnn3d,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 9, 120, 160, 96).
center=True,
scale=True,
is_training=training,
scope='cnn3d-batch_norm')
# Activation function.
cnn3d_bn_relu = tf.nn.relu(cnn3d_bn)
print(cnn3d_bn_relu) # Tensor("Relu_2:0", shape=(?, 9, 120, 160, 96), dtype=float32)
What I would like to make sure is whether the code above exactly implements batch normalization as described in [Ioffe & Szegedy 2015] at the end of Sec. 3.2:
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. [...] Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
UPDATE
I guess the code above is also correct for the 3D conv case. In fact, when I define my model if I print all the trainable variables, I also see the expected numbers of beta and gamma variables. For instance:
Tensor("conv3a/conv3d_weights/read:0", shape=(3, 3, 3, 128, 256), dtype=float32)
Tensor("BatchNorm_2/beta/read:0", shape=(256,), dtype=float32)
Tensor("BatchNorm_2/gamma/read:0", shape=(256,), dtype=float32)
This looks ok to me since due to BN, one pair of beta and gamma are learned for each feature map (256 in total).
[Ioffe & Szegedy 2015]: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
That is a great post about 3D batchnorm, it's often unnoticed that batchnorm can be applied to any tensor of rank greater than 1. Your code is correct, but I couldn't help but add a few important notes on this:
A "standard" 2D batchnorm (accepts a 4D tensor) can be significantly faster in tensorflow than 3D or higher, because it supports fused_batch_norm implementation, which applies one kernel operation:
Fused batch norm combines the multiple operations needed to do batch
normalization into a single kernel. Batch norm is an expensive process
that for some models makes up a large percentage of the operation
time. Using fused batch norm can result in a 12%-30% speedup.
There is an issue on GitHub to support 3D filters as well, but there hasn't been any recent activity and at this point the issue is closed unresolved.
Although the original paper prescribes using batchnorm before ReLU activation (and that's what you did in the code above), there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:
... I can guarantee that recent code written by Christian [Szegedy]
applies relu
before BN. It is still occasionally a topic of debate, though.
For anyone interested to apply the idea of normalization in practice, there's been recent research developments of this idea, namely weight normalization and layer normalization, which fix certain disadvantages of original batchnorm, for example they work better for LSTM and recurrent networks.
I'm building DNN to predict if the object is present in the image or not. My network has two hidden layers and the last layer looks like this:
# Output layer
W_fc2 = weight_variable([2048, 1])
b_fc2 = bias_variable([1])
y = tf.matmul(h_fc1, W_fc2) + b_fc2
Then I have placeholder for labels:
y_ = tf.placeholder(tf.float32, [None, 1], 'Output')
I run training in batches (therefore first argument in Output layer shape is None).
I use the following loss function:
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
y[:, :1], y_[:, :1], name='xentropy')
loss = tf.reduce_mean(cross_entropy, name='xentropy_mean')
predict_hand = tf.greater(y, 0.5)
correct_prediction = tf.equal(tf.to_float(predict_hand), y_)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
But in runtime I got the following error:
Rank mismatch: Rank of labels (received 2) should equal rank of logits
minus 1 (received 2).
I guess I should reshape labels layer, but not sure what it expects. I looked up in documentation and it says:
logits: Unscaled log probabilities of rank r and shape [d_0, d_1, ...,
d_{r-2}, num_classes] and dtype float32 or float64. labels: Tensor of
shape [d_0, d_1, ..., d_{r-2}] and dtype int32 or int64. Each entry in
labels must be an index in [0, num_classes).
If I have just single class, what my labels should look like (now it is just 0 or 1)? Any help appreciated
From the documentation* for tf.nn.sparse_softmax_cross_entropy_with_logits:
"A common use case is to have logits of shape [batch_size,
num_classes] and labels of shape [batch_size]. But higher dimensions
are supported."
So I suppose your labels tensor should be of shape [None]. Note that a given tensor with shape [None, 1] or shape [None] will contain the same number of elements.
Example input with concrete dummy values:
>>> logits = np.array([[11, 22], [33, 44], [55, 66]])
>>> labels = np.array([1, 0, 1])
Where there's 3 examples in the mini-batch, the logits for the first example are 11 and 22 and there's 2 classes: 0 and 1.
*https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#sparse_softmax_cross_entropy_with_logits
The problem may be the activation function in your network. Use tf.nn.softmax_cross_entropy_with_logits instead of sparse_softmax. This will solve the issue.
In short, here is implements of it
cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=hypothesis,labels=tf.argmax(Y,1)))
sparse_softmax_cross_entropy_with_logits
Computes sparse softmax cross entropy between logits and labels.
Measures the probability error in discrete classification tasks in
which the classes are mutually exclusive (each entry is in exactly
one class).
For example, each CIFAR-10 image is labeled with one and only one
label: an image can be a dog or a truck, but not both.
NOTE: For this operation, the probability of a given label is
considered exclusive. That is, soft classes are not allowed,
and the labels vector must provide a single specific index for the
true class for each row of logits (each minibatch entry).
For soft softmax classification with a probability distribution
for each entry, see softmax_cross_entropy_with_logits.
WARNING: This op expects unscaled logits, since it performs a softmax
on logits internally for efficiency. Do not call this op with the
output of softmax, as it will produce incorrect results.
A common use case is to have logits of shape [batch_size, num_classes]
and labels of shape [batch_size]. But higher dimensions are supported.
Note that to avoid confusion, it is required to pass only named
arguments to this function.
softmax_cross_entropy_with_logits_v2 and softmax_cross_entropy_with_logits
Computes softmax cross entropy between logits and labels. (deprecated)
THIS FUNCTION IS DEPRECATED. It will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow into
the labels input on backprop by default. Backpropagation will happen
only into logits. To calculate a cross entropy loss that allows
backpropagation into both logits and labels, see softmax_cross_entropy_with_logits_v2
Measures the probability error in discrete classification tasks in
which the classes are mutually exclusive (each entry is in exactly one
class).
For example, each CIFAR-10 image is labeled with one and only one
label: an image can be a dog or a truck, but not both.
here is the same implements of softmax_cross_entropy_with_logits_v2
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
logits=hypothesis,labels=Y))
First let me explain the input and target values of the RNN. My dataset consists of sequences (e.g. 4, 7, 1, 23, 42, 69). The RNN is trained to predict the next value in each sequence. So all values except the last are input, and all values except the first are target values. Each value is represented as a 1-HOT vector.
I have a RNN in Tensorflow where the outputs from the RNN (tf.dynamic_rnn) are sent through a feedforward layer. The input sequences have varying length, so I use the sequence_length parameter to specify the length of each sequence in a batch. The output from the RNN layer is a Tensor of outputs for each timestep. Most sequences have the same length, but some are shorter. When shorter sequences are sent through, I get additional all-zero vectors (as a padding).
The problem is that I want to send the output from the RNN layer through a feedforward layer. If I add bias in this feedforward layer, then the additional all-zero vectors become non-zero. With no bias, only weights, this works fine, since the all-zero vectors are not affected by multiplication. So without bias, I can set the target vectors as all-zero as well and thus they will not affect the backward pass. But if bias is added, I don't know what to put in the padded/dummy target vectors.
So the network looks like this:
[INPUT (1-HOT vectors, one vector for each value in the sequence)]
V
[GRU layer (smaller size than the input layer)]
V
[Feedforward layer (outputs vectors of the same size as the input)]
And here is the code:
# [batch_size, max_sequence_length, size of 1-HOT vectors]
x = tf.placeholder(tf.float32, [None, max_length, n_classes])
y = tf.placeholder(tf.int32, [None, max_length, n_classes])
session_length = tf.placeholder(tf.int32, [None])
outputs, state = rnn.dynamic_rnn(
rnn_cell.GRUCell(num_hidden),
x,
dtype=tf.float32,
sequence_length=session_length
)
layer = {'weights':tf.Variable(tf.random_normal([n_hidden, n_classes])),
'biases':tf.Variable(tf.random_normal([n_classes]))}
# Flatten to apply same weights to all timesteps
outputs = tf.reshape(outputs, [-1, n_hidden])
prediction = tf.matmul(output, layer['weights']) # + layer['bias']
error = tf.nn.softmax_cross_entropy_with_logits(prediction,y)
You can add the bias, but mask out the non-relevant sequence elements from the loss function.
See an example from the im2txt project:
weights = tf.to_float(tf.reshape(self.input_mask, [-1])) # these are the masks
# Compute losses.
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits, targets)
batch_loss = tf.div(tf.reduce_sum(tf.mul(losses, weights)),
tf.reduce_sum(weights),
name="batch_loss") # Here the irrelevant sequence elements are masked out
Also, for generating the mask see the function batch_with_dynamic_pad in the same project, under ops/inputs