How to implement the fixed length spatial pyramid pooling layer? - python

I would like to implement the spatial pyramid pooling layer as introduced in this paper.
As the paper setting, the keypoint is to define variant kernel size and stride size of max_pooling layer, which is:
kernel_size = ceil(a/n)
stride_size = floor(a/n)
where a is the input tensor spatial size, and n is the pyramid level, i.e. spatial bins of the pooling output.
I try to implement this layer with tensorflow:
import numpy as np
import tensorflow as tf
def spp_layer(input_, name='SPP_layer'):
"""
4 level SPP layer.
spatial bins: [6_6, 3_3, 2_2, 1_1]
Parameters
----------
input_ : tensor
name : str
Returns
-------
tensor
"""
shape = input_.get_shape().as_list()
with tf.variable_scope(name):
spp_6_6_pool = tf.nn.max_pool(input_,
ksize=[1,
np.ceil(shape[1]/6).astype(np.int32),
np.ceil(shape[2]/6).astype(np.int32),
1],
strides=[1, shape[1]//6, shape[2]//6, 1],
padding='SAME')
print('SPP layer level 6:', spp_6_6_pool.get_shape().as_list())
spp_3_3_pool = tf.nn.max_pool(input_,
ksize=[1,
np.ceil(shape[1]/3).astype(np.int32),
np.ceil(shape[2]/3).astype(np.int32),
1],
strides=[1, shape[1]//3, shape[2]//3, 1],
padding='SAME')
print('SPP layer level 3:', spp_3_3_pool.get_shape().as_list())
spp_2_2_pool = tf.nn.max_pool(input_,
ksize=[1,
np.ceil(shape[1]/2).astype(np.int32),
np.ceil(shape[2]/2).astype(np.int32),
1],
strides=[1, shape[1]//2, shape[2]//2, 1],
padding='SAME')
print('SPP layer level 2:', spp_2_2_pool.get_shape().as_list())
spp_1_1_pool = tf.nn.max_pool(input_,
ksize=[1,
np.ceil(shape[1]/1).astype(np.int32),
np.ceil(shape[2]/1).astype(np.int32),
1],
strides=[1, shape[1]//1, shape[2]//1, 1],
padding='SAME')
print('SPP layer level 1:', spp_1_1_pool.get_shape().as_list())
spp_6_6_pool_flat = tf.reshape(spp_6_6_pool, [shape[0], -1])
spp_3_3_pool_flat = tf.reshape(spp_3_3_pool, [shape[0], -1])
spp_2_2_pool_flat = tf.reshape(spp_2_2_pool, [shape[0], -1])
spp_1_1_pool_flat = tf.reshape(spp_1_1_pool, [shape[0], -1])
spp_pool = tf.concat(1, [spp_6_6_pool_flat,
spp_3_3_pool_flat,
spp_2_2_pool_flat,
spp_1_1_pool_flat])
return spp_pool
But it cannot gurantee the same length pooling output, when the input sizes are different.
How to solve this problem?

I believe the authors of the paper are wrong, the formula should be:
stride_size = floor(a/n)
kernel_size = floor(a/n) + (a mod n)
Notice that both formula give the same result for n < 4.
You can prove this result by doing the euclidian division of a by n.
I modified the code I found at https://github.com/tensorflow/tensorflow/issues/6011 and here it is:
def spp_layer(input_, levels=(6, 3, 2, 1), name='SPP_layer'):
shape = input_.get_shape().as_list()
with tf.variable_scope(name):
pyramid = []
for n in levels:
stride_1 = np.floor(float(shape[1] / n)).astype(np.int32)
stride_2 = np.floor(float(shape[2] / n)).astype(np.int32)
ksize_1 = stride_1 + (shape[1] % n)
ksize_2 = stride_2 + (shape[2] % n)
pool = tf.nn.max_pool(input_,
ksize=[1, ksize_1, ksize_2, 1],
strides=[1, stride_1, stride_2, 1],
padding='VALID')
# print("Pool Level {}: shape {}".format(n, pool.get_shape().as_list()))
pyramid.append(tf.reshape(pool, [shape[0], -1]))
spp_pool = tf.concat(1, pyramid)
return spp_pool

Yes, the output size right now is not constant, and looking at your code it seems that your individual pooling operations will have output sizes that alternate between two numbers. The reason is that the output size, at least for 'SAME', is calculated by the formula
out_height = ceil(float(in_height) / float(strides[1]))
If for the stride we use what is essentially the floor of in_height/n, then the output will fluctuate between n and n+1. What you need to do to ensure constancy is use the ceil operation instead for your stride values. The altered code for spp_6_6 pool would be
ksize=[1, np.ceil(shape[1]/6).astype(np.int32), np.ceil(shape[2]/6).astype(np.int32), 1]
spp_6_6_pool = tf.nn.max_pool(input_, ksize=ksize,strides=ksize, padding='SAME')
I defined the ksize outside of the call to tf.nn.max_pool() for clarity. So, if you use your ksize for your strides too, it should work out. If you round up then mathematically as long as the size of the input dimensions are at least double the value of your largest pyramid size n, your output size should be constant with 'SAME' padding!
Somewhat related to your question, in your first max pooling operation your ksize parameter is
ksize=[1, np.ceil(shape[1]/6).astype(np.int32), np.ceil(shape[1]/6).astype(np.int32), 1]
For the third element of ksize, you did shape[1]/6 instead of shape[2]/6. I assumed that was a typo, so I changed it in the above code.
I'm aware that in the paper that the stride is taken to be the floor of a/n and not the ceil, but as of now, using the native pooling operations of tensorflow, there is no way to make that work as desired. 'VALID' pooling will not result in anything near what you want.
Well... if you're really willing to put the time into it, you can take the input size modulo your largest pyramid dimension, in this case 6, and handle all six of these circumstances independently. I can't find a good justification for this though. Tensorflow pads differently than other libraries such as, say, Caffe, so inherently there will be differences. The above solution will get you what they're aiming for in the paper, a pyramid of pooling layers where disjoint regions of the image are being max-pooled with differing levels of granularity.
EDIT: Actually, if you use tf.pad() to manually pad the input yourself and create a new input for each max pooling operation such that the new inputs have height and width a neat multiple of n, then it would work out with the code you already have.

Related

Max pool a single image in tensorflow using "tf.nn.avg_pool"

I want to apply "tf.nn.max_pool()" on a single image but I get a result with dimension that is totally different than the input:
import tensorflow as tf
import numpy as np
ifmaps_1 = tf.Variable(tf.random_uniform( shape=[ 7, 7, 3], minval=0, maxval=3, dtype=tf.int32))
ifmaps=tf.dtypes.cast(ifmaps_1, dtype=tf.float64)
ofmaps_tf = tf.nn.max_pool([ifmaps], ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding="SAME")[0] # no padding
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
print("ifmaps_tf = ")
print(ifmaps.eval())
print("ofmaps_tf = ")
result = sess.run(ofmaps_tf)
print(result)
I think this is related to trying to apply pooling to single example not on a batch. I need to do the pooling on a single example.
Any help is appreciated.
Your input is (7,7,3), kernel size is (3,3) and stride is (2,2). So if you do not want any paddings, (state in your comment), you should use padding="VALID", that will return a (3,3) tensor as output. If you use padding="SAME", it will return (4,4) tensor.
Usually, the formula of calculating output size for SAME pad is:
out_size = ceil(in_sizei/stride)
For VALID pad is:
out_size = ceil(in_size-filter_size+1/stride)

How to perform element convolution between two tensors?

I have two tensors, both with batch size N of images and same resolution. I would like to convolve the first image in tensor 1 with the first image of tensor 2, second image of tensor 1 with tensor 2, and so on. I want the output to be a tensor with N images of the same size.
I looked into using tf.nn.conv2d, but it seems like this command will take in a batch of N images and convolve them with a single filter.
I looked into examples like What does tf.nn.conv2d do in tensorflow?
but they do not talk about multiple images and multiple filters.
You can manage to do something like that using tf.nn.separable_conv2d, using the batch dimension as the separable channels and the actual input channels as batch dimension. I am not sure if it is going to be perform very well, though, as it involves several transpositions (which are not free in TensorFlow) and a convolution through a large number of channels, which is not really the optimized use case. Here is how it could work:
import tensorflow as tf
import numpy as np
import scipy.signal
# Expects imgs with shape (B, H, W, C) and filters with shape (B, H, W, 1)
def batch_conv(imgs, filters, strides, padding, rate=None):
imgs = tf.convert_to_tensor(imgs)
filters = tf.convert_to_tensor(filters)
b = tf.shape(imgs)[0]
imgs_t = tf.transpose(imgs, [3, 1, 2, 0])
filters_t = tf.transpose(filters, [1, 2, 0, 3])
strides = [strides[3], strides[1], strides[2], strides[0]]
# "do-nothing" pointwise filter
pointwise = tf.eye(b, batch_shape=[1, 1])
conv = tf.nn.separable_conv2d(imgs_t, filters_t, pointwise, strides, padding, rate)
return tf.transpose(conv, [3, 1, 2, 0])
# Slow, loop-based version using SciPy's correlate to check result
def batch_conv_np(imgs, filters, padding):
return np.stack(
[np.stack([scipy.signal.correlate2d(img[..., i], filter[..., 0], padding.lower())
for i in range(img.shape[-1])], axis=-1)
for img, filter in zip(imgs, filters)], axis=0)
# Make random input
np.random.seed(0)
imgs = np.random.rand(5, 20, 30, 3).astype(np.float32)
filters = np.random.rand(5, 20, 30, 1).astype(np.float32)
padding = 'SAME'
# Test
res_np = batch_conv_np(imgs, filters, padding)
with tf.Graph().as_default(), tf.Session() as sess:
res_tf = batch_conv(imgs, filters, [1, 1, 1, 1], padding)
res_tf_val = sess.run(res_tf)
print(np.allclose(res_np, res_tf_val))
# True

Is there a way to add constraints to a neural network output but still with softmax activation function?

I am not a deep learning geek, i am learning to do this for my homework.
How can I make my neural network output a list of positive floats that sum to 1 but at the same time each element of the list is smaller than a treshold (0.4 for example)?
I tried to add some hidden layers before the output layer but that didn't improve the results.
here is where i started from:
def build_net(inputs, predictor,scope,trainable):
with tf.name_scope(scope):
if predictor == 'CNN':
L=int(inputs.shape[2])
N = int(inputs.shape[3])
conv1_W = tf.Variable(tf.truncated_normal([1,L,N,32], stddev=0.15), trainable=trainable)
layer = tf.nn.conv2d(inputs, filter=conv1_W, padding='VALID', strides=[1, 1, 1, 1])
norm1 = tf.layers.batch_normalization(layer)
x = tf.nn.relu(norm1)
conv3_W = tf.Variable(tf.random_normal([1, 1, 32, 1], stddev=0.15), trainable=trainable)
conv3 = tf.nn.conv2d(x, filter=conv3_W, strides=[1, 1, 1, 1], padding='VALID')
norm3 = tf.layers.batch_normalization(conv3)
net = tf.nn.relu(norm3)
net=tf.layers.flatten(net)
return net
x=build_net(inputs,predictor,scope,trainable=trainable)
y=tf.placeholder(tf.float32,shape=[None]+[self.M])
network = tf.add(x,y)
w_init=tf.random_uniform_initializer(-0.0005,0.0005)
outputs=tf.layers.dense(network,self.M,activation=tf.nn.softmax,kernel_initializer=w_init)
I am expecting that outputs would still sum to 1, but each element of it is smaller than a specific treshold i set.
Thank you so much in advance for your precious help guys.
What you want to do is add a penalty in case any of the outputs is larger than some specified thresh, you can do this with the max function:
thresh = 0.4
strength = 10.0
reg_output = strength * tf.reduce_sum(tf.math.maximum(0.0, outputs - thresh), axis=-1)
Then you need to add reg_output to your loss so its optimized with the rest of the loss. strength is a tunable parameter that defines how strong is the penalty for going over the threshold, and you have to tune it to your needs.
This penalty works by summing max(0, output - thresh) over the last dimension, which activates the penalty if output is bigger than thresh. If its smaller, the penalty is zero and does nothing.

Tensorflow: CNN training converges at a vector of zeros

I'm a beginner in deep learning and have taken a few courses on Udacity. Recently I'm trying to build a deep network detecting hand joints in the input depth images, which doesn't seem to be working well. (My dataset is ICVL Hand Posture Dataset)
The network structure is shown here.
① A batch of input images, 240x320;
② An 8-channel convolutional layer with a 5x5 kernel;
③ A max pooling layer, ksize = stride = 2;
④ A fully-connected layer, weight.shape = [38400, 1024];
⑤ A fully-connected layer, weight.shape = [1024, 48].
After several epochs of training, the output of the last layer converges as a (0, 0, ..., 0) vector. I chose the mean square error as the loss function and its value stayed above 40000 and didn't seem to reduce.
The network structure is already too simple to be simplified again but the problem remains. Could anyone offer any suggestions?
My main code is posted below:
image = tf.placeholder(tf.float32, [None, 240, 320, 1])
annotations = tf.placeholder(tf.float32, [None, 48])
W_convolution_layer1 = tf.Variable(tf.truncated_normal([5, 5, 1, 8], stddev=0.1))
b_convolution_layer1 = tf.Variable(tf.constant(0.1, shape=[8]))
h_convolution_layer1 = tf.nn.relu(
tf.nn.conv2d(image, W_convolution_layer1, [1, 1, 1, 1], 'SAME') + b_convolution_layer1)
h_pooling_layer1 = tf.nn.max_pool(h_convolution_layer1, [1, 2, 2, 1], [1, 2, 2, 1], 'SAME')
W_fully_connected_layer1 = tf.Variable(tf.truncated_normal([120 * 160 * 8, 1024], stddev=0.1))
b_fully_connected_layer1 = tf.Variable(tf.constant(0.1, shape=[1024]))
h_pooling_flat = tf.reshape(h_pooling_layer1, [-1, 120 * 160 * 8])
h_fully_connected_layer1 = tf.nn.relu(
tf.matmul(h_pooling_flat, W_fully_connected_layer1) + b_fully_connected_layer1)
W_fully_connected_layer2 = tf.Variable(tf.truncated_normal([1024, 48], stddev=0.1))
b_fully_connected_layer2 = tf.Variable(tf.constant(0.1, shape=[48]))
detection = tf.nn.relu(
tf.matmul(h_fully_connected_layer1, W_fully_connected_layer2) + b_fully_connected_layer2)
mean_squared_error = tf.reduce_sum(tf.losses.mean_squared_error(annotations, detection))
training = tf.train.AdamOptimizer(1e-4).minimize(mean_squared_error)
# This data loader reads images and annotations and convert them into batches of numbers.
loader = ICVLDataLoader('../data/')
with tf.Session() as session:
session.run(tf.global_variables_initializer())
for i in range(1000):
# batch_images: a list with shape = [BATCH_SIZE, 240, 320, 1]
# batch_annotations: a list with shape = [BATCH_SIZE, 48]
[batch_images, batch_annotations] = loader.get_batch(100).to_1d_list()
[x_, t_, l_, p_] = session.run([x_image, training, mean_squared_error, detection],
feed_dict={images: batch_images, annotations: batch_annotations})
And it runs like this.
The main issue is likely the relu activation in the output layer. You should remove this, i.e. let detection simply be the results of a matrix multiplication. If you want to force the outputs to be positive, consider something like the exponential function instead.
While relu is a popular hidden activation, I see one major problem with using it as an output activation: As is well known relu maps negative inputs to 0 -- however, crucially, the gradients will also be 0. This happening in the output layer basically means your network cannot learn from its mistakes when it produces outputs < 0 (which is likely to happen with random initializations). This will likely heavily impair the overall learning process.

what is the effect of tf.nn.conv2d() on an input tensor shape?

I am studying tensorboard code from Dandelion Mane specificially: https://github.com/dandelionmane/tf-dev-summit-tensorboard-tutorial/blob/master/mnist.py
His convolution layer is specifically defined as:
def conv_layer(input, size_in, size_out, name="conv"):
with tf.name_scope(name):
w = tf.Variable(tf.truncated_normal([5, 5, size_in, size_out], stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[size_out]), name="B")
conv = tf.nn.conv2d(input, w, strides=[1, 1, 1, 1], padding="SAME")
act = tf.nn.relu(conv + b)
tf.summary.histogram("weights", w)
tf.summary.histogram("biases", b)
tf.summary.histogram("activations", act)
return tf.nn.max_pool(act, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding="SAME")
I am trying to work out what is the effect of the conv2d on the input tensor size. As far as I can tell it seems the first 3 dimensions are unchanged but the last dimension of the output follows the size of the last dimension of w.
For example, ?x47x36x64 input becomes ?x47x36x128 with w shape=5x5x64x128
And I also see that: ?x24x18x128 becomes ?x24x18x256 with w shape=5x5x128x256
So, is the resultant size for input: [a,b,c,d] the output size of [a,b,c,w.shape[3]]?
Would it be correct to think that the first dimension does not change?
This works in your case because of the stride used and the padding applied. The output width and height will not always be the same as the input.
Check out this excellent discussion of the topic. The basic takeaway (taken almost verbatim from that link) is that a convolution layer:
Accepts an input volume of size W1 x H1 x D1
Requires four hyperparameters:
Number of filters K
Spatial extent of filters F
The stride with which the filter moves S
The amount of zero padding P
Produces a volume of size W2 x H2 x D2 where:
W2 = (W1 - F + 2*P)/S + 1
H2 = (H1 - F + 2*P)/S + 1
D2 = K
And when you are processing batches of data in Tensorflow they typically have shape [batch_size, width, height, depth], so the first dimension which is just the number of samples in your batch should not change.
Note that the amount of padding P in the above is a little tricky with TF. When you give the padding='same' argument to tf.nn.conv2d, tensorflow applies zero padding to both sides of the image to make sure that no pixels of the image are ignored by your filter, but it may not add the same amount of padding to both sides (can differ by only one I think). This SO thread has some good discussion on the topic.
In general, with a stride S of 1 (which your network has), zero padding of P = (F - 1) / 2 will ensure that the output width/height equals the input, i.e. W2 = W1 and H2 = H1. In your case, F is 5, so tf.nn.conv2d must be adding two zeros to each side of the image for a P of 2, and your output width according to the above equation is W2 = (W1 - 5 + 2*2)/1 + 1 = W1 - 1 + 1 = W1.

Categories