Tensorflow Depthwise Convolution Understanding

Tensorflow Depthwise Convolution Understanding - python

I'm currently trying to understand how Tensorflow's Depthwise Convolution works. As far as I've understood, each channel in the input image is convolved with it's own set of filters, and then the results are concatenated. I'm going to stick with the parameter depth_multiplier=1 for the sake of simplicity in the remainder, so n_inputchannels == n_outputchannels.
So in theory, I could split up the depthwise convolution into N individual, regular Conv2Ds, correct? Why does the following code produce different results then I am wondering - is this a precision issue? I'm following the documentation for the ordering [filter_height, filter_width, in_channels, 1] for the depthwise convolution filters, and [filter_height, filter_width, in_channels, out_channels] for the regular convolutions, and NHWC data format.
import tensorflow as tf
import numpy as np
import random
width = 128
height = 128
channels = 32
kernel_width = 3
kernel_height = 3
with tf.Session() as sess:
_input = np.float32(np.random.rand(1, height, width, channels))
_weights = np.float32(np.random.rand(kernel_height, kernel_width, channels, 1))
_input_ph = tf.placeholder(tf.float32, shape=(1, height, width, channels))
_weights_pc = tf.placeholder(tf.float32, shape=(kernel_height, kernel_width, channels, 1))
feed = { _input_ph: _input, _weights_pc : _weights }
result = tf.nn.depthwise_conv2d(_input_ph, _weights_pc, [1,1,1,1], 'SAME')
individual_results = []
for i in range(channels):
individual_results.append(tf.nn.conv2d(tf.expand_dims(_input_ph[:,:,:,i],axis=3), tf.expand_dims(_weights_pc[:,:,i,:],axis=3), [1,1,1,1], 'SAME'))
depth_result = sess.run(result, feed_dict=feed)
concat_result = sess.run(tf.concat(individual_results, axis=3), feed_dict=feed)
channel_diff = 0.0
for i in range(channels):
channel_diff += np.sum(depth_result[:,:,:,i]-concat_result[:,:,:,i])
print(channel_diff)
Here I'm computing first the normal tf.nn.depthwise_conv2d and then slice the input and weights accordingly and do tf.nn.conv2ds individually. For these parameters I get about 1e-5 difference, but that tends to get higher when I increase the number of channels.
I would be really glad if someone could explain to me what's going on :)
Thanks!

Related

Feature importance in neural networks with multiple differently shaped inputs in pytorch and captum (classification)

I have developed a model with three inputs types. Image, categorical data and numerical data. For Image data I've used ResNet50 for the other two I develop my own network.
class MulticlassClassification(nn.Module):
def __init__(self, cat_size, num_col, output_size, layers, p=0.4):
super(MulticlassClassification, self).__init__()
# IMAGE: ResNet
self.cnn = models.resnet50(pretrained = True)
for param in self.cnn.parameters():
param.requires_grad = False
n_inputs = self.cnn.fc.in_features
self.cnn.fc = nn.Sequential(
nn.Linear(n_inputs, 250),
nn.ReLU(),
nn.Dropout(p),
nn.Linear(250, output_size),
nn.LogSoftmax(dim=1)
)
# TABULAR
self.all_embeddings = nn.ModuleList(
[nn.Embedding(categories, size) for categories, size in cat_size]
)
self.embedding_dropout = nn.Dropout(p)
self.batch_norm_num = nn.BatchNorm1d(num_col)
all_layers = []
num_cat_col = sum(e.embedding_dim for e in self.all_embeddings)
input_size = num_cat_col + num_col
for i in layers:
all_layers.append(nn.Linear(input_size, i))
all_layers.append(nn.ReLU(inplace=True))
all_layers.append(nn.BatchNorm1d(i))
all_layers.append(nn.Dropout(p))
input_size = i
all_layers.append(nn.Linear(layers[-1], output_size))
self.layers = nn.Sequential(*all_layers)
#combine
self.combine_fc = nn.Linear(output_size * 2, output_size)
def forward(self, image, x_categorical, x_numerical):
embeddings = []
for i, embedding in enumerate(self.all_embeddings):
print(x_categorical[:,i])
embeddings.append(embedding(x_categorical[:,i]))
x = torch.cat(embeddings, 1)
x = self.embedding_dropout(x)
x_numerical = self.batch_norm_num(x_numerical)
x = torch.cat([x, x_numerical], 1)
x = self.layers(x)
# img
x2 = self.cnn(image)
# combine
x3 = torch.cat([x, x2], 1)
x3 = F.relu(self.combine_fc(x3))
return x
Now after successful training I would like to calculate integrated gradients by using the captum library.
from captum.attr import IntegratedGradients
ig = IntegratedGradients(model)
testiter = iter(testloader)
img, stack_cat, stack_num, target = next(testiter)
attributions_ig = ig.attribute(inputs=(img.cuda(), stack_cat.cuda(), stack_num.cuda()), target=target.cuda())
And here I got an error:
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
I figured out that captum injects a wrongly shaped tensor into my x_categorical input (with the print in my forward method). It seems like captum only sees the first input tensor and uses it's shape for all other inputs. How can I change this behaviour?
I've found the similar issue here (https://github.com/pytorch/captum/issues/439). It was recommended to use Interpretable Embedding for categorical data. When I used it I got this error:
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
I would be very grateful for any tips and advises how to combine all three inputs and to solve my problem.

Why are the images generated by a GAN get darker as the network trains more?

I created a simple DCGAN with 6 layers and trained it on CelebA dataset (a portion of it containing 30K images).
I noticed my network generated images are dimmed looking and as the network trains more, the bright colors fade into dim ones!
here are some example:
This is how CelebA images look like (real images used for training) :
and these are the generated ones ,the number shows the epoch number(they were trained for 30 epochs ultimately) :
What is the cause for this phenomenon?
I tried to do all the general tricks concerning GANs, such as rescaling the input image between -1 and 1, or not using BatchNorm in the first layer of the Discriminator, and for the last layer of the Generator or
using LeakyReLU(0.2) in the Discriminator, and ReLU for the Generator. yet I have no idea why the images are this dim/dark!
Is this caused by simply less training images?
or is it caused by the networks deficiencies ? if so what is the source of such deficiencies?
Here are how these networks implemented :
def conv_batch(in_dim, out_dim, kernel_size, stride, padding, batch_norm=True):
layers = nn.ModuleList()
conv = nn.Conv2d(in_dim, out_dim, kernel_size, stride, padding, bias=False)
layers.append(conv)
if batch_norm:
layers.append(nn.BatchNorm2d(out_dim))
return nn.Sequential(*layers)
class Discriminator(nn.Module):
def __init__(self, conv_dim=32, act = nn.ReLU()):
super().__init__()
self.conv_dim = conv_dim
self.act = act
self.conv1 = conv_batch(3, conv_dim, 4, 2, 1, False)
self.conv2 = conv_batch(conv_dim, conv_dim*2, 4, 2, 1)
self.conv3 = conv_batch(conv_dim*2, conv_dim*4, 4, 2, 1)
self.conv4 = conv_batch(conv_dim*4, conv_dim*8, 4, 1, 1)
self.conv5 = conv_batch(conv_dim*8, conv_dim*10, 4, 2, 1)
self.conv6 = conv_batch(conv_dim*10, conv_dim*10, 3, 1, 1)
self.drp = nn.Dropout(0.5)
self.fc = nn.Linear(conv_dim*10*3*3, 1)
def forward(self, input):
batch = input.size(0)
output = self.act(self.conv1(input))
output = self.act(self.conv2(output))
output = self.act(self.conv3(output))
output = self.act(self.conv4(output))
output = self.act(self.conv5(output))
output = self.act(self.conv6(output))
output = output.view(batch, self.fc.in_features)
output = self.fc(output)
output = self.drp(output)
return output
def deconv_convtranspose(in_dim, out_dim, kernel_size, stride, padding, batchnorm=True):
layers = []
deconv = nn.ConvTranspose2d(in_dim, out_dim, kernel_size = kernel_size, stride=stride, padding=padding)
layers.append(deconv)
if batchnorm:
layers.append(nn.BatchNorm2d(out_dim))
return nn.Sequential(*layers)
class Generator(nn.Module):
def __init__(self, z_size=100, conv_dim=32):
super().__init__()
self.conv_dim = conv_dim
# make the 1d input into a 3d output of shape (conv_dim*4, 4, 4 )
self.fc = nn.Linear(z_size, conv_dim*4*4*4)#4x4
# conv and deconv layer work on 3d volumes, so we now only need to pass the number of fmaps and the
# input volume size (its h,w which is 4x4!)
self.drp = nn.Dropout(0.5)
self.deconv1 = deconv_convtranspose(conv_dim*4, conv_dim*3, kernel_size =4, stride=2, padding=1)
self.deconv2 = deconv_convtranspose(conv_dim*3, conv_dim*2, kernel_size =4, stride=2, padding=1)
self.deconv3 = deconv_convtranspose(conv_dim*2, conv_dim, kernel_size =4, stride=2, padding=1)
self.deconv4 = deconv_convtranspose(conv_dim, conv_dim, kernel_size =3, stride=2, padding=1)
self.deconv5 = deconv_convtranspose(conv_dim, 3, kernel_size =4, stride=1, padding=1, batchnorm=False)
def forward(self, input):
output = self.fc(input)
output = self.drp(output)
output = output.view(-1, self.conv_dim*4, 4, 4)
output = F.relu(self.deconv1(output))
output = F.relu(self.deconv2(output))
output = F.relu(self.deconv3(output))
output = F.relu(self.deconv4(output))
# we create the image using tanh!
output = F.tanh(self.deconv5(output))
return output
# testing nets
dd = Discriminator()
zd = np.random.rand(2,3,64,64)
zd = torch.from_numpy(zd).float()
# print(dd)
print(dd(zd).shape)
gg = Generator()
z = np.random.uniform(-1,1,size=(2,100))
z = torch.from_numpy(z).float()
print(gg(z).shape)

I think that the problem lies rather in the architecture itself and I would first consider the overall quality of generated images rather than their brightness or darkness. The generations clearly get better as you train for more epochs. I agree that the images get darker but even in the early epochs, the generated images are significantly darker than the ones in the training samples. (At least compared to ones that you posted.)
And now coming back to your architecture, 30k samples are actually enough to obtain very convincing results as achieved by state-of-the-art models in face generations. The generations do get better but they are still far away from being "very good".
I think the generator is definitely not strong enough and is the problematic part. (The fact that your generator loss skyrockets can also be a hint for this.) In the generator, all you do is just upsampling and upsampling. You should note that the transposed convolution is more like a heuristic and it does not provide much learnability. This is related to the nature of the problem. When you are doing convolutions, you have all the information and you are trying to learn to encode but in the decoder, you are trying to recover information that was previously lost :). So, in a way, it is harder to learn because the information taken as input is limited and lacking.
In fact, deterministic bilinear interpolation methods do perform similar or even better than transposed convolutions and these are purely based on scaling/extending with zero learnability. (https://arxiv.org/pdf/1707.05847.pdf)
To observe the transposed convolutions' limits, I suggest that you replace all the Transposedconv2d with UpSampling2D (https://www.tensorflow.org/api_docs/python/tf/keras/layers/UpSampling2D) and I claim that the results will not be much different. UpSampling2D is one of those deterministic methods that I mentioned.
To improve your generator, you can try to insert convolutional layers between upsampling layers. These layers would refine the features/images and correct some of the mistakes that occurred during the up-sampling. In addition to corrections, the next upsampling layer would take a more informative input. What I mean is to try a UNet like decoding that you can find in this link (https://arxiv.org/pdf/1505.04597.pdf). Of course, that would be a primary step to explore. There are many more GAN architectures that you can try and probably perform better.

How to convolve signal with 1D kernel in TensorFlow?

I am trying to filter a TensorFlow tensor of shape (N_batch, N_data), where N_batch is the batch size (e.g. 32), and N_data is the size of the (noisy) timeseries array. I have a Gaussian kernel (taken from here), which is one-dimensional. I then want to use tensorflow.nn.conv1d to convolve this kernel with my signal.
I have been trying for most of the morning to get the dimensions of the input signal and the kernel right, but obviously with no success. From what I gathered from the interwebs, the dimensions of both the input signal and the kernel need to be aligned in some finicky way, and I just can't figure out which way that is. The TensorFlow error messages aren't particularly meaningful either (Shape must be rank 4 but is rank 3 for 'conv1d/Conv2D' (op: 'Conv2D') with input shapes: [?,1,1000], [1,81]). Below I've included a little piece of code to reproduce the situation:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
# Based on: https://stackoverflow.com/a/52012658/1510542
# Credits to #zephyrus
def gaussian_kernel(size, mean, std):
d = tf.distributions.Normal(tf.cast(mean, tf.float32), tf.cast(std, tf.float32))
vals = d.prob(tf.range(start=-size, limit=size+1, dtype=tf.float32))
kernel = vals # Some reshaping is required here
return kernel / tf.reduce_sum(kernel)
def gaussian_filter(input, sigma):
size = int(4*sigma + 0.5)
x = input # Some reshaping is required here
kernel = gaussian_kernel(size=size, mean=0.0, std=sigma)
conv = tf.nn.conv1d(x, kernel, stride=1, padding="SAME")
return conv
def run_filter():
tf.reset_default_graph()
# Define size of data, batch sizes
N_batch = 32
N_data = 1000
noise = 0.2 * (np.random.rand(N_batch, N_data) - 0.5)
x = np.linspace(0, 2*np.pi, N_data)
y = np.tile(np.sin(x), N_batch).reshape(N_batch, N_data)
y_noisy = y + noise
input = tf.placeholder(tf.float32, shape=[None, N_data])
smooth_input = gaussian_filter(input, sigma=10)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
y_smooth = smooth_input.eval(feed_dict={input: y_noisy})
plt.plot(y_noisy[0])
plt.plot(y_smooth[0])
plt.show()
if __name__ == "__main__":
run_filter()
Any ideas?

You need to add channel dimensions to your input/kernel, since TF convolutions are generally used for multi-channel inputs/outputs. As you are working with simple 1-channel input/output this amounts to just adding some size-1 "dummy" axes.
Since by default convolution expects channels to come last, your placeholder should have shape [None, N_data, 1] and your input be modified like
y_noisy = y + noise
y_noisy = y_noisy[:, :, np.newaxis]
Similarly, you need to add input and output channel dimensions to your filter:
kernel = gaussian_kernel(size=size, mean=0.0, std=sigma)
kernel = kernel[:, tf.newaxis, tf.newaxis]
That is, the filter is expected to have shape [width, in_channels, out_cannels].

Iterate over a tensor dimension in Tensorflow

I am trying to develop a seq2seq model from a low level perspective (creating by myself all the tensors needed). I am trying to feed the model with a sequence of vectors as a two-dimensional tensor, however, i can't iterate over one dimension of the tensor to extract vector by vector. Does anyone know what could I do to feed a batch of vectors and later get them one by one?
This is my code:
batch_size = 100
hidden_dim = 5
input_dim = embedding_dim
time_size = 5
input_sentence = tf.placeholder(dtype=tf.float64, shape=[embedding_dim,None], name='input')
output_sentence = tf.placeholder(dtype=tf.float64, shape=[embedding_dim,None], name='output')
input_array = np.asarray(input_sentence)
output_array = np.asarray(output_sentence)
gru_layer1 = GRU(input_array, input_dim, hidden_dim) #This is a class created by myself
for i in range(input_array.shape[-1]):
word = input_array[:,i]
previous_state = gru_encoder.h_t
gru_layer1.forward_pass(previous_state,word)
And this is the error that I get
TypeError: Expected binary or unicode string, got <tf.Tensor 'input_7:0' shape=(10, ?) dtype=float64>

Tensorflow does deferred execution.
You usually can't know how big the vector will be (words in a sentance, audio samples, etc...). The common thing to do is to cap it at some reasonably large value and then pad the shorter sequences with an empty token.
Once you do this you can select the data for a time slice with the slice operator:
data = tf.placeholder(shape=(batch_size, max_size, numer_of_inputs))
....
for i in range(max_size):
time_data = data[:, i, :]
DoStuff(time_data)
Also lookup tf.transpose for swapping batch and time indices. It can help with performance in certain cases.
Alternatively consider something like tf.nn.static_rnn or tf.nn.dynamic_rnn to do the boilerplate stuff for you.

Finally I found an approach that solves my problem. It worked using tf.scan() instead of a loop, which doesn't require the input tensor to have a defined number in the second dimension. Consecuently you hace to prepare the input tensor previously to be parsed as you want throught tf.san(). In my case this is the code:
batch_size = 100
hidden_dim = 5
input_dim = embedding_dim
time_size = 5
input_sentence = tf.placeholder(dtype=tf.float64, shape=[embedding_dim,None], name='input')
output_sentence = tf.placeholder(dtype=tf.float64, shape=[embedding_dim,None], name='output')
input_array = np.asarray(input_sentence)
output_array = np.asarray(output_sentence)
x_t = tf.transpose(input_array, [1, 0], name='x_t')
h_0 = tf.convert_to_tensor(h_0, dtype=tf.float64)
h_t_transposed = tf.scan(forward_pass, x_t, h_0, name='h_t_transposed')
h_t = tf.transpose(h_t_transposed, [1, 0], name='h_t')

Flattening two last dimensions of a tensor in TensorFlow

I'm trying to reshape a tensor from [A, B, C, D] into [A, B, C * D] and feed it into a dynamic_rnn. Assume that I don't know the B, C, and D in advance (they're a result of a convolutional network).
I think in Theano such reshaping would look like this:
x = x.flatten(ndim=3)
It seems that in TensorFlow there's no easy way to do this and so far here's what I came up with:
x_shape = tf.shape(x)
x = tf.reshape(x, [batch_size, x_shape[1], tf.reduce_prod(x_shape[2:])]
Even when the shape of x is known during graph building (i.e. print(x.get_shape()) prints out absolute values, like [10, 20, 30, 40] after the reshaping get_shape() becomes [10, None, None]. Again, still assume the initial shape isn't known so I can't operate with absolute values.
And when I'm passing x to a dynamic_rnn it fails:
ValueError: Input size (depth of inputs) must be accessible via shape inference, but saw value None.
Why is reshape unable to handle this case? What is the right way of replicating Theano's flatten(ndim=n) in TensorFlow with tensors of rank 4 and more?

It is not a flaw in reshape, but a limitation of tf.dynamic_rnn.
Your code to flatten the last two dimensions is correct. And, reshape behaves correctly too: if the last two dimensions are unknown when you define the flattening operation, then so is their product, and None is the only appropriate value that can be returned at this time.
The culprit is tf.dynamic_rnn, which expects a fully-defined feature shape during construction, i.e. all dimensions apart from the first (batch size) and the second (time steps) must be known. It is a bit unfortunate perhaps, but the current implementation does not seem to allow RNNs with a variable number of features, à la FCN.

I tried a simple code according to your requirements. Since you are trying to reshape a CNN output, the shape of X is same as the output of CNN in Tensorflow.
HEIGHT = 100
WIDTH = 200
N_CHANELS =3
N_HIDDEN =64
X = tf.placeholder(tf.float32, shape=[None,HEIGHT,WIDTH,N_CHANELS],name='input') # output of CNN
shape = X.get_shape().as_list() # get the shape of each dimention shape[0] =BATCH_SIZE , shape[1] = HEIGHT , shape[2] = HEIGHT = WIDTH , shape[3] = N_CHANELS
input = tf.reshape(X, [-1, shape[1] , shape[2] * shape[3]])
print(input.shape) # prints (?, 100, 600)
#Input for tf.nn.dynamic_rnn should be in the shape of [BATCH_SIZE, N_TIMESTEPS, INPUT_SIZE]
#Therefore, according to the reshape N_TIMESTEPS = 100 and INPUT_SIZE= 600
#create the RNN here
lstm_layers = tf.contrib.rnn.BasicLSTMCell(N_HIDDEN, forget_bias=1.0)
outputs, _ = tf.nn.dynamic_rnn(lstm_layers, input, dtype=tf.float32)
Hope this helps.

I found a solution to this by using .get_shape().
Assuming 'x' is a 4-D Tensor.
This will only work with the Reshape Layer. As you were making changes to the architecture of the model, this should work.
x = tf.keras.layers.Reshape(x, [x.get_shape()[0], x.get_shape()[1], x.get_shape()[2] * x.get_shape()][3])
Hope this works!

If you use the tf.keras.models.Model or tf.keras.layers.Layer wrapper, the build method provides a nice way to do this.
Here's an example:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Conv1D, Conv2D, Conv2DTranspose, Attention, Layer, Reshape
class VisualAttention(Layer):
def __init__(self, channels_out, key_is_value=True):
super(VisualAttention, self).__init__()
self.channels_out = channels_out
self.key_is_value = key_is_value
self.flatten_images = None # see build method
self.unflatten_images = None # see build method
self.query_conv = Conv1D(filters=channels_out, kernel_size=1, padding='same')
self.value_conv = Conv1D(filters=channels_out, kernel_size=4, padding='same')
self.key_conv = self.value_conv if key_is_value else Conv1D(filters=channels_out, kernel_size=4, padding='same')
self.attention_layer = Attention(use_scale=False, causal=False, dropout=0.)
def build(self, input_shape):
b, h, w, c = input_shape
self.flatten_images = Reshape((h*w, c), input_shape=(h, w, c))
self.unflatten_images = Reshape((h, w, self.channels_out), input_shape=(h*w, self.channels_out))
def call(self, x, training=True):
x = self.flatten_images(x)
q = self.query_conv(x)
v = self.value_conv(x)
inputs = [q, v] if self.key_is_value else [q, v, self.key_conv(x)]
output = self.attention_layer(inputs=inputs, training=training)
return self.unflatten_images(output)
# test
import numpy as np
x = np.arange(8*28*32*3).reshape((8, 28, 32, 3)).astype('float32')
model = VisualAttention(8)
y = model(x)
print(y.shape)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.