Weight of layer as a result of dot operation in PyTorch - python

I’m trying to implement a network in which the weights of the layers are calculated as a result of a tensor operation. This is the code I have for a single layer and is repeated for all conv and fc layers:
class NOWANet(nn.Module):
def __init__(self, V, Wstacked):
super(NOWANet, self).__init__()
# first conv layer
# define layer
self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
# define tensors that will be used to calculate weights and biases
self.conv1_W_V = nn.Parameter(V[0], requires_grad=True)
self.conv1_B_V = nn.Parameter(V[1], requires_grad=True)
self.conv1_W_stack = nn.Parameter(Wstacked[0], requires_grad=False)
self.conv1_B_stack = nn.Parameter(Wstacked[1], requires_grad=False)
# set layer weights and bias
self.conv1.weight = nn.Parameter(torch.tensordot(self.conv1_W_V,
self.conv1_W_stack,
dims=1, out=None),
requires_grad=True)
self.conv1.bias = nn.Parameter(torch.tensordot(self.conv1_B_V,
self.conv1_B_stack,
dims=1, out=None),
requires_grad=True)
The idea is to perform a tensordot between the V and W_stack tensors, which then should be used as weights and bias (one tensordot each). The catch is that I only want to optimize V so that W_stack remains the intact. The way it is currently written initializes the weights correctly, but backprop optimizes over the actual weights of the layer, there is no change to the V vector.
Any ideas or suggestions on how to do it?

Related

Add non-image features to the Inception network

I am training an binary classifier using the Inception V3 model and I would like to feed some of the non-image features of my dataset into the network.
I have previously trained a logistic regression model with these features which performed well and I would like to see if I can improve my cnn by combining these models.
It looks like inception has a fully connected layer (logits layer) just before the softmax and I believe I should concatenate some nodes onto that layer to feed in my features. I have never done this, however.
The logits layer is build here - a snippet of the inception code
# Final pooling and prediction
with tf.variable_scope('logits'):
shape = net.get_shape()
net = ops.avg_pool(net, shape[1:3], padding='VALID', scope='pool')
# 1 x 1 x 2048
net = ops.dropout(net, dropout_keep_prob, scope='dropout')
net = ops.flatten(net, scope='flatten')
# 2048
logits = ops.fc(net, num_classes, activation=None, scope='logits',
restore=restore_logits)
# 1000
end_points['logits'] = logits
if FLAGS.mode == '0_softmax':
end_points['predictions'] = tf.nn.softmax(logits, name='predictions')
The function to make the fully connected layer:
#scopes.add_arg_scope
def fc(inputs,
num_units_out,
activation=tf.nn.relu,
stddev=0.01,
bias=0.0,
weight_decay=0,
batch_norm_params=None,
is_training=True,
trainable=True,
restore=True,
scope=None,
reuse=None):
"""Adds a fully connected layer followed by an optional batch_norm layer.
FC creates a variable called 'weights', representing the fully connected
weight matrix, that is multiplied by the input. If `batch_norm` is None, a
second variable called 'biases' is added to the result of the initial
vector-matrix multiplication.
Args:
inputs: a [B x N] tensor where B is the batch size and N is the number of
input units in the layer.
num_units_out: the number of output units in the layer.
activation: activation function.
stddev: the standard deviation for the weights.
bias: the initial value of the biases.
weight_decay: the weight decay.
batch_norm_params: parameters for the batch_norm. If is None don't use it.
is_training: whether or not the model is in training mode.
trainable: whether or not the variables should be trainable or not.
restore: whether or not the variables should be marked for restore.
scope: Optional scope for variable_scope.
reuse: whether or not the layer and its variables should be reused. To be
able to reuse the layer scope must be given.
Returns:
the tensor variable representing the result of the series of operations.
"""
with tf.variable_scope(scope, 'FC', [inputs], reuse=reuse):
num_units_in = inputs.get_shape()[1]
weights_shape = [num_units_in, num_units_out]
weights_initializer = tf.truncated_normal_initializer(stddev=stddev)
l2_regularizer = None
if weight_decay and weight_decay > 0:
l2_regularizer = losses.l2_regularizer(weight_decay)
weights = variables.variable('weights',
shape=weights_shape,
initializer=weights_initializer,
regularizer=l2_regularizer,
trainable=trainable,
restore=restore)
if batch_norm_params is not None:
outputs = tf.matmul(inputs, weights)
with scopes.arg_scope([batch_norm], is_training=is_training,
trainable=trainable, restore=restore):
outputs = batch_norm(outputs, **batch_norm_params)
else:
bias_shape = [num_units_out,]
bias_initializer = tf.constant_initializer(bias)
biases = variables.variable('biases',
shape=bias_shape,
initializer=bias_initializer,
trainable=trainable,
restore=restore)
outputs = tf.nn.xw_plus_b(inputs, weights, biases)
if activation:
outputs = activation(outputs)
return outputs
My model has 10 non-image features, so I suppose I will use num_units_out + 10? I am not sure how what to do with the inputs. I assume I will add the feature data directly into this layer, by adding it to the input already coming from the previous layers. So in essence I will have two input layers.
Add your features just before the FC layer:
net = ops.flatten(net, scope='flatten')
# assuming both tensors have a shape like <batch>x<features>
net = tf.concat([net, my_other_features], axis=-1)
This will combine the existing FC part with a Input->FC->Sigmoid part into a single layer. Another way of saying it is that the final logistic layer (FC->Sigmoid) will get a feature vector input that consists of both your features and the features calculated by the CNN from the image.

Using numpy for hidden layer outpute given weights and biases

I have extracted the weights and biases for my hidden Tensorflow fully-connected layers (3 hidden-ReLU, 1 output-NoActivation) like so:
trainable_variables = tf.trainable_variables()
kernels_list = [trainable_variable for trainable_variable in trainable_variables if "DQN" in trainable_variable.name and "kernel" in trainable_variable.name]
biases_list = [trainable_variable for trainable_variable in trainable_variables if "DQN" in trainable_variable.name and "bias" in trainable_variable.name]
sess.run(tf.global_variables_initializer())
kval_list = [sess.run(kernels_list[i]) for i in range(len(kernels_list))]
bval_list = [sess.run(biases_list[i]) for i in range(len(biases_list))]
Given that the model is fully trained, I want to use these weights and biases to run a "numpy prediction". I have the input array which goes to the first hidden layer, then numpy uses those weights and biases to find the output from the first hidden layer etc. This is the code I am using to try and run these calculations:
x = state # state is my 1 dimensional input array
for i in range(0, len(kernels_list)):
weights = kval_list[i]
biases = bval_list[i]
w = x.dot(weights)
w += biases
out = np.maximum(w, 0) # ReLU
if i == len(kernels_list) - 1: # If output layer, no ReLU
out = w
x = out # Input to the next layer
However, when I compare the final x value (output layer) to the output layer using Tensorflow prediction, I get different values for each output node... Am I doing my matrix calculations wrong? The input layer has 1140 nodes, so x.shape = (1140,) initially and the first hidden layer has 600 nodes so weights.shape = (1140, 600) and biases.shape = (600,) in the first iteration.

Not fully connected layer in tensorflow

I want to create a network where in the input layer nodes are just connected to some nodes in the next layer. Here is a small example:
My solution so far is that I set the weight of the edge between i1 and h1 to zero and after every optimization step I multiply the weights with a matrix (I call this matrix mask matrix) in which every entry is 1 except the entry of the weight of the edge between i1 and h1.
(See code below)
Is this approach right? Or does this have a affect on the GradientDescent? Is there another approach to create this kind of a network in TensorFlow?
import tensorflow as tf
import tensorflow.contrib.eager as tfe
import numpy as np
tf.enable_eager_execution()
model = tf.keras.Sequential([
tf.keras.layers.Dense(2, activation=tf.sigmoid, input_shape=(2,)), # input shape required
tf.keras.layers.Dense(2, activation=tf.sigmoid)
])
#set the weights
weights=[np.array([[0, 0.25],[0.2,0.3]]),np.array([0.35,0.35]),np.array([[0.4,0.5],[0.45, 0.55]]),np.array([0.6,0.6])]
model.set_weights(weights)
model.get_weights()
features = tf.convert_to_tensor([[0.05,0.10 ]])
labels = tf.convert_to_tensor([[0.01,0.99 ]])
mask =np.array([[0, 1],[1,1]])
#define the loss function
def loss(model, x, y):
y_ = model(x)
return tf.losses.mean_squared_error(labels=y, predictions=y_)
#define the gradient calculation
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
#create optimizer an global Step
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
global_step = tf.train.get_or_create_global_step()
#optimization step
loss_value, grads = grad(model, features, labels)
optimizer.apply_gradients(zip(grads, model.variables),global_step)
#masking the optimized weights
weights=(model.get_weights())[0]
masked_weights=tf.multiply(weights,mask)
model.set_weights([masked_weights])
If you are looking for a solution for the specific example you provided, you can simply use tf.keras Functional API and define two Dense layers where one is connected to both neurons in the previous layer and the other one is only connected to one of the neurons:
from tensorflow.keras.layer import Input, Lambda, Dense, concatenate
from tensorflow.keras.models import Model
inp = Input(shape=(2,))
inp2 = Lambda(lambda x: x[:,1:2])(inp) # get the second neuron
h1_out = Dense(1, activation='sigmoid')(inp2) # only connected to the second neuron
h2_out = Dense(1, activation='sigmoid')(inp) # connected to both neurons
h_out = concatenate([h1_out, h2_out])
out = Dense(2, activation='sigmoid')(h_out)
model = Model(inp, out)
# simply train it using `fit`
model.fit(...)
The problem with your solution and some others suggested by other answers in this post is that they do not prevent training of this weight. They allow the gradient descent to train the non existent weight and then overwrite it retrospectively. This will result in a network that has a zero in this location as desired, but will negatively affect your training process as the back propagation calculation will not see the masking step as it is not part of a TensorFlow graph and so the gradient descent will follow a path which includes the assumption that this weight does have an affect on the outcome (it does not).
A better solution would be to include the masking step as a part of your TensorFlow graph, so that it can be factored into the gradient descent. Since the masking step is simply a element wise multiplication by your sparse, binary martix mask, you could just include the mask matrix as an elementwise matrix multiplicaiton in the graph definition using tf.multiply.
Sadly this means sying goodbye to the user friendly keras,layers methods and embracing a more nuts & bolts approach to TensorFlow. I can't see an obvious way to do it using the layers API.
See the implementation below, I have tried to provide comments explaining what is happening at each stage.
import tensorflow as tf
## Graph definition for model
# set up tf.placeholders for inputs x, and outputs y_
# these remain fixed during training and can have values fed to them during the session
with tf.name_scope("Placeholders"):
x = tf.placeholder(tf.float32, shape=[None, 2], name="x") # input layer
y_ = tf.placeholder(tf.float32, shape=[None, 2], name="y_") # output layer
# set up tf.Variables for the weights at each layer from l1 to l3, and setup feeding of initial values
# also set up mask as a variable and set it to be un-trianable
with tf.name_scope("Variables"):
w_l1_values = [[0, 0.25],[0.2,0.3]]
w_l1 = tf.Variable(w_l1_values, name="w_l1")
w_l2_values = [[0.4,0.5],[0.45, 0.55]]
w_l2 = tf.Variable(w_l2_values, name="w_l2")
mask_values = [[0., 1.], [1., 1.]]
mask = tf.Variable(mask_values, trainable=False, name="mask")
# link each set of weights as matrix multiplications in the graph. Inlcude an elementwise multiplication by mask.
# Sequence takes us from inputs x to output final_out, which will be compared to labels fed to placeholder y_
l1_out = tf.nn.relu(tf.matmul(x, tf.multiply(w_l1, mask)), name="l1_out")
final_out = tf.nn.relu(tf.matmul(l1_out, w_l2), name="output")
## define loss function and training operation
with tf.name_scope("Loss"):
# some loss defined as a function of graph output: final_out and labels: y_
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=final_out, labels=y_, name="loss")
with tf.name_scope("Train"):
# some optimisation strategy, arbitrary learning rate
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, name="optimizer_adam")
train_op = optimizer.minimize(loss, name="train_op")
# create session, initialise variables and train according to inputs and corresponding labels
# This should show that the values of the first layer weights change, but the one set to 0 remains at 0
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
print(initial_l1_weights.eval())
inputs = [[0.05, 0.10]]
labels = [[0.01, 0.99]]
ans = sess.run(train_op, feed_dict={"Placeholders/x:0": inputs, "Placeholders/y_:0": labels})
train_steps = 1
for i in range(train_steps):
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
print(initial_l1_weights.eval())
Or use the answer provided by today for a keras friendly option.
You have multiple options here.
First, you could use the dynamic masking approach in your example. I believe this will work as expected since the gradients w.r.t. the masked-out parameters will be zero (the output is constant when you change the unused parameters). This approach is simple and it can be used even when your mask is not constant during the training.
Second, if you know beforehand which weights will be always zero, you can compose your weight matrix using tf.get_variable to get a submatrix, and then concatenate it with a tf.constant tensor, e.g.:
weights_sub = tf.get_variable("w", [dim_in, dim_out - 1])
zeros = tf.zeros([dim_in, 1])
weights = tf.concat([weights_sub, zeros], axis=1)
this example will make one column of your weight matrix to be always zero.
Finally, if your mask is more complex, you can use tf.get_variable on a flattened vector and then compose a tf.SparseTensor with the variable values on the used indices:
weights_used = tf.get_variable("w", [num_used_vars])
indices = ... # get your indices in a 2-D matrix of shape [num_used_vars, 2]
dense_shape = tf.constant([dim_in, dim_out]) # this is the final shape of the weight matrix
weights = tf.SparseTensor(indices, weights_used, dense_shape)
EDIT: This probably won't work in combination with Keras' set_weights method, as it expects Numpy arrays, not Tensors.

How to implement Tensorflow batch normalization in LSTM

My current LSTM network looks like this.
rnn_cell = tf.contrib.rnn.BasicRNNCell(num_units=CELL_SIZE)
init_s = rnn_cell.zero_state(batch_size=1, dtype=tf.float32) # very first hidden state
outputs, final_s = tf.nn.dynamic_rnn(
rnn_cell, # cell you have chosen
tf_x, # input
initial_state=init_s, # the initial hidden state
time_major=False, # False: (batch, time step, input); True: (time step, batch, input)
)
# reshape 3D output to 2D for fully connected layer
outs2D = tf.reshape(outputs, [-1, CELL_SIZE])
net_outs2D = tf.layers.dense(outs2D, INPUT_SIZE)
# reshape back to 3D
outs = tf.reshape(net_outs2D, [-1, TIME_STEP, INPUT_SIZE])
Usually, I apply tf.layers.batch_normalization as batch normalization. But I am not sure if this works in a LSTM network.
b1 = tf.layers.batch_normalization(outputs, momentum=0.4, training=True)
d1 = tf.layers.dropout(b1, rate=0.4, training=True)
# reshape 3D output to 2D for fully connected layer
outs2D = tf.reshape(d1, [-1, CELL_SIZE])
net_outs2D = tf.layers.dense(outs2D, INPUT_SIZE)
# reshape back to 3D
outs = tf.reshape(net_outs2D, [-1, TIME_STEP, INPUT_SIZE])
If you want to use batch norm for RNN (LSTM or GRU), you can check out this implementation , or read the full description from blog post.
However, the layer-normalization has more advantage than batch norm in sequence data. Specifically, "the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent networks" (from the paper Ba, et al. Layer normalization).
For layer normalization, it normalizes the summed inputs within each layer. You can check out the implementation of layer-normalization for GRU cell:
Based on this paper: "Layer Normalization" - Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
Tensorflow now comes with the tf.contrib.rnn.LayerNormBasicLSTMCell a LSTM unit with layer normalization and recurrent dropout.
Find the documentation here.

Add bias to Lasagne neural network layers

I am wondering if there is a way to add bias node to each layer in Lasagne neural network toolkit? I have been trying to find related information in documentation.
This is the network I built but i don't know how to add a bias node to each layer.
def build_mlp(input_var=None):
# This creates an MLP of two hidden layers of 800 units each, followed by
# a softmax output layer of 10 units. It applies 20% dropout to the input
# data and 50% dropout to the hidden layers.
# Input layer, specifying the expected input shape of the network
# (unspecified batchsize, 1 channel, 28 rows and 28 columns) and
# linking it to the given Theano variable `input_var`, if any:
l_in = lasagne.layers.InputLayer(shape=(None, 60),
input_var=input_var)
# Apply 20% dropout to the input data:
l_in_drop = lasagne.layers.DropoutLayer(l_in, p=0.2)
# Add a fully-connected layer of 800 units, using the linear rectifier, and
# initializing weights with Glorot's scheme (which is the default anyway):
l_hid1 = lasagne.layers.DenseLayer(
l_in_drop, num_units=800,
nonlinearity=lasagne.nonlinearities.rectify,
W=lasagne.init.Uniform())
# We'll now add dropout of 50%:
l_hid1_drop = lasagne.layers.DropoutLayer(l_hid1, p=0.5)
# Another 800-unit layer:
l_hid2 = lasagne.layers.DenseLayer(
l_hid1_drop, num_units=800,
nonlinearity=lasagne.nonlinearities.rectify)
# 50% dropout again:
l_hid2_drop = lasagne.layers.DropoutLayer(l_hid2, p=0.5)
# Finally, we'll add the fully-connected output layer, of 10 softmax units:
l_out = lasagne.layers.DenseLayer(
l_hid2_drop, num_units=2,
nonlinearity=lasagne.nonlinearities.softmax)
# Each layer is linked to its incoming layer(s), so we only need to pass
# the output layer to give access to a network in Lasagne:
return l_out
Actually you don't have to explicitly create biases, because DenseLayer(), and convolution base layers too, has a default keyword argument:
b=lasagne.init.Constant(0.).
Thus you can avoid creating bias, if you don't want to have with explicitly pass bias=None, but this is not that case.
Thus in brief you do have bias parameters while you don't pass None to bias parameter e.g.:
hidden = Denselayer(...bias=None)

Categories