Compute updates in Theano after N number of loss calculations - python

I've constructed a LSTM recurrent NNet using lasagne that is loosely based on the architecture in this blog post. My input is a text file that has around 1,000,000 sentences and a vocabulary of 2,000 word tokens. Normally, when I construct networks for image recognition my input layer will look something like the following:
l_in = nn.layers.InputLayer((32, 3, 128, 128))
(where the dimensions are batch size, channel, height and width) which is convenient because all the images are the same size so I can process them in batches. Since each instance in my LSTM network has a varying sentence length, I have an input layer that looks like the following:
l_in = nn.layers.InputLayer((None, None, 2000))
As described in above referenced blog post,
Masks:
Because not all sequences in each minibatch will always have the same length, all recurrent layers in
lasagne
accept a separate mask input which has shape
(batch_size, n_time_steps)
, which is populated such that
mask[i, j] = 1
when
j <= (length of sequence i)
and
mask[i, j] = 0
when
j > (length
of sequence i)
.
When no mask is provided, it is assumed that all sequences in the minibatch are of length
n_time_steps.
My question is: Is there a way to process this type of network in mini-batches without using a mask?
Here is a simplified version if my network.
# -*- coding: utf-8 -*-
import theano
import theano.tensor as T
import lasagne as nn
softmax = nn.nonlinearities.softmax
def build_model():
l_in = nn.layers.InputLayer((None, None, 2000))
lstm = nn.layers.LSTMLayer(l_in, 4096, grad_clipping=5)
rs = nn.layers.SliceLayer(lstm, 0, 0)
dense = nn.layers.DenseLayer(rs, num_units=2000, nonlinearity=softmax)
return l_in, dense
model = build_model()
l_in, l_out = model
all_params = nn.layers.get_all_params(l_out)
target_var = T.ivector("target_output")
output = nn.layers.get_output(l_out)
loss = T.nnet.categorical_crossentropy(output, target_var).sum()
updates = nn.updates.adagrad(loss, all_params, 0.005)
train = theano.function([l_in.input_var, target_var], cost, updates=updates)
From there I have generator that spits out (X, y) pairs and I am computing train(X, y) and updating the gradient with each iteration. What I want to do is do an N number of training steps and then update the parameters with the average gradient.
To do this, I tried creating a compute_gradient function:
gradient = theano.grad(loss, all_params)
compute_gradient = theano.function(
[l_in.input_var, target_var],
output=gradient
)
and then looping over several training instances to create a "batch" and collect the gradient calculations to a list:
grads = []
for _ in xrange(1024):
X, y = train_gen.next() # generator for producing training data
grads.append(compute_gradient(X, y))
this produces a list of lists
>>> grads
[[<CudaNdarray at 0x7f83b5ff6d70>,
<CudaNdarray at 0x7f83b5ff69f0>,
<CudaNdarray at 0x7f83b5ff6270>,
<CudaNdarray at 0x7f83b5fc05f0>],
[<CudaNdarray at 0x7f83b5ff66f0>,
<CudaNdarray at 0x7f83b5ff6730>,
<CudaNdarray at 0x7f83b5ff6b70>,
<CudaNdarray at 0x7f83b5ff64f0>] ...
From here I would need to take the mean of the gradient at each layer, and then update the model parameters. This is possible to do in pieces like this does does the gradient calc/parameter update need to happen all in one theano function?
Thanks.

NOTE: this is a solution, but by no means do i have enough experience to verify its best and the code is just a sloppy example
You need 2 theano functions. The first being the grad one you seem to have already judging from the information provided in your question.
So after computing the batched gradients you want to immediately feed them as an input argument back into another theano function dedicated to updating the shared variables. For this you need to specify the expected batch size at the compile time of your neural network. so you could do something like this: (for simplicity i will assume you have a global list variable where all your params are stored)
params #list of params you wish to update
BATCH_SIZE = 1024 #size of the expected training batch
G = [T.matrix() for i in range(BATCH_SIZE) for param in params] #placeholder for grads result flattened so they can be fed into a theano function
updates = [G[i] for i in range(len(params))] #starting with list of param updates from first batch
for i in range(len(params)): #summing the gradients for each individual param
for j in range(1, len(G)/len(params)):
updates[i] += G[i*BATCH_SIZE + j]
for i in range(len(params)): #making a list of tuples for theano.function updates argument
updates[i] = (params[i], updates[i]/BATCH_SIZE)
update = theano.function([G], 0, updates=updates)
Like this theano will be taking the mean of the gradients and updating the params as usual
dont know if you need to flatten the inputs as I did, but probably
EDIT: gathering from how you edited your question it seems important that the batch size can vary in that case you could add 2 theano functions to your existing one:
the first theano function takes a batch of size 2 of your params and returns the sum. you could apply this theano function using python's reduce() and get the sum of the over the whole batch of gradients
the second theano function takes those summed param gradients and a scaler (the batch size) as input and hence is able to update the NN params over the mean of the summed gradients.

Related

How to compute the gradient of the output with respect to each input in pytorch

I have a tensor of shape (number_of rays, number_of_points_per_ray, 3), let’s call it input. input is passed through a model and some processing (all of this is differentiable), let’s call this process inference. Finally, we get output = inference(input), which has a shape of (number_of_rays, number_of_points_per_ray, 300), where each “ray” in the output only depends on the same ray of the input. e.g. output[i] only depends on input[i]. This means that for each set of 3 elements on the input the output has 300 elements, so I would expect to get a gradient with the same shape as the output
As explained at https://discuss.pytorch.org/t/need-help-computing-gradient-of-the-output-with-respect-to-the-input/150950/5 , I tried grads=torch.autograd.grad (outputs = output, inputs = input, grad_outputs = None)
but the output I am getting is of shape (number_of rays, number_of_points_per_ray, 3) , which is the same as the input and not the same as the output.
Any clue what may I be doing wrong?
Thanks in advance
I am assuming that 3 in input is the size of the state that you forward to your model network, and 300 is the size of the output that your model network produces.
Now, you want to call a separate instance of your model network for each element in (number_of_rays)? If yes, then one way of getting the gradients for each element in your array, is to assign a separate optimizer to each instance of the model network that you assign to your array elements.
Here is my code:
class Environment():
def __init__(self, N, input_dims, output_dims, lr, gamma):
self.number_of_rays = N
self.input_dims = input_dims
self.output_dims = output_dims
# assign to each ray a neural network
self.rays = [YourModel(input_dims, output_dims, gamma) for i in range(N)]
self.optimizer = []
for i in range(N):
optim = T.optim.Adam(self.rays[i].parameters(), lr=lr, betas=(0.92, 0.999))
self.optimizer.append(optim)
def run(self):
# here you do your stuff
# ...
# now update the networks
for i in range(self.number_of_rays):
prediction = self.rays[i](observation) # feed your input state to the model
loss = loss_fn(prediction) # compute loss
self.optimizer[i].zero_grad() # reset the gradients of model parameters
loss.backward() # backpropagate the prediction loss
self.optimizer[i].step() # adjust the parameters by the gradients collected in the backward pass

For loop with GRUCell in call method of subclassed tf.keras.Model

I have subclassed tf.keras.Model and I use tf.keras.layers.GRUCell in a for loop to compute sequences 'y_t' (n, timesteps, hidden_units) and final hidden states 'h_t' (n, hidden_units). For my loop to output 'y_t', I update a tf.Variable after each iteration of the loop. Calling the model with model(input) is not a problem, but when I fit the model with the for loop in the call method I get either a TypeError or a ValueError.
Please note, I cannot simply use tf.keras.layers.GRU because I am trying to implement this paper. Instead of just passing x_t to the next cell in the RNN, the paper performs some computation as a step in the for loop (they implement in PyTorch) and pass the result of that computation to the RNN cell. They end up essentially doing this: h_t = f(special_x_t, h_t-1).
Please see the model below that causes the error:
class CustomGruRNN(tf.keras.Model):
def __init__(self, batch_size, timesteps, hidden_units, features, **kwargs):
# Inheritance
super().__init__(**kwargs)
# Args
self.batch_size = batch_size
self.timesteps = timesteps
self.hidden_units = hidden_units
# Stores y_t
self.rnn_outputs = tf.Variable(tf.zeros(shape=(batch_size, timesteps, hidden_units)), trainable=False)
# To be used in for loop in call
self.gru_cell = tf.keras.layers.GRUCell(units=hidden_units)
# Reshape to match input dimensions
self.dense = tf.keras.layers.Dense(units=features)
def call(self, inputs):
"""Inputs is rank-3 tensor of shape (n, timesteps, features) """
# Initial state for gru cell
h_t = tf.zeros(shape=(self.batch_size, self.hidden_units))
for timestep in tf.range(self.timesteps):
# Get the the timestep of the inputs
x_t = tf.gather(inputs, timestep, axis=1) # Same as x_t = inputs[:, timestep, :]
# Compute outputs and hidden states
y_t, h_t = self.gru_cell(x_t, h_t)
# Update y_t at the t^th timestep
self.rnn_outputs = self.rnn_outputs[:, timestep, :].assign(y_t)
# Outputs need to have same last dimension as inputs
outputs = self.dense(self.rnn_outputs)
return outputs
An example that would throw the error:
# Arbitrary values for dataset
num_samples = 128
batch_size = 4
timesteps = 5
features = 10
# Arbitrary dataset
x = tf.random.uniform(shape=(num_samples, timesteps, features))
y = tf.random.uniform(shape=(num_samples, timesteps, features))
train_data = tf.data.Dataset.from_tensor_slices((x, y))
train_data = train_data.shuffle(batch_size).batch(batch_size, drop_remainder=True)
# Model with arbitrary hidden units
model = CustomGruRNN(batch_size, timesteps, hidden_units=5)
model.compile(loss=tf.keras.losses.MeanSquaredError(), optimizer=tf.keras.optimizers.Adam())
When running eagerly:
model.fit(train_data, epochs=2, run_eagerly=True)
Epoch 1/2
WARNING:tensorflow:Gradients do not exist for variables
['stack_overflow_gru_rnn/gru_cell/kernel:0',
'stack_overflow_gru_rnn/gru_cell/recurrent_kernel:0',
'stack_overflow_gru_rnn/gru_cell/bias:0'] when minimizing the loss.
ValueError: substring not found ValueError
When not running eagerly:
model.fit(train_data, epochs=2, run_eagerly=False)
Epoch 1/2
TypeError: in user code:
TypeError: Can not convert a NoneType into a Tensor or Operation.
Edit:
While the TensorFlow guide answer suffices, I think my self-answered question involving custom cells for RNNs is a much better option. Please see this answer. Using a custom RNN cell removes the need to use tf.Transpose and tf.TensorArrayand thus lowers complexity of the code while simultaneously improving readability.
Original Self-Answer:
The use of the DynamicRNN described near the bottom of TensorFlow's Guide to Effective TensorFlow2 solves my problem.
To expand briefly on the DynamicRNN's conceptual use, an RNN cell is defined, in my case GRU, and then any number of custom steps can be defined within the tf.range loop. Variables should be tracked using tf.TensorArray objects outside the loop but inside the call method itself, and the sizes of such arrays can be determined by simply calling the .shape method of (input) tensors. Notably, the DynamicRNN object works in model fit, wherein the default execution mode is 'Graph' mode as opposed to the slower 'Eager Execution' mode.
Lastly, one might require the use of a 'DynamicRNN' because by default, the `tf.keras.layers.GRU' computation is loosely described by the following recurrent logic (assume that 'f' defines a GRU cell):
# Numpy is used here for ease of indexing, but in general you should use
# tensors and transpose them accordingly (see the previously linked guide)
inputs = np.random.randn((batch, total_timesteps, features))
# List for tracking outputs -- just for simple demonstration... again please see the guide for more details
outputs = []
# Initialize the 'hidden state' (often referred to as h_naught and denoted h_0) of the RNN cell
state_at_t_minus_1 = tf.zeros(shape=(batch, hidden_cell_units))
# Iterate through the input until all timesteps in the sequence have been 'seen' by the GRU cell function 'f'
for timestep_t in total_timesteps:
# This is of shape (batch, features)
input_at_t = inputs[:, timestep_t, :]
# output_at_t of shape (batch, hidden_units_of_cell) and state_at_t (batch, hidden_units_of_cell)
output_at_t, state_at_t = f(input_at_t, state_at_t_minus_1)
outputs.append(output_at_t)
# When the loop restarts, this variable will be used in the next GRU Cell function call 'f'
state_at_t_minus_1 = state_at_t
One might wish to add other steps in the for loop of the recurrent logic (e.g., dense layers, other layers, etc.) to modify the inputs and states passed to the GRU Cell function 'f'. This is one motivation of the DynamicRNN.

Keras sequence models - how to generate data during test/generation?

Is there a way to use the already trained RNN (SimpleRNN or LSTM) model to generate new sequences in Keras?
I'm trying to modify an exercise from the Coursera Deep Learning Specialization - Sequence Models course, where you train an RNN to generate dinosaurus's names. In the exercise you build the RNN using only numpy, but I want to use Keras.
One of the problems is different lengths of the sequences (dino names), so I used padding and set sequence length to the max size appearing in the dataset (I padded with 0, which is also the code for '\n').
My question is how to generate the actual sequence once training is done? In the numpy version of the exercise you take the softmax output of the previous cell and use it as a distribution to sample a new input for the next cell. But is there a way to connect the output of the previous cell as the input of the next cell in Keras, during testing/generation time?
Also - some additional side-question:
Since I'm using padding, I suspect the accuracy is way too optimistic. Is there a way to tell Keras not to include the padding values in its accuracy calculations?
Am I even doing this right? Is there a better way to use Keras with sequences of different lengths?
You can check my (WIP) code here.
Inferring from a model that has been trained on a sequence
So it's a pretty common thing to do in RNN models and in Keras the best way (at least from what I know) is to create two different models.
One model for training (which uses sequences instead of individual items)
Another model for predicting (which uses a single element instead of a sequence)
So let's see an example. Suppose you have the following model.
from tensorflow.keras import models, layers
n_chars = 26
timesteps = 10
inp = layers.Input(shape=(timesteps, n_chars))
lstm = layers.LSTM(100, return_sequences=True)
out1 = lstm(inp)
dense = layers.Dense(n_chars, activation='softmax')
out2 = layers.TimeDistributed(dense)(out1)
model = models.Model(inp, out2)
model.summary()
Now to infer from this model, you create another model which looks like the one below.
inp_infer = layers.Input(shape=(1, n_chars))
# Inputs to feed LSTM states back in
h_inp_infer = layers.Input(shape=(100,))
c_inp_infer = layers.Input(shape=(100,))
# We need return_state=True so we are creating a new layer
lstm_infer = layers.LSTM(100, return_state=True, return_sequences=True)
out1_infer, h, c = lstm_infer(inp_infer, initial_state=[h_inp_infer, c_inp_infer])
out2_infer = layers.TimeDistributed(dense)(out1_infer)
# Our model takes the previous states as inputs and spits out new states as outputs
model_infer = models.Model([inp_infer, h_inp_infer, c_inp_infer], [out2_infer, h, c])
# We are setting the weights from the trained model
lstm_infer.set_weights(lstm.get_weights())
model_infer.summary()
So what's different. You see that we have defined a new input layer which accepts an input which has only one timestep (or in other words, just a single item). Then the model outputs an output which has a single timestep (technically we don't need the TimeDistributedLayer. But I've kept that for consistency). Other than that we take the previous LSTM state output as an input and produces the new state as the output. More specifically we have the following inference model.
Input: [(None, 1, n_chars) (None, 100), (None, 100)] list of tensor
Output: [(None, 1, n_chars), (None, 100), (None, 100)] list of Tensor
Note that I'm updating the weights of the new layers from the trained model or using the existing layers from the training model. It will be a pretty useless model if you don't reuse the trained layers and weights.
Now we can write inference logic.
import numpy as np
x = np.random.randint(0,2,size=(1, 1, n_chars))
h = np.zeros(shape=(1, 100))
c = np.zeros(shape=(1, 100))
seq_len = 10
for _ in range(seq_len):
print(x)
y_pred, h, c = model_infer.predict([x, h, c])
y_pred = x[:,0,:]
y_onehot = np.zeros(shape=(x.shape[0],n_chars))
y_onehot[np.arange(x.shape[0]),np.argmax(y_pred,axis=1)] = 1.0
x = np.expand_dims(y_onehot, axis=1)
This part starts with an initial x, h, c. Gets the prediction y_pred, h, c and convert that to an input in the following lines and assign it back to x, h, c. So you keep going for n iterations of your choice.
About masking zeros
Keras does offer a Masking layer which can be used for this purpose. And the second answer in this question seems to be what you're looking for.

Why prediction on activation values (Softmax) gives incorrect results?

I've implemented a basic neural network from scratch using Tensorflow and trained it on MNIST fashion dataset. It's trained correctly and outputs testing accuracy around ~88-90% over 10 classes.
Now I've written predict() function which predicts the class of given image using trained weights. Here is the code:
def predict(images, trained_parameters):
Ws, bs = [], []
parameters = {}
for param in trained_parameters.keys():
parameters[param] = tf.convert_to_tensor(trained_parameters[param])
X = tf.placeholder(tf.float32, [images.shape[0], None], name = 'X')
Z_L = forward_propagation(X, trained_parameters)
p = tf.argmax(Z_L) # Working fine
# p = tf.argmax(tf.nn.softmax(Z_L)) # not working if softmax is applied
with tf.Session() as session:
prediction = session.run(p, feed_dict={X: images})
return prediction
This uses forward_propagation() function which returns the weighted sum of the last layer (Z) and not the activitions (A) because of TensorFlows tf.nn.softmax_cross_entropy_with_logits() requires Z instead of A as it will calculate A by applying softmax Refer this link for details.
Now in predict() function, when I make predictions using Z instead of A (activations) it's working correctly. By if I calculate softmax on Z (which is activations A of the last layer) it's giving incorrect predictions.
Why it's giving correct predictions on weighted sums Z? We are not supposed to first apply softmax activation (and calculate A) and then make predictions?
Here is the link to my colab notebook if anyone wants to look at my entire code: Link to Notebook Gist
So what am I missing here?
Most TF functions, such as tf.nn.softmax, assume by default that the batch dimension is the first one - that is a common practice. Now, I noticed in your code that your batch dimension is the second, i.e. your output shape is (output_dim=10, batch_size=?), and as a result, tf.nn.softmax is computing the softmax activation along the batch dimension.
There is nothing wrong in not following the conventions - one just needs to be aware of them. Computing the argmax of the softmax along the first axis should yield the desired results (it is equivalent to taking the argmax of the logits):
p = tf.argmax(tf.nn.softmax(Z_L, axis=0))
Also, I would also recommend computing the argmax along the first axis in case more than one image is fed into the network.

Writing this exotic NN architecture with keras, tensorflow and python

I'm trying to get Keras to train a multiclass classification model that can be written in a network like this:
The only set of trainable parameters are those , all the rest is given. The functions fi are combinations of usual mathematical functions (for example .Sigma stands for summing the previous terms and softmax is the usual function. The (x1,x2,...xn) are elements of train or test set and are a specific subset of the original data already selected.
The model in more depth:
Specificaly, given (x_1,x_2,...,x_n) an input in train or test set, the network evaluates
where fi are given mathematical functions, are rows of a particular subset of the original data and the coefficients are the parameters I want to train.
As I'm using keras, I expect it to add a bias term to each row.
After the above evaluation, I will apply a softmax layer (each of the m lines above are numbers that will be inputs for the softmax function).
At the end I want to compile the model and run model.fit as usual.
The problem is that I couln't translate the expression to keras sintax.
My attempt:
Following the network scratch above, I first tried to consider each of the expressions of the form as lambda layers in a Sequential Model, but the best I could get to work was a combination of a dense layer with linear activation (which would play the role of a row's parameters: ) followed by a Lambda layer outputting a vector without the required summation, as follows:
model = Sequential()
#single row considered:
model.add(Lambda(lambda x: f_fixedRow(x), input_shape=(nFeatures,)))
#parameters set after lambda layer to get (a1*f(x1,y1),...,an*f(xn,yn)) and not (f(a1*x1,y1),...,f(an*xn,yn))
model.add(Dense(nFeatures, activation='linear'))
#missing summation: sum(x)
#missing evaluation of f in all other rows
model.add(Dense(classes,activation='softmax',trainable=False)) #should get all rows
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
Also, I had to define the function in the lambda function call with the argument already fixed (because the lambda function could have only the input layers as variable):
def f_fixedRow(x):
#picking a particular row (as a vector) to evaluate f in (f works element-wise)
y=tf.constant(value=x[0,:],dtype=tf.float32)
return f(x,y)
I managed to write the f function with tensorflow (working element-wise in a row), although this is a possible source for problems in my code (and the above workaround seems unnatural).
I also thought that if I could properly write the element-wise sum of the vector in the aforementioned attempt I could repeat the same procedure in a parallelized manner with the keras Functional API and then insert the output of each parallel model in a softmax function, as I need.
Another approach that I considered was to train the parameters keeping their natural matrix structure seen in Network Description, maybe writing a matrix Lambda layer, but I could not find anything related to this idea.
Anyway, I'm not sure what is a good way to work with this model within keras, maybe I'm missing an important point because of the non standard way the parameters are written or lack of experience with tensorflow. Any suggestions are welcome.
For this answer, it's important that f be a tensor function that operates elementwise. (No iterating). This is reasonably easy to have, just check the keras backend functions.
Assumptions:
The x_pk set is constant, otherwise this solution must be reviewed.
The function f is elementwise (if not, please show f for better code)
Your model will need x_pk as a tensor input. And you should do that in a functional API model.
import keras.backend as K
from keras.layers import Input, Lambda, Activation
from keras.models import Model
#x_pk data
x_pk_numpy = select_X_pk_samples(x_train)
x_pk_tensor = K.variable(x_pk_numpy)
#number of rows in x_pk
m = len(x_pk_numpy)
#I suggest a fixed batch size for simplicity
batch = some_batch_size
First let's work on the function that will take x and x_pk calling f.
def calculate_f(inputs): #inputs will be a list with x and x_pk
x, x_pk = inputs
#since f will work elementwise, let's replicate x and x_pk so they have equal shapes
#please explain f for better optimization
# x from (batch, n) to (batch, m, n)
x = K.stack([x]*m, axis=1)
# x_pk from (m, n) to (batch, m, n)
x_pk = K.stack([x_pk]*batch, axis=0)
#a batch size of 1 could make this even simpler
#a variable batch size would make this more complicated
#certain f functions could make this process unnecessary
return f(x, x_pk)
Now, different from a Dense layer, this formula is using the a_pk weights multiplied elementwise. So we need a custom layer:
class ElementwiseWeights(Layer):
def __init__(self, **kwargs):
super(ElementwiseWeights, self).__init__(**kwargs)
def build(self, input_shape):
weight_shape = (1,) + input_shape[1:] #shape (1, m, n)
self.kernel = self.add_weight(name='kernel',
shape=weight_shape,
initializer='uniform',
trainable=True)
super(ElementwiseWeights, self).build(input_shape)
def compute_output_shape(self,input_shape):
return input_shape
def call(self, inputs):
return self.kernel * inputs
Now let's build our functional API model:
#x_pk model tensor input
x_pk = Input(tensor=x_pk_tensor) #shape (m, n)
#x usual input with fixed batch size
x = Input(batch_shape=(batch,n)) #shape (batch, n)
#calculate F
out = Lambda(calculate_f)([x, xp_k]) #shape (batch, m, n)
#multiply a_pk
out = ElementwiseWeights()(out) #shape (batch, m, n)
#sum n elements, keep m rows:
out = Lambda(lambda x: K.sum(x, axis=-1))(out) #shape (batch, m)
#softmax
out = Activation('softmax')(out) #shape (batch,m)
Continue this model with whatever you want and finish it:
model = Model([x, x_pk], out)
model.compile(.....)
model.fit(x_train, y_train, ....) #perhaps you might need .fit([x_train], ytrain,...)
Edit for function f
You can have the proposed f like this:
#create the n coefficients:
coefficients = np.array([c0, c1, .... , cn])
coefficients = coefficients.reshape((1,1,n))
def f(x, x_pk):
c = K.variable(coefficients) #shape (1, 1, n)
out = (x - x_pk) / c
return K.exp(out)
This f would accept x with shape (batch, 1, n), without the stack used in the calculate_f function.
Or could accept x_pk with shape (1, m, n), allowing variable batch size.
But I'm not sure it's possible to have both of these shapes together. Testing this might be interesting.

Categories