State persistence in shared LSTM layers in Keras - python

I am trying to use a shared LSTM layer with state in a Keras model, but it seems that the internal state is modified by each parallel use. This raises two questions:
When training a model with a shared LSTM layer and using stateful=True, are the parallel uses updating the same state also during training?
If my observation is valid, is there a way to use weight-sharing LSTMs such that the state is stored independently for each of the parallel uses?
The code below exemplifies the problem with three sequences sharing the LSTM. The prediction of a full input is compared with the result from splitting the prediction input into two halves and feeding them into the network consecutively.
What can be observed, is that the a1 is the same as the first half of aFull, meaning that the the uses of the LSTM really are in parallel with independent states during the first prediction. I.e., z1 is not affected by the parallel call creating z2 and z3. But a2 is different from the second half of aFull, so there is some interaction between the states of the parallel uses.
What I was hoping is that the concatenation of the two pieces a1 and a2 would be the same as the result from calling the prediction with a longer input sequence, but this doesn't seem to be the case. A further concern is that when this kind of interaction takes place in the prediction, is it also happening during the training.
import keras
import keras.backend as K
import numpy as np
nOut = 3
xShape = (3, 50, 4)
inShape = (xShape[0], None, xShape[2])
batchInShape = (1, ) + inShape
x = np.random.randn(*xShape)
# construct network
xIn = keras.layers.Input(shape=inShape, batch_shape=batchInShape)
# shared LSTM layer
sharedLSTM = keras.layers.LSTM(units=nOut, stateful=True, return_sequences=True, return_state=False)
# split the input on the first axis
x1 = keras.layers.Lambda(lambda x: x[:,0,:,:])(xIn)
x2 = keras.layers.Lambda(lambda x: x[:,1,:,:])(xIn)
x3 = keras.layers.Lambda(lambda x: x[:,2,:,:])(xIn)
# pass each input through the LSTM
z1 = sharedLSTM(x1)
z2 = sharedLSTM(x2)
z3 = sharedLSTM(x3)
# add a singleton dimension
y1 = keras.layers.Lambda(lambda x: K.expand_dims(x, axis=1))(z1)
y2 = keras.layers.Lambda(lambda x: K.expand_dims(x, axis=1))(z2)
y3 = keras.layers.Lambda(lambda x: K.expand_dims(x, axis=1))(z3)
# combine the outputs
y = keras.layers.Concatenate(axis=1)([y1, y2, y3])
model = keras.models.Model(inputs=xIn, outputs=y)
model.compile(loss='mse', optimizer='adam')
# no need to train, since we're interested only what is happening mechanically
# reset to a known state and predict for full input
aFull = model.predict(x[np.newaxis,:,:,:])
# reset to a known state and predict for the same input, but in two pieces
a1 = model.predict(x[np.newaxis,:,:xShape[1]//2,:])
a2 = model.predict(x[np.newaxis,:,xShape[1]//2:,:])
# combine the pieces
aSplit = np.concatenate((a1, a2), axis=2)
print('full diff: {}, first half diff: {}, second half diff: {}'.format(str(np.sum(np.abs(aFull - aSplit))), str(np.sum(np.abs(aFull[:,:,:xShape[1]//2,:] - aSplit[:,:,:xShape[1]//2,:]))), str(np.sum(np.abs(aFull[:,:,xShape[1]//2:,:] - aSplit[:,:,xShape[1]//2:,:])))))
Update: The behaviour described above was observed with Keras using Tensorflow 1.14 and 1.15 as the backend. Running the same code with tf2.0 (with the adjusted imports) changes the result so that a1 is no longer the same as the first half of aFull. This can be still accomplished by setting stateful=False in the layer instantiation.
This would suggest to me that the way I'm trying to use the recursive layer with shared parameters, but own states for parallel uses, is not really possible like this.
Update 2: It seems that the same functionality has been missed by also other earlier: closed, unanswered question at Keras' github.
For a comparison, here is a scribbling in pytorch (the first time I've tried to use it) implementing a simple network with N parallel LSTMs sharing the weights, but having independent states. In this case the states are stored explicitly in a list and provided to the LSTM cell manually.
import torch
import numpy as np
class sharedLSTM(torch.nn.Module):
def __init__(self, batchSz, nBands, nDims, outDim):
super(sharedLSTM, self).__init__()
self.internalLSTM = torch.nn.LSTM(input_size=nDims, hidden_size=outDim, num_layers=1, bias=True, batch_first=True)
allStates = list()
for bandIdx in range(nBands):
h_0 = torch.zeros(1, batchSz, outDim)
c_0 = torch.zeros(1, batchSz, outDim)
allStates.append((h_0, c_0))
self.allStates = allStates
self.nBands = nBands
def forward(self, x):
allOut = list()
for dimIdx in range(self.nBands):
thisSlice = x[:,dimIdx,:,:] # (batchSz, nSteps, nFeats)
thisState = self.allStates[dimIdx]
thisY, thisState = self.internalLSTM(thisSlice, thisState)
self.allStates[dimIdx] = thisState
allOut.append(thisY[:,None,:,:]) # => (batchSz, 1, nSteps, nFeats)
y =, dim=1) # => (batchSz, nDims, nSteps, nFeats)
return y
def resetStates(self):
for bandIdx in range(nBands):
self.allStates[bandIdx][0][:] = 0.0
self.allStates[bandIdx][1][:] = 0.0
batchSz = 5
nBands = 3
nFeats = 4
nOutDims = 2
net = sharedLSTM(batchSz, nBands, nFeats, nOutDims)
net = net.float()
N = 20
x = torch.from_numpy(np.random.rand(batchSz, nBands, N, nFeats)).float()
x1 = x[:, :, :N//2, :]
x2 = x[:, :, N//2:, :]
aa = net.forward(x)
a1 = net.forward(x1)
a2 = net.forward(x2)
print('(with reset) first half abs diff: {}'.format(str(torch.sum(torch.abs(a1 - aa[:,:,:N//2,:])).detach().numpy())))
print('(with reset) second half abs diff: {}'.format(str(torch.sum(torch.abs(a2 - aa[:,:,N//2:,:])).detach().numpy())))
Result: the output is the same regardless if we do the prediction in one go or in pieces.
I've tried to replicate this in Keras using sub-classing, but without success:
import keras
import numpy as np
class sharedLSTM(keras.Model):
def __init__(self, batchSz, nBands, nDims, outDim):
super(sharedLSTM, self).__init__()
self.internalLSTM = keras.layers.LSTM(units=outDim, stateful=True, return_sequences=True, return_state=True), None, nDims))
allStates = list()
allSlicers = list()
for bandIdx in range(nBands):
allSlicers.append(keras.layers.Lambda(lambda x, b: x[:, :, b, :], arguments = {'b' : bandIdx}))
self.allStates = allStates
self.allSlicers = allSlicers
self.Concat = keras.layers.Lambda(lambda x: keras.backend.concatenate(x, axis=2))
self.nBands = nBands
def call(self, x):
allOut = list()
for bandIdx in range(self.nBands):
thisSlice = self.allSlicers[bandIdx]( x )
thisState = self.allStates[bandIdx]
thisY, *thisState = self.internalLSTM(thisSlice, initial_state=thisState)
self.allStates[bandIdx] = thisState.copy()
y = self.Concat( allOut )
return y
batchSz = 1
nBands = 3
nFeats = 4
nOutDims = 2
N = 20
model = sharedLSTM(batchSz, nBands, nFeats, nOutDims)
model.compile(optimizer='SGD', loss='mae')
x = np.random.rand(batchSz, N, nBands, nFeats)
x1 = x[:, :N//2, :, :]
x2 = x[:, N//2:, :, :]
aa = model.predict(x)
a1 = model.predict(x1)
a2 = model.predict(x2)
print('(with reset) first half abs diff: {}'.format(str(np.sum(np.abs(a1 - aa[:,:N//2,:,:])))))
print('(with reset) second half abs diff: {}'.format(str(np.sum(np.abs(a2 - aa[:,N//2:,:,:])))))
If you now ask "why don't you then use torch and shut up?", the answer is that the surrounding experimental framework has been built assuming Keras and changing it would be a non-negligible amount of work.

Based on my current understanding of the behaviour of LSTMs (and other RNNs) in Keras is that using a shared LSTM layer in a stateful=True mode does not work as one would expect and there is only one state variable that gets updated through all the parallel uses. So the answers to the questions appear to be:
Yes, they are. The processing runs over one of the many parallel sequences, stores the state at the end, and uses this as the initial state for the second parallel sequence, and so forth.
Yes, but it requires some work. See below for the details.
I've managed to accomplish handling the states in two ways. First is deriving sub-classes from Keras' LSTM and LSTMCell, and overloading to handle the parallel data streams by splitting the input, and storing and recovering the state of each parallel stream. A drawback here is that the input shape to an RNN is fixed to be 3D, which means that the parallel inputs need to be reshaped into the feature dimension along with the real features.
The second approach is to create a wrapper Layer not completely dissimilar to the sharedLSTM-Model in the question, containing slicing of the input to parallel streams, calling the internal LSTM with the correct state for each stream, and storing the returned states. The state storage update in the list works through add_update() call inserted into the end of call(). This add_update() does not (seem to) work with Model, hence Layer. However, when run with Keras <2.3, the weights of the nested layers are not tracked or updated, so Keras 2.3+ or TF2 is needed.


How to use Grid Search when there are multiple inputs, including a matrix?

I am trying to tune the neuron model, but since the inputs are multiple and in different shapes, it seems impossible to stack it together.
There are two types of inputs when I make the model.
one of them is:
cats = [train.time_signature,train.key,train.mode_cat,train.loudness_cat,train.tempo_cat,train.duration_cat]
The shape of them are all (7678,)
another input is:
The shape of it is (7678,30), which is a matrix
When fitting the Keras model, it's okay to just concatenate them together as a training set:[cats,num_train],train.genre_cat,batch_size=50,epochs=1000,verbose=1,validation_split=.1)
However, when I use the GridSearch, it doesn't allow me to the same input as I did in the model.
grid_result =[cats,num_train],train.genre_cat)
It shows error :ValueError: Found input variables with inconsistent numbers of samples: [2, 7678], which means the num_train is not allowed because I got the other 30 samples on index 1.
Is there anything that I can deal with this problem? Thanks.
Grid Search for Keras with multiple inputs
I think that the trick is always the same: make one single input which is the concatenation of the multiple inputs.
In this particular case, we have N inputs of dim (n_sample,1) and one input of dim (n_sample,30). The new concatenated input will have dimensions (n_sample, n_stacked_columns).
we make the separation of the columns inside the model
n_sample = 100
x1 = np.random.uniform(0,1, (n_sample,1)) # (n_sample,1)
x2 = np.random.uniform(0,1, (n_sample,1)) # (n_sample,1)
x3 = np.random.uniform(0,1, (n_sample,1)) # (n_sample,1)
x = np.random.uniform(0,1, (n_sample,30)) # (n_sample,30)
X = np.column_stack([x1,x2,x3,x]) # (n_sample,1+1+1+30)
y = np.random.uniform(0,1, n_sample)
inp = Input((X.shape[-1],))
inp1 = Lambda(lambda x: tf.expand_dims(x[:,0],-1))(inp) # (None,1)
inp2 = Lambda(lambda x: tf.expand_dims(x[:,1],-1))(inp) # (None,1)
inp3 = Lambda(lambda x: tf.expand_dims(x[:,2],-1))(inp) # (None,1)
inp_matrix = Lambda(lambda x: x[:,3:])(inp) # (None,30)
d1 = Dense(8)(inp1)
d2 = Dense(8)(inp2)
d3 = Dense(8)(inp3)
d_matrix = Dense(8)(inp_matrix)
concat = Concatenate()([d1,d2,d3,d_matrix])
out = Dense(1)(concat)
model = Model(inp, out)
model.compile('adam','mse'),y, epochs=3)

Solved: How to combine tf.gradients with and keras models

I'm trying to build a workflow that uses batches and an iterator. For performance reasons, I am really trying to avoid using the placeholder->feed_dict loop workflow.
The process I'm trying to implement involves grad-cam (which requires the gradient of the loss with respect to the final convolutional layer of a CNN) as an intermediate step, and ideally I'd like to be able to try it out on several Keras pre-trained models, including non-sequential ones like ResNet.
Most implementations of grad-cam that I've found rely on hand-crafting the CNN of interest in tensorflow. I found one implementation,, that is made for keras models, and following that example, I get
def safe_norm(x):
return x / tf.sqrt(tf.reduce_mean(x ** 2) + 1e-8)
vgg_ = VGG19()
dataset =
it = dataset.make_one_shot_iterator()
files, batch = it.get_next()
conv5_4 = vgg_.layers[-6]
h_k, w_k, c_k = conv5_4.output.shape[1:]
vgg_model = Model(inputs=vgg_.input, outputs=vgg_.output)
conv_model = Model(inputs=vgg_.input, outputs=conv5_4.output)
probs = vgg_model(batch)
predicted_class = tf.argmax(probs, axis=-1)
layer_name = 'block5_conv4'
target_layer = lambda x: target_category_loss(x, predicted_class, n_categories)
x = Lambda(target_layer)(vgg_model.outputs[0])
model = Model(inputs=vgg_model.inputs[0], outputs=x)
loss = K.sum(model.output, axis=-1)
conv_output = [l for l in model.layers if is layer_name][0].output
grads = Lambda(safe_norm)(K.gradients(loss, [conv_output])[0])
gradient_function = K.function([model.input], [conv_output, grads])
output, grads_val = gradient_function([batch])
weights = tf.reduce_mean(grads_val, axis = (1, 2))
cam = tf.ones([batch_size, h_k, w_k], dtype = tf.float32)
cam += tf.reduce_sum(output * tf.reshape(weights, [-1, 1, 1, weights.shape[-1]]), axis=-1)
cam = tf.squeeze(tf.image.resize_images(images=tf.expand_dims(cam, axis=-1), size=(224, 224)))
cam = tf.maximum(cam, 0)
heatmap = cam / tf.reshape(tf.reduce_max(cam, axis=[1, 2]), shape=[-1, 1, 1])
The problem is that gradient_function([batch]) returns a numpy array whose value is determined by the first batch, so that heatmap doesn't change with subsequent evaluations.
I've tried replacing K.function with a Model in various ways, but nothing seems to work. I usually end up either with an error suggesting that grads evaluates to None or that one model or another is expecting a feed_dict and not receiving one.
Is this code salvageable? Is there a better way to do this besides looping through the data several times (once to get all the grad-cams and then again once I have them) or using placeholders and feed_dicts?
def safe_norm(x):
return x / tf.sqrt(tf.reduce_mean(x ** 2) + 1e-8)
vgg_ = VGG19()
dataset =
it = dataset.make_one_shot_iterator()
files, batch = it.get_next()
conv5_4 = vgg_.layers[-6]
h_k, w_k, c_k = conv5_4.output.shape[1:]
vgg_model = Model(inputs=vgg_.input, outputs=vgg_.output)
conv_model = Model(inputs=vgg_.input, outputs=conv5_4.output)
probs = vgg_model(batch)
predicted_class = tf.argmax(probs, axis=-1)
layer_name = 'block5_conv4'
target_layer = lambda x: target_category_loss(x, predicted_class, n_categories)
x = Lambda(target_layer)(vgg_model.outputs[0])
model = Model(inputs=vgg_model.inputs[0], outputs=x)
loss = K.sum(model.output, axis=-1)
conv_output = [l for l in model.layers if is layer_name][0].output
grads = Lambda(safe_norm)(K.gradients(loss, [conv_output])[0])
gradient_function = K.function([model.input], [conv_output, grads])
output, grads_val = gradient_function([batch])
weights = tf.reduce_mean(grads_val, axis = (1, 2))
cam = tf.ones([batch_size, h_k, w_k], dtype = tf.float32)
cam += tf.reduce_sum(output * tf.reshape(weights, [-1, 1, 1, weights.shape[-1]]), axis=-1)
cam = tf.squeeze(tf.image.resize_images(images=tf.expand_dims(cam, axis=-1), size=(224, 224)))
cam = tf.maximum(cam, 0)
heatmap = cam / tf.reshape(tf.reduce_max(cam, axis=[1, 2]), shape=[-1, 1, 1])
# other operations on heatmap and batch ...
# ...
output_function = K.function(model.input, [node1, ..., nodeN])
for batch in range(n_batches):
outputs1, ... , outputsN = output_function(batch)
Gives me the desired outputs for each batch.
Yes, K.function returns numpy arrays because it evaluates the symbolic computation in your graph. What I think you should do is to keep everything symbolic up to K.function, and after getting the gradients, perform all computations of the Grad-CAM weights and final saliency map using numpy.
Then you can iterate on your dataset, evaluate gradient_function on a new batch of data, and compute the saliency map.
If you want to keep everything symbolic, then you should not use K.function to produce the gradient function, but use the symbolic gradient (the output of K.gradient, without lambda) and convolutional feature maps (conv_output) and perform the saliency map computation on top of that, and then build a function (using K.function) that takes the model input, and outputs the saliency map.
Hope the explanation is enough.

Iterate over a tensor dimension in Tensorflow

I am trying to develop a seq2seq model from a low level perspective (creating by myself all the tensors needed). I am trying to feed the model with a sequence of vectors as a two-dimensional tensor, however, i can't iterate over one dimension of the tensor to extract vector by vector. Does anyone know what could I do to feed a batch of vectors and later get them one by one?
This is my code:
batch_size = 100
hidden_dim = 5
input_dim = embedding_dim
time_size = 5
input_sentence = tf.placeholder(dtype=tf.float64, shape=[embedding_dim,None], name='input')
output_sentence = tf.placeholder(dtype=tf.float64, shape=[embedding_dim,None], name='output')
input_array = np.asarray(input_sentence)
output_array = np.asarray(output_sentence)
gru_layer1 = GRU(input_array, input_dim, hidden_dim) #This is a class created by myself
for i in range(input_array.shape[-1]):
word = input_array[:,i]
previous_state = gru_encoder.h_t
And this is the error that I get
TypeError: Expected binary or unicode string, got <tf.Tensor 'input_7:0' shape=(10, ?) dtype=float64>
Tensorflow does deferred execution.
You usually can't know how big the vector will be (words in a sentance, audio samples, etc...). The common thing to do is to cap it at some reasonably large value and then pad the shorter sequences with an empty token.
Once you do this you can select the data for a time slice with the slice operator:
data = tf.placeholder(shape=(batch_size, max_size, numer_of_inputs))
for i in range(max_size):
time_data = data[:, i, :]
Also lookup tf.transpose for swapping batch and time indices. It can help with performance in certain cases.
Alternatively consider something like tf.nn.static_rnn or tf.nn.dynamic_rnn to do the boilerplate stuff for you.
Finally I found an approach that solves my problem. It worked using tf.scan() instead of a loop, which doesn't require the input tensor to have a defined number in the second dimension. Consecuently you hace to prepare the input tensor previously to be parsed as you want throught tf.san(). In my case this is the code:
batch_size = 100
hidden_dim = 5
input_dim = embedding_dim
time_size = 5
input_sentence = tf.placeholder(dtype=tf.float64, shape=[embedding_dim,None], name='input')
output_sentence = tf.placeholder(dtype=tf.float64, shape=[embedding_dim,None], name='output')
input_array = np.asarray(input_sentence)
output_array = np.asarray(output_sentence)
x_t = tf.transpose(input_array, [1, 0], name='x_t')
h_0 = tf.convert_to_tensor(h_0, dtype=tf.float64)
h_t_transposed = tf.scan(forward_pass, x_t, h_0, name='h_t_transposed')
h_t = tf.transpose(h_t_transposed, [1, 0], name='h_t')

Tensorflow 170 times slower than Theano for RNN implementation

I am trying to implement a RNN in Tensorflow (0.11), based on this paper.
They have a Theano implementation here, that I am comparing my implementation to. When I try to run their Theano implementation, it finishes 10 epochs in about 1 hour. My Tensorflow implementation needs about 17 hours just to finish 1 epoch. I am wondering if anyone could look at my code and tell me if there are some obvious problems that are slowing it down.
The purpose of the RNN is to predict the next item a user is going to click on, given his previous clicks. The items are represented by unique IDs that are given as input to the RNN as a 1-HOT vector.
So the RNN is built like this:
[INPUT (1-HOT representation, size 37803)] -> [GRU layer (size 100)] -> [FeedForward layer]
and the ouput from the FF layer is a vector with the same size as the input vector, where high values indicate that the item corresponding to that index is very likely to be the next one clicked.
num_hidden = 100
x = tf.placeholder(tf.float32, [None, max_length, n_items], name="InputX")
y = tf.placeholder(tf.float32, [None, max_length, n_items], name="TargetY")
session_length = tf.placeholder(tf.int32, [None], name="SeqLenOfInput")
output, state = rnn.dynamic_rnn(
layer = {'weights':tf.Variable(tf.random_normal([num_hidden, n_items])),
output = tf.reshape(output, [-1, num_hidden])
prediction = tf.matmul(output, layer['weights'])
y_flat = tf.reshape(y, [-1, n_items])
final_output = tf.nn.softmax_cross_entropy_with_logits(prediction,y_flat)
cost = tf.reduce_sum(final_output)
optimizer = tf.train.AdamOptimizer().minimize(cost)
Both implementations are tested on the same hardware. Both implementations utilize the GPU.
The Theano model has the same structure. (1-HOT input -> GRU layer with 100 units -> FeedForward)
I tested the Theano version with the same parameters as I used in my model (using cross entropy for the loss, batch size=200, adam optimizer, with the same learning rate, no dropout in either model) but the speed difference is still the same.
EDIT (2016-12-07):
Using file queues to queue batches instead of using feed_dict helped alot.
I still need to do other optimizations to make it faster. Anyways, here is how I used file queues to make it faster.
# Create filename_queue
filename_queue = tf.train.string_input_producer(train_files, shuffle=True)
min_after_dequeue = 1024
capacity = min_after_dequeue + 3*batch_size
examples_queue = tf.RandomShuffleQueue(
# Create multiple readers to populate the queue of examples
enqueue_ops = []
for i in range(n_readers):
reader = tf.TextLineReader()
_key, value =
tf.train.queue_runner.QueueRunner(examples_queue, enqueue_ops))
example_string = examples_queue.dequeue()
# Default values, and type of the columns, first is sequence_length
# +1 since first field is sequence length
record_defaults = [[0]]*(max_sequence_length+1)
enqueue_examples = []
for thread_id in range(n_preprocess_threads):
example = tf.decode_csv(value, record_defaults=record_defaults)
# Split the row into input/target values
sequence_length = example[0]
features = example[1:-1]
targets = example[2:]
enqueue_examples.append([sequence_length, features, targets])
# Batch together examples
session_length, x_unparsed, y_unparsed = tf.train.batch_join(
# Parse the examples in a batch
x = tf.one_hot(x_unparsed, depth=n_classes)
y = tf.one_hot(y_unparsed, depth=n_classes)
# From here on, x, y and session_length can be used in the model

Deep Network Produce zero Accuracy

I am trying to build a deep network using theano. However the accuracy is zero. I can not figure out my mistake. I am trying to create a deep learning network with 3 hidden layers and one output. I am tyring to do a classification task and I have 5 classes. Therefore, the output layer have 5 nodes.
Any suggestion?
#!/usr/bin/env python
from __future__ import print_function
import theano
import theano.tensor as T
import lasagne
import numpy as np
import sklearn.datasets
import os
import csv
import pandas as pd
# Lasagne is pre-release, so it's interface is changing.
# Whenever there's a backwards-incompatible change, a warning is raised.
# Let's ignore these for the course of the tutorial
import warnings
warnings.filterwarnings('ignore', module='lasagne')
from lasagne.objectives import categorical_crossentropy, aggregate
#load the data and prepare it
df = pd.read_excel('risk_sample_data_9.20.16_anon.xls',skiprows=0)
rawdata = df.values
# remove empty rows (odd rows)
mask = np.ones(len(rawdata), dtype=bool)
mask[::2] = False
data = rawdata[mask]
idx = np.array([1,5,6,7])
m = np.zeros_like(data)
m[:,idx] = 1
X =,m)
X =, fill_value=0)
X = X.astype(theano.config.floatX)
y = data[:,7] # extract financial rating labels
# convert char lables into int , A=1 , B=2, C=3, D=4, F=5
y[y == 'A'] = 1
y[y == 'B'] = 2
y[y == 'C'] = 3
y[y == 'D'] = 4
y[y == 'F'] = 5
y = pd.to_numeric(y)
y = y.astype('int32')
#y = y.astype(theano.config.floatX)
# First, construct an input layer.
# The shape parameter defines the expected input shape,
# which is just the shape of our data matrix data.
l_in = lasagne.layers.InputLayer(shape=X.shape)
# We'll create a network with two dense layers:
# A tanh hidden layer and a softmax output layer.
l_hidden1 = lasagne.layers.DenseLayer(
# The first argument is the input layer
# This defines the layer's output dimensionality
# Various nonlinearities are available
l_hidden2 = lasagne.layers.DenseLayer(
# The first argument is the input layer
# This defines the layer's output dimensionality
# Various nonlinearities are available
l_hidden3 = lasagne.layers.DenseLayer(
# The first argument is the input layer
# This defines the layer's output dimensionality
# Various nonlinearities are available
l_hidden4 = lasagne.layers.DenseLayer(
# The first argument is the input layer
# This defines the layer's output dimensionality
# Various nonlinearities are available
# For our output layer, we'll use a dense layer with a softmax nonlinearity.
l_output = lasagne.layers.DenseLayer(
l_hidden4, num_units=N_CLASSES, nonlinearity=lasagne.nonlinearities.softmax)
net_output = lasagne.layers.get_output(l_output)
# As a loss function, we'll use Theano's categorical_crossentropy function.
# This allows for the network output to be class probabilities,
# but the target output to be class labels.
true_output = T.ivector('true_output')
# get_loss computes a Theano expression for the objective,
# given a target variable
# By default, it will use the network's InputLayer input_var,
# which is what we want.
#loss = objective.get_loss(target=true_output)
loss = lasagne.objectives.categorical_crossentropy(net_output, true_output)
loss = aggregate(loss, mode='mean')
# Retrieving all parameters of the network is done using get_all_params,
# which recursively collects the parameters of all layers
# connected to the provided layer.
all_params = lasagne.layers.get_all_params(l_output)
# Now, we'll generate updates using Lasagne's SGD function
updates = lasagne.updates.sgd(loss, all_params, learning_rate=1)
# Finally, we can compile Theano functions for training and
# computing the output.
# Note that because loss depends on the input variable of our input layer,
# we need to retrieve it and tell Theano to use it.
train = theano.function([l_in.input_var, true_output], loss, updates=updates)
get_output = theano.function([l_in.input_var], net_output)
def eq(x, y):
if x==y:
return 1
return 0
print("Training ...")
# Train for 100 epochs
for n in xrange(10):
train(X, y)
y_predicted = np.argmax(get_output(X), axis=1)
correct = reduce(lambda a, b: a+b, map(eq, y_predicted, y))
print("Iteration {} correct prediction {}".format(n, correct))
# Compute the predicted label of the training data.
# The argmax converts the class probability output to class label
y_predicted = np.argmax(get_output(X), axis=1)
The learning rate seems way too high. Try a lower learning rate first. It might be that your model diverges on the task. Hard to tell without being able to try it on your data.
