Keras masking and padding - mask values still available in loss? - python

I am working on a binary classification problem using LSTM layers where I classify each timestep as belonging to a class (0,1). I have sequences of variable sizes. So, I padded and masked those extra steps at the end of the sequence using the -1 value. My input follows the shape (300,2000,8), so all my 300 samples have 2000 timesteps and 8 features. If a sample originally had 1500 timesteps, I, therefore, add 500 extra steps at the end as with the value = -1, to each of the 8 features.
I added the padding for both my inputs = train_x and the labels = train_y, so the shape of train_y is actually (300,2000,1). So, both the input and label vectors are padded with a -1 signaling the steps to ignore.
Now, I have some doubts that been causing my headaches for days.From what I understood from here, whenever keras 'sees' a timestep where all the feature values are -1 it will ignore it in the processing. However, when I access the tensors in my custom loss function, y_predictions and y_labels, the timesteps that have a value of -1 in the y_labels also have a prediction given by the model (e.g. a value between 0 and 1) that is usually the same for all -1 timesteps. I am wondering if I did something wrong?
Should I pad and mask only the features vector and keep the vector of labels with their original size when passing it to the model?
I think I end up 'ignoring' the -1 timesteps by doing
this at the start of the loss function and then use only the indexes of the timesteps where y_true is != -1, when doing the calculations and returning the loss value. Does it make sense?
pos_class = tf.where(y_true > 0)
neg_class = tf.where(y_true == 0)
... rest of calculations ...
The code for the model building part goes as follows:
# train_x -> (300,2000,8)
# train_y -> (300,2000,1)
# both already padded and masked with -1 for the extra steps
input_layer = Input(shape=(2000, 8))
mask_1 = Masking(mask_value=-1)(input_layer)
lstm_1 = LSTM(64, return_sequences=True)(mask_1)
dense_1 = Dense(1, activation="sigmoid")(lstm_1)
model = Model(inputs=input_layer, outputs=dense_1)
model.summary()
optimizer = Adam(lr=0.001)
model.compile(optimizer=optimizer, loss=CustomLossFun, metrics=[CustomMcc(), CustomPrecision()])
train_model = model.fit(x = train_x, y = train_y, ...)

Got to understood the answer from this question.
I will still obtain a tensor with the size of the entire sequence in the loss function. However, the values that were masked are there just as fillers and were not computed (that's probably why they're the same value repeated). In the case of my loss function I still need to ignore those though.

Related

GRU in Tensorflow 1.3

I want to implement a GRU for sentiment analysis and this is what I have so far:
epochs = 20
batch_size = 25
embedding_size = 50
layers = 1
max_label = 2 # only valid target labels are 0 and 1
# one word is fed in at any time instance
embedding_matrix = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embeddings = tf.nn.embedding_lookup(embedding_matrix, x)
# number of neurons in a single LSTM cell is equal to the embedding size
cell = tf.contrib.rnn.GRUCell(embedding_size)
cell = tf.contrib.rnn.DropoutWrapper(cell=cell, output_keep_prob=0.75)
# encoding fed into softmax prediction layer
output, states = tf.nn.dynamic_rnn(cell, embeddings, dtype=tf.float32)
logits = tf.layers.dense(output, max_label, activation=None)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
loss = tf.reduce_mean(cross_entropy)
# prediction is accurate if predicted label is equal to the actual label
prediction = tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))
But I get this error:
ValueError: Rank mismatch: Rank of labels (received 1) should equal rank of logits minus 1 (received 3).
I tried changing the value of max_label to 2 and 1 and tried to make logits=logits-1 but the error is still there. How can I fix it?
NB: This is without knowing the size of your label vector y nor which line the error message is coming from. Please include this important information in future questions.
According to the tensorflow 1.3 docs, the usage of nn.sparse_softmax_cross_entropy_with_logits is:
"... to have logits of shape [batch_size, num_classes] and labels of shape [batch_size]. But higher dimensions are supported."
From my experience with using RNN's in tensorflow, the size convention of your model will be [Batch, Length of Sequence, Feature Size]. So, your output will be 3 dimensional. According to the quote I've attached above, your output should then be of size [Batch, Length of Sequence], with each value being the index of your label (so if position (X,Y) is class 30, then, in your code, y[X,Y] = 30). This is different (from my memory) to the non-sparse version, which takes in one-hot (or multiple-hot) encodings.

What activation function on the last layer and loss function should I use in an auto encoder for reconstructing a sequence of events? [Keras]

My data set is a 3D array of the size (M,t,N) where M is the number of samples, t is the number of timesteps in a sequence and N is the number of possible events that can happen at time t. By selecting a specific M we have a 2D array of size (t,N) where each row is a timestep and each column is an event. Each column is set to 1 if that event happened at time t, otherwise it's set to 0. Only 1 event can happen at any given timestep.
I want to try and build an auto-encoder for anomaly detection, and in the tutorials and blogs I have read, the last activation layer is 'relu' and the loss function is 'mse'. But since I am trying to basically reconstruct a classification with N classes, would 'softmax' as the last layer and 'categorical_crossentropy' be better?
inputs = Input(shape = (timesteps,n_features))
# Encoder
lstm_enc_1 = LSTM(32, activation='relu', input_shape=(timesteps, n_features), return_sequences=True)(inputs)
lstm_enc_2 = LSTM(latent_dim, activation='relu', return_sequences=False)(lstm_enc_1)
repeater = RepeatVector(timesteps)
# Decoder
lstm_dec_1 = LSTM(latent_dim, activation='relu', return_sequences=True)
lstm_dec_2 = LSTM(32, activation='relu', return_sequences=True)
time_dis = TimeDistributed(Dense(n_features,activation='softmax')) #<-- Does this make sense here?
z = repeater(lstm_enc_2)
h = lstm_dec_1(z)
decoded_h = lstm_dec_2(h)
decoded = time_dis(decoded_h)
ae = Model(inputs,decoded)
ae.compile(loss='categorical_crossentropy', optimizer='adam') #<-- Does this make sense here?
Or should I, for some reason, still use 'relu' and 'mse' as the last activation function and loss function?
Any input is appreciated.
When i read it correctly, N is one-hot encoded and it sounds like you want to do a classification, no regression.
For beeing y one-hot encoded, using categorical_crossentropy is correct.
If you have more classes in y than 4, you may use integer-encodings and use sparse_categorical_crossentropy, which decodes you y values to one-hot matrices on the way.
mse is better used for regression.
As last actication, since you have a classification, you may want to use softmax, which outputs a probability for each of your y classes.
As far as I know, your normally do not use relu is the last layer, if you have a regression task, you prefer sigmoid in general.

Keras LSTM appears to be fitting the end of time-series input instead of the prediction target

To preface this, I have plenty of experience with python and moderate experience building and using machine learning networks. That being said, this is the first LSTM I have made aside from some of the cookie-cutter examples available, so any help is appreciated. I feel like this is a problem with a simple solution and that I have just been looking at this code for far too long to see it.
This model is made in a python3.5 venv using Keras with a tensorflow backend.
In short, I am trying to make predictions of some temporal data using the data itself as well as a few mathematical permutations of this data, creating four input features. I am building a time-series input from the prior 60 data points and specifying the prediction target to be 60 data points in the future.
Shape of complete training data (input)(target): (2476224, 60, 4) (2476224)
Shape of single data "point" (input)(target): (1, 60, 4) (1)
What appears to be happening is that the trained model has fit the trailing value of my input time-series (the current value) instead of the target I have provided it (60 cycles in the future).
What is interesting is that the loss function seems to be calculating according to the correct prediction target, yet the model is not converging to the proper solution.
I have no idea why the model should be doing this. My first thought was that I was preprocessing my data incorrectly and feeding it the wrong target. I have tested my input formatting of the data extensively and am pretty confident that I am providing the model with he correct target and input information.
In one instance, I had increased the learning rate a tad such that the model converged to a local minima. This testing loss of this convergence was very similar to the loss of my preferred learning rate (still quite high). But the predictions were still of the "current value". Why is this so?
Here is how I created my model:
def create_model():
lstm_model = Sequential()
lstm_model.add(CuDNNLSTM(100, batch_input_shape=(batch_size, time_step, train_input.shape[2]),
stateful=True, return_sequences=True,
kernel_initializer='random_uniform'))
lstm_model.add(Dropout(0.4))
lstm_model.add(CuDNNLSTM(60))
lstm_model.add(Dropout(0.4))
lstm_model.add(Dense(20, activation='relu'))
lstm_model.add(Dense(1, activation='linear'))
optimizer = optimizers.Adagrad(lr=params["lr"])
lstm_model.compile(loss='mean_squared_error', optimizer=optimizer)
return lstm_model
This is how I am pre-processing the data. The first function, build_timeseries, constructs my input-output pairs. I believe this is working correctly (but please correct me if I am wrong). The second function trims the pairs to fit the batch size. I do the exact same for the test input/target.
train_input, train_target = build_timeseries(train_input, time_step, pred_horiz, 0)
train_input = trim_dataset(train_input, batch_size)
train_target = trim_dataset(train_target, batch_size)
def build_timeseries(mat, TIME_STEPS, PRED_HORIZON, y_col_index):
# y_col_index is the index of column that would act as output column
dim_0 = mat.shape[0] # num datasets
dim_1 = mat.shape[1] # num features
dim_2 = mat.shape[2] # num datapoints
# Reformatted matrix
mat = mat.swapaxes(1, 2)
x = np.zeros((dim_0*(dim_2-PRED_HORIZON), TIME_STEPS, dim_1))
y = np.zeros((dim_0*(dim_2-PRED_HORIZON),))
k = 0
for i in range(dim_0): # Iterate through datasets
for j in range(TIME_STEPS, dim_2-PRED_HORIZON):
x[k] = mat[i, j-TIME_STEPS:j]
y[k] = mat[i, j+PRED_HORIZON, y_col_index]
k += 1
print("length of time-series i/o", x.shape, y.shape)
return x, y
def trim_dataset(mat, batch_size):
no_of_rows_drop = mat.shape[0] % batch_size
if(no_of_rows_drop > 0):
return mat[no_of_rows_drop:]
else:
return mat
Lastly, this is how I call the actual model.
history = model.fit(train_input, train_target, epochs=params["epochs"], verbose=2, batch_size=batch_size,
shuffle=True, validation_data=(test_input, test_target), callbacks=[es, mcp])
As the model converges, I expect it to predict values close to the specified targets I had fed it. However instead, its predictions align much more closely with the trailing value of the time-series data (or the current value). Though, on the other hand, the model appears to be evaluating the loss according to the specified target.... Why is it working this way and how can I fix it? Any help is appreciated.

Does Keras's LSTM really take into account the cell state and previous output?

I learned about LSTM's over the past day, and then i decided to look at a tutorial which uses Keras to create it. I looked at several tutorials and they all had a derivative of
model = Sequential()
model.add(LSTM(10, input_shape=(1,1)))
model.add(Dense(1, activation='linear'))
model.compile(loss='mse', optimizer='adam')
X,y = get_train()
model.fit(X, y, epochs=300, shuffle=False, verbose=0)
then they predicted using
model.predict(X, verbose=0)
my question is: don't you have to give the previous prediction along with input and cell state in order to predict the next outcome using an LSTM?
Also, what does the 10 represent in model.add(LSTM(10, input_shape(1,1))?
You have to give the previous prediction to the LSTM state. If you call predict the LSTM will be initialized every time, it will not remember the state from previous predictions.
Typically (e.g if you generate text with an lstm) you have a loop where you do something like this:
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print "Seed:"
print "\"", ''.join([int_to_char[value] for value in pattern]), "\""
# generate characters
for i in range(1000):
x = numpy.reshape(pattern, (1, len(pattern), 1))
x = x / float(n_vocab)
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
result = int_to_char[index]
seq_in = [int_to_char[value] for value in pattern]
sys.stdout.write(result)
pattern.append(index)
pattern = pattern[1:len(pattern)]
print "\nDone."
(example copied from machinelearningmastery.com)
The important thing are this lines:
pattern.append(index)
pattern = pattern[1:len(pattern)]
Here they append the next character to the pattern and then drop the first character to have an input length that matches the expectation from the lstm. Then the bring it to a numpy array (x = np.reshape(...)) and predict from the model with the generated output. So to answer your first question you need to feed in the output again.
For the second question the 10 corresponds to the number of lstm cells that you have in a layer. If you don't use "return_sequences=True" it corresponds to the output size of that layer.
Let's break it down into pieces and look pictorially
LSTM(10, input_shape=(3,1))): Defines an LSTM whose sequence length is 3 i.e the LSTM will unroll for 3 timesteps. At each timestep the LSTM will take an input of size 1. The output (and also the size of the hidden state and all other LSTM gates) is 10 (vector or size 10)
You dont have to do unrolling manually (passing in the current hidden state to the next state) it is taken care by the keras/tensorflow LSTM layer. All you have to do is to pass in data in the (batch_size X time_steps X input_size) format.
Dense(1, activation='linear'): This is a dense layer with linear activation with takes in as input the output of the previous layer (i.e the output of the LSTM which will be a vector of size 10 of the last unrolling). It will return a vector of size 1.
The same can be checked using model.summary()
Your 1st question:
don't you have to give the previous prediction along with input and cell state in order to predict the next outcome using an LSTM?
no, you don't have to do that. As far as I understand, it is stored in the LSTM cell which is why LSTM uses so much RAM
if you have data with shape looking like this:
(100,1000)
if you plug that into the fit function, each epoch will run on 100 lists. The LSTM will remember 1000 data plots before refreshing when it moves onto the next list.
2nd:
Also, what does the 10 represent in model.add(LSTM(10, input_shape(1,1))?
it is the shape of the 1st layer after the input, so your model currently has the shape of:
1,1
10
1
hope it helps :)

Implementing a many-to-many regression task

Sorry if I present my problem not clearly, English is not my first language
Problem
Short description:
I want to train a model which map input x (with shape of [n_sample, timestamp, feature]) to an output y (with exact same shape). It's like mapping 2 space
Longer version:
I have 2 float ndarrays of shape [n_sample, timestamp, feature], representing MFCC feature of n_sample audio file. These 2 ndarray are 2 speakers' speech of the same corpus, which was aligned by DTW. Lets name these 2 arrays x and y. I want to train a model, which predict y[k] given x[k]. It's like mapping from space x to space y, and the output must be exact same shape as the input
What I've tried
It's time-series problem so I decide to use RNN approach. Here is my code in PyTorch (I put comment along the code. I removed the calculation of average loss for simplicity). Note that I've tried many option for learning rate, the behavior still the same
Class define
class Net(nn.Module):
def __init__(self, in_size, hidden_size, out_size, nb_lstm_layers):
super().__init__()
self.in_size = in_size
self.hidden_size = hidden_size
self.out_size = out_size
self.nb_lstm_layers = nb_lstm_layers
# self.fc1 = nn.Linear()
self.lstm = nn.LSTM(input_size=self.in_size, hidden_size=self.hidden_size, num_layers=self.nb_lstm_layers, batch_first=True, bias=True)
# self.fc = nn.Linear(self.hidden_size, self.out_size)
self.fc1 = nn.Linear(self.hidden_size, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, self.out_size)
def forward(self, x, h_state):
out, h_state = self.lstm(x, h_state)
output_fc = []
for frame in out:
output_fc.append(self.fc3(torch.tanh(self.fc1(frame)))) # I added fully connected layer to each frame, to make an output with same shape as input
return torch.stack(output_fc), h_state
def hidden_init(self):
if use_cuda:
h_state = torch.stack([torch.zeros(nb_lstm_layers, batch_size, 20) for _ in range(2)]).cuda()
else:
h_state = torch.stack([torch.zeros(nb_lstm_layers, batch_size, 20) for _ in range(2)])
return h_state
Training step:
net = Net(20, 20, 20, nb_lstm_layers)
optimizer = optim.Adam(net.parameters(), lr=0.0001, weight_decay=0.0001)
criterion = nn.MSELoss()
for epoch in range(nb_epoch):
count = 0
loss_sum = 0
batch_x = None
for i in (range(len(data))):
# data is my entire data, which contain A and B i specify above.
temp_x = torch.tensor(data[i][0])
temp_y = torch.tensor(data[i][1])
for ii in range(0, data[i][0].shape[0] - nb_frame_in_batch*2 + 1): # Create batches
batch_x, batch_y = get_batches(temp_x, temp_y, ii, batch_size, nb_frame_in_batch)
# this will return 2 tensor of shape (batch_size, nb_frame_in_batch, 20),
# with `batch_size` is the number of sample each time I feed to the net,
# nb_frame_in_batch is the number of frame in each sample
optimizer.zero_grad()
h_state = net.hidden_init()
prediction, h_state = net(batch_x.float(), h_state)
loss = criterion(prediction.float(), batch_y.float())
h_state = (h_state[0].detach(), h_state[1].detach())
loss.backward()
optimizer.step()
Problem is, the loss seems not to decrease but fluctuate a lot, without a clear behaviour
Please help me. Any suggestion will be greatly appreciated. If somebody can inspect my code and provide some comment, that would be so kind.
Thanks in advance!
It seems the network learning nothing from your data, hence the loss fluctuation (since weights depends on random initialization only). There are something you can try:
Try to normalize the data (this suggestion is quite broad, but I can't give you more details since I don't have your data, but normalize it to a specific range like [0, 1], or to a mean and std value is worth trying)
One very typical problem of LSTM in pytorch is its input dimension is quite different to other type of neural network. You must feed into your network a tensor with shape (seq_len, batch, input_size). You should go here, LSTM section for better details
One more thing: try to tune your hyperparameters. LSTM is harder to train compare to FC or CNN (to my experience).
Tell me if you have improvement. Debugging a neural network is always hard and full of potential coding mistake
With most ML algorithms it is tough to diagnose without seeing the data. Based on the inconsistency of your loss results this might be an issue with your data pre-processing. Have you tried normalizing the data first? Often times with large fluctuations in results, one of your input neuron values may be skewing your loss function making it unable to find a good direction.
How to normalize a NumPy array to within a certain range?
This is an example for audio normalization but I would also try adjusting the learning rate as it looks high and possibly removing a hidden layer.
May the problem was in the calculation of the loss. Try to sum the losses of each time-step in a sequence and then take the average over the batch. May it helps

Categories