I am trying to implement a LSTM VAE (following this example I found), but also have it accept variable length sequences using Masking Layers. I tried to combine the above code with the ideas from this SO question that seems to deal with it the "best way" by cropping the gradients to get the most accurate loss as possible, however my implementation does not seem to be able to reproduce sequences on a small set of data. I am thus relatively confident that there is something amiss with my implementation, but I cannot seem to pinpoint what exactly is wrong. The relevant part is here:
x = Input(shape=(None, input_dim))(x)
x_masked = Masking(mask_value=0.0, input_shape=(None, input_dim))(x)
h = LSTM(intermediate_dim)(x_masked)
z_mean = Dense(latent_dim)(h)
z_log_sigma = Dense(latent_dim)(h)
def sampling(args):
z_mean, z_log_sigma = args
epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0., stddev=epsilon_std)
return z_mean + z_log_sigma * epsilon
z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_sigma])
decoded_h = LSTM(intermediate_dim, return_sequences=True)
decoded_mean = LSTM(latent_dim, return_sequences=True)
h_decoded = RepeatVector(max_timesteps)(z)
h_decoded = decoder_h(h_decoded)
x_decoded_mean = decoder_mean(h_decoded)
def crop_outputs(x):
padding = K.cast(K.not_equal(x[1], 0), dtype=K.floatx())
return x[0] * padding
x_decoded_mean = Lambda(crop_outputs, output_shape=(max_timesteps, input_dim))([x_decoded_mean, x])
vae = Model(x, x_decoded_mean)
def vae_loss(x, x_decoded_mean):
xent_loss = objectives.mse(x, x_decoded_mean)
kl_loss = -0.5 * K.mean(1 + z_log_sigma - K.square(z_mean) - K.exp(z_log_sigma))
loss = xent_loss + kl_loss
return loss
vae.compile(optimizer='adam', loss=vae_loss)
# Here, X is variable length time series data of shape
# (num_examples, max_timesteps, input_dim) and is zero padded
# on the right for all the examples of length less than max_timesteps
# X has been appropriately scaled using the StandardScaler.
vae.fit(X, X, epochs = num_epochs, batch_size=batch_size)
As always, any help is much appreciated. Thank you!
I came by your question while looking to do exactly the same. I gave up on VAE's, but found a solution to apply masking to layers that don't support masking. What I did was just predefine a binary mask (you can do this with numpy Code 1) and then multiplied my output by the mask. During Backpropagation the algorithm will try the derivative of the multiplication and will end up propagating the value or not. It is not as clever as the masking layer on Keras, bubt it did for me.
#Code 1
#making a numpy binary mask
# expecting a sequence with shape (Time_Steps, Features)
# let's say that my sequence has Features = 10 and a max_Length of 15
max_Len = 15
seq = np.linspace(0,1,100).reshape((10,10))
# You must pad/truncate the sequence here
mask = np.concatenate([np.ones(seq.shape[0]),np.zeros(max_Len-seq.shape[0])],axis=-1)
# This mask can be thrown as input to the model afterwards
A few considerations:
1- It resulted on a weak regression model. Don't know the impact on VAE's, since I never tested, but I think it will generate lots of noise.
2- The computational resource demand went up, so it is a good thing to try and calculate the requirements of propagating and backpropagating this workaround (or "gambiarra" as we say here) if you are on a budget like me.
3- It wont solve the problem completly, you could try and delve deeper on this and implement a more stable solution using pure Tensorflow.
4- A more "accurate" solution would be to implement a custom masking layer (code 2).
Regarding point 4, it is easy, you must define the layer as a default layer and then just use the call function receiving a mask and then just output the multiplication of mask and input. Like this:
# Code 2
class MyCoolMaskingLayer(tf.keras.layers.Layer):
def __init__(self, **kwargs):
#init stuff here
pass
def compute_mask(self, inputs, mask=None):
return mask
def call(self, input, mask=None):
bc_mask = tf.expand_dims(tf.cast(mask, "float32"), -1) if mask is not None else np.asarray([[1]])
return input * mask
This function might not work for you, it is really problem specific and from a noob (me), but it worked for me. I just cannot share the entire code because my Master's Tutor doesn't allow.
(a little bit of context: I wrap it around a TimeDistributed so that each TimeStep of a LSTM output is individually processed by this masking layer, because inside call i perform some transformations on the data)
To preface this, I have plenty of experience with python and moderate experience building and using machine learning networks. That being said, this is the first LSTM I have made aside from some of the cookie-cutter examples available, so any help is appreciated. I feel like this is a problem with a simple solution and that I have just been looking at this code for far too long to see it.
This model is made in a python3.5 venv using Keras with a tensorflow backend.
In short, I am trying to make predictions of some temporal data using the data itself as well as a few mathematical permutations of this data, creating four input features. I am building a time-series input from the prior 60 data points and specifying the prediction target to be 60 data points in the future.
Shape of complete training data (input)(target): (2476224, 60, 4) (2476224)
Shape of single data "point" (input)(target): (1, 60, 4) (1)
What appears to be happening is that the trained model has fit the trailing value of my input time-series (the current value) instead of the target I have provided it (60 cycles in the future).
What is interesting is that the loss function seems to be calculating according to the correct prediction target, yet the model is not converging to the proper solution.
I have no idea why the model should be doing this. My first thought was that I was preprocessing my data incorrectly and feeding it the wrong target. I have tested my input formatting of the data extensively and am pretty confident that I am providing the model with he correct target and input information.
In one instance, I had increased the learning rate a tad such that the model converged to a local minima. This testing loss of this convergence was very similar to the loss of my preferred learning rate (still quite high). But the predictions were still of the "current value". Why is this so?
Here is how I created my model:
def create_model():
lstm_model = Sequential()
lstm_model.add(CuDNNLSTM(100, batch_input_shape=(batch_size, time_step, train_input.shape[2]),
stateful=True, return_sequences=True,
kernel_initializer='random_uniform'))
lstm_model.add(Dropout(0.4))
lstm_model.add(CuDNNLSTM(60))
lstm_model.add(Dropout(0.4))
lstm_model.add(Dense(20, activation='relu'))
lstm_model.add(Dense(1, activation='linear'))
optimizer = optimizers.Adagrad(lr=params["lr"])
lstm_model.compile(loss='mean_squared_error', optimizer=optimizer)
return lstm_model
This is how I am pre-processing the data. The first function, build_timeseries, constructs my input-output pairs. I believe this is working correctly (but please correct me if I am wrong). The second function trims the pairs to fit the batch size. I do the exact same for the test input/target.
train_input, train_target = build_timeseries(train_input, time_step, pred_horiz, 0)
train_input = trim_dataset(train_input, batch_size)
train_target = trim_dataset(train_target, batch_size)
def build_timeseries(mat, TIME_STEPS, PRED_HORIZON, y_col_index):
# y_col_index is the index of column that would act as output column
dim_0 = mat.shape[0] # num datasets
dim_1 = mat.shape[1] # num features
dim_2 = mat.shape[2] # num datapoints
# Reformatted matrix
mat = mat.swapaxes(1, 2)
x = np.zeros((dim_0*(dim_2-PRED_HORIZON), TIME_STEPS, dim_1))
y = np.zeros((dim_0*(dim_2-PRED_HORIZON),))
k = 0
for i in range(dim_0): # Iterate through datasets
for j in range(TIME_STEPS, dim_2-PRED_HORIZON):
x[k] = mat[i, j-TIME_STEPS:j]
y[k] = mat[i, j+PRED_HORIZON, y_col_index]
k += 1
print("length of time-series i/o", x.shape, y.shape)
return x, y
def trim_dataset(mat, batch_size):
no_of_rows_drop = mat.shape[0] % batch_size
if(no_of_rows_drop > 0):
return mat[no_of_rows_drop:]
else:
return mat
Lastly, this is how I call the actual model.
history = model.fit(train_input, train_target, epochs=params["epochs"], verbose=2, batch_size=batch_size,
shuffle=True, validation_data=(test_input, test_target), callbacks=[es, mcp])
As the model converges, I expect it to predict values close to the specified targets I had fed it. However instead, its predictions align much more closely with the trailing value of the time-series data (or the current value). Though, on the other hand, the model appears to be evaluating the loss according to the specified target.... Why is it working this way and how can I fix it? Any help is appreciated.
I am converting Keras code into PyTorch because I am more familiar with the latter than the former. However, I found that it is not learning (or only barely).
Below I have provided almost all of my PyTorch code, including the initialisation code so that you can try it out yourself. The only thing you would need to provide yourself, is the word embeddings (I'm sure you can find many word2vec models online). The first input file should be a file with tokenised text, the second input file should be a file with floating-point numbers, one per line. Because I have provided all the code, this question may seem huge and too broad. However, my question is specific enough I think: what is wrong in my model or training loop that causes my model to not or barely improve. (See below for results.)
I have tried to provide many comments where applicable, and I have provided the shape transformations as well so you do not have to run the code to see what is going on. The data prep methods are not important to inspect.
The most important parts are the forward method of the RegressorNet, and the training loop of RegressionNN (admittedly, these names were badly chosen). I think the mistake is there somewhere.
from pathlib import Path
import time
import numpy as np
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
import gensim
from scipy.stats import pearsonr
from LazyTextDataset import LazyTextDataset
class RegressorNet(nn.Module):
def __init__(self, hidden_dim, embeddings=None, drop_prob=0.0):
super(RegressorNet, self).__init__()
self.hidden_dim = hidden_dim
self.drop_prob = drop_prob
# Load pretrained w2v model, but freeze it: don't retrain it.
self.word_embeddings = nn.Embedding.from_pretrained(embeddings)
self.word_embeddings.weight.requires_grad = False
self.w2v_rnode = nn.GRU(embeddings.size(1), hidden_dim, bidirectional=True, dropout=drop_prob)
self.dropout = nn.Dropout(drop_prob)
self.linear = nn.Linear(hidden_dim * 2, 1)
# LeakyReLU rather than ReLU so that we don't get stuck in a dead nodes
self.lrelu = nn.LeakyReLU()
def forward(self, batch_size, sentence_input):
# shape sizes for:
# * batch_size 128
# * embeddings of dim 146
# * hidden dim of 200
# * sentence length of 20
# sentence_input: torch.Size([128, 20])
# Get word2vec vector representation
embeds = self.word_embeddings(sentence_input)
# embeds: torch.Size([128, 20, 146])
# embeds.view(-1, batch_size, embeds.size(2)): torch.Size([20, 128, 146])
# Input vectors into GRU, only keep track of output
w2v_out, _ = self.w2v_rnode(embeds.view(-1, batch_size, embeds.size(2)))
# w2v_out = torch.Size([20, 128, 400])
# Leaky ReLU it
w2v_out = self.lrelu(w2v_out)
# Dropout some nodes
if self.drop_prob > 0:
w2v_out = self.dropout(w2v_out)
# w2v_out: torch.Size([20, 128, 400
# w2v_out[-1, :, :]: torch.Size([128, 400])
# Only use the last output of a sequence! Supposedly that cell outputs the final information
regression = self.linear(w2v_out[-1, :, :])
regression: torch.Size([128, 1])
return regression
class RegressionRNN:
def __init__(self, train_files=None, test_files=None, dev_files=None):
print('Using torch ' + torch.__version__)
self.datasets, self.dataloaders = RegressionRNN._set_data_loaders(train_files, test_files, dev_files)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model = self.w2v_vocab = self.criterion = self.optimizer = self.scheduler = None
#staticmethod
def _set_data_loaders(train_files, test_files, dev_files):
# labels must be the last input file
datasets = {
'train': LazyTextDataset(train_files) if train_files is not None else None,
'test': LazyTextDataset(test_files) if test_files is not None else None,
'valid': LazyTextDataset(dev_files) if dev_files is not None else None
}
dataloaders = {
'train': DataLoader(datasets['train'], batch_size=128, shuffle=True, num_workers=4) if train_files is not None else None,
'test': DataLoader(datasets['test'], batch_size=128, num_workers=4) if test_files is not None else None,
'valid': DataLoader(datasets['valid'], batch_size=128, num_workers=4) if dev_files is not None else None
}
return datasets, dataloaders
#staticmethod
def prepare_lines(data, split_on=None, cast_to=None, min_size=None, pad_str=None, max_size=None, to_numpy=False,
list_internal=False):
""" Converts the string input (line) to an applicable format. """
out = []
for line in data:
line = line.strip()
if split_on:
line = line.split(split_on)
line = list(filter(None, line))
else:
line = [line]
if cast_to is not None:
line = [cast_to(l) for l in line]
if min_size is not None and len(line) < min_size:
# pad line up to a number of tokens
line += (min_size - len(line)) * ['#pad#']
elif max_size and len(line) > max_size:
line = line[:max_size]
if list_internal:
line = [[item] for item in line]
if to_numpy:
line = np.array(line)
out.append(line)
if to_numpy:
out = np.array(out)
return out
def prepare_w2v(self, data):
idxs = []
for seq in data:
tok_idxs = []
for word in seq:
# For every word, get its index in the w2v model.
# If it doesn't exist, use #unk# (available in the model).
try:
tok_idxs.append(self.w2v_vocab[word].index)
except KeyError:
tok_idxs.append(self.w2v_vocab['#unk#'].index)
idxs.append(tok_idxs)
idxs = torch.tensor(idxs, dtype=torch.long)
return idxs
def train(self, epochs=10):
valid_loss_min = np.Inf
train_losses, valid_losses = [], []
for epoch in range(1, epochs + 1):
epoch_start = time.time()
train_loss, train_results = self._train_valid('train')
valid_loss, valid_results = self._train_valid('valid')
# Calculate Pearson correlation between prediction and target
try:
train_pearson = pearsonr(train_results['predictions'], train_results['targets'])
except FloatingPointError:
train_pearson = "Could not calculate Pearsonr"
try:
valid_pearson = pearsonr(valid_results['predictions'], valid_results['targets'])
except FloatingPointError:
valid_pearson = "Could not calculate Pearsonr"
# calculate average losses
train_loss = np.mean(train_loss)
valid_loss = np.mean(valid_loss)
train_losses.append(train_loss)
valid_losses.append(valid_loss)
# print training/validation statistics
print(f'----------\n'
f'Epoch {epoch} - completed in {(time.time() - epoch_start):.0f} seconds\n'
f'Training Loss: {train_loss:.6f}\t Pearson: {train_pearson}\n'
f'Validation loss: {valid_loss:.6f}\t Pearson: {valid_pearson}')
# validation loss has decreased
if valid_loss <= valid_loss_min and train_loss > valid_loss:
print(f'!! Validation loss decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Saving model ...')
valid_loss_min = valid_loss
if train_loss <= valid_loss:
print('!! Training loss is lte validation loss. Might be overfitting!')
# Optimise with scheduler
if self.scheduler is not None:
self.scheduler.step(valid_loss)
print('Done training...')
def _train_valid(self, do):
""" Do training or validating. """
if do not in ('train', 'valid'):
raise ValueError("Use 'train' or 'valid' for 'do'.")
results = {'predictions': np.array([]), 'targets': np.array([])}
losses = np.array([])
self.model = self.model.to(self.device)
if do == 'train':
self.model.train()
torch.set_grad_enabled(True)
else:
self.model.eval()
torch.set_grad_enabled(False)
for batch_idx, data in enumerate(self.dataloaders[do], 1):
# 1. Data prep
sentence = data[0]
target = data[-1]
curr_batch_size = target.size(0)
# Returns list of tokens, possibly padded #pad#
sentence = self.prepare_lines(sentence, split_on=' ', min_size=20, max_size=20)
# Converts tokens into w2v IDs as a Tensor
sent_w2v_idxs = self.prepare_w2v(sentence)
# Converts output to Tensor of floats
target = torch.Tensor(self.prepare_lines(target, cast_to=float))
# Move input to device
sent_w2v_idxs, target = sent_w2v_idxs.to(self.device), target.to(self.device)
# 2. Predictions
pred = self.model(curr_batch_size, sentence_input=sent_w2v_idxs)
loss = self.criterion(pred, target)
# 3. Optimise during training
if do == 'train':
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# 4. Save results
pred = pred.detach().cpu().numpy()
target = target.cpu().numpy()
results['predictions'] = np.append(results['predictions'], pred, axis=None)
results['targets'] = np.append(results['targets'], target, axis=None)
losses = np.append(losses, float(loss))
torch.set_grad_enabled(True)
return losses, results
if __name__ == '__main__':
HIDDEN_DIM = 200
# Load embeddings from pretrained gensim model
embed_p = Path('path-to.w2v_model').resolve()
w2v_model = gensim.models.KeyedVectors.load_word2vec_format(str(embed_p))
# add a padding token with only zeros
w2v_model.add(['#pad#'], [np.zeros(w2v_model.vectors.shape[1])])
embed_weights = torch.FloatTensor(w2v_model.vectors)
# Text files are used as input. Every line is one datapoint.
# *.tok.low.*: tokenized (space-separated) sentences
# *.cross: one floating point number per line, which we are trying to predict
regr = RegressionRNN(train_files=(r'train.tok.low.en',
r'train.cross'),
dev_files=(r'dev.tok.low.en',
r'dev.cross'),
test_files=(r'test.tok.low.en',
r'test.cross'))
regr.w2v_vocab = w2v_model.vocab
regr.model = RegressorNet(HIDDEN_DIM, embed_weights, drop_prob=0.2)
regr.criterion = nn.MSELoss()
regr.optimizer = optim.Adam(list(regr.model.parameters())[0:], lr=0.001)
regr.scheduler = optim.lr_scheduler.ReduceLROnPlateau(regr.optimizer, 'min', factor=0.1, patience=5, verbose=True)
regr.train(epochs=100)
For the LazyTextDataset, you can refer to the class below.
from torch.utils.data import Dataset
import linecache
class LazyTextDataset(Dataset):
def __init__(self, paths):
# labels are in the last path
self.paths, self.labels_path = paths[:-1], paths[-1]
with open(self.labels_path, encoding='utf-8') as fhin:
lines = 0
for line in fhin:
if line.strip() != '':
lines += 1
self.num_entries = lines
def __getitem__(self, idx):
data = [linecache.getline(p, idx + 1) for p in self.paths]
label = linecache.getline(self.labels_path, idx + 1)
return (*data, label)
def __len__(self):
return self.num_entries
As I wrote before, I am trying to convert a Keras model to PyTorch. The original Keras code does not use an embedding layer, and uses pre-built word2vec vectors per sentence as input. In the model below, there is no embedding layer. The Keras summary looks like this (I don't have access to the base model setup).
Layer (type) Output Shape Param # Connected to
====================================================================================================
bidirectional_1 (Bidirectional) (200, 400) 417600
____________________________________________________________________________________________________
dropout_1 (Dropout) (200, 800) 0 merge_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense) (200, 1) 801 dropout_1[0][0]
====================================================================================================
The issue is that with identical input, the Keras model works and gets a +0.5 Pearson correlation between predicted and actual labels. The PyTorch model above, though, does not seem to work at all. To give you an idea, here is the loss (mean squared error) and Pearson (correlation coefficient, p-value) after the first epoch:
Epoch 1 - completed in 11 seconds
Training Loss: 1.684495 Pearson: (-0.0006077809280690612, 0.8173368901481127)
Validation loss: 1.708228 Pearson: (0.017794288315261794, 0.4264098054188664)
And after the 100th epoch:
Epoch 100 - completed in 11 seconds
Training Loss: 1.660194 Pearson: (0.0020315421756790806, 0.4400929436716754)
Validation loss: 1.704910 Pearson: (-0.017288118524826892, 0.4396865964324158)
The loss is plotted below (when you look at the Y-axis, you can see the improvements are minimal).
A final indicator that something may be wrong, is that for my 140K lines of input, each epoch only takes 10 seconds on my GTX 1080TI. I feel that his is not much and I would guess that the optimisation is not working/running. I cannot figure out why, though. To issue will probably be in my train loop or the model itself, but I cannot find it.
Again, something must be going wrong because:
- the Keras model does perform well;
- the training speed is 'too fast' for 140K sentences
- almost no improvemnts after training.
What am I missing? The issue is more than likely present in the training loop or in the network structure.
TL;DR: Use permute instead of view when swapping axes, see the end of answer to get an intuition about the difference.
About RegressorNet (neural network model)
No need to freeze embedding layer if you are using from_pretrained. As documentation states, it does not use gradient updates.
This part:
self.w2v_rnode = nn.GRU(embeddings.size(1), hidden_dim, bidirectional=True, dropout=drop_prob)
and especially dropout without providable num_layers is totally pointless (as no dropout can be specified with shallow one layer network).
BUG AND MAIN ISSUE: in your forward function you are using view instead of permute, here:
w2v_out, _ = self.w2v_rnode(embeds.view(-1, batch_size, embeds.size(2)))
See this answer and appropriate documentation for each of those functions and try to use this line instead:
w2v_out, _ = self.w2v_rnode(embeds.permute(1, 0, 2))
You may consider using batch_first=True argument during w2v_rnode creation, you won't have to permute indices that way.
Check documentation of torch.nn.GRU, you are after last step of the sequence, not after all of the sequences you have there, so you should be after:
_, last_hidden = self.w2v_rnode(embeds.permute(1, 0, 2))
but I think this part is fine otherwise.
Data preparation
No offence, but prepare_lines is very unreadable and seems pretty hard to maintain as well, not to say spotting an eventual bug (I suppose it lies in here).
First of all, it seems like you are padding manually. Please don't do it that way, use torch.nn.pad_sequence to work with batches!
In essence, first you encode each word in every sentence as index pointing into embedding (as you seem to do in prepare_w2v), after that you use torch.nn.pad_sequence and torch.nn.pack_padded_sequence or torch.nn.pack_sequence if the lines are already sorted by length.
Proper batching
This part is very important and it seems you are not doing that at all (and likely this is the second error in your implementation).
PyTorch's RNN cells take inputs not as padded tensors, but as torch.nn.PackedSequence objects. This is an efficient object storing indices which specify unpadded length of each sequence.
See more informations on the topic here, here and in many other blog posts throughout the web.
First sequence in batch has to be the longest, and all others have to be provided in the descending length. What follows is:
You have to sort your batch each time by sequences length and sort your targets in an analogous way OR
Sort your batch, push it through the network and unsort it afterwards to match with your targets.
Either is fine, it's your call what seems to be more intuitive for you.
What I like to do is more or less the following, hope it helps:
Create unique indices for each word and map each sentence appropriately (you've already done it).
Create regular torch.utils.data.Dataset object returning single sentence for each geitem, where it is returned as a tuple consisting of features (torch.Tensor) and labels (single value), seems like you're doing it as well.
Create custom collate_fn for use with torch.utils.data.DataLoader, which is responsible for sorting and padding each batch in this scenario (+ it returns lengths of each sentence to be passed into neural network).
Using sorted and padded features and their lengths I'm using torch.nn.pack_sequence inside neural network's forward method (do it after embedding!) to push it through RNN layer.
Depending on the use-case I unpack them using torch.nn.pad_packed_sequence. In your case, you only care about last hidden state, hence you don't have to do that. If you were using all of the hidden outputs (like is the case with, say, attention networks), you would add this part.
When it comes to the third point, here is a sample implementation of collate_fn, you should get the idea:
import torch
def length_sort(features):
# Get length of each sentence in batch
sentences_lengths = torch.tensor(list(map(len, features)))
# Get indices which sort the sentences based on descending length
_, sorter = sentences_lengths.sort(descending=True)
# Pad batch as you have the lengths and sorter saved already
padded_features = torch.nn.utils.rnn.pad_sequence(features, batch_first=True)
return padded_features, sentences_lengths, sorter
def pad_collate_fn(batch):
# DataLoader return batch like that unluckily, check it on your own
features, labels = (
[element[0] for element in batch],
[element[1] for element in batch],
)
padded_features, sentences_lengths, sorter = length_sort(features)
# Sort by length features and labels accordingly
sorted_padded_features, sorted_labels = (
padded_features[sorter],
torch.tensor(labels)[sorter],
)
return sorted_padded_features, sorted_labels, sentences_lengths
Use those as collate_fn in DataLoaders and you should be just about fine (maybe with minor adjustments, so it's essential you understand the idea standing behind it).
Other possible problems and tips
Training loop: great place for a lot of small errors, you may want to minimalize those by using PyTorch Ignite. I am having unbelievably hard time going through your Tensorflow-like-Estimator-like-API-like training loop (e.g. self.model = self.w2v_vocab = self.criterion = self.optimizer = self.scheduler = None this). Please, don't do it this way, separate each task (data creating, data loading, data preparation, model setup, training loop, logging) into it's own respective module. All in all there is a reason why PyTorch/Keras is more readable and sanity-preserving than Tensorflow.
Make the first row of your embedding equal to vector containg zeros: By default, torch.nn.functional.embedding expects the first row to be used for padding. Hence you should start your unique indexing for each word at 1 or specify an argument padding_idx to different value (though I highly discourage this approach, confusing at best).
I hope this answer helps you at least a little bit, if something is unclear post a comment below and I'll try to explain it from a different perspective/more detail.
Some final comments
This code is not reproducible, nor the question's specific. We don't have the data you are using, neither we got your word vectors, random seed is not fixed etc.
PS. One last thing: Check your performance on really small subset of your data (say 96 examples), if it does not converge, it is very likely you indeed have a bug in your code.
About the times: they are probably off (due to not sorting and not padding I suppose), usually Keras and PyTorch's times are quite similar (if I understood this part of your question as intended) for correct and efficient implementations.
Permute vs view vs reshape explanation
This simple example show the differences between permute() and view(). The first one swaps axes, while the second does not change memory layout, just chunks the array into desired shape (if possible).
import torch
a = torch.tensor([[1, 2], [3, 4], [5, 6]])
print(a)
print(a.permute(1, 0))
print(a.view(2, 3))
And the output would be:
tensor([[1, 2],
[3, 4],
[5, 6]])
tensor([[1, 3, 5],
[2, 4, 6]])
tensor([[1, 2, 3],
[4, 5, 6]])
reshape is almost like view, was added for those coming from numpy, so it's easier and more natural for them, but it has one important difference:
view never copies data and work only on contiguous memory (so after permutation like the one above your data may not be contiguous, hence acces to it might be slower)
reshape can copy data if needed, so it would work for non-contiguous arrays as well.
I've got some short code that attempts to fit a function. But I am worried about how to feed the data to a tflearn rnn.
The X input is a [45,1,8] (45 samples, 4 timesteps, and 8 features] array. Therefore, the Y input should be a [45,1,8] array since the goal is to minimize the difference element-wise.
However it throws the following error when this is attempted
Cannot feed value of shape (45, 1, 8) for Tensor 'TargetsData/Y:0', which has shape '(?, 8)'
I can't seem to figure out my error. Any help would be appreciated.
Note: someone seems to have solved a similar problem, but I can't understand the answer
tensorflow/tflearn input shape
Full Code
def mod(rnn_output,state):
Tau = tfl.variable(name='GRN', shape=[8],
initializer='uniform_scaling',
regularizer='L2')
Timestep = tf.constant(6.0,shape=[8])
one = tf.div(Timestep,Tau)
two = rnn_output
three = tf.mul(tf.sub(tf.ones(shape=[num_genes]),one),state)
four = tf.mul(one,two)
five = tf.add(four,three)
return(five)
net = tfl.input_data(shape=[None,4,8])
out, state = tfl.layers.recurrent.simple_rnn(net,8,return_state=True
,name='RNN')
net = tfl.layers.core.custom_layer(out,mod,state=state)
net = tfl.layers.estimator.regression(net)
# Define model
model = tfl.DNN(net)
# Start training (apply gradient descent algorithm)
model.fit(train_x, train_y, n_epoch=10, batch_size=45, show_metric=True)
I've constructed a LSTM recurrent NNet using lasagne that is loosely based on the architecture in this blog post. My input is a text file that has around 1,000,000 sentences and a vocabulary of 2,000 word tokens. Normally, when I construct networks for image recognition my input layer will look something like the following:
l_in = nn.layers.InputLayer((32, 3, 128, 128))
(where the dimensions are batch size, channel, height and width) which is convenient because all the images are the same size so I can process them in batches. Since each instance in my LSTM network has a varying sentence length, I have an input layer that looks like the following:
l_in = nn.layers.InputLayer((None, None, 2000))
As described in above referenced blog post,
Masks:
Because not all sequences in each minibatch will always have the same length, all recurrent layers in
lasagne
accept a separate mask input which has shape
(batch_size, n_time_steps)
, which is populated such that
mask[i, j] = 1
when
j <= (length of sequence i)
and
mask[i, j] = 0
when
j > (length
of sequence i)
.
When no mask is provided, it is assumed that all sequences in the minibatch are of length
n_time_steps.
My question is: Is there a way to process this type of network in mini-batches without using a mask?
Here is a simplified version if my network.
# -*- coding: utf-8 -*-
import theano
import theano.tensor as T
import lasagne as nn
softmax = nn.nonlinearities.softmax
def build_model():
l_in = nn.layers.InputLayer((None, None, 2000))
lstm = nn.layers.LSTMLayer(l_in, 4096, grad_clipping=5)
rs = nn.layers.SliceLayer(lstm, 0, 0)
dense = nn.layers.DenseLayer(rs, num_units=2000, nonlinearity=softmax)
return l_in, dense
model = build_model()
l_in, l_out = model
all_params = nn.layers.get_all_params(l_out)
target_var = T.ivector("target_output")
output = nn.layers.get_output(l_out)
loss = T.nnet.categorical_crossentropy(output, target_var).sum()
updates = nn.updates.adagrad(loss, all_params, 0.005)
train = theano.function([l_in.input_var, target_var], cost, updates=updates)
From there I have generator that spits out (X, y) pairs and I am computing train(X, y) and updating the gradient with each iteration. What I want to do is do an N number of training steps and then update the parameters with the average gradient.
To do this, I tried creating a compute_gradient function:
gradient = theano.grad(loss, all_params)
compute_gradient = theano.function(
[l_in.input_var, target_var],
output=gradient
)
and then looping over several training instances to create a "batch" and collect the gradient calculations to a list:
grads = []
for _ in xrange(1024):
X, y = train_gen.next() # generator for producing training data
grads.append(compute_gradient(X, y))
this produces a list of lists
>>> grads
[[<CudaNdarray at 0x7f83b5ff6d70>,
<CudaNdarray at 0x7f83b5ff69f0>,
<CudaNdarray at 0x7f83b5ff6270>,
<CudaNdarray at 0x7f83b5fc05f0>],
[<CudaNdarray at 0x7f83b5ff66f0>,
<CudaNdarray at 0x7f83b5ff6730>,
<CudaNdarray at 0x7f83b5ff6b70>,
<CudaNdarray at 0x7f83b5ff64f0>] ...
From here I would need to take the mean of the gradient at each layer, and then update the model parameters. This is possible to do in pieces like this does does the gradient calc/parameter update need to happen all in one theano function?
Thanks.
NOTE: this is a solution, but by no means do i have enough experience to verify its best and the code is just a sloppy example
You need 2 theano functions. The first being the grad one you seem to have already judging from the information provided in your question.
So after computing the batched gradients you want to immediately feed them as an input argument back into another theano function dedicated to updating the shared variables. For this you need to specify the expected batch size at the compile time of your neural network. so you could do something like this: (for simplicity i will assume you have a global list variable where all your params are stored)
params #list of params you wish to update
BATCH_SIZE = 1024 #size of the expected training batch
G = [T.matrix() for i in range(BATCH_SIZE) for param in params] #placeholder for grads result flattened so they can be fed into a theano function
updates = [G[i] for i in range(len(params))] #starting with list of param updates from first batch
for i in range(len(params)): #summing the gradients for each individual param
for j in range(1, len(G)/len(params)):
updates[i] += G[i*BATCH_SIZE + j]
for i in range(len(params)): #making a list of tuples for theano.function updates argument
updates[i] = (params[i], updates[i]/BATCH_SIZE)
update = theano.function([G], 0, updates=updates)
Like this theano will be taking the mean of the gradients and updating the params as usual
dont know if you need to flatten the inputs as I did, but probably
EDIT: gathering from how you edited your question it seems important that the batch size can vary in that case you could add 2 theano functions to your existing one:
the first theano function takes a batch of size 2 of your params and returns the sum. you could apply this theano function using python's reduce() and get the sum of the over the whole batch of gradients
the second theano function takes those summed param gradients and a scaler (the batch size) as input and hence is able to update the NN params over the mean of the summed gradients.