I have a VGG16 model implemented with Keras/tensorflow.
When I call model.fit, I pass in a generator of data. The generator does transforms necessary for a VGGNet:
Preprocess the images with vgg16.preprocess_input
Convert the label to a one-hot vector via to_categorical
The generator can be seen below and works. Unfortunately, since there are multiple epochs, I have to set dataset.repeat(-1) (infinitely repeat) so the generator doesn't run out. This in turn requires one to pass a steps_per_epoch so a given iteration of training can complete. As you're probably thinking, this is brittle, (hinges on a known dataset cardinality)!
I have decided it's best to preprocess the training Dataset once up front using Dataset.map. However, I am struggling with the construction of a mapping function, it seems to_categorical doesn't work with a tf.Tensor. Down below is what I have right now, but I am not sure if there's a latent bug.
How can I correctly translate the below Dataset generator into a Dataset.map function?
Current Dataset Generator
This is implemented (and known to work) with Python 3.8 and tensorflow==2.4.4.
from typing import Iterable, Tuple
import numpy as np
import tensorflow as tf
def make_vgg_preprocessing_generator(
dataset: tf.data.Dataset, num_repeat: int = -1
) -> Iterable[Tuple[tf.Tensor, np.ndarray]]:
num_classes = len(dataset.class_names)
for batch_images, batch_labels in dataset.repeat(num_repeat):
pre_images = tf.keras.applications.vgg16.preprocess_input(batch_images)
pre_labels = tf.keras.utils.to_categorical(batch_labels, num_classes)
yield pre_images, pre_labels
train_ds: tf.data.Dataset # Not provided in this sample
model.fit(
make_vgg_preprocessing_generator(train_ds)
epochs=10,
steps_per_epoch=10, # Required since default num_repeat is indefinitely
)
Dataset.map Function
Here is my current translation that I would like to improve.
def vgg_preprocess_dataset(dataset: tf.data.Dataset) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
return pre_x, pre_y
return dataset.map(_preprocess)
Yes, you're on the right track! You'll want to replace to_categorical with tf.one_hot, just as you have, as tf.one_hot is specifically for tensors, and is designed for this context. Next, you might want to play around with some of the other tf.data.Dataset methods here and add them to your pipeline. Right now, your batch size will be one sample, and un-shuffled. An example of some other processing you might do:
def vgg_preprocess_dataset(dataset: tf.data.Dataset, batch_size=32, shuffle_buffer=1000) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor):
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
# pre_y = to_categorical(y, num_classes)
return pre_x, pre_y
# bigger buffer is better but slower
dataset = dataset.shuffle(shuffle_buffer)
# do your mapping after shuffle
dataset = dataset.map(_preprocess)
# next batch it
dataset = dataset.batch(batch_size)
# this allows your CPU to fetch the next batch (do the above shuffling, mapping, etc) during the
# current GPU pass, so that the GPU has minimal downtime
dataset = dataset.prefetch(2)
return dataset
ds = vgg_preprocess_dataset(ds)
# and you just pass it right to fit!
model.fit(ds)
i'm trying to fit my deep learning model with a custom generator.
When i fit the model, it shows me this error:
I tried to find similar questions, but all the answers were about converting lists to numpy array. I think that's not the question in this error. My lists are all in numpy array format. This custom generator is based on a custom generator from here
This is the part of code where I fit the model:
train_generator = RepresentationGenerator(representation_path=representations_path, target_path=target_path,
filenames=training_filenames, batch_size=batch_size)
val_generator = RepresentationGenerator(representation_path=representations_path, target_path=target_path,
filenames=validation_filenames, batch_size=batch_size)
self.model_semantic.fit_generator(train_generator,
epochs=10,
verbose=1,
validation_data=val_generator,
)
return 0
where the variables are:
representations_path - is a string with the directory to the path where i store the training files, that which file is the input to model
target_path - is a string with the directory to the path where i store the target files, that which file is the target of the model (output)
training_filenames - is a list with the names of training and target files (both have the same name, but they are in different folders)
batch_size - integer with the size of the batch. It has the value 7.
My generator class is below:
import np
from tensorflow_core.python.keras.utils.data_utils import Sequence
class RepresentationGenerator(Sequence):
def __init__(self, representation_path, target_path, filenames, batch_size):
self.filenames = np.array(filenames)
self.batch_size = batch_size
self.representation_path = representation_path
self.target_path = target_path
def __len__(self):
return (np.ceil(len(self.filenames) / float(self.batch_size))).astype(np.int)
def __getitem__(self, idx):
files_to_batch = self.filenames[idx * self.batch_size: (idx + 1) * self.batch_size]
batch_x, batch_y = [], []
for file in files_to_batch:
batch_x.append(np.load(self.representation_path + file + ".npy", allow_pickle=True))
batch_y.append(np.load(self.target_path + file + ".npy", allow_pickle=True))
return np.array(batch_x), np.array(batch_y)
These are the values, when the method fit is called:
How can I fix this error?
Thank you mates!
When I call the method fit_generator, it calls the method fit.
The method fit, calls the method func.fit and it passes the variable Y that is set as None
The error occurs in this line:
Final solution:
Import from the correct place:
from tensorflow.keras.utils import Sequence
Old answers:
If __getitem__ is never called, the problem might be in __len__. You're not returning an int, you're returning a np.int.
I suggest you try:
def __len__(self):
length = len(self.filenames) // self.batch_size
if len(self.filenames) % self.batch_size > 0:
length += 1
return length
But if __getitem__ is being called and your data returned, then you should inspect your arrays.
Get an item from the generator yourself and check the content:
x, y = train_generator[0]
Are they single arrays? Or are they arrays of arrays? (Must be single)
What are their shapes? Do they have the expected shapes?
What are their types? Usually they should be float, sometimes int (for inputs to embedding layers), very rarely string (for inputs to custom layers that know how to treat strings).
The outputs must always be float, at most int (for sparse losses)
Other suppositions, you're using fit with batch_size while using a generator.... this is strange and the "if" clauses inside the method may not be well prepared, you might be falling into another training case.
Go straight to the usual options:
self.model_semantic.fit_generator(train_generator,
epochs=10,
verbose=1,
validation_data=val_generator)
Your generator is a Sequence, it already has a __len__, you don't need to specify steps_per_epoch or validation_steps.
Every generator has automatic batch sizes, every step is a batch and that's it. You don't need to specify batch_size in fit_generator.
If you're going to use fit, go like this:
...fit(train_generator, steps_per_epoch = len(train_generator),
epochs = 10, verbose = 1,
validation_data = val_generator, validation_steps = len(val_generator))
Finally, you should be hunting for anything that might be None (as the error message suggests) in your code.
Check if every function has a return line.
Check all inputs of your generator in __init__.
Print the filenames inside the generator.
Get the __len__ of the generator.
Try to get an item from the generator: x, y= train_generator[0]
I'm trying to come up with a Keras model that would do binary classification on image sequences. I split my dataset into train (1200 samples), valid (320 samples) and test (764 samples) partitions.
In the first step, I used InceptionV3 to extract features from my dataset consisting of image sequences. I saved those sequences of extracted features to disk.
The next step is to feed those sequences as input data to my LSTM based model.
Here is a minimal workable example that compiles:
import numpy as np
import copy
import random
from keras.layers.recurrent import LSTM
from keras.layers import Dense, Flatten, Dropout
from keras.models import Sequential
from keras.optimizers import Adam
def get_sequence(sequence_no, sample_type):
# Load sequence from disk according to 'sequence_no' and 'sample_type'
# eg. 'sample_' + sample_type + '_' + sequence_no
return np.zeros([100, 2048]) # added only for demo purposes, so it compiles
def frame_generator(batch_size, generator_type, labels):
# Define list of sample indexes
sample_list = []
for i in range(0, len(labels)):
sample_list.append(i)
# sample_list = [0, 1, 2, 3 ...]
while 1:
X, y = [], []
# Generate batch_size samples
for _ in range(batch_size):
# Reset to be safe
sequence = None
# Get a random sample
random_sample = random.choice(sample_list)
# Get sequence from disk
sequence = get_sequence(random_sample, generator_type)
X.append(sequence)
y.append(labels[random_sample])
yield np.array(X), np.array(y)
# Data mimicking
train_samples_num = 1200
valid_samples_num = 320
labels_train = np.random.randint(0, 2, size=(train_samples_num), dtype='uint8')
labels_valid = np.random.randint(0, 2, size=(valid_samples_num), dtype='uint8')
batch_size = 32
train_generator = frame_generator(batch_size, 'train', labels_train)
val_generator = frame_generator(batch_size, 'valid', labels_valid)
# Model
model = Sequential()
model.add(LSTM(2048, input_shape=(100, 2048), dropout = 0.5))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()
# Training
model.fit_generator(
generator=train_generator,
steps_per_epoch=len(labels_train) // batch_size,
epochs=10,
validation_data=val_generator,
validation_steps=len(labels_valid) // batch_size)
This example was compiled with:
python 3.6.8
keras 2.2.4
tensorflow 1.13.1
This works well, I ran 3 training sessions on this model (same setup, same train/valid/test partitioning) and the test accuracy mean was 96.9%.
Then I started to ask myself whether it's a good idea to always choose a completely random sample from the sample list, inside the frame_generator() function, specifically line
random_sample = random.choice(sample_list). Picking samples completely at random makes it possible that some samples would be used way more than others, in the training session. Also, some of them would possibly never be used. This made me think that the model would generalize less in this setup COMPARED TO if it would see the samples equally during training.
As a fix, changes applied to frame_generator():
Make a backup copy of the sample list, then remove a sample from the sample list every time after using it. When the sample list becomes empty, replace the sample list with its backup copy. Do this for the whole training session. This ensures that ALL of the samples are seen by the model at training with almost the same frequency.
This is the new version of frame_generator(). Marked with #ADDED the 4 lines added:
def frame_generator(batch_size, generator_type, labels):
# Define list of sample indexes
sample_list = []
for i in range(0, len(labels)):
sample_list.append(i)
# sample_list = [0, 1, 2, 3 ...]
sample_list_backup = copy.deepcopy(sample_list) # ADDED
while 1:
X, y = [], []
# Generate batch_size samples
for _ in range(batch_size):
# Reset to be safe
sequence = None
# Get a random sample
random_sample = random.choice(sample_list)
sample_list.remove(random_sample) # ADDED
if len(sample_list) == 0: # ADDED
sample_list = copy.deepcopy(sample_list_backup) # ADDED
# Get sequence from disk
sequence = get_sequence(random_sample, generator_type)
X.append(sequence)
y.append(labels[random_sample])
yield np.array(X), np.array(y)
What I expected:
Given that now the model should be able to generalize a bit better, due to ensuring that all the samples are seen at training with an equal frequency, I expected that the test accuracy would increase slightly.
What happened:
I ran 3 training sessions on this model (same train/valid/test partitioning) and the test accuracy mean this time was 96.2%. That is a 0.7% decrease in accuracy from the 1st setup. So it seems that the model now generalizes worse.
Losses plots for each run:
Question:
Why does more randomness in frame_generator() result in better test set accuracy?
Seems rather counterintuitive to me.
I am converting Keras code into PyTorch because I am more familiar with the latter than the former. However, I found that it is not learning (or only barely).
Below I have provided almost all of my PyTorch code, including the initialisation code so that you can try it out yourself. The only thing you would need to provide yourself, is the word embeddings (I'm sure you can find many word2vec models online). The first input file should be a file with tokenised text, the second input file should be a file with floating-point numbers, one per line. Because I have provided all the code, this question may seem huge and too broad. However, my question is specific enough I think: what is wrong in my model or training loop that causes my model to not or barely improve. (See below for results.)
I have tried to provide many comments where applicable, and I have provided the shape transformations as well so you do not have to run the code to see what is going on. The data prep methods are not important to inspect.
The most important parts are the forward method of the RegressorNet, and the training loop of RegressionNN (admittedly, these names were badly chosen). I think the mistake is there somewhere.
from pathlib import Path
import time
import numpy as np
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
import gensim
from scipy.stats import pearsonr
from LazyTextDataset import LazyTextDataset
class RegressorNet(nn.Module):
def __init__(self, hidden_dim, embeddings=None, drop_prob=0.0):
super(RegressorNet, self).__init__()
self.hidden_dim = hidden_dim
self.drop_prob = drop_prob
# Load pretrained w2v model, but freeze it: don't retrain it.
self.word_embeddings = nn.Embedding.from_pretrained(embeddings)
self.word_embeddings.weight.requires_grad = False
self.w2v_rnode = nn.GRU(embeddings.size(1), hidden_dim, bidirectional=True, dropout=drop_prob)
self.dropout = nn.Dropout(drop_prob)
self.linear = nn.Linear(hidden_dim * 2, 1)
# LeakyReLU rather than ReLU so that we don't get stuck in a dead nodes
self.lrelu = nn.LeakyReLU()
def forward(self, batch_size, sentence_input):
# shape sizes for:
# * batch_size 128
# * embeddings of dim 146
# * hidden dim of 200
# * sentence length of 20
# sentence_input: torch.Size([128, 20])
# Get word2vec vector representation
embeds = self.word_embeddings(sentence_input)
# embeds: torch.Size([128, 20, 146])
# embeds.view(-1, batch_size, embeds.size(2)): torch.Size([20, 128, 146])
# Input vectors into GRU, only keep track of output
w2v_out, _ = self.w2v_rnode(embeds.view(-1, batch_size, embeds.size(2)))
# w2v_out = torch.Size([20, 128, 400])
# Leaky ReLU it
w2v_out = self.lrelu(w2v_out)
# Dropout some nodes
if self.drop_prob > 0:
w2v_out = self.dropout(w2v_out)
# w2v_out: torch.Size([20, 128, 400
# w2v_out[-1, :, :]: torch.Size([128, 400])
# Only use the last output of a sequence! Supposedly that cell outputs the final information
regression = self.linear(w2v_out[-1, :, :])
regression: torch.Size([128, 1])
return regression
class RegressionRNN:
def __init__(self, train_files=None, test_files=None, dev_files=None):
print('Using torch ' + torch.__version__)
self.datasets, self.dataloaders = RegressionRNN._set_data_loaders(train_files, test_files, dev_files)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model = self.w2v_vocab = self.criterion = self.optimizer = self.scheduler = None
#staticmethod
def _set_data_loaders(train_files, test_files, dev_files):
# labels must be the last input file
datasets = {
'train': LazyTextDataset(train_files) if train_files is not None else None,
'test': LazyTextDataset(test_files) if test_files is not None else None,
'valid': LazyTextDataset(dev_files) if dev_files is not None else None
}
dataloaders = {
'train': DataLoader(datasets['train'], batch_size=128, shuffle=True, num_workers=4) if train_files is not None else None,
'test': DataLoader(datasets['test'], batch_size=128, num_workers=4) if test_files is not None else None,
'valid': DataLoader(datasets['valid'], batch_size=128, num_workers=4) if dev_files is not None else None
}
return datasets, dataloaders
#staticmethod
def prepare_lines(data, split_on=None, cast_to=None, min_size=None, pad_str=None, max_size=None, to_numpy=False,
list_internal=False):
""" Converts the string input (line) to an applicable format. """
out = []
for line in data:
line = line.strip()
if split_on:
line = line.split(split_on)
line = list(filter(None, line))
else:
line = [line]
if cast_to is not None:
line = [cast_to(l) for l in line]
if min_size is not None and len(line) < min_size:
# pad line up to a number of tokens
line += (min_size - len(line)) * ['#pad#']
elif max_size and len(line) > max_size:
line = line[:max_size]
if list_internal:
line = [[item] for item in line]
if to_numpy:
line = np.array(line)
out.append(line)
if to_numpy:
out = np.array(out)
return out
def prepare_w2v(self, data):
idxs = []
for seq in data:
tok_idxs = []
for word in seq:
# For every word, get its index in the w2v model.
# If it doesn't exist, use #unk# (available in the model).
try:
tok_idxs.append(self.w2v_vocab[word].index)
except KeyError:
tok_idxs.append(self.w2v_vocab['#unk#'].index)
idxs.append(tok_idxs)
idxs = torch.tensor(idxs, dtype=torch.long)
return idxs
def train(self, epochs=10):
valid_loss_min = np.Inf
train_losses, valid_losses = [], []
for epoch in range(1, epochs + 1):
epoch_start = time.time()
train_loss, train_results = self._train_valid('train')
valid_loss, valid_results = self._train_valid('valid')
# Calculate Pearson correlation between prediction and target
try:
train_pearson = pearsonr(train_results['predictions'], train_results['targets'])
except FloatingPointError:
train_pearson = "Could not calculate Pearsonr"
try:
valid_pearson = pearsonr(valid_results['predictions'], valid_results['targets'])
except FloatingPointError:
valid_pearson = "Could not calculate Pearsonr"
# calculate average losses
train_loss = np.mean(train_loss)
valid_loss = np.mean(valid_loss)
train_losses.append(train_loss)
valid_losses.append(valid_loss)
# print training/validation statistics
print(f'----------\n'
f'Epoch {epoch} - completed in {(time.time() - epoch_start):.0f} seconds\n'
f'Training Loss: {train_loss:.6f}\t Pearson: {train_pearson}\n'
f'Validation loss: {valid_loss:.6f}\t Pearson: {valid_pearson}')
# validation loss has decreased
if valid_loss <= valid_loss_min and train_loss > valid_loss:
print(f'!! Validation loss decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Saving model ...')
valid_loss_min = valid_loss
if train_loss <= valid_loss:
print('!! Training loss is lte validation loss. Might be overfitting!')
# Optimise with scheduler
if self.scheduler is not None:
self.scheduler.step(valid_loss)
print('Done training...')
def _train_valid(self, do):
""" Do training or validating. """
if do not in ('train', 'valid'):
raise ValueError("Use 'train' or 'valid' for 'do'.")
results = {'predictions': np.array([]), 'targets': np.array([])}
losses = np.array([])
self.model = self.model.to(self.device)
if do == 'train':
self.model.train()
torch.set_grad_enabled(True)
else:
self.model.eval()
torch.set_grad_enabled(False)
for batch_idx, data in enumerate(self.dataloaders[do], 1):
# 1. Data prep
sentence = data[0]
target = data[-1]
curr_batch_size = target.size(0)
# Returns list of tokens, possibly padded #pad#
sentence = self.prepare_lines(sentence, split_on=' ', min_size=20, max_size=20)
# Converts tokens into w2v IDs as a Tensor
sent_w2v_idxs = self.prepare_w2v(sentence)
# Converts output to Tensor of floats
target = torch.Tensor(self.prepare_lines(target, cast_to=float))
# Move input to device
sent_w2v_idxs, target = sent_w2v_idxs.to(self.device), target.to(self.device)
# 2. Predictions
pred = self.model(curr_batch_size, sentence_input=sent_w2v_idxs)
loss = self.criterion(pred, target)
# 3. Optimise during training
if do == 'train':
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# 4. Save results
pred = pred.detach().cpu().numpy()
target = target.cpu().numpy()
results['predictions'] = np.append(results['predictions'], pred, axis=None)
results['targets'] = np.append(results['targets'], target, axis=None)
losses = np.append(losses, float(loss))
torch.set_grad_enabled(True)
return losses, results
if __name__ == '__main__':
HIDDEN_DIM = 200
# Load embeddings from pretrained gensim model
embed_p = Path('path-to.w2v_model').resolve()
w2v_model = gensim.models.KeyedVectors.load_word2vec_format(str(embed_p))
# add a padding token with only zeros
w2v_model.add(['#pad#'], [np.zeros(w2v_model.vectors.shape[1])])
embed_weights = torch.FloatTensor(w2v_model.vectors)
# Text files are used as input. Every line is one datapoint.
# *.tok.low.*: tokenized (space-separated) sentences
# *.cross: one floating point number per line, which we are trying to predict
regr = RegressionRNN(train_files=(r'train.tok.low.en',
r'train.cross'),
dev_files=(r'dev.tok.low.en',
r'dev.cross'),
test_files=(r'test.tok.low.en',
r'test.cross'))
regr.w2v_vocab = w2v_model.vocab
regr.model = RegressorNet(HIDDEN_DIM, embed_weights, drop_prob=0.2)
regr.criterion = nn.MSELoss()
regr.optimizer = optim.Adam(list(regr.model.parameters())[0:], lr=0.001)
regr.scheduler = optim.lr_scheduler.ReduceLROnPlateau(regr.optimizer, 'min', factor=0.1, patience=5, verbose=True)
regr.train(epochs=100)
For the LazyTextDataset, you can refer to the class below.
from torch.utils.data import Dataset
import linecache
class LazyTextDataset(Dataset):
def __init__(self, paths):
# labels are in the last path
self.paths, self.labels_path = paths[:-1], paths[-1]
with open(self.labels_path, encoding='utf-8') as fhin:
lines = 0
for line in fhin:
if line.strip() != '':
lines += 1
self.num_entries = lines
def __getitem__(self, idx):
data = [linecache.getline(p, idx + 1) for p in self.paths]
label = linecache.getline(self.labels_path, idx + 1)
return (*data, label)
def __len__(self):
return self.num_entries
As I wrote before, I am trying to convert a Keras model to PyTorch. The original Keras code does not use an embedding layer, and uses pre-built word2vec vectors per sentence as input. In the model below, there is no embedding layer. The Keras summary looks like this (I don't have access to the base model setup).
Layer (type) Output Shape Param # Connected to
====================================================================================================
bidirectional_1 (Bidirectional) (200, 400) 417600
____________________________________________________________________________________________________
dropout_1 (Dropout) (200, 800) 0 merge_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense) (200, 1) 801 dropout_1[0][0]
====================================================================================================
The issue is that with identical input, the Keras model works and gets a +0.5 Pearson correlation between predicted and actual labels. The PyTorch model above, though, does not seem to work at all. To give you an idea, here is the loss (mean squared error) and Pearson (correlation coefficient, p-value) after the first epoch:
Epoch 1 - completed in 11 seconds
Training Loss: 1.684495 Pearson: (-0.0006077809280690612, 0.8173368901481127)
Validation loss: 1.708228 Pearson: (0.017794288315261794, 0.4264098054188664)
And after the 100th epoch:
Epoch 100 - completed in 11 seconds
Training Loss: 1.660194 Pearson: (0.0020315421756790806, 0.4400929436716754)
Validation loss: 1.704910 Pearson: (-0.017288118524826892, 0.4396865964324158)
The loss is plotted below (when you look at the Y-axis, you can see the improvements are minimal).
A final indicator that something may be wrong, is that for my 140K lines of input, each epoch only takes 10 seconds on my GTX 1080TI. I feel that his is not much and I would guess that the optimisation is not working/running. I cannot figure out why, though. To issue will probably be in my train loop or the model itself, but I cannot find it.
Again, something must be going wrong because:
- the Keras model does perform well;
- the training speed is 'too fast' for 140K sentences
- almost no improvemnts after training.
What am I missing? The issue is more than likely present in the training loop or in the network structure.
TL;DR: Use permute instead of view when swapping axes, see the end of answer to get an intuition about the difference.
About RegressorNet (neural network model)
No need to freeze embedding layer if you are using from_pretrained. As documentation states, it does not use gradient updates.
This part:
self.w2v_rnode = nn.GRU(embeddings.size(1), hidden_dim, bidirectional=True, dropout=drop_prob)
and especially dropout without providable num_layers is totally pointless (as no dropout can be specified with shallow one layer network).
BUG AND MAIN ISSUE: in your forward function you are using view instead of permute, here:
w2v_out, _ = self.w2v_rnode(embeds.view(-1, batch_size, embeds.size(2)))
See this answer and appropriate documentation for each of those functions and try to use this line instead:
w2v_out, _ = self.w2v_rnode(embeds.permute(1, 0, 2))
You may consider using batch_first=True argument during w2v_rnode creation, you won't have to permute indices that way.
Check documentation of torch.nn.GRU, you are after last step of the sequence, not after all of the sequences you have there, so you should be after:
_, last_hidden = self.w2v_rnode(embeds.permute(1, 0, 2))
but I think this part is fine otherwise.
Data preparation
No offence, but prepare_lines is very unreadable and seems pretty hard to maintain as well, not to say spotting an eventual bug (I suppose it lies in here).
First of all, it seems like you are padding manually. Please don't do it that way, use torch.nn.pad_sequence to work with batches!
In essence, first you encode each word in every sentence as index pointing into embedding (as you seem to do in prepare_w2v), after that you use torch.nn.pad_sequence and torch.nn.pack_padded_sequence or torch.nn.pack_sequence if the lines are already sorted by length.
Proper batching
This part is very important and it seems you are not doing that at all (and likely this is the second error in your implementation).
PyTorch's RNN cells take inputs not as padded tensors, but as torch.nn.PackedSequence objects. This is an efficient object storing indices which specify unpadded length of each sequence.
See more informations on the topic here, here and in many other blog posts throughout the web.
First sequence in batch has to be the longest, and all others have to be provided in the descending length. What follows is:
You have to sort your batch each time by sequences length and sort your targets in an analogous way OR
Sort your batch, push it through the network and unsort it afterwards to match with your targets.
Either is fine, it's your call what seems to be more intuitive for you.
What I like to do is more or less the following, hope it helps:
Create unique indices for each word and map each sentence appropriately (you've already done it).
Create regular torch.utils.data.Dataset object returning single sentence for each geitem, where it is returned as a tuple consisting of features (torch.Tensor) and labels (single value), seems like you're doing it as well.
Create custom collate_fn for use with torch.utils.data.DataLoader, which is responsible for sorting and padding each batch in this scenario (+ it returns lengths of each sentence to be passed into neural network).
Using sorted and padded features and their lengths I'm using torch.nn.pack_sequence inside neural network's forward method (do it after embedding!) to push it through RNN layer.
Depending on the use-case I unpack them using torch.nn.pad_packed_sequence. In your case, you only care about last hidden state, hence you don't have to do that. If you were using all of the hidden outputs (like is the case with, say, attention networks), you would add this part.
When it comes to the third point, here is a sample implementation of collate_fn, you should get the idea:
import torch
def length_sort(features):
# Get length of each sentence in batch
sentences_lengths = torch.tensor(list(map(len, features)))
# Get indices which sort the sentences based on descending length
_, sorter = sentences_lengths.sort(descending=True)
# Pad batch as you have the lengths and sorter saved already
padded_features = torch.nn.utils.rnn.pad_sequence(features, batch_first=True)
return padded_features, sentences_lengths, sorter
def pad_collate_fn(batch):
# DataLoader return batch like that unluckily, check it on your own
features, labels = (
[element[0] for element in batch],
[element[1] for element in batch],
)
padded_features, sentences_lengths, sorter = length_sort(features)
# Sort by length features and labels accordingly
sorted_padded_features, sorted_labels = (
padded_features[sorter],
torch.tensor(labels)[sorter],
)
return sorted_padded_features, sorted_labels, sentences_lengths
Use those as collate_fn in DataLoaders and you should be just about fine (maybe with minor adjustments, so it's essential you understand the idea standing behind it).
Other possible problems and tips
Training loop: great place for a lot of small errors, you may want to minimalize those by using PyTorch Ignite. I am having unbelievably hard time going through your Tensorflow-like-Estimator-like-API-like training loop (e.g. self.model = self.w2v_vocab = self.criterion = self.optimizer = self.scheduler = None this). Please, don't do it this way, separate each task (data creating, data loading, data preparation, model setup, training loop, logging) into it's own respective module. All in all there is a reason why PyTorch/Keras is more readable and sanity-preserving than Tensorflow.
Make the first row of your embedding equal to vector containg zeros: By default, torch.nn.functional.embedding expects the first row to be used for padding. Hence you should start your unique indexing for each word at 1 or specify an argument padding_idx to different value (though I highly discourage this approach, confusing at best).
I hope this answer helps you at least a little bit, if something is unclear post a comment below and I'll try to explain it from a different perspective/more detail.
Some final comments
This code is not reproducible, nor the question's specific. We don't have the data you are using, neither we got your word vectors, random seed is not fixed etc.
PS. One last thing: Check your performance on really small subset of your data (say 96 examples), if it does not converge, it is very likely you indeed have a bug in your code.
About the times: they are probably off (due to not sorting and not padding I suppose), usually Keras and PyTorch's times are quite similar (if I understood this part of your question as intended) for correct and efficient implementations.
Permute vs view vs reshape explanation
This simple example show the differences between permute() and view(). The first one swaps axes, while the second does not change memory layout, just chunks the array into desired shape (if possible).
import torch
a = torch.tensor([[1, 2], [3, 4], [5, 6]])
print(a)
print(a.permute(1, 0))
print(a.view(2, 3))
And the output would be:
tensor([[1, 2],
[3, 4],
[5, 6]])
tensor([[1, 3, 5],
[2, 4, 6]])
tensor([[1, 2, 3],
[4, 5, 6]])
reshape is almost like view, was added for those coming from numpy, so it's easier and more natural for them, but it has one important difference:
view never copies data and work only on contiguous memory (so after permutation like the one above your data may not be contiguous, hence acces to it might be slower)
reshape can copy data if needed, so it would work for non-contiguous arrays as well.