Adding batch normalization decreases the performance

Adding batch normalization decreases the performance - python

I'm using PyTorch to implement a classification network for skeleton-based action recognition. The model consists of three convolutional layers and two fully connected layers. This base model gave me an accuracy of around 70% in the NTU-RGB+D dataset. I wanted to learn more about batch normalization, so I added a batch normalization for all the layers except for the last one. To my surprise, the evaluation accuracy dropped to 60% rather than increasing But the training accuracy has increased from 80% to 90%. Can anyone say what am I doing wrong? or Adding batch normalization need not increase the accuracy?
The model with batch normalization
class BaseModelV0p2(nn.Module):
def __init__(self, num_person, num_joint, num_class, num_coords):
super().__init__()
self.name = 'BaseModelV0p2'
self.num_person = num_person
self.num_joint = num_joint
self.num_class = num_class
self.channels = num_coords
self.out_channel = [32, 64, 128]
self.loss = loss
self.metric = metric
self.bn_momentum = 0.01
self.bn_cv1 = nn.BatchNorm2d(self.out_channel[0], momentum=self.bn_momentum)
self.conv1 = nn.Sequential(nn.Conv2d(in_channels=self.channels, out_channels=self.out_channel[0],
kernel_size=3, stride=1, padding=1),
self.bn_cv1,
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.bn_cv2 = nn.BatchNorm2d(self.out_channel[1], momentum=self.bn_momentum)
self.conv2 = nn.Sequential(nn.Conv2d(in_channels=self.out_channel[0], out_channels=self.out_channel[1],
kernel_size=3, stride=1, padding=1),
self.bn_cv2,
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.bn_cv3 = nn.BatchNorm2d(self.out_channel[2], momentum=self.bn_momentum)
self.conv3 = nn.Sequential(nn.Conv2d(in_channels=self.out_channel[1], out_channels=self.out_channel[2],
kernel_size=3, stride=1, padding=1),
self.bn_cv3,
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.bn_fc1 = nn.BatchNorm1d(256 * 2, momentum=self.bn_momentum)
self.fc1 = nn.Sequential(nn.Linear(self.out_channel[2]*8*3, 256*2),
self.bn_fc1,
nn.ReLU(),
nn.Dropout2d(p=0.5)) # TO check
self.fc2 = nn.Sequential(nn.Linear(256*2, self.num_class))
def forward(self, input):
list_bn_layers = [self.bn_fc1, self.bn_cv3, self.bn_cv2, self.bn_cv1]
# set the momentum of the batch norm layers to given momentum value during trianing and 0 during evaluation
# ref: https://discuss.pytorch.org/t/model-eval-gives-incorrect-loss-for-model-with-batchnorm-layers/7561
# ref: https://github.com/pytorch/pytorch/issues/4741
for bn_layer in list_bn_layers:
if self.training:
bn_layer.momentum = self.bn_momentum
else:
bn_layer.momentum = 0
logits = []
for i in range(self.num_person):
out = self.conv1(input[:, :, :, :, i])
out = self.conv2(out)
out = self.conv3(out)
logits.append(out)
out = torch.max(logits[0], logits[1])
out = out.view(out.size(0), -1)
out = self.fc1(out)
out = self.fc2(out)
t = out
assert not ((t != t).any()) # find out nan in tensor
assert not (t.abs().sum() == 0) # find out 0 tensor
return out

My interpretation of the phenomenon you are observing,, is that instead of reducing the covariance shift, which is what the Batch Normalization is meant for, you are increasing it. In other words, instead of decrease the distribution differences between train and test, you are increasing it and that's what it is causing you to have a bigger difference in the accuracies between train and test. Batch Normalization does not assure better performance always, but for some problems it doesn't work well. I have several ideas that could lead to an improvement:
Increase the batch size if it is small, what would help the mean and std calculated in the Batch Norm layers to be more robust estimates of the population parameters.
Decrease the bn_momentum parameter a bit, to see if that also stabilizes the Batch Norm parameters.
I am not sure you should set bn_momentum to zero when test, I think you should just call model.train() when you want to train and model.eval() when you want to use your trained model to perform inference.
You could alternatively try Layer Normalization instead of Batch Normalization, cause it does not require accumulating any statistic and usually works well
Try regularizing a bit your model using dropout
Make sure you shuffle your training set in every epoch. Not shuffling the data set may lead to correlated batches that make the statistics in batch normalization cycle. That may impact your generalization
I hope any of these ideas work for you

The problem may be with your momentum. I see you are using 0.01.
Here is how I tried different betas to fit to points with momentum and with beta=0.01 I got bad results. Usually beta=0.1 is used.

It's almost happen because of two major reasons 1.non-stationary training'procedure and 2.train/test different distributions
If It's possible try other regularization technique's like Drop-out,I face to this problem and i found that my test and train distribution might be different so after i remove BN and use drop-out instead, got the reasonable result. read this for more
Use nn.BatchNorm2d(out_channels, track_running_stats=False) this disables the running statistics of the batches and uses the current batch’s mean and variance to do the normalization
In Training mode run some forward passes on data in with torch.no_grad() block. this stabilize the running_mean / running_std values
Use same batch_size in your dataset for both model.train() and model.eval()
Increase momentum of the BN. This means that the means and stds learned will be much more stable during the process of training
this is helpful whenever you use pre-trained model
for child in model.children():
for ii in range(len(child)):
if type(child[ii])==nn.BatchNorm2d:
child[ii].track_running_stats = False

Related

Optimizer zero_grad() and optimize() - how to use when each sample split into patches?

I have a dataset consisting of small RGB images. Each image is then split into a specific number of patches, each then being resized and blurred (Gaussian). The input of my model (see Thermal Image Enhancement using CNN (10.1109/IROS.2016.7759059), shallow 3-layers network for increasing the resolution and handling blurring issues in thermal images) is resized + blurred patch while the expected result is just the resized patch.
The network is rather simple:
class TEN_Network(nn.Module):
def __init__(self) -> None:
super(TEN_Network, self).__init__()
self.model = nn.Sequential(OrderedDict([
('conv1', nn.Conv2d(1, 64, 7, stride=1, padding=3)),
('relu1', nn.ReLU(True)),
('conv2', nn.Conv2d(64, 32, 5, stride=1, padding=2)),
('relu2', nn.ReLU(True)),
('conv3', nn.Conv2d(32, 32, 3, stride=1, padding=1)),
('relu3', nn.ReLU(True)),
('conv4', nn.Conv2d(32, 1, 3, stride=1, padding=1))
]))
def forward(self, x):
x = self.model(x)
return x
I picked this CNN for numerous reasons simplicity being one of them. Since I am quite new to neural networks and PyTorch I thought it will provide a nice playground.
My question is this - given that each sample is split into patches should I zero the gradients (and respectively step() the optimizer) for each patch or should I calculate the average loss (sum of all losses for all patches in a sample divided by the number of patches) or should I run these two steps at the beginning of training my sample and after the patches have been processed (resulting in the above mentioned average loss per sample)?
Currently I have the following (pseudo-code, optimizer is Adam, loss is MSE):
# ...
for epoch_id in range(0, epochs_total):
# Load dataset with dataloader
dataloader = DataLoader(dataset=custom_dataset, ...)
# Train
for sample_expected, sample_input in next(iter(dataloader)):
loss_sample_avg = 0.0
patches_count = len(sample_expected)
# optimizer.zero_grad() <---- HERE(1)
for patch_expected, patch_input in zip(sample_expected, sample_input):
# optimizer.zero_grad() <---- HERE(2)
# ...
patch_predicted = model(patch_input)
loss = citerion(patch_predicted, patch_input)
loss_sample_avg += loss.item()
loss.backward()
# optimizer.step() <---- HERE(2)?
# optimizer.step() <---- HERE(1)
# Validate
...
This is probably me not getting something very basic. I know that
Optimizer step() should always be followed by a zero_grad()
A backward() call is possible only after another forward() pass has been processed
Since the paper does not provide any code (tried reaching out to the authors but as expected nothing came out) I am trying to figure out how to implement it myself.

You should pass all patches as a batch. performing a gradient step on a sole patch consitutes pure stochastic gradient descent, which is generally not preferred because it yeilds a very noisy estimate of the desired gradient. Furthermore looping over each patch from the image is computationally inefficient.
At a minimum, batch all patches from one image together. Ideally though, you would pass multiple patches from multiple images as a batch. So patch_input and patch_expected should each be [batch_size x color_channels x patch_height x path_width] in size. batch_size should itself be the number of sampled patches per image times the number of images in a batch.
Likely, this will require very little modification of your code other than adding a few lines to collate the inputs and targets in their existing for into a single tensor each. I'd provide specifics but it isn't evident what form your targets and inputs take currently.

PyTorch: VGG16 model not deterministic when ResNet18 model is

I currently try to learn two models (VGG16 ad ResNet18) on two Datasets (MNIST and CIFAR10). The goal here is to later test the effect different changes (like another loss function, or a manipulated dataset) have on the accuracy of the model. To make my results comparable I tried to make the learning process deterministic. To achieve this I set a fixed see for all the random generators with the following code.
def update_seed(seed):
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
os.environ['PYTHONHASHSEED'] = str(seed)
And for the ResNet18 model this works perfectly fine (The results are deterministic). But for the VGG16 model this does not work. And that is the point I don't understand, why is the above enough for ResNet18 to be deterministic, but not for VGG16?
So where is this extra randomness for VGG16 coming from and how can I disable it?
To get VGG16 deterministic I currently have to disable cuda and use the cpu only, but this makes the whole computing process very slow and is therefor not really an option.
The only difference between the two models is loading seen below and the learning rate when using CIFAR10.
def setup_vgg16(is_mnist_used):
vgg16_model = models.vgg16()
if is_mnist_used:
vgg16_model.features[0] = nn.Conv2d(1, 64, kernel_size=3, stride=1, padding=1)
vgg16_model.classifier[-1] = nn.Linear(4096, 10, bias=True)
return vgg16_model
def setup_resnet(is_mnist_used):
resnet_model = models.resnet18()
if is_mnist_used:
resnet_model.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
resnet_model.fc = nn.Linear(512, 10, bias=True)
return resnet_model
What I have already tried (but with no success):
Adding bias=False to the VGG16 model, as it is the obvious difference between the two models
Testing the model before the learning (maybe the model is initiated with random values), but without learning the model is deterministic
Adding more stuff to the update_seed(seed) function
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.enabled = False (These two just decreases the
performance)
torch.use_deterministic_algorithms(True) -> This results in a cuda
error
Set num_worker=0 in the dataloader (this was suggested as a workaround for a similar problem in another thread)
This is the training function. Before this function the model is deterministic and after it is called for the first time, VGG16 is no longer deterministic.
def train_loop(dataloader, f_model, f_loss_fn, f_optimizer):
# setting the model into the train mode
f_model.train()
for batch, (x, y) in tqdm(enumerate(dataloader)):
# Moving the data to the same device as the model
x, y = x.to(device), y.to(device)
# Compute prediction and loss
pred = f_model(x)
loss = f_loss_fn(pred, y)
# Backpropagation
f_optimizer.zero_grad()
loss.backward()
f_optimizer.step()

I think that's because torchvision's VGG models use AdaptiveAvgPool2d, and AdaptiveAvgPool2d cannot be made non-deterministic and will throw a runtime error when used together with torch.use_deterministic_algorithms(True).

Understand and Implement Element-Wise Attention Module

Please add a minimum comment on your thoughts so that I can improve my query. Thank you. -)
I'm trying to understand and implement a research work on Triple Attention Learning, which consists on
- channel-wise attention (a)
- element-wise attention (b)
- scale-wise attention (c)
The mechanism is integrated experimentally inside the DenseNet model. The arch of the whole model's diagram is here. The channel-wise attention module is simply nothing but the squeeze and excitation block. That gives a sigmoid output further to the element-wise attention module. Below is the more precise feature flow diagram of these modules (a, b, and c).
Theory
For the most part, I was able to understand and implement it but was a bit lost in the Element-Wise attention section (part b from the above diagram). This is where I need your assistance. -)
Here is a little theory on this topic to give you a rough idea of what all this is about. Please note, The paper is not openly accessible now but at its early stage of release on the publisher page it was free to get and I saved it at that time. And to be fair to all, I'm sharing it with you, Link. Anyway, from the paper (Section 4.3) it shows:
So first of all, f(att) function (which is in the first inplace diagram, left-middle part or b) consists of three convolution layers with 512 kernels with 1 x 1, 512 kernels with 3 x 3 and C kernels with 1 x 1. Here C is the number of the classifier. And with Softmax activation!
Next, it applies to the Channel-Wise attention module which we mentioned that simply a SENet module and gave a sigmoid probability score i.e X(CA). So, from the function of f(att), we're getting C times softmax probability scores and each of these scores get multiplied with sigmoid output and finally produces feature maps A (according to the equation 4 of the above diagram).
Second, there is a C linear classifier that implemented as a 1 x 1 - C kernels convolution layer. This layer also applied to the SENet module's output i.e. X(CA), to each feature vector pixel-wise. And in the end, it gives an output of feature maps S (equation 5 shown below diagram).
And Third, they element-wise multiply each confidence score (of S) with the corresponding attention element A. This multiplication is on purpose. They did it for preventing unnecessary attention on the feature maps. To make it effective, they also use the weighted cross-entropy loss function to minimize it here between the classification ground truth and the score vector.
My Query
Mostly I don't get properly the minimization strategies in the middle of the network. I want someone who can give me a proper understanding and implementation of this `element-wise attention mechanism in detail that proposed in the mentioned paperwork (section 4.3).
Implement
Here is a minimum code to get started. It should enough I guess. This is shallow implementation but too much away from the original element-wise module. I'm not sure how to implement it properly. For now, I want it as a layer that supposed to plug and play to any model. I was trying with MNIST and a simple Conv net.
In a summary, for MNIST, we should have a network that contains both the channel-wise and element-wise attention model followed by the last 10 unit softmax layer. So for example:
Net: Conv2D - Attentions-Module - GAP - Softmax(10)
The Attention-Module consists of those two-part: Channel-wise and Element-wise, and the Element-wisesupposed to have Softmax too that minimizes weighted CE loss function to ground-truth and score vector coming from this module (according to the paperwork, already described above too). The module also passes weighted feature maps to the consecutive layers. For more clarity here is a simple schematic diagram of what we're looking for
Ok, for the channel-wise attention which should give us a single probability score (sigmoid), let's use a fake layer for now for simplicity:
class FakeSE(tf.keras.layers.Layer):
def __init__(self):
super(Block, self).__init__()
# conv layer
self.conv = tf.keras.layers.Conv2D(10, padding='same',
kernel_size=3)
def call(self, input_tensor, training=False):
x = self.conv(input_tensor)
return tf.math.sigmoid(x)
And for the element-wise attention part, following is the failed attempt so far:
class ElementWiseAttention(tf.keras.layers.Layer):
def __init__(self):
# for simplicity the f(attn) function here has 2 convolution instead of 3
# self.conv1, and self.conv2
self.conv1 = tf.keras.layers.Conv2D(16,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=tf.nn.silu)
self.conv2 = tf.keras.layers.Conv2D(10,
kernel_size=1,
strides=1, padding='same',
use_bias=False, activation=tf.keras.activations.softmax)
# fake SENet or channel-wise attention module
self.cam = FakeSE()
# a linear layer
self.linear = tf.keras.layers.Conv2D(10,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=None)
super(ElementWiseAttention, self).__init__()
def call(self, inputs):
# 2 stacked conv layer (in paper, it's 3. we set 2 for simplicity)
# this is the f(att)
x = self.conv1(inputs)
x = self.conv2(x)
# this is the A = f(att)*X(CA)
camx = self.cam(x)*x
# this is S = X(CA)*Linear_Classifier
linx = self.cam(self.linear(inputs))
# element-wise multiply to prevent unnecessary attention
# suppose to minimize with weighted cross entorpy loss
out = tf.multiply(camx, linx)
return out
The above one is the Layer of Interest. If I understand the paper words correctly, this layer should not only minimize the weighted loss function to gt and score_vector but also produce some weighted feature maps (2D).
Run
Here is the toy data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train, axis=-1)
x_train = x_train.astype('float32') / 255
x_train = tf.image.resize(x_train, [32,32]) # if we want to resize
y_train = tf.keras.utils.to_categorical(y_train , num_classes=10)
# Model
input = tf.keras.Input(shape=(32,32,1))
efnet = tf.keras.applications.DenseNet121(weights=None,
include_top = False,
input_tensor = input)
em = ElementWiseAttention()(efnet.output)
# Now that we apply global max pooling.
gap = tf.keras.layers.GlobalMaxPooling2D()(em)
# classification layer.
output = tf.keras.layers.Dense(10, activation='softmax')(gap)
# bind all
func_model = tf.keras.Model(efnet.input, output)
func_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = tf.keras.metrics.CategoricalAccuracy(),
optimizer = tf.keras.optimizers.Adam())
# fit
func_model.fit(x_train, y_train, batch_size=32, epochs=3, verbose = 1)

Understanding the element-wise attention
When paper introduce they method they said:
The attention modules aim to exploit the relationship between disease
labels and (1) diagnosis-specific feature channels, (2)
diagnosis-specific locations on images (i.e. the regions of thoracic
abnormalities), and (3) diagnosis-specific scales of the feature maps.
(1), (2), (3) corresponding to channel-wise attention, element-wise attention, scale-wise attention
We can tell that element-wise attention is for deal with disease location & weight info, i.e: at each location on image, how likely there is a disease, as it been mention again when paper introduce the element-wise attention：
The element-wise attention learning aims to enhance the sensitivity of feature
representations to thoracic abnormal regions, while suppressing the activations when there is no abnormality.
OK, we could easily get location & weight info for one disease, but we have multiple disease:
Since there are multiple thoracic diseases, we choose to estimate an
element-wise attention map for each category in this work.
We could store the multiple disease location & weight info by using a tensor A with shape (height, width, number of disease):
The all-category attention map is denoted by A ∈ RH×W×C, where each
element aijc is expected to represent the relative importance at location (i, j) for
identifying the c-th category of thoracic abnormalities.
And we have linear classifiers for produce a tensor S with same shape as A, this can be interpret as:
At each location on feature maps X(CA), how confident those linear classifiers think there is certain disease at that location
Now we element-wise multiply S and A to get M, i.e we are:
prevent the attention maps from paying unnecessary attention to those
location with non-existent labels
So after all those, we get tensor M which tells us:
location & weight info about certain disease that linear classifiers are confident about it
Then if we do global average pooling over M, we get prediction of weight for each disease, add another softmax (or sigmoid) we could get prediction of probability for each disease
Now since we have label and prediction, so, naturally we could minimizing loss function to optimize the model.
Implementation
Following code is tested on colab and will show you how to implement channel-wise attention and element-wise attention, and build and training a simple model base on your code with DenseNet121 and without scale-wise attention:
import tensorflow as tf
import numpy as np
ALPHA = 1/16
C = 10
D = 128
class ChannelWiseAttention(tf.keras.layers.Layer):
def __init__(self):
super(ChannelWiseAttention, self).__init__()
# squeeze
self.gap = tf.keras.layers.GlobalAveragePooling2D()
# excitation
self.fc0 = tf.keras.layers.Dense(int(ALPHA * D), use_bias=False, activation=tf.nn.relu)
self.fc1 = tf.keras.layers.Dense(D, use_bias=False, activation=tf.nn.sigmoid)
# reshape so we can do channel-wise multiplication
self.rs = tf.keras.layers.Reshape((1, 1, D))
def call(self, inputs):
# calculate channel-wise attention vector
z = self.gap(inputs)
u = self.fc0(z)
u = self.fc1(u)
u = self.rs(u)
return u * inputs
class ElementWiseAttention(tf.keras.layers.Layer):
def __init__(self):
super(ElementWiseAttention, self).__init__()
# f(att)
self.conv0 = tf.keras.layers.Conv2D(512,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=tf.nn.relu)
self.conv1 = tf.keras.layers.Conv2D(512,
kernel_size=3,
strides=1, padding='same',
use_bias=True, activation=tf.nn.relu)
self.conv2 = tf.keras.layers.Conv2D(C,
kernel_size=1,
strides=1, padding='same',
use_bias=False, activation=tf.keras.activations.softmax)
# linear classifier
self.linear = tf.keras.layers.Conv2D(C,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=None)
# for calculate score vector to training element-wise attention module
self.gap = tf.keras.layers.GlobalAveragePooling2D()
self.sfm = tf.keras.layers.Softmax()
def call(self, inputs):
# f(att)
a = self.conv0(inputs)
a = self.conv1(a)
a = self.conv2(a)
# confidence score
s = self.linear(inputs)
# element-wise multiply to prevent unnecessary attention
m = s * a
# using to minimize with weighted cross entorpy loss
y_hat = self.gap(m)
# could also using sigmoid like in paper
out = self.sfm(y_hat)
return m, out
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train, axis=-1)
x_train = x_train.astype('float32') / 255
x_train = tf.image.resize(x_train, [32,32]) # if we want to resize
y_train = tf.keras.utils.to_categorical(y_train , num_classes=10)
# Model
input = tf.keras.Input(shape=(32,32,1))
efnet = tf.keras.applications.DenseNet121(weights=None,
include_top = False,
input_tensor = input)
xca = ChannelWiseAttention()(efnet.get_layer("conv3_block1_0_bn").output)
m, output = ElementWiseAttention()(xca)
# bind all
func_model = tf.keras.Model(efnet.input, output)
func_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = tf.keras.metrics.CategoricalAccuracy(),
optimizer = tf.keras.optimizers.Adam())
# fit
func_model.fit(x_train, y_train, batch_size=64, epochs=3, verbose = 1)
PS: Serendipity, I also answered your another question related to this paper few month back:
How to place custom layer inside a in-built pre trained model?

Autoencoder to encode features/categories of data

My question is regarding the use of autoencoders (in PyTorch). I have a tabular dataset with a categorical feature that has 10 different categories. Names of these categories are quite different - some names consist of one word, some of two or three words. But all in all I have 10 unique category names. What I'm trying to do is to create an autoencoder which will encode names of these categories - for example, if I have a category named 'Medium size class', I want to see if it is possible to train autoencoder to encode this name as something like 'mdmsc' or something like that. The use of it would be to found out which data points are hard to encode or not typical or something like that. I tried to adapt autoencoder architectures from various tutorials online however nothing seems to work for me or I simply do not know how to use them as they are all about images. Maybe someone has any idea how this type of autoencoder might be accomplished if it is at all possible?
Edit: here's the model I have so far (I just tried to adapt some architectures I found online):
class Autoencoder(nn.Module):
def __init__(self, input_shape, encoding_dim):
super(Autoencoder, self).__init__()
self.encode = nn.Sequential(
nn.Linear(input_shape, 128),
nn.ReLU(True),
nn.Linear(128, 64),
nn.ReLU(True),
nn.Linear(64, encoding_dim),
)
self.decode = nn.Sequential(
nn.Linear(encoding_dim, 64),
nn.ReLU(True),
nn.Linear(64, 128),
nn.ReLU(True),
nn.Linear(128, input_shape)
)
def forward(self, x):
x = self.encode(x)
x = self.decode(x)
return x
model = Autoencoder(input_shape=10, encoding_dim=5)
And also I use LabelEncoder() and then OneHotEncoder()to give these features/categories I mentioned numerical form. However, after training, output is the same as was input (no changes on the category name) but when I try to use only encoder part I'm unable to apply LabelEncoder() and then OneHotEncoder() because of dimension issues. I feel like maybe I can do something differently at the beginning, then I try to give those features numerical form, however I'm not sure what should I do.

First you will need to set up a train_loader depending on your data that will iterate over your data points.
Then you need to figure out what kind of loss you are going to use and optimizer:
# mean-squared error loss
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=.001) #learning rate depend on your task
Once you have that ready, you can train your autoencoder with basic steps:
for epoch in range(epochs):
for features in train_loader:
optimizer.zero_grad()
outputs = model(batch_features)
train_loss = criterion(outputs, features)
train_loss.backward()
optimizer.step()
Once model is done training you can examine embeddings using:
embedding = model.encode(your_input)

Implementing a many-to-many regression task

Sorry if I present my problem not clearly, English is not my first language
Problem
Short description:
I want to train a model which map input x (with shape of [n_sample, timestamp, feature]) to an output y (with exact same shape). It's like mapping 2 space
Longer version:
I have 2 float ndarrays of shape [n_sample, timestamp, feature], representing MFCC feature of n_sample audio file. These 2 ndarray are 2 speakers' speech of the same corpus, which was aligned by DTW. Lets name these 2 arrays x and y. I want to train a model, which predict y[k] given x[k]. It's like mapping from space x to space y, and the output must be exact same shape as the input
What I've tried
It's time-series problem so I decide to use RNN approach. Here is my code in PyTorch (I put comment along the code. I removed the calculation of average loss for simplicity). Note that I've tried many option for learning rate, the behavior still the same
Class define
class Net(nn.Module):
def __init__(self, in_size, hidden_size, out_size, nb_lstm_layers):
super().__init__()
self.in_size = in_size
self.hidden_size = hidden_size
self.out_size = out_size
self.nb_lstm_layers = nb_lstm_layers
# self.fc1 = nn.Linear()
self.lstm = nn.LSTM(input_size=self.in_size, hidden_size=self.hidden_size, num_layers=self.nb_lstm_layers, batch_first=True, bias=True)
# self.fc = nn.Linear(self.hidden_size, self.out_size)
self.fc1 = nn.Linear(self.hidden_size, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, self.out_size)
def forward(self, x, h_state):
out, h_state = self.lstm(x, h_state)
output_fc = []
for frame in out:
output_fc.append(self.fc3(torch.tanh(self.fc1(frame)))) # I added fully connected layer to each frame, to make an output with same shape as input
return torch.stack(output_fc), h_state
def hidden_init(self):
if use_cuda:
h_state = torch.stack([torch.zeros(nb_lstm_layers, batch_size, 20) for _ in range(2)]).cuda()
else:
h_state = torch.stack([torch.zeros(nb_lstm_layers, batch_size, 20) for _ in range(2)])
return h_state
Training step:
net = Net(20, 20, 20, nb_lstm_layers)
optimizer = optim.Adam(net.parameters(), lr=0.0001, weight_decay=0.0001)
criterion = nn.MSELoss()
for epoch in range(nb_epoch):
count = 0
loss_sum = 0
batch_x = None
for i in (range(len(data))):
# data is my entire data, which contain A and B i specify above.
temp_x = torch.tensor(data[i][0])
temp_y = torch.tensor(data[i][1])
for ii in range(0, data[i][0].shape[0] - nb_frame_in_batch*2 + 1): # Create batches
batch_x, batch_y = get_batches(temp_x, temp_y, ii, batch_size, nb_frame_in_batch)
# this will return 2 tensor of shape (batch_size, nb_frame_in_batch, 20),
# with `batch_size` is the number of sample each time I feed to the net,
# nb_frame_in_batch is the number of frame in each sample
optimizer.zero_grad()
h_state = net.hidden_init()
prediction, h_state = net(batch_x.float(), h_state)
loss = criterion(prediction.float(), batch_y.float())
h_state = (h_state[0].detach(), h_state[1].detach())
loss.backward()
optimizer.step()
Problem is, the loss seems not to decrease but fluctuate a lot, without a clear behaviour
Please help me. Any suggestion will be greatly appreciated. If somebody can inspect my code and provide some comment, that would be so kind.
Thanks in advance!

It seems the network learning nothing from your data, hence the loss fluctuation (since weights depends on random initialization only). There are something you can try:
Try to normalize the data (this suggestion is quite broad, but I can't give you more details since I don't have your data, but normalize it to a specific range like [0, 1], or to a mean and std value is worth trying)
One very typical problem of LSTM in pytorch is its input dimension is quite different to other type of neural network. You must feed into your network a tensor with shape (seq_len, batch, input_size). You should go here, LSTM section for better details
One more thing: try to tune your hyperparameters. LSTM is harder to train compare to FC or CNN (to my experience).
Tell me if you have improvement. Debugging a neural network is always hard and full of potential coding mistake

With most ML algorithms it is tough to diagnose without seeing the data. Based on the inconsistency of your loss results this might be an issue with your data pre-processing. Have you tried normalizing the data first? Often times with large fluctuations in results, one of your input neuron values may be skewing your loss function making it unable to find a good direction.
How to normalize a NumPy array to within a certain range?
This is an example for audio normalization but I would also try adjusting the learning rate as it looks high and possibly removing a hidden layer.

May the problem was in the calculation of the loss. Try to sum the losses of each time-step in a sequence and then take the average over the batch. May it helps

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding batch normalization decreases the performance - python

The problem may be with your momentum. I see you are using 0.01. Here is how I tried different betas to fit to points with momentum and with beta=0.01 I got bad results. Usually beta=0.1 is used.

Related

Optimizer zero_grad() and optimize() - how to use when each sample split into patches?

PyTorch: VGG16 model not deterministic when ResNet18 model is

Understand and Implement Element-Wise Attention Module

Autoencoder to encode features/categories of data

Implementing a many-to-many regression task

Categories

Resources