What's the performance difference between unstructured pruning and structured pruning? - python

I am new to neural network pruning.I know the unstructured pruning sets the weights to zero while the structured pruning does actually change the network architecture. However, I am curious about the difference between their performance.
To be specific , I have created two models, a big one ,a small one.
class MyNet(nn.Module):
def __init__(self,channels):
super(MyNet,self).__init__()
self.conv1 = nn.Conv2d(3, channels, kernel_size=3, stride=1, padding=1, bias=False)
self.fc = nn.Linear(channels,10,bias=False)
def forward(self,x):
out = self.conv1(x)
out = F.avg_pool2d(out,32)
out = out.view(out.size(0), -1)
out = self.fc(out)
return out
big = MyNet(channels=64)
small = MyNet(channels=32)
The small model copies weights from the big one, then the big only remains the weights that the small model has copied from.(the remaining weights will be set to zeros)
in the first iteration they output the same, but after that,the two models output differently.
So I wonder why they output differently after the second iteration?

Related

How to efficiently implement a non-fully connected Linear Layer in PyTorch?

I made an example diagram of a scaled down version of what I'm trying to implement:
So the top two input nodes are only fully connected to the top three output nodes, and the same design applies to the bottom two nodes. So far I've come up with two ways of implementing this in PyTorch, neither of which are optimal.
The first would be to create a nn.ModuleList of many smaller Linear Layers, and during the forward pass, iterate the input through them. For the diagram's example, that would look something like this:
class Module(nn.Module):
def __init__(self):
self.layers = nn.Module([nn.Linear(2, 3) for i in range(2)])
def forward(self, input):
output = torch.zeros(2, 3)
for i in range(2):
output[i, :] = self.layers[i](input.view(2, 2)[i, :])
return output.flatten()
So this accomplishes the network in the diagram, the main issue is its very slow. I assume this is because PyTorch has to process the for loop sequentially, and can't process the input tensor in parallel.
To "vectorize" the module such that PyTorch can run it quicker, I have this implementation:
class Module(nn.Module):
def __init__(self):
self.layer = nn.Linear(4, 6)
self.mask = # create mask of ones and zeros to "block" certain layer connections
def forward(self, input):
prune.custom_from_mask(self.layer, name='weight', mask=self.mask)
return self.layer(input)
This also accomplishes the diagram's network, by using weight pruning to ensure certain weights in the fully connected layer are always zero (ex. the weight connecting the top input node to the bottom out node will always be zero, so its effectively "disconnected"). This module is much faster than the previous, as there is no for loop. The problem now is this module takes up significantly more memory. This is likely due to the fact that, even though most of the layer's weights will be zero, PyTorch still treats the network as if they are there. This implementation essentially keeps way more weights around than it needs to.
Has anyone encountered this issue before and come up with an efficient solution?
If weight sharing is ok, then 1D convolutions should solve the problem:
class Module(nn.Module):
def __init__(self):
self.layers = nn.Conv1d(in_channels=2, out_channels=3, kernel_size=1)
self._n_splits = 2
def forward(self, input):
B, C = input.shape
output = self.layers(input.view(B, C//self._n_splits, -1))
return output.view(B, C)
If weight sharing is NOT ok, then you can use the group convolutions: self.layers = nn.Conv1d(in_channels=4, out_channels=4, kernel_size=1, stride=1, groups=2). However, I am not sure if this can implement an arbitrary number of channel splits, you can check the documentation: https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html
A 1D convolution is a fully connected layer on all the channels of the input. A Group convolution will divide the channels into groups and perform separate conv operations on them (which is what you want).
The implementation will look something like:
class Module(nn.Module):
def __init__(self):
self.layers = nn.Conv1d(in_channels=2, out_channels=4, kernel_size=1, groups=2)
def forward(self, input):
B, C = input.shape
output = self.layers(input.unsqueeze(-1))
return output.squeeze()
EDIT:
If you need an odd number of output channels you can combine two group convs.
class Module(nn.Module):
def __init__(self):
self.layers = nn.Sequence(
nn.Conv1d(in_channels=2, out_channels=4, kernel_size=1, groups=2),
nn.Conv1d(in_channels=4, out_channels=3, kernel_size=1, groups=3))
def forward(self, input):
B, C = input.shape
output = self.layers(input.unsqueeze(-1))
return output.squeeze()
That will effectively define the input channels as required in the diagram and allow you for an arbitrary number of output channels. Notice that if the second convolution has groups=1 you will allow for mixing channels and will effectively render the first group conv layer useless.
From a theoretical perspective, there is no need for activation functions in between those two convolutions. We are combining them in a linear matter. However, it is possible that adding an activation function will improve performance.

How to update a pretrained model after Pruning of filters in its conv layer in PyTorch?

I have a pretrained model LeNet5 defined from scratch. I am performing pruning over filters in the convolution layers present in the model shown below.
class LeNet5(nn.Module):
def __init__(self, n_classes):
super(LeNet5, self).__init__()
self.feature_extractor = nn.Sequential(
nn.Conv2d(in_channels=1, out_channels=20, kernel_size=5, stride=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2),
nn.Conv2d(in_channels=20, out_channels=50, kernel_size=5, stride=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
self.classifier = nn.Sequential(
nn.Linear(in_features=800, out_features=500),
nn.ReLU(),
nn.Linear(in_features=500, out_features=10), # 10 - possible classes
)
def forward(self, x):
#x = x.view(x.size(0), -1)
x = self.feature_extractor(x)
x = torch.flatten(x, 1)
logits = self.classifier(x)
probs = F.softmax(logits, dim=1)
return logits, probs
I have successfully removed 2 filters from 20 in layer 1 (now 18 filters in conv2d layer1) and 5 filters from 50 in layer 2 (now 45 filters in conv2d layer3). So, now I need to update the model with the changes done as follows -
out_channel of layer 1 - 20 to 18
in_channel of layer 3 - 20 to 18
out_channel of layer 3 - 50 to 45
However, I'm unable to run the model as it gives dimension error.
RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x720 and 800x500)
How to update the no. of filters layers present in the model using Pytorch to perform pruning? Is there any library I can use for the same?
Assuming you do not want the model to automatically change structure during runtime, you can easily update the structure of the model by simply changing the input parameters to the constructor. For instance:
nn.Conv2d(in_channels = 1, out_channels = 18, kernel_size = 5, stride = 1),
nn.Conv2d(in_channels = 18, out_channels = 45, kernel_size = 5, stride = 1),
and so on.
If you are retraining from scratch every time you change the model structure, that's all you need to do. However, if you would like to maintain portions of the already learned parameters when you change the model, you'll need to select these relevant values and reassign them to the model parameters. For instance, consider the parameters associated with the first convolutional layer, 1 input, 20 outputs, and kernel size of 5. The weights and biases for this layer have size [1,20,5,5] and [1,20]. You need to modify these parameters such that they have size [1,18,5,5] and [1,18]. You'd thus need the indices for the particular kernels/filters you want to maintain and which kernels you'd like to prune. The code syntax for doing this is roughly:
params = net.state_dict()
params["feature_extractor"]["conv1.weight"] = params["feature_extractor"]["conv1.weight"][:,:18,:,:]
params["feature_extractor"]["conv1.bias"] = params["feature_extractor"]["conv1.bias"][:,:18]
# and so on for the other layers
net.load_state_dict(params)
Here, I simply drop the last two kernels/bias values for the first convolutional layer. (Note that the actual dictionary key names may differ slightly; I didn't code this up to check because, as indicated in the comments above, you included a picture of code rather than real, copy-able, code so try to do the latter in the future.)

Adding batch normalization decreases the performance

I'm using PyTorch to implement a classification network for skeleton-based action recognition. The model consists of three convolutional layers and two fully connected layers. This base model gave me an accuracy of around 70% in the NTU-RGB+D dataset. I wanted to learn more about batch normalization, so I added a batch normalization for all the layers except for the last one. To my surprise, the evaluation accuracy dropped to 60% rather than increasing But the training accuracy has increased from 80% to 90%. Can anyone say what am I doing wrong? or Adding batch normalization need not increase the accuracy?
The model with batch normalization
class BaseModelV0p2(nn.Module):
def __init__(self, num_person, num_joint, num_class, num_coords):
super().__init__()
self.name = 'BaseModelV0p2'
self.num_person = num_person
self.num_joint = num_joint
self.num_class = num_class
self.channels = num_coords
self.out_channel = [32, 64, 128]
self.loss = loss
self.metric = metric
self.bn_momentum = 0.01
self.bn_cv1 = nn.BatchNorm2d(self.out_channel[0], momentum=self.bn_momentum)
self.conv1 = nn.Sequential(nn.Conv2d(in_channels=self.channels, out_channels=self.out_channel[0],
kernel_size=3, stride=1, padding=1),
self.bn_cv1,
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.bn_cv2 = nn.BatchNorm2d(self.out_channel[1], momentum=self.bn_momentum)
self.conv2 = nn.Sequential(nn.Conv2d(in_channels=self.out_channel[0], out_channels=self.out_channel[1],
kernel_size=3, stride=1, padding=1),
self.bn_cv2,
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.bn_cv3 = nn.BatchNorm2d(self.out_channel[2], momentum=self.bn_momentum)
self.conv3 = nn.Sequential(nn.Conv2d(in_channels=self.out_channel[1], out_channels=self.out_channel[2],
kernel_size=3, stride=1, padding=1),
self.bn_cv3,
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.bn_fc1 = nn.BatchNorm1d(256 * 2, momentum=self.bn_momentum)
self.fc1 = nn.Sequential(nn.Linear(self.out_channel[2]*8*3, 256*2),
self.bn_fc1,
nn.ReLU(),
nn.Dropout2d(p=0.5)) # TO check
self.fc2 = nn.Sequential(nn.Linear(256*2, self.num_class))
def forward(self, input):
list_bn_layers = [self.bn_fc1, self.bn_cv3, self.bn_cv2, self.bn_cv1]
# set the momentum of the batch norm layers to given momentum value during trianing and 0 during evaluation
# ref: https://discuss.pytorch.org/t/model-eval-gives-incorrect-loss-for-model-with-batchnorm-layers/7561
# ref: https://github.com/pytorch/pytorch/issues/4741
for bn_layer in list_bn_layers:
if self.training:
bn_layer.momentum = self.bn_momentum
else:
bn_layer.momentum = 0
logits = []
for i in range(self.num_person):
out = self.conv1(input[:, :, :, :, i])
out = self.conv2(out)
out = self.conv3(out)
logits.append(out)
out = torch.max(logits[0], logits[1])
out = out.view(out.size(0), -1)
out = self.fc1(out)
out = self.fc2(out)
t = out
assert not ((t != t).any()) # find out nan in tensor
assert not (t.abs().sum() == 0) # find out 0 tensor
return out
My interpretation of the phenomenon you are observing,, is that instead of reducing the covariance shift, which is what the Batch Normalization is meant for, you are increasing it. In other words, instead of decrease the distribution differences between train and test, you are increasing it and that's what it is causing you to have a bigger difference in the accuracies between train and test. Batch Normalization does not assure better performance always, but for some problems it doesn't work well. I have several ideas that could lead to an improvement:
Increase the batch size if it is small, what would help the mean and std calculated in the Batch Norm layers to be more robust estimates of the population parameters.
Decrease the bn_momentum parameter a bit, to see if that also stabilizes the Batch Norm parameters.
I am not sure you should set bn_momentum to zero when test, I think you should just call model.train() when you want to train and model.eval() when you want to use your trained model to perform inference.
You could alternatively try Layer Normalization instead of Batch Normalization, cause it does not require accumulating any statistic and usually works well
Try regularizing a bit your model using dropout
Make sure you shuffle your training set in every epoch. Not shuffling the data set may lead to correlated batches that make the statistics in batch normalization cycle. That may impact your generalization
I hope any of these ideas work for you
The problem may be with your momentum. I see you are using 0.01.
Here is how I tried different betas to fit to points with momentum and with beta=0.01 I got bad results. Usually beta=0.1 is used.
It's almost happen because of two major reasons 1.non-stationary training'procedure and 2.train/test different distributions
If It's possible try other regularization technique's like Drop-out,I face to this problem and i found that my test and train distribution might be different so after i remove BN and use drop-out instead, got the reasonable result. read this for more
Use nn.BatchNorm2d(out_channels, track_running_stats=False) this disables the running statistics of the batches and uses the current batch’s mean and variance to do the normalization
In Training mode run some forward passes on data in with torch.no_grad() block. this stabilize the running_mean / running_std values
Use same batch_size in your dataset for both model.train() and model.eval()
Increase momentum of the BN. This means that the means and stds learned will be much more stable during the process of training
this is helpful whenever you use pre-trained model
for child in model.children():
for ii in range(len(child)):
if type(child[ii])==nn.BatchNorm2d:
child[ii].track_running_stats = False

How to properly setup an RNN in Keras for sequence to sequence modelling?

Although not new to Machine Learning, I am still relatively new to Neural Networks, more specifically how to implement them (In Keras/Python). Feedforwards and Convolutional architectures are fairly straightforward, but I am having trouble with RNNs.
My X data consists of variable length sequences, each data-point in that sequence having 26 features. My y data, although of variable length, each pair of X and y have the same length, e.g:
X_train[0].shape: (226,26)
y_train[0].shape: (226,)
X_train[1].shape: (314,26)
y_train[1].shape: (314,)
X_train[2].shape: (189,26)
y_train[2].shape: (189,)
And my objective is to classify each item in the sequence into one of 39 categories.
What I can gather thus far from reading example code, is that we do something like the following:
encoder_inputs = Input(shape=(None, 26))
encoder = GRU(256, return_state=True)
encoder_outputs, state_h = encoder(encoder_inputs)
decoder_inputs = Input(shape=(None, 39))
decoder_gru= GRU(256, return_sequences=True)
decoder_outputs, _ = decoder_gru(decoder_inputs, initial_state=state_h)
decoder_dense = Dense(39, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
Which makes sense to me, because each of the sequences have different lengths.
So with a for loop that loops over all sequences, we use None in the input shape of the first GRU layer because we are unsure what the sequence length will be, and then return the hidden state state_h of that encoder. With the second GRU layer returning sequences, and the initial state being the state returned from the encoder, we then pass the outputs to a final softmax activation layer.
Obviously something is flawed here because I get:
decoder_outputs, _ = decoder_gru(decoder_inputs, initial_state=state_h)
File "/usr/local/lib/python3.6/dist-
packages/tensorflow/python/framework/ops.py", line 458, in __iter__
"Tensor objects are only iterable when eager execution is "
TypeError: Tensor objects are only iterable when eager execution is
enabled. To iterate over this tensor use tf.map_fn.
This link points to a proposed solution, but I don't understand why you would add encoder states to a tuple for as many layers you have in the network.
I'm really looking for help in being able to successfully write this RNN to do this task, but also understanding. I am very interested in RNNs and want to understand them more in depth so I can apply them to other problems.
As an extra note, each sequence is of shape (sequence_length, 26), but I expand the dimension to be (1, sequence_length, 26) for X and (1, sequence_length) for y, and then pass them in a for loop to be fit, with the decoder_target_data one step ahead of the current input:
for idx in range(X_train.shape[0]):
X_train_s = np.expand_dims(X_train[idx], axis=0)
y_train_s = np.expand_dims(y_train[idx], axis=0)
y_train_s1 = np.expand_dims(y_train[idx+1], axis=0)
encoder_input_data = X_train_s
decoder_input_data = y_train_s
decoder_target_data = y_train_s1
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
epochs=50,
validation_split=0.2)
With other networks I have wrote (FeedForward and CNN), I specify the model by adding layers on top of Keras's Sequential class. Because of the inherent complexity of RNNs I see the general format of using Keras's Input class like above and retrieving hidden states (and cell states for LSTM) etc... to be logical, but I have also seen them built from using Keras's Sequential Class. Although these were many to one type tasks, I would be interested in how you would write it that way too.
The problem is that the decoder_gru layer does not return its state, therefore you should not use _ as the return value for the state (i.e. just remove , _):
decoder_outputs = decoder_gru(decoder_inputs, initial_state=state_h)
Since the input and output lengths are the same and there is a one to one mapping between the elements of input and output, you can alternatively construct the model this way:
inputs = Input(shape=(None, 26))
gru = GRU(64, return_sequences=True)(inputs)
outputs = Dense(39, activation='softmax')(gru)
model = Model(inputs, outputs)
Now you can make this model more complex (i.e. increase its capacity) by stacking multiple GRU layers on top of each other:
inputs = Input(shape=(None, 26))
gru = GRU(256, return_sequences=True)(inputs)
gru = GRU(128, return_sequences=True)(gru)
gru = GRU(64, return_sequences=True)(gru)
outputs = Dense(39, activation='softmax')(gru)
model = Model(inputs, outputs)
Further, instead of using GRU layers, you can use LSTM layers which has more representational capacity (of course this may come at the cost of increasing computational cost). And don't forget that when you increase the capacity of the model you increase the chance of overfitting as well. So you must keep that in mind and consider solutions that prevent overfitting (e.g. adding regularization).
Side note: If you have a GPU available, then you can use CuDNNGRU (or CuDNNLSTM) layer instead, which has been optimized for GPUs so it runs much faster compared to GRU.

Concatenate Embedding layers

I'm trying to create a model which has words as inputs. Most of those words are in the glove word vector set (~50000). However, some of the frequent words are not (~1000). The question is, how do I concatenate the following two embedding layers to create one giant Embedding lookup table?
trained_em = Embedding(50000, 50,
weights=np.array([word2glove[w] for w in words_in_glove]),
trainable=False)
untrained_em = Embedding(1000, 50)
As far as I understand these are simply two lookup tables with same number of dimensions. So I'm hoping that there is a way to stack these two lookup tables.
Edit 1:
I just realised that this is probably going to be more than stacking Embedding layers because the input sequence would be a number from 0-50999. However untrained_em above only expect a number from 0-999. So perhaps a different solution is required.
Edit 2:
This is what I would expect to do in a numpy array representing the Embedding:
np.random.seed(42) # Set seed for reproducibility
pretrained = np.random.randn(15,3)
untrained = np.random.randn(5,3)
final_embedding = np.vstack([pretrained, untrained])
word_idx = [2, 5, 19]
np.take(final_embedding, word_idx, axis=0)
I believe the last bit can be done with something to do with keras.backend.gather but not sure how to put it all together.
Turns out that I need to implement a custom layer. Which was implemented by tweaking the orignial Embedding class.
The two most important parts shown in the class below are self.embeddings = K.concatenate([fixed_weight, variable_weight], axis=0) and out = K.gather(self.embeddings, inputs). The first is hopefully self explanatory while the second picks out the relevant input rows from the embeddings table.
However, in the particular application that I'm working on, turns out that it works out better using an Embedding layer instead of the modified layer. Perhaps because the learning rate is too high. I will report back on this after I have experimented more.
from keras.engine.topology import Layer
import keras.backend as K
from keras import initializers
import numpy as np
class Embedding2(Layer):
def __init__(self, input_dim, output_dim, fixed_weights, embeddings_initializer='uniform',
input_length=None, **kwargs):
kwargs['dtype'] = 'int32'
if 'input_shape' not in kwargs:
if input_length:
kwargs['input_shape'] = (input_length,)
else:
kwargs['input_shape'] = (None,)
super(Embedding2, self).__init__(**kwargs)
self.input_dim = input_dim
self.output_dim = output_dim
self.embeddings_initializer = embeddings_initializer
self.fixed_weights = fixed_weights
self.num_trainable = input_dim - len(fixed_weights)
self.input_length = input_length
def build(self, input_shape, name='embeddings'):
initializer = initializers.get(self.embeddings_initializer)
shape1 = (self.num_trainable, self.output_dim)
variable_weight = K.variable(initializer(shape1), dtype=K.floatx(), name=name+'_var')
fixed_weight = K.variable(self.fixed_weights, name=name+'_fixed')
self._trainable_weights.append(variable_weight)
self._non_trainable_weights.append(fixed_weight)
self.embeddings = K.concatenate([fixed_weight, variable_weight], axis=0)
self.built = True
def call(self, inputs):
if K.dtype(inputs) != 'int32':
inputs = K.cast(inputs, 'int32')
out = K.gather(self.embeddings, inputs)
return out
def compute_output_shape(self, input_shape):
if not self.input_length:
input_length = input_shape[1]
else:
input_length = self.input_length
return (input_shape[0], input_length, self.output_dim)
So, my suggestion is to use only one Embedding layer (taking into consideration your indexing problem), and transfer the weights from the old layer to the new one.
So, what you're going to to in this suggestion is...
Create your new model with 51000 words:
inp = Input((1,))
emb = Embedding(51000,50)(inp)
out = the rest of the model.....
model = Model(inp,out)
Now take the embedding layer and give it the weights you had:
weights = np.array([word2glove[w] for w in words_in_glove])
newWeights = model.layers[1].get_weights()[0]
newWeights[:50000,:] = weights
model.layers[1].set_weights([newWeights])
This will give you a new embedding, larger than the previous one, with a great part of its weights already trained, and the remaining randomly initialized.
Unfortunately, you will have to let everything be trained.

Categories