Run multiple models of an ensemble in parallel with PyTorch

Run multiple models of an ensemble in parallel with PyTorch - python

My neural network has the following architecture:
input -> 128x (separate fully connected layers) -> output averaging
I am using a ModuleList to hold the list of fully connected layers. Here's how it looks at this point:
class MultiHead(nn.Module):
def __init__(self, dim_state, dim_action, hidden_size=32, nb_heads=1):
super(MultiHead, self).__init__()
self.networks = nn.ModuleList()
for _ in range(nb_heads):
network = nn.Sequential(
nn.Linear(dim_state, hidden_size),
nn.Tanh(),
nn.Linear(hidden_size, dim_action)
)
self.networks.append(network)
self.cuda()
self.optimizer = optim.Adam(self.parameters())
Then, when I need to calculate the output, I use a for ... in construct to perform the forward and backward pass through all the layers:
q_values = torch.cat([net(observations) for net in self.networks])
# skipped code which ultimately computes the loss I need
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
This works! But I am wondering if I couldn't do this more efficiently. I feel like by doing a for...in, I am actually going through each separate FC layer one by one, while I'd expect this operation could be done in parallel.

In the case of Convnd in place of Linear you could use the groups argument for "grouped convolutions" (a.k.a. "depthwise convolutions"). This let's you handle all parallel networks simultaneously.
If you use a convolution kernel of size 1, then the convolution does nothing else than applying a Linear layer, where each channel is considered an input dimension. So the rough structure of your network would look like this:
Modify the input tensor of shape B x dim_state as follows: add an additional dimension and replicate by nb_state-times B x dim_state to B x (dim_state * nb_heads) x 1
replace the two Linear with
nn.Conv1d(in_channels=dim_state * nb_heads, out_channels=hidden_size * nb_heads, kernel_size=1, groups=nb_heads)
and
nn.Conv1d(in_channels=hidden_size * nb_heads, out_channels=dim_action * nb_heads, kernel_size=1, groups=nb_heads)
we now have a tensor of size B x (dim_action x nb_heads) x 1 you can now modify it to whatever shape you want (e.g. B x nb_heads x dim_action)
While CUDA natively supports grouped convolutions, there were some issues in pytorch with the speed of grouped convolutions (see e.g. here) but I think that was solved now.

Related

How to efficiently implement a non-fully connected Linear Layer in PyTorch?

I made an example diagram of a scaled down version of what I'm trying to implement:
So the top two input nodes are only fully connected to the top three output nodes, and the same design applies to the bottom two nodes. So far I've come up with two ways of implementing this in PyTorch, neither of which are optimal.
The first would be to create a nn.ModuleList of many smaller Linear Layers, and during the forward pass, iterate the input through them. For the diagram's example, that would look something like this:
class Module(nn.Module):
def __init__(self):
self.layers = nn.Module([nn.Linear(2, 3) for i in range(2)])
def forward(self, input):
output = torch.zeros(2, 3)
for i in range(2):
output[i, :] = self.layers[i](input.view(2, 2)[i, :])
return output.flatten()
So this accomplishes the network in the diagram, the main issue is its very slow. I assume this is because PyTorch has to process the for loop sequentially, and can't process the input tensor in parallel.
To "vectorize" the module such that PyTorch can run it quicker, I have this implementation:
class Module(nn.Module):
def __init__(self):
self.layer = nn.Linear(4, 6)
self.mask = # create mask of ones and zeros to "block" certain layer connections
def forward(self, input):
prune.custom_from_mask(self.layer, name='weight', mask=self.mask)
return self.layer(input)
This also accomplishes the diagram's network, by using weight pruning to ensure certain weights in the fully connected layer are always zero (ex. the weight connecting the top input node to the bottom out node will always be zero, so its effectively "disconnected"). This module is much faster than the previous, as there is no for loop. The problem now is this module takes up significantly more memory. This is likely due to the fact that, even though most of the layer's weights will be zero, PyTorch still treats the network as if they are there. This implementation essentially keeps way more weights around than it needs to.
Has anyone encountered this issue before and come up with an efficient solution?

If weight sharing is ok, then 1D convolutions should solve the problem:
class Module(nn.Module):
def __init__(self):
self.layers = nn.Conv1d(in_channels=2, out_channels=3, kernel_size=1)
self._n_splits = 2
def forward(self, input):
B, C = input.shape
output = self.layers(input.view(B, C//self._n_splits, -1))
return output.view(B, C)
If weight sharing is NOT ok, then you can use the group convolutions: self.layers = nn.Conv1d(in_channels=4, out_channels=4, kernel_size=1, stride=1, groups=2). However, I am not sure if this can implement an arbitrary number of channel splits, you can check the documentation: https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html
A 1D convolution is a fully connected layer on all the channels of the input. A Group convolution will divide the channels into groups and perform separate conv operations on them (which is what you want).
The implementation will look something like:
class Module(nn.Module):
def __init__(self):
self.layers = nn.Conv1d(in_channels=2, out_channels=4, kernel_size=1, groups=2)
def forward(self, input):
B, C = input.shape
output = self.layers(input.unsqueeze(-1))
return output.squeeze()
EDIT:
If you need an odd number of output channels you can combine two group convs.
class Module(nn.Module):
def __init__(self):
self.layers = nn.Sequence(
nn.Conv1d(in_channels=2, out_channels=4, kernel_size=1, groups=2),
nn.Conv1d(in_channels=4, out_channels=3, kernel_size=1, groups=3))
def forward(self, input):
B, C = input.shape
output = self.layers(input.unsqueeze(-1))
return output.squeeze()
That will effectively define the input channels as required in the diagram and allow you for an arbitrary number of output channels. Notice that if the second convolution has groups=1 you will allow for mixing channels and will effectively render the first group conv layer useless.
From a theoretical perspective, there is no need for activation functions in between those two convolutions. We are combining them in a linear matter. However, it is possible that adding an activation function will improve performance.

Creating custom layer as stack of individual neurons TensorFlow

So, I'm trying to create a custom layer in TensorFlow 2.4.1, using a function for a neuron I defined.
# NOTE: this is not the actual neuron I want to use,
# it's just a simple example.
def neuron(x, W, b):
return W # x + b
Where the W and b it gets would be of shape (1, x.shape[0]) and (1, 1) respectively. This means this is like a single neuron in a dense layer. So, I want to create a dense layer by stacking however many of these individual neurons I want.
class Layer(tf.keras.layers.Layer):
def __init__(self, n_units=5):
super(Layer, self).__init__() # handles standard arguments
self.n_units = n_units # Number of neurons to be in the layer
def build(self, input_shape):
# Create weights and biases for all neurons individually
for i in range(self.n_units):
# Create weights and bias for ith neuron
...
def call(self, inputs):
# Compute outputs for all neurons
...
# Concatenate outputs to create layer output
...
return output
How can I create a layer as a stack of individual neurons (also in a way it can train)? I have roughly outlined the idea for the layer in the above code, but the answer doesn't need to follow that as a blueprint.
Finally; yes I'm aware that to create a dense layer you don't need to go about it in such a roundabout way (you just need 1 weight and bias matrix), but in my actual use case, this is neccessary. Thanks!

So, person who asked this question here, I have found a way to do it, by dynamically creating variables and operations.
First, let's re-define the neuron to use tensorflow operations:
def neuron(x, W, b):
return tf.add(tf.matmul(W, x), b)
Then, let's create the layer (this uses the blueprint layed out in the question):
class Layer(tf.keras.layers.Layer):
def __init__(self, n_units=5):
super(Layer, self).__init__()
self.n_units = n_units
def build(self, input_shape):
for i in range(self.n_units):
exec(f'self.kernel_{i} = self.add_weight("kernel_{i}", shape=[1, int(input_shape[0])])')
exec(f'self.bias_{i} = self.add_weight("bias_{i}", shape=[1, 1])')
def call(self, inputs):
for i in range(self.n_units):
exec(f'out_{i} = neuron(inputs, self.kernel_{i}, self.bias_{i})')
return eval(f'tf.concat([{", ".join([ f"out_{i}" for i in range(self.n_units) ])}], axis=0)')
As you can see, we're using exec and eval to dynamically create variables and perform operations.
That's it! We can perform a few checks to see if TensorFlow could use this:
# Check to see if it outputs the correct thing
layer = Layer(5) # With 5 neurons, it should return a (5, 6)
print(layer(tf.zeros([10, 6])))
# Check to see if it has the right trainable parameters
print(layer.trainable_variables)
# Check to see if TensorFlow can find the gradients
layer = Layer(5)
x = tf.ones([10, 6])
with tf.GradientTape() as tape:
z = layer(x)
print(f"Parameter: {layer.trainable_variables[2]}")
print(f"Gradient: {tape.gradient(z, layer.trainable_variables[2])}")
This solution works, but it's not very elegant... I wonder if there's a better way to do it, some magical TF method that can map the neuron to create a layer, I'm too inexperienced to know for the moment. So, please answer if you have a (better) answer, I'll be happy to accept it :)

Is convolution useful on a network with a timestep of 1?

This code comes from https://www.kaggle.com/dkaraflos/1-geomean-nn-and-6featlgbm-2-259-private-lb, The goal of this competition is to use seismic signals to predict the timing of laboratory earthquakes. The person in this link has won first place among more than 4000 teams
def get_model():
inp = Input(shape=(1,train_sample.shape[1]))
x = BatchNormalization()(inp)
x = LSTM(128,return_sequences=True)(x) # LSTM as first layer performed better than Dense.
x = Convolution1D(128, (2),activation='relu', padding="same")(x)
x = Convolution1D(84, (2),activation='relu', padding="same")(x)
x = Convolution1D(64, (2),activation='relu', padding="same")(x)
x = Flatten()(x)
x = Dense(64, activation="relu")(x)
x = Dense(32, activation="relu")(x)
#outputs
ttf = Dense(1, activation='relu',name='regressor')(x) # Time to Failure
tsf = Dense(1)(x) # Time Since Failure
classifier = Dense(1, activation='sigmoid')(x) # Binary for TTF<0.5 seconds
model = models.Model(inputs=inp, outputs=[ttf,tsf,classifier])
opt = optimizers.Nadam(lr=0.008)
# We are fitting to 3 targets simultaneously: Time to Failure (TTF), Time Since Failure (TSF), and Binary for TTF<0.5 seconds
# We weight the model to optimize heavily for TTF
# Optimizing for TSF and Binary TTF<0.5 helps to reduce overfitting, and helps for generalization.
model.compile(optimizer=opt, loss=['mae','mae','binary_crossentropy'],loss_weights=[8,1,1],metrics=['mae'])
return model
However, According to my derivation, I think x = Convolution1D(128, (2),activation='relu', padding="same")(x) and x = Dense(128, activation='relu ')(x) has the same effect, because the convolution kernel performs convolution on the sequence with a time step of 1. In principle, it is very similar to the fully connected layer. Why use conv1D here instead of directly using the fullly connection layer? Is my derivation wrong?

1) Assuming you would input a sequence to the LSTM (the normal use case):
It would not be the same since the LSTM returns a sequence (return_sequences=True), thereby not reducing the input dimensionality. The output shape is therefore (Batch, Sequence, Hid). This is being fed to the Convolution1D layer which performs convolution on the Sequence dimension, i.e. on (Sequence, Hid). So in effect, the purpose of the 1D Convolutions is to extract local 1D subsequences/patches after the LSTM.
If we had return_sequences=False, the LSTM would return the final state h_t. To ensure the same behavior as a Dense layer, you need a fully connected convolutional layer, i.e. a kernel size of Sequence length, and we need as many filters as we have Hid in the output shape. This would then make the 1D Convolution equivalent to a Dense layer.
2) Assuming you do not input a sequence to the LSTM (your example):
In your example, the LSTM is used as a replacement for a Dense layer.
It serves the same function, though it gives you a slightly different
result as the gates do additional transformations (even though we
have no sequence).
Since the Convolution is then performed on (Sequence, Hid) = (1, Hid), it is indeed operating per timestep. Since we have 128 inputs and 128 filters, it is fully connected and the kernel size is large enough to operate on the single element. This meets the above defined criteria for a 1D Convolution to be equivalent to a Dense layer, so you're correct.
As a side note, this type of architecture is something you would typically get with a Neural Architecture Search. The "replacements" used here are not really commonplace and not generally guaranteed to be better than the more established counterparts. In a lot of cases, using Reinforcement Learning or Evolutionary Algorithms can however yield slightly better accuracy using "untraditional" solutions since very small performance gains can just happen by chance and don't have to necessarily reflect back on the usefulness of the architecture.

Define a network in pytorch with incomplete connections, like convolution

I'd like to train a small neural network in Pytorch that takes as an input an 8-dimensional vector and predicts one of three possible categories. The first hidden layer should contain 6 neurons, where each neuron takes the activations of only 3 consecutive dimensions in the input layer. The second hidden layer should also contain 6 nodes and be fully connected, and the last layer should be the output layer with 3 neurons. Thus the topology is:
network topology
Let's say that a mini batch consists of 64 (8-dimensional) data points.
I tried to implement the first layer using 1D convolution. Since a 1D convolution filter assumes the input is a sequence of points, I thought a good approach is to define 6 filters operating on 8 1-dimensional points:
import torch.nn as nn
import torch.nn.functional as functional
class ExampleNet(nn.Module):
def __init__(self, batch_size, input_channels, output_channels):
super(ExampleNet, self).__init__()
self._layer1 = nn.Conv1d(in_channels=1, out_channels=input_channels - 2, kernel_size=3, stride=1)
self._layer2 = nn.Linear(in_features=input_channels - 2, out_features=input_channels - 2)
self._layer3 = nn.Linear(in_features=input_channels - 2, out_features=output_channels)
def forward(self, x):
x = functional.relu(self._layer1(x))
x = functional.relu(self._layer2(x))
x = functional.softmax(self._layer3(x))
return x
net = ExampleNet(64, 8, 3)
I know that Pytorch expects a sequence of arrays of size 64 x 8 x 1 each when training the network. However, since I apply 1D convolutional filters in an untraditional way, I think I should have input arrays of size 64 x 1 x 8, and I am expecting an output of size 64 x 3. I use the following mini-batch of random points to run through the network:
# Generate a mini-batch of 64 samples
input = torch.randn(64, 1, 8)
out = net(input)
print(out.size())
And the output I get tells me that I defined a wrong topology. How would you advise me to define the layers I need? Is using Conv1d a good approach in my case? I saw that another approach is to use a masked layer but I don't know how to define it.

Writing this exotic NN architecture with keras, tensorflow and python

I'm trying to get Keras to train a multiclass classification model that can be written in a network like this:
The only set of trainable parameters are those , all the rest is given. The functions fi are combinations of usual mathematical functions (for example .Sigma stands for summing the previous terms and softmax is the usual function. The (x1,x2,...xn) are elements of train or test set and are a specific subset of the original data already selected.
The model in more depth:
Specificaly, given (x_1,x_2,...,x_n) an input in train or test set, the network evaluates
where fi are given mathematical functions, are rows of a particular subset of the original data and the coefficients are the parameters I want to train.
As I'm using keras, I expect it to add a bias term to each row.
After the above evaluation, I will apply a softmax layer (each of the m lines above are numbers that will be inputs for the softmax function).
At the end I want to compile the model and run model.fit as usual.
The problem is that I couln't translate the expression to keras sintax.
My attempt:
Following the network scratch above, I first tried to consider each of the expressions of the form as lambda layers in a Sequential Model, but the best I could get to work was a combination of a dense layer with linear activation (which would play the role of a row's parameters: ) followed by a Lambda layer outputting a vector without the required summation, as follows:
model = Sequential()
#single row considered:
model.add(Lambda(lambda x: f_fixedRow(x), input_shape=(nFeatures,)))
#parameters set after lambda layer to get (a1*f(x1,y1),...,an*f(xn,yn)) and not (f(a1*x1,y1),...,f(an*xn,yn))
model.add(Dense(nFeatures, activation='linear'))
#missing summation: sum(x)
#missing evaluation of f in all other rows
model.add(Dense(classes,activation='softmax',trainable=False)) #should get all rows
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
Also, I had to define the function in the lambda function call with the argument already fixed (because the lambda function could have only the input layers as variable):
def f_fixedRow(x):
#picking a particular row (as a vector) to evaluate f in (f works element-wise)
y=tf.constant(value=x[0,:],dtype=tf.float32)
return f(x,y)
I managed to write the f function with tensorflow (working element-wise in a row), although this is a possible source for problems in my code (and the above workaround seems unnatural).
I also thought that if I could properly write the element-wise sum of the vector in the aforementioned attempt I could repeat the same procedure in a parallelized manner with the keras Functional API and then insert the output of each parallel model in a softmax function, as I need.
Another approach that I considered was to train the parameters keeping their natural matrix structure seen in Network Description, maybe writing a matrix Lambda layer, but I could not find anything related to this idea.
Anyway, I'm not sure what is a good way to work with this model within keras, maybe I'm missing an important point because of the non standard way the parameters are written or lack of experience with tensorflow. Any suggestions are welcome.

For this answer, it's important that f be a tensor function that operates elementwise. (No iterating). This is reasonably easy to have, just check the keras backend functions.
Assumptions:
The x_pk set is constant, otherwise this solution must be reviewed.
The function f is elementwise (if not, please show f for better code)
Your model will need x_pk as a tensor input. And you should do that in a functional API model.
import keras.backend as K
from keras.layers import Input, Lambda, Activation
from keras.models import Model
#x_pk data
x_pk_numpy = select_X_pk_samples(x_train)
x_pk_tensor = K.variable(x_pk_numpy)
#number of rows in x_pk
m = len(x_pk_numpy)
#I suggest a fixed batch size for simplicity
batch = some_batch_size
First let's work on the function that will take x and x_pk calling f.
def calculate_f(inputs): #inputs will be a list with x and x_pk
x, x_pk = inputs
#since f will work elementwise, let's replicate x and x_pk so they have equal shapes
#please explain f for better optimization
# x from (batch, n) to (batch, m, n)
x = K.stack([x]*m, axis=1)
# x_pk from (m, n) to (batch, m, n)
x_pk = K.stack([x_pk]*batch, axis=0)
#a batch size of 1 could make this even simpler
#a variable batch size would make this more complicated
#certain f functions could make this process unnecessary
return f(x, x_pk)
Now, different from a Dense layer, this formula is using the a_pk weights multiplied elementwise. So we need a custom layer:
class ElementwiseWeights(Layer):
def __init__(self, **kwargs):
super(ElementwiseWeights, self).__init__(**kwargs)
def build(self, input_shape):
weight_shape = (1,) + input_shape[1:] #shape (1, m, n)
self.kernel = self.add_weight(name='kernel',
shape=weight_shape,
initializer='uniform',
trainable=True)
super(ElementwiseWeights, self).build(input_shape)
def compute_output_shape(self,input_shape):
return input_shape
def call(self, inputs):
return self.kernel * inputs
Now let's build our functional API model:
#x_pk model tensor input
x_pk = Input(tensor=x_pk_tensor) #shape (m, n)
#x usual input with fixed batch size
x = Input(batch_shape=(batch,n)) #shape (batch, n)
#calculate F
out = Lambda(calculate_f)([x, xp_k]) #shape (batch, m, n)
#multiply a_pk
out = ElementwiseWeights()(out) #shape (batch, m, n)
#sum n elements, keep m rows:
out = Lambda(lambda x: K.sum(x, axis=-1))(out) #shape (batch, m)
#softmax
out = Activation('softmax')(out) #shape (batch,m)
Continue this model with whatever you want and finish it:
model = Model([x, x_pk], out)
model.compile(.....)
model.fit(x_train, y_train, ....) #perhaps you might need .fit([x_train], ytrain,...)
Edit for function f
You can have the proposed f like this:
#create the n coefficients:
coefficients = np.array([c0, c1, .... , cn])
coefficients = coefficients.reshape((1,1,n))
def f(x, x_pk):
c = K.variable(coefficients) #shape (1, 1, n)
out = (x - x_pk) / c
return K.exp(out)
This f would accept x with shape (batch, 1, n), without the stack used in the calculate_f function.
Or could accept x_pk with shape (1, m, n), allowing variable batch size.
But I'm not sure it's possible to have both of these shapes together. Testing this might be interesting.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.