I am trying to implement a LSTM VAE (following this example I found), but also have it accept variable length sequences using Masking Layers. I tried to combine the above code with the ideas from this SO question that seems to deal with it the "best way" by cropping the gradients to get the most accurate loss as possible, however my implementation does not seem to be able to reproduce sequences on a small set of data. I am thus relatively confident that there is something amiss with my implementation, but I cannot seem to pinpoint what exactly is wrong. The relevant part is here:
x = Input(shape=(None, input_dim))(x)
x_masked = Masking(mask_value=0.0, input_shape=(None, input_dim))(x)
h = LSTM(intermediate_dim)(x_masked)
z_mean = Dense(latent_dim)(h)
z_log_sigma = Dense(latent_dim)(h)
def sampling(args):
z_mean, z_log_sigma = args
epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0., stddev=epsilon_std)
return z_mean + z_log_sigma * epsilon
z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_sigma])
decoded_h = LSTM(intermediate_dim, return_sequences=True)
decoded_mean = LSTM(latent_dim, return_sequences=True)
h_decoded = RepeatVector(max_timesteps)(z)
h_decoded = decoder_h(h_decoded)
x_decoded_mean = decoder_mean(h_decoded)
def crop_outputs(x):
padding = K.cast(K.not_equal(x[1], 0), dtype=K.floatx())
return x[0] * padding
x_decoded_mean = Lambda(crop_outputs, output_shape=(max_timesteps, input_dim))([x_decoded_mean, x])
vae = Model(x, x_decoded_mean)
def vae_loss(x, x_decoded_mean):
xent_loss = objectives.mse(x, x_decoded_mean)
kl_loss = -0.5 * K.mean(1 + z_log_sigma - K.square(z_mean) - K.exp(z_log_sigma))
loss = xent_loss + kl_loss
return loss
vae.compile(optimizer='adam', loss=vae_loss)
# Here, X is variable length time series data of shape
# (num_examples, max_timesteps, input_dim) and is zero padded
# on the right for all the examples of length less than max_timesteps
# X has been appropriately scaled using the StandardScaler.
vae.fit(X, X, epochs = num_epochs, batch_size=batch_size)
As always, any help is much appreciated. Thank you!
I came by your question while looking to do exactly the same. I gave up on VAE's, but found a solution to apply masking to layers that don't support masking. What I did was just predefine a binary mask (you can do this with numpy Code 1) and then multiplied my output by the mask. During Backpropagation the algorithm will try the derivative of the multiplication and will end up propagating the value or not. It is not as clever as the masking layer on Keras, bubt it did for me.
#Code 1
#making a numpy binary mask
# expecting a sequence with shape (Time_Steps, Features)
# let's say that my sequence has Features = 10 and a max_Length of 15
max_Len = 15
seq = np.linspace(0,1,100).reshape((10,10))
# You must pad/truncate the sequence here
mask = np.concatenate([np.ones(seq.shape[0]),np.zeros(max_Len-seq.shape[0])],axis=-1)
# This mask can be thrown as input to the model afterwards
A few considerations:
1- It resulted on a weak regression model. Don't know the impact on VAE's, since I never tested, but I think it will generate lots of noise.
2- The computational resource demand went up, so it is a good thing to try and calculate the requirements of propagating and backpropagating this workaround (or "gambiarra" as we say here) if you are on a budget like me.
3- It wont solve the problem completly, you could try and delve deeper on this and implement a more stable solution using pure Tensorflow.
4- A more "accurate" solution would be to implement a custom masking layer (code 2).
Regarding point 4, it is easy, you must define the layer as a default layer and then just use the call function receiving a mask and then just output the multiplication of mask and input. Like this:
# Code 2
class MyCoolMaskingLayer(tf.keras.layers.Layer):
def __init__(self, **kwargs):
#init stuff here
def compute_mask(self, inputs, mask=None):
return mask
def call(self, input, mask=None):
bc_mask = tf.expand_dims(tf.cast(mask, "float32"), -1) if mask is not None else np.asarray([[1]])
return input * mask
This function might not work for you, it is really problem specific and from a noob (me), but it worked for me. I just cannot share the entire code because my Master's Tutor doesn't allow.
(a little bit of context: I wrap it around a TimeDistributed so that each TimeStep of a LSTM output is individually processed by this masking layer, because inside call i perform some transformations on the data)
I am trying to use a manually calculate a gradient using the output of my network, I will then use this in a loss function. I have managed to get an example working in keras, but converting it to PyTorch has proven more difficult
I have a model like:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(1, 50)
self.fc2 = nn.Linear(50, 10)
self.fc3 = nn.Linear(10, 1)
def forward(self, x):
x = F.sigmoid(self.fc1(x))
x = F.sigmoid(self.fc2(x))
x = self.fc3(x)
return x
and some data:
x = torch.unsqueeze(torch.linspace(-1, 1, 101), dim=1)
x = Variable(x)
I can then try find a gradient like:
output = net(x)
grad = torch.autograd.grad(outputs=output, inputs=x, retain_graph=True)[0]
I want to be able to find the gradient of each point, then do something like:
err_sqr = (grad - x)**2
loss = torch.mean(err_sqr)**2
However, at the moment if I try to do this I get the error:
grad can be implicitly created only for scalar outputs
I have tried changing the shape of my network output to fix this, but if I change it to much it says its not part of the graph. I can get rid of that error by allowing that, but then it says my gradient is None. I've managed to get this working in keras, so I'm confident that its possible here too, I just need a hand!
My questions are:
Is there a way to "fix" what I have to allow me to calculate the gradient
PyTorch expects an upstream gradient in the grad call. For usual (scalar) loss functions, the upstream gradient is implicitly assumed to be 1.
You can do a similar thing by passing ones as the upstream gradient:
grad = torch.autograd.grad(outputs=output, inputs=x, grad_outputs=torch.ones_like(output), retain_graph=True)[0]
I'm trying to get Keras to train a multiclass classification model that can be written in a network like this:
The only set of trainable parameters are those , all the rest is given. The functions fi are combinations of usual mathematical functions (for example .Sigma stands for summing the previous terms and softmax is the usual function. The (x1,x2,...xn) are elements of train or test set and are a specific subset of the original data already selected.
The model in more depth:
Specificaly, given (x_1,x_2,...,x_n) an input in train or test set, the network evaluates
where fi are given mathematical functions, are rows of a particular subset of the original data and the coefficients are the parameters I want to train.
As I'm using keras, I expect it to add a bias term to each row.
After the above evaluation, I will apply a softmax layer (each of the m lines above are numbers that will be inputs for the softmax function).
At the end I want to compile the model and run model.fit as usual.
The problem is that I couln't translate the expression to keras sintax.
My attempt:
Following the network scratch above, I first tried to consider each of the expressions of the form as lambda layers in a Sequential Model, but the best I could get to work was a combination of a dense layer with linear activation (which would play the role of a row's parameters: ) followed by a Lambda layer outputting a vector without the required summation, as follows:
model = Sequential()
#single row considered:
model.add(Lambda(lambda x: f_fixedRow(x), input_shape=(nFeatures,)))
#parameters set after lambda layer to get (a1*f(x1,y1),...,an*f(xn,yn)) and not (f(a1*x1,y1),...,f(an*xn,yn))
model.add(Dense(nFeatures, activation='linear'))
#missing summation: sum(x)
#missing evaluation of f in all other rows
model.add(Dense(classes,activation='softmax',trainable=False)) #should get all rows
Also, I had to define the function in the lambda function call with the argument already fixed (because the lambda function could have only the input layers as variable):
def f_fixedRow(x):
#picking a particular row (as a vector) to evaluate f in (f works element-wise)
return f(x,y)
I managed to write the f function with tensorflow (working element-wise in a row), although this is a possible source for problems in my code (and the above workaround seems unnatural).
I also thought that if I could properly write the element-wise sum of the vector in the aforementioned attempt I could repeat the same procedure in a parallelized manner with the keras Functional API and then insert the output of each parallel model in a softmax function, as I need.
Another approach that I considered was to train the parameters keeping their natural matrix structure seen in Network Description, maybe writing a matrix Lambda layer, but I could not find anything related to this idea.
Anyway, I'm not sure what is a good way to work with this model within keras, maybe I'm missing an important point because of the non standard way the parameters are written or lack of experience with tensorflow. Any suggestions are welcome.
For this answer, it's important that f be a tensor function that operates elementwise. (No iterating). This is reasonably easy to have, just check the keras backend functions.
The x_pk set is constant, otherwise this solution must be reviewed.
The function f is elementwise (if not, please show f for better code)
Your model will need x_pk as a tensor input. And you should do that in a functional API model.
import keras.backend as K
from keras.layers import Input, Lambda, Activation
from keras.models import Model
#x_pk data
x_pk_numpy = select_X_pk_samples(x_train)
x_pk_tensor = K.variable(x_pk_numpy)
#number of rows in x_pk
m = len(x_pk_numpy)
#I suggest a fixed batch size for simplicity
batch = some_batch_size
First let's work on the function that will take x and x_pk calling f.
def calculate_f(inputs): #inputs will be a list with x and x_pk
x, x_pk = inputs
#since f will work elementwise, let's replicate x and x_pk so they have equal shapes
#please explain f for better optimization
# x from (batch, n) to (batch, m, n)
x = K.stack([x]*m, axis=1)
# x_pk from (m, n) to (batch, m, n)
x_pk = K.stack([x_pk]*batch, axis=0)
#a batch size of 1 could make this even simpler
#a variable batch size would make this more complicated
#certain f functions could make this process unnecessary
return f(x, x_pk)
Now, different from a Dense layer, this formula is using the a_pk weights multiplied elementwise. So we need a custom layer:
class ElementwiseWeights(Layer):
def __init__(self, **kwargs):
super(ElementwiseWeights, self).__init__(**kwargs)
def build(self, input_shape):
weight_shape = (1,) + input_shape[1:] #shape (1, m, n)
self.kernel = self.add_weight(name='kernel',
super(ElementwiseWeights, self).build(input_shape)
def compute_output_shape(self,input_shape):
return input_shape
def call(self, inputs):
return self.kernel * inputs
Now let's build our functional API model:
#x_pk model tensor input
x_pk = Input(tensor=x_pk_tensor) #shape (m, n)
#x usual input with fixed batch size
x = Input(batch_shape=(batch,n)) #shape (batch, n)
#calculate F
out = Lambda(calculate_f)([x, xp_k]) #shape (batch, m, n)
#multiply a_pk
out = ElementwiseWeights()(out) #shape (batch, m, n)
#sum n elements, keep m rows:
out = Lambda(lambda x: K.sum(x, axis=-1))(out) #shape (batch, m)
out = Activation('softmax')(out) #shape (batch,m)
Continue this model with whatever you want and finish it:
model = Model([x, x_pk], out)
model.fit(x_train, y_train, ....) #perhaps you might need .fit([x_train], ytrain,...)
Edit for function f
You can have the proposed f like this:
#create the n coefficients:
coefficients = np.array([c0, c1, .... , cn])
coefficients = coefficients.reshape((1,1,n))
def f(x, x_pk):
c = K.variable(coefficients) #shape (1, 1, n)
out = (x - x_pk) / c
return K.exp(out)
This f would accept x with shape (batch, 1, n), without the stack used in the calculate_f function.
Or could accept x_pk with shape (1, m, n), allowing variable batch size.
But I'm not sure it's possible to have both of these shapes together. Testing this might be interesting.
Sorry if I present my problem not clearly, English is not my first language
Short description:
I want to train a model which map input x (with shape of [n_sample, timestamp, feature]) to an output y (with exact same shape). It's like mapping 2 space
Longer version:
I have 2 float ndarrays of shape [n_sample, timestamp, feature], representing MFCC feature of n_sample audio file. These 2 ndarray are 2 speakers' speech of the same corpus, which was aligned by DTW. Lets name these 2 arrays x and y. I want to train a model, which predict y[k] given x[k]. It's like mapping from space x to space y, and the output must be exact same shape as the input
What I've tried
It's time-series problem so I decide to use RNN approach. Here is my code in PyTorch (I put comment along the code. I removed the calculation of average loss for simplicity). Note that I've tried many option for learning rate, the behavior still the same
Class define
class Net(nn.Module):
def __init__(self, in_size, hidden_size, out_size, nb_lstm_layers):
self.in_size = in_size
self.hidden_size = hidden_size
self.out_size = out_size
self.nb_lstm_layers = nb_lstm_layers
# self.fc1 = nn.Linear()
self.lstm = nn.LSTM(input_size=self.in_size, hidden_size=self.hidden_size, num_layers=self.nb_lstm_layers, batch_first=True, bias=True)
# self.fc = nn.Linear(self.hidden_size, self.out_size)
self.fc1 = nn.Linear(self.hidden_size, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, self.out_size)
def forward(self, x, h_state):
out, h_state = self.lstm(x, h_state)
output_fc = []
for frame in out:
output_fc.append(self.fc3(torch.tanh(self.fc1(frame)))) # I added fully connected layer to each frame, to make an output with same shape as input
return torch.stack(output_fc), h_state
def hidden_init(self):
if use_cuda:
h_state = torch.stack([torch.zeros(nb_lstm_layers, batch_size, 20) for _ in range(2)]).cuda()
h_state = torch.stack([torch.zeros(nb_lstm_layers, batch_size, 20) for _ in range(2)])
return h_state
Training step:
net = Net(20, 20, 20, nb_lstm_layers)
optimizer = optim.Adam(net.parameters(), lr=0.0001, weight_decay=0.0001)
criterion = nn.MSELoss()
for epoch in range(nb_epoch):
count = 0
loss_sum = 0
batch_x = None
for i in (range(len(data))):
# data is my entire data, which contain A and B i specify above.
temp_x = torch.tensor(data[i][0])
temp_y = torch.tensor(data[i][1])
for ii in range(0, data[i][0].shape[0] - nb_frame_in_batch*2 + 1): # Create batches
batch_x, batch_y = get_batches(temp_x, temp_y, ii, batch_size, nb_frame_in_batch)
# this will return 2 tensor of shape (batch_size, nb_frame_in_batch, 20),
# with `batch_size` is the number of sample each time I feed to the net,
# nb_frame_in_batch is the number of frame in each sample
h_state = net.hidden_init()
prediction, h_state = net(batch_x.float(), h_state)
loss = criterion(prediction.float(), batch_y.float())
h_state = (h_state[0].detach(), h_state[1].detach())
Problem is, the loss seems not to decrease but fluctuate a lot, without a clear behaviour
Please help me. Any suggestion will be greatly appreciated. If somebody can inspect my code and provide some comment, that would be so kind.
Thanks in advance!
It seems the network learning nothing from your data, hence the loss fluctuation (since weights depends on random initialization only). There are something you can try:
Try to normalize the data (this suggestion is quite broad, but I can't give you more details since I don't have your data, but normalize it to a specific range like [0, 1], or to a mean and std value is worth trying)
One very typical problem of LSTM in pytorch is its input dimension is quite different to other type of neural network. You must feed into your network a tensor with shape (seq_len, batch, input_size). You should go here, LSTM section for better details
One more thing: try to tune your hyperparameters. LSTM is harder to train compare to FC or CNN (to my experience).
Tell me if you have improvement. Debugging a neural network is always hard and full of potential coding mistake
With most ML algorithms it is tough to diagnose without seeing the data. Based on the inconsistency of your loss results this might be an issue with your data pre-processing. Have you tried normalizing the data first? Often times with large fluctuations in results, one of your input neuron values may be skewing your loss function making it unable to find a good direction.
How to normalize a NumPy array to within a certain range?
This is an example for audio normalization but I would also try adjusting the learning rate as it looks high and possibly removing a hidden layer.
May the problem was in the calculation of the loss. Try to sum the losses of each time-step in a sequence and then take the average over the batch. May it helps
My question is similar to the one posed here:
keras combining two losses with adjustable weights
However, the outputs have a different dimensionality resulting in the outputs not being able to be concatenated. Hence, the solution is not applicable, is there another way to solve this problem?
The question:
I have a keras functional model with two layers with outputs x1 and x2.
x1 = Dense(1,activation='relu')(prev_inp1)
x2 = Dense(2,activation='relu')(prev_inp2)
I need to use these x1 and x2 use them in a weighted loss function like in the attached image. Propagate the 'same loss' into both branches. Alpha is flexible to vary with iterations.
For this question, a more elaborated solution is necessary. Since we're going to use a trainable weight, we will need a custom layer.
Also, we will be needing a different form of training, since our loss doesn't work like the others taking only y_true and y_pred and considers joining two different outputs.
Thus, we're going to create two versions of the same model, one for prediction, another for training, and the training version will contain the loss in itself, using a dummy keras loss function in compilation.
The prediction model
Let's use a very basic example of model with two outputs and one input:
#any input your true model takes
inp = Input((5,5,2))
#represents the localization output
outImg = Conv2D(1,3,activation='sigmoid')(inp)
#represents the classification output
outClass = Flatten()(inp)
outClass = Dense(2,activation='sigmoid')(outClass)
#the model
predictionModel = Model(inp, [outImg,outClass])
You use this one regularly for predictions. It's not necessary to compile this one.
The losses for each branch
Now, let's create custom loss functions for each branch, one for LossCls and another for LossLoc.
Using dummy examples here, you can elaborate these losses better if necessary. The most important is that they output batches shaped like (batch, 1) or (batch,). Both output the same shape so they can be summed later.
def calcImgLoss(x):
true,pred = x
loss = binary_crossentropy(true,pred)
return K.mean(loss, axis=[1,2])
def calcClassLoss(x):
true,pred = x
return binary_crossentropy(true,pred)
These will be used in Lambda layers in the training model.
The loss weighting layer - (WARNING! EDITED! - See explanation at the end)
Now, let's weight the losses with the trainable alpha. Trainable parameters need custom layers to be implemented.
class LossWeighter(Layer):
def __init__(self, **kwargs): #kwargs can have 'name' and other things
super(LossWeighter, self).__init__(**kwargs)
#create the trainable weight here, notice the constraint between 0 and 1
def build(self, inputShape):
self.weight = self.add_weight(name='loss_weight',
def call(self,inputs):
#old answer: will always tend to completely ignore the biggest loss
#return (self.weight * firstLoss) + ((1-self.weight)*secondLoss)
#problem: alpha tends to 0 or 1, eliminating the biggest of the two losses
#proposal of working alpha optimization
#return K.square((self.weight * firstLoss) - ((1-self.weight)*secondLoss))
#problem: might not train any of the losses, and even increase one of them
#in order to minimize the difference between the two losses
#new answer - a mix between the two, applying gradients to the right weights
loss1, loss2 = inputs #trainable
static_loss1 = K.stop_gradient(loss1) #non_trainable
static_loss2 = K.stop_gradient(loss2) #non_trainable
a1 = self.weight #trainable
a2 = 1 - a1 #trainable
static_a1 = K.stop_gradient(a1) #non_trainable
static_a2 = 1 - static_a1 #non_trainable
#this trains only alpha to minimize the difference between both losses
alpha_loss = K.square((a1 * static_loss1) - (a2 * static_loss2))
#or K.abs (.....)
#this trains only the original model weights to minimize both original losses
model_loss = (static_a1 * loss1) + (static_a2 * loss2)
return alpha_loss + model_loss
def compute_output_shape(self,inputShape):
return inputShape[0]
Notice that there is a custom constraint to keep this weight between 0 and 1. This constraint is implemented with:
class Between(Constraint):
def __init__(self,min_value,max_value):
self.min_value = min_value
self.max_value = max_value
def __call__(self,w):
return K.clip(w,self.min_value, self.max_value)
def get_config(self):
return {'min_value': self.min_value,
'max_value': self.max_value}
The training model
This model will take the prediction model as base, add the loss calculations and loss weighter at the end and output only the loss value. Because it outputs only a loss, we will use the true targets as inputs, and a dummy loss function defined like:
def ignoreLoss(true,pred):
return pred #this just tries to minimize the prediction without any extra computation
Model inputs:
#true targets
trueImg = Input((3,3,1))
trueClass = Input((2,))
#predictions from the prediction model
predImg = predictionModel.outputs[0]
predClass = predictionModel.outputs[1]
Model outputs = losses:
imageLoss = Lambda(calcImgLoss, name='loss_loc')([trueImg, predImg])
classLoss = Lambda(calcClassLoss, name='loss_cls')([trueClass, predClass])
weightedLoss = LossWeighter(name='weighted_loss')([imageLoss,classLoss])
trainingModel = Model([predictionModel.input, trueImg, trueClass], weightedLoss)
trainingModel.compile(optimizer='sgd', loss=ignoreLoss)
Dummy training
inputImages = np.zeros((7,5,5,2))
outputImages = np.ones((7,3,3,1))
outputClasses = np.ones((7,2))
dummyOut = np.zeros((7,))
trainingModel.fit([inputImages,outputImages,outputClasses], dummyOut, epochs = 50)
Necessary imports
from keras.layers import *
from keras.models import Model
from keras.constraints import Constraint
from keras.initializers import Constant
from keras.losses import binary_crossentropy #or another you need
(EDIT) Explaining the problem with the old answer:
The formula used in the old answer would make alpha always go to 0 or 1, meaning only the smallest of the two losses would be ever trained. (Useless)
A new formula leads alpha to make both losses have the same value. Alpha would be trained properly and not tend to 0 or 1. But, still, the losses would not be properly trained because "increasing one loss to reach the other" would be a possibility for the model, and once both losses were equal, the model would stop training.
The new solution is a mix of the two proposals above, while the first actually trains the losses but with wrong alpha; and the second trains alpha with wrong losses. The mixed solution adds both, but uses K.stop_gradient to prevent the wrong part of the training from happening.
The result of this will be: the "easiest" loss (not the biggest) will be more trained than the hardest. We may use K.abs or K.square, as compared to "mae" or "mse" between the two losses. The best option is up to experiment.
See this table comparing the old and new proposals:
This does not guarantee the best optimization though!!!
Training the easiest loss will not always have the best result, though. It may be better than favoring a huge loss just because it's formula is different. But the expected result might still need some manual weighting of the losses.
I fear there is no automatic training for this weight. If you have a target metric, you can try to train this metric (when possible, but metrics that depend on sorting, getting an index, rounding or anything that breaks backpropagation may not be possible to be transformed in losses).
There is no need to concatenate your outputs. To pass multiple arguments to a loss function, you can wrap it as follows:
def custom_loss(x1, x2, y1, y2, alpha):
def loss(y_true, y_pred):
return (1-alpha) * loss_cls(y1, x1) + alpha * loss_loc(y2, x2)
return loss
And then compile your functional model as:
x1 = Dense(1, activation='relu')(prev_inp1)
x2 = Dense(2, activation='relu')(prev_inp2)
y1 = Input((1,))
y2 = Input((2,))
loss=custom_loss(x1, x2, y1, y2, 0.5),
target_tensors=[y1, y2])
NOTE: Not tested.
I'm implementing a Restricted Boltzmann Machine with Rectified Linear Units. I haven't found a simple implementation anywhere so wanted to ask if somebody would kindly verify the design.
Here is the CD1 calculation:
def propup(self, vis):
activation = numpy.dot(vis, self.W) + self.hbias
# ReLU activation of hidden units
return activation * (activation > 0)
def sample_h_given_v(self, v0_sample):
h1_mean = self.propup(v0_sample)
# Sampling from a rectified Normal distribution
h1_sample = numpy.maximum(0, h1_mean + numpy.random.normal(0, sigmoid(h1_mean)))
return [h1_mean, h1_sample]
def propdown(self, hid):
activation = numpy.dot(hid, self.W.T) + self.vbias
return sigmoid(activation)
def sample_v_given_h(self, h0_sample):
v1_mean = self.propdown(h0_sample)
v1_sample = self.numpy_rng.binomial(size=v1_mean.shape, n=1, p=v1_mean)
return [v1_mean, v1_sample]
This is how I calculate the gradient:
def get_cost_updates(self, lr, decay, mom, l1_penalty, p_noise, epoch, persistent=None, k=1):
ph_mean, ph_sample = self.sample_h_given_v(input)
nv_means, nv_samples,nh_means, nh_samples = self.gibbs_hvh(ph_sample)
W_grad = numpy.dot(self.input.T, ph_mean) - numpy.dot(nv_samples.T, nh_means)
vbias_grad = numpy.mean(self.input - nv_samples, axis=0)
hbias_grad = numpy.mean(ph_mean - nh_means, axis=0)
My question is, how do I layer these into a DBN?
The aim is to construct an autoencoder, but I'm not sure how to handle the visible units also being real number variables in the second layer.
I can see that question was asked some time ago, but as there is no answer, I will add mine.
DBN as you wrote is implemented with a greedy learning algorithm that takes each layer to be as if it is a RBM. I actually gave a lecture about it recently and you can find a presentation with a numeric example I used here:https://www.slideshare.net/mobile/AvnerGidron/generative-models/AvnerGidron/generative-models
I think that if you will understand the presentation it shouldn't take really long for you to do it yourself.