I got the idea of implementing my version of deep feature selection is from the paper here,http://link.springer.com/chapter/10.1007%2F978-3-319-16706-0_20
The basic idea of deep feature selection according to this paper is to add a one to one mapping layer before any full connected hidden layer, then by adding a regularization term (whether lasso or elastic net) to produce zeros in the input layer weights.
My question is, even though it seems I have implemented the deep feature selection framework well, while testing on the random data generated by numpy.rand.random(1000,50) fails to give me any zeros on the initial weight. Is is a common thing for lasso like regularization? Am I going to adjust the parameters I used for this framework (even larger epochs)? Or did I do something wrong with my code.
class DeepFeatureSelectionMLP:
def __init__(self, X, Y, hidden_dims=[100], epochs=1000,
lambda1=0.001, lambda2=1.0, alpha1=0.001, alpha2=0.0, learning_rate=0.1):
# Initiate the input layer
# Get the dimension of the input X
n_sample, n_feat = X.shape
n_classes = len(np.unique(Y))
# One hot Y
one_hot_Y = np.zeros((len(Y), n_classes))
for i,j in enumerate(Y):
one_hot_Y[i][j] = 1
self.epochs = epochs
Y = one_hot_Y
# Store up original value
self.X = X
self.Y = Y
# Two variables with undetermined length is created
self.var_X = tf.placeholder(dtype=tf.float32, shape=[None, n_feat], name='x')
self.var_Y = tf.placeholder(dtype=tf.float32, shape=[None, n_classes], name='y')
self.input_layer = One2OneInputLayer(self.var_X)
self.hidden_layers = []
layer_input = self.input_layer.output
# Create hidden layers
for dim in hidden_dims:
self.hidden_layers.append(DenseLayer(layer_input, dim))
layer_input = self.hidden_layers[-1].output
# Final classification layer, variable Y is passed
self.softmax_layer = SoftmaxLayer(self.hidden_layers[-1].output, n_classes, self.var_Y)
n_hidden = len(hidden_dims)
# regularization terms on coefficients of input layer
self.L1_input = tf.reduce_sum(tf.abs(self.input_layer.w))
self.L2_input = tf.nn.l2_loss(self.input_layer.w)
# regularization terms on weights of hidden layers
L1s = []
L2_sqrs = []
for i in xrange(n_hidden):
L1s.append(tf.reduce_sum(tf.abs(self.hidden_layers[i].w)))
L2_sqrs.append(tf.nn.l2_loss(self.hidden_layers[i].w))
L1s.append(tf.reduce_sum(tf.abs(self.softmax_layer.w)))
L2_sqrs.append(tf.nn.l2_loss(self.softmax_layer.w))
self.L1 = tf.add_n(L1s)
self.L2_sqr = tf.add_n(L2_sqrs)
# Cost with two regularization terms
self.cost = self.softmax_layer.cost \
+ lambda1*(1.0-lambda2)*0.5*self.L2_input + lambda1*lambda2*self.L1_input \
+ alpha1*(1.0-alpha2)*0.5 * self.L2_sqr + alpha1*alpha2*self.L1
self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(self.cost)
self.y = self.softmax_layer.y
def train(self, batch_size=100):
sess = tf.Session()
sess.run(tf.initialize_all_variables())
for i in xrange(self.epochs):
x_batch, y_batch = get_batch(self.X, self.Y, batch_size)
sess.run(self.optimizer, feed_dict={self.var_X: x_batch, self.var_Y: y_batch})
if (i + 1) % 50 == 0:
l = sess.run(self.cost, feed_dict={self.var_X: x_batch, self.var_Y: y_batch})
print('epoch {0}: global loss = {1}'.format(i, l))
self.selected_w = sess.run(self.input_layer.w)
print(self.selected_w)
class One2OneInputLayer(object):
# One to One Mapping!
def __init__(self, input):
"""
The second dimension of the input,
for each input, each row is a sample
and each column is a feature, since
this is one to one mapping, n_in equals
the number of features
"""
n_in = input.get_shape()[1].value
self.input = input
# Initiate the weight for the input layer
w = tf.Variable(tf.zeros([n_in,]), name='w')
self.w = w
self.output = self.w * self.input
self.params = [w]
class DenseLayer(object):
# Canonical dense layer
def __init__(self, input, n_out, activation='sigmoid'):
"""
The second dimension of the input,
for each input, each row is a sample
and each column is a feature, since
this is one to one mapping, n_in equals
the number of features
n_out defines how many nodes are there in the
hidden layer
"""
n_in = input.get_shape()[1].value
self.input = input
# Initiate the weight for the input layer
w = tf.Variable(tf.ones([n_in, n_out]), name='w')
b = tf.Variable(tf.ones([n_out]), name='b')
output = tf.add(tf.matmul(input, w), b)
output = activate(output, activation)
self.w = w
self.b = b
self.output = output
self.params = [w]
class SoftmaxLayer(object):
def __init__(self, input, n_out, y):
"""
The second dimension of the input,
for each input, each row is a sample
and each column is a feature, since
this is one to one mapping, n_in equals
the number of features
n_out defines how many nodes are there in the
hidden layer
"""
n_in = input.get_shape()[1].value
self.input = input
# Initiate the weight and biases for this layer
w = tf.Variable(tf.random_normal([n_in, n_out]), name='w')
b = tf.Variable(tf.random_normal([n_out]), name='b')
pred = tf.add(tf.matmul(input, w), b)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
self.y = y
self.w = w
self.b = b
self.cost = cost
self.params= [w]
Gradient descent algorithms such as Adam do not give exact zeros when using l1 regularization. Instead, something like ftrl or proximal adagrad can give you exact zeros.
Related
I am new to machine learning in general and pytorch, so I apologize if my terminology is incorrect. I am trying to understand the code that was used to train a temporal dependent VAE based on this paper. I am trying to follow the architecture of the model based on the answers here. The answer using torchviz is not working for me but torchview is working. The issue is that it only gives me the architecture included in the forward function (ie the functions PreProcess and LSTM in the code) as shown in the image below. I have another function which is used to calculate the loss. I would like to able to generate a similar flow chart following the input and output dimensions for this part of the loss function (DBlock in the code below). Is this possible to visualize? .
'''
class DBlock(nn.Module):
# A basic building block for parameterizing a normal distribution.
# It corresponds to the D operation in the reference Appendix.
def __init__(self, input_size, hidden_size, output_size):
super(DBlock, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(input_size, hidden_size)
self.fc_mu = nn.Linear(hidden_size, output_size)
self.fc_logsigma = nn.Linear(hidden_size, output_size)
def forward(self, input):
t = torch.tanh(self.fc1(input))
t = t * torch.sigmoid(self.fc2(input))
mu = self.fc_mu(t)
logsigma = self.fc_logsigma(t)
return mu, logsigma
class PreProcess(nn.Module):
# The pre-process layer for MNIST image
def __init__(self, input_size, processed_x_size):
super(PreProcess, self).__init__()
self.input_size = input_size
self.fc1 = nn.Linear(input_size, processed_x_size)
self.fc2 = nn.Linear(processed_x_size, processed_x_size)
def forward(self, input):
t = torch.relu(self.fc1(input))
t = torch.relu(self.fc2(t))
return t
class Decoder(nn.Module):
# The decoder layer converting state to observation.
# Because the observation is MNIST image whose elements are values
# between 0 and 1, the output of this layer are probabilities of
# elements being 1.
def __init__(self, z_size, hidden_size, x_size):
super(Decoder, self).__init__()
self.fc1 = nn.Linear(z_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, hidden_size)
self.fc3 = nn.Linear(hidden_size, x_size)
def forward(self, z):
t = torch.tanh(self.fc1(z))
t = torch.tanh(self.fc2(t))
p = torch.sigmoid(self.fc3(t))
return p
class TD_VAE(nn.Module):
The full TD_VAE model with jumpy prediction.
First, let's first go through some definitions which would help
understanding what is going on in the following code.
Belief: As the model is fed a sequence of observations, x_t, the
model updates its belief state, b_t, through a LSTM network.
It is a deterministic function of x_t.
We call b_t the belief at time t instead of belief state,
because we call the hidden state z state.
State: The latent state variable, z.
Observation: The observed variable, x.
In this case, it represents binarized MNIST images
def __init__(self, x_size, processed_x_size, b_size, z_size):
super(TD_VAE, self).__init__()
self.x_size = x_size
self.processed_x_size = processed_x_size
self.b_size = b_size
self.z_size = z_size
## input pre-process layer
self.process_x = PreProcess(self.x_size, self.processed_x_size)
## one layer LSTM for aggregating belief states
## One layer LSTM is used here and I am not sure how many layers
## are used in the original paper
self.lstm = nn.LSTM(input_size = self.processed_x_size,
hidden_size = self.b_size,
batch_first = True)
## Two layer state model is used
## belief to state (b to z)
## (this is corresponding to P_B distribution in the reference;
## weights are shared across time but not across layers.)
self.l2_b_to_z = DBlock(b_size, 50, z_size) # layer 2
# TODO: input size is to clean, what does this mean?
self.l1_b_to_z = DBlock(b_size + z_size, 50, z_size) # layer 1
## Given belief and state at time t2, infer the state at time t1
self.l2_infer_z = DBlock(b_size + 2*z_size, 50, z_size) # layer 2
self.l1_infer_z = DBlock(b_size + 2*z_size + z_size, 50, z_size) # layer 1
## Given the state at time t1, model state at time t2 through state transition
self.l2_transition_z = DBlock(2*z_size, 50, z_size)
self.l1_transition_z = DBlock(2*z_size + z_size, 50, z_size)
## state to observation
self.z_to_x = Decoder(2*z_size, 200, x_size)
def forward(self, images):
self.batch_size = images.size()[0]
self.x = images
## pre-precess image x
self.processed_x = self.process_x(self.x)
## aggregate the belief b
# TODO: are h_n and c_n used internally by pytorch?
self.b, (h_n, c_n) = self.lstm(self.processed_x)
def calculate_loss(self, t1, t2):
"""
Calculate the jumpy VD-VAE loss, which is corresponding to
the equation (6) and equation (8) in the reference.
"""
## Because the loss is based on variational inference, we need to
## draw samples from the variational distribution in order to estimate
## the loss function.
## sample a state at time t2 (see the reparameterization trick is used)
## z in layer 2
t2_l2_z_mu, t2_l2_z_logsigma = self.l2_b_to_z(self.b[:, t2, :])
t2_l2_z_epsilon = torch.randn_like(t2_l2_z_mu)
t2_l2_z = t2_l2_z_mu + torch.exp(t2_l2_z_logsigma)*t2_l2_z_epsilon
## z in layer 1
t2_l1_z_mu, t2_l1_z_logsigma = self.l1_b_to_z(
torch.cat((self.b[:,t2,:], t2_l2_z),dim = -1))
t2_l1_z_epsilon = torch.randn_like(t2_l1_z_mu)
t2_l1_z = t2_l1_z_mu + torch.exp(t2_l1_z_logsigma)*t2_l1_z_epsilon
## concatenate z from layer 1 and layer 2
t2_z = torch.cat((t2_l1_z, t2_l2_z), dim = -1)
## sample a state at time t1
## infer state at time t1 based on states at time t2
t1_l2_qs_z_mu, t1_l2_qs_z_logsigma = self.l2_infer_z(
torch.cat((self.b[:,t1,:], t2_z), dim = -1))
t1_l2_qs_z_epsilon = torch.randn_like(t1_l2_qs_z_mu)
t1_l2_qs_z = t1_l2_qs_z_mu + torch.exp(t1_l2_qs_z_logsigma)*t1_l2_qs_z_epsilon
t1_l1_qs_z_mu, t1_l1_qs_z_logsigma = self.l1_infer_z(
torch.cat((self.b[:,t1,:], t2_z, t1_l2_qs_z), dim = -1))
t1_l1_qs_z_epsilon = torch.randn_like(t1_l1_qs_z_mu)
t1_l1_qs_z = t1_l1_qs_z_mu + torch.exp(t1_l1_qs_z_logsigma)*t1_l1_qs_z_epsilon
t1_qs_z = torch.cat((t1_l1_qs_z, t1_l2_qs_z), dim = -1)
#### After sampling states z from the variational distribution, we can calculate
#### the loss.
## state distribution at time t1 based on belief at time 1
t1_l2_pb_z_mu, t1_l2_pb_z_logsigma = self.l2_b_to_z(self.b[:, t1, :])
t1_l1_pb_z_mu, t1_l1_pb_z_logsigma = self.l1_b_to_z(
torch.cat((self.b[:,t1,:], t1_l2_qs_z),dim = -1))
## state distribution at time t2 based on states at time t1 and state transition
t2_l2_t_z_mu, t2_l2_t_z_logsigma = self.l2_transition_z(t1_qs_z)
t2_l1_t_z_mu, t2_l1_t_z_logsigma = self.l1_transition_z(
torch.cat((t1_qs_z, t2_l2_z), dim = -1))
## observation distribution at time t2 based on state at time t2
t2_x_prob = self.z_to_x(t2_z)
#### start calculating the loss
#### KL divergence between z distribution at time t1 based on variational
#### distribution (inference model) and z distribution at time t1 based on belief.
#### This divergence is between two normal distributions and it can be
#### calculated analytically
## KL divergence between t1_l2_pb_z, and t1_l2_qs_z
loss = 0.5*torch.sum(((t1_l2_pb_z_mu - t1_l2_qs_z)/torch.exp(t1_l2_pb_z_logsigma))**2,-1) + \
torch.sum(t1_l2_pb_z_logsigma, -1) - torch.sum(t1_l2_qs_z_logsigma, -1)
## KL divergence between t1_l1_pb_z and t1_l1_qs_z
loss += 0.5*torch.sum(((t1_l1_pb_z_mu - t1_l1_qs_z)/torch.exp(t1_l1_pb_z_logsigma))**2,-1) + \
torch.sum(t1_l1_pb_z_logsigma, -1) - torch.sum(t1_l1_qs_z_logsigma, -1)
#### The following four terms estimate the KL divergence between
#### the z distribution at time t2 based on variational distribution
#### (inference model) and z distribution at time t2 based on transition.
#### In contrast with the above KL divergence for z distribution at time t1,
#### this KL divergence can not be calculated analytically because
#### the transition distribution depends on z_t1, which is sampled after z_t2.
#### Therefore, the KL divergence is estimated using samples
## state log probability at time t2 based on belief
loss += torch.sum(-0.5*t2_l2_z_epsilon**2 - 0.5*t2_l2_z_epsilon.new_tensor(2*np.pi) - t2_l2_z_logsigma, dim = -1)
loss += torch.sum(-0.5*t2_l1_z_epsilon**2 - 0.5*t2_l1_z_epsilon.new_tensor(2*np.pi) - t2_l1_z_logsigma, dim = -1)
## state log probability at time t2 based on transition
loss += torch.sum(0.5*((t2_l2_z - t2_l2_t_z_mu)/torch.exp(t2_l2_t_z_logsigma))**2 + 0.5*t2_l2_z.new_tensor(2*np.pi) + t2_l2_t_z_logsigma, -1)
loss += torch.sum(0.5*((t2_l1_z - t2_l1_t_z_mu)/torch.exp(t2_l1_t_z_logsigma))**2 + 0.5*t2_l1_z.new_tensor(2*np.pi) + t2_l1_t_z_logsigma, -1)
## observation prob at time t2
loss += -torch.sum(self.x[:,t2,:]*torch.log(t2_x_prob) + (1-self.x[:,t2,:])*torch.log(1-t2_x_prob), -1)
loss = torch.mean(loss)
return loss
'''
I am new to Pytorch and I have compiled the below code from different articles and code snippets. The code is basically taking in sequence of products and then predicting the next product in a sequence.
I am trying to find accuracy of this model but not sure how to do it. Any help or suggestion would be appreciated.
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(50)
prod_list = ['AA105045091',
'C2106264154',
'B2106691381',
'AA105045091',
'B2106691381',
'X3106692282',
'V2106350393',
'C2106264154',
'V6104504285',
'A2106329636',
'M6M100936257',
'N2101433968',
'X2M200042701',
'V3M200052002',
'K5101434063',
'B1106334744',
'P1103790575',
'K1106031596',
'E3D227124S6',
'D1105834415',
'M4102794084',
'B4101250283',
'C2102794082',
'D1106816721',
'B5106788450',
'A3106805351',
'C2106788452',
'C2106805373',
'B2106788454',
'A1104146375']
prod_list
sequences = []
for i in range(3, len(prod_list)):
words = prod_list[i-3:i+1]
sequences.append(words)
# split the sequence to input list and output list
X = []
y= []
for i in sequences:
X.append(i[0:3])
y.append(i[3])
# create integer-to-token mapping
int2token = {}
cnt = 0
for w in set(" ".join(prod_list).split()):
int2token[cnt] = w
cnt+= 1
# create token-to-integer mapping
token2int = {t: i for i, t in int2token.items()}
def get_integer_seq_train(seq):
new_list = []
for i in seq:
new_list.append(token2int[i])
return new_list
# convert text sequences to integer sequences
x_int = [get_integer_seq_train(i) for i in X]
# convert lists to numpy arrays
x_int = np.array(x_int)
vocab_size = len(int2token)
vocab_size
def get_integer_seq_test(seq):
return [token2int[w] for w in seq.split()]
#return [token2int[w] for w in seq.split()]
# convert text sequences to integer sequences
y_int = [get_integer_seq_test(i) for i in y]
# convert lists to numpy arrays
y_int = np.array(y_int)
def get_batches(arr_x, arr_y, batch_size):
# iterate through the arrays
prv = 0
for n in range(batch_size, arr_x.shape[0], batch_size):
x = arr_x[prv:n,:]
y = arr_y[prv:n,:]
prv = n
yield x, y
class WordLSTM(nn.Module):
def __init__(self, n_hidden=256, n_layers=4, drop_prob=0.3, lr=0.001):
super().__init__()
self.drop_prob = drop_prob
self.n_layers = n_layers
self.n_hidden = n_hidden
self.lr = lr
self.emb_layer = nn.Embedding(vocab_size, 200)
## define the LSTM
self.lstm = nn.LSTM(200, n_hidden, n_layers,
dropout=drop_prob, batch_first=True)
## define a dropout layer
self.dropout = nn.Dropout(drop_prob)
## define the fully-connected layer
self.fc = nn.Linear(3*n_hidden, vocab_size)
torch.manual_seed(50)
def forward(self, x, hidden):
''' Forward pass through the network.
These inputs are x, and the hidden/cell state `hidden`. '''
## pass input through embedding layer
embedded = self.emb_layer(x)
## Get the outputs and the new hidden state from the lstm
lstm_output, hidden = self.lstm(embedded, hidden)
## pass through a dropout layer
out = self.dropout(lstm_output)
#out = out.contiguous().view(-1, self.n_hidden)
out = out.reshape(x.shape[0], -1)
## put "out" through the fully-connected layer
out = self.fc(out)
# return the final output and the hidden state
return out, hidden
def init_hidden(self, batch_size):
''' initializes hidden state '''
# Create two new tensors with sizes n_layers x batch_size x n_hidden,
# initialized to zero, for hidden state and cell state of LSTM
weight = next(self.parameters()).data
# if GPU is available
if (torch.cuda.is_available()):
hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
# if GPU is not available
else:
hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
return hidden
net = WordLSTM()
def train(net, epochs=10, batch_size=32, lr=0.001, clip=1, print_every=32):
torch.manual_seed(50)
# optimizer
opt = torch.optim.Adam(net.parameters(), lr=lr)
# loss
criterion = nn.CrossEntropyLoss()
counter = 0
net.train()
for e in range(epochs):
# initialize hidden state
h = net.init_hidden(batch_size)
for x, y in get_batches(x_int, y_int, batch_size):
counter+= 1
# convert numpy arrays to PyTorch arrays
inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
# detach hidden states
h = tuple([each.data for each in h])
# zero accumulated gradients
net.zero_grad()
# get the output from the model
output, h = net(inputs, h)
# calculate the loss and perform backprop
loss = criterion(output, targets.view(-1))
# back-propagate error
loss.backward()
# `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
nn.utils.clip_grad_norm_(net.parameters(), clip)
# update weigths
opt.step()
print("Epoch: {}/{}...".format(e+1, epochs),
"Step: {}...".format(counter),
"Loss: {}...".format(loss))
train(net, batch_size = 32, epochs=20, print_every=256)
def predict(net, tkn, h=None):
# tensor inputs
new_inp = []
for t1 in tkn:
x = np.array([token2int[t1]])
new_inp.append(x)
new_inp = np.asarray(new_inp).reshape(1,-1)
inputs = torch.from_numpy(new_inp)
# detach hidden state from history
h = tuple([each.data for each in h])
# get the output of the model
out, h = net(inputs, h)
# get the token probabilities
p = F.softmax(out, dim=1).data
p = p.cpu()
p = p.numpy()
p = p.reshape(p.shape[1],)
# get indices of top 3 values
top_n_idx = p.argsort()[-1:][::-1]
# randomly select one of the three indices
#sampled_token_index = top_n_idx[random.sample([0],1)[0]]
sampled_token_index = top_n_idx[0]
# return the encoded value of the predicted char and the hidden state
return int2token[sampled_token_index]
# function to generate text
def sample(net, prime):
net.eval()
# batch size is 1
h = net.init_hidden(1)
token = predict(net, prime, h)
return token
sample(net, prime=['AA105045091', 'C2106264154', 'B2106691381'])
Given 5 features on a time series we want to predict the following values using an LSTM Recurrent Neural Network, using PyTorch. The problem is that the Loss Value starts very low (i.e. 0.04) and it increases a bit as the computation runs (it seems it converge to a slightly higher value, but it never decreases).
Moreover, the dataset is normalized, and we tried different values of learning rate, epochs, batch sizes etc.
An example of loss during training:
step : 0 loss : 0.0016425768844783306
step : 1 loss : 0.0028163508977741003
step : 2 loss : 0.009786984883248806
This is the class:
class MV_LSTM(torch.nn.Module):
def __init__(self,n_features,seq_length):
super(MV_LSTM, self).__init__()
self.n_features = n_features
self.seq_len = seq_length
self.n_hidden = 40 # number of hidden states
self.n_layers = 1 # number of LSTM layers (stacked)
self.l_lstm = torch.nn.LSTM(input_size = n_features,
hidden_size = self.n_hidden,
num_layers = self.n_layers,
batch_first = True)
# according to pytorch docs LSTM output is
# (batch_size,seq_len, num_directions * hidden_size)
# when considering batch_first = True
self.l_linear = torch.nn.Linear(self.n_hidden*self.seq_len, 5)
def init_hidden(self, batch_size):
hidden_state = torch.randn(self.n_layers,batch_size,self.n_hidden)
cell_state = torch.randn(self.n_layers,batch_size,self.n_hidden)
self.hidden = (hidden_state, cell_state)
def forward(self, x):
batch_size, seq_len, _ = x.size()
lstm_out, self.hidden = self.l_lstm(x,self.hidden)
x = lstm_out.contiguous().view(batch_size,-1)
return self.l_linear(x)
This is the main code:
n_features = 5 # this is number of parallel inputs
n_timesteps = 24 # this is number of timesteps
# convert dataset into input/output
X, y = split_sequences(dataset, n_timesteps)
print(X.shape, y.shape)
X
y
# create NN
mv_net = MV_LSTM(n_features,n_timesteps)
criterion = torch.nn.MSELoss() # reduction='sum' created huge loss value
optimizer = torch.optim.Adam(mv_net.parameters(), lr=1e-4)
train_episodes = 50
batch_size = 16
This is the training:
mv_net.train()
for t in range(train_episodes):
X, y = sklearn.utils.shuffle(X, y)
for b in range(0,len(X),batch_size):
inpt = X[b:b+batch_size,:,:]
target = y[b:b+batch_size,:]
x_batch = torch.tensor(inpt,dtype=torch.float32)
y_batch = torch.tensor(target,dtype=torch.float32)
mv_net.init_hidden(x_batch.size(0))
output = mv_net(x_batch)
loss = criterion(output.view(-1,5), y_batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print('step : ' , t , 'loss : ' , loss.item())
Thank you for your time, and sorry for our unexperience (this is our first RNN).
I compared the r2 score according to the number of hidden layers and hidden units using the for-loop and selected the layers and units with a high score and acceptable convergence time.
However, the re-calculation with the selected layers and units yields different r2 scores.
Even fixing the number of layers and the number of units and just running the loop result in different r2 score as shown below.
[same number of layers and units, but different results][1]
I have been thinking two possible reasons: firstly, for loop, the session is not initialized and secondly, the reproducibility in NN was not guaranteed.
I searched for other articles to solve both, but I'm asking because I still couldn't find the answer. Thank you in advance for your help.
The main code is below. To eliminate the randomness during the data split, the scikit-learn library with the same random_state was implemented.
n_layer = i+1
x_con[i] = n_layer
for j in range(m_neuron):
n_neuron = 2**(j+1)
y_con[j] = n_neuron
print('n_layer: ',n_layer,'n_neuron:',n_neuron)
# Launch the graph in a session.
sess = tf.Session()
tf.set_random_seed(777) # for reproducibility
# Create model and solver
m1 = FCNN(str(i)+str(j), n_feature, n_output, n_layer, n_neuron, learning_rate, use_batchnorm=True)
m1_solver = Solver(sess, m1)
# Initializes global variables in the graph
init = tf.global_variables_initializer()
sess.run(init)
cost_val_old = np.full((n_output), 0.)
for step in range(n_epoch):
cost_val, y_train_predict, _ = m1_solver.train(x_train_scaled, y_train_scaled)
diff_tmp = m1_solver.convergence_criterion(cost_val,cost_val_old)
cost_val_old = cost_val
if (step % n_print == 0 and step > 0) or diff_tmp <= tol:
print("{0} Cost: {1} Diff: {2:.10f}".format(step,cost_val,diff_tmp))
if diff_tmp <= tol:
cost_train[j,i,:] = cost_val[0:3]
iter_train[j,i] = step
break
y_valid_predict = np.squeeze(np.array(m1_solver.predict(x_valid_scaled)), axis=0)
y_test_predict = np.squeeze(np.array(m1_solver.predict(x_test_scaled)), axis=0)
# Evaluate r2 score
for k in range(n_output):
r2_train_tmp = m1_solver.evaluate_r2(y_train_scaled[:,i], y_train_predict[:,i])
r2_valid_tmp = m1_solver.evaluate_r2(y_valid_scaled[:,i], y_valid_predict[:,i])
r2_test_tmp = m1_solver.evaluate_r2(y_test_scaled[:,i], y_test_predict[:,i])
r2_train[j,i,k] = r2_train_tmp[0]
r2_valid[j,i,k] = r2_valid_tmp[0]
r2_test [j,i,k] = r2_test_tmp[0]
# Close session
sess.close()
The class of the model is also below. This class is mainly based on https://github.com/hunkim/DeepLearningZeroToAll/blob/master/lab-10-6-mnist_nn_batchnorm.ipynb.
def __init__(self, name, n_feature, n_output, n_layer, n_neuron, lr, use_batchnorm=True):
with tf.variable_scope(name):
self.x = tf.placeholder(tf.float32, shape=[None, n_feature], name='x')
self.y = tf.placeholder(tf.float32, shape=[None, n_output], name='y')
self.mode = tf.placeholder(tf.bool, name='train_mode')
self.y_target = tf.placeholder(tf.float32, shape=[None])
self.y_prediction = tf.placeholder(tf.float32, shape=[None])
self.cost_new = tf.placeholder(tf.float32, shape=[n_output])
self.cost_old = tf.placeholder(tf.float32, shape=[n_output])
# Loop over hidden layers
net = self.x
hidden_dims = np.full((n_layer), n_neuron)
for i, h_dim in enumerate(hidden_dims):
with tf.variable_scope('layer{}'.format(i)):
net = tf.layers.dense(net, h_dim)
if use_batchnorm:
net = tf.layers.batch_normalization(net, training=self.mode)
net = tf.nn.relu(net)
# Attach fully connected layers
net = tf.contrib.layers.flatten(net)
self.hypothesis = tf.layers.dense(net, n_output)
self.cost = tf.reduce_mean(tf.square(self.hypothesis - self.y),axis=0, name='cost')
# When using the batchnormalization layers,
# it is necessary to manually add the update operations
# because the moving averages are not included in the graph
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, scope=name)
with tf.control_dependencies(update_ops):
optimizer = tf.train.AdamOptimizer(learning_rate=lr)
self.train_op = optimizer.minimize(self.cost)
# convergence criterion
self.diff = tf.sqrt(tf.reduce_sum(tf.square(self.cost_new - self.cost_old)))
# R2 score
total_error = tf.reduce_sum(tf.square(self.y_target - tf.reduce_mean(self.y_target)))
unexplained_error = tf.reduce_sum(tf.square(self.y_target - self.y_prediction))
self.acc_R2 = 1. - unexplained_error/total_error ```
[1]: https://i.stack.imgur.com/Fo42X.png
My data is 4123 rows of inputs and outputs to an xor gate.
I want to write a Neural Network with three input layer neurons (the third one is bias), a hidden layer, and an output layer.
Here's my implementation
import numpy as np
class TwoLayerNetwork:
def __init__(self, input_size, hidden_size, output_size):
"""
input_size: the number of neurons in the input layer
hidden_size: the number of neurons in the hidden layer
output_size: the number of neurons in the output layer
"""
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.params = {}
self.params['W1'] = 0.01 * np.random.randn(input_size, hidden_size) # FxH
self.params['b1'] = np.zeros((hidden_size, 1)) # Hx1
self.params['W2'] = 0.01 * np.random.randn(hidden_size, output_size) # HxO
self.params['b2'] = np.zeros((output_size, 1)) # Ox1
self.optimal_weights = []
self.errors = {}
def train(self, X, y, epochs):
"""
X: input data matrix, NxF
y: output vector, Nx1
returns:
the optimal set of parameters that best minimize the loss function
"""
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
for iteration in range(epochs):
forward_to_hidden = X.dot(W1) # NxH
activate_hidden = sigmoid(forward_to_hidden) # NxH
forward_to_output = activate_hidden.dot(W2) # NxO
output = sigmoid(forward_to_output) # NxO
self.errors[iteration] = np.mean(0.5 * (y**2 - output**2))
output_error = y - output # NxO
output_layer_delta = output_error * sigmoidPrime(output) # NxO
hidden_layer_error = output_layer_delta.dot(W2.T) # NxO . OxH = NxH
hidden_layer_delta = hidden_layer_error * sigmoidPrime(activate_hidden) # NxH
W1_update = X.T.dot(hidden_layer_delta) # FxN . NxH = FxH
W2_update = activate_hidden.T.dot(output_layer_delta) # HxN . NxO = HxO
W1 += W1_update
W2 += W2_update
self.optimal_weights.append(W1)
self.optimal_weights.append(W2)
def predict(self, X):
W1, W2 = self.optimal_weights[0], self.optimal_weights[1]
forward = sigmoid(X.dot(W1)) # NxH
forward = forward.dot(W2) # NxO
forward = sigmoid(forward) # NxO
return forward
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoidPrime(x):
return sigmoid(x) * (1 - sigmoid(x))
I realize that's very vanilla, but that's intentional. I want to understand the most basic form of NN architecture first.
Now, my problem is that my error plot is confusing.
The neural network just stops learning.
My second problem is that my weights are blowing up reaching up to -10000, which causes overflow because of exp in the sigmoid function.
My third problem is that my output vector only outputs 0.5 instead of 1 or 0
import pandas as pd
data = pd.read_csv('xor.csv').sample(frac=1)
X = data.iloc[:, [0, 1]] # 1st and 2nd cols are the input
X = np.hstack((X, np.ones((data.shape[0], 1)))) # adding the bias 1's
y = data.iloc[:, 2][:, np.newaxis] # 3rd col is the output
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
nn.train(X_train, y_train, 100)
plt.plot(range(100), [i for i in nn.errors.values()])
plt.show()
The link for the dataset
So, if I read your code correctly, your network is specified correctly, but is missing a few key points in order to learn XOR by backpropagation.
The fun part is, your error specification is weird.
I made it into
self.errors[iteration] = np.mean(0.5 * (y - output)**2)
for visualization.
With x-axis denoting epoch and y-axis denoting error:
So what happens, the backpropagation hits a plateau, then rapidly blows up the weights. To slow down the blowing up of the weights and allow the network some time to re-evaluate its mistakes, you can add a so-called "learning rate" != 1. This adresses one of the pitfalls.
Another one is the second figure: you hit oscillatory behaviour in the updates and the program will never reach its optimum state. To adress this, you can deliberately enter an imperfection in the form of a "momentum".
Additionally, the initial conditions matter for the speed at which you converge, so you need to have enough epochs to overcome the local plateaux:
Last, but certainly not least, I did find an error with your specification, but all of the above still applies.
In your layer_deltas you do sigmoidPrime(sigmoid(forwards)) which is one call to sigmoid too many.
last_update = np.zeros((X.shape[1], W1.shape[1]))
last_update2 = np.zeros((W1.shape[1], W2.shape[1]))
output_layer_delta = output_error * sigmoidPrime(forward_to_output) # NxO
hidden_layer_delta = hidden_layer_error * sigmoidPrime(forward_to_hidden) # NxH
W1 += 0.001*(W1_update + last_update * 0.5)
W2 += 0.001*(W2_update + last_update2 * 0.5)
# W1 = 0.001*W1_update
# W2 = 0.001*W2_update
last_update = W1_update.copy()
last_update2 = W2_update.copy()
Did the final trick for me. Now please verify and appease this grumbling man who spent the better part of a night and day on figuring it out. ;)