Why evaluate self._initial_state when training RNN in Tensorflow

Why evaluate self._initial_state when training RNN in Tensorflow - python

In the RNN tutorial ptd_word_lm.py. When training the RNN using run_epoch, why is it necessary to evaluate self._initial_state?
def run_epoch(session, m, data, eval_op, verbose=False):
"""Runs the model on the given data."""
epoch_size = ((len(data) // m.batch_size) - 1) // m.num_steps
start_time = time.time()
costs = 0.0
iters = 0
state = m.initial_state.eval()
for step, (x, y) in enumerate(reader.ptb_iterator(data, m.batch_size,
m.num_steps)):
cost, state, _ = session.run([m.cost, m.final_state, eval_op],
{m.input_data: x,
m.targets: y,
m.initial_state: state})
costs += cost
iters += m.num_steps
if verbose and step % (epoch_size // 10) == 10:
print("%.3f perplexity: %.3f speed: %.0f wps" %
(step * 1.0 / epoch_size, np.exp(costs / iters),
iters * m.batch_size / (time.time() - start_time)))
return np.exp(costs / iters)
The initial state is defined as following and is never changed during training.
self._initial_state = cell.zero_state(batch_size, tf.float32)

In the PTB example, the sentences are concatenated and split into batches (of size batch_size x num_steps). After each batch, the last state of the RNN is passed as the initial state of the next batch. This effectively allows you to train the RNN as if it was one very long chain over the entire PTB corpus (and this explain why model.final_state is evaluated and why the state is passed into m.initial_state in the feed_dict). So you see that the initial_state actual does change at every step.
At the very beginning of an epoch, we have no previous state to pass as the initial_state and so use all zeros, represented by state = m.initial_state.eval(). Perhaps it would be less confusing if there was another property called m.zero_state that you evaluated to get this initial state. You could, for example, also use a numpy array of zeros of the appropriate size and this would work just fine too. The eval is just a convenient way to get a tensor or zeros of the appropriate size.
Hope this makes sense!

Related

Pytorch backward does not compute the gradients for requested variables

I'm trying to train a resnet18 model on pytorch (+pytorch-lightning) with the use of Virtual Adversarial Training. During the computations required for this type of training I need to obtain the gradient of D (ie. the cross-entropy loss of the model) with regard to tensor r.
This should, in theory, happen in the following code snippet:
def generic_step(self, train_batch, batch_idx, step_type):
x, y = train_batch
unlabeled_idx = y is None
d = torch.rand(x.shape).to(x.device)
d = d/(torch.norm(d) + 1e-8)
pred_y = self.classifier(x)
y[unlabeled_idx] = pred_y[unlabeled_idx]
l = self.criterion(pred_y, y)
R_adv = torch.zeros_like(x)
for _ in range(self.ip):
r = self.xi * d
r.requires_grad = True
pred_hat = self.classifier(x + r)
# pred_hat = F.log_softmax(pred_hat, dim=1)
D = self.criterion(pred_hat, pred_y)
self.classifier.zero_grad()
D.requires_grad=True
D.backward()
R_adv += self.eps * r.grad / (torch.norm(r.grad) + 1e-8)
R_adv /= 32
loss = l + R_adv * self.a
loss.backward()
self.accuracy[step_type] = self.acc_metric(torch.argmax(pred_y, 1), y)
return loss
Here, to my understanding, r.grad should in theory be the gradient of D with respect to r. However, the code throws this at D.backward():
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
(full traceback excluded because this error is not helpful and technically "solved" as I know the cause for it, explained just below)
After some research and debugging it seems that in this situation D.backward() attempts to calculate dD/dD disregarding any previous mention of requires_grad=True. This is confirmed when I add D.requires_grad=True and I get D.grad=Tensor(1.,device='cuda:0') but r.grad=None.
Does anyone know why this may be happening?

In Lightning, .backward() and optimizer step are all handled under the hood. If you do it yourself like in the code above, it will mess with Lightning because it doesn't know you called backward yourself.
You can enable manual optimization in the LightningModule:
def __init__(self):
super().__init__()
# put this in your init
self.automatic_optimization = False
This tells Lightning that you are taking over calling backward and handling optimizer step + zero grad yourself. Don't forget to add that in your code above. You can access the optimizer and scheduler like so in your training step:
def training_step(self, batch, batch_idx):
optimizer = self.optimizers()
scheduler = self.lr_schedulers()
# do your training step
# don't forget to call:
# 1) backward 2) optimizer step 3) zero grad
Read more about manual optimization here.

Tricky forward pass pytorch

I have a precipitation map timeseries dataset with input shape (None, seq_length =7, c = 75, w=112, h=112) output shape (None, lead_times = 60, c=51, w=28, h=28). The model (Conv downsampler + ConvGRU + Axial Attention) predicts precipitation in a 28x28 region in the middle with 51 categorical precipitation intervals and is conditioned with 60 different lead times (5, 10, ..., 300 minutes).
Right now my forward pass looks like this:
def forward(self, imgs):
"""It takes a rank 5 tensor
- imgs [bs, seq_len, channels, h, w]
"""
# Compute all timesteps, probably can be parallelized
res = []
for i in range(self.forecast_steps):
x_i = self.encode_timestep(imgs, i)
out = self.head(x_i)
res.append(out)
res = torch.stack(res, dim=1)
return res
Here imgs is the input tensor without lead time encoding, so only 15 channels. The imgs is then one-hot encoded for each respective lead time and the output is the entire predicted time series (5-300min). However this leads to severe memory issues even with batch_size = 1 so I want the forward loop to only do one random lead time at a time. I am training this with pytorch-lightning module for easier parallelization so I don't have much control of the training loop.
The issue is that the effective batch size with this training loop is 60*batch_size. The paper solves this by only doing one random lead time per sample, which now makes sense to me. This solves the memory issue by allowing effective minimum batch size to be 1. How can pass a random integer (the lead time) to the forward pass and couple it with the correct Y when pytorch-lightning computes the loss?
I want
y_hat = forward(self, X[n], lead_time=random)
...
loss(y_hat-Y[n,lead_time,:,:])
My code is available at https://github.com/ValterFallenius/metnet.

I figured out how to fix it. Once explaining the problem to someone else, I realized how simple the solution was...
def forward(self, imgs,lead_time):
"""It takes a rank 5 tensor
- imgs [bs, seq_len, channels, h, w]
- lead_time #random int between 0,self.forecast_steps
"""
x_i = self.encode_timestep(imgs, lead_time)
out = self.head(x_i)
res.append(out)
res = torch.stack(res, dim=1)
return res
The trick was to simply add the lead_time variable to the training_step method:
def training_step(self, batch, batch_idx):
x, y = batch
lead_time = np.random.randint(0,self.forecast_steps)
y_hat = self(x.float(),lead_time)
loss = F.mse_loss(y_hat, y[:,lead_time])
pbar = {"training_loss": loss}
return {"loss": loss, "progress_bar": pbar}

Neural networks very bad accuracy when using more than one hidden layer

I have created the following neural network:
def init_weights(m, n=1):
"""
initialize a matrix/vector of weights with xavier initialization
:param m: out dim
:param n: in dim
:return: matrix/vector of random weights
"""
limit = (6 / (n * m)) ** 0.5
weights = np.random.uniform(-limit, limit, size=(m, n))
if n == 1:
weights = weights.reshape((-1,))
return weights
def softmax(v):
exp = np.exp(v)
return exp / np.tile(exp.sum(1), (v.shape[1], 1)).T
def relu(x):
return np.maximum(x, 0)
def sign(x):
return (x > 0).astype(int)
class Model:
"""
A class for neural network model
"""
def __init__(self, sizes, lr):
self.lr = lr
self.weights = []
self.biases = []
self.memory = []
for i in range(len(sizes) - 1):
self.weights.append(init_weights(sizes[i + 1], sizes[i]))
self.biases.append(init_weights(sizes[i + 1]))
def forward(self, X):
self.memory = [X]
X = np.dot(self.weights[0], X.T).T + self.biases[0]
for W, b in zip(self.weights[1:], self.biases[1:]):
X = relu(X)
self.memory.append(X)
X = np.dot(W, X.T).T + b
return softmax(X)
def backward(self, y, y_pred):
# calculate the errors for each layer
y = np.eye(y_pred.shape[1])[y]
errors = [y_pred - y]
for i in range(len(self.weights) - 1, 0, -1):
new_err = sign(self.memory[i]) * \
np.dot(errors[0], self.weights[i])
errors.insert(0, new_err)
# update weights
for i in range(len(self.weights)):
self.weights[i] -= self.lr *\
np.dot(self.memory[i].T, errors[i]).T
self.biases[i] -= self.lr * errors[i].sum(0)
The data has 10 classes. When using a single hidden layer the accuracy is almost 40%. when using 2 or 3 hidden layers, the accuracy is around 9-10% from the first epoch and remains that way. The accuracy on the train set is also in that range. Is there a problem with my implementation that could cause such a thing?

You asked about the accuracy improvement of a machine learning model, which is a very broad and ambiguous problem in the era of ML, because it varies between various model types and data types
In your case the model is neural network that has several factors on which accuracy is dependent. You are trying to optimize the accuracy on the basis of activation functions, weights or number of hidden layers which is not the correct way. To increase the accuracy you have to consider other factors too e.g. your basic checklist can be following
Increase Hidden Layers
Change Activation Functions
Experiment with initial weight initialization
Normalize Training Data
Scale Training Data
Check for Class Biasness
Now you are trying to achieve state of the art accuracy on the basis of very few factors, I don't know about your dataset as you haven't shown the pre processing code, but I recommend that you double check the dataset may be by correctly normalizing the dataset you can increase accuracy, also check if your dataset can be scaled and the most important thing if one of the class sample in your dataset is overloaded or too big in count as compared to other samples then it will also lead to the poor accuracy matrix.
For more details check this it contains the mathematical proof and explanation how these things affect your ML model accuracy

Python Minibatch Dictionary Learning

I'd like to implement error tracking with dictionary learning in python, using sklearn's MiniBatchDictionaryLearning, so that I can record how the error decreases over the iterations. I have two methods to do it, neither of which really worked. Set up:
Input data X, numpy array shape (n_samples, n_features) = (298143, 300). These are patches of shape (10, 10), generated from an image of shape (642, 480, 3).
Dictionary learning parameters: No. of columns (or atoms) = 100, alpha = 2, transform algorithm = OMP, total no. of iterations = 500 (keep it small first, just as a test case)
Calculating error: After learning the dictionary, I encode the original image again based on the learnt dictionary. Since both the encoding and the original are numpy arrays of the same shape (642, 480, 3), I'm just doing elementwise Euclidean distance for now:
err = np.sqrt(np.sum(reconstruction - original)**2))
I did a test run with these parameters, and the full fit was able to produce a pretty good reconstruction with a low error, so that's good.Now on to the two methods:
Method 1: Save the learnt dictionary every 100 iterations, and record the error. For 500 iterations, this gives us 5 runs of 100 iterations each. After each run, I compute the error, then use the currently learnt dictionary as an initialization for the next run.
# Fit an initial dictionary, V, as a first run
dico = MiniBatchDictionaryLearning(n_components = 100,
alpha = 2,
n_iter = 100,
transform_algorithm='omp')
dl = dico.fit(patches)
V = dl.components_
# Now do another 4 runs.
# Note the warm restart parameter, dict_init = V.
for i in range(n_runs):
print("Run %s..." % i, end = "")
dico = MiniBatchDictionaryLearning(n_components = 100,
alpha = 2,
n_iter = n_iterations,
transform_algorithm='omp',
dict_init = V)
dl = dico.fit(patches)
V = dl.components_
img_r = reconstruct_image(dico, V, patches)
err = np.sqrt(np.sum((img - img_r)**2))
print("Err = %s" % err)
Problem: The error isn't decreasing, and was pretty high. The dictionary wasn't learnt very well either.
Method 2: Cut the input data X into, say, 500 batches, and do partial fitting, using the partial_fit() method.
batch_size = 500
n_batches = X.shape[0] // batch_size
print(n_batches) # 596
for iternum in range(n_batches):
batch = patches[iternum*batch_size : (iternum+1)*batch_size]
V = dico.partial_fit(batch)
Problem: this seems to take about 5000 times longer.
I'd like to know if there's a way to retrieve the error over the fitting process?

Each call to fit re-initializes the model and forgets any previous call to fit: this is the expected behavior of all estimators in scikit-learn.
I think using partial_fit in a loop is the right solution but you should call it on small batches (as done in the fit method method, the default batch_size value is just 3) and then only compute the cost every 100 or 1000 calls to partial_fit for instance:
batch_size = 3
n_epochs = 20
n_batches = X.shape[0] // batch_size
print(n_batches) # 596
n_updates = 0
for epoch in range(n_epochs):
for i in range(n_batches):
batch = patches[i * batch_size:(i + 1) * batch_size]
dico.partial_fit(batch)
n_updates += 1
if n_updates % 100 == 0:
img_r = reconstruct_image(dico, dico.components_, patches)
err = np.sqrt(np.sum((img - img_r)**2))
print("[epoch #%02d] Err = %s" % (epoch, err))

I ran through the same problem and finally I was able to make the code much faster. If it's still useful to someone, adding the solution here. The catch is that while constructing the MiniBatchDictionaryLearning object we need to set n_iter to a low value (e.g., 1), so that for each partial_fit it does not run a single batch for too many epochs.
# Construct an initial dictionary object, note partial fit will be done later inside
# the loop, here we only specify that for partial_fit it needs just to run just 1
# epoch (n_iter=1) with batch_size=batch_size on the current batch provided
# (otherwise by default it can run upto 1000 iterations with batch_size=3 for a
# single partial_fit() and on each of the batches, which makes the a single run of
# partial_fit() very slow. Since we control the epoch on our own and it restarts
# when all the batches are done, we need not provide more than 1 iteration here.
# This will make the code to execute fast.
batch_size = 128 # e.g.,
dico = MiniBatchDictionaryLearning(n_components = 100,
alpha = 2,
n_iter = 1, # epoch per partial_fit()
batch_size = batch_size,
transform_algorithm='omp')
followed by #ogrisel's code:
n_updates = 0
for epoch in range(n_epochs):
for i in range(n_batches):
batch = patches[i * batch_size:(i + 1) * batch_size]
dico.partial_fit(batch)
n_updates += 1
if n_updates % 100 == 0:
img_r = reconstruct_image(dico, dico.components_, patches)
err = np.sqrt(np.sum((img - img_r)**2))
print("[epoch #%02d] Err = %s" % (epoch, err))

Predicting the next word using the LSTM ptb model tensorflow example

I am trying to use the tensorflow LSTM model to make next word predictions.
As described in this related question (which has no accepted answer) the example contains pseudocode to extract next word probabilities:
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)
I am confused about how to interpret the probabilities vector. I modified the __init__ function of the PTBModel in ptb_word_lm.py to store the probabilities and logits:
class PTBModel(object):
"""The PTB model."""
def __init__(self, is_training, config):
# General definition of LSTM (unrolled)
# identical to tensorflow example ...
# omitted for brevity ...
# computing the logits (also from example code)
logits = tf.nn.xw_plus_b(output,
tf.get_variable("softmax_w", [size, vocab_size]),
tf.get_variable("softmax_b", [vocab_size]))
loss = seq2seq.sequence_loss_by_example([logits],
[tf.reshape(self._targets, [-1])],
[tf.ones([batch_size * num_steps])],
vocab_size)
self._cost = cost = tf.reduce_sum(loss) / batch_size
self._final_state = states[-1]
# my addition: storing the probabilities and logits
self.probabilities = tf.nn.softmax(logits)
self.logits = logits
# more model definition ...
Then printed some info about them in the run_epoch function:
def run_epoch(session, m, data, eval_op, verbose=True):
"""Runs the model on the given data."""
# first part of function unchanged from example
for step, (x, y) in enumerate(reader.ptb_iterator(data, m.batch_size,
m.num_steps)):
# evaluate proobability and logit tensors too:
cost, state, probs, logits, _ = session.run([m.cost, m.final_state, m.probabilities, m.logits, eval_op],
{m.input_data: x,
m.targets: y,
m.initial_state: state})
costs += cost
iters += m.num_steps
if verbose and step % (epoch_size // 10) == 10:
print("%.3f perplexity: %.3f speed: %.0f wps, n_iters: %s" %
(step * 1.0 / epoch_size, np.exp(costs / iters),
iters * m.batch_size / (time.time() - start_time), iters))
chosen_word = np.argmax(probs, 1)
print("Probabilities shape: %s, Logits shape: %s" %
(probs.shape, logits.shape) )
print(chosen_word)
print("Batch size: %s, Num steps: %s" % (m.batch_size, m.num_steps))
return np.exp(costs / iters)
This produces output like this:
0.000 perplexity: 741.577 speed: 230 wps, n_iters: 220
(20, 10000) (20, 10000)
[ 14 1 6 589 1 5 0 87 6 5 3 5 2 2 2 2 6 2 6 1]
Batch size: 1, Num steps: 20
I was expecting the probs vector to be an array of probabilities, with one for each word in the vocabulary (eg with shape (1, vocab_size)), meaning that I could get the predicted word using np.argmax(probs, 1) as suggested in the other question.
However, the first dimension of the vector is actually equal to the number of steps in the unrolled LSTM (20 if the small config settings are used), which I'm not sure what to do with. To access to the predicted word, do I just need to use the last value (because it's the output of the final step)? Or is there something else that I'm missing?
I tried to understand how the predictions are made and evaluated by looking at the implementation of seq2seq.sequence_loss_by_example, which must perform this evaluation, but this ends up calling gen_nn_ops._sparse_softmax_cross_entropy_with_logits, which doesn't seem to be included in the github repo, so I'm not sure where else to look.
I'm quite new to both tensorflow and LSTMs, so any help is appreciated!

The output tensor contains the concatentation of the LSTM cell outputs for each timestep (see its definition here). Therefore you can find the prediction for the next word by taking chosen_word[-1] (or chosen_word[sequence_length - 1] if the sequence has been padded to match the unrolled LSTM).
The tf.nn.sparse_softmax_cross_entropy_with_logits() op is documented in the public API under a different name. For technical reasons, it calls a generated wrapper function that does not appear in the GitHub repository. The implementation of the op is in C++, here.

I am implementing seq2seq model too.
So lets me try to explain with my understanding:
The outputs of your LSTM model is a list (with length num_steps) of 2D tensor of size [batch_size, size].
The code line:
output = tf.reshape(tf.concat(1, outputs), [-1, size])
will produce a new output which is a 2D tensor of size [batch_size x num_steps, size].
For your case, batch_size = 1 and num_steps = 20 --> output shape is [20, size].
Code line:
logits = tf.nn.xw_plus_b(output, tf.get_variable("softmax_w", [size, vocab_size]), tf.get_variable("softmax_b", [vocab_size]))
<=> output[batch_size x num_steps, size] x softmax_w[size, vocab_size] will output logits of size [batch_size x num_steps, vocab_size].
For your case, logits of size [20, vocab_size]
--> probs tensor has same size as logits by [20, vocab_size].
Code line:
chosen_word = np.argmax(probs, 1)
will output chosen_word tensor of size [20, 1] with each value is the next prediction word index of current word.
Code line:
loss = seq2seq.sequence_loss_by_example([logits], [tf.reshape(self._targets, [-1])], [tf.ones([batch_size * num_steps])])
is to compute the softmax cross entropy loss for batch_size of sequences.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why evaluate self._initial_state when training RNN in Tensorflow - python

Related

Pytorch backward does not compute the gradients for requested variables

Tricky forward pass pytorch

Neural networks very bad accuracy when using more than one hidden layer

Python Minibatch Dictionary Learning

Predicting the next word using the LSTM ptb model tensorflow example

Categories

Resources