I am trying to modify the projection layer of my NMT (neural machine translation) model. I want to be able to update the number of units without reinitializing all of the weights. I followed the tutorial from the tensorflow NMT tutorial found here. Here is the code for my decoder:
# Decoder
train_decoder = tf.contrib.seq2seq.BasicDecoder(
decoder_cell, train_helper, decoder_initial_state)
maximum_iterations = tf.round(tf.reduce_max(encoder_input_lengths) * 2)
# Dynamic decoding
train_outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(train_decoder)
# Projection layer -- THIS IS WHAT I WANT TO MODIFY
projection_layer = layers_core.Dense(
len(language_base.vocabulary), use_bias=False)
train_logits = projection_layer(train_outputs.rnn_output)
train_crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=decoder_outputs, logits=train_logits)
# Target weights
target_weights = tf.sequence_mask(
decoder_input_lengths, params.tgt_max_len, dtype=train_logits.dtype)
target_weights = tf.transpose(target_weights)
# Loss function
train_loss = (tf.reduce_sum(train_crossent * target_weights) /
tf.to_float(params.batch_size))
# Calculate and clip gradients
train_vars = tf.trainable_variables()
gradients = tf.gradients(train_loss, train_vars)
clipped_gradients, _ = tf.clip_by_global_norm(
gradients, params.max_gradient_norm)
# Optimization
optimizer = tf.train.AdamOptimizer(params.learning_rate)
update_step = optimizer.apply_gradients(
zip(clipped_gradients, train_vars))
Tensorflow doesn't really let you change the shape of a variable (which is related to the number of units here) without some effort.
Instead, you're better off preallocating a larger-than-usual number of units and masking the units you don't use to 0 during the early stages of training, and update your mask as you go (use a variable to store the mask to make updating it easier).
Related
I'm trying to combine a few "networks" into one final loss function. I'm wondering if what I'm doing is "legal", as of now I can't seem to make this work. I'm using tensorflow probability :
The main problem is here:
# Get gradients of the loss wrt the weights.
gradients = tape.gradient(loss, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights])
# Update the weights of our linear layer.
optimizer.apply_gradients(zip(gradients, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights])
Which gives me None gradients and throws on apply gradients:
AttributeError: 'list' object has no attribute 'device'
Full code:
univariate_gmm = tfp.distributions.MixtureSameFamily(
mixture_distribution=tfp.distributions.Categorical(probs=phis_true),
components_distribution=tfp.distributions.Normal(loc=mus_true,scale=sigmas_true)
)
x = univariate_gmm.sample(n_samples, seed=random_seed).numpy()
dataset = tf.data.Dataset.from_tensor_slices(x)
dataset = dataset.shuffle(buffer_size=1024).batch(64)
m_phis = keras.layers.Dense(2, activation=tf.nn.softmax)
m_mus = keras.layers.Dense(2)
m_sigmas = keras.layers.Dense(2, activation=tf.nn.softplus)
def neg_log_likelihood(y, phis, mus, sigmas):
a = tfp.distributions.Normal(loc=mus[0],scale=sigmas[0]).prob(y)
b = tfp.distributions.Normal(loc=mus[1],scale=sigmas[1]).prob(y)
c = np.log(phis[0]*a + phis[1]*b)
return tf.reduce_sum(-c, axis=-1)
# Instantiate a logistic loss function that expects integer targets.
loss_fn = neg_log_likelihood
# Instantiate an optimizer.
optimizer = tf.keras.optimizers.SGD(learning_rate=1e-3)
# Iterate over the batches of the dataset.
for step, y in enumerate(dataset):
yy = np.expand_dims(y, axis=1)
# Open a GradientTape.
with tf.GradientTape() as tape:
# Forward pass.
phis = m_phis(yy)
mus = m_mus(yy)
sigmas = m_sigmas(yy)
# Loss value for this batch.
loss = loss_fn(yy, phis, mus, sigmas)
# Get gradients of the loss wrt the weights.
gradients = tape.gradient(loss, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights])
# Update the weights of our linear layer.
optimizer.apply_gradients(zip(gradients, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights]))
# Logging.
if step % 100 == 0:
print("Step:", step, "Loss:", float(loss))
There are two separate problems to take into account.
1. Gradients are None:
Typically this happens, if non-tensorflow operations are executed in the code that is watched by the GradientTape. Concretely, this concerns the computation of np.log in your neg_log_likelihood functions. If you replace np.log with tf.math.log, the gradients should compute. It may be a good habit to try not to use numpy in your "internal" tensorflow components, since this avoids errors like this. For most numpy operations, there is a good tensorflow substitute.
2. apply_gradients for multiple trainables:
This mainly has to do with the input that apply_gradients expects. There you have two options:
First option: Call apply_gradients three times, each time with different trainables
optimizer.apply_gradients(zip(m_phis_gradients, m_phis.trainable_weights))
optimizer.apply_gradients(zip(m_mus_gradients, m_mus.trainable_weights))
optimizer.apply_gradients(zip(m_sigmas_gradients, m_sigmas.trainable_weights))
The alternative would be to create a list of tuples, like indicated in the tensorflow documentation (quote: "grads_and_vars: List of (gradient, variable) pairs.").
This would mean calling something like
optimizer.apply_gradients(
[
zip(m_phis_gradients, m_phis.trainable_weights),
zip(m_mus_gradients, m_mus.trainable_weights),
zip(m_sigmas_gradients, m_sigmas.trainable_weights),
]
)
Both options require you to split the gradients. You can either do that by computing the gradients and indexing them separately (gradients[0],...), or you can simply compute the gradiens separately. Note that this may require persistent=True in your GradientTape.
# [...]
# Open a GradientTape.
with tf.GradientTape(persistent=True) as tape:
# Forward pass.
phis = m_phis(yy)
mus = m_mus(yy)
sigmas = m_sigmas(yy)
# Loss value for this batch.
loss = loss_fn(yy, phis, mus, sigmas)
# Get gradients of the loss wrt the weights.
m_phis_gradients = tape.gradient(loss, m_phis.trainable_weights)
m_mus_gradients = tape.gradient(loss, m_mus.trainable_weights)
m_sigmas_gradients = tape.gradient(loss, m_sigmas .trainable_weights)
# Update the weights of our linear layer.
optimizer.apply_gradients(
[
zip(m_phis_gradients, m_phis.trainable_weights),
zip(m_mus_gradients, m_mus.trainable_weights),
zip(m_sigmas_gradients, m_sigmas.trainable_weights),
]
)
# [...]
There is a famous trick in u-net architecture to use custom weight maps to increase accuracy. Below are the details of it:
Now, by asking here and at multiple other place, I get to know about 2 approaches. I want to know which one is correct or is there any other right approach which is more correct?
First is to use torch.nn.Functional method in the training loop:
loss = torch.nn.functional.cross_entropy(output, target, w) where w will be the calculated custom weight.
Second is to use reduction='none' in the calling of loss function outside the training loop
criterion = torch.nn.CrossEntropy(reduction='none')
and then in the training loop multiplying with the custom weight:
gt # Ground truth, format torch.long
pd # Network output
W # per-element weighting based on the distance map from UNet
loss = criterion(pd, gt)
loss = W*loss # Ensure that weights are scaled appropriately
loss = torch.sum(loss.flatten(start_dim=1), axis=0) # Sums the loss per image
loss = torch.mean(loss) # Average across a batch
Now, I am kinda confused which one is right or is there any other way, or both are right?
The weighting portion looks like just simply weighted cross entropy which is performed like this for the number of classes (2 in the example below).
weights = torch.FloatTensor([.3, .7])
loss_func = nn.CrossEntropyLoss(weight=weights)
EDIT:
Have you seen this implementation from Patrick Black?
# Set properties
batch_size = 10
out_channels = 2
W = 10
H = 10
# Initialize logits etc. with random
logits = torch.FloatTensor(batch_size, out_channels, H, W).normal_()
target = torch.LongTensor(batch_size, H, W).random_(0, out_channels)
weights = torch.FloatTensor(batch_size, 1, H, W).random_(1, 3)
# Calculate log probabilities
logp = F.log_softmax(logits)
# Gather log probabilities with respect to target
logp = logp.gather(1, target.view(batch_size, 1, H, W))
# Multiply with weights
weighted_logp = (logp * weights).view(batch_size, -1)
# Rescale so that loss is in approx. same interval
weighted_loss = weighted_logp.sum(1) / weights.view(batch_size, -1).sum(1)
# Average over mini-batch
weighted_loss = -1. * weighted_loss.mean()
Note that torch.nn.CrossEntropyLoss() is a class that calls torch.nn.functional.
See https://pytorch.org/docs/stable/_modules/torch/nn/modules/loss.html#CrossEntropyLoss
You can use the weights when you define the criteria. Comparing them functionally, both methods are the same.
Now, I do not understand your idea of computing loss inside the training loop in method 1 and outside the training loop in method 2. if you compute loss outside the loop then how will you backpropagate?
Problem settings
As a beginner of RNN, I'm currently building a 3-to-1 autocompletion RNN model for 4-letter words, where the input is a 3-letter incomplete word and the output is a single-letter which completes the word. For example, I would desire to have the following model-prediction:
input : "C", "A", "F"
output : "E"
Codes - generate dataset
To get the desired result from an RNN model, I have made an (imbalanced) dataset as follows:
import string
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
alphList = list(string.ascii_uppercase) # Define a list of alphabets
alphToNum = {n: i for i, n in enumerate(alphList)} # dic of alphabet-numbers
# Make dataset
# define words of interest
fourList = ['CARE', 'CODE', 'COME', 'CANE', 'COPE', 'FISH', 'JAZZ', 'GAME', 'WALK', 'QUIZ']
# (len(Sequence), len(Batch), len(Observation)) following tensorflow-style
first3Data = np.zeros((3, len(fourList), len(alphList)), dtype=np.int32)
last1Data = np.zeros((len(fourList), len(alphList)), dtype=np.int32)
for idxObs, word in enumerate(fourList):
# Make an array of one-hot vectors consisting of first 3 letters
first3 = [alphToNum[n] for n in word[:-1]]
first3Data[:,idxObs,:] = np.eye(len(alphList))[first3]
# Make an array of one-hot vectors consisting of last 1 letter
last1 = alphToNum[word[3]]
last1Data[idxObs,:] = np.eye(len(alphList))[last1]
So fourList contains the training data information, first3Data contains all the one-hot encoded first 3 letters of the training data, and last1Data contains all the one-hot encoded last 1 letter of the training data.
Codes - build model
Following the standard setting of 3-to-1 RNN model,I have made the following code.
# Hyperparameters
n_data = len(fourList)
n_input = len(alphList) # number of input units
n_hidden = 128 # number of hidden units
n_output = len(alphList) # number of output units
learning_rate = 0.01
total_epoch = 100000
# Variables (separate version)
W_in = tf.Variable(tf.random_normal([n_input, n_hidden]))
W_rec = tf.Variable(tf.random_normal([n_hidden, n_hidden]))
b_rec = tf.Variable(tf.random_normal([n_hidden]))
W_out = tf.Variable(tf.random_normal([n_hidden, n_output]))
b_out = tf.Variable(tf.random_normal([n_output]))
# Manual calculation of RNN output
def RNNoutput(Xinput):
h_state = tf.random_normal([1,n_hidden]) # initial hidden state
for iX in Xinput:
h_state = tf.nn.tanh(iX # W_in + (h_state # W_rec + b_rec))
rnn_output = h_state # W_out + b_out
return(rnn_output)
Note that the Manual calculation of RNN output part basically rolls the hidden state exactly 4 times using the matrix multiplication and the tanh activation function as follows:
tf.nn.tanh(iX # W_in + (h_state # W_rec + b_rec))
Here, every time the whole data is passed, one epoch is completed. Thus I initialize the h_state every time I pass the data. Additionally, note that I have not used a placeholder, which may be a cause of the learning instability.
Codes - train
I have used the following code to train the network.
# Cost / optimizer definition
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=RNNoutput(first3Data),
labels=last1Data))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
# Train and keep track of the loss history
sess = tf.Session()
sess.run(tf.global_variables_initializer())
lossHistory = []
for epoch in range(total_epoch):
_, loss = sess.run([optimizer, cost])
lossHistory.append(loss)
Question
The resulting learning curve looks as follows. Indeed, it shows an exponential decay.
However, for me it looks too wiggly for this kind of simple example, showing some instabilities even in the late period of the learning.
plt.plot(range(total_epoch), lossHistory)
plt.show()
Possible explanations?
I think the learning curve should show a square-like stable decay pattern as expected using tensorflow built-in functions (*). But I think this instability may be explained plausibly as follows:
Instability in random initialization of parameters
Numerical instability due to the successive addition when defining RNNoutput
Not using a tensor for loop but using the for loop directly in data
But I don't think any of these played a crucial role. Is there any other solution to help me out?
(*) I have seen a nearly square-patterned loss decay using tensorflow built-in functions for simple RNN. But sorry that I have not included the results to be compared, since I run out of time... I think I can edit shortly.
This modification where the initial state is set to be zero seems to solve the problem.
# Variables (separate version)
W_in = tf.Variable(tf.random_normal([n_input, n_hidden]))
W_rec = tf.Variable(tf.random_normal([n_hidden, n_hidden]))
b_rec = tf.Variable(tf.random_normal([n_hidden]))
W_out = tf.Variable(tf.random_normal([n_hidden, n_output]))
b_out = tf.Variable(tf.random_normal([n_output]))
h_init = tf.zeros([1,n_hidden])
# Manual calculation of RNN output
def RNNoutput(Xinput):
h_state = h_init # initial hidden state
for iX in Xinput:
h_state = tf.nn.tanh(iX # W_in + (h_state # W_rec + b_rec))
rnn_output = h_state # W_out + b_out
return(rnn_output)
I want to create a network where in the input layer nodes are just connected to some nodes in the next layer. Here is a small example:
My solution so far is that I set the weight of the edge between i1 and h1 to zero and after every optimization step I multiply the weights with a matrix (I call this matrix mask matrix) in which every entry is 1 except the entry of the weight of the edge between i1 and h1.
(See code below)
Is this approach right? Or does this have a affect on the GradientDescent? Is there another approach to create this kind of a network in TensorFlow?
import tensorflow as tf
import tensorflow.contrib.eager as tfe
import numpy as np
tf.enable_eager_execution()
model = tf.keras.Sequential([
tf.keras.layers.Dense(2, activation=tf.sigmoid, input_shape=(2,)), # input shape required
tf.keras.layers.Dense(2, activation=tf.sigmoid)
])
#set the weights
weights=[np.array([[0, 0.25],[0.2,0.3]]),np.array([0.35,0.35]),np.array([[0.4,0.5],[0.45, 0.55]]),np.array([0.6,0.6])]
model.set_weights(weights)
model.get_weights()
features = tf.convert_to_tensor([[0.05,0.10 ]])
labels = tf.convert_to_tensor([[0.01,0.99 ]])
mask =np.array([[0, 1],[1,1]])
#define the loss function
def loss(model, x, y):
y_ = model(x)
return tf.losses.mean_squared_error(labels=y, predictions=y_)
#define the gradient calculation
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
#create optimizer an global Step
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
global_step = tf.train.get_or_create_global_step()
#optimization step
loss_value, grads = grad(model, features, labels)
optimizer.apply_gradients(zip(grads, model.variables),global_step)
#masking the optimized weights
weights=(model.get_weights())[0]
masked_weights=tf.multiply(weights,mask)
model.set_weights([masked_weights])
If you are looking for a solution for the specific example you provided, you can simply use tf.keras Functional API and define two Dense layers where one is connected to both neurons in the previous layer and the other one is only connected to one of the neurons:
from tensorflow.keras.layer import Input, Lambda, Dense, concatenate
from tensorflow.keras.models import Model
inp = Input(shape=(2,))
inp2 = Lambda(lambda x: x[:,1:2])(inp) # get the second neuron
h1_out = Dense(1, activation='sigmoid')(inp2) # only connected to the second neuron
h2_out = Dense(1, activation='sigmoid')(inp) # connected to both neurons
h_out = concatenate([h1_out, h2_out])
out = Dense(2, activation='sigmoid')(h_out)
model = Model(inp, out)
# simply train it using `fit`
model.fit(...)
The problem with your solution and some others suggested by other answers in this post is that they do not prevent training of this weight. They allow the gradient descent to train the non existent weight and then overwrite it retrospectively. This will result in a network that has a zero in this location as desired, but will negatively affect your training process as the back propagation calculation will not see the masking step as it is not part of a TensorFlow graph and so the gradient descent will follow a path which includes the assumption that this weight does have an affect on the outcome (it does not).
A better solution would be to include the masking step as a part of your TensorFlow graph, so that it can be factored into the gradient descent. Since the masking step is simply a element wise multiplication by your sparse, binary martix mask, you could just include the mask matrix as an elementwise matrix multiplicaiton in the graph definition using tf.multiply.
Sadly this means sying goodbye to the user friendly keras,layers methods and embracing a more nuts & bolts approach to TensorFlow. I can't see an obvious way to do it using the layers API.
See the implementation below, I have tried to provide comments explaining what is happening at each stage.
import tensorflow as tf
## Graph definition for model
# set up tf.placeholders for inputs x, and outputs y_
# these remain fixed during training and can have values fed to them during the session
with tf.name_scope("Placeholders"):
x = tf.placeholder(tf.float32, shape=[None, 2], name="x") # input layer
y_ = tf.placeholder(tf.float32, shape=[None, 2], name="y_") # output layer
# set up tf.Variables for the weights at each layer from l1 to l3, and setup feeding of initial values
# also set up mask as a variable and set it to be un-trianable
with tf.name_scope("Variables"):
w_l1_values = [[0, 0.25],[0.2,0.3]]
w_l1 = tf.Variable(w_l1_values, name="w_l1")
w_l2_values = [[0.4,0.5],[0.45, 0.55]]
w_l2 = tf.Variable(w_l2_values, name="w_l2")
mask_values = [[0., 1.], [1., 1.]]
mask = tf.Variable(mask_values, trainable=False, name="mask")
# link each set of weights as matrix multiplications in the graph. Inlcude an elementwise multiplication by mask.
# Sequence takes us from inputs x to output final_out, which will be compared to labels fed to placeholder y_
l1_out = tf.nn.relu(tf.matmul(x, tf.multiply(w_l1, mask)), name="l1_out")
final_out = tf.nn.relu(tf.matmul(l1_out, w_l2), name="output")
## define loss function and training operation
with tf.name_scope("Loss"):
# some loss defined as a function of graph output: final_out and labels: y_
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=final_out, labels=y_, name="loss")
with tf.name_scope("Train"):
# some optimisation strategy, arbitrary learning rate
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, name="optimizer_adam")
train_op = optimizer.minimize(loss, name="train_op")
# create session, initialise variables and train according to inputs and corresponding labels
# This should show that the values of the first layer weights change, but the one set to 0 remains at 0
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
print(initial_l1_weights.eval())
inputs = [[0.05, 0.10]]
labels = [[0.01, 0.99]]
ans = sess.run(train_op, feed_dict={"Placeholders/x:0": inputs, "Placeholders/y_:0": labels})
train_steps = 1
for i in range(train_steps):
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
print(initial_l1_weights.eval())
Or use the answer provided by today for a keras friendly option.
You have multiple options here.
First, you could use the dynamic masking approach in your example. I believe this will work as expected since the gradients w.r.t. the masked-out parameters will be zero (the output is constant when you change the unused parameters). This approach is simple and it can be used even when your mask is not constant during the training.
Second, if you know beforehand which weights will be always zero, you can compose your weight matrix using tf.get_variable to get a submatrix, and then concatenate it with a tf.constant tensor, e.g.:
weights_sub = tf.get_variable("w", [dim_in, dim_out - 1])
zeros = tf.zeros([dim_in, 1])
weights = tf.concat([weights_sub, zeros], axis=1)
this example will make one column of your weight matrix to be always zero.
Finally, if your mask is more complex, you can use tf.get_variable on a flattened vector and then compose a tf.SparseTensor with the variable values on the used indices:
weights_used = tf.get_variable("w", [num_used_vars])
indices = ... # get your indices in a 2-D matrix of shape [num_used_vars, 2]
dense_shape = tf.constant([dim_in, dim_out]) # this is the final shape of the weight matrix
weights = tf.SparseTensor(indices, weights_used, dense_shape)
EDIT: This probably won't work in combination with Keras' set_weights method, as it expects Numpy arrays, not Tensors.
I am trying to develop my own deep learning library to enhance my knowledge and gain some experience. I am using Tensorflow. I am done with fully connected layers and trying to implement LSTM. I will use LSTM for character level joke generation.
My question is, I want to store previous hidden state in my LSTM class and update the Tensorflow Variable. Is this possible in Tensorflow? I am not sure this works since Tensorflow variables are intialized beforehand. I have implemented LSTM like the following:
def feed(self, input_tensor):
with self.graph.as_default():
with tf.name_scope(self.name):
i_t = tf.sigmoid(
tf.matmul(input_tensor, self.weights['W_i']) +
tf.matmul(self.hidden_state, self.weights['U_i']) +
self.biases['b_i']
)
C_head_t = tf.tanh(
tf.matmul(input_tensor, self.weights['W_c']) +
tf.matmul(self.hidden_state, self.weights['U_c']) +
self.biases['b_c']
)
f_t = tf.sigmoid(
tf.matmul(input_tensor, self.weights['W_f']) +
tf.matmul(self.hidden_state, self.weights['U_f']) +
self.biases['b_f']
)
next_cell_state = tf.multiply(i_t, C_head_t) + tf.multiply(f_t, self.cell_state)
o_t = tf.sigmoid(
tf.matmul(input_tensor, self.weights['W_o']) +
tf.matmul(self.hidden_state, self.weights['U_o']) +
self.biases['b_o']
)
next_hidden_state = tf.multiply(o_t, tf.tanh(next_cell_state))
self.hidden_state = tf.assign(ref=self.hidden_state, value=next_hidden_state)
self.cell_state = tf.assign(ref=self.cell_state, value=next_cell_state)
tf.summary.histogram('hidden_state', self.hidden_state)
tf.summary.histogram('cell_state', self.cell_state)
for k, v in self.weights.items():
tf.summary.histogram(k, v)
for k, v in self.biases.items():
tf.summary.histogram(k, v)
return next_hidden_state
And I form Tensorflow Graph like the following:
self.feed_forward = self.input_tensor
for l in self.layers:
self.feed_forward = l.feed(self.feed_forward)
self.loss_opt = self.loss_function(self.feed_forward, self.output_tensor)
self.fit_opt = self.optimizer.minimize(self.loss_opt)
init = tf.global_variables_initializer()
self.sess.run(init)
Then I run feed_forward variable in a session to predict, fit_opt variable to update the model.
I connected LSTM layer with a fully connected layer. I believe that fully connected layer works fine since I tested it on a basic dataset. LSTM's hidden state is input to a fully connected layer. softmax_cross_entropy used as a loss function and AdamOptimizer is used to update the model.
I'm getting meaningless results when I train my LSTM. I think that hidden state and cell state updates does not work properly. What is the best way to debug my models? I looked at my graph and tensor histograms through Tensorboard. Graph looks fine and histograms are updated through time.
I suspect the following part
self.hidden_state = tf.assign(ref=self.hidden_state, value=next_hidden_state)
self.cell_state = tf.assign(ref=self.cell_state, value=next_cell_state)
PS: I use truncated_normal to initialize the trensors. Cell state and hidden state variables has trainable=False and their initial value is zero vector.
https://github.com/ceteke/MyNN here you can find the whole code.
I am using the algorithm: http://deeplearning.net/tutorial/lstm.html