Is there a way to speed up Embedding layer in tf.keras?

Is there a way to speed up Embedding layer in tf.keras? - python

I'm trying to implement an LSTM model for DNA sequence classification, but at the moment it is unusable because of how long it takes to train (25 seconds per epoch over 6.5K sequences, about 4ms per sample, and we need to train several versions of the model over 100s of thousands of sequences).
DNA sequence can be represented as a string of A, C, G, and T, e.g. "ACGGGTGACAT" could be an example of a single DNA sequence. Each sequence belongs to one of two categories that I am trying to predict and each sequence contains 1000 characters.
Initially, my model did not include an Embedding layer and instead I manually converted each sequence into a one-hot encoded matrix (4 rows by 1000 columns) and the model didn't work great but was incredibly fast. At this point though I had seen online that using an embedding layer has clear advantages. So I added an embedding layer and instead of using the one-hot encoded matrix I converted the sequences into integers with each character represented by a different integer.
Indeed the model works much better now, but it is about 30 times slower and impossible to work with. Is there something I can do here to speed up the embedding layer?
Here are the functions for constructing and fitting my model:
from tensorflow.keras.layers import Embedding, Dense, LSTM, Activation
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
def build_model():
# initialize a sequential model
model = Sequential()
# add embedding layer
model.add(Embedding(5, 1, input_length=1000, mask_zero=True))
# Add LSTM layer
model.add(
LSTM(5)
)
# Add Dense NN layer
model.add(
Dense(units=2)
)
model.add(Activation('softmax'))
optimizer = Adam(clipnorm=1.)
model.compile(
loss="categorical_crossentropy", optimizer=optimizer, metrics=['accuracy']
)
return model
def train_model(X_train, y_train, epochs, batch_size):
model = build_model()
# y_train is initially a list of zeroes and ones, needs to be converted to categorical
y_train = to_categorical(y_train)
history = model.fit(
X_train, y_train, epochs=epochs, batch_size=batch_size
)
return model, history
Any help will be greatly appreciated - after much googling and trial-and-error, I can't seem to speed this up.

A possible suggestion is to use a "cheaper" RNN, such as the SimpleRNN instead of LSTM. It has less parameters to train. In some simple testing, I got a ~3x speed up over LSTM, with the same Embedding processing as you currently have. Not sure if you can reduce the sequence length from 1000 to a lower number, but that might be a direction to explore as well. I hope this helps.

Related

Surrogate model for [parameter vector] to [time series]

Say I have a function F that takes in a parameter vector P (say, a 5-element vector), and produces a (numerical) time series Y[t] of length T (eg T=100, so t=1,...,100). The function could be complicated (eg enzyme reaction models)
I want to make a neural network that predicts the output (Y[t]) that would result from feeding a new parameter set (P') into the function. How can this be done?
A simple feed-forward network can work, but it requires a very large number of output nodes, and doesn't take into account the temporal correlation / relationships between points. Is it possible/better to use a RNN or Transformer instead?

Using RNN might work for you. Here is some example code in Keras to get you started:
param_length = 5
time_length = 100
hidden_size = 20
model = tf.keras.Sequential([
# Encode input parameters.
tf.keras.layers.Dense(hidden_size, input_shape=[param_length]),
# Generate a sequence.
tf.keras.layers.RepeatVector(time_length),
tf.keras.layers.LSTM(32, return_sequences=True),
tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(1))
])
model.compile(loss="mse", optimizer="nadam")
model.fit(train_x, train_y, validation_data=(val_x, val_y), epochs=10)
The first Dense layer converts input parameters to a hidden state. Then LSTM RNN units generate time sequences. You will need to experiment with hyperparameters like the number of dense and LTSM layers, the size of hidden layers etc.
One more thing you can try is to use different loss function like:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(
monitor="val_mae", patience=50, restore_best_weights=True)
model.compile(loss=tf.keras.losses.Huber(), optimizer="nadam", metrics=["mae"])
history = model.fit(train_x, train_y, validation_data=(val_x, val_y), epochs=500,
callbacks=[early_stopping_cb])

What is the timestep in Keras' LSTM?

I have some troubles with the LSTM implementation in Keras.
My training set is structured as follow:
number of sequences: 5358
the length of each sequence is 300
each element of the sequence is a vector of 54 features
I'm unsure on how to shape the input for a stateful LSTM.
Following this tutorial: http://philipperemy.github.io/keras-stateful-lstm/, I've created the subsequences (in my case there are 1452018 subsequences with a window_size = 30).
What is the best option to reshape the data for a stateful LSTM's input?
What means the timestep of the input in this case? And why?
Is the batch_size related to the timestep?

I'm unsure on how to shape the input for a stateful LSTM.
LSTM(100, statefull=True)
But before using stateful LSTM ask yourself do I really need statefull LSTM? See here and here for more details.
What is the best option to reshape the data for a stateful LSTM's
input?
It really depends on the problem on hands. However, I think you do not need reshaping just feed data directly into Keras:
input_layer = Input(shape=(300, 54))
What means the timestep of the input in this case? And why?
In your example timestamp is 300. See here for further details on timestamp. In the following picture, we have 5 timestamps that we feed them into the LSTM network.
Is the batch_size related to the timestep?
No, it has nothing to do with batch_size. More details on batch_size can be found here.
Here is simple code based on the description that you provide. It might give you some intuition:
import numpy as np
from tensorflow.python.keras import Input, Model
from tensorflow.python.keras.layers import LSTM
from tensorflow.python.layers.core import Dense
x_train = np.zeros(shape=(5358, 300, 54))
y_train = np.zeros(shape=(5358, 1))
input_layer = Input(shape=(300, 54))
lstm = LSTM(100)(input_layer)
dense1 = Dense(20, activation='relu')(lstm)
dense2 = Dense(1, activation='sigmoid')(dense1)
model = Model(inputs=input_layer, ouputs=dense2)
model.compile("adam", loss='binary_crossentropy')
model.fit(x_train, y_train, batch_size=512)

How to train a model with only an Embedding layer in Keras and no labels

I have some text without any labels. Just a bunch of text files. And I want to train an Embedding layer to map the words to embedding vectors. Most of the examples I've seen so far are like this:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_data=(x_val, y_val))
They all assume that the Embedding layer is part of a bigger model which tries to predict a label. But in my case, I have no label. I'm not trying to classify anything. I just want to train the mapping from words (more precisely integers) to embedding vectors. But the fit method of the model, asks for x_train and y_train (as the example given above).
How can I train a model only with an Embedding layer and no labels?
[UPDATE]
Based on the answer I've got from #Daniel Möller, Embedding layer in Keras is implementing a supervised algorithm and thus cannot be trained without labels. Initially, I was thinking that it is a variation of Word2Vec and thus does not need labels to be trained. Apparently, that's not the case. Personally, I ended up using the FastText which has nothing to do with Keras or Python.

Does it make sense to do that without a label/target?
How will your model decide which values in the vectors are good for anything if there is no objective?
All embeddings are "trained" for a purpose. If there is no purpose, there is no target, if there is no target, there is no training.
If you really want to transform words in vectors without any purpose/target, you've got two options:
Make one-hot encoded vectors. You may use the Keras to_categorical function for that.
Use a pretrained embedding. There are some available, such as glove, embeddings from Google, etc. (All of they were trained at some point for some purpose).
A very naive approach based on our chat, considering word distance
Warning: I don't really know anything about Word2Vec, but I'll try to show how to add the rules for your embedding using some naive kind of word distance and how to use dummy "labels" just to satisfy Keras' way of training.
from keras.layers import Input, Embedding, Subtract, Lambda
import keras.backend as K
from keras.models import Model
input1 = Input((1,)) #word1
input2 = Input((1,)) #word2
embeddingLayer = Embedding(...params...)
word1 = embeddingLayer(input1)
word2 = embeddingLayer(input2)
#naive distance rule, subtract, expect zero difference
word_distance = Subtract()([word1,word2])
#reduce all dimensions to a single dimension
word_distance = Lambda(lambda x: K.mean(x, axis=-1))(word_distance)
model = Model([input1,input2], word_distance)
Now that our model outputs directly a word distance, our labels will be "zero", they're not really labels for a supervised training, but they're the expected result of the model, something necessary for Keras to work.
We can have as loss function the mae (mean absolute error) or mse (mean squared error), for instance.
model.compile(optimizer='adam', loss='mse')
And training with word2 being the word after word1:
xTrain = entireText
xTrain1 = entireText[:-1]
xTrain2 = entireText[1:]
yTrain = np.zeros((len(xTrain1),))
model.fit([xTrain1,xTrain2], yTrain, .... more params.... )
Although this may be completely wrong regarding what Word2Vec really does, it shows the main points that are:
Embedding layers don't have special properties, they're just trainable lookup tables
Rules for creating an embedding should be defined by the model and expected outputs
A Keras model will need "targets", even if those targets are not "labels" but a mathematical trick for an expected result.

Neural network accuracy optimization

I have constructed an ANN in keras which has 1 input layer(3 inputs), one output layer (1 output) and two hidden layers with with 12 and 3 nodes respectively.
The way i construct and train my network is:
from keras.models import Sequential
from keras.layers import Dense
from sklearn.cross_validation import train_test_split
import numpy
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
dataset = numpy.loadtxt("sorted output.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:3]
Y = dataset[:,3]
# split into 67% for train and 33% for test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed)
# create model
model = Sequential()
model.add(Dense(12, input_dim=3, init='uniform', activation='relu'))
model.add(Dense(3, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test,y_test), nb_epoch=150, batch_size=10)
Sorted output csv file looks like:
so after 150 epochs i get: loss: 0.6932 - acc: 0.5000 - val_loss: 0.6970 - val_acc: 0.1429
My question is: how could i modify my NN in order to achieve higher accuracy?

You could try the following things. I have written this roughly in the order of importance - i.e. the order I would try things to fix the accuracy problem you are seeing:
Normalise your input data. Usually you would take mean and standard deviation of training data, and use them to offset+scale all further inputs. There is a standard normalising function in sklearn for this. Remember to treat your test data in the same way (using the mean and std from the training data, not recalculating it)
Train for more epochs. For problems with small numbers of features and limited training set sizes, you often have to run for thousands of epochs before the network will converge. You should plot the training and validation loss values to see whether the network is still learning, or has converged as best as it can.
For your simple data, I would avoid relu activations. You may have heard they are somehow "best", but like most NN options, they have types of problems where they work well, and others where they are not best choice. I think you would be better off with tanh or sigmoid activations in hidden layers for your problem. Save relu for very deep networks and/or convolutional problems on images/audio.
Use more training data. Not clear how much you are feeding it, but NNs work best with large amounts of training data.
Provided you already have lots of training data - increase size of hidden layers. More complex relationships require more hidden neurons (and sometimes more layers) for the NN to be able to express the "shape" of the decision surface. Here is a handy browser-based network allowing you to play with that idea and get a feel for it.
Add one or more dropout layers after the hidden layers or add some other regularisation. The network could be over-fitting (although with a training accuracy of 0.5 I suspect it isn't). Unlike relu, using dropout is pretty close to a panacea for tougher NN problems - it improves generalisation in many cases. A small amount of dropout (~0.2) might help with your problem, but like most hyper-parameters, you will need to search for the best values.
Finally, it is always possible that the relationship you want to find that allows you to predict Y from X is not really there. In which case it would be a correct result from the NN to be no better than guessing at Y.

Neil Slater already provided a long list of helpful general advices.
In your specific examaple, normalization is the important thing. If you add the following lines to your code
...
X = dataset[:,0:3]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
you will get 100% accuracy on your toy data, even with much simpler network structures. Without normalization, the optimizer won't work.

How to deal with situation where LSTM fails to learn (constantly makes the same incorrect prediction)

I am trying to use LSTM neural networks in order to make a song composer. Basically this is based of a text generator (tries to predict the next character after looking at a sequence of characters) but instead of characters, it tried to predict notes.
Structure of the midi file that serves as the input (Y-axis is the pitch or note value while X-axis is time):
And this is the predicted note values:
I set an epoch of 50, but it seems that the LSTM's loss rate does not decrease, most of the time its loss rate does not improve.
I suspect this is because there is an overwhelming number of a particular note (in this case, note value 65) which makes the LSTM lazy during training phase and predict 65 each and every time.
I feel like this is a common problem among LSTMs and time-series based learning algorithms. How would I solve a problem like this? If what I mentioned is not the problem, then what is the problem and how do I solve that?
Here is the code I am using to train if you need it:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
seq_length = 100
read_path = '../matrices/input/world-is-mine/world-is-mine-y-0.npy'
raw_text = numpy.load(read_path)
# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
char_to_int = dict((c,i) for i,c in enumerate(chars))
n_chars = len(raw_text)
n_vocab = len(chars)
# prepare the dataset of input to output pairs encoded as integers
dataX = []
dataY = []
# dataX is the encoding version of the sequence
# dataY is an encoded version of the next prediction
for i in range(0, n_chars - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i+seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print "Total Patterns: ", n_patterns
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length,1))
# normalize
X = X/float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
print 'X: ', X.shape
print 'Y: ', y.shape
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
#model.add(Dropout(0.05))
model.add(LSTM(256))
#model.add(Dropout(0.05))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# There is no test dataset. We are modeling the entire training dataset to learn the probability of each character in a sequence.
# We are interested in a generalization of the dataset that minimizes the chosen loss function
# We are seeking a balance between generalization of the dataset and overfitting but short of memorization
# define the check point
filepath="../checkpoints/weights-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
model.fit(X,y, nb_epoch=50, batch_size=64, callbacks=callbacks_list)

I have no experience on working with music data. From my experience with text data, this seems like a under-fitted model. Increasing the training dataset with different note value should overcome the underfitting problem. It seems like the training examples are not enough for learning the note variation. For example, for char language model, 1 MB data is too small for training a reasonable LSTM model. Also, try to train with smaller sequence length (let's say with 20) first. Smaller sequence length will be easier to learn than the longer one, with limited training data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.