Using tensorflow data pipelines for nlp text generator

Using tensorflow data pipelines for nlp text generator - python

I am new to stackoverflow as a question asker. I am typically perusing for answers on here and have normally never had to ask a question, till now. I am working on building a deep learning network using the tf.data.Dataset API and the network doesn't seem to be bringing in the dataset correctly.
So to set the stage I am working with a text dataset, I have already broken the text up into tokens, created a dictionary of unique words, created an embedding matrix to convert the tokens into vectors and then planned to use the tf.data.Dataset to enable the easy use of an internal pipeline and batch large datasets to manage training.
The 'vect_doc' variable is an array with shape of (35054, 300).
vect_dataset = tf.data.Dataset.from_tensor_slices(vect_doc)
from here I shuffle the dataset so I can break it up into train, test and validation sets.
vect_data_shuffle = vect_dataset.shuffle(len(proc_doc), reshuffle_each_iteration = False)
train_dataset = vect_data_shuffle.take(train_size)
test_dataset = vect_data_shuffle.skip(train_size)
val_dataset = test_dataset.skip(val_size)
test_dataset = test_dataset.take(test_size)
Then I batch the datasets to create samples that are 2*sequence_length, I will demonstrate with just the training dataset for simplicity sake.
train_batch_ds = train_dataset.batch(2*self.sequence_length + 1, drop_remainder=True)
Once the dataset has been broken up into batches, I run the following process:
def vect_split_dataset(self, sample):
dataset_Xy = tf.data.Dataset.from_tensors((sample[:self.sequence_length],
sample[self.sequence_length]))
for i in range(1, (len(sample) - 1) // 2):
X_seq_batch = sample[i: i + self.sequence_length]
y_nxwrd_batch = sample[i + self.sequence_length]
Xy_samp = tf.data.Dataset.from_tensors((X_seq_batch, y_nxwrd_batch))
Xy_dataset = dataset_Xy.concatenate(Xy_samp)
return Xy_dataset
Xy_dataset = train_batch_ds.flat_map(train_set.vect_split_dataset)
Xy_dataset = Xy_dataset.repeat(len(proc_doc)).shuffle(len(proc_doc)).batch(param_dict['batch_size'], drop_remainder=True)
The above Xy_dataset returns a shape of ((60, 30, 300), (60, 300)). Now that I have the dataset created that I can pass to my DNN model is where I start getting problems. This is the code I am using to build the model:
LSTM = tf.keras.layers.LSTM(units=self.rnn_units,
kernel_initializer=self.initializer,
activation=self.activation,
recurrent_activation=self.activation_out,
return_sequences=True)
for i in range(self.num_layers):
# Different layers should have different setups as indicated below
if i == 0: # Initial layer also referred to as the input layer
self.model.add(tf.keras.layers.Embedding(input_dim=self.input_dim,
input_shape=(self.sequence_length, self.spacy_len),
output_dim=self.spacy_len,
input_length=self.batch_size))
elif i+1 == self.num_layers: # Output layer
self.model.add(tf.keras.layers.Dropout(self.drop_rate))
self.model.add(tf.keras.layers.Dense(units=self.num_units_out,
kernel_initializer=self.initializer,
activation=self.activation_out))
self.model.add(tf.keras.layers.Activation(self.activation_out))
else: # hidden layers basically anything that isn't an input or output layer
self.model.add(tf.keras.layers.Bidirectional(LSTM))
Basically the errors I keep getting is that 'ValueError: Input 0 of layer bidirectional is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: [None, 30, 300, 300]'
I am not sure if this is how I am handling the embedding error or what. When I mute the embedding layer and replace it with the Bidirectional I get an incompatibility error between two shapes (60,300) and (60, 30, 300).
My goal is to make it iterate over the entire dataset in some defined batches (for this example I am using 60) for each epoch. I have set the steps per epoch to the length of the entire document minus the sequence length divided by the batch size 'steps_per_epoch = (len(processed_doc) - self.sequence_length) // self.batch_size' when calling the model.fit command.
I appreciate any comments or guidance that can be provided on fixing this issue.

Related

Keras sequence models - how to generate data during test/generation?

Is there a way to use the already trained RNN (SimpleRNN or LSTM) model to generate new sequences in Keras?
I'm trying to modify an exercise from the Coursera Deep Learning Specialization - Sequence Models course, where you train an RNN to generate dinosaurus's names. In the exercise you build the RNN using only numpy, but I want to use Keras.
One of the problems is different lengths of the sequences (dino names), so I used padding and set sequence length to the max size appearing in the dataset (I padded with 0, which is also the code for '\n').
My question is how to generate the actual sequence once training is done? In the numpy version of the exercise you take the softmax output of the previous cell and use it as a distribution to sample a new input for the next cell. But is there a way to connect the output of the previous cell as the input of the next cell in Keras, during testing/generation time?
Also - some additional side-question:
Since I'm using padding, I suspect the accuracy is way too optimistic. Is there a way to tell Keras not to include the padding values in its accuracy calculations?
Am I even doing this right? Is there a better way to use Keras with sequences of different lengths?
You can check my (WIP) code here.

Inferring from a model that has been trained on a sequence
So it's a pretty common thing to do in RNN models and in Keras the best way (at least from what I know) is to create two different models.
One model for training (which uses sequences instead of individual items)
Another model for predicting (which uses a single element instead of a sequence)
So let's see an example. Suppose you have the following model.
from tensorflow.keras import models, layers
n_chars = 26
timesteps = 10
inp = layers.Input(shape=(timesteps, n_chars))
lstm = layers.LSTM(100, return_sequences=True)
out1 = lstm(inp)
dense = layers.Dense(n_chars, activation='softmax')
out2 = layers.TimeDistributed(dense)(out1)
model = models.Model(inp, out2)
model.summary()
Now to infer from this model, you create another model which looks like the one below.
inp_infer = layers.Input(shape=(1, n_chars))
# Inputs to feed LSTM states back in
h_inp_infer = layers.Input(shape=(100,))
c_inp_infer = layers.Input(shape=(100,))
# We need return_state=True so we are creating a new layer
lstm_infer = layers.LSTM(100, return_state=True, return_sequences=True)
out1_infer, h, c = lstm_infer(inp_infer, initial_state=[h_inp_infer, c_inp_infer])
out2_infer = layers.TimeDistributed(dense)(out1_infer)
# Our model takes the previous states as inputs and spits out new states as outputs
model_infer = models.Model([inp_infer, h_inp_infer, c_inp_infer], [out2_infer, h, c])
# We are setting the weights from the trained model
lstm_infer.set_weights(lstm.get_weights())
model_infer.summary()
So what's different. You see that we have defined a new input layer which accepts an input which has only one timestep (or in other words, just a single item). Then the model outputs an output which has a single timestep (technically we don't need the TimeDistributedLayer. But I've kept that for consistency). Other than that we take the previous LSTM state output as an input and produces the new state as the output. More specifically we have the following inference model.
Input: [(None, 1, n_chars) (None, 100), (None, 100)] list of tensor
Output: [(None, 1, n_chars), (None, 100), (None, 100)] list of Tensor
Note that I'm updating the weights of the new layers from the trained model or using the existing layers from the training model. It will be a pretty useless model if you don't reuse the trained layers and weights.
Now we can write inference logic.
import numpy as np
x = np.random.randint(0,2,size=(1, 1, n_chars))
h = np.zeros(shape=(1, 100))
c = np.zeros(shape=(1, 100))
seq_len = 10
for _ in range(seq_len):
print(x)
y_pred, h, c = model_infer.predict([x, h, c])
y_pred = x[:,0,:]
y_onehot = np.zeros(shape=(x.shape[0],n_chars))
y_onehot[np.arange(x.shape[0]),np.argmax(y_pred,axis=1)] = 1.0
x = np.expand_dims(y_onehot, axis=1)
This part starts with an initial x, h, c. Gets the prediction y_pred, h, c and convert that to an input in the following lines and assign it back to x, h, c. So you keep going for n iterations of your choice.
About masking zeros
Keras does offer a Masking layer which can be used for this purpose. And the second answer in this question seems to be what you're looking for.

Error when feeding numpy sequence to bidirectional LSTM in Keras

I am trying to feed the features extracted from 2 fine-tuned VGG16 (each on a different stream), then for each sequence of 9 data pairs, concatenate their numpy arrays and feed the sequence of 9 outputs (concatenated) to a bi-directional LSTM in Keras.
The problem is that I am running into an error when trying to build the LSTM part. The following shows the generator I wrote to read both RGB and Optical flow streams, extract features and concatenate each pair :
def generate_generator_multiple(generator,dir1, dir2, batch_rgb, batch_flow, img_height,img_width):
print("Processing inside generate multiple")
genX1 = generator.flow_from_directory(dir1,
target_size = (img_height,img_width),
class_mode = 'categorical',
batch_size = batch_rgb,
shuffle=False
)
genX2 = generator.flow_from_directory(dir2,
target_size = (img_height,img_width),
class_mode = 'categorical',
batch_size = batch_flow,
shuffle=False
)
while True:
imgs, labels = next(genX1)
X1i = RGB_model.predict(imgs, verbose=0)
imgs2, labels2 = next(genX2)
X2i = FLOW_model.predict(imgs2,verbose=0)
Xi = []
for i in range(9):
Xi.append(np.concatenate([X1i[i+1],X2i[i]]))
Xi = np.asarray(Xi)
if np.array_equal(labels[1:],labels2)==False:
print("ERROR !! problem of labels matching: RGB and FLOW have different labels")
yield Xi, labels2[2]
I am expecting the generator to yield a sequence of 9 arrays, so the shape of Xi when I force the loop to run twice is: (9, 14, 7, 512)
When I use while True (like in the code above) and try to call the method to check what it returs, after 3 iterations I get the error:
ValueError: too many values to unpack (expected 2)
Now, assuming that there is no problem with the generator, I try to feed the data returned by the generator to the bidirectional LSTM like the following:
n_frames = 9
seq = 100
Bi_LSTM = Sequential()
Bi_LSTM.add(Bidirectional(LSTM(seq, return_sequences=True, dropout=0.25, recurrent_dropout=0.1),input_shape=(n_frames,14,7,512)))
Bi_LSTM.add(GlobalMaxPool1D())
Bi_LSTM.add(TimeDistributed(Dense(100, activation="relu")))
Bi_LSTM.add(layers.Dropout(0.25))
Bi_LSTM.add(Dense(4, activation="relu"))
model.compile(Adam(lr=.00001), loss='categorical_crossentropy', metrics=['accuracy'])
But I keep getting the following error: (the error log is a bit long)
InvalidArgumentError: Shape must be rank 4 but is rank 2 for 'bidirectional_2/Tile_1' (op: 'Tile') with input shapes: [?,7,512,1], [2].
It seems to be caused by this line:
Bi_LSTM.add(Bidirectional(LSTM(seq, return_sequences=True, dropout=0.25, recurrent_dropout=0.1),input_shape=(n_frames,14,7,512)))
I am not sure anymore if the problem is the way I try to build the LSTM, the way I return the data from the generator, or the way I define the input of LSTM.
Thanks a lot for any help you can provide.

It seems like this error specifically is cause by the following line:
input_shape=(n_frames,14,7,512)
I was confused about the input for LSTM. Instead to explicitly giving the shape of the input, we just need to specify the dimensions of the input. In my case, this is 3 since the input is a 3D np array. I still have other problems with my code, but for this specific error, the solution is changing that part with:
input_shape=(n_frames,3)
Edit:
When predicting, We need to get the mean of the prediction since LSTM expects a 1D input.
Another issue in my code was the shape of Xi. It needs to be reshaped before yielding it so that it matches the input expected by LSTM.

Getting name of images per batch in Keras ResNet50 model

I'm finetuning a ResNet50 model with a few additional layers using Keras.
I need to know which images are trained per batch.
The problem I have is that only the imagedata and their labels, but no image names can be passed on in the fit and fit_generator in order to output the image names, which are trained in a batch, to a file.

You can make your own generator so you could track what is fed into the network, and do whatever you like with the data (i.e. match indices to images).
Here is a basic example of a generator function which you can build upon:
def gen_data():
x_train = np.random.rand(100, 784)
y_train = np.random.randint(0, 1, 100)
i = 0
while True:
indices = np.arange(i*10, 10*i+10)
# Those are indices being fed to network which can be saved to a file.
print(indices)
out = x_train[indices], y_train[indices]
i = (i+1) % 10
yield out
And then use fit_generator with the new defined generator function:
model.fit_generator(gen_data(), steps_per_epoch=10, epochs=20)

keras lstm nmt model compiles but does not run -- dimension size?

I'm working on what I hope will be a simple nmt translator in keras. Below is a link to some inspiring examples of seq2seq in keras.
https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/
I want a model that takes a vector of 300 as a word input and takes 25 of them at a time. This 25 number is the length of a sentence. Units is the 300 number and tokens_per_sentence is the 25 number. The code below only has the training model. I have ommitted the inference model. My model compiles, but when I run it with training data I get a dimension error. I have tried reshaping the output of dense_layer_b but I'm repeatedly told that the output of the operation needs to have the same size as the input. I'm using python3 and tensorflow as a backend. My keras is v2.1.4. My o.s. is uubuntu.
an error message:
ValueError: Error when checking target: expected dense_layer_b to have 2 dimensions, but got array with shape (1, 25, 300)
some terminal output:
Tensor("lstm_1/while/Exit_2:0", shape=(?, 25), dtype=float32)
(?, 25) (?, 25) h c
some code:
def model_lstm():
x_shape = (None,units)
valid_word_a = Input(shape=x_shape)
valid_word_b = Input(shape=x_shape)
### encoder for training ###
lstm_a = LSTM(units=tokens_per_sentence, return_state=True)
recurrent_a, lstm_a_h, lstm_a_c = lstm_a(valid_word_a)
lstm_a_states = [lstm_a_h , lstm_a_c]
print(lstm_a_h)
### decoder for training ###
lstm_b = LSTM(units=tokens_per_sentence ,return_state=True)
recurrent_b, inner_lstmb_h, inner_lstmb_c = lstm_b(valid_word_b, initial_state=lstm_a_states)
print(inner_lstmb_h.shape, inner_lstmb_c.shape,'h c')
dense_b = Dense(tokens_per_sentence , activation='softmax', name='dense_layer_b')
decoder_b = dense_b(recurrent_b)
model = Model([valid_word_a,valid_word_b], decoder_b)
return model
I was hoping that the question marks in the terminal output could be replaced with my vector size when the code was actually used with data.
edit:
I've been trying to work this out by switching around the dimensions in the code. I have updated the code and the error message. I still have basically the same problem. The Dense layer doesn't seem to work.

so I think I needed to set return_sequences to True for both lstm_a and lstm_b.

Tensorflow LSTM: Predict next action based on a series of previous ones

My input data consists of 10 samples, each of which has 200 time steps, while each time step is described by a vector of 30 dimensions.
In addition, each time step consists of a 3 dimensional vector (one hot encoding) which describes the action which has been taken at that particular time step. With that being said, I am trying to build a model which get fed in all previous actions and then predicts which action would be the best to take next.
I tried to get this working with tflearn and tensorflow but with limited success so far.
Simple sample code:
import numpy as np
import operator
import tflearn
from tflearn import regression
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.embedding_ops import embedding
from tflearn.layers.recurrent import bidirectional_rnn, BasicLSTMCell
from tflearn.data_utils import to_categorical, pad_sequences
SAMPLES = 10
TIME_STEPS = 200
DATA_DIMENSIONS = 30
LABEL_CLASSES = 3
x = []
y = []
# Generate fake data.
for i in range(SAMPLES):
sequences = []
outputs = []
for i in range(TIME_STEPS):
d = []
for i in range(DATA_DIMENSIONS):
d.append(1)
sequences.append(d)
outputs.append([0,0,1])
x.append(sequences)
y.append(outputs)
print("X1:", len(x), ", X2:", len(x[0]), ", X3:", len(x[0][0]))
print("Y1:", len(y), ", Y2:", len(y[0]), ", Y3:", len(y[0][0]))
# Define model
net = tflearn.input_data([None, TIME_STEPS, DATA_DIMENSIONS], name='input')
net = tflearn.lstm(net, 128, dropout=0.8, return_seq=True)
net = tflearn.fully_connected(net, LABEL_CLASSES, activation='softmax')
net = tflearn.regression(net, optimizer='adam', loss='categorical_crossentropy', name='targets')
model = tflearn.DNN(net)
# Fit model.
model.fit({'input': x}, {'targets': y},
n_epoch=1,
snapshot_step=1000,
show_metric=True, run_id='test', batch_size=32)
Error
ValueError: Cannot feed value of shape (10, 200, 3) for Tensor
'targets/Y:0', which has shape '(?, 3)'
As far as I understand, the input_data should be correct. However, the output data is apparently wrong, at least, Tensorflow throws an error. That is probably because my model expects one label per sample rather than one label per time step.
Can I even achieve my goal with an LSTM, and if so, how do I have to set up my model?
Thanks,
Robert

As the error suggests, there is a shape mismatch between the expected size of your targets tensor, and the one of the data you actually provide for it. Let us break it down.
From what I understand, you have labeled action for every timestep of your sequences. This means that the labels that you provide should have a shape (10, 200, 3). This seems to be the case from the error message. Good.
So we now know the error comes from what the network generates.
=================
Input data -> (10, 200, 30)
LSTM -> (10, 128) (because return_seq=False)
FullyConnected -> (10, 3).
=================
So that explains the second part of the error message, your network indeed produces an output with shape (10, 3) which mismatches the one of your data.
I think you missed the return_seq argument of the LSTM. As is usually the case with RNN implementations, you have a parameter telling if you want the layer to return outputs for the whole sequence, or only for the last timestep. Here by default it is the second option, that is why you don't get an output with the expected shape. Use return_seq=True.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using tensorflow data pipelines for nlp text generator - python

Related

Keras sequence models - how to generate data during test/generation?

Error when feeding numpy sequence to bidirectional LSTM in Keras

Getting name of images per batch in Keras ResNet50 model

keras lstm nmt model compiles but does not run -- dimension size?

Tensorflow LSTM: Predict next action based on a series of previous ones

Categories

Resources