Movie Review Classification with Recurrent Networks - python

As far as I know and research, the sequences in a data set can be of different lengths; we do not need to pad or truncate them provided that each batch in the training process contains the sequences with the same length.
To realize and apply it, I decided to set the batch size to 1 and trained my RNN model over the IMDB movie classification dataset. I added the code that I had written below.
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import SimpleRNN
from tensorflow.keras.layers import Embedding
max_features = 10000
batch_size = 1
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=32))
model.add(SimpleRNN(units=32, input_shape=(None, 32)))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop",
loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train,
batch_size=batch_size, epochs=10,
validation_split=0.2)
acc = history.history["acc"]
loss = history.history["loss"]
val_acc = history.history["val_acc"]
val_loss = history.history["val_loss"]
epochs = range(len(acc) + 1)
plt.plot(epochs, acc, "bo", label="Training Acc")
plt.plot(epochs, val_acc, "b", label="Validation Acc")
plt.title("Training and Validation Accuracy")
plt.legend()
plt.figure()
plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "b", label="Validation Loss")
plt.title("Training and Validation Loss")
plt.legend()
plt.show()
What error I have been encountered is to fail to convert the input to tensor format because of the list components in the input numpy array. However, when I change them, I continue to get similar kinds of errors.
The error message:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
I could not handle the problem. Could anyone help me on this point?

With Sequence Padding
There are two issues. You need to use pad_sequences on the text sequence first. And also there is no such param input_shape in SimpleRNN. Try with the following code:
max_features = 20000 # Only consider the top 20k words
maxlen = 200 # Only consider the first 200 words of each movie review
batch_size = 1
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), "Training sequences")
print(len(x_test), "Validation sequences")
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=32))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train, batch_size=batch_size,
epochs=10, validation_split=0.2)
Here is the official code example, it might help you.
With Sequence Padding with Mask in Embedding Layer
Based on your comments and information, It seems that it's possible to use a variable-length input sequence, check this and this too. But still, I can say, in most of the cases practitioner would prefer to pad the sequences for uniform length; as it's convincing. Choosing non-uniform or variable input sequence length is some kind of special case; similar to when we want variable input image sizes for vision models.
However, here we will add info on padding and how we can mask out the padded value in training time which technically seems variable-length input training. Hope that convinces you. Let's first understand what pad_sequences do. Normally in sequence data, it's very much a common case that, each training samples are in a different length. Let's consider the following inputs:
raw_inputs = [
[711, 632, 71],
[73, 8, 3215, 55, 927],
[83, 91, 1, 645, 1253, 927],
]
These 3 training samples are in different lengths, 3, 5, and 6 respectively. What we do next is to make them all equal lengths by adding some value (typically 0 or -1) - whether at the beginning or at the end of the sequence.
tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)
array([[ 0, 0, 0, 711, 632, 71],
[ 0, 73, 8, 3215, 55, 927],
[ 83, 91, 1, 645, 1253, 927]], dtype=int32)
We can set padding = "post" to set pad value at the end of the sequence. But it recommends using "post" padding when working with RNN layers in order to be able to use the CuDNN implementation of the layers. However, FYI, you may notice we set maxlen = 6 which is the highest input sequence length. But it does not have to be the highest input sequence length as it may get computationally expensive if the dataset gets bigger. We can set it to 5 assuming that our model can learn feature representation within this length, it's a kind of hyper-parameter. And that brings another parameter truncating.
tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=5, dtype="int32", padding="pre", truncating="pre", value=0.0
)
array([[ 0, 0, 711, 632, 71],
[ 73, 8, 3215, 55, 927],
[ 91, 1, 645, 1253, 927]], dtype=int32
Okay, now we have a padded input sequence, all inputs are uniform length. Now, we can mask out those additional padded values in training time. We will tell the model some part of the data is padding and those should be ignored. That mechanism is masking. So, it's a way to tell sequence-processing layers that certain timesteps in the input are missing, and thus should be skipped when processing the data. There are three ways to introduce input masks in Keras models:
Add a keras. layers.Masking layer.
Configure a keras.layers.Embedding layer with mask_zero=True.
Pass a mask argument manually when calling layers that support this argument (e.g. RNN layers).
Here we will show only by configuring the Embedding layer. It has a parameter called mask_zero and set False by default. If we set it True then 0 containing indices in the sequences will be skipped. False entry indicates that the corresponding timestep should be ignored during processing.
padd_input = tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)
print(padd_input)
embedding = tf.keras.layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
masked_output = embedding(padd_input)
print(masked_output._keras_mask)
[[ 0 0 0 711 632 71]
[ 0 73 8 3215 55 927]
[ 83 91 1 645 1253 927]]
tf.Tensor(
[[False False False True True True]
[False True True True True True]
[ True True True True True True]], shape=(3, 6), dtype=bool)
And here is how it's computed in the class Embedding(Layer).
def compute_mask(self, inputs, mask=None):
if not self.mask_zero:
return None
return tf.not_equal(inputs, 0)
And here is one catch, if we set mask_zero as True, as a consequence, index 0 cannot be used in the vocabulary. According to the doc
mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).
So, we have to use max_features + 1 at least. Here is a nice explanation on this.
Here is the complete example using these of your code.
# get the data
(x_train, y_train), (_, _) = imdb.load_data(num_words=max_features)
print(x_train.shape)
# check highest sequence lenght
max_list_length = lambda list: max( [len(i) for i in list])
print(max_list_idx(x_train))
max_features = 20000 # Only consider the top 20k words
maxlen = 350 # Only consider the first 350 words out of `max_list_idx(x_train)`
batch_size = 512
print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])
# (1). padding with value 0 at the end of the sequence - padding="post", value=0.
# (2). truncate 'maxlen' words
# out of `max_list_idx(x_train)` at the end - maxlen=maxlen, truncating="post"
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train,
maxlen=maxlen, dtype="int32",
padding="post", truncating="post",
value=0.)
print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])
Your model definition should be now
model = Sequential()
model.add(Embedding(
input_dim=max_features + 1,
output_dim=32,
mask_zero=True))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train,
batch_size=256,
epochs=1, validation_split=0.2)
639ms/step - loss: 0.6774 - acc: 0.5640 - val_loss: 0.5034 - val_acc: 0.8036
References
Masking and padding with Keras
Embedding layer, - Pads sequences
Recurrent Neural Networks (RNN) with Keras

Without Sequence Padding
Padding is not MUST for the variable length of the input sequence in sequence modeling. In TensorFlow, a tensor with variable numbers of elements along some axis is called ragged and we use tf.ragged.RaggedTensor for ragged data. For example:
# variable length input sequences
ragged_list = [
[0, 1, 2, 3],
[4, 5],
[6, 7, 8],
[9]]
# convert to ragged tensor that handle such variable length inputs
tf.ragged.constant(ragged_list).shape
shape: [4, None]
So, we can use ragged input data in sequence modeling and we no longer need to pad the sequence for uniform input length.
DataSet
import tensorflow as tf
import warnings, numpy as np
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
# maxlen = 200 # No maximum length but whatever
batch_size = 256
max_features = 20000 # Only consider the top 20k words
(x_train, y_train), (x_test, y_test) = \
tf.keras.datasets.imdb.load_data(num_words=max_features)
print(len(x_train), "Training sequences")
print(len(x_test), "Validation sequences")
25000 Training sequences
25000 Validation sequences
# quick check
x_train[:3]
array([list([1, 14, 22, 16, 43, 53, ....]),
list([....]),
list([...]),
Convert to Ragged
Now, we convert it to a ragged tensor which deals with variable size sequences.
x_train = tf.ragged.constant(x_train)
x_test = tf.ragged.constant(x_test)
# quick check
x_train[:3]
<tf.RaggedTensor [[1, 14, 22, 16, 43, 53, ...] [...] [...]]
x_train.shape, x_test.shape
(TensorShape([25000, None]), TensorShape([25000, None]))
Model
# Input for variable-length sequences of integers
inputs = tf.keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector
x = tf.keras.layers.Embedding(max_features, 128)(inputs)
x = tf.keras.layers.SimpleRNN(units=32)(x)
# Add a classifier
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.summary()
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, None)] 0
_________________________________________________________________
embedding_1 (Embedding) (None, None, 128) 2560000
_________________________________________________________________
simple_rnn (SimpleRNN) (None, 32) 5152
_________________________________________________________________
dense (Dense) (None, 1) 33
=================================================================
Total params: 2,565,185
Trainable params: 2,565,185
Non-trainable params: 0
_________________________________________________________________
Compile and Train
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"])
model.fit(x_train, y_train, batch_size=batch_size, verbose=2,
epochs=10, validation_data=(x_test, y_test))
Epoch 1/10
113s 1s/step - loss: 0.6273 - acc: 0.6295 - val_loss: 0.4188 - val_acc: 0.8206
Epoch 2/10
109s 1s/step - loss: 0.4895 - acc: 0.8041 - val_loss: 0.4703 - val_acc: 0.8040
Epoch 3/10
109s 1s/step - loss: 0.3513 - acc: 0.8661 - val_loss: 0.3996 - val_acc: 0.8337
Epoch 4/10
110s 1s/step - loss: 0.2450 - acc: 0.9105 - val_loss: 0.3945 - val_acc: 0.8420
Epoch 5/10
109s 1s/step - loss: 0.1437 - acc: 0.9559 - val_loss: 0.4085 - val_acc: 0.8422
Epoch 6/10
109s 1s/step - loss: 0.0767 - acc: 0.9807 - val_loss: 0.4310 - val_acc: 0.8429
Epoch 7/10
109s 1s/step - loss: 0.0380 - acc: 0.9932 - val_loss: 0.4784 - val_acc: 0.8437
Epoch 8/10
110s 1s/step - loss: 0.0288 - acc: 0.9946 - val_loss: 0.5039 - val_acc: 0.8564
Epoch 9/10
110s 1s/step - loss: 0.0957 - acc: 0.9615 - val_loss: 0.5687 - val_acc: 0.8575
Epoch 10/10
109s 1s/step - loss: 0.1008 - acc: 0.9637 - val_loss: 0.5166 - val_acc: 0.8550

Related

Error while using VGG16 pretrained model for grayscale images

I am working on sign language detection using VGG16 pre-trained model with grayscale images. When I am trying to run the model.fit command, I am getting the following error.
CLARIFICATION
I already have images as RGB form but I want to use them as grayscale to check if they would work with grayscale. The reason being, with color images, I am not getting the accuracy which I am expecting. It is having test accuracy of max 40% only and getting overfitted on dataset.
Also, this is my model command
vgg = VGG16(input_shape= [128, 128] + [3], weights='imagenet', include_top=False)
This is my model.fit command
history = model.fit(
train_x,
train_y,
epochs=15,
validation_data=(test_x, test_y),
callbacks=[early_stop, checkpoint],
batch_size=32,shuffle=True)
I am new to working with pre-trained models. When I am trying to run the code with color images with 3 channels, my model is getting into overfitting and val_accuracy doesn't rise above 40% so I want to give try the grayscale images as I have added many data augmentation techniques but accuracy is not improving. Any leads are welcomed as I am stuck into this for long time now.
The simplest (and likely fastest) solution I can think of is to just convert your image to rgb. You can do this as part of your model.
model = Sequential([
tf.keras.layers.Lambda(tf.image.grayscale_to_rgb),
vgg
])
This will fix your issue with VGG. I also see that you're missing the last dimensionality for your images. Images in grayscale are expected to be of shape [height, width, 1], but you simply have [height, width]. You can fix this using tf.expand_dims:
model = Sequential([
tf.keras.layers.Lambda(
lambda x: tf.image.grayscale_to_rgb(tf.expand_dims(x, -1))
),
vgg,
])
Note that this solution solves the problem in the graph, so it runs online. Meaning, at runtime, you can feed data exactly the same way you have it now (in the shape [128, 128], without a channels dimension) and it will still functionally work. If this is your expected dimensionality during runtime, this will be faster than manipulating your data before throwing it into the model.
By the way, none of this is ideal, given that VGG was trained specifically to work best with color images. Just thought I should add that.
Why are you getting overfitting?
Maybe for different reasons:
Your images and labels don't equally exist in the train, Val, test. (maybe you have images in train and don't have them in test.) Or your train, Val, test data don't stratify correctly and you train your model on a specific area in your data and features.
You Dataset is very small and you need more data.
Maybe you have noise in your datase, first make sure to remove noise from the dataset. (if you have noise, model fit on your noise.)
How can you input grayscale images to VGG16?
For Using VGG16, you need to input 3 channels images. For this reason, you need to concatenate your images like below to get three channels images from grayscale:
image = tf.concat([image, image, image], -1)
Example of training VGG16 on grayscale images from fashion_mnist dataset:
from tensorflow.keras.applications.vgg16 import VGG16
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
train, val, test = tfds.load(
'fashion_mnist',
shuffle_files=True,
as_supervised=True,
split = ['train[:85%]', 'train[85%:]', 'test']
)
def resize_preprocess(image, label):
image = tf.image.resize(image, (32, 32))
image = tf.concat([image, image, image], -1)
image = tf.keras.applications.densenet.preprocess_input(image)
return image, label
train = train.map(resize_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
test = test.map(resize_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
val = val.map(resize_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
train = train.repeat(15).batch(64).prefetch(tf.data.AUTOTUNE)
test = test.batch(64).prefetch(tf.data.AUTOTUNE)
val = val.batch(64).prefetch(tf.data.AUTOTUNE)
base_model = VGG16(weights="imagenet", include_top=False, input_shape=(32,32,3))
base_model.trainable = False ## Not trainable weights
model = tf.keras.Sequential()
model.add(base_model)
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(1024, activation='relu'))
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Dense(10, activation='sigmoid'))
model.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer='Adam',
metrics=['accuracy'])
model.summary()
fit_callbacks = [tf.keras.callbacks.EarlyStopping(
monitor='val_accuracy', patience = 4, restore_best_weights = True)]
history = model.fit(train, steps_per_epoch=150, epochs=5, batch_size=64, validation_data=val, callbacks=fit_callbacks)
model.evaluate(test)
Output:
Model: "sequential_17"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
vgg16 (Functional) (None, 1, 1, 512) 14714688
flatten_3 (Flatten) (None, 512) 0
dense_9 (Dense) (None, 1024) 525312
dropout_6 (Dropout) (None, 1024) 0
dense_10 (Dense) (None, 256) 262400
dropout_7 (Dropout) (None, 256) 0
dense_11 (Dense) (None, 10) 2570
=================================================================
Total params: 15,504,970
Trainable params: 790,282
Non-trainable params: 14,714,688
_________________________________________________________________
Epoch 1/5
150/150 [==============================] - 6s 35ms/step - loss: 0.8056 - accuracy: 0.7217 - val_loss: 0.5433 - val_accuracy: 0.7967
Epoch 2/5
150/150 [==============================] - 4s 26ms/step - loss: 0.5560 - accuracy: 0.7965 - val_loss: 0.4772 - val_accuracy: 0.8224
Epoch 3/5
150/150 [==============================] - 4s 26ms/step - loss: 0.5287 - accuracy: 0.8080 - val_loss: 0.4698 - val_accuracy: 0.8234
Epoch 4/5
150/150 [==============================] - 5s 32ms/step - loss: 0.5012 - accuracy: 0.8149 - val_loss: 0.4334 - val_accuracy: 0.8329
Epoch 5/5
150/150 [==============================] - 4s 25ms/step - loss: 0.4791 - accuracy: 0.8315 - val_loss: 0.4312 - val_accuracy: 0.8398
157/157 [==============================] - 2s 15ms/step - loss: 0.4457 - accuracy: 0.8325
[0.44566288590431213, 0.8324999809265137]

Set the dimension for simpleRNN (Please provide data which shares the same first dimension)

I am trying to test the simpleRNN
stft_librosa is numpy data (257, 958)
My Idea is put sliced (10,958) for input and get (1,958) for output.
epochs = 10
batch = 24
model.add(
SimpleRNN(1, activation=None, input_shape=(958,1), return_sequences=True)
)
model.add(Dense(1, activation="linear"))
model.compile(loss="mean_squared_error", optimizer="sgd")
print(model.summary())
for num in range(0, epochs):
print(num + 1, '/', epochs, ' start')
for i,data in enumerate(stft_librosa):
if i == 0:continue
in_data = stft_librosa[i - 1]
out_data = stft_librosa[i]
model.fit(in_data, out_data, epochs=1, shuffle=False, batch_size=batch)
model.reset_states()
print(num+1, '/', epochs, ' epoch is done!')
model.save('/data/mymodel')
There comes error like this, how should I fix it??
ValueError: Data cardinality is ambiguous:
x sizes: 9
y sizes: 958
Please provide data which shares the same first dimension.
model.summary() is here.
Output Shape Param #
simple_rnn (SimpleRNN) (None, 9, 958) 1836486
dense (Dense) (None, 9, 1) 959
Total params: 1,837,445
Trainable params: 1,837,445
Non-trainable params: 0
That input_shape=(9,958) is wrong (this is after you have edited the question). Correct one is input_shape=(958, 1) - the one that you have in your original question.(please, don't edit your questions like that, you have changed both the model, by adding the Dense layer, and the error by changing the input shape).
The original error, before you updated the question, is due to the incorrect shape of the inputs that you are feeding into the model. This model expects 3-dimensional input - the first one being the batch size but you are feeding it a single sample at a time.
Here is how you can reshape your data.
x = stft_librosa[:-1, :].reshape((-1, 958, 1))
y = stft_librosa[1:, :].reshape((-1, 958, 1))
print(x.shape, y.shape)
# ((256, 958, 1), (256, 958, 1))
This will shift y one index ahead and remove the last index from x.
Once you have that, you can just call the fit method once without the need for that loop.
model = Sequential([
SimpleRNN(1, activation=keras.activations.linear, input_shape=(958, 1), return_sequences=True),
Dense(1, activation=keras.activations.linear)
])
model.compile(optimizer=SGD(lr=0.000001), loss=keras.losses.MeanSquaredError())
model.fit(x, y, epochs=5, batch_size=24)
#Epoch 1/5
#11/11 [==============================] - 3s 266ms/step - loss: 2.8277
#Epoch 2/5
#11/11 [==============================] - 3s 266ms/step - loss: 2.0028
#Epoch 3/5
#11/11 [==============================] - 2s 224ms/step - loss: 1.8698
#Epoch 4/5
#11/11 [==============================] - 3s 229ms/step - loss: 1.7877
#Epoch 5/5
#11/11 [==============================] - 2s 226ms/step - loss: 1.7307
I suggest you revert the edit to your question because the original code made more sense than the current one
The first dimension is batch dimension. It should be the same for input and output.
Try this:
in_data = stft_librosa[i - 1][tf.newaxis, :]
out_data = stft_librosa[i][tf.newaxis, :]

could not broadcast input array from shape (27839,1) into shape (27839)

I'm building a chain classifier for a multiclass problem that uses Keras binary Classifier model in a chain. I have 17 labels as classification target and shape of X_train is (111300,107) and y_train is (111300,17). After training, I got following Error in predict method;
*could not broadcast input array from shape (27839,1) into shape (27839)*
My code is here:
def create_model():
input_size=length_long_sentence
embedding_size=128
lstm_size=64
output_size=len(unique_tag_set)
#----------------------------Model--------------------------------
current_input=Input(shape=(input_size,))
emb_current = Embedding(vocab_size, embedding_size, input_length=input_size)(current_input)
out_current=Bidirectional(LSTM(units=lstm_size))(emb_current )
#out_current = Reshape((1,2*lstm_size))(out_current)
output = Dense(units=1, activation= 'sigmoid')(out_current)
#output = Dense(units=1, activation='softmax')(out_current)
model = Model(inputs=current_input, outputs=output)
#-------------------------------compile-------------
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=create_model, epochs=1,batch_size=256, shuffle = True, verbose = 1,validation_split=0.2)
chain=ClassifierChain(model, order='random', random_state=42)
history=chain.fit(X_train, y_train)
the result for chain.classes_ is given below:
[array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8)]
then trying to predict on Test data:
Y_pred_chain = chain.predict(X_test)
The summary of the model is given below:
Full Trace of error is here:
109/109 [==============================] - 22s 202ms/step
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-28-34a25ad06cd4> in <module>()
----> 1 Y_pred_chain = chain.predict(X_test)
/usr/local/lib/python3.6/dist-packages/sklearn/multioutput.py in predict(self, X)
523 else:
524 X_aug = np.hstack((X, previous_predictions))
--> 525 Y_pred_chain[:, chain_idx] = estimator.predict(X_aug)
526
527 inv_order = np.empty_like(self.order_)
ValueError: could not broadcast input array from shape (27839,1) into shape (27839)
Can any one help about how to fix this error?
Stage 1
Going by the model summary as posted in the question, I start with that the input size of 107 and the output size is 1 (binary classification task)
Lets break it into pieces and understand.
The Model architecture
input_size = 107
# define the model
def create_model():
global input_size
embedding_size=128
lstm_size=64
output_size=1
vocab_size = 100
current_input=Input(shape=(input_size,))
emb_current = Embedding(vocab_size, embedding_size, input_length=input_size)(current_input)
out_current=Bidirectional(LSTM(units=lstm_size))(emb_current )
output = Dense(units=output_size, activation= 'sigmoid')(out_current)
model = Model(inputs=current_input, outputs=output)
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
Some dummy data
X = np.random.randint(0,100,(111, 107))
y = np.random.randint(0,2,(111,1)) # NOTE: The y should have two dimensions
Lets test the keras model directly
model = KerasClassifier(build_fn=create_model, epochs=1, batch_size=8, shuffle = True, verbose = 1,validation_split=0.2)
model.fit(X, y)
y_hat = model.predict(X)
Output:
Train on 88 samples, validate on 23 samples
Epoch 1/1
88/88 [==============================] - 2s 21ms/step - loss: 0.6951 - accuracy: 0.4432 - val_loss: 0.6898 - val_accuracy: 0.5652
111/111 [==============================] - 0s 2ms/step
(111, 1)
Ta-da! it works
Now lets chain them and run
model=KerasClassifier(build_fn=create_model, epochs=1, batch_size=8, shuffle=True, verbose=1,validation_split=0.2)
chain=ClassifierChain(model, order='random', random_state=42)
chain.fit(X, y)
print (chain.predict(X).shape)
oops! it trains but predictions fails as OP points out
Error:
ValueError: could not broadcast input array from shape (111,1) into shape (111)
The problem
This error is because of the below line in sklearn
--> 525 Y_pred_chain[:, chain_idx] = estimator.predict(X_aug)
It is because classifier chain runs the estimators one at a time and saves each estimators predictions in Y_pred_chain at the estimators index (determined by the order parameter). It assumes that the estimators return the predictions in a 1D array. But keras models return output of shape batch_size x output_size which in out our case is 111 x 1.
The solution
We need a way to reshape the predictions of shape 111 X 1 to 111 or in general batch_size x 1 to batch_size. Lets bank on the concepts of OOPS and overload the predict method of KerasClassifier
class MyKerasClassifier(KerasClassifier):
def __init__(self, **args):
super().__init__(**args)
def predict(self, X):
return super().predict(X).reshape(len(X)) # Here we are flattening 2D array to 1D
model=MyKerasClassifier(build_fn=create_model, epochs=1, batch_size=8, shuffle=True, verbose=1,validation_split=0.2)
chain=ClassifierChain(model, order='random', random_state=42)
chain.fit(X, y)
print (chain.predict(X).shape)
Output:
Epoch 1/1
88/88 [==============================] - 2s 19ms/step - loss: 0.6919 - accuracy: 0.5227 - val_loss: 0.6892 - val_accuracy: 0.5652
111/111 [==============================] - 0s 3ms/step
(111, 1)
Ta-da! it works
Stage 2
Lets look deeper into ClassifierChain class
A multi-label model that arranges binary classifiers into a chain.
Each model makes a prediction in the order specified by the chain
using all of the available features provided to the model plus the
predictions of models that are earlier in the chain.
So what we really need is a y of shape 111 X 17 so that the chain contains 17 estimators. Lets try it
The real ClassifierChain
y = np.random.randint(0,2,(111,17))
model=MyKerasClassifier(build_fn=create_model, epochs=1, batch_size=8, shuffle=True, verbose=1,validation_split=0.2)
chain=ClassifierChain(model, order='random', random_state=42)
chain.fit(X, y)
Output:
ValueError: Error when checking input: expected input_62 to have shape (107,) but got array with shape (108,)
It cannot train the model; the reason is pretty simple. The chain first trains the first estimator with 107 feature with works fine. Next the chain picks up the next estimator and then trains it with 107 features + the single output of the previous estimator (=108). But since our model has input size of 107 it will fail as so the error message. Each estimator will get 107 input features + the output of all the previous estimators.
The solution [hacky]
We need a way to change the input_size of the model as they are created from the ClassifierChain. There seem to be no callbacks or hooks into the ClassifierChain, so I have a hacky solution.
input_size = 107
# define the model
def create_model():
global input_size
embedding_size=128
lstm_size=64
output_size=1
vocab_size = 100
current_input=Input(shape=(input_size,))
emb_current = Embedding(vocab_size, embedding_size, input_length=input_size)(current_input)
out_current=Bidirectional(LSTM(units=lstm_size))(emb_current )
output = Dense(units=output_size, activation= 'sigmoid')(out_current)
model = Model(inputs=current_input, outputs=output)
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
input_size += 1 # <-- This does the magic
return model
X = np.random.randint(0,100,(111, 107))
y = np.random.randint(0,2,(111,17))
model=MyKerasClassifier(build_fn=create_model, epochs=1, batch_size=8, shuffle=True, verbose=1,validation_split=0.2)
chain=ClassifierChain(model, order='random', random_state=42)
chain.fit(X, y)
print (chain.predict(X).shape)
Output:
Train on 88 samples, validate on 23 samples
Epoch 1/1
88/88 [==============================] - 2s 22ms/step - loss: 0.6901 - accuracy: 0.6023 - val_loss: 0.7002 - val_accuracy: 0.4783
Train on 88 samples, validate on 23 samples
Epoch 1/1
88/88 [==============================] - 2s 22ms/step - loss: 0.6976 - accuracy: 0.5000 - val_loss: 0.7070 - val_accuracy: 0.3913
Train on 88 samples, validate on 23 samples
Epoch 1/1
----------- [Output truncated] ----------------
111/111 [==============================] - 0s 3ms/step
111/111 [==============================] - 0s 3ms/step
(111, 17)
As expected it trains 17 estimators and predict method returns output of shape 111 x 17 each column corresponding to the predictions made by the corresponding estimator.
here a complete working example...
I solved using sequential model and softmax as the last activation
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from sklearn.multioutput import ClassifierChain
n_sample = 20
vocab_size = 33
input_size = 100
X = np.random.randint(0,vocab_size, (n_sample,input_size))
y = np.random.randint(0,2, (n_sample,17))
def create_model():
global input_size
embedding_size = 128
lstm_size = 64
model = Sequential([
Embedding(vocab_size, embedding_size, input_length=input_size),
Bidirectional(LSTM(units=lstm_size)),
Dense(units=2, activation= 'softmax')
])
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
input_size += 1
return model
model = tf.keras.wrappers.scikit_learn.KerasClassifier(build_fn=create_model, epochs=1, batch_size=256,
shuffle = True, verbose = 1, validation_split=0.2)
chain = ClassifierChain(model, order='random', random_state=42)
chain.fit(X, y)
chain.predict_proba(X)
here the running code: https://colab.research.google.com/drive/1aVjjh6VPmAyBddwU4ff2w9y_LmmC02W_?usp=sharing

Linear Classifier Tensorflow2 not training(1 neuron model)

I'm currently working on the CIFAR-10 Dataset which is an image classification problem with 10 classes.
I have started to develop with Tensorflow 2 a Linear Classification without the LinearClassifier Object.
X shape corresponds to 10 000 images of 32*32 pixels RBG = (10000, 3072)
Y_one_hot is a one hot vector = (10000, 10)
model creation code:
model = tf.keras.Sequential()
model.add(Dense(1, activation="linear", input_dim=32*32*3))
model.add(Dense(10, activation="softmax", input_dim=1))
model.compile(optimizer="adam", loss="mean_squared_error", metrics=["accuracy"])
training code:
model.fit(X, Y_one_hot, batch_size=10000, verbose=1, epochs=100)
predict code:
img = X[0].reshape(1, 3072) # Select image 0
res = np.argmax((model.predict(img))) # select the max in output
Problem:
res value is always the same. It seems my model is not learning.
Model.summary
Summary displays :
dense (Dense) (None, 1) 3073
dense_1 (Dense) (None, 10) 20
Total params: 3,093
Trainable params: 3,093
Non-trainable params: 0
Accuracy & loss:
Epoch 1/100
10000/10000 [==============================] - 2s 184us/sample - loss: 0.0949 - accuracy: 0.1005
Epoch 50/100
10000/10000 [==============================] - 0s 10us/sample - loss: 0.0901 - accuracy: 0.1000
Epoch 100/100
10000/10000 [==============================] - 0s 8us/sample - loss: 0.0901 - accuracy: 0.1027
Do you have any idea why my model is always prediciting the same value ?
Thanks,
One remarks:
The loss you used loss="mean_squared_error"is not meant for classification. Is meant for regression. Two very different problems. Try a cross entropy. For example
`model.compile(optimizer=AdamOpt,
loss='categorical_crossentropy', metrics=['accuracy'])`
You can find an example here: https://github.com/michelucci/oreilly-london-ai/blob/master/day1/Beginner%20friendly%20networks/First_Example_of_a_CNN_(CIFAR10).ipynb. Is a note book I used for a training I gave. The network is CNN but you can change it with yours.
Try that...
Best of luck, Umberto

Text classification with LSTM Network and Keras 0.0% accuracy

I have csv file with two columns:
category, description
1030 categories in the file and only about 12,600 lines
I need to get a model for text classification, trained on this data. I use keras with LSTM model.
I found an article describing how to make a binary classification, and slightly modified it to use several categories.
My code:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from numpy import array
from keras.preprocessing.text import one_hot
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing import sequence
import keras
df = pd.read_csv('/tmp/input_data.csv')
#one hot encode your documents
# integer encode the documents
vocab_size = 2000
encoded_docs = [one_hot(d, vocab_size) for d in df['description']]
def load_data_from_arrays(strings, labels, train_test_split=0.9):
data_size = len(strings)
test_size = int(data_size - round(data_size * train_test_split))
print("Test size: {}".format(test_size))
print("\nTraining set:")
x_train = strings[test_size:]
print("\t - x_train: {}".format(len(x_train)))
y_train = labels[test_size:]
print("\t - y_train: {}".format(len(y_train)))
print("\nTesting set:")
x_test = strings[:test_size]
print("\t - x_test: {}".format(len(x_test)))
y_test = labels[:test_size]
print("\t - y_test: {}".format(len(y_test)))
return x_train, y_train, x_test, y_test
encoder = LabelEncoder()
categories = encoder.fit_transform(df['category'])
num_classes = np.max(categories) + 1
print('Categories count: {}'.format(num_classes))
#Categories count: 1030
X_train, y_train, x_test, y_test = load_data_from_arrays(encoded_docs, categories, train_test_split=0.8)
# Truncate and pad the review sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
x_test = sequence.pad_sequences(x_test, maxlen=max_review_length)
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)
# Build the model
embedding_vector_length = 32
top_words = 10000
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])
print(model.summary())
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_8 (Embedding) (None, 500, 32) 320000
_________________________________________________________________
lstm_8 (LSTM) (None, 100) 53200
_________________________________________________________________
dense_8 (Dense) (None, 1030) 104030
=================================================================
Total params: 477,230
Trainable params: 477,230
Non-trainable params: 0
_________________________________________________________________
None
#Train the model
model.fit(X_train, y_train, validation_data=(x_test, y_test), epochs=5, batch_size=64)
Train on 10118 samples, validate on 2530 samples
Epoch 1/5
10118/10118 [==============================] - 60s 6ms/step - loss: 6.5086 - acc: 0.0019 - val_loss: 10.0911 - val_acc: 0.0000e+00
Epoch 2/5
10118/10118 [==============================] - 63s 6ms/step - loss: 6.3281 - acc: 0.0028 - val_loss: 10.8270 - val_acc: 0.0000e+00
Epoch 3/5
10118/10118 [==============================] - 63s 6ms/step - loss: 6.3120 - acc: 0.0024 - val_loss: 11.0078 - val_acc: 0.0000e+00
Epoch 4/5
10118/10118 [==============================] - 64s 6ms/step - loss: 6.2891 - acc: 0.0030 - val_loss: 11.8264 - val_acc: 0.0000e+00
Epoch 5/5
10118/10118 [==============================] - 69s 7ms/step - loss: 6.2559 - acc: 0.0032 - val_loss: 12.1625 - val_acc: 0.0000e+00
#Evaluate the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Accuracy: 0.00%
What mistake did I make when preparing the data?
why accuracy is always 0?
I have curated end-to-end code with some inputs from my end and tested working on this data, you can use the same with your data with no or minimal changes as I have removed specifics and made it generic. Also at the end, I have highlighted what points I have worked on top of the code you provided above.
Code
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
from nltk.tokenize import word_tokenize
def load_data_from_arrays(strings, labels, train_test_split=0.9):
data_size = len(strings)
test_size = int(data_size - round(data_size * train_test_split))
print("Test size: {}".format(test_size))
print("\nTraining set:")
x_train = strings[test_size:]
print("\t - x_train: {}".format(len(x_train)))
y_train = labels[test_size:]
print("\t - y_train: {}".format(len(y_train)))
print("\nTesting set:")
x_test = strings[:test_size]
print("\t - x_test: {}".format(len(x_test)))
y_test = labels[:test_size]
print("\t - y_test: {}".format(len(y_test)))
return x_train, y_train, x_test, y_test
# estimating the vocab length with the help of nltk
def get_vocab_length(strings):
vocab = []
for sent in strings:
words = word_tokenize(sent)
vocab.extend(words)
vocab = list(set(vocab))
vocab_length = len(vocab)
return vocab_length
def clean_text(sent):
# <your cleaning code here>
# clean func 1
# clean func 2
# ...
# clean func n
return sent
# load input data
df = pd.read_csv('/tmp/input_data.csv')
strings = df['description'].values
labels = df['category'].values
clean_strings = [clean_text(sent) for sent in strings]
vocab_length = get_vocab_length(clean_strings)
# create onehot encodings of strings
encoded_docs = [one_hot(sent, vocab_length) for sent in strings]
# create onehot encodings of labels
ohe = OneHotEncoder()
categories = ohe.fit_transform(labels.reshape(-1,1)).toarray()
# split data
X_train, y_train, X_test, y_test = load_data_from_arrays(encoded_docs, categories, train_test_split=0.8)
# assuming max input to be not more than 512 words
max_input_len = 512
# padding data
X_train = pad_sequences(X_train, maxlen=max_input_len, padding= 'post')
X_test = pad_sequences(X_test, maxlen=max_input_len, padding= 'post')
# setting embedding vector length
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(vocab_length, embedding_vector_length, input_length=max_input_len, name= 'embedding') )
model.add(Flatten())
model.add(Dense(5, activation= 'softmax'))
model.compile('adam', loss= 'categorical_crossentropy', metrics= ['accuracy'])
model.summary()
# training the model
model.fit(X_train, y_train, epochs= 10, batch_size= 128, validation_split= 0.2, verbose= 1)
# evaluating the model
score = model.evaluate(X_test, y_test, verbose=0)
print("Test Loss:", score[0])
print("Test Acc:", score[1])
Additional areas I have worked on
1. Text Cleaning
Created a function to clean the text. It is extremely important as it will remove unnecessary noise from the data and also note this step will totally depend on the type of data you have. To help you simplify, I have created a clean_text function in the above code where you can place your cleaning code. It should be used in such a way that it takes in raw text and provides clean text. Some of the libraries you may like to look into are re, string, and emoji.
2. Estimating Vocab Size
If you have enough data, it is good to estimate the vocab size rather than putting some number directly while passing it to Keras one_hot function. I have created a basic get_vocab_length function using nltk word_tokenize. You can use the same or enhance it further as per your data.
What Else?
You can work further on hyperparameter tuning and a few different neural network designs.
Final Words
It still may not work as it totally depends on the data quality and amount of data you have. There is a good chance you may not get results after trying everything if you have poor quality data or a very less amount of data.
I would then suggest you try transfer learning on some pre-trained models like BERT, RoBERTa, etc. HuggingFace provides good support for state-of-art pre-trained models, you can get started at the following links -
https://huggingface.co/docs/transformers/index#supported-models
https://towardsdatascience.com/text-classification-with-hugging-face-transformers-in-tensorflow-2-without-tears-ee50e4f3e7ed
https://towardsdatascience.com/an-introduction-to-transformers-and-hugging-face-13052ec9d72d
I guess that your vocab_size is way too low. If you are dealing with usual text, try 10.000 - 100.000 as a starting point.
What one_hot does is to use the hashing trick. That means all of your words are hashed and projected into an 2000 vector space. It does not only mean that your dict is 2000 words long, it does mean every word will be projected to into this space, which effectively causes a lot of collisions, where words have the same index and are considered as equal in the LSTM.
Furthermore you should take a look at the transformed text, just too get an understanding of what happens here. To do so, build an reverse lookup and transform all the indices back.
As a further improvement it is feasible to preprocess the text with common techniques like stemming, normalizing etc. and the usage of a vocabulary or discard bag of words and use word embeddings.
from keras.preprocessing.text import one_hot, Tokenizer, hashing_trick
text1 = 'I love you'
text2 = 'you love I'
print('one_hot: ')
print(one_hot(text1, n=20))
print(one_hot(text2, n=20))
print('--------------------------------------')
print('Tokenizer: ')
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text1, text2])
print(tokenizer.word_index)
print(tokenizer.index_word)
print('--------------------------------------')
print('hashing_trick: ')
print(hashing_trick(text1, n=20))
print(hashing_trick(text2, n=20))
print('--------------------------------------')
out:
one_hot:
[14, 7, 14]
[14, 7, 14]
--------------------------------------
Tokenizer:
{'i': 1, 'love': 2, 'you': 3}
{1: 'i', 2: 'love', 3: 'you'}
--------------------------------------
hashing_trick:
[14, 7, 14]
[14, 7, 14]
--------------------------------------
Run more times and you will find that the results of one_hot and hashing_trick are not unique.
You should use Tokenizer to convert text.

Categories