I'm building a chain classifier for a multiclass problem that uses Keras binary Classifier model in a chain. I have 17 labels as classification target and shape of X_train is (111300,107) and y_train is (111300,17). After training, I got following Error in predict method;
*could not broadcast input array from shape (27839,1) into shape (27839)*
My code is here:
def create_model():
input_size=length_long_sentence
embedding_size=128
lstm_size=64
output_size=len(unique_tag_set)
#----------------------------Model--------------------------------
current_input=Input(shape=(input_size,))
emb_current = Embedding(vocab_size, embedding_size, input_length=input_size)(current_input)
out_current=Bidirectional(LSTM(units=lstm_size))(emb_current )
#out_current = Reshape((1,2*lstm_size))(out_current)
output = Dense(units=1, activation= 'sigmoid')(out_current)
#output = Dense(units=1, activation='softmax')(out_current)
model = Model(inputs=current_input, outputs=output)
#-------------------------------compile-------------
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=create_model, epochs=1,batch_size=256, shuffle = True, verbose = 1,validation_split=0.2)
chain=ClassifierChain(model, order='random', random_state=42)
history=chain.fit(X_train, y_train)
the result for chain.classes_ is given below:
[array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8),
array([0, 1], dtype=uint8)]
then trying to predict on Test data:
Y_pred_chain = chain.predict(X_test)
The summary of the model is given below:
Full Trace of error is here:
109/109 [==============================] - 22s 202ms/step
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-28-34a25ad06cd4> in <module>()
----> 1 Y_pred_chain = chain.predict(X_test)
/usr/local/lib/python3.6/dist-packages/sklearn/multioutput.py in predict(self, X)
523 else:
524 X_aug = np.hstack((X, previous_predictions))
--> 525 Y_pred_chain[:, chain_idx] = estimator.predict(X_aug)
526
527 inv_order = np.empty_like(self.order_)
ValueError: could not broadcast input array from shape (27839,1) into shape (27839)
Can any one help about how to fix this error?
Stage 1
Going by the model summary as posted in the question, I start with that the input size of 107 and the output size is 1 (binary classification task)
Lets break it into pieces and understand.
The Model architecture
input_size = 107
# define the model
def create_model():
global input_size
embedding_size=128
lstm_size=64
output_size=1
vocab_size = 100
current_input=Input(shape=(input_size,))
emb_current = Embedding(vocab_size, embedding_size, input_length=input_size)(current_input)
out_current=Bidirectional(LSTM(units=lstm_size))(emb_current )
output = Dense(units=output_size, activation= 'sigmoid')(out_current)
model = Model(inputs=current_input, outputs=output)
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
Some dummy data
X = np.random.randint(0,100,(111, 107))
y = np.random.randint(0,2,(111,1)) # NOTE: The y should have two dimensions
Lets test the keras model directly
model = KerasClassifier(build_fn=create_model, epochs=1, batch_size=8, shuffle = True, verbose = 1,validation_split=0.2)
model.fit(X, y)
y_hat = model.predict(X)
Output:
Train on 88 samples, validate on 23 samples
Epoch 1/1
88/88 [==============================] - 2s 21ms/step - loss: 0.6951 - accuracy: 0.4432 - val_loss: 0.6898 - val_accuracy: 0.5652
111/111 [==============================] - 0s 2ms/step
(111, 1)
Ta-da! it works
Now lets chain them and run
model=KerasClassifier(build_fn=create_model, epochs=1, batch_size=8, shuffle=True, verbose=1,validation_split=0.2)
chain=ClassifierChain(model, order='random', random_state=42)
chain.fit(X, y)
print (chain.predict(X).shape)
oops! it trains but predictions fails as OP points out
Error:
ValueError: could not broadcast input array from shape (111,1) into shape (111)
The problem
This error is because of the below line in sklearn
--> 525 Y_pred_chain[:, chain_idx] = estimator.predict(X_aug)
It is because classifier chain runs the estimators one at a time and saves each estimators predictions in Y_pred_chain at the estimators index (determined by the order parameter). It assumes that the estimators return the predictions in a 1D array. But keras models return output of shape batch_size x output_size which in out our case is 111 x 1.
The solution
We need a way to reshape the predictions of shape 111 X 1 to 111 or in general batch_size x 1 to batch_size. Lets bank on the concepts of OOPS and overload the predict method of KerasClassifier
class MyKerasClassifier(KerasClassifier):
def __init__(self, **args):
super().__init__(**args)
def predict(self, X):
return super().predict(X).reshape(len(X)) # Here we are flattening 2D array to 1D
model=MyKerasClassifier(build_fn=create_model, epochs=1, batch_size=8, shuffle=True, verbose=1,validation_split=0.2)
chain=ClassifierChain(model, order='random', random_state=42)
chain.fit(X, y)
print (chain.predict(X).shape)
Output:
Epoch 1/1
88/88 [==============================] - 2s 19ms/step - loss: 0.6919 - accuracy: 0.5227 - val_loss: 0.6892 - val_accuracy: 0.5652
111/111 [==============================] - 0s 3ms/step
(111, 1)
Ta-da! it works
Stage 2
Lets look deeper into ClassifierChain class
A multi-label model that arranges binary classifiers into a chain.
Each model makes a prediction in the order specified by the chain
using all of the available features provided to the model plus the
predictions of models that are earlier in the chain.
So what we really need is a y of shape 111 X 17 so that the chain contains 17 estimators. Lets try it
The real ClassifierChain
y = np.random.randint(0,2,(111,17))
model=MyKerasClassifier(build_fn=create_model, epochs=1, batch_size=8, shuffle=True, verbose=1,validation_split=0.2)
chain=ClassifierChain(model, order='random', random_state=42)
chain.fit(X, y)
Output:
ValueError: Error when checking input: expected input_62 to have shape (107,) but got array with shape (108,)
It cannot train the model; the reason is pretty simple. The chain first trains the first estimator with 107 feature with works fine. Next the chain picks up the next estimator and then trains it with 107 features + the single output of the previous estimator (=108). But since our model has input size of 107 it will fail as so the error message. Each estimator will get 107 input features + the output of all the previous estimators.
The solution [hacky]
We need a way to change the input_size of the model as they are created from the ClassifierChain. There seem to be no callbacks or hooks into the ClassifierChain, so I have a hacky solution.
input_size = 107
# define the model
def create_model():
global input_size
embedding_size=128
lstm_size=64
output_size=1
vocab_size = 100
current_input=Input(shape=(input_size,))
emb_current = Embedding(vocab_size, embedding_size, input_length=input_size)(current_input)
out_current=Bidirectional(LSTM(units=lstm_size))(emb_current )
output = Dense(units=output_size, activation= 'sigmoid')(out_current)
model = Model(inputs=current_input, outputs=output)
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
input_size += 1 # <-- This does the magic
return model
X = np.random.randint(0,100,(111, 107))
y = np.random.randint(0,2,(111,17))
model=MyKerasClassifier(build_fn=create_model, epochs=1, batch_size=8, shuffle=True, verbose=1,validation_split=0.2)
chain=ClassifierChain(model, order='random', random_state=42)
chain.fit(X, y)
print (chain.predict(X).shape)
Output:
Train on 88 samples, validate on 23 samples
Epoch 1/1
88/88 [==============================] - 2s 22ms/step - loss: 0.6901 - accuracy: 0.6023 - val_loss: 0.7002 - val_accuracy: 0.4783
Train on 88 samples, validate on 23 samples
Epoch 1/1
88/88 [==============================] - 2s 22ms/step - loss: 0.6976 - accuracy: 0.5000 - val_loss: 0.7070 - val_accuracy: 0.3913
Train on 88 samples, validate on 23 samples
Epoch 1/1
----------- [Output truncated] ----------------
111/111 [==============================] - 0s 3ms/step
111/111 [==============================] - 0s 3ms/step
(111, 17)
As expected it trains 17 estimators and predict method returns output of shape 111 x 17 each column corresponding to the predictions made by the corresponding estimator.
here a complete working example...
I solved using sequential model and softmax as the last activation
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from sklearn.multioutput import ClassifierChain
n_sample = 20
vocab_size = 33
input_size = 100
X = np.random.randint(0,vocab_size, (n_sample,input_size))
y = np.random.randint(0,2, (n_sample,17))
def create_model():
global input_size
embedding_size = 128
lstm_size = 64
model = Sequential([
Embedding(vocab_size, embedding_size, input_length=input_size),
Bidirectional(LSTM(units=lstm_size)),
Dense(units=2, activation= 'softmax')
])
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
input_size += 1
return model
model = tf.keras.wrappers.scikit_learn.KerasClassifier(build_fn=create_model, epochs=1, batch_size=256,
shuffle = True, verbose = 1, validation_split=0.2)
chain = ClassifierChain(model, order='random', random_state=42)
chain.fit(X, y)
chain.predict_proba(X)
here the running code: https://colab.research.google.com/drive/1aVjjh6VPmAyBddwU4ff2w9y_LmmC02W_?usp=sharing
Related
As far as I know and research, the sequences in a data set can be of different lengths; we do not need to pad or truncate them provided that each batch in the training process contains the sequences with the same length.
To realize and apply it, I decided to set the batch size to 1 and trained my RNN model over the IMDB movie classification dataset. I added the code that I had written below.
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import SimpleRNN
from tensorflow.keras.layers import Embedding
max_features = 10000
batch_size = 1
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=32))
model.add(SimpleRNN(units=32, input_shape=(None, 32)))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop",
loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train,
batch_size=batch_size, epochs=10,
validation_split=0.2)
acc = history.history["acc"]
loss = history.history["loss"]
val_acc = history.history["val_acc"]
val_loss = history.history["val_loss"]
epochs = range(len(acc) + 1)
plt.plot(epochs, acc, "bo", label="Training Acc")
plt.plot(epochs, val_acc, "b", label="Validation Acc")
plt.title("Training and Validation Accuracy")
plt.legend()
plt.figure()
plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "b", label="Validation Loss")
plt.title("Training and Validation Loss")
plt.legend()
plt.show()
What error I have been encountered is to fail to convert the input to tensor format because of the list components in the input numpy array. However, when I change them, I continue to get similar kinds of errors.
The error message:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
I could not handle the problem. Could anyone help me on this point?
With Sequence Padding
There are two issues. You need to use pad_sequences on the text sequence first. And also there is no such param input_shape in SimpleRNN. Try with the following code:
max_features = 20000 # Only consider the top 20k words
maxlen = 200 # Only consider the first 200 words of each movie review
batch_size = 1
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), "Training sequences")
print(len(x_test), "Validation sequences")
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=32))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train, batch_size=batch_size,
epochs=10, validation_split=0.2)
Here is the official code example, it might help you.
With Sequence Padding with Mask in Embedding Layer
Based on your comments and information, It seems that it's possible to use a variable-length input sequence, check this and this too. But still, I can say, in most of the cases practitioner would prefer to pad the sequences for uniform length; as it's convincing. Choosing non-uniform or variable input sequence length is some kind of special case; similar to when we want variable input image sizes for vision models.
However, here we will add info on padding and how we can mask out the padded value in training time which technically seems variable-length input training. Hope that convinces you. Let's first understand what pad_sequences do. Normally in sequence data, it's very much a common case that, each training samples are in a different length. Let's consider the following inputs:
raw_inputs = [
[711, 632, 71],
[73, 8, 3215, 55, 927],
[83, 91, 1, 645, 1253, 927],
]
These 3 training samples are in different lengths, 3, 5, and 6 respectively. What we do next is to make them all equal lengths by adding some value (typically 0 or -1) - whether at the beginning or at the end of the sequence.
tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)
array([[ 0, 0, 0, 711, 632, 71],
[ 0, 73, 8, 3215, 55, 927],
[ 83, 91, 1, 645, 1253, 927]], dtype=int32)
We can set padding = "post" to set pad value at the end of the sequence. But it recommends using "post" padding when working with RNN layers in order to be able to use the CuDNN implementation of the layers. However, FYI, you may notice we set maxlen = 6 which is the highest input sequence length. But it does not have to be the highest input sequence length as it may get computationally expensive if the dataset gets bigger. We can set it to 5 assuming that our model can learn feature representation within this length, it's a kind of hyper-parameter. And that brings another parameter truncating.
tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=5, dtype="int32", padding="pre", truncating="pre", value=0.0
)
array([[ 0, 0, 711, 632, 71],
[ 73, 8, 3215, 55, 927],
[ 91, 1, 645, 1253, 927]], dtype=int32
Okay, now we have a padded input sequence, all inputs are uniform length. Now, we can mask out those additional padded values in training time. We will tell the model some part of the data is padding and those should be ignored. That mechanism is masking. So, it's a way to tell sequence-processing layers that certain timesteps in the input are missing, and thus should be skipped when processing the data. There are three ways to introduce input masks in Keras models:
Add a keras. layers.Masking layer.
Configure a keras.layers.Embedding layer with mask_zero=True.
Pass a mask argument manually when calling layers that support this argument (e.g. RNN layers).
Here we will show only by configuring the Embedding layer. It has a parameter called mask_zero and set False by default. If we set it True then 0 containing indices in the sequences will be skipped. False entry indicates that the corresponding timestep should be ignored during processing.
padd_input = tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)
print(padd_input)
embedding = tf.keras.layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
masked_output = embedding(padd_input)
print(masked_output._keras_mask)
[[ 0 0 0 711 632 71]
[ 0 73 8 3215 55 927]
[ 83 91 1 645 1253 927]]
tf.Tensor(
[[False False False True True True]
[False True True True True True]
[ True True True True True True]], shape=(3, 6), dtype=bool)
And here is how it's computed in the class Embedding(Layer).
def compute_mask(self, inputs, mask=None):
if not self.mask_zero:
return None
return tf.not_equal(inputs, 0)
And here is one catch, if we set mask_zero as True, as a consequence, index 0 cannot be used in the vocabulary. According to the doc
mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).
So, we have to use max_features + 1 at least. Here is a nice explanation on this.
Here is the complete example using these of your code.
# get the data
(x_train, y_train), (_, _) = imdb.load_data(num_words=max_features)
print(x_train.shape)
# check highest sequence lenght
max_list_length = lambda list: max( [len(i) for i in list])
print(max_list_idx(x_train))
max_features = 20000 # Only consider the top 20k words
maxlen = 350 # Only consider the first 350 words out of `max_list_idx(x_train)`
batch_size = 512
print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])
# (1). padding with value 0 at the end of the sequence - padding="post", value=0.
# (2). truncate 'maxlen' words
# out of `max_list_idx(x_train)` at the end - maxlen=maxlen, truncating="post"
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train,
maxlen=maxlen, dtype="int32",
padding="post", truncating="post",
value=0.)
print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])
Your model definition should be now
model = Sequential()
model.add(Embedding(
input_dim=max_features + 1,
output_dim=32,
mask_zero=True))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train,
batch_size=256,
epochs=1, validation_split=0.2)
639ms/step - loss: 0.6774 - acc: 0.5640 - val_loss: 0.5034 - val_acc: 0.8036
References
Masking and padding with Keras
Embedding layer, - Pads sequences
Recurrent Neural Networks (RNN) with Keras
Without Sequence Padding
Padding is not MUST for the variable length of the input sequence in sequence modeling. In TensorFlow, a tensor with variable numbers of elements along some axis is called ragged and we use tf.ragged.RaggedTensor for ragged data. For example:
# variable length input sequences
ragged_list = [
[0, 1, 2, 3],
[4, 5],
[6, 7, 8],
[9]]
# convert to ragged tensor that handle such variable length inputs
tf.ragged.constant(ragged_list).shape
shape: [4, None]
So, we can use ragged input data in sequence modeling and we no longer need to pad the sequence for uniform input length.
DataSet
import tensorflow as tf
import warnings, numpy as np
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
# maxlen = 200 # No maximum length but whatever
batch_size = 256
max_features = 20000 # Only consider the top 20k words
(x_train, y_train), (x_test, y_test) = \
tf.keras.datasets.imdb.load_data(num_words=max_features)
print(len(x_train), "Training sequences")
print(len(x_test), "Validation sequences")
25000 Training sequences
25000 Validation sequences
# quick check
x_train[:3]
array([list([1, 14, 22, 16, 43, 53, ....]),
list([....]),
list([...]),
Convert to Ragged
Now, we convert it to a ragged tensor which deals with variable size sequences.
x_train = tf.ragged.constant(x_train)
x_test = tf.ragged.constant(x_test)
# quick check
x_train[:3]
<tf.RaggedTensor [[1, 14, 22, 16, 43, 53, ...] [...] [...]]
x_train.shape, x_test.shape
(TensorShape([25000, None]), TensorShape([25000, None]))
Model
# Input for variable-length sequences of integers
inputs = tf.keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector
x = tf.keras.layers.Embedding(max_features, 128)(inputs)
x = tf.keras.layers.SimpleRNN(units=32)(x)
# Add a classifier
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.summary()
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, None)] 0
_________________________________________________________________
embedding_1 (Embedding) (None, None, 128) 2560000
_________________________________________________________________
simple_rnn (SimpleRNN) (None, 32) 5152
_________________________________________________________________
dense (Dense) (None, 1) 33
=================================================================
Total params: 2,565,185
Trainable params: 2,565,185
Non-trainable params: 0
_________________________________________________________________
Compile and Train
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"])
model.fit(x_train, y_train, batch_size=batch_size, verbose=2,
epochs=10, validation_data=(x_test, y_test))
Epoch 1/10
113s 1s/step - loss: 0.6273 - acc: 0.6295 - val_loss: 0.4188 - val_acc: 0.8206
Epoch 2/10
109s 1s/step - loss: 0.4895 - acc: 0.8041 - val_loss: 0.4703 - val_acc: 0.8040
Epoch 3/10
109s 1s/step - loss: 0.3513 - acc: 0.8661 - val_loss: 0.3996 - val_acc: 0.8337
Epoch 4/10
110s 1s/step - loss: 0.2450 - acc: 0.9105 - val_loss: 0.3945 - val_acc: 0.8420
Epoch 5/10
109s 1s/step - loss: 0.1437 - acc: 0.9559 - val_loss: 0.4085 - val_acc: 0.8422
Epoch 6/10
109s 1s/step - loss: 0.0767 - acc: 0.9807 - val_loss: 0.4310 - val_acc: 0.8429
Epoch 7/10
109s 1s/step - loss: 0.0380 - acc: 0.9932 - val_loss: 0.4784 - val_acc: 0.8437
Epoch 8/10
110s 1s/step - loss: 0.0288 - acc: 0.9946 - val_loss: 0.5039 - val_acc: 0.8564
Epoch 9/10
110s 1s/step - loss: 0.0957 - acc: 0.9615 - val_loss: 0.5687 - val_acc: 0.8575
Epoch 10/10
109s 1s/step - loss: 0.1008 - acc: 0.9637 - val_loss: 0.5166 - val_acc: 0.8550
OS Platform: Linux Centos 7.6
Distribution: Intel Xeon Gold 6152 (22x3.70 GHz);
GPU Model: NVIDIA Tesla V100 32 GB;
Number of nodes/CPU/Cores/GPU: 26/52/1144/104;
TensorFlow installed from (source or binary): official webpage
TensorFlow version (use command below): 2.1.0
Python version: 3.6.8
Description of issue:
While I was implementing my proposed method, using the second style of implementation (see below), I realized that the performance of the algorithm is indeed strange. To be more precise, the accuracy decreases and loss value increases while the number of epochs increases.
So I narrow down the problem and finally, I decided to modify some codes from TensorFlow official page to check what is happening. As it is explained in TF v2 official webpage there are two styles of implementation which I have adopted as follows.
I have modified the code provided in "getting started of TF v2" the link below:
TensorFlow 2 quickstart for beginners
as follows:
import tensorflow as tf
from sklearn.preprocessing import OneHotEncoder
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
learning_rate = 1e-4
batch_size = 100
n_classes = 2
n_units = 80
# Generate synthetic data / load data sets
x_in, y_in = make_classification(n_samples=1000, n_features=10, n_informative=4, n_redundant=2, n_repeated=2, n_classes=2, n_clusters_per_class=2, weights=[0.5, 0.5],
flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=42)
x_in = x_in.astype('float32')
y_in = y_in.astype('float32').reshape(-1, 1)
one_hot_encoder = OneHotEncoder(sparse=False)
y_in = one_hot_encoder.fit_transform(y_in)
y_in = y_in.astype('float32')
x_train, x_test, y_train, y_test = train_test_split(x_in, y_in, test_size=0.4, random_state=42, shuffle=True)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5, random_state=42, shuffle=True)
print("shapes:", x_train.shape, y_train.shape, x_test.shape, y_test.shape, x_val.shape, y_val.shape)
V = x_train.shape[1]
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(n_units, activation='relu', input_shape=(V,)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(n_classes)
])
loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test, verbose=2)
the output is as it is expected, as one can see below:
600/600 [==============================] - 0s 419us/sample - loss: 0.7114 - accuracy: 0.5350
Epoch 2/5
600/600 [==============================] - 0s 42us/sample - loss: 0.6149 - accuracy: 0.6050
Epoch 3/5
600/600 [==============================] - 0s 39us/sample - loss: 0.5450 - accuracy: 0.6925
Epoch 4/5
600/600 [==============================] - 0s 46us/sample - loss: 0.4895 - accuracy: 0.7425
Epoch 5/5
600/600 [==============================] - 0s 40us/sample - loss: 0.4579 - accuracy: 0.7825
test: 200/200 - 0s - loss: 0.4110 - accuracy: 0.8350
To be more precise, the training accuracy increases and the loss value decrease as the number epochs increases (which is expected and it is normal).
HOWEVER, the following chunk of code which is adapted from the link below:
TensorFlow 2 quickstart for experts
as follows:
import tensorflow as tf
from sklearn.preprocessing import OneHotEncoder
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
learning_rate = 1e-4
batch_size = 100
n_classes = 2
n_units = 80
# Generate synthetic data / load data sets
x_in, y_in = make_classification(n_samples=1000, n_features=10, n_informative=4, n_redundant=2, n_repeated=2, n_classes=2, n_clusters_per_class=2, weights=[0.5, 0.5],flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=42)
x_in = x_in.astype('float32')
y_in = y_in.astype('float32').reshape(-1, 1)
one_hot_encoder = OneHotEncoder(sparse=False)
y_in = one_hot_encoder.fit_transform(y_in)
y_in = y_in.astype('float32')
x_train, x_test, y_train, y_test = train_test_split(x_in, y_in, test_size=0.4, random_state=42, shuffle=True)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5, random_state=42, shuffle=True)
print("shapes:", x_train.shape, y_train.shape, x_test.shape, y_test.shape, x_val.shape, y_val.shape)
training_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size)
valid_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(batch_size)
testing_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(batch_size)
V = x_train.shape[1]
class MyModel(tf.keras.models.Model):
def __init__(self):
super(MyModel, self).__init__()
self.d1 = tf.keras.layers.Dense(n_units, activation='relu', input_shape=(V,))
self.d2 = tf.keras.layers.Dropout(0.2)
self.d3 = tf.keras.layers.Dense(n_classes,)
def call(self, x):
x = self.d1(x)
x = self.d2(x)
return self.d3(x)
# Create an instance of the model
model = MyModel()
loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.BinaryCrossentropy(name='train_accuracy')
test_loss = tf.keras.metrics.Mean(name='test_loss')
test_accuracy = tf.keras.metrics.BinaryCrossentropy(name='test_accuracy')
#tf.function
def train_step(images, labels):
with tf.GradientTape() as tape:
# training=True is only needed if there are layers with different
# behavior during training versus inference (e.g. Dropout).
predictions = model(images,) # training=True
loss = loss_object(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_loss(loss)
train_accuracy(labels, predictions)
#tf.function
def test_step(images, labels):
# training=False is only needed if there are layers with different
# behavior during training versus inference (e.g. Dropout).
predictions = model(images,) # training=False
t_loss = loss_object(labels, predictions)
test_loss(t_loss)
test_accuracy(labels, predictions)
EPOCHS = 5
for epoch in range(EPOCHS):
# Reset the metrics at the start of the next epoch
train_loss.reset_states()
train_accuracy.reset_states()
test_loss.reset_states()
test_accuracy.reset_states()
for images, labels in training_dataset:
train_step(images, labels)
for test_images, test_labels in testing_dataset:
test_step(test_images, test_labels)
template = 'Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, Test Accuracy: {}'
print(template.format(epoch + 1,train_loss.result(), train_accuracy.result(), test_loss.result(), test_accuracy.result()))
Behaves indeed strange. Here is the output of this piece of code:
Epoch 1, Loss: 0.7299721837043762, Accuracy: 3.8341376781463623, Test Loss: 0.7290592193603516, Test Accuracy: 3.6925911903381348
Epoch 2, Loss: 0.6725851893424988, Accuracy: 3.1141700744628906, Test Loss: 0.6695905923843384, Test Accuracy: 3.2315549850463867
Epoch 3, Loss: 0.6256862878799438, Accuracy: 2.75959849357605, Test Loss: 0.6216427087783813, Test Accuracy: 2.920461416244507
Epoch 4, Loss: 0.5873140096664429, Accuracy: 2.4249706268310547, Test Loss: 0.5828182101249695, Test Accuracy: 2.575272560119629
Epoch 5, Loss: 0.555053174495697, Accuracy: 2.2128372192382812, Test Loss: 0.5501811504364014, Test Accuracy: 2.264410972595215
As one can see, not only the values of accuracy are strange but also instead of increasing, once the number of epochs increase, they decrease?
May you please explain what is happening here?
As it is pointed in the comment I made mistake in using the evaluation metrics. I should have used BinaryAccuracy.
Moreover, it is better to edit the call in the advance version as follows:
def call(self, x, training=False):
x = self.d1(x)
if training:
x = self.d2(x, training=training)
return self.d3(x)
I have a time-series data set with 1 feature that represents multiple games. The goal is to classify each game as a win or loss - binary classification. Each game has 61 rows, and the feature has been scaled to be between 0 and 1:
x_train = array([[0.55340617],
[0.54956823],
[0.54588505],
...,
[0.87483364],
[0.8956947 ],
[0.90343248]])
y_train = array([0, 0, 0, ..., 0, 0, 0])
The problem should be quite easy, and I was expecting around 70% accuracy based on other models.
I'm trying to train an LSTM with the data. I think I should be resetting the state on every game, and so the batch shape is defined by 61 timesteps, and 1 feature:
timesteps = 61
n_features = 1
# Reshape data for LSTM
x_train = x_train.reshape(len(x_train)//timesteps, timesteps, n_features)
# Get class of each game
y_train = x_test[0: len(y_train): timesteps]
model = Sequential()
# Hidden layer
n_neurons = 8
model.add(LSTM(n_neurons,
input_shape=(timesteps, n_features),
stateful=False))
model.add(Dense(1, activation='softmax'))
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.fit(x_train,
y_train,
epochs=3,
batch_size=1)
But when I train the model, the accuracy remains constant:
Epoch 1/3
301/301 [==============================] - 7s 23ms/step - loss: 8.2524 - accuracy: 0.4618
Epoch 2/3
301/301 [==============================] - 6s 21ms/step - loss: 8.2524 - accuracy: 0.4618
Epoch 3/3
301/301 [==============================] - 6s 21ms/step - loss: 8.2524 - accuracy: 0.4618
I have tried switching the optimiser to 'RMSprop', but I get the exact same result? I think the problem lies with the batch shape?
Any help would be greatly appreciated!
EDIT: Fixed some typos in the code. Sorry!
I have csv file with two columns:
category, description
1030 categories in the file and only about 12,600 lines
I need to get a model for text classification, trained on this data. I use keras with LSTM model.
I found an article describing how to make a binary classification, and slightly modified it to use several categories.
My code:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from numpy import array
from keras.preprocessing.text import one_hot
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing import sequence
import keras
df = pd.read_csv('/tmp/input_data.csv')
#one hot encode your documents
# integer encode the documents
vocab_size = 2000
encoded_docs = [one_hot(d, vocab_size) for d in df['description']]
def load_data_from_arrays(strings, labels, train_test_split=0.9):
data_size = len(strings)
test_size = int(data_size - round(data_size * train_test_split))
print("Test size: {}".format(test_size))
print("\nTraining set:")
x_train = strings[test_size:]
print("\t - x_train: {}".format(len(x_train)))
y_train = labels[test_size:]
print("\t - y_train: {}".format(len(y_train)))
print("\nTesting set:")
x_test = strings[:test_size]
print("\t - x_test: {}".format(len(x_test)))
y_test = labels[:test_size]
print("\t - y_test: {}".format(len(y_test)))
return x_train, y_train, x_test, y_test
encoder = LabelEncoder()
categories = encoder.fit_transform(df['category'])
num_classes = np.max(categories) + 1
print('Categories count: {}'.format(num_classes))
#Categories count: 1030
X_train, y_train, x_test, y_test = load_data_from_arrays(encoded_docs, categories, train_test_split=0.8)
# Truncate and pad the review sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
x_test = sequence.pad_sequences(x_test, maxlen=max_review_length)
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)
# Build the model
embedding_vector_length = 32
top_words = 10000
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])
print(model.summary())
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_8 (Embedding) (None, 500, 32) 320000
_________________________________________________________________
lstm_8 (LSTM) (None, 100) 53200
_________________________________________________________________
dense_8 (Dense) (None, 1030) 104030
=================================================================
Total params: 477,230
Trainable params: 477,230
Non-trainable params: 0
_________________________________________________________________
None
#Train the model
model.fit(X_train, y_train, validation_data=(x_test, y_test), epochs=5, batch_size=64)
Train on 10118 samples, validate on 2530 samples
Epoch 1/5
10118/10118 [==============================] - 60s 6ms/step - loss: 6.5086 - acc: 0.0019 - val_loss: 10.0911 - val_acc: 0.0000e+00
Epoch 2/5
10118/10118 [==============================] - 63s 6ms/step - loss: 6.3281 - acc: 0.0028 - val_loss: 10.8270 - val_acc: 0.0000e+00
Epoch 3/5
10118/10118 [==============================] - 63s 6ms/step - loss: 6.3120 - acc: 0.0024 - val_loss: 11.0078 - val_acc: 0.0000e+00
Epoch 4/5
10118/10118 [==============================] - 64s 6ms/step - loss: 6.2891 - acc: 0.0030 - val_loss: 11.8264 - val_acc: 0.0000e+00
Epoch 5/5
10118/10118 [==============================] - 69s 7ms/step - loss: 6.2559 - acc: 0.0032 - val_loss: 12.1625 - val_acc: 0.0000e+00
#Evaluate the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Accuracy: 0.00%
What mistake did I make when preparing the data?
why accuracy is always 0?
I have curated end-to-end code with some inputs from my end and tested working on this data, you can use the same with your data with no or minimal changes as I have removed specifics and made it generic. Also at the end, I have highlighted what points I have worked on top of the code you provided above.
Code
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
from nltk.tokenize import word_tokenize
def load_data_from_arrays(strings, labels, train_test_split=0.9):
data_size = len(strings)
test_size = int(data_size - round(data_size * train_test_split))
print("Test size: {}".format(test_size))
print("\nTraining set:")
x_train = strings[test_size:]
print("\t - x_train: {}".format(len(x_train)))
y_train = labels[test_size:]
print("\t - y_train: {}".format(len(y_train)))
print("\nTesting set:")
x_test = strings[:test_size]
print("\t - x_test: {}".format(len(x_test)))
y_test = labels[:test_size]
print("\t - y_test: {}".format(len(y_test)))
return x_train, y_train, x_test, y_test
# estimating the vocab length with the help of nltk
def get_vocab_length(strings):
vocab = []
for sent in strings:
words = word_tokenize(sent)
vocab.extend(words)
vocab = list(set(vocab))
vocab_length = len(vocab)
return vocab_length
def clean_text(sent):
# <your cleaning code here>
# clean func 1
# clean func 2
# ...
# clean func n
return sent
# load input data
df = pd.read_csv('/tmp/input_data.csv')
strings = df['description'].values
labels = df['category'].values
clean_strings = [clean_text(sent) for sent in strings]
vocab_length = get_vocab_length(clean_strings)
# create onehot encodings of strings
encoded_docs = [one_hot(sent, vocab_length) for sent in strings]
# create onehot encodings of labels
ohe = OneHotEncoder()
categories = ohe.fit_transform(labels.reshape(-1,1)).toarray()
# split data
X_train, y_train, X_test, y_test = load_data_from_arrays(encoded_docs, categories, train_test_split=0.8)
# assuming max input to be not more than 512 words
max_input_len = 512
# padding data
X_train = pad_sequences(X_train, maxlen=max_input_len, padding= 'post')
X_test = pad_sequences(X_test, maxlen=max_input_len, padding= 'post')
# setting embedding vector length
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(vocab_length, embedding_vector_length, input_length=max_input_len, name= 'embedding') )
model.add(Flatten())
model.add(Dense(5, activation= 'softmax'))
model.compile('adam', loss= 'categorical_crossentropy', metrics= ['accuracy'])
model.summary()
# training the model
model.fit(X_train, y_train, epochs= 10, batch_size= 128, validation_split= 0.2, verbose= 1)
# evaluating the model
score = model.evaluate(X_test, y_test, verbose=0)
print("Test Loss:", score[0])
print("Test Acc:", score[1])
Additional areas I have worked on
1. Text Cleaning
Created a function to clean the text. It is extremely important as it will remove unnecessary noise from the data and also note this step will totally depend on the type of data you have. To help you simplify, I have created a clean_text function in the above code where you can place your cleaning code. It should be used in such a way that it takes in raw text and provides clean text. Some of the libraries you may like to look into are re, string, and emoji.
2. Estimating Vocab Size
If you have enough data, it is good to estimate the vocab size rather than putting some number directly while passing it to Keras one_hot function. I have created a basic get_vocab_length function using nltk word_tokenize. You can use the same or enhance it further as per your data.
What Else?
You can work further on hyperparameter tuning and a few different neural network designs.
Final Words
It still may not work as it totally depends on the data quality and amount of data you have. There is a good chance you may not get results after trying everything if you have poor quality data or a very less amount of data.
I would then suggest you try transfer learning on some pre-trained models like BERT, RoBERTa, etc. HuggingFace provides good support for state-of-art pre-trained models, you can get started at the following links -
https://huggingface.co/docs/transformers/index#supported-models
https://towardsdatascience.com/text-classification-with-hugging-face-transformers-in-tensorflow-2-without-tears-ee50e4f3e7ed
https://towardsdatascience.com/an-introduction-to-transformers-and-hugging-face-13052ec9d72d
I guess that your vocab_size is way too low. If you are dealing with usual text, try 10.000 - 100.000 as a starting point.
What one_hot does is to use the hashing trick. That means all of your words are hashed and projected into an 2000 vector space. It does not only mean that your dict is 2000 words long, it does mean every word will be projected to into this space, which effectively causes a lot of collisions, where words have the same index and are considered as equal in the LSTM.
Furthermore you should take a look at the transformed text, just too get an understanding of what happens here. To do so, build an reverse lookup and transform all the indices back.
As a further improvement it is feasible to preprocess the text with common techniques like stemming, normalizing etc. and the usage of a vocabulary or discard bag of words and use word embeddings.
from keras.preprocessing.text import one_hot, Tokenizer, hashing_trick
text1 = 'I love you'
text2 = 'you love I'
print('one_hot: ')
print(one_hot(text1, n=20))
print(one_hot(text2, n=20))
print('--------------------------------------')
print('Tokenizer: ')
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text1, text2])
print(tokenizer.word_index)
print(tokenizer.index_word)
print('--------------------------------------')
print('hashing_trick: ')
print(hashing_trick(text1, n=20))
print(hashing_trick(text2, n=20))
print('--------------------------------------')
out:
one_hot:
[14, 7, 14]
[14, 7, 14]
--------------------------------------
Tokenizer:
{'i': 1, 'love': 2, 'you': 3}
{1: 'i', 2: 'love', 3: 'you'}
--------------------------------------
hashing_trick:
[14, 7, 14]
[14, 7, 14]
--------------------------------------
Run more times and you will find that the results of one_hot and hashing_trick are not unique.
You should use Tokenizer to convert text.
I'm new to tensorflow keras and dataset. Can anyone help me understand why the following code doesn't work?
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
from tensorflow.python.data.ops import dataset_ops
from tensorflow.python.data.ops import iterator_ops
from tensorflow.python.keras.utils import multi_gpu_model
from tensorflow.python.keras import backend as K
data = np.random.random((1000,32))
labels = np.random.random((1000,10))
dataset = tf.data.Dataset.from_tensor_slices((data,labels))
print( dataset)
print( dataset.output_types)
print( dataset.output_shapes)
dataset.batch(10)
dataset.repeat(100)
inputs = keras.Input(shape=(32,)) # Returns a placeholder tensor
# A layer instance is callable on a tensor, and returns a tensor.
x = keras.layers.Dense(64, activation='relu')(inputs)
x = keras.layers.Dense(64, activation='relu')(x)
predictions = keras.layers.Dense(10, activation='softmax')(x)
# Instantiate the model given inputs and outputs.
model = keras.Model(inputs=inputs, outputs=predictions)
# The compile step specifies the training configuration.
model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
loss='categorical_crossentropy',
metrics=['accuracy'])
# Trains for 5 epochs
model.fit(dataset, epochs=5, steps_per_epoch=100)
It failed with the following error:
model.fit(x=dataset, y=None, epochs=5, steps_per_epoch=100)
File "/home/wuxinyu/pyEnv/lib/python3.5/site-packages/tensorflow/python/keras/engine/training.py", line 1510, in fit
validation_split=validation_split)
File "/home/wuxinyu/pyEnv/lib/python3.5/site-packages/tensorflow/python/keras/engine/training.py", line 994, in _standardize_user_data
class_weight, batch_size)
File "/home/wuxinyu/pyEnv/lib/python3.5/site-packages/tensorflow/python/keras/engine/training.py", line 1113, in _standardize_weights
exception_prefix='input')
File "/home/wuxinyu/pyEnv/lib/python3.5/site-packages/tensorflow/python/keras/engine/training_utils.py", line 325, in standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking input: expected input_1 to have 2 dimensions, but got array with shape (32,)
According to tf.keras guide, I should be able to directly pass the dataset to model.fit, as this example shows:
Input tf.data datasets
Use the Datasets API to scale to large datasets or multi-device training. Pass a tf.data.Dataset instance to the fit method:
# Instantiates a toy dataset instance:
dataset = tf.data.Dataset.from_tensor_slices((data, labels))
dataset = dataset.batch(32)
dataset = dataset.repeat()
Don't forget to specify steps_per_epoch when calling fit on a dataset.
model.fit(dataset, epochs=10, steps_per_epoch=30)
Here, the fit method uses the steps_per_epoch argument—this is the number of training steps the model runs before it moves to the next epoch. Since the Dataset yields batches of data, this snippet does not require a batch_size.
Datasets can also be used for validation:
dataset = tf.data.Dataset.from_tensor_slices((data, labels))
dataset = dataset.batch(32).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((val_data, val_labels))
val_dataset = val_dataset.batch(32).repeat()
model.fit(dataset, epochs=10, steps_per_epoch=30,
validation_data=val_dataset,
validation_steps=3)
What's the problem with my code, and what's the correct way of doing it?
To your original question as to why you're getting the error:
Error when checking input: expected input_1 to have 2 dimensions, but got array with shape (32,)
The reason your code breaks is because you haven't applied the .batch() back to the dataset variable, like so:
dataset = dataset.batch(10)
You simply called dataset.batch().
This breaks because without the batch() the output tensors are not batched, i.e. you get shape (32,) instead of (1,32).
You are missing defining an iterator which is the reason why there is an error.
data = np.random.random((1000,32))
labels = np.random.random((1000,10))
dataset = tf.data.Dataset.from_tensor_slices((data,labels))
dataset = dataset.batch(10).repeat()
inputs = Input(shape=(32,)) # Returns a placeholder tensor
# A layer instance is callable on a tensor, and returns a tensor.
x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)
# Instantiate the model given inputs and outputs.
model = keras.Model(inputs=inputs, outputs=predictions)
# The compile step specifies the training configuration.
model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
loss='categorical_crossentropy',
metrics=['accuracy'])
# Trains for 5 epochs
model.fit(dataset.make_one_shot_iterator(), epochs=5, steps_per_epoch=100)
Epoch 1/5
100/100 [==============================] - 1s 8ms/step - loss: 11.5787 - acc: 0.1010
Epoch 2/5
100/100 [==============================] - 0s 4ms/step - loss: 11.4846 - acc: 0.0990
Epoch 3/5
100/100 [==============================] - 0s 4ms/step - loss: 11.4690 - acc: 0.1270
Epoch 4/5
100/100 [==============================] - 0s 4ms/step - loss: 11.4611 - acc: 0.1300
Epoch 5/5
100/100 [==============================] - 0s 4ms/step - loss: 11.4546 - acc: 0.1360
This is the result on my system.