Here is my code to train an RNN with LSTM nodes:
# LSTM RNN with dropout for sequence classification
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
import pickle, numpy, pandas as pd
###################################### CONSTANTS #############################################
SEED = 7 # Fixes random seed for reproducibility.
URL = 'ibcData.tsv' # Specified dataset to gather data from.
SEPERATOR = '\t' # Seperator the dataset uses to divide data.
RANDOM_STATE = 1 # Pseudo-random number generator state used for random sampling.
TOP_WORDS = 5000 # Most used words in the dataset.
MAX_REVIEW_LENGTH = 500 # Length of each sentence being sent in (necessary).
EMBEDDING_VECTOR_LENGTH = 32 # The specific Embedded later will have 32-length vectors to
# represent each word.
BATCH_SIZE = 64 # Takes 64 sentences at a time and continually retrains RNN.
NUMBER_OF_EPOCHS = 3 # Fits RNN to more accurately guess the data's political bias.
DROPOUT = 0.2 # Helps slow down overfitting of data (slower convergence rate)
RECURRENT_DROPOUT = 0.2 # Helps slow down overfitting of data when recurrently training
##############################################################################################
# fix random seed for reproducibility
numpy.random.seed(SEED)
readData = pd.read_csv(URL, header=None, names=['label', 'message'], sep=SEPERATOR)
# convert label to a numerical variable
readData['label_num'] = readData.label.map({'Liberal' : 0, 'Neutral': 0.5, 'Conservative' : 1})
X = readData.message # Contains the dataset's actual sentences that were labeled
Y = readData.label_num # Either 0.0, 0.5, or 1.0 depending on label mapped to
# load the dataset into training and testing datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=RANDOM_STATE)
# truncate and pad input sequences
for sentence in X_train:
sentence.zfill(MAX_REVIEW_LENGTH)
for sentence in X_test:
sentence.zfill(MAX_REVIEW_LENGTH)
# create the model
model = Sequential()
model.add(Embedding(TOP_WORDS, EMBEDDING_VECTOR_LENGTH, input_length=MAX_REVIEW_LENGTH))
model.add(LSTM(100, recurrent_dropout=RECURRENT_DROPOUT dropout=DROPOUT)) # Dropouts help prevent overfitting
model.add(Dense(2, activation='sigmoid')) # Layers deal with a 2D tensor, and output a 2D tensor
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=NUMBER_OF_EPOCHS, batch_size=BATCH_SIZE)
# Final evaluation of the model
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
It is training a .tsv file that has data like this:
"Liberal","Forcing middle-class workers to bear a greater share of the cost of government weakens their support for needed investments and stirs resentment toward those who depend on public services the most ."
"Liberal", "Because it would not be worthwhile to bring a case for $ 30.22 , the arbitration clause would , as a practical matter , deny the Concepcions any relief and , more important , eliminate a class action that might punish AT&T for its pattern of fraudulent behavior ."
I try to run it and I get this from the console and I have no idea how to fix it nor do my professors trying to help me with this research:
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 500, 32) 160000
_________________________________________________________________
lstm_1 (LSTM) (None, 100) 53200
_________________________________________________________________
dense_1 (Dense) (None, 2) 202
=================================================================
Total params: 213,402
Trainable params: 213,402
Non-trainable params: 0
_________________________________________________________________
None
Traceback (most recent call last):
File "LSTM-RNN.py", line 55, in <module>
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=NUMBER_OF_EPOCHS
, batch_size=BATCH_SIZE)
File "C:\Users\Hydur\Anaconda3\lib\site-packages\keras\keras\models.py", line 871, in f
it
initial_epoch=initial_epoch)
File "C:\Users\Hydur\Anaconda3\lib\site-packages\keras\keras\engine\training.py", line
1525, in fit
batch_size=batch_size)
File "C:\Users\Hydur\Anaconda3\lib\site-packages\keras\keras\engine\training.py", line
1379, in _standardize_user_data
exception_prefix='input')
File "C:\Users\Hydur\Anaconda3\lib\site-packages\keras\keras\engine\training.py", line
144, in _standardize_input_data
str(array.shape))
ValueError: Error when checking input: expected embedding_1_input to have shape (None, 50
0) but got array with shape (3244, 1)
Main problem seems to be that X contained raw strings, while the Embedding layer expected data already coded numerically. The Keras text preprocessing utilities will take care of that:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=MAX_REVIEW_LENGTH)
tokenizer.fit_on_texts(readData.message)
X = numpy.array(tokenizer.texts_to_matrix(readData.message)) # shape (None, 500)
This will code each message as a 500 integers, with a unique integer assigned to each word.
Once that was fixed, I also got an error on the "dense_1" layer. The last layer in your network was specified to have two output nodes, but the loss function you used (binary_cross_entropy) expects a single column coded as 0/1. I edited it so that layer had only one output node so the process would complete, but doubt using 0, 0.5, 1 with binary cross entropy will do what you want. I think you'd probably be between off with a 3-level one-hot encoding and categorical_cross_entropy, but that's out of scope for this question.
Here is the full edited script that ran for me. I was only able to run it on the two observations you provided but it did complete.
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
import os, pickle, numpy, pandas as pd
from keras.preprocessing.text import Tokenizer
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
################################### CONSTANTS ################################################
SEED = 7 # Fixes random seed for reproducibility.
URL = 'ibcData.tsv' # Specified dataset to gather data from.
SEPERATOR = '\t' # Seperator the dataset uses to divide data.
RANDOM_STATE = 1 # Pseudo-random number generator state used for random sampling.
TOP_WORDS = 5000 # Most used words in the dataset.
MAX_REVIEW_LENGTH = 500 # Length of each sentence being sent in (necessary).
EMBEDDING_VECTOR_LENGTH = 32 # The specific Embedded later will have 32-length vectors to
# represent each word.
BATCH_SIZE = 64 # Takes 64 sentences at a time and continually retrains RNN.
NUMBER_OF_EPOCHS = 3 # Fits RNN to more accurately guess the data's political bias.
# fix random seed for reproducibility
numpy.random.seed(SEED)
readData = pd.read_csv(URL, header=None, names=['label', 'message'], sep=SEPERATOR)
# convert label to a numerical variable
tokenizer = Tokenizer(num_words=MAX_REVIEW_LENGTH)
tokenizer.fit_on_texts(readData.message)
X = numpy.array(tokenizer.texts_to_matrix(readData.message)) # shape (None, 32)
readData['label_num'] = readData.label.map({'Liberal' : 0, 'Neutral': 0.5, 'Conservative' : 1})
Y = numpy.array(readData.label_num) # Either 0.0, 0.5, or 1.0 depending on label mapped to
# load the dataset into training and testing datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=RANDOM_STATE)
# create the model
model = Sequential()
model.add(Embedding(TOP_WORDS, EMBEDDING_VECTOR_LENGTH, input_length=MAX_REVIEW_LENGTH))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid')) # Layers deal with a 2D tensor, and output a 2D tensor
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=NUMBER_OF_EPOCHS, batch_size=BATCH_SIZE)
# Final evaluation of the model
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
I then received the following output:
Using TensorFlow backend.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 500, 32) 160000
_________________________________________________________________
lstm_1 (LSTM) (None, 100) 53200
_________________________________________________________________
dense_1 (Dense) (None, 1) 101
=================================================================
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None
Train on 1 samples, validate on 1 samples
Epoch 1/3
1/1 [==============================] - 0s - loss: 0.6953 - acc: 0.0000e+00 - val_loss: 0.6814 - val_acc: 1.0000
Epoch 2/3
1/1 [==============================] - 0s - loss: 0.6814 - acc: 1.0000 - val_loss: 0.6670 - val_acc: 1.0000
Epoch 3/3
1/1 [==============================] - 0s - loss: 0.6670 - acc: 1.0000 - val_loss: 0.6516 - val_acc: 1.0000
Hope that helps.
Related
I am trying to get the size of the DistillBERT model trained on the ENRON dataset.
I believe that I can get an approximation for the size of the model based on the number of parameters at each layer, its weights, and biases. But what is the way to get it? How can I calculate the size of the model?
Below I have describes the model in detail along with the code.
Here is my model:
import datetime
import pandas as pd
import tensorflow as tf
import transformers
from transformers import DistilBertTokenizer
from transformers import TFAutoModelForSequenceClassification
pd.set_option('display.max_colwidth', None)
MODEL_NAME = 'distilbert-base-uncased'
# loading the data and processing it.
df = pd.read_csv("enron6dataset.csv", error_bad_lines=False)
train_df = df.sample(frac=0.8, random_state=25)
test_df = df.drop(train_df.index)
X_train = train_df.email
X_test = test_df.email
y_train = train_df.label
y_test = test_df.label
y_train_encoded = tf.one_hot(y_train.values, 2)
y_test_encoded = tf.one_hot(y_test.values, 2)
# tokenization and model compilation
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
train_encodings = tokenizer(list(X_train.values),
truncation=True,
padding=True)
test_encodings = tokenizer(list(X_test.values),
truncation=True,
padding=True)
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings),list(y_train_encoded )))
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings),list(y_test_encoded )))
test_dataset2 = test_dataset.shuffle(buffer_size=1024).take(1000).batch(16)
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
optimizerr = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizerr,
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
# fine tuning the model
model.fit(train_dataset.shuffle(len(X_train)).batch(16),
epochs=2,
batch_size=16,callbacks=[tensorboard_callback])
#accuracy
results = model.evaluate(test_dataset2)
print("test loss, test acc:", results)
I am getting following accuracy -
63/63 [==============================] - 2s 32ms/step - loss: 0.0135 - accuracy: 0.9950
test loss, test acc: [0.01349691767245531, 0.9950000047683716]
model.summary() gives the following result -
Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
distilbert (TFDistilBertMain multiple 66362880
_________________________________________________________________
pre_classifier (Dense) multiple 590592
_________________________________________________________________
classifier (Dense) multiple 1538
_________________________________________________________________
dropout_19 (Dropout) multiple 0
=================================================================
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________
Firstly why is it showing multiple in the output shape?
the following are the layers of the model - model.layers :
[<transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertMainLayer at 0x2b268cbbfb20>,
<tensorflow.python.keras.layers.core.Dense at 0x2b268cbdd670>,
<tensorflow.python.keras.layers.core.Dense at 0x2b268cbddfa0>,
<tensorflow.python.keras.layers.core.Dropout at 0x2b268cbdf730>]
I have a 3 dimensional dataset of audio files where X.shape is (329,20,85). I want to have a simpl bare-bones model running, so please don't nitpick and address only the issue at hand. Here is the code:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.LSTM(32, return_sequences=True, stateful=False, input_shape = (20,85,1)))
model.add(tf.keras.layers.LSTM(20))
model.add(tf.keras.layers.Dense(nb_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
model.summary()
print("Train...")
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=50, validation_data=(X_test, y_test))
But then I had the error mentioned in the title:
ValueError: Shapes (None, 1) and (None, 3) are incompatible
Here is the model.summary()
Model: "sequential_13"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_21 (LSTM) (None, 20, 32) 15104
_________________________________________________________________
lstm_22 (LSTM) (None, 20) 4240
_________________________________________________________________
dense_8 (Dense) (None, 3) 63
=================================================================
Total params: 19,407
Trainable params: 19,407
Non-trainable params: 0
_________________________________________________________________
Train...
For this, I followed this post and updated Tensorflow to the latest version, but the issue persists. This post is completely unrelated and highly unreliable.This post although a bit relatable is unanswered for a while now.
Update 1.0:
I strongly think the problem has something to do with the final Dense layer where I pass nb_classes as 3, since I am classifying for 3 categories in y.
So I changed the Dense layer's nb_classes to 1, which ran the model and gives me this output, which I am positive is wrong.
Train...
9/9 [==============================] - 2s 177ms/step - loss: 0.0000e+00 - accuracy: 0.1520 - val_loss: 0.0000e+00 - val_accuracy: 0.3418
<tensorflow.python.keras.callbacks.History at 0x7f50f1dcebe0>
Update 2.0:
I one hot encoded the ys and resolved the shape issue. But now the above output with <tensorflow.python.keras.callbacks.History at 0x7f50f1dcebe0> persists. Any help with this? Or should I post a new question for this? Thanks for all the help.
How should I proceed, or what should I be changing?
The first problem is with the LSTM input_shape. input_shape = (20,85,1).
From the doc: https://keras.io/layers/recurrent/
LSTM layer expects 3D tensor with shape (batch_size, timesteps, input_dim).
model.add(tf.keras.layers.Dense(nb_classes, activation='softmax')) - this suggets you're doing a multi-class classification.
So, you need your y_train and y_test have to be one-hot-encoded. That means they must have dimension (number_of_samples, 3), where 3 denotes number of classes.
You need to apply tensorflow.keras.utils.to_categorical to them.
y_train = to_categorical(y_train, 3)
y_test = to_categorical(y_test, 3)
ref: https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical
tf.keras.callbacks.History() - this callback is automatically applied to every Keras model. The History object gets returned by the fit method of models.
ref: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/History
Check if the last Dense Layer(output) has same number of classes as the number of target classes in the training dataset. I made similar mistake while training the dataset and correcting it helped me.
Another thing to check, is whether your labels are one-hot-coded, or integers only. See this post: https://www.kaggle.com/general/197993
The error arose because 'categorical_crossentropy' works on one-hot encoded target, while 'sparse_categorical_crossentropy' works on integer target.
model.add(tf.keras.layers.Dense(nb_classes, activation='softmax'))
The nb_classes should be same as y_train.shape[1]
Issue was with the wrong variables used after One Hot Encoding for Classification problem.
trainY = tf.keras.utils.to_categorical(y_train, num_classes=9)
testY = tf.keras.utils.to_categorical(y_test, num_classes=9)
Modeling was done with y_train and y_test as below:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50,
batch_size = 128)
Correction did below and it worked as expected:
trainY = tf.keras.utils.to_categorical(y_train, num_classes=9)
testY = tf.keras.utils.to_categorical(y_test, num_classes=9)
I'm very new to machine learning so this question might sound stupid.
i'm following a tutorial on Text Classification but I'm facing an error that I don't have any idea about how to solve.
This is the code I have (it is basically what it is found in the tutorial)
import pandas as pd
filepath_dict = {'yelp': 'data/yelp_labelled.txt',
'amazon': 'data/amazon_cells_labelled.txt',
'imdb': 'data/imdb_labelled.txt'}
df_list = []
for source, filepath in filepath_dict.items():
df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
df['source'] = source
df_list.append(df)
df = pd.concat(df_list)
print(df.iloc[0:4])
from sklearn.feature_extraction.text import CountVectorizer
df_yelp = df[df['source'] == 'yelp']
sentences = df_yelp['sentence'].values
y = df_yelp['label'].values
from sklearn.model_selection import train_test_split
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
from keras.models import Sequential
from keras import layers
input_dim = X_train.shape[1]
model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.summary()
history = model.fit(X_train, y_train,
nb_epoch=100,
verbose=False,
validation_data=(X_test, y_test),
batch_size=10)
When I reach the last line, I get an error
"TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]"
I guess I'll have to perform some kind of transformation on the data I'm using, or that I should try to load those data in a different way. I tried to search on Stackoverflow already but - being new to all this - I couldn't find anything helpful.
How do I make this work? Ideally I'd like to get not only the solution but also a brief explaination about why the error happened and what the solution does in order to solve it.
thanks!
The reason you're facing this difficulty is that your X_train and X_test are of type <class scipy.sparse.csr.csr_matrix> whereas your model expects it to be a numpy array.
Try casting them to dense and you're fine to go:
X_train = X_train.todense()
X_test = X_test.todense()
Not sure, why are you getting error for this script.
The following script is working fine; even with sparse matrix. May be give a try in your machine.
sentences = ['i want to test this','let us try this',
'would this work','how about this',
'even this','this should not work']
y= [0,0,0,0,0,1]
from sklearn.model_selection import train_test_split
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
from keras.models import Sequential
from keras import layers
input_dim = X_train.shape[1]
model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.summary()
model.fit(X_train, y_train,
epochs=2,
verbose=True,
validation_data=(X_test, y_test),
batch_size=2)
#
Layer (type) Output Shape Param #
=================================================================
dense_5 (Dense) (None, 10) 110
_________________________________________________________________
dense_6 (Dense) (None, 1) 11
=================================================================
Total params: 121
Trainable params: 121
Non-trainable params: 0
_________________________________________________________________
Train on 4 samples, validate on 2 samples
Epoch 1/2
4/4 [==============================] - 1s 169ms/step - loss: 0.7570 - acc: 0.2500 - val_loss: 0.6358 - val_acc: 1.0000
Epoch 2/2
4/4 [==============================] - 0s 3ms/step - loss: 0.7509 - acc: 0.2500 - val_loss: 0.6328 - val_acc: 1.0000
I have csv file with two columns:
category, description
1030 categories in the file and only about 12,600 lines
I need to get a model for text classification, trained on this data. I use keras with LSTM model.
I found an article describing how to make a binary classification, and slightly modified it to use several categories.
My code:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from numpy import array
from keras.preprocessing.text import one_hot
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing import sequence
import keras
df = pd.read_csv('/tmp/input_data.csv')
#one hot encode your documents
# integer encode the documents
vocab_size = 2000
encoded_docs = [one_hot(d, vocab_size) for d in df['description']]
def load_data_from_arrays(strings, labels, train_test_split=0.9):
data_size = len(strings)
test_size = int(data_size - round(data_size * train_test_split))
print("Test size: {}".format(test_size))
print("\nTraining set:")
x_train = strings[test_size:]
print("\t - x_train: {}".format(len(x_train)))
y_train = labels[test_size:]
print("\t - y_train: {}".format(len(y_train)))
print("\nTesting set:")
x_test = strings[:test_size]
print("\t - x_test: {}".format(len(x_test)))
y_test = labels[:test_size]
print("\t - y_test: {}".format(len(y_test)))
return x_train, y_train, x_test, y_test
encoder = LabelEncoder()
categories = encoder.fit_transform(df['category'])
num_classes = np.max(categories) + 1
print('Categories count: {}'.format(num_classes))
#Categories count: 1030
X_train, y_train, x_test, y_test = load_data_from_arrays(encoded_docs, categories, train_test_split=0.8)
# Truncate and pad the review sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
x_test = sequence.pad_sequences(x_test, maxlen=max_review_length)
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)
# Build the model
embedding_vector_length = 32
top_words = 10000
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])
print(model.summary())
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_8 (Embedding) (None, 500, 32) 320000
_________________________________________________________________
lstm_8 (LSTM) (None, 100) 53200
_________________________________________________________________
dense_8 (Dense) (None, 1030) 104030
=================================================================
Total params: 477,230
Trainable params: 477,230
Non-trainable params: 0
_________________________________________________________________
None
#Train the model
model.fit(X_train, y_train, validation_data=(x_test, y_test), epochs=5, batch_size=64)
Train on 10118 samples, validate on 2530 samples
Epoch 1/5
10118/10118 [==============================] - 60s 6ms/step - loss: 6.5086 - acc: 0.0019 - val_loss: 10.0911 - val_acc: 0.0000e+00
Epoch 2/5
10118/10118 [==============================] - 63s 6ms/step - loss: 6.3281 - acc: 0.0028 - val_loss: 10.8270 - val_acc: 0.0000e+00
Epoch 3/5
10118/10118 [==============================] - 63s 6ms/step - loss: 6.3120 - acc: 0.0024 - val_loss: 11.0078 - val_acc: 0.0000e+00
Epoch 4/5
10118/10118 [==============================] - 64s 6ms/step - loss: 6.2891 - acc: 0.0030 - val_loss: 11.8264 - val_acc: 0.0000e+00
Epoch 5/5
10118/10118 [==============================] - 69s 7ms/step - loss: 6.2559 - acc: 0.0032 - val_loss: 12.1625 - val_acc: 0.0000e+00
#Evaluate the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Accuracy: 0.00%
What mistake did I make when preparing the data?
why accuracy is always 0?
I have curated end-to-end code with some inputs from my end and tested working on this data, you can use the same with your data with no or minimal changes as I have removed specifics and made it generic. Also at the end, I have highlighted what points I have worked on top of the code you provided above.
Code
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
from nltk.tokenize import word_tokenize
def load_data_from_arrays(strings, labels, train_test_split=0.9):
data_size = len(strings)
test_size = int(data_size - round(data_size * train_test_split))
print("Test size: {}".format(test_size))
print("\nTraining set:")
x_train = strings[test_size:]
print("\t - x_train: {}".format(len(x_train)))
y_train = labels[test_size:]
print("\t - y_train: {}".format(len(y_train)))
print("\nTesting set:")
x_test = strings[:test_size]
print("\t - x_test: {}".format(len(x_test)))
y_test = labels[:test_size]
print("\t - y_test: {}".format(len(y_test)))
return x_train, y_train, x_test, y_test
# estimating the vocab length with the help of nltk
def get_vocab_length(strings):
vocab = []
for sent in strings:
words = word_tokenize(sent)
vocab.extend(words)
vocab = list(set(vocab))
vocab_length = len(vocab)
return vocab_length
def clean_text(sent):
# <your cleaning code here>
# clean func 1
# clean func 2
# ...
# clean func n
return sent
# load input data
df = pd.read_csv('/tmp/input_data.csv')
strings = df['description'].values
labels = df['category'].values
clean_strings = [clean_text(sent) for sent in strings]
vocab_length = get_vocab_length(clean_strings)
# create onehot encodings of strings
encoded_docs = [one_hot(sent, vocab_length) for sent in strings]
# create onehot encodings of labels
ohe = OneHotEncoder()
categories = ohe.fit_transform(labels.reshape(-1,1)).toarray()
# split data
X_train, y_train, X_test, y_test = load_data_from_arrays(encoded_docs, categories, train_test_split=0.8)
# assuming max input to be not more than 512 words
max_input_len = 512
# padding data
X_train = pad_sequences(X_train, maxlen=max_input_len, padding= 'post')
X_test = pad_sequences(X_test, maxlen=max_input_len, padding= 'post')
# setting embedding vector length
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(vocab_length, embedding_vector_length, input_length=max_input_len, name= 'embedding') )
model.add(Flatten())
model.add(Dense(5, activation= 'softmax'))
model.compile('adam', loss= 'categorical_crossentropy', metrics= ['accuracy'])
model.summary()
# training the model
model.fit(X_train, y_train, epochs= 10, batch_size= 128, validation_split= 0.2, verbose= 1)
# evaluating the model
score = model.evaluate(X_test, y_test, verbose=0)
print("Test Loss:", score[0])
print("Test Acc:", score[1])
Additional areas I have worked on
1. Text Cleaning
Created a function to clean the text. It is extremely important as it will remove unnecessary noise from the data and also note this step will totally depend on the type of data you have. To help you simplify, I have created a clean_text function in the above code where you can place your cleaning code. It should be used in such a way that it takes in raw text and provides clean text. Some of the libraries you may like to look into are re, string, and emoji.
2. Estimating Vocab Size
If you have enough data, it is good to estimate the vocab size rather than putting some number directly while passing it to Keras one_hot function. I have created a basic get_vocab_length function using nltk word_tokenize. You can use the same or enhance it further as per your data.
What Else?
You can work further on hyperparameter tuning and a few different neural network designs.
Final Words
It still may not work as it totally depends on the data quality and amount of data you have. There is a good chance you may not get results after trying everything if you have poor quality data or a very less amount of data.
I would then suggest you try transfer learning on some pre-trained models like BERT, RoBERTa, etc. HuggingFace provides good support for state-of-art pre-trained models, you can get started at the following links -
https://huggingface.co/docs/transformers/index#supported-models
https://towardsdatascience.com/text-classification-with-hugging-face-transformers-in-tensorflow-2-without-tears-ee50e4f3e7ed
https://towardsdatascience.com/an-introduction-to-transformers-and-hugging-face-13052ec9d72d
I guess that your vocab_size is way too low. If you are dealing with usual text, try 10.000 - 100.000 as a starting point.
What one_hot does is to use the hashing trick. That means all of your words are hashed and projected into an 2000 vector space. It does not only mean that your dict is 2000 words long, it does mean every word will be projected to into this space, which effectively causes a lot of collisions, where words have the same index and are considered as equal in the LSTM.
Furthermore you should take a look at the transformed text, just too get an understanding of what happens here. To do so, build an reverse lookup and transform all the indices back.
As a further improvement it is feasible to preprocess the text with common techniques like stemming, normalizing etc. and the usage of a vocabulary or discard bag of words and use word embeddings.
from keras.preprocessing.text import one_hot, Tokenizer, hashing_trick
text1 = 'I love you'
text2 = 'you love I'
print('one_hot: ')
print(one_hot(text1, n=20))
print(one_hot(text2, n=20))
print('--------------------------------------')
print('Tokenizer: ')
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text1, text2])
print(tokenizer.word_index)
print(tokenizer.index_word)
print('--------------------------------------')
print('hashing_trick: ')
print(hashing_trick(text1, n=20))
print(hashing_trick(text2, n=20))
print('--------------------------------------')
out:
one_hot:
[14, 7, 14]
[14, 7, 14]
--------------------------------------
Tokenizer:
{'i': 1, 'love': 2, 'you': 3}
{1: 'i', 2: 'love', 3: 'you'}
--------------------------------------
hashing_trick:
[14, 7, 14]
[14, 7, 14]
--------------------------------------
Run more times and you will find that the results of one_hot and hashing_trick are not unique.
You should use Tokenizer to convert text.
I'm relatively new to neural nets so please excuse my ignorance. I'm trying to adapt the keras BLSTM example here. The example reads in texts and classifies them as 0 or 1. I want a BLSTM that does something very much like POS tagging, though extras like lemmatizing or other advanced features are not neccessary, I just want a basic model. My data is a list of sentences and each word is given a category 1-8. I want to train a BLSTM that can use this data to predict the category for each word in an unseen sentence.
e.g. input = ['The', 'dog', 'is', 'red'] gives output = [2, 4, 3, 7]
If the keras example is not the best route, I'm open to other suggestions.
I currently have this:
'''Train a Bidirectional LSTM.'''
from __future__ import print_function
import numpy as np
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers import Dense, Dropout, Embedding, LSTM, Input, merge
from prep_nn import prep_scan
np.random.seed(1337) # for reproducibility
max_features = 20000
batch_size = 16
maxlen = 18
print('Loading data...')
(X_train, y_train), (X_test, y_test) = prep_scan(nb_words=max_features,
test_split=0.2)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')
print("Pad sequences (samples x time)")
# type issues here? float/int?
X_train = sequence.pad_sequences(X_train, value=0.)
X_test = sequence.pad_sequences(X_test, value=0.) # pad with zeros
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
# need to pad y too, because more than 1 ouput value, not classification?
y_train = sequence.pad_sequences(np.array(y_train), value=0.)
y_test = sequence.pad_sequences(np.array(y_test), value=0.)
print('y_train shape:', X_train.shape)
print('y_test shape:', X_test.shape)
# this is the placeholder tensor for the input sequences
sequence = Input(shape=(maxlen,), dtype='int32')
# this embedding layer will transform the sequences of integers
# into vectors of size 128
embedded = Embedding(max_features, 128, input_length=maxlen)(sequence)
# apply forwards LSTM
forwards = LSTM(64)(embedded)
# apply backwards LSTM
backwards = LSTM(64, go_backwards=True)(embedded)
# concatenate the outputs of the 2 LSTMs
merged = merge([forwards, backwards], mode='concat', concat_axis=-1)
after_dp = Dropout(0.5)(merged)
# number after dense has to corresponse to output matrix?
output = Dense(17, activation='sigmoid')(after_dp)
model = Model(input=sequence, output=output)
# try using different optimizers and different optimizer configs
model.compile('adam', 'categorical_crossentropy', metrics=['accuracy'])
print('Train...')
model.fit(X_train, y_train,
batch_size=batch_size,
nb_epoch=4,
validation_data=[X_test, y_test])
X_test_new = np.array([[0,0,0,0,0,0,0,0,0,12,3,55,4,34,5,45,3,9],[0,0,0,0,0,0,0,1,7,65,34,67,34,23,24,67,54,43,]])
classes = model.predict(X_test_new, batch_size=16)
print(classes)
My output is the right dimension, but is giving me floats 0-1. I think this is because it's still looking for binary classfication. Anyone know how to fix this?
SOLVED
Just make sure the labels are each binary arrays:
(X_train, y_train), (X_test, y_test), maxlen, word_ids, tags_ids = prep_model(
nb_words=nb_words, test_len=75)
W = (y_train > 0).astype('float')
print(len(X_train), 'train sequences')
print(int(len(X_train)*val_split), 'validation sequences')
print(len(X_test), 'heldout sequences')
# this is the placeholder tensor for the input sequences
sequence = Input(shape=(maxlen,), dtype='int32')
# this embedding layer will transform the sequences of integers
# into vectors of size 256
embedded = Embedding(nb_words, output_dim=hidden,
input_length=maxlen, mask_zero=True)(sequence)
# apply forwards LSTM
forwards = LSTM(output_dim=hidden, return_sequences=True)(embedded)
# apply backwards LSTM
backwards = LSTM(output_dim=hidden, return_sequences=True,
go_backwards=True)(embedded)
# concatenate the outputs of the 2 LSTMs
merged = merge([forwards, backwards], mode='concat', concat_axis=-1)
after_dp = Dropout(0.15)(merged)
# TimeDistributed for sequence
# change activation to sigmoid?
output = TimeDistributed(
Dense(output_dim=nb_classes,
activation='softmax'))(after_dp)
model = Model(input=sequence, output=output)
# try using different optimizers and different optimizer configs
# loss=binary_crossentropy, optimizer=rmsprop
model.compile(loss='categorical_crossentropy',
metrics=['accuracy'], optimizer='adam',
sample_weight_mode='temporal')
print('Train...')
model.fit(X_train, y_train,
batch_size=batch_size,
nb_epoch=epochs,
shuffle=True,
validation_split=val_split,
sample_weight=W)
Solved. The main issue was reshaping the data for the classification categories as binary arrays. Also used TimeDistributed and set return_sequences to True.
I knows that this thread is very old but i hope will i can help.
I modified the model for a binary model:
sequence = Input(shape=(X_train.shape[1],), dtype='int32')
embedded = Embedding(max_fatures,embed_dim,input_length=X_train.shape[1], mask_zero=True)(sequence)
# apply forwards LSTM
forwards = LSTM(output_dim=hidden, return_sequences=True)(embedded)
# apply backwards LSTM
backwards = LSTM(output_dim=hidden, return_sequences=True,go_backwards=True)(embedded)
# concatenate the outputs of the 2 LSTMs
merged = concatenate([forwards, backwards])
after_dp = Dropout(0.15)(merged)
# add now layer LSTM without return_sequence
lstm_normal = LSTM(hidden)(merged)
# TimeDistributed for sequence
# change activation to sigmoid?
#output = TimeDistributed(Dense(output_dim=2,activation='sigmoid'))(after_dp)
#I changed output layer TimeDistributed for a Dense, for the problem of dimensionality and output_dim = 1 (output binary)
output = Dense(output_dim=1,activation='sigmoid')(lstm_normal)
model = Model(input=sequence, output=output)
# try using different optimizers and different optimizer configs
# loss=binary_crossentropy, optimizer=rmsprop
# I changed modelo compile by to binary and remove sample_weight_mode parameter
model.compile(loss='binary_crossentropy',
metrics=['accuracy'], optimizer='adam',
)
print(model.summary())
###################################
#this is the line of training
model.fit(X_train, Y_train,
batch_size=128,
epochs=10,
shuffle=True,
validation_split=0.2,
#sample_weight=W
)
#In this moment work fine.....
Train on 536000 samples, validate on 134000 samples
Epoch 1/10
536000/536000 [==============================] - 1814s 3ms/step - loss: 0.4794 - acc: 0.7679 - val_loss: 0.4624 - val_acc: 0.7784
Epoch 2/10
536000/536000 [==============================] - 1829s 3ms/step - loss: 0.4502 - acc: 0.7857 - val_loss: 0.4551 - val_acc: 0.7837
Epoch 3/10
99584/536000 [====>.........................] - ETA: 23:10 - loss: 0.4291 - acc: 0.7980