Word2Vec + LSTM Good Training and Validation but Poor on Test - python

currently I'am training my Word2Vec + LSTM for Twitter sentiment analysis. I use the pre-trained GoogleNewsVectorNegative300 word embedding. The reason I used the pre-trained GoogleNewsVectorNegative300 because the performance much worse when I trained my own Word2Vec using own dataset. The problem is why my training process had validation acc and loss stuck at 0.88 and 0.34 respectively. Then, my confussion matrix also seems wrong. Here several processes that I have done before fitting the model
Text Pre processing:
Lower casing
Remove hashtag, mentions, URLs, numbers, change words to numbers, non-ASCII characters, retweets "RT"
Expand contractions
Replace negations with antonyms
Remove puncutations
Remove stopwords
Lemmatization
I split my dataset into 90:10 for train:test as follows:
def split_data(X, y):
X_train, X_test, y_train, y_test = train_test_split(X,
y,
train_size=0.9,
test_size=0.1,
stratify=y,
random_state=0)
return X_train, X_test, y_train, y_test
The split data resulting in training has 2060 samples with 708 positive sentiment class, 837 negative sentiment class, and 515 sentiment neutral class
Then, I implemented the text augmentation that is EDA (Easy Data Augmentation) on all the training data as follows:
class TextAugmentation:
def __init__(self):
self.augmenter = EDA()
def replace_synonym(self, text):
augmented_text_portion = int(len(text)*0.1)
synonym_replaced = self.augmenter.synonym_replacement(text, n=augmented_text_portion)
return synonym_replaced
def random_insert(self, text):
augmented_text_portion = int(len(text)*0.1)
random_inserted = self.augmenter.random_insertion(text, n=augmented_text_portion)
return random_inserted
def random_swap(self, text):
augmented_text_portion = int(len(text)*0.1)
random_swaped = self.augmenter.random_swap(text, n=augmented_text_portion)
return random_swaped
def random_delete(self, text):
random_deleted = self.augmenter.random_deletion(text, p=0.5)
return random_deleted
text_augmentation = TextAugmentation()
The data augmentation resulting in training has 10300 samples with 3540 positive sentiment class, 4185 negative sentiment class, and 2575 sentiment neutral class
Then, I tokenized the sequence as follows:
# Tokenize the sequence
pfizer_tokenizer = Tokenizer(oov_token='OOV')
pfizer_tokenizer.fit_on_texts(df_pfizer_train['text'].values)
X_pfizer_train_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_train['text'].values)
X_pfizer_test_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_test['text'].values)
# Pad the sequence
X_pfizer_train_padded = pad_sequences(X_pfizer_train_tokenized, maxlen=100)
X_pfizer_test_padded = pad_sequences(X_pfizer_test_tokenized, maxlen=100)
pfizer_max_length = 100
pfizer_num_words = len(pfizer_tokenizer.word_index) + 1
# Encode label
y_pfizer_train_encoded = df_pfizer_train['sentiment'].factorize()[0]
y_pfizer_test_encoded = df_pfizer_test['sentiment'].factorize()[0]
y_pfizer_train_category = to_categorical(y_pfizer_train_encoded)
y_pfizer_test_category = to_categorical(y_pfizer_test_encoded)
Resulting in 8869 unique words and 100 maximum sequence length
Finally, I fit the into my model using pre trained GoogleNewsVectorNegative300 word embedding but only use the weight and LSTM, and I split my training data again with 10% for validation as follows:
# Build single LSTM model
def build_lstm_model(embedding_matrix, max_sequence_length):
# Input layer
input_layer = Input(shape=(max_sequence_length,), dtype='int32')
# Word embedding layer
embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=True)(input_layer)
# LSTM model layer
lstm_layer = LSTM(units=128,
dropout=0.5,
return_sequences=True)(embedding_layer)
batch_normalization = BatchNormalization()(lstm_layer)
lstm_layer = LSTM(units=128,
dropout=0.5,
return_sequences=False)(batch_normalization)
batch_normalization = BatchNormalization()(lstm_layer)
# Dense model layer
dense_layer = Dense(units=128, activation='relu')(batch_normalization)
dropout_layer = Dropout(rate=0.5)(dense_layer)
batch_normalization = BatchNormalization()(dropout_layer)
output_layer = Dense(units=3, activation='softmax')(batch_normalization)
lstm_model = Model(inputs=input_layer, outputs=output_layer)
return lstm_model
# Building single LSTM model
sinovac_lstm_model = build_lstm_model(SINOVAC_EMBEDDING_MATRIX, SINOVAC_MAX_SEQUENCE)
sinovac_lstm_model.summary()
sinovac_lstm_model.compile(loss='categorical_crossentropy',
optimizer=Adam(learning_rate=0.001),
metrics=['accuracy'])
sinovac_lstm_history = sinovac_lstm_model.fit(x=X_sinovac_train,
y=y_sinovac_train,
batch_size=64,
epochs=20,
validation_split=0.1,
verbose=1)
The training result:
The evaluation result:
I really need some suggestions or insights to have a good accuracy on my test

Without reviewing everything, a few high-order things that may be limiting your results:
The GoogleNews vectors were trained on media-outlet news stories from 2012 and earlier. Tweets in 2020+ use a very different style of language. I wouldn't necessarily expect those pretrained vectors, from a different era & domain-of-writing, to be very good at modeling the words you'll need. A well-trained word2vec model (using plenty of modern tweet data, with good preprocessing/tokenization & parameterization choices) has a good chance of working better, so you may want to revisit that choice.
The GoogleNews training texts preprocessing, while as far as I can tell never fully-documented, did not appear to flatten all casing, nor remove stopwords, nor involve lemmatization. It didn't mutate obvious negations into antonyms, but it did perform a statistical combinations of some single-words into multigram tokens instead. So some of your steps are potentially causing your tokens to have less concordance with that set's vectors – even throwing away info, like inflectional variations of words, that could be beneficially retained. Be sure every step you're taking is worth the trouble – and note that a suffiicient modern word2vec moel, on Tweets, built using the same preprocessing for word2vec training then later steps, would match vocabularies perfectly.
Both the word2vec model, and any deeper neural network, often need lots of data to train well, and avoid overfitting. Even disregarding the 900 million parameters from GoogleNews, you're trying to train ~130k parameters – at least 520KB of state – from an initial set of merely 2060 tweet-sized texts (maybe 100KB of data). Models that learn generalizable things tend to be compressions of the data, in some sense, and a model that's much larger than the training data brings risk of severe overfitting. (Your mechanistic process for replacing words with synonyms may not be really giving the model any info that the word-vector similarity between synonyms didn't already provide.) So: consider shrinking your model, and getting much more training data - potentially even from other domains than your main classification interest, as long as the use-of-language is similar.

Related

Word2Vec + CNN Overfitting

Currently I'am training my Word2Vec + CNN for Twitter sentiment analysis about COVID-19 vaccine domain. I used the pre-trained GoogleNewsVectorNegative300 word embedding. The problem is why I heavily overfit on training proses. The reason I used the pre-trained GoogleNewsVectorNegative300 because the performance much worse when I trained my own Word2Vec using own dataset. Here several processes that I have done before fitting the model:
Text Pre processing:
Lower casing
Remove hashtag, mentions, URLs, numbers, change words to numbers, non-ASCII characters, retweets "RT"
Expand contractions
Replace negations with antonyms
Remove puncutations
Remove stopwords
Lemmatization
I split my dataset into 90:10 for train:test as follows:
def split_data(X, y):
X_train, X_test, y_train, y_test = train_test_split(X,
y,
train_size=0.9,
test_size=0.1,
stratify=y,
random_state=0)
return X_train, X_test, y_train, y_test
The split data resulting in training has 2060 samples with 708 positive sentiment class, 837 negative sentiment class, and 515 sentiment neutral class
Training:
Testing:
Then, I implemented the text augmentation that is EDA (Easy Data Augmentation) on all the training data as follows:
class TextAugmentation:
def __init__(self):
self.augmenter = EDA()
def replace_synonym(self, text):
augmented_text_portion = int(len(text)*0.1)
synonym_replaced = self.augmenter.synonym_replacement(text, n=augmented_text_portion)
return synonym_replaced
def random_insert(self, text):
augmented_text_portion = int(len(text)*0.1)
random_inserted = self.augmenter.random_insertion(text, n=augmented_text_portion)
return random_inserted
def random_swap(self, text):
augmented_text_portion = int(len(text)*0.1)
random_swaped = self.augmenter.random_swap(text, n=augmented_text_portion)
return random_swaped
def random_delete(self, text):
random_deleted = self.augmenter.random_deletion(text, p=0.5)
return random_deleted
text_augmentation = TextAugmentation()
The data augmentation resulting in training has 10300 samples with 3540 positive sentiment class, 4185 negative sentiment class, and 2575 sentiment neutral class
Then, I tokenized the sequence as follows:
# Tokenize the sequence
pfizer_tokenizer = Tokenizer(oov_token='OOV')
pfizer_tokenizer.fit_on_texts(df_pfizer_train['text'].values)
X_pfizer_train_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_train['text'].values)
X_pfizer_test_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_test['text'].values)
# Pad the sequence
X_pfizer_train_padded = pad_sequences(X_pfizer_train_tokenized, maxlen=100)
X_pfizer_test_padded = pad_sequences(X_pfizer_test_tokenized, maxlen=100)
pfizer_max_length = 100
pfizer_num_words = len(pfizer_tokenizer.word_index) + 1
# Encode label
y_pfizer_train_encoded = df_pfizer_train['sentiment'].factorize()[0]
y_pfizer_test_encoded = df_pfizer_test['sentiment'].factorize()[0]
y_pfizer_train_category = to_categorical(y_pfizer_train_encoded)
y_pfizer_test_category = to_categorical(y_pfizer_test_encoded)
Resulting in 8869 unique words and 100 maximum sequence length
Finally, I fit the into my model using pre trained GoogleNewsVectorNegative300 word embedding and CNN, and I split my training data again with 10% for validation as follows:
# Build single CNN model
def build_cnn_model(embedding_matrix, max_sequence_length):
# Input layer
input_layer = Input(shape=(max_sequence_length,))
# Word embedding layer
embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=True)(input_layer)
# CNN model layer
cnn_layer = Conv1D(filters=256,
kernel_size=2,
strides=1,
padding='valid',
activation='relu')(embedding_layer)
cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
cnn_layer = Dropout(rate=0.5)(cnn_layer)
batch_norm_layer = BatchNormalization()(cnn_layer)
cnn_layer = Conv1D(filters=256,
kernel_size=2,
strides=1,
padding='valid',
activation='relu')(batch_norm_layer)
cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
cnn_layer = Dropout(rate=0.5)(cnn_layer)
batch_norm_layer = BatchNormalization()(cnn_layer)
cnn_layer = Conv1D(filters=256,
kernel_size=2,
strides=1,
padding='valid',
activation='relu')(batch_norm_layer)
cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
cnn_layer = Dropout(rate=0.5)(cnn_layer)
batch_norm_layer = BatchNormalization()(cnn_layer)
flatten = Flatten()(batch_norm_layer)
# Dense model layer
dense_layer = Dense(units=10, activation='relu')(flatten)
batch_norm_layer = BatchNormalization()(dense_layer)
output_layer = Dense(units=3, activation='softmax')(batch_norm_layer)
cnn_model = Model(inputs=input_layer, outputs=output_layer)
return cnn_model
return lstm_model
sinovac_cnn_history = sinovac_cnn_model.fit(x=X_sinovac_train,
y=y_sinovac_train,
batch_size=128,
epochs=100,
validation_split=0.1,
verbose=1)
The training result:
I really need some suggestions or insights because I have been doing this without any performance progress to my model
That's quite a complex problem. It sure looks like overfitting as you said yourself. Meaning the model can't generalize well from your training set to new data.
Intuitively, I would suggest for you to cycle hyperparameters (epochs, batch size, learning rate, dropout layers), if you didn't already, to seek a better combination. Also, I would suggest to use cross-validation to get a better idea of the performance of your classifier. This would also shuffle the training data and avoid that the model learns the data by heart.
Have you tried classifying the original data without the data augmentation? It's not a lot of data, but it could be enough to see if the performance on the test set is better than the final version, and thus see whether the data augmentation might be screwing something up in your data.
Did you try another embedding? I don't really think this is the problem, but in the search for the error I would probably switch it to see what happens.
Last but not least, do you know for a fact that this model structure can handle this task? Meaning did you find a working example somewhere? It sure sounds like it could do it, but there is the chance that the CNN model for example just doesn't generalize well over the embeddings. Have you considered using a model specified on text classification, like a Transformer or an LSTM?

Implementation of Gradient-Reversal-layer into a functioning keras model for multiclassification

My question is about the practical implementation of "Domain Adaptation" into a functional model in keras with tensorflow backend.
Description of the problem:
I have a collection of particle collision samples which consist of n variables. One half of them is simulated data with certain class labels (e.g "W-Boson"). The other half is real collision data which is not labeled. The key idea now is to setup a keras model, which has two outputs. One for classifying the class of a sample and one for classifying the domain, so wether it is simulated or real data. The thing is that the model shall be trained so that the domain classifier performs very poor. This is achieved by flipping the sign of the incoming gradient from the domain end of the network during training. This technique is called "Domain Adaptation". The model is expected to be trained to find domain-invariant features, or in other words, to perform the same on simulated and real collision data.
The framework I am working with has an existin functional keras model, which I wanted to expand with said domain classifier. This is a prototype I came up with:
# common layers
inputs = keras.Input(shape=(n_variables, ))
X = layers.Dense(units=50, activation="relu")(inputs)
# domain end
flip_layer = flipGradientTF.GradientReversal(hp_lambda=0.3)(X)
X_domain = layers.Dense(units=50, activation="relu")(flip_layer)
domain_out = layers.Dense(units=2, activation="softmax", name="domain_out")(X_domain)
# class end
X_class = layers.Dense(units=50, activation="relu")(X)
class_out = layers.Dense(units=n_classes, activation="softmax", name="class_out")(X_class)
The code for flipGradientTF is taken from https://github.com/michetonu/gradient_reversal_keras_tf
And further on for compiling and training the model:
model = keras.Model(inputs=inputs, outputs=[class_out, domain_out])
model.compile(optimizer="adam", loss=loss_function, metrics="accuracy")
# train model
model.fit(
x = train_data,
y = [train_class_labels, train_domain_labels],
batch_size = 200,
epochs = 200,
sample_weight = {"class_out": class_weights, "domain_out": None}
)
For train_data I am passing the dataframe which consists of the data from both domains. As I have tried to use either "categorical_crossentropy" or "sparse_categorical_crossentropy" as the loss_function, train_class_labels and train_domain_labels where either in the one-hot representation or in the integer representation. My biggest issue is figuring out what to use for the class labels of the unlabeled data and this led to a gut feeling that I am on the wrong track here.
So in a nutshell:
Is this implementation strategy legit and assuming it is, what should I do about the class labels for the unlabeled data? And if it is not legit, what would be a better way of attacking this problem?
Any help would be much appreciated :)

How to make prediction on new text dataset using saved text classification model

I trained a text classifier under this guide: https://developers.google.com/machine-learning/guides/text-classification/step-4
And save model as
model.save('~./output/model.h5')
In this case, how i use this model to classify texts on another new dataset?
Thank you
import tensorflow as tf
# Recreate the exact same model, including its weights and the optimizer
new_model = tf.keras.models.load_model('~./output/model.h5')
# Show the model architecture
new_model.summary()
# Apply the same process of data preparation while training the model.
# Lets say after Data preprocessing you have stored the processed data in test_data
# check model accuracy from unseen/new dataset
loss, acc = new_model.evaluate(test_data, test_labels, verbose=2)
print('Restored model, accuracy: {:5.2f}%'.format(100*acc))
You can use the tensorflow's Text tokenization utility class (Tokenizer) to deal with unknown words in Test data.
Num_words is the vocabulary size (it picks most frequent words)
Assign oov_token = 'Some string', used for all the tokens/words outside vocab size (basically new words in test data will be dealt as oov_token string .
Fit on Train data and then generate token sequence for both train and test data.
tf.keras.preprocessing.text.Tokenizer(
num_words=None, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~\t\n', lower=True,
split=' ', char_level=False, oov_token=None, document_count=0, **kwargs
)

How to improve accuracy of model for categorical, non-binary, foreign language sentiment analysis in TensorFlow?

TLDR
My aim is to categorize sentences in a foreign language (Hungarian) to 3 sentiment categories: negative, neutral & positive. I would like to improve the accuracy of the model used, which can be found below in the "Define, Compile, Fit the Model" section. The rest of the post is here for completeness and reproducibility.
I am new to asking questions on Machine Learning topics, suggestions are welcome here as well: How to ask a good question on Machine Learning?
Data preparation
For this I have 10000 sentences, given to 5 human annotators, categorized as negative, neutral or positive, available from here. The first few lines look like this:
I categorize the sentence positive (denoted by 2) if sum of the scores by annotators is positive, neutral if it is 0 (denoted by 1), and negative (denoted by 0) if the sum is negative:
import pandas as pd
sentences_df = pd.read_excel('/content/OpinHuBank_20130106.xls')
sentences_df['annotsum'] = sentences_df['Annot1'] +\
sentences_df['Annot2'] +\
sentences_df['Annot3'] +\
sentences_df['Annot4'] +\
sentences_df['Annot5']
def categorize(integer):
if 0 < integer: return 2
if 0 == integer: return 1
else: return 0
sentences_df['sentiment'] = sentences_df['annotsum'].apply(categorize)
Following this tutorial, I use SubwordTextEncoder to proceed. From here, I download web2.2-freq-sorted.top100k.nofreqs.txt, which contains 100000 most frequently used word in the target language. (Both the sentiment data and this data was recommended by this.)
Reading in list of most frequent words:
wordlist = pd.read_csv('/content/web2.2-freq-sorted.top100k.nofreqs.txt',sep='\n',header=None,encoding = 'ISO-8859-1')[0].dropna()
Encoding data, conversion to tensors
Initializing encoder using build_from_corpus method:
import tensorflow_datasets as tfds
encoder = tfds.features.text.SubwordTextEncoder.build_from_corpus(
corpus_generator=(word for word in wordlist), target_vocab_size=2**16)
Building on this, encoding the sentences:
import numpy as np
import tensorflow as tf
def applyencoding(string):
return tf.convert_to_tensor(np.asarray(encoder.encode(string)))
sentences_df['encoded_sentences'] = sentences_df['Sentence'].apply(applyencoding)
Convert to a tensor each sentence's sentiment:
def tensorise(input):
return tf.convert_to_tensor(input)
sentences_df['sentiment_as_tensor'] = sentences_df['sentiment'].apply(tensorise)
Defining how much data to be preserved for testing:
test_fraction = 0.2
train_fraction = 1-test_fraction
From the pandas dataframe, let's create numpy array of encoded sentence train tensors:
nparrayof_encoded_sentence_train_tensors = \
np.asarray(sentences_df['encoded_sentences'][:int(train_fraction*len(sentences_df['encoded_sentences']))])
These tensors have different lengths, so lets use padding to make them have the same:
padded_nparrayof_encoded_sentence_train_tensors = tf.keras.preprocessing.sequence.pad_sequences(
nparrayof_encoded_sentence_train_tensors, padding="post")
Let's stack these tensors together:
stacked_padded_nparrayof_encoded_sentence_train_tensors = tf.stack(padded_nparrayof_encoded_sentence_train_tensors)
Stacking the sentiment tensors together as well:
stacked_nparray_sentiment_train_tensors = \
tf.stack(np.asarray(sentences_df['sentiment_as_tensor'][:int(train_fraction*len(sentences_df['encoded_sentences']))]))
Define, Compile, Fit the Model (ie the main point)
Define & compile the model as follows:
### THE QUESTION IS ABOUT THESE ROWS ###
model = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.Conv1D(128, 5, activation='sigmoid'),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(6, activation='sigmoid'),
tf.keras.layers.Dense(3, activation='sigmoid')
])
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True), optimizer='adam', metrics=['accuracy'])
Fit it:
NUM_EPOCHS = 40
history = model.fit(stacked_padded_nparrayof_encoded_sentence_train_tensors,
stacked_nparray_sentiment_train_tensors,
epochs=NUM_EPOCHS)
The first few lines of the output is:
Testing results
As in TensorFlow's RNN tutorial, let's plot the results we gained so far:
import matplotlib.pyplot as plt
def plot_graphs(history):
plt.plot(history.history['accuracy'])
plt.plot(history.history['loss'])
plt.xlabel("Epochs")
plt.ylabel('accuracy / loss')
plt.legend(['accuracy','loss'])
plt.show()
plot_graphs(history)
Which gives us:
Prepare the testing data as we prepared the training data:
nparrayof_encoded_sentence_test_tensors = \
np.asarray(sentences_df['encoded_sentences'][int(train_fraction*len(sentences_df['encoded_sentences'])):])
padded_nparrayof_encoded_sentence_test_tensors = tf.keras.preprocessing.sequence.pad_sequences(
nparrayof_encoded_sentence_test_tensors, padding="post")
stacked_padded_nparrayof_encoded_sentence_test_tensors = tf.stack(padded_nparrayof_encoded_sentence_test_tensors)
stacked_nparray_sentiment_test_tensors = \
tf.stack(np.asarray(sentences_df['sentiment_as_tensor'][int(train_fraction*len(sentences_df['encoded_sentences'])):]))
Evaluate the model using only test data:
test_loss, test_acc = model.evaluate(stacked_padded_nparrayof_encoded_sentence_test_tensors,stacked_nparray_sentiment_test_tensors)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
Giving result:
Full notebook available here.
The question
How can I change the model definition and compilation rows above to have higher accuracy on the test set after no more than 1000 epochs?
You are using word piece subwords, you can try BPE. Also, you can build your model upon BERT and use transfer learning, that will literally skyrocket your results.
Firstly, change the kernel size in your Conv1D layer and try various values for it. Recommended would be [3, 5, 7]. Then, consider adding layers. Also, in the second last layer i.e. Dense, increase the number of units in it, that might help. Alternately, you can try a network with just LSTM layers or LSTM layers followed by Conv1D layer.
By trying out if it works then great otherwise repeat. But, the training loss gives a hint about it, if you see, the loss is not going down smoothly, you may assume, that your network is lacking predictive power i.e. underfitting and increase the number of neurons in it.
Yes, more data does help. But, if the fault is in your network i.e. it is underfitting, then, it won't help. First, you should explore the limits of the model you have before looking for faults in the data.
Yes, using the most common words is the usual norm because probabilistically, the less used words won't occur more and thus, won't affect the predictions greatly.

Python: LSTM model and word embedding

My problem is mainly theoretical. I would like to use an LSTM model to classify the sentiment of sentences in this way 1 = positive, 0 = neutral and -1 = negative. I have a bag of word (BOW) that I would like to use to train the model. BOW is dataframe with two columns like this:
Text | Sentiment
hello dear... 1
I hate you... -1
... ...
According to the example proposed by keras I should transform the sentences of the 'Text' column of my BOW into numerical vectors where each number represents a word of the vocabulary.
Now my questions is how do I turn my sentences into vectors of numbers and what are the best techniques to do it?
For now my code is this, what am i doing wrong?
model = Sequential()
model.add(LSTM(units=50))
model.add(Dense(2, activation='softmax')) # 2 because I have 3 classes
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['Sentiment'], test_size=0.3, random_state=1) #Sentiment maiuscolo per altro dataframe
clf = model.fit(X_train, y_train)
predicted = clf.predict(X_test)
print(predicted)
First of all, as Marat commented, you are not using the term Bag of Words (BOW) correctly here. What you are claiming to be your BOW is simply just a labeled dataset of sentences. While there are a lot of questions here, I will try to answer the first one on how to convert your sentences into vectors that can be used in an LSTM model.
The most basic way to do this is to create one-hot-encoding vectors for each word in each sentence. To create these, you first need to iterate through your dataset and assign a unique index to each word. So for example:
vocab =
{ 'hello': 0,
'dear': 1,
.
.
.
'hate': 999}
Once you have this dictionary created, you can then go through each sentence and assign each word in each sentence a vector of len(vocab) with zeros at every index except for the index corresponding to that word. For example, using vocab above, dear would look like:
[0,1,0,0,0,...,0,0].
The pros of one-hot-encoding vectors is that they are easy to create, and pretty simple to work with. The downside is that you can pretty quickly be working with super high dimension vectors if you have a large vocabulary. That's where word embeddings come into play, and honestly are the superior route to one-hot-encoding vectors. However, they are a bit more complex and harder to understand what exactly they are doing behind the scenes. You can read more about that here if you want: https://towardsdatascience.com/what-the-heck-is-word-embedding-b30f67f01c81
You should first create an index of you vocabulary, i.e. assign an index to each token in your. And then transform to a numeric form by replacing each token in the text by its corresponding index. Your model should be then:
model = Sequential()
model.add(Embedding(len(vocab), 64, input_length=sent_len)
model.add(LSTM(units=50))
model.add(Dense(3, activation='softmax'))
Note that you need to pad you sentences to a common length before feeding them to the network. You can use np.pad to do so.
An other alternative is to used pre-trained word embeddings, you can download them from fastText
P.S. You are miss using the BOW, however BOW is a good baseline model you can use for sentiment analysis.

Categories