Python: LSTM model and word embedding

Python: LSTM model and word embedding - python

My problem is mainly theoretical. I would like to use an LSTM model to classify the sentiment of sentences in this way 1 = positive, 0 = neutral and -1 = negative. I have a bag of word (BOW) that I would like to use to train the model. BOW is dataframe with two columns like this:
Text | Sentiment
hello dear... 1
I hate you... -1
... ...
According to the example proposed by keras I should transform the sentences of the 'Text' column of my BOW into numerical vectors where each number represents a word of the vocabulary.
Now my questions is how do I turn my sentences into vectors of numbers and what are the best techniques to do it?
For now my code is this, what am i doing wrong?
model = Sequential()
model.add(LSTM(units=50))
model.add(Dense(2, activation='softmax')) # 2 because I have 3 classes
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['Sentiment'], test_size=0.3, random_state=1) #Sentiment maiuscolo per altro dataframe
clf = model.fit(X_train, y_train)
predicted = clf.predict(X_test)
print(predicted)

First of all, as Marat commented, you are not using the term Bag of Words (BOW) correctly here. What you are claiming to be your BOW is simply just a labeled dataset of sentences. While there are a lot of questions here, I will try to answer the first one on how to convert your sentences into vectors that can be used in an LSTM model.
The most basic way to do this is to create one-hot-encoding vectors for each word in each sentence. To create these, you first need to iterate through your dataset and assign a unique index to each word. So for example:
vocab =
{ 'hello': 0,
'dear': 1,
.
.
.
'hate': 999}
Once you have this dictionary created, you can then go through each sentence and assign each word in each sentence a vector of len(vocab) with zeros at every index except for the index corresponding to that word. For example, using vocab above, dear would look like:
[0,1,0,0,0,...,0,0].
The pros of one-hot-encoding vectors is that they are easy to create, and pretty simple to work with. The downside is that you can pretty quickly be working with super high dimension vectors if you have a large vocabulary. That's where word embeddings come into play, and honestly are the superior route to one-hot-encoding vectors. However, they are a bit more complex and harder to understand what exactly they are doing behind the scenes. You can read more about that here if you want: https://towardsdatascience.com/what-the-heck-is-word-embedding-b30f67f01c81

You should first create an index of you vocabulary, i.e. assign an index to each token in your. And then transform to a numeric form by replacing each token in the text by its corresponding index. Your model should be then:
model = Sequential()
model.add(Embedding(len(vocab), 64, input_length=sent_len)
model.add(LSTM(units=50))
model.add(Dense(3, activation='softmax'))
Note that you need to pad you sentences to a common length before feeding them to the network. You can use np.pad to do so.
An other alternative is to used pre-trained word embeddings, you can download them from fastText
P.S. You are miss using the BOW, however BOW is a good baseline model you can use for sentiment analysis.

Related

Word2Vec + LSTM Good Training and Validation but Poor on Test

currently I'am training my Word2Vec + LSTM for Twitter sentiment analysis. I use the pre-trained GoogleNewsVectorNegative300 word embedding. The reason I used the pre-trained GoogleNewsVectorNegative300 because the performance much worse when I trained my own Word2Vec using own dataset. The problem is why my training process had validation acc and loss stuck at 0.88 and 0.34 respectively. Then, my confussion matrix also seems wrong. Here several processes that I have done before fitting the model
Text Pre processing:
Lower casing
Remove hashtag, mentions, URLs, numbers, change words to numbers, non-ASCII characters, retweets "RT"
Expand contractions
Replace negations with antonyms
Remove puncutations
Remove stopwords
Lemmatization
I split my dataset into 90:10 for train:test as follows:
def split_data(X, y):
X_train, X_test, y_train, y_test = train_test_split(X,
y,
train_size=0.9,
test_size=0.1,
stratify=y,
random_state=0)
return X_train, X_test, y_train, y_test
The split data resulting in training has 2060 samples with 708 positive sentiment class, 837 negative sentiment class, and 515 sentiment neutral class
Then, I implemented the text augmentation that is EDA (Easy Data Augmentation) on all the training data as follows:
class TextAugmentation:
def __init__(self):
self.augmenter = EDA()
def replace_synonym(self, text):
augmented_text_portion = int(len(text)*0.1)
synonym_replaced = self.augmenter.synonym_replacement(text, n=augmented_text_portion)
return synonym_replaced
def random_insert(self, text):
augmented_text_portion = int(len(text)*0.1)
random_inserted = self.augmenter.random_insertion(text, n=augmented_text_portion)
return random_inserted
def random_swap(self, text):
augmented_text_portion = int(len(text)*0.1)
random_swaped = self.augmenter.random_swap(text, n=augmented_text_portion)
return random_swaped
def random_delete(self, text):
random_deleted = self.augmenter.random_deletion(text, p=0.5)
return random_deleted
text_augmentation = TextAugmentation()
The data augmentation resulting in training has 10300 samples with 3540 positive sentiment class, 4185 negative sentiment class, and 2575 sentiment neutral class
Then, I tokenized the sequence as follows:
# Tokenize the sequence
pfizer_tokenizer = Tokenizer(oov_token='OOV')
pfizer_tokenizer.fit_on_texts(df_pfizer_train['text'].values)
X_pfizer_train_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_train['text'].values)
X_pfizer_test_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_test['text'].values)
# Pad the sequence
X_pfizer_train_padded = pad_sequences(X_pfizer_train_tokenized, maxlen=100)
X_pfizer_test_padded = pad_sequences(X_pfizer_test_tokenized, maxlen=100)
pfizer_max_length = 100
pfizer_num_words = len(pfizer_tokenizer.word_index) + 1
# Encode label
y_pfizer_train_encoded = df_pfizer_train['sentiment'].factorize()[0]
y_pfizer_test_encoded = df_pfizer_test['sentiment'].factorize()[0]
y_pfizer_train_category = to_categorical(y_pfizer_train_encoded)
y_pfizer_test_category = to_categorical(y_pfizer_test_encoded)
Resulting in 8869 unique words and 100 maximum sequence length
Finally, I fit the into my model using pre trained GoogleNewsVectorNegative300 word embedding but only use the weight and LSTM, and I split my training data again with 10% for validation as follows:
# Build single LSTM model
def build_lstm_model(embedding_matrix, max_sequence_length):
# Input layer
input_layer = Input(shape=(max_sequence_length,), dtype='int32')
# Word embedding layer
embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=True)(input_layer)
# LSTM model layer
lstm_layer = LSTM(units=128,
dropout=0.5,
return_sequences=True)(embedding_layer)
batch_normalization = BatchNormalization()(lstm_layer)
lstm_layer = LSTM(units=128,
dropout=0.5,
return_sequences=False)(batch_normalization)
batch_normalization = BatchNormalization()(lstm_layer)
# Dense model layer
dense_layer = Dense(units=128, activation='relu')(batch_normalization)
dropout_layer = Dropout(rate=0.5)(dense_layer)
batch_normalization = BatchNormalization()(dropout_layer)
output_layer = Dense(units=3, activation='softmax')(batch_normalization)
lstm_model = Model(inputs=input_layer, outputs=output_layer)
return lstm_model
# Building single LSTM model
sinovac_lstm_model = build_lstm_model(SINOVAC_EMBEDDING_MATRIX, SINOVAC_MAX_SEQUENCE)
sinovac_lstm_model.summary()
sinovac_lstm_model.compile(loss='categorical_crossentropy',
optimizer=Adam(learning_rate=0.001),
metrics=['accuracy'])
sinovac_lstm_history = sinovac_lstm_model.fit(x=X_sinovac_train,
y=y_sinovac_train,
batch_size=64,
epochs=20,
validation_split=0.1,
verbose=1)
The training result:
The evaluation result:
I really need some suggestions or insights to have a good accuracy on my test

Without reviewing everything, a few high-order things that may be limiting your results:
The GoogleNews vectors were trained on media-outlet news stories from 2012 and earlier. Tweets in 2020+ use a very different style of language. I wouldn't necessarily expect those pretrained vectors, from a different era & domain-of-writing, to be very good at modeling the words you'll need. A well-trained word2vec model (using plenty of modern tweet data, with good preprocessing/tokenization & parameterization choices) has a good chance of working better, so you may want to revisit that choice.
The GoogleNews training texts preprocessing, while as far as I can tell never fully-documented, did not appear to flatten all casing, nor remove stopwords, nor involve lemmatization. It didn't mutate obvious negations into antonyms, but it did perform a statistical combinations of some single-words into multigram tokens instead. So some of your steps are potentially causing your tokens to have less concordance with that set's vectors – even throwing away info, like inflectional variations of words, that could be beneficially retained. Be sure every step you're taking is worth the trouble – and note that a suffiicient modern word2vec moel, on Tweets, built using the same preprocessing for word2vec training then later steps, would match vocabularies perfectly.
Both the word2vec model, and any deeper neural network, often need lots of data to train well, and avoid overfitting. Even disregarding the 900 million parameters from GoogleNews, you're trying to train ~130k parameters – at least 520KB of state – from an initial set of merely 2060 tweet-sized texts (maybe 100KB of data). Models that learn generalizable things tend to be compressions of the data, in some sense, and a model that's much larger than the training data brings risk of severe overfitting. (Your mechanistic process for replacing words with synonyms may not be really giving the model any info that the word-vector similarity between synonyms didn't already provide.) So: consider shrinking your model, and getting much more training data - potentially even from other domains than your main classification interest, as long as the use-of-language is similar.

How to combine embeddins vectors of bert with other features?

I am working on a classification task with 3 labels (0,1,2 = neg, pos, neu). Data are sentences. So to produce vectors/embeddings of sentences, I use a Bert encoder to get embeddings for each sentence and then I used a simple knn to make predictions.
My data look like this : each sentence has a label and other numerical value of classification.
For example, my data look like this
Sentence embeddings_BERT level sub-level label
je mange [0.21, 0.56] 2 2.1 pos
il hait [0.25, 0.39] 3 3.1 neg
.....
As you can see each sentence has other categories but the are not the final one but indices to help figure the label when a human annotated the data. I want my model to take into consideration those two values when predicting the label. I was wondering if I have to concatenate them with the embeddings generate by the bert encoding or is there another way ?

There is not one perfect way to tackle this problem, but a simple solution will be to concat the bert embeddings with hard-coded features. The BERT embeddings (sentence embeddings) will be of dimension 768 (if you have used BERT base). These embeddings can be treated as features of the sentence itself. The additional features can be concatenated to form a higher dimensional vector. If the features are categorical, it will be ideal to convert to one-hot vectors and concatenate them. For example, if you want to use level in your example as set of input features, it will be best to convert it into one-hot feature vector and then concatenate with BERT embeddings. However, in some cases, your hard coded features can be dominant feature to bias the classifier, and in some other cases, it can have no influence at all. It all depends on the data that you have.

How to use the ground truth answers from the previous time step for the decoders

I found this on an article here:
Decoder takes the hidden state of the last Encoder RNN cell as the initial state of its first RNN cell along with the token as the initial input to produce an output sequence. We use Teacher Forcing for faster and efficient training of the decoder. Teacher forcing is a method for quickly and efficiently training recurrent neural network models that use the ground truth from a prior time step as input. In this method, the right answer is given as the beginning of training so that the model will train quickly and efficiently.
Now, I am unable to use the Teacher Forcing Method for the decoder. Can someone help me out?
What I have done so far
Now, I am taking many English sentences and tokenizing them, and then keeping it in a variable X. The shape of X is [7000, 5], which means that there are 7000 English sentences, each which has 5 words in it. I am doing the same with German Sentences and keeping it in a variable Y, which has shape of [7000, 10, 1], which means that there are 7000 total German Sentences, each having 10 words in it (I am not hot-encoding it, as I am using sparse categorical cross-entropy as the loss function).
Then, I defined my model which is this:
Model = Sequential([
Embedding(english_vocab_size, 256, input_length=english_max_len, mask_zero=True),
LSTM(256, activation='relu'),
RepeatVector(german_max_len),
LSTM(256, activation='relu', return_sequences=True),
Dense(german_vocab_size, activation='softmax')
])
The english_vocab_size and german_vocab_size is the total number of English words and German words present in the English and German vocabulary. The english_max_len and german_max_len are total number of words each English and German sentence has in it.
Now, from here what should I do to use the Teacher Forcing Technique?

How to generate independent(X) variable using Word2vec?

I have a movie review data set which has two columns Review(Sentences) and Sentiment(1 or 0).
I want to create a classification model using word2vec for the embedding and a CNN for the classification.
I've looked for tutorials on youtube but all they do is create vectors for every words and show me the similar words. Like this-
model= gensim.models.Word2Vec(cleaned_dataset, min_count = 2, size = 100, window = 5)
words= model.wv.vocab
simalar= model.wv.most_similar("bad")
I already have my dependent variable(y) which is my 'Sentiment' column all I need is the independent variable(X) which I can pass on to my CNN model.
Before using word2vec I used the Bag Of Words(BOW) model which generated a sparse matrix which was my independent(X) variable. How can I achieve something similar using word2vec?
Kindly correct me if I'm doing something wrong.

To get the word vector, you have to do this:
model['word_that_you_want']
You may also want to handle the KeyError that could arise if you don't find that given word in your model. You also might want to read about what an embedding layer is, which is usually used as the first layer of the neural network (for NLP generally) and is basically a lookup mapping of a word to its corresponding word vector.
To get the word vectors for an entire sentence, you need to first initialize a numpy array of zeros to the dimensions you want.
You might need other variables such as the length of the longest sentence so that you can pad all sentences to that length. The documentation of the pad_sequences method for Keras is here.
A simple example of getting a sentence of word vectors is:
import numpy as np
embedding_matrix = np.zeros((vocab_len, size_of_your_word_vector))
Then iterate over the index of embedding_matrix and add to it, if you find a word vector in your model.
I use this resource which has a lot of examples and I have referenced some of the code there (which I have also used myself sometimes):
embedding_matrix = np.zeros((vocab_length, 100))
for word, index in word_tokenizer.word_index.items():
embedding_vector = model[word] # using your w2v model, KeyError possible
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector
And in your model (I'm assuming Tensorflow with Keras)
embedding_layer = Embedding(vocab_length, 100, weights=[embedding_matrix], input_length=length_long_sentence, trainable=False)
I hope this helps.

Word2Vec doesn't inherently create vectors for a text (set of words) – just individual words.
But, sometimes a not-so-bad vector for a multi-word text is the average of all its word-vectors.
If list_of_words is a list of the words in your text, and all the words are in the Word2Vec model, a simple way to get the average of those words' vectors is:
avg_vector_of_words = model.wv[list_of_words].mean(axis=0)
(If some words aren't present, you'd need to filter them before attempting this to avoid KeyErrors. If you wanted to leave out some words, or use unit-normed word-vectors, or unit-normalize the final vector, you'd need more code.)
Then avg_vector_of_words is a small, dense/'embedded' feature vector for the list_of-words text.
You could pass these vectors, one per text, to another downstream classifier, like your CNN, exactly analogously to how you were previously using sparse BOW vectors.

Character embeddings with Keras

I am trying to implement the type of character level embeddings described in this paper in Keras. The character embeddings are calculated using a bidirectional LSTM.
To recreate this, I've first created a matrix of containing, for each word, the indexes of the characters making up the word:
char2ind = {char: index for index, char in enumerate(chars)}
max_word_len = max([len(word) for sentence in sentences for word in sentence])
X_char = []
for sentence in X:
for word in sentence:
word_chars = []
for character in word:
word_chars.append(char2ind[character])
X_char.append(word_chars)
X_char = sequence.pad_sequences(X_char, maxlen = max_word_len)
I then define a BiLSTM model with an embedding layer for the word-character matrix. I assume the input_dimension will have to be equal to the number of characters. I want a size of 64 for my character embeddings, so I set the hidden size of the BiLSTM to 32:
char_lstm = Sequential()
char_lstm.add(Embedding(len(char2ind) + 1, 64))
char_lstm.add(Bidirectional(LSTM(hidden_size, return_sequences=True)))
And this is where I get confused. How can I retrieve the embeddings from the model? I'm guessing I would have to compile the model and fit it then retrieve the weights to get the embeddings, but what parameters should I use to fit it ?
Additional details:
This is for an NER task, so the dataset technically could be be anything in the word - label format, although I am specifically working with the WikiGold ConLL corpus available here: https://github.com/pritishuplavikar/Resume-NER/blob/master/wikigold.conll.txt
The expected output from the network are the labels (I-MISC, O, I-PER...)
I expect the dataset to be large enough to be training character embeddings directly from it. All words are coded with the index of their constituting characters, alphabet size is roughly 200 characters. The words are padded / cut to 20 characters. There are around 30 000 different words in the dataset.
I hope to be able learn embeddings for each characters based on the info from the different words. Then, as in the paper, I would concatenate the character embeddings with the word's glove embedding before feeding into a Bi-LSTM network with a final CRF layer.
I would also like to be able to save the embeddings so I can reuse them for other similar NLP tasks.

Generally speaking Keras approach to building models (even seemingly complex ones) is dead simple. For example, the kind of model you want to build would simply look like (note this is for binary classification problem):
model = Sequential()
model.add(Embedding(max_features, out_dims, input_length=maxlen))
model.add(Bidirectional(LSTM(32)))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
This is no different from plain vanilla NN with the exception of having the Embedding and Bidirectional layers in place Dense layers. This is one of the things that makes Keras amazing.
Usually it's helpful to look for a working example (Keras has loads) that is doing more or less the same thing you are trying to do. In this case you could first look at this model and then "reverse engineer" the workings of it to answer your questions. Usually things come down to formatting the data in the right way, where a working example model works wonders as you can carefully investigate the data format its using.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.