LSTM with Attention getting weights?? Classifing documents based on sentence embedding

LSTM with Attention getting weights?? Classifing documents based on sentence embedding - python

I'm really stuck building a NN for text-classification with keras using lstm and adding an attention_layer on top. Im sure Iam pretty close, but Im confused:
Do I have to add a TimeDistributed dense layer after LSTM?
And, how do I retrieve the Attention weights from my network (for visualization purpose)? - so that I know which sentence was 'responsible' that the document was classified as good or bad?
Say, I have 10 documents consisting of 100 sentences and each sentence is represented as a 500 element vector. So my documents matrix containing the sentence-sequences looks like: X = np.array(Matrix).reshape(10, 100, 500)
The documents should be classified to an according sentiment 1=good; 0=bad - so
y= [1,0,0,1,1]
yy= np.array(y)
I dont need an embedding-layer cause each sentence of each document is already a sparse-vector.
The attention layer is taken from: https://github.com/richliao/textClassifier/blob/master/textClassifierHATT.py
MAX_SENTS = 100
MAX_SENT_LENGTH = 500
review_input = Input(shape=(MAX_SENTS, MAX_SENT_LENGTH))
l_lstm_sent = LSTM(100, activation='tanh', return_sequences=True)(review_input)
l_att_sent = AttLayer(100)(l_lstm_sent)
preds = Dense(1, activation='softmax')(l_att_sent)
model = Model(review_input, preds)
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
model.fit(X, yy, nb_epoch=10, batch_size=50)
So I think my model should be set up correctly but Im not quite sure.. But how do I get the attention-weights from that (e.g. so I know which sentence caused a classification as 1)? Help so much appreciated

1. Time distributed
In this case, you don't have to wrap Dense into TimeDistributed, although it may be a little bit faster if you do, especially if you can provide a mask that masks out a large part of the LSTM output.
However, Dense operates in the last dimension no matter what the shape before the last dimension is.
2. Attention weights
Yes, it is as you suggest in the comment. You need to modify the AttLayer it is capable of returning both its output and the attention weights.
return output, ait
And then create a model that contains both prediction and attention weight tensors and get the predictions for them:
l_att_sent, l_att_sent = AttLayer(100)(l_lstm_sent)
...
predictions, att_weights = attmodel.predict(X)

Related

Overfitting on LSTM text classification using Keras

I am trying to develop an LSTM model using Keras, following this tutorial. However, I am implementing it with a different dataset of U.S. political news articles with the aim of classifying them based on a political bias (labels: Left, Centre and Right). I have gotten a model to run with the tutorial, but the loss and accuracy would look very off, like this:
I tried to play around with different DropOut probabilities (i.e. 0.5 instead of 0.2), adding/removing hidden layers (and making them less dense), and decreasing/increasing the max number of words and max sequence length.
I have managed to get the graphs to align a bit more, however, that has led to the model having less accuracy with the training data (and the problem of overfitting is still bad):
Additionally, I am not sure why the validation accuracy always seems to be higher than the model accuracy in the first epoch (shouldn't it usually be lower)?
Here is some code that is being used when tokenizing, padding, and initializing variables:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 500
# Max number of words in each news article
MAX_SEQUENCE_LENGTH = 100 # I am aware this may be too small
# This is fixed.
EMBEDDING_DIM = 64
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~',
lower=True)
tokenizer.fit_on_texts(df_raw['titletext'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df_raw['titletext'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df_raw['label']).values
print('Shape of label tensor:', Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
X_train.view()
When I look at what is shown when X_train.view() is executed, I am also not sure why all the arrays start with zeros like this:
I also did a third attempt that was just a second attempt with the number of epochs increased, it looks like this:
Here is the code of the actual model:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
# model.add(SpatialDropout1D(0.2)) ---> commented out
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) ---> commented out
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.5))
model.add(Dense(8))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 25
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs,
batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Here is the link to the full code, including the dataset
Any help would be greatly appreciated!

Hyperparameter adjustments for reducing overfitting in neural networks
Identify and ascertain overfitting. The first attempt shows largely overfitting, with early divergence of your test & train loss. I would try a lower learning rate here (in addition to the steps you took for regularisation with dropout layers). Using the default rate does not guarantee best results.
Allowing your model to find the global mimima / not being stuck in a local minima. On the second attempt, it looks better. However, if the x-axis shows the number of epochs -- it could be that your early stopping is too strict? ie. increase the threshold. Consider other optimisers, including SGD with a learning rate scheduler.
Too large network leads to overfitting on the trainset and difficulty in generalisation. Too many neurons may cause the network to 'memorize' all you trainset and overfit. I would try out 8, 16 or 24 neurons in your LSTM layer for example.
Data preprocessing & cleaning. Check your padding_sequences. It is probably padding the start of each text with zeros. I would pad post text.
Dataset. Depending on the size of your current dataset, I would suggest data augmentation to get to a sizable amount of text of training (empirically >=1M words). I would also try several techniques including feature engineering / improving data quality such as, spell checks. Are the classes imbalanced? You may need to balance them out by over/undersampling.
Consider using transfer learning and incorporate trained language models as your embeddings layer instead of training one from scratch. ie. https://www.gcptutorials.com/post/how-to-create-embedding-with-tensorflow

Simple autoencoder keeping constant tensor as predict in keras

I'm new in keras and deep learning field. In fact, I want to make a dense vector for each document in my data so that i built a simple autoencoder using keras library.
The input data are normalized using Word2vec with 200 as embedding size and all features are between -1 and 1. I prepared a 3D tensor that contains 137 samples (number of document) with 469 columns (maximum numbers of words) and the third dimension is the embedding size.I used mse loss function and GRU as recurrent neural network. I am having the same vector for all documents as the autoencoder prediction output while loss start with a very low value and became constant after a few number of epochs.
I tried different number of epochs but I got the same thing. I tried also to change the batch size but no change. Can any one help me find the problem please.
input = Input(shape=(469,200))
encoder = GRU(120,activation='sigmoid',dropout=0.2)(input)
neck = Dense(20)(encoder)
decoder1 = RepeatVector(469)(neck)
decoder1 = GRU(120,return_sequences=True,activation='sigmoid',dropout=0.2)(decoder1)
decoder1 = TimeDistributed(Dense(200,activation='tanh'))(decoder1)
model = Model(inputs=input, outputs=decoder1)
model.compile(optimizer='adam', loss='mse')
history = model.fit(x_train, x_train,validation_data=(x_test,x_test) ,epochs=10, batch_size=8)
this is the input data "x_train" :
print(model.predict(x_train)) return this values (same vectors):
Why "model.predict(x_train)" return the same vector for the 137 samples ?
Thank you in advance.

How to train a model with only an Embedding layer in Keras and no labels

I have some text without any labels. Just a bunch of text files. And I want to train an Embedding layer to map the words to embedding vectors. Most of the examples I've seen so far are like this:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_data=(x_val, y_val))
They all assume that the Embedding layer is part of a bigger model which tries to predict a label. But in my case, I have no label. I'm not trying to classify anything. I just want to train the mapping from words (more precisely integers) to embedding vectors. But the fit method of the model, asks for x_train and y_train (as the example given above).
How can I train a model only with an Embedding layer and no labels?
[UPDATE]
Based on the answer I've got from #Daniel Möller, Embedding layer in Keras is implementing a supervised algorithm and thus cannot be trained without labels. Initially, I was thinking that it is a variation of Word2Vec and thus does not need labels to be trained. Apparently, that's not the case. Personally, I ended up using the FastText which has nothing to do with Keras or Python.

Does it make sense to do that without a label/target?
How will your model decide which values in the vectors are good for anything if there is no objective?
All embeddings are "trained" for a purpose. If there is no purpose, there is no target, if there is no target, there is no training.
If you really want to transform words in vectors without any purpose/target, you've got two options:
Make one-hot encoded vectors. You may use the Keras to_categorical function for that.
Use a pretrained embedding. There are some available, such as glove, embeddings from Google, etc. (All of they were trained at some point for some purpose).
A very naive approach based on our chat, considering word distance
Warning: I don't really know anything about Word2Vec, but I'll try to show how to add the rules for your embedding using some naive kind of word distance and how to use dummy "labels" just to satisfy Keras' way of training.
from keras.layers import Input, Embedding, Subtract, Lambda
import keras.backend as K
from keras.models import Model
input1 = Input((1,)) #word1
input2 = Input((1,)) #word2
embeddingLayer = Embedding(...params...)
word1 = embeddingLayer(input1)
word2 = embeddingLayer(input2)
#naive distance rule, subtract, expect zero difference
word_distance = Subtract()([word1,word2])
#reduce all dimensions to a single dimension
word_distance = Lambda(lambda x: K.mean(x, axis=-1))(word_distance)
model = Model([input1,input2], word_distance)
Now that our model outputs directly a word distance, our labels will be "zero", they're not really labels for a supervised training, but they're the expected result of the model, something necessary for Keras to work.
We can have as loss function the mae (mean absolute error) or mse (mean squared error), for instance.
model.compile(optimizer='adam', loss='mse')
And training with word2 being the word after word1:
xTrain = entireText
xTrain1 = entireText[:-1]
xTrain2 = entireText[1:]
yTrain = np.zeros((len(xTrain1),))
model.fit([xTrain1,xTrain2], yTrain, .... more params.... )
Although this may be completely wrong regarding what Word2Vec really does, it shows the main points that are:
Embedding layers don't have special properties, they're just trainable lookup tables
Rules for creating an embedding should be defined by the model and expected outputs
A Keras model will need "targets", even if those targets are not "labels" but a mathematical trick for an expected result.

Can't figure out why I'm only getting one value from predict function?

I have built and trained a sequential binary classification model using keras layers. Everything seems to work fine until I start using the predict method. This function starts to give me a weird exponential value rather than probabilities of the two classes.
This what I get after training and using predict method on the model
This classification model has two classes, let's say a cat or a dog, so I was expecting the result to be something like [99.9999, 0.0001] suggesting that it's a cat. I'm not sure how to interpret the value that I'm getting back instead.
Here is the code I have:
# Get the data.
(train_texts, train_labels), (val_texts, val_labels) = data
train_labels = np.asarray(train_labels).astype('float32')
val_labels = np.asarray(val_labels).astype('float32')
# Vectorizing data
train_texts,val_texts, word_index = vectorize_data.sequence_vectorize(
train_texts, val_texts)
# Building the model architecture( adding layers to the model)
model = build_model.simple_model_layers(train_texts.shape[1:])
# Setting and compiling with the features like the optimizer, loss and metrics functions
model = build_model.simple_model_compile(model=model)
# This is when the learning happens
history = model.fit(train_texts,
train_labels,
epochs=EPOCHS,
validation_data=(val_texts, val_labels),
verbose=VERBOSE_OFF, batch_size=BATCH_SIZE)
print('Validation accuracy: {acc}, loss: {loss}'.format(
acc=history['val_acc'][-1], loss=history['val_loss'][-1]))
# loading data to predict on
test_text = any
with open('text_req.pickle', 'rb') as pickle_file:
test_text = pickle.load(pickle_file)
print('Lets make a prediction of this requirement:')
prediction = model.predict(test_text, batch_size=None, verbose=0, steps=None)
print(prediction)
Here is how the simple model function looks like:
model = models.Sequential()
model.add(Dense(26, activation='relu', input_shape=input_shape))
model.add(Dense(16, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
return model
Gradient desent functions:
optimizer='adam', loss='binary_crossentropy'
Sample data is of String type which I convert to constant size matrices of 1’s and 0’s using padding and all. The features have two classes, so labels are simply 1 and 0. That's all for data. In my opinion, data doesn’t seem to be the problem it could be something more trivial than that which I‘m overlooking and have failed to recognize.
Thank you guys, This last problem was resolved, but I need better understanding at this:
I read that sigmoid returns the probability of all possible classes and all the probabilities should add up to 1. The values that I am getting back are:
Validation accuracy: 0.792168688343232, loss: 2.8360600299145804
Let's make a prediction of this requirement:
[[2.7182817, 1. ]
[2.7182817, 1. ]
[1., 2.7182817]
[1. , 2.7182817]]
They don't add up to 1 and looking at these values 1 or otherwise is not intuitive enough in what to make of it.

Your model only has one output. If your training labels are set to 0 for cat and 1 for dog then that means the network thinks its a cat if the output is [[2.977094e-12]]. If you want the probabilities of the two classes like you were expecting then you need to change the output of your model as follows:
model = models.Sequential()
model.add(Dense(26, activation='relu', input_shape=input_shape))
model.add(Dense(16, activation='relu'))
model.add(Dense(10, activation='relu')
model.add(Dense(2, activation='softmax'))
Your labels would also need to change to [1, 0] and [0, 1] for cat and dog.

I want to clarify that you don't get a weird exponential value, you just get a weird value. The E is scientific notation for x10, so you basically get 2.7 x 10^-12. I'd love to help but I can't check your data nor your model. I tried to Google some parts of your code, in the hope to find some clarification but I can't seem to find out what's under the hood of these two lines:
model = build_model.simple_model_layers(train_texts.shape[1:])
model = build_model.simple_model_compile(model=model)
I've no clue what network has been build, I'd like to know at least the loss function and the full final layer, that'd already be much to go by. Are you also sure that your data is correct?
EDIT:
sigmoid does not do what you describe, softmax does that. Sigmoid is often used as multilabel classification, since it can detect multiple labels as True. Sigmoid output could look for example like [0.99, 0.3], it has the ability to look at each label separately. Softmax on the other hand doesn't, softmax could look like [0.99, 0.01], and the sum of all probabilities is always 1.
That solved that confusion, now about your output, I've no clue what that is, it should be between 1 and 0, unless I'm missing something here.
To answer your data question you asked to K. Streutker:
The goal of a neural network is to create the labels you feed it, on new data. If you want a probability distribution, then you also need to feed one. Every image should have a label [1, 0] and dog [0, 1], or reversed, whatever you like. Then once the model is trained it will be able to give you two outputs that make sense. The loss function, most likely cross entropy takes these labels and the output of your model, and tries to minimize the difference over time. So this is sort of what you need:
image (dog)--> model --> loss --> optimizer that updates the weights
labels ([0,1]) ------------------┘
then predicting will look like this
image --> model --> labels
Hope I helped a bit!

Predicting time series data with Neural Network in python

I'm a beginner in Neural Network and trying to predict values which are temperature values(output) with 5 inputs in python. I used keras package in python to work Neural Network.
Also, I used two algorithms which are feedforward Neural Network(Regression) and Recurrent Neural Network(LSTM) to predict values. However, both of algorithms didn't work well for forecasting.
In my case of Feedforward Neural Network(Regression), I used 3 hidden layers(with 100, 200, 300 neurons) like code below,
def baseline_model():
# create model
model = Sequential()
model.add(Dense(100, input_dim=5, kernel_initializer='normal', activation='sigmoid'))
model.add(Dense(200, kernel_initializer = 'normal', activation='sigmoid'))
model.add(Dense(300, kernel_initializer = 'normal', activation='sigmoid'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
df = DataFrame({'Time': TIME_list, 'input1': input1_list, 'input2': input2_list, 'input3': input3_list, 'input4': input4_list, 'input5': input5_list, 'output': output_list})
df.index = pd.to_datetime(df.Time)
df = df.values
#Setting training data and test data
train_size_x = int(len(df)*0.8) #The user can change the range of training data
print(train_size_x)
X_train = df[0:train_size_x, 0:5]
t_train = df[0:train_size_x, 6]
X_test = df[train_size_x:int(len(df)), 0:5]
t_test = df[train_size_x:int(len(df)), 6]
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.transform(X_test)
#Regression in Keras package
clf = KerasRegressor(build_fn=baseline_model, nb_epoch=50, batch_size=5, verbose=0)
clf.fit(X_train,t_train)
res = clf.predict(X_test)
However, the error was quite big. The maximum absolute error was 78.4834. So I tried to minimize that error by changing number of hidden layer or neurons in hidden layer, but the error stayed around same.
After feedforward NN, secondly, I used Recurrent Neural Network(LSTM) algorithm which can predict by using only one input. In my case, the input is temperature. It gives me much less error than the feedforward NN, but I was lost in deep thought that Recurrent Nueral Network(LSTM) I implemented is little ambiguous in my case because it didn't use 5 inputs that affect the output(temperature value) such as feedforward regression that I implemented above.
And now I got lost what other kinds of algorithm I should use.
Any suggestions or ideas for my case..?
Thanks in advance.

I have to agree with the commenter to your question, you are jumping a little ahead of yourself. Neural networks can seem like black magic at times and its worth taking the time to understand whats actually going on under the hood. A good place to start learning and experimenting is with sklearn. Sklearn is a good place to start because you can try different techniques easily, this will help you learn quickly how to structure your problems. There is also an abundance of info and tutorials.
From there, you will be better equipped to tackling your own NN from scratch. Additionally, sklearn has many useful functions to pre-process/normalize your training data, which is a whole art in itself.
There are tons of good networks already available for common situations. Most of the work is in choosing the right structure for your problem, getting good data to train on, and massaging that data so it can be utilized properly.
Check it out... http://scikit-learn.org/stable/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.