How to encode text for an nlp in tensorflow

How to encode text for an nlp in tensorflow - python

model = tf.keras.Sequential([
tf.keras.layers.Embedding(VOCAB_SIZE, 32),
tf.keras.layers.LSTM(32),
tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["acc"])
train_data, train_labels, test_data, test_labels = getdata()
model.fit(train_data, train_labels, epochs=10, batch_size=BATCH_SIZE, shuffle=True)
results = model.evaluate(test_data, test_labels)
print(results)
This is my basic tensorflow model and I have my datasets but I'm completely unsure on how to convert the strings in the data to something the network can take in as an input. I'm very new to TF btw.

You can use the gensim library effective for text embedding without losing its meaning and sequence. An example code below. Feel free to make changes according to your needs. Choose algorithms, optimizers etc. as per your requirement.
from gensim.models import Word2Vec
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, GlobalMaxPool1D, Dropout
# first, we initialize a tokenizer which will convert text corpus to numbers (vectorization)
reviews_text = ["list of strings you want to analyze"]
sentiment = <array of labels/target>
top_words = 1500
tokenizer = Tokenizer(num_words=top_words)
# second, we update the vocabulary of the tokenizer by providing text
tokenizer.fit_on_texts(reviews_text) # fit on reviews before converting to matrix
# third, we transform each text into a sequence of integers
reviews_seq = tokenizer.texts_to_sequences(reviews_text)
len(tokenizer.word_index) # length of tokernizer, means unique words in the vocab
# in order for the math to work, we convert each review into same length. Padding...
max_len = 300
reviews_seq = pad_sequences(reviews_seq, maxlen=max_len)
# word2vec requires list of lists as input
documents = []
i = 0
for doc in <list of strings>:
i += 1
documents.append(doc.split(' '))
# gensim model training, it will translate a word into 128 numbers
embedding_size = 128
window_size = 128
w2v = Word2Vec(documents # input list of lists
, min_count=3 # any word must appear 3 times or more for training
# , workers=3 # CPU cores to be used for training the model
, size=embedding_size # no. of numbers required to represent a word
, window=window_size # how many neighbors to look at either side of the word for learning
, iter=20 # no. of iterations over the documents for training
)
w2v.wv.vectors.shape # shape of the model
# create embeddings to be added into keras model
embedding_matrix = zeros((top_words + 1, embedding_size))
# build matrix from pre-trained word to vec model
for word, i in sorted(tokenizer.word_index.items(), key=lambda x: x[1]):
if i > top_words:
break
if word in w2v.wv.vocab:
embedding_vector = w2v.wv[word]
embedding_matrix[i] = embedding_vector
model = Sequential()
model.add(Embedding(top_words + 1, embedding_size, input_length=max_len, weights=[embedding_matrix], trainable=False))
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(32)))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
optimizer = Adadelta(learning_rate=0.001, rho=0.92)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
model.fit(reviews_seq, asarray(sentiment), validation_split=0.2, epochs=1000, batch_size=32)

Related

How to train a keras tokenizer on a large corpus that doesn't fit in memory?

I am trying to train a language model that based on a 2-word input tries to predict a 1-word output. This is the model definition (all the layers are imported from keras.layers):
model = Sequential()
model.add(Embedding(vocab_size, 2, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
The problem is that my dataset has 87 million lines of 3-word data (2 for input, 1 for output) and it does not fit into my memory. I heard that keras.preprocessing.text.Tokenizer creates tokens based on their frequency in text. I am training my tokenizer like this:
tokenizer = Tokenizer(oov_token=1)
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
How am I supposed to fit my tokenizer on all texts if they don't fit into memory?

NLP Sentiment Analysis net is not learning

I want to train a neural net for sentiment analysis. I have followed the tutorials on the keras webpage but I had to adapt the code to my usecase in order to be able to use the net afterwards.
For this purpose I decode back the texts from the imdb dataset from keras from numbers to text, and then I stemmize the text because I need to use the text stemmized. After that, since I want to control the way I am doing the word embeddings rather than using text_to_sequences an pad_sequences I am training a doc2vec embeddings and I am using it on the training set, so that I can obtain the embeddings from the text I want to classify.
The problem is that, the net does not learn anything, the accuracy does not improve and I can not reduce the loss function. I have tried many many things, like the architecture of the net, all the hyperparameters and changing the last layer from 2 nets to 1 and from sparse_categorical_entropy to binary_crossentropy. Let's see if anybody can help and show some light to my problem. I plug the code here and thanks in advance.
from keras.datasets import imdb
max_features = 40000
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=max_features)
import numpy as np
data = np.concatenate((training_data, testing_data), axis=0)
targets = np.concatenate((training_targets, testing_targets), axis=0)
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()])
decoded = " ".join([reverse_index.get(i - 3, "") for i in data[0]])
import nltk
from nltk .stem import LancasterStemmer
toke_corpus = list()
lan = LancasterStemmer()
from tqdm import tqdm
lista_reviews = list()
for review in tqdm(data):
lista_reviews.append(np.array([lan.stem(reverse_index.get(i - 3, '')) for i in review][1:]))
train_x, test_x = lista_reviews[10000:], lista_reviews[:10000]
train_y, test_y = targets[10000:], targets[:10000]
from gensim.models.callbacks import CallbackAny2Vec
class EpochLogger(CallbackAny2Vec):
'''Callback to log information about training'''
def __init__(self):
self.epoch = 0
def on_epoch_begin(self, model):
print("Epoch #{} start".format(self.epoch))
def on_epoch_end(self, model):
print("Epoch #{} end".format(self.epoch))
self.epoch += 1
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(lista_reviews)]
print("DOcuments already built")
epoch_logger = EpochLogger()
model = Doc2Vec(documents, vector_size=512, window=5, min_count=3, workers=8, epochs = 7, callbacks=[epoch_logger])
encoded_x_train, encoded_x_test = list(), list()
from tqdm import tqdm
for i in tqdm(train_x):
encoded_x_train.append(model.infer_vector(i))
for k in tqdm(test_x):
encoded_x_test.append(model.infer_vector(k))
import keras
reduce_lr = keras.callbacks.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.50, patience=2, verbose=1, mode='auto', cooldown=0, min_lr=0.00001)
early = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=4, verbose=1, mode='auto')
from keras import models
from keras.models import Sequential
from keras import layers
from keras.layers import Embedding, Bidirectional, Dense, LSTM, Conv1D, MaxPooling1D, Flatten
model1 = Sequential()
model1.add(Embedding(input_dim = max_features, input_length=512, output_dim=128, trainable=False))
model1.add(Conv1D(filters=64,
kernel_size=5,
padding='valid',
activation='linear',
strides=1))
model1.add(MaxPooling1D(pool_size=4))
model1.add(Dense(64, activation='linear'))
model1.add(LSTM(32, activation='tanh'))
# model1.add(Dense(32, activation='relu'))
# model1.add(Flatten())
# model1.add(Dense(1, activation='sigmoid'))
model1.add(Dense(2, activation='softmax'))
model1.summary()
from keras import optimizers
# sgd = optimizers.SGD(lr=0.001, decay=1e-6, momentum=0.9, nesterov=True)
adam = optimizers.Adam(learning_rate=0.01, beta_1=0.9, beta_2=0.999, amsgrad=False)
model1.compile(loss='sparse_categorical_crossentropy',
optimizer=adam,
metrics=['accuracy'])
history = model1.fit( np.array(encoded_x_train), np.array(train_y),
epochs= 20,
batch_size = 500,
validation_data = (np.array(encoded_x_test), np.array(test_y)), callbacks = [reduce_lr, early]
)

You use Doc2Vec to create sample embeddings. for this reason, I don't think that Embedding, Conv1D and MaxPooling1D layers are useful in your network. they are useful for word2vec where you can extract embeddings of each token and use them inside a network.
try to feed your network directly with your embedding in this way
model1 = Sequential()
model1.add(Dense(128, activation='relu', input_shape=(512,)))
# ....
model1.add(Dense(2, activation='softmax'))
adam = optimizers.Adam(learning_rate=0.01, beta_1=0.9, beta_2=0.999, amsgrad=False)
model1.compile(loss='sparse_categorical_crossentropy',
optimizer=adam,
metrics=['accuracy'])
history = model1.fit( np.array(encoded_x_train), np.array(train_y),
epochs= 20,
batch_size = 500,
validation_data = (np.array(encoded_x_test), np.array(test_y)), callbacks = [reduce_lr, early]
)

How to implement Merge function in keras using funtional API

print("Building model...")
ques1_enc = Sequential()
ques1_enc.add(Embedding(output_dim=64, input_dim=vocab_size, weights=[embedding_weights], mask_zero=True))
ques1_enc.add(LSTM(100, input_shape=(64, seq_maxlen), return_sequences=False))
ques1_enc.add(Dropout(0.3))
ques2_enc = Sequential()
ques2_enc.add(Embedding(output_dim=64, input_dim=vocab_size, weights=[embedding_weights], mask_zero=True))
ques2_enc.add(LSTM(100, input_shape=(64, seq_maxlen), return_sequences=False))
ques2_enc.add(Dropout(0.3))
model = Sequential()
model.add(Merge([ques1_enc, ques2_enc], mode="sum"))
model.add(Dense(2, activation="softmax"))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
print("Building model costs:", time.time() - start)
print("Training...")
checkpoint = ModelCheckpoint(filepath=os.path.join("C:/Users/", "quora_dul_best_lstm.hdf5"), verbose=1, save_best_only=True)
model.fit([x_ques1train, x_ques2train], ytrain, batch_size=32, epochs=1, validation_split=0.1, verbose=2, callbacks=[checkpoint])
print("Training neural network costs:", time.time() - start)
I want to convert the above code into functional API in keras as in sequential API Merge() function is not supported. I have been trying it for long time but getting few errors. About the details of the attrributes:
ques_pairs contains the preprocessed data,
word2index contains the word count,
seq_maxlen contains the maximum length of question one or two.
iam trying to implement this model on Quora Question Pair Dataset https://www.kaggle.com/c/quora-question-pairs

I will give you a small example, that you can apply to your own model:
from keras.layers import Input, Dense, Add
input1 = Input(shape=(16,))
output1 = Dense(8, activation='relu')(input1)
output1 = Dense(4, activation='relu')(output1) # Add as many layers as you like like this
input2 = Input(shape=(16,))
output2 = Dense(8, activation='relu')(input2)
output2 = Dense(4, activation='relu')(output2) # Add as many layers as you like like this
output_full = Add()([output1, output2])
output_full = Dense(1, activation='sigmoid')(output_full) # Add as many layers as you like like this
model_full = Model(inputs=[input1, input2], outputs=output_full)
You need to define an Input for each of your model parts first, then add layers (as shown in the code) to both models. Then you can add them using the Add layer. Finally you call Model with a list of the input layers and the output layer.
model_full can then be compiled and trained like any other model.

Are you trying to achieve something like the following ?
from tensorflow.python import keras
from keras.layers import *
from keras.models import Sequential, Model
vocab_size = 1000
seq_maxlen = 32
embedding_weights = np.zeros((vocab_size, 64))
print("Building model...")
ques1_enc = Sequential()
ques1_enc.add(Embedding(output_dim=64, input_dim=vocab_size, weights=[embedding_weights], mask_zero=True))
ques1_enc.add(LSTM(100, input_shape=(64, seq_maxlen), return_sequences=False))
ques1_enc.add(Dropout(0.3))
ques2_enc = Sequential()
ques2_enc.add(Embedding(output_dim=64, input_dim=vocab_size, weights=[embedding_weights], mask_zero=True))
ques2_enc.add(LSTM(100, input_shape=(64, seq_maxlen), return_sequences=False))
ques2_enc.add(Dropout(0.3))
merge = Concatenate(axis=1)([ques1_enc.output, ques2_enc.output])
output = Dense(2, activation="softmax")(merge)
model = Model([ques1_enc.input, ques2_enc.input], output)
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()

How to use keras embedding layer with 3D tensor input?

I am facing difficulty in using Keras embedding layer with one hot encoding of my input data.
Following is the toy code.
Import packages
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
import matplotlib.pyplot as plt
import numpy as np
import openpyxl
import pandas as pd
from keras.callbacks import ModelCheckpoint
from keras.callbacks import ReduceLROnPlateau
The input data is text based as follows.
Train and Test data
X_train_orignal= np.array(['OC(=O)C1=C(Cl)C=CC=C1Cl', 'OC(=O)C1=C(Cl)C=C(Cl)C=C1Cl',
'OC(=O)C1=CC=CC(=C1Cl)Cl', 'OC(=O)C1=CC(=CC=C1Cl)Cl',
'OC1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=O'])
X_test_orignal=np.array(['OC(=O)C1=CC=C(Cl)C=C1Cl', 'CCOC(N)=O',
'OC1=C(Cl)C(=C(Cl)C=C1Cl)Cl'])
Y_train=np.array(([[2.33],
[2.59],
[2.59],
[2.54],
[4.06]]))
Y_test=np.array([[2.20],
[2.81],
[2.00]])
Creating dictionaries
Now i create two dictionaries, characters to index vice. The unique character number is stored in len(charset) and maximum length of the string along with 5 additional characters is stored in embed. The start of each string will be padded with ! and end will be E.
charset = set("".join(list(X_train_orignal))+"!E")
char_to_int = dict((c,i) for i,c in enumerate(charset))
int_to_char = dict((i,c) for i,c in enumerate(charset))
embed = max([len(smile) for smile in X_train_orignal]) + 5
print (str(charset))
print(len(charset), embed)
One hot encoding
I convert all the train data into one hot encoding as follows.
def vectorize(smiles):
one_hot = np.zeros((smiles.shape[0], embed , len(charset)),dtype=np.int8)
for i,smile in enumerate(smiles):
#encode the startchar
one_hot[i,0,char_to_int["!"]] = 1
#encode the rest of the chars
for j,c in enumerate(smile):
one_hot[i,j+1,char_to_int[c]] = 1
#Encode endchar
one_hot[i,len(smile)+1:,char_to_int["E"]] = 1
return one_hot[:,0:-1,:]
X_train = vectorize(X_train_orignal)
print(X_train.shape)
X_test = vectorize(X_test_orignal)
print(X_test.shape)
When it converts the input train data into one hot encoding, the shape of the one hot encoded data becomes (5, 44, 14) for train and (3, 44, 14) for test. For train, there are 5 example, 0-44 is the maximum length and 14 are the unique characters. The examples for which there are less number of characters, are padded with E till the maximum length.
Verifying the correct padding
Following is the code to verify if we have done the padding rightly.
mol_str_train=[]
mol_str_test=[]
for x in range(5):
mol_str_train.append("".join([int_to_char[idx] for idx in np.argmax(X_train[x,:,:], axis=1)]))
for x in range(3):
mol_str_test.append("".join([int_to_char[idx] for idx in np.argmax(X_test[x,:,:], axis=1)]))
and let's see, how the train set looks like.
mol_str_train
['!OC(=O)C1=C(Cl)C=CC=C1ClEEEEEEEEEEEEEEEEEEEE',
'!OC(=O)C1=C(Cl)C=C(Cl)C=C1ClEEEEEEEEEEEEEEEE',
'!OC(=O)C1=CC=CC(=C1Cl)ClEEEEEEEEEEEEEEEEEEEE',
'!OC(=O)C1=CC(=CC=C1Cl)ClEEEEEEEEEEEEEEEEEEEE',
'!OC1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=OEEE']
Now is the time to build model.
Model
model = Sequential()
model.add(Embedding(len(charset), 10, input_length=embed))
model.add(Flatten())
model.add(Dense(1, activation='linear'))
def coeff_determination(y_true, y_pred):
from keras import backend as K
SS_res = K.sum(K.square( y_true-y_pred ))
SS_tot = K.sum(K.square( y_true - K.mean(y_true) ) )
return ( 1 - SS_res/(SS_tot + K.epsilon()) )
def get_lr_metric(optimizer):
def lr(y_true, y_pred):
return optimizer.lr
return lr
optimizer = Adam(lr=0.00025)
lr_metric = get_lr_metric(optimizer)
model.compile(loss="mse", optimizer=optimizer, metrics=[coeff_determination, lr_metric])
callbacks_list = [
ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-15, verbose=1, mode='auto',cooldown=0),
ModelCheckpoint(filepath="weights.best.hdf5", monitor='val_loss', save_best_only=True, verbose=1, mode='auto')]
history =model.fit(x=X_train, y=Y_train,
batch_size=1,
epochs=10,
validation_data=(X_test,Y_test),
callbacks=callbacks_list)
Error
ValueError: Error when checking input: expected embedding_3_input to have 2 dimensions, but got array with shape (5, 44, 14)
The embedding layer expects two dimensional array. How can I deal with this issue so that it can accept the one hot vector encoded data.
All the above code can be run.

The Keras embedding layer works with indices, not directly with one-hot encodings.
So you don't need to have (5,44,14), just (5,44) works fine.
E.g. get indices with argmax:
X_test = np.argmax(X_test, axis=2)
X_train = np.argmax(X_train, axis=2)
Although it's probably better to not one-hot encode it first =)
Besides that, your 'embed' variable says size 45, while your data is size 44.
If you change those, your model runs fine:
model = Sequential()
model.add(Embedding(len(charset), 10, input_length=44))
model.add(Flatten())
model.add(Dense(1, activation='linear'))
def coeff_determination(y_true, y_pred):
from keras import backend as K
SS_res = K.sum(K.square( y_true-y_pred ))
SS_tot = K.sum(K.square( y_true - K.mean(y_true) ) )
return ( 1 - SS_res/(SS_tot + K.epsilon()) )
def get_lr_metric(optimizer):
def lr(y_true, y_pred):
return optimizer.lr
return lr
optimizer = Adam(lr=0.00025)
lr_metric = get_lr_metric(optimizer)
model.compile(loss="mse", optimizer=optimizer, metrics=[coeff_determination, lr_metric])
callbacks_list = [
ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-15, verbose=1, mode='auto',cooldown=0),
ModelCheckpoint(filepath="weights.best.hdf5", monitor='val_loss', save_best_only=True, verbose=1, mode='auto')]
history =model.fit(x=np.argmax(X_train, axis=2), y=Y_train,
batch_size=1,
epochs=10,
validation_data=(np.argmax(X_test, axis=2),Y_test),
callbacks=callbacks_list)

our input shape was not defined properly in the embedding layer. The following code works for me by reducing the steps to covert your data dimensions to 2D you can directly pass the 3-D input to your embedding layer.
#THE MISSING STUFF
#_________________________________________
Y_train = Y_train.reshape(5) #Dense layer contains a single unit so need to input single dimension array
max_len = len(charset)
max_features = embed-1
inputshape = (max_features, max_len) #input shape didn't define. Embedding layer can accept 3D input by using input_shape
#__________________________________________
model = Sequential()
#model.add(Embedding(len(charset), 10, input_length=14))
model.add(Embedding(max_features, 10, input_shape=inputshape))#input_length=max_len))
model.add(Flatten())
model.add(Dense(1, activation='linear'))
print(model.summary())
optimizer = Adam(lr=0.00025)
lr_metric = get_lr_metric(optimizer)
model.compile(loss="mse", optimizer=optimizer, metrics=[coeff_determination, lr_metric])
callbacks_list = [
ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-15, verbose=1, mode='auto',cooldown=0),
ModelCheckpoint(filepath="weights.best.hdf5", monitor='val_loss', save_best_only=True, verbose=1, mode='auto')]
history =model.fit(x=X_train, y=Y_train,
batch_size=10,
epochs=10,
validation_data=(X_test,Y_test),
callbacks=callbacks_list)

Returning sequence of hidden states for time steps in Keras

I have this model in keras that make sentiment classification for reviews.
embedding_layer = Embedding(vocabulary_size +1,
embeddings_dim,
weights=[embedding_weights],
input_length=max_sent_len,
trainable=False)
sentence_input = Input(shape=(max_sent_len,), dtype='int32')
embedded_sequences = embedding_layer(sentence_input)
gru_ = GRU(300, input_shape =(timesteps, embeddings_dim), return_sequences=True, dropout_W = 0.3,dropout_U = 0.3)(embedded_sequences)
attention, word_score = AttLayer(return_attention= True)(gru_)
preds=Dense(1, activation='sigmoid')(attention)
model= Model(sentence_input,preds)
adam = Adam(lr=0.001,decay=1e-6,clipvalue=5)
model.compile(loss='binary_crossentropy',
optimizer=adam,
metrics=['accuracy'])
history=model.fit(train_sequences,train_labels,batch_size=32, nb_epoch=5, validation_data=(test_sequences, test_labels), verbose=1, shuffle=True)
After attention layer, I need to plot the representation overtime for all sequence of the input review. In a heatmap, I need to plot representations obtained after combining current word embedding with previous build embeddings. Each column should corresponds to outputs at each time-step. Now I just have one float number score for each time step.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to encode text for an nlp in tensorflow - python

Related

How to train a keras tokenizer on a large corpus that doesn't fit in memory?

NLP Sentiment Analysis net is not learning

How to implement Merge function in keras using funtional API

How to use keras embedding layer with 3D tensor input?

Returning sequence of hidden states for time steps in Keras

Categories

Resources