How to test NLP model against many strings - python

I have trained a classifier model using logistic regression on a set of strings that classifies strings into 0 or 1. I currently have it where I can only test one string at a time. How can I have my model run through more than one sentence at a time, maybe from a .csv file so I dont have to input each sentence individually?
def train_model(classifier, feature_vector_train, label, feature_vector_valid,valid_y, is_neural_net=False):
classifier.fit(feature_vector_train, label)
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
if is_neural_net:
predictions = predictions.argmax(axis=-1)
return classifier , metrics.accuracy_score(predictions, valid_y)
then
model, accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xtest_count,test_y)
Currently how I test my model
sent = ['here I copy a string']
converting text to count bag of words vectors
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}',ngram_range=(1, 2))
x_feature_vector = count_vect.transform(sent)
pred = model.predict(x_feature_vector)
and I get the sentence and its prediction
I wanted the model to classify all my new sentences at once and give a classification to each sentence.

model.predict(X) takes a list of samples, the same for count_vec.transform(X) so you can read sentences from file and predict them together like this:
with open('file.txt', 'r') as f:
samples = f.readlines()
vecs = count_vec.transform(samples)
preds = model.predict(vecs)
for s, p in zip(samples, preds):
#printing each sentence with the predicted label
print(s + " Label: " + p)

Much easier way to go will be
vecs=count_vec.transform(test['column_name_on_which_you_want_to_predict'])
pred=model.predict(vecs)
data=pd.DataFrame({'Text':column_name,'SECTION':pred})
you can export it then as you want.

Related

Word2Vec + LSTM Good Training and Validation but Poor on Test

currently I'am training my Word2Vec + LSTM for Twitter sentiment analysis. I use the pre-trained GoogleNewsVectorNegative300 word embedding. The reason I used the pre-trained GoogleNewsVectorNegative300 because the performance much worse when I trained my own Word2Vec using own dataset. The problem is why my training process had validation acc and loss stuck at 0.88 and 0.34 respectively. Then, my confussion matrix also seems wrong. Here several processes that I have done before fitting the model
Text Pre processing:
Lower casing
Remove hashtag, mentions, URLs, numbers, change words to numbers, non-ASCII characters, retweets "RT"
Expand contractions
Replace negations with antonyms
Remove puncutations
Remove stopwords
Lemmatization
I split my dataset into 90:10 for train:test as follows:
def split_data(X, y):
X_train, X_test, y_train, y_test = train_test_split(X,
y,
train_size=0.9,
test_size=0.1,
stratify=y,
random_state=0)
return X_train, X_test, y_train, y_test
The split data resulting in training has 2060 samples with 708 positive sentiment class, 837 negative sentiment class, and 515 sentiment neutral class
Then, I implemented the text augmentation that is EDA (Easy Data Augmentation) on all the training data as follows:
class TextAugmentation:
def __init__(self):
self.augmenter = EDA()
def replace_synonym(self, text):
augmented_text_portion = int(len(text)*0.1)
synonym_replaced = self.augmenter.synonym_replacement(text, n=augmented_text_portion)
return synonym_replaced
def random_insert(self, text):
augmented_text_portion = int(len(text)*0.1)
random_inserted = self.augmenter.random_insertion(text, n=augmented_text_portion)
return random_inserted
def random_swap(self, text):
augmented_text_portion = int(len(text)*0.1)
random_swaped = self.augmenter.random_swap(text, n=augmented_text_portion)
return random_swaped
def random_delete(self, text):
random_deleted = self.augmenter.random_deletion(text, p=0.5)
return random_deleted
text_augmentation = TextAugmentation()
The data augmentation resulting in training has 10300 samples with 3540 positive sentiment class, 4185 negative sentiment class, and 2575 sentiment neutral class
Then, I tokenized the sequence as follows:
# Tokenize the sequence
pfizer_tokenizer = Tokenizer(oov_token='OOV')
pfizer_tokenizer.fit_on_texts(df_pfizer_train['text'].values)
X_pfizer_train_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_train['text'].values)
X_pfizer_test_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_test['text'].values)
# Pad the sequence
X_pfizer_train_padded = pad_sequences(X_pfizer_train_tokenized, maxlen=100)
X_pfizer_test_padded = pad_sequences(X_pfizer_test_tokenized, maxlen=100)
pfizer_max_length = 100
pfizer_num_words = len(pfizer_tokenizer.word_index) + 1
# Encode label
y_pfizer_train_encoded = df_pfizer_train['sentiment'].factorize()[0]
y_pfizer_test_encoded = df_pfizer_test['sentiment'].factorize()[0]
y_pfizer_train_category = to_categorical(y_pfizer_train_encoded)
y_pfizer_test_category = to_categorical(y_pfizer_test_encoded)
Resulting in 8869 unique words and 100 maximum sequence length
Finally, I fit the into my model using pre trained GoogleNewsVectorNegative300 word embedding but only use the weight and LSTM, and I split my training data again with 10% for validation as follows:
# Build single LSTM model
def build_lstm_model(embedding_matrix, max_sequence_length):
# Input layer
input_layer = Input(shape=(max_sequence_length,), dtype='int32')
# Word embedding layer
embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=True)(input_layer)
# LSTM model layer
lstm_layer = LSTM(units=128,
dropout=0.5,
return_sequences=True)(embedding_layer)
batch_normalization = BatchNormalization()(lstm_layer)
lstm_layer = LSTM(units=128,
dropout=0.5,
return_sequences=False)(batch_normalization)
batch_normalization = BatchNormalization()(lstm_layer)
# Dense model layer
dense_layer = Dense(units=128, activation='relu')(batch_normalization)
dropout_layer = Dropout(rate=0.5)(dense_layer)
batch_normalization = BatchNormalization()(dropout_layer)
output_layer = Dense(units=3, activation='softmax')(batch_normalization)
lstm_model = Model(inputs=input_layer, outputs=output_layer)
return lstm_model
# Building single LSTM model
sinovac_lstm_model = build_lstm_model(SINOVAC_EMBEDDING_MATRIX, SINOVAC_MAX_SEQUENCE)
sinovac_lstm_model.summary()
sinovac_lstm_model.compile(loss='categorical_crossentropy',
optimizer=Adam(learning_rate=0.001),
metrics=['accuracy'])
sinovac_lstm_history = sinovac_lstm_model.fit(x=X_sinovac_train,
y=y_sinovac_train,
batch_size=64,
epochs=20,
validation_split=0.1,
verbose=1)
The training result:
The evaluation result:
I really need some suggestions or insights to have a good accuracy on my test
Without reviewing everything, a few high-order things that may be limiting your results:
The GoogleNews vectors were trained on media-outlet news stories from 2012 and earlier. Tweets in 2020+ use a very different style of language. I wouldn't necessarily expect those pretrained vectors, from a different era & domain-of-writing, to be very good at modeling the words you'll need. A well-trained word2vec model (using plenty of modern tweet data, with good preprocessing/tokenization & parameterization choices) has a good chance of working better, so you may want to revisit that choice.
The GoogleNews training texts preprocessing, while as far as I can tell never fully-documented, did not appear to flatten all casing, nor remove stopwords, nor involve lemmatization. It didn't mutate obvious negations into antonyms, but it did perform a statistical combinations of some single-words into multigram tokens instead. So some of your steps are potentially causing your tokens to have less concordance with that set's vectors – even throwing away info, like inflectional variations of words, that could be beneficially retained. Be sure every step you're taking is worth the trouble – and note that a suffiicient modern word2vec moel, on Tweets, built using the same preprocessing for word2vec training then later steps, would match vocabularies perfectly.
Both the word2vec model, and any deeper neural network, often need lots of data to train well, and avoid overfitting. Even disregarding the 900 million parameters from GoogleNews, you're trying to train ~130k parameters – at least 520KB of state – from an initial set of merely 2060 tweet-sized texts (maybe 100KB of data). Models that learn generalizable things tend to be compressions of the data, in some sense, and a model that's much larger than the training data brings risk of severe overfitting. (Your mechanistic process for replacing words with synonyms may not be really giving the model any info that the word-vector similarity between synonyms didn't already provide.) So: consider shrinking your model, and getting much more training data - potentially even from other domains than your main classification interest, as long as the use-of-language is similar.

How to make prediction on new text dataset using saved text classification model

I trained a text classifier under this guide: https://developers.google.com/machine-learning/guides/text-classification/step-4
And save model as
model.save('~./output/model.h5')
In this case, how i use this model to classify texts on another new dataset?
Thank you
import tensorflow as tf
# Recreate the exact same model, including its weights and the optimizer
new_model = tf.keras.models.load_model('~./output/model.h5')
# Show the model architecture
new_model.summary()
# Apply the same process of data preparation while training the model.
# Lets say after Data preprocessing you have stored the processed data in test_data
# check model accuracy from unseen/new dataset
loss, acc = new_model.evaluate(test_data, test_labels, verbose=2)
print('Restored model, accuracy: {:5.2f}%'.format(100*acc))
You can use the tensorflow's Text tokenization utility class (Tokenizer) to deal with unknown words in Test data.
Num_words is the vocabulary size (it picks most frequent words)
Assign oov_token = 'Some string', used for all the tokens/words outside vocab size (basically new words in test data will be dealt as oov_token string .
Fit on Train data and then generate token sequence for both train and test data.
tf.keras.preprocessing.text.Tokenizer(
num_words=None, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~\t\n', lower=True,
split=' ', char_level=False, oov_token=None, document_count=0, **kwargs
)

Evaluate Machine Learning Text Classifier

I have built a binary text classifier. Trained it to recognize sentences for clients based on 'New' or 'Return'. My issue is that real data may not always have a clear distinction between new or return, even to an actual person reading the sentence.
My model was trained to 0.99% accuracy with supervised learning using Logistic Regression.
#train model
def train_model(classifier, feature_vector_train, label, feature_vector_valid,valid_y, is_neural_net=False):
classifier.fit(feature_vector_train, label)
predictions = classifier.predict(feature_vector_valid)
if is_neural_net:
predictions = predictions.argmax(axis=-1)
return classifier , metrics.accuracy_score(predictions, valid_y)
# Linear Classifier on Count Vectors
model, accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xtest_count,test_y)
print ('::: Accuracy on Test Set :::')
print ('Linear Classifier, BoW Vectors: ', accuracy)
And this would give me an accuracy of 0.998.
I now can pass a whole list of sentences to test this model and it would catch if the sentences has a new or return word yet I need an evaluation metric because some sentences will have no chance of being new or return as real data is messy as always.
My question is: What evaluation metrics can I use so that each new sentence that gets passed through the model shows a score?
Right now I only use the following code
with open('realdata.txt', 'r') as f:
samples = f.readlines()
vecs = count_vect.transform(sentence)
visit = model.predict(vecs)
num_to_label= {0:'New', 1:'Return'}
for s, p in zip(sentence, visit):
#printing each sentence with the predicted label
print(s + num_to_label[p])
For example I would expect
Sentence Visit (Metric X)
New visit 2nd floor New 0.95
Return visit Evening Return 0.98
Afternoon visit North New 0.43
Therefore I'd know to not trust those will metrics below a certain percentage because the tool isnt reliable.
You can use predict_proba() instead of predict(). This will give you probability estimates of your predictions for each possible label.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Making predictions on text data using keras

i had trained and tested a CNN for sentiment analysis. The train and test data were prepared the same way, tokenizing the sentence and giving unique integers:
tokenizer = Tokenizer(filters='$%&()*/:;<=>#[\\]^`{|}~\t\n')
tokenizer.fit_on_texts(text)
vocab_size = len(tokenizer.word_index) + 1
sequences = tokenizer.texts_to_sequences(text)
Then pre-trained glove model to create embedding matrix for CNN as:
filepath_glove = 'glove.twitter.27B.200d.txt'
glove_vocab = []
glove_embd=[]
embedding_dict = {}
file = open(filepath_glove,'r',encoding='UTF-8')
for line in file.readlines():
row = line.strip().split(' ')
vocab_word = row[0]
glove_vocab.append(vocab_word)
embed_vector = [float(i) for i in row[1:]] # convert to list of float
embedding_dict[vocab_word]=embed_vector
file.close()
for word, index in tokenizer.word_index.items():
`embedding_matrix[index] = embedding_dict[word]`
At this point i also used the test sentences to create this matrix which was later passed as weights into embedding layer:
e= Embedding(vocab_size, 200, input_length=maxSeqLength, weights=[embedding_matrix], trainable=False)(inp)
Now i want to reload my model and test with some new data but it would mean that embedding matrix wont include some words from new data.This makes me wonder that if even before i shouldnt have had included test data while creating embedding matrix? And if not,how does the embedding layer work for those new words?This part is similar to this question but i couldnt find answer:
How does the Keras Embedding Layer work if word is not found?
Thanks
It´s quite simple. You are providing the vocab_size, which is the number of words, the embedding layer knows. If you pass an index, which is out of bounds of the vocab_size (new word), it will be ignored, or an error will be thrown by keras.
This answers your question regarding if you should include all data for your embedding matrix. Yes, you should.

PCA applied to MFCC for feeding a GMM classifier (sklearn library)

I'm facing a (probably simple) problem where I have to reduce the dimensionality of my features vector using PCA. The main point of all of this is to create a classifier that predicts a sentence composed by phonemes. I train my model with hours of sentences pronounced by people (the sentences are only 10), each sentence has a label composed by a set of phonemes (see below).
What I have done so far is the following:
import mdp
from sklearn import mixture
from features import mdcc
def extract_mfcc():
X_train = []
directory = test_audio_folder
# Iterate through each .wav file and extract the mfcc
for audio_file in glob.glob(directory):
(rate, sig) = wav.read(audio_file)
mfcc_feat = mfcc(sig, rate)
X_train.append(mfcc_feat)
return np.array(X_train)
def extract_labels():
Y_train = []
# here I have all the labels - each label is a sentence composed by a set of phonemes
with open(labels_files) as f:
for line in f: # Ex: line = AH0 P IY1 S AH0 V K EY1 K
Y_train.append(line)
return np.array(Y_train)
def main():
__X_train = extract_mfcc()
Y_train = extract_labels()
# Now, according to every paper I read, I need to reduce the dimensionality of my mfcc vector before to feed my gaussian mixture model
X_test = []
for feat in __X_train:
pca = mdp.pca(feat)
X_test.append(pca)
n_classes = 10 # I'm trying to predict only 10 sentences (each sentence is composed by the phonemes described above)
gmm_classifier = mixture.GMM(n_components=n_classes, covariance_type='full')
gmm_classifier.fit(X_train) # error here!reason: each "pca" that I appended before in X_train has a different shape (same number of columns though)
How can I reduce the dimensionality and, at the same time, have the same shape for each PCA that I extract ?
I also tried a new thing: calling the gmm_classifier.fit(...) within the for loop where I obtain the PCA vector (see code below). The function fit() works but I'm not sure whether I'm actually training the GMM correctly or not.
n_classes = 10
gmm_classifier = mixture.GMM(n_components=n_classes, covariance_type='full')
X_test = []
for feat in __X_train:
pca = mdp.pca(feat)
gmm_classifier.fit(pca) # in this way it works, but I'm not sure if it actually model is trained correctly
Thanks a lot
Regarding to your last comment/question:
gmm_classifier.fit(pca) # in this way it works, but I'm not sure if it actually model is trained correctly
Whenever you call this, the classifier forgets the previous information and be only trained by the last data. Try appending the feats inside the loop and then fit.

Categories