Use Tf-Idf with in Keras Model - python

I've read my train, test and validation sentences into train_sentences, test_sentences, val_sentences
Then I applied Tf-IDF vectorizer on these.
vectorizer = TfidfVectorizer(max_features=300)
vectorizer = vectorizer.fit(train_sentences)
X_train = vectorizer.transform(train_sentences)
X_val = vectorizer.transform(val_sentences)
X_test = vectorizer.transform(test_sentences)
And my model looks like this
model = Sequential()
model.add(Input(????))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(8, activation='sigmoid'))
model.summary()
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
Normally we pass embeddings matrix in the embeddings layer in case of word2vec.
How should I use Tf-IDF in Keras model? Please provide me with an example to use.
Thanks.

I cannot imagine a good reason for combining TF/IDF values with embedding vectors, but here is a possible solution: use the functional API, multiple Inputs and the concatenate function.
To concatenate layer outputs, their shapes must be aligned (except for the axis that is being concatenated). One method is to average embeddings and then concatenate to a vector of TF/IDF values.
Setting up, and some sample data
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import keras
from keras.models import Model
from keras.layers import Dense, Activation, concatenate, Embedding, Input
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
# some sample training data
bunch = fetch_20newsgroups()
all_sentences = []
for document in bunch.data:
sentences = document.split("\n")
all_sentences.extend(sentences)
all_sentences = all_sentences[:1000]
X_train, X_test = train_test_split(all_sentences, test_size=0.1)
len(X_train), len(X_test)
vectorizer = TfidfVectorizer(max_features=300)
vectorizer = vectorizer.fit(X_train)
df_train = vectorizer.transform(X_train)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
maxlen = 50
sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_train = pad_sequences(sequences_train, maxlen=maxlen)
Model definition
vocab_size = len(tokenizer.word_index) + 1
embedding_size = 300
input_tfidf = Input(shape=(300,))
input_text = Input(shape=(maxlen,))
embedding = Embedding(vocab_size, embedding_size, input_length=maxlen)(input_text)
# this averaging method taken from:
# https://stackoverflow.com/a/54217709/1987598
mean_embedding = keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1))(embedding)
concatenated = concatenate([input_tfidf, mean_embedding])
dense1 = Dense(256, activation='relu')(concatenated)
dense2 = Dense(32, activation='relu')(dense1)
dense3 = Dense(8, activation='sigmoid')(dense2)
model = Model(inputs=[input_tfidf, input_text], outputs=dense3)
model.summary()
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
Model Summary Output
Model: "model_2"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_11 (InputLayer) (None, 50) 0
__________________________________________________________________________________________________
embedding_5 (Embedding) (None, 50, 300) 633900 input_11[0][0]
__________________________________________________________________________________________________
input_10 (InputLayer) (None, 300) 0
__________________________________________________________________________________________________
lambda_1 (Lambda) (None, 300) 0 embedding_5[0][0]
__________________________________________________________________________________________________
concatenate_4 (Concatenate) (None, 600) 0 input_10[0][0]
lambda_1[0][0]
__________________________________________________________________________________________________
dense_5 (Dense) (None, 256) 153856 concatenate_4[0][0]
__________________________________________________________________________________________________
dense_6 (Dense) (None, 32) 8224 dense_5[0][0]
__________________________________________________________________________________________________
dense_7 (Dense) (None, 8) 264 dense_6[0][0]
==================================================================================================
Total params: 796,244
Trainable params: 796,244
Non-trainable params: 0

Related

how to feed LSTM model in Keras python?

I have read about LSTM and I know that algorithm takes the value of the previous words and consider it in the next word parameters
Now I am trying to apply my first LSTM algorithm
I have this code.
model = Sequential()
model.add(LSTM(units=6, input_shape = (X_train_count.shape[0], X_train_count.shape[1]), return_sequences = True))
model.add(LSTM(units=6, return_sequences=True))
model.add(LSTM(units=6, return_sequences=True))
model.add(LSTM(units=ytrain.shape[1], return_sequences=True, name='output'))
model.compile(loss='cosine_proximity', optimizer='sgd', metrics = ['accuracy'])
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
model.summary()
cp=ModelCheckpoint('model_cnn.hdf5',monitor='val_acc',verbose=1,save_best_only=True)
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
model.summary()
cp=ModelCheckpoint('model_cnn.hdf5',monitor='val_acc',verbose=1,save_best_only=True)
history = model.fit(X_train_count, ytrain,
epochs=20,
verbose=False,
validation_data=(X_test_count, yval),
batch_size=10,
callbacks=[cp])
1- I cannot see how the LSTM would know the word sequence while my dataset built based on TFIDF?
2- I am getting error that
ValueError: Input 0 of layer sequential_8 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [None, 18644]
The issue seems to be in the shape of X_train_count you are taking in LSTM input shape is always tricky.
If your X_train_count is not in 3D then reshape using the below line.
X_train_count=X_train_count.reshape(X_train_count.shape[0],X_train_count.shape[1],1))
In the LSTM layer, the input_shape should be (timesteps, data_dim).
Below is the example to illustrate the same.
from sklearn.feature_extraction.text import TfidfVectorizer
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
X = ["first example","one more","good morning"]
Y = ["first example","one more","good morning"]
vectorizer = TfidfVectorizer().fit(X)
tfidf_vector_X = vectorizer.transform(X).toarray()
tfidf_vector_Y = vectorizer.transform(Y).toarray()
tfidf_vector_X = tfidf_vector_X[:, :, None]
tfidf_vector_Y = tfidf_vector_Y[:, :, None]
X_train, X_test, y_train, y_test = train_test_split(tfidf_vector_X, tfidf_vector_Y, test_size = 0.2, random_state = 1)
from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM
model = Sequential()
model.add(LSTM(units=6, input_shape = X_train.shape[1:], return_sequences = True))
model.add(LSTM(units=6, return_sequences=True))
model.add(LSTM(units=6, return_sequences=True))
model.add(LSTM(units=1, return_sequences=True, name='output'))
model.compile(loss='cosine_proximity', optimizer='sgd', metrics = ['accuracy'])
Model Summary:
Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_9 (LSTM) (None, 6, 6) 192
_________________________________________________________________
lstm_10 (LSTM) (None, 6, 6) 312
_________________________________________________________________
lstm_11 (LSTM) (None, 6, 6) 312
_________________________________________________________________
output (LSTM) (None, 6, 1) 32
=================================================================
Total params: 848
Trainable params: 848
Non-trainable params: 0
_________________________________________________________________
None
Here X_train is of shape (2, 6, 1)
To add to the solution, I would like to suggest to go with a dense vector instead of a sparse vector generated from the Tf-Idf approach representation by replacing with pre-trained models like Google News Vector or Glove as weights to the embedding layer which would be better in performance wise and result wise.

Model subclassing ignores the weights of the Keras layers that appended to a Python list

I am trying to create a model subclassing with a variable number of layers and hidden layers' size.
Since the number and size of the hidden layers are not fixed, I appended the instantiated Keras layers into a list according to constructor parameters. But I do not see why when I use the list self.W to keep the Keras layers, the model ignores the weights of them.
class MLP(keras.Model):
def __init__(self, first_size, num_hidden_layers, hidden_activation, num_classes, **kwargs):
super(MLP, self).__init__()
self.W = [Dense(units=first_size//(2**i), activation=hidden_activation) for i in range(num_hidden_layers)]
# Regression task
if num_classes == 0:
self.W.append(Dense(units=1, activation='linear'))
# Classification task
else:
self.W.append(Dense(units=num_classes, activation='softmax'))
def call(self, inputs):
x = inputs
for w in self.W:
x = w(x)
return x
model = MLP(first_size=128, num_hidden_layers=4, hidden_activation='relu', num_classes=10)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
model.fit(x_train, y_train, batch_size=64, epochs=20, validation_data=(x_val, y_val))
model.summary()
Model: "mlp_23"
_________________________________________________________________
Layer (type)    Output Shape    Param #
================================================================
Total params: 0
Trainable params: 0
Non-trainable params: 0
_________________________________________________________________
I think you can do this easily like this.
For example I used iris dataset from sklearn.
from keras.models import Model
from keras.layers import Input, Dense
import sklearn.datasets
iris_dataset = sklearn.datasets.load_iris()
x_train = iris_dataset["data"]
y_train = iris_dataset["target"]
inputs = Input(shape=x_train[0].shape)
x = inputs
num_hidden_layers=4
num_classes=10
hidden_activation='relu'
first_size=128
for i in range(num_hidden_layers):
x=Dense(units=first_size//(2**i), activation=hidden_activation)(x)
outputs=Dense(units=num_classes, activation='softmax')(x)
model = Model(inputs=inputs,outputs=outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
model.summary()
output
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 4) 0
_________________________________________________________________
dense_1 (Dense) (None, 128) 640
_________________________________________________________________
dense_2 (Dense) (None, 64) 8256
_________________________________________________________________
dense_3 (Dense) (None, 32) 2080
_________________________________________________________________
dense_4 (Dense) (None, 16) 528
_________________________________________________________________
dense_5 (Dense) (None, 10) 170
=================================================================
Total params: 11,674
Trainable params: 11,674
Non-trainable params: 0

What determines input dimension of dense layer at end of a CNN

I'm following CNN tutorial at analytics vidhya.
I'm having difficulty visualizing the connection between the flattened layer and the dense layer with 2 nodes and an input dimension of 50. This is a binary classification problem, so I understand the 2 nodes. However, what determines the input dimensions? We can also omit this parameter, in which case there will just be fewer weights to train for this dense layer?
import os
import numpy as np
import pandas as pd
import scipy
import sklearn
import keras
from keras.models import Sequential
import cv2
from skimage import io
%matplotlib inline
#Defining the File Path
cat=os.listdir("/mnt/hdd/datasets/dogs_cats/train/cat")
dog=os.listdir("/mnt/hdd/datasets/dogs_cats/train/dog")
filepath="/mnt/hdd/datasets/dogs_cats/train/cat/"
filepath2="/mnt/hdd/datasets/dogs_cats/train/dog/"
#Loading the Images
images=[]
label = []
for i in cat:
image = scipy.misc.imread(filepath+i)
images.append(image)
label.append(0) #for cat images
for i in dog:
image = scipy.misc.imread(filepath2+i)
images.append(image)
label.append(1) #for dog images
#resizing all the images
for i in range(0,23000):
images[i]=cv2.resize(images[i],(300,300))
#converting images to arrays
images=np.array(images)
label=np.array(label)
# Defining the hyperparameters
filters=10
filtersize=(5,5)
epochs =5
batchsize=128
input_shape=(300,300,3)
#Converting the target variable to the required size
from keras.utils.np_utils import to_categorical
label = to_categorical(label)
#Defining the model
model = Sequential()
model.add(keras.layers.InputLayer(input_shape=input_shape))
model.add(keras.layers.convolutional.Conv2D(filters, filtersize, strides=(1, 1), padding='valid', data_format="channels_last", activation='relu'))
model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(units=2, input_dim=50,activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(images, label, epochs=epochs, batch_size=batchsize,validation_split=0.3)
model.summary()
However, what determines the input dimensions? We can also omit this
parameter, in which case there will just be fewer weights to train for
this dense layer?
It is determined by the output shape of the previous layer. As seen from the model.summary(), the output shape from Flatten layer is (None, 219040), so the input dimensions to the Dense layer is 219040. So, in this case there are more weights to train(>50).
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 296, 296, 10) 760
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 148, 148, 10) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 219040) 0
_________________________________________________________________
dense_1 (Dense) (None, 2) 438082
=================================================================
Total params: 438,842
Trainable params: 438,842
Non-trainable params: 0
_________________________________________________________________
As can be seen from the code snippet below, the weights for the dense layer are created based on the input_shape parameter(which is the output_shape of the previous layer). The input_dim passed by the user when constructing the Dense layer is ignored.
input_dim = input_shape[-1]
self.kernel = self.add_weight(shape=(input_dim, self.units),
https://github.com/keras-team/keras/blob/3bda5520b787f84f687bb116c460f3aedada039b/keras/layers/core.py#L891

How would I improve my model such that it will work on more characters not in the dataset?

In my last post linked here, it was said that I have to modify my model for it to be better. To quote the only answerer's comment to my questions (again, thank you, Sir):
The accuracy of prediction is a metric of how good your neural network architecture is and it also depends on your train/validation data. You will have to tune your neural network in such a way that you generalize well by adjusting the hyper parameters such as number of layers, type of layers, learning rate, optimizer etc. ...
I would like to know how I would do these mentioned. Or at the least, be pointed in the right direction. I am honestly both lost in theory and practice.
The only thing I have been able to do is to adjust the epoch above 100. I have also cleaned the images to be identified as much as I can.
Currently, here is how I create my model. It is only based on Tensorflow 2.0's tutorial.
import numpy as np
import tensorflow as tf
from tensorflow import keras
# Load and prepare the MNIST dataset. Convert the samples from integers to floating-point numbers:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
def createModel():
# Build the tf.keras.Sequential model by stacking layers.
# Choose an optimizer and loss function used for training:
model = tf.keras.models.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dropout(0.2),
keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
return model
model = createModel()
model.fit(x_train, y_train, epochs=102, validation_data=(x_test, y_test))
model.evaluate(x_test, y_test)
It gave out a validation accuracy of around .9800 for me. But its performance against images of handwritten characters I've extracted from documents is dismal. I would also like it to be extended such that it can also read other selected characters, but I guess that can be another question for another day.
Thanks!
You could have multiple layers of Convolution/ Max Pool at the beginning that would perform a feature extraction by scanning the image. After that you use a fully connected NN like you did before and a softmax.
You could create a model with a CNN that way:
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from keras.models import Sequential
# Create the model
model = Sequential()
# Add the 1st Convolution/ max pool
model.add(Conv2D(40, kernel_size=5, padding="same",input_shape=(28, 28, 1), activation = 'relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
# 2nd convolution / max pool
model.add(Conv2D(200, kernel_size=3, padding="same", activation = 'relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(1, 1)))
# 3rd convolution/ max pool
model.add(Conv2D(512, kernel_size=3, padding="valid", activation = 'relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(1, 1)))
# Reduce dimensions from 2d to 1d
model.add(Flatten())
model.add(Dense(units=100, activation='relu'))
# Add dropout to prevent overfitting
model.add(Dropout(0.5))
# Final fullyconnected layer
model.add(Dense(10, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Which returns the following model:
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 28, 28, 40) 1040
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 40) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 14, 14, 200) 72200
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 12, 200) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 10, 10, 512) 922112
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 8, 8, 512) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 32768) 0
_________________________________________________________________
dense_1 (Dense) (None, 100) 3276900
_________________________________________________________________
dropout_1 (Dropout) (None, 100) 0
_________________________________________________________________
dense_2 (Dense) (None, 10) 1010
=================================================================
Total params: 4,273,262
Trainable params: 4,273,262
Non-trainable params: 0
_________________________________________________________________

Expected dense_3_input to have shape (None, 40) but got array with shape (40, 1)

I am a beginner at Deep Learning and am attempting to practice the implementation of Neural Networks in Python by performing audio analysis on a dataset. I have been following the Urban Sound Challenge tutorial and have completed the code for training the model, but I keep running into errors when trying to run the model on the test set.
Here is my code for creation of the model and training:
import numpy as np
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
num_labels = y.shape[1]
filter_size = 2
model = Sequential()
model.add(Dense(256, input_shape = (40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
model.fit(X, y, batch_size=32, epochs=40, validation_data=(val_X, val_Y))
Running model.summary() before fitting the model gives me:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_3 (Dense) (None, 256) 10496
_________________________________________________________________
activation_3 (Activation) (None, 256) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 256) 0
_________________________________________________________________
dense_4 (Dense) (None, 10) 2570
_________________________________________________________________
activation_4 (Activation) (None, 10) 0
=================================================================
Total params: 13,066
Trainable params: 13,066
Non-trainable params: 0
_________________________________________________________________
After fitting the model, I attempt to run it on one file so that it can classify the sound.
file_name = ".../UrbanSoundClassifier/test/Test/5.wav"
test_X, sample_rate = librosa.load(file_name,res_type='kaiser_fast')
mfccs = np.mean(librosa.feature.mfcc(y=test_X, sr=sample_rate, n_mfcc=40).T,axis=0)
test_X = np.array(mfccs)
print(model.predict(test_X))
However, I get
ValueError: Error when checking : expected dense_3_input to have shape
(None, 40) but got array with shape (40, 1)
Would someone kindly like to point me in the right direction as to how I should be testing the model? I do not know what the input for model.predict() should be.
Full code can be found here.
So:
The easiest fix to that is simply reshaping test_x:
test_x = test_x.reshape((1, 40))
More sophisticated is to reuse the pipeline you have for the creation of train and valid set also for a test set. Please, notice that the process you applied to data files is totally different in case of test. I'd create a test dataframe:
test_dataframe = pd.DataFrame({'filename': ["here path to test file"]}
and then reused existing pipeline for creation of validation set.

Categories