I want to implement a Resnet50+LSTM to classify the video frames into different 7 phases (classes). In my train files, I have 5 folders, each one includes a video that is captured as some frames which show one phase of a specific action(the action is identical for all the videos). Now I want to use Resnet50+LSTM to classify the action phase recognition. Also, I want to use 4 nearby frames. I implement the following codes with Keras, but I have some questions.
inputs = Input((4, 224, 224, 3))
resnet = ResNet50(include_top=False,input_shape =(224,224,3), weights='imagenet')
for layer in resnet.layers:
layer.trainable=False
output = GlobalAveragePooling2D()(resnet.output)
cnn = tf.keras.Model(inputs=resnet.input, outputs=output)
encoded_frames = TimeDistributed(cnn)(inputs)
lstm = LSTM(2048)(encoded_frames)
out_leaky = LeakyReLU()(lstm)
out_drop = Dropout(0.4)(out_leaky)
out_dense = Dense(2048,input_dim=inputs,activation='relu')(out_drop)
out_1 = Dense(1,activation='sigmoid')(out_dense)
model = tf.keras.Model(inputs=[inputs], outputs=out_1)
I have used 'GlobalAveragePooling2D' to have 4 nearby frames. But I was reading that I should load 4 nearby frames in each iteration of the dataloader. It means that in each iteration, the dataloader should load (B, N, 3, H, W) (batch_size, # of nearby frames, channels, H, W). What should I do?
I want to use my model in a PyTorch environment. Can you help me to convert it?
Also, about the input of resnet50 and LSTM, I use these numbers based on the error that I received. Can you explain them to me?
Thank you in advance.
What should I do?
Why not using a 3D conv network? It gives the best results according to papers with code (https://paperswithcode.com/sota/action-classification-on-kinetics-400). To this type of networks you feed (B, N, 3, H, W). Then the model classifies the set of frames inputed to the model
Can you help me to convert it?
If your destination framework is PyTorch why not using the already existing models in TorchVision for this task (3D conv model used below)?
from torchvision.io.video import read_video
from torchvision.models.video import r3d_18, R3D_18_Weights
vid, _, _ = read_video("/path/to/your/test/video.avi", output_format="TCHW")
vid = vid[:32] # optionally shorten duration
# Step 1: Initialize model with the best available weights
weights = R3D_18_Weights.DEFAULT
model = r3d_18(weights=weights)
model.eval()
# Step 2: Initialize the inference transforms
preprocess = weights.transforms()
# Step 3: Apply inference preprocessing transforms
batch = preprocess(vid).unsqueeze(0)
# Step 4: Use the model and print the predicted category
prediction = model(batch).squeeze(0).softmax(0)
label = prediction.argmax().item()
score = prediction[label].item()
category_name = weights.meta["categories"][label]
print(f"{category_name}: {100 * score}%")
Related
I am working on CNN model using Tensorflow frames in google collab. I am unable to extract the latent vectors from the convolutional layers. I want to extract the output of the convolutional layers, the layers before fully connected layer.
I have tried with the following code
a = dropout()(classifier_model.output)
print(a)
I am unable to understand the solution suggested on the link Stackoverflow solution to print the value of tensorflow object after applying a-conv-pool-layer
Anyone with any suggestion?
You can use get_layer method of the Model class to get a layer by its name, find bellow an example with a dummy 1D CNN and a binary classifier :
timesteps = 100
nfeatures = 2
# build the model using the functional API
# example of a 1D CNN inspired by the your stack overflow link, but using a model instead of successive *raw* layers
# the values of the Conv1D filters and kernels are different
input = Input((timesteps, nfeatures))
p = Conv1D(filters=16, kernel_size=10)(input)
p = ReLU()(p)
p = MaxPool1D(pool_size=2)(p)
p = Conv1D(filters=32, kernel_size=10)(p)
p = ReLU()(p)
p = MaxPool1D(pool_size=2)(p)
p = Conv1D(filters=64, kernel_size=10)(p)
p = ReLU()(p)
p = MaxPool1D(pool_size=2, name='conv1Dfeat')(p) # give a name to the CNN output
# fully connected part
p = Flatten()(p)
p = Dense(10)(p)
# could add a dropout layer to ease optimization
finaloutput = Dense(1, activation='sigmoid')(p)
# full model
model = Model(inputs=input, outputs=finaloutput)
# compile network, i.e. define optimizer, loss and metrics
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
You need to train the model using the fit method with some data. Then you can get the output of the layer which name is conv1Dfeat (the last layer of the convolutive part) by defining the model:
modelCNN = Model(inputs=input, outputs=model.get_layer('conv1Dfeat').output)
modelCNN.summary()
If you want to get the output of the convolutive part, let's say based on a single numpy input array of shape (timesteps, nfeatures), you can use the predict of the Model class on batched data:
data = np.random.normal(size=(timesteps, nfeatures)) # dummy data
data_tf = tf.expand_dims(data, axis=0) # convert to TF tensor and add batch dimension at the same time
cnn_out_np = modelCNN.predict(data_tf)
cnn_out_np = np.squeeze(cnn_out_np, axis=0) # remove batch dimension
print(cnn_out_np.shape)
(4, 64)
I am newbie to Machine Learning. For some reason, my CNN doesn't learn at all. I tried on different datasets, but result is the same: loss and accuracy are changing just a little (most likely this is just an inaccuracy). Maybe I configured something incorrectly or did not do something at lot.
The task is to create a lip movement generator based on face photos and audio recordings (supervised learning)
[audio.mp3 + face.jpg] -> [1.jpg, 2.jpg, 3.jpg, 4.jpg, 5.jpg, 6.jpg ... ]
I do normalize input data,
I set loss function = mean_squared_error
I set metrics = [accuracy, mean_squared_error]
I set optimizer = Adam,
The dataset is large enough to see at least some results = 2000 videos
The epoch number = 2,
batch_size = 16
Learning_rate = 0.001 (default)
The model structure diagram is below. I chose the number of layers and their types at random.
Input frame = face, picture 50x60, channels = 3. Shape = (50, 60, 3)
Input mfcc = numerical coefficients. Shape = (20, 43)
Output: 24 images 50x60
I'm trying to generate output = lips movement (video = many frames) by audio (mfcc) and face image.
Here you are my logs and model structure
Is there a way to use the already trained RNN (SimpleRNN or LSTM) model to generate new sequences in Keras?
I'm trying to modify an exercise from the Coursera Deep Learning Specialization - Sequence Models course, where you train an RNN to generate dinosaurus's names. In the exercise you build the RNN using only numpy, but I want to use Keras.
One of the problems is different lengths of the sequences (dino names), so I used padding and set sequence length to the max size appearing in the dataset (I padded with 0, which is also the code for '\n').
My question is how to generate the actual sequence once training is done? In the numpy version of the exercise you take the softmax output of the previous cell and use it as a distribution to sample a new input for the next cell. But is there a way to connect the output of the previous cell as the input of the next cell in Keras, during testing/generation time?
Also - some additional side-question:
Since I'm using padding, I suspect the accuracy is way too optimistic. Is there a way to tell Keras not to include the padding values in its accuracy calculations?
Am I even doing this right? Is there a better way to use Keras with sequences of different lengths?
You can check my (WIP) code here.
Inferring from a model that has been trained on a sequence
So it's a pretty common thing to do in RNN models and in Keras the best way (at least from what I know) is to create two different models.
One model for training (which uses sequences instead of individual items)
Another model for predicting (which uses a single element instead of a sequence)
So let's see an example. Suppose you have the following model.
from tensorflow.keras import models, layers
n_chars = 26
timesteps = 10
inp = layers.Input(shape=(timesteps, n_chars))
lstm = layers.LSTM(100, return_sequences=True)
out1 = lstm(inp)
dense = layers.Dense(n_chars, activation='softmax')
out2 = layers.TimeDistributed(dense)(out1)
model = models.Model(inp, out2)
model.summary()
Now to infer from this model, you create another model which looks like the one below.
inp_infer = layers.Input(shape=(1, n_chars))
# Inputs to feed LSTM states back in
h_inp_infer = layers.Input(shape=(100,))
c_inp_infer = layers.Input(shape=(100,))
# We need return_state=True so we are creating a new layer
lstm_infer = layers.LSTM(100, return_state=True, return_sequences=True)
out1_infer, h, c = lstm_infer(inp_infer, initial_state=[h_inp_infer, c_inp_infer])
out2_infer = layers.TimeDistributed(dense)(out1_infer)
# Our model takes the previous states as inputs and spits out new states as outputs
model_infer = models.Model([inp_infer, h_inp_infer, c_inp_infer], [out2_infer, h, c])
# We are setting the weights from the trained model
lstm_infer.set_weights(lstm.get_weights())
model_infer.summary()
So what's different. You see that we have defined a new input layer which accepts an input which has only one timestep (or in other words, just a single item). Then the model outputs an output which has a single timestep (technically we don't need the TimeDistributedLayer. But I've kept that for consistency). Other than that we take the previous LSTM state output as an input and produces the new state as the output. More specifically we have the following inference model.
Input: [(None, 1, n_chars) (None, 100), (None, 100)] list of tensor
Output: [(None, 1, n_chars), (None, 100), (None, 100)] list of Tensor
Note that I'm updating the weights of the new layers from the trained model or using the existing layers from the training model. It will be a pretty useless model if you don't reuse the trained layers and weights.
Now we can write inference logic.
import numpy as np
x = np.random.randint(0,2,size=(1, 1, n_chars))
h = np.zeros(shape=(1, 100))
c = np.zeros(shape=(1, 100))
seq_len = 10
for _ in range(seq_len):
print(x)
y_pred, h, c = model_infer.predict([x, h, c])
y_pred = x[:,0,:]
y_onehot = np.zeros(shape=(x.shape[0],n_chars))
y_onehot[np.arange(x.shape[0]),np.argmax(y_pred,axis=1)] = 1.0
x = np.expand_dims(y_onehot, axis=1)
This part starts with an initial x, h, c. Gets the prediction y_pred, h, c and convert that to an input in the following lines and assign it back to x, h, c. So you keep going for n iterations of your choice.
About masking zeros
Keras does offer a Masking layer which can be used for this purpose. And the second answer in this question seems to be what you're looking for.
I'm finetuning a ResNet50 model with a few additional layers using Keras.
I need to know which images are trained per batch.
The problem I have is that only the imagedata and their labels, but no image names can be passed on in the fit and fit_generator in order to output the image names, which are trained in a batch, to a file.
You can make your own generator so you could track what is fed into the network, and do whatever you like with the data (i.e. match indices to images).
Here is a basic example of a generator function which you can build upon:
def gen_data():
x_train = np.random.rand(100, 784)
y_train = np.random.randint(0, 1, 100)
i = 0
while True:
indices = np.arange(i*10, 10*i+10)
# Those are indices being fed to network which can be saved to a file.
print(indices)
out = x_train[indices], y_train[indices]
i = (i+1) % 10
yield out
And then use fit_generator with the new defined generator function:
model.fit_generator(gen_data(), steps_per_epoch=10, epochs=20)
(Apologies for the long post)
All,
I want to use the bottleneck features from a pretrained Inceptionv3 model to predict classification for my input images. Before training a model and predicting classification, I tried 3 different approaches for extracting the bottleneck features.
My 3 approaches yielded different bottleneck features (not just in values but even the size was different).
Size of my bottleneck features from Approach 1 and 2: (number of input images) x 3 x 3 x 2048
Size of my bottleneck features from Approach 3: (number of input images) x 2048
Why are the sizes different between the Keras based Inceptionv3 model and the native Tensorflow model? My guess is that when I say include_top=False in Keras, I'm not extracting the 'pool_3/_reshape:0' layer. Is this correct? If yes, how do I extract the 'pool_3/_reshape:0' layer in Keras? If my guess is incorrect, what 'am I missing?
I compared the bottleneck feature values from Approach 1 and 2 and they were significantly different. I think I'm feeding it the same input images because I resize and rescale my images before I even read it as input for my script. I have no options for my ImageDataGenerator in Approach 1 and according to the documentation for that function all the default values do not change my input image. I have set shuffle to false so I assumed that predict_generator and predict are reading images in the same order. What 'am I missing?
Please note:
My inputs images are in RGB format (so number of channels = 3) and I resized all of them to 150x150. I used the preprocess_input function in inceptionv3.py to preprocess all my images.
def preprocess_input(image):
image /= 255.
image -= 0.5
image *= 2.
return image
Approach 1: Used Keras with tensorflow as backend, an ImageDataGenerator to read my data and model.predict_generator to compute bottleneck features
I followed the example (Section Using the bottleneck features of a pre-trained network: 90% accuracy in a minute) from Keras' blog. Instead of VGG model listed there I used Inceptionv3. Below is the snippet of code I used
(code not shown here but what i did before the code below) : read all input images, resize to 150x150x3, rescale according to the preprocessing_input function mentioned above, save the resized and rescaled images
train_datagen = ImageDataGenerator()
train_generator = train_datagen.flow_from_directory(my_input_dir, target_size=(150,150),shuffle=False, batch_size=16)
# get bottleneck features
# use pre-trained model and exclude top layer - which is used for classification
pretrained_model = InceptionV3(include_top=False, weights='imagenet', input_shape=(150,150,3))
bottleneck_features_train_v1 = pretrained_model.predict_generator(train_generator,len(train_generator.filenames)//16)
Approach 2: Used Keras with tensorflow as backend, my own reader and model.predict to compute bottleneck features
Only difference between this approach and earlier one is that I used my own reader to read the input images.
(code not shown here but what i did before the code below) : read all input images, resize to 150x150x3, rescale according to the preprocessing_input function mentioned above, save the resized and rescaled images
# inputImages is a numpy array of size <number of input images x 150 x 150 x 3>
inputImages = readAllJPEGsInFolderAndMergeAsRGB(my_input_dir)
# get bottleneck features
# use pre-trained model and exclude top layer - which is used for classification
pretrained_model = InceptionV3(include_top=False, weights='imagenet', input_shape=(img_width, img_height, 3))
bottleneck_features_train_v2 = pretrained_model.predict(trainData.images,batch_size=16)
Approach 3: Used tensorflow (NO KERAS) compute bottleneck features
I followed retrain.py to extract bottleneck features for my input images. Please note that that the weights from that script can be obtained from (http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz)
As mentioned in that example, I used the bottleneck_tensor_name = 'pool_3/_reshape:0' as the layer to extract and compute bottleneck features. Similar to the first 2 approaches, I used resized and rescaled images as input to the script and I called this feature list bottleneck_features_train_v3
Thank you so much
Different results between 1 and 2
Since you haven't shown your code, I (maybe wrongly) suggest that the problem is that you might not have used preprocess_input when declaring ImageDataGenerator ?
from keras.applications.inception_v3 import preprocess_input
train_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
Make sure, though, that your saved image files range from 0 to 255. (Bit depth 24).
Different shapes between 1 and 3
There are three possible types of model in this case:
include_top = True -> this will return classes
include_top = False (only) -> this implies in pooling = None (no final pooling layer)
include_top = False, pooling='avg' or ='max' -> has a pooling layer
So, your declared model without an explicit pooling=something doesn't have the final pooling layer in keras. Then the outputs will still have the spatial dimensions.
Solve that simply by adding a pooling at the end. One of these:
pretrained_model = InceptionV3(include_top=False, pooling = 'avg', weights='imagenet', input_shape=(img_width, img_height, 3))
pretrained_model = InceptionV3(include_top=False, pooling = 'max', weights='imagenet', input_shape=(img_width, img_height, 3))
Not sure which one the model in the tgz file is using.
As an alternative, you can also get another layer from the Tensorflow model, the one coming immediately before 'pool_3'.
You can look into the Keras implementation of inceptionv3 here:
https://github.com/keras-team/keras/blob/master/keras/applications/inception_v3.py
so, the default parameter is:
def InceptionV3(include_top=True,
weights='imagenet',
input_tensor=None,
input_shape=None,
pooling=None,
classes=1000):
Notice that default for pooling=None, then when building the model, the code is:
if include_top:
# Classification block
x = GlobalAveragePooling2D(name='avg_pool')(x)
x = Dense(classes, activation='softmax', name='predictions')(x)
else:
if pooling == 'avg':
x = GlobalAveragePooling2D()(x)
elif pooling == 'max':
x = GlobalMaxPooling2D()(x)
# Ensure that the model takes into account
# any potential predecessors of `input_tensor`.
if input_tensor is not None:
inputs = get_source_inputs(input_tensor)
else:
inputs = img_input
# Create model.
model = Model(inputs, x, name='inception_v3')
So if you do not specify the pooling the bottleneck feature is extracted without any pooling, you need to specify if you want to get an average pooling or max pooling on top of these feature.