I am newbie to Machine Learning. For some reason, my CNN doesn't learn at all. I tried on different datasets, but result is the same: loss and accuracy are changing just a little (most likely this is just an inaccuracy). Maybe I configured something incorrectly or did not do something at lot.
The task is to create a lip movement generator based on face photos and audio recordings (supervised learning)
[audio.mp3 + face.jpg] -> [1.jpg, 2.jpg, 3.jpg, 4.jpg, 5.jpg, 6.jpg ... ]
I do normalize input data,
I set loss function = mean_squared_error
I set metrics = [accuracy, mean_squared_error]
I set optimizer = Adam,
The dataset is large enough to see at least some results = 2000 videos
The epoch number = 2,
batch_size = 16
Learning_rate = 0.001 (default)
The model structure diagram is below. I chose the number of layers and their types at random.
Input frame = face, picture 50x60, channels = 3. Shape = (50, 60, 3)
Input mfcc = numerical coefficients. Shape = (20, 43)
Output: 24 images 50x60
I'm trying to generate output = lips movement (video = many frames) by audio (mfcc) and face image.
Here you are my logs and model structure
Related
I want to implement a Resnet50+LSTM to classify the video frames into different 7 phases (classes). In my train files, I have 5 folders, each one includes a video that is captured as some frames which show one phase of a specific action(the action is identical for all the videos). Now I want to use Resnet50+LSTM to classify the action phase recognition. Also, I want to use 4 nearby frames. I implement the following codes with Keras, but I have some questions.
inputs = Input((4, 224, 224, 3))
resnet = ResNet50(include_top=False,input_shape =(224,224,3), weights='imagenet')
for layer in resnet.layers:
layer.trainable=False
output = GlobalAveragePooling2D()(resnet.output)
cnn = tf.keras.Model(inputs=resnet.input, outputs=output)
encoded_frames = TimeDistributed(cnn)(inputs)
lstm = LSTM(2048)(encoded_frames)
out_leaky = LeakyReLU()(lstm)
out_drop = Dropout(0.4)(out_leaky)
out_dense = Dense(2048,input_dim=inputs,activation='relu')(out_drop)
out_1 = Dense(1,activation='sigmoid')(out_dense)
model = tf.keras.Model(inputs=[inputs], outputs=out_1)
I have used 'GlobalAveragePooling2D' to have 4 nearby frames. But I was reading that I should load 4 nearby frames in each iteration of the dataloader. It means that in each iteration, the dataloader should load (B, N, 3, H, W) (batch_size, # of nearby frames, channels, H, W). What should I do?
I want to use my model in a PyTorch environment. Can you help me to convert it?
Also, about the input of resnet50 and LSTM, I use these numbers based on the error that I received. Can you explain them to me?
Thank you in advance.
What should I do?
Why not using a 3D conv network? It gives the best results according to papers with code (https://paperswithcode.com/sota/action-classification-on-kinetics-400). To this type of networks you feed (B, N, 3, H, W). Then the model classifies the set of frames inputed to the model
Can you help me to convert it?
If your destination framework is PyTorch why not using the already existing models in TorchVision for this task (3D conv model used below)?
from torchvision.io.video import read_video
from torchvision.models.video import r3d_18, R3D_18_Weights
vid, _, _ = read_video("/path/to/your/test/video.avi", output_format="TCHW")
vid = vid[:32] # optionally shorten duration
# Step 1: Initialize model with the best available weights
weights = R3D_18_Weights.DEFAULT
model = r3d_18(weights=weights)
model.eval()
# Step 2: Initialize the inference transforms
preprocess = weights.transforms()
# Step 3: Apply inference preprocessing transforms
batch = preprocess(vid).unsqueeze(0)
# Step 4: Use the model and print the predicted category
prediction = model(batch).squeeze(0).softmax(0)
label = prediction.argmax().item()
score = prediction[label].item()
category_name = weights.meta["categories"][label]
print(f"{category_name}: {100 * score}%")
I've been working on a Keras network to classify images as to whether they contain traffic lights or not, but so far I've had 0 success. I have a dataset of 11000+ images, and for my first test I used 240 images (or rather, text files for each image with the grayscale pixel values). There is only one output - a 0 or 1 saying whether the image contains traffic lights.
However, when I ran the test, it only predicted one class. Given that 53/240 images had traffic lights, it was achieving about a 79% accuracy rate because it was just predicting 0 all the time. I read that this might be down to inbalanced data, so I downscaled to just 4 images - 2 with traffic lights, 2 without.
Even with this test, it still stuck at 50% accuracy after 5 epochs; it's just predicting one class! Similar questions have been answered but I haven't found anything that is working for me :(
Here is the code I am using:
from keras.datasets import mnist
from keras import models
from keras import layers
from keras.utils import to_categorical
import numpy as np
import os
train_images = []
train_labels = []
#The following is just admin tasks - extracting the grayscale pixel values
#from the text files, adding them to the input array. Same with the labels, which
#are extracted from text files and added to output array. Not important to performance.
for fileName in os.listdir('pixels1/'):
newRead = open(os.path.join('pixels1/', fileName))
currentList = []
for pixel in newRead:
rePixel = int(pixel.replace('\n', ''))/255
currentList.append(rePixel)
train_images.append(currentList)
for fileName in os.listdir('labels1/'):
newRead = open(os.path.join('labels1/', fileName))
line = newRead.readline()
train_labels.append(int(line))
train_images = np.array(train_images)
train_labels = np.array(train_labels)
train_images = train_images.reshape((4,13689))
#model
model = models.Sequential()
model.add(layers.Dense(13689, input_dim=13689, activation='relu'))
model.add(layers.Dense(13689, activation='relu'))
model.add(layers.Dense(1, activation='softmax'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5, batch_size=1)
I was hoping at the very least it would be able to recognise the images at the end. I really want to move onto running a training session on my full 11,000 examples, but at this point I can't get it to work with 4.
Rough points:
You seem to believe that the number of units in your dense layers should be equal to your data dimension (13869); this is not the case. Change both of them to something smaller (in the range of 100-200) - they do not even have to be equal. A model that big is not recommended with your relatively small number of data samples (images).
Since you are in a binary classification setting with a single node in your last layer, you should use activation=sigmoid for this (last) layer, and compile your model with loss='binary_crossentropy'.
In imaging applications, normally the first couple of layers are convolutional ones.
I am currently working on a speech classification problem. I have 1000 audio files in each class and have 7 such classes. I need to augment data to achieve better accuracy. I am using librosa library for data augmentation. For every audio file, I am using the below code.
fbank_train = []
labels_train = []
for wav in x_train_one[:len(x_train_one)]:
samples, sample_rate = librosa.load(wav, sr=16000)
if (len(samples)) == 16000:
label = wav.split('/')[6]
fbank = logfbank(samples, sample_rate, nfilt=16)
fbank_train.append(fbank)
labels_train.append(label)
y_shifted = librosa.effects.pitch_shift(samples, sample_rate, n_steps=4, bins_per_octave=24)
fbank_y_shifted = logfbank(y_shifted, sample_rate, nfilt=16)
fbank_train.append(fbank_y_shifted)
labels_train.append(label)
change_speed = librosa.effects.time_stretch(samples, rate=0.75)
if(len(change_speed)>=16000):
change_speed = change_speed[:16000]
fbank_change_speed = logfbank(change_speed, sample_rate, nfilt=16)
fbank_train.append(fbank_change_speed)
labels_train.append(label)
change_speedp = librosa.effects.time_stretch(samples, rate=1.25)
if(len(change_speedp)<=16000):
change_speedp = np.pad(change_speedp, (0, max(0, 16000 - len(change_speedp))), "constant")
fbank_change_speedp = logfbank(change_speedp, sample_rate, nfilt=16)
fbank_train.append(fbank_change_speedp)
labels_train.append(label)
That is I am augmentating each audio file (pitch-shifting and time-shifting). I would like to know, is this the correct way of augmentation of training dataset?
And if not, what is the proportion of audio files that need to be augmented?
The most common way of performing augmentation is doing it to the whole dataset with a random chance for each sample to be augmented or not.
Also in most cases, the augmentation is done during runtime.
For example a pseudocode for your case could look like:
for e in epochs:
reshuffle_training_set
for x, y in training_set:
if np.random.random() > 0.5:
x = randomly_shift_pitch(x)
if np.random.random() > 0.5:
x = randomly_shift_time(x)
model.fit(x, y)
This means that each image has a 25% chance of not being augmented at all, a 25% chance of being only time-shifted, a 25% chance of being only pitch-shifted and a 25% chance of being both time and pitch-shifted.
During the next epoch, that same image is augmented again with the above strategies. If you train your model through multiple epochs, each image will pass through each combination of augmentations (with a high probability), so the model will learn from them all.
Also if each of the shifts is done randomly, even if a sample passed through the same augmentor twice, it wouldn't result in the same perturbed sample.
A benefit of augmenting the images during runtime and not performing the full augmentation beforehand is that if you wanted the same result, you'd need to create multiple new datasets (i.e. a few time-shifted ones, pitch-shifted ones and combinations of both) and train the model on the combined large dataset.
I am trying to train a semantic-segmentation network (E-Net) in particular for high-quality human segmentation. For that, I have collected the "Supervisely Person" data-set and extracted the annotation masks using the provided API. This data-set holds high quality masks, thus I think it will provide better results in comparison to e.g. COCO data-set.
Supervisely - Example below : original image - ground truth.
First I want to give some details of the model. The network itself (Enet_arch) returns logits from the last convolution layer and probabilities which are produced through tf.nn.sigmoid(logits,name='logits_to_softmax').
I am using sigmoid cross-entropy on the ground truth and the returned logits, momentum and exponential decay on the learning rate. The model instance and the training pipeline is as follows.
self.global_step = tf.Variable(0, name='global_step', trainable=False)
self.momentum = tf.Variable(0.9, trainable=False)
# introducing weight decay
#with slim.arg_scope(ENet_arg_scope(weight_decay=2e-4)):
self.logits, self.probabilities = Enet_arch(inputs=self.input_data, num_classes=self.num_classes, batch_size=self.batch_size) # returns logits (2d), probabilities (2d)
#self.gt is int32 with values 0 or 1 (coming from read_tfrecords.Read_TFRecords annotation images + placeholder defined to int)
self.gt = self.input_masks
# self.probabilities is output of sigmoid, pixel-wise between probablities [0, 1].
# self.predictions is filtered probabilities > 0.5 = 1 else 0
self.predictions = tf.to_int32(self.probabilities > 0.5)
# capture segmentation accuracy
self.accuracy, self.accuracy_update = tf.metrics.accuracy(labels=self.gt, predictions=self.predictions)
# losses and updates
# calculate cross entropy loss on logits
loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=self.gt, logits=self.logits)
# add the loss to total loss and average (?)
self.total_loss = tf.losses.get_total_loss()
# decay_steps = depend on the number of epochs
self.learning_rate = tf.train.exponential_decay(self.starter_learning_rate, global_step=self.global_step, decay_steps=123893, decay_rate=0.96, staircase=True)
#Now we can define the optimizer
#optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate, epsilon=1e-8)
optimizer = tf.train.MomentumOptimizer(self.learning_rate, self.momentum)
#Create the train_op.
self.train_op = optimizer.minimize(loss, global_step=self.global_step)
I first tried to over-fit the model on a single image to identify the depth of details that this network can capture. To increase the output quality I resized all the images to 1080p before feeding them to the network. On this trial I trained the network for 10K iterations and the total error reached ~30% (captured from tf.losses.get_total_loss() ).
The results while training on a single image are pretty good as you can see below.
Supervisely - Example below : (1) Loss (2) input (before resizing) | ground truth (before resizing) | 1080p out
Later, I tried to train on the whole data-set but the training loss produce lot of oscillations. That means that in some images the network perform well and in some other not. As a results after 743360 iterations (which is 160 epochs, since the training set holds 4646 images) I stopped training since obviously there is something wrong with the hyper-parameters selection that I made.
Supervisely - Example below : (1) Loss (2) learning rate (3) input (before resizing) | ground truth (before resizing) | 1080p out
On the other hand on some instances of the training set images the network produce fair (not very good though) results like below.
Supervisely - Example below : input (before resizing) | ground truth (before resizing) | 1080p out
Why do I have those differences on these training instances? Are there any obvious changes that I should do on the model or on the hyper-parameters? Is it possible that this model is just not suitable for this use-case (e.g. low network capacity) ?
Thanks in advance.
It turns out that the problem here is indeed E-net architecture. I changed the architecture with DeepLabV3 and saw a big difference in loss behaviour and performance.. even in small resolution!
I have a bunch of images that look like this of someone playing a videogame (a simple game I created in Tkinter):
The idea of the game is that the user controls the box at the bottom of the screen in order to dodge the falling balls (they can only dodge left and right).
My goal is to have the neural network output the position of the player on the bottom of the screen. If they're totally on the left, the neural network should output a 0, if they're in the middle, a .5, and all the way right, a 1, and all the values in-between.
My images are 300x400 pixels. I stored my data very simply. I recorded each of the images and position of the player as a tuple for each frame in a 50-frame game. Thus my result was a list in the form [(image, player position), ...] with 50 elements. I then pickled that list.
So in my code I try to create an extremely basic feed-forward network that takes in the image and outputs a value between 0 and 1 representing where the box on the bottom of the image is. But my neural network is only outputting 1s.
What should I change in order to get it to train and output values close to what I want?
Of course, here is my code:
# machine learning code mostly from https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import pickle
def pil_image_to_np_array(image):
'''Takes an image and converts it to a numpy array'''
# from https://stackoverflow.com/a/45208895
# all my images are black and white, so I only need one channel
return np.array(image)[:, :, 0:1]
def data_to_training_set(data):
# split the list in the form [(frame 1 image, frame 1 player position), ...] into [[all images], [all player positions]]
inputs, outputs = [list(val) for val in zip(*data)]
for index, image in enumerate(inputs):
# convert the PIL images into numpy arrays so Keras can process them
inputs[index] = pil_image_to_np_array(image)
return (inputs, outputs)
if __name__ == "__main__":
# fix random seed for reproducibility
np.random.seed(7)
# load data
# data will be in the form [(frame 1 image, frame 1 player position), (frame 2 image, frame 2 player position), ...]
with open("position_data1.pkl", "rb") as pickled_data:
data = pickle.load(pickled_data)
X, Y = data_to_training_set(data)
# get the width of the images
width = X[0].shape[1] # == 400
# convert the player position (a value between 0 and the width of the image) to values between 0 and 1
for index, output in enumerate(Y):
Y[index] = output / width
# flatten the image inputs so they can be passed to a neural network
for index, inpt in enumerate(X):
X[index] = np.ndarray.flatten(inpt)
# keras expects an array (not a list) of image-arrays for input to the neural network
X = np.array(X)
Y = np.array(Y)
# create model
model = Sequential()
# my images are 300 x 400 pixels, so each input will be a flattened array of 120000 gray-scale pixel values
# keep it super simple by not having any deep learning
model.add(Dense(1, input_dim=120000, activation='sigmoid'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
# Fit the model
model.fit(X, Y, epochs=15, batch_size=10)
# see what the model is doing
predictions = model.predict(X, batch_size=10)
print(predictions) # this prints all 1s! # TODO fix
EDIT: print(Y) gives me:
so it's definitely not all zeroes.
Of course, a deeper model might give you a better accuracy, but considering the fact that your images are simple, a pretty simple (shallow) model with only one hidden layer should give a medium to high accuracy. So here are the modifications you need to make this happen:
Make sure X and Y are of type float32 (currently, X is of type uint8):
X = np.array(X, dtype=np.float32)
Y = np.array(Y, dtype=np.float32)
When training a neural network it would be much better to normalize the training data. Normalization helps the optimization process to go smoothly and speed up the convergence to a solution. It further prevent large values to cause large gradient updates which would be desruptive. Usually, the values of each feature in the input data should fall in a small range, where two common ranges are [-1,1] and [0,1]. Therefore, to make sure that all values fall in the range [-1,1], we subtract from each feature its mean and divide it by its standard deviation:
X_mean = X.mean(axis=0)
X -= X_mean
X_std = X.std(axis=0)
X /= X_std + 1e-8 # add a very small constant to prevent division by zero
Note that we are normalizing each feature (i.e. each pixel in this case) here not each image. When you want to predict on new data, i.e. in inference or testing mode, you need to subtract X_mean from test data and divide it by X_std (you should NEVER EVER subtract from test data its own mean or divide it by its own standard deviation; rather, use the mean and std of training data):
X_test -= X_mean
X_test /= X_std + 1e-8
If you apply the changes in points one and two, you might notice that the network no longer predicts only ones or only zeros. Rather, it shows some faint signs of learning and predicts a mix of zeros and ones. This is not bad but it is far from good and we have high expectations! The predictions should be much better than a mix of only zeros and ones. There, you should take into account the (forgotten!) learning rate. Since the network has relatively large number of parameters considering a relatively simple problem (and there are a few samples of training data), you should choose a smaller learning rate to smooth the gradient updates and the learning process:
from keras import optimizers
model.compile(loss='mean_squared_error', optimizer=optimizers.Adam(lr=0.0001))
You would notice the difference: the loss value reaches to around 0.01 after 10 epochs. And the network no longer predicts a mix of zeros and ones; rather the predictions are much more accurate and close to what they should be (i.e. Y).
Don't forget! We have high (logical!) expectations. So, how can we do better without adding any new layers to the network (obviously, we assume that adding more layers might help!!)?
4.1. Gather more training data.
4.2. Add weight regularization. Common ones are L1 and L2 regularization (I highly recommend the Jupyter notebooks of the the book Deep Learning with Python written by François Chollet the creator of Keras. Specifically, here is the one which discusses regularization.)
You should always evaluate your model in a proper and unbiased way. Evaluating it on the training data (that you have used to train it) does not tell you anything about how well your model would perform on unseen (i.e. new or real world) data points (e.g. consider a model which stores or memorize all the training data. It would perform perfectly on the training data, but it would be a useless model and perform poorly on new data). So we should have test and train datasets: we train model on the training data and evaluate the model on the test (i.e. new) data. However, during the process of coming up with a good model you are performing lots of experiments: for example, you first change the type and number of layers, train the model and then evaluate it on test data to make sure it is good. Then you change another thing say the learning rate, train it again and then evaluate it again on test data... To make it short, these cycles of tuning and evaluations somehow causes an over-fitting on the test data. Therefore, we would need a third dataset called validation data (read more: What is the difference between test set and validation set?):
# first shuffle the data to make sure it isn't in any particular order
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
Y = Y[indices]
# you have 200 images
# we select 100 images for training,
# 50 images for validation and 50 images for test data
X_train = X[:100]
X_val = X[100:150]
X_test = X[150:]
Y_train = Y[:100]
Y_val = Y[100:150]
Y_test = Y[150:]
# train and tune the model
# you can attempt train and tune the model multiple times,
# each time with different architecture, hyper-parameters, etc.
model.fit(X_train, Y_train, epochs=15, batch_size=10, validation_data=(X_val, Y_val))
# only and only after completing the tuning of your model
# you should evaluate it on the test data for just one time
model.evaluate(X_test, Y_test)
# after you are satisfied with the model performance
# and want to deploy your model for production use (i.e. real world)
# you can train your model once more on the whole data available
# with the best configurations you have found out in your tunings
model.fit(X, Y, epochs=15, batch_size=10)
(Actually, when we have few training data available it would be wasteful to separate validation and test data from whole available data. In this case, and if the model is not computationally expensive, instead of separating a validation set which is called cross-validation, one can do K-fold cross-validation or iterated K-fold cross-validation in case of having very few data samples.)
It is around 4 AM at the time of writing this answer and I am feeling sleepy, but I would like to mention one more thing which is not directly related to your question: by using the Numpy library and its functionalities and methods you can write more concise and efficient code and also save yourself a lot time. So make sure you practice using it more as it is heavily used in machine learning community and libraries. To demonstrate this, here is the same code you have written but with more use of Numpy (Note that I have not applied all the changes I mentioned above in this code):
# machine learning code mostly from https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import pickle
def pil_image_to_np_array(image):
'''Takes an image and converts it to a numpy array'''
# from https://stackoverflow.com/a/45208895
# all my images are black and white, so I only need one channel
return np.array(image)[:, :, 0]
def data_to_training_set(data):
# split the list in the form [(frame 1 image, frame 1 player position), ...] into [[all images], [all player positions]]
inputs, outputs = zip(*data)
inputs = [pil_image_to_np_array(image) for image in inputs]
inputs = np.array(inputs, dtype=np.float32)
outputs = np.array(outputs, dtype=np.float32)
return (inputs, outputs)
if __name__ == "__main__":
# fix random seed for reproducibility
np.random.seed(7)
# load data
# data will be in the form [(frame 1 image, frame 1 player position), (frame 2 image, frame 2 player position), ...]
with open("position_data1.pkl", "rb") as pickled_data:
data = pickle.load(pickled_data)
X, Y = data_to_training_set(data)
# get the width of the images
width = X.shape[2] # == 400
# convert the player position (a value between 0 and the width of the image) to values between 0 and 1
Y /= width
# flatten the image inputs so they can be passed to a neural network
X = np.reshape(X, (X.shape[0], -1))
# create model
model = Sequential()
# my images are 300 x 400 pixels, so each input will be a flattened array of 120000 gray-scale pixel values
# keep it super simple by not having any deep learning
model.add(Dense(1, input_dim=120000, activation='sigmoid'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
# Fit the model
model.fit(X, Y, epochs=15, batch_size=10)
# see what the model is doing
predictions = model.predict(X, batch_size=10)
print(predictions) # this prints all 1s! # TODO fix