Having problems while doing multiclass classification with tensorflow

Having problems while doing multiclass classification with tensorflow - python

https://colab.research.google.com/drive/1EdCL6YXCAvKqpEzgX8zCqWv51Yum2PLO?usp=sharing
Hello,
Above, I'm trying to identify 5 different type of restorations on dental x-rays with tensorflow. i'm using the official documentation to follow the steps but now i'm kind of stucked and i need help. here are my questions:
1-i have my data on my local disk. TF example on the link above downloads the data from a different repository. when i want to test my images, do i have any other way than to use the code below ?:
import numpy as np
from keras.preprocessing import image
from google.colab import files
uploaded = files.upload()
# predicting images
for fn in uploaded.keys():
path = fn
img = image.load_img(path, target_size=(180, 180))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
images = np.vstack([x])
classes = model.predict(images)
print(fn)
print(classes)
i'm asking this because the official documentation just shows the way to test images one-by-one, like this:
img = keras.preprocessing.image.load_img(
sunflower_path, target_size=(img_height, img_width)
)
img_array = keras.preprocessing.image.img_to_array(img)
img_array = tf.expand_dims(img_array, 0) # Create a batch
predictions = model.predict(img_array)
score = tf.nn.softmax(predictions[0])
print(
"This image most likely belongs to {} with a {:.2f} percent confidence."
.format(class_names[np.argmax(score)], 100 * np.max(score))
)
2- i'm using "image_dataset_from_directory" method, so i don't have a separate validation directory. is that ok ? or should i use ImageDataGenerator ? For testing my data, i picked some data randomly from all 5 categories by hand and put them in my test folder which has 5 subfolders as i have that number of categories. is this what i am supposed to do for prediction, also separating the test data into different folders ? if yes, how can i load all these 5 folders simultaneously at test time ?
3- i'm also supposed to create the confusion matrix. but i couldn't understand how i can apply this to my code ? some others say, use scikit-learn's confusion matrix, but this time i have to define y-true, y_pred values, which i cannot fit into this code. am i supposed to evaluate 5 different confusion matrices for 5 different predictions and how ?
4-sometimes, i observe that the validation accuracy starts much higher than the training accuracy. is this unusual ? after 3-4 epochs, train accuracy cathces the validation accuracy and continues in a more balanced way. i thought this should not be happening. is everything alright ?
5- final question, why the first epoch takes much much longer time than other epochs? in my setup, it's about 30-40 minutes to complete the first epoch, and then only about a minute or so to complete every other epoch. is there a way to fix it or does it always happen the same way ?
thanks.

I am no expert in image processing with tf, but let me try to answer as much as possible:
1
I dont really understand this question, because you are using image_dataset_from_directory which should handle the file loading process for you. So far to me, it looks good what you are doing there.
2
Let me cite tf.keras.preprocessing.image_dataset_from_directory:
Then calling image_dataset_from_directory(main_directory,
labels='inferred') will return a tf.data.Dataset that yields batches
of images from the subdirectories class_a and class_b, together with
labels 0 and 1 (0 corresponding to class_a and 1 corresponding to
class_b).
And ImageDataGenerator:
Generate batches of tensor image data with real-time data augmentation.
The data will be looped over (in batches).
As your data is handpicked, there is no need for ImageDataGenerator, as image_dataset_from_directory returns what you want. If you test and validation data (which you should have), you can use the tf.data.Dataset functions for splitting data in test, train and valid. This can be a bit clunky, but the time learning tf.data.dataset is well spent.
3
The confusion matrix give the the F1-Score, Precision and Recall values. But as the confusion matrix is normally for binary classification (which is not your case), it only returns those values for one class (and for not this class). Better use the metrics Tensorflow relies on. Tensorflow can calculate the recall and precision and F1 score for you as metric, so if you ask me, use them.
4
Depending on how the data is shuffled and structured this can be normal. When there are more special cases in the training data, the model will have more difficulties to predict them correct. When there are more simple predictions in the test labels, the model will be better there, which gives you a higher accuracy at that point. It is indeed an indicator, that the classes in your train and test data might not be equally distributed.
5
tf.data.Dataset loads the data when needed. This means, the files are not loaded into memory until the training process has started which results in a very long first epoch (loading all images first) and the second very short epoch (oh cool, all images are already there). You can approve this by checking the gpu usage of your machine, it should often be doing nothing or be very low.
To fix this, you can use .prefetch(z) on your dataset variable. ´prefetch() ´makes the dataset prefetch the next ´z´ values, while the gpu is already doing some calculations. This might speed up the first epoch.

Related

CNN stuck at 0% accuracy

I'm learning CNN and wondering why is my network stuck at 0% accuracy even after multiple epochs? I'm sharing the entire code as it's really simple.
I have a dataset with faces and respective ages. I'm using keras and tf to train a convolution neural network to determine age.
However, my accuracy is always reporting as 0%. I'm very new to neural networks and I'm hoping you could tell me what I am doing wrong?
path = "dataset"
pixels = []
age = []
for img in os.listdir(path):
ages = img.split("_")[0]
img = cv2.imread(str(path)+"/"+str(img))
img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
pixels.append(np.array(img))
age.append(np.array(ages))
age = np.array(age,dtype=np.int64)
pixels = np.array(pixels)
x_train,x_test,y_train,y_test = train_test_split(pixels,age,random_state=100)
input = Input(shape=(200,200,3))
conv1 = Conv2D(70,(3,3),activation="relu")(input)
conv2 = Conv2D(65,(3,3),activation="relu")(conv1)
batch1 = BatchNormalization()(conv2)
pool3 = MaxPool2D((2,2))(batch1)
conv3 = Conv2D(60,(3,3),activation="relu")(pool3)
batch2 = BatchNormalization()(conv3)
pool4 = MaxPool2D((2,2))(batch2)
flt = Flatten()(pool4)
#age
age_l = Dense(128,activation="relu")(flt)
age_l = Dense(64,activation="relu")(age_l)
age_l = Dense(32,activation="relu")(age_l)
age_l = Dense(1,activation="relu")(age_l)
model = Model(inputs=input,outputs=age_l)
model.compile(optimizer="adam",loss=["mse","sparse_categorical_crossentropy"],metrics=['mae','accuracy'])
save = model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=2)

Well you have to decide if you want to do a classification model or a regression model. As it stands now it looks like you are trying to do a regression model.
Lets start at the outset. Apparently you have a dataset of image files and within the files path is text that defines the age so it is something like
say 27_01.jpg I assume. So you split the path based on the _ to get the age associated with the image file. You then read in the image using cv2 and then convert it to rgb. Now cv2 reads in the image and return it as an array so you don't need to convert it to an np array just use
pixels.append(img)
now the variable ages is a string which you want to convert into an integer. So just
use the code
ages =int( img.split("_")[0])
this is now a scaler integer value, not an array so just use
age.append(ages)
you now have two lists, pixels and age. To use them in a model you need to convert them to np arrays so use
age=np.array(age)
pixels=np.array(pixels
Now the next thing you want to do is to create a train set and a test set using the train_test_split function. Lets assume you want 90% of the data set to be used for training and 10% for testing. so use
x_train,x_test,y_train,y_test = train_test_split(pixels,age,train_size=.9, shuffle=True, random_state=100)
Now lets look at your model. This is what decides if you are doing regression or
classification. You want to do regression. Your model is OK but needs some changes
You have 4 dense layers. I suspect that this will lead to a case where your model
is over-fitting so I recommend that prior to the last layer you add a dropout layer
Use the code
drop=Dropout(rate=.4, seed=123)(age_1)
age_l = Dense(1,activation="linear")(age_l)
Note the activation is set to linear. That way the output can take a range of values
that can be compared to the integer values of the age array.
Now when you compile your model you want your loss to be mse. So it is measuring the error between the models output and the ages. Sparse_categorical crossentropy is used when you are doing classification which is NOT what you are doing. As for the metrics accuracy is used for classification models so you only want to use mae So you compile code should be
model.compile(optimizer="adam",loss="mse",metrics=['mae'])
now model.fit looks ok but you should run for more epochs like say 20. Now when you run your model look at the training loss and the validation loss. As the training loss decreases, on AVERAGE the validation loss should trend to decrease. If it starts to trend upward your model is over-fitting. In that case you may want to add an additional dropout layer.
At some point your model will stop improving if you run a sufficient number of epochs. You can usually get an improvement in performance if you use an adjustable learning rate. Since you are new to this you may not have experience using callbacks. Callbacks are used within model.fit and there are many types. Documentation for callbacks can be found here. To implement an adjustable learning rate you can use the callback ReduceLROnPlateau. The documentation for that is here. What it does is to set it up to monitor the validation loss. If the validation loss fails to reduce for a "patience" number of epochs the callback will reduce the learning rate by the parameter "factor" where
new_learning_rate=current_learning rate * factor
where factor is a float between 0 and 1.0. May recommended code for this callback is
shown below
rlronp=tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss",factor=0.5,
patience=2, verbose=1)
I also recommend you use the callback EarlyStopping. The documentation for that is here. Set it up to monitor validation loss. If the loss fails to reduce for 'patience number of consecutive epochs training will be halted. Set the parameter restore_best_weights=True. That way if the callback halts training it leaves your model set with the weights for the epoch that had the lowest validation loss. My recommended code for the callback is shown below
estop=tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=4,
verbose=1, restore_best_weights=True)
To use the callback for model.fit include the code
save = model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=20,
callbacks=[rlronp,estop])
By the way I think I am familar with this dataset or a similar one. Do not expect
great root mean squared error as I have seen many models for this and none had a small error margin. Incidentally if you want to learn machine learning there is an excellent set of about 200 tutorials on this by a guy called Gabriel Atkin. He can see his tutorials called Data Everyday here. The specific tutorial dealing with this kind of age dataset is located here.

pytorch deep learning loading data sequentially and efficiently

I have been doing neural network analysis on 20 thousand "images", each image represented in the form of the intensity of 100 * 100 * 100 neurons.
x = np.loadtxt('imgfile')
x = x.reshape(-1, img_channels, 100, 100, 100)
//similarly for target variable 'y'
Above, the first dimension of x will be the number of images. Am using DataLoader to get appropriate number of images for training during each iteration as shown below.
batch_size = 16
traindataset = TensorDataset(Tensor(x[:-testdatasize]), Tensor(y[:-testdatasize]) )
train_loader = DataLoader(dataset=traindataset, batch_size=batch_size, shuffle=True)
for epoch in range(num_epochs):
for i, (data,targets) in enumerate(train_loader):
...
I hope to increase the number of images to 50k but am restricted by the computer memory (imgfile is ~50 GB).
I was wondering if there is an efficient way to handle all the data? Like, rather than loading the whole imgfile, can we first divide them into sets, each with batch_size number of images, and load the sets periodically during training. I am not completely sure how to implement this.
I found some similar ideas using Keras here: https://machinelearningmastery.com/how-to-load-large-datasets-from-directories-for-deep-learning-with-keras/
Please point me towards any similar ideas implemented with pytorch or you have any ideas.

Digging a while after posting the question, found out there is, of course, a way using torch.utils.data.Dataset. Each image-data can be saved in a separate file and all the filenames are listed in 'filelistdata'. Only the batch_size number of images will be loaded into memory when called using DataLoader (in the background, getitem method will fetch the images). The following worked for me:
traindataset = CustDataset(filename='filelistdata', root_dir=root_dir)
train_loader = DataLoader(dataset=traindataset, batch_size=batch_size, num_workers = 16)
num_workers is really important for performance and should be higher than the number of Cpus you are using (I am using 4 cpus above). Found the following resources useful for answering this question.
How to split and load huge dataset that doesn't fit into memory into pytorch Dataloader?
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
https://www.youtube.com/watch?v=ZoZHd0Zm3RY

Import images to Numpy array, then divide into training and test sets

I have a set of 20,000 images that I am importing from disk like below.
imgs_dict={}
path="Documents/data/img"
os.listdir(path)
valid_images =[".png"]
for f in os.listdir(path):
ext= os.path.splitext(f)[1]
if ext.lower() not in valid_images:
continue
img_name=os.path.basename(f)
img_name=os.path.splitext(img_name)[0]
img=np.asarray(Image.open(os.path.join(path,f)))
imgs_dict.update([(img_name,img)])
The reason I am converting this to a dictionary at the end is because I also have two other dictionaries specifying the image id, the classification and whether it is part of the training or validation set. One of these dictionaries corresponds to all the data that should be part of the training data and the other specifies those that should be part of the validation data. After I separate them out, I need to get them back into the standard array format for images (height, width, channels). How can I take a dictionary of images and convert it back into the format I'm wanting here? When i do the below, it produces an array with a shape of (8500,), which is the amount of images in my training set but obviously not reflective of the height, width and channels.
x_train=np.array(list(training_images.values()))
np.shape(x_train)
(8500,)
Or secondarily, am I going about this all wrong? Is there an easier way to handle images than this? It would seem much nicer to just keep the images in a numpy array from the beginning, but as far as I can tell there's no way to have arrays have a key value/label of any sort so I can't pull out specific images.
Edit: For some more context, what I'm essentially trying to do is get my data into a format like what is described in the following link.
https://elitedatascience.com/keras-tutorial-deep-learning-in-python
The specific part in question I'm having trouble with is this:
from keras.datasets import mnist
# Load pre-shuffled MNIST data into train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
When we load the MNIST data, how is the relation between X_train and y_train determined? How can I replicate that with my data?

Yes, there is an easier way of handling image data in Keras. Specifically, when dealing with large dataset you want to use a generator instead of loading all of the images to the memory, so specifically please refer to the ImageDataGenerator class. This class in a data generator already implemented in Keras, so unless you need any special operations etc. this can be the "go-to-guy", at least for basic projects. This will also allow you to define basic augmentations and normalization (for example - rescaling, normalize the data, rotation etc.).
Specifically, you can automatically upload images per class either by arranging them in subdirectoris (put all the images from a single label under the same subdirectory), or by creating a data frame that indicates for each image path what is it's label. Refer to flow_from_directory and flow_from_dataframe accordingly.
For train-test splitting, the easiest way is to keep your train and test set in different directories (e.g data/train and data/test) and create 2 different generators. For example, a figure from this tutorial:
In case you don't wan't to put the train and test data at different directories, you can use the validation_split argument when initialize the generator (e.g validation_split=0.2), then, when invoking flow_from_directory, add the argument subset='validation' or subset='training'.
Having said all that, in case you want to load all of the images to the memory as you did and just split them easily, you can use scikit learn - train_test_split, as described here, for example.
PS
regarding MNIST - this is a well established benchmark, which is strictly defined to train and test set, so everyone will be able to compare thier evaluations on the exact same images. This is the reason it is already splitted in advance.

How can I control which samples are read using tfrecords and steps_per_epoch

I am currently transitioning my tf code towards tfrecords and tf datasets. In my application, the trained model usually converges long before it has seen all training samples. I therefore usually set the data generator length myself to the number of batches that I want to fit in one epoch and ensure in my generator that in the next epoch, the generator picks up after the last sample from the previous epoch. This allows that all callbacks work as desired (especially early stopping) while I can still train my models with unseen data in each epoch.
How can I achieve this behaviour with tf datasets and tfrecords? I have read through the dataset definitions on the tensorflow Github but am unsure on whether this will be possible.
I think there are two possible solutions to this if I set steps_per_epoch:
Overwriting the part of the code that specifies from where the next sample is read to just pick up at the sample one after the last one from the previous epoch.
Trying to mimic the behaviour described above with a custom tf dataset implementation. I would be worried that this could have unforeseen impacts on parallelisations and performance.
However, I do not know how to accomplish either. So if you have any insights on this, I would be very grateful.
For now I can use an inelegant work-around in which I always train for one epoch and then initialise a new dataset with new tfrecord files, but I hope there is a better way, especially with regards to callbacks.

I am not sure I fully understand what you try to achieve. You want that:
During an epoch, your model does NOT see the whole dataset
The following epochs do not use samples from the previous ones
That's it?
From my point of view, the steps_per_epoch argument is your best bet. If you have a Dataset with, for example, 100 items (samples or batches) and you set steps_per_epoch=20 then, during the first epoch your model will see items 0 to 19, and 20 to 39 during second epoch, and so on. No need to overwrite any part of code.
Trying to mimic the Dataset behavior is probably not a good idea (too many things to take care, many (hard) work involved).
From your last paragraph, I understand that you want each epoch to be feeded with data from specific TFRecord files. Maybe you can look at tf.data.Dataset.flat_map. Build a list of your TFRecord files (same file can appear multiple times) and "flat_map" TFRecordDataset on it:
files = tf.data.Dataset.from_tensor_slices([
"file1.tfrecord", "file2.tfrecord",
"file1.tfrecord", "file3.tfrecord"
])
dataset = file.flat_map(TFRecordDataset)
Iterating over dataset will give you Examples from file1, then from file2, then from file1 again and then from file3.
Hope this can help.

NASNet-A fine tuning poor validation accuracy

I have a dataset of roughly 34000 images divided in 2 sets: train (30000 images) and validation (4000 images) sets. Each image is the result of the difference between two images taken from a video (the time offset between the images in each pair is about 1 second). The videos have a static background so the diff images contains too much black with only one or two small regions with colors. Each diff image has a label (there has been an action or no.. 1 or 0) so this is sort of binary classification. Briefly, I'm using the slim models pretrained on ImageNet to do the finetuning on my dataset. I've launched 5 separated training using 5 different networks: InceptionV4, InceptionResnetV2, Resnet152, NASNet-mobile, NASNet. I got very good results using the first 4 networks InceptionV4, InceptionResnetV2, Resnet152, NASNet-mobile but it was not the case using NASNet. The thing is that the Area Under the ROC curve on the validation set is always = 0.5 and the logits of the validation images are roughly having the same values which is really weird. In fact, I got this kind of results using NASNet-mobile on the first 10000 mini-batch but after that the model did converge. Here are the values of the hyperparameters I have in my script:
batch_size=10
weight_decay = 0.00004
optimizer = rmsprop
rmsprop_momentum = 0.9
rmsprop_decay = 0.9
learning_rate_decay_type = exponential
learning_rate = 0.01
learning_rate_decay_factor = 0.94
num_epochs_per_decay = 2.0 #'Number of epochs after which learning rate
I'm still newbie in tensorflow and I did not find anything related anywhere else. This is a really weird behavior because I'm using the same parameters and same inputs but it seems using NASNet there is a problem somewhere. I'm not only looking for a solution (if possible because I know it is tough to troubleshoot such things without too much details about the model) but insights about where to look and how to troubleshoot would be great. Does anybody had this problem with finetuning NASNet before? something like the model didn't converge for example? Finally, I know it is really hard to got answers on such questions but I hope to get at least some insights so I can move forward with my investigations.
EDIT:
Here are the plots of the cross entropy and regularization losses:
EDIT:
As proposed in the answer, I did set the drop_path_keep_prob params to 1 and now the model converged and I got good accuracy on the validation set. But now the question is: what does this param mean? Is it one of the params that we should adapt to our dataset (like learning rate etc..)?

The simplest sanity check you can do would be to run the finetuning on a single minibatch. Any deep network should be able to overfit to that, if there aren't any big problems. If you see that it can't do that, then there must be some problem with the definition, or the way you're using the definition.
The only guess I have in your case is that it could be something to do with the drop_path implementation. It's disabled in the mobile version, but it is enabled during training on the large model. It could make the model unstable enough that it wouldn't fine tune, so it may be worth trying to train with it disabled.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.