50% accuracy in CNN on image binary classification

50% accuracy in CNN on image binary classification - python

I have a collection of images with open and closed eyes.
The data is collected from the current directory using keras in this way:
batch_size = 64
N_images = 84898 #total number of images
datagen = ImageDataGenerator(
rescale=1./255)
data_iterator = datagen.flow_from_directory(
'./Eyes',
shuffle = 'False',
color_mode='grayscale',
target_size=(h, w),
batch_size=batch_size,
class_mode = 'binary')
I've got a .csv file with the state of each eye.
I've built this Sequential model:
num_filters = 8
filter_size = 3
pool_size = 2
model = Sequential([
Conv2D(num_filters, filter_size, input_shape=(90, 90, 1)),
MaxPooling2D(pool_size=pool_size),
Flatten(),
Dense(16, activation='relu'),
Dense(2, activation='sigmoid'), # Two classes. one for "open" and another one for "closed"
])
Model compilation.
model.compile(
'adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
Finally I fit all the data with the following:
model.fit(
train_images,
to_categorical(train_labels),
epochs=3,
validation_data=(test_images, to_categorical(test_labels)),
)
The result fluctuates around 50% and I do not understand why.

Your current model essentially has one convolutional layer. That is, num_filters convolutional filters (which in this case are 3 x 3 arrays) are defined and fit such that when they are convolved with the image, they produce features that are as discriminative as possible between classes. You then perform maxpooling to slightly reduce the dimension of the output CNN features before passing to 2 dense layers.
I'd start by saying that one convolutional layer is almost certainly insufficient, especially with 3x3 filters. Basically, with a single convolutional layer, the most meaningful information you can get are edges or lines. These features are only marginally more useful to a function approximator (i.e. your fully connected layers) than the raw pixel intensity values because they still have an extremely high degree of variability both within a class and between classes. Consider that shifting an image of an eye 2 pixels to the left would result in completely different values output from your 1-layer CNN. You'd like the outputs of your CNN to be invariant to scale, rotation, illumination, etc.
In practice, this means you're going to need more convolutional layers. The relatively simple VGG net has at least 14 convolutional layers, and modern residual-layer based networks often have over 100 convolutional layers. Try writing a routine to define sequentially more complex networks until you start seeing performance gains.
As a secondary point, generally you don't want to use a sigmoid() activation function on your final layer outputs during training. This flattens the gradients and makes it much slower to backpropogate your loss. You actually don't care that the output values fall between 0 and 1, you only care about their relative magnitudes. Common practice is to use cross entropy loss which combines a log softmax function (gradient more stable than normal softmax) and negative log likelihood loss, as you've already done. Thus, since the log softmax portion transforms the output values into the desired range, there's no need to use the sigmoid activation function.

Related

Searching for an LSTM architecture to be used for regression

The lack of good intuition on LSTMs and how they work paired with
an awkward dataset and regression problem leaves me with questions on how to approach and solve my scenario.
I don't want any in depth answers, I seek just for intuition and suggestions.
My dataset consists of:
X flights, each flight has Y timesteps where each timestep has Z features. Every flight is characterized by K which is a 2 value vector (K_1, K_2) and that's the target of regression, predicting these 2 variables.
I've tried several different regression methods and they happen to perform really well. Because I have time dependent trajectories, the other methods calculated stats across each trajectory for the Z features and transformed each whole trajectory to just [Z*l,] - [K_1, K_2] supervised data, where l is just a factor that implies that we have new calculated features. (for example one of these features could be the mean of a feature across all trajectory).
Problem:
I want to implement an LSTM regression pipeline that takes as input the raw dataset (X,Y,Z) and after a dense layer outputs 2 values, K_1 and K_2 (with 1 model or 2 seperate models for each value) and backpropagate correctly the loss of K_1_target, K_2_target.
I've tried it and it seems that it performs really poorly and I don't know if it's a technical mistake or a theoritical mistake.
Below I provide the architecture I use at the moment.
samples, timesteps, features = x_train.shape[0], x_train.shape[1], x_train.shape[2]
model1 = Sequential()
model1.add(Masking(mask_value=-10.0))
model1.add(LSTM(hidden_units, return_sequences = True))
model1.add(Flatten())
model1.add(Dense(hidden_units, activation = "relu"))
model1.add(Dense(1, activation = "linear"))
model1.compile(loss='mse', optimizer=Adam(learning_rate=0.0001))
model1.fit(x_train, y_train[:,0], validation_data=(x_test, y_test[:,0]), epochs=epochs, batch_size=batch, shuffle=False)
model2 = Sequential()
model2.add(Masking(mask_value=-10.0))
model2.add(LSTM(hidden_units, return_sequences=True))
model2.add(Flatten())
model2.add(Dense(hidden_units, activation = "relu"))
model2.add(Dense(1, activation = "linear"))
model2.compile(loss='mse', optimizer=Adam(learning_rate=0.0001))
model2.fit(x_train, y_train[:,1], validation_data=(x_test, y_test[:,1]), epochs=epochs, batch_size=batch, shuffle=False)
In my mind it makes sense but it seems that it is not working well..
I'm not entirely sure how the trainable params of LSTM are learning on backpropagation using a dense layer after and backwarding the loss of 2 values which are the same again and again for the same trajectory.
Any kind of clarification, correction or intuition will help me a lot!
Lastly, I'll provide some real details.
K_1 takes discrete values from 0 to 100
K_2 takes 1 floating point precision values from 0 to 1
x_input is a subset of (6991, 527, 6) using k-fold CV, using k = 10, so about (6292, 527, 6) and y_input is of shape (6292, 2). Accordingly for testing.
I,ve used pre padding for even length of trajectories and a masking layer that ignores rows with no data.
I've normalized all my features and target values indepedently with MinMax normalization, and inversed transformed model's output and y_test for correct loss calculation.
The best result I've got till now is a MAE loss is:
K_1 (whose range is 0 - 100) = ~6.0 (Even lasso performs better, while in a non linear problem)
K_2 (whose range is 0 - 1) = ~0.003 (Pretty good)

Keras RNN accuracy doesn't improve

I'm trying to improve my model so it can become a bit more accurate. Right now I'm training the model and get this as my training and validation accuracy.
For every epoch I get an training accuracy of 0.0003 and an validation accuracy of 0. I know this isn't good but I don't know how I can fix this.
Data is normalized with the minmax scaler. 4 of the 8 features are normalized (other 4 are hour, day, day_of_week and month)
Update:
I've also tried to normalize the entire dataset and it doesn't make a differance
scaling = MinMaxScaler(feature_range=(0,1)).fit(df[cols])
df[[cols]] = scaling.transform(df[[cols]])
My model: The shape is (5351, 1, 8)
and the input_shape is (1, 8)
model = keras.Sequential()
model.add(keras.layers.Bidirectional(keras.layers.LSTM(2,input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True, activation='linear')))
model.add(keras.layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='Adamax', metrics=['acc'])
history = model.fit(
X_train, y_train,
epochs=200,
batch_size=24,
validation_split=0.35,
shuffle=False,
)
i tried using the answer of this question:
Keras model accuracy not improving
but it didn't work

A mean_sqared_error loss is for regression tasks while a acc metric is for classification problems. So it makes no sense to use them together.
If you work on a classification problem, use binary_crossentropy or categorical_crossentropy as loss and keep the metric parameter as you did.
If it is a regression tasks, change the metric to [mse] for mean squares error instead of [acc].
Your model "works" and you have applied the standard formula for backpropagation by using the mean squares error loss. But measuring the accuracy will make Keras check if your model's output is EXACTLY equals to the expected values. Since the loss function is for regression, it will hardly ever be equal.
Three last points because that little change won't correct everything.
Firstly, your last dense layer should have an activation function. (It's safier)
Secondly, I'm pretty sure a Bidirectional+LSTM layer placed before a Dense layer should have a return_sequences=False. A LSTM layer (with or without Bidirectional) can return thé full séquence of vector (like a matrix) but a dense layer takes vectors as input. But in this case it will work because of the third point.
The last point is about the shape of your data. You have 5351 examples of shape (1, 8) each which a vector of size 8. But a LSTM layer takes a sequence of vectors still thé size of your séquence is one. I don't know if it is relevent to use an RNN type layer here.

How can I use values, along with images, as input for training a Keras image classifier?

I believe this is my first question here.
I am very new to Neural Networks. I just started working on one in Python that is supposed to look at levels of glucose in patients with a risk of diabetes and rank them from 1 to 3 on their risk of developing the disease. With 1 being high risk, and 3 being low risk.
Right now, I have ~110 graphs previously ranked by a doctor (42 risk 1, 51 risk 2, 10 of risk 3). I randomly took 25% of each group as the test set, and put the rest as training, then gave it to a Keras for learning.
It works just fine. Here's my code:
print("Convoluting")
classifier.add(Convolution2D(32, 3, 3, input_shape = (64, 64, 3), activation = 'relu'))
print("Pooling")
classifier.add(MaxPooling2D(pool_size = (2,2)))
print("Flattening")
classifier.add(Flatten())
print("Connecting")
classifier.add(Dense(activation = 'relu', units=128))
classifier.add(Dense(activation = 'softmax', units=3))
print("Compiling CNN")
classifier.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
print("Generating images")
from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator()
test_datagen = ImageDataGenerator()
print("Setting sets")
training_set = train_datagen.flow_from_directory(
'dataset/train_set',
target_size=(64,64),
batch_size=Batches,
class_mode='categorical')
test_set = test_datagen.flow_from_directory(
'dataset/test_set',
target_size=(64,64),
batch_size=Batches,
class_mode='categorical')
print("training nn...")
from IPython.display import display
from PIL import Image
classifier.fit_generator(
training_set,
steps_per_epoch=StepsPerEpoch,
epochs=Epochs,
validation_data=test_set,
validation_steps=ValidationSteps)
However, the accuracy after the training won't go above 0.4. Now, I know I have a relatively small sample for training a neural network, but I currently don't have access to information from more patients. I do, however, have access to demographic data from those patients, like weight, height, and age.
Basically, I would like to somehow include the weight, height, and age of each patient along with the graph showing their levels of glucose. So my program knows to take that information into account when making a judgement.
I haven't been able to find anything similar when searching online, though it may be due to my little knowledge on the matter. What should I do?
Thanks for your time.

If I had to do a thing like that I would concatenate image features and numerical features once they have the same form - a feature vector. For that, you can view the convolutional part of the network as a feature extractor that turns into a list of features after the last pooling layer i.e. it will have a shape like [batch_size, 1, 1, N]. At this point you can easily append/concatenate your regular numerical features before feeding them into a dense layer.
Couple things I would be on the lookout for:
be sure that numerical and convolutional features are from the same distribution i.e. BatchNorm is applied to both
be sure that they have roughly the same size i.e. if you have 2048 conv features and only 5 numerical features it might not quite work as is.
You can get more inspiration from Wide & Deep Learning for Recommender Systems.

First of all your using a small deep network so your data must be more of 100 instance. so I suggest searching "Data Augmentation" to know how to increase your data. secondly, while you have few data for training your network I think 10 percent or less for test data is enough. and finally for using other features for training you could extract features of images by a convolution network, and then concatenate that with other features and train a simpler network to do final classification.

RNN fails to fit a linear trend (Keras BPTT issue?)

I am trying to train a simple LSTM to fit a line. My hypothesis is that I should be able to fit a linearly decreasing trend with zero input since the LSTM can decide how much it listens to its input vs. internal state, and can thus learn to just operate on the internal state. Basically a degenerate case for testing whether the LSTM can fit an expected result with zero input.
I create my input and target data:
seq_len = 1000
x_train = np.zeros((1, seq_len, 1)) # [batch_size, seq_len, num_feat]
target = np.linspace(100, 0, num=seq_len).reshape(1, -1, 1)
I create a pretty simple network:
from keras.models import Model
from keras.layers import LSTM, Dense, Input, TimeDistributed
x_in = Input((seq_len, 1))
seq1 = LSTM(8, return_sequences=True)(x_in)
dense1 = TimeDistributed(Dense(8))(seq1)
seq2 = LSTM(8, return_sequences=True)(dense1)
dense2 = TimeDistributed(Dense(8))(seq2)
out = TimeDistributed(Dense(1))(dense2)
model = Model(inputs=x_in, outputs=out)
model.compile(optimizer='adam', loss='mean_squared_error')
history = model.fit(x_train, target, batch_size=1, epochs=1000,
validation_split=0.)
I also created a custom callback that calls model.predict(x_train) after every epoch and adds the results to an array so I can see how my model's output is evolving over time. Basically the model just learns to predict a constant value which gradually (asymptotically) approaches the mean of my target line (target line is in red, not sure why the legend didn't show):
So basically nothing is driving my response to fit the actual line, I'm just gradually approaching the mean of the line. I suspect I am not getting any gradient with respect to time (data index), just an average gradient over time. But I would have thought LSTM losses would automagically give you gradient through time.
I've tried:
different activation functions for the LSTM layers (None, 'relu' for both the regular activation and recurrent activation)
different optimizers ('nadam', 'adadelta', 'rmsprop')
the 'mean_aboslute_error' loss function, which I didn't expect to improve the results, and it acted about the same
passing sequences of random numbers drawn from a normal distribution as input
replacing LSTM with GRU
Nothing seems to do it.
Anybody have a suggestion as to how I can force this thing to train on the gradient as a function of my sequence index, i.e. g(t)? Or any other suggestions on how I can get this to work?
Note: with the trend as shown, if the LSTM results in a constant value at exactly the mean (50), the minimum mean absolute error will be 25 and the minimum mean squared error will be about 835.8. So if we don't see any better than that, we probably aren't fitting the line, just the mean.
Just some references in case you run this yourself.

Keras: using mask_zero with padded sequences versus single sequence non padded training

I'm building an LSTM model in Keras to classify entities from sentences. I'm experimenting with both zero padded sequences and the mask_zero parameter, or a generator to train the model on one sentence (or batches of same length sentences) at a time so I don't need to pad them with zeros.
If I define my model as such:
model = Sequential()
model.add(Embedding(input_dim=vocab_size+1, output_dim=200, mask_zero=True,
weights=[pretrained_weights], trainable = True))
model.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(target_size, activation='softmax')))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics = ['accuracy'])
Can I expect the padded sequences with the mask_zero parameter to perform similarly to feeding the model non-padded sequences one sentence at a time? Essentially:
model.fit(padded_x, padded_y, batch_size=128, epochs=n_epochs,
validation_split=0.1, verbose=1)
or
def iter_sentences():
while True:
for i in range(len(train_x)):
yield np.array([train_x[i]]), to_categorical([train_y[i]], num_classes = target_size)
model.fit_generator(iter_sentences(), steps_per_epoch=less_steps, epochs=way_more_epochs, verbose=1)
I'm just not sure if there is a general preference for one method over the other, or the exact effect the mask_zero parameter has on the model.
Note: There are slight parameter differences for the model initialization based on which training method I'm using - I've left those out for brevity.

The biggest difference will be performance and training stability, otherwise padding and then masking is the same as processing single sentence at time.
performance: Well you will train one point at a time which might not exploit any parallelism that is available on the hardware. Often, we adjust the batch size to get the best performance from the machine during training and prediction.
training stability: when you set batch size to 1 you are not longer performing mini-batch training. The training routine will apply updates after every data point which might be detrimental for momentum based algorithms such as Adam. Instead, accumulating gradients over a batch tends to provide more stable convergence especially if the data is noisy.
So to answer the question, no, you can't expect them to perform similarly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

50% accuracy in CNN on image binary classification - python

Related

Searching for an LSTM architecture to be used for regression

Keras RNN accuracy doesn't improve

How can I use values, along with images, as input for training a Keras image classifier?

RNN fails to fit a linear trend (Keras BPTT issue?)

Keras: using mask_zero with padded sequences versus single sequence non padded training

Categories

Resources