Keras output layer gives unexpected error

Keras output layer gives unexpected error - python

I am using Keras for a binary classification problem. I am using the following adaptation of LeNet:
lenet_model = models.Sequential()
lenet_model.add(Convolution2D(filters=filt_size, kernel_size=(kern_size,
kern_size), padding='valid', input_shape=input_shape))
lenet_model.add(Activation('relu'))
lenet_model.add(BatchNormalization())
lenet_model.add(MaxPooling2D(pool_size=(maxpool_size, maxpool_size)))
lenet_model.add(Convolution2D(filters=64, kernel_size=(kern_size,
kern_size), padding='valid'))
lenet_model.add(Activation('relu'))
lenet_model.add(MaxPooling2D(pool_size=(maxpool_size, maxpool_size)))
lenet_model.add(Convolution2D(filters=128, kernel_size=(kern_size,
kern_size), padding='valid'))
lenet_model.add(Activation('relu'))
lenet_model.add(MaxPooling2D(pool_size=(maxpool_size, maxpool_size)))
lenet_model.add(Flatten())
lenet_model.add(Dense(1024, kernel_initializer='uniform'))
lenet_model.add(Activation('relu'))
lenet_model.add(Dense(512, kernel_initializer='uniform'))
lenet_model.add(Activation('relu'))
lenet_model.add(Dropout(0.2))
lenet_model.add(Dense(1, kernel_initializer='uniform'))
lenet_model.add(Activation('sigmoid'))
lenet_model.compile(loss='binary_crossentropy', optimizer=Adam(),
metrics=['accuracy'])
But I am getting this:
ValueError: Error when checking model target: expected activation_6 to have shape (None, 1) but got array with shape (1652, 2). It gets resolved if I use 2 in the final Dense layer.

I would suggest first check the dimensionality of your data. The training dataset target is 2 dimensional, but the model takses 1 dimensional data.
You have set lenet_model.add(Dense(1, kernel_initializer='uniform')) to accept 2 dimensional data. You need to set the final dense layer shape such that it accepts target shape (None,2)
lenet_model.add(Dense(2, kernel_initializer='uniform')) is what it should be else preprocess your data such that target data is 1 dimensional data.
Consider reading the documentaion before writing the code next time.

It seems that in your preprocessing steps, you have used functions to turn your numerical class labels into categorical ones, i.e., representing numerical classes in the one-hot coding scheme (in Keras, to_categorical(y, num_classes=2) would do this job for you).
Since you are dealing with a binary problem, if the original labels are 0s and 1s, the coded categorical labels would be 01s and 10s (in labels coded in the one-hot scheme, counting from right to left, the nth digit would be 1 if the numerical class for this instance is n while the rest of that label would be 0). This would explain why your data dimension in the error traceback is (1652, 2).
However, since you have set the output dimension in your model to 1, your output layer would expect the desired labels in data to be of 1 digit only, which would correspond to the raw data before you applied any preprocessing steps mentioned above.
So, you could fix this problem either by taking out the preprocessing for the labels or changing the output dimension to 2. If you stick with using categorical labels coded in the one-hot fashion, you should also switch the sigmoid activation in the last layer to softmax activation since sigmoid only deals with binary numerical classes, i.e., 0 or 1. For a binary classification problem, these two choices should not differ in performance much.
One thing worth mentioning is that you should also pay attention to the cost function you use when you compile this model. Generally speaking, categorical labels work the best with cost functions like categorical crossentropy. Especially for multi-class classification (more than 2 classes) problems where you would have to use categorical labels together with a softmax activation, categorical crossentropy should pretty much be your default choice since it has many benefits over some other common cost functions such as MSE and raw error count.
One of the many benefits of categorical crossentropy would be the fact that it penalizes a "very confident mistake" much more than the case where the classifier "almost got it right", which makes sense. For example, in a binary classification setting using categorical crossentropy as the cost function, a classifier that is 95% sure that a given instance is of class 0 whereas the instance actually belongs to class 1 would be penalized more than a classifier that is 51% percent sure when it made this mistake. Some other cost functions like raw error count are insensitive to how "sure" the classifier is when it makes decisions and those cost functions only take into consideration the final classification result, which essentially means losing a great deal of useful information. Some other cost functions such as MSE would give more emphasis on the wrongly classified instances, which is not always the desired feature to have.

Related

Keras LSTM trained with masking and custom loss function breaks after first iteration

I am attempting to train an LSTM that reads a variable length input sequence and has a custom loss function applied to it. In order to be able to train on batches, I pad my inputs to all be the maxmimum length.
My input data is a float tensor of shape (7789, 491, 11) where the form is (num_samples, max_sequence_length, dimension).
Any sample that is shorter than the maximum length I pad with -float('inf'), so a sequence with 10 values would start with 481 sets of 11 '-inf' values followed by the real values at the end.
The way I am attempting to evaluate this model doesn't fit into any standard loss functions, so I had to make my own. I've tested it and it performs as expected on sample tensors. I don't believe this is the source of the issue so I won't go into details, but I could be wrong.
The problem I'm having comes from the model itself. Here is how I define and train it:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Masking(mask_value=-float('inf'),
input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(tf.keras.layers.LSTM(32))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(30),
kernel_initializer=tf.keras.initializers.zeros())
model.add(tf.keras.layers.Reshape((3, 10)))
model.compile(loss=batched_custom_loss, optimizer='rmsprop', run_eagerly=True)
model.fit(x=train_X, y=train_y, validation_data=val, epochs=5, batch_size=32)
No errors are thrown when I try to fit the model, but it only works on the first batch of training. As soon as the second batch starts, the loss becomes 'nan'. Upon closer inspection, it seems like the LSTM layer is outputting 'nan' after the first epoch of training.
My two guesses for what is going on are:
I set up the masking layer wrong, and it for some reason fails to mask out all of the -inf values after the first training iteration. Thus, -inf gets passed through the LSTM and it goes haywire.
I did something wrong with the format of my loss function, and the when the optimizer applies my calculated loss to the model it ruins the weights of the LSTM. For reference, my loss function outputs a 1D tensor with length equal to the number of samples in the batch. Each item in the output is a float with the loss of that sample.
I know that the math in my loss function is good since I've tested it on sample data, but maybe the output format is wrong even though it seems to match what I've found online.
Let me know if the problem is obvious from what I've shown or if you need more information.

Neural network with linear activation output. Calculate output range for each of the output neurons

Let's assume I have a neural network like the following:
model = keras.models.Sequential()
model.add(keras.layers.Dense(10, input_shape=(5,), activation='relu'))
model.add(keras.layers.Dense(4, activation='linear'))
With n output neurons with a linear activation function.
The training process is not important here, so we can take a look at the random weights that keras initialized using:
model.weights
Of course, in a real example, these weights should be adjusted in the training process.
Depending on these model.weights, each of the output neurons returns values in a range.
I would like to calculate this exact range.
Does keras offer any function to calculate it?
I built a flawed piece of code to make an approximation of it, using a loop and predicting random inputs. But this would not be really useful in a real example with much more inputs/neurons/weights.
Here a few examples trying to clarify my question (All of them assume that the input values are between and 1):
model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_shape=(2,),
activation='linear', use_bias=False))
model.set_weights([np.array([1, 1]).reshape(2, 1)])
For the previous example the output neuron results would be between 0 and 2
model.set_weights([np.array([-0.5, 1]).reshape(2, 1)])
For the previous example the output neuron results would be between -0.5 and 1
model = keras.models.Sequential()
model.add(keras.layers.Dense(2, input_shape=(2,), activation='linear', use_bias=False))
model.add(keras.layers.Dense(1, activation='linear', use_bias=False))
model.set_weights([np.array([1, 1, 1, 1]).reshape(2,2), np.array([1, 1]).reshape(2,1)])
For the previous example, the output neuron results would be between 0 and 4
These are simplified examples. In a real scenario with a much complex network structure, activation functions, bias..... these ranges are not obvious to calculate.

It sounds like you are roughly interested in what is referred to as neural network verification. This field broadly consists of answering the question: given a range of possible inputs, what is the range of possible outputs from a neural network with a given set of weights? A few things to note:
A neural network is essentially a complex, non-linear function. That is, it maps the input space to the output space. Defining an output range does not make sense except with respect to an input range. In your question you make no reference to the inputs, so your examples are flawed/incomplete.
In general, neural network verification is an emerging field with most published works being fairly recent (last 5-7 years). That being said, there are exact and approximate methods for fully connected networks with a variety of activation functions. I'll list a few such methods here:
https://arxiv.org/abs/2004.05519 - MATLAB toolbox, but you could export your neural network in ONNX format and then use MATLAB for the verification/output range analysis.
https://arxiv.org/abs/1804.10829 - specifically for ReLU activation function.
https://anwu1219.github.io/download/Marabou.pdf with python API available here: https://github.com/NeuralNetworkVerification/Marabou
The field is still evolving so you may have to do some of the codings yourself rather than using pre-existing libraries in some cases, but these papers/ a search query for neural network verification should at least give you some ideas of where to start.

IMO, there is no such a function, as far as I know, to estimate the output value's range( without imposing your restriction).
For example, a dense function without bias is just a plain linear function of a=bx, in your case, you are restricting x to 0-1 range and explicitly setting b to your desired values.
You will always get the value in those ranges you`ve cited in your question. A hypothetical example is to choose b randomly and the range in your questions would not hold the ground.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_shape=(2,), activation='linear', use_bias=False))
import matplotlib.pyplot as plt
#model.set_weights([np.array([1, 1]).reshape(2, 1)])
eval_func = keras.backend.function([model.input], model.layers[-1].output)
outputs = eval_func(np.array([[2,1]]))
counts, bins = np.histogram(outputs)
plt.hist(bins[:-1], bins, weights=counts)

Scaling the sigmoid output

I am training a Network on images for binary classification. The input images are normalized to have pixel values in the range[0,1]. Also, the weight matrices are initialized from a normal distribution. However, the output from my last Dense layer with sigmoid activation yields values with a very minute difference for the two classes. For example -
output for class1- 0.377525 output for class2- 0.377539
The difference for the classes comes after 4 decimal places. Is there any workaround to make sure that the output for class 1 falls around 0 to 0.5 and for class 2 , it falls between 0.5 to 1.
Edit:
I have tried both the cases.
Case 1 - Dense(1, 'sigmoid') with binary crossentropy
Case 2- Dense(2, 'softmax') with binary crossentropy
For case1, the output values differ by a very small amount as mentioned in the problem above. As such , i am taking mean of the predicted values to act as threshold for classification. This works upto some extent, but not a permanent solution.
For case 2 - the prediction overfits to one class only.
A sample code : -
inputs = Input(shape = (128,156,1))
x = Conv2D(.....)(inputs)
x = BatchNormalization()(x)
x = Maxpooling2D()(x)
...
.
.
flat=Flatten()(x)
out = Dense(1,'sigmoid')(x)
model = Model(inputs,out)
model.compile(optimizer='adamax',loss='binary_crossentropy',metrics=['binary_accuracy'])

It seems you are confusing a binary classification architecture with a 2 label multi-class classification architecture setup.
Since you mention the probabilities for the 2 classes, class1 and class2, you have, set up a single label multi-class setup. That means, you are trying to predict the probabilities of 2 classes, where a sample can have only one of the labels at a time.
In this setup, it's proper to use softmax instead of sigmoid. Your loss function would be binary_crossentropy as well.
Right now, with the multi-label setup and sigmoid activation, you are independently predicting the probability of a sample being class1 and class2 simultaneously (aka, multi-label multi-class classification).
Once you change to softmax you should see more significant differences between the probabilities IF the sample actually definitively belongs to one of the 2 classes and if your model is well trained & confident about its predictions (validation vs training results)

First, I would like to say the information you provided is insufficient to exactly debug your problem, because you didn't provide any code of your model and optimizer. I suspect there might be an error in the labels, and I also suggest you use a softmax activation fuction instead of the sigmoid function in the final layer, although it will still work through your approach, binary classification problems must output one single node and loss must be binary cross entropy.
If you want to receive an accurate solution, please provide more information.

Proper and definitive explanation about how to build a CNN 1D in Keras

Ciao,
this is the second part of a problem I'm facing with CNN 1d. The first part is this
How does it works the input_shape variable in Conv1d in Keras?
I'using this code:
from keras.models import Sequential
from keras.layers import Dense, Conv1D
import numpy as np
N_FEATURES=5
N_TIMESTEPS=10
X = np.random.rand(100, N_FEATURES)
Y = np.random.randint(0,2, size=100)
model = Sequential()
model.add(Conv1D(filters=32, kernel_size=N_TIMESTEPS, activation='relu', input_shape=(N_TIMESTEPS, N_FEATURES)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Now, what I want to do?
I want to train a CNN 1d over a timeseries with 5 features. Actually I want to work with time windows og length N_TIMESTEPS rather than timeserie it self. This means that I want to use a sort of "magnifier" of dimension N_TIMESTEPS x N_FEATURES on the time series to work locally. That's why I've decided to use CNN
Here come the first question. It is not clear at all if I have to transform the time series into a tensor or it is something that Keras will do for me since I've specified the kernel_size variable.
In case I must provide a tensor I would do something like this:
X_tensor = []
for i in range(len(X)-N_TIMESTEPS):
X_tensor+=[X_tensor[range(i, N_TIMESTEPS+i), :]]
X_tensor = np.asarray(X_tensor)
In this case of course I should also provide a Y_tensor vector computed from Y according to some criteria. Suppose I have already this Y_tensor boolean vector of the same length of X_tensor, which is len(X)-N_TIMESTEPS-1.
Y_tensor = np.random.randint(0,2,len(X)-N_TIMESTEPS-1)
Now if I try to feed the model I get of the most common error for CNN 1d which is:
ValueError: Error when checking input: expected conv1d_4_input to have 3 dimensions, but got array with shape (100, 5)
By looking to a dozen of posts about it I cannot understand what I did wrong. This is what I've tried:
model.fit(X,Y)
model.fit(np.expand_dims(X, axis=0),Y)
model.fit(np.expand_dims(X, axis=2),Y)
model.fit(X_tensor,Y_tensor)
For all of these cases I get always the same error (with different dimensional values in the final tuple).
Questions:
What Keras expects from my data? Can I feed the model with the whole time series or I have to slice it into a tensor?
How I have to feed the model in term of data structure?I.e. I have to specify in some strange way the dimension of the data?
Can you help me? I find out that this is one the most confusing point of CNN implementation in Keras that there are different posts with different solutions that do not fit with structure of my data (even if they have a very common structure according to me).
Note: There are some post suggesting to pass in the input_shape variable the length of the data. This is meaningless to me since I should not provide the dimension of the data (which is a variable) to the model. The only thing I should give to it, according to the theory, is the filter dimension and number of features (namely the dimension of the matrix that will roll over the time series).
Thanks,
am

Simply, Conv1D requires 3 dimensions:
Number of series (1)
Number of steps (100 - your entire data)
Number of features (5)
So, model.fit(np.expand_dims(X, axis=0),Y) is correct for X.
Now, if X is (1, 100, 5), naturally your input_shape=(100,5).
If your Y has 100 steps, then you need to make sure your Conv1D will output 100 steps. You need padding='same', otherwise it will become 91. (I suggest you actually work with 91, since you want a result for each 10 steps and probably don't want border effects spoiling your results)
Y must also follow the same rules for shape:
Number of series (1)
Number of steps (100 if padding='same'; 91 if padding='valid')
Number of features (1 = Dense output)
So, Y = Y.reshape((1,-1,1)).
Since you have only one class (true/false), it's pointless to use 'categorical_crossentropy'. You should go with 'binary_crossentropy'.
In general, your overall idea of using this convolution with kernel_size=10 to simulate sliding windows of 10 steps will work as expected (whether it will be efficient or not is another question, answered only by trying).
If you want better networks for sequences, you should probably try LSTM layers. The dimensions work exactly the same way. You will need return_sequences=False.
The main difference is that you will need to separate the data as you did in that loop. Then:
X.shape == (91, 10, 5)
Y.shape == (91, 1)

I think you don't have a clear idea on how 1d convolutional neural networks work:
if you want to predict the y values from the timeseries x and you just have 1 timeseries, your approach won't work. the network needs lots of samples to train, and having just 1 will allow it to easily memorize the input and not learn. For example, if the timeseries is the humidity of a given day, and y is the chance of rain at a specific timestep, what you have now is the data for just one day (timesteps being for example hours of the day). In order for the network to learn you need to gather data for many days, ending up with datasets of shape x=(n_days, timesteps, features) y=(n_days, timesteps, 1).
If you describe your actual problem there's better chance to get more helpful answers
[Edit] by sticking to your code, and using just one timeseries, you are better off with other methods that don't involve deep learning. You could split your timeseries at regular interval, obtaining n samples that would allow your network to train, but unless you have a very long timeseries that may not be a valid alternative.

Loss function for class imbalanced multi-class classifier in Keras

I am trying to apply deep learning to a multi-class classification problem with high class imbalance between target classes (10K, 500K, 90K, 30K). I want to write a custom loss function.
This is my current model:
model = Sequential()
model.add(LSTM(
units=10, # number of units returned by LSTM
return_sequences=True,
input_shape=(timestamps,nb_features),
dropout=0.2,
recurrent_dropout=0.2
)
)
model.add(TimeDistributed(Dense(1)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(units=nb_classes,
activation='softmax'))
model.compile(loss="categorical_crossentropy",
metrics = ['accuracy'],
optimizer='adadelta')
Unfortunately, all predictions belong to class 1!!! The model always predicts 1 for any input...
Appreciate any pointers on how I can solve this task.
Update:
Dimensions of input data:
94981 train sequences
29494 test sequences
X_train shape: (94981, 20, 18)
X_test shape: (29494, 20, 18)
y_train shape: (94981, 4)
y_test shape: (29494, 4)
Basically in the train data I have 94981 samples. Each sample contains a sequence of 20 timestamps. There are 18 features.
The imbalance between target classes (10K, 500K, 90K, 30K) is just an example. I have similar proportions in my real dataset.

First of all, you have ~100k samples. Start with something smaller, like 100 samples and multiple epochs and see whether your model overfits to this smaller training dataset (if it can't, you either have an error in your code or the model is not capable to model the dependencies [I would go with the second case]). Seriously, start with this one. And remember about representing all of your classes in this small dataset.
Secondly, hidden size of LSTM may be too small, you have 18 features for each sequence and sequences have length of 20, while your hidden is only 10. And you apply dropout to top it off and regularize the network even further.
Furthermore, you may want to add some dense outputs units instead of merely returning a linear layer of size 10 x 1 for each timestamp.
Last but not least, you may want to upsample the underrepresented data. 0 class would have to be repeated say 50 times (or maybe 25), class 2 something around 4 times and your one around 10-15 times, so the network is trained on them.
Oh, and use cross-validation for your hyperparameters like the hidden size, number of dense units etc.
Plus I don't know for how many epochs you've been training this network, what is your test dataset (it is entirely possible it only constitutes of the first class if you haven't done stratification).
I think this will get you started, hit me up with any doubts in the comments.
EDIT: When it comes to metrics, you may want to check something different than mere accuracy; maybe F1 score and your loss monitoring + accuracy to see how it performs. There are other available choices, for inspiration you can check sklearn's documentation as they provide quite a few options.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.