I've depeloped a 5 layers neural network for classification and it always predict the same probability for each row which ends predicting the same class.
I'm using Relu as activation fuction (if I use sigmoid or tanh it outputs NaN's) and tf.nn.sigmoid_cross_entropy_with_logits
The model will be this:
X, Y = create_placeholders(n_x, n_y)
parameters = initialize_parameters()
Z5 = forward_propagation(X, parameters)
cost = cost_function(Z5, Y)
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
And the prediction:
predicted = tf.nn.sigmoid(Z5)
correct_pred = tf.equal(tf.round(predicted), Y)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
The cost value drops until 0.24, in the first 100 epochs, but the values predicted don't change.
At the beginning I thought it was cause of an imbalance class problem but I fixed it (upsampling and wonsampling 1s and 0s) so it should not be the problem.
Thanks in advance
Related
I am trying to implement a Neural Network for predicting the h1_hemoglobin in PyTorch. After creating a model, I kept 1 in the output layer as this is Regression. But I got the error as below. I'm not able to understand the mistake. Keeping a large value like 100 in the output layer removes the error but renders the model useless as I am trying to implement regression.
Data:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
##### Creating Tensors
X_train=torch.tensor(X_train)
X_test=torch.tensor(X_test)
y_train=torch.LongTensor(y_train)
y_test=torch.LongTensor(y_test)
class ANN_Model(nn.Module):
def __init__(self,input_features=4,hidden1=20,hidden2=20,out_features=1):
super().__init__()
self.f_connected1=nn.Linear(input_features,hidden1)
self.f_connected2=nn.Linear(hidden1,hidden2)
self.out=nn.Linear(hidden2,out_features)
def forward(self,x):
x=F.relu(self.f_connected1(x))
x=F.relu(self.f_connected2(x))
x=self.out(x)
return x
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr = 0.01)
epochs = 500
final_losses = []
for i in range(epochs):
i = i + 1
y_pred = model.forward(X_train.float())
loss=loss_function(y_pred, y_train)
final_losses.append(loss.item())
if i%10==1:
print("Epoch number: {} and the loss: {}".format(i, loss.item()))
optimizer.zero_grad()
loss.backward()
optimizer.step()
Error:
Since you are performing regression, the CrossEntropyLoss() internally implements the NLLLoss() function. The CrossEntropyLoss() expects C classes for C predictions but you have specified only one class. The NLLLoss() tries to index into the prediction logits based on the ground-truth value. E.g., in your case, the ground-truth is a single value 14. The loss step tries to index into the 14th logit of your predictions to get its corresponding value so that it can compute the negative log likelihood on it, which is essentially - -log(probability_k) where k is the index that the ground-truth outputs. Since you have only logit in your predictions, it throws an error - index out of bounds.
For regression problems, you should consider using distance based losses such as MSELoss().
Try replacing your loss function - loss_function = CrossEntropyLoss() with loss_function = MSELoss()
Your response variable h1_hemoglobin looks like continous response variable. If that's the case please change the Torch Tensor Type for y_train and y_test from LongTensor to FloatTensor or DoubleTensor.
According to the Pytorch docs, CrossEntropyLoss is useful for classification problems with a number of classes. Try to change your loss_function from CrossEntropyLoss to a more suitable one for your continuous response variable h1_hemoglobin.
In your case, the following might do it.
y_train=torch.DoubleTensor(y_train)
y_test=torch.DoubleTensor(y_test)
...
...
loss_function = nn.MSELoss()
Pytorch MSELoss
Pytorch CrossEntropyLoss
Suppose we have a simple Keras model that uses BatchNormalization:
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(1,)),
tf.keras.layers.BatchNormalization()
])
How to actually use it with GradientTape? The following doesn't seem to work as it doesn't update the moving averages?
# model training... we want the output values to be close to 150
for i in range(1000):
x = np.random.randint(100, 110, 10).astype(np.float32)
with tf.GradientTape() as tape:
y = model(np.expand_dims(x, axis=1))
loss = tf.reduce_mean(tf.square(y - 150))
grads = tape.gradient(loss, model.variables)
opt.apply_gradients(zip(grads, model.variables))
In particular, if you inspect the moving averages, they remain the same (inspect model.variables, averages are always 0 and 1). I know one can use .fit() and .predict(), but I would like to use the GradientTape and I'm not sure how to do this. Some version of the documentation suggests to update update_ops, but that doesn't seem to work in eager mode.
In particular, the following code will not output anything close to 150 after the above training.
x = np.random.randint(200, 210, 100).astype(np.float32)
print(model(np.expand_dims(x, axis=1)))
with gradient tape mode BatchNormalization layer should be called with argument training=True
example:
inp = KL.Input( (64,64,3) )
x = inp
x = KL.Conv2D(3, kernel_size=3, padding='same')(x)
x = KL.BatchNormalization()(x, training=True)
model = KM.Model(inp, x)
then moving vars are properly updated
>>> model.layers[2].weights[2]
<tf.Variable 'batch_normalization/moving_mean:0' shape=(3,) dtype=float32, numpy
=array([-0.00062087, 0.00015137, -0.00013239], dtype=float32)>
I just give up. I spent quiet a bit of time trying to make sense of a model that looks like:
model = tf.keras.Sequential([
tf.keras.layers.BatchNormalization(),
])
And I do give up because that thing looks like that:
My intuition was that BatchNorm these days is not as straight forward as it used to be and that is why it scales original distribution but not so much new distribution (which is a shame), but ain't nobody got time for that.
Edit: the reason for that behavior is that BN only calculates moments and normalizes batches during training. During training it maintains running averages of mean and deviation and once you switch to evaluation, parameters are used as constants. i.e. evaluation should not depend on normalization because evaluation can be used even for a single input and can not rely on batch statistics. Since constants are calculated on a different distribution, you are getting a higher error during evaluation.
With Gradient Tape mode, you would usually find gradients like:
with tf.GradientTape() as tape:
y_pred = model(features)
loss = your_loss_function(y_pred, y_true)
gradients = tape.gradient(loss, model.trainable_variables)
train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
However, if your model contains BatchNormalization or Dropout layer (or any layer that has different train/test phases) then tf will fail building the graph.
A good practice would be to explicitly use trainable parameter when obtaining output from a model. When optimizing use model(features, trainable=True) and when predicting use model(features, trainable=False), in order to explicitly choose train/test phase when using such layers.
For PREDICT and EVAL phase, use
training = (mode == tf.estimator.ModeKeys.TRAIN)
y_pred = model(features, trainable=training)
For TRAIN phase, use
with tf.GradientTape() as tape:
y_pred = model(features, trainable=training)
loss = your_loss_function(y_pred, y_true)
gradients = tape.gradient(loss, model.trainable_variables)
train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
Note that, iperov's answer works as well, except that you will need to set the training phase manually for those layers.
x = BatchNormalization()(x, training=True)
x = Dropout(rate=0.25)(x, training=True)
x = BatchNormalization()(x, training=False)
x = Dropout(rate=0.25)(x, training=False)
I'd recommended to have one get_model function that returns the model, while changing the phase using training parameter when calling the model.
Note:
If you use model.variables when finding gradients, you'll get this warning
Gradients do not exist for variables
['layer_1_bn/moving_mean:0',
'layer_1_bn/moving_variance:0',
'layer_2_bn/moving_mean:0',
'layer_2_bn/moving_variance:0']
when minimizing the loss.
This can be resolved by computing gradients only against trainable variables. Replace model.variables with model.trainable_variables
I am experimenting with TensorFlow 2.0 (alpha). I want to implement a simple feed forward Network with two output nodes for binary classification (it's a 2.0 version of this model).
This is a simplified version of the script. After I defined a simple Sequential() model, I set:
# import layers + dropout & activation
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.activations import elu, softmax
# Neural Network Architecture
n_input = X_train.shape[1]
n_hidden1 = 15
n_hidden2 = 10
n_output = y_train.shape[1]
model = tf.keras.models.Sequential([
Dense(n_input, input_shape = (n_input,), activation = elu), # Input layer
Dropout(0.2),
Dense(n_hidden1, activation = elu), # hidden layer 1
Dropout(0.2),
Dense(n_hidden2, activation = elu), # hidden layer 2
Dropout(0.2),
Dense(n_output, activation = softmax) # Output layer
])
# define loss and accuracy
bce_loss = tf.keras.losses.BinaryCrossentropy()
accuracy = tf.keras.metrics.BinaryAccuracy()
# define optimizer
optimizer = tf.optimizers.Adam(learning_rate = 0.001)
# save training progress in lists
loss_history = []
accuracy_history = []
# loop over 1000 epochs
for epoch in range(1000):
with tf.GradientTape() as tape:
# take binary cross-entropy (bce_loss)
current_loss = bce_loss(model(X_train), y_train)
# Update weights based on the gradient of the loss function
gradients = tape.gradient(current_loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# save in history vectors
current_loss = current_loss.numpy()
loss_history.append(current_loss)
accuracy.update_state(model(X_train), y_train)
current_accuracy = accuracy.result().numpy()
accuracy_history.append(current_accuracy)
# print loss and accuracy scores each 100 epochs
if (epoch+1) % 100 == 0:
print(str(epoch+1) + '.\tTrain Loss: ' + str(current_loss) + ',\tAccuracy: ' + str(current_accuracy))
accuracy.reset_states()
print('\nTraining complete.')
Training goes without errors, however strange things happen:
Sometimes, the Network doesn't learn anything. All loss and accuracy scores are constant throughout all the epochs.
Other times, the network is learning, but very very badly. Accuracy never went beyond 0.4 (while in TensorFlow 1.x I got an effortless 0.95+). Such a low performance suggests me that something went wrong in the training.
Other times, the accuracy is very slowly improving, while the loss remains constant all the time.
What can cause these problems? Please help me understand my mistakes.
UPDATE:
After some corrections, I can make the Network learn. However, its performance is extremely poor. After 1000 epochs, it reaches about %40 accuracy, which clearly means something is still wrong. Any help is appreciated.
The tf.GradientTape is recording every operation that happens inside its scope.
You don't want to record in the tape the gradient calculation, you only want to compute the loss forward.
with tf.GradientTape() as tape:
# take binary cross-entropy (bce_loss)
current_loss = bce_loss(model(df), classification)
# End of tape scope
# Update weights based on the gradient of the loss function
gradients = tape.gradient(current_loss, model.trainable_variables)
# The tape is now consumed
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
More importantly, I don't see the loop on the training set, therefore I suppose the complete code looks like:
for epoch in range(n_epochs):
for df, classification in dataset:
# your code that computes loss and trains
Moreover, the usage of the metrics is wrong.
You want to accumulate, thus update the internal state of the accuracy operation, at every training step and measure the overall accuracy at the end of every epoch.
Thus you have to:
# Measure the accuracy inside the training loop
accuracy.update_state(model(df), classification)
And call accuracy.result() only at the end of the epoch, when all the accuracy value have been saved into the metric.
Remember to call to the .reset_states() method to clears the variable states, resetting it to zero at the end of every epoch.
I am experimenting with a simple 2 layer neural network with pytorch, feeding in only three inputs of size 10 each, with a single value as output. I have normalised inputs and lowered learning rate. It is my understanding that a two layer fully connected neural network should be able to trivially fit to this data
Features:
0.8138 1.2342 0.4419 0.8273 0.0728 2.4576 0.3800 0.0512 0.6872 0.5201
1.5666 1.3955 1.0436 0.1602 0.1688 0.2074 0.8810 0.9155 0.9641 1.3668
1.7091 0.9091 0.5058 0.6149 0.3669 0.1365 0.3442 0.9482 1.2550 1.6950
[torch.FloatTensor of size 3x10]
Targets
[124, 125, 122]
[torch.FloatTensor of size 3]
The code is adapted from a simple example and I am using MSELoss as the loss function. The loss diverges to infinity after just a few iterations:
features = torch.from_numpy(np.array(features))
x_data = Variable(torch.Tensor(features))
y_data = Variable(torch.Tensor(targets))
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear = torch.nn.Linear(10,5)
self.linear2 = torch.nn.Linear(5,1)
def forward(self, x):
l_out1 = self.linear(x)
y_pred = self.linear2(l_out1)
return y_pred
model = Model()
criterion = torch.nn.MSELoss(size_average = False)
optim = torch.optim.SGD(model.parameters(), lr = 0.001)
def main():
for iteration in range(1000):
y_pred = model(x_data)
loss = criterion(y_pred, y_data)
print(iteration, loss.data[0])
optim.zero_grad()
loss.backward()
optim.step()
Any help would be appreciated. Thanks
EDIT:
Indeed it seems that this was simply due to the learning rate being too high. Setting to 0.00001 fixes convergence issues, albeit giving very slow convergence.
This is because you're not using a non-linearity between layers, and your network is still Linear.
You can use Relu in order to make it non linear. You can change the forward method like this :
...
y_pred = torch.nn.functional.F.relu(self.linear2(l_out1))
...
Maybe you can try to predict a log(y) instead of y to improve the convergence even more. Also Adam optimizer (adaptive learning rate) should help + BatchNormalization (for example between your linear layers).
I want to train a neural network with 12 inputs and 2 outputs. Here I have a simple tensorflow neural network that has two outputs. When I run the code it always consistently gives one output. That is, if the two outputs are labeled 'l1' and 'l2' the model always chooses 'l1' for its output. Is this a problem with my input (that it doesn't vary enough between 'l1' and 'l2') or is this a problem with choosing to use just two outputs? This is my question. If it's the latter, what do I do to remidy this? My model is supposed to detect skin tones in a photo. ('l1' = skin tone, 'l2' = not skin tone). I'm not sure this makes sense. It is adapted from the mnist example, but that code has ten outputs.
def nn_setup(self):
input_num = 4 * 3
mid_num = 3
output_num = 2
x = tf.placeholder(tf.float32, [None, input_num])
W_1 = tf.Variable(tf.zeros([input_num, mid_num]))
b_1 = tf.Variable(tf.zeros([mid_num]))
y_mid = tf.nn.softmax(tf.matmul(x,W_1) + b_1)
W_2 = tf.Variable(tf.zeros([mid_num, output_num]))
b_2 = tf.Variable(tf.zeros([output_num]))
y = tf.nn.softmax(tf.matmul(y_mid, W_2) + b_2)
y_ = tf.placeholder(tf.float32, [None, output_num])
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, y_))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
init = tf.initialize_all_variables()
self.sess = tf.Session()
self.sess.run(init)
for i in range(1000):
batch_xs, batch_ys = self.get_nn_next_train()
self.sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
self.nn_test.images, self.nn_test.labels = self.get_nn_next_test()
print(self.sess.run(accuracy, feed_dict={x: self.nn_test.images, y_: self.nn_test.labels}))
There are a few "odd" things with your network, such as having softmax in your middle layer.
You have two major issues I can find with your implementation.
1. Weight initialisation
W_1 = tf.Variable(tf.zeros([input_num, mid_num]))
W_2 = tf.Variable(tf.zeros([mid_num, output_num]))
This will initialise the weights to be identical. So they will have identical gradient values, and be changed at each step identically.
Effectively by doing this you have created a network with one neuron in each layer (which is then copied to create the layer matrix that you use).
Use a different initial value, it is usual to take a small random matrix like this:
W_1 = tf.Variable(tf.random_normal([input_num, mid_num], stddev=0.5))
In general you will want a smaller standard deviation the larger your layers are. You don't have to do this for biases as well, but you can if you like.
This won't fix everything with your network, but it should at least start to calculate different values from input data and train a little.
2. Use of cost function
You have used this loss function incorrectly:
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(y, y_) )
. . . because softmax_cross_entropy_with_logits is designed to work with the input to softmax, not the output. So your cost function is incorrect. Instead you want to reference y_logits like this where currently you calculate y:
y_logits = tf.matmul(y_mid, W_2) + b_2
y = tf.nn.softmax(y_logits)
Then your cross-entropy would be
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(y_logits, y_) )
After the hidden layer initialization, you have calculated softmax of the logits for the hidden layer: y_mid = tf.nn.softmax(tf.matmul(x,W_1) + b_1). In a classification problem, softmax should be applied to the values obtained from the output layer. Try something like: y_mid = tf.nn.relu(tf.matmul(x,W_1) + b_1) to compute the activations from the hidden layer and see if your classification improves. If that does not solve your problem, check for the population of 'l1' and 'l2' in your training data. If your training data is highly skewed towards 'l1', you will always get 'l1' as the output. You may consider minority-oversampling or undersampling techniques to resolve population imbalance problem.