I want to train a neural network with 12 inputs and 2 outputs. Here I have a simple tensorflow neural network that has two outputs. When I run the code it always consistently gives one output. That is, if the two outputs are labeled 'l1' and 'l2' the model always chooses 'l1' for its output. Is this a problem with my input (that it doesn't vary enough between 'l1' and 'l2') or is this a problem with choosing to use just two outputs? This is my question. If it's the latter, what do I do to remidy this? My model is supposed to detect skin tones in a photo. ('l1' = skin tone, 'l2' = not skin tone). I'm not sure this makes sense. It is adapted from the mnist example, but that code has ten outputs.
def nn_setup(self):
input_num = 4 * 3
mid_num = 3
output_num = 2
x = tf.placeholder(tf.float32, [None, input_num])
W_1 = tf.Variable(tf.zeros([input_num, mid_num]))
b_1 = tf.Variable(tf.zeros([mid_num]))
y_mid = tf.nn.softmax(tf.matmul(x,W_1) + b_1)
W_2 = tf.Variable(tf.zeros([mid_num, output_num]))
b_2 = tf.Variable(tf.zeros([output_num]))
y = tf.nn.softmax(tf.matmul(y_mid, W_2) + b_2)
y_ = tf.placeholder(tf.float32, [None, output_num])
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, y_))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
init = tf.initialize_all_variables()
self.sess = tf.Session()
for i in range(1000):
batch_xs, batch_ys = self.get_nn_next_train()
self.sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
self.nn_test.images, self.nn_test.labels = self.get_nn_next_test()
print(self.sess.run(accuracy, feed_dict={x: self.nn_test.images, y_: self.nn_test.labels}))
There are a few "odd" things with your network, such as having softmax in your middle layer.
You have two major issues I can find with your implementation.
1. Weight initialisation
W_1 = tf.Variable(tf.zeros([input_num, mid_num]))
W_2 = tf.Variable(tf.zeros([mid_num, output_num]))
This will initialise the weights to be identical. So they will have identical gradient values, and be changed at each step identically.
Effectively by doing this you have created a network with one neuron in each layer (which is then copied to create the layer matrix that you use).
Use a different initial value, it is usual to take a small random matrix like this:
W_1 = tf.Variable(tf.random_normal([input_num, mid_num], stddev=0.5))
In general you will want a smaller standard deviation the larger your layers are. You don't have to do this for biases as well, but you can if you like.
This won't fix everything with your network, but it should at least start to calculate different values from input data and train a little.
2. Use of cost function
You have used this loss function incorrectly:
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(y, y_) )
. . . because softmax_cross_entropy_with_logits is designed to work with the input to softmax, not the output. So your cost function is incorrect. Instead you want to reference y_logits like this where currently you calculate y:
y_logits = tf.matmul(y_mid, W_2) + b_2
y = tf.nn.softmax(y_logits)
Then your cross-entropy would be
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(y_logits, y_) )
After the hidden layer initialization, you have calculated softmax of the logits for the hidden layer: y_mid = tf.nn.softmax(tf.matmul(x,W_1) + b_1). In a classification problem, softmax should be applied to the values obtained from the output layer. Try something like: y_mid = tf.nn.relu(tf.matmul(x,W_1) + b_1) to compute the activations from the hidden layer and see if your classification improves. If that does not solve your problem, check for the population of 'l1' and 'l2' in your training data. If your training data is highly skewed towards 'l1', you will always get 'l1' as the output. You may consider minority-oversampling or undersampling techniques to resolve population imbalance problem.
I am reading about backpropagation deep neural network, and as I understood, I can summarize the algorithm of that type of neural network as below :
1- Input x : Set the corresponding activation for the input layer
2- Feedforward: caclulate the error of the forward-propagation
3- Output error: Calculate the output error
4- Backpropagate the error : caclulate the error of the back-propagation
5- Output: Using the gradient of the cost function
That's ok, and then I check many codes of that type of deep network, below an example code with explanation :
### imports
import tensorflow as tf
### constant data
x = [[0.,0.],[1.,1.],[1.,0.],[0.,1.]]
y_ = [[0.],[0.],[1.],[1.]]
### induction
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output
# Layer 0 = the x2 inputs
x0 = tf.constant( x , dtype=tf.float32 )
y0 = tf.constant( y_ , dtype=tf.float32 )
# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable( tf.random_uniform( [2,3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b1 = tf.Variable( tf.random_uniform( [3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
h1 = tf.sigmoid( tf.matmul( x0,m1 ) + b1 )
# Layer 2 = the 3x1 sigmoid output
m2 = tf.Variable( tf.random_uniform( [3,1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b2 = tf.Variable( tf.random_uniform( [1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
### loss
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
### training
# run 500 times using all the X and Y
# print out the loss and any other interesting info
with tf.Session() as sess:
sess.run( tf.global_variables_initializer() )
for step in range(500) :
results = sess.run([m1,b1,m2,b2,y_out,loss])
labels = "m1,b1,m2,b2,y_out,loss".split(",")
for label,result in zip(*(labels,results)) :
print ""
print label
print result
print ""
My question, the above code is calculating the error of the forward-propagation but I don't see any step for calculating the back-propagation error. In other words, following the above description, I can see the steps 1 (Input x) , 2 (Feedforward) , 3 (Output error) and 5 (Output) but the step number 4 which is (Backpropagate the error) is not shown in the code!! Is that right or something missing in the code? The problem that all codes I found online are following same steps in backpropagation deep neural networks!
please, could you describe how the step of Backpropagate the error is happening the code or what should I add something to performing that step?
Thank you
In simple terms, when you build the TF graph up to the point you are computing the loss in your code, TF will know on which tf.Variable (weights) the loss depends. Then, when you create the node train = tf.train.GradientDescentOptimizer(1.0).minimize(loss), and later run it in a tf.Session, the backpropagation is done for you in the background. To be more specific, the train = tf.train.GradientDescentOptimizer(1.0).minimize(loss) merges the following steps:
# 1. Create a GD optimizer with a learning rate of 1.0
optimizer = tf.train.GradientDescentOptimizer(1.0)
# 2. Compute the gradients for each of the variables (weights) with respect to the loss
gradients, variables = zip(*optimizer.compute_gradients(loss))
# 3. Update the variables (weights) based on the computed gradients
train = optimizer.apply_gradients(zip(gradients, variables))
In particular, step 1 and 2, summarize the backpropagation step. Hope that this makes things more clear for you!
Besides, I want to restructure the steps in your question:
Input X: The input of the neural network.
Forward pass: Propagating the input through the neural network, in order to get the output. In other words, multiplying the input X with each of the tf.Variable in your code.
Loss: The mismatch between the obtained output in step 2 and the expected output.
Computing the gradients: Computing the gradients for each of the tf.Variable (weights) with respect to the loss.
Updating the weights: Updating each tf.Variable (weight) according to its corresponding gradient.
Please note that step 4 and 5 encapsulate backpropagation.
I want to use BERT model to do multi-label classification with Tensorflow.
To do so, I want to adapt the example run_classifier.py from BERT github repository, which is an example on how to use BERT to do simple classification, using the pre-trained weights given by Google Research. (For example with BERT-Base, Cased)
I have X different labels which have value of either 0 or 1, so I want to add to the original BERT model a new Dense layer of size X and using the sigmoid_cross_entropy_with_logits activation function.
So, for the theorical part I think I am OK.
The problem is that I don't know how I can append a new output layer and retrain only this new layer with my dataset, using the existing BertModel class.
Here is the original create_model() function from run_classifier.py where I guess I have to do my modifications. But I am a bit lost on what to do.
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
labels, num_labels, use_one_hot_embeddings):
"""Creates a classification model."""
model = modeling.BertModel(
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
And here is the same function, with some of my modifications, but where there is things missing (and wrong things too? )
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, labels, num_labels):
"""Creates a classification model."""
model = modeling.BertModel(
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable("output_weights", [num_labels, hidden_size],initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable("output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
per_example_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=logits)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
The other things I have adapted in the code and for which I had no problem :
DataProcessor to load and parse my custom dataset
Changing the type of labels variable from numerical values to arrays everywhere it is used
So, if anyone knows what I should do to resolve my problem, or even point out some obvious mistake I may have done, I would be glad to hear it.
Notes :
I found this article that correspond pretty well to what I am trying to do, but it use PyTorch, and I can not translate it into Tensorflow.
You want to replace the softmax that models a single distribution over possible outputs (all scores sum up to one) with sigmoid which models an independent distribution for each class (there is yes/no distribution for each output).
So, you correctly change the loss function, but you also need to change how you compute the probabilities. It should be:
probabilities = tf.sigmoid(logits)
In this case, you don't need the log_probs.
I'm using fully connected neural network and I am using normalized data such that every single sample values range from 0 to 1. I have used 100 neurons in first layer and 10 in second layer and used almost 50 lack samples during training. I want to classify my data into two classes. But my networks performance is too low, almost 49 percent on training and test data. I tried to increase the performance by changing the values of hyper parameters. But it didn't work. Can some one please tell me what should I do to get higher performance?
x = tf.placeholder(tf.float32, [None, nPixels])
W1 = tf.Variable(tf.random_normal([nPixels, nNodes1], stddev=0.01))
b1 = tf.Variable(tf.zeros([nNodes1]))
y1 = tf.nn.relu(tf.matmul(x, W1) + b1)
W2 = tf.Variable(tf.random_normal([nNodes1, nNodes2], stddev=0.01))
b2 = tf.Variable(tf.zeros([nNodes2]))
y2 = tf.nn.relu(tf.matmul(y1, W2) + b2)
W3 = tf.Variable(tf.random_normal([nNodes2, nLabels], stddev=0.01))
b3 = tf.Variable(tf.zeros([nLabels]))
y = tf.nn.softmax(tf.matmul(y2, W3) + b3)
y_ = tf.placeholder(dtype=tf.float32, shape=[None, 2])
cross_entropy = -1*tf.reduce_sum(y_* tf.log(y), axis=1)
loss = tf.reduce_mean(cross_entropy)
optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
correct_prediction = tf.equal(tf.argmax(y_,axis=1), tf.argmax(y, axis=1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
Your computational model knows nothing about "images", it only sees numbers. So if you trained it with pixels of values from 0-255, it has learned what "light" means, what "dark" means and how do these combine to give you whatever target value you try model.
And what you did by the normalization is that you forced all pixel to be 0-1. So as far as the model cares, they are all black as night. No surprise that it cannot extract anything meaningful.
You need to apply the same input normalization during both training and testing.
And speaking about normalization for NN models, it is better to normalize to zero mean.
I would like to train my neural network using a custom loss value of my own. Therefore, I would like to perform feed forward propagation for one mini batch to store the activations in the memory, and then perform back propagation using a my own loss value. This is to be done using tensorflow.
Finally, I need to do something like:
sess.run(optimizer, feed_dict={x: training_data, loss: my_custom_loss_value}
Is that possible? I am assuming that the optimizer depends on the loss which by itself depends on the input. Therefore, I want to inputs to be fed into the graph, but I want to use my value for the loss.
I guess since the optimizer depends on the activations, they will be evaluated, in other words, the input is going to be fed into the network. Here is an example:
import tensorflow as tf
a = tf.Variable(tf.constant(8.0))
a = tf.Print(input_=a, data=[a], message="a:")
b = tf.Variable(tf.constant(6.0))
b = tf.Print(input_=b, data=[b], message="b:")
c = a * b
optimizer = tf.train.AdadeltaOptimizer(learning_rate=0.1).minimize(c)
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
value, _ = sess.run([c, optimizer], feed_dict={c: 1})
Finally, the printed value is 1.0, while the console shows: a:[8]b:[6] which means that the inputs got evaluated.
Exactly so.
When you train the optimizer using Gradient Descent or any other optimization algorithm like AdamOptimizer(), the optimizer minimizes your loss function, which could be a Softmax cross entropy tf.nn.softmax_cross_entropy_with_logits in terms of multi-class classification, or a squared error loss tf.losses.mean_squared_error in terms of regression or your own custom loss. The loss function is evaluated or computed using the model hypothesis.
So TensorFlow uses this cascade approach to train the model hypothesis by calling a tf.Session().run() on the optimizer. See the following as a rough example in a multi-classification setting:
batch_size = 128
# build the linear model
hypothesis = tf.add(tf.matmul(input_X, weight), bias)
# softmax cross entropy loss or cost function for logistic regression
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=targets,
# optimizer to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.001).minimize(loss)
# execute in Session
with tf.Session() as sess:
# initialize all variables
# Train the model
for steps in range(1000):
mini_batch = zip(range(0, X_train.shape[0], batch_size),
range(batch_size, X_train.shape[0]+1, batch_size))
# train using mini-batches
for (start, end) in mini_batch:
sess.run(optimizer, feed_dict = {input_X: X_features[start:end],
input_y: y_targets[start:end]})
I have the following question: I'm trying to learn tensor-flow and I still don't find where to set the training as online or batch. For example, if I have the following code to train a neural-network:
loss_op = tf.reduce_mean(tf.pow(neural_net(X) - Y, 2))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)
sess.run(train_op, feed_dict={X: batch_x, Y: batch_y})
If I give all the data at the same time (i.e batch_x has all the data), does that mean that is training as a batch training? or the tensor-flow optimizer optimize in a different way from behind? Is it wrong if I do a for loop giving one data sample at a time? does that count as single-step (online) training? Thank you for your help.
There are mainly 3 Types of Gradient Descent. Specifically,
Stochastic Gradient Descent
Batch Gradient Descent
Mini Batch Gradient Descent
Here, is a good tutorial (https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/) on above three methods with upsides and downsides.
For your question, Following is a standard sample training tensorflow code,
N_EPOCHS = #Need to define here
BATCH_SIZE = # Need to define hare
with tf.Session() as sess:
train_count = len(train_x)
for i in range(1, N_EPOCHS + 1):
for start, end in zip(range(0, train_count, BATCH_SIZE),
range(BATCH_SIZE, train_count + 1,BATCH_SIZE)):
sess.run(train_op, feed_dict={X: train_x[start:end],
Y: train_y[start:end]})
Here N_EPOCHS means the number of passes of the whole training dataset. And you can set the BATCH_SIZE according to your Gradient Descent method.
For Stochastic Gradient Descent, BATCH_SIZE = 1.
For Batch Gradient Descent, BATCH_SIZE = training dataset size.
For Mini Batch Gradient Decent, 1 << BATCH_SIZE << training dataset size.
Among three methods, the most popular method is the Mini Batch Gradient Decent. However, you need to set the BATCH_SIZE parameter according to your requirements. A good default for BATCH_SIZE might be 32.
Hope this helps.
Normally the first dimension of the data placeholders in Tensorflow is set as the batch_size and TensorFlow doesn't define that(the training strategy) in default. You can set that first dimension to determine if it is on-line(first dimension is 1) or mini-batch(tens normally). For example:
self.enc_batch = tf.placeholder(tf.int32, [hps.batch_size, None], name='enc_batch')