I created a neural network and attempted training it, all was well until I added in a bias.
From what I gather when training the bias adjusts to move the expected output up or down, and the weights tend towards a value that helps YHat emulate some function, so for a two layer network:
output = tanh(tanh(X0W0 + b0)W1 + b1)
In practice what I've found is W sets all weights to near 0, and b almost echos the trained output of Y. Which essentially makes the output work perfectly for the trained data, but when you give it different kinds of data it will always give the same output.
This has caused quite some confusion. I know that the bias' role is to move the activation graph up or down but when it comes to training it seems to make the entire purpose of the neural network irrelevant. Here is the code from my training method:
def train(self, X, Y, loss, epoch=10000):
for i in range(epoch):
YHat = self.forward(X)
loss.append(sum(Y - YHat))
err = -(Y - YHat)
for l in self.__layers[::-1]:
werr = np.sum(np.dot(l.localWGrad, err.T), axis=1)
werr.shape = (l.height, 1)
l.adjustWeights(werr)
err = np.sum(err, axis=1)
err.shape = (X.shape[0], 1)
l.adjustBiases(err)
err = np.multiply(err, l.localXGrad)
and the code for adjusting the weghts and biases. (Note: epsilon is my training rate and lambda is the regularisation rate)
def adjustWeights(self, err):
self.__weights = self.__weights - (err * self.__epsilon + self.__lambda * self.__weights)
def adjustBiases(self, err):
a = np.sum(np.multiply(err, self.localPartialGrad), axis=1) * self.__epsilon
a.shape = (err.shape[0], 1)
self.__biases = self.__biases - a
And here is the math I've done for this network.
Z0 = X0W0 + b0
X1 = relu(Z0)
Z1 = X1W1 + b1
X2 = relu(Z1)
a = YHat-X2
#Note the second part is for regularisation
loss = ((1/2)*(a^2)) + (lambda*(1/2)*(sum(W1^2) + sum(W2^2)))
And now the derivatives
dloss/dW1 = -(YHat-X2)*relu'(X1W1 + b1)X1
dloss/dW0 = -(YHat-X2)*relu'(X1W1 + b1)W1*relu'(X0W0 + b0)X0
dloss/db1 = -(YHat-X2)*relu'(X1W1 + b1)
dloss/db0 = -(YHat-X2)*relu'(X1W1 + b1)W1*relu'(X0W0 + b0)
I'm guessing I'm doing something wrong, but I have no idea what it is. I tried training this network on the following inputs
X = np.array([[0.0], [1.0], [2.0], [3.0]])
Xnorm = X / np.amax(X)
Y = np.array([[0.0], [2.0], [4.0], [6.0]])
Ynorm = Y / np.amax(Y)
And I get this as the output:
post training:
shape: (4, 1)
[[0. ]
[1.99799666]
[3.99070622]
[5.72358125]]
Expected:
[[0.]
[2.]
[4.]
[6.]]
Which seems great... until you forward something else:
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
Then I get:
shape: (4, 1)
[[0.58289512]
[2.59967085]
[4.31654068]
[5.74322541]]
Expected:
[[4.]
[6.]
[8.]
[10.]]
I thought "perhapse this is the evil 'Overfitting I've heard of" and decided to add in some regularisation, but even that doesn't really solve the issue, why would it when it makes sense from a logical perspective that it's faster, and more optimal to set the biases to equal the output and make the weights zero... Can someone explain what's going wrong in my thinking?
Here is the network structure post training, (note if you multiply the output by the max of the training Y you will get the expected output:)
===========================NeuralNetwork===========================
Layers:
===============Layer 0 :===============
Weights: (1, 3)
[[0.05539559 0.05539442 0.05539159]]
Biases: (4, 1)
[[0. ]
[0.22897166]
[0.56300199]
[1.30167665]]
==============\Layer 0 :===============
===============Layer 1 :===============
Weights: (3, 1)
[[0.29443245]
[0.29442639]
[0.29440642]]
Biases: (4, 1)
[[0. ]
[0.13199981]
[0.32762199]
[1.10023446]]
==============\Layer 1 :===============
==========================\NeuralNetwork===========================
The graph y = 2x has a y intercept crosses at x=0, and thus it would make sense for all the bias' to be 0 as we aren't moving the graph up or down... right?
Thanks for reading this far!
edit:
Here is the loss graph:
edit 2:
I just tried to do this with a single weight and output and here is the output structure I got:
===========================NeuralNetwork===========================
Layers:
===============Layer 0 :===============
Weights: (1, 1)
[[0.47149317]]
Biases: (4, 1)
[[0. ]
[0.18813419]
[0.48377987]
[1.33644038]]
==============\Layer 0 :===============
==========================\NeuralNetwork===========================
and for this input:
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
I got this output:
shape: (4, 1)
[[4.41954787]
[5.53236625]
[5.89599366]
[5.99257962]]
when again it should be:
Expected:
[[4.]
[6.]
[8.]
[10.]]
Note the problem with the biases persist, you would think in this situation the weight would be 2, and the bias would be 0.
Moved answer from OP's question
Turns out I never dealt with my training data properly. The input vector:
[[0.0], [1.0], [2.0], [3.0]]
was normalised, I divided this vector by the max value in the input which was 3, and thus I got
[[0.0], [0.3333], [0.6666], [1.0]]
And for the input Y training vector I had
[[0.0], [2.0], [4.0], [6.0]]
and I foolishly decided to do the same with this vector,but with the max of Y 6:
[[0.0], [0.333], [0.666], [1.0]]
So basically I was saying "hey network, mimic my input". This was my first error. The second error came as a result of more misunderstanding of the scaling.
Although 1 was 0.333, and 0.333*2 = 0.666, which I then multiplied by the max of y (6) 6*0.666 = 2, if I try this again with a different set of data say:
[[2.0], [3.0], [4.0], [5.0]]
2 would be 2/5 = 0.4 and 0.4*2 = 0.8, which times 5 would be 2, however in the real world we would have no way of knowing that 5 was the max output of the dataset, and thus I thought maybe it would have been the max of the Y training, which was 6 so instead of 2/5 = 0.4, 0.4*2 = 0.8 * 5, I done 2/5 = 0.4, 0.4*2 = 0.8 * 6 = 4.8.
So then I got some strange behaviours of the biases and weights as a result. So after essentially getting rid of the normalisation, I was free to tweak the hyperparameters and now as an output for the base training data:
input:
X:
[[0.]
[1.]
[2.]
[3.]]
I get this output:
shape: (4, 1)
[[0.30926124]
[2.1030826 ]
[3.89690395]
[5.6907253 ]]
and for the extra testing data (not trained on):
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
I get this output:
shape: (4, 1)
[[3.89690395]
[5.6907253 ]
[7.48454666]
[9.27836801]]
So now I'm happy. I also changed my activation to a leaky relu as it should fit a linear equation better (I think.). I'm sure with more testing data and more tweaking of the hyperparameters it would be a perfect fit. Thanks for the help everyone. Trying to explain my problem really put things into perspective.
Related
For a multiple linear regression model in Tensorflow in python, how can you print out the equation that the model is using to predict the label. The model I am currently using takes two features to predict one label, so I think the general equation is this but how could I get the unknown parameters and values of all the constants using Tensorflow?
Code:
fundingFeatures = fundingTrainSet.copy()
fundingLabels = fundingFeatures.pop('% of total funding spent')
fundingFeatures = np.array(fundingFeatures)
normalizer = preprocessing.Normalization()
normalizer.adapt(fundingFeatures)
model = tf.keras.Sequential([
normalizer,
layers.Dense(units=1)
])
model.compile(loss = tf.losses.MeanSquaredError(),
optimizer = tf.keras.optimizers.SGD(
learning_rate=0.06, momentum=0.0, nesterov=True, name="SGD",
))
model.fit(fundingFeatures, fundingLabels, epochs=1000)
I will explain how you can write the equation of your NN.
In order to do that, I have modified your code and added fixed values for your Y features and Y labels. I'm doing that in order to show the whole calculation step by step so that next time you can do it yourself.
Based on all the information you have provided, it seems that you have
NN with 2 layers.
First layer is a Normatization layer
Second layer is a Dense layer
You have 2 features in your input tensor and 1 single output
Let's start with the normalization layer. For normalization layers, it is kind "strange" in my opinion to use the term "weight". The weights are basically
the mean and variance which will be applied to each input in order to normalize the data.
I wil call the 2 input features x0 and x1
if you run my code (which is your code with my fixed data), you will see that the weights for the normalization layer are
[5. 4.6]
[ 5.4 11.24]
It means that the means for your [x0 x1] columns are [5. 4.6] and the variances are [5.4 11.24]
Can we verify that? Yes, we can. Let's check for x0.
[1,4,8,7,3,6,6,5,2,8,5]
mean = 5
stddev = 2.323790008
variance = 5.4 ( variance = stddev^2)
As you can see, it matches the "weights" of the normalization layer.
As data is pushed thru the normalization layer, each value will be normalized based on
x' = (x-mean)/stddev ( stddev, not variance )
You can check that by applying the normalization to the data.
In the code, if you run this 2 lines
normalized_data = normalizer(fundingFeatures)
print(normalized_data)
You will get
[[-1.7213259 1.31241 ]
[-0.43033147 1.014135 ]
[ 1.2909944 0.41758505]
[ 0.86066294 -0.47723997]
[-0.86066294 -1.07379 ]
[ 0.43033147 1.31241 ]
[ 0.43033147 -1.07379 ]
[ 0. -1.07379 ]
[-1.2909944 0.71586 ]
[ 1.2909944 -1.07379 ]]
Let's verify the first number.
x0[0] = 1
x0'[0] = (1-5)/2.323790008 = -1.7213 ( it does match)
At this point, we should be able to write the equations for the normalization layer
y[0]' = (x0-5)/2.323790008 # (x-mean)/stddev
y[1]' = (x1-4.6)/3.352610923
Now, these 2 outputs will be inject in the next layer. Remember, you have a Dense layer and therefore it is fully connected. It means that both values will be inject in the single neuron.
These lines show the value of both weights and bias for the Dense layer.
weights = model.layers[1].get_weights()[0]
biases = model.layers[1].get_weights()[1]
print(weights)
print(biases)
[[-0.12915221]
[-0.41322172]]
[0.32663438]
A neuron multiplies each input by a given weight, adds all results with the bias.
Let's modify y[0]' and y[1]' to include the weights.
y[0]' = (x0-5)/2.323790008)* -0.12915221
y[1]' = (x1-4.6)/3.352610923 * -0.41322172
We are close, we just need to sum up these 2 and add the bias
y' = ((x0-5)/2.323790008)* -0.12915221 + (x1-4.6)/3.352610923 * -0.41322172 + 0.32663438
Since you don't have an activation function, we can stop here.
How can we verify if the formula is right?
Let's use the model to predict the label for a random input and see if it matches the result we get when we put the same values in our equation.
First, let's run a model prediction for [4,5]
print(model.predict( [[4,5]] ))
[[0.3329112]]
Now, let's plug the same inputs to our equation
y' = (((4-5)/2.323790008)* -0.12915221) + ((5-4.6)/3.352610923 * -0.41322172) + 0.32663438
y' = 0.332911
It seems that we are good. I cut some precisions just be make my life easier.
Here is the function for your model. Just replace my numbers with your numbers.
y' = ((x0-5)/2.323790008)* -0.12915221 + (x1-4.6)/3.352610923 * -0.41322172 + 0.32663438
And here is the code. I have also added tensorboard so you can verify yourself what I have said here.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
from matplotlib import pyplot as plt
import numpy as np
import datetime
fundingFeatures = tf.constant([[1, 9], [4, 8], [8, 6], [7 ,3], [3 ,1], [6, 9], [6, 1], [5, 1], [2, 7], [8, 1]], dtype=tf.int32)
fundingLabels = tf.constant([ 0.8160469, -0.05249139, 1.1515405, 1.0792135, 0.80369186, -1.7353221, 1.0092108, 0.19228514, -0.10366996, 0.10583907])
normalizer = preprocessing.Normalization()
normalizer.adapt(fundingFeatures)
normalized_data = normalizer(fundingFeatures)
print(normalized_data)
print("Features mean raw: %.2f" % (fundingFeatures[:,0].numpy().mean()))
print("Features std raw: %.2f" % (fundingFeatures[:,0].numpy().std()))
print("Features mean raw: %.2f" % (fundingFeatures[:,1].numpy().mean()))
print("Features std raw: %.2f" % (fundingFeatures[:,1].numpy().std()))
print("Features mean: %.2f" % (normalized_data.numpy().mean()))
print("Features std: %.2f" % (normalized_data.numpy().std()))
model = tf.keras.Sequential([
normalizer,
layers.Dense(units=1)
])
model.compile(loss = tf.losses.MeanSquaredError(),
optimizer = tf.keras.optimizers.SGD(
learning_rate=0.06, momentum=0.0, nesterov=True, name="SGD",
))
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
model.summary()
print('--------------')
weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]
print('--------------')
model.fit(fundingFeatures, fundingLabels, epochs=1000, callbacks=[tensorboard_callback])
weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]
print(weights)
print(biases)
print ("\n")
weights = model.layers[1].get_weights()[0]
biases = model.layers[1].get_weights()[1]
print(weights)
print(biases)
print('\n--------- Prediction ------')
print(model.predict( [[4,5]] ))
I'm doing sequence classification, I've got batch sizes of 1, 5 outcomes, and variable time steps (14 in this example). My sample weights w are the same shape as my label y:
y = tf.convert_to_tensor(np.ones(shape = (1,14,5)))
w = tf.convert_to_tensor(np.random.uniform(size = (1,14,5)))
y.shape
Out[53]: TensorShape([1, 14, 5])
w.shape
Out[54]: TensorShape([1, 14, 5])
When I try to run this through the loss function, I get the following error:
loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=False)
loss_object(y_true = y,
y_pred = y,
sample_weight = w)
InvalidArgumentError: Can not squeeze dim[2], expected a dimension of 1, got 5 [Op:Squeeze]
what's going on? It should be a straightforward multiplication of a loss matrix (pre-reduction) with the weights. How to fix?
Super simple fix! Tensorflow squeezes the last dimension of the sample weights because they are supposed to be applied per sample, therefore, all you need to do is add one dimension to your weight matrix along the last axis:
y = tf.convert_to_tensor(np.ones(shape = (1,14,5)))
w = tf.convert_to_tensor(np.random.uniform(size = (1,14,5,1))) # Change made here
loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=False)
loss_object(y_true = y,
y_pred = y,
sample_weight = w)
You can also just change the shape of the weights matrix after creation:
w = tf.expand_dims(w, axis=-1)
I am writing a custom loss function for semi supervised learning on cifar-10 dataset, for which I need to duplicate columns of my tensor for creating a sort of mask which I then multiply with the activation values to later sum over.
My loss function is a sum of entropy and cross entropy for unlabelled and labeled samples. I add an extra class and set it to 1 for unlabelled samples.
I then create a mask for identifying row indices of unlabelled samples from the y_true tensor. From that I should get a (n_samples, 1) tensor which I need to repeat/duplicate/copy to a (n_samples, 11) tensor that I can multiply with the activation values in y_pred
Loss function code:
a = np.ones((mini_batch_size, 1)) * 10
a_var = K.variable(value=a)
v = K.cast(K.equal(K.cast(K.argmax(y_true, axis=1), 'float32'), a_var), 'float32')
e_loss = K.sum(K.concatenate([v,v,v,v,v,v,v,v,v,v,v], axis=-1) * K.log(y_pred) * y_pred)
m_u = K.sum(K.cast(K.equal(K.cast(K.argmax(y_true, axis=1), 'float32'), a_var), 'float32'))
b = np.ones((mini_batch_size, 1)) * 10
b_var = K.variable(value=b)
v2 = K.cast(K.not_equal(K.cast(K.argmax(y_true, axis=1), 'float32'), b_var), 'float32')
ce_loss = K.sum(K.concatenate([v2, v2, v2, v2, v2, v2, v2, v2, v2, v2, v2], axis=1) * K.log(y_pred))
m_l = K.variable(value=float(mini_batch_size), dtype='float32') #- m_u
return -((e_loss/m_u) + (ce_loss/m_l))
The error I get is:
InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Incompatible shapes: [40,11] vs. [40,440]
[[{{node loss_36/dense_74_loss/mul_2}}]]
[[metrics_28/acc/Mean/_2627]]
(1) Invalid argument: Incompatible shapes: [40,11] vs. [40,440]
[[{{node loss_36/dense_74_loss/mul_2}}]]
0 successful operations.
0 derived errors ignored.
My batch size is 40.
I need my concatenated tensor to be of size [40, 11] not [40, 440]
I don't have real data to test whether the loss properly works, but this got rid of that InvalidArgumentError and did work with model.fit() for a dense model.
Few changes I did,
You don't have to repeat your v 11 times to multiply that with y_pred. All you need is reshape it to (-1,1) - (Will save you memory)
Got rid of all the K.variables. Now this is something I want to check with you, you are not trying to optimize a_var and b_var right (i.e. that's not a part of the model)? (Apparently, that's what's causing the issue. I need to dive deeper to see why). It seems the whole point of a_var and b_var is to perform boolean logics equal and not_equal, which works just fine with the constant.
Made m_l a K.constant
def loss_fn(y_true, y_pred):
v = K.cast(K.equal(K.cast(K.argmax(y_true, axis=-1), 'float32'), 10), 'float32')
e_loss = K.sum(K.reshape(v, (-1,1)) * K.log(y_pred) * y_pred)
m_u = K.sum(K.cast(K.equal(K.cast(K.argmax(y_true, axis=-1), 'float32'), 10), 'float32'))
v2 = K.cast(K.not_equal(K.cast(K.argmax(y_true, axis=-1), 'float32'), 10), 'float32')
ce_loss = K.sum(K.reshape(v2, (-1,1)) * K.log(y_pred))
m_l = K.constant(value=float(mini_batch_size), dtype='float32') #- m_u
return -((e_loss/m_u) + (ce_loss/m_l))
Note: Depending on the batch size within the loss function is a bad idea. Try to get rid of any batch_size dependent operations (especially for shape of tensors). You can see that I only have kept mini_batch_size to set m_l. But I would suggest setting this to some constant instead of min_batch_size. Because, if a batch with <40 comes through, you are using a different loss function for that batch. And your results aren't comparable between different batch sizes, as your loss function changes.
In setting up the model I sometimes see the code:
# Scenario 1
# Define loss and optimizer
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=logits, labels=Y))
or
# Scenario 2
# Evaluate model (with test logits, for dropout to be disabled)
prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))
The definition of tf.reduce_mean states that it "calculates the mean of tensor elements along various dimensions of the tensor." I am confused about what it does in simpler language? When do we need to use it, maybe with reference to # Scenario 1 & 2 ? Thank you
As far as I understand, tensorflow.reduce_mean is the same as numpy.mean. It creates an operation in the underlying tensorflow graph which computes the mean of a tensor.
The most important keyword argument of tensorflow.reduce_mean is axis. Basically, if you have a tensor with shape (4, 3, 2) and axis=1, an empty array with shape (4, 2) will be created, and the mean values along the selected axis will be computed to fill in the empty array. (This is just a pseudo-process to help you make sense of the output, but may not be the actual process)
Here is a simple example to help you understand
import tensorflow as tf
import numpy as np
one = np.linspace(1, 30, 30).reshape(5, 3, 2)
x = tf.placeholder('float32', shape=[5, 3, 2])
op_1 = tf.reduce_mean(x)
op_2 = tf.reduce_mean(x, axis=0)
op_3 = tf.reduce_mean(x, axis=1)
op_4 = tf.reduce_mean(x, axis=2)
with tf.Session() as sess:
print(sess.run(op_1, feed_dict={x: one}))
print(sess.run(op_2, feed_dict={x: one}))
print(sess.run(op_3, feed_dict={x: one}))
print(sess.run(op_4, feed_dict={x: one}))
The first output is a number because we didn't provide an axis. The shapes of the rest of the outputs are (3, 2), (5, 2) and (5, 3), respectively.
reduce_mean can be useful when the target value is a matrix.
User #meTchaikovsky explained the general case of tf.reduce_mean. In both of your cases tf.reduce_mean simply works as any mean calculator i.e,. you're not taking mean along any particular axis of a tensor, you simply divide the sum of the elements in a tensor by number of elements.
Let's decode what exactly is happening in both the cases. For the both the cases assume batch_size = 2 and num_classes = 5, meaning that there are two examples per batch.
Now for the first case, tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y) returns an array of shape (2,).
>>import numpy as np
>>import tensorflow as tf
>>sess= tf.InteractiveSession()
>>batch_size = 2
>>num_classes = 5
>>logits = np.random.rand(batch_size,num_classes)
>>print(logits)
[[0.94108451 0.68186329 0.04000461 0.25996487 0.50391948]
[0.22781201 0.32305269 0.93359371 0.22599208 0.05942905]]
>>labels = np.array([[1,0,0,0,0],[0,1,0,0,0]])
>>print(labels)
[[1 0 0 0 0]
[0 1 0 0 0]]
>>logits_ = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_classes))
>>Y_ = tf.placeholder(dtype=tf.int32,shape=(batch_size,num_classes))
>>loss_op = tf.nn.softmax_cross_entropy_with_logits(logits=logits_, labels=Y_)
>>loss_per_example = sess.run(loss_op,feed_dict={Y_:labels,logits_:logits})
>>print(loss_per_example)
array([1.2028817, 1.6912657], dtype=float32)
You can see that loss_per_example is of shape (2,). If we take the mean of this variable then we can approximate the average loss for the full batch. Hence we calculate
>>loss_per_example_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size))
>>final_loss_per_batch = tf.reduce_mean(loss_per_example_holder)
>>final_loss = sess.run(final_loss_per_batch,feed_dict={loss_per_example_holder:loss_per_example})
>>print(final_loss)
1.4470737
Coming to your second case:
>>predictions_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_classes))
>>labels_holder = tf.placeholder(dtype=tf.int32,shape=(batch_size,num_classes))
>>prediction_tf = tf.equal(tf.argmax(predictions_holder, 1), tf.argmax(labels_holder, 1))
>>labels_match = sess.run(prediction_tf,feed_dict={predictions_holder:logits,labels_holder:labels})
>>print(labels_match)
[ True False]
The above output was expected because only the first example of the variable logits says that the neuron with highest activation (0.9410) is zeroth which is same as labels. Now we want to calculate the accuracy, which means we have to take the average of the variable labels_match.
>>labels_match_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size))
>>accuracy_calc = tf.reduce_mean(tf.cast(labels_match_holder, tf.float32))
>>accuracy = sess.run(accuracy_calc, feed_dict={labels_match_holder:labels_match})
>>print(accuracy)
0.5
Here is an implementation of AND function with single neuron using tensorflow:
def tf_sigmoid(x):
return 1 / (1 + tf.exp(-x))
data = [
(0, 0),
(0, 1),
(1, 0),
(1, 1),
]
labels = [
0,
0,
0,
1,
]
n_steps = 1000
learning_rate = .1
x = tf.placeholder(dtype=tf.float32, shape=[2])
y = tf.placeholder(dtype=tf.float32, shape=None)
w = tf.get_variable('W', shape=[2], initializer=tf.random_normal_initializer(), dtype=tf.float32)
b = tf.get_variable('b', shape=[], initializer=tf.random_normal_initializer(), dtype=tf.float32)
h = tf.reduce_sum(x * w) + b
output = tf_sigmoid(h)
error = tf.abs(output - y)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(error)
sess.run(tf.initialize_all_variables())
for step in range(n_steps):
for i in np.random.permutation(range(len(data))):
sess.run(optimizer, feed_dict={x: data[i], y: labels[i]})
Sometimes it works perfectly, but on some parameters it gets stuck and doesn't want to learn. For example with these initial parameters:
w = tf.Variable(initial_value=[-0.31199348, -0.46391705], dtype=tf.float32)
b = tf.Variable(initial_value=-1.94877, dtype=tf.float32)
it will hardly make any improvement in cost function. What am I doing wrong, maybe I should somehow adjust initialization of parameters?
Aren't you missing a mean(error) ?
Your problem is the particular combination of the sigmoid, the cost function, and the optimizer.
Don't feel bad, AFAIK this exact problem stalled the entire field for a few years.
Sigmoid is flat when you're far from the middle, and You're initializing it with relatively large numbers, try /1000.
So your abs-error (or square-error) is flat too, and the GradientDescent optimizer takes steps proportional to the slope.
Either of these should fix it:
Use cross-entropy for the error - it's convex.
Use a better Optimizer, like Adam
, who's step size is much less dependent on the slope. More on the consistency of the slope.
Bonus: Don't roll your own sigmoid, use tf.nn.sigmoid, you'll get a lot fewer NaN's that way.
Have fun!