Back propagation algorithm gets stuck on training AND function - python

Here is an implementation of AND function with single neuron using tensorflow:
def tf_sigmoid(x):
return 1 / (1 + tf.exp(-x))
data = [
(0, 0),
(0, 1),
(1, 0),
(1, 1),
]
labels = [
0,
0,
0,
1,
]
n_steps = 1000
learning_rate = .1
x = tf.placeholder(dtype=tf.float32, shape=[2])
y = tf.placeholder(dtype=tf.float32, shape=None)
w = tf.get_variable('W', shape=[2], initializer=tf.random_normal_initializer(), dtype=tf.float32)
b = tf.get_variable('b', shape=[], initializer=tf.random_normal_initializer(), dtype=tf.float32)
h = tf.reduce_sum(x * w) + b
output = tf_sigmoid(h)
error = tf.abs(output - y)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(error)
sess.run(tf.initialize_all_variables())
for step in range(n_steps):
for i in np.random.permutation(range(len(data))):
sess.run(optimizer, feed_dict={x: data[i], y: labels[i]})
Sometimes it works perfectly, but on some parameters it gets stuck and doesn't want to learn. For example with these initial parameters:
w = tf.Variable(initial_value=[-0.31199348, -0.46391705], dtype=tf.float32)
b = tf.Variable(initial_value=-1.94877, dtype=tf.float32)
it will hardly make any improvement in cost function. What am I doing wrong, maybe I should somehow adjust initialization of parameters?

Aren't you missing a mean(error) ?
Your problem is the particular combination of the sigmoid, the cost function, and the optimizer.
Don't feel bad, AFAIK this exact problem stalled the entire field for a few years.
Sigmoid is flat when you're far from the middle, and You're initializing it with relatively large numbers, try /1000.
So your abs-error (or square-error) is flat too, and the GradientDescent optimizer takes steps proportional to the slope.
Either of these should fix it:
Use cross-entropy for the error - it's convex.
Use a better Optimizer, like Adam
, who's step size is much less dependent on the slope. More on the consistency of the slope.
Bonus: Don't roll your own sigmoid, use tf.nn.sigmoid, you'll get a lot fewer NaN's that way.
Have fun!

Related

When and why do we use tf.reduce_mean?

In setting up the model I sometimes see the code:
# Scenario 1
# Define loss and optimizer
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=logits, labels=Y))
or
# Scenario 2
# Evaluate model (with test logits, for dropout to be disabled)
prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))
The definition of tf.reduce_mean states that it "calculates the mean of tensor elements along various dimensions of the tensor." I am confused about what it does in simpler language? When do we need to use it, maybe with reference to # Scenario 1 & 2 ? Thank you
As far as I understand, tensorflow.reduce_mean is the same as numpy.mean. It creates an operation in the underlying tensorflow graph which computes the mean of a tensor.
The most important keyword argument of tensorflow.reduce_mean is axis. Basically, if you have a tensor with shape (4, 3, 2) and axis=1, an empty array with shape (4, 2) will be created, and the mean values along the selected axis will be computed to fill in the empty array. (This is just a pseudo-process to help you make sense of the output, but may not be the actual process)
Here is a simple example to help you understand
import tensorflow as tf
import numpy as np
one = np.linspace(1, 30, 30).reshape(5, 3, 2)
x = tf.placeholder('float32', shape=[5, 3, 2])
op_1 = tf.reduce_mean(x)
op_2 = tf.reduce_mean(x, axis=0)
op_3 = tf.reduce_mean(x, axis=1)
op_4 = tf.reduce_mean(x, axis=2)
with tf.Session() as sess:
print(sess.run(op_1, feed_dict={x: one}))
print(sess.run(op_2, feed_dict={x: one}))
print(sess.run(op_3, feed_dict={x: one}))
print(sess.run(op_4, feed_dict={x: one}))
The first output is a number because we didn't provide an axis. The shapes of the rest of the outputs are (3, 2), (5, 2) and (5, 3), respectively.
reduce_mean can be useful when the target value is a matrix.
User #meTchaikovsky explained the general case of tf.reduce_mean. In both of your cases tf.reduce_mean simply works as any mean calculator i.e,. you're not taking mean along any particular axis of a tensor, you simply divide the sum of the elements in a tensor by number of elements.
Let's decode what exactly is happening in both the cases. For the both the cases assume batch_size = 2 and num_classes = 5, meaning that there are two examples per batch.
Now for the first case, tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y) returns an array of shape (2,).
>>import numpy as np
>>import tensorflow as tf
>>sess= tf.InteractiveSession()
>>batch_size = 2
>>num_classes = 5
>>logits = np.random.rand(batch_size,num_classes)
>>print(logits)
[[0.94108451 0.68186329 0.04000461 0.25996487 0.50391948]
[0.22781201 0.32305269 0.93359371 0.22599208 0.05942905]]
>>labels = np.array([[1,0,0,0,0],[0,1,0,0,0]])
>>print(labels)
[[1 0 0 0 0]
[0 1 0 0 0]]
>>logits_ = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_classes))
>>Y_ = tf.placeholder(dtype=tf.int32,shape=(batch_size,num_classes))
>>loss_op = tf.nn.softmax_cross_entropy_with_logits(logits=logits_, labels=Y_)
>>loss_per_example = sess.run(loss_op,feed_dict={Y_:labels,logits_:logits})
>>print(loss_per_example)
array([1.2028817, 1.6912657], dtype=float32)
You can see that loss_per_example is of shape (2,). If we take the mean of this variable then we can approximate the average loss for the full batch. Hence we calculate
>>loss_per_example_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size))
>>final_loss_per_batch = tf.reduce_mean(loss_per_example_holder)
>>final_loss = sess.run(final_loss_per_batch,feed_dict={loss_per_example_holder:loss_per_example})
>>print(final_loss)
1.4470737
Coming to your second case:
>>predictions_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_classes))
>>labels_holder = tf.placeholder(dtype=tf.int32,shape=(batch_size,num_classes))
>>prediction_tf = tf.equal(tf.argmax(predictions_holder, 1), tf.argmax(labels_holder, 1))
>>labels_match = sess.run(prediction_tf,feed_dict={predictions_holder:logits,labels_holder:labels})
>>print(labels_match)
[ True False]
The above output was expected because only the first example of the variable logits says that the neuron with highest activation (0.9410) is zeroth which is same as labels. Now we want to calculate the accuracy, which means we have to take the average of the variable labels_match.
>>labels_match_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size))
>>accuracy_calc = tf.reduce_mean(tf.cast(labels_match_holder, tf.float32))
>>accuracy = sess.run(accuracy_calc, feed_dict={labels_match_holder:labels_match})
>>print(accuracy)
0.5

Neural network bias training

I created a neural network and attempted training it, all was well until I added in a bias.
From what I gather when training the bias adjusts to move the expected output up or down, and the weights tend towards a value that helps YHat emulate some function, so for a two layer network:
output = tanh(tanh(X0W0 + b0)W1 + b1)
In practice what I've found is W sets all weights to near 0, and b almost echos the trained output of Y. Which essentially makes the output work perfectly for the trained data, but when you give it different kinds of data it will always give the same output.
This has caused quite some confusion. I know that the bias' role is to move the activation graph up or down but when it comes to training it seems to make the entire purpose of the neural network irrelevant. Here is the code from my training method:
def train(self, X, Y, loss, epoch=10000):
for i in range(epoch):
YHat = self.forward(X)
loss.append(sum(Y - YHat))
err = -(Y - YHat)
for l in self.__layers[::-1]:
werr = np.sum(np.dot(l.localWGrad, err.T), axis=1)
werr.shape = (l.height, 1)
l.adjustWeights(werr)
err = np.sum(err, axis=1)
err.shape = (X.shape[0], 1)
l.adjustBiases(err)
err = np.multiply(err, l.localXGrad)
and the code for adjusting the weghts and biases. (Note: epsilon is my training rate and lambda is the regularisation rate)
def adjustWeights(self, err):
self.__weights = self.__weights - (err * self.__epsilon + self.__lambda * self.__weights)
def adjustBiases(self, err):
a = np.sum(np.multiply(err, self.localPartialGrad), axis=1) * self.__epsilon
a.shape = (err.shape[0], 1)
self.__biases = self.__biases - a
And here is the math I've done for this network.
Z0 = X0W0 + b0
X1 = relu(Z0)
Z1 = X1W1 + b1
X2 = relu(Z1)
a = YHat-X2
#Note the second part is for regularisation
loss = ((1/2)*(a^2)) + (lambda*(1/2)*(sum(W1^2) + sum(W2^2)))
And now the derivatives
dloss/dW1 = -(YHat-X2)*relu'(X1W1 + b1)X1
dloss/dW0 = -(YHat-X2)*relu'(X1W1 + b1)W1*relu'(X0W0 + b0)X0
dloss/db1 = -(YHat-X2)*relu'(X1W1 + b1)
dloss/db0 = -(YHat-X2)*relu'(X1W1 + b1)W1*relu'(X0W0 + b0)
I'm guessing I'm doing something wrong, but I have no idea what it is. I tried training this network on the following inputs
X = np.array([[0.0], [1.0], [2.0], [3.0]])
Xnorm = X / np.amax(X)
Y = np.array([[0.0], [2.0], [4.0], [6.0]])
Ynorm = Y / np.amax(Y)
And I get this as the output:
post training:
shape: (4, 1)
[[0. ]
[1.99799666]
[3.99070622]
[5.72358125]]
Expected:
[[0.]
[2.]
[4.]
[6.]]
Which seems great... until you forward something else:
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
Then I get:
shape: (4, 1)
[[0.58289512]
[2.59967085]
[4.31654068]
[5.74322541]]
Expected:
[[4.]
[6.]
[8.]
[10.]]
I thought "perhapse this is the evil 'Overfitting I've heard of" and decided to add in some regularisation, but even that doesn't really solve the issue, why would it when it makes sense from a logical perspective that it's faster, and more optimal to set the biases to equal the output and make the weights zero... Can someone explain what's going wrong in my thinking?
Here is the network structure post training, (note if you multiply the output by the max of the training Y you will get the expected output:)
===========================NeuralNetwork===========================
Layers:
===============Layer 0 :===============
Weights: (1, 3)
[[0.05539559 0.05539442 0.05539159]]
Biases: (4, 1)
[[0. ]
[0.22897166]
[0.56300199]
[1.30167665]]
==============\Layer 0 :===============
===============Layer 1 :===============
Weights: (3, 1)
[[0.29443245]
[0.29442639]
[0.29440642]]
Biases: (4, 1)
[[0. ]
[0.13199981]
[0.32762199]
[1.10023446]]
==============\Layer 1 :===============
==========================\NeuralNetwork===========================
The graph y = 2x has a y intercept crosses at x=0, and thus it would make sense for all the bias' to be 0 as we aren't moving the graph up or down... right?
Thanks for reading this far!
edit:
Here is the loss graph:
edit 2:
I just tried to do this with a single weight and output and here is the output structure I got:
===========================NeuralNetwork===========================
Layers:
===============Layer 0 :===============
Weights: (1, 1)
[[0.47149317]]
Biases: (4, 1)
[[0. ]
[0.18813419]
[0.48377987]
[1.33644038]]
==============\Layer 0 :===============
==========================\NeuralNetwork===========================
and for this input:
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
I got this output:
shape: (4, 1)
[[4.41954787]
[5.53236625]
[5.89599366]
[5.99257962]]
when again it should be:
Expected:
[[4.]
[6.]
[8.]
[10.]]
Note the problem with the biases persist, you would think in this situation the weight would be 2, and the bias would be 0.
Moved answer from OP's question
Turns out I never dealt with my training data properly. The input vector:
[[0.0], [1.0], [2.0], [3.0]]
was normalised, I divided this vector by the max value in the input which was 3, and thus I got
[[0.0], [0.3333], [0.6666], [1.0]]
And for the input Y training vector I had
[[0.0], [2.0], [4.0], [6.0]]
and I foolishly decided to do the same with this vector,but with the max of Y 6:
[[0.0], [0.333], [0.666], [1.0]]
So basically I was saying "hey network, mimic my input". This was my first error. The second error came as a result of more misunderstanding of the scaling.
Although 1 was 0.333, and 0.333*2 = 0.666, which I then multiplied by the max of y (6) 6*0.666 = 2, if I try this again with a different set of data say:
[[2.0], [3.0], [4.0], [5.0]]
2 would be 2/5 = 0.4 and 0.4*2 = 0.8, which times 5 would be 2, however in the real world we would have no way of knowing that 5 was the max output of the dataset, and thus I thought maybe it would have been the max of the Y training, which was 6 so instead of 2/5 = 0.4, 0.4*2 = 0.8 * 5, I done 2/5 = 0.4, 0.4*2 = 0.8 * 6 = 4.8.
So then I got some strange behaviours of the biases and weights as a result. So after essentially getting rid of the normalisation, I was free to tweak the hyperparameters and now as an output for the base training data:
input:
X:
[[0.]
[1.]
[2.]
[3.]]
I get this output:
shape: (4, 1)
[[0.30926124]
[2.1030826 ]
[3.89690395]
[5.6907253 ]]
and for the extra testing data (not trained on):
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
I get this output:
shape: (4, 1)
[[3.89690395]
[5.6907253 ]
[7.48454666]
[9.27836801]]
So now I'm happy. I also changed my activation to a leaky relu as it should fit a linear equation better (I think.). I'm sure with more testing data and more tweaking of the hyperparameters it would be a perfect fit. Thanks for the help everyone. Trying to explain my problem really put things into perspective.

simple linear regression failed to converge in tensorflow

I am new to machine learning and Tensorflow. Currently I am trying to follow the tutorial's logic to create a simple linear regression model of form y = a*x (there is no bias term here) . However, for some reason, the model fail to converge to the correct value "a". The data set is created by me in excel. As shown below:
here is my code that tries to run tensorflow on this dummy data set I generated.
import tensorflow as tf
import pandas as pd
w = tf.Variable([[5]],dtype=tf.float32)
b = tf.Variable([-5],dtype=tf.float32)
x = tf.placeholder(shape=(None,1),dtype=tf.float32)
y = tf.add(tf.matmul(x,w),b)
label = tf.placeholder(dtype=tf.float32)
loss = tf.reduce_mean(tf.squared_difference(y,label))
data = pd.read_csv("D:\\dat2.csv")
xs = data.iloc[:,:1].as_matrix()
ys = data.iloc[:,1].as_matrix()
optimizer = tf.train.GradientDescentOptimizer(0.000001).minimize(loss)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
for i in range(10000):
sess.run(optimizer,{x:xs,label:ys})
if i%100 == 0: print(i,sess.run(w))
print(sess.run(w))
below is the print out in ipython console, as you can see after 10000th iteration, the value for w is around 4.53 instead of the correct value 6.
I would really appreciate if anyone could shed some light on what is going on wrong here. I have played around with different learning rate from 0.01 to 0.0000001, none of the setting is able to have the w converge to 6. I have read some suggesting to normalize the feature to standard normal distribution, I would like to know if this normalization is a must? without normalization, gradientdescent is not able to find the solution? Thank you very much!
It is a shaping problem: y and label don't have the same shape ([batch_size, 1] vs [batch_size]). In loss = tf.reduce_mean(tf.squared_difference(y, label)), it causes tensorflow to interpret things differently from what you want, probably by using some broadcasting... Anyway, the result is that your loss is not at all the one you want.
To correct that, simply replace
y = tf.add(tf.matmul(x, w), b)
by
y = tf.add(tf.matmul(x, w), b)
y = tf.reshape(y, shape=[-1])
My full working code below:
import tensorflow as tf
import pandas as pd
w = tf.Variable([[4]], dtype=tf.float64)
b = tf.Variable([10.0], dtype=tf.float64, trainable=True)
x = tf.placeholder(shape=(None, 1), dtype=tf.float64)
y = tf.add(tf.matmul(x, w), b)
y = tf.reshape(y, shape=[-1])
label = tf.placeholder(shape=(None), dtype=tf.float64)
loss = tf.reduce_mean(tf.squared_difference(y, label))
my_path = "/media/sf_ShareVM/data2.csv"
data = pd.read_csv(my_path, sep=";")
max_n_samples_to_use = 50
xs = data.iloc[:max_n_samples_to_use, :1].as_matrix()
ys = data.iloc[:max_n_samples_to_use, 1].as_matrix()
lr = 0.000001
optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr).minimize(loss)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
for i in range(100000):
_, loss_value, w_value, b_value, y_val, lab_val = sess.run([optimizer, loss, w, b, y, label], {x: xs, label: ys})
if i % 100 == 0: print(i, loss_value, w_value, b_value)
if (i%2000 == 0 and 0< i < 10000): # We use a smaller LR at first to avoid exploding gradient. It would be MUCH cleaner to use gradient clipping (by global norm)
lr*=2
optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr).minimize(loss)
print(sess.run(w))

Model not learning in tensorflow

I am new to tensorflow and neural networks, and I am trying to create a model that just multiples two float values together.
I wasn't sure how many neurons I would want, but I picked 10 neurons and tried to see where I could go from that. I figured that would probably introduce enough complexity in order to semi-accurately learn that operation.
Anyways, here is my code:
import tensorflow as tf
import numpy as np
# Teach how to multiply
def generate_data(how_many):
data = np.random.rand(how_many, 2)
answers = data[:, 0] * data[:, 1]
return data, answers
sess = tf.InteractiveSession()
# Input data
input_data = tf.placeholder(tf.float32, shape=[None, 2])
correct_answers = tf.placeholder(tf.float32, shape=[None])
# Use 10 neurons--just one layer for now, but it'll be fully connected
weights_1 = tf.Variable(tf.truncated_normal([2, 10], stddev=.1))
bias_1 = tf.Variable(.1)
# Output of this will be a [None, 10]
hidden_output = tf.nn.relu(tf.matmul(input_data, weights_1) + bias_1)
# Weights
weights_2 = tf.Variable(tf.truncated_normal([10, 1], stddev=.1))
bias_2 = tf.Variable(.1)
# Softmax them together--this will be [None, 1]
calculated_output = tf.nn.softmax(tf.matmul(hidden_output, weights_2) + bias_2)
cross_entropy = tf.reduce_mean(correct_answers * tf.log(calculated_output))
optimizer = tf.train.GradientDescentOptimizer(.5).minimize(cross_entropy)
sess.run(tf.initialize_all_variables())
for i in range(1000):
x, y = generate_data(100)
sess.run(optimizer, feed_dict={input_data: x, correct_answers: y})
error = tf.reduce_sum(tf.abs(calculated_output - correct_answers))
x, y = generate_data(100)
print("Total Error: ", error.eval(feed_dict={input_data: x, correct_answers: y}))
It seems that the error is always around 7522.1, which very very bad for just 100 data points, so I assume it is not learning.
My questions: Is my machine learning? If so, what can I do to make it more accurate? If not, how can I make it learn?
There are a few major issues with the code. Aaron has already identified some of them, but there's another important one: calculated_output and correct_answers are not the same shape, so you're creating a 2D matrix when you subtract them. (The shape of calculated_output is (100, 1) and the shape of correct_answers is (100).) So you need to adjust the shape (for example, by using tf.squeeze on calculated_output).
This problem also doesn't really require any non-linearities, so you could get by with no activations and only one layer. The following code gets a total error of about 6 (~0.06 error on average for each test point). Hope that helps!
import tensorflow as tf
import numpy as np
# Teach how to multiply
def generate_data(how_many):
data = np.random.rand(how_many, 2)
answers = data[:, 0] * data[:, 1]
return data, answers
sess = tf.InteractiveSession()
input_data = tf.placeholder(tf.float32, shape=[None, 2])
correct_answers = tf.placeholder(tf.float32, shape=[None])
weights_1 = tf.Variable(tf.truncated_normal([2, 1], stddev=.1))
bias_1 = tf.Variable(.0)
output_layer = tf.matmul(input_data, weights_1) + bias_1
mean_squared = tf.reduce_mean(tf.square(correct_answers - tf.squeeze(output_layer)))
optimizer = tf.train.GradientDescentOptimizer(.1).minimize(mean_squared)
sess.run(tf.initialize_all_variables())
for i in range(1000):
x, y = generate_data(100)
sess.run(optimizer, feed_dict={input_data: x, correct_answers: y})
error = tf.reduce_sum(tf.abs(tf.squeeze(output_layer) - correct_answers))
x, y = generate_data(100)
print("Total Error: ", error.eval(feed_dict={input_data: x, correct_answers: y}))
The way you are using softmax is weird. Softmax is normally used when you want to have a probability distribution over a set of classes. In your code it looks like you have a one dimensional output. The softmax is not helping you there.
The cross entropy loss function is appropriate in classification problems but you are doing regression. You should try using a mean squared error loss function instead.

Recovering probability distribution from binary observations - what are the reasons for the defects of this implementation?

I am trying to recover a probability distribution (not a probability density, any function with range in [0,1] with f(x) encoding probability of success for a observation at x). I use a hidden layer with 10 neurons and softmax. Here's my code:
import tensorflow as tf
import numpy as np
import random
import math
#Make binary observations encoded as one-hot vectors.
def makeObservations(probabilities):
observations = np.zeros((len(probabilities),2), dtype='float32')
for i in range(0, len(probabilities)):
if random.random() <= probabilities[i]:
observations[i,0] = 1
observations[i,1] = 0
else:
observations[i,0] = 0
observations[i,1] = 1
return observations
xTrain = np.linspace(0, 4*math.pi, 2001).reshape(1,-1)
distribution = map(lambda x: math.sin(x)**2, xTrain[0])
yTrain = makeObservations(distribution)
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
x = tf.placeholder("float", [1,None])
hiddenDim = 10
b = bias_variable([hiddenDim,1])
W = weight_variable([hiddenDim, 1])
b2 = bias_variable([2,1])
W2 = weight_variable([2, hiddenDim])
hidden = tf.nn.sigmoid(tf.matmul(W, x) + b)
y = tf.transpose(tf.matmul(W2, hidden) + b2)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, yTrain))
step = tf.Variable(0, trainable=False)
rate = tf.train.exponential_decay(0.2, step, 1, 0.9999)
optimizer = tf.train.AdamOptimizer(rate)
train = optimizer.minimize(loss, global_step=step)
predict_op = tf.argmax(y, 1)
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
for i in range(50001):
sess.run(train, feed_dict={x: xTrain})
if i%200 == 0:
#proportion of correct predictions
print i, np.mean(np.argmax(yTrain, axis=1) ==
sess.run(predict_op, feed_dict={x: xTrain}))
import matplotlib.pyplot as plt
ys = tf.nn.softmax(y).eval({x:xTrain}, sess)
plt.plot(xTrain[0],ys[:,0])
plt.plot(xTrain[0],distribution)
plt.plot(xTrain[0], yTrain[:,0], 'ro')
plt.show()
Here are two typical results:
Questions:
What is the difference between doing tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, yTrain)) and applying softmax manually with minimizing cross entropy?
It is typical for the model not to snap to the last period of the distribution. I've had it do so successfully only once. Perhaps it will be fixed by doing more training runs, but it doesn't look like it as the results often stabilise for the last ~20k runs. Would it most likely be improved by better selection of the optimising algorithm, by more hidden layers, or by more dimensions of the hidden layer? (partially answered by Edit)
The aberrations close to x=0 are typical. What causes them?
Edit: The fit has improved a lot by doing
hiddenDim = 15
(...)
optimizer = tf.train.AdagradOptimizer(0.5)
and changing the activations to tanh from sigmoids.
Further questions:
Is it typical that a higher hidden dimension makes braking out of local minima easier?
What is the approximate typical relation between the optimal dimension of hidden layers and dimension of inputs dim(hidden) = f(dim(input))? Linear, weaker than linear or stronger than linear?
It's over-fitting on the left and under-fitting on the right.
Because of the small random biases your hidden units all get near zero activation near x=0, and because of the asymetry and large range of the x values, most of the hidden units are saturated out around x = 10.
The gradients can't flow through saturated units, so they all get used up to overfit the values they can feel, near zero.
I think centering the data on x=0 will help.
Try reducing the weight-initialization-variance, and/or increasing the bias-initialization-variance (or equivalently, reducing the range of the data to a smaller region, like [-1,1]).
You would get the same problem if you used RBF's and initializad them all near zero. with the linear-sigmoid units the second layer is using pairs of linear-sigmoids to make RBF's.

Categories