I'm learning about policy gradients and I'm having hard time understanding how does the gradient passes through a random operation. From here: It is not possible to directly backpropagate through random samples. However, there are two main methods for creating surrogate functions that can be backpropagated through.
They have an example of the score function:
probs = policy_network(state)
# Note that this is equivalent to what used to be called multinomial
m = Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()
Which I tried to create an example of:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Normal
import matplotlib.pyplot as plt
from tqdm import tqdm
softplus = torch.nn.Softplus()
class Model_RL(nn.Module):
def __init__(self):
super(Model_RL, self).__init__()
self.fc1 = nn.Linear(1, 20)
self.fc2 = nn.Linear(20, 30)
self.fc3 = nn.Linear(30, 2)
def forward(self, x):
x1 = self.fc1(x)
x = torch.relu(x1)
x2 = self.fc2(x)
x = torch.relu(x2)
x3 = softplus(self.fc3(x))
return x3, x2, x1
# basic
net_RL = Model_RL()
features = torch.tensor([1.0])
x = torch.tensor([1.0])
y = torch.tensor(3.0)
baseline = 0
baseline_lr = 0.1
epochs = 3
opt_RL = optim.Adam(net_RL.parameters(), lr=1e-3)
losses = []
xs = []
for _ in tqdm(range(epochs)):
out_RL = net_RL(x)
mu, std = out_RL[0]
dist = Normal(mu, std)
print(dist)
a = dist.sample()
log_p = dist.log_prob(a)
out = features * a
reward = -torch.square((y - out))
baseline = (1-baseline_lr)*baseline + baseline_lr*reward
loss = -(reward-baseline)*log_p
opt_RL.zero_grad()
loss.backward()
opt_RL.step()
losses.append(loss.item())
This seems to work magically fine which again, I don't understand how the gradient passes through as they mentioned that it can't pass through the random operation (but then somehow it does).
Now since the gradient can't flow through the random operation I tried to replace
mu, std = out_RL[0] with mu, std = out_RL[0].detach() and that caused the error:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. If the gradient doesn't pass through the random operation, I don't understand why would detaching a tensor before the operation matter.
It is indeed true that sampling is not a differentiable operation per se. However, there exist two (broad) ways to mitigate this - [1] The REINFORCE way and [2] The reparameterization way. Since your example is related to [1], I will stick my answer to REINFORCE.
What REINFORCE does is it entirely gets rid of sampling operation in the computation graph. However, the sampling operation remains outside the graph. So, your statement
.. how does the gradient passes through a random operation ..
isn't correct. It does not pass through any random operation. Let's see your example
mu, std = out_RL[0]
dist = Normal(mu, std)
a = dist.sample()
log_p = dist.log_prob(a)
Computation of a does not involve creating a computation graph. It is technically equivalent to plugging in some offline data from a dataset (as in supervised learning)
mu, std = out_RL[0]
dist = Normal(mu, std)
# a = dist.sample()
a = torch.tensor([1.23, 4.01, -1.2, ...], device='cuda')
log_p = dist.log_prob(a)
Since we don't have offline data beforehand, we create them on the fly and the .sample() method does merely that.
So, there is no random operation on the graph. The log_p depends on mu and std deterministically, just like any standard computation graph. If you cut the connection like this
mu, std = out_RL[0].detach()
.. of course it is going to complaint.
Also, do not get confused by this operation
dist = Normal(mu, std)
log_p = dist.log_prob(a)
as it does not contain any randomness by itself. This is merely a shortcut for writing the tedious log-likelihood formula for Normal distribution.
#ayandas explained the first way very well.
The second way, the reparameterization method, is quite different.
In contrast to the sample(), the reparameterization using rsample() returns a sample that sustains the computation graph.
It is done by adding a random value (but keeping the parameters of the model).
Check this explanation with the simple code.
Related
I'm trying to train a resnet18 model on pytorch (+pytorch-lightning) with the use of Virtual Adversarial Training. During the computations required for this type of training I need to obtain the gradient of D (ie. the cross-entropy loss of the model) with regard to tensor r.
This should, in theory, happen in the following code snippet:
def generic_step(self, train_batch, batch_idx, step_type):
x, y = train_batch
unlabeled_idx = y is None
d = torch.rand(x.shape).to(x.device)
d = d/(torch.norm(d) + 1e-8)
pred_y = self.classifier(x)
y[unlabeled_idx] = pred_y[unlabeled_idx]
l = self.criterion(pred_y, y)
R_adv = torch.zeros_like(x)
for _ in range(self.ip):
r = self.xi * d
r.requires_grad = True
pred_hat = self.classifier(x + r)
# pred_hat = F.log_softmax(pred_hat, dim=1)
D = self.criterion(pred_hat, pred_y)
self.classifier.zero_grad()
D.requires_grad=True
D.backward()
R_adv += self.eps * r.grad / (torch.norm(r.grad) + 1e-8)
R_adv /= 32
loss = l + R_adv * self.a
loss.backward()
self.accuracy[step_type] = self.acc_metric(torch.argmax(pred_y, 1), y)
return loss
Here, to my understanding, r.grad should in theory be the gradient of D with respect to r. However, the code throws this at D.backward():
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
(full traceback excluded because this error is not helpful and technically "solved" as I know the cause for it, explained just below)
After some research and debugging it seems that in this situation D.backward() attempts to calculate dD/dD disregarding any previous mention of requires_grad=True. This is confirmed when I add D.requires_grad=True and I get D.grad=Tensor(1.,device='cuda:0') but r.grad=None.
Does anyone know why this may be happening?
In Lightning, .backward() and optimizer step are all handled under the hood. If you do it yourself like in the code above, it will mess with Lightning because it doesn't know you called backward yourself.
You can enable manual optimization in the LightningModule:
def __init__(self):
super().__init__()
# put this in your init
self.automatic_optimization = False
This tells Lightning that you are taking over calling backward and handling optimizer step + zero grad yourself. Don't forget to add that in your code above. You can access the optimizer and scheduler like so in your training step:
def training_step(self, batch, batch_idx):
optimizer = self.optimizers()
scheduler = self.lr_schedulers()
# do your training step
# don't forget to call:
# 1) backward 2) optimizer step 3) zero grad
Read more about manual optimization here.
I'm trying to learn some PyTorch and am referencing this discussion here
The author provides a minimum working piece of code that illustrates how you can use PyTorch to solve for an unknown linear function that has been polluted with random noise.
This code runs fine for me.
However, when I change the function such that I want t = X^2, the parameter does not seem to converge.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
# Let's make some data for a linear regression.
A = 3.1415926
b = 2.7189351
error = 0.1
N = 100 # number of data points
# Data
X = Variable(torch.randn(N, 1))
# (noisy) Target values that we want to learn.
t = X * X + Variable(torch.randn(N, 1) * error)
# Creating a model, making the optimizer, defining loss
model = nn.Linear(1, 1)
optimizer = optim.SGD(model.parameters(), lr=0.05)
loss_fn = nn.MSELoss()
# Run training
niter = 50
for _ in range(0, niter):
optimizer.zero_grad()
predictions = model(X)
loss = loss_fn(predictions, t)
loss.backward()
optimizer.step()
print("-" * 50)
print("error = {}".format(loss.data[0]))
print("learned A = {}".format(list(model.parameters())[0].data[0, 0]))
print("learned b = {}".format(list(model.parameters())[1].data[0]))
When I execute this code, the new A and b parameters are seemingly random thus it does not converge. I think this should converge because you can approximate any function with a slope and offset function. My theory is that I'm using PyTorch incorrectly.
Can any identify a problem with my t = X * X + Variable(torch.randn(N, 1) * error) line of code?
You cannot fit a 2nd degree polynomial with a linear function. You cannot expect more than random (since you have random samples from the polynomial).
What you can do is try and have two inputs, x and x^2 and fit from them:
model = nn.Linear(2, 1) # you have 2 inputs now
X_input = torch.cat((X, X**2), dim=1) # have 2 inputs per entry
# ...
predictions = model(X_input) # 2 inputs -> 1 output
loss = loss_fn(predictions, t)
# ...
# learning t = c*x^2 + a*x + b
print("learned a = {}".format(list(model.parameters())[0].data[0, 0]))
print("learned c = {}".format(list(model.parameters())[0].data[0, 1]))
print("learned b = {}".format(list(model.parameters())[1].data[0]))
I have this neural network that I've trained seen bellow, it works, or at least appears to work, but the problem is with the training. I'm trying to train it to act as an OR gate, but it never seems to get there, the output tends to looks like this:
prior to training:
[[0.50181624]
[0.50183743]
[0.50180414]
[0.50182533]]
post training:
[[0.69641759]
[0.754652 ]
[0.75447178]
[0.79431198]]
expected output:
[[0]
[1]
[1]
[1]]
I have this loss graph:
Its strange it appears to be training, but at the same time not quite getting to the expected output. I know that it would never really achieve the 0s and 1s, but at the same time I expect it to manage and get something a little bit closer to the expected output.
I had some issues trying to figure out how to back prop the error as I wanted to make this network have any number of hidden layers, so I stored the local gradient in a layer, along side the weights, and sent the error from the end back.
The main functions I suspect are the culprits are NeuralNetwork.train and both forward methods.
import sys
import math
import numpy as np
import matplotlib.pyplot as plt
from itertools import product
class NeuralNetwork:
class __Layer:
def __init__(self,args):
self.__epsilon = 1e-6
self.localGrad = 0
self.__weights = np.random.randn(
args["previousLayerHeight"],
args["height"]
)*0.01
self.__biases = np.zeros(
(args["biasHeight"],1)
)
def __str__(self):
return str(self.__weights)
def forward(self,X):
a = np.dot(X, self.__weights) + self.__biases
self.localGrad = np.dot(X.T,self.__sigmoidPrime(a))
return self.__sigmoid(a)
def adjustWeights(self, err):
self.__weights -= (err * self.__epsilon)
def __sigmoid(self, z):
return 1/(1 + np.exp(-z))
def __sigmoidPrime(self, a):
return self.__sigmoid(a)*(1 - self.__sigmoid(a))
def __init__(self,args):
self.__inputDimensions = args["inputDimensions"]
self.__outputDimensions = args["outputDimensions"]
self.__hiddenDimensions = args["hiddenDimensions"]
self.__layers = []
self.__constructLayers()
def __constructLayers(self):
self.__layers.append(
self.__Layer(
{
"biasHeight": self.__inputDimensions[0],
"previousLayerHeight": self.__inputDimensions[1],
"height": self.__hiddenDimensions[0][0]
if len(self.__hiddenDimensions) > 0
else self.__outputDimensions[0]
}
)
)
for i in range(len(self.__hiddenDimensions)):
self.__layers.append(
self.__Layer(
{
"biasHeight": self.__hiddenDimensions[i + 1][0]
if i + 1 < len(self.__hiddenDimensions)
else self.__outputDimensions[0],
"previousLayerHeight": self.__hiddenDimensions[i][0],
"height": self.__hiddenDimensions[i + 1][0]
if i + 1 < len(self.__hiddenDimensions)
else self.__outputDimensions[0]
}
)
)
def forward(self,X):
out = self.__layers[0].forward(X)
for i in range(len(self.__layers) - 1):
out = self.__layers[i+1].forward(out)
return out
def train(self,X,Y,loss,epoch=5000000):
for i in range(epoch):
YHat = self.forward(X)
delta = -(Y-YHat)
loss.append(sum(Y-YHat))
err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
err.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(err)
i=0
for l in reversed(self.__layers[:-1]):
err = np.dot(l.localGrad, err)
l.adjustWeights(err)
i += 1
def printLayers(self):
print("Layers:\n")
for l in self.__layers:
print(l)
print("\n")
def main(args):
X = np.array([[x,y] for x,y in product([0,1],repeat=2)])
Y = np.array([[0],[1],[1],[1]])
nn = NeuralNetwork(
{
#(height,width)
"inputDimensions": (4,2),
"outputDimensions": (1,1),
"hiddenDimensions":[
(6,1)
]
}
)
print("input:\n\n",X,"\n")
print("expected output:\n\n",Y,"\n")
nn.printLayers()
print("prior to training:\n\n",nn.forward(X), "\n")
loss = []
nn.train(X,Y,loss)
print("post training:\n\n",nn.forward(X), "\n")
nn.printLayers()
fig,ax = plt.subplots()
x = np.array([x for x in range(5000000)])
loss = np.array(loss)
ax.plot(x,loss)
ax.set(xlabel="epoch",ylabel="loss",title="logic gate training")
plt.show()
if(__name__=="__main__"):
main(sys.argv[1:])
Could someone please point out what I'm doing wrong here, I strongly suspect it has to do with the way I'm dealing with matrices but at the same time I don't have the slightest idea what's going on.
Thanks for taking the time to read my question, and taking the time to respond (if relevant).
edit:
Actually quite a lot is wrong with this but I'm still a bit confused over how to fix it. Although the loss graph looks like its training, and it kind of is, the math I've done above is wrong.
Look at the training function.
def train(self,X,Y,loss,epoch=5000000):
for i in range(epoch):
YHat = self.forward(X)
delta = -(Y-YHat)
loss.append(sum(Y-YHat))
err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
err.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(err)
i=0
for l in reversed(self.__layers[:-1]):
err = np.dot(l.localGrad, err)
l.adjustWeights(err)
i += 1
Note how I get delta = -(Y-Yhat) and then dot product it with the "local gradient" of the last layer. The "local gradient" is the local W gradient.
def forward(self,X):
a = np.dot(X, self.__weights) + self.__biases
self.localGrad = np.dot(X.T,self.__sigmoidPrime(a))
return self.__sigmoid(a)
I'm skipping a step in the chain rule. I should really be multiplying by W* sigprime(XW + b) first as that's the local gradient of X, then by the local W gradient. I tried that, but I'm still getting issues, here is the new forward method (note the __init__ for layers needs to be initialised for the new vars, and I changed the activation function to tanh)
def forward(self, X):
a = np.dot(X, self.__weights) + self.__biases
self.localPartialGrad = self.__tanhPrime(a)
self.localWGrad = np.dot(X.T, self.localPartialGrad)
self.localXGrad = np.dot(self.localPartialGrad,self.__weights.T)
return self.__tanh(a)
and updated the training method to look something like this:
def train(self, X, Y, loss, epoch=5000):
for e in range(epoch):
Yhat = self.forward(X)
err = -(Y-Yhat)
loss.append(sum(err))
print("loss:\n",sum(err))
for l in self.__layers[::-1]:
l.adjustWeights(err)
if(l != self.__layers[0]):
err = np.multiply(err,l.localPartialGrad)
err = np.multiply(err,l.localXGrad)
The new graphs I'm getting are all over the place, I have no idea what's going on. Here is the final bit of code I changed:
def adjustWeights(self, err):
perr = np.multiply(err, self.localPartialGrad)
werr = np.sum(np.dot(self.__weights,perr.T),axis=1)
werr = werr * self.__epsilon
werr.shape = (self.__weights.shape[0],1)
self.__weights = self.__weights - werr
Your network is learning, as can be seen from the loss chart, so backprop implementation is correct (congrats!). The main problem with this particular architecture is the choice of the activation function: sigmoid. I have replaced sigmoid with tanh and it works much better instantly.
From this discussion on CV.SE:
There are two reasons for that choice (assuming you have normalized
your data, and this is very important):
Having stronger gradients: since data is centered around 0, the
derivatives are higher. To see this, calculate the derivative of the
tanh function and notice that input values are in the range [0,1]. The
range of the tanh function is [-1,1] and that of the sigmoid function
is [0,1]
Avoiding bias in the gradients. This is explained very well in the
paper, and it is worth reading it to understand these issues.
Though I'm sure sigmoid-based NN can be trained as well, looks like it's much more sensitive to input values (note that they are not zero-centered), because the activation itself is not zero-centered. tanh is better than sigmoid by all means, so a simpler approach is just use that activation function.
The key change is this:
def __tanh(self, z):
return np.tanh(z)
def __tanhPrime(self, a):
return 1 - self.__tanh(a) ** 2
... instead of __sigmoid and __sigmoidPrime.
I have also tuned hyperparameters a little bit, so that the network now learns in 100k epochs, instead of 5m:
prior to training:
[[ 0. ]
[-0.00056925]
[-0.00044885]
[-0.00101794]]
post training:
[[0. ]
[0.97335842]
[0.97340917]
[0.98332273]]
A complete code is in this gist.
Well I'm an idiot. I was right about being wrong but I was wrong about how wrong I was. Let me explain.
Within the backwards training method I got the last layer trained correctly, but all layers after that wasn't trained correctly, hence why the above network was coming up with a result, it was indeed training, but only one layer.
So what did i do wrong? Well I was only multiplying by the local graident of the Weights with respect to the output, and thus the chain rule was partially correct.
Lets say the loss function was this:
t = Y-X2
loss = 1/2*(t)^2
a2 = X1W2 + b
X2 = activation(a2)
a1 = X0W1 + b
X1 = activation(a1)
We know that the the derivative of loss with respect to W2 would be -(Y-X2)*X1. This was done in the first part of my training function:
def train(self,X,Y,loss,epoch=5000000):
for i in range(epoch):
#First part
YHat = self.forward(X)
delta = -(Y-YHat)
loss.append(sum(Y-YHat))
err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
err.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(err)
i=0
#Second part
for l in reversed(self.__layers[:-1]):
err = np.dot(l.localGrad, err)
l.adjustWeights(err)
i += 1
However the second part is where I screwed up. In order to calculate the loss with respect to W1, I must multiply the original error -(Y-X2) by W2 as W2 is the local X Gradient of the last layer, and due to the chain rule this must be done first. Then I could multiply by the local W gradient (X1) to get the loss with respect to W1. I failed to do the multiplication of the local X gradient first, so the last layer was indeed training, but all layers after that had an error that magnified as the layer increased.
To solve this I updated the train method:
def train(self,X,Y,loss,epoch=10000):
for i in range(epoch):
YHat = self.forward(X)
err = -(Y-YHat)
loss.append(sum(Y-YHat))
werr = np.sum(np.dot(self.__layers[-1].localWGrad,err.T), axis=1)
werr.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(werr)
for l in reversed(self.__layers[:-1]):
err = np.multiply(err, l.localXGrad)
werr = np.sum(np.dot(l.weights,err.T),axis=1)
l.adjustWeights(werr)
Now the loss graph I got looks like this:
I am playing with vanilla Rnn's, training with gradient descent (non-batch version), and I am having an issue with the gradient computation for the (scalar) cost; here's the relevant portion of my code:
class Rnn(object):
# ............ [skipping the trivial initialization]
def recurrence(x_t, h_tm_prev):
h_t = T.tanh(T.dot(x_t, self.W_xh) +
T.dot(h_tm_prev, self.W_hh) + self.b_h)
return h_t
h, _ = theano.scan(
recurrence,
sequences=self.input,
outputs_info=self.h0
)
y_t = T.dot(h[-1], self.W_hy) + self.b_y
self.p_y_given_x = T.nnet.softmax(y_t)
self.y_pred = T.argmax(self.p_y_given_x, axis=1)
def negative_log_likelihood(self, y):
return -T.mean(T.log(self.p_y_given_x)[:, y])
def testRnn(dataset, vocabulary, learning_rate=0.01, n_epochs=50):
# ............ [skipping the trivial initialization]
index = T.lscalar('index')
x = T.fmatrix('x')
y = T.iscalar('y')
rnn = Rnn(x, n_x=27, n_h=12, n_y=27)
nll = rnn.negative_log_likelihood(y)
cost = T.lscalar('cost')
gparams = [T.grad(cost, param) for param in rnn.params]
updates = [(param, param - learning_rate * gparam)
for param, gparam in zip(rnn.params, gparams)
]
train_model = theano.function(
inputs=[index],
outputs=nll,
givens={
x: train_set_x[index],
y: train_set_y[index]
},
)
sgd_step = theano.function(
inputs=[cost],
outputs=[],
updates=updates
)
done_looping = False
while(epoch < n_epochs) and (not done_looping):
epoch += 1
tr_cost = 0.
for idx in xrange(n_train_examples):
tr_cost += train_model(idx)
# perform sgd step after going through the complete training set
sgd_step(tr_cost)
For some reasons I don't want to pass complete (training) data to the train_model(..), instead I want to pass individual examples at a time. Now the problem is that each call to train_model(..) returns me the cost (negative log-likelihood) of that particular example and then I have to aggregate all the cost (of the complete (training) data-set) and then take derivative and perform the relevant update to the weight parameters in the sgd_step(..), and for obvious reasons with my current implementation I am getting this error: theano.gradient.DisconnectedInputError: grad method was asked to compute the gradient with respect to a variable that is not part of the computational graph of the cost, or is used only by a non-differentiable operator: W_xh. Now I don't understand how to make 'cost' a part of computational graph (as in my case when I have to wait for it to be aggregated) or is there any better/elegant way to achieve the same thing ?
Thanks.
It turns out one cannot bring the symbolic variable into Theano graph if they are not part of computational graph. Therefore, I have to change the way to pass data to the train_model(..); passing the complete training data instead of individual example fix the issue.
I am trying to recover a probability distribution (not a probability density, any function with range in [0,1] with f(x) encoding probability of success for a observation at x). I use a hidden layer with 10 neurons and softmax. Here's my code:
import tensorflow as tf
import numpy as np
import random
import math
#Make binary observations encoded as one-hot vectors.
def makeObservations(probabilities):
observations = np.zeros((len(probabilities),2), dtype='float32')
for i in range(0, len(probabilities)):
if random.random() <= probabilities[i]:
observations[i,0] = 1
observations[i,1] = 0
else:
observations[i,0] = 0
observations[i,1] = 1
return observations
xTrain = np.linspace(0, 4*math.pi, 2001).reshape(1,-1)
distribution = map(lambda x: math.sin(x)**2, xTrain[0])
yTrain = makeObservations(distribution)
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
x = tf.placeholder("float", [1,None])
hiddenDim = 10
b = bias_variable([hiddenDim,1])
W = weight_variable([hiddenDim, 1])
b2 = bias_variable([2,1])
W2 = weight_variable([2, hiddenDim])
hidden = tf.nn.sigmoid(tf.matmul(W, x) + b)
y = tf.transpose(tf.matmul(W2, hidden) + b2)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, yTrain))
step = tf.Variable(0, trainable=False)
rate = tf.train.exponential_decay(0.2, step, 1, 0.9999)
optimizer = tf.train.AdamOptimizer(rate)
train = optimizer.minimize(loss, global_step=step)
predict_op = tf.argmax(y, 1)
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
for i in range(50001):
sess.run(train, feed_dict={x: xTrain})
if i%200 == 0:
#proportion of correct predictions
print i, np.mean(np.argmax(yTrain, axis=1) ==
sess.run(predict_op, feed_dict={x: xTrain}))
import matplotlib.pyplot as plt
ys = tf.nn.softmax(y).eval({x:xTrain}, sess)
plt.plot(xTrain[0],ys[:,0])
plt.plot(xTrain[0],distribution)
plt.plot(xTrain[0], yTrain[:,0], 'ro')
plt.show()
Here are two typical results:
Questions:
What is the difference between doing tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, yTrain)) and applying softmax manually with minimizing cross entropy?
It is typical for the model not to snap to the last period of the distribution. I've had it do so successfully only once. Perhaps it will be fixed by doing more training runs, but it doesn't look like it as the results often stabilise for the last ~20k runs. Would it most likely be improved by better selection of the optimising algorithm, by more hidden layers, or by more dimensions of the hidden layer? (partially answered by Edit)
The aberrations close to x=0 are typical. What causes them?
Edit: The fit has improved a lot by doing
hiddenDim = 15
(...)
optimizer = tf.train.AdagradOptimizer(0.5)
and changing the activations to tanh from sigmoids.
Further questions:
Is it typical that a higher hidden dimension makes braking out of local minima easier?
What is the approximate typical relation between the optimal dimension of hidden layers and dimension of inputs dim(hidden) = f(dim(input))? Linear, weaker than linear or stronger than linear?
It's over-fitting on the left and under-fitting on the right.
Because of the small random biases your hidden units all get near zero activation near x=0, and because of the asymetry and large range of the x values, most of the hidden units are saturated out around x = 10.
The gradients can't flow through saturated units, so they all get used up to overfit the values they can feel, near zero.
I think centering the data on x=0 will help.
Try reducing the weight-initialization-variance, and/or increasing the bias-initialization-variance (or equivalently, reducing the range of the data to a smaller region, like [-1,1]).
You would get the same problem if you used RBF's and initializad them all near zero. with the linear-sigmoid units the second layer is using pairs of linear-sigmoids to make RBF's.