I am reading about backpropagation deep neural network, and as I understood, I can summarize the algorithm of that type of neural network as below :
1- Input x : Set the corresponding activation for the input layer
2- Feedforward: caclulate the error of the forward-propagation
3- Output error: Calculate the output error
4- Backpropagate the error : caclulate the error of the back-propagation
5- Output: Using the gradient of the cost function
That's ok, and then I check many codes of that type of deep network, below an example code with explanation :
### imports
import tensorflow as tf
### constant data
x = [[0.,0.],[1.,1.],[1.,0.],[0.,1.]]
y_ = [[0.],[0.],[1.],[1.]]
### induction
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output
# Layer 0 = the x2 inputs
x0 = tf.constant( x , dtype=tf.float32 )
y0 = tf.constant( y_ , dtype=tf.float32 )
# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable( tf.random_uniform( [2,3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b1 = tf.Variable( tf.random_uniform( [3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
h1 = tf.sigmoid( tf.matmul( x0,m1 ) + b1 )
# Layer 2 = the 3x1 sigmoid output
m2 = tf.Variable( tf.random_uniform( [3,1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b2 = tf.Variable( tf.random_uniform( [1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
### loss
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
### training
# run 500 times using all the X and Y
# print out the loss and any other interesting info
with tf.Session() as sess:
sess.run( tf.global_variables_initializer() )
for step in range(500) :
sess.run(train)
results = sess.run([m1,b1,m2,b2,y_out,loss])
labels = "m1,b1,m2,b2,y_out,loss".split(",")
for label,result in zip(*(labels,results)) :
print ""
print label
print result
print ""
My question, the above code is calculating the error of the forward-propagation but I don't see any step for calculating the back-propagation error. In other words, following the above description, I can see the steps 1 (Input x) , 2 (Feedforward) , 3 (Output error) and 5 (Output) but the step number 4 which is (Backpropagate the error) is not shown in the code!! Is that right or something missing in the code? The problem that all codes I found online are following same steps in backpropagation deep neural networks!
please, could you describe how the step of Backpropagate the error is happening the code or what should I add something to performing that step?
Thank you
In simple terms, when you build the TF graph up to the point you are computing the loss in your code, TF will know on which tf.Variable (weights) the loss depends. Then, when you create the node train = tf.train.GradientDescentOptimizer(1.0).minimize(loss), and later run it in a tf.Session, the backpropagation is done for you in the background. To be more specific, the train = tf.train.GradientDescentOptimizer(1.0).minimize(loss) merges the following steps:
# 1. Create a GD optimizer with a learning rate of 1.0
optimizer = tf.train.GradientDescentOptimizer(1.0)
# 2. Compute the gradients for each of the variables (weights) with respect to the loss
gradients, variables = zip(*optimizer.compute_gradients(loss))
# 3. Update the variables (weights) based on the computed gradients
train = optimizer.apply_gradients(zip(gradients, variables))
In particular, step 1 and 2, summarize the backpropagation step. Hope that this makes things more clear for you!
Besides, I want to restructure the steps in your question:
Input X: The input of the neural network.
Forward pass: Propagating the input through the neural network, in order to get the output. In other words, multiplying the input X with each of the tf.Variable in your code.
Loss: The mismatch between the obtained output in step 2 and the expected output.
Computing the gradients: Computing the gradients for each of the tf.Variable (weights) with respect to the loss.
Updating the weights: Updating each tf.Variable (weight) according to its corresponding gradient.
Please note that step 4 and 5 encapsulate backpropagation.
Related
I am finding output of batchnormalization in Keras.
My model is:
#Import libraries
import numpy as np
import keras
from keras import layers
from keras.layers import Input, Dense, Activation, BatchNormalization, Flatten, Conv2D
from keras.models import Model
#Model
def HappyModel3(input_shape):
X_input = Input(input_shape, name='input_layer')
X = BatchNormalization(axis = 1, name = 'batchnorm_layer')(X_input)
X = Dense(1, activation='sigmoid', name='sigmoid_layer')(X)
model = Model(inputs = X_input, outputs = X, name='HappyModel3')
return model
Compiling Model | here number of epochs is 1
X_train=np.array([[1,1,-1],[2,1,1]])
Y_train=np.array([0,1])
happyModel_1=HappyModel3(X_train[0].shape)
happyModel_1.compile(optimizer=keras.optimizers.RMSprop(), loss=keras.losses.mean_squared_error)
happyModel_1.fit(x = X_train, y = Y_train, epochs = 1 , batch_size = 2, verbose=0 )
finding Batch Normalisation layer's output for model with epochs=1:
for i in range(0, len(happyModel_1.layers)):
tmp_model = Model(happyModel_1.layers[0].input, happyModel_1.layers[i].output)
tmp_output = tmp_model.predict(X_train)
if i in (0,1) :
print(happyModel_1.layers[i].name)
print(tmp_output.shape)
print(tmp_output)
print('\n')
Code Output is:
input_layer
(2, 3)
[[ 1. 1. -1.]
[ 2. 1. 1.]]
batchnorm_layer
(2, 3)
[[ 0.99003249 0.99388224 -0.99551398]
[ 1.99647105 0.99388224 0.9971655 ]]
We've normalized at axis=1 |
Batch Norm Layer Output: At axis=1, 1st dimension mean is 1.5, 2nd dimension mean is 1, 3rd dimension mean is 0.
Since its batch norm, I expect mean to be close to 0 for all 3 dimensions
This happens when I increase epochs to 1000:
happyModel_2=HappyModel3(X_train[0].shape)
happyModel_2.compile(optimizer=keras.optimizers.RMSprop(), loss=keras.losses.mean_squared_error)
happyModel_2.fit(x = X_train, y = Y_train, epochs = 1000 , batch_size = 2, verbose=0 )
finding Batch Normalisation layer's output for model with epochs=1000:
for i in range(0, len(happyModel_2.layers)):
tmp_model = Model(happyModel_2.layers[0].input, happyModel_2.layers[i].output)
tmp_output = tmp_model.predict(X_train)
if i in (0,1) :
print(happyModel_2.layers[i].name)
print(tmp_output.shape)
print(tmp_output)
print('\n')
#Code output
input_layer
(2, 3)
[[ 1. 1. -1.]
[ 2. 1. 1.]]
batchnorm_layer
(2, 3)
[[ -1.95576239e+00 8.08715820e-04 -1.86621261e+00]
[ 1.95795488e+00 8.08715820e-04 1.86590290e+00]]
We've normalized at axis=1 | Now At axis=1, batch norm layer output is: 1st dimension mean is 0, 2nd dimension mean is 0, 3rd dimension mean is 0. THIS IS AN EXPECTED OUTPUT NOW
My question is: Is output of Batch Normalization in Keras dependent on number of epochs?
(Probably YES, as we do backpropagation, batch Normalization parameters will be affected by increasing number of epochs)
The keras documentation for BatchNormalization gives an answer to your question:
Importantly, batch normalization works differently during training and
during inference.
What happens during training, i.e. when calling model.fit()?
During training [...], the layer normalizes its output
using the mean and standard deviation of the current batch of inputs.
But what will happen during inference, i.e. when calling mode.predict() as in your examples?
During inference [...], the layer normalizes its output using a moving average of
the mean and standard deviation of the batches it has seen during
training. That is to say, it returns (batch - self.moving_mean) / (self.moving_var + epsilon) * gamma + beta.
self.moving_mean and self.moving_var are non-trainable variables that
are updated each time the layer in called in training mode [...].
It's important to understand that batch normalization will calculate the statistics (mean and variance) of your whole training data during training by looking at statistics of single batches and internally updating the moving_mean and moving_variance parameters by a running average computed form the single batch statistics. Therefore they're not affected by backpropagation. Ideally, after your model has seen enough training examples (or did enough training epochs), moving_mean and moving_variance will correspond to the statistics of your whole training set. These two parameters are then used during inference to normalize test examples. At the start of training the two parameters will be initialized to 0 and 1. Further batch norm has two more parameters called gamma and beta, which will be updated by the optimizer and therefore depend on your loss.
In essence, yes, the output of batch normalization during inference is dependent on the number of epochs you have trained your model. Firstly, due to changing moving averages for mean and variance and second due to learned parameters gamma and beta.
For a deeper understanding of how batch normalization works and why it is needed, have a look at the original publication.
I want to create a network where in the input layer nodes are just connected to some nodes in the next layer. Here is a small example:
My solution so far is that I set the weight of the edge between i1 and h1 to zero and after every optimization step I multiply the weights with a matrix (I call this matrix mask matrix) in which every entry is 1 except the entry of the weight of the edge between i1 and h1.
(See code below)
Is this approach right? Or does this have a affect on the GradientDescent? Is there another approach to create this kind of a network in TensorFlow?
import tensorflow as tf
import tensorflow.contrib.eager as tfe
import numpy as np
tf.enable_eager_execution()
model = tf.keras.Sequential([
tf.keras.layers.Dense(2, activation=tf.sigmoid, input_shape=(2,)), # input shape required
tf.keras.layers.Dense(2, activation=tf.sigmoid)
])
#set the weights
weights=[np.array([[0, 0.25],[0.2,0.3]]),np.array([0.35,0.35]),np.array([[0.4,0.5],[0.45, 0.55]]),np.array([0.6,0.6])]
model.set_weights(weights)
model.get_weights()
features = tf.convert_to_tensor([[0.05,0.10 ]])
labels = tf.convert_to_tensor([[0.01,0.99 ]])
mask =np.array([[0, 1],[1,1]])
#define the loss function
def loss(model, x, y):
y_ = model(x)
return tf.losses.mean_squared_error(labels=y, predictions=y_)
#define the gradient calculation
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
#create optimizer an global Step
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
global_step = tf.train.get_or_create_global_step()
#optimization step
loss_value, grads = grad(model, features, labels)
optimizer.apply_gradients(zip(grads, model.variables),global_step)
#masking the optimized weights
weights=(model.get_weights())[0]
masked_weights=tf.multiply(weights,mask)
model.set_weights([masked_weights])
If you are looking for a solution for the specific example you provided, you can simply use tf.keras Functional API and define two Dense layers where one is connected to both neurons in the previous layer and the other one is only connected to one of the neurons:
from tensorflow.keras.layer import Input, Lambda, Dense, concatenate
from tensorflow.keras.models import Model
inp = Input(shape=(2,))
inp2 = Lambda(lambda x: x[:,1:2])(inp) # get the second neuron
h1_out = Dense(1, activation='sigmoid')(inp2) # only connected to the second neuron
h2_out = Dense(1, activation='sigmoid')(inp) # connected to both neurons
h_out = concatenate([h1_out, h2_out])
out = Dense(2, activation='sigmoid')(h_out)
model = Model(inp, out)
# simply train it using `fit`
model.fit(...)
The problem with your solution and some others suggested by other answers in this post is that they do not prevent training of this weight. They allow the gradient descent to train the non existent weight and then overwrite it retrospectively. This will result in a network that has a zero in this location as desired, but will negatively affect your training process as the back propagation calculation will not see the masking step as it is not part of a TensorFlow graph and so the gradient descent will follow a path which includes the assumption that this weight does have an affect on the outcome (it does not).
A better solution would be to include the masking step as a part of your TensorFlow graph, so that it can be factored into the gradient descent. Since the masking step is simply a element wise multiplication by your sparse, binary martix mask, you could just include the mask matrix as an elementwise matrix multiplicaiton in the graph definition using tf.multiply.
Sadly this means sying goodbye to the user friendly keras,layers methods and embracing a more nuts & bolts approach to TensorFlow. I can't see an obvious way to do it using the layers API.
See the implementation below, I have tried to provide comments explaining what is happening at each stage.
import tensorflow as tf
## Graph definition for model
# set up tf.placeholders for inputs x, and outputs y_
# these remain fixed during training and can have values fed to them during the session
with tf.name_scope("Placeholders"):
x = tf.placeholder(tf.float32, shape=[None, 2], name="x") # input layer
y_ = tf.placeholder(tf.float32, shape=[None, 2], name="y_") # output layer
# set up tf.Variables for the weights at each layer from l1 to l3, and setup feeding of initial values
# also set up mask as a variable and set it to be un-trianable
with tf.name_scope("Variables"):
w_l1_values = [[0, 0.25],[0.2,0.3]]
w_l1 = tf.Variable(w_l1_values, name="w_l1")
w_l2_values = [[0.4,0.5],[0.45, 0.55]]
w_l2 = tf.Variable(w_l2_values, name="w_l2")
mask_values = [[0., 1.], [1., 1.]]
mask = tf.Variable(mask_values, trainable=False, name="mask")
# link each set of weights as matrix multiplications in the graph. Inlcude an elementwise multiplication by mask.
# Sequence takes us from inputs x to output final_out, which will be compared to labels fed to placeholder y_
l1_out = tf.nn.relu(tf.matmul(x, tf.multiply(w_l1, mask)), name="l1_out")
final_out = tf.nn.relu(tf.matmul(l1_out, w_l2), name="output")
## define loss function and training operation
with tf.name_scope("Loss"):
# some loss defined as a function of graph output: final_out and labels: y_
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=final_out, labels=y_, name="loss")
with tf.name_scope("Train"):
# some optimisation strategy, arbitrary learning rate
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, name="optimizer_adam")
train_op = optimizer.minimize(loss, name="train_op")
# create session, initialise variables and train according to inputs and corresponding labels
# This should show that the values of the first layer weights change, but the one set to 0 remains at 0
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
print(initial_l1_weights.eval())
inputs = [[0.05, 0.10]]
labels = [[0.01, 0.99]]
ans = sess.run(train_op, feed_dict={"Placeholders/x:0": inputs, "Placeholders/y_:0": labels})
train_steps = 1
for i in range(train_steps):
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
print(initial_l1_weights.eval())
Or use the answer provided by today for a keras friendly option.
You have multiple options here.
First, you could use the dynamic masking approach in your example. I believe this will work as expected since the gradients w.r.t. the masked-out parameters will be zero (the output is constant when you change the unused parameters). This approach is simple and it can be used even when your mask is not constant during the training.
Second, if you know beforehand which weights will be always zero, you can compose your weight matrix using tf.get_variable to get a submatrix, and then concatenate it with a tf.constant tensor, e.g.:
weights_sub = tf.get_variable("w", [dim_in, dim_out - 1])
zeros = tf.zeros([dim_in, 1])
weights = tf.concat([weights_sub, zeros], axis=1)
this example will make one column of your weight matrix to be always zero.
Finally, if your mask is more complex, you can use tf.get_variable on a flattened vector and then compose a tf.SparseTensor with the variable values on the used indices:
weights_used = tf.get_variable("w", [num_used_vars])
indices = ... # get your indices in a 2-D matrix of shape [num_used_vars, 2]
dense_shape = tf.constant([dim_in, dim_out]) # this is the final shape of the weight matrix
weights = tf.SparseTensor(indices, weights_used, dense_shape)
EDIT: This probably won't work in combination with Keras' set_weights method, as it expects Numpy arrays, not Tensors.
If an assign operation is applied to a weight tensor after that weight tensor is used in its portion of the forward pass of a network, does TensorFlow's backpropagation take into account the assign operation when determining the gradient for that weight? For example, if I have
weights = tf.Variable(...)
bias = tf.Variable(...)
output = tf.tanh(tf.matmul(weights, input) + bias)
weight_assign_op = weights.assign(weights + 1.0)
with tf.control_dependencies(weight_assign_op):
output2 = tf.identity(output)
the output is calculated, and then a change is made to the weights. If the output is then used to calculate a loss and gradients to update the variables, will the gradients be created taking into account the change to weights? That is, will the gradients for weights be the correct gradients for old_weights + 1.0 or will they still be the gradients for old_weights which when applied to the new weights won't necessarily be "correct" gradients for gradient descent?
I ended up testing it experimentally. The gradient calculation does take the assign op into account. I used the below code to test. Running it as it results in a positive gradient. Commenting out the weight assign op line and the control dependency lines results in a negative gradient. This is because the gradient is either being considered for the original starting value weight of 0.0 or of the updated weight after the assign of 2.0.
import tensorflow as tf
data = [[1.0], [2.0], [3.0]]
labels = [[1.0], [2.1], [2.9]]
input_data = tf.placeholder(dtype=tf.float32, shape=[3, 1])
input_labels = tf.placeholder(dtype=tf.float32, shape=[3, 1])
weights = tf.Variable(tf.constant([0.0]))
bias = tf.Variable(tf.constant([0.0]))
output = (weights * input_data) + bias
weight_assign_op = weights.assign(tf.constant([2.0]))
with tf.control_dependencies([weight_assign_op]):
output = tf.identity(output)
loss = tf.reduce_sum(tf.norm(output - input_labels))
weight_gradient = tf.gradients(loss, weights)
initialize_op = tf.global_variables_initializer()
session = tf.Session()
session.run([initialize_op])
weight_gradient_value = session.run([weight_gradient], feed_dict={input_data: data, input_labels: labels})
print(weight_gradient_value)
I want to train a neural network with 12 inputs and 2 outputs. Here I have a simple tensorflow neural network that has two outputs. When I run the code it always consistently gives one output. That is, if the two outputs are labeled 'l1' and 'l2' the model always chooses 'l1' for its output. Is this a problem with my input (that it doesn't vary enough between 'l1' and 'l2') or is this a problem with choosing to use just two outputs? This is my question. If it's the latter, what do I do to remidy this? My model is supposed to detect skin tones in a photo. ('l1' = skin tone, 'l2' = not skin tone). I'm not sure this makes sense. It is adapted from the mnist example, but that code has ten outputs.
def nn_setup(self):
input_num = 4 * 3
mid_num = 3
output_num = 2
x = tf.placeholder(tf.float32, [None, input_num])
W_1 = tf.Variable(tf.zeros([input_num, mid_num]))
b_1 = tf.Variable(tf.zeros([mid_num]))
y_mid = tf.nn.softmax(tf.matmul(x,W_1) + b_1)
W_2 = tf.Variable(tf.zeros([mid_num, output_num]))
b_2 = tf.Variable(tf.zeros([output_num]))
y = tf.nn.softmax(tf.matmul(y_mid, W_2) + b_2)
y_ = tf.placeholder(tf.float32, [None, output_num])
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, y_))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
init = tf.initialize_all_variables()
self.sess = tf.Session()
self.sess.run(init)
for i in range(1000):
batch_xs, batch_ys = self.get_nn_next_train()
self.sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
self.nn_test.images, self.nn_test.labels = self.get_nn_next_test()
print(self.sess.run(accuracy, feed_dict={x: self.nn_test.images, y_: self.nn_test.labels}))
There are a few "odd" things with your network, such as having softmax in your middle layer.
You have two major issues I can find with your implementation.
1. Weight initialisation
W_1 = tf.Variable(tf.zeros([input_num, mid_num]))
W_2 = tf.Variable(tf.zeros([mid_num, output_num]))
This will initialise the weights to be identical. So they will have identical gradient values, and be changed at each step identically.
Effectively by doing this you have created a network with one neuron in each layer (which is then copied to create the layer matrix that you use).
Use a different initial value, it is usual to take a small random matrix like this:
W_1 = tf.Variable(tf.random_normal([input_num, mid_num], stddev=0.5))
In general you will want a smaller standard deviation the larger your layers are. You don't have to do this for biases as well, but you can if you like.
This won't fix everything with your network, but it should at least start to calculate different values from input data and train a little.
2. Use of cost function
You have used this loss function incorrectly:
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(y, y_) )
. . . because softmax_cross_entropy_with_logits is designed to work with the input to softmax, not the output. So your cost function is incorrect. Instead you want to reference y_logits like this where currently you calculate y:
y_logits = tf.matmul(y_mid, W_2) + b_2
y = tf.nn.softmax(y_logits)
Then your cross-entropy would be
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(y_logits, y_) )
After the hidden layer initialization, you have calculated softmax of the logits for the hidden layer: y_mid = tf.nn.softmax(tf.matmul(x,W_1) + b_1). In a classification problem, softmax should be applied to the values obtained from the output layer. Try something like: y_mid = tf.nn.relu(tf.matmul(x,W_1) + b_1) to compute the activations from the hidden layer and see if your classification improves. If that does not solve your problem, check for the population of 'l1' and 'l2' in your training data. If your training data is highly skewed towards 'l1', you will always get 'l1' as the output. You may consider minority-oversampling or undersampling techniques to resolve population imbalance problem.
So I tried implementing the neural network from:
http://iamtrask.github.io/2015/07/12/basic-python-network/
but using TensorFlow instead. I printed out the cost function twice during training and the error is appears to be getting smaller according yet all the values in the output layer are close to 1 when only two of them should be. I imagine it might be something wrong with my maths but I'm not sure. There is no difference when I try with a hidden layer or use Error Squared as cost function. Here is my code:
import tensorflow as tf
import numpy as np
input_layer_size = 3
output_layer_size = 1
x = tf.placeholder(tf.float32, [None, input_layer_size]) #holds input values
y = tf.placeholder(tf.float32, [None, output_layer_size]) # holds true y values
tf.set_random_seed(1)
input_weights = tf.Variable(tf.random_normal([input_layer_size, output_layer_size]))
input_bias = tf.Variable(tf.random_normal([1, output_layer_size]))
output_layer_vals = tf.nn.sigmoid(tf.matmul(x, input_weights) + input_bias)
cross_entropy = -tf.reduce_sum(y * tf.log(output_layer_vals))
training = tf.train.AdamOptimizer(0.1).minimize(cross_entropy)
x_data = np.array(
[[0,0,1],
[0,1,1],
[1,0,1],
[1,1,1]])
y_data = np.reshape(np.array([0,0,1,1]).T, (4, 1))
with tf.Session() as ses:
init = tf.initialize_all_variables()
ses.run(init)
for _ in range(1000):
ses.run(training, feed_dict={x: x_data, y:y_data})
if _ % 500 == 0:
print(ses.run(output_layer_vals, feed_dict={x: x_data}))
print(ses.run(cross_entropy, feed_dict={x: x_data, y:y_data}))
print('\n\n')
And this is what it outputs:
[[ 0.82036656]
[ 0.96750367]
[ 0.87607527]
[ 0.97876281]]
0.21947 #first cross_entropy error
[[ 0.99937409]
[ 0.99998224]
[ 0.99992537]
[ 0.99999785]]
0.00062825 #second cross_entropy error, as you can see, it's smaller
First of all: you have no hidden layer. As far as I remember basic perceptrons could possibly model the XOR problem, but it needed some adjustments. However, AI is just invented by biology, but it does not model real neural networks exactly. Thus, you have to at least build an MLP (Multilayer perceptron), which consits of at least one input, one hidden and one output layer. The XOR problem needs at least two neurons + bias in the hidden layer to be solved correctly (with a high precision).
Additionally your learning rate is too high. 0.1 is a very high learning rate. To put it simply: it basically means that you update/adapt your current state by 10% of one single learning step. This lets your network forget about already learned invariants quickly. Usually the learning rate is something in between 1e-2 to 1e-6, depending on your problem, network size and general architecture.
Moreover you implemented the "simplified/short" version of cross-entropy. See wikipedia for the full version: cross-entropy. However, to avoid some edge cases TensorFlow already has its own version of cross-entropy: for example tf.nn.softmax_cross_entropy_with_logits.
Finally you should remember that the cross-entropy error is a logistic loss function that operates on probabilities of your classes. Although your sigmoid function squashes the output layer into an interval of [0, 1], this does only work in your case because you have one single output neuron. As soon as you have more than one output neuron, you also need the sum of the output layer to be exactly 1,0 in order to really describes probabilities for every class on the output layer.