I wrote a very basic tensorflow model where I want to predict a number:
import tensorflow as tf
import numpy as np
def HW_numbers(x):
y = (2 * x) + 1
return y
x = np.array([1.0,2.0,3.0,4.0,5.0,6.0,7.0], dtype=float)
y = np.array(HW_numbers(x))
model = tf.keras.models.Sequential([tf.keras.layers.Dense(units=1,input_shape=[1])])
model.compile(optimizer='sgd',loss='mean_squared_error')
model.fit(x,y,epochs = 30)
print(model.predict([10.0]))
This above code works fine. But if I add an activation function in Dense layer, the prediction becomes weird. I have tried 'relu','sigmoid','tanh' etc.
My question is, why is that? What exactly is activation function doing in that single layer that messes up the prediction?
I have used Tensorflow 2.0
Currently, you are learning a linear function. As it can be described by a single neuron, you just need a single neuron to learn the function. On the other hand activation function is:
to learn and make sense of something really complicated and Non-linear complex functional mappings between the inputs and response variable. It introduces non-linear properties to our Network. Their main purpose is to convert an input signal of a node in an A-NN to an output signal. That output signal now is used as an input in the next layer in the stack.
Hence, as you have just a single neuron here (a specific case), you do not need to pass the value to the next layer. In other words, all hidden, input, and output layers are merged together. Hence, the activation function is not helpful for your case. Unless you want to make a decision base on the output of the neuron.
Your network consists of just one neuron. So what it does with with no activation function is to multiply your input with the neurons weight. This weight will eventually converge to something around 2.1.
But with relu as an activation function, only positive numbers are propagated through your network. So if your neuron's weight is initialized with a negative number, you will always get zero as an output. So with relu, you have a 50:50 chance to get good results.
With the activation functions tanh and sigmoid, the output of the neuron is limited to [-1,1] and [0, 1] respectively, so your output can't be more than one.
So for such a small neuronal network, these activation functions don't match the problem.
Related
Let's assume I have a neural network like the following:
model = keras.models.Sequential()
model.add(keras.layers.Dense(10, input_shape=(5,), activation='relu'))
model.add(keras.layers.Dense(4, activation='linear'))
With n output neurons with a linear activation function.
The training process is not important here, so we can take a look at the random weights that keras initialized using:
model.weights
Of course, in a real example, these weights should be adjusted in the training process.
Depending on these model.weights, each of the output neurons returns values in a range.
I would like to calculate this exact range.
Does keras offer any function to calculate it?
I built a flawed piece of code to make an approximation of it, using a loop and predicting random inputs. But this would not be really useful in a real example with much more inputs/neurons/weights.
Here a few examples trying to clarify my question (All of them assume that the input values are between and 1):
model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_shape=(2,),
activation='linear', use_bias=False))
model.set_weights([np.array([1, 1]).reshape(2, 1)])
For the previous example the output neuron results would be between 0 and 2
model.set_weights([np.array([-0.5, 1]).reshape(2, 1)])
For the previous example the output neuron results would be between -0.5 and 1
model = keras.models.Sequential()
model.add(keras.layers.Dense(2, input_shape=(2,), activation='linear', use_bias=False))
model.add(keras.layers.Dense(1, activation='linear', use_bias=False))
model.set_weights([np.array([1, 1, 1, 1]).reshape(2,2), np.array([1, 1]).reshape(2,1)])
For the previous example, the output neuron results would be between 0 and 4
These are simplified examples. In a real scenario with a much complex network structure, activation functions, bias..... these ranges are not obvious to calculate.
It sounds like you are roughly interested in what is referred to as neural network verification. This field broadly consists of answering the question: given a range of possible inputs, what is the range of possible outputs from a neural network with a given set of weights? A few things to note:
A neural network is essentially a complex, non-linear function. That is, it maps the input space to the output space. Defining an output range does not make sense except with respect to an input range. In your question you make no reference to the inputs, so your examples are flawed/incomplete.
In general, neural network verification is an emerging field with most published works being fairly recent (last 5-7 years). That being said, there are exact and approximate methods for fully connected networks with a variety of activation functions. I'll list a few such methods here:
https://arxiv.org/abs/2004.05519 - MATLAB toolbox, but you could export your neural network in ONNX format and then use MATLAB for the verification/output range analysis.
https://arxiv.org/abs/1804.10829 - specifically for ReLU activation function.
https://anwu1219.github.io/download/Marabou.pdf with python API available here: https://github.com/NeuralNetworkVerification/Marabou
The field is still evolving so you may have to do some of the codings yourself rather than using pre-existing libraries in some cases, but these papers/ a search query for neural network verification should at least give you some ideas of where to start.
IMO, there is no such a function, as far as I know, to estimate the output value's range( without imposing your restriction).
For example, a dense function without bias is just a plain linear function of a=bx, in your case, you are restricting x to 0-1 range and explicitly setting b to your desired values.
You will always get the value in those ranges you`ve cited in your question. A hypothetical example is to choose b randomly and the range in your questions would not hold the ground.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_shape=(2,), activation='linear', use_bias=False))
import matplotlib.pyplot as plt
#model.set_weights([np.array([1, 1]).reshape(2, 1)])
eval_func = keras.backend.function([model.input], model.layers[-1].output)
outputs = eval_func(np.array([[2,1]]))
counts, bins = np.histogram(outputs)
plt.hist(bins[:-1], bins, weights=counts)
The i/p to my custom activation function is going to be a 19 * 19 * 5 tensor say x. The function needs to be such that it applies sigmoid to the first layer i.e x[:,:,0:1] and relu to the remaining layers i.e. x[:,:,1:5]. I have defined a custom activation function with the following code:
def custom_activation(x):
return tf.concat([tf.sigmoid(x[:,:,:,0:1]) , tf.nn.relu(x[:,:,:,1:5])],axis = 3)
get_custom_objects().update({'custom_activation': Activation(custom_activation)})
The fourth dimension comes into picture because at the input I get at the function custom_activation has batch size as another dimension. So the input tensor is of shape[bathc_size,19,19,5].
Could someone tell me if this is the correct way to do it?
Keras Activations are designed to work on arbitrarily sized layers of almost any imaginable feed forward layer (e.g. Tanh, Relu, Softmax, etc). The transformation you describe sounds specific to a particular layer in the architecture you are using. As a result, I would recommend accomplishing the task using a Lambda Layer:
from keras.layers import Lambda
def custom_activation_shape(input_shape):
# Ensure there is rank 4 tensor
assert len(input_shape) == 4
# Ensure the last input component has 5 dimensions
assert input_shape[3] == 5
return input_shape # Shape is unchanged
Which can then be added to your model using
Lambda(custom_activation, output_shape=custom_activation_shape)
However, if you intend to use this transformation after many different layers in your network, and thus would truly like a custom defined Activation, see How do you create a custom activation function with Keras?, which suggests doing what you wrote in your question.
I'm new in Keras and Neural Networks. I'm writing a thesis and trying to create a SimpleRNN in Keras as it is illustrated below:
As it is shown in the picture, I need to create a model with 4 inputs + 2 outputs and with any number of neurons in the hidden layer.
This is my code:
model = Sequential()
model.add(SimpleRNN(4, input_shape=(1, 4), activation='sigmoid', return_sequences=True))
model.add(Dense(2))
model.compile(loss='mean_absolute_error', optimizer='adam')
model.fit(data, target, epochs=5000, batch_size=1, verbose=2)
predict = model.predict(data)
1) Does my model implement the graph?
2) Is it possible to specify connections between neurons Input and Hidden layers or Output and Input layers?
Explanation:
I am going to use backpropagation to train my network.
I have input and target values
Input is a 10*4 array and target is a 10*2 array which I then reshape:
input = input.reshape((10, 1, 4))
target = target.reshape((10, 1, 2))
It is crucial for to able to specify connections between neurons as they can be different. For instance, here you can have an example:
1) Not really. But I'm not sure about what exactly you want in that graph. (Let's see how Keras recurrent layers work below)
2) Yes, it's possible to connect every layer to every layer, but you can't use Sequential for that, you must use Model.
This answer may not be what you're looking for. What exactly do you want to achieve? What kind of data you have, what output you expect, what is the model supposed to do? etc...
1 - How does a recurrent layer work?
Documentation
Recurrent layers in keras work with an "input sequence" and may output a single result or a sequence result. It's recurrency is totally contained in it and doesn't interact with other layers.
You should have inputs with shape (NumberOrExamples, TimeStepsInSequence, DimensionOfEachStep). This means input_shape=(TimeSteps,Dimension).
The recurrent layer will work internally with each time step. The cycles happen from step to step and this behavior is totally invisible. The layer seems to work just like any other layer.
This doesn't seem to be what you want. Unless you have a "sequence" to input. The only way I know if using recurrent layers in Keras that is similar to you graph is when you have a segment of a sequence and want to predict the next step. If that's the case, see some examples by searching for "predicting the next element" in Google.
2 - How to connect layers using Model:
Instead of adding layers to a sequential model (which will always follow a straight line), start using the layers independently, starting from an input tensor:
from keras.layers import *
from keras.models import Model
inputTensor = Input(shapeOfYourInput) #it seems the shape is "(2,)", but we must see your data.
#A dense layer with 2 outputs:
myDense = Dense(2, activation=ItsAGoodIdeaToUseAnActivation)
#The output tensor of that layer when you give it the input:
denseOut1 = myDense(inputTensor)
#You can do as many cycles as you want here:
denseOut2 = myDense(denseOut1)
#you can even make a loop:
denseOut = Activation(ItsAGoodIdeaToUseAnActivation)(inputTensor) #you may create a layer and call it with the input tensor in just one line if you're not going to reuse the layer
#I'm applying this activation layer here because since we defined an activation for the dense layer and we're going to cycle it, it's not going to behave very well receiving huge values in the first pass and small values the next passes....
for i in range(n):
denseOut = myDense(denseOut)
This kind of usage allows you to create any kind of model, with branches, alternative ways, connections from anywhere to anywhere, provided you respect the shape rules. For a cycle like that, inputs and outputs must have the same shape.
At the end, you must define a model from one or many inputs to one or many outputs (you must have training data to match all inputs and outputs you choose):
model = Model(inputTensor,denseOut)
But notice that this model is static. If you want to change the number of cycles, you will have to create a new model.
In this case, it would be as simple as repeating the loop step denseOut = myDense(denseOut) and creating another model2=Model(inputTensor,denseOut).
3 - Trying to create something like the image below:
I am supposing C and F will participate in all iterations. If not,
Since there are four actual inputs, and we are going to treat them all separately, let's create 4 inputs instead, all like (1,).
Your input array should be divided in 4 arrays, all being (10,1).
from keras.models import Model
from keras.layers import *
inputA = Input((1,))
inputB = Input((1,))
inputC = Input((1,))
inputF = Input((1,))
Now the layers N2 and N3, that will be used only once, since C and F are constant:
outN2 = Dense(1)(inputC)
outN3 = Dense(1)(inputF)
Now the recurrent layer N1, without giving it the tensors yet:
layN1 = Dense(1)
For the loop, let's create outA and outB. They start as actual inputs and will be given to the layer N1, but in the loop they will be replaced
outA = inputA
outB = inputB
Now in the loop, let's do the "passes":
for i in range(n):
#unite A and B in one
inputAB = Concatenate()([outA,outB])
#pass through N1
outN1 = layN1(inputAB)
#sum results of N1 and N2 into A
outA = Add()([outN1,outN2])
#this is constant for all the passes except the first
outB = outN3 #looks like B is never changing in your image....
Now the model:
finalOut = Concatenate()([outA,outB])
model = Model([inputA,inputB,inputC,inputF], finalOut)
This is going to be long and hard to describe so apologies in advance.
I have a regular CNN like network with standard MLP layers on top of it. On top of the MLP, I have a softmax layer too, however, unlike conventional networks, this is NOT fully connected to the MLP below and it consists of subgroups.
To further describe the softmax, it looks like this:
Neur1A Neur2A ... NeurNA Neur1B Neur2B ... NeurNB Neur1C Neur2C ...NeurNC
Group A Group B Group C
There are many more groups. Each group has a softmax that is independent from the other groups. So it is in a way, several independent classifications (even though it actually is not).
What I need is for the index of the activated neuron to be monotonically increasing between groups. For example, if I have Neuron5 in Group A activated, I want the activated neuron in group B to be >=5. Same with Group B and Group C and so on..
This softmax layer containing all the neurons for all groups is actually NOT my last layer and it is interestingly an intermediate one.
To achieve this monotonicity, I add another term to my loss function that penalizes non monotonic activated neuron indices. Here is some of the code:
The code for softmax layer and its output:
def compute_image_estimate(layer2_input):
estimated_yps= tf.zeros([FLAGS.batch_size,0],dtype=tf.int64)
for pix in xrange(NUM_CLASSES):
pixrow= int( pix/width)
rowdata= image_pixels[:, pixrow*width:(pixrow+1)*width]
with tf.variable_scope('layer2_'+'_'+str(pix)) as scope:
weights = _variable_with_weight_decay('weights', shape=[layer2_input.get_shape()[1], width], stddev=0.04, wd=0.0000000)
biases = _variable_on_cpu('biases', [width], tf.constant_initializer(0.1))
y = tf.nn.softmax(tf.matmul(layer2_input,weights) + biases)
argyp=width-1-tf.argmax(y,1)
argyp= tf.reshape(argyp,[FLAGS.batch_size,1])
estimated_yps=tf.concat(1,[estimated_yps,argyp])
return estimated_yps
The estimated_yps are passed onto a function that quantifies monotonicity:
def compute_monotonicity(yp):
sm= tf.zeros([FLAGS.batch_size])
for curr_row in xrange(height):
for curr_col in xrange(width-1):
pix= curr_row *width + curr_col
sm=sm+alpha * tf.to_float(tf.square(tf.minimum(0,tf.to_int32(yp[:,pix]-yp[:,pix+1]))))
return sm
and the loss function is:
def loss(estimated_yp, SOME_OTHER_THINGS):
tf.add_to_collection('losses', SOME_OTHER_THINGS)
monotonicity_metric= tf.reduce_mean( compute_monotonocity(estimated_yp) )
tf.add_to_collection('losses', monotonicity_metric)
return tf.add_n(tf.get_collection('losses'), name='total_loss')
Now my problem is, when I do not use SOME_OTHER_THINGS that are conventional metrics, I get ValueError: No gradients provided for any variable for the monotonocity metric.
Seems like gradients are not defined when the softmax layer outputs are used like this.
Am I doing something wrong? Any help would be appreciated.
Apologies.. I realized that the problem is that tf.argmax function obviously does not have a gradient defined.
Recently I started toying with neural networks. I was trying to implement an AND gate with Tensorflow. I am having trouble understanding when to use different cost and activation functions. This is a basic neural network with only input and output layers, no hidden layers.
First I tried to implement it in this way. As you can see this is a poor implementation, but I think it gets the job done, at least in some way. So, I tried only the real outputs, no one hot true outputs. For activation functions, I used a sigmoid function and for cost function I used squared error cost function (I think its called that, correct me if I'm wrong).
I've tried using ReLU and Softmax as activation functions (with the same cost function) and it doesn't work. I figured out why they don't work. I also tried the sigmoid function with Cross Entropy cost function, it also doesn't work.
import tensorflow as tf
import numpy
train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]])
train_Y = numpy.asarray([[0],[0],[0],[1]])
x = tf.placeholder("float",[None, 2])
y = tf.placeholder("float",[None, 1])
W = tf.Variable(tf.zeros([2, 1]))
b = tf.Variable(tf.zeros([1, 1]))
activation = tf.nn.sigmoid(tf.matmul(x, W)+b)
cost = tf.reduce_sum(tf.square(activation - y))/4
optimizer = tf.train.GradientDescentOptimizer(.1).minimize(cost)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for i in range(5000):
train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})
result = sess.run(activation, feed_dict={x:train_X})
print(result)
after 5000 iterations:
[[ 0.0031316 ]
[ 0.12012422]
[ 0.12012422]
[ 0.85576665]]
Question 1 - Is there any other activation function and cost function, that can work(learn) for the above network, without changing the parameters(meaning without changing W, x, b).
Question 2 - I read from a StackOverflow post here:
[Activation Function] selection depends on the problem.
So there are no cost functions that can be used anywhere? I mean there is no standard cost function that can be used on any neural network. Right? Please correct me on this.
I also implemented the AND gate with a different approach, with the output as one-hot true. As you can see the train_Y [1,0] means that the 0th index is 1, so the answer is 0. I hope you get it.
Here I have used a softmax activation function, with cross entropy as cost function. Sigmoid function as activation function fails miserably.
import tensorflow as tf
import numpy
train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]])
train_Y = numpy.asarray([[1,0],[1,0],[1,0],[0,1]])
x = tf.placeholder("float",[None, 2])
y = tf.placeholder("float",[None, 2])
W = tf.Variable(tf.zeros([2, 2]))
b = tf.Variable(tf.zeros([2]))
activation = tf.nn.softmax(tf.matmul(x, W)+b)
cost = -tf.reduce_sum(y*tf.log(activation))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(cost)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for i in range(5000):
train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})
result = sess.run(activation, feed_dict={x:train_X})
print(result)
after 5000 iteration
[[ 1.00000000e+00 1.41971401e-09]
[ 9.98996437e-01 1.00352429e-03]
[ 9.98996437e-01 1.00352429e-03]
[ 1.40495342e-03 9.98595059e-01]]
Question 3 So in this case what cost function and activation function can I use? How do I understand what type of cost and activation functions I should use? Is there a standard way or rule, or just experience only? Should I have to try every cost and activation function in a brute force manner? I found an answer here. But I am hoping for a more elaborate explanation.
Question 4 I have noticed that it takes many iterations to converge to a near accurate prediction. I think the convergance rate depends on the learning rate (using too large of will miss the solution) and the cost function (correct me if I'm wrong). So, is there any optimal way (meaning the fastest) or cost function for converging to a correct solution?
I will answer your questions a little bit out of order, starting with more general answers, and finishing with those specific to your particular experiment.
Activation functions Different activation functions, in fact, do have different properties. Let's first consider an activation function between two layers of a neural network. The only purpose of an activation function there is to serve as an nonlinearity. If you do not put an activation function between two layers, then two layers together will serve no better than one, because their effect will still be just a linear transformation. For a long while people were using sigmoid function and tanh, choosing pretty much arbitrarily, with sigmoid being more popular, until recently, when ReLU became the dominant nonleniarity. The reason why people use ReLU between layers is because it is non-saturating (and is also faster to compute). Think about the graph of a sigmoid function. If the absolute value of x is large, then the derivative of the sigmoid function is small, which means that as we propagate the error backwards, the gradient of the error will vanish very quickly as we go back through the layers. With ReLU the derivative is 1 for all positive inputs, so the gradient for those neurons that fired will not be changed by the activation unit at all and will not slow down the gradient descent.
For the last layer of the network the activation unit also depends on the task. For regression you will want to use the sigmoid or tanh activation, because you want the result to be between 0 and 1. For classification you will want only one of your outputs to be one and all others zeros, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it.
Your example. Now let's look at your example. Your first example tries to compute the output of AND in a following form:
sigmoid(W1 * x1 + W2 * x2 + B)
Note that W1 and W2 will always converge to the same value, because the output for (x1, x2) should be equal to the output of (x2, x1). Therefore, the model that you are fitting is:
sigmoid(W * (x1 + x2) + B)
x1 + x2 can only take one of three values (0, 1 or 2) and you want to return 0 for the case when x1 + x2 < 2 and 1 for the case when x1 + x2 = 2. Since the sigmoid function is rather smooth, it will take very large values of W and B to make the output close to the desired, but because of a small learning rate they can't get to those large values fast. Increasing the learning rate in your first example will increase the speed of convergence.
Your second example converges better because the softmax function is good at making precisely one output be equal to 1 and all others to 0. Since this is precisely your case, it does converge quickly. Note that sigmoid would also eventually converge to good values, but it will take significantly more iterations (or higher learning rate).
What to use. Now to the last question, how does one choose which activation and cost functions to use. These advices will work for majority of cases:
If you do classification, use softmax for the last layer's nonlinearity and cross entropy as a cost function.
If you do regression, use sigmoid or tanh for the last layer's nonlinearity and squared error as a cost function.
Use ReLU as a nonlienearity between layers.
Use better optimizers (AdamOptimizer, AdagradOptimizer) instead of GradientDescentOptimizer, or use momentum for faster convergence,
Cost function and activation function play an important role in the learning phase of a neural network.
The activation function, as explained in the first answer, gives the possibility to the network to learn non-linear functions, besides assuring to have small change in the output in response of small change in the input. A sigmoid activation function works well for these assumptions. Other activation functions do the same but may be less computational expensive, see activation functions for completeness. But, in general Sigmoid activation function should be avoid because the vanishing gradient problem.
The cost function C plays a crucial role in the speed of learning of the neural network. Gradient-based neural networks learn in an iterative way by minimising the cost function, so computing the gradient of the cost function, and changing the weights in according to it. If a quadratic cost function is used, this means that its gradient with respect the weights, is proportional to the activation function first derivate. Now, if a sigmoid activation function is used this implies that when the output is close to 1 the derivate is very small, as you can see from the image, and so the neurons learns slow.
The cross-entropy cost function allows to avoid this problem. Even if you are using a sigmoid function, using a cross-entropy function as cost function, implies that its derivates with respect to the weights are not more proportional to the first derivate of the activation function, as happened with the quadratic function , but instead they are proportional to the output error. This implies that when the prediction output is far away to the target your network learns more quickly, and viceversa.
Cross-entropy cost function should be used always instead of using a quadratic cost function, for classification problem, for the above explained.
Note that, in neural networks the cross-entropy function has not always the same meaning as the cross-entropy function you meet in probability, there it is used to compare two probability distribution. In neural networks this can be true if you have a unique sigmoid output to the final layer and want to think about it as a probability distribution. But this losses meaning if you have multi-sigmoid neurons at the final layer.