After studying autograd, I tried to make loss function myself.
And here are my loss
def myCEE(outputs,targets):
exp=torch.exp(outputs)
A=torch.log(torch.sum(exp,dim=1))
hadamard=F.one_hot(targets, num_classes=10).float()*outputs
B=torch.sum(hadamard, dim=1)
return torch.sum(A-B)
and I compared with torch.nn.CrossEntropyLoss
here are results
for i,j in train_dl:
inputs=i
targets=j
break
outputs=model(inputs)
myCEE(outputs,targets) : tensor(147.5397, grad_fn=<SumBackward0>)
loss_func = nn.CrossEntropyLoss(reduction='sum') : tensor(147.5397, grad_fn=<NllLossBackward>)
values were same.
I thought, because those are different functions so grad_fn are different and it
won't cause any problems.
But something happened!
After 4 epochs, loss values are turned to nan.
Contrary to myCEE, with nn.CrossEntropyLoss learning went well.
So, I wonder if there is a problem with my function.
After read some posts about nan problems, I stacked more convolutions to the model.
As a result 39-epoch training did not make an error.
Nevertheless, I'd like to know difference between myCEE and nn.CrossEntropyLoss
torch.nn.CrossEntropyLoss is different to your implementation because it uses a trick to counter instable computation of the exponential when using numerically big values. Given the logits output {l_1, ... l_j, ..., l_n}, the softmax is defined as:
softmax(l_i) = exp(l_i) / sum_j(exp(l_j))
The trick is to multiple both the numerator and denominator by exp(-β):
softmax(l_i) = exp(l_i)*exp(-β) / [sum_j(exp(l_j))*exp(-β)]
= exp(l_i-β) / sum_j(exp(l_j-β))
Then the log-softmax comes down to:
logsoftmax(l_i) = l_i - β - log[sum_j(exp(l_j-β))]
In practice β is chosen as the highest logit value i.e. β = max_j(l_j).
You can read more about it on this question: Numerically Stable Softmax.
After i 'v written the simple neural network with numpy, i wanted to compare it numerically with PyTorch impementation. Running alone, seems my neural network implementation converges, so it seems to have no errors.
Also i v checked forward pass matches to PyTorch, so basic setup is correct.
But something different happens while backward pass, because the weights after one backpropagation are different.
I dont want to post full code here because its linked over several .py files, and most of the code is irrelevant to the question. I just want to know does PyTorch "basic" gradient descent or something different.
I m viewing the most simle example about full-connected weights of the last layer, cause if it is different, further will be also different:
self.weight += self.learning_rate * hidden_layer.T.dot(output_delta )
where
output_delta = self.expected - self.output
self.expected are expected value,
self.output is forward pass result
No activation or further stuff here.
The torch past is:
optimizer = torch.optim.SGD(nn.parameters() , lr = 1.0)
criterion = torch.nn.MSELoss(reduction='sum')
output = nn.forward(x_train)
loss = criterion(output, y_train)
loss.backward()
optimizer.step()
optimizer.zero_grad()
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above? If its so i d like to know how to numerically check my numpy solution with pytorch.
I just want to know does PyTorch "basic" gradient descent or something different.
If you set torch.optim.SGD, this means stochastic gradient descent.
You have different implementations on GD, but the one that is used in PyTorch is applied to mini-batches.
There are GD implementations that will optimize parameters after the full epoch. As you may guess they are very "slow", this may be great for supercomputers to test. There are GD implementations that work for every sample, as you may guess their imperfectness is "huge" gradient fluctuations.
These are all relative terms, so I am using ""
Note you are using too big learning rates like lr = 1.0, which means you haven't normalized your data at first, but this is a skill you may scalp over time.
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above?
It uses what you told.
Here is a the example in PyTorch and in Python to show detection of gradients works as expected (used in back propagation) :
x = torch.tensor([5.], requires_grad=True);
print(x) # tensor([5.], requires_grad=True)
y = 3*x**2
y.backward()
print(x.grad) # tensor([30.])
How would you get this value 30 in plain python?
def y(x):
return 3*x**2
x=5
e=0.01 #etha
g=(y(x+e)-y(x))/e
print(g) # 30.0299
As we expect we got ~30, it would be even better with smaller etha.
I am nooby in this field of study and probably this is a pretty silly question. I want to build a normal ANN, but I am not sure if I can use a weighted mean square error as the loss function.
If we are not treating each sample equally, I mean we care the prediction precision more for some of the categories of the samples more than the others, then we want to form a weighted loss function.
Lets say, we have a categorical feature ci, i is the index of the sample, and for simplicity, we assume that this feature takes binary value, either 0 or 1. So, we can form the loss function as
(ci + 1)(yi_hat - yi)^2
#and take the sum for all i
Are there going to be any problem with the back-propagation? I don't see any issue with calculating the gradient or updating the weights between layers.
And, if no issue, how can I program this loss function in Keras? Because it seems that the loss function only takes two parameters, y_true and y_pred, how can I plug in the vector c?
There is absolutely nothing wrong with that. Functions can declare the constants withing themselves or even take the constants from an outside scope:
import keras.backend as K
c = K.constant([c1,c2,c3,c4,...,cn])
def weighted_loss(y_true,y_pred):
loss = keras.losses.get('mse')
return c * loss(y_true,y_pred)
Exactly like yours:
def weighted_loss(y_true,y_pred):
weighted = (c+1)*K.square(y_true-y_pred)
return K.sum(weighted)
I'm having trouble trying to teach a neural network the XOR logic function. I've already trained the network with succesful results using the hyperbolic tangent and ReLU as activation functions (regarding the ReLU, I know it's not the appropiate for this kind of problem, but I still wanted to test it). Still, I can't make it work with the logistic function. My definition of the function is:
def logistic(data):
return 1.0 / (1.0 + np.exp(-data))
and its derivative:
def logistic_prime(data):
output = logistic(data)
return output * (1.0 - output)
where np is the name given to the NumPy imported package. As the XOR logic uses are 0's and 1's, the logistic function should be an appropriate activation function. Still, the results I get are close to 0.5 in all cases, i.e. any input combination of 0's and 1's results in a value close to 0.5. Is there any error in what I'm saying?
Don't hesitate in asking me for more context or more code. Thanks in advance.
I had the same problem as you.
The problem happens when the data can not the divided by a linear hyperplane.
Try to train the data:
X = [[-1,0],[0,1],[1,0],[0,-1]]
Y = [1,0,1,0]
if you draw this down on a coordinate, then you will fine it is not linearly dividable.
Train it on logistic, parameters are all close to 0 and result close to 0.5.
Another linearly dividable example is use Y = [1,1,0,0]
and logistic work.
Recently I started toying with neural networks. I was trying to implement an AND gate with Tensorflow. I am having trouble understanding when to use different cost and activation functions. This is a basic neural network with only input and output layers, no hidden layers.
First I tried to implement it in this way. As you can see this is a poor implementation, but I think it gets the job done, at least in some way. So, I tried only the real outputs, no one hot true outputs. For activation functions, I used a sigmoid function and for cost function I used squared error cost function (I think its called that, correct me if I'm wrong).
I've tried using ReLU and Softmax as activation functions (with the same cost function) and it doesn't work. I figured out why they don't work. I also tried the sigmoid function with Cross Entropy cost function, it also doesn't work.
import tensorflow as tf
import numpy
train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]])
train_Y = numpy.asarray([[0],[0],[0],[1]])
x = tf.placeholder("float",[None, 2])
y = tf.placeholder("float",[None, 1])
W = tf.Variable(tf.zeros([2, 1]))
b = tf.Variable(tf.zeros([1, 1]))
activation = tf.nn.sigmoid(tf.matmul(x, W)+b)
cost = tf.reduce_sum(tf.square(activation - y))/4
optimizer = tf.train.GradientDescentOptimizer(.1).minimize(cost)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for i in range(5000):
train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})
result = sess.run(activation, feed_dict={x:train_X})
print(result)
after 5000 iterations:
[[ 0.0031316 ]
[ 0.12012422]
[ 0.12012422]
[ 0.85576665]]
Question 1 - Is there any other activation function and cost function, that can work(learn) for the above network, without changing the parameters(meaning without changing W, x, b).
Question 2 - I read from a StackOverflow post here:
[Activation Function] selection depends on the problem.
So there are no cost functions that can be used anywhere? I mean there is no standard cost function that can be used on any neural network. Right? Please correct me on this.
I also implemented the AND gate with a different approach, with the output as one-hot true. As you can see the train_Y [1,0] means that the 0th index is 1, so the answer is 0. I hope you get it.
Here I have used a softmax activation function, with cross entropy as cost function. Sigmoid function as activation function fails miserably.
import tensorflow as tf
import numpy
train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]])
train_Y = numpy.asarray([[1,0],[1,0],[1,0],[0,1]])
x = tf.placeholder("float",[None, 2])
y = tf.placeholder("float",[None, 2])
W = tf.Variable(tf.zeros([2, 2]))
b = tf.Variable(tf.zeros([2]))
activation = tf.nn.softmax(tf.matmul(x, W)+b)
cost = -tf.reduce_sum(y*tf.log(activation))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(cost)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for i in range(5000):
train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})
result = sess.run(activation, feed_dict={x:train_X})
print(result)
after 5000 iteration
[[ 1.00000000e+00 1.41971401e-09]
[ 9.98996437e-01 1.00352429e-03]
[ 9.98996437e-01 1.00352429e-03]
[ 1.40495342e-03 9.98595059e-01]]
Question 3 So in this case what cost function and activation function can I use? How do I understand what type of cost and activation functions I should use? Is there a standard way or rule, or just experience only? Should I have to try every cost and activation function in a brute force manner? I found an answer here. But I am hoping for a more elaborate explanation.
Question 4 I have noticed that it takes many iterations to converge to a near accurate prediction. I think the convergance rate depends on the learning rate (using too large of will miss the solution) and the cost function (correct me if I'm wrong). So, is there any optimal way (meaning the fastest) or cost function for converging to a correct solution?
I will answer your questions a little bit out of order, starting with more general answers, and finishing with those specific to your particular experiment.
Activation functions Different activation functions, in fact, do have different properties. Let's first consider an activation function between two layers of a neural network. The only purpose of an activation function there is to serve as an nonlinearity. If you do not put an activation function between two layers, then two layers together will serve no better than one, because their effect will still be just a linear transformation. For a long while people were using sigmoid function and tanh, choosing pretty much arbitrarily, with sigmoid being more popular, until recently, when ReLU became the dominant nonleniarity. The reason why people use ReLU between layers is because it is non-saturating (and is also faster to compute). Think about the graph of a sigmoid function. If the absolute value of x is large, then the derivative of the sigmoid function is small, which means that as we propagate the error backwards, the gradient of the error will vanish very quickly as we go back through the layers. With ReLU the derivative is 1 for all positive inputs, so the gradient for those neurons that fired will not be changed by the activation unit at all and will not slow down the gradient descent.
For the last layer of the network the activation unit also depends on the task. For regression you will want to use the sigmoid or tanh activation, because you want the result to be between 0 and 1. For classification you will want only one of your outputs to be one and all others zeros, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it.
Your example. Now let's look at your example. Your first example tries to compute the output of AND in a following form:
sigmoid(W1 * x1 + W2 * x2 + B)
Note that W1 and W2 will always converge to the same value, because the output for (x1, x2) should be equal to the output of (x2, x1). Therefore, the model that you are fitting is:
sigmoid(W * (x1 + x2) + B)
x1 + x2 can only take one of three values (0, 1 or 2) and you want to return 0 for the case when x1 + x2 < 2 and 1 for the case when x1 + x2 = 2. Since the sigmoid function is rather smooth, it will take very large values of W and B to make the output close to the desired, but because of a small learning rate they can't get to those large values fast. Increasing the learning rate in your first example will increase the speed of convergence.
Your second example converges better because the softmax function is good at making precisely one output be equal to 1 and all others to 0. Since this is precisely your case, it does converge quickly. Note that sigmoid would also eventually converge to good values, but it will take significantly more iterations (or higher learning rate).
What to use. Now to the last question, how does one choose which activation and cost functions to use. These advices will work for majority of cases:
If you do classification, use softmax for the last layer's nonlinearity and cross entropy as a cost function.
If you do regression, use sigmoid or tanh for the last layer's nonlinearity and squared error as a cost function.
Use ReLU as a nonlienearity between layers.
Use better optimizers (AdamOptimizer, AdagradOptimizer) instead of GradientDescentOptimizer, or use momentum for faster convergence,
Cost function and activation function play an important role in the learning phase of a neural network.
The activation function, as explained in the first answer, gives the possibility to the network to learn non-linear functions, besides assuring to have small change in the output in response of small change in the input. A sigmoid activation function works well for these assumptions. Other activation functions do the same but may be less computational expensive, see activation functions for completeness. But, in general Sigmoid activation function should be avoid because the vanishing gradient problem.
The cost function C plays a crucial role in the speed of learning of the neural network. Gradient-based neural networks learn in an iterative way by minimising the cost function, so computing the gradient of the cost function, and changing the weights in according to it. If a quadratic cost function is used, this means that its gradient with respect the weights, is proportional to the activation function first derivate. Now, if a sigmoid activation function is used this implies that when the output is close to 1 the derivate is very small, as you can see from the image, and so the neurons learns slow.
The cross-entropy cost function allows to avoid this problem. Even if you are using a sigmoid function, using a cross-entropy function as cost function, implies that its derivates with respect to the weights are not more proportional to the first derivate of the activation function, as happened with the quadratic function , but instead they are proportional to the output error. This implies that when the prediction output is far away to the target your network learns more quickly, and viceversa.
Cross-entropy cost function should be used always instead of using a quadratic cost function, for classification problem, for the above explained.
Note that, in neural networks the cross-entropy function has not always the same meaning as the cross-entropy function you meet in probability, there it is used to compare two probability distribution. In neural networks this can be true if you have a unique sigmoid output to the final layer and want to think about it as a probability distribution. But this losses meaning if you have multi-sigmoid neurons at the final layer.