Was just wondering if anyone else has encountered this behavior before. I couldn't find this specific issue being brought up in my searches. I wrote my own neural net (a very simple one) using just numpy, and it appears to work, the cost function decreases as I iterate over it. However, when I change the random initialization on the weights from using np.random.randn(shape) to np.random.randn(shape)*0.01 (I heard that using smaller initial weights might speed up learning because I'm using a sigmoid layer), my cost function starts at .69~ln(2) and pretty much gets stuck there. This happens no matter how many times I restart the neural net, and no matter what kind of inputs I'm putting into it. I find this very odd indeed. I should add, that if I start off without the 0.01 multiplication factor, the cost function will decrease to less than .69.
The neural net uses a cross-entropy cost function, and uses gradient descent to make steps. No regularization has been implemented. This behavior doesn't seem to depend on what the neural network's dimensions (number of layers, neurons per layer) are, or the learning rate, only on whether I initialize the starting weights with or without multiplying them by 0.01.
Related
I am using pytorch and autograd to build my neural network architecture. It is a small 3 layered network with a sinngle input and output. Suppose I have to predict some output function based on some initial conditions and I am using a custom loss function.
The problem I am facing is:
My loss converges initially but gradients vanish eventually.
I have tried sigmoid activation and tanh. tanh gives slightly better results in terms of loss convergence.
I tried using ReLU but since I don't have much weights in my neural network, the weights become dead and it doesn't give good results.
Is there any other activation function apart from sigmoid and tanh that handles the problem of vanishing gradients well enough for small sized neural networks?
Any suggestions on what else can I try
In the deep learning world, ReLU is usually prefered over other activation functions, because it overcomes the vanishing gradient problem, allowing models to learn faster and perform better. But it could have downsides.
Dying ReLU problem
The dying ReLU problem refers to the scenario when a large number of ReLU neurons only output values of 0. When most of these neurons return output zero, the gradients fail to flow during backpropagation and the weights do not get updated. Ultimately a large part of the network becomes inactive and it is unable to learn further.
What causes the Dying ReLU problem?
High learning rate: If learning rate is set too high, there is a significant chance that new weights will be in negative value range.
Large negative bias: Large negative bias term can indeed cause the inputs to the ReLU activation to become negative.
How to solve the Dying ReLU problem?
Use of a smaller learning rate: It can be a good idea to decrease the learning rate during the training.
Variations of ReLU: Leaky ReLU is a common effective method to solve a dying ReLU problem, and it does so by adding a slight slope in the negative range. There are other variations like PReLU, ELU, GELU. If you want to dig deeper check out this link.
Modification of initialization procedure: It has been demonstrated that the use of a randomized asymmetric initialization can help prevent the dying ReLU problem. Do check out the arXiv paper for the mathematical details
Sources:
Practical guide for ReLU
ReLU variants
Dying ReLU problem
I'm implementing a residual cnn(modified smaller version of xception) in a low latency environment. I've done a lot of manual tuning to minimize the run time speed of my network (reducing number of filters, removing layers, etc).
But now I want to try allowing my network to make its classification prediction(final fcnn layer) on the residual connection after each residual block.
basic logic-
attempt final prediction with residual connection as input
if this fcnn layer predicts a certain class with a probability > a set threshold:
return fcnn output as if it was normal final layer
else:
do next residual block like normal and try the previous conditional again unless we are already at final block
My hope is this will allow my network to learn to solve easier problems with less computation while allowing it to still do the additional layers if it is still unsure of the classification.
So my basic question is: In pytorch, whats the best way to implement this conditional in a way that allows my nn at run time to decide whether to do more processing or not
Currently Ive tested returning the intermediate x's after the blocks in the forward function, but I dont know how best to setup the conditional to chose which x to return
Also side note: I believe I may end up needing another cnn layer between the residual and fcnn to serve as a function to convert the internal representation for processing to a representation the fcnn understands for classification.
It has already been done and presented in ICLR 2018.
It appears as if in ResNets the first few bottlenecks learn representations (and therefore cannot be skipped) while the remaining bottlenecks refine the features and therefore can be skipped at a moderate loss of accuracy. (Stanisław Jastrzebski, Devansh Arpit, Nicolas Ballas, Vikas Verma, Tong Che, Yoshua Bengio Residual Connections Encourage Iterative Inference, ICLR 2018).
This idea was taken to the extreme with sharing weights across bottlenecks in Sam Leroux, Pavlo Molchanov, Pieter Simoens, Bart Dhoedt, Thomas Breuel, Jan Kautz IamNN: Iterative and Adaptive Mobile Neural Network for efficient image classification, ICLR 2018.
I just read "Make your own neural network" book. Now I am trying to create NeuralNetwork class in Python. I use sigmoid activation function. I wrote basic code and tried to test it. But my implementation didn't work properly at all. After long debugging and comparation to code from book I found out that sigmoid of very big number is 1 because Python rounds it. I use numpy.random.rand() to generate weights and this function returns only values from 0 to 1. After summing all products of weights and inputs I get very big number. I fixed this problem with numpy.random.normal() function that generates random numbers from range, (-1, 1) for example. But I have some questions.
Is sigmoid good activation function?
What to do if output of node is still so big and Python rounds result to 1, which is impossible for sigmoid?
How can I prevent Python to rounding floats that are very close to integer
Any advices for me as beginner in neural networks (books, techniques, etc).
The answer to this question obviously depends on context. What it means by "good". The sigmoid activation function will result in outputs that are between 0 and 1. As such, they are standard output activations for binary classification where you want your neural network to output a number between 0 and 1 - with the output being interpreted as the probability of your input being in the specified class. However, if you are using sigmoid activation functions throughout your neural network (i.e. in intermediate layers as well), you might consider switching to RELU activation function. Historically, the sigmoid activation function was used throughout neural networks as a way to introduce non-linearity so that a neural network could do more than approximate linear functions. However, it was found that sigmoid activations suffer heavily from the vanishing gradients problem because the function is so flat far from 0. As such, nowadays, most intermediate layers will use RELU activation functions (or something even more fancy - e.g. SELU/Leaky RELU/etc.) The RELU activation function is 0 for inputs less than 0 and equals the input for inputs greater than 0. Its been found to be sufficient for introducing non-linearity into a neural network.
Generally you don't want to be in a regime where your outputs are so huge or so small that it becomes computationally unstable. One way to help fix this issue, as mentioned earlier, is to use a different activation function (e.g. RELU). Another way, and perhaps even better way, to help with this issue is by initializing the weights better with e.g. the Xavior-Glorot initialization scheme or simply initializing them to smaller values e.g. within the range [-.01,.01]. Basically, you scale the random initializations so that your outputs are in a good range of values and not some gigantic or miniscule number. You can certainly also do both.
You can use higher precision floats to make python keep more decimals around. E.g. you can use np.float64 instead of np.float32...however, this increases the computational complexity and probably isn't necessary. Most neural networks today use 32-bit floats and they work just fine. See points 1 and 2 for better alternatives to solving your problem.
This question is overly broad. I would say that the coursera course and specialization by Prof. Andrew Ng is my strongest recommendation in terms of learning neural networks.
I implemented a simple neural network for classification (one class) of images in python. Layers are simple (image_matrix, 5,1). Using relu and sigmoid for the hidden layers.
I am iterating 5000 times. At first it looks like the cost goes down gradually in a sensible way.
However, no matter how many training examples I use, or what my learning_rate is, the costs starts behaving erratically after around 3000 iterations every time...
cost (click to see image)
Can someone help me understand what's going on?
Thanks
In training models, you should remember that their are multiple local minima for the its cost. Your graph shows that you're cost is moving around this local minima whilst finding your global minimum, which is the goal finding the best performance for a model.
1st - you should probably try checking for accuracy, f1-score, or loss per iteration/epoch to check if the performance is actually improving.
2nd - do cross validation and check for same metrics above for validation
3rd - implement an early stopping function that should check if you're model is improving or not.
*note: find the best alpha that would help you find the global minimum better.
Background: I'm writing in Python a three-layer neural network using mini-batch stochastic gradient descent specifically designed to identify between three classes of iris plants from the famous iris data set. The input layer has four neurons, one for each feature in the data. The hidden layer has 3 neurons (but the code allows variations in hidden layer neuron numbers) and the output layer has three neurons (one for each species). All neurons use sigmoid activation functions.
Problem: The loss (mean-squared error) generally decreases over time, however the accuracy (usually below 55.55% or even 33.33%) is stagnant. I've tried experimenting with different epoch iteration numbers and learning rates, but nothing worked. Interestingly, more often than not, the outputs for the algorithm remain fixed no matter what the input values are. I'm fairly certain of my math, since the loss seems to be decreasing as the number of epochs increases.
To replicate problem: Just run the Python code and observe the LEARNING_RESULTS.txt file. (Make sure iris.txt file in the repo is in same directory)
Question: How can I improve performance for this neural network?
Link to GitHub repo: https://github.com/kwonkyo/neural-networks
Thanks!
UPDATE: Problem solved. I was adding a constant value (numerical sum of the sum of the mini-batch matrices) to the weight and bias matrices instead of the sum of the mini-batch gradient matrices. Updated code has been pushed to github.