I am taking intro to ML on Coursera offered by Duke, which I recommend if you are interested in ML. The instructors of this course explained that "We typically include nonlinearities between layers of a neural network.There's a number of reasons to do so.For one, without anything nonlinear between them, successive linear transforms (fully connected layers) collapse into a single linear transform, which means the model isn't any more expressive than a single layer. On the other hand, intermediate nonlinearities prevent this collapse, allowing neural networks to approximate more complex functions." I am curious that, if I apply ReLU, aren't we losing information since ReLU is transforming every negative value to 0? Then how is this transformation more expressive than that without ReLU?
In Multilayer Perceptron, I tried to run MLP on MNIST dataset without a ReLU transformation, and it seems that the performance didn't change much (92% with ReLU and 90% without ReLU). But still, I am curious why this tranformation gives us more information rather than lose information.
the first point is that without nonlinearities, such as the ReLU function, in a neural network, the network is limited to performing linear combinations of the input. In other words, the network can only learn linear relationships between the input and output. This means that the network can't approximate complex functions that are not linear, such as polynomials or non-linear equations.
Consider a simple example where the task is to classify a 2D data point as belonging to one of two classes based on its coordinates (x, y). A linear classifier, such as a single-layer perceptron, can only draw a straight line to separate the two classes. However, if the data points are not linearly separable, a linear classifier will not be able to classify them accurately. A nonlinear classifier, such as a multi-layer perceptron with a nonlinear activation function, can draw a curved decision boundary and separate the two classes more accurately.
ReLU function increases the complexity of the neural network by introducing non-linearity, which allows the network to learn more complex representations of the data. The ReLU function is defined as f(x) = max(0, x), which sets all negative values to zero. By setting all negative values to zero, the ReLU function creates multiple linear regions in the network, which allows the network to represent more complex functions.
For example, suppose you have a neural network with two layers, where the first layer has a linear activation function and the second layer has a ReLU activation function. The first layer can only perform a linear transformation on the input, while the second layer can perform a non-linear transformation. By having a non-linear function in the second layer, the network can learn more complex representations of the data.
In the case of your experiment, it's normal that the performance did not change much when you removed the ReLU function, because the dataset and the problem you were trying to solve might not be complex enough to require a ReLU function. In other words, a linear model might be sufficient for that problem, but for more complex problems, ReLU can be a critical component to achieve good performance.
It's also important to note that ReLU is not the only function to introduce non-linearity and other non-linear activation functions such as sigmoid and tanh could be used as well. The choice of activation function depends on the problem and dataset you are working with.
Neural networks are inspired by the structure of brain. Neurons in the brain transmit information between different areas of the brain by using electrical impulses and chemical signals. Some signals are strong and some are not. Neurons with weak signals are not activated.
Neural networks work in the same fashion. Some input features have weak and some have strong signals. These depend on the features. If they are weak, the related neurons aren't activated and don't transmit the information forward. We know that some features or inputs aren't crucial players in contributing to the label. For the same reason, we don't bother with feature engineering in neural networks. The model takes care of it. Thus, activation functions help here and tell the model which neurons and how much information they should transmit.
Related
I am using pytorch and autograd to build my neural network architecture. It is a small 3 layered network with a sinngle input and output. Suppose I have to predict some output function based on some initial conditions and I am using a custom loss function.
The problem I am facing is:
My loss converges initially but gradients vanish eventually.
I have tried sigmoid activation and tanh. tanh gives slightly better results in terms of loss convergence.
I tried using ReLU but since I don't have much weights in my neural network, the weights become dead and it doesn't give good results.
Is there any other activation function apart from sigmoid and tanh that handles the problem of vanishing gradients well enough for small sized neural networks?
Any suggestions on what else can I try
In the deep learning world, ReLU is usually prefered over other activation functions, because it overcomes the vanishing gradient problem, allowing models to learn faster and perform better. But it could have downsides.
Dying ReLU problem
The dying ReLU problem refers to the scenario when a large number of ReLU neurons only output values of 0. When most of these neurons return output zero, the gradients fail to flow during backpropagation and the weights do not get updated. Ultimately a large part of the network becomes inactive and it is unable to learn further.
What causes the Dying ReLU problem?
High learning rate: If learning rate is set too high, there is a significant chance that new weights will be in negative value range.
Large negative bias: Large negative bias term can indeed cause the inputs to the ReLU activation to become negative.
How to solve the Dying ReLU problem?
Use of a smaller learning rate: It can be a good idea to decrease the learning rate during the training.
Variations of ReLU: Leaky ReLU is a common effective method to solve a dying ReLU problem, and it does so by adding a slight slope in the negative range. There are other variations like PReLU, ELU, GELU. If you want to dig deeper check out this link.
Modification of initialization procedure: It has been demonstrated that the use of a randomized asymmetric initialization can help prevent the dying ReLU problem. Do check out the arXiv paper for the mathematical details
Sources:
Practical guide for ReLU
ReLU variants
Dying ReLU problem
I'm interested in fitting a linear mixed model using the variational inference capabilities of tensorflow probabilities and keras. However, I cannot find a straight-forward answer on how to implement such analysis. Using the regression example in TF probabilities (see Case 3 here), I am able to grasp how to fit these models if we have only random variables in the model (the example is regression using a single feature). Following the radon example here, we have two features: floor (fixed) and county (random). My understanding is the latter should only be passed to the denseVariational layers, while the former can be passed to a regular dense layer. So I guess I would have to jointly train two networks, one for fixed and one for the random features and some how merge their outputs.
So my questions are:
(1) If these are fit jointly can the same loss function be applied to both? I often see mean square error used, while in VI negative log likelihood is used (I think this equivalent to maximizing evidence of lower bound).
(2) Does the input need to be split before-hand and fed as input to two networks?
I had had an overwiew of how neural networks work and have come up with some interconnected questions, on which I am not able to find an answer.
Considering one-hidden-layer feedforward neural network: if the function for each of the hidden-layer neurons is the same
a1 = relu (w1x1+w2x2), a2=relu(w3x1+w4x2), ...
How do we make the model learn different values of weights?
I do undestand the point of manually-established connections between neurons. As shown on the picture Manually established connections between neurons, that way we define the possible functions of functions (i.e., house size and bedrooms number taken together might represent a possible family size which the house would accomodate). But the fully-connected network doesn't make sense to me.
I get the point that a fully-connected neural network should somehow automatically define, which functions of functions make sense, but how does it do it?
Not being able to answer this question, I don't also understand why should increasing the number of neurons increase the accuracy of model prediction?
How do we make the model learn different values of weights?
By initializing the parameters before training starts. In case of a fully connected neural network otherwise we would have the same update step on each parameter - that is where your confusion is coming from. Initialization, either randomly or more sophisticated (e.g. Glorot) solves this.
Why should increasing the number of neurons increase the accuracy of the model prediction?
This is only partially true, increasing the number of neurons should improve your training accuracy (it is a different game for your validation and test performance). By adding units your model is able to store additional information or incorporate outliers into your network, and hence improve the accuracy of the prediction. Think of a 2D problem (predicting house prizes per sqm over sqm of some property). With two parameters you can fit a line, with three a curve and so on, the more parameters the more complex your curve can get and fit through each of your training points.
Great next step for a deep dive - Karpathy's lecture on Computer Vision at Stanford.
I'm currently stuyind TensorFlow 2.0 and Keras. I know that the activation functions are used to calculate the output of each layer of a neural network, based on mathematical functions. However, when searching about layers, I can't find synthetic and easy-to-read information for a beginner in deep learning.
There's a keras documentation, but I would like to know synthetically:
what are the most common layers used to create a model (Dense, Flatten, MaxPooling2D, Dropout, ...).
In which case to use each of them? (Classification, regression, other)
what is the appropriate way to use each layer depending on each case?
Depending on the problem you want to solve, there are different activation functions and loss functions that you can use.
Regression problem: You want to predict the price of a building. You have N features. Of course, the price of the building is a real number, therefore you need to have mean_squared_error as a loss function and a linear activation for your last node. In this case, you can have a couple of Dense() layers with relu activation, while your last layer is a Dense(1,activation='linear').
In between the Dense() layers, you can add Dropout() so as to mitigate the overfitting effect(if present).
Classification problem: You want to detect whether or not someone is diabetic while taking into account several factors/features. In this case, you can use again stacked Dense() layers but your last layer will be a Dense(1,activation='sigmoid'), since you want to detect whether a patient is or not diabetic. The loss function in this case is 'binary_crossentropy'. In between the Dense() layers, you can add Dropout() so as to mitigate the overfitting effect(if present).
Image processing problems: Here you surely have stacks of [Conv2D(),MaxPool2D(),Dropout()]. MaxPooling2D is an operation which is typical for image processing and also some natural language processing(not going to expand upon here). Sometimes, in convolutional neural network architectures, the Flatten() layer is used. Its purpose is to reduce the dimensionality of the feature maps into 1D vector whose dimension is equal to the total number of elements within the entire feature map depth. For example, if you had a matrix of [28,28], flattening it would reduce it to (1,784), where 784=28*28.
Although the question is quite broad and maybe some people will vote to close it, I tried to provide you a short overview over what you asked. I recommend that your start learning the basics behind neural networks and then delve deeper into using a framework, such as TensorFlow or PyTorch.
I just read "Make your own neural network" book. Now I am trying to create NeuralNetwork class in Python. I use sigmoid activation function. I wrote basic code and tried to test it. But my implementation didn't work properly at all. After long debugging and comparation to code from book I found out that sigmoid of very big number is 1 because Python rounds it. I use numpy.random.rand() to generate weights and this function returns only values from 0 to 1. After summing all products of weights and inputs I get very big number. I fixed this problem with numpy.random.normal() function that generates random numbers from range, (-1, 1) for example. But I have some questions.
Is sigmoid good activation function?
What to do if output of node is still so big and Python rounds result to 1, which is impossible for sigmoid?
How can I prevent Python to rounding floats that are very close to integer
Any advices for me as beginner in neural networks (books, techniques, etc).
The answer to this question obviously depends on context. What it means by "good". The sigmoid activation function will result in outputs that are between 0 and 1. As such, they are standard output activations for binary classification where you want your neural network to output a number between 0 and 1 - with the output being interpreted as the probability of your input being in the specified class. However, if you are using sigmoid activation functions throughout your neural network (i.e. in intermediate layers as well), you might consider switching to RELU activation function. Historically, the sigmoid activation function was used throughout neural networks as a way to introduce non-linearity so that a neural network could do more than approximate linear functions. However, it was found that sigmoid activations suffer heavily from the vanishing gradients problem because the function is so flat far from 0. As such, nowadays, most intermediate layers will use RELU activation functions (or something even more fancy - e.g. SELU/Leaky RELU/etc.) The RELU activation function is 0 for inputs less than 0 and equals the input for inputs greater than 0. Its been found to be sufficient for introducing non-linearity into a neural network.
Generally you don't want to be in a regime where your outputs are so huge or so small that it becomes computationally unstable. One way to help fix this issue, as mentioned earlier, is to use a different activation function (e.g. RELU). Another way, and perhaps even better way, to help with this issue is by initializing the weights better with e.g. the Xavior-Glorot initialization scheme or simply initializing them to smaller values e.g. within the range [-.01,.01]. Basically, you scale the random initializations so that your outputs are in a good range of values and not some gigantic or miniscule number. You can certainly also do both.
You can use higher precision floats to make python keep more decimals around. E.g. you can use np.float64 instead of np.float32...however, this increases the computational complexity and probably isn't necessary. Most neural networks today use 32-bit floats and they work just fine. See points 1 and 2 for better alternatives to solving your problem.
This question is overly broad. I would say that the coursera course and specialization by Prof. Andrew Ng is my strongest recommendation in terms of learning neural networks.