Sigmoid of x is 1 - python

I just read "Make your own neural network" book. Now I am trying to create NeuralNetwork class in Python. I use sigmoid activation function. I wrote basic code and tried to test it. But my implementation didn't work properly at all. After long debugging and comparation to code from book I found out that sigmoid of very big number is 1 because Python rounds it. I use numpy.random.rand() to generate weights and this function returns only values from 0 to 1. After summing all products of weights and inputs I get very big number. I fixed this problem with numpy.random.normal() function that generates random numbers from range, (-1, 1) for example. But I have some questions.
Is sigmoid good activation function?
What to do if output of node is still so big and Python rounds result to 1, which is impossible for sigmoid?
How can I prevent Python to rounding floats that are very close to integer
Any advices for me as beginner in neural networks (books, techniques, etc).

The answer to this question obviously depends on context. What it means by "good". The sigmoid activation function will result in outputs that are between 0 and 1. As such, they are standard output activations for binary classification where you want your neural network to output a number between 0 and 1 - with the output being interpreted as the probability of your input being in the specified class. However, if you are using sigmoid activation functions throughout your neural network (i.e. in intermediate layers as well), you might consider switching to RELU activation function. Historically, the sigmoid activation function was used throughout neural networks as a way to introduce non-linearity so that a neural network could do more than approximate linear functions. However, it was found that sigmoid activations suffer heavily from the vanishing gradients problem because the function is so flat far from 0. As such, nowadays, most intermediate layers will use RELU activation functions (or something even more fancy - e.g. SELU/Leaky RELU/etc.) The RELU activation function is 0 for inputs less than 0 and equals the input for inputs greater than 0. Its been found to be sufficient for introducing non-linearity into a neural network.
Generally you don't want to be in a regime where your outputs are so huge or so small that it becomes computationally unstable. One way to help fix this issue, as mentioned earlier, is to use a different activation function (e.g. RELU). Another way, and perhaps even better way, to help with this issue is by initializing the weights better with e.g. the Xavior-Glorot initialization scheme or simply initializing them to smaller values e.g. within the range [-.01,.01]. Basically, you scale the random initializations so that your outputs are in a good range of values and not some gigantic or miniscule number. You can certainly also do both.
You can use higher precision floats to make python keep more decimals around. E.g. you can use np.float64 instead of np.float32...however, this increases the computational complexity and probably isn't necessary. Most neural networks today use 32-bit floats and they work just fine. See points 1 and 2 for better alternatives to solving your problem.
This question is overly broad. I would say that the coursera course and specialization by Prof. Andrew Ng is my strongest recommendation in terms of learning neural networks.

Related

Why ReLU function after every layer in CNN?

I am taking intro to ML on Coursera offered by Duke, which I recommend if you are interested in ML. The instructors of this course explained that "We typically include nonlinearities between layers of a neural network.There's a number of reasons to do so.For one, without anything nonlinear between them, successive linear transforms (fully connected layers) collapse into a single linear transform, which means the model isn't any more expressive than a single layer. On the other hand, intermediate nonlinearities prevent this collapse, allowing neural networks to approximate more complex functions." I am curious that, if I apply ReLU, aren't we losing information since ReLU is transforming every negative value to 0? Then how is this transformation more expressive than that without ReLU?
In Multilayer Perceptron, I tried to run MLP on MNIST dataset without a ReLU transformation, and it seems that the performance didn't change much (92% with ReLU and 90% without ReLU). But still, I am curious why this tranformation gives us more information rather than lose information.
the first point is that without nonlinearities, such as the ReLU function, in a neural network, the network is limited to performing linear combinations of the input. In other words, the network can only learn linear relationships between the input and output. This means that the network can't approximate complex functions that are not linear, such as polynomials or non-linear equations.
Consider a simple example where the task is to classify a 2D data point as belonging to one of two classes based on its coordinates (x, y). A linear classifier, such as a single-layer perceptron, can only draw a straight line to separate the two classes. However, if the data points are not linearly separable, a linear classifier will not be able to classify them accurately. A nonlinear classifier, such as a multi-layer perceptron with a nonlinear activation function, can draw a curved decision boundary and separate the two classes more accurately.
ReLU function increases the complexity of the neural network by introducing non-linearity, which allows the network to learn more complex representations of the data. The ReLU function is defined as f(x) = max(0, x), which sets all negative values to zero. By setting all negative values to zero, the ReLU function creates multiple linear regions in the network, which allows the network to represent more complex functions.
For example, suppose you have a neural network with two layers, where the first layer has a linear activation function and the second layer has a ReLU activation function. The first layer can only perform a linear transformation on the input, while the second layer can perform a non-linear transformation. By having a non-linear function in the second layer, the network can learn more complex representations of the data.
In the case of your experiment, it's normal that the performance did not change much when you removed the ReLU function, because the dataset and the problem you were trying to solve might not be complex enough to require a ReLU function. In other words, a linear model might be sufficient for that problem, but for more complex problems, ReLU can be a critical component to achieve good performance.
It's also important to note that ReLU is not the only function to introduce non-linearity and other non-linear activation functions such as sigmoid and tanh could be used as well. The choice of activation function depends on the problem and dataset you are working with.
Neural networks are inspired by the structure of brain. Neurons in the brain transmit information between different areas of the brain by using electrical impulses and chemical signals. Some signals are strong and some are not. Neurons with weak signals are not activated.
Neural networks work in the same fashion. Some input features have weak and some have strong signals. These depend on the features. If they are weak, the related neurons aren't activated and don't transmit the information forward. We know that some features or inputs aren't crucial players in contributing to the label. For the same reason, we don't bother with feature engineering in neural networks. The model takes care of it. Thus, activation functions help here and tell the model which neurons and how much information they should transmit.

When using the cross-entropy function for binary classification, a big gap between the model output scalar and a two-dimensional vector

When I use u-net for semantic segmentation of two categories, my output in the last layer of the model is set to 1 channel and 2 channel respectively. Then I use cross-entropy loss to measure: BCEloss and CrossEntropyLoss.
But the gap between the two is great. The performance of the former is normal, but the latter has a very low precision rate and a high recall rate.
I used pytorch.
Mathematically BCEloss (logist) is just a special case of CrossEntropy loss for the case of two classes.
Are you using a sigmoid or softmax in the output of the network? In PyTorch, CrossEntropy loss takes the raw output of the last layer (no need for softmax the output), that is done for numerical stability.
BCEloss only takes input in between 0 and 1. So a sigmoid is needed there. However, PyTorch has The BCEWithLogistLoss that applies the sigmoid for you, this version is more stable.
One more thing that it seems you are not doing correctly. (it would be nicer to have some minimum amount of code to better understand your problem). CrossEntropyLoss requires one channel per class. So if you have 2 classes, you have to give it an input with two channels. The logist (BCEloss) only takes one channel with a number ranging between 0 and 1. If I understood correctly, you are somehow giving it 2 channels. That will bring you problems in training.
My best guess is that the gap in performance between the two is due to misuse of the loss functions. The PyTorch documentation has improved a lot, I can recomend you spending a few minutes understanding the difference between each those three loss functions: https://pytorch.org/docs/stable/nn.html#loss-functions .

Activation Function in Machine learning

What is meant by Activation function in Machine learning. I go through with most of the articles and videos, everyone states or compare that with neural network. I'am a newbie to machine learning and not that much familiar with deep learning and neural networks. So, can any one explain me what exactly an Activation function is ? instead of explaining with neural networks. I struck with this ambiguity while I learning Sigmoid function for logistic regression.
It's rather difficult to describe activation functions without some reference to automated learning, because that's exactly their application, as well as the rationale behind a collective term. They help us focus learning in a stream of functional transformations. I'll try to reduce the complexity of the description.
Very simply, an activation function is a filter that alters an output signal (series of values) from its current form into one we find more "active" or useful for the purpose at hand.
For instance, a very simple activation function would be a cut-off score for college admissions. My college requires a score of at least 500 on each section of the SAT. Thus, any applicant passes through this filter: if they don't meet that requirement, the "admission score" is dropped to zero. This "activates" the other candidates.
Another common function is the sigmoid you studied: the idea is to differentiate the obviously excellent values (map them close to 1) from obviously undesirable values (map them close to -1), and preserve the ability to discriminate or learn about the ones in the middle (map them to something with a gradient useful for further work).
A third type might accentuate differences at the top end of a spectrum -- say, football goals and assists. In trying to judge relative levels of skill between players, we have to consider: is the difference between 15 and 18 goals in a season the same as between 0 and 3 goals? Some argue that the larger numbers show a greater differentiation in scoring skill: the more you score, the more opponents focus to stop you. Also, we might want to consider that there's a little "noise" in the metric: the first two goals in a season don't really demonstrate much.
In this case, we might choose an activation function for goals g such as
1.2 ^ max(0, g-2)
This evaluation would then be added to other factors to obtain a metric for the player.
Does this help explain things for you?
Activation functions are really important for a Artificial Neural Network to learn and make sense of something really complicated and Non-linear complex functional mappings between the inputs and response variable.They introduce non-linear properties to our Network.Their main purpose is to convert a input signal of a node in a A-NN to an output signal. That output signal now is used as a input in the next layer in the stack.
Specifically in A-NN we do the sum of products of inputs(X) and their corresponding Weights(W) and apply a Activation function f(x) to it to get the output of that layer and feed it as an input to the next layer.
More info here
Simply put, an activation function is a function that is added into an artificial neural network in order to help the network learn complex patterns in the data. When comparing with a neuron-based model that is in our brains, the activation function is at the end deciding what is to be fired to the next neuron. That is exactly what an activation function does in an ANN as well. It takes in the output signal from the previous cell and converts it into some form that can be taken as input to the next cell.

Python: neural net cost function stuck at ln(2)

Was just wondering if anyone else has encountered this behavior before. I couldn't find this specific issue being brought up in my searches. I wrote my own neural net (a very simple one) using just numpy, and it appears to work, the cost function decreases as I iterate over it. However, when I change the random initialization on the weights from using np.random.randn(shape) to np.random.randn(shape)*0.01 (I heard that using smaller initial weights might speed up learning because I'm using a sigmoid layer), my cost function starts at .69~ln(2) and pretty much gets stuck there. This happens no matter how many times I restart the neural net, and no matter what kind of inputs I'm putting into it. I find this very odd indeed. I should add, that if I start off without the 0.01 multiplication factor, the cost function will decrease to less than .69.
The neural net uses a cross-entropy cost function, and uses gradient descent to make steps. No regularization has been implemented. This behavior doesn't seem to depend on what the neural network's dimensions (number of layers, neurons per layer) are, or the learning rate, only on whether I initialize the starting weights with or without multiplying them by 0.01.

Python Neurolab - fixing output range

I am learning some model based on examples ${((x_{i1},x_{i2},....,x_{ip}),y_i)}_{i=1...N}$ using a neural network of Feed Forward Multilayer Perceptron (newff) (using python library neurolab). I expect the output of the NN to be positive for any further simulation of the NN.
How can I make sure that the results of simulation of my learned NN are always positive?
(how I do it in neurolab?)
Simply use a standard sigmoid/logistic activation function on the output neuron. sigmoid(x) > 0 forall real-valued x so that should do what you want.
By default, many neural network libraries will use either linear or symmetric sigmoid outputs (which can go negative).
Just note that it takes longer to train networks with a standard sigmoid output function. It's usually better in practice to let the values go negative and instead transform the outputs from the network into the range [0,1] after the fact (shift up by the minimum, divide by the range (aka max-min)).

Categories