I am learning some model based on examples ${((x_{i1},x_{i2},....,x_{ip}),y_i)}_{i=1...N}$ using a neural network of Feed Forward Multilayer Perceptron (newff) (using python library neurolab). I expect the output of the NN to be positive for any further simulation of the NN.
How can I make sure that the results of simulation of my learned NN are always positive?
(how I do it in neurolab?)
Simply use a standard sigmoid/logistic activation function on the output neuron. sigmoid(x) > 0 forall real-valued x so that should do what you want.
By default, many neural network libraries will use either linear or symmetric sigmoid outputs (which can go negative).
Just note that it takes longer to train networks with a standard sigmoid output function. It's usually better in practice to let the values go negative and instead transform the outputs from the network into the range [0,1] after the fact (shift up by the minimum, divide by the range (aka max-min)).
Related
I am taking intro to ML on Coursera offered by Duke, which I recommend if you are interested in ML. The instructors of this course explained that "We typically include nonlinearities between layers of a neural network.There's a number of reasons to do so.For one, without anything nonlinear between them, successive linear transforms (fully connected layers) collapse into a single linear transform, which means the model isn't any more expressive than a single layer. On the other hand, intermediate nonlinearities prevent this collapse, allowing neural networks to approximate more complex functions." I am curious that, if I apply ReLU, aren't we losing information since ReLU is transforming every negative value to 0? Then how is this transformation more expressive than that without ReLU?
In Multilayer Perceptron, I tried to run MLP on MNIST dataset without a ReLU transformation, and it seems that the performance didn't change much (92% with ReLU and 90% without ReLU). But still, I am curious why this tranformation gives us more information rather than lose information.
the first point is that without nonlinearities, such as the ReLU function, in a neural network, the network is limited to performing linear combinations of the input. In other words, the network can only learn linear relationships between the input and output. This means that the network can't approximate complex functions that are not linear, such as polynomials or non-linear equations.
Consider a simple example where the task is to classify a 2D data point as belonging to one of two classes based on its coordinates (x, y). A linear classifier, such as a single-layer perceptron, can only draw a straight line to separate the two classes. However, if the data points are not linearly separable, a linear classifier will not be able to classify them accurately. A nonlinear classifier, such as a multi-layer perceptron with a nonlinear activation function, can draw a curved decision boundary and separate the two classes more accurately.
ReLU function increases the complexity of the neural network by introducing non-linearity, which allows the network to learn more complex representations of the data. The ReLU function is defined as f(x) = max(0, x), which sets all negative values to zero. By setting all negative values to zero, the ReLU function creates multiple linear regions in the network, which allows the network to represent more complex functions.
For example, suppose you have a neural network with two layers, where the first layer has a linear activation function and the second layer has a ReLU activation function. The first layer can only perform a linear transformation on the input, while the second layer can perform a non-linear transformation. By having a non-linear function in the second layer, the network can learn more complex representations of the data.
In the case of your experiment, it's normal that the performance did not change much when you removed the ReLU function, because the dataset and the problem you were trying to solve might not be complex enough to require a ReLU function. In other words, a linear model might be sufficient for that problem, but for more complex problems, ReLU can be a critical component to achieve good performance.
It's also important to note that ReLU is not the only function to introduce non-linearity and other non-linear activation functions such as sigmoid and tanh could be used as well. The choice of activation function depends on the problem and dataset you are working with.
Neural networks are inspired by the structure of brain. Neurons in the brain transmit information between different areas of the brain by using electrical impulses and chemical signals. Some signals are strong and some are not. Neurons with weak signals are not activated.
Neural networks work in the same fashion. Some input features have weak and some have strong signals. These depend on the features. If they are weak, the related neurons aren't activated and don't transmit the information forward. We know that some features or inputs aren't crucial players in contributing to the label. For the same reason, we don't bother with feature engineering in neural networks. The model takes care of it. Thus, activation functions help here and tell the model which neurons and how much information they should transmit.
I'm a bit of a beginner in the art of machine learning. Here is a rather conceptual question I've been wondering:
Suppose I have a function X->Y, say y=x^2, then, generating enough data of X->Y, I can train a neural network to perform regression on the function, and get x^2 with any input x. This is basically also what the Universal Approximation Theorem suggests.
Now, my question is, what if I want the inverse relation, Y->X? In this case, Y is a multi-valued function of X, for instance for X>0, x=+-sqrt(y). I can swap X and Y as input/output data to train the network alright, but for any given y, there should be a random 1/2 - 1/2 chance that x=sqrt(y) and x=-sqrt(y). But of course, if one trains it with min-squared-error, the network wouldn't know this is a multi-value function, and would just follow SGD on the loss function and get x=0, the average value, for any given y.
Therefore, I wonder if there is any way a neural network can model a multi-valued function? For instance, my guess would be
(1) the neural network can output a collection of, say, the top 2 possible values for X and train it with cross-entropy. The problem is, if X is a vector or even a matrix (like a bit-map image) instead of a number, we don't know how many solutions Y=X has (which could very well be an infinite number, i.e. a continuous range), so a "list" of possible values and probabilities won't work - ideally the neural network should output values randomly and continuously distributed across possible X solutions.
(2) perhaps does this fall into the realm of probabilistic neural networks (PNN)? Does PNN model functions that support a given probabilistic distribution (continuous or discrete) of vectors as its output? If so, is it possible to implement PNN with popular frameworks like Tensorflow+Keras?
(Also, note that this is different from a "multivariate" function, which is the case where X,Y could be multi-component vectors, which is still something a traditional network can easily train on. The actual problem in question here is where the output could be a probabilistic distribution of vectors, which is something that a simple feed-forward network doesn't capture, since it doesn't have the inherent randomness.)
Thank you for your kind help!
Image of forward function Y=X^2 (can be easily modeled by network with regression)
Image of inverse function X=+-sqrt(Y) (the network cannot capture the two-value function and outputs the average value X=0 for any Y)
Try to read the following paper:
https://onlinelibrary.wiley.com/doi/abs/10.1002/ecjc.1028
Mifflin's algorithm (or its more general version SLQP-GS) mentioned in this paper is available here and corresponding paper with description is here.
I'm trying to build an Agent that can play Pocket Tanks using RL. The problem I'm facing now is that how can I train a neural network to output the correct Power and Angle. so instead of actions classification. and I want a regression.
In order to output the correct power and angle, all you need to do is go into your neural network architecture and change the activation of your last layer.
In your question, you stated that you are currently using an action classification output, so it is most likely a softmax output layer. We can do two things here:
If the power and angle has hard constraints, e.g. the angle cannot be greater than 360°, or the power cannot exceed 700 kW, we can change the softmax output to a TanH output (hyperbolic tangent) and multiply it by the constraint of power/angle. This will create a "scaling effect" because tanh's output is between -1 and 1. Multiplying the tanh's output by the constraint of the power/angle ensures that the constraints are always satisfied and the output is the correct power/angle.
If there are no constraints on your problem. We can simply just delete the softmax output all together. Removing the softmax allows for the output to no longer be constrained between 0 and 1. The last layer of the neural network will simply act as a linear mapping, i.e., y = Wx + b.
I hope this helps!
EDIT: In both cases, your reward function to train your neural network can simply be a MSE loss. Example: loss = (real_power - estimated_power)^2 + (real_angle - estimated_angle)^2
I just read "Make your own neural network" book. Now I am trying to create NeuralNetwork class in Python. I use sigmoid activation function. I wrote basic code and tried to test it. But my implementation didn't work properly at all. After long debugging and comparation to code from book I found out that sigmoid of very big number is 1 because Python rounds it. I use numpy.random.rand() to generate weights and this function returns only values from 0 to 1. After summing all products of weights and inputs I get very big number. I fixed this problem with numpy.random.normal() function that generates random numbers from range, (-1, 1) for example. But I have some questions.
Is sigmoid good activation function?
What to do if output of node is still so big and Python rounds result to 1, which is impossible for sigmoid?
How can I prevent Python to rounding floats that are very close to integer
Any advices for me as beginner in neural networks (books, techniques, etc).
The answer to this question obviously depends on context. What it means by "good". The sigmoid activation function will result in outputs that are between 0 and 1. As such, they are standard output activations for binary classification where you want your neural network to output a number between 0 and 1 - with the output being interpreted as the probability of your input being in the specified class. However, if you are using sigmoid activation functions throughout your neural network (i.e. in intermediate layers as well), you might consider switching to RELU activation function. Historically, the sigmoid activation function was used throughout neural networks as a way to introduce non-linearity so that a neural network could do more than approximate linear functions. However, it was found that sigmoid activations suffer heavily from the vanishing gradients problem because the function is so flat far from 0. As such, nowadays, most intermediate layers will use RELU activation functions (or something even more fancy - e.g. SELU/Leaky RELU/etc.) The RELU activation function is 0 for inputs less than 0 and equals the input for inputs greater than 0. Its been found to be sufficient for introducing non-linearity into a neural network.
Generally you don't want to be in a regime where your outputs are so huge or so small that it becomes computationally unstable. One way to help fix this issue, as mentioned earlier, is to use a different activation function (e.g. RELU). Another way, and perhaps even better way, to help with this issue is by initializing the weights better with e.g. the Xavior-Glorot initialization scheme or simply initializing them to smaller values e.g. within the range [-.01,.01]. Basically, you scale the random initializations so that your outputs are in a good range of values and not some gigantic or miniscule number. You can certainly also do both.
You can use higher precision floats to make python keep more decimals around. E.g. you can use np.float64 instead of np.float32...however, this increases the computational complexity and probably isn't necessary. Most neural networks today use 32-bit floats and they work just fine. See points 1 and 2 for better alternatives to solving your problem.
This question is overly broad. I would say that the coursera course and specialization by Prof. Andrew Ng is my strongest recommendation in terms of learning neural networks.
I have a neural network with one input, three hidden neurons and one output. I have 720 input and corresponding target values, 540 for training, 180 for testing.
When I train my network using Logistic Sigmoid or Tan Sigmoid function, I get the same outputs while testing, i.e. I get same number for all 180 output values. When I use Linear activation function, I get NaN, because apparently, the value gets too high.
Is there any activation function to use in such a case? Or any improvements to be done? I can update the question with details and code if required.
Neural nets are not stable when fed input data on arbitrary scales (such as between approximately 0 and 1000 in your case). If your output units are tanh they can't even predict values outside the range -1 to 1 or 0 to 1 for logistic units!
You should try recentering/scaling the data (making it have mean zero and unit variance - this is called standard scaling in the datascience community). Since it is a lossless transformation you can revert back to your original scale once you've trained the net and predicted on the data.
Additionally, a linear output unit is probably the best as it makes no assumptions about the output space and I've found tanh units to do much better on recurrent neural networks in low dimensional input/hidden/output nets.
Newmu is right that the scaling is probably the issue here; you need to scale your inputs to lie in the valid range. (Standardization to zero mean, unit variance, as they suggest, though, isn't a great choice since that means about a third of your data will like outside [-1, 1]....) I don't know about pybrain, but in scikit-learn you'd want sklearn.preprocessing.MinMaxScaler.
But, also, in the comments you said your dataset looks like this:
where the horizontal axis is inputs, vertical is targets. So, when you see an input of 200, you have one training example saying it's 80 and one saying it's 320; what do you want it to say then? An "optimal" neural network (which may be hard to achieve) would predict 200 or so.
You may need to think about how to reframe your learning problem to be a more-consistent function from inputs to targets.