Activation Function in Machine learning - python

What is meant by Activation function in Machine learning. I go through with most of the articles and videos, everyone states or compare that with neural network. I'am a newbie to machine learning and not that much familiar with deep learning and neural networks. So, can any one explain me what exactly an Activation function is ? instead of explaining with neural networks. I struck with this ambiguity while I learning Sigmoid function for logistic regression.

It's rather difficult to describe activation functions without some reference to automated learning, because that's exactly their application, as well as the rationale behind a collective term. They help us focus learning in a stream of functional transformations. I'll try to reduce the complexity of the description.
Very simply, an activation function is a filter that alters an output signal (series of values) from its current form into one we find more "active" or useful for the purpose at hand.
For instance, a very simple activation function would be a cut-off score for college admissions. My college requires a score of at least 500 on each section of the SAT. Thus, any applicant passes through this filter: if they don't meet that requirement, the "admission score" is dropped to zero. This "activates" the other candidates.
Another common function is the sigmoid you studied: the idea is to differentiate the obviously excellent values (map them close to 1) from obviously undesirable values (map them close to -1), and preserve the ability to discriminate or learn about the ones in the middle (map them to something with a gradient useful for further work).
A third type might accentuate differences at the top end of a spectrum -- say, football goals and assists. In trying to judge relative levels of skill between players, we have to consider: is the difference between 15 and 18 goals in a season the same as between 0 and 3 goals? Some argue that the larger numbers show a greater differentiation in scoring skill: the more you score, the more opponents focus to stop you. Also, we might want to consider that there's a little "noise" in the metric: the first two goals in a season don't really demonstrate much.
In this case, we might choose an activation function for goals g such as
1.2 ^ max(0, g-2)
This evaluation would then be added to other factors to obtain a metric for the player.
Does this help explain things for you?

Activation functions are really important for a Artificial Neural Network to learn and make sense of something really complicated and Non-linear complex functional mappings between the inputs and response variable.They introduce non-linear properties to our Network.Their main purpose is to convert a input signal of a node in a A-NN to an output signal. That output signal now is used as a input in the next layer in the stack.
Specifically in A-NN we do the sum of products of inputs(X) and their corresponding Weights(W) and apply a Activation function f(x) to it to get the output of that layer and feed it as an input to the next layer.
More info here

Simply put, an activation function is a function that is added into an artificial neural network in order to help the network learn complex patterns in the data. When comparing with a neuron-based model that is in our brains, the activation function is at the end deciding what is to be fired to the next neuron. That is exactly what an activation function does in an ANN as well. It takes in the output signal from the previous cell and converts it into some form that can be taken as input to the next cell.

Related

Weight prediction using NNs

I’m relatively new to the topic of machine learning, so naturally I have a couple of issues that I hope you can help me with or lead me in the right direction. I had a project before, during which we collected data of people walking normally and also with a stone in their shoe. We measured Acceleration and also with a gyroscope sensor. Based on this data I build a neural network that can classify the signals into normal or impaired walking. So two possible outputs.
Now my idea is this: I want to, using the same data, build a network that can predict the weights of the participants (it was also recorded).
Based on this my three questions:
- What kind of network structure is most suitable for such a task? (Dense, CNN, LSTM,…)
- Before the network basically had two options to answer from (normal or impaired walking) but now I have a continuous range of answers… How can this be approached?
- How can I make sure the network initializes with a sensible prediction?
I hope all the questions make sense. Any help will be much appreciated!
You can use the NNa architecture you prefer:
If you work with sequences use 1d convolutionals or RNNs.
As you are dealing with a regression problem you have to have a single neuron as output without activation function.
Take a.look here to learn to solve a regression problem with RNNs

A2C algorithm in tf.keras: actor loss function

I'm learning about Action-Critic Reinforcement Learning techniques, in particular A2C algorithm.
I've found a good description of a simple version of the algorithm (i.e. without experience replay, batching or other tricks) with implementation here: https://link.medium.com/yi55uKWwV2. The complete code from that article is available on GitHub.
I think I understand ok-ish what's happening here, but to make sure I actually do, I'm trying to reimplement it from scratch using higher-level tf.keras APIs. Where I'm getting stuck is how do I implement training loop correctly, and how do I formulate actor's loss function.
What is the correct way to pass action and advantage into the loss function?
Actor's loss function involves computing probability of the action taken given to normal distribution. How can I ensure that mu and sigma of the normal distribution during loss function computation actually match the ones were during prediction?
The way it is in the original, the actor's loss function doesn't care about y_pred, it only does about action that was chosen while interacting with the environment. This seems to be wrong, but I'm not sure how.
The code I have so far: https://gist.github.com/nevkontakte/beb59f29e0a8152d99003852887e7de7
Edit: I suppose some of my confusion stems from a poor understanding of magic behind gradient computation in Keras/TensorFlow, so any pointers there would be appreciated.
First, credit where credit is due: information provided by ralf htp and Simon was instrumental in helping me to figure out the right answers eventually.
Before I go into detailed answers to my own questions, here's the original code I was trying to rewrite in tf.keras terms, and here's my result.
What is the correct way to pass action and advantage into a loss function in Keras?
There is a difference between what raw TF optimizer considers a loss function and what Keras does. When using an optimizer directly, it simply expects a tensor (lazy or eager depending on your configuration), which will be evaluated under tf.GradientTape() to compute the gradient and update weights.
Example from https://medium.com/#asteinbach/actor-critic-using-deep-rl-continuous-mountain-car-in-tensorflow-4c1fb2110f7c:
# Below norm_dist is the output tensor of the neural network we are training.
loss_actor = -tfc.log(norm_dist.prob(action_placeholder) + 1e-5) * delta_placeholder
training_op_actor = tfc.train.AdamOptimizer(
lr_actor, name='actor_optimizer').minimize(loss_actor)
# Later, in the training loop...
_, loss_actor_val = sess.run([training_op_actor, loss_actor],
feed_dict={action_placeholder: np.squeeze(action),
state_placeholder: scale_state(state),
delta_placeholder: td_error})
In this example it computes the whole graph, including making an inference, capture the gradient and adjust weights. So to pass whatever values you need into the loss function/gradient computation you just pass necessary values into the computation graph.
Keras is a bit more formal in what loss function should look like:
loss: String (name of objective function), objective function or tf.keras.losses.Loss instance. See tf.keras.losses. An objective function is any callable with the signature scalar_loss = fn(y_true, y_pred). If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses.
Keras will do the inference (forward pass) for you and pass the output into the loss function. The loss function is supposed to do some extra computation on the predicted value and y_true label, and return the result. This whole process will be tracked for the purpose of gradient computation.
Although it is very convenient for traditional training, this is a bit restrictive when we want to pass some extra data in, like TD error. It is possible to work around that and shove all the extra data into y_true, and pull it apart inside the loss function (I found this trick somewhere on the web, but unfortunately lost the link to source).
Here's how I rewrote the above in the end:
def loss(y_true, y_pred):
action_true = y_true[:, :n_outputs]
advantage = y_true[:, n_outputs:]
return -tfc.log(y_pred.prob(action_true) + 1e-5) * advantage
# Below, in the training loop...
# A trick to pass TD error *and* actual action to the loss function: join them into a tensor and split apart
# Inside the loss function.
annotated_action = tf.concat([action, td_error], axis=1)
actor_model.train_on_batch([scale_state(state)], [annotated_action])
Actor's loss function involves computing probability of the action taken given to normal distribution. How can I ensure that mu and sigma of the normal distribution during loss function computation actually match the ones were during prediction?
When I asked this question, I didn't understand well enough how TF compute graph works. So the answer is simple: every time sess.run() is invoked, it must compute the whole graph from scratch. Parameters of the distribution would be the same (or similar) as long as graph inputs (e.g. observed state) and NN weights are the same (or similar).
The way it is in the original, the actor's loss function doesn't care about y_pred, it only does about action that was chosen while interacting with the environment. This seems to be wrong, but I'm not sure how.
What's wrong is the assumption "the actor's loss function doesn't care about y_pred" :) Actor's loss function involves norm_dist (which is action probability distribution), which is effectively an analog of y_pred in this context.
As far as i understand A2C it is the machine learning implementation of activator-inhibitor systems that are also called two-component reaction diffusion systems (https://en.wikipedia.org/wiki/Reaction%E2%80%93diffusion_system). Activator-inhibitor models are important in any field of science as they describe pattern formations like i.e. the Turing mechanism (simply search the net for activator-inhibitor model and you find a vast amount of information, a very common application are predator-prey models). Also cf the graphic
source of graphic : https://www.researchgate.net/figure/Activator-Inhibitor-System_fig1_23671770/
with the explanatory graphic of the A2C algorithm in https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69
Activator-inhibitor models are closely linked to the theory of nonlinear dynamical systems (or 'chaos theory') this also becomes obvious in the comparison of the bifurcation tree-like structure in https://medium.com/#asteinbach/rl-introduction-simple-actor-critic-for-continuous-actions-4e22afb712 and the bifurcation tree of a nonlinear dynamical systems like i.e. the logistic map (https://en.wikipedia.org/wiki/Logistic_map, the logistic map is one of the simplest predator-prey models or activator-inhibitor models). Another similarity is the sensitivity to initial condition in A2C models that is described as
This introduces in inherent high variability in log probabilities (log of the policy distribution) and cumulative reward values, because each trajectories during training can deviate from each other at great degrees.
in https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f and the curse of dimensionality appears also in chaos theory, i.e. in attractor reconstruction
From the viewpoint of systems theory the A2C algorithm tries to adapt the initial value (start state) in a way that it ends up at a given endpoint when increasing the growth rate of a dynamical systems i.e. the logistic map (r-value is increased and the initial value (start state) is constantly re-adapted to choose the correct bifurations (actions) in the bifurcation tree )
So A2C tries to numerically solve a chaos theory problem, namely finding the initial value for a given outcome of a nonlinear dynamical system in its chaotic region. Analytically this problem is in most cases not solveable.
The action is the bifurcation points in the bifurcation tree, the states are the future bifurctions.
Both, actions and states, are modeled by two coupled neural networks and this coupling of two neural nets is the great innovation of A2C algorithms.
In https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 is well documented keras code for implementing A2C, so you have a possible implementation there.
The loss function here is defined as the temporal difference (TD) function that is the exact difference between state at the actual bifurcation point and the state at the estimated future one, however this mathematically exactly defined is prone to stochastic error (or noise), so the stochastic error is included in the definition of exact, because in the end machine learning is based on stochastic systems or error calculus, meaning systems that are composed of a deterministic and a stochastic component. To zero this error stochastic gradient descend is used. In keras this is simply implmeneted by choosing optimizer=sge.
This interaction of actual and future step is implemented as memory on https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 in the function remember and this function also links the actor and the critic network (or activator and inhibitor network). This general structure of trial (action), call predict (TD function ), remember and train (i.e. stochastic gradient descent) is fundamental to all reinforcement learning algorithms, and is linked to the structure actual state, action, reward, new state :
The prediction code is also very much the same as it was in previous reinforcement learning algorithms. That is, we just have to iterate through the trial and call predict, remember, and train on the agent:
In the implementation on your first question is solved by applying remember on the critic and the train the critic with these values (this is in the main function), where training always evaluates the loss function, so action and reward are passed to the loss function by remember in this implementation :
actor_critic.remember(cur_state, action, reward, new_state, done)
actor_critic.train()
Because of your second question : i am not sure but i think this is achieved by the optimization algorithm (i.e. stochastic gradient descent)
Third question : In the predator-prey model the actors or activator is the prey and the behavior of the prey is only determined by the size or capacity of the habitat (the amount of grass) and the size of the predator (inhibitor) population, so modelling it in this way is consistent with nature or an activator-inhibitor system again. In the main function in https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 also only the critic or inhibitor / predator is trained.

Pytorch method for conditional use of intermediate layer instead of final cnn layer output. ie: allow nn to learn to use more layers or less

I'm implementing a residual cnn(modified smaller version of xception) in a low latency environment. I've done a lot of manual tuning to minimize the run time speed of my network (reducing number of filters, removing layers, etc).
But now I want to try allowing my network to make its classification prediction(final fcnn layer) on the residual connection after each residual block.
basic logic-
attempt final prediction with residual connection as input
if this fcnn layer predicts a certain class with a probability > a set threshold:
return fcnn output as if it was normal final layer
else:
do next residual block like normal and try the previous conditional again unless we are already at final block
My hope is this will allow my network to learn to solve easier problems with less computation while allowing it to still do the additional layers if it is still unsure of the classification.
So my basic question is: In pytorch, whats the best way to implement this conditional in a way that allows my nn at run time to decide whether to do more processing or not
Currently Ive tested returning the intermediate x's after the blocks in the forward function, but I dont know how best to setup the conditional to chose which x to return
Also side note: I believe I may end up needing another cnn layer between the residual and fcnn to serve as a function to convert the internal representation for processing to a representation the fcnn understands for classification.
It has already been done and presented in ICLR 2018.
It appears as if in ResNets the first few bottlenecks learn representations (and therefore cannot be skipped) while the remaining bottlenecks refine the features and therefore can be skipped at a moderate loss of accuracy. (Stanisław Jastrzebski, Devansh Arpit, Nicolas Ballas, Vikas Verma, Tong Che, Yoshua Bengio Residual Connections Encourage Iterative Inference, ICLR 2018).
This idea was taken to the extreme with sharing weights across bottlenecks in Sam Leroux, Pavlo Molchanov, Pieter Simoens, Bart Dhoedt, Thomas Breuel, Jan Kautz IamNN: Iterative and Adaptive Mobile Neural Network for efficient image classification, ICLR 2018.

Neural network regression with multi-value (probabilistic) functions

I'm a bit of a beginner in the art of machine learning. Here is a rather conceptual question I've been wondering:
Suppose I have a function X->Y, say y=x^2, then, generating enough data of X->Y, I can train a neural network to perform regression on the function, and get x^2 with any input x. This is basically also what the Universal Approximation Theorem suggests.
Now, my question is, what if I want the inverse relation, Y->X? In this case, Y is a multi-valued function of X, for instance for X>0, x=+-sqrt(y). I can swap X and Y as input/output data to train the network alright, but for any given y, there should be a random 1/2 - 1/2 chance that x=sqrt(y) and x=-sqrt(y). But of course, if one trains it with min-squared-error, the network wouldn't know this is a multi-value function, and would just follow SGD on the loss function and get x=0, the average value, for any given y.
Therefore, I wonder if there is any way a neural network can model a multi-valued function? For instance, my guess would be
(1) the neural network can output a collection of, say, the top 2 possible values for X and train it with cross-entropy. The problem is, if X is a vector or even a matrix (like a bit-map image) instead of a number, we don't know how many solutions Y=X has (which could very well be an infinite number, i.e. a continuous range), so a "list" of possible values and probabilities won't work - ideally the neural network should output values randomly and continuously distributed across possible X solutions.
(2) perhaps does this fall into the realm of probabilistic neural networks (PNN)? Does PNN model functions that support a given probabilistic distribution (continuous or discrete) of vectors as its output? If so, is it possible to implement PNN with popular frameworks like Tensorflow+Keras?
(Also, note that this is different from a "multivariate" function, which is the case where X,Y could be multi-component vectors, which is still something a traditional network can easily train on. The actual problem in question here is where the output could be a probabilistic distribution of vectors, which is something that a simple feed-forward network doesn't capture, since it doesn't have the inherent randomness.)
Thank you for your kind help!
Image of forward function Y=X^2 (can be easily modeled by network with regression)
Image of inverse function X=+-sqrt(Y) (the network cannot capture the two-value function and outputs the average value X=0 for any Y)
Try to read the following paper:
https://onlinelibrary.wiley.com/doi/abs/10.1002/ecjc.1028
Mifflin's algorithm (or its more general version SLQP-GS) mentioned in this paper is available here and corresponding paper with description is here.

Sigmoid of x is 1

I just read "Make your own neural network" book. Now I am trying to create NeuralNetwork class in Python. I use sigmoid activation function. I wrote basic code and tried to test it. But my implementation didn't work properly at all. After long debugging and comparation to code from book I found out that sigmoid of very big number is 1 because Python rounds it. I use numpy.random.rand() to generate weights and this function returns only values from 0 to 1. After summing all products of weights and inputs I get very big number. I fixed this problem with numpy.random.normal() function that generates random numbers from range, (-1, 1) for example. But I have some questions.
Is sigmoid good activation function?
What to do if output of node is still so big and Python rounds result to 1, which is impossible for sigmoid?
How can I prevent Python to rounding floats that are very close to integer
Any advices for me as beginner in neural networks (books, techniques, etc).
The answer to this question obviously depends on context. What it means by "good". The sigmoid activation function will result in outputs that are between 0 and 1. As such, they are standard output activations for binary classification where you want your neural network to output a number between 0 and 1 - with the output being interpreted as the probability of your input being in the specified class. However, if you are using sigmoid activation functions throughout your neural network (i.e. in intermediate layers as well), you might consider switching to RELU activation function. Historically, the sigmoid activation function was used throughout neural networks as a way to introduce non-linearity so that a neural network could do more than approximate linear functions. However, it was found that sigmoid activations suffer heavily from the vanishing gradients problem because the function is so flat far from 0. As such, nowadays, most intermediate layers will use RELU activation functions (or something even more fancy - e.g. SELU/Leaky RELU/etc.) The RELU activation function is 0 for inputs less than 0 and equals the input for inputs greater than 0. Its been found to be sufficient for introducing non-linearity into a neural network.
Generally you don't want to be in a regime where your outputs are so huge or so small that it becomes computationally unstable. One way to help fix this issue, as mentioned earlier, is to use a different activation function (e.g. RELU). Another way, and perhaps even better way, to help with this issue is by initializing the weights better with e.g. the Xavior-Glorot initialization scheme or simply initializing them to smaller values e.g. within the range [-.01,.01]. Basically, you scale the random initializations so that your outputs are in a good range of values and not some gigantic or miniscule number. You can certainly also do both.
You can use higher precision floats to make python keep more decimals around. E.g. you can use np.float64 instead of np.float32...however, this increases the computational complexity and probably isn't necessary. Most neural networks today use 32-bit floats and they work just fine. See points 1 and 2 for better alternatives to solving your problem.
This question is overly broad. I would say that the coursera course and specialization by Prof. Andrew Ng is my strongest recommendation in terms of learning neural networks.

Categories