Maxout activation function- implementation in NumPy for forward and backpropogation - python

I am building a vanilla neural network from scratch using NumPy and trialling the model performance for different activation functions. I am especially keen to see how the 'Maxout' activation function would effect my model performance.
After doing some search, I was not able to find an implementation in NumPy except for their definition (https://ibb.co/kXCpjKc). The formula for forward propagation is clear where I would take the max(Z) (where Z = w.T * x + b). But, their derivative that I will be using in backpropogation is not clear to me.
What does j = argmax(z) mean in this context? How do I implement it in NumPy?
Any help would be much appreciated! Thank you!

Changing any of the non maximum values slightly does not affect the output, so their gradient is zero. The gradient is passed from the next layer to only the neuron that achieved the max (gradient = 1 in the link you provided). See this stackoverflow answer: https://datascience.stackexchange.com/a/11703.
In a neural network setting you would need the gradient with respect to every of the x_i, so you would need the full derivative. In the link you provided you can see there is only a partial derivative defined. The partial derivative is a vector (of almost all zeros and 1 where the neuron is maximum), so the full gradient will become a matrix.
You can implement this in numpy using np.argmax.

Related

Learn the matrix function of a sort of bilinear form (Neural Networks)

I am thinking to the problem of doing regression on a scalar function f:R^n->R, where I have a set of training samples (x1,y1),...,(xN,yN), with yi = f(xi).
I know I could in principle apply any neural network architecture to do regression on this function, however I want to exploit a property I know about it to design the network.
Precisely, I know that f(x)= x^TA(x)x$ for a nxn matrix valued function A(x) which I don't know clearly, but I know is symmetric and positive definite.
I think that since I know this structure of the function, it is not an efficient approach to apply a "standard" architecture to this problem. This problem in fact looks like the one of finding and approximation of a metric on R^n.
Since A(x) is symmetric positive definite, I thought to rewrite it as A(x) = B(x)^TB(x) for an unknown matrix valued function B(x). Thus, the function f(x) rewrites in a much simpler way: f(x) = |B(x)x|^2, where the only unknown is the matrix function B(x).
Now, is there some known architecture which is well suited for this context?
Generating the training data with B(x) constant, I have solved the problem very easily defining a weight to be optimized and it works very well. However if the matrix B(x) is x-dependent I am not completely sure on how to proceed.
Up to now I have implemented a NN which goes from R^n to R^{n^2}, where the output is reshaped into the nxn matrix B(x) to learn. However, this works well just for simple B(x)s and to me it is still not clear why.
You can write the expression f(x) = |B(x)x|^2 in tensorflow and solve for B via standard gradient descent minimization. Tensorflow can minimize anything you can write in tensorflow.

Tensorflow how to compute the gradient of output with respect to the input?

Recently, I try to do some experiments and I have a neural network D(x) where x is the input image with batch size 64. I want to compute the gradient of D(x) with respect to x. Should I do the computation as the following?
grad = tf.gradients(D(x), [x])
Thank you everybody!
Yes, you will need to use tf.gradients. For more details see https://www.tensorflow.org/api_docs/python/tf/gradients.
During the training of a neural network, the gradient is generally computed of a loss function with respect to the input. This is because, the loss function can be well defined along with its gradient.
However, if you talk about the gradient of your output D(x), this I assume is some set of vector(s). You will need to define how the gradient will be computed with respect to its input (i.e the layer which generates the output).
The exact details of that implementation depends upon the framework which you are using.

Tensor Flow passing a tensor to optimizer minimize function trains better

I am encountering something a bit strange (to me) in tensorflow and was hoping someone could shed some light on the situation.
I have a simple neural network that processes images. The cost function I am minimizing is the simple MSE.
At first I implemented the following:
cost = tf.square(DECONV - Y)
which I then passed to my optimizer as follows:
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(cost)
I was able to obtain great results with this implementation. However, as I tried to implement a regularizer, I realized that I wasn't passing a scalar value to the optimizer.minimize() but in fact passing a tensor of shape [batch, dim_x, dim_y].
I changed my implementation to the following:
cost = tf.losses.mean_squared_error(Y, DECONV)
as well as many variations of this like:
cost = tf.reduce_mean(tf.square(tf.subtract(DECONV, Y)))
etc.
My issue is that with these new implementations of the MSE I am not able to even come close to the results I obtained using the original "wrong" implementation.
Is the original way a valid way to train? If so, how can I implement regularizers? If not, what am I doing wrong with the new implementations? Why can't I replicate the results?
Can you precise what you mean by
I was able to achieve greater result [..]
I assume that you have another metric than cost - this time an actual scalar, which enables you to compare the models trained by each method.
Also, have you tried adjusting the learning rate for the second method? I ask this because my intuition is that when you ask tensorflow to minimize a tensor (which has no mathematical meaning as far as I know), it minimizes the scalar obtained by summing over all the axis of the tensor. This is how tf.gradients works, and the reason why I think this is happening. So maybe in the second method if you multiply the learning rate by batch*dim_x*dim_y you would get the same behavior as in the first method.
Even if this works, I don't think passing a tensor to the minimize function is a good idea - minimization of a d-dimensional value has no meaning as you have no order rule in such spaces.

How can I compute the hessian of the loss function in MXNet?

I'm learning MXNet at the moment and I'm working on a problem using neural nets. I'm interested in observing the curvature of my loss function with respect to the network weights but as best I can tell higher order gradients are not supported for neural network functions at the moment. Is there any (possibly hacky) way that I could still do this?
You can follow the discussion here
The gist of it is that not all operators support higher order gradients at the moment.
In Gluon you can try the following:
with mx.autograd.record():
output = net(x)
loss = loss_func(output)
dz = mx.autograd.grad(loss, [z], create_graph=True) # where [z] is the parameter(s) you want
dz[0].backward() # now the actual parameters should have second order gradients
Taken from this forum thread

Should I consider x0, threshold when using a Tensorflow placeholder function?

I am currently doing a mood classification project by supervised learning, using tensor flow.
And in machine learning theory, as you know, there is always an x0 which is +1. When making a placeholder for input dataset, is the function automatically produce a x0 part? or should I designate it manually?
Thanks
There are two ways of thinking about x0. Either your input has an extra dimension, which always has 1 in it, and then a linear regression or a fully connected layer in a neural network will be represented as:
out = W * in
where * is matrix-vector multiplication, or, which is more common, to not add that extra dimension, and instead model it as
out = W * in + b
This is, in part, to highlight the difference between W, which is how we "weight" the input, and b, which is how much we "shift" it (b is called a "bias" term). One other reason why this representation is more desirable is because it is common to regularize W, but not b.
Now, back to your question, TensorFlow neural network library models fully connected layer in terms of a weight matrix and a bias vector, therefore you do not need to add an extra one to your input vector.
If you use low-level Tensor operations instead of the high-level predefined layers, then TensorFlow makes no assumptions about your input, and if you want to model your model in terms of operations on a vector with an extra 1 in it, it is your responsibility to add that 1 to that vector, TensorFlow will not do that for you.

Categories