Softmax derivative in NumPy approaches 0 (implementation) - python

I'm trying to implement the softmax function for a neural network written in Numpy. Let h be the softmax value of a given signal i.
I've struggled to implement the softmax activation function's partial derivative.
I'm currently stuck at issue where all the partial derivatives approaches 0 as the training progresses. I've cross-referenced my math with this excellent answer, but my math does not seem to work out.
import numpy as np
def softmax_function( signal, derivative=False ):
# Calculate activation signal
e_x = np.exp( signal )
signal = e_x / np.sum( e_x, axis = 1, keepdims = True )
if derivative:
# Return the partial derivation of the activation function
return np.multiply( signal, 1 - signal ) + sum(
# handle the off-diagonal values
- signal * np.roll( signal, i, axis = 1 )
for i in xrange(1, signal.shape[1] )
)
else:
# Return the activation signal
return signal
#end activation function
The signal parameter contains the input signal sent into the activation function and has the shape (n_samples, n_features).
# sample signal (3 samples, 3 features)
signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
The following code snipped is a fully working activation function and is only included as a reference and proof (mostly for myself) that the conceptual idea actually work.
from scipy.special import expit
import numpy as np
def sigmoid_function( signal, derivative=False ):
# Prevent overflow.
signal = np.clip( signal, -500, 500 )
# Calculate activation signal
signal = expit( signal )
if derivative:
# Return the partial derivation of the activation function
return np.multiply(signal, 1 - signal)
else:
# Return the activation signal
return signal
#end activation function
Edit
The problem intuitively persist with simple single layer networks. The softmax (and its derivative) is applied at the final layer.

This is an answer on how to calculate the derivative of the softmax function in a more vectorized numpy fashion. However, the fact that the partial derivatives approach to zero might not be a math issue, and just be a problem of the learning rate or the known dying weight issue with complex deep neural networks. Layers like ReLU help preventing the latter issue.
First, I've used the following signal (just duplicating your last entry) to make it 4 samples x 3 features so is easier to see what is going on with the dimensions.
>>> signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
>>> signal.shape
(4, 3)
Next, you want to compute the Jacobian matrix of your softmax function. According to the cited page it is defined as -hi * hj for the off-diagonal entries (majority of the matrix for n_features > 2), so lets start there. In numpy, you can efficiently calculate that Jacobian matrix using broadcasting:
>>> J = - signal[..., None] * signal[:, None, :]
>>> J.shape
(4, 3, 3)
The first signal[..., None] (equivalent to signal[:, :, None]) reshapes the signal to (4, 3, 1) while the second signal[:, None, :] reshapes the signal to (4, 1, 3). Then, the * just multiplies both matrices element-wise. Numpy's internal broadcasting repeats both matrices to form the n_features x n_features matrix for every sample.
Then, we need to fix the diagonal elements:
>>> iy, ix = np.diag_indices_from(J[0])
>>> J[:, iy, ix] = signal * (1. - signal)
The above lines extract diagonal indices for n_features x n_features matrix. It is equivalent of doing iy = np.arange(n_features); ix = np.arange(n_features). Then, replaces the diagonal entries with your defitinion hi * (1 - hi).
Last, according to the linked source, you need to sum across rows for each of the samples. That can be done as:
>>> J = J.sum(axis=1)
>>> J.shape
(4, 3)
Find bellow a summarized version:
if derivative:
J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
iy, ix = np.diag_indices_from(J[0])
J[:, iy, ix] = signal * (1. - signal) # diagonal
return J.sum(axis=1) # sum across-rows for each sample
Comparison of the derivatives:
>>> signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
>>> e_x = np.exp( signal )
>>> signal = e_x / np.sum( e_x, axis = 1, keepdims = True )
Yours:
>>> np.multiply( signal, 1 - signal ) + sum(
# handle the off-diagonal values
- signal * np.roll( signal, i, axis = 1 )
for i in xrange(1, signal.shape[1] )
)
array([[ 2.77555756e-17, -2.77555756e-17, 0.00000000e+00],
[ -2.77555756e-17, -2.77555756e-17, -2.77555756e-17],
[ 2.77555756e-17, 0.00000000e+00, 2.77555756e-17],
[ 2.77555756e-17, 0.00000000e+00, 2.77555756e-17]])
Mine:
>>> J = signal[..., None] * signal[:, None, :]
>>> iy, ix = np.diag_indices_from(J[0])
>>> J[:, iy, ix] = signal * (1. - signal)
>>> J.sum(axis=1)
array([[ 4.16333634e-17, -1.38777878e-17, 0.00000000e+00],
[ -2.77555756e-17, -2.77555756e-17, -2.77555756e-17],
[ 2.77555756e-17, 1.38777878e-17, 2.77555756e-17],
[ 2.77555756e-17, 1.38777878e-17, 2.77555756e-17]])

#Imanol Luengo's solution is wrong the moment he takes sums across rows.
#Harveyslash also makes a good point since he noticed extremely low "gradients" in his solution, the NN won't learn or learn in the wrong direction.
We have 4 samples x 3 inputs
The thing is that softmax is not a scalar function that takes just 1 input, but 3 in this case. Remember that all inputs in one sample must add up to one, and therefore you can't compute the value of one without knowing the others? This implies that the gradient must be a square matrix, because we need to also take into account our partial derivatives.
TLDR: The output of the gradient of softmax(sample) in this case must be a 3x3 matrix.
This is correct:
J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
iy, ix = np.diag_indices_from(J[0])
J[:, iy, ix] = signal * (1. - signal) # diagonal
Up until this point Imanol uses fast vector operations to compute the Jacobian of the softmax function in 4 points, resulting in a 3x3 matrix stacked 4 times: 4 x 3 x 3.
Now I think what the OP really wants is dJdZ, the first step in ANN backprop:
dJdZ(4x3) = dJdy(4x3) * gradSoftmax[layer signal(4x3)](?,?)
The problem is that we usually (with sigmoid, ReLU,... any scalar activation function) can compute the gradient as a stacked vector and then multiply element-wise with dJdy, but here we got a stacked matrix. How can we marry the two concepts?
The vector can be seen as the non zero elements of a diagonal matrix -> All this time we have been getting away with easy element-wise multiplication just because our activation function was scalar! For our softmax it's not that simple, and therefore we have to use matrix multiplication
dJdZ(4x3) = dJdy(4-1x3) * anygradient[layer signal(4,3)](4-3x3)
Now we multiply each 1x3 dJdy vector times the 3x3 gradient, for each of the 4 samples, but usually common operations will fail. We need to specify along which dimensions we multiply too (use np.einsum). End result:
#For reference 'mnr,mrr->mr' = 4x1x3,4x3x3->4x3
dJdZ = np.einsum('mnr,mrr->mr', dJdy[:,None,:], gradient)

Related

Machine Learning - implementing a Gradient Descent in Python from Octave code

I am trying to implement a gradient descent function in Python from scratch which I have implemented and work in GNU Octave. Unfortunately I am stuck. I fiddled with it for a while and checked the NumPy documentation but so far no luck.
I am aware of libraries such as scikit-learn, however my purpose is to learn to code such a function from scratch. Perhaps I am going about it the wrong way.
Below you will find all the code necessary to reproduce the error.
Thanks in advance for your help.
Actual result: test fails with error -> "ValueError: matmul: Input operand 0 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)"
Expected result: and array with values [5.2148, -0.5733]
Function gradientDescent() in Octave:
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
theta = theta - (alpha/m)*X'*(X*theta-y);
J_history(iter) = computeCost(X, y, theta);
end
Function gradient_descent() in python:
from numpy import zeros
def compute_cost(X, y, theta):
m = len(y)
ans = (X.T # theta).T - y
J = (ans # ans.T) / (2 * m)
return J[0, 0]
def gradient_descent(X, y, theta, alpha, num_iters):
m = len(y)
J_history = zeros((num_iters, 1), dtype=int)
for iter in range(num_iters):
theta = theta - (alpha / m) # X.T # (X # theta - y)
J_history[iter] = compute_cost(X, y, theta)
return theta
the test file: test_ml_utils.py
import unittest
import numpy as np
from ml.ml_utils import compute_cost, gradient_descent
class TestGradientDescent(unittest.TestCase):
# TODO: implement tests for Gradient Descent function
# [theta J_hist] = gradientDescent([1 5; 1 2; 1 4; 1 5],[1 6 4 2]',[0 0]',0.01,1000);
def test_gradient_descent_00(self):
X = np.array([[1, 5], [1, 2], [1, 4], [1, 5]])
y = np.array([1, 6, 4, 2])
theta = np.zeros(2)
alpha = 0.01
num_iter = 1000
r_theta = np.array([5.2148, -0.5733])
result = gradient_descent(X, y, theta, alpha, num_iter)
self.assertEqual((round(result, 4), r_theta), 'Result is wrong!')
if __name__ == '__main__':
unittest.main()
The __matmul__ operator # in Python binds more tightly than -. That means you're trying to do matrix multiplication with the operands (alpha / m), which is a scalar, and X.T, which is actually a matrix. See operator precedence.
In the Octave code, (alpha - m) * X' is doing scalar multiplication, not matrix, so if you want that same behavior in Python, use * rather than #. This seems to be because Octave overloads the * operator to perform scalar multiplication if one operand is a scalar, but matrix multiplication if both operands are matrices.
Adding to Adam's answer (which is correct in regards to the error you're getting).
However I wanted to add more generally, this code is meaningless to a reader without some sort of hint (whether programmatically or in the form of a comment) what dimensions the different variables take.
E.g., there is a hint in the code that y is likely to be 2-dimensional, and you're using len to get its size. Just as an example of how this might fail silently, consider this:
>>> y = numpy.array([[1,2,3,4,5]])
>>> len( y )
1
whereas presumably you want
>>> numpy.shape( y )
(1, 5)
or
>>> numpy.size( y )
5
I note in your unit test that you're passing a rank 1 vector instead of rank 2, so it turns out y is 1D instead of 2D, but operates with X which is 2D due to broadcasting. Your code therefore works despite the logic implied, but in the absence of making such things explicit, this is a runtime error waiting to happen.

Scipy minimize returns a higher value than minimum

As a part of multi-start optimization, I am running differential evolution (DE), the output of which I feed as initial values to scipy minimization with SLSQP (I need constraints).
I am testing testing the procedure on the Ackley function. Even in situations in which DE returns the optimum (zeros), scipy minimization deviates from the optimal initial value and returns a value higher than at the optimum.
Do you know how to make scipy minimize return the optimum? I noticed it helps to specify tolerance for scipy minimize, but it does not solve the issue completely. Scaling the objective function makes things worse. The problem is not present for COBYLA solver.
Here are the optimization steps:
# Set up
x0min = -20
x0max = 20
xdim = 4
fun = ackley
bounds = [(x0min,x0max)] * xdim
tol = 1e-12
# Get a DE solution
result = differential_evolution(fun, bounds,
maxiter=10000,
tol=tol,
workers = 1,
init='latinhypercube')
# Initialize at DE output
x0 = result.x
# Estimate the model
r = minimize(fun, x0, method='SLSQP', tol=1e-18)
which in my case yields
result.fun = -4.440892098500626e-16
r.fun = 1.0008238682246429e-09
result.x = array([0., 0., 0., 0.])
r.x = array([-1.77227927e-10, -1.77062108e-10, 4.33179228e-10, -2.73031830e-12])
Here is the implementation of the Ackley function:
def ackley(x):
# Computes the value of Ackley benchmark function.
# ACKLEY accepts a matrix of size (dim,N) and returns a vetor
# FVALS of size (N,)
# Parameters
# ----------
# x : 1-D array size (dim,) or a 2-D array size (dim,N)
# Each row of the matrix represents one dimension.
# Columns have therefore the interpretation of different points at which
# the function is evaluated. N is number of points to be evaluated.
# Returns
# -------
# fvals : a scalar if x is a 1-D array or
# a 1-D array size (N,) if x is a 2-D array size (dim,N)
# in which each row contains the function value for each column of X.
n = x.shape[0]
ninverse = 1 / n
sum1 = np.sum(x**2, axis=0)
sum2 = np.sum(np.cos(2 * np.pi * x), axis=0)
fvals = (20 + np.exp(1) - (20 * np.exp(-0.2 * np.sqrt( ninverse * sum1)))
- np.exp( ninverse * sum2))
return fvals
Decreasing the "step size used for numerical approximation of the Jacobian" in the SLSQP options solved the issue for me.

Linear Regression with gradient descent: two questions

I'm trying to understand Linear Regression with Gradient Descent and I do not understand this part in my loss_gradients function below.
import numpy as np
def forward_linear_regression(X, y, weights):
# dot product weights * inputs
N = np.dot(X, weights['W'])
# add bias
P = N + weights['B']
# compute loss with MSE
loss = np.mean(np.power(y - P, 2))
forward_info = {}
forward_info['X'] = X
forward_info['N'] = N
forward_info['P'] = P
forward_info['y'] = y
return loss, forward_info
Here is where I'm stuck in my understanding, I have commented out my questions:
def loss_gradients(forward_info, weights):
# to update weights, we need: dLdW = dLdP * dPdN * dNdW
dLdP = -2 * (forward_info['y'] - forward_info['P'])
dPdN = np.ones_like(forward_info['N'])
dNdW = np.transpose(forward_info['X'], (1, 0))
dLdW = np.dot(dNdW, dLdP * dPdN)
# why do we mix matrix multiplication and dot product like this?
# Why not dLdP * dPdN * dNdW instead?
# to update biases, we need: dLdB = dLdP * dPdB
dPdB = np.ones_like(forward_info[weights['B']])
dLdB = np.sum(dLdP * dPdB, axis=0)
# why do we sum those values along axis 0?
# why not just dLdP * dPdB ?
It looks to me like this code is expecting a 'batch' of data. What I mean by that is, it's expecting that when you do forward_info and loss_gradients, you're actually passing a bunch of (X, y) pairs together. Let's say you pass B such pairs. The first dimension of all of your forward info stuff will have size B.
Now, the answers to both of your questions are the same: essentially, these lines compute the gradients (using the formulas you predicted) for each of the B terms, and then sum up all of the gradients so you get one gradient update. I encourage you to work out the logic behind the dot product yourself, because this is a very common pattern in ML, but it's a little tricky to get the hang of at first.

scipy's sparse eigs yielding wrong eigenvectors

I am a bit confused regarding the following issue:
I am computing fixed points of a quantum channel, which means I want to compute the leading eigenvector of a specific matrix. The matrix is such that its dimensionality is n^2 x n^2 and defined in such a way that the leading eigenvalue reshaped to a matrix with shape n x n is a positive matrix (self adjoint with positive eigenvalues).
If I do this with scipy.sparse.linalg.eigs however I get wrong results. The exact computation (using scipy.linalg.eig) however works fine. I tried around playing with the arguments for k and ncv for the solver, but didn't get i working properly unless I set k=n**2 in which case eigs just refers to eig. This, however, won't work in the case that I have actually in mind where the channel (super_op in the script below) is actually encoded as a LinearOperator. So I rely on using eigs :/
Anybody any idea how to get this run properly?
Thanks to everybody in advance!
import numpy as np
from numpy.random import rand
from numpy import tensordot as td
from scipy.sparse.linalg import eigs
from scipy.linalg import eig
n = 16
d = 3
kraus_op = .5 - rand(n, d, n) + 1j * (.5 - rand(n, d, n))
super_op = td(kraus_op, kraus_op.conj(), [[1], [1]]).transpose(0, 2, 1, 3)
########
# Sparse
########
vals, vecs = eigs(super_op.reshape(n**2, n**2), k=n*(n-1), which='LM')
rho = vecs[:,0].reshape(n, n)
print('is self adjoint: ', np.allclose(rho, rho.conj().T))
super_op_times_rho = td(super_op, rho, [[2, 3], [0, 1]])
print('super_op(rho) == lambda * rho :', np.allclose(rho, super_op_times_rho/vals[0]))
########
# Exact
########
vals, vecs = eig(super_op.reshape(n**2, n**2))
rho = vecs[:,0].reshape(n, n)
print('is self adjoint: ', np.allclose(rho, rho.conj().T))
super_op_times_rho = td(super_op, rho, [[2, 3], [0, 1]])
print('super_op(rho) == lambda * rho :', np.allclose(rho, super_op_times_rho/vals[0]))
the result is:
is self adjoint: False
super_op(rho) == lambda * rho : True
is self adjoint: True
super_op(rho) == lambda * rho : True
For completeness:
Python 3.5.2
numpy 1.16.1
scipy 1.2.1
After all I found the solution with some help of my colleagues:
While eig gives the eigenvectors (1) sorted (by magnitude) and (2) such that the first entry is real, eigs has another sorting that must not be as in eig and also does not regularize the complex phase of the eigenvector. Correcting the phase is easily done by dividing the tensor by the phase of the first entry (corresponding to ensuring that the first diagonal element is real to get rid of the freedom of choosing the complex phase of an eigenvector and making Hermiticity possible...).
So the corrected code snippet for the sparse case is:
vals, vecs = eigs(super_op.reshape(chi**2, chi**2), k=chi*(chi-1), which='LM')
# find the index corresponding to the largest eigenvalue
arg = np.argmax(np.abs(vals))
rho = vecs[:,arg].reshape(chi, chi)
# regularize the output array
rho *= np.abs(rho[0, 0])/rho[0, 0]

Conflict between vectorization/broadcasting and solving an ODE with solve_ivp

Using a NumPy array and vectorization, I'm trying to create a population of n different individuals, with each individual having three properties: alpha, beta, and phenotype (the phenotype being calculated as the steady state of a differential equation that involves alpha and beta). So, I want each individual to have its own phenotype.
However, my code produces the same phenotype for every individual. Moreover, this unwanted behavior only occurs if there happen to be exactly n entries in solve_ivp's y0 array (which here is [0, 1]) -- otherwise, a broadcasting error is produced:
ValueError: operands could not be broadcast together with shapes (2,) (3,)
Here's the code:
import numpy as np
from scipy.integrate import solve_ivp
def create_population(n):
"""creates a population of n individuals"""
pop = np.zeros(n, dtype=[('alpha','<f8'),('beta','<f8'),('phenotype','<f8')])
pop['alpha'] = np.random.randn(n)
pop['beta'] = np.random.randn(n) + 5
def phenotype(n):
"""creates the phenotype"""
def pheno_ode(t_ode, y):
"""defines the ode for the phenotype"""
dydt = 0.123 - y + pop['alpha'] * (y ** pop['beta'] / (1 + y ** pop['beta']))
return dydt
t_end = 1e06
sol = solve_ivp(pheno_ode, [0, t_end], [0, 1], method='BDF')
return sol.y[0][-1] # last entry is assumed to be the steady state
pop['phenotype'] = phenotype(n)
return pop
popul = create_population(3)
print(popul)
In contrast, if the phenotype is calculated from alpha and beta via a "simple" equation, then vectorization works fine:
def phenotype(n):
"""creates the phenotype"""
phenotype_simple = 2 * pop['alpha'] + pop['beta']
return phenotype_simple
There are two problems that I can see:
First, you have the initial condition for the ODE set to [0, 1]. The sets the size of the vector solution for solve_ivp to 2, regardless of the value of n. However, the arrays pop['alpha'] and pop['beta'] have length n, and in your script, you call create_population with n set to 3. So you have a mismatch in the array shapes in the formula for dydt: y has length 2, but pop['alpha'] and pop['beta'] have length 3. That causes the error that you see.
You can fix this by using, say, np.ones(n) instead of [0, 1] as the initial condition in your call to solve_ivp.
The second problem is in the statement return sol.y[0][-1] in the function phenotype(n). sol.y has shape (n, num_points), where num_points is the number of points computed by solve_ivp. So sol.y[0] is just the first component of the solution, and sol.y[0][-1] is the last value of the solution for the first component. It is a scalar, so when you execute pop['phenotype'] = phenotype(n), you are assigning the same value (the steady state of the first component) to all the phenotypes.
The return statement should be return sol.y[:, -1]. That returns the last column of the solution array (i.e. all the steady state phenotypes).

Categories