Understanding JAX argnums parameter in its gradient function

Understanding JAX argnums parameter in its gradient function - python

I'm trying to understand the behaviour of argnums in JAX's gradient function.
Suppose I have the following function:
def make_mse(x, t):
def mse(w,b):
return np.sum(jnp.power(x.dot(w) + b - t, 2))/2
return mse
And I'm taking the gradient in the following way:
w_gradient, b_gradient = grad(make_mse(train_data, y), (0,1))(w,b)
argnums= (0,1) in this case, but what does it mean? With respect to which variables the gradient is calculated? What will be the difference if I will use argnums=0 instead?
Also, can I use the same function to get the Hessian matrix?
I looked at JAX help section about it, but couldn't figure it out

When you pass multiple argnums to grad, the result is a function that returns a tuple of gradients, equivalent to if you had computed each separately:
def f(x, y):
return x ** 2 + x * y + y ** 2
df_dxy = grad(f, argnums=(0, 1))
df_dx = grad(f, argnums=0)
df_dy = grad(f, argnums=1)
x = 3.0
y = 4.25
assert df_dxy(x, y) == (df_dx(x, y), df_dy(x, y))
If you want to compute a mixed second derivatives, you can do this by repeatedly applying the gradient:
d2f_dxdy = grad(grad(f, argnums=0), argnums=1)
assert d2f_dxdy(x, y) == 1

Related

Using tf.custom_gradient to calculate a Taylor's series approximation

I am supposed to calculate the Taylor series approximation of the function cos(x) + 1 using a TensorFlow custom gradient.
I wrote the following code:
def approx_cos_p1(x, n=7, dtype=tf.float32):
"""Return the approximation of cos(x) + 1 using a taylor/series expansion up to order 7"""
result = tf.constant(2, dtype)
for i in range(1, n//2+1):
if i % 2 == 1:
num=tf.math.pow(x, i*2)
den=math.factorial(i*2)
result=tf.math.subtract(result,tf.math.divide(num,den))
else:
num=tf.math.pow(x, i*2)
den=math.factorial(i*2)
result=tf.math.add(result,tf.math.divide(num,den))
return result
#tf.custom_gradient
def approx_cos_p1_custom_grad(x):
def backward(dy):
return approx_cos_p1(x)
return x, backward
x=tf.Variable(3.0,dtype=tf.float32)
with tf.GradientTape(persistent=True) as t:
output=approx_cos_p1_custom_grad(x)
print(t.gradient(output,x))
But according to the Tensorflow documentation, tf.custom_gradients should be used as follows:
#Establish an identity operation, but clip during the gradient pass
#tf.custom_gradient
def clip_gradients(y):
def backward(dy):
return tf.clip_by_norm(dy, 0.5)
return y, backward
v = tf.Variable(2.0)
with tf.GradientTape() as t:
output = clip_gradients(v * v)
print(t.gradient(output, v)) # calls "backward", which clips 4 to 2
The approx_cos_p1() function works perfectly.
The problem here is that the dy parameter in the function backward() is not being passed in approx_cos_p1() which is not as per the Tensorflow documentation. But I get the desired output as -3.15.
When I pass dy in approx_cos_p1(), I get an undesired output 1.14.
Is my implementation of the function correct?

Predict function for multiple linear regression

I am trying to make a predict function for a homework problem where it takes the dot products of a matrix(x) and a vector(y) and inserts them into a numpy array
def predict(x, y):
y_hat = np.empty
for j in range(len(y)):
y_hat[i] = np.dot(x, y)
return y_hat
There is an error message on y_hat[i] = np.dot(x,y)

There are two errors in the code:
numpy.empty() is a method which get arguments for the shape. Here, you must define it as np.empty([len(y), len(x)]) (if x is matrix and y is a vector,np.dot(x, y) results a vector with length len(x)). It produces a placeholder for np.dot() resulted arrays.
variable i is not defined.
so:
def predict(x, y):
y_hat = np.empty([len(y), len(x)])
for j in range(len(y)):
y_hat[j] = np.dot(x, y)
return y_hat

How to Use AutoGrad Packages?

I am trying to do a simple thing: use autograd to get gradients and do gradient descent:
import tangent
def model(x):
return a*x + b
def loss(x,y):
return (y-model(x))**2.0
After getting loss for an input-output pair, I want to get gradients wrt loss:
l = loss(1,2)
# grad_a = gradient of loss wrt a?
a = a - grad_a
b = b - grad_b
But the library tutorials don't show how to do obtain gradient with respect to a or b i.e. the parameters so, neither autograd nor tangent.

You can specify this with the second argument of the grad function:
def f(x,y):
return x*x + x*y
f_x = grad(f,0) # derivative with respect to first argument
f_y = grad(f,1) # derivative with respect to second argument
print("f(2,3) = ", f(2.0,3.0))
print("f_x(2,3) = ", f_x(2.0,3.0))
print("f_y(2,3) = ", f_y(2.0,3.0))
In your case, 'a' and 'b' should be an input to the loss function, which passes them to the model in order to calculate the derivatives.
There was a similar question i just answered:
Partial Derivative using Autograd

Here this may help:
import autograd.numpy as np
from autograd import grad
def tanh(x):
y=np.exp(-x)
return (1.0-y)/(1.0+y)
grad_tanh = grad(tanh)
print(grad_tanh(1.0))
e=0.00001
g=(tanh(1+e)-tanh(1))/e
print(g)
Output:
0.39322386648296376
0.39322295790622513
Here is what you may create:
import autograd.numpy as np
from autograd import grad # grad(f) returns f'
def f(x): # tanh
y = np.exp(-x)
return (1.0 - y) / ( 1.0 + y)
D_f = grad(f) # Obtain gradient function
D2_f = grad(D_f)# 2nd derivative
D3_f = grad(D2_f)# 3rd derivative
D4_f = grad(D3_f)# etc.
D5_f = grad(D4_f)
D6_f = grad(D5_f)
import matplotlib.pyplot as plt
plt.subplots(figsize = (9,6), dpi=153 )
x = np.linspace(-7, 7, 100)
plt.plot(x, list(map(f, x)),
x, list(map(D_f , x)),
x, list(map(D2_f , x)),
x, list(map(D3_f , x)),
x, list(map(D4_f , x)),
x, list(map(D5_f , x)),
x, list(map(D6_f , x)))
plt.show()
Output:

Cost Function and Gradient Seem to be Working, but scipy.optimize functions are not

I'm working through my Matlab code for the Andrew NG Coursera course and turning it into python. I am working on non-regularized logistic regression and after writing my gradient and cost functions I needed something similar to fminunc and after some googling, I found a couple options. They are both returning the same results, but they do not match what is in Andrew NG's expected results code. Others seem to be getting this to work correctly, but I'm wondering why my specific code does not seem to return the desired result when using scipy.optimize functions, but does for the cost and gradient pieces earlier in the code.
The data I'm using can be found at the link below;
ex2data1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as op
#Machine Learning Online Class - Exercise 2: Logistic Regression
#Load Data
#The first two columns contains the exam scores and the third column contains the label.
data = pd.read_csv('ex2data1.txt', header = None)
X = np.array(data.iloc[:, 0:2]) #100 x 3
y = np.array(data.iloc[:,2]) #100 x 1
y.shape = (len(y), 1)
#Creating sub-dataframes for plotting
pos_plot = data[data[2] == 1]
neg_plot = data[data[2] == 0]
#==================== Part 1: Plotting ====================
#We start the exercise by first plotting the data to understand the
#the problem we are working with.
print('Plotting data with + indicating (y = 1) examples and o indicating (y = 0) examples.')
plt.plot(pos_plot[0], pos_plot[1], "+", label = "Admitted")
plt.plot(neg_plot[0], neg_plot[1], "o", label = "Not Admitted")
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend()
plt.show()
def sigmoid(z):
'''
SIGMOID Compute sigmoid function
g = SIGMOID(z) computes the sigmoid of z.
Instructions: Compute the sigmoid of each value of z (z can be a matrix,
vector or scalar).
'''
g = 1 / (1 + np.exp(-z))
return g
def costFunction(theta, X, y):
'''
COSTFUNCTION Compute cost and gradient for logistic regression
J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
parameter for logistic regression and the gradient of the cost
w.r.t. to the parameters.
'''
m = len(y) #number of training examples
h = sigmoid(X.dot(theta)) #logisitic regression hypothesis
J = (1/m) * np.sum((-y*np.log(h)) - ((1-y)*np.log(1-h)))
#h is 100x1, y is %100x1, these end up as 2 vector we subtract from each other
#then we sum the values by rows
#cost function for logisitic regression
return J
def gradient(theta, X, y):
m = len(y)
grad = np.zeros((theta.shape))
h = sigmoid(X.dot(theta))
for i in range(len(theta)): #number of rows in theta
XT = X[:,i]
XT.shape = (len(X),1)
grad[i] = (1/m) * np.sum((h-y)*XT) #updating each row of the gradient
return grad
#============ Part 2: Compute Cost and Gradient ============
#In this part of the exercise, you will implement the cost and gradient
#for logistic regression. You neeed to complete the code in costFunction.m
#Add intercept term to x and X_test
Bias = np.ones((len(X), 1))
X = np.column_stack((Bias, X))
#Initialize fitting parameters
initial_theta = np.zeros((len(X[0]), 1))
#Compute and display initial cost and gradient
(cost, grad) = costFunction(initial_theta, X, y), gradient(initial_theta, X, y)
print('Cost at initial theta (zeros): %f' % cost)
print('Expected cost (approx): 0.693\n')
print('Gradient at initial theta (zeros):')
print(grad)
print('Expected gradients (approx):\n -0.1000\n -12.0092\n -11.2628')
#Compute and display cost and gradient with non-zero theta
test_theta = np.array([[-24], [0.2], [0.2]]);
(cost, grad) = costFunction(test_theta, X, y), gradient(test_theta, X, y)
print('\nCost at test theta: %f' % cost)
print('Expected cost (approx): 0.218\n')
print('Gradient at test theta:')
print(grad)
print('Expected gradients (approx):\n 0.043\n 2.566\n 2.647\n')
result = op.fmin_tnc(func = costFunction, x0 = initial_theta, fprime = gradient, args = (X,y))
result[1]
Result = op.minimize(fun = costFunction,
x0 = initial_theta,
args = (X, y),
method = 'TNC',
jac = gradient, options={'gtol': 1e-3, 'disp': True, 'maxiter': 1000})
theta = Result.x
theta
test = np.array([[1, 45, 85]])
prob = sigmoid(test.dot(theta))
print('For a student with scores 45 and 85, we predict an admission probability of %f,' % prob)
print('Expected value: 0.775 +/- 0.002\n')

This was a very difficult problem to debug, and illustrates a poorly documented aspect of the scipy.optimize interface. The documentation vaguely indicates that theta will be passed around as a vector:
Minimization of scalar function of one or more variables.
In general, the optimization problems are of the form:
minimize f(x) subject to
g_i(x) >= 0, i = 1,...,m
h_j(x) = 0, j = 1,...,p
where x is a vector of one or more variables.
What's important is that they really mean vector in the most primitive sense, a 1-dimensional array. So you have to expect that whenever theta is passed into one of your callbacks, it will be passed in as a 1-d array. But in numpy, 1-d arrays sometimes behave differently from 2-d row arrays (and, obviously, from 2-d column arrays).
I don't know exactly why it's causing a problem in your case, but it's easily fixed regardless. You just have to add the following at the top of both your cost function and your gradient function:
theta = theta.reshape(-1, 1)
This guarantees that theta will be a 2-d column array, as expected. Once you've done this, the results are correct.

I have had similar issues with Scipy dealing with the same problem as you. As senderle points out the interface is not the easiest to deal with, especially combined with the numpy array interface... Here is my implementation which works as expected.
Defining the cost and gradient functions
Note that initial_theta is passed as a simple array of shape (3,) and converted to a column vector of shape (3,1) within the function. The gradient function then returns the grad.ravel() which has shape (3,) again. This is important as doing otherwise caused an error message with various optimization methods in Scipy.optimize.
Note that different methods have different behaviours but returning .ravel() seems to fix most issues...
import pandas as pd
import numpy as np
import scipy.optimize as opt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def CostFunc(theta,X,y):
#Initializing variables
m = len(y)
J = 0
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
J = (1/m) * ( (-y.T # np.log(h)) - (1 - y).T # np.log(1-h));
return J
def Gradient(theta,X,y):
#Initializing variables
m = len(y)
theta = theta[:,np.newaxis]
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
grad = (1/m)*(X.T # ( h - y));
return grad.ravel() #<-- This is the trick
Initializing variables and parameters
Note that initial_theta.shape returns (3,)
X = data1.iloc[:,0:2].values
m,n = X.shape
X = np.concatenate((np.ones(m)[:,np.newaxis],X),1)
y = data1.iloc[:,-1].values[:,np.newaxis]
initial_theta = np.zeros((n+1))
Calling Scipy.optimize
model = opt.minimize(fun = CostFunc, x0 = initial_theta, args = (X, y), method = 'TNC', jac = Gradient)
Any comments from more knowledgeable people are welcome, this Scipy interface is a mystery to me, thanks

scipy.optimize.minimize() in place of gradient desc

Given a set of points we'd like to find a straight line fitting the data best possible. I have implemented a version where the minimization of the cost function is done via gradient descent, and now I'd like to use the algorithm from scipy (scipy.optimize.minimize).
I tried this:
def gradDescVect(X, y, theta, alpha, iters):
m = shape(X)[0]
grad = copy(theta)
for c in range(0, iters):
error_sum = hypo(X, grad) - y
error_sum = X.T.dot(error_sum)
grad -= (alpha/m)*error_sum
return grad
def computeCostScipy(theta, X, y):
"""Compute cost, vectorized version"""
m = len(y)
term = hypo(X, theta) - y
print((term.T.dot(term) / (2 * m))[0, 0])
return (term.T.dot(term) / (2 * m))[0, 0]
def findMinTheta( theta, X, y):
result = scipy.optimize.minimize( computeCostScipy, theta, args=(X, y), method='BFGS', options={"maxiter":5000, "disp":True} )
return result.x, result.fun
This actually works pretty fine and gives results very close results to the original grad desc version.
The only problem seems that fminunc() stops execution before reaching minimum value of costFunction().
its sure that costFunction() works fine, and no. of iteration of minimize() is set greater than max iters it takes.
Output:
Optimization terminated successfully.
Current function value: 15.024985
Iterations: 2
Function evaluations: 16
Gradient evaluations: 4
Cost : 15.0249848024
Thteta : [ 0.15232531 0.93072285]
while the correct results are :
Cost : 4.48338825659
Theta : [[-3.63029144]
[ 1.16636235]]
See how close the results are:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding JAX argnums parameter in its gradient function - python

Related

Using tf.custom_gradient to calculate a Taylor's series approximation

Predict function for multiple linear regression

How to Use AutoGrad Packages?

Cost Function and Gradient Seem to be Working, but scipy.optimize functions are not

scipy.optimize.minimize() in place of gradient desc

Categories

Resources