multi-variable gradient descent for n by n linear transformation - python

I have a training set of data and both the input (data['qvec']) and output (data['tvec']) are normalized 300 dimensional vectors. I want to train the linear transform, theta- a 300x300 matrix, to minimize the cost function:
from scipy.spatial.distance import cosine
def cost_function(data, theta):
dists = [cosine(data.iloc[i]['qvec'].dot(theta), data.iloc[i]['tvec']) for i in data.index]
return sum(dists)/len(data)
I am assuming that there will be an update function that is similar to multi-variable gradient descent. That is:
def update_theta(data, theta, alpha):
for m in range(300):
for n in range(300):
cost = [(data.iloc[i]['qvec'].dot(theta) - data.iloc[i]['tvec']) * ????
for i in data.index]
theta[m,n] = theta[m,n] - alpha/len(data) * sum(cost)
return theta
I know that when theta is a 300x1 matrix, ???? is data.iloc[i]['qvec'][m], but what would it be for a 300x300 matrix? If my approach is way off, or if there is already a package for this, I'd also appreciate if anyone points me in the right direction.

Related

Stochastic Gradient Decent vs. Gradient Decent for x**2 function

I would like to understand the difference between SGD and GD on the easiest example of function: y=x**2
The function of GD is here:
def gradient_descent(
gradient, start, learn_rate, n_iter=50, tolerance=1e-06
):
vector = start
for _ in range(n_iter):
diff = -learn_rate * gradient(vector)
if np.all(np.abs(diff) <= tolerance):
break
vector += diff
return vector
And in order to find the minimum of x**2 function we should do next (the answer is almost 0, which is correct):
gradient_descent(gradient=lambda v: 2 * x, start=10.0, learn_rate=0.2)
How i understood, in the classical GD the gradient is calculated precisely from all the data points. What is "all data points" in the implementation that i showed above?
And further, how we should modernize this function in order to call it SGD (SGD uses one single data point for calculating gradient. Where is "one single point" in the gradient_descent function?)
The function minimized in your example does not depend on any data, so it is not helpful to illustrate the difference between GD and SGD.
Consider this example:
import numpy as np
rng = np.random.default_rng(7263)
y = rng.normal(loc=10, scale=4, size=100)
def loss(y, mean):
return 0.5 * ((y-mean)**2).sum()
def gradient(y, mean):
return (mean - y).sum()
def mean_gd(y, learning_rate=0.005, n_iter=15, start=0):
"""Estimate the mean of y using gradient descent"""
mean = start
for i in range(n_iter):
mean -= learning_rate * gradient(y, mean)
print(f'Iter {i} mean {mean:0.2f} loss {loss(y, mean):0.2f}')
return mean
def mean_sgd(y, learning_rate=0.005, n_iter=15, start=0):
"""Estimate the mean of y using stochastic gradient descent"""
mean = start
for i in range(n_iter):
rng.shuffle(y)
for single_point in y:
mean -= learning_rate * gradient(single_point, mean)
print(f'Iter {i} mean {mean:0.2f} loss {loss(y, mean):0.2f}')
return mean
mean_gd(y)
mean_sgd(y)
y.mean()
Two (very simplistic) versions of GD and SGD are used to estimate the mean of a random sample y. Estimating the mean is achieved by minimizing the squared loss.
As you understood correctly, in GD each update uses the gradient computed on the whole dataset and in SGD we look at a single random point at a time.

Python - Solve time-dependent matrix differential equation

I am facing some problems solving a time-dependent matrix differential equation.
The problem is that the time-dependent coefficient is not just following some time-dependent shape, rather it is the solution of another differential equation.
Up until now, I have considered the trivial case where my coefficient G(t) is just G(t)=pulse(t) where this pulse(t) is a function I define. Here is the code:
# Matrix differential equation
def Leq(t,v,pulse):
v=v.reshape(4,4) #covariance matrix
M=np.array([[-kappa,0,E_0*pulse(t),0],\. #coefficient matrix
[0,-kappa,0,-E_0*pulse(t)],\
[E_0*pulse(t),0,-kappa,0],\
[0,-E_0*pulse(t),0,-kappa]])
Driff=kappa*np.ones((4,4),float) #constant term
dv=M.dot(v)+v.dot(M)+Driff #solve dot(v)=Mv+vM^(T)+D
return dv.reshape(-1) #return vectorized matrix
#Pulse shape
def Gaussian(t):
return np.exp(-(t - t0)**2.0/(tau ** 2.0))
#scipy solver
cov0=np.zeros((4,4),float) ##initial vector
cov0 = cov0.reshape(-1); ## vectorize initial vector
Tmax=20 ##max value for time
Nmax=10000 ##number of steps
dt=Tmax/Nmax ##increment of time
t=np.linspace(0.0,Tmax,Nmax+1)
Gaussian_sol=solve_ivp(Leq, [min(t),max(t)] , cov0, t_eval= t, args=(Gaussian,))
And I get a nice result. The problem is that is it not exactly what I need. Want I need is that dot(G(t))=-kappa*G(t)+pulse(t), i.e. the coefficient is the solution of a differential equation.
I have tried to implement this differential equation in a sort of vectorized way in Leq by returning another parameter G(t) that would be fed to M, but I was getting problems with the dimensions of the arrays.
Any idea of how should I proceed?
Thanks,
In principle you have the right idea, you just have to split and join the state and derivative vectors.
def Leq(t,u,pulse):
v=u[:16].reshape(4,4) #covariance matrix
G=u[16:].reshape(4,4)
... # compute dG and dv
return np.concatenate([dv.flatten(), dG.flatten()])
The initial vector has likewise to be such a composite.

How to compute second-order derivative with respect to a vector in python (pytorch?)

I am trying to write up a code that performs metropolized iid sampling, and I am having troubles computing the second-order derivative of a function with respect to a numpy ndarray. The code I wrote so far are shown below:
import numpy as np
from scipy.optimize import newton
w = np.array([4,5,3,6,2])
y = np.array([1,0,0,1,1])
lamb = np.array([0.001, 0.0045, 0.0072,0.0083, 0.0069)]
tau = np.array([0.0002, 0.00045, 0.000378, 0.00467, 0.00235])
# define function f
def f(gamma):
return (-1) * (np.dot(y, np.dot(w, gamma)) - np.sum(np.log(np.exp(np.dot(w,gamma)) + 1)) - np.sum((np.subtract(gamma,lamb)) * (np.subtract(gamma,lamb)) / (2*tau*tau)))
# mode for the metropolized iid sampler
gamma_hat = newton(f, gamma, fprime=None, args=(), tol=1.48e-08, maxiter=50, fprime2=None)
# not sure how to calculate the variance for the same sampler
What I want to do:
To calculate the variance for the sampler, I want to take the second-order derivative of the function f with respect to the vector variable gamma, and then compute the value of the second derivative after substituting gamma = gamma_hat.
Can I do this in python? If needed, I am willing to use the pytorch for calculating the value of the second order derivative.
Thank you,

Gradient descent from scratch in python not working

I am trying to implement a gradient descent algorithm from scratch in python, which should be fairly easy. however, I have been scratching my head for quite while with my code now, unable to make it work.
I generate data as follow:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
#Defining the x array.
x=np.array(range(1,100))
#Defining the y array.
y=10+2*x.ravel()
y=y+np.random.normal(loc=0, scale=70, size=99)
Then define the parameters:
alpha = 0.01 # Which will be the learning rate
NbrIter = 100 # Representing the number of iteration
m = len(y)
theta = np.random.randn(2,1)
and my GD is as follow:
for iter in range(NbrIter):
theta = theta - (1/m) * alpha * ( X.T # ((X # theta) - y) )
What I get is a huge matrix, meaning that I have some problem with the linear algebra. However, I really fail to see where the issue is.
(Playing around with the matrices to try to get them to match I reached a theta having the correct form (2x1) with:
theta = theta - (1/m) * alpha * ( X.T # ((X # theta).T - y).T )
But it does look wrong and the actual value are way off (array([[-8.92647663e+148],
[-5.92079000e+150]]))
)
I guess you were hit by broadcasting. Variable y's shape is (100,). When y is subtracted from result of X.T#X#theta. Theta is column vector so I guess the result is a column vector. Variable y is broadcasted to a row vector of shape (1,100). The result of subtraction is (100,100). To fix this reshape y as column vector with y.reshape(-1,1)
Now, a few optimizations:
X.T # ((X # theta) - y[:,None])
can be rewritten as:
(X.T#X) # theta - (X.T*y[:,None])
The most costly computation can be taken out of the loop:
XtX = X.T#X
Xty = X.T*y[:,None]
for iter in range(NbrIter):
theta = theta - (1/m) * alpha * (XtX # theta - Xty)
Now you operate on 2x2 matrix rather that 100x2.
Let's take a look on convergence.
Assuming that X is constructed like: X=np.column_stack((x, np.ones_like(x)) it is possible to check matrix condition:
np.linalg.cond(XtX)
Which produced:
13475.851490419038
It means that the ratio between minimal and maximal eigenvector is about 13k. Therefore using alpha larger then 1/13k will likely result in bad convergence.
If you use alpha=1e-5 the algorithm will converge.
Good luck!

Why does Scipy's fit() function for gamma distribution producing a totally dissimilar distribution?

I'm trying to use scipy's fit function to fit a gamma distribution to observed data. The histogram for the observed data is below:
The mean and variance are
print("mean",np.mean(observed_data)) # mean 0.427611176580073
print("Var",np.var(observed_data)) # Var 0.6898193689790143
However, if I use scipy.stats.gamma.fit() to fit this observed data to a gamma distribution and then sample again from that distribution, the mean and variance are totally different:
I know that scipy is fitting based on MLE, but I'm not understanding the intuition behind why these key statistics are so off - the mean and variance are totally different. In fact, I can get much better results simply running this through my own solver:
from scipy.optimize import fsolve
from typing import List
def fit_gamma_distribution(data: List[float]):
mean = np.mean(data)
variance = np.var(data)
def equations(p):
k, theta = p
return (k * theta - mean, k * theta **2 - variance)
solved_k, solved_theta = fsolve(equations, (1,1))
if np.isclose(np.array([solved_k * solved_theta]), np.array([mean]), rtol=0.01):
return fsolve(equations, (1,1))
k, theta = fit_gamma_distribution(observed_data)
new_dist = np.random.gamma(shape=k, scale=
theta, size=len(observed_data))
plt.hist(new_dist, alpha= 0.5, bins=40)
plt.hist(observed_data, alpha=0.2, bins=40)
plt.xlim(0,5)
plt.title(f"New sampled distribution: μ = {round(np.mean(new_dist),2)} observed μ = {round(np.mean(observed_data), 2)}")
It's a much better fit than the scipy one in my opinion. Why is this the case?

Categories