Optimize with PyMC3 - python

Problem Summary
I have been optimizing my function VectorizedVcdfe, and I am still trying to optimize it. This function is responsible for 99% of the slowness of another function customFunc. This customFunc is used in a PyMC3 code block.
Please help me optimize VectorizedVcdfe.
Function to optimize
def VectorizedVcdfe(self, x, dataVector, recip_h_times_lambda_vector):
n = len(dataVector)
differenceVector = x - dataVector
stackedDiffVecAndRecipVec = pymc3.math.stack(differenceVector, recip_h_times_lambda_vector)
erfcTerm = 1. - pymc3.math.erf(self.neg_sqrt1_2 * pymc3.math.prod(stackedDiffVecAndRecipVec, axis=0))
# Calc F_Hat
F_Hat = (1. / float(n)) * pymc3.math.sum(0.5 * erfcTerm)
# Return F_Hat
return(F_Hat)
Arguments/variables
x is a TensorVariable.
dataVector is a 1Xn numpy matrix.
recip_h_times_lambda_vector is also a 1Xn numpy matrix.
neg_sqrt1_2 is a scalar constant.
How customFunc is used
with pymc3.Model() as model:
# Create likelihood
like = pymc3.DensityDist('X', customFunc, shape=2)
# Make samples
step = pymc3.NUTS()
trace = pymc3.sample(2000, tune=1000, init=None, step=step, cores=2)
EDIT:
To answer commenters, random values are OK for both dataVector and
recip_h_times_lambda_vector for the purposes of doing this optimization. In reality, recip_h_times_lambda_vector is dependent on dataVector and a scalar parameter h.
Some commenters were wondering about customFunc, so here it is...
def customFunc(X):
Y = []
for j in range(2):
x_j = X[j]
F_x_j = fittedKdEstimator.VCDFE(x_j)
y_j = myPPF(F_x_j)
Y.append(y_j)
logLikelihood = 0.
recipSqrtTwoPi = 1. / math.sqrt(2. * math.pi)
for j in range(2):
y_j = Y[j]
logLikelihood += pymc3.math.log(recipSqrtTwoPi * pymc3.math.exp(y_j * y_j / -2.))
return(pymc3.math.exp(logLikelihood))
The global variable fittedKdEstimator is an instance of the class that contains the functions VectorizedVcdfe and VCDFE.
Here is the Python code for VCDFE...
def VCDFE(self, x):
if not self.beenFit: raise Exception("Must first fit to data")
return(self.VectorizedVcdfe(x, self.__dataVector, self.__recip_h_times_lambda_vector))
On a separate note, the function myPPF is my implementation of the standard normal "percent-point function" (AKA: "quantile function"). I have timed the customFunc, and myPPF takes a fraction of the entire time. The vast majority of time is consumed by VectorizedVcdfe.
Last but not least, a typical value for n may range from 10,000 to 100,000.

Related

Using python built-in functions for coupled ODEs

THIS PART IS JUST BACKGROUND IF YOU NEED IT
I am developing a numerical solver for the Second-Order Kuramoto Model. The functions I use to find the derivatives of theta and omega are given below.
# n-dimensional change in omega
def d_theta(omega):
return omega
# n-dimensional change in omega
def d_omega(K,A,P,alpha,mask,n):
def layer1(theta,omega):
T = theta[:,None] - theta
A[mask] = K[mask] * np.sin(T[mask])
return - alpha*omega + P - A.sum(1)
return layer1
These equations return vectors.
QUESTION 1
I know how to use odeint for two dimensions, (y,t). for my research I want to use a built-in Python function that works for higher dimensions.
QUESTION 2
I do not necessarily want to stop after a predetermined amount of time. I have other stopping conditions in the code below that will indicate whether the system of equations converges to the steady state. How do I incorporate these into a built-in Python solver?
WHAT I CURRENTLY HAVE
This is the code I am currently using to solve the system. I just implemented RK4 with constant time stepping in a loop.
# This function randomly samples initial values in the domain and returns whether the solution converged
# Inputs:
# f change in theta (d_theta)
# g change in omega (d_omega)
# tol when step size is lower than tolerance, the solution is said to converge
# h size of the time step
# max_iter maximum number of steps Runge-Kutta will perform before giving up
# max_laps maximum number of laps the solution can do before giving up
# fixed_t vector of fixed points of theta
# fixed_o vector of fixed points of omega
# n number of dimensions
# theta initial theta vector
# omega initial omega vector
# Outputs:
# converges true if it nodes restabilizes, false otherwise
def kuramoto_rk4_wss(f,g,tol_ss,tol_step,h,max_iter,max_laps,fixed_o,fixed_t,n):
def layer1(theta,omega):
lap = np.zeros(n, dtype = int)
converges = False
i = 0
tau = 2 * np.pi
while(i < max_iter): # perform RK4 with constant time step
p_omega = omega
p_theta = theta
T1 = h*f(omega)
O1 = h*g(theta,omega)
T2 = h*f(omega + O1/2)
O2 = h*g(theta + T1/2,omega + O1/2)
T3 = h*f(omega + O2/2)
O3 = h*g(theta + T2/2,omega + O2/2)
T4 = h*f(omega + O3)
O4 = h*g(theta + T3,omega + O3)
theta = theta + (T1 + 2*T2 + 2*T3 + T4)/6 # take theta time step
mask2 = np.array(np.where(np.logical_or(theta > tau, theta < 0))) # find which nodes left [0, 2pi]
lap[mask2] = lap[mask2] + 1 # increment the mask
theta[mask2] = np.mod(theta[mask2], tau) # take the modulus
omega = omega + (O1 + 2*O2 + 2*O3 + O4)/6
if(max_laps in lap): # if any generator rotates this many times it probably won't converge
break
elif(np.any(omega > 12)): # if any of the generators is rotating this fast, it probably won't converge
break
elif(np.linalg.norm(omega) < tol_ss and # assert the nodes are sufficiently close to the equilibrium
np.linalg.norm(omega - p_omega) < tol_step and # assert change in omega is small
np.linalg.norm(theta - p_theta) < tol_step): # assert change in theta is small
converges = True
break
i = i + 1
return converges
return layer1
Thanks for your help!
You can wrap your existing functions into a function accepted by odeint (option tfirst=True) and solve_ivp as
def odesys(t,u):
theta,omega = u[:n],u[n:]; # or = u.reshape(2,-1);
return [ *f(omega), *g(theta,omega) ]; # or np.concatenate([f(omega), g(theta,omega)])
u0 = [*theta0, *omega0]
t = linspan(t0, tf, timesteps+1);
u = odeint(odesys, u0, t, tfirst=True);
#or
res = solve_ivp(odesys, [t0,tf], u0, t_eval=t)
The scipy methods pass numpy arrays and convert the return value into same, so that you do not have to care in the ODE function. The variant in comments is using explicit numpy functions.
While solve_ivp does have event handling, using it for a systematic collection of events is rather cumbersome. It would be easier to advance some fixed step, do the normalization and termination detection, and then repeat this.
If you want to later increase efficiency somewhat, use directly the stepper classes behind solve_ivp.

SGD implementation Python

I am aware that SGD has been asked before on SO but I wanted to have an opinion on my code as below:
import numpy as np
import matplotlib.pyplot as plt
# Generating data
m,n = 10000,4
x = np.random.normal(loc=0,scale=1,size=(m,4))
theta_0 = 2
theta = np.append([],[1,0.5,0.25,0.125]).reshape(n,1)
y = np.matmul(x,theta) + theta_0*np.ones(m).reshape((m,1)) + np.random.normal(loc=0,scale=0.25,size=(m,1))
# input features
x0 = np.ones([m,1])
X = np.append(x0,x,axis=1)
# defining the cost function
def compute_cost(X,y,theta_GD):
return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2
# initializations
theta_GD = np.append([theta_0],[theta]).reshape(n+1,1)
alp = 1e-5
num_iterations = 10000
# Batch Sum
def batch(i,j,theta_GD):
batch_sum = 0
for k in range(i,i+9):
batch_sum += float((y[k]-np.transpose(theta_GD).dot(X[k]))*X[k][j])
return batch_sum
# Gradient Step
def gradient_step(theta_current, X, y, alp,i):
for j in range(0,n):
theta_current[j]-= alp*batch(i,j,theta_current)/10
theta_updated = theta_current
return theta_updated
# gradient descent
cost_vec = []
for i in range(num_iterations):
cost_vec.append(compute_cost(X[i], y[i], theta_GD))
theta_GD = gradient_step(theta_GD, X, y, alp,i)
plt.plot(cost_vec)
plt.xlabel('iterations')
plt.ylabel('cost')
I was trying a mini-batch GD with a batch size of 10. I am getting extremely oscillatory behavior for the MSE. Where's the issue? Thanks.
P.S. I was following NG's https://www.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent
This is a description of the underlying mathematical principle, not a code based solution...
The cost function is highly nonlinear (np.power()) and recursive and recursive and nonlinear systems can oscillate ( self-oscillation https://en.wikipedia.org/wiki/Self-oscillation ). In mathematics this is subject to chaos theory / theory of nonlinear dynamical systems ( https://pdfs.semanticscholar.org/8e0d/ee3c433b1806bfa0d98286836096f8c2681d.pdf ), cf the Logistic Map
( https://en.wikipedia.org/wiki/Logistic_map ). The logistic map oscillates if the growth factor r exceeds a threshold. The growth factor is a measure for how much energy is in the system.
In your code the critical parts are the cost function, the cost vector, that is the history of the system and the time steps :
def compute_cost(X,y,theta_GD):
return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2
cost_vec = []
for i in range(num_iterations):
cost_vec.append(compute_cost(X[i], y[i], theta_GD))
theta_GD = gradient_step(theta_GD, X, y, alp,i)
# Gradient Step
def gradient_step(theta_current, X, y, alp,i):
for j in range(0,n):
theta_current[j]-= alp*batch(i,j,theta_current)/10
theta_updated = theta_current
return theta_updated
If you compare this to an implementation of the logistic map you see the similarities
from pylab import show, scatter, xlim, ylim
from random import randint
iter = 1000 # Number of iterations per point
seed = 0.5 # Seed value for x in (0, 1)
spacing = .0001 # Spacing between points on domain (r-axis)
res = 8 # Largest n-cycle visible
# Initialize r and x lists
rlist = []
xlist = []
def logisticmap(x, r): <------------------ nonlinear function
return x * r * (1 - x)
# Return nth iteration of logisticmap(x. r)
def iterate(n, x, r):
for i in range(1,n):
x = logisticmap(x, r)
return x
# Generate list values -- iterate for each value of r
for r in [i * spacing for i in range(int(1/spacing),int(4/spacing))]:
rlist.append(r)
xlist.append(iterate(randint(iter-res/2,iter+res/2), seed, r)) <--------- similar to cost_vector, the history of the system
scatter(rlist, xlist, s = .01)
xlim(0.9, 4.1)
ylim(-0.1,1.1)
show()
source of code : https://www.reddit.com/r/learnpython/comments/zzh28/a_simple_python_implementation_of_the_logistic_map/
Basing on this you can try to modify your cost function by introducing a factor similar to the growth factor in the logistic map to reduce the intensity of oscillation of the system
def gradient_step(theta_current, X, y, alp,i):
for j in range(0,n):
theta_current[j]-= alp*batch(i,j,theta_current)/10 <--- introduce a factor somewhere to keep the system under the oscillation threshold
theta_updated = theta_current
return theta_updated
or
def compute_cost(X,y,theta_GD):
return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2 <--- introduce a factor somewhere to keep the system under the oscillation threshold
If this is not working maybe follow the suggestions in https://www.reddit.com/r/MachineLearning/comments/3y9gkj/how_can_i_avoid_oscillations_in_gradient_descent/ ( timesteps,... )

Gradient Descent Variation doesn't work

I try to implement the Stochastic Gradient Descent Algorithm.
The first solution works:
def gradientDescent(x,y,theta,alpha):
xTrans = x.transpose()
for i in range(0,99):
hypothesis = np.dot(x,theta)
loss = hypothesis - y
gradient = np.dot(xTrans,loss)
theta = theta - alpha * gradient
return theta
This solution gives the right theta values but the following algorithm
doesnt work:
def gradientDescent2(x,y,theta,alpha):
xTrans = x.transpose();
for i in range(0,99):
hypothesis = np.dot(x[i],theta)
loss = hypothesis - y[i]
gradientThetaZero= loss * x[i][0]
gradientThetaOne = loss * x[i][1]
theta[0] = theta[0] - alpha * gradientThetaZero
theta[1] = theta[1] - alpha * gradientThetaOne
return theta
I don't understand why solution 2 does not work, basically it
does the same like the first algorithm.
I use the following code to produce data:
def genData():
x = np.random.rand(100,2)
y = np.zeros(shape=100)
for i in range(0, 100):
x[i][0] = 1
# our target variable
e = np.random.uniform(-0.1,0.1,size=1)
y[i] = np.sin(2*np.pi*x[i][1]) + e[0]
return x,y
And use it the following way:
x,y = genData()
theta = np.ones(2)
theta = gradientDescent2(x,y,theta,0.005)
print(theta)
I hope you can help me!
Best regards, Felix
Your second code example overwrites the gradient computation on each iteration over your observation data.
In the first code snippet, you properly adjust your parameters in each looping iteration based on the error (loss function).
In the second code snippet, you calculate the point-wise gradient computation in each iteration, but then don't do anything with it. That means that your final update effectively only trains on the very last data point.
If instead you accumulate the gradients within the loop by summing ( += ), it should be closer to what you're looking for (as an expression of the gradient of the loss function with respect to your parameters over the entire observation set).

Steepest descent spitting out unreasonably large values

My implementation of steepest descent for solving Ax = b is showing some weird behavior: for any matrix large enough (~10 x 10, have only tested square matrices so far), the returned x contains all huge values (on the order of 1x10^10).
def steepestDescent(A, b, numIter=100, x=None):
"""Solves Ax = b using steepest descent method"""
warnings.filterwarnings(action="error",category=RuntimeWarning)
# Reshape b in case it has shape (nL,)
b = b.reshape(len(b), 1)
exes = []
res = []
# Make a guess for x if none is provided
if x==None:
x = np.zeros((len(A[0]), 1))
exes.append(x)
for i in range(numIter):
# Re-calculate r(i) using r(i) = b - Ax(i) every five iterations
# to prevent roundoff error. Also calculates initial direction
# of steepest descent.
if (numIter % 5)==0:
r = b - np.dot(A, x)
# Otherwise use r(i+1) = r(i) - step * Ar(i)
else:
r = r - step * np.dot(A, r)
res.append(r)
# Calculate step size. Catching the runtime warning allows the function
# to stop and return before all iterations are completed. This is
# necessary because once the solution x has been found, r = 0, so the
# calculation below divides by 0, turning step into "nan", which then
# goes on to overwrite the correct answer in x with "nan"s
try:
step = np.dot(r.T, r) / np.dot( np.dot(r.T, A), r )
except RuntimeWarning:
warnings.resetwarnings()
return x
# Update x
x = x + step * r
exes.append(x)
warnings.resetwarnings()
return x, exes, res
(exes and res are returned for debugging)
I assume the problem must be with calculating r or step (or some deeper issue) but I can't make out what it is.
The code seems correct. For example, the following test work for me (both linalg.solve and steepestDescent give the close answer, most of the time):
import numpy as np
n = 100
A = np.random.random(size=(n,n)) + 10 * np.eye(n)
print(np.linalg.eig(A)[0])
b = np.random.random(size=(n,1))
x, xs, r = steepestDescent(A,b, numIter=50)
print(x - np.linalg.solve(A,b))
The problem is in the math. This algorithm is guaranteed to converge to the correct solution if A is positive definite matrix. By adding the 10 * identity matrix to a random matrix, we increase the probability that all the eigen-values are positive
If you test with large random matrices (for example A = random.random(size=(n,n)), you are almost certain to have a negative eigenvalue, and the algorithm will not converge.

Gradient Descent

So I am writing a program that handles gradient descent. Im using this method to solve equations of the form
Ax = b
where A is a random 10x10 matrix and b is a random 10x1 matrix
Here is my code:
import numpy as np
import math
import random
def steepestDistance(A,b,xO, e):
xPrev = xO
dPrev = -((A * xPrev) - b)
magdPrev = np.linalg.norm(dPrev)
danger = np.asscalar(((magdPrev * magdPrev)/(np.dot(dPrev.T,A * dPrev))))
xNext = xPrev + (danger * dPrev)
step = 1
while (np.linalg.norm((A * xNext) - b) >= e and np.linalg.norm((A * xNext) - b) < math.pow(10,4)):
xPrev = xNext
dPrev = -((A * xPrev) - b)
magdPrev = np.linalg.norm(dPrev)
danger = np.asscalar((math.pow(magdPrev,2))/(np.dot(dPrev.T,A * dPrev)))
xNext = xPrev + (danger * dPrev)
step = step + 1
return xNext
##print(steepestDistance(np.matrix([[5,2],[2,1]]),np.matrix([[1],[1]]),np.matrix([[0.5],[0]]), math.pow(10,-5)))
def chooseRandMatrix():
matrix = np.zeros(shape = (10,10))
for i in range(10):
for a in range(10):
matrix[i][a] = random.randint(0,100)
return matrix.T * matrix
def chooseRandColArray():
arra = np.zeros(shape = (10,1))
for i in range(10):
arra[i][0] = random.randint(0,100)
return arra
for i in range(4):
matrix = np.asmatrix(chooseRandMatrix())
array = np.asmatrix(chooseRandColArray())
print(steepestDistance(matrix, array, np.asmatrix(chooseRandColArray()),math.pow(10,-5)))
When I run the method steepestDistance on the random matrix and column, I keep getting an infinite loop. It works fine when simple 2x2 matrices are used for A, but it loops indefinitely for 10x10 matrices. The problem is in np.linalg.norm((A * xNext) - b); it keeps growing indefinitely. Thats why I put an upper bound on it; Im not supposed to do it for the algorithm however. Can someone tell me what the problem is?
Solving a linear system Ax=b with gradient descent means to minimize the quadratic function
f(x) = 0.5*x^t*A*x - b^t*x.
This only works if the matrix A is symmetric, A=A^t, since the derivative or gradient of f is
f'(x)^t = 0.5*(A+A^t)*x - b,
and additionally A must be positive definite. If there are negative eigenvalues,then the descent will proceed to minus infinity, there is no minimum to be found.
One work-around is to replace b by A^tb and A by a^t*A, that is to minimize the function
f(x) = 0.5*||A*x-b||^2
= 0.5*x^t*A^t*A*x - b^t*A*x + 0.5*b^t*b
with gradient
f'(x)^t = A^t*A*x - A^t*b
But for large matrices A this is not recommended since the condition number of A^t*A is about the square of the condition number of A.

Categories