bayesian pca using PyMC

bayesian pca using PyMC - python

I'm trying to implement Bayesian PCA using PyMC library for python. But, I'm stuck where I define lower dimensional coordinates...
Model is
x = Wz + e
where x is observation vector, W is the transformation matrix, and z is lower dimensional coordinate vector.
First I define a distribution for the transformation matrix W. Each column is drawn from a normal distribution (zero mean, and identity covariance for simplicity)
def W_logp(value):
logLikes = np.array([multivariate_normal.logpdf(value[:,i], mean=np.zeros(dimX), cov=1) for i in range(0, dimZ)])
return logLikes.sum()
def W_random():
W = np.zeros([dimX, dimZ])
for i in range(0, dimZ):
W[:,i] = multivariate_normal.rvs(mean=np.zeros(dimX), cov=1)
return W
w0 = np.random.randn(dimX, dimZ)
W = pymc.Stochastic(
logp = W_logp,
doc = 'Transformation',
name = 'W',
parents = {},
random = W_random,
trace = True,
value = w0,
dtype = float,
rseed = 116.,
observed = False,
cache_depth = 2,
plot = False,
verbose = 0)
Then, I want to define distribution for z that is again a multivariate normal (zero mean, and identity covariance). However, I need to draw a z for each observation separately while W is common for all of them. So, I tried
z = pymc.MvNormal('z', np.zeros(dimZ), np.eye(dimZ), size=N)
However, pymc.MvNormal does not have a size parameter. So it raises an error. Next step would be
m = Data.mean(axis=0) + np.dot(W, z)
obs = pymc.MvNormal('Obs', m, C, value=Data, observed=True)
I did not give the specification for C above since it is irrelevant for now. Any ideas how to implement?
Thanks
EDIT
After Chris Fonnesbeck's answer I changed my code as follows
numD, dimX = Data.shape
dimZ = 3
mm = Data.mean(axis=0)
tau = pymc.Gamma('tau', alpha=10, beta=2)
tauW = pymc.Gamma('tauW', alpha=20, beta=2, size=dimZ)
#pymc.deterministic(dtype=float)
def C(tau=tau):
return (tau)*np.eye(dimX)
#pymc.deterministic(dtype=float)
def CW(tau=tauW):
return np.diag(tau)
W = [pymc.MvNormal('W%i'%i, np.zeros(dimZ), CW) for i in range(dimX)]
z = [pymc.MvNormal('z%i'%i, np.zeros(dimZ), np.eye(dimZ)) for i in range(numD)]
mu = [pymc.Lambda('mu%i'%i, lambda W=W, z=z: mm + np.dot(np.array(W), np.array(z[i]))) for i in range(numD)]
obs = [pymc.MvNormal('Obs%i'%i, mu[i], C, value=Data[i,:], observed=True) for i in range(numD)]
model = pymc.Model([tau, tauW] + obs + W + z)
mcmc = pymc.MCMC(model)
But this time, it tries to allocate a huge amount of memory (more than 8GB) when running pymc.MCMC(model), with numD=45 and dimX=504. Even when I try it with only numD=1 (so creating only 1 z, mu, and obs), it does the same. Any idea why?

Unfortunately, PyMC does not easily let you define vectors of multivariate stochastics. Hopefully we can make this happen in PyMC 3. For now, you would have to specify this using a container. For example:
z = [pymc.MvNormal('z_%i' % i, np.zeros(dimZ), np.eye(dimZ)) for i in range(N)]

Regarding the memory issue, try using a different backend for the traces. The default ("ram") keeps everything in RAM. You can try something like "pickle" or "sqlite" instead.
Regarding the plate notation, it might be something we could pursue for PyMC 3. Feel free to create an issue suggesting this in our issue tracker.

Related

Got stuck while make a ML model from scratch

I have a CSV file of various persons with 8 parameters to determine whether the person is diabetic or not.
You will get the CSV file from here
I am making a model that will train and predict if a person is diabetic or not without using of third-party applications like Tensorlfow Scikitlearn etc. I am making it from scratch.
here is my code:
from numpy import genfromtxt
import numpy as np
my_data = genfromtxt('E:/diabaties.csv', delimiter=',')
X,Y = my_data[1: ,:-1], my_data[1: ,-1:] #striping data and output from my_data
def sigmoid(x):
return (1/(1+np.exp(-x)))
m = X.shape[0]
def propagate(W, b, X, Y):
#forward propagation
A = sigmoid(np.dot(X, W) + b)
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
print(cost)
#backward propagation
dw = (1 / m) * np.dot(X.T, (A - Y))
db = (1 / m) * np.sum(A - Y)
return(dw, db, cost)
def optimizer(W,b,X,Y,number_of_iterration,learning_rate):
for i in range(number_of_iterration):
dw, db, cost = propagate(W,b,X,Y)
W = W - learning_rate*dw
b = b - learning_rate*db
return(W, b)
W = np.zeros((X.shape[1],1))
b = 0
W,b = optimizer(W, b, X, Y, 100, 0.05)
The output which is getting generated is:
It is in this link please take a look.
I have tried to -
initialize the value of W with random numbers.
spent a lot of time to debug but cannot find what I have done wrong

This short answer is that your learning rate is about 500x too big for this problem. Think about it like you're trying to pilot your W vector into a canyon in the cost function. At each step, the gradient tells you which way is down hill, but the steps you take in that direction are so big that you jump over the canyon and end up on the other side. Each time this happens, your cost goes up because you're getting farther and farther out of the canyon until after 2 iterations, it blows up.
If you replace the line
W,b = optimizer(W, b, X, Y, 100, 0.05)
with
W,b = optimizer(W, b, X, Y, 100, 0.0001)
It will converge, though still not at a reasonable speed. (Side note, there's no good way to know the learning rate you need for a given problem. You just try lower and lower values until your cost value doesn't diverge.)
The longer answer is that the problem is that your features are all on different scales.
col_means = X.mean(axis=0)
col_stds = X.std(axis=0)
print('column means: ', col_means)
print('column stdevs: ', col_stds)
yields
column means: [ 3.84505208 120.89453125 69.10546875 20.53645833 79.79947917
31.99257812 0.4718763 33.24088542]
column stdevs: [ 3.36738361 31.95179591 19.34320163 15.94182863 115.16894926
7.87902573 0.33111282 11.75257265]
This means that the variations in the numbers of the second feature are about 100x as large as the variations in the numbers of the second to last feature which in turn means that the number of the second value in your W vector will have to be tuned to about 100x the precision of the value of the second to last number in your W vector.
There are two ways to deal with this in practice. First, you could use a fancier optimizer. Instead of basic gradient descent, you could use gradient descent with momentum, but that would change all your code. The second, simpler, way is just to scale your features so they're all about the same size.
col_means = X.mean(axis=0)
col_stds = X.std(axis=0)
print('column means: ', col_means)
print('column stdevs: ', col_stds)
X -= col_means
X /= col_stds
W, b = optimizer(W, b, X, Y, 100, 1.0)
Here we subtract the mean value of each feature and divide each feature's value by its standard deviation. Sometimes newbies are thrown off by this -- "you can't change your data values, that changes the problem" -- but it makes sense if you realize that it's just another mathematical transformation, just like multiplying by W, adding b, taking the sigmoid, etc. The only catch is that you've got to make sure you do the same thing for any future data. Just like the values of your W vector are learned parameters of your model, the values of the col_means and col_stds are too, so you've got to save them like W and b and use them if you want to perform inference with this model on new data in the future.
That lets us use a much bigger learning rater of 1.0 because now all the features are approximately the same size.
Now if you try, you'll get the following output:
column means: [ 3.84505208 120.89453125 69.10546875 20.53645833 79.79947917
31.99257812 0.4718763 33.24088542]
column stdevs: [ 3.36738361 31.95179591 19.34320163 15.94182863 115.16894926
7.87902573 0.33111282 11.75257265]
0.6931471805599452
0.5902957589079032
0.5481784378158732
0.5254804089153315
...
0.4709931321295562
0.4709931263193595
0.47099312122176273
0.4709931167488006
0.470993112823447
This is what you want. Your cost function is going down at each step and at the end of your 100 iterations, the cost is stable to ~8 significant figures, so dropping it more probably won't do much.
Welcome to machine learning!

The problem is with your initialization of the weight and bias. It’s important that you don’t initialize at least the weights to zero and instead initialize them with some random small numbers. The value of A is coming out to be zero making your cost function undefined
Update:
Try something like this:
from numpy import genfromtxt
import numpy as np
# my_data = genfromtxt('E:/diabaties.csv', delimiter=',')
# X,Y = my_data[1: ,:-1], my_data[1: ,-1:] #striping data and output from my_data
# Using random data
n_points = 100
n_neurons = 5
X = np.random.rand(n_points, n_neurons) # 5 dimensional data from uniform distribution [0, 1)
Y = np.random.randint(low=0, high=2, size=(n_points, 1)) # Binary labels
def sigmoid(x):
return (1/(1+np.exp(-x)))
m = X.shape[0]
def propagate(W, b, X, Y):
#forward propagation
A = sigmoid(np.dot(X, W) + b)
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
print(cost)
#backward propagation
dw = (1 / m) * np.dot(X.T, (A - Y))
db = (1 / m) * np.sum(A - Y)
return(dw, db, cost)
def optimizer(W,b,X,Y,number_of_iterration,learning_rate):
for i in range(number_of_iterration):
dw, db, cost = propagate(W,b,X,Y)
W = W - learning_rate*dw
b = b - learning_rate*db
return(W, b)
W = np.random.normal(loc=0, scale=0.01, size=(n_neurons, 1)) # Drawing random initialization from gaussian
b = 0
W,b = optimizer(W, b, X, Y, 100, 0.05)

Your NaN problem is simply due to np.log encountering a zero value. You always want to scale your X values. Statistical (mean, std) normalization will work, but I find min-max scaling works best. Here is code for that:
def minmax_scaler(x):
min = np.nanmin(x, axis=0)
max = np.nanmax(x, axis=0)
return (x-min)/(max-min)
Also, your neural net has only one neuron. When you call np.dot(X, W) these should be matrices of shape (cases, features) and (features, neurons) respectively. So, now your initialization code looks like this:
X = minmax_scaler(X)
neurons = 10
learning_rate = 0.05
W = np.random.random((X.shape[1], neurons))
b = np.zeros((1, neurons)) # b width to match W
I got decent convergence without needing to change the learning rate. See chart:
This is such a small dataset that, even with 10-20 neurons, you are in danger of overfitting it. Ordinarily, you would code a predict() method and an accuracy check, and then set aside some of the data to test for overfitting.

Recursive Least Squares in Python

I'm pretty new to Python and trying to make a RLS filter work. I have a simple linear forecasting regression d = b*x + v for which I would like to recursively estimate d by incorporating the data for x one at a time and measure the error of the filters estimate to the actual d. The filter examples online look like this:
N = 500
x = np.random.normal(0, 1, (N, 4)) # input matrix
v = np.random.normal(0, 0.1, N) # noise
d = 2*x[:,0] + 0.1*x[:,1] - 4*x[:,2] + 0.5*x[:,3] + v # target identification
f = pa.filters.FilterRLS(n=4, mu=0.1, w="random")
y, e, w = f.run(d, x)
But how do I make this work? I don't have a matrix for x, I only have a simple regression with one independent variable. And why do I need to give the noise v, that's something I would like to get from the filter? I would like to give actual data for x and d as an input.

How to update a value of a shared variable in Theano efficiently?

I try to implement non-negative matrix factorization in Theano. In more detail, I try to find two matrices L and R such that their product L x R represents a give matrix M as accurate as possible.
For finding L and R matrices I use back propagation. At some point I have noticed that values in L and R can be negative (of course nothing prevents back prop from doing that). I have tried to correct this behavior by adding the following lines after the back propagation step:
self.L.set_value(T.abs_(self.L).eval())
self.R.set_value(T.abs_(self.R).eval())
After that my program became much more slower.
Am I doing something wrong? Do I update the values of the tensors in a wrong way? Is there a way to do it faster?
ADDED
As requested in the comments, I provide more code. This is how I define the function in the __init__.
self.L = theano.shared(value=np.random.rand(n_rows, n_hids), name='L', borrow=True)
self.R = theano.shared(value=np.random.rand(n_hids, n_cols), name='R', borrow=True)
Y = theano.dot(self.L, self.R)
diff = X - Y
D = T.pow(diff, 2)
E = T.sum(D)
gr_L = T.grad(cost=E, wrt=self.L)
gr_R = T.grad(cost=E, wrt=self.R)
self.l_rate = theano.shared(value=0.000001)
L_ups = self.L - self.l_rate*gr_L
R_ups = self.R - self.l_rate*gr_R
updates = [(self.L, L_ups), (self.R, R_ups)]
self.backprop = theano.function([X], E, updates=updates)
Then in my train function I had this code:
for i in range(self.n_iter):
costs = self.backprop(X, F)
self.L.set_value(T.abs_(self.L).eval())
self.R.set_value(T.abs_(self.R).eval())
A minor remark, I use the abs_ function, but it would make actually more sense to use a function that replace negative values by zero.

You can force the symbolic update values for L and R to always be positive like this:
self.l_rate = theano.shared(value=0.000001)
L_ups = self.L - self.l_rate*gr_L
R_ups = self.R - self.l_rate*gr_R
# This force R and L to always be updated to a positive value
L_ups_abs = T.abs_(L_ups)
R_ups_abs = T.abs_(R_ups)
# Use the update L_ups_abs instead of L_ups (same with R_ups)
updates = [(self.L, L_ups_abs), (self.R, R_ups_abs)]
self.backprop = theano.function([X], E, updates=updates)
and remove the lines
self.L.set_value(T.abs_(self.L).eval())
self.R.set_value(T.abs_(self.R).eval())
from your training loop

Is there Implementation of Hawkes Process in PyMC?

I want to use Hawkes process to model some data. I could not find whether PyMC supports Hawkes process. More specifically I want an observed variable with Hawkes Process and learn a posterior on its params.
If it is not there, then could I define it in PyMC in some way e.g. #deterministic etc.??

It's been quite a long time since your question, but I've worked it out on PyMC today so I'd thought I'd share the gist of my implementation for the other people who might get across the same problem. We're going to infer the parameters λ and α of a Hawkes process. I'm not going to cover the temporal scale parameter β, I'll leave that as an exercise for the readers.
First let's generate some data :
def hawkes_intensity(mu, alpha, points, t):
p = np.array(points)
p = p[p <= t]
p = np.exp(p - t)
return mu + alpha * np.sum(p)
def simulate_hawkes(mu, alpha, window):
t = 0
points = []
lambdas = []
while t < window:
m = hawkes_intensity(mu, alpha, points, t)
s = np.random.exponential(scale=1/m)
ratio = hawkes_intensity(mu, alpha, points, t + s)
t = t + s
if t < window:
points.append(t)
lambdas.append(ratio)
else:
break
points = np.sort(np.array(points, dtype=np.float32))
lambdas = np.array(lambdas, dtype=np.float32)
return points, lambdas
# parameters
window = 1000
mu = 8
alpha = 0.25
points, lambdas = simulate_hawkes(mu, alpha, window)
num_points = len(points)
We just generated some temporal points using some functions that I adapted from there : https://nbviewer.jupyter.org/github/MatthewDaws/PointProcesses/blob/master/Temporal%20points%20processes.ipynb
Now, the trick is to create a matrix of size (num_points, num_points) that contains the temporal distance of the ith point from all the other points. So the (i, j) point of the matrix is the temporal interval separating the ith point to the jth. This matrix will be used to compute the sum of the exponentials of the Hawkes process, ie. the self-exciting part. The way to create this matrix as well as the sum of the exponentials is a bit tricky. I'd recommend to check every line yourself so you can see what they do.
tile = np.tile(points, num_points).reshape(num_points, num_points)
tile = np.clip(points[:, None] - tile, 0, np.inf)
tile = np.tril(np.exp(-tile), k=-1)
Σ = np.sum(tile, axis=1)[:-1] # this is our self-exciting sum term
We have points and we have a matrix containg the sum of the excitations term.
The duration between two consecutive events of a Hawkes process follow an exponential distribution of parameter λ = λ0 + ∑ excitation. This is what we are going to model, but first we have to compute the duration between two consecutive points of our generated data.
interval = points[1:] - points[:-1]
We're now ready for inference:
with pm.Model() as model:
λ = pm.Exponential("λ", 1)
α = pm.Uniform("α", 0, 1)
lam = pm.Deterministic("lam", λ + α * Σ)
interarrival = pm.Exponential(
"interarrival", lam, observed=interval)
trace = pm.sample(2000, tune=4000)
pm.plot_posterior(trace, var_names=["λ", "α"])
plt.show()
print(np.mean(trace["λ"]))
print(np.mean(trace["α"]))
7.829
0.284
Note: the tile matrix can become quite large if you have many data points.

MLE for a Polya Distribution

I'm working on programming a MLE for the Polya distribution using scipy. The Nelder-Mead method is working, however I get a "Desired error not necessarily achieved due to precision loss." error when running BFGS. The Nelder-Mead method seems like it's too slow for my needs (I have a lot of fairly big data, say 1000 tables in some cases as big as 10x10000). I've tried using the check_grad function and the result is smallish on the example below (order 10^-2), so I'm not sure if that means there's a bug in the gradient of the log likelihood or the likelihood is just very strongly peaked. For what it's worth, I've stared quite hard at my code and I can't see the issue. Here's some example code to recreate the problem
#setup some data
from numpy.random import dirichlet, multinomial
from scipy.optimize import check_grad
alpha = [10,30,50]
p = pd.DataFrame(dirichlet(alpha,200))
data = p.apply(lambda x: multinomial(500,x),1)
a = np.array(data.mean(0))
#optimize
result = minimize(lambda a: -1*llike(data,exp(a)),
x0=np.log(a),
method='Nelder-Mead')
x0=result.x
result = minimize(lambda a: -1*llike(data,exp(a)),
x0=x0,
jac=lambda a: -1*gradient_llike(data,np.exp(a)),
method='BFGS')
exp(result.x) #should be close to alpha
#uhoh, let's check that this is right.
check_grad(func=lambda a: -1*llike(data,a),grad=lambda a: -1*gradient_llike(data,a),x0=alpha)
Here's the code for my functions
def log_polya(Z,alpha):
"""
Z is a vector of counts
https://en.wikipedia.org/wiki/Dirichlet-multinomial_distribution
http://mimno.infosci.cornell.edu/info6150/exercises/polya.pdf
"""
if not isinstance(alpha,np.ndarray):
alpha = np.array(alpha)
if not isinstance(Z,np.ndarray):
Z = np.array(Z)
#Concentration Parameter
A = sum(alpha)
#Number of Datapoints
N = sum(Z)
return gammaln(A) - gammaln(N+A) + sum(gammaln(Z+alpha) - gammaln(alpha))
def llike(data,alpha):
return sum(data.apply(log_polya,1,alpha=alpha))
def log_polya_derivative(Z,alpha):
if not isinstance(alpha,np.ndarray):
alpha = np.array(alpha)
if not isinstance(Z,np.ndarray):
Z = np.array(Z)
if 0. in Z+alpha:
Warning("invalid prior parameter,nans should be produced")
#Concentration Parameter
A = sum(alpha)
#Number of Datapoints
N = sum(Z)
K = len(Z)
return np.array([psi(A) - psi(N+A) + psi(Z[i]+alpha[i]) - psi(alpha[i]) for i in xrange(K)])
def gradient_llike(data,alpha):
return np.array(data.apply(log_polya_derivative,1,alpha=alpha).sum(0))
UPDATE: Still curious about this, but for those interested in a working implementation for this problem, the following code for implementing the Minka Fixed Point Algorithm seems to work well (i.e. recovers quickly values that are close to the true dirichlet parameter).
def minka_mle_polya(data):
"""
http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf
"""
data = np.array(data)
K = np.shape(data)[1]
alpha = np.array(data.mean(0))
alpha_new = np.ndarray((K))
precision = 10
while precision > 10**-5:
for k in range(K):
A = sum(alpha)
N = data.sum(1)
numerator = sum(
psi(data[:,k]+alpha[k])-psi(alpha[k])
)
denominator = sum(
psi(N+A)-psi(A)
)
alpha_new[k] = alpha[k]*numerator/denominator
precision = sum(abs(alpha_new - alpha))
alpha_old = np.array(alpha)
alpha = np.array(alpha_new)
print "Gap", precision

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.