This is a follow up on PyMC: Parameter estimation in a Markov system
I have a system which is defined by its position and velocity at each timestep. The behavior of the system is defined as:
vel = vel + damping * dt
pos = pos + vel * dt
So, here is my PyMC model. To estimate vel, pos and most importantly damping.
# PRIORS
damping = pm.Normal("damping", mu=-4, tau=(1 / .5**2))
# we assume some system noise
tau_system_noise = (1 / 0.1**2)
# the state consist of (pos, vel); save in lists
# vel: we can't judge the initial velocity --> assume it's 0 with big std
vel_states = [pm.Normal("v0", mu=-4, tau=(1 / 2**2))]
# pos: the first pos is just the observation
pos_states = [pm.Normal("p0", mu=observations[0], tau=tau_system_noise)]
for i in range(1, len(observations)):
new_vel = pm.Normal("v" + str(i),
mu=vel_states[-1] + damping * dt,
tau=tau_system_noise)
vel_states.append(new_vel)
pos_states.append(
pm.Normal("s" + str(i),
mu=pos_states[-1] + new_vel * dt,
tau=tau_system_noise)
)
# we assume some observation noise
tau_observation_noise = (1 / 0.5**2)
obs = pm.Normal("obs", mu=pos_states, tau=tau_observation_noise, value=observations, observed=True)
This is how I run the sampling:
mcmc = pm.MCMC([damping, obs, vel_states, pos_states])
mcmc.sample(50000, 25000)
pm.Matplot.plot(mcmc.get_node("damping"))
damping_samples = mcmc.trace("damping")[:]
print "damping -- mean:%f; std:%f" % (mean(damping_samples), std(damping_samples))
print "real damping -- %f" % true_damping
The value for damping is dominated by the prior. Even if I change the prior to Uniform or whatever, it is still the case.
What am I doing wrong? It's pretty much like the previous example, just with another layer.
The full IPython notebook of this problem is available here: http://nbviewer.ipython.org/github/sotte/random_stuff/blob/master/PyMC%20-%20HMM%20Dynamic%20System.ipynb
[EDIT: Some clarifications & code for sampling.]
[EDIT2: #Chris answer didn't help. I could not use AdaptiveMetropolis since the *_states don't seem to be part of the model.]
There are a couple of issues with the model, looking at it again. First and foremost, you did not add all of your PyMC objects to the model. You have only added [damping, obs]. You should pass all of the PyMC nodes to the model.
Also, note that you don't need to call both Model and MCMC. This is fine:
model = pm.MCMC([damping, obs, vel_states, pos_states])
The best workflow for PyMC is to keep your model in a separate file from the running logic. That way, you can just import the model and pass it to MCMC:
import my_model
model = pm.MCMC(my_model)
Alternately, you can write your model as a function, returning locals (or vars), then calling the function as the argument for MCMC. For example:
def generate_model():
# put your model definition here
return locals()
model = pm.MCMC(generate_model())
Assuming you know the structure of your model -- you are doing parameter estimation, not system identification -- you can construct your PyMC model as a regression, with unknown damping, initial position and initial velocity as parameters and the array of positions, your observations.
That is, with class PM representing the point-mass system:
pm = PM(true_damping)
positions, velocities = pm.integrate(true_pos, true_vel, N, dt)
# Assume little system noise
std_system_noise = 0.05
tau_system_noise = 1.0/std_system_noise**2
# Treat the real positions as observations
observations = positions + np.random.randn(N,)*std_system_noise
# Damping is modelled with a Uniform prior
damping = mc.Uniform("damping", lower=-4.0, upper=4.0, value=-0.5)
# Initial position & velocity unknown -> assume Uniform priors
init_pos = mc.Uniform("init_pos", lower=-1.0, upper=1.0, value=0.5)
init_vel = mc.Uniform("init_vel", lower=0.0, upper=2.0, value=1.5)
#mc.deterministic
def det_pos(d=damping, pi=init_pos, vi=init_vel):
# Apply damping, init_pos and init_vel estimates and integrate
pm.damping = d.item()
pos, vel = pm.integrate(pi, vi, N, dt)
return pos
# Standard deviation is modelled with a Uniform prior
std_pos = mc.Uniform("std", lower=0.0, upper=1.0, value=0.5)
#mc.deterministic
def det_prec_pos(s=std_pos):
# Precision, based on standard deviation
return 1.0/s**2
# The observations are based on the estimated positions and precision
obs_pos = mc.Normal("obs", mu=det_pos, tau=det_prec_pos, value=observations, observed=True)
# Create the model and sample
model = mc.Model([damping, init_pos, init_vel, det_prec_pos, obs_pos])
mcmc = mc.MCMC(model)
mcmc.sample(50000, 25000)
The full listing is here:
https://gist.github.com/stuckeyr/7762371
Increasing N and decreasing dt will improve your estimates markedly.
What do you mean by unreasonable? Are they shrunken toward the prior? Damping seems to have a pretty tight variance -- what if you give it a more diffuse prior?
Also, you might try using the AdaptiveMetropolis sampler on the *_states arrays:
my_model.use_step_method(AdaptiveMetropolis, my_model.vel_states)
It sometimes mixes better for correlated variables, as these likely are.
I think that your initial approach is fine and should work, except that the "obs" variable has not been included in the list of nodes supplied to MCMC (see In[10] in your notebook). After including this variable, the MCMC solver runs fine and does enforce the conditional dependencies specified by your model. I'd like to repeat the point made by Chris that it is best to define the model in a different file or under a function to avoid such mistakes.
The reason why you don't get the right results, is that your priors have been chosen arbitrarily and in some cases, the values are such that it is very difficult for the model to mix properly in order to converge. Your toy problem tries to estimate a damping value such that the positions converge to vector of observed positions. For this, your model should have the flexibility to choose velocity and damping values in a wide range so that stochastic errors in the position/velocity can be corrected when going from one time step to the next. Otherwise, as a result of your Euler integration scheme, the errors just keep getting propagated. I think Chris referred to the same thing when he suggested choosing a more diffuse prior.
I suggest playing around with the tau values for each of the Normal variables. For instance, I changed the following values:
damping = pm.Normal("damping", mu=0, tau=1/20.**2) # was tau=1/2.**2
new_vel = pm.Normal("v" + str(i),
mu=vel_states[-1] + damping * dt,
tau=(1/2.**2)) # was tau=tau_system_noise=(1 / 0.5**2)
tau_observation_noise = (1 / 0.005**2) # was 1 / 0.5**2
You can see the modified file here.
The plots at the bottom show that the positions are indeed converging. The velocities are all over the place. The estimated mean value of damping is 6.9, which is very different from -1.5. Perhaps you can achieve better estimates by choosing appropriate values for the priors.
Related
I'm trying to infer 2 parameters (beta and gamma) given a deterministic equation and simulated noisy data. For some reason, the equation I'm using seems to be problematic, as I just copied the basic pymc3 tutorial and used my own deterministic equation. Here is the model I'm using:
# True parameter values
beta, gamma = 0.21, 0.07
# Size of dataset
days = 50
# Predictor variable
time = np.arange(0,days,1)
# Simulate outcome variable
data = []
for t in time:
data.append((beta/((beta-gamma))*(np.exp(t*(beta-gamma))-1)+1) + np.random.normal(0,1))
basic_model = pm.Model()
def smodel(beta,gamma):
s = beta/((beta-gamma))*(tt.exp(time*(beta-gamma))-1)+1
return s
with basic_model:
# Priors for unknown model parameters
beta = pm.Normal("beta", mu=0, sigma=10)
gamma = pm.Normal("gamma", mu=0, sigma=10)
# Expected value of outcome
#smodel_pm = pm.Deterministic('smodel', smodel(inputParam))
y_obs = pm.Normal('obs', mu=smodel(beta,gamma), sigma=1,observed=data)
# Draw the specified number of samples
trace = pm.sample(step=pm.Metropolis())
However, when I run a summary of the trace, I'm getting 0's for everything. Anyone know what the issue is?
The parameterization is rather poor here (correlated variables, symmetric solutions in domain), plus Metropolis-Hastings simply needs to run for a long time, whereas the default settings assume NUTS.
Here's a suggested alternative parameterization, plus tuning and draw counts more reasonable for this sampling strategy:
basic_model = pm.Model()
def smodel(a, b):
s = a*(tt.exp(b*time)-1)+1
return s
with basic_model:
# priors for pre-transformed model parameters
a = pm.Normal("a", mu=0, sigma=10)
b = pm.HalfNormal("b", sigma=10)
# (transformed) parameters of interest
beta = pm.Deterministic("beta", a*b)
gamma = pm.Deterministic("gamma", (a-1)*b)
# expected value of outcome
y_obs = pm.Normal('obs', mu=smodel(a, b), sigma=1, observed=data)
# Draw the specified number of samples
trace = pm.sample(step=pm.Metropolis(), tune=100000, draws=50000)
One could probably up the draws even further, since the effective sample sizes (ESS) are so small, but the numbers for the parameters of interest are about where they should be:
Looking at the pairs plots, one can see the correlation for a,b is still very high, which explains the samples are so highly auto-correlated.
The traces and densities (per chain) look decent to me:
I am new to the Bayesian world and PyMC3, and am struggling with a simple model setup. Specifically, how to deal with a setup where the 'observed' data are themselves modified by the random variables? As an example, lets' say I have a collection of 2d points [Xi, Yi] that form an arc about a circle whose central point [Xc,Yc], I don't know. However, I expect that the distances between the points and the circle center, Ri, should be normally distributed, about a known radius, R. I therefore initially thought I could assign Xc and Yc uniform priors (on some arbitrarily large range) and then re-calculate Ri within the model and assign Ri as the 'observed' data to get posterior estimates on Xc and Yc:
import pymc3 as pm
import numpy as np
points = np.array([[2.95, 4.98], [3.28, 4.88], [3.84, 4.59], [4.47, 4.09], [2.1,5.1], [5.4, 1.8]])
Xi = points[:,0]
Yi = points[:,1]
#known [Xc,Yc] = [2.1, 1.8]
R = 3.3
with pm.Model() as Cir_model:
Xc = pm.Uniform('Xc', lower=-20, upper=20)
Yc = pm.Uniform('Yc', lower=-20, upper=20)
Ri = pm.math.sqrt((Xi-Xc)**2 + (Yi-Yc)**2)
y = pm.Normal('y', mu=R, sd=1.0, observed=Ri)
samples = pm.fit(random_seed=2020).sample(1000)
pm.plot_posterior(samples, var_names=['Xc'])
pm.plot_posterior(samples, var_names=['Yc']);
While this code runs and gives me something, it clearly isn't working properly, which isn't surprising because it didn't seem right to be feeding a variable (Ri) in as 'observed' data. However, while I know there is something seriously wrong with my model setup (and my understanding more generally), I can't seem to recognize it. Any help greatly appreciated!
This model is actually doing fine, but there are a few things you might improve:
Using a variable as an observation is not great, in that you should think about what it is doing to the distribution you are fitting. It will fit a distribution, but you should think about whether you are double-counting variables in a prior and a likelihood. That doesn't matter so much for this toy model though!
You are using pm.fit(...), which uses variational inference, but MCMC is fine here, so replacing that whole line with samples = pm.sample() works.
The points you provide are almost exactly on a circle -- the empirical standard deviation is around 0.004, but standard deviation you supply in the liklihood is 1: around 250x the true value! Sampling from the model as-is allows for the center of the points to be in two different places:
If you change the likelihood to y = pm.Normal('y', mu=R, sd=0.01, observed=Ri), you still get two possible centers, though there's a little more mass near the true center:
Finally, you could take an approach where you put a prior on the scale, and also learn that, which happily feels the most principled and gives results closest to the true ones. Here's the model:
with pm.Model():
Xc = pm.Uniform('Xc', lower=-20, upper=20)
Yc = pm.Uniform('Yc', lower=-20, upper=20)
Ri = pm.math.sqrt((Xi-Xc)**2 + (Yi-Yc)**2)
obs_sd = pm.HalfNormal('obs_sd', 1)
y = pm.Normal('y', mu=R, sd=obs_sd, observed=Ri)
samples = pm.sample()
and here's the output:
I have recently been working with gpflow, in-particular Gaussian process regression, to model a process for which I have access to approximated moments for each input. I have a vector of input values X of size (N,1) and a vector of responses Y of size (N,1). However, I also know, for each (x,y) pair, an approximation of the associated variance, skewness, kurtosis and so on for the particular y value.
From this, I know properties that inform me of appropriate likelihoods to use for each data point.
In the simplest case, I just assume all likelihoods are Gaussian, and specify the variance at each point. I've created a minimal example of my code by adapting the tutorial on: https://nbviewer.jupyter.org/github/GPflow/GPflow/blob/develop/doc/source/notebooks/advanced/varying_noise.ipynb#Demo-2:-grouped-noise-variances.
import numpy as np
import gpflow
def generate_data(N=100):
X = np.random.rand(N)[:, None] * 10 - 5 # Inputs, shape N x 1
F = 2.5 * np.sin(6 * X) + np.cos(3 * X) # Mean function values
groups = np.arange( 0, N, 1 ).reshape(-1,1)
NoiseVar = np.array([i/100.0 for i in range(N)])[groups]
Y = F + np.random.randn(N, 1) * np.sqrt(NoiseVar) # Noisy data
return X, Y, groups, NoiseVar
# Get data
X, Y, groups, NoiseVar = generate_data()
Y_data = np.hstack([Y, groups])
# Generate one likelihood per data-point
likelihood = gpflow.likelihoods.SwitchedLikelihood( [gpflow.likelihoods.Gaussian(variance=NoiseVar[i]) for i in range(Y.shape[0])])
# model construction (notice that num_latent is 1)
kern = gpflow.kernels.Matern52(input_dim=1, lengthscales=0.5)
model = gpflow.models.VGP(X, Y_data, kern=kern, likelihood=likelihood, num_latent=1)
# Specify the likelihood as non-trainable
model.likelihood.set_trainable(False)
# build the natural gradients optimiser
natgrad_optimizer = gpflow.training.NatGradOptimizer(gamma=1.)
natgrad_tensor = natgrad_optimizer.make_optimize_tensor(model, var_list=[(model.q_mu, model.q_sqrt)])
session = model.enquire_session()
session.run(natgrad_tensor)
# update the cache of the variational parameters in the current session
model.anchor(session)
# Stop Adam from optimising the variational parameters
model.q_mu.trainable = False
model.q_sqrt.trainable = False
# Create Adam tensor
adam_tensor = gpflow.train.AdamOptimizer(learning_rate=0.1).make_optimize_tensor(model)
for i in range(200):
session.run(natgrad_tensor)
session.run(adam_tensor)
# update the cache of the parameters in the current session
model.anchor(session)
print(model)
The above code works for a gaussian likelihood, and known variances. Inspecting my real data, I see that it is skewed very often and as a result, I want to use non-gaussian likelihoods to model it, but am unsure how to specify these other likelihood parameters given what I know.
So my question is: Given this setup, how can I adapt my code so far to include non-Gaussian likelihoods at each step, in-particular specifying and fixing their parameters based on my known variances, skewness, kurtosis and so on associated with each individual y value?
Firstly, you will need to choose which non-Gaussian likelihood you use. GPflow includes various ones in likelihoods.py. You then need to adapt the line
likelihood = gpflow.likelihoods.SwitchedLikelihood(
[gpflow.likelihoods.Gaussian(variance=NoiseVar[i]) for i in range(Y.shape[0])]
)
to give a list of your non-Gaussian likelihoods.
Which likelihood can take advantage of your skewness and kurtosis information is a statistical question. Depending on what you come up with, you may need to implement your own likelihood class, which can be done by inheriting from Likelihood. You should be able to follow some other examples from likelihoods.py.
I've implemented a single-variable linear regression model in Python that uses gradient descent to find the intercept and slope of the best-fit line (I'm using gradient descent rather than computing the optimal values for intercept and slope directly because I'd eventually like to generalize to multiple regression).
The data I am using are below. sales is the dependent variable (in dollars) and temp is the independent variable (degrees celsius) (think ice cream sales vs temperature, or something similar).
sales temp
215 14.20
325 16.40
185 11.90
332 15.20
406 18.50
522 22.10
412 19.40
614 25.10
544 23.40
421 18.10
445 22.60
408 17.20
And this is my data after it has been normalized:
sales temp
0.06993007 0.174242424
0.326340326 0.340909091
0 0
0.342657343 0.25
0.515151515 0.5
0.785547786 0.772727273
0.529137529 0.568181818
1 1
0.836829837 0.871212121
0.55011655 0.46969697
0.606060606 0.810606061
0.51981352 0.401515152
My code for the algorithm:
import numpy as np
import pandas as pd
from scipy import stats
class SLRegression(object):
def __init__(self, learnrate = .01, tolerance = .000000001, max_iter = 10000):
# Initialize learnrate, tolerance, and max_iter.
self.learnrate = learnrate
self.tolerance = tolerance
self.max_iter = max_iter
# Define the gradient descent algorithm.
def fit(self, data):
# data : array-like, shape = [m_observations, 2_columns]
# Initialize local variables.
converged = False
m = data.shape[0]
# Track number of iterations.
self.iter_ = 0
# Initialize theta0 and theta1.
self.theta0_ = 0
self.theta1_ = 0
# Compute the cost function.
J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
print('J is: ', J)
# Iterate over each point in data and update theta0 and theta1 on each pass.
while not converged:
diftemp0 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) for i in range(m)])
diftemp1 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) * data[i][1] for i in range(m)])
# Subtract the learnrate * partial derivative from theta0 and theta1.
temp0 = self.theta0_ - (self.learnrate * diftemp0)
temp1 = self.theta1_ - (self.learnrate * diftemp1)
# Update theta0 and theta1.
self.theta0_ = temp0
self.theta1_ = temp1
# Compute the updated cost function, given new theta0 and theta1.
new_J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
print('New J is: %s') % (new_J)
# Test for convergence.
if abs(J - new_J) <= self.tolerance:
converged = True
print('Model converged after %s iterations!') % (self.iter_)
# Set old cost equal to new cost and update iter.
J = new_J
self.iter_ += 1
# Test whether we have hit max_iter.
if self.iter_ == self.max_iter:
converged = True
print('Maximum iterations have been reached!')
return self
def point_forecast(self, x):
# Given feature value x, returns the regression's predicted value for y.
return self.theta0_ + self.theta1_ * x
# Run the algorithm on a data set.
if __name__ == '__main__':
# Load in the .csv file.
data = np.squeeze(np.array(pd.read_csv('sales_normalized.csv')))
# Create a regression model with the default learning rate, tolerance, and maximum number of iterations.
slregression = SLRegression()
# Call the fit function and pass in the data.
slregression.fit(data)
# Print out the results.
print('After %s iterations, the model converged on Theta0 = %s and Theta1 = %s.') % (slregression.iter_, slregression.theta0_, slregression.theta1_)
# Compare our model to scipy linregress model.
slope, intercept, r_value, p_value, slope_std_error = stats.linregress(data[:,1], data[:,0])
print('Scipy linear regression gives intercept: %s and slope = %s.') % (intercept, slope)
# Test the model with a point forecast.
print('As an example, our algorithm gives y = %s given x = .87.') % (slregression.point_forecast(.87)) # Should be about .83.
print('The true y-value for x = .87 is about .8368.')
I'm having trouble understanding exactly what allows the algorithm to converge versus return values that are completely wrong. Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge. However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005. This brings changes in the cost function from iteration to iteration down to around 614, but I can't get it to go any lower.
Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why? Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values? For instance, if I were going to deliver this algorithm to a client so they could make predictions of their own (I'm not, but for the sake of argument..), wouldn't I want them to simply be able to plug in the un-normalized x-value?
All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time. Is this normal, or are there flaws in my algorithm that are contributing to this issue?
Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge. However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005
That's kind of to be expected the way you have your algorithm set up.
The normalization of the data makes it so the y-intercept of the best fit is around 0.0. Otherwise, you could have a y-intercept thousands of units off of the starting guess, and you'd have to trek there before you ever really started the optimization part.
Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why?
No, absolutely not, but if you don't normalize, you should pick a starting point more intelligently (you're starting at (m,b) = (0,0)). Your learnrate may also be too small if you don't normalize your data, and same with your tolerance.
Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values?
Apply whatever transformation that you applied to the original data to get the normalized data to your new x-value. (The code for normalization is outside of what you have shown). If this test point fell within the (minx,maxx) range of your original data, once transformed, it should fall within 0 <= x <= 1. Once you have this normalized test point, plug it into your theta equation of a line (remember, your thetas are m,b of the y-intercept form of the equation of a line).
All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time.
For a well-formed problem, if you're in fact diverging it often means your step size is too large. Try lowering it.
If it's simply not converging before it hits the max iterations, that could be a few issues:
Your step size is too small,
Your tolerance is too small,
Your max iterations is too small,
Your starting point is poorly chosen
In your case, using the non normalized data results in your starting point of (0,0) being very far off (the (m,b) of the non-normalized data is around (-159, 30) while the (m,b) of your normalized data is (0.10,0.79)), so most if not all of your iterations are being used just getting to the area of interest.
The problem with this is that by increasing the step size to get to the area of interest faster also makes it less-likely to find convergence once it gets there.
To account for this, some gradient descent algorithms have dynamic step size (or learnrate) such that large steps are taken at the beginning, and smaller ones as it nears convergence.
It may also be helpful for you to keep a history of of the theta pairs throughout the algorithm, then plot them. You'll be able to see the difference immediately between using normalized and non-normalized input data.
I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
Here's my code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?
The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.
What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.
w(i+1) = w(i) - s * g
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)