Simplest linear model with PyMC

Simplest linear model with PyMC - python

Say I try to estimate the slope of a simple y= m * x problem using the following data:
x_data = np.array([0,1,2,3])
y_data = np.array([0,1,2,3])
Clearly the slope is 1. However, when I run this in PyMC I get 10
slope = pm.Uniform('slope', lower=0, upper=20)
#pm.deterministic
def y_gen(value=y_data, x=x_data, slope=slope, observed=True):
return slope * x
model = pm.Model([slope])
mcmc = pm.MCMC(model)
mcmc.sample(100000, 5000)
# This returns 10
final_guess = mcmc.trace('slope')[:].mean()
but it should be 1!
Note: The above is with PyMC2.

You need to define a likelihood, try this:
import pymc as pm
import numpy as np
x_data = np.linspace(0,1,100)
y_data = np.linspace(0,1,100)
slope = pm.Normal('slope', mu=0, tau=10**-2)
tau = pm.Uniform('tau', lower=0, upper=20)
#pm.deterministic
def y_gen(x=x_data, slope=slope):
return slope * x
like = pm.Normal('likelihood', mu=y_gen, tau=tau, observed=True, value=y_data)
model = pm.Model([slope, y_gen, like, tau])
mcmc = pm.MCMC(model)
mcmc.sample(100000, 5000)
# This returns 10
final_guess = mcmc.trace('slope')[:].mean()
It returns 10 because you're just sampling from your uniform prior and 10 is the expected value of that.

You need to set value=y_data, observed=True for the likelihood. Also, a minor point, you don't need to instantiate a Model object. Just pass your nodes (or a call to locals()) to MCMC.

Related

Quantile residual Q-Q plot in python

I know how I can get Normal Q-Q plots in Python but how can I get quantile residual Q-Q plots?
I tried to do the three steps written here (Chapter 20.2.6.1):
First I tried to adapt this solution for use with smf.glm (I need to use smf because I have a huge dataframe with hundreds of variables I need to pass):
import numpy as np
from scipy import stats
import statsmodels.formula.api as smf
# generate some data to check
nobs = 1000
n, p = 50, 0.25
dist0 = stats.nbinom(n, p)
y = dist0.rvs(size=nobs)
x = np.ones(nobs)
df_test = pd.DataFrame({'y': y, 'x': x})
loglike_method = 'nb2' # or use 'nb2'
#res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])
res = smf.glm(formula="y ~ x", data=df_test, family=sm.families.NegativeBinomial()).fit(start_params=[0.1, 0.1])
print(dist0.mean())
print(res.params)
mu = res.predict() # use this for mean if not constant
mu = mu.mean()
#mu = np.exp(res.params[0]) # shortcut, we just regress on a constant
alpha = res.params[0]
if loglike_method == 'nb1':
Q = 1
elif loglike_method == 'nb2':
Q = 0
size = 1. / alpha * mu**Q
prob = size / (size + mu)
print('data generating parameters'.format(n, p))
print('estimated params'.format(size, prob))
#estimated distribution
dist_est = stats.nbinom(size, prob)
But the estimated parameters are totally off.
Next step would be to call stats.nbinom.cdf with those parameters to simulate values ...
Is this the right way?
And how can I get the correct values for size and prob from my fitted model?

Bayesian Calibration with PyMC3, Kennedy O'Hagan

I'm quite new to probabilistic programming and pymc3...
Currently, I want to implement the Kennedy-O’Hagan framework in pymc3.
The setup is according to the paper of Kennedy and O'Hagan as follows:
We have n observations zi of the form
zi = f(xi , theta) + g(xi) + ei,
where xi are known imputs and theta are unknown calibration parameters and ei are iid error terms. We also have m model evaluations yj of the form
yj = f(x'j, thetaj), where both x'j (different than xi above) and thetaj are known. Therefore, the data consists of all zi and yj. In the paper, Kennedy-O'Hagan model f, g using gaussian processes:
f ~ GP{m1 (.,.), Sigma1[(.,.),(.,.)] }
g ~ GP{m2 (.), Sigma2[(.),(.)] }
Among other things, the goal is to get posterior samples for the unknow calibration parameters theta.
What I've done so far:
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import freeze_support
import sys
import theano
import theano.tensor as tt
from mpl_toolkits.mplot3d import Axes3D
import pyDOE
from scipy.stats.distributions import uniform
def physical_system(x):
return 0.65 * x / (1 + x / 5)
def observation(x):
return physical_system(x[:]) + np.random.normal(0,0.01,len(x))
def computational_system(input):
return input[:,0]*input[:,1]
if __name__ == "__main__":
freeze_support()
# observations with noise
x_obs = np.linspace(0,4,10)
y_real = physical_system(x_obs[:])
y_obs = observation(x_obs[:])
# computation model
N = 60
design = pyDOE.lhs(2, samples=N, criterion='center')
left = [-0.2,-0.2]; right = [4.2,1.2]
for i in range(2):
design[:,i] = uniform(loc=left[i],scale=right[i]-left[i]).ppf(design[:,i])
x_comp = design[:,0][:,None]; t_comp = design[:,1][:,None]
input_comp = np.hstack((x_comp,t_comp))
y_comp = computational_system(input_comp)
x_obs_shared = theano.shared(x_obs[:, None])
with pm.Model() as model:
noise = pm.HalfCauchy('noise',beta=5)
ls_1 = pm.Gamma('ls_1', alpha=1, beta=1, shape=2)
cov = pm.gp.cov.ExpQuad(2,ls=ls_1)
f = pm.gp.Marginal(cov_func=cov)
# train the gp f with data from computer model:
f_0 = f.marginal_likelihood('f_0', X=input_comp, y=y_comp, noise=noise)
trace = pm.sample(500, pm.Metropolis(), chains=4)
burned_trace = trace[300:]
Until here, everything is fine. My GP f is trained according the computer model.
Now, I want to test if I could fit this trained GP to my observed data:
#gp f is now trained to data from computer model
#now I want to fit this trained gp to observed data and find posterior for theta
with model:
sd = pm.Gamma('eta', alpha=1, beta=1)
theta = pm.Normal('theta', mu=0, sd=sd)
sigma = pm.Gamma('sigma', alpha=1, beta=1)
input_1 = tt.concatenate([x_obs_shared, tt.tile(theta, len(x_obs[:,None]), ndim=2).T], axis=1)
f_1 = gp1.conditional('f_1', Xnew=input_1, shape=(10,))
y_ = pm.Normal('y_', mu=f_1,sd=sigma, observed=y_obs)
step = pm.Metropolis()
trace_ = pm.sample(30000, step,start=pm.find_MAP(), chains=4)
Is this formulation correct? I get very unstable results...
The full formulation according KOH should be something like this:
with pm.Model() as model:
theta = pm.Normal('theta', mu=0, sd=10)
noise = pm.HalfCauchy('noise',beta=5)
ls_1 = pm.Gamma('ls_1', alpha=1, beta=1, shape=2)
cov = pm.gp.cov.ExpQuad(2,ls=ls_1)
gp1 = pm.gp.Marginal(cov_func=cov)
gp2 = pm.gp.Marginal(cov_func=cov)
gp = gp1 + gp2
input_1 = tt.concatenate([x_obs_shared, tt.tile(theta, len(x_obs), ndim=2).T], axis=1)
f_0 = gp1.marginal_likelihood('f_0', X=input_comp, y=y_comp, noise=noise)
f_1 = gp1.marginal_likelihood('f_1', X=input_1, y=y_obs, noise=noise)
f = gp.marginal_likelihood('f', X=input_1, y=y_obs, noise=noise)
Could somebody give me some advise how to formulate the KOH properly with pymc3? I am desperate... Would appreciate any help. Thank you!

You might have found the solution but if not, that's a good one (Guidelines for the Bayesian calibration of building energy models)

Linear regression ODR fails

Following the recommendations in this answer I have used several combination of values for beta0, and as shown here, the values from polyfit.
This example is UPDATED in order to show the effect of relative scales of values of X versus Y (X range is 0.1 to 100 times Y):
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
for num in range(1, 5):
plt.subplot(2, 2, num)
plt.title('X range is %.1f times Y' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()
It seems that no combination of beta0 helps... The only way to get polyfit and ODR fit similar is to swap X and Y, OR as shown here to increase the range of values of X with regard to Y, still not really a solution :)
=== EDIT ===
I do not want ODR to be the same as polyfit. I am showing polyfit just to emphasize that the ODR fit is wrong and it is not a problem of the data.
=== SOLUTION ===
thanks to #norok2 answer when Y range is 0.001 to 100000 times X:
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() / 1000 for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
plt.figure(figsize=(12, 12))
for num in range(1, 10):
plt.subplot(3, 3, num)
plt.title('Y range is %.1f times X' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y,
sy=min(1/np.var(Y), 1/np.var(X))) # here the trick!! :)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()

The key difference between polyfit() and the Orthogonal Distance Regression (ODR) fit is that polyfit works under the assumption that the error on x is negligible. If this assumption is violated, like it is in your data, you cannot expect the two methods to produce similar results.
In particular, ODR() is very sensitive to the errors you specify.
If you do not specify any error/weighting, it will assign a value of 1 for both x and y, meaning that any scale difference between x and y will affect the results (the so-called numerical conditioning).
On the contrary, polyfit(), before computing the fit, applies some sort of pre-whitening to the data (see around line 577 of its source code) for better numerical conditioning.
Therefore, if you want ODR() to match polyfit(), you could simply fine-tune the error on Y to change your numerical conditioning.
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
mydata = odr.RealData(X, Y)
# equivalent to: odr.RealData(X, Y, sx=1, sy=1)
to:
mydata = odr.RealData(X, Y, sx=1, sy=1/np.var(Y))
(EDIT: note there was a typo on the line above)
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
Note that this would only make sense for well-conditioned fits.

I cannot format source code in a comment, and so place it here. This code uses ODR to calculate fit statistics, note the line that has "parameter order for odr" such that I use a wrapper function for the ODR call to my "actual" function.
from scipy.optimize import curve_fit
import numpy as np
import scipy.odr
import scipy.stats
x = np.array([5.357, 5.797, 5.936, 6.161, 6.697, 6.731, 6.775, 8.442, 9.861])
y = np.array([0.376, 0.874, 1.049, 1.327, 2.054, 2.077, 2.138, 4.744, 7.104])
def f(x,b0,b1):
return b0 + (b1 * x)
def f_wrapper_for_odr(beta, x): # parameter order for odr
return f(x, *beta)
parameters, cov= curve_fit(f, x, y)
model = scipy.odr.odrpack.Model(f_wrapper_for_odr)
data = scipy.odr.odrpack.Data(x,y)
myodr = scipy.odr.odrpack.ODR(data, model, beta0=parameters, maxit=0)
myodr.set_job(fit_type=2)
parameterStatistics = myodr.run()
df_e = len(x) - len(parameters) # degrees of freedom, error
cov_beta = parameterStatistics.cov_beta # parameter covariance matrix from ODR
sd_beta = parameterStatistics.sd_beta * parameterStatistics.sd_beta
ci = []
t_df = scipy.stats.t.ppf(0.975, df_e)
ci = []
for i in range(len(parameters)):
ci.append([parameters[i] - t_df * parameterStatistics.sd_beta[i], parameters[i] + t_df * parameterStatistics.sd_beta[i]])
tstat_beta = parameters / parameterStatistics.sd_beta # coeff t-statistics
pstat_beta = (1.0 - scipy.stats.t.cdf(np.abs(tstat_beta), df_e)) * 2.0 # coef. p-values
for i in range(len(parameters)):
print('parameter:', parameters[i])
print(' conf interval:', ci[i][0], ci[i][1])
print(' tstat:', tstat_beta[i])
print(' pstat:', pstat_beta[i])
print()

Asymmetric error bars in Scipy's odrpack

I am using Scipy's odrpack to fit a linear function to some data that has uncertainties in both the x and y dimensions. Each data point has it's own uncertainty that is asymmetric.
I can fit a function using symmetric uncertainties, but this is not a true representation of my data.
How can I perform the fit with this in mind?
This is my code so far. It receives input data as a command line argument, and the uncertainties i'm using are just random numbers at the moment. (also, two fits are happening, one for positive data points another for the negative. The reasons are unrelated to this question)
import sys
import numpy as np
import scipy.odr.odrpack as odrpack
def f(B, x):
return B[0]*x + B[1]
xdata = sys.argv[1].split(',')
xdata = [float(i) for i in xdata]
xdata = np.array(xdata)
#find indices of +/- data
zero_ind = np.where(xdata >= 0)[0][0]
x_p = xdata[zero_ind:]
x_m = xdata[:zero_ind+1]
ydata = sys.argv[2].split(',')
ydata = [float(i) for i in ydata]
ydata = np.array(ydata)
y_p = ydata[zero_ind:]
y_m = ydata[:zero_ind+1]
sx_m = np.random.random(len(x_m))
sx_p = np.random.random(len(x_p))
sy_m = np.random.random(len(y_m))
sy_p = np.random.random(len(y_p))
linear = odrpack.Model(f)
data_p = odrpack.RealData(x_p, y_p, sx=sx_p, sy=sy_p)
odr_p = odrpack.ODR(data_p, linear, beta0=[1.,2.])
out_p = odr_p.run()
data_m = odrpack.RealData(x_m, y_m, sx=sx_m, sy=sy_m)
odr_m = odrpack.ODR(data_m, linear, beta0=[1.,2.])
out_m = odr_m.run()
Thanks!

I will just give you solution with random data,I could not bother to import your data
import numpy as np
import scipy.odr.odrpack as odrpack
np.random.seed(1)
N = 10
x = np.linspace(0,5,N)*(-1)
y = 2*x - 1 + np.random.random(N)
sx = np.random.random(N)
sy = np.random.random(N)
def f(B, x):
return B[0]*x + B[1]
linear = odrpack.Model(f)
# mydata = odrpack.Data(x, y, wd=1./np.power(sx,2), we=1./np.power(sy,2))
mydata = odrpack.RealData(x, y, sx=sx, sy=sy)
myodr = odrpack.ODR(mydata, linear, beta0=[1., 2.])
myoutput = myodr.run()
myoutput.pprint()
Than we got
Beta: [ 1.92743947 -0.94409236]
Beta Std Error: [ 0.03117086 0.11273067]
Beta Covariance: [[ 0.02047196 0.06690713]
[ 0.06690713 0.26776027]]
Residual Variance: 0.04746112419196648
Inverse Condition #: 0.10277763521624257
Reason(s) for Halting:
Sum of squares convergence

PyMC DiscreteMetropolis in Discrete Float Grid

Currently, I am trying to solve a problem from astrophysics which can be simplified as following :
I wanted to fit a linear model (say y = a + b*x) to observed data, and I wish to use PyMC to characterize posterior of a and b in discrete grid parameter space like in this figure:
I know PyMC has DiscreteMetropolis class to find posterior in discrete space, but that's in integer space, not in custom discrete space. So I am thinking to define a potential to force PyMC to search in the grid, but not working well...Can anyone help with this? or Anyone has solved a similar problem? Any thoughts will be greatly appreciated :)
Here is my draft code, commented out potential class is my idea to force PyMC to search in the grid:
import sys
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import pymc
#------------------------------------------------------------
# Generate the data
size = 200
slope_true = 12.3
y_intercept_true = 22.4
x = np.linspace(0, 1, size)
# y = a + b*x
y_true = y_intercept_true + slope_true * x
# add noise
y = y_true + np.random.normal(scale=.03, size=size)
# Define searching parameter space
# Note: this is discrete but not in the form of integer
slope_search_space = np.linspace(1,30,51)
y_intercept_search_space = np.linspace(1,30,51)
#------------------------------------------------------------
#Start initializing PyMC
#pymc.stochastic(dtype=int)
def slope(value = 5, t_l=1, t_h=30):
"""The switchpoint for the rate of disaster occurrence."""
def logp(value, t_l, t_h):
if value > t_h or value < t_l:
return -np.inf
else:
return -np.log(t_h - t_l + 1)
##pymc.potential
#def slope_prior(val=slope,t_l=-30, t_h=30):
# if val not in slope_search_space:
# return -np.inf
# return -np.log(t_h - t_l + 1)
#---
#pymc.stochastic(dtype=int)
def y_intercept(value=4, t_l=1, t_h=30):
"""The switchpoint for the rate of disaster occurrence."""
def logp(value, t_l, t_h):
if value > t_h or value < t_l:
return -np.inf
else:
return -np.log(t_h - t_l + 1)
##pymc.potential
#def y_intercept_prior(val=y_intercept,t_l=-30, t_h=30):
# if val not in y_intercept_search_space:
# return -np.inf
# return -np.log(t_h - t_l + 1)
# Define observed data
#pymc.deterministic
def mu(x=x, slope=slope, y_intercept=y_intercept):
# Linear age-price model
return y_intercept + slope*x
# Sampling distribution of prices
p = pymc.Poisson('p', mu, value=y, observed=True)
model = dict(slope=slope, y_intercept=y_intercept, mu=mu, p=p)
#-----------------------------------------------------------
# perform the MCMC
M = pymc.MCMC(model)
trace = M.sample(iter=10000,burn=5000)
#Plot
pymc.Matplot.plot(M)
plt.figure()
pymc.Matplot.summary_plot([M.slope,M.y_intercept])
plt.show()

I managed to solve my problem a few days ago. And to my surprise, some of my astronomy friends in Facebook group are also interested in this question, so I think it might be useful to post my solution just in case other people are having the same issue. Please note, this solution may not be the best way to tackle this problem, in fact, I believed there's more elegant way. But for now, this is the best I can come up with. Hope this is helpful to some of you.
The way I solve the problem is very straightforward, and I summarized as follow
1> Define slope, y_intercept stochastic variable in continuous form (PyMC then will use Metropolis to do sampling)
2> Define a function find_nearest to map continuous random variable slope, y_intercept to Grid e.g. Grid_slope=np.array([1,2,3,4,…51]), slope=4.678, then find_nearest(Grid_slope, slope) will return 5, as slope value is closest to 5 in the Grid_slope. Similarly to y_intercept variable.
3> When compute the likelihood, this is where I do the trick, I applied the find_nearest function to model in likelihood function i.e. to change model(slope, y_intercept) to model(find_nearest(Grid_slope, slope), find_nearest(Grid_y_intercept, y_intercept)), which will compute likelihood only upon Grid parameter space.
4> The trace returned for slope and y_intercept by PyMC may not be strictly Grid value, you can use find_nearest function to map trace to Grid value, and then making any statistical inference from it. For my case, I just use the trace straightaway to get statistics, and the result is nice :)
import sys
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import pymc
#------------------------------------------------------------
# Generate the data
size = 200
slope_true = 12.3
y_intercept_true = 22.4
x = np.linspace(0, 1, size)
# y = a + b*x
y_true = y_intercept_true + slope_true * x
# add noise
y = y_true + np.random.normal(scale=.03, size=size)
# Define searching parameter space
# Note: this is discrete but not in the form of integer
slope_search_space = np.linspace(1,30,51)
y_intercept_search_space = np.linspace(1,30,51)
#------------------------------------------------------------
#Start initializing PyMC
from pymc import Normal, Gamma, deterministic, MCMC, Matplot, Uniform
# Constant priors for parameters
slope = Uniform('slope', 1, 30)
y_intercept = Uniform('y_intp', 1, 30)
# Precision of normal distribution of y value
tau = Uniform('tau',0,10000 )
#deterministic
def mu(x=x,slope=slope, y_intercept=y_intercept):
def find_nearest(array,value):
"""
This function maps 'value' to the nearest point in 'array'
"""
idx = (np.abs(array-value)).argmin()
return array[idx]
# Linear model
iso = find_nearest(y_intercept_search_space,y_intercept) + find_nearest(slope_search_space,slope)*x
return iso
# Sampling distribution of y
p = Normal('p', mu, tau, value=y, observed=True)
model = dict(slope=slope, y_intercept=y_intercept,tau=tau, mu=mu, p=p)
#-----------------------------------------------------------
# perform the MCMC
M = pymc.MCMC(model)
trace = M.sample(40000,20000)
#Plot
pymc.Matplot.plot(M)
M.slope.summary()
M.y_intercept.summary()
plt.figure()
pymc.Matplot.summary_plot([M.slope,M.y_intercept])
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simplest linear model with PyMC - python

You need to set value=y_data, observed=True for the likelihood. Also, a minor point, you don't need to instantiate a Model object. Just pass your nodes (or a call to locals()) to MCMC.

Related

Quantile residual Q-Q plot in python

Bayesian Calibration with PyMC3, Kennedy O'Hagan

Linear regression ODR fails

Asymmetric error bars in Scipy's odrpack

PyMC DiscreteMetropolis in Discrete Float Grid

Categories

Resources