Only save transformed parameters in PyMC3 - python

I want to shift my variables in PyMC3. I'm currently just using deterministic transforms, but when I perform the inference, it saves both the original samples and the shifted samples (which is expected behavior).
Example code:
import pymc3 as pm
x_lower = -3
with pm.Model():
x = pm.Gamma('x', alpha=2., beta=1.5)
x_shift = pm.Deterministic("x_shift", x + x_lower)
trace = pm.sample(1000, tune=1000)
trace.remove_values("x") # my current solution
tp = pm.traceplot(trace)
# other analysis...
Now trace stores all x samples and all x_shift samples, which is clearly a waste when the numbers of variables and samples increase. I can do trace.remove_values("x") before continuing with analysis, but I would prefer to simply not savex at all.
Another option is to not save x_shift at all, but I can't find how to add x_lower on to the samples after inference. So this isn't really a solution if I want to use the in-built analysis tools.
Can I save only the x_shift samples, and not the x samples, when I sample?

You can specify exactly what you want to save by setting the trace argument in the pm.sample() function, e.g.,
trace = pm.sample(1000, tune=1000, trace=[x_shift])

In case anyone finds this...
I went for a still-unsatsifying solution: I'm not using Deterministic transformations any more. I still transform the variables, but don't save them. I just transform the saved (original) samples after sampling.
The above code now looks like this:
import pymc3 as pm
x_lower = -3
def f_x(x):
return x + x_lower
with pm.Model():
x = pm.Gamma('x', alpha=2., beta=1.5)
x_shift = f_x(x)
#keep = {"x": f_x} # something like this for more variables
trace = pm.sample(1000, tune=1000)
trace = {"x": f_x(trace["x"])} # note this merges all chains
#trace = {varname: f(trace[varname]) for varname, f in keep.items()}
tp = pm.traceplot(trace)
# other analysis...
I think the pm.traceplot(trace) still works with trace in this form, but otherwise just import arviz and use it directly --- it does work with dicts like this.
Note: take care with more complicated transformations. e.g. you'll have to use pm.math.exp in the model, but np.exp in the post-sampling transformation.

Related

Summarise the posterior of a single parameter from an array with arviz

I am estimating a model using the pyMC3 library in python. In my "real" model, there are four parameter arrays, two of which have over 170,000 parameters in them. Summarising this array of parameters is too computationally intensive on my computer. I have been trying to figure out if the summary function in arviz will allow me to only summarise one (or a small number) of parameters in the array. Below is a reprex where the same problem is present, though the model is a lot simpler. In the linear regression model below, the parameter array b has three parameters in it b[0], b[1], b[2]. I would like to know how to get the summary for just b[0] and b[1] or alternatively for just a single parameter, e.g., b[0].
import pandas as pd
import pymc3 as pm
import arviz as az
d = pd.read_csv("https://quantoid.net/files/mtcars.csv")
mpg = d['mpg'].values
hp = d['hp'].values
weight = d['wt'].values
with pm.Model() as model:
b = pm.Normal("b", mu=0, sigma=10, shape=3)
sig = pm.HalfCauchy("sig", beta=2)
mu = pm.Deterministic('mu', b[0] + b[1]*hp + b[2]*weight)
like = pm.Normal('like', mu=mu, sigma=sig, observed=mpg)
fit = pm.fit(10000, method='advi')
samp = fit.sample(1500)
with model:
smry = az.summary(samp, var_names = ["b"])
It looked like the coords argument to the summary() function would do it, but after googling around and finding a few examples, like the one here with plot_posterior() instead of summary(), I was unable to get something to work. In particular, I tried the following in the hopes that it would return the summary for b[0] and b[1].
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": range(1)})
or this to return the summary of b[0]:
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": [0]})
I suspect I am missing something simple (I'm an R user who dabbles occasionally with Python). Any help is greatly appreciated.
(BTW, I am using Python 3.8.0, pyMC3 3.9.3, arviz 0.10.0)
To use coords for this, you need to update to the development (which will still show 0.11.2 but has the code from github or any >0.11.2 release) version of ArviZ. Until 0.11.2, the coords argument in summary was not used to subset the data (like it did in all plotting functions) but instead it was only taken into account if the input was not already InferenceData in which case it was passed to the converter.
With older versions, you need to use xarray to subset the data before passing it to summary. Therefore you need to explicitly convert the trace to inferencedata beforehand. In the example above it would look like:
with model:
...
samp = fit.sample(1500)
idata = az.from_pymc3(samp)
az.summary(idata.posterior[["b"]].sel({"b_dim_0": [0]}))
Moreover, you may also want to indicate summary to compute only a subset of the stats/diagnostics as shown in the docstring examples.

using scipy curve_fit with dask/xarray

I'm trying to use scipy.optimize.curve_fit on a large latitude/longitude/time xarray using dask.distributed as computing backend.
The idea is to run an individual data fitting for every (latitude, longitude) using the time series.
All of this runs fine outside xarray/dask. I tested it using the time series of a single location passed as a pandas dataframe. However, if I try to run the same process on the same (latitude, longitude) directly on the xarray, the curve_fit operation returns the initial parameters.
I am performing this operation using xr.apply_ufunc like so (here I'm providing only the code that is strictly relevant to the problem):
# function to perform the fit
def _fit_rti_curve(data, data_rti, fit, loc=False):
fit_func, linearize, find_init_params = _get_fit_functions(fit)
# remove nans
x, y = _filter_nodata(data_rti, data)
# remove outliers
x, y = _filter_for_outliers(x, y, linearize=linearize)
# find a first guess for maximum achieveable value
yscale = np.max(y) * 1.05
# find a first guess for the other parameters
# here loc can be manually passed if you have a good estimation
init_parms = find_init_params(x, y, yscale, loc=loc, linearize=linearize)
# fit the curve and return parameters
parms = curve_fit(fit_func, x, y, p0=init_parms, maxfev=10000)
parms = parms[0]
return parms
# shell around _fit_rti_curve
def find_rti_func_parms(data, rti, fit):
# sort and fit highest n values
top_data = np.sort(data)
top_data = top_data[-len(rti):]
# convert to float64 if needed
top_data = top_data.astype(np.float64)
rti = rti.astype(np.float64)
# run the fit
parms = _fit_rti_curve(top_data, rti, fit, loc=0) #TODO maybe add function to allow a free loc
return parms
# call for the apply_ufunc
# `fit` is a string that defines the distribution type
# `rti` is an array for the x values
parms_data = xr.apply_ufunc(
find_rti_func_parms,
xr_obj,
input_core_dims=[['time']],
output_core_dims=[[fit + ' parameters']],
output_sizes = {fit + ' parameters': len(signature(fit_func).parameters) - 1},
vectorize=True,
kwargs={'rti':return_time_interval, 'fit':fit},
dask='parallelized',
output_dtypes=['float64']
)
My guess would be that is a problem related to threading, or at least some shared memory space that is not properly passed between workers and scheduler.
However, I am just not knowledgeable enough to test this within dask.
Any idea on this problem?
You should have a look at this issue https://github.com/pydata/xarray/issues/4300
I had the same problem and I solved using apply_ufunc. It is not optimized, since it has to perform rechunking operations, but it works!
I've created a GitHub Gist for it https://gist.github.com/clausmichele/8350e1f7f15e6828f29579914276de71
This previous answer might be helpful? It's using numpy.polyfit but I think the general approach should be similar.
Applying numpy.polyfit to xarray Dataset
Also, I haven't tried it but xr.polyfit() just got merged recently! Could also be something to look into. http://xarray.pydata.org/en/stable/generated/xarray.DataArray.polyfit.html#xarray.DataArray.polyfit

Statsmodels: vector_ar and IRAnalysis

I'm trying to estimate impulse response functions of a -1 standard-deviation shock to a 3-dimension VAR using statsmodels.tsa, however I'm currently having issues with setting the shock magnitude.
This gives me the IRFs for a 1 s.d. shock, the default:
import numpy as np
import statsmodels.tsa as sm
model = sm.vector_ar.var_model.VAR(endog = data)
fitted = model.fit()
shock= -1*fitted.sigma_u
irf = sm.vector_ar.irf.IRAnalysis(model = fitted)
The function IRAnalysis takes an argument P, an upper diagonal matrix that sets the shocks, I found this looking at the source code. However inputting P as shown below doesn't seem to be doing anything.
irf = statsmodels.tsa.vector_ar.irf.IRAnalysis(model = fitted, P = -np.linalg.cholesky(model.fitted_U))
I would really appreciate some help.
Thanks in advance.
I have had the same question and finally found something that works on my end.
instead of using the IRAnalysis explicitly, I found that transforming the VAR model into it's MA representation was the best way to adjust the size of the shock.
from statsmodels.tsa.vector_ar.irf import IRAnalysis
J = fitted.ma_rep(T)
J = shock*np.array(J)
This will give you the output of the irfs for T periods.
I also wanted the standard error bands on my plots, so I did something similar to that particular function as well.
G, H = fitted.irf_errband_mc(orth=False, repl=1000, steps=T, signif=0.05, seed=None, burn=100, cum=False)
Hope this helps

SVM with python and CPLEX, load the quadratic part of the objective function

''In general, it would get better performance creating batches of linear constraints rather than creating them one at a time. I just wondering if it states even with a huge problem.'' - The wise programmer.
To be clear, I have a (35k x 40) dataset, and I want to do SVM on it. I need to produce the Gramm matrix of this dataset, it is fine, but to pass the coefficient to CPLEX is a mess, it takes hours, here my code:
nn = 35000
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
xg, yg = np.meshgrid(range(nn), range(nn))
indici = np.dstack([yg,xg])
quadraric_part = []
for ii in xrange(nn):
for indd in indici[ii][ii:]:
quadraric_part.append([indd[0],indd[1],temp[indd[0],indd[1]]])
The 'quadratic_part' is a list of the form [i,j,c_ij] where c_ij is the coefficient stored in temp. It will be passed to the function 'objective.set_quadratic_coefficients()' of the CPLEX Python API.
There is a wiser way to do that?
P.S. I have maybe a Memory problem, so It wold be better, instead store the whole list 'quadratic_part', call several times the function 'objective.set_quadratic_coefficients()'.... you know what I mean?!
Under the hood, objective.set_quadratic makes use of the CPXXcopyquad function in the C Callable Library. Whereas, objective.set_quadratic_coefficients uses CPXXcopyqpsep.
Here is an example (bear in mind that I am not a numpy expert; it's quite possible there's a better way to do that part):
import numpy as np
import cplex
nn = 5 # a small example size here
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
# create symetric matrix
tempu = np.triu(temp) # upper triangle
iu1 = np.triu_indices(nn, 1)
tempu.T[iu1] = tempu[iu1] # copy upper into lower
ind = np.array([[x for x in range(nn)] for x in range(nn)])
qmat = []
for i in range(nn):
qmat.append([np.arange(nn), tempu[i]])
c = cplex.Cplex()
c.variables.add(lb=[0]*nn)
c.objective.set_quadratic(qmat)
c.write("test2.lp")
Your Q matrix is completely dense so depending on the amount of memory you have, this technique may not scale. When it's possible, though, you should get better performance initializing your Q matrix with objective.set_quadratic. Perhaps you'll need to use some hybrid technique where you use both set_quadratic and set_quadratic_coefficients.

How to define General deterministic function in PyMC

In my model, I need to obtain the value of my deterministic variable from a set of parent variables using a complicated python function.
Is it possible to do that?
Following is a pyMC3 code which shows what I am trying to do in a simplified case.
import numpy as np
import pymc as pm
#Predefine values on two parameter Grid (x,w) for a set of i values (1,2,3)
idata = np.array([1,2,3])
size= 20
gridlength = size*size
Grid = np.empty((gridlength,2+len(idata)))
for x in range(size):
for w in range(size):
# A silly version of my real model evaluated on grid.
Grid[x*size+w,:]= np.array([x,w]+[(x**i + w**i) for i in idata])
# A function to find the nearest value in Grid and return its product with third variable z
def FindFromGrid(x,w,z):
return Grid[int(x)*size+int(w),2:] * z
#Generate fake Y data with error
yerror = np.random.normal(loc=0.0, scale=9.0, size=len(idata))
ydata = Grid[16*size+12,2:]*3.6 + yerror # ie. True x= 16, w= 12 and z= 3.6
with pm.Model() as model:
#Priors
x = pm.Uniform('x',lower=0,upper= size)
w = pm.Uniform('w',lower=0,upper =size)
z = pm.Uniform('z',lower=-5,upper =10)
#Expected value
y_hat = pm.Deterministic('y_hat',FindFromGrid(x,w,z))
#Data likelihood
ysigmas = np.ones(len(idata))*9.0
y_like = pm.Normal('y_like',mu= y_hat, sd=ysigmas, observed=ydata)
# Inference...
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(1000, step, start=start, progressbar=False) # draw 1000 posterior samples using NUTS sampling
print('The trace plot')
fig = pm.traceplot(trace, lines={'x': 16, 'w': 12, 'z':3.6})
fig.show()
When I run this code, I get error at the y_hat stage, because the int() function inside the FindFromGrid(x,w,z) function needs integer not FreeRV.
Finding y_hat from a pre calculated grid is important because my real model for y_hat does not have an analytical form to express.
I have earlier tried to use OpenBUGS, but I found out here it is not possible to do this in OpenBUGS. Is it possible in PyMC ?
Update
Based on an example in pyMC github page, I found I need to add the following decorator to my FindFromGrid(x,w,z) function.
#pm.theano.compile.ops.as_op(itypes=[t.dscalar, t.dscalar, t.dscalar],otypes=[t.dvector])
This seems to solve the above mentioned issue. But I cannot use NUTS sampler anymore since it needs gradient.
Metropolis seems to be not converging.
Which step method should I use in a scenario like this?
You found the correct solution with as_op.
Regarding the convergence: Are you using pm.Metropolis() instead of pm.NUTS() by any chance? One reason this could not converge is that Metropolis() by default samples in the joint space while often Gibbs within Metropolis is more effective (and this was the default in pymc2). Having said that, I just merged this: https://github.com/pymc-devs/pymc/pull/587 which changes the default behavior of the Metropolis and Slice sampler to be non-blocked by default (so within Gibbs). Other samplers like NUTS that are primarily designed to sample the joint space still default to blocked. You can always explicitly set this with the kwarg blocked=True.
Anyway, update pymc with the most recent master and see if convergence improves. If not, try the Slice sampler.

Categories