Bayesian update in pymc3: adding more data doesn't work - python

I am new to pymc3, but I've heard it can be used to build a Bayesian update model. So I tried, without success. My goal was to predict which day of the week a person buys a certain product, based on prior information from a number of customers, as well as that person's shopping history.
So let's suppose I know that customers in general buy this product only on Mondays, Tuesdays, Wednesdays, and Thursdays only; and that the number of customers who bought the product in the past on those days is 3,2, 1, and 1, respectively. I thought I would set up my model like this:
import pymc3 as pm
dow = ['m', 'tu', 'w','th']
c = np.array([3, 2, 1, 1])
# hyperparameters (initially all equal)
alphas = np.array([1, 1, 1, 1])
with pm.Model() as model:
# Parameters of the Multinomial are from a Dirichlet
parameters = pm.Dirichlet('parameters', a=alphas, shape=4)
# Observed data is from a Multinomial distribution
observed_data = pm.Multinomial(
'observed_data', n=7, p=parameters, shape=4, observed=c)
So this set up my model without any issues. Then I have an individual customer's data from 4 weeks: 1 means they bought the product, 0 means they didn't, for a given day of the week. I thought updating the model would be as simple as:
c = np.array([[1, 0,0,0],[0,1,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,1])
with pm.Model() as model:
# Parameters are a dirichlet distribution
parameters = pm.Dirichlet('parameters', a=alphas, shape=4)
# Observed data is a multinomial distribution
observed_data = pm.Multinomial(
'observed_data',n=1,p=parameters, shape=4, observed=c)
trace = pm.sample(draws=100, chains=2, tune=50, discard_tuned_samples=True)
This didn't work.
My questions are:
Does this still take into account the priors I set up before, or does it create a brand-new model?
As written above, the code didn't work as it gave me a "bad initial energy" error. Through trial and error I found that parameter "n" has to be the sum of the elements in observations (so I can't have observations adding up to different n's). Why is that? Surely the situation I described above (where some weeks they shop only on Mondays, and others on Mondays and Thursday) is not impossible?
Is there a better way of using pymc3 or a different package for this type of problem? Thank you!

To answer your specific questions first:
The second model is a new model. You can reuse context managers by changing the line to just with model:, but looking at the code, that is probably not what you intended to do.
A multinomial distribution takes n draws, using the provided probabilities, and returns one list. pymc3 will broadcast for you if you provide an array for n. Here's a tidied version of your model:
with pm.Model() as model:
parameters = pm.Dirichlet('parameters', a=alphas)
observed_data = pm.Multinomial(
'observed_data', n=c.sum(axis=-1), p=parameters, observed=c)
trace = pm.sample()
You also ask about whether pymc3 is the right library for this question, which is great! The two models you wrote down are well known, and you can solve the posterior by hand, which is much faster: in the first model, it is a Dirichlet([4, 3, 2, 2]), and in the second Dirichlet([5, 2, 1, 2]). You can confirm this with PyMC3, or read up here.
If you wanted to expand your model, or chose distributions that were not conjugate, then PyMC3 might be a better choice.

Related

Time series data prediction with multiple n numbers

I am studying time series data.
If you look at the time series data you have run with the examples so far, they all have similarly only two columns. One is a date, and one is any value.
For example, in the case of a stock price increase forecast, we predict a 'single' stock.
If so, can you predict multiple stocks simultaneously in time series data analysis?
For example, after the subjects had taken medicines that affected their liver levels, they got liver count data by date so far. Based on this, I would like to experiment with predicting at which point the liver level rises or falls in the future. At this time, I need to predict several patients at the same time, not one patient. How do I specify the data set in this case?
Is it possible to label by adding one column? Or am I not really understanding the nature of time series data analysis?
If anyone knows anything related, I would be really grateful if you can advise me or give me a reference site.
You should do the predictions for each patient separately. You probably don't want the prediction on one of the patient to vary because of what happens to the others at the same time.
Machine Learning is not just about giving data to a model and getting back results, you also have to think the model, what should be its input and output here. For time series, you would probably give as input what was observed on a patient in the previous days, and try to predict what will happen in the next one. For one patient, you do not need the data of the others patients, and if you give it to your model, it will try to use it and capture some noise from the training data, which is not what you want.
However as you could expect similar behaviors in each patient, you can build a model for all the patients, and not one model for each patient. The typical input would be of the form :
[X(t - k, i), X(t - k + 1, i), ..., X(t - 1, i)]
where X(t, i) is the observation at time t for the patient i, to predict X(t, i). Train your model with the data of all the patients.
As you give a medical example, know that if you have some covariates like the weight or the gender of the patients you can include them in your model to capture their individual characteristics. In this case the input of the model to predict X(t, i) would be :
[X(t - k, i), X(t - k + 1, i), ..., X(t - 1, i), C1(i), ..., Cp(i)]
where C1(i)...Cp(i) are the covariates of the patient. If you do not have theses covariates, it is not a problem, they can just improve the results in some cases. Note that all covariates are not necessarily useful.

Why SARIMA has seasonal limits?

The original ARMA algorithm has the following formula:
And here you can see, that ARMA takes p + q + 1 numbers to compute. So, there is no questions about that, that's pretty clear.
But talking about SARIMA algorithm I can't understand one thing. The SARIMA formula is looks like ARMA with exta:
Where S is a number which is stands for seasonal period. S is constant.
So, SARIMA must compute p + q + P + Q + 1 numbers. Just exta P + Q numbers. Not too much, if P = 1 and Q = 2.
But if we use too long period, for example 365 days for everyday time series, SARIMA just can't stop fitting. Look at this to models. The first one takes 9 seconds to fit, while the second one haven't finished fitting after 2 hours!
import statsmodels.api as sm
model = sm.tsa.statespace.SARIMAX(
df.meantemp_box,
order=(1, 0, 2),
seasonal_order=(1, 1, 1, 7)
).fit()
model = sm.tsa.statespace.SARIMAX(
df.meantemp_box,
order=(1, 0, 2),
seasonal_order=(1, 1, 1, 365)
).fit()
And I can't understand that. Mathematically this models are the same - they both takes the same p, q, P and Q. But the second one either takes too long to learn, or is not able to learn at all.
Am I getting something wrong?
First, a possible solution: if you are using Statsmodels v0.11 or the development version, then you can use the following when you have long seasonal effects:
mod = sm.tsa.arima.ARIMA(endog, order=(1, 0, 2), seasonal_order=(1, 1, 1, 365))
res = mod.fit(method='innovations_mle', low_memory=True, cov_type='none')
The main restriction is that your time series cannot have missing entries. If you are missing values, then you will need to impute them somehow before creating the model.
Also, not all of our results features will be available to you, but you can still print the summary with the parameters, compute the loglikelihood and information criteria, compute in-sample predictions, and do out-of-sample forecasting.
Now to explain what the problem is:
The problem is that these models are estimated by putting them in state space form and then applying the Kalman filter to compute the log-likelihood. The dimension of the state space form of an ARIMA model grows quickly with the number of periods in a complete season - for your model with s=365, the dimension of the state vector is 733.
The Kalman filter requires multiplying matrices with this dimension, and by default, memory is allocated for matrices of this dimension for each of your time periods. That's why it takes forever to run (and it takes up a lot of memory too).
For the solution above, instead of computing the log-likelihood using the Kalman filter, we compute it something called the innovations algorithm. Then we only run the Kalman filter once to compute the results object (this allows for e.g. forecasting). The low_memory=True option instructs the model not to store all of the large-dimensional matrices for each time step, and the cov_type=None option instructs the model not to try to compute standard errors for the model's parameters (which would require a lot more log-likelihood evaluations).

PyMC3 binomial switchpoint model highly dependent on testval

I've set up the following binomial switchpoint model in PyMC3:
with pm.Model() as switchpoint_model:
switchpoint = pm.DiscreteUniform('switchpoint', lower=df['covariate'].min(), upper=df['covariate'].max())
# Priors for pre- and post-switch parameters
early_rate = pm.Beta('early_rate', 1, 1)
late_rate = pm.Beta('late_rate', 1, 1)
# Allocate appropriate binomial probabilities to years before and after current
p = pm.math.switch(switchpoint >= df['covariate'].values, early_rate, late_rate)
p = pm.Deterministic('p', p)
y = pm.Binomial('y', p=p, n=df['trials'].values, observed=df['successes'].values)
It seems to run fine, except that it entirely centers in on one value for the switchpoint (999), as shown below.
Upon further investigation it seems that the results for this model are highly dependent on the starting value (in PyMC3, "testval"). The below shows what happens when I set the testval = 750.
switchpoint = pm.DiscreteUniform('switchpoint', lower=gp['covariate'].min(),
upper=gp['covariate'].max(), testval=750)
I get similarly different results with additional different starting values.
For context, this is what my dataset looks like:
My questions are:
Is my model somehow incorrectly specified?
If it's correctly specified, how should I interpret these results? In particular, how do I compare / select results generated by different testvals? The only idea I've had has been using WAIC to evaluate out of sample performance...
Models with discrete values can be problematic, all the nice sampling techniques using the derivatives don't work anymore, and they can behave a lot like multi modal distributions. I don't really see why this would be that problematic in this case, but you could try to use a continuous variable for the switchpoint instead (wouldn't that also make more sense conceptually?).

Sequential updating in PyMC

I'm teaching myself PyMC but got stuck with the following problem:
I have a model whose parameters should be determined from successive measurements. In the beginning the parameter's prior is uninformative, but should be updated after each measurement (i.e. replaced by the posterior). In short, I want to do sequential updating with PyMC.
Consider the following (somewhat constructed) example:
Measurement 1: 10 questions, 9 correct answers
Measurement 2: 5 questions, 3 correct answers
Of course, this can be solved analytically with beta/binomial conjugate priors, but this is not the point here :)
Alternatively, both measurements could be combined to n=15 and k=12. However, this is too simple. I want to take the hard way for educational purposes.
I found a solution in this answer, where new priors are sampled from the posterior. This is almost what I want, but sampling the prior feels a bit messy because the results depends on the number of samples and other settings.
My attempted solution puts both measurement and priors separately in one model, like this:
n1, k1 = 10, 9
n2, k2 = 5, 3
theta1 = pymc.Beta('theta', alpha=1, beta=1)
outcome1 = pymc.Binomial('outcome1', n=n1, p=theta1, value=k1, observed=True)
theta2 = ? # should be the posterior of theta1
outcome2 = pymc.Binomial('outcome2', n=n2, p=theta2, value=k2, observed=True)
How can I get the posterior of theta1 as the prior of theta2?
Is this even possible, or did I just demonstrate ultimate ignorance about Bayesian statistics?
The only way sequential updating works sensibly is in two different models. Specifying them in the same model does not make any sense, since we have no posteriors until after MCMC has completed.
In principle, you would examine the distribution of theta1 and specify a prior that best resembles it. In this simple case it is easy -- it would be:
theta2 = pymc.Beta('theta2', alpha=10, beta=2)
since you don't need MCMC to determine what the posterior of theta is. More generally, you could fit a Beta distribution to the posterior, say using scipy.stats.beta.fit.

Debugging pymc probability calculations

I've tried to model a mixture of exponentials by copying the mixture-of-Gaussians example given here. The code is below. I know there are some funky aspects to the inference here, but my question is more about how to debug the calculations in models like this.
The idea is that it's a mixture of three exponentials, with scale parameters taken from the Gamma assigned to scales. However, all observations get assigned to the zeroth exponential during the ElemwiseCategoricalStep. You can see that the assignments of the observations to the exponential components are initially diverse by looking at initial_assignments, and you can see that all observations are assigned to the zeroth component on all interations from the fact that set(tr['exp'].flatten()) contains only 0.
I assume this is because all of the values assigned to p in the expression array([logp(v * self.sh) for v in self.values]) in ElemwiseCategoricalStep.astep are minus infinity. I would like to know why that is and how to correct it, but even more, I would like to know what tools are available to debug this kind of thing. Is there any way for me to step through the calculation of logp(v * self.sh) to see how the result is determined? If I try to do it using pdb, I think I get stymied at outputs = self.fn() in theano.compile.function_module.Function.__call__, which I guess I can't step into because it's a native function.
Even knowing how to compute the pdf for a given set of model parameters would be a useful start.
import numpy as np
import pymc as pm
from pymc import Model, Gamma, Normal, Dirichlet, Exponential
from pymc import Categorical
from pymc import sample, Metropolis, ElemwiseCategoricalStep
durations = np.concatenate(
[np.random.exponential(1/lam, 10)
for lam in [1e-3,7e-5,2e-6]])
initial_assignments = np.random.randint(0, 3, len(durations))
print 'initial_assignments', initial_assignments
with Model() as model:
scales = Gamma('hp', 1, 1, shape=3)
props = Dirichlet('props', a=np.array([1., 1., 1.]), shape=3)
category = Categorical('exp', p=props, shape=len(durations))
points = Exponential('obs', lam=scales[category], observed=durations)
step1 = pm.Metropolis(vars=[props,scales])
step2 = ElemwiseCategoricalStep(var=category, values=[0,1,2])
start = {'exp': initial_assignments,
'hp': np.ones(3),
'props': np.ones(3),}
tr = sample(3000, step=[step1, step2], start=start)
print set(tr['exp'].flatten())
Excellent question. One thing you can do is look at the pdf for each of the components.
The Model and each of the variables should have both a .logp and a .elemwise_logp property and them which returns a function that can take a point or parameter values.
Thus you can say something like print scales.logp(start) or print model.logp(start) or print scales.dlogp()(start).
For now, I think you unfortunately have to specify all the parameter values (even ones that don't affect the result for a particular variable).
Model, FreeRV and ObservedRV all inherit from Factor which provides this functionality and has a few other methods. You'll probably want the non fast versions since those are more forgiving in the kinds of arguments they accept.
Does that help? Please let me know if you have other ideas for things that might help you in debugging. This is one area where we know pymc3 and theano needs some work.

Categories