After fitting a local level model using UnobservedComponents from statsmodels , we are trying to find ways to simulate new time series with the results. Something like:
import numpy as np
import statsmodels as sm
from statsmodels.tsa.statespace.structural import UnobservedComponents
np.random.seed(12345)
ar = np.r_[1, 0.9]
ma = np.array([1])
arma_process = sm.tsa.arima_process.ArmaProcess(ar, ma)
X = 100 + arma_process.generate_sample(nsample=100)
y = 1.2 * x + np.random.normal(size=100)
y[70:] += 10
plt.plot(X, label='X')
plt.plot(y, label='y')
plt.axvline(69, linestyle='--', color='k')
plt.legend();
ss = {}
ss["endog"] = y[:70]
ss["level"] = "llevel"
ss["exog"] = X[:70]
model = UnobservedComponents(**ss)
trained_model = model.fit()
Is it possible to use trained_model to simulate new time series given the exogenous variable X[70:]? Just as we have the arma_process.generate_sample(nsample=100), we were wondering if we could do something like:
trained_model.generate_random_series(nsample=100, exog=X[70:])
The motivation behind it is so that we can compute the probability of having a time series as extreme as the observed y[70:] (p-value for identifying the response is bigger than the predicted one).
[EDIT]
After reading Josef's and cfulton's comments, I tried implementing the following:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
mod1.simulate(f_model.params, len(X_post))
But this resulted in simulations that doesn't seem to track the predicted_mean of the forecast for X_post as exog. Here's an example:
While the y_post meanders around 100, the simulation is at -400. This approach always leads to p_value of 50%.
So when I tried using the initial_sate=0 and the random shocks, here's the result:
It seemed now that the simulations were following the predicted mean and its 95% credible interval (as cfulton commented below, this is actually a wrong approach as well as it's replacing the level variance of the trained model).
I tried using this approach just to see what p-values I'd observe. Here's how I compute the p-value:
samples = 1000
r = 0
y_post_sum = y_post.sum()
for _ in range(samples):
sim = mod1.simulate(f_model.params, len(X_post), initial_state=0, state_shocks=np.random.normal(size=len(X_post)))
r += sim.sum() >= y_post_sum
print(r / samples)
For context, this is the Causal Impact model developed by Google. As it's been implemented in R, we've been trying to replicate the implementation in Python using statsmodels as the core to process time series.
We already have a quite cool WIP implementation but we still need to have the p-value to know when in fact we had an impact that is not explained by mere randomness (the approach of simulating series and counting the ones whose summation surpasses y_post.sum() is also implemented in Google's model).
In my example I used y[70:] += 10. If I add just one instead of ten, Google's p-value computation returns 0.001 (there's an impact in y) whereas in Python's approach it's returning 0.247 (no impact).
Only when I add +5 to y_post is that the model returns p_value of 0.02 and as it's lower than 0.05, we consider that there's an impact in y_post.
I'm using python3, statsmodels version 0.9.0
[EDIT2]
After reading cfulton's comments I decided to fully debug the code to see what was happening. Here's what I found:
When we create an object of type UnobservedComponents, eventually the representation of the Kalman Filter is initiated. As default, it receives the parameter initial_variance as 1e6 which sets the same property of the object.
When we run the simulate method, the initial_state_cov value is created using this same value:
initial_state_cov = (
np.eye(self.k_states, dtype=self.ssm.transition.dtype) *
self.ssm.initial_variance
)
Finally, this same value is used to find initial_state:
initial_state = np.random.multivariate_normal(
self._initial_state, self._initial_state_cov)
Which results in a normal distribution with 1e6 of standard deviation.
I tried running the following then:
mod1 = UnobservedComponents(np.zeros(len(X_post)), level='llevel', exog=X_post, initial_variance=1)
sim = mod1.simulate(f_model.params, len(X_post))
plt.plot(sim, label='simul')
plt.plot(y_post, label='y')
plt.legend();
print(sim.sum() > y_post.sum())
Which resulted in:
I tested then the p-value and finally for a variation of +1 in y_post the model now is identifying correctly the added signal.
Still, when I tested with the same data that we have in R's Google package the p-value was still off. Maybe it's a matter of further tweaking the input to increase its accuracy.
#Josef is correct and you did the right thing with:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
mod1.simulate(f_model.params, len(X_post))
The simulate method simulates data according to the model in question, which is why you can't directly use trained_model to simulate when you have exogenous variables.
But for some reason the simulations always ended up being lower than y_post.
I think this should be expected - running your example and looking at the estimated coefficients, we get:
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
sigma2.irregular 0.9278 0.194 4.794 0.000 0.548 1.307
sigma2.level 0.0021 0.008 0.270 0.787 -0.013 0.018
beta.x1 1.1882 0.058 20.347 0.000 1.074 1.303
The variance of the level is very small, which means that it is extremely unlikely that the level would shift upwards by nearly 10 percent in a single period, based on the model you specified.
When you used:
mod1.simulate(f_model.params, len(X_post), initial_state=0, state_shocks=np.random.normal(size=len(X_post))
what happened is that the level term is the only unobserved state here, and by providing your own shocks with a variance equal to 1, you essentially overrode the level variance actually estimated by the model. I don't think that setting the initial state to 0 has much of an effect here. (see edit).
You write:
the p-value computation was closer, but still is not correct.
I'm not sure what this means - why would you expect the model to think such a jump was a likely occurrence? What p-value are you expecting to achieve?
Edit:
Thanks for investigating further (in Edit 2). First, what I think you should do is:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
initial_state = np.random.multivariate_normal(
f_model.predicted_state[..., -1], f_model.predicted_state_cov[..., -1])
mod1.simulate(f_model.params, len(X_post), initial_state=initial_state)
Now, the explanation:
In Statsmodels 0.9, we didn't yet have exact treatment of states with a diffuse initialization (it has been merged in since then, though, and this is one reason that I wasn't able to replicate your results until I tested your example with the 0.9 codebase). These "initially diffuse" states don't have a long-run mean that we can solve for (e.g. a random walk process), and the state in the local level case is such a state.
The "approximate" diffuse initialization involves setting the initial state mean to zero and the initial state variance to a large number (as you discovered).
For simulations, the initial state is, by default, sampled from the given initial state distribution. Since this model is initialized with approximate diffuse initialization, that explains why your process was initialized around some random number.
Your solution is a good patch, but it's not optimal because it doesn't base the initial state for the simulated period on the last state from the estimated model / data. These values are given by f_model.predicted_state[..., -1] and f_model.predicted_state_cov[..., -1].
Related
I'm running a deep learning model which requires me to scale my dataset. I'm using scikit-learn's MinMaxScaler. After I make the prediction, if I compare the prediction with the target column I get a certain relative error. But if I rescale the dataset and the prediction, the relative error increases massively.
For reference, it's not a good model and the error when using the scaled dataset is around 40% and when I re-scale the error jumps to over 60%. I'm also calculating the relative error this way:
def calculate_error(prediction, y):
rel_error = 2 * np.absolute(y - prediction) / (np.absolute(y) + np.absolute(prediction))
return rel_error
From this I get the mean and the standard deviation using numpy's mean() and std() functions. An example is the following
predicted_scaled = [0.26652822, 0.2384195, 0.26829958, 0.25697553, 0.28840747]
real_scaled = [0.16201117, 0.37243948, 0.42085661, 0.49534451, 0.23649907]
rel_error.mean() = 44.02%
rel_error.std() = 14.03%
---
predicted_rescaled = [12.012565, 10.503127, 12.107687, 11.499586, 13.187481]
real_rescaled = [6.4, 17.7, 20.3, 24.3, 10.4]
rel_error.mean() = 51.54%
rel_error.std() = 17.8%
Why does this happen and how can I prevent it? Furthermore, what's the correct error: the one that compares prediction and target while scaled or the one I get after scaling?
It's because of your min value in your min/max scaler shifting the shape of your modelled distribution. Let us, for example, take a single datapoint, pred=0.6, true=0.8.
Let us calculate your error according to this point without scaling:
error = 2*|0.6-0.8|/ (1.4)
error = 2/7 = 0.28
Now we can calculate this scaled according to a (randomly-chosen) scaler with a min of 2.2 and max of 10.1:
error = 2*|6.94-8.52|/(16.46)
error = 0.19
So, this is not an error in the code, but rather the fact that you are calculating a relative error between two different distributions which will result in a different value!
In regards to which one is the 'correct' result to display, I would suggest it depends on what you're discussing. If you're conveying the real results, then I would suggest that you use the re-scaled results. If you're conveying model performance then either will suffice.
Also, I think it is important to scale your outputs/inputs as a model will learn better (generally) with scaled outputs/inputs with an activated output (ie. scaling with a sigmoid of tanh function at the output layer).
When performed a logistic regression using the two API, they give different coefficients.
Even with this simple example it doesn't produce the same results in terms of coefficients. And I follow advice from older advice on the same topic, like setting a large value for the parameter C in sklearn since it makes the penalization almost vanish (or setting penalty="none").
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
n = 200
x = np.random.randint(0, 2, size=n)
y = (x > (0.5 + np.random.normal(0, 0.5, n))).astype(int)
display(pd.crosstab( y, x ))
max_iter = 100
#### Statsmodels
res_sm = sm.Logit(y, x).fit(method="ncg", maxiter=max_iter)
print(res_sm.params)
#### Scikit-Learn
res_sk = LogisticRegression( solver='newton-cg', multi_class='multinomial', max_iter=max_iter, fit_intercept=True, C=1e8 )
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.coef_)
For example I just run the above code and get 1.72276655 for statsmodels and 1.86324749 for sklearn. And when run multiple times it always gives different coefficients (sometimes closer than others, but anyway).
Thus, even with that toy example the two APIs give different coefficients (so odds ratios), and with real data (not shown here), it almost get "out of control"...
Am I missing something? How can I produce similar coefficients, for example at least at one or two numbers after the comma?
There are some issues with your code.
To start with, the two models you show here are not equivalent: although you fit your scikit-learn LogisticRegression with fit_intercept=True (which is the default setting), you don't do so with your statsmodels one; from the statsmodels docs:
An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
It seems that this is a frequent point of confusion - see for example scikit-learn & statsmodels - which R-squared is correct? (and own answer there as well).
The other issue is that, although you are in a binary classification setting, you ask for multi_class='multinomial' in your LogisticRegression, which should not be the case.
The third issue is that, as explained in the relevant Cross Validated thread Logistic Regression: Scikit Learn vs Statsmodels:
There is no way to switch off regularization in scikit-learn, but you can make it ineffective by setting the tuning parameter C to a large number.
which makes the two models again non-comparable in principle, but you have successfully addressed it here by setting C=1e8. In fact, since then (2016), scikit-learn has indeed added a way to switch regularization off, by setting penalty='none' since, according to the docs:
If ‘none’ (not supported by the liblinear solver), no regularization is applied.
which should now be considered the canonical way to switch off the regularization.
So, incorporating these changes in your code, we have:
np.random.seed(42) # for reproducibility
#### Statsmodels
# first artificially add intercept to x, as advised in the docs:
x_ = sm.add_constant(x)
res_sm = sm.Logit(y, x_).fit(method="ncg", maxiter=max_iter) # x_ here
print(res_sm.params)
Which gives the result:
Optimization terminated successfully.
Current function value: 0.403297
Iterations: 5
Function evaluations: 6
Gradient evaluations: 10
Hessian evaluations: 5
[-1.65822763 3.65065752]
with the first element of the array being the intercept and the second the coefficient of x. While for scikit learn we have:
#### Scikit-Learn
res_sk = LogisticRegression(solver='newton-cg', max_iter=max_iter, fit_intercept=True, penalty='none')
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.intercept_, res_sk.coef_)
with the result being:
[-1.65822806] [[3.65065707]]
These results are practically identical, within the machine's numeric precision.
Repeating the procedure for different values of np.random.seed() does not change the essence of the results shown above.
I am new to using the PyMC3 package and am just trying to implement an example from a course on measurement uncertainty that I’m taking. (Note this is an optional employee education course through work, not a graded class where I shouldn’t find answers online). The course uses R but I find python to be preferable.
The (simple) problem is posed as following:
Say you have an end-gauge of actual (unknown) length at room-temperature length, and measured length m. The relationship between the two is:
length = m / (1 + alpha*dT)
where alpha is an expansion coefficient and dT is the deviation from room temperature and m is the measured quantity. The goal is to find the posterior distribution on length in order to determine its expected value and standard deviation (i.e. the measurement uncertainty)
The problem specifies prior distributions on alpha and dT (Gaussians with small standard deviation) and a loose prior on length (Gaussian with large standard deviation). The problem specifies that m was measured 25 times with an average of 50.000215 and standard deviation of 5.8e-6. We assume that the measurements of m are normally distributed with a mean of the true value of m.
One issue I had is that the likelihood doesn’t seem like it can be specified just based on these statistics in PyMC3, so I generated some dummy measurement data (I ended up doing 1000 measurements instead of 25). Again, the question is to get a posterior distribution on length (and in the process, although of less interest, updated posteriors on alpha and dT).
Here’s my code, which is not working and having convergence issues:
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
import pymc3 as pm
import theano.tensor as tt
basic_model = pm.Model()
xdata = np.random.normal(50.000215,5.8e-6*np.sqrt(1000),1000)
with basic_model:
#prior distributions
theta = pm.Normal('theta',mu=-.1,sd=.04)
alpha = pm.Normal('alpha',mu=.0000115,sd=.0000012)
length = pm.Normal('length',mu=50,sd=1)
mumeas = length*(1+alpha*theta)
with basic_model:
obs = pm.Normal('obs',mu=mumeas,sd=5.8e-6,observed=xdata)
#yobs = Normal('yobs',)
start = pm.find_MAP()
#trace = pm.sample(2000, step=pm.Metropolis, start=start)
step = pm.Metropolis()
trace = pm.sample(10000, tune=200000,step=step,start=start,njobs=1)
length_samples = trace['length']
fig,ax=plt.subplots()
plt.hist(length_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_1$", color="#A60628", normed=True)
I would really appreciate any help as to why this isn’t working. I've been trying for a while and it never converges to the expected solution given from the R code. I tried the default sampler (NUTS I think) as well as Metropolis but that completely failed with a zero gradient error. The (relevant) course slides are attached as an image. Finally, here is the comparable R code:
library(rjags)
#Data
jags_data <- list(xbar=50.000215)
jags_code <- jags.model(file = "calibration.txt",
data = jags_data,
n.chains = 1,
n.adapt = 30000)
post_samples <- coda.samples(model = jags_code,
variable.names =
c("l","mu","alpha","theta"),#,"ypred"),
n.iter = 30000)
summary(post_samples)
mean(post_samples[[1]][,"l"])
sd(post_samples[[1]][,"l"])
plot(post_samples)
and the calibration.txt model:
model{
l~dnorm(50,1.0)
alpha~dnorm(0.0000115,694444444444)
theta~dnorm(-0.1,625)
mu<-l*(1+alpha*theta)
xbar~dnorm(mu,29726516052)
}
(note I think the dnorm distribution takes 1/sigma^2, hence the weird-looking variances)
Any help or insight as to why the PyMC3 sampling isn't converging and what I should do differently would be extremely appreciated. Thanks!
I also had trouble getting anything useful from the generated data and model in the code. It seems to me that the level of noise in the fake data could equally be explained by the different sources of variance in the model. That can lead to a situation of highly correlated posterior parameters. Add to that the extreme scale imbalances, then it makes sense this would have sampling issues.
However, looking at the JAGS model, it seems they really are using just that one input observation. I've never seen this technique(?) before, that is, inputting summary statistics of data instead of the raw data itself. I suppose it worked for them in JAGS, so I decided to try running the exact same MCMC, including using the precision (tau) parameterization of the Gaussian.
Original Model with Metropolis
with pm.Model() as m0:
# tau === precision parameterization
dT = pm.Normal('dT', mu=-0.1, tau=625)
alpha = pm.Normal('alpha', mu=0.0000115, tau=694444444444)
length = pm.Normal('length', mu=50.0, tau=1.0)
mu = pm.Deterministic('mu', length*(1+alpha*dT))
# only one input observation; tau indicates the 5.8 nm sd
obs = pm.Normal('obs', mu=mu, tau=29726516052, observed=[50.000215])
trace = pm.sample(30000, tune=30000, chains=4, cores=4, step=pm.Metropolis())
While it's still not that great at sampling length and dT, it at least appears convergent overall:
I think noteworthy here is that despite the relatively weak prior on length (sd=1), the strong priors on all the other parameters appear to propagate a tight uncertainty bound on the length posterior. Ultimately, this is the posterior of interest, so this seems to be consistent with the intent of the exercise. Also, see that mu comes out in the posterior as exactly the distribution described, namely, N(50.000215, 5.8e-6).
Trace Plots
Forest Plot
Pair Plot
Here, however, you can see the core problem is still there. There's both strong correlation between length and dT, plus 4 or 5 orders of magnitude scale difference between the standard errors. I'd definitely do a long run before I really trusted the result.
Alternative Model with NUTS
In order to get this running with NUTS, you'd have to address the scaling issue. That is, somehow we need to reparameterize to get all the tau values closer to 1. Then, you'd run the sampler and transform back into the units you're interested in. Unfortunately, I don't have time to play around with this right now (I'd have to figure it out too), but maybe it's something you can start exploring on your own.
I'm teaching myself PyMC but got stuck with the following problem:
I have a model whose parameters should be determined from successive measurements. In the beginning the parameter's prior is uninformative, but should be updated after each measurement (i.e. replaced by the posterior). In short, I want to do sequential updating with PyMC.
Consider the following (somewhat constructed) example:
Measurement 1: 10 questions, 9 correct answers
Measurement 2: 5 questions, 3 correct answers
Of course, this can be solved analytically with beta/binomial conjugate priors, but this is not the point here :)
Alternatively, both measurements could be combined to n=15 and k=12. However, this is too simple. I want to take the hard way for educational purposes.
I found a solution in this answer, where new priors are sampled from the posterior. This is almost what I want, but sampling the prior feels a bit messy because the results depends on the number of samples and other settings.
My attempted solution puts both measurement and priors separately in one model, like this:
n1, k1 = 10, 9
n2, k2 = 5, 3
theta1 = pymc.Beta('theta', alpha=1, beta=1)
outcome1 = pymc.Binomial('outcome1', n=n1, p=theta1, value=k1, observed=True)
theta2 = ? # should be the posterior of theta1
outcome2 = pymc.Binomial('outcome2', n=n2, p=theta2, value=k2, observed=True)
How can I get the posterior of theta1 as the prior of theta2?
Is this even possible, or did I just demonstrate ultimate ignorance about Bayesian statistics?
The only way sequential updating works sensibly is in two different models. Specifying them in the same model does not make any sense, since we have no posteriors until after MCMC has completed.
In principle, you would examine the distribution of theta1 and specify a prior that best resembles it. In this simple case it is easy -- it would be:
theta2 = pymc.Beta('theta2', alpha=10, beta=2)
since you don't need MCMC to determine what the posterior of theta is. More generally, you could fit a Beta distribution to the posterior, say using scipy.stats.beta.fit.
I am trying to fit a Poisson distribution to my data using statsmodels but I am confused by the results that I am getting and how to use the library.
My real data will be a series of numbers that I think that I should be able to describe as having a poisson distribution plus some outliers so eventually I would like to do a robust fit to the data.
However for testing purposes, I just create a dataset using scipy.stats.poisson
samp = scipy.stats.poisson.rvs(4,size=200)
So to fit this using statsmodels I think that I just need to have a constant 'endog'
res = sm.Poisson(samp,np.ones_like(samp)).fit()
print res.summary()
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 200
Model: Poisson Df Residuals: 199
Method: MLE Df Model: 0
Date: Fri, 27 Jun 2014 Pseudo R-squ.: 0.000
Time: 14:28:29 Log-Likelihood: -404.37
converged: True LL-Null: -404.37
LLR p-value: nan
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 1.3938 0.035 39.569 0.000 1.325 1.463
==============================================================================
Ok, that doesn't look right, But if I do
res.predict()
I get an array of 4.03 (which was the mean for this test sample).
So basically, firstly I very confused how to interpret this result from statsmodel and secondly I should probably being doing something completely different if I'm interested in robust parameter estimation of a distribution rather than fitting trends but how should I go about doing that?
Edit
I should really have given more detail in order to answer the second part of my question.
I have an event that occurs a random time after a starting time. When I plot a histogram of the delay times for many events, I see that the distribution looks like a scaled Poisson distribution plus several outlier points which are normally caused by issues in my underlying system. So I simply wanted to find the expected time delay for the dataset, excluding the outliers. If not for the outliers, I could simply find the mean time. I suppose that I could exclude them manually but I thought that I could find something more exacting.
Edit
On further reflection, I will be considering other distributions instead of sticking with a Poissonion and the details of my issue are probably a distraction from the original question but I've left them here anyway.
The Poisson model, as most other models in generalized linear model families or for other discrete data, assumes that we have a transformation that bounds the prediction in the appropriate range.
Poisson works for nonnegative numbers and the transformation is exp, so the model that is estimated assumes that the expected value of an observation, conditional on the explanatory variables is
E(y | x) = exp(X dot params)
To get the lambda parameter of the poisson distribution, we need to use exp, i.e.
>>> np.exp(1.3938)
4.0301355071650118
predict does this by default, but you can request just the linear part (X dot params) with a keyword argument.
BTW: statsmodels' controversial terminology
endog is y
exog is x (has x in it)
(http://statsmodels.sourceforge.net/devel/endog_exog.html )
Outlier Robust Estimation
The answer to the last part of the question is that there is currently no outlier robust estimation in Python for Poisson or other count models, as far as I know.
For overdispersed data, where the variance is larger than the mean, we can use NegativeBinomial Regression. For outliers in Poisson we would have to use R/Rpy or do manual trimming of outliers.
Outlier identification could be based on one of the standardized residuals.
It will not be available in statsmodels for some time, unless someone is contributing this.