I am trying to fit a Poisson distribution to my data using statsmodels but I am confused by the results that I am getting and how to use the library.
My real data will be a series of numbers that I think that I should be able to describe as having a poisson distribution plus some outliers so eventually I would like to do a robust fit to the data.
However for testing purposes, I just create a dataset using scipy.stats.poisson
samp = scipy.stats.poisson.rvs(4,size=200)
So to fit this using statsmodels I think that I just need to have a constant 'endog'
res = sm.Poisson(samp,np.ones_like(samp)).fit()
print res.summary()
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 200
Model: Poisson Df Residuals: 199
Method: MLE Df Model: 0
Date: Fri, 27 Jun 2014 Pseudo R-squ.: 0.000
Time: 14:28:29 Log-Likelihood: -404.37
converged: True LL-Null: -404.37
LLR p-value: nan
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 1.3938 0.035 39.569 0.000 1.325 1.463
==============================================================================
Ok, that doesn't look right, But if I do
res.predict()
I get an array of 4.03 (which was the mean for this test sample).
So basically, firstly I very confused how to interpret this result from statsmodel and secondly I should probably being doing something completely different if I'm interested in robust parameter estimation of a distribution rather than fitting trends but how should I go about doing that?
Edit
I should really have given more detail in order to answer the second part of my question.
I have an event that occurs a random time after a starting time. When I plot a histogram of the delay times for many events, I see that the distribution looks like a scaled Poisson distribution plus several outlier points which are normally caused by issues in my underlying system. So I simply wanted to find the expected time delay for the dataset, excluding the outliers. If not for the outliers, I could simply find the mean time. I suppose that I could exclude them manually but I thought that I could find something more exacting.
Edit
On further reflection, I will be considering other distributions instead of sticking with a Poissonion and the details of my issue are probably a distraction from the original question but I've left them here anyway.
The Poisson model, as most other models in generalized linear model families or for other discrete data, assumes that we have a transformation that bounds the prediction in the appropriate range.
Poisson works for nonnegative numbers and the transformation is exp, so the model that is estimated assumes that the expected value of an observation, conditional on the explanatory variables is
E(y | x) = exp(X dot params)
To get the lambda parameter of the poisson distribution, we need to use exp, i.e.
>>> np.exp(1.3938)
4.0301355071650118
predict does this by default, but you can request just the linear part (X dot params) with a keyword argument.
BTW: statsmodels' controversial terminology
endog is y
exog is x (has x in it)
(http://statsmodels.sourceforge.net/devel/endog_exog.html )
Outlier Robust Estimation
The answer to the last part of the question is that there is currently no outlier robust estimation in Python for Poisson or other count models, as far as I know.
For overdispersed data, where the variance is larger than the mean, we can use NegativeBinomial Regression. For outliers in Poisson we would have to use R/Rpy or do manual trimming of outliers.
Outlier identification could be based on one of the standardized residuals.
It will not be available in statsmodels for some time, unless someone is contributing this.
Related
I have a fitted Poisson model in statsmodels. For each of my observations I want to calculate the probability of observing a value that is at least that high. In other words I want to calculate:
P(y >= y_i | x_i)
(This should be possible, because the fitted model predicts some value lambda as a function of my independent variable x. This lambda_i value defines a Poisson distribution, from which I should be able to derive a probability.)
My question is really about the implementation in statsmodels, less about the statistics. Although if you believe it is relevant, please do elaborate.
For Poisson we can just use the distribution from scipy.stats to compute results for given predicted means.
e.g.
mu = my_results.predict(...)
stats.poisson.sf(counts, mu)
similar usage with pmf is in
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/discrete/discrete_model.py#L3922
After fitting a local level model using UnobservedComponents from statsmodels , we are trying to find ways to simulate new time series with the results. Something like:
import numpy as np
import statsmodels as sm
from statsmodels.tsa.statespace.structural import UnobservedComponents
np.random.seed(12345)
ar = np.r_[1, 0.9]
ma = np.array([1])
arma_process = sm.tsa.arima_process.ArmaProcess(ar, ma)
X = 100 + arma_process.generate_sample(nsample=100)
y = 1.2 * x + np.random.normal(size=100)
y[70:] += 10
plt.plot(X, label='X')
plt.plot(y, label='y')
plt.axvline(69, linestyle='--', color='k')
plt.legend();
ss = {}
ss["endog"] = y[:70]
ss["level"] = "llevel"
ss["exog"] = X[:70]
model = UnobservedComponents(**ss)
trained_model = model.fit()
Is it possible to use trained_model to simulate new time series given the exogenous variable X[70:]? Just as we have the arma_process.generate_sample(nsample=100), we were wondering if we could do something like:
trained_model.generate_random_series(nsample=100, exog=X[70:])
The motivation behind it is so that we can compute the probability of having a time series as extreme as the observed y[70:] (p-value for identifying the response is bigger than the predicted one).
[EDIT]
After reading Josef's and cfulton's comments, I tried implementing the following:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
mod1.simulate(f_model.params, len(X_post))
But this resulted in simulations that doesn't seem to track the predicted_mean of the forecast for X_post as exog. Here's an example:
While the y_post meanders around 100, the simulation is at -400. This approach always leads to p_value of 50%.
So when I tried using the initial_sate=0 and the random shocks, here's the result:
It seemed now that the simulations were following the predicted mean and its 95% credible interval (as cfulton commented below, this is actually a wrong approach as well as it's replacing the level variance of the trained model).
I tried using this approach just to see what p-values I'd observe. Here's how I compute the p-value:
samples = 1000
r = 0
y_post_sum = y_post.sum()
for _ in range(samples):
sim = mod1.simulate(f_model.params, len(X_post), initial_state=0, state_shocks=np.random.normal(size=len(X_post)))
r += sim.sum() >= y_post_sum
print(r / samples)
For context, this is the Causal Impact model developed by Google. As it's been implemented in R, we've been trying to replicate the implementation in Python using statsmodels as the core to process time series.
We already have a quite cool WIP implementation but we still need to have the p-value to know when in fact we had an impact that is not explained by mere randomness (the approach of simulating series and counting the ones whose summation surpasses y_post.sum() is also implemented in Google's model).
In my example I used y[70:] += 10. If I add just one instead of ten, Google's p-value computation returns 0.001 (there's an impact in y) whereas in Python's approach it's returning 0.247 (no impact).
Only when I add +5 to y_post is that the model returns p_value of 0.02 and as it's lower than 0.05, we consider that there's an impact in y_post.
I'm using python3, statsmodels version 0.9.0
[EDIT2]
After reading cfulton's comments I decided to fully debug the code to see what was happening. Here's what I found:
When we create an object of type UnobservedComponents, eventually the representation of the Kalman Filter is initiated. As default, it receives the parameter initial_variance as 1e6 which sets the same property of the object.
When we run the simulate method, the initial_state_cov value is created using this same value:
initial_state_cov = (
np.eye(self.k_states, dtype=self.ssm.transition.dtype) *
self.ssm.initial_variance
)
Finally, this same value is used to find initial_state:
initial_state = np.random.multivariate_normal(
self._initial_state, self._initial_state_cov)
Which results in a normal distribution with 1e6 of standard deviation.
I tried running the following then:
mod1 = UnobservedComponents(np.zeros(len(X_post)), level='llevel', exog=X_post, initial_variance=1)
sim = mod1.simulate(f_model.params, len(X_post))
plt.plot(sim, label='simul')
plt.plot(y_post, label='y')
plt.legend();
print(sim.sum() > y_post.sum())
Which resulted in:
I tested then the p-value and finally for a variation of +1 in y_post the model now is identifying correctly the added signal.
Still, when I tested with the same data that we have in R's Google package the p-value was still off. Maybe it's a matter of further tweaking the input to increase its accuracy.
#Josef is correct and you did the right thing with:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
mod1.simulate(f_model.params, len(X_post))
The simulate method simulates data according to the model in question, which is why you can't directly use trained_model to simulate when you have exogenous variables.
But for some reason the simulations always ended up being lower than y_post.
I think this should be expected - running your example and looking at the estimated coefficients, we get:
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
sigma2.irregular 0.9278 0.194 4.794 0.000 0.548 1.307
sigma2.level 0.0021 0.008 0.270 0.787 -0.013 0.018
beta.x1 1.1882 0.058 20.347 0.000 1.074 1.303
The variance of the level is very small, which means that it is extremely unlikely that the level would shift upwards by nearly 10 percent in a single period, based on the model you specified.
When you used:
mod1.simulate(f_model.params, len(X_post), initial_state=0, state_shocks=np.random.normal(size=len(X_post))
what happened is that the level term is the only unobserved state here, and by providing your own shocks with a variance equal to 1, you essentially overrode the level variance actually estimated by the model. I don't think that setting the initial state to 0 has much of an effect here. (see edit).
You write:
the p-value computation was closer, but still is not correct.
I'm not sure what this means - why would you expect the model to think such a jump was a likely occurrence? What p-value are you expecting to achieve?
Edit:
Thanks for investigating further (in Edit 2). First, what I think you should do is:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
initial_state = np.random.multivariate_normal(
f_model.predicted_state[..., -1], f_model.predicted_state_cov[..., -1])
mod1.simulate(f_model.params, len(X_post), initial_state=initial_state)
Now, the explanation:
In Statsmodels 0.9, we didn't yet have exact treatment of states with a diffuse initialization (it has been merged in since then, though, and this is one reason that I wasn't able to replicate your results until I tested your example with the 0.9 codebase). These "initially diffuse" states don't have a long-run mean that we can solve for (e.g. a random walk process), and the state in the local level case is such a state.
The "approximate" diffuse initialization involves setting the initial state mean to zero and the initial state variance to a large number (as you discovered).
For simulations, the initial state is, by default, sampled from the given initial state distribution. Since this model is initialized with approximate diffuse initialization, that explains why your process was initialized around some random number.
Your solution is a good patch, but it's not optimal because it doesn't base the initial state for the simulated period on the last state from the estimated model / data. These values are given by f_model.predicted_state[..., -1] and f_model.predicted_state_cov[..., -1].
I am trying to perform KS test goodness of fit for my data and estimated distribution.
Plot is like this
The code I am using and the results are as follows:
sp.stats.kstest(df['col'], 'norm', args = (mean, sd), N = 1000000)
KstestResult(statistic=0.06905359838747682, pvalue=0.0)
from df I am taking my data points.
'norm' because I assume normal distribution.
args is a tuple with
parameters for theoretical distribution function I estimated using my
dataset.
N = 1000000 as a sample size.
Of course, the fit is not perfect, but I cannot understand why the p-value is just 0.0. Am I doing something wrong using the function or the fit is that bad? I would expect p-value to be small, even as small as 0.01 or 0.000000536 or whatever, but not dead nil.
Any ideas what is wrong or what can be done to make it work?
BTW: the raw data is originally log-normal distributed (looking at the original, here in the plot it is after log transformation)
I have a database of features, a 2D np.array (2000 samples and each sample contains 100 features, 2000 X 100). I want to fit gaussian distributions to my database using python. My code is the following:
data = load_my_data() # loads a np.array with size 2000x200
clf = mixture.GaussianMixture(n_components= 50, covariance_type='full')
clf.fit(data)
I am not sure about the parameters for example the covariance_type and how can I investigate whether the fit was occured succesfully or not.
EDIT: I debug the code to investigate what is happening with the clf.means_ and appartently it produced a matrix n_components X size_of_features 50 X 20). Is there a way that i can check that the fitting was successful, or to plot data? What are the alternatives to Gaussian mixtures (mixtures of exponential for example, I cannot find any available implementation)?
I think you are using sklearn package.
Once you have fit, then type
print clf.means_
If it has output, then the data is fitted, if it raise errors, not fitted.
Hope this helps you.
You can do dimensionality reduction using PCA to 3D space (let's say) and then plot means and data.
Is is always preferred to choose a reduced set of candidate before trying to identify the distribution (in other words, use Cullen & Frey to reject the unlikely candidates) and then go for goodness of fit a select the best result,
You can just create a list of all available distributions in scipy. An example with two distributions and random data:
import numpy as np
import scipy.stats as st
data = np.random.random(10000)
#Specify all distributions here
distributions = [st.laplace, st.norm]
mles = []
for distribution in distributions:
pars = distribution.fit(data)
mle = distribution.nnlf(pars, data)
mles.append(mle)
results = [(distribution.name, mle) for distribution, mle in
zip(distributions, mles)]
best_fit = sorted(zip(distributions, mles), key=lambda d: d[1])[0]
print 'Best fit reached using {}, MLE value: {}'.format(best_fit[0].name, best_fit[1])
I understand, you may like to do regression of two different distributions, more than fitting them to an arithmetic curve. If this is the case, you may be interested in plotting one against the other one, and make a linear (or polynomial) regression, checking the coefficients
If this is the case, linear regression of two distributions, may tell you if there linear dependent or not.
Linear Regression using Scipy documentation
I am following the Orthogonal distance regression method to fit data with errors on both the dependent and independent variables.
I am fitting the data with a simple straight line, my model is y = ax + b.
Now, I am able to write the code and plot the line fitting the data, but I am NOT able to read the results:
Beta: [ 2.08346947 0.0024333 ]
Beta Std Error: [ 0.03654482 0.00279946]
Beta Covariance: [[ 2.06089823e-03 -9.99220260e-05]
[ -9.99220260e-05 1.20935366e-05]]
Residual Variance: 0.648029925546
Inverse Condition #: 0.011825289654
Reason(s) for Halting:
Sum of squares convergence
The Beta is just the array containing the values of the parameters of my model (a, b), and Beta Std Error, the associated errors.
Regarding the other values, I don't know their meaning.
Especially, I would like to know which one is indicative of a goodness-of-fit, something like the chi-square when one fits with the errors only on the dependent variable.
Beta Covariance is the covariance matrix of your fitted parameters. It can be thought of as a matrix describing out inter-connected your two parameters are with respect to both themselves and each other.
Residual Variance I believe is a measure of the goodness-of-fit where the smaller the value, the better the fit to your data.
Inverse Condition is the inverse (1/x) of the condition number. The condition number defines how sensitive your fitted function is to changes in the input.
scipy.odr is a wrapper around a much older FORTRAN77 package known as ODRPACK. The documentation for ODRPACK can actually be found on on the scipy website. This may help you in understanding what you need to know as it contains the mathematical descriptions of the parameters.