Comparing models fitting multivariate data

Comparing models fitting multivariate data - python

I have trouble using WAIC (widely applicable information criterion) in PyMC3. Namely, I have data which I know to be distributed according to multivariate Dirichlet distribution. I try to fit the data by assuming that marginal distributions are in one case the beta distributions, while in the other lognormal distributions. Obviously in the first case I get lower (better) WAIC value, than in the second case.
The problem arises in the third case then I assume that data is distributed according to Dirichlet distribution. The third WAIC is significantly larger (worse) than in the first two cases. I would expect this WAIC to be lower (better) than the one I get in the second (log-normal) case.
Basically I want to show that log-normal fit is bad. This is easily seen by the naked eye, but I would like to have formal result to show.
The minimal code to replicate what I get:
import pandas as pd
import numpy as np
import pymc3 as pm
# generate the data
df=pd.DataFrame(np.random.dirichlet([10,10,10],size=2000))
# fit the first case (assuming beta marginal distributions)
betaModel=pm.Model()
with betaModel:
alpha=pm.Uniform("alpha",lower=0,upper=20,shape=3)
beta=pm.Uniform("beta",lower=0,upper=40,shape=3)
observed=pm.Beta("obs",alpha=alpha,beta=beta,observed=df.values,shape=df.shape)
betaTrace=pm.sample()
# fit the second case (assuming log-normal marginal distributions)
lognormalModel=pm.Model()
with lognormalModel:
mu=pm.Normal("mu",mu=0,sd=3,shape=3)
sd=pm.HalfNormal("sd",sd=3,shape=3)
observed=pm.Lognormal("obs",mu=mu,sd=sd,observed=df.values,shape=df.shape)
lognormalTrace=pm.sample()
# fit the third case (assuming Dirichlet multivariate distribution)
dirichletModel=pm.Model()
with dirichletModel:
alpha=pm.HalfNormal("alpha",sd=3,shape=3)
observed=pm.Dirichlet("obs",a=alpha,observed=df.values,shape=df.shape)
dirichletTrace=pm.sample()
# compare WAIC
print(pm.waic(betaTrace,betaModel))
print(pm.waic(lognormalTrace,lognormalModel))
print(pm.waic(dirichletTrace,dirichletModel))
The output is:
WAIC_r(WAIC=-12801.95319823564, WAIC_se=105.07502476563575, p_WAIC=5.941977774190434)
WAIC_r(WAIC=-12534.643059697866, WAIC_se=115.43257184238044, p_WAIC=6.68850211472046)
WAIC_r(WAIC=-9156.050975326929, WAIC_se=81.45146980652675, p_WAIC=2.7977039523187996)
I guess the problem might be related to an error:
ValueError: operands could not be broadcast together with shapes (6000,) (2000,)
which I get when I try to run:
pm.compare((betaTrace,lognormalTrace,dirichletTrace),(betaModel,lognormalModel,dirichletModel))
Any suggestions how to obtain a reasonable comparison?
Edit
After thinking about the problem I would believe that it is somewhat "improper". I tend to think so because WAIC is a relative measure, thus it is likely that only similar statistical models can be reasonably compared. If the models are too dissimilar, then you get what I got.
The error I get from pm.compare seems to be related to how random vectors are treated. In the first two cases each component of a random vector is treated as a separate random variate (3 components per 2000 vectors = 6000 points). In the third case random vector as whole is treated as a random variate (2000 vectors = 2000 points).
Initially I thought that this problem could be resolved by reducing the number of points in the first two cases. But as the first two statistical models (wrongly) assume that components are independent, adding log-probabilities does not change anything. WAIC values remain the same.
Currently I think that a small cheat is possible. Namely to fit data to the Dirichlet distribution, but calculate WAIC as if I would have fitted the beta distribution. This gives an expected result - WAIC for the Dirichlet fit is slightly larger than WAIC for the beta fit, but smaller than WAIC for the log-normal fit.
The code for this "cheat":
from collections import namedtuple
from scipy.special import logsumexp
def cheat_logp(tracePoint,model):
values=model.obs.eval()
_,components=values.shape
cb=[None]*components
beta=np.sum(tracePoint["alpha"])
for i in range(components):
cheatBeta=pm.Beta.dist(alpha=tracePoint["alpha"][i],beta=beta-tracePoint["alpha"][i])
cb[i]=cheatBeta.logp(values[:,i]).eval()
return np.array(cb).T
def _log_post_trace(trace, model):
# copy the contents of _log_post_trace function from pymc3/stats.py
# but replace "var.logp_elemwise(pt)" with "cheat_logp(pt,model)"
# <...>
def mywaic(trace, model=None, pointwise=False):
# copy the contents of waic function from pymc3/stats.py
# <...>
Obviously this cheat is not very "nice" and I am still very much interested on how to achieve similar results, but in a proper manner. Of course if it is possible.

Related

Implementing ARIMA(X) model in TensorFlow

I'm currently scratching my head about how I might implement a classic ARIMA(X) model using base TensorFlow (and optionally Keras). The equation I am attempting to setup has the following form:
Where d represents the level of differencing applied to the input observed time series, p is the auto-regressive order, and q is the moving average order. The part which is stumping me currently is the calculation/estimation of the residuals epsilon. The auto-regression portion is a simple linear regression on the lagged samples, while the same is true for the terms involving the exogenous series (X). When I am estimating the residuals, should I simply feed the q-many previous steps into the current estimated parameters, and compute the residuals as y_true-y_predict? Though this also begs the question of: How does one estimate the residuals for observations where you have no previous observations? Do we simply estimate residuals 0 through q simply on a chosen random distribution of set variance (e.g. Normal, Poisson, etc.) with a mean of 0?
I have looked at the source for the statsmodels package to try to understand it, but it is quite opaque. Part of the reason for implementing the model this way is that it needs to fit into a fairly standard ecosystem at the company I work for, and we need control over what slices of data the model is fitted to at a given time step. This is because some data may arrive (much) later than the time stamp it relates to, due to lag at the source etc.
Thank you for any help you might be able to offer.

scikit KernelPCA unstable results

I'm trying to use KernelPCA for reducing the dimensionality of a dataset to 2D (both for visualization purposes and for further data analysis).
I experimented computing KernelPCA using a RBF kernel at various values of Gamma, but the result is unstable:
(each frame is a slightly different value of Gamma, where Gamma is varying continuously from 0 to 1)
Looks like it is not deterministic.
Is there a way to stabilize it/make it deterministic?
Code used to generate transformed data:
def pca(X, gamma1):
kpca = KernelPCA(kernel="rbf", fit_inverse_transform=True, gamma=gamma1)
X_kpca = kpca.fit_transform(X)
#X_back = kpca.inverse_transform(X_kpca)
return X_kpca

KernelPCA should be deterministic and evolve continuously with gamma. It is different from RBFSampler that does have built-in randomness in order to provide an efficient (more scalable) approximation of the RBF kernel.
However what can change in KernelPCA is the order of the principal components: in scikit-learn they are returned sorted in order of descending eigenvalue, so if you have 2 eigenvalues close to each other it could be that the order changes with gamma.
My guess (from the gif) is that this is what is happening here: the axes along which you are plotting are not constant so your data seems to jump around.
Could you provide the code you used to produce the gif?
I'm guessing it is a plot of the data points along the 2 first principal components but it would help to see how you produced it.
You could try to further inspect it by looking at the values of kpca.alphas_ (the eigenvectors) for each value of gamma.
Hope this makes sense.
EDIT: As you noted it looks like the points are reflected against the axis, the most plausible explanation is that one of the eigenvector flips sign (note this does not affect the eigenvalue).
I put in a simple gist to reproduce the issue (you'll need a Jupyter notebook to run it). You can see the sign-flipping when you change the value of gamma.
As a complement note that this kind of discrepancy happens only because you fit several times the KernelPCA object several times. Once you settled with a particular gamma value and you've fit kpca once you can call transform several times and get consistent results.
For the classical PCA the docs mention that:
Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.
I don't know about the behavior of a single KernelPCA object that you would fit several times (I did not find anything relevant in the docs).
It does not apply to your case though as you have to fit the object with several gamma values.

So... I can't give you a definitive answer on why KernelPCA is not deterministic. The behavior resembles the differences I've observed between the results of PCA and RandomizedPCA. PCA is deterministic, but RandomizedPCA is not, and sometimes the eigenvectors are flipped in sign relative to the PCA eigenvectors.
That leads me to my vague idea of how you might get more deterministic results....maybe. Use RBFSampler with a fixed seed:
def pca(X, gamma1):
kernvals = RBFSampler(gamma=gamma1, random_state=0).fit_transform(X)
kpca = PCA().fit_transform(X)
X_kpca = kpca.fit_transform(X)
return X_kpca

Checking for randomness using the Chi-Square Test

I'm running a simulation for a class project that relies heavily on random number generators, and as a result we're asked to test the random number generator to see just how "random" it is using the Chi-Square static. After looking through the some posts here, I used the follow code to find the answer:
from random import randint
import numpy as np
from scipy.stats import chisquare
numIterations = 1000 #I've run it with other numbers as well
observed = []
for i in range(0, numIterations):
observed.append(randint(0, 100))
data = np.array(observed)
print "(chi squared statistic, p-value) with", numOfIter, "samples generated: ", chisquare(data)
However, I'm getting a p-value of zero when numIterations is greater than 10, which doesn't really make sense considering the null hypothesis is that the data is uniform. Am I misinterpreting the results? Or is my code simply wrong?

A chi-square test checks how many items you observed in a bin vs how many you expected to have in that bin. It does so by summing the squared deviations between observed and expected across all bins. You can't just feed it raw data, you need to bin it first using something like scipy.stats.histogram.

Depending on what distribution your going for you can test for it, remember that having more samples will approximate the distribution better (if you could take an infinite number of samples you would have the actual distribution). Since in real life we can't run our number generators an infinite number of times we only deal with approximated situations, so we bin the distribution (see how many numbers fall into a bin http://en.wikipedia.org/wiki/Bean_machine). Now if you ran your bean machine and you found that one of the bins was significantly higher than the expected distribution (in this case Gaussian) then you would say that the process is not Gaussian. Same thing with chi squared except your the shape is different than Gaussian because your sampling multiple normal (special case Gaussian) distributions. Since you want to find out if your data is normal/gaussian (think of shapes, the shapes are determined by the distributions parameters ie mean std kurtosis) here is an example of how to do that: http://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/chi-square-test-for-normality/
I don't know what your data is so I can't really tell you what to look for. All in all you will need to know what your statistical data that your given is then try to fit it to a model (in this case chi-squared) then ask yourself if it matches up with the model (the curve, your probably trying to find if its Gaussian/normal or not which you can do with the chi-squared test). You should google chi-squared, Gaussian normal ect ect.

Python Statsmodels Testing Coefficients from Robust Linear Model based on M-Estimators

I have a linear model that I'm trying to fit to data with a good # of outliers in the endogenous variable, but not in the exogenous space. I've researched that RLM's based on M-estimators are good in this situation.
When I fit an RLM to my data in the follow way:
import numpy as np
import statsmodels.formula.api as smf
import statsmodels as sm
modelspec = ('cost ~ np.log(units) + np.log(units):item + item') #where item is a categorical variable
results = smf.rlm(modelspec, data = dataset, M = sm.robust.norms.TukeyBiweight()).fit()
print results.summary()
the summary results shows a z statistic, and seemingly the coefficient test of significance is based off of this rather than a t statistic. However, the following R manual (http://www.dst.unive.it/rsr/BelVenTutorial.pdf) shows the use of t statistics on pg. 19-21
Two questions:
Can someone explain to me conceptually why statsmodels uses a z-test rather than a t-test?
All terms and interactions are highly significant in the results (|z| > 4). In most cases, each item has 40 or more observations. There are some items that have 21-25 observations. Is there reason to believe that RLM is not effective in a small sample environment? The line it produces must be the best fit line after reweighting outliers, but is the z-test effective for samples of this size (ie, is there a reason to believe the confidence interval produced by smf.rlm() does not produce 95% probability coverage? I know for t-tests this potentially can be an issue...)?
Thanks!

I have mostly only a general answer, I never read any small sample Monte Carlo studies for M-estimators.
To 1.
In many models, like M-estimators, RLM, or generalized linear models, GLM, we have only asymptotic results, except for maybe a few special cases. Asymptotic results provide conditions that the estimator is normally distributed. Given this, statsmodels defaults to using normal distribution for all models outside of the linear regression model, OLS, and similar, and chisquare instead of the F distribution for Wald tests with joint hypothesis.
There is some evidence that in many cases using the t or F distribution with appropriate choice of degrees of freedom provides a better small sample approximation to the distribution of the test statistic. This relies on Monte Carlo results and is not directly justified by the theory, as far as I know.
In the next release, and in the current development version, of statsmodels users can choose to use the t and F distribution for the results, instead of the normal and chisquare distribution. The defaults stay the same as they are now.
There are other cases where it is not clear whether the t-distribution, and which small sample degrees of freedom should be used. In many cases, statsmodels tries to follow the lead of STATA, for example in cluster robust standard errors after OLS.
Another consequence is that sometimes equivalent models that are special cases of different models use different default assumptions on the distribution, both in Stata and in statsmodels.
I recently read the SAS documentation for M-estimators, and SAS is using the chisquare distribution, i.e. also the normal assumption, for the significance of the parameter estimates and for the confidence intervals.
To 2.
(see first sentence)
I think the same as for linear models also applies here. If the data is highly non-normal, then the test statistics could have incorrect coverage in small samples. This can also be the case with some robust sandwich covariance estimators. On the other hand, if we don't use heteroscedasticity or correlation robust covariance estimators, then the tests can also be strongly biased.
For robust estimation methods like M-estimators, RLM, the effective sample size also depends on the number of inliers, or the weights assigned to the observations, not just the total number of observations.
For your case, I think the z-values and the sample size are large enough that, for example, using the t-distribution would not make them much less significant.
Comparing M-estimators with different norms and scale estimates would provide an additional check on the robustness on the assumption on the outliers and for the choice of robust estimator. Another cross check: Does OLS with dropped outliers (observations with small weights in the RLM estimate) give a similar answer.
Finally as general caution:
The references on robust methods often warn that we should not use (outlier-)robust methods blindly. Using robust methods estimates a relationship based on "inliers". But is our discarding or down-weighting of outliers justified? Or, do we have missing non-linearities, missing variables, a mixture distribution or different regimes?

Find a random method that best fit list of values

I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?

You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.

Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example

Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.

Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.