Checking for randomness using the Chi-Square Test

Checking for randomness using the Chi-Square Test - python

I'm running a simulation for a class project that relies heavily on random number generators, and as a result we're asked to test the random number generator to see just how "random" it is using the Chi-Square static. After looking through the some posts here, I used the follow code to find the answer:
from random import randint
import numpy as np
from scipy.stats import chisquare
numIterations = 1000 #I've run it with other numbers as well
observed = []
for i in range(0, numIterations):
observed.append(randint(0, 100))
data = np.array(observed)
print "(chi squared statistic, p-value) with", numOfIter, "samples generated: ", chisquare(data)
However, I'm getting a p-value of zero when numIterations is greater than 10, which doesn't really make sense considering the null hypothesis is that the data is uniform. Am I misinterpreting the results? Or is my code simply wrong?

A chi-square test checks how many items you observed in a bin vs how many you expected to have in that bin. It does so by summing the squared deviations between observed and expected across all bins. You can't just feed it raw data, you need to bin it first using something like scipy.stats.histogram.

Depending on what distribution your going for you can test for it, remember that having more samples will approximate the distribution better (if you could take an infinite number of samples you would have the actual distribution). Since in real life we can't run our number generators an infinite number of times we only deal with approximated situations, so we bin the distribution (see how many numbers fall into a bin http://en.wikipedia.org/wiki/Bean_machine). Now if you ran your bean machine and you found that one of the bins was significantly higher than the expected distribution (in this case Gaussian) then you would say that the process is not Gaussian. Same thing with chi squared except your the shape is different than Gaussian because your sampling multiple normal (special case Gaussian) distributions. Since you want to find out if your data is normal/gaussian (think of shapes, the shapes are determined by the distributions parameters ie mean std kurtosis) here is an example of how to do that: http://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/chi-square-test-for-normality/
I don't know what your data is so I can't really tell you what to look for. All in all you will need to know what your statistical data that your given is then try to fit it to a model (in this case chi-squared) then ask yourself if it matches up with the model (the curve, your probably trying to find if its Gaussian/normal or not which you can do with the chi-squared test). You should google chi-squared, Gaussian normal ect ect.

Related

Comparing models fitting multivariate data

I have trouble using WAIC (widely applicable information criterion) in PyMC3. Namely, I have data which I know to be distributed according to multivariate Dirichlet distribution. I try to fit the data by assuming that marginal distributions are in one case the beta distributions, while in the other lognormal distributions. Obviously in the first case I get lower (better) WAIC value, than in the second case.
The problem arises in the third case then I assume that data is distributed according to Dirichlet distribution. The third WAIC is significantly larger (worse) than in the first two cases. I would expect this WAIC to be lower (better) than the one I get in the second (log-normal) case.
Basically I want to show that log-normal fit is bad. This is easily seen by the naked eye, but I would like to have formal result to show.
The minimal code to replicate what I get:
import pandas as pd
import numpy as np
import pymc3 as pm
# generate the data
df=pd.DataFrame(np.random.dirichlet([10,10,10],size=2000))
# fit the first case (assuming beta marginal distributions)
betaModel=pm.Model()
with betaModel:
alpha=pm.Uniform("alpha",lower=0,upper=20,shape=3)
beta=pm.Uniform("beta",lower=0,upper=40,shape=3)
observed=pm.Beta("obs",alpha=alpha,beta=beta,observed=df.values,shape=df.shape)
betaTrace=pm.sample()
# fit the second case (assuming log-normal marginal distributions)
lognormalModel=pm.Model()
with lognormalModel:
mu=pm.Normal("mu",mu=0,sd=3,shape=3)
sd=pm.HalfNormal("sd",sd=3,shape=3)
observed=pm.Lognormal("obs",mu=mu,sd=sd,observed=df.values,shape=df.shape)
lognormalTrace=pm.sample()
# fit the third case (assuming Dirichlet multivariate distribution)
dirichletModel=pm.Model()
with dirichletModel:
alpha=pm.HalfNormal("alpha",sd=3,shape=3)
observed=pm.Dirichlet("obs",a=alpha,observed=df.values,shape=df.shape)
dirichletTrace=pm.sample()
# compare WAIC
print(pm.waic(betaTrace,betaModel))
print(pm.waic(lognormalTrace,lognormalModel))
print(pm.waic(dirichletTrace,dirichletModel))
The output is:
WAIC_r(WAIC=-12801.95319823564, WAIC_se=105.07502476563575, p_WAIC=5.941977774190434)
WAIC_r(WAIC=-12534.643059697866, WAIC_se=115.43257184238044, p_WAIC=6.68850211472046)
WAIC_r(WAIC=-9156.050975326929, WAIC_se=81.45146980652675, p_WAIC=2.7977039523187996)
I guess the problem might be related to an error:
ValueError: operands could not be broadcast together with shapes (6000,) (2000,)
which I get when I try to run:
pm.compare((betaTrace,lognormalTrace,dirichletTrace),(betaModel,lognormalModel,dirichletModel))
Any suggestions how to obtain a reasonable comparison?
Edit
After thinking about the problem I would believe that it is somewhat "improper". I tend to think so because WAIC is a relative measure, thus it is likely that only similar statistical models can be reasonably compared. If the models are too dissimilar, then you get what I got.
The error I get from pm.compare seems to be related to how random vectors are treated. In the first two cases each component of a random vector is treated as a separate random variate (3 components per 2000 vectors = 6000 points). In the third case random vector as whole is treated as a random variate (2000 vectors = 2000 points).
Initially I thought that this problem could be resolved by reducing the number of points in the first two cases. But as the first two statistical models (wrongly) assume that components are independent, adding log-probabilities does not change anything. WAIC values remain the same.
Currently I think that a small cheat is possible. Namely to fit data to the Dirichlet distribution, but calculate WAIC as if I would have fitted the beta distribution. This gives an expected result - WAIC for the Dirichlet fit is slightly larger than WAIC for the beta fit, but smaller than WAIC for the log-normal fit.
The code for this "cheat":
from collections import namedtuple
from scipy.special import logsumexp
def cheat_logp(tracePoint,model):
values=model.obs.eval()
_,components=values.shape
cb=[None]*components
beta=np.sum(tracePoint["alpha"])
for i in range(components):
cheatBeta=pm.Beta.dist(alpha=tracePoint["alpha"][i],beta=beta-tracePoint["alpha"][i])
cb[i]=cheatBeta.logp(values[:,i]).eval()
return np.array(cb).T
def _log_post_trace(trace, model):
# copy the contents of _log_post_trace function from pymc3/stats.py
# but replace "var.logp_elemwise(pt)" with "cheat_logp(pt,model)"
# <...>
def mywaic(trace, model=None, pointwise=False):
# copy the contents of waic function from pymc3/stats.py
# <...>
Obviously this cheat is not very "nice" and I am still very much interested on how to achieve similar results, but in a proper manner. Of course if it is possible.

How do you compute the expected value of an infinite distribution? (particularly in python)

I was trying to compute the expected value of a distribution (assume I know the parameters or I can estimate them) but it might be a distribution over a sample space that is infinite. Is there a library (for example in python, numpy or something) that is able to compute such an expected value with reasonable speed and accuracy?
For an arbitrary distribution it seemed hard, but the only thoughts I had was, if it was normally, then we can approximate this by adding small enough chunks in cap where the probability is highly concentrated or something...but I wanted to do something less ad-hoc and more established, since I am sure I am not the first one to try to compute an expected value in a computer.

Having a probability space with infinite support is not uncommon.
The normal or t distribution have support over the real line, the Poisson distribution is over all positive integers.
The distribution in scipy.stats implement an expect method, which in the continuous case just uses scipy.integrate.quad, and in the discrete case uses expanding summation with some heuristic stopping criterion.
This works quite well with well behaved functions but can run into problems in some cases, like shifted support of the function or fat tails.
variance of standard normal:
>>> from scipy import stats
>>> stats.norm.expect(lambda x: x**2)
1.000000000000001
variance of Poisson:
>>> stats.poisson.expect(lambda x: (x - 5)**2, args=(5,))
4.9999999999999973

python scipy.stats pdf and expect functions

I was wondering if someone could please explain what the following functions in scipy.stats do:
rv_continuous.expect
rv_continuous.pdf
I have read the documentation but I am still confused.
Here is my task, quite simple in theory, but I am still confused with what these functions do.
So, I have a list of areas, 16383 values. I want to find the probability that the variable area takes any value between a smaller value , called "inf" and a larger value "sup".
So, what I thought is:
scipy.stats.rv_continuous.pdf(a) #a being the list of areas
scipy.stats.rv_continuous.expect(pdf, lb = inf, ub = sup)
So that i can get the probability that any area is between sup and inf.
Can anyone help me by explaining in a simple way what the functions do and any hint on how to compute the integral of f(a) between inf and sup, please?
Thanks
Blaise

rv_continuous is a base class for all of the probability distributions implemented in scipy.stats. You would not call methods on rv_continuous yourself.
Your question is not entirely clear about what you want to do, so I will assume that you have an array of 16383 data points drawn from some unknown probability distribution. From the raw data, you will need to estimate the cumulative distribution, find the values of that cumulative distribution at the sup and inf values and subtract to find the probability that a value drawn from the unknown distribution.
There are lots of ways to estimate the unknown distribution from the data depending on how much modelling you want to do and how many assumptions you want to make. At the more complicated end of the spectrum, you could try to fit one of the standard parametric probability distributions to the data. For example, if you had a suspicion that your data were lognormally distributed, you could use scipy.stats.lognorm.fit(data, floc=0) to find the parameters of the lognormal distribution that fit your data. Then you could use scipy.stats.lognorm.cdf(sup, *params) - scipy.stats.lognorm.cdf(inf, *params) to estimate the probability of the value being between those values.
In the middle are the non-parametric forms of distribution estimation like histograms and kernel density estimates. For example, scipy.stats.gaussian_kde(data).integrate_box_1d(inf, sup) is an easy way to make this estimate using a Gaussian kernel density estimate of the unknown distribution. However, kernel density estimates aren't always appropriate and require some tweaking to get right.
The simplest thing you could do is just count the number of data points that fall between inf and sup and divide by the total number of data points that you have. This only works well with a largish number of points (which you have) and with bounds that aren't too far in the tails of the data.

The cumulative density function might give you what you want.
Then the probability P of being between two values is
P(inf < area < sup) = cdf(sup) - cdf(inf)
There's a tutorial about probabilities here and here
They are all related. The pdf is the "density" of the probabilities. They must be greater than zero and sum to 1. I think of it as indicating how likely something is. The expectation is is a generalisation of the idea of average.
E[x] = sum(x.P(x))

Sampling methods

Can you help me out with these questions? I'm using Python
Sampling Methods
Sampling (or Monte Carlo) methods form a general and useful set of techniques that use random numbers to extract information about (multivariate) distributions and functions. In the context of statistical machine learning, we are most often concerned with drawing samples from distributions to obtain estimates of summary statistics such as the mean value of the distribution in question.
When we have access to a uniform (pseudo) random number generator on the unit interval (rand in Matlab or runif in R) then we can use the transformation sampling method described in Bishop Sec. 11.1.1 to draw samples from more complex distributions. Implement the transformation method for the exponential distribution
$$p(y) = \lambda \exp(−\lambda y) , y \geq 0$$
using the expressions given at the bottom of page 526 in Bishop: Slice sampling involves augmenting z with an additional variable u and then drawing samples from the joint (z,u) space.
The crucial point of sampling methods is how many samples are needed to obtain a reliable estimate of the quantity of interest. Let us say we are interested in estimating the mean, which is
$$\mu_y = 1/\lambda$$
in the above distribution, we then use the sample mean
$$b_y = \frac1L \sum^L_{\ell=1} y(\ell)$$
of the L samples as our estimator. Since we can generate as many samples of size L as we want, we can investigate how this estimate on average converges to the true mean. To do this properly we need to take the absolute diﬀerence
$$|\mu_y − b_y|$$
between the true mean $µ_y$ and estimate $b_y$
averaged over many, say 1000, repetitions for several values of $L$, say 10, 100, 1000.
Plot the expected absolute deviation as a function of $L$.
Can you plot some transformed value of expected absolute deviation to get a more or less straight line and what does this mean?
I'm new to this kind of statistical machine learning and really don't know how to implement it in Python. Can you help me out?

There are a few shortcuts you can take. Python has some built-in methods to do sampling, mainly in the Scipy library. I can recommend a manuscript that implements this idea in Python (disclaimer: I am the author), located here.
It is part of a larger book, but this isolated chapter deals with the more general Law of Large Numbers + convergence, which is what you are describing. The paper deals with Poisson random variables, but you should be able to adapt the code to your own situation.

Find a random method that best fit list of values

I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?

You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.

Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example

Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.

Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.