Kolmogorov test for python - python

Im trying to test if data follows a "normal" distribution, but kstest is not working as I expect. Vy using normal from numpy it "Draw random samples from a normal (Gaussian) distribution".
from scipy.stats import kstest, norm
from numpy.random import seed, normal
seed(42)
data = normal(80, 6, 1000)
# data = norm.rvs(loc=80, scale=6, size=1000)
ksstat, p_value = kstest(data, "norm")
if p_value > 0.05:
print('it looks like Gaussian (fail to reject H0)')
else:
print('it doesnt looks like Gaussian (reject H0)')
I already checked two ways of generating normal distributions with numpy and scipy but this is not giving as a result that this is a normal distribution.
However, by transforming (data - np.mean(data))/np.std(data) i get as it is normal distribution.
What I am missing here?
why it is not direct with this test the result of normality?

scipy.stats.kstest tests the data against the given distribution--with the given distribution parameters (if any). When you use kstest(data, "norm"), the distribution is the standard normal distribution, with mean 0 and standard deviation 1. You generated the data with mean 80 and standard deviation 6, so naturally it does not match.
You can normalize the data as you show in the question, or, if you happen to know the parameters, you can pass them to kstest using the args parameter:
ksstat, p_value = kstest(data, "norm", args=(80, 6))
Or, you could estimate the parameters from the data:
ksstat, p_value = kstest(data, "norm", args=(data.mean(), data.std()))

Related

Generate random sample for exponentiated Weibull distribution using Python

For a distribution for running this scipy function to detect the best fit as Exponentiated Weibull distribution and the function outputs 4 parameter values. But how to generate a sample list of data of size n that honours this kind of distribution.
I don't want to re-write function. Any python package which does this, would be helpful.
Usually you will use a ppf to generate from a rando seed.
For a simple completely fake example, let's say we fit a uniform random variable (with values from 0 to 15) to a Weibull distribution. Then, create a seed random variable (from 0 to 1 because it's the value of the quantiles that you will get) and put it in the ppf function.
import scipy.stats as st
import numpy as np
# fitting part
samples = np.random.rand(10000)*15
dist = st.exponweib.fit(samples)
# generation part
sample_seed = np.random.rand(10000)
random_exponweib_samples = st.exponweib.ppf(sample_seed, *dist)
# plotting
import matplotlib.pyplot as plt
plt.hist(samples, label="uniform", alpha=.5)
plt.hist(random_exponweib_samples, label="exponweib", alpha=.5)
plt.legend()
plt.show()
You'll have something like the following.
Please be careful and check for the documentation of the ppf concerning the weibull distrubiton. In my function st.exponweib.ppf(sample_seed, *dist) I just use *dist but it might be the case that the parameters should be sent differently, you can see it with the form of the orange plot which might not be correct.

Obtaining the percentile from a distribution

How can I obtain the percentiles (for example the mean, or the 10% and 90% percentile) of a distribution received from some program or experiments? In the sample below I generate a normal distribution just for illustration.
from scipy.stats import norm
x = np.linspace(1,10,1001)
count = norm.pdf(x,5,1)
This will be a gaussian curve (for this particular illustration case) if plotted as plt.plot(x,count). Note that this is not the data points but the distribution (which you can obtain with, e.g., x,count = plt.hist(data)), so I can't use p10 = np.percentile(count,10)
but I would want something similar, such as
p10 = module.percentile(x,dist,10)
Does any of you know of such a module, or do you know of some other means of obtaining the percentile?
I am not sure if this is what you are looking for, but scipy.stats distributions have ppf method that computes their percentiles. For example, to get the 30th percentile of the normal distribution with mean 5 and standard deviation 1 you can use:
from scipy.stats import norm
norm.ppf(0.3, loc=5, scale=1)
This gives:
4.475599487291959
Then, you can select elements of an array x which are in this percentile:
x[x < norm.ppf(0.3, loc=5, scale=1)]

Mean of normal distribution generated using numpy.random.randn() is not '0'

I am trying to follow this tutorial from quantopian where they are trying to show that samples progressively exhibit characteristics of a normal distribution with increase in size .
I tried to generate a normal distribution using the numpy.random.randn() method as shown in the tutorial.
I understand that this method returns a sample of the standard normal distribution and that for a normal distribution, mean = 0 and standard deviation = 1
But, when I check the mean and standard deviation of this distribution, they show weird values i.e mean = 0.23 and standard deviation = 0.49.
CODE:
import numpy as np
import matplotlib.pyplot as plt
#np.random.seed(123)
normal = np.random.randn(6)
print (normal.mean())
print (normal.std())
RESULT:
0.231567632423
0.488577812058
I am guessing this could be because I am looking at just a sample and not the whole distribution and it is not perfectly normal. But if that is the case:
What characteristics should I expect from this sample?
Isn't the tutorial's suggestion wrong, since it will never be a normal distribution?
You have a sample size or 6. It is not sufficiently large enough to get close to approximating the normal distribution. Try it with 600 or 6000 to get a good representation of the distribution.
import numpy as np
x = np.random.randn(600)
x.mean(), x.std()
# returns:
(-0.07760043571247623, 0.9664411074909558)
x = np.random.randn(6000)
x.mean(), x.std()
# returns:
(0.003908119246211815, 1.0001989021750033)
The average roll of a 6-sided die should be 3.5. However, if you only roll it 6 times, it is unlikely you will have an average of 3.5.

numpy.random.normal different distribution: selecting values from distribution

I have a power-law distribution of energies and I want to pick n random energies based on the distribution. I tried doing this manually using random numbers but it is too inefficient for what I want to do. I'm wondering is there a method in numpy (or other) that works like numpy.random.normal, except instead of a using normal distribution, the distribution may be specified. So in my mind an example might look like (similar to numpy.random.normal):
import numpy as np
# Energies from within which I want values drawn
eMin = 50.
eMax = 2500.
# Amount of energies to be drawn
n = 10000
photons = []
for i in range(n):
# Method that I just made up which would work like random.normal,
# i.e. return an energy on the distribution based on its probability,
# but take a distribution other than a normal distribution
photons.append(np.random.distro(eMin, eMax, lambda e: e**(-1.)))
print(photons)
Printing photons should give me a list of length 10000 populated by energies in this distribution. If I were to histogram this it would have much greater bin values at lower energies.
I am not sure if such a method exists but it seems like it should. I hope it is clear what I want to do.
EDIT:
I have seen numpy.random.power but my exponent is -1 so I don't think this will work.
Sampling from arbitrary PDFs well is actually quite hard. There are large and dense books just about how to efficiently and accurately sample from the standard families of distributions.
It looks like you could probably get by with a custom inversion method for the example that you gave.
If you want to sample from an arbitrary distribution you need the inverse of the cumulative density function (not the pdf).
You then sample a probability uniformly from range [0,1] and feed this into the inverse of the cdf to get the corresponding value.
It is often not possible to obtain the cdf from the pdf analytically.
However, if you're happy to approximate the distribution, you could do so by calculating f(x) at regular intervals over its domain, then doing a cumsum over this vector to get an approximation of the cdf and from this approximate the inverse.
Rough code snippet:
import matplotlib.pyplot as plt
import numpy as np
import scipy.interpolate
def f(x):
"""
substitute this function with your arbitrary distribution
must be positive over domain
"""
return 1/float(x)
#you should vary inputVals to cover the domain of f (for better accurracy you can
#be clever about spacing of values as well). Here i space them logarithmically
#up to 1 then at regular intervals but you could definitely do better
inputVals = np.hstack([1.**np.arange(-1000000,0,100),range(1,10000)])
#everything else should just work
funcVals = np.array([f(x) for x in inputVals])
cdf = np.zeros(len(funcVals))
diff = np.diff(funcVals)
for i in xrange(1,len(funcVals)):
cdf[i] = cdf[i-1]+funcVals[i-1]*diff[i-1]
cdf /= cdf[-1]
#you could also improve the approximation by choosing appropriate interpolator
inverseCdf = scipy.interpolate.interp1d(cdf,inputVals)
#grab 10k samples from distribution
samples = [inverseCdf(x) for x in np.random.uniform(0,1,size = 100000)]
plt.hist(samples,bins=500)
plt.show()
Why don't you use eval and put the distribution in a string?
>>> cmd = "numpy.random.normal(500)"
>>> eval(cmd)
you can manipulate the string as you wish to set the distribution.

How to perform a chi-squared goodness of fit test using scientific libraries in Python?

Let's assume I have some data I obtained empirically:
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?
I can fit a model with:
param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))
It is very elegant to calculate the Kolmogorov-Smirnov test.
>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)
However, I can't find a good way of calculating the chi-squared test.
There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).
The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.
Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.
An approximate solution for equal probability bins:
Estimate the parameters of the distribution
Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
Then, use np.histogram to count the number of observations in each bin
then use chisquare test on the frequencies.
An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.
This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.
I haven't looked into this into a long time.
If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.
Why do you need to "verify" that it's exponential? Are you sure you need a statistical test? I can pretty much guarantee that is isn't ultimately exponential & the test would be significant if you had enough data, making the logic of using the test rather forced. It may help you to read this CV thread: Is normality testing 'essentially useless'?, or my answer here: Testing for heteroscedasticity with many observations.
It is typically better to use a qq-plot and/or pp-plot (depending on whether you are concerned about the fit in the tails or middle of the distribution, see my answer here: PP-plots vs. QQ-plots). Information on how to make qq-plots in Python SciPy can be found in this SO thread: Quantile-Quantile plot using SciPy
I tried you problem with OpenTURNS.
Beginning is the same:
import numpy as np
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
If you suspect that your sample x is coming from an Exponential distribution, you can use ot.ExponentialFactory() to fit the parameters:
import openturns as ot
sample = ot.Sample([[p] for p in x])
distribution = ot.ExponentialFactory().build(sample)
As Factory needs a an ot.Sample() as input, I needed format x and reshape it as 10.000 points of dimension 1.
Let's now assess this fitting using ChiSquared test:
result = ot.FittingTest.ChiSquared(sample, distribution, 0.01)
print('Exponential?', result.getBinaryQualityMeasure(), ', P-value=', result.getPValue())
>>> Exponential? True , P-value= 0.9275212544642293
Very good!
And of course, print(distribution) will give you the fitted parameters:
>>> Exponential(lambda = 0.0982391, gamma = 0.0274607)

Categories