What significance test is used for spearmanr in SciPy?

What significance test is used for spearmanr in SciPy? - python

What type of significance test is used in scipy.stats.spearmanr to produce the p-value it spits out? The documentation simply says that its a two-sided p-value, but with respect to what distribution? Is it a t-distribution?

According to the documentation,
the p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
When you look into the source code, you can see that they calculate a t-value:
% rs is rho
t = rs * np.sqrt((n-2) / ((rs+1.0)*(1.0-rs)))
and then calculate the p value assuming a t-distribution with two degrees of freedom:
prob = distributions.t.sf(np.abs(t),n-2)*2
This is also explained on Wikipedia as one option for calculating statistical significance.

Related

Calculate KL Divergence between two gamma distribution list

I have two list. Both include normalized percent:
actual_population_distribution = [0.2,0.3,0.3,0.2]
sample_population_distribution = [0.1,0.4,0.2,0.3]
I wish to fit these two list in to gamma distribution and then calculate the returned two list in order to get the KL value.
I have already able to get KL.
This is the function I used to calculate gamma:
def gamma_random_sample(data_list):
mean = np.mean(data_list)
var = np.var(data_list)
g_alpha = mean * mean / var
g_beta = mean / var
for i in range(len(data_list)):
yield random.gammavariate(g_alpha, 1/g_beta)
Fit two lists into gamma distribution:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
This is the code I used to calculate KL:
kl = np.sum(scipy.special.kl_div(actual_grs, sample_grs))
The code above does not produce any errors.
But I suspect the way I did for gamma is wrong because of np.mean/var to get mean and variance.
Indeed, the number is different to:
mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')
if I use this way.
By using "mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')", I will get a KL value way larger than 1 so both two ways are invalid for getting a correct KL.
What do I miss?

See this stack overflow post: https://stats.stackexchange.com/questions/280459/estimating-gamma-distribution-parameters-using-sample-mean-and-std
I don't understand what you are trying to do with:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
It doesn't look like you are fitting to a gamma distribution, it looks like you are using the Method of Moment estimator to get the parameters of the gamma distribution and then you are drawing a single random number for each element of your actual(sample)_population_distribution lists given the distribution statistics of the list.
The gamma distribution is notoriously hard to fit. I hope your actual data has a longer list -- 4 data points are hardly sufficient for estimating a two parameter distribution. The estimates are kind of garbage until you get hundreds of elements or more, take a look at this document on the MLE estimator for the fisher information of a gamma distribution: https://www.math.arizona.edu/~jwatkins/O3_mle.pdf .
I don't know what you are trying to do with the kl divergence either. Your actual population is already normalized to 1 and so is the sample distribution. You can plug in those elements directly into the KL divergence for a discrete score -- what you are doing with your code is a stretching and addition of gamma noise to your original list values with your defined gamma function. You are more likely to have a larger deviation with the KL divergence after the gamma corruption of your original population data.
I'm sorry, I just don't see what you are trying to accomplish here. If I were to guess your original intent, I'd say your problem is that you need hundreds of data points to guarantee convergence with any gamma fitting program.
EDIT: I just wanted to add that with regards to the KL divergence. If you intend to score your fit gamma distributions with the KL divergence, it's better to use an analytical solution where the scale and shape parameters of your two gamma distributions are your two inputs. Randomly sampling noisy data points won't be helpful unless you take 100,000 random samples and histogram them into 1,000 bins or so and then normalize your histogram -- I'm just throwing those numbers out, but you are going to want to approximate a continuous distribution as best as you can and it will be hard because the gamma distributions have long tails. This document has the analytical solution for a generalized distribution: https://arxiv.org/pdf/1401.6853.pdf . Just set that third parameter to 1 and simplify and then code up a function.

Estimate the needed sample size for a Chi Squared test

I want to estimate the needed sample size to compute a Chi Squared (Test for homogenity) test for discrete data using Python and need a hint how to do it.
In general I want to estimate if the failure rates of two production processes differ significantly (alpha = 5%) or not.
I have only found the statsmodels.stats.gof.chisquare_effectsize() function but this seems to work only for a goodness of fit test.
Is there any way how I can determine the needed sample size?
I appreciate every answer.

You can use statsmodels.stats.GofChisquarePower().solve_power()
However, you need to adjust the degrees of freedom (df) to account for the number of variables. You can accomplish this with the n_bins parameter.
>>>import statsmodels.stats.power as smp
>>>n_levels_variable_a = 2
>>>n_levels_variable_b = 3
>>>smp.GofChisquarePower().solve_power(0.346, power=.8, n_bins=(n_levels_variable_a-1)*(n_levels_variable_b-1), alpha=0.05)
115.94688728433769

How to interpret the upper/lower bound of a datapoint with confidence intervals?

Given a list of values:
>>> from scipy import stats
>>> import numpy as np
>>> x = list(range(100))
Using student t-test, I can find the confidence interval of the distribution at the mean with an alpha of 0.1 (i.e. at 90% confidence) with:
def confidence_interval(alist, v, itv):
return stats.t.interval(itv, df=len(alist)-1, loc=v, scale=stats.sem(alist))
x = list(range(100))
confidence_interval(x, np.mean(x), 0.1)
[out]:
(49.134501289005009, 49.865498710994991)
But if I were to find the confidence interval at every datapoint, e.g. for the value 10:
>>> confidence_interval(x, 10, 0.1)
(9.6345012890050086, 10.365498710994991)
How should the interval of the values be interpreted? Is it statistically/mathematical sound to interpret that at all?
Does it goes something like:
At 90% confidence, we know that the data point 10 falls in the interval (9.6345012890050086, 10.365498710994991),
aka.
At 90% confidence, we can say that the data point falls at 10 +- 0.365...
So can we interpret the interval as some sort of a box plot of the datapoint?

In short
Your call gives the interval of confidence for the mean parameter of a normal law of unknown parameters of which you observed 100 observations with an average of 10 and a stdv of 29. It is furthermore not sound to interpret it, since your distribution is clearly not normal, and because 10 is not the observed mean.
TL;DR
There are a lot misconceptions floating around confidence intervals, most of which seemingly stems from a misunderstanding of what we are confident about. Since there is some confusion in your understanding of confidence interval maybe a broader explanation will give a deeper understanding of the concepts you are handling, and hopefully definitely rule out any source of error.
Clearing out misconceptions
Very briefly to set things up. We are in a situation where we want to estimate a parameter, or rather, we want to test a hypothesis for the value of a parameter parameterizing the distribution of a random variable. e.g: Let's say I have a normally distributed variable X with mean m and standard deviation sigma, and I want to test the hypothesis m=0.
What is a parametric test
This a process for testing a hypothesis on a parameter for a random variable. Since we only have access to observations which are concrete realizations of the random variable, it generally procedes by computing a statistic of these realizations. A statistic is roughly a function of the realizations of a random variable. Let's call this function S, we can compute S on x_1,...,x_n which are as many realizations of X.
Therefore you understand that S(X) is a random variable as well with distribution, parameters and so on! The idea is that for standard tests, S(X) follows a very well known distribution for which values are tabulated. e.g: http://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf
What is a confidence interval?
Given what we've just said, a definition for a confidence interval would be: the range of values for the tested parameter, such that if the observations were to have been generated from a distribution parametrized by a value in that range, it would not have probabilistically improbable.
In other words, a confidence interval gives an answer to the question: given the following observations x_1,...,x_n n realizations of X, can we confidently say that X's distribution is parametrized by such value. 90%, 95%, etc... asserts the level of confidence. Usually, external constraints fix this level (industrial norms for quality assessment, scientific norms e.g: for the discovery of new particles).
I think it is now intuitive to you that:
The higher the confidence level, the larger the confidence interval. e.g. for a confidence of 100% the confidence interval would range across all the possible values as soon as there is some uncertainty
For most tests, under conditions I won't describe, the more observations we have, the more we can restrain the confidence interval.
At 90% confidence, we know that the data point 10 falls in the interval (9.6345012890050086, 10.365498710994991)
It is wrong to say that and it is the most common source of mistakes. A 90% confidence interval never means that the estimated parameter has 90% percent chance of falling into that interval. When the interval is computed, it covers the parameter or it does not, it is not a matter of probability anymore. 90% is an assessment of the reliability of the estimation procedure.
What is a student test?
Now let's come to your example and look at it under the lights of what we've just said. You to apply a Student test to your list of observations.
First: a Student test aims at testing a hypothesis of equality between the mean m of a normally distributed random variable with unknown standard deviation, and a certain value m_0.
The statistic associated with this test is t = (np.mean(x) - m_0)/(s/sqrt(n)) where x is your vector of observations, n the number of observations and s the empirical standard deviation. With no surprise, this follows a Student distribution.
Hence, what you want to do is:
compute this statistic for your sample, compute the confidence interval associated with a Student distribution with this many degrees of liberty, this theoretical mean, and confidence level
see if your computed t falls into that interval, which tells you if you can rule out the equality hypothesis with such level of confidence.
I wanted to give you an exercise but I think I've been lengthy enough.
To conclude on the use of scipy.stats.t.interval. You can use it one of two ways. Either computing yourself the t statistic with the formula shown above and check if t fits in the interval returned by interval(alpha, df) where df is the length of your sampling. Or you can directly call interval(alpha, df, loc=m, scale=s) where m is your empirical mean, and s the empirical standard deviatation (divided by sqrt(n)). In such case, the returned interval will directly be the confidence interval for the mean.
So in your case your call gives the interval of confidence for the mean parameter of a normal law of unknown parameters of which you observed 100 observations with an average of 10 and a stdv of 29. It is furthermore not sound to interpret it, beside the error of interpretation I've already pointed out, since your distribution is clearly not normal, and because 10 is not the observed mean.
Resources
You can check out the following resources to go further.
wikipedia links to have quick references and an elborated overview
https://en.wikipedia.org/wiki/Confidence_interval
https://en.wikipedia.org/wiki/Student%27s_t-test
https://en.wikipedia.org/wiki/Student%27s_t-distribution
To go further
http://osp.mans.edu.eg/tmahdy/papers_of_month/0706_statistical.pdf
I haven't read it but the one below seems quite good.
https://web.williams.edu/Mathematics/sjmiller/public_html/BrownClasses/162/Handouts/StatsTests04.pdf
You should also check out p-values, you will find a lot of similarities and hopefully you understand them better after reading this post.
https://en.wikipedia.org/wiki/P-value#Definition_and_interpretation

Confidence intervals are hopelessly counter-intuitive. Especially for programmers, I dare say as a programmer.
Wikipedida uses a 90% confidence to illustrate a possible interpretation:
Were this procedure to be repeated on numerous samples, the fraction of calculated confidence intervals (which would differ for each sample) that encompass the true population parameter would tend toward 90%.
In other words
The confidence interval provides information about a statistical parameter (such as the mean) of a sample.
The interpretation of e.g. a 90% confidence interval would be: If you repeat the experiment an infinite number of times 90% of the resulting confidence intervals will contain the true parameter.
Assuming the code to compute the interval is correct (which I have not checked) you can use it to calculate the confidence interval of the mean (because of the t-distribution, which models the sample mean of a normally distributed population with unknown standard deviation).
For practical purposes it makes sense to pass in the sample mean. Otherwise you are saying "if I pretended my data had a sample mean of e.g. 10, the confidence interval of the mean would be [9.6, 10.3]".
The particular data passed into the confidence interval does not make sense either. Numbers increasing in a range from 0 to 99 are very unlikely to be drawn from a normal distribution.

What is the difference between numpy var() and statistics variance() in python?

I was trying one Dataquest exercise and I figured out that the variance I am getting is different for the two packages.
e.g for [1,2,3,4]
from statistics import variance
import numpy as np
print(np.var([1,2,3,4]))
print(variance([1,2,3,4]))
//1.25
//1.6666666666666667
The expected answer of the exercise is calculated with np.var()
Edit
I guess it has to do that the later one is sample variance and not variance. Anyone could explain the difference?

Use this
print(np.var([1,2,3,4],ddof=1))
1.66666666667
Delta Degrees of Freedom: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default, ddof is zero.
The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead.
In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.
Statistical libraries like numpy use the variance n for what they call var or variance and the standard deviation
For more information refer this documentation : numpy doc

It is correct that dividing by N-1 gives an unbiased estimate for the mean, which can give the impression that dividing by N-1 is therefore slightly more accurate, albeit a little more complex. What is too often not stated is that dividing by N gives the minimum variance estimate for the mean, which is likely to be closer to the true mean than the unbiased estimate, as well as being somewhat simpler.

How to perform a chi-squared goodness of fit test using scientific libraries in Python?

Let's assume I have some data I obtained empirically:
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?
I can fit a model with:
param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))
It is very elegant to calculate the Kolmogorov-Smirnov test.
>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)
However, I can't find a good way of calculating the chi-squared test.
There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).
The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.
Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.

An approximate solution for equal probability bins:
Estimate the parameters of the distribution
Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
Then, use np.histogram to count the number of observations in each bin
then use chisquare test on the frequencies.
An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.
This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.
I haven't looked into this into a long time.
If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.

Why do you need to "verify" that it's exponential? Are you sure you need a statistical test? I can pretty much guarantee that is isn't ultimately exponential & the test would be significant if you had enough data, making the logic of using the test rather forced. It may help you to read this CV thread: Is normality testing 'essentially useless'?, or my answer here: Testing for heteroscedasticity with many observations.
It is typically better to use a qq-plot and/or pp-plot (depending on whether you are concerned about the fit in the tails or middle of the distribution, see my answer here: PP-plots vs. QQ-plots). Information on how to make qq-plots in Python SciPy can be found in this SO thread: Quantile-Quantile plot using SciPy

I tried you problem with OpenTURNS.
Beginning is the same:
import numpy as np
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
If you suspect that your sample x is coming from an Exponential distribution, you can use ot.ExponentialFactory() to fit the parameters:
import openturns as ot
sample = ot.Sample([[p] for p in x])
distribution = ot.ExponentialFactory().build(sample)
As Factory needs a an ot.Sample() as input, I needed format x and reshape it as 10.000 points of dimension 1.
Let's now assess this fitting using ChiSquared test:
result = ot.FittingTest.ChiSquared(sample, distribution, 0.01)
print('Exponential?', result.getBinaryQualityMeasure(), ', P-value=', result.getPValue())
>>> Exponential? True , P-value= 0.9275212544642293
Very good!
And of course, print(distribution) will give you the fitted parameters:
>>> Exponential(lambda = 0.0982391, gamma = 0.0274607)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.