2 Sample KS-Test. CDF or PDF as input?

2 Sample KS-Test. CDF or PDF as input? - python

I implemented the KS-Test to test which Distributions are better fitting together. At this moment, I gave the CDFs as input, because the standard KS-Test involves computing the maximum difference between the CDFs of the function. I just wanted to know if this is the right way to do it. Or should I use the PDFS as input? The statistics values and p-values seem good for me. With the critical value of the KS-Test i can chose which Hypothesis tests I should not reject.
Code example
gammafit = stats.gamma.fit(h4)
pdf_gamma = stats.gamma.pdf(lnspc, *gammafit)
cdf_gamma = stats.gamma.cdf(lnspc, *gammafit)
plt.plot(lnspc, pdf_gamma, label="Gamma")
gamma_kstest999 = stats.ks_2samp(np.cumsum(n4), cdf_gamma)

You should use the pdfs as input. ks_2samp takes as input the pdfs and creates the cdfs inside the code. According to the function source code:
data1 = np.sort(data1)
data2 = np.sort(data2)
n1 = data1.shape[0]
n2 = data2.shape[0]
data_all = np.concatenate([data1, data2])
cdf1 = np.searchsorted(data1, data_all, side='right') / (1.0*n1)
cdf2 = np.searchsorted(data2, data_all, side='right') / (1.0*n2)
d = np.max(np.absolute(cdf1 - cdf2))
# Note: d absolute not signed distance
en = np.sqrt(n1 * n2 / float(n1 + n2))
try:
prob = distributions.kstwobign.sf((en + 0.12 + 0.11 / en) * d)
except:
prob = 1.0
return Ks_2sampResult(d, prob)
The cdf1 and cdf2 variables represent the produced cumulative distributions.

Related

Correct way to compute 1,2,3 sigma errors

I wanted to calculated 1, 2, 3 sigma error of a distribution using python. It is described in following 68–95–99.7 rule wikipedia page. So far I have written following code. Is it correct way to compute such kpi's. Thanks.
import numpy as np
# sensor and reference value
temperature_measured = np.random.rand(1000) # value from a sensor under test
temperature_reference = np.random.rand(1000) # value from a best sensor from market
# error computation
error = temperature_reference - temperature_measured
error_sigma = np.std(error)
error_mean = np.mean(error)
# kpi comutation
expected_sigma = 1 # 1 degree deviation is allowed (Customer requirement)
kpi_1_sigma = (abs(error - error_mean) < 1*expected_sigma).mean()*100.0 >= 68.27
kpi_2_sigma = (abs(error - error_mean) < 2*expected_sigma).mean()*100.0 >= 95.45
kpi_3_sigma = (abs(error - error_mean) < 3*expected_sigma).mean()*100.0 >= 99.73

I would recommend to use the definition you found in wikipedia and just calculate the percentiles, i.e., calculate the diference between:
((mu+sigma)-(mu-sigma) )/2.
sigma1 = (np.percentile(error, 50+34.1, axis=0)- np.percentile(error, 50-34.1, axis=0))/2.
sigma2 = (np.percentile(error, 50+34.1+13.6, axis=0)- np.percentile(error, 50-34.1-13.6, axis=0))/2.
sigma3 = (np.percentile(error, 50+34.1+13.6+2.1, axis=0)- np.percentile(error, 50-34.1-13.6-2.1, axis=0))/2.

An easier way could be like so (taken from here):
NumPy's std yields the standard deviation, which is usually denoted
with "sigma". To get the 2-sigma or 3-sigma ranges, you can simply
multiply sigma with 2 or 3:
print [x.mean() - 3 * x.std(), x.mean() + 3 * x.std()]
result:
[-27.545797458510656, 52.315028227741429]

Why sample size is coming as double with statsmodels package in Python

I am using following code to find sample size for given means and SD in 2 groups:
from statsmodels.stats.power import TTestIndPower
power = 0.8
alpha = 0.05
mean1 = 10; sd1 = 7
mean2 = 17; sd2 = 7
diffmean = mean2-mean1
pooled_sd = np.sqrt(sd1**2 + sd2**2)
es = diffmean / pooled_sd
res = TTestIndPower().solve_power(effect_size=es,
power=power, alpha=alpha, nobs1=None)
print(res)
Output is:
32.38441093768741
On Statsmodels help page it says, that the sample size returned (nobs1) is for one (first) group.
However, I checked sample size with same means and SD values at OpenEpi online sample size calculator:
I got following result:
Here it clearly states that 32 is sample size for both groups combined and not for one group.
I also tested same problem with pwr package in R as detailed on this page:
> pwr.t.test(n = , d = (17-10)/sqrt(7^2+7^2),
sig.level = 0.05, power = 0.8, type = "two.sample")
where d is effect size: (difference in means)/ pooled_sd
The output was:
Two-sample t test power calculation
n = 32.38441
d = 0.7071068
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
It clearly states that 32 is number for each group (hence total sample size needed is 64).
Where is the problem? Which of these sites is correct?

statsmodels and pwr use Cohen's d as a measure of effect size. This is difference in mean divided by some estimate of the population standard deviation.
Cohen's d = (M2 - M1) ⁄ SDpooled
SDpooled = sqrt((SD1**2 + SD2**2) ⁄ 2)
see for example https://www.socscistatistics.com/effectsize/default3.aspx
The estimate of the population variance is a weighted average of the individual standard deviation. The weights need to add up to 1.
In the example of the question a division by 2 is missing in the pooled sd computation.

SciPy KDE gradient

I am using the SciPy implementation of the kernel density estimate (KDE) (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html), which is working fine so far. However, I would now like to obtain the gradient of the KDE at a particular point.
I have looked at the Python source for the library, but haven't been able to figure out how to easily implement this functionality. Is anybody aware of a method to do this?

If you look at the source you referenced, you'll see that the density estimation is constructed from contributions from all points in the dataset Assuming there's only one point points[:,i] you want to evaluate for the moment (lines 219–222):
diff = self.dataset - points[:, i, newaxis]
tdiff = dot(self.inv_cov, diff)
energy = sum(diff * tdiff, axis=0) / 2.0
result[i] = sum(exp(-energy), axis=0)
In matrix notation (no LaTeX available?), this would be written, for a single point D from the dataset and point p to be evaluated as
d = D - p
t = Cov^-1 d
e = 1/2 d^T t
r = exp(-e)
The gradient you're looking for is grad(r) = (dr/dx, dr/dy):
dr/dx = d(exp(-e))/dx
= -de/dx exp(-e)
= -d(1/2 d^T Cov^-1 d)/dx exp(-e)
= -(Cov^-1 d) exp(-e)
Likewise for dr/dy. Hence all you need to do is calculate the term Cov^-1 d and multiply it with the result you already obtained.
result = zeros((self.d,m), dtype=float)
[...]
diff = self.dataset - points[:, i, newaxis]
tdiff = dot(self.inv_cov, diff)
energy = sum(diff * tdiff, axis=0) / 2.0
grad = dot(self.inv_cov, diff)
result[:,i] = sum(grad * exp(-energy), axis=1)
For some reason I needed to drop the -1 when calculating grad to obtain results coherent with a evaluating the density estimation at p and p+delta in all four directions, which is a sign I might of course be way off here.

scipy.stats.ttest_ind without array (python)

I have done a number of calculations to estimate μ, σ and N for my two samples. Due to a number of approximations I don't have the arrays that are expected as input to scipy.stats.ttest_ind. Unless I am mistaken I only need μ, σ and N to do a welch's t test. Is there a way to do this in python?

Here’s a straightforward implementation based on this:
import scipy.stats as stats
import numpy as np
def welch_t_test(mu1, s1, N1, mu2, s2, N2):
# Construct arrays to make calculations more succint.
N_i = np.array([N1, N2])
dof_i = N_i - 1
v_i = np.array([s1, s2]) ** 2
# Calculate t-stat, degrees of freedom, use scipy to find p-value.
t = (mu1 - mu2) / np.sqrt(np.sum(v_i / N_i))
dof = (np.sum(v_i / N_i) ** 2) / np.sum((v_i ** 2) / ((N_i ** 2) * dof_i))
p = stats.distributions.t.sf(np.abs(t), dof) * 2
return t, p
It yields virtually identical results:
sample1 = np.random.rand(10)
sample2 = np.random.rand(15)
result_test = welch_t_test(np.mean(sample1), np.std(sample1, ddof=1), sample1.size,
np.mean(sample2), np.std(sample2, ddof=1), sample2.size)
result_scipy = stats.ttest_ind(sample1, sample2,equal_var=False)
np.allclose(result_test, result_scipy)
True

As an update
The function is now available in scipy.stats, since version 0.16.0
http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.ttest_ind_from_stats.html
scipy.stats.ttest_ind_from_stats(mean1, std1, nobs1, mean2, std2, nobs2, equal_var=True)
T-test for means of two independent samples from descriptive statistics.
This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values.

I have written t-test and z-test functions that take the summary statistics for statsmodels.
Those were intended mainly as internal shortcuts to avoid code duplication, and are not well documented.
For example http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.weightstats._tstat_generic.html
The list of related functions is here:
http://statsmodels.sourceforge.net/devel/stats.html#basic-statistics-and-t-tests-with-frequency-weights
edit: in reply to comment
The function just does the core calculation, the actual calculation of the standard deviation of the difference under different assumptions is added to the calling method.
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/weightstats.py#L713
edit
Here is an example how to use the methods of the CompareMeans class that includes the t-test based on summary statistics. We need to create a class that holds the relevant summary statistic as attributes. At the end there is a function that just wraps the relevant calls.
"""
Created on Wed Jul 23 05:47:34 2014
Author: Josef Perktold
License: BSD-3
"""
import numpy as np
from scipy import stats
from statsmodels.stats.weightstats import CompareMeans, ttest_ind
class SummaryStats(object):
def __init__(self, nobs, mean, std):
self.nobs = nobs
self.mean = mean
self.std = std
self._var = std**2
np.random.seed(123)
nobs = 20
x1 = 1 + np.random.randn(nobs)
x2 = 1 + 1.5 * np.random.randn(nobs)
print stats.ttest_ind(x1, x2, equal_var=False)
print ttest_ind(x1, x2, usevar='unequal')
s1 = SummaryStats(x1.shape[0], x1.mean(0), x1.std(0))
s2 = SummaryStats(x2.shape[0], x2.mean(0), x2.std(0))
print CompareMeans(s1, s2).ttest_ind(usevar='unequal')
def ttest_ind_summ(summ1, summ2, usevar='unequal'):
"""t-test for equality of means based on summary statistic
Parameters
----------
summ1, summ2 : tuples of (nobs, mean, std)
summary statistic for the two samples
"""
s1 = SummaryStats(*summ1)
s2 = SummaryStats(*summ2)
return CompareMeans(s1, s2).ttest_ind(usevar=usevar)
print ttest_ind_summ((x1.shape[0], x1.mean(0), x1.std(0)),
(x2.shape[0], x2.mean(0), x2.std(0)),
usevar='unequal')
''' result
(array(1.1590347327654558), 0.25416326823881513)
(1.1590347327654555, 0.25416326823881513, 35.573591346616553)
(1.1590347327654558, 0.25416326823881513, 35.57359134661656)
(1.1590347327654558, 0.25416326823881513, 35.57359134661656)
'''

Is there a python (scipy) function to determine parameters needed to obtain a target power?

In R there is a very useful function that helps with determining parameters for a two sided t-test in order to obtain a target statistical power.
The function is called power.prop.test.
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/power.prop.test.html
You can call it using:
power.prop.test(p1 = .50, p2 = .75, power = .90)
And it will tell you n the sample size needed to obtain this power. This is extremely useful in deterring sample sizes for tests.
Is there a similar function in the scipy package?

I've managed to replicate the function using the below formula for n and the inverse survival function norm.isf from scipy.stats
from scipy.stats import norm, zscore
def sample_power_probtest(p1, p2, power=0.8, sig=0.05):
z = norm.isf([sig/2]) #two-sided t test
zp = -1 * norm.isf([power])
d = (p1-p2)
s =2*((p1+p2) /2)*(1-((p1+p2) /2))
n = s * ((zp + z)**2) / (d**2)
return int(round(n[0]))
def sample_power_difftest(d, s, power=0.8, sig=0.05):
z = norm.isf([sig/2])
zp = -1 * norm.isf([power])
n = s * ((zp + z)**2) / (d**2)
return int(round(n[0]))
if __name__ == '__main__':
n = sample_power_probtest(0.1, 0.11, power=0.8, sig=0.05)
print n #14752
n = sample_power_difftest(0.1, 0.5, power=0.8, sig=0.05)
print n #392

Some of the basic power calculations are now available in statsmodels
http://statsmodels.sourceforge.net/devel/stats.html#power-and-sample-size-calculations
http://jpktd.blogspot.ca/2013/03/statistical-power-in-statsmodels.html
The blog article does not yet take the latest changes to the statsmodels code into account. Also, I haven't decided yet how many wrapper functions to provide, since many power calculations just reduce to the basic distribution.
>>> import statsmodels.stats.api as sms
>>> es = sms.proportion_effectsize(0.5, 0.75)
>>> sms.NormalIndPower().solve_power(es, power=0.9, alpha=0.05, ratio=1)
76.652940372066908
In R stats
> power.prop.test(p1 = .50, p2 = .75, power = .90)
Two-sample comparison of proportions power calculation
n = 76.7069301141077
p1 = 0.5
p2 = 0.75
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
using R's pwr package
> library(pwr)
> h<-ES.h(0.5,0.75)
> pwr.2p.test(h=h, power=0.9, sig.level=0.05)
Difference of proportion power calculation for binomial distribution (arcsine transformation)
h = 0.5235987755982985
n = 76.6529406106181
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: same sample sizes

Matt's answer for getting the needed n (per group) is almost right, but there is a small error.
Given d (difference in means), s (standard deviation), sig (significance level, typically .05), and power (typically .80), the formula for calculating the number of observations per group is:
n= (2s^2 * ((z_(sig/2) + z_power)^2) / (d^2)
As you can see in his formula, he has
n = s * ((zp + z)**2) / (d**2)
the "s" part is wrong. a correct function that reproduces r's functionality is:
def sample_power_difftest(d, s, power=0.8, sig=0.05):
z = norm.isf([sig/2])
zp = -1 * norm.isf([power])
n = (2*(s**2)) * ((zp + z)**2) / (d**2)
return int(round(n[0]))
Hope this helps.

You also have:
from statsmodels.stats.power import tt_ind_solve_power
and put "None" in the value you want to obtain. For instande, to obtain the number of observations in the case of effect_size = 0.1, power = 0.8 and so on, you should put:
tt_ind_solve_power(effect_size=0.1, nobs1 = None, alpha=0.05, power=0.8, ratio=1, alternative='two-sided')
and obtain: 1570.7330663315456 as the number of observations required.
Or else, to obtain the power you can attain with the other values fixed:
tt_ind_solve_power(effect_size= 0.2, nobs1 = 200, alpha=0.05, power=None, ratio=1, alternative='two-sided')
and you obtain: 0.5140816347005553

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

2 Sample KS-Test. CDF or PDF as input? - python

Related

Correct way to compute 1,2,3 sigma errors

Why sample size is coming as double with statsmodels package in Python

SciPy KDE gradient

scipy.stats.ttest_ind without array (python)

Is there a python (scipy) function to determine parameters needed to obtain a target power?

Categories

Resources