Perform 2 sample t-test - python

I have a the mean, std dev and n of sample 1 and sample 2 - samples are taken from the sample population, but measured by different labs.
n is different for sample 1 and sample 2. I want to do a weighted (take n into account) two-tailed t-test.
I tried using the scipy.stat module by creating my numbers with np.random.normal, since it only takes data and not stat values like mean and std dev (is there any way to use these values directly). But it didn't work since the data arrays has to be of equal size.
Any help on how to get the p-value would be highly appreciated.

If you have the original data as arrays a and b, you can use scipy.stats.ttest_ind with the argument equal_var=False:
t, p = ttest_ind(a, b, equal_var=False)
If you have only the summary statistics of the two data sets, you can calculate the t value using scipy.stats.ttest_ind_from_stats (added to scipy in version 0.16) or from the formula (http://en.wikipedia.org/wiki/Welch%27s_t_test).
The following script shows the possibilities.
from __future__ import print_function
import numpy as np
from scipy.stats import ttest_ind, ttest_ind_from_stats
from scipy.special import stdtr
np.random.seed(1)
# Create sample data.
a = np.random.randn(40)
b = 4*np.random.randn(50)
# Use scipy.stats.ttest_ind.
t, p = ttest_ind(a, b, equal_var=False)
print("ttest_ind: t = %g p = %g" % (t, p))
# Compute the descriptive statistics of a and b.
abar = a.mean()
avar = a.var(ddof=1)
na = a.size
adof = na - 1
bbar = b.mean()
bvar = b.var(ddof=1)
nb = b.size
bdof = nb - 1
# Use scipy.stats.ttest_ind_from_stats.
t2, p2 = ttest_ind_from_stats(abar, np.sqrt(avar), na,
bbar, np.sqrt(bvar), nb,
equal_var=False)
print("ttest_ind_from_stats: t = %g p = %g" % (t2, p2))
# Use the formulas directly.
tf = (abar - bbar) / np.sqrt(avar/na + bvar/nb)
dof = (avar/na + bvar/nb)**2 / (avar**2/(na**2*adof) + bvar**2/(nb**2*bdof))
pf = 2*stdtr(dof, -np.abs(tf))
print("formula: t = %g p = %g" % (tf, pf))
The output:
ttest_ind: t = -1.5827 p = 0.118873
ttest_ind_from_stats: t = -1.5827 p = 0.118873
formula: t = -1.5827 p = 0.118873

Using a recent version of Scipy 0.12.0, this functionality is built in (and does in fact operates on samples of different sizes). In scipy.stats the ttest_ind function performs Welch’s t-test when the flag equal_var is set to False.
For example:
>>> import scipy.stats as stats
>>> sample1 = np.random.randn(10, 1)
>>> sample2 = 1 + np.random.randn(15, 1)
>>> t_stat, p_val = stats.ttest_ind(sample1, sample2, equal_var=False)
>>> t_stat
array([-3.94339083])
>>> p_val
array([ 0.00070813])

Related

3D Fourier transformation of a gaussian function in python

I'm trying to get the 3D Fourier Transform of the gaussian function e^(-r^(2)/2) in python using the numpy.fft library.
I've attempted using different ffts from the library with different inputs, shifting the results with np.fft.fftshift, trying to find a multiplicative factor and many other things, the last thing I tried was using the 1D fft function, and then cubing the result, here's the corresponding source code:
import numpy as np
R = float(10)
N = float(100)
y= np.dtype(np.float64)
dr = R/N
def F(x):
return np.exp(-((x*dr)**2)/2)
Frange = np.arange(1,int(N)+1)
y = np.zeros((int(N)))
i = 0
while i<int(N):
y[i] = F(Frange[i])
i += 1
y = y/3
y_fft = np.fft.fftshift(np.abs(np.fft.fft(y)))**3
print (y_fft)
The first values I get:
4.62e-03, 4.63e-03, 4.65e-03, 4.69e-03, 4.74e-03
According to Lado, Fred. (1971) Numerical Fourier transforms in one, two, and three dimensions for liquid state calculations, the analytic solution to the problem is: (2pi )^(3/2)*e^(-k^(2)/2)
And the first values of the analytic solution with the same values of R and N are:
14.99, 12.92, 10.10, 7.15, 4.58
I also created a DFT program using a formula provided in the previous article which gives the expected results, but I haven't been able to replicate the analytic results in any of my attempts using the NumPy or SciPy fft libraries.
Here's my program for the analytic and DFT results:
import math
import numpy as np
def F(r):
x=math.exp((-1/2)*(r**2))
return x
def FT(r):
x=((2*math.pi)**(3/2))*(math.exp((-1/2)*(r**2)))
return x
R = float(10)
N = int(100)
ft = np.zeros(N)
fta = np.zeros(N)
dr = R/N
dk = math.pi/R
print ("\tk \t\t\t Discrete \t\t\t Analytic")
for j in range (1, N):
kj = j*dk
#Discrete Transform
sum = 0
for i in range(1, N):
ri = i*dr
sum = sum + (dr*ri*(F(ri))*(math.sin(kj*ri)))
ft[j] = ((4*math.pi)/kj)*sum
#Analytic Transform
fta[j] = FT(kj)
#Print results
print(kj, f" \t\t{ft[j]:.10E} \t\t{fta[j]:.10E}")
And these are the first few results:
k Discrete Analytic
0.3141592653589793 1.4991263193E+01 1.4991263193E+01
0.6283185307179586 1.2928362116E+01 1.2928362116E+01
0.9424777960769379 1.0101494686E+01 1.0101494686E+01
1.2566370614359172 7.1509645344E+00 7.1509645344E+00
1.5707963267948966 4.5864901093E+00 4.5864901093E+00

1D Wasserstein distance in Python

The formula below is a special case of the Wasserstein distance/optimal transport when the source and target distributions, x and y (also called marginal distributions) are 1D, that is, are vectors.
where F^{-1} are inverse probability distribution functions of the cumulative distributions of the marginals u and v, derived from real data called x and y, both generated from the normal distribution:
import numpy as np
from numpy.random import randn
import scipy.stats as ss
n = 100
x = randn(n)
y = randn(n)
How can the integral in the formula be coded in python and scipy? I'm guessing the x and y have to be converted to ranked marginals, which are non-negative and sum to 1, while Scipy's ppf could be used to calculate the inverse F^{-1}'s?
Note that when n gets large we have that a sorted set of n samples approaches the inverse CDF sampled at 1/n, 2/n, ..., n/n. E.g.:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.plot(norm.ppf(np.linspace(0, 1, 1000)), label="invcdf")
plt.plot(np.sort(np.random.normal(size=1000)), label="sortsample")
plt.legend()
plt.show()
Also note that your integral from 0 to 1 can be approximated as a sum over 1/n, 2/n, ..., n/n.
Thus we can simply answer your question:
def W(p, u, v):
assert len(u) == len(v)
return np.mean(np.abs(np.sort(u) - np.sort(v))**p)**(1/p)
Note that if len(u) != len(v) you can still apply the method with linear interpolation:
def W(p, u, v):
u = np.sort(u)
v = np.sort(v)
if len(u) != len(v):
if len(u) > len(v): u, v = v, u
us = np.linspace(0, 1, len(u))
vs = np.linspace(0, 1, len(v))
u = np.linalg.interp(u, us, vs)
return np.mean(np.abs(u - v)**p)**(1/p)
An alternative method if you have prior information about the sort of distribution of your data, but not its parameters, is to find the best fitting distribution on your data (e.g. with scipy.stats.norm.fit) for both u and v and then do the integral with the desired precision. E.g.:
from scipy.stats import norm as gauss
def W_gauss(p, u, v, num_steps):
ud = gauss(*gauss.fit(u))
vd = gauss(*gauss.fit(v))
z = np.linspace(0, 1, num_steps, endpoint=False) + 1/(2*num_steps)
return np.mean(np.abs(ud.ppf(z) - vd.ppf(z))**p)**(1/p)
I guess I am a bit late but, but this is what I would do for an exact solution (using only numpy):
import numpy as np
from numpy.random import randn
n = 100
m = 80
p = 2
x = np.sort(randn(n))
y = np.sort(randn(m))
a = np.ones(n)/n
b = np.ones(m)/m
# cdfs
ca = np.cumsum(a)
cb = np.cumsum(b)
# points on which we need to evaluate the quantile functions
cba = np.sort(np.hstack([ca, cb]))
# weights for integral
h = np.diff(np.hstack([0, cba]))
# construction of first quantile function
bins = ca + 1e-10 # small tolerance to avoid rounding errors and enforce right continuity
index_qx = np.digitize(cba, bins, right=True) # right=True becouse quantile function is
# right continuous
qx = x[index_qx] # quantile funciton F^{-1}
# construction of second quantile function
bins = cb + 1e-10
index_qy = np.digitize(cba, bins, right=True) # right=True becouse quantile function is
# right continuous
qy = y[index_qy] # quantile funciton G^{-1}
ot_cost = np.sum((qx - qy)**p * h)
print(ot_cost)
In case you are interested, here you can find a more detailed numpy based implementation of the ot problem on the real line with dual and primal solutions as well: https://github.com/gnies/1d-optimal-transport. (I am still working on it though).

compute an integral using scipy where the integrand is a product with parameters coming from a (arbitrarily long) list

I want to solve the coupon collector's problem in the general case (with varying possibilities for each coupon) using Flajolet's formula that I found on wikipedia (see https://en.wikipedia.org/wiki/Coupon_collector%27s_problem). According to the formula I have to compute an integral where the integrand is a product. I'm using scipy.integrad.quad and lambda-notation to integrate. Problem is that the number of factors in the integrand is not fixed (has parameters coming from a list). When I try to multiply the integrand factors I get an error, since I cannot multiply formal expressions, seemingly. But if I don't, I don't know to get the integration variable x in.
I found ways to integrate a product, if there are, for example, only 2 factors. And it doesn't seem to involve double integration or some such. Can anyone please help (I'm quite new to this stuff)?
import numpy as np
from scipy import integrate
....
def compute_general_case(p_list):
integrand = 1
for p in p_list:
integrand_factor = lambda x: 1 - np.exp(-p * x)
integrand *= integrand_factor
integrand = 1 - integrand
erg = integrate.quad(integrand, 0, np.inf)
print(erg)
You can define the integration function with an arbitrary number of arguments, provided you pass them to quad using args=:
def integrand(x, *p_list):
p_list = np.asarray(p_list)
return 1 - np.product(1 - np.exp(-x * p_list)) #don't need to for-loop a product in numpy
result, abserr = quad(integrand, 0, np.inf, args=[1,1,1,1])
print(result, abserr)
>> 2.083333333333334 2.491001112400493e-10
For more information, see here
Thanks for addressing this question: as a follow on to #Mstaino's answer, you don't even need the args as you can pass them into the function via a closure:
def coupon_collector_expected_samples(probs):
"""
Find the expected number of samples before all "coupons" (with a
non-uniform probability mass) are "collected".
Args:
probs (ndarray): probability mass for each unique item
References:
https://en.wikipedia.org/wiki/Coupon_collector%27s_problem
https://www.combinatorics.org/ojs/index.php/eljc/article/view/v20i2p33/pdf
https://stackoverflow.com/questions/54539128/scipy-integrand-is-product
Example:
>>> import numpy as np
>>> import ubelt as ub
>>> # Check EV of samples for a non-uniform distribution
>>> probs = [0.38, 0.05, 0.36, 0.16, 0.05]
>>> ev = coupon_collector_expected_samples(probs)
>>> print('ev = {}'.format(ub.repr2(ev, precision=4)))
ev = 30.6537
>>> # Use general solution on a uniform distribution
>>> probs = np.ones(4) / 4
>>> ev = coupon_collector_expected_samples(probs)
>>> print('ev = {}'.format(ub.repr2(ev, precision=4)))
ev = 8.3333
>>> # Check that this is the same as the solution for the uniform case
>>> import sympy
>>> n = len(probs)
>>> uniform_ev = float(sympy.harmonic(n) * n)
>>> assert np.isclose(ev, uniform_ev)
"""
import numpy as np
from scipy import integrate
probs = np.asarray(probs)
# Philippe Flajolet's generalized expected value integral
def _integrand(t):
return 1 - np.product(1 - np.exp(-probs * t))
ev, abserr = integrate.quad(func=_integrand, a=0, b=np.inf)
return ev
You can also see that my implementation is faster:
import timerit
ti = timerit.Timerit(100, bestof=10, verbose=2)
probs = np.random.rand(100)
from scipy import integrate
def orig_method(p_list):
def integrand(x, *p_list):
p_list = np.asarray(p_list)
return 1 - np.product(1 - np.exp(-x * p_list))
result, abserr = integrate.quad(integrand, 0, np.inf, args=p_list)
return result
ti = timerit.Timerit(100, bestof=10, verbose=2)
for timer in ti.reset('orig_implementation'):
with timer:
orig_method(probs)
for timer in ti.reset('my_implementation'):
with timer:
coupon_collector_expected_samples(probs)
# Results:
# Timed orig_implementation for: 100 loops, best of 10
# time per loop: best=7.267 ms, mean=7.954 ± 0.5 ms
# Timed my_implementation for: 100 loops, best of 10
# time per loop: best=5.618 ms, mean=5.648 ± 0.0 ms

How do you compute the confidence interval for Pearson's r in Python?

In Python, I know how to calculate r and associated p-value using scipy.stats.pearsonr, but I'm unable to find a way to calculate the confidence interval of r. How is this done? Thanks for any help :)
According to [1], calculation of confidence interval directly with Pearson r is complicated due to the fact that it is not normally distributed. The following steps are needed:
Convert r to z',
Calculate the z' confidence interval. The sampling distribution of z' is approximately normally distributed and has standard error of 1/sqrt(n-3).
Convert the confidence interval back to r.
Here are some sample codes:
def r_to_z(r):
return math.log((1 + r) / (1 - r)) / 2.0
def z_to_r(z):
e = math.exp(2 * z)
return((e - 1) / (e + 1))
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf(1 - alpha/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Reference:
http://onlinestatbook.com/2/estimation/correlation_ci.html
Using rpy2 and the psychometric library (you will need R installed and to run install.packages("psychometric") within R first)
from rpy2.robjects.packages import importr
psychometric=importr('psychometric')
psychometric.CIr(r=.9, n = 100, level = .95)
Where 0.9 is your correlation, n the sample size and 0.95 the confidence level
Here's a solution that uses bootstrapping to compute the confidence interval, rather than the Fisher transformation (which assumes bivariate normality, etc.), borrowing from this answer:
import numpy as np
def pearsonr_ci(x, y, ci=95, n_boots=10000):
x = np.asarray(x)
y = np.asarray(y)
# (n_boots, n_observations) paired arrays
rand_ixs = np.random.randint(0, x.shape[0], size=(n_boots, x.shape[0]))
x_boots = x[rand_ixs]
y_boots = y[rand_ixs]
# differences from mean
x_mdiffs = x_boots - x_boots.mean(axis=1)[:, None]
y_mdiffs = y_boots - y_boots.mean(axis=1)[:, None]
# sums of squares
x_ss = np.einsum('ij, ij -> i', x_mdiffs, x_mdiffs)
y_ss = np.einsum('ij, ij -> i', y_mdiffs, y_mdiffs)
# pearson correlations
r_boots = np.einsum('ij, ij -> i', x_mdiffs, y_mdiffs) / np.sqrt(x_ss * y_ss)
# upper and lower bounds for confidence interval
ci_low = np.percentile(r_boots, (100 - ci) / 2)
ci_high = np.percentile(r_boots, (ci + 100) / 2)
return ci_low, ci_high
Answer given by bennylp is mostly correct, however, there is a small error in calculating the critical value in the 3rd function.
It should instead be:
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf((1 + alpha)/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Here's another post for reference: Scipy - two tail ppf function for a z value?
I know bootstrapping has been suggested above, proposing another variation of it below, which may suit some other set ups better.
#1
Sample your data (paired X & Ys and can also add other say weight) , fit original model on it, record r2, append it. Then extract your confidence intervals from your distribution of all R2s recorded.
#2 Additionally can fit on sampled data and using sampled data model predict on non sampled X (could also supply a continuous range to extend your predictions instead of using original X)
to get confidence intervals on your Y hats.
So in sample code:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
from sklearn.metrics import r2_score
x = np.array([your numbers here])
y = np.array([your numbers here])
### define list for R2 values
r2s = []
### define dataframe to append your bootstrapped fits for Y hat ranges
ci_df = pd.DataFrame({'x': x})
### define how many samples you want
how_many_straps = 5000
### define your fit function/s
def func_exponential(x,a,b):
return np.exp(b) * np.exp(a * x)
### fit original, using log because fitting exponential
polyfit_original = np.polyfit(x
,np.log(y)
,1
,# w= could supply weight for observations here)
)
for i in range(how_many_straps+1):
### zip into tuples attaching X to Y, can combine more variables as well
zipped_for_boot = pd.Series(tuple(zip(x,y)))
### sample zipped X & Y pairs above with replacement
zipped_resampled = zipped_for_boot.sample(frac=1,
replace=True)
### creater your sampled X & Y
boot_x = []
boot_y = []
for sample in zipped_resampled:
boot_x.append(sample[0])
boot_y.append(sample[1])
### predict sampled using original fit
y_hat_boot_via_original_fit = func_exponential(np.asarray(boot_x),
polyfit_original[0],
polyfit_original[1])
### calculate r2 and append
r2s.append(r2_score(boot_y, y_hat_boot_via_original_fit))
### fit sampled
polyfit_boot = np.polyfit(boot_x
,np.log(boot_y)
,1
,# w= could supply weight for observations here)
)
### predict original via sampled fit or on a range of min(x) to Z
y_hat_original_via_sampled_fit = func_exponential(x,
polyfit_boot[0],
polyfit_boot[1])
### insert y hat into dataframe for calculating y hat confidence intervals
ci_df["trial_" + str(i)] = y_hat_original_via_sampled_fit
### R2 conf interval
low = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[0],3)
up = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[1],3)
F"r2 confidence interval = {low} - {up}"

scipy.stats.ttest_ind without array (python)

I have done a number of calculations to estimate μ, σ and N for my two samples. Due to a number of approximations I don't have the arrays that are expected as input to scipy.stats.ttest_ind. Unless I am mistaken I only need μ, σ and N to do a welch's t test. Is there a way to do this in python?
Here’s a straightforward implementation based on this:
import scipy.stats as stats
import numpy as np
def welch_t_test(mu1, s1, N1, mu2, s2, N2):
# Construct arrays to make calculations more succint.
N_i = np.array([N1, N2])
dof_i = N_i - 1
v_i = np.array([s1, s2]) ** 2
# Calculate t-stat, degrees of freedom, use scipy to find p-value.
t = (mu1 - mu2) / np.sqrt(np.sum(v_i / N_i))
dof = (np.sum(v_i / N_i) ** 2) / np.sum((v_i ** 2) / ((N_i ** 2) * dof_i))
p = stats.distributions.t.sf(np.abs(t), dof) * 2
return t, p
It yields virtually identical results:
sample1 = np.random.rand(10)
sample2 = np.random.rand(15)
result_test = welch_t_test(np.mean(sample1), np.std(sample1, ddof=1), sample1.size,
np.mean(sample2), np.std(sample2, ddof=1), sample2.size)
result_scipy = stats.ttest_ind(sample1, sample2,equal_var=False)
np.allclose(result_test, result_scipy)
True
As an update
The function is now available in scipy.stats, since version 0.16.0
http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.ttest_ind_from_stats.html
scipy.stats.ttest_ind_from_stats(mean1, std1, nobs1, mean2, std2, nobs2, equal_var=True)
T-test for means of two independent samples from descriptive statistics.
This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values.
I have written t-test and z-test functions that take the summary statistics for statsmodels.
Those were intended mainly as internal shortcuts to avoid code duplication, and are not well documented.
For example http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.weightstats._tstat_generic.html
The list of related functions is here:
http://statsmodels.sourceforge.net/devel/stats.html#basic-statistics-and-t-tests-with-frequency-weights
edit: in reply to comment
The function just does the core calculation, the actual calculation of the standard deviation of the difference under different assumptions is added to the calling method.
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/weightstats.py#L713
edit
Here is an example how to use the methods of the CompareMeans class that includes the t-test based on summary statistics. We need to create a class that holds the relevant summary statistic as attributes. At the end there is a function that just wraps the relevant calls.
"""
Created on Wed Jul 23 05:47:34 2014
Author: Josef Perktold
License: BSD-3
"""
import numpy as np
from scipy import stats
from statsmodels.stats.weightstats import CompareMeans, ttest_ind
class SummaryStats(object):
def __init__(self, nobs, mean, std):
self.nobs = nobs
self.mean = mean
self.std = std
self._var = std**2
np.random.seed(123)
nobs = 20
x1 = 1 + np.random.randn(nobs)
x2 = 1 + 1.5 * np.random.randn(nobs)
print stats.ttest_ind(x1, x2, equal_var=False)
print ttest_ind(x1, x2, usevar='unequal')
s1 = SummaryStats(x1.shape[0], x1.mean(0), x1.std(0))
s2 = SummaryStats(x2.shape[0], x2.mean(0), x2.std(0))
print CompareMeans(s1, s2).ttest_ind(usevar='unequal')
def ttest_ind_summ(summ1, summ2, usevar='unequal'):
"""t-test for equality of means based on summary statistic
Parameters
----------
summ1, summ2 : tuples of (nobs, mean, std)
summary statistic for the two samples
"""
s1 = SummaryStats(*summ1)
s2 = SummaryStats(*summ2)
return CompareMeans(s1, s2).ttest_ind(usevar=usevar)
print ttest_ind_summ((x1.shape[0], x1.mean(0), x1.std(0)),
(x2.shape[0], x2.mean(0), x2.std(0)),
usevar='unequal')
''' result
(array(1.1590347327654558), 0.25416326823881513)
(1.1590347327654555, 0.25416326823881513, 35.573591346616553)
(1.1590347327654558, 0.25416326823881513, 35.57359134661656)
(1.1590347327654558, 0.25416326823881513, 35.57359134661656)
'''

Categories