I have done a number of calculations to estimate μ, σ and N for my two samples. Due to a number of approximations I don't have the arrays that are expected as input to scipy.stats.ttest_ind. Unless I am mistaken I only need μ, σ and N to do a welch's t test. Is there a way to do this in python?
Here’s a straightforward implementation based on this:
import scipy.stats as stats
import numpy as np
def welch_t_test(mu1, s1, N1, mu2, s2, N2):
# Construct arrays to make calculations more succint.
N_i = np.array([N1, N2])
dof_i = N_i - 1
v_i = np.array([s1, s2]) ** 2
# Calculate t-stat, degrees of freedom, use scipy to find p-value.
t = (mu1 - mu2) / np.sqrt(np.sum(v_i / N_i))
dof = (np.sum(v_i / N_i) ** 2) / np.sum((v_i ** 2) / ((N_i ** 2) * dof_i))
p = stats.distributions.t.sf(np.abs(t), dof) * 2
return t, p
It yields virtually identical results:
sample1 = np.random.rand(10)
sample2 = np.random.rand(15)
result_test = welch_t_test(np.mean(sample1), np.std(sample1, ddof=1), sample1.size,
np.mean(sample2), np.std(sample2, ddof=1), sample2.size)
result_scipy = stats.ttest_ind(sample1, sample2,equal_var=False)
np.allclose(result_test, result_scipy)
True
As an update
The function is now available in scipy.stats, since version 0.16.0
http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.ttest_ind_from_stats.html
scipy.stats.ttest_ind_from_stats(mean1, std1, nobs1, mean2, std2, nobs2, equal_var=True)
T-test for means of two independent samples from descriptive statistics.
This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values.
I have written t-test and z-test functions that take the summary statistics for statsmodels.
Those were intended mainly as internal shortcuts to avoid code duplication, and are not well documented.
For example http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.weightstats._tstat_generic.html
The list of related functions is here:
http://statsmodels.sourceforge.net/devel/stats.html#basic-statistics-and-t-tests-with-frequency-weights
edit: in reply to comment
The function just does the core calculation, the actual calculation of the standard deviation of the difference under different assumptions is added to the calling method.
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/weightstats.py#L713
edit
Here is an example how to use the methods of the CompareMeans class that includes the t-test based on summary statistics. We need to create a class that holds the relevant summary statistic as attributes. At the end there is a function that just wraps the relevant calls.
"""
Created on Wed Jul 23 05:47:34 2014
Author: Josef Perktold
License: BSD-3
"""
import numpy as np
from scipy import stats
from statsmodels.stats.weightstats import CompareMeans, ttest_ind
class SummaryStats(object):
def __init__(self, nobs, mean, std):
self.nobs = nobs
self.mean = mean
self.std = std
self._var = std**2
np.random.seed(123)
nobs = 20
x1 = 1 + np.random.randn(nobs)
x2 = 1 + 1.5 * np.random.randn(nobs)
print stats.ttest_ind(x1, x2, equal_var=False)
print ttest_ind(x1, x2, usevar='unequal')
s1 = SummaryStats(x1.shape[0], x1.mean(0), x1.std(0))
s2 = SummaryStats(x2.shape[0], x2.mean(0), x2.std(0))
print CompareMeans(s1, s2).ttest_ind(usevar='unequal')
def ttest_ind_summ(summ1, summ2, usevar='unequal'):
"""t-test for equality of means based on summary statistic
Parameters
----------
summ1, summ2 : tuples of (nobs, mean, std)
summary statistic for the two samples
"""
s1 = SummaryStats(*summ1)
s2 = SummaryStats(*summ2)
return CompareMeans(s1, s2).ttest_ind(usevar=usevar)
print ttest_ind_summ((x1.shape[0], x1.mean(0), x1.std(0)),
(x2.shape[0], x2.mean(0), x2.std(0)),
usevar='unequal')
''' result
(array(1.1590347327654558), 0.25416326823881513)
(1.1590347327654555, 0.25416326823881513, 35.573591346616553)
(1.1590347327654558, 0.25416326823881513, 35.57359134661656)
(1.1590347327654558, 0.25416326823881513, 35.57359134661656)
'''
Related
I am looking for a function in Numpy or Scipy (or any rigorous Python library) that will give me the cumulative normal distribution function in Python.
Here's an example:
>>> from scipy.stats import norm
>>> norm.cdf(1.96)
0.9750021048517795
>>> norm.cdf(-1.96)
0.024997895148220435
In other words, approximately 95% of the standard normal interval lies within two standard deviations, centered on a standard mean of zero.
If you need the inverse CDF:
>>> norm.ppf(norm.cdf(1.96))
array(1.9599999999999991)
It may be too late to answer the question but since Google still leads people here, I decide to write my solution here.
That is, since Python 2.7, the math library has integrated the error function math.erf(x)
The erf() function can be used to compute traditional statistical functions such as the cumulative standard normal distribution:
from math import *
def phi(x):
#'Cumulative distribution function for the standard normal distribution'
return (1.0 + erf(x / sqrt(2.0))) / 2.0
Ref:
https://docs.python.org/2/library/math.html
https://docs.python.org/3/library/math.html
How are the Error Function and Standard Normal distribution function related?
Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module.
It can be used to get the cumulative distribution function (cdf - probability that a random sample X will be less than or equal to x) for a given mean (mu) and standard deviation (sigma):
from statistics import NormalDist
NormalDist(mu=0, sigma=1).cdf(1.96)
# 0.9750021048517796
Which can be simplified for the standard normal distribution (mu = 0 and sigma = 1):
NormalDist().cdf(1.96)
# 0.9750021048517796
NormalDist().cdf(-1.96)
# 0.024997895148220428
Adapted from here http://mail.python.org/pipermail/python-list/2000-June/039873.html
from math import *
def erfcc(x):
"""Complementary error function."""
z = abs(x)
t = 1. / (1. + 0.5*z)
r = t * exp(-z*z-1.26551223+t*(1.00002368+t*(.37409196+
t*(.09678418+t*(-.18628806+t*(.27886807+
t*(-1.13520398+t*(1.48851587+t*(-.82215223+
t*.17087277)))))))))
if (x >= 0.):
return r
else:
return 2. - r
def ncdf(x):
return 1. - 0.5*erfcc(x/(2**0.5))
To build upon Unknown's example, the Python equivalent of the function normdist() implemented in a lot of libraries would be:
def normcdf(x, mu, sigma):
t = x-mu;
y = 0.5*erfcc(-t/(sigma*sqrt(2.0)));
if y>1.0:
y = 1.0;
return y
def normpdf(x, mu, sigma):
u = (x-mu)/abs(sigma)
y = (1/(sqrt(2*pi)*abs(sigma)))*exp(-u*u/2)
return y
def normdist(x, mu, sigma, f):
if f:
y = normcdf(x,mu,sigma)
else:
y = normpdf(x,mu,sigma)
return y
Alex's answer shows you a solution for standard normal distribution (mean = 0, standard deviation = 1). If you have normal distribution with mean and std (which is sqr(var)) and you want to calculate:
from scipy.stats import norm
# cdf(x < val)
print norm.cdf(val, m, s)
# cdf(x > val)
print 1 - norm.cdf(val, m, s)
# cdf(v1 < x < v2)
print norm.cdf(v2, m, s) - norm.cdf(v1, m, s)
Read more about cdf here and scipy implementation of normal distribution with many formulas here.
Taken from above:
from scipy.stats import norm
>>> norm.cdf(1.96)
0.9750021048517795
>>> norm.cdf(-1.96)
0.024997895148220435
For a two-tailed test:
Import numpy as np
z = 1.96
p_value = 2 * norm.cdf(-np.abs(z))
0.04999579029644087
Simple like this:
import math
def my_cdf(x):
return 0.5*(1+math.erf(x/math.sqrt(2)))
I found the formula in this page https://www.danielsoper.com/statcalc/formulas.aspx?id=55
I have been working with the following link,
Fitting empirical distribution to theoretical ones with Scipy (Python)?
I have been using my data to the code from the link and found out that the common distribution for my data is the Non-Central Student’s T distribution. I couldn’t find the distribution in the pymc3 package, so, I decided to have a look with scipy to understand how the distribution is formed. I created a custom distribution and I have few questions:
I would like to know if my approach to creating the distribution is right?
How can I implement the custom distribution into models?
Regarding the prior distribution, do I use same steps in normal distribution priors (mu and sigma) combined with halfnormed for degree of freedom and noncentral value?
My custom distribution:
import numpy as np
import theano.tensor as tt
from scipy import stats
from scipy.special import hyp1f1, nctdtr
import warnings
from pymc3.theanof import floatX
from pymc3.distributions.dist_math import bound, gammaln
from pymc3.distributions.continuous import assert_negative_support, get_tau_sigma
from pymc3.distributions.distribution import Continuous, draw_values, generate_samples
class NonCentralStudentT(Continuous):
"""
Parameters
----------
nu: float
Degrees of freedom, also known as normality parameter (nu > 0).
mu: float
Location parameter.
sigma: float
Scale parameter (sigma > 0). Converges to the standard deviation as nu increases. (only required if lam is not specified)
lam: float
Scale parameter (lam > 0). Converges to the precision as nu increases. (only required if sigma is not specified)
"""
def __init__(self, nu, nc, mu=0, lam=None, sigma=None, sd=None, *args, **kwargs):
super().__init__(*args, **kwargs)
super(NonCentralStudentT, self).__init__(*args, **kwargs)
if sd is not None:
sigma = sd
warnings.warn("sd is deprecated, use sigma instead", DeprecationWarning)
self.nu = nu = tt.as_tensor_variable(floatX(nu))
self.nc = nc = tt.as_tensor_variable(floatX(nc))
lam, sigma = get_tau_sigma(tau=lam, sigma=sigma)
self.lam = lam = tt.as_tensor_variable(lam)
self.sigma = self.sd = sigma = tt.as_tensor_variable(sigma)
self.mean = self.median = self.mode = self.mu = mu = tt.as_tensor_variable(mu)
self.variance = tt.switch((nu > 2) * 1, (1 / self.lam) * (nu / (nu - 2)), np.inf)
assert_negative_support(lam, 'lam (sigma)', 'NonCentralStudentT')
assert_negative_support(nu, 'nu', 'NonCentralStudentT')
assert_negative_support(nc, 'nc', 'NonCentralStudentT')
def random(self, point=None, size=None):
"""
Draw random values from Non-Central Student's T distribution.
Parameters
----------
point: dict, optional
Dict of variable values on which random values are to be
conditioned (uses default point if not specified).
size: int, optional
Desired size of random sample (returns one sample if not
specified).
Returns
-------
array
"""
nu, nc, mu, lam = draw_values([self.nu, self.nc, self.mu, self.lam], point=point, size=size)
return generate_samples(stats.nct.rvs, nu, nc, loc=mu, scale=lam ** -0.5, dist_shape=self.shape, size=size)
def logp(self, value):
"""
Calculate log-probability of Non-Central Student's T distribution at specified value.
Parameters
----------
value: numeric
Value(s) for which log-probability is calculated. If the log probabilities for multiple
values are desired the values must be provided in a numpy array or theano tensor
Returns
-------
TensorVariable
"""
nu = self.nu
nc = self.nc
mu = self.mu
lam = self.lam
n = nu * 1.0
nc = nc * 1.0
x2 = value * value
ncx2 = nc * nc * x2
fac1 = n + x2
trm1 = n / 2. * tt.log(n) + gammaln(n + 1)
trm1 -= n * tt.log(2) + nc * nc / 2. + (n / 2.) * tt.log(fac1) + gammaln(n / 2.)
Px = tt.exp(trm1)
valF = ncx2 / (2 * fac1)
trm1 = tt.sqrt(2) * nc * value * hyp1f1(n / 2 + 1, 1.5, valF)
trm1 /= np.asarray(fac1 * tt.gamma((n + 1) / 2))
trm2 = hyp1f1((n + 1) / 2, 0.5, valF)
trm2 /= np.asarray(np.sqrt(fac1) * tt.gamma(n / 2 + 1))
Px *= trm1 + trm2
return bound(Px, lam > 0, nu > 0, nc > 0)
def logcdf(self, value):
"""
Compute the log of the cumulative distribution function for Non-Central Student's T distribution
at the specified value.
Parameters
----------
value: numeric
Value(s) for which log CDF is calculated. If the log CDF for multiple
values are desired the values must be provided in a numpy array or theano tensor.
Returns
-------
TensorVariable
"""
nu = self.nu
nc = self.nc
return nctdtr(nu, nc, value)
My Custom model:
with pm.Model() as model:
# Prior Distributions for unknown model parameters:
mu = pm.Normal('sigma', 0, 10)
sigma = pm.Normal('sigma', 0, 10)
nc= pm.HalfNormal('nc', sigma=10)
nu= pm.HalfNormal('nu', sigma=1)
# Observed data is from a Likelihood distributions (Likelihood (sampling distribution) of observations):
=> (input custom distribution) observed_data = pm.Beta('observed_data', alpha=alpha, beta=beta, observed=data)
# draw 5000 posterior samples
trace = pm.sample(draws=5000, tune=2000, chains=3, cores=1)
# Obtaining Posterior Predictive Sampling:
post_pred = pm.sample_posterior_predictive(trace, samples=3000)
print(post_pred['observed_data'].shape)
print('\nSummary: ')
print(pm.stats.summary(data=trace))
print(pm.stats.summary(data=post_pred))
Edit 1:
I redesigned the custom model to include the custom distribution, however, I keep on getting error based on the equations used to get the likelihood distribution or sometimes tensor locks down and the code just freeze. Find my code below,
with pm.Model() as model:
# Prior Distributions for unknown model parameters:
mu = pm.Normal('mu', mu=0, sigma=1)
sd = pm.HalfNormal('sd', sigma=1)
nc = pm.HalfNormal('nc', sigma=10)
nu = pm.HalfNormal('nu', sigma=1)
# Custom distribution:
# observed_data = pm.DensityDist('observed_data', NonCentralStudentT, observed=data_list)
# Observed data is from a Likelihood distributions (Likelihood (sampling distribution) of observations):
observed_data = NonCentralStudentT('observed_data', mu=mu, sd=sd, nc=nc, nu=nu, observed=data_list)
# draw 5000 posterior samples
trace_S = pm.sample(draws=5000, tune=2000, chains=3, cores=1)
# Obtaining Posterior Predictive Sampling:
post_pred_S = pm.sample_posterior_predictive(trace_S, samples=3000)
print(post_pred_S['observed_data'].shape)
print('\nSummary: ')
print(pm.stats.summary(data=trace_S))
print(pm.stats.summary(data=post_pred_S))
Edit 2:
I am looking online in order to convert the function to theano, the only thing that I found to define the function is from the following GitHub link hyp1f1 function GitHub
Will this be enough to use in order to convert the function into theano?
In addition, I have a question, it is okay to use NumPy arrays with theano?
Also, I thought of another way but I am not sure if this can be implemented, I looked into the nct function in scipy and they wrote the following,
If Y is a standard normal random variable and V is an independent
chi-square random variable ( chi2 ) with k degrees of freedom, then
X=(Y+c) / sqrt(V/k)
has a non-central Student’s t distribution on the real line. The
degrees of freedom parameter k (denoted df in the implementation)
satisfies k>0 and the noncentrality parameter c (denoted nc in the
implementation) is a real number.
The probability density above is defined in the “standardized” form.
To shift and/or scale the distribution use the loc and scale
parameters. Specifically, nct.pdf(x, df, nc, loc, scale) is
identically equivalent to nct.pdf(y, df, nc) / scale with y = (x -
loc) / scale .
So, I thought of only using the priors as normal and chi2 random variables code part in their distributions and use the degree of freedom variable as mentioned before in the code into the equation mentioned in SciPy, will it be enough to get the distribution?
Edit 3:
I managed to run the code in the link about fitting empirical distribution and found out the second best was the student t distribution, so, I will be using this. Thank you for your help. I just have a side question, I ran my model with student t distribution but I got these warnings:
There were 52 divergences after tuning. Increase target_accept or
reparameterize. The acceptance probability does not match the target.
It is 0.7037574708196309, but should be close to 0.8. Try to increase
the number of tuning steps. The number of effective samples is smaller
than 10% for some parameters.
I am just confused about these warnings, Do you have any idea what it means? I know that this won't affect my code, but, I can reduce the divergences? and regarding the effective samples, Do I need to increase the number of samples in the trace code?
I have a problem with optimization of the rejection method of generating continuous random variables. I've got a density: f(x) = 3/2 (1-x^2). Here's my code:
import random
import matplotlib.pyplot as plt
import numpy as np
import time
import scipy.stats as ss
a=0 # xmin
b=1 # xmax
m=3/2 # ymax
variables = [] #list for variables
def f(x):
return 3/2 * (1 - x**2) #probability density function
reject = 0 # number of rejections
start = time.time()
while len(variables) < 100000: #I want to generate 100 000 variables
u1 = random.uniform(a,b)
u2 = random.uniform(0,m)
if u2 <= f(u1):
variables.append(u1)
else:
reject +=1
end = time.time()
print("Time: ", end-start)
print("Rejection: ", reject)
x = np.linspace(a,b,1000)
plt.hist(variables,50, density=1)
plt.plot(x, f(x))
plt.show()
ss.probplot(variables, plot=plt)
plt.show()
My first question: Is my probability plot made properly?
And the second, what is in the title. How to optimize that method? I would like to get some advice to optimize the code. Now that code takes about 0.5 seconds and there are about 50 000 rejections. Is it possible to reduce the time and number of rejections? If it's needed I can optimize using a different method of generating variables.
My first question: Is my probability plot made properly?
No. It is made versus default normal distribution. You have to pack your function f(x) into class derived from stats.rv_continuous, make it into _pdf method, and pass it to probplot
And the second, what is in the title. How to optimise that method? Is it possible to reduce the time and number of rejections?
Sure, you have the power of NumPy vector abilities at your hands. Don't ever write explicit loops - vectoriz, vectorize and vectorize!
Look at modified code below, not a single loop, everything is done via NumPy vectors. Time went down on my computer for 100000 samples (Xeon, Win10 x64, Anaconda Python 3.7) from 0.19 to 0.003.
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
import time
a = 0. # xmin
b = 1. # xmax
m = 3.0/2.0 # ymax
def f(x):
return 1.5 * (1.0 - x*x) # probability density function
start = time.time()
N = 100000
u1 = np.random.uniform(a, b, N)
u2 = np.random.uniform(0.0, m, N)
negs = np.empty(N)
negs.fill(-1)
variables = np.where(u2 <= f(u1), u1, negs) # accepted samples are positive or 0, rejected are -1
end = time.time()
accept = np.extract(variables>=0.0, variables)
reject = N - len(accept)
print("Time: ", end-start)
print("Rejection: ", reject)
x = np.linspace(a, b, 1000)
plt.hist(accept, 50, density=True)
plt.plot(x, f(x))
plt.show()
ss.probplot(accept, plot=plt) # against normal distribution
plt.show()
Concerning reducing number of rejections, you could sample with 0 rejects doing inverse method, it is cubic equation so it could work with easy
UPDATE
Here is the code to use for probplot:
class my_pdf(ss.rv_continuous):
def _pdf(self, x):
return 1.5 * (1.0 - x*x)
ss.probplot(accept, dist=my_pdf(a=a, b=b, name='my_pdf'), plot=plt)
and you should get something like
Regarding your first question, scipy.stats.probplot compares your sample against the quantiles of the normal distribution. If you'd like it to compare against the quantiles of your f(x) distribution, check out the dist parameter of probplot.
In terms of making this sampling procedure faster, avoiding loops is generally the way to go. Replacing the code between start = ... and end = ... with the following resulted in a >20x speedup for me.
n_before_accept_reject = 150000
u1 = np.random.uniform(a, b, size=n_before_accept_reject)
u2 = np.random.uniform(0, m, size=n_before_accept_reject)
variables = u1[u2 <= f(u1)]
reject = n_before_accept_reject - len(variables)
Note that this will give you approximately 100000 accepted samples each time you run it. You could raise the value of n_before_accept_reject slightly to effectively guarantee that variables will always have >100000 accepted values, and then just cap the size of variables to return exactly 100000 if necessary.
Others have spoken to the probability plotting, I'm going to address the efficiency of the rejection algorithm.
Acceptance/rejection schemes are based on m(x), a "majorizing function". A majorizing function should have two properties: 1) m(x)≥ f(x) ∀ x; and 2) m(x), when scaled to be a distribution, should be easy to generate values from.
You went with the constant function m = 3/2, which meets both requirements but does not bound f(x) very closely. Integrated from zero to one, that has an area of 3/2. Your f(x), being a valid density function, has an area of 1. Consequently, ∫f(x)) / ∫m(x)) = 1 / (3/2) = 2/3. In other words, 2/3 of the values you generate from the majorizing function are accepted, and you are rejecting 1/3 of the attempts.
You need an m(x) which provides a tighter bound for f(x). I went with a line which is tangent to f(x) at x = 1/2. With a little bit of calculus to get the slope, I derived m(x) = 15/8 - 3x/2.
This choice of m(x) has an area of 9/8, so only 1/9 of the values will be rejected. A bit more calculus yielded the inverse transform generator for x's based on this m(x) is x = (5 - sqrt(25 - 24U)) / 4, where U is a uniform(0,1) random varible.
Here's an implementation, based off your original version. I wrapped the rejection scheme in a function, and created the values with a list comprehension rather than appending to a list. As you'll see if you run this, it produces a lot fewer rejections than your original version.
import random
import matplotlib.pyplot as plt
import numpy as np
import time
import math
import scipy.stats as ss
a = 0 # xmin
b = 1 # xmax
reject = 0 # number of rejections
def f(x):
return 3.0 / 2.0 * (1.0 - x**2) #probability density function
def m(x):
return 1.875 - 1.5 * x
def generate_x():
global reject
while True:
x = (5.0 - math.sqrt(25.0 - random.uniform(0.0, 24.0))) / 4.0
u = random.uniform(0, m(x))
if u <= f(x):
return x
reject += 1
start = time.time()
variables = [generate_x() for _ in range(100000)]
end = time.time()
print("Time: ", end-start)
print("Rejection: ", reject)
x = np.linspace(a,b,1000)
plt.hist(variables,50, density=1)
plt.plot(x, f(x))
plt.show()
I am trying to get a correct way of fitting a beta distribution. It's not a real world problem i am just testing the effects of a few different methods, and in doing this something is puzzling me.
Here is the python code I am working on, in which I tested 3 different approaches:
1>: fit using moments (sample mean and variance).
2>: fit by minimizing the negative log-likelihood (by using scipy.optimize.fmin()).
3>: simply call scipy.stats.beta.fit()
from scipy.optimize import fmin
from scipy.stats import beta
from scipy.special import gamma as gammaf
import matplotlib.pyplot as plt
import numpy
def betaNLL(param,*args):
'''Negative log likelihood function for beta
<param>: list for parameters to be fitted.
<args>: 1-element array containing the sample data.
Return <nll>: negative log-likelihood to be minimized.
'''
a,b=param
data=args[0]
pdf=beta.pdf(data,a,b,loc=0,scale=1)
lg=numpy.log(pdf)
#-----Replace -inf with 0s------
lg=numpy.where(lg==-numpy.inf,0,lg)
nll=-1*numpy.sum(lg)
return nll
#-------------------Sample data-------------------
data=beta.rvs(5,2,loc=0,scale=1,size=500)
#----------------Normalize to [0,1]----------------
#data=(data-numpy.min(data))/(numpy.max(data)-numpy.min(data))
#----------------Fit using moments----------------
mean=numpy.mean(data)
var=numpy.var(data,ddof=1)
alpha1=mean**2*(1-mean)/var-mean
beta1=alpha1*(1-mean)/mean
#------------------Fit using mle------------------
result=fmin(betaNLL,[1,1],args=(data,))
alpha2,beta2=result
#----------------Fit using beta.fit----------------
alpha3,beta3,xx,yy=beta.fit(data)
print '\n# alpha,beta from moments:',alpha1,beta1
print '# alpha,beta from mle:',alpha2,beta2
print '# alpha,beta from beta.fit:',alpha3,beta3
#-----------------------Plot-----------------------
plt.hist(data,bins=30,normed=True)
fitted=lambda x,a,b:gammaf(a+b)/gammaf(a)/gammaf(b)*x**(a-1)*(1-x)**(b-1) #pdf of beta
xx=numpy.linspace(0,max(data),len(data))
plt.plot(xx,fitted(xx,alpha1,beta1),'g')
plt.plot(xx,fitted(xx,alpha2,beta2),'b')
plt.plot(xx,fitted(xx,alpha3,beta3),'r')
plt.show()
The problem I have is about the normalization process (z=(x-a)/(b-a)) where a and b are the min and max of the sample, respectively.
When I don't do the normalization, everything works Ok, there are slight differences among different fitting methods, by reasonably good.
But when I did the normalization, here is the result plot I got.
Only the moment method (green line) looks Ok.
The scipy.stats.beta.fit() method (red line) is uniform always, no matter what parameters I use to generate the random numbers.
And the MLE (blue line) fails.
So it seems like the normalization is creating these issues. But I think it is legal to have x=0 and x=1 in the beta distribution. And if given a real world problem, isn't it the 1st step to normalize the sample observations to make it in between [0,1] ? In that case, how should I fit the curve?
The problem is that beta.pdf() sometimes returns 0 and inf for 0 and 1. For example:
>>> from scipy.stats import beta
>>> beta.pdf(1,1.05,0.95)
/usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:1165: RuntimeWarning: divide by zero encountered in power
Px = (1.0-x)**(b-1.0) * x**(a-1.0)
inf
>>> beta.pdf(0,1.05,0.95)
0.0
You're guaranteeing that you will have one data sample at 0 and 1 by your normalization process. Although you "correct" for values at which the pdf is 0, you are not correcting for those which return inf. To account for this you can just remove all the values which are not finite:
def betaNLL(param,*args):
"""
Negative log likelihood function for beta
<param>: list for parameters to be fitted.
<args>: 1-element array containing the sample data.
Return <nll>: negative log-likelihood to be minimized.
"""
a, b = param
data = args[0]
pdf = beta.pdf(data,a,b,loc=0,scale=1)
lg = np.log(pdf)
mask = np.isfinite(lg)
nll = -lg[mask].sum()
return nll
Really you shouldn't be normalizing like this though, because you are essentially throwing two data points out of the fit.
Without a docstring for beta.fit, it was a little tricky to find, but if you know the upper and lower limits you want to force upon beta.fit, you can use the kwargs floc and fscale.
I ran your code only using the beta.fit method, but with and without the floc and fscale kwargs. Also, I checked it with the arguments as ints and floats to make sure that wouldn't affect your answer. It didn't (on this test. I can't say if it never would.)
>>> from scipy.stats import beta
>>> import numpy
>>> def betaNLL(param,*args):
'''Negative log likelihood function for beta
<param>: list for parameters to be fitted.
<args>: 1-element array containing the sample data.
Return <nll>: negative log-likelihood to be minimized.
'''
a,b=param
data=args[0]
pdf=beta.pdf(data,a,b,loc=0,scale=1)
lg=numpy.log(pdf)
#-----Replace -inf with 0s------
lg=numpy.where(lg==-numpy.inf,0,lg)
nll=-1*numpy.sum(lg)
return nll
>>> data=beta.rvs(5,2,loc=0,scale=1,size=500)
>>> beta.fit(data)
(5.696963536654355, 2.0005252702837009, -0.060443307228404922, 1.0580278414086459)
>>> beta.fit(data,floc=0,fscale=1)
(5.0952451826831462, 1.9546341057106007, 0, 1)
>>> beta.fit(data,floc=0.,fscale=1.)
(5.0952451826831462, 1.9546341057106007, 0.0, 1.0)
In conclusion, it seems this doesn't change your data (through normalization) or throw out data. I just think it should be noted that care should be taken when using this. In your case, you knew the limits were 0 and 1 because you got data out of a defined distribution that was between 0 and 1. In other cases, limits might be known, but if they are not known, beta.fit will provide them. In this case, without specifying the limits of 0 and 1, beta.fit calculated them to be loc=-0.06 and scale=1.058.
I used the method proposed in doi:10.1080/00949657808810232 to fir the beta parameters:
from scipy.special import psi
from scipy.special import polygamma
from scipy.optimize import root_scalar
from numpy.random import beta
import numpy as np
def ipsi(y):
if y >= -2.22:
x = np.exp(y) + 0.5
else:
x = - 1/ (y + psi(1))
for i in range(5):
x = x - (psi(x) - y)/(polygamma(1,x))
return x
#%%
# q satisface
# psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2 = 0
# O sea, busco raíz de
# f(q) = psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2
# luego:
# p = ipsi(lng1 - lng2 + psi(q))
def f(q,lng1,lng2):
return psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2
#%%
def ml_beta_pq(sample):
lng1 = np.log(sample).mean()
lng2 = np.log(1-sample).mean()
def g(q):
return f(q,lng1,lng2)
q=root_scalar(g,x0=1,x1=1.1).root
p = ipsi(lng1 - lng2 + psi(q))
return p, q
#%%
p = 2
q = 5
n = 1500
sample = beta(p,q,n)
ps,qs = ml_beta_pq(sample) #s de sombrero
print(f'Estimación de parámetros de una beta({p}, {q}) \na partir de una muestra de tamaño n = {n}')
print(f'\nn ={n:5d} | p | q')
print(f'---------+-------+------')
print(f'original | {p:2.3f} | {q:2.3f}')
print(f'estimado | {ps:2.3f} | {qs:2.3f}')
I have a the mean, std dev and n of sample 1 and sample 2 - samples are taken from the sample population, but measured by different labs.
n is different for sample 1 and sample 2. I want to do a weighted (take n into account) two-tailed t-test.
I tried using the scipy.stat module by creating my numbers with np.random.normal, since it only takes data and not stat values like mean and std dev (is there any way to use these values directly). But it didn't work since the data arrays has to be of equal size.
Any help on how to get the p-value would be highly appreciated.
If you have the original data as arrays a and b, you can use scipy.stats.ttest_ind with the argument equal_var=False:
t, p = ttest_ind(a, b, equal_var=False)
If you have only the summary statistics of the two data sets, you can calculate the t value using scipy.stats.ttest_ind_from_stats (added to scipy in version 0.16) or from the formula (http://en.wikipedia.org/wiki/Welch%27s_t_test).
The following script shows the possibilities.
from __future__ import print_function
import numpy as np
from scipy.stats import ttest_ind, ttest_ind_from_stats
from scipy.special import stdtr
np.random.seed(1)
# Create sample data.
a = np.random.randn(40)
b = 4*np.random.randn(50)
# Use scipy.stats.ttest_ind.
t, p = ttest_ind(a, b, equal_var=False)
print("ttest_ind: t = %g p = %g" % (t, p))
# Compute the descriptive statistics of a and b.
abar = a.mean()
avar = a.var(ddof=1)
na = a.size
adof = na - 1
bbar = b.mean()
bvar = b.var(ddof=1)
nb = b.size
bdof = nb - 1
# Use scipy.stats.ttest_ind_from_stats.
t2, p2 = ttest_ind_from_stats(abar, np.sqrt(avar), na,
bbar, np.sqrt(bvar), nb,
equal_var=False)
print("ttest_ind_from_stats: t = %g p = %g" % (t2, p2))
# Use the formulas directly.
tf = (abar - bbar) / np.sqrt(avar/na + bvar/nb)
dof = (avar/na + bvar/nb)**2 / (avar**2/(na**2*adof) + bvar**2/(nb**2*bdof))
pf = 2*stdtr(dof, -np.abs(tf))
print("formula: t = %g p = %g" % (tf, pf))
The output:
ttest_ind: t = -1.5827 p = 0.118873
ttest_ind_from_stats: t = -1.5827 p = 0.118873
formula: t = -1.5827 p = 0.118873
Using a recent version of Scipy 0.12.0, this functionality is built in (and does in fact operates on samples of different sizes). In scipy.stats the ttest_ind function performs Welch’s t-test when the flag equal_var is set to False.
For example:
>>> import scipy.stats as stats
>>> sample1 = np.random.randn(10, 1)
>>> sample2 = 1 + np.random.randn(15, 1)
>>> t_stat, p_val = stats.ttest_ind(sample1, sample2, equal_var=False)
>>> t_stat
array([-3.94339083])
>>> p_val
array([ 0.00070813])