simple t-test in python with CIs of difference - python

What is the most straightforward way to perform a t-test in python and to include CIs of the difference? I've seen various posts but everything is different and when I tried to calculate the CIs myself it seemed slightly wrong... Here:
import numpy as np
from scipy import stats
g1 = np.array([48.7107107,
36.8587287,
67.7129929,
39.5538852,
35.8622661])
g2 = np.array([62.4993857,
49.7434833,
67.7516511,
54.3585559,
71.0933957])
m1, m2 = np.mean(g1), np.mean(g2)
dof = (len(g1)-1) + (len(g2)-1)
MSE = (np.var(g1) + np.var(g2)) / 2
stderr_diffs = np.sqrt((2 * MSE)/len(g1))
tcl = stats.t.ppf([.975], dof)
lower_limit = (m1-m2) - (tcl) * (stderr_diffs)
upper_limit = (m1-m2) + (tcl) * (stderr_diffs)
print(lower_limit, upper_limit)
returns:
[-30.12845447] [-0.57070077]
However, when I run the same test in SPSS, although I have the same t and p values, the CIs are -31.87286, 1.17371, and this is also the case in R. I can't seem to find the correct way to do this and would appreciate some help.

You're subtracting 1 when you compute dof, but when you compute the variance you're not using the sample variance:
MSE = (np.var(g1) + np.var(g2)) / 2
should be
MSE = (np.var(g1, ddof=1) + np.var(g2, ddof=1)) / 2
which gives me
[-31.87286426] [ 1.17370902]
That said, instead of doing the manual implementation, I'd probably use statsmodels' CompareMeans:
In [105]: import statsmodels.stats.api as sms
In [106]: r = sms.CompareMeans(sms.DescrStatsW(g1), sms.DescrStatsW(g2))
In [107]: r.tconfint_diff()
Out[107]: (-31.872864255548553, 1.1737090155485568)
(really we should be using a DataFrame here, not an ndarray, but I'm lazy).
Remember though that you're going to want to consider what assumption you want to make about the variance:
In [110]: r.tconfint_diff(usevar='pooled')
Out[110]: (-31.872864255548553, 1.1737090155485568)
In [111]: r.tconfint_diff(usevar='unequal')
Out[111]: (-32.28794665832114, 1.5887914183211436)
and if your g1 and g2 are representative, the assumption of equal variance might not be a good one.

Related

SymPy lambdify gives wrong result, while *.subs gives the accruate one

Sorry for bothering you with this. I have a serious issue and now im on clock to solve it, so here is my question.
I have an issue where I lambdify a quantity, but the result of the quantity differs from the ".subs" result, and sometimes it's way off, or it's a NaN, where in reality there is a real number (found by subs)
Here, I have a small MWE where you can see the issue! Thanks in advance for ur time
import sympy as sy
import numpy as np
##STACK
#some quantities needed before u see the problem
r = sy.Symbol('r', real=True)
th = sy.Symbol('th', real=True)
e_c = 1e51
lf0 = 100
A = 1.6726e-24
#here are some quantities I define to go the problem
lfac = lf0+2
rd = 4*3.14/4/sy.pi/A/lfac**2
xi = r/rd #rescaled r
#now to the problem:
#QUANTITY
lfxi = xi**(-3)*(lfac+1)/2*(sy.sqrt( 1 + 4*lfac/(lfac+1)*xi**(3) + (2*xi**(3)/(lfac+1))**2) -1)
#RESULT WITH SUBS
print(lfxi.subs({th:1.00,r:1.00}).evalf())
#RESULT WITH LAMBDIFY
lfxi_l = sy.lambdify((r,th),lfxi)
lfxi_l(0.01,1.00)
##gives 0
The issue is that your mpmath precision needs to be set higher!
By default mpmath uses prec=53 and dps=15, but your expression requires a much higher resolution than this for it
# print(lfxi)
3.0256512324559e+62*(sqrt(1.09235114769539e-125*pi**6*r**6 + 6.74235013645028e-61*pi**3*r**3 + 1) - 1)/(pi**3*r**3)
...
from mpmath import mp
lfxi_l = sy.lambdify((r,th),lfxi, modules=["mpmath"])
mp.dps = 125
print(lfxi_l(1.00,1.00))
# 101.999... result
Changing a couple of the constants to "modest" values:
In [89]: e_c=1; A=1
The different methods produce essentially the same thing:
In [91]: lfxi.subs({th:1.00,r:1.00}).evalf()
Out[91]: 1.00000000461176
In [92]: lfxi_l = sy.lambdify((r,th),lfxi)
In [93]: lfxi_l(1.0,1.00)
Out[93]: 1.000000004611762
In [94]: lfxi_m = sy.lambdify((r,th),lfxi, modules=["mpmath"])
In [95]: lfxi_m(1.0,1.00)
Out[95]: mpf('1.0000000046117619')

numpy vectorized approach to regression -multiple dependent columns (x) on single independent columns (y)

consider the below (3, 13) np.array
from scipy.stats import linregress
a = [-0.00845,-0.00568,-0.01286,-0.01302,-0.02212,-0.01501,-0.02132,-0.00783,-0.00942,0.00158,-0.00016,0.01422,0.01241]
b = [0.00115,0.00623,0.00160,0.00660,0.00951,0.01258,0.00787,0.01854,0.01462,0.01479,0.00980,0.00607,-0.00106]
c = [-0.00233,-0.00467,0.00000,0.00000,-0.00952,-0.00949,-0.00958,-0.01696,-0.02212,-0.01006,-0.00270,0.00763,0.01005]
array = np.array([a,b,c])
yvalues = pd.to_datetime(['2019-12-15','2019-12-16','2019-12-17','2019-12-18','2019-12-19','2019-12-22','2019-12-23','2019-12-24',\
'2019-12-25','2019-12-26','2019-12-29','2019-12-30','2019-12-31'], errors='coerce')
I can run the OLS regression on one column at a time successfully, as in below:
out = linregress(array[0], y=yvalues.to_julian_date())
print(out)
LinregressResult(slope=329.141087037396, intercept=2458842.411731361, rvalue=0.684426534581417, pvalue=0.009863937200252878, stderr=105.71465449878443)
However, what i wish to accomplish is to: run the regression on the matrix array with 'y' variable (yvalues) being constant for all columns -in one go (loop is possible solution but tiresome). I tried to extend 'yvalues' to match array shape with (np.tile). but is seems not to be the right approach. thank you all for your help.
IIUC you are looking for something like the following list comprehension in a vectorized way:
out = [linregress(array[i], y=yvalues.to_julian_date()) for i in range(array.shape[0])]
out
[LinregressResult(slope=329.141087037396, intercept=2458842.411731361, rvalue=0.684426534581417, pvalue=0.009863937200252876, stderr=105.71465449878443),
LinregressResult(slope=178.44888292241782, intercept=2458838.7056912296, rvalue=0.1911788042719021, pvalue=0.5315353013148307, stderr=276.24376878908953),
LinregressResult(slope=106.86168938856262, intercept=2458840.7656617565, rvalue=0.17721031419860186, pvalue=0.5624701260912525, stderr=178.940293876864)]
To be honest I've never seen what you are looking for implemented using scipy or statsmodels functionalities.
Therefore we can implement it ourselves exploiting numpy broadcasting:
x = array
y = np.array(yvalues.to_julian_date())
# mean of our inputs and outputs
x_mean = np.mean(x, axis=1)
y_mean = np.mean(y)
#total number of values
n = x.shape[1]
# using the formula to calculate the slope and intercept
n = np.sum((x - x_mean[:,np.newaxis]) * (y - y_mean)[np.newaxis,:], axis=1)
d = np.sum((x - x_mean[:,np.newaxis])**2, axis=1)
slopes = n/d
intercepts = y_mean - slopes*x_mean
slopes
array([329.14108704, 178.44888292, 106.86168939])
intercepts
array([2458842.41173136, 2458838.70569123, 2458840.76566176])

.std() & .skew() giving wrong answer with .rolling

I am using pandas version: '0.23.4'
While debugging my code I realized that, std & skew is not giving correct results with rolling window.
Check the code below:
import pandas as pd
import numpy as np
import scipy.stats as sp
df = pd.DataFrame(np.random.randint(1,10,(5)))
df_w = df.rolling(window=3, min_periods=1)
m1 = df_w.apply(lambda x: np.mean(x))
m2 = df_w.mean()
s1 = df_w.apply(lambda x: np.std(x))
s2 = df_w.std()
sk1 = df_w.apply(lambda x: sp.skew(x))
sk2 = df_w.skew()
Though the results for mean is same, but not for std and skew?
Is this expected behavior or am I missing something ?
The difference is in the specified delta degrees of freedom.
Numpy uses ddof to be 0 as default, while pandas uses ddof to be 1 as default. This value impacts how your std is calculated (specifically, how you normalize it, e.g. refer here)
If you specify it to be 0 in both, results are the same
s1 = df_w.apply(lambda k: np.std(k, ddof=0), raw=True)
s2 = df_w.std(ddof=0)
>>> (s1==s2).all()
True
Similarly, for skew, pandas calculates the unbiased skewness, while scipy calculates the biased.
Therefore, to get the same results, just specify bias=False in scipy
sk1 = df_w.apply(lambda x: sp.skew(x, bias=False))
sk2 = df_w.skew()
>>> (sk1==sk2).all()
True

Estimate formants using LPC in Python

I'm new to signal processing (and numpy, scipy, and matlab for that matter). I'm trying to estimate vowel formants with LPC in Python by adapting this matlab code:
http://www.mathworks.com/help/signal/ug/formant-estimation-with-lpc-coefficients.html
Here is my code so far:
#!/usr/bin/env python
import sys
import numpy
import wave
import math
from scipy.signal import lfilter, hamming
from scikits.talkbox import lpc
"""
Estimate formants using LPC.
"""
def get_formants(file_path):
# Read from file.
spf = wave.open(file_path, 'r') # http://www.linguistics.ucla.edu/people/hayes/103/Charts/VChart/ae.wav
# Get file as numpy array.
x = spf.readframes(-1)
x = numpy.fromstring(x, 'Int16')
# Get Hamming window.
N = len(x)
w = numpy.hamming(N)
# Apply window and high pass filter.
x1 = x * w
x1 = lfilter([1., -0.63], 1, x1)
# Get LPC.
A, e, k = lpc(x1, 8)
# Get roots.
rts = numpy.roots(A)
rts = [r for r in rts if numpy.imag(r) >= 0]
# Get angles.
angz = numpy.arctan2(numpy.imag(rts), numpy.real(rts))
# Get frequencies.
Fs = spf.getframerate()
frqs = sorted(angz * (Fs / (2 * math.pi)))
return frqs
print get_formants(sys.argv[1])
Using this file as input, my script returns this list:
[682.18960189917243, 1886.3054773107765, 3518.8326108511073, 6524.8112723782951]
I didn't even get to the last steps where they filter the frequencies by bandwidth because the frequencies in the list aren't right. According to Praat, I should get something like this (this is the formant listing for the middle of the vowel):
Time_s F1_Hz F2_Hz F3_Hz F4_Hz
0.164969 731.914588 1737.980346 2115.510104 3191.775838
What am I doing wrong?
Thanks very much
UPDATE:
I changed this
x1 = lfilter([1., -0.63], 1, x1)
to
x1 = lfilter([1], [1., 0.63], x1)
as per Warren Weckesser's suggestion and am now getting
[631.44354635609318, 1815.8629524985781, 3421.8288991389031, 6667.5030877036006]
I feel like I'm missing something since F3 is very off.
UPDATE 2:
I realized that the order being passed to scikits.talkbox.lpc was off due to a difference in sampling frequency. Changed it to:
Fs = spf.getframerate()
ncoeff = 2 + Fs / 1000
A, e, k = lpc(x1, ncoeff)
Now I'm getting:
[257.86573127888488, 774.59006835496086, 1769.4624576002402, 2386.7093679399809, 3282.387975973973, 4413.0428174593926, 6060.8150432549655, 6503.3090645887842, 7266.5069407315023]
Much closer to Praat's estimation!
The problem had to do with the order being passed to the lpc function. 2 + fs / 1000 where fs is the sampling frequency is the rule of thumb according to:
http://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html
I have not been able to get the results you expect, but I do notice two things which might cause some differences:
Your code uses [1, -0.63] where the MATLAB code from the link you provided has [1 0.63].
Your processing is being applied to the entire x vector at once instead of smaller segments of it (see where the MATLAB code does this: x = mtlb(I0:Iend); ).
Hope that helps.
There are at least two problems:
According to the link, the "pre-emphasis filter is a highpass all-pole (AR(1)) filter". The signs of the coefficients given there are correct: [1, 0.63]. If you use [1, -0.63], you get a lowpass filter.
You have the first two arguments to scipy.signal.lfilter reversed.
So, try changing this:
x1 = lfilter([1., -0.63], 1, x1)
to this:
x1 = lfilter([1.], [1., 0.63], x1)
I haven't tried running your code yet, so I don't know if those are the only problems.

Plotting a mixture distribution in sympy.stats

( gist of this Q here )
I'd like create a mixture of two Gamma distributions and plot the result, evaluated over a given range.
It would appear that sympy.stats is capable of this because it is able to compute the expectation of the mixture and sample from it. I'm quite new to sympy, so not sure if there is a preferred way for evaluating and plotting in this situation than the one I've been using.
%matplotlib inline
from matplotlib import pyplot as plt
from sympy.stats import Gamma, E, density
import numpy as np
G1 = Gamma("G1", 5, 2.5)
G2 = Gamma("G2", 4, 1.5)
f1 = 0.7; f2 = 1-f1
G3 = f1*G1 + f2*G2
Expectation gives me single sensible number for all 3
In [19]: E(G1)
Out[19]: 12.5000000000000
In [20]: E(G2)
Out[20]: 6.00000000000000
In [21]: E(G3)
Out[21]: 10.5500000000000
...but plotting fails on the mixture
u = np.linspace(0, 50)
D1 = density(G1); D2 = density(G2); D3 = density(G3)
v1 = [D1.args[1].subs(D1.args[0][0], i).evalf() for i in u]
v2 = [D2.args[1].subs(D2.args[0][0], i).evalf() for i in u]
v3 = [D3.args[1].subs(D3.args[0][0], i).evalf() for i in u]
plt.plot(u, v1)
plt.plot(u, v2)
plt.plot(u, v3) # this one fails with error 'can't convert expression to float'
The problem would appear to be that the mixture terms still contain free symbols
In [44]: v1[0].free_symbols
Out[44]: set()
In [45]: v3[0].free_symbols
Out[45]: {x}
...as I said, sympy.stats appears to be dealing with this ok somehow in computing the expectation, I assume. So I think I need to apply that machinery here in evaluating and plotting the mixture distribution (?)
It looks like this was fixed. I can reproduce your error in SymPy 0.7.3 but it works just fine in 0.7.4.1, the latest version.
First off, you don't need the fanagling with the .args. The expressions returned by density are callable. Just call D1(i).evalf() to get the numerical value of D1 at i, like
D1 = density(G1); D2 = density(G2); D3 = density(G3)
v1 = [D1(i).evalf() for i in u]
v2 = [D2(i).evalf() for i in u]
v3 = [D3(i).evalf() for i in u]
I've uploaded a working version to http://nbviewer.ipython.org/gist/asmeurer/8486176.

Categories