Python, Pandas & Chi-Squared Test of Independence

Python, Pandas & Chi-Squared Test of Independence - python

I am quite new to Python as well as Statistics. I'm trying to apply the Chi Squared Test to determine whether previous success affects the level of change of a person (percentage wise, this does seem to be the case, but I wanted to see whether my results were statistically significant).
My question is: Did I do this correctly? My results say the p-value is 0.0, which means that there is a significant relationship between my variables (which is what I want of course...but 0 seems a little bit too perfect for a p-value, so I'm wondering whether I did it incorrectly coding wise).
Here's what I did:
import numpy as np
import pandas as pd
import scipy.stats as stats
d = {'Previously Successful' : pd.Series([129.3, 182.7, 312], index=['Yes - changed strategy', 'No', 'col_totals']),
'Previously Unsuccessful' : pd.Series([260.17, 711.83, 972], index=['Yes - changed strategy', 'No', 'col_totals']),
'row_totals' : pd.Series([(129.3+260.17), (182.7+711.83), (312+972)], index=['Yes - changed strategy', 'No', 'col_totals'])}
total_summarized = pd.DataFrame(d)
observed = total_summarized.ix[0:2,0:2]
Output:
Observed
expected = np.outer(total_summarized["row_totals"][0:2],
total_summarized.ix["col_totals"][0:2])/1000
expected = pd.DataFrame(expected)
expected.columns = ["Previously Successful","Previously Unsuccessful"]
expected.index = ["Yes - changed strategy","No"]
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()
print(chi_squared_stat)
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
df = 8) # *
print("Critical value")
print(crit)
p_value = 1 - stats.chi2.cdf(x=chi_squared_stat, # Find the p-value
df=8)
print("P value")
print(p_value)
stats.chi2_contingency(observed= observed)
Output
Statistics

A few corrections:
Your expected array is not correct. You must divide by observed.sum().sum(), which is 1284, not 1000.
For a 2x2 contingency table such as this, the degrees of freedom is 1, not 8.
Your calculation of chi_squared_stat does not include a continuity correction. (But it isn't necessarily wrong to not use it--that's a judgment call for the statistician.)
All the calculations that you perform (expected matrix, statistics, degrees of freedom, p-value) are computed by chi2_contingency:
In [65]: observed
Out[65]:
Previously Successful Previously Unsuccessful
Yes - changed strategy 129.3 260.17
No 182.7 711.83
In [66]: from scipy.stats import chi2_contingency
In [67]: chi2, p, dof, expected = chi2_contingency(observed)
In [68]: chi2
Out[68]: 23.383138325890453
In [69]: p
Out[69]: 1.3273696199438626e-06
In [70]: dof
Out[70]: 1
In [71]: expected
Out[71]:
array([[ 94.63757009, 294.83242991],
[ 217.36242991, 677.16757009]])
By default, chi2_contingency uses a continuity correction when the contingency table is 2x2. If you prefer to not use the correction, you can disable it with the argument correction=False:
In [73]: chi2, p, dof, expected = chi2_contingency(observed, correction=False)
In [74]: chi2
Out[74]: 24.072616672232893
In [75]: p
Out[75]: 9.2770200776879643e-07

degrees of freedom = (row-1)x(column-1). For a 2x2 table it is (2-1)x(2-1) = 1

Related

Instability in fitting data using Scipy Optimize library

I am attempting to fit a function to a set of data I have. The function in question is:
x(t) = - B + sqrt(AB(t-t0) + (x0 + B)^2)
I have tried to fit my data (included at the bottom) to this using two different methods but have found that whatever I do the fit for B is extremely unstable. Changing either the method or the initial guess wildly changes the output value. In addition, when I look at the error for this fit using curve_fit the error is almost two orders of magnitude higher than the value. Does anyone have some suggestions on what I should do to decrease the error?
import numpy as np
import scipy.optimize as spo
def modelFun(t,A,B):
return -B + np.sqrt(A*B*(t-t0) + np.power(x0 + B,2))
def errorFun(k,time,data):
A = k[0]
B = k[1]
return np.sum((data-modelFun(time,A,B))**2)
data = np.genfromtxt('testdata.csv',delimiter=',',skip_header = 1)
time = data[:,0]
xt = data[:,1]
t0 = data[0,0]
x0 = data[0,1]
minErrOut = spo.minimize(errorFun,[1,1000],args=(time,xt),bounds=((0,None),(0,None)))
(curveOut, curveCovar) = spo.curve_fit(modelFun,time,xt,p0=[1,1000],method='dogbox',bounds=([-np.inf,0],np.inf))
print('minimize result: A={}; B={}'.format(*minErrOut.x))
print('curveFit result: A={}; B={}'.format(*curveOut))
print('curveFit Error: A={}; B={}'.format(*np.sqrt(np.diag(curveCovar))))
Datafile:
Time,x
201,2.67662
204,3.28159
206,3.44378
208,3.72537
210,3.94826
212,4.36716
214,4.65373
216,5.26766
219,5.59502
221,6
223,6.22189
225,6.49652
227,6.799
229,7.30846
231,7.54229
232,7.76517
233,7.6209
234,7.89552
235,7.94826
236,8.17015
237,8.66965
238,8.66965
239,8.8398
240,8.88856
241,9.00697
242,9.45075
243,9.51642
244,9.63483
245,9.63483
246,10.07861
247,10.02687
248,10.24876
249,10.31443
250,10.47164
251,10.99502
252,10.92935
253,11.0995
254,11.28358
255,11.58209
256,11.53035
257,11.62388
258,11.93632
259,11.98806
260,12.26269
261,12.43284
262,12.60299
263,12.801
264,12.99502
265,13.08557
266,13.25572
267,13.32139
268,13.57114
269,13.76617
270,13.88358
271,13.83184
272,14.10647
273,14.27662
274,14.40796

TL;DR;
Your dataset is linear and misses observations at larger timescale. Therefore you can capture A which is proportional to the slope while your model needs to keep B large (and potentially unbounded) to inhibit the square root trend.
This can be confirmed by developing Taylor series of your model and analyzing the MSE surface associated to the regression.
In a nutshell, considering this kind of dataset and the given model, accept A don't trust B.
MCVE
First, let's reproduce your problem:
import io
import numpy as np
import pandas as pd
from scipy import optimize
stream = io.StringIO("""Time,x
201,2.67662
204,3.28159
...
273,14.27662
274,14.40796""")
data = pd.read_csv(stream)
# Origin Shift:
data = data.sub(data.iloc[0,:])
data = data.set_index("Time")
# Simplified model:
def model(t, A, B):
return -B + np.sqrt(A*B*t + np.power(B, 2))
# NLLS Fit:
parameters, covariance = optimize.curve_fit(model, data.index, data["x"].values, p0=(1, 1000), ftol=1e-8)
# array([3.23405915e-01, 1.59960168e+05])
# array([[ 3.65068730e-07, -3.93410484e+01],
# [-3.93410484e+01, 9.77198860e+12]])
The adjustment is fair:
But as you noticed model parameters differ from several order of magnitude which can prevent optimization to perform properly.
Notice that your dataset is quite linear. The observed effect is not surprising and is inherent the chosen model. B parameter must be several orders of magnitude bigger than A to keep the linear behaviour.
This claim is supported by the analysis of the first terms of the Taylor series:
def taylor(t, A, B):
return (A/2*t - A**2/B*t**2/8)
parameters, covariance = optimize.curve_fit(taylor, data.index, data["x"].values, p0=(1, 100), ftol=1e-8)
parameters
# array([3.23396685e-01, 1.05237134e+09])
Without surprise the slope of your linear dataset can be captured while the parameter B can be arbitrary large and will cause float arithmetic errors during optimization (hence the minimize warning bellow you have got).
Analyzing Error Surface
The problem can be reformulated as a minimization problem:
def objective(beta, t, x):
return np.sum(np.power(model(t, beta[0], beta[1]) - x, 2))
result = optimize.minimize(objective, (1, 100), args=(data.index, data["x"].values))
# fun: 0.6594398116927569
# hess_inv: array([[8.03062155e-06, 2.94644208e-04],
# [2.94644208e-04, 1.14979735e-02]])
# jac: array([2.07304955e-04, 6.40749931e-07])
# message: 'Desired error not necessarily achieved due to precision loss.'
# nfev: 389
# nit: 50
# njev: 126
# status: 2
# success: False
# x: array([3.24090627e-01, 2.11891188e+03])
If we plot the MSE associated to your dataset, we get the following surface:
We have a canyon that is narrow on A space but seems unbounded at least for first decades on B space. This is supporting your observations in your post and comments. It also brings a technical insight on why we cannot fit B properly.
Performing the same operation on synthetic dataset:
t = np.linspace(0, 1000, 100)
x = model(t, 0.35, 20)
data = pd.DataFrame(x, index=t, columns=["x"])
To have the square root shape in addition of the linear trend at the beginning.
result = optimize.minimize(objective, (1, 0), args=(data.index, data["x"].values), tol=1e-8)
# fun: 1.9284246829733202e-10
# hess_inv: array([[ 4.34760333e-05, -4.42855253e-03],
# [-4.42855253e-03, 4.59219063e-01]])
# jac: array([ 4.35726463e-03, -2.19158602e-05])
# message: 'Desired error not necessarily achieved due to precision loss.'
# nfev: 402
# nit: 94
# njev: 130
# status: 2
# success: False
# x: array([ 0.34999987, 20.000013 ])
This version of the problem has following MSE surface:
Showing a potential convex valley around the known solution which explain why you are able to fit both parameters when there are sufficient large time acquisition.
Notice the valley is strongly stretched meaning that in this scenario your problem will benefit from normalization.

What test (null hypothesis) does a model's `f_pvalue` correspond to?

What is the null hypothesis behind an OLSResults's f_pvalue attribute? This docstring is not particularly useful.
At first I thought the null hypothesis was that all estimated coefficients are simultaneously zero (including the constant term). However, I am starting to think that the hypothesis being tested is that all estimated parameters except for the constant term are simultaneously zero (i.e. b1 = b2 = ... = bp = 0, excluding b0).
For example, suppose y is an array of targets and X is a numpy matrix of features (a constant term and p features).
# Silly example
from statsmodels.api import OLS
m = OLS(endog=y, exog=X).fit()
# What is being tested here?
print(m.f_pvalue)
Does anyone know what the null hypothesis is?

Thanks to #Josef for clearing things up. As per the documentation:
F-statistic of the fully specified model.
Calculated as the mean squared error of the model divided by the mean squared error of the residuals if the nonrobust covariance is used. Otherwise computed using a Wald-like quadratic form that tests whether all coefficients (excluding the constant) are zero.
And just to prove that this is the case:
# Libraries
import numpy as np
import pandas as pd
from statsmodels.api import OLS
from sklearn.datasets import load_boston
# Load target
y = pd.DataFrame(load_boston()['target'], columns=['price'])
# Load features
X = pd.DataFrame(load_boston()['data'], columns=load_boston()['feature_names'])
# Add constant
X['CONST'] = 1
# One feature
m1 = OLS(endog=y, exog=X[['CONST','CRIM']]).fit()
print(f'm1 pvalue: {m1.f_pvalue}')
# Multiple features
m2 = OLS(endog=y, exog=X[['CONST','CRIM','AGE']]).fit()
print(f'm2 pvalue: {m2.f_pvalue}')
# Manually test H0: all coefficients are zero (excluding b0)
print('Manual F-test for m1', m1.f_test(r_matrix=np.matrix([[0,0],[0,1]])),
'Manual F-test for m2', m2.f_test(r_matrix=np.matrix([[0,0,0],[0,1,0],[0,0,1]])),
sep='\n')
# Output
"""
> m1 pvalue: 1.1739870821944483e-19
> m2 pvalue: 2.2015246345918656e-27
> Manual F-test for m1
> <F test: F=array([[89.48611476]]), p=1.1739870821945733e-19, df_denom=504, df_num=1>
> Manual F-test for m2
> <F test: F=array([[69.51929476]]), p=2.2015246345920063e-27, df_denom=503, > df_num=2>
"""
So yes, f_pvalue matches the p-value of manually entering the null hypothesis.

Confidence Interval for Sample Mean in Python (Different from Manual)

I'm trying to create some material for introductory statistics for a seminar. The above code computes a 95% confidence interval for estimating the mean, but the result is not the same from the one implemented in Python. Is there something wrong with my math / code? Thanks.
EDIT:
Data was sampled from here
import pandas as pd
import numpy as np
x = np.random.normal(60000,15000,200)
income = pd.DataFrame()
income = pd.DataFrame()
income['Data Scientist'] = x
# Manual Implementation
sample_mean = income['Data Scientist'].mean()
sample_std = income['Data Scientist'].std()
standard_error = sample_std / (np.sqrt(income.shape[0]))
print('Mean',sample_mean)
print('Std',sample_std)
print('Standard Error',standard_error)
print('(',sample_mean-2*standard_error,',',sample_mean+2*standard_error,')')
# Python Library
import scipy.stats as st
se = st.sem(income['Data Scientist'])
a = st.t.interval(0.95, len(income['Data Scientist'])-1, loc=sample_mean, scale=se)
print(a)
print('Standard Error from this code block',se)

You've got 2 errors.
First, you are using 2 for the multiplier for the CI. The more accurate value is 1.96. "2" is just a convenient estimator. That is making your CI generated manually too fat.
Second, you are comparing a normal distribution to the t-distribution. This probably isn't causing more than decimal dust in difference because you have 199 degrees of freedom for the t-dist, which is basically the normal.
Below is the z-score of 1.96 and computation of CI with apples-to-apples comparison to the norm distribution vs. t.
In [45]: st.norm.cdf(1.96)
Out[45]: 0.9750021048517795
In [46]: print('(',sample_mean-1.96*standard_error,',',sample_mean+1.96*standard_error,')')
( 57558.007862202685 , 61510.37559873406 )
In [47]: st.norm.interval(0.95, loc=sample_mean, scale=se)
Out[47]: (57558.044175045005, 61510.33928589174)

Different result for std between pandas and numpy

I am trying to subtract every element in the column from its mean and divide by the standard deviation. I did it in two different ways (numeric_data1 and numeric_data2):
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std())
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0))
type(numeric_data1) # -> pandas.core.frame.DataFrame
type(numeric_data2) # -> pandas.core.frame.DataFrame
Both are pandas dataframes and they should have the same result. However, I get different results:
numeric_data2 == numeric_data1 # -> False
I think the problem stems from how numpy and pandas handle numeric precision:
numeric_data.mean() == np.mean(numeric_data, axis=0) # -> True
numeric_data.std(axis=0) == np.std(numeric_data, axis=0) # -> False
For mean numpy and pandas gave me the same thing, but for standard deviation, I got little different results.
Is my assessment correct or am I making some blunder?

When calculating the standard deviation it matters whether you are estimating the standard deviation of an entire population with a smaller sample of that population or are you calculating the standard deviation of the entire population.
If it is a smaller sample of a larger population, you need what is called the sample standard deviation. As it turns out, when you divide the sum of squared differences from the mean by the number of observations, you end up with a biased estimator. We correct for that by dividing by one less than the number of observations. We control for this with the argument ddof=1 for sample standard deviation or ddof=0 for population standard deviation.
Truth is, it doesn't matter much if your sample size is large. But you will see small differences.
Use the degrees of freedom argument in your pandas.DataFrame.std call:
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std(ddof=0)) # <<<
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0))
np.isclose(numeric_data1, numeric_data2).all() # -> True
Or in the np.std call:
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std())
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0, ddof=1)) # <<<
np.isclose(numeric_data1, numeric_data2).all() # -> True

Custom PDF from scipy.stats.rv_continuous unwanted upper-bound

I am attempting to generate a random probability density function of QSO's of certain luminosity with the form:
1/( (L/L_B^* )^alpha + (L/L_B^* )^beta )
where L_B^*, alpha, and beta are all constants. To do this, the following code is used:
import scipy.stats as st
logLbreak = 43.88
alpha = 3.4
beta = 1.6
class my_pdf(st.rv_continuous):
def _pdf(self,l_L):
#"l_L" in this is always log L
L = 10**(l_L/logLbreak)
D = 1/(L**alpha + L**beta)
return D
dist_Log_L = my_pdf(momtype = 0, a = 0,name='l_L_dist')
distro = dist_Log_L.rvs(size = 10000)
(L/L^* is rased to a power of 10 since everything is being done in a log scale)
The distribution is supposed to produce a graph that approximates this, trailing off to infinity, but in reality the graph it produces looks like this (10,000 samples). The upper bound is the same regardless of the amount of samples that are used. Is there a reason it is being restricted in the way it is?

Your PDF is not properly normalized. The integral of a PDF over the domain must be 1. Your PDF integrates to approximately 3.4712:
In [72]: from scipy.integrate import quad
In [73]: quad(dist_Log_L._pdf, 0, 100)
Out[73]: (3.4712183965415373, 2.0134487716044682e-11)
In [74]: quad(dist_Log_L._pdf, 0, 800)
Out[74]: (3.4712184965748905, 2.013626296581202e-11)
In [75]: quad(dist_Log_L._pdf, 0, 1000)
Out[75]: (3.47121849657489, 8.412130378805368e-10)
This will break the class's implementation of inverse transform sampling. It will only generate samples from the domain up to where the integral of the PDF from 0 to x first reaches 1.0, which in your case is about 2.325
In [81]: quad(dist_Log_L._pdf, 0, 2.325)
Out[81]: (1.0000875374350238, 1.1103202107010366e-14)
That is, in fact, what you see in your histogram.
As a quick fix to verify the issue, I modified the return statement of the _pdf() method to:
return D/3.47121849657489
and ran your script again. (In a real fix, that value will be a function of the other parameters.) Then the commands
In [85]: import matplotlib.pyplot as plt
In [86]: plt.hist(distro, bins=31)
generates this plot:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, Pandas & Chi-Squared Test of Independence - python

degrees of freedom = (row-1)x(column-1). For a 2x2 table it is (2-1)x(2-1) = 1

Related

Instability in fitting data using Scipy Optimize library

What test (null hypothesis) does a model's `f_pvalue` correspond to?

Confidence Interval for Sample Mean in Python (Different from Manual)

Different result for std between pandas and numpy

Custom PDF from scipy.stats.rv_continuous unwanted upper-bound

Categories

Resources