I'm running a chi square test on some categorical values pertaining to race, and whether different racial groups participated in a clinic. As there's about a dozen different races in this data, I bucketed them down to 'White', 'Black' and 'Other', just for the purposes of testing (as the correlations indicated most of the activity occurring between 'White and 'Black'. However, using Python's .chi2_contingency() method, I'm getting results back that seem unusual. The table is below:
Appointment Status No Yes
Black 9170 33372
White 15137 152307
Other 8864 56165
The Python method returns the following:
X^2: 5207.16
p-value: 0.0
df: 2
expected values array: array([[ 5131.21350472, 37410.78649528],
[ 7843.48838791, 57185.51161209],
[ 20196.29810738, 147247.70189262]]))
The df is good, but the chi square value and p-value both don't seem right. Is there something anyone can see that I might be doing methodologically that might be producing these values, or might there be something going on behind the scenes in Python that's doing this? Thanks!
The test statistic and p-value are correct (and perhaps also understandable). Let me stepwise explain the outcome. The section entitled ``Example chi-squared test for categorical data'' on wikipedia (https://en.wikipedia.org/wiki/Chi-squared_test#Example_chi-squared_test_for_categorical_data) might help as well.
The expected count is the number of observations that would end up in a given cell of the table if we would assume independence. The fractions of Black and No are 0.15468974 and 0.12061524, respectively. Under independence, we expect 0.15468974x0.12061524x275015=5131.21350472 observations in the sample to be labeled as Black and No (Note: 275015 is the total number of observations).
All other expected counts are calculated similar. Note that the differences between the expected and observed counts (i.e. the numbers in your table) are rather large. This should be a first indication that the null hypothesis of independence might be false.
The test statistic is calculated be computing (Obs-Exp)^2/Exp for each element in the cell and summing all the elements in the table. The result is indeed 5207.162302393083 (see code below). Under the null hypothesis, this test statistic is chi2 distributed with 2 df (as you already mentioned). Compared to this distribution, the value 5207.162302393083 is truly far in the tail of the distribution making it very very unlikely to observe this outcome under the null of independence. The p-value is therefore equal to zero.
The code posted below replicates all the numbers and plots the PDF of the chi2 distribution with 2 degrees of freedom. I hope this helps.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
# Data and properties
TrueCounts = np.array( [ [9170,33372],[15137,152307],[8864,56165] ])
Datadimension = TrueCounts.shape
TotalCounts = np.sum(TrueCounts)
print(TotalCounts)
# Fractions
fracAnswer = np.sum(TrueCounts, axis=0)/TotalCounts
fracRace = np.sum(TrueCounts, axis=1)/TotalCounts
# Caculate expected counts
ExpCounts = np.zeros(np.shape(TrueCounts))
for iter1 in range(Datadimension[0]):
for iter2 in range(Datadimension[1]):
ExpCounts[iter1, iter2] = fracRace[iter1]*fracAnswer[iter2]*TotalCounts
print('=== True and expected counts ===')
print(fracAnswer)
print(fracRace)
print('=== True and expected counts ===')
print(TrueCounts)
print(ExpCounts)
print('=== Test summary ===')
TestStat = np.sum( (TrueCounts-ExpCounts)**2/ExpCounts )
print(TestStat)
# Make ch2 plot for comparison
x = np.arange(0, 20, 0.05)
plt.plot(x, chi2.pdf(x, df=2))
plt.show()
Related
I have two pandas DataFrame data1 and data2, and both DataFrame have an integer column h filled with different values varying from 1 to 50.
data1 has a sample size of roughly 55000, whereas data2 has a sample size of roughly 8000. I am not able to upload the exact data due to their sizes, but below are the histograms I created of data1['h'] vs. data2['h']:
(I applied matplotlib.yscale('log') for an easier observation)
To compare the distribution, I used ks_2samp from scipy.stats. I composed one two-tailed test and two one-tailed tests to observe both directions of superiority:
# h indices are significantly different
print(ks_2samp(data1['h'], data2['h']))
# data1 h indices are greater
print(ks_2samp(data1['h'], data2['h'], alternative='greater'))
# data2 h indices are greater
print(ks_2samp(data1['h'], data2['h'], alternative='less'))
The results were the following:
Ks_2sampResult(statistic=0.1293719140156916, pvalue=3.448839769104661e-105)
Ks_2sampResult(statistic=0.0, pvalue=1.0)
Ks_2sampResult(statistic=0.1293719140156916, pvalue=1.5636837258561576e-105)
I have practiced ks_2samp before for other projects, but seeing such obscure p-values is quite new to me. The second result, especially, makes me wonder if I'm performing the test incorrectly, as p-value being 1.0 seems extremely absurd.
I've researched some similar issues including the following StackOverflow question (scipy p-value returns 0.0), but unfortunately this issue is not identical to any reported issues of yet.
I'd love to get any insights to interpret such results or to fix my approach.
The problem does not seem to be with your code, but with your interpretation. We can see that data1 is shifted to the right, so I construct normal distributions, plot their histograms, and run the ksmirnov test to show that the results you got are in line with our expectations.
Setup:
from scipy.stats import ks_2samp
from numpy import random
import pandas as pd
from matplotlib import pyplot
random.seed(1)
n=4000
l1=[random.normal(1) for x in range(n)]
l2=[random.normal() for x in range(n)]
df=pd.DataFrame(list(zip(l1,l2)),columns=['1','2'])
Tests:
print(ks_2samp(df['1'], df['2']))
print(ks_2samp(df['1'], df['2'], alternative='greater'))
print(ks_2samp(df['1'], df['2'], alternative='less'))
Returns:
KstestResult(statistic=0.3965, pvalue=3.8418108959960396e-281)
KstestResult(statistic=0.0, pvalue=1.0)
KstestResult(statistic=0.3965, pvalue=1.9209054479980054e-281)
Graphical representation:
bins=50
pyplot.hist(l1,bins, alpha=.5, label='Sample 1')
pyplot.hist(l2,bins, alpha=.5, label='Sample 2')
pyplot.legend()
pyplot.show
So what's going on here?
The first KS test rejects the null hypothesis that the distributions are equivalent, and it does this with high confidence (pvalue is basically zero). The second one tells us that we cannot reject the hypothesis that sample 1 is greater than sample 2. This is obvious from what we know - sample 1 is pulled from the same population as sample 2, but shifted to the right. The third again rejects a null hypothesis, but this h0 is that sample 1 is smaller than sample 2. Notice that the pvalue here is the smallest - there is a smaller chance that sample 1 is less than sample 2 than that they are pulled from equivalent distributions. This is again as expected.
Also notice, with this example, that both distributions are normal and very similar. But the KS test tells you that "the populations may differ in median, variability or the shape of the distribution"(reference). Here, they differ in median but not shape, and this is detected.
I have data in this format-
[0.266465 0.9203907 1.007363 ... 0. 0.09623989 0.39632136]
It is the value of the first row and first column.
It is the value of the second column of the first row:
[0.9042176 1.135085 1.2988662 ... 0. 0.13614458 0.28000486]
I have 2200 such rows and I want to train a classifier to identify that if the two set of values are similar or not?
P.S.- These are extracted feature vector values.
If you assume relation between two extracted feature vectors to be linear, you could try using Pearson correlation:
import numpy as np
from scipy.stats import pearsonr
list1 = np.random.random(100)
list2 = np.random.random(100)
pearsonr(list1, list2)
An example output is:
(0.0746901299996632, 0.4601843257734832)
Where first value refers to correlation (7%), the second to its significance (with > 0,05 you accept the null hypothesis that the correlation is insignificant at significance level alfa = 5%). And if vectors are correlated, they are be in a way similar. More about the method here.
Also, I came across Normalized Cross-Correlation that is used for identifying similarity between pictures (not an expert, so rather check this).
I have a data set with about 10 continuous features and 1000 binary (categorical) features. After scaling and normalizing the data so that every feature has a mean of 0.0, I perform PCA on the data to get a reduced matrix z, keeping about 90% of the variance (keeping ~700 principal components).
If I now want to fit a normal distribution to the data I have the code below
import numpy as np
from scipy.stats import multivariate_normal
mean = np.mean(z, axis=0)
cov = np.cov(z, rowvar=0)
g = multivariate_normal(mean=mean, cov=cov)
# put a random sample of z through the pdf
# to check the probability of the sample occurring
print(g.pdf(z[np.random.randint(z.shape[0]),:]))
>>> 0.0
The problem is, no matter how many times I run print(g.pdf(z[np.random.randint(z.shape[0]),:])), I get an output of 0.0. I appreciate some samples in z will lie further from the mean than average, giving me and answer of close to 0.0. But I would have thought that at least some samples in z would be closer to the mean and therefore give me a much larger answer when I put a random value z through the pdf.
This may be something to do with my original data, or the reduced dataset z and how these are distributed. But I have performed multiple checks on both the original data set and z to ensure there are no nan values, ensuring the mean of each column of z is in fact 0.0 etc.
My results indicate that I have gaussian with extremely thin tails (a very narrow gaussian) so that everything is far from the peak. I don't think this should be the case.
Am I using multivariate_normal correctly? Are there any other checks I could perform on the data or otherwise? I know I am making a big assumption that the data is normally distributed, but surely not all values in z should yield a pdf value of 0.0.
My problem:
I have an array of ufloats (e.g. an unarray) in pythons uncertainties package.
All values of the array got their own errors, and I need a funktion, that gives me the average of the array in respect to both, the error
I get when calculating the mean of the nominal values and the influence the values errors have.
I have an uarray:
2 +/- 1
3 +/- 2
4 +/- 3
and need a funktion, that gives me an average value of the array.
Thanks
Assuming Gaussian statistics, the uncertainties stem from Gaussian parent distributions. In such a case, it is standard to weight the measurements (nominal values) by the inverse variance. This application to the general weighted average gives,
$$ \frac{\sum_i w_i x_i}{\sum_i w_i} = \frac{\sum_i x_i/\sigma_i^2}{\sum_i 1/\sigma_i^2} $$.
One need only perform good 'ol error propagation on this to get an uncertainty of the weighted average as,
$$ \sqrt{\sum_i \frac{1}{1/\sum_i \sigma_i^2}} $$
I don't have an n-length formula to do this syntactically speaking on hand, but here's how one could get the weighted average and its uncertainty in a simple case:
a = un.ufloat(5, 2)
b = un.ufloat(8, 4)
wavg = un.ufloat((a.n/a.s**2 + b.n/b.s**2)/(1/a.s**2 + 1/b.s**2),
np.sqrt(2/(1/a.s**2 + 1/b.s**2)))
print(wavg)
>>> 5.6+/-2.5298221281347035
As one would expect, the result tends more-so towards the value with the smaller uncertainty. This is good since a smaller uncertainty in a measurement implies that its associated nominal value is closer to the true value in the parent distribution than those with larger uncertainties.
Unless I'm missing something, you could calculate the sum divided by the length of the array:
from uncertainties import unumpy, ufloat
import numpy as np
arr = np.array([ufloat(2, 1), ufloat(3, 2), ufloat(4,3)])
print(sum(arr)/len(arr))
# 3.0+/-1.2
You can also define it like this:
arr1 = unumpy.uarray([2, 3, 4], [1, 2, 3])
print(sum(arr1)/len(arr1))
# 3.0+/-1.2
uncertainties takes care of the rest.
I used Captain Morgan's answer to serve up some sweet Python code for a project and discovered that it needed a little extra ingredient:
import uncertainties as un
from un.unumpy import unp
epsilon = unp.nominal_values(values).mean()/(1e12)
wavg = ufloat(sum([v.n/(v.s**2+epsilon) for v in values])/sum([1/(v.s**2+epsilon) for v in values]),
np.sqrt(len(values)/sum([1/(v.s**2+epsilon) for v in values])))
if wavg.s <= np.sqrt(epsilon):
wavg = ufloat(wavg.n, 0.0)
Without that little something (epsilon) we'd get div/0 errors from observations recorded with zero uncertainty.
If you already have a .csv file which stores variables in 'mean+/-sted' format, you could try the code below; it works for me.
from uncertainties import ufloat_fromstr
df=pd.read_csv('Z:\compare\SL2P_PAR.csv')
for i in range(len(df.uncertainty)):
df['mean'] = ufloat_fromstr(df['uncertainty'][I]).n
df['sted'] = ufloat_fromstr(df['uncertainty'][I]).s
I used scipy.stats.normaltest() to test the normality of the data generated by numpy.random.normal(). Here is the code:
from numpy import random
from scipy import stats
for i in range(0, 10):
d = numpy.random.normal(size=50000)
n = scipy.stats.normaltest(d)
print n
Here are the results:
(1.554124262066523, 0.45975472830684272)
(2.4982341884494002, 0.28675786530134384)
(2.0918010143075256, 0.35137526093176125)
(0.90623072927961634, 0.63564479846313271)
(2.3015160217986934, 0.31639684620041014)
(3.4005006481463624, 0.18263779969208352)
(2.5241123233368978, 0.28307138716898311)
(12.705060069198185, 0.001742333391388526)
(0.83646951793409796, 0.65820769012847313)
(0.12008522338293379, 0.94172440425950443)
According to the document here, the second element of the value returned by normaltest() is
pvalue : float or array
A 2-sided chi squared probability for the hypothesis test.
If my understanding is correct, it indicates how likely the input data is in normal distribution. I had expected that all the pvalues generated by the above code very close to 1. However, some of them can be as small as 0.001742333391388526. What's wrong here?
If my understanding is correct, it indicates how likely the input data is in normal distribution. I had expected that all the pvalues generated by the above code very close to 1.
Your understanding is incorrect, I'm afraid. The p-value is the probability to get a result that is at least as extreme as the observation under the null hypothesis (i.e. under the assumption that the data is actually normal distributed). It does not need to be close to 1. Usually, p-values greater than 0.05 are considered not significant, which means that normality has not been disproved by the test.
As pointed out by Victor Chubukov, you can get low p-values simply by chance, even if the data is really normally distributed.
Statistical hypothesis testing is rather complex and can appear somewhat counter intuitive. If you need to know more details, Cross Validated is the place to get more detailed answers.
Someone can come and yell at me about how this isn't the proper definition of the p-value, but as a back-of-the-envelope estimate, you can expect to get a p-value as low as x with probability x. So you'll get a p-value as low as 0.00174 about once every 575 tries.
import numpy as np
from scipy.stats import normaltest
import matplotlib.pyplot as plt
%matplotlib inline
L=[]
for i in range(0, 10000):
d = np.random.normal(size=50000)
n = normaltest(d)
L.append(n.pvalue)
plt.hist(L,bins=20)
plt.show()