Confidence Interval for Sample Mean in Python (Different from Manual) - python

I'm trying to create some material for introductory statistics for a seminar. The above code computes a 95% confidence interval for estimating the mean, but the result is not the same from the one implemented in Python. Is there something wrong with my math / code? Thanks.
EDIT:
Data was sampled from here
import pandas as pd
import numpy as np
x = np.random.normal(60000,15000,200)
income = pd.DataFrame()
income = pd.DataFrame()
income['Data Scientist'] = x
# Manual Implementation
sample_mean = income['Data Scientist'].mean()
sample_std = income['Data Scientist'].std()
standard_error = sample_std / (np.sqrt(income.shape[0]))
print('Mean',sample_mean)
print('Std',sample_std)
print('Standard Error',standard_error)
print('(',sample_mean-2*standard_error,',',sample_mean+2*standard_error,')')
# Python Library
import scipy.stats as st
se = st.sem(income['Data Scientist'])
a = st.t.interval(0.95, len(income['Data Scientist'])-1, loc=sample_mean, scale=se)
print(a)
print('Standard Error from this code block',se)

You've got 2 errors.
First, you are using 2 for the multiplier for the CI. The more accurate value is 1.96. "2" is just a convenient estimator. That is making your CI generated manually too fat.
Second, you are comparing a normal distribution to the t-distribution. This probably isn't causing more than decimal dust in difference because you have 199 degrees of freedom for the t-dist, which is basically the normal.
Below is the z-score of 1.96 and computation of CI with apples-to-apples comparison to the norm distribution vs. t.
In [45]: st.norm.cdf(1.96)
Out[45]: 0.9750021048517795
In [46]: print('(',sample_mean-1.96*standard_error,',',sample_mean+1.96*standard_error,')')
( 57558.007862202685 , 61510.37559873406 )
In [47]: st.norm.interval(0.95, loc=sample_mean, scale=se)
Out[47]: (57558.044175045005, 61510.33928589174)

Related

Gauss-Markov process in python: how to filter properly a white noise sequence

I'm pretty new in Python. I would like to sintetize a first order Gauss-Markov process from a white Gaussian noise. I know from signal processing theory that it could be performed using a noise shaping filter designed properly. See https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_process for details. First order Guass-Markov processes have two parameter: sigma, which is the standard deviation of the process, and the time costant beta.
The shaping filter should have a transfer function equal to the one in this figure:
Here it is my code:
import scipy.signal as dsp
import numpy as np
Nsamples = 2000
fs = 100
time = np.arange(Nsamples) / fs
rng = np.random.default_rng()
gaussianNoise = rng.standard_normal(size=time.shape)
wgn = (gaussianNoise - np.mean(gaussianNoise)) / np.std(gaussianNoise)
print('\n\n\nWGN MEAN: ', np.mean(wgn))
print('WGN STD: ', np.std(wgn))
beta = 0.01
sigma = 0.1
b = np.array([np.sqrt(2 * beta * sigma**2)])
a = np.array([1, beta])
gaussMarkovNoise = dsp.lfilter(b, a, whiteGaussianNoise)
Unfortunately, something is wrong because the gaussMarkovNoise should have an autocorrelation with an exponential decay (see http above); while filtered in this way, it still has a spike in the origin as a white noise sequence. What am I missing?

More effictient alternative to scipy.stats expect function

I am working on a simple simulation for a stock of products. In specific, I want to calculate the expected shortage of products for various servicelevels. For example, if I assume that the demand of a product follows a normal distribution with a mean of 100 and a standard deviation of 20, for a servicelevel of 90% it would be necessary to have 125.63 units of the product on stock. And I would then still expect a shortage of 0.9469 units:
My current approach the problem is the following:
# Import libraries
import pandas as pd
import numpy as np
from scipy.stats import norm
# Create an exemplary dataset
idx = pd.Index(range(0, 1000), name='productid')
df = pd.DataFrame({'loc': np.random.normal(100, 30, 1000),
'scale': np.random.normal(20, 5, 1000)}, index=idx)
# Calculate quantile
df['quantile'] = norm.ppf(0.9, df['loc'], df['scale'])
# Calculate expected shortage
df['shortage'] = df.apply(lambda row: norm(row['loc'], row['scale'])\
.expect(lambda x: x-row['quantile'], lb=row['quantile']), axis=1)
The code is actually working quiet well, but there is a problem with performance though. The calculation of the expected shortage takes around 15 seconds for 1000 products. In the real dataset I have 10000 products and I need to repeat the operation around 100 times as I want to do the simulation for various servicelevels.
So if anyone knows a more efficient alternative to the scipy.stats expect function, or knows how to boost performance by tweaking the existing code, I would be really happy.

How to fix: frozen and None value

The problem was to create a normal distribution with mean 32 and standard deviation 4.5, setting the random seed to 1, and create a random sample of 100 elements from the above defined distribution.Finally, compute the absolute difference between the sample mean and the distribution mean.
This is some of the beginner stats problems in the course. I have had experience in Python but not in stats.
x = stats.norm(loc=32,state=4.5)
y = np.random.seed(1)
mean1 = np.mean(x)
mean2 = np.mean(y)
diff = abs(mean1 - mean2)
The error I've been encountering is x has a frozen value and y has a value of None.
random.seed(1) sets the state of the pseudorandom numbers generator so that every run of this script will give the same output - and give identical results for all students...
You need to execute this before generating your random numbers. The seed function doesn't have anything to return, so it return None. This is the default return value in Python for functions that don't return anything specific.
Then you create your sample of size 100, and calculate its mean. As it is a sample, its mean will differ from the mean of the distribution (32): we calculate the absolute difference between these means.
You can experiment with different sample sizes, and see how the difference tends towards 0 when the size of the sample grows - you'll learn more about it in your course!
from scipy.stats import norm
import numpy as np
np.random.seed(1)
distribution_mean = 32
sample = norm.rvs(loc=distribution_mean, scale=4.5, size=100)
sample_mean = np.mean(sample)
print('sample:', sample)
print('sample mean:', sample_mean)
abs_diff = abs(sample_mean - distribution_mean)
print('absolute difference:', abs_diff)
Output:
sample: [39.30955414 29.24709614 29.62322711 27.1716412 35.89433433 21.64307586
39.85165294 28.57456895 33.43567593 30.87783331 38.57948572 22.72936681
30.54912258 30.2717554 37.10196249 27.0504893 31.22407307 28.04963712
32.18996186 34.62266846 27.0472137 37.15125669 36.05715824 34.26122453
36.05385177 28.92322463 31.44699399 27.78903755 30.79450364 34.3865996
28.88752662 30.21460913 28.90772285 28.19657461 28.97939241 31.9430093
26.97210343 33.05487064 39.4691098 35.33919872 31.13674001 28.00566966
28.63778768 39.6160457 32.2286349 29.13351959 32.85911968 41.45114811
32.54071529 34.77741399 33.35076644 30.41487569 26.85866811 30.42795775
31.05997595 34.63980436 35.77542536 36.18995937 33.28514296 35.98313524
28.60520927 37.6379067 34.30818419 30.65858224 34.19833166 31.65992729
37.09233224 38.83917567 41.83508933 25.71576649 25.50148788 29.72990362
32.72016681 35.94276015 33.42035726 22.90009453 30.62208194 35.72588589
33.03542631 35.42905031 30.99952336 31.09658869 32.83952626 33.84523241
32.89234874 32.53553891 28.98201971 33.69903704 32.54819572 37.08267759
37.39513046 32.83320388 30.31121772 29.12571317 33.90572459 32.34803031
30.45265846 32.19618586 29.2099962 35.14114415]
sample mean: 32.27262283434065
absolute difference: 0.2726228343406518

Custom PDF from scipy.stats.rv_continuous unwanted upper-bound

I am attempting to generate a random probability density function of QSO's of certain luminosity with the form:
1/( (L/L_B^* )^alpha + (L/L_B^* )^beta )
where L_B^*, alpha, and beta are all constants. To do this, the following code is used:
import scipy.stats as st
logLbreak = 43.88
alpha = 3.4
beta = 1.6
class my_pdf(st.rv_continuous):
def _pdf(self,l_L):
#"l_L" in this is always log L
L = 10**(l_L/logLbreak)
D = 1/(L**alpha + L**beta)
return D
dist_Log_L = my_pdf(momtype = 0, a = 0,name='l_L_dist')
distro = dist_Log_L.rvs(size = 10000)
(L/L^* is rased to a power of 10 since everything is being done in a log scale)
The distribution is supposed to produce a graph that approximates this, trailing off to infinity, but in reality the graph it produces looks like this (10,000 samples). The upper bound is the same regardless of the amount of samples that are used. Is there a reason it is being restricted in the way it is?
Your PDF is not properly normalized. The integral of a PDF over the domain must be 1. Your PDF integrates to approximately 3.4712:
In [72]: from scipy.integrate import quad
In [73]: quad(dist_Log_L._pdf, 0, 100)
Out[73]: (3.4712183965415373, 2.0134487716044682e-11)
In [74]: quad(dist_Log_L._pdf, 0, 800)
Out[74]: (3.4712184965748905, 2.013626296581202e-11)
In [75]: quad(dist_Log_L._pdf, 0, 1000)
Out[75]: (3.47121849657489, 8.412130378805368e-10)
This will break the class's implementation of inverse transform sampling. It will only generate samples from the domain up to where the integral of the PDF from 0 to x first reaches 1.0, which in your case is about 2.325
In [81]: quad(dist_Log_L._pdf, 0, 2.325)
Out[81]: (1.0000875374350238, 1.1103202107010366e-14)
That is, in fact, what you see in your histogram.
As a quick fix to verify the issue, I modified the return statement of the _pdf() method to:
return D/3.47121849657489
and ran your script again. (In a real fix, that value will be a function of the other parameters.) Then the commands
In [85]: import matplotlib.pyplot as plt
In [86]: plt.hist(distro, bins=31)
generates this plot:

Python, Pandas & Chi-Squared Test of Independence

I am quite new to Python as well as Statistics. I'm trying to apply the Chi Squared Test to determine whether previous success affects the level of change of a person (percentage wise, this does seem to be the case, but I wanted to see whether my results were statistically significant).
My question is: Did I do this correctly? My results say the p-value is 0.0, which means that there is a significant relationship between my variables (which is what I want of course...but 0 seems a little bit too perfect for a p-value, so I'm wondering whether I did it incorrectly coding wise).
Here's what I did:
import numpy as np
import pandas as pd
import scipy.stats as stats
d = {'Previously Successful' : pd.Series([129.3, 182.7, 312], index=['Yes - changed strategy', 'No', 'col_totals']),
'Previously Unsuccessful' : pd.Series([260.17, 711.83, 972], index=['Yes - changed strategy', 'No', 'col_totals']),
'row_totals' : pd.Series([(129.3+260.17), (182.7+711.83), (312+972)], index=['Yes - changed strategy', 'No', 'col_totals'])}
total_summarized = pd.DataFrame(d)
observed = total_summarized.ix[0:2,0:2]
Output:
Observed
expected = np.outer(total_summarized["row_totals"][0:2],
total_summarized.ix["col_totals"][0:2])/1000
expected = pd.DataFrame(expected)
expected.columns = ["Previously Successful","Previously Unsuccessful"]
expected.index = ["Yes - changed strategy","No"]
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()
print(chi_squared_stat)
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
df = 8) # *
print("Critical value")
print(crit)
p_value = 1 - stats.chi2.cdf(x=chi_squared_stat, # Find the p-value
df=8)
print("P value")
print(p_value)
stats.chi2_contingency(observed= observed)
Output
Statistics
A few corrections:
Your expected array is not correct. You must divide by observed.sum().sum(), which is 1284, not 1000.
For a 2x2 contingency table such as this, the degrees of freedom is 1, not 8.
Your calculation of chi_squared_stat does not include a continuity correction. (But it isn't necessarily wrong to not use it--that's a judgment call for the statistician.)
All the calculations that you perform (expected matrix, statistics, degrees of freedom, p-value) are computed by chi2_contingency:
In [65]: observed
Out[65]:
Previously Successful Previously Unsuccessful
Yes - changed strategy 129.3 260.17
No 182.7 711.83
In [66]: from scipy.stats import chi2_contingency
In [67]: chi2, p, dof, expected = chi2_contingency(observed)
In [68]: chi2
Out[68]: 23.383138325890453
In [69]: p
Out[69]: 1.3273696199438626e-06
In [70]: dof
Out[70]: 1
In [71]: expected
Out[71]:
array([[ 94.63757009, 294.83242991],
[ 217.36242991, 677.16757009]])
By default, chi2_contingency uses a continuity correction when the contingency table is 2x2. If you prefer to not use the correction, you can disable it with the argument correction=False:
In [73]: chi2, p, dof, expected = chi2_contingency(observed, correction=False)
In [74]: chi2
Out[74]: 24.072616672232893
In [75]: p
Out[75]: 9.2770200776879643e-07
degrees of freedom = (row-1)x(column-1). For a 2x2 table it is (2-1)x(2-1) = 1

Categories