Python - generate array of specific autocorrelation - python

I am interested in generating an array(or numpy Series) of length N that will exhibit specific autocorrelation at lag 1. Ideally, I want to specify the mean and variance, as well, and have the data drawn from (multi)normal distribution. But most importantly, I want to specify the autocorrelation. How do I do this with numpy, or scikit-learn?
Just to be explicit and precise, this is the autocorrelation I want to control:
numpy.corrcoef(x[0:len(x) - 1], x[1:])[0][1]

If you are interested only in the auto-correlation at lag one, you can generate an auto-regressive process of order one with the parameter equal to the desired auto-correlation; this property is mentioned on the Wikipedia page, but it's not hard to prove it.
Here is some sample code:
import numpy as np
def sample_signal(n_samples, corr, mu=0, sigma=1):
assert 0 < corr < 1, "Auto-correlation must be between 0 and 1"
# Find out the offset `c` and the std of the white noise `sigma_e`
# that produce a signal with the desired mean and variance.
# See https://en.wikipedia.org/wiki/Autoregressive_model
# under section "Example: An AR(1) process".
c = mu * (1 - corr)
sigma_e = np.sqrt((sigma ** 2) * (1 - corr ** 2))
# Sample the auto-regressive process.
signal = [c + np.random.normal(0, sigma_e)]
for _ in range(1, n_samples):
signal.append(c + corr * signal[-1] + np.random.normal(0, sigma_e))
return np.array(signal)
def compute_corr_lag_1(signal):
return np.corrcoef(signal[:-1], signal[1:])[0][1]
# Examples.
print(compute_corr_lag_1(sample_signal(5000, 0.5)))
print(np.mean(sample_signal(5000, 0.5, mu=2)))
print(np.std(sample_signal(5000, 0.5, sigma=3)))
The parameter corr lets you set the desired auto-correlation at lag one and the optional parameters, mu and sigma, let you control the mean and standard deviation of the generated signal.

Related

Unable to reproduce simple figure from textbook (possible numerical instability)

I am trying to reproduce figure 5.6 (attached) from the textbook "Modeling Infectious Diseases in Humans and Animals (official code repo)" (Keeling 2008) to verify whether my implementation of a seasonally forced SEIR (epidemiological model) is correct. An official program from the textbook that implements seasonal forcing indicates that large values of Beta 1 can lead to numerical errors, but if the figure has Beta 1 values that did not lead to numerical errors, then in principle this should not be the cause of the problem. My implementation correctly produces the graphs in row 0, column 0 and row 1, column 0 of figure 5.6 but there is no output in my figure for the remaining cells due to the numerical solution for the fraction of infected (see code at bottom) producing 0 (and the ln(0) --> -inf).
I do receive the following warnings:
ODEintWarning: Excess work done on this call
C:\Users\jared\AppData\Local\Temp\ipykernel_24972\2802449019.py:68:
RuntimeWarning: divide by zero encountered in log infected =
np.log(odeint(
C:\Users\jared\AppData\Local\Temp\ipykernel_24972\2802449019.py:68:
RuntimeWarning: invalid value encountered in log infected =
np.log(odeint(
Here is the textbook figure:
My figure within the same time range (990 - 1000 years). Natural log taken of fraction infected:
My figure but with a shorter time range (0 - 100 years). Natural log taken of fraction infected. The numerical solution for the infected population seems to fail between the 5 and 20 year mark for most of the seasonal parameters (Beta 1 and R0):
My figure with a shorter time range as above, but with no natural log taken of fraction infected.
Code to reproduce my figure:
# Code to minimally reproduce figure
import itertools
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
def seir(y, t, mu, sigma, gamma, omega, beta_zero, beta_one):
"""System of diff eqs for epidemiological model.
SEIR stands for susceptibles, exposed, infectious, and
recovered populations.
References:
[SEIR Python Program from Textbook](http://homepages.warwick.ac.uk/~masfz/ModelingInfectiousDiseases/Chapter2/Program_2.6/Program_2_6.py)
[Seasonally Forced SIR Program from Textbook](http://homepages.warwick.ac.uk/~masfz/ModelingInfectiousDiseases/Chapter5/Program_5.1/Program_5_1.py)
"""
s, e, i = y
beta = beta_zero * (1 + beta_one * np.cos(omega * t))
sdot = mu - (beta*i + mu)*s
edot = beta*s*i - (mu + sigma)*e
idot = sigma*e - (mu + gamma)*i
return sdot, edot, idot
def solve_beta_zero(basic_reproductive_rate, gamma):
"""Defined in the last paragraph of pg. 159 of textbook Keeling 2008."""
return gamma * basic_reproductive_rate
# Model parameters (see Figure 5.6 description)
mu = 0.02 / 365
sigma = 1/8
gamma = 1/5
omega = 2 * np.pi / 365 # frequency of oscillations per year
# Seasonal forcing parameters
r0s = [17, 10, 3]
b1s = [0.02, 0.1, 0.225]
# Permutes params to get tuples matching row i column j params in figure
# e.g., [(0.02, 17), (0.02, 10) ... ]
seasonal_params = [p for p in itertools.product(*(b1s, r0s))]
# Initial Conditions: I assume these are proportions of some total population
s0 = 6e-2
e0 = i0 = 1e-3
initial_conditions = [s0, e0, i0]
# Timesteps
nyears = 1000
days_per_year = 365
ndays = nyears * days_per_year
timesteps = np.arange(1, ndays+1, 1)
# Range to slice data to reproduce my figures
# NOTE: CHange the min slice or max slice for different ranges
min_slice = 990 # or 0
max_slice = 1000 # or 100
sliced = slice(min_slice * days_per_year, max_slice * days_per_year)
x_ticks = timesteps[sliced]/days_per_year
# Define figure
nrows = 3
ncols = 3
fig, ax = plt.subplots(nrows, ncols, sharex=True, figsize=(15, 8))
# Iterate through parameters and recreate figure
for i in range(nrows):
for j in range(ncols):
# Get seasonal parameters for this subplot
beta_one = seasonal_params[i * nrows + j][0]
basic_reproductive_rate = seasonal_params[i * nrows + j][1]
# Compute beta zero given the reproductive rate
beta_zero = solve_beta_zero(
basic_reproductive_rate=basic_reproductive_rate,
gamma=gamma)
# Numerically solve the model, extract only the infected solutions,
# slice those solutions to the desired time range, and then natural
# log scale them
solutions = odeint(
seir,
initial_conditions,
timesteps,
args=(mu, sigma, gamma, omega, beta_zero, beta_one))
infected_solutions = solutions[:, 2]
log_infected = np.log(infected_solutions[sliced])
# NOTE: To inspect results without natural log, uncomment the
# below line
# log_infected = infected_solutions[sliced]
# DEBUG: For shape and parameter printing
# print(
# infected_solutions.shape, 'R0=', basic_reproductive_rate, 'B1=', beta_one)
# Plot results
ax[i,j].plot(x_ticks, log_infected)
# label subplot
ax[i,j].title.set_text(rf'$(R_0=${basic_reproductive_rate}, $\beta_{1}=${beta_one})')
fig.supylabel('NaturalLog(Fraction Infected)')
fig.supxlabel('Time (years)')
Disclaimer:
My short term solution is to simply change the list of seasonal parameters to values that will produce data for that range, and this adequately illustrates the effects of seasonal forcing. The point is to reproduce the figure, though, and if the author was able to do it, others should be able to as well.
Your first (and possibly main) problem is one of scale. This diagnosis is also conform with the observations in your later experiments. The system is such that if it is started with positive values, it should stay within positive values. That negative values are reached is only possible if the step errors of the numerical method are too large.
As you can see in the original graphs, the range of values goes from exp(-7) ~= 9e-4 to exp(-12) ~= 6e-6. The value of the absolute tolerance should force at least 3 digits to be exact, so atol = 1e-10 or smaller. The relative tolerance should be adapted similarly. Viewing all components together shows that the first component has values around exp(-2.5) ~= 5e-2, so per-component tolerances should provide better results. The corresponding call is
solutions = odeint(
seir,
initial_conditions,
timesteps,
args=(mu, sigma, gamma, omega, beta_zero, beta_one),
atol = [1e-9,1e-13,1e-13], rtol=1e-11)
With these parameters I get the plots below
The first row and first column are as in the cited graphic, the others look different.
As a test and a general method to integrate in the range of small positive solutions, reformulate for the integration of the logarithms of the components. This can be done with a simple wrapper
def seir_log(log_y,t,*args):
y = np.exp(log_y)
dy = np.array(seir(y,t,*args))
return dy/y # = d(log(y))
Now the expected values have scale 1 to 10, so that the tolerances are no longer so critical, default tolerances should be sufficient, but it is better to work with documented tolerances.
log_solution = odeint(
seir_log,
np.log(initial_conditions),
timesteps,
args=(mu, sigma, gamma, omega, beta_zero, beta_one),
atol = 1e-8, rtol=1e-9)
log_infected = log_solution[sliced,2]
The bottom-left plot is still sensible to atol, with 1e-7 one gets a more wavy picture. Bounding the step size with hmax=5 also stabilizes that. With the code as above the plots are
The center plot is still different than the reference. It might be that there are different stable cycles.

How to calculate the probability between two numbers from a probability distribution in python

I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.

Implementation of the Metropolis-Hasting algorithm for solving gaussian integrals

I am currently having issue with the implementation of the Metropolis-Hastings algorithm.
I am trying to use the algorithm to calculate integrals of the form
In using this algorithm, we can obtain a long chain of configurations ( in this case, each configuration is just a single numbers) such that in the tail-end of the chain the probability of having a particular configuration follows (or rather tends to) a gaussian distribution.
My code seems to be messing up with obtaining the said gaussian distributions. There is a strange dependence on the transition probablity (the probablity of picking a new candidate configuration depending on the previous configuration in the chain). However, if this transition probability is symmetric, there should be no dependence on this function at all (it only affects speed at which phase space [space of potential configurations] is explored and how quickly the chain converges to the desired distribution)!
In my case I am using a normal distribution transition function (which satisfies the need to be symmetric), with width d.
For each d I use I do indeed get a gaussian distribution however the standard deviation, sigma, depends on my choice of d.
The resulting gaussian should have a sigma of roughly 0.701 but I find that the value I actually get depends on the parameter d, when it shouldn't.
I am not sure where the error in this code is, any help would be greatly appreciated!
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
'''
We want to get an exponential decay integral approx using importance sampling.
We will try to integrate x^2exp(-x^2) over the real line.
Metropolis-hasting alg will generate configuartions (in this case, single numbers) such that
the probablity of a given configuration x^a ~ p(x^a) for p(x) propto exp(-x^2).
Once configs = {x^a} generated, the apporximation, Q_N, of the integral, I, will be given by
Q_N = 1/N sum_(configs) x^2
lim (N-> inf) Q_N -> I
'''
'''
Implementing metropolis-hasting algorithm
'''
#Setting up the initial config for our chain, generating first 2 to generate numpy array
x_0 = np.random.uniform(-20,-10,2)
#Defining function that generates the next N steps in the chain, given a starting config x
#Works by iteratively taking the last element in the chain, generating a new candidate configuration from it and accepting/rejecting according to the algorithm
#Success and failures implemented to see roughly the success rate of each step
def next_steps(x,N):
i = 0
Success = 0
Failures = 0
Data = np.array(x)
d = 1.5 #Spread of (normal) transition function
while i < N:
r = np.random.uniform(0,1)
delta = np.random.normal(0,d)
x_old = Data[-1]
x_new = x_old + delta
hasting_ratio = np.exp(-(x_new**2-x_old**2) )
if hasting_ratio > r:
i = i+1
Data = np.append(Data,x_new)
Success = Success +1
else:
Failures = Failures + 1
print(Success)
print(Failures)
return Data
#Number of steps in the chain
N_iteration = 50000
#Generating the data
Data = next_steps(x_0,N_iteration)
#Plotting data to see convergence of chain to gaussian distribution
plt.plot(Data)
plt.show()
#Obtaining tail end data and obtaining the standard deviation of resulting gaussian distribution
Data = Data[-40000:]
(mu, sigma) = norm.fit(Data)
print(sigma)
#Plotting a histogram to visually see if guassian
plt.hist(Data, bins = 300)
plt.show()
You need to save x even when it doesn't change. Otherwise the center values are under-counted, and more so as d increases, which increases the variance.
import numpy as np
from scipy.stats import norm
"""
We want to get an exponential decay integral approx using importance sampling.
We will try to integrate x^2exp(-x^2) over the real line.
Metropolis-hasting alg will generate configuartions (in this case, single numbers) such that
the probablity of a given configuration x^a ~ p(x^a) for p(x) propto exp(-x^2).
Once configs = {x^a} generated, the apporximation, Q_N, of the integral, I, will be given by
Q_N = 1/N sum_(configs) x^2
lim (N-> inf) Q_N -> I
"""
"""
Implementing metropolis-hasting algorithm
"""
# Setting up the initial config for our chain
x_0 = np.random.uniform(-20, -10)
# Defining function that generates the next N steps in the chain, given a starting config x
# Works by iteratively taking the last element in the chain, generating a new candidate configuration from it and accepting/rejecting according to the algorithm
# Success and failures implemented to see roughly the success rate of each step
def next_steps(x, N):
Success = 0
Failures = 0
Data = np.empty((N,))
d = 1.5 # Spread of (normal) transition function
for i in range(N):
r = np.random.uniform(0, 1)
delta = np.random.normal(0, d)
x_new = x + delta
hasting_ratio = np.exp(-(x_new ** 2 - x ** 2))
if hasting_ratio > r:
x = x_new
Success = Success + 1
else:
Failures = Failures + 1
Data[i] = x
print(Success)
print(Failures)
return Data
# Number of steps in the chain
N_iteration = 50000
# Generating the data
Data = next_steps(x_0, N_iteration)
# Obtaining tail end data and obtaining the standard deviation of resulting gaussian distribution
Data = Data[-40000:]
(mu, sigma) = norm.fit(Data)
print(sigma)

SciPy discrete cosine transform (DCT) power in non-existing frequencies

I try to transform a simple cosine signal using the Discrete Cosine Transform (DCT) scipy.fft.dct, however it seems there is an issue as there is power in frequencies that should not exist.
Suppose a domain from zero to one, both endpoints included, for the cosine function:
import numpy as np
x = np.linspace(0, 1, 8, endpoint = True)
f = np.cos(1 * np.pi * x)
This simple signal offers a single frequency, so I do expect significant powers only at a single frequency of the DCT:
import scipy.fft
f_FT = scipy.fft.dct(f, type = 1, norm = "ortho")
I select the DCT type I according to the Wikipedia classification (that is also referenced in SciPy's documentation) because the endpoints are included and the signal is even at both boundaries. But this yields as result:
array([ 3.35699888e-16, 2.09223516e+00, -1.48359792e-17, 2.21406462e-01,
-1.92867730e-16, 2.21406462e-01, 1.18687834e-16, 1.56558011e-01])
Thus, there is still significant energy in k=3pi, 5pi, 7pi (second and last column).
Am I doing something wrong? As written above, I expect only power at k=1pi. The Discrete Sine Transform (DST) does not offer this kind of problem - there, I find only power in frequencies that I generate.
Thank you in advance for your help.
Update - origin of problem found
I found that the problem is caused by norm = "ortho". Apparently, the library modifies the first and last point of the input signal before the transform (in the documentation this is indicated by "If norm='ortho', x[0] and x[N-1] are multiplied by a scaling factor of sqrt(2)") to make sure that Parseval's theorem still holds afterwards. However, then the power in the different modes do not correspond any more to the original signal.
Solution
This modification of the original signal is confusing and I propose the following to anybody who also wants Parseval's theorem to hold while still knowing in which modes the original input signal has power:
f_DCT = scipy.fft.dct(f, type = 1, norm = "backward")
# Apply manual normalisation similar to: norm = "ortho"
# See documentation of SciPy DCT (y[k]).
scaling_factors = np.zeros(np.shape(f_DCT))
scaling_factors[1:-1] = 0.5 * (np.sqrt(2) / np.sqrt(len(f_DCT) -1))
scaling_factors[0] = 0.5 * (1 / np.sqrt(len(f_DCT) -1))
scaling_factors[-1] = 0.5 * (1 / np.sqrt(len(f_DCT) -1))
f_DCT = f_DCT * scaling_factors
del scaling_factors
# Now f_DCT is scaled as expected for norm = "ortho"
# To check Parseval's theorem, one must scale the weight of the first
# and last data point because of the specific type I of the DCT.
# See documentation of SciPy DCT (x[0], x[N-1]).
scaling_factors = np.ones(np.shape(f))
scaling_factors[0] = 1 / np.sqrt(2)
scaling_factors[-1] = 1 / np.sqrt(2)
# Compute the signal weighted properly for checking Parseval's theorem (PT).
f_PT = f * scaling_factors
del scaling_factors
# Note that there is now only energy in one single mode (at k=1pi):
array([ 2.93737402e-16, 1.87082869e+00, -2.31984713e-17, 1.78912008e-16,
-1.78912008e-16, 2.31984713e-17, 0.00000000e+00, -1.25887458e-16])
# Also, Parseval's theorem holds:
np.sum(f_PT * f_PT) # 3.499999999999999
np.sum(f_DCT * f_DCT) # 3.499999999999999

Chi-squared goodness of fit test in Python: way too low p-values, but the fitting function is correct

Despite having searched for two day in related questions, I have not really found an answer to this Problem yet...
In the following code, I generate n normally distributed random variables, which are then represented in a histogram:
import numpy as np
import matplotlib.pyplot as plt
n = 10000 # number of generated random variables
x = np.random.normal(0,1,n) # generate n random variables
# plot this in a non-normalized histogram:
plt.hist(x, bins='auto', normed=False)
# get the arrays containing the bin counts and the bin edges:
histo, bin_edges = np.histogram(x, bins='auto', normed=False)
number_of_bins = len(bin_edges)-1
After that, a curve fitting function and its parameters are found.
It is normally distributed with the parameters a1 and b1, and scaled with scaling_factor to meet the fact that the sample is unnormalized.
It indeed fits the histogram quite well:
import scipy as sp
a1, b1 = sp.stats.norm.fit(x)
scaling_factor = n*(x.max()-x.min())/number_of_bins
plt.plot(x_achse,scaling_factor*sp.stats.norm.pdf(x_achse,a1,b1),'b')
Here's the plot of the histogram with the fitting function in red.
After that, I want to test how well this function fits the histogram using the chi-squared test.
This test uses the observed values and the expected values in those points. To calculate the expected values, I first calculate the location of the middle of each bin, this information is contained in the array x_middle. I then calculate the value of the fitting function at the middle point of each bin, which gives the expected_value array:
observed_values = histo
bin_width = bin_edges[1] - bin_edges[0]
# array containing the middle point of each bin:
x_middle = np.linspace( bin_edges[0] + 0.5*bin_width,
bin_edges[0] + (0.5 + number_of_bins)*bin_width,
num = number_of_bins)
expected_values = scaling_factor*sp.stats.norm.pdf(x_middle,a1,b1)
Plugging this into the chisquare function of Scipy, I get p-values of approximately e-5 to e-15 order of magnitude, which tells me the fitting function does not describe the histogram:
print(sp.stats.chisquare(observed_values,expected_values,ddof=2))
But this is not true, the function fits the histogram very well!
Does anybody know where I made a mistake?
Thanks a lot!!
Charles
p.s.: I set the number of delta degrees of freedom to 2, because the 2 parameters a1 and b1 are estimated from the sample. I tried using other ddof, but the results were still as poor!
Your calculation of the end-point of the array x_middle is off by one; it should be:
x_middle = np.linspace(bin_edges[0] + 0.5*bin_width,
bin_edges[0] + (0.5 + number_of_bins - 1)*bin_width,
num=number_of_bins)
Note the extra - 1 in the second argument of linspace().
A more concise version is
x_middle = 0.5*(bin_edges[1:] + bin_edges[:-1])
A different (and possibly more accurate) approach to computing expected_values is to use the differences of the CDF, instead of approximating those differences using the PDF in the middle of each interval:
In [75]: from scipy import stats
In [76]: cdf = stats.norm.cdf(bin_edges, a1, b1)
In [77]: expected_values = n * np.diff(cdf)
With that calculation, I get the following result from the chi-squared test:
In [85]: stats.chisquare(observed_values, expected_values, ddof=2)
Out[85]: Power_divergenceResult(statistic=61.168393496775181, pvalue=0.36292223875686402)

Categories