I have code here that draws from two gaussian distributions with an equal number of points.
Ultimately, I want to simulate noise but I'm trying to see why if I have two gaussians with means that are really far off from each other, my curve_fit should return their average mean value. It doesn't do that.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import gauss
N_tot = 1000
# Draw from the major gaussian. Note the number N. It is
# the main parameter in obtaining your estimators.
mean = 0; sigma = 1; var = sigma**2; N = 100
A = 1/np.sqrt((2*np.pi*var))
points = gauss.draw_1dGauss(mean,var,N)
# Now draw from a minor gaussian. Note Np
meanp = 10; sigmap = 1; varp = sigmap**2; Np = N_tot-N
pointsp = gauss.draw_1dGauss(meanp,varp,Np)
Ap = 1/np.sqrt((2*np.pi*varp))
# Now implement the sum of the draws by concatenating the two arrays.
points_tot = np.array(points.tolist()+pointsp.tolist())
bins_tot = len(points_tot)/5
hist_tot, bin_edges_tot = np.histogram(points_tot,bins_tot,density=True)
bin_centres_tot = (bin_edges_tot[:-1] + bin_edges_tot[1:])/2.0
# Initial guess
p0 = [A, mean, sigma]
# Result of the fit
coeff, var_matrix = curve_fit(gauss.gaussFun, bin_centres_tot, hist_tot, p0=p0)
# Get the fitted curve
hist_fit = gauss.gaussFun(bin_centres, *coeff)
plt.figure(5); plt.title('Gaussian Estimate')
plt.suptitle('Gaussian Parameters: Mu = '+ str(coeff[1]) +' , Sigma = ' + str(coeff[2]) + ', Amplitude = ' + str(coeff[0]))
plt.plot(bin_centres,hist_fit)
plt.draw()
# Error on the estimates
error_parameters = np.sqrt(np.array([var_matrix[0][0],var_matrix[1][1],var_matrix[2][2]]))
The returned parameters are still centered about 0 and I'm not sure why. It should be centered around 10.
Edit: Changed the integer division portions but still not returning good fit value.
I should get a mean of about ~10 since most of my points are being drawn from that distribution (i.e. the minor distribution)
You find that the least-squares optimization converges to the larger of the two peaks.
The least-squares optimum does not find the "average mean value" of two component distributions, it algorithm merely minimizes the squared error. This will usually happen when the biggest peak is fit.
When the distribution is this lopsided (90% of the samples are from the larger of the two peaks) the error terms on the main peak destroy the local minima at the smaller peak and the minimum between the peaks.
You can get the fit to converge to a point in the center only when the peaks are nearly equal in size, otherwise you should expect least-squares to find the "strongest" peak if it doesn't get stuck in a local minimum.
With the following pieces, I can run your code:
bin_centres = bin_centres_tot
def draw_1dGauss(mean,var,N):
from scipy.stats import norm
from numpy import sqrt
return scipy.stats.norm.rvs(loc = mean, scale = sqrt(var), size=N)
def gaussFun(bin_centres, *coeff):
from numpy import sqrt, exp, pi
A, mean, sigma = coeff[0], coeff[1], coeff[2]
return exp(-(bin_centres-mean)**2 / 2. / sigma**2 ) / sigma / sqrt(2*pi)
plt.hist(points_tot, normed=True, bins=40)
Related
I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.
Background: I observe a sample of a variable z that is the sum of two independent and identically distributed variables x and y. I'm trying to recover the distribution of x, y (call it f) from the distribution of z (call it g), under the assumption that f is symmetric about zero. According to Horowitz and Markatou (1996) we have that the Fourier Transform of f is equal to sqrt(|G|), where G is the Fourier transform of g.
Example:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde, laplace
# sample size
size = 10000
# Number of points to preform FFT on
N = 501
# Scale of the laplace rvs
scale = 3.0
# Test deconvolution
laplace_f = laplace(scale=scale)
x = laplace_f.rvs(size=size)
y = laplace_f.rvs(size=size)
z = x + y
t = np.linspace(-4 * scale, 4 * scale, size)
laplace_pdf = laplace_f.pdf(t)
t2 = np.linspace(-4 * scale, 4 * scale, N)
# Get density from z. Kind of cheating using gaussian
z_density = gaussian_kde(z)(t2)
z_density = (z_density + z_density[::-1]) / 2
z_density_half = z_density[:((N - 1) // 2) + 1]
ft_z_density = np.fft.hfft(z_density_half)
inv_fz_density = np.fft.ihfft(np.sqrt(np.abs(ft_z_density)))
inv_fz_density = np.r_[inv_fz_density, inv_fz_density[::-1][:-1]]
f_deconv_shifted = np.real(np.fft.fftshift(inv_fz_density))
f_deconv = np.real(inv_fz_density)
# Normalize to be a pdf
f_deconv_shifted /= f_deconv_shifted.mean()
f_deconv /= f_deconv.mean()
# Plot
plt.subplot(221)
plt.plot(t, laplace_pdf)
plt.title('laplace pdf')
plt.subplot(222)
plt.plot(t2, z_density)
plt.title("z density")
plt.subplot(223)
plt.plot(t2, f_deconv_shifted)
plt.title('Deconvolved with shift')
plt.subplot(224)
plt.plot(t2, f_deconv)
plt.title('Deconvolved without shift')
plt.tight_layout()
plt.show()
Which results in
Issue: there's clearly something wrong here. I don't think I should need the shift, yet the shifted pdf seems to be closer to the truth. I suspect it has something to do with the domain of the IFFT changing with the sqrt(abs()) operation, but I'm really not sure.
The FFT is defined such that the times associated to the input samples are t=0..N-1. That is, the origin is in the first sample. The same is true for the output, the associated frequencies are k=0..N-1.
Your distribution is symmetric about zero, but ignoring your t (which you cannot pass to the FFT function), and knowing what the t values are that are implied by the FFT definition, you can see that your distribution is actually shifted, which adds a phase component to the frequency domain. You ignore the phase there (by using hfft instead of fft, which means you shift your input signal such that it becomes symmetric about the origin as defined by the FFT (not your origin).
fftshift shifts the signal resulting from the IFFT such that the origin is back where you want it to be. I recommend that you use ifftshift before calling hfft, just to ensure that your signal is actually symmetric as expected by that function. I don't know if it will make a difference, it depends on how this function is implemented.
I can compute the autocorrelation using numpy's built in functionality:
numpy.correlate(x,x,mode='same')
However the resulting correlation is naturally noisy. I can partition my data, and compute the correlation on each resulting window, then average them all together to compute cleaner autocorrelation, similar to what signal.welch does. Is there a handy function in either numpy or scipy that does this, possibly faster than I would get if I were to compute partition and loop through the data myself?
UPDATE
This is motivated by #kazemakase answer. I have tried to show what I mean with some code used to generate the figure below.
One can see that #kazemakase is correct with the fact that the AC function naturally averages out the noise. However the averaging of the AC has the advantage that it is much faster! np.correlate seems to scale as the slow O(n^2) rather than O(nlogn) that I would expect if the correlation was calculated using circular convolution via the FFT...
from statsmodels.tsa.arima_model import ARIMA
import statsmodels as sm
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(12345)
arparams = np.array([.75, -.25, 0.2, -0.15])
maparams = np.array([.65, .35])
ar = np.r_[1, -arparams] # add zero-lag and negate
ma = np.r_[1, maparams] # add zero-lag
x = sm.tsa.arima_process.arma_generate_sample(ar, ma, 10000)
def calc_rxx(x):
x = x-x.mean()
N = len(x)
Rxx = np.correlate(x,x,mode="same")[N/2::]/N
#Rxx = np.correlate(x,x,mode="same")[N/2::]/np.arange(N,N/2,-1)
return Rxx/x.var()
def avg_rxx(x,nperseg=1024):
rxx_windows = []
Nw = int(np.floor(len(x)/nperseg))
print Nw
first = True
for i in range(Nw-1):
xw = x[i*nperseg:nperseg*(i+1)]
y = calc_rxx(xw)
if i%1 == 0:
if first:
plt.semilogx(y,"k",alpha=0.2,label="Short AC")
first = False
else:
plt.semilogx(y,"k",alpha=0.2)
rxx_windows.append(y)
print np.shape(rxx_windows)
return np.mean(rxx_windows,axis=0)
plt.figure()
r_avg = avg_rxx(x,nperseg=300)
r = calc_rxx(x)
plt.semilogx(r_avg,label="Average AC")
plt.semilogx(r,label="Long AC")
plt.xlabel("Lag")
plt.ylabel("Auto-correlation")
plt.legend()
plt.xlim([0,150])
plt.show()
TL-DR: To decrease noise in the autocorrelation function increase the length of your signal x.
Partitioning the data and averaging like in spectral estimation is an interesting idea. I wish it would work...
The autocorrelation is defined as
Let's say we partition the data into two windows. Their autocorrelations become
Note how they are only different in the limits of the sumations. Basically, we split the summation of the autocorrelation into two parts. When we add these back together we are back to the original autocorrelation! So we did not gain anything.
The conclusion is, there is no such thing implemented in numpy/scipy because there is no point in doing so.
Remarks:
I hope it's easy to see that this extends to any number of partitions.
to keep it simple I left the normalization out. If you divide Rxx by n and the partial Rxx by n/2 you get Rxx / n == (Rxx1 * 2/n + Rxx2 * 2/n) / 2. I.e. The mean of the normalized partial autocorrelation is equal to the complete normalized autocorrelation.
to keep it even simpler I assumed the signal x could be indexed beyond the limits of 0 and n-1. In practice, if the signal is stored in an array this is often not possible. In this case there is a small difference between the full and the partialized autocorrelations that increases with the lag l. Unfortunately, this is merely a loss of precision and does not reduce noise.
Code heretic! I don't belive your evil math!
Of course we can try things out and see:
import matplotlib.pyplot as plt
import numpy as np
n = 2**16
n_segments = 8
x = np.random.randn(n) # data
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
segments = x.reshape(n_segments, -1)
m = segments.shape[1]
rs = []
for y in segments:
ry = np.correlate(y, y, mode='same') / m # partial ACF
rs.append(ry)
l2 = np.arange(-m//2, m//2) # lags of partial ACFs
plt.plot(l1, rx, label='full ACF')
plt.plot(l2, np.mean(rs, axis=0), label='partial ACF')
plt.xlim(-m, m)
plt.legend()
plt.show()
Although we used 8 segments to average the ACF, the noise level visually stays the same.
Okay, so that's why it does not work but what is the solution?
Here are the good news: Autocorrelation is already a noise reduction technique! Well, in some way at least: An application of the ACF is to find periodic signals hidden by noise.
Since noise (ideally) has zero mean, its influence diminishes the more elements we sum up. In other words, you can reduce noise in the autocorrelation by using longer signals. (I guess this is probably not true for every type of noise, but should hold for the usual Gaussian white noise and its relatives.)
Behold the noise getting lower with more data samples:
import matplotlib.pyplot as plt
import numpy as np
for n in [2**6, 2**8, 2**12]:
x = np.random.randn(n)
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
plt.plot(l1, rx, label='n={}'.format(n))
plt.legend()
plt.xlim(-20, 20)
plt.show()
Despite having searched for two day in related questions, I have not really found an answer to this Problem yet...
In the following code, I generate n normally distributed random variables, which are then represented in a histogram:
import numpy as np
import matplotlib.pyplot as plt
n = 10000 # number of generated random variables
x = np.random.normal(0,1,n) # generate n random variables
# plot this in a non-normalized histogram:
plt.hist(x, bins='auto', normed=False)
# get the arrays containing the bin counts and the bin edges:
histo, bin_edges = np.histogram(x, bins='auto', normed=False)
number_of_bins = len(bin_edges)-1
After that, a curve fitting function and its parameters are found.
It is normally distributed with the parameters a1 and b1, and scaled with scaling_factor to meet the fact that the sample is unnormalized.
It indeed fits the histogram quite well:
import scipy as sp
a1, b1 = sp.stats.norm.fit(x)
scaling_factor = n*(x.max()-x.min())/number_of_bins
plt.plot(x_achse,scaling_factor*sp.stats.norm.pdf(x_achse,a1,b1),'b')
Here's the plot of the histogram with the fitting function in red.
After that, I want to test how well this function fits the histogram using the chi-squared test.
This test uses the observed values and the expected values in those points. To calculate the expected values, I first calculate the location of the middle of each bin, this information is contained in the array x_middle. I then calculate the value of the fitting function at the middle point of each bin, which gives the expected_value array:
observed_values = histo
bin_width = bin_edges[1] - bin_edges[0]
# array containing the middle point of each bin:
x_middle = np.linspace( bin_edges[0] + 0.5*bin_width,
bin_edges[0] + (0.5 + number_of_bins)*bin_width,
num = number_of_bins)
expected_values = scaling_factor*sp.stats.norm.pdf(x_middle,a1,b1)
Plugging this into the chisquare function of Scipy, I get p-values of approximately e-5 to e-15 order of magnitude, which tells me the fitting function does not describe the histogram:
print(sp.stats.chisquare(observed_values,expected_values,ddof=2))
But this is not true, the function fits the histogram very well!
Does anybody know where I made a mistake?
Thanks a lot!!
Charles
p.s.: I set the number of delta degrees of freedom to 2, because the 2 parameters a1 and b1 are estimated from the sample. I tried using other ddof, but the results were still as poor!
Your calculation of the end-point of the array x_middle is off by one; it should be:
x_middle = np.linspace(bin_edges[0] + 0.5*bin_width,
bin_edges[0] + (0.5 + number_of_bins - 1)*bin_width,
num=number_of_bins)
Note the extra - 1 in the second argument of linspace().
A more concise version is
x_middle = 0.5*(bin_edges[1:] + bin_edges[:-1])
A different (and possibly more accurate) approach to computing expected_values is to use the differences of the CDF, instead of approximating those differences using the PDF in the middle of each interval:
In [75]: from scipy import stats
In [76]: cdf = stats.norm.cdf(bin_edges, a1, b1)
In [77]: expected_values = n * np.diff(cdf)
With that calculation, I get the following result from the chi-squared test:
In [85]: stats.chisquare(observed_values, expected_values, ddof=2)
Out[85]: Power_divergenceResult(statistic=61.168393496775181, pvalue=0.36292223875686402)
I need code to do 2D Kernel Density Estimation (KDE), and I've found the SciPy implementation is too slow. So, I've written an FFT based implementation, but several things confuse me. (The FFT implementation also enforces periodic boundary conditions, which is what I want.)
The implementation is based on creating a simple histogram from the samples and then convolving this with a gaussian. Here's code to do this and compare it with the SciPy result.
from numpy import *
from scipy.stats import *
from numpy.fft import *
from matplotlib.pyplot import *
from time import clock
ion()
#PARAMETERS
N = 512 #number of histogram bins; want 2^n for maximum FFT speed?
nSamp = 1000 #number of samples if using the ranom variable
h = 0.1 #width of gaussian
wh = 1.0 #width and height of square domain
#VARIABLES FROM PARAMETERS
rv = uniform(loc=-wh,scale=2*wh) #random variable that can generate samples
xyBnds = linspace(-1.0, 1.0, N+1) #boundaries of histogram bins
xy = (xyBnds[1:] + xyBnds[:-1])/2 #centers of histogram bins
xx, yy = meshgrid(xy,xy)
#DEFINE SAMPLES, TWO OPTIONS
#samples = rv.rvs(size=(nSamp,2))
samples = array([[0.5,0.5],[0.2,0.5],[0.2,0.2]])
#DEFINITIONS FOR FFT IMPLEMENTATION
ker = exp(-(xx**2 + yy**2)/2/h**2)/h/sqrt(2*pi) #Gaussian kernel
fKer = fft2(ker) #DFT of kernel
#FFT IMPLEMENTATION
stime = clock()
#generate normalized histogram. Note sure why .T is needed:
hst = histogram2d(samples[:,0], samples[:,1], bins=xyBnds)[0].T / (xy[-1] - xy[0])**2
#convolve histogram with kernel. Not sure why fftshift is neeed:
KDE1 = fftshift(ifft2(fft2(hst)*fKer))/N
etime = clock()
print "FFT method time:", etime - stime
#DEFINITIONS FOR NON-FFT IMPLEMTATION FROM SCIPY
#points to sample the KDE at, in a form gaussian_kde likes:
grid_coords = append(xx.reshape(-1,1),yy.reshape(-1,1),axis=1)
#NON-FFT IMPLEMTATION FROM SCIPY
stime = clock()
KDEfn = gaussian_kde(samples.T, bw_method=h)
KDE2 = KDEfn(grid_coords.T).reshape((N,N))
etime = clock()
print "SciPy time:", etime - stime
#PLOT FFT IMPLEMENTATION RESULTS
fig = figure()
ax = fig.add_subplot(111, aspect='equal')
c = contour(xy, xy, KDE1.real)
clabel(c)
title("FFT Implementation Results")
#PRINT SCIPY IMPLEMENTATION RESULTS
fig = figure()
ax = fig.add_subplot(111, aspect='equal')
c = contour(xy, xy, KDE2)
clabel(c)
title("SciPy Implementation Results")
There are two sets of samples above. The 1000 random points is for benchmarking and is commented out; the three points are for debugging.
The resulting plots for the latter case are at the end of this post.
Here are my questions:
Can I avoid the .T for the histogram and the fftshift for KDE1? I'm not sure why they're needed, but the gaussians show up in the wrong places without them.
How is the scalar bandwidth defined for SciPy? The gaussians have much different widths in the two implementations.
Along the same lines, why are the gaussians in the SciPy implementation not radially symmetric even though I gave gaussian_kde a scalar bandwidth?
How could I implement the other bandwidth methods available in SciPy for the FFT code?
(Let me note that the FFT code is ~390x fast than the SciPy code in the 1000 random points case.)
The differences you're seeing are due to the bandwidth and scaling factors, as you've already noticed.
By default, gaussian_kde chooses the bandwidth using Scott's rule. Dig into the code, if you're curious about the details. The code snippets below are from something I wrote quite awhile ago to do something similar to what you're doing. (If I remember right, there's an obvious error in that particular version and it really shouldn't use scipy.signal for the convolution, but the bandwidth estimation and normalization are correct.)
# Calculate the covariance matrix (in pixel coords)
cov = np.cov(xyi)
# Scaling factor for bandwidth
scotts_factor = np.power(n, -1.0 / 6) # For 2D
#---- Make the gaussian kernel -------------------------------------------
# First, determine how big the gridded kernel needs to be (2 stdev radius)
# (do we need to convolve with a 5x5 array or a 100x100 array?)
std_devs = np.diag(np.sqrt(cov))
kern_nx, kern_ny = np.round(scotts_factor * 2 * np.pi * std_devs)
# Determine the bandwidth to use for the gaussian kernel
inv_cov = np.linalg.inv(cov * scotts_factor**2)
After the convolution, the grid is then normalized:
# Normalization factor to divide result by so that units are in the same
# units as scipy.stats.kde.gaussian_kde's output. (Sums to 1 over infinity)
norm_factor = 2 * np.pi * cov * scotts_factor**2
norm_factor = np.linalg.det(norm_factor)
norm_factor = n * dx * dy * np.sqrt(norm_factor)
# Normalize the result
grid /= norm_factor
Hopefully that helps clarify things a touch.
As for your other questions:
Can I avoid the .T for the histogram and the fftshift for KDE1? I'm
not sure why they're needed, but the gaussians show up in the wrong
places without them.
I could be misreading your code, but I think you just have the transpose because you're going from point coordinates to index coordinates (i.e. from <x, y> to <y, x>).
Along the same lines, why are the gaussians in the SciPy
implementation not radially symmetric even though I gave gaussian_kde
a scalar bandwidth?
This is because scipy uses the full covariance matrix of the input x, y points to determine the gaussian kernel. Your formula assumes that x and y aren't correlated. gaussian_kde tests for and uses the correlation between x and y in the result.
How could I implement the other bandwidth methods available in SciPy
for the FFT code?
I'll leave that one for you to figure out. :) It's not too hard, though. Basically, instead of scotts_factor, you'd change the formula and have some other scalar factor. Everything else is the same.