Faster method for creating spatially correlated noise? - python

In my current project, I am interested in calculating spatially correlated noise for a large model grid. The noise should be strongly correlated over short distances, and uncorrelated over large distances. My current approach uses multivariate Gaussians with a covariance matrix specifying the correlation between all cells.
Unfortunately, this approach is extremely slow for large grids. Do you have a recommendation of how one might generate spatially correlated noise more efficiently? (It doesn't have to be Gaussian)
import scipy.stats
import numpy as np
import scipy.spatial.distance
import matplotlib.pyplot as plt
# Create a 50-by-50 grid; My actual grid will be a LOT larger
X,Y = np.meshgrid(np.arange(50),np.arange(50))
# Create a vector of cells
XY = np.column_stack((np.ndarray.flatten(X),np.ndarray.flatten(Y)))
# Calculate a matrix of distances between the cells
dist = scipy.spatial.distance.pdist(XY)
dist = scipy.spatial.distance.squareform(dist)
# Convert the distance matrix into a covariance matrix
correlation_scale = 50
cov = np.exp(-dist**2/(2*correlation_scale)) # This will do as a covariance matrix
# Sample some noise !slow!
noise = scipy.stats.multivariate_normal.rvs(
mean = np.zeros(50**2),
cov = cov)
# Plot the result
plt.contourf(X,Y,noise.reshape((50,50)))

Faster approach:
Generate spatially uncorrelated noise.
Blur with Gaussian filter kernel to make noise spatially correlated.
Since the filter kernel is rather large, it is a good idea to use a convolution method based on Fast Fourier Transform.
import numpy as np
import scipy.signal
import matplotlib.pyplot as plt
# Compute filter kernel with radius correlation_scale (can probably be a bit smaller)
correlation_scale = 50
x = np.arange(-correlation_scale, correlation_scale)
y = np.arange(-correlation_scale, correlation_scale)
X, Y = np.meshgrid(x, y)
dist = np.sqrt(X*X + Y*Y)
filter_kernel = np.exp(-dist**2/(2*correlation_scale))
# Generate n-by-n grid of spatially correlated noise
n = 50
noise = np.random.randn(n, n)
noise = scipy.signal.fftconvolve(noise, filter_kernel, mode='same')
plt.contourf(np.arange(n), np.arange(n), noise)
plt.savefig("fast.png")
Sample output of this method:
Sample output of slow method from question:
Image size vs running time:

Related

Covariance matrix for circular variables?

In my current project, I have a collection of three-dimensional samples such as [-0.5,-0.1,0.2]*pi, [0.8,-0.1,-0.4]*pi. These variables are circular/periodic, with their values ranging from -pi to +pi. It is my goal to calculate a 3-by-3 covariance matrix for these circular variables.
Python has an in-built function to calculate circular standard deviations, which I can use to calculate the standard deviations along each dimension, then use them to create a diagonal covariance matrix (i.e., without any correlation). Ideally, however, I would like to consider correlations between the parameters as well. Is there a way to calculate correlations between circular variables, or to directly compute the covariance matrix between them?
import numpy as np
import scipy.stats
# A collection of N circular samples
samples = np.asarray(
[[0.384917, 1.28862, -2.034],
[0.384917, 1.28862, -2.034],
[0.759245, 1.16033, -2.57942],
[0.45797, 1.31103, 2.9846],
[0.898047, 1.20955, -3.02987],
[1.25694, 1.74957, 2.46946],
[1.02173, 1.26477, 1.83757],
[1.22435, 1.62939, 1.99264]])
# Calculate the circular standard deviations
stds = scipy.stats.circstd(samples, high = np.pi, low = -np.pi, axis = 0)
# Create a diagonal covariance matrix
cov = np.identity(3)
np.fill_diagonal(cov,stds**2)

Fast way reduce noise of autocorrelation function in python?

I can compute the autocorrelation using numpy's built in functionality:
numpy.correlate(x,x,mode='same')
However the resulting correlation is naturally noisy. I can partition my data, and compute the correlation on each resulting window, then average them all together to compute cleaner autocorrelation, similar to what signal.welch does. Is there a handy function in either numpy or scipy that does this, possibly faster than I would get if I were to compute partition and loop through the data myself?
UPDATE
This is motivated by #kazemakase answer. I have tried to show what I mean with some code used to generate the figure below.
One can see that #kazemakase is correct with the fact that the AC function naturally averages out the noise. However the averaging of the AC has the advantage that it is much faster! np.correlate seems to scale as the slow O(n^2) rather than O(nlogn) that I would expect if the correlation was calculated using circular convolution via the FFT...
from statsmodels.tsa.arima_model import ARIMA
import statsmodels as sm
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(12345)
arparams = np.array([.75, -.25, 0.2, -0.15])
maparams = np.array([.65, .35])
ar = np.r_[1, -arparams] # add zero-lag and negate
ma = np.r_[1, maparams] # add zero-lag
x = sm.tsa.arima_process.arma_generate_sample(ar, ma, 10000)
def calc_rxx(x):
x = x-x.mean()
N = len(x)
Rxx = np.correlate(x,x,mode="same")[N/2::]/N
#Rxx = np.correlate(x,x,mode="same")[N/2::]/np.arange(N,N/2,-1)
return Rxx/x.var()
def avg_rxx(x,nperseg=1024):
rxx_windows = []
Nw = int(np.floor(len(x)/nperseg))
print Nw
first = True
for i in range(Nw-1):
xw = x[i*nperseg:nperseg*(i+1)]
y = calc_rxx(xw)
if i%1 == 0:
if first:
plt.semilogx(y,"k",alpha=0.2,label="Short AC")
first = False
else:
plt.semilogx(y,"k",alpha=0.2)
rxx_windows.append(y)
print np.shape(rxx_windows)
return np.mean(rxx_windows,axis=0)
plt.figure()
r_avg = avg_rxx(x,nperseg=300)
r = calc_rxx(x)
plt.semilogx(r_avg,label="Average AC")
plt.semilogx(r,label="Long AC")
plt.xlabel("Lag")
plt.ylabel("Auto-correlation")
plt.legend()
plt.xlim([0,150])
plt.show()
TL-DR: To decrease noise in the autocorrelation function increase the length of your signal x.
Partitioning the data and averaging like in spectral estimation is an interesting idea. I wish it would work...
The autocorrelation is defined as
Let's say we partition the data into two windows. Their autocorrelations become
Note how they are only different in the limits of the sumations. Basically, we split the summation of the autocorrelation into two parts. When we add these back together we are back to the original autocorrelation! So we did not gain anything.
The conclusion is, there is no such thing implemented in numpy/scipy because there is no point in doing so.
Remarks:
I hope it's easy to see that this extends to any number of partitions.
to keep it simple I left the normalization out. If you divide Rxx by n and the partial Rxx by n/2 you get Rxx / n == (Rxx1 * 2/n + Rxx2 * 2/n) / 2. I.e. The mean of the normalized partial autocorrelation is equal to the complete normalized autocorrelation.
to keep it even simpler I assumed the signal x could be indexed beyond the limits of 0 and n-1. In practice, if the signal is stored in an array this is often not possible. In this case there is a small difference between the full and the partialized autocorrelations that increases with the lag l. Unfortunately, this is merely a loss of precision and does not reduce noise.
Code heretic! I don't belive your evil math!
Of course we can try things out and see:
import matplotlib.pyplot as plt
import numpy as np
n = 2**16
n_segments = 8
x = np.random.randn(n) # data
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
segments = x.reshape(n_segments, -1)
m = segments.shape[1]
rs = []
for y in segments:
ry = np.correlate(y, y, mode='same') / m # partial ACF
rs.append(ry)
l2 = np.arange(-m//2, m//2) # lags of partial ACFs
plt.plot(l1, rx, label='full ACF')
plt.plot(l2, np.mean(rs, axis=0), label='partial ACF')
plt.xlim(-m, m)
plt.legend()
plt.show()
Although we used 8 segments to average the ACF, the noise level visually stays the same.
Okay, so that's why it does not work but what is the solution?
Here are the good news: Autocorrelation is already a noise reduction technique! Well, in some way at least: An application of the ACF is to find periodic signals hidden by noise.
Since noise (ideally) has zero mean, its influence diminishes the more elements we sum up. In other words, you can reduce noise in the autocorrelation by using longer signals. (I guess this is probably not true for every type of noise, but should hold for the usual Gaussian white noise and its relatives.)
Behold the noise getting lower with more data samples:
import matplotlib.pyplot as plt
import numpy as np
for n in [2**6, 2**8, 2**12]:
x = np.random.randn(n)
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
plt.plot(l1, rx, label='n={}'.format(n))
plt.legend()
plt.xlim(-20, 20)
plt.show()

Discrete cosine transform to differentiate real symmetric function

I'd like to differentiate a real, periodic function on (0,2*pi) which is also symmetric about x=pi using a discrete Fourier transform. I have written a Python code which does this using a FFT/IFFT but this does not take into account the symmetry of the function and so is a bit wasteful.
(The overall aim is to make a pseudospectral fluid flow solver and the periodicity and symmetry in one of the directions should allow me to expand the variables in that direction using only the cosine part of a Fourier series)
I know I need to use a Discrete Cosine Transform (DCT) to do this but cannot work out what needs to be changed about my domain (x), wavenumber vector (k) and implementation of the DCT/IDCT save that the former two should be half the length.
import sympy as sp
import numpy as np
import matplotlib.pylab as plt
from scipy.fftpack import fft, ifft
# Number of grid points
N = 2**5
# Test function to check results with (using SymPy)
w = 3.; X = sp.Symbol('x'); Y=sp.cos(w*X)
# Domain of regularly spaced points in [0,2pi)
x=(2*np.pi/N)*np.arange(0,N)
# Calc exact derivatives using SymPy then turn into functions
dY = Y.diff(X)
d2Y = dY.diff(X)
d3Y = d2Y.diff(X)
f = sp.lambdify(X, Y,'numpy')
df_ex = sp.lambdify(X, dY, 'numpy')
d2f_ex = sp.lambdify(X, d2Y, 'numpy')
d3f_ex = sp.lambdify(X, d3Y, 'numpy')
# Wavenumber vector
k=np.hstack(( np.arange(0,N/2), 0, np.arange(-N/2+1,0) ));
k2=k**2; k3=k**3;
# Trans. to Fourier domain, diff, then return to phyical space
F = fft(f(x))
df = np.real(ifft(1j*k*F))
d2f = np.real(ifft( -k2*F))
d3f = np.real(ifft(-1j*k3*F))
# Plot result
fh=plt.figure(figsize=(8,4)); ah=fh.add_subplot(111)
plt.plot(x,f(x),'b-',x,df_ex(x), 'r-',x,d2f_ex(x),'g-',x,d3f_ex(x),'k-')
plt.plot(x,df,'ro',x,d2f,'go',x,d3f,'ko')
plt.xlim([0,2*np.pi])

How to set a maximum distance between points for interpolation when using scipy.interpolate.griddata?

I have a spatial set of data with Z values I want to interpolate using some matplotlib or scipy module. My XY points have a concave shape and I don't want interpolated values in the empty zone. Is there a method that easily allow user to set a maximum distance between points to avoid interpolation in the empty zone?
I struggled with the same question and found a work around by re-using the kd-tree implementation that scipy itself uses for the nearest neighbour interpolation, masking the interpolated result array with the result of the kd-tree querying result.
Consider the example code below:
import numpy as np
import scipy.interpolate
import matplotlib.pyplot as plt
# Generate some random data
xy = np.random.random((2**15, 2))
z = np.sin(10*xy[:,0]) * np.cos(10*xy[:,1])
grid = np.meshgrid(
np.linspace(0, 1, 512),
np.linspace(0, 1, 512)
)
# Interpolate
result1 = scipy.interpolate.griddata(xy, z, tuple(grid), 'linear')
# Show
plt.figimage(result1)
plt.show()
# Remove rectangular window
mask = np.logical_and.reduce((xy[:,0] > 0.2, xy[:,0] < 0.8, xy[:,1] > 0.2, xy[:,1] < 0.8))
xy, z = xy[~mask], z[~mask]
# Interpolate
result2 = scipy.interpolate.griddata(xy, z, tuple(grid), 'linear')
# Show
plt.figimage(result2)
plt.show()
This generates the following two images. Notices the strong interpolation artefacts because of the missing rectangle window in the centre of the data.
Now if we run the code below on the same example data, the following image is obtained.
THRESHOLD = 0.01
from scipy.interpolate.interpnd import _ndim_coords_from_arrays
from scipy.spatial import cKDTree
# Construct kd-tree, functionality copied from scipy.interpolate
tree = cKDTree(xy)
xi = _ndim_coords_from_arrays(tuple(grid), ndim=xy.shape[1])
dists, indexes = tree.query(xi)
# Copy original result but mask missing values with NaNs
result3 = result2[:]
result3[dists > THRESHOLD] = np.nan
# Show
plt.figimage(result3)
plt.show()
I realize it may not be the visual effect you're after exactly. Especially if your dataset is not very dense you'll need to have a high distance threshold value in order for legitimately interpolated data not to be masked. If your data is dense enough, you might be able to get away with a relatively small radius, or maybe come up with a smarter cut-off function. Hope that helps.

Implementing a 2D, FFT-based Kernel Density Estimator in python, and comparing it to the SciPy implimentation

I need code to do 2D Kernel Density Estimation (KDE), and I've found the SciPy implementation is too slow. So, I've written an FFT based implementation, but several things confuse me. (The FFT implementation also enforces periodic boundary conditions, which is what I want.)
The implementation is based on creating a simple histogram from the samples and then convolving this with a gaussian. Here's code to do this and compare it with the SciPy result.
from numpy import *
from scipy.stats import *
from numpy.fft import *
from matplotlib.pyplot import *
from time import clock
ion()
#PARAMETERS
N = 512 #number of histogram bins; want 2^n for maximum FFT speed?
nSamp = 1000 #number of samples if using the ranom variable
h = 0.1 #width of gaussian
wh = 1.0 #width and height of square domain
#VARIABLES FROM PARAMETERS
rv = uniform(loc=-wh,scale=2*wh) #random variable that can generate samples
xyBnds = linspace(-1.0, 1.0, N+1) #boundaries of histogram bins
xy = (xyBnds[1:] + xyBnds[:-1])/2 #centers of histogram bins
xx, yy = meshgrid(xy,xy)
#DEFINE SAMPLES, TWO OPTIONS
#samples = rv.rvs(size=(nSamp,2))
samples = array([[0.5,0.5],[0.2,0.5],[0.2,0.2]])
#DEFINITIONS FOR FFT IMPLEMENTATION
ker = exp(-(xx**2 + yy**2)/2/h**2)/h/sqrt(2*pi) #Gaussian kernel
fKer = fft2(ker) #DFT of kernel
#FFT IMPLEMENTATION
stime = clock()
#generate normalized histogram. Note sure why .T is needed:
hst = histogram2d(samples[:,0], samples[:,1], bins=xyBnds)[0].T / (xy[-1] - xy[0])**2
#convolve histogram with kernel. Not sure why fftshift is neeed:
KDE1 = fftshift(ifft2(fft2(hst)*fKer))/N
etime = clock()
print "FFT method time:", etime - stime
#DEFINITIONS FOR NON-FFT IMPLEMTATION FROM SCIPY
#points to sample the KDE at, in a form gaussian_kde likes:
grid_coords = append(xx.reshape(-1,1),yy.reshape(-1,1),axis=1)
#NON-FFT IMPLEMTATION FROM SCIPY
stime = clock()
KDEfn = gaussian_kde(samples.T, bw_method=h)
KDE2 = KDEfn(grid_coords.T).reshape((N,N))
etime = clock()
print "SciPy time:", etime - stime
#PLOT FFT IMPLEMENTATION RESULTS
fig = figure()
ax = fig.add_subplot(111, aspect='equal')
c = contour(xy, xy, KDE1.real)
clabel(c)
title("FFT Implementation Results")
#PRINT SCIPY IMPLEMENTATION RESULTS
fig = figure()
ax = fig.add_subplot(111, aspect='equal')
c = contour(xy, xy, KDE2)
clabel(c)
title("SciPy Implementation Results")
There are two sets of samples above. The 1000 random points is for benchmarking and is commented out; the three points are for debugging.
The resulting plots for the latter case are at the end of this post.
Here are my questions:
Can I avoid the .T for the histogram and the fftshift for KDE1? I'm not sure why they're needed, but the gaussians show up in the wrong places without them.
How is the scalar bandwidth defined for SciPy? The gaussians have much different widths in the two implementations.
Along the same lines, why are the gaussians in the SciPy implementation not radially symmetric even though I gave gaussian_kde a scalar bandwidth?
How could I implement the other bandwidth methods available in SciPy for the FFT code?
(Let me note that the FFT code is ~390x fast than the SciPy code in the 1000 random points case.)
The differences you're seeing are due to the bandwidth and scaling factors, as you've already noticed.
By default, gaussian_kde chooses the bandwidth using Scott's rule. Dig into the code, if you're curious about the details. The code snippets below are from something I wrote quite awhile ago to do something similar to what you're doing. (If I remember right, there's an obvious error in that particular version and it really shouldn't use scipy.signal for the convolution, but the bandwidth estimation and normalization are correct.)
# Calculate the covariance matrix (in pixel coords)
cov = np.cov(xyi)
# Scaling factor for bandwidth
scotts_factor = np.power(n, -1.0 / 6) # For 2D
#---- Make the gaussian kernel -------------------------------------------
# First, determine how big the gridded kernel needs to be (2 stdev radius)
# (do we need to convolve with a 5x5 array or a 100x100 array?)
std_devs = np.diag(np.sqrt(cov))
kern_nx, kern_ny = np.round(scotts_factor * 2 * np.pi * std_devs)
# Determine the bandwidth to use for the gaussian kernel
inv_cov = np.linalg.inv(cov * scotts_factor**2)
After the convolution, the grid is then normalized:
# Normalization factor to divide result by so that units are in the same
# units as scipy.stats.kde.gaussian_kde's output. (Sums to 1 over infinity)
norm_factor = 2 * np.pi * cov * scotts_factor**2
norm_factor = np.linalg.det(norm_factor)
norm_factor = n * dx * dy * np.sqrt(norm_factor)
# Normalize the result
grid /= norm_factor
Hopefully that helps clarify things a touch.
As for your other questions:
Can I avoid the .T for the histogram and the fftshift for KDE1? I'm
not sure why they're needed, but the gaussians show up in the wrong
places without them.
I could be misreading your code, but I think you just have the transpose because you're going from point coordinates to index coordinates (i.e. from <x, y> to <y, x>).
Along the same lines, why are the gaussians in the SciPy
implementation not radially symmetric even though I gave gaussian_kde
a scalar bandwidth?
This is because scipy uses the full covariance matrix of the input x, y points to determine the gaussian kernel. Your formula assumes that x and y aren't correlated. gaussian_kde tests for and uses the correlation between x and y in the result.
How could I implement the other bandwidth methods available in SciPy
for the FFT code?
I'll leave that one for you to figure out. :) It's not too hard, though. Basically, instead of scotts_factor, you'd change the formula and have some other scalar factor. Everything else is the same.

Categories