Related
I have two signals which are related to each other and have been captured by two different measurement devices simultaneously.
Since the two measurements are not time synchronized there is a small time delay between them which I want to calculate. Additionally, I need to know which signal is the leading one.
The following can be assumed:
no or only very less noise present
speed of the algorithm is not an issue, only accuracy and robustness
signals are captured with an high sampling rate (>10 kHz) for several seconds
expected time delay is < 0.5s
I though of using-cross correlation for that purpose.
Any suggestions how to implement that in Python are very appreciated.
Please let me know if I should provide more information in order to find the most suitable algorithmn.
A popular approach: timeshift is the lag corresponding to the maximum cross-correlation coefficient. Here is how it works with an example:
import matplotlib.pyplot as plt
from scipy import signal
import numpy as np
def lag_finder(y1, y2, sr):
n = len(y1)
corr = signal.correlate(y2, y1, mode='same') / np.sqrt(signal.correlate(y1, y1, mode='same')[int(n/2)] * signal.correlate(y2, y2, mode='same')[int(n/2)])
delay_arr = np.linspace(-0.5*n/sr, 0.5*n/sr, n)
delay = delay_arr[np.argmax(corr)]
print('y2 is ' + str(delay) + ' behind y1')
plt.figure()
plt.plot(delay_arr, corr)
plt.title('Lag: ' + str(np.round(delay, 3)) + ' s')
plt.xlabel('Lag')
plt.ylabel('Correlation coeff')
plt.show()
# Sine sample with some noise and copy to y1 and y2 with a 1-second lag
sr = 1024
y = np.linspace(0, 2*np.pi, sr)
y = np.tile(np.sin(y), 5)
y += np.random.normal(0, 5, y.shape)
y1 = y[sr:4*sr]
y2 = y[:3*sr]
lag_finder(y1, y2, sr)
In the case of noisy signals, it is common to apply band-pass filters first. In the case of harmonic noise, they can be removed by identifying and removing frequency spikes present in the frequency spectrum.
Numpy has function correlate which suits your needs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.correlate.html
To complement Reveille's answer above (I reproduce his algorithm), I would like to point out some ideas for preprocessing the input signals.
Since there seems to be no fit-for-all (duration in periods, resolution, offset, noise, signal type, ...) you may play with it.
In my example the application of a window function improves the detected phase shift (within resolution of the discretization).
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
r2d = 180.0/np.pi # conversion factor RAD-to-DEG
delta_phi_true = 50.0/r2d
def detect_phase_shift(t, x, y):
'''detect phase shift between two signals from cross correlation maximum'''
N = len(t)
L = t[-1] - t[0]
cc = signal.correlate(x, y, mode="same")
i_max = np.argmax(cc)
phi_shift = np.linspace(-0.5*L, 0.5*L , N)
delta_phi = phi_shift[i_max]
print("true delta phi = {} DEG".format(delta_phi_true*r2d))
print("detected delta phi = {} DEG".format(delta_phi*r2d))
print("error = {} DEG resolution for comparison dphi = {} DEG".format((delta_phi-delta_phi_true)*r2d, dphi*r2d))
print("ratio = {}".format(delta_phi/delta_phi_true))
return delta_phi
L = np.pi*10+2 # interval length [RAD], for generality not multiple period
N = 1001 # interval division, odd number is better (center is integer)
noise_intensity = 0.0
X = 0.5 # amplitude of first signal..
Y = 2.0 # ..and second signal
phi = np.linspace(0, L, N)
dphi = phi[1] - phi[0]
'''generate signals'''
nx = noise_intensity*np.random.randn(N)*np.sqrt(dphi)
ny = noise_intensity*np.random.randn(N)*np.sqrt(dphi)
x_raw = X*np.sin(phi) + nx
y_raw = Y*np.sin(phi+delta_phi_true) + ny
'''preprocessing signals'''
x = x_raw.copy()
y = y_raw.copy()
window = signal.windows.hann(N) # Hanning window
#x -= np.mean(x) # zero mean
#y -= np.mean(y) # zero mean
#x /= np.std(x) # scale
#y /= np.std(y) # scale
x *= window # reduce effect of finite length
y *= window # reduce effect of finite length
print(" -- using raw data -- ")
delta_phi_raw = detect_phase_shift(phi, x_raw, y_raw)
print(" -- using preprocessed data -- ")
delta_phi_preprocessed = detect_phase_shift(phi, x, y)
Without noise (to be deterministic) the output is
-- using raw data --
true delta phi = 50.0 DEG
detected delta phi = 47.864788975654 DEG
...
-- using preprocessed data --
true delta phi = 50.0 DEG
detected delta phi = 49.77938053468019 DEG
...
Numpy has a useful function, called correlation_lags for this, which uses the underlying correlate function mentioned by other answers to find the time lag. The example displayed at the bottom of that page is useful:
from scipy import signal
from numpy.random import default_rng
rng = default_rng()
x = rng.standard_normal(1000)
y = np.concatenate([rng.standard_normal(100), x])
correlation = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(x.size, y.size, mode="full")
lag = lags[np.argmax(correlation)]
Then lag would be -100
In Python, I know how to calculate r and associated p-value using scipy.stats.pearsonr, but I'm unable to find a way to calculate the confidence interval of r. How is this done? Thanks for any help :)
According to [1], calculation of confidence interval directly with Pearson r is complicated due to the fact that it is not normally distributed. The following steps are needed:
Convert r to z',
Calculate the z' confidence interval. The sampling distribution of z' is approximately normally distributed and has standard error of 1/sqrt(n-3).
Convert the confidence interval back to r.
Here are some sample codes:
def r_to_z(r):
return math.log((1 + r) / (1 - r)) / 2.0
def z_to_r(z):
e = math.exp(2 * z)
return((e - 1) / (e + 1))
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf(1 - alpha/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Reference:
http://onlinestatbook.com/2/estimation/correlation_ci.html
Using rpy2 and the psychometric library (you will need R installed and to run install.packages("psychometric") within R first)
from rpy2.robjects.packages import importr
psychometric=importr('psychometric')
psychometric.CIr(r=.9, n = 100, level = .95)
Where 0.9 is your correlation, n the sample size and 0.95 the confidence level
Here's a solution that uses bootstrapping to compute the confidence interval, rather than the Fisher transformation (which assumes bivariate normality, etc.), borrowing from this answer:
import numpy as np
def pearsonr_ci(x, y, ci=95, n_boots=10000):
x = np.asarray(x)
y = np.asarray(y)
# (n_boots, n_observations) paired arrays
rand_ixs = np.random.randint(0, x.shape[0], size=(n_boots, x.shape[0]))
x_boots = x[rand_ixs]
y_boots = y[rand_ixs]
# differences from mean
x_mdiffs = x_boots - x_boots.mean(axis=1)[:, None]
y_mdiffs = y_boots - y_boots.mean(axis=1)[:, None]
# sums of squares
x_ss = np.einsum('ij, ij -> i', x_mdiffs, x_mdiffs)
y_ss = np.einsum('ij, ij -> i', y_mdiffs, y_mdiffs)
# pearson correlations
r_boots = np.einsum('ij, ij -> i', x_mdiffs, y_mdiffs) / np.sqrt(x_ss * y_ss)
# upper and lower bounds for confidence interval
ci_low = np.percentile(r_boots, (100 - ci) / 2)
ci_high = np.percentile(r_boots, (ci + 100) / 2)
return ci_low, ci_high
Answer given by bennylp is mostly correct, however, there is a small error in calculating the critical value in the 3rd function.
It should instead be:
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf((1 + alpha)/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Here's another post for reference: Scipy - two tail ppf function for a z value?
I know bootstrapping has been suggested above, proposing another variation of it below, which may suit some other set ups better.
#1
Sample your data (paired X & Ys and can also add other say weight) , fit original model on it, record r2, append it. Then extract your confidence intervals from your distribution of all R2s recorded.
#2 Additionally can fit on sampled data and using sampled data model predict on non sampled X (could also supply a continuous range to extend your predictions instead of using original X)
to get confidence intervals on your Y hats.
So in sample code:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
from sklearn.metrics import r2_score
x = np.array([your numbers here])
y = np.array([your numbers here])
### define list for R2 values
r2s = []
### define dataframe to append your bootstrapped fits for Y hat ranges
ci_df = pd.DataFrame({'x': x})
### define how many samples you want
how_many_straps = 5000
### define your fit function/s
def func_exponential(x,a,b):
return np.exp(b) * np.exp(a * x)
### fit original, using log because fitting exponential
polyfit_original = np.polyfit(x
,np.log(y)
,1
,# w= could supply weight for observations here)
)
for i in range(how_many_straps+1):
### zip into tuples attaching X to Y, can combine more variables as well
zipped_for_boot = pd.Series(tuple(zip(x,y)))
### sample zipped X & Y pairs above with replacement
zipped_resampled = zipped_for_boot.sample(frac=1,
replace=True)
### creater your sampled X & Y
boot_x = []
boot_y = []
for sample in zipped_resampled:
boot_x.append(sample[0])
boot_y.append(sample[1])
### predict sampled using original fit
y_hat_boot_via_original_fit = func_exponential(np.asarray(boot_x),
polyfit_original[0],
polyfit_original[1])
### calculate r2 and append
r2s.append(r2_score(boot_y, y_hat_boot_via_original_fit))
### fit sampled
polyfit_boot = np.polyfit(boot_x
,np.log(boot_y)
,1
,# w= could supply weight for observations here)
)
### predict original via sampled fit or on a range of min(x) to Z
y_hat_original_via_sampled_fit = func_exponential(x,
polyfit_boot[0],
polyfit_boot[1])
### insert y hat into dataframe for calculating y hat confidence intervals
ci_df["trial_" + str(i)] = y_hat_original_via_sampled_fit
### R2 conf interval
low = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[0],3)
up = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[1],3)
F"r2 confidence interval = {low} - {up}"
Short summary: How do I quickly calculate the finite convolution of two arrays?
Problem description
I am trying to obtain the finite convolution of two functions f(x), g(x) defined by
To achieve this, I have taken discrete samples of the functions and turned them into arrays of length steps:
xarray = [x * i / steps for i in range(steps)]
farray = [f(x) for x in xarray]
garray = [g(x) for x in xarray]
I then tried to calculate the convolution using the scipy.signal.convolve function. This function gives the same results as the algorithm conv suggested here. However, the results differ considerably from analytical solutions. Modifying the algorithm conv to use the trapezoidal rule gives the desired results.
To illustrate this, I let
f(x) = exp(-x)
g(x) = 2 * exp(-2 * x)
the results are:
Here Riemann represents a simple Riemann sum, trapezoidal is a modified version of the Riemann algorithm to use the trapezoidal rule, scipy.signal.convolve is the scipy function and analytical is the analytical convolution.
Now let g(x) = x^2 * exp(-x) and the results become:
Here 'ratio' is the ratio of the values obtained from scipy to the analytical values. The above demonstrates that the problem cannot be solved by renormalising the integral.
The question
Is it possible to use the speed of scipy but retain the better results of a trapezoidal rule or do I have to write a C extension to achieve the desired results?
An example
Just copy and paste the code below to see the problem I am encountering. The two results can be brought to closer agreement by increasing the steps variable. I believe that the problem is due to artefacts from right hand Riemann sums because the integral is overestimated when it is increasing and approaches the analytical solution again as it is decreasing.
EDIT: I have now included the original algorithm 2 as a comparison which gives the same results as the scipy.signal.convolve function.
import numpy as np
import scipy.signal as signal
import matplotlib.pyplot as plt
import math
def convolveoriginal(x, y):
'''
The original algorithm from http://www.physics.rutgers.edu/~masud/computing/WPark_recipes_in_python.html.
'''
P, Q, N = len(x), len(y), len(x) + len(y) - 1
z = []
for k in range(N):
t, lower, upper = 0, max(0, k - (Q - 1)), min(P - 1, k)
for i in range(lower, upper + 1):
t = t + x[i] * y[k - i]
z.append(t)
return np.array(z) #Modified to include conversion to numpy array
def convolve(y1, y2, dx = None):
'''
Compute the finite convolution of two signals of equal length.
#param y1: First signal.
#param y2: Second signal.
#param dx: [optional] Integration step width.
#note: Based on the algorithm at http://www.physics.rutgers.edu/~masud/computing/WPark_recipes_in_python.html.
'''
P = len(y1) #Determine the length of the signal
z = [] #Create a list of convolution values
for k in range(P):
t = 0
lower = max(0, k - (P - 1))
upper = min(P - 1, k)
for i in range(lower, upper):
t += (y1[i] * y2[k - i] + y1[i + 1] * y2[k - (i + 1)]) / 2
z.append(t)
z = np.array(z) #Convert to a numpy array
if dx != None: #Is a step width specified?
z *= dx
return z
steps = 50 #Number of integration steps
maxtime = 5 #Maximum time
dt = float(maxtime) / steps #Obtain the width of a time step
time = [dt * i for i in range (steps)] #Create an array of times
exp1 = [math.exp(-t) for t in time] #Create an array of function values
exp2 = [2 * math.exp(-2 * t) for t in time]
#Calculate the analytical expression
analytical = [2 * math.exp(-2 * t) * (-1 + math.exp(t)) for t in time]
#Calculate the trapezoidal convolution
trapezoidal = convolve(exp1, exp2, dt)
#Calculate the scipy convolution
sci = signal.convolve(exp1, exp2, mode = 'full')
#Slice the first half to obtain the causal convolution and multiply by dt
#to account for the step width
sci = sci[0:steps] * dt
#Calculate the convolution using the original Riemann sum algorithm
riemann = convolveoriginal(exp1, exp2)
riemann = riemann[0:steps] * dt
#Plot
plt.plot(time, analytical, label = 'analytical')
plt.plot(time, trapezoidal, 'o', label = 'trapezoidal')
plt.plot(time, riemann, 'o', label = 'Riemann')
plt.plot(time, sci, '.', label = 'scipy.signal.convolve')
plt.legend()
plt.show()
Thank you for your time!
or, for those who prefer numpy to C. It will be slower than the C implementation, but it's just a few lines.
>>> t = np.linspace(0, maxtime-dt, 50)
>>> fx = np.exp(-np.array(t))
>>> gx = 2*np.exp(-2*np.array(t))
>>> analytical = 2 * np.exp(-2 * t) * (-1 + np.exp(t))
this looks like trapezoidal in this case (but I didn't check the math)
>>> s2a = signal.convolve(fx[1:], gx, 'full')*dt
>>> s2b = signal.convolve(fx, gx[1:], 'full')*dt
>>> s = (s2a+s2b)/2
>>> s[:10]
array([ 0.17235682, 0.29706872, 0.38433313, 0.44235042, 0.47770012,
0.49564748, 0.50039326, 0.49527721, 0.48294359, 0.46547582])
>>> analytical[:10]
array([ 0. , 0.17221333, 0.29682141, 0.38401317, 0.44198216,
0.47730244, 0.49523485, 0.49997668, 0.49486489, 0.48254154])
largest absolute error:
>>> np.max(np.abs(s[:len(analytical)-1] - analytical[1:]))
0.00041657780840698155
>>> np.argmax(np.abs(s[:len(analytical)-1] - analytical[1:]))
6
Short answer: Write it in C!
Long answer
Using the cookbook about numpy arrays I rewrote the trapezoidal convolution method in C. In order to use the C code one requires three files (https://gist.github.com/1626919)
The C code (performancemodule.c).
The setup file to build the code and make it callable from python (performancemodulesetup.py).
The python file that makes use of the C extension (performancetest.py)
The code should run upon downloading by doing the following
Adjust the include path in performancemodule.c.
Run the following
python performancemodulesetup.py build
python performancetest.py
You may have to copy the library file performancemodule.so or performancemodule.dll into the same directory as performancetest.py.
Results and performance
The results agree neatly with one another as shown below:
The performance of the C method is even better than scipy's convolve method. Running 10k convolutions with array length 50 requires
convolve (seconds, microseconds) 81 349969
scipy.signal.convolve (seconds, microseconds) 1 962599
convolve in C (seconds, microseconds) 0 87024
Thus, the C implementation is about 1000 times faster than the python implementation and a bit more than 20 times as fast as the scipy implementation (admittedly, the scipy implementation is more versatile).
EDIT: This does not solve the original question exactly but is sufficient for my purposes.
I am attempting to generate map overlay images that would assist in identifying hot-spots, that is areas on the map that have high density of data points. None of the approaches that I've tried are fast enough for my needs.
Note: I forgot to mention that the algorithm should work well under both low and high zoom scenarios (or low and high data point density).
I looked through numpy, pyplot and scipy libraries, and the closest I could find was numpy.histogram2d. As you can see in the image below, the histogram2d output is rather crude. (Each image includes points overlaying the heatmap for better understanding)
My second attempt was to iterate over all the data points, and then calculate the hot-spot value as a function of distance. This produced a better looking image, however it is too slow to use in my application. Since it's O(n), it works ok with 100 points, but blows out when I use my actual dataset of 30000 points.
My final attempt was to store the data in an KDTree, and use the nearest 5 points to calculate the hot-spot value. This algorithm is O(1), so much faster with large dataset. It's still not fast enough, it takes about 20 seconds to generate a 256x256 bitmap, and I would like this to happen in around 1 second time.
Edit
The boxsum smoothing solution provided by 6502 works well at all zoom levels and is much faster than my original methods.
The gaussian filter solution suggested by Luke and Neil G is the fastest.
You can see all four approaches below, using 1000 data points in total, at 3x zoom there are around 60 points visible.
Complete code that generates my original 3 attempts, the boxsum smoothing solution provided by 6502 and gaussian filter suggested by Luke (improved to handle edges better and allow zooming in) is here:
import matplotlib
import numpy as np
from matplotlib.mlab import griddata
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import math
from scipy.spatial import KDTree
import time
import scipy.ndimage as ndi
def grid_density_kdtree(xl, yl, xi, yi, dfactor):
zz = np.empty([len(xi),len(yi)], dtype=np.uint8)
zipped = zip(xl, yl)
kdtree = KDTree(zipped)
for xci in range(0, len(xi)):
xc = xi[xci]
for yci in range(0, len(yi)):
yc = yi[yci]
density = 0.
retvalset = kdtree.query((xc,yc), k=5)
for dist in retvalset[0]:
density = density + math.exp(-dfactor * pow(dist, 2)) / 5
zz[yci][xci] = min(density, 1.0) * 255
return zz
def grid_density(xl, yl, xi, yi):
ximin, ximax = min(xi), max(xi)
yimin, yimax = min(yi), max(yi)
xxi,yyi = np.meshgrid(xi,yi)
#zz = np.empty_like(xxi)
zz = np.empty([len(xi),len(yi)])
for xci in range(0, len(xi)):
xc = xi[xci]
for yci in range(0, len(yi)):
yc = yi[yci]
density = 0.
for i in range(0,len(xl)):
xd = math.fabs(xl[i] - xc)
yd = math.fabs(yl[i] - yc)
if xd < 1 and yd < 1:
dist = math.sqrt(math.pow(xd, 2) + math.pow(yd, 2))
density = density + math.exp(-5.0 * pow(dist, 2))
zz[yci][xci] = density
return zz
def boxsum(img, w, h, r):
st = [0] * (w+1) * (h+1)
for x in xrange(w):
st[x+1] = st[x] + img[x]
for y in xrange(h):
st[(y+1)*(w+1)] = st[y*(w+1)] + img[y*w]
for x in xrange(w):
st[(y+1)*(w+1)+(x+1)] = st[(y+1)*(w+1)+x] + st[y*(w+1)+(x+1)] - st[y*(w+1)+x] + img[y*w+x]
for y in xrange(h):
y0 = max(0, y - r)
y1 = min(h, y + r + 1)
for x in xrange(w):
x0 = max(0, x - r)
x1 = min(w, x + r + 1)
img[y*w+x] = st[y0*(w+1)+x0] + st[y1*(w+1)+x1] - st[y1*(w+1)+x0] - st[y0*(w+1)+x1]
def grid_density_boxsum(x0, y0, x1, y1, w, h, data):
kx = (w - 1) / (x1 - x0)
ky = (h - 1) / (y1 - y0)
r = 15
border = r * 2
imgw = (w + 2 * border)
imgh = (h + 2 * border)
img = [0] * (imgw * imgh)
for x, y in data:
ix = int((x - x0) * kx) + border
iy = int((y - y0) * ky) + border
if 0 <= ix < imgw and 0 <= iy < imgh:
img[iy * imgw + ix] += 1
for p in xrange(4):
boxsum(img, imgw, imgh, r)
a = np.array(img).reshape(imgh,imgw)
b = a[border:(border+h),border:(border+w)]
return b
def grid_density_gaussian_filter(x0, y0, x1, y1, w, h, data):
kx = (w - 1) / (x1 - x0)
ky = (h - 1) / (y1 - y0)
r = 20
border = r
imgw = (w + 2 * border)
imgh = (h + 2 * border)
img = np.zeros((imgh,imgw))
for x, y in data:
ix = int((x - x0) * kx) + border
iy = int((y - y0) * ky) + border
if 0 <= ix < imgw and 0 <= iy < imgh:
img[iy][ix] += 1
return ndi.gaussian_filter(img, (r,r)) ## gaussian convolution
def generate_graph():
n = 1000
# data points range
data_ymin = -2.
data_ymax = 2.
data_xmin = -2.
data_xmax = 2.
# view area range
view_ymin = -.5
view_ymax = .5
view_xmin = -.5
view_xmax = .5
# generate data
xl = np.random.uniform(data_xmin, data_xmax, n)
yl = np.random.uniform(data_ymin, data_ymax, n)
zl = np.random.uniform(0, 1, n)
# get visible data points
xlvis = []
ylvis = []
for i in range(0,len(xl)):
if view_xmin < xl[i] < view_xmax and view_ymin < yl[i] < view_ymax:
xlvis.append(xl[i])
ylvis.append(yl[i])
fig = plt.figure()
# plot histogram
plt1 = fig.add_subplot(221)
plt1.set_axis_off()
t0 = time.clock()
zd, xe, ye = np.histogram2d(yl, xl, bins=10, range=[[view_ymin, view_ymax],[view_xmin, view_xmax]], normed=True)
plt.title('numpy.histogram2d - '+str(time.clock()-t0)+"sec")
plt.imshow(zd, origin='lower', extent=[view_xmin, view_xmax, view_ymin, view_ymax])
plt.scatter(xlvis, ylvis)
# plot density calculated with kdtree
plt2 = fig.add_subplot(222)
plt2.set_axis_off()
xi = np.linspace(view_xmin, view_xmax, 256)
yi = np.linspace(view_ymin, view_ymax, 256)
t0 = time.clock()
zd = grid_density_kdtree(xl, yl, xi, yi, 70)
plt.title('function of 5 nearest using kdtree\n'+str(time.clock()-t0)+"sec")
cmap=cm.jet
A = (cmap(zd/256.0)*255).astype(np.uint8)
#A[:,:,3] = zd
plt.imshow(A , origin='lower', extent=[view_xmin, view_xmax, view_ymin, view_ymax])
plt.scatter(xlvis, ylvis)
# gaussian filter
plt3 = fig.add_subplot(223)
plt3.set_axis_off()
t0 = time.clock()
zd = grid_density_gaussian_filter(view_xmin, view_ymin, view_xmax, view_ymax, 256, 256, zip(xl, yl))
plt.title('ndi.gaussian_filter - '+str(time.clock()-t0)+"sec")
plt.imshow(zd , origin='lower', extent=[view_xmin, view_xmax, view_ymin, view_ymax])
plt.scatter(xlvis, ylvis)
# boxsum smoothing
plt3 = fig.add_subplot(224)
plt3.set_axis_off()
t0 = time.clock()
zd = grid_density_boxsum(view_xmin, view_ymin, view_xmax, view_ymax, 256, 256, zip(xl, yl))
plt.title('boxsum smoothing - '+str(time.clock()-t0)+"sec")
plt.imshow(zd, origin='lower', extent=[view_xmin, view_xmax, view_ymin, view_ymax])
plt.scatter(xlvis, ylvis)
if __name__=='__main__':
generate_graph()
plt.show()
This approach is along the lines of some previous answers: increment a pixel for each spot, then smooth the image with a gaussian filter. A 256x256 image runs in about 350ms on my 6-year-old laptop.
import numpy as np
import scipy.ndimage as ndi
data = np.random.rand(30000,2) ## create random dataset
inds = (data * 255).astype('uint') ## convert to indices
img = np.zeros((256,256)) ## blank image
for i in xrange(data.shape[0]): ## draw pixels
img[inds[i,0], inds[i,1]] += 1
img = ndi.gaussian_filter(img, (10,10))
A very simple implementation that could be done (with C) in realtime and that only takes fractions of a second in pure python is to just compute the result in screen space.
The algorithm is
Allocate the final matrix (e.g. 256x256) with all zeros
For each point in the dataset increment the corresponding cell
Replace each cell in the matrix with the sum of the values of the matrix in an NxN box centered on the cell. Repeat this step a few times.
Scale result and output
The computation of the box sum can be made very fast and independent on N by using a sum table. Every computation just requires two scan of the matrix... total complexity is O(S + WHP) where S is the number of points; W, H are width and height of output and P is the number of smoothing passes.
Below is the code for a pure python implementation (also very un-optimized); with 30000 points and a 256x256 output grayscale image the computation is 0.5sec including linear scaling to 0..255 and saving of a .pgm file (N = 5, 4 passes).
def boxsum(img, w, h, r):
st = [0] * (w+1) * (h+1)
for x in xrange(w):
st[x+1] = st[x] + img[x]
for y in xrange(h):
st[(y+1)*(w+1)] = st[y*(w+1)] + img[y*w]
for x in xrange(w):
st[(y+1)*(w+1)+(x+1)] = st[(y+1)*(w+1)+x] + st[y*(w+1)+(x+1)] - st[y*(w+1)+x] + img[y*w+x]
for y in xrange(h):
y0 = max(0, y - r)
y1 = min(h, y + r + 1)
for x in xrange(w):
x0 = max(0, x - r)
x1 = min(w, x + r + 1)
img[y*w+x] = st[y0*(w+1)+x0] + st[y1*(w+1)+x1] - st[y1*(w+1)+x0] - st[y0*(w+1)+x1]
def saveGraph(w, h, data):
X = [x for x, y in data]
Y = [y for x, y in data]
x0, y0, x1, y1 = min(X), min(Y), max(X), max(Y)
kx = (w - 1) / (x1 - x0)
ky = (h - 1) / (y1 - y0)
img = [0] * (w * h)
for x, y in data:
ix = int((x - x0) * kx)
iy = int((y - y0) * ky)
img[iy * w + ix] += 1
for p in xrange(4):
boxsum(img, w, h, 2)
mx = max(img)
k = 255.0 / mx
out = open("result.pgm", "wb")
out.write("P5\n%i %i 255\n" % (w, h))
out.write("".join(map(chr, [int(v*k) for v in img])))
out.close()
import random
data = [(random.random(), random.random())
for i in xrange(30000)]
saveGraph(256, 256, data)
Edit
Of course the very definition of density in your case depends on a resolution radius, or is the density just +inf when you hit a point and zero when you don't?
The following is an animation built with the above program with just a few cosmetic changes:
used sqrt(average of squared values) instead of sum for the averaging pass
color-coded the results
stretching the result to always use the full color scale
drawn antialiased black dots where the data points are
made an animation by incrementing the radius from 2 to 40
The total computing time of the 39 frames of the following animation with this cosmetic version is 5.4 seconds with PyPy and 26 seconds with standard Python.
Histograms
The histogram way is not the fastest, and can't tell the difference between an arbitrarily small separation of points and 2 * sqrt(2) * b (where b is bin width).
Even if you construct the x bins and y bins separately (O(N)), you still have to perform some ab convolution (number of bins each way), which is close to N^2 for any dense system, and even bigger for a sparse one (well, ab >> N^2 in a sparse system.)
Looking at the code above, you seem to have a loop in grid_density() which runs over the number of bins in y inside a loop of the number of bins in x, which is why you're getting O(N^2) performance (although if you are already order N, which you should plot on different numbers of elements to see, then you're just going to have to run less code per cycle).
If you want an actual distance function then you need to start looking at contact detection algorithms.
Contact Detection
Naive contact detection algorithms come in at O(N^2) in either RAM or CPU time, but there is an algorithm, rightly or wrongly attributed to Munjiza at St. Mary's college London, which runs in linear time and RAM.
you can read about it and implement it yourself from his book, if you like.
I have written this code myself, in fact
I have written a python-wrapped C implementation of this in 2D, which is not really ready for production (it is still single threaded, etc) but it will run in as close to O(N) as your dataset will allow. You set the "element size", which acts as a bin size (the code will call interactions on everything within b of another point, and sometimes between b and 2 * sqrt(2) * b), give it an array (native python list) of objects with an x and y property and my C module will callback to a python function of your choice to run an interaction function for matched pairs of elements. it's designed for running contact force DEM simulations, but it will work fine on this problem too.
As I haven't released it yet, because the other bits of the library aren't ready yet, I'll have to give you a zip of my current source but the contact detection part is solid. The code is LGPL'd.
You'll need Cython and a c compiler to make it work, and it's only been tested and working under *nix environemnts, if you're on windows you'll need the mingw c compiler for Cython to work at all.
Once Cython's installed, building/installing pynet should be a case of running setup.py.
The function you are interested in is pynet.d2.run_contact_detection(py_elements, py_interaction_function, py_simulation_parameters) (and you should check out the classes Element and SimulationParameters at the same level if you want it to throw less errors - look in the file at archive-root/pynet/d2/__init__.py to see the class implementations, they're trivial data holders with useful constructors.)
(I will update this answer with a public mercurial repo when the code is ready for more general release...)
Your solution is okay, but one clear problem is that you're getting dark regions despite there being a point right in the middle of them.
I would instead center an n-dimensional Gaussian on each point and evaluate the sum over each point you want to display. To reduce it to linear time in the common case, use query_ball_point to consider only points within a couple standard deviations.
If you find that he KDTree is really slow, why not call query_ball_point once every five pixels with a slightly larger threshold? It doesn't hurt too much to evaluate a few too many Gaussians.
You can do this with a 2D, separable convolution (scipy.ndimage.convolve1d) of your original image with a gaussian shaped kernel. With an image size of MxM and a filter size of P, the complexity is O(PM^2) using separable filtering. The "Big-Oh" complexity is no doubt greater, but you can take advantage of numpy's efficient array operations which should greatly speed up your calculations.
Just a note, the histogram2d function should work fine for this. Did you play around with different bin sizes? Your initial histogram2d plot seems to just use the default bin sizes... but there's no reason to expect the default sizes to give you the representation you want. Having said that, many of the other solutions are impressive too.
The Question:
What is the best way to calculate inverse distance weighted (IDW) interpolation in Python, for point locations?
Some Background:
Currently I'm using RPy2 to interface with R and its gstat module. Unfortunately, the gstat module conflicts with arcgisscripting which I got around by running RPy2 based analysis in a separate process. Even if this issue is resolved in a recent/future release, and efficiency can be improved, I'd still like to remove my dependency on installing R.
The gstat website does provide a stand alone executable, which is easier to package with my python script, but I still hope for a Python solution which doesn't require multiple writes to disk and launching external processes. The number of calls to the interpolation function, of separate sets of points and values, can approach 20,000 in the processing I'm performing.
I specifically need to interpolate for points, so using the IDW function in ArcGIS to generate rasters sounds even worse than using R, in terms of performance.....unless there is a way to efficiently mask out only the points I need. Even with this modification, I wouldn't expect performance to be all that great. I will look into this option as another alternative. UPDATE: The problem here is you are tied to the cell size you are using. If you reduce the cell-size to get better accuracy, processing takes a long time. You also need to follow up by extracting by points.....over all an ugly method if you want values for specific points.
I have looked at the scipy documentation, but it doesn't look like there is a straight forward way to calculate IDW.
I'm thinking of rolling my own implementation, possibly using some of the scipy functionality to locate the closest points and calculate distances.
Am I missing something obvious? Is there a python module I haven't seen that does exactly what I want? Is creating my own implementation with the aid of scipy a wise choice?
changed 20 Oct: this class Invdisttree combines inverse-distance weighting and
scipy.spatial.KDTree.
Forget the original brute-force answer;
this is imho the method of choice for scattered-data interpolation.
""" invdisttree.py: inverse-distance-weighted interpolation using KDTree
fast, solid, local
"""
from __future__ import division
import numpy as np
from scipy.spatial import cKDTree as KDTree
# http://docs.scipy.org/doc/scipy/reference/spatial.html
__date__ = "2010-11-09 Nov" # weights, doc
#...............................................................................
class Invdisttree:
""" inverse-distance-weighted interpolation using KDTree:
invdisttree = Invdisttree( X, z ) -- data points, values
interpol = invdisttree( q, nnear=3, eps=0, p=1, weights=None, stat=0 )
interpolates z from the 3 points nearest each query point q;
For example, interpol[ a query point q ]
finds the 3 data points nearest q, at distances d1 d2 d3
and returns the IDW average of the values z1 z2 z3
(z1/d1 + z2/d2 + z3/d3)
/ (1/d1 + 1/d2 + 1/d3)
= .55 z1 + .27 z2 + .18 z3 for distances 1 2 3
q may be one point, or a batch of points.
eps: approximate nearest, dist <= (1 + eps) * true nearest
p: use 1 / distance**p
weights: optional multipliers for 1 / distance**p, of the same shape as q
stat: accumulate wsum, wn for average weights
How many nearest neighbors should one take ?
a) start with 8 11 14 .. 28 in 2d 3d 4d .. 10d; see Wendel's formula
b) make 3 runs with nnear= e.g. 6 8 10, and look at the results --
|interpol 6 - interpol 8| etc., or |f - interpol*| if you have f(q).
I find that runtimes don't increase much at all with nnear -- ymmv.
p=1, p=2 ?
p=2 weights nearer points more, farther points less.
In 2d, the circles around query points have areas ~ distance**2,
so p=2 is inverse-area weighting. For example,
(z1/area1 + z2/area2 + z3/area3)
/ (1/area1 + 1/area2 + 1/area3)
= .74 z1 + .18 z2 + .08 z3 for distances 1 2 3
Similarly, in 3d, p=3 is inverse-volume weighting.
Scaling:
if different X coordinates measure different things, Euclidean distance
can be way off. For example, if X0 is in the range 0 to 1
but X1 0 to 1000, the X1 distances will swamp X0;
rescale the data, i.e. make X0.std() ~= X1.std() .
A nice property of IDW is that it's scale-free around query points:
if I have values z1 z2 z3 from 3 points at distances d1 d2 d3,
the IDW average
(z1/d1 + z2/d2 + z3/d3)
/ (1/d1 + 1/d2 + 1/d3)
is the same for distances 1 2 3, or 10 20 30 -- only the ratios matter.
In contrast, the commonly-used Gaussian kernel exp( - (distance/h)**2 )
is exceedingly sensitive to distance and to h.
"""
# anykernel( dj / av dj ) is also scale-free
# error analysis, |f(x) - idw(x)| ? todo: regular grid, nnear ndim+1, 2*ndim
def __init__( self, X, z, leafsize=10, stat=0 ):
assert len(X) == len(z), "len(X) %d != len(z) %d" % (len(X), len(z))
self.tree = KDTree( X, leafsize=leafsize ) # build the tree
self.z = z
self.stat = stat
self.wn = 0
self.wsum = None;
def __call__( self, q, nnear=6, eps=0, p=1, weights=None ):
# nnear nearest neighbours of each query point --
q = np.asarray(q)
qdim = q.ndim
if qdim == 1:
q = np.array([q])
if self.wsum is None:
self.wsum = np.zeros(nnear)
self.distances, self.ix = self.tree.query( q, k=nnear, eps=eps )
interpol = np.zeros( (len(self.distances),) + np.shape(self.z[0]) )
jinterpol = 0
for dist, ix in zip( self.distances, self.ix ):
if nnear == 1:
wz = self.z[ix]
elif dist[0] < 1e-10:
wz = self.z[ix[0]]
else: # weight z s by 1/dist --
w = 1 / dist**p
if weights is not None:
w *= weights[ix] # >= 0
w /= np.sum(w)
wz = np.dot( w, self.z[ix] )
if self.stat:
self.wn += 1
self.wsum += w
interpol[jinterpol] = wz
jinterpol += 1
return interpol if qdim > 1 else interpol[0]
#...............................................................................
if __name__ == "__main__":
import sys
N = 10000
Ndim = 2
Nask = N # N Nask 1e5: 24 sec 2d, 27 sec 3d on mac g4 ppc
Nnear = 8 # 8 2d, 11 3d => 5 % chance one-sided -- Wendel, mathoverflow.com
leafsize = 10
eps = .1 # approximate nearest, dist <= (1 + eps) * true nearest
p = 1 # weights ~ 1 / distance**p
cycle = .25
seed = 1
exec "\n".join( sys.argv[1:] ) # python this.py N= ...
np.random.seed(seed )
np.set_printoptions( 3, threshold=100, suppress=True ) # .3f
print "\nInvdisttree: N %d Ndim %d Nask %d Nnear %d leafsize %d eps %.2g p %.2g" % (
N, Ndim, Nask, Nnear, leafsize, eps, p)
def terrain(x):
""" ~ rolling hills """
return np.sin( (2*np.pi / cycle) * np.mean( x, axis=-1 ))
known = np.random.uniform( size=(N,Ndim) ) ** .5 # 1/(p+1): density x^p
z = terrain( known )
ask = np.random.uniform( size=(Nask,Ndim) )
#...............................................................................
invdisttree = Invdisttree( known, z, leafsize=leafsize, stat=1 )
interpol = invdisttree( ask, nnear=Nnear, eps=eps, p=p )
print "average distances to nearest points: %s" % \
np.mean( invdisttree.distances, axis=0 )
print "average weights: %s" % (invdisttree.wsum / invdisttree.wn)
# see Wikipedia Zipf's law
err = np.abs( terrain(ask) - interpol )
print "average |terrain() - interpolated|: %.2g" % np.mean(err)
# print "interpolate a single point: %.2g" % \
# invdisttree( known[0], nnear=Nnear, eps=eps )
Edit: #Denis is right, a linear Rbf (e.g. scipy.interpolate.Rbf with "function='linear'") isn't the same as IDW...
(Note, all of these will use excessive amounts of memory if you're using a large number of points!)
Here's a simple exampe of IDW:
def simple_idw(x, y, z, xi, yi):
dist = distance_matrix(x,y, xi,yi)
# In IDW, weights are 1 / distance
weights = 1.0 / dist
# Make weights sum to one
weights /= weights.sum(axis=0)
# Multiply the weights for each interpolated point by all observed Z-values
zi = np.dot(weights.T, z)
return zi
Whereas, here's what a linear Rbf would be:
def linear_rbf(x, y, z, xi, yi):
dist = distance_matrix(x,y, xi,yi)
# Mutual pariwise distances between observations
internal_dist = distance_matrix(x,y, x,y)
# Now solve for the weights such that mistfit at the observations is minimized
weights = np.linalg.solve(internal_dist, z)
# Multiply the weights for each interpolated point by the distances
zi = np.dot(dist.T, weights)
return zi
(Using the distance_matrix function here:)
def distance_matrix(x0, y0, x1, y1):
obs = np.vstack((x0, y0)).T
interp = np.vstack((x1, y1)).T
# Make a distance matrix between pairwise observations
# Note: from <http://stackoverflow.com/questions/1871536>
# (Yay for ufuncs!)
d0 = np.subtract.outer(obs[:,0], interp[:,0])
d1 = np.subtract.outer(obs[:,1], interp[:,1])
return np.hypot(d0, d1)
Putting it all together into a nice copy-paste example yields some quick comparison plots:
(source: jkington at www.geology.wisc.edu)
(source: jkington at www.geology.wisc.edu)
(source: jkington at www.geology.wisc.edu)
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import Rbf
def main():
# Setup: Generate data...
n = 10
nx, ny = 50, 50
x, y, z = map(np.random.random, [n, n, n])
xi = np.linspace(x.min(), x.max(), nx)
yi = np.linspace(y.min(), y.max(), ny)
xi, yi = np.meshgrid(xi, yi)
xi, yi = xi.flatten(), yi.flatten()
# Calculate IDW
grid1 = simple_idw(x,y,z,xi,yi)
grid1 = grid1.reshape((ny, nx))
# Calculate scipy's RBF
grid2 = scipy_idw(x,y,z,xi,yi)
grid2 = grid2.reshape((ny, nx))
grid3 = linear_rbf(x,y,z,xi,yi)
print grid3.shape
grid3 = grid3.reshape((ny, nx))
# Comparisons...
plot(x,y,z,grid1)
plt.title('Homemade IDW')
plot(x,y,z,grid2)
plt.title("Scipy's Rbf with function=linear")
plot(x,y,z,grid3)
plt.title('Homemade linear Rbf')
plt.show()
def simple_idw(x, y, z, xi, yi):
dist = distance_matrix(x,y, xi,yi)
# In IDW, weights are 1 / distance
weights = 1.0 / dist
# Make weights sum to one
weights /= weights.sum(axis=0)
# Multiply the weights for each interpolated point by all observed Z-values
zi = np.dot(weights.T, z)
return zi
def linear_rbf(x, y, z, xi, yi):
dist = distance_matrix(x,y, xi,yi)
# Mutual pariwise distances between observations
internal_dist = distance_matrix(x,y, x,y)
# Now solve for the weights such that mistfit at the observations is minimized
weights = np.linalg.solve(internal_dist, z)
# Multiply the weights for each interpolated point by the distances
zi = np.dot(dist.T, weights)
return zi
def scipy_idw(x, y, z, xi, yi):
interp = Rbf(x, y, z, function='linear')
return interp(xi, yi)
def distance_matrix(x0, y0, x1, y1):
obs = np.vstack((x0, y0)).T
interp = np.vstack((x1, y1)).T
# Make a distance matrix between pairwise observations
# Note: from <http://stackoverflow.com/questions/1871536>
# (Yay for ufuncs!)
d0 = np.subtract.outer(obs[:,0], interp[:,0])
d1 = np.subtract.outer(obs[:,1], interp[:,1])
return np.hypot(d0, d1)
def plot(x,y,z,grid):
plt.figure()
plt.imshow(grid, extent=(x.min(), x.max(), y.max(), y.min()))
plt.hold(True)
plt.scatter(x,y,c=z)
plt.colorbar()
if __name__ == '__main__':
main()
I also needed something fast and i started with #joerington solution and ended up finally at numba
I always experiment between scipy, numpy and numba and choose best one. For this problem I use numba, for extra tmp memory is negligible giving super speed.
With using numpy there is a trade-of with memory and speed. For example on a 16GB ram if you want to calculate interpolation of 50000 points on other 50000 points it will go out of memory or be incredibly slow, no matter what.
So to save on memory we need to use for loops so as to have minimum temp memory allocation. But writing for loops in numpy would mean loosing possible vectorization. For this we have numba. You can add numba jit for a function accepting with for loops on numpy and it will effectively vectorize in on hardware + additional parallelism on multi-core. It will give better speed up for huge arrays case and also you can run it on GPU without writing cuda
An extremely simple snippet would be to calculate distance matrix, in IDW case we need inverse distance matrix. But even for methods other than IDW you can do something similar
Also on custom methods for calculation of hypotenuse I have few caution points here
#nb.njit((nb.float64[:, :], nb.float64[:, :]), parallel=True)
def f2(d0, d1):
print('Numba with parallel')
res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
for i in nb.prange(d0.shape[0]):
for j in range(d1.shape[0]):
res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
return res
Also recent numba becoming compatible with scikit, so that is +1
Refer:
Why np.hypot and np.subtract.outer very fast compared to vanilla broadcast ? Using Numba for speedup numpy in parallel for distance matrix calculation
Custom dtype in numpy for lattitude, longitude for faster distance matrix/krigging/IDW interpolation calculations