Python baseline correction library - python

I am currently working with some Raman Spectra data, and I am trying to correct my data caused by florescence skewing. Take a look at the graph below:
I am pretty close to achieving what I want. As you can see, I am trying to fit a polynomial in all my data whereas I should really just be fitting a polynomial at the local minimas.
Ideally I would want to have a polynomial fitting which when subtracted from my original data would result in something like this:
Are there any built in libs that does this already?
If not, any simple algorithm one can recommend for me?

I found an answer to my question, just sharing for everyone who stumbles upon this.
There is an algorithm called "Asymmetric Least Squares Smoothing" by P. Eilers and H. Boelens in 2005. The paper is free and you can find it on google.
def baseline_als(y, lam, p, niter=10):
L = len(y)
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
w = np.ones(L)
for i in xrange(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z

The following code works on Python 3.6.
This is adapted from the accepted correct answer to avoid the dense matrix diff computation (which can easily cause memory issues) and uses range (not xrange)
import numpy as np
from scipy import sparse
from scipy.sparse.linalg import spsolve
def baseline_als(y, lam, p, niter=10):
L = len(y)
D = sparse.diags([1,-2,1],[0,-1,-2], shape=(L,L-2))
w = np.ones(L)
for i in range(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z

There is a python library available for baseline correction/removal. It has Modpoly, IModploy and Zhang fit algorithm which can return baseline corrected results when you input the original values as a python list or pandas series and specify the polynomial degree.
Install the library as pip install BaselineRemoval. Below is an example
from BaselineRemoval import BaselineRemoval
input_array=[10,20,1.5,5,2,9,99,25,47]
polynomial_degree=2 #only needed for Modpoly and IModPoly algorithm
baseObj=BaselineRemoval(input_array)
Modpoly_output=baseObj.ModPoly(polynomial_degree)
Imodpoly_output=baseObj.IModPoly(polynomial_degree)
Zhangfit_output=baseObj.ZhangFit()
print('Original input:',input_array)
print('Modpoly base corrected values:',Modpoly_output)
print('IModPoly base corrected values:',Imodpoly_output)
print('ZhangFit base corrected values:',Zhangfit_output)
Original input: [10, 20, 1.5, 5, 2, 9, 99, 25, 47]
Modpoly base corrected values: [-1.98455800e-04 1.61793368e+01 1.08455179e+00 5.21544654e+00
7.20210508e-02 2.15427531e+00 8.44622093e+01 -4.17691125e-03
8.75511661e+00]
IModPoly base corrected values: [-0.84912125 15.13786196 -0.11351367 3.89675187 -1.33134142 0.70220645
82.99739548 -1.44577432 7.37269705]
ZhangFit base corrected values: [ 8.49924691e+00 1.84994576e+01 -3.31739230e-04 3.49854060e+00
4.97412948e-01 7.49628529e+00 9.74951576e+01 2.34940300e+01
4.54929023e+01

Recently, I needed to use this method. The code from answers works well, but it obviously overuses the memory. So, here is my version with optimized memory usage.
def baseline_als_optimized(y, lam, p, niter=10):
L = len(y)
D = sparse.diags([1,-2,1],[0,-1,-2], shape=(L,L-2))
D = lam * D.dot(D.transpose()) # Precompute this term since it does not depend on `w`
w = np.ones(L)
W = sparse.spdiags(w, 0, L, L)
for i in range(niter):
W.setdiag(w) # Do not create a new matrix, just update diagonal values
Z = W + D
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
According to my benchmarks bellow, it is also about 1,5 times faster.
%%timeit -n 1000 -r 10 y = randn(1000)
baseline_als(y, 10000, 0.05) # function from #jpantina's answer
# 20.5 ms ± 382 µs per loop (mean ± std. dev. of 10 runs, 1000 loops each)
%%timeit -n 1000 -r 10 y = randn(1000)
baseline_als_optimized(y, 10000, 0.05)
# 13.3 ms ± 874 µs per loop (mean ± std. dev. of 10 runs, 1000 loops each)
NOTE 1: The original article says:
To emphasize the basic simplicity of the algorithm, the number of iterations has been fixed to 10. In practical applications one should check whether the weights show any change; if not, convergence has been attained.
So, it means that the more correct way to stop iteration is to check that ||w_new - w|| < tolerance
NOTE 2: Another useful quote (from #glycoaddict's comment) gives an idea how to choose values of the parameters.
There are two parameters: p for asymmetry and λ for smoothness. Both have to be
tuned to the data at hand. We found that generally 0.001 ≤ p ≤ 0.1 is a good choice (for a signal with positive peaks) and 102 ≤ λ ≤ 109, but exceptions may occur. In any case one should vary λ on a grid that is approximately linear for log λ. Often visual inspection is sufficient to get good parameter values.

I worked the version of the algorithm referenced by glinka in a previous comment, which is an improvement of the penalized weighted linear squares method published in a relatively recent paper. I took Rustam Guliev's code to build this one:
from scipy import sparse
from scipy.sparse import linalg
import numpy as np
from numpy.linalg import norm
def baseline_arPLS(y, ratio=1e-6, lam=100, niter=10, full_output=False):
L = len(y)
diag = np.ones(L - 2)
D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L - 2)
H = lam * D.dot(D.T) # The transposes are flipped w.r.t the Algorithm on pg. 252
w = np.ones(L)
W = sparse.spdiags(w, 0, L, L)
crit = 1
count = 0
while crit > ratio:
z = linalg.spsolve(W + H, W * y)
d = y - z
dn = d[d < 0]
m = np.mean(dn)
s = np.std(dn)
w_new = 1 / (1 + np.exp(2 * (d - (2*s - m))/s))
crit = norm(w_new - w) / norm(w)
w = w_new
W.setdiag(w) # Do not create a new matrix, just update diagonal values
count += 1
if count > niter:
print('Maximum number of iterations exceeded')
break
if full_output:
info = {'num_iter': count, 'stop_criterion': crit}
return z, d, info
else:
return z
In order to test the algorithm, I created a spectrum similar to the one shown in Fig. 3 of the paper, by first generating a simulated spectra consisting of multiple Gaussian peaks:
def spectra_model(x):
coeff = np.array([100, 200, 100])
mean = np.array([300, 750, 800])
stdv = np.array([15, 30, 15])
terms = []
for ind in range(len(coeff)):
term = coeff[ind] * np.exp(-((x - mean[ind]) / stdv[ind])**2)
terms.append(term)
spectra = sum(terms)
return spectra
x_vals = np.arange(1, 1001)
spectra_sim = spectra_model(x_vals)
Then, I created a third-order interpolating polynomial using 4 points taken directly from the paper:
from scipy.interpolate import CubicSpline
x_poly = np.array([0, 250, 700, 1000])
y_poly = np.array([200, 180, 230, 200])
poly = CubicSpline(x_poly, y_poly)
baseline = poly(x_vals)
noise = np.random.randn(len(x_vals)) * 0.1
spectra_base = spectra_sim + baseline + noise
Finally, I used the baseline correction algorithm to subtract the baseline out of the altered spectra (spectra_base):
_, spectra_arPLS, info = baseline_arPLS(spectra_base, lam=1e4, niter=10,
full_output=True)
The results were (for reference, I compared with the pure ALS implementation by Rustam Guliev's, using lam = 1e4 and p = 0.001):

I know this is an old question, but I stumpled upon it a few months ago and implemented the equivalent answer using spicy.sparse routines.
# Baseline removal
def baseline_als(y, lam, p, niter=10):
s = len(y)
# assemble difference matrix
D0 = sparse.eye( s )
d1 = [numpy.ones( s-1 ) * -2]
D1 = sparse.diags( d1, [-1] )
d2 = [ numpy.ones( s-2 ) * 1]
D2 = sparse.diags( d2, [-2] )
D = D0 + D2 + D1
w = np.ones( s )
for i in range( niter ):
W = sparse.diags( [w], [0] )
Z = W + lam*D.dot( D.transpose() )
z = spsolve( Z, w*y )
w = p * (y > z) + (1-p) * (y < z)
return z
Cheers,
Pedro.

Related

Numba parallel code slower than its sequential counterpart

I'm new to Numba and I'm trying to implement an old Fortran code in Python using Numba (version 0.54.1), but when I add parallel = True the program actually slows down. My program is very simple: I change the positions x and y in a L x L grid and for each position in the grid I perform a summation
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J = np.array([[1.0, -k*np.cos(x)], [1.0, 1.0 - k*np.cos(x)]])
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
# Compile
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1, 10)
# Parameters
N = int(1e3)
L = 128
pi = np.pi
k = 1.5
# Limits of the phase space
x0 = -pi
xf = pi
y0 = -pi
yf = pi
# Grid positions
x = np.linspace(x0, xf, L, endpoint=True)
y = np.linspace(y0, yf, L, endpoint=True)
lypnv = lyapunov_grid(x, y, k, N)
With parallel=False it takes about 8s to run, however with parallel=True it takes about 14s. I also tested with another code from https://github.com/animator/mandelbrot-numba and in this case the parallelization works.
import math
import numpy as np
import numba as nb
WIDTH = 1000
MAX_ITER = 1000
#nb.njit(parallel=True)
def mandelbrot(width, max_iter):
pixels = np.zeros((width, width, 3), dtype=np.uint8)
for y in nb.prange(width):
for x in range(width):
c0 = complex(3.0*x/width - 2, 3.0*y/width - 1.5)
c = 0
for i in range(1, max_iter):
if abs(c) > 2:
log_iter = math.log(i)
pixels[y, x, :] = np.array([int(255*(1+math.cos(3.32*log_iter))/2),
int(255*(1+math.cos(0.774*log_iter))/2),
int(255*(1+math.cos(0.412*log_iter))/2)],
dtype=np.uint8)
break
c = c * c + c0
return pixels
# compile
_ = mandelbrot(WIDTH, 10)
calcpixels = mandelbrot(WIDTH, MAX_ITER)
One main issue is that the second function call compile the function again. Indeed, the types of the provided arguments change: in the first call the third argument is an integer (int transformed to a np.int_) while in the second call the third argument (k) is a floating point number (float transformed to a np.float64). Numba recompiles the function for different parameter types because they are deduced from the type of the arguments and it does not know you want to use a np.float64 type for the third argument (since the first time the function is compiled with for a np.int_ type). One simple solution to fix the problem is to change the first call to:
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1.0, 10)
However, this is not a robust way to fix the problem. You can specify the parameter types to Numba so it will compile the function at declaration time. This also remove the need to artificially call the function (with useless parameters).
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
Note that (J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)) is zero the first time resulting in a division by 0.
Another main issue comes from the allocations of many small arrays in the loop causing a contention of the standard allocator (see this post for more information). While Numba could theoretically optimize it (ie. replace the array with local variables), it actually does not, resulting in a huge slowdown and a contention. Hopefully, in your case, you do not need to actually create the array. At last, you can create it only in the encompassing loop and modify it in the innermost loop. Here is the optimized code:
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
J = np.ones((2, 2), dtype=np.float64)
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J[0, 1] = -k*np.cos(x)
J[1, 1] = 1.0 - k*np.cos(x)
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
Here is the results on a old 2-core machine (with 4 hardware threads):
Original sequential: 15.9 s
Original parallel: 11.9 s
Fix-build sequential: 15.7 s
Fix-build parallel: 10.1 s
Optimized sequential: 2.73 s
Optimized parallel: 0.94 s
The optimized implementation is much faster than the others. The parallel optimized version scale very well compared than the original one (2.9 times faster than the sequential one). Finally, the best version is about 12 times faster than the original parallel version. I expect a much faster computation on a recent machine with many more cores.

How to subtract baseline from spectrum with rising tail in python?

I have a spectrum that I want to subtract a baseline from. The spectrum data are:
1.484043000000000001e+00 1.121043091000000004e+03
1.472555999999999976e+00 1.140899658000000045e+03
1.461239999999999872e+00 1.135047851999999921e+03
1.450093000000000076e+00 1.153286499000000049e+03
1.439112000000000169e+00 1.158624877999999853e+03
1.428292000000000117e+00 1.249718872000000147e+03
1.417629999999999946e+00 1.491854857999999922e+03
1.407121999999999984e+00 2.524922362999999677e+03
1.396767000000000092e+00 4.102439940999999635e+03
1.386559000000000097e+00 4.013319579999999860e+03
1.376497999999999999e+00 3.128252441000000090e+03
1.366578000000000070e+00 2.633181152000000111e+03
1.356797999999999949e+00 2.340077147999999852e+03
1.347154999999999880e+00 2.099404540999999881e+03
1.337645999999999891e+00 2.012083983999999873e+03
1.328268000000000004e+00 2.052154540999999881e+03
1.319018999999999942e+00 2.061067871000000196e+03
1.309895999999999949e+00 2.205770507999999609e+03
1.300896999999999970e+00 2.199266602000000148e+03
1.292019000000000029e+00 2.317792235999999775e+03
1.283260000000000067e+00 2.357031494000000293e+03
1.274618000000000029e+00 2.434981689000000188e+03
1.266089999999999938e+00 2.540746337999999923e+03
1.257675000000000098e+00 2.605709472999999889e+03
1.249370000000000092e+00 2.667244141000000127e+03
1.241172999999999860e+00 2.800522704999999860e+03
I've taken only every 20th data point from the actual data file, but the general shape is preserved.
import matplotlib.pyplot as plt
share = the_above_array
plt.plot(share)
Original_spectrum
There is a clear tail in around the high x values. Assume the tail is an artifact and needs to be removed. I've tried solutions using the ALS algorithm by P. Eilers, a rubberband approach, and the peakutils package, but these end up subtracting the tail and creating a rise around the low x values or not creating a suitable baseline.
ALS algorithim, in this example I am using lam=1E6 and p=0.001; these were the best parameters I was able to manually find:
# ALS approach
from scipy import sparse
from scipy.sparse.linalg import spsolve
def baseline_als(y, lam, p, niter=10):
L = len(y)
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
w = np.ones(L)
for i in range(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
baseline = baseline_als(share[:,1], 1E6, 0.001)
baseline_subtracted = share[:,1] - baseline
plt.plot(baseline_subtracted)
ALS_plot
Rubberband approach:
# rubberband approach
from scipy.spatial import ConvexHull
def rubberband(x, y):
# Find the convex hull
v = ConvexHull(share).vertices
# Rotate convex hull vertices until they start from the lowest one
v = np.roll(v, v.argmax())
# Leave only the ascending part
v = v[:v.argmax()]
# Create baseline using linear interpolation between vertices
return np.interp(x, x[v], y[v])
baseline_rubber = rubberband(share[:,0], share[:,1])
intensity_rubber = share[:,1] - baseline_rubber
plt.plot(intensity_rubber)
Rubber_plot
peakutils package:
# peakutils approach
import peakutils
baseline_peakutils = peakutils.baseline(share[:,1])
intensity_peakutils = share[:,1] - baseline_peakutils
plt.plot(intensity_peakutils)
Peakutils_plot
Are there any suggestions, aside from masking the low x value data, for constructing a baseline and subtracting the tail without creating a rise in the low x values?
I found a set of similar ALS algorithms here. One of these algorithms, asymmetrically reweighted penalized least squares smoothing (arpls), gives a slightly better fit than als.
# arpls approach
from scipy.linalg import cholesky
def arpls(y, lam=1e4, ratio=0.05, itermax=100):
r"""
Baseline correction using asymmetrically
reweighted penalized least squares smoothing
Sung-June Baek, Aaron Park, Young-Jin Ahna and Jaebum Choo,
Analyst, 2015, 140, 250 (2015)
"""
N = len(y)
D = sparse.eye(N, format='csc')
D = D[1:] - D[:-1] # numpy.diff( ,2) does not work with sparse matrix. This is a workaround.
D = D[1:] - D[:-1]
H = lam * D.T * D
w = np.ones(N)
for i in range(itermax):
W = sparse.diags(w, 0, shape=(N, N))
WH = sparse.csc_matrix(W + H)
C = sparse.csc_matrix(cholesky(WH.todense()))
z = spsolve(C, spsolve(C.T, w * y))
d = y - z
dn = d[d < 0]
m = np.mean(dn)
s = np.std(dn)
wt = 1. / (1 + np.exp(2 * (d - (2 * s - m)) / s))
if np.linalg.norm(w - wt) / np.linalg.norm(w) < ratio:
break
w = wt
return z
baseline = baseline_als(share[:,1], 1E6, 0.001)
baseline_subtracted = share[:,1] - baseline
plt.plot(baseline_subtracted, 'r', label='als')
baseline_arpls = arpls(share[:,1], 1e5, 0.1)
intensity_arpls = share[:,1] - baseline_arpls
plt.plot(intensity_arpls, label='arpls')
plt.legend()
ARPLS plot
Fortunately, this improvement becomes better when using the data from the entire spectrum:
Note the parameters for either algorithm were different. For now, I think the arpls algorithm is as close as I can get, at least for spectra that look like this. We'll see how robust the algorithm can fit spectra with different shapes. Of course, I am always open to suggestions or improvements!
Have a look at the RamPy library in python, which proposes various baseline subtraction algorithms. This includes splines, ARPLS, ALS, polynomial functions, and many more. It also offers various other features, such as resampling, normalisation, and peak fitting examples.
In your case, a simple spline function fitted before and after the peak should easily do the job. Have a look at this example Jupyter notebook.

How to calculate the Kolmogorov-Smirnov statistic between two weighted samples

Let's say that we have two samples data1 and data2 with their respective weights weight1 and weight2 and that we want to calculate the Kolmogorov-Smirnov statistic between the two weighted samples.
The way we do that in python follows:
import numpy as np
def ks_w(data1,data2,wei1,wei2):
ix1=np.argsort(data1)
ix2=np.argsort(data2)
wei1=wei1[ix1]
wei2=wei2[ix2]
data1=data1[ix1]
data2=data2[ix2]
d=0.
fn1=0.
fn2=0.
j1=0
j2=0
j1w=0.
j2w=0.
while(j1<len(data1))&(j2<len(data2)):
d1=data1[j1]
d2=data2[j2]
w1=wei1[j1]
w2=wei2[j2]
if d1<=d2:
j1+=1
j1w+=w1
fn1=(j1w)/sum(wei1)
if d2<=d1:
j2+=1
j2w+=w2
fn2=(j2w)/sum(wei2)
if abs(fn2-fn1)>d:
d=abs(fn2-fn1)
return d
where we just modify to our purpose the classical two-sample KS statistic as implemented in Press, Flannery, Teukolsky, Vetterling - Numerical Recipes in C - Cambridge University Press - 1992 - pag.626.
Our questions are:
is anybody aware of any other way to do it?
is there any library in python/R/* that performs it?
what about the test? Does it exist or should we use a reshuffling procedure in order to evaluate the statistic?
This solution is based on the code for scipy.stats.ks_2samp and runs in about 1/10000 the time (notebook):
import numpy as np
def ks_w2(data1, data2, wei1, wei2):
ix1 = np.argsort(data1)
ix2 = np.argsort(data2)
data1 = data1[ix1]
data2 = data2[ix2]
wei1 = wei1[ix1]
wei2 = wei2[ix2]
data = np.concatenate([data1, data2])
cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
cdf1we = cwei1[[np.searchsorted(data1, data, side='right')]]
cdf2we = cwei2[[np.searchsorted(data2, data, side='right')]]
return np.max(np.abs(cdf1we - cdf2we))
Here's a test of its accuracy and performance:
ds1 = np.random.rand(10000)
ds2 = np.random.randn(40000) + .2
we1 = np.random.rand(10000) + 1.
we2 = np.random.rand(40000) + 1.
ks_w2(ds1, ds2, we1, we2)
# 0.4210415232236593
ks_w(ds1, ds2, we1, we2)
# 0.4210415232236593
%timeit ks_w2(ds1, ds2, we1, we2)
# 100 loops, best of 3: 17.1 ms per loop
%timeit ks_w(ds1, ds2, we1, we2)
# 1 loop, best of 3: 3min 44s per loop
To add to Luca Jokull's answer, if you want to also return a p-value (similar to the unweighted scipy.stats.ks_2samp function), the suggested ks_w2() function can be modified as follows:
from scipy.stats import distributions
def ks_weighted(data1, data2, wei1, wei2, alternative='two-sided'):
ix1 = np.argsort(data1)
ix2 = np.argsort(data2)
data1 = data1[ix1]
data2 = data2[ix2]
wei1 = wei1[ix1]
wei2 = wei2[ix2]
data = np.concatenate([data1, data2])
cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
cdf1we = cwei1[np.searchsorted(data1, data, side='right')]
cdf2we = cwei2[np.searchsorted(data2, data, side='right')]
d = np.max(np.abs(cdf1we - cdf2we))
# calculate p-value
n1 = data1.shape[0]
n2 = data2.shape[0]
m, n = sorted([float(n1), float(n2)], reverse=True)
en = m * n / (m + n)
if alternative == 'two-sided':
prob = distributions.kstwo.sf(d, np.round(en))
else:
z = np.sqrt(en) * d
# Use Hodges' suggested approximation Eqn 5.3
# Requires m to be the larger of (n1, n2)
expt = -2 * z**2 - 2 * z * (m + 2*n)/np.sqrt(m*n*(m+n))/3.0
prob = np.exp(expt)
return d, prob
This is the asymptotic method that scipy's original unweighted function uses.
This is a R version of a two-tails weighted KS statistic following the suggestion of Numerical Methods of Statistics by Monohan, pg. 334 in 1E and pg. 358 in 2E.
ks_weighted <- function(vector_1,vector_2,weights_1,weights_2){
F_vec_1 <- ewcdf(vector_1, weights = weights_1, normalise=FALSE)
F_vec_2 <- ewcdf(vector_2, weights = weights_2, normalise=FALSE)
xw <- c(vector_1,vector_2)
d <- max(abs(F_vec_1(xw) - F_vec_2(xw)))
## P-VALUE with NORMAL SAMPLE
# n_vector_1 <- length(vector_1)
# n_vector_2<- length(vector_2)
# n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)
# P-VALUE EFFECTIVE SAMPLE SIZE as suggested by Monahan
n_vector_1 <- sum(weights_1)^2/sum(weights_1^2)
n_vector_2 <- sum(weights_2)^2/sum(weights_2^2)
n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)
pkstwo <- function(x, tol = 1e-06) {
if (is.numeric(x))
x <- as.double(x)
else stop("argument 'x' must be numeric")
p <- rep(0, length(x))
p[is.na(x)] <- NA
IND <- which(!is.na(x) & (x > 0))
if (length(IND))
p[IND] <- .Call(stats:::C_pKS2, p = x[IND], tol)
p
}
pval <- 1 - pkstwo(sqrt(n) * d)
out <- c(KS_Stat=d, P_value=pval)
return(out)
}

Vectorizing for loop with repeated indices in python

I am trying to optimize a snippet that gets called a lot (millions of times) so any type of speed improvement (hopefully removing the for-loop) would be great.
I am computing a correlation function of some j'th particle with all others
C_j(|r-r'|) = sqrt(E((s_j(r')-s_k(r))^2)) averaged over k.
My idea is to have a variable corrfun which bins data into some bins (the r, defined elsewhere). I find what bin of r each s_k belongs to and this is stored in ind. So ind[0] is the index of r (and thus the corrfun) for which the j=0 point corresponds to. Multiple points can fall into the same bin (in fact I want bins to be big enough to contain multiple points) so I sum together all of the (s_j(r')-s_k(r))^2 and then divide by number of points in that bin (stored in variable rw). The code I ended up making for this is the following (np is for numpy):
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
rw2 = rw
rw2[rw < 1] = 1
corrfun = np.sqrt(np.divide(corrfun, rw2))
Note, the rw2 business was because I want to avoid divide by 0 problems but I do return the rw array and I want to be able to differentiate between the rw=0 and rw=1 elements. Perhaps there is a more elegant solution for this as well.
Is there a way to make the for-loop faster? While I would like to not add the self interaction (j==k) I am even ok with having self interaction if it means I can get significantly faster calculation (length of ind ~ 1E6 so self interaction is probably insignificant anyways).
Thank you!
Ilya
Edit:
Here is the full code. Note, in the full code I am averaging over j as well.
import numpy as np
def twopointcorr(x,y,s,dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
print(r)
corrfun = r*0
rw = r*0
print(maxR)
''' go through all points'''
for j in range(0, n-1):
hypot = np.sqrt((x[j]-x)**2+(y[j]-y)**2)
ind = [np.abs(r-h).argmin() for h in hypot]
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
rw2 = rw
rw2[rw < 1] = 1
corrfun = np.sqrt(np.divide(corrfun, rw2))
return r, corrfun, rw
I debug test it the following way
from twopointcorr import twopointcorr
import numpy as np
import matplotlib.pyplot as plt
import time
n=1000
x = np.random.rand(n)
y = np.random.rand(n)
s = np.random.rand(n)
print('running two point corr functinon')
start_time = time.time()
r,corrfun,rw = twopointcorr(x,y,s,0.1)
print("--- Execution time is %s seconds ---" % (time.time() - start_time))
fig1=plt.figure()
plt.plot(r, corrfun,'-x')
fig2=plt.figure()
plt.plot(r, rw,'-x')
plt.show()
Again, the main issue is that in the real dataset n~1E6. I can resample to make it smaller, of course, but I would love to actually crank through the dataset.
Here is the code that use broadcast, hypot, round, bincount to remove all the loops:
def twopointcorr2(x, y, s, dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
osub = lambda x:np.subtract.outer(x, x)
ind = np.clip(np.round(np.hypot(osub(x), osub(y)) / dr), 0, len(r)-1).astype(int)
rw = np.bincount(ind.ravel())
rw[0] -= len(x)
corrfun = np.bincount(ind.ravel(), (osub(s)**2).ravel())
return r, corrfun, rw
to compare, I modified your code as follows:
def twopointcorr(x,y,s,dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
corrfun = r*0
rw = r*0
for j in range(0, n):
hypot = np.sqrt((x[j]-x)**2+(y[j]-y)**2)
ind = [np.abs(r-h).argmin() for h in hypot]
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
return r, corrfun, rw
and here is the code to check the results:
import numpy as np
n=1000
x = np.random.rand(n)
y = np.random.rand(n)
s = np.random.rand(n)
r1, corrfun1, rw1 = twopointcorr(x,y,s,0.1)
r2, corrfun2, rw2 = twopointcorr2(x,y,s,0.1)
assert np.allclose(r1, r2)
assert np.allclose(corrfun1, corrfun2)
assert np.allclose(rw1, rw2)
and the %timeit results:
%timeit twopointcorr(x,y,s,0.1)
%timeit twopointcorr2(x,y,s,0.1)
outputs:
1 loop, best of 3: 5.16 s per loop
10 loops, best of 3: 134 ms per loop
Your original code on my system runs in about 5.7 seconds. I fully vectorized the inner loop and got it to run in 0.39 seconds. Simply replace your "go through all points" loop with this:
points = np.column_stack((x,y))
hypots = scipy.spatial.distance.cdist(points, points)
inds = np.rint(hypots.clip(max=maxR) / dr).astype(np.int)
# go through all points
for j in range(n): # n.b. previously n-1, not sure why
ind = inds[j]
np.add.at(corrfun, ind, (s - s[j])**2)
np.add.at(rw, ind, 1)
rw[ind[j]] -= 1 # subtract self
The first observation was that your hypot code was computing 2D distances, so I replaced that with cdist from SciPy to do it all in a single call. The second was that the inner for loop was slow, and thanks to an insightful comment from #hpaulj I vectorized that as well using np.add.at().
Since you asked how to vectorize the inner loop as well, I did that later. It now takes 0.25 seconds to run, for a total speedup of over 20x. Here's the final code:
points = np.column_stack((x,y))
hypots = scipy.spatial.distance.cdist(points, points)
inds = np.rint(hypots.clip(max=maxR) / dr).astype(np.int)
sn = np.tile(s, (n,1)) # n copies of s
diffs = (sn - sn.T)**2 # squares of pairwise differences
np.add.at(corrfun, inds, diffs)
rw = np.bincount(inds.flatten(), minlength=len(r))
np.subtract.at(rw, inds.diagonal(), 1) # subtract self
This uses more memory but does produce a substantial speedup vs. the single-loop version above.
Ok, so as it turns out outer products are incredibly memory expensive, however, using answers from #HYRY and #JohnZwinck i was able to make code that is still roughly linear in n in memory and computes fast (0.5 seconds for the test case)
import numpy as np
def twopointcorr(x,y,s,dr,maxR=-1):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
if maxR < dr:
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR+dr, dr)
corrfun = r*0
rw = r*0
for j in range(0, n):
ind = np.clip(np.round(np.hypot(x[j]-x,y[j]-y) / dr), 0, len(r)-1).astype(int)
np.add.at(corrfun, ind, (s - s[j])**2)
np.add.at(rw, ind, 1)
rw[0] -= n
corrfun = np.sqrt(np.divide(corrfun, np.maximum(rw,1)))
r=np.delete(r,-1)
rw=np.delete(rw,-1)
corrfun=np.delete(corrfun,-1)
return r, corrfun, rw

Fully vectorise numpy polyfit

Overview
I am running into issues with performance using polyfit because it doesn't appear able to accept broadcast arrays. I am aware from this post that the dependant data y can be multidimensional if you use numpy.polynomial.polynomial.polyfit. However, the x dimension cannot be multidimensional. Is there anyway around this?
Motivation
I need to compute the rate of change of some data. To match with an experiment I want to use the following method: take data y and x, for short sections of data fit a polynomial, then use the fitted coefficient as an estimate of the rate of change.
Illustration
import numpy as np
import matplotlib.pyplot as plt
n = 100
x = np.linspace(0, 10, n)
y = np.sin(x)
window_length = 10
ydot = [np.polyfit(x[j:j+window_length], y[j:j+window_length], 1)[0]
for j in range(n - window_length)]
x_mids = [x[j+window_length/2] for j in range(n - window_length)]
plt.plot(x, y)
plt.plot(x_mids, ydot)
plt.show()
The blue line is the original data (a sine curve), while the green is the first differential (a cosine curve).
The problem
To vectorise this I did the following:
window_length = 10
vert_idx_list = np.arange(0, len(x) - window_length, 1)
hori_idx_list = np.arange(window_length)
A, B = np.meshgrid(hori_idx_list, vert_idx_list)
idx_array = A + B
x_array = x[idx_array]
y_array = y[idx_array]
This broadcasts the two 1D vectors to 2D vectors of shape (n-window_length, window_length). Now I was hoping that polyfit would have an axis argument so I could parallelise the calculation, but no such luck.
Does anyone have any suggestion for how to do this? I am open to
The way polyfit works is by solving a least-square problem of the form:
y = [X].a
where y are your dependent coordinates, [X] is the Vandermonde matrix of the corresponding independent coordinates, and a is the vector of fitted coefficients.
In your case you are always computing a 1st degree polynomial approximation, and are actually only interested in the coefficient of the 1st degree term. This has a well known closed form solution you can find in any statistics book, or produce your self by creating a 2x2 linear system of equation premultiplying both sides of the above equation by the transpose of [X]. This all adds up to the value you want to calculate being:
>>> n = 10
>>> x = np.random.random(n)
>>> y = np.random.random(n)
>>> np.polyfit(x, y, 1)[0]
-0.29207474654700277
>>> (n*(x*y).sum() - x.sum()*y.sum()) / (n*(x*x).sum() - x.sum()*x.sum())
-0.29207474654700216
On top of that you have a sliding window running over your data, so you can use something akin to a 1D summed area table as follows:
def sliding_fitted_slope(x, y, win):
x = np.concatenate(([0], x))
y = np.concatenate(([0], y))
Sx = np.cumsum(x)
Sy = np.cumsum(y)
Sx2 = np.cumsum(x*x)
Sxy = np.cumsum(x*y)
Sx = Sx[win:] - Sx[:-win]
Sy = Sy[win:] - Sy[:-win]
Sx2 = Sx2[win:] - Sx2[:-win]
Sxy = Sxy[win:] - Sxy[:-win]
return (win*Sxy - Sx*Sy) / (win*Sx2 - Sx*Sx)
With this code you can easily check that (notice I extended the range by 1):
>>> np.allclose(sliding_fitted_slope(x, y, window_length),
[np.polyfit(x[j:j+window_length], y[j:j+window_length], 1)[0]
for j in range(n - window_length + 1)])
True
And:
%timeit sliding_fitted_slope(x, y, window_length)
10000 loops, best of 3: 34.5 us per loop
%%timeit
[np.polyfit(x[j:j+window_length], y[j:j+window_length], 1)[0]
for j in range(n - window_length + 1)]
100 loops, best of 3: 10.1 ms per loop
So it is about 300x faster for your sample data.
Sorry for answering my own question, but 20 minutes more of trying to get to grips with it I have the following solution:
ydot = np.polynomial.polynomial.polyfit(x_array[0], y_array.T, 1)[-1]
One confusing part is that np.polyfit returns the coefficients with the highest power first. In np.polynomial.polynomial.polyfit the highest power is last (hence the -1 instead of 0 index).
Another confusion is that we use only the first slice of x (x_array[0]). I think that this is okay because it is not the absolute values of the independent vector x that are used, but the difference between them. Or alternatively it is like changing the reference x value.
If there is a better way to do this I am still happy to hear about it!
Using an alternative method for calculating the rate of change may be the solution for both speed and accuracy increase.
n = 1000
x = np.linspace(0, 10, n)
y = np.sin(x)
def timingPolyfit(x,y):
window_length = 10
vert_idx_list = np.arange(0, len(x) - window_length, 1)
hori_idx_list = np.arange(window_length)
A, B = np.meshgrid(hori_idx_list, vert_idx_list)
idx_array = A + B
x_array = x[idx_array]
y_array = y[idx_array]
ydot = np.polynomial.polynomial.polyfit(x_array[0], y_array.T, 1)[-1]
x_mids = [x[j+window_length/2] for j in range(n - window_length)]
return ydot, x_mids
def timingSimple(x,y):
dy = (y[2:] - y[:-2])/2
dx = x[1] - x[0]
dydx = dy/dx
return dydx, x[1:-1]
y1, x1 = timingPolyfit(x,y)
y2, x2 = timingSimple(x,y)
polyfitError = np.abs(y1 - np.cos(x1))
simpleError = np.abs(y2 - np.cos(x2))
print("polyfit average error: {:.2e}".format(np.average(polyfitError)))
print("simple average error: {:.2e}".format(np.average(simpleError)))
result = %timeit -o timingPolyfit(x,y)
result2 = %timeit -o timingSimple(x,y)
print("simple is {0} times faster".format(result.best / result2.best))
polyfit average error: 3.09e-03
simple average error: 1.09e-05
100 loops, best of 3: 3.2 ms per loop
100000 loops, best of 3: 9.46 µs per loop
simple is 337.995634151131 times faster
Relative error:
Results:

Categories