How to efficiently interpolate data in a Pandas DataFrame row-wise? - python

I have several thousand "observations". Each observation consists of location (x,y) and sensor reading (z), see example below.
I would like to fit a bi-linear surface to the x,y, and z data. I am currently doing it with the code-snippet from amroamroamro/gist:
def bi2Dlinter(xdata, ydata, zdata, gridrez):
X,Y = np.meshgrid(
np.linspace(min(x), max(x), endpoint=True, num=gridrez),
np.linspace(min(y), max(y), endpoint=True, num=gridrez))
A = np.c_[xdata, ydata, np.ones(len(zdata))]
C,_,_,_ = scipy.linalg.lstsq(A, zdata)
Z = C[0]*X + C[1]*Y + C[2]
return Z
My current approach is to cycle through the rows of the DataFrame. (This works great for 1000 observations but is not usable for larger data-sets.)
ZZ = []
for index, row in df2.iterrows():
x=row['x1'], row['x2'], row['x3'], row['x4'], row['x5']
y=row['y1'], row['y2'], row['y3'], row['y4'], row['y5']
z=row['z1'], row['z2'], row['z3'], row['z4'], row['z5']
ZZ.append(np.median(bi2Dlinter(x,y,z,gridrez)))
df2['ZZ']=ZZ
I would be surprised if there is not a more efficient way to do this.
Is there a way to vectorize the linear interpolation?
I put the code here which also generates dummy entries.
Thanks

Looping over DataFrames like this is generally not recommended. Instead you should opt to try and vectorize your code as much as possible.
First we create an array for your inputs
x_vals = df2[['x1','x2','x3','x4','x5']].values
y_vals = df2[['y1','y2','y3','y4','y5']].values
z_vals = df2[['z1','z2','z3','z4','z5']].values
Next we need to create a bi2Dlinter function that handles vector inputs, this involves changing linspace/meshgrid to work for an array and changing the least_squares function. Normally scipy.linalg functions work over an array but as far as I'm aware the .lstsq method doesn't. Instead we can use the .SVD to replicate the same functionality over an array.
def create_ranges(start, stop, N, endpoint=True):
if endpoint==1:
divisor = N-1
else:
divisor = N
steps = (1.0/divisor) * (stop - start)
return steps[:,None]*np.arange(N) + start[:,None]
def linspace_nd(x,y,gridrez):
a1 = create_ranges(x.min(axis=1), x.max(axis=1), N=gridrez, endpoint=True)
a2 = create_ranges(y.min(axis=1), y.max(axis=1), N=gridrez, endpoint=True)
out_shp = a1.shape + (a2.shape[1],)
Xout = np.broadcast_to(a1[:,None,:], out_shp)
Yout = np.broadcast_to(a2[:,:,None], out_shp)
return Xout, Yout
def stacked_lstsq(L, b, rcond=1e-10):
"""
Solve L x = b, via SVD least squares cutting of small singular values
L is an array of shape (..., M, N) and b of shape (..., M).
Returns x of shape (..., N)
"""
u, s, v = np.linalg.svd(L, full_matrices=False)
s_max = s.max(axis=-1, keepdims=True)
s_min = rcond*s_max
inv_s = np.zeros_like(s)
inv_s[s >= s_min] = 1/s[s>=s_min]
x = np.einsum('...ji,...j->...i', v,
inv_s * np.einsum('...ji,...j->...i', u, b.conj()))
return np.conj(x, x)
def vectorized_bi2Dlinter(x_vals, y_vals, z_vals, gridrez):
X,Y = linspace_nd(x_vals, y_vals, gridrez)
A = np.stack((x_vals,y_vals,np.ones_like(z_vals)), axis=2)
C = stacked_lstsq(A, z_vals)
n_bcast = C.shape[0]
return C.T[0].reshape((n_bcast,1,1))*X + C.T[1].reshape((n_bcast,1,1))*Y + C.T[2].reshape((n_bcast,1,1))
Upon testing this on data for n=10000 rows, the vectorized function was significantly faster.
%%timeit
ZZ = []
for index, row in df2.iterrows():
x=row['x1'], row['x2'], row['x3'], row['x4'], row['x5']
y=row['y1'], row['y2'], row['y3'], row['y4'], row['y5']
z=row['z1'], row['z2'], row['z3'], row['z4'], row['z5']
ZZ.append((bi2Dlinter(x,y,z,gridrez)))
df2['ZZ']=ZZ
Out: 5.52 s ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
res = vectorized_bi2Dlinter(x_vals,y_vals,z_vals,gridrez)
Out: 74.6 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You should pay careful attention to whats going on in this vectorize function and familiarize yourself with broadcasting in numpy. I cannot take credit for the first three functions, instead I will link their answers from stack overflow for you to get an understanding.
Vectorized NumPy linspace for multiple start and stop values
how to solve many overdetermined systems of linear equations using vectorized codes?
How to use numpy.c_ properly for arrays

Related

Loop speed up of FFT in python (with `np.einsum`)

Problem: I want to speed up my python loop containing a lot of products and summations with np.einsum, but I'm also open to any other solutions.
My function takes an vector configuration S of shape (n,n,3) (my case: n=72) and does a Fourier-Transformation on the correlation function for N*N points. The correlation function is defined as the product of every vector with every other. This gets multiplied by a cosine function of the postions of vectors times the kx and ky values. Every position i,j is in the end summed to get one point in k-space p,m:
def spin_spin(S,N):
n= len(S)
conf = np.reshape(S,(n**2,3))
chi = np.zeros((N,N))
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
for p in range(N):
for m in range(N):
for i in range(n**2):
for j in range(n**2):
chi[p,m] += 2/(n**2)*np.dot(conf[i],conf[j])*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
return(chi,kx,ky)
My problem is that I need roughly 100*100 points which are denoted by kx*ky and the loop needs to many hours to finish this job for a lattice with 72*72 vectors.
Number of calculations: 72*72*72*72*100*100
I cannot use the built-in FFT of numpy, because of my triangular grid, so I need some other option to reduce here the computional cost.
My idea: First I recognized that reshaping the configuration into a list of vectors instead of a matrix reduces the computational cost. Furthermore I used the numba package, which also has reduced the cost, but its still too slow. I found out that a good way of calculating these kind of objects is the np.einsum function. Calculating the product of every vector with every vector is done with the following:
np.einsum('ij,kj -> ik',np.reshape(S,(72**2,3)),np.reshape(S,(72**2,3)))
The tricky part is the calculation of the term inside the np.cos. Here I want to caclulate the product between a list of shape (100,1) with the positions of the vectors (e.g. np.shape(x)=(72**2,1)). Especially I really dont know how to implement the distance in x-direction and y-direction with np.einsum.
To reproduce the code (Probably you won't need this): First you need a vector configuration. You can do it simply with np.ones((72,72,3) or you take random vectors as example with:
def spherical_to_cartesian(r, theta, phi):
'''Convert spherical coordinates (physics convention) to cartesian coordinates'''
sin_theta = np.sin(theta)
x = r * sin_theta * np.cos(phi)
y = r * sin_theta * np.sin(phi)
z = r * np.cos(theta)
return x, y, z # return a tuple
def random_directions(n, r):
'''Return ``n`` 3-vectors in random directions with radius ``r``'''
out = np.empty(shape=(n,3), dtype=np.float64)
for i in range(n):
# Pick directions randomly in solid angle
phi = random.uniform(0, 2*np.pi)
theta = np.arccos(random.uniform(-1, 1))
# unpack a tuple
x, y, z = spherical_to_cartesian(r, theta, phi)
out[i] = x, y, z
return out
S = np.reshape(random_directions(72**2,1),(72,72,3))
(The reshape in this example is needed to shape it in the function spin_spin back to the (72**2,3) shape.)
For the positions of vectors I use a triangular grid defined by
def triangular(nsize):
'''Positional arguments of the spin configuration'''
X=np.zeros((nsize,nsize))
Y=np.zeros((nsize,nsize))
for i in range(nsize):
for j in range(nsize):
X[i,j]+=1/2*j+i
Y[i,j]+=np.sqrt(3)/2*j
return(X,Y)
Optimized Numba implementation
The main problem in your code is calling external BLAS function np.dot repeatedly with extremely small data. In this code it would make more sense to calculate them only once, but if you have to do this calculations in a loop write a Numba implementation. Example
Optimized function (brute-force)
import numpy as np
import numba as nb
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin(S,N):
n= len(S)
conf = np.reshape(S,(n**2,3))
chi = np.zeros((N,N))
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N).astype(np.float32)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N).astype(np.float32)
x=np.reshape(triangular(n)[0],(n**2)).astype(np.float32)
y=np.reshape(triangular(n)[1],(n**2)).astype(np.float32)
#precalc some values
fact=nb.float32(2/(n**2))
conf_dot=np.dot(conf,conf.T).astype(np.float32)
for p in nb.prange(N):
for m in range(N):
#accumulating on a scalar is often beneficial
acc=nb.float32(0)
for i in range(n**2):
for j in range(n**2):
acc+= conf_dot[i,j]*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
chi[p,m]=fact*acc
return(chi,kx,ky)
Optimized function (removing of redundant calculations)
There are a lot of redundant calculations done. This is an example on how to remove them. This is also a version which does the calculations in double precision.
#nb.njit()
def precalc(S):
#There may not be all redundancies removed
n= len(S)
conf = np.reshape(S,(n**2,3))
conf_dot=np.dot(conf,conf.T)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
x_s=set()
y_s=set()
for i in range(n**2):
for j in range(n**2):
x_s.add((x[i]-x[j]))
y_s.add((y[i]-y[j]))
x_arr=np.sort(np.array(list(x_s)))
y_arr=np.sort(np.array(list(y_s)))
conf_dot_sel=np.zeros((x_arr.shape[0],y_arr.shape[0]))
for i in range(n**2):
for j in range(n**2):
ii=np.searchsorted(x_arr,x[i]-x[j])
jj=np.searchsorted(y_arr,y[i]-y[j])
conf_dot_sel[ii,jj]+=conf_dot[i,j]
return x_arr,y_arr,conf_dot_sel
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin_opt_2(S,N):
chi = np.empty((N,N))
n= len(S)
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x_arr,y_arr,conf_dot_sel=precalc(S)
fact=2/(n**2)
for p in nb.prange(N):
for m in range(N):
acc=nb.float32(0)
for i in range(x_arr.shape[0]):
for j in range(y_arr.shape[0]):
acc+= fact*conf_dot_sel[i,j]*np.cos(kx[p]*x_arr[i]+ ky[m]*y_arr[j])
chi[p,m]=acc
return(chi,kx,ky)
#nb.njit()
def precalc(S):
#There may not be all redundancies removed
n= len(S)
conf = np.reshape(S,(n**2,3))
conf_dot=np.dot(conf,conf.T)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
x_s=set()
y_s=set()
for i in range(n**2):
for j in range(n**2):
x_s.add((x[i]-x[j]))
y_s.add((y[i]-y[j]))
x_arr=np.sort(np.array(list(x_s)))
y_arr=np.sort(np.array(list(y_s)))
conf_dot_sel=np.zeros((x_arr.shape[0],y_arr.shape[0]))
for i in range(n**2):
for j in range(n**2):
ii=np.searchsorted(x_arr,x[i]-x[j])
jj=np.searchsorted(y_arr,y[i]-y[j])
conf_dot_sel[ii,jj]+=conf_dot[i,j]
return x_arr,y_arr,conf_dot_sel
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin_opt_2(S,N):
chi = np.empty((N,N))
n= len(S)
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x_arr,y_arr,conf_dot_sel=precalc(S)
fact=2/(n**2)
for p in nb.prange(N):
for m in range(N):
acc=nb.float32(0)
for i in range(x_arr.shape[0]):
for j in range(y_arr.shape[0]):
acc+= fact*conf_dot_sel[i,j]*np.cos(kx[p]*x_arr[i]+ ky[m]*y_arr[j])
chi[p,m]=acc
return(chi,kx,ky)
Timings
#brute-force
%timeit res=spin_spin(S,100)
#48 s ± 671 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#new version
%timeit res_2=spin_spin_opt_2(S,100)
#5.33 s ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=spin_spin_opt_2(S,1000)
#1min 23s ± 2.43 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Edit (SVML-check)
import numba as nb
import numpy as np
#nb.njit(fastmath=True)
def foo(n):
x = np.empty(n*8, dtype=np.float64)
ret = np.empty_like(x)
for i in range(ret.size):
ret[i] += np.cos(x[i])
return ret
foo(1000)
if 'intel_svmlcc' in foo.inspect_llvm(foo.signatures[0]):
print("found")
else:
print("not found")
#found
If there is a not found read this link. It should work on Linux and Windows, but I haven't tested it on macOS.
Here is one approach to speed things up. I didn't start using np.einsum because a little tweaking of your loops was sufficient.
The main thing slowing down your code was redundant recalculations of the same thing. The nested loop here is the perpetrator:
for p in range(N):
for m in range(N):
for i in range(n**2):
for j in range(n**2):
chi[p,m] += 2/(n**2)*np.dot(conf[i],conf[j])*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
It contains a lot of redundancy, recalculating vector operations many times.
Consider the np.dot(...): this calculation is completely independent of the points kx and ky. But only the points kx and ky required indexing with m and n. So you can run the dot products over all i and j just once, and save the result, as opposed to recalculating for each m,n (which would be 10,000 times!).
In a similar approach, no need for the vector differences between to be recalculated at each point in the lattice. At every point you calculate every vector distance, when all that is needed is to calculate the vector distances once and merely multiply this result by each lattice point.
So, having fixed the loops and used dictionaries with indices (i,j) as keys to store all the values, you can just look up the relevant value during the loop over i, j. Here is my code:
def spin_spin(S, N):
n = len(S)
conf = np.reshape(S,(n**2, 3))
chi = np.zeros((N, N))
kx = np.linspace(-5*np.pi/3, 5*np.pi/3, N)
ky = np.linspace(-3*np.pi/np.sqrt(3), 3*np.pi/np.sqrt(3), N)
# Minor point; no need to use triangular twice
x, y = triangular(n)
x, y = np.reshape(x,(n**2)), np.reshape(y,(n**2))
# Build a look-up for all the dot products to save calculating them many times
dot_prods = dict()
x_diffs, y_diffs = dict(), dict()
for i, j in itertools.product(range(n**2), range(n**2)):
dot_prods[(i, j)] = np.dot(conf[i], conf[j])
x_diffs[(i, j)], y_diffs[(i, j)] = x[i] - x[j], y[i] - y[j]
# Minor point; improve syntax by converting nested for loops to one line
for p, m in itertools.product(range(N), range(N)):
for i, j in itertools.product(range(n**2), range(n**2)):
# All vector operations are replaced by look ups to the dictionaries defined above
chi[p, m] += 2/(n**2)*dot_prods[(i, j)]*np.cos(kx[p]*(x_diffs[(i, j)]) + ky[m]*(y_diffs[(i, j)]))
return(chi, kx, ky)
I am running this at the moment with the dimensions you provide, on a decent machine, and the loop over i,j finishes in two minutes. That only needs to happen once; then it is just a loop over m, n. Each one of these is taking about 90 seconds, so still a 2-3 hour run time. I welcome any suggestions on how to optimise that cos calculation to speed that up!
I hit the low hanging fruit of optimization, but to give a sense of speed, the loop of i, j takes 2 minutes, and this way it runs 9,999 fewer times!

What is the quickest way to evaluate an interpolated function on a polar grid in Python?

I have an interpolated function of two Cartesian variables (created using RectBivariateSpline), and I'm looking for the fastest means of evaluating that function over a polar grid.
I've approached this problem by first defining spaces in r (the radial coordinate) and t (the angular coordinate), creating a meshgrid from these, converting this meshgrid to Cartesian coordinates, and then evaluating the function at each point by looping over the Cartesian meshgrid. The below code demonstrates this.
import numpy as np
import scipy as sp
from scipy.interpolate import RectBivariateSpline
# this shows the type of data/function I'm working with:
n = 1000
x = np.linspace(-10, 10, n)
y = np.linspace(-10, 10, n)
z = np.random.rand(n,n)
fun = RectBivariateSpline(x, y, z)
# this defines the polar grid and converts it to a Cartesian one:
nr = 1000
nt = 360
r = np.linspace(0, 10, nr)
t = np.linspace(0, 2*np.pi, nt)
R, T = np.meshgrid(r, t, indexing = 'ij')
kx = R*np.cos(T)
ky = R*np.sin(T)
# this evaluates the interpolated function over the converted grid:
eval = np.empty((nr, nt))
for i in range(0, nr):
for j in range(0, nt):
eval[i][j] = fun(kx[i][j], ky[i][j])
In this way, I get an array whose elements match up with the R and T arrays, where i corresponds to R, and j to T. This is important, because for each radius I need to sum the evaluated function over the angular coordinate.
This approach works, but is dreadfully slow... in reality I am working with much, much larger arrays than those here. I'm looking for a way to speed this up.
One thing I've noticed is that one can submit two 1D arrays to a 2-variable function and have returned a 2D array of the function evaluated at each possible combination of the input points. Because my function isn't a polar one, however, I can't just submit my radial and angular arrays to the function. Ideally there'd be a way of converting an interpolated function to accept polar arguments, but I don't think this is possible.
I should note further that there is no way I can define the function in terms of radial coordinates in the first place: the data I'm using is output from a discrete Fourier transform, which requires rectangularly-gridded data.
Any help would be appreciated!
By examining the __call__ method of RectBivariateSpline here, you can use the grid=False option to avoid the slow double loop here.
This alone provides an order of magnitude speed up on the example you gave. I would expect the speedup to be even better on larger data sets.
Also the answers are the same between the methods as expected.
import numpy as np
import scipy as sp
from scipy.interpolate import RectBivariateSpline
# this shows the type of data/function I'm working with:
n = 1000
x = np.linspace(-10, 10, n)
y = np.linspace(-10, 10, n)
z = np.random.rand(n,n)
fun = RectBivariateSpline(x, y, z)
# this defines the polar grid and converts it to a Cartesian one:
nr = 1000
nt = 360
r = np.linspace(0, 10, nr)
t = np.linspace(0, 2*np.pi, nt)
R, T = np.meshgrid(r, t, indexing = 'ij')
kx = R*np.cos(T)
ky = R*np.sin(T)
# this evaluates the interpolated function over the converted grid:
def evaluate_slow(kx, ky):
eval = np.empty((nr, nt))
for i in range(0, nr):
for j in range(0, nt):
eval[i][j] = fun(kx[i][j], ky[i][j])
return eval
def evaluate_fast(kx, ky):
eval = fun(kx.ravel(), ky.ravel(), grid=False)
return eval
%timeit evaluate_slow(kx, ky)
%timeit evaluate_fast(kx, ky)
eval1 = evaluate_slow(kx, ky)
eval2 = evaluate_fast(kx, ky)
print(np.all(np.isclose(eval1, eval2.reshape((nr, nt)))))
1.69 s ± 73.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
262 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
True

Scipy interpolate on structured 2d data, but evaluate at unstructured points?

I have the following minimum code using scipy.interpolate.interp2d to do interpolation on 2d grid data.
import numpy as np
from scipy import interpolate
x = np.arange(-5.01, 5.01, 0.25)
y = np.arange(-5.01, 5.01, 0.25)
xx, yy = np.meshgrid(x, y)
z = np.sin(xx**2+yy**2)
f = interpolate.interp2d(x, y, z, kind='cubic')
Now f here can be used to evaluate other points. The problem is the points I want to evaluate are totally random points not forming a regular grid.
# Evaluate at point (x_new, y_new), in total 256*256 points
x_new = np.random.random(256*256)
y_new = np.random.random(256*256)
func(x_new, y_new)
This will cause a runtime error in my PC, it seems to treat x_new and y_new as mesh grid, generate a evaluation matrix 65536x65536, which is not my purpose.
RuntimeError: Cannot produce output of size 65536x65536 (size too large)
One way to get things done is to evaluate points one by one, using code:
z_new = np.array([f(i, j) for i, j in zip(x_new, y_new)])
However, it is slow!!!
%timeit z_new = np.array([f(i, j) for i, j in zip(x_new, y_new)])
1.26 s ± 46.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is there any faster way to evaluate random points?
Faster here I mean comparable with time below:
x_new = np.random.random(256)
y_new = np.random.random(256)
%timeit f(x_new, y_new)
Same 256*256 = 65536 evaluations, time for this in my PC:
1.21 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It does not have to be in comparable speed with 1.21ms, 121 ms is totally acceptable.
The function you are looking for is scipy.interpolate.RegularGridInterpolator
Given a set of points (x,y,z), where x & y are defined on a regular grid, it allows you to sample the z-value of intermediate (x,y) points. In your case, this would look as follows
import numpy as np
from scipy import interpolate
x = np.arange(-5.01, 5.01, 0.25)
y = np.arange(-5.01, 5.01, 0.25)
def f(x,y):
return np.sin(x**2+y**2)
z = f(*np.meshgrid(x, y, indexing='ij', sparse=True))
func = interpolate.RegularGridInterpolator((x,y), z)
x_new = np.random.random(256*256)
y_new = np.random.random(256*256)
xy_new = list(zip(x_new,y_new))
z_new = func(xy_new)func(xy_new)
For more details, see https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.interpolate.RegularGridInterpolator.html

Reshape numpy array with minimum rate

I have an array that is not monotonic increasing. I would like to make it monotonic increasing applying a constant rate when the array decreases.
I have create a small example here where the rate is 0.2:
# Rate
rate = 0.2
# Array to interpolate
arr1 = np.array([0,1,2,3,4,4,4,3,2,2.5,3.5,5.2,7,10,9.5,np.nan,np.nan,np.nan,11.2, 11.4, 12,10,9,9.5,10.2,10.5,10.8,12,12.5,15],dtype=float)
# Line with constant rate at first monotonic decrease (index 6)
xx1 = 6
xr1 = np.array(np.arange(0,arr1.shape[0]+1),dtype=float)
yr1 = rate*xr1 + (arr1[xx1]-rate*xx1)
# Line with constant rate at second monotonic decrease [index 14]
xx2 = 13
xr2 = np.array(np.arange(0,arr1.shape[0]+1),dtype=float)
yr2 = rate*xr2 + (arr1[xx2]-rate*xx2)
# Line with constant rate at second monotonic decrease [index 14]
xx3 = 20
xr3 = np.array(np.arange(0,arr1.shape[0]+1),dtype=float)
yr3 = rate*xr3 + (arr1[xx3]-rate*xx3)
plt.figure()
plt.plot(arr1,'.-',label='Original')
plt.plot(xr1,yr1,label='Const Rate line 1')
plt.plot(xr2,yr2,label='Const Rate line 2')
plt.plot(xr3,yr3,label='Const Rate line 2')
plt.legend()
plt.grid()
The "Original" array is my dataset.
The final results I would like is the blue + red-dashed line. In the figure I highlighted also the "constant rate curves".
Since I have very large arrays (millions of records), I would like to avoid for-loops over the entire array.
Thanks a lot to everybody for the help!
Here's a different option: If you are interested in plotting monotonically increasing curve from your data, then you can simply skip the unwanted points between two successive increasing points, e.g. between arr1[6] = 4 and arr1[11] = 5, by connecting them with a line.
import numpy as np
import matplotlib.pyplot as plt
arr1 = np.array([0,1,2,3,4,4,4,3,2,2.5,3.5,5.2,7,10,9.5,np.nan,np.nan,np.nan,11.2, 11.4, 12,10,9,9.5,10.2,10.5,10.8,12,12.5,15],dtype=float)
mask = (arr1 == np.maximum.accumulate(np.nan_to_num(arr1)))
x = np.arange(len(arr1))
plt.figure()
plt.plot(x, arr1,'.-',label='Original')
plt.plot(x[mask], arr1[mask], 'r-', label='Interp.')
plt.legend()
plt.grid()
arr2 = arr1[1:] - arr1[:-1]
ind = numpy.where(arr2 < 0)[0]
for i in ind:
arr1[i] = arr1[i - 1] + rate
You may need to replace first any numpy.nan with values, such as numpy.amin(arr1)
I would like to avoid for-loops over the entire array.
Frankly speaking, it is hard to achieve no for-loops in numpy, because numpy as the C-made-library uses for-loops implemented in C / C++. And all sorting algorithm (like np.argwhere, np.all, etc.) requires comparisons and therefore also iterations.
Contrarily, I suggest using at least one explicit loop made in Python (iteration is made only once):
arr0 = np.zeros_like(arr1)
num = 1
rate = .2
while(num < len(arr1)):
if arr1[num] < arr1[num-1] or np.isnan(arr1[num]):
start = arr1[num-1]
while(start > arr1[num] or np.isnan(arr1[num])):
print(arr1[num])
arr0[num] = arr0[num-1] + rate
num+=1
continue
arr0[num] = arr1[num]
num +=1
Your problem can be expressed in one simple recursive difference equation:
y[n] = max(y[n-1] + 0.2, x[n])
So the direct Python form would be
def func(a):
out = np.zeros_like(a)
out[0] = a[0]
for i in range(1, len(a)):
out[i] = max(out[i-1] + 0.2, a[i])
return out
Unfortunately, this equation is recursive and non-linear, so finding a vectorized algorithm may be difficult.
However, using Numba we can speed up this loop-based algorithm by a factor of 300:
fastfunc = numba.jit(func)
arr1 = np.random.rand(1000000)
%timeit func(arr1)
# 599 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit fastfunc(arr1)
# 2.22 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I finally managed to do what I wanted with a while loop.
# data['myvar'] is the original dataset I want to reshape
data['myvar_corrected'] = data['myvar'].values
temp_d = data['myvar'].fillna(0).values*1.0
dtc = np.maximum.accumulate(temp_d)
data.loc[temp_d < np.maximum.accumulate(dtc),'myvar_corrected'] = float('nan')
stay_in_while = True
min_rate = 5/200000/(24*60)
idx_next = 0
while stay_in_while:
df_temp = data.iloc[idx_next:]
if df_tem['myvar'].isnull().sum()>0:
idx_first_nan = df_temp.reset_index().['myvar_corrected'].isnull().argmax()
idx_nan_or = (data_new.index.values==df_temp.index.values[idx_first_nan]).argmax()
x = np.arange(idx_first_nan-1,df_temp.shape[0])
y0 = df_temp.iloc[idx_first_nan-1]['myvar_corrected']
rate_curve = min_rate*x + (y0 - min_rate*(idx_first_nan-1))
damage_m_rate = df_temp.iloc[idx_first_nan-1:]['myvar_corrected']-rate_curve
try:
idx_intercept = (data_new.index.values==damage_m_rate[damage_m_rate>0].index.values[0]).argmax()
data_new.iloc[idx_nan_or:idx_intercept]['myvar'] = rate_curve[0:(damage_m_rate.index.values==damage_m_rate[damage_m_rate>0].index.values[0]).argmax()-1]
idx_next = idx_intercept + 1
except:
stay_in_while = False
else:
stay_in_while = False
# Finally I have my result stored in data_new['myvar']
In the following picture the result.
Thanks to everybody for the contribution!

Efficient online linear regression algorithm in python

I got a 2-D dataset with two columns x and y. I would like to get the linear regression coefficients and interception dynamically when new data feed in. Using scikit-learn I could calculate all current available data like this:
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
x = np.arange(100)
y = np.arange(100)+10*np.random.random_sample((100,))
regr.fit(x,y)
print(regr.coef_)
print(regr.intercept_)
However, I got quite big dataset (more than 10k rows in total) and I want to calculate coefficient and intercept as fast as possible whenever there's new rows coming in. Currently calculate 10k rows takes about 600 microseconds, and I want to accelerate this process.
Scikit-learn looks like does not have online update function for linear regression module. Is there any better ways to do this?
I've found solution from this paper: updating simple linear regression. The implementation is as below:
def lr(x_avg,y_avg,Sxy,Sx,n,new_x,new_y):
"""
x_avg: average of previous x, if no previous sample, set to 0
y_avg: average of previous y, if no previous sample, set to 0
Sxy: covariance of previous x and y, if no previous sample, set to 0
Sx: variance of previous x, if no previous sample, set to 0
n: number of previous samples
new_x: new incoming 1-D numpy array x
new_y: new incoming 1-D numpy array x
"""
new_n = n + len(new_x)
new_x_avg = (x_avg*n + np.sum(new_x))/new_n
new_y_avg = (y_avg*n + np.sum(new_y))/new_n
if n > 0:
x_star = (x_avg*np.sqrt(n) + new_x_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
y_star = (y_avg*np.sqrt(n) + new_y_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
elif n == 0:
x_star = new_x_avg
y_star = new_y_avg
else:
raise ValueError
new_Sx = Sx + np.sum((new_x-x_star)**2)
new_Sxy = Sxy + np.sum((new_x-x_star).reshape(-1) * (new_y-y_star).reshape(-1))
beta = new_Sxy/new_Sx
alpha = new_y_avg - beta * new_x_avg
return new_Sxy, new_Sx, new_n, alpha, beta, new_x_avg, new_y_avg
Performance comparison:
Scikit learn version that calculate 10k samples altogether.
from sklearn.linear_model import LinearRegression
x = np.arange(10000).reshape(-1,1)
y = np.arange(10000)+100*np.random.random_sample((10000,))
regr = LinearRegression()
%timeit regr.fit(x,y)
# 419 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
My version assume 9k sample is already calculated:
Sxy, Sx, n, alpha, beta, new_x_avg, new_y_avg = lr(0, 0, 0, 0, 0, x.reshape(-1,1)[:9000], y[:9000])
new_x, new_y = x.reshape(-1,1)[9000:], y[9000:]
%timeit lr(new_x_avg, new_y_avg, Sxy,Sx,n,new_x, new_y)
# 38.7 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
10 times faster, which is expected.
Nice! Thanks for sharing your findings :) Here is an equivalent implementation of this solution written with dot products:
class SimpleLinearRegressor(object):
def __init__(self):
self.dots = np.zeros(5)
self.intercept = None
self.slope = None
def update(self, x: np.ndarray, y: np.ndarray):
self.dots += np.array(
[
x.shape[0],
x.sum(),
y.sum(),
np.dot(x, x),
np.dot(x, y),
]
)
size, sum_x, sum_y, sum_xx, sum_xy = self.dots
det = size * sum_xx - sum_x ** 2
if det > 1e-10: # determinant may be zero initially
self.intercept = (sum_xx * sum_y - sum_xy * sum_x) / det
self.slope = (sum_xy * size - sum_x * sum_y) / det
When working with time series data, we can extend this idea to do sliding window regression with a soft (EMA-like) window.
You can use accelerated libraries that implement faster algorithms - particularly
https://github.com/intel/scikit-learn-intelex
For linear regression you would get much better performance
First install package
pip install scikit-learn-intelex
And then add in your python script
from sklearnex import patch_sklearn
patch_sklearn()

Categories