Efficient way to do a large number of regressions using numpy?

Efficient way to do a large number of regressions using numpy? - python

I have a large collection (26,214,400 to be exact) of sets of data I want to perform a linear regressions on, i.e. each of the 26,214,400 data sets consists of n x values and n y values and I want to find y = m * x + b. For any set of points I can use sklearn or numpy.linalg.lstsq, something like:
A = np.vstack([x, np.ones(len(x))]).T
m, b = np.linalg.lstsq(A, y, rcond=None)[0]
Is there a way to set up the matrices such that I can avoid a python loop through 26,214,400 items? Or do I have to use a loop and would be better served using something like Numba?

I ended up going the numba route which yielded a ~20x speed up on my laptop, it used all my cores so I assume more CPUs the better. The answer looked something like
import numpy as np
from numpy.linalg import lstsq
import numba
#numba.jit(nogil=True, parallel=True)
def fit(XX, yy):
""""Fit a large set of points to a regression"""
assert XX.shape == yy.shape, "Inputs mismatched"
n_pnts, n_samples = XX.shape
scale = np.empty(n_pnts)
offset = np.empty(n_pnts)
for i in numba.prange(n_pnts):
X, y = XX[i], yy[i]
A = np.vstack((np.ones_like(X), X)).T
offset[i], scale[i] = lstsq(A, y)[0]
return offset, scale
Running it:
XX, yy = np.random.randn(2, 1000, 10)
offset, scale = fit(XX, yy)
%timeit offset, scale = fit(XX, yy)
1.87 ms ± 37.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The non-jitted version has this timing:
41.7 ms ± 620 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Estimate the rotation between two 2D point clouds

I have two 2D point clouds with an equal number of elements. For these elements I know their correspondence, i.e. for each point in PC1 I know the corresponding element in PC2 and vice versa.
I would now like to estimate the rotation between these two point clouds. That is, I would like to find the angle alpha by which I must rotate all points in PC1 around the origin such that the distance between corresponding points in PC1 and PC2 is minimized.
I can solve this using scipy's linear optimizer (see below); however, this optimization sits inside a loop along the critical path of my code and is the current bottleneck.
import numpy as np
from scipy.optimize import minimize_scalar
from math import sin, cos
# generate some data for demonstration purposes
# points in each point cloud are ordered by correspondence
num_points = 10
distance = np.random.rand(num_points) * 3
radii = np.random.rand(num_points) * 2*np.pi
pc1 = distance[:, None] * np.stack([np.cos(radii), np.sin(radii)], axis=1)
distance = np.random.rand(num_points) * 3
radii = np.random.rand(num_points) * 2*np.pi
pc2 = distance[:, None] * np.stack([np.cos(radii), np.sin(radii)], axis=1)
# solve using scipy
def score(alpha):
rot_matrix = np.array([
[cos(alpha), -sin(alpha)],
[sin(alpha), cos(alpha)]
])
pc1_rotated = (rot_matrix # pc1.T).T
sum_of_squares = np.sum((pc2 - pc1_rotated)**2, axis=1)
mse = np.mean(sum_of_squares)
return mse
# simple solution via scipy
result = minimize_scalar(
score,
bounds=(0, 2*np.pi),
method="bounded",
options={"maxiter": 1000},
)
if result.success:
print(f"Best angle: {result.x}")
else:
raise RuntimeError(f"IK failed. Reason: {result.message}")
Is there a faster (potentially analytic) solution to this problem?

Since minimize_scalar only uses derivative-free methods, the optimization runtime depends heavily on the time needed to evaluate your objective function score. Consequently, I'd recommend accelerating this function as much as possible.
Let's time your function and the optimizer as benchmark reference:
In [68]: %timeit score(0.5)
20.2 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [69]: %timeit result = minimize_scalar(score,bounds=(0, 2*np.pi),method="bounded",options={"maxiter": 1000})
415 µs ± 7.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Firstly, note that (rot_matrix # pc1.T).T is the same as pc1 # rot_matrix.T, i.e. we only need to transpose one matrix instead of two.
Next, note that -sin(alpha) = cos(alpha + 5*pi/2) and sin(alpha) = cos(alpha + 3*pi/2). This means that we only need one function call of np.cos to create the rot_matrix instead of four calls of math.sin or math.cos.
Lastly, you can compute the mse more efficiently by np.einsum.
Considering all points, the function can look like this:
k1 = 5*np.pi/2
k2 = 3*np.pi/2
def score2(alpha):
rot_matrixT = np.cos((alpha, alpha+k2, alpha + k1, alpha)).reshape(2,2)
pc1_rotated = pc1 # rot_matrixT
diff = pc2 - pc1_rotated
return np.einsum('ij,ij->', diff, diff) / num_points
Timing the function again yields
In [70]: %timeit score(0.5)
9.26 µs ± 84.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and therefore, the optimizer is much faster:
In [71]: %timeit result = minimize_scalar(score, bounds=(0, 2*np.pi), method="bounded", options={"maxiter": 1000})
279 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If that still is not fast enough, you can just-in-time compile your function by Numba:
In [60]: from numba import njit
In [61]: #njit
...: def score3(alpha):
...: rot_matrix = np.array([
...: [cos(alpha), -sin(alpha)],
...: [sin(alpha), cos(alpha)]
...: ])
...: pc1_rotated = (rot_matrix # pc1.T).T
...: sum_of_squares = np.sum((pc2 - pc1_rotated)**2, axis=1)
...: mse = np.mean(sum_of_squares)
...: return mse
In [62]: %timeit score3(0.5)
2.97 µs ± 47.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
or rewrite it using Cython. Just for the sake of completeness, here's a fast Cython implementation:
In [45]: %%cython -c=-O3 -c=-march=native -c=-Wno-deprecated-declarations -c=-Wno-#warnings
...:
...: from libc.math cimport cos, sin
...: cimport numpy as np
...: import numpy as np
...: from cython cimport wraparound, boundscheck
...:
...: #wraparound(False)
...: #boundscheck(False)
...: cpdef double score4(double alpha, double[:, ::1] pc1, double[:, ::1] pc2):
...: cdef int i
...: cdef int N = pc1.shape[0]
...: cdef double diff1 = 0.0
...: cdef double diff2 = 0.0
...: cdef double mse = 0.0
...: cdef double rmT00 = cos(alpha)
...: cdef double rmT01 = sin(alpha)
...: cdef double rmT10 = -rmT01
...: cdef double rmT11 = rmT00
...:
...: for i in range(N):
...: diff1 = pc2[i,0] - (pc1[i,0]*rmT00 + pc1[i,1]*rmT10)
...: diff2 = pc2[i,1] - (pc1[i,0]*rmT01 + pc1[i,1]*rmT11)
...: mse += diff1*diff1 + diff2*diff2
...: return mse / N
which yields
In [48]: %timeit score4(0.5, pc1, pc2)
1.05 µs ± 15.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Last but not least, you can write down the first-order necessary condition of your problem and check whether it can be solved analytically. Otherwise, you can try to solve the resulting nonlinear equation numerically.

Is there any way to make Python random sum code faster in Cython?

I have written a piece of coded that draws random numbers from a uniform distribution, totals them until it reaches a number L=x.
I have tried to optimise it using Cython but i would like any suggestions on how it could be further optimised as it would be called for large L values so would take quite long.
This is the code I have written in Jupyter so far
%%cython
import numpy as np
cimport numpy
import numpy.random
def f(int L):
cdef double r=0
cdef int i=0
cdef float theta
while r<=L:
theta=np.random.uniform(0, 2*np.pi, size = None)
r+=np.cos(theta)
i+=1
return i
I'd like to speed it up as much as possible

One way, without using Cython, that you can speed this up is to call np.random.uniform less frequently. The cost of calling this function and returning 1 value vs 100,000 values is negligible, call it and returning 1,000 values vs calling it 1,000 times reaps huge time savings:
def call1000():
return [np.random.uniform(0, 2*np.pi, size = None) for i in range(1000)]
%timeit call1000()
762 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.random.uniform(0, 2*np.pi, size = 1000)
10.8 µs ± 13.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You can implement this and ensure that you don't run out of values by doing something like this:
def f(L):
r = 0
i = 0
j = 0
theta = np.random.uniform(0, 2*np.pi, size = 100000)
while r<=L:
if j == len(theta):
j=0
theta=np.random.uniform(0, 2*np.pi, size = 100000)
r+=np.cos(theta[j])
i+=1
return i

Gaussian kernel performance

Following method calculates a gaussian kernel:
import numpy as np
def gaussian_kernel(X, X2, sigma):
"""
Calculate the Gaussian kernel matrix
k_ij = exp(-||x_i - x_j||^2 / (2 * sigma^2))
:param X: array-like, shape=(n_samples_1, n_features), feature-matrix
:param X2: array-like, shape=(n_samples_2, n_features), feature-matrix
:param sigma: scalar, bandwidth parameter
:return: array-like, shape=(n_samples_1, n_samples_2), kernel matrix
"""
norm = np.square(np.linalg.norm(X[None,:,:] - X2[:,None,:], axis=2).T)
return np.exp(-norm/(2*np.square(sigma)))
# Usage example
%timeit gaussian_kernel(np.random.rand(5000, 10), np.random.rand(5000, 10), 1)
1.43 s ± 39.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My question is: is there any ways to increase performance using numpy?

For quite small arrays you can write a simple loop implementation and compile it using Numba. For larger arrays the algebraic reformulation using np.dot() will be faster.
Example
#from version 0.43 until 0.47 this has to be set before importing numba
#Bug: https://github.com/numba/numba/issues/4689
from llvmlite import binding
binding.set_option('SVML', '-vector-library=SVML')
import numba as nb
import numpy as np
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def gaussian_kernel_2(X, X2, sigma):
res=np.empty((X.shape[0],X2.shape[0]),dtype=X.dtype)
for i in nb.prange(X.shape[0]):
for j in range(X2.shape[0]):
acc=0.
for k in range(X.shape[1]):
acc+=(X[i,k]-X2[j,k])**2/(2*sigma**2)
res[i,j]=np.exp(-1*acc)
return res
Timings
X1=np.random.rand(5000, 10)
X2=np.random.rand(5000, 10)
#Your solution
%timeit gaussian_kernel(X1,X2, 1)
#511 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit gaussian_kernel_2(X1,X2, 1)
#90.1 ms ± 9.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This post: https://stackoverflow.com/a/47271663/9539058 gave an answer.
Shortly, to copy the numpy part:
import numpy as np
def gaussian_kernel(X, X2, sigma):
"""
Calculate the Gaussian kernel matrix
k_ij = exp(-||x_i - x_j||^2 / (2 * sigma^2))
:param X: array-like, shape=(n_samples_1, n_features), feature-matrix
:param X2: array-like, shape=(n_samples_2, n_features), feature-matrix
:param sigma: scalar, bandwidth parameter
:return: array-like, shape=(n_samples_1, n_samples_2), kernel matrix
"""
X_norm = np.sum(X ** 2, axis = -1)
X2_norm = np.sum(X2 ** 2, axis = -1)
norm = X_norm[:,None] + X2_norm[None,:] - 2 * np.dot(X, X2.T)
return np.exp(-norm/(2*np.square(sigma)))
# Timing
%timeit gaussian_kernel(np.random.rand(5000, 10), np.random.rand(5000, 10), 1)
976 ms ± 73.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Perform mulitple comparisons (interval) in a Numpy array in a single pass

I have a situation similar to the "Count values in a certain range" question, but instead of a column-vector I have a matrix intervals with two columns [upper, lower] and another column vector true_values.
I want to check whether the values in the true_values vector are within the ranges defined [upper, lower], element wise.
The answer provided in the linked question would do 4 passes:
((true_values >= intervals[:, 0]) & (true_values <= intervals[:, 1])).sum()
One pass for each greater/less than check, one for the and clause, and one for the sum.
Given that these are potentially huge matrices, I'm wondering if it's possible to reduce the number of passes necessary, ideally to one pass for the interval checks, and one for the sum (I think unavoidable), I was thinking something like broadcasting a function over intervals' rows.
Here's a minimal example:
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
n_samples = 2000
n_features = 10
rng = np.random.RandomState(0)
X = rng.normal(size=(n_samples, n_features))
w = rng.normal(size=n_features)
# simple linear function without noise
y = np.dot(X, w)
gbrt = GradientBoostingRegressor(loss='quantile', alpha=0.95)
gbrt.fit(X, y)
# Get upper interval
upper_interval = gbrt.predict(X)
# Get lower interval
gbrt.set_params(alpha=0.05)
gbrt.fit(X, y)
lower_interval = gbrt.predict(X)
intervals = np.concatenate((lower_interval[:, np.newaxis], upper_interval[:, np.newaxis]), axis=1)
# This is 4 passes:
perc_correct_intervals = ((y >= intervals[:, 0]) & (y <= intervals[:, 1])).sum() / y.shape[0]

some savings with np.count_nonzero vs .sum(), more if you don't really need to make the intervals matrix for other uses
%%timeit
intervals = np.concatenate((lower_interval[:, np.newaxis], upper_interval[:, np.newaxis]), axis=1);
perc_correct_intervals = ((y >= intervals[:, 0]) & (y <= intervals[:, 1])).sum() / y.shape[0]
15.7 µs ± 78.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
np.count_nonzero(np.less(lower_interval, y)*np.less(y, upper_interval))/y.size
3.93 µs ± 28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Best way to implement numpy.sin(x) / x where x might contain 0

What I am doing now is:
import numpy as np
eps = np.finfo(float).eps
def sindiv(x):
x = np.abs(x)
return np.maximum(eps, np.sin(x)) / np.maximum(eps, x)
But there is quite a lot of additional array operation. Is there a better way?

You could use numpy.sinc, which computes sin(pi x)/(pi x):
In [20]: x = 2.4
In [21]: np.sin(x)/x
Out[21]: 0.28144299189631289
In [22]: x_over_pi = x / np.pi
In [23]: np.sinc(x_over_pi)
Out[23]: 0.28144299189631289
In [24]: np.sinc(0)
Out[24]: 1.0

In numpy array notation (so you get back a np array):
def sindiv(x):
return np.where(np.abs(x) < 0.01, 1.0 - x*x/6.0, np.sin(x)/x)
Here I've made "epsilon" fairly large for testing and used the first two terms of the taylor series for the approximation. In practice, I'd change 0.01 to some small multiple of your eps (machine epsilon).
xx = np.arange(-0.1, 0.1, 0.001)
yy = sinxdiv(xx)
type(yy)
outputs numpy.ndarray and the values are continuous (and differentiable, if that's important) near the origin.
If you don't want the double evaluation (i.e. both branches are evaluated in the above), then I think you have to go with a loop as I don't believe there is any sort of "lazy where" option.
def sindiv(x):
sox = np.zeros(x.size)
for i in xrange(x.size):
xv = x[i]
if np.abs(xv) < 0.001: # For testing, use a small multiple of machine epsilon
sox[i] = 1.0 - xv * xv / 6.0
else:
sox[i] = np.sin(xv) / xv
return sox
To make this really pythonic though it would be best to check the type of x and just do the non-array version if it is not an array.

As others have said, numpy.sinc() is the easiest.
I want to include a copy of its current implementation in NumPy 1.21.2 (link) to show there's no special tricks:
y = pi * where(x == 0, 1.0e-20, x)
return sin(y)/y
It's basically just sin(x)/x. Note that in creating y: multiplication by pi, where(), and x == 0 will create at least 2 intermediate arrays plus the final array for y. And then sin(y)/y creates two more arrays. In total at least 5 arrays are created by numpy.sinc(); and by my count your sindiv() also creates at least 5 arrays, so it's not actually that wasteful.
Here is another implementation:
TINY = np.finfo(float).tiny # ≈ 2e-308 (smallest 'normal' float)
def mysinc(x):
y = np.abs(np.pi*x) + TINY
return np.sin(y)/y
I'm pretty sure this returns identical values to numpy.sinc(). The reason being sin(x) == x for relatively 'large' values of x:
x = np.ldexp(1, -26, dtype=np.double) # x = 2**-26 ≈ 1.5e-8
print(np.sin(x) == x) # True
x = np.ldexp(1, -32, dtype=np.longdouble) # x = 2**-32 ≈ 2.3e-10
print(np.sin(x) == x) # True
So for small enough x (ignore pi factors), mysinc(x) = (x+TINY)/(x+TINY) = x/x = np.sinc(x). The exact threshold this happens does not matter too much so long as TINY < np.spacing(x) when it occurs so that x + TINY = x in this regime.
(The cutoff is around the square-root of the machine epsilon as can be understood from the Taylor series sin(x) = x - x**3/6 + ... = x(1-x**2/6) + .... So TINY is always small enough to not matter.)
Timings
import numpy as np
eps = np.finfo(float).eps
tiny = np.finfo(float).tiny
def npsinc(x):
y = np.pi * np.where(x == 0, 1.0e-20, x)
return np.sin(y)/y
def sindiv(x):
x = np.pi * np.abs(x)
return np.maximum(eps, np.sin(x)) / np.maximum(eps, x)
def mysinc(x):
y = np.abs(np.pi*x) + tiny
return np.sin(y)/y
def mysinc2(x):
y = np.abs(np.pi*x)
y += tiny # in-place addition
return np.sin(y)/y
# Test data
x = np.random.rand(100)
x[np.random.randint(100, size=10)] = 0
%timeit npsinc(x)
# 10.9 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit sindiv(x)
# 9.4 µs ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mysinc(x)
# 7.38 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mysinc2(x)
# 8.64 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Curiously using mysinc2() with in-place addition seems to be slower, and using in-place numpy.abs() and in-place numpy.sin() is even slower. Not entirely sure why, but see this related question.
Regardless, if you really need performance, you can try using Cython to generate C code and do things properly instead of playing tricks with NumPy:
%%cython
from libc.math cimport M_PI, sin
cimport cython
cimport numpy as np
import numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef _cysinc(double[:] x, double[:] out):
cdef size_t i
for i in range(x.shape[0]):
if x[i] == 0:
out[i] = 1
else:
out[i] = sin(M_PI*x[i])/(M_PI*x[i])
def cysinc(np.ndarray x):
out = np.empty_like(x)
_cysinc(x.ravel(), out.ravel())
return out
%timeit cysinc(x)
# 4.38 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As always, don't prematurely optimize, just use numpy.sinc() to begin with.
Side note
There's a question Is boost::math::sinc_pi unnecessarily complicated? that asks about the benefits of using a Taylor expansion about x=0. In summary, almost none, but maybe they are doing it for other reasons.
To emphasise, there is nothing unstable about floating point division, or dividing a small number by a small number since you're just dividing the significands and subtracting the exponents.
If you calculate sinc(x) as sin(x)/x, instead of a direct Taylor series or other method that sums to convergence beyond the machine epsilon np.spacing(sinc(x)), you will be off by at most np.spacing(sinc(x)) coming from the round-off error in division /, just as you'd get with multiplication *. (Assuming no subnormal business, which even here does not matter in the treatment of sin(x)/x.)

What about allowing div by zero and replace NaNs later?
import numpy as np
def sindiv(x):
a = np.sin(x)/x
a = np.nan_to_num(a)
return a
If you don't want warnings, supress them via seterr
Of course, using a could be eliminated:
def sindiv(x):
return np.nan_to_num(np.sin(x)/x)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient way to do a large number of regressions using numpy? - python

Related

Estimate the rotation between two 2D point clouds

Is there any way to make Python random sum code faster in Cython?

Gaussian kernel performance

Perform mulitple comparisons (interval) in a Numpy array in a single pass

Best way to implement numpy.sin(x) / x where x might contain 0

Categories

Resources