I would like to compute an RBF or "Gaussian" kernel for a data matrix X with n rows and d columns. The resulting square kernel matrix is given by:
K[i,j] = var * exp(-gamma * ||X[i] - X[j]||^2)
var and gamma are scalars.
What is the fastest way to do this in python?
I am going to present four different methods for computing such a kernel, followed by a comparison of their run-time.
Using pure numpy
Here, I use the fact that ||x-y||^2 = ||x||^2 + ||y||^2 - 2 * x^T * y.
import numpy as np
X_norm = np.sum(X ** 2, axis = -1)
K = var * np.exp(-gamma * (X_norm[:,None] + X_norm[None,:] - 2 * np.dot(X, X.T)))
Using numexpr
numexpr is a python package that allows for efficient and parallelized array operations on numpy arrays. We can use it as follows to perform the same computation as above:
import numpy as np
import numexpr as ne
X_norm = np.sum(X ** 2, axis = -1)
K = ne.evaluate('v * exp(-g * (A + B - 2 * C))', {
'A' : X_norm[:,None],
'B' : X_norm[None,:],
'C' : np.dot(X, X.T),
'g' : gamma,
'v' : var
})
Using scipy.spatial.distance.pdist
We could also use scipy.spatial.distance.pdist to compute a non-redundant array of pairwise squared euclidean distances, compute the kernel on that array and then transform it to a square matrix:
import numpy as np
from scipy.spatial.distance import pdist, squareform
K = squareform(var * np.exp(-gamma * pdist(X, 'sqeuclidean')))
K[np.arange(K.shape[0]), np.arange(K.shape[1])] = var
Using sklearn.metrics.pairwise.rbf_kernel
sklearn provides a built-in method for direct computation of an RBF kernel:
import numpy as np
from sklearn.metrics.pairwise import rbf_kernel
K = var * rbf_kernel(X, gamma = gamma)
Run-time comparison
I use 25,000 random samples of 512 dimensions for testing and perform experiments on an Intel Core i7-7700HQ (4 cores # 2.8 GHz). More precisely:
X = np.random.randn(25000, 512)
gamma = 0.01
var = 5.0
Each method is run 7 times and the mean and standard deviation of the time per execution is reported.
| Method | Time |
|-------------------------------------|-------------------|
| numpy | 24.2 s ± 1.06 s |
| numexpr | 8.89 s ± 314 ms |
| scipy.spatial.distance.pdist | 2min 59s ± 312 ms |
| sklearn.metrics.pairwise.rbf_kernel | 13.9 s ± 757 ms |
First of all, scipy.spatial.distance.pdist is surprisingly slow.
numexpr is almost 3 times faster than the pure numpy method, but this speed-up factor will vary with the number of available CPUs.
sklearn.metrics.pairwise.rbf_kernel is not the fastest way, but only a bit slower than numexpr.
Well you are doing a lot of optimizations in your answer post. I would like to add few more (mostly tweaks). I would build upon the winner from the answer post, which seems to be numexpr based on.
Tweak #1
First off, np.sum(X ** 2, axis = -1) could be optimized with np.einsum. Though this part isn't the biggest overhead, but optimization of any sort won't hurt. So, that summation could be expressed as -
X_norm = np.einsum('ij,ij->i',X,X)
Tweak #2
Secondly, we could leverage Scipy supported blas functions and if allowed use single-precision dtype for noticeable performance improvement over its double precision one. Hence, np.dot(X, X.T) could be computed with SciPy's sgemm like so -
sgemm(alpha=1.0, a=X, b=X, trans_b=True)
Few more tweaks on rearranging the negative sign with gamma lets us feed more to sgemm. Also, we would push in gamma into the alpha term.
Tweaked implementations
Thus, with these two optimizations, we would have two more variants (if I could put it that way) of the numexpr method, listed below -
from scipy.linalg.blas import sgemm
def app1(X, gamma, var):
X_norm = -np.einsum('ij,ij->i',X,X)
return ne.evaluate('v * exp(g * (A + B + 2 * C))', {\
'A' : X_norm[:,None],\
'B' : X_norm[None,:],\
'C' : np.dot(X, X.T),\
'g' : gamma,\
'v' : var\
})
def app2(X, gamma, var):
X_norm = -gamma*np.einsum('ij,ij->i',X,X)
return ne.evaluate('v * exp(A + B + C)', {\
'A' : X_norm[:,None],\
'B' : X_norm[None,:],\
'C' : sgemm(alpha=2.0*gamma, a=X, b=X, trans_b=True),\
'g' : gamma,\
'v' : var\
})
Runtime test
Numexpr based one from your answer post -
def app0(X, gamma, var):
X_norm = np.sum(X ** 2, axis = -1)
return ne.evaluate('v * exp(-g * (A + B - 2 * C))', {
'A' : X_norm[:,None],
'B' : X_norm[None,:],
'C' : np.dot(X, X.T),
'g' : gamma,
'v' : var
})
Timings and verification -
In [165]: # Setup
...: X = np.random.randn(10000, 512)
...: gamma = 0.01
...: var = 5.0
In [166]: %timeit app0(X, gamma, var)
...: %timeit app1(X, gamma, var)
...: %timeit app2(X, gamma, var)
1 loop, best of 3: 1.25 s per loop
1 loop, best of 3: 1.24 s per loop
1 loop, best of 3: 973 ms per loop
In [167]: np.allclose(app0(X, gamma, var), app1(X, gamma, var))
Out[167]: True
In [168]: np.allclose(app0(X, gamma, var), app2(X, gamma, var))
Out[168]: True
In the case that you are evaluating X against a high number of gammas, it is useful to save the negative pairwise distances matrix using the tricks done by #Callidior and #Divakar.
from numpy import exp, matmul, power, einsum, dot
from scipy.linalg.blas import sgemm
from numexpr import evaluate
def pdist2(X):
X_norm = - einsum('ij,ij->i', X, X)
return evaluate('A + B + C', {
'A' : X_norm[:,None],
'B' : X_norm[None,:],
'C' : sgemm(alpha=2.0, a=X, b=X, trans_b=True),
})
pairwise_distance_matrix = pdist2(X)
Then, the best solution would be to use numexpr to compute the exponential.
def rbf_kernel2(gamma, p_matrix):
return evaluate('exp(g * m)', {
'm' : p_matrix,
'g' : gamma,
})
Example:
import numpy as np
np.random.seed(1001)
X= np.random.rand(1001, 5).astype('float32')
p_matrix_test = pdist2(X)
gamma_test_list = (10 ** np.linspace(-2, 1, 11)).astype('float32')
def app2(gamma, X):
X_norm = - gamma * einsum('ij,ij->i', X, X)
return evaluate('exp(A + B + C)', {\
'A' : X_norm[:, None],\
'B' : X_norm[None, :],\
'C' : sgemm(alpha=2.0*gamma, a=X, b=X, trans_b=True),\
'g' : gamma,
})
I have the results:
%timeit y = [app2(gamma_test, x_test) for gamma_test in gamma_test_list]
70.8 ms ± 5.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit y = [rbf_kernel2(gamma_test, p_matrix_test) for gamma_test in gamma_test_list]
33.6 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note that you need to add the overhead to compute the pairwise distance matrix before but it shouldn't be much if you are evaluating against a large number of gammas.
The NumPy solution with a given X, Y and gamma would be:
import numpy as np
def rbf_kernel(X, Y, gamma):
X_norm = np.sum(X ** 2, axis=-1)
Y_norm = np.sum(Y ** 2, axis=-1)
K = np.exp(-gamma * (X_norm[:, None] + Y_norm[None, :] - 2 * np.dot(X, Y.T)))
return K
Related
I am translating this code from Matlab to Python. The code function fine but it is painfully slow in python. In Matlab, the code runs in way less then a minute, in python it took 30 min!!! Someone with mode experience in python could help me?
# P({ai})
somai = 0
for i in range(1, n):
somaj = 0
for j in range(1, n):
exponencial = math.exp(-((a[i] - a[j]) * (a[i] - a[j])) / dev_a2 - ((b[i] - b[j]) * (b[i] - b[j])) / dev_b2)
somaj = somaj + exponencial
somai = somai + somaj
As with MATLAB, I'd recommend you vectorize your code. Iterating by for-loops can be much slower than the lower level implementation of MATLAB and numpy.
Your operations (a[i] - a[j])*(a[i] - a[j]) are pairwise squared-Euclidean distance for all N data points. You can calculate a pairwise distance matrix using scipy's pdist and squareform functions -- pdist, squareform.
Then you calculate the difference between pairwise distance matrices A and B, and sum the exponential decay. So you could get a vectorized code like:
import numpy as np
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
# Example data
N = 1000
a = np.random.rand(N,1)
b = np.random.rand(N,1)
dev_a2 = np.random.rand()
dev_b2 = np.random.rand()
# `a` is an [N,1] matrix (i.e. column vector)
A = pdist(a, 'sqeuclidean')
# Change to pairwise distance matrix
A = squareform(A)
# Divide all elements by same divisor
A = A / dev_a2
# Then do the same for `b`'s
# `b` is an [N,1] matrix (i.e. column vector)
B = pdist(b, 'sqeuclidean')
B = squareform(B)
B = B / dev_b2
# Calculate exponential decay
expo = np.exp(-(A-B))
# Sum all elements
total = np.sum(expo)
Here's a quick timing comparison between the iterative method and this vectorized code.
N: 1000 | Iter Output: 2729989.851117 | Vect Output: 2732194.924364
Iter time: 6.759 secs | Vect time: 0.031 secs
N: 5000 | Iter Output: 24855530.997400 | Vect Output: 24864471.007726
Iter time: 171.795 secs | Vect time: 0.784 secs
Note that the final results are not exactly the same. I'm not sure why this is, it might be rounding error or math error on my part, but I'll leave that to you.
TLDR
Use numpy
Why Numpy?
Python, by default, is slow. One of the powers of python is that it plays nicely with C and has tons of libraries. The one that will help you hear is numpy. Numpy is mostly implemented in C and, when used properly, is blazing fast. The trick is to phrase the code in such a way that you keep the execution inside numpy and outside of python proper.
Code and Results
import math
import numpy as np
n = 1000
np_a = np.random.rand(n)
a = list(np_a)
np_b = np.random.rand(n)
b = list(np_b)
dev_a2, dev_b2 = (1, 1)
def old():
somai = 0.0
for i in range(0, n):
somaj = 0.0
for j in range(0, n):
tmp_1 = -((a[i] - a[j]) * (a[i] - a[j])) / dev_a2
tmp_2 = -((b[i] - b[j]) * (b[i] - b[j])) / dev_b2
exponencial = math.exp(tmp_1 + tmp_2)
somaj += exponencial
somai += somaj
return somai
def new():
tmp_1 = -np.square(np.subtract.outer(np_a, np_a)) / dev_a2
tmp_2 = -np.square(np.subtract.outer(np_b, np_b)) / dev_a2
exponential = np.exp(tmp_1 + tmp_2)
somai = np.sum(exponential)
return somai
old = 1.76 s ± 48.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
new = 24.6 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is about a 70x improvement
old yields 740919.6020840995
new yields 740919.602084099
Explanation
You'll notice I broke up your code with the tmp_1 and tmp_2 a bit for clarity.
np.random.rand(n): This creates an array of length n that has random floats going from 0 to 1 (excluding 1) (documented here).
np.subtract.outer(a, b): Numpy has modules for all the operators that allow you do various things with them. Lets say you had np_a = [1, 2, 3], np.subtract.outer(np_a, np_a) would yield
array([[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0]])
Here's a stackoverflow link if you want to go deeper on this. (also the word "outer" comes from "outer product" like from linear algebra)
np.square: simply squares every element in the matrix.
/: In numpy when you do arithmetic operators between scalars and matrices it does the appropriate thing and applies that operation to every element in the matrix.
np.exp: like np.square
np.sum: sums every element together and returns a scalar.
I got a 2-D dataset with two columns x and y. I would like to get the linear regression coefficients and interception dynamically when new data feed in. Using scikit-learn I could calculate all current available data like this:
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
x = np.arange(100)
y = np.arange(100)+10*np.random.random_sample((100,))
regr.fit(x,y)
print(regr.coef_)
print(regr.intercept_)
However, I got quite big dataset (more than 10k rows in total) and I want to calculate coefficient and intercept as fast as possible whenever there's new rows coming in. Currently calculate 10k rows takes about 600 microseconds, and I want to accelerate this process.
Scikit-learn looks like does not have online update function for linear regression module. Is there any better ways to do this?
I've found solution from this paper: updating simple linear regression. The implementation is as below:
def lr(x_avg,y_avg,Sxy,Sx,n,new_x,new_y):
"""
x_avg: average of previous x, if no previous sample, set to 0
y_avg: average of previous y, if no previous sample, set to 0
Sxy: covariance of previous x and y, if no previous sample, set to 0
Sx: variance of previous x, if no previous sample, set to 0
n: number of previous samples
new_x: new incoming 1-D numpy array x
new_y: new incoming 1-D numpy array x
"""
new_n = n + len(new_x)
new_x_avg = (x_avg*n + np.sum(new_x))/new_n
new_y_avg = (y_avg*n + np.sum(new_y))/new_n
if n > 0:
x_star = (x_avg*np.sqrt(n) + new_x_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
y_star = (y_avg*np.sqrt(n) + new_y_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
elif n == 0:
x_star = new_x_avg
y_star = new_y_avg
else:
raise ValueError
new_Sx = Sx + np.sum((new_x-x_star)**2)
new_Sxy = Sxy + np.sum((new_x-x_star).reshape(-1) * (new_y-y_star).reshape(-1))
beta = new_Sxy/new_Sx
alpha = new_y_avg - beta * new_x_avg
return new_Sxy, new_Sx, new_n, alpha, beta, new_x_avg, new_y_avg
Performance comparison:
Scikit learn version that calculate 10k samples altogether.
from sklearn.linear_model import LinearRegression
x = np.arange(10000).reshape(-1,1)
y = np.arange(10000)+100*np.random.random_sample((10000,))
regr = LinearRegression()
%timeit regr.fit(x,y)
# 419 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
My version assume 9k sample is already calculated:
Sxy, Sx, n, alpha, beta, new_x_avg, new_y_avg = lr(0, 0, 0, 0, 0, x.reshape(-1,1)[:9000], y[:9000])
new_x, new_y = x.reshape(-1,1)[9000:], y[9000:]
%timeit lr(new_x_avg, new_y_avg, Sxy,Sx,n,new_x, new_y)
# 38.7 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
10 times faster, which is expected.
Nice! Thanks for sharing your findings :) Here is an equivalent implementation of this solution written with dot products:
class SimpleLinearRegressor(object):
def __init__(self):
self.dots = np.zeros(5)
self.intercept = None
self.slope = None
def update(self, x: np.ndarray, y: np.ndarray):
self.dots += np.array(
[
x.shape[0],
x.sum(),
y.sum(),
np.dot(x, x),
np.dot(x, y),
]
)
size, sum_x, sum_y, sum_xx, sum_xy = self.dots
det = size * sum_xx - sum_x ** 2
if det > 1e-10: # determinant may be zero initially
self.intercept = (sum_xx * sum_y - sum_xy * sum_x) / det
self.slope = (sum_xy * size - sum_x * sum_y) / det
When working with time series data, we can extend this idea to do sliding window regression with a soft (EMA-like) window.
You can use accelerated libraries that implement faster algorithms - particularly
https://github.com/intel/scikit-learn-intelex
For linear regression you would get much better performance
First install package
pip install scikit-learn-intelex
And then add in your python script
from sklearnex import patch_sklearn
patch_sklearn()
What I am doing now is:
import numpy as np
eps = np.finfo(float).eps
def sindiv(x):
x = np.abs(x)
return np.maximum(eps, np.sin(x)) / np.maximum(eps, x)
But there is quite a lot of additional array operation. Is there a better way?
You could use numpy.sinc, which computes sin(pi x)/(pi x):
In [20]: x = 2.4
In [21]: np.sin(x)/x
Out[21]: 0.28144299189631289
In [22]: x_over_pi = x / np.pi
In [23]: np.sinc(x_over_pi)
Out[23]: 0.28144299189631289
In [24]: np.sinc(0)
Out[24]: 1.0
In numpy array notation (so you get back a np array):
def sindiv(x):
return np.where(np.abs(x) < 0.01, 1.0 - x*x/6.0, np.sin(x)/x)
Here I've made "epsilon" fairly large for testing and used the first two terms of the taylor series for the approximation. In practice, I'd change 0.01 to some small multiple of your eps (machine epsilon).
xx = np.arange(-0.1, 0.1, 0.001)
yy = sinxdiv(xx)
type(yy)
outputs numpy.ndarray and the values are continuous (and differentiable, if that's important) near the origin.
If you don't want the double evaluation (i.e. both branches are evaluated in the above), then I think you have to go with a loop as I don't believe there is any sort of "lazy where" option.
def sindiv(x):
sox = np.zeros(x.size)
for i in xrange(x.size):
xv = x[i]
if np.abs(xv) < 0.001: # For testing, use a small multiple of machine epsilon
sox[i] = 1.0 - xv * xv / 6.0
else:
sox[i] = np.sin(xv) / xv
return sox
To make this really pythonic though it would be best to check the type of x and just do the non-array version if it is not an array.
As others have said, numpy.sinc() is the easiest.
I want to include a copy of its current implementation in NumPy 1.21.2 (link) to show there's no special tricks:
y = pi * where(x == 0, 1.0e-20, x)
return sin(y)/y
It's basically just sin(x)/x. Note that in creating y: multiplication by pi, where(), and x == 0 will create at least 2 intermediate arrays plus the final array for y. And then sin(y)/y creates two more arrays. In total at least 5 arrays are created by numpy.sinc(); and by my count your sindiv() also creates at least 5 arrays, so it's not actually that wasteful.
Here is another implementation:
TINY = np.finfo(float).tiny # ≈ 2e-308 (smallest 'normal' float)
def mysinc(x):
y = np.abs(np.pi*x) + TINY
return np.sin(y)/y
I'm pretty sure this returns identical values to numpy.sinc(). The reason being sin(x) == x for relatively 'large' values of x:
x = np.ldexp(1, -26, dtype=np.double) # x = 2**-26 ≈ 1.5e-8
print(np.sin(x) == x) # True
x = np.ldexp(1, -32, dtype=np.longdouble) # x = 2**-32 ≈ 2.3e-10
print(np.sin(x) == x) # True
So for small enough x (ignore pi factors), mysinc(x) = (x+TINY)/(x+TINY) = x/x = np.sinc(x). The exact threshold this happens does not matter too much so long as TINY < np.spacing(x) when it occurs so that x + TINY = x in this regime.
(The cutoff is around the square-root of the machine epsilon as can be understood from the Taylor series sin(x) = x - x**3/6 + ... = x(1-x**2/6) + .... So TINY is always small enough to not matter.)
Timings
import numpy as np
eps = np.finfo(float).eps
tiny = np.finfo(float).tiny
def npsinc(x):
y = np.pi * np.where(x == 0, 1.0e-20, x)
return np.sin(y)/y
def sindiv(x):
x = np.pi * np.abs(x)
return np.maximum(eps, np.sin(x)) / np.maximum(eps, x)
def mysinc(x):
y = np.abs(np.pi*x) + tiny
return np.sin(y)/y
def mysinc2(x):
y = np.abs(np.pi*x)
y += tiny # in-place addition
return np.sin(y)/y
# Test data
x = np.random.rand(100)
x[np.random.randint(100, size=10)] = 0
%timeit npsinc(x)
# 10.9 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit sindiv(x)
# 9.4 µs ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mysinc(x)
# 7.38 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mysinc2(x)
# 8.64 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Curiously using mysinc2() with in-place addition seems to be slower, and using in-place numpy.abs() and in-place numpy.sin() is even slower. Not entirely sure why, but see this related question.
Regardless, if you really need performance, you can try using Cython to generate C code and do things properly instead of playing tricks with NumPy:
%%cython
from libc.math cimport M_PI, sin
cimport cython
cimport numpy as np
import numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef _cysinc(double[:] x, double[:] out):
cdef size_t i
for i in range(x.shape[0]):
if x[i] == 0:
out[i] = 1
else:
out[i] = sin(M_PI*x[i])/(M_PI*x[i])
def cysinc(np.ndarray x):
out = np.empty_like(x)
_cysinc(x.ravel(), out.ravel())
return out
%timeit cysinc(x)
# 4.38 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As always, don't prematurely optimize, just use numpy.sinc() to begin with.
Side note
There's a question Is boost::math::sinc_pi unnecessarily complicated? that asks about the benefits of using a Taylor expansion about x=0. In summary, almost none, but maybe they are doing it for other reasons.
To emphasise, there is nothing unstable about floating point division, or dividing a small number by a small number since you're just dividing the significands and subtracting the exponents.
If you calculate sinc(x) as sin(x)/x, instead of a direct Taylor series or other method that sums to convergence beyond the machine epsilon np.spacing(sinc(x)), you will be off by at most np.spacing(sinc(x)) coming from the round-off error in division /, just as you'd get with multiplication *. (Assuming no subnormal business, which even here does not matter in the treatment of sin(x)/x.)
What about allowing div by zero and replace NaNs later?
import numpy as np
def sindiv(x):
a = np.sin(x)/x
a = np.nan_to_num(a)
return a
If you don't want warnings, supress them via seterr
Of course, using a could be eliminated:
def sindiv(x):
return np.nan_to_num(np.sin(x)/x)
I have a large sparse matrix - using sparse.csr_matrix from scipy. The values are binary. For each row, I need to compute the Jaccard distance to every row in the same matrix. What's the most efficient way to do this? Even for a 10.000 x 10.000 matrix, my runtime takes minutes to finish.
Current solution:
def jaccard(a, b):
intersection = float(len(set(a) & set(b)))
union = float(len(set(a) | set(b)))
return 1.0 - (intersection/union)
def regions(csr, p, epsilon):
neighbors = []
for index in range(len(csr.indptr)-1):
if jaccard(p, csr.indices[csr.indptr[index]:csr.indptr[index+1]]) <= epsilon:
neighbors.append(index)
return neighbors
csr = scipy.sparse.csr_matrix("file")
regions(csr, 0.51) #this is called for every row
Vectorization is relatively easy if you use matrix multiplication to calculate the set intersections and then the rule |union(a, b)| == |a| + |b| - |intersection(a, b)| to determine the unions:
# Not actually necessary for sparse matrices, but it is for
# dense matrices and ndarrays, if X.dtype is integer.
from __future__ import division
def pairwise_jaccard(X):
"""Computes the Jaccard distance between the rows of `X`.
"""
X = X.astype(bool).astype(int)
intrsct = X.dot(X.T)
row_sums = intrsct.diagonal()
unions = row_sums[:,None] + row_sums - intrsct
dist = 1.0 - intrsct / unions
return dist
Note the cast to bool and then int, because the dtype of X must be large enough to accumulate twice the maximum row sum and that entries of X must be either zero or one. The downside of this code is that it's heavy on RAM, because unions and dists are dense matrices.
If you're only interested in distances smaller than some cut-off epsilon, the code can be tuned for sparse matrices:
from scipy.sparse import csr_matrix
def pairwise_jaccard_sparse(csr, epsilon):
"""Computes the Jaccard distance between the rows of `csr`,
smaller than the cut-off distance `epsilon`.
"""
assert(0 < epsilon < 1)
csr = csr_matrix(csr).astype(bool).astype(int)
csr_rownnz = csr.getnnz(axis=1)
intrsct = csr.dot(csr.T)
nnz_i = np.repeat(csr_rownnz, intrsct.getnnz(axis=1))
unions = nnz_i + csr_rownnz[intrsct.indices] - intrsct.data
dists = 1.0 - intrsct.data / unions
mask = (dists > 0) & (dists <= epsilon)
data = dists[mask]
indices = intrsct.indices[mask]
rownnz = np.add.reduceat(mask, intrsct.indptr[:-1])
indptr = np.r_[0, np.cumsum(rownnz)]
out = csr_matrix((data, indices, indptr), intrsct.shape)
return out
If this still takes to much RAM you could try to vectorize over one dimension and Python-loop over the other.
To add to the accepted answer: I had use for a weighted version of the above method which is simply implemented as:
def pairwise_jaccard_sparse_weighted(csr, epsilon, weight):
csr = scipy.sparse.csr_matrix(csr).astype(bool).astype(int)
csr_w = csr * scipy.sparse.diags(weight)
csr_rowsum = numpy.array(csr_w.sum(axis = 1)).flatten()
intrsct = csr.dot(csr_w.T)
rowsum_i = numpy.repeat(csr_rowsum, intrsct.getnnz(axis = 1))
unions = rowsum_i + csr_rowsum[intrsct.indices] - intrsct.data
dists = 1.0 - 1.0 * intrsct.data / unions
mask = (dists > 0) & (dists <= epsilon)
data = dists[mask]
indices = intrsct.indices[mask]
rownnz = numpy.add.reduceat(mask, intrsct.indptr[:-1])
indptr = numpy.r_[0, numpy.cumsum(rownnz)]
out = scipy.sparse.csr_matrix((data, indices, indptr), intrsct.shape)
return out
I doubt this is the most efficient implementation, but it's a damn sight quicker than the dense implementation in scipy.spatial.distance.jaccard.
Here a solution that has a scikit-learn-like API.
def pairwise_sparse_jaccard_distance(X, Y=None):
"""
Computes the Jaccard distance between two sparse matrices or between all pairs in
one sparse matrix.
Args:
X (scipy.sparse.csr_matrix): A sparse matrix.
Y (scipy.sparse.csr_matrix, optional): A sparse matrix.
Returns:
numpy.ndarray: A similarity matrix.
"""
if Y is None:
Y = X
assert X.shape[1] == Y.shape[1]
X = X.astype(bool).astype(int)
Y = Y.astype(bool).astype(int)
intersect = X.dot(Y.T)
x_sum = X.sum(axis=1).A1
y_sum = Y.sum(axis=1).A1
xx, yy = np.meshgrid(x_sum, y_sum)
union = ((xx + yy).T - intersect)
return (1 - intersect / union).A
Here some testing and benchmarking:
>>> import timeit
>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> from sklearn.metrics import pairwise_distances
>>> X = csr_matrix(np.random.choice(a=[False, True], size=(10000, 1000), p=[0.9, 0.1]))
>>> Y = csr_matrix(np.random.choice(a=[False, True], size=(1000, 1000), p=[0.9, 0.1]))
Asserting that all results are approximately equivalent
>>> custom_jaccard_distance = pairwise_sparse_jaccard_distance(X, Y)
>>> sklearn_jaccard_distance = pairwise_distances(X.todense(), Y.todense(), "jaccard")
>>> np.allclose(custom_jaccard_distance, sklearn_jaccard_distance)
True
Benchmarking runtime (from Jupyer Notebook)
>>> %timeit pairwise_jaccard_index(X, Y)
795 ms ± 58.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit 1 - pairwise_distances(X.todense(), Y.todense(), "jaccard")
14.7 s ± 694 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm using Python and Numpy to calculate a best fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (linear, quadratic, etc.).
This much works, but I also want to calculate r (coefficient of correlation) and r-squared(coefficient of determination). I am comparing my results with Excel's best-fit trendline capability, and the r-squared value it calculates. Using this, I know I am calculating r-squared correctly for linear best-fit (degree equals 1). However, my function does not work for polynomials with degree greater than 1.
Excel is able to do this. How do I calculate r-squared for higher-order polynomials using Numpy?
Here's my function:
import numpy
# Polynomial Regression
def polyfit(x, y, degree):
results = {}
coeffs = numpy.polyfit(x, y, degree)
# Polynomial Coefficients
results['polynomial'] = coeffs.tolist()
correlation = numpy.corrcoef(x, y)[0,1]
# r
results['correlation'] = correlation
# r-squared
results['determination'] = correlation**2
return results
A very late reply, but just in case someone needs a ready function for this:
scipy.stats.linregress
i.e.
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
as in #Adam Marples's answer.
From yanl (yet-another-library) sklearn.metrics has an r2_score function;
from sklearn.metrics import r2_score
coefficient_of_dermination = r2_score(y, p(x))
From the numpy.polyfit documentation, it is fitting linear regression. Specifically, numpy.polyfit with degree 'd' fits a linear regression with the mean function
E(y|x) = p_d * x**d + p_{d-1} * x **(d-1) + ... + p_1 * x + p_0
So you just need to calculate the R-squared for that fit. The wikipedia page on linear regression gives full details. You are interested in R^2 which you can calculate in a couple of ways, the easisest probably being
SST = Sum(i=1..n) (y_i - y_bar)^2
SSReg = Sum(i=1..n) (y_ihat - y_bar)^2
Rsquared = SSReg/SST
Where I use 'y_bar' for the mean of the y's, and 'y_ihat' to be the fit value for each point.
I'm not terribly familiar with numpy (I usually work in R), so there is probably a tidier way to calculate your R-squared, but the following should be correct
import numpy
# Polynomial Regression
def polyfit(x, y, degree):
results = {}
coeffs = numpy.polyfit(x, y, degree)
# Polynomial Coefficients
results['polynomial'] = coeffs.tolist()
# r-squared
p = numpy.poly1d(coeffs)
# fit values, and mean
yhat = p(x) # or [p(z) for z in x]
ybar = numpy.sum(y)/len(y) # or sum(y)/len(y)
ssreg = numpy.sum((yhat-ybar)**2) # or sum([ (yihat - ybar)**2 for yihat in yhat])
sstot = numpy.sum((y - ybar)**2) # or sum([ (yi - ybar)**2 for yi in y])
results['determination'] = ssreg / sstot
return results
I have been using this successfully, where x and y are array-like.
Note: for linear regression only
def rsquared(x, y):
""" Return R^2 where x and y are array-like."""
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
return r_value**2
I originally posted the benchmarks below with the purpose of recommending numpy.corrcoef, foolishly not realizing that the original question already uses corrcoef and was in fact asking about higher order polynomial fits. I've added an actual solution to the polynomial r-squared question using statsmodels, and I've left the original benchmarks, which while off-topic, are potentially useful to someone.
statsmodels has the capability to calculate the r^2 of a polynomial fit directly, here are 2 methods...
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Construct the columns for the different powers of x
def get_r2_statsmodels(x, y, k=1):
xpoly = np.column_stack([x**i for i in range(k+1)])
return sm.OLS(y, xpoly).fit().rsquared
# Use the formula API and construct a formula describing the polynomial
def get_r2_statsmodels_formula(x, y, k=1):
formula = 'y ~ 1 + ' + ' + '.join('I(x**{})'.format(i) for i in range(1, k+1))
data = {'x': x, 'y': y}
return smf.ols(formula, data).fit().rsquared # or rsquared_adj
To further take advantage of statsmodels, one should also look at the fitted model summary, which can be printed or displayed as a rich HTML table in Jupyter/IPython notebook. The results object provides access to many useful statistical metrics in addition to rsquared.
model = sm.OLS(y, xpoly)
results = model.fit()
results.summary()
Below is my original Answer where I benchmarked various linear regression r^2 methods...
The corrcoef function used in the Question calculates the correlation coefficient, r, only for a single linear regression, so it doesn't address the question of r^2 for higher order polynomial fits. However, for what it's worth, I've come to find that for linear regression, it is indeed the fastest and most direct method of calculating r.
def get_r2_numpy_corrcoef(x, y):
return np.corrcoef(x, y)[0, 1]**2
These were my timeit results from comparing a bunch of methods for 1000 random (x, y) points:
Pure Python (direct r calculation)
1000 loops, best of 3: 1.59 ms per loop
Numpy polyfit (applicable to n-th degree polynomial fits)
1000 loops, best of 3: 326 µs per loop
Numpy Manual (direct r calculation)
10000 loops, best of 3: 62.1 µs per loop
Numpy corrcoef (direct r calculation)
10000 loops, best of 3: 56.6 µs per loop
Scipy (linear regression with r as an output)
1000 loops, best of 3: 676 µs per loop
Statsmodels (can do n-th degree polynomial and many other fits)
1000 loops, best of 3: 422 µs per loop
The corrcoef method narrowly beats calculating the r^2 "manually" using numpy methods. It is >5X faster than the polyfit method and ~12X faster than the scipy.linregress. Just to reinforce what numpy is doing for you, it's 28X faster than pure python. I'm not well-versed in things like numba and pypy, so someone else would have to fill those gaps, but I think this is plenty convincing to me that corrcoef is the best tool for calculating r for a simple linear regression.
Here's my benchmarking code. I copy-pasted from a Jupyter Notebook (hard not to call it an IPython Notebook...), so I apologize if anything broke on the way. The %timeit magic command requires IPython.
import numpy as np
from scipy import stats
import statsmodels.api as sm
import math
n=1000
x = np.random.rand(1000)*10
x.sort()
y = 10 * x + (5+np.random.randn(1000)*10-5)
x_list = list(x)
y_list = list(y)
def get_r2_numpy(x, y):
slope, intercept = np.polyfit(x, y, 1)
r_squared = 1 - (sum((y - (slope * x + intercept))**2) / ((len(y) - 1) * np.var(y, ddof=1)))
return r_squared
def get_r2_scipy(x, y):
_, _, r_value, _, _ = stats.linregress(x, y)
return r_value**2
def get_r2_statsmodels(x, y):
return sm.OLS(y, sm.add_constant(x)).fit().rsquared
def get_r2_python(x_list, y_list):
n = len(x_list)
x_bar = sum(x_list)/n
y_bar = sum(y_list)/n
x_std = math.sqrt(sum([(xi-x_bar)**2 for xi in x_list])/(n-1))
y_std = math.sqrt(sum([(yi-y_bar)**2 for yi in y_list])/(n-1))
zx = [(xi-x_bar)/x_std for xi in x_list]
zy = [(yi-y_bar)/y_std for yi in y_list]
r = sum(zxi*zyi for zxi, zyi in zip(zx, zy))/(n-1)
return r**2
def get_r2_numpy_manual(x, y):
zx = (x-np.mean(x))/np.std(x, ddof=1)
zy = (y-np.mean(y))/np.std(y, ddof=1)
r = np.sum(zx*zy)/(len(x)-1)
return r**2
def get_r2_numpy_corrcoef(x, y):
return np.corrcoef(x, y)[0, 1]**2
print('Python')
%timeit get_r2_python(x_list, y_list)
print('Numpy polyfit')
%timeit get_r2_numpy(x, y)
print('Numpy Manual')
%timeit get_r2_numpy_manual(x, y)
print('Numpy corrcoef')
%timeit get_r2_numpy_corrcoef(x, y)
print('Scipy')
%timeit get_r2_scipy(x, y)
print('Statsmodels')
%timeit get_r2_statsmodels(x, y)
7/28/21 Benchmark results. (Python 3.7, numpy 1.19, scipy 1.6, statsmodels 0.12)
Python
2.41 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy polyfit
318 µs ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy Manual
79.3 µs ± 4.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Numpy corrcoef
83.8 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Scipy
221 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Statsmodels
375 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here is a function to compute the weighted r-squared with Python and Numpy (most of the code comes from sklearn):
from __future__ import division
import numpy as np
def compute_r2_weighted(y_true, y_pred, weight):
sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
tse = (weight * (y_true - np.average(
y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
r2_score = 1 - (sse / tse)
return r2_score, sse, tse
Example:
from __future__ import print_function, division
import sklearn.metrics
def compute_r2_weighted(y_true, y_pred, weight):
sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
tse = (weight * (y_true - np.average(
y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
r2_score = 1 - (sse / tse)
return r2_score, sse, tse
def compute_r2(y_true, y_predicted):
sse = sum((y_true - y_predicted)**2)
tse = (len(y_true) - 1) * np.var(y_true, ddof=1)
r2_score = 1 - (sse / tse)
return r2_score, sse, tse
def main():
'''
Demonstrate the use of compute_r2_weighted() and checks the results against sklearn
'''
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
weight = [1, 5, 1, 2]
r2_score = sklearn.metrics.r2_score(y_true, y_pred)
print('r2_score: {0}'.format(r2_score))
r2_score,_,_ = compute_r2(np.array(y_true), np.array(y_pred))
print('r2_score: {0}'.format(r2_score))
r2_score = sklearn.metrics.r2_score(y_true, y_pred,weight)
print('r2_score weighted: {0}'.format(r2_score))
r2_score,_,_ = compute_r2_weighted(np.array(y_true), np.array(y_pred), np.array(weight))
print('r2_score weighted: {0}'.format(r2_score))
if __name__ == "__main__":
main()
#cProfile.run('main()') # if you want to do some profiling
outputs:
r2_score: 0.9486081370449679
r2_score: 0.9486081370449679
r2_score weighted: 0.9573170731707317
r2_score weighted: 0.9573170731707317
This corresponds to the formula (mirror):
with f_i is the predicted value from the fit, y_{av} is the mean of the observed data y_i is the observed data value. w_i is the weighting applied to each data point, usually w_i=1. SSE is the sum of squares due to error and SST is the total sum of squares.
If interested, the code in R: https://gist.github.com/dhimmel/588d64a73fa4fef02c8f (mirror)
Here's a very simple python function to compute R^2 from the actual and predicted values assuming y and y_hat are pandas series:
def r_squared(y, y_hat):
y_bar = y.mean()
ss_tot = ((y-y_bar)**2).sum()
ss_res = ((y-y_hat)**2).sum()
return 1 - (ss_res/ss_tot)
R-squared is a statistic that only applies to linear regression.
Essentially, it measures how much variation in your data can be explained by the linear regression.
So, you calculate the "Total Sum of Squares", which is the total squared deviation of each of your outcome variables from their mean. . .
where y_bar is the mean of the y's.
Then, you calculate the "regression sum of squares", which is how much your FITTED values differ from the mean
and find the ratio of those two.
Now, all you would have to do for a polynomial fit is plug in the y_hat's from that model, but it's not accurate to call that r-squared.
Here is a link I found that speaks to it a little.
The wikipedia article on r-squareds suggests that it may be used for general model fitting rather than just linear regression.
Using the numpy module (tested in python3):
import numpy as np
def linear_regression(x, y):
coefs = np.polynomial.polynomial.polyfit(x, y, 1)
ffit = np.poly1d(coefs)
m = ffit[0]
b = ffit[1]
eq = 'y = {}x + {}'.format(round(m, 3), round(b, 3))
rsquared = np.corrcoef(x, y)[0, 1]**2
return rsquared, eq, m, b
rsquared, eq, m, b = linear_regression(x,y)
print(rsquared, m, b)
print(eq)
Output:
0.013378252355751777 0.1316331351105754 0.7928782850418713
y = 0.132x + 0.793
Note: r² ≠ R²
r² is called the "Coefficient of Determination"
R² is the square of the Pearson Coefficient
R², officially conflated as r², is probably the one you want, as it's a least-square fit, which is better than the simple fraction of sums that r² is. Numpy is not afraid to call it "corrcoef", which presupposes Pearson is the de-facto correlation coefficient.
You can execute this code directly, this will find you the polynomial, and will find you the R-value you can put a comment down below if you need more explanation.
from scipy.stats import linregress
import numpy as np
x = np.array([1,2,3,4,5,6])
y = np.array([2,3,5,6,7,8])
p3 = np.polyfit(x,y,3) # 3rd degree polynomial, you can change it to any degree you want
xp = np.linspace(1,6,6) # 6 means the length of the line
poly_arr = np.polyval(p3,xp)
poly_list = [round(num, 3) for num in list(poly_arr)]
slope, intercept, r_value, p_value, std_err = linregress(x, poly_list)
print(r_value**2)
From scipy.stats.linregress source. They use the average sum of squares method.
import numpy as np
x = np.array(x)
y = np.array(y)
# average sum of squares:
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
r_num = ssxym
r_den = np.sqrt(ssxm * ssym)
r = r_num / r_den
if r_den == 0.0:
r = 0.0
else:
r = r_num / r_den
if r > 1.0:
r = 1.0
elif r < -1.0:
r = -1.0