Related
To simplify, I have the test code below:
from scipy.sparse import csr_matrix, dok_array, issparse
import numpy as np
from tqdm import tqdm
X = np.load('dense.npy')
# convert it to csr sparse matrix
#X = csr_matrix(X)
print(repr(X))
n = X.shape[0]
with tqdm(total=n*(n-1)//2) as pbar:
cooccur = dok_array((n, n), dtype='float32')
for i in range(n):
for j in range(i+1, n):
u, v = X[i], X[j]
if issparse(u):
u = u.toarray()[0]
v = v.toarray()[0]
#import pdb; pdb.set_trace()
m = u - v
min_uv = u - np.maximum(m, 0)
val = np.sum(min_uv - np.abs(m) * min_uv)
pbar.update()
Case 1: Run as it is - the time usage is like this (2min 54sec):
Case 2: uncomment the line X=csr_matrix(X) (just for the sake of comparison), the running time is 1min 56sec:
It is so weird and I can't figure out why it is even slower to operate on dense array. I subsampled the array for this test; for the original array, the run time difference between sparse and dense array is big (due to the large number of iterations.)
I put the code into a function and used line_profiler to see the time usage. My findings are: 1. slice indeed is much slower for sparse matrix; 2. the 3 lines above last line are much faster in Case 2; 3. the total run time is smaller for Case 2 even it takes extra time for slicing and converting to dense vector.
I am so confused why these 3 lines costed different run time in Case 1 and Case 2 - they are exactly the same numpy vectors in the two cases. Any explanations?
The dense.npy file is uploaded to here to reproduce the observation.
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import issparse
n = 1_000
sparsity = 0.98
A = np.random.rand(n, n)
A[A < sparsity] = 0
As = csr_matrix(A)
def _test(X):
n = X.shape[0]
for i in range(n):
for j in range(i+1, n):
u, v = X[i], X[j]
if issparse(u):
u = u.toarray()[0]
v = v.toarray()[0]
m = u - v
min_uv = u - np.maximum(m, 0)
val = np.sum(min_uv - np.abs(m) * min_uv)
Running this on dense:
%timeit _test(A)
5.3 s ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Running this on sparse:
%timeit _test(As)
1min 10s ± 1.06 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
This makes sense as you're not actually using a sparse data structure for anything - you're just expensively and inefficiently converting it back to a dense data structure every time your inner loop iterates.
I don't know how you got the runtimes you got as the order of magnitude difference between dense and sparse is what I would expect for the code you have provided.
In scipy.spatial there is the Delaunay function. The documentation includes an example of how to calculate barycentric coordinates.
Following that example, the following code will calculate barycentric coordinates using a loop.
points = np.array([(0,0),(0,1),(1,0),(1,1)])
samples = np.array([(0.5,0.5),(0,0),(0.1,0.1)])
dim = len(points[0]) # determine the dimension of the samples
simp = Delaunay(points) # create simplexes for the defined points
s = simp.find_simplex(samples) # for each sample, find corresponding simplex for each sample
b0 = np.zeros((len(samples),dim)) # reserve space for each barycentric coordinate
for ii in range(len(samples)):
b0[ii,:] = simp.transform[s[ii],:dim].dot((samples[ii] - simp.transform[s[ii],dim]).transpose())
coord = np.c_[b0, 1 - b0.sum(axis=1)]
This is ok for short list of samples to convert to barycentric coordinates, however for very large lists of samples, the performance is poor. How can this be modified to take advantage of vectorized math in numpy/scipy to improve performance?
Consider the following modification (for-loop replaced with numpy methods):
def f_1(points, samples):
""" original """
dim = len(points[0])
simp = ssp.Delaunay(points)
s = simp.find_simplex(samples)
b0 = np.zeros((len(samples), dim))
for ii in range(len(samples)):
b0[ii, :] = simp.transform[s[ii], :dim].dot(
(samples[ii] - simp.transform[s[ii], dim]).transpose())
coord = np.c_[b0, 1 - b0.sum(axis=1)]
return coord
def f_2(points, samples):
""" modified """
simp = ssp.Delaunay(points)
s = simp.find_simplex(samples)
b0 = (simp.transform[s, :points.shape[1]].transpose([1, 0, 2]) *
(samples - simp.transform[s, points.shape[1]])).sum(axis=2).T
coord = np.c_[b0, 1 - b0.sum(axis=1)]
return coord
Test case:
N = 100
points = np.array(list(itertools.product(range(N), repeat=2)))
samples = np.random.rand(100_000, 2) * N
Result:
%timeit f_1(points, samples)
712 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2(points, samples)
422 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
With modified version the line simp.find_simplex(samples) gives about 95% of the running time. So, I guess there is nothing else you can do with vectorization. To improve perfomance further you need another implementation of find_simplex method or another approach to the problem.
I have several thousand "observations". Each observation consists of location (x,y) and sensor reading (z), see example below.
I would like to fit a bi-linear surface to the x,y, and z data. I am currently doing it with the code-snippet from amroamroamro/gist:
def bi2Dlinter(xdata, ydata, zdata, gridrez):
X,Y = np.meshgrid(
np.linspace(min(x), max(x), endpoint=True, num=gridrez),
np.linspace(min(y), max(y), endpoint=True, num=gridrez))
A = np.c_[xdata, ydata, np.ones(len(zdata))]
C,_,_,_ = scipy.linalg.lstsq(A, zdata)
Z = C[0]*X + C[1]*Y + C[2]
return Z
My current approach is to cycle through the rows of the DataFrame. (This works great for 1000 observations but is not usable for larger data-sets.)
ZZ = []
for index, row in df2.iterrows():
x=row['x1'], row['x2'], row['x3'], row['x4'], row['x5']
y=row['y1'], row['y2'], row['y3'], row['y4'], row['y5']
z=row['z1'], row['z2'], row['z3'], row['z4'], row['z5']
ZZ.append(np.median(bi2Dlinter(x,y,z,gridrez)))
df2['ZZ']=ZZ
I would be surprised if there is not a more efficient way to do this.
Is there a way to vectorize the linear interpolation?
I put the code here which also generates dummy entries.
Thanks
Looping over DataFrames like this is generally not recommended. Instead you should opt to try and vectorize your code as much as possible.
First we create an array for your inputs
x_vals = df2[['x1','x2','x3','x4','x5']].values
y_vals = df2[['y1','y2','y3','y4','y5']].values
z_vals = df2[['z1','z2','z3','z4','z5']].values
Next we need to create a bi2Dlinter function that handles vector inputs, this involves changing linspace/meshgrid to work for an array and changing the least_squares function. Normally scipy.linalg functions work over an array but as far as I'm aware the .lstsq method doesn't. Instead we can use the .SVD to replicate the same functionality over an array.
def create_ranges(start, stop, N, endpoint=True):
if endpoint==1:
divisor = N-1
else:
divisor = N
steps = (1.0/divisor) * (stop - start)
return steps[:,None]*np.arange(N) + start[:,None]
def linspace_nd(x,y,gridrez):
a1 = create_ranges(x.min(axis=1), x.max(axis=1), N=gridrez, endpoint=True)
a2 = create_ranges(y.min(axis=1), y.max(axis=1), N=gridrez, endpoint=True)
out_shp = a1.shape + (a2.shape[1],)
Xout = np.broadcast_to(a1[:,None,:], out_shp)
Yout = np.broadcast_to(a2[:,:,None], out_shp)
return Xout, Yout
def stacked_lstsq(L, b, rcond=1e-10):
"""
Solve L x = b, via SVD least squares cutting of small singular values
L is an array of shape (..., M, N) and b of shape (..., M).
Returns x of shape (..., N)
"""
u, s, v = np.linalg.svd(L, full_matrices=False)
s_max = s.max(axis=-1, keepdims=True)
s_min = rcond*s_max
inv_s = np.zeros_like(s)
inv_s[s >= s_min] = 1/s[s>=s_min]
x = np.einsum('...ji,...j->...i', v,
inv_s * np.einsum('...ji,...j->...i', u, b.conj()))
return np.conj(x, x)
def vectorized_bi2Dlinter(x_vals, y_vals, z_vals, gridrez):
X,Y = linspace_nd(x_vals, y_vals, gridrez)
A = np.stack((x_vals,y_vals,np.ones_like(z_vals)), axis=2)
C = stacked_lstsq(A, z_vals)
n_bcast = C.shape[0]
return C.T[0].reshape((n_bcast,1,1))*X + C.T[1].reshape((n_bcast,1,1))*Y + C.T[2].reshape((n_bcast,1,1))
Upon testing this on data for n=10000 rows, the vectorized function was significantly faster.
%%timeit
ZZ = []
for index, row in df2.iterrows():
x=row['x1'], row['x2'], row['x3'], row['x4'], row['x5']
y=row['y1'], row['y2'], row['y3'], row['y4'], row['y5']
z=row['z1'], row['z2'], row['z3'], row['z4'], row['z5']
ZZ.append((bi2Dlinter(x,y,z,gridrez)))
df2['ZZ']=ZZ
Out: 5.52 s ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
res = vectorized_bi2Dlinter(x_vals,y_vals,z_vals,gridrez)
Out: 74.6 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You should pay careful attention to whats going on in this vectorize function and familiarize yourself with broadcasting in numpy. I cannot take credit for the first three functions, instead I will link their answers from stack overflow for you to get an understanding.
Vectorized NumPy linspace for multiple start and stop values
how to solve many overdetermined systems of linear equations using vectorized codes?
How to use numpy.c_ properly for arrays
me and some friends are doing a small language competition to calculate some neural networks. Some doing in C other in fortran, and me: Python.
The code is simple, is just a bunch of vector dot operations and a summation after that apply a signal function and return -1 or 1 (activated or not).
With that we are sending a bunch of random numbers and checking (right now only single process) which language do it faster.
My code is simple as this:
def sgn(h):
"""Signal function"""
return -1 if h < 0 else 1
def lincomb(A, B):
"""Linear combinator between two matrices"""
return np.einsum('ji,ij->', A, B)
def lincombrav(A, B):
return A.ravel().dot(B.ravel('F'))
def functional_test():
w1 = np.random.random(50**2).reshape(50,50)
w2 = np.random.random(50**2).reshape(50,50)
return sgn(lincombrav(w1, w2))
Where A and B are matrices that represent each layer in a neural network. then we dot the ith-column of the first matrix with the ith-row for the second matrix, sum all results and send to signal function. Something like:
w1 = 2*np.random.random(100**2).reshape(100,100)-1
w2 = 2*np.random.random(100**2).reshape(100,100)-1
then we time it with
%timeit sgn(lincomb(w1, w2))
Python is losing to Fortran by 38x :-(
Is there anyway to improve that Python "code".
EDIT: Added timeit results:
Python version (already with the ravel mode)
In [10]: %timeit functional_test()
8.72 µs ± 406 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Python version (with einsum)
In [16]: %timeit functional_test()
10.27 µs ± 490 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Fortran version
In [13]: %timeit fort.test()
235 ns ± 12.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Fortran version was created using "f2py" program, to generate a python loadable module from fortran code.
The test functions do the following (in each language):
Create the matrix A
Create the matrix B
call sgn(lincomb(A,B)) # from each respective language implementation
I also moved the matrix creation to outside, to run only the mathematical operation instead also handling memory. Still, python is behind by same magnitude.
EDIT2: Good python news. Python has won in all but the small matrix tests. Here will follow the whole code:
Python functions (bla.py)
import numpy as np
from numba import jit
import timeit
import matplotlib.pyplot as plt
def sgn(h):
"""Signal function"""
return -1 if h < 0 else 1
def lincomb(A, B):
"""Linear combinator between two matrices"""
return np.einsum('ji,ij->', A, B)
def lincombrav(A, B):
return A.ravel().dot(B.ravel('F'))
def functional_test_ravel(n):
"""Functional tests (Victor experiment)"""
w = 2*np.random.random(n**2).reshape(n,n)-1
x = 2*np.random.random(n**2).reshape(n,n)-1
return sgn(lincombrav(w, x))
def functional_test_einsum(n):
"""Functional tests (Victor experiment)"""
w = 2*np.random.random(n**2).reshape(n,n)-1
x = 2*np.random.random(n**2).reshape(n,n)-1
return sgn(lincomb(w, x))
#jit()
def functional_test_numbaein(n):
"""Functional tests (Victor experiment)"""
w = 2*np.random.random(n**2).reshape(n,n)-1
x = 2*np.random.random(n**2).reshape(n,n)-1
return sgn(lincomb(w, x))
#jit()
def functional_test_numbarav(n):
"""Functional tests (Victor experiment)"""
w = 2*np.random.random(n**2).reshape(n,n)-1
x = 2*np.random.random(n**2).reshape(n,n)-1
return sgn(lincombrav(w, x))
Fortran functions (fbla.f95)
module fbla
implicit none
integer, parameter::dp = selected_real_kind(12,100)
public
contains
real(kind=dp) function sgn(x)
integer, parameter::dp = selected_real_kind(12,100)
real(kind=dp), intent(in):: x
if(x >= 0.0 ) then
sgn = +1.0
else if (x < 0.0) then
sgn = -1.0
end if
end function sgn
real(kind=dp) function lincomb(A, B, n)
integer, parameter :: sp = selected_int_kind(r=8)
integer, parameter :: dp = selected_real_kind(12,100)
integer(kind=sp) :: i
integer(kind=sp), intent(in):: n
real(kind=DP), intent(in) :: A(n,n)
real(kind=DP), intent(in) :: B(n,n)
lincomb = 0
do i=1,n
lincomb = lincomb + dot_product(A(:,i),B(i,:))
end do
end function lincomb
real(kind=dp) function functional_test(n)
integer, parameter::dp = selected_real_kind(12,100)
integer, parameter::sp = selected_int_kind(r=8)
integer(kind=sp), intent(in):: n
integer(kind=sp):: i, j
real(kind=dp), allocatable, dimension(:,:):: x, w, wt
ALLOCATE(wt(n,n),w(n,n),x(n,n))
do i=1,n
do j=1,n
w(i,j) = 2*rand(0)-1
x(i,j) = 2*rand(0)-1
end do
end do
wt = transpose(w)
functional_test = sgn(lincomb(wt, x, n))
end function functional_test
end module fbla
Test execution functions (tests.py)
import numpy as np
import timeit
import matplotlib.pyplot as plt
import bla
from fbla import fbla
def run_test(test_functions, N, runs=1000):
results = []
global rank
for n in N:
rank = n
for t in test_functions:
# print(f'Rank {globals()["rank"]}')
print(f'Running {t} to matrix size {rank}', end='')
r = min(timeit.Timer(t , globals=globals()).repeat(repeat=5, number=runs))
print(f' total time {r} per run {r/runs}')
results.append((t, n, r, r/runs))
return results
def plotbars(results, test_functions, N):
Nsz = len(N)
M = len(test_functions)
fig, ax = plt.subplots()
ind = np.arange(int(Nsz))
width = 1/(M+1)
p = []
for n in range(M):
g = [ w*1000 for (x,y,z,w) in results if x==test_functions[n]]
p.append(ax.bar(ind+n*width, g, width, bottom=0))
ax.legend([ l[0] for l in p ], test_functions)
ax.set_xticks(ind-width/2+((M/2) * width))
ax.set_xticklabels(np.array(N).astype(str))
ax.set_xlabel('Rank of square random matrix')
ax.set_ylabel('Average time(ms) per run')
ax.set_yscale('log')
return fig
N = (10, 50, 100, 1000)
test_functions = [
'bla.functional_test_einsum(rank)',
'fbla.functional_test(rank)'
]
results = run_test(test_functions, N)
plot = plotbars(results, test_functions, N)
plot.show()
The results are:
[('bla.functional_test_einsum(rank)', 10, 0.023221354000270367, 2.3221354000270368e-05),
('fbla.functional_test(rank)', 10, 0.005375514010665938, 5.375514010665938e-06),
('bla.functional_test_einsum(rank)', 50, 0.07035048000398092, 7.035048000398091e-05),
('fbla.functional_test(rank)', 50, 0.1242617039824836, 0.0001242617039824836),
('bla.functional_test_einsum(rank)', 100, 0.22694124400732107, 0.00022694124400732108),
('fbla.functional_test(rank)', 100, 0.5518505079962779, 0.0005518505079962779),
('bla.functional_test_einsum(rank)', 1000, 37.88827919398318, 0.03788827919398318),
('fbla.functional_test(rank)', 1000, 74.09929457501858, 0.07409929457501857)]
Some standard timeit output from a ipython3 session. fbla is the fortran library while bla is standard python library.
In : n=1000
In : w1 = 2*np.random.random(n**2).reshape(n,n)-1
In : w2 = 2*np.random.random(n**2).reshape(n,n)-1
In : bla.sgn(bla.lincomb(w1,w2))
Out: -1
In : fbla.sgn(fbla.lincomb(w1,w2))
Out: -1.0
In : %timeit fbla.sgn(fbla.lincomb(w1,w2))
11.3 ms ± 430 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In : %timeit bla.sgn(bla.lincomb(w1,w2))
3.81 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We can improve a bit with matrix-multiplication -
sgn(w1.ravel().dot(w2.ravel('F')))
If you want Numpy to be faster get a faster Numpy. Try uninstalling Numpy and installing the Intel optimized version of Numpy. Intel's optimized version of Numpy includes a number of CPU level optimizations that should significantly improve the performance of operations such as matrix multiplication on machines that use an Intel CPU.
pip uninstall numpy
pip install intel-numpy
What I am doing now is:
import numpy as np
eps = np.finfo(float).eps
def sindiv(x):
x = np.abs(x)
return np.maximum(eps, np.sin(x)) / np.maximum(eps, x)
But there is quite a lot of additional array operation. Is there a better way?
You could use numpy.sinc, which computes sin(pi x)/(pi x):
In [20]: x = 2.4
In [21]: np.sin(x)/x
Out[21]: 0.28144299189631289
In [22]: x_over_pi = x / np.pi
In [23]: np.sinc(x_over_pi)
Out[23]: 0.28144299189631289
In [24]: np.sinc(0)
Out[24]: 1.0
In numpy array notation (so you get back a np array):
def sindiv(x):
return np.where(np.abs(x) < 0.01, 1.0 - x*x/6.0, np.sin(x)/x)
Here I've made "epsilon" fairly large for testing and used the first two terms of the taylor series for the approximation. In practice, I'd change 0.01 to some small multiple of your eps (machine epsilon).
xx = np.arange(-0.1, 0.1, 0.001)
yy = sinxdiv(xx)
type(yy)
outputs numpy.ndarray and the values are continuous (and differentiable, if that's important) near the origin.
If you don't want the double evaluation (i.e. both branches are evaluated in the above), then I think you have to go with a loop as I don't believe there is any sort of "lazy where" option.
def sindiv(x):
sox = np.zeros(x.size)
for i in xrange(x.size):
xv = x[i]
if np.abs(xv) < 0.001: # For testing, use a small multiple of machine epsilon
sox[i] = 1.0 - xv * xv / 6.0
else:
sox[i] = np.sin(xv) / xv
return sox
To make this really pythonic though it would be best to check the type of x and just do the non-array version if it is not an array.
As others have said, numpy.sinc() is the easiest.
I want to include a copy of its current implementation in NumPy 1.21.2 (link) to show there's no special tricks:
y = pi * where(x == 0, 1.0e-20, x)
return sin(y)/y
It's basically just sin(x)/x. Note that in creating y: multiplication by pi, where(), and x == 0 will create at least 2 intermediate arrays plus the final array for y. And then sin(y)/y creates two more arrays. In total at least 5 arrays are created by numpy.sinc(); and by my count your sindiv() also creates at least 5 arrays, so it's not actually that wasteful.
Here is another implementation:
TINY = np.finfo(float).tiny # ≈ 2e-308 (smallest 'normal' float)
def mysinc(x):
y = np.abs(np.pi*x) + TINY
return np.sin(y)/y
I'm pretty sure this returns identical values to numpy.sinc(). The reason being sin(x) == x for relatively 'large' values of x:
x = np.ldexp(1, -26, dtype=np.double) # x = 2**-26 ≈ 1.5e-8
print(np.sin(x) == x) # True
x = np.ldexp(1, -32, dtype=np.longdouble) # x = 2**-32 ≈ 2.3e-10
print(np.sin(x) == x) # True
So for small enough x (ignore pi factors), mysinc(x) = (x+TINY)/(x+TINY) = x/x = np.sinc(x). The exact threshold this happens does not matter too much so long as TINY < np.spacing(x) when it occurs so that x + TINY = x in this regime.
(The cutoff is around the square-root of the machine epsilon as can be understood from the Taylor series sin(x) = x - x**3/6 + ... = x(1-x**2/6) + .... So TINY is always small enough to not matter.)
Timings
import numpy as np
eps = np.finfo(float).eps
tiny = np.finfo(float).tiny
def npsinc(x):
y = np.pi * np.where(x == 0, 1.0e-20, x)
return np.sin(y)/y
def sindiv(x):
x = np.pi * np.abs(x)
return np.maximum(eps, np.sin(x)) / np.maximum(eps, x)
def mysinc(x):
y = np.abs(np.pi*x) + tiny
return np.sin(y)/y
def mysinc2(x):
y = np.abs(np.pi*x)
y += tiny # in-place addition
return np.sin(y)/y
# Test data
x = np.random.rand(100)
x[np.random.randint(100, size=10)] = 0
%timeit npsinc(x)
# 10.9 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit sindiv(x)
# 9.4 µs ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mysinc(x)
# 7.38 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mysinc2(x)
# 8.64 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Curiously using mysinc2() with in-place addition seems to be slower, and using in-place numpy.abs() and in-place numpy.sin() is even slower. Not entirely sure why, but see this related question.
Regardless, if you really need performance, you can try using Cython to generate C code and do things properly instead of playing tricks with NumPy:
%%cython
from libc.math cimport M_PI, sin
cimport cython
cimport numpy as np
import numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef _cysinc(double[:] x, double[:] out):
cdef size_t i
for i in range(x.shape[0]):
if x[i] == 0:
out[i] = 1
else:
out[i] = sin(M_PI*x[i])/(M_PI*x[i])
def cysinc(np.ndarray x):
out = np.empty_like(x)
_cysinc(x.ravel(), out.ravel())
return out
%timeit cysinc(x)
# 4.38 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As always, don't prematurely optimize, just use numpy.sinc() to begin with.
Side note
There's a question Is boost::math::sinc_pi unnecessarily complicated? that asks about the benefits of using a Taylor expansion about x=0. In summary, almost none, but maybe they are doing it for other reasons.
To emphasise, there is nothing unstable about floating point division, or dividing a small number by a small number since you're just dividing the significands and subtracting the exponents.
If you calculate sinc(x) as sin(x)/x, instead of a direct Taylor series or other method that sums to convergence beyond the machine epsilon np.spacing(sinc(x)), you will be off by at most np.spacing(sinc(x)) coming from the round-off error in division /, just as you'd get with multiplication *. (Assuming no subnormal business, which even here does not matter in the treatment of sin(x)/x.)
What about allowing div by zero and replace NaNs later?
import numpy as np
def sindiv(x):
a = np.sin(x)/x
a = np.nan_to_num(a)
return a
If you don't want warnings, supress them via seterr
Of course, using a could be eliminated:
def sindiv(x):
return np.nan_to_num(np.sin(x)/x)