Improving matrix multiplication in Numpy

Improving matrix multiplication in Numpy - python

me and some friends are doing a small language competition to calculate some neural networks. Some doing in C other in fortran, and me: Python.
The code is simple, is just a bunch of vector dot operations and a summation after that apply a signal function and return -1 or 1 (activated or not).
With that we are sending a bunch of random numbers and checking (right now only single process) which language do it faster.
My code is simple as this:
def sgn(h):
"""Signal function"""
return -1 if h < 0 else 1
def lincomb(A, B):
"""Linear combinator between two matrices"""
return np.einsum('ji,ij->', A, B)
def lincombrav(A, B):
return A.ravel().dot(B.ravel('F'))
def functional_test():
w1 = np.random.random(50**2).reshape(50,50)
w2 = np.random.random(50**2).reshape(50,50)
return sgn(lincombrav(w1, w2))
Where A and B are matrices that represent each layer in a neural network. then we dot the ith-column of the first matrix with the ith-row for the second matrix, sum all results and send to signal function. Something like:
w1 = 2*np.random.random(100**2).reshape(100,100)-1
w2 = 2*np.random.random(100**2).reshape(100,100)-1
then we time it with
%timeit sgn(lincomb(w1, w2))
Python is losing to Fortran by 38x :-(
Is there anyway to improve that Python "code".
EDIT: Added timeit results:
Python version (already with the ravel mode)
In [10]: %timeit functional_test()
8.72 µs ± 406 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Python version (with einsum)
In [16]: %timeit functional_test()
10.27 µs ± 490 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Fortran version
In [13]: %timeit fort.test()
235 ns ± 12.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Fortran version was created using "f2py" program, to generate a python loadable module from fortran code.
The test functions do the following (in each language):
Create the matrix A
Create the matrix B
call sgn(lincomb(A,B)) # from each respective language implementation
I also moved the matrix creation to outside, to run only the mathematical operation instead also handling memory. Still, python is behind by same magnitude.
EDIT2: Good python news. Python has won in all but the small matrix tests. Here will follow the whole code:
Python functions (bla.py)
import numpy as np
from numba import jit
import timeit
import matplotlib.pyplot as plt
def sgn(h):
"""Signal function"""
return -1 if h < 0 else 1
def lincomb(A, B):
"""Linear combinator between two matrices"""
return np.einsum('ji,ij->', A, B)
def lincombrav(A, B):
return A.ravel().dot(B.ravel('F'))
def functional_test_ravel(n):
"""Functional tests (Victor experiment)"""
w = 2*np.random.random(n**2).reshape(n,n)-1
x = 2*np.random.random(n**2).reshape(n,n)-1
return sgn(lincombrav(w, x))
def functional_test_einsum(n):
"""Functional tests (Victor experiment)"""
w = 2*np.random.random(n**2).reshape(n,n)-1
x = 2*np.random.random(n**2).reshape(n,n)-1
return sgn(lincomb(w, x))
#jit()
def functional_test_numbaein(n):
"""Functional tests (Victor experiment)"""
w = 2*np.random.random(n**2).reshape(n,n)-1
x = 2*np.random.random(n**2).reshape(n,n)-1
return sgn(lincomb(w, x))
#jit()
def functional_test_numbarav(n):
"""Functional tests (Victor experiment)"""
w = 2*np.random.random(n**2).reshape(n,n)-1
x = 2*np.random.random(n**2).reshape(n,n)-1
return sgn(lincombrav(w, x))
Fortran functions (fbla.f95)
module fbla
implicit none
integer, parameter::dp = selected_real_kind(12,100)
public
contains
real(kind=dp) function sgn(x)
integer, parameter::dp = selected_real_kind(12,100)
real(kind=dp), intent(in):: x
if(x >= 0.0 ) then
sgn = +1.0
else if (x < 0.0) then
sgn = -1.0
end if
end function sgn
real(kind=dp) function lincomb(A, B, n)
integer, parameter :: sp = selected_int_kind(r=8)
integer, parameter :: dp = selected_real_kind(12,100)
integer(kind=sp) :: i
integer(kind=sp), intent(in):: n
real(kind=DP), intent(in) :: A(n,n)
real(kind=DP), intent(in) :: B(n,n)
lincomb = 0
do i=1,n
lincomb = lincomb + dot_product(A(:,i),B(i,:))
end do
end function lincomb
real(kind=dp) function functional_test(n)
integer, parameter::dp = selected_real_kind(12,100)
integer, parameter::sp = selected_int_kind(r=8)
integer(kind=sp), intent(in):: n
integer(kind=sp):: i, j
real(kind=dp), allocatable, dimension(:,:):: x, w, wt
ALLOCATE(wt(n,n),w(n,n),x(n,n))
do i=1,n
do j=1,n
w(i,j) = 2*rand(0)-1
x(i,j) = 2*rand(0)-1
end do
end do
wt = transpose(w)
functional_test = sgn(lincomb(wt, x, n))
end function functional_test
end module fbla
Test execution functions (tests.py)
import numpy as np
import timeit
import matplotlib.pyplot as plt
import bla
from fbla import fbla
def run_test(test_functions, N, runs=1000):
results = []
global rank
for n in N:
rank = n
for t in test_functions:
# print(f'Rank {globals()["rank"]}')
print(f'Running {t} to matrix size {rank}', end='')
r = min(timeit.Timer(t , globals=globals()).repeat(repeat=5, number=runs))
print(f' total time {r} per run {r/runs}')
results.append((t, n, r, r/runs))
return results
def plotbars(results, test_functions, N):
Nsz = len(N)
M = len(test_functions)
fig, ax = plt.subplots()
ind = np.arange(int(Nsz))
width = 1/(M+1)
p = []
for n in range(M):
g = [ w*1000 for (x,y,z,w) in results if x==test_functions[n]]
p.append(ax.bar(ind+n*width, g, width, bottom=0))
ax.legend([ l[0] for l in p ], test_functions)
ax.set_xticks(ind-width/2+((M/2) * width))
ax.set_xticklabels(np.array(N).astype(str))
ax.set_xlabel('Rank of square random matrix')
ax.set_ylabel('Average time(ms) per run')
ax.set_yscale('log')
return fig
N = (10, 50, 100, 1000)
test_functions = [
'bla.functional_test_einsum(rank)',
'fbla.functional_test(rank)'
]
results = run_test(test_functions, N)
plot = plotbars(results, test_functions, N)
plot.show()
The results are:
[('bla.functional_test_einsum(rank)', 10, 0.023221354000270367, 2.3221354000270368e-05),
('fbla.functional_test(rank)', 10, 0.005375514010665938, 5.375514010665938e-06),
('bla.functional_test_einsum(rank)', 50, 0.07035048000398092, 7.035048000398091e-05),
('fbla.functional_test(rank)', 50, 0.1242617039824836, 0.0001242617039824836),
('bla.functional_test_einsum(rank)', 100, 0.22694124400732107, 0.00022694124400732108),
('fbla.functional_test(rank)', 100, 0.5518505079962779, 0.0005518505079962779),
('bla.functional_test_einsum(rank)', 1000, 37.88827919398318, 0.03788827919398318),
('fbla.functional_test(rank)', 1000, 74.09929457501858, 0.07409929457501857)]
Some standard timeit output from a ipython3 session. fbla is the fortran library while bla is standard python library.
In : n=1000
In : w1 = 2*np.random.random(n**2).reshape(n,n)-1
In : w2 = 2*np.random.random(n**2).reshape(n,n)-1
In : bla.sgn(bla.lincomb(w1,w2))
Out: -1
In : fbla.sgn(fbla.lincomb(w1,w2))
Out: -1.0
In : %timeit fbla.sgn(fbla.lincomb(w1,w2))
11.3 ms ± 430 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In : %timeit bla.sgn(bla.lincomb(w1,w2))
3.81 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We can improve a bit with matrix-multiplication -
sgn(w1.ravel().dot(w2.ravel('F')))

If you want Numpy to be faster get a faster Numpy. Try uninstalling Numpy and installing the Intel optimized version of Numpy. Intel's optimized version of Numpy includes a number of CPU level optimizations that should significantly improve the performance of operations such as matrix multiplication on machines that use an Intel CPU.
pip uninstall numpy
pip install intel-numpy

Related

Estimate the rotation between two 2D point clouds

I have two 2D point clouds with an equal number of elements. For these elements I know their correspondence, i.e. for each point in PC1 I know the corresponding element in PC2 and vice versa.
I would now like to estimate the rotation between these two point clouds. That is, I would like to find the angle alpha by which I must rotate all points in PC1 around the origin such that the distance between corresponding points in PC1 and PC2 is minimized.
I can solve this using scipy's linear optimizer (see below); however, this optimization sits inside a loop along the critical path of my code and is the current bottleneck.
import numpy as np
from scipy.optimize import minimize_scalar
from math import sin, cos
# generate some data for demonstration purposes
# points in each point cloud are ordered by correspondence
num_points = 10
distance = np.random.rand(num_points) * 3
radii = np.random.rand(num_points) * 2*np.pi
pc1 = distance[:, None] * np.stack([np.cos(radii), np.sin(radii)], axis=1)
distance = np.random.rand(num_points) * 3
radii = np.random.rand(num_points) * 2*np.pi
pc2 = distance[:, None] * np.stack([np.cos(radii), np.sin(radii)], axis=1)
# solve using scipy
def score(alpha):
rot_matrix = np.array([
[cos(alpha), -sin(alpha)],
[sin(alpha), cos(alpha)]
])
pc1_rotated = (rot_matrix # pc1.T).T
sum_of_squares = np.sum((pc2 - pc1_rotated)**2, axis=1)
mse = np.mean(sum_of_squares)
return mse
# simple solution via scipy
result = minimize_scalar(
score,
bounds=(0, 2*np.pi),
method="bounded",
options={"maxiter": 1000},
)
if result.success:
print(f"Best angle: {result.x}")
else:
raise RuntimeError(f"IK failed. Reason: {result.message}")
Is there a faster (potentially analytic) solution to this problem?

Since minimize_scalar only uses derivative-free methods, the optimization runtime depends heavily on the time needed to evaluate your objective function score. Consequently, I'd recommend accelerating this function as much as possible.
Let's time your function and the optimizer as benchmark reference:
In [68]: %timeit score(0.5)
20.2 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [69]: %timeit result = minimize_scalar(score,bounds=(0, 2*np.pi),method="bounded",options={"maxiter": 1000})
415 µs ± 7.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Firstly, note that (rot_matrix # pc1.T).T is the same as pc1 # rot_matrix.T, i.e. we only need to transpose one matrix instead of two.
Next, note that -sin(alpha) = cos(alpha + 5*pi/2) and sin(alpha) = cos(alpha + 3*pi/2). This means that we only need one function call of np.cos to create the rot_matrix instead of four calls of math.sin or math.cos.
Lastly, you can compute the mse more efficiently by np.einsum.
Considering all points, the function can look like this:
k1 = 5*np.pi/2
k2 = 3*np.pi/2
def score2(alpha):
rot_matrixT = np.cos((alpha, alpha+k2, alpha + k1, alpha)).reshape(2,2)
pc1_rotated = pc1 # rot_matrixT
diff = pc2 - pc1_rotated
return np.einsum('ij,ij->', diff, diff) / num_points
Timing the function again yields
In [70]: %timeit score(0.5)
9.26 µs ± 84.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and therefore, the optimizer is much faster:
In [71]: %timeit result = minimize_scalar(score, bounds=(0, 2*np.pi), method="bounded", options={"maxiter": 1000})
279 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If that still is not fast enough, you can just-in-time compile your function by Numba:
In [60]: from numba import njit
In [61]: #njit
...: def score3(alpha):
...: rot_matrix = np.array([
...: [cos(alpha), -sin(alpha)],
...: [sin(alpha), cos(alpha)]
...: ])
...: pc1_rotated = (rot_matrix # pc1.T).T
...: sum_of_squares = np.sum((pc2 - pc1_rotated)**2, axis=1)
...: mse = np.mean(sum_of_squares)
...: return mse
In [62]: %timeit score3(0.5)
2.97 µs ± 47.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
or rewrite it using Cython. Just for the sake of completeness, here's a fast Cython implementation:
In [45]: %%cython -c=-O3 -c=-march=native -c=-Wno-deprecated-declarations -c=-Wno-#warnings
...:
...: from libc.math cimport cos, sin
...: cimport numpy as np
...: import numpy as np
...: from cython cimport wraparound, boundscheck
...:
...: #wraparound(False)
...: #boundscheck(False)
...: cpdef double score4(double alpha, double[:, ::1] pc1, double[:, ::1] pc2):
...: cdef int i
...: cdef int N = pc1.shape[0]
...: cdef double diff1 = 0.0
...: cdef double diff2 = 0.0
...: cdef double mse = 0.0
...: cdef double rmT00 = cos(alpha)
...: cdef double rmT01 = sin(alpha)
...: cdef double rmT10 = -rmT01
...: cdef double rmT11 = rmT00
...:
...: for i in range(N):
...: diff1 = pc2[i,0] - (pc1[i,0]*rmT00 + pc1[i,1]*rmT10)
...: diff2 = pc2[i,1] - (pc1[i,0]*rmT01 + pc1[i,1]*rmT11)
...: mse += diff1*diff1 + diff2*diff2
...: return mse / N
which yields
In [48]: %timeit score4(0.5, pc1, pc2)
1.05 µs ± 15.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Last but not least, you can write down the first-order necessary condition of your problem and check whether it can be solved analytically. Otherwise, you can try to solve the resulting nonlinear equation numerically.

Is there any way to make Python random sum code faster in Cython?

I have written a piece of coded that draws random numbers from a uniform distribution, totals them until it reaches a number L=x.
I have tried to optimise it using Cython but i would like any suggestions on how it could be further optimised as it would be called for large L values so would take quite long.
This is the code I have written in Jupyter so far
%%cython
import numpy as np
cimport numpy
import numpy.random
def f(int L):
cdef double r=0
cdef int i=0
cdef float theta
while r<=L:
theta=np.random.uniform(0, 2*np.pi, size = None)
r+=np.cos(theta)
i+=1
return i
I'd like to speed it up as much as possible

One way, without using Cython, that you can speed this up is to call np.random.uniform less frequently. The cost of calling this function and returning 1 value vs 100,000 values is negligible, call it and returning 1,000 values vs calling it 1,000 times reaps huge time savings:
def call1000():
return [np.random.uniform(0, 2*np.pi, size = None) for i in range(1000)]
%timeit call1000()
762 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.random.uniform(0, 2*np.pi, size = 1000)
10.8 µs ± 13.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You can implement this and ensure that you don't run out of values by doing something like this:
def f(L):
r = 0
i = 0
j = 0
theta = np.random.uniform(0, 2*np.pi, size = 100000)
while r<=L:
if j == len(theta):
j=0
theta=np.random.uniform(0, 2*np.pi, size = 100000)
r+=np.cos(theta[j])
i+=1
return i

How to utilize all cores to accelerate simulation program based on numpy 3D array?

I'm using python to build a physics simulation model. Now I have two numpy 3d arrays arr_A and arr_B, with the size of 50*50*15 (may be enlarged to 1000*1000*50 in the future). And I want to see how these two arrays evolve based on some certain computation. I was trying to accelerate my program with parallel computing using my 12-core machine but the outcome was not so good. I finally realize that python is very slow in scientific computing.
Do I have to rewrite my program in C language? It's quite a tough job. I heard that Cython might be a solution. Should I use it? I really need some advice on how to accelerate my program since I'm a beginner in programming.
I'm working on a win10 x64 machine with 12 cores.
My computation is something like this:
The value in arr_A is either 1 or 0. For every "1" in arr_A, I need to calculate a certain value according to arr_B.
For example, if arr_A[x,y,z] == 1, C[x,y,z] = 1/(arr_B[x-1,y,z]+arr_B[x,y-1,z]+arr_B[x,y,z-1]+arr_B[x+1,y,z]+arr_B[x,y+1,z]+arr_B[x,y,z+1]).
Then I use the minimum in array C as a parameter for a function. The function can slightly change arr_A and arr_B so that they can evolve. Then we compute the "result" again and the loop keeps going.
Notice that for every single C[x,y,z], many values in arr_B are involved. Otherwise I can do something like this:
C = arr_B[arr_A>0]**2
I hope the solution can be simple like that. But I can't find any feasible indexing methods except a triple-nested 'for' loop.
After reading this and some documents about multi-thread and multiprocessing, I tried to using multiprocessing but the simulation is not much faster.
I use slice like this for multiprocessing. To be specific, carrier_3d and potential_3d are arr_A and arr_B I mentioned above respectively. I put the slices into different sub-processes. The detail of functions is not given here but you can get the main idea.
chunk = np.shape(carrier_3d)[0] // cores
p = Pool(processes=cores)
for i in range(cores):
slice_of_carrier_3d = slice(i*chunk,
np.shape(carrier_3d)[0] if i == cores-1 else (i+1)*chunk+2)
p.apply_async(hopping_x_section, args=(i, chunk,carrier_3d[slice_of_carrier_3d, :, :],
potential_3d[slice_of_carrier_3d, :, :]),
callback=paral_site_record)
p.close()
p.join()
In case you want to know more about the computation, following code is basically how my computation works without multiprocessing. But I have explain the process above.
def vab(carrier_3d, potential_3d, a, b):
try:
Ea = potential_3d[a[0], a[1], a[2]]
Eb = potential_3d[b[0], b[1], b[2]]
if carrier_3d[b[0], b[1], b[2]] > 0:
return 0
elif b[2] < t_ox:
return 0
elif b[0] < 0 or b[1] < 0:
return 0
elif Eb > Ea:
return math.exp(-10*math.sqrt((b[0]-a[0])**2+
(b[1]-a[1])**2+(b[2]-a[2])**2)-
q*(Eb-Ea)/(kB*T))
else:
return math.exp(-10*math.sqrt((b[0]-a[0])**2+
(b[1]-a[1])**2+(b[2]-a[2])**2))
except IndexError:
return 0
#Given a point, get the vij to all 26 directions at the point
def v_all_drt(carrier_3d, potential_3d, x, y, z):
x_neighbor = [-1, 0, 1]
y_neighbor = [-1, 0, 1]
z_neighbor = [-1, 0, 1]
v = []#v is the hopping probability
drtn = []#direction
for i in x_neighbor:
for j in y_neighbor:
for k in z_neighbor:
v.append(vab(carrier_3d, potential_3d,
[x, y, z], [x+i, y+j, z+k]))
drtn.append([x+i, y+j, z+k])
return np.array(v), np.array(drtn)
#v is a list of probability(v_ij) hopping to nearest sites.
#drt is the corresponding dirction(site).
def hopping():
global sys_time
global time_counter
global hop_ini
global hop_finl
global carrier_3d
global potential_3d
rt_min = 1000#1000 is meaningless. Just a large enough name to start
for x in range(np.shape(carrier_3d)[0]):
for y in range(np.shape(carrier_3d)[1]):
for z in range(t_ox, np.shape(carrier_3d)[2]):
if carrier_3d[x, y, z] == 1:
v, drt = v_all_drt(carrier_3d, potential_3d, x, y, z)
if v.sum() > 0:
rt_i = -math.log(random.random())/v.sum()/v0
if rt_i < rt_min:
rt_min = rt_i
v_hop = v
drt_hop = drt
hop_ini = np.array([x, y, z], dtype = int)
#Above loop finds the carrier that hops.
#Yet we still need the hopping direction.
rdm2 = random.random()
for i in range(len(v_hop)):
if (rdm2 > v_hop[:i].sum()/v_hop.sum()) and\
(rdm2 <= v_hop[:i+1].sum()/v_hop.sum()):
hop_finl = np.array(drt_hop[i], dtype = int)
break
carrier_3d[hop_ini[0], hop_ini[1], hop_ini[2]] = 0
carrier_3d[hop_finl[0], hop_finl[1], hop_finl[2]] = 1
def update_carrier():
pass
def update_potential():
pass
#-------------------------------------
carrier_3d = np.random.randn(len_x, len_y, len_z)
carrier_3d[carrier_3d>.5] = 1
carrier_3d[carrier_3d<=.5] = 0
carrier_3d = carrier_3d.astype(int)
potential_3d = np.random.randn(len_x, len_y, len_z)
while time_counter <= set_time:# set the running time of the simulation
hopping()
update_carrier()
update_potential()
time_counter += 1

You can use numba to create a jit compiled version of your analysis function. This alone will be the biggest speedup to your code, and tends to work very well when your problem fits in to the constraints. You will have to write a more sophisticated analysis in your for loop, but I don't see any reason why what you've outlined wouldn't work. See the following code which shows a 330 fold speedup by compiling with numba. You can also specify certain numba functions to execute in parallel. However, the overhead associated with this only adds a speedup when the problem gets sufficiently large, so that is something you will have to consider for yourself
from numpy import *
from numba import njit
def function(A, B):
C = zeros(shape=B.shape)
X, Y, Z = B.shape
for x in range(X):
for y in range(Y):
for z in range(Z):
if A[x, y, z] == 1:
C[x, y, z] = B[x, y, z]**2
return C
cfunction = njit(function)
cfunction_parallel = njit(function, parallel=True)
X, Y, Z = 50, 50, 10
A = random.randint(0, 2, size=X*Y*Z).reshape(X, Y, Z)
B = random.random(size=X*Y*Z).reshape(X, Y, Z)
_ = cfunction(A, B) # force compilation so as not to slow down timers
_ = cfunction_parallel(A, B)
print('uncompiled function')
%timeit function(A, B)
print('\nfor smaller computations, the parallel overhead makes it slower')
%timeit cfunction(A, B)
%timeit cfunction_parallel(A, B)
X, Y, Z = 1000, 1000, 50
A = random.randint(0, 2, size=X*Y*Z).reshape(X, Y, Z)
B = random.random(size=X*Y*Z).reshape(X, Y, Z)
print('\nfor larger computations, parallelization helps')
%timeit cfunction(A, B)
%timeit cfunction_parallel(A, B)
this prints:
uncompiled function
23.2 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
for smaller computations, the parallel overhead makes it slower
77.5 µs ± 1.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
121 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
for larger computations, parallelization helps
138 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
40.1 ms ± 633 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

More efficient weighted Gini coefficient in Python

Per https://stackoverflow.com/a/48981834/1840471, this is an implementation of the weighted Gini coefficient in Python:
import numpy as np
def gini(x, weights=None):
if weights is None:
weights = np.ones_like(x)
# Calculate mean absolute deviation in two steps, for weights.
count = np.multiply.outer(weights, weights)
mad = np.abs(np.subtract.outer(x, x) * count).sum() / count.sum()
rmad = mad / np.average(x, weights=weights)
# Gini equals half the relative mean absolute deviation.
return 0.5 * rmad
This is clean and works well for medium-sized arrays, but as warned in its initial suggestion (https://stackoverflow.com/a/39513799/1840471) it's O(n2). On my computer that means it breaks after ~20k rows:
n = 20000 # Works, 30000 fails.
gini(np.random.rand(n), np.random.rand(n))
Can this be adjusted to work for larger datasets? Mine is ~150k rows.

Here is a version which is much faster than the one you provided above, and also uses a simplified formula for the case without weight to get even faster results in that case.
def gini(x, w=None):
# The rest of the code requires numpy arrays.
x = np.asarray(x)
if w is not None:
w = np.asarray(w)
sorted_indices = np.argsort(x)
sorted_x = x[sorted_indices]
sorted_w = w[sorted_indices]
# Force float dtype to avoid overflows
cumw = np.cumsum(sorted_w, dtype=float)
cumxw = np.cumsum(sorted_x * sorted_w, dtype=float)
return (np.sum(cumxw[1:] * cumw[:-1] - cumxw[:-1] * cumw[1:]) /
(cumxw[-1] * cumw[-1]))
else:
sorted_x = np.sort(x)
n = len(x)
cumx = np.cumsum(sorted_x, dtype=float)
# The above formula, with all weights equal to 1 simplifies to:
return (n + 1 - 2 * np.sum(cumx) / cumx[-1]) / n
Here is some test code to check we get (mostly) the same results:
>>> x = np.random.rand(1000000)
>>> w = np.random.rand(1000000)
>>> gini_max_ghenis(x, w)
0.33376310938610521
>>> gini(x, w)
0.33376310938610382
But the speed is very different:
%timeit gini(x, w)
203 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit gini_max_ghenis(x, w)
55.6 s ± 3.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you remove the pandas ops from the function, it is already much faster:
%timeit gini_max_ghenis_no_pandas_ops(x, w)
1.62 s ± 75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you want to get the last drop of performance you could use numba or cython but that would only gain a few percent because most of the time is spent in sorting.
%timeit ind = np.argsort(x); sx = x[ind]; sw = w[ind]
180 ms ± 4.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
edit: gini_max_ghenis is the code used in Max Ghenis' answer

Adapting the StatsGini R function from here:
import numpy as np
import pandas as pd
def gini(x, w=None):
# Array indexing requires reset indexes.
x = pd.Series(x).reset_index(drop=True)
if w is None:
w = np.ones_like(x)
w = pd.Series(w).reset_index(drop=True)
n = x.size
wxsum = sum(w * x)
wsum = sum(w)
sxw = np.argsort(x)
sx = x[sxw] * w[sxw]
sw = w[sxw]
pxi = np.cumsum(sx) / wxsum
pci = np.cumsum(sw) / wsum
g = 0.0
for i in np.arange(1, n):
g = g + pxi.iloc[i] * pci.iloc[i - 1] - pci.iloc[i] * pxi.iloc[i - 1]
return g
This works for large vectors, at least up to 10M rows:
n = 1e7
gini(np.random.rand(n), np.random.rand(n)) # Takes ~15s.
It also produces the same result as the function provided in the question, for example giving 0.2553 for this example:
gini(np.array([3, 1, 6, 2, 1]), np.array([4, 2, 2, 10, 1]))

Best way to implement numpy.sin(x) / x where x might contain 0

What I am doing now is:
import numpy as np
eps = np.finfo(float).eps
def sindiv(x):
x = np.abs(x)
return np.maximum(eps, np.sin(x)) / np.maximum(eps, x)
But there is quite a lot of additional array operation. Is there a better way?

You could use numpy.sinc, which computes sin(pi x)/(pi x):
In [20]: x = 2.4
In [21]: np.sin(x)/x
Out[21]: 0.28144299189631289
In [22]: x_over_pi = x / np.pi
In [23]: np.sinc(x_over_pi)
Out[23]: 0.28144299189631289
In [24]: np.sinc(0)
Out[24]: 1.0

In numpy array notation (so you get back a np array):
def sindiv(x):
return np.where(np.abs(x) < 0.01, 1.0 - x*x/6.0, np.sin(x)/x)
Here I've made "epsilon" fairly large for testing and used the first two terms of the taylor series for the approximation. In practice, I'd change 0.01 to some small multiple of your eps (machine epsilon).
xx = np.arange(-0.1, 0.1, 0.001)
yy = sinxdiv(xx)
type(yy)
outputs numpy.ndarray and the values are continuous (and differentiable, if that's important) near the origin.
If you don't want the double evaluation (i.e. both branches are evaluated in the above), then I think you have to go with a loop as I don't believe there is any sort of "lazy where" option.
def sindiv(x):
sox = np.zeros(x.size)
for i in xrange(x.size):
xv = x[i]
if np.abs(xv) < 0.001: # For testing, use a small multiple of machine epsilon
sox[i] = 1.0 - xv * xv / 6.0
else:
sox[i] = np.sin(xv) / xv
return sox
To make this really pythonic though it would be best to check the type of x and just do the non-array version if it is not an array.

As others have said, numpy.sinc() is the easiest.
I want to include a copy of its current implementation in NumPy 1.21.2 (link) to show there's no special tricks:
y = pi * where(x == 0, 1.0e-20, x)
return sin(y)/y
It's basically just sin(x)/x. Note that in creating y: multiplication by pi, where(), and x == 0 will create at least 2 intermediate arrays plus the final array for y. And then sin(y)/y creates two more arrays. In total at least 5 arrays are created by numpy.sinc(); and by my count your sindiv() also creates at least 5 arrays, so it's not actually that wasteful.
Here is another implementation:
TINY = np.finfo(float).tiny # ≈ 2e-308 (smallest 'normal' float)
def mysinc(x):
y = np.abs(np.pi*x) + TINY
return np.sin(y)/y
I'm pretty sure this returns identical values to numpy.sinc(). The reason being sin(x) == x for relatively 'large' values of x:
x = np.ldexp(1, -26, dtype=np.double) # x = 2**-26 ≈ 1.5e-8
print(np.sin(x) == x) # True
x = np.ldexp(1, -32, dtype=np.longdouble) # x = 2**-32 ≈ 2.3e-10
print(np.sin(x) == x) # True
So for small enough x (ignore pi factors), mysinc(x) = (x+TINY)/(x+TINY) = x/x = np.sinc(x). The exact threshold this happens does not matter too much so long as TINY < np.spacing(x) when it occurs so that x + TINY = x in this regime.
(The cutoff is around the square-root of the machine epsilon as can be understood from the Taylor series sin(x) = x - x**3/6 + ... = x(1-x**2/6) + .... So TINY is always small enough to not matter.)
Timings
import numpy as np
eps = np.finfo(float).eps
tiny = np.finfo(float).tiny
def npsinc(x):
y = np.pi * np.where(x == 0, 1.0e-20, x)
return np.sin(y)/y
def sindiv(x):
x = np.pi * np.abs(x)
return np.maximum(eps, np.sin(x)) / np.maximum(eps, x)
def mysinc(x):
y = np.abs(np.pi*x) + tiny
return np.sin(y)/y
def mysinc2(x):
y = np.abs(np.pi*x)
y += tiny # in-place addition
return np.sin(y)/y
# Test data
x = np.random.rand(100)
x[np.random.randint(100, size=10)] = 0
%timeit npsinc(x)
# 10.9 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit sindiv(x)
# 9.4 µs ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mysinc(x)
# 7.38 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mysinc2(x)
# 8.64 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Curiously using mysinc2() with in-place addition seems to be slower, and using in-place numpy.abs() and in-place numpy.sin() is even slower. Not entirely sure why, but see this related question.
Regardless, if you really need performance, you can try using Cython to generate C code and do things properly instead of playing tricks with NumPy:
%%cython
from libc.math cimport M_PI, sin
cimport cython
cimport numpy as np
import numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef _cysinc(double[:] x, double[:] out):
cdef size_t i
for i in range(x.shape[0]):
if x[i] == 0:
out[i] = 1
else:
out[i] = sin(M_PI*x[i])/(M_PI*x[i])
def cysinc(np.ndarray x):
out = np.empty_like(x)
_cysinc(x.ravel(), out.ravel())
return out
%timeit cysinc(x)
# 4.38 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As always, don't prematurely optimize, just use numpy.sinc() to begin with.
Side note
There's a question Is boost::math::sinc_pi unnecessarily complicated? that asks about the benefits of using a Taylor expansion about x=0. In summary, almost none, but maybe they are doing it for other reasons.
To emphasise, there is nothing unstable about floating point division, or dividing a small number by a small number since you're just dividing the significands and subtracting the exponents.
If you calculate sinc(x) as sin(x)/x, instead of a direct Taylor series or other method that sums to convergence beyond the machine epsilon np.spacing(sinc(x)), you will be off by at most np.spacing(sinc(x)) coming from the round-off error in division /, just as you'd get with multiplication *. (Assuming no subnormal business, which even here does not matter in the treatment of sin(x)/x.)

What about allowing div by zero and replace NaNs later?
import numpy as np
def sindiv(x):
a = np.sin(x)/x
a = np.nan_to_num(a)
return a
If you don't want warnings, supress them via seterr
Of course, using a could be eliminated:
def sindiv(x):
return np.nan_to_num(np.sin(x)/x)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Improving matrix multiplication in Numpy - python

We can improve a bit with matrix-multiplication - sgn(w1.ravel().dot(w2.ravel('F')))

Related

Estimate the rotation between two 2D point clouds

Is there any way to make Python random sum code faster in Cython?

How to utilize all cores to accelerate simulation program based on numpy 3D array?

More efficient weighted Gini coefficient in Python

Best way to implement numpy.sin(x) / x where x might contain 0

Categories

Resources