Numpy: Negative Execution Time for exponentiation operation

Numpy: Negative Execution Time for exponentiation operation - python

I am multiplying two large matrices, and it turns out the operation is faster when I first perform exponentiation on the first input:
import time
import numpy as np
a = np.asarray(np.random.uniform(-1,1, (100,40000)), dtype=np.float32)
b = np.asarray(np.random.uniform(-1,1, (40000,20000)), dtype=np.float32)
start = time.time()
d0 = np.dot(a,b)
print "\nA.B - {:.2f} seconds".format((time.time()-start))
start = time.time()
d1 = np.dot(np.exp(a), b)
print "exp(A).B - {:.2f} seconds".format((time.time()-start))
start = time.time()
d2 = np.dot(a, np.exp(b))
print "A.exp(B) - {:.2f} seconds".format((time.time()-start))
start = time.time()
d3 = np.dot(np.exp(a), np.exp(b))
print "exp(A).exp(B) - {:.2f} seconds".format((time.time()-start))
Here are the results:
A.B 1.27 seconds
exp(A).B 1.15 seconds
A.exp(B) 7.31 seconds
exp(A).exp(B) 7.38 seconds
Can anyone explain what am I doing wrong, or how is this possible?

What you are seeing is the reason why you would never run an operation only once when you are benchmarking, which is kind of what you are attempting here. Many things will affect the result you are seeing when you're only doing it once. In this case, probably some caching-effect. Also, the b array is so much bigger than the a array that you cannot draw any conclusions with a vs np.exp(a) when doing np.exp(b) unless you're running in a very controlled environment.
To more properly benchmark this, we can cut the two last benchmarks and focus on a vs exp(a). Also, we repeat the operation 10,000 times and reduce the size of the arrays to avoid having to wait for several minutes:
import time
import numpy as np
a = np.asarray(np.random.uniform(-1,1, (100,400)), dtype=np.float32)
b = np.asarray(np.random.uniform(-1,1, (400,2000)), dtype=np.float32)
start = time.time()
for i in xrange(10000):
d0 = np.dot(a,b)
print "\nA.B - {:.2f} seconds".format((time.time()-start))
start = time.time()
for i in xrange(10000):
d0 = np.dot(np.exp(a), b)
print "exp(A).B - {:.2f} seconds".format((time.time()-start))
This yields the following result on my computer:
A.B - 7.87 seconds
exp(A).B - 13.24 seconds
As you see, doing np.exp(a) now takes more time than just accessing a, as expected.

I suspect this to be a matter of memory-caching, since inverting the order of the executions
import time
import numpy as np
a = np.asarray(np.random.uniform(-1,1, (100,4000)), dtype=np.float32)
b = np.asarray(np.random.uniform(-1,1, (4000,20000)), dtype=np.float32)
start = time.time()
d1 = np.dot(np.exp(a), b)
print "exp(A).B - {:.2f} seconds".format((time.time()-start))
start = time.time()
d0 = np.dot(a,b)
print "A.B - {:.2f} seconds".format((time.time()-start))
start = time.time()
d1 = np.dot(np.exp(a), b)
print "exp(A).B - {:.2f} seconds".format((time.time()-start))
start = time.time()
d0 = np.dot(a,b)
print "A.B - {:.2f} seconds".format((time.time()-start))
will still cause the second statement to be quicker than the first, but second executions of the same statements run also quicker the second time.
exp(A).B - 0.72 seconds
A.B - 0.70 seconds
exp(A).B - 0.70 seconds
A.B - 0.69 seconds
(I had to reduce the size of the arrays to avoid memory-issues.)

Related

How to measure cython run time and compare with python code?

I was using time.time for measuring inside pyx file and printing it to check times, but I think its not correct times for running, it showed different times for same lines of code and times are changing in each run, also it shows different times for each of for loops but all are doing same job, 0.010 and 0.020, and sometimes shows 0 for that part while shows 0.010 and 0.020 for 3 other sections.
Please help me how should I measure it correctly, couldn't find any good way to measure time in cython docs
for this part of code it shows those 2 times and it changes sometimes in each run :
t4 = time.time()
# print('T3 =', t4 - t3)
for j in range(np.shape(im1)[1]):
# sum_c1[j] = np.shape(im1)[0] - (np.sum(im1[:, j]))
sum_c1[j] = np.shape(im1)[0] - (np.count_nonzero(im1[:, j]))
tt3 = time.time()
print('TT3 =', tt3 - t4)
cdef int amc1 = np.argmax(sum_c1) # argmax sum_c1
tt4 = time.time()
# print('TT4 =', tt4 - tt3)
for j in range(np.shape(im2)[1]):
# sum_c2[j] = np.shape(im2)[0] - (np.sum(im2[:, j]))
sum_c2[j] = np.shape(im2)[0] - (np.count_nonzero(im2[:, j]))
t5 = time.time()
print('TT5 =', t5 - tt4)
# print('T4 =', t5 - t4)
## find of max zeros in row
for j in range(np.shape(im1)[0]):
# sum_r1[j] = np.shape(im1)[1] - (np.sum(im1[j, :]))
sum_r1[j] = np.shape(im1)[1] - (np.count_nonzero(im1[j, :]))
tt1 = time.time()
print('TT1 =', tt1 - t5)
cdef int amr1 = np.argmax(sum_r1) # argmax sum_r1
tt2 = time.time()
# print('TT2 =', tt2 - tt1)
for j in range(np.shape(im2)[0]):
# sum_r2[j] = np.shape(im2)[1] - (np.sum(im2[j, :]))
sum_r2[j] = np.shape(im2)[1] - (np.count_nonzero(im2[j, :]))
t6 = time.time()
print('T5 =', t6 - t5)
('TT3 =', 0.020589590072631836)
('TT5 =', 0.011527061462402344)
('TT1 =', 0.0)
('T5 =', 0.009999990463256836)
-----------
('TT3 =', 0.0100250244140625)
('TT5 =', 0.00996851921081543)
('TT1 =', 0.01003265380859375)
('T5 =', 0.020001888275146484)
these are 2 different runs on same code

What about using perf_counter()?
start = time.perf_counter()
# your code
print(time.perf_counter()-start)
More details here: https://stackoverflow.com/a/52228375/3872144

the first calculation with torch.einsum is much slower

When I run several calculations with torch.einsum in a row, the first one is always much slower than the following calculations.
The following code and plot illustrates the problem:
import torch as tor
from timeit import default_timer as timer
N = 1000
L = 10
time_arr = np.zeros(L)
for i in range(L):
a = tor.randn(N, N).to("cuda:0") #3 random 1000x1000 matrices for each cycle
b = tor.randn(N, N).to("cuda:0")
c = tor.randn(N, N).to("cuda:0")
time_start = timer()
tor.einsum("ij, kj",tor.einsum("ij, ja", aa, ab), ac)
time_end = timer()
time_arr[i] = time_end - time_start
Plot of the different times for each cylce of the loop

Fast String Array (Individual) Hashing in Python

As the title suggests, I would love some suggestions on how to make the following code faster. I have tried multiple methods including utilizing Numba, a combination of the np.fromiter() and map() functions, as well as Numpy vectorizing the hash function so as to get it all to work faster.
Take a look at the code below! Below the code are the results of the code from running it on my machine.
Some notes for the future:
the CODES variable holds the dimensions of the arrays that need to be hashed (generated random strings for StackOverflow purposes)
the first compilation time for Numba is quite high, but I am willing to take on this cost if it means speeding up the code in the long run when running hundreds of thousands of times
import numpy as np
import numba as nb
import time
# GENERATE RANDOM "<U64" CHARACTER ARRAYS
A, Z = np.array(["A","Z"]).view("int32")
CODES = (25058, 64272, 61425)
LENGTH = 64
tmp1 = np.random.randint(low = A, high = Z, size = CODES[0] * LENGTH, dtype = "int32").view(f"U{LENGTH}")
tmp2 = np.random.randint(low = A, high = Z, size = CODES[1] * LENGTH, dtype = "int32").view(f"U{LENGTH}")
tmp3 = np.random.randint(low = A, high = Z, size = CODES[2] * LENGTH, dtype = "int32").view(f"U{LENGTH}")
# NUMBA ITERATION AND HASHING ONE BY ONE
#nb.jit(nopython=True, fastmath=True, parallel=True, cache=True)
def form_hashed_array(to_hash: np.ndarray) -> np.ndarray:
hashed = np.empty_like(to_hash, dtype=np.int64)
for i in nb.prange(len(to_hash)):
hashed[i] = hash(to_hash[i])
return hashed
print("--------")
t2 = time.monotonic()
t = time.monotonic()
tmp4 = form_hashed_array(tmp1)
print(time.monotonic() - t) # this first one will be larger due to compilation time
t = time.monotonic()
tmp5 = form_hashed_array(tmp2)
print(time.monotonic() - t)
t = time.monotonic()
tmp6 = form_hashed_array(tmp3)
print(time.monotonic() - t)
print("NUMBA ITERATION TOOK: " + str(time.monotonic() - t2))
# MAP + FROMITER COMBINATION
print("--------")
t2 = time.monotonic()
t = time.monotonic()
tmp7 = np.fromiter((map(hash, tmp1)), dtype=np.int64)
print(time.monotonic() - t)
t=time.monotonic()
tmp8 = np.fromiter((map(hash, tmp2)), dtype=np.int64)
print(time.monotonic()-t)
t = time.monotonic()
tmp9 = np.fromiter((map(hash, tmp3)), dtype=np.int64)
print(time.monotonic()-t)
print("MAP + FROMITER COMBINATION TOOK : " + str(time.monotonic()-t2))
# NUMPY FUNCTION VECTORIZATION
vfunc = np.vectorize(hash)
print("--------")
t2 = time.monotonic()
t = time.monotonic()
tmp10 = vfunc(tmp1)
print(time.monotonic() - t)
t = time.monotonic()
tmp11 = vfunc(tmp2)
print(time.monotonic() - t)
t = time.monotonic()
tmp12 = vfunc(tmp3)
print(time.monotonic() - t)
print("NUMPY FUNCTION VECTORIZATION TOOK: " + str(time.monotonic()-t2))
# SANITY CHECKS
print("--------")
print(not (tmp4 - tmp7).any() and not (tmp7 - tmp10).any(), end=" ")
print(not (tmp5 - tmp8).any() and not (tmp8 - tmp11).any(), end=" ")
print(not (tmp6 - tmp9).any() and not (tmp9 - tmp12).any())
print("--------")
breakpoint()
This code will be run a significant number of times, so it is important that it is as fast as possible.
--------
6.9208437809720635
0.08914285799255595
0.09502507897559553
NUMBA ITERATION TOOK: 7.1051117710303515
--------
0.009926816972438246
0.02683716599131003
0.026946138008497655
MAP + FROMITER COMBINATION TOOK : 0.06381386297289282
--------
0.011753249040339142
0.02864329604199156
0.029279633017722517
NUMPY FUNCTION VECTORIZATION TOOK: 0.06976548000238836
--------
True True True
--------
Thanks for the help in advance!

for loop in python is 10x slower than matlab

I run python 2.7 and matlab R2010a on the same machine, doing nothing, and it gives me 10x different in speed
I looked online, and heard it should be the same order.
Python will further slow down as if statement and math operator in the for loop
My question: is this the reality? or there is some other way let them in the same speed order?
Here is python code
import time
start_time = time.time()
for r in xrange(1000):
for c in xrange(1000):
continue
elapsed_time = time.time() - start_time
print 'time cost = ',elapsed_time
Output: time cost = 0.0377440452576
Here is matlab code
tic
for i = 1:1000
for j = 1:1000
end
end
toc
Output: Escaped time is 0.004200 seconds

The reason this is happening is related to the JIT compiler, which is optimizing the MATLAB for loop. You can disable/enable the JIT accelerator using feature accel off and feature accel on. When you disable the accelerator, the times change dramatically.
MATLAB with accel on: Elapsed time is 0.009407 seconds.
MATLAB with accel off: Elapsed time is 0.287955 seconds.
python: time cost = 0.0511920452118
Thus the JIT accelerator is directly causing the speedup that you are noticing. There is another thing that you should consider, which is related to the way that you defined the iteration indices. In both cases, MATLAB and python, you used Iterators to define your loops. In MATLAB you create the actual values by adding the square brackets ([]), and in python you use range instead of xrange. When you make these changes
% MATLAB
for i = [1:1000]
for j = [1:1000]
# python
for r in range(1000):
for c in range(1000):
The times become
MATLAB with accel on: Elapsed time is 0.338701 seconds.
MATLAB with accel off: Elapsed time is 0.289220 seconds.
python: time cost = 0.0606048107147
One final consideration is if you were to add a quick computation to the loop. ie t=t+1. Then the times become
MATLAB with accel on: Elapsed time is 1.340830 seconds.
MATLAB with accel off: Elapsed time is 0.905956 seconds. (Yes off was faster)
python: time cost = 0.147221088409
I think that the moral here is that the computation speeds of for loops, out-of-the box, are comparable for extremely simple loops, depending on the situation. However, there are other, numerical tools in python which can speed things up significantly, numpy and PyPy have been brought up so far.

The basic Python implementation, CPython, is not meant to be super-speedy. If you need efficient matlab-style numerical manipulation, use the numpy package or an implementation of Python that is designed for fast work, such as PyPy or even Cython. (Writing a Python extension in C, which will of course be pretty fast, is also a possible solution, but in that case you may as well just use numpy and save yourself the effort.)

If Python execution performance is really crucial for you, you might take a look at PyPy
I did your test:
import time
for a in range(10):
start_time = time.time()
for r in xrange(1000):
for c in xrange(1000):
continue
elapsed_time = time.time()-start_time
print elapsed_time
with standard Python 2.7.3, I get:
0.0311839580536
0.0310959815979
0.0309510231018
0.0306520462036
0.0302460193634
0.0324130058289
0.0308878421783
0.0307397842407
0.0304911136627
0.0307500362396
whereas, using PyPy 1.9.0 (which corresponds to Python 2.7.2), I get:
0.00921821594238
0.0115230083466
0.00851202011108
0.00808095932007
0.00496387481689
0.00499391555786
0.00508499145508
0.00618195533752
0.005126953125
0.00482988357544
The acceleration of PyPy is really stunning and really becomes visible when its JIT compiler optimizations outweigh their cost. That's also why I introduced the extra for loop. For this example, absolutely no modification of the code was needed.

This is just my opinion, but I think the process is a bit more complex. Basically Matlab is an optimized layer of C, so with the appropriate initialization of matrices and minimization of function calls (avoid "." objects-like operators in Matlab) you obtain extremely different results. Consider the simple following example of wave generator with cosine function. Matlab time = 0.15 secs in practical debug session, Python time = 25 secs in practical debug session (Spyder), thus Python becomes 166x slower. Run directly by Python 3.7.4. machine the time is = 5 secs aprox, so still be a non negligible 33x.
MATLAB:
AW(1,:) = [800 , 0 ]; % [amp frec]
AW(2,:) = [300 , 4E-07];
AW(3,:) = [200 , 1E-06];
AW(4,:) = [ 50 , 4E-06];
AW(5,:) = [ 30 , 9E-06];
AW(6,:) = [ 20 , 3E-05];
AW(7,:) = [ 10 , 4E-05];
AW(8,:) = [ 9 , 5E-04];
AW(9,:) = [ 7 , 7E-04];
AW(10,:)= [ 5 , 8E-03];
phas = 0
tini = -2*365 *86400; % 2 years backwards in seconds
dt = 200; % step, 200 seconds
tfin = 0; % present
vec_t = ( tini: dt: tfin)'; % vector_time
nt = length(vec_t);
vec_t = vec_t - phas;
wave = zeros(nt,1);
for it = 1:nt
suma = 0;
t = vec_t(it,1);
for iW = 1:size(AW,1)
suma = suma + AW(iW,1)*cos(AW(iW,2)*t);
end
wave(it,1) = suma;
end
PYTHON:
import numpy as np
AW = np.zeros((10,2))
AW[0,:] = [800 , 0.0]
AW[1,:] = [300 , 4E-07]; # [amp frec]
AW[2,:] = [200 , 1E-06];
AW[3,:] = [ 50 , 4E-06];
AW[4,:] = [ 30 , 9E-06];
AW[5,:] = [ 20 , 3E-05];
AW[6,:] = [ 10 , 4E-05];
AW[7,:] = [ 9 , 5E-04];
AW[8,:] = [ 7 , 7E-04];
AW[9,:] = [ 5 , 8E-03];
phas = 0
tini = -2*365 *86400 # 2 years backwards
dt = 200
tfin = 0 # present
nt = round((tfin-tini)/dt) + 1
vec_t = np.linspace(tini,tfin1,nt) - phas
wave = np.zeros((nt))
for it in range(nt):
suma = 0
t = vec_t[fil]
for iW in range(np.size(AW,0)):
suma = suma + AW[iW,0]*np.cos(AW[iW,1]*t)
#endfor iW
wave[it] = suma
#endfor it
To deal such aspects in Python I would suggest to compile into executable directly to binary the numerical parts that may compromise the project (or for example C or Fortran into executable and be called by Python afterwards). Of course, other suggestions are appreciated.

I tested a FIR filter with MATLAB and same (adapted) code in Python, including a frequency sweep. The FIR filter is pretty huge, N = 100 order, I post below the two codes, but leave you here the timing results:
MATLAB: Elapsed time is 11.149704 seconds.
PYTHON: time cost = 247.8841781616211 seconds.
PYTHON IS 25 TIMES SLOWER !!!
MATLAB CODE (main):
f1 = 4000; % bandpass frequency (response = 1).
f2 = 4200; % bandreject frequency (response = 0).
N = 100; % FIR filter order.
k = 0:2*N;
fs = 44100; Ts = 1/fs; % Sampling freq. and time.
% FIR Filter numerator coefficients:
Nz = Ts*(f1+f2)*sinc((f2-f1)*Ts*(k-N)).*sinc((f2+f1)*Ts*(k-N));
f = 0:fs/2;
w = 2*pi*f;
z = exp(-i*w*Ts);
% Calculation of the expected response:
Hz = polyval(Nz,z).*z.^(-2*N);
figure(1)
plot(f,abs(Hz))
title('Gráfica Respuesta Filtro FIR (Filter Expected Response)')
xlabel('frecuencia f (Hz)')
ylabel('|H(f)|')
xlim([0, 5000])
grid on
% Sweep Frequency Test:
tic
% Start and Stop frequencies of sweep, t = tmax = 50 seconds = 5000 Hz frequency:
fmin = 1; fmax = 5000; tmax = 50;
t = 0:Ts:tmax;
phase = 2*pi*fmin*t + 2*pi*((fmax-fmin).*t.^2)/(2*tmax);
x = cos(phase);
y = filtro2(Nz, 1, x); % custom filter function, not using "filter" library here.
figure(2)
plot(t,y)
title('Gráfica Barrido en Frecuencia Filtro FIR (Freq. Sweep)')
xlabel('Tiempo Barrido: t = 10 seg = 1000 Hz')
ylabel('y(t)')
xlim([0, 50])
grid on
toc
MATLAB CUSTOM FILTER FUNCTION
function y = filtro2(Nz, Dz, x)
Nn = length(Nz);
Nd = length(Dz);
N = length(x);
Nm = max(Nn,Nd);
x1 = [zeros(Nm-1,1) ; x'];
y1 = zeros(Nm-1,1);
for n = Nm:N+Nm-1
y1(n) = Nz(Nn:-1:1)*x1(n-Nn+1:n)/Dz(1);
if Nd > 1
y1(n) = y1(n) - Dz(Nd:-1:2)*y1(n-Nd+1:n-1)/Dz(1);
end
end
y = y1(Nm:Nm+N-1);
end
PYTHON CODE (main):
import numpy as np
from matplotlib import pyplot as plt
import FiltroDigital as fd
import time
j = np.array([1j])
pi = np.pi
f1, f2 = 4000, 4200
N = 100
k = np.array(range(0,2*N+1),dtype='int')
fs = 44100; Ts = 1/fs;
Nz = Ts*(f1+f2)*np.sinc((f2-f1)*Ts*(k-N))*np.sinc((f2+f1)*Ts*(k-N));
f = np.arange(0, fs/2, 1)
w = 2*pi*f
z = np.exp(-j*w*Ts)
Hz = np.polyval(Nz,z)*z**(-2*N)
plt.figure(1)
plt.plot(f,abs(Hz))
plt.title("Gráfica Respuesta Filtro FIR")
plt.xlabel("frecuencia f (Hz)")
plt.ylabel("|H(f)|")
plt.xlim(0, 5000)
plt.grid()
plt.show()
start_time = time.time()
fmin = 1; fmax = 5000; tmax = 50;
t = np.arange(0, tmax, Ts)
fase = 2*pi*fmin*t + 2*pi*((fmax-fmin)*t**2)/(2*tmax)
x = np.cos(fase)
y = fd.filtro(Nz, [1], x)
plt.figure(2)
plt.plot(t,y)
plt.title("Gráfica Barrido en Frecuencia Filtro FIR")
plt.xlabel("Tiempo Barrido: t = 10 seg = 1000 Hz")
plt.ylabel("y(t)")
plt.xlim(0, 50)
plt.grid()
plt.show()
elapsed_time = time.time() - start_time
print('time cost = ', elapsed_time)
PYTHON CUSTOM FILTER FUNCTION
import numpy as np
def filtro(Nz, Dz, x):
Nn = len(Nz);
Nd = len(Dz);
Nz = np.array(Nz,dtype=float)
Dz = np.array(Dz,dtype=float)
x = np.array(x,dtype=float)
N = len(x);
Nm = max(Nn,Nd);
x1 = np.insert(x, 0, np.zeros((Nm-1,), dtype=float))
y1 = np.zeros((N+Nm-1,), dtype=float)
for n in range(Nm-1,N+Nm-1) :
y1[n] = sum(Nz*np.flip( x1[n-Nn+1:n+1]))/Dz[0] # = y1FIR[n]
if Nd > 1:
y1[n] = y1[n] - sum(Dz[1:]*np.flip( y1[n-Nd+1:n]))/Dz[0]
print(y1[n])
y = y1[Nm-1:]
return y

Numpy modify ndarray diagonal

is there any way in numpy to get a reference to the array diagonal?
I want my array diagonal to be divided by a certain factor
Thanks

If X is your array and c is the factor,
X[np.diag_indices_from(X)] /= c
See diag_indices_from in the Numpy manual.

A quick way to access the diagonal of a square (n,n) numpy array is with arr.flat[::n+1]:
n = 1000
c = 20
a = np.random.rand(n,n)
a[np.diag_indices_from(a)] /= c # 119 microseconds
a.flat[::n+1] /= c # 25.3 microseconds

The np.fill_diagonal function is quite fast:
np.fill_diagonal(a, a.diagonal() / c)
where a is your array and c is your factor. On my machine, this method was as fast as #kwgoodman's a.flat[::n+1] /= c method, and in my opinion a bit clearer (but not as slick).

Comparing the above 3 methods:
import numpy as np
import timeit
n = 1000
c = 20
a = np.random.rand(n,n)
a1 = a.copy()
a2 = a.copy()
a3 = a.copy()
t1 = np.zeros(1000)
t2 = np.zeros(1000)
t3 = np.zeros(1000)
for i in range(1000):
start = timeit.default_timer()
a1[np.diag_indices_from(a1)] /= c
stop = timeit.default_timer()
t1[i] = start-stop
start = timeit.default_timer()
a2.flat[::n+1] /= c
stop = timeit.default_timer()
t2[i] = start-stop
start = timeit.default_timer()
np.fill_diagonal(a3,a3.diagonal() / c)
stop = timeit.default_timer()
t3[i] = start-stop
print([t1.mean(), t1.std()])
print([t2.mean(), t2.std()])
print([t3.mean(), t3.std()])
[-4.5693619907979154e-05, 9.3142851395411316e-06]
[-2.338075107036275e-05, 6.7119609571872443e-06]
[-2.3731951987429056e-05, 8.0455946813059586e-06]
So you can see that the np.flat method is the fastest but marginally. When I ran this for a few more times there were times when the fill_diagonal method was slightly faster. But readability wise its probably worth using the fill_diagonal method.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.