Given the Fourier series coefficients a[n] and b[n] (for cosines and sines respectively) of a function with period T and t an equally spaced interval the following code will evaluate the partial sum for all points in interval t (a,b,t are all numpy arrays). It is clarified that len(t) <> len(a).
yn=ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*cos(2*pi*n*t/T)-b[n]*sin(2*pi*n*t/T))
My question is: Can this for loop be vectorized?
Here's one vectorized approach making use broadcasting to create the 2D array version of cosine/sine input : 2*pi*n*t/T and then using matrix-multiplication with np.dot for the sum-reduction -
r = np.arange(1,len(a))
S = 2*np.pi*r[:,None]*t/T
cS = np.cos(S)
sS = np.sin(S)
out = a[1:].dot(cS) - b[1:].dot(sS) + a[0]
Further performance boost
For further boost, we can make use of numexpr module to compute those trignometric steps -
import numexpr as ne
cS = ne.evaluate('cos(S)')
sS = ne.evaluate('sin(S)')
Runtime test -
Approaches -
def original_app(t,a,b,T):
yn=np.ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*np.cos(2*np.pi*n*t/T)-b[n]*np.sin(2*np.pi*n*t/T))
return yn
def vectorized_app(t,a,b,T):
r = np.arange(1,len(a))
S = (2*np.pi/T)*r[:,None]*t
cS = np.cos(S)
sS = np.sin(S)
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
def vectorized_app_v2(t,a,b,T):
r = np.arange(1,len(a))
S = (2*np.pi/T)*r[:,None]*t
cS = ne.evaluate('cos(S)')
sS = ne.evaluate('sin(S)')
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
Also, including function PP from #Paul Panzer's post.
Timings -
In [22]: # Setup inputs
...: n = 10000
...: t = np.random.randint(0,9,(n))
...: a = np.random.randint(0,9,(n))
...: b = np.random.randint(0,9,(n))
...: T = 3.45
...:
In [23]: print np.allclose(original_app(t,a,b,T), vectorized_app(t,a,b,T))
...: print np.allclose(original_app(t,a,b,T), vectorized_app_v2(t,a,b,T))
...: print np.allclose(original_app(t,a,b,T), PP(t,a,b,T))
...:
True
True
True
In [25]: %timeit original_app(t,a,b,T)
...: %timeit vectorized_app(t,a,b,T)
...: %timeit vectorized_app_v2(t,a,b,T)
...: %timeit PP(t,a,b,T)
...:
1 loops, best of 3: 6.49 s per loop
1 loops, best of 3: 6.24 s per loop
1 loops, best of 3: 1.54 s per loop
1 loops, best of 3: 1.96 s per loop
Can't beat numexpr, but if it's not available we can save on the transcendentals (testing and benchmarking code heavily based on #Divakar's code in case you didn't notice ;-) ):
import numpy as np
from timeit import timeit
def PP(t,a,b,T):
CS = np.empty((len(t), len(a)-1), np.complex)
CS[...] = np.exp(2j*np.pi*(t[:, None])/T)
np.cumprod(CS, axis=-1, out=CS)
return a[1:].dot(CS.T.real) - b[1:].dot(CS.T.imag) + a[0]
def original_app(t,a,b,T):
yn=np.ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*np.cos(2*np.pi*n*t/T)-b[n]*np.sin(2*np.pi*n*t/T))
return yn
def vectorized_app(t,a,b,T):
r = np.arange(1,len(a))
S = 2*np.pi*r[:,None]*t/T
cS = np.cos(S)
sS = np.sin(S)
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
n = 1000
t = 2000
t = np.random.randint(0,9,(t))
a = np.random.randint(0,9,(n))
b = np.random.randint(0,9,(n))
T = 3.45
print(np.allclose(original_app(t,a,b,T), vectorized_app(t,a,b,T)))
print(np.allclose(original_app(t,a,b,T), PP(t,a,b,T)))
print('{:18s} {:9.6f}'.format('orig', timeit(lambda: original_app(t,a,b,T), number=10)/10))
print('{:18s} {:9.6f}'.format('Divakar no numexpr', timeit(lambda: vectorized_app(t,a,b,T), number=10)/10))
print('{:18s} {:9.6f}'.format('PP', timeit(lambda: PP(t,a,b,T), number=10)/10))
Prints:
True
True
orig 0.166903
Divakar no numexpr 0.179617
PP 0.060817
Btw. if delta t divides T one can potentially save more, or even run the full fft and discard what's too much.
This is not really another answer but a comment on #Paul Panzer's one, written as an answer because I needed to post some code. If there is a way to post propely formatted code in a comment please advice.
Inspired by #Paul Panzer cumprod idea, I came up with the following:
an = ones((len(a)-1,len(te)))*2j*pi*te/T
CS = exp(cumsum(an,axis=0))
out = (a[1:].dot(CS.real) - b[1:].dot(CS.imag)) + a[0]
Although it seems properly vectorized and produces correct results, its performance is miserable. It is not only much slower than the cumprod, which is expected as len(a)-1 exponentiations more are made, but 50% slower than the original unvectorized version. What is the cause of this poor performance?
Related
I'm working with DNA sequence alignments and trying to implement a simple scoring algorithm. Since i have to use a matrix for the calculations, i thought numpy should be way faster than a list of lists, but as I tested both, the python lists seem to be way faster. I found this thread (Why use numpy over list based on speed?) but still; i'm using preallocated numpy vs preallocated lists and list of lists are the clear winners.
Here is my code:
Lists
def edirDistance(x, y):
x_dim = len(x)+1
y_dim = len(y)+1
D = []
for i in range(x_dim):
D.append([0] * (y_dim))
#Filling the matrix borders
for i in range(x_dim):
D[i][0] = i
for i in range(y_dim):
D[0][i] = i
for i in range(1, x_dim):
for j in range(1, y_dim):
distHor = D[i][j-1] + 1
distVer = D[i-1][j] + 1
if x[i-1] == y[j-1]:
distDiag = D[i-1][j-1]
else:
distDiag = D[i-1][j-1] + 1
D[i][j] = min(distHor, distVer,distDiag)
return D
Numpy
def NP_edirDistance(x, y):
x_dim = len(x)+1
y_dim = len(y)+1
D = np.zeros((x_dim,y_dim))
#Filling the matrix borders
for i in range(x_dim):
D[i][0] = i
for i in range(y_dim):
D[0][i] = i
for i in range(1, x_dim):
for j in range(1, y_dim):
distHor = D[i][j-1] + 1
distVer = D[i-1][j] + 1
if x[i-1] == y[j-1]:
distDiag = D[i-1][j-1]
else:
distDiag = D[i-1][j-1] + 1
D[i][j] = min(distHor, distVer,distDiag)
return D
I'm not timing the np import.
a = 'ACGTACGACTATCGACTAGCTACGAA'
b = 'ACCCACGTATAACGACTAGCTAGGGA'
%%time
edirDistance(a, b)
total: 1.41 ms
%%time
NP_edirDistance(a, b)
total: 4.43 ms
Replacing D[i][j] by D[i,j] greatly improved time, but still slower. (Thanks #Learning is a mess !)
total: 2.64 ms
I tested with even larger DNA sequences (around 10.000 letters each) and still lists are winning.
Can someone help me improve timing?
Are lists better for this use?
One way to have faster run is to use GPU/TPU-aided accelerators such as numba and …. I have tested your codes by that a and b on google colab TPU without using accelerators:
1000 loops, best of 5: 563 µs per loop
1000 loops, best of 5: 1.95 ms per loop # NumPy
But with using numba as nopython=True, without any changes to your codes:
import numba as nb
#nb.njit()
def edirDistance(x, y):
.
.
#nb.njit()
def NP_edirDistance(x, y):
.
.
It gets:
1000 loops, best of 5: 213 µs per loop
1000 loops, best of 5: 153 µs per loop # NumPy
Which will get significant difference between them using huge samples or by improving and vectorizing your NumPy codes. This method results as below for samples with 10000 length:
35.50053691864014
22.95994758605957 # NumPy (seconds)
I'm concerned with the speed of the following function:
def cch(tau):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
Where "cartprod" is a variable for a list that looks like this:
cartprod = np.ndarray([[0.0123,0.0123],[0.0123,0.0459],...])
The length of this list is about 25 million. Basically, I'm trying to find a significantly faster way to return a list of differences for every pair list in that np.ndarray. Is there an algorithmic way or function that's faster than np.diff? Or, is np.diff the end all be all? I'm also open to anything else.
EDIT: Thank you all for your solutions!
I think you're hitting a wall by repeatedly returning multiple np.arrays of length ~25 million rather than np.diff being slow. I wrote an equivalent function that iterates over the array and tallies the results as it goes along. The function needs to be jitted with numba to be fast. I hope that is acceptable.
arr = np.random.rand(25000000, 2)
def cch(tau, cartprod):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
%timeit cch(0.01, arr)
#jit(nopython=True)
def cch_jit(tau, cartprod):
count = 0
tau = -tau
for i in range(cartprod.shape[0]):
count += np.less(np.abs(tau - (cartprod[i, 1]- cartprod[i, 0])), 0.001)
return count
%timeit cch_jit(0.01, arr)
produces
294 ms ± 2.82 ms
42.7 ms ± 483 µs
which is about ~6 times faster.
We can leverage multi-core with numexpr module for large data and to gain memory efficiency and hence performance with some help from array-slicing -
import numexpr as ne
def cch_numexpr(a, tau):
d = {'a0':a[:,0],'a1':a[:,1]}
return np.count_nonzero(ne.evaluate('abs(a0-a1-tau)<0.001',d))
Sample run and timings on 25M sized data -
In [83]: cartprod = np.random.rand(25000000,2)
In [84]: cch(cartprod, tau=0.5) == cch_numexpr(cartprod, tau=0.5)
Out[84]: True
In [85]: %timeit cch(cartprod, tau=0.5)
10 loops, best of 3: 150 ms per loop
In [86]: %timeit cch_numexpr(cartprod, tau=0.5)
10 loops, best of 3: 25.5 ms per loop
Around 6x speedup.
This was with 8 threads. Thus, with more number of threads available for compute, it should improve further. Related post on how to control multi-core functionality.
Just out of curiosity I compared the solutions of #Divakar numexpr and #alexdor numba.jit. The implementation numexpr.evaluate seems to be twice as fast as using numba's jit compiler. The results are shown for 100 runs each:
np.sum: 111.07543396949768
numexpr: 12.282189846038818
JIT: 6.2505223751068115
'np.sum' returns same result as 'numexpr'
'np.sum' returns same result as 'jit'
'numexpr' returns same result as 'jit'
Script so reproduce the results:
import numpy as np
import time
import numba
import numexpr
arr = np.random.rand(25000000, 2)
runs = 100
def cch(tau, cartprod):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
def cch_ne(tau, cartprod):
d = {'a0':cartprod[:,0],'a1':cartprod[:,1], 'tau': tau}
count = np.count_nonzero(numexpr.evaluate('abs(a0-a1-tau)<0.001',d))
return count
#numba.jit(nopython=True)
def cch_jit(tau, cartprod):
count = 0
tau = -tau
for i in range(cartprod.shape[0]):
count += np.less(np.abs(tau - (cartprod[i, 1]- cartprod[i, 0])), 0.001)
return count
start = time.time()
for x in range(runs):
x1 = cch(0.01, arr)
print('np.sum:\t\t', time.time() - start)
start = time.time()
for x in range(runs):
x2 = cch_ne(0.01, arr)
print('numexpr:\t', time.time() - start)
x3 = cch_jit(0.01, arr)
start = time.time()
for x in range(runs):
x3 = cch_jit(0.01, arr)
print('JIT:\t\t', time.time() - start)
if x1 == x2: print('\'np.sum\' returns same result as \'numexpr\'')
if x1 == x3: print('\'np.sum\' returns same result as \'jit\'')
if x2 == x3: print('\'numexpr\' returns same result as \'jit\'')
I have these 2 vectors A and B:
import numpy as np
A=np.array([1,2,3])
B=np.array([8,7])
and I want to add them up with this expression:
Result = sum((A-B)**2)
The expected result that I need is:
Result = np.array([X,Y])
Where:
X = (1-8)**2 + (2-8)**2 + (3-8)**2 = 110
Y = (1-7)**2 + (2-7)**2 + (3-7)**2 = 77
How can I do it? The 2 arrays are an example, in my case I have a very large arrays and I cannot do it manually.
You can make A a 2d array and utilize numpy's broadcasting property to vectorize the calculation:
((A[:, None] - B) ** 2).sum(0)
# array([110, 77])
Since you have mentioned that you are working with large arrays, with focus on performance here's one with np.einsum that does the combined operation of squaring and sum-reduction in one step efficiently, like so -
def einsum_based(A,B):
subs = A[:,None] - B
return np.einsum('ij,ij->j',subs, subs)
Sample run -
In [16]: A = np.array([1,2,3])
...: B = np.array([8,7])
...:
In [17]: einsum_based(A,B)
Out[17]: array([110, 77])
Runtime test with large arrays scaling up the given sample 1000x -
In [8]: A = np.random.rand(3000)
In [9]: B = np.random.rand(2000)
In [10]: %timeit ((A[:, None] - B) ** 2).sum(0) # #Psidom's soln
10 loops, best of 3: 21 ms per loop
In [11]: %timeit einsum_based(A,B)
100 loops, best of 3: 12.3 ms per loop
I want to Solve Polynomial equation of 6th order with Python.
I've tried the "basic" version:
avgIrms = 19.61
c_val = (0.000002324*avgIrms**6) - (0.0001527*avgIrms**5) + (0.003961843*avgIrms**4) - (0.052211292*avgIrms**3) + (0.379269091*avgIrms**2) -(0.404399274*avgIrms) + 0.000682896
print(c_val)
After that I've used the numpy with the following code:
import numpy as np
avgIrms = 19.61
ppar = [0.000002324, -0.0001527, 0.003961843, -0.052211292, 0.379269091, -0.404399274, 0.000682896]
p = np.poly1d(ppar)
print(p(avgIrms))
In the both ways the raspberry tooks more than five seconds to process... It's to much! Any help to solve polynomial equations efficiently? (less than one second...)
Thanks in advance,
Daniel
First, what you want is to evaluate a polynomial for a given x, not to solve it. Second, I still don't see how do you get your speed up..
Find here a couple of timmings:
>>> import numpy as np
>>> x = 19.61
>>> pr = [0.000002324, -0.0001527, 0.003961843, -0.052211292, 0.379269091, -0.404399274, 0.000682896]
>>> p = pr[::-1] # reverse the order
Hardcoded solution:
>>> %timeit p[0] + x * p[1] + p[2] * x**2 + p[3] * x**3 + p[4] * x**4 + p[5] * x**5 + p[6] * x**6
809 ns
Loopy solution:
>>> %%timeit
val = 0
for i in range(len(p)):
val += p[i] * x**i
1.24 µs
Functional programming solution:
>>> %timeit reduce(lambda acc, i: acc + p[i] * x**i, range(len(p)))
1.61 µs
Using numpy's polyval:
>>> %timeit np.polyval(pr, x)
6.12 µs
Using numpy's poly1d
>>> %%timeit
c = np.poly1d(pr)
c(x)
9.46 µs
So, clearly numpy is slower, as for a such a small array it adds some overhead in the Python <-> C communication, but still, it is of the order of 6-9 µs, I'm using a desktop computer, but I would be pretty impressed if a Raspberry Pi would really take 5 seconds to do that operation. Are you sure you did the timings properly?
Any way, either the hardcoded or the loopy solution seem faster than the functional programming one (the equivalent to the one that you defined as horner in your comment).
I know I can do it like the following:
import numpy as np
N=10
a=np.arange(1,100,1)
np.argsort()[-N:]
However, it is very slow since it did a full sort.
I wonder whether numpy provide some methods the do it fast.
numpy 1.8 implements partition and argpartition that perform partial sort ( in O(n) time as opposed to full sort that is O(n) * log(n)).
import numpy as np
test = np.array([9,1,3,4,8,7,2,5,6,0])
temp = np.argpartition(-test, 4)
result_args = temp[:4]
temp = np.partition(-test, 4)
result = -temp[:4]
Result:
>>> result_args
array([0, 4, 8, 5]) # indices of highest vals
>>> result
array([9, 8, 6, 7]) # highest vals
Timing:
In [16]: a = np.arange(10000)
In [17]: np.random.shuffle(a)
In [18]: %timeit np.argsort(a)
1000 loops, best of 3: 1.02 ms per loop
In [19]: %timeit np.argpartition(a, 100)
10000 loops, best of 3: 139 us per loop
In [20]: %timeit np.argpartition(a, 1000)
10000 loops, best of 3: 141 us per loop
The bottleneck module has a fast partial sort method that works directly with Numpy arrays: bottleneck.partition().
Note that bottleneck.partition() returns the actual values sorted, if you want the indexes of the sorted values (what numpy.argsort() returns) you should use bottleneck.argpartition().
I've benchmarked:
z = -bottleneck.partition(-a, 10)[:10]
z = a.argsort()[-10:]
z = heapq.nlargest(10, a)
where a is a random 1,000,000-element array.
The timings were as follows:
bottleneck.partition(): 25.6 ms per loop
np.argsort(): 198 ms per loop
heapq.nlargest(): 358 ms per loop
I had this problem and, since this question is 5 years old, I had to redo all benchmarks and change the syntax of bottleneck (there is no partsort anymore, it's partition now).
I used the same arguments as kwgoodman, except the number of elements retrieved, which I increased to 50 (to better fit my particular situation).
I got these results:
bottleneck 1: 01.12 ms per loop
bottleneck 2: 00.95 ms per loop
pandas : 01.65 ms per loop
heapq : 08.61 ms per loop
numpy : 12.37 ms per loop
numpy 2 : 00.95 ms per loop
So, bottleneck_2 and numpy_2 (adas's solution) were tied.
But, using np.percentile (numpy_2) you have those topN elements already sorted, which is not the case for the other solutions. On the other hand, if you are also interested on the indexes of those elements, percentile is not useful.
I added pandas too, which uses bottleneck underneath, if available (http://pandas.pydata.org/pandas-docs/stable/install.html#recommended-dependencies). If you already have a pandas Series or DataFrame to start with, you are in good hands, just use nlargest and you're done.
The code used for the benchmark is as follows (python 3, please):
import time
import numpy as np
import bottleneck as bn
import pandas as pd
import heapq
def bottleneck_1(a, n):
return -bn.partition(-a, n)[:n]
def bottleneck_2(a, n):
return bn.partition(a, a.size-n)[-n:]
def numpy(a, n):
return a[a.argsort()[-n:]]
def numpy_2(a, n):
M = a.shape[0]
perc = (np.arange(M-n,M)+1.0)/M*100
return np.percentile(a,perc)
def pandas(a, n):
return pd.Series(a).nlargest(n)
def hpq(a, n):
return heapq.nlargest(n, a)
def do_nothing(a, n):
return a[:n]
def benchmark(func, size=1000000, ntimes=100, topn=50):
t1 = time.time()
for n in range(ntimes):
a = np.random.rand(size)
func(a, topn)
t2 = time.time()
ms_per_loop = 1000000 * (t2 - t1) / size
return ms_per_loop
t1 = benchmark(bottleneck_1)
t2 = benchmark(bottleneck_2)
t3 = benchmark(pandas)
t4 = benchmark(hpq)
t5 = benchmark(numpy)
t6 = benchmark(numpy_2)
t0 = benchmark(do_nothing)
print("bottleneck 1: {:05.2f} ms per loop".format(t1 - t0))
print("bottleneck 2: {:05.2f} ms per loop".format(t2 - t0))
print("pandas : {:05.2f} ms per loop".format(t3 - t0))
print("heapq : {:05.2f} ms per loop".format(t4 - t0))
print("numpy : {:05.2f} ms per loop".format(t5 - t0))
print("numpy 2 : {:05.2f} ms per loop".format(t6 - t0))
Each negative sign in the proposed bottleneck solution
-bottleneck.partsort(-a, 10)[:10]
makes a copy of the data. We can remove the copies by doing
bottleneck.partsort(a, a.size-10)[-10:]
Also the proposed numpy solution
a.argsort()[-10:]
returns indices not values. The fix is to use the indices to find the values:
a[a.argsort()[-10:]]
The relative speed of the two bottleneck solutions depends on the ordering of the elements in the initial array because the two approaches partition the data at different points.
In other words, timing with any one particular random array can make either method look faster.
Averaging the timing across 100 random arrays, each with 1,000,000 elements, gives
-bn.partsort(-a, 10)[:10]: 1.76 ms per loop
bn.partsort(a, a.size-10)[-10:]: 0.92 ms per loop
a[a.argsort()[-10:]]: 15.34 ms per loop
where the timing code is as follows:
import time
import numpy as np
import bottleneck as bn
def bottleneck_1(a):
return -bn.partsort(-a, 10)[:10]
def bottleneck_2(a):
return bn.partsort(a, a.size-10)[-10:]
def numpy(a):
return a[a.argsort()[-10:]]
def do_nothing(a):
return a
def benchmark(func, size=1000000, ntimes=100):
t1 = time.time()
for n in range(ntimes):
a = np.random.rand(size)
func(a)
t2 = time.time()
ms_per_loop = 1000000 * (t2 - t1) / size
return ms_per_loop
t1 = benchmark(bottleneck_1)
t2 = benchmark(bottleneck_2)
t3 = benchmark(numpy)
t4 = benchmark(do_nothing)
print "-bn.partsort(-a, 10)[:10]: %0.2f ms per loop" % (t1 - t4)
print "bn.partsort(a, a.size-10)[-10:]: %0.2f ms per loop" % (t2 - t4)
print "a[a.argsort()[-10:]]: %0.2f ms per loop" % (t3 - t4)
Perhaps heapq.nlargest
import numpy as np
import heapq
x = np.array([1,-5,4,6,-3,3])
z = heapq.nlargest(3,x)
Result:
>>> z
[6, 4, 3]
If you want to find the indices of the n largest elements using bottleneck you could use
bottleneck.argpartsort
>>> x = np.array([1,-5,4,6,-3,3])
>>> z = bottleneck.argpartsort(-x, 3)[:3]
>>> z
array([3, 2, 5]
You can also use numpy's percentile function. In my case it was slightly faster then bottleneck.partsort():
import timeit
import bottleneck as bn
N,M,K = 10,1000000,100
start = timeit.default_timer()
for k in range(K):
a=np.random.uniform(size=M)
tmp=-bn.partsort(-a, N)[:N]
stop = timeit.default_timer()
print (stop - start)/K
start = timeit.default_timer()
perc = (np.arange(M-N,M)+1.0)/M*100
for k in range(K):
a=np.random.uniform(size=M)
tmp=np.percentile(a,perc)
stop = timeit.default_timer()
print (stop - start)/K
Average time per loop:
bottleneck.partsort(): 59 ms
np.percentile(): 54 ms
If storing the array as a list of numbers isn't problematic, you can use
import heapq
heapq.nlargest(N, a)
to get the N largest members.