Code runs much faster in C than in NumPy

Code runs much faster in C than in NumPy - python

I wrote physics simulation code in python using numpy and than rewrote it to C++. in C++ it takes only 0.5 seconds while in python around 40s. Can someone please help my find what I did horribly wrong?
import numpy as np
def myFunc(i):
uH = np.copy(u)
for j in range(1, xmax-1):
u[i][j] = a*uH[i][j-1]+(1-2*a)*uH[i][j]+a*uH[i][j+1]
u[i][0] = u[i][0]/b
for x in range(1, xmax):
u[i][x] = (u[i][x]+a*u[i][x-1])/(b+a*c[x-1])
for x in range(xmax-2,-1,-1):
u[i][x]=u[i][x]-c[x]*u[i][x+1]
xmax = 101
tmax = 2000
#All other variables are defined here but I removed that for visibility
uH = np.zeros((xmax,xmax))
u = np.zeros((xmax,xmax))
c = np.full(xmax,-a)
uH[50][50] = 10000
for t in range(1, tmax):
if t % 2 == 0:
for i in range(0,xmax):
myFunc(i)
else:
for i in range(0, xmax):
myFunc(i)
In case someones wants to run it here is whole code: http://pastebin.com/20ZSpBqQ
EDIT: all variables are defined in the whole code which can be found on pastebin. Sorry for confusion, I thought removing all the clutter will make the code easier to understand

fundamentally, C is a compiled language, when Python is a interpreted one, speed against ease of use.
Numpy can fill the gap, but you must avoid for loop on items, which need often
some skills.
For exemple,
def block1():
for i in range(xmax):
for j in range(1, xmax-1):
u[i][j] = a*uH[i][j-1]+(1-2*a)*uH[i][j]+a*uH[i][j+1]
is in numpy style :
def block2():
u[:,1:-1] += a*np.diff(u,2)
with is shorter and faster ( and easier to read and understand ?) :
In [37]: %timeit block1()
10 loops, best of 3: 25.8 ms per loop
In [38]: %timeit block2()
10000 loops, best of 3: 123 µs per loop
At last, you can speed numpy code with Just In Time compilation, what is allowed with Numba. Just change the beginning of your code like :
import numba
#numba.jit
def myFunc(u,i):
...
and the calls by myFunc(u,i) at the end of the script (u must be a parameter for automatic determination of types) and you will reach the same performance (0,4 s on my PC).

So when I ran your numpy python code it took four minutes to run, once I removed the numpy code and replaced it with standard python code it only took one minute! (I have a not so fast computer)
Here's that code:
#import numpy as np
def impl(i,row):
if row:
uH = u[:][:] # this copys the array 'u'
for j in range(1, xmax-1):
u[i][j] = a*uH[i][j-1]+(1-2*a)*uH[i][j]+a*uH[i][j+1]
u[i][0] = u[i][0]/b
for x in range(1, xmax):
u[i][x] = (u[i][x]+a*u[i][x-1])/(b+a*c[x-1])
for x in range(xmax-2,-1,-1):
u[i][x]=u[i][x]-c[x]*u[i][x+1]
else:
uH = u[:][:] # this copys the array 'u'
for j in range(1, xmax-1):
u[j][i]= a*uH[j-1][i]+(1-2*a)*uH[j][i]+a*uH[j+1][i]
u[0][i] = u[0][i]/b
for y in range(1, xmax):
u[y][i] = (u[y][i]+a*u[y-1][i])/(b+a*c[y-1])
for y in range(xmax-2,-1,-1):
u[y][i]=u[y][i]-c[y]*u[y+1][i]
#Init
xmax = 101
tmax = 2000
D = 0.5
l = 1
tSec = 0.1
uH = [[0.0]*xmax]*xmax #np.zeros((xmax,xmax))
u = [[0.0]*xmax]*xmax #np.zeros((xmax,xmax))
dx = l / xmax
dt = tSec / tmax
a = (D*dt)/(dx*dx);
b=1+2*a
print("dx=="+str(dx))
print("dt=="+str(dt))
print(" a=="+str(a))
#koeficient c v trojdiagonalnej matici
c = [-a]*xmax #np.full(xmax,-a)
c[0]=c[0]/b
for i in range(1, xmax):
c[i]=c[i]/(b+a*c[i-1])
uH[50][50] = 10000
u = uH
for t in range(1, tmax):
if t % 2 == 0:
for i in range(0,xmax):
impl(i,False)
else:
for i in range(0, xmax):
impl(i,True)
I believe that this could be much faster if you were to have used numpy the correct way rather than as a substitute for arrays, however, not using numpy arrays cut the time to 1/4th of the original.

Related

Numpy array vs list of lists - editing values one by one (help implementing)

I'm working with DNA sequence alignments and trying to implement a simple scoring algorithm. Since i have to use a matrix for the calculations, i thought numpy should be way faster than a list of lists, but as I tested both, the python lists seem to be way faster. I found this thread (Why use numpy over list based on speed?) but still; i'm using preallocated numpy vs preallocated lists and list of lists are the clear winners.
Here is my code:
Lists
def edirDistance(x, y):
x_dim = len(x)+1
y_dim = len(y)+1
D = []
for i in range(x_dim):
D.append([0] * (y_dim))
#Filling the matrix borders
for i in range(x_dim):
D[i][0] = i
for i in range(y_dim):
D[0][i] = i
for i in range(1, x_dim):
for j in range(1, y_dim):
distHor = D[i][j-1] + 1
distVer = D[i-1][j] + 1
if x[i-1] == y[j-1]:
distDiag = D[i-1][j-1]
else:
distDiag = D[i-1][j-1] + 1
D[i][j] = min(distHor, distVer,distDiag)
return D
Numpy
def NP_edirDistance(x, y):
x_dim = len(x)+1
y_dim = len(y)+1
D = np.zeros((x_dim,y_dim))
#Filling the matrix borders
for i in range(x_dim):
D[i][0] = i
for i in range(y_dim):
D[0][i] = i
for i in range(1, x_dim):
for j in range(1, y_dim):
distHor = D[i][j-1] + 1
distVer = D[i-1][j] + 1
if x[i-1] == y[j-1]:
distDiag = D[i-1][j-1]
else:
distDiag = D[i-1][j-1] + 1
D[i][j] = min(distHor, distVer,distDiag)
return D
I'm not timing the np import.
a = 'ACGTACGACTATCGACTAGCTACGAA'
b = 'ACCCACGTATAACGACTAGCTAGGGA'
%%time
edirDistance(a, b)
total: 1.41 ms
%%time
NP_edirDistance(a, b)
total: 4.43 ms
Replacing D[i][j] by D[i,j] greatly improved time, but still slower. (Thanks #Learning is a mess !)
total: 2.64 ms
I tested with even larger DNA sequences (around 10.000 letters each) and still lists are winning.
Can someone help me improve timing?
Are lists better for this use?

One way to have faster run is to use GPU/TPU-aided accelerators such as numba and …. I have tested your codes by that a and b on google colab TPU without using accelerators:
1000 loops, best of 5: 563 µs per loop
1000 loops, best of 5: 1.95 ms per loop # NumPy
But with using numba as nopython=True, without any changes to your codes:
import numba as nb
#nb.njit()
def edirDistance(x, y):
.
.
#nb.njit()
def NP_edirDistance(x, y):
.
.
It gets:
1000 loops, best of 5: 213 µs per loop
1000 loops, best of 5: 153 µs per loop # NumPy
Which will get significant difference between them using huge samples or by improving and vectorizing your NumPy codes. This method results as below for samples with 10000 length:
35.50053691864014
22.95994758605957 # NumPy (seconds)

Trying to optimize my complex function to excute in a polynomial time

I have this code that generate all the 2**40 possible binary numbers, and from this binary numbers, i will try to get all the vectors that match my objectif function conditions which is:
1- each vector in the matrix must have 20 of ones(1).
2- the sum of s = s + (the index of one +1)* the rank of the one must equal 4970.
i wrote this code but it will take a lot of time maybe months, to give the results. Now, i am looking for an alternative way or an optimization of this code if possible.
import time
from multiprocessing import Process
from multiprocessing import Pool
import numpy as np
import itertools
import numpy
CC = 20
#test if there is 20 numbers of 1
def test1numebers(v,x=1,x_l=CC):
c = 0
for i in range(len(v)):
if(v[i]==x):
c+=1
if c == x_l:
return True
else:
return False
#s = s+ the nth of 1 * (index+1)
def objectif_function(v,x=1):
s = 0
for i in range(len(v)):
if(v[i]==x):
s = s+((i+1)*nthi(v,i))
return s
#calculate the nth of 1 in a vecteur
def nthi(v,i):
c = 0
for j in range(0,i+1):
if(v[j] == 1):
c+=1
return c
#generate 2**40 of all possible binray numbers
def generateMatrix(N):
l = itertools.product([0, 1], repeat=N)
return l
#function that get the number of valide vector that match our objectif function
def main_algo(N=40,S=4970):
#N = 40
m = generateMatrix(N)
#S = 4970
c = 0
ii = 0
for i in m:
ii+=1
print("\n count:",ii)
xx = i
if(test1numebers(xx)):
if(objectif_function(xx)==S):
c+=1
print('found one')
print('\n',xx,'\n')
if ii>=1000000:
break
t_end = time.time()
print('time taken for 10**6 is: ',t_end-t_start)
print(c)
#main_algo()
if __name__ == '__main__':
'''p = Process(target=main_algo, args=(40,4970,))
p.start()
p.join()'''
p = Pool(150)
print(p.map(main_algo, [40,4970]))

While you could make a lot of improvements in readability and make your code more pythonic.
I recommend that you use numpy which is the fastest way of working with matrixes.
Avoid working with matrixes on a "pixel by pixel" loop. With numpy you can make those calculations faster and with all the data at once.
Also numpy has support for generating matrixes really fast. I think that you could make a random [0,1] matrix in less lines of code and quite faster.
Also i recommend that you install OPENBLAS, ATLAS and LAPACK which make linear algebra calculations quite faster.
I hope this helps you.

list comprehension convolution

I have a working code like this, but it is rather slow.
def halfconvolution(g,w,dz):
convo=np.zeros_like(g)
for i in range(0,len(g)):
sum=0
for j in range(0,i):
sum+=g[j]*w[(i-j)]*dz
convo[i] = -sum
return convo
I am trying to turn it into a list comprehension, but I am struggling.
I tried:
convo=[-g*w[i-j] for i in g for j in w]

I am not sure if this improves the performance, but it is a list comprehension as you asked
convo = [-sum(g[j] * w[i - j] * dz for j in range(0, i)) for i in range(0, len(g))]
A faster implementation using NumPy:
# make the matrices square
g = np.repeat(g, g.shape[0]).reshape(g.shape[0], g.shape[0], order='F')
w = np.repeat(w, w.shape[0]).reshape(w.shape[0], w.shape[0], order='F')
# take the lower half of g
g = np.tril(g, k=-1)
# shift each column by its index number
# see: https://stackoverflow.com/questions/20360675/roll-rows-of-a-matrix-independently
rows_w, column_indices_w = np.ogrid[:w.shape[0], :w.shape[1]]
shift = np.arange(w.shape[0])
shift[shift < 0] += w.shape[1]
w = w[rows_w, column_indices_w - shift[:,np.newaxis]].T
convo = np.sum(g * w, axis=1) * dz
For it to work it needs both w and g to be of the same size, but otherwise I'm sure a workaround can be found.
I hope this is a more acceptable speedup for you? Always try to rewrite your logic/problem into vector/matrix multiplications.

The inner loop can be replaced by the sum function (don't override it with a variable of the same name)
Then you append the outer loop to the end of that
[-sum(g[j]*w[i-j]*dz for j in range(i)) for i in range(len(g))]

Don't use list comprehensions for performance reasons
Use
Numba
Cython
Vectorized Numpy operations
Numba
import numba as nb
import numpy as np
import time
#nb.njit(fastmath=True)
def halfconvolution(g,w,dz):
convo=np.empty(g.shape[0],dtype=g.dtype)
for i in range(g.shape[0]):
sum=0.
for j in range(0,i):
sum+=g[j]*w[(i-j)]*dz
convo[i] = -sum
return convo
g=np.random.rand(1000)
w=np.random.rand(1000)
dz=0.15
t1=time.time()
for i in range(1000):
#res=halfconvolution(g,w,dz)
res=[-sum(g[j]*w[i-j]*dz for j in range(i)) for i in range(len(g))]
print(time.time()-t1)
print("Done")
Performance
List Comprehension: 0.27s per iteration
Numba Version: 0.6ms per iteration
So there is a factor 500 between this two versions. If you wan't to call this function on multiple arrays at once, you can also parallelize this problem easily and you should get at least another "Number of Cores" speed up.

How to vectorize fourier series partial sum in numpy

Given the Fourier series coefficients a[n] and b[n] (for cosines and sines respectively) of a function with period T and t an equally spaced interval the following code will evaluate the partial sum for all points in interval t (a,b,t are all numpy arrays). It is clarified that len(t) <> len(a).
yn=ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*cos(2*pi*n*t/T)-b[n]*sin(2*pi*n*t/T))
My question is: Can this for loop be vectorized?

Here's one vectorized approach making use broadcasting to create the 2D array version of cosine/sine input : 2*pi*n*t/T and then using matrix-multiplication with np.dot for the sum-reduction -
r = np.arange(1,len(a))
S = 2*np.pi*r[:,None]*t/T
cS = np.cos(S)
sS = np.sin(S)
out = a[1:].dot(cS) - b[1:].dot(sS) + a[0]
Further performance boost
For further boost, we can make use of numexpr module to compute those trignometric steps -
import numexpr as ne
cS = ne.evaluate('cos(S)')
sS = ne.evaluate('sin(S)')
Runtime test -
Approaches -
def original_app(t,a,b,T):
yn=np.ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*np.cos(2*np.pi*n*t/T)-b[n]*np.sin(2*np.pi*n*t/T))
return yn
def vectorized_app(t,a,b,T):
r = np.arange(1,len(a))
S = (2*np.pi/T)*r[:,None]*t
cS = np.cos(S)
sS = np.sin(S)
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
def vectorized_app_v2(t,a,b,T):
r = np.arange(1,len(a))
S = (2*np.pi/T)*r[:,None]*t
cS = ne.evaluate('cos(S)')
sS = ne.evaluate('sin(S)')
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
Also, including function PP from #Paul Panzer's post.
Timings -
In [22]: # Setup inputs
...: n = 10000
...: t = np.random.randint(0,9,(n))
...: a = np.random.randint(0,9,(n))
...: b = np.random.randint(0,9,(n))
...: T = 3.45
...:
In [23]: print np.allclose(original_app(t,a,b,T), vectorized_app(t,a,b,T))
...: print np.allclose(original_app(t,a,b,T), vectorized_app_v2(t,a,b,T))
...: print np.allclose(original_app(t,a,b,T), PP(t,a,b,T))
...:
True
True
True
In [25]: %timeit original_app(t,a,b,T)
...: %timeit vectorized_app(t,a,b,T)
...: %timeit vectorized_app_v2(t,a,b,T)
...: %timeit PP(t,a,b,T)
...:
1 loops, best of 3: 6.49 s per loop
1 loops, best of 3: 6.24 s per loop
1 loops, best of 3: 1.54 s per loop
1 loops, best of 3: 1.96 s per loop

Can't beat numexpr, but if it's not available we can save on the transcendentals (testing and benchmarking code heavily based on #Divakar's code in case you didn't notice ;-) ):
import numpy as np
from timeit import timeit
def PP(t,a,b,T):
CS = np.empty((len(t), len(a)-1), np.complex)
CS[...] = np.exp(2j*np.pi*(t[:, None])/T)
np.cumprod(CS, axis=-1, out=CS)
return a[1:].dot(CS.T.real) - b[1:].dot(CS.T.imag) + a[0]
def original_app(t,a,b,T):
yn=np.ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*np.cos(2*np.pi*n*t/T)-b[n]*np.sin(2*np.pi*n*t/T))
return yn
def vectorized_app(t,a,b,T):
r = np.arange(1,len(a))
S = 2*np.pi*r[:,None]*t/T
cS = np.cos(S)
sS = np.sin(S)
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
n = 1000
t = 2000
t = np.random.randint(0,9,(t))
a = np.random.randint(0,9,(n))
b = np.random.randint(0,9,(n))
T = 3.45
print(np.allclose(original_app(t,a,b,T), vectorized_app(t,a,b,T)))
print(np.allclose(original_app(t,a,b,T), PP(t,a,b,T)))
print('{:18s} {:9.6f}'.format('orig', timeit(lambda: original_app(t,a,b,T), number=10)/10))
print('{:18s} {:9.6f}'.format('Divakar no numexpr', timeit(lambda: vectorized_app(t,a,b,T), number=10)/10))
print('{:18s} {:9.6f}'.format('PP', timeit(lambda: PP(t,a,b,T), number=10)/10))
Prints:
True
True
orig 0.166903
Divakar no numexpr 0.179617
PP 0.060817
Btw. if delta t divides T one can potentially save more, or even run the full fft and discard what's too much.

This is not really another answer but a comment on #Paul Panzer's one, written as an answer because I needed to post some code. If there is a way to post propely formatted code in a comment please advice.
Inspired by #Paul Panzer cumprod idea, I came up with the following:
an = ones((len(a)-1,len(te)))*2j*pi*te/T
CS = exp(cumsum(an,axis=0))
out = (a[1:].dot(CS.real) - b[1:].dot(CS.imag)) + a[0]
Although it seems properly vectorized and produces correct results, its performance is miserable. It is not only much slower than the cumprod, which is expected as len(a)-1 exponentiations more are made, but 50% slower than the original unvectorized version. What is the cause of this poor performance?

Speeding up newton-raphson in pandas/python

I'm currently iterating through a very large set of data ~85GB (~600M lines) and simply using newton-raphson to compute a new parameter. As of right now my code is extremely slow, any tips on how to speed it up? The methods from BSCallClass & BSPutClass are closed-form, so there's nothing really to speed up there. Thanks.
class NewtonRaphson:
def __init__(self, theObject):
self.theObject = theObject
def solve(self, Target, Start, Tolerance, maxiter=500):
y = self.theObject.Price(Start)
x = Start
i = 0
while (abs(y - Target) > Tolerance):
i += 1
d = self.theObject.Vega(x)
x += (Target - y) / d
y = self.theObject.Price(x)
if i > maxiter:
x = nan
break
return x
def main():
for row in a.iterrows():
print row[1]["X.1"]
T = (row[1]["X.7"] - row[1]["X.8"]).days
Spot = row[1]["X.2"]
Strike = row[1]["X.9"]
MktPrice = abs(row[1]["X.10"]-row[1]["X.11"])/2
CPflag = row[1]["X.6"]
if CPflag == 'call':
option = BSCallClass(0, 0, T, Spot, Strike)
elif CPflag == 'put':
option = BSPutClass(0, 0, T, Spot, Strike)
a["X.15"][row[0]] = NewtonRaphson(option).solve(MktPrice, .05, .0001)
EDIT:
For those curious, I ended up speeding this entire process significantly by using the scipy suggestion, as well as using the multiprocessing module.

Don't code your own Newton-Raphson method in Python. You'll get better performance using one of the root finders in scipy.optimize such as brentq or newton.
(Presumably, if you have pandas, you'd also install scipy.)
Back of the envelope calculation:
Making 600M calls to brentq should be manageable on standard hardware:
import scipy.optimize as optimize
def f(x):
return x**2 - 2
In [28]: %timeit optimize.brentq(f, 0, 10)
100000 loops, best of 3: 4.86 us per loop
So if each call to optimize.brentq takes 4.86 microseconds, 600M calls will take about 4.86 * 600 ~ 3000 seconds ~ 1 hour.
newton may be slower, but still manageable:
def f(x):
return x**2 - 2
def fprime(x):
return 2*x
In [40]: %timeit optimize.newton(f, 10, fprime)
100000 loops, best of 3: 8.22 us per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Code runs much faster in C than in NumPy - python

Related

Numpy array vs list of lists - editing values one by one (help implementing)

Trying to optimize my complex function to excute in a polynomial time

list comprehension convolution

How to vectorize fourier series partial sum in numpy

Speeding up newton-raphson in pandas/python

Categories

Resources