Cython not fast enough

Cython not fast enough - python

I rewrote my python loop in cython expecting a large improvement in speed.
I only get about a factor of four. Am I doing something wrong?
This is the code without cython:
import numpy as np
import itertools as itr
import math
def Pk (b, f, mu, k): # k is in Mpc
isoPk = 200*math.exp(-(k-0.02)**2/2/0.01**2) # Isotropic power spectrum
power = (b+mu**2*f)**2*isoPk
return power
def Gendk (N, kvec, Pk, b, f, deltak3d):
Nhalf = int(N/2)
for xx, yy, zz in itr.product(range(0,N), range(0,N), range(0,Nhalf+1)):
kx = kvec[xx]
ky = kvec[yy]
kz = kvec[zz]
kk = math.sqrt(kx**2+ky**2+kz**2)
if kk == 0:
continue
mu = kz/kk
power = Pk(b, f, mu, kk)
if power==0:
deltaRe = 0
deltaIm = 0
else:
deltaRe = np.random.normal(0, power/2.0)
if (xx==0 or xx==Nhalf) and (yy==0 or yy==Nhalf) and (zz==0 or zz==Nhalf):
deltaIm = 0
else:
deltaIm = np.random.normal(0, power/2.0)
x_conj = (2*N-xx)%N
y_conj = (2*N-yy)%N
z_conj = (2*N-zz)%N
deltak3d[xx,yy,zz] = deltaRe + deltaIm*1j
deltak3d[x_conj,y_conj,z_conj] = deltaRe - deltaIm*1j
Ntot = 300000
L = 1000
N = 128
Nhalf = int(N/2)
kmax = 5.0
dk = kmax/N
kvec = np.fft.fftfreq(N, L/N)
dL = L/N
deltak3d = np.zeros((N,N,N), dtype=complex)
deltak3d[0,0,0] = Ntot
Gendk(N, kvec, Pk, 2, 1, deltak3d)
This is the version with cython:
import numpy as np
import pyximport; pyximport.install(setup_args={"include_dirs":np.get_include()})
import testGauss as tG
Ntot = 300000
L = 1000
N = 128
Nhalf = int(N/2)
kmax = 5.0
dk = kmax/N
kvec = np.fft.fftfreq(N, L/N)
dL = L/N
deltak3d = np.zeros((N,N,N), dtype=complex)
deltak3d[0,0,0] = Ntot
tG.Gendk(N, kvec, tG.Pk, 2, 1, deltak3d)
and the testGauss.pyx file is:
import math
import numpy as np
cimport numpy as np
import itertools as itr
def Pk (double b, double f, double mu, double k): # k is in Mpc
cdef double isoPk, power
isoPk = 200*math.exp(-(k-0.02)**2/2/0.01**2) # Isotropic power spectrum
power = (b+mu**2*f)**2*isoPk
return power
def Gendk (int N, np.ndarray[np.float64_t,ndim=1] kvec, Pk, double b, double f, np.ndarray[np.complex128_t,ndim=3] deltak3d):
cdef int Nhalf = int(N/2)
cdef int xx, yy, zz
cdef int x_conj, y_conj, z_conj
cdef double kx, ky, kz, kk
cdef mu
cdef power
cdef deltaRe, deltaIm
for xx, yy, zz in itr.product(range(0,N), range(0,N), range(0,Nhalf+1)):
kx = kvec[xx]
ky = kvec[yy]
kz = kvec[zz]
kk = math.sqrt(kx**2+ky**2+kz**2)
if kk == 0:
continue
mu = kz/kk
power = Pk(b, f, mu, kk)
if power==0:
deltaRe = 0
deltaIm = 0
else:
deltaRe = np.random.normal(0, power/2.0)
if (xx==0 or xx==Nhalf) and (yy==0 or yy==Nhalf) and (zz==0 or zz==Nhalf):
deltaIm = 0
else:
deltaIm = np.random.normal(0, power/2.0)
x_conj = (2*N-xx)%N
y_conj = (2*N-yy)%N
z_conj = (2*N-zz)%N
deltak3d[xx,yy,zz] = deltaRe + deltaIm*1j
deltak3d[x_conj,y_conj,z_conj] = deltaRe - deltaIm*1j
Thank you very much in advance!

You could get some speedup by replacing
import math
with
from libc cimport math
That will avoid a python function call when you do sqrt and exp, replacing it with a direct c call (which should be a lot faster).
I'm also slightly concerned at the calls to np.random.normal inside your loop, which add a reasonable python overhead each time. It might well be quicker to call this before the loop to generate a large array of random numbers (with the overhead of a single python call) then overwrite them with 0 if they aren't needed inside the loop.
The general advice for optimising Cython still applies: run
cython -a your_file.pyx
Look at the HTML, and worry about bits highlighted yellow (but only if they're called often)

Use cProfile to profile your Python code. Maybe the most CPU intensive tasks are in NumPy already. Then there is not so much to gain from Cython.

Turning your code (slightly modified) in a native module with Pythran gives me a x50 speedup.
import numpy as np
import itertools as itr
import math
from random import gauss as normal
def Pk (b, f, mu, k): # k is in Mpc
isoPk = 200*math.exp(-(k-0.02)**2/2/0.01**2) # Isotropic power spectrum
power = (b+mu**2*f)**2*isoPk
return power
#pythran export Gendk(int, float[], int, int, complex[][][])
def Gendk (N, kvec, b, f, deltak3d):
Nhalf = int(N/2)
for xx, yy, zz in itr.product(range(0, N), range(0, N), range(0, Nhalf+1)):
kx = kvec[xx]
ky = kvec[yy]
kz = kvec[zz]
kk = math.sqrt(kx**2+ky**2+kz**2)
if kk == 0:
continue
mu = kz/kk
power = Pk(b, f, mu, kk)
if power == 0:
deltaRe = 0
deltaIm = 0
else:
# deltaRe = np.random.normal(0, power/2.0)
deltaRe = normal(0, power/2.0)
if (xx == 0 or xx == Nhalf) and (yy == 0 or yy == Nhalf) and (zz == 0 or zz == Nhalf):
deltaIm = 0
else:
#deltaIm = np.random.normal(0, power/2.0)
deltaIm = normal(0, power/2.0)
x_conj = (2*N-xx)%N
y_conj = (2*N-yy)%N
z_conj = (2*N-zz)%N
deltak3d[xx, yy, zz] = deltaRe + deltaIm*1j
deltak3d[x_conj, y_conj, z_conj] = deltaRe - deltaIm*1j
Compiled with:
$ pythran tg.py
And tested with:
$ python -m timeit -s 'import numpy as np; Ntot = 30000; L = 1000; N = 12; Nhalf = int(N/2); kmax = 5.0; dk = kmax/N; kvec = np.fft.fftfreq(N, L/N); dL = L/N; deltak3d = np.zeros((N, N, N), dtype=complex); deltak3d[0, 0, 0] = Ntot; from tg import Gendk' 'Gendk(N, kvec, 2, 1, deltak3d)'
I get 10 loops, best of 3: 29.4 msec per loop for the CPython run and 1000 loops, best of 3: 587 usec per loop for the Pythran run.
Disclaimer: I'm a Pythran dev.

Related

Runge Kutta 4th order Python

I am trying to solve this equation using Runge Kutta 4th order:
applying d2Q/dt2=F(y,x,v) and dQ/dt=u Q=y in my program.
I try to run the code but i get this error:
Traceback (most recent call last):
File "C:\Users\Egw\Desktop\Analysh\Askhsh1\asdasda.py", line 28, in <module>
k1 = F(y, u, x) #(x, v, t)
File "C:\Users\Egw\Desktop\Analysh\Askhsh1\asdasda.py", line 13, in F
return ((Vo/L -(R0/L)*u -(R1/L)*u**3 - y*(1/L*C)))
OverflowError: (34, 'Result too large')
I tried using the decimal library but I still couldnt make it work properly.I might have not used it properly tho.
My code is this one:
import numpy as np
from math import pi
from numpy import arange
from matplotlib.pyplot import plot, show
#parameters
R0 = 200
R1 = 250
L = 15
h = 0.002
Vo=1000
C=4.2*10**(-6)
t=0.93
def F(y, u, x):
return ((Vo/L -(R0/L)*u -(R1/L)*u**3 - y*(1/L*C)))
xpoints = arange(0,t,h)
ypoints = []
upoints = []
y = 0.0
u = Vo/L
for x in xpoints:
ypoints.append(y)
upoints.append(u)
m1 = u
k1 = F(y, u, x) #(x, v, t)
m2 = h*(u + 0.5*k1)
k2 = (h*F(y+0.5*m1, u+0.5*k1, x+0.5*h))
m3 = h*(u + 0.5*k2)
k3 = h*F(y+0.5*m2, u+0.5*k2, x+0.5*h)
m4 = h*(u + k3)
k4 = h*F(y+m3, u+k3, x+h)
y += (m1 + 2*m2 + 2*m3 + m4)/6
u += (k1 + 2*k2 + 2*k3 + k4)/6
plot(xpoints, upoints)
show()
plot(xpoints, ypoints)
show()
I expected to get the plots of u and y against t.

Turns out I messed up with the equations I was using for Runge Kutta
The correct code is the following:
import numpy as np
from math import pi
from numpy import arange
from matplotlib.pyplot import plot, show
#parameters
R0 = 200
R1 = 250
L = 15
h = 0.002
Vo=1000
C=4.2*10**(-6)
t0=0
#dz/dz
def G(x,y,z):
return Vo/L -(R0/L)*z -(R1/L)*z**3 - y/(L*C)
#dy/dx
def F(x,y,z):
return z
t = np.arange(t0, 0.93, h)
x = np.zeros(len(t))
y = np.zeros(len(t))
z = np.zeros(len(t))
y[0] = 0.0
z[0] = 0
for i in range(1, len(t)):
k0=h*F(x[i-1],y[i-1],z[i-1])
l0=h*G(x[i-1],y[i-1],z[i-1])
k1=h*F(x[i-1]+h*0.5,y[i-1]+k0*0.5,z[i-1]+l0*0.5)
l1=h*G(x[i-1]+h*0.5,y[i-1]+k0*0.5,z[i-1]+l0*0.5)
k2=h*F(x[i-1]+h*0.5,y[i-1]+k1*0.5,z[i-1]+l1*0.5)
l2=h*G(x[i-1]+h*0.5,y[i-1]+k1*0.5,z[i-1]+l1*0.5)
k3=h*F(x[i-1]+h,y[i-1]+k2,z[i-1]+l2)
l3 = h * G(x[i - 1] + h, y[i - 1] + k2, z[i - 1] + l2)
y[i]=y[i-1]+(k0+2*k1+2*k2+k3)/6
z[i] = z[i - 1] + (l0 + 2 * l1 + 2 * l2 + l3) / 6
Q=y
I=z
plot(t, Q)
show()
plot(t, I)
show()

If I may draw your attention to these 4 lines
m1 = u
k1 = F(y, u, x) #(x, v, t)
m2 = h*(u + 0.5*k1)
k2 = (h*F(y+0.5*m1, u+0.5*k1, x+0.5*h))
You should note a fundamental structural difference between the first two lines and the second pair of lines.
You need to multiply with the step size h also in the first pair.
The next problem is the step size and the cubic term. It contributes a term of size 3*(R1/L)*u^2 ~ 50*u^2 to the Lipschitz constant. In the original IVP per the question with u=Vo/L ~ 70 this term is of size 2.5e+5. To compensate only that term to stay in the stability region of the method, the step size has to be smaller 1e-5.
In the corrected initial conditions with u=0 at the start the velocity u remains below 0.001 so the cubic term does not determine stability, this is now governed by the last term contributing a Lipschitz term of 1/sqrt(L*C) ~ 125. The step size for stability is now 0.02, with 0.002 one can expect quantitatively useful results.

You can use decimal libary for more precision (handle more digits), but it's kind of annoying every value should be the same class (decimal.Decimal).
For example:
import numpy as np
from math import pi
from numpy import arange
from matplotlib.pyplot import plot, show
# Import decimal.Decimal as D
import decimal
from decimal import Decimal as D
# Precision
decimal.getcontext().prec = 10_000_000
#parameters
# Every value should be D class (decimal.Decimal class)
R0 = D(200)
R1 = D(250)
L = D(15)
h = D(0.002)
Vo = D(1000)
C = D(4.2*10**(-6))
t = D(0.93)
def F(y, u, x):
# Decomposed for use D
a = D(Vo/L)
b = D(-(R0/L)*u)
c = D(-(R1/L)*u**D(3))
d = D(-y*(D(1)/L*C))
return ((a + b + c + d ))
xpoints = arange(0,t,h)
ypoints = []
upoints = []
y = D(0.0)
u = D(Vo/L)
for x in xpoints:
ypoints.append(y)
upoints.append(u)
m1 = u
k1 = F(y, u, x) #(x, v, t)
m2 = (h*(u + D(0.5)*k1))
k2 = (h*F(y+D(0.5)*m1, u+D(0.5)*k1, x+D(0.5)*h))
m3 = h*(u + D(0.5)*k2)
k3 = h*F(y+D(0.5)*m2, u+D(0.5)*k2, x+D(0.5)*h)
m4 = h*(u + k3)
k4 = h*F(y+m3, u+k3, x+h)
y += (m1 + D(2)*m2 + D(2)*m3 + m4)/D(6)
u += (k1 + D(2)*k2 + D(2)*k3 + k4)/D(6)
plot(xpoints, upoints)
show()
plot(xpoints, ypoints)
show()
But even with ten million of precision I still get an overflow error. Check the components of the formula, their values are way too high. You can increase precision for handle them, but you'll notice it takes time to calculate them.

Problem implementation using scipy.integrate.odeint and scipy.integrate.solve_ivp.
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint, solve_ivp
# Input data initial conditions
ti = 0.0
tf = 0.5
N = 100000
h = (tf-ti)/N
# Initial conditions
u0 = 0.0
Q0 = 0.0
t_span = np.linspace(ti,tf,N)
r0 = np.array([Q0,u0])
# Parameters
R0 = 200
R1 = 250
L = 15
C = 4.2*10**(-6)
V0 = 1000
# Systems of First Order Equations
# This function is used with odeint, as specified in the documentation for scipy.integrate.odeint
def f(r,t,R0,R1,L,C,V0):
Q,u = r
ode1 = u
ode2 = -((R0/L)*u)-((R1/L)*u**3)-((1/(L*C))*Q)+(V0/L)
return np.array([ode1,ode2])
# This function is used in our 4Order Runge-Kutta implementation and in scipy.integrate.solve_ivp
def F(t,r,R0,R1,L,C,V0):
Q,u = r
ode1 = u
ode2 = -((R0/L)*u)-((R1/L)*u**3)-((1/(L*C))*Q)+(V0/L)
return np.array([ode1,ode2])
# Resolution with oedint
sol_1 = odeint(f,r0,t_span,args=(R0,R1,L,C,V0))
sol_2 = solve_ivp(fun=F,t_span=(ti,tf), y0=r0, method='LSODA',args=(R0,R1,L,C,V0))
Q_odeint, u_odeint = sol_1[:,0], sol_1[:,1]
Q_solve_ivp, u_solve_ivp = sol_2.y[0,:], sol_2.y[1,:]
# Figures
plt.figure(figsize=[30.0,10.0])
plt.subplot(3,1,1)
plt.grid(color = 'red',linestyle='--',linewidth=0.4)
plt.plot(t_span,Q_odeint,'r',t_span,u_odeint,'b')
plt.xlabel('t(s)')
plt.ylabel('Q(t), u(t)')
plt.subplot(3,1,2)
plt.plot(sol_2.t,Q_solve_ivp,'g',sol_2.t,u_solve_ivp,'y')
plt.grid(color = 'yellow',linestyle='--',linewidth=0.4)
plt.xlabel('t(s)')
plt.ylabel('Q(t), u(t)')
plt.subplot(3,1,3)
plt.plot(Q_solve_ivp,u_solve_ivp,'green')
plt.grid(color = 'yellow',linestyle='--',linewidth=0.4)
plt.xlabel('Q(t)')
plt.ylabel('u(t)')
plt.show()
Runge-Kutta 4th
# Code development of Runge-Kutta 4 Order
# Parameters
R0 = 200
R1 = 250
L = 15
C = 4.2*10**(-6)
V0 = 1000
# Input data initial conditions #
ti = 0.0
tf = 0.5
N = 100000
h = (tf-ti)/N
# Initial conditions
u0 = 0.0
Q0 = 0.0
# First order ordinary differential equations
def f1(t,Q,u):
return u
def f2(t,Q,u):
return -((R0/L)*u)-((R1/L)*u**3)-((1/(L*C))*Q)+(V0/L)
t = np.zeros(N); Q = np.zeros(N); u = np.zeros(N)
t[0] = ti
Q[0] = Q0
u[0] = u0
for i in range(0,N-1,1):
k1 = h*f1(t[i],Q[i],u[i])
l1 = h*f2(t[i],Q[i],u[i])
k2 = h*f1(t[i]+(h/2),Q[i]+(k1/2),u[i]+(l1/2))
l2 = h*f2(t[i]+(h/2),Q[i]+(k1/2),u[i]+(l1/2))
k3 = h*f1(t[i]+(h/2),Q[i]+(k2/2),u[i]+(l2/2))
l3 = h*f2(t[i]+(h/2),Q[i]+(k2/2),u[i]+(l2/2))
k4 = h*f1(t[i]+h,Q[i]+k3,u[i]+l3)
l4 = h*f2(t[i]+h,Q[i]+k3,u[i]+l3)
Q[i+1] = Q[i] + ((k1+2*k2+2*k3+k4)/6)
u[i+1] = u[i] + ((l1+2*l2+2*l3+l4)/6)
t[i+1] = t[i] + h
plt.figure(figsize=[20.0,10.0])
plt.subplot(1,2,1)
plt.plot(t,Q_solve_ivp,'r',t,Q_odeint,'y',t,Q,'b')
plt.grid(color = 'yellow',linestyle='--',linewidth=0.4)
plt.xlabel('t(s)')
plt.ylabel(r'$Q(t)_{Odeint}$, $Q(t)_{RK4}$')
plt.subplot(1,2,2)
plt.plot(t,Q_solve_ivp,'g',t,Q_odeint,'y',t,Q,'b')
plt.grid(color = 'yellow',linestyle='--',linewidth=0.4)
plt.xlabel('t(s)')
plt.ylabel(r'$Q(t)_{solve_ivp}$, $Q(t)_{RK4}$')

Full algorithm (math) of natural cubic splines computation in Python?

I'm interested in full Python code (with math formulas) with all computations needed to calculate natural Cubic Splines from scratch. If possible, fast (e.g. Numpy-based).
I created this question only to share my code (as answer) that I programmed recently from scratch (based on Wikipedia) when learning cubic splines.

I programmed the following code based on Russian Wikipedia Article, as I see almost the same description and formulas are located in English Article.
To speed-up computation I used both Numpy and Numba.
To check the correctness of code I made tests with comparison to reference implementation of the natural cubic spline of scipy.interpolate.CubicSpline, you can see np.allclose(...) assertion in my code that proves my formulas are correct.
Also, I did timings:
calc (spline_scipy): Timed best=2.712 ms, mean=2.792 +- 0.1 ms
calc (spline_numba): Timed best=916.000 us, mean=938.868 +- 17.9 us
speedup: 2.973
use (spline_scipy): Timed best=5.262 ms, mean=5.320 +- 0.1 ms
use (spline_numba): Timed best=4.745 ms, mean=5.420 +- 0.3 ms
speedup: 0.981
which shows that my spline-params computation is around 3x times faster than the Scipy version and usage of spline (computation for given x) is the same speed as Scipy.
Running code below needs one-time installing following packages python -m pip install numpy numba scipy timerit, here scipy and timerit are only needed for testing purposes and not needed for actual algorithm.
Code draws plots showing original multi-line and both spline approximation for Scipy and Numba versions, as one can see Scipy and Numba lines are the same (meaning that spline computation is same):
Code:
Try it online!
import numpy as np, numba
# Solves linear system given by Tridiagonal Matrix
# Helper for calculating cubic splines
#numba.njit(
[f'f{ii}[:](f{ii}[:], f{ii}[:], f{ii}[:], f{ii}[:])' for ii in (4, 8)],
cache = True, fastmath = True, inline = 'always')
def tri_diag_solve(A, B, C, F):
n = B.size
assert A.ndim == B.ndim == C.ndim == F.ndim == 1 and (
A.size == B.size == C.size == F.size == n
) #, (A.shape, B.shape, C.shape, F.shape)
Bs, Fs = np.zeros_like(B), np.zeros_like(F)
Bs[0], Fs[0] = B[0], F[0]
for i in range(1, n):
Bs[i] = B[i] - A[i] / Bs[i - 1] * C[i - 1]
Fs[i] = F[i] - A[i] / Bs[i - 1] * Fs[i - 1]
x = np.zeros_like(B)
x[-1] = Fs[-1] / Bs[-1]
for i in range(n - 2, -1, -1):
x[i] = (Fs[i] - C[i] * x[i + 1]) / Bs[i]
return x
# Calculate cubic spline params
#numba.njit(
#[f'(f{ii}, f{ii}, f{ii}, f{ii})(f{ii}[:], f{ii}[:])' for ii in (4, 8)],
cache = True, fastmath = True, inline = 'always')
def calc_spline_params(x, y):
a = y
h = np.diff(x)
c = np.concatenate((np.zeros((1,), dtype = y.dtype),
np.append(tri_diag_solve(h[:-1], (h[:-1] + h[1:]) * 2, h[1:],
((a[2:] - a[1:-1]) / h[1:] - (a[1:-1] - a[:-2]) / h[:-1]) * 3), 0)))
d = np.diff(c) / (3 * h)
b = (a[1:] - a[:-1]) / h + (2 * c[1:] + c[:-1]) / 3 * h
return a[1:], b, c[1:], d
# Spline value calculating function, given params and "x"
#numba.njit(
[f'f{ii}[:](f{ii}[:], i8[:], f{ii}[:], f{ii}[:], f{ii}[:], f{ii}[:], f{ii}[:])' for ii in (4, 8)],
cache = True, fastmath = True, inline = 'always')
def func_spline(x, ix, x0, a, b, c, d):
dx = x - x0[1:][ix]
return a[ix] + (b[ix] + (c[ix] + d[ix] * dx) * dx) * dx
#numba.njit(
[f'i8[:](f{ii}[:], f{ii}[:], b1)' for ii in (4, 8)],
cache = True, fastmath = True, inline = 'always')
def searchsorted_merge(a, b, sort_b):
ix = np.zeros((len(b),), dtype = np.int64)
if sort_b:
ib = np.argsort(b)
pa, pb = 0, 0
while pb < len(b):
if pa < len(a) and a[pa] < (b[ib[pb]] if sort_b else b[pb]):
pa += 1
else:
ix[pb] = pa
pb += 1
return ix
# Compute piece-wise spline function for "x" out of sorted "x0" points
#numba.njit([f'f{ii}[:](f{ii}[:], f{ii}[:], f{ii}[:], f{ii}[:], f{ii}[:], f{ii}[:])' for ii in (4, 8)],
cache = True, fastmath = True, inline = 'always')
def piece_wise_spline(x, x0, a, b, c, d):
xsh = x.shape
x = x.ravel()
#ix = np.searchsorted(x0[1 : -1], x)
ix = searchsorted_merge(x0[1 : -1], x, False)
y = func_spline(x, ix, x0, a, b, c, d)
y = y.reshape(xsh)
return y
def test():
import matplotlib.pyplot as plt, scipy.interpolate
from timerit import Timerit
Timerit._default_asciimode = True
np.random.seed(0)
def f(n):
x = np.sort(np.random.uniform(0., n / 5 * np.pi, (n,))).astype(np.float64)
return x, (np.sin(x) * 5 + np.sin(1 + 2.5 * x) * 3 + np.sin(2 + 0.5 * x) * 2).astype(np.float64)
def spline_numba(x0, y0):
a, b, c, d = calc_spline_params(x0, y0)
return lambda x: piece_wise_spline(x, x0, a, b, c, d)
def spline_scipy(x0, y0):
f = scipy.interpolate.CubicSpline(x0, y0, bc_type = 'natural')
return lambda x: f(x)
def timings():
x0, y0 = f(10000)
s, t = {}, []
gs = [spline_scipy, spline_numba]
spline_numba(np.copy(x0[::3]), np.copy(y0[::3])) # pre-compile numba
for g in gs:
print('calc (', g.__name__, '): ', sep = '', end = '', flush = True)
tim = Timerit(num = 150, verbose = 1)
for _ in tim:
s_ = g(x0, y0)
s[g.__name__] = s_
t.append(tim.mean())
if len(t) >= 2:
print('speedup:', round(t[-2] / t[-1], 3))
print()
x = np.linspace(x0[0], x0[-1], 50000, dtype = np.float64)
t = []
s['spline_numba'](np.copy(x[::3])) # pre-compile numba
for i in range(len(s)):
print('use (', gs[i].__name__, '): ', sep = '', end = '', flush = True)
tim = Timerit(num = 100, verbose = 1)
sg = s[gs[i].__name__]
for _ in tim:
sg(x)
t.append(tim.mean())
if len(t) >= 2:
print('speedup:', round(t[-2] / t[-1], 3))
x0, y0 = f(50)
timings()
shift = 3
x = np.linspace(x0[0], x0[-1], 1000, dtype = np.float64)
ys = spline_scipy(x0, y0)(x)
yn = spline_numba(x0, y0)(x)
assert np.allclose(ys, yn), np.absolute(ys - yn).max()
plt.plot(x0, y0, label = 'orig')
plt.plot(x, ys, label = 'spline_scipy')
plt.plot(x, yn, '-.', label = 'spline_numba')
plt.legend()
plt.show()
if __name__ == '__main__':
test()

General minimal residual method with right-preconditioner of SSOR

I am trying to implement the algorithm of GMRES with right-preconditioner P for solving the linear system Ax = b . The code is running without error; however, it pops into unprecise result for me because the error I have is very large. For the GMRES method (without preconditioning matrix - remove P in the algorithm), the error I get is around 1e^{-12} and it converges with the same matrix.
import numpy as np
from scipy import sparse
import matplotlib.pyplot as plt
from scipy.linalg import norm as norm
import scipy.sparse as sp
from scipy.sparse import diags
"""The program is to split the matrix into D-diagonal; L: strictly lower matrix; U strictly upper matrix
satisfying: A = D - L - U """
def splitMat(A):
n,m = A.shape
if (n == m):
diagval = np.diag(A)
D = diags(diagval,0).toarray()
L = (-1)*np.tril(A,-1)
U = (-1)*np.triu(A,1)
else:
print("A needs to be a square matrix")
return (L,D,U)
"""Preconditioned Matrix for symmetric successive over-relaxation (SSOR): """
def P_SSOR(A,w):
## Split up matrix A:
L,D,U = splitMat(A)
Comp1 = (D - w*U)
Comp2 = (D - w*L)
Comp1inv = np.linalg.inv(Comp1)
Comp2inv = np.linalg.inv(Comp2)
P = w*(2-w)*np.matmul(Comp1inv, np.matmul(D,Comp2inv))
return P
"""GMRES_SSOR using right preconditioning P:
A - matrix of linear system Ax = b
x0 - initial guess
tol - tolerance
maxit - maximum iteration """
def myGMRES_SSOR(A,x0, b, tol, maxit):
matrixSize = A.shape[0]
e = np.zeros((maxit+1,1))
rr = 1
rstart = 2
X = x0
w = 1.9 ## in ssor
P = P_SSOR(A,w) ### preconditioned matrix
### Starting the GMRES ####
for rs in range(0,rstart+1):
### first check the residual:
if rr<tol:
break
else:
r0 = (b-A.dot(x0))
rho = norm(r0)
e[0] = rho
H = np.zeros((maxit+1,maxit))
Qcol = np.zeros((matrixSize, maxit+1))
Qcol[:,0:1] = r0/rho
for k in range(1, maxit+1):
### Arnodi procedure ##
Qcol[:,k] =np.matmul(np.matmul(A,P), Qcol[:,k-1]) ### This step applies P here:
for j in range(0,k):
H[j,k-1] = np.dot(np.transpose(Qcol[:,k]),Qcol[:,j])
Qcol[:,k] = Qcol[:,k] - (np.dot(H[j,k-1], Qcol[:,j]))
H[k,k-1] =norm(Qcol[:,k])
Qcol[:,k] = Qcol[:,k]/H[k,k-1]
### QR decomposition step ###
n = k
Q = np.zeros((n+1, n))
R = np.zeros((n, n))
R[0, 0] = norm(H[0:n+2, 0])
Q[:, 0] = H[0:n+1, 0] / R[0,0]
for j in range (0, n+1):
t = H[0:n+1, j-1]
for i in range (0, j-1):
R[i, j-1] = np.dot(Q[:, i], t)
t = t - np.dot(R[i, j-1], Q[:, i])
R[j-1, j-1] = norm(t)
Q[:, j-1] = t / R[j-1, j-1]
g = np.dot(np.transpose(Q), e[0:k+1])
Y = np.dot(np.linalg.inv(R), g)
Res= e[0:n] - np.dot(H[0:n, 0:n], Y[0:n])
rr = norm(Res)
#### second check on the residual ###
if rr < tol:
break
#### Updating the solution with the preconditioned matrix ####
X = X + np.matmul(np.matmul(P,Qcol[:, 0:k]), Y) ### This steps applies P here:
return X
######
A = np.random.rand(100,100)
x = np.random.rand(100,1)
b = np.matmul(A,x)
x0 = np.zeros((100,1))
maxit = 100
tol = 0.00001
x = myGMRES_SSOR(A,x0,b,tol,maxit)
res = b - np.matmul(A,x)
print(norm(res))
print("Solution with gmres\n", np.matmul(A,x))
print("---------------------------------------")
print("b matrix:", b)
I hope anyone could help me figure out this!!!

I'm not sure where you got you "Symmetric_successive_over-relaxation" SSOR code from, but it appears to be wrong. You also seem to be assuming that A is symmetric matrix, but in your random test case it is not.
Following SSOR's Wikipedia entry, I replaced your P_SSOR function with
def P_SSOR(A,w):
L,D,U = splitMat(A)
P = 2/(2-w) * (1/w*D+L)*np.linalg.inv(D)*(1/w*D+L).T
return P
and your test matrix with
A = np.random.rand(100,100)
A = A + A.T
and your code works up to a 12 digit residual error.

Transport equation in 1D (python)

I'm trying to write a python program to solve the convection equation in 1D using the finite differences method (upwind scheme). The problem is as follows:
Here's what I've attempted
from numpy import *
from numpy.linalg import *
from matplotlib.pyplot import *
def u0(x):
if (0.4 <= x <= 0.5):
y = 10*(x - 0.4)
elif (0.5 <= x <= 0.6):
y = 10*(0.6 - x)
else:
y = 0
return y
print('Choix de la vitesse de transport c : ')
c = float(input('c = '))
def solex(x, t):
return u0(x - c*t)
print('Choix de pas h : ')
h = float(input('h = '))
print('Choix du pas dt et du temps final T : ')
dt = float(input('dt = '))
T = float(input('T = '))
# Maillage
N = int((1/h) - 1)
x = linspace(0, 1, N + 2)
M = int((T/dt) - 1)
t = linspace(0, T, M + 2)
# Itération
U1 = zeros(N)
U2 = zeros(N)
sol = zeros((N, M + 2))
for i in range(1, N + 1):
U1[i - 1] = u0(x[i])
sol[:, 0] = U1
for j in range(1, size(t)):
for i in range(1, N-1):
U2[i] = U1[i] - c*(dt/h)*(U1[i] - U1[i - 1])
sol[:, j] = U2
U1 = U2
It doesn't seem to work and I don't know why

Though you said you already solved your problem, I would still like to suggest some general improvements:
wildcard imports like from numpy import * are considered bad practice, better use import numpy as np and refer to the necessary functions as np.linspace etc.
the power of numpy comes from vectorization, so try to replace as much for-loops as possible by vectorized operations.
at least from what you showed us, the variables U1 and U2 are not really necessary.
using input for every single parameter might be overkill
The following code includes my suggested improvements. Note how I replaced your u0 with a vectorized version using np.piecewise and replaced several for-loops. I also added a visualisation.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def u0(x):
y= np.piecewise(
x,
[(0.4 <= x)&(x <= 0.5), (0.5 <= x)&(x<= 0.6) ],
[lambda x: 10*(x - 0.4), lambda x: 10*(0.6 - x), 0])
return y
c = 0.9
h = 0.01
dt = 0.01
T = 2
N = int(np.ceil(1/h))
x = np.linspace(0, 1, N)
M = int(np.ceil(T/dt))
t = np.linspace(0, T, M)
#solve with upwind scheme
sol = np.zeros((N, M))
sol[:,0] = u0(x)
#you could add boundary values here by setting
#sol[0,:] = <your_boundary_data>
for i in range(1,len(t)):
sol[1:,i] = sol[1:,i-1] - c*(dt/h)*(sol[1:,i-1] - sol[:-1,i-1])
#Visualization
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.set_xlabel('x')
ax.set_ylabel('t')
T, X = np.meshgrid(t, x)
surf = ax.plot_surface(X, T, sol)

Cython optimization of the code

I'm struggling to boost the performance of my python particle tracking code with Cython.
Here's my pure Python code:
from scipy.integrate import odeint
import numpy as np
from numpy import sqrt, pi, sin, cos
from time import time as Time
import multiprocessing as mp
from functools import partial
cLight = 299792458.
Dim = 6
class Integrator:
def __init__(self, ring):
self.ring = ring
def equations(self, X, s):
dXds = np.zeros(Dim)
E, B = self.ring.getEMField( [X[0], X[2], s], X[4] )
h = 1 + X[0]/self.ring.ringRadius
p_s = np.sqrt(X[5]**2 - self.ring.particle.mass**2 - X[1]**2 - X[3]**2)
dtds = h*X[5]/p_s
gamma = X[5]/self.ring.particle.mass
beta = np.array( [X[1], X[3], p_s] ) / X[5]
dXds[0] = dtds*beta[0]
dXds[2] = dtds*beta[1]
dXds[1] = p_s/self.ring.ringRadius + self.ring.particle.charge*(dtds*E[0] + dXds[2]*B[2] - h*B[1])
dXds[3] = self.ring.particle.charge*(dtds*E[1] + h*B[0] - dXds[0]*B[2])
dXds[4] = dtds
dXds[5] = self.ring.particle.charge*(dXds[0]*E[0] + dXds[2]*E[1] + h*E[2])
return dXds
def odeSolve(self, X0, sRange):
sol = odeint(self.equations, X0, sRange)
return sol
class Ring:
def __init__(self, particle):
self.particle = particle
self.ringRadius = 7.112
self.magicB0 = self.particle.magicMomentum/self.ringRadius
def getEMField(self, pos, time):
x, y, s = pos
theta = (s/self.ringRadius*180/pi) % 360
r = sqrt(x**2 + y**2)
arg = 0 if r == 0 else np.angle( complex(x/r, y/r) )
rn = r/0.045
k2 = 37*24e3
k10 = -4*24e3
E = np.zeros(3)
B = np.array( [ 0, self.magicB0, 0 ] )
for i in range(4):
if ((21.9+90*i < theta < 34.9+90*i or 38.9+90*i < theta < 64.9+90*i) and (-0.05 < x < 0.05 and -0.05 < y < 0.05)):
E = np.array( [ k2*x/0.045 + k10*rn**9*cos(9*arg), -k2*y/0.045 -k10*rn**9*sin(9*arg), 0] )
break
return E, B
class Particle:
def __init__(self):
self.mass = 105.65837e6
self.charge = 1.
self.gm2 = 0.001165921
self.magicMomentum = self.mass/sqrt(self.gm2)
self.magicEnergy = sqrt(self.magicMomentum**2 + self.mass**2)
self.magicGamma = self.magicEnergy/self.mass
self.magicBeta = self.magicMomentum/(self.magicGamma*self.mass)
def runSimulation(nParticles, tEnd):
particle = Particle()
ring = Ring(particle)
integrator = Integrator(ring)
Xs = np.array( [ np.array( [45e-3*(np.random.rand()-0.5)*2, 0, 0, 0, 0, particle.magicEnergy] ) for i in range(nParticles) ] )
sRange = np.arange(0, tEnd, 1e-9)*particle.magicBeta*cLight
ode = partial(integrator.odeSolve, sRange=sRange)
t1 = Time()
pool = mp.Pool()
sol = np.array(pool.map(ode, Xs))
t2 = Time()
print ("%.3f sec" %(t2-t1))
return t2-t1
Obviously, the most time-consuming process is integrating the ODE, defined as odeSolve() and equations() in class Integrator. Also, getEMField() method in class Ring is called as much as equations() method during the solving process.
I tried to get significant amount of speed up (at least 10x~20x) using Cython, but I only got ~1.5x level of speed up by the following Cython script:
import cython
import numpy as np
cimport numpy as np
from libc.math cimport sqrt, pi, sin, cos
from scipy.integrate import odeint
from time import time as Time
import multiprocessing as mp
from functools import partial
cdef double cLight = 299792458.
cdef int Dim = 6
#cython.boundscheck(False)
cdef class Integrator:
cdef Ring ring
def __init__(self, ring):
self.ring = ring
cpdef np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] equations(self,
np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] X,
double s):
cdef np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] dXds = np.zeros(Dim)
cdef double h, p_s, dtds, gamma
cdef np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] beta, E, B
E, B = self.ring.getEMField( [X[0], X[2], s], X[4] )
h = 1 + X[0]/self.ring.ringRadius
p_s = np.sqrt(X[5]*X[5] - self.ring.particle.mass*self.ring.particle.mass - X[1]*X[1] - X[3]*X[3])
dtds = h*X[5]/p_s
gamma = X[5]/self.ring.particle.mass
beta = np.array( [X[1], X[3], p_s] ) / X[5]
dXds[0] = dtds*beta[0]
dXds[2] = dtds*beta[1]
dXds[1] = p_s/self.ring.ringRadius + self.ring.particle.charge*(dtds*E[0] + dXds[2]*B[2] - h*B[1])
dXds[3] = self.ring.particle.charge*(dtds*E[1] + h*B[0] - dXds[0]*B[2])
dXds[4] = dtds
dXds[5] = self.ring.particle.charge*(dXds[0]*E[0] + dXds[2]*E[1] + h*E[2])
return dXds
cpdef np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] odeSolve(self,
np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] X0,
np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] sRange):
sol = odeint(self.equations, X0, sRange)
return sol
#cython.boundscheck(False)
cdef class Ring:
cdef Particle particle
cdef double ringRadius
cdef double magicB0
def __init__(self, particle):
self.particle = particle
self.ringRadius = 7.112
self.magicB0 = self.particle.magicMomentum/self.ringRadius
cpdef tuple getEMField(self,
list pos,
double time):
cdef double x, y, s
cdef double theta, r, rn, arg, k2, k10
cdef np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] E, B
x, y, s = pos
theta = (s/self.ringRadius*180/pi) % 360
r = sqrt(x*x + y*y)
arg = 0 if r == 0 else np.angle( complex(x/r, y/r) )
rn = r/0.045
k2 = 37*24e3
k10 = -4*24e3
E = np.zeros(3)
B = np.array( [ 0, self.magicB0, 0 ] )
for i in range(4):
if ((21.9+90*i < theta < 34.9+90*i or 38.9+90*i < theta < 64.9+90*i) and (-0.05 < x < 0.05 and -0.05 < y < 0.05)):
E = np.array( [ k2*x/0.045 + k10*rn**9*cos(9*arg), -k2*y/0.045 -k10*rn**9*sin(9*arg), 0] )
#E = np.array( [ k2*x/0.045, -k2*y/0.045, 0] )
break
return E, B
cdef class Particle:
cdef double mass
cdef double charge
cdef double gm2
cdef double magicMomentum
cdef double magicEnergy
cdef double magicGamma
cdef double magicBeta
def __init__(self):
self.mass = 105.65837e6
self.charge = 1.
self.gm2 = 0.001165921
self.magicMomentum = self.mass/sqrt(self.gm2)
self.magicEnergy = sqrt(self.magicMomentum**2 + self.mass**2)
self.magicGamma = self.magicEnergy/self.mass
self.magicBeta = self.magicMomentum/(self.magicGamma*self.mass)
def runSimulation(nParticles, tEnd):
particle = Particle()
ring = Ring(particle)
integrator = Integrator(ring)
#nParticles = 5
Xs = np.array( [ np.array( [45e-3*(np.random.rand()-0.5)*2, 0, 0, 0, 0, particle.magicEnergy] ) for i in range(nParticles) ] )
sRange = np.arange(0, tEnd, 1e-9)*particle.magicBeta*cLight
ode = partial(integrator.odeSolve, sRange=sRange)
t1 = Time()
pool = mp.Pool()
sol = np.array(pool.map(ode, Xs))
t2 = Time()
print ("%.3f sec" %(t2-t1))
return t2-t1
What should I do to get the maximum effect from Cython?
(I tried Numba instead of Cython, and actually the performance gain from Numba was enormous (around ~20x speedup). But I had extremely hard time to utilize Numba with python class instances, and I decided to use Cython instead of Numba).
For reference, the following is cython annotation on its compilation:

This is a very incomplete answer since I haven't profiled or timed anything or even checked that it gives the same answer. However here are some suggestions that reduce the amount of Python code that Cython generates:
Add the #cython.cdivision(True) compilation directive. This means that a ZeroDivisionError won't be raised on float division and you'll get a NaN value instead. (Only do this if you don't want the error to be raised).
Change p_s = np.sqrt(...) to p_s = sqrt(...). This removes a numpy call that only operates on a single value. You seem to have done this elsewhere so I don't know why you missed this line.
Where possible use fixed size C arrays instead of numpy arrays:
cdef double beta[3]
# ...
beta[0] = X[1]/X[5]
beta[1] = X[3]/X[5]
beta[2] = p_s/X[5]
You can do this when the size is known at compile time (and fairly small) and when you don't want to return it. This avoids a call to np.zeros and some subsequent type-checking to assign it the the typed numpy array. I think beta is the only place you can do this.
np.angle( complex(x/r, y/r) ) can be replaced by atan2(y/r, x/r) (using atan2 from libc.math. You can also lose the division by r
cdef int i helps make your for loop faster in getEMField (Cython is often good at automatically picking up the types of loop variables but seems to have failed here)
I suspect it's quicker to assign E element-by-element than as a whole array:
E[0] = k2*x/0.045 + k10*rn**9*cos(9*arg)
E[1] = -k2*y/0.045 -k10*rn**9*sin(9*arg)
There isn't much value in specifying types like list and tuple and it may actually make the code slightly slower (because it will waste time checking the types).
A bigger change would be to pass E and B into GetEMField as pointers rather than using allocating them np.zeros. This would let you allocate them as static C arrays in equations (cdef double E[3]). The downside is that GetEMField would have to be cdef so no longer callable from Python (but you could make a Python callable wrapper function too if you like).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cython not fast enough - python

Use cProfile to profile your Python code. Maybe the most CPU intensive tasks are in NumPy already. Then there is not so much to gain from Cython.

Related

Runge Kutta 4th order Python

Full algorithm (math) of natural cubic splines computation in Python?

General minimal residual method with right-preconditioner of SSOR

Transport equation in 1D (python)

Cython optimization of the code

Categories

Resources