Improving Numpy speed for Gauss-Seidel (Jacobi) Solver

Improving Numpy speed for Gauss-Seidel (Jacobi) Solver - python

This question is a follow-up to a recent question posted regarding MATLAB being twice as fast as Numpy.
I currently have a Gauss-Seidel solver implemented in both MATLAB and Numpy which acts on a 2D axisymmetric domain (cylindrical coordinates). The code was originally written in MATLAB and then transferred to Python. The Matlab code runs in ~20 s whereas the Numpy codes takes ~30 s. I would like to use Numpy, however, since this code is part of a larger program, the almost twice as long simulation time is a significant drawback.
The algorithm simply solves the discretized Laplace equation on a rectangular mesh (in cylindrical coordinates). It finishes when the maximum difference between updates on the mesh is less than the indicated tolerance.
The code in Numpy is:
import numpy as np
import time
T = np.transpose
# geometry
length = 0.008
width = 0.002
# mesh
nz = 256
nr = 64
# step sizes
dz = length/nz
dr = width/nr
# node position matrices
r = np.tile(np.linspace(0,width,nr+1), (nz+1, 1)).T
ri = r/dr
# equation coefficients
cr = dz**2 / (2*(dr**2 + dz**2))
cz = dr**2 / (2*(dr**2 + dz**2))
# initial/boundary conditions
v = np.zeros((nr+1,nz+1))
v[:,0] = 1100
v[:,-1] = 0
v[31:,29:40] = 1000
v[19:,54:65] = -200
# convergence parameters
tol = 1e-4
# Gauss-Seidel solver
tic = time.time()
max_v_diff = 1;
while (max_v_diff > tol):
v_old = v.copy()
# left boundary updates
v[0,1:nz] = cr*2*v[1,1:nz] + cz*(v[0,0:nz-1] + v[0,2:nz+2])
# internal updates
v[1:nr,1:nz] = cr*((1 - 1/(2*ri[1:nr,1:nz]))*v[0:nr-1,1:nz] + (1 + 1/(2*ri[1:nr,1:nz]))*v[2:nr+1,1:nz]) + cz*(v[1:nr,0:nz-1] + v[1:nr,2:nz+1])
# right boundary updates
v[nr,1:nz] = cr*2*v[nr-1,1:nz] + cz*(v[nr,0:nz-1] + v[nr,2:nz+1])
# reapply grid potentials
v[31:,29:40] = 1000
v[19:,54:65] = -200
# check for convergence
v_diff = v - v_old
max_v_diff = np.absolute(v_diff).max()
toc = time.time() - tic
print(toc)
This is actually not the full algorithm which I use. The full algorithm uses successive overrelaxation and a checkerboard iteration scheme to improve speed and remove solver directionality, but for purposes of simplicity I provided this easier to understand version. The speed drawbacks in Numpy are more pronounced for the full version (17s vs. 9s simulation times respectively in Numpy and MATLAB).
I tried the solution from the previous question, changing v to a column-major order array, but there was no performance increase.
Any suggestions?
Edit: The Matlab code for reference is:
% geometry
length = 0.008;
width = 0.002;
% mesh
nz = 256;
nr = 64;
% step sizes
dz = length/nz;
dr = width/nr;
% node position matrices
r = repmat(linspace(0,width,nr+1)', 1, nz+1);
ri = r./dr;
% equation coefficients
cr = dz^2/(2*(dr^2+dz^2));
cz = dr^2/(2*(dr^2+dz^2));
% initial/boundary conditions
v = zeros(nr+1,nz+1);
v(1:nr+1,1) = 1100;
v(1:nr+1,nz+1) = 0;
v(32:nr+1,30:40) = 1000;
v(20:nr+1,55:65) = -200;
% convergence parameters
tol = 1e-4;
max_v_diff = 1;
% Gauss-Seidel Solver
tic
while (max_v_diff > tol)
v_old = v;
% left boundary updates
v(1,2:nz) = cr.*2.*v(2,2:nz) + cz.*( v(1,1:nz-1) + v(1,3:nz+1) );
% internal updates
v(2:nr,2:nz) = cr.*( (1 - 1./(2.*ri(2:nr,2:nz))).*v(1:nr-1,2:nz) + (1 + 1./(2.*ri(2:nr,2:nz))).*v(3:nr+1,2:nz) ) + cz.*( v(2:nr,1:nz-1) + v(2:nr,3:nz+1) );
% right boundary updates
v(nr+1,2:nz) = cr.*2.*v(nr,2:nz) + cz.*( v(nr+1,1:nz-1) + v(nr+1,3:nz+1) );
% reapply grid potentials
v(32:nr+1,30:40) = 1000;
v(20:nr+1,55:65) = -200;
% check for convergence
max_v_diff = max(max(abs(v - v_old)));
end
toc

I've been able to reduce the running time in my laptop from 66 to 21 seconds by following this process:
Find the bottleneck. I profiled the code using line_profiler from the IPython console to find the lines that took most time. It turned out that over 80% of the time was spent in the line that does "internal updates".
Choose a way to optimise it. There are several tools to speed code up in numpy (Cython, numexpr, weave...). In particular, scipy.weave.blitz is well suited to compile numpy expressions, like the offending line, into fast code. In theory, that line could be wrapped inside "..." and executed as weave.blitz("...") but the array that's being updated is used in the computation, so as stated by point #4 in the docs a temporary array must be used to keep the same result:
expr = "temp = cr*((1 - 1/(2*ri[1:nr,1:nz]))*v[0:nr-1,1:nz] + (1 + 1/(2*ri[1:nr,1:nz]))*v[2:nr+1,1:nz]) + cz*(v[1:nr,0:nz-1] + v[1:nr,2:nz+1]); v[1:nr,1:nz] = temp"
temp = np.empty((nr-1, nz-1))
...
while ...
# internal updates
weave.blitz(expr)
After checking that the results are correct, runtime checks are disabled by using weave.blitz(expr, check_size=0). The code now runs in 34 seconds.
Building up on Jaime's work, precompute the constant factors A and B in the expression. The code runs in 21 seconds (with minimal changes but it now needs a compiler).
This is the core of the code:
from scipy import weave
# [...] Set up code till "# Gauss-Seidel solver"
tic = time.time()
max_v_diff = 1;
A = cr * (1 - 1/(2*ri[1:nr,1:nz]))
B = cr * (1 + 1/(2*ri[1:nr,1:nz]))
expr = "temp = A*v[0:nr-1,1:nz] + B*v[2:nr+1,1:nz] + cz*(v[1:nr,0:nz-1] + v[1:nr,2:nz+1]); v[1:nr,1:nz] = temp"
temp = np.empty((nr-1, nz-1))
while (max_v_diff > tol):
v_old = v.copy()
# left boundary updates
v[0,1:nz] = cr*2*v[1,1:nz] + cz*(v[0,0:nz-1] + v[0,2:nz+2])
# internal updates
weave.blitz(expr, check_size=0)
# right boundary updates
v[nr,1:nz] = cr*2*v[nr-1,1:nz] + cz*(v[nr,0:nz-1] + v[nr,2:nz+1])
# reapply grid potentials
v[31:,29:40] = 1000
v[19:,54:65] = -200
# check for convergence
v_diff = v - v_old
max_v_diff = np.absolute(v_diff).max()
toc = time.time() - tic

On my laptop your code runs in about 45 seconds. By trying to reduce creation of intermediate arrays to the bare minimum, including reuse of pre-allocated work arrays, I have managed to reduce that time to 27 seconds. That should put you back at the level of MATLAB, but your code would be less readable. Anyway, find below code to replace everything below your # Gauss-Seidel solver comment:
# work arrays
v_old = np.empty_like(v)
w1 = np.empty_like(v[0, 1:nz])
w2 = np.empty_like(v[1:nr,1:nz])
w3 = np.empty_like(v[nr, 1:nz])
# constants
A = cr * (1 - 1/(2*ri[1:nr,1:nz]))
B = cr * (1 + 1/(2*ri[1:nr,1:nz]))
# Gauss-Seidel solver
tic = time.time()
max_v_diff = 1;
while (max_v_diff > tol):
v_old[:] = v
# left boundary updates
np.add(v_old[0, 0:nz-1], v_old[0, 2:nz+2], out=v[0, 1:nz])
v[0, 1:nz] *= cz
np.multiply(2*cr, v_old[1, 1:nz], out=w1)
v[0, 1:nz] += w1
# internal updates
np.add(v_old[1:nr, 0:nz-1], v_old[1:nr, 2:nz+1], out=v[1:nr, 1:nz])
v[1:nr,1:nz] *= cz
np.multiply(A, v_old[0:nr-1, 1:nz], out=w2)
v[1:nr,1:nz] += w2
np.multiply(B, v_old[2:nr+1, 1:nz], out=w2)
v[1:nr,1:nz] += w2
# right boundary updates
np.add(v_old[nr, 0:nz-1], v_old[nr, 2:nz+1], out=v[nr, 1:nz])
v[nr, 1:nz] *= cz
np.multiply(2*cr, v_old[nr-1, 1:nz], out=w3)
v[nr,1:nz] += w3
# reapply grid potentials
v[31:,29:40] = 1000
v[19:,54:65] = -200
# check for convergence
v_old -= v
max_v_diff = np.absolute(v_old).max()
toc = time.time() - tic

Related

Runtime error: Factor is exactly singular

I am trying to implement 2 temperature models, the following equations:
C_e(∂T_e)/∂t=∇[k_e∇T_e ]-G(T_e-T_ph )+ A(r,t)
C_ph(∂T_ph)/∂t=∇[k_ph∇T_ph] + G(T_e-T_ph)
Code
from fipy.tools import numerix
import scipy
import fipy
import numpy as np
from fipy import CylindricalGrid1D
from fipy import Variable, CellVariable, TransientTerm, DiffusionTerm, Viewer, LinearLUSolver, LinearPCGSolver, \
LinearGMRESSolver, ImplicitDiffusionTerm, Grid1D
FIPY_SOLVERS = scipy
## Mesh
nr = 50
dr = 1e-7
# r = nr * dr
mesh = CylindricalGrid1D(nr=nr, dr=dr, origin=0)
x = mesh.cellCenters[0]
# Variables
T_e = CellVariable(name="electronTemp", mesh=mesh,hasOld=True)
T_e.setValue(300)
T_ph = CellVariable(name="phononTemp", mesh=mesh, hasOld=True)
T_ph.setValue(300)
G = CellVariable(name="EPC", mesh=mesh)
t = Variable()
# Material parameters
C_e = CellVariable(name="C_e", mesh=mesh)
k_e = CellVariable(name="k_e", mesh=mesh)
C_ph = CellVariable(name="C_ph", mesh=mesh)
k_ph = CellVariable(name="k_ph", mesh=mesh)
C_e = 4.15303 - (4.06897 * numerix.exp(T_e / -85120.8644))
C_ph = 4.10446 - 3.886 * numerix.exp(-T_ph / 373.8)
k_e = 0.1549 * T_e**-0.052
k_ph =1.24 + 16.29 * numerix.exp(-T_ph / 151.57)
G = numerix.exp(21.87 + 10.062 * numerix.log(numerix.log(T_e )- 5.4))
# Boundary conditions
T_e.constrain(300, where=x > 4.5e-6)
T_ph.constrain(300, where=x > 4.5e-6)
# Source 𝐴(𝑟,𝑡) = 𝑎𝐷(𝑟)𝜏−1 𝑒−𝑡/𝜏 , 𝐷(𝑟) = 𝑆𝑒 exp (−𝑟2/𝜎2)/√2𝜋𝜎2
sig = 1.0e-6
tau = 1e-15
S_e = 35
d_r = (S_e * 1.6e-9 * numerix.exp(-x**2 /sig**2)) / (numerix.sqrt(2. * 3.14 * sig**2))
A_t = numerix.exp(-t/tau)
a = (numerix.sqrt(2. * 3.14)) / (3.14 * sig)
A_r = a * d_r * tau**-1 * A_t
eq0 = (TransientTerm(var=T_e, coeff=C_e) == DiffusionTerm(var=T_e, coeff=k_e) - G*(T_e - T_ph) + A_r
eq1 =(TransientTerm(var=T_ph, coeff=C_ph) == DiffusionTerm(var=T_ph, coeff=k_ph) + G*(T_e - T_ph)
eq = eq0 & eq1
dt = 1e-18
steps = 7000
elapsed = 0.
vi = Viewer((T_e, T_ph), datamin=0., datamax=2e4)
for step in range(steps):
T_e.updateOld()
T_ph.updateOld()
vi.plot()
res = 1e100
dt *= 1.1
while res > 1:
res = eq.sweep(dt=dt)
print(t, res)
t.setValue(t + dt)
Problem
The code is working fine with very small dt = 1e-18, but I need to run it until e 1e-10.
With this time step is going to take very long time, and setting dt *= 1.1 the resduals at some point start to increase then
gives following runtime error:
factor is exactly singular
Even with very small increment dt*= 1.005 the same issue pop up.
Using dt= 1.001 runs the code for quit long time then the residual get stuck at certain value.
Questions
I there any error in the fipy formalism of the equations?
What causes the error?
Is the error because of time step increase? If yes, how can I increase my time step?

I've made a few more changes to the code that can get you to an elapsed time of 1e-10. The main changes are
Using ImplicitSourceTerm for the terms with G. This stabalizes the solution.
Applied underRelaxation=0.5 in the sweep step. This slows down the updates in the sweep loop so the feedback loop is damped down.
Removed FIPY_SOLVERS=scipy. This isn't doing anything. FIPY_SOLVERS is an environment variable that you set outside of the Python environment.
The way the boundary conditions were applied seemed strange so I applied them in a more canonical way.
The sweep loop is fixed to 10 sweeps to get to a steady state quickly. Note that as the solution gets close to a stable steady state, the residual won't get better necessarily. Probably want to go back to residual checks if you need an accurate transient.
from fipy.tools import numerix
import scipy
import fipy
import numpy as np
from fipy import CylindricalGrid1D
from fipy import Variable, CellVariable, TransientTerm, DiffusionTerm, Viewer, LinearLUSolver, LinearPCGSolver, \
LinearGMRESSolver, ImplicitDiffusionTerm, Grid1D, ImplicitSourceTerm
## Mesh
nr = 50
dr = 1e-7
# r = nr * dr
mesh = CylindricalGrid1D(nr=nr, dr=dr, origin=0)
x = mesh.cellCenters[0]
# Variables
T_e = CellVariable(name="electronTemp", mesh=mesh,hasOld=True)
T_e.setValue(300)
T_ph = CellVariable(name="phononTemp", mesh=mesh, hasOld=True)
T_ph.setValue(300)
G = CellVariable(name="EPC", mesh=mesh)
t = Variable()
# Material parameters
C_e = CellVariable(name="C_e", mesh=mesh)
k_e = CellVariable(name="k_e", mesh=mesh)
C_ph = CellVariable(name="C_ph", mesh=mesh)
k_ph = CellVariable(name="k_ph", mesh=mesh)
C_e = 4.15303 - (4.06897 * numerix.exp(T_e / -85120.8644))
C_ph = 4.10446 - 3.886 * numerix.exp(-T_ph / 373.8)
k_e = 0.1549 * T_e**-0.052
k_ph =1.24 + 16.29 * numerix.exp(-T_ph / 151.57)
G = numerix.exp(21.87 + 10.062 * numerix.log(numerix.log(T_e )- 5.4))
# Boundary conditions
T_e.constrain(300, where=mesh.facesRight)
T_ph.constrain(300, where=mesh.facesRight)
# Source 𝐴(𝑟,𝑡) = 𝑎𝐷(𝑟)𝜏−1 𝑒−𝑡/𝜏 , 𝐷(𝑟) = 𝑆𝑒 exp (−𝑟2/𝜎2)/√2𝜋𝜎2
sig = 1.0e-6
tau = 1e-15
S_e = 35
d_r = (S_e * 1.6e-9 * numerix.exp(-x**2 /sig**2)) / (numerix.sqrt(2. * 3.14 * sig**2))
A_t = numerix.exp(-t/tau)
a = (numerix.sqrt(2. * 3.14)) / (3.14 * sig)
A_r = a * d_r * tau**-1 * A_t
eq0 = (
TransientTerm(var=T_e, coeff=C_e) == \
DiffusionTerm(var=T_e, coeff=k_e) - \
ImplicitSourceTerm(coeff=G, var=T_e) + \
ImplicitSourceTerm(var=T_ph, coeff=G) + \
A_r)
eq1 = (TransientTerm(var=T_ph, coeff=C_ph) == DiffusionTerm(var=T_ph, coeff=k_ph) + ImplicitSourceTerm(var=T_e, coeff=G) - ImplicitSourceTerm(coeff=G, var=T_ph))
eq = eq0 & eq1
dt = 1e-18
steps = 7000
elapsed = 0.
vi = Viewer((T_e, T_ph), datamin=0., datamax=2e4)
for step in range(steps):
T_e.updateOld()
T_ph.updateOld()
vi.plot()
res = 1e100
dt *= 1.1
count = 0
while count < 10:
res = eq.sweep(dt=dt, underRelaxation=0.5)
print(t, res)
count += 1
print('elapsed:', t.value)
t.setValue(t + dt)
Regarding your questions.
I there any error in the fipy formalism of the equations?
Actually, no. Nothing wrong with the formalism, but better to use ImplicitSourceTerm.
What causes the error?
There are two source of instability in this system. The source terms inside the equation when written explicitly are unstable above a certain time step. Using an ImplcitSourceTerm removes this instablity. There is also some sort of instability in the coupling of the equations. I think that using under relaxation helps with that.
Is the error because of time step increase? If yes, how can I increase my time step?
Explained above.

In addition to #wd15's answer:
Your equations are extremely non-linear. You will likely benefit from Newton iterations to get decent convergence.
As #TimRoberts said, geometrically increasing the time step without bound is probably not a good idea.
I've recently posted a package called steppyngstounes that takes care of adapting timesteps. Although a standalone package, it's intended to work with FiPy. For example, you could change your solve loop to this:
from steppyngstounes import FixedStepper, PIDStepper
T_e.updateOld()
T_ph.updateOld()
for checkpoint in FixedStepper(start=0, stop=1e-10, size=1e-12):
for step in PIDStepper(start=checkpoint.begin,
stop=checkpoint.end,
size=dt):
res = 1e100
for sweep in range(10):
res = eq.sweep(dt=dt, underRelaxation=0.5)
print(t, sweep, res)
if step.succeeded(error=res / 1000):
T_e.updateOld()
T_ph.updateOld()
t.value = step.end
else:
T_e.value = T_e.old
T_ph.value = T_ph.old
print('elapsed:', t.value)
# the last step might have been smaller than possible,
# if it was near the end of the checkpoint range
dt = step.want
_ = checkpoint.succeeded()
vi.plot()
This code will update the viewer every 1e-12 time units, and adaptively make it's way between those checkpoints. There are other steppers in the package that would facilitate taking geometrically or exponentially increasing checkpoints, if that kept things more interesting.
You could probably get better overall performance by sweeping fewer times and letting the adapter take much smaller time steps in the beginning. I found that no time step was small enough to get the initial residual lower than 777.9. After the first couple of steps, the error metric could probably be much more aggressive, giving more accurate results.

Using python built-in functions for coupled ODEs

THIS PART IS JUST BACKGROUND IF YOU NEED IT
I am developing a numerical solver for the Second-Order Kuramoto Model. The functions I use to find the derivatives of theta and omega are given below.
# n-dimensional change in omega
def d_theta(omega):
return omega
# n-dimensional change in omega
def d_omega(K,A,P,alpha,mask,n):
def layer1(theta,omega):
T = theta[:,None] - theta
A[mask] = K[mask] * np.sin(T[mask])
return - alpha*omega + P - A.sum(1)
return layer1
These equations return vectors.
QUESTION 1
I know how to use odeint for two dimensions, (y,t). for my research I want to use a built-in Python function that works for higher dimensions.
QUESTION 2
I do not necessarily want to stop after a predetermined amount of time. I have other stopping conditions in the code below that will indicate whether the system of equations converges to the steady state. How do I incorporate these into a built-in Python solver?
WHAT I CURRENTLY HAVE
This is the code I am currently using to solve the system. I just implemented RK4 with constant time stepping in a loop.
# This function randomly samples initial values in the domain and returns whether the solution converged
# Inputs:
# f change in theta (d_theta)
# g change in omega (d_omega)
# tol when step size is lower than tolerance, the solution is said to converge
# h size of the time step
# max_iter maximum number of steps Runge-Kutta will perform before giving up
# max_laps maximum number of laps the solution can do before giving up
# fixed_t vector of fixed points of theta
# fixed_o vector of fixed points of omega
# n number of dimensions
# theta initial theta vector
# omega initial omega vector
# Outputs:
# converges true if it nodes restabilizes, false otherwise
def kuramoto_rk4_wss(f,g,tol_ss,tol_step,h,max_iter,max_laps,fixed_o,fixed_t,n):
def layer1(theta,omega):
lap = np.zeros(n, dtype = int)
converges = False
i = 0
tau = 2 * np.pi
while(i < max_iter): # perform RK4 with constant time step
p_omega = omega
p_theta = theta
T1 = h*f(omega)
O1 = h*g(theta,omega)
T2 = h*f(omega + O1/2)
O2 = h*g(theta + T1/2,omega + O1/2)
T3 = h*f(omega + O2/2)
O3 = h*g(theta + T2/2,omega + O2/2)
T4 = h*f(omega + O3)
O4 = h*g(theta + T3,omega + O3)
theta = theta + (T1 + 2*T2 + 2*T3 + T4)/6 # take theta time step
mask2 = np.array(np.where(np.logical_or(theta > tau, theta < 0))) # find which nodes left [0, 2pi]
lap[mask2] = lap[mask2] + 1 # increment the mask
theta[mask2] = np.mod(theta[mask2], tau) # take the modulus
omega = omega + (O1 + 2*O2 + 2*O3 + O4)/6
if(max_laps in lap): # if any generator rotates this many times it probably won't converge
break
elif(np.any(omega > 12)): # if any of the generators is rotating this fast, it probably won't converge
break
elif(np.linalg.norm(omega) < tol_ss and # assert the nodes are sufficiently close to the equilibrium
np.linalg.norm(omega - p_omega) < tol_step and # assert change in omega is small
np.linalg.norm(theta - p_theta) < tol_step): # assert change in theta is small
converges = True
break
i = i + 1
return converges
return layer1
Thanks for your help!

You can wrap your existing functions into a function accepted by odeint (option tfirst=True) and solve_ivp as
def odesys(t,u):
theta,omega = u[:n],u[n:]; # or = u.reshape(2,-1);
return [ *f(omega), *g(theta,omega) ]; # or np.concatenate([f(omega), g(theta,omega)])
u0 = [*theta0, *omega0]
t = linspan(t0, tf, timesteps+1);
u = odeint(odesys, u0, t, tfirst=True);
#or
res = solve_ivp(odesys, [t0,tf], u0, t_eval=t)
The scipy methods pass numpy arrays and convert the return value into same, so that you do not have to care in the ODE function. The variant in comments is using explicit numpy functions.
While solve_ivp does have event handling, using it for a systematic collection of events is rather cumbersome. It would be easier to advance some fixed step, do the normalization and termination detection, and then repeat this.
If you want to later increase efficiency somewhat, use directly the stepper classes behind solve_ivp.

Solving heat equation with python (NumPy)

I solve the heat equation for a metal rod as one end is kept at 100 °C and the other at 0 °C as
import numpy as np
import matplotlib.pyplot as plt
dt = 0.0005
dy = 0.0005
k = 10**(-4)
y_max = 0.04
t_max = 1
T0 = 100
def FTCS(dt,dy,t_max,y_max,k,T0):
s = k*dt/dy**2
y = np.arange(0,y_max+dy,dy)
t = np.arange(0,t_max+dt,dt)
r = len(t)
c = len(y)
T = np.zeros([r,c])
T[:,0] = T0
for n in range(0,r-1):
for j in range(1,c-1):
T[n+1,j] = T[n,j] + s*(T[n,j-1] - 2*T[n,j] + T[n,j+1])
return y,T,r,s
y,T,r,s = FTCS(dt,dy,t_max,y_max,k,T0)
plot_times = np.arange(0.01,1.0,0.01)
for t in plot_times:
plt.plot(y,T[t/dt,:])
If changing the Neumann boundary condition as one end is insulated (not flux),
then, how the calculating term
T[n+1,j] = T[n,j] + s*(T[n,j-1] - 2*T[n,j] + T[n,j+1])
should be modified?

A typical approach to Neumann boundary condition is to imagine a "ghost point" one step beyond the domain, and calculate the value for it using the boundary condition; then proceed normally (using the PDE) for the points that are inside the grid, including the Neumann boundary.
The ghost point allows us to use the symmetric finite difference approximation to the derivative at the boundary, that is (T[n, j+1] - T[n, j-1]) / (2*dy) if y is the space variable. Non-symmetric approximation (T[n, j] - T[n, j-1]) / dy, which does not involve a ghost point, is much less accurate: the error it introduces is an order of magnitude worse than the error involved in the discretization of the PDE itself.
So, when j is the maximal possible index for T, the boundary condition says that "T[n, j+1]" should be understood as T[n, j-1] and this is what is done below.
for j in range(1, c-1):
T[n+1,j] = T[n,j] + s*(T[n,j-1] - 2*T[n,j] + T[n,j+1]) # as before
j = c-1
T[n+1, j] = T[n,j] + s*(T[n,j-1] - 2*T[n,j] + T[n,j-1]) # note the last term here

Build an approximately uniform grid from random sample (python)

I want to build a grid from sampled data. I could use a machine learning - clustering algorithm, like k-means, but I want to restrict the centres to be roughly uniformly distributed.
I have come up with an approach using the scikit-learn nearest neighbours search: pick a point at random, delete all points within radius r then repeat. This works well, but wondering if anyone has a better (faster) way of doing this.
In response to comments I have tried two alternate methods, one turns out much slower the other is about the same...
Method 0 (my first attempt):
def get_centers0(X, r):
N = X.shape[0]
D = X.shape[1]
grid = np.zeros([0,D])
nearest = near.NearestNeighbors(radius = r, algorithm = 'auto')
while N > 0:
nearest.fit(X)
x = X[int(random()*N), :]
_, del_x = nearest.radius_neighbors(x)
X = np.delete(X, del_x[0], axis = 0)
grid = np.vstack([grid, x])
N = X.shape[0]
return grid
Method 1 (using the precomputed graph):
def get_centers1(X, r):
N = X.shape[0]
D = X.shape[1]
grid = np.zeros([0,D])
nearest = near.NearestNeighbors(radius = r, algorithm = 'auto')
nearest.fit(X)
graph = nearest.radius_neighbors_graph(X)
#This method is very slow even before doing any 'pruning'
Method 2:
def get_centers2(X, r, k):
N = X.shape[0]
D = X.shape[1]
k = k
grid = np.zeros([0,D])
nearest = near.NearestNeighbors(radius = r, algorithm = 'auto')
while N > 0:
nearest.fit(X)
x = X[np.random.randint(0,N,k), :]
#min_dist = near.NearestNeighbors().fit(x).kneighbors(x, n_neighbors = 1, return_distance = True)
min_dist = dist(x, k, 2, np.ones(k)) # where dist is a cython compiled function
x = x[min_dist < 0.1,:]
_, del_x = nearest.radius_neighbors(x)
X = np.delete(X, del_x[0], axis = 0)
grid = np.vstack([grid, x])
N = X.shape[0]
return grid
Running these as follows:
N = 50000
r = 0.1
x1 = np.random.rand(N)
x2 = np.random.rand(N)
X = np.vstack([x1, x2]).T
tic = time.time()
grid0 = get_centers0(X, r)
toc = time.time()
print 'Method 0: ' + str(toc - tic)
tic = time.time()
get_centers1(X, r)
toc = time.time()
print 'Method 1: ' + str(toc - tic)
tic = time.time()
grid2 = get_centers2(X, r)
toc = time.time()
print 'Method 1: ' + str(toc - tic)
Method 0 and 2 are about the same...
Method 0: 0.840130090714
Method 1: 2.23365592957
Method 2: 0.774812936783

I'm not sure from the question exactly what you are trying to do. You mention wanting to create an "approximate grid", or a "uniform distribution", while the code you provide selects a subset of points such that no pairwise distance is greater than r.
A couple possible suggestions:
if what you want is an approximate grid, I would construct the grid you want to approximate, and then query for the nearest neighbor of each grid point. Depending on your application, you might further trim these results to cut-out points whose distance from the grid point is larger than is useful for you.
if what you want is an approximately uniform distribution drawn from among the points, I would do a kernel density estimate (sklearn.neighbors.KernelDensity) at each point, and do a randomized sub-selection from the dataset weighted by the inverse of the local density at each point.
if what you want is a subset of points such that no pairwise distance is greater than r, I would start by constructing a radius_neighbors_graph with radius r, which will, in one go, give you a list of all points which are too close together. You can then use a pruning algorithm similar to the one you wrote above to remove points based on these sparse graph distances.
I hope that helps!

I have come up with a very simple method which is much more efficient than my previous attempts.
This one simply loops over the data set and adds the current point to the list of grid points only if it is greater than r distance from all existing centers. This method is around 20 times faster than my previous attempts. Because there are no external libraries involved I can run this all in cython...
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def get_centers_fast(np.ndarray[DTYPE_t, ndim = 2] x, double radius):
cdef int N = x.shape[0]
cdef int D = x.shape[1]
cdef int m = 1
cdef np.ndarray[DTYPE_t, ndim = 2] xc = np.zeros([10000, D])
cdef double r = 0
cdef double r_min = 10
cdef int i, j, k
for k in range(D):
xc[0,k] = x[0,k]
for i in range(1, N):
r_min = 10
for j in range(m):
r = 0
for k in range(D):
r += (x[i, k] - xc[j, k])**2
r = r**0.5
if r < r_min:
r_min = r
if r_min > radius:
m = m + 1
for k in range(D):
xc[m - 1,k] = x[i,k]
nonzero = np.nonzero(xc[:,0])[0]
xc = xc[nonzero,:]
return xc
Running these methods as follows:
N = 40000
r = 0.1
x1 = np.random.normal(size = N)
x1 = (x1 - min(x1)) / (max(x1)-min(x1))
x2 = np.random.normal(size = N)
x2 = (x2 - min(x2)) / (max(x2)-min(x2))
X = np.vstack([x1, x2]).T
tic = time.time()
grid0 = gt.get_centers0(X, r)
toc = time.time()
print 'Method 0: ' + str(toc - tic)
tic = time.time()
grid2 = gt.get_centers2(X, r, 10)
toc = time.time()
print 'Method 2: ' + str(toc - tic)
tic = time.time()
grid3 = gt.get_centers_fast(X, r)
toc = time.time()
print 'Method 3: ' + str(toc - tic)
The new method is around 20 times faster. It could be made even faster, if I stopped looping early (e.g. if k successive iterations fail to produce a new center).
Method 0: 0.219595909119
Method 2: 0.191949129105
Method 3: 0.0127329826355

Maybe you could only re-fit the nearest object every k << N deletions to speedup the process. Most of the time the neighborhood structure should not change much.

Sounds like you are trying to reinvent one of the following:
cluster features (see BIRCH)
data bubbles (see "Data bubbles: Quality preserving performance boosting for hierarchical clustering")
canopy pre-clustering
i.e. this concept has already been invented at least three times with small variations.
Technically, it is not clustering. K-means isn't really clustering either.
It is much more adequately described as vector quantization.

for loop in python is 10x slower than matlab

I run python 2.7 and matlab R2010a on the same machine, doing nothing, and it gives me 10x different in speed
I looked online, and heard it should be the same order.
Python will further slow down as if statement and math operator in the for loop
My question: is this the reality? or there is some other way let them in the same speed order?
Here is python code
import time
start_time = time.time()
for r in xrange(1000):
for c in xrange(1000):
continue
elapsed_time = time.time() - start_time
print 'time cost = ',elapsed_time
Output: time cost = 0.0377440452576
Here is matlab code
tic
for i = 1:1000
for j = 1:1000
end
end
toc
Output: Escaped time is 0.004200 seconds

The reason this is happening is related to the JIT compiler, which is optimizing the MATLAB for loop. You can disable/enable the JIT accelerator using feature accel off and feature accel on. When you disable the accelerator, the times change dramatically.
MATLAB with accel on: Elapsed time is 0.009407 seconds.
MATLAB with accel off: Elapsed time is 0.287955 seconds.
python: time cost = 0.0511920452118
Thus the JIT accelerator is directly causing the speedup that you are noticing. There is another thing that you should consider, which is related to the way that you defined the iteration indices. In both cases, MATLAB and python, you used Iterators to define your loops. In MATLAB you create the actual values by adding the square brackets ([]), and in python you use range instead of xrange. When you make these changes
% MATLAB
for i = [1:1000]
for j = [1:1000]
# python
for r in range(1000):
for c in range(1000):
The times become
MATLAB with accel on: Elapsed time is 0.338701 seconds.
MATLAB with accel off: Elapsed time is 0.289220 seconds.
python: time cost = 0.0606048107147
One final consideration is if you were to add a quick computation to the loop. ie t=t+1. Then the times become
MATLAB with accel on: Elapsed time is 1.340830 seconds.
MATLAB with accel off: Elapsed time is 0.905956 seconds. (Yes off was faster)
python: time cost = 0.147221088409
I think that the moral here is that the computation speeds of for loops, out-of-the box, are comparable for extremely simple loops, depending on the situation. However, there are other, numerical tools in python which can speed things up significantly, numpy and PyPy have been brought up so far.

The basic Python implementation, CPython, is not meant to be super-speedy. If you need efficient matlab-style numerical manipulation, use the numpy package or an implementation of Python that is designed for fast work, such as PyPy or even Cython. (Writing a Python extension in C, which will of course be pretty fast, is also a possible solution, but in that case you may as well just use numpy and save yourself the effort.)

If Python execution performance is really crucial for you, you might take a look at PyPy
I did your test:
import time
for a in range(10):
start_time = time.time()
for r in xrange(1000):
for c in xrange(1000):
continue
elapsed_time = time.time()-start_time
print elapsed_time
with standard Python 2.7.3, I get:
0.0311839580536
0.0310959815979
0.0309510231018
0.0306520462036
0.0302460193634
0.0324130058289
0.0308878421783
0.0307397842407
0.0304911136627
0.0307500362396
whereas, using PyPy 1.9.0 (which corresponds to Python 2.7.2), I get:
0.00921821594238
0.0115230083466
0.00851202011108
0.00808095932007
0.00496387481689
0.00499391555786
0.00508499145508
0.00618195533752
0.005126953125
0.00482988357544
The acceleration of PyPy is really stunning and really becomes visible when its JIT compiler optimizations outweigh their cost. That's also why I introduced the extra for loop. For this example, absolutely no modification of the code was needed.

This is just my opinion, but I think the process is a bit more complex. Basically Matlab is an optimized layer of C, so with the appropriate initialization of matrices and minimization of function calls (avoid "." objects-like operators in Matlab) you obtain extremely different results. Consider the simple following example of wave generator with cosine function. Matlab time = 0.15 secs in practical debug session, Python time = 25 secs in practical debug session (Spyder), thus Python becomes 166x slower. Run directly by Python 3.7.4. machine the time is = 5 secs aprox, so still be a non negligible 33x.
MATLAB:
AW(1,:) = [800 , 0 ]; % [amp frec]
AW(2,:) = [300 , 4E-07];
AW(3,:) = [200 , 1E-06];
AW(4,:) = [ 50 , 4E-06];
AW(5,:) = [ 30 , 9E-06];
AW(6,:) = [ 20 , 3E-05];
AW(7,:) = [ 10 , 4E-05];
AW(8,:) = [ 9 , 5E-04];
AW(9,:) = [ 7 , 7E-04];
AW(10,:)= [ 5 , 8E-03];
phas = 0
tini = -2*365 *86400; % 2 years backwards in seconds
dt = 200; % step, 200 seconds
tfin = 0; % present
vec_t = ( tini: dt: tfin)'; % vector_time
nt = length(vec_t);
vec_t = vec_t - phas;
wave = zeros(nt,1);
for it = 1:nt
suma = 0;
t = vec_t(it,1);
for iW = 1:size(AW,1)
suma = suma + AW(iW,1)*cos(AW(iW,2)*t);
end
wave(it,1) = suma;
end
PYTHON:
import numpy as np
AW = np.zeros((10,2))
AW[0,:] = [800 , 0.0]
AW[1,:] = [300 , 4E-07]; # [amp frec]
AW[2,:] = [200 , 1E-06];
AW[3,:] = [ 50 , 4E-06];
AW[4,:] = [ 30 , 9E-06];
AW[5,:] = [ 20 , 3E-05];
AW[6,:] = [ 10 , 4E-05];
AW[7,:] = [ 9 , 5E-04];
AW[8,:] = [ 7 , 7E-04];
AW[9,:] = [ 5 , 8E-03];
phas = 0
tini = -2*365 *86400 # 2 years backwards
dt = 200
tfin = 0 # present
nt = round((tfin-tini)/dt) + 1
vec_t = np.linspace(tini,tfin1,nt) - phas
wave = np.zeros((nt))
for it in range(nt):
suma = 0
t = vec_t[fil]
for iW in range(np.size(AW,0)):
suma = suma + AW[iW,0]*np.cos(AW[iW,1]*t)
#endfor iW
wave[it] = suma
#endfor it
To deal such aspects in Python I would suggest to compile into executable directly to binary the numerical parts that may compromise the project (or for example C or Fortran into executable and be called by Python afterwards). Of course, other suggestions are appreciated.

I tested a FIR filter with MATLAB and same (adapted) code in Python, including a frequency sweep. The FIR filter is pretty huge, N = 100 order, I post below the two codes, but leave you here the timing results:
MATLAB: Elapsed time is 11.149704 seconds.
PYTHON: time cost = 247.8841781616211 seconds.
PYTHON IS 25 TIMES SLOWER !!!
MATLAB CODE (main):
f1 = 4000; % bandpass frequency (response = 1).
f2 = 4200; % bandreject frequency (response = 0).
N = 100; % FIR filter order.
k = 0:2*N;
fs = 44100; Ts = 1/fs; % Sampling freq. and time.
% FIR Filter numerator coefficients:
Nz = Ts*(f1+f2)*sinc((f2-f1)*Ts*(k-N)).*sinc((f2+f1)*Ts*(k-N));
f = 0:fs/2;
w = 2*pi*f;
z = exp(-i*w*Ts);
% Calculation of the expected response:
Hz = polyval(Nz,z).*z.^(-2*N);
figure(1)
plot(f,abs(Hz))
title('Gráfica Respuesta Filtro FIR (Filter Expected Response)')
xlabel('frecuencia f (Hz)')
ylabel('|H(f)|')
xlim([0, 5000])
grid on
% Sweep Frequency Test:
tic
% Start and Stop frequencies of sweep, t = tmax = 50 seconds = 5000 Hz frequency:
fmin = 1; fmax = 5000; tmax = 50;
t = 0:Ts:tmax;
phase = 2*pi*fmin*t + 2*pi*((fmax-fmin).*t.^2)/(2*tmax);
x = cos(phase);
y = filtro2(Nz, 1, x); % custom filter function, not using "filter" library here.
figure(2)
plot(t,y)
title('Gráfica Barrido en Frecuencia Filtro FIR (Freq. Sweep)')
xlabel('Tiempo Barrido: t = 10 seg = 1000 Hz')
ylabel('y(t)')
xlim([0, 50])
grid on
toc
MATLAB CUSTOM FILTER FUNCTION
function y = filtro2(Nz, Dz, x)
Nn = length(Nz);
Nd = length(Dz);
N = length(x);
Nm = max(Nn,Nd);
x1 = [zeros(Nm-1,1) ; x'];
y1 = zeros(Nm-1,1);
for n = Nm:N+Nm-1
y1(n) = Nz(Nn:-1:1)*x1(n-Nn+1:n)/Dz(1);
if Nd > 1
y1(n) = y1(n) - Dz(Nd:-1:2)*y1(n-Nd+1:n-1)/Dz(1);
end
end
y = y1(Nm:Nm+N-1);
end
PYTHON CODE (main):
import numpy as np
from matplotlib import pyplot as plt
import FiltroDigital as fd
import time
j = np.array([1j])
pi = np.pi
f1, f2 = 4000, 4200
N = 100
k = np.array(range(0,2*N+1),dtype='int')
fs = 44100; Ts = 1/fs;
Nz = Ts*(f1+f2)*np.sinc((f2-f1)*Ts*(k-N))*np.sinc((f2+f1)*Ts*(k-N));
f = np.arange(0, fs/2, 1)
w = 2*pi*f
z = np.exp(-j*w*Ts)
Hz = np.polyval(Nz,z)*z**(-2*N)
plt.figure(1)
plt.plot(f,abs(Hz))
plt.title("Gráfica Respuesta Filtro FIR")
plt.xlabel("frecuencia f (Hz)")
plt.ylabel("|H(f)|")
plt.xlim(0, 5000)
plt.grid()
plt.show()
start_time = time.time()
fmin = 1; fmax = 5000; tmax = 50;
t = np.arange(0, tmax, Ts)
fase = 2*pi*fmin*t + 2*pi*((fmax-fmin)*t**2)/(2*tmax)
x = np.cos(fase)
y = fd.filtro(Nz, [1], x)
plt.figure(2)
plt.plot(t,y)
plt.title("Gráfica Barrido en Frecuencia Filtro FIR")
plt.xlabel("Tiempo Barrido: t = 10 seg = 1000 Hz")
plt.ylabel("y(t)")
plt.xlim(0, 50)
plt.grid()
plt.show()
elapsed_time = time.time() - start_time
print('time cost = ', elapsed_time)
PYTHON CUSTOM FILTER FUNCTION
import numpy as np
def filtro(Nz, Dz, x):
Nn = len(Nz);
Nd = len(Dz);
Nz = np.array(Nz,dtype=float)
Dz = np.array(Dz,dtype=float)
x = np.array(x,dtype=float)
N = len(x);
Nm = max(Nn,Nd);
x1 = np.insert(x, 0, np.zeros((Nm-1,), dtype=float))
y1 = np.zeros((N+Nm-1,), dtype=float)
for n in range(Nm-1,N+Nm-1) :
y1[n] = sum(Nz*np.flip( x1[n-Nn+1:n+1]))/Dz[0] # = y1FIR[n]
if Nd > 1:
y1[n] = y1[n] - sum(Dz[1:]*np.flip( y1[n-Nd+1:n]))/Dz[0]
print(y1[n])
y = y1[Nm-1:]
return y

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.