for loop in python is 10x slower than matlab - python

I run python 2.7 and matlab R2010a on the same machine, doing nothing, and it gives me 10x different in speed
I looked online, and heard it should be the same order.
Python will further slow down as if statement and math operator in the for loop
My question: is this the reality? or there is some other way let them in the same speed order?
Here is python code
import time
start_time = time.time()
for r in xrange(1000):
for c in xrange(1000):
continue
elapsed_time = time.time() - start_time
print 'time cost = ',elapsed_time
Output: time cost = 0.0377440452576
Here is matlab code
tic
for i = 1:1000
for j = 1:1000
end
end
toc
Output: Escaped time is 0.004200 seconds

The reason this is happening is related to the JIT compiler, which is optimizing the MATLAB for loop. You can disable/enable the JIT accelerator using feature accel off and feature accel on. When you disable the accelerator, the times change dramatically.
MATLAB with accel on: Elapsed time is 0.009407 seconds.
MATLAB with accel off: Elapsed time is 0.287955 seconds.
python: time cost = 0.0511920452118
Thus the JIT accelerator is directly causing the speedup that you are noticing. There is another thing that you should consider, which is related to the way that you defined the iteration indices. In both cases, MATLAB and python, you used Iterators to define your loops. In MATLAB you create the actual values by adding the square brackets ([]), and in python you use range instead of xrange. When you make these changes
% MATLAB
for i = [1:1000]
for j = [1:1000]
# python
for r in range(1000):
for c in range(1000):
The times become
MATLAB with accel on: Elapsed time is 0.338701 seconds.
MATLAB with accel off: Elapsed time is 0.289220 seconds.
python: time cost = 0.0606048107147
One final consideration is if you were to add a quick computation to the loop. ie t=t+1. Then the times become
MATLAB with accel on: Elapsed time is 1.340830 seconds.
MATLAB with accel off: Elapsed time is 0.905956 seconds. (Yes off was faster)
python: time cost = 0.147221088409
I think that the moral here is that the computation speeds of for loops, out-of-the box, are comparable for extremely simple loops, depending on the situation. However, there are other, numerical tools in python which can speed things up significantly, numpy and PyPy have been brought up so far.

The basic Python implementation, CPython, is not meant to be super-speedy. If you need efficient matlab-style numerical manipulation, use the numpy package or an implementation of Python that is designed for fast work, such as PyPy or even Cython. (Writing a Python extension in C, which will of course be pretty fast, is also a possible solution, but in that case you may as well just use numpy and save yourself the effort.)

If Python execution performance is really crucial for you, you might take a look at PyPy
I did your test:
import time
for a in range(10):
start_time = time.time()
for r in xrange(1000):
for c in xrange(1000):
continue
elapsed_time = time.time()-start_time
print elapsed_time
with standard Python 2.7.3, I get:
0.0311839580536
0.0310959815979
0.0309510231018
0.0306520462036
0.0302460193634
0.0324130058289
0.0308878421783
0.0307397842407
0.0304911136627
0.0307500362396
whereas, using PyPy 1.9.0 (which corresponds to Python 2.7.2), I get:
0.00921821594238
0.0115230083466
0.00851202011108
0.00808095932007
0.00496387481689
0.00499391555786
0.00508499145508
0.00618195533752
0.005126953125
0.00482988357544
The acceleration of PyPy is really stunning and really becomes visible when its JIT compiler optimizations outweigh their cost. That's also why I introduced the extra for loop. For this example, absolutely no modification of the code was needed.

This is just my opinion, but I think the process is a bit more complex. Basically Matlab is an optimized layer of C, so with the appropriate initialization of matrices and minimization of function calls (avoid "." objects-like operators in Matlab) you obtain extremely different results. Consider the simple following example of wave generator with cosine function. Matlab time = 0.15 secs in practical debug session, Python time = 25 secs in practical debug session (Spyder), thus Python becomes 166x slower. Run directly by Python 3.7.4. machine the time is = 5 secs aprox, so still be a non negligible 33x.
MATLAB:
AW(1,:) = [800 , 0 ]; % [amp frec]
AW(2,:) = [300 , 4E-07];
AW(3,:) = [200 , 1E-06];
AW(4,:) = [ 50 , 4E-06];
AW(5,:) = [ 30 , 9E-06];
AW(6,:) = [ 20 , 3E-05];
AW(7,:) = [ 10 , 4E-05];
AW(8,:) = [ 9 , 5E-04];
AW(9,:) = [ 7 , 7E-04];
AW(10,:)= [ 5 , 8E-03];
phas = 0
tini = -2*365 *86400; % 2 years backwards in seconds
dt = 200; % step, 200 seconds
tfin = 0; % present
vec_t = ( tini: dt: tfin)'; % vector_time
nt = length(vec_t);
vec_t = vec_t - phas;
wave = zeros(nt,1);
for it = 1:nt
suma = 0;
t = vec_t(it,1);
for iW = 1:size(AW,1)
suma = suma + AW(iW,1)*cos(AW(iW,2)*t);
end
wave(it,1) = suma;
end
PYTHON:
import numpy as np
AW = np.zeros((10,2))
AW[0,:] = [800 , 0.0]
AW[1,:] = [300 , 4E-07]; # [amp frec]
AW[2,:] = [200 , 1E-06];
AW[3,:] = [ 50 , 4E-06];
AW[4,:] = [ 30 , 9E-06];
AW[5,:] = [ 20 , 3E-05];
AW[6,:] = [ 10 , 4E-05];
AW[7,:] = [ 9 , 5E-04];
AW[8,:] = [ 7 , 7E-04];
AW[9,:] = [ 5 , 8E-03];
phas = 0
tini = -2*365 *86400 # 2 years backwards
dt = 200
tfin = 0 # present
nt = round((tfin-tini)/dt) + 1
vec_t = np.linspace(tini,tfin1,nt) - phas
wave = np.zeros((nt))
for it in range(nt):
suma = 0
t = vec_t[fil]
for iW in range(np.size(AW,0)):
suma = suma + AW[iW,0]*np.cos(AW[iW,1]*t)
#endfor iW
wave[it] = suma
#endfor it
To deal such aspects in Python I would suggest to compile into executable directly to binary the numerical parts that may compromise the project (or for example C or Fortran into executable and be called by Python afterwards). Of course, other suggestions are appreciated.

I tested a FIR filter with MATLAB and same (adapted) code in Python, including a frequency sweep. The FIR filter is pretty huge, N = 100 order, I post below the two codes, but leave you here the timing results:
MATLAB: Elapsed time is 11.149704 seconds.
PYTHON: time cost = 247.8841781616211 seconds.
PYTHON IS 25 TIMES SLOWER !!!
MATLAB CODE (main):
f1 = 4000; % bandpass frequency (response = 1).
f2 = 4200; % bandreject frequency (response = 0).
N = 100; % FIR filter order.
k = 0:2*N;
fs = 44100; Ts = 1/fs; % Sampling freq. and time.
% FIR Filter numerator coefficients:
Nz = Ts*(f1+f2)*sinc((f2-f1)*Ts*(k-N)).*sinc((f2+f1)*Ts*(k-N));
f = 0:fs/2;
w = 2*pi*f;
z = exp(-i*w*Ts);
% Calculation of the expected response:
Hz = polyval(Nz,z).*z.^(-2*N);
figure(1)
plot(f,abs(Hz))
title('Gráfica Respuesta Filtro FIR (Filter Expected Response)')
xlabel('frecuencia f (Hz)')
ylabel('|H(f)|')
xlim([0, 5000])
grid on
% Sweep Frequency Test:
tic
% Start and Stop frequencies of sweep, t = tmax = 50 seconds = 5000 Hz frequency:
fmin = 1; fmax = 5000; tmax = 50;
t = 0:Ts:tmax;
phase = 2*pi*fmin*t + 2*pi*((fmax-fmin).*t.^2)/(2*tmax);
x = cos(phase);
y = filtro2(Nz, 1, x); % custom filter function, not using "filter" library here.
figure(2)
plot(t,y)
title('Gráfica Barrido en Frecuencia Filtro FIR (Freq. Sweep)')
xlabel('Tiempo Barrido: t = 10 seg = 1000 Hz')
ylabel('y(t)')
xlim([0, 50])
grid on
toc
MATLAB CUSTOM FILTER FUNCTION
function y = filtro2(Nz, Dz, x)
Nn = length(Nz);
Nd = length(Dz);
N = length(x);
Nm = max(Nn,Nd);
x1 = [zeros(Nm-1,1) ; x'];
y1 = zeros(Nm-1,1);
for n = Nm:N+Nm-1
y1(n) = Nz(Nn:-1:1)*x1(n-Nn+1:n)/Dz(1);
if Nd > 1
y1(n) = y1(n) - Dz(Nd:-1:2)*y1(n-Nd+1:n-1)/Dz(1);
end
end
y = y1(Nm:Nm+N-1);
end
PYTHON CODE (main):
import numpy as np
from matplotlib import pyplot as plt
import FiltroDigital as fd
import time
j = np.array([1j])
pi = np.pi
f1, f2 = 4000, 4200
N = 100
k = np.array(range(0,2*N+1),dtype='int')
fs = 44100; Ts = 1/fs;
Nz = Ts*(f1+f2)*np.sinc((f2-f1)*Ts*(k-N))*np.sinc((f2+f1)*Ts*(k-N));
f = np.arange(0, fs/2, 1)
w = 2*pi*f
z = np.exp(-j*w*Ts)
Hz = np.polyval(Nz,z)*z**(-2*N)
plt.figure(1)
plt.plot(f,abs(Hz))
plt.title("Gráfica Respuesta Filtro FIR")
plt.xlabel("frecuencia f (Hz)")
plt.ylabel("|H(f)|")
plt.xlim(0, 5000)
plt.grid()
plt.show()
start_time = time.time()
fmin = 1; fmax = 5000; tmax = 50;
t = np.arange(0, tmax, Ts)
fase = 2*pi*fmin*t + 2*pi*((fmax-fmin)*t**2)/(2*tmax)
x = np.cos(fase)
y = fd.filtro(Nz, [1], x)
plt.figure(2)
plt.plot(t,y)
plt.title("Gráfica Barrido en Frecuencia Filtro FIR")
plt.xlabel("Tiempo Barrido: t = 10 seg = 1000 Hz")
plt.ylabel("y(t)")
plt.xlim(0, 50)
plt.grid()
plt.show()
elapsed_time = time.time() - start_time
print('time cost = ', elapsed_time)
PYTHON CUSTOM FILTER FUNCTION
import numpy as np
def filtro(Nz, Dz, x):
Nn = len(Nz);
Nd = len(Dz);
Nz = np.array(Nz,dtype=float)
Dz = np.array(Dz,dtype=float)
x = np.array(x,dtype=float)
N = len(x);
Nm = max(Nn,Nd);
x1 = np.insert(x, 0, np.zeros((Nm-1,), dtype=float))
y1 = np.zeros((N+Nm-1,), dtype=float)
for n in range(Nm-1,N+Nm-1) :
y1[n] = sum(Nz*np.flip( x1[n-Nn+1:n+1]))/Dz[0] # = y1FIR[n]
if Nd > 1:
y1[n] = y1[n] - sum(Dz[1:]*np.flip( y1[n-Nd+1:n]))/Dz[0]
print(y1[n])
y = y1[Nm-1:]
return y

Related

Numba parallel code slower than its sequential counterpart

I'm new to Numba and I'm trying to implement an old Fortran code in Python using Numba (version 0.54.1), but when I add parallel = True the program actually slows down. My program is very simple: I change the positions x and y in a L x L grid and for each position in the grid I perform a summation
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J = np.array([[1.0, -k*np.cos(x)], [1.0, 1.0 - k*np.cos(x)]])
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
# Compile
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1, 10)
# Parameters
N = int(1e3)
L = 128
pi = np.pi
k = 1.5
# Limits of the phase space
x0 = -pi
xf = pi
y0 = -pi
yf = pi
# Grid positions
x = np.linspace(x0, xf, L, endpoint=True)
y = np.linspace(y0, yf, L, endpoint=True)
lypnv = lyapunov_grid(x, y, k, N)
With parallel=False it takes about 8s to run, however with parallel=True it takes about 14s. I also tested with another code from https://github.com/animator/mandelbrot-numba and in this case the parallelization works.
import math
import numpy as np
import numba as nb
WIDTH = 1000
MAX_ITER = 1000
#nb.njit(parallel=True)
def mandelbrot(width, max_iter):
pixels = np.zeros((width, width, 3), dtype=np.uint8)
for y in nb.prange(width):
for x in range(width):
c0 = complex(3.0*x/width - 2, 3.0*y/width - 1.5)
c = 0
for i in range(1, max_iter):
if abs(c) > 2:
log_iter = math.log(i)
pixels[y, x, :] = np.array([int(255*(1+math.cos(3.32*log_iter))/2),
int(255*(1+math.cos(0.774*log_iter))/2),
int(255*(1+math.cos(0.412*log_iter))/2)],
dtype=np.uint8)
break
c = c * c + c0
return pixels
# compile
_ = mandelbrot(WIDTH, 10)
calcpixels = mandelbrot(WIDTH, MAX_ITER)
One main issue is that the second function call compile the function again. Indeed, the types of the provided arguments change: in the first call the third argument is an integer (int transformed to a np.int_) while in the second call the third argument (k) is a floating point number (float transformed to a np.float64). Numba recompiles the function for different parameter types because they are deduced from the type of the arguments and it does not know you want to use a np.float64 type for the third argument (since the first time the function is compiled with for a np.int_ type). One simple solution to fix the problem is to change the first call to:
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1.0, 10)
However, this is not a robust way to fix the problem. You can specify the parameter types to Numba so it will compile the function at declaration time. This also remove the need to artificially call the function (with useless parameters).
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
Note that (J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)) is zero the first time resulting in a division by 0.
Another main issue comes from the allocations of many small arrays in the loop causing a contention of the standard allocator (see this post for more information). While Numba could theoretically optimize it (ie. replace the array with local variables), it actually does not, resulting in a huge slowdown and a contention. Hopefully, in your case, you do not need to actually create the array. At last, you can create it only in the encompassing loop and modify it in the innermost loop. Here is the optimized code:
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
J = np.ones((2, 2), dtype=np.float64)
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J[0, 1] = -k*np.cos(x)
J[1, 1] = 1.0 - k*np.cos(x)
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
Here is the results on a old 2-core machine (with 4 hardware threads):
Original sequential: 15.9 s
Original parallel: 11.9 s
Fix-build sequential: 15.7 s
Fix-build parallel: 10.1 s
Optimized sequential: 2.73 s
Optimized parallel: 0.94 s
The optimized implementation is much faster than the others. The parallel optimized version scale very well compared than the original one (2.9 times faster than the sequential one). Finally, the best version is about 12 times faster than the original parallel version. I expect a much faster computation on a recent machine with many more cores.

Threading or multiprocess to run code in parallel seems slower

I've been searching with little success how to solve this problem. The script below is supposed to perform planet simulations. planet1_pars will define 1st planet parameters. set_grids_fakePlanet will create a grid for each of the parameters of a hypothetical planet put into the system. This function will return a generator not a list/array with tons of parameter values. planet2_pars will give me a set of parameters previously created in set_grids_fakePlanet, hence each time I execute planet2_pars it will give me a different set of parameters from the hypothetical planet. ComputeTTV will do some calculations and return what I need each time I execute run_rebound, which is my main function that will call all these mentioned functions above. Whenever I execute run_rebound, I need to give it the hypothetical planet parameter so it run the simulation.
def planet1_pars():
P_p1,m_p1,e_p1 = 0.7920639164 / 365.25, 29.32*3.0027e-6, 0.0 #P[yrs], m[solar],e[fixed]
inc_p1,omega_p1,M_p1 = 77.4041697839 * np.pi/180, 90., 0.
return P_p1,m_p1,e_p1,inc_p1,omega_p1,M_p1
def set_grids_fakePlanet(pars_p1):
P_p1,m_p1,e_p1,inc_p1,omega_p1,M_p1 = [*pars_p1]
#min max periods in which to put a planet
Pmin = P_p1 * 2.02 # Pmin ~ 0.0196/365.25 #shortest period found so far in exoplanet.eu
Pmax = P_p1 * 2.05
#set grids
P_grid = np.arange(Pmin, Pmax, P_p1 * 0.005)
m_p2_grid = np.arange(.5, 320, 1) * 3.0027e-6 # every 1 Earth mass to 1 Jupiter
e_grid = [0.0]#np.linspace(0,0.1, 10) # e=1 may cause code to blow up
inc_grid = [inc_p1]#np.linspace(60,90, 5)
omega_grid = [0.0]#np.linspace(0,360, 5)
M_grid = [0.0]#np.linspace(0,360, 5)
#store grid vals, each column is a parameter, last column TTV amplitude
#[n,m] n is max_size(P,e,inc,omega,M) ** m. m is # of orbital parameters + 1 ttv amp
size = len(P_grid) * len(m_p2_grid) * len(e_grid) * len(inc_grid) * len(omega_grid) * len(M_grid)
results = np.zeros([size,6+1]) * np.nan
peiom_grid = ((x,k,y,w,j,z) for x in P_grid for k in m_p2_grid for y in e_grid for w in inc_grid
for j in omega_grid for z in M_grid)
return peiom_grid
def planet2_pars():
for pars_p2 in peiom_grid:
return pars_p2
#2nd planet
# m_p2, P_p2, e_p2, inc_p2, omega_p2, M_p2 = system_parameters(n*m_p1, P_p2,e_p2,inc_p2,omega_p2,M_p2)
def computeTTVs(sim, P_p1, P_p2):
N=34
transittimes = np.zeros(int(N))
p = sim.particles
i = 0
while i<N:
y_old = p[1].y - p[0].y # (Thanks to David Martin for pointing out a bug in this line!)
t_old = sim.t
if P_p1 > P_p2:
sim.integrate(sim.t+ (P_p2 * 0.05)) # check for transits every 0.5 time units. Note that 0.5 is shorter than one orbit
else:
sim.integrate(sim.t+ (P_p1 * 0.05)) #5% of period ~ 1h which is shorter than Tdur=2h
t_new = sim.t
if y_old*(p[1].y-p[0].y)<0. and p[1].x-p[0].x>0.: # sign changed (y_old*y<0), planet in front of star (x>0)
while t_new-t_old>1e-7: # bisect until prec of 1e-5 reached
if y_old*(p[1].y-p[0].y)<0.:
t_new = sim.t
else:
t_old = sim.t
sim.integrate( (t_new+t_old)/2.)
transittimes[i] = sim.t
i += 1
sim.integrate(sim.t+ P_p1 * 0.01) # integrate 0.05 to be past the transit
A = np.vstack([np.ones(N), range(N)]).T
c, m = np.linalg.lstsq(A, transittimes, rcond=-1)[0] # fits a linear model and get period m and t0 c
comp_t0s = c + m*np.array(range(N))
OC = transittimes-comp_t0s # in years
OC *= 365.25*24*60
amp = rms(OC)
# amp = np.diff([np.min(OC), np.max(OC)])[0]
return amp #in minutes
def run_rebound(pars_p2):
ms=1.02 #solar unit
P_p1,m_p1,e_p1,inc_p1,omega_p1,M_p1 = planet1_pars()
P_p2,m_p2,e_p2,inc_p2,omega_p2,M_p2 = [*pars_p2] #fake planet
#start simulation
sim = rebound.Simulation()
sim.G = 39.478 #AU^3 yr^-2 Ms^-1
sim.add(m=ms)
sim.add(m=m_p1, P=P_p1, e=e_p1, inc=inc_p1, omega=omega_p1, M=M_p1)
sim.add(m=m_p2, P=P_p2, e=e_p2, inc=inc_p2, omega=omega_p2, M=M_p2)
#put outcomes in a list
results = [P_p2,m_p2,e_p2,inc_p2*(180/np.pi),omega_p2,M_p2, computeTTVs(sim, P_p1, P_p2)]
return results
Question: I tried to make it parallel using the threading library as in:
peiom_grid = set_grids_fakePlanet(planet1_pars()) #make the fake planet grid as a generator variable
import threading
start = time.time()
for pars in peiom_grid:
t1 = threading.Thread(target=run_rebound, args=(pars,))
t1.start()
t1.join()
end = time.time()
print((end-start) /60, 'min')
In this manner, I see the 8 CPU I got is being used but at a rate which is less than 50%.
And it takes ~ 1.2 min to run (the grids are small because I am testing, but ideally the grids should be lager so it may take days to run).
I also tried MultiProcessing
from multiprocessing import Process
start = time.time()
if __name__ == '__main__':
for pars in peiom_grid:
p = Process(target=run_rebound, args=(pars,))
p.start()
p.join()
end = time.time()
print((end-start) /60, 'min')
it takes ~ 1.7min
and without any parallelization
start = time.time()
for pars in peiom_grid:
run_rebound(pars)
end = time.time()
print((end-start)/60, 'min')
it takes ~ 1.34 min
I think I am not doing any parallelization because the difference between the runs above with/without parallelization isn't significant. I cannot find where the issue is. I followed a few examples and check several examples on stack overflow but nothing... Hope you guys can give me some feedback.
In case of multithreading - Mark is right, the bottleneck is Python GIL. However, multiprocessing is free of this limitation (but is subject to a different overhead, minimal in this case).
The reason you don't see any improvement is because .join() waits for process execution. So, this implementation starts a single process and then immediately blocks until it is complete. To fix this, move .join() out of the process creation loop:
processes = []
for pars in peiom_grid:
p = Process(target=run_rebound, args=(pars,))
p.start()
processes.append(p)
for p in processes:
p.join()
A more straightforward way to do this would be to use process pool:
from multiprocessing import Pool
with Pool() as pool: # will use the number of CPUs in the system by default
results = pool.map(run_rebound, peiom_grid)

Extremely slow calculations

I am trying to write an equation that does the following things:
1) Integrates an equation
2) Stores that equation for later use
3) Numerically integrate the first and evaluate the 2nd equation on 100 different intervals, increasing by a fixed amount each time
import math
from sympy import *
import kvalues
import time
import random
import pandas as pd
import matplotlib.pyplot as plt
The first task is very simple, I completed it like so:
def integration_gas(number,Fa_0,Fb_0,Fc_0,v_0,a,b,c,d,e):
Ca_0 = Fa_0/v_0
Cb_0 = Fb_0/v_0
Cc_0 = Fc_0/v_0
Ft_0 = Fb_0 + Fa_0 + Fc_0
theta1 = Cb_0/Ca_0
stoic1 = b/a
theta2 = Cc_0/Ca_0
stoic2 = c/a
stoic3 = d/a
stoic4 = e/a
Cd = stoic3*x
Ce = stoic4*x
sigma = e+d-c-b-1
epsilon = (Fa_0/Ft_0)*sigma
Ca_eq = Ca_0*((1-x)/(1+epsilon*x))
Cb_eq = Ca_0*((1*theta1-stoic1*x)/(1+epsilon*x))
Cc_eq = Ca_0*((1*theta2-stoic2*x)/(1+epsilon*x))
ra = 1*(Ca_eq**a)*(Cb_eq**b)*(Cc_eq**c)*final_k[number-1]
equation = Fa_0/ra
int1 = Integral(equation,x)
pprint(int1)
evaluate = int1.doit()
pprint(evaluate)
return equation
This part of the code works perfectly fine, so on to the 2nd part.
def Ra_gas(number,Fa_0,Fb_0,Fc_0,v_0,a,b,c,d,e):
Ca_0 = Fa_0/v_0
Cb_0 = Fb_0/v_0
Cc_0 = Fc_0/v_0
Ft_0 = Fb_0 + Fa_0 + Fc_0
theta1 = Cb_0/Ca_0
stoic1 = b/a
theta2 = Cc_0/Ca_0
stoic2 = c/a
sigma = e+d-c-b-1
epsilon = (Fa_0/Ft_0)*sigma
Ca_eq = Ca_0*((1-x)/(1+epsilon*x))
Cb_eq = Ca_0*((1*theta1-stoic1*x)/(1+epsilon*x))
Cc_eq = Ca_0*((1*theta2-stoic2*x)/(1+epsilon*x))
ra = 1*(Ca_eq**a)*(Cb_eq**b)*(Cc_eq**c)*final_k[number-1]
pprint(ra)
return ra
This part of the code also works perfectly fine. So for the last part I have the following code:
Number = 4
FA0 = 10
FB0 = 25
FC0 = 5
V0 = 2
A = 1
B = 2
C = 0.5
D = 1
E = 1
Ra = []
volume = []
Xff = []
eq1 = integration_gas(Number,FA0,FB0,FC0,V0,A,B,C,D,E)
Ra1 = Ra_gas(Number,FA0,FB0,FC0,V0,A,B,C,D,E)
#print(Ra1)
Xf = 0.01
# Calculates the reaction rate and volume for every interval of conversion
while Xf <=1:
int2 = Integral(eq1,(x,0,Xf))
volume.append(int2.doit())
f = lambdify(x,Ra1,"math")
f(Xf)
Ra.append(f(Xf))
Xff.append(Xf)
Xf += 0.01
I then take the results and plot them. Everything i've written works perfectly fine for some situations and is completed in around 10~15 seconds. However, in situations like this one in particular, i've been running this code for 5+ hours with no solutions. How can I optimize this code?
Take a look at sympy, it can symbolically integrate your original equation and you then can evaluate it via numpy. For "real" maths Python is a bit slow, the scipy Stack (numpy, matplotlib, sympy...) is WAY faster.
Though 5+ hours is a bit long, are you sure that it actually executes?
EDIT: A simple thing to try
Sorry, Just now noticed you're lambdifying, you might want to include your imports so people see what you're using.
At the beginning:
import numpy as np
Let's go through this piece of your Code:
Xf = 0.01
while Xf <=1:
int2 = Integral(eq1,(x,0,Xf))
volume.append(int2.doit())
f = lambdify(x,Ra1,"math") #you're lambdifying each iteration that takes time
f(Xf) # no assignment here, unless you're doing something in place this line does nothing
Ra.append(f(Xf))
Xff.append(Xf)
Xf += 0.01
With something along the lines of this:
Xf = np.arange(0.01, 1.01, 0.01) #vector with values from 0.01 to 1 in steps of 0.01
f = np.vectorize(lambdify(x,Ra1,"math")) # you anonymous function but able to take np vectors/np arrays
Ra = f(Xf)
#Xff would be Xf

Optimization of a timeline builder function

I've got a squared signal with a frequency f, and I'm interested in the time at which the square starts.
def time_builder(f, t0=0, tf=300):
"""
Function building the time line in ms between t0 and tf with a frequency f.
f: Hz
t0 and tf: ms
"""
time = [t0] # /!\ time in ms
i = 1
while time[len(time)-1] < tf:
if t0 + (i/f)*1000 < tf:
time.append(t0 + (i/f)*1000)
else:
break
i += 1
return time
So this function loops between t0 and tf to create a list in which is the timing at which a square starts. I'm quite sure it's not the best way to do it, and I'd like to know how to improve it.
Thanks.
If I am interpreting this correct, you are looking for a list of the times of the waves, starting at t0 and ending at tf.
def time_builder(f, t0=0, tf=300):
"""
Function building the time line in ms between t0 and tf with a frequency f.
f: Hz
t0 and tf: ms
"""
T = 1000 / f # period [ms]
n = int( (tf - t0) / T + 0.5 ) # n integer number of wavefronts, +0.5 added for rounding consistency
return [t0 + i*T for i in range(n)]
Using standard library python for this might not be the best approach... particularly considering that you might want to do other things later on.
An alternative is to use numpy. This will let you to do the following
from numpy import np
from scipy import signal
t = np.linspace(0, 1, 500, endpoint=False)
s = signal.square(2 * np.pi * 5 * t) # we create a square signal usign scipy
d = np.diff(s) # obtaining the differences, this tell when there is a step.
# In this particular case, 2 means step up -2 step down.
starts = t[np.where(d == 2)] # take the times array t filtered by which
# elements in the differences array d equal to 2

Improving Numpy speed for Gauss-Seidel (Jacobi) Solver

This question is a follow-up to a recent question posted regarding MATLAB being twice as fast as Numpy.
I currently have a Gauss-Seidel solver implemented in both MATLAB and Numpy which acts on a 2D axisymmetric domain (cylindrical coordinates). The code was originally written in MATLAB and then transferred to Python. The Matlab code runs in ~20 s whereas the Numpy codes takes ~30 s. I would like to use Numpy, however, since this code is part of a larger program, the almost twice as long simulation time is a significant drawback.
The algorithm simply solves the discretized Laplace equation on a rectangular mesh (in cylindrical coordinates). It finishes when the maximum difference between updates on the mesh is less than the indicated tolerance.
The code in Numpy is:
import numpy as np
import time
T = np.transpose
# geometry
length = 0.008
width = 0.002
# mesh
nz = 256
nr = 64
# step sizes
dz = length/nz
dr = width/nr
# node position matrices
r = np.tile(np.linspace(0,width,nr+1), (nz+1, 1)).T
ri = r/dr
# equation coefficients
cr = dz**2 / (2*(dr**2 + dz**2))
cz = dr**2 / (2*(dr**2 + dz**2))
# initial/boundary conditions
v = np.zeros((nr+1,nz+1))
v[:,0] = 1100
v[:,-1] = 0
v[31:,29:40] = 1000
v[19:,54:65] = -200
# convergence parameters
tol = 1e-4
# Gauss-Seidel solver
tic = time.time()
max_v_diff = 1;
while (max_v_diff > tol):
v_old = v.copy()
# left boundary updates
v[0,1:nz] = cr*2*v[1,1:nz] + cz*(v[0,0:nz-1] + v[0,2:nz+2])
# internal updates
v[1:nr,1:nz] = cr*((1 - 1/(2*ri[1:nr,1:nz]))*v[0:nr-1,1:nz] + (1 + 1/(2*ri[1:nr,1:nz]))*v[2:nr+1,1:nz]) + cz*(v[1:nr,0:nz-1] + v[1:nr,2:nz+1])
# right boundary updates
v[nr,1:nz] = cr*2*v[nr-1,1:nz] + cz*(v[nr,0:nz-1] + v[nr,2:nz+1])
# reapply grid potentials
v[31:,29:40] = 1000
v[19:,54:65] = -200
# check for convergence
v_diff = v - v_old
max_v_diff = np.absolute(v_diff).max()
toc = time.time() - tic
print(toc)
This is actually not the full algorithm which I use. The full algorithm uses successive overrelaxation and a checkerboard iteration scheme to improve speed and remove solver directionality, but for purposes of simplicity I provided this easier to understand version. The speed drawbacks in Numpy are more pronounced for the full version (17s vs. 9s simulation times respectively in Numpy and MATLAB).
I tried the solution from the previous question, changing v to a column-major order array, but there was no performance increase.
Any suggestions?
Edit: The Matlab code for reference is:
% geometry
length = 0.008;
width = 0.002;
% mesh
nz = 256;
nr = 64;
% step sizes
dz = length/nz;
dr = width/nr;
% node position matrices
r = repmat(linspace(0,width,nr+1)', 1, nz+1);
ri = r./dr;
% equation coefficients
cr = dz^2/(2*(dr^2+dz^2));
cz = dr^2/(2*(dr^2+dz^2));
% initial/boundary conditions
v = zeros(nr+1,nz+1);
v(1:nr+1,1) = 1100;
v(1:nr+1,nz+1) = 0;
v(32:nr+1,30:40) = 1000;
v(20:nr+1,55:65) = -200;
% convergence parameters
tol = 1e-4;
max_v_diff = 1;
% Gauss-Seidel Solver
tic
while (max_v_diff > tol)
v_old = v;
% left boundary updates
v(1,2:nz) = cr.*2.*v(2,2:nz) + cz.*( v(1,1:nz-1) + v(1,3:nz+1) );
% internal updates
v(2:nr,2:nz) = cr.*( (1 - 1./(2.*ri(2:nr,2:nz))).*v(1:nr-1,2:nz) + (1 + 1./(2.*ri(2:nr,2:nz))).*v(3:nr+1,2:nz) ) + cz.*( v(2:nr,1:nz-1) + v(2:nr,3:nz+1) );
% right boundary updates
v(nr+1,2:nz) = cr.*2.*v(nr,2:nz) + cz.*( v(nr+1,1:nz-1) + v(nr+1,3:nz+1) );
% reapply grid potentials
v(32:nr+1,30:40) = 1000;
v(20:nr+1,55:65) = -200;
% check for convergence
max_v_diff = max(max(abs(v - v_old)));
end
toc
I've been able to reduce the running time in my laptop from 66 to 21 seconds by following this process:
Find the bottleneck. I profiled the code using line_profiler from the IPython console to find the lines that took most time. It turned out that over 80% of the time was spent in the line that does "internal updates".
Choose a way to optimise it. There are several tools to speed code up in numpy (Cython, numexpr, weave...). In particular, scipy.weave.blitz is well suited to compile numpy expressions, like the offending line, into fast code. In theory, that line could be wrapped inside "..." and executed as weave.blitz("...") but the array that's being updated is used in the computation, so as stated by point #4 in the docs a temporary array must be used to keep the same result:
expr = "temp = cr*((1 - 1/(2*ri[1:nr,1:nz]))*v[0:nr-1,1:nz] + (1 + 1/(2*ri[1:nr,1:nz]))*v[2:nr+1,1:nz]) + cz*(v[1:nr,0:nz-1] + v[1:nr,2:nz+1]); v[1:nr,1:nz] = temp"
temp = np.empty((nr-1, nz-1))
...
while ...
# internal updates
weave.blitz(expr)
After checking that the results are correct, runtime checks are disabled by using weave.blitz(expr, check_size=0). The code now runs in 34 seconds.
Building up on Jaime's work, precompute the constant factors A and B in the expression. The code runs in 21 seconds (with minimal changes but it now needs a compiler).
This is the core of the code:
from scipy import weave
# [...] Set up code till "# Gauss-Seidel solver"
tic = time.time()
max_v_diff = 1;
A = cr * (1 - 1/(2*ri[1:nr,1:nz]))
B = cr * (1 + 1/(2*ri[1:nr,1:nz]))
expr = "temp = A*v[0:nr-1,1:nz] + B*v[2:nr+1,1:nz] + cz*(v[1:nr,0:nz-1] + v[1:nr,2:nz+1]); v[1:nr,1:nz] = temp"
temp = np.empty((nr-1, nz-1))
while (max_v_diff > tol):
v_old = v.copy()
# left boundary updates
v[0,1:nz] = cr*2*v[1,1:nz] + cz*(v[0,0:nz-1] + v[0,2:nz+2])
# internal updates
weave.blitz(expr, check_size=0)
# right boundary updates
v[nr,1:nz] = cr*2*v[nr-1,1:nz] + cz*(v[nr,0:nz-1] + v[nr,2:nz+1])
# reapply grid potentials
v[31:,29:40] = 1000
v[19:,54:65] = -200
# check for convergence
v_diff = v - v_old
max_v_diff = np.absolute(v_diff).max()
toc = time.time() - tic
On my laptop your code runs in about 45 seconds. By trying to reduce creation of intermediate arrays to the bare minimum, including reuse of pre-allocated work arrays, I have managed to reduce that time to 27 seconds. That should put you back at the level of MATLAB, but your code would be less readable. Anyway, find below code to replace everything below your # Gauss-Seidel solver comment:
# work arrays
v_old = np.empty_like(v)
w1 = np.empty_like(v[0, 1:nz])
w2 = np.empty_like(v[1:nr,1:nz])
w3 = np.empty_like(v[nr, 1:nz])
# constants
A = cr * (1 - 1/(2*ri[1:nr,1:nz]))
B = cr * (1 + 1/(2*ri[1:nr,1:nz]))
# Gauss-Seidel solver
tic = time.time()
max_v_diff = 1;
while (max_v_diff > tol):
v_old[:] = v
# left boundary updates
np.add(v_old[0, 0:nz-1], v_old[0, 2:nz+2], out=v[0, 1:nz])
v[0, 1:nz] *= cz
np.multiply(2*cr, v_old[1, 1:nz], out=w1)
v[0, 1:nz] += w1
# internal updates
np.add(v_old[1:nr, 0:nz-1], v_old[1:nr, 2:nz+1], out=v[1:nr, 1:nz])
v[1:nr,1:nz] *= cz
np.multiply(A, v_old[0:nr-1, 1:nz], out=w2)
v[1:nr,1:nz] += w2
np.multiply(B, v_old[2:nr+1, 1:nz], out=w2)
v[1:nr,1:nz] += w2
# right boundary updates
np.add(v_old[nr, 0:nz-1], v_old[nr, 2:nz+1], out=v[nr, 1:nz])
v[nr, 1:nz] *= cz
np.multiply(2*cr, v_old[nr-1, 1:nz], out=w3)
v[nr,1:nz] += w3
# reapply grid potentials
v[31:,29:40] = 1000
v[19:,54:65] = -200
# check for convergence
v_old -= v
max_v_diff = np.absolute(v_old).max()
toc = time.time() - tic

Categories