cuda python GPU numbapro 3d loop poor performance

cuda python GPU numbapro 3d loop poor performance - python

I am trying to set up a 3D loop with the assignment
C(i,j,k) = A(i,j,k) + B(i,j,k)
using Python on my GPU. This is my GPU:
http://www.geforce.com/hardware/desktop-gpus/geforce-gt-520/specifications
The sources I'm looking at / comparing with are:
http://nbviewer.ipython.org/gist/harrism/f5707335f40af9463c43
http://nbviewer.ipython.org/github/ContinuumIO/numbapro-examples/blob/master/webinars/2014_06_17/intro_to_gpu_python.ipynb
It's possible that I've imported more modules than necessary. This is my code:
import numpy as np
import numbapro
import numba
import math
from timeit import default_timer as timer
from numbapro import cuda
from numba import *
#autojit
def myAdd(a, b):
return a+b
myAdd_gpu = cuda.jit(restype=f8, argtypes=[f8, f8], device=True)(myAdd)
#cuda.jit(argtypes=[float32[:,:,:], float32[:,:,:], float32[:,:,:]])
def myAdd_kernel(a, b, c):
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
tz = cuda.threadIdx.z
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
bz = cuda.blockIdx.z
bw = cuda.blockDim.x
bh = cuda.blockDim.y
bd = cuda.blockDim.z
i = tx + bx * bw
j = ty + by * bh
k = tz + bz * bd
if i >= c.shape[0]:
return
if j >= c.shape[1]:
return
if k >= c.shape[2]:
return
for i in xrange(0,c.shape[0]):
for j in xrange(0,c.shape[1]):
for k in xrange(0,c.shape[2]):
# c[i,j,k] = a[i,j,k] + b[i,j,k]
c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
def main():
my_gpu = numba.cuda.get_current_device()
print "Running on GPU:", my_gpu.name
cores_per_capability = {1: 8,2: 32,3: 192,}
cc = my_gpu.compute_capability
print "Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)"
majorcc = cc[0]
print "Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT
cores_per_multiprocessor = cores_per_capability[majorcc]
print "Number of cores per mutliprocessor:", cores_per_multiprocessor
total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
print "Number of cores on GPU:", total_cores
N = 100
thread_ct = my_gpu.WARP_SIZE
block_ct = int(math.ceil(float(N) / thread_ct))
print "Threads per block:", thread_ct
print "Block per grid:", block_ct
a = np.ones((N,N,N), dtype = np.float32)
b = np.ones((N,N,N), dtype = np.float32)
c = np.zeros((N,N,N), dtype = np.float32)
start = timer()
cg = cuda.to_device(c)
myAdd_kernel[block_ct, thread_ct](a,b,cg)
cg.to_host()
dt = timer() - start
print "Wall clock time with GPU in %f s" % dt
print 'c[:3,:,:] = ' + str(c[:3,1,1])
print 'c[-3:,:,:] = ' + str(c[-3:,1,1])
if __name__ == '__main__':
main()
My result from running this is the following:
Running on GPU: GeForce GT 520
Compute capability: 2.1 (Numba requires >= 2.0)
Number of streaming multiprocessor: 1
Number of cores per mutliprocessor: 32
Number of cores on GPU: 32
Threads per block: 32
Block per grid: 4
Wall clock time with GPU in 1.104860 s
c[:3,:,:] = [ 2. 2. 2.]
c[-3:,:,:] = [ 2. 2. 2.]
When I run the examples in the sources, I see significant speedup. I don't think my example is running properly since the wall clock time is much longer than I would expect. I've modeled this mostly from the "even bigger speedups with cuda python" section in the first example link.
I believe I've indexed correctly and safely. Maybe the problem is with my blockdim? or griddim? Or maybe I'm using the wrong types for my GPU. I think I read that they must be a certain type. I'm very new to this so the problem very well could be trivial!
Any and all help is greatly appreciated!

You are creating your indexes correctly but then you're ignoring them.
Running the nested loop
for i in xrange(0,c.shape[0]):
for j in xrange(0,c.shape[1]):
for k in xrange(0,c.shape[2]):
is forcing all your threads to loop through all values in all dimensions, which is not what you want. You want each thread to compute one value in a block and then move on.
I think something like this should work better...
i = tx + bx * bw
while i < c.shape[0]:
j = ty+by*bh
while j < c.shape[1]:
k = tz + bz * bd
while k < c.shape[2]:
c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
k+=cuda.blockDim.z*cuda.gridDim.z
j+=cuda.blockDim.y*cuda.gridDim.y
i+=cuda.blockDim.x*cuda.gridDim.x
Try to compile and run it. Also make sure to validate it, as I have not.

I don't see you using imshow, or show, so there is no need to import those.
It doesn't appear as though you use your import of math (I didn't see any calls of math.some_function.
Your imports from numba and numbapro seem repetitive. Your "from numba import cuda" overrides your "from numbapro import cuda", since it is subsequent to it. Your calls to cuda use the cuda in numba not numbapro. When you call "from numba import *", you import everything from numba, not just cuda, which seems to be the only thing you use. Also, (I believe) import numba.cuda is equivalent to from numba import cuda. Why not eliminate all your imports from numba and numbapro with a single "from numba import cuda".

Related

Multiprocessing not using whole CPU

I'm testing python's module "multiprocessing". I'm trying to compute pi using a montecarlo technique using my 12 threads ryzen 5 5600.
The problem is that my cpu is not fully used, instead only 47% is used. I leave you my code below, changing the value of n_cpu leads to not so different core usage, instead increasing N by 1 order of magnitude can increase the load up to 77%... but i believed that N shouldn't affect the number of processes...
Please help me understand how to correctly parallelize my code, thanks.
import random
import math
import numpy as np
import multiprocessing
from multiprocessing import Pool
def sample(n):
n_inside_circle = 0
for i in range(n):
x = random.random()
y = random.random()
if x**2 + y**2 < 1.0:
n_inside_circle += 1
return n_inside_circle
N_test=1000
N=12*10**4
n_cpu = 12
pi=0
for j in range(N_test):
part_count=[int(N/n_cpu)] * n_cpu
pool = Pool(processes=n_cpu)
results = pool.map(sample, part_count)
pool.close()
pi += sum(results)/(N*1.0)*4
print(pi/N_test)

The lack of cpu use is because you are sending chunks of data to multiple new process pools instead of all at once to a single process pool.
simply using
pool = Pool(processes=n_cpu)
for j in range(N_test):
part_count=[int(N/n_cpu)] * n_cpu
results = pool.map(sample, part_count)
pi += sum(results)/(N*1.0)*4
pool.close()
should have some speed up
To optimize this further
We can change the way the jobs are split up to have more samples for a single process.
We can use Numpy's vectorized random functions that will run faster than random.random().
Finally for the last bit of speed, we can use numba with a threadpool to reduce overhead even more.
import time
import numpy as np
from multiprocessing.pool import ThreadPool
from numba import jit
#jit(nogil=True, parallel=True, fastmath=True)
def sample(n):
x = np.random.random(n)
y = np.random.random(n)
inside_circle = np.square(x) + np.square(y) < 1.0
return int(np.sum(inside_circle))
total_samples = int(3e9)
function_limit = int(1e7)
n_cpu = 12
pi=0
assert total_samples%function_limit == 0
start = time.perf_counter()
with ThreadPool(n_cpu) as pool:
part_count=[function_limit] * (total_samples//function_limit)
results = pool.map(sample, part_count)
pi = 4*sum(results)/(total_samples)
end = time.perf_counter()
print(pi)
print(round(end-start,3), "seconds taken")
resulting in
3.141589756
6.982 seconds taken

Solving Linear Equations on the GPU with NumPy and PyTorch

I am trying to solve a lot of linear equations as fast as possible. To find out the fastest way I benchmarked NumPy and PyTorch, each on the CPU and on my GeForce 1080 GPU (using Numba for NumPy). The results really confused me.
This is the code I used with Python 3.8:
import timeit
import torch
import numpy
from numba import njit
def solve_numpy_cpu(dim: int = 5):
a = numpy.random.rand(dim, dim)
b = numpy.random.rand(dim)
for _ in range(1000):
numpy.linalg.solve(a, b)
def solve_numpy_njit_a(dim: int = 5):
njit(solve_numpy_cpu, dim=dim)
#njit
def solve_numpy_njit_b(dim: int = 5):
a = numpy.random.rand(dim, dim)
b = numpy.random.rand(dim)
for _ in range(1000):
numpy.linalg.solve(a, b)
def solve_torch_cpu(dim: int = 5):
a = torch.rand(dim, dim)
b = torch.rand(dim, 1)
for _ in range(1000):
torch.solve(b, a)
def solve_torch_gpu(dim: int = 5):
torch.set_default_tensor_type("torch.cuda.FloatTensor")
solve_torch_cpu(dim=dim)
def main():
for f in (solve_numpy_cpu, solve_torch_cpu, solve_torch_gpu, solve_numpy_njit_a, solve_numpy_njit_b):
time = timeit.timeit(f, number=1)
print(f"{f.__name__:<20s}: {time:f}")
if __name__ == "__main__":
main()
And these are the results:
solve_numpy_cpu : 0.007275
solve_torch_cpu : 0.012244
solve_torch_gpu : 5.239126
solve_numpy_njit_a : 0.000158
solve_numpy_njit_b : 1.273660
The slowest is CUDA accelerated PyTorch. I verified that PyTorch is using my GPU with
import torch
torch.cuda.is_available()
torch.cuda.get_device_name(0)
returning
True
'GeForce GTX 1080'
I can get behind that, on the CPU, PyTorch is slower than NumPy. What I cannot understand is why PyTorch on the GPU is so much slower. Not that important but actually even more confusing is that Numba's njit decorator makes performance orders of magnitude slower, until you don't use the # decorator syntax anymore.
Is it my setup? Occasionally I get a weird message about the windows page / swap file not being big enough. In case I've taken a completely obscure path to solving linear equations on the GPU, I'd be happy to be directed into another direction.
Edit
So, I focussed on Numba and changed my benchmarking a bit. As suggested by #max9111 I rewrote the functions to receive input and produce output because, in the end, that's what anyone would want to use them for. Now, I also perform a first compile run for the Numba accelerated function so the subsequent timing is fairer. Finally, I checked the performance against matrix size and plotted the results.
TL/DR: Up to matrix sizes of 500x500, Numba acceleration doesn't really make a difference for numpy.linalg.solve.
Here is the code:
import time
from typing import Tuple
import numpy
from matplotlib import pyplot
from numba import jit
#jit(nopython=True)
def solve_numpy_njit(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def solve_numpy(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def get_data(dim: int) -> Tuple[numpy.ndarray, numpy.ndarray]:
a = numpy.random.random((dim, dim))
b = numpy.random.random(dim)
return a, b
def main():
a, b = get_data(10)
# compile numba function
p = solve_numpy_njit(a, b)
matrix_size = [(x + 1) * 10 for x in range(50)]
non_accelerated = []
accelerated = []
results = non_accelerated, accelerated
for j, each_matrix_size in enumerate(matrix_size):
for m, f in enumerate((solve_numpy, solve_numpy_njit)):
average_time = -1.
for k in range(5):
time_start = time.time()
for i in range(100):
a, b = get_data(each_matrix_size)
p = f(a, b)
d_t = time.time() - time_start
print(f"{each_matrix_size:d} {f.__name__:<30s}: {d_t:f}")
average_time = (average_time * k + d_t) / (k + 1)
results[m].append(average_time)
pyplot.plot(matrix_size, non_accelerated, label="not numba")
pyplot.plot(matrix_size, accelerated, label="numba")
pyplot.legend()
pyplot.show()
if __name__ == "__main__":
main()
And these are the results (runtime against matrix edge length):
Edit 2
Seeing that Numba doesn't make much of a difference in my case, I came back to benchmarking PyTorch. And indeed, it appears to be roughly 4x faster than Numpy without even using a CUDA device.
Here is the code I used:
import time
from typing import Tuple
import numpy
import torch
from matplotlib import pyplot
def solve_numpy(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def get_data(dim: int) -> Tuple[numpy.ndarray, numpy.ndarray]:
a = numpy.random.random((dim, dim))
b = numpy.random.random(dim)
return a, b
def get_data_torch(dim: int) -> Tuple[torch.tensor, torch.tensor]:
a = torch.rand(dim, dim)
b = torch.rand(dim, 1)
return a, b
def solve_torch(a: torch.tensor, b: torch.tensor) -> torch.tensor:
parameters, _ = torch.solve(b, a)
return parameters
def experiment_numpy(matrix_size: int, repetitions: int = 100):
for i in range(repetitions):
a, b = get_data(matrix_size)
p = solve_numpy(a, b)
def experiment_pytorch(matrix_size: int, repetitions: int = 100):
for i in range(repetitions):
a, b = get_data_torch(matrix_size)
p = solve_torch(a, b)
def main():
matrix_size = [x for x in range(5, 505, 5)]
experiments = experiment_numpy, experiment_pytorch
results = tuple([] for _ in experiments)
for i, each_experiment in enumerate(experiments):
for j, each_matrix_size in enumerate(matrix_size):
time_start = time.time()
each_experiment(each_matrix_size, repetitions=100)
d_t = time.time() - time_start
print(f"{each_matrix_size:d} {each_experiment.__name__:<30s}: {d_t:f}")
results[i].append(d_t)
for each_experiment, each_result in zip(experiments, results):
pyplot.plot(matrix_size, each_result, label=each_experiment.__name__)
pyplot.legend()
pyplot.show()
if __name__ == "__main__":
main()
And here's the result (runtime against matrix edge length):
So for now, I'll be sticking with torch.solve. However, the original question remains:
How can I exploit my GPU to solve linear equations even faster?

Numba, strange type modification

I am trying to further speed up some code written in python, compiled using Numba. When looking at the assembly generated by numba, I noticed double-precision operations being generated, which I felt was odd since the inputs and outputs are all supposed to be float32.
I declare the variable/array types as float32 outside of the jitted loop and pass them into the function. Strangely, I find that after running my tests, the variable "scalarout" is converted to python float, which is actually a 64 bit value.
My code:
from scipy import ndimage, misc
import matplotlib.pyplot as plt
import numpy.fft
from timeit import default_timer as timer
import numba
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32
from numba import jit, njit, prange
from numba import cuda
import numpy as np
import scipy as sp
# import llvmlite.binding as llvm
# llvm.set_option('', '--debug-only=loop-vectorize')
#njit(fastmath=True, parallel=False)
def mydot(a, b, xlen, ylen, scalarout):
scalarout = (np.float32)(0.0)
for y in prange(ylen):
for x in prange(xlen):
scalarout += a[y, x] * b[y, x]
return scalarout
# ======================================== TESTS ========================================
print()
xlen = 100000
ylen = 16
a = np.random.rand(ylen, xlen).astype(np.float32)
b = np.random.rand(ylen, xlen).astype(np.float32)
print("a type = ", type(a[1,1]))
scalarout = (np.float32)(0.0)
print("scalarout type, before execution = ", type(scalarout))
iters=1000
time = 100.0
for n in range(iters):
start = timer()
scalarout = mydot(a, b, xlen, ylen, scalarout)
end = timer()
if(end-start < time):
time = end-start
print("Numba njit function time, in us = %16.10f" % ((end-start)*10**6))
print("function output = %f" % scalarout)
print("scalarout type, after execution = ", type(scalarout))

This is more of an extended comment than an answer. If you change the scalarout to be a float32 array of length 1 and modify that, your output is float32.
#njit(fastmath=True, parallel=False)
def mydot(a, b, xlen, ylen):
scalarout = np.array([0.0], dtype=np.float32)
for y in prange(ylen):
for x in prange(xlen):
scalarout[0] += a[y, x] * b[y, x]
return scalarout
If you change return scalarout to return scalarout[0], then the output is again a python float.
In your original code for mydot, the result is a python float even if you write return np.float32(scalarout).

Cannot replicate results comparing Python, Numpy and Numba matrix multiplication

So while evaluating possibilities to speed up Python code i came across this Stack Overflow post: Comparing Python, Numpy, Numba and C++ for matrix multiplication
I was quite impressed with numba's performance and implemented some of our function in numba. Unfortunately the speedup was only there for very small matrices and for large matrices the code became very slow compared to the previous scipy sparse implementation. I thought this made sense but nevertheless i repeated the test in the original post (code below).
When using a 1000 x 1000 matrix, according to that post even the python implementation should take roughly 0,01 s. Here's my results though:
python : 769.6387 seconds
numpy : 0.0660 seconds
numba : 3.0779 seconds
scipy : 0.0030 seconds
What am i doing wrong to get such different results than the original post? I copied the functions and did not change anything. I tried both Python 3.5.1 (64 bit) and Python 2.7.10 (32 bit), a colleague tried the same code with the same results. This is the result for a 100x100 matrix:
python : 0.6916 seconds
numpy : 0.0035 seconds
numba : 0.0015 seconds
scipy : 0.0035 seconds
Did i make some obvious mistakes?
import numpy as np
import numba as nb
import scipy.sparse
import time
class benchmark(object):
def __init__(self, name):
self.name = name
def __enter__(self):
self.start = time.time()
def __exit__(self, ty, val, tb):
end = time.time()
print("%s : %0.4f seconds" % (self.name, end-self.start))
return False
def dot_py(A, B):
m, n = A.shape
p = B.shape[1]
C = np.zeros((m, p))
for i in range(0, m):
for j in range(0, p):
for k in range(0, n):
C[i, j] += A[i, k] * B[k, j]
return C
def dot_np(A, B):
C = np.dot(A,B)
return C
def dot_scipy(A, B):
C = A * B
return C
dot_nb = nb.jit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:]), nopython=True)(dot_py)
dim_x = 1000
dim_y = 1000
a = scipy.sparse.rand(dim_x, dim_y, density=0.01)
b = scipy.sparse.rand(dim_x, dim_y, density=0.01)
a_full = a.toarray()
b_full = b.toarray()
print("starting test")
with benchmark("python"):
dot_py(a_full, b_full)
with benchmark("numpy"):
dot_np(a_full, b_full)
with benchmark("numba"):
dot_nb(a_full, b_full)
with benchmark("scipy"):
dot_scipy(a, b)
print("finishing test")
edit:
for anyone seeing this at a later time. this is the results i got when using sparse nxn matrices (1% of elements are nonzero).

In the linked stackoverflow question where you got the code from, m = n = 3 and p is variable, whereas you are using m = n = 1000, which is going to make a huge difference in the timings.

for loop in python is 10x slower than matlab

I run python 2.7 and matlab R2010a on the same machine, doing nothing, and it gives me 10x different in speed
I looked online, and heard it should be the same order.
Python will further slow down as if statement and math operator in the for loop
My question: is this the reality? or there is some other way let them in the same speed order?
Here is python code
import time
start_time = time.time()
for r in xrange(1000):
for c in xrange(1000):
continue
elapsed_time = time.time() - start_time
print 'time cost = ',elapsed_time
Output: time cost = 0.0377440452576
Here is matlab code
tic
for i = 1:1000
for j = 1:1000
end
end
toc
Output: Escaped time is 0.004200 seconds

The reason this is happening is related to the JIT compiler, which is optimizing the MATLAB for loop. You can disable/enable the JIT accelerator using feature accel off and feature accel on. When you disable the accelerator, the times change dramatically.
MATLAB with accel on: Elapsed time is 0.009407 seconds.
MATLAB with accel off: Elapsed time is 0.287955 seconds.
python: time cost = 0.0511920452118
Thus the JIT accelerator is directly causing the speedup that you are noticing. There is another thing that you should consider, which is related to the way that you defined the iteration indices. In both cases, MATLAB and python, you used Iterators to define your loops. In MATLAB you create the actual values by adding the square brackets ([]), and in python you use range instead of xrange. When you make these changes
% MATLAB
for i = [1:1000]
for j = [1:1000]
# python
for r in range(1000):
for c in range(1000):
The times become
MATLAB with accel on: Elapsed time is 0.338701 seconds.
MATLAB with accel off: Elapsed time is 0.289220 seconds.
python: time cost = 0.0606048107147
One final consideration is if you were to add a quick computation to the loop. ie t=t+1. Then the times become
MATLAB with accel on: Elapsed time is 1.340830 seconds.
MATLAB with accel off: Elapsed time is 0.905956 seconds. (Yes off was faster)
python: time cost = 0.147221088409
I think that the moral here is that the computation speeds of for loops, out-of-the box, are comparable for extremely simple loops, depending on the situation. However, there are other, numerical tools in python which can speed things up significantly, numpy and PyPy have been brought up so far.

The basic Python implementation, CPython, is not meant to be super-speedy. If you need efficient matlab-style numerical manipulation, use the numpy package or an implementation of Python that is designed for fast work, such as PyPy or even Cython. (Writing a Python extension in C, which will of course be pretty fast, is also a possible solution, but in that case you may as well just use numpy and save yourself the effort.)

If Python execution performance is really crucial for you, you might take a look at PyPy
I did your test:
import time
for a in range(10):
start_time = time.time()
for r in xrange(1000):
for c in xrange(1000):
continue
elapsed_time = time.time()-start_time
print elapsed_time
with standard Python 2.7.3, I get:
0.0311839580536
0.0310959815979
0.0309510231018
0.0306520462036
0.0302460193634
0.0324130058289
0.0308878421783
0.0307397842407
0.0304911136627
0.0307500362396
whereas, using PyPy 1.9.0 (which corresponds to Python 2.7.2), I get:
0.00921821594238
0.0115230083466
0.00851202011108
0.00808095932007
0.00496387481689
0.00499391555786
0.00508499145508
0.00618195533752
0.005126953125
0.00482988357544
The acceleration of PyPy is really stunning and really becomes visible when its JIT compiler optimizations outweigh their cost. That's also why I introduced the extra for loop. For this example, absolutely no modification of the code was needed.

This is just my opinion, but I think the process is a bit more complex. Basically Matlab is an optimized layer of C, so with the appropriate initialization of matrices and minimization of function calls (avoid "." objects-like operators in Matlab) you obtain extremely different results. Consider the simple following example of wave generator with cosine function. Matlab time = 0.15 secs in practical debug session, Python time = 25 secs in practical debug session (Spyder), thus Python becomes 166x slower. Run directly by Python 3.7.4. machine the time is = 5 secs aprox, so still be a non negligible 33x.
MATLAB:
AW(1,:) = [800 , 0 ]; % [amp frec]
AW(2,:) = [300 , 4E-07];
AW(3,:) = [200 , 1E-06];
AW(4,:) = [ 50 , 4E-06];
AW(5,:) = [ 30 , 9E-06];
AW(6,:) = [ 20 , 3E-05];
AW(7,:) = [ 10 , 4E-05];
AW(8,:) = [ 9 , 5E-04];
AW(9,:) = [ 7 , 7E-04];
AW(10,:)= [ 5 , 8E-03];
phas = 0
tini = -2*365 *86400; % 2 years backwards in seconds
dt = 200; % step, 200 seconds
tfin = 0; % present
vec_t = ( tini: dt: tfin)'; % vector_time
nt = length(vec_t);
vec_t = vec_t - phas;
wave = zeros(nt,1);
for it = 1:nt
suma = 0;
t = vec_t(it,1);
for iW = 1:size(AW,1)
suma = suma + AW(iW,1)*cos(AW(iW,2)*t);
end
wave(it,1) = suma;
end
PYTHON:
import numpy as np
AW = np.zeros((10,2))
AW[0,:] = [800 , 0.0]
AW[1,:] = [300 , 4E-07]; # [amp frec]
AW[2,:] = [200 , 1E-06];
AW[3,:] = [ 50 , 4E-06];
AW[4,:] = [ 30 , 9E-06];
AW[5,:] = [ 20 , 3E-05];
AW[6,:] = [ 10 , 4E-05];
AW[7,:] = [ 9 , 5E-04];
AW[8,:] = [ 7 , 7E-04];
AW[9,:] = [ 5 , 8E-03];
phas = 0
tini = -2*365 *86400 # 2 years backwards
dt = 200
tfin = 0 # present
nt = round((tfin-tini)/dt) + 1
vec_t = np.linspace(tini,tfin1,nt) - phas
wave = np.zeros((nt))
for it in range(nt):
suma = 0
t = vec_t[fil]
for iW in range(np.size(AW,0)):
suma = suma + AW[iW,0]*np.cos(AW[iW,1]*t)
#endfor iW
wave[it] = suma
#endfor it
To deal such aspects in Python I would suggest to compile into executable directly to binary the numerical parts that may compromise the project (or for example C or Fortran into executable and be called by Python afterwards). Of course, other suggestions are appreciated.

I tested a FIR filter with MATLAB and same (adapted) code in Python, including a frequency sweep. The FIR filter is pretty huge, N = 100 order, I post below the two codes, but leave you here the timing results:
MATLAB: Elapsed time is 11.149704 seconds.
PYTHON: time cost = 247.8841781616211 seconds.
PYTHON IS 25 TIMES SLOWER !!!
MATLAB CODE (main):
f1 = 4000; % bandpass frequency (response = 1).
f2 = 4200; % bandreject frequency (response = 0).
N = 100; % FIR filter order.
k = 0:2*N;
fs = 44100; Ts = 1/fs; % Sampling freq. and time.
% FIR Filter numerator coefficients:
Nz = Ts*(f1+f2)*sinc((f2-f1)*Ts*(k-N)).*sinc((f2+f1)*Ts*(k-N));
f = 0:fs/2;
w = 2*pi*f;
z = exp(-i*w*Ts);
% Calculation of the expected response:
Hz = polyval(Nz,z).*z.^(-2*N);
figure(1)
plot(f,abs(Hz))
title('Gráfica Respuesta Filtro FIR (Filter Expected Response)')
xlabel('frecuencia f (Hz)')
ylabel('|H(f)|')
xlim([0, 5000])
grid on
% Sweep Frequency Test:
tic
% Start and Stop frequencies of sweep, t = tmax = 50 seconds = 5000 Hz frequency:
fmin = 1; fmax = 5000; tmax = 50;
t = 0:Ts:tmax;
phase = 2*pi*fmin*t + 2*pi*((fmax-fmin).*t.^2)/(2*tmax);
x = cos(phase);
y = filtro2(Nz, 1, x); % custom filter function, not using "filter" library here.
figure(2)
plot(t,y)
title('Gráfica Barrido en Frecuencia Filtro FIR (Freq. Sweep)')
xlabel('Tiempo Barrido: t = 10 seg = 1000 Hz')
ylabel('y(t)')
xlim([0, 50])
grid on
toc
MATLAB CUSTOM FILTER FUNCTION
function y = filtro2(Nz, Dz, x)
Nn = length(Nz);
Nd = length(Dz);
N = length(x);
Nm = max(Nn,Nd);
x1 = [zeros(Nm-1,1) ; x'];
y1 = zeros(Nm-1,1);
for n = Nm:N+Nm-1
y1(n) = Nz(Nn:-1:1)*x1(n-Nn+1:n)/Dz(1);
if Nd > 1
y1(n) = y1(n) - Dz(Nd:-1:2)*y1(n-Nd+1:n-1)/Dz(1);
end
end
y = y1(Nm:Nm+N-1);
end
PYTHON CODE (main):
import numpy as np
from matplotlib import pyplot as plt
import FiltroDigital as fd
import time
j = np.array([1j])
pi = np.pi
f1, f2 = 4000, 4200
N = 100
k = np.array(range(0,2*N+1),dtype='int')
fs = 44100; Ts = 1/fs;
Nz = Ts*(f1+f2)*np.sinc((f2-f1)*Ts*(k-N))*np.sinc((f2+f1)*Ts*(k-N));
f = np.arange(0, fs/2, 1)
w = 2*pi*f
z = np.exp(-j*w*Ts)
Hz = np.polyval(Nz,z)*z**(-2*N)
plt.figure(1)
plt.plot(f,abs(Hz))
plt.title("Gráfica Respuesta Filtro FIR")
plt.xlabel("frecuencia f (Hz)")
plt.ylabel("|H(f)|")
plt.xlim(0, 5000)
plt.grid()
plt.show()
start_time = time.time()
fmin = 1; fmax = 5000; tmax = 50;
t = np.arange(0, tmax, Ts)
fase = 2*pi*fmin*t + 2*pi*((fmax-fmin)*t**2)/(2*tmax)
x = np.cos(fase)
y = fd.filtro(Nz, [1], x)
plt.figure(2)
plt.plot(t,y)
plt.title("Gráfica Barrido en Frecuencia Filtro FIR")
plt.xlabel("Tiempo Barrido: t = 10 seg = 1000 Hz")
plt.ylabel("y(t)")
plt.xlim(0, 50)
plt.grid()
plt.show()
elapsed_time = time.time() - start_time
print('time cost = ', elapsed_time)
PYTHON CUSTOM FILTER FUNCTION
import numpy as np
def filtro(Nz, Dz, x):
Nn = len(Nz);
Nd = len(Dz);
Nz = np.array(Nz,dtype=float)
Dz = np.array(Dz,dtype=float)
x = np.array(x,dtype=float)
N = len(x);
Nm = max(Nn,Nd);
x1 = np.insert(x, 0, np.zeros((Nm-1,), dtype=float))
y1 = np.zeros((N+Nm-1,), dtype=float)
for n in range(Nm-1,N+Nm-1) :
y1[n] = sum(Nz*np.flip( x1[n-Nn+1:n+1]))/Dz[0] # = y1FIR[n]
if Nd > 1:
y1[n] = y1[n] - sum(Dz[1:]*np.flip( y1[n-Nd+1:n]))/Dz[0]
print(y1[n])
y = y1[Nm-1:]
return y

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.