Cannot replicate results comparing Python, Numpy and Numba matrix multiplication

Cannot replicate results comparing Python, Numpy and Numba matrix multiplication - python

So while evaluating possibilities to speed up Python code i came across this Stack Overflow post: Comparing Python, Numpy, Numba and C++ for matrix multiplication
I was quite impressed with numba's performance and implemented some of our function in numba. Unfortunately the speedup was only there for very small matrices and for large matrices the code became very slow compared to the previous scipy sparse implementation. I thought this made sense but nevertheless i repeated the test in the original post (code below).
When using a 1000 x 1000 matrix, according to that post even the python implementation should take roughly 0,01 s. Here's my results though:
python : 769.6387 seconds
numpy : 0.0660 seconds
numba : 3.0779 seconds
scipy : 0.0030 seconds
What am i doing wrong to get such different results than the original post? I copied the functions and did not change anything. I tried both Python 3.5.1 (64 bit) and Python 2.7.10 (32 bit), a colleague tried the same code with the same results. This is the result for a 100x100 matrix:
python : 0.6916 seconds
numpy : 0.0035 seconds
numba : 0.0015 seconds
scipy : 0.0035 seconds
Did i make some obvious mistakes?
import numpy as np
import numba as nb
import scipy.sparse
import time
class benchmark(object):
def __init__(self, name):
self.name = name
def __enter__(self):
self.start = time.time()
def __exit__(self, ty, val, tb):
end = time.time()
print("%s : %0.4f seconds" % (self.name, end-self.start))
return False
def dot_py(A, B):
m, n = A.shape
p = B.shape[1]
C = np.zeros((m, p))
for i in range(0, m):
for j in range(0, p):
for k in range(0, n):
C[i, j] += A[i, k] * B[k, j]
return C
def dot_np(A, B):
C = np.dot(A,B)
return C
def dot_scipy(A, B):
C = A * B
return C
dot_nb = nb.jit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:]), nopython=True)(dot_py)
dim_x = 1000
dim_y = 1000
a = scipy.sparse.rand(dim_x, dim_y, density=0.01)
b = scipy.sparse.rand(dim_x, dim_y, density=0.01)
a_full = a.toarray()
b_full = b.toarray()
print("starting test")
with benchmark("python"):
dot_py(a_full, b_full)
with benchmark("numpy"):
dot_np(a_full, b_full)
with benchmark("numba"):
dot_nb(a_full, b_full)
with benchmark("scipy"):
dot_scipy(a, b)
print("finishing test")
edit:
for anyone seeing this at a later time. this is the results i got when using sparse nxn matrices (1% of elements are nonzero).

In the linked stackoverflow question where you got the code from, m = n = 3 and p is variable, whereas you are using m = n = 1000, which is going to make a huge difference in the timings.

Related

How to implement multiprocessing in Monte Carlo integration

I created a Python program that integrates a given function over a given interval using Monte Carlo simulation. It works well, except for the fact that it runs painfully slow when you want higher levels of accuracy (larger N value). I figured I'd give multiprocessing a try in order to speed it up, but then I realized I have no clue how to implement it. Here's what I have right now:
from scipy import random
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Process
import os
# GOAL: Approximate the integral of a function f(x) from lower bound a to upper bound b using Monte Carlo simulation
# bounds of integration
a = 0
b = np.pi
# function to integrate
def f(x):
return np.sin(x)
N = 10000
areas = []
def mcIntegrate():
for i in range(N):
# array filled with random numbers between limits
xrand = random.uniform(a, b, N)
# sum the return values of the function of each random number
integral = 0.0
for i in range(N):
integral += f(xrand[i])
# scale integral by difference of bounds divided by amount of random values
ans = integral * ((b - a) / float(N))
# add approximation to list of other approximations
areas.append(ans)
if __name__ == "__main__":
processes = []
numProcesses = os.cpu_count()
for i in range(numProcesses):
process = Process(target=mcIntegrate)
processes.append(process)
for process in processes:
process.start()
for process in processes:
process.start()
# graph approximation distribution
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec='black')
plt.xlabel("Areas")
plt.show()
Can I get some help with this implementation?
Took advice from the comments and used multiprocessor.Pool, and also cut down on some operations by using NumPy instead. Went from taking about 5min to run to now about 6sec (for N = 10000). Here's my implementation:
import scipy
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing
import os
# GOAL: Approximate the integral of function f from lower bound a to upper bound b using Monte Carlo simulation
a = 0 # lower bound of integration
b = np.pi # upper bound of integration
f = np.sin # function to integrate
N = 10000 # sample size
def mcIntegrate(p):
xrand = scipy.random.uniform(a, b, N) # create array filled with random numbers within bounds
integral = np.sum(f(xrand)) # sum return values of function of each random number
approx = integral * ((b - a) / float(N)) # scale integral by difference of bounds divided by sample size
return approx
if __name__ == "__main__":
# run simulation N times in parallel and store results in array
with multiprocessing.Pool(os.cpu_count()) as pool:
areas = pool.map(mcIntegrate, range(N))
# graph approximation distribution
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec='black')
plt.xlabel("Areas")
plt.show()

This turned out to be a more interesting problem than I thought it would when I got to optimising it. The basic method is very simple:
from multiprocessing import pool
def f(x):
return x
results = pool.map(f, range(100))
Here is your mcIntegerate adapted for multiprocessing:
from tqdm import tqdm
def mcIntegrate(steps):
tasks = []
print("Setting up simulations")
# linear
for _ in tqdm(range(steps)):
xrand = random.uniform(a, b, steps)
for i in range(steps):
tasks.append(xrand[i])
pool = Pool(cpu_count())
print("Simulating (no progress)")
results = pool.map(f, tasks)
pool.close()
print("summing")
areas = []
for chunk in tqdm(range(steps)):
vals = results[chunk * steps : (chunk + 1) * steps]
integral = sum(vals)
ans = integral * ((b - a) / float(steps))
areas.append(ans)
return areas
tqdm is just used to display a progress bar.
This is the basic workflow for multiprocessing: break the question up into tasks, solve all the tasks, then add them all back together again. And indeed the code as given works. (Note that I've changed your N for steps).
For completeness, the script now begins:
from scipy import random
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool, cpu_count
from tqdm import tqdm
# function to integrate
def f(x):
return np.sin(x)
and ends
areas = mcIntegrate(3_000)
a = 0
b = np.pi
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec="black")
plt.xlabel("Areas")
plt.show()
Optimisation
I deliberately split the problem up at the smallest possible level. Was this a good idea? To answer that, consider: how might we optimise the linear process of generating the tasks? This does take a considerable while at the moment. We could parallelise it:
def _prepare(steps):
xrand = random.uniform(a, b, steps)
return [xrand[i] for i in range(steps)]
def mcIntegrate(steps):
...
tasks = []
for res in tqdm(pool.imap(_prepare, (steps for _ in range(steps))), total=steps):
tasks += res # slower except for very large steps
Here I've used pool.imap, which returns an iterator which we can iterate as soon as the results are available, allowing us to build a progress bar. If you do this and compare, you will see that it runs slower than the linear solution. Removing the progress bar (on my machine) and replace with:
import time
start = time.perf_counter()
results = pool.map(_prepare, (steps for _ in range(steps)))
tasks = []
for res in results:
tasks += res
print(time.perf_counter() - start)
Is only marginally faster: it's still slower than running linear. Serialising data to a process and then deserialising it has an overhead. If you try to get a progress bar on the whole thing, it becomes excruciatingly slow:
results = []
for result in tqdm(pool.imap(f, tasks), total=len(tasks)):
results.append(result)
So what about iterating at a higher level? Here's another adaption of your mcIterate:
a = 0
b = np.pi
def _mcIntegrate(steps):
xrand = random.uniform(a, b, steps)
integral = 0.0
for i in range(steps):
integral += f(xrand[i])
ans = integral * ((b - a) / float(steps))
return ans
def mcIntegrate(steps):
areas = []
p = Pool(cpu_count())
for ans in tqdm(p.imap(_mcIntegrate, ((steps) for _ in range(steps))), total=steps):
areas.append(ans)
return areas
This, on my machine, is considerably faster. It's also considerably simpler. I was expecting a difference, but not such a considerable difference.
Takeaways
Multiprocessing isn't free. Something as simple as np.sin() is too cheap to multprocess: we pay to serialise, deserialise, append, and so on, all for one sin() calculation. But if you do too many calculations, you will waste time as you lose granularity. Here the effect is more striking than I was expecting. The only way to know the right level of granularity for a particular problem... is to profile and try.

My experience is that multiprocessing is often not very efficient (a ton of overhead). The more you push your code into numpy the faster it will be, with one caveat; you can overload your memory if you're not careful (10k x 10k is getting large). Lastly, it looks like N is doing double duty, both defining sample size for each estimate, and also serving as the number of trial estimates.
Here is how I would do this (with minor style changes):
import numpy as np
f = np.sin
a = 0
b = np.pi
# number samples for each trial, trial count, and number calculated at once
N = 10000
TRIALS = 10000
BATCH_SIZE=1000
def mc_integrate(f, a, b, N, batch_size=BATCH_SIZE):
# compute everything carrying `batch_size` copies by extending the array dimension.
# samples.shape == (N, batch_size)
samples = np.random.uniform(a, b, size=(N, batch_size))
integrals = np.sum(f(samples), axis=0)
mc_estimates = integrals * ((b - a) / N)
return mc_estimates
# loop over batch values to get final result
n, r = divmod(TRIALS, BATCH_SIZE)
results = []
for j in [BATCH_SIZE]*n + [r]:
results.extend(mc_integrate(f, a, b, N, batch_size=j))
On my machine this takes a few seconds.

Solving Linear Equations on the GPU with NumPy and PyTorch

I am trying to solve a lot of linear equations as fast as possible. To find out the fastest way I benchmarked NumPy and PyTorch, each on the CPU and on my GeForce 1080 GPU (using Numba for NumPy). The results really confused me.
This is the code I used with Python 3.8:
import timeit
import torch
import numpy
from numba import njit
def solve_numpy_cpu(dim: int = 5):
a = numpy.random.rand(dim, dim)
b = numpy.random.rand(dim)
for _ in range(1000):
numpy.linalg.solve(a, b)
def solve_numpy_njit_a(dim: int = 5):
njit(solve_numpy_cpu, dim=dim)
#njit
def solve_numpy_njit_b(dim: int = 5):
a = numpy.random.rand(dim, dim)
b = numpy.random.rand(dim)
for _ in range(1000):
numpy.linalg.solve(a, b)
def solve_torch_cpu(dim: int = 5):
a = torch.rand(dim, dim)
b = torch.rand(dim, 1)
for _ in range(1000):
torch.solve(b, a)
def solve_torch_gpu(dim: int = 5):
torch.set_default_tensor_type("torch.cuda.FloatTensor")
solve_torch_cpu(dim=dim)
def main():
for f in (solve_numpy_cpu, solve_torch_cpu, solve_torch_gpu, solve_numpy_njit_a, solve_numpy_njit_b):
time = timeit.timeit(f, number=1)
print(f"{f.__name__:<20s}: {time:f}")
if __name__ == "__main__":
main()
And these are the results:
solve_numpy_cpu : 0.007275
solve_torch_cpu : 0.012244
solve_torch_gpu : 5.239126
solve_numpy_njit_a : 0.000158
solve_numpy_njit_b : 1.273660
The slowest is CUDA accelerated PyTorch. I verified that PyTorch is using my GPU with
import torch
torch.cuda.is_available()
torch.cuda.get_device_name(0)
returning
True
'GeForce GTX 1080'
I can get behind that, on the CPU, PyTorch is slower than NumPy. What I cannot understand is why PyTorch on the GPU is so much slower. Not that important but actually even more confusing is that Numba's njit decorator makes performance orders of magnitude slower, until you don't use the # decorator syntax anymore.
Is it my setup? Occasionally I get a weird message about the windows page / swap file not being big enough. In case I've taken a completely obscure path to solving linear equations on the GPU, I'd be happy to be directed into another direction.
Edit
So, I focussed on Numba and changed my benchmarking a bit. As suggested by #max9111 I rewrote the functions to receive input and produce output because, in the end, that's what anyone would want to use them for. Now, I also perform a first compile run for the Numba accelerated function so the subsequent timing is fairer. Finally, I checked the performance against matrix size and plotted the results.
TL/DR: Up to matrix sizes of 500x500, Numba acceleration doesn't really make a difference for numpy.linalg.solve.
Here is the code:
import time
from typing import Tuple
import numpy
from matplotlib import pyplot
from numba import jit
#jit(nopython=True)
def solve_numpy_njit(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def solve_numpy(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def get_data(dim: int) -> Tuple[numpy.ndarray, numpy.ndarray]:
a = numpy.random.random((dim, dim))
b = numpy.random.random(dim)
return a, b
def main():
a, b = get_data(10)
# compile numba function
p = solve_numpy_njit(a, b)
matrix_size = [(x + 1) * 10 for x in range(50)]
non_accelerated = []
accelerated = []
results = non_accelerated, accelerated
for j, each_matrix_size in enumerate(matrix_size):
for m, f in enumerate((solve_numpy, solve_numpy_njit)):
average_time = -1.
for k in range(5):
time_start = time.time()
for i in range(100):
a, b = get_data(each_matrix_size)
p = f(a, b)
d_t = time.time() - time_start
print(f"{each_matrix_size:d} {f.__name__:<30s}: {d_t:f}")
average_time = (average_time * k + d_t) / (k + 1)
results[m].append(average_time)
pyplot.plot(matrix_size, non_accelerated, label="not numba")
pyplot.plot(matrix_size, accelerated, label="numba")
pyplot.legend()
pyplot.show()
if __name__ == "__main__":
main()
And these are the results (runtime against matrix edge length):
Edit 2
Seeing that Numba doesn't make much of a difference in my case, I came back to benchmarking PyTorch. And indeed, it appears to be roughly 4x faster than Numpy without even using a CUDA device.
Here is the code I used:
import time
from typing import Tuple
import numpy
import torch
from matplotlib import pyplot
def solve_numpy(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def get_data(dim: int) -> Tuple[numpy.ndarray, numpy.ndarray]:
a = numpy.random.random((dim, dim))
b = numpy.random.random(dim)
return a, b
def get_data_torch(dim: int) -> Tuple[torch.tensor, torch.tensor]:
a = torch.rand(dim, dim)
b = torch.rand(dim, 1)
return a, b
def solve_torch(a: torch.tensor, b: torch.tensor) -> torch.tensor:
parameters, _ = torch.solve(b, a)
return parameters
def experiment_numpy(matrix_size: int, repetitions: int = 100):
for i in range(repetitions):
a, b = get_data(matrix_size)
p = solve_numpy(a, b)
def experiment_pytorch(matrix_size: int, repetitions: int = 100):
for i in range(repetitions):
a, b = get_data_torch(matrix_size)
p = solve_torch(a, b)
def main():
matrix_size = [x for x in range(5, 505, 5)]
experiments = experiment_numpy, experiment_pytorch
results = tuple([] for _ in experiments)
for i, each_experiment in enumerate(experiments):
for j, each_matrix_size in enumerate(matrix_size):
time_start = time.time()
each_experiment(each_matrix_size, repetitions=100)
d_t = time.time() - time_start
print(f"{each_matrix_size:d} {each_experiment.__name__:<30s}: {d_t:f}")
results[i].append(d_t)
for each_experiment, each_result in zip(experiments, results):
pyplot.plot(matrix_size, each_result, label=each_experiment.__name__)
pyplot.legend()
pyplot.show()
if __name__ == "__main__":
main()
And here's the result (runtime against matrix edge length):
So for now, I'll be sticking with torch.solve. However, the original question remains:
How can I exploit my GPU to solve linear equations even faster?

Numba, strange type modification

I am trying to further speed up some code written in python, compiled using Numba. When looking at the assembly generated by numba, I noticed double-precision operations being generated, which I felt was odd since the inputs and outputs are all supposed to be float32.
I declare the variable/array types as float32 outside of the jitted loop and pass them into the function. Strangely, I find that after running my tests, the variable "scalarout" is converted to python float, which is actually a 64 bit value.
My code:
from scipy import ndimage, misc
import matplotlib.pyplot as plt
import numpy.fft
from timeit import default_timer as timer
import numba
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32
from numba import jit, njit, prange
from numba import cuda
import numpy as np
import scipy as sp
# import llvmlite.binding as llvm
# llvm.set_option('', '--debug-only=loop-vectorize')
#njit(fastmath=True, parallel=False)
def mydot(a, b, xlen, ylen, scalarout):
scalarout = (np.float32)(0.0)
for y in prange(ylen):
for x in prange(xlen):
scalarout += a[y, x] * b[y, x]
return scalarout
# ======================================== TESTS ========================================
print()
xlen = 100000
ylen = 16
a = np.random.rand(ylen, xlen).astype(np.float32)
b = np.random.rand(ylen, xlen).astype(np.float32)
print("a type = ", type(a[1,1]))
scalarout = (np.float32)(0.0)
print("scalarout type, before execution = ", type(scalarout))
iters=1000
time = 100.0
for n in range(iters):
start = timer()
scalarout = mydot(a, b, xlen, ylen, scalarout)
end = timer()
if(end-start < time):
time = end-start
print("Numba njit function time, in us = %16.10f" % ((end-start)*10**6))
print("function output = %f" % scalarout)
print("scalarout type, after execution = ", type(scalarout))

This is more of an extended comment than an answer. If you change the scalarout to be a float32 array of length 1 and modify that, your output is float32.
#njit(fastmath=True, parallel=False)
def mydot(a, b, xlen, ylen):
scalarout = np.array([0.0], dtype=np.float32)
for y in prange(ylen):
for x in prange(xlen):
scalarout[0] += a[y, x] * b[y, x]
return scalarout
If you change return scalarout to return scalarout[0], then the output is again a python float.
In your original code for mydot, the result is a python float even if you write return np.float32(scalarout).

Performance: Matlab vs Python

I recently switched from Matlab to Python. While converting one of my lengthy codes, I was surprised to find Python being very slow. I profiled and traced the problem with one function hogging up time. This function is being called from various places in my code (being part of other functions which are recursively called). Profiler suggests that 300 calls are made to this function in both Matlab and Python.
In short, following codes summarizes the issue at hand:
MATLAB
The class containing the function:
classdef ExampleKernel1 < handle
methods (Static)
function [kernel] = kernel_2D(M,x,N,y)
kernel = zeros(M,N);
for i= 1 : M
for j= 1 : N
% Define the custom kernel function here
kernel(i , j) = sqrt((x(i , 1) - y(j , 1)) .^ 2 + ...
(x(i , 2) - y(j , 2)) .^2 );
end
end
end
end
end
and the script to call test.m:
xVec=[
49.7030 78.9590
42.6730 11.1390
23.2790 89.6720
75.6050 25.5890
81.5820 53.2920
44.9680 2.7770
38.7890 78.9050
39.1570 33.6790
33.2640 54.7200
4.8060 44.3660
49.7030 78.9590
42.6730 11.1390
23.2790 89.6720
75.6050 25.5890
81.5820 53.2920
44.9680 2.7770
38.7890 78.9050
39.1570 33.6790
33.2640 54.7200
4.8060 44.3660
];
N=size(xVec,1);
kex1=ExampleKernel1;
tic
for i=1:300
K=kex1.kernel_2D(N,xVec,N,xVec);
end
toc
Gives the output
clear all
>> test
Elapsed time is 0.022426 seconds.
>> test
Elapsed time is 0.009852 seconds.
PYTHON 3.4
Class containing the function CustomKernels.py:
from numpy import zeros
from math import sqrt
class CustomKernels:
"""Class for defining the custom kernel functions"""
#staticmethod
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
kernel = zeros([M, N])
for i in range(0, M):
for j in range(0, N):
# Define the custom kernel function here
kernel[i, j] = sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
and the script to call test.py:
import numpy as np
from CustomKernels import CustomKernels
from time import perf_counter
xVec = np.array([
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660],
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660]
])
N = xVec.shape[0]
kex1 = CustomKernels.exampleKernelA
start=perf_counter()
for i in range(0,300):
K = kex1(N, xVec, N, xVec)
print(' %f secs' %(perf_counter()-start))
Gives the output
%run test.py
0.940515 secs
%run test.py
0.884418 secs
%run test.py
0.940239 secs
RESULTS
Comparing the results it seems Matlab is about 42 times faster after a "clear all" is called and then 100 times faster if script is run multiple times without calling "clear all". That is at least and order of magnitude if not two orders of magnitudes faster. This is a very surprising result for me. I was expecting the result to be the other way around.
Can someone please shed some light on this?
Can someone suggest a faster way to perform this?
SIDE NOTE
I have also tried to use numpy.sqrt which makes the performance worse, therefore I am using math.sqrt in Python.
EDIT
The for loops for calling the functions are purely fictitious. They are there just to "simulate" 300 calls to the function. As I described earlier, the kernel functions (kernel_2D in Matlab and kex1 in Python) are called from various different places in the program. To make the problem shorter, I "simulate" the 300 calls using the for loop. The for loops inside the kernel functions are essential and unavoidable because of the structure of the kernel matrix.
EDIT 2
Here is the larger problem: https://github.com/drfahdsiddiqui/bbfmm2d-python

You want to get rid of those for loops. Try this:
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
i, j = np.indices((N, M))
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
You can also do it with broadcasting, which may be even faster, but a little less intuitive coming from MATLAB.

Upon further investigation I have found that using indices as indicated in the answer is still slower.
Solution: Use meshgrid
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
# Euclidean norm function implemented using meshgrid idea.
# Fastest
x0, y0 = meshgrid(y[:, 0], x[:, 0])
x1, y1 = meshgrid(y[:, 1], x[:, 1])
# Define custom kernel here
kernel = sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
return kernel
Result: Very very fast, 10 times faster than indices approach. I am getting times which are closer to C.
However: Using meshgrid with Matlab beats C and Numpy by being 10 times faster than both.
Still wondering why!

Matlab uses commerical MKL library. If you use free python distribution, check whether you have MKL or other high performance blas library used in python or it is the default ones, which could be much slower.

Comparing Jit-Compilers
It has been mentioned that Matlab uses an internal Jit-compiler to get good performance on such tasks. Let's compare Matlabs jit-compiler with a Python jit-compiler (Numba).
Code
import numba as nb
import numpy as np
import math
import time
#If the arrays are somewhat larger it makes also sense to parallelize this problem
#cache ==True may also make sense
#nb.njit(fastmath=True)
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
#explicitly declaring the size of the second dim also improves performance a bit
assert x.shape[1]==2
assert y.shape[1]==2
#Works with all dtypes, zeroing isn't necessary
kernel = np.empty((M,N),dtype=x.dtype)
for i in range(M):
for j in range(N):
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
def exampleKernelB(M, x, N, y):
"""Example kernel function A"""
# Euclidean norm function implemented using meshgrid idea.
# Fastest
x0, y0 = np.meshgrid(y[:, 0], x[:, 0])
x1, y1 = np.meshgrid(y[:, 1], x[:, 1])
# Define custom kernel here
kernel = np.sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
return kernel
#nb.njit()
def exampleKernelC(M, x, N, y):
"""Example kernel function A"""
#explicitly declaring the size of the second dim also improves performance a bit
assert x.shape[1]==2
assert y.shape[1]==2
#Works with all dtypes, zeroing isn't necessary
kernel = np.empty((M,N),dtype=x.dtype)
for i in range(M):
for j in range(N):
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
#Your test data
xVec = np.array([
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660],
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660]
])
#compilation on first callable
#can be avoided with cache=True
res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)
res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)
t1=time.time()
for i in range(10_000):
res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
t1=time.time()
for i in range(10_000):
res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
t1=time.time()
for i in range(10_000):
res=exampleKernelB(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
Performance
exampleKernelA: 0.03s
exampleKernelC: 0.03s
exampleKernelB: 1.02s
Matlab_2016b (your code, but 10000 rep., after few runs): 0.165s

I got ~5x speed improvement over the meshgrid solution using only broadcasting:
def exampleKernelD(M, x, N, y):
return np.sqrt((x[:,1:] - y[:,1:].T) ** 2 + (x[:,:1] - y[:,:1].T) ** 2)

whether to use python to the huge computing?

Sorry for my bad English.I am currently work with python and i have a problem with too slow to filling of matrix size of 10000x10000.I program a "relaxation method" where i need to test the different size of matrix.What do you think about that?...P.S I wait 3-5 min to fill one matrix 10000x10000.
def CalcMatrix(self):
for i in range(0, self._n + 1): #(0,10000)
for j in range(0, self._n + 1): #(0,10000)
one = sin(pi * self._x[i]) # x is the vector of size 10000
two = sin(pi * self._y[j]) # y too
self._f[i][j] = 2 * pi * pi * one * two #fill

Native Python loop speed is very slow. People working with arrays and matrices in Python usually use numpy. There are other tools like cython and numba which can dramatically improve the speed in certain circumstances, but the basic idea of numpy is to vectorize the operations and push the hard work down to fast libraries implemented in C and fortran.
The following code takes only a few seconds on my not-very-fast notebook:
import numpy as np
from numpy import pi
x = np.linspace(0,1,10**4)
y = np.linspace(2,5,10**4)
ans = 2*pi**2 * np.outer(np.sin(pi*x), np.sin(pi*y))
(PS: If your _n == 10000, then won't your matrix be 10001x10001, not 10000x10000?)

Some improvements are possible. Consider moving some of the computations out of the loop
def CalcMatrix(self):
for i in range(0, self._n + 1): #(0,10000)
one = sin(pi * self._x[i]) # x is the vector of size 10000
for j in range(0, self._n + 1): #(0,10000)
two = sin(pi * self._y[j]) # y too
self._f[i][j] = 2 * pi * pi * one * two #fill
This value 2 * pi * pi can be precomputed and stored in a variable so it doesn't have to be recomputed each time in the loop.
If that is still not enough, consider using a native language like C or Fortran.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot replicate results comparing Python, Numpy and Numba matrix multiplication - python

In the linked stackoverflow question where you got the code from, m = n = 3 and p is variable, whereas you are using m = n = 1000, which is going to make a huge difference in the timings.

Related

How to implement multiprocessing in Monte Carlo integration

Solving Linear Equations on the GPU with NumPy and PyTorch

Numba, strange type modification

Performance: Matlab vs Python

whether to use python to the huge computing?

Categories

Resources