I recently played with Cython and Numba to accelerate small pieces of a python that does numerical simulation. At first, developing with numba seems easier. Yet, I found difficult to understand when numba will provide a better performance and when it will not.
One example of unexpected performance drop is when I use the function np.zeros() to allocate a big array in a compiled function. For example, consider the three function definitions:
import numpy as np
from numba import jit
def pure_python(n):
mat = np.zeros((n,n), dtype=np.double)
# do something
return mat.reshape((n**2))
#jit(nopython=True)
def pure_numba(n):
mat = np.zeros((n,n), dtype=np.double)
# do something
return mat.reshape((n**2))
def mixed_numba1(n):
return mixed_numba2(np.zeros((n,n)))
#jit(nopython=True)
def mixed_numba2(array):
n = len(array)
# do something
return array.reshape((n,n))
# To compile
pure_numba(10)
mixed_numba1(10)
Since the #do something is empty, I do not expect the pure_numba function to be faster. Yet, I was not expecting such a performance drop:
n=10000
%timeit x = pure_python(n)
%timeit x = pure_numba(n)
%timeit x = mixed_numba1(n)
I obtain (python 3.7.7, numba 0.48.0 on a mac)
4.96 µs ± 65.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
344 ms ± 7.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.8 µs ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Here, the numba code is much slower when I use the function np.zeros() inside the compiled function. It works normally when the np.zeros() is outside the function.
Am I doing something wrong here or should I always allocate big arrays like these outside functions that are compiled by numba?
Update
This seems related to a lazy initialization of the matrices by np.zeros((n,n)) when n is large enough (see Performance of zeros function in Numpy ).
for n in [1000, 2000, 5000]:
print('n=',n)
%timeit x = pure_python(n)
%timeit x = pure_numba(n)
%timeit x = mixed_numba1(n)
gives me:
n = 1000
468 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
296 µs ± 6.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
300 µs ± 2.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n = 2000
4.79 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.45 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.54 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
n = 5000
270 µs ± 4.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
104 ms ± 599 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
119 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
tl;dr Numpy uses C memory functions whereas Numba must assign zeros
I wrote a script to plot the time it takes for several options to complete and it appears that Numba has a severe drop in performance when the size of the np.zeros array reaches 2048*2048*8 = 32 MB on my machine as shown in the diagram below.
Numba's implementation of np.zeros is just as fast as creating an empty array and filling it with zeros by iterating over the dimensions of the array (this is the Numba nested loop green curve of the diagram). This can actually be double-checked by setting the NUMBA_DUMP_IR environment variable before running the script (see below). When comparing to the dump for numba_loop there is not much difference.
Interstingly, np.zeros gets a little boost passed the 32 MB threshold.
My best guess, although I am far from an expert, is that the 32 MB limit is an OS or hardware bottleneck coming from the amount of data that can fit in a cache for the same process. If this is exceeded, the operation of moving data in and out of the cache to operate on it is very time consuming.
By contrast, Numpy uses calloc to get some memory segment with a promise to fill the data with zeros when it will be accessed.
This is how far I got and I realise it's only half an answer but maybe someone more knowledgeable can shed some light on what is actually going on.
Numba IR dump:
---------------------------IR DUMP: pure_numba_zeros----------------------------
label 0:
n = arg(0, name=n) ['n']
$2load_global.0 = global(np: <module 'numpy' from '/lib/python3.8/site-packages/numpy/__init__.py'>) ['$2load_global.0']
$4load_attr.1 = getattr(value=$2load_global.0, attr=zeros) ['$2load_global.0', '$4load_attr.1']
del $2load_global.0 []
$10build_tuple.4 = build_tuple(items=[Var(n, script.py:15), Var(n, script.py:15)]) ['$10build_tuple.4', 'n', 'n']
$12load_global.5 = global(np: <module 'numpy' from '/lib/python3.8/site-packages/numpy/__init__.py'>) ['$12load_global.5']
$14load_attr.6 = getattr(value=$12load_global.5, attr=double) ['$12load_global.5', '$14load_attr.6']
del $12load_global.5 []
$18call_function_kw.8 = call $4load_attr.1($10build_tuple.4, func=$4load_attr.1, args=[Var($10build_tuple.4, script.py:15)], kws=[('dtype', Var($14load_attr.6, script.py:15))], vararg=None) ['$10build_tuple.4', '$14load_attr.6', '$18call_function_kw.8', '$4load_attr.1']
del $4load_attr.1 []
del $14load_attr.6 []
del $10build_tuple.4 []
mat = $18call_function_kw.8 ['$18call_function_kw.8', 'mat']
del $18call_function_kw.8 []
$24load_method.10 = getattr(value=mat, attr=reshape) ['$24load_method.10', 'mat']
del mat []
$const28.12 = const(int, 2) ['$const28.12']
$30binary_power.13 = n ** $const28.12 ['$30binary_power.13', '$const28.12', 'n']
del n []
del $const28.12 []
$32call_method.14 = call $24load_method.10($30binary_power.13, func=$24load_method.10, args=[Var($30binary_power.13, script.py:16)], kws=(), vararg=None) ['$24load_method.10', '$30binary_power.13', '$32call_method.14']
del $30binary_power.13 []
del $24load_method.10 []
$34return_value.15 = cast(value=$32call_method.14) ['$32call_method.14', '$34return_value.15']
del $32call_method.14 []
return $34return_value.15 ['$34return_value.15']
The script to produce the diagram:
import numpy as np
from numba import jit
from time import time
import os
import matplotlib.pyplot as plt
os.environ['NUMBA_DUMP_IR'] = '1'
def numpy_zeros(n):
mat = np.zeros((n,n), dtype=np.double)
return mat.reshape((n**2))
#jit(nopython=True)
def numba_zeros(n):
mat = np.zeros((n,n), dtype=np.double)
return mat.reshape((n**2))
#jit(nopython=True)
def numba_loop(n):
mat = np.empty((n * 2,n), dtype=np.float32)
for i in range(mat.shape[0]):
for j in range(mat.shape[1]):
mat[i, j] = 0.
return mat.reshape((2 * n**2))
# To compile
numba_zeros(10)
numba_loop(10)
os.environ['NUMBA_DUMP_IR'] = '0'
max_n = 4100
time_deltas = {
'numpy_zeros': [],
'numba_zeros': [],
'numba_loop': [],
}
call_count = 10
for n in range(0, max_n, 10):
for f in (numpy_zeros, numba_zeros, numba_loop):
start = time()
for i in range(call_count):
x = f(n)
delta = time() - start
time_deltas[f.__name__].append(delta / call_count)
print(f'{f.__name__:25} n = {n}: {delta}')
print()
size = np.arange(0, max_n, 10) ** 2 * 8 / 1024 ** 2
fig, ax = plt.subplots()
plt.xticks(np.arange(0, size[-1], 16))
plt.axvline(x=32, color='gray', lw=0.5)
ax.plot(size, time_deltas['numpy_zeros'], label='Numpy zeros (calloc)')
ax.plot(size, time_deltas['numba_zeros'], label='Numba zeros')
ax.plot(size, time_deltas['numba_loop'], label='Numba nested loop')
ax.set_xlabel('Size of array in MB')
ax.set_ylabel(r'Mean $\Delta$t in s')
plt.legend(loc='upper left')
plt.show()
Related
I was wondering if anyone has an idea on how to speed up the identification of which indices are between a set of values.
Let's say I have a 1d array of sorted values (~50k) and a large list (>100k) of a pair of min/max values and I want to determine which (if any) indices in the 1d array are present. I must also be able to do this many times where the 1d array changes in size/shape.
My current approach is to use numpy and numba and list comprehension but unfortunately it doesn't really scale. It's okay if I try to look for ~1k values but when the number is much larger, it's too slow to be able to repeat it 1000s of times.
Current code:
import numpy as np
import numba
#numba.njit()
def find_between_batch(array: np.ndarray, min_value: np.ndarray, max_value: np.ndarray):
"""Find indices between specified boundaries for many items."""
res = []
for i in range(len(min_value)):
res.append(np.where(np.logical_and(array >= min_value[i], array <= max_value[i]))[0])
return res
Here is an example of the input:
x = np.linspace(0, 2000, 50000) # input 1d array
# these are the boundaries for which we should find the indices
mins = np.sort(np.random.choice(x, 10000)) - 0.01 # lower values to search for
maxs = mins + 0.02 # upper values to search for
And the current performance
# pre-compile
result = find_between_batch(x, mins, maxs)
%timeit -r 3 -n 10 find_between_batch(x, mins, maxs)
616 ms ± 4.11 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
And example output
result
[array([11]),
array([14]),
array([19]),
array([23]),
...
]
Does anyone have a suggestion on how to speed this up or if there is another approach that could give me the same results?
Thanks for the suggestion to use np.searchsorted - I've come up with a solution that is approx. 10-100x faster than my initial attempt.
#numba.njit()
def find_between_batch2(array: np.ndarray, min_value: np.ndarray, max_value: np.ndarray):
"""Find indices between specified boundaries for many items."""
min_indices = np.searchsorted(array, min_value, side="left")
max_indices = np.searchsorted(array, max_value, side="right")
res = []
for i in range(len(min_value)):
_array = array[min_indices[i]:max_indices[i]]
res.append(min_indices[i] + find_between(_array, min_value[i], max_value[i]))
return res
Original code:
%timeit -r 3 -n 10 find_between_batch(x, mins, maxs)
616 ms ± 4.11 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
Updated code:
%timeit -r 3 -n 10 find_between_batch2(x, mins, maxs)
6.36 ms ± 73.6 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
I am currently working with an array, containing categorical data.
Categories are organised like this: None,zoneA, zoneB
My array is a measure of sensors, it tells me if, at any time, the sensor is in zoneA, zoneB or not in a zone.
My goal here is to smooth those values.
For example, the sensor could be out of zoneA or b for a period of 30 measures, but if it happened I want those measures to be "smoothed".
Ex :
array[zoneA, zoneA, zoneA, None, None, zoneA, zoneA, None, None, None, zoneA]
should give
array[zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, None, None, None, zoneA]
with a threshold of 2.
Currently, I am using an iteration over arrays, but its computation is too expensive and can lead to 1 or 2 min of computation. Is there an existing algorithm to answer that problem?
My current code :
def smooth(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Args:
df (pd.DataFrame): dataframe with landlot column to smooth.
Returns:dataframe smoothed
"""
df_iter = df
last = "None"
last_index = 0
for num, line in df_iter.iterrows():
if (
(line.landlot != "None")
and (line.landlot == last)
and (num - last_index <= self.delay)
and (
df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")]
== "None"
)
):
df_iter.iloc[
last_index: (num + 1), # noqa: E203
df_iter.columns.get_loc("landlot"),
] = last
if line.landlot != "None":
last = line.landlot
last_index = num
return df_iter
Python implementation
I like to start these kind of things clean and simple. Therefore I just wrote a simple class that does exactly what is needed, without thinking too much about optimization. I call it Interpolator as this looks like categorical interpolation to me.
class Interpolator:
def __init__(self, data):
self.data = data
self.current_idx = 0
self.current_nan_region_start = None
self.result = None
self.maxgap = 1
def run(self, maxgap=2):
# Initialization
self.result = [None] * len(self.data)
self.maxgap = maxgap
self.current_nan_region_start = None
prev_isnan = 0
for idx, item in enumerate(self.data):
isnan = item is None
self.current_idx = idx
if isnan:
if prev_isnan:
# Result is already filled with empty data.
# Do nothing.
continue
else:
self.entered_nan_region()
prev_isnan = 1
else: # not nan
if prev_isnan:
self.exited_nan_region()
prev_isnan = 0
else:
self.continuing_in_categorical_region()
def entered_nan_region(self):
self.current_nan_region_start = self.current_idx
def continuing_in_categorical_region(self):
self.result[self.current_idx] = self.data[self.current_idx]
def exited_nan_region(self):
nan_region_end = self.current_idx - 1
nan_region_length = nan_region_end - self.current_nan_region_start + 1
# Always copy the empty region endpoint even if gap is not filled
self.result[self.current_idx] = self.data[self.current_idx]
if nan_region_length > self.maxgap:
# Do not interpolate as exceeding maxgap
return
if self.current_nan_region_start == 0:
# Special case. data starts with "None"
# -> Cannot interpolate
return
if self.data[self.current_nan_region_start - 1] != self.data[self.current_idx]:
# Do not fill as both ends of missing data
# region do not have same value
return
# Fill the gap
for idx in range(self.current_nan_region_start, self.current_idx):
self.result[idx] = self.data[self.current_idx]
def interpolate(data, maxgap=2):
"""
Interpolate categorical variables over missing
values (None's).
Parameters
----------
data: list of objects
The data to interpolate. Holds
categorical data, such as 'cat', 'dog'
or 108. None is handled as missing data.
maxgap: int
The maximum gap to interpolate over.
For example, with maxgap=2, ['car', None,
None, 'car', None, None, None, 'car']
would become ['car', 'car', 'car' 'car',
None, None None, 'car'].
Note: Interpolation will only occur on missing
data regions where both ends contain the same value.
For example, [1, None, 2, None, 2] will become
[1, None, 2, 2, 2].
"""
interpolator = Interpolator(data)
interpolator.run(maxgap=maxgap)
return interpolator.result
This is how one would use it (code for get_data() below):
data = get_data(k=100)
interpolated_data = interpolate(data)
Copy-paste Cython implementation
Most probably the python implementation is fast enough, as with array size of 1000.000, the amount of time needed to process the data is 0.504 seconds on my laptop. Anyway, creating Cython versions is fun and might give small additional timing bonus.
Needed steps:
Copy-paste the python implementation into new file, called fast_categorical_interpolate.pyx
Create setup.py to the same folder, with following contents:
from setuptools import setup
from Cython.Build import cythonize
setup(
ext_modules=cythonize(
"fast_categorical_interpolate.pyx",
language_level="3",
),
)
Run python setup.py build_ext --inplace to build the Cython extension. You'll see something like fast_categorical_interpolate.cp38-win_amd64.pyd in the same folder.
Now, you may use the interpolator like this:
import fast_categorical_interpolate as fpi
data = get_data(k=100)
interpolated_data = fpi.interpolate(data)
Of course, there might be some optimizations that you could do in the Cython code to make this even faster, but on my machine the speed improvement was 38% out of the box with N=1000.000 and 126% when N=10.000.
Timings on my machine
When N=100 (number of items in the list), python implementation is about 160x , and Cython implementation about 250x faster than smooth
In [8]: timeit smooth(test_df, delay=2)
10.2 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: timeit interpolate(data)
64.8 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: timeit fpi.interpolate(data)
41.3 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
When N=10.000, the timing difference is about 190x (Python) to 302x (Cython).
In [5]: timeit smooth(test_df, delay=2)
1.08 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: timeit interpolate(data)
5.69 ms ± 852 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: timeit fpi.interpolate(data)
3.57 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
When N=1000.000, the python implementation is about 210x faster and Cython implementation is about 287x faster.
In [9]: timeit smooth(test_df, delay=2)
1min 45s ± 24.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: timeit interpolate(data)
504 ms ± 67.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: timeit fpi.interpolate(data)
365 ms ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appendix
Test data creator get_data()
import random
random.seed(0)
def get_data(k=100):
return random.choices(population=[None, "ZoneA", "ZoneB"], weights=[4, 3, 2], k=k)
Function and test data for testing smooth()
import pandas as pd
data = get_data(k=1000)
test_df = pd.DataFrame(dict(landlot=data)).fillna("None")
def smooth(df: pd.DataFrame, delay=2) -> pd.DataFrame:
"""
Args:
df (pd.DataFrame): dataframe with landlot column to smooth.
Returns:dataframe smoothed
"""
df_iter = df
last = "None"
last_index = 0
for num, line in df_iter.iterrows():
if (
(line.landlot != "None")
and (line.landlot == last)
and (num - last_index <= delay)
and (df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")] == "None")
):
df_iter.iloc[
last_index : (num + 1), # noqa: E203
df_iter.columns.get_loc("landlot"),
] = last
if line.landlot != "None":
last = line.landlot
last_index = num
return df_iter
Note on the "current code"
I think there must be some copy-paste error somewhere, as the "current code" does not work as all. I replaced the self.delay with a delay=2 keyword argument to indicate the max gap. I assume that is was it was supposed to be. Even with that the logic did not work correcly with the simple example data you provided.
Long story short, I'm applying a function onto multiple different time intervals and then storing the resulting arrays at different indexs in an ndarray. Presently, I'm doing this by using the a for loop with the numpy equivalent of the enumerate function. As I understand it, this eliminates the major advantage of numpy: vectorisation. Is this a particular way my rountine could be implemented that retains this advantage?
Here is my code:
Most of is working parts for the function psi_t
import numpy as np
# Number of Walks and Number of Positions
N = 100
P = 2*N +1
hopping_rate = 0.5
psi_t0 = np.zeros(P)
psi_t0[N] = 1
#creates the line upon which the particle moves
#index N is the central position
def hamiltonian(line_length, hopping_rate):
'''
creates the simple non time dependent hamiltonian for H = γA
where A is the adjancency matrix
'''
return hopping_rate * line_adjacency_matrix(line_length)
def measurement_operator(positions,finished_quantum_state):
'''
Converts the finished quantum state into an array of probabilities for
being in each position.
Uses the measurement operator from Susan Blog
https://susan-stepney.blogspot.com/2014/02/mathjax.html
Improved on by this guy
https://github.com/Driminary/python-randomwalk-project/blob/master/quantum-2D.py
Apart from the fact that the measurement operator drops the extra dimensions of the spin space,
which isn't present in the continuous walk.
'''
probabilities = np.empty(P)
#M_hat = np.zeros((2*P,2*P,2*P))
for k in range(P):
posn = np.zeros(P) # values of positions to nought ..
posn[k] = 1 #except for the value we're interested in
#M_hat = np.kron(np.outer(posn,posn)) #perform measurement at the current pos
M_hat = np.outer(posn,posn)
proj = M_hat.dot(finished_quantum_state) #find the state the system is in
probabilities[k] = proj.dot(proj.conjugate()).real #Calculate Prob of Particle being there
return probabilities
def psi_t(initial_wave_function,positions,hopping_rate,time):
'''
Acts upon the initial state to give the 'position' of the quantum particle at time t. Applies the measurement operator
to return the probability of being at any position at time t.
'''
psi_t = np.matmul((LA.expm(-1j*hamiltonian(positions,hopping_rate)*time)),initial_wave_function) #state after the continuous walk after time evolution
probablities = measurement_operator(P, psi_t)
return probablities
time_evolution = 150 #how many 'seconds' the wavefunction is evolved for
time_interval = 0.5
number_of_intervals =int(time_evolution / time_interval )
number_of_positions = P
probabilities_at_t =np.ndarray((number_of_intervals,number_of_positions)) #creates the empty ndarray ready for the probabilites at time t
array_of_times = np.linspace(0,time_evolution,number_of_intervals) #produces the individual times at which psi_t is calculated,
for idx,time in np.ndenumerate(array_of_times):
probabilities_at_t[idx] = psi_t(psi_t0,P,hopping_rate,time) #the array probabillites_at_t is filled at index idx with the array of probabilities produced by psi_t.
#This is the step I am trying to vectorise
The function psi_t is called on a for loop to act on each of the time(s) in array_of_times individually. Is there way where psi_t could act on the array array_of_times like one can do x**2 for the array x? Can it be done in one fell swoop?
P.S Eagle Eyed Overflowers will note that within the measurement_operator there is a for loop anyway. I don't think there's a way to get rid of this however !
Question is not really reproducible because some of the functions that are being called are missing but here is my vectorised implementation of measurement_operator. This is with the assumption that finished_quantum_state has a shape of (P, ) (Not sure if that's the case, because couldn't reproduce till that part) .
def measurement_operator_vectorized(positions, finished_quantum_state):
M_hat = np.zeros((P, P, P))
M_hat[np.arange(0, P), np.arange(0, P), np.arange(0, P)] = 1
proj = np.tensordot(M_hat, finished_quantum_state, axes=((2), (0)))
probabilities = (proj * proj.conjugate()).sum(axis=1).real
return probabilities
Here is some benchmarkings -
P = 1000
a = np.random.rand(P)
b = np.random.rand(P)
%timeit c1 = measurement_operator(a, b)
%timeit c2 = measurement_operator_vectorized(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
print(np.allclose(c1, c2))
Gives -
1.18 s ± 46.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
308 ms ± 6.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
peak memory: 86.43 MiB, increment: 0.00 MiB
peak memory: 90.34 MiB, increment: 3.91 MiB
True
The vectorised version is faster is comparable memory usage for P~1000.
Note that for really high values of P, the memory usage will increase a lot for the vectorised version.
This isn't exactly what the OP asked for, but to vectorise the other loop, a more complete code would be helpful.
However, this benchmark is valid only if finished_quantum_state is real. For complex values the tensordot operation is very slow and inefficient (in memory) so you might actually be better off with the non-vectorized version.
P = 1000
a = np.random.rand(P) + np.random.rand(P)*1j
b = np.random.rand(P) + np.random.rand(P)*1j
%timeit -n1 -r1 c1 = measurement_operator(a, b)
%timeit -n1 -r1 c2 = measurement_operator_vectorized(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
np.allclose(c1, c2)
2.97 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.49 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
peak memory: 102.69 MiB, increment: 0.03 MiB
peak memory: 15365.38 MiB, increment: 15262.69 MiB
However, if you really want the best performance, you are better off forgetting the physics details about measurement etc temporarily and just doing
def measurement_operator_fastest(positions, finished_quantum_state):
return (finished_quantum_state * finished_quantum_state.conjugate()).real
P = 1000
a = np.random.rand(P) + np.random.rand(P)*1j
b = np.random.rand(P) + np.random.rand(P)*1j
%timeit -n1 -r1 c1 = measurement_operator(a, b)
%timeit -n1 -r1 c2 = measurement_operator_vectorized(a, b)
%timeit -n1 -r1 c3 = measurement_operator_fastest(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
%memit c3 = measurement_operator_fastest(a, b)
print(np.allclose(c1, c2))
print(np.allclose(c1, c3))
2.87 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.48 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
16.6 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
peak memory: 102.70 MiB, increment: 0.00 MiB
peak memory: 15365.39 MiB, increment: 15262.69 MiB
peak memory: 102.69 MiB, increment: -0.01 MiB
True
True
By taking the inner product directly, you can make the function around 10^6 times faster. Of course that assumes the measurement operator as defined.
I have to compute a large number of 3x3 linear transformations (eg. rotations). This is what I have so far:
import numpy as np
from scipy import sparse
from numba import jit
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
def dot1():
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2():
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3():
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4():
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
On a macbook pro 2012, this gives me:
In [62]: %timeit dot1()
783 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [63]: %timeit dot2()
261 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [64]: %timeit dot3()
293 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [65]: %timeit dot4()
281 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appart from the naive approach, all approaches are similar. Is there a way to accelerate this significantly?
Edit
(The cuda approach is the best when available. The following is comparing the non-cuda versions)
Following the various suggestions, I modified dot2, added the Op#A method, and a version based on #59356461.
#njit(fastmath=True, parallel=True)
def dot2(Op, A):
""" same as above, but jitted """
new = np.empty_like(A)
for i in prange(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot5(Op, A):
""" using matmul """
return Op#A
#njit(fastmath=True, parallel=True)
def dot6(Op, A):
""" another numba.jit with parallel (based on #59356461) """
new = np.empty_like(A)
for i_n in prange(A.shape[0]):
for i_k in range(A.shape[2]):
for i_x in range(3):
acc = 0.0j
for i_y in range(3):
acc += Op[i_n, i_x, i_y] * A[i_n, i_y, i_k]
new[i_n, i_x, i_k] = acc
return new
This is what I get (on a different machine) with benchit:
def gen(n, k):
Op = np.random.rand(n, 3, 3) + 1j * np.random.rand(n, 3, 3)
A = np.random.rand(n, 3, k) + 1j * np.random.rand(n, 3, k)
return Op, A
# benchit
import benchit
funcs = [dot1, dot2, dot3, dot4, dot5, dot6]
inputs = {n: gen(n, 100) for n in [100,1000,10000,100000,1000000]}
t = benchit.timings(funcs, inputs, multivar=True, input_name='Number of operators')
t.plot(logy=True, logx=True)
You've gotten some great suggestions, but I wanted to add one more due to this specific goal:
Is there a way to accelerate this significantly?
Realistically, if you need these operations to be significantly faster (which often means > 10x) you probably would want to use a GPU for the matrix multiplication. As a quick example:
import numpy as np
import cupy as cp
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
# CPU version
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
def dot5(): # the suggested, best CPU approach
return Op#A
# GPU version using a V100
gA = cp.asarray(A)
gOp = cp.asarray(Op)
# run once to ignore JIT overhead before benchmarking
gOp#gA;
%timeit dot5()
%timeit gOp#gA; cp.cuda.Device().synchronize() # need to sync for a fair benchmark
112 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.19 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use Op#A like suggested by #hpaulj in comments.
Here is a comparison using benchit:
def dot1(A,Op):
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2(A,Op):
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3(A,Op):
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4(A,Op):
n = A.shape[0]
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
def dot5(A,Op):
return Op#A
in_ = {n:[np.random.rand(n, 3, k), np.random.rand(n, 3, 3)] for n in [100,1000,10000,100000,1000000]}
They seem to be close in performance for larger scale with dot5 being slightly faster.
In one answer Nick mentioned using the GPU - which is the best solution of course.
But - as a general rule - what you're doing is likely CPU limited. Therefore (with the exception to the GPU approach), the best bang you can get is if you make use of all the cores on your machine to work in parallel.
So for that you would want to use multiprocessing (not python's multithreading!), to split the job up into pieces running on each core in parallel.
This is not trivial, but also not too hard, and there are many good examples/guides online.
But if you had an 8-core machine, it would likely give you an almost 8x speed increase as long as you're careful to avoid memory bottlenecks by trying to pass many small objects between processes, but pass them all in a group at the start
I am currently following through the beginner Codebat track. Both pieces of code work however is there anything fundamentally wrong/different between the two ways of writing the below code?
thanks,
def mine(myStr, x):
myResult = myStr * x
return myResult
def codebat(thierStr, i):
codeResult = ''
for i in range(i):
codeResult += thierStr
return codeResult
import string # string.ascii_letters = 'abcde...ABCDE...'
def mine(s, x):
return s * x # fixed your code so it multiplies by x, not 4
def theirs(s, x): # renamed but the same as codebat
res = ''
for _ in range(x):
res += s
return res
We can see they give the same results
mine(string.ascii_letters, 10) == theirs(string.ascii_letters, 10) # --> True
We can test the time efficiency of these functions however
%timeit mine(string.ascii_letters, 1000)
2.27 µs ± 9.69 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit theirs(string.ascii_letters, 1000)
202 µs ± 4.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As you can see mine is almost 100 times more efficient because under the hood python pre-allocates the memory needed for the new string. In theirs it has to keep reallocating memory each time the string length is increased.