I want to calculate a rolling quantile of a large 2D matrix with dimensions (1e6, 1e5), column wise. I am looking for the fastest way, since I need to perform this operation thousands of times, and it's very computationally expensive. For experiments window=1000 and q=0.1 is used.
import numpy as np
import pandas as pd
import multiprocessing as mp
from functools import partial
import numba as nb
X = np.random.random((10000,1000)) # Original array has dimensions of about (1e6, 1e5)
My current approaches:
Pandas: %timeit: 5.8 s ± 15.5 ms per loop
def pd_rolling_quantile(X, window, q):
return pd.DataFrame(X).rolling(window).quantile(quantile=q)
Numpy Strided: %timeit: 2min 42s ± 3.29 s per loop
def strided_app(a, L, S):
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
def np_1d(x, window, q):
return np.pad(np.percentile(strided_app(x, window, 1), q*100, axis=-1), (window-1, 0) , mode='constant')
def np_rolling_quantile(X, window, q):
results = []
for i in np.arange(X.shape[1]):
results.append(np_1d(X[:,i], window, q))
return np.column_stack(results)
Multiprocessing: %timeit: 1.13 s ± 27.6 ms per loop
def mp_rolling_quantile(X, window, q):
pool = mp.Pool(processes=12)
results = pool.map(partial(pd_rolling_quantile, window=window, q=q), [X[:,i] for i in np.arange(X.shape[1])])
pool.close()
pool.join()
return np.column_stack(results)
Numba: %timeit: 2min 28s ± 182 ms per loop
#nb.njit
def nb_1d(x, window, q):
out = np.zeros(x.shape[0])
for i in np.arange(x.shape[0]-window+1)+window:
out[i-1] = np.quantile(x[i-window:i], q=q)
return out
def nb_rolling_quantile(X, window, q):
results = []
for i in np.arange(X.shape[1]):
results.append(nb_1d(X[:,i], window, q))
return np.column_stack(results)
The timings are not great, and ideally I would target an improvement of 10-50x by speed. I would appreciate any suggestions, how to speed it up. Maybe someone has ideas on using lower level languages (Cython), or other ways to speed it up with Numpy/Numba/Tensorflow based methods. Thanks!
I would recommend the new rolling-quantiles package.
To demonstrate, even the somewhat naive approach of constructing a separate filter for each column outperforms the above single-threaded pandas experiment:
pipes = [rq.Pipeline(rq.LowPass(window=1000, quantile=0.1)) for i in range(1000)]
%timeit [pipe.feed(X[:, i]) for i, pipe in enumerate(pipes)]
1.34 s ± 7.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
versus
df = pd.DataFrame(X)
%timeit df.rolling(1000).quantile(0.1)
5.63 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Both can be trivially parallelized by means of multiprocessing, as you showed.
Related
I have here pure python code, except just making a NumPy array. My problem here is that the result I get is completely wrong when I use #jit, but when I remove it its good. Could anyone give me any tips on why this is?
#jit
def grayFun(image: np.array) -> np.array:
gray_image = np.empty_like(image)
for i in range(image.shape[0]):
for j in range(image.shape[1]):
gray = gray_image[i][j][0]*0.21 + gray_image[i][j][1]*0.72 + gray_image[i][j][2]*0.07
gray_image[i][j] = (gray,gray,gray)
gray_image = gray_image.astype("uint8")
return gray_image
This will return a grayscale image with your conversion formula. USUALLY, you do not need to duplicate the columns; a grayscale image with shape (X,Y) can be used just like an image with shape (X,Y,3).
def gray(image):
return image[:,:,0]*0.21+image[:,:,1]*0.72 + image[:,:,2]*0.07
This should work just fine with numba. #TimRobert's answer is definitely fast, so you may just want to go with that implementation. But the biggest win is simply from vectorization. I'm sure others could find additional performance tweaks but at this point I think we've whittled down most of the runtime & issues:
# your implementation, but fixed so that `gray` is calculated from `image`
def grayFun(image: np.array) -> np.array:
gray_image = np.empty_like(image)
for i in range(image.shape[0]):
for j in range(image.shape[1]):
gray = image[i][j][0]*0.21 + image[i][j][1]*0.72 + image[i][j][2]*0.07
gray_image[i][j] = (gray,gray,gray)
gray_image = gray_image.astype("uint8")
return gray_image
# a vectorized numpy version of your implementation
def grayQuick(image: np.array) -> np.array:
return np.tile(
np.expand_dims(
(image[:, :, 0]*0.21 + image[:, :, 1]*0.72 + image[:, :, 2]*0.07), -1
),
(1,1, 3)
).astype(np.uint8)
# a parallelized implementation in numba
#numba.jit
def gray_numba(image: np.array) -> np.array:
out = np.empty_like(image)
for i in numba.prange(image.shape[0]):
for j in numba.prange(image.shape[1]):
gray = np.uint8(image[i, j, 0]*0.21 + image[i, j, 1]*0.72 + image[i, j, 2]*0.07)
out[i, j, :] = gray
return out
# a 2D solution leveraging #TimRoberts's speedup
def gray_2D(image):
return image[:,:,0]*0.21+image[:,:,1]*0.72 + image[:,:,2]*0.07
I loaded a reasonably large image:
In [69]: img = matplotlib.image.imread(os.path.expanduser(
...: "~/Desktop/Screen Shot.png"
...: ))
...: image = (img[:, :, :3] * 256).astype('uint8')
...:
In [70]: image.shape
Out[70]: (1964, 3024, 3)
Now, running these three reveals a slight speedup from numba, while the fastest is the 2D solution:
In [71]: %%timeit
...: grey = grayFun(image) # watch out - this takes ~21 minutes
...:
...:
2min 56s ± 1min 58s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [72]: %%timeit
...: grey_np = grayQuick(image)
...:
...:
556 ms ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [73]: %%timeit
...: grey = gray_numba(image)
...:
...:
246 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [74]: %%timeit
...: grey = gray_2D(image)
...:
...:
117 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note that numba will be noticeably slower on the first iteration, so the vectorized numpy solutions will significantly outperform numba if you're only doing this once. But if you're going to call the function repeatedly within the same python session numba is a good option. You could of course use numba for the 2D result to get a further speedup - I'm not sure if this would outperform numpy though.
Toy example
I have two arrays, which have different shape, for example:
import numpy as np
matrix = np.arange(5*6*7*8).reshape(5, 6, 7, 8)
vector = np.arange(1, 20, 2)
What I want to do is to multiply each element of the matrix by one of the element of 'vector' and then do the sum over the last two axis. For that, I have an array with the same shape as 'matrix' that tells me which index to use, for example:
Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size
I know that one of the solution would be to do:
matVec = vector[Idx]
res = np.sum(matrix*matVec, axis=(2, 3))
or even:
res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)
Problem
However, my problems is that my arrays are big and the construction of matVec takes both times and memory. So is there a way to bypass that and still achieve the same result ?
More realistic example
Here is a more realistic example of what I'm actually doing:
import numpy as np
order = 20
dim = 23
listOrder = np.arange(-order, order+1, 1)
N, P = np.meshgrid(listOrder, listOrder)
K = np.arange(-2*dim+1, 2*dim+1, 1)
X = np.arange(-2*dim, 2*dim, 1)
tN = np.einsum('..., p, x -> ...px', N, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tP = np.einsum('..., p, x -> ...px', P, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tK = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), K, np.ones(X.shape, dtype=int))#, optimize=pathInt)
tX = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), np.ones(K.shape, dtype=int), X)#, optimize=pathInt)
tL = tK + tX
mini, maxi = -4*dim+1, 4*dim-1
NmPp2L = np.arange(2*mini-2*order, 2*maxi+2*order+1)
Idx = (2*tL+tN-tP) - NmPp2L[0]
np.random.seed(0)
matrix = (np.random.rand(Idx.size) + 1j*np.random.rand(Idx.size)).reshape(Idx.shape)
vector = np.random.rand(np.max(Idx)+1) + 1j*np.random.rand(np.max(Idx)+1)
res = np.sum(matrix*vector[Idx], axis=(2, 3))
For larger data arrays
import numpy as np
matrix = np.arange(50*60*70*80).reshape(50, 60, 70, 80)
vector = np.arange(1, 50, 2)
Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size
parallel numba speeds up the computation and avoids creating matVec.
On a 4-core Intel Xeon Platinum 8259CL
matVec = vector[Idx]
# %timeit 48.4 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)
# %timeit 26.9 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
against an unoptimized numba implementation
import numba as nb
#nb.njit(parallel=True)
def func(matrix, idx, vector):
res = np.zeros((matrix.shape[0],matrix.shape[1]), dtype=matrix.dtype)
for i in nb.prange(matrix.shape[0]):
for j in range(matrix.shape[1]):
for k in range(matrix.shape[2]):
for l in range(matrix.shape[3]):
res[i,j] += matrix[i,j,k,l] * vector[idx[i,j,k,l]]
return res
func(matrix, Idx, vector) # compile run
func(matrix, Idx, vector)
# %timeit 21.7 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# (48.4 + 26.9) / 21.7 = ~3.47x speed up
np.testing.assert_equal(func(matrix, Idx, vector), np.einsum('ijkl, ijkl -> ij', matrix, matVec))
Performance and further optimizations
The above Numba code should be memory-bound when dealing with complex numbers. Indeed, matrix and Idx are big and must be completely read from the relatively-slow RAM. matrix has a size of 41*41*92*92*8*2 = 217.10 MiB and Idx a size of either 41*41*92*92*8 = 108.55 MiB or 41*41*92*92*4 = 54.28 MiB regarding the target platform (it should be of type int32 on most x86-64 Windows platforms and int64 on most Linux x86-64 platforms). This is also why vector[Idx] was slow: Numpy needs to write a big array in memory (not to mention writing data should be twice slower than reading it on x86-64 platforms in this case).
Assuming the code is memory bound, the only way to speed it up is to reduce the amount of data read from the RAM. This can be achieve by storing Idx in a uint16 array instead of the default np.int_ datatype (2~4 bigger). This is possible since vector.size is small. On a Linux with a i5-9600KF processor and a 38-40 GiB/s RAM, this optimization resulted in a ~29% speed up while the code is still mainly memory bound. The implementation is nearly optimal on this platform.
I have an array Nx3 of N points, each one has X, Y and Z coordinate. I need to rotate each point, so i have the rotation matrix 3x3. To do this, i need to get dot product of the rotation matrix and each point. The problem is the array of points is quite massive (~1_000_000x3) and therefore it takes noticeable amount of time to compute the rotated points.
The only solution i come up with so far is simple for loop iterating over array of points. Here is the example with a piece of data:
# Given set of points Nx3: np.ndarray
points = np.array([
[285.679, 219.75, 524.733],
[285.924, 219.404, 524.812],
[285.116, 219.217, 524.813],
[285.839, 219.557, 524.842],
[285.173, 219.507, 524.606]
])
points_t = points.T
# Array to store results
points_rotated = np.empty((points_t.shape))
# Given rotation matrix 3x3: np.ndarray
rot_matrix = np.array([
[0.57423549, 0.81862056, -0.01067613],
[-0.81866133, 0.57405696, -0.01588193],
[-0.00687256, 0.0178601, 0.99981688]
])
for i in range(points.shape[0]):
points_rotated[:, i] = np.dot(rot_matrix, points_t[:, i])
# Result
points_rotated.T
# [[ 338.33677093 -116.05910163 526.59831864]
# [ 338.1933725 -116.45955203 526.6694408 ]
# [ 337.5762975 -115.90543822 526.67265381]
# [ 338.26949115 -116.30261156 526.70275207]
# [ 337.84863885 -115.78233784 526.47047941]]
I dont feel confident using numpy as i quite new to it, so i'm sure there is at least more elegant way to do. I've heard about np.einsum() but i don't understand yet how to implement it here and i'm not sure it will make it quicker. And the main problem is still the computation time, so i want to now how to make it work faster?
Thank you very much in advance!
You can write the code using numba parallelization in no python mode as:
#nb.njit("float64[:, ::1](float64[:, ::1], float64[:, ::1])", parallel=True)
def dot(a, b): # dot(points, rot_matrix)
dot_ = np.zeros((a.shape[0], b.shape[0]))
for i in nb.prange(a.shape[0]):
for j in range(b.shape[0]):
dot_[i, j] = a[i, 0] * b[j, 0] + a[i, 1] * b[j, 1] + a[i, 2] * b[j, 2]
return dot_
It must be better than ko3 answer due to parallelization and signatures and using algebra instead np.dot. I have tested similar code (that was applied just on upper triangle of an array) instead dot product in another answer which showed that was at least 2 times faster (as far as I can remember).
I have tried some codes to see the performance of each solution (based on my system) on 1,000,000 x 3 matrix.
Using np.matmul
Using numba njit with dot product(#)
Using numba njit with your code
Here are the results of three functions.
points = np.random.rand(1000000,3)
rot_matrix = np.array([
[0.57423549, 0.81862056, -0.01067613],
[-0.81866133, 0.57405696, -0.01588193],
[-0.00687256, 0.0178601, 0.99981688]
# Using np.matmul
def function_matmul(points, rot_matrix):
return np.matmul(rot_matrix # points.T).T
# Using numba njit with dot product(#)
#njit
def function_numba_dot(points, rot_matrix):
return (rot_matrix, points.T).T
# Using numba njit with your code
#njit
def function_ori_code(points, rot_matrix):
points_t = points.T
points_rotated = np.empty((points_t.shape))
for i in range(points.shape[0]):
points_rotated[:, i] = np.dot(rot_matrix, points_t[:, i])
return points_rotated.T
%timeit function_matmul(points, rot_matrix)
16.5 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit function_numba_dot(points, rot_matrix)
16.3 ms ± 69.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit function_ori_code(points, rot_matrix)
260 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My goal is to convert a list of pixels from RGB to Hex as quickly as possible. The format is a Numpy dimensional array (rgb colorspace) and ideally it would be converted from RGB to Hex and maintain it's shape.
My attempt at doing this uses list comprehension and with the exception of performance, it solves it. Performance wise, adding the ravel and list comprehension really slows this down. Unfortunately I just don't know enough math to know the solution of how to to speed this up:
Edited: Updated code to reflex most recent changes. Current running around 24ms on 35,000 pixel image.
def np_array_to_hex(array):
array = np.asarray(array, dtype='uint32')
array = (1 << 24) + ((array[:, :, 0]<<16) + (array[:, :, 1]<<8) + array[:, :, 2])
return [hex(x)[-6:] for x in array.ravel()]
>>> np_array_to_hex(img)
['afb3bc', 'abaeb5', 'b3b4b9', ..., '8b9dab', '92a4b2', '9caebc']
I tried it with a LUT ("Look Up Table") - it takes a few seconds to initialise and it uses 100MB (0.1GB) of RAM, but that's a small price to pay amortised over a million images:
#!/usr/bin/env python3
import numpy as np
def np_array_to_hex1(array):
array = np.asarray(array, dtype='uint32')
array = ((array[:, :, 0]<<16) + (array[:, :, 1]<<8) + array[:, :, 2])
return array
def np_array_to_hex2(array):
array = np.asarray(array, dtype='uint32')
array = (1 << 24) + ((array[:, :, 0]<<16) + (array[:, :, 1]<<8) + array[:, :, 2])
return [hex(x)[-6:] for x in array.ravel()]
def me(array, LUT):
h, w, d = array.shape
# Reshape to a color vector
z = np.reshape(array,(-1,3))
# Make array and fill with 32-bit colour number
y = np.zeros((h*w),dtype=np.uint32)
y = z[:,0]*65536 + z[:,1]*256 + z[:,2]
return LUT[y]
# Define dummy image of 35,000 RGB pixels
w,h = 175, 200
im = np.random.randint(0,256,(h,w,3),dtype=np.uint8)
# %timeit np_array_to_hex1(im)
# 112 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# %timeit np_array_to_hex2(im)
# 8.42 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# This may take time to set up, but amortize that over a million images...
LUT = np.zeros((256*256*256),dtype='a6')
for i in range(256*256*256):
h = hex(i)[2:].zfill(6)
LUT[i] = h
# %timeit me(im,LUT)
# 499 µs ± 8.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So that appears to be 4x slower than your fastest which doesn't work, and 17x faster that your slowest which does work.
My next suggestion is to use multi-threading or multi-processing so all your CPU cores get busy in parallel and reduce your overall time by a factor of 4 or more assuming you have a reasonably modern 4+ core CPU.
The following data, represent 2 given histograms split into 13 bins:
key 0 1-9 10-18 19-27 28-36 37-45 46-54 55-63 64-72 73-81 82-90 91-99 100
A 1.274580708 2.466224824 5.045757621 7.413716262 8.958855646 10.41325305 11.14150951 10.91949012 11.29095648 10.95054297 10.10976255 8.128781795 1.886568472
B 0 1.700493692 4.059243006 5.320899616 6.747120132 7.899067471 9.434997257 11.24520022 12.94569391 12.83598464 12.6165661 10.80636314 4.388370817
I'm trying to follow this article in order to calculate the intersection between those 2 histograms, using this method:
def histogram_intersection(h1, h2, bins):
bins = numpy.diff(bins)
sm = 0
for i in range(len(bins)):
sm += min(bins[i]*h1[i], bins[i]*h2[i])
return sm
Since my data is already calculated as a histogram, I can't use numpy built-in function, so I'm failing to provide the function the necessary data.
How can I process my data to fit the algorithm?
Since you have the same number of bins for both of the histograms you can use:
def histogram_intersection(h1, h2):
sm = 0
for i in range(13):
sm += min(h1[i], h2[i])
return sm
You can calculate it faster and more simply with Numpy:
#!/usr/bin/env python3
import numpy as np
A = np.array([1.274580708,2.466224824,5.045757621,7.413716262,8.958855646,10.41325305,11.14150951,10.91949012,11.29095648,10.95054297,10.10976255,8.128781795,1.886568472])
B = np.array([0,1.700493692,4.059243006,5.320899616,6.747120132,7.899067471,9.434997257,11.24520022,12.94569391,12.83598464,12.6165661,10.80636314,4.388370817])
def histogram_intersection(h1, h2):
sm = 0
for i in range(13):
sm += min(h1[i], h2[i])
return sm
print(histogram_intersection(A,B))
print(np.sum(np.minimum(A,B)))
Output
88.44792356099998
88.447923561
But if you time it, Numpy only takes 60% of the time:
%timeit histogram_intersection(A,B)
5.02 µs ± 65.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.sum(np.minimum(A,B))
3.22 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Some caveats first : in your data bins are ranges, in your algorithm they are numbers. You must redefine bins for that.
Furthermore, min(bins[i]*h1[i], bins[i]*h2[i]) is bins[i]*min(h1[i], h2[i]), so the result can be obtained by :
hists=pandas.read_clipboard(index_col=0) # your data
bins=arange(-4,112,9) # try for bins but edges are different here
mins=hists.min('rows')
intersection=dot(mins,bins)