I've got the following situation.
I have a list of i coordinates (x, y, z) and have to compute all triples inside a cutoff sphere, such that r_ij and r_ik are smaller than a cutoff value.
Therefore I am computing a matrix r_ij that contains all distances.
To compute the triples my idea is, to construct a r_ijk matrix.
I've done this with a loop over the number of elements i as
import tensorflow as tf
r_ij = tf.reshape(tf.range(4*4), (4, 4))
r_ijk = []
for i in range(len(x)):
r_ijk.append(tf.roll(r_ij, shift=-i, axis=1))
tf.stack(r_ijk)
I want to improve this code because of two issues.
Primarly because I assume, that it could be fully vectorized.
But also to use this in my model, I need to alter it:
#tf.function
def get_triplets(full_r_ij, r_cut):
r_ij = tf.norm(full_r_ij, axis=-1) # Shape of full_r_ij is (n_timesteps, n_atoms, n_atoms, 3)
n_atoms = tf.shape(r_ij)[1]
r_ijk = r_ij[None]
for atom in range(1, n_atoms):
tf.autograph.experimental.set_loop_options(
shape_invariants=[(r_ijk, tf.TensorShape([None, None, None, None]))]
)
tmp = tf.roll(r_ij, shift=-atom, axis=2)
r_ijk = tf.concat([r_ijk, tmp[None]], axis=0) # shape is (n_atoms, n_timesteps, n_atoms, n_atoms)
r_ijk = tf.transpose(r_ijk, perm=(1, 0, 2, 3))
r_ijk = tf.where(r_ijk == 0, tf.ones_like(r_ijk) * r_cut, r_ijk)
intermediate_indices = tf.where(
tf.math.logical_and(r_ijk[:, 0, None] == 3.0, r_ijk[:, 1:] == 3.0)
)
n_atoms = tf.cast(n_atoms, dtype=tf.int64)
t, n, i, j = tf.unstack(intermediate_indices, axis=1)
k = j + n + 1
k = tf.where(k >= n_atoms, k - n_atoms, k)
triples = tf.stack([t, i, j, k], axis=1)
return triples
and use tf.autograph.experimental.set_loop_options because I am kind of looping over the r_ij tensor.
Is there a way to improve the first code example (or the second as well)?
I tested two further implementations using tf.vectorized_mad and tf.map_fn which both performed worse than the initial function I wrote. All tests were performed with r_ij = tf.random.normal((32, 150, 150))
#tf.function
def roll_loop(r_ij, n_atoms):
r_ijk = r_ij[None]
for atom in range(1, n_atoms):
tf.autograph.experimental.set_loop_options(
shape_invariants=[(r_ijk, tf.TensorShape([None, None, None, None]))]
)
tmp = tf.roll(r_ij, shift=-atom, axis=2)
r_ijk = tf.concat([r_ijk, tmp[None]], axis=0) # shape is (n_atoms, n_timesteps, n_atoms, n_atoms)
return r_ijk
It took 129 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#tf.function
def roll_vect(r_ij, n_atoms):
r_ijk = tf.repeat(r_ij[None], repeats=n_atoms, axis=0)
def roll(args):
x, shift = args
return tf.roll(x, shift=shift, axis=2)
return tf.vectorized_map(roll, [r_ijk, tf.range(n_atoms)])
It took 225 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#tf.function
def roll_map(r_ij, n_atoms):
r_ijk = tf.repeat(r_ij[None], repeats=n_atoms, axis=0)
def roll(args):
x, shift = args
return tf.roll(x, shift=shift, axis=2)
return tf.map_fn(roll, (r_ijk, tf.range(n_atoms)), fn_output_signature=tf.float32)
It took 327 ms ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So it seems like going for tf.function with python for loop is fastest (so far). All functions were compiled before testing.
EDIT:
Using tf.TensorArray seems to be the best way for this task.
I tested it with a few different inputs and it performs as good or even a little better, than tf.autograph.experimental.set_loop_options
#tf.function
def roll_loop(r_ij, n_atoms):
r_ijk = tf.TensorArray(tf.float32, size=n_atoms)
for atom in range(0, n_atoms):
tmp = tf.roll(r_ij, shift=-atom, axis=2)
r_ijk = r_ijk.write(atom, tmp)
return r_ijk.stack()
It took 128 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
I currently have sixteen images (A,B,C,D,E,F,G,...) which must be concatenated into one as part of a Tensorflow Dataset workflow. Each image is 128 x 128 and has the shape of (128, 128, 3). The final output should be a 512 x 512 image of shape (512,512,3). All of the images come from an image sequence, known as img_seq. This img_seq has the shape of (None, 128, 128, 3)
Right now, this is accomplished through the following code:
#tf.function
def glue_to_one(imgs_seq):
first_row= tf.concat((imgs_seq[0], imgs_seq[1],imgs_seq[2],imgs_seq[3]), 0)
second_row = tf.concat((imgs_seq[4], imgs_seq[5], imgs_seq[6], imgs_seq[7]), 0)
third_row = tf.concat((imgs_seq[8], imgs_seq[9], imgs_seq[10], imgs_seq[11]), 0)
fourth_row = tf.concat((imgs_seq[12], imgs_seq[13], imgs_seq[14], imgs_seq[15]), 0)
img_glue = tf.stack((first_row, second_row, third_row, fourth_row), axis=1)
img_glue = tf.reshape(img_glue, [512,512,3])
return img_glue
It is suspected that this method is inefficient and is learning to a bottleneck.
A different approach would be to allocate a 512 x 512 tensor and then replace the elements. Would this be more efficient? How would it be done? Can you please recommend a better approach?
Simply use tf.split method instead of writing that much code...,
**Your Inputs seems to be a list of inputs**
def stack_and_concat(x):
t = tf.split(x , 16 , axis=0)
t = tf.reshape(tf.stack([tf.concat(t[(i*4): 4 * (i+1)] , axis=1) for i in range(4)],axis=2) , (512,512,3))
return t
concat_inputs(x).shape
TensorShape([512, 512, 3])
for thousand iterations my method took 3.28 secs but your's took 10.35 secs
You can improve it about 3 times using something like this:
def glue_answer(imgs_seq):
image = tf.reshape(imgs_seq, (4, 4, 128, 128, 3))
image = tf.concat(image, axis=1)
image = tf.concat(image, axis=1)
return image
I tested the performance as follows:
def glue_to_one(imgs_seq):
first_row= tf.concat((imgs_seq[0], imgs_seq[1],imgs_seq[2],imgs_seq[3]), 0)
second_row = tf.concat((imgs_seq[4], imgs_seq[5], imgs_seq[6], imgs_seq[7]), 0)
third_row = tf.concat((imgs_seq[8], imgs_seq[9], imgs_seq[10], imgs_seq[11]), 0)
fourth_row = tf.concat((imgs_seq[12], imgs_seq[13], imgs_seq[14], imgs_seq[15]), 0)
img_glue = tf.stack((first_row, second_row, third_row, fourth_row), axis=1)
img_glue = tf.reshape(img_glue, [512,512,3])
return img_glue
def glue_answer(imgs_seq):
image = tf.reshape(imgs_seq, (4, 4, 128, 128, 3))
image = tf.concat(image, axis=1)
image = tf.concat(image, axis=1)
return image
print("Method in question:")
%timeit -n 1000 -r 10 glue_to_one(imgs_seq)
print("Method in answe:")
%timeit -n 1000 -r 10 glue_answer(imgs_seq)
Output:
Method in question:
1.7 ms ± 212 µs per loop (mean ± std. dev. of 10 runs, 1,000 loops each)
Method in answe:
540 µs ± 28.8 µs per loop (mean ± std. dev. of 10 runs, 1,000 loops each)
tf.squeeze(tf.concat(tf.split(tf.concat(imgs_seq, axis=0),4),axis=1))
Testing:
imgs_seq = [ tf.random.normal(shape=(128,128,3)) for _ in range(16)]
out1 = glue_to_one(imgs_seq)
out2 = tf.squeeze(tf.concat(tf.split(tf.concat(imgs_seq, axis=0),4),axis=1))
#check whether both outputs are same.
np.testing.assert_allclose(out1.numpy(), out2.numpy())
%timeit glue_to_one(imgs_seq)
1.52 ms ± 123 µs per loop
%timeit tf.squeeze(tf.concat(tf.split(tf.concat(imgs_seq,axis=0),4),axis=1))
308 µs ± 13.6 µs per loop
I have here pure python code, except just making a NumPy array. My problem here is that the result I get is completely wrong when I use #jit, but when I remove it its good. Could anyone give me any tips on why this is?
#jit
def grayFun(image: np.array) -> np.array:
gray_image = np.empty_like(image)
for i in range(image.shape[0]):
for j in range(image.shape[1]):
gray = gray_image[i][j][0]*0.21 + gray_image[i][j][1]*0.72 + gray_image[i][j][2]*0.07
gray_image[i][j] = (gray,gray,gray)
gray_image = gray_image.astype("uint8")
return gray_image
This will return a grayscale image with your conversion formula. USUALLY, you do not need to duplicate the columns; a grayscale image with shape (X,Y) can be used just like an image with shape (X,Y,3).
def gray(image):
return image[:,:,0]*0.21+image[:,:,1]*0.72 + image[:,:,2]*0.07
This should work just fine with numba. #TimRobert's answer is definitely fast, so you may just want to go with that implementation. But the biggest win is simply from vectorization. I'm sure others could find additional performance tweaks but at this point I think we've whittled down most of the runtime & issues:
# your implementation, but fixed so that `gray` is calculated from `image`
def grayFun(image: np.array) -> np.array:
gray_image = np.empty_like(image)
for i in range(image.shape[0]):
for j in range(image.shape[1]):
gray = image[i][j][0]*0.21 + image[i][j][1]*0.72 + image[i][j][2]*0.07
gray_image[i][j] = (gray,gray,gray)
gray_image = gray_image.astype("uint8")
return gray_image
# a vectorized numpy version of your implementation
def grayQuick(image: np.array) -> np.array:
return np.tile(
np.expand_dims(
(image[:, :, 0]*0.21 + image[:, :, 1]*0.72 + image[:, :, 2]*0.07), -1
),
(1,1, 3)
).astype(np.uint8)
# a parallelized implementation in numba
#numba.jit
def gray_numba(image: np.array) -> np.array:
out = np.empty_like(image)
for i in numba.prange(image.shape[0]):
for j in numba.prange(image.shape[1]):
gray = np.uint8(image[i, j, 0]*0.21 + image[i, j, 1]*0.72 + image[i, j, 2]*0.07)
out[i, j, :] = gray
return out
# a 2D solution leveraging #TimRoberts's speedup
def gray_2D(image):
return image[:,:,0]*0.21+image[:,:,1]*0.72 + image[:,:,2]*0.07
I loaded a reasonably large image:
In [69]: img = matplotlib.image.imread(os.path.expanduser(
...: "~/Desktop/Screen Shot.png"
...: ))
...: image = (img[:, :, :3] * 256).astype('uint8')
...:
In [70]: image.shape
Out[70]: (1964, 3024, 3)
Now, running these three reveals a slight speedup from numba, while the fastest is the 2D solution:
In [71]: %%timeit
...: grey = grayFun(image) # watch out - this takes ~21 minutes
...:
...:
2min 56s ± 1min 58s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [72]: %%timeit
...: grey_np = grayQuick(image)
...:
...:
556 ms ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [73]: %%timeit
...: grey = gray_numba(image)
...:
...:
246 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [74]: %%timeit
...: grey = gray_2D(image)
...:
...:
117 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note that numba will be noticeably slower on the first iteration, so the vectorized numpy solutions will significantly outperform numba if you're only doing this once. But if you're going to call the function repeatedly within the same python session numba is a good option. You could of course use numba for the 2D result to get a further speedup - I'm not sure if this would outperform numpy though.
Toy example
I have two arrays, which have different shape, for example:
import numpy as np
matrix = np.arange(5*6*7*8).reshape(5, 6, 7, 8)
vector = np.arange(1, 20, 2)
What I want to do is to multiply each element of the matrix by one of the element of 'vector' and then do the sum over the last two axis. For that, I have an array with the same shape as 'matrix' that tells me which index to use, for example:
Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size
I know that one of the solution would be to do:
matVec = vector[Idx]
res = np.sum(matrix*matVec, axis=(2, 3))
or even:
res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)
Problem
However, my problems is that my arrays are big and the construction of matVec takes both times and memory. So is there a way to bypass that and still achieve the same result ?
More realistic example
Here is a more realistic example of what I'm actually doing:
import numpy as np
order = 20
dim = 23
listOrder = np.arange(-order, order+1, 1)
N, P = np.meshgrid(listOrder, listOrder)
K = np.arange(-2*dim+1, 2*dim+1, 1)
X = np.arange(-2*dim, 2*dim, 1)
tN = np.einsum('..., p, x -> ...px', N, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tP = np.einsum('..., p, x -> ...px', P, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tK = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), K, np.ones(X.shape, dtype=int))#, optimize=pathInt)
tX = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), np.ones(K.shape, dtype=int), X)#, optimize=pathInt)
tL = tK + tX
mini, maxi = -4*dim+1, 4*dim-1
NmPp2L = np.arange(2*mini-2*order, 2*maxi+2*order+1)
Idx = (2*tL+tN-tP) - NmPp2L[0]
np.random.seed(0)
matrix = (np.random.rand(Idx.size) + 1j*np.random.rand(Idx.size)).reshape(Idx.shape)
vector = np.random.rand(np.max(Idx)+1) + 1j*np.random.rand(np.max(Idx)+1)
res = np.sum(matrix*vector[Idx], axis=(2, 3))
For larger data arrays
import numpy as np
matrix = np.arange(50*60*70*80).reshape(50, 60, 70, 80)
vector = np.arange(1, 50, 2)
Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size
parallel numba speeds up the computation and avoids creating matVec.
On a 4-core Intel Xeon Platinum 8259CL
matVec = vector[Idx]
# %timeit 48.4 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)
# %timeit 26.9 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
against an unoptimized numba implementation
import numba as nb
#nb.njit(parallel=True)
def func(matrix, idx, vector):
res = np.zeros((matrix.shape[0],matrix.shape[1]), dtype=matrix.dtype)
for i in nb.prange(matrix.shape[0]):
for j in range(matrix.shape[1]):
for k in range(matrix.shape[2]):
for l in range(matrix.shape[3]):
res[i,j] += matrix[i,j,k,l] * vector[idx[i,j,k,l]]
return res
func(matrix, Idx, vector) # compile run
func(matrix, Idx, vector)
# %timeit 21.7 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# (48.4 + 26.9) / 21.7 = ~3.47x speed up
np.testing.assert_equal(func(matrix, Idx, vector), np.einsum('ijkl, ijkl -> ij', matrix, matVec))
Performance and further optimizations
The above Numba code should be memory-bound when dealing with complex numbers. Indeed, matrix and Idx are big and must be completely read from the relatively-slow RAM. matrix has a size of 41*41*92*92*8*2 = 217.10 MiB and Idx a size of either 41*41*92*92*8 = 108.55 MiB or 41*41*92*92*4 = 54.28 MiB regarding the target platform (it should be of type int32 on most x86-64 Windows platforms and int64 on most Linux x86-64 platforms). This is also why vector[Idx] was slow: Numpy needs to write a big array in memory (not to mention writing data should be twice slower than reading it on x86-64 platforms in this case).
Assuming the code is memory bound, the only way to speed it up is to reduce the amount of data read from the RAM. This can be achieve by storing Idx in a uint16 array instead of the default np.int_ datatype (2~4 bigger). This is possible since vector.size is small. On a Linux with a i5-9600KF processor and a 38-40 GiB/s RAM, this optimization resulted in a ~29% speed up while the code is still mainly memory bound. The implementation is nearly optimal on this platform.
My goal is to convert a list of pixels from RGB to Hex as quickly as possible. The format is a Numpy dimensional array (rgb colorspace) and ideally it would be converted from RGB to Hex and maintain it's shape.
My attempt at doing this uses list comprehension and with the exception of performance, it solves it. Performance wise, adding the ravel and list comprehension really slows this down. Unfortunately I just don't know enough math to know the solution of how to to speed this up:
Edited: Updated code to reflex most recent changes. Current running around 24ms on 35,000 pixel image.
def np_array_to_hex(array):
array = np.asarray(array, dtype='uint32')
array = (1 << 24) + ((array[:, :, 0]<<16) + (array[:, :, 1]<<8) + array[:, :, 2])
return [hex(x)[-6:] for x in array.ravel()]
>>> np_array_to_hex(img)
['afb3bc', 'abaeb5', 'b3b4b9', ..., '8b9dab', '92a4b2', '9caebc']
I tried it with a LUT ("Look Up Table") - it takes a few seconds to initialise and it uses 100MB (0.1GB) of RAM, but that's a small price to pay amortised over a million images:
#!/usr/bin/env python3
import numpy as np
def np_array_to_hex1(array):
array = np.asarray(array, dtype='uint32')
array = ((array[:, :, 0]<<16) + (array[:, :, 1]<<8) + array[:, :, 2])
return array
def np_array_to_hex2(array):
array = np.asarray(array, dtype='uint32')
array = (1 << 24) + ((array[:, :, 0]<<16) + (array[:, :, 1]<<8) + array[:, :, 2])
return [hex(x)[-6:] for x in array.ravel()]
def me(array, LUT):
h, w, d = array.shape
# Reshape to a color vector
z = np.reshape(array,(-1,3))
# Make array and fill with 32-bit colour number
y = np.zeros((h*w),dtype=np.uint32)
y = z[:,0]*65536 + z[:,1]*256 + z[:,2]
return LUT[y]
# Define dummy image of 35,000 RGB pixels
w,h = 175, 200
im = np.random.randint(0,256,(h,w,3),dtype=np.uint8)
# %timeit np_array_to_hex1(im)
# 112 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# %timeit np_array_to_hex2(im)
# 8.42 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# This may take time to set up, but amortize that over a million images...
LUT = np.zeros((256*256*256),dtype='a6')
for i in range(256*256*256):
h = hex(i)[2:].zfill(6)
LUT[i] = h
# %timeit me(im,LUT)
# 499 µs ± 8.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So that appears to be 4x slower than your fastest which doesn't work, and 17x faster that your slowest which does work.
My next suggestion is to use multi-threading or multi-processing so all your CPU cores get busy in parallel and reduce your overall time by a factor of 4 or more assuming you have a reasonably modern 4+ core CPU.
Is there a way to calculate many histograms along an axis of an nD-array? The method I currently have uses a for loop to iterate over all other axes and calculate a numpy.histogram() for each resulting 1D array:
import numpy
import itertools
data = numpy.random.rand(4, 5, 6)
# axis=-1, place `200001` and `[slice(None)]` on any other position to process along other axes
out = numpy.zeros((4, 5, 200001), dtype="int64")
indices = [
numpy.arange(4), numpy.arange(5), [slice(None)]
]
# Iterate over all axes, calculate histogram for each cell
for idx in itertools.product(*indices):
out[idx] = numpy.histogram(
data[idx],
bins=2 * 100000 + 1,
range=(-100000 - 0.5, 100000 + 0.5),
)[0]
out.shape # (4, 5, 200001)
Needless to say this is very slow, however I couldn't find a way to solve this using numpy.histogram, numpy.histogram2d or numpy.histogramdd.
Here's a vectorized approach making use of the efficient tools np.searchsorted and np.bincount. searchsorted gives us the loactions where each element is to be placed based on the bins and bincount does the counting for us.
Implementation -
def hist_laxis(data, n_bins, range_limits):
# Setup bins and determine the bin location for each element for the bins
R = range_limits
N = data.shape[-1]
bins = np.linspace(R[0],R[1],n_bins+1)
data2D = data.reshape(-1,N)
idx = np.searchsorted(bins, data2D,'right')-1
# Some elements would be off limits, so get a mask for those
bad_mask = (idx==-1) | (idx==n_bins)
# We need to use bincount to get bin based counts. To have unique IDs for
# each row and not get confused by the ones from other rows, we need to
# offset each row by a scale (using row length for this).
scaled_idx = n_bins*np.arange(data2D.shape[0])[:,None] + idx
# Set the bad ones to be last possible index+1 : n_bins*data2D.shape[0]
limit = n_bins*data2D.shape[0]
scaled_idx[bad_mask] = limit
# Get the counts and reshape to multi-dim
counts = np.bincount(scaled_idx.ravel(),minlength=limit+1)[:-1]
counts.shape = data.shape[:-1] + (n_bins,)
return counts
Runtime test
Original approach -
def org_app(data, n_bins, range_limits):
R = range_limits
m,n = data.shape[:2]
out = np.zeros((m, n, n_bins), dtype="int64")
indices = [
np.arange(m), np.arange(n), [slice(None)]
]
# Iterate over all axes, calculate histogram for each cell
for idx in itertools.product(*indices):
out[idx] = np.histogram(
data[idx],
bins=n_bins,
range=(R[0], R[1]),
)[0]
return out
Timings and verification -
In [2]: data = np.random.randn(4, 5, 6)
...: out1 = org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))
...: out2 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
...: print np.allclose(out1, out2)
...:
True
In [3]: %timeit org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))
10 loops, best of 3: 39.3 ms per loop
In [4]: %timeit hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
100 loops, best of 3: 3.17 ms per loop
Since, in the loopy solution, we are looping through the first two axes. So, let's increase their lengths as that would show us how good is the vectorized one -
In [59]: data = np.random.randn(400, 500, 6)
In [60]: %timeit org_app(data, n_bins=21, range_limits=(- 2.5, 2.5))
1 loops, best of 3: 9.59 s per loop
In [61]: %timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
10 loops, best of 3: 44.2 ms per loop
In [62]: 9590/44.2 # Speedup number
Out[62]: 216.9683257918552
The first solution provided a nice short idiom which uses numpy sortedsearch which has the cost of a sort and many searches. But numpy has a fast route in its source code which is done in Python in fact, which can deal with equal bin edge range mathematically. This solution uses only a vectorized subtraction and multiplication and some comparisons instead.
This solution will follow numpy code for the search sorted, type imputations, and handles weights as well as complex numbers. It is basically the first solution combined with numpy histogram fast route, and some extra type, and iteration details, etc.
_range = range
def hist_np_laxis(a, bins=10, range=None, weights=None):
# Initialize empty histogram
N = a.shape[-1]
data2D = a.reshape(-1,N)
limit = bins*data2D.shape[0]
# gh-10322 means that type resolution rules are dependent on array
# shapes. To avoid this causing problems, we pick a type now and stick
# with it throughout.
bin_type = np.result_type(range[0], range[1], a)
if np.issubdtype(bin_type, np.integer):
bin_type = np.result_type(bin_type, float)
bin_edges = np.linspace(range[0],range[1],bins+1, endpoint=True, dtype=bin_type)
# Histogram is an integer or a float array depending on the weights.
if weights is None:
ntype = np.dtype(np.intp)
else:
ntype = weights.dtype
n = np.zeros(limit, ntype)
# Pre-compute histogram scaling factor
norm = bins / (range[1] - range[0])
# We set a block size, as this allows us to iterate over chunks when
# computing histograms, to minimize memory usage.
BLOCK = 65536
# We iterate over blocks here for two reasons: the first is that for
# large arrays, it is actually faster (for example for a 10^8 array it
# is 2x as fast) and it results in a memory footprint 3x lower in the
# limit of large arrays.
for i in _range(0, data2D.shape[0], BLOCK):
tmp_a = data2D[i:i+BLOCK]
block_size = tmp_a.shape[0]
if weights is None:
tmp_w = None
else:
tmp_w = weights[i:i + BLOCK]
# Only include values in the right range
keep = (tmp_a >= range[0])
keep &= (tmp_a <= range[1])
if not np.logical_and.reduce(np.logical_and.reduce(keep)):
tmp_a = tmp_a[keep]
if tmp_w is not None:
tmp_w = tmp_w[keep]
# This cast ensures no type promotions occur below, which gh-10322
# make unpredictable. Getting it wrong leads to precision errors
# like gh-8123.
tmp_a = tmp_a.astype(bin_edges.dtype, copy=False)
# Compute the bin indices, and for values that lie exactly on
# last_edge we need to subtract one
f_indices = (tmp_a - range[0]) * norm
indices = f_indices.astype(np.intp)
indices[indices == bins] -= 1
# The index computation is not guaranteed to give exactly
# consistent results within ~1 ULP of the bin edges.
decrement = tmp_a < bin_edges[indices]
indices[decrement] -= 1
# The last bin includes the right edge. The other bins do not.
increment = ((tmp_a >= bin_edges[indices + 1])
& (indices != bins - 1))
indices[increment] += 1
((bins*np.arange(i, i+block_size)[:,None] * keep)[keep].reshape(indices.shape) + indices).reshape(-1)
#indices = scaled_idx.reshape(-1)
# We now compute the histogram using bincount
if ntype.kind == 'c':
n.real += np.bincount(indices, weights=tmp_w.real,
minlength=limit)
n.imag += np.bincount(indices, weights=tmp_w.imag,
minlength=limit)
else:
n += np.bincount(indices, weights=tmp_w,
minlength=limit).astype(ntype)
n.shape = a.shape[:-1] + (bins,)
return n
data = np.random.randn(4, 5, 6)
out1 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
out2 = hist_np_laxis(data, bins=200001, range=(- 2.5, 2.5))
print(np.allclose(out1, out2))
True
%timeit hist_np_laxis(data, bins=21, range=(- 2.5, 2.5))
92.1 µs ± 504 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
55.1 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Although the first solution is faster in the small example and even the larger example:
data = np.random.randn(400, 500, 6)
%timeit hist_np_laxis(data, bins=21, range=(- 2.5, 2.5))
264 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
71.6 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is not ALWAYS faster:
data = np.random.randn(400, 6, 500)
%timeit hist_np_laxis(data, bins=101, range=(- 2.5, 2.5))
71.5 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit hist_laxis(data, n_bins=101, range_limits=(- 2.5, 2.5))
76.9 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, the numpy variation is only faster when the last axis is large. And its a very slight increase. In all other cases I tried, the first solution is much faster regardless of bin count and size of the first 2 dimensions. The only important line ((bins*np.arange(i, i+block_size)[:,None] * keep)[keep].reshape(indices.shape) + indices).reshape(-1) might be more optimizable though I have not found a faster method yet.
This would also imply the sheer number of vectorized operations of O(n) is outdoing the O(n log n) of the sort and repeated incremental searches.
However, realistic use cases will have a last axis with a lot of data and the prior axes with few. So in reality the samples in the first solution are too contrived to fit the desired performance.
Axis addition for histogram is noted as an issue in the numpy repo: https://github.com/numpy/numpy/issues/13166.
An xhistogram library has sought to solve this problem as well: https://xhistogram.readthedocs.io/en/latest/