I have a simple numpy array (3xN) like:
v = np.array([[-3.33829, -3.42467, -3.53332],
[-2.67681, -2.6082 , -3.49502],
[-3.49497, -2.73177, -2.61499],
[-2.76056, -3.57753, -2.67334],
[-1.96801, -3.47521, -3.51974],
[-1.25571, -2.69451, -3.45554],
[-1.94568, -2.59504, -2.72568],
[-1.28991, -3.47927, -2.73176],
[-0.51201, -3.50684, -3.40448],
[ 0.22398, -2.70244, -3.43421]])
Here, N = 10, but it is much larger than here (+500) in my real case. Each row is a point - Euclidean coordinate.
I would like to carry out:
where i, j and k indicate different rows from v.
How can I implement it on Python in a fast way?
You can do this using numpy broadcasting operations:
diffs = ((v[:, None] - v) ** 2).sum(-1)
d = np.exp(diffs + diffs[:, None]).sum((0, 1))
print(d)
# [3.08316899e+11 2.37020625e+07 4.05357364e+12 8.22697743e+08
# 8.85209202e+04 2.55340202e+05 7.33879459e+04 1.88175133e+05
# 8.10134295e+08 6.62122925e+12]
Even for an array of size 500, the result is computed in just a few seconds:
%%time
v = np.random.rand(500, 3)
diffs = np.sum((v[:, None] - v) ** 2, -1)
d = np.exp(diffs + diffs[:, None]).sum((0, 1))
# CPU times: user 2.74 s, sys: 5.5 ms, total: 2.75 s
# Wall time: 2.75 s
IIUC, the equation suggests pairwise vector differences, and not squared distance between vectors.
The pairwise difference between N vectors will be N*N vectors.
Finally, I would assume since you are only reducing over j and k axes, the output vector is (10,3) and not (10,). Do correct me if I am wrong.
import numpy as np
d = np.exp(((v[:,None]-v)**2)[:,None] + ((v[:,None]-v)**2)).sum((0,1))
print(d)
#### Stepwise breakdown
#v #i,3 -> 10,3
#diff = (v[:,None]-v)**2 #j,i,3 -> 10,10,3
#power = diff[:,None]+diff #k,j,i,3 -> 10,10,10,3
#exp = np.exp(power) #k,j,i,3 -> 10,10,10,3
#d = np.sum(exp,(1,2)) #i,3 -> 10,3
array([[4.38558108e+11, 2.11224470e+02, 2.08153285e+02],
[6.10332697e+09, 2.42309774e+02, 2.00079357e+02],
[1.37237360e+12, 2.11552094e+02, 2.32739462e+02],
[9.98934092e+09, 2.51158071e+02, 2.16562340e+02],
[1.77827910e+08, 2.22151678e+02, 2.05163797e+02],
[1.91234145e+08, 2.19457894e+02, 1.92858561e+02],
[1.63391357e+08, 2.46419838e+02, 2.04498335e+02],
[1.67512751e+08, 2.23119070e+02, 2.03232700e+02],
[8.45322705e+09, 2.30065042e+02, 1.85024981e+02],
[1.14468558e+12, 2.17683864e+02, 1.89388595e+02]])
Benchmark -
%%timeit
np.exp(((v[:,None]-v)**2)[:,None] + ((v[:,None]-v)**2)).sum((0,1))
# 21.2 s ± 3.27 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related
Toy example
I have two arrays, which have different shape, for example:
import numpy as np
matrix = np.arange(5*6*7*8).reshape(5, 6, 7, 8)
vector = np.arange(1, 20, 2)
What I want to do is to multiply each element of the matrix by one of the element of 'vector' and then do the sum over the last two axis. For that, I have an array with the same shape as 'matrix' that tells me which index to use, for example:
Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size
I know that one of the solution would be to do:
matVec = vector[Idx]
res = np.sum(matrix*matVec, axis=(2, 3))
or even:
res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)
Problem
However, my problems is that my arrays are big and the construction of matVec takes both times and memory. So is there a way to bypass that and still achieve the same result ?
More realistic example
Here is a more realistic example of what I'm actually doing:
import numpy as np
order = 20
dim = 23
listOrder = np.arange(-order, order+1, 1)
N, P = np.meshgrid(listOrder, listOrder)
K = np.arange(-2*dim+1, 2*dim+1, 1)
X = np.arange(-2*dim, 2*dim, 1)
tN = np.einsum('..., p, x -> ...px', N, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tP = np.einsum('..., p, x -> ...px', P, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tK = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), K, np.ones(X.shape, dtype=int))#, optimize=pathInt)
tX = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), np.ones(K.shape, dtype=int), X)#, optimize=pathInt)
tL = tK + tX
mini, maxi = -4*dim+1, 4*dim-1
NmPp2L = np.arange(2*mini-2*order, 2*maxi+2*order+1)
Idx = (2*tL+tN-tP) - NmPp2L[0]
np.random.seed(0)
matrix = (np.random.rand(Idx.size) + 1j*np.random.rand(Idx.size)).reshape(Idx.shape)
vector = np.random.rand(np.max(Idx)+1) + 1j*np.random.rand(np.max(Idx)+1)
res = np.sum(matrix*vector[Idx], axis=(2, 3))
For larger data arrays
import numpy as np
matrix = np.arange(50*60*70*80).reshape(50, 60, 70, 80)
vector = np.arange(1, 50, 2)
Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size
parallel numba speeds up the computation and avoids creating matVec.
On a 4-core Intel Xeon Platinum 8259CL
matVec = vector[Idx]
# %timeit 48.4 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)
# %timeit 26.9 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
against an unoptimized numba implementation
import numba as nb
#nb.njit(parallel=True)
def func(matrix, idx, vector):
res = np.zeros((matrix.shape[0],matrix.shape[1]), dtype=matrix.dtype)
for i in nb.prange(matrix.shape[0]):
for j in range(matrix.shape[1]):
for k in range(matrix.shape[2]):
for l in range(matrix.shape[3]):
res[i,j] += matrix[i,j,k,l] * vector[idx[i,j,k,l]]
return res
func(matrix, Idx, vector) # compile run
func(matrix, Idx, vector)
# %timeit 21.7 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# (48.4 + 26.9) / 21.7 = ~3.47x speed up
np.testing.assert_equal(func(matrix, Idx, vector), np.einsum('ijkl, ijkl -> ij', matrix, matVec))
Performance and further optimizations
The above Numba code should be memory-bound when dealing with complex numbers. Indeed, matrix and Idx are big and must be completely read from the relatively-slow RAM. matrix has a size of 41*41*92*92*8*2 = 217.10 MiB and Idx a size of either 41*41*92*92*8 = 108.55 MiB or 41*41*92*92*4 = 54.28 MiB regarding the target platform (it should be of type int32 on most x86-64 Windows platforms and int64 on most Linux x86-64 platforms). This is also why vector[Idx] was slow: Numpy needs to write a big array in memory (not to mention writing data should be twice slower than reading it on x86-64 platforms in this case).
Assuming the code is memory bound, the only way to speed it up is to reduce the amount of data read from the RAM. This can be achieve by storing Idx in a uint16 array instead of the default np.int_ datatype (2~4 bigger). This is possible since vector.size is small. On a Linux with a i5-9600KF processor and a 38-40 GiB/s RAM, this optimization resulted in a ~29% speed up while the code is still mainly memory bound. The implementation is nearly optimal on this platform.
I have three 1D arrays:
idxs: the index data
weights: the weight of each index in idxs
bins: the bin which is used to calculate minimum weight in it.
Here's my current method of using the idxs to check the data called weights in which bin, and then calculate the min/max of binned weights:
Get slices that shows which bin each idxs element belongs to.
Sort slices and weights at the same time.
Calculate the minimum of weights in each bin (slice).
numpy method
import random
import numpy as np
# create example data
out_size = int(10)
bins = np.arange(3, out_size-3)
idxs = np.arange(0, out_size)
#random.shuffle(idxs)
# set duplicated slice manually for test
idxs[4] = idxs[3]
idxs[6] = idxs[7]
weights = idxs
# get which bin idxs belong to
slices = np.digitize(idxs, bins)
# get index and weights in bins
valid = (bins.max() >= idxs) & (idxs >= bins.min())
valid_slices = slices[valid]
valid_weights = weights[valid]
# sort slice and weights
sort_index = valid_slices.argsort()
valid_slices_sort = valid_slices[sort_index]
valid_weights_sort = valid_weights[sort_index]
# get index of each first unque slices
unique_slices, unique_index = np.unique(valid_slices_sort, return_index=True)
# calculate the minimum
res_sub = np.minimum.reduceat(valid_weights_sort, unique_index)
# save results
res = np.full((out_size), np.nan)
res[unique_slices-1] = res_sub
print(res)
Results:
array([ 3., nan, 5., nan, nan, nan, nan, nan, nan, nan])
If I increase the out_size to 1e7 and shuffle the data, the speed (from np.digitize to the end) is slow:
13.5 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And, here's the consumed time of each part:
np.digitize: 10.8 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
valid: 171 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
argsort and slice: 2.02 s ± 33.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
unique: 9.9 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.minimum.reduceat: 5.11 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.digitize() costs the most of time: 10.8 s. And, the next is argsort: 2.02 s.
I also check the consumed time of calculating mean using np.histogram:
counts, _ = np.histogram(idxs, bins=out_size, range=(0, out_size))
sums, _ = np.histogram(idxs, bins=out_size, range=(0, out_size), weights = weights, density=False)
mean = sums / np.where(counts == 0, np.nan, counts)
33.2 s ± 3.47 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is similar to my method of calculating minimum.
scipy method
from scipy.stats import binned_statistic
statistics, _, _ = binned_statistic(idxs, weights, statistic='min', bins=bins)
print(statistics)
The result is a little different, but the speed is much slower (x6) for the longer(1e7) shuffled data:
array([ 3., nan, 5.])
1min 20s ± 6.93 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Summary
I want to figure out a quicker method. If the method is also suitable for dask, that would be excellent!
User Case
Here's how my real data (1D) look like:
SultanOrazbayev showed a quick approach; I'll add a cool one.
mask = bins[:, None] == idxs[None, :]
result = np.nanmin(np.where(mask, weights, np.nan), axis=-1)
# Note: may produce (expected) runtime warning if bin has no values
of course, you can also do np.nanmax, np.nanmean, etc.
The above assumes that your bins are indeed single values. If they are ranges, you need slightly little more work to construct the mask
lower_mask = idxs[None, :] >= bins[:, None]
upper_mask = np.empty_like(lower_mask)
upper_mask[:-1, ...] = idxs[None, :] < bins[1:, None]
upper_mask[-1, ...] = False
mask = lower_mask & upper_mask
at which point you can use np.nanmin like above.
Ofc np.where and the broadcast to create a mask will create new arrays of shape (len(bins), len(idxs)) with their respective datatype. If this is of no concern to you, then the above is probably good enough.
If it is a problem (because you are pressed for RAM), then my first suggestion is to buy more RAM. If - for some stupid reason (say, money) - that doesn't work, you can avoid the copy of weights by using a masked array over a manually re-strided view into weights
import numpy.ma as ma
mask = ...
restrided_weights = np.lib.stride_tricks.as_strided(weights, shape=(bins.size, idxs.size), strides=(0, idxs.strides[0]))
masked = ma.masked_array(restrided_weights, mask=~mask, fill_value=np.nan, dtype=np.float64)
result = masked.min(axis=-1).filled(np.nan)
this avoids both, a copy of weights and the above-mentioned runtime warning.
If you don't even have enough memory to construct mask, then you can try processing the data in chunks.
Last I checked, Dask used to have funny behavior when fed with manually strided arrays. There was some work on this though, so you may want to double-check if that is resolved, in which case you can happily parallelize the above.
Update based on your further comments to this answer and the other:
You can do this computation in chunks to avoid memory issues due to your large number of bins (1e4 in magnitude). Putting the concrete numbers into a full example and adding a progress bar indicates <40 seconds runtime on a single core.
import numpy.ma as ma
from tqdm import trange
import numpy as np
import random
# create example data
out_size = int(1.5e5)
#bins = np.arange(3, out_size-3)
bins = np.arange(3, int(3.8e4-3), dtype=np.int64)
idxs = np.arange(0, out_size)
random.shuffle(idxs)
# set duplicated slice manually for test
idxs[4] = idxs[3]
idxs[6] = idxs[7]
weights = idxs
chunk_size = 100
# preallocate buffers to avoid array creation in main loop
extended_bins = np.empty(len(bins) + 1, dtype=bins.dtype)
extended_bins[:-1] = bins
extended_bins[-1] = np.iinfo(bins.dtype).max # last bin goes to infinity
mask_buffer = np.empty((chunk_size, len(idxs)), dtype=bool)
result = np.empty_like(bins, dtype=np.float64)
for low in trange(0, len(bins), chunk_size):
high = min(low + chunk_size, len(bins))
chunk_size = high - low
mask_buffer[:chunk_size, ...] = ~((bins[low:high, None] <= idxs[None, :]) & (extended_bins[low+1:high+1, None] > idxs[None, :]))
mask = mask_buffer[:chunk_size, ...]
restrided_weights = np.lib.stride_tricks.as_strided(weights, shape=mask.shape, strides=(0, idxs.strides[0]))
masked = ma.masked_array(restrided_weights, mask=mask, fill_value=np.nan, dtype=np.float64)
result[low:high] = masked.min(axis=-1).filled(np.nan)
Bonus: For min and max only there is a cool trick that you can use: sort idxs and weights based on weights (ascending for min, descending for max). This way, you can immediately look up the min/max value and can avoid the masked array and the custom strides altogether. This relies on some not so well documented behavior of np.argmax, which takes a fast-pass for boolean arrays and doesn't search the full array.
It only works for these two cases, and you'd have to fall back to the above for more sophisticated things (mean), but for those two it shaves off another ~70% and a run on a single core clocks in at <12 seconds.
# fast min/max
from tqdm import trange
import numpy as np
# create example data
out_size = int(1.5e5)
#bins = np.arange(3, out_size-3)
bins = np.arange(3, int(3.8e4-3), dtype=np.int64)
idxs = np.arange(0, out_size)
random.shuffle(idxs)
# set duplicated slice manually for test
idxs[4] = idxs[3]
idxs[6] = idxs[7]
weights = idxs
order = np.argsort(weights)
weights_sorted = np.empty((len(weights) + 1), dtype=np.float64)
weights_sorted[:-1] = weights[order]
weights_sorted[-1] = np.nan
idxs_sorted = idxs[order]
extended_bins = np.empty(len(bins) + 1, dtype=bins.dtype)
extended_bins[:-1] = bins
extended_bins[-1] = np.iinfo(bins.dtype).max # last bin goes to infinity
# preallocate buffers to avoid array creation in main loop
chunk_size = 1000
mask_buffer = np.empty((chunk_size, len(idxs) + 1), dtype=bool)
mask_buffer[:, -1] = True
result = np.empty_like(bins, dtype=np.float64)
for low in trange(0, len(bins), chunk_size):
high = min(low + chunk_size, len(bins))
chunk_size = high - low
mask_buffer[:chunk_size, :-1] = (bins[low:high, None] <= idxs_sorted[None, :]) & (extended_bins[low+1:high+1, None] > idxs_sorted[None, :])
mask = mask_buffer[:chunk_size, ...]
weight_idx = np.argmax(mask, axis=-1)
result[low:high] = weights_sorted[weight_idx]
A quick approach to achieve this is with dask.dataframe and pd.cut, I first show the pandas version:
import numpy as np
from scipy.stats import binned_statistic as bs
import pandas as pd
nrows=10**7
df = pd.DataFrame(np.random.rand(nrows, 2), columns=['x', 'val'])
bins = np.linspace(df['x'].min(), df['x'].max(), 10)
df['binned_x'] = pd.cut(df['x'], bins=bins, right=False)
result_pandas = df.groupby('binned_x')['val'].min().values
result_scipy = bs(df['x'], df['val'], 'min', bins=bins)[0]
print(np.isclose(result_pandas, result_scipy))
# [ True True True True True True True True True]
Now to go from pandas to dask, you will need to make sure that bins are consistent across partitions, so take a look here. Once every partition is binned consistently, you want to apply the desired operation (min/max/sum/count):
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=10)
def f(df, bins):
df = df.copy()
df['binned_x'] = pd.cut(df['x'], bins=bins, right=False)
result = df.groupby('binned_x', as_index=False)['val'].min()
return result
result_dask = ddf.map_partitions(f, bins).groupby('binned_x')['val'].min().compute()
print(np.isclose(result_pandas, result_dask))
# [ True True True True True True True True True]
On my laptop, the first code take about 7 3 seconds, the second code is about 10 times faster (forgot that I am double-counting both pandas and scipy performing the same operation). There is scope for playing around with partitioning, but that's context-dependent, so something you can try optimizing on your data/hardware.
Update: note that this approach will work for min/max, but for mean you will want to calculate sum and count and then divide them. There is probably a good way of keeping track of weights in doing this calculation in one go, but it might not be worth the added code complexity.
I've got the following situation.
I have a list of i coordinates (x, y, z) and have to compute all triples inside a cutoff sphere, such that r_ij and r_ik are smaller than a cutoff value.
Therefore I am computing a matrix r_ij that contains all distances.
To compute the triples my idea is, to construct a r_ijk matrix.
I've done this with a loop over the number of elements i as
import tensorflow as tf
r_ij = tf.reshape(tf.range(4*4), (4, 4))
r_ijk = []
for i in range(len(x)):
r_ijk.append(tf.roll(r_ij, shift=-i, axis=1))
tf.stack(r_ijk)
I want to improve this code because of two issues.
Primarly because I assume, that it could be fully vectorized.
But also to use this in my model, I need to alter it:
#tf.function
def get_triplets(full_r_ij, r_cut):
r_ij = tf.norm(full_r_ij, axis=-1) # Shape of full_r_ij is (n_timesteps, n_atoms, n_atoms, 3)
n_atoms = tf.shape(r_ij)[1]
r_ijk = r_ij[None]
for atom in range(1, n_atoms):
tf.autograph.experimental.set_loop_options(
shape_invariants=[(r_ijk, tf.TensorShape([None, None, None, None]))]
)
tmp = tf.roll(r_ij, shift=-atom, axis=2)
r_ijk = tf.concat([r_ijk, tmp[None]], axis=0) # shape is (n_atoms, n_timesteps, n_atoms, n_atoms)
r_ijk = tf.transpose(r_ijk, perm=(1, 0, 2, 3))
r_ijk = tf.where(r_ijk == 0, tf.ones_like(r_ijk) * r_cut, r_ijk)
intermediate_indices = tf.where(
tf.math.logical_and(r_ijk[:, 0, None] == 3.0, r_ijk[:, 1:] == 3.0)
)
n_atoms = tf.cast(n_atoms, dtype=tf.int64)
t, n, i, j = tf.unstack(intermediate_indices, axis=1)
k = j + n + 1
k = tf.where(k >= n_atoms, k - n_atoms, k)
triples = tf.stack([t, i, j, k], axis=1)
return triples
and use tf.autograph.experimental.set_loop_options because I am kind of looping over the r_ij tensor.
Is there a way to improve the first code example (or the second as well)?
I tested two further implementations using tf.vectorized_mad and tf.map_fn which both performed worse than the initial function I wrote. All tests were performed with r_ij = tf.random.normal((32, 150, 150))
#tf.function
def roll_loop(r_ij, n_atoms):
r_ijk = r_ij[None]
for atom in range(1, n_atoms):
tf.autograph.experimental.set_loop_options(
shape_invariants=[(r_ijk, tf.TensorShape([None, None, None, None]))]
)
tmp = tf.roll(r_ij, shift=-atom, axis=2)
r_ijk = tf.concat([r_ijk, tmp[None]], axis=0) # shape is (n_atoms, n_timesteps, n_atoms, n_atoms)
return r_ijk
It took 129 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#tf.function
def roll_vect(r_ij, n_atoms):
r_ijk = tf.repeat(r_ij[None], repeats=n_atoms, axis=0)
def roll(args):
x, shift = args
return tf.roll(x, shift=shift, axis=2)
return tf.vectorized_map(roll, [r_ijk, tf.range(n_atoms)])
It took 225 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#tf.function
def roll_map(r_ij, n_atoms):
r_ijk = tf.repeat(r_ij[None], repeats=n_atoms, axis=0)
def roll(args):
x, shift = args
return tf.roll(x, shift=shift, axis=2)
return tf.map_fn(roll, (r_ijk, tf.range(n_atoms)), fn_output_signature=tf.float32)
It took 327 ms ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So it seems like going for tf.function with python for loop is fastest (so far). All functions were compiled before testing.
EDIT:
Using tf.TensorArray seems to be the best way for this task.
I tested it with a few different inputs and it performs as good or even a little better, than tf.autograph.experimental.set_loop_options
#tf.function
def roll_loop(r_ij, n_atoms):
r_ijk = tf.TensorArray(tf.float32, size=n_atoms)
for atom in range(0, n_atoms):
tmp = tf.roll(r_ij, shift=-atom, axis=2)
r_ijk = r_ijk.write(atom, tmp)
return r_ijk.stack()
It took 128 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Is there a way to calculate many histograms along an axis of an nD-array? The method I currently have uses a for loop to iterate over all other axes and calculate a numpy.histogram() for each resulting 1D array:
import numpy
import itertools
data = numpy.random.rand(4, 5, 6)
# axis=-1, place `200001` and `[slice(None)]` on any other position to process along other axes
out = numpy.zeros((4, 5, 200001), dtype="int64")
indices = [
numpy.arange(4), numpy.arange(5), [slice(None)]
]
# Iterate over all axes, calculate histogram for each cell
for idx in itertools.product(*indices):
out[idx] = numpy.histogram(
data[idx],
bins=2 * 100000 + 1,
range=(-100000 - 0.5, 100000 + 0.5),
)[0]
out.shape # (4, 5, 200001)
Needless to say this is very slow, however I couldn't find a way to solve this using numpy.histogram, numpy.histogram2d or numpy.histogramdd.
Here's a vectorized approach making use of the efficient tools np.searchsorted and np.bincount. searchsorted gives us the loactions where each element is to be placed based on the bins and bincount does the counting for us.
Implementation -
def hist_laxis(data, n_bins, range_limits):
# Setup bins and determine the bin location for each element for the bins
R = range_limits
N = data.shape[-1]
bins = np.linspace(R[0],R[1],n_bins+1)
data2D = data.reshape(-1,N)
idx = np.searchsorted(bins, data2D,'right')-1
# Some elements would be off limits, so get a mask for those
bad_mask = (idx==-1) | (idx==n_bins)
# We need to use bincount to get bin based counts. To have unique IDs for
# each row and not get confused by the ones from other rows, we need to
# offset each row by a scale (using row length for this).
scaled_idx = n_bins*np.arange(data2D.shape[0])[:,None] + idx
# Set the bad ones to be last possible index+1 : n_bins*data2D.shape[0]
limit = n_bins*data2D.shape[0]
scaled_idx[bad_mask] = limit
# Get the counts and reshape to multi-dim
counts = np.bincount(scaled_idx.ravel(),minlength=limit+1)[:-1]
counts.shape = data.shape[:-1] + (n_bins,)
return counts
Runtime test
Original approach -
def org_app(data, n_bins, range_limits):
R = range_limits
m,n = data.shape[:2]
out = np.zeros((m, n, n_bins), dtype="int64")
indices = [
np.arange(m), np.arange(n), [slice(None)]
]
# Iterate over all axes, calculate histogram for each cell
for idx in itertools.product(*indices):
out[idx] = np.histogram(
data[idx],
bins=n_bins,
range=(R[0], R[1]),
)[0]
return out
Timings and verification -
In [2]: data = np.random.randn(4, 5, 6)
...: out1 = org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))
...: out2 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
...: print np.allclose(out1, out2)
...:
True
In [3]: %timeit org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))
10 loops, best of 3: 39.3 ms per loop
In [4]: %timeit hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
100 loops, best of 3: 3.17 ms per loop
Since, in the loopy solution, we are looping through the first two axes. So, let's increase their lengths as that would show us how good is the vectorized one -
In [59]: data = np.random.randn(400, 500, 6)
In [60]: %timeit org_app(data, n_bins=21, range_limits=(- 2.5, 2.5))
1 loops, best of 3: 9.59 s per loop
In [61]: %timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
10 loops, best of 3: 44.2 ms per loop
In [62]: 9590/44.2 # Speedup number
Out[62]: 216.9683257918552
The first solution provided a nice short idiom which uses numpy sortedsearch which has the cost of a sort and many searches. But numpy has a fast route in its source code which is done in Python in fact, which can deal with equal bin edge range mathematically. This solution uses only a vectorized subtraction and multiplication and some comparisons instead.
This solution will follow numpy code for the search sorted, type imputations, and handles weights as well as complex numbers. It is basically the first solution combined with numpy histogram fast route, and some extra type, and iteration details, etc.
_range = range
def hist_np_laxis(a, bins=10, range=None, weights=None):
# Initialize empty histogram
N = a.shape[-1]
data2D = a.reshape(-1,N)
limit = bins*data2D.shape[0]
# gh-10322 means that type resolution rules are dependent on array
# shapes. To avoid this causing problems, we pick a type now and stick
# with it throughout.
bin_type = np.result_type(range[0], range[1], a)
if np.issubdtype(bin_type, np.integer):
bin_type = np.result_type(bin_type, float)
bin_edges = np.linspace(range[0],range[1],bins+1, endpoint=True, dtype=bin_type)
# Histogram is an integer or a float array depending on the weights.
if weights is None:
ntype = np.dtype(np.intp)
else:
ntype = weights.dtype
n = np.zeros(limit, ntype)
# Pre-compute histogram scaling factor
norm = bins / (range[1] - range[0])
# We set a block size, as this allows us to iterate over chunks when
# computing histograms, to minimize memory usage.
BLOCK = 65536
# We iterate over blocks here for two reasons: the first is that for
# large arrays, it is actually faster (for example for a 10^8 array it
# is 2x as fast) and it results in a memory footprint 3x lower in the
# limit of large arrays.
for i in _range(0, data2D.shape[0], BLOCK):
tmp_a = data2D[i:i+BLOCK]
block_size = tmp_a.shape[0]
if weights is None:
tmp_w = None
else:
tmp_w = weights[i:i + BLOCK]
# Only include values in the right range
keep = (tmp_a >= range[0])
keep &= (tmp_a <= range[1])
if not np.logical_and.reduce(np.logical_and.reduce(keep)):
tmp_a = tmp_a[keep]
if tmp_w is not None:
tmp_w = tmp_w[keep]
# This cast ensures no type promotions occur below, which gh-10322
# make unpredictable. Getting it wrong leads to precision errors
# like gh-8123.
tmp_a = tmp_a.astype(bin_edges.dtype, copy=False)
# Compute the bin indices, and for values that lie exactly on
# last_edge we need to subtract one
f_indices = (tmp_a - range[0]) * norm
indices = f_indices.astype(np.intp)
indices[indices == bins] -= 1
# The index computation is not guaranteed to give exactly
# consistent results within ~1 ULP of the bin edges.
decrement = tmp_a < bin_edges[indices]
indices[decrement] -= 1
# The last bin includes the right edge. The other bins do not.
increment = ((tmp_a >= bin_edges[indices + 1])
& (indices != bins - 1))
indices[increment] += 1
((bins*np.arange(i, i+block_size)[:,None] * keep)[keep].reshape(indices.shape) + indices).reshape(-1)
#indices = scaled_idx.reshape(-1)
# We now compute the histogram using bincount
if ntype.kind == 'c':
n.real += np.bincount(indices, weights=tmp_w.real,
minlength=limit)
n.imag += np.bincount(indices, weights=tmp_w.imag,
minlength=limit)
else:
n += np.bincount(indices, weights=tmp_w,
minlength=limit).astype(ntype)
n.shape = a.shape[:-1] + (bins,)
return n
data = np.random.randn(4, 5, 6)
out1 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
out2 = hist_np_laxis(data, bins=200001, range=(- 2.5, 2.5))
print(np.allclose(out1, out2))
True
%timeit hist_np_laxis(data, bins=21, range=(- 2.5, 2.5))
92.1 µs ± 504 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
55.1 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Although the first solution is faster in the small example and even the larger example:
data = np.random.randn(400, 500, 6)
%timeit hist_np_laxis(data, bins=21, range=(- 2.5, 2.5))
264 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
71.6 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is not ALWAYS faster:
data = np.random.randn(400, 6, 500)
%timeit hist_np_laxis(data, bins=101, range=(- 2.5, 2.5))
71.5 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit hist_laxis(data, n_bins=101, range_limits=(- 2.5, 2.5))
76.9 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, the numpy variation is only faster when the last axis is large. And its a very slight increase. In all other cases I tried, the first solution is much faster regardless of bin count and size of the first 2 dimensions. The only important line ((bins*np.arange(i, i+block_size)[:,None] * keep)[keep].reshape(indices.shape) + indices).reshape(-1) might be more optimizable though I have not found a faster method yet.
This would also imply the sheer number of vectorized operations of O(n) is outdoing the O(n log n) of the sort and repeated incremental searches.
However, realistic use cases will have a last axis with a lot of data and the prior axes with few. So in reality the samples in the first solution are too contrived to fit the desired performance.
Axis addition for histogram is noted as an issue in the numpy repo: https://github.com/numpy/numpy/issues/13166.
An xhistogram library has sought to solve this problem as well: https://xhistogram.readthedocs.io/en/latest/
I have two lists of coordinates:
l1 = [[x,y,z],[x,y,z],[x,y,z],[x,y,z],[x,y,z]]
l2 = [[x,y,z],[x,y,z],[x,y,z]]
I want to find the shortest pairwise distance between l1 and l2. Distance between two coordinates is simply:
numpy.linalg.norm(l1_element - l2_element)
So how do I use numpy to efficiently apply this operation to each pair of elements?
Here is a quick performance analysis of the four methods presented so far:
import numpy
import scipy
from itertools import product
from scipy.spatial.distance import cdist
from scipy.spatial import cKDTree as KDTree
n = 100
l1 = numpy.random.randint(0, 100, size=(n,3))
l2 = numpy.random.randint(0, 100, size=(n,3))
# by #Phillip
def a(l1,l2):
return min(numpy.linalg.norm(l1_element - l2_element) for l1_element,l2_element in product(l1,l2))
# by #Kasra
def b(l1,l2):
return numpy.min(numpy.apply_along_axis(
numpy.linalg.norm,
2,
l1[:, None, :] - l2[None, :, :]
))
# mine
def c(l1,l2):
return numpy.min(scipy.spatial.distance.cdist(l1,l2))
# just checking that numpy.min is indeed faster.
def c2(l1,l2):
return min(scipy.spatial.distance.cdist(l1,l2).reshape(-1))
# by #BrianLarsen
def d(l1,l2):
# make KDTrees for both sets of points
t1 = KDTree(l1)
t2 = KDTree(l2)
# we need a distance to not look beyond, if you have real knowledge use it, otherwise guess
maxD = numpy.linalg.norm(l1[0] - l2[0]) # this could be closest but anyhting further is certainly not
# get a sparce matrix of all the distances
ans = t1.sparse_distance_matrix(t2, maxD)
# get the minimum distance and points involved
minD = min(ans.values())
return minD
for x in (a,b,c,c2,d):
print("Timing variant", x.__name__, ':', flush=True)
print(x(l1,l2), flush=True)
%timeit x(l1,l2)
print(flush=True)
For n=100
Timing variant a :
2.2360679775
10 loops, best of 3: 90.3 ms per loop
Timing variant b :
2.2360679775
10 loops, best of 3: 151 ms per loop
Timing variant c :
2.2360679775
10000 loops, best of 3: 136 µs per loop
Timing variant c2 :
2.2360679775
1000 loops, best of 3: 844 µs per loop
Timing variant d :
2.2360679775
100 loops, best of 3: 3.62 ms per loop
For n=1000
Timing variant a :
0.0
1 loops, best of 3: 9.16 s per loop
Timing variant b :
0.0
1 loops, best of 3: 14.9 s per loop
Timing variant c :
0.0
100 loops, best of 3: 11 ms per loop
Timing variant c2 :
0.0
10 loops, best of 3: 80.3 ms per loop
Timing variant d :
0.0
1 loops, best of 3: 933 ms per loop
Using newaxis and broadcasting, l1[:, None, :] - l2[None, :, :] is an array of the pairwise difference vectors. You can reduce this array to an array of norms using apply_along_axis and then take the min:
numpy.min(numpy.apply_along_axis(
numpy.linalg.norm,
2,
l1[:, None, :] - l2[None, :, :]
))
Of course, this only works if l1 and l2 are numpy arrays, so if your lists in the question weren't pseudo-code, you'll have to add l1 = numpy.array(l1); l2 = numpy.array(l2).
You can use itertools.product to get the all combinations the use min :
l1 = [[x,y,z],[x,y,z],[x,y,z],[x,y,z],[x,y,z]]
l2 = [[x,y,z],[x,y,z],[x,y,z]]
from itertools import product
min(numpy.linalg.norm(l1_element - l2_element) for l1_element,l2_element in product(l1,l2))
If you have many, many, many points this is a great use for a KDTree. Totally overkill for this example but a good learning experience and really fast for a certain class of problems, and can give a bit more information on number of points within a certain distance.
import numpy as np
from scipy.spatial import cKDTree as KDTree
#sample data
l1 = [[0,0,0], [4,5,6], [7,6,7], [4,5,6]]
l2 = [[100,3,4], [1,0,0], [10,15,16], [17,16,17], [14,15,16], [-34, 5, 6]]
# make them arrays
l1 = np.asarray(l1)
l2 = np.asarray(l2)
# make KDTrees for both sets of points
t1 = KDTree(l1)
t2 = KDTree(l2)
# we need a distance to not look beyond, if you have real knowledge use it, otherwise guess
maxD = np.linalg.norm(l1[-1] - l2[-1]) # this could be closest but anyhting further is certainly not
# get a sparce matrix of all the distances
ans = t1.sparse_distance_matrix(t2, maxD)
# get the minimum distance and points involved
minA = min([(i,k) for k, i in ans.iteritems()])
print("Minimun distance is {0} between l1={1} and l2={2}".format(minA[0], l1[minA[1][0]], l2[minA[1][2]] ))
What this does is make a KDTree for the the sets of points then find all the distances for points within the guess distance and give back the distance and closest point. This post has a writeup of how a KDTree works.