Use NumPy to apply a fixed palette to an image? - python

I have a NumPy image in RGB bytes, let's say it's this 2x3 image:
img = np.array([[[ 0, 255, 0], [255, 255, 255]],
[[255, 0, 255], [ 0, 255, 255]],
[[255, 0, 255], [ 0, 0, 0]]])
I also have a palette that covers every color used in the image. Let's say it's this palette:
palette = np.array([[255, 0, 255],
[ 0, 255, 0],
[ 0, 255, 255],
[ 0, 0, 0],
[255, 255, 255]])
Is there some combination of indexing the image against the palette (or vice versa) that will give me a paletted image equivalent to this?
img_p = np.array([[1, 4],
[0, 2],
[0, 3]])
For comparison, I know the reverse is pretty simple. palette[img_p] will give a result equivalent to img. I'm trying to figure out if there's a similar approach in the opposite direction that will let NumPy do all the heavy lifting.
I know I can just iterate over all the image pixels individually and build my own paletted image. I'm hoping there's a more elegant option.
Okay, so I implemented the various solutions below and ran them over a moderate test set: 20 images, each one 2000x2000 pixels, with a 32-element palette of three-byte colors. Pixels were given random palette indexes. All algorithms were run over the same images.
Timing results:
mostly empty lookup array - 0.89 seconds
np.searchsorted approach - 3.20 seconds
Pandas lookup, single integer - 38.7 seconds
Using == and then aggregating the boolean results - 66.4 seconds
inverting the palette into a dict and using np.apply_along_axis() - Probably ~500 seconds, based on a smaller test set
Pandas lookup with a MultiIndex - Probably ~3000 seconds, based on a smaller test set
Given that the lookup array has a significant memory penalty (and a prohibitive one if there's an alpha channel), I'm going to go with the np.searchsorted approach. The lookup array is significantly faster if you want to spend the RAM on it.

Edit Here is a faster way that uses np.searchsorted.
def rev_lookup_by_sort(img, palette):
M = (1 + palette.max())**np.arange(3)
p1d, ix = np.unique(palette # M, return_index=True)
return ix[np.searchsorted(p1d, img # M)]
Correctness (by equivalence to rev_lookup_by_dict() in the original answer below):
np.array_equal(
rev_lookup_by_sort(img, palette),
rev_lookup_by_dict(img, palette),
)
Speedup (for a 1000 x 1000 image and a 1000 colors palette):
orig = %timeit -o rev_lookup_by_dict(img, palette)
# 2.47 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
v2 = %timeit -o rev_lookup_by_sort(img, palette)
# 71.8 ms ± 93.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> orig.average / v2.average
34.46
So that answer using np.searchsorted is 30x faster at that size.
Original answer
An initial shot gives a slowish version (hopefully we can do better). It uses a dict, where keys are colors as tuples.
def rev_lookup_by_dict(img, palette):
d = {tuple(v): k for k, v in enumerate(palette)}
def func(pix):
return d.get(tuple(pix), -1)
return np.apply_along_axis(func, -1, img)
img_p = rev_lookup_by_dict(img, palette)
Notice that "color not found" is expressed as -1 in img_p.
On your (modified) data:
>>> img_p
array([[1, 4],
[0, 2],
[0, 3]])
Larger example:
# setup
from math import isqrt
w, h = 1000, 1000
s = isqrt(w * h)
palette = np.random.randint(0, 256, (s, 3))
img = palette[np.random.randint(0, s, (w, h))]
Test:
img_p = rev_lookup_by_dict(img, palette)
>>> np.array_equal(palette[img_p], img)
True
Timing:
%timeit rev_lookup_by_dict(img, palette)
# 2.48 s ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's quite awful, but hopefully we can do better.

Faster than a dictionary, but with a 64 MB lookup array.
d = np.zeros((256,256,256), np.int32) # 64 MB!
d[tuple(palette.T)] = np.arange(len(palette))
img_p = d[tuple(img.reshape(-1,3).T)].reshape(*img.shape[:2])
# %%timeit 10 loops, best of 5: 25.8 ms per loop (1000 x 1000)
np.testing.assert_equal(img, palette[img_p])

If you can use Pandas in addition to NumPy, you can use a Pandas MultiIndex as a sort of sparse array:
inverse_palette = pd.Series(np.arange(len(palette)),
index=pd.MultiIndex.from_arrays(palette.T)).sort_index()
img_p = np.apply_along_axis(lambda px: inverse_palette[tuple(px)], 2, img)
That's really slow, though. You can do a bit better by converting the colors into integers first:
def collapse_bytes(array):
result = np.zeros(array.shape[:-1], np.uint32)
for i in range(array.shape[-1]):
result = result * 256 + array[...,i]
return result
inverse_palette = pd.Series(np.arange(len(palette)),
index=collapse_bytes(palette)).sort_index()
img_p = inverse_palette[collapse_bytes(img).flat].to_numpy()\
.reshape(img.shape[:-1])

Related

find biggest cross shape in array of coordinate

please see this image before reading :)
finding a centroid coordinate based on the biggest multiply of left/right/top/down side
below code is working but has no end with bigger array.
how can i optimize this:
(if numpy matters , i am passing region to find_centroid with region=region_coordinates.tolist())
def find_centroid(region):
centroid = region[0]
coord_weight = 0
for coord in region:
new_coord_weight = weight_calc(region, coord, -1, 0)*weight_calc(
region, coord, 1, 0)*weight_calc(region, coord, 0, -1)*weight_calc(region, coord, 0, 1)
if new_coord_weight > coord_weight:
coord_weight = new_coord_weight
centroid = coord
return centroid
def weight_calc(region, coord, xinc, yinc):
weight = 1
x = coord[0]
y = coord[1]
while(True):
if [x, y] in region:
weight += 1
x += xinc
y += yinc
else:
break
return weight
and for a test :
array_test = ([[0, 0], [1, 0], [1, 1], [0, 1], [2, 1], [2, 2], [3, 1], [3, 0], [2, 0], [3, 2]])
print(find_centroid(array_test))
No infinite loop explained
This part of your code will get you stuck in an infinite loop if region is a numpy array:
while True:
if [x, y] in region:
...
That's because, when used on arrays, the operator in will return True if any of the element of the list matches any of the array's sublist elements.
Instead, you can use python's any and all methods :
if (np.array(region)==[x,y]).all(axis=1).any(axis=0):
The all(axis=1) will check for every sublist if both values are equal, in the correct order.
We got an array of boolean. If any boolean is True, then there is at least one match.
Casting any of both lists into a numpy array is enough to make this test possible.
But it will work if...
If both elements are lists, the in operator will work as expected, but in that case you should make sure that region and each of its sublist are all list, not a numpy array. Casting it won't work. Here is why :
import numpy as np
array_test = [[0, 0], [1, 0]]
print([1,1] in array_test) # prints False, as expected
# numpy always compares element-wise, when both elements have the same length
print([1,1] == np.array([1,0])) # Prints [True, False]
print(np.array([1,1]) == np.array([1,0])) # [Line 6] Prints [True, False]
# Errors when ambiguous "in"
print([1,1] == np.array(array_test)) # Prints [[False False] [ True False]]
print([1,1] in np.array(array_test)) # Prints true as explained, because we have at least one True
print([1,1] in list(np.array(array_test))) #Error because numpy doesn't know how to evaluate the result at line 6
Another version
Here is my way of doing this. There might be better ways, it's just my two cents.
Pre-filtering the potential centroids
First, I'd compose all possible intersections (I'll call them "centers" from now) in the region. So first, I would count every x coordinate and y coordinate. To make it easy, I will use numpy.
import numpy as np
# We count every x values. We keep those that are present at least twice.
x_counts = dict(zip(*np.unique(array_test[:,0], return_counts=True)))
y_counts = dict(zip(*np.unique(array_test[:,1], return_counts=True)))
# If an x is present once, then there cannot be any center in this column.
x_inter = [coord for coord, count in x_counts.items() if count>=2]
# Same with y and rows.
y_inter = [coord for coord, count in y_counts.items() if count>=2]
# Next, we create all combinations of (x, y)
# an filter in the combinations present in our region.
possible_centroids = np.array([(x,y) for x,y in product(x_inter, y_inter)
if (array_test==np.array([x,y])).all(axis=1).any())
Measuring arm lengths
To calculate the power of our centers, we first use a function to measure the arm length. Let's make it a bit parametrable, with a directionargument.
# Since we are in 2D and we have no diagonal, there are four possible directions.
directions = np.array([[0,1], [0, -1], [1, 0], [-1, 0]])
def get_arm_length(center, direction):
position = center+direction # going one step in the direction
# We keep track of the length in the direction.
length = 0
# adding 1 as long as the next step in direction is in region
while (region==position).all(axis=1).any():
position += direction
length+=1
return length
Measuring every potential centroid
Now we can test the four directions, for each of our potential centroids (previously selected) and keeping the best one along the way.
best_center=(0,[-1, -1]) # => (power, center_coords)
for center in centers:
# Setting to 1, which is the identity element of the product (x * 1 == x)
power = 1
for direction in directions:
# We multiply by the power along the four axes.
power *= get_arm_length(center, direction)
# if a more powerful one is found, we store it power and coords.
if power > best_center[0]:
best_center = power, center
# At this point, we found most powerful center, which is our centroid.
Putting it all together
Here is the full code.
def find_centroid2(region):
region = np.array(region)
# Directions:
directions = np.array([[0,1], [0, -1], [1, 0], [-1, 0]])
def get_arm_length(center, direction):
position = center+direction
length = 1
while (region==position).all(axis=1).any(axis=0):
position+= direction
length+=1
return length
# Intersections:
x_counts = dict(zip(*np.unique(region[:,0], return_counts=True)))
y_counts = dict(zip(*np.unique(region[:,1], return_counts=True)))
x_inter = [coord for coord, count in x_counts.items() if count>=2]
y_inter = [coord for coord, count in y_counts.items() if count>=2]
centers = np.array([(x,y) for x,y in product(x_inter, y_inter) if (region==np.array([x,y])).all(axis=1).any()])
# Measuring each center's "power":
best_center=(0,[-1, -1]) # => (power, center_coords)
for center in centers:
power = 1
for direction in directions:
power *= get_arm_length(center, direction)
if power > best_center[0]:
best_center = power, center
return best_center[1]
An optimisation's optimisation
Instead of testing all virtual centers to keep the ones that belong to our region, we can instead filter our region and keep the cells that have a coordinate counted twice or more.
def find_centroid3(region):
region = np.array(region)
# Directions:
directions = np.array([[0,1], [0, -1], [1, 0], [-1, 0]])
def get_arm_length(center, direction):
position = center+direction
length = 1
while (region==position).all(axis=1).any(axis=0):
position+= direction
length+=1
return length
# Intersections:
# It's better to filter the cells instead of computing and testing all combinations
x_counts = [x[0] for x in zip(*np.unique(region[:,0], return_counts=True)) if x[1]>=2]
y_counts = [y[0] for y in zip(*np.unique(region[:,1], return_counts=True)) if y[1]>=2]
centers = [[x,y] for x,y in region if x in x_counts or y in y_counts]
# Measuring each center's "power":
best_center=(0,[-1, -1]) # => (power, center_coords)
for center in centers:
power = 1
for direction in directions:
power *= get_arm_length(center, direction)
if power > best_center[0]:
best_center = power, center
return best_center[1]
Comparing V2
The random region preparation, with a lot of cells.
# Keeping the grid fairly big and filled
# 150*150 grid (22'500 cells) with 15'000 filled cells max.
array_test = np.random.randint(15, size=(150, 2)) # => len = 15'000
# Getting rid of duplicates, else they will mess with the counting.
# Assuming your own grids also don't have any
new_array = [list(array_test[0])]
for elem in array_test[1:]:
if (elem != np.array(new_array)).any(axis=1).all():
new_array.append(elem)
array_test = np.array(new_array) # => len = 10'959, all are unique cells
Results:
find_centroid(array_test) # Original version. Result = [64 127]
# 16 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
find_centroid(array_test) # Proposed version 1. result = [61 127]
# 13.1 s ± 87.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
find_centroid3(array_test) # Proposed version 2. Result = [61, 127]
# 9.49 s ± 47.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I tried several grid sizes, keeping it half filled at max.
Comparing V1
[Obsolete]
Your original code (corrected for dealing with the infinite loop):
%%timeit
find_centroid(array_test) # Result => array([73, 16])
# 21.4 s ± 397 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The proposed code:
%%timeit
find_centroid2(array_test) # Result => array([73, 16])
# 17.2 s ± 76.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's not a tremendous optimisation, but that's an optimization anyways.
Maybe some other reviews and other ideas can make it better.
I tried several grid sizes, keeping it half filled at max.
For anybody else who need a better(may be perfect)answer is using an small shape passed through erosion Over the array in a while loop which remove a border until find:
-one cross : the center coordinates
-many cross : compare for best one separately

Broadcasting N-dim array to (N+1)-dim array and summing on all but 1 dim

Assume you have a numpy array with shape (a,b,c) and a boolean mask of shape (a,b,c,d).
I would like to apply the mask to the array iterating over the last axis, sum the masked array along the first three axes, and obtain a list (or an array) of length/shape (d,).
I tried to do this with a list comprehension:
Result = [np.sum(Array[Mask[:,:,:,i]], axis=(0,1,2)) for i in range(d)]
It works, but it does not look very pythonic and it is a bit slow as well.
I also tried something like
Array = Array[:,:,:,np.newaxis]
Result = np.sum(Array[Mask], axis=(0,1,2))
but of course this doesn't work, since the dimension of the Mask along the last axis, d, is larger than the dimension of the last axis of the Array, 1.
Also, consider that each axis could have dimension of order 100 or 200, so repeating the Array d times along a new last axis using np.repeat would be really memory intensive, and I would like to avoid this.
Are there any other faster and more pythonic alternatives to the list comprehension?
The most straightforward way of broadcasting a N-dimensional array to a matching (N+1)-dimensional array is to use np.broadcast_to():
import numpy as np
arr = np.random.randint(0, 100, (2, 3))
mask = np.random.randint(0, 2, (2, 3, 4), dtype=bool)
b_arr = np.broadcast_to(arr[..., None], mask.shape)
print(mask.shape == b_arr.shape)
# True
However, as #hpaulj already pointed out, you cannot use mask for slicing b_arr without loosing the dimensions.
Given that you want to just sum the elements together and summing zeroes "does not hurt", you could simply multiply element-wise your array and your mask so as to keep the correct dimension but the elements that are False in the mask are irrelevant for the subsequent sum of the corresponding array elements:
result = np.sum(b_arr * mask, axis=tuple(range(mask.ndim - 1)))
or, since * will do the broadcasting automatically:
result = np.sum(arr[..., None] * mask, axis=tuple(range(mask.ndim - 1)))
without the need to use np.broadcast_to() in the first place (but you still need to match the number of dimension, i.e. using arr[..., None] and not just arr).
As #PaulPanzer already pointed out, since you want to sum up over all but one dimensions, this can be further simplified using np.matmul()/#:
result2 = arr.ravel() # mask.reshape(-1, mask.shape[-1])
print(np.all(result == result2))
# True
For fancier operations involving the summation, please have a look at np.einsum().
EDIT
The catch with broadcasting is that it will create temporary arrays during the evaluation of your expressions.
With the number you seems to be dealing with, I simply cannot use the broadcasted arrays as I run into MemoryError, but time-wise the element-wise multiplication may still be a better approach than what you originally proposed.
Alternatively, if you are after speed, you could do this at a somewhat lower level with explicit looping in Cython or Numba.
Below you can find a couple of Numba-based solutions (working on ravel()-ed data):
_vector_matrix_product(): does not use any temporary array
_vector_matrix_product_mp(): some as above but using parallel execution
_vector_matrix_product_sum(): uses np.sum() and parallel execution
import numpy as np
import numba as nb
#nb.jit(nopython=True)
def _vector_matrix_product(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in range(rows):
for j in range(cols):
result_arr[i] += vect_arr[j] * mat_arr[i, j]
else:
for i in range(rows):
for j in range(cols):
result_arr[j] += vect_arr[i] * mat_arr[i, j]
#nb.jit(nopython=True, parallel=True)
def _vector_matrix_product_mp(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in nb.prange(rows):
for j in nb.prange(cols):
result_arr[i] += vect_arr[j] * mat_arr[i, j]
else:
for i in nb.prange(rows):
for j in nb.prange(cols):
result_arr[j] += vect_arr[i] * mat_arr[i, j]
#nb.jit(nopython=True, parallel=True)
def _vector_matrix_product_sum(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in nb.prange(rows):
result_arr[i] = np.sum(vect_arr * mat_arr[i, :])
else:
for j in nb.prange(cols):
result_arr[j] = np.sum(vect_arr * mat_arr[:, j])
def vector_matrix_product(
vect_arr,
mat_arr,
swap=False,
dtype=None,
mode=None):
rows, cols = mat_arr.shape
if not dtype:
dtype = (vect_arr[0] * mat_arr[0, 0]).dtype
if not swap:
result_arr = np.zeros(cols, dtype=dtype)
else:
result_arr = np.zeros(rows, dtype=dtype)
if mode == 'sum':
_vector_matrix_product_sum(vect_arr, mat_arr, result_arr)
elif mode == 'mp':
_vector_matrix_product_mp(vect_arr, mat_arr, result_arr)
else:
_vector_matrix_product(vect_arr, mat_arr, result_arr)
return result_arr
np.random.seed(0)
arr = np.random.randint(0, 100, (2, 3, 4))
mask = np.random.randint(0, 2, (2, 3, 4, 5), dtype=bool)
target = arr.ravel() # mask.reshape(-1, mask.shape[-1])
print(target)
# [820 723 861 486 408]
result1 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]))
print(result1)
# [820 723 861 486 408]
result2 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='mp')
print(result2)
# [820 723 861 486 408]
result3 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='sum')
print(result3)
# [820 723 861 486 408]
with improved timing over any list-comprehension-based solutions:
arr = np.random.randint(0, 100, (256, 256, 256))
mask = np.random.randint(0, 2, (256, 256, 256, 128), dtype=bool)
%timeit np.sum(arr[..., None] * mask, axis=tuple(range(mask.ndim - 1)))
# MemoryError
%timeit arr.ravel() # mask.reshape(-1, mask.shape[-1])
# MemoryError
%timeit np.array([np.sum(arr * mask[..., i], axis=tuple(range(mask.ndim - 1))) for i in range(mask.shape[-1])])
# 24.1 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.array([np.sum(arr[mask[..., i]]) for i in range(mask.shape[-1])])
# 46 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]))
# 408 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='mp')
# 1.63 s ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='sum')
# 7.17 s ± 258 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As expected, the JIT accelerated version is the fastest, and enforcing parallelism on the code does not result in improved speed-ups.
Note also that the approach with element-wise multiplication is faster than slicing (approx. twice as fast for these benchmarks).
EDIT 2
Following #max9111 suggestion, looping first by rows and then by cols cause the most time-consuming loop to run on contiguous data, resulting in significant speed-up.
Without this trick, _vector_matrix_product_sum() and _vector_matrix_product_mp() would run at essentially the same speed.
How about
Array.reshape(-1)#Mask.reshape(-1,d)
Since you are summing over the first three axes anyway you may as well merge them after which it is easy to see that the operation can be written as matrix-vector product
Example:
a,b,c,d = 4,5,6,7
Mask = np.random.randint(0,2,(a,b,c,d),bool)
Array = np.random.randint(0,10,(a,b,c))
[np.sum(Array[Mask[:,:,:,i]]) for i in range(d)]
# [310, 237, 253, 261, 229, 268, 184]
Array.reshape(-1)#Mask.reshape(-1,d)
# array([310, 237, 253, 261, 229, 268, 184])

Best way to create a three channel image with a fixed color?

I want to create a numpy three channel image with dimensions 10x5 and a fixed color of [0, 1, 2]. I'm currently doing it using the following code:
x = np.array([0, 1, 2])
x = np.array((x,) * 10)
x = np.array((x,) * 5)
This works, but is not very elegant. What is the best / most efficient way to achieve the same with less code?
Alternatively, you can use np.full:
np.full((10, 5, 3), [0, 1, 2])
It creates an array of given shape (10, 5, 3) and fills it with a constant value [0, 1, 2].
Use np.broadcast_to to get a view into the input 1D array -
np.broadcast_to([0, 1, 2],(5,10,3))
If you need a copy that has its own data, simply append .copy() -
np.broadcast_to([0, 1, 2],(5,10,3)).copy()
Or use np.tile -
np.tile([0,1,2],(5,10,1))
The benefit with having a view is that there's no extra memory overhead and virtually free. -
In [17]: x0 = np.arange(3)
In [18]: %timeit np.broadcast_to(x0,(5,10,len(x0)))
100000 loops, best of 3: 3.16 µs per loop
In [19]: x0 = np.arange(3000)
In [20]: %timeit np.broadcast_to(x0,(5,10,len(x0)))
100000 loops, best of 3: 3.08 µs per loop
What about slice notation?
a = np.empty((10,5,3))
a[:,:,0]=0
a[:,:,1]=1
a[:,:,2]=2

More efficient way to multiply each column of a 2-d matrix by each slice of a 3-d matrix

I have an 8x8x25000 array W and an 8 x 25000 array r. I want to multiple each 8x8 slice of W by each column (8x1) of r and save the result in Wres, which will end up being an 8x25000 matrix.
I am accomplishing this using a for loop as such:
for i in range(0,25000):
Wres[:,i] = np.matmul(W[:,:,i],res[:,i])
But this is slow and I am hoping there is a quicker way to accomplish this.
Any ideas?
Matmul can propagate as long as the 2 arrays share the same 1 axis length. From the docs:
If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
Thus, you have to perform 2 operations prior to matmul:
import numpy as np
a = np.random.rand(8,8,100)
b = np.random.rand(8, 100)
transpose a and b so that the first axis are the 100 slices
add an extra dimension to b so that b.shape = (100, 8, 1)
Then:
at = a.transpose(2, 0, 1) # swap to shape 100, 8, 8
bt = b.T[..., None] # swap to shape 100, 8, 1
c = np.matmul(at, bt)
c is now 100, 8, 1, reshape back to 8, 100:
c = np.squeeze(c).swapaxes(0, 1)
or
c = np.squeeze(c).T
And last, a one-liner just for conveniende:
c = np.squeeze(np.matmul(a.transpose(2, 0, 1), b.T[..., None])).T
An alternative to using np.matmul is np.einsum, which can be accomplished in 1 shorter and arguably more palatable line of code with no method chaining.
Example arrays:
np.random.seed(123)
w = np.random.rand(8,8,25000)
r = np.random.rand(8,25000)
wres = np.einsum('ijk,jk->ik',w,r)
# a quick check on result equivalency to your loop
print(np.allclose(np.matmul(w[:, :, 1], r[:, 1]), wres[:, 1]))
True
Timing is equivalent to #Imanol's solution so take your pick of the two. Both are 30x faster than looping. Here, einsum will be competitive because of the size of the arrays. With arrays larger than these, it would likely win out, and lose for smaller arrays. See this discussion for more.
def solution1():
return np.einsum('ijk,jk->ik',w,r)
def solution2():
return np.squeeze(np.matmul(w.transpose(2, 0, 1), r.T[..., None])).T
def solution3():
Wres = np.empty((8, 25000))
for i in range(0,25000):
Wres[:,i] = np.matmul(w[:,:,i],r[:,i])
return Wres
%timeit solution1()
100 loops, best of 3: 2.51 ms per loop
%timeit solution2()
100 loops, best of 3: 2.52 ms per loop
%timeit solution3()
10 loops, best of 3: 64.2 ms per loop
Credit to: #Divakar

Why is numpy's einsum faster than numpy's built in functions?

Lets start with three arrays of dtype=np.double. Timings are performed on a intel CPU using numpy 1.7.1 compiled with icc and linked to intel's mkl. A AMD cpu with numpy 1.6.1 compiled with gcc without mkl was also used to verify the timings. Please note the timings scale nearly linearly with system size and are not due to the small overhead incurred in the numpy functions if statements these difference will show up in microseconds not milliseconds:
arr_1D=np.arange(500,dtype=np.double)
large_arr_1D=np.arange(100000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)
First lets look at the np.sum function:
np.all(np.sum(arr_3D)==np.einsum('ijk->',arr_3D))
True
%timeit np.sum(arr_3D)
10 loops, best of 3: 142 ms per loop
%timeit np.einsum('ijk->', arr_3D)
10 loops, best of 3: 70.2 ms per loop
Powers:
np.allclose(arr_3D*arr_3D*arr_3D,np.einsum('ijk,ijk,ijk->ijk',arr_3D,arr_3D,arr_3D))
True
%timeit arr_3D*arr_3D*arr_3D
1 loops, best of 3: 1.32 s per loop
%timeit np.einsum('ijk,ijk,ijk->ijk', arr_3D, arr_3D, arr_3D)
1 loops, best of 3: 694 ms per loop
Outer product:
np.all(np.outer(arr_1D,arr_1D)==np.einsum('i,k->ik',arr_1D,arr_1D))
True
%timeit np.outer(arr_1D, arr_1D)
1000 loops, best of 3: 411 us per loop
%timeit np.einsum('i,k->ik', arr_1D, arr_1D)
1000 loops, best of 3: 245 us per loop
All of the above are twice as fast with np.einsum. These should be apples to apples comparisons as everything is specifically of dtype=np.double. I would expect the speed up in an operation like this:
np.allclose(np.sum(arr_2D*arr_3D),np.einsum('ij,oij->',arr_2D,arr_3D))
True
%timeit np.sum(arr_2D*arr_3D)
1 loops, best of 3: 813 ms per loop
%timeit np.einsum('ij,oij->', arr_2D, arr_3D)
10 loops, best of 3: 85.1 ms per loop
Einsum seems to be at least twice as fast for np.inner, np.outer, np.kron, and np.sum regardless of axes selection. The primary exception being np.dot as it calls DGEMM from a BLAS library. So why is np.einsum faster that other numpy functions that are equivalent?
The DGEMM case for completeness:
np.allclose(np.dot(arr_2D,arr_2D),np.einsum('ij,jk',arr_2D,arr_2D))
True
%timeit np.einsum('ij,jk',arr_2D,arr_2D)
10 loops, best of 3: 56.1 ms per loop
%timeit np.dot(arr_2D,arr_2D)
100 loops, best of 3: 5.17 ms per loop
The leading theory is from #sebergs comment that np.einsum can make use of SSE2, but numpy's ufuncs will not until numpy 1.8 (see the change log). I believe this is the correct answer, but have not been able to confirm it. Some limited proof can be found by changing the dtype of input array and observing speed difference and the fact that not everyone observes the same trends in timings.
First off, there's been a lot of past discussion about this on the numpy list. For example, see:
http://numpy-discussion.10968.n7.nabble.com/poor-performance-of-sum-with-sub-machine-word-integer-types-td41.html
http://numpy-discussion.10968.n7.nabble.com/odd-performance-of-sum-td3332.html
Some of boils down to the fact that einsum is new, and is presumably trying to be better about cache alignment and other memory access issues, while many of the older numpy functions focus on a easily portable implementation over a heavily optimized one. I'm just speculating, there, though.
However, some of what you're doing isn't quite an "apples-to-apples" comparison.
In addition to what #Jamie already said, sum uses a more appropriate accumulator for arrays
For example, sum is more careful about checking the type of the input and using an appropriate accumulator. For example, consider the following:
In [1]: x = 255 * np.ones(100, dtype=np.uint8)
In [2]: x
Out[2]:
array([255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255], dtype=uint8)
Note that the sum is correct:
In [3]: x.sum()
Out[3]: 25500
While einsum will give the wrong result:
In [4]: np.einsum('i->', x)
Out[4]: 156
But if we use a less limited dtype, we'll still get the result you'd expect:
In [5]: y = 255 * np.ones(100)
In [6]: np.einsum('i->', y)
Out[6]: 25500.0
Now that numpy 1.8 is released, where according to the docs all ufuncs should use SSE2, I wanted to double check that Seberg's comment about SSE2 was valid.
To perform the test a new python 2.7 install was created- numpy 1.7 and 1.8 were compiled with icc using standard options on a AMD opteron core running Ubuntu.
This is the test run both before and after the 1.8 upgrade:
import numpy as np
import timeit
arr_1D=np.arange(5000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)
print 'Summation test:'
print timeit.timeit('np.sum(arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ijk->", arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Power test:'
print timeit.timeit('arr_3D*arr_3D*arr_3D',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ijk,ijk,ijk->ijk", arr_3D, arr_3D, arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Outer test:'
print timeit.timeit('np.outer(arr_1D, arr_1D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("i,k->ik", arr_1D, arr_1D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Einsum test:'
print timeit.timeit('np.sum(arr_2D*arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ij,oij->", arr_2D, arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
Numpy 1.7.1:
Summation test:
0.172988510132
0.0934836149216
----------------------
Power test:
1.93524689674
0.839519000053
----------------------
Outer test:
0.130380821228
0.121401786804
----------------------
Einsum test:
0.979052495956
0.126066613197
Numpy 1.8:
Summation test:
0.116551589966
0.0920487880707
----------------------
Power test:
1.23683619499
0.815982818604
----------------------
Outer test:
0.131808176041
0.127472200394
----------------------
Einsum test:
0.781750011444
0.129271841049
I think this is fairly conclusive that SSE plays a large role in the timing differences, it should be noted that repeating these tests the timings very by only ~0.003s. The remaining difference should be covered in the other answers to this question.
I think these timings explain what's going on:
a = np.arange(1000, dtype=np.double)
%timeit np.einsum('i->', a)
100000 loops, best of 3: 3.32 us per loop
%timeit np.sum(a)
100000 loops, best of 3: 6.84 us per loop
a = np.arange(10000, dtype=np.double)
%timeit np.einsum('i->', a)
100000 loops, best of 3: 12.6 us per loop
%timeit np.sum(a)
100000 loops, best of 3: 16.5 us per loop
a = np.arange(100000, dtype=np.double)
%timeit np.einsum('i->', a)
10000 loops, best of 3: 103 us per loop
%timeit np.sum(a)
10000 loops, best of 3: 109 us per loop
So you basically have an almost constant 3us overhead when calling np.sum over np.einsum, so they basically run as fast, but one takes a little longer to get going. Why could that be? My money is on the following:
a = np.arange(1000, dtype=object)
%timeit np.einsum('i->', a)
Traceback (most recent call last):
...
TypeError: invalid data type for einsum
%timeit np.sum(a)
10000 loops, best of 3: 20.3 us per loop
Not sure what is going on exactly, but it seems that np.einsum is skipping some checks to extract type specific functions to do the multiplications and additions, and is going directly with * and + for standard C types only.
The multidimensional cases are not different:
n = 10; a = np.arange(n**3, dtype=np.double).reshape(n, n, n)
%timeit np.einsum('ijk->', a)
100000 loops, best of 3: 3.79 us per loop
%timeit np.sum(a)
100000 loops, best of 3: 7.33 us per loop
n = 100; a = np.arange(n**3, dtype=np.double).reshape(n, n, n)
%timeit np.einsum('ijk->', a)
1000 loops, best of 3: 1.2 ms per loop
%timeit np.sum(a)
1000 loops, best of 3: 1.23 ms per loop
So a mostly constant overhead, not a faster running once they get down to it.
An update for numpy 1.21.2: Numpy's native functions are faster than einsums in almost all cases. Only einsum's outer variant and the sum23 test faster than the non-einsum versions.
If you can use numpy's native functions, do that.
(Images created with perfplot, a project of mine.)
Code to reproduce the plots:
import numpy
import perfplot
def setup1(n):
return numpy.arange(n, dtype=numpy.double)
def setup2(n):
return numpy.arange(n ** 2, dtype=numpy.double).reshape(n, n)
def setup3(n):
return numpy.arange(n ** 3, dtype=numpy.double).reshape(n, n, n)
def setup23(n):
return (
numpy.arange(n ** 2, dtype=numpy.double).reshape(n, n),
numpy.arange(n ** 3, dtype=numpy.double).reshape(n, n, n),
)
def numpy_sum(a):
return numpy.sum(a)
def einsum_sum(a):
return numpy.einsum("ijk->", a)
perfplot.save(
"sum.png",
setup=setup3,
kernels=[numpy_sum, einsum_sum],
n_range=[2 ** k for k in range(10)],
)
def numpy_power(a):
return a * a * a
def einsum_power(a):
return numpy.einsum("ijk,ijk,ijk->ijk", a, a, a)
perfplot.save(
"power.png",
setup=setup3,
kernels=[numpy_power, einsum_power],
n_range=[2 ** k for k in range(9)],
)
def numpy_outer(a):
return numpy.outer(a, a)
def einsum_outer(a):
return numpy.einsum("i,k->ik", a, a)
perfplot.save(
"outer.png",
setup=setup1,
kernels=[numpy_outer, einsum_outer],
n_range=[2 ** k for k in range(13)],
)
def dgemm_numpy(a):
return numpy.dot(a, a)
def dgemm_einsum(a):
return numpy.einsum("ij,jk", a, a)
def dgemm_einsum_optimize(a):
return numpy.einsum("ij,jk", a, a, optimize=True)
perfplot.save(
"dgemm.png",
setup=setup2,
kernels=[dgemm_numpy, dgemm_einsum],
n_range=[2 ** k for k in range(13)],
)
def dot_numpy(a):
return numpy.dot(a, a)
def dot_einsum(a):
return numpy.einsum("i,i->", a, a)
perfplot.save(
"dot.png",
setup=setup1,
kernels=[dot_numpy, dot_einsum],
n_range=[2 ** k for k in range(20)],
)
def sum23_numpy(data):
a, b = data
return numpy.sum(a * b)
def sum23_einsum(data):
a, b = data
return numpy.einsum("ij,oij->", a, b)
perfplot.save(
"sum23.png",
setup=setup23,
kernels=[sum23_numpy, sum23_einsum],
n_range=[2 ** k for k in range(10)],
)

Categories