Outer sum of two numpy arrays along specified axes - python

I have two numpy.array objects x and y where x.shape is (P, K) and y.shape is (T, K). I want to do an outer sum on these two objects such that the result has shape (P, T, K). I'm aware of the np.add.outer and the np.einsum functions but I couldn't get them to do what I wanted.
The following gives the intended result.
x_plus_y = np.zeros((P, T, K))
for k in range(K):
x_plus_y[:, :, k] = np.add.outer(x[:, k], y[:, k])
But I've got to imagine there's a faster way!

One option is to add a new dimension to x and add using numpy broadcasting:
out = x[:, None] + y
or as #FirefoxMetzger pointed out, it's more readable to be explicit with the dimensions:
out = x[:, None, :] + y[None, :, :]
P, K, T = np.random.randint(10,30, size=3)
x = np.random.rand(P, K)
y = np.random.rand(T, K)
x_plus_y = np.zeros((P, T, K))
for k in range(K):
x_plus_y[:, :, k] = np.add.outer(x[:, k], y[:, k])
assert (x_plus_y == x[:, None] + y).all()


Filling a matrix in a numpythonic way

I have a L x L matrix A, which I currently fill in using the following code:
A = np.zeros((L, L))
for J in range(X):
for a in range(L):
for b in range(L):
A[a][b] += alpha[J, a] * O[b, J] * A_old[a, b] * betas[J+2, b]
Where X is an integer defined elsewhere, alpha and betas is of shape (X, L), O is of shape (L, X) and A_old is of shape (L, L). I'm concerned about the speed of this code, and am trying to find a more numpythonic way to approach filling in this matrix. My instinct is to do something like:
for J in range(X):
A += alpha[J, :] * O[:, J] * A_old[:, :] * betas[J+2, :]
But this doesn't broadcast the operations correctly because of the A_old matrix (the resulting shape is right, but the values are not). What's a good way to condense this loop using numpy?

Can I speed up this aerodynamics calculation with Numba, vectorization, or multiprocessing?

I am trying to increase the speed of an aerodynamics function in Python.
Function Set:
import numpy as np
from numba import njit
def calculate_velocity_induced_by_line_vortices(
points, origins, terminations, strengths, collapse=True
# Expand the dimensionality of the points input. It is now of shape (N x 1 x 3).
# This will allow NumPy to broadcast the upcoming subtractions.
points = np.expand_dims(points, axis=1)
# Define the vectors from the vortex to the points. r_1 and r_2 now both are of
# shape (N x M x 3). Each row/column pair holds the vector associated with each
# point/vortex pair.
r_1 = points - origins
r_2 = points - terminations
r_0 = r_1 - r_2
r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)
r_1_cross_r_2_absolute_magnitude = (
r_1_cross_r_2[:, :, 0] ** 2
+ r_1_cross_r_2[:, :, 1] ** 2
+ r_1_cross_r_2[:, :, 2] ** 2
r_1_length = nb_2d_explicit_norm(r_1)
r_2_length = nb_2d_explicit_norm(r_2)
# Define the radius of the line vortices. This is used to get rid of any
# singularities.
radius = 3.0e-16
# Set the lengths and the absolute magnitudes to zero, at the places where the
# lengths and absolute magnitudes are less than the vortex radius.
r_1_length[r_1_length < radius] = 0
r_2_length[r_2_length < radius] = 0
r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0
# Calculate the vector dot products.
r_0_dot_r_1 = np.einsum("ijk,ijk->ij", r_0, r_1)
r_0_dot_r_2 = np.einsum("ijk,ijk->ij", r_0, r_2)
# Calculate k and then the induced velocity, ignoring any divide-by-zero or nan
# errors. k is of shape (N x M)
with np.errstate(divide="ignore", invalid="ignore"):
k = (
/ (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
* (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
# Set the shape of k to be (N x M x 1) to support numpy broadcasting in the
# subsequent multiplication.
k = np.expand_dims(k, axis=2)
induced_velocities = k * r_1_cross_r_2
# Set the values of the induced velocity to zero where there are singularities.
induced_velocities[np.isinf(induced_velocities)] = 0
induced_velocities[np.isnan(induced_velocities)] = 0
if collapse:
induced_velocities = np.sum(induced_velocities, axis=1)
return induced_velocities
def nb_2d_explicit_norm(vectors):
return np.sqrt(
(vectors[:, :, 0]) ** 2 + (vectors[:, :, 1]) ** 2 + (vectors[:, :, 2]) ** 2
def nb_2d_explicit_cross(a, b):
e = np.zeros_like(a)
e[:, :, 0] = a[:, :, 1] * b[:, :, 2] - a[:, :, 2] * b[:, :, 1]
e[:, :, 1] = a[:, :, 2] * b[:, :, 0] - a[:, :, 0] * b[:, :, 2]
e[:, :, 2] = a[:, :, 0] * b[:, :, 1] - a[:, :, 1] * b[:, :, 0]
return e
This function is used by Ptera Software, an open-source solver for flapping wing aerodynamics. As shown by the profile output below, it is by far the largest contributor to Ptera Software's run time.
Currently, Ptera Software takes just over 3 minutes to run a typical case, and my goal is to get this below 1 minute.
The function takes in a group of points, origins, terminations, and strengths. At every point, it finds the induced velocity due to the line vortices, which are characterized by the groups of origins, terminations, and strengths. If collapse is true, then the output is the cumulative velocity induced at each point due to the vortices. If false, the function outputs each vortex's contribution to the velocity at each point.
During a typical run, the velocity function is called approximately 2000 times. At first, the calls involve vectors with relatively small input arguments (around 200 points, origins, terminations, and strengths). Later calls involve large input arguments (around 400 points and around 6,000 origins, terminations, and strengths). An ideal solution would be fast for all size inputs, but increasing the speed of large input calls is more important.
For testing, I recommend running the following script with your own implementation of the function:
import timeit
import matplotlib.pyplot as plt
import numpy as np
n_repeat = 2
n_execute = 10 ** 3
min_oom = 0
max_oom = 3
times_py = []
for i in range(max_oom - min_oom + 1):
n_elem = 10 ** i
n_elem_pretty = np.format_float_scientific(n_elem, 0)
print("Number of elements: " + n_elem_pretty)
# Benchmark Python.
print("\tBenchmarking Python...")
setup = '''
import numpy as np
these_points = np.random.random((''' + str(n_elem) + ''', 3))
these_origins = np.random.random((''' + str(n_elem) + ''', 3))
these_terminations = np.random.random((''' + str(n_elem) + ''', 3))
these_strengths = np.random.random(''' + str(n_elem) + ''')
def calculate_velocity_induced_by_line_vortices(points, origins, terminations,
strengths, collapse=True):
statement = '''
results_orig = calculate_velocity_induced_by_line_vortices(these_points, these_origins,
times = timeit.repeat(repeat=n_repeat, stmt=statement, setup=setup, number=n_execute)
time_py = min(times)/n_execute
time_py_pretty = np.format_float_scientific(time_py, 2)
print("\t\tAverage Time per Loop: " + time_py_pretty + " s")
# Record the times.
sizes = [10 ** i for i in range(max_oom - min_oom + 1)]
fig, ax = plt.subplots()
ax.plot(sizes, times_py, label='Python')
ax.set_xlabel("Size of List or Array (elements)")
ax.set_ylabel("Average Time per Loop (s)")
"Comparison of Different Optimization Methods\nBest of "
+ str(n_repeat)
+ " Runs, each with "
+ str(n_execute)
+ " Loops"
Previous Attempts:
My prior attempts at speeding up this function involved vectorizing it (which worked great, so I kept those changes) and trying out Numba's JIT compiler. I had mixed results with Numba. When I tried to use Numba on a modified version of the entire velocity function, my results were much slower than before. However, I found that Numba significantly sped up the cross-product and norm functions, which I implemented above.
Update 1:
Based on Mercury's comment (which has since been deleted), I replaced
points = np.expand_dims(points, axis=1)
r_1 = points - origins
r_2 = points - terminations
with two calls to the following function:
def subtract(a, b):
c = np.empty((a.shape[0], b.shape[0], 3))
for i in range(a.shape[0]):
for j in range(b.shape[0]):
for k in range(3):
c[i, j, k] = a[i, k] - b[j, k]
return c
This resulted in a speed increase from 227 s to 220 s. This is better! However, it is still not fast enough.
I also have tried setting the njit fastmath flag to true, and using a numba function instead of calls to np.einsum. Neither increased the speed.
Update 2:
With Jérôme Richard's answer, the run time is now 156 s, which is a decrease of 29%! I'm satisfied enough to accept this answer, but feel free to make other suggestions if you think you can improve on their work!
First of all, Numba can perform parallel computations resulting in a faster code if you manually request it using mainly parallel=True and prange. This is useful for big arrays (but not for small ones).
Moreover, your computation is mainly memory bound. Thus, you should avoid creating big arrays when they are not reused multiple times, or more generally when they cannot be recomputed on the fly (in a relatively cheap way). This is the case for r_0 for example.
In addition, memory access pattern matters: vectorization is more efficient when accesses are contiguous in memory and the cache/RAM is use more efficiently. Consequently, arr[0, :, :] = 0 should be faster then arr[:, :, 0] = 0. Similarly, arr[:, :, 0] = arr[:, :, 1] = 0 should be mush slower than arr[:, :, 0:2] = 0 since the former performs to noncontinuous memory passes while the latter performs only one more contiguous memory pass. Sometimes, it can be beneficial to transpose your data so that the following calculations are much faster.
Moreover, Numpy tends to create many temporary arrays that are costly to allocate. This is a huge problem when the input arrays are small. The Numba jit can avoid that in most cases.
Finally, regarding your computation, it may be a good idea to use GPUs for big arrays (definitively not for small ones). You can give a look to cupy or clpy to do that quite easily.
Here is an optimized implementation working on the CPU:
import numpy as np
from numba import njit, prange
def subtract(a, b):
c = np.empty((a.shape[0], b.shape[0], 3))
for i in prange(c.shape[0]):
for j in range(c.shape[1]):
for k in range(3):
c[i, j, k] = a[i, k] - b[j, k]
return c
def nb_2d_explicit_norm(vectors):
res = np.empty((vectors.shape[0], vectors.shape[1]))
for i in prange(res.shape[0]):
for j in range(res.shape[1]):
res[i, j] = np.sqrt(vectors[i, j, 0] ** 2 + vectors[i, j, 1] ** 2 + vectors[i, j, 2] ** 2)
return res
# NOTE: better memory access pattern
def nb_2d_explicit_cross(a, b):
e = np.empty(a.shape)
for i in prange(e.shape[0]):
for j in range(e.shape[1]):
e[i, j, 0] = a[i, j, 1] * b[i, j, 2] - a[i, j, 2] * b[i, j, 1]
e[i, j, 1] = a[i, j, 2] * b[i, j, 0] - a[i, j, 0] * b[i, j, 2]
e[i, j, 2] = a[i, j, 0] * b[i, j, 1] - a[i, j, 1] * b[i, j, 0]
return e
# NOTE: avoid the slow building of temporary arrays
def cross_absolute_magnitude(cross):
return cross[:, :, 0] ** 2 + cross[:, :, 1] ** 2 + cross[:, :, 2] ** 2
# NOTE: avoid the slow building of temporary arrays again and multiple pass in memory
# Warning: do the work in-place
def discard_singularities(arr):
for i in prange(arr.shape[0]):
for j in range(arr.shape[1]):
for k in range(3):
if np.isinf(arr[i, j, k]) or np.isnan(arr[i, j, k]):
arr[i, j, k] = 0.0
def compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length):
return (strengths
/ (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
* (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
def rDotProducts(b, c):
assert b.shape == c.shape and b.shape[2] == 3
n, m = b.shape[0], b.shape[1]
ab = np.empty((n, m))
ac = np.empty((n, m))
for i in prange(n):
for j in range(m):
ab[i, j] = 0.0
ac[i, j] = 0.0
for k in range(3):
a = b[i, j, k] - c[i, j, k]
ab[i, j] += a * b[i, j, k]
ac[i, j] += a * c[i, j, k]
return (ab, ac)
# Compute `np.sum(arr, axis=1)` in parallel.
def collapseArr(arr):
assert arr.shape[2] == 3
n, m = arr.shape[0], arr.shape[1]
res = np.empty((n, 3))
for i in prange(n):
res[i, 0] = np.sum(arr[i, :, 0])
res[i, 1] = np.sum(arr[i, :, 1])
res[i, 2] = np.sum(arr[i, :, 2])
return res
def calculate_velocity_induced_by_line_vortices(points, origins, terminations, strengths, collapse=True):
r_1 = subtract(points, origins)
r_2 = subtract(points, terminations)
# NOTE: r_0 is computed on the fly by rDotProducts
r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)
r_1_cross_r_2_absolute_magnitude = cross_absolute_magnitude(r_1_cross_r_2)
r_1_length = nb_2d_explicit_norm(r_1)
r_2_length = nb_2d_explicit_norm(r_2)
radius = 3.0e-16
r_1_length[r_1_length < radius] = 0
r_2_length[r_2_length < radius] = 0
r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0
r_0_dot_r_1, r_0_dot_r_2 = rDotProducts(r_1, r_2)
with np.errstate(divide="ignore", invalid="ignore"):
k = compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length)
k = np.expand_dims(k, axis=2)
induced_velocities = k * r_1_cross_r_2
if collapse:
induced_velocities = collapseArr(induced_velocities)
return induced_velocities
On my machine, this code is 2.5 times faster than the initial implementation on arrays of size 10**3. It also use a bit less memory.

Numpy insert matrix values with matrix index

I have the following code which creates a 4D grid matrix and I am looking to insert the rolled 2D vals matrix into this grid.
import numpy as np
k = 100
x = 20
y = 10
z = 3
grid = np.zeros((y, k, x, z))
insert_map = np.random.randint(low=0, high=y, size=(5, k, x))
vals = np.random.random((5, k))
for i in range(x):
grid[insert_map[:, :, i], i, 0] = np.roll(vals, i)
If vals would be a 1D array and I would use a 1D insert_map array as a reference it would work, however using it in multiple dimensions seems to be an issue and it raises error:
ValueError: shape mismatch: value array of shape (5,100) could not be broadcast to indexing result of shape (5,100,3)
I'm confused as to why it's saying that error as grid[insert_map[:, :, i], i, 0] should in my mind give a (5, 100) insert location for the y and k portion of the grid array and then fixes the x and z portion with i and 0?
Is there any way to insert the 2D (5, 100) rolled vals matrix into the 4D (10, 100, 20, 3) grid matrix by 2D indexing?
grid is (y, k, x, z)
insert_map is (5, k, x). insert_map[:, :, i] is then (5,k).
grid[insert_map[:, :, i], i, 0] will then be (5,k,z). insert_map[] only indexes the first y dimension.
vals is (5,k); roll doesn't change that.
np.roll(vals, i)[...,None] can broadcast to fill the z dimension, if that's what you want.
Your insert_map can't select values along the k dimension. It's values, created by the randint are valid the y dimension.
If the i and 0 are supposed to apply to the last two dimensions, you still need an index for the k dimension. Possibilities are:
grid[insert_map[:, j, i], j, i, 0]
grid[insert_map[:, :, i], 0, i, 0]
grid[insert_map[:, :, i], :, i, 0]
grid[insert_map[:, :, i], np.arange(k), i, 0]

Getting diagonal elements over several dimensions

I'd like to transform a tensor T of size (n x n x m x m) into a tensor U of size (n x m x m) while only retreiving the diagonal elements of T over the (NxN) chunks (i.e. Uikl=Tiikl). torch.diag() only works with 2-D tensors and I really fail to see how to do this without looping on the indexes of the elements (which I'd like to avoid given that I think that it is inefficient computationnally). In clear, I'd like to vectorize the following code:
U = torch.zeros(n, m, m)
for i in range(n):
for k in range(m):
for l in range(m):
U[i][k][l] = T[i][i][k][l]
I'm totally new to pytorch and I tried many combination of functions but none of them gives me a satisfying result. Has anyone an idea?
You can generate the indexes using np.meshgrid
i, k, l = np.meshgrid(range(n), range(m), range(m))
U[i, k, l] = T[i, i, k, l]
for completeness I did:
n = 3
m = 5
T = torch.arange(n * n * m * m).view(n, n, m, m)
U = torch.zeros(n, m, m)
U_ = torch.zeros(n, m, m)
i, k, l = np.meshgrid(range(n), range(m), range(m))
U_[i, k, l] = T[i, i, k, l]
for i in range(n):
for k in range(m):
for l in range(m):
U[i][k][l] = T[i][i][k][l]
U = U.view(-1)
U_ = U_.view(-1)
print ((U == U_).all())
The output is True so I assume it is correct.
When applied to 2d matrices, torch.diag() is an alias for torch.diagonal().
diagonal itself allows you to specify which two dimensions of an arbitrary rank tensor the diagonal is taken from, by default these are 0 and 1:
U = T.diagonal()

evaluate many monomials at many points

The following problem concerns evaluating many monomials (x**k * y**l * z**m) at many points.
I would like to compute the "inner power" of two numpy arrays, i.e.,
import numpy
a = numpy.random.rand(10, 3)
b = numpy.random.rand(3, 5)
out = numpy.ones((10, 5))
for i in range(10):
for j in range(5):
for k in range(3):
out[i, j] *= a[i, k]**b[k, j]
If instead the line would read
out[i, j] += a[i, k]*b[j, k]
this would be a a number of inner products, computable with a simple dot or einsum.
Is it possible to perform the above loop in just one numpy line?
What about thinking of it in terms of logarithms:
import numpy
a = numpy.random.rand(10, 3)
b = numpy.random.rand(3, 5)
out = np.exp(np.matmul(np.log(a), b))
Since c_ij = prod(a_ik ** b_kj, k=1..K), then log(c_ij) = sum(log(a_ik) * b_ik, k=1..K).
Note: Having zeros in a may mess up the result (also negatives, but then the result wouldn't be well defined anyway). I have given it a try and it doesn't seem to actually break somehow; I don't know if that behavior is guaranteed by NumPy but, to be safe, you can add something at the end like:
out[np.logical_or.reduce(a < eps, axis=1)] = 0
You can use broadcasting after extending those arrays to 3D versions -
Simply put -
Basically, we are keeping the last axis and first axis from the two arrays aligned, while performing element-wise powers between the first and last axes from the two inputs. Schematically put using the given sample on shapes -
10 x 3 x 1
1 x 3 x 5
Two more solutions:
numpy.prod([a[:, i]**bb[i] for i in range(len(bb))], axis=0)
for bb in b.T
and using power.outer:
numpy.prod([numpy.power.outer(a[:, k], b[k]) for k in range(len(b))], axis=0)
Both are a bit slower than the broadcasting solution.
Even with some logic to accommodate for zero and negative values, the exp-log solution takes the cake.
Code to reproduce the plot:
import numpy
import perfplot
def loop(data):
a, b = data
m = a.shape[0]
n = b.shape[1]
out = numpy.ones((m, n))
for i in range(m):
for j in range(n):
for k in range(3):
out[i, j] *= a[i, k]**b[k, j]
return out
def broadcasting(data):
a, b = data
return (a[..., None]**b[None]).prod(1)
def log_exp(data):
a, b = data
neg_a = numpy.zeros(a.shape, dtype=int)
neg_a[a < 0.0] = 1
odd_b = numpy.zeros(b.shape, dtype=int)
odd_b[b % 2 == 1] = 1
negative_count = numpy.dot(neg_a, odd_b)
out = (-1)**negative_count * numpy.exp(
numpy.log(abs(a), where=abs(a) > 0.0),
zero_a = numpy.zeros(a.shape, dtype=int)
zero_a[a == 0.0] = 1
pos_b = numpy.zeros(b.shape, dtype=int)
pos_b[b > 0] = 1
zero_count = numpy.dot(zero_a, pos_b)
out[zero_count > 0] = 0.0
return out
def inline(data):
a, b = data
return numpy.array([
numpy.prod([a[:, i]**bb[i] for i in range(len(bb))], axis=0)
for bb in b.T
def outer_power(data):
a, b = data
return numpy.prod([
numpy.power.outer(a[:, k], b[k]) for k in range(len(b))
], axis=0)
setup=lambda n: (
numpy.random.rand(n, 3) - 0.5,
numpy.random.randint(0, 10, (3, n))
n_range=[2**k for k in range(11)],
import numpy
a = numpy.random.rand(10, 3)
b = numpy.random.rand(3, 5)
out = [[numpy.prod([a[i, k]**b[k, j] for k in range(3)]) for j in range(5)] for i in range(10)]
