Related
I have two arrays a & b
a.shape
(5, 4, 3)
array([[[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0.10772717, 0.604584 , 0.41664413]],
[[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0.10772717, 0.604584 , 0.41664413],
[ 0.95879616, 0.85575133, 0.46135877]],
[[ 0. , 0. , 0. ],
[ 0.10772717, 0.604584 , 0.41664413],
[ 0.95879616, 0.85575133, 0.46135877],
[ 0.70442301, 0.74126523, 0.88965603]],
[[ 0.10772717, 0.604584 , 0.41664413],
[ 0.95879616, 0.85575133, 0.46135877],
[ 0.70442301, 0.74126523, 0.88965603],
[ 0.8039435 , 0.62802183, 0.58885027]],
[[ 0.95879616, 0.85575133, 0.46135877],
[ 0.70442301, 0.74126523, 0.88965603],
[ 0.8039435 , 0.62802183, 0.58885027],
[ 0.95848603, 0.72429311, 0.71461332]]])
and b
array([ 0.79212707, 0.66629398, 0.58676553], dtype=float32)
b.shape
(3,)
I want to get array
ab.shape
(5,5,3)
I do as below
first
b = b.reshape(1,1,3)
then
b=np.concatenate((b, b,b, b, b), axis = 0)
And
ab=np.concatenate((a, b), axis = 1)
ab.shape
(5, 5, 3)
I get the right result, but it's not very convenient especially at the step
b=np.concatenate((b, b,b, b, b), axis = 0)
when I have to type many times (the real dataset has much dimensions). Are there any faster ways to come to this result?
Simply broadcast b to 3D and then concatenate along second axis -
b3D = np.broadcast_to(b,(a.shape[0],1,len(b)))
out = np.concatenate((a,b3D),axis=1)
The broadcasting part with np.broadcast_to doesn't actual replicate or make copies and is simply a replicated view and then in the next step, we do the concatenation that does the replication on-the-fly.
Benchmarking
We are comparing np.repeat version from #cᴏʟᴅsᴘᴇᴇᴅ's solution against np.broadcast_to one
in this section with focus on performance. The broadcasting based one does the replication and concatenation in the second step, as a merged command so to speak, while np.repeat version makes copy and then concatenates in two separate steps.
Timing the approaches as whole :
Case #1 : a = (500,400,300) and b = (300,)
In [321]: a = np.random.rand(500,400,300)
In [322]: b = np.random.rand(300)
In [323]: %%timeit
...: b3D = b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)
...: r = np.concatenate((a, b3D), axis=1)
10 loops, best of 3: 72.1 ms per loop
In [325]: %%timeit
...: b3D = np.broadcast_to(b,(a.shape[0],1,len(b)))
...: out = np.concatenate((a,b3D),axis=1)
10 loops, best of 3: 72.5 ms per loop
For smaller input shapes, call to np.broadcast_to would take a bit longer than np.repeat given the work needed for setting up the broadcasting is apparently more complicated, as the timings suggest below :
In [360]: a = np.random.rand(5,4,3)
In [361]: b = np.random.rand(3)
In [366]: %timeit np.broadcast_to(b,(a.shape[0],1,len(b)))
100000 loops, best of 3: 3.12 µs per loop
In [367]: %timeit b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)
1000000 loops, best of 3: 957 ns per loop
But, the broadcasting part would have a constant time irrepective of the shapes of the inputs, i.e. the 3 u-sec part would stay around that mark. The timing for the counterpart : b.reshape(1, 1, -1).repeat(a.shape[0], axis=0) would depend on the input shapes. So, let's dig deeper and see how the concatenation steps for the two approaches fair/behave.
Diging deeper
Trying to dig deeper to see how much the concatenation part is consuming :
In [353]: a = np.random.rand(500,400,300)
In [354]: b = np.random.rand(300)
In [355]: b3D = np.broadcast_to(b,(a.shape[0],1,len(b)))
In [356]: %timeit np.concatenate((a,b3D),axis=1)
10 loops, best of 3: 72 ms per loop
In [357]: b3D = b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)
In [358]: %timeit np.concatenate((a,b3D),axis=1)
10 loops, best of 3: 72 ms per loop
Conclusion : Doesn't seem too different.
Now, let's try a case where the replication needed for b is a bigger number and b has noticeably high number of elements as well.
In [344]: a = np.random.rand(10000, 10, 1000)
In [345]: b = np.random.rand(1000)
In [346]: b3D = np.broadcast_to(b,(a.shape[0],1,len(b)))
In [347]: %timeit np.concatenate((a,b3D),axis=1)
10 loops, best of 3: 130 ms per loop
In [348]: b3D = b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)
In [349]: %timeit np.concatenate((a,b3D),axis=1)
10 loops, best of 3: 141 ms per loop
Conclusion : Seems like the merged concatenate+replication with np.broadcast_to is doing a bit better here.
Let's try the original case of (5,4,3) shape :
In [360]: a = np.random.rand(5,4,3)
In [361]: b = np.random.rand(3)
In [362]: b3D = np.broadcast_to(b,(a.shape[0],1,len(b)))
In [363]: %timeit np.concatenate((a,b3D),axis=1)
1000000 loops, best of 3: 948 ns per loop
In [364]: b3D = b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)
In [365]: %timeit np.concatenate((a,b3D),axis=1)
1000000 loops, best of 3: 950 ns per loop
Conclusion : Again, not too different.
So, the final conclusion is that if there are a lot of elements in b and if the first axis of a is also a big number (as the replication number is that one), np.broadcast_to would be a good option, otherwise np.repeat based version takes care of the other cases pretty well.
You can use np.repeat:
r = np.concatenate((a, b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)), axis=1)
What this does, is first reshape your b array to match the dimensions of a, and then repeat its values as many times as needed according to a's first axis:
b3D = b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)
array([[[1, 2, 3]],
[[1, 2, 3]],
[[1, 2, 3]],
[[1, 2, 3]],
[[1, 2, 3]]])
b3D.shape
(5, 1, 3)
This intermediate result is then concatenated with a -
r = np.concatenate((a, b3d), axis=0)
r.shape
(5, 5, 3)
This differs from your current answer mainly in the fact that the repetition of values is not hard-coded (i.e., it is taken care of by the repeat).
If you need to handle this for a different number of dimensions (not 3D arrays), then some changes are needed (mainly in how remove the hardcoded reshape of b).
Timings
a = np.random.randn(100, 99, 100)
b = np.random.randn(100)
# Tai's answer
%timeit np.insert(a, 4, b, axis=1)
100 loops, best of 3: 3.7 ms per loop
# Divakar's answer
%%timeit
b3D = np.broadcast_to(b,(a.shape[0],1,len(b)))
np.concatenate((a,b3D),axis=1)
100 loops, best of 3: 3.67 ms per loop
# solution in this post
%timeit np.concatenate((a, b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)), axis=1)
100 loops, best of 3: 3.62 ms per loop
These are all pretty competitive solutions. However, note that performance depends on your actual data, so make sure you test things first!
Here are some simple timings based on cᴏʟᴅsᴘᴇᴇᴅ's and Divakar's solutions:
%timeit np.concatenate((a, b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)), axis=1)
Output:
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.68 µs per loop
%timeit np.concatenate((a, np.broadcast_to(b[None,None], (a.shape[0], 1, len(b)))), axis=1)
Output:
The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.7 µs per loop
Now here is the timing based on your original code:
%timeit original_func(a, b)
Output:
The slowest run took 4.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.69 µs per loop
Since the question asked for faster ways to come up with the same result, I would go for cᴏʟᴅsᴘᴇᴇᴅ's solution based on these problem calculations.
You can also use np.insert.
b_broad = np.expand_dims(b, axis=0) # b_broad.shape = (1, 3)
ab = np.insert(a, 4, b_broad, axis=1)
"""
Because now we are inserting along axis 1
a'shape without axis 1 = (5, 3)
b_broad's shape (1, 3)
can be aligned and broadcast b_broad to (5, 3)
"""
In this example, we insert along the axis 1, and will put b_broad before the index given, 4 here. In other words, the b_broad will occupy index 4 at long the axis and make ab.shape equal (5, 5, 3).
Note again that before we do insertion, we turn b into b_broad for safely achieve the right broadcasting you want. The dimension of b is smaller and there will be broadcasting at insertion. We can use expand_dims to achieve this goal.
If a is of shape (3, 4, 5), you will need b_broad to have shape (3, 1) to match up dimensions if inserting along axis 1. This can be achieved by
b_broad = np.expand_dims(b, axis=1) # shape = (3, 1)
It would be a good practice to make b_broad in a right shape because you might have a.shape = (3, 4, 3) and you really need to specify which way to broadcast in this case!
Timing Results
From OP's dataset: COLDSPEED's answer is 3 times faster.
def Divakar(): # Divakar's answer
b3D = b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)
r = np.concatenate((a, b3D), axis=1)
# COLDSPEED's result
%timeit np.concatenate((a, b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)), axis=1)
2.95 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# Divakar's result
%timeit Divakar()
3.03 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# Mine's
%timeit np.insert(a, 4, b, axis=1)
10.1 µs ± 220 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Dataset 2 (Borrow the timing experiment from COLDSPEED): nothing can be concluded in this case because they share nearly the same mean and standard deviation.
a = np.random.randn(100, 99, 100)
b = np.random.randn(100)
# COLDSPEED's result
%timeit np.concatenate((a, b.reshape(1, 1, -1).repeat(a.shape[0], axis=0)), axis=1)
2.37 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Divakar's
%timeit Divakar()
2.31 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Mine's
%timeit np.insert(a, 99, b, axis=1)
2.34 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Speed will depend on data's size, shape, and volume. Please tested on you dataset if speed is your concern.
Suppose we have
an n-dimensional numpy.array A
a numpy.array B with dtype=int and shape of (n, m)
How do I index A by B so that the result is an array of shape (m,), with values taken from the positions indicated by the columns of B?
For example, consider this code that does what I want when B is a python list:
>>> a = np.arange(27).reshape(3,3,3)
>>> a[[0, 1, 2], [0, 0, 0], [1, 1, 2]]
array([ 1, 10, 20]) # the result we're after
>>> bl = [[0, 1, 2], [0, 0, 0], [1, 1, 2]]
>>> a[bl]
array([ 1, 10, 20]) # also works when indexing with a python list
>>> a[bl].shape
(3,)
However, when B is a numpy array, the result is different:
>>> b = np.array(bl)
>>> a[b].shape
(3, 3, 3, 3)
Now, I can get the desired result by casting B into a tuple, but surely that cannot be the proper/idiomatic way to do it?
>>> a[tuple(b)]
array([ 1, 10, 20])
Is there a numpy function to achieve the same without casting B to a tuple?
One alternative would be converting to linear indices and then index with np.take or index into its flattened version -
np.take(a,np.ravel_multi_index(b, a.shape))
a.flat[np.ravel_multi_index(b, a.shape)]
Custom np.ravel_multi_index for performance boost
We could implement a custom version to simulate the behaviour of np.ravel_multi_index to boost the performance, like so -
def ravel_index(b, shp):
return np.concatenate((np.asarray(shp[1:])[::-1].cumprod()[::-1],[1])).dot(b)
Using it, the desired output would be found in two ways -
np.take(a,ravel_index(b, a.shape))
a.flat[ravel_index(b, a.shape)]
Benchmarking
Additionall incorporating tuple based method from the question and map based one from #Kanak's post.
Case #1 : dims = 3
In [23]: a = np.random.randint(0,9,([20]*3))
In [24]: b = np.random.randint(0,20,(a.ndim,1000000))
In [25]: %timeit a[tuple(b)]
...: %timeit a[map(np.ravel, b)]
...: %timeit np.take(a,np.ravel_multi_index(b, a.shape))
...: %timeit a.flat[np.ravel_multi_index(b, a.shape)]
...: %timeit np.take(a,ravel_index(b, a.shape))
...: %timeit a.flat[ravel_index(b, a.shape)]
100 loops, best of 3: 6.56 ms per loop
100 loops, best of 3: 6.58 ms per loop
100 loops, best of 3: 6.95 ms per loop
100 loops, best of 3: 9.17 ms per loop
100 loops, best of 3: 6.31 ms per loop
100 loops, best of 3: 8.52 ms per loop
Case #2 : dims = 6
In [29]: a = np.random.randint(0,9,([10]*6))
In [30]: b = np.random.randint(0,10,(a.ndim,1000000))
In [31]: %timeit a[tuple(b)]
...: %timeit a[map(np.ravel, b)]
...: %timeit np.take(a,np.ravel_multi_index(b, a.shape))
...: %timeit a.flat[np.ravel_multi_index(b, a.shape)]
...: %timeit np.take(a,ravel_index(b, a.shape))
...: %timeit a.flat[ravel_index(b, a.shape)]
10 loops, best of 3: 40.9 ms per loop
10 loops, best of 3: 40 ms per loop
10 loops, best of 3: 20 ms per loop
10 loops, best of 3: 29.9 ms per loop
100 loops, best of 3: 15.7 ms per loop
10 loops, best of 3: 25.8 ms per loop
Case #3 : dims = 10
In [32]: a = np.random.randint(0,9,([4]*10))
In [33]: b = np.random.randint(0,4,(a.ndim,1000000))
In [34]: %timeit a[tuple(b)]
...: %timeit a[map(np.ravel, b)]
...: %timeit np.take(a,np.ravel_multi_index(b, a.shape))
...: %timeit a.flat[np.ravel_multi_index(b, a.shape)]
...: %timeit np.take(a,ravel_index(b, a.shape))
...: %timeit a.flat[ravel_index(b, a.shape)]
10 loops, best of 3: 60.7 ms per loop
10 loops, best of 3: 60.1 ms per loop
10 loops, best of 3: 27.8 ms per loop
10 loops, best of 3: 38 ms per loop
100 loops, best of 3: 18.7 ms per loop
10 loops, best of 3: 29.3 ms per loop
So, it makes sense to look for alternatives when working with higher-dimensional inputs and with large data.
Another alternative that fits your need involves the use of np.ravel
>>> a[map(np.ravel, b)]
array([ 1, 10, 20])
However not fully numpy-based.
Performance-concerns.
Updated following the comments below.
Be that as it may, your approach is better than mine, but not better than any of #Divakar's.
import numpy as np
import timeit
a = np.arange(27).reshape(3,3,3)
bl = [[0, 1, 2], [0, 0, 0], [1, 1, 2]]
b = np.array(bl)
imps = "from __main__ import np,a,b"
reps = 100000
tup_cas_t = timeit.Timer("a[tuple(b)]", imps).timeit(reps)
map_rav_t = timeit.Timer("a[map(np.ravel, b)]", imps).timeit(reps)
fla_rp1_t = timeit.Timer("np.take(a,np.ravel_multi_index(b, a.shape))", imps).timeit(reps)
fla_rp2_t = timeit.Timer("a.flat[np.ravel_multi_index(b, a.shape)]", imps).timeit(reps)
print tup_cas_t/map_rav_t ## 0.505382211881
print tup_cas_t/fla_rp1_t ## 1.18185817386
print tup_cas_t/fla_rp2_t ## 1.71288705886
Are you looking for numpy.ndarray.tolist() ?
>>> a = np.arange(27).reshape(3,3,3)
>>> bl = [[0, 1, 2], [0, 0, 0], [1, 1, 2]]
>>> b = np.array(bl)
>>> a[b.tolist()]
array([ 1, 10, 20])
Or for arrays indexing arrays which is quite similar to list indexing :
>>> a[np.array([0, 1, 2]), np.array([0, 0, 0]), np.array([1, 1, 2])]
array([ 1, 10, 20])
However as you can from the previous link, indexing an array a with an array b directly means you are indexing the first index of a only with your whole b array which can lead to confusing output.
I have the following function which is returning an array calculating the nearest neighbor:
def p_batch(U,X,Y):
return [nearest(u,X,Y) for u in U]
I would like to replace the for loop using numpy. I've been looking into numpy.vectorize() as this seems to be the right approach, but I can't get it to work. This is what I've tried so far:
def n_batch(U,X,Y):
vbatch = np.vectorize(nearest)
return vbatch(U,X,Y)
Can anyone give me a hint where I went wrong?
Edit:
Implementation of nearest:
def nearest(u,X,Y):
return Y[np.argmin(np.sqrt(np.sum(np.square(np.subtract(u,X)),axis=1)))]
Function for U,X,Y (with M=20,N=100,d=50):
U = numpy.random.mtrand.RandomState(123).uniform(0,1,[M,d])
X = numpy.random.mtrand.RandomState(456).uniform(0,1,[N,d])
Y = numpy.random.mtrand.RandomState(789).randint(0,2,[N])
Approach #1
You could use Scipy's cdist to generate all those euclidean distances and then simply use argmin and index into Y -
from scipy.spatial.distance import cdist
out = Y[cdist(U,X).argmin(1)]
Sample run -
In [76]: M,N,d = 5,6,3
...: U = np.random.mtrand.RandomState(123).uniform(0,1,[M,d])
...: X = np.random.mtrand.RandomState(456).uniform(0,1,[N,d])
...: Y = np.random.mtrand.RandomState(789).randint(0,2,[N])
...:
# Using a loop comprehension to verify values
In [77]: [nearest(U[i], X,Y) for i in range(len(U))]
Out[77]: [1, 0, 0, 1, 1]
In [78]: Y[cdist(U,X).argmin(1)]
Out[78]: array([1, 0, 0, 1, 1])
Approach #2
Another way with sklearn.metrics.pairwise_distances_argmin_min to give us those argmin indices directly -
from sklearn.metrics import pairwise
Y[pairwise.pairwise_distances_argmin(U,X)]
Runtime test with M=20,N=100,d=50 -
In [90]: M,N,d = 20,100,50
...: U = np.random.mtrand.RandomState(123).uniform(0,1,[M,d])
...: X = np.random.mtrand.RandomState(456).uniform(0,1,[N,d])
...: Y = np.random.mtrand.RandomState(789).randint(0,2,[N])
...:
Testing between cdist and pairwise_distances_argmin -
In [91]: %timeit cdist(U,X).argmin(1)
10000 loops, best of 3: 55.2 µs per loop
In [92]: %timeit pairwise.pairwise_distances_argmin(U,X)
10000 loops, best of 3: 90.6 µs per loop
Timings against loopy version -
In [93]: %timeit [nearest(U[i], X,Y) for i in range(len(U))]
1000 loops, best of 3: 298 µs per loop
In [94]: %timeit Y[cdist(U,X).argmin(1)]
10000 loops, best of 3: 55.6 µs per loop
In [95]: %timeit Y[pairwise.pairwise_distances_argmin(U,X)]
10000 loops, best of 3: 91.1 µs per loop
In [96]: 298.0/55.6 # Speedup with cdist over loopy one
Out[96]: 5.359712230215827
I want to create a toy training set from the XOR function:
xor = [[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 1, 0]]
input_x = np.random.choice(a=xor, size=200)
However, this is giving me
{ValueError} 'a' must be 1-dimensoinal
But, if I add e.g. a number to this list:
xor = [[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 1, 0],
1337] # With this it will work
input_x = np.random.choice(a=xor, size=200)
it starts to work. Why is this the case and how can I make this work without having to add another primitive to the xor list?
In case of an array I would do the following:
xor = np.array([[0,0,0],
[0,1,1],
[1,0,1],
[1,1,0]])
rnd_indices = np.random.choice(len(xor), size=200)
xor_data = xor[rnd_indices]
If you want a random list from xor, you should probably be doing this.
xor[np.random.choice(len(xor),1)]
You can use the random package instead:
import random
input_x = [random.choice(xor) for _ in range(200)]
Interesting!! Seems that numpy implicitly converts the input to np.array first. so, for your first input
np.array(xor).shape == (4, 3)
while for the second value
np.array(xor).shape == (5, )
so, the second value is seen by numpy as 1d!!!
So, to pick a random row, just pick a random index, and then the corresponding row
ind = np.choice(len(xor))
random_row = xor[ind, :]
With focus on performance, we could use the decimal number equivalents of those four numbers, feed those to np.random.choice() to generate 200 such numbers randomly chosen and finally get their binary equivalents with bit-shift operation.
Thus, an implementation would be -
def bitshift_approach(N):
nums = np.random.choice(np.array([0,3,5,6]),size=(N))
return ((nums & (1 << np.arange(3))[:,None])!=0).T.astype(int)
Another approach would be using very similar to what others have suggested to use np.random.choice(len(xor) to generate the row indices and then use row-indexing to select rows off xor. A slight modification to that would be to use np.take to select those rows. With such repeated indices, as is the case here, this should be performant.
Thus, an alternative approach would be -
np.take(xor,np.random.choice(len(xor), size=N))
Runtime test -
In [42]: N = 200
In [43]: %timeit xor[np.random.choice(np.arange(len(xor)), size=N)]
...: %timeit xor[np.random.choice(len(xor), size=N)]
...: %timeit bitshift_approach(N)
...: %timeit np.take(xor,np.random.choice(len(xor), size=N))
...:
10000 loops, best of 3: 43.3 µs per loop
10000 loops, best of 3: 38.3 µs per loop
10000 loops, best of 3: 59.4 µs per loop
10000 loops, best of 3: 35 µs per loop
In [44]: N = 1000
In [45]: %timeit xor[np.random.choice(np.arange(len(xor)), size=N)]
...: %timeit xor[np.random.choice(len(xor), size=N)]
...: %timeit bitshift_approach(N)
...: %timeit np.take(xor,np.random.choice(len(xor), size=N))
...:
10000 loops, best of 3: 69.5 µs per loop
10000 loops, best of 3: 64.7 µs per loop
10000 loops, best of 3: 77.7 µs per loop
10000 loops, best of 3: 38.7 µs per loop
In [46]: N = 10000
In [47]: %timeit xor[np.random.choice(np.arange(len(xor)), size=N)]
...: %timeit xor[np.random.choice(len(xor), size=N)]
...: %timeit bitshift_approach(N)
...: %timeit np.take(xor,np.random.choice(len(xor), size=N))
...:
1000 loops, best of 3: 363 µs per loop
1000 loops, best of 3: 351 µs per loop
1000 loops, best of 3: 225 µs per loop
10000 loops, best of 3: 134 µs per loop
You can use random.choice() directly and just run it 200 times to get 200 sample
Since np.random.choice() requires values to be in 1 d shape like ["1","2","3"]
and can't work with a list of lists or list of tuples only list of scaler values
Say I have two arrays a and b,
a.shape = (5,2,3)
b.shape = (2,3)
then c = a * b will give me an array c of shape (5,2,3) with c[i,j,k] = a[i,j,k]*b[j,k].
Now the situation is,
a.shape = (5,2,3)
b.shape = (2,3,8)
and I want c to have a shape (5,2,3,8) with c[i,j,k,l] = a[i,j,k]*b[j,k,l].
How to do this efficiently? My a and b are actually quite large.
This should work:
a[..., numpy.newaxis] * b[numpy.newaxis, ...]
Usage:
In : a = numpy.random.randn(5,2,3)
In : b = numpy.random.randn(2,3,8)
In : c = a[..., numpy.newaxis]*b[numpy.newaxis, ...]
In : c.shape
Out: (5, 2, 3, 8)
Ref: Array Broadcasting in numpy
Edit: Updated reference URL
I think the following should work:
import numpy as np
a = np.random.normal(size=(5,2,3))
b = np.random.normal(size=(2,3,8))
c = np.einsum('ijk,jkl->ijkl',a,b)
and:
In [5]: c.shape
Out[5]: (5, 2, 3, 8)
In [6]: a[0,0,1]*b[0,1,2]
Out[6]: -0.041308376453821738
In [7]: c[0,0,1,2]
Out[7]: -0.041308376453821738
np.einsum can be a bit tricky to use, but is quite powerful for these sort of indexing problems:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html
Also note that this requires numpy >= v1.6.0
I'm not sure about efficiency for your particular problem, but if it doesn't perform as well as needed, definitely look into using Cython with explicit for loops, and possibly parallelize it using prange
UPDATE
In [18]: %timeit np.einsum('ijk,jkl->ijkl',a,b)
100000 loops, best of 3: 4.78 us per loop
In [19]: %timeit a[..., np.newaxis]*b[np.newaxis, ...]
100000 loops, best of 3: 12.2 us per loop
In [20]: a = np.random.normal(size=(50,20,30))
In [21]: b = np.random.normal(size=(20,30,80))
In [22]: %timeit np.einsum('ijk,jkl->ijkl',a,b)
100 loops, best of 3: 16.6 ms per loop
In [23]: %timeit a[..., np.newaxis]*b[np.newaxis, ...]
100 loops, best of 3: 16.6 ms per loop
In [2]: a = np.random.normal(size=(500,20,30))
In [3]: b = np.random.normal(size=(20,30,800))
In [4]: %timeit np.einsum('ijk,jkl->ijkl',a,b)
1 loops, best of 3: 3.31 s per loop
In [5]: %timeit a[..., np.newaxis]*b[np.newaxis, ...]
1 loops, best of 3: 2.6 s per loop