Is there any quicker way to physically transpose a large 2D numpy matrix than array.transpose.copy()? And are there any routines for doing it with efficient memory use?
It may be worth looking at what transpose does, just so we are clear about what you mean by 'physically tranposing'.
Start with a small (4,3) array:
In [51]: arr = np.array([[1,2,3],[10,11,12],[22,23,24],[30,32,34]])
In [52]: arr
Out[52]:
array([[ 1, 2, 3],
[10, 11, 12],
[22, 23, 24],
[30, 32, 34]])
This is stored with a 1d data buffer, which we can display with ravel:
In [53]: arr.ravel()
Out[53]: array([ 1, 2, 3, 10, 11, 12, 22, 23, 24, 30, 32, 34])
and strides which tell it to step columns by 8 bytes, and rows by 24 (3*8):
In [54]: arr.strides
Out[54]: (24, 8)
We can ravel with the "F" order - that's going down the rows:
In [55]: arr.ravel(order='F')
Out[55]: array([ 1, 10, 22, 30, 2, 11, 23, 32, 3, 12, 24, 34])
While [53] is a view, [55] is a copy.
Now the transpose:
In [57]: arrt=arr.T
In [58]: arrt
Out[58]:
array([[ 1, 10, 22, 30],
[ 2, 11, 23, 32],
[ 3, 12, 24, 34]])
This a view; we can tranverse the [53] data buffer, going down rows with 8 byte steps. Doing calculations with the arrt is basically just as fast as with arr. With the strided iteration, order 'F' is just as fast as order 'C'.
In [59]: arrt.strides
Out[59]: (8, 24)
the original order:
In [60]: arrt.ravel(order='F')
Out[60]: array([ 1, 2, 3, 10, 11, 12, 22, 23, 24, 30, 32, 34])
but doing a 'C' ravel creates a copy, same as [55]
In [61]: arrt.ravel(order='C')
Out[61]: array([ 1, 10, 22, 30, 2, 11, 23, 32, 3, 12, 24, 34])
Doing a copy of the transpose makes an array that's transpose with 'C' order. This is your 'physical transpose':
In [62]: arrc = arrt.copy()
In [63]: arrc.strides
Out[63]: (32, 8)
Reshaping a transpose as done with [61] does make a copy, but usually we don't need to explicitly make the copy. I think the only reason to do so is to avoid several redundant copies in later calculations.
I assume that you need to do a row-wise operation that uses the CPU cache more efficiently if rows are contiguous in memory, and you don't have enough memory available to make a copy.
Wikipedia has an article on in-place matrix transposition. It turns out that such a transposition is nontrivial. Here is a follow-the-cycles algorithm as described there:
import numpy as np
from numba import njit
#njit # comment this line for debugging
def transpose_in_place(a):
"""In-place matrix transposition for a rectangular matrix.
https://stackoverflow.com/a/62507342/6228891
Parameter:
- a: 2D array. Unless it's a square matrix, it will be scrambled
in the process.
Return:
- transposed array, using the same in-memory data storage as the
input array.
This algorithm is typically 10x slower than a.T.copy().
Only use it if you are short on memory.
"""
if a.shape == (1, 1):
return a # special case
n, m = a.shape
# find max length L of permutation cycle by starting at a[0,1].
# k is the index in the flat buffer; i, j are the indices in
# a.
L = 0
k = 1
while True:
j = k % m
i = k // m
k = n*j + i
L += 1
if k == 1:
break
permut = np.zeros(L, dtype=np.int32)
# Now do the permutations, one cycle at a time
seen = np.full(n*m, False)
aflat = a.reshape(-1) # flat view
for k0 in range(1, n*m-1):
if seen[k0]:
continue
# construct cycle
k = k0
permut[0] = k0
q = 1 # size of permutation array
while True:
seen[k] = True
# note that this is slightly faster than the formula
# on Wikipedia, k = n*k % (n*m-1)
i = k // m
j = k - i*m
k = n*j + i
if k == k0:
break
permut[q] = k
q += 1
# apply cyclic permutation
tmp = aflat[permut[q-1]]
aflat[permut[1:q]] = aflat[permut[:q-1]]
aflat[permut[0]] = tmp
aT = aflat.reshape(m, n)
return aT
def test_transpose(n, m):
a = np.arange(n*m).reshape(n, m)
aT = a.T.copy()
assert np.all(transpose_in_place(a) == aT)
def roundtrip_inplace(a):
a = transpose_in_place(a)
a = transpose_in_place(a)
def roundtrip_copy(a):
a = a.T.copy()
a = a.T.copy()
if __name__ == '__main__':
test_transpose(1, 1)
test_transpose(3, 4)
test_transpose(5, 5)
test_transpose(1, 5)
test_transpose(5, 1)
test_transpose(19, 29)
Even though I'm using numba.njit here so that the loops in the transpose function are compiled, it's still quite a bit slower than a copy-transpose.
n, m = 1000, 10000
a_big = np.arange(n*m, dtype=np.float64).reshape(n, m)
%timeit -r2 -n10 roundtrip_copy(a_big)
54.5 ms ± 153 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
%timeit -r2 -n1 roundtrip_inplace(a_big)
614 ms ± 141 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Whatever you do will require O(n^2) time and memory. I would assume that .transpose and .copy (written in C) will be the most efficient choice for your application.
Edit: this assumes you actually need to copy the matrix
Related
Say I have a np.array like this:
a = [1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78]
Is there a quick method to get the indices of all locations where 3 consecutive numbers are all above some threshold? That is, for some threshold th, get all x where this holds:
a[x]>th and a[x+1]>th and a[x+2]>th
Example: for threshold 40 and the list given above, x should be [4,8,9].
Many thanks.
Approach #1
Use convolution on the mask of boolean array obtained after comparison -
In [40]: a # input array
Out[40]: array([ 1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78])
In [42]: N = 3 # compare N consecutive numbers
In [44]: T = 40 # threshold for comparison
In [45]: np.flatnonzero(np.convolve(a>T, np.ones(N, dtype=int),'valid')>=N)
Out[45]: array([4, 8, 9])
Approach #2
Use binary_erosion -
In [77]: from scipy.ndimage.morphology import binary_erosion
In [31]: np.flatnonzero(binary_erosion(a>T,np.ones(N, dtype=int), origin=-(N//2)))
Out[31]: array([4, 8, 9])
Approach #3 (Specific case) : Small numbers of consecutive numbers check
For checking such a small number of consecutive numbers (three in this case), we can also slicing on the compared mask for better performance -
m = a>T
out = np.flatnonzero(m[:-2] & m[1:-1] & m[2:])
Benchmarking
Timings on 100000 repeated/tiled array from given sample -
In [78]: a
Out[78]: array([ 1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78])
In [79]: a = np.tile(a,100000)
In [80]: N = 3
In [81]: T = 40
# Approach #3
In [82]: %%timeit
...: m = a>T
...: out = np.flatnonzero(m[:-2] & m[1:-1] & m[2:])
1000 loops, best of 3: 1.83 ms per loop
# Approach #1
In [83]: %timeit np.flatnonzero(np.convolve(a>T, np.ones(N, dtype=int),'valid')>=N)
100 loops, best of 3: 10.9 ms per loop
# Approach #2
In [84]: %timeit np.flatnonzero(binary_erosion(a>T,np.ones(N, dtype=int), origin=-(N//2)))
100 loops, best of 3: 11.7 ms per loop
try:
th=40
results = [ x for x in range( len( array ) -2 ) if(array[x:x+3].min() > th) ]
which is a list comprehension for
th=40
results = []
for x in range( len( array ) -2 ):
if( array[x:x+3].min() > th ):
results.append( x )
Another approach, using numpy.lib.stride_tricks.as_strided:
in [59]: import numpy as np
In [60]: from numpy.lib.stride_tricks import as_strided
Define the input data:
In [61]: a = np.array([ 1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78])
In [62]: N = 3
In [63]: threshold = 40
Compute the result; q is the boolean mask for the "big" values.
In [64]: q = a > threshold
In [65]: result = np.all(as_strided(q, shape=(len(q)-N+1, N), strides=(q.strides[0], q.strides[0])), axis=1).nonzero()[0]
In [66]: result
Out[66]: array([4, 8, 9])
Do it again with N = 4:
In [67]: N = 4
In [68]: result = np.all(as_strided(q, shape=(len(q)-N+1, N), strides=(q.strides[0], q.strides[0])), axis=1).nonzero()[0]
In [69]: result
Out[69]: array([8])
I have four square matrices with dimension 3Nx3N, called A, B, C and D.
I want to combine them in a single matrix.
The code with for loops is the following:
import numpy
N = 3
A = numpy.random.random((3*N, 3*N))
B = numpy.random.random((3*N, 3*N))
C = numpy.random.random((3*N, 3*N))
D = numpy.random.random((3*N, 3*N))
final = numpy.zeros((6*N, 6*N))
for i in range(N):
for j in range(N):
for k in range(3):
for l in range(3):
final[6*i + k][6*j + l] = A[3*i+k][3*j+l]
final[6*i + k + 3][6*j + l + 3] = B[3*i+k][3*j+l]
final[6*i + k + 3][6*j + l] = C[3*i+k][3*j+l]
final[6*i + k][6*j + l + 3] = D[3*i+k][3*j+l]
Is it possible to write the previous for loops in a numpythonic way?
Great problem for practicing array-slicing into multi-dimensional tensors/arrays!
We will initialize the output array as a multi-dimensional 6D array and simply slice it and assign the four arrays being reshaped as 4D arrays. The intention is avoid any stacking/concatenating as those would be expensive specially when working with large arrays by instead working with reshaping of input arrays, which would be merely views.
Here's the implementation -
out = np.zeros((N,2,3,N,2,3),dtype=A.dtype)
out[:,0,:,:,0,:] = A.reshape(N,3,N,3)
out[:,0,:,:,1,:] = D.reshape(N,3,N,3)
out[:,1,:,:,0,:] = C.reshape(N,3,N,3)
out[:,1,:,:,1,:] = B.reshape(N,3,N,3)
out.shape = (6*N,6*N)
Just to explain a bit more, we had :
|------------------------ Axes for selecting A, B, C, D
np.zeros((N,2,3,N,2,3),dtype=A.dtype)
|------------------------- Axes for selecting A, B, C, D
Thus, those two axes (second and fifth) of lengths (2x2) = 4 were used to select between the four inputs.
Runtime test
Approaches -
def original_app(A, B, C, D):
final = np.zeros((6*N,6*N),dtype=A.dtype)
for i in range(N):
for j in range(N):
for k in range(3):
for l in range(3):
final[6*i + k][6*j + l] = A[3*i+k][3*j+l]
final[6*i + k + 3][6*j + l + 3] = B[3*i+k][3*j+l]
final[6*i + k + 3][6*j + l] = C[3*i+k][3*j+l]
final[6*i + k][6*j + l + 3] = D[3*i+k][3*j+l]
return final
def slicing_app(A, B, C, D):
out = np.zeros((N,2,3,N,2,3),dtype=A.dtype)
out[:,0,:,:,0,:] = A.reshape(N,3,N,3)
out[:,0,:,:,1,:] = D.reshape(N,3,N,3)
out[:,1,:,:,0,:] = C.reshape(N,3,N,3)
out[:,1,:,:,1,:] = B.reshape(N,3,N,3)
return out.reshape(6*N,6*N)
Timings and verification -
In [147]: # Setup input arrays
...: N = 200
...: A = np.random.randint(11,99,(3*N,3*N))
...: B = np.random.randint(11,99,(3*N,3*N))
...: C = np.random.randint(11,99,(3*N,3*N))
...: D = np.random.randint(11,99,(3*N,3*N))
...:
In [148]: np.allclose(slicing_app(A, B, C, D), original_app(A, B, C, D))
Out[148]: True
In [149]: %timeit original_app(A, B, C, D)
1 loops, best of 3: 1.63 s per loop
In [150]: %timeit slicing_app(A, B, C, D)
100 loops, best of 3: 9.26 ms per loop
I'll start with a couple of generic observations
For numpy arrays we normally use the [ , ] syntax rather than [][]
final[6*i + k][6*j + l]
final[6*i + k, 6*j + l]
For new arrays built from others we often use things like reshape and slicing so that we can then add them together as blocks rather than with iterative loops
For an simple example, to take successive differences:
y = x[1:] - x[:-1]
Regarding the title, 'matrix creation' is clearer. 'load' has more of the sense of reading data from a file, as in np.loadtxt.
=================
So with N=1,
In [171]: A=np.arange(0,9).reshape(3,3)
In [172]: B=np.arange(10,19).reshape(3,3)
In [173]: C=np.arange(20,29).reshape(3,3)
In [174]: D=np.arange(30,39).reshape(3,3)
In [178]: final
Out[178]:
array([[ 0, 1, 2, 30, 31, 32],
[ 3, 4, 5, 33, 34, 35],
[ 6, 7, 8, 36, 37, 38],
[20, 21, 22, 10, 11, 12],
[23, 24, 25, 13, 14, 15],
[26, 27, 28, 16, 17, 18]])
Which can be created with one call to bmat:
In [183]: np.bmat([[A,D],[C,B]]).A
Out[183]:
array([[ 0, 1, 2, 30, 31, 32],
[ 3, 4, 5, 33, 34, 35],
[ 6, 7, 8, 36, 37, 38],
[20, 21, 22, 10, 11, 12],
[23, 24, 25, 13, 14, 15],
[26, 27, 28, 16, 17, 18]])
bmat uses a mix of hstack and vstack. It also produces a np.matrix, hence the need for .A. #Divakar's solution is bound to be faster.
This does not match with N=3. 3x3 blocks are out of order. But expanding the array to 6d (as Divakar does), and swapping some axes, puts the sub blocks into the the right order.
For N=3:
In [57]: block=np.bmat([[A,D],[C,B]])
In [58]: b1=block.A.reshape(2,3,3,2,3,3)
In [59]: b2=b1.transpose(1,0,2,4,3,5)
In [60]: b3=b2.reshape(18,18)
In [61]: np.allclose(b3,final)
Out[61]: True
In quick time tests (N=3), my approach is about half the speed of slicing_app.
As a matter of curiosity, bmat works with a string input: np.bmat('A,D;C,B'). That's because np.matrix was trying, years ago, to give a MATLAB feel.
you can just concat em
concat A and B horizontally
concat C and D horizontally
concat the conjunction of AB with the conjucntion of CD vertically
example:
AB = numpy.concatenate([A,B],1)
CD = numpy.concatenate([C,D],1)
ABCD = numpy.concatenate([AB,CD],0)
i hope that helps :)
Yet another way to do that, using view_as_blocks :
from skimage.util import view_as_blocks
def by_blocks():
final = numpy.empty((6*N,6*N))
a,b,c,d,f= [view_as_blocks(X,(3,3)) for X in [A,B,C,D,final]]
f[0::2,0::2]=a
f[1::2,1::2]=b
f[1::2,0::2]=c
f[0::2,1::2]=d
return final
You just have to think block by block, letting view_as_blocks manage strides and shapes for you. It is as fast as other numpy solutions.
I want to mask a numpy array a with mask. The mask doesn't have exactly the same shape as a, but it is possible to mask a anyway (I guess because of the additional dimension being 1-dimensional (broadcasting?)).
a.shape
>>> (3, 9, 31, 2, 1)
mask.shape
>>> (3, 9, 31, 2)
masked_a = ma.masked_array(a, mask)
The same logic however, does not apply to array b which has 5 elements in its last dimension.
ext_mask = mask[..., np.newaxis] # extending or not extending has same effect
ext_mask.shape
>>> (3, 9, 31, 2, 1)
b.shape
>>> (3, 9, 31, 2, 5)
masked_b = ma.masked_array(b, ext_mask)
>>> numpy.ma.core.MaskError: Mask and data not compatible: data size is 8370, mask size is 1674.
How can I create a (3, 9, 31, 2, 5) mask from a (3, 9, 31, 2) mask by expanding any True value in the last dimension of the (3, 9, 31, 2) mask to [True, True, True, True, True] (and False respectively)?
This gives the desired result:
masked_b = ma.masked_array(*np.broadcast(b, ext_mask))
I have not profiled this method, but it should be faster than allocating a new mask. According to the documentation, no data is copied:
These arrays are views on the original arrays. They are typically not
contiguous. Furthermore, more than one element of a broadcasted array
may refer to a single memory location. If you need to write to the
arrays, make copies first.
It is possible to verify the no-copying behavior:
bb, mb = np.broadcast(b, ext_mask)
print(mb.shape) # (3, 9, 31, 2, 5) - same shape as b
print(mb.base.shape) # (3, 9, 31, 2) - the shape of the original mask
print(mb.strides) # (558, 62, 2, 1, 0) - that's how it works: 0 stride
Pretty impressive how the numpy developers implemented broadcasting. Values are repeated by using a stride of 0 along the last dimension. Whow!
Edit
I compared the speed of broadcasting and allocating with this code:
import numpy as np
from numpy import ma
a = np.random.randn(30, 90, 31, 2, 1)
b = np.random.randn(30, 90, 31, 2, 5)
mask = np.random.randn(30, 90, 31, 2) > 0
ext_mask = mask[..., np.newaxis]
def broadcasting(a=a, b=b, ext_mask=ext_mask):
mb1 = ma.masked_array(*np.broadcast_arrays(b, ext_mask))
def allocating(a=a, b=b, ext_mask=ext_mask):
m2 = np.empty(b.shape, dtype=bool)
m2[:] = ext_mask
mb2 = ma.masked_array(b, m2)
Broadcasting is clearly faster than allocating, here:
# array size: (30, 90, 31, 2, 5)
In [23]: %timeit broadcasting()
The slowest run took 10.39 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 39.4 µs per loop
In [24]: %timeit allocating()
The slowest run took 4.86 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 982 µs per loop
Note that I had to increase array size for the difference in speed to become apparent. With the original array dimensions allocating was slightly faster than broadcasting:
# array size: (3, 9, 31, 2, 5)
In [28]: %timeit broadcasting()
The slowest run took 9.36 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 39 µs per loop
In [29]: %timeit allocating()
The slowest run took 9.22 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.6 µs per loop
The broadcasting solution's runtime seems not to depend on array size.
In order to find the index of the smallest value, I can use argmin:
import numpy as np
A = np.array([1, 7, 9, 2, 0.1, 17, 17, 1.5])
print A.argmin() # 4 because A[4] = 0.1
But how can I find the indices of the k-smallest values?
I'm looking for something like:
print A.argmin(numberofvalues=3)
# [4, 0, 7] because A[4] <= A[0] <= A[7] <= all other A[i]
Note: in my use case A has between ~ 10 000 and 100 000 values, and I'm interested for only the indices of the k=10 smallest values. k will never be > 10.
Use np.argpartition. It does not sort the entire array. It only guarantees that the kth element is in sorted position and all smaller elements will be moved before it. Thus the first k elements will be the k-smallest elements.
import numpy as np
A = np.array([1, 7, 9, 2, 0.1, 17, 17, 1.5])
k = 3
idx = np.argpartition(A, k)
print(idx)
# [4 0 7 3 1 2 6 5]
This returns the k-smallest values. Note that these may not be in sorted order.
print(A[idx[:k]])
# [ 0.1 1. 1.5]
To obtain the k-largest values use
idx = np.argpartition(A, -k)
# [4 0 7 3 1 2 6 5]
A[idx[-k:]]
# [ 9. 17. 17.]
WARNING: Do not (re)use idx = np.argpartition(A, k); A[idx[-k:]] to obtain the k-largest.
That won't always work. For example, these are NOT the 3 largest values in x:
x = np.array([100, 90, 80, 70, 60, 50, 40, 30, 20, 10, 0])
idx = np.argpartition(x, 3)
x[idx[-3:]]
array([ 70, 80, 100])
Here is a comparison against np.argsort, which also works but just sorts the entire array to get the result.
In [2]: x = np.random.randn(100000)
In [3]: %timeit idx0 = np.argsort(x)[:100]
100 loops, best of 3: 8.26 ms per loop
In [4]: %timeit idx1 = np.argpartition(x, 100)[:100]
1000 loops, best of 3: 721 µs per loop
In [5]: np.alltrue(np.sort(np.argsort(x)[:100]) == np.sort(np.argpartition(x, 100)[:100]))
Out[5]: True
You can use numpy.argsort with slicing
>>> import numpy as np
>>> A = np.array([1, 7, 9, 2, 0.1, 17, 17, 1.5])
>>> np.argsort(A)[:3]
array([4, 0, 7], dtype=int32)
For n-dimentional arrays, this function works well. The indecies are returned in a callable form. If you want a list of the indices to be returned, then you need to transpose the array before you make a list.
To retrieve the k largest, simply pass in -k.
def get_indices_of_k_smallest(arr, k):
idx = np.argpartition(arr.ravel(), k)
return tuple(np.array(np.unravel_index(idx, arr.shape))[:, range(min(k, 0), max(k, 0))])
# if you want it in a list of indices . . .
# return np.array(np.unravel_index(idx, arr.shape))[:, range(k)].transpose().tolist()
Example:
r = np.random.RandomState(1234)
arr = r.randint(1, 1000, 2 * 4 * 6).reshape(2, 4, 6)
indices = get_indices_of_k_smallest(arr, 4)
indices
# (array([1, 0, 0, 1], dtype=int64),
# array([3, 2, 0, 1], dtype=int64),
# array([3, 0, 3, 3], dtype=int64))
arr[indices]
# array([ 4, 31, 54, 77])
%%timeit
get_indices_of_k_smallest(arr, 4)
# 17.1 µs ± 651 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
numpy.partition(your_array, k) is an alternative. No slicing necessary as it gives the values sorted until the kth element.
Let's say we have two matrices A and B and let matrix C be A*B (matrix multiplication not element-wise). We wish to get only the diagonal entries of C, which can be done via np.diagonal(C). However, this causes unnecessary time overhead, because we are multiplying A with B even though we only need the the multiplications of each row in A with the column of B that has the same 'id', that is row 1 of A with column 1 of B, row 2 of A with column 2 of B and so on: the multiplications that form the diagonal of C. Is there a way to efficiently achieve that using Numpy? I want to avoid using loops to control which row is multiplied with which column, instead, I wish for a built-in numpy method that does this kind of operation to optimize performance.
Thanks in advance..
I might use einsum here:
>>> a = np.random.randint(0, 10, (3,3))
>>> b = np.random.randint(0, 10, (3,3))
>>> a
array([[9, 2, 8],
[5, 4, 0],
[8, 0, 6]])
>>> b
array([[5, 5, 0],
[3, 5, 5],
[9, 4, 3]])
>>> a.dot(b)
array([[123, 87, 34],
[ 37, 45, 20],
[ 94, 64, 18]])
>>> np.diagonal(a.dot(b))
array([123, 45, 18])
>>> np.einsum('ij,ji->i', a,b)
array([123, 45, 18])
For larger arrays, it'll be much faster than doing the multiplication directly:
>>> a = np.random.randint(0, 10, (1000,1000))
>>> b = np.random.randint(0, 10, (1000,1000))
>>> %timeit np.diagonal(a.dot(b))
1 loops, best of 3: 7.04 s per loop
>>> %timeit np.einsum('ij,ji->i', a, b)
100 loops, best of 3: 7.49 ms per loop
[Note: originally I'd done the elementwise version, ii,ii->i, instead of matrix multiplication. The same einsum tricks work.]
def diag(A,B):
diags = []
for x in range(len(A)):
diags.append(A[x][x] * B[x][x])
return diags
I believe the above code is that you're looking for.