I need to compute the trace of a matrix across all its diagonals. That is, for an nxm matrix, the operation should produce n+m-1 'traces'. Here is an example program:
import numpy as np
A=np.arange(12).reshape(3,4)
def function_1(A):
output=np.zeros(A.shape[0]+A.shape[1]-1)
for i in range(A.shape[0]+A.shape[1]-1):
output[i]=np.trace(A,A.shape[1]-1-i)
return output
A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
function_1(A)
array([ 3., 9., 18., 15., 13., 8.])
My hope is to find a way to replace the loop in the program, since I need to do this computation many times on very large matrices. One avenue that looks promising is
to use numpy.einsum, but I can't quite figure out how to do it. Alternatively I have looked into rewriting the problem entirely with loops in cython:
%load_ext cythonmagic
%%cython
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def function_2(long [:,:] A):
cdef int n=A.shape[0]
cdef int m=A.shape[1]
cdef long [::1] output = np.empty(n+m-1,dtype=np.int64)
cdef size_t l1
cdef int i,j, k1
cdef long out
it_list1=range(m)
it_list2=range(m,m+n-1)
for l1 in range(len(it_list1)):
k1=it_list1[l1]
i=0
j=m-1-k1
out=0
while (i<n)&(j<m):
out+=A[i,j]
i+=1
j+=1
output[k1]=out
for l1 in range(len(it_list2)):
k1=it_list2[l1]
i=k1-m+1
j=0
out=0
while (i<n)&(j<m):
out+=A[i,j]
i+=1
j+=1
output[k1]=out
return np.array(output)
The cython program outperforms the program looping through np.trace:
%timeit function_1(A)
10000 loops, best of 3: 62.7 µs per loop
%timeit function_2(A)
100000 loops, best of 3: 9.66 µs per loop
So, basically I want to get feedback on whether there was a more efficient way to use numpy/scipy routines, or if I have probably achieved the
fastest way using cython.
If you want to stay away from Cython, building a diagonal index array and using np.bincount may do the trick:
>>> import numpy as np
>>> a = np.arange(12).reshape(3, 4)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> rows, cols = a.shape
>>> rows_arr = np.arange(rows)
>>> cols_arr = np.arange(cols)
>>> diag_idx = rows_arr[:, None] - (cols_arr - (cols - 1))
>>> diag_idx
array([[3, 2, 1, 0],
[4, 3, 2, 1],
[5, 4, 3, 2]])
>>> np.bincount(diag_idx.ravel(), weights=a.ravel())
array([ 3., 9., 18., 15., 13., 8.])
By my timings, for your example input, it is 4x faster than your original pure Python method. So I don't think it is going to be faster than your Cython code, but you may want to time it.
If your matrix shape is sufficiently far away from being square, i.e. if it is tall or wide, then you can use stride tricks efficiently to do this. You can use stride tricks in any case, but it may not be super memory efficient if the matrix is near square.
What you need to do is create a new array view on the same data which is constructed in a way that the step going from one line to the next also causes an increment in the column. This is achieved by changing the strides of the array.
The problem that one needs to take care of lies at the borders of the array, where one needs to zero-pad. If the array is far from being square, this does not matter. If it is square, then we need twice the size of the array to pad.
If you do not need the smaller traces at the edges, then you do not need to zero-pad.
Here goes (assuming more columns than lines, but easily adapted):
import numpy as np
from numpy.lib.stride_tricks import as_strided
A = np.arange(30).reshape(3, 10)
A_embedded = np.hstack([np.zeros([3, 2]), A, np.zeros([3, 2])])
A = A_embedded[:, 2:-2] # We are now sure that the memory around A is padded with 0, but actually we never really need A again
new_strides = (A.strides[0] + A.strides[1], A.strides[1])
B = as_strided(A_embedded, shape=A_embedded[:, :-2].shape, strides=new_strides)
traces = B.sum(0)
print A
print B
print traces
In order to conform with the output you show in your example, you need to reverse it (see #larsmans comment)
traces = traces[::-1]
This is a specific example with concrete numbers. If this is useful to your usecase I can turn it into a general function.
Here's an improved version of your Cython function.
Honestly, this is how I'd do it if Cython is an option.
import numpy as np
from libc.stdint cimport int64_t as i64
from cython cimport boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def all_trace_int64(i64[:,::1] A):
cdef:
int i,j
i64[:] t = np.zeros(A.shape[0] + A.shape[1] - 1, dtype=np.int64)
for i in range(A.shape[0]):
for j in range(A.shape[1]):
t[A.shape[0]-i+j-1] += A[i,j]
return np.array(t)
This will be significantly faster than the version you give in your question because it iterates over the array in in the order in which it is stored in memory.
For small arrays, the two approaches are nearly the same, though this one is marginally faster on my machine.
I wrote this function so that it requires a C-contiguous array.
If you have a Fortran contiguous array, transpose it, then reverse the order of the output.
This does return the answers in the opposite order from the function shown in your example, so you will need to reverse the order of the array if the order is particularly important.
You may also improve performance by compiling with heavier optimizations.
For example, you could build your Cython code in the IPython notebook with additional compiler flags by replacing
%%cython
with something like
%%cython -c=-O3 -c=-march=native -c=-funroll-loops -f
Edit:
When doing this, you will also want to make sure that your values aren't generated by an outer product. If your values come from an outer product, this operation can be combined with the outer product into a single call to np.convolve.
This is competitive if the array is large:
def f5(A):
rows, cols = A.shape
N = rows + cols -1
out = np.zeros(N, A.dtype)
for idx in range(rows):
out[N-idx-cols:N-idx] += A[idx]
return out[::-1]
Although it uses a Python loop it's faster than the bincount solution (for large arrays.. on my system..)
This method does have high sensitivity to the array column/row ratio, because this ratio determines how much looping is done in Python relative to Numpy.
As #Jaime pointed out it's efficient to iterate the smallest dimension, e.g.:
def f6(A):
rows, cols = A.shape
N = rows + cols -1
out = np.zeros(N, A.dtype)
if rows > cols:
for idx in range(cols):
out[N-idx-rows:N-idx] += A[:, idx]
else:
for idx in range(rows):
out[N-idx-cols:N-idx] += A[idx]
out = out[::-1]
return out
But it should be noted that for larger array sizes (e.g. 100000 x 500 on my system) accessing the array row by row as in the first code I posted could still be faster, probably because of how the array is laid out in the RAM
(it's faster to fetch contiguous chunks than spread out bits).
This can be done by (slightly abusively) using scipy.sparse.dia_matrix in two ways, one sparser than the other.
The first one, yielding the exact result, uses the dia_matrix stored data vector
import numpy as np
from scipy.sparse import dia_matrix
A = np.arange(30).reshape(3, 10)
traces = dia_matrix(A).data.sum(1)[::-1]
A less memory-intensive method would be to work the other way round:
import numpy as np
from scipy.sparse import dia_matrix
A = np.arange(30).reshape(3, 10)
A_dia = dia_matrix((A, range(len(A))), shape=(A.shape[1],) * 2)
traces = np.array(A_dia.sum(1)).ravel()[::-1]
Note however, that two entries are missing in this solution. This may be correctible in a smart way, but I am not sure yet.
#moarningsun found the solution:
rows, cols = A.shape
A_dia = dia_matrix((A, np.arange(rows)), shape=(cols,)*2)
traces1 = A_dia.sum(1).A.ravel()
A_dia = dia_matrix((A, np.arange(-rows+1, 1)), shape=(rows,)*2)
traces2 = A_dia.sum(1).A.ravel()
traces = np.concatenate((traces1[::-1], traces2[-2::-1]))
np.trace does what you want:
import numpy as np
A = array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
n = A.shape[0]
[np.trace(A, i) for i in range(-n+1, n+1)]
Edit: Changed np.sum(np.diag()) to np.trace() according to the suggestion from #user2357112.
Use the numpy array trace method:
import numpy as np
A = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
A.trace()
returns:
15
Related
I am trying to run something like:
np.bincount(array1, weights = array2, minlength=7)
where both array1 and array2 are 2d n numpy arrays of shape (m,n). My desired goal is that np.bincount() is run n times with each row of array1 and array2
I have tried using np.apply_along_axis() but as far as I can tell this only allows for the function to be run on each row of array1 without using each row of array2 as arguments for np.bincount. I was hoping to find a way to do this cleanly with a numpy function rather than iteration as this is a performance critical function but so far can't find another way.
For Example, given these arrays:
array1 = [[1,2,3],[4,5,6]]
array2 = [[7,8,9],[10,11,12]]
I would want to compute:
[np.bincounts([1,2,3], weights = [7,8,9],minlength=7), np.bincounts([4,5,6], weights = [10,11,12], minlength=7)]
A simple solution is simply to use comprehension lists:
result = [np.bincount(v, weights=w) for v,w in zip(array1, array2)]
Because the resulting arrays can have a different size (and actually do have a different size in your example), the result cannot be a Numpy array but a regular list. Most Numpy function are not able to work on a list of variable-sized arrays or even produce them.
If you have a lot of row in the arrays, you can mitigate the cost of the CPython interpreter loops using the Numba's JIT (or eventually Cython in this case). Note that the input arrays must be converted in Numpy arrays before calling the Numba function for sake of performance. If you know that all the arrays are of the same size, you can write a more efficient implementation using Numba (by preallocating the resulting array and doing the bincount yourself).
Update
With fixed-size arrays, here is a fast implementation in Numba:
import numpy as np
import numba as nb
array1 = np.array([[1,2,3],[4,5,6]], dtype=np.int32)
array2 = np.array([[7,8,9],[10,11,12]], dtype=np.int32)
#nb.njit('i4[:,::1](i4[:,::1],i4[:,::1])')
def compute(array1, array2):
assert array1.shape == array2.shape
n, m = array1.shape
res = np.zeros((n, 7), dtype=np.int32)
for i in range(n):
for j in range(m):
v = array1[i, j]
assert v>=0 and v<7 # Can be removed if the input is safe
res[i, v] += array2[i, j]
return res
result = compute(array1, array2)
# result is
# array([[ 0, 7, 8, 9, 0, 0, 0],
# [ 0, 0, 0, 0, 10, 11, 12]])
Slice a 3d numpy array using a 1d lookup between indices
import numpy as np
a = np.arange(12).reshape(2, 3, 2)
b = np.array([2, 0])
b maps i to j where i and j are the first 2 indexes of a, so a[i,j,k]
Desired result after applying b to a is:
[[4 5]
[6 7]]
Naive solution:
c = np.empty(shape=(2, 2), dtype=int)
for i in range(2):
j = b[i]
c[i, :] = a[i, j, :]
Question: Is there a way to do this using a numpy or scipy routine or routines or fancy indexing?
Application: Reinforcement Learning finite MDPs where b is a deterministic policy vector pi(a|s), a is the state transition probabilities p(s'|s,a) and c is the state transition matrix for that policy vector p(s'|s). The arrays will be large and this operation will be repeated a large number of times so needs to be scaleable and fast.
What I have tried:
Compiling using numba but line profiler suggests my code is slower compared to a similarly sized numpy routine. Also numpy is more widely understood and used.
Maintaining pi(a|s) as a sparse matrix (all zero except one 1 per row) b_as_a_matrix and then using einsum but this involves storing and updating the matrix and creates more work (an extra loop over j and sum operation).
c = np.einsum('ij,ijk->ik', b_as_a_matrix, a)
Numpy arrays can be indexed using other arrays as indices. See also: NumPy selecting specific column index per row by using a list of indexes.
With that in mind, we can vectorize your loop to simply use b for indexing:
>>> import numpy as np
>>> a = np.arange(12).reshape(2, 3, 2)
>>> b = np.array([2, 0])
>>> i = np.arange(len(b))
>>> i
array([0, 1])
>>> a[i, b, :]
array([[4, 5],
[6, 7]])
I'm having trouble finding the proper way to do something I think should be trivial using numpy. I have an array (1000x1000) and I want to calculate the sum of a specific pattern across the array.
For example:
If I have this array and want to calculate the sum of a two-cell-right diagonal I would expect [7,12,11,8,12,6,11,7] (a total of 8 sums).
How can I do this?
This operation is called a 2-dimensional convolution:
>>> import numpy as np
>>> from scipy.signal import convolve2d
>>> kernel = np.eye(2, dtype=int)
>>> a = np.array([[5,3,7,1,2],[3,2,9,4,7],[8,9,4,2,3]])
>>> convolve2d(a, kernel, mode='valid')
array([[ 7, 12, 11, 8],
[12, 6, 11, 7]])
Should you want to generalize it to arbitrary dimensions, there is also scipy.ndimage.convolve available. It will also work for this 2d case, but does not offer the mode='valid' convenience.
l = [[5,3,7,1,2],[3,2,9,4,7],[8,9,4,2,3]]
[q+l[w+1][t+1] for w,i in enumerate(l[:-1]) for t,q in enumerate(i[:-1])]
then you can avoid using numpy :) and the output is
[7,12,11,8,12,6,11,7]
Let's say I have following numpy arrays:
import numpy as np
a = np.array([1, 2])
b = np.array([1])
c = np.array([1, 4, 8, 10])
How can I do something like np.vstack((a, b, c)) without any error? I know there is a pure python way l = [a, b, c] but that's not efficient enough. I'd like to implement it in a numpy method. Do you have any idea? Thanks in advance!
In [863]: a = np.array([1, 2])
In [864]: b = np.array([1])
In [865]: c = np.array([1, 4, 8, 10])
A list of these 3 arrays:
In [866]: ll=[a,b,c]
An object dtype array made from this list:
In [867]: A=np.array(ll)
In [868]: A
Out[868]: array([array([1, 2]), array([1]), array([ 1, 4, 8, 10])], dtype=object)
A, like ll contains pointers to data objects elsewhere in memory. In terms of memory use they are equally efficient.
In [870]: id(A[1]),id(b)
Out[870]: (3032501768, 3032501768)
You can perform a limited number of math operations on the elements of A, for example addition works as one might expect
In [871]: A+3
Out[871]: array([array([4, 5]), array([4]), array([ 4, 7, 11, 13])], dtype=object)
But there's little to no speed advantage, e.g.
In [876]: timeit [x+3 for x in ll]
100000 loops, best of 3: 9.52 µs per loop
In [877]: timeit A+3
100000 loops, best of 3: 14.6 µs per loop
and other things like np.max don't work. You have to test this case by case.
More details here: Maintaining numpy subclass inside a container after applying ufunc and other object array questions.
To get numpy speed, you need to imbed the vectors into an array. Either a 2D array or 1D array could work. You could make an array of zeros that is large enough to hold all the values. Then put the vectors in that array. Or, you could make a large 1D array and concatenate the vectors end to end.
import numpy as np
a = np.array([1, 2])
b = np.array([1])
c = np.array([1, 4, 8, 10])
# Imbed the vectors in a 2D array
A = np.zeros((3, max(a.size, b.size, c.size)))
A[0, :a.size] = a
A[1, :b.size] = b
A[2, :c.size] = c
# 1D array imbedding
B = np.zeros(a.size + b.size + c.size)
B[:a.size] = a
B[a.size:(a.size+b.size)] = b
B[(a.size+b.size):] = c
%timeit A+3
1000000 loops, best of 3: 780 ns per loop
%timeit B+3
1000000 loops, best of 3: 764 ns per loop
This has the advantage of numpy speed. But it involves more coding work, and it is less easy to interpret the values of your arrays.
Also, to decide whether the 1D or 2D solution is better, it makes sense to think about how your using the arrays. For example, if the values are Fourier series coefficients, then the 2D array would probably be better. With a 2D array you can keep specific elements of your vectors aligned.
However, I could also imagine applications where concatenating vectors into a single 1D array would make more sense. I hope this was helpful.
Is there a simpler and more memory efficient way to do the following in numpy alone.
import numpy as np
ar = np.array(a[l:r])
ar += c
a = a[0:l] + ar.tolist() + a[r:]
It may look primitive but it involves obtaining a subarray copy of the given array, then prepare two more copies of the same to append in left and right direction in addition to the scalar add. I was hoping to find some more optimized way of doing this. I would like a solution that is completely in Python list or NumPy array, but not both as converting from one form to another as shown above would cause serious overhead when the data is huge.
You can just do the assignment inplace as follows:
import numpy as np
a = np.array([1, 1, 1, 1, 1])
a[2:4] += 5
>>> a
array([1, 1, 6, 6, 1])