Looping over large sparse array

Looping over large sparse array - python

Let's say I have a (sparse) matrix M size (N*N, N*N). I want to select elements from this matrix where the outer product of grid (a (n,m) array, where n*m=N) is True (it is a boolean 2D array, and na=grid.sum()). This can be done as follows
result = M[np.outer( grid.flatten(),grid.flatten() )].reshape (( N, N ) )
result is an (na,na) sparse array (and na < N). The previous line is what I want to achieve: get the elements of M that are true from the product of grid, and squeeze the ones that aren't true out of the array.
As n and m (and hence N) grow, and M and result are sparse matrices, I am not able to do this efficiently in terms of memory or speed. Closest I have tried is:
result = sp.lil_matrix ( (1, N*N), dtype=np.float32 )
# Calculate outer product
A = np.einsum("i,j", grid.flatten(), grid.flatten())
cntr = 0
it = np.nditer ( A, flags=['multi_index'] )
while not it.finished:
if it[0]:
result[0,cntr] = M[it.multi_index[0], it.multi_index[1]]
cntr += 1
# reshape result to be a N*N sparse matrix
The last reshape could be done by this approach, but I haven't got there yet, as the while loop is taking forever.
I have also tried selecting nonzero elements of A too, and looping over but this eats up all of the memory:
A=np.einsum("i,j", grid.flatten(), grid.flatten())
nzero = A.nonzero() # This eats lots of memory
cntr = 0
for (i,j) in zip (*nzero):
temp_mat[0,cntr] = M[i,j]
cnt += 1
'n' and 'm' in the example above are around 300.

I don't know if it was a typo, or code error, but your example is missing an iternext:
R=[]
it = np.nditer ( A, flags=['multi_index'] )
while not it.finished:
if it[0]:
R.append(M[it.multi_index])
it.iternext()
I think appending to a list is simpler and faster than R[ctnr]=.... It's competitive if R is a regular array, and sparse indexing is slower (even the fastest lil format).
ndindex wraps this use of a nditer as:
R=[]
for index in np.ndindex(A.shape):
if A[index]:
R.append(M[index])
ndenumerate also works:
R = []
for index,a in np.ndenumerate(A):
if a:
R.append(M[index])
But I wonder if you really want to advance the cntr each it step, not just the True cases. Otherwise reshaping result to (N,N) doesn't make much sense. But in that case, isn't your problem just
M[:N, :N].multiply(A)
or if M was a dense array:
M[:N, :N]*A
In fact if both M and A are sparse, then the .data attribute of that multiply will be the same as the R list.
In [76]: N=4
In [77]: M=np.arange(N*N*N*N).reshape(N*N,N*N)
In [80]: a=np.array([0,1,0,1])
In [81]: A=np.einsum('i,j',a,a)
In [82]: A
Out[82]:
array([[0, 0, 0, 0],
[0, 1, 0, 1],
[0, 0, 0, 0],
[0, 1, 0, 1]])
In [83]: M[:N, :N]*A
Out[83]:
array([[ 0, 0, 0, 0],
[ 0, 17, 0, 19],
[ 0, 0, 0, 0],
[ 0, 49, 0, 51]])
In [84]: c=sparse.csr_matrix(M)[:N,:N].multiply(sparse.csr_matrix(A))
In [85]: c.data
Out[85]: array([17, 19, 49, 51], dtype=int32)
In [89]: [M[index] for index, a in np.ndenumerate(A) if a]
Out[89]: [17, 19, 49, 51]

Related

How to do <= and >= on sparse matrices?

Is it possible to do <= or >= operations on Scipy sparse matrices, such that the expression returns True if the operation is true for all corresponding elements? For example, a <= b means that for all corresponding elements (a, b) in matrices (A, B), a <= b? Here's an example to consider:
import numpy as np
from scipy.sparse import csr_matrix
np.random.seed(0)
mat = csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
print(mat.A)
print()
np.random.seed(1)
matb = csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
print(matb.A)
Running this gives the warning: SparseEfficiencyWarning: Comparing sparse matrices using >= and <= is inefficient, using <, >, or !=, instead and gives the error: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
I'd like to be able to take 2 sparse matrices, A and B, and determine if A <= B for each pair of corresponding elements (a, b) in (A, B). Is this possible? What would the performance of such an operation be?

In [402]: np.random.seed = 0
...: mat = sparse.csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
In [403]: mat
Out[403]:
<10x12 sparse matrix of type '<class 'numpy.int64'>'
with 40 stored elements in Compressed Sparse Row format>
In [404]: mat.A
Out[404]:
array([[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
...
[0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1],
[0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]], dtype=int64)
In [405]: np.random.seed = 1
...: matb = sparse.csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
In [407]: mat<matb
Out[407]:
<10x12 sparse matrix of type '<class 'numpy.bool_'>'
with 27 stored elements in Compressed Sparse Row format>
In [408]: mat>=matb
/home/paul/.local/lib/python3.6/site-packages/scipy/sparse/compressed.py:295: SparseEfficiencyWarning: Comparing sparse matrices using >= and <= is inefficient, using <, >, or !=, instead.
"using <, >, or !=, instead.", SparseEfficiencyWarning)
Out[408]:
<10x12 sparse matrix of type '<class 'numpy.float64'>'
with 93 stored elements in Compressed Sparse Row format>
In your case, neither mat or matb are particularly sparse, 40 and 36 nonzeros out of a possible 120. Even so the mat<matb results in 27 nonzero (True) values, while the >= test results in 93. Where ever both matrices are 0, the result is True.
It's warning us that using sparse matrices isn't going to save us space or time (compared to dense arrays) if we do this kind of testing. It's not going to kill us, it just won't be as efficient.

(Pulling some comments together for this answer):
To simply do elementwise <= on two sparse matrices A and B, you can do (A <= B). However, as #hpaulj points out, this is inefficient because any pair of corresponding 0 elements (i.e. (1,1) is 0 in both A and B) will be turned into a 1 with this operation. Assuming both A and B are sparse (mostly 0s), you will destroy their sparsity by making them mostly 1s.
To get around this, consider the following:
A = csr_matrix((3, 3))
A[1, 1] = 1
print(A.A)
print()
B = csr_matrix((3, 3))
B[0, 0] = 1
B[1, 1] = 2
print(B.A)
print(not (A > B).count_nonzero())
To explain that last line, A > B will do the opposite of A <= B, so corresponding 0s will remain 0, and any place where a > b will become a 1. Therefore, if the resulting matrix has any non-zero elements, it means that there is some (a, b) in (A, B) where a > b. This means that it is not the case that A <= B (elementwise).

Replace values in subarray based upon dynamic condition in Numpy

I have a Python Numpy array that is a 2D array where the second dimension is a subarray of 3 elements of integers. For example:
[ [2, 3, 4], [9, 8, 7], ... [15, 14, 16] ]
For each subarray I want to replace the lowest number with a 1 and all other numbers with a 0. So the desired output from the above example would be:
[ [1, 0, 0], [0, 0, 1], ... [0, 1, 0] ]
This is a large array, so I want to exploit Numpy performance. I know about using conditions to operate on array elements, but how do I do this when the condition is dynamic? In this instance the condition needs to be something like:
newarray = (a == min(a)).astype(int)
But how do I do this across each subarray?

You can specify the axis parameter to calculate a 2d array of mins(if you keep the dimension of the result), then when you do a == a.minbyrow, you will get trues at the minimum position for each sub array:
(a == a.min(1, keepdims=True)).astype(int)
#array([[1, 0, 0],
# [0, 0, 1],
# [0, 1, 0]])

How about this?
import numpy as np
a = np.random.random((4,3))
i = np.argmin(a, axis=-1)
out = np.zeros(a.shape, int)
out[np.arange(out.shape[0]), i] = 1
print(a)
print(out)
Sample output:
# [[ 0.58321885 0.18757452 0.92700724]
# [ 0.58082897 0.12929637 0.96686648]
# [ 0.26037634 0.55997658 0.29486454]
# [ 0.60398426 0.72253012 0.22812904]]
# [[0 1 0]
# [0 1 0]
# [1 0 0]
# [0 0 1]]
It appears to be marginally faster than the direct approach:
from timeit import timeit
def dense():
return (a == a.min(1, keepdims=True)).astype(int)
def sparse():
i = np.argmin(a, axis=-1)
out = np.zeros(a.shape, int)
out[np.arange(out.shape[0]), i] = 1
return out
for shp in ((4,3), (10000,3), (100,10), (100000,1000)):
a = np.random.random(shp)
d = timeit(dense, number=40)/40
s = timeit(sparse, number=40)/40
print('shape, dense, sparse, ratio', '({:6d},{:6d}) {:9.6g} {:9.6g} {:9.6g}'.format(*shp, d, s, d/s))
Sample run:
# shape, dense, sparse, ratio ( 4, 3) 4.22172e-06 3.1274e-06 1.34992
# shape, dense, sparse, ratio ( 10000, 3) 0.000332396 0.000245348 1.35479
# shape, dense, sparse, ratio ( 100, 10) 9.8944e-06 5.63165e-06 1.75693
# shape, dense, sparse, ratio (100000, 1000) 0.344177 0.189913 1.81229

Combine list of numpy arrays and reshape

I'm hoping anybody could help me with the following.
I have 2 lists of arrays, which should be linked to each-other. Each list stands for a certain object. arr1 and arr2 are the attributes of that object.
For example:
import numpy as np
arr1 = [np.array([1, 2, 3]), np.array([1, 2]), np.array([2, 3])]
arr2 = [np.array([20, 50, 30]), np.array([50, 50]), np.array([75, 25])]
The arrays are linked to each other as in the 1 in arr1, first array belongs to the 20 in arr2 first array. The result I'm looking for in this example would be a numpy array with size 3,4. The 'columns' stand for 0, 1, 2, 3 (the numbers in arr1, plus 0) and the rows are filled with the corresponding values of arr2. When there are no corresponding values this cell should be 0.
Example:
array([[ 0, 20, 50, 30],
[ 0, 50, 50, 0],
[ 0, 0, 75, 25]])
How would I link these two list of arrays and reshape them in the desired format as shown in the above example?
Many thanks!

Here's an almost* vectorized approach -
lens = np.array([len(i) for i in arr1])
N = len(arr1)
row_idx = np.repeat(np.arange(N),lens)
col_idx = np.concatenate(arr1)
M = col_idx.max()+1
out = np.zeros((N,M),dtype=int)
out[row_idx,col_idx] = np.concatenate(arr2)
*: Almost because of the loop comprehension at the start, but that should be computationally negligible as it doesn't involve any computation there.

Here is a solution with for-loops. Showing each step in detail.
import numpy as np
arr1 = [np.array([1, 2, 3]), np.array([1, 2]), np.array([2, 3])]
arr2 = [np.array([20, 50, 30]), np.array([50, 50]), np.array([75, 25])]
maxi = []
for i in range(len(arr1)):
maxi.append(np.max(arr1[i]))
maxi = np.max(maxi)
output = np.zeros((len(arr2),maxi))
for i in range(len(arr1)):
for k in range(len(arr1[i])):
output[i][k]=arr2[i][k]

This is a straight forward approach, with only one level of iteration:
In [261]: res=np.zeros((3,4),int)
In [262]: for i,(idx,vals) in enumerate(zip(arr1, arr2)):
...: res[i,idx]=vals
...:
In [263]: res
Out[263]:
array([[ 0, 20, 50, 30],
[ 0, 50, 50, 0],
[ 0, 0, 75, 25]])
I suspect it is faster than #Divakar's approach for this example. And it should remain competitive as long as the number of columns is quite a bit larger than the number of rows.

Comparing value with neighbor elements in numpy

Let's say I have a numpy array
a b c
A = i j k
u v w
I want to compare the value central element with some of its eight neighbor elements (along the axis or along the diagonal). Is there any faster way except the nested for loop (it's too slow for big matrix)?
To be more specific, what I want to do is compare value of element with it's neighbors and assign new values.
For example:
if (j == 1):
if (j>i) & (j>k):
j = 999
else:
j = 0
if (j == 2):
if (j>c) & (j>u):
j = 999
else:
j = 0
...
something like this.

Your operation contains lots of conditionals, so the most efficient way to do it in the general case (any kind of conditionals, any kind of operations) is using loops. This could be done efficiently using numba or cython. In special cases, you can implement it using higher level functions in numpy/scipy. I'll show a solution for the specific example you gave, and hopefully you can generalize from there.
Start with some fake data:
A = np.asarray([
[1, 1, 1, 2, 0],
[1, 0, 2, 2, 2],
[0, 2, 0, 1, 0],
[1, 2, 2, 1, 0],
[2, 1, 1, 1, 2]
])
We'll find locations in A where various conditions apply.
1a) The value is 1
1b) The value is greater than its horizontal neighbors
2a) The value is 2
2b) The value is greater than its diagonal neighbors
Find locations in A where the specified values occur:
cond1a = A == 1
cond2a = A == 2
This gives matrices of boolean values, of the same size as A. The value is true where the condition holds, otherwise false.
Find locations in A where each element has the specified relationships to its neighbors:
# condition 1b: value greater than horizontal neighbors
f1 = np.asarray([[1, 0, 1]])
cond1b = A > scipy.ndimage.maximum_filter(
A, footprint=f1, mode='constant', cval=-np.inf)
# condition 2b: value greater than diagonal neighbors
f2 = np.asarray([
[0, 0, 1],
[0, 0, 0],
[1, 0, 0]
])
cond2b = A > scipy.ndimage.maximum_filter(
A, footprint=f2, mode='constant', cval=-np.inf)
As before, this gives matrices of boolean values indicating where the conditions are true. This code uses scipy.ndimage.maximum_filter(). This function iteratively shifts a 'footprint' to be centered over each element of A. The returned value for that position is the maximum of all elements for which the footprint is 1. The mode argument specifies how to treat implicit values outside boundaries of the matrix, where the footprint falls off the edge. Here, we treat them as negative infinity, which is the same as ignoring them (since we're using the max operation).
Set values of the result according to the conditions. The value is 999 if conditions 1a and 1b are both true, or if conditions 2a and 2b are both true. Else, the value is 0.
result = np.zeros(A.shape)
result[(cond1a & cond1b) | (cond2a & cond2b)] = 999
The result is:
[
[ 0, 0, 0, 0, 0],
[999, 0, 0, 999, 999],
[ 0, 0, 0, 999, 0],
[ 0, 0, 999, 0, 0],
[ 0, 0, 0, 0, 999]
]
You can generalize this approach to other patterns of neighbors by changing the filter footprint. You can generalize to other operations (minimum, median, percentiles, etc.) using other kinds of filters (see scipy.ndimage). For operations that can be expressed as weighted sums, use 2d cross correlation.
This approach should be much faster than looping in python. But, it does perform unnecessary computations (for example, it's only necessary to compute the max when the value is 1 or 2, but we're doing it for all elements). Looping manually would let you avoid these computations. Looping in python would probably be much slower than the code here. But, implementing it in numba or cython would probably be faster because these tools generate compiled code.

I used numpy's:
concatenate to pad with zeroes
dstack and roll to align correctly
Apply custom_roll twice along different dimensions and subtract original.
import numpy as np
def custom_roll(a, axis=0):
n = 3
a = a.T if axis==1 else a
pad = np.zeros((n-1, a.shape[1]))
a = np.concatenate([a, pad], axis=0)
ad = np.dstack([np.roll(a, i, axis=0) for i in range(n)])
a = ad.sum(2)[1:-1, :]
a = a.T if axis==1 else a
return a
Consider the following ndarray:
A = np.arange(25).reshape(5, 5)
A
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
sum_of_eight_around_me = custom_roll(custom_roll(A), axis=1) - A
sum_of_eight_around_me
array([[ 12., 20., 25., 30., 20.],
[ 28., 48., 56., 64., 42.],
[ 53., 88., 96., 104., 67.],
[ 78., 128., 136., 144., 92.],
[ 52., 90., 95., 100., 60.]])

Find first nonzero column in scipy.sparse matrix

I am looking for the first column containing a nonzero element in a sparse matrix (scipy.sparse.csc_matrix). Actually, the first column starting with the i-th one to contain a nonzero element.
This is part of a certain type of linear equation solver. For dense matrices I had the following: (Relevant line is pcol = ...)
import numpy
D = numpy.matrix([[1,0,0],[2,0,0],[3,0,1]])
i = 1
pcol = i + numpy.argmax(numpy.any(D[:,i:], axis=0))
if pcol != i:
# Pivot columns i, pcol
D[:,[i,pcol]] = D[:,[pcol,i]]
print(D)
# Result should be numpy.matrix([[1,0,0],[2,0,0],[3,1,0]])
The above should swap columns 1 and 2. If we set i = 0 instead, D is unchanged since column 0 already contains nonzero entries.
What is an efficient way to do this for scipy.sparse matrices? Are there analogues for the numpy.any() and numpy.argmax() functions?

With a csc matrix it is easy to find the nonzero columns.
In [302]: arr=sparse.csc_matrix([[0,0,1,2],[0,0,0,2]])
In [303]: arr.A
Out[303]:
array([[0, 0, 1, 2],
[0, 0, 0, 2]])
In [304]: arr.indptr
Out[304]: array([0, 0, 0, 1, 3])
In [305]: np.diff(arr.indptr)
Out[305]: array([0, 0, 1, 2])
The last line shows how many nonzero terms there are in each column.
np.nonzero(np.diff(arr.indptr))[0][0] would be the index of the first nonzero value in that diff.
Do the same on a csr matrix for find the 1st nonzero row.
I can elaborate on indptr if you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Looping over large sparse array - python

Related

How to do <= and >= on sparse matrices?

Replace values in subarray based upon dynamic condition in Numpy

Combine list of numpy arrays and reshape

Comparing value with neighbor elements in numpy

Find first nonzero column in scipy.sparse matrix

Categories

Resources