Is it possible to do <= or >= operations on Scipy sparse matrices, such that the expression returns True if the operation is true for all corresponding elements? For example, a <= b means that for all corresponding elements (a, b) in matrices (A, B), a <= b? Here's an example to consider:
import numpy as np
from scipy.sparse import csr_matrix
np.random.seed(0)
mat = csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
print(mat.A)
print()
np.random.seed(1)
matb = csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
print(matb.A)
Running this gives the warning: SparseEfficiencyWarning: Comparing sparse matrices using >= and <= is inefficient, using <, >, or !=, instead and gives the error: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
I'd like to be able to take 2 sparse matrices, A and B, and determine if A <= B for each pair of corresponding elements (a, b) in (A, B). Is this possible? What would the performance of such an operation be?
In [402]: np.random.seed = 0
...: mat = sparse.csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
In [403]: mat
Out[403]:
<10x12 sparse matrix of type '<class 'numpy.int64'>'
with 40 stored elements in Compressed Sparse Row format>
In [404]: mat.A
Out[404]:
array([[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
...
[0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1],
[0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]], dtype=int64)
In [405]: np.random.seed = 1
...: matb = sparse.csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
In [407]: mat<matb
Out[407]:
<10x12 sparse matrix of type '<class 'numpy.bool_'>'
with 27 stored elements in Compressed Sparse Row format>
In [408]: mat>=matb
/home/paul/.local/lib/python3.6/site-packages/scipy/sparse/compressed.py:295: SparseEfficiencyWarning: Comparing sparse matrices using >= and <= is inefficient, using <, >, or !=, instead.
"using <, >, or !=, instead.", SparseEfficiencyWarning)
Out[408]:
<10x12 sparse matrix of type '<class 'numpy.float64'>'
with 93 stored elements in Compressed Sparse Row format>
In your case, neither mat or matb are particularly sparse, 40 and 36 nonzeros out of a possible 120. Even so the mat<matb results in 27 nonzero (True) values, while the >= test results in 93. Where ever both matrices are 0, the result is True.
It's warning us that using sparse matrices isn't going to save us space or time (compared to dense arrays) if we do this kind of testing. It's not going to kill us, it just won't be as efficient.
(Pulling some comments together for this answer):
To simply do elementwise <= on two sparse matrices A and B, you can do (A <= B). However, as #hpaulj points out, this is inefficient because any pair of corresponding 0 elements (i.e. (1,1) is 0 in both A and B) will be turned into a 1 with this operation. Assuming both A and B are sparse (mostly 0s), you will destroy their sparsity by making them mostly 1s.
To get around this, consider the following:
A = csr_matrix((3, 3))
A[1, 1] = 1
print(A.A)
print()
B = csr_matrix((3, 3))
B[0, 0] = 1
B[1, 1] = 2
print(B.A)
print(not (A > B).count_nonzero())
To explain that last line, A > B will do the opposite of A <= B, so corresponding 0s will remain 0, and any place where a > b will become a 1. Therefore, if the resulting matrix has any non-zero elements, it means that there is some (a, b) in (A, B) where a > b. This means that it is not the case that A <= B (elementwise).
Related
Given a numpy array (let it be a bit array for simplicity), how can I construct a new array of the same shape where 1 stands exactly at the positions where in the original array there was a zero, preceded by at least N-1 consecutive zeros?
For example, what is the best way to implement function nzeros having two arguments, a numpy array and the minimal required number of consecutive zeros:
import numpy as np
a = np.array([0, 0, 0, 0, 1, 0, 0, 0, 1, 1])
b = nzeros(a, 3)
Function nzeros(a, 3) should return
array([0, 0, 1, 1, 0, 0, 0, 1, 0, 0])
Approach #1
We can use 1D convolution -
def nzeros(a, n):
# Define kernel for 1D convolution
k = np.ones(n,dtype=int)
# Get sliding summations for zero matches with that kernel
s = np.convolve(a==0,k)
# Look for summations that are equal to n value, which will occur for
# n consecutive 0s. Remember that we are using a "full" version of
# convolution, so there's one-off offsetting because of the way kernel
# slides across input data. Also, we need to create 1s at places where
# n consective 0s end, so we would need to slice out ending elements.
# Thus, we would end up with the following after int dtype conversion
return (s==n).astype(int)[:-n+1]
Sample run -
In [46]: a
Out[46]: array([0, 0, 0, 0, 1, 0, 0, 0, 1, 1])
In [47]: nzeros(a,3)
Out[47]: array([0, 0, 1, 1, 0, 0, 0, 1, 0, 0])
In [48]: nzeros(a,2)
Out[48]: array([0, 1, 1, 1, 0, 0, 1, 1, 0, 0])
Approach #2
Another way to solve and this could be considered as a variant of the 1D convolution approach, would be to use erosion, because if you look at the outputs, we can simply erode the mask of 0s from the starts until n-1 places. So, we can use scipy.ndimage.morphology's binary_erosion that also allow us to specify the portion of kernel center with its origin arg, hence we will avoid any slicing. The implementation would look something like this -
from scipy.ndimage.morphology import binary_erosion
out = binary_erosion(a==0,np.ones(n),origin=(n-1)//2).astype(int)
Using for loop:
def nzeros(a, n):
#Create a numpy array of zeros of length equal to n
b = np.zeros(n)
#Create a numpy array of zeros of same length as array a
c = np.zeros(len(a), dtype=int)
for i in range(0,len(a) - n):
if (b == a[i : i+n]).all(): #Check if array b is equal to slice in a
c[i+n-1] = 1
return c
Sample Output:
print(nzeros(a, 3))
[0 0 1 1 0 0 0 1 0 0]
This was the only question I found about standard basis vectors in numpy but it's not really related to my question.
I have a numpy array of integers and I want to determine the co-occurrence matrix which stores the number of times indicies have the same value in the same column. This question describes the problem in more detail.
I have a method of solving my problem but it doesn't scale well.
My question then is this:
Is it possible to store standard basis vectors in a numpy array in a memory efficient manner?
I want to be able to do the following:
Given an array
M = e1 e2 e1
e1 e2 e2
e3 e1 e3
e2 e3 e3
where ei is the transposed i-th standard basis vector of the vector space (R3 in this case), perform matrix multiplication with the transpose of M, i.e. determine np.dot(M, M.T). To be clear, the matrix M above could be written as:
M = 1 0 0 0 1 0 1 0 0
1 0 0 0 1 0 0 1 0
0 0 1 1 0 0 0 0 1
0 1 0 0 0 1 0 0 1
(extra spaces added for emphasis).
The issue with representing the matrix like this is that it isn't scalable in memory with the number of rows and dimension of the vector space.
EDIT: I should mention that the number of columns can increase as well. The memory complexity is D * R * C where D is the dimension of the vector space, R is the number of rows and C is the number of columns. In an average working example I have roughly D == 150, R == 2000 and C == 1000 though R can go up to 20,000 and C is unbounded (though 10,000 is a reasonable estimate).
The rules for standard basis vector multiplication are simple (ei * ei.T == 1, ei * ej.T == 0 if i != j) so I was wondering if it's possible to store these rules in a numpy array to save memory.
Let's encode the basis vectors with numbers: e1 -> 1, e2 -> 2, ... This allows a very memory-efficient storage.
M = np.array([[1, 2, 1], [1, 2, 2], [3, 1, 3], [2, 3, 3]], dtype=np.uint8)
# if more than 255 basis vectors, use uint16.
Now we only need to implement a special dot product that works with these basis vectors. Basically we only replace the multiplication with a comparison:
def basis_dot(a, b):
return np.sum(a[:, :, np.newaxis] == b[np.newaxis, :, :], axis=1)
print(basis_dot(M, M.T))
# [[3 2 0 0]
# [2 3 0 0]
# [0 0 3 1]
# [0 0 1 3]]
Let's verify the result:
M = np.array([[1, 0, 0, 0, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 1, 1, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 1, 0, 0, 1]])
np.dot(M, M.T)
# array([[3, 2, 0, 0],
# [2, 3, 0, 0],
# [0, 0, 3, 1],
# [0, 0, 1, 3]])
A potential drawback with the approach is the large temporary array required in basis_dot. The memory requirement can be reduced by explicitly coding the loops, at the cost of performance (unless you use a jit compiler).
# slower but more memory friendly
def basis_dot(a, b):
out = np.empty((a.shape[0], b.shape[1]))
for i in range(a.shape[0]):
for j in range(b.shape[1]):
out[i, j] = np.sum(a[i, :] == b[:, j])
return out
So, my assumption based on your example is that you're actually working with a higher dimensionality than just 3. My other assumption is that you're not computing any basis vectors, but just auto-generating basis vectors for RN. I'll ignore the question of exactly what you're trying to accomplish or why you're storing vectors that you can easily auto-generate for now.
If all of the above assumptions are accurate then you can likely gain a lot of benefit by storing in a sparse data format. This will only improve storage if you've got a preponderance of zeroes, but that seems like a reasonable assumption. There are a large number of sparse formats which you can view here. My best guess for you would be the coo_matrix class.
from scipy.sparse import coo_matrix
new_matrix = coo_matrix(<your_matrix>)
Then saving new matrix in your format of choice.
I have a large (79 000 x 480 000) sparse csr matrix. I am trying to remove all columns (within a certain range) for which each value < k.
In regular numpy matrices this is simply done by a mask:
m = np.array([[0,2,1,1],
[0,4,2,0],
[0,3,4,0]])
mask = (arr < 2)
idx = mask.all(axis=0)
result = m[:, ~idx]
print result
>>> [[2 1]
[4 2]
[3 4]]
The unary bitwise negation operator ~ and boolean mask functionality are not available for sparse matrices however. What is the best method to:
Obtain the indices of columns where all values fulfill condition e < k.
Remove these columns based on the list of indices.
Some things to note:
The columns represent ngram text features: there are no columns in the matrix for which each element is zero.
Is using the csr matrix format even a plausible choice for this?
Do I transpose and make use of .nonzero()? I have a fair amount of working memory (192GB) so time efficiency is preferable to memory efficiency.
If I do
M = sparse.csr_matrix(m)
M < 2
I get an efficiency warning; all the 0 values of M satisfy the condition,
In [1754]: print(M)
(0, 1) 2
(0, 2) 1
(0, 3) 1
(1, 1) 4
(1, 2) 2
(2, 1) 3
(2, 2) 4
In [1755]: print(M<2)
/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:275: SparseEfficiencyWarning: Comparing a sparse matrix with a scalar greater than zero using < is inefficient, try using >= instead.
warn(bad_scalar_msg, SparseEfficiencyWarning)
(0, 0) True # not in M
(0, 2) True
(0, 3) True
(1, 0) True # not in M
(1, 3) True
(2, 0) True # not in M
(2, 3) True
In [1756]: print(M>=2) # all a subset of M
(0, 1) True
(1, 1) True
(1, 2) True
(2, 1) True
(2, 2) True
If I=M>=2; there isn't an all method, but there is a sum.
In [1760]: I.sum(axis=0)
Out[1760]: matrix([[0, 3, 2, 0]], dtype=int32)
sum is actually performed using a matrix multiplication
In [1769]: np.ones((1,3),int)*I
Out[1769]: array([[0, 3, 2, 0]], dtype=int32)
Using nonzero to find the nonzero columns:
In [1778]: np.nonzero(I.sum(axis=0))
Out[1778]: (array([0, 0], dtype=int32), array([1, 2], dtype=int32))
In [1779]: M[:,np.nonzero(I.sum(axis=0))[1]]
Out[1779]:
<3x2 sparse matrix of type '<class 'numpy.int32'>'
with 6 stored elements in Compressed Sparse Row format>
In [1780]: M[:,np.nonzero(I.sum(axis=0))[1]].A
Out[1780]:
array([[2, 1],
[4, 2],
[3, 4]], dtype=int32)
General points:
watch out for those 0 values when doing comparisons
watch out for False values when doing logic on sparse matrices
sparse matrices are optimized for math, especially matrix multiplication
sparse indexing isn't quite as powerful as array indexing; and not as fast either.
note when operations produce a dense matrix
Let's say I have a (sparse) matrix M size (N*N, N*N). I want to select elements from this matrix where the outer product of grid (a (n,m) array, where n*m=N) is True (it is a boolean 2D array, and na=grid.sum()). This can be done as follows
result = M[np.outer( grid.flatten(),grid.flatten() )].reshape (( N, N ) )
result is an (na,na) sparse array (and na < N). The previous line is what I want to achieve: get the elements of M that are true from the product of grid, and squeeze the ones that aren't true out of the array.
As n and m (and hence N) grow, and M and result are sparse matrices, I am not able to do this efficiently in terms of memory or speed. Closest I have tried is:
result = sp.lil_matrix ( (1, N*N), dtype=np.float32 )
# Calculate outer product
A = np.einsum("i,j", grid.flatten(), grid.flatten())
cntr = 0
it = np.nditer ( A, flags=['multi_index'] )
while not it.finished:
if it[0]:
result[0,cntr] = M[it.multi_index[0], it.multi_index[1]]
cntr += 1
# reshape result to be a N*N sparse matrix
The last reshape could be done by this approach, but I haven't got there yet, as the while loop is taking forever.
I have also tried selecting nonzero elements of A too, and looping over but this eats up all of the memory:
A=np.einsum("i,j", grid.flatten(), grid.flatten())
nzero = A.nonzero() # This eats lots of memory
cntr = 0
for (i,j) in zip (*nzero):
temp_mat[0,cntr] = M[i,j]
cnt += 1
'n' and 'm' in the example above are around 300.
I don't know if it was a typo, or code error, but your example is missing an iternext:
R=[]
it = np.nditer ( A, flags=['multi_index'] )
while not it.finished:
if it[0]:
R.append(M[it.multi_index])
it.iternext()
I think appending to a list is simpler and faster than R[ctnr]=.... It's competitive if R is a regular array, and sparse indexing is slower (even the fastest lil format).
ndindex wraps this use of a nditer as:
R=[]
for index in np.ndindex(A.shape):
if A[index]:
R.append(M[index])
ndenumerate also works:
R = []
for index,a in np.ndenumerate(A):
if a:
R.append(M[index])
But I wonder if you really want to advance the cntr each it step, not just the True cases. Otherwise reshaping result to (N,N) doesn't make much sense. But in that case, isn't your problem just
M[:N, :N].multiply(A)
or if M was a dense array:
M[:N, :N]*A
In fact if both M and A are sparse, then the .data attribute of that multiply will be the same as the R list.
In [76]: N=4
In [77]: M=np.arange(N*N*N*N).reshape(N*N,N*N)
In [80]: a=np.array([0,1,0,1])
In [81]: A=np.einsum('i,j',a,a)
In [82]: A
Out[82]:
array([[0, 0, 0, 0],
[0, 1, 0, 1],
[0, 0, 0, 0],
[0, 1, 0, 1]])
In [83]: M[:N, :N]*A
Out[83]:
array([[ 0, 0, 0, 0],
[ 0, 17, 0, 19],
[ 0, 0, 0, 0],
[ 0, 49, 0, 51]])
In [84]: c=sparse.csr_matrix(M)[:N,:N].multiply(sparse.csr_matrix(A))
In [85]: c.data
Out[85]: array([17, 19, 49, 51], dtype=int32)
In [89]: [M[index] for index, a in np.ndenumerate(A) if a]
Out[89]: [17, 19, 49, 51]
I am looking for the first column containing a nonzero element in a sparse matrix (scipy.sparse.csc_matrix). Actually, the first column starting with the i-th one to contain a nonzero element.
This is part of a certain type of linear equation solver. For dense matrices I had the following: (Relevant line is pcol = ...)
import numpy
D = numpy.matrix([[1,0,0],[2,0,0],[3,0,1]])
i = 1
pcol = i + numpy.argmax(numpy.any(D[:,i:], axis=0))
if pcol != i:
# Pivot columns i, pcol
D[:,[i,pcol]] = D[:,[pcol,i]]
print(D)
# Result should be numpy.matrix([[1,0,0],[2,0,0],[3,1,0]])
The above should swap columns 1 and 2. If we set i = 0 instead, D is unchanged since column 0 already contains nonzero entries.
What is an efficient way to do this for scipy.sparse matrices? Are there analogues for the numpy.any() and numpy.argmax() functions?
With a csc matrix it is easy to find the nonzero columns.
In [302]: arr=sparse.csc_matrix([[0,0,1,2],[0,0,0,2]])
In [303]: arr.A
Out[303]:
array([[0, 0, 1, 2],
[0, 0, 0, 2]])
In [304]: arr.indptr
Out[304]: array([0, 0, 0, 1, 3])
In [305]: np.diff(arr.indptr)
Out[305]: array([0, 0, 1, 2])
The last line shows how many nonzero terms there are in each column.
np.nonzero(np.diff(arr.indptr))[0][0] would be the index of the first nonzero value in that diff.
Do the same on a csr matrix for find the 1st nonzero row.
I can elaborate on indptr if you want.