I have a large (79 000 x 480 000) sparse csr matrix. I am trying to remove all columns (within a certain range) for which each value < k.
In regular numpy matrices this is simply done by a mask:
m = np.array([[0,2,1,1],
[0,4,2,0],
[0,3,4,0]])
mask = (arr < 2)
idx = mask.all(axis=0)
result = m[:, ~idx]
print result
>>> [[2 1]
[4 2]
[3 4]]
The unary bitwise negation operator ~ and boolean mask functionality are not available for sparse matrices however. What is the best method to:
Obtain the indices of columns where all values fulfill condition e < k.
Remove these columns based on the list of indices.
Some things to note:
The columns represent ngram text features: there are no columns in the matrix for which each element is zero.
Is using the csr matrix format even a plausible choice for this?
Do I transpose and make use of .nonzero()? I have a fair amount of working memory (192GB) so time efficiency is preferable to memory efficiency.
If I do
M = sparse.csr_matrix(m)
M < 2
I get an efficiency warning; all the 0 values of M satisfy the condition,
In [1754]: print(M)
(0, 1) 2
(0, 2) 1
(0, 3) 1
(1, 1) 4
(1, 2) 2
(2, 1) 3
(2, 2) 4
In [1755]: print(M<2)
/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:275: SparseEfficiencyWarning: Comparing a sparse matrix with a scalar greater than zero using < is inefficient, try using >= instead.
warn(bad_scalar_msg, SparseEfficiencyWarning)
(0, 0) True # not in M
(0, 2) True
(0, 3) True
(1, 0) True # not in M
(1, 3) True
(2, 0) True # not in M
(2, 3) True
In [1756]: print(M>=2) # all a subset of M
(0, 1) True
(1, 1) True
(1, 2) True
(2, 1) True
(2, 2) True
If I=M>=2; there isn't an all method, but there is a sum.
In [1760]: I.sum(axis=0)
Out[1760]: matrix([[0, 3, 2, 0]], dtype=int32)
sum is actually performed using a matrix multiplication
In [1769]: np.ones((1,3),int)*I
Out[1769]: array([[0, 3, 2, 0]], dtype=int32)
Using nonzero to find the nonzero columns:
In [1778]: np.nonzero(I.sum(axis=0))
Out[1778]: (array([0, 0], dtype=int32), array([1, 2], dtype=int32))
In [1779]: M[:,np.nonzero(I.sum(axis=0))[1]]
Out[1779]:
<3x2 sparse matrix of type '<class 'numpy.int32'>'
with 6 stored elements in Compressed Sparse Row format>
In [1780]: M[:,np.nonzero(I.sum(axis=0))[1]].A
Out[1780]:
array([[2, 1],
[4, 2],
[3, 4]], dtype=int32)
General points:
watch out for those 0 values when doing comparisons
watch out for False values when doing logic on sparse matrices
sparse matrices are optimized for math, especially matrix multiplication
sparse indexing isn't quite as powerful as array indexing; and not as fast either.
note when operations produce a dense matrix
Related
I am working on argmax function of PyTorch which is defined as:
torch.argmax(input, dim=None, keepdim=False)
Consider an example
a = torch.randn(4, 4)
print(a)
print(torch.argmax(a, dim=1))
Here when I use dim=1 instead of searching column vectors, the function searches for row vectors as shown below.
print(a) :
tensor([[-1.7739, 0.8073, 0.0472, -0.4084],
[ 0.6378, 0.6575, -1.2970, -0.0625],
[ 1.7970, -1.3463, 0.9011, -0.8704],
[ 1.5639, 0.7123, 0.0385, 1.8410]])
print(torch.argmax(a, dim=1))
tensor([1, 1, 0, 3])
As far as my assumption goes dim = 0 represents rows and dim =1 represent columns.
It's time to correctly understand how the axis or dim argument work in PyTorch:
The following example should make sense once you comprehend the above picture:
|
v
dim-0 ---> -----> dim-1 ------> -----> --------> dim-1
| [[-1.7739, 0.8073, 0.0472, -0.4084],
v [ 0.6378, 0.6575, -1.2970, -0.0625],
| [ 1.7970, -1.3463, 0.9011, -0.8704],
v [ 1.5639, 0.7123, 0.0385, 1.8410]]
|
v
# argmax (indices where max values are present) along dimension-1
In [215]: torch.argmax(a, dim=1)
Out[215]: tensor([1, 1, 0, 3])
Note: dim (short for 'dimension') is the torch equivalent of 'axis' in NumPy.
Dimensions are defined as shown in the above excellent answer. I have highlighted the way I understand dimensions in Torch and Numpy (dim and axis respectively) and hope that this will be helpful to others.
Notice that only the specified dimension’s index varies during the argmax operation, and the specified dimension’s index range reduces to a single index once the operation is completed. Let tensor A have M rows and N columns and consider the sum operation for simplicity. The shape of A is (M, N). If dim=0 is specified, then the vectors A[0,:], A[1,:], ..., A[M-1,:] are summed elementwise and the result is another tensor with 1 row and N columns. Notice that only the 0th dimension’s indices vary from 0 throughout M-1. Similarly, If dim=1 is specified, then the vectors A[:,0], A[:,1], ..., A[:,N-1] are summed elementwise and the result is another tensor with M rows and 1 column.
An example is given below:
>>> A = torch.tensor([[1,2,3], [4,5,6]])
>>> A
tensor([[1, 2, 3],
[4, 5, 6]])
>>> S0 = torch.sum(A, dim = 0)
>>> S0
tensor([5, 7, 9])
>>> S1 = torch.sum(A, dim = 1)
>>> S1
tensor([ 6, 15])
In the above sample code, the first sum operation specifies dim=0, therefore A[0,:] and A[1,:], which are [1,2,3] and [4,5,6], are summed and resulted in [5, 7, 9]. When dim=1 was specified, the vectors A[:,0], A[:,1], and A[:2], which are the vectors [1, 4], [2, 5], and [3, 6], are elementwise added to find [6, 15].
Note also that the specified dimension collapses. Again let A have the shape (M, N). If dim=0, then the result will have the shape (1, N), where dimension 0 is reduced from M to 1. Similarly if dim=1, then the result would have the shape (M, 1), where N is reduced to 1. Note also that shapes (1, N) and (M,1) are represented by a single-dimensional tensor with N and M elements respectively.
Is it possible to do <= or >= operations on Scipy sparse matrices, such that the expression returns True if the operation is true for all corresponding elements? For example, a <= b means that for all corresponding elements (a, b) in matrices (A, B), a <= b? Here's an example to consider:
import numpy as np
from scipy.sparse import csr_matrix
np.random.seed(0)
mat = csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
print(mat.A)
print()
np.random.seed(1)
matb = csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
print(matb.A)
Running this gives the warning: SparseEfficiencyWarning: Comparing sparse matrices using >= and <= is inefficient, using <, >, or !=, instead and gives the error: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
I'd like to be able to take 2 sparse matrices, A and B, and determine if A <= B for each pair of corresponding elements (a, b) in (A, B). Is this possible? What would the performance of such an operation be?
In [402]: np.random.seed = 0
...: mat = sparse.csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
In [403]: mat
Out[403]:
<10x12 sparse matrix of type '<class 'numpy.int64'>'
with 40 stored elements in Compressed Sparse Row format>
In [404]: mat.A
Out[404]:
array([[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
...
[0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1],
[0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]], dtype=int64)
In [405]: np.random.seed = 1
...: matb = sparse.csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
In [407]: mat<matb
Out[407]:
<10x12 sparse matrix of type '<class 'numpy.bool_'>'
with 27 stored elements in Compressed Sparse Row format>
In [408]: mat>=matb
/home/paul/.local/lib/python3.6/site-packages/scipy/sparse/compressed.py:295: SparseEfficiencyWarning: Comparing sparse matrices using >= and <= is inefficient, using <, >, or !=, instead.
"using <, >, or !=, instead.", SparseEfficiencyWarning)
Out[408]:
<10x12 sparse matrix of type '<class 'numpy.float64'>'
with 93 stored elements in Compressed Sparse Row format>
In your case, neither mat or matb are particularly sparse, 40 and 36 nonzeros out of a possible 120. Even so the mat<matb results in 27 nonzero (True) values, while the >= test results in 93. Where ever both matrices are 0, the result is True.
It's warning us that using sparse matrices isn't going to save us space or time (compared to dense arrays) if we do this kind of testing. It's not going to kill us, it just won't be as efficient.
(Pulling some comments together for this answer):
To simply do elementwise <= on two sparse matrices A and B, you can do (A <= B). However, as #hpaulj points out, this is inefficient because any pair of corresponding 0 elements (i.e. (1,1) is 0 in both A and B) will be turned into a 1 with this operation. Assuming both A and B are sparse (mostly 0s), you will destroy their sparsity by making them mostly 1s.
To get around this, consider the following:
A = csr_matrix((3, 3))
A[1, 1] = 1
print(A.A)
print()
B = csr_matrix((3, 3))
B[0, 0] = 1
B[1, 1] = 2
print(B.A)
print(not (A > B).count_nonzero())
To explain that last line, A > B will do the opposite of A <= B, so corresponding 0s will remain 0, and any place where a > b will become a 1. Therefore, if the resulting matrix has any non-zero elements, it means that there is some (a, b) in (A, B) where a > b. This means that it is not the case that A <= B (elementwise).
How can I get the values of a sparse matrix? For example:
x = sp.sparse.csr_matrix([[0,0,-1,1,0],[0,0,0,0,-1]])
print(x)
(0, 2) -1
(0, 3) 1
(1, 4) -1
I am just looking for the values of the data, i.e., [-1, 1, 1].
This can be accessed through the data property:
x = sp.sparse.csr_matrix([[0,0,-1,1,0],[0,0,0,0,-1]])
print(x.data)
[-1 1 -1]
np.nditer automatically iterates of the elements of an array row-wise. Is there a way to iterate of elements of an array columnwise?
x = np.array([[1,3],[2,4]])
for i in np.nditer(x):
print i
# 1
# 3
# 2
# 4
What I want is:
for i in Columnwise Iteration(x):
print i
# 1
# 2
# 3
# 4
Is my best bet just to transpose my array before doing the iteration?
For completeness, you don't necessarily have to transpose the matrix before iterating through the elements. With np.nditer you can specify the order of how to iterate through the matrix. The default is usually row-major or C-like order. You can override this behaviour and choose column-major, or FORTRAN-like order which is what you desire. Simply specify an additional argument order and set this flag to 'F' when using np.nditer:
In [16]: x = np.array([[1,3],[2,4]])
In [17]: for i in np.nditer(x,order='F'):
....: print i
....:
1
2
3
4
You can read more about how to control the order of iteration here: http://docs.scipy.org/doc/numpy-1.10.0/reference/arrays.nditer.html#controlling-iteration-order
You could use the shape and slice each column
>>> [x[:, i] for i in range(x.shape[1])]
[array([1, 2]), array([3, 4])]
You could transpose it?
>>> x = np.array([[1,3],[2,4]])
>>> [y for y in x.T]
[array([1, 2]), array([3, 4])]
Or less elegantly:
>>> [np.array([x[j,i] for j in range(x.shape[0])]) for i in range(x.shape[1])]
[array([1, 2]), array([3, 4])]
nditer is not the best iteration tool for this case. It is useful when working toward a compiled (cython) solution, but not in pure Python coding.
Look at some regular iteration strategies:
In [832]: x=np.array([[1,3],[2,4]])
In [833]: x
Out[833]:
array([[1, 3],
[2, 4]])
In [834]: for i in x:print i # print each row
[1 3]
[2 4]
In [835]: for i in x.T:print i # print each column
[1 2]
[3 4]
In [836]: for i in x.ravel():print i # print values in order
1
3
2
4
In [837]: for i in x.T.ravel():print i # print values in column order
1
2
3
4
You comment: I need to fill values into an array based on the index of each cell in the array
What do you mean by index?
A crude 2d iteration with indexing:
In [838]: for i in range(2):
.....: for j in range(2):
.....: print (i,j),x[i,j]
(0, 0) 1
(0, 1) 3
(1, 0) 2
(1, 1) 4
ndindex uses nditer to generate similar indexes
In [841]: for i,j in np.ndindex(x.shape):
.....: print (i,j),x[i,j]
.....:
(0, 0) 1
(0, 1) 3
(1, 0) 2
(1, 1) 4
enumerate is a good Python way of getting both values and indexes:
In [847]: for i,v in enumerate(x):print i,v
0 [1 3]
1 [2 4]
Or you can use meshgrid to generate all the indexes, as arrays
In [843]: I,J=np.meshgrid(range(2),range(2))
In [844]: I
Out[844]:
array([[0, 1],
[0, 1]])
In [845]: J
Out[845]:
array([[0, 0],
[1, 1]])
In [846]: x[I,J]
Out[846]:
array([[1, 2],
[3, 4]])
Note that most of these iterative methods just treat your array as a list of lists. They don't take advantage of the array nature, and will be slow compared to methods that work with the whole x.
I am looking for the first column containing a nonzero element in a sparse matrix (scipy.sparse.csc_matrix). Actually, the first column starting with the i-th one to contain a nonzero element.
This is part of a certain type of linear equation solver. For dense matrices I had the following: (Relevant line is pcol = ...)
import numpy
D = numpy.matrix([[1,0,0],[2,0,0],[3,0,1]])
i = 1
pcol = i + numpy.argmax(numpy.any(D[:,i:], axis=0))
if pcol != i:
# Pivot columns i, pcol
D[:,[i,pcol]] = D[:,[pcol,i]]
print(D)
# Result should be numpy.matrix([[1,0,0],[2,0,0],[3,1,0]])
The above should swap columns 1 and 2. If we set i = 0 instead, D is unchanged since column 0 already contains nonzero entries.
What is an efficient way to do this for scipy.sparse matrices? Are there analogues for the numpy.any() and numpy.argmax() functions?
With a csc matrix it is easy to find the nonzero columns.
In [302]: arr=sparse.csc_matrix([[0,0,1,2],[0,0,0,2]])
In [303]: arr.A
Out[303]:
array([[0, 0, 1, 2],
[0, 0, 0, 2]])
In [304]: arr.indptr
Out[304]: array([0, 0, 0, 1, 3])
In [305]: np.diff(arr.indptr)
Out[305]: array([0, 0, 1, 2])
The last line shows how many nonzero terms there are in each column.
np.nonzero(np.diff(arr.indptr))[0][0] would be the index of the first nonzero value in that diff.
Do the same on a csr matrix for find the 1st nonzero row.
I can elaborate on indptr if you want.