Python sparse intersection of matrices non-zero values - python

I have two sparse* adjacency matrices A1 and A2 of type 'numpy.int64'.
The nodes of the corresponding graphs are labeled by integers and the indices of the matrices correspond to these nodes (the matrix value being the link weight between the nodes).
I'm trying to compute a similarity measure between the graphs. To do this I need to find the adjacency matrix for the subgraph of each graph, which contains the nodes common to both graphs.
Nothing about the equals sizes of the matrices, or common nodes between them is assured.
The result should be the same adjacency matrices with values for nodes not in both graphs equal to zero.
Example:
A1:
array([[ 0, 1, 2, 1],
[ 1, 0, 0, 0],
[ 2, 0, 0, 0],
[ 1, 0, 0, 0]])
A2:
array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
Outcome:
A1':
array([[ 0, 0, 2, 0],
[ 0, 0, 0, 0],
[ 2, 0, 0, 0],
[ 0, 0, 0, 0]])
A2':
array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
The size of matrices I'm using are on order of 10^5 X 10^5. The resulting size doesn't matter, I'll slice down the size of the smallest afterwards.
I'll be repeating this operation many times and so speed is important.
Attempts so far:
I can get the list of common nodes by:
np.intersect1d(A1.nonzero()[0], A2.nonzero()[0])
But I can't find a way of using this as a filter to map the values for indices not in this list to 0.
*I don't think I necessarily need to use sparse though is very preferable for scalability later.

If I understand your question correctly, based on the example you have provided, you can simply use the numpy.in1d method to give you a boolean array indices, for example
A1 = np.array([[ 0, 1, 2, 1],
[ 1, 0, 0, 0],
[ 2, 0, 0, 0],
[ 1, 0, 0, 0]])
A2 = np.array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
idx = np.in1d(A1,A2).reshape(A1.shape)
A1[idx] = 0
print(A1)
# prints
[[0 0 2 0]
[0 0 0 0]
[2 0 0 0]
[0 0 0 0]]
For sparse matrices, the right solution depends on which sparse format you are using. If you are using csr or csc formats then you can apply the same technique on the coefficients (V_IJ) of the matrices A1.data and then use resulting array (idx) to modify the corresponding indices (I and J) i.e. A1.indices and A1.indptr.

Related

Numpy - Matrix multiplication to return ndarray, not sum

All, I have an application that requires returning a numpy ndarray, rather than a simple sum, when multiplying two matrices; e.g.:
import numpy as np
x = np.array([[1, 1, 0], [0, 1, 1]])
y = np.array([[1, 0, 0, 1], [1, 0, 1, 0], [0, 0, 0, 0]])
w = x # y
>>> array([[2, 0, 1, 1],
[1, 0, 1, 0]])
However, the requirement is to return an ndarray (in this case..):
array([[[1,1,0], [0,0,0], [0,1,0], [1,0,0]],
[[0,1,0], [0,0,0], [0,1,0], [0,0,0]]])
Note that the matrix multiplication operation may be repeated; the output will be used as the left-side matrix of ndarrays for the next matrix multiplication operation, which would yield a higher-order ndarray after the second matrix multiplication operation, etc..
Any way to achieve this? I've looked at overloading __add__, and __radd__ by subclassing np.ndarray as discussed here, but mostly got dimension incompatibility errors.
Ideas?
Update:
Addressing #Divakar's answer E.g., for chained operation, adding
z = np.array([[1, 1, 0], [0, 0, 0], [1, 0, 0], [0, 1, 0]])
s1 = x[...,None] * y
s2 = s1[...,None] * z
results in an undesired output.
I suspect the issue starts with s1, which in the case above returns s1.shape = (2,3,4). It should be (2,4,3) since [2x3][3x4] = [2x4], but we're not really summing here, just return an array of length 3.
Similarly, s2.shape should be (2,3,4,3), which [incidentally] it is, but with undesired output (it's not 'wrong', just not what we're looking for).
To elaborate, s1*z should be [2x4][4x3] = [2x3] matrix. Each element of the matrix is itself an ndarray, of [4x3] since we have 4 rows in z to multiply the elements in s1, and each element in s1 is itself 3 elements long (again, we're not arithmetically adding elements, but return ndarrays with the extended dimension being the row count in the R-matrix of the operation.
Ultimately, the desired output would be:
s2 = array([[[[1, 1, 0],
[0, 0, 0],
[0, 1, 0],
[0, 0, 0]],
[[1, 1, 0],
[0, 0, 0],
[0, 0, 0],
[1, 0, 0]],
[[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]]],
[[[0, 1, 0],
[0, 0, 0],
[0, 1, 0],
[0, 0, 0]],
[[0, 1, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]]]])
Extend them to 3D and leverage broadcasting -
x[:,None] * y.T
Or with np.einsum -
np.einsum('ij,jk->ikj',x,y)
Going by OP's comment and the quote from the question :
... matrix multiplication operation may be repeated; the output will
be used as the left-side matrix of ndarrays for the next matrix
multiplication operation, which would yield a higher-order ndarray
after the second matrix multiplication operation, etc..
It seems, we need to do something along these lines -
s1 = x[...,None] * y
s2 = s1[...,None] * z # and so on.
Though, the order of the axes would be different in this case, but it seems to be the simplest way to extend the solution to a generic number of incoming 2D arrays.
Following the edits in the question, seems like you are placing the incoming arrays from the first axis onwards for element-wise multiplication. So, if I got that right, you can swap axes to get the correct order, like so -
s1c = (x[...,None] * y).swapaxes(1,-1)
s2c = (s1c.swapaxes(1,-1)[...,None] * z).swapaxes(1,-1) # and so on.
If you are only interested in the final output, swap axes only at the final stage and skip those in the intermediate ones.

Python: Cumulative insertion of values in a sparse matrix (lil_matrix) due to repeated indices

my situation is as follows:
I have an array of results, say
S = np.array([2,3,10,-1,12,1,2,4,4]), which I would like to insert in the last row of a scipy.sparse.lil_matrix M according to an array of column indices with possibly repeated elements (with no specific pattern), e.g.:
j = np.array([3,4,5,14,15,16,3,4,5]).
When column indices are repeated, the sum of their corresponding values in S should be inserted in the matrix M. Thus, in the example above, results [4,7,14] should be placed in columns [3,4,5] of the last row of M. In other words, I would like to achieve something like:
M[-1,j] = np.array([2+2,3+4,10+4,-1,12,1]).
Calculation speed is very important for my program, such that I should avoid using loops. Looking forward to your clever solutions! Thanks!
That kind of summation is the normal behavior for sparse matrices, especially in the csr format.
define the 3 input arrays:
In [408]: S = np.array([2,3,10,-1,12,1,2,4,4])
In [409]: j=np.array([3,4,5,14,15,16,3,4,5])
In [410]: i=np.ones(S.shape,int)
The coo format takes those 3 arrays, as is, without change
In [411]: c0=sparse.coo_matrix((S,(i,j)))
In [412]: c0.data
Out[412]: array([ 2, 3, 10, -1, 12, 1, 2, 4, 4])
But when converted to csr format, it sums repeated indices:
In [413]: c1=c0.tocsr()
In [414]: c1.data
Out[414]: array([ 4, 7, 14, -1, 12, 1], dtype=int32)
In [415]: c1.A
Out[415]:
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 4, 7, 14, 0, 0, 0, 0, 0, 0, 0, 0, -1, 12, 1]], dtype=int32)
That summation is also done when converting the coo to dense or array, c0.A.
and when converting to lil:
In [419]: cl=c0.tolil()
In [420]: cl.data
Out[420]: array([[], [4, 7, 14, -1, 12, 1]], dtype=object)
In [421]: cl.rows
Out[421]: array([[], [3, 4, 5, 14, 15, 16]], dtype=object)
lil_matrix does not accept the (data,(i,j)) input directly, so you have to go through coo if that is your target.
http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.sparse.coo_matrix.html
By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)
To do this as an insertion in an existing lil use an intermediate csr:
In [443]: L=sparse.lil_matrix((3,17),dtype=S.dtype)
In [444]: L[-1,:]=sparse.csr_matrix((S,(np.zeros(S.shape),j)))
In [445]: L.A
Out[445]:
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 4, 7, 14, 0, 0, 0, 0, 0, 0, 0, 0, -1, 12, 1]])
This statement is faster than the one using csr_matrix;
L[-1,:]=sparse.coo_matrix((S,(np.zeros(S.shape),j)))
Examine L.__setitem__ if you are really worried about speed. Off hand it looks like it normally converts a sparse matrix to array
L[-1,:]=sparse.coo_matrix((S,(np.zeros(S.shape),j))).A
takes the same time. With a small test case like this, the overhead of creating an intermediate matrix can swamp any time spent adding these duplicate indices.
In general, inserting or appending values to an existing sparse matrix is slow, regardless of whether you do this summation or not. Where possible it is best to create the data, i and j arrays for the whole matrix first, and then make the sparse matrix.
You could use a defaultdict that maps the M column indices to their value and use the map function to update this defaultdict, like so:
from collections import defaultdict
d = defaultdict(int) #Use your array type here
def f(j, s):
d[j] += s
map(f, j, S)
M[-1, d.keys()] = d.values() #keys and values are always in the same order
Instead of map, you can use filter if you don't want to create a list of None uselessly:
d = defaultdict(int) #Use your array type here
def g(e):
d[e[1]] += S[e[0]]
filter(g, enumerate(j))
M[-1, d.keys()] = d.values() #keys and values are always in the same

Numpy: increment elements of an array given the indices required to increment

I am trying to turn a second order tensor into a binary third order tensor. Given a second order tensor as a m x n numpy array: A, I need to take each element value: x, in A and replace it with a vector: v, with dimensions equal to the maximum value of A, but with a value of 1 incremented at the index of v corresponding to the value x (i.e. v[x] = 1). I have been following this question: Increment given indices in a matrix, which addresses producing an array with increments at indices given by 2 dimensional coordinates. I have been reading the answers and trying to use np.ravel_multi_index() and np.bincount() to do the same but with 3 dimensional coordinates, however I keep on getting a ValueError: "invalid entry in coordinates array". This is what I have been using:
def expand_to_tensor_3(array):
(x, y) = array.shape
(a, b) = np.indices((x, y))
a = a.reshape(x*y)
b = b.reshape(x*y)
tensor_3 = np.bincount(np.ravel_multi_index((a, b, array.reshape(x*y)), (x, y, np.amax(array))))
return tensor_3
If you know what is wrong here or know an even better method to accomplish my goal, both would be really helpful, thanks.
You can use (A[:,:,np.newaxis] == np.arange(A.max()+1)).astype(int).
Here's a demonstration:
In [52]: A
Out[52]:
array([[2, 0, 0, 2],
[3, 1, 2, 3],
[3, 2, 1, 0]])
In [53]: B = (A[:,:,np.newaxis] == np.arange(A.max()+1)).astype(int)
In [54]: B
Out[54]:
array([[[0, 0, 1, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0]],
[[0, 0, 0, 1],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]],
[[0, 0, 0, 1],
[0, 0, 1, 0],
[0, 1, 0, 0],
[1, 0, 0, 0]]])
Check a few individual elements of A:
In [55]: A[0,0]
Out[55]: 2
In [56]: B[0,0,:]
Out[56]: array([0, 0, 1, 0])
In [57]: A[1,3]
Out[57]: 3
In [58]: B[1,3,:]
Out[58]: array([0, 0, 0, 1])
The expression A[:,:,np.newaxis] == np.arange(A.max()+1) uses broadcasting to compare each element of A to np.arange(A.max()+1). For a single value, this looks like:
In [63]: 3 == np.arange(A.max()+1)
Out[63]: array([False, False, False, True], dtype=bool)
In [64]: (3 == np.arange(A.max()+1)).astype(int)
Out[64]: array([0, 0, 0, 1])
A[:,:,np.newaxis] is a three-dimensional view of A with shape (3,4,1). The extra dimension is added so that the comparison to np.arange(A.max()+1) will broadcast to each element, giving a result with shape (3, 4, A.max()+1).
With a trivial change, this will work for an n-dimensional array. Indexing a numpy array with the ellipsis ... means "all the other dimensions". So
(A[..., np.newaxis] == np.arange(A.max()+1)).astype(int)
converts an n-dimensional array to an (n+1)-dimensional array, where the last dimension is the binary indicator of the integer in A. Here's an example with a one-dimensional array:
In [6]: a = np.array([3, 4, 0, 1])
In [7]: (a[...,np.newaxis] == np.arange(a.max()+1)).astype(int)
Out[7]:
array([[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0]])
You can make it work this way:
tensor_3 = np.bincount(np.ravel_multi_index((a, b, array.reshape(x*y)),
(x, y, np.amax(array) + 1)))
The difference is that I add 1 to the amax() result, because ravel_multi_index() expects that the indexes are all strictly less than the dimensions, not less-or-equal.
I'm not 100% sure if this is what you wanted; another way to make the code run is to specify mode='clip' or mode='wrap' in ravel_multi_index(), which does something a bit different and I'm guessing is less correct. But you can try it.

matlab find() for nonzero element in python

I have a sparse matrix (numpy.array) and I would like to have the index of the nonzero elements in it.
In Matlab I would write:
[i, j] = find(CM)
and in Python what should I do?
I have tried numpy.nonzero (but I don't know how to take the indices from that) and flatnonzero (but it's not convenient for me, I need both the row and column index).
Thanks in advance!
Assuming that by "sparse matrix" you don't actually mean a scipy.sparse matrix, but merely a numpy.ndarray with relatively few nonzero entries, then I think nonzero is exactly what you're looking for. Starting from an array:
>>> a = (np.random.random((5,5)) < 0.10)*1
>>> a
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
nonzero returns the indices (here x and y) where the nonzero entries live:
>>> a.nonzero()
(array([1, 2, 3]), array([4, 2, 0]))
We can assign these to i and j:
>>> i, j = a.nonzero()
We can also use them to index back into a, which should give us only 1s:
>>> a[i,j]
array([1, 1, 1])
We can even modify a using these indices:
>>> a[i,j] = 2
>>> a
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 2],
[0, 0, 2, 0, 0],
[2, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
If you want a combined array from the indices, you can do that too:
>>> np.array(a.nonzero()).T
array([[1, 4],
[2, 2],
[3, 0]])
(there are lots of ways to do this reshaping; I chose one almost at random.)
This goes slightly beyond what you as and I only mention it since I once faced a similar problem. If you want the indices to access some other array there is some very simple sytax:
import numpy as np
array = np.random.randint(0, 2, size=(3, 3))
data = np.random.random(size=(3, 3))
Now array looks something like
>>> print array
array([[0, 1, 0],
[1, 0, 1],
[1, 1, 0]])
while data could be
>>> print data
array([[ 0.92824816, 0.43605604, 0.16627849],
[ 0.00301434, 0.94342538, 0.95297402],
[ 0.32665135, 0.03504204, 0.86902492]])
Then if we want the elements of data which are zero:
>>> print data[array==0]
array([ 0.92824816, 0.16627849, 0.94342538, 0.86902492])
Which is nice and simple.

numpy matrix multiplication to triangular/sparse storage?

I'm working with a very large sparse matrix multiplication (matmul) problem. As an example let's say:
A is a binary ( 75 x 200,000 ) matrix. It's sparse, so I'm using csc for storage. I need to do the following matmul operation:
B = A.transpose() * A
The output is going to be a sparse and symmetric matrix of size 200Kx200K.
Unfortunately, B is going to be way to large to store in RAM (or "in core") on my laptop. On the other hand, I'm lucky because there are some properties to B that should solve this problem.
Since B is going to be symmetric along the diagonal and sparse, I could use a triangular matrix (upper/lower) to store the results of the matmul operation and a sparse matrix storage format could further reduce the size.
My question is...can numpy or scipy be told, ahead of time, what the output storage requirements are going to look like so that I can select a storage solution using numpy and avoid the "matrix is too big" runtime error after several minutes (hours) of calculation?
In other words, can storage requirements for the matrix multiply be approximated by analyzing the contents of the two input matrices using an approximate counting algorithm?
https://en.wikipedia.org/wiki/Approximate_counting_algorithm
If not, I'm looking into a brute force solution. Something involving map/reduce, out-of-core storage, or a matmul subdivision solution (strassens algorithm) from the following web links:
A couple Map/Reduce problem subdivision solutions
http://www.norstad.org/matrix-multiply/index.html
http://bpgergo.blogspot.com/2011/08/matrix-multiplication-in-python.html
A out-of-core (PyTables) storage solution
Very large matrices using Python and NumPy
A matmul subdivision solution:
https://en.wikipedia.org/wiki/Strassen_algorithm
http://facultyfp.salisbury.edu/taanastasio/COSC490/Fall03/Lectures/FoxMM/example.pdf
http://eli.thegreenplace.net/2012/01/16/python-parallelizing-cpu-bound-tasks-with-multiprocessing/
Thanks in advance for any recommendations, comments, or guidance!
Since you are after the product of a matrix with its transpose, the value at [m, n] is basically going to be the dot product of columns m and n in your original matrix.
I am going to use the following matrix as a toy example
a = np.array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]])
>>> np.dot(a.T, a)
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2]])
It is of shape (3, 12) and has 7 non-zero entries. The product of its transpose with it is of course of shape (12, 12) and has 16 non-zero entries, 6 of it in the diagonal, so it only requires storage of 11 elements.
You can get a good idea of what the size of your output matrix is going to be in one of two ways:
CSR FORMAT
If your original matrix has C non-zero columns, your new matrix will have at most C**2 non-zero entries, of which C are in the diagonal, and are assured not to be zero, and of the remaining entries you only need to keep half, so that is at most (C**2 + C) / 2 non-zero elements. Of course, many of these will also be zero, so this is probably a gross overestimate.
If your matrix is stored in csr format, then the indices attribute of the corresponding scipy object has an array with the column indices of all non zero elements, so you can easily compute the above estimate as:
>>> a_csr = scipy.sparse.csr_matrix(a)
>>> a_csr.indices
array([ 2, 11, 1, 7, 10, 4, 11])
>>> np.unique(a_csr.indices).shape[0]
6
So there are 6 columns with non-zero entries, and so the estimate would be for at most 36 non-zero entries, way more than the real 16.
CSC FORMAT
If instead of column indices of non-zero elements we have row indices, we can actually do a better estimate. For the dot product of two columns to be non-zero, they must have a non-zero element in the same row. If there are R non-zero elements in a given row, they will contribute R**2 non-zero elements to the product. When you sum this for all rows, you are bound to count some elements more than once, so this is also an upper bound.
The row indices of the non-zero elements of your matrix are in the indices attribute of a sparse csc matrix, so this estimate can be computed as follows:
>>> a_csc = scipy.sparse.csc_matrix(a)
>>> a_csc.indices
array([1, 0, 2, 1, 1, 0, 2])
>>> rows, where = np.unique(a_csc.indices, return_inverse=True)
>>> where = np.bincount(where)
>>> rows
array([0, 1, 2])
>>> where
array([2, 3, 2])
>>> np.sum(where**2)
17
This is darn close to the real 16! And it is actually not a coincidence that this estimate is actually the same as:
>>> np.sum(np.dot(a.T,a),axis=None)
17
In any case, the following code should allow you to see that the estimation is pretty good:
def estimate(a) :
a_csc = scipy.sparse.csc_matrix(a)
_, where = np.unique(a_csc.indices, return_inverse=True)
where = np.bincount(where)
return np.sum(where**2)
def test(shape=(10,1000), count=100) :
a = np.zeros(np.prod(shape), dtype=int)
a[np.random.randint(np.prod(shape), size=count)] = 1
print 'a non-zero = {0}'.format(np.sum(a))
a = a.reshape(shape)
print 'a.T * a non-zero = {0}'.format(np.flatnonzero(np.dot(a.T,
a)).shape[0])
print 'csc estimate = {0}'.format(estimate(a))
>>> test(count=100)
a non-zero = 100
a.T * a non-zero = 1065
csc estimate = 1072
>>> test(count=200)
a non-zero = 199
a.T * a non-zero = 4056
csc estimate = 4079
>>> test(count=50)
a non-zero = 50
a.T * a non-zero = 293
csc estimate = 294

Categories