I'm working with a very large sparse matrix multiplication (matmul) problem. As an example let's say:
A is a binary ( 75 x 200,000 ) matrix. It's sparse, so I'm using csc for storage. I need to do the following matmul operation:
B = A.transpose() * A
The output is going to be a sparse and symmetric matrix of size 200Kx200K.
Unfortunately, B is going to be way to large to store in RAM (or "in core") on my laptop. On the other hand, I'm lucky because there are some properties to B that should solve this problem.
Since B is going to be symmetric along the diagonal and sparse, I could use a triangular matrix (upper/lower) to store the results of the matmul operation and a sparse matrix storage format could further reduce the size.
My question is...can numpy or scipy be told, ahead of time, what the output storage requirements are going to look like so that I can select a storage solution using numpy and avoid the "matrix is too big" runtime error after several minutes (hours) of calculation?
In other words, can storage requirements for the matrix multiply be approximated by analyzing the contents of the two input matrices using an approximate counting algorithm?
https://en.wikipedia.org/wiki/Approximate_counting_algorithm
If not, I'm looking into a brute force solution. Something involving map/reduce, out-of-core storage, or a matmul subdivision solution (strassens algorithm) from the following web links:
A couple Map/Reduce problem subdivision solutions
http://www.norstad.org/matrix-multiply/index.html
http://bpgergo.blogspot.com/2011/08/matrix-multiplication-in-python.html
A out-of-core (PyTables) storage solution
Very large matrices using Python and NumPy
A matmul subdivision solution:
https://en.wikipedia.org/wiki/Strassen_algorithm
http://facultyfp.salisbury.edu/taanastasio/COSC490/Fall03/Lectures/FoxMM/example.pdf
http://eli.thegreenplace.net/2012/01/16/python-parallelizing-cpu-bound-tasks-with-multiprocessing/
Thanks in advance for any recommendations, comments, or guidance!
Since you are after the product of a matrix with its transpose, the value at [m, n] is basically going to be the dot product of columns m and n in your original matrix.
I am going to use the following matrix as a toy example
a = np.array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]])
>>> np.dot(a.T, a)
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2]])
It is of shape (3, 12) and has 7 non-zero entries. The product of its transpose with it is of course of shape (12, 12) and has 16 non-zero entries, 6 of it in the diagonal, so it only requires storage of 11 elements.
You can get a good idea of what the size of your output matrix is going to be in one of two ways:
CSR FORMAT
If your original matrix has C non-zero columns, your new matrix will have at most C**2 non-zero entries, of which C are in the diagonal, and are assured not to be zero, and of the remaining entries you only need to keep half, so that is at most (C**2 + C) / 2 non-zero elements. Of course, many of these will also be zero, so this is probably a gross overestimate.
If your matrix is stored in csr format, then the indices attribute of the corresponding scipy object has an array with the column indices of all non zero elements, so you can easily compute the above estimate as:
>>> a_csr = scipy.sparse.csr_matrix(a)
>>> a_csr.indices
array([ 2, 11, 1, 7, 10, 4, 11])
>>> np.unique(a_csr.indices).shape[0]
6
So there are 6 columns with non-zero entries, and so the estimate would be for at most 36 non-zero entries, way more than the real 16.
CSC FORMAT
If instead of column indices of non-zero elements we have row indices, we can actually do a better estimate. For the dot product of two columns to be non-zero, they must have a non-zero element in the same row. If there are R non-zero elements in a given row, they will contribute R**2 non-zero elements to the product. When you sum this for all rows, you are bound to count some elements more than once, so this is also an upper bound.
The row indices of the non-zero elements of your matrix are in the indices attribute of a sparse csc matrix, so this estimate can be computed as follows:
>>> a_csc = scipy.sparse.csc_matrix(a)
>>> a_csc.indices
array([1, 0, 2, 1, 1, 0, 2])
>>> rows, where = np.unique(a_csc.indices, return_inverse=True)
>>> where = np.bincount(where)
>>> rows
array([0, 1, 2])
>>> where
array([2, 3, 2])
>>> np.sum(where**2)
17
This is darn close to the real 16! And it is actually not a coincidence that this estimate is actually the same as:
>>> np.sum(np.dot(a.T,a),axis=None)
17
In any case, the following code should allow you to see that the estimation is pretty good:
def estimate(a) :
a_csc = scipy.sparse.csc_matrix(a)
_, where = np.unique(a_csc.indices, return_inverse=True)
where = np.bincount(where)
return np.sum(where**2)
def test(shape=(10,1000), count=100) :
a = np.zeros(np.prod(shape), dtype=int)
a[np.random.randint(np.prod(shape), size=count)] = 1
print 'a non-zero = {0}'.format(np.sum(a))
a = a.reshape(shape)
print 'a.T * a non-zero = {0}'.format(np.flatnonzero(np.dot(a.T,
a)).shape[0])
print 'csc estimate = {0}'.format(estimate(a))
>>> test(count=100)
a non-zero = 100
a.T * a non-zero = 1065
csc estimate = 1072
>>> test(count=200)
a non-zero = 199
a.T * a non-zero = 4056
csc estimate = 4079
>>> test(count=50)
a non-zero = 50
a.T * a non-zero = 293
csc estimate = 294
Related
I have the following task to solve.
I have an image (numpy array) where everything that is not the main object is 0 and the main object has some pixel counts all around (let's set all of them to 1).
What I need is to get the number of all the pixels on the contour (red squares with 1 as the value) of this object. The objects can have different forms.
Is there any way to achieve it?
OBS: The goal is to have a method that would be able to adapt to the shape of the figure, because it would be run on multiple images simultaneously.
I propose a similar solution to #user2640045 using convolution.
We can slide a filter over the array that counts the number of neighbours (left, right, top, bottom):
import numpy as np
from scipy import signal
a = np.array(
[
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
]
)
filter = np.array([[0, 1, 0],
[1, 0, 1],
[0, 1, 0]])
Now we convolve the image array with the filter and :
conv = signal.convolve2d(a, filter, mode='same')
Every element that has more than zero and less than four neighbors while being active itself is a boundary element:
bounds = a * np.logical_and(conv > 0, conv < 4)
We can apply this mask to get the boundary pixels and sum them up:
>>> a[bounds].sum()
8
Here are 2 example inputs:
This is interesting and I got an elegant solution for you.
Since we can agree that contour is defined as np.array value that is greater than 0 and have at least 1 neighbor with a value of 0 we can solve it pretty stright forward and make sure it is ready for every single image you will get for life (in an Numpy array, of course...)
import numpy as np
image_pxs = np.array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
def get_contour(two_d_arr):
contour_pxs = 0
# Iterate of np:
for i, row in enumerate(two_d_arr):
for j, pixel in enumerate(row):
# Check neighbors
up = two_d_arr[i-1][j] == 0 if i > 0 else True
down = two_d_arr[i+1][j] == 0 if i < len(image_pxs)-1 else True
left = two_d_arr[i][j-1] == 0 if j > 0 else True
right = two_d_arr[i][j+1] == 0 if j < len(row)-1 else True
# Count distinct neighbors (empty / not empty)
sub_contour = len(list(set([up, down, left, right])))
# If at least 1 neighbor is empty and current value > 0 it is the contour
if sub_contour > 1 and pixel > 0:
# Add the number of pixels in i, j
contour_pxs += pixel
return contour_pxs
print(get_contour(image_pxs))
The output is of course 8:
8
[Finished in 97ms]
Suppose, I have a numpy vector with n elements, so I'd like to encode numbers in this vector as a binary notation, so resulting shape will be (n,m) where m is log2(maxnumber) for example:
x = numpy.array([32,5,67])
Because max number I have is 67, I need numpy.ceil(numpy.log2(67)) == 7 bits to encode this vector, so shape of the result will be (3,7)
array([[1, 0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 1, 0, 1],
[0, 1, 0, 0, 0, 0, 0]])
The problem rises because I have no quick way to move binary notation from
function numpy.binary_repr to numpy array. Now I have to iterate over result, and put each bit severally:
brepr = numpy.binary_repr(x[i],width=7)
j = 0
for bin in brepr:
X[i][j] = bin
j += 1
It's very timecost and stupid way, how to make it efficient?
Here is one way using np.unpackbits and broadcasting:
>>> max_size = np.ceil(np.log2(x.max())).astype(int)
>>> np.unpackbits(x[:,None].astype(np.uint8), axis=1)[:,-max_size:]
array([[0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 0, 1, 1]], dtype=uint8)
You can use numpy byte string.
For the case you have in hand:
res = numpy.array(len(x),dtype='S7')
for i in range(len(x)):
res[i] = numpy.binary_repr(x[i])
Or more compactly
res = numpy.array([numpy.binary_repr(val) for val in x])
I have two sparse* adjacency matrices A1 and A2 of type 'numpy.int64'.
The nodes of the corresponding graphs are labeled by integers and the indices of the matrices correspond to these nodes (the matrix value being the link weight between the nodes).
I'm trying to compute a similarity measure between the graphs. To do this I need to find the adjacency matrix for the subgraph of each graph, which contains the nodes common to both graphs.
Nothing about the equals sizes of the matrices, or common nodes between them is assured.
The result should be the same adjacency matrices with values for nodes not in both graphs equal to zero.
Example:
A1:
array([[ 0, 1, 2, 1],
[ 1, 0, 0, 0],
[ 2, 0, 0, 0],
[ 1, 0, 0, 0]])
A2:
array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
Outcome:
A1':
array([[ 0, 0, 2, 0],
[ 0, 0, 0, 0],
[ 2, 0, 0, 0],
[ 0, 0, 0, 0]])
A2':
array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
The size of matrices I'm using are on order of 10^5 X 10^5. The resulting size doesn't matter, I'll slice down the size of the smallest afterwards.
I'll be repeating this operation many times and so speed is important.
Attempts so far:
I can get the list of common nodes by:
np.intersect1d(A1.nonzero()[0], A2.nonzero()[0])
But I can't find a way of using this as a filter to map the values for indices not in this list to 0.
*I don't think I necessarily need to use sparse though is very preferable for scalability later.
If I understand your question correctly, based on the example you have provided, you can simply use the numpy.in1d method to give you a boolean array indices, for example
A1 = np.array([[ 0, 1, 2, 1],
[ 1, 0, 0, 0],
[ 2, 0, 0, 0],
[ 1, 0, 0, 0]])
A2 = np.array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
idx = np.in1d(A1,A2).reshape(A1.shape)
A1[idx] = 0
print(A1)
# prints
[[0 0 2 0]
[0 0 0 0]
[2 0 0 0]
[0 0 0 0]]
For sparse matrices, the right solution depends on which sparse format you are using. If you are using csr or csc formats then you can apply the same technique on the coefficients (V_IJ) of the matrices A1.data and then use resulting array (idx) to modify the corresponding indices (I and J) i.e. A1.indices and A1.indptr.
my situation is as follows:
I have an array of results, say
S = np.array([2,3,10,-1,12,1,2,4,4]), which I would like to insert in the last row of a scipy.sparse.lil_matrix M according to an array of column indices with possibly repeated elements (with no specific pattern), e.g.:
j = np.array([3,4,5,14,15,16,3,4,5]).
When column indices are repeated, the sum of their corresponding values in S should be inserted in the matrix M. Thus, in the example above, results [4,7,14] should be placed in columns [3,4,5] of the last row of M. In other words, I would like to achieve something like:
M[-1,j] = np.array([2+2,3+4,10+4,-1,12,1]).
Calculation speed is very important for my program, such that I should avoid using loops. Looking forward to your clever solutions! Thanks!
That kind of summation is the normal behavior for sparse matrices, especially in the csr format.
define the 3 input arrays:
In [408]: S = np.array([2,3,10,-1,12,1,2,4,4])
In [409]: j=np.array([3,4,5,14,15,16,3,4,5])
In [410]: i=np.ones(S.shape,int)
The coo format takes those 3 arrays, as is, without change
In [411]: c0=sparse.coo_matrix((S,(i,j)))
In [412]: c0.data
Out[412]: array([ 2, 3, 10, -1, 12, 1, 2, 4, 4])
But when converted to csr format, it sums repeated indices:
In [413]: c1=c0.tocsr()
In [414]: c1.data
Out[414]: array([ 4, 7, 14, -1, 12, 1], dtype=int32)
In [415]: c1.A
Out[415]:
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 4, 7, 14, 0, 0, 0, 0, 0, 0, 0, 0, -1, 12, 1]], dtype=int32)
That summation is also done when converting the coo to dense or array, c0.A.
and when converting to lil:
In [419]: cl=c0.tolil()
In [420]: cl.data
Out[420]: array([[], [4, 7, 14, -1, 12, 1]], dtype=object)
In [421]: cl.rows
Out[421]: array([[], [3, 4, 5, 14, 15, 16]], dtype=object)
lil_matrix does not accept the (data,(i,j)) input directly, so you have to go through coo if that is your target.
http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.sparse.coo_matrix.html
By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)
To do this as an insertion in an existing lil use an intermediate csr:
In [443]: L=sparse.lil_matrix((3,17),dtype=S.dtype)
In [444]: L[-1,:]=sparse.csr_matrix((S,(np.zeros(S.shape),j)))
In [445]: L.A
Out[445]:
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 4, 7, 14, 0, 0, 0, 0, 0, 0, 0, 0, -1, 12, 1]])
This statement is faster than the one using csr_matrix;
L[-1,:]=sparse.coo_matrix((S,(np.zeros(S.shape),j)))
Examine L.__setitem__ if you are really worried about speed. Off hand it looks like it normally converts a sparse matrix to array
L[-1,:]=sparse.coo_matrix((S,(np.zeros(S.shape),j))).A
takes the same time. With a small test case like this, the overhead of creating an intermediate matrix can swamp any time spent adding these duplicate indices.
In general, inserting or appending values to an existing sparse matrix is slow, regardless of whether you do this summation or not. Where possible it is best to create the data, i and j arrays for the whole matrix first, and then make the sparse matrix.
You could use a defaultdict that maps the M column indices to their value and use the map function to update this defaultdict, like so:
from collections import defaultdict
d = defaultdict(int) #Use your array type here
def f(j, s):
d[j] += s
map(f, j, S)
M[-1, d.keys()] = d.values() #keys and values are always in the same order
Instead of map, you can use filter if you don't want to create a list of None uselessly:
d = defaultdict(int) #Use your array type here
def g(e):
d[e[1]] += S[e[0]]
filter(g, enumerate(j))
M[-1, d.keys()] = d.values() #keys and values are always in the same
I have this list:
row = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
I need to then shuffle or randomize the list:
shuffle(row)
And then I need to go through and find any adjacent 1's and move them so that they are separated by at least one 0. For example I need the result to look like this:
row = [0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0]
I am not sure of what the most efficient way to go about searching for adjacent 1's and then moving them so that they aren't adjacent is... I will also being doing this repeatedly to come up with multiple combinations of this row.
Originally when the list was shorter I did it this way:
row = [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
rowlist = set(list(permutations(row)))
rowschemes = [(0, 0) + x for x in rowlist if '1, 1' not in str(x)]
But now that my row is 20 elements long this takes forever to come up with all the possible permutations.
Is there an efficient way to go about this?
I had a moderately clever partition-based approach in mind, but since you said there are always 20 numbers and 6 1s, and 6 is a pretty small number, you can construct all the possible locations (38760) and toss the ones which are invalid. Then you can uniformly draw from those, and build the resulting row:
import random
from itertools import combinations
def is_valid(locs):
return all(y-x >= 2 for x,y in zip(locs, locs[1:]))
def fill_from(size, locs):
locs = set(locs)
return [int(i in locs) for i in range(size)]
and then
>>> size = 20
>>> num_on = 6
>>> on_locs = list(filter(is_valid, combinations(range(size), num_on)))
>>> len(on_locs)
5005
>>> fill_from(size, random.choice(on_locs))
[0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1]
>>> fill_from(size, random.choice(on_locs))
[0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1]
>>> fill_from(size, random.choice(on_locs))
[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1]
Why not go directly for what you want? Something like:
row = ["0","0","0","0","0","0","0","0","0","01","01","01","01","01","01"]
random.shuffle(row)
print (map(int, list("".join(row)[1:])))
Since the number of 1's is fixed in a row and you don't want any 1's to be adjacent, let m be the number of 1's and let k be the number of 0's of the row. Then you want to place the m 1's in (k+1) locations randomly so that there is at most one 1 in each location. This amounts to choosing a random subset of size ((k+1) choose m) from the set (1,2,...,k+1). This is easy to do. Given the random choice of subset, you can construct your random arrangement of 0's and 1's so that no two 1's are adjacent. The random choice algorithm takes O(m) time.
Place the 6 1's and 5 of the 0's in a list giving
row = [1,0,1,0,1,0,1,0,1,0,1]
Then insert the remaining 0's one by one at random positions in the (growing) list.
for i in range(11,19):
row.insert(random.randint(0,i), 0)