Repeat a scipy csr sparse matrix along axis 0 - python

I wanted to repeat the rows of a scipy csr sparse matrix, but when I tried to call numpy's repeat method, it simply treats the sparse matrix like an object, and would only repeat it as an object in an ndarray. I looked through the documentation, but I couldn't find any utility to repeats the rows of a scipy csr sparse matrix.
I wrote the following code that operates on the internal data, which seems to work
def csr_repeat(csr, repeats):
if isinstance(repeats, int):
repeats = np.repeat(repeats, csr.shape[0])
repeats = np.asarray(repeats)
rnnz = np.diff(csr.indptr)
ndata = rnnz.dot(repeats)
if ndata == 0:
return sparse.csr_matrix((np.sum(repeats), csr.shape[1]),
dtype=csr.dtype)
indmap = np.ones(ndata, dtype=np.int)
indmap[0] = 0
rnnz_ = np.repeat(rnnz, repeats)
indptr_ = rnnz_.cumsum()
mask = indptr_ < ndata
indmap -= np.int_(np.bincount(indptr_[mask],
weights=rnnz_[mask],
minlength=ndata))
jumps = (rnnz * repeats).cumsum()
mask = jumps < ndata
indmap += np.int_(np.bincount(jumps[mask],
weights=rnnz[mask],
minlength=ndata))
indmap = indmap.cumsum()
return sparse.csr_matrix((csr.data[indmap],
csr.indices[indmap],
np.r_[0, indptr_]),
shape=(np.sum(repeats), csr.shape[1]))
and be reasonably efficient, but I'd rather not monkey patch the class. Is there a better way to do this?
Edit
As I revisit this question, I wonder why I posted it in the first place. Almost everything I could think to do with the repeated matrix would be easier to do with the original matrix, and then apply the repetition afterwards. My assumption is that post repetition will always be the better way to approach this problem than any of the potential answers.

from scipy.sparse import csr_matrix
repeated_row_matrix = csr_matrix(np.ones([repeat_number,1])) * sparse_row

It's not surprising that np.repeat does not work. It delegates the action to the hardcoded a.repeat method, and failing that, first turns a into an array (object if needed).
In the linear algebra world where sparse code was developed, most of the assembly work was done on the row, col, data arrays BEFORE creating the sparse matrix. The focus was on efficient math operations, and not so much on adding/deleting/indexing rows and elements.
I haven't worked through your code, but I'm not surprised that a csr format matrix requires that much work.
I worked out a similar function for the lil format (working from lil.copy):
def lil_repeat(S, repeat):
# row repeat for lil sparse matrix
# test for lil type and/or convert
shape=list(S.shape)
if isinstance(repeat, int):
shape[0]=shape[0]*repeat
else:
shape[0]=sum(repeat)
shape = tuple(shape)
new = sparse.lil_matrix(shape, dtype=S.dtype)
new.data = S.data.repeat(repeat) # flat repeat
new.rows = S.rows.repeat(repeat)
return new
But it is also possible to repeat using indices. Both lil and csr support indexing that is close to that of regular numpy arrays (at least in new enough versions). Thus:
S = sparse.lil_matrix([[0,1,2],[0,0,0],[1,0,0]])
print S.A.repeat([1,2,3], axis=0)
print S.A[(0,1,1,2,2,2),:]
print lil_repeat(S,[1,2,3]).A
print S[(0,1,1,2,2,2),:].A
give the same result
and best of all?
print S[np.arange(3).repeat([1,2,3]),:].A

After someone posted a really clever response for how best to do this I revisited my original question, to see if there was an even better way. I I came up with one more way that has some pros and cons. Instead of repeating all of the data (as is done with the accepted answer), we can instead instruct scipy to reuse the data of the repeated rows, creating something akin to a view of the original sparse array (as you might do with broadcast_to). This can be done by simply tiling the indptr field.
repeated = sparse.csr_matrix((orig.data, orig.indices, np.tile(orig.indptr, repeat_num)))
This technique repeats the vector repeat_num times, while only modifying the the indptr. The downside is that due to the way the csr matrices encode data, instead of creating a matrix that's repeat_num x n in dimension, it creates one that's (2 * repeat_num - 1) x n where every odd row is 0. This shouldn't be too big of a deal as any operation will be quick given that each row is 0, and they should be pretty easy to slice out afterwards (with something like [::2]), but it's not ideal.
I think the marked answer is probably still the "best" way to do this.

One of the most efficient ways to repeat the sparse matrix would be the way OP suggested. I modified indptr so that it doesn't output rows of 0s.
## original sparse matrix
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
x = scipy.sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
x.toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
To repeat this, you need to repeat data and indices, and you need to fix-up the indptr. This is not the most elegant way, but it works.
## repeated sparse matrix
repeat = 5
new_indptr = indptr
for r in range(1,repeat):
new_indptr = np.concatenate((new_indptr, new_indptr[-1]+indptr[1:]))
x = scipy.sparse.csr_matrix((np.tile(data,repeat), np.tile(indices,repeat), new_indptr))
x.toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])

Related

diagonalize multiple vectors using numpy

Say I have a matrix of shape (2,3), I need to diagonalize the 3-elements vector into matrix of shape (3,3), for all the 2 vectors at once. That is, I need to return matrix with shape (2,3,3). How can I do that with Numpy elegantly ?
given data = np.array([[1,2,3],[4,5,6]])
i want the result [[[1,0,0],
[0,2,0],
[0,0,3]],
[[4,0,0],
[0,5,0],
[0,0,6]]]
Thanks
tl;dr, my one-liner: mydiag=np.vectorize(np.diag, signature='(n)->(n,n)')
I suppose here that by "diagonalize" you mean "applying np.diag".
Which, as a teacher of linear algebra, tickles me a bit. Since "diagonalizing" has a specific meaning, which is not that (it is computing eigen vectors and values, and from there, writing M=P⁻¹ΛP. Which you cannot do from the inputs you have).
So, I suppose that if input matrix is
[[1, 2, 3],
[9, 8, 7]]
The output matrix you want is
[[[1, 0, 0],
[0, 2, 0],
[0, 0, 3]],
[[9, 0, 0],
[0, 8, 0],
[0, 0, 7]]]
If not, you can ignore this answer [Edit: in the meantime, you explained exactly that. So yo may continue to read].
There are many way to do that.
My one liner would be
mydiag=np.vectorize(np.diag, signature='(n)->(n,n)')
Which build a new functions which does what you want (it interprets the input as a list of 1D-array, call np.diag of each of them, to get a 2D-array, and put each 2D-array in a numpy array, thus getting a 3D-array)
Then, you just call mydiag(M)
One advantage of vectorize, is that it uses numpy broadcasting. In other words, the loops are executed in C, not in python. In yet other words, it is faster. Well it is supposed to be (on small matrix, it is in fact slower than Michael's method - in comment; on large matrix, it is has the exact same speed. Which is frustrating, since einsum doc itself specify that it sacrifices broadcasting).
Plus, it is a one-liner, which has no other interest than bragging on forums. But well, here we are.
Here is one way with indexing:
out = np.zeros(data.shape+(data.shape[-1],), dtype=data.dtype)
x,y = np.indices(data.shape).reshape(2, -1)
out[x,y,y] = data.ravel()
output:
array([[[1, 0, 0],
[0, 2, 0],
[0, 0, 3]],
[[4, 0, 0],
[0, 5, 0],
[0, 0, 6]]])
We use array indexing to precisely grab those elements that are on the diagonal. Note that array indexing allows broadcasting between the indices, so we have index1 contain the index of the array, and index2 contain the index of the diagonal element.
index1 = np.arange(2)[:, None] # 2 is the number of arrays
index2 = np.arange(3)[None, :] # 3 is the square size of each matrix
result = np.zeros((2, 3, 3))
result[index1, index2, index2] = data

Change the data type of one element in a matrix

I'm looking to implement a hardware-efficient multiplication of a list of large matrices (on the order of 200,000 x 200,000). The matrices are very nearly the identity matrix, but with some elements changed to irrational numbers.
In an effort to reduce the memory footprint and make the computation go faster, I want to store the 0s and 1s of the identity as single bytes like so.
import numpy as np
size = 200000
large_matrix = np.identity(size, dtype=uint8)
and just change a few elements to a different data type.
import sympy as sp
# sympy object
irr1 = sp.sqrt(2)
# float
irr2 = e
large_matrix[123456, 100456] = irr1
large_matirx[100456, 123456] = irr2
Is is possible to hold only these elements of the matrix with a different data type, while all the other elements are still bytes? I don't want to have to change everything to a float just because I need one element to be a float.
-----Edit-----
If it's not possible in numpy, then how can I find a solution without numpy?
Maybe you can have a look at the SciPy's Coordinate-based sparse matrix. In that case SciPy creates a sparse matrix (optimized for such large empty matrices) and with its coordinate format you can access and modify the data as you intend.
From its documentation:
>>> from scipy.sparse import coo_matrix
>>> # Constructing a matrix using ijv format
>>> row = np.array([0, 3, 1, 0])
>>> col = np.array([0, 3, 1, 2])
>>> data = np.array([4, 5, 7, 9])
>>> m = coo_matrix((data, (row, col)), shape=(4, 4))
>>> m.toarray()
array([[4, 0, 9, 0],
[0, 7, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 5]])
It does not create a matrix but a set of coordinates with values, which takes much less space than just filling a matrix with zeros.
>>> from sys import getsizeof
>>> getsizeof(m)
56
>>> getsizeof(m.toarray())
176
By definition, NumPy arrays only have one dtype. You can see in the NumPy documentation:
A numpy array is homogeneous, and contains elements described by a dtype object. A dtype object can be constructed from different combinations of fundamental numeric types.
Further reading: https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

why do I get warning on scipy sparse column slicing?

Scipy sparse documentation of csr_matrix says that this kind of matrix is efficient for row slicing. Using this code:
import numpy as np
from scipy import sparse
dok = sparse.dok_matrix((5,1))
dok[1,0] = 1
data = np.array([0,1,2,3,4])
row = np.array([0,1,2,3,4])
col = np.array([0,1,2,3,4])
csr = sparse.csr_matrix((data, (row, col)))
csr[:, 0] += dok
I get this warning:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
Why am I getting this warning?
This is unrelated to row vs. column. Essentially, you are forcing scipy to insert elements in the middle of two arrays, which as the warning says is expensive.
Let's look at the internal representation of csr before and after the in-place modification to confirm this:
>>> csr.data
array([0, 1, 2, 3, 4], dtype=int64)
>>> csr.indices
array([0, 1, 2, 3, 4], dtype=int32)
>>>
>>> csr[:, 0] += dok
/home/paul/lib/python3.6/site-packages/scipy/sparse/compressed.py:742: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
>>> csr.data
array([0, 1, 1, 2, 3, 4], dtype=int64)
>>> csr.indices
array([0, 0, 1, 2, 3, 4], dtype=int32)
A bit of background: The compressed sparse row and column formats essentially only store nonzeros. They do this in a packed way using vectors to store the nonzero values and their coordinates in a specific order. If an operation adds new nonzeros they typically can't be appended but must be inserted, which is what we see in the example and what makes it expensive.

Numpy 2D Array - Power Of - Not returning an answer?

I'm attempting to get the 'power' of a Python list/matrix using numpy. My only current working solution is an iterative function using np.dot():
def matr_power(matrix, power):
matrix_a = list(matrix)
matrix_b = list(matrix)
for i in range(0, power-1):
matrix_a = np.dot(matrix_a, matrix_b)
return matrix_a
This works for my needs, but I'm aware it's probably not the most efficient method.
I've tried converting my list to a numpy array, performing power operations on it, and then back to a list so it's usable in the form I need. The conversions seem to happen, but the power calculation does not.
while (foo != bar):
matr_x = np.asarray(matr_a)
matr_y = matr_x ** n
matr_out = matr_y.tolist()
n += 1
# Other code here to output certain results
The issue is, the matrix gets converted to an array as expected, but when performing the power operation (**) matr_y ends up being the same as matr_x as if no calculation was ever performed. I have tried using np.power(matr_y, n) and some other solutions found in related questions on Stack Overflow.
I've tried using the numpy documentation, but (either I'm misunderstanding it, or) it just confirms that this should be working as expected.
When checking the debugging console in PyCharm everything seems fine (all matrices / lists / arrays are converted as expected) except that the calculation matr_x ** i never seems to be calculated (or else never stored in matr_y).
Answer
Although it's possible to use a numpy matrix with the ** operator, the best solution is to use numpy arrays (as numpy matrices are deprecated) combined with numpy's linalg matrix_power method.
matr_x = np.array(mat_a)
matr_y = np.linalg.matrix_power(matr_x, path_length)
work_matr = matr_y.tolist()
It is also now apparent that the function of ** being element-wise may have been discovered earlier had I not been using an adjacency matrix (only zeros and ones).
There are (at least) two options for computing the power of a matrix using numpy without multiple calls to dot:
Use numpy.linalg.matrix_power.
Use the numpy matrix class, which defines ** to be the matrix algebraic power.
For example,
In [38]: a
Out[38]:
array([[0, 1, 0],
[1, 0, 1],
[0, 1, 0]])
In [39]: np.linalg.matrix_power(a, 2)
Out[39]:
array([[1, 0, 1],
[0, 2, 0],
[1, 0, 1]])
In [40]: np.linalg.matrix_power(a, 3)
Out[40]:
array([[0, 2, 0],
[2, 0, 2],
[0, 2, 0]])
In [41]: m = np.matrix(a)
In [42]: m ** 2
Out[42]:
matrix([[1, 0, 1],
[0, 2, 0],
[1, 0, 1]])
In [43]: m ** 3
Out[43]:
matrix([[0, 2, 0],
[2, 0, 2],
[0, 2, 0]])
Warren's answer is perfectly good.
Upon special request by the OP I briefly explain how to build an efficient integer power operator by hand.
I don't know what this algorithm is called, but it works like this:
Suppose you want to calculate X^35. If you do that naively it will cost you 34 multiplications. But you can do much better than that. Write X^35 = X^32 x X^2 x X. What you've done here is split the product according to the binary representation of 35, which is 100011. Now, calculating X^32 is actually cheap, because you only have to repeatedly (5 times) square X to get there. So in total you need just 7 multiplications, much better than 34.
In code:
def my_power(x, n):
out = None
p = x
while True:
if n % 2 == 1:
if out is None:
out = p
else:
out = out # p # this requires a fairly up-to-date python
# if yours is too old use np.dot instead
if n == 1:
return out
n //= 2
p = p # p

Numpy: get 1D array as 2D array without reshape

I have need for hstacking multple arrays with with the same number of rows (although the number of rows is variable between uses) but different number of columns. However some of the arrays only have one column, eg.
array = np.array([1,2,3,4,5])
which gives
#array.shape = (5,)
but I'd like to have the shape recognized as a 2d array, eg.
#array.shape = (5,1)
So that hstack can actually combine them.
My current solution is:
array = np.atleast_2d([1,2,3,4,5]).T
#array.shape = (5,1)
So I was wondering, is there a better way to do this? Would
array = np.array([1,2,3,4,5]).reshape(len([1,2,3,4,5]), 1)
be better?
Note that my use of [1,2,3,4,5] is just a toy list to make the example concrete. In practice it will be a much larger list passed into a function as an argument. Thanks!
Check the code of hstack and vstack. One, or both of those, pass the arguments through atleast_nd. That is a perfectly acceptable way of reshaping an array.
Some other ways:
arr = np.array([1,2,3,4,5]).reshape(-1,1) # saves the use of len()
arr = np.array([1,2,3,4,5])[:,None] # adds a new dim at end
np.array([1,2,3],ndmin=2).T # used by column_stack
hstack and vstack transform their inputs with:
arrs = [atleast_1d(_m) for _m in tup]
[atleast_2d(_m) for _m in tup]
test data:
a1=np.arange(2)
a2=np.arange(10).reshape(2,5)
a3=np.arange(8).reshape(2,4)
np.hstack([a1.reshape(-1,1),a2,a3])
np.hstack([a1[:,None],a2,a3])
np.column_stack([a1,a2,a3])
result:
array([[0, 0, 1, 2, 3, 4, 0, 1, 2, 3],
[1, 5, 6, 7, 8, 9, 4, 5, 6, 7]])
If you don't know ahead of time which arrays are 1d, then column_stack is easiest to use. The others require a little function that tests for dimensionality before applying the reshaping.
Numpy: use reshape or newaxis to add dimensions
If I understand your intent correctly, you wish to convert an array of shape (N,) to an array of shape (N,1) so that you can apply np.hstack:
In [147]: np.hstack([np.atleast_2d([1,2,3,4,5]).T, np.atleast_2d([1,2,3,4,5]).T])
Out[147]:
array([[1, 1],
[2, 2],
[3, 3],
[4, 4],
[5, 5]])
In that case, you could use avoid reshaping the arrays and use np.column_stack instead:
In [151]: np.column_stack([[1,2,3,4,5], [1,2,3,4,5]])
Out[151]:
array([[1, 1],
[2, 2],
[3, 3],
[4, 4],
[5, 5]])
I followed Ludo's work and just changed the size of v from 5 to 10000. I ran the code on my PC and the result shows that atleast_2d seems to be a more efficient method in the larger scale case.
import numpy as np
import timeit
v = np.arange(10000)
print('atleast2d:',timeit.timeit(lambda:np.atleast_2d(v).T))
print('reshape:',timeit.timeit(lambda:np.array(v).reshape(-1,1))) # saves the use of len()
print('v[:,None]:', timeit.timeit(lambda:np.array(v)[:,None])) # adds a new dim at end
print('np.array(v,ndmin=2).T:', timeit.timeit(lambda:np.array(v,ndmin=2).T)) # used by column_stack
The result is:
atleast2d: 1.3809496470021259
reshape: 27.099974197000847
v[:,None]: 28.58291715100131
np.array(v,ndmin=2).T: 30.141663907001202
My suggestion is that use [:None] when dealing with a short vector and np.atleast_2d when your vector goes longer.
Just to add info on hpaulj's answer. I was curious about how fast were the four methods described. The winner is the method adding a column at the end of the 1d array.
Here is what I ran:
import numpy as np
import timeit
v = [1,2,3,4,5]
print('atleast2d:',timeit.timeit(lambda:np.atleast_2d(v).T))
print('reshape:',timeit.timeit(lambda:np.array(v).reshape(-1,1))) # saves the use of len()
print('v[:,None]:', timeit.timeit(lambda:np.array(v)[:,None])) # adds a new dim at end
print('np.array(v,ndmin=2).T:', timeit.timeit(lambda:np.array(v,ndmin=2).T)) # used by column_stack
And the results:
atleast2d: 4.455070924214851
reshape: 2.0535152913971615
v[:,None]: 1.8387219828073285
np.array(v,ndmin=2).T: 3.1735243063353664

Categories