why do I get warning on scipy sparse column slicing? - python

Scipy sparse documentation of csr_matrix says that this kind of matrix is efficient for row slicing. Using this code:
import numpy as np
from scipy import sparse
dok = sparse.dok_matrix((5,1))
dok[1,0] = 1
data = np.array([0,1,2,3,4])
row = np.array([0,1,2,3,4])
col = np.array([0,1,2,3,4])
csr = sparse.csr_matrix((data, (row, col)))
csr[:, 0] += dok
I get this warning:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
Why am I getting this warning?

This is unrelated to row vs. column. Essentially, you are forcing scipy to insert elements in the middle of two arrays, which as the warning says is expensive.
Let's look at the internal representation of csr before and after the in-place modification to confirm this:
>>> csr.data
array([0, 1, 2, 3, 4], dtype=int64)
>>> csr.indices
array([0, 1, 2, 3, 4], dtype=int32)
>>>
>>> csr[:, 0] += dok
/home/paul/lib/python3.6/site-packages/scipy/sparse/compressed.py:742: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
>>> csr.data
array([0, 1, 1, 2, 3, 4], dtype=int64)
>>> csr.indices
array([0, 0, 1, 2, 3, 4], dtype=int32)
A bit of background: The compressed sparse row and column formats essentially only store nonzeros. They do this in a packed way using vectors to store the nonzero values and their coordinates in a specific order. If an operation adds new nonzeros they typically can't be appended but must be inserted, which is what we see in the example and what makes it expensive.

Related

Efficiently create 2d numpy array given 1 dimension and a constant

Given an x-dataset,
x = np.array([1, 2, 3, 4, 5])
what is the most efficient way to create the NumPy array where each x coordinate is paired with a y-coordinate of value 0? I am wondering if there is a way specifically that doesn't require any hard coding, so that x could vary in length without causing failure.
As per your problem statement, the following is one way to do it.
# initialize an array of zeros
In [36]: res = np.zeros((2, *x.shape), dtype=x.dtype)
# fill `x` as first row
In [37]: res[0] = x
In [38]: res
Out[38]:
array([[1, 2, 3, 4],
[0, 0, 0, 0]])
When we initialize the array of zeros, we use 2 for axis-0 dimension since your requirement is to create a 2D array. For the column size we simply take the length from the x array. For reasonably larger arrays, this approach would be the fastest.

Change the data type of one element in a matrix

I'm looking to implement a hardware-efficient multiplication of a list of large matrices (on the order of 200,000 x 200,000). The matrices are very nearly the identity matrix, but with some elements changed to irrational numbers.
In an effort to reduce the memory footprint and make the computation go faster, I want to store the 0s and 1s of the identity as single bytes like so.
import numpy as np
size = 200000
large_matrix = np.identity(size, dtype=uint8)
and just change a few elements to a different data type.
import sympy as sp
# sympy object
irr1 = sp.sqrt(2)
# float
irr2 = e
large_matrix[123456, 100456] = irr1
large_matirx[100456, 123456] = irr2
Is is possible to hold only these elements of the matrix with a different data type, while all the other elements are still bytes? I don't want to have to change everything to a float just because I need one element to be a float.
-----Edit-----
If it's not possible in numpy, then how can I find a solution without numpy?
Maybe you can have a look at the SciPy's Coordinate-based sparse matrix. In that case SciPy creates a sparse matrix (optimized for such large empty matrices) and with its coordinate format you can access and modify the data as you intend.
From its documentation:
>>> from scipy.sparse import coo_matrix
>>> # Constructing a matrix using ijv format
>>> row = np.array([0, 3, 1, 0])
>>> col = np.array([0, 3, 1, 2])
>>> data = np.array([4, 5, 7, 9])
>>> m = coo_matrix((data, (row, col)), shape=(4, 4))
>>> m.toarray()
array([[4, 0, 9, 0],
[0, 7, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 5]])
It does not create a matrix but a set of coordinates with values, which takes much less space than just filling a matrix with zeros.
>>> from sys import getsizeof
>>> getsizeof(m)
56
>>> getsizeof(m.toarray())
176
By definition, NumPy arrays only have one dtype. You can see in the NumPy documentation:
A numpy array is homogeneous, and contains elements described by a dtype object. A dtype object can be constructed from different combinations of fundamental numeric types.
Further reading: https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

Numpy: smart matrix multiplication to sparse result matrix

In python with numpy, say I have two matrices:
S, a sparse x*x matrix
M, a dense x*y matrix
Now I want to do np.dot(M, M.T) which will return a dense x*x matrix S_.
However, I only care about the cells that are nonzero in S, which means that it would not make a difference for my application if I did
S_ = S*S_
Obviously, that would be a waste of operations as I would like to leave out the irrelevant cells given in S alltogether. Remember that in matrix multiplication
S_[i,j] = np.sum(M[i,:]*M[:,j])
So I want to do this operation only for i,j such that S[i,j]=True.
Is this supported somehow by numpy implementations that run in C so that I do not need to implement it with python loops?
EDIT 1 [solved]: I still have this problem, actually M is now also sparse.
Now, given rows and cols of S, I implemented it like this:
data = np.array([ M[rows[i],:].dot(M[cols[i],:]).data[0] for i in xrange(len(rows)) ])
S_ = csr( (data, (rows,cols)) )
... but it is still slow. Any new ideas?
EDIT 2: jdehesa has given a great solution, but I would like to save more memory.
The solution was to do the following:
data = M[rows,:].multiply(M[cols,:]).sum(axis=1)
and then build a new sparse matrix from rows, cols and data.
However, when running the above line, scipy builds a (contiguous) numpy array with as many elements as nnz of the first submatrix plus nnz of the second submatrix, which can lead to MemoryError in my case.
In order to save more memory, I would like to multiply iteratively each row with its respective 'partner' column, then sum over and discard the result vector. Using simple python to implement this, basically I am back to the extremely slow version.
Is there a fast way of solving this problem?
Here is how you can do it with NumPy/SciPy, both for dense and sparse M matrices:
import numpy as np
import scipy.sparse as sp
# Coordinates where S is True
S = np.array([[0, 1],
[3, 6],
[3, 4],
[9, 1],
[4, 7]])
# Dense M matrix
# Random big matrix
M = np.random.random(size=(1000, 2000))
# Take relevant rows and compute values
values = np.sum(M[S[:, 0]] * M[S[:, 1]], axis=1)
# Make result matrix from values
result = np.zeros((len(M), len(M)), dtype=values.dtype)
result[S[:, 0], S[:, 1]] = values
# Sparse M matrix
# Construct sparse M as COO matrix or any other way
M = sp.coo_matrix(([10, 20, 30, 40, 50], # Data
([0, 1, 3, 4, 6], # Rows
[4, 4, 5, 5, 8])), # Columns
shape=(1000, 2000))
# Convert to CSR for fast row slicing
M_csr = M.tocsr()
# Take relevant rows and compute values
values = M_csr[S[:, 0]].multiply(M_csr[S[:, 1]]).sum(axis=1)
values = np.squeeze(np.asarray(values))
# Construct COO sparse matrix from values
result = sp.coo_matrix((values, (S[:, 0], S[:, 1])), shape=(M.shape[0], M.shape[0]))

Numpy: get 1D array as 2D array without reshape

I have need for hstacking multple arrays with with the same number of rows (although the number of rows is variable between uses) but different number of columns. However some of the arrays only have one column, eg.
array = np.array([1,2,3,4,5])
which gives
#array.shape = (5,)
but I'd like to have the shape recognized as a 2d array, eg.
#array.shape = (5,1)
So that hstack can actually combine them.
My current solution is:
array = np.atleast_2d([1,2,3,4,5]).T
#array.shape = (5,1)
So I was wondering, is there a better way to do this? Would
array = np.array([1,2,3,4,5]).reshape(len([1,2,3,4,5]), 1)
be better?
Note that my use of [1,2,3,4,5] is just a toy list to make the example concrete. In practice it will be a much larger list passed into a function as an argument. Thanks!
Check the code of hstack and vstack. One, or both of those, pass the arguments through atleast_nd. That is a perfectly acceptable way of reshaping an array.
Some other ways:
arr = np.array([1,2,3,4,5]).reshape(-1,1) # saves the use of len()
arr = np.array([1,2,3,4,5])[:,None] # adds a new dim at end
np.array([1,2,3],ndmin=2).T # used by column_stack
hstack and vstack transform their inputs with:
arrs = [atleast_1d(_m) for _m in tup]
[atleast_2d(_m) for _m in tup]
test data:
a1=np.arange(2)
a2=np.arange(10).reshape(2,5)
a3=np.arange(8).reshape(2,4)
np.hstack([a1.reshape(-1,1),a2,a3])
np.hstack([a1[:,None],a2,a3])
np.column_stack([a1,a2,a3])
result:
array([[0, 0, 1, 2, 3, 4, 0, 1, 2, 3],
[1, 5, 6, 7, 8, 9, 4, 5, 6, 7]])
If you don't know ahead of time which arrays are 1d, then column_stack is easiest to use. The others require a little function that tests for dimensionality before applying the reshaping.
Numpy: use reshape or newaxis to add dimensions
If I understand your intent correctly, you wish to convert an array of shape (N,) to an array of shape (N,1) so that you can apply np.hstack:
In [147]: np.hstack([np.atleast_2d([1,2,3,4,5]).T, np.atleast_2d([1,2,3,4,5]).T])
Out[147]:
array([[1, 1],
[2, 2],
[3, 3],
[4, 4],
[5, 5]])
In that case, you could use avoid reshaping the arrays and use np.column_stack instead:
In [151]: np.column_stack([[1,2,3,4,5], [1,2,3,4,5]])
Out[151]:
array([[1, 1],
[2, 2],
[3, 3],
[4, 4],
[5, 5]])
I followed Ludo's work and just changed the size of v from 5 to 10000. I ran the code on my PC and the result shows that atleast_2d seems to be a more efficient method in the larger scale case.
import numpy as np
import timeit
v = np.arange(10000)
print('atleast2d:',timeit.timeit(lambda:np.atleast_2d(v).T))
print('reshape:',timeit.timeit(lambda:np.array(v).reshape(-1,1))) # saves the use of len()
print('v[:,None]:', timeit.timeit(lambda:np.array(v)[:,None])) # adds a new dim at end
print('np.array(v,ndmin=2).T:', timeit.timeit(lambda:np.array(v,ndmin=2).T)) # used by column_stack
The result is:
atleast2d: 1.3809496470021259
reshape: 27.099974197000847
v[:,None]: 28.58291715100131
np.array(v,ndmin=2).T: 30.141663907001202
My suggestion is that use [:None] when dealing with a short vector and np.atleast_2d when your vector goes longer.
Just to add info on hpaulj's answer. I was curious about how fast were the four methods described. The winner is the method adding a column at the end of the 1d array.
Here is what I ran:
import numpy as np
import timeit
v = [1,2,3,4,5]
print('atleast2d:',timeit.timeit(lambda:np.atleast_2d(v).T))
print('reshape:',timeit.timeit(lambda:np.array(v).reshape(-1,1))) # saves the use of len()
print('v[:,None]:', timeit.timeit(lambda:np.array(v)[:,None])) # adds a new dim at end
print('np.array(v,ndmin=2).T:', timeit.timeit(lambda:np.array(v,ndmin=2).T)) # used by column_stack
And the results:
atleast2d: 4.455070924214851
reshape: 2.0535152913971615
v[:,None]: 1.8387219828073285
np.array(v,ndmin=2).T: 3.1735243063353664

Repeat a scipy csr sparse matrix along axis 0

I wanted to repeat the rows of a scipy csr sparse matrix, but when I tried to call numpy's repeat method, it simply treats the sparse matrix like an object, and would only repeat it as an object in an ndarray. I looked through the documentation, but I couldn't find any utility to repeats the rows of a scipy csr sparse matrix.
I wrote the following code that operates on the internal data, which seems to work
def csr_repeat(csr, repeats):
if isinstance(repeats, int):
repeats = np.repeat(repeats, csr.shape[0])
repeats = np.asarray(repeats)
rnnz = np.diff(csr.indptr)
ndata = rnnz.dot(repeats)
if ndata == 0:
return sparse.csr_matrix((np.sum(repeats), csr.shape[1]),
dtype=csr.dtype)
indmap = np.ones(ndata, dtype=np.int)
indmap[0] = 0
rnnz_ = np.repeat(rnnz, repeats)
indptr_ = rnnz_.cumsum()
mask = indptr_ < ndata
indmap -= np.int_(np.bincount(indptr_[mask],
weights=rnnz_[mask],
minlength=ndata))
jumps = (rnnz * repeats).cumsum()
mask = jumps < ndata
indmap += np.int_(np.bincount(jumps[mask],
weights=rnnz[mask],
minlength=ndata))
indmap = indmap.cumsum()
return sparse.csr_matrix((csr.data[indmap],
csr.indices[indmap],
np.r_[0, indptr_]),
shape=(np.sum(repeats), csr.shape[1]))
and be reasonably efficient, but I'd rather not monkey patch the class. Is there a better way to do this?
Edit
As I revisit this question, I wonder why I posted it in the first place. Almost everything I could think to do with the repeated matrix would be easier to do with the original matrix, and then apply the repetition afterwards. My assumption is that post repetition will always be the better way to approach this problem than any of the potential answers.
from scipy.sparse import csr_matrix
repeated_row_matrix = csr_matrix(np.ones([repeat_number,1])) * sparse_row
It's not surprising that np.repeat does not work. It delegates the action to the hardcoded a.repeat method, and failing that, first turns a into an array (object if needed).
In the linear algebra world where sparse code was developed, most of the assembly work was done on the row, col, data arrays BEFORE creating the sparse matrix. The focus was on efficient math operations, and not so much on adding/deleting/indexing rows and elements.
I haven't worked through your code, but I'm not surprised that a csr format matrix requires that much work.
I worked out a similar function for the lil format (working from lil.copy):
def lil_repeat(S, repeat):
# row repeat for lil sparse matrix
# test for lil type and/or convert
shape=list(S.shape)
if isinstance(repeat, int):
shape[0]=shape[0]*repeat
else:
shape[0]=sum(repeat)
shape = tuple(shape)
new = sparse.lil_matrix(shape, dtype=S.dtype)
new.data = S.data.repeat(repeat) # flat repeat
new.rows = S.rows.repeat(repeat)
return new
But it is also possible to repeat using indices. Both lil and csr support indexing that is close to that of regular numpy arrays (at least in new enough versions). Thus:
S = sparse.lil_matrix([[0,1,2],[0,0,0],[1,0,0]])
print S.A.repeat([1,2,3], axis=0)
print S.A[(0,1,1,2,2,2),:]
print lil_repeat(S,[1,2,3]).A
print S[(0,1,1,2,2,2),:].A
give the same result
and best of all?
print S[np.arange(3).repeat([1,2,3]),:].A
After someone posted a really clever response for how best to do this I revisited my original question, to see if there was an even better way. I I came up with one more way that has some pros and cons. Instead of repeating all of the data (as is done with the accepted answer), we can instead instruct scipy to reuse the data of the repeated rows, creating something akin to a view of the original sparse array (as you might do with broadcast_to). This can be done by simply tiling the indptr field.
repeated = sparse.csr_matrix((orig.data, orig.indices, np.tile(orig.indptr, repeat_num)))
This technique repeats the vector repeat_num times, while only modifying the the indptr. The downside is that due to the way the csr matrices encode data, instead of creating a matrix that's repeat_num x n in dimension, it creates one that's (2 * repeat_num - 1) x n where every odd row is 0. This shouldn't be too big of a deal as any operation will be quick given that each row is 0, and they should be pretty easy to slice out afterwards (with something like [::2]), but it's not ideal.
I think the marked answer is probably still the "best" way to do this.
One of the most efficient ways to repeat the sparse matrix would be the way OP suggested. I modified indptr so that it doesn't output rows of 0s.
## original sparse matrix
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
x = scipy.sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
x.toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
To repeat this, you need to repeat data and indices, and you need to fix-up the indptr. This is not the most elegant way, but it works.
## repeated sparse matrix
repeat = 5
new_indptr = indptr
for r in range(1,repeat):
new_indptr = np.concatenate((new_indptr, new_indptr[-1]+indptr[1:]))
x = scipy.sparse.csr_matrix((np.tile(data,repeat), np.tile(indices,repeat), new_indptr))
x.toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])

Categories