Efficient way to create a diagonal sparse matrix - python

I have the following code in Python using Numpy:
p = np.diag(1.0 / np.array(x))
How can I transform it to get the sparse matrix p2 with the same values as p without creating p first?

Use scipy.sparse.spdiags (which does a lot, and so may be confusing, at first), scipy.sparse.dia_matrix and/or scipy.sparse.lil_diags. (depending on the format you want the sparse matrix in...)
E.g. using spdiags:
import numpy as np
import scipy as sp
import scipy.sparse
x = np.arange(10)
# "0" here indicates the main diagonal...
# "y" will be a dia_matrix type of sparse array, by default
y = sp.sparse.spdiags(x, 0, x.size, x.size)

Using the scipy.sparse module,
p = sparse.dia_matrix(1.0 / np.array(x), shape=(len(x), len(x)));

Related

Fill scipy sparse.coo_matrix by for loop

I want to iteratively fill a scipy sparse coo_matrix
# Constructing an empty matrix
import numpy as np
from scipy.sparse import coo_matrix
m = coo_matrix((3, 4), dtype=np.int8)
by a for loop and some rules, e.g. all ones (I know there's a constructor with data but I can't use it since the rule is more complex). How can I do that? I haven't found any documentation about it.
You can not directly fill a scipy sparse.coo_matrix by for loop
Your best option is to convert the coo_matrix to a dok_matrix that can be indexed like a dictionary, and later revert it back to coo_matrix if needed.
import numpy as np
from scipy.sparse import coo_matrix
m = coo_matrix((3, 4), dtype=np.int8)
m = m.todok() # convert to dok
for i in xrange(10):
for j in xrange(5):
m[i, j] = i + j # update the matrix, note the particular access format
m = m.tocoo() # convert back to coo
This is the fastest way, as mentioned in this answer.

Matlab vector in NumPy

I am new to NumPy and have only limited knowledge of Matlab. I have the following Matlab command to create row and column vectors:
X=(-N:N)
Y=column(X.^2)
I am trying to create the same thing in NumPy but the shape of the vector X and Y are the same despite doing transpose:
import numpy as np
N=10
X=np.arange(-N,N)
Y=X**2.T
print X.shape, Y.shape
Could you please let me know if np.arange() is the equivalent of (-N:N) in matlab and what is the problem with the column vector in NumPy?
It's a little more verbose in python:
import numpy as np
X = np.arange(-10,11) #same as X=-10:10; in matlab
Y = X**2 # same as X.^2 in matlab
Y.shape = (np.size(Y),1) #forces it to be column vec

SciPy/numpy: Only keep maximum value of a sparse matrix block

I am trying to operate on a large sparse matrix (currently 12000 x 12000).
What I want to do is to set blocks of it to zero but keep the largest value within this block.
I already have a running solution for dense matrices:
import numpy as np
from scipy.sparse import random
np.set_printoptions(precision=2)
#x = random(10,10,density=0.5)
x = np.random.random((10,10))
x = x.T * x
print(x)
def keep_only_max(a,b,c,d):
sub = x[a:b,c:d]
z = np.max(sub)
sub[sub < z] = 0
sizes = np.asarray([0,1,5,4])
sizes_sum = np.cumsum(sizes)
for i in range(1,len(sizes)):
current_i_min = sizes_sum[i-1]
current_i_max = sizes_sum[i]
for j in range(1,len(sizes)):
if i >= j:
continue
current_j_min = sizes_sum[j-1]
current_j_max = sizes_sum[j]
keep_only_max(current_i_min, current_i_max, current_j_min, current_j_max)
keep_only_max(current_j_min, current_j_max, current_i_min, current_i_max)
print(x)
This, however, doesn't work for sparse matrices (try uncommenting the line on top).
Any ideas how I could efficiently implement this without calling todense()?
def keep_only_max(a,b,c,d):
sub = x[a:b,c:d]
z = np.max(sub)
sub[sub < z] = 0
For a sparse x, the sub slicing works for csr format. It won't be as fast as the equivalent dense slice, but it will create a copy of that part of x.
I'd have to check the sparse max functions. But I can imagine convertering sub to coo format, using np.argmax on the .data attribute, and with the corresponding row and col values, constructing a new matrix of the same shape but just one nonzero value.
If your blocks covered x in a regular, nonoverlapping manner, I'd suggest constructing a new matrix with sparse.bmat. That basically collects the coo attributes of all the components, joins them into one set of arrays with the appropriate offsets, and makes a new coo matrix.
If the blocks are scattered or overlap you might have to generate, and insert them back into x one by one. csr format should work for that, but it will issue a sparse efficiency warning. lil is supposed to be faster for changing values. I think it will accept blocks.
I can imagine doing this with sparse matrices, but it will take time to setup a test case and debug the process.
Thanks to hpaulj I managed to implement a solution using scipy.sparse.bmat:
from scipy.sparse import coo_matrix
from scipy.sparse import csr_matrix
from scipy.sparse import rand
from scipy.sparse import bmat
import numpy as np
np.set_printoptions(precision=2)
# my matrices are symmetric, so generate random symmetric matrix
x = rand(10,10,density=0.4)
x = x.T * x
x = x
def keep_only_max(a,b,c,d):
sub = x[a:b,c:d]
z = np.unravel_index(sub.argmax(),sub.shape)
i1 = z[0]
j1 = z[1]
new = csr_matrix(([sub[i1,j1]],([i1],[j1])),shape=(b-a,d-c))
return new
def keep_all(a,b,c,d):
return x[a:b,c:d].copy()
# we want to create a chessboard pattern where the first central block is 1x1, the second 5x5 and the last 4x4
sizes = np.asarray([0,1,5,4])
sizes_sum = np.cumsum(sizes)
# acquire 2D array to store our chessboard blocks
r = range(len(sizes)-1)
blocks = [[0 for x in r] for y in r]
for i in range(1,len(sizes)):
current_i_min = sizes_sum[i-1]
current_i_max = sizes_sum[i]
for j in range(i,len(sizes)):
current_j_min = sizes_sum[j-1]
current_j_max = sizes_sum[j]
if i == j:
# keep the blocks at the diagonal completely
sub = keep_all(current_i_min, current_i_max, current_j_min, current_j_max)
blocks[i-1][j-1] = sub
else:
# the blocks not on the digonal only keep their maximum value
current_j_min = sizes_sum[j-1]
current_j_max = sizes_sum[j]
# we can leverage the matrix symmetry and only calculate one new matrix.
m1 = keep_only_max(current_i_min, current_i_max, current_j_min, current_j_max)
m2 = m1.T
blocks[i-1][j-1] = m1
blocks[j-1][i-1] = m2
z = bmat(blocks)
print(z.todense())

L2 normalization of rows in scipy sparse matrix

As I want to use only numpy and scipy (I don't want to use scikit-learn), I was wondering how to perform a L2 normalization of rows in a huge scipy csc_matrix (2,000,000 x 500,000). The operation must consume as little memory as possible since it must fit in memory.
What I have so far is:
import scipy.sparse as sp
tf_idf_matrix = sp.lil_matrix((n_docs, n_terms), dtype=np.float16)
# ... perform several operations and fill up the matrix
tf_idf_matrix = tf_idf_matrix / l2_norm(tf_idf_matrix)
# l2_norm() is what I want
def l2_norm(sparse_matrix):
pass
Since I couldn't find the answer anywhere, I will post here how I approached the problem.
def l2_norm(sparse_csc_matrix):
# first, I convert the csc_matrix to csr_matrix which is done in linear time
norm = sparse_csc_matrix.tocsr(copy=True)
# compute the inverse of l2 norm of non-zero elements
norm.data **= 2
norm = norm.sum(axis=1)
n_nzeros = np.where(norm > 0)
norm[n_nzeros] = 1.0 / np.sqrt(norm[n_nzeros])
norm = np.array(norm).T[0]
# modify sparse_csc_matrix in place
sp.sparsetools.csr_scale_rows(sparse_csc_matrix.shape[0],
sparse_csc_matrix.shape[1],
sparse_csc_matrix.indptr,
sparse_csc_matrix.indices,
sparse_csc_matrix.data, norm)
If anyone has a better approach, please post it.

adding a numpy ndarray to a sparse matrix

I am trying to add a numpy ndarray to a sparse matrix and I have been unsuccessful in doing so. I was wondering if there is a way to do so, without transforming my sparse matrix into a dense one.
another question is if adding two sparse matrices is possible.
x = np.dot(aSparseMatrix, weights)
y = x + bias
where x is my sparse matrix and bias is the numpy array. The error that I get is currently:
NotImplementedError: adding a scalar to a CSC or CSR matrix is not supported
aSparseMatrix.shape (1, 10063)
weights.shape (10063L, 2L)
bias.shape (2L,)
There are different kinds of scipy.sparse matrices: csr_matrix is fast for matrix algebra
but slow to update, coo_matrix slow for algebra / fast to update.
They're described in scipy.org/SciPyPackages/Sparse.
If a sparse matrix is 99 % 0, sparsematrix + 1 is 99 % ones -- dense.
You can hand-expand e.g.
y = dot( x + bias, npvec )
to dot( x, npvec ) + bias * npvec
wherever y is used later -- possible for short bits of code, but no fun.
I highly recommend IPython for trying things out:
# add some combinations of scipy.sparse matrices + numpy vecs
# see http://www.scipy.org/SciPyPackages/Sparse
from __future__ import division
import numpy as np
from scipy import sparse as sp
npvec = np.tile( [0,0,0,0,1.], 20 )
Acsr = sp.csr_matrix(npvec)
Acoo = Acsr.tocoo()
for A in (Acsr, Acoo, npvec):
print "\n%s" % type(A)
for B in (Acsr, Acoo, npvec):
print "+ %s = " % type(B) ,
try:
AplusB = A + B
print type(AplusB)
except StandardError, errmsg:
print "Error", errmsg
instead of numpy dot product, use simple * product. Python will broadcast and the correct result will be obtained.
the addition doesn't work, because through numpy dot product, the size of the result matrix didn't match the expected and so the addition could not take place between two matrices that have different shapes.

Categories