Sparse Matrix in Numba - python

I wish to speed up my machine learning algorithm (written in Python) using Numba (http://numba.pydata.org/). Note that this algorithm takes as its input data a sparse matrix. In my pure Python implementation, I used csr_matrix and related classes from Scipy, but apparently it is not compatible with Numba's JIT compiler.
I have also created my own custom class to implement the sparse matrix (which is basically a list of list of (index, value) pair), but again it is incompatible with Numba (i.e., I got some weird error message saying it doesn't recognize extension type)
Is there an alternative, simple way to implement sparse matrix using only numpy (without resorting to SciPy) that is compatible with Numba? Any example code would be appreciated. Thanks!

If all you have to do is iterate over the values of a CSR matrix, you can pass the attributes data, indptr, and indices to a function instead of the CSR matrix object.
from scipy import sparse
from numba import njit
#njit
def print_csr(A, iA, jA):
for row in range(len(iA)-1):
for i in range(iA[row], iA[row+1]):
print(row, jA[i], A[i])
A = sparse.csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
print_csr(A.data, A.indptr, A.indices)

You can access the data of your sparse matrix as pure numpy or python. For example
M=sparse.csr_matrix([[1,0,0],[1,0,1],[1,1,1]])
ML = M.tolil()
for d,r in enumerate(zip(ML.data,ML.rows))
# d,r are lists
dr = np.array([d,r])
print dr
produces:
[[1]
[0]]
[[1 1]
[0 2]]
[[1 1 1]
[0 1 2]]
Surely numba can handle code that uses these arrays, provided, of course, that it does not expect each row to have the same size of array.
The lil format stores values 2 object dtype arrays, with data and indices stored lists, by row.

Related

Slice a 3d numpy array using a 1d lookup between indices

Slice a 3d numpy array using a 1d lookup between indices
import numpy as np
a = np.arange(12).reshape(2, 3, 2)
b = np.array([2, 0])
b maps i to j where i and j are the first 2 indexes of a, so ​a[i,j,k]
Desired result after applying b to a is:
[[4 5]
​ [6 7]]
Naive solution:
c = np.empty(shape=(2, 2), dtype=int)
for i in range(2):
​j = b[i]
​c[i, :] = a[i, j, :]
Question: Is there a way to do this using a numpy or scipy routine or routines or fancy indexing?
Application: Reinforcement Learning finite MDPs where b is a deterministic policy vector pi(a|s), a is the state transition probabilities p(s'|s,a) and c is the state transition matrix for that policy vector p(s'|s). The arrays will be large and this operation will be repeated a large number of times so needs to be scaleable and fast.
What I have tried:
Compiling using numba but line profiler suggests my code is slower compared to a similarly sized numpy routine. Also numpy is more widely understood and used.
Maintaining pi(a|s) as a sparse matrix (all zero except one 1 per row) b_as_a_matrix and then using einsum but this involves storing and updating the matrix and creates more work (an extra loop over j and sum operation).
c = np.einsum('ij,ijk->ik', b_as_a_matrix, a)
Numpy arrays can be indexed using other arrays as indices. See also: NumPy selecting specific column index per row by using a list of indexes.
With that in mind, we can vectorize your loop to simply use b for indexing:
>>> import numpy as np
>>> a = np.arange(12).reshape(2, 3, 2)
>>> b = np.array([2, 0])
>>> i = np.arange(len(b))
>>> i
array([0, 1])
>>> a[i, b, :]
array([[4, 5],
[6, 7]])

An optimized matrix multiplication library in Python (similar to Matlab) but is NOT numpy

According to the NumPy documentation they may deprecate their np.matrix class. And while arrays do have their multitude of use cases, they cannot do everything. Specifically, they will "break" when doing pretty basic linear algebra operations (you can read more about it here).
Building my own matrix multiplication module in python is not too difficult, but it would not be optimized at all. I am looking for another library that has full linear algebra support which is optimized upon BLAS (Basic Linear Algebra Subprograms). Or at the least, is there any documents on how to DIY integrate a BLAS to python.
Edit: So some are suggesting the # operator, which is like pushing a mole down a hole and having him pop up immediately in the neighbouring one. In essence, what is happening is a debuggers nightmare:
W*x == w*x.T
W#x == W#x.T
You would hope that an error is raised here letting you know that you made a mistake in defining your matrices. But since arrays don't store 2D information if they are along one axis, I am not sure that the issue can ever be solved via np.array. (These problems don't exist with np.matrix but for some reason the developers seem insistent on removing it).
If you insist on the distinction between column and row vectors, you can do that.
>>> x = np.array([1, 2, 3]).reshape(-1, 1)
>>> W = np.arange(15).reshape(5, 3)
>>> x
array([[1],
[2],
[3]])
>>> W
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
>>> W # x
array([[ 8],
[26],
[44],
[62],
[80]])
>>> W # x.T
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 3)
You could create helper functions to create column and row vectors:
def rowvec(x):
return np.array(x).reshape(1, -1)
def colvec(x):
return np.array(x).reshape(-1, 1)
>>> rowvec([1, 2, 3])
array([[1, 2, 3]])
>>> colvec([1, 2, 3])
array([[1],
[2],
[3]])
I would recommend that you only use this type of constructs when you're porting existing Matlab code. You'll have trouble reading numpy code written by others and many library functions expect 1D arrays as inputs, not (1, n)-shaped arrays.
Actually, numpy offers BLAS-powered matrix mutiplication through the matmul operator #. This invokes the __matmul__ magic method for a given class.
All you have to do in the above example is W # x.
Other linear algebra stuff can be found on the np.linalg module.
Edit: I guess your problem is way more about the language's style than any technical issues. I found this answer very elucidative:
Transposing a NumPy array
Also, I find it very improbable that you will find something that is NOT numpy since most of the major machine learning/data science frameworks rely on it.

Building sparse COO matrix structure from Cartesian product of indices

Problem
Consider
P: A (N, n_x) matrix.
Then I want to find the indices of a sparse COO matrix such that
indices = []
for i in range(N):
for j1 in range(n_x):
for j2 in range(n_x):
indices.append([P[i, j1], P[i, j2]])
indices = unique(indices, axis=0)
Faster Solution
The above solution is both inefficient in terms of time and memory. A faster option using Numpy is below
col_idx = np.reshape(np.tile(P, n_x), [N, n_x, n_x])
row_idx = np.transpose(col_idx, [0,2,1])
indices = np.concatenate((row_idx[:,None], col_idx[:, None]), axis=1)
indices = np.unique(indices, axis=0)
Note however that this still requires building 2 N*n_x*n_x arrays which can be much larger than necessary if we only have a small number of unique elements.
Question
How can I build a fast but also memory efficient algorithm for doing the following. Currently the fast solution is not usable as it requires too much memory.
The solution could be Python but also an algorithm that I could code in C would suffice.
I think both in C++ and Python the way to go is using a set.
In the version below I used Numba which gives apox. 30x speedup to the pure Python version.
Python
import numba as nb
import numpy as np
import time
N=500
n_x=600
P=np.random.randint(0,50,N*n_x).reshape(N,n_x)
#nb.jit()
def nb_sparse_coo(P):
indices = set()
for i in range(P.shape[0]):
for j1 in range(P.shape[1]):
for j2 in range(j1,P.shape[1]):
indices.add((P[i, j1], P[i, j2]))
return np.array(list(indices))
indices=nb_sparse_coo(P)

Multiplying an array by a designated row vector of another matrix

Good afternoon all relatively simple`question here from a mechanical standpoint.
I'm currently performing PCA and have successfully written a code that computes the covariance matrix and correlation matrix, and the associated eigenspectrum.
Now, I have created an array that represents the eigenvectors row wise, and i would like to compute the transformation C*v^t, where c is the observation matrix and v^t is the element wise entries of the eigen vector transposed.
Now, since some of these matrices are pretty big-i'd like to be able to tell python which row of the eigenvector matrix to mulitply C by. So far I have tried some of the numpy functions, but to no avail.
(for those of you wondering, i don't want to compute the matrix product of all the eigen vecotrs, i only need to multiply by a small subset of them-the ones associated with the largest eigenvalues)
Thanks!
To "slice" a vector of row n out of 2-dimensional array A, you use a syntax like A[n]. If it's slicing columns you wanted instead, the syntax is A[:,n].
For transformations with numpy arrays and vectors, the syntax is with matrix multiplication operator:
>>> A = np.array([[0, -1], [1, 0]])
>>> vs = np.array([[1, 2], [3, 4]])
>>> A # vs[0] # this is a rotation of the first row of vs by A
array([-2, 1])
>>> A # vs[1] # this is a rotation of second row of vs by A
array([-4, 3])
Note: If you're on older python version (< 3.5), you might not have # available yet. Then you'll have to use a function np.dot(array, vector) instead of the operator.

Dot product between 1D numpy array and scipy sparse matrix

Say I have Numpy array p and a Scipy sparse matrix q such that
>>> p.shape
(10,)
>>> q.shape
(10,100)
I want to do a dot product of p and q. When I try with numpy I get the following:
>>> np.dot(p,q)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist packages/IPython/core/interactiveshell.py", line 2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-96-8260c6752ee5>", line 1, in <module>
np.dot(p,q)
ValueError: Cannot find a common data type.
I see in the Scipy documentation that
As of NumPy 1.7, np.dot is not aware of sparse matrices, therefore
using it will result on unexpected results or errors. The
corresponding dense matrix should be obtained first instead
But that defeats my purpose of using a sparse matrix. Soooo, how am I to do dot products between a sparse matrix and a 1D numpy array (numpy matrix, I am open to either) without losing the sparsity of my matrix?
I am using Numpy 1.8.2 and Scipy 0.15.1.
Use *:
p * q
Note that * uses matrix-like semantics rather than array-like semantics for sparse matrices, so it computes a matrix product rather than a broadcasted product.
A sparse matrix is not a numpy array or matrix, though most formats use several arrays to store their data. As a general rule, regular numpy functions aren't aware of sparse matrices, so you should count on using the sparse versions of functions and operators.
By popular demand, the latest np.dot is sparse aware, though I don't know the details of how it acts on that. In 1.18 we have several options.
user2357112 suggests p*q. With the dense array first, I was a little doubtful, wondering if it would try to use array element by element multiplication (and fail due to broadcasting errors). But it works. Sometimes operators like * pass control to the 2nd argument. But just to be sure I tried several alternatives:
q.T * p
np.dot(p, q.A)
q.T.dot(p)
all give the same dense (100,) array. Note - this is an array, not a sparse matrix result.
To get a sparse matrix I need to use
sparse.csr_matrix(p)*q # (1,100) shape
q could be other sparse formats, but for calculations like this it is converted to csr or csc. And .T operation is cheap because if just requires switching the format from csr to csc.
It would be good idea to check whether these alternatives work if p is a 2d array, e.g. (2,10).
Scipy has inbuilt methods for sparse matrix multiplication.
Example from documentation:
>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> Q = csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
>>> p = np.array([1, 0, -1])
>>> Q.dot(p)
array([ 1, -3, -1], dtype=int64)
Check these resources:
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csc_matrix.dot.html
http://docs.scipy.org/doc/scipy/reference/sparse.html

Categories