SciPy/numpy: Only keep maximum value of a sparse matrix block

SciPy/numpy: Only keep maximum value of a sparse matrix block - python

I am trying to operate on a large sparse matrix (currently 12000 x 12000).
What I want to do is to set blocks of it to zero but keep the largest value within this block.
I already have a running solution for dense matrices:
import numpy as np
from scipy.sparse import random
np.set_printoptions(precision=2)
#x = random(10,10,density=0.5)
x = np.random.random((10,10))
x = x.T * x
print(x)
def keep_only_max(a,b,c,d):
sub = x[a:b,c:d]
z = np.max(sub)
sub[sub < z] = 0
sizes = np.asarray([0,1,5,4])
sizes_sum = np.cumsum(sizes)
for i in range(1,len(sizes)):
current_i_min = sizes_sum[i-1]
current_i_max = sizes_sum[i]
for j in range(1,len(sizes)):
if i >= j:
continue
current_j_min = sizes_sum[j-1]
current_j_max = sizes_sum[j]
keep_only_max(current_i_min, current_i_max, current_j_min, current_j_max)
keep_only_max(current_j_min, current_j_max, current_i_min, current_i_max)
print(x)
This, however, doesn't work for sparse matrices (try uncommenting the line on top).
Any ideas how I could efficiently implement this without calling todense()?

def keep_only_max(a,b,c,d):
sub = x[a:b,c:d]
z = np.max(sub)
sub[sub < z] = 0
For a sparse x, the sub slicing works for csr format. It won't be as fast as the equivalent dense slice, but it will create a copy of that part of x.
I'd have to check the sparse max functions. But I can imagine convertering sub to coo format, using np.argmax on the .data attribute, and with the corresponding row and col values, constructing a new matrix of the same shape but just one nonzero value.
If your blocks covered x in a regular, nonoverlapping manner, I'd suggest constructing a new matrix with sparse.bmat. That basically collects the coo attributes of all the components, joins them into one set of arrays with the appropriate offsets, and makes a new coo matrix.
If the blocks are scattered or overlap you might have to generate, and insert them back into x one by one. csr format should work for that, but it will issue a sparse efficiency warning. lil is supposed to be faster for changing values. I think it will accept blocks.
I can imagine doing this with sparse matrices, but it will take time to setup a test case and debug the process.

Thanks to hpaulj I managed to implement a solution using scipy.sparse.bmat:
from scipy.sparse import coo_matrix
from scipy.sparse import csr_matrix
from scipy.sparse import rand
from scipy.sparse import bmat
import numpy as np
np.set_printoptions(precision=2)
# my matrices are symmetric, so generate random symmetric matrix
x = rand(10,10,density=0.4)
x = x.T * x
x = x
def keep_only_max(a,b,c,d):
sub = x[a:b,c:d]
z = np.unravel_index(sub.argmax(),sub.shape)
i1 = z[0]
j1 = z[1]
new = csr_matrix(([sub[i1,j1]],([i1],[j1])),shape=(b-a,d-c))
return new
def keep_all(a,b,c,d):
return x[a:b,c:d].copy()
# we want to create a chessboard pattern where the first central block is 1x1, the second 5x5 and the last 4x4
sizes = np.asarray([0,1,5,4])
sizes_sum = np.cumsum(sizes)
# acquire 2D array to store our chessboard blocks
r = range(len(sizes)-1)
blocks = [[0 for x in r] for y in r]
for i in range(1,len(sizes)):
current_i_min = sizes_sum[i-1]
current_i_max = sizes_sum[i]
for j in range(i,len(sizes)):
current_j_min = sizes_sum[j-1]
current_j_max = sizes_sum[j]
if i == j:
# keep the blocks at the diagonal completely
sub = keep_all(current_i_min, current_i_max, current_j_min, current_j_max)
blocks[i-1][j-1] = sub
else:
# the blocks not on the digonal only keep their maximum value
current_j_min = sizes_sum[j-1]
current_j_max = sizes_sum[j]
# we can leverage the matrix symmetry and only calculate one new matrix.
m1 = keep_only_max(current_i_min, current_i_max, current_j_min, current_j_max)
m2 = m1.T
blocks[i-1][j-1] = m1
blocks[j-1][i-1] = m2
z = bmat(blocks)
print(z.todense())

Related

incrementing a multidimensional numpy array (python) with products generated from a set of vectors corresponding to axes of the array

A is a k dimensional numpy array of floats (k could be pretty big, e.g. up to 10)
I need to implement an update to A by incrementing each of the values (as described below). I'm wondering if there is a numpy-style way that would be fast.
Let L_i be the length of axis i
An update to this array is generated in two steps follows:
For each axis of A a corresponding vector G is generated.
For example, corresponding to axis i a vector G_i of length L_i is generated (from data).
Update A at all positions by calculating an increment from the G vectors for each position in A
To do this at any particular position, let p be an array of k indices, corresponding to a position in A. Then A at p is incremented by a value calculated as the product:
Product(G_i[p[i]], for i from 0 to k-1)
A full update to A involves doing this operation for all locations in A (i.e. all possible values of p)
This operation would be very slow doing positions one by one via loops.
Is there a numpy style way to do this that would be fast?
edit
## this for three dimensions, final matrix at pos i,j,k has the
## product of c[i]*b[j]*a[k]
## but for arbitrary # of dimensions it will have a loop in a loop
## and will be slow
import numpy as np
a = np.array([1,2])
b = np.array([3,4,5])
c = np.array([6,7,8,9])
ab = []
for bval in b:
ab.append(bval*a)
ab = np.stack(ab)
abc = []
for cval in c:
abc.append(cval*ab)
abc = np.stack(abc)
as a function
def loopfunc(arraylist):
ndim = len(arraylist)
m = arraylist[0]
for i in range(1,ndim):
ml = []
for val in arraylist[i]:
ml.append(val*m)
m = np.stack(ml)
return m

This is a wacky problem, but I like it.
If I understand what you need from your example, you can accomplish this with some reshaping trickery and NumPy's usual broadcasting rules. The idea is to reshape each array so it has the right number of dimensions, then just directly multiply.
Here's a function that implements this.
from functools import reduce
import operator
import numpy as np
import scipy.linalg
def wacky_outer_product(*arrays):
assert len(arrays) >= 2
assert all(arr.ndim == 1 for arr in arrays)
ndim = len(arrays)
shapes = scipy.linalg.toeplitz((-1,) + (1,) * (ndim - 1))
reshaped = (arr.reshape(new_shape) for arr, new_shape in zip(arrays, shapes))
return reduce(operator.mul, reshaped).T
Testing this on your example arrays, we have:
>>> foo = wacky_outer_product(a, b, c)
>>> np.all(foo, abc)
True
Edit
Ok, the above function is fun, but the below is probably much better. No transposing, clearer, and much smaller:
from functools import reduce
import operator
import numpy as np
def wacky_outer_product(*arrays):
return reduce(operator.mul, np.ix_(*reversed(arrays)))

Scipy: Sparse indicator matrix from array(s)

What is the most efficient way to compute a sparse boolean matrix I from one or two arrays a,b, with I[i,j]==True where a[i]==b[j]? The following is fast but memory-inefficient:
I = a[:,None]==b
The following is slow and still memory-inefficient during creation:
I = csr((a[:,None]==b),shape=(len(a),len(b)))
The following gives at least the rows,cols for better csr_matrix initialization, but it still creates the full dense matrix and is equally slow:
z = np.argwhere((a[:,None]==b))
Any ideas?

One way to do it would be to first identify all different elements that a and b have in common using sets. This should work well if there are not very many different possibilities for the values in a and b. One then would only have to loop over the different values (below in variable values) and use np.argwhere to identify the indices in a and b where these values occur. The 2D indices of the sparse matrix can then be constructed using np.repeat and np.tile:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))
##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []
##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
x = np.argwhere(a==value).ravel()
y = np.argwhere(b==value).ravel()
rows.append(np.repeat(x, len(x)))
cols.append(np.tile(y, len(y)))
##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)
##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )
##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)
The syntax for generating the csr matrix is taken from the documentation. The test for sparse matrix equality is taken from this post.
Old Answer:
I don't know about performance, but at least you can avoid constructing the full dense matrix by using a simple generator expression. Here some code that uses two 1d arras of random integers to first generate the sparse matrix the way that the OP posted and then uses a generator expression to test all elements for equality:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
## matrix generation using generator
data, rows, cols = zip(
*((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))
##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0) ## --> True
I think there is no way around the double loop and ideally this would be pushed into numpy, but at least with the generator the loops are somewhat optimised ...

You could use numpy.isclose with small tolerance:
np.isclose(a,b)
Or pandas.DataFrame.eq:
a.eq(b)
Note this returns an array of True False.

Newton method in python for multivariables (system of equations)

My code is running fine for first iteration but after that it outputs the following error:
ValueError: matrix must be 2-dimensional
To the best of my knowledge (which is not much in python), my code is correct. but I don't know, why it is not running correctly for all given iterations. Could anyone help me in this problem.
from __future__ import division
import numpy as np
import math
import matplotlib.pylab as plt
import sympy as sp
from numpy.linalg import inv
#initial guesses
x = -2
y = -2.5
i1 = 0
while i1<5:
F= np.matrix([[(x**2)+(x*y**3)-9],[(3*y*x**2)-(y**3)-4]])
theta = np.sum(F)
J = np.matrix([[(2*x)+y**3, 3*x*y**2],[6*x*y, (3*x**2)-(3*y**2)]])
Jinv = inv(J)
xn = np.array([[x],[y]])
xn_1 = xn - (Jinv*F)
x = xn_1[0]
y = xn_1[1]
#~ print theta
print xn
i1 = i1+1

I believe xn_1 is a 2D matrix. Try printing it you and you will see [[something], [something]]
Therefore to get the x and y, you need to use multidimensional indexing. Here is what I did
x = xn_1[0,0]
y = xn_1[1,0]
This works because within the 2D matrix xn_1 are two single element arrays. Therefore we need to further index 0 to get that single element.
Edit: To clarify, xn_1[1,0] means to index 1 and then take that subarray and index 0 on that. And although according to Scipy it may seem that it should be functionally equivalent to xn_1[1][0], that only applies to the general np.array type and not the np.matrix type. Here is an excellent thread on SO that explains this.
So you should use the xn_1[1,0] way to get the element you want.

xn_1 is a numpy matrix, so it's elements are accessed with the item() method, not like an array. (with []s)
So just change
x = xn_1[0]
y = xn_1[1]
to
x = xn_1.item(0)
y = xn_1.item(1)

Diagonal Block Multiplication via Tensor Multiplication in Numpy

Is there an expression (perhaps using np.tensordot) that most expressively captures a sparse matrix vector multplicatiuon between a block matrix of diagonal blocks and a vector of the corresponding size?
I have a working implementation that performs the exact operations I want, but I used two python loops (see below) instead of an appropriate numpy command, which probably exists.
For example:
import numpy as np
outer_size = 2
inner_size = 5
shape = (outer_size, outer_size, inner_size)
diag_block = np.arange(np.prod(shape)).reshape(shape)
true_diag = np.bmat([[np.diag(diag_block[i,j]) for j in range(shape[1])] for i in range(shape[0])]).A
x = np.arange(shape[1] * shape[2])
def sparse_prod(diags, x):
outer_size = diags.shape[0]
return np.hstack(sum(diags[i] * x.reshape(outer_size, -1)) for i in range(outer_size))
print(true_diag.dot(x))
print(sparse_prod(diag_block, x))

Approach #1 : You could use np.einsum -
np.einsum('ijk,jk->ik', diag_block, x.reshape(outer_size, -1)).ravel()
Basically, we are keeping the last axis aligned between the two inputs and sum-reducing the second and first axes respectively.
Approach #2 : Depending on the shapes of the input arrays, you might want to use np.dot/np.tensordot in a loop as discussed in some detail in this post.
Here's such an approach with a loop -
m,_,n = diag_block.shape
x2D = x.reshape(outer_size, -1)
out = np.empty((m,n))
for i in range(n):
out[:,i] = np.dot(diag_block[...,i], x2D[:,i])
out.shape = out.size # Flatten

sparse 3d matrix/array in Python?

In scipy, we can construct a sparse matrix using scipy.sparse.lil_matrix() etc. But the matrix is in 2d.
I am wondering if there is an existing data structure for sparse 3d matrix / array (tensor) in Python?
p.s. I have lots of sparse data in 3d and need a tensor to store / perform multiplication. Any suggestions to implement such a tensor if there's no existing data structure?

Happy to suggest a (possibly obvious) implementation of this, which could be made in pure Python or C/Cython if you've got time and space for new dependencies, and need it to be faster.
A sparse matrix in N dimensions can assume most elements are empty, so we use a dictionary keyed on tuples:
class NDSparseMatrix:
def __init__(self):
self.elements = {}
def addValue(self, tuple, value):
self.elements[tuple] = value
def readValue(self, tuple):
try:
value = self.elements[tuple]
except KeyError:
# could also be 0.0 if using floats...
value = 0
return value
and you would use it like so:
sparse = NDSparseMatrix()
sparse.addValue((1,2,3), 15.7)
should_be_zero = sparse.readValue((1,5,13))
You could make this implementation more robust by verifying that the input is in fact a tuple, and that it contains only integers, but that will just slow things down so I wouldn't worry unless you're releasing your code to the world later.
EDIT - a Cython implementation of the matrix multiplication problem, assuming other tensor is an N Dimensional NumPy array (numpy.ndarray) might look like this:
#cython: boundscheck=False
#cython: wraparound=False
cimport numpy as np
def sparse_mult(object sparse, np.ndarray[double, ndim=3] u):
cdef unsigned int i, j, k
out = np.ndarray(shape=(u.shape[0],u.shape[1],u.shape[2]), dtype=double)
for i in xrange(1,u.shape[0]-1):
for j in xrange(1, u.shape[1]-1):
for k in xrange(1, u.shape[2]-1):
# note, here you must define your own rank-3 multiplication rule, which
# is, in general, nontrivial, especially if LxMxN tensor...
# loop over a dummy variable (or two) and perform some summation:
out[i,j,k] = u[i,j,k] * sparse((i,j,k))
return out
Although you will always need to hand roll this for the problem at hand, because (as mentioned in code comment) you'll need to define which indices you're summing over, and be careful about the array lengths or things won't work!
EDIT 2 - if the other matrix is also sparse, then you don't need to do the three way looping:
def sparse_mult(sparse, other_sparse):
out = NDSparseMatrix()
for key, value in sparse.elements.items():
i, j, k = key
# note, here you must define your own rank-3 multiplication rule, which
# is, in general, nontrivial, especially if LxMxN tensor...
# loop over a dummy variable (or two) and perform some summation
# (example indices shown):
out.addValue(key) = out.readValue(key) +
other_sparse.readValue((i,j,k+1)) * sparse((i-3,j,k))
return out
My suggestion for a C implementation would be to use a simple struct to hold the indices and the values:
typedef struct {
int index[3];
float value;
} entry_t;
you'll then need some functions to allocate and maintain a dynamic array of such structs, and search them as fast as you need; but you should test the Python implementation in place for performance before worrying about that stuff.

An alternative answer as of 2017 is the sparse package. According to the package itself it implements sparse multidimensional arrays on top of NumPy and scipy.sparse by generalizing the scipy.sparse.coo_matrix layout.
Here's an example taken from the docs:
import numpy as np
n = 1000
ndims = 4
nnz = 1000000
coords = np.random.randint(0, n - 1, size=(ndims, nnz))
data = np.random.random(nnz)
import sparse
x = sparse.COO(coords, data, shape=((n,) * ndims))
x
# <COO: shape=(1000, 1000, 1000, 1000), dtype=float64, nnz=1000000>
x.nbytes
# 16000000
y = sparse.tensordot(x, x, axes=((3, 0), (1, 2)))
y
# <COO: shape=(1000, 1000, 1000, 1000), dtype=float64, nnz=1001588>

Have a look at sparray - sparse n-dimensional arrays in Python (by Jan Erik Solem). Also available on github.

Nicer than writing everything new from scratch may be to use scipy's sparse module as far as possible. This may lead to (much) better performance. I had a somewhat similar problem, but I only had to access the data efficiently, not perform any operations on them. Furthermore, my data were only sparse in two out of three dimensions.
I have written a class that solves my problem and could (as far as I think) easily be extended to satisfiy the OP's needs. It may still hold some potential for improvement, though.
import scipy.sparse as sp
import numpy as np
class Sparse3D():
"""
Class to store and access 3 dimensional sparse matrices efficiently
"""
def __init__(self, *sparseMatrices):
"""
Constructor
Takes a stack of sparse 2D matrices with the same dimensions
"""
self.data = sp.vstack(sparseMatrices, "dok")
self.shape = (len(sparseMatrices), *sparseMatrices[0].shape)
self._dim1_jump = np.arange(0, self.shape[1]*self.shape[0], self.shape[1])
self._dim1 = np.arange(self.shape[0])
self._dim2 = np.arange(self.shape[1])
def __getitem__(self, pos):
if not type(pos) == tuple:
if not hasattr(pos, "__iter__") and not type(pos) == slice:
return self.data[self._dim1_jump[pos] + self._dim2]
else:
return Sparse3D(*(self[self._dim1[i]] for i in self._dim1[pos]))
elif len(pos) > 3:
raise IndexError("too many indices for array")
else:
if (not hasattr(pos[0], "__iter__") and not type(pos[0]) == slice or
not hasattr(pos[1], "__iter__") and not type(pos[1]) == slice):
if len(pos) == 2:
result = self.data[self._dim1_jump[pos[0]] + self._dim2[pos[1]]]
else:
result = self.data[self._dim1_jump[pos[0]] + self._dim2[pos[1]], pos[2]].T
if hasattr(pos[2], "__iter__") or type(pos[2]) == slice:
result = result.T
return result
else:
if len(pos) == 2:
return Sparse3D(*(self[i, self._dim2[pos[1]]] for i in self._dim1[pos[0]]))
else:
if not hasattr(pos[2], "__iter__") and not type(pos[2]) == slice:
return sp.vstack([self[self._dim1[pos[0]], i, pos[2]]
for i in self._dim2[pos[1]]]).T
else:
return Sparse3D(*(self[i, self._dim2[pos[1]], pos[2]]
for i in self._dim1[pos[0]]))
def toarray(self):
return np.array([self[i].toarray() for i in range(self.shape[0])])

I also need 3D sparse matrix for solving the 2D heat equations (2 spatial dimensions are dense, but the time dimension is diagonal plus and minus one offdiagonal.) I found this link to guide me. The trick is to create an array Number that maps the 2D sparse matrix to a 1D linear vector. Then build the 2D matrix by building a list of data and indices. Later the Number matrix is used to arrange the answer back to a 2D array.
[edit] It occurred to me after my initial post, this could be handled better by using the .reshape(-1) method. After research, the reshape method is better than flatten because it returns a new view into the original array, but flatten copies the array. The code uses the original Number array. I will try to update later.[end edit]
I test it by creating a 1D random vector and solving for a second vector. Then multiply it by the sparse 2D matrix and I get the same result.
Note: I repeat this many times in a loop with exactly the same matrix M, so you might think it would be more efficient to solve for inverse(M). But the inverse of M is not sparse, so I think (but have not tested) using spsolve is a better solution. "Best" probably depends on how large the matrix is you are using.
#!/usr/bin/env python3
# testSparse.py
# profhuster
import numpy as np
import scipy.sparse as sM
import scipy.sparse.linalg as spLA
from array import array
from numpy.random import rand, seed
seed(101520)
nX = 4
nY = 3
r = 0.1
def loadSpNodes(nX, nY, r):
# Matrix to map 2D array of nodes to 1D array
Number = np.zeros((nY, nX), dtype=int)
# Map each element of the 2D array to a 1D array
iM = 0
for i in range(nX):
for j in range(nY):
Number[j, i] = iM
iM += 1
print(f"Number = \n{Number}")
# Now create a sparse matrix of the "stencil"
diagVal = 1 + 4 * r
offVal = -r
d_list = array('f')
i_list = array('i')
j_list = array('i')
# Loop over the 2D nodes matrix
for i in range(nX):
for j in range(nY):
# Recall the 1D number
iSparse = Number[j, i]
# populate the diagonal
d_list.append(diagVal)
i_list.append(iSparse)
j_list.append(iSparse)
# Now, for each rectangular neighbor, add the
# off-diagonal entries
# Use a try-except, so boundry nodes work
for (jj,ii) in ((j+1,i),(j-1,i),(j,i+1),(j,i-1)):
try:
iNeigh = Number[jj, ii]
if jj >= 0 and ii >=0:
d_list.append(offVal)
i_list.append(iSparse)
j_list.append(iNeigh)
except IndexError:
pass
spNodes = sM.coo_matrix((d_list, (i_list, j_list)), shape=(nX*nY,nX*nY))
return spNodes
MySpNodes = loadSpNodes(nX, nY, r)
print(f"Sparse Nodes = \n{MySpNodes.toarray()}")
b = rand(nX*nY)
print(f"b=\n{b}")
x = spLA.spsolve(MySpNodes.tocsr(), b)
print(f"x=\n{x}")
print(f"Multiply back together=\n{x * MySpNodes}")

I needed a 3d look up table for x,y,z and came up with this solution..
Why not use one of the dimensions to be a divisor of the third dimension? ie. use x and 'yz' as the matrix dimensions
eg. if x has 80 potential members, y has 100 potential' and z has 20 potential'
you make the sparse matrix to be 80 by 2000 (i.e. xy=100x20)
x dimension is as usual
yz dimension: the first 100 elements will represent z=0, y=0 to 99
..............the second 100 will represent z=2, y=0 to 99 etc
so given element located at (x,y,z) would be in sparse matrix at (x, z*100 + y)
if you need to use negative numbers design a aritrary offset into your matrix translation. the solutio could be expanded to n dimensions if necessary
from scipy import sparse
m = sparse.lil_matrix((100,2000), dtype=float)
def add_element((x,y,z), element):
element=float(element)
m[x,y+z*100]=element
def get_element(x,y,z):
return m[x,y+z*100]
add_element([3,2,4],2.2)
add_element([20,15,7], 1.2)
print get_element(0,0,0)
print get_element(3,2,4)
print get_element(20,15,7)
print " This is m sparse:";print m
====================
OUTPUT:
0.0
2.2
1.2
This is m sparse:
(3, 402L) 2.2
(20, 715L) 1.2
====================

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

SciPy/numpy: Only keep maximum value of a sparse matrix block - python

Related

incrementing a multidimensional numpy array (python) with products generated from a set of vectors corresponding to axes of the array

Scipy: Sparse indicator matrix from array(s)

Newton method in python for multivariables (system of equations)

Diagonal Block Multiplication via Tensor Multiplication in Numpy

sparse 3d matrix/array in Python?

Categories

Resources