expanding (adding a row or column) a scipy.sparse matrix - python

Suppose I have a NxN matrix M (lil_matrix or csr_matrix) from scipy.sparse, and I want to make it (N+1)xN where M_modified[i,j] = M[i,j] for 0 <= i < N (and all j) and M[N,j] = 0 for all j. Basically, I want to add a row of zeros to the bottom of M and preserve the remainder of the matrix. Is there a way to do this without copying the data?

Scipy doesn't have a way to do this without copying the data but you can do it yourself by changing the attributes that define the sparse matrix.
There are 4 attributes that make up the csr_matrix:
data: An array containing the actual values in the matrix
indices: An array containing the column index corresponding to each value in data
indptr: An array that specifies the index before the first value in data for each row. If the row is empty then the index is the same as the previous column.
shape: A tuple containing the shape of the matrix
If you are simply adding a row of zeros to the bottom all you have to do is change the shape and indptr for your matrix.
x = np.ones((3,5))
x = csr_matrix(x)
x.toarray()
>> array([[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.]])
# reshape is not implemented for csr_matrix but you can cheat and do it yourself.
x._shape = (4,5)
# Update indptr to let it know we added a row with nothing in it. So just append the last
# value in indptr to the end.
# note that you are still copying the indptr array
x.indptr = np.hstack((x.indptr,x.indptr[-1]))
x.toarray()
array([[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 0., 0., 0., 0., 0.]])
Here is a function to handle the more general case of vstacking any 2 csr_matrices. You still end up copying the underlying numpy arrays but it is still significantly faster than the scipy vstack method.
def csr_vappend(a,b):
""" Takes in 2 csr_matrices and appends the second one to the bottom of the first one.
Much faster than scipy.sparse.vstack but assumes the type to be csr and overwrites
the first matrix instead of copying it. The data, indices, and indptr still get copied."""
a.data = np.hstack((a.data,b.data))
a.indices = np.hstack((a.indices,b.indices))
a.indptr = np.hstack((a.indptr,(b.indptr + a.nnz)[1:]))
a._shape = (a.shape[0]+b.shape[0],b.shape[1])
return a

Not sure if you're still looking for a solution, but maybe others can look into hstack and vstack - http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html. I think we can define a csr_matrix for the single additional row and then vstack it with the previous matrix.

I don't think that there is any way to really escape from doing the copying. Both of those types of sparse matrices store their data as Numpy arrays (in the data and indices attributes for csr and in the data and rows attributes for lil) internally and Numpy arrays can't be extended.
Update with more information:
LIL does stand for LInked List, but the current implementation doesn't quite live up to the name. The Numpy arrays used for data and rows are both of type object. Each of the objects in these arrays are actually Python lists (an empty list when all values are zero in a row). Python lists aren't exactly linked lists, but they are kind of close and quite frankly a better choice due to O(1) look-up. Personally, I don't immediately see the point of using a Numpy array of objects here rather than just a Python list. You could fairly easily change the current lil implementation to use Python lists instead which would allow you to add a row without copying the whole matrix.

Related

Converting from sparse to dense to sparse again decreases density after constructing sparse matrix

I am using scipy to generate a sparse finite difference matrix, constructing it initially from block matrices and then editing the diagonal to account for boundary conditions. The resulting sparse matrix is of the BSR type. I have found that if I convert the matrix to a dense matrix and then back to a sparse matrix using the scipy.sparse.BSR_matrix function, I am left with a sparser matrix than before. Here is the code I use to generate the matrix:
size = (4,4)
xDiff = np.zeros((size[0]+1,size[0]))
ix,jx = np.indices(xDiff.shape)
xDiff[ix==jx] = 1
xDiff[ix==jx+1] = -1
yDiff = np.zeros((size[1]+1,size[1]))
iy,jy = np.indices(yDiff.shape)
yDiff[iy==jy] = 1
yDiff[iy==jy+1] = -1
Ax = sp.sparse.dia_matrix(-np.matmul(np.transpose(xDiff),xDiff))
Ay = sp.sparse.dia_matrix(-np.matmul(np.transpose(yDiff),yDiff))
lap = sp.sparse.kron(sp.sparse.eye(size[1]),Ax) + sp.sparse.kron(Ay,sp.sparse.eye(size[0]))
#set up boundary conditions
BC_diag = np.array([2]+[1]*(size[0]-2)+[2]+([1]+[0]*(size[0]-2)+[1])*(size[1]-2)+[2]+[1]*(size[0]-2)+[2])
lap += sp.sparse.diags(BC_diag)
If I check the sparsity of this matrix I see the following:
lap
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 160 stored elements (blocksize = 4x4) in Block Sparse Row format>
However, if I convert it to a dense matrix and then back to the same sparse format I see a much sparser matrix:
sp.sparse.bsr_matrix(lap.todense())
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 64 stored elements (blocksize = 1x1) in Block Sparse Row format>
I suspect that the reason this is happening is because I constructed the matrix using the sparse.kron function but my question is if there is a way to arrive at the smaller sparse matrix without converting to dense first, for example if I end up wanting to simulate a very large domain.
BSR stores the data in dense blocks:
In [167]: lap.data.shape
Out[167]: (10, 4, 4)
In this case those blocks have quite a few zeros.
In [168]: lap1 = lap.tocsr()
In [170]: lap1
Out[170]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 160 stored elements in Compressed Sparse Row format>
In [171]: lap1.data
Out[171]:
array([-2., 1., 0., 0., 1., 0., 0., 0., 1., -3., 1., 0., 0.,
1., 0., 0., 0., 1., -3., 1., 0., 0., 1., 0., 0., 0.,
1., -2., 0., 0., 0., 1., 1., 0., 0., 0., -3., 1., 0.,
0., 1., 0., 0., 0., 0., 1., 0., 0., 1., -4., 1., 0.,
...
0., 0., 1., -2.])
In place cleanup:
In [172]: lap1.eliminate_zeros()
In [173]: lap1
Out[173]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 64 stored elements in Compressed Sparse Row format>
If I specify the csr format when using kron:
In [181]: lap2 = sparse.kron(np.eye(size[1]),Ax,format='csr') + sparse.kron(Ay,n
...: p.eye(size[0]), format='csr')
In [182]: lap2
Out[182]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 64 stored elements in Compressed Sparse Row format>
[I have been informed that my answer is incorrect. The reason, if I understand, is that Scipy is not using Lapack for creating matrices but is using its own code for this purpose. Interesting. The information though unexpected has the ring of authority. I shall defer to it!
[I will leave the answer posted for reference, but no longer assert that the answer were correct.]
Generally speaking, when it comes to complicated data structures like sparse matrices, you have two cases:
the constructor knows the structure's full contents in advance; or
the structure is designed to be built up gradually so that the structure's full contents are known only after the structure is complete.
The classic case of the complicated data structure is the case of the binary tree. You can make a binary tree more efficient by copying it after it is complete. Otherwise, the standard red-black implementation of the tree leaves some search paths as long as twice as long as others—which is usually okay but is not optimal.
Now, you probably knew all that, but I mention it for a reason. Scipy depends on Lapack. Lapack brings several different storage schemes. Two of these are the
general sparse and
banded
schemes. It would appear that Scipy begins by storing your matrix as sparse, where the indices of each nonzero element are explicitly stored; but that, on copy, Scipy notices that the banded representation is the more appropriate—for your matrix is, after all, banded.

What are the differences between these Numpy array creation functions? [duplicate]

What is the difference between NumPy's np.array and np.asarray? When should I use one rather than the other? They seem to generate identical output.
The definition of asarray is:
def asarray(a, dtype=None, order=None):
return array(a, dtype, copy=False, order=order)
So it is like array, except it has fewer options, and copy=False. array has copy=True by default.
The main difference is that array (by default) will make a copy of the object, while asarray will not unless necessary.
Since other questions are being redirected to this one which ask about asanyarray or other array creation routines, it's probably worth having a brief summary of what each of them does.
The differences are mainly about when to return the input unchanged, as opposed to making a new array as a copy.
array offers a wide variety of options (most of the other functions are thin wrappers around it), including flags to determine when to copy. A full explanation would take just as long as the docs (see Array Creation, but briefly, here are some examples:
Assume a is an ndarray, and m is a matrix, and they both have a dtype of float32:
np.array(a) and np.array(m) will copy both, because that's the default behavior.
np.array(a, copy=False) and np.array(m, copy=False) will copy m but not a, because m is not an ndarray.
np.array(a, copy=False, subok=True) and np.array(m, copy=False, subok=True) will copy neither, because m is a matrix, which is a subclass of ndarray.
np.array(a, dtype=int, copy=False, subok=True) will copy both, because the dtype is not compatible.
Most of the other functions are thin wrappers around array that control when copying happens:
asarray: The input will be returned uncopied iff it's a compatible ndarray (copy=False).
asanyarray: The input will be returned uncopied iff it's a compatible ndarray or subclass like matrix (copy=False, subok=True).
ascontiguousarray: The input will be returned uncopied iff it's a compatible ndarray in contiguous C order (copy=False, order='C').
asfortranarray: The input will be returned uncopied iff it's a compatible ndarray in contiguous Fortran order (copy=False, order='F').
require: The input will be returned uncopied iff it's compatible with the specified requirements string.
copy: The input is always copied.
fromiter: The input is treated as an iterable (so, e.g., you can construct an array from an iterator's elements, instead of an object array with the iterator); always copied.
There are also convenience functions, like asarray_chkfinite (same copying rules as asarray, but raises ValueError if there are any nan or inf values), and constructors for subclasses like matrix or for special cases like record arrays, and of course the actual ndarray constructor (which lets you create an array directly out of strides over a buffer).
The difference can be demonstrated by this example:
Generate a matrix.
>>> A = numpy.matrix(numpy.ones((3, 3)))
>>> A
matrix([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
Use numpy.array to modify A. Doesn't work because you are modifying a copy.
>>> numpy.array(A)[2] = 2
>>> A
matrix([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
Use numpy.asarray to modify A. It worked because you are modifying A itself.
>>> numpy.asarray(A)[2] = 2
>>> A
matrix([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 2., 2., 2.]])
The differences are mentioned quite clearly in the documentation of array and asarray. The differences lie in the argument list and hence the action of the function depending on those parameters.
The function definitions are :
numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)
and
numpy.asarray(a, dtype=None, order=None)
The following arguments are those that may be passed to array and not asarray as mentioned in the documentation :
copy : bool, optional If true (default), then the object is copied.
Otherwise, a copy will only be made if __array__ returns a copy, if
obj is a nested sequence, or if a copy is needed to satisfy any of the
other requirements (dtype, order, etc.).
subok : bool, optional If True, then sub-classes will be
passed-through, otherwise the returned array will be forced to be a
base-class array (default).
ndmin : int, optional Specifies the minimum number of dimensions that
the resulting array should have. Ones will be pre-pended to the shape
as needed to meet this requirement.
asarray(x) is like array(x, copy=False)
Use asarray(x) when you want to ensure that x will be an array before any other operations are done. If x is already an array then no copy would be done. It would not cause a redundant performance hit.
Here is an example of a function that ensure x is converted into an array first.
def mysum(x):
return np.asarray(x).sum()
Here's a simple example that can demonstrate the difference.
The main difference is that array will make a copy of the original data and using different object we can modify the data in the original array.
import numpy as np
a = np.arange(0.0, 10.2, 0.12)
int_cvr = np.asarray(a, dtype = np.int64)
The contents in array (a), remain untouched, and still, we can perform any operation on the data using another object without modifying the content in original array.
Let's Understand the difference between np.array() and np.asarray() with the example:
np.array(): Convert input data (list, tuple, array, or other sequence type) to an ndarray and copies the input data by default.
np.asarray(): Convert input data to an ndarray but do not copy if the input is already an ndarray.
#Create an array...
arr = np.ones(5); # array([1., 1., 1., 1., 1.])
#Now I want to modify `arr` with `array` method. Let's see...
arr = np.array(arr)[3] = 200; # array([1., 1., 1., 1., 1.])
No change in the array because we are modify a copy of the arr.
Now, modify arr with asarray() method.
arr = np.asarray(arr)[3] = 200; # array([1., 200, 1., 1., 1.])
The change occur in this array because we are work with the original array now.

Normalise 2D Numpy Array: Zero Mean Unit Variance

I have a 2D Numpy array, in which I want to normalise each column to zero mean and unit variance. Since I'm primarily used to C++, the method in which I'm doing is to use loops to iterate over elements in a column and do the necessary operations, followed by repeating this for all columns. I wanted to know about a pythonic way to do so.
Let class_input_data be my 2D array. I can get the column mean as:
column_mean = numpy.sum(class_input_data, axis = 0)/class_input_data.shape[0]
I then subtract the mean from all columns by:
class_input_data = class_input_data - column_mean
By now, the data should be zero mean. However, the value of:
numpy.sum(class_input_data, axis = 0)
isn't equal to 0, implying that I have done something wrong in my normalisation. By isn't equal to 0, I don't mean very small numbers which can be attributed to floating point inaccuracies.
Something like:
import numpy as np
eg_array = 5 + (np.random.randn(10, 10) * 2)
normed = (eg_array - eg_array.mean(axis=0)) / eg_array.std(axis=0)
normed.mean(axis=0)
Out[14]:
array([ 1.16573418e-16, -7.77156117e-17, -1.77635684e-16,
9.43689571e-17, -2.22044605e-17, -6.09234885e-16,
-2.22044605e-16, -4.44089210e-17, -7.10542736e-16,
4.21884749e-16])
normed.std(axis=0)
Out[15]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Combining 2-d arrays to form a 3-d array

I'm defining a function which will return a 3-d grid. During it, I use a function defined already that returns a 2-d array. I want to join these 2-d arrarys to form the 3-d during an iteration but I've looked at functions like meshgrid(), dstack(), concatenate() but can't seem to get any of them to fit right into the code.
The program models the spread of waves from a point source on the 2-d array, and the 3-d array shows how the displacement of the medium changes over the course of a wavelength.
def make_wave_snapshot(size,wavelength,phase):
waves_array = np.zeros((size,size),np.float)
if size%2==0:
for y in range(size):
for x in range(size):
r = math.hypot((size/2 - x - 0.5),(size/2 - y - 0.5))
d = np.sin((2*math.pi*r/wavelength)-phase)/np.sqrt(r)
waves_array[y,x] = d
dp.display_2d_array(waves_array) #This is in another module altogether
return waves_array #Displays array showing values
else:
return 'Please use integer of size.'
def make_wave_sequence(size,wavelength,nsteps):
waves_sequence = np.zeros((nsteps,size,size),np.float)
if nsteps%1==0:
for z in range(nsteps):
make_wave_snapshot(size,wavelength,(2*math.pi*z/nsteps))
waves_sequence = ???
return waves_sequence #Displays array showing values
else:
return 'Please use positive integer for number of steps'
The issue is turning the 'wave_array's into a 'wave_sequence'. Generous commenting would be very appreciated if you write any code. Many thanks!
If I understand correctly you have a three dimensional array, something like:
wave = np.zeros((2, 2, 2), np.float)
([[[0., 0.],
[0., 0.]],
[[0., 0.],
[0., 0.]]])
And you want to insert a two dimensional array, returned from your function like:
([[ 1., 2.],
[ 3., 4.]])
Such that your 3D array is now:
([[[1., 2.],
[3., 4.]],
[[0., 0.],
[0., 0.]]])
After the first iteration of your for loop. If that is correct, then it's actually pretty simple and you're most of the way there. You can assign an "element" to your 3D array that is a 2D array as long as you select the correct entry:
for z in range(nsteps):
waves_sequence[z] = make_wave_snapshot(size,wavelength,(2*math.pi*z/nsteps))

How do I add a column to a python (matix) multi-dimensional array? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What's the simplest way to extend a numpy array in 2 dimensions?
I've been frustrated as a Matlab user switching over to python because I don't know all the tricks and get stuck hacking together code until it works. Below is an example where I have a matrix that I want to add a dummy column to. Surely, there is a simpler way then the zip vstack zip method below. It works, but it is totally a noob attempt. Please enlighten me. Thank you in advance for taking the time for this tutorial.
# BEGIN CODE
from pylab import *
# Find that unlike most things in python i must build a dummy matrix to
# add stuff in a for loop.
H = ones((4,10-1))
print "shape(H):"
print shape(H)
print H
### enter for loop to populate dummy matrix with interesting data...
# stuff happens in the for loop, which is awesome and off topic.
### exit for loop
# more awesome stuff happens...
# Now I need a new column on H
H = zip(*vstack((zip(*H),ones(4)))) # THIS SEEMS LIKE THE DUMB WAY TO DO THIS...
print "shape(H):"
print shape(H)
print H
# in conclusion. I found a hack job solution to adding a column
# to a numpy matrix, but I'm not happy with it.
# Could someone educate me on the better way to do this?
# END CODE
Use np.column_stack:
In [12]: import numpy as np
In [13]: H = np.ones((4,10-1))
In [14]: x = np.ones(4)
In [15]: np.column_stack((H,x))
Out[15]:
array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [16]: np.column_stack((H,x)).shape
Out[16]: (4, 10)
There are several functions that let you concatenate arrays in different dimensions:
np.vstack along axis=0
np.hstack along axis=1
np.dstack along axis=2
In your case, the np.hstack looks what you want. np.column_stack stacks a set 1D arrays as a 2D array, but you have already a 2D array to start with.
Of course, nothing prevents you to do it the hard way:
>>> new = np.empty((a.shape[0], a.shape[1]+1), dtype=a.dtype)
>>> new.T[:a.shape[1]] = a.T
Here, we created an empty array with an extra column, then used some tricks to set the first columns to a (using the transpose operator T, so that new.T has an extra row compared to a.T...)

Categories