How can I remove a column from a sparse matrix efficiently? - python

If I am using the sparse.lil_matrix format, how can I remove a column from the matrix easily and efficiently?

Much simpler and faster. You might not even need the conversion to csr, but I just know for sure that it works with csr sparse matrices and converting between shouldn't be an issue.
from scipy import sparse
x_new = sparse.lil_matrix(sparse.csr_matrix(x)[:,col_list])

I've been wanting this myself and in truth there isn't a great built-in way to do it yet. Here's a way to do it. I chose to make a subclass of lil_matrix and add the remove_col function. If you want, you can instead add the removecol function to the lil_matrix class in your lib/site-packages/scipy/sparse/lil.py file. Here's the code:
from scipy import sparse
from bisect import bisect_left
class lil2(sparse.lil_matrix):
def removecol(self,j):
if j < 0:
j += self.shape[1]
if j < 0 or j >= self.shape[1]:
raise IndexError('column index out of bounds')
rows = self.rows
data = self.data
for i in xrange(self.shape[0]):
pos = bisect_left(rows[i], j)
if pos == len(rows[i]):
continue
elif rows[i][pos] == j:
rows[i].pop(pos)
data[i].pop(pos)
if pos == len(rows[i]):
continue
for pos2 in xrange(pos,len(rows[i])):
rows[i][pos2] -= 1
self._shape = (self._shape[0],self._shape[1]-1)
I have tried it out and don't see any bugs. I certainly think that it is better than slicing the column out, which just creates a new matrix as far as I know.
I decided to make a removerow function as well, but I don't think that it is as good as removecol. I'm limited by not being able to remove one row from an ndarray in the way that I would like. Here is removerow which can be added to the above class
def removerow(self,i):
if i < 0:
i += self.shape[0]
if i < 0 or i >= self.shape[0]:
raise IndexError('row index out of bounds')
self.rows = numpy.delete(self.rows,i,0)
self.data = numpy.delete(self.data,i,0)
self._shape = (self._shape[0]-1,self.shape[1])
Perhaps I should submit these functions to the Scipy repository.

For a sparse csr matrix (X) and a list of indices to drop (index_to_drop):
to_keep = list(set(xrange(X.shape[1]))-set(index_to_drop))
new_X = X[:,to_keep]
It is easy to convert lil_matrices to csr_matrices. Check tocsr() in lil_matrix documentation
Note however that going from csr to lil matrices using tolil() is expensive. So, this choice is good when you do not require to have your matrix in lil format.

I'm new to python so my answer is probably wrong, but I was wondering why something like the following won't be efficient?
Lets say your lil_matrix is called mat and that you want to remove the i-th column:
mat=hstack( [ mat[:,0:i] , mat[:,i+1:] ] )
Now the matrix will turn to a coo_matrix after that but you can turn it back to lil_matrix.
Ok, I understand that this will have to create the two matrices inside the hstack before it does the assignment to the mat variable so it would be like having the original matrix plus one more at the same time but I guess if the sparsity is big enough then I think there shouldn't be any memory problems (since memory (and time) is the whole reason of using sparse matrices).

def removecols(W, col_list):
if min(col_list) = W.shape[1]:
raise IndexError('column index out of bounds')
rows = W.rows
data = W.data
for i in xrange(M.shape[0]):
for j in col_list:
pos = bisect_left(rows[i], j)
if pos == len(rows[i]):
continue
elif rows[i][pos] == j:
rows[i].pop(pos)
data[i].pop(pos)
if pos == len(rows[i]):
continue
for pos2 in xrange(pos,len(rows[i])):
rows[i][pos2] -= 1
W._shape = (W._shape[0], W._shape[1]-len(col_list))
return W
Just rewrote your code to work with col_list as input - maybe this will be helpful for somebody.

By looking at the notes for each sparse matrix, specifically in our case is csc matrix it has the following advantages as listed in the documentation [1]
efficient arithmetic operations CSC + CSC, CSC * CSC, etc.
efficient column slicing
fast matrix vector products (CSR, BSR may be faster)
If you have the column indices you want to remove, just use slicing.
For removing rows use csr matrix since it is efficient in row slicing

Related

Defining a matrix with unknown size in python

I want to use a matrix in my Python code but I don't know the exact size of my matrix to define it.
For other matrices, I have used np.zeros(a), where a is known.
What should I do to define a matrix with unknown size?
In this case, maybe an approach is to use a python list and append to it, up until it has the desired size, then cast it to a np array
pseudocode:
matrix = []
while matrix not full:
matrix.append(elt)
matrix = np.array(matrix)
You could write a function that tries to modify the np.array, and expand if it encounters an IndexError:
x = np.random.normal(size=(2,2))
r,c = (5,10)
try:
x[r,c] = val
except IndexError:
r0,c0 = x.shape
r_ = r+1-r0
c_ = c+1-c0
if r > 0:
x = np.concatenate([x,np.zeros((r_,x.shape[1]))], axis = 0)
if c > 0:
x = np.concatenate([x,np.zeros((x.shape[0],c_))], axis = 1)
There are problems with this implementation though: First, it makes a copy of the array and returns a concatenation of it, which translates to a possible bottleneck if you use it many times. Second, the code I provided only works if you're modifying a single element. You could do it for slices, and it would take more effort to modify the code; or you can go the whole nine yards and create a new object inheriting np.array and override the .__getitem__ and .__setitem__ methods.
Or you could just use a huge matrix, or better yet, see if you can avoid having to work with matrices of unknown size.
If you have a python generator you can use np.fromiter:
def gen():
yield 1
yield 2
yield 3
In [11]: np.fromiter(gen(), dtype='int64')
Out[11]: array([1, 2, 3])
Beware if you pass an infinite iterator you will most likely crash python, so it's often a good idea to cap the length (with the count argument):
In [21]: from itertools import count # an infinite iterator
In [22]: np.fromiter(count(), dtype='int64', count=3)
Out[22]: array([0, 1, 2])
Best practice is usually to either pre-allocate (if you know the size) or build the array as a list first (using list.append). But lists don't build in 2d very well, which I assume you want since you specified a "matrix."
In that case, I'd suggest pre-allocating an oversize scipy.sparse matrix. These can be defined to have a size much larger than your memory, and lil_matrix or dok_matrix can be built sequentially. Then you can pare it down once you enter all of your data.
from scipy.sparse import dok_matrix
dummy = dok_matrix((1000000, 1000000)) # as big as you think you might need
for i, j, data in generator():
dummy[i,j] = data
s = np.array(dummy.keys).max() + 1
M = dummy.tocoo[:s,:s] #or tocsr, tobsr, toarray . . .
This way you build your array as a Dictionary of Keys (dictionaries supporting dynamic assignment much better than ndarray does) , but still have a matrix-like output that can be (somewhat) efficiently used for math, even in a partially built state.

numpy fill an array with arrays

I want to combine an unspecified (finite) number of matrices under a Kroneckerproduct. In order to do this I want to save the matrices in an array but I don't know how to do this. At the moment I have:
for i in range(LNew-2):
for j in range(LNew-2):
Bulk = np.empty(shape=(LNew-1,LNew-1))
if i == j:
Bulk[i,j]=H2
else:
Bulk[i,j]=idm
Here the H2 and idm are both matrices, which I want to combine under a Kronecker product. But since Bulk is an ndarray object I suppose it wont accept arraylike objects inside it.
edit:
This is the function in which I want to use this idea. I am using it to build a Hamiltonian matrix for a quantum spin chain. So H2 is the Hamiltonian for a two particle chain,
H2 is a 4x4 matrix and idm is the 2x2 identity matrix.
and now the three particle chain is np.kron(H2,idm)+np.kron(idm,H2)
and for four particles
np.kron(np.kron(H2,idm),idm)+np.kron(idm,np.kron(H2,idm))+np.kron(idm,np.kron(idm,H2)) and so on.
def ExpandHN(LNew):
idm = np.identity(2)
H2 = GetH(2,'N')
HNew = H2
for i in range(LNew-2):
for j in range(LNew-2):
Bulk = np.empty(shape=(LNew-1,LNew-1))
if i == j:
Bulk[i,j]=H2
else:
Bulk[i,j]=idm
i = 0
for i in range(LNew-2):
for j in range(LNew-3):
HNew += np.kron(Bulk[i,j],Bulk[i,j+1]) #something like this
return HNew
As you can see the second set of for loops hasn't been worked out.
That being said, if someone has a totaly different but working solution I would be happy with that too.
If I understand correctly, the your question boils down to how to create arrays of arrays with numpy. I would suggest to use the standard python object dict:
Bulk = dict()
for i in range(LNew-2):
for j in range(LNew-2):
if i == j:
Bulk[(i,j)]=H2
else:
Bulk[(i,j)]=idm
The usage of tuples as keys allows you to maintain an array-like indexing of the matrices.
Also note, that you should define Bulk outside of the two for loops (in any case).
HTH

Vectorization in Numpy - Broadcasting

I have a code in python with the following elements:
I have an intensities vector which is something like this:
array([ 1142., 1192., 1048., ..., 29., 18., 35.])
I have also an x vector which looks like this:
array([ 0, 1, 1, ..., 1060, 1060, 1061])
Then, I have the for loop where I fill another vector, radialDistribution like this:
for i in range(1000):
radialDistribution[i] = sum(intensities[np.where(x == i)]) / len(np.where(x == i)[0])
The problem is that it takes 20 second to complete it...therefore I want to vectorize it. But I am quite new with broadcasting in Numpy and didn't find so much out there...therefore I need your help.
I tried this, but didn't work:
i= np.ogrid[:1000]
intensities[i] = sum(sortedIntensities1D[np.where(sortedDists1D == i)]) / len(np.where(sortedDists1D == i)[0])
Could you help me just telling me where should I look to learn the vectorization procedures with Numpy?
Thanks in advance for your valuable help!
If your x vector has consecutive integers starting at 0, then you can simply do:
radialDistribution = np.bincount(x, weights=intensities) / np.bincount(x)
Here is my implementation of group_by functionality in numpy. It is conceptually similar to the pandas solution; except that this does not require pandas, and ought to become a part of the numpy core, in my opinion.
Using this functionality, your code would look like this:
radialDistribution = group_by(x).mean(intensities)
and would complete in notime.
Look also at the test_radial function defined at the end, which may come even closer to your endgoal.
Here's a method that uses broadcasting:
# arrays need to be at least 2D for broadcasting
x = np.atleast_2d(x)
# create vector of indices
i = np.atleast_2d(np.arange(x.size))
# do the vectorized calculation
bool_eq = (x == i.T)
totals = np.sum(np.where(bool_eq, intensities, 0), axis=1)
rD = totals / np.sum(bool_eq, axis=1)
This uses broadcasting two times: in the operation x == i.T and in the call to np.where. Unfortunately the code above is very slow, even slower than the original. The main bottleneck here is np.where, which we can speed up in this case by taking the product of the Boolean array and the intensities (also by broadcasting):
totals = np.sum(bool_eq*intensities, axis=1)
And this is essentially the same as a matrix-vector product, so we can write:
totals = np.dot(intensities, bool_eq.T)
The end result is a faster code than the original (at least until the memory use for the intermediary array becomes the limiting factor), but you're probably better off with an iterative approach, as suggested by one of the other answers.
Edit: making use of np.einsum was faster still (in my trial):
totals = np.einsum('ij,j', bool_eq, intensities)
Building on my itertools.groupby solution in https://stackoverflow.com/a/22265803/901925 here's a solution that works on 2 small arrays.
import numpy as np
import itertools
intensities = np.arange(12,dtype=float)
x=np.array([1,0,1,2,2,1,0,0,1,2,1,0]) # general, not sorted or consecutive
first a bincount solution, adjusted for nonconsecutive values
# using bincount
# if 'x' are not consecutive
J=np.bincount(x)>0
print np.bincount(x,weights=intensities)[J]/np.bincount(x)[J]
Now a groupby solution
# using groupby;
# sort if need
I=np.argsort(x)
x=x[I]
intensities=intensities[I]
# make a record array for use by groupby
xi=np.zeros(shape=x.shape, dtype=[('intensities',float),('x',int)])
xi['intensities']=intensities
xi['x']=x
g=itertools.groupby(xi, lambda z:z['x'])
xx=np.array([np.array([z[0] for z in y[1]]).mean() for y in g])
print xx
Here's a compact numpy solution, using the return_index option of np.unique, and np.split. x should be sorted. I'm not optimistic about the speed for large arrays, since there will be iteration in unique and split in addition to the comprehension.
[values, index] = np.unique(x, return_index=True)
[y.mean() for y in np.split(intensities, index[1:])]

Null numpy array to be appended to

I'm writing a feature selection code. Basically get the output from featureselection function and concatenate it to the numpy array data
data=np.zeros([1,4114]) # put feature length here
for i in range(1,N):
filename=splitpath+str(i)+'.tiff'
feature=featureselection(filename)
data=np.vstack((data, feature))
data=data[1:,:] # remove the first zeros row
However, this is not a robust implementation as I need to know feature length (4114) beforehand.
Is there any null numpy array matrix, like in Python list we have []?
Appending to a numpy array in a loop is inefficient, there might be some situations when it cannot be avoided but this doesn't seem to be one of them. If you know the size of the array that you'll end up with, it's best to just per-allocate the array, something like this:
data = np.zeros([N, 4114])
for i in range(1, N):
filename = splitpath+str(i)+'.tiff'
feature = featureselection(filename)
data[i] = feature
Sometimes you don't know the size of the final array. There are several ways to deal with this case, but the simplest is probably to use a temporary list, something like:
data = []
for i in range(1,N):
filename = splitpath+str(i)+'.tiff'
feature = featureselection(filename)
data.append(feature)
data = np.array(data)
Just for completeness, you can also do data = np.zeros([0, 4114]), but I would recommend against that and suggest one of the methods above.
If you don't want to assume the size before creating the first array, you can use lazy initialization.
data = None
for i in range(1,N):
filename=splitpath+str(i)+'.tiff'
feature=featureselection(filename)
if data is None:
data = np.zeros(( 0, feature.size ))
data = np.vstack((data, feature))
if data is None:
print 'no features'
else:
print data.shape

Numpy : how to increment values of an indexed array? [duplicate]

My question is about a specific array operation that I want to express using numpy.
I have an array of floats w and an array of indices idx of the same length as w and I want to sum up all w with the same idx value and collect them in an array v.
As a loop, this looks like this:
for i, x in enumerate(w):
v[idx[i]] += x
Is there a way to do this with array operations?
My guess was v[idx] += w but that does not work, since idx contains the same index multiple times.
Thanks!
numpy.bincount was introduced for this purpose:
tmp = np.bincount(idx, w)
v[:len(tmp)] += tmp
I think as of 1.6 you can also pass a minlength to bincount.
This is a known behavior and, though somewhat unfortunate, does not have a numpy-level workaround. (bincount can be used for this if you twist its arm.) Doing the loop yourself is really your best bet.
Note that your code might have been a bit more clear without re-using the name w and without introducing another set of indices, like
for i, w_thing in zip(idx, w):
v[i] += w_thing
If you need to speed up this loop, you might have to drop down to C. Cython makes this relatively easy.

Categories