How to find zero elements in a sparse matrix

How to find zero elements in a sparse matrix - python

I know that scipy.sparse.find(A) returns 3 arrays I,J,V each of them containing the rows, columns, and values of the nonzero elements respectively.
What i want is a way to do the same (except the V array) for all zero elements without having to iterate through the matrix since its too large.

Make a small sparse matrix with 10% sparsity:
In [1]: from scipy import sparse
In [2]: M = sparse.random(10,10,.1)
In [3]: M
Out[3]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in COOrdinate format>
The 10 nonzero values:
In [5]: sparse.find(M)
Out[5]:
(array([6, 4, 1, 2, 3, 0, 1, 6, 9, 6], dtype=int32),
array([1, 2, 3, 3, 3, 4, 4, 4, 5, 8], dtype=int32),
array([ 0.91828586, 0.29763717, 0.12771201, 0.24986069, 0.14674883,
0.56018409, 0.28643427, 0.11654358, 0.8784731 , 0.13253971]))
If, out of the 100 elements of the matrix, 10 are nonzero, then 90 elements are zero. Do you really want the indices of all of those?
where or nonzero on the dense equivalent gives the same indices:
In [6]: A = M.A # dense
In [7]: np.where(A)
Out[7]:
(array([0, 1, 1, 2, 3, 4, 6, 6, 6, 9], dtype=int32),
array([4, 3, 4, 3, 3, 2, 1, 4, 8, 5], dtype=int32))
And the indices of the 90 zero values:
In [8]: np.where(A==0)
Out[8]:
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9], dtype=int32),
array([0, 1, 2, 3, 5, 6, 7, 8, 9, 0, 1, 2, 5, 6, 7, 8, 9, 0, 1, 2, 4, 5, 6,
7, 8, 9, 0, 1, 2, 4, 5, 6, 7, 8, 9, 0, 1, 3, 4, 5, 6, 7, 8, 9, 0, 1,
2, 3, 4, 5, 6, 7, 8, 9, 0, 2, 3, 5, 6, 7, 9, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 6, 7, 8, 9], dtype=int32))
That's 2 arrays of shape (90,), 180 integers, as opposed to the 100 values in the the dense array itself. If your sparse matrix is too large to convert to dense, it will be too large to produce all the zero indices (assuming reasonable sparsity).
The print(M) shows the same triplets as the find. The attributes of the coo format also give the nonzero indices:
In [13]: M.row
Out[13]: array([6, 6, 3, 4, 1, 6, 9, 2, 1, 0], dtype=int32)
In [14]: M.col
Out[14]: array([1, 4, 3, 2, 3, 8, 5, 3, 4, 4], dtype=int32)
(Sometimes manipulation of a matrix can set values to 0 without removing them from the attributes. So find/nonzero takes an added step to remove those, if any.)
We could apply find to M==0 as well - but sparse will give us a warning.
In [15]: sparse.find(M==0)
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:213: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead.
", try using != instead.", SparseEfficiencyWarning)
It's the same thing that I've been warning about - the large size of this set. The resulting arrays are the same as in Out[8].

Assuming you have a scipy sparse array and have imported find:
from itertools import product
I, J, _= find(your_sparse_array)
nonzero = zip(I, J)
nrows, ncols = your_sparse_array.shape
for a, b in product(range(nrows), range(ncols)):
if (a,b) not in nonzero: print(a, b)

Related

Numpy: convert last axis to list [duplicate]

This question already has an answer here:
Numpy to list over 2nd axis
(1 answer)
Closed 2 years ago.
Let numpy array be shape (x, y, z).
I want it to be (x, y) shape with every element being a list of z-length: [a, b, c, ..., z]
Is there any way to do it with numpy methods?

You can use tolist and assign to a preallocated object array:
import numpy as np
a = np.random.randint(0,10,(100,100,100))
def f():
A = np.empty(a.shape[:-1],object)
A[...] = a.tolist()
return A
f()[99,99]
# [4, 5, 9, 2, 8, 9, 9, 6, 8, 5, 7, 9, 8, 7, 6, 1, 9, 6, 2, 9, 0, 7, 0, 1, 2, 8, 4, 4, 7, 0, 1, 2, 3, 8, 9, 6, 0, 1, 4, 7, 0, 7, 9, 3, 9, 1, 8, 7, 1, 2, 3, 6, 6, 2, 7, 0, 2, 8, 7, 0, 0, 1, 8, 2, 6, 3, 5, 4, 9, 6, 9, 0, 2, 5, 9, 5, 3, 7, 0, 1, 9, 0, 8, 2, 0, 7, 3, 6, 9, 9, 4, 4, 3, 8, 4, 7, 4, 2, 1, 8]
type(f()[99,99])
# <class 'list'>
from timeit import timeit
timeit(f,number=100)*10
# 28.67872992530465

I can't imagine why numpy would need such a method. Here is, more or less, a pythonic solution.
import numpy as np
# an example array with shape [2,3,4]
a = np.random.random([2,3,4])
# create the target array shaped [2,3] with 'object' type (accepting other types than numbers).
b = np.array([[None for row in mat] for mat in a])
for i in range(b.shape[0]):
for j in range(b.shape[1]):
b[i,j] = list(a[i,j])

Numpy: Sample group of indices with different values

Given some numpy array a
array([2,2,3,3,2,0,0,0,2,2,3,2,0,1,1,0])
what is the best way to get all groups of n indices with each of them having a different value in a?
Obviously there is no group larger than the number of unique elements in a, here 4.
So for example, one group of size 4 is
array([0,2,5,13])
Consider that a might be quite long, let's say up to 250k.
If the result gets too large, it might also be desirable not to compute all such groups, but only the first k requested.

For inputs as integers, we can have a solution based on this post -
In [41]: sidx = a.argsort() # use kind='mergesort' for first occurences
In [42]: c = np.bincount(a)
In [43]: np.sort(sidx[np.r_[0,(c[c!=0])[:-1].cumsum()]])
Out[43]: array([ 0, 2, 5, 13])
Another closely related to previous method for generic inputs -
In [44]: b = a[sidx]
In [45]: np.sort(sidx[np.r_[True,b[:-1]!=b[1:]]])
Out[45]: array([ 0, 2, 5, 13])
Another with numba for memory-efficiency and hence performance too, to select first indices along those unique groups and also with the additional k arg -
from numba import njit
#njit
def _numba1(a, notfound, out, k):
iterID = 0
for i,e in enumerate(a):
if notfound[e]:
notfound[e] = False
out[iterID] = i
iterID += 1
if iterID>=k:
break
return out
def unique_elems(a, k, maxnum=None):
# feed in max of the input array as maxnum value if known
if maxnum is None:
L = a.max()+1
else:
L = maxnum+1
notfound = np.ones(L, dtype=bool)
out = np.ones(k, dtype=a.dtype)
return _numba1(a, notfound, out, k)
Sample run -
In [16]: np.random.seed(0)
...: a = np.random.randint(0,10,200)
In [17]: a
Out[17]:
array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5, 9,
8, 9, 4, 3, 0, 3, 5, 0, 2, 3, 8, 1, 3, 3, 3, 7, 0, 1, 9, 9, 0, 4,
7, 3, 2, 7, 2, 0, 0, 4, 5, 5, 6, 8, 4, 1, 4, 9, 8, 1, 1, 7, 9, 9,
3, 6, 7, 2, 0, 3, 5, 9, 4, 4, 6, 4, 4, 3, 4, 4, 8, 4, 3, 7, 5, 5,
0, 1, 5, 9, 3, 0, 5, 0, 1, 2, 4, 2, 0, 3, 2, 0, 7, 5, 9, 0, 2, 7,
2, 9, 2, 3, 3, 2, 3, 4, 1, 2, 9, 1, 4, 6, 8, 2, 3, 0, 0, 6, 0, 6,
3, 3, 8, 8, 8, 2, 3, 2, 0, 8, 8, 3, 8, 2, 8, 4, 3, 0, 4, 3, 6, 9,
8, 0, 8, 5, 9, 0, 9, 6, 5, 3, 1, 8, 0, 4, 9, 6, 5, 7, 8, 8, 9, 2,
8, 6, 6, 9, 1, 6, 8, 8, 3, 2, 3, 6, 3, 6, 5, 7, 0, 8, 4, 6, 5, 8,
2, 3])
In [19]: unique_elems(a, k=6)
Out[19]: array([0, 1, 2, 4, 5, 8])

Use Numpy.unique for this job. There are several other options, one can for instance return the number of times each unique item appears in a.
import numpy as np
# Sample data
a = np.array([2,2,3,3,2,0,0,0,2,2,3,2,0,1,1,0])
# The unique values are in 'u'
# The indices of the first occurence of the unique values are in 'indices'
u, indices = np.unique(a, return_index=True)

How can I share memory between numpy arrays?

I have a large numpy array of size 100x100. Among these 10000 values, there are only about 50 unique values. So I want to create a second array of length 50, containing these unique values, and then somehow map the large array to the smaller array. Effectively, I want to store just 50 values in my system instead of redundant 10000 values.
Slices of arrays seem to share memory, but as soon as I use specific indexing, memory sharing is lost.
a = np.array([1,2,3,4,5])
b = a[:3]
indices = [0,1,2]
c = a[indices]
print(b,c)
print(np.shares_memory(a,b),np.shares_memory(a,c))
This gives the output:
[1 2 3] [1 2 3]
True False
Even though b and c are referring to the same values of a, b(the slice) shares memory with a while c doesn't. If I execute b[0] = 100, a[0] also becomes 100 since they share memory. That is not the case with c.
I want to make c, which is a collection of values which are all from a, share memory with a.

In general it is not possible to save memory in this way. The reason is that your data consists of 64-bit integers, and pointers are also 64-bit integers, so if you try to store each value exactly once in some auxiliary array and then point at those values, you will end up using basically the same amount of space.
The answer would be different if for example some of your arrays are subsets of other ones, or you if you were storing large types like long strings.

So make a random array with a small set of unique values:
In [45]: x = np.random.randint(0,10,(10,10))
In [46]: x
Out[46]:
array([[4, 3, 8, 5, 4, 8, 8, 1, 8, 1],
[9, 2, 7, 2, 9, 5, 3, 9, 3, 3],
[6, 2, 6, 9, 4, 2, 3, 4, 6, 7],
[1, 0, 2, 1, 0, 9, 4, 2, 6, 2],
[8, 1, 6, 8, 3, 9, 5, 0, 8, 5],
[4, 9, 1, 4, 1, 2, 8, 4, 7, 2],
[4, 5, 2, 4, 8, 0, 1, 4, 4, 7],
[2, 2, 0, 5, 3, 0, 3, 3, 3, 9],
[3, 1, 0, 6, 4, 8, 8, 3, 5, 2],
[7, 5, 9, 2, 8, 0, 8, 1, 7, 8]])
Find the unique ones:
In [48]: np.unique(x)
Out[48]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
better yet the unique values plus an array that lets us map those values on the original:
In [49]: np.unique(x, return_inverse=True)
Out[49]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([4, 3, 8, 5, 4, 8, 8, 1, 8, 1, 9, 2, 7, 2, 9, 5, 3, 9, 3, 3, 6, 2,
6, 9, 4, 2, 3, 4, 6, 7, 1, 0, 2, 1, 0, 9, 4, 2, 6, 2, 8, 1, 6, 8,
3, 9, 5, 0, 8, 5, 4, 9, 1, 4, 1, 2, 8, 4, 7, 2, 4, 5, 2, 4, 8, 0,
1, 4, 4, 7, 2, 2, 0, 5, 3, 0, 3, 3, 3, 9, 3, 1, 0, 6, 4, 8, 8, 3,
5, 2, 7, 5, 9, 2, 8, 0, 8, 1, 7, 8]))
There's a value in the reverse mapping for each element in the original.

Sorting 2D numpy array using indices returned from np.argsort() [duplicate]

This question already has answers here:
Sort invariant for numpy.argsort with multiple dimensions
(3 answers)
Closed 3 years ago.
When we have a 1D numpy array, we can sort it the following way:
>>> temp = np.random.randint(1,10, 10)
>>> temp
array([5, 1, 1, 9, 5, 2, 8, 7, 3, 9])
>>> sort_inds = np.argsort(temp)
>>> sort_inds
array([1, 2, 5, 8, 0, 4, 7, 6, 3, 9], dtype=int64)
>>> temp[sort_inds]
array([1, 1, 2, 3, 5, 5, 7, 8, 9, 9])
Note: I know I can do this using np.sort; Obviously, I need the sorting indices for a different array - this is just a simple example. Now we can continue to my actual question..
I tried to apply the same approach for a 2D array:
>>> d = np.random.randint(1,10,(5,10))
>>> d
array([[1, 6, 8, 4, 4, 4, 4, 4, 4, 8],
[3, 6, 1, 4, 5, 5, 2, 1, 8, 2],
[1, 2, 6, 9, 8, 6, 9, 2, 5, 8],
[8, 5, 1, 6, 6, 2, 4, 3, 7, 1],
[5, 1, 4, 4, 4, 2, 5, 9, 7, 9]])
>>> sort_inds = np.argsort(d)
>>> sort_inds
array([[0, 3, 4, 5, 6, 7, 8, 1, 2, 9],
[2, 7, 6, 9, 0, 3, 4, 5, 1, 8],
[0, 1, 7, 8, 2, 5, 4, 9, 3, 6],
[2, 9, 5, 7, 6, 1, 3, 4, 8, 0],
[1, 5, 2, 3, 4, 0, 6, 8, 7, 9]], dtype=int64)
This result looks good - notice that we can sort each row of d using the indices of the corresponding row from sort_inds as demonstrated in the 1D example. However, trying to get a sorted array using the same approach I used in the 1D example, I got this exception:
>>> d[sort_inds]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-63-e480a9fb309c> in <module>
----> 1 d[ind]
IndexError: index 5 is out of bounds for axis 0 with size 5
So I have 2 questions:
What just happened? How did numpy interpret this code?
How can I still achieve what I want - that is, sorting d - or any other array of the same dimensions - using sort_inds?
Thanks

You need a little extra work to properly index the 2d array. Here's a way using advanced indexing, where np.arange is used in the first axis so that each row in sort_inds extracts values from the corresponding row in d:
d[np.arange(d.shape[0])[:,None], sort_inds]
array([[1, 1, 2, 3, 3, 4, 4, 7, 8, 9],
[1, 3, 4, 5, 5, 5, 6, 8, 8, 9],
[1, 2, 3, 4, 5, 6, 7, 8, 8, 8],
[2, 2, 4, 7, 7, 8, 8, 9, 9, 9],
[1, 1, 2, 4, 4, 7, 7, 8, 8, 8]])

Numpy: stack duplicates of matrix in new dimension [duplicate]

This question already has answers here:
How to copy a 2D array into a 3rd dimension, N times?
(7 answers)
Closed 4 years ago.
I have a 3x3 numpy array and I want to create a 3x3xC matrix where the new dimension consists of exact copies of the original 3x3 array. I am sure this is asked somewhere but I couldn't find the best way. I worked out how to do this for a simple 1 dimensional array x:
new_x = np.tile(np.array(x, (C, 1))
which repeats the array, then do:
np.transpose(np.expand_dims(new_x, axis=2),(2,1,0))
which expands the dimension and switches the axis so that the array is repeated in the 3rd dimension (although this works I'm not sure if this is the best way to do it either) - what is the most efficient way to do this for a general n x n numpy array?

For a readonly version, broadcast_to can be used:
In [370]: x = np.arange(9).reshape(3,3)
In [371]: x
Out[371]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [372]: x = np.broadcast_to(x[..., None],(3,3,10))
In [373]: x
Out[373]:
array([[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
[[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5]],
[[6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
[7, 7, 7, 7, 7, 7, 7, 7, 7, 7],
[8, 8, 8, 8, 8, 8, 8, 8, 8, 8]]])
Or with repeat:
In [378]: x=np.repeat(x[...,None],10,2)
In [379]: x
Out[379]:
array([[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
[[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5]],
[[6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
[7, 7, 7, 7, 7, 7, 7, 7, 7, 7],
[8, 8, 8, 8, 8, 8, 8, 8, 8, 8]]])
This is a larger array, whose elements can be changed individually.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find zero elements in a sparse matrix - python

I know that scipy.sparse.find(A) returns 3 arrays I,J,V each of them containing the rows, columns, and values of the nonzero elements respectively. What i want is a way to do the same (except the V array) for all zero elements without having to iterate through the matrix since its too large.

Assuming you have a scipy sparse array and have imported find: from itertools import product I, J, _= find(your_sparse_array) nonzero = zip(I, J) nrows, ncols = your_sparse_array.shape for a, b in product(range(nrows), range(ncols)): if (a,b) not in nonzero: print(a, b)

Related

Numpy: convert last axis to list [duplicate]

Numpy: Sample group of indices with different values

How can I share memory between numpy arrays?

Sorting 2D numpy array using indices returned from np.argsort() [duplicate]

Numpy: stack duplicates of matrix in new dimension [duplicate]

Categories

Resources