Find all-zero columns in pandas sparse matrix - python

For example I have a coo_matrix A :
import scipy.sparse as sp
A = sp.coo_matrix([3,0,3,0],
[0,0,2,0],
[2,5,1,0],
[0,0,0,0])
How can I get result [0,0,0,1], which indicates that first 3 columns contain non-zero values, only the 4th column is all zeros.
PS : cannot convert A to other type.
PS2 : I tried using np.nonzeros but it seems that my implementation is not very elegant.

Approach #1 We could do something like this -
# Get the columns indices of the input sparse matrix
C = sp.find(A)[1]
# Use np.in1d to create a mask of non-zero columns.
# So, we invert it and convert to int dtype for desired output.
out = (~np.in1d(np.arange(A.shape[1]),C)).astype(int)
Alternatively, to make the code shorter, we can use subtraction -
out = 1-np.in1d(np.arange(A.shape[1]),C)
Step-by-step run -
1) Input array and sparse matrix from it :
In [137]: arr # Regular dense array
Out[137]:
array([[3, 0, 3, 0],
[0, 0, 2, 0],
[2, 5, 1, 0],
[0, 0, 0, 0]])
In [138]: A = sp.coo_matrix(arr) # Convert to sparse matrix as input here on
2) Get non-zero column indices :
In [139]: C = sp.find(A)[1]
In [140]: C
Out[140]: array([0, 2, 2, 0, 1, 2], dtype=int32)
3) Use np.in1d to get mask of non-zero columns :
In [141]: np.in1d(np.arange(A.shape[1]),C)
Out[141]: array([ True, True, True, False], dtype=bool)
4) Invert it :
In [142]: ~np.in1d(np.arange(A.shape[1]),C)
Out[142]: array([False, False, False, True], dtype=bool)
5) Finally convert to int dtype :
In [143]: (~np.in1d(np.arange(A.shape[1]),C)).astype(int)
Out[143]: array([0, 0, 0, 1])
Alternative subtraction approach :
In [145]: 1-np.in1d(np.arange(A.shape[1]),C)
Out[145]: array([0, 0, 0, 1])
Approach #2 Here's another way and possibly a faster one using matrix-multiplication -
out = 1-np.ones(A.shape[0],dtype=bool)*A.astype(bool)
Runtime test
Let's test out all the posted approaches on a big and really sparse matrix -
In [29]: A = sp.coo_matrix((np.random.rand(4000,4000)>0.998).astype(int))
In [30]: %timeit 1-np.in1d(np.arange(A.shape[1]),sp.find(A)[1])
100 loops, best of 3: 4.12 ms per loop # Approach1
In [31]: %timeit 1-np.ones(A.shape[0],dtype=bool)*A.astype(bool)
1000 loops, best of 3: 771 µs per loop # Approach2
In [32]: %timeit 1 - (A.col==np.arange(A.shape[1])[:,None]).any(axis=1)
1 loops, best of 3: 236 ms per loop # #hpaulj's soln
In [33]: %timeit (A!=0).sum(axis=0)==0
1000 loops, best of 3: 1.03 ms per loop # #jez's soln
In [34]: %timeit (np.sum(np.absolute(A.toarray()), 0) == 0) * 1
10 loops, best of 3: 86.4 ms per loop # #wwii's soln

The actual logical operation can be performed like this:
b = (A!=0).sum(axis=0)==0
# matrix([[False, False, False, True]], dtype=bool)
Now, to ensure that I'm answering your question exactly, I'd better tell you how you could convert from booleans to integers (although really, for most applications I can think of, you can do a lot more in numpy and friends if you stick with an array of bools):
b = b.astype(int)
#matrix([[0, 0, 0, 1]])
Either way, to then convert from a matrix to a list, you could do this:
c = list(b.flat)
# [0, 0, 0, 1]
...although again, I'm not sure this is the best thing to do: for most applications I can imagine, I would perhaps just convert to a one-dimensional numpy.array with c = b.A.flatten() instead.

Recent
scipy.sparse.coo_matrix how to fast find all zeros column, fill with 1 and normalize
similar, except it wants to fill those columns with 1s and normalize them.
I immediately suggested the lil format of the transpose. All-0 columns will be empty lists in this format. But sticking with the coo format I suggested
np.nonzero(~(Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1))[0]
or for this 1/0 format
1 - (Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1)
which is functionally the same as:
1 - np.in1d(np.arange(Mo.shape[1]),Mo.col)
sparse.find converts the matrix to csr to sum duplicates and eliminate duplicates, and then back to coo to get the data, row, and col attributes (which it returns).
Mo.nonzero uses A.data != 0 to eliminate 0s before returning the col and row attributes.
The np.ones(A.shape[0],dtype=bool)*A.astype(bool) solution requires converting A to csr format for multiplication.
(A!=0).sum(axis=0) also converts to csr because column (or row) sum is done with a matrix multiplication.
So the no-convert requirement is unrealistic, at least within the bounds of sparse formats.
===============
For Divakar's test case my == version is quite slow; it's ok with small ones, but creates too large of test array with the 1000 columns.
Testing on a matrix that is sparse enough to have a number of 0 columns:
In [183]: Arr=sparse.random(1000,1000,.001)
In [184]: (1-np.in1d(np.arange(Arr.shape[1]),Arr.col)).any()
Out[184]: True
In [185]: (1-np.in1d(np.arange(Arr.shape[1]),Arr.col)).sum()
Out[185]: 367
In [186]: timeit 1-np.ones(Arr.shape[0],dtype=bool)*Arr.astype(bool)
1000 loops, best of 3: 334 µs per loop
In [187]: timeit 1-np.in1d(np.arange(Arr.shape[1]),Arr.col)
1000 loops, best of 3: 323 µs per loop
In [188]: timeit 1-(Arr.col==np.arange(Arr.shape[1])[:,None]).any(axis=1)
100 loops, best of 3: 3.9 ms per loop
In [189]: timeit (Arr!=0).sum(axis=0)==0
1000 loops, best of 3: 820 µs per loop

Convert to an array or dense matrix, sum the absolute value along the first axis, test the result against zero, convert to int
>>> import numpy as np
>>> (np.sum(np.absolute(a.toarray()), 0) == 0) * 1
array([0, 0, 0, 1])
>>> (np.sum(np.absolute(a.todense()), 0) == 0) * 1
matrix([[0, 0, 0, 1]])
>>>
>>> np.asarray((np.sum(np.absolute(a.todense()), 0) == 0), dtype = np.int32)
array([[0, 0, 0, 1]])
>>>
The first is the fastest - 24 uS for your example on my machine.
For a matrix made with np.random.randint(0,3,(1000,1000)), all are right at 25 mS on my machine.

Related

How could I get numpy array indices by some conditions

I come to a problem like this:
suppose I have arrays like this:
a = np.array([[1,2,3,4,5,4,3,2,1],])
label = np.array([[1,0,1,0,0,1,1,0,1],])
I need to obtain the indices of a at which position the element value of label is 1 and the value of a is the largest amount all that causing label to be 1.
It maybe confusing, in the above example, the indices where label is 1 are: 0, 2, 5, 6, 8, their corresponding values of a are thus: 1, 3, 4, 3, 1, among which 4 is the larges, thus I need to get the result of 5 which is the index of number 4 in a. How could I do this with numpy ?
Get the 1s indices say as idx, then index into a with it, get max index and finally trace it back to the original order by indexing into idx -
idx = np.flatnonzero(label==1)
out = idx[a[idx].argmax()]
Sample run -
# Assuming inputs to be 1D
In [18]: a
Out[18]: array([1, 2, 3, 4, 5, 4, 3, 2, 1])
In [19]: label
Out[19]: array([1, 0, 1, 0, 0, 1, 1, 0, 1])
In [20]: idx = np.flatnonzero(label==1)
In [21]: idx[a[idx].argmax()]
Out[21]: 5
For a as ints and label as an array of 0s and 1s, we could optimize further as we could scale a based on the range of values in it, like so -
(label*(a.max()-a.min()+1) + a).argmax()
Furthermore, if a has positive numbers only, it would simplify to -
(label*(a.max()+1) + a).argmax()
Timings for positive ints largish a -
In [115]: np.random.seed(0)
...: a = np.random.randint(0,10,(100000))
...: label = np.random.randint(0,2,(100000))
In [117]: %%timeit
...: idx = np.flatnonzero(label==1)
...: out = idx[a[idx].argmax()]
1000 loops, best of 3: 592 µs per loop
In [116]: %timeit (label*(a.max()-a.min()+1) + a).argmax()
1000 loops, best of 3: 357 µs per loop
# #coldspeed's soln
In [120]: %timeit np.ma.masked_where(~label.astype(bool), a).argmax()
1000 loops, best of 3: 1.63 ms per loop
# won't work with negative numbers in a
In [119]: %timeit (label*(a.max()+1) + a).argmax()
1000 loops, best of 3: 292 µs per loop
# #klim's soln (won't work with negative numbers in a)
In [121]: %timeit np.argmax(a * (label == 1))
1000 loops, best of 3: 229 µs per loop
You can use masked arrays:
>>> np.ma.masked_where(~label.astype(bool), a).argmax()
5
Here is one of the simplest ways.
>>> np.argmax(a * (label == 1))
5
>>> np.argmax(a * (label == 1), axis=1)
array([5])
Coldspeed's method may take more time.

Aggregate data in a numpy array [duplicate]

How can I sum across rows that have equal values in the first column of a numpy array? For example:
In: np.array([[1,2,3],
[1,4,6],
[2,3,5],
[2,6,2],
[3,4,8]])
Out: [[1,6,9], [2,9,7], [3,4,8]]
Any help would be greatly appreciated.
Pandas has a very very powerful groupby function which makes this very simple.
import pandas as pd
n = np.array([[1,2,3],
[1,4,6],
[2,3,5],
[2,6,2],
[3,4,8]])
df = pd.DataFrame(n, columns = ["First Col", "Second Col", "Third Col"])
df.groupby("First Col").sum()
Approach #1
Here's something in a numpythonic vectorized way based on np.bincount -
# Initial setup
N = A.shape[1]-1
unqA1, id = np.unique(A[:, 0], return_inverse=True)
# Create subscripts and accumulate with bincount for tagged summations
subs = np.arange(N)*(id.max()+1) + id[:,None]
sums = np.bincount( subs.ravel(), weights=A[:,1:].ravel() )
# Append the unique elements from first column to get final output
out = np.append(unqA1[:,None],sums.reshape(N,-1).T,1)
Sample input, output -
In [66]: A
Out[66]:
array([[1, 2, 3],
[1, 4, 6],
[2, 3, 5],
[2, 6, 2],
[7, 2, 1],
[2, 0, 3]])
In [67]: out
Out[67]:
array([[ 1., 6., 9.],
[ 2., 9., 10.],
[ 7., 2., 1.]])
Approach #2
Here's another based on np.cumsum and np.diff -
# Sort A based on first column
sA = A[np.argsort(A[:,0]),:]
# Row mask of where each group ends
row_mask = np.append(np.diff(sA[:,0],axis=0)!=0,[True])
# Get cummulative summations and then DIFF to get summations for each group
cumsum_grps = sA.cumsum(0)[row_mask,1:]
sum_grps = np.diff(cumsum_grps,axis=0)
# Concatenate the first unique row with its counts
counts = np.concatenate((cumsum_grps[0,:][None],sum_grps),axis=0)
# Concatenate the first column of the input array for final output
out = np.concatenate((sA[row_mask,0][:,None],counts),axis=1)
Benchmarking
Here's some runtime tests for the numpy based approaches presented so far for the question -
In [319]: A = np.random.randint(0,1000,(100000,10))
In [320]: %timeit cumsum_diff(A)
100 loops, best of 3: 12.1 ms per loop
In [321]: %timeit bincount(A)
10 loops, best of 3: 21.4 ms per loop
In [322]: %timeit add_at(A)
10 loops, best of 3: 60.4 ms per loop
In [323]: A = np.random.randint(0,1000,(100000,20))
In [324]: %timeit cumsum_diff(A)
10 loops, best of 3: 32.1 ms per loop
In [325]: %timeit bincount(A)
10 loops, best of 3: 32.3 ms per loop
In [326]: %timeit add_at(A)
10 loops, best of 3: 113 ms per loop
Seems like Approach #2: cumsum + diff is performing quite well.
Try using pandas. Group by the first column and then sum rowwise. Something like
df.groupby(df.ix[:,1]).sum()
With a little help from your friends np.unique and np.add.at:
>>> unq, unq_inv = np.unique(A[:, 0], return_inverse=True)
>>> out = np.zeros((len(unq), A.shape[1]), dtype=A.dtype)
>>> out[:, 0] = unq
>>> np.add.at(out[:, 1:], unq_inv, A[:, 1:])
>>> out # A was the OP's array
array([[1, 6, 9],
[2, 9, 7],
[3, 4, 8]])

Numpy: indicators to partition

I am trying to represent a partition of the numbers 0 to n-1 in Python
I have a numpy array where the ith entry indicates the partition ID of number i. For instance, the numpy array
indicator = array([1, 1, 3, 0, 2, 3, 0, 0])
indicates that numbers 3, 6, and 7 belong to the partition with ID 0. Numbers 0 and 1 belong to partition 1. 4 belongs to partition 2. And 2 and 5 belong to partition 3. Let's call this the indicator representation.
Another way to represent the partition would be a list of lists where the ith list is the partition with ID i. For the array above, this maps to
explicit = [[3, 6, 7], [0, 1], [4], [2, 5]]
Let's call this the explicit representation.
My question is what is the most efficient way to convert the indicator representation to the explicit representation? The naive way is to iterate through the indicator array and assign the elements to their respective slot in the explicit array, but iterating through numpy arrays is inefficient. Is there a more natural numpy construct to do this?
Here's an approach using sorted indices and then splitting those into groups -
def indicator_to_part(indicator):
sidx = indicator.argsort() # indicator.argsort(kind='mergesort') keeps order
sorted_arr = indicator[sidx]
split_idx = np.nonzero(sorted_arr[1:] != sorted_arr[:-1])[0]
return np.split(sidx, split_idx+1)
Runtime test -
In [326]: indicator = np.random.randint(0,100,(10000))
In [327]: %timeit from_ind_to_expl(indicator) ##yogabonito's soln
100 loops, best of 3: 5.59 ms per loop
In [328]: %timeit indicator_to_part(indicator)
1000 loops, best of 3: 801 µs per loop
In [330]: indicator = np.random.randint(0,1000,(100000))
In [331]: %timeit from_ind_to_expl(indicator) ##yogabonito's soln
1 loops, best of 3: 494 ms per loop
In [332]: %timeit indicator_to_part(indicator)
100 loops, best of 3: 11.1 ms per loop
Note that the output would be a list of arrays. If you have to get a list of lists as output, a simple way would be to use map(list,indicator_to_part(indicator)). Again, a performant alternative would involve few more steps, like so -
def indicator_to_part_list(indicator):
sidx = indicator.argsort() # indicator.argsort(kind='mergesort') keeps order
sorted_arr = indicator[sidx]
split_idx = np.nonzero(sorted_arr[1:] != sorted_arr[:-1])[0]
sidx_list = sidx.tolist()
start = np.append(0,split_idx+1)
stop = np.append(split_idx+1,indicator.size+1)
return [sidx_list[start[i]:stop[i]] for i in range(start.size)]
Here is a solution for translating indicator to explicit using numpy only (no for loops, list comprehensions, itertools, etc.)
I haven't seen your iteration-based approach so I can't compare them but maybe you can tell me if it's fast enough for your needs :)
import numpy as np
indicator = np.array([1, 1, 3, 0, 2, 3, 0, 0])
explicit = [[3, 6, 7], [0, 1], [4], [2, 5]]
def from_ind_to_expl(indicator):
groups, group_sizes = np.unique(indicator, return_counts=True)
group_sizes = np.cumsum(group_sizes)
ordered = np.where(indicator==groups[:, np.newaxis])
return np.hsplit(ordered[1], group_sizes[:-1])
from_ind_to_expl(indicator) gives
[array([3, 6, 7]), array([0, 1]), array([4]), array([2, 5])]
I have also compared the times of #Divakar's and my solution. On my machine #Divakar's solution is 2-3 times faster than mine. So #Divakar definitely gets an upvote from me :)
In the last comparison in #Divakar's post there's no averaging for my solution because there's only one loop - this is slightly unfair :P ;)

Altering arrays of different dimensions to be broadcasted together

I am looking for a more optimized way to convert a (n,n) or (n,n,1) matrix to a (n,n,3) matrix. I start out with an (n,n,3), but my dimensions get reduced after I perform a sum over the second axis to (n,n). Essentially, I want to keep the original size of the array and have the second axis just repeated 3 times. The reason I need this is that I will later be broadcasting it with another (n,n,3) array, but they need the same dimensions.
My current method works, but does not seem elegant.
a0=np.random.random((n,n))
b=a.flatten().tolist()
a=np.array(zip(b,b,b))
a.shape=n,n,3
This setup has the desired result, but is clunky and hard to follow. Is there perhaps a way to go directly from an (n,n) to an (n,n,3) by duplicating the second index? or perhaps a way to not downsize the array to begin with?
None or np.newaxis is a common way of adding a dimension to an array. reshape with (3,3,1) works just as well:
In [64]: arr=np.arange(9).reshape(3,3)
In [65]: arr1 = arr[...,None]
In [66]: arr1.shape
Out[66]: (3, 3, 1)
repeat as function or method replicates this.
In [72]: arr2=arr1.repeat(3,axis=2)
In [73]: arr2.shape
Out[73]: (3, 3, 3)
In [74]: arr2[0,0,:]
Out[74]: array([0, 0, 0])
But you might not need to do this. With broadcasting a (3,3,1) works with a (3,3,3).
In [75]: (arr1+arr2).shape
Out[75]: (3, 3, 3)
In fact it will broadcast with a (3,) to produce (3,3,3).
In [77]: arr1+np.ones(3,int)
Out[77]:
array([[[1, 1, 1],
[2, 2, 2],
...
[[7, 7, 7],
[8, 8, 8],
[9, 9, 9]]])
So arr1+np.zeros(3,int) is another way of expanding that (3,3,1) to (3,3,3).
The broadcasting rules are:
(3,3,1) + (3,) => (3,3,1) + (1,1,3) => (3,3,3)
broadcasting adds dimensions at the start as needed.
When you sum on an axis, you can keep the original number of dimensions with a parameter:
In [78]: arr2.sum(axis=2).shape
Out[78]: (3, 3)
In [79]: arr2.sum(axis=2, keepdims=True).shape
Out[79]: (3, 3, 1)
This is handy if you want to subtract the mean from an array along any dimension:
arr2-arr2.mean(axis=2, keepdims=True)
You can firstly create a new axis (axis = 2) on a and then use np.repeat along this new axis:
np.repeat(a[:,:,None], 3, axis = 2)
Or another approach, flatten the array, repeat elements and then reshape:
np.repeat(a.ravel(), 3).reshape(n,n,3)
The result comparison:
import numpy as np
n = 4
a=np.random.random((n,n))
b=a.flatten().tolist()
a1=np.array(zip(b,b,b))
a1.shape=n,n,3
# a1 is the result from the original method
(np.repeat(a[:,:,None], 3, axis = 2) == a1).all()
# True
(np.repeat(a.ravel(), 3).reshape(4,4,3) == a1).all()
# True
Timing, use built-in numpy.repeat also shows a speed up:
import numpy as np
n = 4
a=np.random.random((n,n))
​
def rep():
b=a.flatten().tolist()
a1=np.array(zip(b,b,b))
a1.shape=n,n,3
%timeit rep()
# 100000 loops, best of 3: 7.11 µs per loop
%timeit np.repeat(a[:,:,None], 3, axis = 2)
# 1000000 loops, best of 3: 1.64 µs per loop
%timeit np.repeat(a.ravel(), 3).reshape(4,4,3)
# 1000000 loops, best of 3: 1.9 µs per loop

How can I select values along an axis of an nD array with an (n-1)D array of indices of that axis?

This is motivated by my answer here.
Given array A with shape (n0,n1), and array J with shape (n0), I'd like to create an array B with shape (n0) such that
B[i] = A[i,J[i]]
I'd also like to be able to generalize this to k-dimensional arrays, where A has shape (n0,n1,...,nk) and J has shape (n0,n1,...,n(k-1))
There are messy, flattening ways of doing this that make assumptions about index order:
import numpy as np
B = A.ravel()[ J+A.shape[-1]*np.arange(0,np.prod(J.shape)).reshape(J.shape) ]
The question is, is there a way to do this that doesn't rely on flattening arrays and dealing with indexes manually?
For the 2 and 1d case, this indexing works:
A[np.arange(J.shape[0]), J]
Which can be applied to more dimensions by reshaping to 2d (and back):
A.reshape(-1, A.shape[-1])[np.arange(np.prod(A.shape[:-1])).reshape(J.shape), J]
For 3d A this works:
A[np.arange(J.shape[0])[:,None], np.arange(J.shape[1])[None,:], J]
where the 1st 2 arange indices broadcast to the same dimension as J.
With functions in lib.index_tricks, this can be expressed as:
A[np.ogrid[0:J.shape[0],0:J.shape[1]]+[J]]
A[np.ogrid[slice(J.shape[0]),slice(J.shape[1])]+[J]]
or for multiple dimensions:
A[np.ix_(*[np.arange(x) for x in J.shape])+(J,)]
A[np.ogrid[[slice(k) for k in J.shape]]+[J]]
For small A and J (eg 2*3*4), J.choose(np.rollaxis(A,-1)) is faster. All of the extra time is in preparing the index tuple. np.ix_ is faster than np.ogrid.
np.choose has a size limit. At its upper end it is slower than ix_:
In [610]: Abig=np.arange(31*31).reshape(31,31)
In [611]: Jbig=np.arange(31)
In [612]: Jbig.choose(np.rollaxis(Abig,-1))
Out[612]:
array([ 0, 32, 64, 96, 128, 160, ... 960])
In [613]: timeit Jbig.choose(np.rollaxis(Abig,-1))
10000 loops, best of 3: 73.1 µs per loop
In [614]: timeit Abig[np.ix_(*[np.arange(x) for x in Jbig.shape])+(Jbig,)]
10000 loops, best of 3: 22.7 µs per loop
In [635]: timeit Abig.ravel()[Jbig+Abig.shape[-1]*np.arange(0,np.prod(Jbig.shape)).reshape(Jbig.shape) ]
10000 loops, best of 3: 44.8 µs per loop
I did similar indexing tests at https://stackoverflow.com/a/28007256/901925, and found that flat indexing was faster for much larger arrays (e.g. n0=1000). That's where I learned about the 32 limit for choice.
It doesn't solve your problem exactly, but choose() should nevertheless help:
>>> A = array(range(1, 28)).reshape(3, 3, 3)
>>> B = array([0, 0, 0, 1, 1, 1, 2, 2, 2]).reshape(3, 3)
>>> B.choose(A)
array([[ 1, 2, 3],
[13, 14, 15],
[25, 26, 27]])
It selects among the first dimension instead of the last.

Categories