I would like to know the fastest way to extract the indices of the first n non zero values per column in a 2D array.
For example, with the following array:
arr = [
[4, 0, 0, 0],
[0, 0, 0, 0],
[0, 4, 0, 0],
[2, 0, 9, 0],
[6, 0, 0, 0],
[0, 7, 0, 0],
[3, 0, 0, 0],
[1, 2, 0, 0],
With n=2 I would have [0, 0, 1, 1, 2] as xs and [0, 3, 2, 5, 3] as ys. 2 values in the first and second columns and 1 in the third.
Here is how it is currently done:
x = []
y = []
n = 3
for i, c in enumerate(arr.T):
a = c.nonzero()[0][:n]
if len(a):
x.extend([i]*len(a))
y.extend(a)
In practice I have arrays of size (405, 256).
Is there a way to make it faster?
Here is a method, although quite confusing as it uses a lot of functions, that does not require sorting the array (only a linear scan is necessary to get non null values):
n = 2
# Get indices with non null values, columns indices first
nnull = np.stack(np.where(arr.T != 0))
# split indices by unique value of column
cols_ids= np.array_split(range(len(nnull[0])), np.where(np.diff(nnull[0]) > 0)[0] +1 )
# Take n in each (max) and concatenate the whole
np.concatenate([nnull[:, u[:n]] for u in cols_ids], axis = 1)
outputs:
array([[0, 0, 1, 1, 2],
[0, 3, 2, 5, 3]], dtype=int64)
Here is one approach using argsort, it gives a different order though:
n = 2
m = arr!=0
# non-zero values first
idx = np.argsort(~m, axis=0)
# get first 2 and ensure non-zero
m2 = np.take_along_axis(m, idx, axis=0)[:n]
y,x = np.where(m2)
# slice
x, idx[y,x]
# (array([0, 1, 2, 0, 1]), array([0, 2, 3, 3, 5]))
Use dislocation comparison for the row results of the transposed nonzero:
>>> n = 2
>>> i, j = arr.T.nonzero()
>>> mask = np.concatenate([[True] * n, i[n:] != i[:-n]])
>>> i[mask], j[mask]
(array([0, 0, 1, 1, 2], dtype=int64), array([0, 3, 2, 5, 3], dtype=int64))
Related
Im trying to create a function that will transform a regular Matrix into CSR form (I don't want to use the scipy.sparse one).
To do this, I'm using a nested for-loop to run through a given matrix to create a new matrix with three rows.
The first row ('Values') should contain all non-zero values. The second ('Cols') should contain the column index for each number in 'Values'. The third row should contain the index value in 'Values' for the first non-zero value on each row.
My question regards the second and third rows:
Is there a way of getting the column ID for the element 'i' in the for-loop?
M=array([[4,0,39],
[0,5,0],
[0,0,7]])
def Convert(x):
CSRMatrix = []
Values = []
Cols = []
Rows = []
for k in x:
for i in k:
if i != 0:
Values.append(i)
Cols.append({#the column index value of 'i'})
Rows.append[#theindex in 'Values' of the first non-zero element on each row]
CSRMatrix.append(Values)
CSRMatrix.append(Cols)
CSRMatrix.append(Rows)
return(CSRMatrix)
Convert(M)
I'm not sure of what you want exactly for Cols.append() because of the way you commented it in the code between curly braces.
Is it a dict containing the index:value of all non 0 value? Or a list of sets containing the indexes of all non 0 values (which would be weird), or is it all the indexes of each row in your array?
Anyway I put the 2 most likely candidates (dict and list of indexes for each row) test each one and delete the unwanted one and if none are right please add some more specifics:
import numpy as np
m = np.array([[4,0,39],
[0,5,0],
[0,0,7]])
def Convert(x):
CSRMatrix = []
Values = []
Cols = []
Rows = []
for num in x:
for i in range(len(num)):
if num[i] != 0:
Values.append(num[i])
Cols.append({i:num[i]}) # <- if dict. Remove if not what you wanted
Rows.append(i)
Cols.append(i) # <- list of all indexes in the array for each row. Remove if not what you wanted
CSRMatrix.append(Values)
CSRMatrix.append(Cols)
CSRMatrix.append(Rows)
return(CSRMatrix)
x = Convert(m)
print(x)
enumerate() passes an index for every iteration.
Thereby the second row can be easily created by appending num2.
For the third row you have to check again if you have already added a value in that row. If not append num2 and set the non_zero check to False. For the next row non_zero check is set to True again.
def Convert(x):
CSRMatrix = []
Values = []
Cols = []
Rows = []
for num, k in enumerate(x):
non_zero = True
for num2, i in enumerate(k):
if i != 0:
Values.append(i)
Cols.append(num2)
if non_zero:
Rows.append(num2)
non_zero = False
CSRMatrix.append(Values)
CSRMatrix.append(Cols)
CSRMatrix.append(Rows)
return (CSRMatrix)
Here is a numpythonic implementation, use the nonzero method to directly obtain the row and column index of non-zero elements, and then use a comparison to generate a mask. Finally, use nonzero for the mask to get the row indices:
>>> M = np.array([[ 4, 0, 39],
... [ 0, 5, 0],
... [ 0, 0, 7]])
>>> r, c = M.nonzero()
>>> mask = np.concatenate(([True], r[1:] != r[:-1]))
>>> [M[r, c], c, *mask.nonzero()]
[array([ 4, 39, 5, 7]), array([0, 2, 1, 2]), array([0, 2, 3])]
Test of a larger array:
>>> a = np.random.choice(10, size=(8, 8), p=[0.73] + [0.03] * 9)
>>> a
array([[0, 0, 0, 0, 8, 0, 0, 1],
[1, 0, 5, 4, 0, 0, 9, 0],
[0, 0, 9, 0, 0, 0, 0, 1],
[0, 0, 0, 8, 9, 0, 0, 4],
[0, 0, 5, 0, 0, 6, 0, 0],
[0, 8, 0, 0, 0, 0, 0, 9],
[0, 0, 0, 0, 0, 0, 0, 9],
[0, 9, 0, 0, 0, 4, 0, 0]])
>>> r, c = a.nonzero()
>>> mask = np.concatenate(([True], r[1:] != r[:-1]))
>>> pp([a[r, c], c, *mask.nonzero()])
[array([8, 1, 1, 5, 4, 9, 9, 1, 8, 9, 4, 5, 6, 8, 9, 9, 9, 4]),
array([4, 7, 0, 2, 3, 6, 2, 7, 3, 4, 7, 2, 5, 1, 7, 7, 1, 5], dtype=int64),
array([ 0, 2, 6, 8, 11, 13, 15, 16], dtype=int64)]
I have a matrix (formed of a list of lists) that would look something like:
matrix = [[0, 0, 0, 0, 5],
[0, 0, 0, 4, 0],
[2, 0, 3, 0, 0],
[3, 2, 0, 2, 0],
[1, 0, 2, 0, 1]]
What I am struggling to create is a function that will take this matrix as an input, along with a position in the matrix - represented by a tuple - and return the two diagonals that intersect that point (without using NumPy). For example,
def getDiagonal(matrix, pos)
(row, col) = pos
# Smart diagonal finder code #
return (diag1, diag2)
diagonals = getDiagonals(matrix, (1, 1))
print(diagnonal[0])
print(diagnonal[1])
print(' ')
diagonals = getDiagonals(matrix, (1, 3))
print(diagnonal[0])
print(diagnonal[1])
Expected output:
OUT: [5, 4, 3, 2, 1]
OUT: [2, 2, 2]
OUT:
OUT: [0, 2, 2]
OUT: [0, 0, 3, 2, 1]
It is worth pointing out that I don't mind from which direction (bottom-to-top or top-to-bottom) the returned elements of the diagonals are. They could easily be done one way and revered using reverse() if need be.
I have looked at similar questions such as this one but this mainly deals with acquring the leading diagonals of a matrix and provides less information on getting the diagonals about a point.
Many thanks for your help and comments in advance!
A bit confusing, but I think this does it:
def getDiagonals(matrix, pos):
row, col = pos
nrows = len(matrix)
ncols = len(matrix[0]) if nrows > 0 else 0
# First diagonal
d1_i, d1_j = nrows - 1 - max(row - col, 0), max(col - row, 0)
d1_len = min(d1_i + 1, ncols - d1_j)
diag1 = [matrix[d1_i - k][d1_j + k] for k in range(d1_len)]
# Second diagonal
t = min(row, ncols - col - 1)
d2_i, d2_j = nrows - 1 - row + t, col + t
d2_len = min(d2_i, d2_j) + 1
diag2 = [matrix[d2_i - k][d2_j - k] for k in range(d2_len)]
return (diag1, diag2)
# Test
matrix = [[0, 0, 0, 0, 5],
[0, 0, 0, 4, 0],
[2, 0, 3, 0, 0],
[3, 2, 0, 2, 0],
[1, 0, 2, 0, 1]]
diagonals = getDiagonals(matrix, (1, 1))
print(diagonals[0])
# [1, 2, 3, 4, 5]
print(diagonals[1])
# [2, 2, 2]
diagonals = getDiagonals(matrix, (1, 3))
print(diagonals[0])
# [2, 2, 0]
print(diagonals[1])
# [1, 2, 3, 0, 0]
diagonals = getDiagonals(matrix, (2, 2))
print(diagonals[0])
# [1, 2, 3, 4, 5]
print(diagonals[1])
# [1, 2, 3, 0, 0]
I am working on a large array (3000 x 3000) over which I use scipy.ndimage.label. The return is 3403 labels and the labelled array. I would like to know the indices of these labels for e.g. for label 1 I should know the rows and columns in the labelled array.
So basically like this
a[0] = array([[1, 1, 0, 0],
[1, 1, 0, 2],
[0, 0, 0, 2],
[3, 3, 0, 0]])
indices = [np.where(a[0]==t+1) for t in range(a[1])] #where a[1] = 3 is number of labels.
print indices
[(array([0, 0, 1, 1]), array([0, 1, 0, 1])), (array([1, 2]), array([3, 3])), (array([3, 3]), array([0, 1]))]
And I would like to create a list of indices for all 3403 labels like above. The above method seems to be slow. I tried using generators, it doesn't look like there is improvement.
Are there any efficient ways?
Well the idea with gaining efficiency would be to minimize the work once inside the loop. A vectorized method isn't possible given that you would have variable number of elements per label. So, with those factors in mind, here's one solution -
a_flattened = a[0].ravel()
sidx = np.argsort(a_flattened)
afs = a_flattened[sidx]
cut_idx = np.r_[0,np.flatnonzero(afs[1:] != afs[:-1])+1,a_flattened.size]
row, col = np.unravel_index(sidx, a[0].shape)
row_indices = [row[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
col_indices = [col[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
Sample input, output -
In [59]: a[0]
Out[59]:
array([[1, 1, 0, 0],
[1, 1, 0, 2],
[0, 0, 0, 2],
[3, 3, 0, 0]])
In [60]: a[1]
Out[60]: 3
In [62]: row_indices # row indices
Out[62]:
[array([0, 0, 1, 2, 2, 2, 3, 3]), # for label-0
array([0, 0, 1, 1]), # for label-1
array([1, 2]), # for label-2
array([3, 3])] # for label-3
In [63]: col_indices # column indices
Out[63]:
[array([2, 3, 2, 0, 1, 2, 2, 3]), # for label-0
array([0, 1, 0, 1]), # for label-1
array([3, 3]), # for label-2
array([0, 1])] # for label-3
The first elements off row_indices and col_indices are the expected output. The first groups from each those represent the 0-th regions, so you might want to skip those.
Let's say I have a 2D array with positive integers:
a = numpy.array([[1, 1, 2],
[1, 2, 5],
[1, 3, 6],
[3, 3, 3],
[3, 4, 6],
[4, 5, 6],
])
and a threshold (positive integer). I want to count, for each row, how many ocurrences are < threshold, how many >= threshold and < threshold+2, and how many >= threshold+2. The results are to be stored on a size 3 x n array, where n = a.shape[0] and each of the 3 columns corresponds to the threshold partition.
For the example above and threshold = 3, it would be:
b = numpy.array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2],
])
My solution was to use a for loop combined with masks, so that I could apply the masks individually for each row. But using for loops on arrays feels wrong. Is there a more optimized way to accomplish that?
My solution so far:
b = []
for row in a:
b.append((numpy.sum(row < threshold),
numpy.sum((row >= threshold) * (row < threshold + 2)),
numpy.sum(row >= threshold + 2)))
b = numpy.array(b)
Approach #1
Making use of elementwise comparison against the thresholds and summing each row -
t = 3 # threshold
mask0 = (a<t)
mask2 = a>=t+2
mask1 = (a>=t) & ~mask2
out = np.c_[mask0.sum(1), mask1.sum(1), mask2.sum(1)]
Approach #2
If you think about it closely, we are creating three bins there. So, we could use get the bin ID for each element and finally, get the count of each row based on the IDs. We would use np.searchsorted to get those bin IDs and then elementwise equate and sum along each row.
Thus, we would have a solution, like so -
t = 3 # threshold
bins = [t, t+2] # Create intervals
N = len(bins)+1 # Number of cols in output
idx = np.searchsorted(bins,a,'right') # Get bin IDs
out = np.column_stack([(idx==i).sum(1) for i in range(N)])
We can vectorize the last step with broadcasting -
out = (idx == np.arange(N)[:,None,None]).sum(2).T
And one more vectorized alternative, which would also be memory efficient with np.bincount -
M = a.shape[0]
r = N*np.arange(M)[:,None]
out = np.bincount((idx + r).ravel(),minlength=M*N).reshape(M,N)
You have to break points 3 and 5. We can use np.searchsorted to find where each element of a falls with respect to our break points.
np.searchsorted([3, 5], 1, side='right') will return 0 because 1 should be inserted at position 0 to maintain sorted-ness.
np.searchsorted([3, 5], 3, side='right') will return 1 because 3 can be inserted at position 0 or any other in which a value of 3 occupies to maintain sorted-ness. The default behavior to insert to the left of elements that are equal. We can change this to insert to the right of all elements that are equal. This accounts for the condition < threshold
np.searchsorted([3, 5], 5) will return 1
np.searchsorted([3, 5], 7) will return 2
I use np.eye to build sub arrays to sum over in order to count how many fall within each bin.
np.eye(3, dtype=int)[np.searchsorted([3, 5], a, side='right')].sum(1)
array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2]])
We can generalize this with a function
def count_bins(a, threshold, interval_sizes):
edges = np.append(threshold, interval_sizes).cumsum()
eye = np.eye(edges.size + 1, dtype=int)
return eye[edges.searchsorted(a, side='right')].sum(1)
count_bins(a, 3, [2])
array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2]])
Or
count_bins(a, 3, [1, 1])
array([[3, 0, 0, 0],
[2, 0, 0, 1],
[1, 1, 0, 1],
[0, 3, 0, 0],
[0, 1, 1, 1],
[0, 0, 1, 2]])
But I'd rather return a pandas dataframe to see things more clearly
def count_bins(a, threshold, interval_sizes):
edges = np.append(threshold, interval_sizes).cumsum()
eye = np.eye(edges.size + 1, dtype=int)
labels = ['{:0.0f} to {:0.0f}'.format(i, j) for i, j in zip(np.append(-np.inf, edges), np.append(edges, np.inf))]
return pd.DataFrame(
eye[edges.searchsorted(a, side='right')].sum(1),
columns=labels
)
count_bins(a, 3, [2])
-inf to 3 3 to 5 5 to inf
0 3 0 0
1 2 0 1
2 1 1 1
3 0 3 0
4 0 2 1
5 0 1 2
I am trying to turn a second order tensor into a binary third order tensor. Given a second order tensor as a m x n numpy array: A, I need to take each element value: x, in A and replace it with a vector: v, with dimensions equal to the maximum value of A, but with a value of 1 incremented at the index of v corresponding to the value x (i.e. v[x] = 1). I have been following this question: Increment given indices in a matrix, which addresses producing an array with increments at indices given by 2 dimensional coordinates. I have been reading the answers and trying to use np.ravel_multi_index() and np.bincount() to do the same but with 3 dimensional coordinates, however I keep on getting a ValueError: "invalid entry in coordinates array". This is what I have been using:
def expand_to_tensor_3(array):
(x, y) = array.shape
(a, b) = np.indices((x, y))
a = a.reshape(x*y)
b = b.reshape(x*y)
tensor_3 = np.bincount(np.ravel_multi_index((a, b, array.reshape(x*y)), (x, y, np.amax(array))))
return tensor_3
If you know what is wrong here or know an even better method to accomplish my goal, both would be really helpful, thanks.
You can use (A[:,:,np.newaxis] == np.arange(A.max()+1)).astype(int).
Here's a demonstration:
In [52]: A
Out[52]:
array([[2, 0, 0, 2],
[3, 1, 2, 3],
[3, 2, 1, 0]])
In [53]: B = (A[:,:,np.newaxis] == np.arange(A.max()+1)).astype(int)
In [54]: B
Out[54]:
array([[[0, 0, 1, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0]],
[[0, 0, 0, 1],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]],
[[0, 0, 0, 1],
[0, 0, 1, 0],
[0, 1, 0, 0],
[1, 0, 0, 0]]])
Check a few individual elements of A:
In [55]: A[0,0]
Out[55]: 2
In [56]: B[0,0,:]
Out[56]: array([0, 0, 1, 0])
In [57]: A[1,3]
Out[57]: 3
In [58]: B[1,3,:]
Out[58]: array([0, 0, 0, 1])
The expression A[:,:,np.newaxis] == np.arange(A.max()+1) uses broadcasting to compare each element of A to np.arange(A.max()+1). For a single value, this looks like:
In [63]: 3 == np.arange(A.max()+1)
Out[63]: array([False, False, False, True], dtype=bool)
In [64]: (3 == np.arange(A.max()+1)).astype(int)
Out[64]: array([0, 0, 0, 1])
A[:,:,np.newaxis] is a three-dimensional view of A with shape (3,4,1). The extra dimension is added so that the comparison to np.arange(A.max()+1) will broadcast to each element, giving a result with shape (3, 4, A.max()+1).
With a trivial change, this will work for an n-dimensional array. Indexing a numpy array with the ellipsis ... means "all the other dimensions". So
(A[..., np.newaxis] == np.arange(A.max()+1)).astype(int)
converts an n-dimensional array to an (n+1)-dimensional array, where the last dimension is the binary indicator of the integer in A. Here's an example with a one-dimensional array:
In [6]: a = np.array([3, 4, 0, 1])
In [7]: (a[...,np.newaxis] == np.arange(a.max()+1)).astype(int)
Out[7]:
array([[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0]])
You can make it work this way:
tensor_3 = np.bincount(np.ravel_multi_index((a, b, array.reshape(x*y)),
(x, y, np.amax(array) + 1)))
The difference is that I add 1 to the amax() result, because ravel_multi_index() expects that the indexes are all strictly less than the dimensions, not less-or-equal.
I'm not 100% sure if this is what you wanted; another way to make the code run is to specify mode='clip' or mode='wrap' in ravel_multi_index(), which does something a bit different and I'm guessing is less correct. But you can try it.