Faster index computation from Scipy labelled array apart from np.where - python

I am working on a large array (3000 x 3000) over which I use scipy.ndimage.label. The return is 3403 labels and the labelled array. I would like to know the indices of these labels for e.g. for label 1 I should know the rows and columns in the labelled array.
So basically like this
a[0] = array([[1, 1, 0, 0],
[1, 1, 0, 2],
[0, 0, 0, 2],
[3, 3, 0, 0]])
indices = [np.where(a[0]==t+1) for t in range(a[1])] #where a[1] = 3 is number of labels.
print indices
[(array([0, 0, 1, 1]), array([0, 1, 0, 1])), (array([1, 2]), array([3, 3])), (array([3, 3]), array([0, 1]))]
And I would like to create a list of indices for all 3403 labels like above. The above method seems to be slow. I tried using generators, it doesn't look like there is improvement.
Are there any efficient ways?

Well the idea with gaining efficiency would be to minimize the work once inside the loop. A vectorized method isn't possible given that you would have variable number of elements per label. So, with those factors in mind, here's one solution -
a_flattened = a[0].ravel()
sidx = np.argsort(a_flattened)
afs = a_flattened[sidx]
cut_idx = np.r_[0,np.flatnonzero(afs[1:] != afs[:-1])+1,a_flattened.size]
row, col = np.unravel_index(sidx, a[0].shape)
row_indices = [row[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
col_indices = [col[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
Sample input, output -
In [59]: a[0]
Out[59]:
array([[1, 1, 0, 0],
[1, 1, 0, 2],
[0, 0, 0, 2],
[3, 3, 0, 0]])
In [60]: a[1]
Out[60]: 3
In [62]: row_indices # row indices
Out[62]:
[array([0, 0, 1, 2, 2, 2, 3, 3]), # for label-0
array([0, 0, 1, 1]), # for label-1
array([1, 2]), # for label-2
array([3, 3])] # for label-3
In [63]: col_indices # column indices
Out[63]:
[array([2, 3, 2, 0, 1, 2, 2, 3]), # for label-0
array([0, 1, 0, 1]), # for label-1
array([3, 3]), # for label-2
array([0, 1])] # for label-3
The first elements off row_indices and col_indices are the expected output. The first groups from each those represent the 0-th regions, so you might want to skip those.

Related

Find first n non zero values in in numpy 2d array

I would like to know the fastest way to extract the indices of the first n non zero values per column in a 2D array.
For example, with the following array:
arr = [
[4, 0, 0, 0],
[0, 0, 0, 0],
[0, 4, 0, 0],
[2, 0, 9, 0],
[6, 0, 0, 0],
[0, 7, 0, 0],
[3, 0, 0, 0],
[1, 2, 0, 0],
With n=2 I would have [0, 0, 1, 1, 2] as xs and [0, 3, 2, 5, 3] as ys. 2 values in the first and second columns and 1 in the third.
Here is how it is currently done:
x = []
y = []
n = 3
for i, c in enumerate(arr.T):
a = c.nonzero()[0][:n]
if len(a):
x.extend([i]*len(a))
y.extend(a)
In practice I have arrays of size (405, 256).
Is there a way to make it faster?
Here is a method, although quite confusing as it uses a lot of functions, that does not require sorting the array (only a linear scan is necessary to get non null values):
n = 2
# Get indices with non null values, columns indices first
nnull = np.stack(np.where(arr.T != 0))
# split indices by unique value of column
cols_ids= np.array_split(range(len(nnull[0])), np.where(np.diff(nnull[0]) > 0)[0] +1 )
# Take n in each (max) and concatenate the whole
np.concatenate([nnull[:, u[:n]] for u in cols_ids], axis = 1)
outputs:
array([[0, 0, 1, 1, 2],
[0, 3, 2, 5, 3]], dtype=int64)
Here is one approach using argsort, it gives a different order though:
n = 2
m = arr!=0
# non-zero values first
idx = np.argsort(~m, axis=0)
# get first 2 and ensure non-zero
m2 = np.take_along_axis(m, idx, axis=0)[:n]
y,x = np.where(m2)
# slice
x, idx[y,x]
# (array([0, 1, 2, 0, 1]), array([0, 2, 3, 3, 5]))
Use dislocation comparison for the row results of the transposed nonzero:
>>> n = 2
>>> i, j = arr.T.nonzero()
>>> mask = np.concatenate([[True] * n, i[n:] != i[:-n]])
>>> i[mask], j[mask]
(array([0, 0, 1, 1, 2], dtype=int64), array([0, 3, 2, 5, 3], dtype=int64))

Repeat Array while Maintaining Order within group

I have the below array and would like to repeat each array n times.
x_array
[array([14.91488012, 1.2986064 , 4.98965322]),
array([2.39389187e+02, 1.04442059e-01, 3.06391338e-01]),
array([ 48.19437348, 201.09951372, 0.35223001]),
array([ 19.96978171, 367.52578786, 0.68676553]),
array([0.55120466, 0.27133609, 0.75646697]),
array([8.21287360e+02, 1.76495077e+02, 4.87263691e-01]),
array([184.03439377, 1.24823107, 5.33109884]),
array([575.59800297, 186.4650814 , 2.21028258]),
array([0.50308552, 3.09976082, 0.10537899]),
array([1.02259912e+00, 1.52282513e+02, 1.15085308e-01])]
I've tried np.repeat(x_array, 2) but this doesn't preserve the order of the matrix/array. I've also tried x_array*2, but this seems to just put the new array at the bottom. I was hopping to repeat x_array[0] n times and do the same for the next set of arrays, so that I have n total of each in order.
Thanks in advance.
Building off of the last example from https://numpy.org/doc/stable/reference/generated/numpy.repeat.html,
x_array = np.array(x_array) # Or a similiar operation to convert x_array to an ndarray vs. a list of arrays.
expanded_x_array = np.repeat(x_array, n, axis=0)
print(expanded_x_array)
should produce what you are looking for.
You just need to specify the axis:
>>> np.repeat(x_array, 2, axis=0)
array([[1.49149e+01, 1.29861e+00, 4.98965e+00],
[1.49149e+01, 1.29861e+00, 4.98965e+00],
[2.39389e+02, 1.04442e-01, 3.06391e-01],
[2.39389e+02, 1.04442e-01, 3.06391e-01],
...,
[5.03086e-01, 3.09976e+00, 1.05379e-01],
[5.03086e-01, 3.09976e+00, 1.05379e-01],
[1.02260e+00, 1.52283e+02, 1.15085e-01],
[1.02260e+00, 1.52283e+02, 1.15085e-01]])
From the docs:
numpy.repeat(a, repeats, axis=None)
...
axis int, optional
The axis along which to repeat values. By default, use the flattened input array, and return a flat output array.
(added bold)
You could use a list comprehension:
n = 2
repeated_list = [row for row in a for _ in range(n)]
print(repeated_list)
Your terminology is confusing. You say it's an "array", but the display looks more like a list, And the fact that x_array*2 puts an "new array" at the bottom confirms that - that's a list use of *.
np.repeat(x_array) first makes an array (a real one!)
np.array(x_array)
is a (n,3) float dtype array. Without axis np.repeat flattens - as documented!
Specifying the axis=0 works because it's repeating on that first n dimension. The result is a (2*n,3) float dtype array (not a list).
It is possible to make a 1d object dtype array containing those arrays. With that repeat will work without the axis parameter.
Knowing what you have, and describing it accurately, can make this kind of task much easier - and the questions clearer.
illustration
Make a list of arrays:
In [21]: alist = [np.ones(3,int),np.zeros(3,int),np.arange(3)]
In [22]: alist
Out[22]: [array([1, 1, 1]), array([0, 0, 0]), array([0, 1, 2])]
List repeat:
In [23]: alist*2
Out[23]:
[array([1, 1, 1]),
array([0, 0, 0]),
array([0, 1, 2]),
array([1, 1, 1]),
array([0, 0, 0]),
array([0, 1, 2])]
Make a 2d array from the list:
In [24]: np.array(alist)
Out[24]:
array([[1, 1, 1],
[0, 0, 0],
[0, 1, 2]])
repeat without axis repeats elements in a flattened way:
In [25]: np.repeat(alist,2)
Out[25]: array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2])
repeat this 2d array on 0 axis:
In [26]: np.repeat(alist,2,axis=0)
Out[26]:
array([[1, 1, 1],
[1, 1, 1],
[0, 0, 0],
[0, 0, 0],
[0, 1, 2],
[0, 1, 2]])
Object dtype array from list:
In [27]: arr = np.empty(3,object); arr[:]=alist
In [28]: arr
Out[28]: array([array([1, 1, 1]), array([0, 0, 0]), array([0, 1, 2])], dtype=object)
Since the arrays have the same size we have to use this special construct. Otherwise we get the 2d array [24].
This array has a repeat method, and with only one dimension we dont need to specify the axis. It's repeating the object elements, arrays, not the numbers in the 2d [24] array.
In [29]: arr.repeat(2)
Out[29]:
array([array([1, 1, 1]), array([1, 1, 1]), array([0, 0, 0]),
array([0, 0, 0]), array([0, 1, 2]), array([0, 1, 2])], dtype=object)

Concatenate two numpy arrays so that index order keeps the same?

Assume I have two numpy arrays as follows:
{0: array([ 2, 4, 8, 9, 12], dtype=int64),
1: array([ 1, 3, 5], dtype=int64)}
Now I want to replace each array with the ID at the front, i.e. the values in array 0 become 0 and in array 1 become 1, then both arrays should be merged, whereby the index order must be correct.
I.e. desired output:
array([1, 0, 1, 0, 1, 0, 0 ,0])
But that's what I get:
np.concatenate((h1,h2), axis=0)
array([0, 0, 0, 0, 0, 1, 1, 1])
(Each array contains only unique values, if this helps.)
How can this be done?
Your description of merging is a bit unclear. But here's something that makes sense
In [399]: dd ={0: np.array([ 2, 4, 8, 9, 12]),
...: 1: np.array([ 1, 3, 5])}
In [403]: res = np.zeros(13, int)
In [404]: res[dd[0]] = 0
In [405]: res[dd[1]] = 1
In [406]: res
Out[406]: array([0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0])
Or to make the assignments clearer:
In [407]: res = np.zeros(13, int)
In [408]: res[dd[0]] = 2
In [409]: res[dd[1]] = 1
In [410]: res
Out[410]: array([0, 1, 2, 1, 2, 1, 0, 0, 2, 2, 0, 0, 2])
Otherwise the talk index positions doesn't make a whole lot of sense.
Something like this?
d = {0: array([ 2, 4, 8, 9, 12], dtype=int64),
1: array([ 1, 3, 5], dtype=int64)}
(np.concatenate([d[0],d[1]]).argsort(kind="stable")>=len(d[0])).view(np.uint8)
# array([1, 0, 1, 0, 1, 0, 0, 0], dtype=uint8)
.concatenate Just appends lists/arrays.
Maybe an unconventional way to go about it, but you could repeat the [0 1] pattern for the len of the shortest array, using numpy.repeat and then add repeated 1 values for the difference of the two arrays?
if len(h1) > len(h2):
temp = len(h2)
else:
temp = len(h1)
diff = abs(h1-h2)
for i in range(temp):
A = numpy.repeat(0, 1)
for i in range(diff):
B = numpy.repeat(1)
C = numpy.concatenate((A,B), axis=0)
Maybe not the most dynamic or kindest way to go about this but if your solution requires just that, then it could do the job in the meantime.

indices of sparse_csc matrix are reversed after extracting some columns

I'm trying to extract columns of a scipy sparse column matrix, but the result is not stored as I'd expect. Here's what I mean:
In [77]: a = scipy.sparse.csc_matrix(np.ones([4, 5]))
In [78]: ind = np.array([True, True, False, False, False])
In [79]: b = a[:, ind]
In [80]: b.indices
Out[80]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
In [81]: a.indices
Out[81]: array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
How come b.indices is not [0, 1, 2, 3, 0, 1, 2, 3] ?
And since this behaviour is not the one I expect, is a[:, ind] not the correct way to extract columns from a csc matrix?
The indices are not sorted. You can either force the looping by reversing in a's rows, which is not that intuitive, or enforce sorted indices (you can also do it in-place, but I prefer casting). What I find funny is that the has_sorted_indices attribute does not always return a boolean, but mixes it with integer representation.
a = scipy.sparse.csc_matrix(np.ones([4, 5]))
ind = np.array([True, True, False, False, False])
b = a[::-1, ind]
b2 = a[:, ind]
b3 = b2.sorted_indices()
b.indices
>>array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b.has_sorted_indices
>>1
b2.indices
>>array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
b2.has_sorted_indices
>>0
b3.indices
array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b3.has_sorted_indices
>>True
csc and csr indices are not guaranteed to be sorted. I can't off hand find documentation to the effect, but the has_sort_indices and the sort methods suggest that.
In your case the order is the result of how the indexing is done. I found in previous SO questions, that multicolumn indexing is performed with a matrix multiplication:
In [165]: a = sparse.csc_matrix(np.ones([4,5]))
In [166]: b = a[:,[0,1]]
In [167]: b.indices
Out[167]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
This indexing is the equivalent to constructing a 'selection' matrix:
In [169]: I = sparse.csr_matrix(np.array([[1,0,0,0,0],[0,1,0,0,0]]).T)
In [171]: I.A
Out[171]:
array([[1, 0],
[0, 1],
[0, 0],
[0, 0],
[0, 0]], dtype=int32)
and doing this matrix multiplication:
In [172]: b1 = a * I
In [173]: b1.indices
Out[173]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
The order is the result of how the matrix multiplication was done. In fact a * a.T does the same reversal. We'd have to examine the multiplication code to know exactly why. Evidently the csc and csr calculation code doesn't require sorted indices, and doesn't bother to ensure the results are sorted.
https://docs.scipy.org/doc/scipy-0.19.1/reference/sparse.html#further-details
Further Details¶
CSR column indices are not necessarily sorted. Likewise for CSC row indices. Use the .sorted_indices() and .sort_indices() methods when sorted indices are required (e.g. when passing data to other libraries).

Increment given indices in a matrix

Briefly: there is a similar question and the best answer suggests using numpy.bincount. I need the same thing, but for a matrix.
I've got two arrays:
array([1, 2, 1, 1, 2])
array([2, 1, 1, 1, 1])
together they make indices that should be incremented:
>>> np.array([a, b]).T
array([[1, 2],
[2, 1],
[1, 1],
[1, 1],
[2, 1]])
I want to get this matrix:
array([[0, 0, 0],
[0, 2, 1], # (1,1) twice, (1,2) once
[0, 2, 0]]) # (2,1) twice
The matrix will be small (like, 5Ă—5), and the number of indices will be large (somewhere near 10^3 or 10^5).
So, is there anything better (faster) than a for-loop?
You can still use bincount(). The trick is to convert a and b into a single 1D array of flat indices.
If the matrix is nxm, you could apply bincount() to a * m + b, and construct the matrix from the result.
To take the example in your question:
In [15]: a = np.array([1, 2, 1, 1, 2])
In [16]: b = np.array([2, 1, 1, 1, 1])
In [17]: cnt = np.bincount(a * 3 + b)
In [18]: cnt.resize((3, 3))
In [19]: cnt
Out[19]:
array([[0, 0, 0],
[0, 2, 1],
[0, 2, 0]])
If the shape of the array is more complicated, it might be easier to use np.ravel_multi_index() instead of computing flat indices by hand:
In [20]: cnt = np.bincount(np.ravel_multi_index(np.vstack((a, b)), (3, 3)))
In [21]: np.resize(cnt, (3, 3))
Out[21]:
array([[0, 0, 0],
[0, 2, 1],
[0, 2, 0]])
(Hat tip #Jaime for pointing out ravel_multi_index.)
m1 = m.view(numpy.ndarray) # Create view
m1.shape = -1 # Make one-dimensional array
m1 += np.bincount(a+m.shape[1]*b, minlength=m1.size)

Categories