Efficiently count the number of occurrences of unique subarrays in NumPy?

Efficiently count the number of occurrences of unique subarrays in NumPy? - python

I have an array of shape (128, 36, 8) and I'd like to find the number of occurrences of the unique subarrays of length 8 in the last dimension.
I'm aware of np.unique and np.bincount, but those seem to be for elements rather than subarrays. I've seen this question but it's about finding the first occurrence of a particular subarray, rather than the counts of all unique subarrays.

The question states that the input array is of shape (128, 36, 8) and we are interested in finding unique subarrays of length 8 in the last dimension.
So, I am assuming that the uniqueness is along the first two dimensions being merged together. Let us assume A as the input 3D array.
Get the number of unique subarrays
# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])
# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar = Ar[sorted_idx,:]
# Get the count of rows that have at least one TRUE value
# indicating presence of unique subarray there
unq_out = np.any(np.diff(sorted_Ar,axis=0),1).sum()+1
Sample run -
In [159]: A # A is (2,2,3)
Out[159]:
array([[[0, 0, 0],
[0, 0, 2]],
[[0, 0, 2],
[2, 0, 1]]])
In [160]: unq_out
Out[160]: 3
Get the count of occurrences of unique subarrays
# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])
# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar = Ar[sorted_idx,:]
# Get IDs for each element based on their uniqueness
id = np.append([0],np.any(np.diff(sorted_Ar,axis=0),1).cumsum())
# Get counts for each ID as the final output
unq_count = np.bincount(id)
Sample run -
In [64]: A
Out[64]:
array([[[0, 0, 2],
[1, 1, 1]],
[[1, 1, 1],
[1, 2, 0]]])
In [65]: unq_count
Out[65]: array([1, 2, 1], dtype=int64)

Here I've modified #Divakar's very useful answer to return the counts of the unique subarrays, as well as the subarrays themselves, so that the output is the same as that of collections.Counter.most_common():
# Get the array in 2D form.
arr = arr.reshape(-1, arr.shape[-1])
# Lexicographically sort
sorted_arr = arr[np.lexsort(arr.T), :]
# Get the indices where a new row appears
diff_idx = np.where(np.any(np.diff(sorted_arr, axis=0), 1))[0]
# Get the unique rows
unique_rows = [sorted_arr[i] for i in diff_idx] + [sorted_arr[-1]]
# Get the number of occurences of each unique array (the -1 is needed at
# the beginning, rather than 0, because of fencepost concerns)
counts = np.diff(
np.append(np.insert(diff_idx, 0, -1), sorted_arr.shape[0] - 1))
# Return the (row, count) pairs sorted by count
return sorted(zip(unique_rows, counts), key=lambda x: x[1], reverse=True)

I am not sure that it's the most efficient way to do it but this should work.
arr = arr.reshape(128*36,8)
unique_ = []
occurence_ = []
for sub in arr:
if sub.tolist() not in unique_:
unique_.append(sub.tolist())
occurence_.append(1)
else:
occurence_[unique_.index(sub.tolist())]+=1
for index_,u in unique_:
print u,"occurrence: %s"%occurence_[index_]

Related

Get column indices of row-wise maximum values of a 2D array (with random tie-breaking)

Given a 2D numpy array, I want to construct an array out of the column indices of the maximum value of each row. So far, arr.argmax(1) works well. However, for my specific case, for some rows, 2 or more columns may contain the maximum value. In that case, I want to select a column index randomly (not the first index as it is the case with .argmax(1)).
For example, for the following arr:
arr = np.array([
[0, 1, 0],
[1, 1, 0],
[2, 1, 3],
[3, 2, 2]
])
there can be two possible outcomes: array([1, 0, 2, 0]) and array([1, 1, 2, 0]) each chosen with 1/2 probability.
I have code that returns the expected output using a list comprehension:
idx = np.arange(arr.shape[1])
ans = [np.random.choice(idx[ix]) for ix in arr == arr.max(1, keepdims=True)]
but I'm looking for an optimized numpy solution. In other words, how do I replace the list comprehension with numpy methods to make the code feasible for bigger arrays?

Use scipy.stats.rankdata and apply_along_axis as follows.
import numpy as np
from scipy.stats import rankdata
ranks = rankdata(-arr, axis = 1, method = "min")
func = lambda x: np.random.choice(np.where(x==1)[0])
idx = np.apply_along_axis(func, 1, ranks)
print(idx)
It returns [1 0 2 0] or [1 1 2 0].
The main idea is rankdata calculates ranks of every value in each row, and the maximum value will have 1. func randomly choices one of index whose corresponding value is 1. Finally, apply_along_axis applies the func to every row of arr.

After some advice I got offline, it turns out that randomization of maximum values are possible when we multiply the boolean array that flags row-wise maximum values by a random array of the same shape. Then what remains is a simple argmax(1) call.
# boolean array that flags maximum values of each row
mxs = arr == arr.max(1, keepdims=True)
# random array where non-maximum values are zero and maximum values are random values
random_arr = np.random.rand(*arr.shape) * mxs
# row-wise maximum of the auxiliary array
ans = random_arr.argmax(1)
A timeit test shows that for data of shape (507_563, 12), this code runs in ~172 ms on my machine while the loop in the question runs for 11 sec, so this is about 63x faster.

Counting occurrences of elements of one array in another array

I want to find frequency of elements of a given one dimensional numpy array (arr1) in another one dimensional numpy array (arr2). The array arr1 contains elements with no repetitions. Also, all elements in arr1 are part of the array of unique elements of arr2
Consider this as an example,
arr1 = np.array([1,2,6])
arr2 = np.array([2, 3, 6, 1, 2, 1, 2, 0, 2, 0])
At present, I am using the following:
freq = np.zeros( len(arr1) )
for i in range( len(arr1) ):
mark = np.where( arr2==arr1[i] )
freq[i] = len(mark[0])
print freq
>>[2, 4, 1]
The aforementioned method gives me the correct answer. But, I want to know if there is a better/more efficient method than the one that I am following.

Here's a vectorized solution based on np.searchsorted -
idx = np.searchsorted(arr1,arr2)
idx[idx==len(arr1)] = 0
mask = arr1[idx]==arr2
out = np.bincount(idx[mask])
It assumes arr1 is sorted. If not so, we got two solutions :
Sort arr1 as the pre-processing step. Since, arr1 is part of unique elements from arr2, this should be a comparatively smaller array and hence an inexpensive sorting operation.
Use sorter arg with searchsorted to compute idx:
sidx = arr1.argsort();
idx = sidx[np.searchsorted(arr1,arr2,sorter=sidx)]

In Python. I have a list of ND arrays and I want to count duplicate arrays in order to calculate an Average for each Duplicate array value

I have a list of ND arrays(vectors), each vector has a (1,300) shape.
My goal is to find duplicate vectors inside a list, to sum them and then divide them by the size of a list, the result value(a vector) will replace the duplicate vector.
For example, a is a list of ND arrays, a = [[2,3,1],[5,65,-1],[2,3,1]], then the first and the last element are duplicates.
their sum would be :[4,6,2],
which will be divided by the size of a list of vectors, size = 3.
Output: a = [[4/3,6/3,2/3],[5,65,-1],[4/3,6/3,2/3]]
I have tried to use a Counter but it doesn't work for ndarrays.
What is the Numpy way?
Thanks.

If you have numpy 1.13 or higher, this is pretty simple:
def f(a):
u, inv, c = np.unique(a, return_counts = True, return_inverse = True, axis = 0)
p = np.where(c > 1, c / a.shape[0], 1)[:, None]
return (u * p)[inv]
If you don't have 1.13, you'll need some trick to convert a into a 1-d array first. I recommend #Jaime's excellent answer using np.void here
How it works:
u is the unique rows of a (usually not in their original order)
c is the number of times each row of u are repeated in a
inv is the indices to get u back to a, i.e. u[inv] = a
p is the multiplier for each row of u based on your requirements. 1 if c == 1 and c / n (where n is the number of rows in a) if c > 1. [:, None] turns it into a column vector so that it broadcasts well with u
return u * p indexed back to their original locations by [inv]

You can use numpy unique , with count return count
elements, count = np.unique(a, axis=0, return_counts=True)
Return Count allow to return the number of occurrence of each element in the array
The output is like this ,
(array([[ 2, 3, 1],
[ 5, 65, -1]]), array([2, 1]))
Then you can multiply them like this :
(count * elements.T).T
Output :
array([[ 4, 6, 2],
[ 5, 65, -1]])

Mix three vectors in a numpy array, then sort it

I have 3 numpy.ndarray vectors, X, Y and intensity. I would like to mix it in an numpy array, then sort by the third column (or the first one). I tried the following code:
m=np.column_stack((X,Y))
m=np.column_stack((m,intensity))
m=np.sort(m,axis=2)
Then I got the error: ValueError: axis(=2) out of bounds.
When I print m, I get:
array([[ 109430, 285103, 121],
[ 134497, 284907, 134],
[ 160038, 285321, 132],
...,
[12374406, 2742429, 148],
[12371858, 2741994, 148],
[12372221, 2742017, 161]])
How can I fix it. that is, get a sorted array?

Axis=2 does not refer to the column index but rather, to the dimension of the array. It means numpy will try to look for a third dimension in the data and sorts it from smallest to largest in the third dimension. Sorting from smallest to largest in the first dimension (axis = 0) would be have the values in all rows going from smallest to largest. Sorting from smallest to largest in the second dimension (axis = 1) would be have the values in all columns going from smallest to largest. Examples would be below.
Furthermore, sort would work differently depending on the base array. Two arrays are considered: Unstructured and structured.
Unstructured
X = np.nrandn(10)
X = np.nrandn(10)
intensity = np.nrandn(10)
m=np.column_stack((X,Y))
m=np.column_stack((m,intensity))
m is being treated as an unstructured array because there are no fields linked to any of the columns. In other words, if you call np.sort() on m, it will just sort them from smallest to largest from top to bottom if axis=0 and left to right if axis=1. The rows are not being preserved.
Original:
[[ 1.20122251 1.41451461 -1.66427245]
[ 1.3657312 -0.2318793 -0.23870104]
[-0.30280613 0.79123814 -1.64082042]]
Axis=1:
[[-1.66427245 1.20122251 1.41451461]
[-0.23870104 -0.2318793 1.3657312 ]
[-1.64082042 -0.30280613 0.79123814]]
Axis = 0:
[[-0.30280613 -0.2318793 -1.66427245]
[ 1.20122251 0.79123814 -1.64082042]
[ 1.3657312 1.41451461 -0.23870104]]
Structured
As you can see, the data structure in the rows is not kept. If you would like to preserve the row order, you need to add in labels to the datatypes and create an array with this. You can sort by the other columns with order = label_name.
dtype = [("a",float),("b",float),("c",float)]
m = [tuple(x) for x in m]
labelled_arr = np.array(m,dtype)
print np.sort(labelled_arr,order="a")
This will get:
[(-0.30280612629541204, 0.7912381363389004, -1.640820419927318)
(1.2012225144719493, 1.4145146097431947, -1.6642724545574712)
(1.3657312047892836, -0.23187929505306418, -0.2387010374198555)]
Another more convenient way of doing this would be passing the data into a pandas dataframe which automatically creates column names from 0 to n-1. Then you can just call the sort_values method and pass in the column index you want and follow it by axis=0 if you would like it to be sorted from top to bottom just like in numpy.
Example:
pd.DataFrame(m).sort_values(0,axis = 0)
Output:
0 1 2
2 -0.302806 0.791238 -1.640820
0 1.201223 1.414515 -1.664272
1 1.365731 -0.231879 -0.238701

You are getting that error because you don't have an axis with a 2 index. Axes are zero-indexed. Regardless, np.sort will sort every column, or every row. Consider from the docs:
order : str or list of str, optional When a is an array with fields
defined, this argument specifies which fields to compare first,
second, etc. A single field can be specified as a string, and not all
fields need be specified, but unspecified fields will still be used,
in the order in which they come up in the dtype, to break ties.
For example:
In [28]: a
Out[28]:
array([[0, 0, 1],
[1, 2, 3],
[3, 1, 8]])
In [29]: np.sort(a, axis = 0)
Out[29]:
array([[0, 0, 1],
[1, 1, 3],
[3, 2, 8]])
In [30]: np.sort(a, axis = 1)
Out[30]:
array([[0, 0, 1],
[1, 2, 3],
[1, 3, 8]])
So, I think what you really want is this neat little idiom:
In [32]: a[a[:,2].argsort()]
Out[32]:
array([[0, 0, 1],
[1, 2, 3],
[3, 1, 8]])

Choosing random one from each row of a binary numpy matrix?

Suppose I have a binary matrix. I would like to cast that matrix into another matrix where each row has single one and the index of that one would be random for each row.
For instance if one of the row is [0,1,0,1,0,0,1] and I cast it to [0,0,0,1,0,0,0] where we select the 1's index randomly.
How could I do it in numpy?
Presently I find the max (since max function returns one index) of each row and set it to zero. It works if all rows have at most 2 ones but if more than 2 then it fails.

Extending #zhangxaochen's answer, given a random binary array
x = np.random.random_integers(0, 1, (8, 8))
you can populate another array with a randomly drawn 1 from x:
y = np.zeros_like(x)
ind = [np.random.choice(np.where(row)[0]) for row in x]
y[range(x.shape[0]), ind] = 1

I'd like to use np.argsort to get the indices of the non-zero elements:
In [351]: a=np.array([0,1,0,1,0,0,1])
In [352]: from random import choice
...: idx=choice(np.where(a)[0]) #or np.nonzero instead of np.where
In [353]: b=np.zeros_like(a)
...: b[idx]=1
...: b
Out[353]: array([0, 1, 0, 0, 0, 0, 0])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently count the number of occurrences of unique subarrays in NumPy? - python

Related

Get column indices of row-wise maximum values of a 2D array (with random tie-breaking)

Counting occurrences of elements of one array in another array

In Python. I have a list of ND arrays and I want to count duplicate arrays in order to calculate an Average for each Duplicate array value

Mix three vectors in a numpy array, then sort it

Choosing random one from each row of a binary numpy matrix?

Categories

Resources