Counting occurrences of elements of one array in another array - python

I want to find frequency of elements of a given one dimensional numpy array (arr1) in another one dimensional numpy array (arr2). The array arr1 contains elements with no repetitions. Also, all elements in arr1 are part of the array of unique elements of arr2
Consider this as an example,
arr1 = np.array([1,2,6])
arr2 = np.array([2, 3, 6, 1, 2, 1, 2, 0, 2, 0])
At present, I am using the following:
freq = np.zeros( len(arr1) )
for i in range( len(arr1) ):
mark = np.where( arr2==arr1[i] )
freq[i] = len(mark[0])
print freq
>>[2, 4, 1]
The aforementioned method gives me the correct answer. But, I want to know if there is a better/more efficient method than the one that I am following.

Here's a vectorized solution based on np.searchsorted -
idx = np.searchsorted(arr1,arr2)
idx[idx==len(arr1)] = 0
mask = arr1[idx]==arr2
out = np.bincount(idx[mask])
It assumes arr1 is sorted. If not so, we got two solutions :
Sort arr1 as the pre-processing step. Since, arr1 is part of unique elements from arr2, this should be a comparatively smaller array and hence an inexpensive sorting operation.
Use sorter arg with searchsorted to compute idx:
sidx = arr1.argsort();
idx = sidx[np.searchsorted(arr1,arr2,sorter=sidx)]

Related

Get column indices of row-wise maximum values of a 2D array (with random tie-breaking)

Given a 2D numpy array, I want to construct an array out of the column indices of the maximum value of each row. So far, arr.argmax(1) works well. However, for my specific case, for some rows, 2 or more columns may contain the maximum value. In that case, I want to select a column index randomly (not the first index as it is the case with .argmax(1)).
For example, for the following arr:
arr = np.array([
[0, 1, 0],
[1, 1, 0],
[2, 1, 3],
[3, 2, 2]
])
there can be two possible outcomes: array([1, 0, 2, 0]) and array([1, 1, 2, 0]) each chosen with 1/2 probability.
I have code that returns the expected output using a list comprehension:
idx = np.arange(arr.shape[1])
ans = [np.random.choice(idx[ix]) for ix in arr == arr.max(1, keepdims=True)]
but I'm looking for an optimized numpy solution. In other words, how do I replace the list comprehension with numpy methods to make the code feasible for bigger arrays?
Use scipy.stats.rankdata and apply_along_axis as follows.
import numpy as np
from scipy.stats import rankdata
ranks = rankdata(-arr, axis = 1, method = "min")
func = lambda x: np.random.choice(np.where(x==1)[0])
idx = np.apply_along_axis(func, 1, ranks)
print(idx)
It returns [1 0 2 0] or [1 1 2 0].
The main idea is rankdata calculates ranks of every value in each row, and the maximum value will have 1. func randomly choices one of index whose corresponding value is 1. Finally, apply_along_axis applies the func to every row of arr.
After some advice I got offline, it turns out that randomization of maximum values are possible when we multiply the boolean array that flags row-wise maximum values by a random array of the same shape. Then what remains is a simple argmax(1) call.
# boolean array that flags maximum values of each row
mxs = arr == arr.max(1, keepdims=True)
# random array where non-maximum values are zero and maximum values are random values
random_arr = np.random.rand(*arr.shape) * mxs
# row-wise maximum of the auxiliary array
ans = random_arr.argmax(1)
A timeit test shows that for data of shape (507_563, 12), this code runs in ~172 ms on my machine while the loop in the question runs for 11 sec, so this is about 63x faster.

Indices of intersection between arrays

Is there a fast way to compare every element of an array against every element in a list of unique identifiers?
Using a for loop to loop through each of the unique values works but is way too slow to be usable. I have been searching for a vectorized solution but have not been successful. Any help would be greatly appreciated!
arrStart = []
startRavel = startInforce['pol_id'].ravel()
for policy in unique_policies:
arrStart.append(np.argwhere(startRavel == policy))
Sample Input:
startRavel = [1,2,2,2,3,3]
unique_policies = [1,2,3]
Sample Output:
arrStart = [[0], [1,2,3],[4,5]]
The new array would have the same length as the unique values array but each element would be a list of all of the rows that match that unique value in the large array.
Here's a vectorized solution:
import numpy as np
startRavel = np.array([1,2,2,2,3,3])
unique_policies = np.array([1,2,3])
Sort startRavel using np.argsort.
ix = np.argsort(startRavel)
s_startRavel = startRavel[ix]
Use np.searchsorted to find the indices in which unique_policies should be inserted in startRavel to mantain order:
s_ix = np.searchsorted(s_startRavel, unique_policies)
# array([0, 1, 4])
And then use np.split to split the array using the obtained indices. np.argsort is used again on s_ix to deal with non-sorted inputs:
ix_r = np.argsort(s_ix)
ixs = np.split(ix, s_ix[ix_r][1:])
np.array(ixs)[ix_r]
# [array([0]), array([1, 2, 3]), array([4, 5])]
General solution :
Lets wrap it all up in a function:
def ix_intersection(x, y):
"""
Finds the indices where each unique
value in x is found in y.
Both x and y must be numpy arrays.
----------
x: np.array
Must contain unique values.
Values in x are assumed to be in y.
y: np.array
Returns
-------
Array of arrays. Each array contains the indices where a
value in x is found in y
"""
ix_y = np.argsort(y)
s = np.searchsorted(y[ix_y], x)
ix_r = np.argsort(s)
ixs = np.split(ix_y, s[ix_r][1:])
return np.array(ixs)[ix_r]
Other examples
Lets try with the following arrays:
startRavel = np.array([1,3,3,2,2,2])
unique_policies = np.array([1,2,3])
ix_intersection(unique_policies, startRavel)
# array([array([0]), array([3, 4, 5]), array([1, 2])])
Another example, this time with non-sorted inputs:
startRavel = np.array([1,3,3,2,2,2,5])
unique_policies = np.array([1,2,5,3])
ix_intersection(unique_policies, startRavel)
# array([array([0]), array([3, 4, 5]), array([6]), array([1, 2])])

Efficiently count the number of occurrences of unique subarrays in NumPy?

I have an array of shape (128, 36, 8) and I'd like to find the number of occurrences of the unique subarrays of length 8 in the last dimension.
I'm aware of np.unique and np.bincount, but those seem to be for elements rather than subarrays. I've seen this question but it's about finding the first occurrence of a particular subarray, rather than the counts of all unique subarrays.
The question states that the input array is of shape (128, 36, 8) and we are interested in finding unique subarrays of length 8 in the last dimension.
So, I am assuming that the uniqueness is along the first two dimensions being merged together. Let us assume A as the input 3D array.
Get the number of unique subarrays
# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])
# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar = Ar[sorted_idx,:]
# Get the count of rows that have at least one TRUE value
# indicating presence of unique subarray there
unq_out = np.any(np.diff(sorted_Ar,axis=0),1).sum()+1
Sample run -
In [159]: A # A is (2,2,3)
Out[159]:
array([[[0, 0, 0],
[0, 0, 2]],
[[0, 0, 2],
[2, 0, 1]]])
In [160]: unq_out
Out[160]: 3
Get the count of occurrences of unique subarrays
# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])
# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar = Ar[sorted_idx,:]
# Get IDs for each element based on their uniqueness
id = np.append([0],np.any(np.diff(sorted_Ar,axis=0),1).cumsum())
# Get counts for each ID as the final output
unq_count = np.bincount(id)
Sample run -
In [64]: A
Out[64]:
array([[[0, 0, 2],
[1, 1, 1]],
[[1, 1, 1],
[1, 2, 0]]])
In [65]: unq_count
Out[65]: array([1, 2, 1], dtype=int64)
Here I've modified #Divakar's very useful answer to return the counts of the unique subarrays, as well as the subarrays themselves, so that the output is the same as that of collections.Counter.most_common():
# Get the array in 2D form.
arr = arr.reshape(-1, arr.shape[-1])
# Lexicographically sort
sorted_arr = arr[np.lexsort(arr.T), :]
# Get the indices where a new row appears
diff_idx = np.where(np.any(np.diff(sorted_arr, axis=0), 1))[0]
# Get the unique rows
unique_rows = [sorted_arr[i] for i in diff_idx] + [sorted_arr[-1]]
# Get the number of occurences of each unique array (the -1 is needed at
# the beginning, rather than 0, because of fencepost concerns)
counts = np.diff(
np.append(np.insert(diff_idx, 0, -1), sorted_arr.shape[0] - 1))
# Return the (row, count) pairs sorted by count
return sorted(zip(unique_rows, counts), key=lambda x: x[1], reverse=True)
I am not sure that it's the most efficient way to do it but this should work.
arr = arr.reshape(128*36,8)
unique_ = []
occurence_ = []
for sub in arr:
if sub.tolist() not in unique_:
unique_.append(sub.tolist())
occurence_.append(1)
else:
occurence_[unique_.index(sub.tolist())]+=1
for index_,u in unique_:
print u,"occurrence: %s"%occurence_[index_]

Recover data from lists in Python

I have two lists in python, say A and B.
List A is a list of lists with some integer indexes, for example [[2,3,1,3,2,3,3], [4,2,1,4],[5,4,3,3,3,4,]...] and so on.
List B has the same structure but instead of integers has numpy arrays.
[[array([0, 0]), array([0, 0]), array([0, 1]) ...][[array([0, 0]), array([0, 1])...]...]
These lists are correlated, so each numpy array corresponds to a integer, in other words, the sublists of A and the sublists of B has the same size. For example
[[2,3,3,3,2,4]...]
[[array([0, 0]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 0]), array([1, 0])]...]
The first integer in the sublist of A, "2" is linked to the first numpy array in the sublist of B.
As you can see there are repeated integers, and in consequence repeated numpy arrays. I want to recover the unique indexes without repeating, and thus their corresponding arrays.
Taking the example above the return of the procedure should be something like this:
[[2,3,4]...]
[[array([0, 0]), array([0, 1]), array([1, 0])]...]
How can I recover the unique elements from list A with their corresponded numpy array in list b?
My first attempt used the numpy.unique function so i can recover list A efficiently, but then I lose the information to recover the information from B. The line was
A = np.array([np.unique(a) for a in A])
TO RECAP
I have the following
import numpy as np
A = [] # PUT A REAL (short) LIST HERE
B = [] # PUT A REAL (short) LIST HERE
uniqueA = np.array([np.unique(a) for a in A])
print uniqueA #prints what I want/dont want
expectA = [1,4] # put what you would expect to get back
#ask additional questions here
First, get unique indices of A, then take them from A and B
import numpy as np
A = [[2,3,1,3,2,3,3], [4,2,1,4]]
B = [[np.zeros(2)]*len(A[0]), [np.zeros(2)]*len(A[1])]
indices = np.array([np.unique(a, True)[1] for a in A])
A = np.array([np.array(arr)[index] for arr, index in zip(A, indices)])
B = np.array([np.array(arr)[index] for arr, index in zip(B, indices)])

Acquiring the Minimum array out of Multiple Arrays by order in Python

Say that I have 4 numpy arrays
[1,2,3]
[2,3,1]
[3,2,1]
[1,3,2]
In this case, I've determined [1,2,3] is the "minimum array" for my purposes, as it is one of two arrays with lowest value at index 0, and of those two arrays it has the the lowest index 1. If there were more arrays with similar values, I would need to compare the next index values, and so on.
How can I extract the array [1,2,3] in that same order from the pile?
How can I extend that to x arrays of size n?
Thanks
Using the python non-numpy .sort() or sorted() on a list of lists (not numpy arrays) automatically does this e.g.
a = [[1,2,3],[2,3,1],[3,2,1],[1,3,2]]
a.sort()
gives
[[1,2,3],[1,3,2],[2,3,1],[3,2,1]]
The numpy sort seems to only sort the subarrays recursively so it seems the best way would be to convert it to a python list first. Assuming you have an array of arrays you want to pick the minimum of you could get the minimum as
sorted(a.tolist())[0]
As someone pointed out you could also do min(a.tolist()) which uses the same type of comparisons as sort, and would be faster for large arrays (linear vs n log n asymptotic run time).
Here's an idea using numpy:
import numpy
a = numpy.array([[1,2,3],[2,3,1],[3,2,1],[1,3,2]])
col = 0
while a.shape[0] > 1:
b = numpy.argmin(a[:,col:], axis=1)
a = a[b == numpy.min(b)]
col += 1
print a
This checks column by column until only one row is left.
numpy's lexsort is close to what you want. It sorts on the last key first, but that's easy to get around:
>>> a = np.array([[1,2,3],[2,3,1],[3,2,1],[1,3,2]])
>>> order = np.lexsort(a[:, ::-1].T)
>>> order
array([0, 3, 1, 2])
>>> a[order]
array([[1, 2, 3],
[1, 3, 2],
[2, 3, 1],
[3, 2, 1]])

Categories