2D Vectorization of unique values per row with condition

2D Vectorization of unique values per row with condition - python

Consider the array and function definition shown:
import numpy as np
a = np.array([[2, 2, 5, 6, 2, 5],
[1, 5, 8, 9, 9, 1],
[0, 4, 2, 3, 7, 9],
[1, 4, 1, 1, 5, 1],
[6, 5, 4, 3, 2, 1],
[3, 6, 3, 6, 3, 6],
[0, 2, 7, 6, 3, 4],
[3, 3, 7, 7, 3, 3]])
def grpCountSize(arr, grpCount, grpSize):
count = [np.unique(row, return_counts=True) for row in arr]
valid = [np.any(np.count_nonzero(row[1] == grpSize) == grpCount) for row in count]
return valid
The point of the function is to return the rows of array a that have exactly grpCount groups of elements that each hold exactly grpSize identical elements.
For example:
# which rows have exactly 1 group that holds exactly 2 identical elements?
out = a[grpCountSize(a, 1, 2)]
As expected, the code outputs out = [[2, 2, 5, 6, 2, 5], [3, 3, 7, 7, 3, 3]].
The 1st output row has exactly 1 group of 2 (ie: 5,5), while the 2nd output row also has exactly 1 group of 2 (ie: 7,7).
Similarly:
# which rows have exactly 2 groups that each hold exactly 3 identical elements?
out = a[grpCountSize(a, 2, 3)]
This produces out = [[3, 6, 3, 6, 3, 6]], because only this row has exactly 2 groups each holding exactly 3 elements (ie: 3,3,3 and 6,6,6)
PROBLEM: My actual arrays have just 6 columns, but they can have many millions of rows. The code works perfectly as intended, but it is VERY SLOW for long arrays. Is there a way to speed this up?

np.unique sorts the array which makes it less efficient for your purpose. Use np.bincount and that way you most likely will save some time(depending on your array shape and values in the array). You also will not need np.any anymore:
def grpCountSize(arr, grpCount, grpSize):
count = [np.bincount(row) for row in arr]
valid = [np.count_nonzero(row == grpSize) == grpCount for row in count]
return valid
Another way that might even save more time is using same number of bins for all rows and create one array:
def grpCountSize(arr, grpCount, grpSize):
m = arr.max()
count = np.stack([np.bincount(row, minlength=m+1) for row in arr])
return (count == grpSize).sum(1)==grpCount
Another yet upgrade is to use vectorized 2D bin count from this post. For example (note that Numba solutions tested in the post above is faster. I just provided the numpy solution for example. You can replace the function with any of the suggested ones in the post linked above):
def grpCountSize(arr, grpCount, grpSize):
count = bincount2D_vectorized(arr)
return (count == grpSize).sum(1)==grpCount
#from the post above
def bincount2D_vectorized(a):
N = a.max()+1
a_offs = a + np.arange(a.shape[0])[:,None]*N
return np.bincount(a_offs.ravel(), minlength=a.shape[0]*N).reshape(-1,N)
output of all solutions above:
a[grpCountSize2(a, 1, 2)]
#array([[2, 2, 5, 6, 2, 5],
# [3, 3, 7, 7, 3, 3]])

Related

Numpy Array: Slice several values at every step

I am trying to extract several values at once from an array but I can't seem to find a way to do it in a one-liner in Numpy.
Simply put, considering an array:
a = numpy.arange(10)
> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
I would like to be able to extract, say, 2 values, skip the next 2, extract the 2 following values etc. This would result in:
array([0, 1, 4, 5, 8, 9])
This is an example but I am ideally looking for a way to extract x values and skip y others.
I thought this could be done with slicing, doing something like:
a[:2:2]
but it only returns 0, which is the expected behavior.
I know I could obtain the expected result by combining several slicing operations (similarly to Numpy Array Slicing) but I was wondering if I was not missing some numpy feature.

If you want to avoid creating copies and allocating new memory, you could use a window_view of two elements:
win = np.lib.stride_tricks.sliding_window_view(a, 2)
array([[0, 1],
[1, 2],
[2, 3],
[3, 4],
[4, 5],
[5, 6],
[6, 7],
[7, 8],
[8, 9]])
And then only take every 4th window view:
win[::4].ravel()
array([0, 1, 4, 5, 8, 9])
Or directly go with the more dangerous as_strided, but heed the warnings in the documentation:
np.lib.stride_tricks.as_strided(a, shape=(3,2), strides=(32,8))

You can use a modulo operator:
x = 2 # keep
y = 2 # skip
out = a[np.arange(a.shape[0])%(x+y)<x]
Output: array([0, 1, 4, 5, 8, 9])
Output with x = 2 ; y = 3:
array([0, 1, 5, 6])

Dropping array rows that DUPLICATE defined column elements of other array rows

Consider the np array sample below:
import numpy as np
arr = np.array([[1,2,5, 4,2,7, 5,2,9],
[4,4,1, 4,2,0, 3,6,4],
[1,2,1, 4,2,2, 5,2,0],
[1,2,7, 2,4,1, 5,2,8],
[1,2,9, 4,2,8, 5,2,1],
[4,2,0, 4,4,1, 5,2,4],
[4,4,0, 4,2,6, 3,6,6],
[1,2,1, 4,2,2, 5,2,0]])
PROBLEM: We are concerned only with the first TWO columns of each element triplet. I want to remove array rows that duplicate these two elements of each triplet (in the same order).
In the example above, the rows with indices 0,2,4, and 7 are all of the form [1,2,_, 4,2,_, 5,2,_]. So, we should keep arr[0],and drop the other three. Similarly, row[6] is dropped because it has the same pattern as row[1], namely [4,4,_, 4,2,_, 3,6,_].
In the example given, the output should look like:
[[1,2,5, 4,2,7, 5,2,9],
[4,4,1 4,2,0, 3,6,4],
[1,2,7, 2,4,1, 5,2,8],
[4,2,0, 4,4,1 5,2,4]]
The part I'm struggling with most is that the solution should be general enough to handle arrays of 3, 6, 9, 12... columns. (always a multiple of 3, and we are always interested in duplications of the first two columns of each triplet.

If you can create an array withonly the values you are interested in, you can pass that to np.unique() which has an option to return_index.
One way to get the groups you want is to delete every third column. Pass that to np.unique() and get the indices:
import numpy as np
arr = np.array([[1,2,5, 4,2,7, 5,2,9],
[4,4,1, 4,2,0, 3,6,4],
[1,2,1, 4,2,2, 5,2,0],
[1,2,7, 2,4,1, 5,2,8],
[1,2,9, 4,2,8, 5,2,1],
[4,2,0, 4,4,1, 5,2,4],
[4,4,0, 4,2,6, 3,6,6],
[1,2,1, 4,2,2, 5,2,0]])
unique_cols = np.delete(arr, slice(2, None, 3), axis=1)
vals, indices = np.unique(unique_cols, axis=0, return_index=True)
arr[sorted(indices)]
output:
array([[1, 2, 5, 4, 2, 7, 5, 2, 9],
[4, 4, 1, 4, 2, 0, 3, 6, 4],
[1, 2, 7, 2, 4, 1, 5, 2, 8],
[4, 2, 0, 4, 4, 1, 5, 2, 4]])

Sorting 2D array by the first n rows

How can I sort an array in NumPy by the two first rows?
For example,
A=array([[9, 2, 2],
[4, 5, 6],
[7, 0, 5]])
And I'd like to sort columns by the first two rows, such that I get back:
A=array([[2, 2, 9],
[5, 6, 4],
[0, 5, 7]])
Thank you!

One approach is to transform the 2D array over which we want to take the argsort into an easier to handle 1D array. For that one idea could be to multiply the rows to take into accounts for the sorting purpose by successively decreasing values in the power of 10 sequence, sum them and then use argsort (note: this method will be numerically unstable for high values of k. Meant for values up to ~20):
def sort_on_first_k_rows(x, k):
# normalize each row so that its max value is 1
a = (x[:k,:]/x[:k,:,None].max(1)).astype('float64')
# multiply each row by the seq 10^n, for n=k-1,k-2...0
# Ensures that the contribution of each row in the sorting is
# captured in the final sum
a_pow = (a*10**np.arange(a.shape[0]-1,-1,-1)[:,None])
# Sort with the argsort on the resulting sum
return x[:,a_pow.sum(0).argsort()]
Checking with the shared example:
sort_on_first_k_rows(A, 2)
array([[2, 2, 9],
[5, 6, 4],
[0, 5, 7]])
Or with another example:
A=np.array([[9, 2, 2, 1, 5, 2, 9],
[4, 7, 6, 0, 9, 3, 3],
[7, 0, 5, 0, 2, 1, 2]])
sort_on_first_k_rows(A, 2)
array([[1, 2, 2, 2, 5, 9, 9],
[0, 3, 6, 7, 9, 3, 4],
[0, 1, 5, 0, 2, 2, 7]])

The pandas library is very flexible for sorting DataFrames - but only based on columns. So I suggest to transpose and convert your array to a DataFrame like this (note that you need to specify column names for later defining the sorting criteria):
df = pd.DataFrame(A.transpose(), columns=['col'+str(i) for i in range(len(A))])
Then sort it and convert it back like this:
A_new = df.sort_values(['col0', 'col1'], ascending=[True, True]).to_numpy().transpose()

Finding Indices for Repeat Sequences in NumPy Array

This is a follow up to a previous question. If I have a NumPy array [0, 1, 2, 2, 3, 4, 2, 2, 5, 5, 6, 5, 5, 2, 2], for each repeat sequence (starting at each index), is there a fast way to to then find all matches of that repeat sequence and return the index for those matches?
Here, the repeat sequences are [2, 2] and [5, 5] (note that the length of the repeat is specified by the user but will be the same length and can be much greater than 2). The repeats can be found at [2, 6, 8, 11, 13] via:
def consec_repeat_starts(a, n):
N = n-1
m = a[:-1]==a[1:]
return np.flatnonzero(np.convolve(m,np.ones(N, dtype=int))==N)-N+1
But for each unique type of repeat sequence (i.e., [2, 2] and [5, 5]) I want to return something like the repeat followed by the indices for where the repeat is located:
[([2, 2], [2, 6, 13]), ([5, 5], [8, 11])]
Update
Additionally, given the repeat sequence, can you return the results from a second array. So, look for [2, 2] and [5, 5] in:
[2, 2, 5, 5, 1, 4, 9, 2, 5, 5, 0, 2, 2, 2]
And the function would return:
[([2, 2], [0, 11, 12]), ([5, 5], [2, 8]))]

Here's a way to do so -
def group_consec(a, n):
idx = consec_repeat_starts(a, n)
b = a[idx]
sidx = b.argsort()
c = b[sidx]
cut_idx = np.flatnonzero(np.r_[True, c[:-1]!=c[1:],True])
idx_s = idx[sidx]
indices = [idx_s[i:j] for (i,j) in zip(cut_idx[:-1],cut_idx[1:])]
return c[cut_idx[:-1]], indices
# Perform lookup in another array, b
n = 2
v_a,indices_a = group_consec(a, n)
v_b,indices_b = group_consec(b, n)
idx = np.searchsorted(v_a, v_b)
idx[idx==len(v_a)] = 0
valid_mask = v_a[idx]==v_b
common_indices = [j for (i,j) in zip(valid_mask,indices_b) if i]
common_val = v_b[valid_mask]
Note that for simplicity and ease of usage, the first output arg off group_consec has the unique values per sequence. If you need them in (val, val,..) format, simply replicate at the end. Similarly, for common_val.

Fastest way to count identical sub-arrays in a nd-array?

Let's consider a 2d-array A
2 3 5 7
2 3 5 7
1 7 1 4
5 8 6 0
2 3 5 7
The first, second and last lines are identical. The algorithm I'm looking for should return the number of identical rows for each different row (=number of duplicates of each element). If the script can be easily modified to also count the number of identical column also, it would be great.
I use an inefficient naive algorithm to do that:
import numpy
A=numpy.array([[2, 3, 5, 7],[2, 3, 5, 7],[1, 7, 1, 4],[5, 8, 6, 0],[2, 3, 5, 7]])
i=0
end = len(A)
while i<end:
print i,
j=i+1
numberID = 1
while j<end:
print j
if numpy.array_equal(A[i,:] ,A[j,:]):
numberID+=1
j+=1
i+=1
print A, len(A)
Expected result:
array([3,1,1]) # number identical arrays per line
My algo looks like using native python within numpy, thus inefficient. Thanks for help.

In unumpy >= 1.9.0, np.unique has a return_counts keyword argument you can combine with the solution here to get the counts:
b = np.ascontiguousarray(A).view(np.dtype((np.void, A.dtype.itemsize * A.shape[1])))
unq_a, unq_cnt = np.unique(b, return_counts=True)
unq_a = unq_a.view(A.dtype).reshape(-1, A.shape[1])
>>> unq_a
array([[1, 7, 1, 4],
[2, 3, 5, 7],
[5, 8, 6, 0]])
>>> unq_cnt
array([1, 3, 1])
In an older numpy, you can replicate what np.unique does, which would look something like:
a_view = np.array(A, copy=True)
a_view = a_view.view(np.dtype((np.void,
a_view.dtype.itemsize*a_view.shape[1]))).ravel()
a_view.sort()
a_flag = np.concatenate(([True], a_view[1:] != a_view[:-1]))
a_unq = A[a_flag]
a_idx = np.concatenate(np.nonzero(a_flag) + ([a_view.size],))
a_cnt = np.diff(a_idx)
>>> a_unq
array([[1, 7, 1, 4],
[2, 3, 5, 7],
[5, 8, 6, 0]])
>>> a_cnt
array([1, 3, 1])

You can lexsort on the row entries, which will give you the indices for traversing the rows in sorted order, making the search O(n) rather than O(n^2). Note that by default, the elements in the last column sort last, i.e. the rows are 'alphabetized' right to left rather than left to right.
In [9]: a
Out[9]:
array([[2, 3, 5, 7],
[2, 3, 5, 7],
[1, 7, 1, 4],
[5, 8, 6, 0],
[2, 3, 5, 7]])
In [10]: lexsort(a.T)
Out[10]: array([3, 2, 0, 1, 4])
In [11]: a[lexsort(a.T)]
Out[11]:
array([[5, 8, 6, 0],
[1, 7, 1, 4],
[2, 3, 5, 7],
[2, 3, 5, 7],
[2, 3, 5, 7]])

You can use Counter class from collections module for this.
It works like this :
x = [2, 2, 1, 5, 2]
from collections import Counter
c=Counter(x)
print c
Output : Counter({2: 3, 1: 1, 5: 1})
Only issue you will face is in your case since every value of x is itself a list which is a non hashable data structure.
If you can convert every value of x in a tuple that it should works as :
x = [(2, 3, 5, 7),(2, 3, 5, 7),(1, 7, 1, 4),(5, 8, 6, 0),(2, 3, 5, 7)]
from collections import Counter
c=Counter(x)
print c
Output : Counter({(2, 3, 5, 7): 3, (5, 8, 6, 0): 1, (1, 7, 1, 4): 1})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

2D Vectorization of unique values per row with condition - python

Related

Numpy Array: Slice several values at every step

Dropping array rows that DUPLICATE defined column elements of other array rows

Sorting 2D array by the first n rows

Finding Indices for Repeat Sequences in NumPy Array

Fastest way to count identical sub-arrays in a nd-array?

Categories

Resources