Efficiently handling duplicates in a Python list - python

I'm looking to compactly represent duplicates in a Python list / 1D numpy array. For instance, say we have
x = np.array([1, 0, 0, 3, 3, 0])
this array has several duplicate elements, that can be represented with a
group_id = np.array([0, 1, 1, 2, 2, 1])
so that all duplicates in a given cluster are found with x[group_id==<some_id>].
The list of duplicate pairs can be efficiently computed with sorting,
s_idx = np.argsort(x)
diff_idx = np.nonzero(x[s_idx[:-1]] == x[s_idx[1:]])[0]
where the pair s_idx[diff_idx] <-> s_idx[diff_idx+1] correspond to the indices in the original array that are duplicates.
(here array([1, 2, 3]) <-> array([2, 5, 4])).
However, I'm not sure how to efficiently calculate cluster_id from this linkage information for large arrays sizes (N > 10⁶).
Edit: as suggested by #Chris_Rands, this can indeed be done with itertools.groupby,
import numpy as np
import itertools
def get_group_id(x):
group_id = np.zeros(x.shape, dtype='int')
for i, j in itertools.groupby(x):
j_el = next(j)
group_id[x==j_el] = i
return group_id
however the scaling appears to be O(n^2), and this would not scale to my use case (N > 10⁶),
for N in [50000, 100000, 200000]:
%time _ = get_group_id(np.random.randint(0, N, size=N))
CPU times: total: 1.53 s
CPU times: total: 5.83 s
CPU times: total: 23.9 s
and I belive using the duplicate linkage information would be more efficient as computing duplicate pairs for N=200000 takes just 6.44 µs in comparison.

You could use numpy.unique:
In [13]: x = np.array([1, 0, 0, 3, 3, 0])
In [14]: values, cluster_id = np.unique(x, return_inverse=True)
In [15]: values
Out[15]: array([0, 1, 3])
In [16]: cluster_id
Out[16]: array([1, 0, 0, 2, 2, 0])
(The cluster IDs are assigned in the order of the sorted unique values, not in the order of a value's first appearance in the input.)
Locations of the items in cluster 0:
In [22]: cid = 0
In [23]: values[cid]
Out[23]: 0
In [24]: (cluster_id == cid).nonzero()[0]
Out[24]: array([1, 2, 5])

Here's an approach using np.unique to keep the order according to the first appearance of a number -
unq, first_idx, ID = np.unique(x,return_index=1,return_inverse=1)
out = first_idx.argsort().argsort()[ID]
Sample run -
In [173]: x
Out[173]: array([1, 0, 0, 3, 3, 0, 9, 0, 2, 6, 0, 0, 4, 8])
In [174]: unq, first_idx, ID = np.unique(x,return_index=1,return_inverse=1)
In [175]: first_idx.argsort().argsort()[ID]
Out[175]: array([0, 1, 1, 2, 2, 1, 3, 1, 4, 5, 1, 1, 6, 7])

Related

Number of times an array is present in another array in Python

How can I count the number of times an array is present in a larger array?
a = np.array([1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1])
b = np.array([1, 1, 1])
The count for the number of times b is present in a should be 3
b can be any combination of 1s and 0s
I'm working with huge arrays, so for loops are pretty slow
If the subarray being searched for contains all 1s, you can count the number of times the subarray appears in the larger array by convolving the two arrays with np.convolve and counting the number of entries in the result that equal the size of the subarray:
# 'valid' = convolve only over the complete overlap of the signals
>>> np.convolve(a, b, mode='valid')
array([1, 1, 2, 3, 2, 2, 2, 3, 3, 2, 1, 1])
# ^ ^ ^ <= Matches
>>> win_size = min(a.size, b.size)
>>> np.count_nonzero(np.convolve(a, b) == win_size)
3
For subarrays that may contain 0s, you can start by using convolution to transform a into an array containing the binary numbers encoded by each window of size b.size. Then just compare each element of the transformed array with the binary number encoded by b and count the matches:
>>> b = np.array([0, 1, 1]) # encodes '3'
>>> weights = 2 ** np.arange(b.size) # == [1, 2, 4, 8, ..., 2**(b.size-1)]
>>> np.convolve(a, weights, mode='valid')
array([4, 1, 3, 7, 6, 5, 3, 7, 7, 6, 4, 1])
# ^ ^ Matches
>>> target = (b * np.flip(weights)).sum() # target==3
>>> np.count_nonzero(np.convolve(a, weights, mode='valid') == target)
2
Not a super fast method, but you can view a as a windowed array using np.lib.stride_tricks.sliding_window_view:
window = np.lib.stride_tricks.sliding_window_view(a, b.shape)
You can now equate this to b directly and find where they match:
result = (window == b).all(-1).sum()
For older versions of numpy (pre-1.20.0), you can use np.libs.stride_tricks.as_strided to achieve a similar result:
window = np.lib.stride_tricks.as_strided(
a, shape=(*(np.array(a.shape) - b.shape + 1), *b.shape),
strides=a.strides + (a.strides[0],) * b.ndim)
Here is a solution using a list comprehension:
a = [1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1]
b = [1, 1, 1]
sum(a[i:i+len(b)]==b for i in range(len(a)-len(b)))
output: 3
Here are a few improvements on #Brian's answer:
Use np.correlate not np.convolve; they are nearly identical but convolve reads a and b in opposite directions
To deal with templates that have zeros convert the zeros to -1. For example:
a = np.array([1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1])
b = np.array([0,1,1])
np.correlate(a,2*b-1)
# array([-1, 1, 2, 1, 0, 0, 2, 1, 1, 0, -1, 1])
The template fits where the correlation equals the number of ones in the template. The indices can be extracted like so:
(np.correlate(a,2*b-1)==np.count_nonzero(b)).nonzero()[0]
# array([2, 6])
If you only need the count use np.count_nonzero
np.count_nonzero((np.correlate(a,2*b-1)==np.count_nonzero(b)))
# 2

How to find a distance between elements in numpy array?

For example, I have such array z:
array([1, 0, 1, 0, 0, 0, 1, 0, 0, 1])
How to find a distances between two successive 1s in this array? (measured in the numbers of 0s)
For example, in the z array, such distances are:
[1, 3, 2]
I have such code for it:
distances = []
prev_idx = 0
for idx, element in enumerate(z):
if element == 1:
distances.append(idx - prev_idx)
prev_idx = idx
distances = np.array(distances[1:]) - 1
Can this opeartion be done without for-loop and maybe in more efficient way?
UPD
The solution in the #warped answer works fine in 1-D case.
But what if z will be 2D-array like np.array([z, z])?
You can use np.where to find the ones, and then np.diff to get the distances:
q=np.where(z==1)
np.diff(q[0])-1
out:
array([1, 3, 2], dtype=int64)
edit:
for 2d arrays:
You can use the minimum of the manhattan distance (decremented by 1) of the positions that have ones to get the number of zeros inbetween:
def manhattan_distance(a, b):
return np.abs(np.array(a) - np.array(b)).sum()
zeros_between = []
r, c = np.where(z==1)
coords = list(zip(r,c))
for i, c in enumerate(coords[:-1]):
zeros_between.append(
np.min([manhattan_distance(c, coords[j])-1 for j in range(i+1, len(coords))]))
If you dont want to use the for, you can use np.where and np.roll
import numpy as np
x = np.array([1, 0, 1, 0, 0, 0, 1, 0, 0, 1])
pos = np.where(x==1)[0] #pos = array([0, 2, 6, 9])
shift = np.roll(pos,-1) # shift = array([2, 6, 9, 0])
result = ((shift-pos)-1)[:-1]
#shift-pos = array([ 2, 4, 3, -9])
#(shif-pos)-1 = array([ 1, 3, 2, -10])
#((shif-pos)-1)[:-1] = array([ 1, 3, 2])
print(result)

Using np.random.randint as fill_value

I want to create a numpy array, where each element is the amount of 1s in another numpy array of size x created with np.random.randint.
>>> x = 10
>>> np.random.randint(2, size=x)
array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
>>> sum(array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1]))
5
and using it like this results in the same array being used, instead of generating a new random one each time
>>> np.full((5,), sum(np.random.randint(2, size=10)), dtype="int")
array([5, 5, 5, 5, 5])
How can I do this, or is there a better way to do this? I also tried the following
>>> a = np.random.rand(10)
>>> len(a[a < 0.5])
7
>>> np.full((5,), len(np.random.rand(10)[np.random.rand(10) < 0.5]), dtype="int")
array([7, 7, 7, 7, 7])
but as you can see that also resulted in the same numbers. The problem is that I don't want to use for loops, and instead find a way to do it quickly using numpy.
You could just generate a matrix which is N arrays each of size x made of random ints. Then sum over each array,
import numpy as np
x = 10
N = 5
a = np.sum(np.random.randint(2, size=[N,x]),0)
I'm fairly sure np.full is not what you want here as this is for array initialisation to a single value.
Using the binomial distribution as discussed above:
In [13]: np.random.binomial(10, 0.5, 5)
Out[13]: array([7, 4, 6, 7, 4])
This assumes that there are 10 distinct left/right decisions, each having 0.5 probability.

Vectorized relabeling of NumPy array to consecutive numbers and retrieving back

I have a huge training dataset with 4 classes. These classes are labeled non-consecutively. To be able to apply a sequential neural network the classes have to be relabeled so that the unique values in the classes are consecutive. In addition, at the end of the script I have to relabel them back to their old values.
I know how to relabel them with loops:
def relabel(old_classes, new_classes):
indexes=[np.where(old_classes ==np.unique(old_classes)[i]) for i in range(len(new_classes))]
for i in range(len(new_classes )):
old_classes [indexes[i]]=new_classes[i]
return old_classes
>>> old_classes = np.array([0,1,2,6,6,2,6,1,1,0])
>>> new_classes = np.arange(len(np.unique(old_classes)))
>>> relabel(old_classes,new_classes)
array([0, 1, 2, 3, 3, 2, 3, 1, 1, 0])
But this isn't nice coding and it takes quite a lot of time.
Any idea how to vectorize this relabeling?
To be clear, I also want to be able to relabel them back to their old values:
>>> relabeled_classes=np.array([0, 1, 2, 3, 3, 2, 3, 1, 1, 0])
>>> old_classes = np.array([0,1,2,6])
>>> relabel(relabeled_classes,old_classes )
array([0,1,2,6,6,2,6,1,1,0])
We can use the optional argument return_inverse with np.unique to get those unique sequential IDs/tags, like so -
unq_arr, unq_tags = np.unique(old_classes,return_inverse=1)
Index into unq_arr with unq_tags to retrieve back -
old_classes_retrieved = unq_arr[unq_tags]
Sample run -
In [69]: old_classes = np.array([0,1,2,6,6,2,6,1,1,0])
In [70]: unq_arr, unq_tags = np.unique(old_classes,return_inverse=1)
In [71]: unq_arr
Out[71]: array([0, 1, 2, 6])
In [72]: unq_tags
Out[72]: array([0, 1, 2, 3, 3, 2, 3, 1, 1, 0])
In [73]: old_classes_retrieved = unq_arr[unq_tags]
In [74]: old_classes_retrieved
Out[74]: array([0, 1, 2, 6, 6, 2, 6, 1, 1, 0])

numpy select fixed amount of values among duplicate values in array

Starting from a simple array with duplicate values:
a = np.array([2,3,2,2,3,3,2,1])
I'm trying to select a maximum of 2 unique values from this. The resulting array would appear as:
b = np.array([2,3,2,3,1])
no matter the order of the items. So far I tried to find unique values with:
In [20]: c = np.unique(a,return_counts=True)
In [21]: c
Out[21]: (array([1, 2, 3]), array([1, 4, 3]))
which is useful because it returns the frequency of values as well, but I'm stucked in filtering by frequency.
You could use np.repeat to generate the desired array from the array of uniques and counts:
import numpy as np
a = np.array([2,3,2,2,3,3,2,1])
uniques, count = np.unique(a,return_counts=True)
np.repeat(uniques, np.clip(count, 0, 2))
yields
array([1, 2, 2, 3, 3])
np.clip is used to force all values in count to be between 0 and 2. Thus, you get at most two values for each unique value.
You can use a list comprehension within np.concatenate() and limit the number of items by slicing:
>>> np.concatenate([a[a==i][:2] for i in np.unique(a)])
array([1, 2, 2, 3, 3])
Here's an approach to keep the order as in the input array -
N = 2 # Number of duplicates to keep for each unique element
sortidx = a.argsort()
_,id_arr = np.unique(a[sortidx],return_index=True)
valid_ind = np.unique( (id_arr[:,None] + np.arange(N)).ravel().clip(max=a.size-1) )
out = a[np.sort(sortidx[valid_ind])]
Sample run -
In [253]: a
Out[253]: array([ 0, -3, 0, 2, 0, 3, 2, 0, 2, 3, 3, 2, 1, 5, 0, 2])
In [254]: N
Out[254]: 3
In [255]: out
Out[255]: array([ 0, -3, 0, 2, 0, 3, 2, 2, 3, 3, 1, 5])
In [256]: np.unique(out,return_counts=True)[1] # Verify the counts to be <= N
Out[256]: array([1, 3, 1, 3, 3, 1])

Categories