Removing rows in a 2D array that have the same value - python

I'm looking for a quick way to remove duplicate values present in a 2D array on a first come first serve basis. I know a way to remove the rows if they are identical, but not if only one of the values is present.
a = array([[0, 1],
[3, 4],
[3, 5],
[2, 5],
[1, 2]])
As 3 is present in a[1] and a[2] I would like to delete any future occurrence of a value. Similarly with 2 in a[3] and a[4] So the output would be:
a = array([[0, 1],
[3, 4],
[2, 5]])
As can be seen, there is overlap with the value 5. Any suggestions are appreciated.

A pure-Python way will be to use set with a list-comprehension:
>>> seen = set()
>>> np.array([x for x in a if seen.isdisjoint(x) and not seen.update(x)])
array([[0, 1],
[3, 4],
[2, 5]])
The one-liner simply abuses the fact that set.update returns None, so when seen.isdisjoint(x) is True we can update the seen set using not seen.update(x).
We can also write the above code as:
seen = set()
out = []
for item in a:
# if none of items in current sub-array are present in seen set
# then add current sub-array to our list. Plus update the seen
# set with the items from current sub-array
if seen.isdisjoint(item):
out.append(item)
seen.update(item)
...
>>> out
[array([0, 1]), array([3, 4]), array([2, 5])]
>>> np.array(out)
array([[0, 1],
[3, 4],
[2, 5]])

Related

Python equivalent of Unique function in Matlab

I have a (654 x 2) matrix of integers where many rows are having values which are just permutations of the same column values. (Eg. a certain row has values [2,5] whereas another row has values [5,2]). I need a Python function which treats both the rows as unique and help me deleting the row which comes later when sorted.
Sort each element in the sublist.
a = [[1,2], [3, 4], [2,1]]
#Sorted each element in sublist, I converted list to tuple to provide it as an input in set
li = [tuple(sorted(x)) for x in a]
print(li)
#[(1, 2), (3, 4), (1, 2)]
Then use set to eliminate duplicates.
#Convert tuple back to list
unique_li = [list(t) for t in set(li)]
print(unique_li)
#[[1, 2], [3, 4]]
You could use numpy to sort your array's rows.
a = np.array([[1,2], [3, 4], [2,1]])
a
array([[1, 2],
[3, 4],
[2, 1]])
np.ndarray.sort(a)
a
array([[1, 2],
[3, 4],
[1, 2]])
The use aray_equal to compare for equality of rows:
np.array_equal(a[0], a[1])
False
np.array_equal(a[0], a[2])
True
And then remove rows using:
np.delete(a, 2, 0)
array([[1, 2],
[3, 4]])

numpy sort 2d: rearrange rows without changing values in row

How can the rows in an array be sorted without that the values in each row will changed?
Furthermore: how to get the indicies of this sort-process?
input:
a = np.array([[4,3],[0,3],[3,0],[1,3],[1,2],[2,0]])
required sorting arrray:
b = np.array([1,4,3,5,2,0])
a = a[b]
output:
a = np.array([[0,3],[1,2],[1,3][2,0],[3,0],[4,3]])
How do I get the array b ?
You need lexsort here:
b = np.lexsort((a[:, 1], a[:, 0]))
# array([1, 4, 3, 5, 2, 0], dtype=int64)
And applied to your initial array:
>>> a[b]
array([[0, 3],
[1, 2],
[1, 3],
[2, 0],
[3, 0],
[4, 3]])
As #miradulo pointed out, you may also use:
b = np.lexsort(np.fliplr(a).T)
Which is less verbose than explicitly stating the columns to sort on.

How to replicate a row of an array with numpy?

I want to replicate the last row of an array in python and found the following lines of code in the numpy documentation
>>> x = np.array([[1,2],[3,4]])
>>> np.repeat(x, [1, 2], axis=0)
in the above code what does the second parameter "[1,2]" in np.repeat do?
if i want to replicate a row in a 3*3 array how will this second parameter change.
It's the repeats parameter
repeats : int or array of ints
The number of repetitions for each element. repeats is broadcasted to fit the shape of the given axis.
It's the number of times you want to repeat a row or column based on the parameter axis.
x = np.array([[1,2],[3,4],[4,5]])
np.repeat(x, repeats = [1, 2, 1 ], axis=0)
This would lead to repetition of row 1 once, row 2 twice and row 3 once.
array([[1, 2],
[3, 4],
[3, 4],
[4, 5]])
Similarly, if you specify the axis = 1. Repeats can take maximum of 2 elements in the list,and below code lead to repetition of column 1 once and column 2 twice.
x = np.array([[1,2],[3,4],[4,5]])
np.repeat(x, repeats = [1, 2 ], axis=1)
array([[1, 2, 2],
[3, 4, 4],
[4, 5, 5]])
If you want to repeat only last row, repeat only last row and stack i.e
rep = 2
last = np.repeat([x[-1]],repeats= rep-1 ,axis=0)
np.vstack([x, last])
array([[1, 2],
[3, 4],
[4, 5],
[4, 5]])
I have test it using following code
>>> a
array([[1, 2],
[3, 4]])
>>> np.repeat(a, [2,3], axis = 0)
array([[1, 2],
[1, 2],
[3, 4],
[3, 4],
[3, 4]])
>>> np.repeat(a, [1,3], axis = 0)
array([[1, 2],
[3, 4],
[3, 4],
[3, 4]])
The second parameter seems mean how many times the i-th elements in a will be repeat. As my code shown above, [2,3] repeats a[0] 2 times and repeats a[1] 3 times, [1,3] repeats a[0] 1 times and repeats a[1] 3 times

Is there any function in python which can perform the inverse of numpy.repeat function?

For example
x = np.repeat(np.array([[1,2],[3,4]]), 2, axis=1)
gives you
x = array([[1, 1, 2, 2],
[3, 3, 4, 4]])
but is there something which can perform
x = np.*inverse_repeat*(np.array([[1, 1, 2, 2],[3, 3, 4, 4]]), axis=1)
and gives you
x = array([[1,2],[3,4]])
Regular slicing should work. For the axis you want to inverse repeat, use ::number_of_repetitions
x = np.repeat(np.array([[1,2],[3,4]]), 4, axis=0)
x[::4, :] # axis=0
Out:
array([[1, 2],
[3, 4]])
x = np.repeat(np.array([[1,2],[3,4]]), 3, axis=1)
x[:,::3] # axis=1
Out:
array([[1, 2],
[3, 4]])
x = np.repeat(np.array([[[1],[2]],[[3],[4]]]), 5, axis=2)
x[:,:,::5] # axis=2
Out:
array([[[1],
[2]],
[[3],
[4]]])
This should work, and has the exact same signature as np.repeat:
def inverse_repeat(a, repeats, axis):
if isinstance(repeats, int):
indices = np.arange(a.shape[axis] / repeats, dtype=np.int) * repeats
else: # assume array_like of int
indices = np.cumsum(repeats) - 1
return a.take(indices, axis)
Edit: added support for per-item repeats as well, analogous to np.repeat
For the case where we know the axis and the repeat - and the repeat is a scalar (same value for all elements) we can construct a slicing index like this:
In [1117]: a=np.array([[1, 1, 2, 2],[3, 3, 4, 4]])
In [1118]: axis=1; repeats=2
In [1119]: ind=[slice(None)]*a.ndim
In [1120]: ind[axis]=slice(None,None,a.shape[axis]//repeats)
In [1121]: ind
Out[1121]: [slice(None, None, None), slice(None, None, 2)]
In [1122]: a[ind]
Out[1122]:
array([[1, 2],
[3, 4]])
#Eelco's use of take makes it easier to focus on one axis, but requires a list of indices, not a slice.
But repeat does allow for differing repeat counts.
In [1127]: np.repeat(a1,[2,3],axis=1)
Out[1127]:
array([[1, 1, 2, 2, 2],
[3, 3, 4, 4, 4]])
Knowing axis=1 and repeats=[2,3] we should be able construct the right take indexing (probably with cumsum). Slicing won't work.
But if we only know the axis, and the repeats are unknown then we probably need some sort of unique or set operation as in #redratear's answer.
In [1128]: a2=np.repeat(a1,[2,3],axis=1)
In [1129]: y=[list(set(c)) for c in a2]
In [1130]: y
Out[1130]: [[1, 2], [3, 4]]
A take solution with list repeats. This should select the last of each repeated block:
In [1132]: np.take(a2,np.cumsum([2,3])-1,axis=1)
Out[1132]:
array([[1, 2],
[3, 4]])
A deleted answer uses unique; here's my row by row use of unique
In [1136]: np.array([np.unique(row) for row in a2])
Out[1136]:
array([[1, 2],
[3, 4]])
unique is better than set for this use since it maintains element order. There's another problem with unique (or set) - what if the original had repeated values, e.g. [[1,2,1,3],[3,3,4,1]].
Here is a case where it would be difficult to deduce the repeat pattern from the result. I'd have to look at all the rows first.
In [1169]: a=np.array([[2,1,1,3],[3,3,2,1]])
In [1170]: a1=np.repeat(a,[2,1,3,4], axis=1)
In [1171]: a1
Out[1171]:
array([[2, 2, 1, 1, 1, 1, 3, 3, 3, 3],
[3, 3, 3, 2, 2, 2, 1, 1, 1, 1]])
But cumsum on a known repeat solves it nicely:
In [1172]: ind=np.cumsum([2,1,3,4])-1
In [1173]: ind
Out[1173]: array([1, 2, 5, 9], dtype=int32)
In [1174]: np.take(a1,ind,axis=1)
Out[1174]:
array([[2, 1, 1, 3],
[3, 3, 2, 1]])
>>> import numpy as np
>>> x = np.repeat(np.array([[1,2],[3,4]]), 2, axis=1)
>>> y=[list(set(c)) for c in x] #This part remove duplicates for each array in tuple. So this will not work for x = np.repeat(np.array([[1,1],[3,3]]), 2, axis=1)=[[1,1,1,1],[3,3,3,3]. Result will be [[1],[3]]
>>> print y
[[1, 2], [3, 4]]
You dont need know to axis and repeat amount...

Finding indices of non-unique elements in Numpy array

I have found other methods, such as this, to remove duplicate elements from an array. My requirement is slightly different. If I start with:
array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
I would like to end up with:
array([[2, 3, 4],
[3, 2, 1]
[3, 4, 5]])
That's what I would ultimately like to end up with, but there is an extra requirement. I would also like to store either an array of indices to discard, or to keep (a la numpy.take).
I am using Numpy 1.8.1
We want to find rows which are not duplicated in your array, while preserving the order.
I use this solution to combine each row of a into a single element, so that we can find the unique rows using np.unique(,return_index=True, return_inverse= True). Then, I modified this function to output the counts of the unique rows using the index and inverse. From there, I can select all unique rows which have counts == 1.
a = np.array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
#use a flexible data type, np.void, to combine the columns of `a`
#size of np.void is the number of bytes for an element in `a` multiplied by number of columns
b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, inv = np.unique(b, return_index = True, return_inverse = True)
def return_counts(index, inv):
count = np.zeros(len(index), np.int)
np.add.at(count, inv, 1)
return count
counts = return_counts(index, inv)
#if you want the indices to discard replace with: counts[i] > 1
index_keep = [i for i, j in enumerate(index) if counts[i] == 1]
>>>a[index_keep]
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
#if you don't need the indices and just want the array returned while preserving the order
a_unique = np.vstack(a[idx] for i, idx in enumerate(index) if counts[i] == 1])
>>>a_unique
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
For np.version >= 1.9
b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, counts = np.unique(b, return_index = True, return_counts = True)
index_keep = [i for i, j in enumerate(index) if counts[i] == 1]
>>>a[index_keep]
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
You can proceed as follows:
# Assuming your array is a
uniq, uniq_idx, counts = np.unique(a, axis=0, return_index=True, return_counts=True)
# to return the array you want
new_arr = uniq[counts == 1]
# The indices of non-unique rows
a_idx = np.arange(a.shape[0]) # the indices of array a
nuniq_idx = a_idx[np.in1d(a_idx, uniq_idx[counts==1], invert=True)]
You get:
#new_arr
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
# nuniq_idx
array([0, 2])
If you want to delete all instances of the elements, that exists in duplicate versions, you could iterate through the array, find the indexes of elements existing in more than one version, and lastly delete these:
# The array to check:
array = numpy.array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
# List that contains the indices of duplicates (which should be deleted)
deleteIndices = []
for i in range(0,len(array)): # Loop through entire array
indices = range(0,len(array)) # All indices in array
del indices[i] # All indices in array, except the i'th element currently being checked
for j in indexes: # Loop through every other element in array, except the i'th element, currently being checked
if(array[i] == array[j]).all(): # Check if element being checked is equal to the j'th element
deleteIndices.append(j) # If i'th and j'th element are equal, j is appended to deleteIndices[]
# Sort deleteIndices in ascending order:
deleteIndices.sort()
# Delete duplicates
array = numpy.delete(array,deleteIndices,axis=0)
This outputs:
>>> array
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
>>> deleteIndices
[0, 2]
Like that you both delete the duplicates and get a list of indices to discard.
The numpy_indexed package (disclaimer: I am its author) can be used to solve such problems in a vectorized manner:
index = npi.as_index(arr)
keep = index.count == 1
discard = np.invert(keep)
print(index.unique[keep])

Categories