I am trying to turn a second order tensor into a binary third order tensor. Given a second order tensor as a m x n numpy array: A, I need to take each element value: x, in A and replace it with a vector: v, with dimensions equal to the maximum value of A, but with a value of 1 incremented at the index of v corresponding to the value x (i.e. v[x] = 1). I have been following this question: Increment given indices in a matrix, which addresses producing an array with increments at indices given by 2 dimensional coordinates. I have been reading the answers and trying to use np.ravel_multi_index() and np.bincount() to do the same but with 3 dimensional coordinates, however I keep on getting a ValueError: "invalid entry in coordinates array". This is what I have been using:
def expand_to_tensor_3(array):
(x, y) = array.shape
(a, b) = np.indices((x, y))
a = a.reshape(x*y)
b = b.reshape(x*y)
tensor_3 = np.bincount(np.ravel_multi_index((a, b, array.reshape(x*y)), (x, y, np.amax(array))))
return tensor_3
If you know what is wrong here or know an even better method to accomplish my goal, both would be really helpful, thanks.
You can use (A[:,:,np.newaxis] == np.arange(A.max()+1)).astype(int).
Here's a demonstration:
In [52]: A
Out[52]:
array([[2, 0, 0, 2],
[3, 1, 2, 3],
[3, 2, 1, 0]])
In [53]: B = (A[:,:,np.newaxis] == np.arange(A.max()+1)).astype(int)
In [54]: B
Out[54]:
array([[[0, 0, 1, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0]],
[[0, 0, 0, 1],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]],
[[0, 0, 0, 1],
[0, 0, 1, 0],
[0, 1, 0, 0],
[1, 0, 0, 0]]])
Check a few individual elements of A:
In [55]: A[0,0]
Out[55]: 2
In [56]: B[0,0,:]
Out[56]: array([0, 0, 1, 0])
In [57]: A[1,3]
Out[57]: 3
In [58]: B[1,3,:]
Out[58]: array([0, 0, 0, 1])
The expression A[:,:,np.newaxis] == np.arange(A.max()+1) uses broadcasting to compare each element of A to np.arange(A.max()+1). For a single value, this looks like:
In [63]: 3 == np.arange(A.max()+1)
Out[63]: array([False, False, False, True], dtype=bool)
In [64]: (3 == np.arange(A.max()+1)).astype(int)
Out[64]: array([0, 0, 0, 1])
A[:,:,np.newaxis] is a three-dimensional view of A with shape (3,4,1). The extra dimension is added so that the comparison to np.arange(A.max()+1) will broadcast to each element, giving a result with shape (3, 4, A.max()+1).
With a trivial change, this will work for an n-dimensional array. Indexing a numpy array with the ellipsis ... means "all the other dimensions". So
(A[..., np.newaxis] == np.arange(A.max()+1)).astype(int)
converts an n-dimensional array to an (n+1)-dimensional array, where the last dimension is the binary indicator of the integer in A. Here's an example with a one-dimensional array:
In [6]: a = np.array([3, 4, 0, 1])
In [7]: (a[...,np.newaxis] == np.arange(a.max()+1)).astype(int)
Out[7]:
array([[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0]])
You can make it work this way:
tensor_3 = np.bincount(np.ravel_multi_index((a, b, array.reshape(x*y)),
(x, y, np.amax(array) + 1)))
The difference is that I add 1 to the amax() result, because ravel_multi_index() expects that the indexes are all strictly less than the dimensions, not less-or-equal.
I'm not 100% sure if this is what you wanted; another way to make the code run is to specify mode='clip' or mode='wrap' in ravel_multi_index(), which does something a bit different and I'm guessing is less correct. But you can try it.
I need to find unique rows in a numpy.array.
For example:
>>> a # I have
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0]])
I know that i can create a set and loop over the array, but I am looking for an efficient pure numpy solution. I believe that there is a way to set data type to void and then I could just use numpy.unique, but I couldn't figure out how to make it work.
As of NumPy 1.13, one can simply choose the axis for selection of unique values in any N-dim array. To get unique rows, one can do:
unique_rows = np.unique(original_array, axis=0)
Yet another possible solution
np.vstack({tuple(row) for row in a})
Another option to the use of structured arrays is using a view of a void type that joins the whole row into a single item:
a = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)
unique_a = a[idx]
>>> unique_a
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
EDIT
Added np.ascontiguousarray following #seberg's recommendation. This will slow the method down if the array is not already contiguous.
EDIT
The above can be slightly sped up, perhaps at the cost of clarity, by doing:
unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])
Also, at least on my system, performance wise it is on par, or even better, than the lexsort method:
a = np.random.randint(2, size=(10000, 6))
%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop
a = np.random.randint(2, size=(10000, 100))
%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop
If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy's structured arrays.
The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn't make a copy, and is quite efficient.
As a quick example:
import numpy as np
data = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)
uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
print uniq
To understand what's going on, have a look at the intermediary results.
Once we view things as a structured array, each element in the array is a row in your original array. (Basically, it's a similar data structure to a list of tuples.)
In [71]: struct
Out[71]:
array([[(1, 1, 1, 0, 0, 0)],
[(0, 1, 1, 1, 0, 0)],
[(0, 1, 1, 1, 0, 0)],
[(1, 1, 1, 0, 0, 0)],
[(1, 1, 1, 1, 1, 0)]],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
In [72]: struct[0]
Out[72]:
array([(1, 1, 1, 0, 0, 0)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
Once we run numpy.unique, we'll get a structured array back:
In [73]: np.unique(struct)
Out[73]:
array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
That we then need to view as a "normal" array (_ stores the result of the last calculation in ipython, which is why you're seeing _.view...):
In [74]: _.view(data.dtype)
Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])
And then reshape back into a 2D array (-1 is a placeholder that tells numpy to calculate the correct number of rows, give the number of columns):
In [75]: _.reshape(-1, ncols)
Out[75]:
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
Obviously, if you wanted to be more concise, you could write it as:
import numpy as np
def unique_rows(data):
uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
return uniq.view(data.dtype).reshape(-1, data.shape[1])
data = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
print unique_rows(data)
Which results in:
[[0 1 1 1 0 0]
[1 1 1 0 0 0]
[1 1 1 1 1 0]]
np.unique when I run it on np.random.random(100).reshape(10,10) returns all the unique individual elements, but you want the unique rows, so first you need to put them into tuples:
array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)
That is the only way I see you changing the types to do what you want, and I am not sure if the list iteration to change to tuples is okay with your "not looping through"
np.unique works by sorting a flattened array, then looking at whether each item is equal to the previous. This can be done manually without flattening:
ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]
This method does not use tuples, and should be much faster and simpler than other methods given here.
NOTE: A previous version of this did not have the ind right after a[, which mean that the wrong indices were used. Also, Joe Kington makes a good point that this does make a variety of intermediate copies. The following method makes fewer, by making a sorted copy and then using views of it:
b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]
This is faster and uses less memory.
Also, if you want to find unique rows in an ndarray regardless of how many dimensions are in the array, the following will work:
b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]
An interesting remaining issue would be if you wanted to sort/unique along an arbitrary axis of an arbitrary-dimension array, something that would be more difficult.
Edit:
To demonstrate the speed differences, I ran a few tests in ipython of the three different methods described in the answers. With your exact a, there isn't too much of a difference, though this version is a bit faster:
In [87]: %timeit unique(a.view(dtype)).view('<i8')
10000 loops, best of 3: 48.4 us per loop
In [88]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]
10000 loops, best of 3: 37.6 us per loop
In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10000 loops, best of 3: 41.6 us per loop
With a larger a, however, this version ends up being much, much faster:
In [96]: a = np.random.randint(0,2,size=(10000,6))
In [97]: %timeit unique(a.view(dtype)).view('<i8')
10 loops, best of 3: 24.4 ms per loop
In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10 loops, best of 3: 28.2 ms per loop
In [99]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]
100 loops, best of 3: 3.25 ms per loop
I've compared the suggested alternative for speed and found that, surprisingly, the void view unique solution is even a bit faster than numpy's native unique with the axis argument. If you're looking for speed, you'll want
numpy.unique(
a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
).view(a.dtype).reshape(-1, a.shape[1])
I've implemented that fastest variant in npx.unique_rows.
There is a bug report on GitHub for this, too.
Code to reproduce the plot:
import numpy
import perfplot
def unique_void_view(a):
return (
numpy.unique(a.view(numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))))
.view(a.dtype)
.reshape(-1, a.shape[1])
)
def lexsort(a):
ind = numpy.lexsort(a.T)
return a[
ind[numpy.concatenate(([True], numpy.any(a[ind[1:]] != a[ind[:-1]], axis=1)))]
]
def vstack(a):
return numpy.vstack([tuple(row) for row in a])
def unique_axis(a):
return numpy.unique(a, axis=0)
perfplot.show(
setup=lambda n: numpy.random.randint(2, size=(n, 20)),
kernels=[unique_void_view, lexsort, vstack, unique_axis],
n_range=[2 ** k for k in range(15)],
xlabel="len(a)",
equality_check=None,
)
Here is another variation for #Greg pythonic answer
np.vstack(set(map(tuple, a)))
I didn’t like any of these answers because none handle floating-point arrays in a linear algebra or vector space sense, where two rows being “equal” means “within some 𝜀”. The one answer that has a tolerance threshold, https://stackoverflow.com/a/26867764/500207, took the threshold to be both element-wise and decimal precision, which works for some cases but isn’t as mathematically general as a true vector distance.
Here’s my version:
from scipy.spatial.distance import squareform, pdist
def uniqueRows(arr, thresh=0.0, metric='euclidean'):
"Returns subset of rows that are unique, in terms of Euclidean distance"
distances = squareform(pdist(arr, metric=metric))
idxset = {tuple(np.nonzero(v)[0]) for v in distances <= thresh}
return arr[[x[0] for x in idxset]]
# With this, unique columns are super-easy:
def uniqueColumns(arr, *args, **kwargs):
return uniqueRows(arr.T, *args, **kwargs)
The public-domain function above uses scipy.spatial.distance.pdist to find the Euclidean (customizable) distance between each pair of rows. Then it compares each each distance to a threshold to find the rows that are within thresh of each other, and returns just one row from each thresh-cluster.
As hinted, the distance metric needn’t be Euclidean—pdist can compute sundry distances including cityblock (Manhattan-norm) and cosine (the angle between vectors).
If thresh=0 (the default), then rows have to be bit-exact to be considered “unique”. Other good values for thresh use scaled machine-precision, i.e., thresh=np.spacing(1)*1e3.
Why not use drop_duplicates from pandas:
>>> timeit pd.DataFrame(image.reshape(-1,3)).drop_duplicates().values
1 loops, best of 3: 3.08 s per loop
>>> timeit np.vstack({tuple(r) for r in image.reshape(-1,3)})
1 loops, best of 3: 51 s per loop
The numpy_indexed package (disclaimer: I am its author) wraps the solution posted by Jaime in a nice and tested interface, plus many more features:
import numpy_indexed as npi
new_a = npi.unique(a) # unique elements over axis=0 (rows) by default
Based on the answer in this page I have written a function that replicates the capability of MATLAB's unique(input,'rows') function, with the additional feature to accept tolerance for checking the uniqueness. It also returns the indices such that c = data[ia,:] and data = c[ic,:]. Please report if you see any discrepancies or errors.
def unique_rows(data, prec=5):
import numpy as np
d_r = np.fix(data * 10 ** prec) / 10 ** prec + 0.0
b = np.ascontiguousarray(d_r).view(np.dtype((np.void, d_r.dtype.itemsize * d_r.shape[1])))
_, ia = np.unique(b, return_index=True)
_, ic = np.unique(b, return_inverse=True)
return np.unique(b).view(d_r.dtype).reshape(-1, d_r.shape[1]), ia, ic
Beyond #Jaime excellent answer, another way to collapse a row is to uses a.strides[0] (assuming a is C-contiguous) which is equal to a.dtype.itemsize*a.shape[0]. Furthermore void(n) is a shortcut for dtype((void,n)). we arrive finally to this shortest version :
a[unique(a.view(void(a.strides[0])),1)[1]]
For
[[0 1 1 1 0 0]
[1 1 1 0 0 0]
[1 1 1 1 1 0]]
np.unique works given a list of tuples:
>>> np.unique([(1, 1), (2, 2), (3, 3), (4, 4), (2, 2)])
Out[9]:
array([[1, 1],
[2, 2],
[3, 3],
[4, 4]])
With a list of lists it raises a TypeError: unhashable type: 'list'
For general purpose like 3D or higher multidimensional nested arrays, try this:
import numpy as np
def unique_nested_arrays(ar):
origin_shape = ar.shape
origin_dtype = ar.dtype
ar = ar.reshape(origin_shape[0], np.prod(origin_shape[1:]))
ar = np.ascontiguousarray(ar)
unique_ar = np.unique(ar.view([('', origin_dtype)]*np.prod(origin_shape[1:])))
return unique_ar.view(origin_dtype).reshape((unique_ar.shape[0], ) + origin_shape[1:])
which satisfies your 2D dataset:
a = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
unique_nested_arrays(a)
gives:
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
But also 3D arrays like:
b = np.array([[[1, 1, 1], [0, 1, 1]],
[[0, 1, 1], [1, 1, 1]],
[[1, 1, 1], [0, 1, 1]],
[[1, 1, 1], [1, 1, 1]]])
unique_nested_arrays(b)
gives:
array([[[0, 1, 1], [1, 1, 1]],
[[1, 1, 1], [0, 1, 1]],
[[1, 1, 1], [1, 1, 1]]])
None of these answers worked for me. I'm assuming as my unique rows contained strings and not numbers. However this answer from another thread did work:
Source: https://stackoverflow.com/a/38461043/5402386
You can use .count() and .index() list's methods
coor = np.array([[10, 10], [12, 9], [10, 5], [12, 9]])
coor_tuple = [tuple(x) for x in coor]
unique_coor = sorted(set(coor_tuple), key=lambda x: coor_tuple.index(x))
unique_count = [coor_tuple.count(x) for x in unique_coor]
unique_index = [coor_tuple.index(x) for x in unique_coor]
We can actually turn m x n numeric numpy array into m x 1 numpy string array, please try using the following function, it provides count, inverse_idx and etc, just like numpy.unique:
import numpy as np
def uniqueRow(a):
#This function turn m x n numpy array into m x 1 numpy array storing
#string, and so the np.unique can be used
#Input: an m x n numpy array (a)
#Output unique m' x n numpy array (unique), inverse_indx, and counts
s = np.chararray((a.shape[0],1))
s[:] = '-'
b = (a).astype(np.str)
s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1)
n = a.shape[1] - 2
for i in range(0,n):
s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)
s3, idx, inv_, c = np.unique(s2,return_index = True, return_inverse = True, return_counts = True)
return a[idx], inv_, c
Example:
A = np.array([[ 3.17 9.502 3.291],
[ 9.984 2.773 6.852],
[ 1.172 8.885 4.258],
[ 9.73 7.518 3.227],
[ 8.113 9.563 9.117],
[ 9.984 2.773 6.852],
[ 9.73 7.518 3.227]])
B, inv_, c = uniqueRow(A)
Results:
B:
[[ 1.172 8.885 4.258]
[ 3.17 9.502 3.291]
[ 8.113 9.563 9.117]
[ 9.73 7.518 3.227]
[ 9.984 2.773 6.852]]
inv_:
[3 4 1 0 2 4 0]
c:
[2 1 1 1 2]
Lets get the entire numpy matrix as a list, then drop duplicates from this list, and finally return our unique list back into a numpy matrix:
matrix_as_list=data.tolist()
matrix_as_list:
[[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]
uniq_list=list()
uniq_list.append(matrix_as_list[0])
[uniq_list.append(item) for item in matrix_as_list if item not in uniq_list]
unique_matrix=np.array(uniq_list)
unique_matrix:
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0]])
The most straightforward solution is to make the rows a single item by making them strings. Each row then can be compared as a whole for its uniqueness using numpy. This solution is generalize-able you just need to reshape and transpose your array for other combinations. Here is the solution for the problem provided.
import numpy as np
original = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
uniques, index = np.unique([str(i) for i in original], return_index=True)
cleaned = original[index]
print(cleaned)
Will Give:
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
Send my nobel prize in the mail
import numpy as np
original = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
# create a view that the subarray as tuple and return unique indeies.
_, unique_index = np.unique(original.view(original.dtype.descr * original.shape[1]),
return_index=True)
# get unique set
print(original[unique_index])