Split numpy 2D array based on separate label array - python

I have a 2D numpy array A. For example:
A = np.array([[1, 2],
[3, 4],
[5, 6],
[7, 8],
[9, 0]])
I have another label array B corresponding to rows of A. For example:
B = np.array([0, 1, 2, 0, 1])
I want to split A into 3 arrays based on their labels, so the result would be:
[[[1, 2],
[7, 8]],
[[3, 4],
[9, 0]],
[[5, 6]]]
Are there any numpy built in functions to achieve this?
Right now, my solution is rather ugly and involves repeating calling numpy.where in a for-loop, and slicing the indices tuples to contain only the rows.

Here's one way to do it:
hstack both the array together.
sort the array by the last column
split the array based on unique value index
a = np.hstack((A,B[:,None]))
a = a[a[:, -1].argsort()]
a = np.split(a[:,:-1], np.unique(a[:, -1], return_index=True)[1][1:])
OUTPUT:
[array([[1, 2],
[7, 8]]),
array([[3, 4],
[9, 0]]),
array([[5, 6]])]

If the output can always be an array because the labels are equally distributed, you only need to sort the data by label:
idx = B.argsort()
n = np.flatnonzero(np.diff(idx))[0] + 1
result = A[idx].reshape(n, A.shape[0] // n, A.shape[1])
If the labels aren't equally distributed, you'll have to make a list in the outer dimension:
_, indices, counts = np.unique(B, return_counts=True, return_inverse=True)
result = np.split(A[indices.argsort()], counts.cumsum()[:-1])
Using the equivalent of np.where is not very efficient, but you can do it without a loop:
b, idx = np.unique(B, return_inverse=True)
mask = idx[:, None] == np.arange(b.size)
result = np.split(A[idx.argsort()], np.count_nonzero(mask, axis=0).cumsum()[:-1])
You can compute the mask simulataneously for all the labels and apply it to the sorted A (A[idx.argsort()]) by counting the number of matching elements in each category (np.count_nonzero(mask, axis=0).cumsum()). The last index is stripped off the cumulative sum because np.split always adds an implicit total index.

You could also use Pandas for this because it's designed for labelled data and has a powerful groupby method.
import pandas as pd
index = pd.Index(B, name='label')
df = pd.DataFrame(A, index=index)
groups = {k: v.values for k, v in df.groupby('label')}
print(groups)
This produces a dictionary of arrays of the grouped values:
{0: array([[1, 2],
[7, 8]]), 1: array([[3, 4],
[9, 0]]), 2: array([[5, 6]])}
For a list of the arrays you can do this instead:
groups = [v.values for k, v in df.groupby('label')]

This is probably the simplest way:
groups = [A[B == label, :] for label in np.unique(B)]
print(groups)
Output:
[array([[1, 2],
[7, 8]]), array([[3, 4],
[9, 0]]), array([[5, 6]])]

Related

numpy sort 2d: rearrange rows without changing values in row

How can the rows in an array be sorted without that the values in each row will changed?
Furthermore: how to get the indicies of this sort-process?
input:
a = np.array([[4,3],[0,3],[3,0],[1,3],[1,2],[2,0]])
required sorting arrray:
b = np.array([1,4,3,5,2,0])
a = a[b]
output:
a = np.array([[0,3],[1,2],[1,3][2,0],[3,0],[4,3]])
How do I get the array b ?
You need lexsort here:
b = np.lexsort((a[:, 1], a[:, 0]))
# array([1, 4, 3, 5, 2, 0], dtype=int64)
And applied to your initial array:
>>> a[b]
array([[0, 3],
[1, 2],
[1, 3],
[2, 0],
[3, 0],
[4, 3]])
As #miradulo pointed out, you may also use:
b = np.lexsort(np.fliplr(a).T)
Which is less verbose than explicitly stating the columns to sort on.

How to delete last n rows from Numpy array?

I'm trying to delete the last few rows from a numpy array. I'm able to delete 0 to i rows with the following code.
for i, line in enumerate(two_d_array1):
if all(v == 0 for v in line):
pass
else:
break
two_d_array2 = np.delete(two_d_array1, slice(0, i), axis=0)
Any suggestions on how to do the same for the end of the array?
for i, line in enumerate(reversed(two_d_array2)):
if all(v == 0 for v in line):
pass
else:
break
two_d_array3 = np.delete(two_d_array2, **slice(0, i)**, axis=0)
You can use slice notation for your indexing.
To remove the last n rows from an array:
a = np.array(range(10)).reshape(5, 2)
>>> a
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
n = 2 # Remove last two rows of array.
>>> a[:-n, :]
array([[0, 1],
[2, 3],
[4, 5]])
To remove the first n rows from an array:
>>> a[n:, :] # Remove first two rows.
array([[4, 5],
[6, 7],
[8, 9]])
you can also use :
array_name[:-n]
This is the efficient approach with best time complexity than the previous one

Finding indices of non-unique elements in Numpy array

I have found other methods, such as this, to remove duplicate elements from an array. My requirement is slightly different. If I start with:
array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
I would like to end up with:
array([[2, 3, 4],
[3, 2, 1]
[3, 4, 5]])
That's what I would ultimately like to end up with, but there is an extra requirement. I would also like to store either an array of indices to discard, or to keep (a la numpy.take).
I am using Numpy 1.8.1
We want to find rows which are not duplicated in your array, while preserving the order.
I use this solution to combine each row of a into a single element, so that we can find the unique rows using np.unique(,return_index=True, return_inverse= True). Then, I modified this function to output the counts of the unique rows using the index and inverse. From there, I can select all unique rows which have counts == 1.
a = np.array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
#use a flexible data type, np.void, to combine the columns of `a`
#size of np.void is the number of bytes for an element in `a` multiplied by number of columns
b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, inv = np.unique(b, return_index = True, return_inverse = True)
def return_counts(index, inv):
count = np.zeros(len(index), np.int)
np.add.at(count, inv, 1)
return count
counts = return_counts(index, inv)
#if you want the indices to discard replace with: counts[i] > 1
index_keep = [i for i, j in enumerate(index) if counts[i] == 1]
>>>a[index_keep]
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
#if you don't need the indices and just want the array returned while preserving the order
a_unique = np.vstack(a[idx] for i, idx in enumerate(index) if counts[i] == 1])
>>>a_unique
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
For np.version >= 1.9
b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, counts = np.unique(b, return_index = True, return_counts = True)
index_keep = [i for i, j in enumerate(index) if counts[i] == 1]
>>>a[index_keep]
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
You can proceed as follows:
# Assuming your array is a
uniq, uniq_idx, counts = np.unique(a, axis=0, return_index=True, return_counts=True)
# to return the array you want
new_arr = uniq[counts == 1]
# The indices of non-unique rows
a_idx = np.arange(a.shape[0]) # the indices of array a
nuniq_idx = a_idx[np.in1d(a_idx, uniq_idx[counts==1], invert=True)]
You get:
#new_arr
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
# nuniq_idx
array([0, 2])
If you want to delete all instances of the elements, that exists in duplicate versions, you could iterate through the array, find the indexes of elements existing in more than one version, and lastly delete these:
# The array to check:
array = numpy.array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
# List that contains the indices of duplicates (which should be deleted)
deleteIndices = []
for i in range(0,len(array)): # Loop through entire array
indices = range(0,len(array)) # All indices in array
del indices[i] # All indices in array, except the i'th element currently being checked
for j in indexes: # Loop through every other element in array, except the i'th element, currently being checked
if(array[i] == array[j]).all(): # Check if element being checked is equal to the j'th element
deleteIndices.append(j) # If i'th and j'th element are equal, j is appended to deleteIndices[]
# Sort deleteIndices in ascending order:
deleteIndices.sort()
# Delete duplicates
array = numpy.delete(array,deleteIndices,axis=0)
This outputs:
>>> array
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
>>> deleteIndices
[0, 2]
Like that you both delete the duplicates and get a list of indices to discard.
The numpy_indexed package (disclaimer: I am its author) can be used to solve such problems in a vectorized manner:
index = npi.as_index(arr)
keep = index.count == 1
discard = np.invert(keep)
print(index.unique[keep])

Numpy: Average of values corresponding to unique coordinate positions

So, I have been browsing stackoverflow for quite some time now, but I can't seem to find the solution for my problem
Consider this
import numpy as np
coo = np.array([[1, 2], [2, 3], [3, 4], [3, 4], [1, 2], [5, 6], [1, 2]])
values = np.array([1, 2, 4, 2, 1, 6, 1])
The coo array contains the (x, y) coordinate positions
x = (1, 2, 3, 3, 1, 5, 1)
y = (2, 3, 4, 4, 2, 6, 2)
and the values array some sort of data for this grid point.
Now I want to get the average of all values for each unique grid point.
For example the coordinate (1, 2) occurs at the positions (0, 4, 6), so for this point I want values[[0, 4, 6]].
How could I get this for all unique grid points?
You can sort coo with np.lexsort to bring the duplicate ones in succession. Then run np.diff along the rows to get a mask of starts of unique XY's in the sorted version. Using that mask, you can create an ID array that would have the same ID for the duplicates. The ID array can then be used with np.bincount to get the summation of all values with the same ID and also their counts and thus the average values, as the final output. Here's an implementation to go along those lines -
# Use lexsort to bring duplicate coo XY's in succession
sortidx = np.lexsort(coo.T)
sorted_coo = coo[sortidx]
# Get mask of start of each unique coo XY
unqID_mask = np.append(True,np.any(np.diff(sorted_coo,axis=0),axis=1))
# Tag/ID each coo XY based on their uniqueness among others
ID = unqID_mask.cumsum()-1
# Get unique coo XY's
unq_coo = sorted_coo[unqID_mask]
# Finally use bincount to get the summation of all coo within same IDs
# and their counts and thus the average values
average_values = np.bincount(ID,values[sortidx])/np.bincount(ID)
Sample run -
In [65]: coo
Out[65]:
array([[1, 2],
[2, 3],
[3, 4],
[3, 4],
[1, 2],
[5, 6],
[1, 2]])
In [66]: values
Out[66]: array([1, 2, 4, 2, 1, 6, 1])
In [67]: unq_coo
Out[67]:
array([[1, 2],
[2, 3],
[3, 4],
[5, 6]])
In [68]: average_values
Out[68]: array([ 1., 2., 3., 6.])
You can use where:
>>> values[np.where((coo == [1, 2]).all(1))].mean()
1.0
It is very likely going to be faster to flatten your indices, i.e.:
flat_index = coo[:, 0] * np.max(coo[:, 1]) + coo[:, 1]
then use np.unique on it:
unq, unq_idx, unq_inv, unq_cnt = np.unique(flat_index,
return_index=True,
return_inverse=True,
return_counts=True)
unique_coo = coo[unq_idx]
unique_mean = np.bincount(unq_inv, values) / unq_cnt
than the similar approach using lexsort.
But under the hood the method is virtually the same.
This is a simple one-liner using the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
unique, mean = npi.group_by(coo).mean(values)
Should be comparable to the currently accepted answer in performance, as it does similar things under the hood; but all in a well tested package with a nice interface.
Another way to do it is using JAX unique and grad. This approach might be particularly fast because it allows you to run on an accelerator (CPU, GPU, or TPU).
import functools
import jax
import jax.numpy as jnp
#jax.grad
def _unique_sum(unique_values: jnp.ndarray, unique_inverses: jnp.ndarray, values: jnp.ndarray):
errors = unique_values[unique_inverses] - values
return -0.5*jnp.dot(errors, errors)
#functools.partial(jax.jit, static_argnames=['size'])
def unique_mean(indices, values, size):
unique_indices, unique_inverses, unique_counts = jnp.unique(indices, axis=0, return_inverse=True, return_counts=True, size=size)
unique_values = jnp.zeros(unique_indices.shape[0], dtype=float)
return unique_indices, _unique_sum(unique_values, unique_inverses, values) / unique_counts
coo = jnp.array([[1, 2], [2, 3], [3, 4], [3, 4], [1, 2], [5, 6], [1, 2]])
values = jnp.array([1, 2, 4, 2, 1, 6, 1])
unique_coo, unique_mean = unique_mean(coo, values, size=4)
print(unique_mean.block_until_ready())
The only weird thing is the size argument since JAX requires all array sizes to be fixed / known beforehand. If you make size too small it will throw out good results, too large it will return nan's.

Filtering multiple NumPy arrays based on the intersection of one column

I have three rather large NumPy arrays with varying numbers of rows, whose first columns are all integers. My hope is to filter these arrays such that the only rows left are those for whom the value in the first column is shared by all three. This would leave three arrays of the same size. The entries in the other columns are not necessarily shared across arrays.
So, with input:
A =
[[1, 1],
[2, 2],
[3, 3],]
B =
[[2, 1],
[3, 2],
[4, 3],
[5, 4]]
C =
[[2, 2],
[3, 1]
[5, 2]]
I hope to get back as output:
A =
[[2, 2],
[3, 3]]
B =
[[2, 1],
[3, 2]]
C =
[[2, 2],
[3, 1]]
My current approach is to:
Find the intersection of the three first columns using numpy.intersect1d()
Use numpy.in1d() on this intersection and the first columns of each array to find the row indices that are not shared in each array (converting boolean to index using a modified version of the method found here: Python: intersection indices numpy array )
Finally using numpy.delete() with each of these indices and its respective array to remove rows with non-shared entries in the first column.
I'm wondering if there might be a faster or more elegantly Pythonic way to go about this however, something that is suited to very large arrays.
Your indices in your example are sorted and unique. Assuming this is no coincidence (and this situation often arises, or can easily be enforced), the following works:
import numpy as np
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
I = reduce(
lambda l,r: np.intersect1d(l,r,True),
(i[:,0] for i in (A,B,C)))
print A[np.searchsorted(A[:,0], I)]
print B[np.searchsorted(B[:,0], I)]
print C[np.searchsorted(C[:,0], I)]
and in case the first column is not in sorted order (but is still unique):
C = np.array(
[[9, 2],
[1,6],
[5, 1],
[2, 5],
[3, 2],])
def index_by_first_column_entry(M, keys):
colkeys = M[:,0]
sorter = np.argsort(colkeys)
index = np.searchsorted(colkeys, keys, sorter = sorter)
return M[sorter[index]]
print index_by_first_column_entry(C, I)
and make sure to change the true to false in
I = reduce(
lambda l,r: np.intersect1d(l,r,False),
(i[:,0] for i in (A,B,C)))
generalization to duplicate values can be made using np.unique
One way to do this is to build an indicator array, or a hash table if you like, to indicate which integers are in all your input arrays. Then you can use boolean indexing based on this indicator array to get the subarrays. Something like this:
import numpy as np
# Setup
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
def take_overlap(*input):
n = len(input)
maxIndex = max(array[:, 0].max() for array in input)
indicator = np.zeros(maxIndex + 1, dtype=int)
for array in input:
indicator[array[:, 0]] += 1
indicator = indicator == n
result = []
for array in input:
# Look up each integer in the indicator array
mask = indicator[array[:, 0]]
# Use boolean indexing to get the sub array
result.append(array[mask])
return result
subA, subB, subC = take_overlap(A, B, C)
This should be quite fast and this method does not assume the elements of the input arrays are unique or sorted. However this method could take a lot of memory, and might e a bit slower, if the indexing integers are sparse, ie [1, 10, 10000], but should be close to optimal if the integers are more or less dense.
This works but I'm not sure if it is faster than any of the other answers:
import numpy as np
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
a = A[:,0]
b = B[:,0]
c = C[:,0]
ab = np.where(a[:, np.newaxis] == b[np.newaxis, :])
bc = np.where(b[:, np.newaxis] == c[np.newaxis, :])
ab_in_bc = np.in1d(ab[1], bc[0])
bc_in_ab = np.in1d(bc[0], ab[1])
arows = ab[0][ab_in_bc]
brows = ab[1][ab_in_bc]
crows = bc[1][bc_in_ab]
anew = A[arows, :]
bnew = B[brows, :]
cnew = C[crows, :]
print(anew)
print(bnew)
print(cnew)
gives:
[[2 2]
[3 3]]
[[2 1]
[3 2]]
[[2 2]
[3 1]]

Categories