Related
I have a 2D numpy array A. For example:
A = np.array([[1, 2],
[3, 4],
[5, 6],
[7, 8],
[9, 0]])
I have another label array B corresponding to rows of A. For example:
B = np.array([0, 1, 2, 0, 1])
I want to split A into 3 arrays based on their labels, so the result would be:
[[[1, 2],
[7, 8]],
[[3, 4],
[9, 0]],
[[5, 6]]]
Are there any numpy built in functions to achieve this?
Right now, my solution is rather ugly and involves repeating calling numpy.where in a for-loop, and slicing the indices tuples to contain only the rows.
Here's one way to do it:
hstack both the array together.
sort the array by the last column
split the array based on unique value index
a = np.hstack((A,B[:,None]))
a = a[a[:, -1].argsort()]
a = np.split(a[:,:-1], np.unique(a[:, -1], return_index=True)[1][1:])
OUTPUT:
[array([[1, 2],
[7, 8]]),
array([[3, 4],
[9, 0]]),
array([[5, 6]])]
If the output can always be an array because the labels are equally distributed, you only need to sort the data by label:
idx = B.argsort()
n = np.flatnonzero(np.diff(idx))[0] + 1
result = A[idx].reshape(n, A.shape[0] // n, A.shape[1])
If the labels aren't equally distributed, you'll have to make a list in the outer dimension:
_, indices, counts = np.unique(B, return_counts=True, return_inverse=True)
result = np.split(A[indices.argsort()], counts.cumsum()[:-1])
Using the equivalent of np.where is not very efficient, but you can do it without a loop:
b, idx = np.unique(B, return_inverse=True)
mask = idx[:, None] == np.arange(b.size)
result = np.split(A[idx.argsort()], np.count_nonzero(mask, axis=0).cumsum()[:-1])
You can compute the mask simulataneously for all the labels and apply it to the sorted A (A[idx.argsort()]) by counting the number of matching elements in each category (np.count_nonzero(mask, axis=0).cumsum()). The last index is stripped off the cumulative sum because np.split always adds an implicit total index.
You could also use Pandas for this because it's designed for labelled data and has a powerful groupby method.
import pandas as pd
index = pd.Index(B, name='label')
df = pd.DataFrame(A, index=index)
groups = {k: v.values for k, v in df.groupby('label')}
print(groups)
This produces a dictionary of arrays of the grouped values:
{0: array([[1, 2],
[7, 8]]), 1: array([[3, 4],
[9, 0]]), 2: array([[5, 6]])}
For a list of the arrays you can do this instead:
groups = [v.values for k, v in df.groupby('label')]
This is probably the simplest way:
groups = [A[B == label, :] for label in np.unique(B)]
print(groups)
Output:
[array([[1, 2],
[7, 8]]), array([[3, 4],
[9, 0]]), array([[5, 6]])]
I want to create an array with a format, and the values originate from another array. My input array consists out of three columns. I want to create an array with in the first row all values from the third column if the second column is equal. So in this example the first three values in the second column are equal, so in the new array i want the third value of each row in the new array.
a =
[[1, 1, 4],
[2, 1, 6],
[3, 1, 7],
[4, 2, 0],
[5, 2, 7],
[6, 3, 1]]
result:
b =
[[4, 6 , 7],
[0, 7],
[1]]
I tried:
c = []
x = 1
for row in a:
if row[0] == x
c.extend[row[2]]
else:
x = x + 1
c.append(row[2])
But the result is a list of all 3rd values
a = np.asarray(a)
c = []
for i in range(a[-1,1]): #a[-1,1] is the maximum that will occur
save = a[a[:,1]==i] # take all the ones that have i in the second entry
c.append(save[:,2]) # of those add the last entry
It's important, that ais converted to a np.array for this.
If the second column is sorted, you can use np.diff to find out the index where the value changes and then split on it:
np.split(a[:,2], np.flatnonzero(np.diff(a[:,1]) != 0)+1)
# [array([4, 6, 7]), array([0, 7]), array([1])]
The below works for me:
import numpy as np
c = [[]]
x = 1
for row in a:
if row[1] == x:
c[-1].append(row[2])
else:
x = x + 1
c.append([row[2]])
c = np.asarray(c)
I have the following problem:
Let's say I have an array defined like this:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
What I would like to do is to make use of Numpy multiple indexing and set several elements to 0. To do that I'm creating a vector:
indices_to_remove = [1, 2, 0]
What I want it to mean is the following:
Remove element with index '1' from the first row
Remove element with index '2' from the second row
Remove element with index '0' from the third row
The result should be the array [[1,0,3],[4,5,0],[0,8,9]]
I've managed to get values of the elements I would like to modify by following code:
values = np.diagonal(np.take(A, indices, axis=1))
However, that doesn't allow me to modify them. How could this be solved?
You could use integer array indexing to assign those zeros -
A[np.arange(len(indices_to_remove)), indices_to_remove] = 0
Sample run -
In [445]: A
Out[445]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [446]: indices_to_remove
Out[446]: [1, 2, 0]
In [447]: A[np.arange(len(indices_to_remove)), indices_to_remove] = 0
In [448]: A
Out[448]:
array([[1, 0, 3],
[4, 5, 0],
[0, 8, 9]])
I have found other methods, such as this, to remove duplicate elements from an array. My requirement is slightly different. If I start with:
array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
I would like to end up with:
array([[2, 3, 4],
[3, 2, 1]
[3, 4, 5]])
That's what I would ultimately like to end up with, but there is an extra requirement. I would also like to store either an array of indices to discard, or to keep (a la numpy.take).
I am using Numpy 1.8.1
We want to find rows which are not duplicated in your array, while preserving the order.
I use this solution to combine each row of a into a single element, so that we can find the unique rows using np.unique(,return_index=True, return_inverse= True). Then, I modified this function to output the counts of the unique rows using the index and inverse. From there, I can select all unique rows which have counts == 1.
a = np.array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
#use a flexible data type, np.void, to combine the columns of `a`
#size of np.void is the number of bytes for an element in `a` multiplied by number of columns
b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, inv = np.unique(b, return_index = True, return_inverse = True)
def return_counts(index, inv):
count = np.zeros(len(index), np.int)
np.add.at(count, inv, 1)
return count
counts = return_counts(index, inv)
#if you want the indices to discard replace with: counts[i] > 1
index_keep = [i for i, j in enumerate(index) if counts[i] == 1]
>>>a[index_keep]
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
#if you don't need the indices and just want the array returned while preserving the order
a_unique = np.vstack(a[idx] for i, idx in enumerate(index) if counts[i] == 1])
>>>a_unique
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
For np.version >= 1.9
b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, counts = np.unique(b, return_index = True, return_counts = True)
index_keep = [i for i, j in enumerate(index) if counts[i] == 1]
>>>a[index_keep]
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
You can proceed as follows:
# Assuming your array is a
uniq, uniq_idx, counts = np.unique(a, axis=0, return_index=True, return_counts=True)
# to return the array you want
new_arr = uniq[counts == 1]
# The indices of non-unique rows
a_idx = np.arange(a.shape[0]) # the indices of array a
nuniq_idx = a_idx[np.in1d(a_idx, uniq_idx[counts==1], invert=True)]
You get:
#new_arr
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
# nuniq_idx
array([0, 2])
If you want to delete all instances of the elements, that exists in duplicate versions, you could iterate through the array, find the indexes of elements existing in more than one version, and lastly delete these:
# The array to check:
array = numpy.array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
# List that contains the indices of duplicates (which should be deleted)
deleteIndices = []
for i in range(0,len(array)): # Loop through entire array
indices = range(0,len(array)) # All indices in array
del indices[i] # All indices in array, except the i'th element currently being checked
for j in indexes: # Loop through every other element in array, except the i'th element, currently being checked
if(array[i] == array[j]).all(): # Check if element being checked is equal to the j'th element
deleteIndices.append(j) # If i'th and j'th element are equal, j is appended to deleteIndices[]
# Sort deleteIndices in ascending order:
deleteIndices.sort()
# Delete duplicates
array = numpy.delete(array,deleteIndices,axis=0)
This outputs:
>>> array
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
>>> deleteIndices
[0, 2]
Like that you both delete the duplicates and get a list of indices to discard.
The numpy_indexed package (disclaimer: I am its author) can be used to solve such problems in a vectorized manner:
index = npi.as_index(arr)
keep = index.count == 1
discard = np.invert(keep)
print(index.unique[keep])
More specifically, I have a list of rows/columns that need to be ignored when choosing the max entry. In other words, when choosing the max upper triangular entry, certain indices need to skipped. In that case, what is the most efficient way to find the location of the max upper triangular entry?
For example:
>>> a
array([[0, 1, 1, 1],
[1, 2, 3, 4],
[4, 5, 6, 6],
[4, 5, 6, 7]])
>>> indices_to_skip = [0,1,2]
I need to find the index of the min element among all elements in the upper triangle except for the entries a[0,1], a[0,2], and a[1,2].
You can use np.triu_indices_from:
>>> np.vstack(np.triu_indices_from(a,k=1)).T
array([[0, 1],
[0, 2],
[0, 3],
[1, 2],
[1, 3],
[2, 3]])
>>> inds=inds[inds[:,1]>2] #Or whatever columns you want to start from.
>>> inds
array([[0, 3],
[1, 3],
[2, 3]])
>>> a[inds[:,0],inds[:,1]]
array([1, 4, 6])
>>> max_index = np.argmax(a[inds[:,0],inds[:,1]])
>>> inds[max_index]
array([2, 3]])
Or:
>>> inds=np.triu_indices_from(a,k=1)
>>> mask = (inds[1]>2) #Again change 2 for whatever columns you want to start at.
>>> a[inds][mask]
array([1, 4, 6])
>>> max_index = np.argmax(a[inds][mask])
>>> inds[mask][max_index]
array([2, 3]])
For the above you can use inds[0] to skip certains rows.
To skip specific rows or columns:
def ignore_upper(arr, k=0, skip_rows=None, skip_cols=None):
rows, cols = np.triu_indices_from(arr, k=k)
if skip_rows != None:
row_mask = ~np.in1d(rows, skip_rows)
rows = rows[row_mask]
cols = cols[row_mask]
if skip_cols != None:
col_mask = ~np.in1d(cols, skip_cols)
rows = rows[col_mask]
cols = cols[col_mask]
inds=np.ravel_multi_index((rows,cols),arr.shape)
return np.take(arr,inds)
print ignore_upper(a, skip_rows=1, skip_cols=2) #Will also take numpy arrays for skipping.
[0 1 1 6 7]
The two can be combined and creative use of boolean indexing can help speed up specific cases.
Something interesting that I ran across, a faster way to take upper triu indices:
def fast_triu_indices(dim,k=0):
tmp_range = np.arange(dim-k)
rows = np.repeat(tmp_range,(tmp_range+1)[::-1])
cols = np.ones(rows.shape[0],dtype=np.int)
inds = np.cumsum(tmp_range[1:][::-1]+1)
np.put(cols,inds,np.arange(dim*-1+2+k,1))
cols[0] = k
np.cumsum(cols,out=cols)
return (rows,cols)
Its about ~6x faster although it does not work for k<0:
dim=5000
a=np.random.rand(dim,dim)
k=50
t=time.time()
rows,cols=np.triu_indices(dim,k=k)
print time.time()-t
0.913508892059
t=time.time()
rows2,cols2,=fast_triu_indices(dim,k=k)
print time.time()-t
0.16515994072
print np.allclose(rows,rows2)
True
print np.allclose(cols,cols2)
True