I want to clean my data reducing the number of duplicates. I do not want to delete ALL duplicates.
How can I get a numpy array with certain number of duplicates?
Suppose, I have
x = np.array([[1,2,3],[1,2,3],[5,5,5],[1,2,3],[1,2,3]])
and I set number of duplicates as 2.
And the output should be like
x
>>[[1,2,3],[1,2,3],[5,5,5]]
or
x
>>[[5,5,5],[1,2,3],[1,2,3]]
It does not meter in my task
Even though using list appending as an intermediate step is not always a good idea when you already have numpy arrays, in this case it is by far the cleanest way to do it:
def n_uniques(arr, max_uniques):
uniq, cnts = np.unique(arr, axis=0, return_counts=True)
arr_list = []
for i in range(cnts.size):
num = cnts[i] if cnts[i] <= max_uniques else max_uniques
arr_list.extend([uniq[i]] * num)
return np.array(arr_list)
x = np.array([[1,2,3],
[1,2,3],
[1,2,3],
[5,5,5],
[1,2,3],
[1,2,3],])
reduced_arr = n_uniques(x, 2)
This was kind of tricky, but you can actually do that without loops and preserving the relative order in the original array with something like this (in this case the first repetitions are preserved):
import numpy as np
def drop_extra_repetitions(x, max_reps):
# Find unique rows
uniq, idx_inv, counts = np.unique(x, axis=0, return_inverse=True, return_counts=True)
# Compute number of repetitions of each different row
counts_clip = np.minimum(counts, max_reps)
# Array alternating between valid unique row indices and -1 ([0, -1, 1, -1, ...])
idx_to_repeat = np.stack(
[np.arange(len(uniq)), -np.ones(len(uniq), dtype=int)], axis=1).ravel()
# Number of repetitions for each of the previous indices
idx_repeats_clip = np.stack([counts_clip, counts - counts_clip], axis=1).ravel()
# Valid unique row indices are repetead at most max_reps,
# extra repetitions are filled with -1
idx_clip_sorted = np.repeat(idx_to_repeat, idx_repeats_clip)
# Sorter for inverse index - that is, sort the indices in the input array
# according to their corresponding unique row index
sorter = np.argsort(idx_inv)
# The final inverse index is the same as the original but with -1 on extra repetitions
idx_inv_final = np.empty(len(sorter), dtype=int)
idx_inv_final[sorter] = idx_clip_sorted
# Return the array reconstructed from the inverse index without the positions with -1
return uniq[idx_inv_final[idx_inv_final >= 0]]
x = [[5, 5, 5], [1, 2, 3], [1, 2, 3], [5, 5, 5], [1, 2, 3], [1, 2, 3]]
max_reps = 2
print(drop_extra_repetitions(x, max_reps))
# [[5 5 5]
# [1 2 3]
# [1 2 3]
# [5 5 5]]
If you do not need to preserve the order at all, then you can simply do:
import numpy as np
def drop_extra_repetitions(x, max_reps):
uniq, counts = np.unique(x, axis=0, return_counts=True)
# Repeat each unique row index at most max_reps
ret_idx = np.repeat(np.arange(len(uniq)), np.minimum(counts, max_reps))
return uniq[ret_idx]
x = [[5, 5, 5], [1, 2, 3], [1, 2, 3], [5, 5, 5], [1, 2, 3], [1, 2, 3]]
max_reps = 2
print(drop_extra_repetitions(x, max_reps))
# [[1 2 3]
# [1 2 3]
# [5 5 5]
# [5 5 5]]
Related
For a machine learning project I am doing, I need to transform a 2D array of floats to another array of the same shape where elements to the left and below are at least as large as the given element.
For example,
In [135]: import numpy as np
...: A = np.array([[1, 2, 1, 1],
...: [1, 1, 6, 5],
...: [3, 2, 4, 2]])
...: print(A)
[[1 2 1 1]
[1 1 6 5]
[3 2 4 2]]
Because A[0,1] = 2, I the following elements (below and to the right) to be >= 2: A[0,2], A[0,3], A[1,1].
Likewise, because A[1,2] = 6, I the following elements (below and to the right) to be >= 6: A[1,3], A[2,2], A[2,3].
I need to do this for every element in the array. The end result is:
[[1 2 2 2]
[1 2 6 6]
[3 3 6 6]]
Here's code that works, but I'd rather use fewer loops. I'd like to use vector operations or apply the function set_val against all elements of the array A. I looked into meshgrid and vectorize, but didn't see how to pass the index of the array (i.e. row,col) to the function.
def set_val(A, cur_row,cur_col,min_val):
for row_new in range(cur_row,A.shape[0]):
for col_new in range(cur_col,A.shape[1]):
if A[row_new,col_new] < min_val:
A[row_new,col_new] = min_val
A_new = A.copy()
#Iterate over every element of A
for row,row_data in enumerate(A):
for col,val in enumerate(row_data):
#Set values to the right and below to be no smaller than the given value
set_val(A, row, col, val)
print(A_new)
My question: Is there a more efficient (or at least more Pythonic) approach?
You can make use of two "cummulative maximum" calls:
from np.mx import maximum as mx
mx.accumulate(mx.accumulate(A), axis=1)
The mx.accumulate calculates the cummulative maximum. This means that for axis=0, the value for B = accumulate(A) is so that bij= maxk≤j aik. For axis=1, the same happens, but columnwise.
By doing this two times, we know that for the result R the value for rij will be the maximum of rij= maxk≤i, l≤ j akl.
Indeed, if such the largest element exists in this subrectangle, then the first mx.accumulate(..) will copy that value to the right, and thus eventually to the same column as the "target". Then the next mx.accumulate(.., axis=1) will copy that value to the same row as the "target", and thus pass that value to the correct cell.
For the given sample input, we thus obtain:
>>> A
array([[1, 2, 1, 1],
[1, 1, 6, 5],
[3, 2, 4, 2]])
>>> mx.accumulate(mx.accumulate(A), axis=1)
array([[1, 2, 2, 2],
[1, 2, 6, 6],
[3, 3, 6, 6]])
Benchmarks: if we run the above algorithm for a random 1000×1000 matrix, and we repeat the experiment 100 times, we get the following benchmark:
>>> timeit(lambda: mx.accumulate(mx.accumulate(A), axis=1), number=100)
1.5123104000231251
This thus means that it calculates one such matrix in approximately 151 milliseconds.
I have a numpy array like the following:
Xtrain = np.array([[1, 2, 3],
[4, 5, 6],
[1, 7, 3]])
I want to shuffle the items of each row separately, but do not want the shuffle to be the same for each row (as in several examples just shuffle column order).
For example, I want an output like the following:
output = np.array([[3, 2, 1],
[4, 6, 5],
[7, 3, 1]])
How can I randomly shuffle each of the rows randomly in an efficient way? My actual np array is over 100000 rows and 1000 columns.
Since you want to only shuffle the columns you can just perform the shuffling on transposed of your matrix:
In [86]: np.random.shuffle(Xtrain.T)
In [87]: Xtrain
Out[87]:
array([[2, 3, 1],
[5, 6, 4],
[7, 3, 1]])
Note that random.suffle() on a 2D array shuffles the rows not items in each rows. i.e. changes the position of the rows. Therefor if your change the position of the transposed matrix rows you're actually shuffling the columns of your original array.
If you still want a completely independent shuffle you can create random indexes for each row and then create the final array with a simple indexing:
In [172]: def crazyshuffle(arr):
...: x, y = arr.shape
...: rows = np.indices((x,y))[0]
...: cols = [np.random.permutation(y) for _ in range(x)]
...: return arr[rows, cols]
...:
Demo:
In [173]: crazyshuffle(Xtrain)
Out[173]:
array([[1, 3, 2],
[6, 5, 4],
[7, 3, 1]])
In [174]: crazyshuffle(Xtrain)
Out[174]:
array([[2, 3, 1],
[4, 6, 5],
[1, 3, 7]])
From: https://github.com/numpy/numpy/issues/5173
def disarrange(a, axis=-1):
"""
Shuffle `a` in-place along the given axis.
Apply numpy.random.shuffle to the given axis of `a`.
Each one-dimensional slice is shuffled independently.
"""
b = a.swapaxes(axis, -1)
# Shuffle `b` in-place along the last axis. `b` is a view of `a`,
# so `a` is shuffled in place, too.
shp = b.shape[:-1]
for ndx in np.ndindex(shp):
np.random.shuffle(b[ndx])
return
This solution is not efficient by any means, but I had fun thinking about it, so wrote it down. Basically, you ravel the array, and create an array of row labels, and an array of indices. You shuffle the index array, and index the original and row label arrays with that. Then you apply a stable argsort to the row labels to gather the data into rows. Apply that index and reshape and viola, data shuffled independently by rows:
import numpy as np
r, c = 3, 4 # x.shape
x = np.arange(12) + 1 # Already raveled
inds = np.arange(x.size)
rows = np.repeat(np.arange(r).reshape(-1, 1), c, axis=1).ravel()
np.random.shuffle(inds)
x = x[inds]
rows = rows[inds]
inds = np.argsort(rows, kind='mergesort')
x = x[inds].reshape(r, c)
Here is an IDEOne Link
We can create a random 2-dimensional matrix, sort it by each row, and then use the index matrix given by argsort to reorder the target matrix.
target = np.random.randint(10, size=(5, 5))
# [[7 4 0 2 5]
# [5 6 4 8 7]
# [6 4 7 9 5]
# [8 6 6 2 8]
# [8 1 6 7 3]]
shuffle_helper = np.argsort(np.random.rand(5,5), axis=1)
# [[0 4 3 2 1]
# [4 2 1 3 0]
# [1 2 3 4 0]
# [1 2 4 3 0]
# [1 2 3 0 4]]
target[np.arange(shuffle_helper.shape[0])[:, None], shuffle_helper]
# array([[7, 5, 2, 0, 4],
# [7, 4, 6, 8, 5],
# [4, 7, 9, 5, 6],
# [6, 6, 8, 2, 8],
# [1, 6, 7, 8, 3]])
Explanation
We use np.random.rand and argsort to mimic the effect from shuffling.
random.rand gives randomness.
Then, we use argsort with axis=1 to help rank each row. This creates the index that can be used for reordering.
Lets say you have array a with shape 100000 x 1000.
b = np.random.choice(100000 * 1000, (100000, 1000), replace=False)
ind = np.argsort(b, axis=1)
a_shuffled = a[np.arange(100000)[:,np.newaxis], ind]
I don't know if this is faster than loop, because it needs sorting, but with this solution maybe you will invent something better, for example with np.argpartition instead of np.argsort
You may use Pandas:
df = pd.DataFrame(X_train)
_ = df.apply(lambda x: np.random.permutation(x), axis=1, raw=True)
df.values
Change the keyword to axis=0 if you want to shuffle columns.
In tensorflow, I would like to sum columns of a 2D tensor according to multiple sets of indices.
For example:
Summing the columns of the following tensor
[[1 2 3 4 5]
[5 4 3 2 1]]
according to the 2 sets of indices (first set to sum columns 0 1 2 and second set to sum columns 3 4)
[[0,1,2],[3,4]]
should give 2 columns
[[6 9]
[12 3]]
Remarks:
All columns' indices will appear in one and only one set of indices.
This has to be done in Tensorflow, so that gradient can flow through this operation.
Do you have any idea how to perform that operation? I suspect I need to use tf.slice and probably tf.while_loop.
You can do that with tf.segment_sum:
import tensorflow as tf
nums = [[1, 2, 3, 4, 5],
[5, 4, 3, 2, 1]]
column_idx = [[0, 1, 2], [3, 4]]
with tf.Session() as sess:
# Data as TF tensor
data = tf.constant(nums)
# Make segment ids
segments = tf.concat([tf.tile([i], [len(lst)]) for i, lst in enumerate(column_idx)], axis=0)
# Select columns
data_cols = tf.gather(tf.transpose(data), tf.concat(column_idx, axis=0))
col_sum = tf.transpose(tf.segment_sum(data_cols, segments))
print(sess.run(col_sum))
Output:
[[ 6 9]
[12 3]]
I know of a crude way of solving this in NumPy if you don't mind solving this problem with NumPy.
import numpy as np
mat = np.array([[1, 2, 3, 4, 5], [5, 4, 3, 2, 1]])
grid1 = np.ix_([0], [0, 1, 2])
item1 = np.sum(mat[grid1])
grid2 = np.ix_([1], [0, 1, 2])
item2 = np.sum(mat[grid2])
grid3 = np.ix_([0], [3, 4])
item3 = np.sum(mat[grid3])
grid4 = np.ix_([1], [3, 4])
item4 = np.sum(mat[grid4])
result = np.array([[item1, item3], [item2, item4]])
I have found other methods, such as this, to remove duplicate elements from an array. My requirement is slightly different. If I start with:
array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
I would like to end up with:
array([[2, 3, 4],
[3, 2, 1]
[3, 4, 5]])
That's what I would ultimately like to end up with, but there is an extra requirement. I would also like to store either an array of indices to discard, or to keep (a la numpy.take).
I am using Numpy 1.8.1
We want to find rows which are not duplicated in your array, while preserving the order.
I use this solution to combine each row of a into a single element, so that we can find the unique rows using np.unique(,return_index=True, return_inverse= True). Then, I modified this function to output the counts of the unique rows using the index and inverse. From there, I can select all unique rows which have counts == 1.
a = np.array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
#use a flexible data type, np.void, to combine the columns of `a`
#size of np.void is the number of bytes for an element in `a` multiplied by number of columns
b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, inv = np.unique(b, return_index = True, return_inverse = True)
def return_counts(index, inv):
count = np.zeros(len(index), np.int)
np.add.at(count, inv, 1)
return count
counts = return_counts(index, inv)
#if you want the indices to discard replace with: counts[i] > 1
index_keep = [i for i, j in enumerate(index) if counts[i] == 1]
>>>a[index_keep]
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
#if you don't need the indices and just want the array returned while preserving the order
a_unique = np.vstack(a[idx] for i, idx in enumerate(index) if counts[i] == 1])
>>>a_unique
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
For np.version >= 1.9
b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, counts = np.unique(b, return_index = True, return_counts = True)
index_keep = [i for i, j in enumerate(index) if counts[i] == 1]
>>>a[index_keep]
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
You can proceed as follows:
# Assuming your array is a
uniq, uniq_idx, counts = np.unique(a, axis=0, return_index=True, return_counts=True)
# to return the array you want
new_arr = uniq[counts == 1]
# The indices of non-unique rows
a_idx = np.arange(a.shape[0]) # the indices of array a
nuniq_idx = a_idx[np.in1d(a_idx, uniq_idx[counts==1], invert=True)]
You get:
#new_arr
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
# nuniq_idx
array([0, 2])
If you want to delete all instances of the elements, that exists in duplicate versions, you could iterate through the array, find the indexes of elements existing in more than one version, and lastly delete these:
# The array to check:
array = numpy.array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])
# List that contains the indices of duplicates (which should be deleted)
deleteIndices = []
for i in range(0,len(array)): # Loop through entire array
indices = range(0,len(array)) # All indices in array
del indices[i] # All indices in array, except the i'th element currently being checked
for j in indexes: # Loop through every other element in array, except the i'th element, currently being checked
if(array[i] == array[j]).all(): # Check if element being checked is equal to the j'th element
deleteIndices.append(j) # If i'th and j'th element are equal, j is appended to deleteIndices[]
# Sort deleteIndices in ascending order:
deleteIndices.sort()
# Delete duplicates
array = numpy.delete(array,deleteIndices,axis=0)
This outputs:
>>> array
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])
>>> deleteIndices
[0, 2]
Like that you both delete the duplicates and get a list of indices to discard.
The numpy_indexed package (disclaimer: I am its author) can be used to solve such problems in a vectorized manner:
index = npi.as_index(arr)
keep = index.count == 1
discard = np.invert(keep)
print(index.unique[keep])
More specifically, I have a list of rows/columns that need to be ignored when choosing the max entry. In other words, when choosing the max upper triangular entry, certain indices need to skipped. In that case, what is the most efficient way to find the location of the max upper triangular entry?
For example:
>>> a
array([[0, 1, 1, 1],
[1, 2, 3, 4],
[4, 5, 6, 6],
[4, 5, 6, 7]])
>>> indices_to_skip = [0,1,2]
I need to find the index of the min element among all elements in the upper triangle except for the entries a[0,1], a[0,2], and a[1,2].
You can use np.triu_indices_from:
>>> np.vstack(np.triu_indices_from(a,k=1)).T
array([[0, 1],
[0, 2],
[0, 3],
[1, 2],
[1, 3],
[2, 3]])
>>> inds=inds[inds[:,1]>2] #Or whatever columns you want to start from.
>>> inds
array([[0, 3],
[1, 3],
[2, 3]])
>>> a[inds[:,0],inds[:,1]]
array([1, 4, 6])
>>> max_index = np.argmax(a[inds[:,0],inds[:,1]])
>>> inds[max_index]
array([2, 3]])
Or:
>>> inds=np.triu_indices_from(a,k=1)
>>> mask = (inds[1]>2) #Again change 2 for whatever columns you want to start at.
>>> a[inds][mask]
array([1, 4, 6])
>>> max_index = np.argmax(a[inds][mask])
>>> inds[mask][max_index]
array([2, 3]])
For the above you can use inds[0] to skip certains rows.
To skip specific rows or columns:
def ignore_upper(arr, k=0, skip_rows=None, skip_cols=None):
rows, cols = np.triu_indices_from(arr, k=k)
if skip_rows != None:
row_mask = ~np.in1d(rows, skip_rows)
rows = rows[row_mask]
cols = cols[row_mask]
if skip_cols != None:
col_mask = ~np.in1d(cols, skip_cols)
rows = rows[col_mask]
cols = cols[col_mask]
inds=np.ravel_multi_index((rows,cols),arr.shape)
return np.take(arr,inds)
print ignore_upper(a, skip_rows=1, skip_cols=2) #Will also take numpy arrays for skipping.
[0 1 1 6 7]
The two can be combined and creative use of boolean indexing can help speed up specific cases.
Something interesting that I ran across, a faster way to take upper triu indices:
def fast_triu_indices(dim,k=0):
tmp_range = np.arange(dim-k)
rows = np.repeat(tmp_range,(tmp_range+1)[::-1])
cols = np.ones(rows.shape[0],dtype=np.int)
inds = np.cumsum(tmp_range[1:][::-1]+1)
np.put(cols,inds,np.arange(dim*-1+2+k,1))
cols[0] = k
np.cumsum(cols,out=cols)
return (rows,cols)
Its about ~6x faster although it does not work for k<0:
dim=5000
a=np.random.rand(dim,dim)
k=50
t=time.time()
rows,cols=np.triu_indices(dim,k=k)
print time.time()-t
0.913508892059
t=time.time()
rows2,cols2,=fast_triu_indices(dim,k=k)
print time.time()-t
0.16515994072
print np.allclose(rows,rows2)
True
print np.allclose(cols,cols2)
True