I have a numpy array like the following:
Xtrain = np.array([[1, 2, 3],
[4, 5, 6],
[1, 7, 3]])
I want to shuffle the items of each row separately, but do not want the shuffle to be the same for each row (as in several examples just shuffle column order).
For example, I want an output like the following:
output = np.array([[3, 2, 1],
[4, 6, 5],
[7, 3, 1]])
How can I randomly shuffle each of the rows randomly in an efficient way? My actual np array is over 100000 rows and 1000 columns.
Since you want to only shuffle the columns you can just perform the shuffling on transposed of your matrix:
In [86]: np.random.shuffle(Xtrain.T)
In [87]: Xtrain
Out[87]:
array([[2, 3, 1],
[5, 6, 4],
[7, 3, 1]])
Note that random.suffle() on a 2D array shuffles the rows not items in each rows. i.e. changes the position of the rows. Therefor if your change the position of the transposed matrix rows you're actually shuffling the columns of your original array.
If you still want a completely independent shuffle you can create random indexes for each row and then create the final array with a simple indexing:
In [172]: def crazyshuffle(arr):
...: x, y = arr.shape
...: rows = np.indices((x,y))[0]
...: cols = [np.random.permutation(y) for _ in range(x)]
...: return arr[rows, cols]
...:
Demo:
In [173]: crazyshuffle(Xtrain)
Out[173]:
array([[1, 3, 2],
[6, 5, 4],
[7, 3, 1]])
In [174]: crazyshuffle(Xtrain)
Out[174]:
array([[2, 3, 1],
[4, 6, 5],
[1, 3, 7]])
From: https://github.com/numpy/numpy/issues/5173
def disarrange(a, axis=-1):
"""
Shuffle `a` in-place along the given axis.
Apply numpy.random.shuffle to the given axis of `a`.
Each one-dimensional slice is shuffled independently.
"""
b = a.swapaxes(axis, -1)
# Shuffle `b` in-place along the last axis. `b` is a view of `a`,
# so `a` is shuffled in place, too.
shp = b.shape[:-1]
for ndx in np.ndindex(shp):
np.random.shuffle(b[ndx])
return
This solution is not efficient by any means, but I had fun thinking about it, so wrote it down. Basically, you ravel the array, and create an array of row labels, and an array of indices. You shuffle the index array, and index the original and row label arrays with that. Then you apply a stable argsort to the row labels to gather the data into rows. Apply that index and reshape and viola, data shuffled independently by rows:
import numpy as np
r, c = 3, 4 # x.shape
x = np.arange(12) + 1 # Already raveled
inds = np.arange(x.size)
rows = np.repeat(np.arange(r).reshape(-1, 1), c, axis=1).ravel()
np.random.shuffle(inds)
x = x[inds]
rows = rows[inds]
inds = np.argsort(rows, kind='mergesort')
x = x[inds].reshape(r, c)
Here is an IDEOne Link
We can create a random 2-dimensional matrix, sort it by each row, and then use the index matrix given by argsort to reorder the target matrix.
target = np.random.randint(10, size=(5, 5))
# [[7 4 0 2 5]
# [5 6 4 8 7]
# [6 4 7 9 5]
# [8 6 6 2 8]
# [8 1 6 7 3]]
shuffle_helper = np.argsort(np.random.rand(5,5), axis=1)
# [[0 4 3 2 1]
# [4 2 1 3 0]
# [1 2 3 4 0]
# [1 2 4 3 0]
# [1 2 3 0 4]]
target[np.arange(shuffle_helper.shape[0])[:, None], shuffle_helper]
# array([[7, 5, 2, 0, 4],
# [7, 4, 6, 8, 5],
# [4, 7, 9, 5, 6],
# [6, 6, 8, 2, 8],
# [1, 6, 7, 8, 3]])
Explanation
We use np.random.rand and argsort to mimic the effect from shuffling.
random.rand gives randomness.
Then, we use argsort with axis=1 to help rank each row. This creates the index that can be used for reordering.
Lets say you have array a with shape 100000 x 1000.
b = np.random.choice(100000 * 1000, (100000, 1000), replace=False)
ind = np.argsort(b, axis=1)
a_shuffled = a[np.arange(100000)[:,np.newaxis], ind]
I don't know if this is faster than loop, because it needs sorting, but with this solution maybe you will invent something better, for example with np.argpartition instead of np.argsort
You may use Pandas:
df = pd.DataFrame(X_train)
_ = df.apply(lambda x: np.random.permutation(x), axis=1, raw=True)
df.values
Change the keyword to axis=0 if you want to shuffle columns.
Related
For a machine learning project I am doing, I need to transform a 2D array of floats to another array of the same shape where elements to the left and below are at least as large as the given element.
For example,
In [135]: import numpy as np
...: A = np.array([[1, 2, 1, 1],
...: [1, 1, 6, 5],
...: [3, 2, 4, 2]])
...: print(A)
[[1 2 1 1]
[1 1 6 5]
[3 2 4 2]]
Because A[0,1] = 2, I the following elements (below and to the right) to be >= 2: A[0,2], A[0,3], A[1,1].
Likewise, because A[1,2] = 6, I the following elements (below and to the right) to be >= 6: A[1,3], A[2,2], A[2,3].
I need to do this for every element in the array. The end result is:
[[1 2 2 2]
[1 2 6 6]
[3 3 6 6]]
Here's code that works, but I'd rather use fewer loops. I'd like to use vector operations or apply the function set_val against all elements of the array A. I looked into meshgrid and vectorize, but didn't see how to pass the index of the array (i.e. row,col) to the function.
def set_val(A, cur_row,cur_col,min_val):
for row_new in range(cur_row,A.shape[0]):
for col_new in range(cur_col,A.shape[1]):
if A[row_new,col_new] < min_val:
A[row_new,col_new] = min_val
A_new = A.copy()
#Iterate over every element of A
for row,row_data in enumerate(A):
for col,val in enumerate(row_data):
#Set values to the right and below to be no smaller than the given value
set_val(A, row, col, val)
print(A_new)
My question: Is there a more efficient (or at least more Pythonic) approach?
You can make use of two "cummulative maximum" calls:
from np.mx import maximum as mx
mx.accumulate(mx.accumulate(A), axis=1)
The mx.accumulate calculates the cummulative maximum. This means that for axis=0, the value for B = accumulate(A) is so that bij= maxk≤j aik. For axis=1, the same happens, but columnwise.
By doing this two times, we know that for the result R the value for rij will be the maximum of rij= maxk≤i, l≤ j akl.
Indeed, if such the largest element exists in this subrectangle, then the first mx.accumulate(..) will copy that value to the right, and thus eventually to the same column as the "target". Then the next mx.accumulate(.., axis=1) will copy that value to the same row as the "target", and thus pass that value to the correct cell.
For the given sample input, we thus obtain:
>>> A
array([[1, 2, 1, 1],
[1, 1, 6, 5],
[3, 2, 4, 2]])
>>> mx.accumulate(mx.accumulate(A), axis=1)
array([[1, 2, 2, 2],
[1, 2, 6, 6],
[3, 3, 6, 6]])
Benchmarks: if we run the above algorithm for a random 1000×1000 matrix, and we repeat the experiment 100 times, we get the following benchmark:
>>> timeit(lambda: mx.accumulate(mx.accumulate(A), axis=1), number=100)
1.5123104000231251
This thus means that it calculates one such matrix in approximately 151 milliseconds.
question about slicing numpy arrays.
Say I have an array:
A = np.array([1,2,3,4,5,6,7,8,9]).reshape(3,3)
[1 2 3]
[4 5 6]
[7 8 9]
and indices:
idx = [2,2,1]
and I want to get up to the index value for each row..i.e [:2] in first row, [:2] in second, [:1] in third. Also would like to sum the slices as I go.
I know I can achieve this doing the following:
for i,a in zip(idx,A):
print(a[:i],sum(a[:i]))
output:
[1 2] 3
[4 5] 9
[7] 7
Is there anyway this could be achieved without a for loop? Main focus is to do the irregular slicing, the sum was just an arbitrary operation I want to perform.
Something like:
A[:,:idx]
just to give context to what I mean
You could create a matrix of indexes & create a mask by checking if the index is in the required range.
idx = np.repeat(np.arange(0,3), 3, 0).reshape(3,3).T
row_limits = np.array([[2], [2], [1]])
mask = idx < row_limits
masked_A = np.multiply(A, mask)
# masked_A outputs:
array([[1, 2, 0],
[4, 5, 0],
[7, 0, 0]])
and then apply sum along axis=1
masked_A.sum(1)
# outputs: array([3, 9, 7])
I want to clean my data reducing the number of duplicates. I do not want to delete ALL duplicates.
How can I get a numpy array with certain number of duplicates?
Suppose, I have
x = np.array([[1,2,3],[1,2,3],[5,5,5],[1,2,3],[1,2,3]])
and I set number of duplicates as 2.
And the output should be like
x
>>[[1,2,3],[1,2,3],[5,5,5]]
or
x
>>[[5,5,5],[1,2,3],[1,2,3]]
It does not meter in my task
Even though using list appending as an intermediate step is not always a good idea when you already have numpy arrays, in this case it is by far the cleanest way to do it:
def n_uniques(arr, max_uniques):
uniq, cnts = np.unique(arr, axis=0, return_counts=True)
arr_list = []
for i in range(cnts.size):
num = cnts[i] if cnts[i] <= max_uniques else max_uniques
arr_list.extend([uniq[i]] * num)
return np.array(arr_list)
x = np.array([[1,2,3],
[1,2,3],
[1,2,3],
[5,5,5],
[1,2,3],
[1,2,3],])
reduced_arr = n_uniques(x, 2)
This was kind of tricky, but you can actually do that without loops and preserving the relative order in the original array with something like this (in this case the first repetitions are preserved):
import numpy as np
def drop_extra_repetitions(x, max_reps):
# Find unique rows
uniq, idx_inv, counts = np.unique(x, axis=0, return_inverse=True, return_counts=True)
# Compute number of repetitions of each different row
counts_clip = np.minimum(counts, max_reps)
# Array alternating between valid unique row indices and -1 ([0, -1, 1, -1, ...])
idx_to_repeat = np.stack(
[np.arange(len(uniq)), -np.ones(len(uniq), dtype=int)], axis=1).ravel()
# Number of repetitions for each of the previous indices
idx_repeats_clip = np.stack([counts_clip, counts - counts_clip], axis=1).ravel()
# Valid unique row indices are repetead at most max_reps,
# extra repetitions are filled with -1
idx_clip_sorted = np.repeat(idx_to_repeat, idx_repeats_clip)
# Sorter for inverse index - that is, sort the indices in the input array
# according to their corresponding unique row index
sorter = np.argsort(idx_inv)
# The final inverse index is the same as the original but with -1 on extra repetitions
idx_inv_final = np.empty(len(sorter), dtype=int)
idx_inv_final[sorter] = idx_clip_sorted
# Return the array reconstructed from the inverse index without the positions with -1
return uniq[idx_inv_final[idx_inv_final >= 0]]
x = [[5, 5, 5], [1, 2, 3], [1, 2, 3], [5, 5, 5], [1, 2, 3], [1, 2, 3]]
max_reps = 2
print(drop_extra_repetitions(x, max_reps))
# [[5 5 5]
# [1 2 3]
# [1 2 3]
# [5 5 5]]
If you do not need to preserve the order at all, then you can simply do:
import numpy as np
def drop_extra_repetitions(x, max_reps):
uniq, counts = np.unique(x, axis=0, return_counts=True)
# Repeat each unique row index at most max_reps
ret_idx = np.repeat(np.arange(len(uniq)), np.minimum(counts, max_reps))
return uniq[ret_idx]
x = [[5, 5, 5], [1, 2, 3], [1, 2, 3], [5, 5, 5], [1, 2, 3], [1, 2, 3]]
max_reps = 2
print(drop_extra_repetitions(x, max_reps))
# [[1 2 3]
# [1 2 3]
# [5 5 5]
# [5 5 5]]
In tensorflow, I would like to sum columns of a 2D tensor according to multiple sets of indices.
For example:
Summing the columns of the following tensor
[[1 2 3 4 5]
[5 4 3 2 1]]
according to the 2 sets of indices (first set to sum columns 0 1 2 and second set to sum columns 3 4)
[[0,1,2],[3,4]]
should give 2 columns
[[6 9]
[12 3]]
Remarks:
All columns' indices will appear in one and only one set of indices.
This has to be done in Tensorflow, so that gradient can flow through this operation.
Do you have any idea how to perform that operation? I suspect I need to use tf.slice and probably tf.while_loop.
You can do that with tf.segment_sum:
import tensorflow as tf
nums = [[1, 2, 3, 4, 5],
[5, 4, 3, 2, 1]]
column_idx = [[0, 1, 2], [3, 4]]
with tf.Session() as sess:
# Data as TF tensor
data = tf.constant(nums)
# Make segment ids
segments = tf.concat([tf.tile([i], [len(lst)]) for i, lst in enumerate(column_idx)], axis=0)
# Select columns
data_cols = tf.gather(tf.transpose(data), tf.concat(column_idx, axis=0))
col_sum = tf.transpose(tf.segment_sum(data_cols, segments))
print(sess.run(col_sum))
Output:
[[ 6 9]
[12 3]]
I know of a crude way of solving this in NumPy if you don't mind solving this problem with NumPy.
import numpy as np
mat = np.array([[1, 2, 3, 4, 5], [5, 4, 3, 2, 1]])
grid1 = np.ix_([0], [0, 1, 2])
item1 = np.sum(mat[grid1])
grid2 = np.ix_([1], [0, 1, 2])
item2 = np.sum(mat[grid2])
grid3 = np.ix_([0], [3, 4])
item3 = np.sum(mat[grid3])
grid4 = np.ix_([1], [3, 4])
item4 = np.sum(mat[grid4])
result = np.array([[item1, item3], [item2, item4]])
I have a matrix X of size (d,N). In other words, there are N vectors with d dimensions each. For example,
X = [[1,2,3,4],[5,6,7,8]]
there are N=4 vectors of d=2 dimensions.
Also, I have rag array (list of lists). Indices are indexing columns in the X matrix. For example,
I = [ [0,1], [1,2,3] ]
The I[0]=[0,1] indexes columns 0 and 1 in matrix X. Similarly the element I[1] indexes columns 1,2 and 3. Notice that elements of I are lists that are not of the same length!
What I would like to do, is to index the columns in the matrix X using each element in I, sum the vectors and get a vector. Repeat this for each element of I and thus build a new matrix Y. The matrix Y should have as many d-dimensional vectors as there are elements in I array. In my example, the Y matrix will have 2 vectors of 2 dimensions.
In my example, the element I[0] tells to get columns 0 and 1 from matrix X. Sum the two vectors 2-dimensional vectors of matrix X and put this vector in Y (column 0). Then, element I[1] tells to sum the columns 1,2 and 3 of matrix X and put this new vector in Y (column 1).
I can do this easily using a loop but I would like to vectorize this operation if possible. My matrix X has hundreds of thousands of columns and the I indexing matrix has tens of thousands elements (each element is a short lists of indices).
My loopy code :
Y = np.zeros( (d,len(I)) )
for i,idx in enumerate(I):
Y[:,i] = np.sum( X[:,idx], axis=1 )
Here's an approach -
# Get a flattened version of indices
idx0 = np.concatenate(I)
# Get indices at which we need to do "intervaled-summation" along axis=1
cut_idx = np.append(0,map(len,I))[:-1].cumsum()
# Finally index into cols of array with flattend indices & perform summation
out = np.add.reduceat(X[:,idx0], cut_idx,axis=1)
Step-by-step run -
In [67]: X
Out[67]:
array([[ 1, 2, 3, 4],
[15, 6, 17, 8]])
In [68]: I
Out[68]: array([[0, 2, 3, 1], [2, 3, 1], [2, 3]], dtype=object)
In [69]: idx0 = np.concatenate(I)
In [70]: idx0 # Flattened indices
Out[70]: array([0, 2, 3, 1, 2, 3, 1, 2, 3])
In [71]: cut_idx = np.append(0,map(len,I))[:-1].cumsum()
In [72]: cut_idx # We need to do addition in intervals limited by these indices
Out[72]: array([0, 4, 7])
In [74]: X[:,idx0] # Select all of the indexed columns
Out[74]:
array([[ 1, 3, 4, 2, 3, 4, 2, 3, 4],
[15, 17, 8, 6, 17, 8, 6, 17, 8]])
In [75]: np.add.reduceat(X[:,idx0], cut_idx,axis=1)
Out[75]:
array([[10, 9, 7],
[46, 31, 25]])