For a machine learning project I am doing, I need to transform a 2D array of floats to another array of the same shape where elements to the left and below are at least as large as the given element.
For example,
In [135]: import numpy as np
...: A = np.array([[1, 2, 1, 1],
...: [1, 1, 6, 5],
...: [3, 2, 4, 2]])
...: print(A)
[[1 2 1 1]
[1 1 6 5]
[3 2 4 2]]
Because A[0,1] = 2, I the following elements (below and to the right) to be >= 2: A[0,2], A[0,3], A[1,1].
Likewise, because A[1,2] = 6, I the following elements (below and to the right) to be >= 6: A[1,3], A[2,2], A[2,3].
I need to do this for every element in the array. The end result is:
[[1 2 2 2]
[1 2 6 6]
[3 3 6 6]]
Here's code that works, but I'd rather use fewer loops. I'd like to use vector operations or apply the function set_val against all elements of the array A. I looked into meshgrid and vectorize, but didn't see how to pass the index of the array (i.e. row,col) to the function.
def set_val(A, cur_row,cur_col,min_val):
for row_new in range(cur_row,A.shape[0]):
for col_new in range(cur_col,A.shape[1]):
if A[row_new,col_new] < min_val:
A[row_new,col_new] = min_val
A_new = A.copy()
#Iterate over every element of A
for row,row_data in enumerate(A):
for col,val in enumerate(row_data):
#Set values to the right and below to be no smaller than the given value
set_val(A, row, col, val)
print(A_new)
My question: Is there a more efficient (or at least more Pythonic) approach?
You can make use of two "cummulative maximum" calls:
from np.mx import maximum as mx
mx.accumulate(mx.accumulate(A), axis=1)
The mx.accumulate calculates the cummulative maximum. This means that for axis=0, the value for B = accumulate(A) is so that bij= maxk≤j aik. For axis=1, the same happens, but columnwise.
By doing this two times, we know that for the result R the value for rij will be the maximum of rij= maxk≤i, l≤ j akl.
Indeed, if such the largest element exists in this subrectangle, then the first mx.accumulate(..) will copy that value to the right, and thus eventually to the same column as the "target". Then the next mx.accumulate(.., axis=1) will copy that value to the same row as the "target", and thus pass that value to the correct cell.
For the given sample input, we thus obtain:
>>> A
array([[1, 2, 1, 1],
[1, 1, 6, 5],
[3, 2, 4, 2]])
>>> mx.accumulate(mx.accumulate(A), axis=1)
array([[1, 2, 2, 2],
[1, 2, 6, 6],
[3, 3, 6, 6]])
Benchmarks: if we run the above algorithm for a random 1000×1000 matrix, and we repeat the experiment 100 times, we get the following benchmark:
>>> timeit(lambda: mx.accumulate(mx.accumulate(A), axis=1), number=100)
1.5123104000231251
This thus means that it calculates one such matrix in approximately 151 milliseconds.
Related
question about slicing numpy arrays.
Say I have an array:
A = np.array([1,2,3,4,5,6,7,8,9]).reshape(3,3)
[1 2 3]
[4 5 6]
[7 8 9]
and indices:
idx = [2,2,1]
and I want to get up to the index value for each row..i.e [:2] in first row, [:2] in second, [:1] in third. Also would like to sum the slices as I go.
I know I can achieve this doing the following:
for i,a in zip(idx,A):
print(a[:i],sum(a[:i]))
output:
[1 2] 3
[4 5] 9
[7] 7
Is there anyway this could be achieved without a for loop? Main focus is to do the irregular slicing, the sum was just an arbitrary operation I want to perform.
Something like:
A[:,:idx]
just to give context to what I mean
You could create a matrix of indexes & create a mask by checking if the index is in the required range.
idx = np.repeat(np.arange(0,3), 3, 0).reshape(3,3).T
row_limits = np.array([[2], [2], [1]])
mask = idx < row_limits
masked_A = np.multiply(A, mask)
# masked_A outputs:
array([[1, 2, 0],
[4, 5, 0],
[7, 0, 0]])
and then apply sum along axis=1
masked_A.sum(1)
# outputs: array([3, 9, 7])
I want to clean my data reducing the number of duplicates. I do not want to delete ALL duplicates.
How can I get a numpy array with certain number of duplicates?
Suppose, I have
x = np.array([[1,2,3],[1,2,3],[5,5,5],[1,2,3],[1,2,3]])
and I set number of duplicates as 2.
And the output should be like
x
>>[[1,2,3],[1,2,3],[5,5,5]]
or
x
>>[[5,5,5],[1,2,3],[1,2,3]]
It does not meter in my task
Even though using list appending as an intermediate step is not always a good idea when you already have numpy arrays, in this case it is by far the cleanest way to do it:
def n_uniques(arr, max_uniques):
uniq, cnts = np.unique(arr, axis=0, return_counts=True)
arr_list = []
for i in range(cnts.size):
num = cnts[i] if cnts[i] <= max_uniques else max_uniques
arr_list.extend([uniq[i]] * num)
return np.array(arr_list)
x = np.array([[1,2,3],
[1,2,3],
[1,2,3],
[5,5,5],
[1,2,3],
[1,2,3],])
reduced_arr = n_uniques(x, 2)
This was kind of tricky, but you can actually do that without loops and preserving the relative order in the original array with something like this (in this case the first repetitions are preserved):
import numpy as np
def drop_extra_repetitions(x, max_reps):
# Find unique rows
uniq, idx_inv, counts = np.unique(x, axis=0, return_inverse=True, return_counts=True)
# Compute number of repetitions of each different row
counts_clip = np.minimum(counts, max_reps)
# Array alternating between valid unique row indices and -1 ([0, -1, 1, -1, ...])
idx_to_repeat = np.stack(
[np.arange(len(uniq)), -np.ones(len(uniq), dtype=int)], axis=1).ravel()
# Number of repetitions for each of the previous indices
idx_repeats_clip = np.stack([counts_clip, counts - counts_clip], axis=1).ravel()
# Valid unique row indices are repetead at most max_reps,
# extra repetitions are filled with -1
idx_clip_sorted = np.repeat(idx_to_repeat, idx_repeats_clip)
# Sorter for inverse index - that is, sort the indices in the input array
# according to their corresponding unique row index
sorter = np.argsort(idx_inv)
# The final inverse index is the same as the original but with -1 on extra repetitions
idx_inv_final = np.empty(len(sorter), dtype=int)
idx_inv_final[sorter] = idx_clip_sorted
# Return the array reconstructed from the inverse index without the positions with -1
return uniq[idx_inv_final[idx_inv_final >= 0]]
x = [[5, 5, 5], [1, 2, 3], [1, 2, 3], [5, 5, 5], [1, 2, 3], [1, 2, 3]]
max_reps = 2
print(drop_extra_repetitions(x, max_reps))
# [[5 5 5]
# [1 2 3]
# [1 2 3]
# [5 5 5]]
If you do not need to preserve the order at all, then you can simply do:
import numpy as np
def drop_extra_repetitions(x, max_reps):
uniq, counts = np.unique(x, axis=0, return_counts=True)
# Repeat each unique row index at most max_reps
ret_idx = np.repeat(np.arange(len(uniq)), np.minimum(counts, max_reps))
return uniq[ret_idx]
x = [[5, 5, 5], [1, 2, 3], [1, 2, 3], [5, 5, 5], [1, 2, 3], [1, 2, 3]]
max_reps = 2
print(drop_extra_repetitions(x, max_reps))
# [[1 2 3]
# [1 2 3]
# [5 5 5]
# [5 5 5]]
I have a numpy array like the following:
Xtrain = np.array([[1, 2, 3],
[4, 5, 6],
[1, 7, 3]])
I want to shuffle the items of each row separately, but do not want the shuffle to be the same for each row (as in several examples just shuffle column order).
For example, I want an output like the following:
output = np.array([[3, 2, 1],
[4, 6, 5],
[7, 3, 1]])
How can I randomly shuffle each of the rows randomly in an efficient way? My actual np array is over 100000 rows and 1000 columns.
Since you want to only shuffle the columns you can just perform the shuffling on transposed of your matrix:
In [86]: np.random.shuffle(Xtrain.T)
In [87]: Xtrain
Out[87]:
array([[2, 3, 1],
[5, 6, 4],
[7, 3, 1]])
Note that random.suffle() on a 2D array shuffles the rows not items in each rows. i.e. changes the position of the rows. Therefor if your change the position of the transposed matrix rows you're actually shuffling the columns of your original array.
If you still want a completely independent shuffle you can create random indexes for each row and then create the final array with a simple indexing:
In [172]: def crazyshuffle(arr):
...: x, y = arr.shape
...: rows = np.indices((x,y))[0]
...: cols = [np.random.permutation(y) for _ in range(x)]
...: return arr[rows, cols]
...:
Demo:
In [173]: crazyshuffle(Xtrain)
Out[173]:
array([[1, 3, 2],
[6, 5, 4],
[7, 3, 1]])
In [174]: crazyshuffle(Xtrain)
Out[174]:
array([[2, 3, 1],
[4, 6, 5],
[1, 3, 7]])
From: https://github.com/numpy/numpy/issues/5173
def disarrange(a, axis=-1):
"""
Shuffle `a` in-place along the given axis.
Apply numpy.random.shuffle to the given axis of `a`.
Each one-dimensional slice is shuffled independently.
"""
b = a.swapaxes(axis, -1)
# Shuffle `b` in-place along the last axis. `b` is a view of `a`,
# so `a` is shuffled in place, too.
shp = b.shape[:-1]
for ndx in np.ndindex(shp):
np.random.shuffle(b[ndx])
return
This solution is not efficient by any means, but I had fun thinking about it, so wrote it down. Basically, you ravel the array, and create an array of row labels, and an array of indices. You shuffle the index array, and index the original and row label arrays with that. Then you apply a stable argsort to the row labels to gather the data into rows. Apply that index and reshape and viola, data shuffled independently by rows:
import numpy as np
r, c = 3, 4 # x.shape
x = np.arange(12) + 1 # Already raveled
inds = np.arange(x.size)
rows = np.repeat(np.arange(r).reshape(-1, 1), c, axis=1).ravel()
np.random.shuffle(inds)
x = x[inds]
rows = rows[inds]
inds = np.argsort(rows, kind='mergesort')
x = x[inds].reshape(r, c)
Here is an IDEOne Link
We can create a random 2-dimensional matrix, sort it by each row, and then use the index matrix given by argsort to reorder the target matrix.
target = np.random.randint(10, size=(5, 5))
# [[7 4 0 2 5]
# [5 6 4 8 7]
# [6 4 7 9 5]
# [8 6 6 2 8]
# [8 1 6 7 3]]
shuffle_helper = np.argsort(np.random.rand(5,5), axis=1)
# [[0 4 3 2 1]
# [4 2 1 3 0]
# [1 2 3 4 0]
# [1 2 4 3 0]
# [1 2 3 0 4]]
target[np.arange(shuffle_helper.shape[0])[:, None], shuffle_helper]
# array([[7, 5, 2, 0, 4],
# [7, 4, 6, 8, 5],
# [4, 7, 9, 5, 6],
# [6, 6, 8, 2, 8],
# [1, 6, 7, 8, 3]])
Explanation
We use np.random.rand and argsort to mimic the effect from shuffling.
random.rand gives randomness.
Then, we use argsort with axis=1 to help rank each row. This creates the index that can be used for reordering.
Lets say you have array a with shape 100000 x 1000.
b = np.random.choice(100000 * 1000, (100000, 1000), replace=False)
ind = np.argsort(b, axis=1)
a_shuffled = a[np.arange(100000)[:,np.newaxis], ind]
I don't know if this is faster than loop, because it needs sorting, but with this solution maybe you will invent something better, for example with np.argpartition instead of np.argsort
You may use Pandas:
df = pd.DataFrame(X_train)
_ = df.apply(lambda x: np.random.permutation(x), axis=1, raw=True)
df.values
Change the keyword to axis=0 if you want to shuffle columns.
In a given array I want to replace the values by the index of this value in an other array (which doesn't contain duplicates). Here is a simple example of I'm trying to do:
import numpy as np
from copy import deepcopy
a = np.array([[0, 1, 2], [2, 1, 3], [0, 1, 3]])
chg = np.array([3, 0, 2, 1])
b = deepcopy(a)
for new, old in enumerate(chg):
b[a == old] = new
print b
# [[1 3 2] [2 3 0] [1 3 0]]
But I need to do that on large arrays so having an explicit loop is not acceptable in terms of execution time.
I can't figure out how to do that using fancy numpy functions...
take is your friend.
a = np.array([[0, 1, 2], [2, 1, 3], [0, 1, 3]])
chg = np.array([3, 0, 2, 1])
inverse_chg=chg.take(chg)
print(inverse_chg.take(a))
gives :
[[1 3 2]
[2 3 0]
[1 3 0]]
or more directly with fancy indexing: chg[chg][a], but inverse_chg.take(a) is three times faster.
You can convert chg to a 3D array by adding two new axes at the end of it and then perform the matching comparison with a, which would bring in NumPy's broadcasting to give us a 3D mask. Next up, get the argmax on the mask along the first axis to simulate "b[a == old] = new". Finally, replace the ones that had no matches along that axis with the corresponding values in a. The implementation would look something like this -
mask = a == chg[:,None,None]
out = mask.argmax(0)
invalid_pos = ~mask.max(0)
out[invalid_pos] = a[invalid_pos]
This type of replacement operation can be tricky to do in full generality with NumPy, although you could use searchsorted:
>>> s = np.argsort(chg)
>>> s[np.searchsorted(chg, a.ravel(), sorter=s).reshape(a.shape)]
array([[1, 3, 2],
[2, 3, 0],
[1, 3, 0]])
(Note: searchsorted doesn't just replace exact matches, so be careful if you have values in a that aren't in chg...)
pandas has a variety of tools which can make these operations on NumPy arrays much easier and potentially a lot quicker / more memory efficient for larger arrays. For this specific problem, pd.match could be used:
>>> pd.match(a.ravel(), chg).reshape(a.shape)
array([[1, 3, 2],
[2, 3, 0],
[1, 3, 0]])
This function also allows you to specify what value should be filled if a value is missing from chg.
Check this out:
a = np.array([3,4,1,2,0])
b = np.array([[0,0],[0,1],[0,2],[0,3],[0,4]])
c = b[a]
print(c)
It gives me back:
[[0 3]
[0 4]
[0 1]
[0 2]
[0 0]]
If you're working with numpy arrays you could do this.
I want to select a submatrix of a numpy matrix based on whether the diagonal is less than some cutoff value. For example, given the matrix:
Test = array([[1,2,3,4,5],
[2,3,4,5,6],
[3,4,5,6,7],
[4,5,6,7,8],
[5,6,7,8,9]])
I want to select the rows and columns where the diagonal value is less than, say, 6. In this example, the diagonal values are sorted, so that I could just take Test[:3,:3], but in the general problem I want to solve this isn't the case.
The following snippet works:
def MatrixCut(M,Ecut):
D = diag(M)
indices = D<Ecut
n = sum(indices)
NewM = zeros((n,n),'d')
ii = -1
for i,ibool in enumerate(indices):
if ibool:
ii += 1
jj = -1
for j,jbool in enumerate(indices):
if jbool:
jj += 1
NewM[ii,jj] = M[i,j]
return NewM
print MatrixCut(Test,6)
[[ 1. 2. 3.]
[ 2. 3. 4.]
[ 3. 4. 5.]]
However, this is fugly code, with all kinds of dangerous things like initializing the ii/jj indices to -1, which won't cause an error if somehow I get into the loop and take M[-1,-1].
Plus, there must be a numpythonic way of doing this. For a one-dimensional array, you could do:
D = diag(A)
A[D<Ecut]
But the analogous thing for a 2d array doesn't work:
D = diag(Test)
Test[D<6,D<6]
array([1, 3, 5])
Is there a good way to do this? Thanks in advance.
This also works when the diagonals are not sorted:
In [7]: Test = array([[1,2,3,4,5],
[2,3,4,5,6],
[3,4,5,6,7],
[4,5,6,7,8],
[5,6,7,8,9]])
In [8]: d = np.argwhere(np.diag(Test) < 6).squeeze()
In [9]: Test[d][:,d]
Out[9]:
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
Alternately, to use a single subscript call, you could do:
In [10]: d = np.argwhere(np.diag(Test) < 6)
In [11]: Test[d, d.flat]
Out[11]:
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
[UPDATE]: Explanation of the second form.
At first, it may be tempting to just try Test[d, d] but that will only extract elements from the diagonal of the array:
In [75]: Test[d, d]
Out[75]:
array([[1],
[3],
[5]])
The problem is that d has shape (3, 1) so if we use d in both subscripts, the output array will have the same shape as d. The d.flat is equivalent to using d.flatten() or d.ravel() (except flat just returns an iterator instead of an array). The effect is that the result has shape (3,):
In [76]: d
Out[76]:
array([[0],
[1],
[2]])
In [77]: d.flatten()
Out[77]: array([0, 1, 2])
In [79]: print d.shape, d.flatten().shape
(3, 1) (3,)
The reason Test[d, d.flat] works is because numpy's general broadcasting rules cause the last dimension of d (which is 1) to be broadcast to the last (and only) dimension of d.flat (which is 3). Similarly, d.flat is broadcast to match the first dimension of d. The result is two (3,3) index arrays, which are equivalent to the following arrays i and j:
In [80]: dd = d.flatten()
In [81]: i = np.hstack((d, d, d)
In [82]: j = np.vstack((dd, dd, dd))
In [83]: print i
[[0 0 0]
[1 1 1]
[2 2 2]]
In [84]: print j
[[0 1 2]
[0 1 2]
[0 1 2]]
And just to make sure they work:
In [85]: Test[i, j]
Out[85]:
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
The only way I found to solve your task is somewhat tricky
>>> Test[[[i] for i,x in enumerate(D<6) if x], D<6]
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
possibly not the best one. Based on this answer.
Or (thanks to #bogatron or reminding me argwhere):
>>> Test[np.argwhere(D<6), D<6]
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])