Related
I have a numpy array and a mask specifying which entries from that array to shuffle while keeping their relative order. Let's have an example:
In [2]: arr = np.array([5, 3, 9, 0, 4, 1])
In [4]: mask = np.array([True, False, False, False, True, True])
In [5]: arr[mask]
Out[5]: array([5, 4, 1]) # These entries shall be shuffled inside arr, while keeping their order.
In [6]: np.where(mask==True)
Out[6]: (array([0, 4, 5]),)
In [7]: shuffle_array(arr, mask) # I'm looking for an efficient realization of this function!
Out[7]: array([3, 5, 4, 9, 0, 1]) # See how the entries 5, 4 and 1 haven't changed their order.
I've written some code that can do this, but it's really slow.
import numpy as np
def shuffle_array(arr, mask):
perm = np.arange(len(arr)) # permutation array
n = mask.sum()
if n > 0:
old_true_pos = np.where(mask == True)[0] # old positions for which mask is True
old_false_pos = np.where(mask == False)[0] # old positions for which mask is False
new_true_pos = np.random.choice(perm, n, replace=False) # draw new positions
new_true_pos.sort()
new_false_pos = np.setdiff1d(perm, new_true_pos)
new_pos = np.hstack((new_true_pos, new_false_pos))
old_pos = np.hstack((old_true_pos, old_false_pos))
perm[new_pos] = perm[old_pos]
return arr[perm]
To make things worse, I actually have two large matrices A and B with shape (M,N). Matrix A holds arbitrary values, while each row of matrix B is the mask which to use for shuffling one corresponding row of matrix A according to the procedure that I outlined above. So what I want is shuffled_matrix = row_wise_shuffle(A, B).
The only way I have so far found to do it is via my shuffle_array() function and a for loop.
Can you think of any numpy'onic way to accomplish this task avoiding loops? Thank you so much in advance!
For 1d case:
import numpy as np
a = np.arange(8)
b = np.array([1,1,1,1,0,0,0,0])
# Get ordered values
ordered_values = a[np.where(b==1)]
# We'll shuffle both arrays
shuffled_ix = np.random.permutation(a.shape[0])
a_shuffled = a[shuffled_ix]
b_shuffled = b[shuffled_ix]
# Replace the values with correct order
a_shuffled[np.where(b_shuffled==1)] = ordered_values
a_shuffled # Notice that 0, 1, 2, 3 preserves order.
>>>
array([0, 1, 2, 6, 3, 4, 7, 5])
for 2d case, columnwise shuffle (along axis=1):
import numpy as np
a = np.arange(24).reshape(4,6)
b = np.array([[0,0,0,0,1,1], [1,1,1,0,0,0], [1,1,1,1,0,0], [0,0,1,1,0,0]])
# The code below works for column shuffle (i.e. axis=1).
# Get ordered values
i,j = np.where(b==1)
values = a[i, j]
values
# We'll shuffle both arrays for axis=1
# taken from https://stackoverflow.com/questions/5040797/shuffling-numpy-array-along-a-given-axis
idx = np.random.rand(*a.shape).argsort(axis=1)
a_shuffled = np.take_along_axis(a,idx,axis=1)
b_shuffled = np.take_along_axis(b,idx,axis=1)
# Replace the values with correct order
a_shuffled[np.where(b_shuffled==1)] = values
# Get the result
a_shuffled # see that 4,5 | 6,7,8 | 12,13,14,15 | 20, 21 preserves order
>>>
array([[ 4, 1, 0, 3, 2, 5],
[ 9, 6, 7, 11, 8, 10],
[12, 13, 16, 17, 14, 15],
[23, 20, 19, 22, 21, 18]])
for 2d case, rowwise shuffle (along axis=0), we can use the same code, first transpose arrays and after shuffle transpose back:
import numpy as np
a = np.arange(24).reshape(4,6)
b = np.array([[0,0,0,0,1,1], [1,1,1,0,0,0], [1,1,1,1,0,0], [0,0,1,1,0,0]])
# The code below works for column shuffle (i.e. axis=1).
# As you said rowwise, we first transpose
at = a.T
bt = b.T
# Get ordered values
i,j = np.where(bt==1)
values = at[i, j]
values
# We'll shuffle both arrays for axis=1
# taken from https://stackoverflow.com/questions/5040797/shuffling-numpy-array-along-a-given-axis
idx = np.random.rand(*at.shape).argsort(axis=1)
at_shuffled = np.take_along_axis(at,idx,axis=1)
bt_shuffled = np.take_along_axis(bt,idx,axis=1)
# Replace the values with correct order
at_shuffled[np.where(bt_shuffled==1)] = values
# Get the result
a_shuffled = at_shuffled.T
a_shuffled # see that 6,12 | 7, 13 | 8,14,20 | 15, 21 preserves order
>>>
array([[ 6, 7, 2, 3, 10, 17],
[18, 19, 8, 15, 16, 23],
[12, 13, 14, 21, 4, 5],
[ 0, 1, 20, 9, 22, 11]])
Say I have a series like
mySeries = pd.Series(range(1, 100, 1))
myArray = np.array([[3, 10],[6, 9]])
How to use values in myArray as indices to select mySeries?
I would like the resulting array to be np.array([[4,11],[7, 10]]).
For example, the (1,1) element in myArray is 3, so i would like the (1,1) element in my resulting array to be the 3rd element in mySeries, which is 4.
Here is my solution, consisted of first flattening the 2dim array to 1dim, and then recovering the original shape.
import pandas as pd
import numpy as np
mySeries = pd.Series(range(1, 100, 1))
myArray = np.array([[3, 10],[6, 9]])
flatArray = np.asarray(mySeries[myArray.ravel()])
resultArray = flatArray.reshape(myArray.shape)
# Output results
print(resultArray)
Which outputs:
[[ 4 11]
[ 7 10]]
resultArray = np.empty(shape=len(myArray), dtype=np.ndarray)
for i in range(len(myArray)):
row = np.empty(shape=len(myArray[i]))
for k in range(len(myArray[i])):
v = mySeries[myArray[i,k]]
row[k] = v
resultArray[i] = row
Here's an alternative approach that I think is slightly cleaner:
>>> newArray = mySeries[myArray.flatten()].values
>>> newArray.shape = myArray.shape
>>> newArray
array([[ 4, 11],
[ 7, 10]], dtype=int64)
I'm trying to plot a 2-dimensional function (specifically, a 2-d Laplace solution). I defined my function and it returns the right value when I put in specific numbers, but when I try running through an array of values (x,y below), it still returns only one number. I tried with a random function of x and y (e.g., f(x,y) = x^2 + y^2) and it gives me an array of values.
def V_func(x,y):
a = 5
b = 4
Vo = 4
n = np.arange(1,100,2)
sum_list = []
for indx in range(len(n)):
sum_term = (1/n[indx])*(np.cosh(n[indx]*np.pi*x/a))/(np.cosh(n[indx]*np.pi*b/a))*np.sin(n[indx]*np.pi*y/a)
sum_list = np.append(sum_list,sum_term)
summation = np.sum(sum_list)
V = 4*Vo/np.pi * summation
return V
x = np.linspace(-4,4,50)
y = np.linspace(0,5,50)
V_func(x,y)
Out: 53.633709914177224
Try this:
def V_func(x,y):
a = 5
b = 4
Vo = 4
n = np.arange(1,100,2)
# sum_list = []
sum_list = np.zeros(50)
for indx in range(len(n)):
sum_term = (1/n[indx])*(np.cosh(n[indx]*np.pi*x/a))/(np.cosh(n[indx]*np.pi*b/a))*np.sin(n[indx]*np.pi*y/a)
# sum_list = np.append(sum_list,sum_term)
sum_list += sum_term
# summation = np.sum(sum_list)
# V = 4*Vo/np.pi * summation
V = 4*Vo/np.pi * sum_list
return V
Define a pair of arrays:
In [6]: x = np.arange(3); y = np.arange(10,13)
In [7]: x,y
Out[7]: (array([0, 1, 2]), array([10, 11, 12]))
Try a simple function of the 2
In [8]: x + y
Out[8]: array([10, 12, 14])
Since they have the same size, they can be summed (or otherwise combined) elementwise. The result has the same shape as the 2 inputs.
Now try 'broadcasting'. x[:,None] has shape (3,1)
In [9]: x[:,None] + y
Out[9]:
array([[10, 11, 12],
[11, 12, 13],
[12, 13, 14]])
The result is (3,3), the first 3 from the reshaped x, the second from y.
I can generate the pair of arrays with meshgrid:
In [10]: I,J = np.meshgrid(x,y,sparse=True, indexing='ij')
In [11]: I
Out[11]:
array([[0],
[1],
[2]])
In [12]: J
Out[12]: array([[10, 11, 12]])
In [13]: I + J
Out[13]:
array([[10, 11, 12],
[11, 12, 13],
[12, 13, 14]])
Note the added parameters in meshgrid. So that's how we go about generating 2d values from a pair of 1d arrays.
Now look at what sum does. As you use it in the function:
In [14]: np.sum(I + J)
Out[14]: 108
the result is a scalar. See the docs. If I specify an axis I get an array.
In [15]: np.sum(I + J, axis=0)
Out[15]: array([33, 36, 39])
If you gave V_func the right x and y, sum_list could be a 3d array. That axis-less sum reduces it to a scalar.
In code like this you need to keep track of array shapes. Include test prints if needed; don't just assume anything; test it. Pay attention to how dimensions grow and shrink as they pass through various operations.
x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
I want to grab first 2 rows of array x from every block of 5, result should be:
x[fancy_indexing] = [1,2, 6,7, 11,12]
It's easy enough to build up an index like that using a for loop.
Is there a one-liner slicing trick that will pull it off? Points for simplicity here.
Approach #1 Here's a vectorized one-liner using boolean-indexing -
x[np.mod(np.arange(x.size),M)<N]
Approach #2 If you are going for performance, here's another vectorized approach using NumPy strides -
n = x.strides[0]
shp = (x.size//M,N)
out = np.lib.stride_tricks.as_strided(x, shape=shp, strides=(M*n,n)).ravel()
Sample run -
In [61]: # Inputs
...: x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
...: N = 2
...: M = 5
...:
In [62]: # Approach 1
...: x[np.mod(np.arange(x.size),M)<N]
Out[62]: array([ 1, 2, 6, 7, 11, 12])
In [63]: # Approach 2
...: n = x.strides[0]
...: shp = (x.size//M,N)
...: out=np.lib.stride_tricks.as_strided(x,shape=shp,strides=(M*n,n)).ravel()
...:
In [64]: out
Out[64]: array([ 1, 2, 6, 7, 11, 12])
I first thought you need this to work for 2d arrays due to your phrasing of "first N rows of every block of M rows", so I'll leave my solution as this.
You could work some magic by reshaping your array into 3d:
M = 5 # size of blocks
N = 2 # number of columns to cut
x = np.arange(3*4*M).reshape(4,-1) # (4,3*N)-shaped dummy input
x = x.reshape(x.shape[0],-1,M)[:,:,:N+1].reshape(x.shape[0],-1) # (4,3*N)-shaped output
This will extract every column according to your preference. In order to use it for your 1d case you'd need to make your 1d array into a 2d one using x = x[None,:].
Reshape the array to multiple rows of five columns then take (slice) the first two columns of each row.
>>> x
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
>>> x.reshape(x.shape[0] / 5, 5)[:,:2]
array([[ 1, 2],
[ 6, 7],
[11, 12]])
Or
>>> x.reshape(x.shape[0] / 5, 5)[:,:2].flatten()
array([ 1, 2, 6, 7, 11, 12])
>>>
It only works with 1-d arrays that have a length that is a multiple of five.
import numpy as np
x = np.array(range(1, 16))
y = np.vstack([x[0::5], x[1::5]]).T.ravel()
y
// => array([ 1, 2, 6, 7, 11, 12])
Taking the first N rows of every block of M rows in the array [1, 2, ..., K]:
import numpy as np
K = 30
M = 5
N = 2
x = np.array(range(1, K+1))
y = np.vstack([x[i::M] for i in range(N)]).T.ravel()
y
// => array([ 1, 2, 6, 7, 11, 12, 16, 17, 21, 22, 26, 27])
Notice that .T and .ravel() are fast operations: they don't copy any data, but just manipulate the dimensions and strides of the array.
If you insist on getting your slice using fancy indexing:
import numpy as np
K = 30
M = 5
N = 2
x = np.array(range(1, K+1))
fancy_indexing = [i*M+n for i in range(len(x)//M) for n in range(N)]
x[fancy_indexing]
// => array([ 1, 2, 6, 7, 11, 12, 16, 17, 21, 22, 26, 27])
How can I convert the duplicate elements in a array 'data' into 0? It has to be done row-wise.
data = np.array([[1,8,3,3,4],
[1,8,9,9,4]])
The answer should be as follows:
ans = array([[1,8,3,0,4],
[1,8,9,0,4]])
Approach #1
One approach with np.unique -
# Find out the unique elements and their starting positions
unq_data, idx = np.unique(data,return_index=True)
# Find out the positions for each unique element, their duplicate positions
dup_idx = np.setdiff1d(np.arange(data.size),idx)
# Set those duplicate positioned elemnents to 0s
data[dup_idx] = 0
Sample run -
In [46]: data
Out[46]: array([1, 8, 3, 3, 4, 1, 3, 3, 9, 4])
In [47]: unq_data, idx = np.unique(data,return_index=True)
...: dup_idx = np.setdiff1d(np.arange(data.size),idx)
...: data[dup_idx] = 0
...:
In [48]: data
Out[48]: array([1, 8, 3, 0, 4, 0, 0, 0, 9, 0])
Approach #2
You can also use sorting and differentiation as a faster approach -
# Get indices for sorted data
sort_idx = np.argsort(data)
# Get duplicate indices and set those in data to 0s
dup_idx = sort_idx[1::][np.diff(np.sort(data))==0]
data[dup_idx] = 0
Runtime tests -
In [110]: data = np.random.randint(0,100,(10000))
...: data1 = data.copy()
...: data2 = data.copy()
...:
In [111]: def func1(data):
...: unq_data, idx = np.unique(data,return_index=True)
...: dup_idx = np.setdiff1d(np.arange(data.size),idx)
...: data[dup_idx] = 0
...:
...: def func2(data):
...: sort_idx = np.argsort(data)
...: dup_idx = sort_idx[1::][np.diff(np.sort(data))==0]
...: data[dup_idx] = 0
...:
In [112]: %timeit func1(data1)
1000 loops, best of 3: 1.36 ms per loop
In [113]: %timeit func2(data2)
1000 loops, best of 3: 467 µs per loop
Extending to a 2D case :
Approach #2 could be extended to work for a 2D array case, avoiding any loop like so -
# Get indices for sorted data
sort_idx = np.argsort(data,axis=1)
# Get sorted linear indices
row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
sort_lin_idx = sort_idx[:,1::] + row_offset
# Get duplicate linear indices and set those in data as 0s
dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
data.ravel()[dup_lin_idx] = 0
Sample run -
In [6]: data
Out[6]:
array([[1, 8, 3, 3, 4, 0, 3, 3],
[1, 8, 9, 9, 4, 8, 7, 9],
[1, 8, 9, 9, 4, 8, 7, 3]])
In [7]: sort_idx = np.argsort(data,axis=1)
...: row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
...: sort_lin_idx = sort_idx[:,1::] + row_offset
...: dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
...: data.ravel()[dup_lin_idx] = 0
...:
In [8]: data
Out[8]:
array([[1, 8, 3, 0, 4, 0, 0, 0],
[1, 8, 9, 0, 4, 0, 7, 0],
[1, 8, 9, 0, 4, 0, 7, 3]])
Here's a simple pure-Python way to do it:
seen = set()
for i, x in enumerate(data):
if x in seen:
data[i] = 0
else:
seen.add(x)
You could use a nested for loop, where you compare each element of the array to every other element to check for duplicate records. Syntax might be a bit off as I am not really familiar with numpy.
for x in range(0, len(data))
for y in range(x+1, len(data))
if(data[x] == data[y])
data[x] = 0
#Divakar has it almost right, but there are a few things that can be further optimized, but don't really fit in a comment. To begin:
rows, cols = data.shape
The first operation is to sort the array to identify the duplicates. Since we will want to undo the sorting, we need to use np.argsort, but if you want to make sure that it is the first occurrence of each repeated value that is kept, you need to use a stable sorting algorithm:
sort_idx = data.argsort(axis=1, kind='mergesort')
Once we have the indices to sort data, to get a sorted copy of the array it is faster to use the indices than to re-sort the array:
sorted_data = data[np.arange(rows)[:, None], sort_idx]
While the principle is similar to that in using np.diff, it is typically faster to use boolean operations. We want an array full of False where the first occurrences of each value happen, and True where the duplicates are:
sorted_mask = np.concatenate((np.zeros((rows, 1), dtype=bool),
sorted_data[:, :-1] == sorted_data[:, 1:]),
axis=1)
We now use that mask to set all the duplicates to zero:
sorted_data[sorted_mask] = 0
And we finally undo the sorting. To revert a permutation you can sort the indices that define it, i.e. you could do:
invert_idx = sort_idx.argsort(axis=1, kind='mergesort')
ans = sorted_data[np.arange(rows)[:, None], invert_idx]
But it is more efficient to use assignment, i.e.:
ans = np.empty_like(data)
ans[np.arange(rows), sort_idx] = sorted_data
Putting it all together:
def zero_dups(data):
rows, cols = data.shape
sort_idx = data.argsort(axis=1, kind='mergesort')
sorted_data = data[np.arange(rows)[:, None], sort_idx]
sorted_mask = np.concatenate((np.zeros((rows, 1), dtype=bool),
sorted_data[:, :-1] == sorted_data[:, 1:]),
axis=1)
sorted_data[sorted_mask] = 0
ans = np.empty_like(data)
ans[np.arange(rows)[:, None], sort_idx] = sorted_data
return ans