replace missing values in a 3d array - python

I have a 3d array containing missing values:
arr = np.array([[[ 1, 13],[ 2, 14],[ 3, np.nan]],[[ 4, 16],[ 5, 17],[ 6, 18]],[[ np.nan, 19],[ 8, 20],[ 9, 21]],[[10, 22],[11, 23],[12, np.nan]]])
I would like to perform imputation to replace those missing values, preferably using nearest neighbors. I tried looking into sklearn.impute module but none of the functions accept a 3d array. I know I could flatten the array, but that will result in loss of spatial information. Are there any alternatives?
EDIT:
The array has a 3d spatial configuration and in the real world might look like this:
layer 2
13 14 nan
16 17 18
19 20 21
22 23 nan
layer 1
1 2 3
4 5 6
nan 8 9
10 11 12
for example, value 1 is a neighbor of 2 and 4 in layer 1.
By flattening arr,
[[ 1, 13],
[ 2, 14],
[ 3, np.nan],
[ 4, 16],
[ 5, 17],
[ 6, 18],
[ np.nan, 19],
[ 8, 20],
[ 9, 21],
[10, 22],
[11, 23],
[12, np.nan]]
it looks as though 4 is farther away from 1 than 2 is, but it isn't. 1 is just as close to 2 as it is to 4, just in different dimensions.

Related

Is there a method of vectorizing the printing of elements in a Numpy array?

I have a numpy array named "a":
a = numpy.array([
[[1, 2, 3], [11, 22, 33]],
[[4, 5, 6], [44, 55, 66]],
])
I want to print the following (in this exact format):
1 2 3
11 22 33
4 5 6
44 55 66
To accomplish this, I wrote the following:
for i in range(len(A)):
a = A[i]
for j in range(len(a)):
a1 = a[j][0]
a2 = a[j][1]
a3 = a[j][2]
print(a1, a2, a3)
The output is:
1 2 3
11 22 33
4 5 6
44 55 66
I would like to vectorize my solution (if possible) and discard the for loop. I understand that this problem might not benefit from vectorization. In reality (for work-related purposes), the array "a" has 52 elements and each element contains hundreds of arrays stored inside. I'd like to solve a basic/trivial case and move onto a more advanced, realistic case.
Also, I know that Numpy arrays were not meant to be iterated through.
I could have used Python lists to accomplish the following, but I really want to vectorize this (if possible, of course).
You could use np.apply_along_axis which maps the array with a function on an arbitrary axis. Applying it on axis=2 to get the desired result.
Using print directly as the callback:
>>> np.apply_along_axis(print, 2, a)
[1 2 3]
[11 22 33]
[4 5 6]
[44 55 66]
Or with a lambda wrapper:
>>> np.apply_along_axis(lambda r: print(' '.join([str(x) for x in r])), 2, a)
1 2 3
11 22 33
4 5 6
44 55 66
In [146]: a = numpy.array([
...: [[1, 2, 3], [11, 22, 33]],
...: [[4, 5, 6], [44, 55, 66]],
...: ])
...:
In [147]: a
Out[147]:
array([[[ 1, 2, 3],
[11, 22, 33]],
[[ 4, 5, 6],
[44, 55, 66]]])
A proper "vectorized" numpy output is:
In [148]: a.reshape(-1,3)
Out[148]:
array([[ 1, 2, 3],
[11, 22, 33],
[ 4, 5, 6],
[44, 55, 66]])
You could also convert that to a list of lists:
In [149]: a.reshape(-1,3).tolist()
Out[149]: [[1, 2, 3], [11, 22, 33], [4, 5, 6], [44, 55, 66]]
But you want a print without the standard numpy formatting (nor list formatting)
But this iteration is easy:
In [150]: for row in a.reshape(-1,3):
...: print(*row)
...:
1 2 3
11 22 33
4 5 6
44 55 66
Since your desired output is a print, or at least "unformatted" strings, there's no "vectorized", i.e. whole-array, option. You have to iterate on each line!
np.savetxt creates a csv output by iterating on rows and writing a format tuple, e.g. f.write(fmt%tuple(row)).
In [155]: np.savetxt('test', a.reshape(-1,3), fmt='%d')
In [156]: cat test
1 2 3
11 22 33
4 5 6
44 55 66
To get that exact output without iterating, try this:
print(str(a.tolist()).replace('], [', '\n').replace('[', '').replace(']', '').replace(',', ''))

Numpy argsort - what is happening?

I have a numpy array called arr1 defined like following.
arr1 = np.array([1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9])
print(arr1.argsort())
array([ 0, 1, 2, 3, 4, 5, 6, 7, 9, 8, 10, 11, 12, 13, 14, 15, 16,
17], dtype=int64)
I expected all the indices of the array to be in numeric order but indices 8 and 9 seems to have flipped.
Can someone help on why this is happening?
np.argsort by default uses the quicksort algorithm which is not stable. You can specify kind = "stable" to perform a stable sort, which will preserve the order of equal elements:
import numpy as np
arr1 = np.array([1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9])
print(arr1.argsort(kind="stable"))
It gives:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17]
Because it will sort according to the quick sort algorithm if you follow the steps you will see that is why they are flipped. https://numpy.org/doc/stable/reference/generated/numpy.argsort.html

numpy select members with a certain shifted window

I want to extract some members from many large numpy arrays. A simple example is
A = np.arange(36).reshape(6,6)
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
I want to extract the members in a shifted windows in each row with minimum stride 2 and maximum stride 4. For example, in the first row, I would like to have
[2, 3, 4] # A[i,i+2:i+4+1] where i == 0
In the second row, I want to have
[9, 10, 11] # A[i,i+2:i+4+1] where i == 1
In the third row, I want to have
[16, 17, 0]
[[2, 3, 4],
[9, 10, 11],
[16 17, 0],
[23, 0, 0]]
I want to know efficient ways to do this. Thanks.
you can extract values in an array by providing a list of indices for each dimension. For example, if you want the second diagonal, you can use arr[np.arange(0, len(arr)-1), np.arange(1, len(arr))]
For your example, I would do smth like what's in the code below. although, I did not account for different strides. If you want to account for strides, you can change the behaviour of the index list creation. If you struggle with adding the stride functionality, comment and I'll edit this answer.
import numpy as np
def slide(arr, window_len = 3, start = 0):
# pad array to get zeros out of bounds
arr_padded = np.zeros((arr.shape[0], arr.shape[1]+window_len-1))
arr_padded[:arr.shape[0], :arr.shape[1]] = arr
# compute the number of window moves
repeats = min(arr.shape[0], arr.shape[1]-start)
# create index lists
idx0 = np.arange(repeats).repeat(window_len)
idx1 = np.concatenate(
[np.arange(start+i, start+i+window_len)
for i in range(repeats)])
return arr_padded[idx0, idx1].reshape(-1, window_len)
A = np.arange(36).reshape(6,6)
print(f'A =\n{A}')
print(f'B =\n{slide(A, start = 2)}')
output:
A =
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]
[30 31 32 33 34 35]]
B =
[[ 2. 3. 4.]
[ 9. 10. 11.]
[16. 17. 0.]
[23. 0. 0.]]

Pythonic way to get both diagonals passing through a matrix entry (i,j)

What is the Pythonic way to get a list of diagonal elements in a matrix passing through entry (i,j)?
For e.g., given a matrix like:
[1 2 3 4 5]
[6 7 8 9 10]
[11 12 13 14 15]
[16 17 18 19 20]
[21 22 23 24 25]
and an entry, say, (1,3) (representing element 9) how can I get the elements in the diagonals passing through 9 in a Pythonic way? Basically, [3,9,15] and [5,9,13,17,21] both.
Using np.diagonal with a little offset logic.
import numpy as np
lst = np.array([[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
i, j = 1, 3
major = np.diagonal(lst, offset=(j - i))
print(major)
array([ 3, 9, 15])
minor = np.diagonal(np.rot90(lst), offset=-lst.shape[1] + (j + i) + 1)
print(minor)
array([ 5, 9, 13, 17, 21])
The indices i and j are the row and column. By specifying the offset, numpy knows from where to begin selecting elements for the diagonal.
For the major diagonal, You want to start collecting from 3 in the first row. So you need to take the current column index and subtract it by the current row index, to figure out the correct column index at the 0th row. Similarly for the minor diagonal, where the array is flipped (rotated by 90˚) and the process repeats.
As another alternative method, with raveling the array and for matrix with shape (n*n):
array = np.array([[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
x, y = 1, 3
a_mod = array.ravel()
size = array.shape[0]
if y >= x:
diag = a_mod[y-x:(x+size-y)*size:size+1]
else:
diag = a_mod[(x-y)*size::size+1]
if x-(size-1-y) >= 0:
reverse_diag = array[:, ::-1].ravel()[(x-(size-1-y))*size::size+1]
else:
reverse_diag = a_mod[x:x*size+1:size-1]
# diag --> [ 3 9 15]
# reverse_diag --> [ 5 9 13 17 21]
The correctness of the resulted arrays must be checked further. This can be developed to handle matrices with other shapes e.g. (n*m).

Transforming multiindex to row-wise multi-dimensional NumPy array.

Suppose I have a MultiIndex DataFrame similar to an example from the MultiIndex docs.
>>> df
0 1 2 3
first second
bar one 0 1 2 3
two 4 5 6 7
baz one 8 9 10 11
two 12 13 14 15
foo one 16 17 18 19
two 20 21 22 23
qux one 24 25 26 27
two 28 29 30 31
I want to generate a NumPy array from this DataFrame with a 3-dimensional structure like
>>> desired_arr
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]],
[[16, 20],
[17, 21],
[18, 22],
[19, 23]],
[[24, 28],
[25, 29],
[26, 30],
[27, 31]]])
How can I do so?
Hopefully it is clear what is happening here - I am effectively unstacking the DataFrame by the first level and then trying to turn each top level in the resulting column MultiIndex to its own 2-dimensional array.
I can get half way there with
>>> df.unstack(1)
0 1 2 3
second one two one two one two one two
first
bar 0 4 1 5 2 6 3 7
baz 8 12 9 13 10 14 11 15
foo 16 20 17 21 18 22 19 23
qux 24 28 25 29 26 30 27 31
but then I am struggling to find a nice way to turn each column into a 2-dimensional array and then join them together, beyond doing so explicitly with loops and lists.
I feel like there should be some way for me to specify the shape of my desired NumPy array beforehand, fill it with np.nan and then use a specific iterating order to fill the values with my DataFrame, but I have not managed to solve the problem with this approach yet .
To generate the sample DataFrame
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
ind = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.arange(8*4).reshape((8, 4)), index=ind)
Some reshape and swapaxes magic -
df.values.reshape(4,2,-1).swapaxes(1,2)
Generalizable to -
m,n = len(df.index.levels[0]), len(df.index.levels[1])
arr = df.values.reshape(m,n,-1).swapaxes(1,2)
Basically splitting the first axis into two of lengths 4 and 2 creating a 3D array and then swapping the last two axes, i.e. pushing in the axis of length 2 to the back (as the last one).
Sample output -
In [35]: df.values.reshape(4,2,-1).swapaxes(1,2)
Out[35]:
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]],
[[16, 20],
[17, 21],
[18, 22],
[19, 23]],
[[24, 28],
[25, 29],
[26, 30],
[27, 31]]])
to complete the answer of #divakar, for a multidimensionnal generalisation :
# sort values by index
A = df.sort_index()
# fill na
for idx in A.index.names:
A = A.unstack(idx).fillna(0).stack(1)
# create a tuple with the rights dimensions
reshape_size = tuple([len(x) for x in A.index.levels])
# reshape
arr = np.reshape(A.values, reshape_size ).swapaxes(0,1)

Categories