NumPy array shape mismatch on masking/assignment command

NumPy array shape mismatch on masking/assignment command - python

I'm trying to run a loop where I develop a mask, and then use that mask to assign various values in various rows in one array with specific values from another array. The following script works, but only when there are no duplicate values in column 0 of array y. If there are duplicates, then the mask would have an assignment made to multiple rows in y, then the error throws. Thx for any help.
x = np.zeros(shape=(100,10))
x[:,0] = np.arange(100)
# this seed = 9 produces duplicate values in column 1, which seems cause the problem
# (no issues when there are no duplicate values in column 1 of y)
y = (np.random.default_rng(9).random((10,7))*100).astype(int)
for i in range(x.shape[0]):
mask = y[:,0] == x[i,0]
y[mask,[1,3,4,6]] = x[i,[1,2,3,4]]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Input In [219], in <cell line: 2>()
2 for i in range(x.shape[0]):
3 mask = y[:,0] == x[i,0]
----> 4 y[mask,[1,3,4,6]] = x[i,[1,2,3,4]]
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (0,) (4,)

The mask array in your example must have at least one True in each loop, because you are assigning to rows one by one in loops. You can use if condition to be sure mask contains at least one true:
1. First solution: curing the prepared loop
range_ = np.arange(y.shape[0], dtype=np.int64)
for i in range(x.shape[0]):
mask = y[:, 0] == x[i, 0]
if np.count_nonzero(mask) != 0:
true_counts = np.count_nonzero(mask)
broadcast_x = np.broadcast_to(x[i, [1, 2, 3, 4]], shape=(true_counts, 4)) # 4 is length of [1, 2, 3, 4]
broadcast_y = np.broadcast_to([1, 3, 4, 6], shape=(true_counts, 4))
y[range_[mask][:, None], broadcast_y] = broadcast_x
2. Second solution: vectorized way (the best)
Instead using loops, we can firstly find the intersection and then use advanced indexing as:
mask = np.in1d(y[:, 0], x[:, 0])
y[mask, np.array([1, 3, 4, 6])[:, None]] = 0
now, if the x[:, 0] is specified by np.arange, for assigning an array instead of zero, for creating this array, we need to take the related values from x. For doing so, at first, we select the corresponding rows by x[y[:, 0] - x[0, 0]] (in your case it can be just x[y[:, 0] because np.arange start from 0 so x[0, 0] = 0) and then apply the masks to bring out the needed values from specified rows and columns:
mask = np.in1d(y[:, 0], x[:, 0]) # rows mask for y
new_arr = x[y[:, 0] - x[0, 0]][mask, np.array([1, 2, 3, 4])[:, None]]
y[mask, np.array([1, 3, 4, 6])[:, None]] = new_arr
if it get error IndexError: arrays used as indices must be of integer (or boolean) type so we must ensure indices type are integers so we can use some code like (y[:, 0] - x[0, 0]).astype(np.int64) or np.array([1, 2, 3, 4], dtype=np.int64).
The more comprehensive code is to find the common elements' indices between the two arrays when we didn't fill the x[:, 0] by np.arange. So the code will be as:
mask = np.in1d(y[:, 0], x[:, 0])
# finding common indices
unique_values, index = np.unique(x[:, 0], return_index=True)
idx = index[np.searchsorted(unique_values, y[:, 0])]
new_arr = x[idx][mask, np.array([1, 2, 3, 4])[:, None]]
y[mask, np.array([1, 3, 4, 6])[:, None]] = new_arr
3. Third solution: indexing (just for the prepared toy example)
For the prepared example in the question, you can do this easily by advanced indexing instead the loop:
y[:, [1, 3, 4, 6]] = 0
This last code is working on your prepared data because values in y (< 100) involved in x first column (which is from 0 to 99).
or in case of assigning array instead 0:
new_arr = np.array([3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
y[:, [1, 3, 4, 6]] = new_arr[:, None]

Related

Vectorizing this for-loop in numpy

I was wondering how I would vectorize this for loop. Given a 2x2x2 array x and an array where each element is the ith, jth, and kth element of the array I want to get x[i,j,k]
Given an arrays x and y
x = np.arange(8).reshape((2, 2, 2))
y = [[0, 1, 1], [1, 1, 0]]
I want to get:
x[0, 1, 1] = 3 and x[1, 1, 0] = 6
I tried:
print(x[y])
But it prints:
array([[2, 3],
[6, 7],
[4, 5]])
So I ended up doing:
for y_ in y:
print(x[y_[0], y_[1], y_[2]])
Which works, but I can't help but think there is a better way.

Use transposed y i.e zip(*y) as the index; You need to have the indices for each dimension as an element for advanced indexing to work:
x[tuple(zip(*y))]
# array([3, 6])

Example in np.argsort document

For some reason I cannot resolve this.
According to the example here for 1-dim array,
https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
x = np.array([3, 1, 2])
np.argsort(x)
array([1, 2, 0])
And I have tried this myself. But by default, the return result should be ascending..meaning
x([result])
returns
array([1, 2, 3])
Thus shouldnt the result be [2,0,1]
What am I missing here?

From the docs, the first line states "Returns the indices that would sort an array." Hence if you want the positions of the sorted values we have:
x = np.array([3, 1, 2])
np.argsort(x)
>>>array([1, 2, 0])
here we want the index positions of 1, 2 and 3 in x. The psotion of 3 is 0, the psotion of 1 is 1, and the position of 2 is 2, hence array([1, 2, 0]) = sorted_array(1,2,3).
Again from the notes, " argsort returns an array of indices of the same shape as a that index data along the given axis in sorted order."
A more intuitive way of looking at what that means is to use a for loop, where we loop over our returned argsort values, and then index the initial array with these values:
x = np.array([3, 1, 2])
srt_positions = np.argsort(x)
for k in srt_positions:
print x[k]
>>> 1, 2, 3

numpy advanced indexing with array [duplicate]

What is the most elegant way to access an n dimensional array with an (n-1) dimensional array along a given dimension as in the dummy example
a = np.random.random_sample((3,4,4))
b = np.random.random_sample((3,4,4))
idx = np.argmax(a, axis=0)
How can I access now with idx a to get the maxima in a as if I had used a.max(axis=0)? or how to retrieve the values specified by idx in b?
I thought about using np.meshgrid but I think it is an overkill. Note that the dimension axis can be any usefull axis (0,1,2) and is not known in advance. Is there an elegant way to do this?

Make use of advanced-indexing -
m,n = a.shape[1:]
I,J = np.ogrid[:m,:n]
a_max_values = a[idx, I, J]
b_max_values = b[idx, I, J]
For the general case:
def argmax_to_max(arr, argmax, axis):
"""argmax_to_max(arr, arr.argmax(axis), axis) == arr.max(axis)"""
new_shape = list(arr.shape)
del new_shape[axis]
grid = np.ogrid[tuple(map(slice, new_shape))]
grid.insert(axis, argmax)
return arr[tuple(grid)]
Quite a bit more awkward than such a natural operation should be, unfortunately.
For indexing a n dim array with a (n-1) dim array, we could simplify it a bit to give us the grid of indices for all axes, like so -
def all_idx(idx, axis):
grid = np.ogrid[tuple(map(slice, idx.shape))]
grid.insert(axis, idx)
return tuple(grid)
Hence, use it to index into input arrays -
axis = 0
a_max_values = a[all_idx(idx, axis=axis)]
b_max_values = b[all_idx(idx, axis=axis)]

using indexing in numpy https://docs.scipy.org/doc/numpy-1.10.1/reference/arrays.indexing.html#advanced-indexing
a = np.array([[1, 2], [3, 4], [5, 6]])
a
> a: array([[1, 2],
[3, 4],
[5, 6]])
idx = a.argmax(axis=1)
idx
> array([1, 0, 0], dtype=int64)
since you want all rows but only columns with idx indexes you can use [0, 1, 2] or np.arange(a.shape[0]) for the row indexes
rows = np.arange(a.shape[0])
a[rows, idx]
>array([3, 2, 1])
which is the same as a.max(axis=1)
a.max(axis=1)
>array([3, 2, 1])
if you have 3 dimensions you add the indexes of the 3rd dimension as well:
index2 = np.arange(a.shape[2])
a[rows, idx, index2]

I suggest the following:
a = np.array([[1, 3], [2, -2], [1, -1]])
a
>array([[ 1, 3],
[ 2, -2],
[ 1, -1]])
idx = a.argmax(axis=1)
idx
> array([1, 0, 0], dtype=int64)
np.take_along_axis(a, idx[:, None], axis=1).squeeze()
>array([3, 2, 1])
a.max(axis=1)
>array([3, 2, 1])

Numpy - Converting array of indices to array of values [duplicate]

What is the most elegant way to access an n dimensional array with an (n-1) dimensional array along a given dimension as in the dummy example
a = np.random.random_sample((3,4,4))
b = np.random.random_sample((3,4,4))
idx = np.argmax(a, axis=0)
How can I access now with idx a to get the maxima in a as if I had used a.max(axis=0)? or how to retrieve the values specified by idx in b?
I thought about using np.meshgrid but I think it is an overkill. Note that the dimension axis can be any usefull axis (0,1,2) and is not known in advance. Is there an elegant way to do this?

Make use of advanced-indexing -
m,n = a.shape[1:]
I,J = np.ogrid[:m,:n]
a_max_values = a[idx, I, J]
b_max_values = b[idx, I, J]
For the general case:
def argmax_to_max(arr, argmax, axis):
"""argmax_to_max(arr, arr.argmax(axis), axis) == arr.max(axis)"""
new_shape = list(arr.shape)
del new_shape[axis]
grid = np.ogrid[tuple(map(slice, new_shape))]
grid.insert(axis, argmax)
return arr[tuple(grid)]
Quite a bit more awkward than such a natural operation should be, unfortunately.
For indexing a n dim array with a (n-1) dim array, we could simplify it a bit to give us the grid of indices for all axes, like so -
def all_idx(idx, axis):
grid = np.ogrid[tuple(map(slice, idx.shape))]
grid.insert(axis, idx)
return tuple(grid)
Hence, use it to index into input arrays -
axis = 0
a_max_values = a[all_idx(idx, axis=axis)]
b_max_values = b[all_idx(idx, axis=axis)]

using indexing in numpy https://docs.scipy.org/doc/numpy-1.10.1/reference/arrays.indexing.html#advanced-indexing
a = np.array([[1, 2], [3, 4], [5, 6]])
a
> a: array([[1, 2],
[3, 4],
[5, 6]])
idx = a.argmax(axis=1)
idx
> array([1, 0, 0], dtype=int64)
since you want all rows but only columns with idx indexes you can use [0, 1, 2] or np.arange(a.shape[0]) for the row indexes
rows = np.arange(a.shape[0])
a[rows, idx]
>array([3, 2, 1])
which is the same as a.max(axis=1)
a.max(axis=1)
>array([3, 2, 1])
if you have 3 dimensions you add the indexes of the 3rd dimension as well:
index2 = np.arange(a.shape[2])
a[rows, idx, index2]

I suggest the following:
a = np.array([[1, 3], [2, -2], [1, -1]])
a
>array([[ 1, 3],
[ 2, -2],
[ 1, -1]])
idx = a.argmax(axis=1)
idx
> array([1, 0, 0], dtype=int64)
np.take_along_axis(a, idx[:, None], axis=1).squeeze()
>array([3, 2, 1])
a.max(axis=1)
>array([3, 2, 1])

How does the axis parameter from NumPy work?

Can someone explain exactly what the axis parameter in NumPy does?
I am terribly confused.
I'm trying to use the function myArray.sum(axis=num)
At first I thought if the array is itself 3 dimensions, axis=0 will return three elements, consisting of the sum of all nested items in that same position. If each dimension contained five dimensions, I expected axis=1 to return a result of five items, and so on.
However this is not the case, and the documentation does not do a good job helping me out (they use a 3x3x3 array so it's hard to tell what's happening)
Here's what I did:
>>> e
array([[[1, 0],
[0, 0]],
[[1, 1],
[1, 0]],
[[1, 0],
[0, 1]]])
>>> e.sum(axis = 0)
array([[3, 1],
[1, 1]])
>>> e.sum(axis=1)
array([[1, 0],
[2, 1],
[1, 1]])
>>> e.sum(axis=2)
array([[1, 0],
[2, 1],
[1, 1]])
>>>
Clearly the result is not intuitive.

Clearly,
e.shape == (3, 2, 2)
Sum over an axis is a reduction operation so the specified axis disappears. Hence,
e.sum(axis=0).shape == (2, 2)
e.sum(axis=1).shape == (3, 2)
e.sum(axis=2).shape == (3, 2)
Intuitively, we are "squashing" the array along the chosen axis, and summing the numbers that get squashed together.

To understand the axis intuitively, refer the picture below (source: Physics Dept, Cornell Uni)
The shape of the (boolean) array in the above figure is shape=(8, 3). ndarray.shape will return a tuple where the entries correspond to the length of the particular dimension. In our example, 8 corresponds to length of axis 0 whereas 3 corresponds to length of axis 1.

If someone need this visual description:

There are good answers for visualization however it might help to think purely from analytical perspective.
You can create array of arbitrary dimension with numpy.
For example, here's a 5-dimension array:
>>> a = np.random.rand(2, 3, 4, 5, 6)
>>> a.shape
(2, 3, 4, 5, 6)
You can access any element of this array by specifying indices. For example, here's the first element of this array:
>>> a[0, 0, 0, 0, 0]
0.0038908603263844155
Now if you take out one of the dimensions, you get number of elements in that dimension:
>>> a[0, 0, :, 0, 0]
array([0.00389086, 0.27394775, 0.26565889, 0.62125279])
When you apply a function like sum with axis parameter, that dimension gets eliminated and array of dimension less than original gets created. For each cell in new array, the operator will get list of elements and apply the reduction function to get a scaler.
>>> np.sum(a, axis=2).shape
(2, 3, 5, 6)
Now you can check that the first element of this array is sum of above elements:
>>> np.sum(a, axis=2)[0, 0, 0, 0]
1.1647502999560164
>>> a[0, 0, :, 0, 0].sum()
1.1647502999560164
The axis=None has special meaning to flatten out the array and apply function on all numbers.
Now you can think about more complex cases where axis is not just number but a tuple:
>>> np.sum(a, axis=(2,3)).shape
(2, 3, 6)
Note that we use same technique to figure out how this reduction was done:
>>> np.sum(a, axis=(2,3))[0,0,0]
7.889432081931909
>>> a[0, 0, :, :, 0].sum()
7.88943208193191
You can also use same reasoning for adding dimension in array instead of reducing dimension:
>>> x = np.random.rand(3, 4)
>>> y = np.random.rand(3, 4)
# New dimension is created on specified axis
>>> np.stack([x, y], axis=2).shape
(3, 4, 2)
>>> np.stack([x, y], axis=0).shape
(2, 3, 4)
# To retrieve item i in stack set i in that axis
Hope this gives you generic and full understanding of this important parameter.

Some answers are too specific or do not address the main source of confusion. This answer attempts to provide a more general but simple explanation of the concept, with a simple example.
The main source of confusion is related to expressions such as "Axis along which the means are computed", which is the documentation of the argument axis of the numpy.mean function. What the heck does "along which" even mean here? "Along which" essentially means that you will sum the rows (and divide by the number of rows, given that we are computing the mean), if the axis is 0, and the columns, if the axis is 1. In the case of axis is 0 (or 1), the rows can be scalars or vectors or even other multi-dimensional arrays.
In [1]: import numpy as np
In [2]: a=np.array([[1, 2], [3, 4]])
In [3]: a
Out[3]:
array([[1, 2],
[3, 4]])
In [4]: np.mean(a, axis=0)
Out[4]: array([2., 3.])
In [5]: np.mean(a, axis=1)
Out[5]: array([1.5, 3.5])
So, in the example above, np.mean(a, axis=0) returns array([2., 3.]) because (1 + 3)/2 = 2 and (2 + 4)/2 = 3. It returns an array of two numbers because it returns the mean of the rows for each column (and there are two columns).

Both 1st and 2nd reply is great for understanding ndarray concept in numpy. I am giving a simple example.
And according to this image by #debaonline4u
https://i.stack.imgur.com/O5hBF.jpg
Suppose , you have an 2D array -
[1, 2, 3]
[4, 5, 6]
In, numpy format it will be -
c = np.array([[1, 2, 3],
[4, 5, 6]])
Now,
c.ndim = 2 (rows/axis=0)
c.shape = (2,3) (axis0, axis1)
c.sum(axis=0) = [1+4, 2+5, 3+6] = [5, 7, 9] (sum of the 1st elements of each rows, so along axis0)
c.sum(axis=1) = [1+2+3, 4+5+6] = [6, 15] (sum of the elements in a row, so along axis1)
So for your 3D array,

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

NumPy array shape mismatch on masking/assignment command - python

Related

Vectorizing this for-loop in numpy

Example in np.argsort document

numpy advanced indexing with array [duplicate]

Numpy - Converting array of indices to array of values [duplicate]

How does the axis parameter from NumPy work?

Categories

Resources