Let's say I have a numpy.ndarray with shape (2,3,2) as below,
arr = np.array([[[1,3], [2,5], [1,2]],[[3,3], [6,5], [5,2]]])
I want to reshape it in such a way that:
arr.shape == (2,3)
arr == [[(1,3), (2,5), (1,2)],[(3,3), (6,5), (5,2)]]
and
each value of arr is a size 2 tuple
The reason I want to do this is that I want to take the minimum along axis 0 of the 3dimensional array, but I want to preserve the value that the min of the rows in paired with.
arr = np.array(
[[[1, 4],
[2, 1],
[5, 2]],
[[3, 3],
[6, 5],
[1, 7]]])
print(np.min(arr, axis=0))
>>> [[1,3],
[2,1],
[1,2]]
>>>Should be
[[1,4],
[2,1],
[1,7]]
If the array contained tuples, it would be 2 dimensional, and the comparison operator for minimize would still function correctly,
so I would get the correct result. But I haven't found any way to do this besides iterating over the arrays, which is inefficient and obvious in implementation.
Is it possible to perform this conversion efficiently in numpy?
Don't use tuples at all - just view it as a structured array, which supports the lexical comparison you're after:
a = np.array([[[1,3], [2,5], [1,2]],[[3,3], [6,5], [5,2]]])
a_pairs = a.view([('f0', a.dtype), ('f1', a.dtype)]).squeeze(axis=-1)
min_pair = np.partition(a_pairs, 0, axis=0)[0] # min doesn't work on structured types :(
array([(1, 4), (2, 1), (1, 7)],
dtype=[('f0', '<i4'), ('f1', '<i4')])
First, let's find out which pairs to take:
first_eq = arr[0,:,0] == arr[1,:,0]
which_compare = np.where(first_eq, 1, 0)[0]
winner = arr[:,:,which_compare].argmin(axis=0)
Here, first_eq is True where the first elements match, so we would need to compare the second elements. It's [False, False, False] in your example. which_compare then is [0, 0, 0] (because the first element of each pair is what we will compare). Finally, winner tells us which of the two pairs to choose along the second axis. It is [0, 0, 1].
The last step is to extract the winners:
arr[winner, np.arange(arr.shape[1])]
That is, take the winner (0 or 1) at each point along the second axis.
Here's one way -
# Get each row being fused with scaling based on scale being decided
# based off the max values from the second col. Get argmin indices.
idx = (arr[...,1] + arr[...,0]*(arr[...,1].max()+1)).argmin(0)
# Finally use advanced-indexing to get those rows off array
out = arr[idx, np.arange(arr.shape[1])]
Sample run -
In [692]: arr
Out[692]:
array([[[3, 4],
[2, 1],
[5, 2]],
[[3, 3],
[6, 5],
[5, 1]]])
In [693]: out
Out[693]:
array([[3, 3],
[2, 1],
[5, 1]])
Related
Given indexes for each row, how to return the corresponding elements in a 2-d matrix?
For instance, In array of np.array([[1,2,3,4],[4,5,6,7]]) I expect to see the output [[1,2],[4,5]] given indxs = np.array([[0,1],[0,1]]). Below is what I've tried:
a= np.array([[1,2,3,4],[4,5,6,7]])
indxs = np.array([[0,1],[0,1]]) #means return the elements located at 0 and 1 for each row
#I tried this, but it returns an array with shape (2, 2, 4)
a[idxs]
The reason you are getting two times your array is that when you do a[[0,1]] you are selecting the rows 0 and 1 from your array a, which are indeed your entire array.
In[]: a[[0,1]]
Out[]: array([[1, 2, 3, 4],
[4, 5, 6, 7]])
You can get the desired output using slides. That would be the easiest way.
a = np.array([[1,2,3,4],[4,5,6,7]])
a[:,0:2]
Out []: array([[1, 2],
[4, 5]])
In case you are still interested on indexing, you could also get your output doing:
In[]: [list(a[[0],[0,1]]),list(a[[1],[0,1]])]
Out[]: [[1, 2], [4, 5]]
The NumPy documentation gives you a really nice overview on how indexes work.
In [120]: indxs = np.array([[0,1],[0,1]])
In [121]: a= np.array([[1,2,3,4],[4,5,6,7]])
...: indxs = np.array([[0,1],[0,1]]) #
You need to provide an index for the first dimension, one that broadcasts with with indxs.
In [122]: a[np.arange(2)[:,None], indxs]
Out[122]:
array([[1, 2],
[4, 5]])
indxs is (2,n), so you need a (2,1) array to give a (2,n) result
I want to write a function that takes a numpy array and I want to check if it meets the requirements. One thing that confuses me is that:
np.array([1,2,3]).shape = np.array([[1,2,3],[2,3],[2,43,32]]) = (3,)
[1,2,3] should be allowed, while [[1,2,3],[2,3],[2,43,32]] shouldn't.
Allowed shapes:
[0, 1, 2, 3, 4]
[0, 1, 2]
[[1],[2]]
[[1, 2], [2, 3], [3, 4]]
Not Allowed:
[] (empty array is not allowed)
[[0], [1, 2]] (inner dimensions must have same size 1!=2)
[[[4,5,6],[4,3,2][[2,3,2],[2,3,4]]] (more than 2 dimension)
You should start with defining what you want in terms of shape. I tried to understand it from the question, please add more details if it is not correct.
So here we have (1) empty array is not allowed and (2) no more than two dimensions. It translates the following way:
def is_allowed(arr):
return arr.shape != (0, ) and len(arr.shape) <= 2
The first condition just compares you array's shape with the shape of an empty array. the second condition checks that an array has no more than two dimensions.
With inner dimensions there is a problem. Some of the lists you provided as an example are not numpy arrays. If you cast np.array([[1,2,3],[2,3],[2,43,32]]), you get just an array where each element is the list. It is not a "real" numpy array with direct access to all the elements. See example:
>>> np.array([[1,2,3],[2,3],[2,43,32]])
array([list([1, 2, 3]), list([2, 3]), list([2, 43, 32])], dtype=object)
>>> np.array([[1,2,3],[2,3, None],[2,43,32]])
array([[1, 2, 3],
[2, 3, None],
[2, 43, 32]], dtype=object)
So I would recommend (if you are operating with usual lists) check that all arrays have the same length without numpy.
I have two numpy arrays acting as lower and upper boundaries of a range of vectors that I want to generate.
In the a similar way that arange() works, I would like to generate the intermediate members as in the example:
lower_boundary = np.array([1,1])
upper_boundary = np.array([3,3])
expected_result = [[1,1], [1,2], [1,3], [2,1], [2,2], [2,3], [3,1], [3,2], [3,3]]
The result can be a list or another numpy array. So far I have managed to workaround this scenario with nested loops, but the dimensions of 'lower_boundary' and 'upper_boundary' may vary, and my approach is not applicable.
In a typical scenario, both boundaries have at least 4 dimensions.
You can use np.indicies to get a range of index values of your desired range (upper_boundary - lower boundary + 1), reshape it to your needs (reshape(len(upper_boundary),-1)) and add your lower_boundry to values resulting in;
>>> np.indices(upper_boundary - lower_boundary + 1).reshape(len(upper_boundary),-1).T + lower_boundary
array([[1, 1],
[1, 2],
[1, 3],
[2, 1],
[2, 2],
[2, 3],
[3, 1],
[3, 2],
[3, 3]])
Edit: I forgot to correct the code before posting, it should be like this.
Thanks #Divakar for the fix.
So, I have been browsing stackoverflow for quite some time now, but I can't seem to find the solution for my problem
Consider this
import numpy as np
coo = np.array([[1, 2], [2, 3], [3, 4], [3, 4], [1, 2], [5, 6], [1, 2]])
values = np.array([1, 2, 4, 2, 1, 6, 1])
The coo array contains the (x, y) coordinate positions
x = (1, 2, 3, 3, 1, 5, 1)
y = (2, 3, 4, 4, 2, 6, 2)
and the values array some sort of data for this grid point.
Now I want to get the average of all values for each unique grid point.
For example the coordinate (1, 2) occurs at the positions (0, 4, 6), so for this point I want values[[0, 4, 6]].
How could I get this for all unique grid points?
You can sort coo with np.lexsort to bring the duplicate ones in succession. Then run np.diff along the rows to get a mask of starts of unique XY's in the sorted version. Using that mask, you can create an ID array that would have the same ID for the duplicates. The ID array can then be used with np.bincount to get the summation of all values with the same ID and also their counts and thus the average values, as the final output. Here's an implementation to go along those lines -
# Use lexsort to bring duplicate coo XY's in succession
sortidx = np.lexsort(coo.T)
sorted_coo = coo[sortidx]
# Get mask of start of each unique coo XY
unqID_mask = np.append(True,np.any(np.diff(sorted_coo,axis=0),axis=1))
# Tag/ID each coo XY based on their uniqueness among others
ID = unqID_mask.cumsum()-1
# Get unique coo XY's
unq_coo = sorted_coo[unqID_mask]
# Finally use bincount to get the summation of all coo within same IDs
# and their counts and thus the average values
average_values = np.bincount(ID,values[sortidx])/np.bincount(ID)
Sample run -
In [65]: coo
Out[65]:
array([[1, 2],
[2, 3],
[3, 4],
[3, 4],
[1, 2],
[5, 6],
[1, 2]])
In [66]: values
Out[66]: array([1, 2, 4, 2, 1, 6, 1])
In [67]: unq_coo
Out[67]:
array([[1, 2],
[2, 3],
[3, 4],
[5, 6]])
In [68]: average_values
Out[68]: array([ 1., 2., 3., 6.])
You can use where:
>>> values[np.where((coo == [1, 2]).all(1))].mean()
1.0
It is very likely going to be faster to flatten your indices, i.e.:
flat_index = coo[:, 0] * np.max(coo[:, 1]) + coo[:, 1]
then use np.unique on it:
unq, unq_idx, unq_inv, unq_cnt = np.unique(flat_index,
return_index=True,
return_inverse=True,
return_counts=True)
unique_coo = coo[unq_idx]
unique_mean = np.bincount(unq_inv, values) / unq_cnt
than the similar approach using lexsort.
But under the hood the method is virtually the same.
This is a simple one-liner using the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
unique, mean = npi.group_by(coo).mean(values)
Should be comparable to the currently accepted answer in performance, as it does similar things under the hood; but all in a well tested package with a nice interface.
Another way to do it is using JAX unique and grad. This approach might be particularly fast because it allows you to run on an accelerator (CPU, GPU, or TPU).
import functools
import jax
import jax.numpy as jnp
#jax.grad
def _unique_sum(unique_values: jnp.ndarray, unique_inverses: jnp.ndarray, values: jnp.ndarray):
errors = unique_values[unique_inverses] - values
return -0.5*jnp.dot(errors, errors)
#functools.partial(jax.jit, static_argnames=['size'])
def unique_mean(indices, values, size):
unique_indices, unique_inverses, unique_counts = jnp.unique(indices, axis=0, return_inverse=True, return_counts=True, size=size)
unique_values = jnp.zeros(unique_indices.shape[0], dtype=float)
return unique_indices, _unique_sum(unique_values, unique_inverses, values) / unique_counts
coo = jnp.array([[1, 2], [2, 3], [3, 4], [3, 4], [1, 2], [5, 6], [1, 2]])
values = jnp.array([1, 2, 4, 2, 1, 6, 1])
unique_coo, unique_mean = unique_mean(coo, values, size=4)
print(unique_mean.block_until_ready())
The only weird thing is the size argument since JAX requires all array sizes to be fixed / known beforehand. If you make size too small it will throw out good results, too large it will return nan's.
Can someone explain exactly what the axis parameter in NumPy does?
I am terribly confused.
I'm trying to use the function myArray.sum(axis=num)
At first I thought if the array is itself 3 dimensions, axis=0 will return three elements, consisting of the sum of all nested items in that same position. If each dimension contained five dimensions, I expected axis=1 to return a result of five items, and so on.
However this is not the case, and the documentation does not do a good job helping me out (they use a 3x3x3 array so it's hard to tell what's happening)
Here's what I did:
>>> e
array([[[1, 0],
[0, 0]],
[[1, 1],
[1, 0]],
[[1, 0],
[0, 1]]])
>>> e.sum(axis = 0)
array([[3, 1],
[1, 1]])
>>> e.sum(axis=1)
array([[1, 0],
[2, 1],
[1, 1]])
>>> e.sum(axis=2)
array([[1, 0],
[2, 1],
[1, 1]])
>>>
Clearly the result is not intuitive.
Clearly,
e.shape == (3, 2, 2)
Sum over an axis is a reduction operation so the specified axis disappears. Hence,
e.sum(axis=0).shape == (2, 2)
e.sum(axis=1).shape == (3, 2)
e.sum(axis=2).shape == (3, 2)
Intuitively, we are "squashing" the array along the chosen axis, and summing the numbers that get squashed together.
To understand the axis intuitively, refer the picture below (source: Physics Dept, Cornell Uni)
The shape of the (boolean) array in the above figure is shape=(8, 3). ndarray.shape will return a tuple where the entries correspond to the length of the particular dimension. In our example, 8 corresponds to length of axis 0 whereas 3 corresponds to length of axis 1.
If someone need this visual description:
There are good answers for visualization however it might help to think purely from analytical perspective.
You can create array of arbitrary dimension with numpy.
For example, here's a 5-dimension array:
>>> a = np.random.rand(2, 3, 4, 5, 6)
>>> a.shape
(2, 3, 4, 5, 6)
You can access any element of this array by specifying indices. For example, here's the first element of this array:
>>> a[0, 0, 0, 0, 0]
0.0038908603263844155
Now if you take out one of the dimensions, you get number of elements in that dimension:
>>> a[0, 0, :, 0, 0]
array([0.00389086, 0.27394775, 0.26565889, 0.62125279])
When you apply a function like sum with axis parameter, that dimension gets eliminated and array of dimension less than original gets created. For each cell in new array, the operator will get list of elements and apply the reduction function to get a scaler.
>>> np.sum(a, axis=2).shape
(2, 3, 5, 6)
Now you can check that the first element of this array is sum of above elements:
>>> np.sum(a, axis=2)[0, 0, 0, 0]
1.1647502999560164
>>> a[0, 0, :, 0, 0].sum()
1.1647502999560164
The axis=None has special meaning to flatten out the array and apply function on all numbers.
Now you can think about more complex cases where axis is not just number but a tuple:
>>> np.sum(a, axis=(2,3)).shape
(2, 3, 6)
Note that we use same technique to figure out how this reduction was done:
>>> np.sum(a, axis=(2,3))[0,0,0]
7.889432081931909
>>> a[0, 0, :, :, 0].sum()
7.88943208193191
You can also use same reasoning for adding dimension in array instead of reducing dimension:
>>> x = np.random.rand(3, 4)
>>> y = np.random.rand(3, 4)
# New dimension is created on specified axis
>>> np.stack([x, y], axis=2).shape
(3, 4, 2)
>>> np.stack([x, y], axis=0).shape
(2, 3, 4)
# To retrieve item i in stack set i in that axis
Hope this gives you generic and full understanding of this important parameter.
Some answers are too specific or do not address the main source of confusion. This answer attempts to provide a more general but simple explanation of the concept, with a simple example.
The main source of confusion is related to expressions such as "Axis along which the means are computed", which is the documentation of the argument axis of the numpy.mean function. What the heck does "along which" even mean here? "Along which" essentially means that you will sum the rows (and divide by the number of rows, given that we are computing the mean), if the axis is 0, and the columns, if the axis is 1. In the case of axis is 0 (or 1), the rows can be scalars or vectors or even other multi-dimensional arrays.
In [1]: import numpy as np
In [2]: a=np.array([[1, 2], [3, 4]])
In [3]: a
Out[3]:
array([[1, 2],
[3, 4]])
In [4]: np.mean(a, axis=0)
Out[4]: array([2., 3.])
In [5]: np.mean(a, axis=1)
Out[5]: array([1.5, 3.5])
So, in the example above, np.mean(a, axis=0) returns array([2., 3.]) because (1 + 3)/2 = 2 and (2 + 4)/2 = 3. It returns an array of two numbers because it returns the mean of the rows for each column (and there are two columns).
Both 1st and 2nd reply is great for understanding ndarray concept in numpy. I am giving a simple example.
And according to this image by #debaonline4u
https://i.stack.imgur.com/O5hBF.jpg
Suppose , you have an 2D array -
[1, 2, 3]
[4, 5, 6]
In, numpy format it will be -
c = np.array([[1, 2, 3],
[4, 5, 6]])
Now,
c.ndim = 2 (rows/axis=0)
c.shape = (2,3) (axis0, axis1)
c.sum(axis=0) = [1+4, 2+5, 3+6] = [5, 7, 9] (sum of the 1st elements of each rows, so along axis0)
c.sum(axis=1) = [1+2+3, 4+5+6] = [6, 15] (sum of the elements in a row, so along axis1)
So for your 3D array,