Check shape of numpy array - python

I want to write a function that takes a numpy array and I want to check if it meets the requirements. One thing that confuses me is that:
np.array([1,2,3]).shape = np.array([[1,2,3],[2,3],[2,43,32]]) = (3,)
[1,2,3] should be allowed, while [[1,2,3],[2,3],[2,43,32]] shouldn't.
Allowed shapes:
[0, 1, 2, 3, 4]
[0, 1, 2]
[[1],[2]]
[[1, 2], [2, 3], [3, 4]]
Not Allowed:
[] (empty array is not allowed)
[[0], [1, 2]] (inner dimensions must have same size 1!=2)
[[[4,5,6],[4,3,2][[2,3,2],[2,3,4]]] (more than 2 dimension)

You should start with defining what you want in terms of shape. I tried to understand it from the question, please add more details if it is not correct.
So here we have (1) empty array is not allowed and (2) no more than two dimensions. It translates the following way:
def is_allowed(arr):
return arr.shape != (0, ) and len(arr.shape) <= 2
The first condition just compares you array's shape with the shape of an empty array. the second condition checks that an array has no more than two dimensions.
With inner dimensions there is a problem. Some of the lists you provided as an example are not numpy arrays. If you cast np.array([[1,2,3],[2,3],[2,43,32]]), you get just an array where each element is the list. It is not a "real" numpy array with direct access to all the elements. See example:
>>> np.array([[1,2,3],[2,3],[2,43,32]])
array([list([1, 2, 3]), list([2, 3]), list([2, 43, 32])], dtype=object)
>>> np.array([[1,2,3],[2,3, None],[2,43,32]])
array([[1, 2, 3],
[2, 3, None],
[2, 43, 32]], dtype=object)
So I would recommend (if you are operating with usual lists) check that all arrays have the same length without numpy.

Related

Applying torch.combinations on multidimensional tensor or tuple of tensors in PyTorch?

Using PyTorch, torch.combinations will only take a 1D tensor as input but I would like to apply it to each 1D tensor in a multidimensional tensor.
inp = torch.tensor([[1, 2, 3],
[2, 3, 4]])
torch.combinations((inp), r=2)
The result is an error saying I can't apply it to that shape but I want to apply it to [1, 2, 3] and [2, 3, 4] individually. I can't do it one by one because the idea is to apply this to large sets of data.
inp = torch.tensor([[1,2,3],[2,3,4]])
inp_tuple = torch.unbind(inp)
print(inp_tuple)
(tensor([1, 2, 3]), tensor([2, 3, 4]))
torch.combinations((inp_tuple), r=2)
I also tried unbinding the tensor and applying it to the tuple of tensors but it gives an error saying it can't be applied to a tuple.
Is there any way that I can get torch.combinations to automatically apply to each individual 1D tensor in a multidimensional tensor or each tensor in a tuple of tensors? If not are there any alternatives to achieve all combinations of each individual part of a multidimensional tensor?
Function torch.combinations returns all possible combinations of size r of the elements contained in the 1D input vector. The reason why multi-dimensional inputs are not supported is probably that you have no guarantee that the different vectors in your input have the exact same number of unique elements. Obviously if one of the vectors has a duplicate element then you would end up with one set of combinations bigger than another which is simply not possible to represent with a homogenous PyTorch tensor.
So from there on, I will assume that the input tensor inp is a 2D tensor shaped (N, C) where each of its N vectors contains C unique elements. The example you gave would fit to this requirement since both vectors have three unique elements each: {1, 2, 3} and {2, 3, 4}.
>>> inp = torch.tensor([[1,2,3],[2,3,4]])
The idea is to apply torch.combinations on an arrangement tensor of length equal to that of our vectors. We can then use those as indices to gather values in our different vectors in our input tensor.
We can retrieve all combinations of an arrangement with the following:
>>> c = torch.combinations(torch.arange(inp.size(1)), r=2)
tensor([[0, 1],
[0, 2],
[1, 2]])
Then we need to reshape and expand both inp and c such that they match in number of dimensions:
>>> x = inp[:,None].expand(-1,len(c),-1)
tensor([[[1, 2, 3],
[1, 2, 3],
[1, 2, 3]],
[[2, 3, 4],
[2, 3, 4],
[2, 3, 4]]])
>>> idx = c[None].expand(len(x), -1, -1)
tensor([[[0, 1],
[0, 2],
[1, 2]],
[[0, 1],
[0, 2],
[1, 2]]])
Finally we can apply torch.gather on x and idx on dim=2. This will return a 3D tensor out such that:
out[i][j][k] = x[i][j][index[i][j][k]]
Let's make our call on torch.gather:
>>> x.gather(dim=2, index=idx)
tensor([[[1, 2],
[1, 3],
[2, 3]],
[[2, 3],
[2, 4],
[3, 4]]])
Which is the desired result.

Is there an `np.repeat` which acts on an existing array?

I have a large NumPy array which I want to fill with new data on each iteration of a loop. The array is filled with data repeated along axis 0, for example:
[[1, 5],
[1, 5],
[1, 5],
[1, 5]]
I know how to create this array from scratch in each iteration:
x = np.repeat([[1, 5]], 4, axis=0)
However, I don't want to create a new array every time, because it's a very large array (much larger than 4x2). Instead, I want to create the array in advance using the above code, and then just fill the array with new data on each iteration.
But np.repeat() returns a new array, rather than acting on an existing array. Is there an equivalent of np.repeat() for filling an existing array?
As we noted in comments, you can use a broadcasting assignment to fill your 2d array with a 1d array-like of the appropriate size:
x[...] = [1, 5]
If by any chance your large array always contains the same items in each row (i.e. you won't change these preset values later), you can almost certainly use broadcasting in later parts of your code and just work with an initial x such as
x = np.array([[1, 5]])
This array has shape (1, 2) which is broadcast-compatible with other arrays of shape (4, 2) you might have in the above example.
If you always need the same values in each row and for some reason you can't use broadcasting (both cases are highly unlikely), you can use broadcast_to to create an array with an explicit 2d shape without copying memory:
x_bc = np.broadcast_to([1, 5], (4, 2)) # broadcast 1d [1, 5] to shape (4, 2)
This might work because it has the right shape with only 2 unique elements in memory:
>>> x_bc
array([[1, 5],
[1, 5],
[1, 5],
[1, 5]])
>>> x_bc.strides
(0, 8)
However you can't mutate it, because it's a read-only view:
>>> x_bc[0, :] = [2, 4]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-35-ae12ecfe3c5e> in <module>
----> 1 x_bc[0, :] = [2, 4]
ValueError: assignment destination is read-only
So, if you only need the same values in each row and you can't use broadcasting and you want to mutate those same rows later, you can use stride tricks to map the same 1d data to a 2d array:
>>> x_in = np.array([1, 5])
... x_strided = np.lib.stride_tricks.as_strided(x_in, shape=(4,) + x_in.shape,
... strides=(0,) + x_in.strides[-1:])
>>> x_strided
array([[1, 5],
[1, 5],
[1, 5],
[1, 5]])
>>> x_strided[0, :] = [2, 4]
>>> x_strided
array([[2, 4],
[2, 4],
[2, 4],
[2, 4]])
Which gives you a 2d array of fixed shape that always contains one unique row, and mutating any of the rows mutates the rest (since the underlying data corresponds to only a single row). Handle with care, because if you ever want to have two different rows you'll have to do something else.

Converting values of Existing Numpy ndarray to tuples

Let's say I have a numpy.ndarray with shape (2,3,2) as below,
arr = np.array([[[1,3], [2,5], [1,2]],[[3,3], [6,5], [5,2]]])
I want to reshape it in such a way that:
arr.shape == (2,3)
arr == [[(1,3), (2,5), (1,2)],[(3,3), (6,5), (5,2)]]
and
each value of arr is a size 2 tuple
The reason I want to do this is that I want to take the minimum along axis 0 of the 3dimensional array, but I want to preserve the value that the min of the rows in paired with.
arr = np.array(
[[[1, 4],
[2, 1],
[5, 2]],
[[3, 3],
[6, 5],
[1, 7]]])
print(np.min(arr, axis=0))
>>> [[1,3],
[2,1],
[1,2]]
>>>Should be
[[1,4],
[2,1],
[1,7]]
If the array contained tuples, it would be 2 dimensional, and the comparison operator for minimize would still function correctly,
so I would get the correct result. But I haven't found any way to do this besides iterating over the arrays, which is inefficient and obvious in implementation.
Is it possible to perform this conversion efficiently in numpy?
Don't use tuples at all - just view it as a structured array, which supports the lexical comparison you're after:
a = np.array([[[1,3], [2,5], [1,2]],[[3,3], [6,5], [5,2]]])
a_pairs = a.view([('f0', a.dtype), ('f1', a.dtype)]).squeeze(axis=-1)
min_pair = np.partition(a_pairs, 0, axis=0)[0] # min doesn't work on structured types :(
array([(1, 4), (2, 1), (1, 7)],
dtype=[('f0', '<i4'), ('f1', '<i4')])
First, let's find out which pairs to take:
first_eq = arr[0,:,0] == arr[1,:,0]
which_compare = np.where(first_eq, 1, 0)[0]
winner = arr[:,:,which_compare].argmin(axis=0)
Here, first_eq is True where the first elements match, so we would need to compare the second elements. It's [False, False, False] in your example. which_compare then is [0, 0, 0] (because the first element of each pair is what we will compare). Finally, winner tells us which of the two pairs to choose along the second axis. It is [0, 0, 1].
The last step is to extract the winners:
arr[winner, np.arange(arr.shape[1])]
That is, take the winner (0 or 1) at each point along the second axis.
Here's one way -
# Get each row being fused with scaling based on scale being decided
# based off the max values from the second col. Get argmin indices.
idx = (arr[...,1] + arr[...,0]*(arr[...,1].max()+1)).argmin(0)
# Finally use advanced-indexing to get those rows off array
out = arr[idx, np.arange(arr.shape[1])]
Sample run -
In [692]: arr
Out[692]:
array([[[3, 4],
[2, 1],
[5, 2]],
[[3, 3],
[6, 5],
[5, 1]]])
In [693]: out
Out[693]:
array([[3, 3],
[2, 1],
[5, 1]])

Indexing with lists and arrays in numpy appears inconsistent

Inspired by this other question, I'm trying to wrap my mind around advanced indexing in NumPy and build up more intuitive understanding of how it works.
I've found an interesting case. Here's an array:
>>> y = np.arange(10)
>>> y
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
if I index it a scalar, I get a scalar of course:
>>> y[4]
4
with a 1D array of integers, I get another 1D array:
>>> idx = [4, 3, 2, 1]
>>> y[idx]
array([4, 3, 2, 1])
so if I index it with a 2D array of integers, I get... what do I get?
>>> idx = [[4, 3], [2, 1]]
>>> y[idx]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: too many indices for array
Oh no! The symmetry is broken. I have to index with a 3D array to get a 2D array!
>>> idx = [[[4, 3], [2, 1]]]
>>> y[idx]
array([[4, 3],
[2, 1]])
What makes numpy behave this way?
To make this more interesting, I noticed that indexing with numpy arrays (instead of lists) behaves how I'd intuitively expect, and 2D gives me 2D:
>>> idx = np.array([[4, 3], [2, 1]])
>>> y[idx]
array([[4, 3],
[2, 1]])
This looks inconsistent from where I'm at. What's the rule here?
The reason is the interpretation of lists as index for numpy arrays: Lists are interpreted like tuples and indexing with a tuple is interpreted by NumPy as multidimensional indexing.
Just like arr[1, 2] returns the element arr[1][2] the arr[[[4, 3], [2, 1]]] is identical to arr[[4, 3], [2, 1]] and will, according to the rules of multidimensional indexing return the elements arr[4, 2] and arr[3, 1].
By adding one more list you do tell NumPy that you want slicing along the first dimension, because the outermost list is effectively interpreted as if you only passed in one "list of indices for the first dimension": arr[[[[4, 3], [2, 1]]]].
From the documentation:
Example
From each row, a specific element should be selected. The row index is just [0, 1, 2] and the column index specifies the element to choose for the corresponding row, here [0, 1, 0]. Using both together the task can be solved using advanced indexing:
>>> x = np.array([[1, 2], [3, 4], [5, 6]])
>>> x[[0, 1, 2], [0, 1, 0]]
array([1, 4, 5])
and:
Warning
The definition of advanced indexing means that x[(1,2,3),] is fundamentally different than x[(1,2,3)]. The latter is equivalent to x[1,2,3] which will trigger basic selection while the former will trigger advanced indexing. Be sure to understand why this occurs.
In such cases it's probably better to use np.take:
>>> y.take([[4, 3], [2, 1]]) # 2D array
array([[4, 3],
[2, 1]])
This function [np.take] does the same thing as “fancy” indexing (indexing arrays using arrays); however, it can be easier to use if you need elements along a given axis.
Or convert the indices to an array. That way NumPy interprets it (array is special cased!) as fancy indexing instead of as "multidimensional indexing":
>>> y[np.asarray([[4, 3], [2, 1]])]
array([[4, 3],
[2, 1]])

How do I access the ith column of a NumPy multidimensional array?

Given:
test = numpy.array([[1, 2], [3, 4], [5, 6]])
test[i] gives the ith row (e.g. [1, 2]). How do I access the ith column? (e.g. [1, 3, 5]). Also, would this be an expensive operation?
To access column 0:
>>> test[:, 0]
array([1, 3, 5])
To access row 0:
>>> test[0, :]
array([1, 2])
This is covered in Section 1.4 (Indexing) of the NumPy reference. This is quick, at least in my experience. It's certainly much quicker than accessing each element in a loop.
>>> test[:,0]
array([1, 3, 5])
this command gives you a row vector, if you just want to loop over it, it's fine, but if you want to hstack with some other array with dimension 3xN, you will have
ValueError: all the input arrays must have same number of dimensions
while
>>> test[:,[0]]
array([[1],
[3],
[5]])
gives you a column vector, so that you can do concatenate or hstack operation.
e.g.
>>> np.hstack((test, test[:,[0]]))
array([[1, 2, 1],
[3, 4, 3],
[5, 6, 5]])
And if you want to access more than one column at a time you could do:
>>> test = np.arange(9).reshape((3,3))
>>> test
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> test[:,[0,2]]
array([[0, 2],
[3, 5],
[6, 8]])
You could also transpose and return a row:
In [4]: test.T[0]
Out[4]: array([1, 3, 5])
Although the question has been answered, let me mention some nuances.
Let's say you are interested in the first column of the array
arr = numpy.array([[1, 2],
[3, 4],
[5, 6]])
As you already know from other answers, to get it in the form of "row vector" (array of shape (3,)), you use slicing:
arr_col1_view = arr[:, 1] # creates a view of the 1st column of the arr
arr_col1_copy = arr[:, 1].copy() # creates a copy of the 1st column of the arr
To check if an array is a view or a copy of another array you can do the following:
arr_col1_view.base is arr # True
arr_col1_copy.base is arr # False
see ndarray.base.
Besides the obvious difference between the two (modifying arr_col1_view will affect the arr), the number of byte-steps for traversing each of them is different:
arr_col1_view.strides[0] # 8 bytes
arr_col1_copy.strides[0] # 4 bytes
see strides and this answer.
Why is this important? Imagine that you have a very big array A instead of the arr:
A = np.random.randint(2, size=(10000, 10000), dtype='int32')
A_col1_view = A[:, 1]
A_col1_copy = A[:, 1].copy()
and you want to compute the sum of all the elements of the first column, i.e. A_col1_view.sum() or A_col1_copy.sum(). Using the copied version is much faster:
%timeit A_col1_view.sum() # ~248 µs
%timeit A_col1_copy.sum() # ~12.8 µs
This is due to the different number of strides mentioned before:
A_col1_view.strides[0] # 40000 bytes
A_col1_copy.strides[0] # 4 bytes
Although it might seem that using column copies is better, it is not always true for the reason that making a copy takes time too and uses more memory (in this case it took me approx. 200 µs to create the A_col1_copy). However if we needed the copy in the first place, or we need to do many different operations on a specific column of the array and we are ok with sacrificing memory for speed, then making a copy is the way to go.
In the case we are interested in working mostly with columns, it could be a good idea to create our array in column-major ('F') order instead of the row-major ('C') order (which is the default), and then do the slicing as before to get a column without copying it:
A = np.asfortranarray(A) # or np.array(A, order='F')
A_col1_view = A[:, 1]
A_col1_view.strides[0] # 4 bytes
%timeit A_col1_view.sum() # ~12.6 µs vs ~248 µs
Now, performing the sum operation (or any other) on a column-view is as fast as performing it on a column copy.
Finally let me note that transposing an array and using row-slicing is the same as using the column-slicing on the original array, because transposing is done by just swapping the shape and the strides of the original array.
A[:, 1].strides[0] # 40000 bytes
A.T[1, :].strides[0] # 40000 bytes
To get several and indepent columns, just:
> test[:,[0,2]]
you will get colums 0 and 2
>>> test
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
>>> ncol = test.shape[1]
>>> ncol
5L
Then you can select the 2nd - 4th column this way:
>>> test[0:, 1:(ncol - 1)]
array([[1, 2, 3],
[6, 7, 8]])
This is not multidimensional. It is 2 dimensional array. where you want to access the columns you wish.
test = numpy.array([[1, 2], [3, 4], [5, 6]])
test[:, a:b] # you can provide index in place of a and b

Categories