Numpy array indexing behavior - python

I was playing with numpy array indexing and find this odd behavior. When I index with np.array or list it works as expected:
In[1]: arr = np.arange(10).reshape(5,2)
arr[ [1, 1] ]
Out[1]: array([[2, 3],
[2, 3]])
But when I put tuple, it gives me a single element:
In[1]: arr = np.arange(10).reshape(5,2)
arr[ (1, 1) ]
Out[1]: 3
Also some kind of this strange tuple vs list behavior occurs with arr.flat:
In[1]: arr = np.arange(10).reshape(5,2)
In[2]: arr.flat[ [3, 4] ]
Out[2]: array([3, 4])
In[3]: arr.flat[ (3, 4) ]
Out[3]: IndexError: unsupported iterator index
I can't understand what is going on under the hood? What difference between tuple and list in this case?
Python 3.5.2
NumPy 1.11.1

What's happening is called fancy indexing, or advanced indexing. There's a difference between indexing with slices, or with a list/array. The trick is that multidimensional indexing actually works with tuples due to the implicit tuple syntax:
import numpy as np
arr = np.arange(10).reshape(5,2)
arr[2,1] == arr[(2,1)] # exact same thing: 2,1 matrix element
However, using a list (or array) inside an index expression will behave differently:
arr[[2,1]]
will index into arr with 1, then with 2, so first it fetches arr[2]==arr[2,:], then arr[1]==arr[1,:], and returns these two rows (row 2 and row 1) as the result.
It gets funkier:
print(arr[1:3,0:2])
print(arr[[1,2],[0,1]])
The first one is regular indexing, and it slices rows 1 to 2 and columns 0 to 1 inclusive; giving you a 2x2 subarray. The second one is fancy indexing, it gives you arr[1,0],arr[2,1] in an array, i.e. it indexes selectively into your array using, essentially, the zip() of the index lists.
Now here's why flat works like that: it returns a flatiter of your array. From help(arr.flat):
class flatiter(builtins.object)
| Flat iterator object to iterate over arrays.
|
| A `flatiter` iterator is returned by ``x.flat`` for any array `x`.
| It allows iterating over the array as if it were a 1-D array,
| either in a for-loop or by calling its `next` method.
So the resulting iterator from arr.flat behaves as a 1d array. When you do
arr.flat[ [3, 4] ]
you're accessing two elements of that virtual 1d array using fancy indexing; it works. But when you're trying to do
arr.flat[ (3,4) ]
you're attempting to access the (3,4) element of a 1d (!) array, but this is erroneous. The reason that this doesn't throw an IndexError is probably only due to the fact that arr.flat itself handles this indexing case.

In [387]: arr=np.arange(10).reshape(5,2)
With this list, you are selecting 2 rows from arr
In [388]: arr[[1,1]]
Out[388]:
array([[2, 3],
[2, 3]])
It's the same as if you explicitly marked the column slice (with : or ...)
In [389]: arr[[1,1],:]
Out[389]:
array([[2, 3],
[2, 3]])
Using an array instead of a list works: arr[np.array([1,1]),:]. (It also eliminates some ambiguities.)
With the tuple, the result is the same as if you wrote the indexing without the tuple wrapper. So it selects an element with row index of 1, column index of 1.
In [390]: arr[(1,1)]
Out[390]: 3
In [391]: arr[1,1]
Out[391]: 3
The arr[1,1] is translated by the interpreter to arr.__getitem__((1,1)). As is common in Python 1,1 is shorthand for (1,1).
In the arr.flat cases you are indexing the array as if it were 1d. np.arange(10)[[2,3]] selects 2 items, while np.arange(10)[(2,3)] is 2d indexing, hence the error.
A couple of recent questions touch on a messier corner case. Sometimes the list is treated as a tuple. The discussion might be enlightening, but don't go there if it's confusing.
Advanced slicing when passed list instead of tuple in numpy
numpy indexing: shouldn't trailing Ellipsis be redundant?

Related

numpy sum of each array in a list of arrays of different size

Given a list of numpy arrays, each of different length, as that obtained by doing lst = np.array_split(arr, indices), how do I get the sum of every array in the list? (I know how to do it using list-comprehension but I was hoping there was a pure-numpy way to do it).
I thought that this would work:
np.apply_along_axis(lambda arr: arr.sum(), axis=0, arr=lst)
But it doesn't, instead it gives me this error which I don't understand:
ValueError: operands could not be broadcast together with shapes (0,) (12,)
NB: It's an array of sympy objects.
There's a faster way which avoids np.split, and utilizes np.reduceat. We create an ascending array of indices where you want to sum elements with np.append([0], np.cumsum(indices)[:-1]). For proper indexing we need to put a zero in front (and discard the last element, if it covers the full range of the original array.. otherwise just delete the [:-1] indexing). Then we use the np.add ufunc with np.reduceat:
import numpy as np
arr = np.arange(1, 11)
indices = np.array([2, 4, 4])
# this should split like this
# [1 2 | 3 4 5 6 | 7 8 9 10]
np.add.reduceat(arr, np.append([0], np.cumsum(indices)[:-1]))
# array([ 3, 18, 34])

What is differnces between array[0] and array[0:1] in Python?

My question may look too simple, but I am curious to know why this is available in Python.
Assume we have defined an array of size of (4,3):
import numpy as np
a=np.random.randint(15,size=(4,3))
The result would be something like below:
array([[ 7, 6, 1],
[ 5, 3, 6],
[12, 10, 11],
[ 1, 3, 4]])
What is difference between:
a[0]
Result:
array([7, 6, 1])
and
a[0:1]
Result:
array([[7, 6, 1]])
As both of them return the same part of the matrix:
7, 6, 1
I do know that the difference is the shape as the former one is (3,) and the later one is sized of (1,3).
But my question is that why we need to have these kinds of shapes. If you are familiar with Matlab, giving a range using colon gives you two rows, but in Python, it returns the same information with different shape. What is the point? what is the advantage?
The reason is that you can be confident that array[x:y] always returns a subarray of the original array. So that you can use all the array methods on it. Say you have
map(lambda1, array[x:y])
Even if y-x == 1 or y-x == 0, you are guaranteed to have a array returned from array[x:y] and you can do map over it. Imagine if array[1:2] instead returned a single item i.e. array[1]. Then the behavior of the above code depends on what array[1] is, and it is probably not what you want.
I will try to explain with a simplified example.
simple_matrix = [[0,1,2],[3,4,5],[6,7,8]]
The following code is printing a single element from this list of lists:
print (simple_matrix[0])
The element printed is a list, this is because the element at index 0 of simple_matrix is only a list:
>>> [0,1,2]
The use of slicing, like in the following example, returns not a single element but two.
In this case it is simpler to expect a list of elements as return and that is exactly what we see as result:
print (simple_matrix[0:2])
>>> [[0, 1, 2], [3, 4, 5]]
What seems to puzzle you is this output:
print simple_matrix[0:1]
>>> [[0, 1, 2]]
You get this output because in this case your are not getting a single element from the list like we did in the 1st example but because you are slicing a list of lists.
This slice returns a list containing the sliced elements, in this case only the list [0, 1, 2].
Colon notation is a shorthand for slicing, so start with a brief definition of terms with some trivial examples. I'd refer you to this great answer to start with understanding how slices work in general. This is contrasted with what I'll term "access notation", or accessing an array element like a[0].
Therefore, in your case, the difference is that your n-dimensional array can be accessed at dimension 0, which returns the series of columns in that row. In contrast, slicing your n-dimensional array from 0 to 1 gives a list containing dimensions 0 through-but-not-including 1, which will be a two-dimensional array where the first (and only) element is a series of columns in the first row.
With respect to shapes, it depends what you're doing with the data. For instance, if you needed to access multiple rows at once, it might make more sense to use a wider slice in one go, whereas access notation would require multiple statements or a loop.
A note about Numpy Arrays specifically
Slicing a traditional, one-dimensional array will always yield a subset of the original array, as a copy. In contrast, slicing an n-dimensional NP array will yield a view instead, which shares memory with the original array. Be careful, as modifying this slice will modify the original as well!
I believe the point you are confused about is that in python, when you take a splice of an array, it is inclusive of the start index but EXCLUSIVE of the end index. I believe that in Matlab both the start index and the end index are inclusive.
So for the example you gave:
a[0:1] will take index 0, and not 1.
However, if you were you use a[0:2], you will get what is at indices 0 and 1, and get the result you seemed to be expecting.
This also explains why the shape is different, a[0:1] is doing exactly what you expect. It is giving you a list of rows, but that list only contains 1 row, hence the 1 in the shape (1, 3).
Conversely, a[0] only gives you a single row, and not a list of rows. The row has 3 elements, and hence you get the shape (3,)
array[m:n] returns an array, array[0] returns an element of the array (this has bearing on NumPy stuff too, I promise, just read on):
> py -3
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> array = [1,2,3]
>>> array[0]
1
>>> array[0:1]
[1]
>>>
This is why you get these results:
a[0]
Result:
array([7, 6, 1])
and
a[0:1]
Result:
array([[7, 6, 1]])
If you look carefully, the second returns an array that wraps a list of list of numbers, while the first returns an array that wraps a list of numbers.

Seemingly inconsistent slicing behavior in numpy arrays

I ran across something that seemed to me like inconsistent behavior in Numpy slices. Specifically, please consider the following example:
import numpy as np
a = np.arange(9).reshape(3,3) # a 2d numpy array
y = np.array([1,2,2]) # vector that will be used to index the array
b = a[np.arange(len(a)),y] # a vector (what I want)
c = a[:,y] # a matrix ??
I wanted to obtain a vector such that the i-th element is a[i,y[i]]. I tried two things (b and c above) and was surprised that b and c are not the same... in fact one is a vector and the other is a matrix! I was under the impression that : was shorthand for "all elements" but apparently the meaning is somewhat more subtle.
After trial and error I somewhat understand the difference now (b == np.diag(c)), but would appreciate clarification on why they are different, what exactly using : implies, and how to understand when to use either case.
Thanks!
It's hard to understand advanced indexing (with lists or arrays) without understanding broadcasting.
In [487]: a=np.arange(9).reshape(3,3)
In [488]: idx = np.array([1,2,2])
Index with a (3,) and (3,) producing shape (3,) result:
In [489]: a[np.arange(3),idx]
Out[489]: array([1, 5, 8])
Index with (3,1) and (3,), result is (3,3)
In [490]: a[np.arange(3)[:,None],idx]
Out[490]:
array([[1, 2, 2],
[4, 5, 5],
[7, 8, 8]])
The slice : does basically the same thing. There are subtle differences, but here it's the same.
In [491]: a[:,idx]
Out[491]:
array([[1, 2, 2],
[4, 5, 5],
[7, 8, 8]])
ix_ does the same thing, converting the (3,) & (3,) to (3,1) and (1,3):
In [492]: np.ix_(np.arange(3),idx)
Out[492]:
(array([[0],
[1],
[2]]), array([[1, 2, 2]]))
A broadcasted sum might help visualize the two cases:
In [495]: np.arange(3)*10+idx
Out[495]: array([ 1, 12, 22])
In [496]: np.sum(np.ix_(np.arange(3)*10,idx),axis=0)
Out[496]:
array([[ 1, 2, 2],
[11, 12, 12],
[21, 22, 22]])
When you pass
np.arange(len(a)), y
You can view the result as being all the indexed pairs for the zipped elements you passed. In this case, indexing by np.arange(len(a)) and y
np.arange(len(a))
# [0, 1, 2]
y
# [1, 2, 2]
effectively takes elements: (0, 1), (1, 2), and (2, 2).
print(a[0, 1], a[1, 2], a[2, 2]) # 0th, 1st, 2nd elements from each indexer
# 1 5 8
In the second case, you take the entire slice along the first dimension. (Nothing before the colon.) So this is all elements along the 0th axis. You then specify with y that you want the 1st, 2nd, and 2nd element along each row. (0-indexed.)
As you pointed out, it may seem a bit unintuitive that the results are different given that the individual elements of the slice are equivalent:
a[:] == a[np.arange(len(a))]
and
a[:y] == a[:y]
However, NumPy advanced indexing cares what type of data structure you pass when indexing (tuples, integers, etc). Things can become hairy very quickly.
The detail behind that is this: first consider all NumPy indexing to be of the general form x[obj], where obj is the evaluation of whatever you passed. How NumPy "behaves" depends on what type of object obj is:
Advanced indexing is triggered when the selection object, obj, is a
non-tuple sequence object, an ndarray (of data type integer or bool),
or a tuple with at least one sequence object or ndarray (of data type
integer or bool).
...
The definition of advanced indexing means that x[(1,2,3),] is
fundamentally different than x[(1,2,3)]. The latter is equivalent to
x[1,2,3] which will trigger basic selection while the former will
trigger advanced indexing. Be sure to understand why this occurs.
In your first case, obj = np.arange(len(a)),y, a tuple that fits the bill in bold above. This triggers advanced indexing and forces the behavior described above.
As for the second case, [:,y]
When there is at least one slice (:), ellipsis (...) or np.newaxis in
the index (or the array has more dimensions than there are advanced
indexes), then the behaviour can be more complicated. It is like
concatenating the indexing result for each advanced index element.
Demonstrated:
# Concatenate the indexing result for each advanced index element.
np.vstack((a[0, y], a[1, y], a[2, y]))

How to do 2D slicing?

I'm trying to do what I think should be simple:
I make a 2D list:
a = [[1,5],[2,6],[3,7]]
and I want to slide out the first column and tried:
1)
a[:,0]
...
TypeError: list indices must be integers or slices, not tuple
2)
a[:,0:1]
...
TypeError: list indices must be integers or slices, not tuple
3)
a[:][0]
[1, 5]
4)
a[0][:]
[1, 5]
5) got it but is this the way to do it?
aa[0] for aa in a
Using numpy it would be easy but what is the Python way?
2D slicing like a[:, 0] only works for NumPy arrays, not for lists.
However you can transpose (rows become columns and vice versa) nested lists using zip(*a). After transposing, simply slice out the first row:
a = [[1,5],[2,6],[3,7]]
print zip(*a) # [(1, 2, 3), (5, 6, 7)]
print list(zip(*a)[0]) # [1, 2, 3]
What you are trying to do in numerals 1 and 2 works in numpy arrays (or similarly with pandas dataframes), but not with basic python lists. If you want to do it with basic python lists, see the answer from #cricket_007 in the comments to your question.
One of the reasons to use numpy is exactly this - it makes it much easier to slice arrays with multiple dimensions
Use [x[0] for x in a] is the clear and proper way.

How to index a numpy array element with an array

I've got a numpy array, and would like to get the value at a specific element. For example, I might like to access the value at [1,1]
import numpy as np
A = np.arange(9).reshape(3,3)
print A[1,1]
# 4
Now, say I've got the coordinates in an array:
i = np.array([1,1])
How can I index A with my i coordinate array. The following doesn't work:
print A[i]
# [[3 4 5]
# [3 4 5]]
http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
In Python, x[(exp1, exp2, ..., expN)] is equivalent to x[exp1, exp2, ..., expN]; the latter is just syntactic sugar for the former.
So to get the same result as with A[1,1], you have to index with a tuple.
If you use an ndarray as the indexing object, advanced indexing is triggered:
http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing
Your best bet is A[tuple(i)]. The tuple(i) call just treats i as a sequence and puts the sequence items into a tuple. Note that if your array has more than one dimension, this won't make a nested tuple. It doesn't matter in this case, though.

Categories