I'm working on a machine learning project for university and I'm having trouble understanding some bits of code online. Here's the example:
digits = np.loadtxt(raw_data, delimiter=",")
x_train, y_train = digits[:,:-1], digits[:,-1:].squeeze()
What do the slices done in the second line mean? I'm trying to make a slice selecting the first 2/3 of the array and I've done before by something like [:2*array_elements // 3], but I don't understand how to do it if there's a delimiter in half.
numpy (or anything, but this seems like numpy) can implement __getitem__ to accept tuples instead of what stdlib does, where only scalar values are accepted (afaik) (e.g. integers, strings, slice objects).
You want to look at the slice "parts" individually, as specified by , delimiters. So [:,:-1] is actually : and :-1, are are completely independent.
First slice
: is "all", no slicing along that axis.
:x is all up until (and not including) x and -1 means the last element, so...
:-1 is all up until (and not including) the last.
Second slice
x: is all after (and including) x, and we already know about -1 so...
-1: is all after (and including) the last -- in this case just the last.
There are two mechanisms involved here.
The python's notation for slicing array : Understanding Python's slice notation
Basically the syntax is array[x:y] where the resulting slice starts at x (included) and end at y (excluded).
If start (resp. end) is omitted it means "from the first item" (resp. "to the last item) (This is a shortcut).
Also the notation is cyclic :
array[-1:0]
# The elements between the last index - 1 and the first (in this order).
# Which means the elements between the last index -1 and the last index
# Which means a list containing only the last element
array[-1:] = [array[-1]]
The numpy's 2-dimensionnal arrays (assuming the np is for numpy) :
Numpy frequently uses arrays of 2 dimensions like a matrix. So to access the element in row x and column y you can write it matrix[x,y]
Plus the python's notation for slicing arrays also apply here to slice matrix into a sub-matrix of smaller size
So, back at your problem:
digits[:,:-1]
= digits[start:end , start:-1]
= digits[start:end , start:end-1]
= the sub-matrix where you take all the rows (start:end) and you take all the columns except the last one (start:end-1)
And
digit[:,-1:]
= digit[start:end, -1:start]
= digit[start:end, -1:end]
= sub-matrix with all the rows and only the last column
Related
I have a 5 dimension array like this
a=np.random.randint(10,size=[2,3,4,5,600])
a.shape #(2,3,4,5,600)
I want to get the first element of the 2nd dimension, and several elements of the last dimension
b=a[:,0,:,:,[1,3,5,30,17,24,30,100,120]]
b.shape #(9,2,4,5)
as you can see, the last dimension was automatically converted to the first dimension.
why? and how to avoid that?
This behavior is described in the numpy documentation. In the expression
a[:,0,:,:,[1,3,5,30,17,24,30,100,120]]
both 0 and [1,3,5,30,17,24,30,100,120] are advanced indexes, separated by slices. As the documentation explains, in such case dimensions coming from advanced indexes will be first in the resulting array.
If we replace 0 by the slice 0:1 it will change this situation (since it will leave only one advanced index), and then the order of dimensions will be preserved. Thus one way to fix this issue is to use the 0:1 slice and then squeeze the appropriate axis:
a[:,0:1,:,:,[1,3,5,30,17,24,30,100,120]].squeeze(axis=1)
Alternatively, one can keep both advanced indexes, and then rearrange axes:
np.moveaxis(a[:,0,:,:,[1,3,5,30,17,24,30,100,120]], 0, -1)
What is the difference in below two lines.
I know [::-1] will reverse the matrix. but I want to know what [::] on LHS side '=' does, as without iterating each element how matrix gets reversed in-place in case of 1st case.
matrix[::] = matrix[::-1]
matrix = matrix[::-1]
The technic you are looking for called slicing. It is an advanced way to reference elements in some container. Instead of using single index you can use a slice to reference a range of elements.
The slice consists of start, end and step, like this matrix[start:end:step]. You can skip some parts and defaults values will be taken - 0, len(matrix), 1.
Of course, a container must support this technic (protocol).
matrix[::] = # get all elements of the matrix and assign something to them
matrix = # link matrix name with something
matrix[::-1] # get all elements of the matrix in reversed order
So, the first one is actually copying elements in different positions of the same object.
The second one is just linking name matrix with new object constructed from slice of matrix.
I have a very quick question about the notation on accessing an array in python.
Is this line:
trainPredictPlot[look_back:len(trainPredict) + look_back, :] = trainPredict
I've seen arrays are being accessed like this x[a:b] but never like this x[a:b,:]
Can someone explain me with detail what this line of code is doing? What does it mean to put colon before the closing bracket? What about the comma?
When you use x[a:b], it means that you are taking the elements from position "a" (x[a]) to position "b" (x[b]) of a one dimensional array.
For the second case x[a:b,:], it is a two dimensional "a" to position "b" of the first dimension of the array, and all the elements of the second dimension of the array, in other words, from x[a][first element] to x[b][last element].
On Python2.4, the single colon slice operator : works as expected on Numeric matrices, in that it returns all values for the dimension it was used on. For example all X and/or Y values for a 2-D matrix.
On Python2.6, the single colon slice operator seems to have a different effect in some cases: for example, on a regular 2-D MxN matrix, m[:] can result in zeros(<some shape tuple>, 'l') being returned as the resulting slice. The full matrix is what one would expect - which is what one gets using Python2.4.
Using either a double colon :: or 3 dots ... in Python2.6, instead of a single colon, seems to fix this issue and return the proper matrix slice.
After some guessing, I discovered you can get the same zeros output when inputting 0 as the stop index. e.g. m[<any index>:0] returns the same "zeros" output as m[:]. Is there any way to debug what indexes are being picked when trying to do m[:]? Or did something change between the two Python versions (2.4 to 2.6) that would affect the behavior of slicing operators?
The version of Numeric being used (24.2) is the same between both versions of Python. Why does the single colon slicing NOT work on Python 2.6 the same way it works with version 2.4?
Python2.6:
>>> a = array([[1,2,3],[4,5,6]])
**>>> a[:]
zeros((0, 3), 'l')**
>>> a[::]
array([[1,2,3],[4,5,6]])
>>> a[...]
array([[1,2,3],[4,5,6]])
Python2.4:
>>> a = array([[1,2,3],[4,5,6]])
**>>> a[:]
array([[1,2,3],[4,5,6]])**
>>> a[::]
array([[1,2,3],[4,5,6]])
>>> a[...]
array([[1,2,3],[4,5,6]])
(I typed the "code" up from scratch, so it may not be fully accurate syntax or printout-wise, but shows what's happening)
It seems the problem is an integer overflow issue. In the Numeric source code, the matrix data structure being used is in a file called MA.py. The specific class is called MaskedArray. There is a line at the end of the class that sets the "array()" function to this class. I had much trouble finding this information but it turned out to be very critical.
There is also a getslice(self, i, j) method in the MaskedArray class that takes in the start/stop indices and returns the proper slice. After finding this and adding debug for those indices, I discovered that under the good case with Python2.4, when doing a slice for an entire array the start/stop indices automatically input are 0 and 2^31-1, respectively. But under Python2.6, the stop index automatically input changed to be 2^63-1.
Somewhere, probably in the Numeric source/library code, there is only 32 bits to store the stop index when slicing arrays. Hence, the 2^63-1 value was overflowing (but any value greater than 2^31 would overflow). The output slice in these bad cases ends up being equivalent to slicing from start 0 to stop 0, e.g. an empty matrix. When you slice from [0:-1] you do get a valid slice. I think (2^63 - 1) interpreted as a 32 bit number would come out to -1. I'm not quite sure why the output of slicing from 0 to 2^63-1 is the same as slicing from 0 to 0 (where you get an empty matrix), and not from 0 to -1 (where you get at least some output).
Although, if I input ending slice indexes that would overflow (i.e. greater than 2^31), but the lower 32 bits were a valid positive non-zero number, I would get a valid slice back. E.g. a stop index of 2^33+1 would return the same slice as a stop index of 1, because the lower 32 bits are 1 in both cases.
Python 2.4 Example code:
>>> a = array([[1,2,3],[4,5,6]])
>>> a[:] # (which actually becomes a[0:2^31-1])
[[1,2,3],[4,5,6]] # correct, expect the entire array
Python 2.6 Example code:
>>> a = array([[1,2,3],[4,5,6]])
>>> a[:] # (which actually becomes a[0:2^63-1])
zeros((0, 3), 'l') # incorrect b/c of overflow, should be full array
>>> a[0:0]
zeros((0, 3), 'l') # correct, b/c slicing range is null
>>> a[0:2**33+1]
[ [1,2,3]] # incorrect b/c of overflow, should be full array
# although it returned some data b/c the
# lower 32 bits of (2^33+1) = 1
>>> a[0:-1]
[ [1,2,3]] # correct, although I'm not sure why "a[:]" doesn't
# give this output as well, given that the lower 32
# bits of 2^63-1 equal -1
I think I was using 2.4 10 years ago. I used numpy back then, but may have added Numeric for its NETCDF capabilities. But the details are fuzzy. And I don't have any of those versions now for testing.
Python documentation back then should be easy to explore. numpy/Numeric documentation was skimpier.
I think Python has always had the basic : slicing for lists. alist[:] to make a copy, alist[1:-1] to slice of the first and last elements, etc.
I don't know when the step was added, e.g. alist[::-1] to reverse a list.
Python started to recognize indexing tuples at the request of numeric developers, e.g. arr[2,4], arr[(2,4)], arr[:, [1,2]], arr[::-1, :]. But I don't know when that appeared
Ellipsis is also mainly of value for multidimensional indexing. The Python interpreter recognizes ..., but lists don't handle it. About the same time the : notation was formally implemented as slice, e.g.
In 3.5, we can reverse a list with a slice
In [6]: list(range(10)).__getitem__(slice(None,None,-1))
Out[6]: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
I would suggest a couple of things:
make sure you understand numpy (and list) indexing/slicing in a current system
try the same things in the older versions; ask SO questions with concrete examples of the differences. Don't count on any of us to have memories of the old code.
study the documentation to find when suspected features where changed or added.
I have an 2D-array (array1), which has an arbitrary number of rows and in the first column I have strictly monotonic increasing numbers (but not linearly), which represent a position in my system, while the second one gives me a value, which represents the state of my system for and around the position in the first column.
Now I have a second array (array2); its range should usually be the same as for the first column of the first array, but does not matter to much, as you will see below.
I am now interested for every element in array2:
1. What is the argument in array1[:,0], which has the closest value to the current element in array2?
2. What is the value (array1[:,1]) of those elements.
As usually array2 will be longer than the number of rows in array1 it is perfectly fine, if I get one argument from array1 more than one time. In fact this is what I expect.
The value from 2. is written in the second and third column, as you will see below.
My striped code looks like this:
from numpy import arange, zeros, absolute, argmin, mod, newaxis, ones
ysize1 = 50
array1 = zeros((ysize1+1,2))
array1[:,0] = arange(ysize1+1)**2
# can be any strictly monotonic increasing array
array1[:,1] = mod(arange(ysize1+1),2)
# in my current case, but could also be something else
ysize2 = (ysize1)**2
array2 = zeros((ysize2+1,3))
array2[:,0] = arange(0,ysize2+1)
# is currently uniformly distributed over the whole range, but does not necessarily have to be
a = 0
for i, array2element in enumerate(array2[:,0]):
a = argmin(absolute(array1[:,0]-array2element))
array2[i,1] = array1[a,1]
It works, but takes quite a lot time to process large arrays. I then tried to implement broadcasting, which seems to work with the following code:
indexarray = argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)
array2[:,2]=array1[indexarray,1] # just to compare the results
Unfortunately now I seem to run into a different problem: I get a memory error on the sizes of arrays I am using in the line of code with the broadcasting.
For small sizes it works, but for larger ones where len(array2[:,0]) is something like 2**17 (and could be even larger) and len(array1[:,0]) is about 2**14. I get, that the size of the array is bigger than the available memory. Is there an elegant way around that or to speed up the loop?
I do not need to store the intermediate array(s), I am just interested in the result.
Thanks!
First lets simplify this line:
argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)
it should be:
a = array1[:, 0]
b = array2[:, 0]
argmin(abs(a - b[:, newaxis]), 1)
But even when simplified, you're creating two large temporary arrays. If a and b have sizes M and N, b - a and abs(...) each create a temporary array of size (M, N). Because you've said that a is monotonically increasing, you can avoid the issue all together by using a binary search (sorted search) which is much faster anyways. Take a look at the answer I wrote to this question a while back. Using the function from this answer, try this:
closest = find_closest(array1[:, 0], array2[:, 0])
array2[:, 2] = array1[closest, 1]