Related
I came across a code snippet where I could not understand two of the statements, though I could see the end result of each.
I will create a variable before giving the statements:
train = np.random.random((10,100))
One of them read as :
train = train[:-1, 1:-1]
What does this slicing mean? How to read this? I know that that -1 in slicing denotes from the back. But I cannot understand this.
Another statement read as follows:
la = [0.2**(7-j) for j in range(1,t+1)]
np.array(la)[:,None]
What does slicing with None as in [:,None] mean?
For the above two statements, along with how each statement is read, it will be helpful to have an alternative method along, so that I understand it better.
One of Python's strengths is its uniform application of straightforward principles. Numpy indexing, like all indexing in Python, passes a single argument to the indexed object's (i.e., the array's) __getitem__ method, and numpy arrays were one of the primary justifications for the slicing mechanism (or at least one of its very early uses).
When I'm trying to understand new behaviours I like to start with a concrete and comprehensible example, so rather than 10x100 random values I'll start with a one-dimensional 4-element vector and work up to 3x4, which should be big enough to understand what's going on.
simple = np.array([1, 2, 3, 4])
train = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
The interpreter shows these as
array([1, 2, 3, 4])
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
The expression simple[x] is equivalent to (which is to say the interpreter ends up executing) simple.__getitem__(x) under the hood - note this call takes a single argument.
The numpy array's __getitem__ method implements indexing with an integer very simply: it selects a single element from the first dimension. So simple[1] is 2, and train[1] is array([5, 6, 7, 8]).
When __getitem__ receives a tuple as an argument (which is how Python's syntax interprets expressions like array[x, y, z]) it applies each element of the tuple as an index to successive dimensions of the indexed object. So result = train[1, 2] is equivalent (conceptually - the code is more complex in implementation) to
temp = train[1] # i.e. train.__getitem__(1)
result = temp[2] # i.e. temp.__getitem__(2)
and sure enough we find that result comes out at 7. You could think of array[x, y, z] as equivalent to array[x][y][z].
Now we can add slicing to the mix. Expressions containing a colon can be regarded as slice literals (I haven't seen a better name for them), and the interpreter creates slice objects for them. As the documentation notes, a slice object is mostly a container for three values, start, stop and slice, and it's up to each object's __getitem__ method how it interprets them. You might find this question helpful to understand slicing further.
With what you now know, you should be able to understand the answer to your first question.
result = train[:-1, 1:-1]
will call train.__getitem__ with a two-element tuple of slices. This is equivalent to
temp = train[:-1]
result = temp[..., 1:-1]
The first statement can be read as "set temp to all but the last row of train", and the second as "set result to all but the first and last columns of temp". train[:-1] is
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
and applying the [1:-1] subscripting to the second dimension of that array gives
array([[2, 3],
[6, 7]])
The ellipsis on the first dimension of the temp subscript says "pass everything," so the subscript expression[...]can be considered equivalent to[:]. As far as theNonevalues are concerned, a slice has a maximum of three data points: _start_, _stop_ and _step_. ANonevalue for any of these gives the default value, which is0for _start_, the length of the indexed object for _stop_, and1for _step. Sox[None:None:None]is equivalent tox[0:len(x):1]which is equivalent tox[::]`.
With this knowledge under your belt you should stand a bit more chance of understanding what's going on.
Let's say I have a numpy array of rgb-imagetype looking like this:
d = [ [ [0, 1, 2], [3, 4, 5], [6 ,7 ,8] ],
[ [9, 10, 11], [12, 13, 14], [15, 16 ,17] ],
[ [18,19, 20], [21, 22, 23], [24, 25 ,26] ] ]
I select a few random r/g or b pixels using random
import random
r = random.sample(range(1, len(d)*len(d[0])*3), 3)
# for example r = [25, 4, 15]
How can I then select the data I want?
Like I want 25th value in array d for the first r_value = 25 which corresponds to d[2][2][1], because it is the 25th value.
What you want to do is index it as a flat or 1d array. There are various ways of doing this. ravel and reshape(-1) create 1d views, flatten() creates a 1d copy.
The most efficient is the flat iterator (an attribute, not method):
In [347]: d.flat[25]
Out[347]: 25
(it can be used in an assignment as well, eg. d.flat[25]=0.
In [341]: idx = [25, 4, 15]
In [343]: d.flat[idx]
Out[343]: array([25, 4, 15])
To find out what the 3d index is, there's utility, unravel_index (and a corresponding ravel_multi_index)
In [344]: fidx=np.unravel_index(idx,d.shape)
In [345]: fidx
Out[345]:
(array([2, 0, 1], dtype=int32),
array([2, 1, 2], dtype=int32),
array([1, 1, 0], dtype=int32))
In [346]: d[fidx]
Out[346]: array([25, 4, 15])
This a tuple, the index for one element is read 'down', e.g. (2,2,1).
On a large array, flat indexing is actually a bit faster:
In [362]: dl=np.ones((100,100,100))
In [363]: idx=np.arange(0,1000000,40)
In [364]: fidx=np.unravel_index(idx,dl.shape)
In [365]: timeit x=dl[fidx]
1000 loops, best of 3: 447 µs per loop
In [366]: timeit x=dl.flat[idx]
1000 loops, best of 3: 312 µs per loop
If you are going to inspect/alter the array linearly on a frequent basis, you can construct a linear view:
d_lin = d.reshape(-1) # construct a 1d view
d_lin[25] # access the 25-th element
Or putting it all in a one-liner:
d.reshape(-1)[25] # construct a 1d view
You can from now on access (and modify) the elements in d_view as a 1d array. So you access the 25-th value with d_lin[25]. You don't have to construct a new view each time you want to access/modify an element: simply reuse the d_lin view.
Furthermore the order of flattening can be specified (order='C' (C-like), order='F' (Fortran-like) or order='A' (Fortran-wise if contiguous in meory, C-like otherwise)). order='F' means that we first iterate over the greatest dimension.
The advantage of a view (but this can also lead to unintended behavior), is that if you assign a new value through d_lin, like d_lin[25] = 3, it will alter the original matrix.
Alternatives are .flat, or np.ravel. So the following are somewhat equivalent:
d.reshape(-1)[25]
np.ravel(d)[25]
d.flat[25]
There are however some differences between the reshape(..) and ravel(..) approach against the flat approach. The most important one is the fact that d.flat does not create a full view. Indeed if we for instance want to pass the view to another function that expects a numpy array, then it will crash, for example:
>>> d.flat.sum()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'numpy.flatiter' object has no attribute 'sum'
>>> d.reshape(-1).sum()
351
>>> np.ravel(d).sum()
351
This is not per se a problem. If we want to limit the number of tools (for instance as a protection mechanism), then this will actually give us a bit more security (although we can still set elements elements in bulk, and call np.sum(..) on an flatiter object).
You can use numpy.flatten method, like this:
a = np.array(d)
d[25] # print 26
Assuming you do not want to flatten your array:
if you know the size of the sublists beforehand, you can easily calculate it out. In your example, every element of the main list is a list of exactly 3 elements, each element containing 3. So to access n you could do something like
i = n//9
j = (n%9)//3
k = (n%3)
element = d[i][j][k]
For n=25, you'd get i = 2, j = 2, k = 1, just as you wanted.
In python2, you can (and have to) use the normal / operator instead of //
If you only care on the value, you can flatten your array, and access it directly as
val = d.flatten()[r]
If you really want the index corresponding to the flattened index, you need to something like this:
ix_2 = r % d.shape[2]
helper_2 = (r - ix_2) / d.shape[2]
ix_1 = helper_2 % d.shape[1]
helper_1 = (helper_2 - ix_1) / d.shape[1]
ix_0 = helper_1 % d.shape[0]
val = d[ix_0, ix_1, ix_2]
I know that list aliasing is an issue in Python, but I can't figure out a way around it.
def zeros(A):
new_mat = A
for i in range(len(A)):
for j in range(len(A[i])):
if A[i][j]==0:
for b in range(len(A)):
new_mat[b][j] = 0
else:
new_mat[i][j] = A[i][j]
return A
Even though I don't change the values of A at all, when I return A, it is still modified:
>>> Matrix = [[1,2,3],[5,0,78],[7,3,45]]
>>> zeros(Matrix)
[[1, 0, 3], [5, 0, 78], [7, 0, 45]]
Is this list aliasing? If so, how do you modify elements of a 2D array without aliasing occurring? Thanks so muhc <3.
new_mat = A does not create a new matrix. You have merely given a new name to the object you also knew as A. If it's a list of lists of numbers, you might want to use copy.deepcopy to create a full copy, and if it's a numpy array you can use the copy method.
new_mat = A[:]
This creates a copy of the list instead of just referencing it. Give it a try.
[:] just specifies a slice from beginning to end. You could have [1:] and it would be from element 1 to the end, or [1:4] and it would be element 1 to 4, for instance. Check out list slicing regarding that.
This might help someone else. just do this.
import copy
def zeros(A):
new_mat = copy.deepcopy(A)
QUICK EXPLANATION
If you want A to be the original matrix your function should be like this. Just use np.copy() function
def zeros(A):
new_mat = np.copy(A) #JUST COPY IT WITH np.copy() function
for i in range(len(A)):
for j in range(len(A[i])):
if A[i][j]==0:
for b in range(len(A)):
new_mat[b][j] = 0
else:
new_mat[i][j] = A[i][j]
return A
THOROUGH EXPLANATION
Let's say we don't want numpy ndarray a to have aliasing. For example, we want to prevent a to change any of its values when we take (for example) a slice of it, we assign this slice to b and then we modify one element of b. We want to AVOID this:
import numpy as np
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = a[0]
b[2] = 50
a
Out[]:
array([[ 1, 2, 50],
[ 4, 5, 6],
[ 7, 8, 9]])
Now you might think that treating a numpy-ndarray object as if it was a list object might solve our problem. But it DOES NOT WORK either:
%reset #delete all previous variables
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = a[:][0] #this is how you would have copied a slice of "a" if it was a list
b[2] = 50
a
Out[]:
array([[ 1, 2, 50], #problem persists
[ 4, 5, 6],
[ 7, 8, 9]])
The problem gets solved, in this case, if you think of numpy arrays as different objects than python lists. In order to copy numpy arrays you can't do copy = name_array_to_be_copied[:], but copy = np.copy(name_array_to_be_copied). Therefore, this would solve our aliasing problem:
%reset #delete all previous variables
import numpy as np
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.copy(a)[0] #this is how you copy numpy ndarrays. Calling np.copy() function
b[2] = 50
a
Out[]:
array([[1, 2, 3], #problem solved
[4, 5, 6],
[7, 8, 9]])
P.S. Watch out with the zeros() function. Even after fixing the aliasing issue, your function does not convert to 0 the columns in the new_matrix who have at least one zero in the same column for the matrix A (this is what I think you wanted to acomplish by seeing the incorrectly reported output of your function [[1, 0, 3], [5, 0, 78], [7, 0, 45]], since it actually yields [[1,0,3],[5,0,78],[7,3,45]]). If you want that you can try this:
def zeros_2(A):
new_mat = np.copy(A)
for i in range(len(A[0])): #I assume each row has same length.
if 0 in new_mat[:,i]:
new_mat[:,i] = 0
print(new_mat)
return A
I am using Numpy to store data into matrices. Coming from R background, there has been an extremely simple way to apply a function over row/columns or both of a matrix.
Is there something similar for python/numpy combination? It's not a problem to write my own little implementation but it seems to me that most of the versions I come up with will be significantly less efficient/more memory intensive than any of the existing implementation.
I would like to avoid copying from the numpy matrix to a local variable etc., is that possible?
The functions I am trying to implement are mainly simple comparisons (e.g. how many elements of a certain column are smaller than number x or how many of them have absolute value larger than y).
Almost all numpy functions operate on whole arrays, and/or can be told to operate on a particular axis (row or column).
As long as you can define your function in terms of numpy functions acting on numpy arrays or array slices, your function will automatically operate on whole arrays, rows or columns.
It may be more helpful to ask about how to implement a particular function to get more concrete advice.
Numpy provides np.vectorize and np.frompyfunc to turn Python functions which operate on numbers into functions that operate on numpy arrays.
For example,
def myfunc(a,b):
if (a>b): return a
else: return b
vecfunc = np.vectorize(myfunc)
result=vecfunc([[1,2,3],[5,6,9]],[7,4,5])
print(result)
# [[7 4 5]
# [7 6 9]]
(The elements of the first array get replaced by the corresponding element of the second array when the second is bigger.)
But don't get too excited; np.vectorize and np.frompyfunc are just syntactic sugar. They don't actually make your code any faster. If your underlying Python function is operating on one value at a time, then np.vectorize will feed it one item at a time, and the whole
operation is going to be pretty slow (compared to using a numpy function which calls some underlying C or Fortran implementation).
To count how many elements of column x are smaller than a number y, you could use an expression such as:
(array['x']<y).sum()
For example:
import numpy as np
array=np.arange(6).view([('x',np.int),('y',np.int)])
print(array)
# [(0, 1) (2, 3) (4, 5)]
print(array['x'])
# [0 2 4]
print(array['x']<3)
# [ True True False]
print((array['x']<3).sum())
# 2
Selecting elements from a NumPy array based on one or more conditions is straightforward using NumPy's beautifully dense syntax:
>>> import numpy as NP
>>> # generate a matrix to demo the code
>>> A = NP.random.randint(0, 10, 40).reshape(8, 5)
>>> A
array([[6, 7, 6, 4, 8],
[7, 3, 7, 9, 9],
[4, 2, 5, 9, 8],
[3, 8, 2, 6, 3],
[2, 1, 8, 0, 0],
[8, 3, 9, 4, 8],
[3, 3, 9, 8, 4],
[5, 4, 8, 3, 0]])
how many elements in column 2 are greater than 6?
>>> ndx = A[:,1] > 6
>>> ndx
array([False, True, False, False, True, True, True, True], dtype=bool)
>>> NP.sum(ndx)
5
how many elements in last column of A have absolute value larger than 3?
>>> A = NP.random.randint(-4, 4, 40).reshape(8, 5)
>>> A
array([[-4, -1, 2, 0, 3],
[-4, -1, -1, -1, 1],
[-1, -2, 2, -2, 3],
[ 1, -4, -1, 0, 0],
[-4, 3, -3, 3, -1],
[ 3, 0, -4, -1, -3],
[ 3, -4, 0, -3, -2],
[ 3, -4, -4, -4, 1]])
>>> ndx = NP.abs(A[:,-1]) > 3
>>> NP.sum(ndx)
0
how many elements in the first two rows of A are greater than or equal to 2?
>>> ndx = A[:2,:] >= 2
>>> NP.sum(ndx.ravel()) # 'ravel' just flattens ndx, which is originally 2D (2x5)
2
NumPy's indexing syntax is pretty close to R's; given your fluency in R, here are the key differences between R and NumPy in this context:
NumPy indices are zero-based, in R, indexing begins with 1
NumPy (like Python) allows you to index from right to left using negative indices--e.g.,
# to get the last column in A
A[:, -1],
# to get the penultimate column in A
A[:, -2]
# this is a big deal, because in R, the equivalent expresson is:
A[, dim(A)[0]-2]
NumPy uses colon ":" notation to denote "unsliced", e.g., in R, to
get the first three rows in A, you would use, A[1:3, ]. In NumPy, you
would use A[0:2, :] (in NumPy, the "0" is not necessary, in fact it
is preferable to use A[:2, :]
I also come from a more R background, and bumped into the lack of a more versatile apply which could take short customized functions. I've seen the forums suggesting using basic numpy functions because many of them handle arrays. However, I've been getting confused over the way "native" numpy functions handle array (sometimes 0 is row-wise and 1 column-wise, sometimes the opposite).
My personal solution to more flexible functions with apply_along_axis was to combine them with the implicit lambda functions available in python. Lambda functions should very easy to understand for the R minded who uses a more functional programming style, like in R functions apply, sapply, lapply, etc.
So for example I wanted to apply standardisation of variables in a matrix. Tipically in R there's a function for this (scale) but you can also build it easily with apply:
(R code)
apply(Mat,2,function(x) (x-mean(x))/sd(x) )
You see how the body of the function inside apply (x-mean(x))/sd(x) is the bit we can't type directly for the python apply_along_axis. With lambda this is easy to implement FOR ONE SET OF VALUES, so:
(Python)
import numpy as np
vec=np.random.randint(1,10,10) # some random data vector of integers
(lambda x: (x-np.mean(x))/np.std(x) )(vec)
Then, all we need is to plug this inside the python apply and pass the array of interest through apply_along_axis
Mat=np.random.randint(1,10,3*4).reshape((3,4)) # some random data vector
np.apply_along_axis(lambda x: (x-np.mean(x))/np.std(x),0,Mat )
Obviously, the lambda function could be implemented as a separate function, but I guess the whole point is to use rather small functions contained within the line where apply originated.
I hope you find it useful !
Pandas is very useful for this. For instance, DataFrame.apply() and groupby's apply() should help you.
I'm really confused by the index logic of numpy arrays with several dimensions. Here is an example:
import numpy as np
A = np.arange(18).reshape(3,2,3)
[[[ 0, 1, 2],
[ 3, 4, 5]],
[[ 6, 7, 8],
[ 9, 10, 11]],
[[12, 13, 14],
[15, 16, 17]]])
this gives me an array of shape (3,2,3), call them (x,y,z) for sake of argument. Now I want an array B with the elements from A corresponding to x = 0,2 y =0,1 and z = 1,2. Like
array([[[ 1, 2],
[4, 5]],
[[13, 14],
[16, 17]]])
Naively I thought that
B=A[[0,2],[0,1],[1,2]]
would do the job. But it gives
array([ 2, 104])
and does not work.
A[[0,2],:,:][:,:,[1,2]]
does the job. But I still wonder whats wrong with my first try. And what is the best way to do what I want to do?
There are two types of indexing in NumPy basic and advanced. Basic indexing uses tuples of slices for indexing, and does not copy the array, but rather creates a view with adjusted strides. Advanced indexing in contrast also uses lists or arrays of indices and copies the array.
Your first attempt
B = A[[0, 2], [0, 1], [1, 2]]
uses advanced indexing. In advanced indexing, all index lists are first broadcasted to the same shape, and this shape is used for the output array. In this case, they already have the same shape, so the broadcasting does not do anything. The output array will also have this shape of two entries. The first entry of the output array is obtained by using all first indices of the three lists, and the second by using all second indices:
B = numpy.array([A[0, 0, 1], A[2, 1, 2]])
Your second approach
B = A[[0,2],:,:][:,:,[1,2]]
does work, but it is inefficient. It uses advanced indexing twice, so your data will be copied twice.
To get what you actually want with advanced indexing, you can use
A[np.ix_([0,2],[0,1],[1,2])]
as pointed out by nikow. This will copy the data only once.
In your example, you can get away without copying the data at all, using only basic indexing:
B = A[::2, :, 1:2]
I recommend the following advanced tutorial, which explains the various indexing methods: NumPy MedKit
Once you understand the powerful ways to index arrays (and how they can be combined) it will make sense. If your first try was valid then this would collide with some of the other indexing techniques (reducing your options in other use cases).
In your example you can exploit that the third index covers a continuous range:
A[[0,2],:,1:]
You could also use
A[np.ix_([0,2],[0,1],[1,2])]
which is handy in more general cases, when the latter indices are not continuous. np.ix_ simply constructs three index arrays.
As Sven pointed out in his answer, there is a more efficient way in this specific case (using a view instead of a copied version).
Edit: As pointed out by Sven my answer contained some errors, which I have removed. I still think that his answer is better, but unfortunately I can't delete mine now.
A[(0,2),:,1:]
If you wanted
array([[[ 1, 2],
[ 4, 5]],
[[13, 14],
[16, 17]]])
A[indices you want,rows you want, col you want]