Let's say I have a numpy array of rgb-imagetype looking like this:
d = [ [ [0, 1, 2], [3, 4, 5], [6 ,7 ,8] ],
[ [9, 10, 11], [12, 13, 14], [15, 16 ,17] ],
[ [18,19, 20], [21, 22, 23], [24, 25 ,26] ] ]
I select a few random r/g or b pixels using random
import random
r = random.sample(range(1, len(d)*len(d[0])*3), 3)
# for example r = [25, 4, 15]
How can I then select the data I want?
Like I want 25th value in array d for the first r_value = 25 which corresponds to d[2][2][1], because it is the 25th value.
What you want to do is index it as a flat or 1d array. There are various ways of doing this. ravel and reshape(-1) create 1d views, flatten() creates a 1d copy.
The most efficient is the flat iterator (an attribute, not method):
In [347]: d.flat[25]
Out[347]: 25
(it can be used in an assignment as well, eg. d.flat[25]=0.
In [341]: idx = [25, 4, 15]
In [343]: d.flat[idx]
Out[343]: array([25, 4, 15])
To find out what the 3d index is, there's utility, unravel_index (and a corresponding ravel_multi_index)
In [344]: fidx=np.unravel_index(idx,d.shape)
In [345]: fidx
Out[345]:
(array([2, 0, 1], dtype=int32),
array([2, 1, 2], dtype=int32),
array([1, 1, 0], dtype=int32))
In [346]: d[fidx]
Out[346]: array([25, 4, 15])
This a tuple, the index for one element is read 'down', e.g. (2,2,1).
On a large array, flat indexing is actually a bit faster:
In [362]: dl=np.ones((100,100,100))
In [363]: idx=np.arange(0,1000000,40)
In [364]: fidx=np.unravel_index(idx,dl.shape)
In [365]: timeit x=dl[fidx]
1000 loops, best of 3: 447 µs per loop
In [366]: timeit x=dl.flat[idx]
1000 loops, best of 3: 312 µs per loop
If you are going to inspect/alter the array linearly on a frequent basis, you can construct a linear view:
d_lin = d.reshape(-1) # construct a 1d view
d_lin[25] # access the 25-th element
Or putting it all in a one-liner:
d.reshape(-1)[25] # construct a 1d view
You can from now on access (and modify) the elements in d_view as a 1d array. So you access the 25-th value with d_lin[25]. You don't have to construct a new view each time you want to access/modify an element: simply reuse the d_lin view.
Furthermore the order of flattening can be specified (order='C' (C-like), order='F' (Fortran-like) or order='A' (Fortran-wise if contiguous in meory, C-like otherwise)). order='F' means that we first iterate over the greatest dimension.
The advantage of a view (but this can also lead to unintended behavior), is that if you assign a new value through d_lin, like d_lin[25] = 3, it will alter the original matrix.
Alternatives are .flat, or np.ravel. So the following are somewhat equivalent:
d.reshape(-1)[25]
np.ravel(d)[25]
d.flat[25]
There are however some differences between the reshape(..) and ravel(..) approach against the flat approach. The most important one is the fact that d.flat does not create a full view. Indeed if we for instance want to pass the view to another function that expects a numpy array, then it will crash, for example:
>>> d.flat.sum()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'numpy.flatiter' object has no attribute 'sum'
>>> d.reshape(-1).sum()
351
>>> np.ravel(d).sum()
351
This is not per se a problem. If we want to limit the number of tools (for instance as a protection mechanism), then this will actually give us a bit more security (although we can still set elements elements in bulk, and call np.sum(..) on an flatiter object).
You can use numpy.flatten method, like this:
a = np.array(d)
d[25] # print 26
Assuming you do not want to flatten your array:
if you know the size of the sublists beforehand, you can easily calculate it out. In your example, every element of the main list is a list of exactly 3 elements, each element containing 3. So to access n you could do something like
i = n//9
j = (n%9)//3
k = (n%3)
element = d[i][j][k]
For n=25, you'd get i = 2, j = 2, k = 1, just as you wanted.
In python2, you can (and have to) use the normal / operator instead of //
If you only care on the value, you can flatten your array, and access it directly as
val = d.flatten()[r]
If you really want the index corresponding to the flattened index, you need to something like this:
ix_2 = r % d.shape[2]
helper_2 = (r - ix_2) / d.shape[2]
ix_1 = helper_2 % d.shape[1]
helper_1 = (helper_2 - ix_1) / d.shape[1]
ix_0 = helper_1 % d.shape[0]
val = d[ix_0, ix_1, ix_2]
Related
Fancy Indexing vs Views in Numpy
In an answer to this equation: is is explained that different idioms will produce different results.
Using the idiom where fancy indexing is to chose the values and said values are set to a new value in the same line means that the values in the original object will be changed in place.
However the final example below:
https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html
"A final exercise"
The example appears to use the same idiom:
a[x, :][:, y] = 100
but it still produces a different result depending on whether x is a slice or a fancy index (see below):
a = np.arange(12).reshape(3,4)
ifancy = [0,2]
islice = slice(0,3,2)
a[islice, :][:, ifancy] = 100
a
#array([[100, 1, 100, 3],
# [ 4, 5, 6, 7],
# [100, 9, 100, 11]])
a = np.arange(12).reshape(3,4)
ifancy = [0,2]
islice = slice(0,3,2)
a[ifancy, :][:, islice] = 100 # note that ifancy and islice are interchanged here
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
My intuition is that if the first set of fancy indexes is a slice it treats the object like a view and therefore the values in the orignal object are changed.
Whereas in the second case the first set of fancy indexes is itself a fancy index so it treats the object as a fancy index creating a copy of the original object. This then means that the original object is not changed when the values of the copy object are changed.
Is my intuition correct?
The example hints that one should think of the sqeuence of getitem and setitem can someone explain it to my properly in theis way?
Python evaluates each set of [] separately. a[x, :][:, y] = 100 is 2 operations.
temp = a[x,:] # getitem step
temp[:,y] = 100 # setitem step
Whether the 2nd line ends up modifying a depends on whether temp is a view or copy.
Remember, numpy is an addon to Python. It does not modify basic Python syntax or interpretation.
I was surprised that numpy.split yields a list and not an array. I would have thought it would be better to return an array, since numpy has put a lot of work into making arrays more useful than lists. Can anyone justify numpy returning a list instead of an array? Why would that be a better programming decision for the numpy developers to have made?
A comment pointed out that if the slit is uneven, the result can't be a array, at least not one that has the same dtype. At best it would be an object dtype.
But lets consider the case of equal length subarrays:
In [124]: x = np.arange(10)
In [125]: np.split(x,2)
Out[125]: [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])]
In [126]: np.array(_) # make an array from that
Out[126]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
But we can get the same array without split - just reshape:
In [127]: x.reshape(2,-1)
Out[127]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
Now look at the code for split. It just passes the task to array_split. Ignoring the details about alternative axes, it just does
sub_arys = []
for i in range(Nsections):
# st and end from `div_points
sub_arys.append(sary[st:end])
return sub_arys
In other words, it just steps through array and returns successive slices. Those (often) are views of the original.
So split is not that sophisticate a function. You could generate such a list of subarrays yourself without a lot of numpy expertise.
Another point. Documentation notes that split can be reversed with an appropriate stack. concatenate (and family) takes a list of arrays. If give an array of arrays, or a higher dim array, it effectively iterates on the first dimension, e.g. concatenate(arr) => concatenate(list(arr)).
Actually you are right it returns a list
import numpy as np
a=np.random.randint(1,30,(2,2))
b=np.hsplit(a,2)
type(b)
it will return type(b) as list so, there is nothing wrong in the documentation, i also first thought that the documentation is wrong it doesn't return a array, but when i checked
type(b[0])
type(b[1])
it returned type as ndarray.
it means it returns a list of ndarrary's.
I have a list which entries are numpy arrays (2D in this case).
Example data:
x=list([np.array([[1,2,3],[11,12,13],[111,112,113]]),np.array([[4,5,6],[14,15,16],[114,115,116],[1114,1115,1116]]),np.array([[11,12,13],[111,112,113]]),np.array([[7,8,9],[17,18,19],[117,118,119],[1117,1118,1119]])])
I want to execute functions on each column of each numpy array separate. Some functions have that axis command built in but some not e.g. MinMaxScaler.
so far I created this list-comprehension:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
Data=list()
Data=[[(scaler.fit_transform(np.reshape(x[i][:,j],(-1,1)))) for j in range(x[i].shape[1])] for i in range(len(x))]
The problem here is that the list comprehension creates a new list with one 1D- numpy array per iteration.
I tried to use hstack and iterate over the list length.
Data=list()
L=list(range(len(x)))
for k in range(len(x)):
L[k]=np.zeros([x[k].shape[0],x[k].shape[1]])
Data=[[np.hstack((L[i],scaler.fit_transform(np.reshape(x[i][:,j],(-1,1))))) for j in range(x[i].shape[1])] for i in range(len(x))]
But that works not at all. Of course, it stacks on top of the existing zeroes in L and it creates another list per iteration.
Other initiations of L did not work even if that is not the main problem:
L=list() #IndexError: list index out of range
L=list(None)*len(x) #TypeError: 'NoneType' object is not iterable
L=list(range(len(x))) #ValueError: all the input arrays must have same number of dimensions
#...and others tried
Does anyone have an idea how to solve this or do I have to do this with the classic for loops?
Thanks for your help
This should work (if i've understood correctly)
def f(column):
... # function you want to apply to each column
data = [f(column) for matrix in x for column in matrix.T]
It's a double for loop, equivalent to (but faster than)
data = []
for matrix in x: # iterate through every matrix in the list
for column in matrix.transpose(): # iterate through every column in the matrix
data.append(f(column))
With your x (thanks for making it cut-n-paste friendly):
In [291]: x=list([np.array([[1,2,3],[11,12,13],[111,112,113]]),np.array([[4,5,6
...: ],[14,15,16],[114,115,116],[1114,1115,1116]]),np.array([[11,12,13],[11
...: 1,112,113]]),np.array([[7,8,9],[17,18,19],[117,118,119],[1117,1118,111
...: 9]])])
In [292]: x
Out[292]:
[array([[ 1, 2, 3],
[ 11, 12, 13],
[111, 112, 113]]), array([[ 4, 5, 6],
[ 14, 15, 16],
[ 114, 115, 116],
[1114, 1115, 1116]]), array([[ 11, 12, 13],
[111, 112, 113]]), array([[ 7, 8, 9],
[ 17, 18, 19],
[ 117, 118, 119],
[1117, 1118, 1119]])]
In [293]: len(x)
Out[293]: 4
In [294]: [i.shape for i in x]
Out[294]: [(3, 3), (4, 3), (2, 3), (4, 3)]
I haven't tried to digest your intended processing, but since the arrays have different shapes, I don't see how you can avoid processing each separately. They can't be combined into any sort of higher dimensional array.
I'm not going to try to apply fit.transform, but it is apparent that Data is a list of lists. I don't know what those inner lists contain.
May be it would help if you described the problem, possibly in a simplified form, with just one element of the x list. I prefer to run a concrete example, and look at the resulting arrays and lists in my own Python session. Word descriptions just aren't clear enough.
I found the answer. It is probably not the sexiest one it works. If anyone can translate it into a more pythonic way with list comprehension it would be appreciated but not necessary.
with x:
x=list([np.array([[1,2,3],[11,12,13],[111,112,113]]),np.array([[4,5,6],[14,15,16],[114,115,116],[1114,1115,1116]]),np.array([[11,12,13],[111,112,113]]),np.array([[7,8,9],[17,18,19],[117,118,119],[1117,1118,1119]])])
Version with function, which is interchangeable:
def theFunction(values,f):
values=f.fit_transform(np.reshape(values,(-1,1)))
return values
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1)) #define function
data =[0]*len(Neonate)
for matrix,i in zip(x,range(len(x))): # iterate through every matrix in the list
for column in matrix.transpose(): # iterate through every column in the matrix
col=theFunction(column,scaler)
if 'Matrx' in locals():
Matrx=np.hstack((Matrx,col))
else:
Matrx=col
data[i]=Matrx
del Matrx
without function where you define what to do within the loop itselve:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1)) #define function
data =[0]*len(Neonate)
for matrix,i in zip(x,range(len(x))): # iterate through every matrix in the list
for column in matrix.transpose(): # iterate through every column in the matrix
col=scaler.fit_transform(np.reshape(column,(-1,1)))
if 'Matrx' in locals():
Matrx=np.hstack((Matrx,col))
else:
Matrx=col
data[i]=Matrx
del Matrx
return babies, AnnotMatrix_each_patient, FeatureMatrix_each_patient_all
Situation
I have objects that have attributes which are represented by numpy arrays:
>> obj = numpy.array([1, 2, 3])
where 1, 2, 3 are the attributes' values.
I'm about to write a few methods that should work equally on both a single object and a group of objects. A group of objects is represented by a 2D numpy array:
>>> group = numpy.array([[11, 21, 31],
... [12, 22, 32],
... [13, 23, 33]])
where the first digit indicates the object and the second digit indicates the attribute. That is 12 is attribute 2 of object 1 and 21 is attribute 1 of object 2.
Why this way and not transposed? Because I want the array indices to correspond to the attributes. That is object_or_group[0] should yield the first attribute either as a single number or as a numpy array, so it can be used for further computations.
Alright, so when I want to compute the dot product for example this works out of the box:
>>> obj = numpy.array([1, 2, 3])
>>> obj.dot(object_or_group)
What doesn't work is element-wise addition.
Input:
>>> group
array([[1, 2, 3],
[4, 5, 6]])
>>> obj
array([10, 20])
The resulting array should be the sum of the first element of group and obj and similar for the second element:
>>> result = numpy.array([group[0] + obj[0],
... group[1] + obj[1]])
>>> result
array([[11, 12, 13],
[24, 25, 26]])
However:
>>> group + obj
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (2,3) (2,)
Which makes sense considering numpy's broadcasting rules.
It seems that there is no numpy function which performs an addition (or equivalently the broadcasting) along a specified axis. While I could use
>>> (group.T + obj).T
array([[11, 12, 13],
[24, 25, 26]])
this feels very cumbersome (and if, instead of a group, I consider a single object this feels wrong indeed). Especially because numpy covered each and every corner case for its usage, I have the feeling that I might have gotten something conceptually wrong here.
To sum it up
Similarly to
>>> obj1
array([1, 2])
>>> obj2
array([10, 20])
>>> obj1 + obj2
array([11, 22])
(which performs an element-wise - or attribute-wise - addition) I want to do the same for groups of objects:
>>> group
array([[1, 2, 3],
[4, 5, 6]])
while the layout of such a 2D group array must be such that the single objects are listed along the 2nd axis (axis=1) in order to be able to request a certain attribute (or many) via normal indexing: obj[0] and group[0] should both yield the first attribute(s).
what you want to do seems to work with this simple code !!
>>> m
array([[1, 2, 3],
[4, 5, 6]])
>>> g = np.array([10,20])
>>> m + g[ : , None]
array([[11, 12, 13],
[24, 25, 26]])
You appear to be confused about which dimension of the matrix is an object and which is an attirbute, as evidenced by the changing object size in your examples. In fact, it it the fact that you are swapping dimensions to match that changing size that is throwing you off. You are also using the unfortunate example of a 3x3 group for your dot product, which is further throwing off your explanation.
In the examples below, objects will be three-element vectors, i.e., they will have three attributes each. The example group will have consistently two rows, meaning two objects in it, and three columns, because objects have three attributes.
The first row of the group, group[0], a.k.a. group[0, :], will be the first object in the group. The first column, group[:, 0] will be the first attribute.
Here are a couple of sample objects and groups to illustrate the points that follow:
>>> obj1 = np.array([1, 2, 3])
>>> obj2 = np.array([4, 5, 6])
>>> group1 = np.array([[7, 8, 9],
[0, 1, 2]])
>>> group2 = np.array([[3, 4, 5]])
Addition will work out of the box because of broadcasting now:
>>> obj1 + obj2
array([5, 7, 9])
>>> group1 + obj1
array([[ 8, 10, 12],
[ 1, 3, 5]])
As you can see, corresponding attributes are getting added just fine. You can even add together groups, but only if they are the same size or if one of them only contains a single object:
>>> group1 + group2
array([[10, 12, 14],
[ 3, 5, 7]])
>>> group1 + group1
array([[14, 16, 18],
[ 0, 2, 4]])
The same will be true for all the binary elementwise operators: *, -, /, np.bitwise_and, etc.
The only remaining question is how to make dot products not care if they are operating on a matrix or a vector. It just so happens that dot products don't care. Your common dimension is always the number of attributes, so the second operand (the multiplier) needs to be transposed so that the number of columns becomes the number of rows. np.dot(x1, x2.T), or equivalently x1.dot(x2.T) will work correctly whether x1 and x2 are groups or objects:
>>> obj1.dot(obj2.T)
32
>>> obj1.dot(group1.T)
array([50, 8])
>>> group1.dot(obj1.T)
array([50, 8])
You can use either np.atleast_1d or np.atleast_2d to always coerce the result into a particular shape so you don't end up with a scalar like the obj1.dot(obj2.T) case. I would recommend the latter, so you always have a consistent number of dimensions regardless of the inputs:
>>> np.atleast_2d(obj1.dot(obj2.T))
array([[32]])
>>> np.atleast_2d(obj1.dot(group1.T))
array([[50, 8]])
Just keep in mind that the dimensions of the dot product will be the the number of objects in the first operand by the number of objects in the second operand (everything will be treated as a group). The attributes will get multiplied and summed together. Whether or not that has a valid interpretation for your purposes is entirely for you to decide.
UPDATE
The only remaining problem at this point is attribute access. As stated above obj1[0] and group1[0] mean very different things. There are three ways to reconcile this difference, listed in the order that I personally prefer them, with 1 being the most preferable:
Use the Ellipsis indexing object to get the last index instead of the first
>>> obj1[..., 0]
array([1])
>>> group1[..., 0]
array([7, 0])
This is the most efficient way since it does not make any copies, just does a normal index on the original arrays. As you can see, there will be no difference between the result from a single object (1D array) and a group with only one object in it (2D array).
Make all your objects 2D. As you pointed out yourself, this can be done with a decorator, and/or using np.atleast_2d. Personally, I would prefer having the convenience of using 1D arrays as single objects without having to wrap them in 2D.
Always access attributes via a transpose:
>>> obj1.T[0]
1
>>> group1.T[0]
array([7, 0])
While this is functionally equivalent to #1, it is clunky and unsightly by comparison, in addition to doing something very different under-the-hood. This approach at the very least creates a new view of the underlying array, and may run the risk of making unnecessary copies in certain cases if the group arrays are not laid out just right. I would not recommend this approach even if it does solve the problem if uniform access.
Suppose I have a where a.shape is (m*n,), how do I create a new array that comprises the m sums of each group of n elements efficiently?
The best I came up with is:
a.reshape((m, n)).sum(axis=1)
but this creates an extra new array.
I think there is nothing wrong with using reshape and then taking the sum of the rows, I cannot think of anything faster. According to the manual, reshape should (if possible) return a view on the original array, so no large amount of data is copied. When a view is created, numpy only creates a new header with different strides and shape, with a pointer into the data of the original array. This should cost constant time and memory, independent of the array size.
In [23]: x = np.arange(12)
In [24]: y = x.reshape((3, 4))
In [25]: y
Out[25]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [26]: y.base is x # check if it is a view
Out[26]: True
There is another trick, a variant on cumsum, reduceat. In this case
np.add.reduceat(a, np.arange(0,m*n,n))
For m,n=100,10, it is 2x as fast as x.reshape((m,n)).sum(axis=1).
I haven't used it much, so it took a bit of digging to find in the documentation.