Understanding == applied to a NumPy array - python

I'm new to Python, and I am learning TensorFlow. In a tutorial using the notMNIST dataset, they give example code to transform the labels matrix to a one-of-n encoded array.
The goal is to take an array consisting of label integers 0...9, and return a matrix where each integer has been transformed into a one-of-n encoded array like this:
0 -> [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1 -> [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
2 -> [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
...
The code they give to do this is:
# Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
However, I don't understand how this code does that at all. It looks like it just generates an array of integers in the range of 0 to 9, and then compares that with the labels matrix, and converts the result to a float. How does an == operator result in a one-of-n encoded matrix?

There are a few things going on here: numpy's vector ops, adding a singleton axis, and broadcasting.
First, you should be able to see how the == does the magic.
Let's say we start with a simple label array. == behaves in a vectorized fashion, which means that we can compare the entire array with a scalar and get an array consisting of the values of each elementwise comparison. For example:
>>> labels = np.array([1,2,0,0,2])
>>> labels == 0
array([False, False, True, True, False], dtype=bool)
>>> (labels == 0).astype(np.float32)
array([ 0., 0., 1., 1., 0.], dtype=float32)
First we get a boolean array, and then we coerce to floats: False==0 in Python, and True==1. So we wind up with an array which is 0 where labels isn't equal to 0 and 1 where it is.
But there's nothing special about comparing to 0, we could compare to 1 or 2 or 3 instead for similar results:
>>> (labels == 2).astype(np.float32)
array([ 0., 1., 0., 0., 1.], dtype=float32)
In fact, we could loop over every possible label and generate this array. We could use a listcomp:
>>> np.array([(labels == i).astype(np.float32) for i in np.arange(3)])
array([[ 0., 0., 1., 1., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 1.]], dtype=float32)
but this doesn't really take advantage of numpy. What we want to do is have each possible label compared with each element, IOW to compare
>>> np.arange(3)
array([0, 1, 2])
with
>>> labels
array([1, 2, 0, 0, 2])
And here's where the magic of numpy broadcasting comes in. Right now, labels is a 1-dimensional object of shape (5,). If we make it a 2-dimensional object of shape (5,1), then the operation will "broadcast" over the last axis and we'll get an output of shape (5,3) with the results of comparing each entry in the range with each element of labels.
First we can add an "extra" axis to labels using None (or np.newaxis), changing its shape:
>>> labels[:,None]
array([[1],
[2],
[0],
[0],
[2]])
>>> labels[:,None].shape
(5, 1)
And then we can make the comparison (this is the transpose of the arrangement we were looking at earlier, but that doesn't really matter).
>>> np.arange(3) == labels[:,None]
array([[False, True, False],
[False, False, True],
[ True, False, False],
[ True, False, False],
[False, False, True]], dtype=bool)
>>> (np.arange(3) == labels[:,None]).astype(np.float32)
array([[ 0., 1., 0.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.]], dtype=float32)
Broadcasting in numpy is very powerful, and well worth reading up on.

In short, == applied to a numpy array means applying element-wise == to the array. The result is an array of booleans. Here is an example:
>>> b = np.array([1,0,0,1,1,0])
>>> b == 1
array([ True, False, False, True, True, False], dtype=bool)
To count say how many 1s there are in b, you don't need to cast the array to float, i.e. the .astype(np.float32) can be saved, because in python boolean is a subclass of int and in Python 3 you have True == 1 False == 0. So here is how you count how many ones is in b:
>>> np.sum((b == 1))
3
Or:
>>> np.count_nonzero(b == 1)
3

Related

numpy where condition on index

I have a numpy 1D array of numbers representing columns e.g: [0,0,2,1]
And a matrix e.g:
[[1,1,1],
[1,1,1],
[1,1,1],
[1,1,1]]
Now I a want to change the values in the matrix to 0 where the column index is bigger than the value given in the 1D array:
[[1,0,0],
[1,0,0],
[1,1,1],
[1,1,0]]
How can I achieve this? I think I need a condition based on the index, not on the value
Explanation:
The first row in the matrix has indices [0,0 ; 0,1 ; 0,2] where the second index is the column. For indices 0,0 ; 0,1 and 0,2 the value 0 is given. 1 and 2 are bigger than 0. Thus only 0,0 is not changed to zero.
Assuming a the 2D array and v the 1D vector, you can create a mask of the same size and use numpy.where:
x,y = a.shape
np.where(np.tile(np.arange(y), (x,1)) <= v[:,None], a, 0)
Input:
a = np.array([[1,1,1],
[1,1,1],
[1,1,1],
[1,1,1]])
v = np.array([0,0,2,1])
Output:
array([[1, 0, 0],
[1, 0, 0],
[1, 1, 1],
[1, 1, 0]])
Intermediates:
>>> np.tile(np.arange(y), (x,1))
[[0 1 2]
[0 1 2]
[0 1 2]
[0 1 2]]
>>> np.tile(np.arange(y), (x,1)) <= v[:,None]
[[ True False False]
[ True False False]
[ True True True]
[ True True False]]
Construct a 2D array whose elements are the corresponding column index, and then mask the elements greater than the corresponding value of the 1D array.
Taking advantage of broadcasting, you can do:
>>> arr = np.ones((4,3))
>>> arr
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
>>> col_thr_idx = np.array([0,0,2,1])
>>> col_thr_idx
array([0, 0, 2, 1])
>>> mask = np.arange(arr.shape[1])[None,:] > col_thr_idx[:,None]
>>> mask
array([[False, True, True],
[False, True, True],
[False, False, False],
[False, False, True]])
>>> arr[mask] = 0
>>> arr
array([[1., 0., 0.],
[1., 0., 0.],
[1., 1., 1.],
[1., 1., 0.]])

Numpy replace specific rows and columns of one array with specific rows and columns of another array

I am trying to replace specific rows and columns of a Numpy array as given below.
The values of array a and b are as below initially:
a = [[1 1 1 1]
[1 1 1 1]
[1 1 1 1]]
b = [[2 3 4 5]
[6 7 8 9]
[0 2 3 4]]
Now, based on a certain probability, I need to perform elementwise replacing of a with the values of b (say, after generating a random number, r, between 0 and 1 for each element, I will replace the element of a with that of b if r > 0.8).
How can I use numpy/scipy to do this in Python with high performance?
With masking. We first generate a matrix with the same dimensions, of random numbers, and check if these are larger than 0.8:
mask = np.random.random(a.shape) > 0.8
Now we can assign the values of b where mask is True to the corresponding indices of a:
a[mask] = b[mask]
For example:
>>> a
array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
>>> b
array([[2, 3, 4, 5],
[6, 7, 8, 9],
[0, 2, 3, 4]])
>>> mask = np.random.random(a.shape) > 0.8
>>> mask
array([[ True, False, False, False],
[ True, False, False, False],
[False, False, False, False]])
>>> a[mask] = b[mask]
>>> a
array([[2., 1., 1., 1.],
[6., 1., 1., 1.],
[1., 1., 1., 1.]])
So here where the mask is True (since 0.8 is rather high, we expect on average 2.4 such values), we assign the corresponding value of b.

Numpy indexing set 1 to max value and zero's to all others

I think I've misunderstood something with indexing in numpy.
I have a 3D-numpy array of shape (dim_x, dim_y, dim_z) and I want to find the maximum along the third axis (dim_z), and set its value to 1 and all the others to zero.
The problem is that I end up with several 1 in the same row, even if values are different.
Here is the code :
>>> test = np.random.rand(2,3,2)
>>> test
array([[[ 0.13110146, 0.07138861],
[ 0.84444158, 0.35296986],
[ 0.97414498, 0.63728852]],
[[ 0.61301975, 0.02313646],
[ 0.14251848, 0.91090492],
[ 0.14217992, 0.41549218]]])
>>> result = np.zeros_like(test)
>>> result[:test.shape[0], np.arange(test.shape[1]), np.argmax(test, axis=2)]=1
>>> result
array([[[ 1., 0.],
[ 1., 1.],
[ 1., 1.]],
[[ 1., 0.],
[ 1., 1.],
[ 1., 1.]]])
I was expecting to end with :
array([[[ 1., 0.],
[ 1., 0.],
[ 1., 0.]],
[[ 1., 0.],
[ 0., 1.],
[ 0., 1.]]])
Probably I'm missing something here. From what I've understood, 0:dim_x, np.arange(dim_y) returns dim_x of dim_y tuples and np.argmax(test, axis=dim_z) has the shape (dim_x, dim_y) so if the indexing is of the form [x, y, z] a couple [x, y] is not supposed to appear twice.
Could someone explain me where I'm wrong ? Thanks in advance.
What we are looking for
We get the argmax indices along the last axis -
idx = np.argmax(test, axis=2)
For the given sample data, we have idx :
array([[0, 0, 0],
[0, 1, 1]])
Now, idx covers the first and second axes, while getting those argmax indices.
To assign the corresponding ones in the output, we need to create range arrays for the first two axes covering the lengths along those and aligned according to the shape of idx. Now, idx is a 2D array of shape (m,n), where m = test.shape[0] and n = test.shape[1].
Thus, the range arrays for assignment into first two axes of output must be -
X = np.arange(test.shape[0])[:,None]
Y = np.arange(test.shape[1])
Notice, the extension of the first range array to 2D is needed to have it aligned against the rows of idx and Y would align against the cols of idx -
In [239]: X
Out[239]:
array([[0],
[1]])
In [240]: Y
Out[240]: array([0, 1, 2])
Schematically put -
idx :
Y array
--------->
x x x | X array
x x x |
v
The fault in original code
Your code was -
result[:test.shape[0], np.arange(test.shape[1]), ..
This is essentially :
result[:, np.arange(test.shape[1]), ...
So, you are selecting all elements along the first axis, instead of only selecting the corresponding ones that correspond to idx indices. In that process, you were selecting a lot more than required elements for assignment and hence you were seeing many more than required 1s in result array.
The correction
Thus, the only correction needed was indexing into the first axis with the range array and a working solution would be -
result[np.arange(test.shape[0])[:,None], np.arange(test.shape[1]), ...
The alternative(s)
Alternatively, using the range arrays created earlier with X and Y -
result[X,Y,idx] = 1
Another way to get X,Y would be with np.mgrid -
m,n = test.shape[:2]
X,Y = np.ogrid[:m,:n]
I think there's a problem with mixing basic (slice) and advanced indexing. It's easier to see when selecting value from an array than with this assignment; but it can result in transposed axes. For a problem like this it is better use advanced indexing all around, as provided by ix_
In [24]: test = np.random.rand(2,3,2)
In [25]: idx=np.argmax(test,axis=2)
In [26]: idx
Out[26]:
array([[1, 0, 1],
[0, 1, 1]], dtype=int32)
with basic and advanced:
In [31]: res1 = np.zeros_like(test)
In [32]: res1[:, np.arange(test.shape[1]), idx]=1
In [33]: res1
Out[33]:
array([[[ 1., 1.],
[ 1., 1.],
[ 0., 1.]],
[[ 1., 1.],
[ 1., 1.],
[ 0., 1.]]])
with advanced:
In [35]: I,J = np.ix_(range(test.shape[0]), range(test.shape[1]))
In [36]: I
Out[36]:
array([[0],
[1]])
In [37]: J
Out[37]: array([[0, 1, 2]])
In [38]: res2 = np.zeros_like(test)
In [40]: res2[I, J , idx]=1
In [41]: res2
Out[41]:
array([[[ 0., 1.],
[ 1., 0.],
[ 0., 1.]],
[[ 1., 0.],
[ 0., 1.],
[ 0., 1.]]])
On further thought, the use of the slice for the 1st dimension is just wrong , if the goal is to set or find the 6 argmax values
In [54]: test
Out[54]:
array([[[ 0.15288242, 0.36013289],
[ 0.90794601, 0.15265616],
[ 0.34014976, 0.53804266]],
[[ 0.97979479, 0.15898605],
[ 0.04933804, 0.89804999],
[ 0.10199319, 0.76170911]]])
In [55]: test[I, J, idx]
Out[55]:
array([[ 0.36013289, 0.90794601, 0.53804266],
[ 0.97979479, 0.89804999, 0.76170911]])
In [56]: test[:, J, idx]
Out[56]:
array([[[ 0.36013289, 0.90794601, 0.53804266],
[ 0.15288242, 0.15265616, 0.53804266]],
[[ 0.15898605, 0.04933804, 0.76170911],
[ 0.97979479, 0.89804999, 0.76170911]]])
With the slice it selects a (2,3,2) set of values from test (or res), not the intended (2,3). There 2 extra rows.
Here is an easier way to do it:
>>> test == test.max(axis=2, keepdims=1)
array([[[ True, False],
[ True, False],
[ True, False]],
[[ True, False],
[False, True],
[False, True]]], dtype=bool)
...and if you really want that as floating-point 1.0 and 0.0, then convert it:
>>> (test==test.max(axis=2, keepdims=1)).astype(float)
array([[[ 1., 0.],
[ 1., 0.],
[ 1., 0.]],
[[ 1., 0.],
[ 0., 1.],
[ 0., 1.]]])
Here is a way to do it with only one winner per row-column combo (i.e. no ties, as discussed in comments):
rowmesh, colmesh = np.meshgrid(range(test.shape[0]), range(test.shape[1]), indexing='ij')
maxloc = np.argmax(test, axis=2)
flatind = np.ravel_multi_index( [rowmesh, colmesh, maxloc ], test.shape )
result = np.zeros_like(test)
result.flat[flatind] = 1
UPDATE after reading hpaulj's answer:
rowmesh, colmesh = np.ix_(range(test.shape[0]), range(test.shape[1]))
is a more-efficient, more numpythonic, alternative to my meshgrid call (the rest of the code stays the same)
The issue of why your approach fails is hard to explain, but here's one place where intuition could start: your slicing approach says "all rows, times all columns, times a certain sequence of layers". How many elements is that slice in total? By contrast, how many elements do you actually want to set to 1? It can be instructive to look at the values you get when you view the corresponding test values of the slice you're trying to assign to:
>>> test[:, :, maxloc].shape
(2, 3, 2, 3) # oops! it's because maxloc itself is 2x3
>>> test[:, :, maxloc]
array([[[[ 0.13110146, 0.13110146, 0.13110146],
[ 0.13110146, 0.07138861, 0.07138861]],
[[ 0.84444158, 0.84444158, 0.84444158],
[ 0.84444158, 0.35296986, 0.35296986]],
[[ 0.97414498, 0.97414498, 0.97414498],
[ 0.97414498, 0.63728852, 0.63728852]]],
[[[ 0.61301975, 0.61301975, 0.61301975],
[ 0.61301975, 0.02313646, 0.02313646]],
[[ 0.14251848, 0.14251848, 0.14251848],
[ 0.14251848, 0.91090492, 0.91090492]],
[[ 0.14217992, 0.14217992, 0.14217992],
[ 0.14217992, 0.41549218, 0.41549218]]]]) # note the repetition, because in maxloc you're repeatedly asking for layer 0 sometimes, and sometimes repeatedly for layer 1

Padding a numpy array with zeros, and using another array as an index for ones

I'm trying to pad a numpy array, and I cannot seem to find the right approach from the documentation for numpy. I have an array:
a = array([2, 1, 3, 5, 7])
This represents the index for an array I wish to create. So at index value 2 or 1 or 3 etc I would like to have a one in the array, and everywhere else in the target array, to be padded with zeros. Sort of like an array mask. I would also like to specify the overall length of the target array, l. So my ideal function would like something like:
>>> foo(a,l)
array([0,1,1,1,0,1,0,1,0,0,0]
, where l=10 for the above example.
EDIT:
So I wrote this function:
def padwithones(a,l) :
p = np.zeros(l)
for i in a :
p = np.insert(p,i,1)
return p
Which gives:
Out[19]:
array([ 0., 1., 0., 1., 1., 1., 0., 1., 0., 0., 0., 0., 0.,
0., 0.])
Which isn't correct!
What you're looking for is basically a one-hot array:
def onehot(foo, l):
a = np.zeros(l, dtype=np.int32)
a[foo] = 1
return a
Example:
In [126]: onehot([2, 1, 3, 5, 7], 10)
Out[126]: array([0, 1, 1, 1, 0, 1, 0, 1, 0, 0])

Numpy matrix binarization using only one expression

I am looking for a way to binarize numpy N-d array based on the threshold using only one expression. So I have something like this:
np.random.seed(0)
np.set_printoptions(precision=3)
a = np.random.rand(4, 4)
threshold, upper, lower = 0.5, 1, 0
a is now:
array([[ 0.02 , 0.833, 0.778, 0.87 ],
[ 0.979, 0.799, 0.461, 0.781],
[ 0.118, 0.64 , 0.143, 0.945],
[ 0.522, 0.415, 0.265, 0.774]])
Now I can fire these 2 expressions:
a[a>threshold] = upper
a[a<=threshold] = lower
and achieve what I want:
array([[ 0., 1., 1., 1.],
[ 1., 1., 0., 1.],
[ 0., 1., 0., 1.],
[ 1., 0., 0., 1.]])
But is there a way to do this with just one expression?
We may consider np.where:
np.where(a>threshold, upper, lower)
Out[6]:
array([[0, 1, 1, 1],
[1, 1, 0, 1],
[0, 1, 0, 1],
[1, 0, 0, 1]])
Numpy treats every 1d array as a vector, 2d array as sequence of vectors (matrix) and 3d+ array as a generic tensor. This means when we perform operations, we are performing vector math. So you can just do:
>>> a = (a > 0.5).astype(np.int_)
For example:
>>> np.random.seed(0)
>>> np.set_printoptions(precision=3)
>>> a = np.random.rand(4, 4)
>>> a
>>> array([[ 0.549, 0.715, 0.603, 0.545],
[ 0.424, 0.646, 0.438, 0.892],
[ 0.964, 0.383, 0.792, 0.529],
[ 0.568, 0.926, 0.071, 0.087]])
>>> a = (a > 0.5).astype(np.int_) # Where the numpy magic happens.
>>> array([[1, 1, 1, 1],
[0, 1, 0, 1],
[1, 0, 1, 1],
[1, 1, 0, 0]])
Whats going on here is that you are automatically iterating through every element of every row in the 4x4 matrix and applying a boolean comparison to each element.
If > 0.5 return True, else return False.
Then by calling the .astype method and passing np.int_ as the argument, you're telling numpy to replace all boolean values with their integer representation, in effect binarizing the matrix based on your comparison value.
A shorter method is to simply multiply the boolean matrix from the condition by 1 or 1.0, depending on the type you want.
>>> a = np.random.rand(4,4)
>>> a
array([[ 0.63227032, 0.18262573, 0.21241511, 0.95181594],
[ 0.79215808, 0.63868395, 0.41706148, 0.9153959 ],
[ 0.41812268, 0.70905987, 0.54946947, 0.51690887],
[ 0.83693151, 0.10929998, 0.19219377, 0.82919761]])
>>> (a>0.5)*1
array([[1, 0, 0, 1],
[1, 1, 0, 1],
[0, 1, 1, 1],
[1, 0, 0, 1]])
>>> (a>0.5)*1.0
array([[ 1., 0., 0., 1.],
[ 1., 1., 0., 1.],
[ 0., 1., 1., 1.],
[ 1., 0., 0., 1.]])
You can write expression directly, this will return a boolean array, and it can be used simply as an 1-byte unsigned integer ("uint8") array for further calculations:
print a > 0.5
output
[[False True True True]
[ True True False True]
[False True False True]
[ True False False True]]
In one line and with custom upper/lower values you can write so for example:
upper = 10
lower = 3
treshold = 0.5
print lower + (a>treshold) * (upper-lower)

Categories