Assign values to numpy array using row indices - python

Suppose I have two arrays, a=np.array([0,0,1,1,1,2]), b=np.array([1,2,4,2,6,5]). Elements in a mean the row indices of where b should be assigned. And if there are multiple elements in the same row, the values should be assigned in order.
So the result is a 2D array c:
c = np.zeros((3, 4))
counts = {k:0 for k in range(3)}
for i in range(a.shape[0]):
c[a[i], counts[a[i]]]=b[i]
counts[a[i]]+=1
print(c)
Is there a way to use some fancy indexing method in numpy to get such results faster (without a for loop) in case these arrays are big.

I had to run your code to actually see what it produced. There are limits to what I can 'run' in my head.
In [230]: c
Out[230]:
array([[1., 2., 0., 0.],
[4., 2., 6., 0.],
[5., 0., 0., 0.]])
In [231]: counts
Out[231]: {0: 2, 1: 3, 2: 1}
Omitting this information may be delaying possible answers. 'vectorization' requires thinking in whole-array terms, which is easiest if I can visualize the result, and look for a pattern.
This looks like a padding problem.
In [260]: u, c = np.unique(a, return_counts=True)
In [261]: u
Out[261]: array([0, 1, 2])
In [262]: c
Out[262]: array([2, 3, 1]) # cf with counts
Load data with rows of different sizes into Numpy array
Working from previous padding questions, I can construct a mask:
In [263]: mask = np.arange(4)<c[:,None]
In [264]: mask
Out[264]:
array([[ True, True, False, False],
[ True, True, True, False],
[ True, False, False, False]])
and use that to assign the b values to c:
In [265]: c = np.zeros((3,4),int)
In [266]: c[mask] = b
In [267]: c
Out[267]:
array([[1, 2, 0, 0],
[4, 2, 6, 0],
[5, 0, 0, 0]])
Since a is already sorted we might get the counts faster than with unique. Also it will have problems if a doesn't have any values for some row(s).

Related

How to assign values to given indices to an array and average on repeated indices?

Is there a neat way to assign values to given indices in an array, and average values in repeated indices?
For example:
a = np.array([0, 0, 0, 0, 0])
ind = np.array([1, 1, 2, 3])
b = np.array([2, 3, 4, 5])
and I want to assign values in array b to array a at corresponding indices indicated in 'ind', and a[1] should be average of 2 and 3.
I can try a for-loop:
hit = np.zeros_like(a)
for i in range(ind.size):
hit[ind[i]] += 1
a[ind[i]] += b[i]
a = a / hit
But this code looks dirty. Is there any better way to do the job?
You could do this using np.where.
import numpy as np
a = np.array([0, 0, 0, 0, 0]).astype('float64')
ind = np.array([1, 1, 2, 3])
b = np.array([2, 3, 4, 5])
for i in set(ind):
a[i] = np.mean(b[np.where(ind == i)])
Would result in:
In [5]: a
Out[5]: array([0. , 2.5, 4. , 5. , 0. ])
You are essentially finding all indices of ind where the value of ind[index] is equal to i and then obtaining the mean of the values at those indices in b and assigning that mean to a[i]. Hope this helps!
Here is a vectorized method. The actual logic is close to your own solution.
n,d = (np.bincount(ind,x,a.size) for x in (b,None))
valid = d!=0
np.copyto(a,np.divide(n,d,where=valid),where=valid)
In [56]: a = np.zeros(5)
...: hit = np.zeros_like(a)
...: for i in range(ind.size):
...: hit[ind[i]] += 1
...: a[ind[i]] += b[i]
In [57]: a
Out[57]: array([0., 5., 4., 5., 0.])
In [58]: hit
Out[58]: array([0., 2., 1., 1., 0.])
The mention of duplicate indices brings to mind the .at ufunc method:
In [59]: a = np.zeros(5)
In [60]: a = np.zeros(5)
...: hit = np.zeros_like(a)
...: np.add.at(a,ind,b)
...: np.add.at(hit,ind,1)
In [61]: a
Out[61]: array([0., 5., 4., 5., 0.])
In [62]: hit
Out[62]: array([0., 2., 1., 1., 0.])
This isn't quite as fast as a[ind]=b, but faster than your loop.
np.bincount might well be better for this task, but this add.at is worth knowing and testing.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.at.html
This might not necessarily be cleaner or faster, but here's an alternative that I think is easy to read:
a = [[] for _ in range(5)]
for i, x in zip(ind, b):
a[i].append(x)
[np.mean(x) if len(x) else 0 for x in a]

numpy broadcasting with 3d arrays

Is it possible to apply numpy broadcasting (with 1D arrays),
x=np.arange(3)[:,np.newaxis]
y=np.arange(3)
x+y=
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
to 3d matricies similar to the one below, such that each element in a[i] is treated as a 1D vector like in the example above?
a=np.zeros((2,2,2))
a[0]=1
b=a
result=a+b
resulting in
result[0,0]=array([[2, 2],
[2, 2]])
result[0,1]=array([[1, 1],
[1, 1]])
result[1,0]=array([[1, 1],
[1, 1]])
result[1,1]=array([[0, 0],
[0, 0]])
You can do this in the same way as if they are 1d array, i.e, insert a new axis between axis 0 and axis 1 in either a or b:
a + b[:,None] # or a[:,None] + b
(a + b[:,None])[0,0]
#array([[ 2., 2.],
# [ 2., 2.]])
(a + b[:,None])[0,1]
#array([[ 1., 1.],
# [ 1., 1.]])
(a + b[:,None])[1,0]
#array([[ 1., 1.],
# [ 1., 1.]])
(a + b[:,None])[1,1]
#array([[ 0., 0.],
# [ 0., 0.]])
Since a and b are of same shape, say (2,2,2), a+b will indeed work.
The way broadcasting works is that it matches the dimensions of the operands in reverse order, starting from the last dimension going up (e.g. considering columns before rows in a two-dimensional case). If the dimensions match then the next dimension is considered.
In case the dimensions don't match AND if one of the dimensions is 1 then that operand's dimension is repeated to match the other operand (e.g. if a.shape = (2,1,2) and b.shape = (2,2,2) then the values at the 1st dimension of a are repeated to make the shape (2,2,2))

Triangular indexing and choice of summation axis for multidimensional arrays / matrices

I'm trying to solve a problem using multidimensional arrays, rather than resorting to for loops, in order to gain a performance boost, but am having trouble with the indexing.
I've tried various permutations using np.newaxis, but can't seem to achieve the following functionality.
Problem:
Part 1) Take an M x N x N array called a, and for each of the M square matrices, set the upper triangular matrix elements as their negative values.
Part 2) Sum all elements in each of the M matrices (of shape N X N), returning a 1D array with M elements. Let's call this array b.
Attempted Solution
Here is my MWP / attempt using loops (which does work, but I'd rather find a fully array/matrix-based approach
a = np.array(
[[[ 0, 1],
[ 5, 0]],
[[ 0, 3],
[ 2, 0]]])
Part 1):
triangular_upper_idx = np.triu_indices_from(a[0])
for i in range(len(a)):
a[i][triangular_upper_idx] *= -1
a
result:
array([[[ 0, -1],
[ 5, 0]],
[[ 0, -3],
[ 2, 0]]])
Part 2):
b = np.zeros(len(a))
for i in range(len(a)):
b[i] = np.sum(a[i])
b
result:
array([ 4., -1.])
Note:
I have seen a similar question on this topic (Triangular indices for multidimensional arrays in numpy) but the solution there was nested for loops... I feel like numpy may offer a more efficient, clever array-based solution?
Any guidance would be much appreciated.
Thanks
yes numpy has the tools
r = 2
neg_uppr = np.triu(-np.ones((r,r)),1) + np.tril(np.ones((r,r)))
can't tell from your numerical example if you want the diagonal to be inverted too? Then use np.triu(-np.ones((r,r))) + np.tril(np.ones((r,r)),-1)
neg_uppr
Out[23]:
array([[ 1., -1.],
[ 1., 1.]])
a = np.array(
[[[ 0, 1],
[ 5, 0]],
[[ 0, 3],
[ 2, 0]]])
its fast to use the builtin element-wise arithmetic
a = a * neg_uppr
a
Out[26]:
array([[[ 0., -1.],
[ 5., 0.]],
[[ 0., -3.],
[ 2., 0.]]])
you can specify axes to sum over:
np.sum(a, (1,2))
Out[27]: array([ 4., -1.])

Understanding == applied to a NumPy array

I'm new to Python, and I am learning TensorFlow. In a tutorial using the notMNIST dataset, they give example code to transform the labels matrix to a one-of-n encoded array.
The goal is to take an array consisting of label integers 0...9, and return a matrix where each integer has been transformed into a one-of-n encoded array like this:
0 -> [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1 -> [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
2 -> [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
...
The code they give to do this is:
# Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
However, I don't understand how this code does that at all. It looks like it just generates an array of integers in the range of 0 to 9, and then compares that with the labels matrix, and converts the result to a float. How does an == operator result in a one-of-n encoded matrix?
There are a few things going on here: numpy's vector ops, adding a singleton axis, and broadcasting.
First, you should be able to see how the == does the magic.
Let's say we start with a simple label array. == behaves in a vectorized fashion, which means that we can compare the entire array with a scalar and get an array consisting of the values of each elementwise comparison. For example:
>>> labels = np.array([1,2,0,0,2])
>>> labels == 0
array([False, False, True, True, False], dtype=bool)
>>> (labels == 0).astype(np.float32)
array([ 0., 0., 1., 1., 0.], dtype=float32)
First we get a boolean array, and then we coerce to floats: False==0 in Python, and True==1. So we wind up with an array which is 0 where labels isn't equal to 0 and 1 where it is.
But there's nothing special about comparing to 0, we could compare to 1 or 2 or 3 instead for similar results:
>>> (labels == 2).astype(np.float32)
array([ 0., 1., 0., 0., 1.], dtype=float32)
In fact, we could loop over every possible label and generate this array. We could use a listcomp:
>>> np.array([(labels == i).astype(np.float32) for i in np.arange(3)])
array([[ 0., 0., 1., 1., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 1.]], dtype=float32)
but this doesn't really take advantage of numpy. What we want to do is have each possible label compared with each element, IOW to compare
>>> np.arange(3)
array([0, 1, 2])
with
>>> labels
array([1, 2, 0, 0, 2])
And here's where the magic of numpy broadcasting comes in. Right now, labels is a 1-dimensional object of shape (5,). If we make it a 2-dimensional object of shape (5,1), then the operation will "broadcast" over the last axis and we'll get an output of shape (5,3) with the results of comparing each entry in the range with each element of labels.
First we can add an "extra" axis to labels using None (or np.newaxis), changing its shape:
>>> labels[:,None]
array([[1],
[2],
[0],
[0],
[2]])
>>> labels[:,None].shape
(5, 1)
And then we can make the comparison (this is the transpose of the arrangement we were looking at earlier, but that doesn't really matter).
>>> np.arange(3) == labels[:,None]
array([[False, True, False],
[False, False, True],
[ True, False, False],
[ True, False, False],
[False, False, True]], dtype=bool)
>>> (np.arange(3) == labels[:,None]).astype(np.float32)
array([[ 0., 1., 0.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.]], dtype=float32)
Broadcasting in numpy is very powerful, and well worth reading up on.
In short, == applied to a numpy array means applying element-wise == to the array. The result is an array of booleans. Here is an example:
>>> b = np.array([1,0,0,1,1,0])
>>> b == 1
array([ True, False, False, True, True, False], dtype=bool)
To count say how many 1s there are in b, you don't need to cast the array to float, i.e. the .astype(np.float32) can be saved, because in python boolean is a subclass of int and in Python 3 you have True == 1 False == 0. So here is how you count how many ones is in b:
>>> np.sum((b == 1))
3
Or:
>>> np.count_nonzero(b == 1)
3

How can I ignore zeros when I take the median on columns of an array?

I have a simple numpy array.
array([[10, 0, 10, 0],
[ 1, 1, 0, 0]
[ 9, 9, 9, 0]
[ 0, 10, 1, 0]])
I would like to take the median of each column, individually, of this array.
However, there are a few 0 values in various places which I would like to ignore in the calculation of the medians.
To further complicate, I would like to keep the columns with only 0 entries as having the median of 0. In this manner, those columns would serve as a bit of a place holder, keeping the dimensions of the matrix the same.
The numpy documentation doesn't have any argument that would work for what I want (maybe I am spoiled by the many switches we get with R!)
numpy.median(a, axis=None, out=None, overwrite_input=False)[source]
Can someone please shed some light on an effective way to do this, which is in line with the spirit of numpy? I could hack it out but in that case I feel like I've defeated the purpose of using numpy in the first place.
Thanks in advance.
Masked array is always handy, but slooooooow:
In [14]:
%timeit np.ma.median(y, axis=0).filled(0)
1000 loops, best of 3: 1.73 ms per loop
In [15]:
%%timeit
ans=np.apply_along_axis(lambda v: np.median(v[v!=0]), 0, x)
ans[np.isnan(ans)]=0.
1000 loops, best of 3: 402 µs per loop
In [16]:
ans=np.apply_along_axis(lambda v: np.median(v[v!=0]), 0, x)
ans[np.isnan(ans)]=0.; ans
Out[16]:
array([ 9., 9., 9., 0.])
np.nonzero is even faster:
In [25]:
%%timeit
ans=np.apply_along_axis(lambda v: np.median(v[np.nonzero(v)]), 0, x)
ans[np.isnan(ans)]=0.
1000 loops, best of 3: 384 µs per loop
Use masked arrays and np.ma.median(axis=0).filled(0) to get the medians of the columns.
In [1]: x = np.array([[10, 0, 10, 0], [1, 1, 0, 0], [9, 9, 9, 0], [0, 10, 1, 0]])
In [2]: y = np.ma.masked_where(x == 0, x)
In [3]: x
Out[3]:
array([[10, 0, 10, 0],
[ 1, 1, 0, 0],
[ 9, 9, 9, 0],
[ 0, 10, 1, 0]])
In [4]: y
Out[4]:
masked_array(data =
[[10 -- 10 --]
[1 1 -- --]
[9 9 9 --]
[-- 10 1 --]],
mask =
[[False True False True]
[False False True True]
[False False False True]
[ True False False True]],
fill_value = 999999)
In [6]: np.median(x, axis=0)
Out[6]: array([ 5., 5., 5., 0.])
In [7]: np.ma.median(y, axis=0).filled(0)
Out[7]:
array(data = [ 9. 9. 9., 0.])
You can use masked arrays.
a = np.array([[10, 0, 10, 0], [1, 1, 0, 0],[9,9,9,0],[0,10,1,0]])
m = np.ma.masked_equal(a, 0)
In [44]: np.median(a)
Out[44]: 1.0
In [45]: np.ma.median(m)
Out[45]: 9.0
In [46]: m
Out[46]:
masked_array(data =
[[10 -- 10 --]
[1 1 -- --]
[9 9 9 --]
[-- 10 1 --]],
mask =
[[False True False True]
[False False True True]
[False False False True]
[ True False False True]],
fill_value = 0)
I perefer to use
# replace 0.0 with nan to exclude 0.0 from median
zero_to_nan = numpy.where(a == 0.0, numpy.nan, a)
n = numpy.nanmedian(zero_to_nan, ....)
This may help. Once you get the nonzero array, you can obtain the median directly from a[nonzero(a)]
numpy.nonzero
numpy.nonzero(a)[source]
Return the indices of the elements that are non-zero.
Returns a tuple of arrays, one for each dimension of a, containing the indices of the non-zero elements in that dimension. The corresponding non-zero values can be obtained with:
a[nonzero(a)]
To group the indices by element, rather than dimension, use:
transpose(nonzero(a))
The result of this is always a 2-D array, with a row for each non-zero element.
Parameters :
a : array_like
Input array.
Returns :
tuple_of_arrays : tuple
Indices of elements that are non-zero.
See also
flatnonzero
Return indices that are non-zero in the flattened version of the input array.
ndarray.nonzero
Equivalent ndarray method.
count_nonzero
Counts the number of non-zero elements in the input array.
Examples
>>> x = np.eye(3)
>>> x
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
>>> np.nonzero(x)
(array([0, 1, 2]), array([0, 1, 2]))
>>> x[np.nonzero(x)]
array([ 1., 1., 1.])
>>> np.transpose(np.nonzero(x))
array([[0, 0],
[1, 1],
[2, 2]])
A common use for nonzero is to find the indices of an array, where a condition is True. Given an array a, the condition a > 3 is a boolean array and since False is interpreted as 0, np.nonzero(a > 3) yields the indices of the a where the condition is true.
>>> a = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a > 3
array([[False, False, False],
[ True, True, True],
[ True, True, True]], dtype=bool)
>>> np.nonzero(a > 3)
(array([1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2]))
The nonzero method of the boolean array can also be called.
>>> (a > 3).nonzero()
(array([1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2]))

Categories