Sum of rows based on index with Numpy - python

I have a 2D array composed of 2D vectors and a 1D array of indices.
How can I add / sumvthe rows of the 2D array that share the same index, using numpy?
Example:
arr = np.array([[48, -51], [-15, -55], [26, -49], [-13, -17], [-67, -7], [23, -48], [-29, -64], [37, 68]])
idx = np.array([0, 1, 1, 2, 2, 3, 3, 4])
#desired output
array([[48, -51],
[11, -104],
[-80, -24],
[-6, -112],
[ 37, 68]])
Notice how the original array arr is of shape (8, 2), and the result of the operation is (5, 2).

If the indices are not always grouped, apply np.argsort first:
order = np.argsort(idx)
You can compute the locations of the sums using np.diff followed by np.flatnonzero to get the indices. We'll also prepend zero and shift everything by 1:
breaks = np.flatnonzero(np.concatenate(([1], np.diff(idx[order])))
breaks can now be used as an argument to np.add.reduceat:
result = np.add.reduceat(arr[order, :], breaks, axis=0)
If the indices are already grouped, you don't need to use order at all:
breaks = np.flatnonzero(np.concatenate(([1], np.diff(idx)))
result = np.add.reduceat(arr, breaks, axis=0)

You can use pandas for the purpose:
pd.DataFrame(arr).groupby(idx).sum().to_numpy()
Output:
array([[ 48, -51],
[ 11, -104],
[ -80, -24],
[ -6, -112],
[ 37, 68]])

Related

Is there a way to conditionally index 3D-numpy array?

Having an array A with the shape (2,6, 60), is it possible to index it based on a binary array B of shape (6,)?
The 6 and 60 is quite arbitrary, they are simply the 2D data I wish to access.
The underlying thing I am trying to do is to calculate two variants of the 2D data (in this case, (6,60)) and then efficiently select the ones with the lowest total sum - that is where the binary (6,) array comes from.
Example: For B = [1,0,1,0,1,0] what I wish to receive is equal to stacking
A[1,0,:]
A[0,1,:]
A[1,2,:]
A[0,3,:]
A[1,4,:]
A[0,5,:]
but I would like to do it by direct indexing and not a for-loop.
I have tried A[B], A[:,B,:], A[B,:,:] A[:,:,B] with none of them providing the desired (6,60) matrix.
import numpy as np
A = np.array([[4, 4, 4, 4, 4, 4], [1, 1, 1, 1, 1, 1]])
A = np.atleast_3d(A)
A = np.tile(A, (1,1,60)
B = np.array([1, 0, 1, 0, 1, 0])
A[B]
Expected results are a (6,60) array containing the elements from A as described above, the received is either (2,6,60) or (6,6,60).
Thank you in advance,
Linus
You can generate a range of the indices you want to iterate over, in your case from 0 to 5:
count = A.shape[1]
indices = np.arange(count) # np.arange(6) for your particular case
>>> print(indices)
array([0, 1, 2, 3, 4, 5])
And then you can use that to do your advanced indexing:
result_array = A[B[indices], indices, :]
If you always use the full range from 0 to length - 1 (i.e. 0 to 5 in your case) of the second axis of A in increasing order, you can simplify that to:
result_array = A[B, indices, :]
# or the ugly result_array = A[B, np.arange(A.shape[1]), :]
Or even this if it's always 6:
result_array = A[B, np.arange(6), :]
An alternative solution using np.take_along_axis (from version 1.15 - docs)
import numpy as np
x = np.arange(2*6*6).reshape((2,6,6))
m = np.zeros(6, int)
m[0] = 1
#example: [1, 0, 0, 0, 0, 0]
np.take_along_axis(x, m[None, :, None], 0) #add dimensions to mask to match array dimensions
>>array([[[36, 37, 38, 39, 40, 41],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]]])

Torch sum a tensor along an axis

How do I sum over the columns of a tensor?
torch.Size([10, 100]) ---> torch.Size([10])
The simplest and best solution is to use torch.sum().
To sum all elements of a tensor:
torch.sum(x) # gives back a scalar
To sum over all rows (i.e. for each column):
torch.sum(x, dim=0) # size = [ncol]
To sum over all columns (i.e. for each row):
torch.sum(x, dim=1) # size = [nrow]
It should be noted that the dimension summed over is eliminated from the resulting tensor.
Alternatively, you can use tensor.sum(axis) where axis indicates 0 and 1 for summing over rows and columns respectively, for a 2D tensor.
In [210]: X
Out[210]:
tensor([[ 1, -3, 0, 10],
[ 9, 3, 2, 10],
[ 0, 3, -12, 32]])
In [211]: X.sum(1)
Out[211]: tensor([ 8, 24, 23])
In [212]: X.sum(0)
Out[212]: tensor([ 10, 3, -10, 52])
As, we can see from the above outputs, in both cases, the output is a 1D tensor. If you, on the other hand, wish to retain the dimension of the original tensor in the output as well, then you've set the boolean kwarg keepdim to True as in:
In [217]: X.sum(0, keepdim=True)
Out[217]: tensor([[ 10, 3, -10, 52]])
In [218]: X.sum(1, keepdim=True)
Out[218]:
tensor([[ 8],
[24],
[23]])
If you have tensor my_tensor, and you wish to sum across the second array dimension (that is, the one with index 1, which is the column-dimension, if the tensor is 2-dimensional, as yours is), use torch.sum(my_tensor,1) or equivalently my_tensor.sum(1) see documentation here.
One thing that is not mentioned explicitly in the documentation is: you can sum across the last array-dimension by using -1 (or the second-to last dimension, with -2, etc.)
So, in your example, you could use: outputs.sum(1) or torch.sum(outputs,1), or, equivalently, outputs.sum(-1) or torch.sum(outputs,-1). All of these would give the same result, an output tensor of size torch.Size([10]), with each entry being the sum over the all rows in a given column of the tensor outputs.
To illustrate with a 3-dimensional tensor:
In [1]: my_tensor = torch.arange(24).view(2, 3, 4)
Out[1]:
tensor([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
In [2]: my_tensor.sum(2)
Out[2]:
tensor([[ 6, 22, 38],
[54, 70, 86]])
In [3]: my_tensor.sum(-1)
Out[3]:
tensor([[ 6, 22, 38],
[54, 70, 86]])
Based on doc https://pytorch.org/docs/stable/generated/torch.sum.html
it should be
dim (int or tuple of python:ints) – the dimension or dimensions to reduce.
dim=0 means reduce row dimensions: condense all rows = sum by col
dim=1 means reduce col dimensions: condense cols= sum by row
Torch sum along multiple axis or dimensions
Just for the sake of completeness (I could not find it easily) I include how to sum along multiple dimensions with torch.sum which is heavily used in computer vision tasks where you have to reduce along H and W dimensions.
If you have an image x with shape C x H x W and want to compute the average pixel intensity value per channel you could do:
avg = torch.sum(x, dim=(1,2)) / (H*W) # Sum along (H,W) and norm

2D numpy argsort index returns 3D when used in the original matrix

I am trying to obtain the top 2 values from each row in a matrix using argsort. The indexing is working, as in argsort is returning the correct values. However, when I put the argsort result as an index, it returns a 3 dimensional result.
For example:
test_mat = np.matrix([[0 for i in range(5)] for j in range(5)])
for i in range(5):
for j in range(5):
test_mat[i, j] = i * j
test_mat[range(2,3)] = test_mat[range(2,3)] * -1
last_two = range(-1, -3, -1)
index = np.argsort(test_mat, axis=1)
index = index[:, last_k]
This gives:
index.shape
Out[402]: (5L, 5L)
test_mat[index].shape
Out[403]: (5L, 5L, 5L)
Python is new to me and I find indexing to be very confusing in general even after reading the various array manuals. I spend more time trying to get the right values out of objects than actually solving problems. I'd welcome any tips on where to properly learn what is going on. Thanks.
You can use linear indexing to solve your case, like so -
# Say A is your 2D input array
# Get sort indices for the top 2 values in each row
idx = A.argsort(1)[:,::-1][:,:2]
# Get row offset numbers
row_offset = A.shape[1]*np.arange(A.shape[0])[:,None]
# Add row offsets with top2 sort indices giving us linear indices of
# top 2 elements in each row. Index into input array with those for output.
out = np.take( A, idx + row_offset )
Here's a step-by-step sample run -
In [88]: A
Out[88]:
array([[34, 45, 16, 20, 24],
[37, 13, 49, 37, 21],
[42, 36, 35, 24, 18],
[26, 28, 21, 13, 44]])
In [89]: idx = A.argsort(1)[:,::-1][:,:2]
In [90]: idx
Out[90]:
array([[1, 0],
[2, 3],
[0, 1],
[4, 1]])
In [91]: row_offset = A.shape[1]*np.arange(A.shape[0])[:,None]
In [92]: row_offset
Out[92]:
array([[ 0],
[ 5],
[10],
[15]])
In [93]: np.take( A, idx + row_offset )
Out[93]:
array([[45, 34],
[49, 37],
[42, 36],
[44, 28]])
You can directly get the top 2 values from each row with just sorting along the second axis and some slicing, like so -
out = np.sort(A,1)[:,:-3:-1]

python iterate over and select values from list of zipped arrays

I have several multidimensional arrays that have been zipped into a single list and am trying to remove values from the list according to a selection criteria applied to a single array. Specifically I have the 4 arrays, all of which have the same shape, that have all been zipped into one list of arrays:
in: array1.shape
out: (5,3)
...
in: array4.shape
out: (5,3)
in: array1
out: ([[0, 1, 1],
[0, 0, 1],
[0, 0, 1],
[0, 1, 1],
[0, 0, 0]])
in: array4
out: ([[20, 16, 20],
[15, 19, 17],
[21, 24, 23],
[22, 22, 26],
[27, 24, 23]])
in: fullarray = zip(array1,...,array4)
in: fullarray[0]
out: (array([0, 1, 1]), array([3, 4, 5]), array([33, 34, 35]), array([20, 16, 20]))
I am trying to iterate over the values from a single target array within each set of arrays and select the values from each array with the same index as the target array when the value equals 20. I doubt I explained that clearly so I'll give an example.
in: fullarray[0]
out: (array([0, 1, 1]), array([3, 4, 5]), array([33, 34, 35]), array([20, 16, 20]))
what I want is to iterate over the values of the fourth array in the list for
fullarray[x] and where the value = 20 to take the value of each array with
the same index and append them into a new list as an array.
so the output for fullarray[0] would be ([[0, 3, 33, 20]), [1, 5, 35, 20]])
My previous attempts have all generated a variety of error messages (example below). Any help would be appreciated.
in: for i in g:
for n in i:
if n == 3:
for k in n:
if k == 0:
newlist.append(i[k])
out: for i in fullarray:
2 for n in i:
----> 3 if n == 3:
4 for k in n:
5 if k == 0:
ValueError: The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()
Edit for modified question:
Here's a piece of code that's doing what you what. The time complexity could probably be improved though.
from numpy import array
fullarray = [(array([0, 1, 1]), array([3, 4, 5]), array([33, 34, 35]), array([20, 16, 20]))]
newlist = []
for arrays in fullarray:
for idx, value in enumerate(arrays[3]):
if value == 20:
newlist.append([array[idx] for array in arrays])
print newlist
Old answer: Assuming all your arrays are the same size you could do the following:
full[idx] contains a tuple with the values of all your arrays at the index idx in the order you zipped them.
import numpy as np
ar1 = np.array([1] * 8)
ar2 = np.array([2] * 8)
full = zip(ar1, ar2)
print full
newlist = []
for idx, v in enumerate(ar1):
if v != 0:
newlist.append(full[idx]) # Here you get a tuple such as (ar1[idx], ar2[idx])
But if len(ar1) > len(ar2) it is going to throw and exception so keep that in mind and adjust your code accordingly.

numpy.ndindex and array slices

I have a multidimensional array, in which I want to get 1D slices, something like mega_array[:, i, j, k, .....]
To do it, I try numpy.ndindex:
for idx in np.ndindex(mega_array.shape[1:]):
print mega_array[:, index]
But alas: this still gives me multidimensional slices, where only the dimension, other than first, are equal to one.
I want to use the slices as l-value, so, simple ravel() is not suitable here.
What should I use to get normal, 1D slices?
UPD: Here's a small example:
in_array = np.asarray([[7, 40], [777, 440]])
for index in np.ndindex(in_array.shape[1:]):
print "---"
print index
print in_array[:, index] # gives 2D array
UPD: Here's a 3D example:
in_array = np.asarray([[[7, 40, 5], [777, 440, 0]], [[8, 41, 6], [778, 441, 1]]])
print in_array
print in_array.shape
# print in_array[:, 0, 2]
for index in np.ndindex(in_array.shape[1:]):
print index
print in_array[:, index] # FAILS
# expected [7, 8], [40, 41], [5, 6], [778, 441] and so on.
You need to add a slice to the index.
In:
in_array = np.asarray([[7, 40], [777, 440]])
for index in np.ndindex(in_array.shape[1:]):
print "---"
print index
print in_array[:, index] # gives 2D array
index has values like (0,),(1,), i.e. tuples.
in_array[:,(1,)] is not the same as in_array[:,1]. To get the latter you need to use in_array[(slice(None),1)]. The slice must be part of the index tuple. We can do that by concatenating tuples.
in_array = np.asarray([[7, 40], [777, 440]])
for index in np.ndindex(in_array.shape[1:]):
print "---"
index = (slice(None),)+index
print index
print in_array[index]
printing:
---
(slice(None, None, None), 0)
[ 7 777]
---
(slice(None, None, None), 1)
[ 40 440]
Same adjustment should work with the nD array case
You can use np.dstack that stack arrays in sequence depth wise (along third axis). :
>>> a
array([[[ 7, 40, 5],
[777, 440, 0]],
[[ 8, 41, 6],
[778, 441, 1]]])
>>> np.dstack(a)
array([[[ 7, 8],
[ 40, 41],
[ 5, 6]],
[[777, 778],
[440, 441],
[ 0, 1]]])
Also based on your dimensions you can use other numpy joining functions :http://docs.scipy.org/doc/numpy/reference/routines.array-manipulation.html#joining-arrays
do you need to transpose the matrix to get the rows?
In [5]: in_array = numpy.asarray([[7, 40], [777, 440]])
In [6]: in_array.transpose()[0]
Out[6]: array([ 7, 777])
After you posted your 3D example, now I see what you wish to do. The following should work for arrays with more than 3 dimensions, too:
Save the number of items in the dimensions > 0,
nitems = np.product(in_array.shape[1:])
Reshape the array (similar to np.dstack pointed out by Kasra),
new_array = np.reshape(in_array, [in_array.shape[0], nitems])
Loop over new array:
for i in range(new_array.shape[1]):
print(new_array[:, i])
For me, that gives the expected output.

Categories