Summing a numpy array based on a multi-labeled mask - python

Say I have an array:
x = np.array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
And a multi-labeled mask:
labels = np.array([[0, 0, 2],
[1, 1, 2],
[1, 1, 2]])
My goal is to sum the entries of x together, grouped by labels. For example:
n_labels = np.max(labels) + 1
out = np.empty(n_labels)
for label in range(n_labels):
mask = labels == label
out[label] = np.sum(x[mask])
>>> out
np.array([1, 20, 15])
However, as x and n_labels become large, I see this being inefficient. Each iteration, you are only summing together a small fraction of the number of entries of x, but still recompute the mask over all of labels (in the expression labels == label) and subsequently index over all of x (in the expression x[mask]). Is there a more efficient way to do this as x and n_labels grow large?

You can use bincount with weights:
np.bincount(labels.ravel(), weights=x.ravel())
out:
array([ 1., 20., 15.])

You don't really have a reason to operate on 2D arrays, so ravel them first:
labels = labels.ravel()
x = x.ravel()
If your labels are already indices, you can use np.argsort along with np.diff and np.add.reduceat:
index = labels.argsort()
splits = np.r_[0, np.flatnonzero(np.diff(labels[index])) + 1]
result = np.add.reduceat(x[index], splits)
labels[index] is the sorted index. Whenever that changes, you enter a new group, and the diff is nonzero. That's what np.flatnonzero(np.diff(labels[index])) finds for you. Since reduceat takes the stop index past the end of the run, you need to add one. np.r_ allows you to prepend zero easily to a 1D array, which is necessary for reduceat to regard t, and also prepend zero., and also prepend zero.he first run (the last is always automatic).
Before you run reduceat, you need to order x into the runs defined by labels, which is what x[index] does.

You can use 2D arrays with another slow and over-engineered approach using np.add.at
sums = np.zeros(labels.max()+1, x.dtype)
np.add.at(sums, labels, x)
sums
Output
array([ 1, 20, 15])

Related

Torch - How to calculate average of tensors with the same indexes

Suppose having two matrices: X(m, n) and index matrix I(m, 1). Every item in index matrix I_k represents the index of the kth element X_k in X.
And suppose the index is in the range of [0, 1, 2, ..., j-1]
I would like to calculate the average of tensors in X with the same index i and return a result matrix R(j, n).
For example,
X = [[1, 1, 1],
[2, 2, 2],
[3, 3, 3]]
I = [0, 0, 1]
The result matrix should be:
R = [[torch.mean(([1, 1, 1], [2, 2, 2]))],
[torch.mean(([3, 3, 3]))]
which equals to:
R = [[1.5, 1.5, 1.5],
[3, 3, 3]]
My current solution is to traverse through m, stack the tensors with the same index and perform torch.mean.
Is there a way avoiding traversing through m? It seems not elegant and rather time-consuming.
ret = torch.empty_like(X)
ret.scatter_reduce_(0, I.unsqueeze(-1).expand_as(X), X, "mean", include_self=False)
should do what you want.
Now, note that this is a fairly new method so it may not be particularly performant. If you bump into an issue with this method, you may be better off running scatter_add_ on the tensor X and a tensor of ones and then divide.
If you want to also have a smaller tensor as output, you may want to figure out how many indices and with that infer the size of out.

How to map corresponding values of a 2D NumPy array into an 1D array

I have written this piece of code:
data = np.array([[3,6], [5,9], [4, 8]])
orig_x, orig_y = np.split(data, 2, axis=1)
x = np.array([3, 4])
y = np.zeros((len(x)))
for i in range(len(x)):
y[i] = orig_y[np.where(orig_x == x[i])[0]]
So basically, I have a 2D NumPy array. I split it into two 1D arrays orig_x and orig_y, one storing values of the x-axis and the other values of the y-axis.
I also have another 1D NumPy array, which has some of the values that exist in the orig_x array. I want to find the y-axis values for each value in the x array. I created this method, using a simple loop, but it is extremely slow since I'm using it with thousands of values.
Do you have a better idea? Maybe by using a NumPy function?
Note: Also a better title for this question can be made. Sorry :(
You could create a mask over which values you want from the x column and then use this mask to select values from the y column.
data = np.array([[3,6], [5,9], [4, 8]])
# the values you want to lookup on the x-axis
x = np.array([3, 4])
mask = np.isin(data[:,0], x)
data[mask,1]
Output:
array([6, 8])
The key function here is to use np.isin. What this is basically doing is broadcasting x or data to the appropriate shape and doing an element-wise comparison:
mask = data[:,0,None] == x
y_mask = np.logical_or.reduce(mask, axis=1)
data[y_mask, 1]
Output:
array([6, 8])
I'm not 100% sure I understood the problem correctly, but I think the following should work:
>>> rows, cols = np.where(orig_x == x)
>>> y = orig_y[rows[np.argsort(cols)]].ravel()
>>> y
array([6, 8])
It assumes that all the values in orig_x are unique, but since your code example has the same restriction, I considered it a given.
What about a lookup table?
import numpy as np
data = np.array([[3,6], [5,9], [4, 8]])
orig_x, orig_y = np.split(data, 2, axis=1)
x = np.array([3, 4])
y = np.zeros((len(x)))
You can pack a dict for lookup:
lookup = {i: j for i, j in zip(orig_x.ravel(), orig_y.ravel())}
And just map this into a new array:
np.fromiter(map(lambda i: lookup.get(i, np.nan), x), dtype=int, count=len(x))
array([6, 8])
If orig_x & orig_y are your smaller data structures this will probably be most efficient.
EDIT - It's occurred to me that if your values are integers the default np.nan won't work and you should figure out what value makes sense for your application if you're trying to find a value that isn't in your orig_x array.

Efficiently create 2d numpy array given 1 dimension and a constant

Given an x-dataset,
x = np.array([1, 2, 3, 4, 5])
what is the most efficient way to create the NumPy array where each x coordinate is paired with a y-coordinate of value 0? I am wondering if there is a way specifically that doesn't require any hard coding, so that x could vary in length without causing failure.
As per your problem statement, the following is one way to do it.
# initialize an array of zeros
In [36]: res = np.zeros((2, *x.shape), dtype=x.dtype)
# fill `x` as first row
In [37]: res[0] = x
In [38]: res
Out[38]:
array([[1, 2, 3, 4],
[0, 0, 0, 0]])
When we initialize the array of zeros, we use 2 for axis-0 dimension since your requirement is to create a 2D array. For the column size we simply take the length from the x array. For reasonably larger arrays, this approach would be the fastest.

How can I find indexes with same elements in 2d numpy array?

I'm working on a machine vision project. By reflecting laser light on the picture, I detect the pixels that the laser light falls on the picture with the help of Opencv. I keep these pixel values ​​as 2d numpy array. However, I want to make the x, y values ​​unique by determining the pixel values ​​whose x axis values ​​are the same and taking the average of them. Pixel values ​​are kept sequentially in numpy array.
For example:
[[659 253]
[660 253]
[660 256]
[661 253]
[662 253]
[663 253]
[664 253]
[665 253]]
First of all, my goal is to identify all lists in which the first element of each list is the same. When using Opencv, pixel values ​​are kept in numpy arrays to be more useful. I'm trying to write an indexing method myself. I created a numpy array for myself to make it simpler.
x = np.array([[1, 2], [1, 78], [1, 3], [1, 6], [4, 3], [5, 6], [5, 3]], np.int32)
I followed a method like this to find the values ​​whose first element is the same from the lists in the x array.
for i in range (len (x)):
if x [i]! = x [-1] and x [i] [0] == x [i + 1] [0]:
print (x [i], x [i + 1])
I want to check if the first element in the first list is in the next lists by browsing the x array list. In order not to face an index out of range error, I used x [i]! = x [-1]. I was expecting this loop to return below result to me.
[1,2] [1,78]
[1,78] [1,3]
[1,3] [1,6]
[5,6] [5,3]
I would later remove duplicate elements from the list but I got
ValueError: The truth value of an array with more than one element is ambiguous.Use a.any() or a.all()
I am not familiar with numpy arrays so I could not get the solution I wanted. Is it possible to get the result I want using numpy array methods? Thanks for your time.
Approach 1
This is a numpy way to do this:
x_sorted = x[np.argsort(x[:,0])]
marker_idx = np.flatnonzero(np.diff(x_sorted[:,0]))+1
output = np.split(x_sorted, marker_idx)
Approach 2
You can also use a package numpy_indexed which is designed to solve groupby problems with less script and without loss of performance:
import numpy_indexed as npi
npi.group_by(x[:, 0]).split(x)
Approach 3
You can get groups of indices but this might not be the best option because of list comprehension:
import pandas as pd
[x[idx] for idx in pd.DataFrame(x).groupby([0]).indices.values()]
Output
[array([[ 1, 2],
[ 1, 78],
[ 1, 3],
[ 1, 6],
[ 1, 234]]),
array([[4, 3]]),
array([[5, 6],
[5, 3]])]
Try the following, using itertools.groupby:
x.sort(axis=0)
for l in [list([tuple(p) for p in k]) for i,k in itertools.groupby(x, key=lambda x: x[0])]:
print(l)
Output:
[(1, 2), (1, 3), (1, 4), (1, 5), (1, 6)]
[(3, 6), (3, 78)]
[(5, 234)]
You can use np.unique with its return_inverse argument, which is effectively a sorting index, and return_counts, which is going to help build the split points:
_, ind, cnt = np.unique(x[:, 0], return_index=True, return_counts=True)
The index i arranges u into x. To sort the other way, you need to invert the index. Luckily, np.argsort is its own inverse:
ind = np.argsort(ind)
To get the splitpoints of the data, you can use np.cumsum on the count. You don't need the last element because it is always going to mark the end of the array:
spp = np.cumsum(cnt[:-1])
Finally, you can use np.split to get the list of sub-arrays that you want:
result = np.split(x[ind, :], spp, axis=0)
TL;DR
_, ind, cnt = np.unique(x[:, 0], return_index=True, return_counts=True)
np.split(x[np.argsort(ind), :], np.cumsum(cnt[:-1]), axis=0)

How to replicate numpy.choose() in tensorflow?

I'm trying to efficiently replicate numpy's ndarray.choose() method.
Here's a numpy example of what I'm looking for:
b = np.arange(15).reshape(3, 5)
c = np.array([1,0,4])
c.choose(b.T) # trying to replicate in tensorflow
-> array([ 1, 5, 14])
The best I've been able to do with this is generate a batch_size square matrix (which is huge if batch size is huge) and take the diagonal of it:
tf_b = tf.constant(b)
tf_c = tf.constant(c)
sess.run(tf.diag_part(tf.gather(tf.transpose(tf_b), tf_c)))
-> array([ 1, 5, 14])
Is there a way to do this that is just linear in the first dimension (instead of squared)?
Yeah, there's an easier way to do this. Flatten your b array to 1-d, so it's [0, 1, 2, ..., 13, 14]. Take an array of indices that are in the range of the number of 'choices' you are taking (3 in your case). That will be [0, 1, 2]. Multiply this range by the second dimension of your original shape, which is the number of options for each choice (5 in your case). That gives you [0, 5, 10]. Then add your indices to this to obtain [1, 5, 14]. Now you're good to call tf.gather().
Here is some code that I've taken from here that does a similar thing for RNN outputs. Yours will be slightly different, but the idea is the same.
index = tf.range(0, batch_size) * max_length + (length - 1)
flat = tf.reshape(output, [-1, out_size])
relevant = tf.gather(flat, index)
return relevant
In a big picture, the operation is pretty straightforward. You use the range operation to get the index of the beginning of each row, then add the index of where you are in each row. I think doing it in 1D is easiest, so that's why we flatten it.

Categories