Replace all values in array with their enumerated counterparts Numpy - python

I have an array with the following structure
[[distance_1,intensity_1],[distance_2,intensity_2]...]
These distances have many decimal points, are unordered and are not unique. I want these distances to have values from 0-max_number_of_unique_values in integers.
An example:
array = [[-1.13243,3],[-0.71229,2],[-2.314532,9],[2.34235,4],[1.342545,4],[-1.13243,2]]
By enumerating all unique distance values I get the following mapping
enumerated_distances = np.array(list(enumerate(np.unique(array[:,0]))))
[[-2.314532,0],[-1.13243,1],[-0.71229,2],[1.342525,3],[2.34235,4]]
Now, what I want to do, is to replace all distance values with their enumerated counterparts, so the original array ends up like this:
[[1,3],[2,2],[0,9],[4,4],[3,4],[1,2]]
Is there a way of doing this efficiently in numpy, without manually finding each value and replacing it with its enumerated counterpart?
Performance is key, as this will be integrated into a system running in real time. In my example, there is only one distance (x), but in reality it will be three dimensional (x,y,z).

As #Epsi95 points out, this is just np.unique(*, return_inverse = True)
_, inv = np.unique(array[:,0], return_inverse = True)
enumerated_out = np.stack([inv, array[:, 1]], axis = -1).astype(int)
enumerated_out
Out[]:
array([[1, 3],
[2, 2],
[0, 9],
[4, 4],
[3, 4],
[1, 2]])

Related

Torch - How to calculate average of tensors with the same indexes

Suppose having two matrices: X(m, n) and index matrix I(m, 1). Every item in index matrix I_k represents the index of the kth element X_k in X.
And suppose the index is in the range of [0, 1, 2, ..., j-1]
I would like to calculate the average of tensors in X with the same index i and return a result matrix R(j, n).
For example,
X = [[1, 1, 1],
[2, 2, 2],
[3, 3, 3]]
I = [0, 0, 1]
The result matrix should be:
R = [[torch.mean(([1, 1, 1], [2, 2, 2]))],
[torch.mean(([3, 3, 3]))]
which equals to:
R = [[1.5, 1.5, 1.5],
[3, 3, 3]]
My current solution is to traverse through m, stack the tensors with the same index and perform torch.mean.
Is there a way avoiding traversing through m? It seems not elegant and rather time-consuming.
ret = torch.empty_like(X)
ret.scatter_reduce_(0, I.unsqueeze(-1).expand_as(X), X, "mean", include_self=False)
should do what you want.
Now, note that this is a fairly new method so it may not be particularly performant. If you bump into an issue with this method, you may be better off running scatter_add_ on the tensor X and a tensor of ones and then divide.
If you want to also have a smaller tensor as output, you may want to figure out how many indices and with that infer the size of out.

diagonalize multiple vectors using numpy

Say I have a matrix of shape (2,3), I need to diagonalize the 3-elements vector into matrix of shape (3,3), for all the 2 vectors at once. That is, I need to return matrix with shape (2,3,3). How can I do that with Numpy elegantly ?
given data = np.array([[1,2,3],[4,5,6]])
i want the result [[[1,0,0],
[0,2,0],
[0,0,3]],
[[4,0,0],
[0,5,0],
[0,0,6]]]
Thanks
tl;dr, my one-liner: mydiag=np.vectorize(np.diag, signature='(n)->(n,n)')
I suppose here that by "diagonalize" you mean "applying np.diag".
Which, as a teacher of linear algebra, tickles me a bit. Since "diagonalizing" has a specific meaning, which is not that (it is computing eigen vectors and values, and from there, writing M=P⁻¹ΛP. Which you cannot do from the inputs you have).
So, I suppose that if input matrix is
[[1, 2, 3],
[9, 8, 7]]
The output matrix you want is
[[[1, 0, 0],
[0, 2, 0],
[0, 0, 3]],
[[9, 0, 0],
[0, 8, 0],
[0, 0, 7]]]
If not, you can ignore this answer [Edit: in the meantime, you explained exactly that. So yo may continue to read].
There are many way to do that.
My one liner would be
mydiag=np.vectorize(np.diag, signature='(n)->(n,n)')
Which build a new functions which does what you want (it interprets the input as a list of 1D-array, call np.diag of each of them, to get a 2D-array, and put each 2D-array in a numpy array, thus getting a 3D-array)
Then, you just call mydiag(M)
One advantage of vectorize, is that it uses numpy broadcasting. In other words, the loops are executed in C, not in python. In yet other words, it is faster. Well it is supposed to be (on small matrix, it is in fact slower than Michael's method - in comment; on large matrix, it is has the exact same speed. Which is frustrating, since einsum doc itself specify that it sacrifices broadcasting).
Plus, it is a one-liner, which has no other interest than bragging on forums. But well, here we are.
Here is one way with indexing:
out = np.zeros(data.shape+(data.shape[-1],), dtype=data.dtype)
x,y = np.indices(data.shape).reshape(2, -1)
out[x,y,y] = data.ravel()
output:
array([[[1, 0, 0],
[0, 2, 0],
[0, 0, 3]],
[[4, 0, 0],
[0, 5, 0],
[0, 0, 6]]])
We use array indexing to precisely grab those elements that are on the diagonal. Note that array indexing allows broadcasting between the indices, so we have index1 contain the index of the array, and index2 contain the index of the diagonal element.
index1 = np.arange(2)[:, None] # 2 is the number of arrays
index2 = np.arange(3)[None, :] # 3 is the square size of each matrix
result = np.zeros((2, 3, 3))
result[index1, index2, index2] = data

Summing a numpy array based on a multi-labeled mask

Say I have an array:
x = np.array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
And a multi-labeled mask:
labels = np.array([[0, 0, 2],
[1, 1, 2],
[1, 1, 2]])
My goal is to sum the entries of x together, grouped by labels. For example:
n_labels = np.max(labels) + 1
out = np.empty(n_labels)
for label in range(n_labels):
mask = labels == label
out[label] = np.sum(x[mask])
>>> out
np.array([1, 20, 15])
However, as x and n_labels become large, I see this being inefficient. Each iteration, you are only summing together a small fraction of the number of entries of x, but still recompute the mask over all of labels (in the expression labels == label) and subsequently index over all of x (in the expression x[mask]). Is there a more efficient way to do this as x and n_labels grow large?
You can use bincount with weights:
np.bincount(labels.ravel(), weights=x.ravel())
out:
array([ 1., 20., 15.])
You don't really have a reason to operate on 2D arrays, so ravel them first:
labels = labels.ravel()
x = x.ravel()
If your labels are already indices, you can use np.argsort along with np.diff and np.add.reduceat:
index = labels.argsort()
splits = np.r_[0, np.flatnonzero(np.diff(labels[index])) + 1]
result = np.add.reduceat(x[index], splits)
labels[index] is the sorted index. Whenever that changes, you enter a new group, and the diff is nonzero. That's what np.flatnonzero(np.diff(labels[index])) finds for you. Since reduceat takes the stop index past the end of the run, you need to add one. np.r_ allows you to prepend zero easily to a 1D array, which is necessary for reduceat to regard t, and also prepend zero., and also prepend zero.he first run (the last is always automatic).
Before you run reduceat, you need to order x into the runs defined by labels, which is what x[index] does.
You can use 2D arrays with another slow and over-engineered approach using np.add.at
sums = np.zeros(labels.max()+1, x.dtype)
np.add.at(sums, labels, x)
sums
Output
array([ 1, 20, 15])

Find nearest neighbors for arrays of different dimentions

I have to compute a similarity measure on several thousands of uneven arrays.
The naive implementation is basically in O(n²) and it's taking too long for the number of arrays I have.
Hopefully, I'm interested only in the similarity for the most similar arrays.
So far I used the sci-kit learn implementation of NearestNeighbors which does the job for arrays with the same number of dimensions. However, NearestNeighbors is based on a KD-tree and I think it's not possible to apply this algorithm for uneven arrays.
Is there any alternative for arrays of different dimensions?
Here is a code snippet summarizing the problem:
import numpy as np
from sklearn.neighbors.unsupervised import NearestNeighbors
def partial_mse(a: np.array, b: np.array) -> float:
def mse(a: np.array, b: np.array) -> float:
mse = (np.square(a - b)).mean()
return -np.sqrt(mse)
if a.size == b.size:
return mse(a, b)
# a is always the bigger one
if a.size < b.size:
a, b = b, a
partial_mse = [mse(a[i:i + b.size], b) for i in range(a.size - b.size + 1)]
return np.max(partial_mse)
uneven_array = np.array([[1, 2, 3, 4], [3, 4], [3, 2, 6], [2, 1, 3], [3]])
even_array = np.array([[1, 2, 3, 4], [3,2, 4, 1], [3, 2, 6, 1], [2, 6, 1, 3], [3, 5, 2, 0]])
nnfit = NearestNeighbors(n_neighbors=2, algorithm='auto', n_jobs=-1,
metric=partial_mse, metric_params={}).fit(uneven_array)
ValueError: setting an array element with a sequence.
NearestNeighbour algorithms are based on abstracting the arrays as a n-dimensional point. So, having points of different dimensions are going to throw the algorithm out of whack, and possibly won't give you what you were looking for even if you managed to implement it.
if n is the maximum number of dimension, then each lower dimension (k) point actually corresponds to (n-k+1) possible points in the higher dimension space (by filling the missing dimensions with the elements of array a), and the metric you chose would return the maximum similarity out of the (n-k+1) points.
After several tries I found that:
Filling the space with a default value is the only way to use NearestNeighbors and KD-tree.
However, the default value is contaminating the similarity function. The most similar part of the features will be the part with the same filling value.
I fixed it by adding the filling value as parameter of partial_mse and filtering out this value inside partial_mse. This filling value should be a value that doesn't exist on the arrays, otherwise, it will filter out true values !
def partial_mse(a: np.array, b: np.array, **kwargs) -> float:
[...]
fill_value = kwargs["fill_value"]
a, b = a[a != fill_value], b[b != fill_value]
[...]
nnfit = NearestNeighbors(n_neighbors=10, algorithm='auto', n_jobs=-1, \
metric=partial_mse, metric_params={"fill_value": fill_value).fit(matrix_features)

Repeat a scipy csr sparse matrix along axis 0

I wanted to repeat the rows of a scipy csr sparse matrix, but when I tried to call numpy's repeat method, it simply treats the sparse matrix like an object, and would only repeat it as an object in an ndarray. I looked through the documentation, but I couldn't find any utility to repeats the rows of a scipy csr sparse matrix.
I wrote the following code that operates on the internal data, which seems to work
def csr_repeat(csr, repeats):
if isinstance(repeats, int):
repeats = np.repeat(repeats, csr.shape[0])
repeats = np.asarray(repeats)
rnnz = np.diff(csr.indptr)
ndata = rnnz.dot(repeats)
if ndata == 0:
return sparse.csr_matrix((np.sum(repeats), csr.shape[1]),
dtype=csr.dtype)
indmap = np.ones(ndata, dtype=np.int)
indmap[0] = 0
rnnz_ = np.repeat(rnnz, repeats)
indptr_ = rnnz_.cumsum()
mask = indptr_ < ndata
indmap -= np.int_(np.bincount(indptr_[mask],
weights=rnnz_[mask],
minlength=ndata))
jumps = (rnnz * repeats).cumsum()
mask = jumps < ndata
indmap += np.int_(np.bincount(jumps[mask],
weights=rnnz[mask],
minlength=ndata))
indmap = indmap.cumsum()
return sparse.csr_matrix((csr.data[indmap],
csr.indices[indmap],
np.r_[0, indptr_]),
shape=(np.sum(repeats), csr.shape[1]))
and be reasonably efficient, but I'd rather not monkey patch the class. Is there a better way to do this?
Edit
As I revisit this question, I wonder why I posted it in the first place. Almost everything I could think to do with the repeated matrix would be easier to do with the original matrix, and then apply the repetition afterwards. My assumption is that post repetition will always be the better way to approach this problem than any of the potential answers.
from scipy.sparse import csr_matrix
repeated_row_matrix = csr_matrix(np.ones([repeat_number,1])) * sparse_row
It's not surprising that np.repeat does not work. It delegates the action to the hardcoded a.repeat method, and failing that, first turns a into an array (object if needed).
In the linear algebra world where sparse code was developed, most of the assembly work was done on the row, col, data arrays BEFORE creating the sparse matrix. The focus was on efficient math operations, and not so much on adding/deleting/indexing rows and elements.
I haven't worked through your code, but I'm not surprised that a csr format matrix requires that much work.
I worked out a similar function for the lil format (working from lil.copy):
def lil_repeat(S, repeat):
# row repeat for lil sparse matrix
# test for lil type and/or convert
shape=list(S.shape)
if isinstance(repeat, int):
shape[0]=shape[0]*repeat
else:
shape[0]=sum(repeat)
shape = tuple(shape)
new = sparse.lil_matrix(shape, dtype=S.dtype)
new.data = S.data.repeat(repeat) # flat repeat
new.rows = S.rows.repeat(repeat)
return new
But it is also possible to repeat using indices. Both lil and csr support indexing that is close to that of regular numpy arrays (at least in new enough versions). Thus:
S = sparse.lil_matrix([[0,1,2],[0,0,0],[1,0,0]])
print S.A.repeat([1,2,3], axis=0)
print S.A[(0,1,1,2,2,2),:]
print lil_repeat(S,[1,2,3]).A
print S[(0,1,1,2,2,2),:].A
give the same result
and best of all?
print S[np.arange(3).repeat([1,2,3]),:].A
After someone posted a really clever response for how best to do this I revisited my original question, to see if there was an even better way. I I came up with one more way that has some pros and cons. Instead of repeating all of the data (as is done with the accepted answer), we can instead instruct scipy to reuse the data of the repeated rows, creating something akin to a view of the original sparse array (as you might do with broadcast_to). This can be done by simply tiling the indptr field.
repeated = sparse.csr_matrix((orig.data, orig.indices, np.tile(orig.indptr, repeat_num)))
This technique repeats the vector repeat_num times, while only modifying the the indptr. The downside is that due to the way the csr matrices encode data, instead of creating a matrix that's repeat_num x n in dimension, it creates one that's (2 * repeat_num - 1) x n where every odd row is 0. This shouldn't be too big of a deal as any operation will be quick given that each row is 0, and they should be pretty easy to slice out afterwards (with something like [::2]), but it's not ideal.
I think the marked answer is probably still the "best" way to do this.
One of the most efficient ways to repeat the sparse matrix would be the way OP suggested. I modified indptr so that it doesn't output rows of 0s.
## original sparse matrix
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
x = scipy.sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
x.toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
To repeat this, you need to repeat data and indices, and you need to fix-up the indptr. This is not the most elegant way, but it works.
## repeated sparse matrix
repeat = 5
new_indptr = indptr
for r in range(1,repeat):
new_indptr = np.concatenate((new_indptr, new_indptr[-1]+indptr[1:]))
x = scipy.sparse.csr_matrix((np.tile(data,repeat), np.tile(indices,repeat), new_indptr))
x.toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])

Categories