Ignoring duplicate entries in sparse matrix

Ignoring duplicate entries in sparse matrix - python

I've tried to initialize csc_matrix and csr_matrix from a list of (data, (rows, cols)) values as the documentation suggests.
sparse = csc_matrix((data, (rows, cols)), shape=(n, n))
The problem is that, the method that I actually have for generating the data, rows and cols vectors introduces duplicates for some points. By default, scipy adds the values of the duplicate entries. However, in my case, those duplicates have exactly the same value in data for a given (row, col).
What I'm trying to achieve is to make scipy ignore the second entry if already exists one, instead of adding them.
Ignoring the fact that I could improve the generation algorithm to avoid generating duplicates, is there a parameter or another way of creating a sparse matrix that ignores duplicates?
Currently two entries with data = [4, 4]; cols = [1, 1]; rows = [1, 1]; generate a sparse matrix which value at (1,1) is 8 while the desired value is 4.
>>> c = csc_matrix(([4, 4], ([1,1],[1,1])), shape=(3,3))
>>> c.todense()
matrix([[0, 0, 0],
[0, 8, 0],
[0, 0, 0]])
I'm also aware that I could filter them by using a 2-dimensional numpy unique function, but lists are quite large so this is not really a valid option.
Other possible answer to the question: Is there any way of specifying what to do with duplicates? i.e. keeping the min or max instead of the default sum?

Creating an intermediary dok matrix works in your example:
In [410]: c=sparse.coo_matrix((data, (cols, rows)),shape=(3,3)).todok().tocsc()
In [411]: c.A
Out[411]:
array([[0, 0, 0],
[0, 4, 0],
[0, 0, 0]], dtype=int32)
A coo matrix puts your input arrays into its data,col,row attributes without change. The summing doesn't occur until it is converted to a csc.
todok loads the dictionary directly from the coo attributes. It creates the blank dok matrix, and fills it with:
dok.update(izip(izip(self.row,self.col),self.data))
So if there are duplicate (row,col) values, it's the last one that remains. This uses the standard Python dictionary hashing to find the unique keys.
Here's a way of using np.unique. I had to construct a special object array, because unique operates on 1d, and we have a 2d indexing.
In [479]: data, cols, rows = [np.array(j) for j in [[1,4,2,4,1],[0,1,1,1,2],[0,1,2,1,1]]]
In [480]: x=np.zeros(cols.shape,dtype=object)
In [481]: x[:]=list(zip(rows,cols))
In [482]: x
Out[482]: array([(0, 0), (1, 1), (2, 1), (1, 1), (1, 2)], dtype=object)
In [483]: i=np.unique(x,return_index=True)[1]
In [484]: i
Out[484]: array([0, 1, 4, 2], dtype=int32)
In [485]: c1=sparse.csc_matrix((data[i],(cols[i],rows[i])),shape=(3,3))
In [486]: c1.A
Out[486]:
array([[1, 0, 0],
[0, 4, 2],
[0, 1, 0]], dtype=int32)
I have no idea which approach is faster.
An alternative way of getting the unique index, as per liuengo's link:
rc = np.vstack([rows,cols]).T.copy()
dt = rc.dtype.descr * 2
i = np.unique(rc.view(dt), return_index=True)[1]
rc has to own its own data in order to change the dtype with view, hence the .T.copy().
In [554]: rc.view(dt)
Out[554]:
array([[(0, 0)],
[(1, 1)],
[(2, 1)],
[(1, 1)],
[(1, 2)]],
dtype=[('f0', '<i4'), ('f1', '<i4')])

Since the values in your data at repeating (row, col) are the same, you can get the unique rows, columns and values as follows:
rows, cols, data = zip(*set(zip(rows, cols, data)))
Example:
data = [4, 3, 4]
cols = [1, 2, 1]
rows = [1, 3, 1]
csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()
matrix([[0, 0, 0, 0],
[0, 8, 0, 0],
[0, 0, 0, 0],
[0, 0, 3, 0]])
rows, cols, data = zip(*set(zip(rows, cols, data)))
csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()
matrix([[0, 0, 0, 0],
[0, 4, 0, 0],
[0, 0, 0, 0],
[0, 0, 3, 0]])

Just to update hpaulj's answer to the most recent version of SciPy, the simplest solution to this problem is now, given a COO matrix c now:
dok=sparse.dok_matrix((c.shape),dtype=c.dtype)
dok._update(zip(zip(c.row,c.col),c.data))
new_c = dok.tocsc()
This is due to a new wrapper in the dok update() function, preventing it from direct changes to the array, requiring the use of the underscore to bypass the wrapper.

Related

Repeat Array while Maintaining Order within group

I have the below array and would like to repeat each array n times.
x_array
[array([14.91488012, 1.2986064 , 4.98965322]),
array([2.39389187e+02, 1.04442059e-01, 3.06391338e-01]),
array([ 48.19437348, 201.09951372, 0.35223001]),
array([ 19.96978171, 367.52578786, 0.68676553]),
array([0.55120466, 0.27133609, 0.75646697]),
array([8.21287360e+02, 1.76495077e+02, 4.87263691e-01]),
array([184.03439377, 1.24823107, 5.33109884]),
array([575.59800297, 186.4650814 , 2.21028258]),
array([0.50308552, 3.09976082, 0.10537899]),
array([1.02259912e+00, 1.52282513e+02, 1.15085308e-01])]
I've tried np.repeat(x_array, 2) but this doesn't preserve the order of the matrix/array. I've also tried x_array*2, but this seems to just put the new array at the bottom. I was hopping to repeat x_array[0] n times and do the same for the next set of arrays, so that I have n total of each in order.
Thanks in advance.

Building off of the last example from https://numpy.org/doc/stable/reference/generated/numpy.repeat.html,
x_array = np.array(x_array) # Or a similiar operation to convert x_array to an ndarray vs. a list of arrays.
expanded_x_array = np.repeat(x_array, n, axis=0)
print(expanded_x_array)
should produce what you are looking for.

You just need to specify the axis:
>>> np.repeat(x_array, 2, axis=0)
array([[1.49149e+01, 1.29861e+00, 4.98965e+00],
[1.49149e+01, 1.29861e+00, 4.98965e+00],
[2.39389e+02, 1.04442e-01, 3.06391e-01],
[2.39389e+02, 1.04442e-01, 3.06391e-01],
...,
[5.03086e-01, 3.09976e+00, 1.05379e-01],
[5.03086e-01, 3.09976e+00, 1.05379e-01],
[1.02260e+00, 1.52283e+02, 1.15085e-01],
[1.02260e+00, 1.52283e+02, 1.15085e-01]])
From the docs:
numpy.repeat(a, repeats, axis=None)
...
axis int, optional
The axis along which to repeat values. By default, use the flattened input array, and return a flat output array.
(added bold)

You could use a list comprehension:
n = 2
repeated_list = [row for row in a for _ in range(n)]
print(repeated_list)

Your terminology is confusing. You say it's an "array", but the display looks more like a list, And the fact that x_array*2 puts an "new array" at the bottom confirms that - that's a list use of *.
np.repeat(x_array) first makes an array (a real one!)
np.array(x_array)
is a (n,3) float dtype array. Without axis np.repeat flattens - as documented!
Specifying the axis=0 works because it's repeating on that first n dimension. The result is a (2*n,3) float dtype array (not a list).
It is possible to make a 1d object dtype array containing those arrays. With that repeat will work without the axis parameter.
Knowing what you have, and describing it accurately, can make this kind of task much easier - and the questions clearer.
illustration
Make a list of arrays:
In [21]: alist = [np.ones(3,int),np.zeros(3,int),np.arange(3)]
In [22]: alist
Out[22]: [array([1, 1, 1]), array([0, 0, 0]), array([0, 1, 2])]
List repeat:
In [23]: alist*2
Out[23]:
[array([1, 1, 1]),
array([0, 0, 0]),
array([0, 1, 2]),
array([1, 1, 1]),
array([0, 0, 0]),
array([0, 1, 2])]
Make a 2d array from the list:
In [24]: np.array(alist)
Out[24]:
array([[1, 1, 1],
[0, 0, 0],
[0, 1, 2]])
repeat without axis repeats elements in a flattened way:
In [25]: np.repeat(alist,2)
Out[25]: array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2])
repeat this 2d array on 0 axis:
In [26]: np.repeat(alist,2,axis=0)
Out[26]:
array([[1, 1, 1],
[1, 1, 1],
[0, 0, 0],
[0, 0, 0],
[0, 1, 2],
[0, 1, 2]])
Object dtype array from list:
In [27]: arr = np.empty(3,object); arr[:]=alist
In [28]: arr
Out[28]: array([array([1, 1, 1]), array([0, 0, 0]), array([0, 1, 2])], dtype=object)
Since the arrays have the same size we have to use this special construct. Otherwise we get the 2d array [24].
This array has a repeat method, and with only one dimension we dont need to specify the axis. It's repeating the object elements, arrays, not the numbers in the 2d [24] array.
In [29]: arr.repeat(2)
Out[29]:
array([array([1, 1, 1]), array([1, 1, 1]), array([0, 0, 0]),
array([0, 0, 0]), array([0, 1, 2]), array([0, 1, 2])], dtype=object)

vectorize upper level of a vectoized code - python - numpy [duplicate]

I have an array X:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
And I wish to find the index of the row of several values in this array:
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
For this example I would like a result like:
[0,3,4]
I have a code doing this, but I think it is overly complicated:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
result = []
for s in searched_values:
idx = np.argwhere([np.all((X-s)==0, axis=1)])[0][1]
result.append(idx)
print(result)
I found this answer for a similar question but it works only for 1d arrays.
Is there a way to do what I want in a simpler way?

Approach #1
One approach would be to use NumPy broadcasting, like so -
np.where((X==searched_values[:,None]).all(-1))[1]
Approach #2
A memory efficient approach would be to convert each row as linear index equivalents and then using np.in1d, like so -
dims = X.max(0)+1
out = np.where(np.in1d(np.ravel_multi_index(X.T,dims),\
np.ravel_multi_index(searched_values.T,dims)))[0]
Approach #3
Another memory efficient approach using np.searchsorted and with that same philosophy of converting to linear index equivalents would be like so -
dims = X.max(0)+1
X1D = np.ravel_multi_index(X.T,dims)
searched_valuesID = np.ravel_multi_index(searched_values.T,dims)
sidx = X1D.argsort()
out = sidx[np.searchsorted(X1D,searched_valuesID,sorter=sidx)]
Please note that this np.searchsorted method assumes there is a match for each row from searched_values in X.
How does np.ravel_multi_index work?
This function gives us the linear index equivalent numbers. It accepts a 2D array of n-dimensional indices, set as columns and the shape of that n-dimensional grid itself onto which those indices are to be mapped and equivalent linear indices are to be computed.
Let's use the inputs we have for the problem at hand. Take the case of input X and note the first row of it. Since, we are trying to convert each row of X into its linear index equivalent and since np.ravel_multi_index assumes each column as one indexing tuple, we need to transpose X before feeding into the function. Since, the number of elements per row in X in this case is 2, the n-dimensional grid to be mapped onto would be 2D. With 3 elements per row in X, it would had been 3D grid for mapping and so on.
To see how this function would compute linear indices, consider the first row of X -
In [77]: X
Out[77]:
array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
We have the shape of the n-dimensional grid as dims -
In [78]: dims
Out[78]: array([10, 7])
Let's create the 2-dimensional grid to see how that mapping works and linear indices get computed with np.ravel_multi_index -
In [79]: out = np.zeros(dims,dtype=int)
In [80]: out
Out[80]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
Let's set the first indexing tuple from X, i.e. the first row from X into the grid -
In [81]: out[4,2] = 1
In [82]: out
Out[82]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
Now, to see the linear index equivalent of the element just set, let's flatten and use np.where to detect that 1.
In [83]: np.where(out.ravel())[0]
Out[83]: array([30])
This could also be computed if row-major ordering is taken into account.
Let's use np.ravel_multi_index and verify those linear indices -
In [84]: np.ravel_multi_index(X.T,dims)
Out[84]: array([30, 66, 61, 24, 41])
Thus, we would have linear indices corresponding to each indexing tuple from X, i.e. each row from X.
Choosing dimensions for np.ravel_multi_index to form unique linear indices
Now, the idea behind considering each row of X as indexing tuple of a n-dimensional grid and converting each such tuple to a scalar is to have unique scalars corresponding to unique tuples, i.e. unique rows in X.
Let's take another look at X -
In [77]: X
Out[77]:
array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
Now, as discussed in the previous section, we are considering each row as indexing tuple. Within each such indexing tuple, the first element would represent the first axis of the n-dim grid, second element would be the second axis of the grid and so on until the last element of each row in X. In essence, each column would represent one dimension or axis of the grid. If we are to map all elements from X onto the same n-dim grid, we need to consider the maximum stretch of each axis of such a proposed n-dim grid. Assuming we are dealing with positive numbers in X, such a stretch would be the maximum of each column in X + 1. That + 1 is because Python follows 0-based indexing. So, for example X[1,0] == 9 would map to the 10th row of the proposed grid. Similarly, X[4,1] == 6 would go to the 7th column of that grid.
So, for our sample case, we had -
In [7]: dims = X.max(axis=0) + 1 # Or simply X.max(0) + 1
In [8]: dims
Out[8]: array([10, 7])
Thus, we would need a grid of at least a shape of (10,7) for our sample case. More lengths along the dimensions won't hurt and would give us unique linear indices too.
Concluding remarks : One important thing to be noted here is that if we have negative numbers in X, we need to add proper offsets along each column in X to make those indexing tuples as positive numbers before using np.ravel_multi_index.

Another alternative is to use asvoid (below) to view each row as a single
value of void dtype. This reduces a 2D array to a 1D array, thus allowing you to use np.in1d as usual:
import numpy as np
def asvoid(arr):
"""
Based on http://stackoverflow.com/a/16973510/190597 (Jaime, 2013-06)
View the array as dtype np.void (bytes). The items along the last axis are
viewed as one value. This allows comparisons to be performed which treat
entire rows as one value.
"""
arr = np.ascontiguousarray(arr)
if np.issubdtype(arr.dtype, np.floating):
""" Care needs to be taken here since
np.array([-0.]).view(np.void) != np.array([0.]).view(np.void)
Adding 0. converts -0. to 0.
"""
arr += 0.
return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
idx = np.flatnonzero(np.in1d(asvoid(X), asvoid(searched_values)))
print(idx)
# [0 3 4]

The numpy_indexed package (disclaimer: I am its author) contains functionality for performing such operations efficiently (also uses searchsorted under the hood). In terms of functionality, it acts as a vectorized equivalent of list.index:
import numpy_indexed as npi
result = npi.indices(X, searched_values)
Note that using the 'missing' kwarg, you have full control over behavior of missing items, and it works for nd-arrays (fi; stacks of images) as well.
Update: using the same shapes as #Rik X=[520000,28,28] and searched_values=[20000,28,28], it runs in 0.8064 secs, using missing=-1 to detect and denote entries not present in X.

Here is a pretty fast solution that scales up well using numpy and hashlib. It can handle large dimensional matrices or images in seconds. I used it on 520000 X (28 X 28) array and 20000 X (28 X 28) in 2 seconds on my CPU
Code:
import numpy as np
import hashlib
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
#hash using sha1 appears to be efficient
xhash=[hashlib.sha1(row).digest() for row in X]
yhash=[hashlib.sha1(row).digest() for row in searched_values]
z=np.in1d(xhash,yhash)
##Use unique to get unique indices to ind1 results
_,unique=np.unique(np.array(xhash)[z],return_index=True)
##Compute unique indices by indexing an array of indices
idx=np.array(range(len(xhash)))
unique_idx=idx[z][unique]
print('unique_idx=',unique_idx)
print('X[unique_idx]=',X[unique_idx])
Output:
unique_idx= [4 3 0]
X[unique_idx]= [[5 6]
[3 3]
[4 2]]

X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
S = np.array([[4, 2],
[3, 3],
[5, 6]])
result = [[i for i,row in enumerate(X) if (s==row).all()] for s in S]
or
result = [i for s in S for i,row in enumerate(X) if (s==row).all()]
if you want a flat list (assuming there is exactly one match per searched value).

Another way is to use cdist function from scipy.spatial.distance like this:
np.nonzero(cdist(X, searched_values) == 0)[0]
Basically, we get row numbers of X which have distance zero to a row in searched_values, meaning they are equal. Makes sense if you look on rows as coordinates.

I had similar requirement and following worked for me:
np.argwhere(np.isin(X, searched_values).all(axis=1))

Here's what worked out for me:
def find_points(orig: np.ndarray, search: np.ndarray) -> np.ndarray:
equals = [np.equal(orig, p).all(1) for p in search]
exists = np.max(equals, axis=1)
indices = np.argmax(equals, axis=1)
indices[exists == False] = -1
return indices
test:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6],
[0, 0]])
find_points(X, searched_values)
output:
[0,3,4,-1]

Find the row indexes of several values in a numpy array

I have an array X:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
And I wish to find the index of the row of several values in this array:
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
For this example I would like a result like:
[0,3,4]
I have a code doing this, but I think it is overly complicated:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
result = []
for s in searched_values:
idx = np.argwhere([np.all((X-s)==0, axis=1)])[0][1]
result.append(idx)
print(result)
I found this answer for a similar question but it works only for 1d arrays.
Is there a way to do what I want in a simpler way?

Approach #1
One approach would be to use NumPy broadcasting, like so -
np.where((X==searched_values[:,None]).all(-1))[1]
Approach #2
A memory efficient approach would be to convert each row as linear index equivalents and then using np.in1d, like so -
dims = X.max(0)+1
out = np.where(np.in1d(np.ravel_multi_index(X.T,dims),\
np.ravel_multi_index(searched_values.T,dims)))[0]
Approach #3
Another memory efficient approach using np.searchsorted and with that same philosophy of converting to linear index equivalents would be like so -
dims = X.max(0)+1
X1D = np.ravel_multi_index(X.T,dims)
searched_valuesID = np.ravel_multi_index(searched_values.T,dims)
sidx = X1D.argsort()
out = sidx[np.searchsorted(X1D,searched_valuesID,sorter=sidx)]
Please note that this np.searchsorted method assumes there is a match for each row from searched_values in X.
How does np.ravel_multi_index work?
This function gives us the linear index equivalent numbers. It accepts a 2D array of n-dimensional indices, set as columns and the shape of that n-dimensional grid itself onto which those indices are to be mapped and equivalent linear indices are to be computed.
Let's use the inputs we have for the problem at hand. Take the case of input X and note the first row of it. Since, we are trying to convert each row of X into its linear index equivalent and since np.ravel_multi_index assumes each column as one indexing tuple, we need to transpose X before feeding into the function. Since, the number of elements per row in X in this case is 2, the n-dimensional grid to be mapped onto would be 2D. With 3 elements per row in X, it would had been 3D grid for mapping and so on.
To see how this function would compute linear indices, consider the first row of X -
In [77]: X
Out[77]:
array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
We have the shape of the n-dimensional grid as dims -
In [78]: dims
Out[78]: array([10, 7])
Let's create the 2-dimensional grid to see how that mapping works and linear indices get computed with np.ravel_multi_index -
In [79]: out = np.zeros(dims,dtype=int)
In [80]: out
Out[80]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
Let's set the first indexing tuple from X, i.e. the first row from X into the grid -
In [81]: out[4,2] = 1
In [82]: out
Out[82]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
Now, to see the linear index equivalent of the element just set, let's flatten and use np.where to detect that 1.
In [83]: np.where(out.ravel())[0]
Out[83]: array([30])
This could also be computed if row-major ordering is taken into account.
Let's use np.ravel_multi_index and verify those linear indices -
In [84]: np.ravel_multi_index(X.T,dims)
Out[84]: array([30, 66, 61, 24, 41])
Thus, we would have linear indices corresponding to each indexing tuple from X, i.e. each row from X.
Choosing dimensions for np.ravel_multi_index to form unique linear indices
Now, the idea behind considering each row of X as indexing tuple of a n-dimensional grid and converting each such tuple to a scalar is to have unique scalars corresponding to unique tuples, i.e. unique rows in X.
Let's take another look at X -
In [77]: X
Out[77]:
array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
Now, as discussed in the previous section, we are considering each row as indexing tuple. Within each such indexing tuple, the first element would represent the first axis of the n-dim grid, second element would be the second axis of the grid and so on until the last element of each row in X. In essence, each column would represent one dimension or axis of the grid. If we are to map all elements from X onto the same n-dim grid, we need to consider the maximum stretch of each axis of such a proposed n-dim grid. Assuming we are dealing with positive numbers in X, such a stretch would be the maximum of each column in X + 1. That + 1 is because Python follows 0-based indexing. So, for example X[1,0] == 9 would map to the 10th row of the proposed grid. Similarly, X[4,1] == 6 would go to the 7th column of that grid.
So, for our sample case, we had -
In [7]: dims = X.max(axis=0) + 1 # Or simply X.max(0) + 1
In [8]: dims
Out[8]: array([10, 7])
Thus, we would need a grid of at least a shape of (10,7) for our sample case. More lengths along the dimensions won't hurt and would give us unique linear indices too.
Concluding remarks : One important thing to be noted here is that if we have negative numbers in X, we need to add proper offsets along each column in X to make those indexing tuples as positive numbers before using np.ravel_multi_index.

Another alternative is to use asvoid (below) to view each row as a single
value of void dtype. This reduces a 2D array to a 1D array, thus allowing you to use np.in1d as usual:
import numpy as np
def asvoid(arr):
"""
Based on http://stackoverflow.com/a/16973510/190597 (Jaime, 2013-06)
View the array as dtype np.void (bytes). The items along the last axis are
viewed as one value. This allows comparisons to be performed which treat
entire rows as one value.
"""
arr = np.ascontiguousarray(arr)
if np.issubdtype(arr.dtype, np.floating):
""" Care needs to be taken here since
np.array([-0.]).view(np.void) != np.array([0.]).view(np.void)
Adding 0. converts -0. to 0.
"""
arr += 0.
return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
idx = np.flatnonzero(np.in1d(asvoid(X), asvoid(searched_values)))
print(idx)
# [0 3 4]

The numpy_indexed package (disclaimer: I am its author) contains functionality for performing such operations efficiently (also uses searchsorted under the hood). In terms of functionality, it acts as a vectorized equivalent of list.index:
import numpy_indexed as npi
result = npi.indices(X, searched_values)
Note that using the 'missing' kwarg, you have full control over behavior of missing items, and it works for nd-arrays (fi; stacks of images) as well.
Update: using the same shapes as #Rik X=[520000,28,28] and searched_values=[20000,28,28], it runs in 0.8064 secs, using missing=-1 to detect and denote entries not present in X.

Here is a pretty fast solution that scales up well using numpy and hashlib. It can handle large dimensional matrices or images in seconds. I used it on 520000 X (28 X 28) array and 20000 X (28 X 28) in 2 seconds on my CPU
Code:
import numpy as np
import hashlib
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
#hash using sha1 appears to be efficient
xhash=[hashlib.sha1(row).digest() for row in X]
yhash=[hashlib.sha1(row).digest() for row in searched_values]
z=np.in1d(xhash,yhash)
##Use unique to get unique indices to ind1 results
_,unique=np.unique(np.array(xhash)[z],return_index=True)
##Compute unique indices by indexing an array of indices
idx=np.array(range(len(xhash)))
unique_idx=idx[z][unique]
print('unique_idx=',unique_idx)
print('X[unique_idx]=',X[unique_idx])
Output:
unique_idx= [4 3 0]
X[unique_idx]= [[5 6]
[3 3]
[4 2]]

X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
S = np.array([[4, 2],
[3, 3],
[5, 6]])
result = [[i for i,row in enumerate(X) if (s==row).all()] for s in S]
or
result = [i for s in S for i,row in enumerate(X) if (s==row).all()]
if you want a flat list (assuming there is exactly one match per searched value).

Another way is to use cdist function from scipy.spatial.distance like this:
np.nonzero(cdist(X, searched_values) == 0)[0]
Basically, we get row numbers of X which have distance zero to a row in searched_values, meaning they are equal. Makes sense if you look on rows as coordinates.

I had similar requirement and following worked for me:
np.argwhere(np.isin(X, searched_values).all(axis=1))

Here's what worked out for me:
def find_points(orig: np.ndarray, search: np.ndarray) -> np.ndarray:
equals = [np.equal(orig, p).all(1) for p in search]
exists = np.max(equals, axis=1)
indices = np.argmax(equals, axis=1)
indices[exists == False] = -1
return indices
test:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6],
[0, 0]])
find_points(X, searched_values)
output:
[0,3,4,-1]

initialize a matrix in python of the row number

Is there a way to initialize a 3 row, 5 column matrix which contains these values without using a for loop?
[[0 0 0 0 0
1 1 1 1 1
2 2 2 2 2]]

It's possible.
i = 0
matrix = []
while i <=2:
matrix += [[i]*5]
i += 1

Without any for loops or list comprehensions, you can use a combination of built-in functions:
map(list, zip(*[range(3)] * 5))

If you're dealing with large datasets and are worried about performance, you might want to consider putting your data into a two-dimensional NumPy array. Here are a couple of ways of generating the matrix you ask for in NumPy:
>>> import numpy as np
>>> np.indices((3, 5))[0]
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2]])
>>> np.repeat(np.arange(3), 5).reshape((3, 5))
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2]])
The first of these is simpler, but a little bit wasteful: the np.indices call actually generates the array you want (which could be regarded as an array of row indices) along with a companion array of column indices:
>>> np.indices((3, 5))[1]
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
with both arrays packed conveniently into a single array of shape (2, 3, 5). If you need that second array anyway for what you're doing then np.indices is the way to go (though in that case you may also want to look into NumPy's mgrid, ogrid and meshgrid functions). The second solution with np.repeat only generates the values you need, and not surprisingly, finishes about twice as fast on my machine when I bump the size of the matrix up to (3000, 5000):
In [19]: %timeit np.indices((3000, 5000))[0]
10 loops, best of 3: 156 ms per loop
In [20]: %timeit np.repeat(np.arange(3000), 5000).reshape((3000, 5000))
10 loops, best of 3: 88.4 ms per loop
Having said that, using np.repeat in this way is a little bit of an antipattern in NumPy: it's often better to avoid the repetition by creating a 2d array with 3 rows and a single column, and rely on NumPy's broadcasting to interpret this correctly when it's combined with other arrays. If you go that way, all you need is:
>>> np.arange(3)[:, np.newaxis]
array([[0],
[1],
[2]])
This is an array of shape (3, 1); a subsequent operation with an array of shape (5,) or (1, 5) (for example) would be subject to NumPy's broadcasting rules, producing an output of shape (3, 5). For example, here's what happens when we add a 1d array of zeros to the above:
>>> np.arange(3)[:, np.newaxis] + np.zeros(5, dtype=int)
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2]])
And for completeness, here's one more variation, using np.tile:
>>> np.tile(np.arange(3)[:, np.newaxis], (1, 5))
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2]])
All of these solutions should have reasonably similar performance for large values of 3 and 5; if this is a bottleneck, you'll want to do timings on your machine to decide which to use. On my machine, the +np.zeros broadcasting solution beats the others by some margin.

This is one of easy way to understand for Python Beginner.
matrix = []
for data in range(3):
matrix.append([data] * 5)

This is possible using:
[[data] * 5 for data in range(3)]

How to group elements of a numpy array with the same value in separate numpy arrays

As usual intro, I am a tyro in python. However, I got quite a big project to code. It is a surface flow model with Cell Automata. Anyway, I also want to include building roofs in my model. Imagine you have an ascii file indicating buildings with 1s, while the rest is 0. There are just those two states. Now, I want to find all adjacent cells indicating the same building and store them (or rather the information of y,x and one more (maybe elevation),so 3 columns) in an individual building arrays. Keep in mind that buildings can have all possible forms though diagonally connected cells doesn't belong to the same building. So only northern, southern, western and eastern cells can belong to the same building.
I did my homework and googled it but so far I couldn't find a satisfying answer.
example:
initial land-cover array:
([0,0,0,0,0,0,0]
[0,0,1,0,0,0,0]
[0,1,1,1,0,1,1]
[0,1,0,1,0,0,1]
[0,0,0,0,0,0,0])
output(I need to now the coordinates of the cells in my initial array):
building_1=([1,2],[2,1],[2,2],[2,3],[3,1],[3,3])
building_2=([2,5],[2,6],[3,6])
Any help is greatly appreciated!

You can use the label function from scipy.ndimage to identify the distinct buildings.
Here's your example array, containing two buildings:
In [57]: a
Out[57]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 1, 1, 1, 0, 1, 1],
[0, 1, 0, 1, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0]])
Import label.
In [58]: from scipy.ndimage import label
Apply label to a. It returns two values: the array of labeled positions, and the number of distinct objects (buildings, in this case) found.
In [59]: lbl, nlbls = label(a)
In [60]: lbl
Out[60]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 1, 1, 1, 0, 2, 2],
[0, 1, 0, 1, 0, 0, 2],
[0, 0, 0, 0, 0, 0, 0]], dtype=int32)
In [61]: nlbls
Out[61]: 2
To get the coordinates of a building, np.where can be used. For example,
In [64]: np.where(lbl == 2)
Out[64]: (array([2, 2, 3]), array([5, 6, 6]))
It returns a tuple of arrays; the kth array holds the coordinates of the kth dimension. You can use, for example, np.column_stack to combine these into an array:
In [65]: np.column_stack(np.where(lbl == 2))
Out[65]:
array([[2, 5],
[2, 6],
[3, 6]])
You might want a list of all the coordinate arrays. Here's one way to create such a list.
For convenience, first create a list of labels:
In [66]: labels = range(1, nlbls+1)
In [67]: labels
Out[67]: [1, 2]
Use a list comprehension to create the list of coordinate arrays.
In [68]: coords = [np.column_stack(where(lbl == k)) for k in labels]
In [69]: coords
Out[69]:
[array([[1, 2],
[2, 1],
[2, 2],
[2, 3],
[3, 1],
[3, 3]]),
array([[2, 5],
[2, 6],
[3, 6]])]
Now your building data is in labels and coords. For example, the first building was labeled labels[0], and its coordinates are in coords[0]:
In [70]: labels[0]
Out[70]: 1
In [71]: coords[0]
Out[71]:
array([[1, 2],
[2, 1],
[2, 2],
[2, 3],
[3, 1],
[3, 3]])

Thank you for the great answers! Here is a little correction. If you see the landcover array, I actually don't have 0 as background information but -9999 (0 is too precious for GIS people). I forgot to mention that. But thanks to machine yearning's hint, I made a work-around by assigning all -9999 with 0 through landcover = np.where(landcover > -9999, landcover, 0). After that I can use label. The actual aim was to find the lowest cell and to assign it as outlet. If somebody has a more efficient way, please let me know!
import numpy as np
from scipy.ndimage import label
Original data set has -9999 as background information and 1 as building cells.
landcover = np.array([[-9999,-9999,-9999,-9999,-9999,-9999,1],
[-9999,-9999,1,-9999,-9999,-9999,-9999],
[-9999,1,1,1,-9999,1,1],
[-9999,1,-9999,1,-9999,-9999,1],
[-9999,-9999,-9999,-9999,-9999,-9999,-9999]],dtype=int)
Here is a random digital elevation map.
DEM = np.array([[7,4,3,2,4,5,4],
[4,5,5,3,5,6,7],
[2,6,4,7,4,4,4],
[3,7,8,8,10,9,7],
[2,5,7,7,9,10,8]],dtype=float)
I changed all -9999 entries to 0 in order to use label #thanks to machine yearning
landcover = np.where(landcover > -9999, landcover, 0)
Then I labeled distinct buildings and counting those distinctions #Warren Weckesser, the rest pretty much yours. thanks!
lbl, nlbls = label(landcover)
bldg_number = range(1, nlbls+1)
bldg_coord = [np.column_stack(where(lbl == k)) for k in bldg_no]
outlets=np.zeros([nlbls,3])
I am iterating over the bldg_coord list in order to determine the lowest cells which will be assigned as outlet
for i in range(0, nlbls):
building=np.zeros([bldg_coord[i].shape[0],3])
for j in range(0,bldg_coord[i].shape[0]):
building[j][0]=bldg_coord[i][j][0]
building[j][1]=bldg_coord[i][j][1]
building[j][2]=DEM[bldg_coord[i][j][0],bldg_coord[i][j][1]]
I sort the building array in ascending order according to the DEM information of each building cell in order to find the lowest lying building cells.
building=building[building[:,2].argsort()]
The lowest building cell will be used as roof outlet for rainwater
outlets[i][0]=building[0][0]
outlets[i][1]=building[0][1]
outlets[i][2]=bldg_coord[i].shape[0]
Here is the output. The first two columns are indices in den landcover array and the last is the number of adjacent building cells.
>>> outlets
array([[ 0., 6., 1.],
[ 2., 2., 6.],
[ 2., 5., 3.]])

It looks like this function does exactly what you're looking for (from the numpy documentation):
numpy.argwhere(a):
Find the indices of array elements that are non-zero, grouped by element.
>>> x = np.arange(6).reshape(2,3)
>>> x
array([[0, 1, 2],
[3, 4, 5]])
>>> np.argwhere(x>1)
array([[0, 2],
[1, 0],
[1, 1],
[1, 2]])
Alternatively it seems like your use case requires using the returned coordinates to index arrays.
The output of argwhere is not suitable for indexing arrays. For this purpose use where(a) instead.
You might want to try numpy.where instead.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Ignoring duplicate entries in sparse matrix - python

Related

Repeat Array while Maintaining Order within group

vectorize upper level of a vectoized code - python - numpy [duplicate]

Find the row indexes of several values in a numpy array

initialize a matrix in python of the row number

How to group elements of a numpy array with the same value in separate numpy arrays

Categories

Resources