indices of sparse_csc matrix are reversed after extracting some columns - python

I'm trying to extract columns of a scipy sparse column matrix, but the result is not stored as I'd expect. Here's what I mean:
In [77]: a = scipy.sparse.csc_matrix(np.ones([4, 5]))
In [78]: ind = np.array([True, True, False, False, False])
In [79]: b = a[:, ind]
In [80]: b.indices
Out[80]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
In [81]: a.indices
Out[81]: array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
How come b.indices is not [0, 1, 2, 3, 0, 1, 2, 3] ?
And since this behaviour is not the one I expect, is a[:, ind] not the correct way to extract columns from a csc matrix?

The indices are not sorted. You can either force the looping by reversing in a's rows, which is not that intuitive, or enforce sorted indices (you can also do it in-place, but I prefer casting). What I find funny is that the has_sorted_indices attribute does not always return a boolean, but mixes it with integer representation.
a = scipy.sparse.csc_matrix(np.ones([4, 5]))
ind = np.array([True, True, False, False, False])
b = a[::-1, ind]
b2 = a[:, ind]
b3 = b2.sorted_indices()
b.indices
>>array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b.has_sorted_indices
>>1
b2.indices
>>array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
b2.has_sorted_indices
>>0
b3.indices
array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b3.has_sorted_indices
>>True

csc and csr indices are not guaranteed to be sorted. I can't off hand find documentation to the effect, but the has_sort_indices and the sort methods suggest that.
In your case the order is the result of how the indexing is done. I found in previous SO questions, that multicolumn indexing is performed with a matrix multiplication:
In [165]: a = sparse.csc_matrix(np.ones([4,5]))
In [166]: b = a[:,[0,1]]
In [167]: b.indices
Out[167]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
This indexing is the equivalent to constructing a 'selection' matrix:
In [169]: I = sparse.csr_matrix(np.array([[1,0,0,0,0],[0,1,0,0,0]]).T)
In [171]: I.A
Out[171]:
array([[1, 0],
[0, 1],
[0, 0],
[0, 0],
[0, 0]], dtype=int32)
and doing this matrix multiplication:
In [172]: b1 = a * I
In [173]: b1.indices
Out[173]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
The order is the result of how the matrix multiplication was done. In fact a * a.T does the same reversal. We'd have to examine the multiplication code to know exactly why. Evidently the csc and csr calculation code doesn't require sorted indices, and doesn't bother to ensure the results are sorted.
https://docs.scipy.org/doc/scipy-0.19.1/reference/sparse.html#further-details
Further Details¶
CSR column indices are not necessarily sorted. Likewise for CSC row indices. Use the .sorted_indices() and .sort_indices() methods when sorted indices are required (e.g. when passing data to other libraries).

Related

Append numpy arrays with different dimensions

I am trying to attach or concatenate two numpy arrays with different dimensions. It does not look good so far.
So, as an example,
a = np.arange(0,4).reshape(1,4)
b = np.arange(0,3).reshape(1,3)
And I am trying
G = np.concatenate(a,b,axis=0)
I get an error as a and b are not the same dimension. The reason I need to concatenate a and b is that I am trying to solve a model recursively and the state space is changing over time. So I need to call the last value function as an input to get a value function for the next time period, etc.:
for t in range(T-1,0,-1):
VG,CG = findv(VT[-1])
VT = np.append(VT,VG,axis=0)
CT = np.append(CT,CG,axis=0)
But, VT has a different dimension from the time period to the next.
Does anyone know how to deal with VT and CT numpy arrays that keep changing dimension?
OK - thanks for the input ... I need the output to be of the following form:
G = [[0, 1, 2, 3],
[0, 1, 2]]
So, if I write G[-1] I will get the last element,
[0,1,2].
I do not know if that is a numpy array?
Thanks, Jesper.
In [71]: a,b,c = np.arange(0,4), np.arange(0,3), np.arange(0,7)
It's easy to put those arrays in a list, either all at once, or incrementally:
In [72]: [a,b,c]
Out[72]: [array([0, 1, 2, 3]), array([0, 1, 2]), array([0, 1, 2, 3, 4, 5, 6])]
In [73]: G =[a,b]
In [74]: G.append(c)
In [75]: G
Out[75]: [array([0, 1, 2, 3]), array([0, 1, 2]), array([0, 1, 2, 3, 4, 5, 6])]
We can make an object dtype array from that list.
In [76]: np.array(G)
Out[76]:
array([array([0, 1, 2, 3]), array([0, 1, 2]),
array([0, 1, 2, 3, 4, 5, 6])], dtype=object)
Be aware that sometimes this could produce a 2d array (if all subarrays were the same size), or an error. Usually it's better to stick with the list.
Repeated append or concatenate to an array is usually not recommended. It's trickier to do right, and slower when it does work.
But let's demonstrate:
In [80]: G = np.array([a,b])
In [81]: G
Out[81]: array([array([0, 1, 2, 3]), array([0, 1, 2])], dtype=object)
c gets 'expanded' with a simple concatenate:
In [82]: np.concatenate((G,c))
Out[82]:
array([array([0, 1, 2, 3]), array([0, 1, 2]), 0, 1, 2, 3, 4, 5, 6],
dtype=object)
Instead we need to wrap c in an object dtype array of its own:
In [83]: cc = np.array([None])
In [84]: cc[0]= c
In [85]: cc
Out[85]: array([array([0, 1, 2, 3, 4, 5, 6])], dtype=object)
In [86]: np.concatenate((G,cc))
Out[86]:
array([array([0, 1, 2, 3]), array([0, 1, 2]),
array([0, 1, 2, 3, 4, 5, 6])], dtype=object)
In general when we concatenate, the dtypes have to match, or at least be compatible. Here, all inputs need to be object dtype. The same would apply when joining compound dtypes (structured arrays). It's only when joining simple numeric dtypes (and strings) that we can ignore dtypes (provided we don't care about integers becoming floats, etc).
You cant really stack arrays with different dimensions or size of dimensions.
This is list (kind of your desired ouput if I understand correctly):
G = [[0, 1, 2, 3],
[0, 1, 2]]
Transformed to numpy array:
G_np = np.array(G)
>>> G_np.shape
(2,)
>>> G_np
array([list([0, 1, 2, 3]), list([0, 1, 2])], dtype=object)
>>>
Solution in your case (based on your requirements):
a = np.arange(0,4)
b = np.arange(0,3)
G_npy = np.array([a,b])
>>> G_np.shape
(2,)
>>> G_np
array([array([0, 1, 2, 3]), array([0, 1, 2])], dtype=object)
>>> G_npy[-1]
array([0, 1, 2])
Edit: In relation to your Question in comment
I must admit I have no Idea how to do it in correct way.
But if a hacky way is ok(Maybe its the correct way), then:
G_npy = np.array([a,b])
G_npy = np.append(G_npy,None) # Allocate space for your new array
G_npy[-1] = np.arange(5) # populate the new space with new array
>>> G_npy
array([array([0, 1, 2, 3]), array([0, 1, 2]), array([0, 1, 2, 3, 4])],
dtype=object)
>>>
Or this way - but then, there is no point in using numpy
temp = [i for i in G_npy]
temp.append(np.arange(5))
G_npy = np.array(temp)
NOTE:
To be honest, i dont think numpy is good for collecting objects(list like this).
If I were you, I would just keep appending a real list. At the end, I would transform it to numpy. But after all, I dont know your application, so I dont know what is best attitude
Try this way:
import numpy as np
a = np.arange(4).reshape(2,2)
b = np.arange(6).reshape(2,3)
c = np.arange(8).reshape(2,4)
a
# array([[0, 1],
# [2, 3]])
b
# array([[0, 1, 2],
# [3, 4, 5]])
c
# array([[0, 1, 2, 3],
# [4, 5, 6, 7]])
np.hstack((a,b,c))
#array([[0, 1, 0, 1, 2, 0, 1, 2, 3],
# [2, 3, 3, 4, 5, 4, 5, 6, 7]])
Hope it helps.
Thanks
You are missing a parentheses there.
Please refer to the concatenate documentation below.
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.concatenate.html
import numpy as np
a = np.arange(0,4).reshape(1,4)
b = np.arange(0,3).reshape(1,3)
c = np.concatenate((a,b), axis=1) #axis 1 as you have reshaped the numpy array
The above will give you the concatenated output c as:
array([[0, 1, 2, 3, 0, 1, 2]])

Faster index computation from Scipy labelled array apart from np.where

I am working on a large array (3000 x 3000) over which I use scipy.ndimage.label. The return is 3403 labels and the labelled array. I would like to know the indices of these labels for e.g. for label 1 I should know the rows and columns in the labelled array.
So basically like this
a[0] = array([[1, 1, 0, 0],
[1, 1, 0, 2],
[0, 0, 0, 2],
[3, 3, 0, 0]])
indices = [np.where(a[0]==t+1) for t in range(a[1])] #where a[1] = 3 is number of labels.
print indices
[(array([0, 0, 1, 1]), array([0, 1, 0, 1])), (array([1, 2]), array([3, 3])), (array([3, 3]), array([0, 1]))]
And I would like to create a list of indices for all 3403 labels like above. The above method seems to be slow. I tried using generators, it doesn't look like there is improvement.
Are there any efficient ways?
Well the idea with gaining efficiency would be to minimize the work once inside the loop. A vectorized method isn't possible given that you would have variable number of elements per label. So, with those factors in mind, here's one solution -
a_flattened = a[0].ravel()
sidx = np.argsort(a_flattened)
afs = a_flattened[sidx]
cut_idx = np.r_[0,np.flatnonzero(afs[1:] != afs[:-1])+1,a_flattened.size]
row, col = np.unravel_index(sidx, a[0].shape)
row_indices = [row[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
col_indices = [col[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
Sample input, output -
In [59]: a[0]
Out[59]:
array([[1, 1, 0, 0],
[1, 1, 0, 2],
[0, 0, 0, 2],
[3, 3, 0, 0]])
In [60]: a[1]
Out[60]: 3
In [62]: row_indices # row indices
Out[62]:
[array([0, 0, 1, 2, 2, 2, 3, 3]), # for label-0
array([0, 0, 1, 1]), # for label-1
array([1, 2]), # for label-2
array([3, 3])] # for label-3
In [63]: col_indices # column indices
Out[63]:
[array([2, 3, 2, 0, 1, 2, 2, 3]), # for label-0
array([0, 1, 0, 1]), # for label-1
array([3, 3]), # for label-2
array([0, 1])] # for label-3
The first elements off row_indices and col_indices are the expected output. The first groups from each those represent the 0-th regions, so you might want to skip those.

How to get a value from every column in a Numpy matrix

I'd like to get the index of a value for every column in a matrix M. For example:
M = matrix([[0, 1, 0],
[4, 2, 4],
[3, 4, 1],
[1, 3, 2],
[2, 0, 3]])
In pseudocode, I'd like to do something like this:
for col in M:
idx = numpy.where(M[col]==0) # Only for columns!
and have idx be 0, 4, 0 for each column.
I have tried to use where, but I don't understand the return value, which is a tuple of matrices.
The tuple of matrices is a collection of items suited for indexing. The output will have the shape of the indexing matrices (or arrays), and each item in the output will be selected from the original array using the first array as the index of the first dimension, the second as the index of the second dimension, and so on. In other words, this:
>>> numpy.where(M == 0)
(matrix([[0, 0, 4]]), matrix([[0, 2, 1]]))
>>> row, col = numpy.where(M == 0)
>>> M[row, col]
matrix([[0, 0, 0]])
>>> M[numpy.where(M == 0)] = 1000
>>> M
matrix([[1000, 1, 1000],
[ 4, 2, 4],
[ 3, 4, 1],
[ 1, 3, 2],
[ 2, 1000, 3]])
The sequence may be what's confusing you. It proceeds in flattened order -- so M[0,2] appears second, not third. If you need to reorder them, you could do this:
>>> row[0,col.argsort()]
matrix([[0, 4, 0]])
You also might be better off using arrays instead of matrices. That way you can manipulate the shape of the arrays, which is often useful! Also note ajcr's transpose-based trick, which is probably preferable to using argsort.
Finally, there is also a nonzero method that does the same thing as where in this case. Using the transpose trick now:
>>> (M == 0).T.nonzero()
(matrix([[0, 1, 2]]), matrix([[0, 4, 0]]))
As an alternative to np.where, you could perhaps use np.argwhere to return an array of indexes where the array meets the condition:
>>> np.argwhere(M == 0)
array([[[0, 0]],
[[0, 2]],
[[4, 1]]])
This tells you each the indexes in the format [row, column] where the condition was met.
If you'd prefer the format of this output array to be grouped by column rather than row, (that is, [column, row]), just use the method on the transpose of the array:
>>> np.argwhere(M.T == 0).squeeze()
array([[0, 0],
[1, 4],
[2, 0]])
I also used np.squeeze here to get rid of axis 1, so that we are left with a 2D array. The sequence you want is the second column, i.e. np.argwhere(M.T == 0).squeeze()[:, 1].
The result of where(M == 0) would look something like this
(matrix([[0, 0, 4]]), matrix([[0, 2, 1]])) First matrix tells you the rows where 0s are and second matrix tells you the columns where 0s are.
Out[4]:
matrix([[0, 1, 0],
[4, 2, 4],
[3, 4, 1],
[1, 3, 2],
[2, 0, 3]])
In [5]: np.where(M == 0)
Out[5]: (matrix([[0, 0, 4]]), matrix([[0, 2, 1]]))
In [6]: M[0,0]
Out[6]: 0
In [7]: M[0,2] #0th row 2nd column
Out[7]: 0
In [8]: M[4,1] #4th row 1st column
Out[8]: 0
This isn't anything new on what's been already suggested, but a one-line solution is:
>>> np.where(np.array(M.T)==0)[-1]
array([0, 4, 0])
(I agree that NumPy matrix objects are more trouble than they're worth).
>>> M = np.array([[0, 1, 0],
... [4, 2, 4],
... [3, 4, 1],
... [1, 3, 2],
... [2, 0, 3]])
>>> [np.where(M[:,i]==0)[0][0] for i in range(M.shape[1])]
[0, 4, 0]

Instantiate a matrix with x zeros and the rest ones

I would like to be able to quickly instantiate a matrix where the first few (variable number of) cells in a row are 0, and the rest are ones.
Imagine we want a 3x4 matrix.
I have instantiated the matrix first as all ones:
ones = np.ones([4,3])
Then imagine we have an array that announces how many leading zeros there are:
arr = np.array([2,1,3,0]) # first row has 2 zeroes, second row 1 zero, etc
Required result:
array([[0, 0, 1],
[0, 1, 1],
[0, 0, 0],
[1, 1, 1]])
Obviously this can be done in the opposite way as well, but I'd consider the approach where 1 is a default value, and zeros would be replaced.
What would be the best way to avoid some silly loop?
Here's one way. n is the number of columns in the result. The number of rows is determined by len(arr).
In [29]: n = 5
In [30]: arr = np.array([1, 2, 3, 0, 3])
In [31]: (np.arange(n) >= arr[:, np.newaxis]).astype(int)
Out[31]:
array([[0, 1, 1, 1, 1],
[0, 0, 1, 1, 1],
[0, 0, 0, 1, 1],
[1, 1, 1, 1, 1],
[0, 0, 0, 1, 1]])
There are two parts to the explanation of how this works. First, how to create a row with m zeros and n-m ones? For that, we use np.arange to create a row with values [0, 1, ..., n-1]`:
In [35]: n
Out[35]: 5
In [36]: np.arange(n)
Out[36]: array([0, 1, 2, 3, 4])
Next, compare that array to m:
In [37]: m = 2
In [38]: np.arange(n) >= m
Out[38]: array([False, False, True, True, True], dtype=bool)
That gives an array of boolean values; the first m values are False and the rest are True. By casting those values to integers, we get an array of 0s and 1s:
In [39]: (np.arange(n) >= m).astype(int)
Out[39]: array([0, 0, 1, 1, 1])
To perform this over an array of m values (your arr), we use broadcasting; this is the second key idea of the explanation.
Note what arr[:, np.newaxis] gives:
In [40]: arr
Out[40]: array([1, 2, 3, 0, 3])
In [41]: arr[:, np.newaxis]
Out[41]:
array([[1],
[2],
[3],
[0],
[3]])
That is, arr[:, np.newaxis] reshapes arr into a 2-d array with shape (5, 1). (arr.reshape(-1, 1) could have been used instead.) Now when we compare this to np.arange(n) (a 1-d array with length n), broadcasting kicks in:
In [42]: np.arange(n) >= arr[:, np.newaxis]
Out[42]:
array([[False, True, True, True, True],
[False, False, True, True, True],
[False, False, False, True, True],
[ True, True, True, True, True],
[False, False, False, True, True]], dtype=bool)
As #RogerFan points out in his comment, this is basically an outer product of the arguments, using the >= operation.
A final cast to type int gives the desired result:
In [43]: (np.arange(n) >= arr[:, np.newaxis]).astype(int)
Out[43]:
array([[0, 1, 1, 1, 1],
[0, 0, 1, 1, 1],
[0, 0, 0, 1, 1],
[1, 1, 1, 1, 1],
[0, 0, 0, 1, 1]])
Not as concise as I wanted (I was experimenting with mask_indices), but this will also do the work:
>>> n = 3
>>> zeros = [2, 1, 3, 0]
>>> numpy.array([[0] * zeros[i] + [1]*(n - zeros[i]) for i in range(len(zeros))])
array([[0, 0, 1],
[0, 1, 1],
[0, 0, 0],
[1, 1, 1]])
>>>
Works very simple: concatenates multiplied required number of times, one-element lists [0] and [1], creating the array row by row.

Increment given indices in a matrix

Briefly: there is a similar question and the best answer suggests using numpy.bincount. I need the same thing, but for a matrix.
I've got two arrays:
array([1, 2, 1, 1, 2])
array([2, 1, 1, 1, 1])
together they make indices that should be incremented:
>>> np.array([a, b]).T
array([[1, 2],
[2, 1],
[1, 1],
[1, 1],
[2, 1]])
I want to get this matrix:
array([[0, 0, 0],
[0, 2, 1], # (1,1) twice, (1,2) once
[0, 2, 0]]) # (2,1) twice
The matrix will be small (like, 5Ă—5), and the number of indices will be large (somewhere near 10^3 or 10^5).
So, is there anything better (faster) than a for-loop?
You can still use bincount(). The trick is to convert a and b into a single 1D array of flat indices.
If the matrix is nxm, you could apply bincount() to a * m + b, and construct the matrix from the result.
To take the example in your question:
In [15]: a = np.array([1, 2, 1, 1, 2])
In [16]: b = np.array([2, 1, 1, 1, 1])
In [17]: cnt = np.bincount(a * 3 + b)
In [18]: cnt.resize((3, 3))
In [19]: cnt
Out[19]:
array([[0, 0, 0],
[0, 2, 1],
[0, 2, 0]])
If the shape of the array is more complicated, it might be easier to use np.ravel_multi_index() instead of computing flat indices by hand:
In [20]: cnt = np.bincount(np.ravel_multi_index(np.vstack((a, b)), (3, 3)))
In [21]: np.resize(cnt, (3, 3))
Out[21]:
array([[0, 0, 0],
[0, 2, 1],
[0, 2, 0]])
(Hat tip #Jaime for pointing out ravel_multi_index.)
m1 = m.view(numpy.ndarray) # Create view
m1.shape = -1 # Make one-dimensional array
m1 += np.bincount(a+m.shape[1]*b, minlength=m1.size)

Categories