Implications of manually setting scipy sparse matrix shape - python
I need to perform online training on a TF-IDF model. I found that scipy's
TfidfVectorizer does not support training on online fashion, so I'm implementing my own CountVectorizer to support online training and then use the scipy's TfidfTransformer to update tf-idf values after a pre-defined number of documents have entered in the corpus.
I found here that you shouldn't be adding rows or columns to numpy arrays since all data would need to be copied so it is stored in contiguous blocks of memory.
But then I also found that in fact, using scipy sparse matrix you can manually change the matrix's shape.
Numpy reshape docs says:
It is not always possible to change the shape of an array without copying the data. If you want an error to be raised when the data is copied, you should assign the new shape to the shape attribute of the array
Since the "reshaping" of the sparse matrix is being done by assigning a new shape, is it safe to say data is not being copied? What are the implications of doing so? Is it efficient?
Code example:
matrix = sparse.random(5, 5, .2, 'csr') # Create (5,5) sparse matrix
matrix._shape = (6, 6) # Change shape to (6, 6)
# Modify data on new empty row
I would also like to expand my question to ask about methods such as vstack that allows one to append arrays to one another (same as adding a row). Is vstack copying the whole data so it gets stored as contiguous blocks of memory as stated in my first link? What about hstack?
EDIT:
So, following this question I've implemented a method to alter the values of a row in a sparse matrix.
Now, mixing the idea of adding new empty rows with the idea of modifying existing values I've come up with the following:
matrix = sparse.random(5, 3, .2, 'csr')
matrix._shape = (6, 3)
# Update indptr to let it know we added a row with nothing in it.
matrix.indptr = np.hstack((matrix.indptr, matrix.indptr[-1]))
# New elements on data, indices format
new_elements = [1, 1]
elements_indices = [0, 2]
# Set elements for new empty row
set_row_csr_unbounded(matrix, 5, new_elements, elements_indices)
I run the above code a few times during the same execution and got no error. But as soon as I try to add a new column (then there would be no need to change indptr) I get an error when I try to alter the values. Any lead on why this happen?
Well, since set_row_csr_unbounded uses numpy.r_ underneath, I assume I'm better using a lil_matrix. Even if all the elements, once added cannot be modified. Am I right?
I think that lil_matrix would be ebtter because I assume numpy.r_ is copying the data.
In numpy reshape means to change the shape in such a way that keeps the same number elements. So the product of the shape terms can't change.
The simplest example is something like
np.arange(12).reshape(3,4)
The assignment method is:
x = np.arange(12)
x.shape = (3,4)
The method (or np.reshape(...)) returns a new array. The shape assignment works in-place.
The docs note that you quote comes into play when doing something like
x = np.arange(12).reshape(3,4).T
x.reshape(3,4) # ok, but copy
x.shape = (3,4) # raises error
To better understand what's happening here, print the array at different stages, and look at how the original 0,1,2,... contiguity changes. (that's left as an exercise for the reader since it isn't central to the bigger question.)
There is a resize function and method, but it isn't used much, and its behavior with respect to views and copies is tricky.
np.concatenate (and variants like np.stack, np.vstack) make new arrays, and copy all the data from the inputs.
A list (and object dtype array) contains pointers to the elements (which may be arrays), and so don't require copying data.
Sparse matrices store their data (and row/col indices) in various attributes that differ among the formats. coo, csr and csc have 3 1d arrays. lil has 2 object arrays containing lists. dok is a dictionary subclass.
lil_matrix implements a reshape method. The other formats do not. As with np.reshape the product of the dimensions can't change.
In theory a sparse matrix could be 'embedded' in a larger matrix with minimal copying of data, since all the new values will be the default 0, and not occupy any space. But the details for that operation have not been worked out for any of the formats.
sparse.hstack and sparse.vstack (don't use the numpy versions on sparse matrices) work by combining the coo attributes of the inputs (via sparse.bmat). So yes, they make new arrays (data, row, col).
A minimal example of making a larger sparse matrix:
In [110]: M = sparse.random(5,5,.2,'coo')
In [111]: M
Out[111]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
In [112]: M.A
Out[112]:
array([[0. , 0.80957797, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0.23618044, 0. , 0.91625967, 0.8791744 ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.7928235 , 0. ]])
In [113]: M1 = sparse.coo_matrix((M.data, (M.row, M.col)),shape=(7,5))
In [114]: M1
Out[114]:
<7x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
In [115]: M1.A
Out[115]:
array([[0. , 0.80957797, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0.23618044, 0. , 0.91625967, 0.8791744 ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.7928235 , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ]])
In [116]: id(M1.data)
Out[116]: 139883362735488
In [117]: id(M.data)
Out[117]: 139883362735488
M and M1 have the same data attribute (same array id). But most operations on these matrices will require a conversion to another format (such as csr for math, or lil for changing values), and will involve copying and modifying the attributes. So this connection between the two matrices will be broken.
When you make a sparse matrix with a function like coo_matrix, and don't provide a shape parameter, it deduces the shape from the provided coordinates. If you provide a shape it uses that. That shape has to be at least as large as the implied shape. With lil (and dok) you can profitably create an 'empty' matrix with a large shape, and then set values iteratively. You don't want to do that with csr. And you can't directly set coo values.
The canonical way of creating sparse matrices is to build the data, row, and col arrays or lists iteratively from various pieces - with list append/extend or array concatenates, and make a coo (or csr) format array from that. So you do all the 'growing' before even creating the matrix.
changing _shape
Make a matrix:
In [140]: M = (sparse.random(5,3,.4,'csr')*10).astype(int)
In [141]: M
Out[141]:
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [142]: M.A
Out[142]:
array([[0, 6, 7],
[0, 0, 6],
[1, 0, 5],
[0, 0, 0],
[0, 6, 0]])
In [144]: M[1,0] = 10
... SparseEfficiencyWarning)
In [145]: M.A
Out[145]:
array([[ 0, 6, 7],
[10, 0, 6],
[ 1, 0, 5],
[ 0, 0, 0],
[ 0, 6, 0]])
your new shape method (make sure the dtype of indptr doesn't change):
In [146]: M._shape = (6,3)
In [147]: newptr = np.hstack((M.indptr,M.indptr[-1]))
In [148]: newptr
Out[148]: array([0, 2, 4, 6, 6, 7, 7], dtype=int32)
In [149]: M.indptr = newptr
In [150]: M
Out[150]:
<6x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [151]: M.A
Out[151]:
array([[ 0, 6, 7],
[10, 0, 6],
[ 1, 0, 5],
[ 0, 0, 0],
[ 0, 6, 0],
[ 0, 0, 0]])
In [152]: M[5,2]=10
... SparseEfficiencyWarning)
In [153]: M.A
Out[153]:
array([[ 0, 6, 7],
[10, 0, 6],
[ 1, 0, 5],
[ 0, 0, 0],
[ 0, 6, 0],
[ 0, 0, 10]])
Adding a column also seems to work:
In [154]: M._shape = (6,4)
In [155]: M
Out[155]:
<6x4 sparse matrix of type '<class 'numpy.int64'>'
with 8 stored elements in Compressed Sparse Row format>
In [156]: M.A
Out[156]:
array([[ 0, 6, 7, 0],
[10, 0, 6, 0],
[ 1, 0, 5, 0],
[ 0, 0, 0, 0],
[ 0, 6, 0, 0],
[ 0, 0, 10, 0]])
In [157]: M[5,0]=10
.... SparseEfficiencyWarning)
In [158]: M[5,3]=10
.... SparseEfficiencyWarning)
In [159]: M
Out[159]:
<6x4 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format>
In [160]: M.A
Out[160]:
array([[ 0, 6, 7, 0],
[10, 0, 6, 0],
[ 1, 0, 5, 0],
[ 0, 0, 0, 0],
[ 0, 6, 0, 0],
[10, 0, 10, 10]])
attribute sharing
I can make a new matrix from an existing one:
In [108]: M = (sparse.random(5,3,.4,'csr')*10).astype(int)
In [109]: newptr = np.hstack((M.indptr,6))
In [110]: M1 = sparse.csr_matrix((M.data, M.indices, newptr), shape=(6,3))
The data attributes a shared, at least in view sense:
In [113]: M[0,1]=14
In [114]: M1[0,1]
Out[114]: 14
But if I modify M1 by adding a nonzero value:
In [117]: M1[5,0]=10
...
SparseEfficiencyWarning)
The link between the matrices breaks:
In [120]: M[0,1]=3
In [121]: M1[0,1]
Out[121]: 14
Related
Numpy - Declare a specific nx1 array
I'm using numpy in python , in order to create a nx1 matrix . I want the 1st element of the matrix to be 3 , the 2nd -1 , then the n-1 element -1 again and at the end the n element 3. All the in between elements , i.e. from element 3 to element n-2 should be 0. I've made a drawing of the mentioned matrix , is like this : I'm fairly new to python and using numpy but seems like a great tool for managing matrices. What I've tried so far is creating the nx1 array (giving n some value) and initializing it to 0 . import numpy as np n = 100 I = np.arange(n) matrix = np.row_stack(0*I) print("\Matrix is \n",matrix) Any clues to how i proceed? Or what routine to use ?
Probably the simplest way is to just do the following: import numpy as np n = 10 a = np.zeros(n) a[0] = 3 a[1] = -1 a[len(a)-1] = 3 a[len(a)-2] = -1 >>print(a) output: [ 3. -1. 0. 0. 0. 0. 0. 0. -1. 3.] Hope this helps ;)
In [97]: n=10 In [98]: arr = np.zeros(n,int) In [99]: arr[[0,-1]]=3; arr[[1,-2]]=-1 In [100]: arr Out[100]: array([ 3, -1, 0, 0, 0, 0, 0, 0, -1, 3]) Easily changed to (n,1): In [101]: arr[:,None] Out[101]: array([[ 3], [-1], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [-1], [ 3]])
I guess something that works is : import numpy as np n = 100 I = np.arange(n) matrix = np.row_stack(0*I) matrix[0]=3 matrix[1]=-1 matrix[n-2]=-1 matrix[n-1]=3 print("\Matrix is \n",matrix)
Replacing non zero values in a matrix with the marginals
I am trying to do some math with my matrix, i can write it down but i am not sure how to code it. This involves getting a column of row marginal values, then making a new matrix that has all non-zero row values replaced with the marginals, after that I would like to divide the sum of non zero new values to be the column marginals. I can get to the row marginals but I cant seem to think of a way to repopulate. example of what i want import numpy as np matrix = np.matrix([[1,3,0],[0,1,2],[1,0,4]]) matrix([[1, 3, 0], [0, 1, 2], [1, 0, 4]]) marginals = ((matrix != 0).sum(1) / matrix.sum(1)) matrix([[0.5 ], [0.66666667], [0.4 ]]) What I want done next is a filling of the matrix based on the non zero locations of the first. matrix([[0.5, 0.5, 0], [0, 0.667, 0.667], [0.4, 0, 0.4]]) Final wanted result is the new matrix column sum divided by the number of non zero occurrences in that column. matrix([[(0.5+0.4)/2, (0.5+0.667)/2, (0.667+0.4)/2]])
To get the final matrix we can use matrix-multiplication for efficiency - In [84]: mask = matrix!=0 In [100]: (mask.T*marginals).T/mask.sum(0) Out[100]: matrix([[0.45 , 0.58333334, 0.53333334]]) Or simpler - In [110]: (marginals.T*mask)/mask.sum(0) Out[110]: matrix([[0.45 , 0.58333334, 0.53333334]]) If you need that intermediate filled output too, use np.multiply for broadcasted elementwise multiplication - In [88]: np.multiply(mask,marginals) Out[88]: matrix([[0.5 , 0.5 , 0. ], [0. , 0.66666667, 0.66666667], [0.4 , 0. , 0.4 ]])
How to invert only negative elements in numpy matrix?
I have a matrix containing positive and negative numbers like this: >>> source_matrix array([[-4, -2, 0], [-5, 0, 4], [ 0, 6, 5]]) I'd like to had a copy of this matrix with inverted negatives: >>> result array([[-0.25, -0.5, 0], [-0.2, 0, 4], [ 0, 6, 5]])
Firstly, since your desired array is gonna contain float type you need to determine the array's dtype at creation time as float. The reason for that is because if you assign the float results of the inverted sub-array they'll automatically be casted to float. Secondly, you need to find the negative numbers in your array and then use a simple indexing in order to grab them and use np.true_divide() to perform the inversion. In [25]: arr = np.array([[-4, -2, 0], ...: [-5, 0, 4], ...: [ 0, 6, 5]], dtype=np.float) ...: ...: In [26]: mask = arr < 0 In [27]: arr[mask] = np.true_divide(1, arr[mask]) In [28]: arr Out[28]: array([[-0.25, -0.5 , 0. ], [-0.2 , 0. , 4. ], [ 0. , 6. , 5. ]])
You can also achieve this without masking, by using the where and out params of true_divide. a = np.array([[-4, -2, 0], [-5, 0, 4], [ 0, 6, 5]], dtype=np.float) np.true_divide(1, a, out=a, where=a<0) Giving the result: array([[-0.25, -0.5 , 0. ], [-0.2 , 0. , 4. ], [ 0. , 6. , 5. ]]) The where= parameter is passed an array of the same dimensions as your two inputs. Where this evaluates to True the divide is performed. Where it evaluates to False, the original input, passed in via out= is output into the result unchanged.
New array from existing one, 2 column begin indexes of line/colum from the existing, third being values [duplicate]
This question already has answers here: Generalise slicing operation in a NumPy array (4 answers) Closed 5 years ago. Here is some code I'm struggling with. My goal is to create an array (db) from an existing one (t) , in db each line will represent a value of t. db will have 3 column, 1 for line index in t, 1 for column index in t and 1 for the value in t. In my case, t was a distance matrix, thus diagonal was 0 and it was symetric, I replaced lower triangular values with 0. I don't need 0 values in the new array but I can just delete them in another step. import numpy as np t = np.array([[0, 2.5], [0, 0]]) My goal is to obtain a new array such as : db = np.array([[0, 0, 0], [0, 1, 2.5], [1, 0, 0], [1, 1, 0]]) Thanks for your time.
You can create a meshgrid of 2D coordinates for the rows and columns, then unroll these into 1D arrays. You can then concatenate these two arrays as well as the unrolled version of t into one final matrix: import numpy as np (Y, X) = np.meshgrid(np.arange(t.shape[1]), np.arange(t.shape[0])) db = np.column_stack((X.ravel(), Y.ravel(), t.ravel())) Example run In [9]: import numpy as np In [10]: t = np.array([[0, 2.5], ...: [0, 0]]) In [11]: (Y, X) = np.meshgrid(np.arange(t.shape[1]), np.arange(t.shape[0])) In [12]: db = np.column_stack((X.ravel(), Y.ravel(), t.ravel())) In [13]: db Out[13]: array([[ 0. , 0. , 0. ], [ 0. , 1. , 2.5], [ 1. , 0. , 0. ], [ 1. , 1. , 0. ]])
Replace all elements of a matrix by their inverses
I've got a simple problem and I can't figure out how to solve it. Here is a matrix: A = np.array([[1,0,3],[0,7,9],[0,0,8]]). I want to find a quick way to replace all elements of this matrix by their inverses, excluding of course the zero elements. I know, thanks to the search engine of Stackoverflow, how to replace an element by a given value with a condition. On the contrary, I do not figure out how to replace elements by new elements depending on the previous ones (e.g. squared elements, inverses, etc.)
Use 1. / A (notice the dot for Python 2): >>> A array([[1, 0, 3], [0, 7, 9], [0, 0, 8]], dtype) >>> 1./A array([[ 1. , inf, 0.33333333], [ inf, 0.14285714, 0.11111111], [ inf, inf, 0.125 ]]) Or if your array has dtype float, you can do it in-place without warnings: >>> A = np.array([[1,0,3], [0,7,9], [0,0,8]], dtype=np.float64) >>> A[A != 0] = 1. / A[A != 0] >>> A array([[ 1. , 0. , 0.33333333], [ 0. , 0.14285714, 0.11111111], [ 0. , 0. , 0.125 ]]) Here we use A != 0 to select only those elements that are non-zero. However if you try this on your original array you'd see array([[1, 0, 0], [0, 0, 0], [0, 0, 0]]) because your array could only hold integers, and inverse of all others would have been rounded down to 0. Generally all of the numpy stuff on arrays does element-wise vectorized transformations so that to square elements, >>> A = np.array([[1,0,3],[0,7,9],[0,0,8]]) >>> A * A array([[ 1, 0, 9], [ 0, 49, 81], [ 0, 0, 64]])
And just a note on Antti Haapala's answer, (Sorry, I can't comment yet) if you wanted to keep the 0's, you could use B=1./A #I use the 1. to make sure it uses floats B[B==np.inf]=0