I am using a sparse matrix format implemented in scipy as csr_matrix. I have a mat variable which is in csr_matrix format and all its elements are non-negative. However, when I use mat + mat operation, the non-zero element number decreases which is quite strange to me. What want is a element-wise addition but why the non-element number will decreases as each of the element is non-negative.
Best Regards
The nnz member of csr_matrix in SciPy counts explicit zeros, so depending on how you create your matrix, this may explain what you are observing. You can see this behavior by explicitly setting zeros in a matrix.
>>> from scipy.sparse import csr_matrix
>>> A = csr_matrix((5, 5))
>>> A.nnz
0
>>> A[0, 0] = 0
>>> A.nnz
1
>>> A[1,1] = 0
>>> A.nnz
2
Now when you do an operation that creates a new matrix (such as matrix addition), the explicit zeros are not retained.
>>> B = A + A
>>> B.nnz
0
although it might be a bit over kill and not related it may be worth looking into these two libraries
petsc4py
petsc
will just about solve any sparse matrix problem you can think of
Related
I have two arrays, A and B, with dimensions (l,m,n) and (l,m,n,n), respectively. I would like to obtain an array C of dimensions (l,m,n) which is obtained by treating A and B as matrices in their fourth (A) and third and fourth indices (B). An easy way to do this is:
import numpy as np
#Define dimensions
l = 1024
m = l
n = 6
#Create some random arrays
A = np.random.rand(l,m,n)
B = np.random.rand(l,m,n,n)
C = np.zeros((l,m,n))
#Desired multiplication
for i in range(0,l):
for j in range(0,m):
C[i,j,:] = np.matmul(A[i,j,:],B[i,j,:,:])
It is, however, slow (about 3 seconds on my MacBook). What'd be the fastest, fully vectorial way to do this?
Try to use einsum.
It has many use cases, check the docs: https://numpy.org/doc/stable/reference/generated/numpy.einsum.html
Or, for more info, a really good explanation can be also found at: https://ajcr.net/Basic-guide-to-einsum/
In your case, it seems like
np.einsum('dhi,dhij->dhj',A,B)
should work. Also, you can try the optimize=True flag to get more speed, if needed.
I am facing a mystery right now. I get strange results in some program and I think it may be related to the computation since I got different results with my functions compared to manual computation.
This is from my program, I am printing the values pre-computation :
print("\nPrecomputation:\nmatrix\n:", matrix)
tmp = likelihood_left * likelihood_right
print("\nconditional_dep:", tmp)
print("\nfinal result:", matrix # tmp)
I got the following output:
Precomputation:
matrix:
[array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294])
array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784])
array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768])
array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674])
array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])]
conditional_dep: [0.01391123 0.01388155 0.17221067 0.02675524 0.01033257]
final result: [0.07995043 0.03485223 0.02184015 0.04721548 0.05323298]
The thing is when I compute the following code:
matrix = [np.array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294]),
np.array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784]),
np.array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768]),
np.array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674]),
np.array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])]
tmp = np.asarray([0.01391123, 0.01388155, 0.17221067, 0.02675524, 0.01033257])
matrix # tmp
The values in use are exactly the same as they should be in the computation before but I get the following result:
array([0.04171218, 0.04535276, 0.02546353, 0.04688848, 0.03106443])
This result is then obviously different than the previous one and is the true one (I computed the dot product by hand).
I have been facing this problem the whole day and I did not find anything useful online. If any of you have any even tiny idea where it can come from I'd be really happy :D
Thank's in advance
Yann
PS: I can show more of the code if needed.
PS2: I don't know if it is relevant but this is used in a dynamic programming algorithm.
To recap our discussion in the comments, in the first part ("pre-computation"), the following is true about the matrix object:
>>> matrix.shape
(5,)
>>> matrix.dtype
dtype('O') # aka object
And as you say, this is due to matrix being a slice of a larger, non-uniform array. Let's recreate this situation:
>>> matrix = np.array([[], np.array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294]), np.array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784]), np.array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768]), np.array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674]), np.array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])])[1:]
It is now not a matrix with scalars in rows and columns, but a column vector of column vectors. Technically, matrix # tmp is an operation between two 1-D arrays and hence NumPy should, according to the documentation, calculate the inner product of the two. This is true in this case, with the convention that the sum be over the first axis:
>>> np.array([matrix[i] * tmp[i] for i in range(5)]).sum(axis=0)
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
>>> matrix # tmp
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
This is essentially the same as taking the transpose of the proper 2-D matrix before the multiplication:
>>> np.stack(matrix).T # tmp
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
Equivalently, as noted by #jirasssimok:
>>> tmp # np.stack(matrix)
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
Hence the erroneous or unexpected result.
As you have already resolved to do in the comments, this can be avoided in the future by ensuring all matrices are proper 2-D arrays.
It looks like you got the operands switched in one of your matrix multiplications.
Using the same values of matrix and tmp that you provided, matrix # tmp and tmp # matrix provide the two results you showed.1
matrix = [np.array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294]),
np.array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784]),
np.array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768]),
np.array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674]),
np.array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])]
tmp = np.asarray([0.01391123, 0.01388155, 0.17221067, 0.02675524, 0.01033257])
print(matrix # tmp) # [0.04171218 0.04535276 0.02546353 0.04688848 0.03106443]
print(tmp # matrix) # [0.07995043 0.03485222 0.02184015 0.04721548 0.05323298]
To make it a little more obvious what your code is doing, you might also consider using np.dot instead of #. If you pass matrix as the first argument and tmp as the second, it will have the result you want, and make it more clear that you're conceptually calculating dot products rather than multiplying matrices.
As an additional note, if you're performing matrix operations on matrix, it might be better if it was a single two-dimensional array instead of a list of 1-dimensional arrays. this will prevent errors of the sort you'll see right now if you try to run matrix # matrix. This would also let you say matrix.dot(tmp) instead of np.dot(matrix, tmp) if you wanted to.
(I'd guess that you can use np.stack or a similar function to create matrix, or you can call np.stack on matrix after creating it.)
1 Because tmp has only one dimension and matrix has two, NumPy can and will treat tmp as whichever type of vector makes the multiplication work (using broadcasting). So tmp is treated as a column vector in matrix # tmp and a row vector in tmp # matrix.
I am currently trying to make a really large matrix, i am unsure how to do so in a memory efficient way.
I was trying to use numpy, which worked fine for my smaller case (2750086X300)
However, i got a larger one, 2750086X1000, which is just too big for me to run.
I though about making it out of ints, but I will add float values to it, so unsure how that cld affect it.
I tried find something about making a sparse zero filled array, but cldnt find any great topics/questions/suggestions here or elsewhere.
Anyone got any good advice? I am currently using python so I am kind of looking for a pythonic solution, but i am willing to try other languages.
Thx
edit:
thx for advices, i ve tried scipy.sparse.csr_matrix which managed to create a matrix but deeply increased the time to go through it.
heres kind of what i am doing:
matrix = scipy.sparse.csr_matrix((df.shape[0], 300))
## matrix = np.zeros((df.shape[0],
for i, q in enumerate(df['column'].values):
matrix[i, :] = function(q)
where function is pretty much a vector operation function on that row.
Now, if i do the loop on the np.zeros, it does so quite easily, about 10 minuts.
Now, if i try to do the same with the scipy sparse matrix, it takes about 50 hours. which is not that reasonable.
Any advices?
Edit 2:
scipy.sparse.lil_matrix did the trick
takes about 20 minut for the loop and uses way less memory than np.zeros
Thx.
Edit 3:
still memory expensive. decided to not store data on matrix. process 1 row at a time. get relevant value/metric out of it, store value at original df, run again.
Try scipy.sparse.csr_matrix:
from scipy.sparse import *
from scipy import *
a=csr_matrix( (2750086,1000), dtype=int8 )
Then a is
<2750086x1000 sparse matrix of type '<class 'numpy.int8'>'
with 0 stored elements in Compressed Sparse Row format>
For example, if you do:
from scipy.sparse import *
from scipy import *
a=csr_matrix( (5,4), dtype=int8 ).todense()
print(a)
You get:
[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]
Another options is to use scipy.sparse.lil_matrix
a = scipy.sparse.lil_matrix((2750086,1000), dtype=int8 )
This seems to be more efficient for setting elements (like a[1,1]=2).
I suggested it could be
np.linalg.inv(np.sqrt(matrix))
but having compared result with MATLAB I saw big difference:
This was in MATLAB
0.2622 -0.0828 -0.0708
-0.0828 0.2601 -0.0792
-0.0708 -0.0792 0.2664
And this was in Python:
0.8607 -0.4417 -0.3536
-0.4417 0.8967 -0.4158
-0.3536 -0.4158 0.8525
Input was
34.502193 27.039107 24.735074
27.039107 36.535737 26.069613
24.735074 26.069613 32.798584
There is no "matrix" class in python. From your code it looks you're talking about numpy.
A possible gotcha for matlab users is that in numpy array operations are elementwise by default, and if you want matrix operations, you need to request them: np.dot for matrix multiplications, np.linalg.inv for inversion etc.
np.linalg.inv(np.sqrt(a)) first takes the square root of each element of a, and then inverts the result in the linear algebra sense. I suspect this is not what you meant to mean.
If you meant elementwise operations, i.e. you wanted to raise each element to power -1/2, then like #Benoit_11 suggests, use
1 / np.sqrt(a).
If what you want is actually a linear algebra operation, then use scipy.linalg.sqrtm
In [14]: a
Out[14]:
array([[ 34.502193, 27.039107, 24.735074],
[ 27.039107, 36.535737, 26.069613],
[ 24.735074, 26.069613, 32.798584]])
In [15]: from scipy.linalg import sqrtm
In [16]: sq = sqrtm(a)
In [17]: np.dot(sq, sq) - a
Out[17]:
array([[ 4.97379915e-14, 4.97379915e-14, 2.84217094e-14],
[ 5.32907052e-14, 6.39488462e-14, 4.61852778e-14],
[ 3.55271368e-14, 3.19744231e-14, 3.55271368e-14]])
It looks like using Python you calculated the inverse of the square root of the matrix (sounds weird sorry) instead of raising the matrix to the power -0.5.
For instance, running this command with Matlab I get your output with python:
m = [34.502193 27.039107 24.735074
27.039107 36.535737 26.069613
24.735074 26.069613 32.798584]
A = inv(sqrt(m))
A =
0.8608 -0.4417 -0.3537
-0.4417 0.8967 -0.4159
-0.3537 -0.4159 0.8525
versus this:
B = m^(-.5)
B =
0.2622 -0.0828 -0.0708
-0.0828 0.2601 -0.0792
-0.0708 -0.0792 0.2664
For the correct Python code please look at #ev-br's answer
Beware that there is such a thing as the matrix square root, which for a matrix M is defined as:
A*A = M
and does not correspond at all to the square root of each element in the matrix M taken individually. The matrix square root is obtained in Matlab using the sqrtm function and is equivalent to m^(.5).
I have to invert a large sparse matrix. I cannot escape from the matrix inversion, the only shortcut would be to just get an idea of the main diagonal elements, and ignore the off-diagonal elements (I'd rather not, but as a solution it'd be acceptable).
The matrices I need to invert are typically large(40000 *40000), and only have a handful of non-nonzero diagonals. My current approach is to build everything sparse, and then
posterior_covar = np.linalg.inv ( hessian.todense() )
this clearly takes a long time and plenty of memory.
Any hints, or it's just a matter of patience or making the problem smaller?
I don't think that the sparse module has an explicit inverse method, but it does have sparse solvers. Something like this toy example works:
>>> a = np.random.rand(3, 3)
>>> a
array([[ 0.31837307, 0.11282832, 0.70878689],
[ 0.32481098, 0.94713997, 0.5034967 ],
[ 0.391264 , 0.58149983, 0.34353628]])
>>> np.linalg.inv(a)
array([[-0.29964242, -3.43275347, 5.64936743],
[-0.78524966, 1.54400931, -0.64281108],
[ 1.67045482, 1.29614174, -2.43525829]])
>>> a_sps = scipy.sparse.csc_matrix(a)
>>> lu_obj = scipy.sparse.linalg.splu(a_sps)
>>> lu_obj.solve(np.eye(3))
array([[-0.29964242, -0.78524966, 1.67045482],
[-3.43275347, 1.54400931, 1.29614174],
[ 5.64936743, -0.64281108, -2.43525829]])
Note that the result is transposed!
If you expect your inverse to also be sparse, and the dense return from the last solve won't fit in memory, you can also generate it one row (column) at a time, extract the non-zero values, and build the sparse inverse matrix from those:
>>> for k in xrange(3) :
... b = np.zeros((3,))
... b[k] = 1
... print lu_obj.solve(b)
...
[-0.29964242 -0.78524966 1.67045482]
[-3.43275347 1.54400931 1.29614174]
[ 5.64936743 -0.64281108 -2.43525829]