Efficient matrix update and matrix multiplication using Scipy sparse matrix - python

I have a large matrix (236680*236680), and my pc does not have sufficient memory to read in the complete matrix so that I am thinking the Scipy sparse matrix. My goal is to multiply a generated matrix (not sparse) by np.eye(the number of observation)-np.ones(the number of observation)/the number of observation with a sparse matrix.
In Scipy, I use the following code, but the computation is still huge. My questions include:
to generate the first matrix, is there any other way to speed the process?
for the matrix multiplication, is there any way to reduce the memory usage, as the first matrix is not sparse?
-
from scipy.sparse import lil_matrix
fline=5
nn=1/fline
M=lil_matrix((fline,fline))
M.setdiag(values=1-nn,k=0)
for i in range(fline)[1:]:
M.setdiag(values=0-nn,k=i)
M.setdiag(values=0-nn,k=-i)
#the first matrix is:
array([[ 0.8, -0.2, -0.2, -0.2, -0.2],
[-0.2, 0.8, -0.2, -0.2, -0.2],
[-0.2, -0.2, 0.8, -0.2, -0.2],
[-0.2, -0.2, -0.2, 0.8, -0.2],
[-0.2, -0.2, -0.2, -0.2, 0.8]])
#the second matrix is:
array([[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0.],
[1., 0., 1., 0., 0.]])
a2=M.dot(B)
#the final expected results
array([[-0.2, 0. , -0.2, 0.6, 0. ],
[-0.2, 0. , -0.2, -0.4, 0. ],
[-0.2, 0. , -0.2, 0.6, 0. ],
[-0.2, 0. , -0.2, -0.4, 0. ],
[ 0.8, 0. , 0.8, -0.4, 0. ]])
Updated: is there any way to improve the speed of the cross product? Numpy dot and Scipy sparse dot functions are tested.

For the first problem: Mathematically,
arr1 = array([[ 0.8, -0.2, -0.2, -0.2, -0.2],
[-0.2, 0.8, -0.2, -0.2, -0.2],
[-0.2, -0.2, 0.8, -0.2, -0.2],
[-0.2, -0.2, -0.2, 0.8, -0.2],
[-0.2, -0.2, -0.2, -0.2, 0.8]])
is equivalent to
arr1 = -0.2 * [[1,1,1,1,1,], + 1
[1,1,1,1,1,], 1
[1,1,1,1,1,], 1
[1,1,1,1,1,], 1
[1,1,1,1,1,]] 1
= [1] [1, 1, 1, 1, 1] * 0.2 + 1
[1] 1
[1] 1
[1] 1
[1] 1
Thus, it can be generated using
-0.2 * np.outer([1,1,1,1,1], [1,1,1,1,1]) + scipy.sparse.identity(5)
For the second problem, let me abuse the notation
-0.2* [1] [1, 1, 1, 1, 1] # B + scipy.sparse.identity(5) # B
[1]
[1]
[1]
[1]
can be reduced to
np.outer([1, 1, 1, 1, 1], B.sum(axis=0)) * -0.2 + scipy.sparse.identity(5) # B
One needs not really compute np.outer([1, 1, 1, 1, 1], B.sum(axis=0)) as this would be a dense square matrix that the memory may not fit. (Note that the outer product is basically repeats B.sum(axis=0) in every row it contains.)
To recover the results in a memory efficient way, you only need to store B.sum(axis=0) and scipy.sparse.identity(5) # B .

Scipy sparse matrix is used, since one of the matrics is a sparse matrix and the cross product function in the sparse matrix is the fastest between Numpy and Scipy.
For the first question, #Tai's answer is the foundation, but I use numpy.full function (a little bit faster).
For the second question, dividing the whole matrix and save smaller computed matrices in files are used.
from scipy import sparse
from scipy.sparse import vstack
import h5sparse
import numpy as num
fline=236680
nn=1/fline; dd=1-nn; off=0-nn
div=int(fline/(61*10))
for i in range(61*10):
divM= num.full((fline, div), off) + sparse.identity(fline,format='csc')[:,0+div*i:div+div*i]
vs=[]
for j in range(divM.shape[1]):
divMB=csr_matrix(divM.T[j]).dot(weights)
vs.append(divMB)
divapp=vstack(vs)
if i ==0:
h5f = h5sparse.File("F:/dissertation/dallastest/temp/tt1.h5")
h5f.create_dataset('sparse/matrix', data=divapp, chunks=(389,),maxshape=(None,))
else:
h5f['sparse/matrix'].append(divapp)

Related

Python numpy matrix multiplication mismatch in core dimension

I am trying to matrix multiply a 2x2 matrix with a 2x1 matrix. Both matrices have entries which are linspaces such that the resulting 2x1 matrix gives me a value for each value of the linspace.
I get this dimensionality error however.
matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 2)
For readability I am not posting the whole code but what's necessary.
I have also replaced linspace values with indicative text.
Matrix "L" is a result of other 2x2 multiplications which contain constants, thus no errors there.
The matrix B (2x2) gives the desired result, so the problem comes down to the multiplication between B and C.
import numpy as np
from sympy import *
# Defining range of values
z = np.linspace(initial, final, 10)
g = np.linspace(initial, final, 10)
y = np.linspace(initial, final, 10)
# Matrix operations
A = np.array([[1, z], [0, 1]], dtype=object)
B = np.matmul(L,A)
C = np.array([[y],[g]])
D = np.matmul(B, C)
print(total)
An alternative POV of what I am trying to do, is that for the matrix "B" when multiplied with the 2x1 "C" which contains unknowns, to calculate those unknowns "y" and "g"
Many thanks,
P.S; For an array "C" with single value entries, the multiplication runs as expected.
Edit; As per mozway's suggestion, I am providing the prints of array "A" and "M" which will make stuff clearer, but let M = B
In [66]: initial, final = 0,1
In [67]: z = np.linspace(initial,final,11)
In [68]: z
Out[68]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
A is (2,2), but contains a mix of array and scalars
In [69]: A = np.array([[1,z],[0,1]], object)
In [70]: A
Out[70]:
array([[1,
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])],
[0, 1]], dtype=object)
In [71]: A.shape
Out[71]: (2, 2)
Now make a (2,2) numeric array:
In [72]: L = np.eye(2)
In [75]: L[1,1] = 2
In [76]: np.matmul(L,A)
Out[76]:
array([[1.0,
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])],
[0.0, array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])]],
dtype=object)
matmul does work with object dtype arrays, provided the elements implement the necessary + and *. The result is still (2,2), but the (1,1) term 2*z.
Now for the C:
In [77]: C = np.array([[z],[z]])
In [78]: C
Out[78]:
array([[[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]],
[[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]]])
In [79]: C.shape
Out[79]: (2, 1, 11)
This is float dtype, 3d array.
In [81]: B=Out[76]
In [82]: np.matmul(B,C)
Traceback (most recent call last):
File "<ipython-input-82-5eababb7341e>", line 1, in <module>
np.matmul(B,C)
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 2)
In [83]: B.shape
Out[83]: (2, 2)
In [84]: C.shape
Out[84]: (2, 1, 11)
There's a mismatch in shapes. But change C definition so it is a 2d array:
In [85]: C = np.array([z,z])
In [86]: C.shape
Out[86]: (2, 11)
In [87]: np.matmul(B,C)
Out[87]:
array([[array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
array([0.1 , 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ]),
...
array([1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8]),
array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])]],
dtype=object)
In [88]: _.shape
Out[88]: (2, 11)
Here the (2,2) B matmuls with (2,11) just fine producing (2,11). But each element is itself a (11,) array - because of the z used in defining A.
But you say you want a (2,1) C. To get that we have to use:
In [91]: C = np.empty((2,1), object)
In [93]: C[:,0]=[z,z]
In [94]: C
Out[94]:
array([[array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])],
[array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])]],
dtype=object)
Be very careful when trying to create object dtype arrays. Things might not be what you expect.
Now matmul of (2,2) with (2,1) => (2,1), object dtype
In [95]: D = np.matmul(B,C)
In [96]: D.shape
Out[96]: (2, 1)
In [99]: D
Out[99]:
array([[array([0. , 0.11, 0.24, 0.39, 0.56, 0.75, 0.96, 1.19, 1.44, 1.71, 2. ])],
[array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])]],
dtype=object)
Keep in mind that matmul is very fast with working with numeric dtype arrays. It does work with object dtype arrays, but speed is much slower, more like using list comprehensions.
Not sure what you're trying to do (you should provide a reproducible example, there are currently many missing variables), and the expected output.
Nevertheless, the definition of A is fundamentally wrong. I imagine you expect a 2x2 array, but as z is a (10,) shaped array, you will end up with A being a weird object array whose element (0,1) is an array.
This prevents you do to any further mathematical operation.

What does the rcond parameter of numpy.linalg.pinv do?

While looking up how to calculate pseudo-inverses in numpy (1.15.4) I noticed that numpy.linalg.pinv has a parameter rcond for which the description reads:
rcond : (…) array_like of float
Cutoff for small singular values. Singular values smaller (in
modulus) than rcond * largest_singular_value (again, in modulus)
are set to zero. Broadcasts against the stack of matrices
From my understanding if rcond is a scalar float, all entries
in the output of pinv which would have been smaller than rcond should be set to zero instead (which would be really useful) but this is not what happens, e.g.:
>>> A = np.array([[ 0., 0.3, 1., 0.],
[ 0., 0.4, -0.3, 0.],
[ 0., 1., -0.1, 0.]])
>>> np.linalg.pinv(A, rcond=1e-3)
array([[ 8.31963531e-17, -4.52584594e-17, -5.09901252e-17],
[ 1.82668420e-01, 3.39032588e-01, 8.09586439e-01],
[ 8.95805933e-01, -2.97384188e-01, -1.49788105e-01],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])
What does this parameter actually do? And can I only get the behaviour I actually want by iterating over the whole output matrix again?
Under the hood, a pseudoinverse is calculated using a singular value decomposition. An initial matrix A=UDV^T is inverted as A^+=VD^+U^T, where D is a diagonal matrix with positive real values (singular values). rcond is used to zero out small entries in D. For example:
import numpy as np
# Initial matrix
a = np.array([[1, 0],
[0, 0.1]])
# SVD with diagonal entries in D = [1. , 0.1]
print(np.linalg.svd(a))
# (array([[1., 0.],
# [0., 1.]]),
# array([1. , 0.1]),
# array([[1., 0.],
# [0., 1.]]))
# Pseudoinverse
c = np.linalg.pinv(a)
print(c)
# [[ 1. 0.]
# [ 0. 10.]]
# Reconstruction is perfect
print(np.dot(a, np.dot(c, a)))
# [[1. 0. ]
# [0. 0.1]]
# Zero out all entries in D below rcond * largest_singular_value = 0.2 * 1
# Not entries of the initial or inverse matrices!
d = np.linalg.pinv(a, rcond=0.2)
print(d)
# [[1. 0.]
# [0. 0.]]
# Reconstruction is imperfect
print(np.dot(a, np.dot(d, a)))
# [[1. 0.]
# [0. 0.]]
To just zero out small values of a matrix:
a = np.array([[1, 2],
[3, 0.1]])
a[a < 0.5] = 0
print(a)
# [[1. 2.]
# [3. 0.]]

Convert NumPy array to 0 or 1 based on threshold

I have an array below:
a=np.array([0.1, 0.2, 0.3, 0.7, 0.8, 0.9])
What I want is to convert this vector to a binary vector based on a threshold.
take threshold=0.5 as an example, element that greater than 0.5 convert to 1, otherwise 0.
The output vector should like this:
a_output = [0, 0, 0, 1, 1, 1]
How can I do this?
np.where
np.where(a > 0.5, 1, 0)
# array([0, 0, 0, 1, 1, 1])
Boolean basking with astype
(a > .5).astype(int)
# array([0, 0, 0, 1, 1, 1])
np.select
np.select([a <= .5, a>.5], [np.zeros_like(a), np.ones_like(a)])
# array([ 0., 0., 0., 1., 1., 1.])
Special case: np.round
This is the best solution if your array values are floating values between 0 and 1 and your threshold is 0.5.
a.round()
# array([0., 0., 0., 1., 1., 1.])
You could use binarize from the sklearn.preprocessing module.
However this will work only if you want your final values to be binary i.e. '0' or '1'. The answers provided above are great of non-binary results as well.
from sklearn.preprocessing import binarize
a = np.array([0.1, 0.2, 0.3, 0.7, 0.8, 0.9]).reshape(1,-1)
x = binarize(a)
a_output = np.ravel(x)
print(a_output)
#everything together
a_output = np.ravel(binarize(a.reshape(1,-1), 0.5))

apply numpy.histogram to multidimensional array

I want to apply numpy.histogram() to a multi-dimensional array along an axis.
Say, for example I have a 2D array and I want to apply histogram() along axis=1.
Code:
import numpy
array = numpy.array([[0.6, 0.7, -0.3, 1.0, -0.8], [0.2, -1.0, -0.5, 0.5, 0.8],
[0.25, 0.3, -0.1, -0.8, 1.0]])
bins = [-1.0, -0.5, 0, 0.5, 1.0, 1.0]
hist, bin_edges = numpy.histogram(array, bins)
print(hist)
Output:
[3 3 3 4 2]
Expected Output:
[[1 1 0 2 1],
[1 1 1 2 0],
[1 1 2 0 1]]
How can I get my expected output?
I tried to use the solution suggested in this post, but it doesn't get me to the expected output.
For n-d cases, you can do this with np.histogram2d just by making a dummy x-axis (i):
def vec_hist(a, bins):
i = np.repeat(np.arange(np.product(a.shape[:-1]), a.shape[-1]))
return np.histogram2d(i, a.flatten(), (a.shape[0], bins)).reshape(a.shape[:-1], -1)
Output
vec_hist(array, bins)
Out[453]:
(array([[ 1., 1., 0., 2., 1.],
[ 1., 1., 1., 2., 0.],
[ 1., 1., 2., 0., 1.]]),
array([ 0. , 0.66666667, 1.33333333, 2. ]),
array([-1. , -0.5 , 0. , 0.5 , 0.9999999, 1. ]))
For histograms over arbitrary axis, you'll probably need to create i using np.meshgrid and np.ravel_multi_axis and then use that to reshape the resulting histogram.

Combining two numpy arrays to form an array with the largest value from each array

I want to combine two numpy arrays to produce an array with the largest values from each array.
import numpy as np
a = np.array([[ 0., 0., 0.5],
[ 0.1, 0.5, 0.5],
[ 0.1, 0., 0.]])
b = np.array([[ 0., 0., 0.0],
[ 0.5, 0.1, 0.5],
[ 0.5, 0.1, 0.]])
I would like to produce
array([[ 0., 0., 0.5],
[ 0.5, 0.5, 0.5],
[ 0.5, 0.1, 0.]])
I know you can do
a += b
which results in
array([[ 0. , 0. , 0.5],
[ 0.6, 0.6, 1. ],
[ 0.6, 0.1, 0. ]])
This is clearly not what I'm after. It seems like such an easy problem and I assume it most probably is.
You can use np.maximum to compute the element-wise maximum of the two arrays:
>>> np.maximum(a, b)
array([[ 0. , 0. , 0.5],
[ 0.5, 0.5, 0.5],
[ 0.5, 0.1, 0. ]])
This works with any two arrays, as long as they're the same shape or one can be broadcast to the shape of the other.
To modify the array a in-place, you can redirect the output of np.maximum back to a:
np.maximum(a, b, out=a)
There is also np.minimum for calculating the element-wise minimum of two arrays.
You are looking for the element-wise maximum.
Example:
>>> np.maximum([2, 3, 4], [1, 5, 2])
array([2, 5, 4])
http://docs.scipy.org/doc/numpy/reference/generated/numpy.maximum.html
inds = b > a
a[inds] = b[inds]
This modifies the original array a which is what += is doing in your example which may or may not be what you want.

Categories