Can sklearn.decomposition.TruncatedSVD be applied to a matrix in parts? - python

I am applying sklearn.decomposition.TruncatedSVD to very large matrices. If the matrix is above a certain size (say 350k by 25k), svd.fit(x) runs out of RAM.
I am applying svd to feature matrices, where each row represents a set of features extracted from a single image.
To work around the memory issues, is it safe to apply svd to parts of the matrix (and then concatenate)?
Will the result be the same? I.e.:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=128)
part_1 = svd.fit_transform(features[0:100000, :])
part_2 = svd.fit_transform(features[100000:, :])
svd_features = np.concatenate((part_1, part_2), axis=0)
.. equivalent to(?):
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=128)
svd_features = svd.fit_transform(svd_features)
If not, is there a workaround for dim reduction of very large matrices?

The results will not be the same,
For example, consider the code below:
import numpy as np
features=np.array([[3, 2, 1, 3, 1],
[2, 0, 1, 2, 2],
[1, 3, 2, 1, 3],
[1, 1, 3, 2, 3],
[1, 1, 2, 1, 3]])
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=2)
svd = TruncatedSVD(n_components=2)
part_1 = svd.fit_transform(features[0:2, :])
part_2 = svd.fit_transform(features[2:, :])
svd_features = np.concatenate((part_1, part_2), axis=0)
svd_b = TruncatedSVD(n_components=2)
svd_features_b = svd_b.fit_transform(features)
print(svd_features)
print(svd_features_b)
This prints
[[ 4.81379561 -0.90959982]
[ 3.36212985 1.30233746]
[ 4.70088886 1.37354278]
[ 4.76960857 -1.06524658]
[ 3.94551566 -0.34876626]]
[[ 4.17420185 2.47515867]
[ 3.23525763 0.9479915 ]
[ 4.53499272 -1.13912762]
[ 4.69967028 -0.89231578]
[ 3.81909069 -1.05765576]]
which are different from each other.

Related

Scipy KDTree get rectangular subset of grid defined by two points

I am using the following example from :
from scipy import spatial
x, y = np.mgrid[0:5, 2:8]
tree = spatial.KDTree(list(zip(x.ravel(), y.ravel())))
pts = np.array([[0, 0], [2.1, 2.9]])
idx = tree.query(pts)[1]
data = tree.data[??????????]
If I input two arbitrary points (see variable pts), I am looking to return all pairs of coordinates that lie within the rectangle defined by the two points (KDTree finds the closest neighbour). So in this case:
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2],
[2, 0],
[2, 1],
[2, 2]])
How can I achieve that from the tree data?
Seems that I found a solution:
from scipy import spatial
import numpy as np
x, y = np.mgrid[0:5, 0:5]
tree = spatial.KDTree(list(zip(x.ravel(), y.ravel())))
pts = np.array([[0, 0], [2.1, 2.2]])
idx = tree.query(pts)[1]
data = tree.data[[idx[0], idx[1]]]
rectangle = tree.data[np.where((tree.data[:,0]>=min(data[:,0])) & (tree.data[:,0]<=max(data[:,0])) & (tree.data[:,1]>=min(data[:,1])) & (tree.data[:,1]<=max(data[:,1])))]
However, I would love to see a solution using the query option!

Memory-efficient storage of large distance matrices

I have to create a data structure to store distances from each point to every other point in a very large array of 2d-coordinates. It's easy to implement for small arrays, but beyond about 50,000 points I start running into memory issues -- not surprising, given that I'm creating an n x n matrix.
Here's a simple example which works fine:
import numpy as np
from scipy.spatial import distance
n = 2000
arr = np.random.rand(n,2)
d = distance.cdist(arr,arr)
cdist is fast, but is inefficient in storage since the matrix is mirrored diagonally (e.g. d[i][j] == d[j][i]). I can use np.triu(d) to convert to upper triangular, but the resulting square matrix still takes the same memory. I also don't need distances beyond a certain cutoff, so that can be helpful. The next step is to convert to a sparse matrix to save memory:
from scipy import sparse
max_dist = 5
dist = np.array([[0,1,3,6], [1,0,8,7], [3,8,0,4], [6,7,4,0]])
print dist
array([[0, 1, 3, 6],
[1, 0, 8, 7],
[3, 8, 0, 4],
[6, 7, 4, 0]])
dist[dist>=max_dist] = 0
dist = np.triu(dist)
print dist
array([[0, 1, 3, 0],
[0, 0, 0, 0],
[0, 0, 0, 4],
[0, 0, 0, 0]])
sdist = sparse.lil_matrix(dist)
print sdist
(0, 1) 1
(2, 3) 4
(0, 2) 3
The problem is getting to that sparse matrix quickly for a very large dataset. To reiterate, making a square matrix with cdist is the fastest way I know of to calculate distances between points, but the intermediate square matrix runs out of memory. I could break it down into more manageable chunks of rows, but then that slows things down a lot. I feel like I'm missing some obvious easy way to go directly to a sparse matrix from cdist.
Here is how to do it with a KDTree:
>>> import numpy as np
>>> from scipy import sparse
>>> from scipy.spatial import cKDTree as KDTree
>>>
# mock data
>>> a = np.random.random((50000, 2))
>>>
# make tree
>>> A = KDTree(a)
>>>
# list all pairs within 0.05 of each other in 2-norm
# format: (i, j, v) - i, j are indices, v is distance
>>> D = A.sparse_distance_matrix(A, 0.05, p=2.0, output_type='ndarray')
>>>
# only keep upper triangle
>>> DU = D[D['i'] < D['j']]
>>>
# make sparse matrix
>>> result = sparse.coo_matrix((DU['v'], (DU['i'], DU['j'])), (50000, 50000))
>>> result
<50000x50000 sparse matrix of type '<class 'numpy.float64'>'
with 9412560 stored elements in COOrdinate format>

Standardizing X different in Python Lasso and R glmnet?

I was trying to get the same result fitting lasso using Python's scikit-learn and R's glmnet. A helpful link
If I specify "normalize =True" in Python and "standardize = T" in R, they gave me the same result.
Python:
from sklearn.linear_model import Lasso
X = np.array([[1, 1, 2], [3, 4, 2], [6, 5, 2], [5, 5, 3]])
y = np.array([1, 0, 0, 1])
reg = Lasso(alpha =0.01, fit_intercept = True, normalize =True)
reg.fit(X, y)
np.hstack((reg.intercept_, reg.coef_))
Out[95]: array([-0.89607695, 0. , -0.24743375, 1.03286824])
R:
reg_glmnet = glmnet(X, y, alpha = 1, lambda = 0.02,standardize = T)
coef(reg_glmnet)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -0.8960770
V1 .
V2 -0.2474338
V3 1.0328682
However, if I don't want to standardize variables and set normalize =False and standardize = F, they gave me quite different results.
Python:
from sklearn.linear_model import Lasso
Z = np.array([[1, 1, 2], [3, 4, 2], [6, 5, 2], [5, 5, 3]])
y = np.array([1, 0, 0, 1])
reg = Lasso(alpha =0.01, fit_intercept = True, normalize =False)
reg.fit(Z, y)
np.hstack((reg.intercept_, reg.coef_))
Out[96]: array([-0.88 , 0.09384212, -0.36159299, 1.05958478])
R:
reg_glmnet = glmnet(X, y, alpha = 1, lambda = 0.02,standardize = F)
coef(reg_glmnet)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -0.76000000
V1 0.04441697
V2 -0.29415542
V3 0.97623074
What's the difference between "normalize" in Python's Lasso and "standardize" in R's glmnet?
Currently, with regard to the normalize parameter the docs state "If you wish to standardize, please use StandardScaler before calling fit on an estimator with normalize=False.''
So evidently normalize and standardize are not the same with sklearn.linear_model.Lasso. Having read the StandardScaler docs I fail to understand the difference, but the fact that there is one is implied by the provided description of the normalize parameter.

How to perform stencil computations element-wise on a matrix in Theano?

I have the following blurring kernel I need to apply to every pixel in an RGB image
[ 0.0625 0.025 0.375 0.025 0.0625 ]
So, the pseudo-code looks something like this in Numpy
for i in range(rows):
for j in range(cols):
for k in range(3):
final[i][j][k] = image[i-2][j][k]*0.0625 + \
image[i-1][j][k]*0.25 + \
image[i][j][k]*0.375 + \
image[i+1][j][k]*0.25 + \
image[i+2][j][k]*0.0625
I've tried searching for a question similar to this but never found these sort of data accesses in the computation.
How do I perform the above function for a Theano tensor matrix?
You can use Conv2D function for this task. see the reference here and may be you also can read the example tutorial here. Notes for this solution:
Because your kernel is symmetrical, you can ignore filter_flip parameter
Conv2D is using 4D input and kernel shape as parameters, so you need to reshape it first
Conv2D sum every channel (I think in your case 'k' variable is for RGB right? it's called channel) so you should separate it first
This is my example code, I use simpler kernel here:
import numpy as np
import theano
import theano.tensor as T
from theano.tensor.nnet import conv2d
# original image
img = [[[1, 2, 3, 4], #R channel
[1, 1, 1, 1], #
[2, 2, 2, 2]], #
[[1, 1, 1, 1], #G channel
[2, 2, 2, 2], #
[1, 2, 3, 4]], #
[[1, 1, 1, 1], #B channel
[1, 2, 3, 4], #
[2, 2, 2, 2],]]#
# separate and reshape each channel to 4D
R = np.asarray([[img[0]]], dtype='float32')
G = np.asarray([[img[1]]], dtype='float32')
B = np.asarray([[img[2]]], dtype='float32')
# 4D kernel from the original : [1,0,1]
kernel = np.asarray([[[[1],[0],[1]]]], dtype='float32')
# theano convolution
t_img = T.ftensor4("t_img")
t_kernel = T.ftensor4("t_kernel")
result = conv2d(
input = t_img,
filters=t_kernel,
filter_shape=(1,1,1,3),
border_mode = 'half')
f = theano.function([t_img,t_kernel],result)
# compute each channel
R = f(R,kernel)
G = f(G,kernel)
B = f(B,kernel)
# reshape again
img = np.asarray([R,G,B])
img = np.reshape(img,(3,3,4))
print img
If you have anything to discuss about the code, please comment. Hope it helps.

Multiple linear regression for a surface using NumPy - example

This question is close to: fitting a linear surface with numpy least squares, but there's no sample data. I must be terribly slow but it seems I can't get it to work.
I have the following code:
import numpy as np
XYZ = np.array([[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 1, 1, 1]])
A = np.row_stack((np.ones(len(XYZ[0])), XYZ[0, :], XYZ[1:]))
coeffs = np.linalg.lstsq(A.T, XYZ[2, :])[0]
print coeffs
The output is:
[ 5.00000000e-01 5.55111512e-17 9.71445147e-17 5.00000000e-01]
I want z = a + bx + cy, i.e. three coefficients, but the output gives me four. Where do I go wrong here? I expected coeffs to be something like:
[ 1.0 0.0 0.0]
Any help appreciated.
Peter Schneider (comment) is right: you'll want to feed XYZ[1, :] to row_stack:
>>> A = np.row_stack((np.ones(len(XYZ[0])), XYZ[0, :], XYZ[1, :]))
>>> np.linalg.lstsq(A.T, XYZ[2, :])[0]
array([ 1.00000000e+00, -7.85046229e-17, -7.85046229e-17])

Categories