Is t-SNE's computational bottleneck its memory complexity?

Is t-SNE's computational bottleneck its memory complexity? - python

I've been exploring different dimensionality reduction algorithms, specifically PCA and T-SNE. I'm taking a small subset of the MNIST dataset (with ~780 dimensions) and attempting to reduce the raw down to three dimensions to visualize as a scatter plot. T-SNE can be described in great detail here.
I'm using PCA as an intermediate dimensional reduction step prior to T-SNE, as described by the original creators of T-SNE on the source code from their website.
I'm finding that T-SNE takes forever to run (10-15 minutes to go from a 2000 x 25 to a 2000 x 3 feature space), while PCA runs relatively quickly (a few seconds for a 2000 x 780 => 2000 X 20).
Why is this the case? My theory is that in the PCA implementation (directly from primary author's source code in Python), he utilizes Numpy dot product notations to calculate X and X.T:
def pca(X = Math.array([]), no_dims = 50):
"""Runs PCA on the NxD array X in order to reduce its dimensionality to no_dims dimensions."""
print "Preprocessing the data using PCA..."
(n, d) = X.shape;
X = X - Math.tile(Math.mean(X, 0), (n, 1));
(l, M) = Math.linalg.eig(Math.dot(X.T, X));
Y = Math.dot(X, M[:,0:no_dims]);
return Y;
As far as I recall, this is significantly more efficient than scalar operations, and also means that only 2N (where N is the number of rows) of data is loaded into memory (you need to load one row of X and one column of X.T).
However, I don't think this is the root reason. T-SNE definitely also contains vector operations, for example, when calculating the pairwise distances D:
D = Math.add(Math.add(-2 * Math.dot(X, X.T), sum_X).T, sum_X);
Or, when calculating P (higher dimension) and Q (lower dimension). In t-SNE, however, you have to create two N X N matrices to store your pairwise distances between each data, one for its original high-dimensional space representation and the other for its reduced dimensional space.
In computing your gradient, you also have to create another N X N matrix called PQ, which is P - Q.
It seems to me that the memory complexity here is the bottleneck. T-SNE requires 3N^2 of memory. There is no way this can fit in local memory, so the algorithm experiences significant cache line misses and needs to go to global memory to retrieve the values.
Is this correct? How do I explain to a client or a reasonable non-technical person why t-SNE is slower than PCA?
The co-author's Python implementation is found here.

The main reason for t-SNE being slower than PCA is that no analytical solution exists for the criterion that is being optimised. Instead, a solution must be approximated through gradient descend iterations.
In practice, this means lots of for loops. Not in the least the main iteration for-loop in line 129, that runs up to max_iter=1000 times. Additionally, the x2p function iterates over all data points with a for loop.
The reference implementation is optimised for readability, not for computational speed. The authors link to an optimised Torch implementation as well, which should speed up the computation a lot. If you want to stay in pure Python, I recommend the implementation in Scikit-Learn, which should also be a lot faster.

t-SNE tries to lower the dimensionality while preserving the distributions of distances between elements.
This requires computing distances between all the points. Pairwise distance matrix has N^2 entries where N is the number of examples.

Related

how does numpy.linalg.eigh vs numpy.linalg.svd?

problem description
For a square matrix, one can obtain the SVD
X= USV'
decomposition, by using simply numpy.linalg.svd
u,s,vh = numpy.linalg.svd(X)
routine or numpy.linalg.eigh, to compute the eig decomposition on Hermitian matrix X'X and XX'
Are they using the same algorithm? Calling the same Lapack routine?
Is there any difference in terms of speed? and stability?

Indeed, numpy.linalg.svd and numpy.linalg.eigh do not call the same routine of Lapack. On the one hand, numpy.linalg.eigh refers to LAPACK's dsyevd() while numpy.linalg.svd makes use LAPACK's dgesdd().
The common point between these routines is the use of Cuppen's divide and conquer algorithm, first designed to solve tridiagonal eigenvalue problems. For instance, dsyevd() only handles Hermitian matrix and performs the following steps, only if eigenvectors are required:
Reduce matrix to tridiagonal form using DSYTRD()
Compute the eigenvectors of the tridiagonal matrix using the divide and conquer algorithm, through DSTEDC()
Apply the Householder reflection reported by DSYTRD() using DORMTR().
On the contrary, to compute the SVD, dgesdd() performs the following steps, in the case job==A (U and VT required):
Bidiagonalize A using dgebrd()
Compute the SVD of the bidiagonal matrix using divide and conquer algorithm using DBDSDC()
Revert the bidiagonalization using using the matrices P and Q returned by dgebrd() applying dormbr() twice, once for U and once for V
While the actual operations performed by LAPACK are very different, the strategies are globally similar. It may stem from the fact that computing the SVD of a general matrix A is similar to performing the eigendecomposition of the symmetric matrix A^T.A.
Regarding accuracy and performances of lapack divide and conquer SVD, see This survey of SVD methods:
They often achieve the accuracy of QR-based SVD, though it is not proven
The worst case is O(n^3) if no deflation occurs, but often proves better than that
The memory requirement is 8 times the size of the matrix, which can become prohibitive
Regarding the symmetric eigenvalue problem, the complexity is 4/3n^3 (but often proves better than that) and the memory footprint is about 2n^2 plus the size of the matrix. Hence, the best choice is likely numpy.linalg.eigh if your matrix is symmetric.
The actual complexity can be computed for your particular matrices using the following code:
import numpy as np
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit
# see https://stackoverflow.com/questions/41109122/fitting-a-curve-to-a-power-law-distribution-with-curve-fit-does-not-work
def func_powerlaw(x, m, c):
return np.log(np.abs( x**m * c))
import time
start = time.time()
print("hello")
end = time.time()
print(end - start)
timeev=[]
timesvd=[]
size=[]
for n in range(10,600):
print n
size.append(n)
A=np.zeros((n,n))
#populate A, 1D diffusion.
for j in range(n):
A[j,j]=2.
if j>0:
A[j-1,j]=-1.
if j<n-1:
A[j+1,j]=-1.
#EIG
Aev=A.copy()
start = time.time()
w,v=np.linalg.eigh(Aev,'L')
end = time.time()
timeev.append(end-start)
Asvd=A.copy()
start = time.time()
u,s,vh=np.linalg.svd(Asvd)
end = time.time()
timesvd.append(end-start)
poptev, pcov = curve_fit(func_powerlaw, size[len(size)/2:], np.log(timeev[len(size)/2:]),p0=[2.1,1e-7],maxfev = 8000)
print poptev
poptsvd, pcov = curve_fit(func_powerlaw, size[len(size)/2:], np.log(timesvd[len(size)/2:]),p0=[2.1,1e-7],maxfev = 8000)
print poptsvd
plt.figure()
fig, ax = plt.subplots()
plt.plot(size,timeev,label="eigh")
plt.plot(size,[np.exp(func_powerlaw(x, poptev[0], poptev[1])) for x in size],label="eigh-adjusted complexity: "+str(poptev[0]))
plt.plot(size,timesvd,label="svd")
plt.plot(size,[np.exp(func_powerlaw(x, poptsvd[0], poptsvd[1])) for x in size],label="svd-adjusted complexity: "+str(poptsvd[0]))
ax.set_xlabel('n')
ax.set_ylabel('time, s')
#plt.legend(loc="upper left")
ax.legend(loc="lower right")
ax.set_yscale("log", nonposy='clip')
fig.tight_layout()
plt.savefig('eigh.jpg')
plt.show()
For such 1D diffusion matrices, eigh outperforms svd, but the actual complexity are similar, slightly lower than n^3, something like n^2.5.
Checking of the accuracy could be performed as well.

No they do not use the same algorithm as they do different things. They are somewhat related but also very different. Let's start with the fact that you can do SVD on m x n matrices, where m and n don't need to be the same.
Dependent on the version of numpy, you are doing. Here are the eigenvalue routines in lapack for double precision:
http://www.netlib.org/lapack/explore-html/d9/d8e/group__double_g_eeigen.html
And the according SVD routines:
http://www.netlib.org/lapack/explore-html/d1/d7e/group__double_g_esing.html
There are differences in routines. Big differences. If you care for the details, they are specified in the fortran headers very well. In many cases it makes sense to find out, what kind of matrix you have in front of you, to make a good choice of routine. Is the matrix symmetric/hermitian? Is it in upper diagonal form? Is it positive semidefinite? ...
There are gynormous differences in runtime. But as rule of thumb EIGs are cheaper than SVDs. But that depends also on convergence speed, which in turn depends a lot on condition number of the matrix, in other words, how ill posed a matrix is, ...
SVDs are usually very robust and slow algorithms and oftentimes used for inversion, speed optimisation through truncation, principle component analysis really with the expetation, that the matrix you are dealing with is just a pile of shitty rows ;)

Best way to speed up calculation of sum of squared residuals in python

I am currently writing some python code for data analysis in which I unfortunately have to use brute-force search for the best fit. I am using a numpy array exp of dimension M x N. Each of the N columns represent a single trace of experimental data of length M. The actual size of the array is usually in the range of (2500 x 15000).
Before performing the fit I calculate an numpy array cal of possible solutions of dimension M x Z (usually the size in the range of (2500 x 20000) or larger. To find the best fit I iterate over the experimental data and calculate the sum of squared residuals for a subset of of the calculated possible results. The subset commonly consists of 1000 traces. This currently looks somewhat like the following minimal example:
for i in range(1,exp.shape[1]):
minIndex = max(bestIndeces[i-1]-maxIndexDelta,0)
maxIndex = min(bestIndeces[i-1]+maxIndexDelta, 2499)
SR = np.power(cal[:,minIndex:maxIndex]-exp[:,[i]],2)
SSR = np.sum(SR,axis=0)
bestSSRs[i] = np.min(SSR)
bestIndeces[i] = int(np.argmin(SSR) + minIndex)
However, this is not a very fast calculation. What would be the best way to speed this calculation up? It would be good to keep on working with numpy arrays. Apart from that any changes or additional libraries would not be a problem to include.

Fast calculation of distances to each cluster center for an entire dataset

In a data clustering problem I have two numpy arrays, X and C, where X corresponds to observations and C corresponds to the centers of the clusters that can be formed with the data in X. Both of them have the same amount of columns (features) but C usually has way less rows than X. I'm trying to find a fast way of calculating the minimum squared distance between each observation in X and all the centers in C. In simple python this can be written as
D2 = np.array([min([np.inner(c-x,c-x) for c in C]) for x in X])
which is rather slow, so I mannaged to improve the speed by doin
D2 = np.array([min(np.sum((C-x)**2, axis=1)) for x in X])
instead, but I'm not yet satisfied with the execution time, and since a for loop still remains, I believe there is hope. Does anyone have an idea on how to further reduce the excecution time of this?
For the curious, I'm using this to generate seeds for K-Means through the K-Means++ algorithm.

The fastest you'll get with the numpy/scipy stack is a specialized function just for this purpose scipy.spatial.distance.cdist.
scipy.spatial.distance.cdist(XA, XB, metric='euclidean', p=2, ...)
Computes distance between each pair of the two collections of inputs.
It's also worth noting that scipy provides kmeans clustering as well.
scipy.cluster.vq.kmeans

Label Propagation - Array is too big

I am using label propagation in scikit learn for semi-supervised classification. I have 17,000 data points with 7 dimensions. I am unable to use it on this data set. Its throwing a numpy big array error. However, it works fine when I work on a relatively small data set say 200 points. Can anyone suggestion a fix?
label_prop_model.fit(np.array(data), labels)
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 58, in fit
graph_matrix = self._build_graph()
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 108, in _build_graph
affinity_matrix = self._get_kernel(self.X_) # get the affinty martix from the data using rbf kernel
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 26, in _get_kernel
return rbf_kernel(X, X, gamma=self.gamma)
File "/usr/lib/pymodules/python2.7/sklearn/metrics/pairwise.py", line 350, in rbf_kernel
K = euclidean_distances(X, Y, squared=True)
File "/usr/lib/pymodules/python2.7/sklearn/metrics/pairwise.py", line 173, in euclidean_distances
distances = safe_sparse_dot(X, Y.T, dense_output=True)
File "/usr/lib/pymodules/python2.7/sklearn/utils/extmath.py", line 79, in safe_sparse_dot
return np.dot(a, b)
ValueError: array is too big.

How much memory does your computer have?
What sklearn might be doing here (I haven't gone through the source, so I might be wrong) is calculating euclidean lengths of vectors between each data point by taking the square of a 17000xK matrix. This would yield squared euclidean distance for all data points, but unfortunately produces an NxN ouput matrix if you have N data points. As far as I know numpy uses double precision, which results in an 17000x17000x8 bytes matrix, approximately 2.15 GB.
If your memory can't hold a matrix of that size that would cause trouble. Try creating a matrix of this size with numpy:
import numpy
mat = numpy.ones(17000, 17000)
If it succeeds I'm mistaken and the problem is something else (though certainly related to memory size and matrices sklearn is trying to allocate).
On the top of my head, one way to resolve this might be to propagate labels in parts by subsampling the unlabeled data points (and possibly the labeled points, if you have many of them). If you are able to run the algorithm for 17000/2 data points and you have L labeled points, build your new data set by randomly drawing (17000-L)/2 of the unlabeled points from the original set and combining them with the L labeled points. Run the algorithm for each partition of the full set.
Note that this probably will reduce the performance of the label propagation algorithm, since it will have fewer data points to work with. Inconsistencies between labels in each of the sets might also cause trouble.
Use with extreme caution and only if you have some way to evaluate the performance :)
A safer approach would be to A: Get more memory or B: Get a label propagation algorithm that is less memory intensive. It is certainly possible to exchange memory complexity for time complexity by recalculating euclidean distances when needed rather than constructing a full all pairs distance matrix as scikit appears to be doing here.

Generalized least square on large dataset

I'd like to linearly fit the data that were NOT sampled independently. I came across generalized least square method:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
The equation is Matlab format; X and Y are coordinates of the data points, and V is a "variance matrix".
The problem is that due to its size (1000 rows and columns), the V matrix becomes singular, thus un-invertable. Any suggestions for how to get around this problem? Maybe using a way of solving generalized linear regression problem other than GLS? The tools that I have available and am (slightly) familiar with are Numpy/Scipy, R, and Matlab.

Instead of:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
Use
b= (X'/V *X)\X'/V*Y
That is, replace all instances of X*(Y^-1) with X/Y. Matlab will skip calculating the inverse (which is hard, and error prone) and compute the divide directly.
Edit: Even with the best matrix manipulation, some operations are not possible (for example leading to errors like you describe).
An example of that which may be relevant to your problem is if try to solve least squares problem under the constraint the multiple measurements are perfectly, 100% correlated. Except in rare, degenerate cases this cannot be accomplished, either in math or physically. You need some independence in the measurements to account for measurement noise or modeling errors. For example, if you have two measurements, each with a variance of 1, and perfectly correlated, then your V matrix would look like this:
V = [1 1; ...
1 1];
And you would never be able to fit to the data. (This generally means you need to reformulate your basis functions, but that's a longer essay.)
However, if you adjust your measurement variance to allow for some small amount of independence between the measurements, then it would work without a problem. For example, 95% correlated measurements would look like this
V = [1 0.95; ...
0.95 1 ];

You can use singular value decomposition as your solver. It'll do the best that can be done.
I usually think about least squares another way. You can read my thoughts here:
http://www.scribd.com/doc/21983425/Least-Squares-Fit
See if that works better for you.
I don't understand how the size is an issue. If you have N (x, y) pairs you still only have to solve for (M+1) coefficients in an M-order polynomial:
y = a0 + a1*x + a2*x^2 + ... + am*x^m

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.