Truncated SVD and PCA - python

Theoretically, the projection result of PCA and SVD is the same if the feature has mean 0. So I tried it on python.
from sklearn import datasets
cancer = datasets.load_breast_cancer()
from sklearn.preprocessing import StandardScaler
# we can set our feature to have mean 0 by setting with_mean=False
scaler = StandardScaler(with_mean=False,with_std=False)
scaler.fit(cancer.data)
X_scaled = scaler.transform(cancer.data)
from sklearn.decomposition import PCA
pca=PCA(n_components=3,svd_solver='randomized')
pca.fit(X_scaled)
X_pca=pca.transform(X_scaled)
from sklearn.decomposition import TruncatedSVD
svdm=TruncatedSVD(n_components=3,algorithm='randomized')
svdm.fit(X_scaled)
X_svdm=svdm.transform(X_scaled)
But when I print the result, it is different. Why is this happening?
print(X_pca)
print(X_svdm)
>>>[[1160.1425737 -293.91754364 48.57839763]
[1269.12244319 15.63018184 -35.39453423]
[ 995.79388896 39.15674324 -1.70975298]
...
[ 314.50175618 47.55352518 -10.44240718]
[1124.85811531 34.12922497 -19.74208742]
[-771.52762188 -88.64310636 23.88903189]]
>>>[[2241.97427647 347.71556015 -27.53741942]
[2372.40840267 56.90166991 23.86316187]
[2101.8402797 11.94762737 30.41138602]
...
[1424.53280954 -55.0217124 -3.5794351 ]
[2231.65579282 19.99439854 3.31619182]
[ 331.69302638 -5.29733966 -39.12136435]]
What should I fix so I can get the same result of both algorithm?

From the help page for scaler :
with_mean bool, default=True If True, center the data before scaling.
This does not work (and will raise an exception) when attempted on
sparse matrices, because centering them entails building a dense
matrix which in common use cases is likely to be too large to fit in
memory.
For PCA and SVD to give the same output, you need to center and scale the data, see also this post for details, so if you do:
# which is also the default
scaler = StandardScaler(with_mean=True, with_std=True)
X_scaled = scaler.fit_transform(cancer.data)
pca=PCA(n_components=3,svd_solver='randomized')
pca.fit(X_scaled)
X_pca=pca.transform(X_scaled)
svdm=TruncatedSVD(n_components=3,algorithm='randomized')
svdm.fit(X_scaled)
X_svdm=svdm.transform(X_scaled)
X_pca
array([[ 9.19283683, 1.94858306, -1.12316567],
[ 2.3878018 , -3.76817175, -0.52929196],
[ 5.73389628, -1.0751738 , -0.55174751],
...,
[ 1.25617928, -1.90229671, 0.56273027],
[10.37479406, 1.67201011, -1.87702986],
[-5.4752433 , -0.6706368 , 1.49044385]])
X_svdm
array([[ 9.19283683, 1.94858307, -1.12316615],
[ 2.3878018 , -3.76817174, -0.52929266],
[ 5.73389628, -1.0751738 , -0.55174759],
...,
[ 1.25617928, -1.90229671, 0.56273052],
[10.37479406, 1.67201011, -1.87702935],
[-5.4752433 , -0.67063679, 1.49044309]])

Related

Labels obtained from clustering seem visually incorrect

I have the following distance matrix based on 10 datapoints:
import numpy as np
distance_matrix = np.array([[0. , 0.00981376, 0.0698306 , 0.01313118, 0.05344448,
0.0085152 , 0.01996724, 0.14019663, 0.03702411, 0.07054652],
[0.00981376, 0. , 0.06148157, 0.00563764, 0.04473798,
0.00905327, 0.01223233, 0.13140022, 0.03114453, 0.06215728],
[0.0698306 , 0.06148157, 0. , 0.05693448, 0.02083512,
0.06390897, 0.05107812, 0.07539802, 0.04003773, 0.00703263],
[0.01313118, 0.00563764, 0.05693448, 0. , 0.0408836 ,
0.00787845, 0.00799949, 0.12779965, 0.02552774, 0.05766039],
[0.05344448, 0.04473798, 0.02083512, 0.0408836 , 0. ,
0.04846382, 0.03638932, 0.0869414 , 0.03579818, 0.0192329 ],
[0.0085152 , 0.00905327, 0.06390897, 0.00787845, 0.04846382,
0. , 0.01284173, 0.13540522, 0.03010677, 0.0646998 ],
[0.01996724, 0.01223233, 0.05107812, 0.00799949, 0.03638932,
0.01284173, 0. , 0.12310601, 0.01916205, 0.05188323],
[0.14019663, 0.13140022, 0.07539802, 0.12779965, 0.0869414 ,
0.13540522, 0.12310601, 0. , 0.11271352, 0.07346808],
[0.03702411, 0.03114453, 0.04003773, 0.02552774, 0.03579818,
0.03010677, 0.01916205, 0.11271352, 0. , 0.04157886],
[0.07054652, 0.06215728, 0.00703263, 0.05766039, 0.0192329 ,
0.0646998 , 0.05188323, 0.07346808, 0.04157886, 0. ]])
I transform the distance_matrix to an affinity_matrix by using the following
delta = 0.1
np.exp(- distance_matrix ** 2 / (2. * delta ** 2))
Which gives
affinity_matrix = np.array([[1. , 0.99519608, 0.7836321 , 0.99141566, 0.86691389,
0.99638113, 0.98026285, 0.37427863, 0.93375682, 0.77970427],
[0.99519608, 1. , 0.82778719, 0.99841211, 0.90477015,
0.9959103 , 0.99254642, 0.42176757, 0.95265821, 0.82433657],
[0.7836321 , 0.82778719, 1. , 0.85037594, 0.97852875,
0.81528476, 0.8777015 , 0.75258369, 0.92297697, 0.99753016],
[0.99141566, 0.99841211, 0.85037594, 1. , 0.91982353,
0.99690131, 0.99680552, 0.44191509, 0.96794184, 0.84684633],
[0.86691389, 0.90477015, 0.97852875, 0.91982353, 1. ,
0.88919645, 0.93593511, 0.68527137, 0.9379342 , 0.98167476],
[0.99638113, 0.9959103 , 0.81528476, 0.99690131, 0.88919645,
1. , 0.9917884 , 0.39982486, 0.95569077, 0.81114925],
[0.98026285, 0.99254642, 0.8777015 , 0.99680552, 0.93593511,
0.9917884 , 1. , 0.46871776, 0.9818083 , 0.87407117],
[0.37427863, 0.42176757, 0.75258369, 0.44191509, 0.68527137,
0.39982486, 0.46871776, 1. , 0.52982057, 0.76347268],
[0.93375682, 0.95265821, 0.92297697, 0.96794184, 0.9379342 ,
0.95569077, 0.9818083 , 0.52982057, 1. , 0.91719051],
[0.77970427, 0.82433657, 0.99753016, 0.84684633, 0.98167476,
0.81114925, 0.87407117, 0.76347268, 0.91719051, 1. ]])
I transform the distance_matrix into a heatmap to get a better visual of the data
import seaborn as sns
distance_matrix_df = pd.DataFrame(distance_matrix)
distance_matrix_df.columns = [x + 1 for x in range(10))]
distance_matrix_df.index = [x + 1 for x in range(10)]
sns.heatmap(distance_matrix_df, cmap='RdYlGn_r', annot=True, linewidths=0.5)
Next I want to cluster the affinity_matrix in 3 clusters. Before running the actual clustering, I inspect the heatmap to forecast the clusters. Clearly #8 is an outlier and will be a cluster on its own.
Next I run the actual clustering.
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3,
assign_labels='kmeans',
affinity='precomputed').fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
The outputs yields
[1, 1, 2, 1, 2, 1, 1, 2, 3, 2]
So, #8 is part of cluster 2 which consists of three other data points. Initially, I would assume that it would be a cluster on its own. Did I do something wrong? Or can someone show me why #8 looks like #3, #5 and #10. Please advice.
When we are moving away from relatively simple clustering algorithms, say like k-means, whatever intuition we may carry along regarding algorithms results and expected behaviors breaks down; indeed, the scikit-learn documentation on spectral clustering gives an implicit warning about that:
Apply clustering to a projection of the normalized Laplacian.
In practice Spectral Clustering is very useful when the structure of
the individual clusters is highly non-convex or more generally when a
measure of the center and spread of the cluster is not a suitable
description of the complete cluster. For instance when clusters are
nested circles on the 2D plane.
Now, even if one pretends to understand exactly what "a projection of the normalized Laplacian" means (I won't), the rest of the description arguably makes clear enough that here we should not expect results similar with more intuitive, distance-based clustering algorithms like k-means.
Nevertheless, your own intuition is not unfounded, and it shows if you just try a k-means clustering instead of a spherical one; using your exact data, we get
from sklearn.cluster import KMeans
clustering = KMeans(n_clusters=3, random_state=42).fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
clusters
# result:
array([2, 2, 1, 2, 1, 2, 2, 3, 2, 1], dtype=int32)
where indeed sample #8 stands out as an outlier in a cluster of its own (#3).
Nevertheless, the same intuition is not necessarily applicable or useful with other clustering algorithms, whose value is arguably exactly that they can uncover regularities of different kinds in the data - arguably they would not be that useful if they just replicated results from existing algorithms like k-means, would they?
The scikit-learn vignette Comparing different clustering algorithms on toy datasets might be useful to get an idea of how different clustering algorithms behave on some toy 2D datasets; here is the summary finding:

Unable to extract factor loadings from sklearn PCA

I want factor loadings to see which factor loads to which variables. I am referring to following link:
Factor Loadings using sklearn
Here is my code where input_data is the master_data.
X=master_data_predictors.values
#Scaling the values
X = scale(X)
#taking equal number of components as equal to number of variables
#intially we have 9 variables
pca = PCA(n_components=9)
pca.fit(X)
#The amount of variance that each PC explains
var= pca.explained_variance_ratio_
#Cumulative Variance explains
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
print var1
[ 74.75 85.85 94.1 97.8 98.87 99.4 99.75 100. 100. ]
#Retaining 4 components as they explain 98% of variance
pca = PCA(n_components=4)
pca.fit(X)
X1=pca.fit_transform(X)
print pca.components_
array([[ 0.38454129, 0.37344315, 0.2640267 , 0.36079567, 0.38070046,
0.37690887, 0.32949014, 0.34213449, 0.01310333],
[ 0.00308052, 0.00762985, -0.00556496, -0.00185015, 0.00300425,
0.00169865, 0.01380971, 0.0142307 , -0.99974635],
[ 0.0136128 , 0.04651786, 0.76405944, 0.10212738, 0.04236969,
0.05690046, -0.47599931, -0.41419841, -0.01629199],
[-0.09045103, -0.27641087, 0.53709146, -0.55429524, 0.058524 ,
-0.19038107, 0.4397584 , 0.29430344, 0.00576399]])
import math
loadings = pca.components_.T * math.sqrt(pca.explained_variance_)
It gives me following error 'only length-1 arrays can be converted to Python scalars
I understand the problem. I have to traverse the pca.components_ and pca.explained_variance_ arrays such as:
##just a thought
Loading=np.empty((8,4))
for i,j in (pca.components_, pca.explained_variance_):
loading=i*math.sqrt(j)
Loading=Loading.append(loading)
##unable to proceed further
##something wrong here
This is simply a problem of mixing modules. For numpy arrays, use np.sqrt instead of math.sqrt (which only works on single values, not arrays).
Your last line should thus read:
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
This is a mistake in the original answers you linked to. I have edited them accordingly.

numpy's fast Fourier transform yields unexpected results

I am struggling with numpy's implementation of the fast Fourier transform. My signal is not of periodic nature and therefore certainly not an ideal candidate, the result of the FFT however is far from what I was expecting. It is the same signal, simply stretched by some factor. I plotted a sinus curve, approximating my signal next to it which should illustrate, that I use the FFT function correctly:
import numpy as np
from matplotlib import pyplot as plt
signal = array([[ 0.], [ 0.1667557 ], [ 0.31103874], [ 0.44339886], [ 0.50747922],
[ 0.47848347], [ 0.64544846], [ 0.67861755], [ 0.69268326], [ 0.71581176],
[ 0.726552 ], [ 0.75032795], [ 0.77133769], [ 0.77379966], [ 0.80519187],
[ 0.78756476], [ 0.84179849], [ 0.85406538], [ 0.82852684], [ 0.87172407],
[ 0.9055542 ], [ 0.90563205], [ 0.92073452], [ 0.91178145], [ 0.8795554 ],
[ 0.89155587], [ 0.87965686], [ 0.91819571], [ 0.95774404], [ 0.95432073],
[ 0.96326252], [ 0.99480947], [ 0.94754962], [ 0.9818627 ], [ 0.9804966 ],
[ 1.], [ 0.99919711], [ 0.97202208], [ 0.99065786], [ 0.90567128],
[ 0.94300558], [ 0.89839004], [ 0.87312245], [ 0.86288378], [ 0.87301008],
[ 0.78184963], [ 0.73774451], [ 0.7450479 ], [ 0.67291666], [ 0.63518575],
[ 0.57036157], [ 0.5709147 ], [ 0.63079811], [ 0.61821523], [ 0.49526048],
[ 0.4434457 ], [ 0.29746173], [ 0.13024641], [ 0.17631683], [ 0.08590552]])
sinus = np.sin(np.linspace(0, np.pi, 60))
plt.plot(signal)
plt.plot(sinus)
The blue line is my signal, the green line is the sinus.
transformed_signal = abs(np.fft.fft(signal)[:30] / len(signal))
transformed_sinus = abs(np.fft.fft(sinus)[:30] / len(sinus))
plt.plot(transformed_signal)
plt.plot(transformed_sinus)
The blue line is transformed_signal, the green line is the transformed_sinus.
Plotting only transformed_signal illustrates the behavior described above:
Can someone explain to me what's going on here?
UPDATE
I was indeed a problem of calling the FFT. This is the correct call and the correct result:
transformed_signal = abs(np.fft.fft(signal,axis=0)[:30] / len(signal))
Numpy's fft is by default applied over rows. Since your signal variable is a column vector, fft is applied over the rows consisting of one element and returns the one-point FFT of each element.
Use the axis option of fft to specify that you want FFT applied over the columns of signal, i.e.,
transformed_signal = abs(np.fft.fft(signal,axis=0)[:30] / len(signal))
[EDIT] I overlooked the crucial thing stated by Stelios! Nevertheless I leave my answer here, since, while not spotting the root cause of your trouble, it is still true and contains things you have to reckon with for a useable FFT.
As you say you're tranforming a non-periodical signal.
Your signal has some ripples (higher harmonics) which nicely show up in the FFT.
The sine does have far less higher freq's and consists largely of a DC component.
So far so good. What I don't understand is that your signal also has a DC component, which doesn't show up at all. Could be that this is a matter of scale.
Core of the matter is that while the sinus and your signal look quite the same, they have a totally different harmonic content.
Most notable none of both hold a frequency that corresponds to the half sinus. This is because a 'half sinus' isn't built by summing whole sinusses. In other words: the underlying full sinus wave isn't in the spectral content of the sinus over half the period.
BTW having only 60 samples is a bit meager, Shannon states that your sample frequency should be at least twice the highest signal frequency, otherwise aliasing will happen (mapping freqs to the wrong place). In other words: your signal should visually appear smooth after sampling (unless of course it is discontinuous or has a discontinuous derivative, like a block or triangle wave). But in your case it looks like the sharp peaks are an artifact of undersampling.

Fast fuse of close points in a numpy-2d (vectorized)

I have a question similar to the question asked here:
simple way of fusing a few close points. I want to replace points that are located close to each other with the average of their coordinates. The closeness in cells is specified by the user (I am talking about euclidean distance).
In my case I have a lot of points (about 1-million). This method is working, but is time consuming as it uses a double for loop.
Is there a faster way to detect and fuse close points in a numpy 2d array?
To be complete I added an example:
points=array([[ 382.49056159, 640.1731949 ],
[ 496.44669161, 655.8583119 ],
[ 1255.64762859, 672.99699399],
[ 1070.16520917, 688.33538171],
[ 318.89390168, 718.05989421],
[ 259.7106383 , 822.2 ],
[ 141.52574427, 28.68594436],
[ 1061.13573287, 28.7094536 ],
[ 820.57417943, 84.27702407],
[ 806.71416007, 108.50307828]])
A scatterplot of the points is visible below. The red circle indicates the points located close to each other (in this case a distance of 27.91 between the last two points in the array). So if the user would specify a minimum distance of 30 these points should be fused.
In the output of the fuse function the last to points are fused. This will look like:
#output
array([[ 382.49056159, 640.1731949 ],
[ 496.44669161, 655.8583119 ],
[ 1255.64762859, 672.99699399],
[ 1070.16520917, 688.33538171],
[ 318.89390168, 718.05989421],
[ 259.7106383 , 822.2 ],
[ 141.52574427, 28.68594436],
[ 1061.13573287, 28.7094536 ],
[ 813.64416975, 96.390051175]])
If you have a large number of points then it may be faster to build a k-D tree using scipy.spatial.KDTree, then query it for pairs of points that are closer than some threshold:
import numpy as np
from scipy.spatial import KDTree
tree = KDTree(points)
rows_to_fuse = tree.query_pairs(r=30)
print(repr(rows_to_fuse))
# {(8, 9)}
print(repr(points[list(rows_to_fuse)]))
# array([[ 820.57417943, 84.27702407],
# [ 806.71416007, 108.50307828]])
The major advantage of this approach is that you don't need to compute the distance between every pair of points in your dataset.
You can use scipy's distance functions such as pdist in order to quickly find which points should be merged:
import numpy as np
from scipy.spatial.distance import pdist, squareform
d = squareform(pdist(a))
d = np.ma.array(d, mask=np.isclose(d, 0))
a[d.min(axis=1) < 30]
#array([[ 820.57417943, 84.27702407],
# [ 806.71416007, 108.50307828]])
NOTE
For large samples this method can cause memory errors since it is storing a full matrix containing the relative distances.

Proximity Matrix in sklearn.ensemble.RandomForestClassifier

I'm trying to perform clustering in Python using Random Forests. In the R implementation of Random Forests, there is a flag you can set to get the proximity matrix. I can't seem to find anything similar in the python scikit version of Random Forest. Does anyone know if there is an equivalent calculation for the python version?
We don't implement proximity matrix in Scikit-Learn (yet).
However, this could be done by relying on the apply function provided in our implementation of decision trees. That is, for all pairs of samples in your dataset, iterate over the decision trees in the forest (through forest.estimators_) and count the number of times they fall in the same leaf, i.e., the number of times apply give the same node id for both samples in the pair.
Hope this helps.
Based on Gilles Louppe answer I have written a function. I don't know if it is effective, but it works. Best regards.
def proximityMatrix(model, X, normalize=True):
terminals = model.apply(X)
nTrees = terminals.shape[1]
a = terminals[:,0]
proxMat = 1*np.equal.outer(a, a)
for i in range(1, nTrees):
a = terminals[:,i]
proxMat += 1*np.equal.outer(a, a)
if normalize:
proxMat = proxMat / nTrees
return proxMat
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
train = load_breast_cancer()
model = RandomForestClassifier(n_estimators=500, max_features=2, min_samples_leaf=40)
model.fit(train.data, train.target)
proximityMatrix(model, train.data, normalize=True)
## array([[ 1. , 0.414, 0.77 , ..., 0.146, 0.79 , 0.002],
## [ 0.414, 1. , 0.362, ..., 0.334, 0.296, 0.008],
## [ 0.77 , 0.362, 1. , ..., 0.218, 0.856, 0. ],
## ...,
## [ 0.146, 0.334, 0.218, ..., 1. , 0.21 , 0.028],
## [ 0.79 , 0.296, 0.856, ..., 0.21 , 1. , 0. ],
## [ 0.002, 0.008, 0. , ..., 0.028, 0. , 1. ]])
There is nothing currently implemented for this in python. I took a first try at it here. It would be great if somebody would be interested in adding these methods to scikit.

Categories