I have the following distance matrix based on 10 datapoints:
import numpy as np
distance_matrix = np.array([[0. , 0.00981376, 0.0698306 , 0.01313118, 0.05344448,
0.0085152 , 0.01996724, 0.14019663, 0.03702411, 0.07054652],
[0.00981376, 0. , 0.06148157, 0.00563764, 0.04473798,
0.00905327, 0.01223233, 0.13140022, 0.03114453, 0.06215728],
[0.0698306 , 0.06148157, 0. , 0.05693448, 0.02083512,
0.06390897, 0.05107812, 0.07539802, 0.04003773, 0.00703263],
[0.01313118, 0.00563764, 0.05693448, 0. , 0.0408836 ,
0.00787845, 0.00799949, 0.12779965, 0.02552774, 0.05766039],
[0.05344448, 0.04473798, 0.02083512, 0.0408836 , 0. ,
0.04846382, 0.03638932, 0.0869414 , 0.03579818, 0.0192329 ],
[0.0085152 , 0.00905327, 0.06390897, 0.00787845, 0.04846382,
0. , 0.01284173, 0.13540522, 0.03010677, 0.0646998 ],
[0.01996724, 0.01223233, 0.05107812, 0.00799949, 0.03638932,
0.01284173, 0. , 0.12310601, 0.01916205, 0.05188323],
[0.14019663, 0.13140022, 0.07539802, 0.12779965, 0.0869414 ,
0.13540522, 0.12310601, 0. , 0.11271352, 0.07346808],
[0.03702411, 0.03114453, 0.04003773, 0.02552774, 0.03579818,
0.03010677, 0.01916205, 0.11271352, 0. , 0.04157886],
[0.07054652, 0.06215728, 0.00703263, 0.05766039, 0.0192329 ,
0.0646998 , 0.05188323, 0.07346808, 0.04157886, 0. ]])
I transform the distance_matrix to an affinity_matrix by using the following
delta = 0.1
np.exp(- distance_matrix ** 2 / (2. * delta ** 2))
Which gives
affinity_matrix = np.array([[1. , 0.99519608, 0.7836321 , 0.99141566, 0.86691389,
0.99638113, 0.98026285, 0.37427863, 0.93375682, 0.77970427],
[0.99519608, 1. , 0.82778719, 0.99841211, 0.90477015,
0.9959103 , 0.99254642, 0.42176757, 0.95265821, 0.82433657],
[0.7836321 , 0.82778719, 1. , 0.85037594, 0.97852875,
0.81528476, 0.8777015 , 0.75258369, 0.92297697, 0.99753016],
[0.99141566, 0.99841211, 0.85037594, 1. , 0.91982353,
0.99690131, 0.99680552, 0.44191509, 0.96794184, 0.84684633],
[0.86691389, 0.90477015, 0.97852875, 0.91982353, 1. ,
0.88919645, 0.93593511, 0.68527137, 0.9379342 , 0.98167476],
[0.99638113, 0.9959103 , 0.81528476, 0.99690131, 0.88919645,
1. , 0.9917884 , 0.39982486, 0.95569077, 0.81114925],
[0.98026285, 0.99254642, 0.8777015 , 0.99680552, 0.93593511,
0.9917884 , 1. , 0.46871776, 0.9818083 , 0.87407117],
[0.37427863, 0.42176757, 0.75258369, 0.44191509, 0.68527137,
0.39982486, 0.46871776, 1. , 0.52982057, 0.76347268],
[0.93375682, 0.95265821, 0.92297697, 0.96794184, 0.9379342 ,
0.95569077, 0.9818083 , 0.52982057, 1. , 0.91719051],
[0.77970427, 0.82433657, 0.99753016, 0.84684633, 0.98167476,
0.81114925, 0.87407117, 0.76347268, 0.91719051, 1. ]])
I transform the distance_matrix into a heatmap to get a better visual of the data
import seaborn as sns
distance_matrix_df = pd.DataFrame(distance_matrix)
distance_matrix_df.columns = [x + 1 for x in range(10))]
distance_matrix_df.index = [x + 1 for x in range(10)]
sns.heatmap(distance_matrix_df, cmap='RdYlGn_r', annot=True, linewidths=0.5)
Next I want to cluster the affinity_matrix in 3 clusters. Before running the actual clustering, I inspect the heatmap to forecast the clusters. Clearly #8 is an outlier and will be a cluster on its own.
Next I run the actual clustering.
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3,
assign_labels='kmeans',
affinity='precomputed').fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
The outputs yields
[1, 1, 2, 1, 2, 1, 1, 2, 3, 2]
So, #8 is part of cluster 2 which consists of three other data points. Initially, I would assume that it would be a cluster on its own. Did I do something wrong? Or can someone show me why #8 looks like #3, #5 and #10. Please advice.
When we are moving away from relatively simple clustering algorithms, say like k-means, whatever intuition we may carry along regarding algorithms results and expected behaviors breaks down; indeed, the scikit-learn documentation on spectral clustering gives an implicit warning about that:
Apply clustering to a projection of the normalized Laplacian.
In practice Spectral Clustering is very useful when the structure of
the individual clusters is highly non-convex or more generally when a
measure of the center and spread of the cluster is not a suitable
description of the complete cluster. For instance when clusters are
nested circles on the 2D plane.
Now, even if one pretends to understand exactly what "a projection of the normalized Laplacian" means (I won't), the rest of the description arguably makes clear enough that here we should not expect results similar with more intuitive, distance-based clustering algorithms like k-means.
Nevertheless, your own intuition is not unfounded, and it shows if you just try a k-means clustering instead of a spherical one; using your exact data, we get
from sklearn.cluster import KMeans
clustering = KMeans(n_clusters=3, random_state=42).fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
clusters
# result:
array([2, 2, 1, 2, 1, 2, 2, 3, 2, 1], dtype=int32)
where indeed sample #8 stands out as an outlier in a cluster of its own (#3).
Nevertheless, the same intuition is not necessarily applicable or useful with other clustering algorithms, whose value is arguably exactly that they can uncover regularities of different kinds in the data - arguably they would not be that useful if they just replicated results from existing algorithms like k-means, would they?
The scikit-learn vignette Comparing different clustering algorithms on toy datasets might be useful to get an idea of how different clustering algorithms behave on some toy 2D datasets; here is the summary finding:
I am using python 3.5.2 and sklearn 0.19.1
I have a muticlass problem (3 classes) and I am using RandomForestClassifier.
For one of the cass I have
19 unique predict_proba values :
{0.0,
0.6666666666666666,
0.6736189855024448,
0.6773290780865037,
0.7150826826468751,
0.7175236925236925,
0.7775446850962057,
0.8245648135911781,
0.8631035080004867,
0.8720525244880196,
0.8739595855873906,
0.8787152225755167,
0.9289844333343654,
0.954439314892936,
0.9606503912532541,
0.9771342285323964,
0.9883370916703461,
0.9957401423931763,
1.0}
I am computing roc_curve and I am expecting the same number of point for the roc curve as I have unique value of probablitity. This is only true for 2 of the 3 classes!
When I looked at the thresholds returned that the roc_curve function:
fpr, tpr, proba = roc_curve(....):
I see the same exact value as the one in the list of probability + one new value 2.0 !
[2.,
1.,
0.99574014,
0.98833709,
0.97713423,
0.96065039,
0.95443931,
0.92898443,
0.87871522,
0.87395959,
0.87205252,
0.86310351,
0.82456481,
0.77754469,
0.71752369,
0.71508268,
0.67732908,
0.67361899,
0.66666667,
0. ]
Why is a new thresholds 2.0 is returned ? I didn't see anything related to that in the documentation.
Any idea ? I am missing something
roc_curve is written so that ROC point corresponding to the highest threshold (fpr[0], tpr[0]) is always (0, 0). If this is not the case, a new threshold is created with an arbitrary value of max(y_score)+1. The relevant code from the source:
thresholds : array, shape = [n_thresholds]
Decreasing thresholds on the decision function used to compute
fpr and tpr. `thresholds[0]` represents no instances being predicted
and is arbitrarily set to `max(y_score) + 1`.
and
if tps.size == 0 or fps[0] != 0:
# Add an extra threshold position if necessary
tps = np.r_[0, tps]
fps = np.r_[0, fps]
thresholds = np.r_[thresholds[0] + 1, thresholds]
So it seems in the case you showed you have data given a score of 1.0 that is incorrectly classified.
I did find a way to calculate the center coordinate of a cluster of points. However, my method is quite slow when the number of initial coordinates is increased (I have about 100 000 coordinates).
The bottleneck is the for-loop in the code. I tried to remove it by using np.apply_along_axis, but discovered that this is nothing more than a hidden python-loop.
Is it possible to detect and average out various sized clusters of too close points in a vectorized way?
import numpy as np
from scipy.spatial import cKDTree
np.random.seed(7)
max_distance=1
#Create random points
points = np.array([[1,1],[1,2],[2,1],[3,3],[3,4],[5,5],[8,8],[10,10],[8,6],[6,5]])
#Create trees and detect the points and neighbours which needs to be fused
tree = cKDTree(points)
rows_to_fuse = np.array(list(tree.query_pairs(r=max_distance))).astype('uint64')
#Split the points and neighbours into two groups
points_to_fuse = points[rows_to_fuse[:,0], :2]
neighbours = points[rows_to_fuse[:,1], :2]
#get unique points_to_fuse
nonduplicate_points = np.ascontiguousarray(points_to_fuse)
unique_points = np.unique(nonduplicate_points.view([('', nonduplicate_points.dtype)]\
*nonduplicate_points.shape[1]))
unique_points = unique_points.view(nonduplicate_points.dtype).reshape(\
(unique_points.shape[0],\
nonduplicate_points.shape[1]))
#Empty array to store fused points
fused_points = np.empty((len(unique_points), 2))
####BOTTLENECK LOOP####
for i, point in enumerate(unique_points):
#Detect all locations where a unique point occurs
locs=np.where(np.logical_and((points_to_fuse[:,0] == point[0]), (points_to_fuse[:,1]==point[1])))
#Select all neighbours on these locations take the average
fused_points[i,:] = (np.average(np.hstack((point[0],neighbours[locs,0][0]))),np.average(np.hstack((point[1],neighbours[locs,1][0]))))
#Get original points that didn't need to be fused
points_without_fuse = np.delete(points, np.unique(rows_to_fuse.reshape((1, -1))), axis=0)
#Stack result
points = np.row_stack((points_without_fuse, fused_points))
Expected output
>>> points
array([[ 8. , 8. ],
[ 10. , 10. ],
[ 8. , 6. ],
[ 1.33333333, 1.33333333],
[ 3. , 3.5 ],
[ 5.5 , 5. ]])
EDIT 1: Example of 1 loop with desired result
Step 1: Create variables for the loop
#outside loop
points_to_fuse = np.array([[100,100],[101,101],[100,100]])
neighbours = np.array([[103,105],[109,701],[99,100]])
unique_points = np.array([[100,100],[101,101]])
#inside loop
point = np.array([100,100])
i = 0
Step 2: Detect all locations where a unique point occurs in the points_to_fuse array
locs=np.where(np.logical_and((points_to_fuse[:,0] == point[0]), (points_to_fuse[:,1]==point[1])))
>>> (array([0, 2], dtype=int64),)
Step 3: Create an array of the point and the neighbouring points at these locations and calculate the average
array_of_points = np.column_stack((np.hstack((point[0],neighbours[locs,0][0])),np.hstack((point[1],neighbours[locs,1][0]))))
>>> array([[100, 100],
[103, 105],
[ 99, 100]])
fused_points[i, :] = np.average(array_of_points, 0)
>>> array([ 100.66666667, 101.66666667])
Loop output after a complete run:
>>> print(fused_points)
>>> array([[ 100.66666667, 101.66666667],
[ 105. , 401. ]])
The bottleneck is not the loop which is necessary since all the neighborhoods have not the same size.
The pitfall is the points_to_fuse[:,0] == point[0] in the loop which trig a quadratic complexity. you can avoid that by sorting the points, by index.
An example to do that, even it doesn't solve the whole problem (after the generation of rows_to_fuse):
sorter=np.lexsort(rows_to_fuse.T)
sorted_points=rows_to_fuse[sorter]
uniques,counts=np.unique(sorted_points[:,1],return_counts=True)
indices=counts.cumsum()
neighbourhood=np.split(sorted_points,indices)[:-1]
means=[(points[ne[:,0]].sum(axis=0)+points[ne[0,1]])/(len(ne)+1) \
for ne in neighbourhood] # a simple python loop.
# + manage unfused points.
An other improvement is to compute means with numba if you want to speed the code, but the complexity is now ~ optimal I think.
I want factor loadings to see which factor loads to which variables. I am referring to following link:
Factor Loadings using sklearn
Here is my code where input_data is the master_data.
X=master_data_predictors.values
#Scaling the values
X = scale(X)
#taking equal number of components as equal to number of variables
#intially we have 9 variables
pca = PCA(n_components=9)
pca.fit(X)
#The amount of variance that each PC explains
var= pca.explained_variance_ratio_
#Cumulative Variance explains
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
print var1
[ 74.75 85.85 94.1 97.8 98.87 99.4 99.75 100. 100. ]
#Retaining 4 components as they explain 98% of variance
pca = PCA(n_components=4)
pca.fit(X)
X1=pca.fit_transform(X)
print pca.components_
array([[ 0.38454129, 0.37344315, 0.2640267 , 0.36079567, 0.38070046,
0.37690887, 0.32949014, 0.34213449, 0.01310333],
[ 0.00308052, 0.00762985, -0.00556496, -0.00185015, 0.00300425,
0.00169865, 0.01380971, 0.0142307 , -0.99974635],
[ 0.0136128 , 0.04651786, 0.76405944, 0.10212738, 0.04236969,
0.05690046, -0.47599931, -0.41419841, -0.01629199],
[-0.09045103, -0.27641087, 0.53709146, -0.55429524, 0.058524 ,
-0.19038107, 0.4397584 , 0.29430344, 0.00576399]])
import math
loadings = pca.components_.T * math.sqrt(pca.explained_variance_)
It gives me following error 'only length-1 arrays can be converted to Python scalars
I understand the problem. I have to traverse the pca.components_ and pca.explained_variance_ arrays such as:
##just a thought
Loading=np.empty((8,4))
for i,j in (pca.components_, pca.explained_variance_):
loading=i*math.sqrt(j)
Loading=Loading.append(loading)
##unable to proceed further
##something wrong here
This is simply a problem of mixing modules. For numpy arrays, use np.sqrt instead of math.sqrt (which only works on single values, not arrays).
Your last line should thus read:
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
This is a mistake in the original answers you linked to. I have edited them accordingly.