AgglomerativeClustering on a correlation matrix - python

I have a correlation matrix of typical structure that is of size 288x288 that is defined by:
from sklearn.cluster import AgglomerativeClustering
df = read_returns()
correl_matrix = df.corr()
where read_returns gives me a dataframe with a date index, and columns of the returns of assets.
Now - I want to cluster these correlations to reduce the population size.
By doing some reading and experimenting I discovered AgglomerativeClustering - and it appears at first pass to be an appropriate solution to my problem.
I define a distance metric as ((.5*(1-correl_matrix))**.5) and have:
cluster = AgglomerativeClustering(n_clusters=40, linkage='average')
cluster.fit(((.5*(1-correl_matrix))**.5).values)
label_groups = cluster.labels_
To observe some of the data and cross check my work I pick out cluster 1 and observe the pairwise correlations and find the min correlation between two items with that group in my dataset to find :
single_cluster = []
for i in range(0,correl_matrix.shape[0]):
if label_groups[i]==1:
single_cluster.append(correl_matrix.index[i])
min_correl = 1.0
for x in single_cluster:
for y in single_cluster:
if x<>y:
if correl_matrix[x][y]<min_correl:
min_correl = correl_matrix[x][y]
print min_correl
and get a min pairwise correlation of .20
To me this seems quite low - but "low based off what?" is a fair question to which I have no answer.
I would like to anticipate/enforce each pairwise correlation of a cluster to be >=.7 or something like this.
Is this possible in AgglomerativeClustering?
Am I accidentally going down the wrong path?

Hierarchical clustering supports different "linkage" strategies.
single-link: this connects points on the minimum distance to the others in the cluster
complete-link: this connects based on the maximum distance to the cluster
...
If you want a high minimum correlation = small maximum distance, this calls for complete linkage.
You may want to treat negative correlations as "good", too.
i.e. use dist = 1 - abs(corr).
Make sure to use ghe dendrogram. If you have outliers in your data, you want to cut into (n_clusters+n_outliers) partitions.

Related

Pandas- find max not counting outliers

I have a dataframe where each column represents a geographic point, and each row represents a minute in a day. The value of each cell is the flow of water at that point in CFS. Below is a graph of one of these time-flow series.
Basically, I need to calculate the absolute value of the max flow at each of these locations during the day, which in this case would be that hump of 187 cfs. However, there are instabilities, so DF.abs().max() returns 1197 cfs. I need to somehow remove the outliers in the calculation. As you can see, there is no pattern to the outliers, but if you look at the graph, no 2 consecutive points in time should have more than an x% change in flow. I should mention that there are 15K of these points, so the fastest solution is the best.
Anyone know how can I accomplish this in python, or at least know the statistical word for what I want to do? Thanks!
In my opinion, the statistical word your are looking for is smoothing or denoising data.
Here is my try:
# Importing packages
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
# Creating a curve with a local maximum to simulate "ideal data"
x = np.arange(start=-1, stop=1, step=0.001)
y_ideal = 10**-(x**2)
# Adding some randomly distributed outliers to simulate "real data"
y_real = y_ideal.copy()
np.random.seed(0)
for i in range(50):
x_index = np.random.choice(len(x))
y_real[x_index] = np.random.randint(-3, 5)
# Denoising with Savitzky-Golay (window size = 501, polynomial order = 3)
y_denoised = savgol_filter(y_real, window_length=501, polyorder=3)
# You should optimize these values to fit your needs
# Getting the index of the maximum value from the "denoised data"
max_index = np.where(y_denoised == np.amax(y_denoised))[0]
# Recovering the maximum value and reporting
max_value = y_real[max_index][0]
print(f'The maximum value is around {max_value:.5f}')
Please, keep in mind that:
This solution is approximate.
You should find the optimum parameters of the window_length and polyorder parameters plugged to the savgol_filter() function.
If the region where your maximum is located is noisy, you can use max_value = y_denoised [max_index][0] instead of max_value = y_real[max_index][0].
Note: This solution is based in this other Stack Overflow answer

Evaluating vector distance measures

I am working with vectors of word frequencies and trying out some of the different distance measures available in Scikit Learns Pairwise Distances. I would like to use these distances for clustering and classification.
I usually have a feature matrix of ~ 30,000 x 100. My idea was to choose a distance metric that maximizes the pairwise distances by running pairwise differences over the same dataset with the distance metrics available in Scipy (e.g. Euclidean, Cityblock, etc.) and for each metric
convert distances computed for the dataset to zscores to normalize across metrics
get the range of these zscores, i.e. the spread of the distances
use the distance metric that gives me the widest range of distances as it apparently gives me the maximum spread over my dataset and the most variance to work with. (Cf. code below)
My questions:
Does this approach make sense?
Are there other evaluation procedures that one should try? I found these papers (Gavin, Aggarwal, but they don't apply 100 % here...)
Any help is much appreciated!
My code:
matrix=np.random.uniform(0, .1, size=(10,300)) #test data set
scipy_distances=['euclidean', 'minkowski', ...] #these are the distance metrics
for d in scipy_distances: #iterate over distances
distmatrix=sklearn.metrics.pairwise.pairwise_distances(matrix, metric=d)
distzscores = scipy.stats.mstats.zscore(distmatrix, axis=0, ddof=1)
diststats=basicstatsmaker(distzscores)
range=np.ptp(distzscores, axis=0)
print "range of metric", d, np.ptp(range)
In general - this is just a heuristic, which might, or not - work. In particular, it is easy to construct a "dummy metric" which will "win" in your approach even though it is useless. Try out
class Dummy_dist:
def __init__(self):
self.cheat = True
def __call__(self, x, y):
if self.cheat:
self.cheat = False
return 1e60
else:
return 0
dummy_dist = Dummy_dist()
This will give you huuuuge spread (even with z-score normalization). Of course this is a cheating example as this is non determinsitic, but I wanted to show the basic counterexample, and of course given your data one can construct a deterministic analogon.
So what you should do? Your metric should be treated as hyperparameter of your process. You should not divide process of generating your clustering/classification into two separate phases: choosing a distance and then learning something; but you should do this jointly, consider your clustering/classification + distance pairs as a single model, thus instead of working with k-means, you will work with k-means+euclidean, k-means+minkowsky and so on. This is the only statistically supported approach. You cannot construct a method of assessing "general goodness" of the metric, as there is no such object, metric quality can be only assessed in a particular task, which involves fixing every other element (such as a clustering/classification method, particular dataset etc.). Once you perform such wide, exhaustive evaluation, check many such pairs, on many datasets, you might claim that given metric performes best in such range of tasks.

Python: how to compare the similarity between clustering using k-means algorithm?

I have two observations of the same event. Let say X and Y.
I suppose to have nc clusters. I am using sklearn to make the clustering.
x = KMeans(n_clusters=nc).fit_predict(X)
y = KMeans(n_clusters=nc).fit_predict(Y)
is there a measure that allow me to compare x and y: i.e. this measure will be 1 if the clusters x and y are the same.
Just extract the cluster centers of your kmeans-objects (see the docs):
x_centers = x.cluster_centers_
y_centers = y.cluster_centers_
The you have to decide which metric you are using to compare these. Keep in mind that the centers are floating-points, the clustering-process is a heuristic and the clustering-process is a random-algorithm. This means, you will get something which interprets as not exactly the same with a high probability, even for cluster-objects trained on the same data.
This link discusses some approaches and the problems.
The Rand Index and its adjusted version do this exactly. Two cluster assignments that match (even if the labels themselves, which are treated as arbitrary, are different), get a score of 1. A value of 0 means they don't agree at all. The Adjusted Rand Index uses its baseline as random assignment of points to clusters.

Wrap-around when calculating distance for k-means

I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference.
I'm doing something simple, similar to the following:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)
data elements looks something like [22, 418, 192] where the first element is the hour.
Any ideas?
Even though #elyase answer is accepted, I think it is not the correct approach.
Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:
#current cluster X,, based on centroid position Xc=24
x1=1
x2=24
#current cluster Y, based on centroid position Yc=10
y1=12
y2=13
computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.
Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}
Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.
This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).
Details of the kernel k-means are given here
Why k-means doesn't work with arbitrary distances
K-means is not a distance-based algorithm.
K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).
In order for Lloyds algorithm to converge you need to have both steps optimize the same function:
the reassignment step
the centroid update step
Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.
If you plug in a random other distance function as suggested by #elyase k-means might no longer converge.
Proper solutions
There are various solutions to this:
Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.
The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.
#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)
sin = np.sin(angles)
cos = np.cos(angles)
a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi
#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5

dbscan - setting limit on maximum cluster span

By my understanding of DBSCAN, it's possible for you to specify an epsilon of, say, 100 meters and — because DBSCAN takes into account density-reachability and not direct density-reachability when finding clusters — end up with a cluster in which the maximum distance between any two points is > 100 meters. In a more extreme possibility, it seems possible that you could set epsilon of 100 meters and end up with a cluster of 1 kilometer:
see [2][6] in this array of images from scikit learn for an example of when that might occur. (I'm more than willing to be told I'm a total idiot and am misunderstanding DBSCAN if that's what's happening here.)
Is there an algorithm that is density-based like DBSCAN but takes into account some kind of thresholding for the maximum distance between any two points in a cluster?
DBSCAN indeed does not impose a total size constraint on the cluster.
The epsilon value is best interpreted as the size of the gap separating two clusters (that may at most contain minpts-1 objects).
I believe, you are in fact not even looking for clustering: clustering is the task of discovering structure in data. The structure can be simpler (such as k-means) or complex (such as the arbitrarily shaped clusters discovered by hierarchical clustering and k-means).
You might be looking for vector quantization - reducing a data set to a smaller set of representatives - or set cover - finding the optimal cover for a given set - instead.
However, I also have the impression that you aren't really sure on what you need and why.
A stength of DBSCAN is that it has a mathematical definition of structure in the form of density-connected components. This is a strong and (except for some rare border cases) well-defined mathematical concept, and the DBSCAN algorithm is an optimally efficient algorithm to discover this structure.
Direct density reachability however, doesn't define a useful (partitioning) structure. It just does not partition the data into disjoint partitions.
If you don't need this kind of strong structure (i.e. you don't do clustering as in "structure discovery", but you just want to compress your data as in vector quantization), you could give "canopy preclustering" a try. It can be seen as a preprocessing step designed for clustering. Essentially, it is like DBSCAN, except that it uses two epsilon values, and the structure is not guaranteed to be optimal in any way, but will highly depend on the ordering of your data. If you then preprocess it appropriately, it can still be useful. Unless you are in a distributed setting, canopy preclustering however is at least as expensive than a full DBSCAN run. Due to the loose requirements (in particular, "clusters" may overlap, and objects are expected to belong to multiple "clusters"), it is easier to parallelize.
Oh, and you might also just be looking for complete-linkage hierarchical clustering. If you cut the dendrogram at your desired height, the resulting clusters should all have the desired maximum distance inbetween of any two objects. The only problem is that hierarchical clustering usually is O(n^3), i.e. it doesn't scale to large data sets. DBSCAN runs in O(n log n) in good implementations (with index support).
I had the same problem and ended up solving it by using DBSCAN in combination with KMeans clustering: First I use DBSCAN to identify high density clusters and remove outliers, then I take any cluster larger than 250 Miles (in my case) and break it apart. Here's the code:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=0.3, min_samples=100).fit(load_geocodes[['lat', 'long']])
load_geocodes.loc[:,'cluster'] = clustering.labels_
import mpu
def calculate_cluster_size(lat, long):
left_top = (max(lat), min(long))
right_bottom = (min(lat), max(long))
distance = mpu.haversine_distance(left_top, right_bottom)*0.621371
return distance
for c, df in load_geocodes.groupby('cluster'):
if c == -1:
continue # don't do this for outliers
distance = calculate_cluster_size(df['lat'], df['long'])
print(distance)
if distance > 250:
# break clusters into more clusters until the maximum size of a cluster is less than 250 Miles
max_distance = distance
i = 2
while max_distance > 250:
kmeans = KMeans(n_clusters=i, random_state=0).fit(df[['lat', 'long']])
df.loc[:, 'cl_temp'] = kmeans.labels_
max_temp_cl_size = 0
for temp_cl, temp_cl_df in df.groupby('cl_temp'):
temp_cl_size = calculate_cluster_size(temp_cl_df['lat'], temp_cl_df['long'])
if temp_cl_size > max_temp_cl_size:
max_temp_cl_size = temp_cl_size
i += 1
max_distance = max_temp_cl_size
load_geocodes.loc[df.index,'subcluster'] = kmeans.labels_

Categories