Hierarchical clustering for categorical data in python - python

I have a categorical attributes that contains string values. three of them contains dayname(mon---sun) monthname and time interval(morning afternoon evening), the other two as i mentioned before has district and street names. followed by gender ,role, comments(it is a predefined fixed field that have values as good, bad strong agree etc)surname and first name.my intention is to cluster them and visualize it. I applied k-mean clustering using this WEKA bur it did not work.
Now I wish to apply hierarchical clustering on it. I found this code:
import scipy
import scipy.cluster.hierarchy as sch
X = scipy.randn(100, 2) # 100 2-dimensional observations
d = sch.distance.pdist(X) # vector of (100 choose 2) pairwise distances
L = sch.linkage(d, method='complete')
ind = sch.fcluster(L, 0.5*d.max(), 'distance')
However, X in above code is numeric; I have categorical data.
Is there some way that I can use a numarray of categorical data to find the distance?
In other words can I use categorical data of string values to find the distance?
I would then use that distance in sch.linkage(d, method='complete')

I think we've identified the problem, then: you leave the X values as they are, string data. You can pass those to pdist, but you also have to supply a 2-arity function (2 inputs, numeric output) for the distance metric.
The simplest one would be that equal classifications have 0 distance; everything else is 1. You can do this with
d = sch.distance.pdist(X, lambda u, v: u != v)
If you have other class discrimination in mind, just code logic to return the desired distance, wrap it in a function, and then pass the function name to pdist. We can't help with that, because you've told us nothing about your classes or the model semantics.
Does that get you moving?

Another possibility is the use of the Hamming distance.
Y = pdist(X, 'hamming')
Computes the normalized Hamming distance, or the proportion of those
vector elements between two n-vectors u and v which disagree. To save
memory, the matrix X can be of type boolean.
If your categorical data is represented by a single character e.g.: "m"/"f" it could be what you are looking for.


Manually find the distance between centroid and labelled data points

I have carried out some clustering analysis on some data X and have arrived at both the labels y and the centroids c. Now, I'm trying to calculate the distance between X and their assigned cluster's centroid c. This is easy when we have a small number of points:
import numpy as np
# 10 random points in 3D space
X = np.random.rand(10,3)
# define the number of clusters, say 3
clusters = 3
# give each point a random label
# (in the real code this is found using KMeans, for example)
y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1)
# randomly assign location of centroids
# (in the real code this is found using KMeans, for example)
c = np.random.rand(clusters,3)
# calculate distances
distances = []
for i in range(len(X)):
Unfortunately, the actual data has many more rows. Is there a way to vectorise this somehow (instead of using a for loop)? I can't seem to get my head around the mapping.
Thanks to numpy's array indexing, you can actually turn your for loop into a one-liner and avoid explicit looping altogether:
distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1)
will do the same thing as your original for loop.
EDIT: Thanks #Kris, I forgot the axis keyword, and since I didn't specify it, numpy automatically computed the norm of the entire flattened matrix, not just along the rows (axis 1). I've updated it now, and it should return an array of distances for each point. Also, einsum was suggested by #Kris for their specific application.

Large set of x,y coordinates. Efficient way to find any within certain distance of each other?

I have a large set of data points in a pandas dataframe, with columns containing x/y coordinates for these points. I would like to identify all points that are within a certain distance "d" of any other point in the dataframe.
I first tried to do this using 'for' loops, checking the distance between the first point and all other points, then the distance between the second point and all others, etc. Clearly this is not very efficient for a large data set.
Recent searching online suggests that the best way might be to use scipy.spatial.ckdtree, but I can't figure out how to implement this. Most examples I see check against a single x/y location, whereas I want to check all vs all. Is anyone able to provide suggestions or examples, starting from an array of x/y coordinates taken from my dataframe as follows:
points = df_sub.loc[:,['FRONT_X','FRONT_Y']].values
That looks something like this:
[[19091199.587 -544406.722]
[19091161.475 -544452.426]
[19091163.893 -544464.899]
[19089150.04 -544747.196]
[19089774.213 -544729.005]
[19089690.516 -545165.489]]
The ideal output would be the ID's of all pairs of points that are within a cutoff distance "d" of each other.
scipy.spatial has many good functions for handling distance computations.
Let's create an array pos of 1000 (x, y) points, similar to what you have in your dataframe.
import numpy as np
from scipy.spatial import distance_matrix
num = 1000
pos = np.random.uniform(size=(num, 2))
# Distance threshold
d = 0.25
From here we shall use the distance_matrix function to calculate pairwise distances. Then we use np.argwhere to find the indices of all the pairwise distances less than some threshold d.
pair_dist = distance_matrix(pos, pos)
ids = np.argwhere(pair_dist < d)
ids now contains the "ID's of all pairs of points that are within a cutoff distance "d" of each other", as you desired.
Of course, this method has the shortcoming that we always compute the distance between each point and itself (returning a distance of 0), which will always be less than our threshold d. However, we can exclude self-comparisons from our ids with the following fudge:
pair_dist[np.r_[:num], np.r_[:num]] = np.inf
ids = np.argwhere(pair_dist < d)
Another shortcoming is that we compute the full symmetric pairwise distance matrix when we only really need the upper or lower triangular pairwise distance matrix. However, unless this computation really is a bottleneck in your code, I wouldn't worry too much about this.

How can I use any classifier to classify my data with each data point consisting of a set of floating values?

I have data in this format-
[0.266465 0.9203907 1.007363 ... 0. 0.09623989 0.39632136]
It is the value of the first row and first column.
It is the value of the second column of the first row:
[0.9042176 1.135085 1.2988662 ... 0. 0.13614458 0.28000486]
I have 2200 such rows and I want to train a classifier to identify that if the two set of values are similar or not?
P.S.- These are extracted feature vector values.
If you assume relation between two extracted feature vectors to be linear, you could try using Pearson correlation:
import numpy as np
from scipy.stats import pearsonr
list1 = np.random.random(100)
list2 = np.random.random(100)
pearsonr(list1, list2)
An example output is:
(0.0746901299996632, 0.4601843257734832)
Where first value refers to correlation (7%), the second to its significance (with > 0,05 you accept the null hypothesis that the correlation is insignificant at significance level alfa = 5%). And if vectors are correlated, they are be in a way similar. More about the method here.
Also, I came across Normalized Cross-Correlation that is used for identifying similarity between pictures (not an expert, so rather check this).

Evaluating vector distance measures

I am working with vectors of word frequencies and trying out some of the different distance measures available in Scikit Learns Pairwise Distances. I would like to use these distances for clustering and classification.
I usually have a feature matrix of ~ 30,000 x 100. My idea was to choose a distance metric that maximizes the pairwise distances by running pairwise differences over the same dataset with the distance metrics available in Scipy (e.g. Euclidean, Cityblock, etc.) and for each metric
convert distances computed for the dataset to zscores to normalize across metrics
get the range of these zscores, i.e. the spread of the distances
use the distance metric that gives me the widest range of distances as it apparently gives me the maximum spread over my dataset and the most variance to work with. (Cf. code below)
My questions:
Does this approach make sense?
Are there other evaluation procedures that one should try? I found these papers (Gavin, Aggarwal, but they don't apply 100 % here...)
Any help is much appreciated!
My code:
matrix=np.random.uniform(0, .1, size=(10,300)) #test data set
scipy_distances=['euclidean', 'minkowski', ...] #these are the distance metrics
for d in scipy_distances: #iterate over distances
distmatrix=sklearn.metrics.pairwise.pairwise_distances(matrix, metric=d)
distzscores = scipy.stats.mstats.zscore(distmatrix, axis=0, ddof=1)
range=np.ptp(distzscores, axis=0)
print "range of metric", d, np.ptp(range)
In general - this is just a heuristic, which might, or not - work. In particular, it is easy to construct a "dummy metric" which will "win" in your approach even though it is useless. Try out
class Dummy_dist:
def __init__(self):
self.cheat = True
def __call__(self, x, y):
if self.cheat:
self.cheat = False
return 1e60
return 0
dummy_dist = Dummy_dist()
This will give you huuuuge spread (even with z-score normalization). Of course this is a cheating example as this is non determinsitic, but I wanted to show the basic counterexample, and of course given your data one can construct a deterministic analogon.
So what you should do? Your metric should be treated as hyperparameter of your process. You should not divide process of generating your clustering/classification into two separate phases: choosing a distance and then learning something; but you should do this jointly, consider your clustering/classification + distance pairs as a single model, thus instead of working with k-means, you will work with k-means+euclidean, k-means+minkowsky and so on. This is the only statistically supported approach. You cannot construct a method of assessing "general goodness" of the metric, as there is no such object, metric quality can be only assessed in a particular task, which involves fixing every other element (such as a clustering/classification method, particular dataset etc.). Once you perform such wide, exhaustive evaluation, check many such pairs, on many datasets, you might claim that given metric performes best in such range of tasks.

Python: how to compare the similarity between clustering using k-means algorithm?

I have two observations of the same event. Let say X and Y.
I suppose to have nc clusters. I am using sklearn to make the clustering.
x = KMeans(n_clusters=nc).fit_predict(X)
y = KMeans(n_clusters=nc).fit_predict(Y)
is there a measure that allow me to compare x and y: i.e. this measure will be 1 if the clusters x and y are the same.
Just extract the cluster centers of your kmeans-objects (see the docs):
x_centers = x.cluster_centers_
y_centers = y.cluster_centers_
The you have to decide which metric you are using to compare these. Keep in mind that the centers are floating-points, the clustering-process is a heuristic and the clustering-process is a random-algorithm. This means, you will get something which interprets as not exactly the same with a high probability, even for cluster-objects trained on the same data.
This link discusses some approaches and the problems.
The Rand Index and its adjusted version do this exactly. Two cluster assignments that match (even if the labels themselves, which are treated as arbitrary, are different), get a score of 1. A value of 0 means they don't agree at all. The Adjusted Rand Index uses its baseline as random assignment of points to clusters.
