Python - multi-dimensional clustering with thresholds - python

Imagine I have a dataset as follows:
[{"x":20, "y":50, "attributeA":90, "attributeB":3849},
{"x":34, "y":20, "attributeA":86, "attributeB":5000},
etc.
There could be a bunch more other attributes in addition to these - this is just an example. What I am wondering is, how can I cluster these points based on all of the factors with control over the maximum separation between a given point and the next for a given variable for it to be considered linked. (i.e. euclidean distance must be within 10 points, attributeA within 5 points and attributeB within 1000 points)
Any ideas on how to do this in python? As I implied above, I would like to apply euclidean distance to compare distance between the two points if possible - not just comparing x and y as separate attributes. For the rest of the attributes it would be all single dimensional comparison...if that makes sense.
Edit: Just to add some clarity in case this doesn't make sense, basically I am looking for some algorithm to compare all objects with each other (or some more efficient way), if all of object A's attributes and euclidean distance are within the specified threshold when compared to object B, then those two are considered similar and linked - this procedure continues until eventually all the linked clusters can be returned as some clusters will have no points that satisfy the conditions to be similar to any point in another cluster resulting in the clusters being separated.

The simplest approach is to build a binary "connectivity" matrix.
Let a[i,j] be 0 exactly if your conditions are fullfilled, 1 otherwise.
Then run hierarchical agglomerative clustering with complete linkage on this matrix. If you don't need every pair of objects in every cluster to satisfy your threshold, then you can also use other linkages.
This isn't the best solution - other distance matrix will need O(n²) memory and time, and the clustering even O(n³), but the easiest to implement. Computing the distance matrix in Python code will be really slow unless you can avoid all loops and have e.g. numpy do most of the work. To improve scalability, you should consider DBSCAN, and a data index.
It's fairly straightforward to replace the three different thresholds with weights, so that you can get a continuous distance; likely even a metric. Then you could use data indexes, and try out OPTICS.

Related

How do I find the 100 most different points within a pool of 10,000 points?

I have a set of 10,000 points, each made up of 70 boolean dimensions. From this set of 10,000, I would like to select 100 points which are representative of the whole set of 10,000. In other words, I would like to pick the 100 points which are most different from one another.
Is there some established way of doing this? The first thing that comes to my mind is a greedy algorithm, which begins by selecting one point at random, then the next point is selected as the most distant one from the first point, and then the second point is selected as having the longest average distance from the first two, etc. This solution doesn't need to be perfect, just roughly correct. Preferably, this solution of 100 points can also be found within ~10 minutes but finishing within 24 hours is also fine.
I don't care about distance, in particular, that's just something that comes to mind as a way to capture "differentness."
If it matters, every point has 10 values of TRUE and 60 values of FALSE.
Some already-built Python package to do this would be ideal, but I am also happy to just write the code myself something if somebody could point me to a Wikipedia article.
Thanks
Your use of "representative" is not standard terminology, but I read your question as you wish to find 100 items that cover a wide gamut of different examples from your dataset. So if 5000 of your 10000 items were near identical, you would prefer to see only one or two items from that large sub-group. Under the usual definition, a representative sample of 100 would have ~50 items from that group.
One approach that might match your stated goal is to identify diverse subsets or groups within your data, and then pick an example from each group.
You can establish group identities for a fixed number of groups - with different membership size allowed for each group - within a dataset using a clustering algorithm. A good option for you might be k-means clustering with k=100. This will find 100 groups within your data and assign all 10,000 items to one of those 100 groups, based on a simple distance metric. You can then either take the central point from each group or a random sample from each group to find your set of 100.
The k-means algorithm is based around minimising a cost function which is the average distance of each group member from the centre of its group. Both the group centres and the membership are allowed to change, updated in an alternating fashion, until the cost cannot be reduced any further.
Typically you start by assigning each item randomly to a group. Then calculate the centre of each group. Then re-assign items to groups based on closest centre. Then recalculate the centres etc. Eventually this should converge. Multiple runs might be required to find an good optimum set of centres (it can get stuck in a local optimum).
There are several implementations of this algorithm in Python. You could start with the scikit learn library implementation.
According to an IBM support page (from comment by sascha), k-means may not work well with binary data. Other clustering algorithms may work better. You could also try to convert your records to a space where Euclidean distance is more useful and continue to use k-means clustering. An algorithm that may do that for you is principle component analysis (PCA) which is also implemented in scikit learn.
The graph partitioning tool METIS claims to be able to partition graphs with millions of vertices in 256 parts within seconds.
You could treat your 10.000 points as vertices of an undirected graph. A fully connected graph with 50 million edges would probably be too big. Therefore, you could restrict the edges to "similarity links" between points which have a Hamming distance below a certrain threshold.
In general, Hamming distances for 70-bit words have values between 0 and 70. In your case, the upper limit is 20 as there are 10 true coordinates and 60 false coordinates per point. The maximum distance occurs, if all true coordinates are differently located for both points.
Creation of the graph is a costly operation of O(n^2). But it might be possible to get it done within your envisaged time frame.

Clustering on large, mixed type data

I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.
The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.

DBSCAN plotting Non-geometrical-Data

I used sklearn cluster-algorithm dbscan to get clusters of my data.
Data: Non-Geometrical objects based on hex-decimal strings
I used a simple distance to create a distance matrix as input for dbscan resulting in expected clusters.
Question Is it possible to create a plot of these cluster-results like in demo
I didn't found a solution through search.
I need to graphically demonstrate the similarities of the objects and clusters to each other.
Since I am using python for everything (in that project) I would appreciate it to choose a solution in python.
I don't use python, so I cannot give you example code.
If your data isn't 2 dimensional, you can try to find a good 2-dimensional approximation using Multidimensional Scaling.
Essentially, it takes an input matrix (which should satistify triangular ineuqality, and ideally be derived from Euclidean distance in some vector space; but you can often get good results if this does not strictly hold). It then tries to find the best 2-dimensional data set that has the same distances.

Classify into one of two sets (without learning)

I am dealing with a problem where I would like to automatically divide a set into two subsets, knowing that ALMOST ALL of the objects in the set A will have greater values in all of the dimensions than objects in the set B.
I know I could use machine learning but I need it to be fully automated, as in various instances of a problem objects of set A and set B will have different values (so values in set B of the problem instance 2 might be greater than values in set A of the problem instance 1!).
I imagine the solution could be something like finding objects which are the best representatives of those two sets (the density of the objects around them is the highest).
Finding N best representatives of both sets would be sufficient for me.
Does anyone know the name of the problem and/or could propose the implementation for that? (Python is preferable).
Cheers!
You could try some of the clustering methods, which belong to unsupervised machine learning. The result depends on your data and how distributed they are. According to your picture I think K-means algorithm could work. There is a python library for machine learning scikit-learn, which already contains k-means implementation: http://scikit-learn.org/stable/modules/clustering.html#k-means
If your data is as easy as you explained, then there are some rather obvious approaches.
Center and count:
Center your data set, and count for each object how many values are positive. If more values are positive than negative, it will likely be in the red class.
Length histogram:
Compute the sum of each vector. Make a histogram of values. Split at the largest gap, vectors longer than the threshold are in one group, the others in the lower group.
I have made an ipython notebook to demonstrate this approach available.

Quick/easy array comparison algorithm without sharing data

I have two arrays generated by two different systems that are independent of each other. I want to compare their similarities by comparing only a few numbers generated from the arrays.
Right now, I'm only comparing min, max, and sums of the arrays, but I was wondering if there is a better algorithm out there? Any type of hashing algorithm would need to be insensitive to small floating point differences between the arrays.
EDIT: What I'm trying to do is to verify that two algorithms generate the same data without having to compare the data directly. So the algorithm should be sensitive to shifts in the data and relatively insensitive to small differences between each element.
I wouldn't try to reduce this to one number; just pass around a tuple of values, and write a close_enough function that compares the tuples.
For example, you could use (mean, stdev) as your value, and then define close_enough as "each array's mean is within 0.25 stdev of the other array's mean".
def mean_stdev(a):
return mean(a), stdev(a)
def close_enough(mean_stdev_a, mean_stdev_b):
mean_a, stdev_a = mean_stdev_a
mean_b, stdev_b = mean_stdev_b
diff = abs(mean_a - mean_b)
return (diff < 0.25 * stdev_a and diff < 0.25 * stdev_b)
Obviously the right value is something you want to tune based on your use case. And maybe you actually want to base it on, e.g., variance (square of stdev), or variance and skew, or stdev and sqrt(skew), or some completely different normalization besides arithmetic mean. That all depends on what your numbers represent, and what "close enough" means.
Without knowing anything about your application area, it's hard to give anything more specific. For example, if you're comparing audio fingerprints (or DNA fingerprints, or fingerprint fingerprints), you'll want something very different from if you're comparing JPEG-compressed images of landscapes.
In your comment, you say you want to be sensitive to the order of the values. To deal with this, you can generate some measure of how "out-of-order" a sequence is. For example:
diffs = [elem[0] - elem[1] for elem in zip(seq, sorted(seq))]
This gives you the difference between each element and the element that would be there in sorted position. You can build a stdev-like measure out of this (square each value, average, sqrt), or take the mean absolute diff, etc.
Or you could compare how far away the actual index is from the "right" index. Or how far the value is from the value expected at its index based on the mean and stdev. Or… there are countless possibilities. Again, which is appropriate depends heavily on your application area.
Depends entirely on your definition of "compare their similarities".
What features do you want to compare? What features can you identify? are their identifiable patterns? i.e in this set, there are 6 critical points, there are 2 discontinuities... etc...
You've already mentioned comparing the min/max/sum; and means and standard deviations have been talked about in comments too. These are all features of the set.
Ultimately, you should be able to take all these features and make an n-dimensional descriptor. For example [min, max, mean, std, etc...]
You can then compare these n-dimensional descriptors to define whether one is "less", "equal" or "more" than the other. If you want to classify other sets into whether they are more like "set A" or more like "set B", you could look into classifiers.
See:
Classifying High-Dimensional Patterns Using a Fuzzy Logic
Support Vector Machines

Categories