I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.
The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.
Related
My dataset contains columns describing abilities of certain characters, filled with True/False values. There are no empty values. My ultimate goal is to make groups of characters with similar abilities. And here's the question:
Should i change True/False values to 1 and 0? Or there's no need for that?
What clustering model should i use? Is KMeans okay for that?
How do i interpret the results (output)? Can i visualize it?
The thing is i always see people perform clustering on numeric datasets that you can visualize and it looks much easier to do. With True/False i just don't even know how to approach it.
Thanks.
In general there is no need to change True/False to 0/1. This is only necessary if you want to apply a specific algorithm for clustering that cannot deal with boolean inputs, like K-means.
K-means is not a preferred option. K-means requires continuous features as input, as it is based on computing distances, like many clustering algorithms. So no boolean inputs. And although binary input (0-1) works, it does not compute distances in a very meaningful way (many points will have the same distance to each other). In case of 0-1 data only, I would not use clustering, but would recommend tabulating the data and see what cells occur frequently. If you have a large data set you might use the Apriori algorithm to find cells that occur frequently.
In general, a clustering algorithm typically returns a cluster number for each observation. In low-dimensions, this number is frequently used to give a color to an observation in a scatter plot. However, in your case of boolean values, I would just list the most frequently occurring cells.
In JMP software there is an option to use the "fast Ward" method when the number of rows is greater than 2000. From the documentation [fast ward]:
"Applies an algorithm that computes Ward's method more quickly for large numbers of rows. The computation time is shorter because this algorithm does not require the calculation of a distance matrix. It is used automatically whenever there are more than 2,000 rows."
Matlab does the same thing....
"Find a maximum of four clusters in a hierarchical cluster tree created using the ward linkage method. Specify 'SaveMemory' as 'on' to construct clusters without computing the distance matrix. Otherwise, you can receive an out-of-memory error if your machine does not have enough memory to hold the distance matrix."
I'm looking for something similar in Python but they all seem to require the distance matrix calculated ahead of time (which requires absurd amounts of memory for my problem of 275k rows and 10 columns). In JMP/Matlab though it works just fine on a machine with half the memory of the machine I want to run the python script on. Anybody know of something?
From a now-rolled-back edit to the question by the OP:
I found that using the "linkage_vector" option seems to be what i was looking for. I was thrown off because "vector" to me meant 1D, but I guess it can be N-D.
Have you worked with fastcluster? It has the option for "hierarchical clusters from distance matrices or from vector data"
I have a set of data with 50 features (c1, c2, c3 ...), with over 80k rows.
Each row contains normalised numerical values (ranging 0-1). It is actually a normalised dummy variable, whereby some rows have only few features, 3-4 (i.e. 0 is assigned if there is no value). Most rows have about 10-20 features.
I used KMeans to cluster the data, always resulting in a cluster with a large number of members. Upon analysis, I noticed that rows with fewer than 4 features tends to get clustered together, which is not what I want.
Is there anyway balance out the clusters?
It is not part of the k-means objective to produce balanced clusters. In fact, solutions with balanced clusters can be arbitrarily bad (just consider a dataset with duplicates). K-means minimizes the sum-of-squares, and putting these objects into one cluster seems to be beneficial.
What you see is the typical effect of using k-means on sparse, non-continuous data. Encoded categoricial variables, binary variables, and sparse data just are not well suited for k-means use of means. Furthermore, you'd probably need to carefully weight variables, too.
Now a hotfix that will likely improve your results (at least the perceived quality, because I do not think it makes them statistically any better) is to normalize each vector to unit length (Euclidean norm 1). This will emphasize the ones of rows with few nonzero entries. You'll probably like the results more, but they are even much harder to interpret.
I have a task to find similar parts based on numeric dimensions--diameters, thickness--and categorical dimensions--material, heat treatment, etc. I have a list of 1 million parts. My approach as a programmer is to put all parts on a list, pop off the first part and use it as a new "cluster" to compare the rest of the parts on the list based on the dimensions. As a part on the list matches the categorical dimensions and numerical dimensions--within 5 percent--I will add that part to the cluster and remove from the initial list. Once all parts in the list are compared with the initial cluster part's dimensions, I will pop the next part off the list and start again, populating clusters until no parts remain on the original list. This is a programmatic approach. I am not sure if this is most efficient way of categorizing parts into "clusters" or if k-means clustering would be a better approach.
Define "better".
What you do seems to be related to "leader" clustering. But that is a very primitive form of clustering that will usually not yield competitive results. But with 1 million points, your choices are limited, and kmeans does not handle categorical data well.
But until you decide what is 'better', there probably is nothing 'wrong' with your greedy approach.
An obvious optimization would be to first split all the data based on the categorical attributes (as you expect them to match exactly). That requires just one pass over the data set and a hash table. If your remaining parts are small enough, you could try kmeans (but how would you choose k), or DBSCAN (probably using the same threshold you already have) on each part.
I am dealing with a problem where I would like to automatically divide a set into two subsets, knowing that ALMOST ALL of the objects in the set A will have greater values in all of the dimensions than objects in the set B.
I know I could use machine learning but I need it to be fully automated, as in various instances of a problem objects of set A and set B will have different values (so values in set B of the problem instance 2 might be greater than values in set A of the problem instance 1!).
I imagine the solution could be something like finding objects which are the best representatives of those two sets (the density of the objects around them is the highest).
Finding N best representatives of both sets would be sufficient for me.
Does anyone know the name of the problem and/or could propose the implementation for that? (Python is preferable).
Cheers!
You could try some of the clustering methods, which belong to unsupervised machine learning. The result depends on your data and how distributed they are. According to your picture I think K-means algorithm could work. There is a python library for machine learning scikit-learn, which already contains k-means implementation: http://scikit-learn.org/stable/modules/clustering.html#k-means
If your data is as easy as you explained, then there are some rather obvious approaches.
Center and count:
Center your data set, and count for each object how many values are positive. If more values are positive than negative, it will likely be in the red class.
Length histogram:
Compute the sum of each vector. Make a histogram of values. Split at the largest gap, vectors longer than the threshold are in one group, the others in the lower group.
I have made an ipython notebook to demonstrate this approach available.