KMeans clustering unbalanced data

KMeans clustering unbalanced data - python

I have a set of data with 50 features (c1, c2, c3 ...), with over 80k rows.
Each row contains normalised numerical values (ranging 0-1). It is actually a normalised dummy variable, whereby some rows have only few features, 3-4 (i.e. 0 is assigned if there is no value). Most rows have about 10-20 features.
I used KMeans to cluster the data, always resulting in a cluster with a large number of members. Upon analysis, I noticed that rows with fewer than 4 features tends to get clustered together, which is not what I want.
Is there anyway balance out the clusters?

It is not part of the k-means objective to produce balanced clusters. In fact, solutions with balanced clusters can be arbitrarily bad (just consider a dataset with duplicates). K-means minimizes the sum-of-squares, and putting these objects into one cluster seems to be beneficial.
What you see is the typical effect of using k-means on sparse, non-continuous data. Encoded categoricial variables, binary variables, and sparse data just are not well suited for k-means use of means. Furthermore, you'd probably need to carefully weight variables, too.
Now a hotfix that will likely improve your results (at least the perceived quality, because I do not think it makes them statistically any better) is to normalize each vector to unit length (Euclidean norm 1). This will emphasize the ones of rows with few nonzero entries. You'll probably like the results more, but they are even much harder to interpret.

Related

Clustering Subsets of a big dataset (2d and multi-dimensional)

How do you cluster a subset of a big dataset?
I have a big dataset of ~200000 points, and they are high dimensional data. There are around ~25000 of different meaningful combinations of the points, each containing around 10-200 points, and I would like to assess the clustering properties of those combinations. I have used umap on the high dimensional data to reduce them to 2d, so analyzing umap is appropriate, but analyzing on the original data would be better.
Traditional clustering methods (kmeans, hierarchical clustering and dbscan) could not account for the what is considered a cluster -- the points are located in a small space as supposed to the entire space even in 2d, and they also generally cluster poorly because of the small amount of data because they would specify multiple clusters when those were actually outliers. I have made some progress with the level-set tree method in that regards, but the behavior of the algorithm is not always controllable (only doable for very typical cases). Is there any methods that you would suggest?

How to perform clustering on a dataset containing TRUE/FALSE values in Python?

My dataset contains columns describing abilities of certain characters, filled with True/False values. There are no empty values. My ultimate goal is to make groups of characters with similar abilities. And here's the question:
Should i change True/False values to 1 and 0? Or there's no need for that?
What clustering model should i use? Is KMeans okay for that?
How do i interpret the results (output)? Can i visualize it?
The thing is i always see people perform clustering on numeric datasets that you can visualize and it looks much easier to do. With True/False i just don't even know how to approach it.
Thanks.

In general there is no need to change True/False to 0/1. This is only necessary if you want to apply a specific algorithm for clustering that cannot deal with boolean inputs, like K-means.
K-means is not a preferred option. K-means requires continuous features as input, as it is based on computing distances, like many clustering algorithms. So no boolean inputs. And although binary input (0-1) works, it does not compute distances in a very meaningful way (many points will have the same distance to each other). In case of 0-1 data only, I would not use clustering, but would recommend tabulating the data and see what cells occur frequently. If you have a large data set you might use the Apriori algorithm to find cells that occur frequently.
In general, a clustering algorithm typically returns a cluster number for each observation. In low-dimensions, this number is frequently used to give a color to an observation in a scatter plot. However, in your case of boolean values, I would just list the most frequently occurring cells.

Clustering on large, mixed type data

I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.

The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.

Data Standardization vs Normalization vs Robust Scaler

I am working on data preprocessing and want to compare the benefits of Data Standardization vs Normalization vs Robust Scaler practically.
In theory, the guidelines are:
Advantages:
Standardization: scales features such that the distribution is centered around 0, with a standard deviation of 1.
Normalization: shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).
Robust Scaler: similar to normalization but it instead uses the interquartile range, so that it is robust to outliers.
Disadvantages:
Standardization: not good if the data is not normally distributed (i.e. no Gaussian Distribution).
Normalization: get influenced heavily by outliers (i.e. extreme values).
Robust Scaler: doesn't take the median into account and only focuses on the parts where the bulk data is.
I created 20 random numerical inputs and tried the above-mentioned methods (numbers in red color represent the outliers):
I noticed that -indeed- the Normalization got affected negatively by the outliers and the change scale between the new values became tiny (all values almost identical -6 digits after the decimal point- 0.000000x) even there is noticeable differences between the original inputs!
My questions are:
Am I right to say that also Standardization gets affected negatively by the extreme values as well? If not, why according to the result provided?
I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple complete interpretation?

Am I right to say that also Standardization gets affected negatively by the extreme values as well?
Indeed you are; the scikit-learn docs themselves clearly warn for such a case:
However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers.
More or less, the same holds true for the MinMaxScaler as well.
I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple -complete interpretation?
Robust does not mean immune, or invulnerable, and the purpose of scaling is not to "remove" outliers and extreme values - this is a separate task with its own methodologies; this is again clearly mentioned in the relevant scikit-learn docs:
RobustScaler
[...] Note that the outliers themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is required (see below).
where the "see below" refers to the QuantileTransformer and quantile_transform.

None of them are robust in the sense that the scaling will take care of outliers and put them on a confined scale, that is no extreme values will appear.
You can consider options like:
Clipping(say, between 5 percentile and 95 percentile) the series/array before scaling
Taking transformations like square-root or logarithms, if clipping is not ideal
Obviously, adding another column 'is clipped'/'logarithmic clipped amount' will reduce information loss.

Efficient k-means evaluation with silhouette score in sklearn

I am running k-means clustering on ~1 million items (each represented as a ~100-feature vector). I have run the clustering for various k, and now want to evaluate the different results with the silhouette score implemented in sklearn. Attempting to run it with no sampling seems unfeasible and takes a prohibitively long time, so I assume I need to use sampling, i.e.:
metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???)
I don't have a good sense of what an appropriate sampling approach is, however. Is there a rule of thumb for what size sample to use given the size of my matrix? Is it better to take the largest sample my analysis machine can handle, or to take the average of more smaller samples?
I ask in large part because my preliminary test (with sample_size=10000) has produced some really really unintuitive results.
I'm also open to alternative, more scalable evaluation metrics.
Editing to visualize the issue: The plot shows, for varying sample sizes, the silhouette score as a function of the number of clusters
What's not weird is that increasing sample size seems to reduce noise. What is weird, given that I have 1 million, very heterogenous vectors, that 2 or 3 is the "best" number of clusters. In other words, what's unintuitive is that I would find a more-or-less monotonic decreases in silhouette score as I increase the number of clusters.

Other metrics
Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off.
Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.
You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).
How much to sample
It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.

kmeans converge to local minima. Starting positions plays a crucial role in optimal number of clusters. It would be a good idea often to reduce the noise and dimensions using PCA or any other dimension reduction techniques to proceed with kmeans.
Just to add for the sake of completeness. It might be a good idea to get optimal number of clusters by "partition around medoids". It is equivalent to using silhouette method.
Reason for the weird observations could be different starting points for different sized samples.
Having said all the above, it is important to evaluate clusterability of the dataset in hand. Tractable means is by Worst Pair ratio as discussed here Clusterability.

Since there is no widely-accepted best approach to determine the optimal number of clusters, all evaluation techniques, including Silhouette Score, Gap Statistic, etc. fundamentally rely on some form of heuristic/trial&error argument. So to me, the best approach is to try out multiple techniques and to NOT develop over-confidence in any.
In your case, the ideal and most accurate score should be calculated on the entire data set. However, if you need to use partial samples to speed up the computation, you should use largest possible sample size your machine can handle. The rationale is the same as getting as many data points as possible out of the population of interest.
One more thig is that the sklearn implementation of Silhouette Score uses random (non-stratified) sampling. You can repeat the calculation multiple time using the same sample size (say sample_size=50000) to get a sensing on whether the sample size is large enough to produce consistent results.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.