I have a trajectory dataset saved in a *.csv file and I sorted it according to month. I mean, I splitted it into different files according to month. Number of records in each file is different. For example, in January I have 10 thousands records but in April I have five hundred thousands records.
I am going to perform k-mean clustering in python on each file. Could you please let me know how can I find or determine the best cluster number to initial K?
Thank you
You can use the elbow method.
In cluster analysis, the elbow method is a heuristic used in
determining the number of clusters in a data set. The method consists
of plotting the explained variation as a function of the number of
clusters, and picking the elbow of the curve as the number of clusters
to use. The same method can be used to choose the number of parameters
in other data-driven models, such as the number of principal
components to describe a data set.
Don't let the above description scare you, it's actually quite an easy thing to do. Here's a quick tutorial.
Related
I have a set of 10,000 points, each made up of 70 boolean dimensions. From this set of 10,000, I would like to select 100 points which are representative of the whole set of 10,000. In other words, I would like to pick the 100 points which are most different from one another.
Is there some established way of doing this? The first thing that comes to my mind is a greedy algorithm, which begins by selecting one point at random, then the next point is selected as the most distant one from the first point, and then the second point is selected as having the longest average distance from the first two, etc. This solution doesn't need to be perfect, just roughly correct. Preferably, this solution of 100 points can also be found within ~10 minutes but finishing within 24 hours is also fine.
I don't care about distance, in particular, that's just something that comes to mind as a way to capture "differentness."
If it matters, every point has 10 values of TRUE and 60 values of FALSE.
Some already-built Python package to do this would be ideal, but I am also happy to just write the code myself something if somebody could point me to a Wikipedia article.
Thanks
Your use of "representative" is not standard terminology, but I read your question as you wish to find 100 items that cover a wide gamut of different examples from your dataset. So if 5000 of your 10000 items were near identical, you would prefer to see only one or two items from that large sub-group. Under the usual definition, a representative sample of 100 would have ~50 items from that group.
One approach that might match your stated goal is to identify diverse subsets or groups within your data, and then pick an example from each group.
You can establish group identities for a fixed number of groups - with different membership size allowed for each group - within a dataset using a clustering algorithm. A good option for you might be k-means clustering with k=100. This will find 100 groups within your data and assign all 10,000 items to one of those 100 groups, based on a simple distance metric. You can then either take the central point from each group or a random sample from each group to find your set of 100.
The k-means algorithm is based around minimising a cost function which is the average distance of each group member from the centre of its group. Both the group centres and the membership are allowed to change, updated in an alternating fashion, until the cost cannot be reduced any further.
Typically you start by assigning each item randomly to a group. Then calculate the centre of each group. Then re-assign items to groups based on closest centre. Then recalculate the centres etc. Eventually this should converge. Multiple runs might be required to find an good optimum set of centres (it can get stuck in a local optimum).
There are several implementations of this algorithm in Python. You could start with the scikit learn library implementation.
According to an IBM support page (from comment by sascha), k-means may not work well with binary data. Other clustering algorithms may work better. You could also try to convert your records to a space where Euclidean distance is more useful and continue to use k-means clustering. An algorithm that may do that for you is principle component analysis (PCA) which is also implemented in scikit learn.
The graph partitioning tool METIS claims to be able to partition graphs with millions of vertices in 256 parts within seconds.
You could treat your 10.000 points as vertices of an undirected graph. A fully connected graph with 50 million edges would probably be too big. Therefore, you could restrict the edges to "similarity links" between points which have a Hamming distance below a certrain threshold.
In general, Hamming distances for 70-bit words have values between 0 and 70. In your case, the upper limit is 20 as there are 10 true coordinates and 60 false coordinates per point. The maximum distance occurs, if all true coordinates are differently located for both points.
Creation of the graph is a costly operation of O(n^2). But it might be possible to get it done within your envisaged time frame.
I am trying to cluster retail data in order to extract groupings of customers based on 6 input features. The data has a shape of (1712594, 6) in the following format:
I've spilt the 'Department' categorical variable into binary n-dimensional array using Pandas get_dummies(). I'm aware this is not optimal but I just wanted to test it out before trying out Gower Distances.
The Elbow method gives the following output:
USING:
I'm using Python and Scikitlearn's KMeans because the dataset is so large and the more complex models are too computationally demanding for Google Colab.
OBSERVATINS:
I'm aware that columns 1-5 are extremely correlated but the data is limited Sales data and little to no data is captured about Customers. KMeans is very sensitive to inputs and this may affect the WCSS in the Elbow Method and cause the straight line but this is just an inclination and I don't have any quantitative backing to support the argument. I'm a Junior Data Scientist so knowledge about technical foundations of Clustering models and algorithms is still developing so forgive me if I'm missing something.
WHAT I'VE DONE:
There were massive outliers that were skewing the data (this is a Building Goods company and therefore most of their sale prices and quantities fall within a certain range. But ~5% of the data contained massive quantity entries (eg. a company buying 300000 bricks at R3/brick) or massive price entries (eg. company buying an expensive piece of equipment).
I've removed them and maintained ~94% of the data. I've also removed the returns made by customers (ie. negative quantities and prices) under the inclination that I may create a binary variable 'Returned' to capture this feature. Here are some metrics:
These are some metrics before removing the outliers:
and these are the metrics after Outlier removal:
KMeans uses Euclidean distances. I've used both Scikitlearn's StandardScaler and RobustScaler when scaling without any significant changes in both. Here are some distribution plots and scatter plots for the 3 numeric variables:
Anybody have any practical/intuitive reasoning as to why this may be happening? Open to any alternative methods to use as well and any help would be much appreciated! Thanks
I am not an expert, in my experience with scikit learn cluster analysis I find that when the features are really similar in magnitude K-means clustering usually does not fulfill the job. I will first try to use a StandardScaler to see if normalizing the data makes the clustering more efficient. the elbow plot shows that with more n_neighbors you get higher accuracy, and by the looks of the plot and the plots you provide, I would think the data is too similar, making it hard to separate into groups (clusters). Adding an additional feature made up of your data can do the trick.
I would try normalizing the data first, standard scaler.
If the groups are still not very clear with a simple plot of the data I would create another column made up of the combination of the others columns.
I would not suggest using DBSCAN, since the eps parameter (distance) would have to be tunned very finely and as you mention is more computationally expensive.
I know that DBSCAN have one parameter that specify the number minimum of points(min points) but I would like to restrict with the maximum number of points on a cluster? Do you know how can I do? I have investigated but i haven't found anything... For example, per cluster I only want to have a maximum of 4 points to be grouped by dbscan
Thanks!
I think you will find everything you need in the link Aaron shared with you. Also, just so you know, clustering methodologies are unsupervised, so you don't train/test anything. You let the algo tell you the story, based on the data that is fed in. You don't know what will happen in advance. In short, with DBSCAN and also Hierarchical Clustering (but not K-Means), you do not pre-specify the number of clusters. The algo determines the optimal number of clusters for you. If you really want to control the number of clusters (min or max) you need to use a K-Means algo. Take a look at this link when you have a chance.
https://blog.cambridgespark.com/how-to-determine-the-optimal-number-of-clusters-for-k-means-clustering-14f27070048f
Simplified example of what I'm trying to do:
Let's say I have 3 data points A, B, and C. I run KMeans clustering on this data and get 2 clusters [(A,B),(C)]. Then I run MeanShift clustering on this data and get 2 clusters [(A),(B,C)]. So clearly the two clustering methods have clustered the data in different ways. I want to be able to quantify this difference. In other words, what metric can I use to determine percent similarity/overlap between the two cluster groupings obtained from the two algorithms? Here is a range of scores that might be given:
100% score for [(A,B),(C)] vs. [(A,B),(C)]
~50% score for [(A,B),(C)] vs. [(A),(B,C)]
~20% score for [(A,B),(C)] vs. [(A,B,C)]
These scores are a bit arbitrary because I'm not sure how to measure similarity between two different cluster groupings. Keep in mind that this is a simplified example, and in real applications you can have many data points and also more than 2 clusters per cluster grouping. Having such a metric is also useful when trying to compare a cluster grouping to a labeled grouping of data (when you have labeled data).
Edit: One idea that I have is to take every cluster in the first cluster grouping and get its percent overlap with every cluster in the second cluster grouping. This would give you a similarity matrix of clusters in the first cluster grouping against clusters in the second cluster grouping. But then I'm not sure what you would do with this matrix. Maybe take the highest similarity score in each row or column and do something with that?
Use evaluation metrics.
Many metrics are symmetric. For example, the adjusted Rand index.
A value close to 1 means they are very similar, close to 0 is random, and much less than 0 means each cluster of one is "evenly" distributed over all clusters of the other.
Well, determining the number of clusters is problem in data analysis and different issue from clustering problem itself. There are quite a few criteria for this AIC
or Cubic Clustering Criteria. I think that with scikit-learn there is not an option to calculate these by default two but I know that there are packages in R.
Are there any types of clustering algorithms that focus on forming specific sized clusters? This can be thought of us as a grouping algorithm more than a clustering algorithm.
Basically, given n data points, and fixed groups of a certain size k, find the optimal distribution of points to sets based upon certain classifiers, that will hopefully minimize the distance of classifiers for each point in a given group.
This problem seems to be pretty similar to a clustering problem, but the main difference is that we are concerned with a specific cluster size, but not concerned about the number of clusters.
There is a tutorial on how to implement such an algorithm in ELKI:
http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans
Also have a look at constraint clustering algorithms; although usually these algorithms only support "Must link" and "cannot link" constraints, not size constraints.
You should be able to do a similar modification where you first specify the group sizes, then assign points randomly, and swap cluster members as long as your objective function improves; similar to k-means / k-medoids. As you may get stuck in local minima, restart a number of times and only keep the best.
See also earlier questions, e.g.
K-means algorithm variation with equal cluster size
and
Group n points in k clusters of equal size
The problem that you are posing is a combinatorial optimization problem. It is very important to know if you need an exact solution, or that can you settle for an approximate one?
If you need exact solutions, there is a body of work that focuses on clustering with different types of constraints. The constraint that you mentioned can be encoded in this framework. However, you should now that this approach scales up to a datasets with a certain size.