PCA on large dataset

PCA on large dataset - python

I have a large dateset consisting of 6 input variables (temperatures, pressures, flow rates etc) to give an output such as yield, purity and conversion.
There are a total of approx 47600 instances and this is all in an excel spreadsheet.
I have applied both artificial neural network and random forest algorithms on this data and obtained predicted plots and accuracy metrics. (in Python)
The random forest model has a feature that gives input variable importance.
I would now like to perform a PCA on this data to firstly compare to the random forest results, as well as to obtain more information on how my input data interacts with each other to give my output.
I've watched a few youtube videos and tutorials to get my head around PCA however the data they use is quite different to mine.
Below is a snippet of my data. The first 6 columns are inputs and the last 3 are outputs.
How can I analyse this using PCA? I have managed to plot it in python however the plot is very busy and almost doesnt give much information.
Any help or tips are welcome! Perhaps a different analysis tool? I don't mind using Python or Matlab
Thank you :)

I suggest to use the KarhunenLoeveSVDAlgorithm in OpenTURNS. It provides 4 implementations of a random SVD algorithm. The constraint is that the number of singular values to be computed has to be set beforehand.
In order to enable the algorithm, we must set the KarhunenLoeveSVDAlgorithm-UseRandomSVD key in the ResourceMap. Then the KarhunenLoeveSVDAlgorithm-RandomSVDMaximumRank key sets the number of singular values to compute (be default, it is equal to 1000).
Two implementations are provided:
Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,
Nathan Halko, Per-Gunnar Martisson, Yoel Shkolnisky and Mark Tygert. An algorithm for the principal component analysis of large data sets.
These algorithms can be chosen with the KarhunenLoeveSVDAlgorithm-RandomSVDVariant key.
In the following example, I simulate a large process sample from a gaussian process with AbsoluteExponential covariance model.
import openturns as ot
mesh = ot.IntervalMesher([10]*2).build(ot.Interval([-1.0]*2, [1.0]*2))
s = 0.01
model = ot.AbsoluteExponential([1.0]*2)
sampleSize = 100000
sample = ot.GaussianProcess(model, mesh).getSample(sampleSize)
Then the random SVD algorithm is used:
ot.ResourceMap_SetAsBool('KarhunenLoeveSVDAlgorithm-UseRandomSVD', True)
algorithm = ot.KarhunenLoeveSVDAlgorithm(sample, s)
algorithm.run()
result = algorithm.getResult()
The result object contains the Karhunen-Loève decomposition of the process. This corresponds to the PCA with a regular grid (and equal weights).

Related

HDBSCAN on Movielens Latent embeddings does not cluster well

I am working on a recommendation algorithm, and that has right now boiled down to finding the right clustering algorithm for the job.
Data
The data I'm working with is the MovieLens 100K dataset, from which I've extracted movie titles, genres and tags, and concatenated them into single documents (one for each movie). This gives me about 10000 documents. These have then been vectorized with TFDIF, which I have then autoencoded to 64-dim feature vectors (loss=0.0014 down from 22.14 in 30 epochs). The AutoEncoder is able to reconstruct the data well.
Clustering
Currently, I am working with HDBSCAN, as it should be able to handle datasets with varying density, with non-globular clustering, arbitrary cluster shapes, etc etc. It should be the correct algorithm to use here. The
2D representation of the original 64-dimensional data (gathered by TSNE) shows what seems to be a decently clusterable space, but I cannot get the HDBSCAN algorithm to work properly. Setting the min_cluster_size to 15-30 gives me this, any higher and it sees all points as noise, and lowering gives me this. Or, it just clusters a large majority of points into 1 cluster, with some additional very small clusters, and the rest as noise, like this. It just seems like it can't handle the data, but it does seem to be clusterable to me.
My Questions:
How can fiddling with parameters help HDBSCAN to cluster this space?
Is there a better algorithm for clustering such a space?
Or is the data simply non-clusterable, from what you can see in the plots?
Thanks so much in advance, I've been struggling with this for hours now.

K Means Clustering: What does it mean about my input features if the Elbow Method gives me a straight line?

I am trying to cluster retail data in order to extract groupings of customers based on 6 input features. The data has a shape of (1712594, 6) in the following format:
I've spilt the 'Department' categorical variable into binary n-dimensional array using Pandas get_dummies(). I'm aware this is not optimal but I just wanted to test it out before trying out Gower Distances.
The Elbow method gives the following output:
USING:
I'm using Python and Scikitlearn's KMeans because the dataset is so large and the more complex models are too computationally demanding for Google Colab.
OBSERVATINS:
I'm aware that columns 1-5 are extremely correlated but the data is limited Sales data and little to no data is captured about Customers. KMeans is very sensitive to inputs and this may affect the WCSS in the Elbow Method and cause the straight line but this is just an inclination and I don't have any quantitative backing to support the argument. I'm a Junior Data Scientist so knowledge about technical foundations of Clustering models and algorithms is still developing so forgive me if I'm missing something.
WHAT I'VE DONE:
There were massive outliers that were skewing the data (this is a Building Goods company and therefore most of their sale prices and quantities fall within a certain range. But ~5% of the data contained massive quantity entries (eg. a company buying 300000 bricks at R3/brick) or massive price entries (eg. company buying an expensive piece of equipment).
I've removed them and maintained ~94% of the data. I've also removed the returns made by customers (ie. negative quantities and prices) under the inclination that I may create a binary variable 'Returned' to capture this feature. Here are some metrics:
These are some metrics before removing the outliers:
and these are the metrics after Outlier removal:
KMeans uses Euclidean distances. I've used both Scikitlearn's StandardScaler and RobustScaler when scaling without any significant changes in both. Here are some distribution plots and scatter plots for the 3 numeric variables:
Anybody have any practical/intuitive reasoning as to why this may be happening? Open to any alternative methods to use as well and any help would be much appreciated! Thanks

I am not an expert, in my experience with scikit learn cluster analysis I find that when the features are really similar in magnitude K-means clustering usually does not fulfill the job. I will first try to use a StandardScaler to see if normalizing the data makes the clustering more efficient. the elbow plot shows that with more n_neighbors you get higher accuracy, and by the looks of the plot and the plots you provide, I would think the data is too similar, making it hard to separate into groups (clusters). Adding an additional feature made up of your data can do the trick.
I would try normalizing the data first, standard scaler.
If the groups are still not very clear with a simple plot of the data I would create another column made up of the combination of the others columns.
I would not suggest using DBSCAN, since the eps parameter (distance) would have to be tunned very finely and as you mention is more computationally expensive.

Efficient k-means evaluation with silhouette score in sklearn

I am running k-means clustering on ~1 million items (each represented as a ~100-feature vector). I have run the clustering for various k, and now want to evaluate the different results with the silhouette score implemented in sklearn. Attempting to run it with no sampling seems unfeasible and takes a prohibitively long time, so I assume I need to use sampling, i.e.:
metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???)
I don't have a good sense of what an appropriate sampling approach is, however. Is there a rule of thumb for what size sample to use given the size of my matrix? Is it better to take the largest sample my analysis machine can handle, or to take the average of more smaller samples?
I ask in large part because my preliminary test (with sample_size=10000) has produced some really really unintuitive results.
I'm also open to alternative, more scalable evaluation metrics.
Editing to visualize the issue: The plot shows, for varying sample sizes, the silhouette score as a function of the number of clusters
What's not weird is that increasing sample size seems to reduce noise. What is weird, given that I have 1 million, very heterogenous vectors, that 2 or 3 is the "best" number of clusters. In other words, what's unintuitive is that I would find a more-or-less monotonic decreases in silhouette score as I increase the number of clusters.

Other metrics
Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off.
Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.
You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).
How much to sample
It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.

kmeans converge to local minima. Starting positions plays a crucial role in optimal number of clusters. It would be a good idea often to reduce the noise and dimensions using PCA or any other dimension reduction techniques to proceed with kmeans.
Just to add for the sake of completeness. It might be a good idea to get optimal number of clusters by "partition around medoids". It is equivalent to using silhouette method.
Reason for the weird observations could be different starting points for different sized samples.
Having said all the above, it is important to evaluate clusterability of the dataset in hand. Tractable means is by Worst Pair ratio as discussed here Clusterability.

Since there is no widely-accepted best approach to determine the optimal number of clusters, all evaluation techniques, including Silhouette Score, Gap Statistic, etc. fundamentally rely on some form of heuristic/trial&error argument. So to me, the best approach is to try out multiple techniques and to NOT develop over-confidence in any.
In your case, the ideal and most accurate score should be calculated on the entire data set. However, if you need to use partial samples to speed up the computation, you should use largest possible sample size your machine can handle. The rationale is the same as getting as many data points as possible out of the population of interest.
One more thig is that the sklearn implementation of Silhouette Score uses random (non-stratified) sampling. You can repeat the calculation multiple time using the same sample size (say sample_size=50000) to get a sensing on whether the sample size is large enough to produce consistent results.

Supervised Machine Learning: Classify types of clusters of data based on shape and density (Python)

I have multiple sets of data, and in each set of data there is a region that is somewhat banana shaped and two regions that are dense blobs. I have been able to differentiate these regions from the rest of the data using a DBSCAN algorithm, but I'd like to use a supervised algorithm to have the program then know which cluster is the banana, and which two clusters are the dense blobs, and I'm not sure where to start.
As there are 3 categories (banana, blob, neither), would doing two separate logistic regressions be the best approach (evaluate if it is banana or not-banana and if it is blob or not-blob)? or is there a good way to incorporate all 3 categories into one neural network?
Here are three data sets. In each, the banana is red. In the 1st, the two blobs are green and blue, in the 2nd the blobs are cyan and green, and in the the 3rd the blobs are blue and green. I'd like the program to (now that is has differentiated the different regions, to then label the banana and blob regions so I don't have to hand pick them every time I run the code.

As you are using python, one of the best options would be to start with some big library, offering many different approaches so you can choose which one suits you the best. One of such libraries is sklearn http://scikit-learn.org/stable/ .
Getting back to the problem itself. What are the models you should try?
Support Vector Machines - this model has been around for a while, and became a gold standard in many fields, mostly due to its elegant mathematical interpretation and ease of use (it has much less parameters to worry about then classical neural networks for instance). It is a binary classification model, but library automaticaly will create a multi-classifier version for you
Decision tree - very easy to understand, yet creates quite "rough" decision boundaries
Random forest - model often used in the more statistical community,
K-nearest neighours - most simple approach, but if you can so easily define shapes of your data, it will provide very good results, while remaining very easy to understand
Of course there are many others, but I would recommend to start with these ones. All of them support multi-class classification, so you do not need to worry how to encode the problem with three classes, simply create data in the form of two matrices x and y where x are input values and y is a vector of corresponding classes (eg. numbers from 1 to 3).
Visualization of different classifiers from the library:
So it remains a question how to represent shape of a cluster - we need a fixed length real valued vector, so what can features actually represent?
center of mass (if position matters)
skewness/kurtosis
covariance matrix (or its eigenvalues) (if rotation matters)
some kind of local density estimation
histograms of some statistics (like histogram of pairwise Euclidean distances between
pairs of points on the shape)
many, many more!
There is quite comprehensive list and detailed overview here (for three-dimensional objects):
http://web.ist.utl.pt/alfredo.ferreira/publications/DecorAR-Surveyon3DShapedescriptors.pdf
There is also quite informative presentation:
http://www.global-edge.titech.ac.jp/faculty/hamid/courses/shapeAnalysis/files/3.A.ShapeRepresentation.pdf
Describing some descriptors and how to make them scale/position/rotation invariant (if it is relevant here)

Could Neural networks help , the "pybrain" library might be the best for it.
You could set up the neural net as a feed forward network. set it so that there is an output for each class of object you expect the data to contain.
Edit :sorry if I have completely misinterpreted the question. I'm assuming you have preexisting data you can feed to train the networks to differentiate clusters.
If there are 3 categories you could have 3 outputs to the NN or perhaps a single NN for each one that simply outputs a true or false value.

I believe you are still unclear about what you want to achieve.
That of course makes it hard to give you a good answer.
Your data seems to be 3D. In 3D you could for example compute the alpha shape of a cluster, and check if it is convex. Because your "banana" probably is not convex, while your blobs are.
You could also measure e.g. whether the cluster center actually is inside your cluster. If it isn't, the cluster is not a blob. You can measure if the extends along the three axes are the same or not.
But in the end, you need some notion of "banana".

Deciding input values to DBSCAN algorithm

I have written code in python to implement DBSCAN clustering algorithm.
My dataset consists of 14k users with each user represented by 10 features.
I am unable to decide what exactly to keep as the value of Min_samples and epsilon as input
How should I decide that?
Similarity measure is euclidean distance.(Hence it becomes even more tough to decide.) Any pointers?

DBSCAN is pretty often hard to estimate its parameters.
Did you think about the OPTICS algorithm? You only need in this case Min_samples which would correspond to the minimal cluster size.
Otherwise for DBSCAN I've done it in the past by trial and error : try some values and see what happens. A general rule to follow is that if your dataset is noisy, you should have a larger value, and it is also correlated with the number of dimensions (10 in this case).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.