Dimensionality reduction by minimizing the spread between data points

Dimensionality reduction by minimizing the spread between data points - python

I have a three-dimensional point dataset (please refer to the images) which has an underlying structure, as seen in Image 3. My goal is to classify these independent "data pillars". My foolish approach would be to rotate the dataset in all axis and compute a statistical measure of spread and choose subsequently the projection on which this measure is minimized and then perform a cluster analysis.
Now to my question: Is there a way to do this analytically? The dimensionality-reduction algorithms I ran into always want to preserve the structure or maximize the variance in the dataset, but is there an algorithm that would fit my application case?
If you have a hint and want to share it here, it would be very appreciated by me.
Thankfully,
Jakob Krieger

Related

Clustering Subsets of a big dataset (2d and multi-dimensional)

How do you cluster a subset of a big dataset?
I have a big dataset of ~200000 points, and they are high dimensional data. There are around ~25000 of different meaningful combinations of the points, each containing around 10-200 points, and I would like to assess the clustering properties of those combinations. I have used umap on the high dimensional data to reduce them to 2d, so analyzing umap is appropriate, but analyzing on the original data would be better.
Traditional clustering methods (kmeans, hierarchical clustering and dbscan) could not account for the what is considered a cluster -- the points are located in a small space as supposed to the entire space even in 2d, and they also generally cluster poorly because of the small amount of data because they would specify multiple clusters when those were actually outliers. I have made some progress with the level-set tree method in that regards, but the behavior of the algorithm is not always controllable (only doable for very typical cases). Is there any methods that you would suggest?

HDBSCAN on Movielens Latent embeddings does not cluster well

I am working on a recommendation algorithm, and that has right now boiled down to finding the right clustering algorithm for the job.
Data
The data I'm working with is the MovieLens 100K dataset, from which I've extracted movie titles, genres and tags, and concatenated them into single documents (one for each movie). This gives me about 10000 documents. These have then been vectorized with TFDIF, which I have then autoencoded to 64-dim feature vectors (loss=0.0014 down from 22.14 in 30 epochs). The AutoEncoder is able to reconstruct the data well.
Clustering
Currently, I am working with HDBSCAN, as it should be able to handle datasets with varying density, with non-globular clustering, arbitrary cluster shapes, etc etc. It should be the correct algorithm to use here. The
2D representation of the original 64-dimensional data (gathered by TSNE) shows what seems to be a decently clusterable space, but I cannot get the HDBSCAN algorithm to work properly. Setting the min_cluster_size to 15-30 gives me this, any higher and it sees all points as noise, and lowering gives me this. Or, it just clusters a large majority of points into 1 cluster, with some additional very small clusters, and the rest as noise, like this. It just seems like it can't handle the data, but it does seem to be clusterable to me.
My Questions:
How can fiddling with parameters help HDBSCAN to cluster this space?
Is there a better algorithm for clustering such a space?
Or is the data simply non-clusterable, from what you can see in the plots?
Thanks so much in advance, I've been struggling with this for hours now.

curve fitting by parts - lmfit Python

I would like to know if in Python, and more precisely, in lmfit library, there is an option for fitting data by parts ? I would like to fit data defined in different ranges and then obtain a unique fit.
Thank you

Without a more concrete example, it is hard to give a concrete answer. But, if I understand your question correctly, you are looking to do a fit to one specific region of your data, then a fit (probably with a different functional form) to another region of your data, and then perhaps combine the multiple regions to get a final fit.
If that is correct, then yes, this can be done with lmfit (and probably with other libraries as well). Let's say you want to fit data that is sort of peak like with an exponential decaying background. First, isolate a region around that peak (it doesn't have to be perfect) and fit a peak (say, Gaussian to that). Then fit an exponential decay to all the data except the peak area. (Aside: numpy.where can be very useful in identifying the regions). Finally, combine the two and fit the whole curve to peak + background.
If that is too vague and doesn't point you in the right direction, please make the question more specific.

How to plot classification regions in a lower dimensional space?

I'm working in a space which has 8 dimensions (i.e. 8 features). I have plotted the data points in 2D by applying PCA as well as TSNE. Now I would like also to draw the borderlines of the classifiers I use as shown here. By the way, I'm using different classifiers (SVM, GNB, Logistic Regression).
This means that I have the different 8-dimensional points which I plot in 2D using PCA or TSNE. On top of this plot I would like to plot the different classification regions as shown in the link above.
Of course the classification boundaries/regions are also 8-dimensional. How can I turn the classification boundaries/regions into 2D matching my 2D data points?

Interesting question here, I once wondered it.
It can be answered several way, including more or less details depending whether you want to fully understand or to apply the method.
As you don't a lot of detail but you included a sklearn link, I will first answer on a technical point of view: "How can you do it with sklearn?"
You have a function for this: transform(X, y=None) which will apply the PCA projection (yes, PCA is a projection for high dimensional space to a lower one).
So you basically just need to give transform(your_boundaries) to apply it.
In term of pseudo code this would give:
pca = PCA(n_component=2).fit(data)
2dboundaries = pca.transform(boundaries)
Et voilà!
Do not hesitate to give more details or ask question. I could add some specific development if it is relevant.
Hope it helps
pltrdy

Is there a good and easy way to visualize high dimensional data?

Can someone please tell me if there is a good (easy) way to visualize high dimensional data? My data is currently 21 dimensions but I would like to see how whether it is dense or sparse. Are there techniques to achieve this?

Parallel coordinates are a popular method for visualizing high-dimensional data.
What kind of visualization is best for your data in particular will depend on its characteristics-- how correlated are the different dimensions?

Principal component analysis could be helpful if the dimensions are correlated.

The buzzword I would search for is multidimensional scaling. It is a technique to develop a projection from the high dimensional space to a lower space (2 or 3 dimensional) in such a way that points which are close in the full space will be close in the projection.
It is often used for visualising the output of clustering algorithms (i.e. if your clusters are compact in the MDS projection there is a good chance they are also in the full space).
Edit: This wouldn't necessarily help with determining if the data is dense or sparse, because you lose the scale in the projection, but it would show whether it is uniform or clumpy (perhaps thats what you mean).

Not sure what kind of patterns you would like to see from the data. t-SNE and its faster variant Barnes-Hut-SNE do a very good job in visualizing groups of related concepts for high-dimensional data. It is available through R.
There is a short tutorial on using it against high-dimensional data with about 300 dimensions.
http://www.codeproject.com/Tips/788739/Visualizing-High-Dimensional-Vector-using-T-SNE-wi

I was looking for ways to visualize high dimensional data and found this t-SNE technique that has been used effectively. Might help others as well.

Take a look at http://www.ggobi.org (tours, parallel coordinates, scatterplot matrices) can be used for real-valued variables. Also http://cranvas.org for more recent. The tourr package in R.

Try using http://hypertools.readthedocs.io/en/latest/.
HyperTools is a library for visualizing and manipulating high-dimensional data in Python.

Star Schema.
http://en.wikipedia.org/wiki/Star_schema
Works well for high-dimensional data.
If the cardinality of your fact table is close to the product of your dimension sizes, you have dense data.
If the cardinality of your fact table is smaller than the product of your dimension sizes, you have sparse data.
In the middle you have a judgement call.

The curios.IT data exploration software is designed for the visualization of high dimensional data: data is shown as a collection of 3D objects (one for each data group) which can show up to 13 variables at the same time. The relationships between data variables and visual features are much easier to remember than with other techniques (like parallel coordinates).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.