Is there a good and easy way to visualize high dimensional data? - python

Can someone please tell me if there is a good (easy) way to visualize high dimensional data? My data is currently 21 dimensions but I would like to see how whether it is dense or sparse. Are there techniques to achieve this?

Parallel coordinates are a popular method for visualizing high-dimensional data.
What kind of visualization is best for your data in particular will depend on its characteristics-- how correlated are the different dimensions?

Principal component analysis could be helpful if the dimensions are correlated.

The buzzword I would search for is multidimensional scaling. It is a technique to develop a projection from the high dimensional space to a lower space (2 or 3 dimensional) in such a way that points which are close in the full space will be close in the projection.
It is often used for visualising the output of clustering algorithms (i.e. if your clusters are compact in the MDS projection there is a good chance they are also in the full space).
Edit: This wouldn't necessarily help with determining if the data is dense or sparse, because you lose the scale in the projection, but it would show whether it is uniform or clumpy (perhaps thats what you mean).

Not sure what kind of patterns you would like to see from the data. t-SNE and its faster variant Barnes-Hut-SNE do a very good job in visualizing groups of related concepts for high-dimensional data. It is available through R.
There is a short tutorial on using it against high-dimensional data with about 300 dimensions.
http://www.codeproject.com/Tips/788739/Visualizing-High-Dimensional-Vector-using-T-SNE-wi

I was looking for ways to visualize high dimensional data and found this t-SNE technique that has been used effectively. Might help others as well.

Take a look at http://www.ggobi.org (tours, parallel coordinates, scatterplot matrices) can be used for real-valued variables. Also http://cranvas.org for more recent. The tourr package in R.

Try using http://hypertools.readthedocs.io/en/latest/.
HyperTools is a library for visualizing and manipulating high-dimensional data in Python.

Star Schema.
http://en.wikipedia.org/wiki/Star_schema
Works well for high-dimensional data.
If the cardinality of your fact table is close to the product of your dimension sizes, you have dense data.
If the cardinality of your fact table is smaller than the product of your dimension sizes, you have sparse data.
In the middle you have a judgement call.

The curios.IT data exploration software is designed for the visualization of high dimensional data: data is shown as a collection of 3D objects (one for each data group) which can show up to 13 variables at the same time. The relationships between data variables and visual features are much easier to remember than with other techniques (like parallel coordinates).

Related

HDBSCAN on Movielens Latent embeddings does not cluster well

I am working on a recommendation algorithm, and that has right now boiled down to finding the right clustering algorithm for the job.
Data
The data I'm working with is the MovieLens 100K dataset, from which I've extracted movie titles, genres and tags, and concatenated them into single documents (one for each movie). This gives me about 10000 documents. These have then been vectorized with TFDIF, which I have then autoencoded to 64-dim feature vectors (loss=0.0014 down from 22.14 in 30 epochs). The AutoEncoder is able to reconstruct the data well.
Clustering
Currently, I am working with HDBSCAN, as it should be able to handle datasets with varying density, with non-globular clustering, arbitrary cluster shapes, etc etc. It should be the correct algorithm to use here. The
2D representation of the original 64-dimensional data (gathered by TSNE) shows what seems to be a decently clusterable space, but I cannot get the HDBSCAN algorithm to work properly. Setting the min_cluster_size to 15-30 gives me this, any higher and it sees all points as noise, and lowering gives me this. Or, it just clusters a large majority of points into 1 cluster, with some additional very small clusters, and the rest as noise, like this. It just seems like it can't handle the data, but it does seem to be clusterable to me.
My Questions:
How can fiddling with parameters help HDBSCAN to cluster this space?
Is there a better algorithm for clustering such a space?
Or is the data simply non-clusterable, from what you can see in the plots?
Thanks so much in advance, I've been struggling with this for hours now.

Using np.piecewise to have segmented linear regression to large data set

For my data analysis I wish to get linear fits for different segments of the data. Since its a large data set I want python to calculate the lineair fits and the corresponding segments. I think the best way of doing this is by using np.piecewise, but in problem solutions they already know how many linear fits or what the segments are. Does anyone have a nice way of doing this?
Cheers

Dimensionality reduction by minimizing the spread between data points

I have a three-dimensional point dataset (please refer to the images) which has an underlying structure, as seen in Image 3. My goal is to classify these independent "data pillars". My foolish approach would be to rotate the dataset in all axis and compute a statistical measure of spread and choose subsequently the projection on which this measure is minimized and then perform a cluster analysis.
Now to my question: Is there a way to do this analytically? The dimensionality-reduction algorithms I ran into always want to preserve the structure or maximize the variance in the dataset, but is there an algorithm that would fit my application case?
If you have a hint and want to share it here, it would be very appreciated by me.
Thankfully,
Jakob Krieger

Difficulty in understanding linear regression with multiple features

Let's say price of houses(target variable) can be easily plotted against area of houses(predictor variables) and we can see the data plotted and draw a best fit line through the data.
However, consider if we have predictor variables as ( size, no.of bedrooms,locality,no.of floors ) etc. How am I gonna plot all these against the
target variable and visualize them on a 2-D figure?
The computation shouldn't be an issue (the math works regardless of dimensionality), but the plotting definitely gets tricky. PCA can be hard to interpret and forcing orthogonality might not be appropriate here. I'd check out some of the advice provided here: https://stats.stackexchange.com/questions/73320/how-to-visualize-a-fitted-multiple-regression-model
Fundamentally, it depends on what you are trying to communicate. Goodness of fit? Maybe throw together multiple plots of residuals.
If you truly want a 2D figure, that's certainly not easy. One possible approach would be to reduce the dimensionality of your data to 2 using something like Principal Component Analysis. Then you can plot it in two dimensions again. Reducing to 3 dimensions instead of 2 might also still work, humans can understand 3D plots drawn on a 2D screen fairly well.
You don't normally need to do linear regression by hand though, so you don't need a 2D drawing of your data either. You can just let your computer compute the linear regression, and that works perfectly fine with way more than 2 or 3 dimensions.

Measuring "mixtureness" of labeled data (python)

I have some 2D data:
The data is labeled and shown in different colors. Definitely a non supervised process will not yield any correct prediction because the data is pretty mixed (although the colors seem to have regions of preference). I want to see if it is possible to measure how mixed are points from different sets.
For this I need to define a measurement of how mixed they are (I think that this should exist). Also it would be nice to have these algorithms implemented. I am also looking for a simple predictive model that can be trained used the data shown. Thanks for your help. If possible I'm looking for these implementations in python.

Categories