Plotting a numpy array with 256 columns used for k-means - python

I have a numpy array with this shape: (109, 256) Every row is a frame and every column is a value of the frame's histogram (8 bits).
With k-means I cluster the histograms to get a resume of the frames. I want something like this:
Where every cluster should be a "scene" with similar histograms.
But how can I plot a representative graphic of the k-means process with 256 columns??
I'm trying with this typical example:
plt.scatter(X[:,0],X[:,1], c=kmeans.labels_, cmap='rainbow')
But yeah, it shows only 2 columns and it doesn't represent the problem. Any help? I'm really new on Python and machine learning.
PD: my k-means code works well and it clusters the way I want, but I don't know how to represent it correctly.

You always represent k-means clustering results on two axis. Those axis can be picked randomly. The only way you can include more attributes is by adapting the size of your points to another variables (for example the higher the income the bigger the point) or by having different color shades.
Otherwise, you seem to have done everything correctly, you have to stick to two variables for your axis and can't integrate more..
You can decide on creating more plots with different axis and create a grid (often this doesn't add much information though)

Related

I have a 3D dataset of coordinates x,y,z. How do I check if the dataset is normally distributed?

The dataset is large with over 15000 rows.
One row of x,y,z plots a point on a 3D plot.
I need to scale the data and so far I'm using RobustScaler(), but I want to make sure that the dataset is either normally distributed or it isn't.
Matplotlib histogram [plt.hist()] can be used for checking data distribution. If the highest peak middle of the graph, then datasets are normally distributed.

Is there a way to count the number of points within a certain area on a graph?

I've got output graphs that look like this:
My question is, is there an easy way for me to count the number of points within each of the obvious 'lines' or 'streaks' of particles? In other words, I need to find the density of each of the streaks separately. Most of them are overlapping which is where my issue comes in.
I've tried specifying x and y limits but again, the overlapping comes into play. The existing code is just importing and plotting values.
Ken, thanks for your comment. I went along that path, I found that single linkage works best for the type of clusters I have. I also had to find a multiplying factor for my own data first, because the clustering was failing with the data overlapping. With this data the different colours represent different clusters. The dendrogram x-axis is labelled with the cluster densities, but they aren't in order! I'm yet to find an efficient way around this. I manually adjusted the dendrogram to produce 2 clusters first, which told me the density of the first shell (it produced 2 clusters, 1 of the first shell and 1 with everything else). Then repeated it for 3,4, etc.
Sorry if none of this makes sense! It's quite late/early here.

Matplotlib imshow()

I am stuck with python and matplotlib imshow(). Aim is it to show a twodimensonal color map which represents three dimensions.
My x-axis is represented by an array'TG'(93 entries). My y-axis is a set of arrays dependend of my 'TG' To be precise we have 93 different arrays with the length of 340. My z-axis is also a set of arrays depended of my 'TG' equally sized then y (93x340).
Basically what I have is a set of two-dimensonal measurements which I want to plot in color dependend on a third array. Is there a clever way to do that. I was trying to find out on my own first, but all I found is that most common is the problem with just a z-plane(two-dimensonal plot). So I have two matrices of the order of (93x340) and one array(93). Do you know a helpful advise.
Without more detail on your specific problem, it's hard to guess what is the best way to represent your data. I am going to give an example, hopefully it is relevant.
Suppose we are collecting height and weight of a group of people. Maybe the index of the person is your first dimension, and the height and weight depends on who it is. Then one way to represent this data is use height and weight as the x and y axes, and plot each person as a dot in that two dimensional space.
In this example, the person index doesn't really have much meaning, thus no color is needed.

Python adaptive histogram widths

I am currently working on a project where I have to bin up to 10-dimensional data. This works totally fine with numpy.histogramdd, however with one have a serious obstacle:
My parameter space is pretty large, but only a fraction is actually inhabited by data (say, maybe a few % or so...). In these regions, the data is quite rich, so I would like to use relatively small bin widths. The problem here, however, is that the RAM usage totally explodes. I see usage of 20GB+ for only 5 dimensions which is already absolutely not practical. I tried defining the grid myself, but the problem persists...
My idea would be to manually specify the bin edges, where I just use very large bin widths for empty regions in the data space. Only in regions where I actually have data, I would need to go to a finer scale.
I was wondering if anyone here knows of such an implementation already which works in arbitrary numbers of dimensions.
thanks 😊
I think you should first remap your data, then create the histogram, and then interpret the histogram knowing the values have been transformed. One possibility would be to tweak the histogram tick labels so that they display mapped values.
One possible way of doing it, for example, would be:
Sort one dimension of data as an unidimensional array;
Integrate this array, so you have a cumulative distribution;
Find the steepest part of this distribution, and choose a horizontal interval corresponding to a "good" bin size for the peak of your histogram - that is, a size that gives you good resolution;
Find the size of this same interval along the vertical axis. That will give you a bin size to apply along the vertical axis;
Create the bins using the vertical span of that bin - that is, "draw" horizontal, equidistant lines to create your bins, instead of the most common way of drawing vertical ones;
That way, you'll have lots of bins where data is more dense, and lesser bins where data is more sparse.
Two things to consider:
The mapping function is the cumulative distribution of the sorted values along that dimension. This can be quite arbitrary. If the distribution resembles some well known algebraic function, you could define it mathematically and use it to perform a two-way transform between actual value data and "adaptive" histogram data;
This applies to only one dimension. Care must be taken as how this would work if the histograms from multiple dimensions are to be combined.

Automatically stretch a colormap in python for better visualization

I'm using matplotlib to plot things that have a lot of data clustered around values 0, .5, and 1. So it is difficult to see the difference between values within each cluster with a sequential colormap (which is what I want to use).
I would like to "stretch" the colormap around the values where my data clusters, so that you can see the contrast within each cluster as well as between clusters.
I saw this similar question, but it doesn't quite get me to where I want:
Matlab, Python: Fixing colormap to specified values
Thanks!

Categories