I am currently working on a project where I have to bin up to 10-dimensional data. This works totally fine with numpy.histogramdd, however with one have a serious obstacle:
My parameter space is pretty large, but only a fraction is actually inhabited by data (say, maybe a few % or so...). In these regions, the data is quite rich, so I would like to use relatively small bin widths. The problem here, however, is that the RAM usage totally explodes. I see usage of 20GB+ for only 5 dimensions which is already absolutely not practical. I tried defining the grid myself, but the problem persists...
My idea would be to manually specify the bin edges, where I just use very large bin widths for empty regions in the data space. Only in regions where I actually have data, I would need to go to a finer scale.
I was wondering if anyone here knows of such an implementation already which works in arbitrary numbers of dimensions.
thanks 😊
I think you should first remap your data, then create the histogram, and then interpret the histogram knowing the values have been transformed. One possibility would be to tweak the histogram tick labels so that they display mapped values.
One possible way of doing it, for example, would be:
Sort one dimension of data as an unidimensional array;
Integrate this array, so you have a cumulative distribution;
Find the steepest part of this distribution, and choose a horizontal interval corresponding to a "good" bin size for the peak of your histogram - that is, a size that gives you good resolution;
Find the size of this same interval along the vertical axis. That will give you a bin size to apply along the vertical axis;
Create the bins using the vertical span of that bin - that is, "draw" horizontal, equidistant lines to create your bins, instead of the most common way of drawing vertical ones;
That way, you'll have lots of bins where data is more dense, and lesser bins where data is more sparse.
Two things to consider:
The mapping function is the cumulative distribution of the sorted values along that dimension. This can be quite arbitrary. If the distribution resembles some well known algebraic function, you could define it mathematically and use it to perform a two-way transform between actual value data and "adaptive" histogram data;
This applies to only one dimension. Care must be taken as how this would work if the histograms from multiple dimensions are to be combined.
Related
I have a numpy array with this shape: (109, 256) Every row is a frame and every column is a value of the frame's histogram (8 bits).
With k-means I cluster the histograms to get a resume of the frames. I want something like this:
Where every cluster should be a "scene" with similar histograms.
But how can I plot a representative graphic of the k-means process with 256 columns??
I'm trying with this typical example:
plt.scatter(X[:,0],X[:,1], c=kmeans.labels_, cmap='rainbow')
But yeah, it shows only 2 columns and it doesn't represent the problem. Any help? I'm really new on Python and machine learning.
PD: my k-means code works well and it clusters the way I want, but I don't know how to represent it correctly.
You always represent k-means clustering results on two axis. Those axis can be picked randomly. The only way you can include more attributes is by adapting the size of your points to another variables (for example the higher the income the bigger the point) or by having different color shades.
Otherwise, you seem to have done everything correctly, you have to stick to two variables for your axis and can't integrate more..
You can decide on creating more plots with different axis and create a grid (often this doesn't add much information though)
I'm trying to get statistics on a distribution but all the libraries I've seen require the input to be in histogram style. That is, with a huge long array of numbers like what plt.hist wants as an input.
I have the bar chart equivalent, i.e. 2 arrays; one with the x-axis centre points, and one with y-axis values for the corresponding value of each point. The plot looks like this:
My question is how can I apply statistics such as mean, range, skewness and kurtosis on this dataset. The numbers are not always integers. It seems very inefficient to force python to make a histogram style array with, for example, 180x 0.125's, 570x 0.25's e.t.c. as in the figure above.
Doing mean on the current array I have will give me the average frequency of all sizes, i.e. plotting a horizontal line on the figure above. I'd like a vertical line to show the average, as if it were a distribution.
Feels like there should be an easy solution! Thanks in advance.
I am stuck with python and matplotlib imshow(). Aim is it to show a twodimensonal color map which represents three dimensions.
My x-axis is represented by an array'TG'(93 entries). My y-axis is a set of arrays dependend of my 'TG' To be precise we have 93 different arrays with the length of 340. My z-axis is also a set of arrays depended of my 'TG' equally sized then y (93x340).
Basically what I have is a set of two-dimensonal measurements which I want to plot in color dependend on a third array. Is there a clever way to do that. I was trying to find out on my own first, but all I found is that most common is the problem with just a z-plane(two-dimensonal plot). So I have two matrices of the order of (93x340) and one array(93). Do you know a helpful advise.
Without more detail on your specific problem, it's hard to guess what is the best way to represent your data. I am going to give an example, hopefully it is relevant.
Suppose we are collecting height and weight of a group of people. Maybe the index of the person is your first dimension, and the height and weight depends on who it is. Then one way to represent this data is use height and weight as the x and y axes, and plot each person as a dot in that two dimensional space.
In this example, the person index doesn't really have much meaning, thus no color is needed.
When I try to make a scatter plot, colored by density, it takes forever.
Probably because the length of the data is quite big.
This is basically how I do it:
xy = np.vstack([np.array(x_values),np.array(y_values)])
z = gaussian_kde(xy)(xy)
plt.scatter(np.array(x_values), np.array(x_values), c=z, s=100, edgecolor='')
As an additional info, I have to add that:
>>len(x_values)
809649
>>len(y_values)
809649
Is it any other option to get the same results but with better speed results?
No, there is not good solutions.
Every point should be prepared, and a circle is drawn, which probably will be hidden by other points.
My tricks: (note these point may change slightly the output)
get minimum and maximum, and set image on such size, so that figure needs not to be redone.
remove data, as much as possible:
duplicate data
convert with a chosen precision (e.g. of floats) and remove duplicate data. You may calculate the precision with half size of the dot (or with resolution of graph, if you want the original look).
Less data: more speed. Removal is far quicker than drawing a point in a graph (which will be overwritten).
Often heatmaps are more interesting for huge data sets: it gives more information. But in your case, I think you still have too much data.
Note: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde has also a nice example (just 2000 points). In any case, this pages uses also my first point.
I would suggest plotting a sample of the data.
If the sample is large enough you should get the same distribution.
Making sure the plot is relevant to the entire data set is also quite easy as you can simply take multiple samples and compare between them.
I am trying to plot some simulated data, using a colormap to indicate absolute magnitude, and quiver vector length to indicate relative magnitude of the quantities. What I would like to do is set a threshold value, above which a different colormap will become active, and which can scale based upon the range of data that falls above the threshold.
For example, in the included image I have managed to combine two builtin colormaps into one, but it is still a linear scale that can only be characterized by its maximum and minimum values. What I would like to do is set a constant value to be the max of the blue-green segment, but the range of values in the yellow-red segment should range from this set constant value, to the maximum value in the data set.
(Apparently my rep is not high enough to include images, probably making this difficult to understand)
I would also be happy to use two separate color schemes, but I need to find a way to enable each, only within a certain range of data.
Thank you for any help you can provide!