I'm trying to get statistics on a distribution but all the libraries I've seen require the input to be in histogram style. That is, with a huge long array of numbers like what plt.hist wants as an input.
I have the bar chart equivalent, i.e. 2 arrays; one with the x-axis centre points, and one with y-axis values for the corresponding value of each point. The plot looks like this:
My question is how can I apply statistics such as mean, range, skewness and kurtosis on this dataset. The numbers are not always integers. It seems very inefficient to force python to make a histogram style array with, for example, 180x 0.125's, 570x 0.25's e.t.c. as in the figure above.
Doing mean on the current array I have will give me the average frequency of all sizes, i.e. plotting a horizontal line on the figure above. I'd like a vertical line to show the average, as if it were a distribution.
Feels like there should be an easy solution! Thanks in advance.
Related
I have a dataframe and want to identify how the variables are correlated. I can get the correlation matrix easily using – df.corr(). I know I can easily plot the correlation matrix using plt.matshow(df.corr()) or seaborn's heatmap, but I'm looking for something like this - graph
taken from here. In this image, the thicker the line connecting the variables is, the higher the correlation.
I did a few searches on stack and elsewhere but all of them are correlation matrices where the values have been replaced by colors. Is there a way to achieve the linked plot?
I have the following dataframe, resulted from running grid search over several regression models:
As it can be noticed, there are many values grouped around 0.0009, but several that are a few orders of magnitude higher (-1.6, -2.3 etc).
I would like to plot these results, but I don't seem to find a way to get a readable plot. I have tried a bar plot, but I get something like:
How can I make this bar plot more readable? Or what other kind of plot would be more suitable to visualize such data?
Edit: Here is the dataframe, exported as CSV:
,a,b,c,d
LinearRegression,0.000858399508896,-4.11609208874e+20,0.000952538859738,0.000952538859733
RandomForestRegressor,-1.62264355718,-2.30218457629,0.0008957696846039999,0.0008990722465239999
ElasticNet,0.000883257900658,0.0008525502791760002,0.000884706195921,0.000929498696126
Lasso,7.92193516085e-05,-1.84086765436e-05,7.92193516085e-05,-1.84086765436e-05
ExtraTreesRegressor,-6.320170496909999,-6.30420308033,,
Ridge,0.0008584791396339999,0.0008601028734780001,,
SGDRegressor,-4.62522968756,,,
You could make the graph have a log scale, which is often used for plotting data with a very large range. This muddies the interpretation slightly, as now each equivalent distance is an equivalent order of magnitude difference. You can read about log scales here:
https://en.wikipedia.org/wiki/Logarithmic_scale
I would like to see more stuff on the boxplot: Maximum, 99th percentile, 95th, 85th, and the same on the other side. Is it possible?
I realize it's possible to do this oneself, by calculating those percentiles etc., and then plotting them in one dimension. But the boxplot function is convenient, it handles the axes for you, it can already calculate arbitrary percentiles (using the optional whis argument), and it lets you plot a bunch of variables in parallel (as at the bottom of this example file) -- which is handy, for instance, when using time-series data.
I am currently working on a project where I have to bin up to 10-dimensional data. This works totally fine with numpy.histogramdd, however with one have a serious obstacle:
My parameter space is pretty large, but only a fraction is actually inhabited by data (say, maybe a few % or so...). In these regions, the data is quite rich, so I would like to use relatively small bin widths. The problem here, however, is that the RAM usage totally explodes. I see usage of 20GB+ for only 5 dimensions which is already absolutely not practical. I tried defining the grid myself, but the problem persists...
My idea would be to manually specify the bin edges, where I just use very large bin widths for empty regions in the data space. Only in regions where I actually have data, I would need to go to a finer scale.
I was wondering if anyone here knows of such an implementation already which works in arbitrary numbers of dimensions.
thanks 😊
I think you should first remap your data, then create the histogram, and then interpret the histogram knowing the values have been transformed. One possibility would be to tweak the histogram tick labels so that they display mapped values.
One possible way of doing it, for example, would be:
Sort one dimension of data as an unidimensional array;
Integrate this array, so you have a cumulative distribution;
Find the steepest part of this distribution, and choose a horizontal interval corresponding to a "good" bin size for the peak of your histogram - that is, a size that gives you good resolution;
Find the size of this same interval along the vertical axis. That will give you a bin size to apply along the vertical axis;
Create the bins using the vertical span of that bin - that is, "draw" horizontal, equidistant lines to create your bins, instead of the most common way of drawing vertical ones;
That way, you'll have lots of bins where data is more dense, and lesser bins where data is more sparse.
Two things to consider:
The mapping function is the cumulative distribution of the sorted values along that dimension. This can be quite arbitrary. If the distribution resembles some well known algebraic function, you could define it mathematically and use it to perform a two-way transform between actual value data and "adaptive" histogram data;
This applies to only one dimension. Care must be taken as how this would work if the histograms from multiple dimensions are to be combined.
I'm trying to draw joint distribution of 2 variables in a package named seaborn (a wrapper over matplotlib). Ultimately, I want to get something like this: http://web.stanford.edu/~mwaskom/software/seaborn/examples/hexbin_marginals.html
The problem is that seaborn swears at me when I pass arrays of different lengths. Suppose,
var1 = [1,1,1,1,1,2,2,2,2,3,3,5,7]
var2 = [1,1,1,1,2,2,2,3,3,3,4,4,5,5,6,6,7,9,10,13]
Then if I write this:
import seaborn as sns
sns.jointplot(var1, var2, kind='hex')
it throws
ValueError: operands could not be broadcast together with shapes (13) (20)
Anyone knows how to make seaborn reconcile with this?
TL/DR: Joint plots are not a well-defined mathematical operation when the arrays are of different lengths
You can think of hexbin as a scatterplot, except instead of plotting dots, it slightly increases the value of the hexagonal area the dot would otherwise fall into. Obviously, unless all your x's are paired with y's, you can't make a scatter plot.
mathy answer:
In that plot, if you look at the histogram to the top and the right, that is the unidimensional frequency distribution. The point of plotting the 2D distribution in the main window is to see how the variable might be dependent--if they are independent, then each (x,y) coordinate is simple the relative frequency of the x variable, times the relative frequency of the y variable (ie the pdf f(x,y) = f(x)f(y) for x,y indep).
So if you want to see how these variable deviate from being independent, you have to have joint information about them--joint meaning observations of both variable have a common index, here assumed to be (0...i). See also independence on wikipedia and the independence tag on Cross Validated.