I would like to see more stuff on the boxplot: Maximum, 99th percentile, 95th, 85th, and the same on the other side. Is it possible?
I realize it's possible to do this oneself, by calculating those percentiles etc., and then plotting them in one dimension. But the boxplot function is convenient, it handles the axes for you, it can already calculate arbitrary percentiles (using the optional whis argument), and it lets you plot a bunch of variables in parallel (as at the bottom of this example file) -- which is handy, for instance, when using time-series data.
Related
I'm trying to get statistics on a distribution but all the libraries I've seen require the input to be in histogram style. That is, with a huge long array of numbers like what plt.hist wants as an input.
I have the bar chart equivalent, i.e. 2 arrays; one with the x-axis centre points, and one with y-axis values for the corresponding value of each point. The plot looks like this:
My question is how can I apply statistics such as mean, range, skewness and kurtosis on this dataset. The numbers are not always integers. It seems very inefficient to force python to make a histogram style array with, for example, 180x 0.125's, 570x 0.25's e.t.c. as in the figure above.
Doing mean on the current array I have will give me the average frequency of all sizes, i.e. plotting a horizontal line on the figure above. I'd like a vertical line to show the average, as if it were a distribution.
Feels like there should be an easy solution! Thanks in advance.
I am currently working on a project where I have to bin up to 10-dimensional data. This works totally fine with numpy.histogramdd, however with one have a serious obstacle:
My parameter space is pretty large, but only a fraction is actually inhabited by data (say, maybe a few % or so...). In these regions, the data is quite rich, so I would like to use relatively small bin widths. The problem here, however, is that the RAM usage totally explodes. I see usage of 20GB+ for only 5 dimensions which is already absolutely not practical. I tried defining the grid myself, but the problem persists...
My idea would be to manually specify the bin edges, where I just use very large bin widths for empty regions in the data space. Only in regions where I actually have data, I would need to go to a finer scale.
I was wondering if anyone here knows of such an implementation already which works in arbitrary numbers of dimensions.
thanks 😊
I think you should first remap your data, then create the histogram, and then interpret the histogram knowing the values have been transformed. One possibility would be to tweak the histogram tick labels so that they display mapped values.
One possible way of doing it, for example, would be:
Sort one dimension of data as an unidimensional array;
Integrate this array, so you have a cumulative distribution;
Find the steepest part of this distribution, and choose a horizontal interval corresponding to a "good" bin size for the peak of your histogram - that is, a size that gives you good resolution;
Find the size of this same interval along the vertical axis. That will give you a bin size to apply along the vertical axis;
Create the bins using the vertical span of that bin - that is, "draw" horizontal, equidistant lines to create your bins, instead of the most common way of drawing vertical ones;
That way, you'll have lots of bins where data is more dense, and lesser bins where data is more sparse.
Two things to consider:
The mapping function is the cumulative distribution of the sorted values along that dimension. This can be quite arbitrary. If the distribution resembles some well known algebraic function, you could define it mathematically and use it to perform a two-way transform between actual value data and "adaptive" histogram data;
This applies to only one dimension. Care must be taken as how this would work if the histograms from multiple dimensions are to be combined.
I would like to produce a tricontour plot similar to these with matplotlib. The difference between these examples and my situation is that I don't have the values of my function in the grid points: they are defined in my triangles (e.g. in the centroid of each triangle).
I would like to plot the result of a finite volume simulation, where the values are defined for each control volume, not for each grid point.
I suppose one simple solution would be to average the values at each grid point. I would like to know if there are any more direct solutions.
Maybe not exactly what you are looking for, but tripcolor function is designed for this use case (value defined at triangle centroid)
See for instance:
http://matplotlib.org/examples/pylab_examples/tripcolor_demo.html
In a standard 3D python plot, each data point is, by default, represented as a sphere in 3D. For the data I'm plotting, the z-axis is very sensitive, while the x and y axes are very general, so is there a way to make each point on the scatter plot spread out over the x and y direction as it normally would with, for example, s=500, but not spread at all along the z-axis? Ideally this would look like a set of stacked discs, rather than overlapping spheres.
Any ideas? I'm relatively new to python and I don't know if there's a way to make custom data points like this with a scatter plot.
I actually was able to do this using the matplotlib.patches library, creating a patch for every data point, and then making it whatever shape I wanted with the help of mpl_toolkits.mplot3d.art3d.
You might look for something called "jittering". Take a look at
Matplotlib: avoiding overlapping datapoints in a "scatter/dot/beeswarm" plot
It works by adding random noise to your data.
Another way might be to reduce the variance of the data on your z-axis (e.g. applying a log-function) or adjusting the scale. You could do that with ax.set_zscale("log"). It is documented here http://matplotlib.org/mpl_toolkits/mplot3d/api.html#mpl_toolkits.mplot3d.axes3d.Axes3D.set_zscale
I'm trying to draw joint distribution of 2 variables in a package named seaborn (a wrapper over matplotlib). Ultimately, I want to get something like this: http://web.stanford.edu/~mwaskom/software/seaborn/examples/hexbin_marginals.html
The problem is that seaborn swears at me when I pass arrays of different lengths. Suppose,
var1 = [1,1,1,1,1,2,2,2,2,3,3,5,7]
var2 = [1,1,1,1,2,2,2,3,3,3,4,4,5,5,6,6,7,9,10,13]
Then if I write this:
import seaborn as sns
sns.jointplot(var1, var2, kind='hex')
it throws
ValueError: operands could not be broadcast together with shapes (13) (20)
Anyone knows how to make seaborn reconcile with this?
TL/DR: Joint plots are not a well-defined mathematical operation when the arrays are of different lengths
You can think of hexbin as a scatterplot, except instead of plotting dots, it slightly increases the value of the hexagonal area the dot would otherwise fall into. Obviously, unless all your x's are paired with y's, you can't make a scatter plot.
mathy answer:
In that plot, if you look at the histogram to the top and the right, that is the unidimensional frequency distribution. The point of plotting the 2D distribution in the main window is to see how the variable might be dependent--if they are independent, then each (x,y) coordinate is simple the relative frequency of the x variable, times the relative frequency of the y variable (ie the pdf f(x,y) = f(x)f(y) for x,y indep).
So if you want to see how these variable deviate from being independent, you have to have joint information about them--joint meaning observations of both variable have a common index, here assumed to be (0...i). See also independence on wikipedia and the independence tag on Cross Validated.