seaborn jointplot with arrays of different length - python

I'm trying to draw joint distribution of 2 variables in a package named seaborn (a wrapper over matplotlib). Ultimately, I want to get something like this: http://web.stanford.edu/~mwaskom/software/seaborn/examples/hexbin_marginals.html
The problem is that seaborn swears at me when I pass arrays of different lengths. Suppose,
var1 = [1,1,1,1,1,2,2,2,2,3,3,5,7]
var2 = [1,1,1,1,2,2,2,3,3,3,4,4,5,5,6,6,7,9,10,13]
Then if I write this:
import seaborn as sns
sns.jointplot(var1, var2, kind='hex')
it throws
ValueError: operands could not be broadcast together with shapes (13) (20)
Anyone knows how to make seaborn reconcile with this?

TL/DR: Joint plots are not a well-defined mathematical operation when the arrays are of different lengths
You can think of hexbin as a scatterplot, except instead of plotting dots, it slightly increases the value of the hexagonal area the dot would otherwise fall into. Obviously, unless all your x's are paired with y's, you can't make a scatter plot.
mathy answer:
In that plot, if you look at the histogram to the top and the right, that is the unidimensional frequency distribution. The point of plotting the 2D distribution in the main window is to see how the variable might be dependent--if they are independent, then each (x,y) coordinate is simple the relative frequency of the x variable, times the relative frequency of the y variable (ie the pdf f(x,y) = f(x)f(y) for x,y indep).
So if you want to see how these variable deviate from being independent, you have to have joint information about them--joint meaning observations of both variable have a common index, here assumed to be (0...i). See also independence on wikipedia and the independence tag on Cross Validated.

Related

How can I create a plot that combines a plot of data, and a histogram of different data?

I need to create a plot that has two y-axis, and a single x-axis. On one x/y-axis pair, I need to plot several sets data (with lines). On the other x/y-axis pair, I need to plot a histogram of a different data set. The intention is to present several curves that represent the performance of several design variations, with a histogram of x-axis data, to visualize how optimized each variant is for the operating region.
Reference this example plot plot example.
There are several curves on the upper plot that represent the value of epsilon as a function of V for a set of variants A,B,C
The lower plot is a histogram that represents the amount of data points collected H for each V. This data is not directly related to the upper plot. The data on the lower plot visualizes the operating region for V, so that it is visually obvious what regions are more important for optimization.
I looked into the seaborn documentation for "Visualizing distributions of data" here.
It appears that the seaborn histograms can only be presented for the data being plotted.
I think that I need to do some combination of a separate line plot and histogram so that the correct data is represented in each plot.
I want this to be represented in a single figure, but I am unsure of the exact method to achieve this.
You'll need:
to share x axis: https://matplotlib.org/stable/gallery/subplots_axes_and_figures/shared_axis_demo.html
to adjust gap/space/padding between subplots:
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots_adjust.html
to Invert one of y axis (two options):
https://matplotlib.org/stable/gallery/subplots_axes_and_figures/invert_axes.html
https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.invert_yaxis.html

Python distribution statistics on scatter plot style data

I'm trying to get statistics on a distribution but all the libraries I've seen require the input to be in histogram style. That is, with a huge long array of numbers like what plt.hist wants as an input.
I have the bar chart equivalent, i.e. 2 arrays; one with the x-axis centre points, and one with y-axis values for the corresponding value of each point. The plot looks like this:
My question is how can I apply statistics such as mean, range, skewness and kurtosis on this dataset. The numbers are not always integers. It seems very inefficient to force python to make a histogram style array with, for example, 180x 0.125's, 570x 0.25's e.t.c. as in the figure above.
Doing mean on the current array I have will give me the average frequency of all sizes, i.e. plotting a horizontal line on the figure above. I'd like a vertical line to show the average, as if it were a distribution.
Feels like there should be an easy solution! Thanks in advance.

matplotlib.pyplot.boxplot: More whiskers, more percentiles

I would like to see more stuff on the boxplot: Maximum, 99th percentile, 95th, 85th, and the same on the other side. Is it possible?
I realize it's possible to do this oneself, by calculating those percentiles etc., and then plotting them in one dimension. But the boxplot function is convenient, it handles the axes for you, it can already calculate arbitrary percentiles (using the optional whis argument), and it lets you plot a bunch of variables in parallel (as at the bottom of this example file) -- which is handy, for instance, when using time-series data.

Tricontour with triangle values

I would like to produce a tricontour plot similar to these with matplotlib. The difference between these examples and my situation is that I don't have the values of my function in the grid points: they are defined in my triangles (e.g. in the centroid of each triangle).
I would like to plot the result of a finite volume simulation, where the values are defined for each control volume, not for each grid point.
I suppose one simple solution would be to average the values at each grid point. I would like to know if there are any more direct solutions.
Maybe not exactly what you are looking for, but tripcolor function is designed for this use case (value defined at triangle centroid)
See for instance:
http://matplotlib.org/examples/pylab_examples/tripcolor_demo.html

How to make data points in a 3D python scatter plot look like "discs" instead of "spheres"

In a standard 3D python plot, each data point is, by default, represented as a sphere in 3D. For the data I'm plotting, the z-axis is very sensitive, while the x and y axes are very general, so is there a way to make each point on the scatter plot spread out over the x and y direction as it normally would with, for example, s=500, but not spread at all along the z-axis? Ideally this would look like a set of stacked discs, rather than overlapping spheres.
Any ideas? I'm relatively new to python and I don't know if there's a way to make custom data points like this with a scatter plot.
I actually was able to do this using the matplotlib.patches library, creating a patch for every data point, and then making it whatever shape I wanted with the help of mpl_toolkits.mplot3d.art3d.
You might look for something called "jittering". Take a look at
Matplotlib: avoiding overlapping datapoints in a "scatter/dot/beeswarm" plot
It works by adding random noise to your data.
Another way might be to reduce the variance of the data on your z-axis (e.g. applying a log-function) or adjusting the scale. You could do that with ax.set_zscale("log"). It is documented here http://matplotlib.org/mpl_toolkits/mplot3d/api.html#mpl_toolkits.mplot3d.axes3d.Axes3D.set_zscale

Categories