How to interpret scipy.stats.probplot results? - python

I wanted to use scipy.stats.probplot() to perform some gaussianity test on mydata.
from scipy import stats
_,fit=stats.probplot(mydata, dist=stats.norm,plot=ax)
goodness_fit="%.2f" %fit[2]
The documentation says:
Generates a probability plot of sample data against the quantiles of a
specified theoretical distribution (the normal distribution by
default). probplot optionally calculates a best-fit line for the data
and plots the results using Matplotlib or a given plot function.
probplot generates a probability plot, which should not be confused
with a Q-Q or a P-P plot. Statsmodels has more extensive functionality
of this type, see statsmodels.api.ProbPlot.
But if google probability plot, it is a common name for P-P plot, while the documentation says not to confuse the two things.
Now I am confused, what is this function doing?

I looked since hours for an answer to this question, and this can be found in the Scipy/Statsmodel code comments.
In Scipy, comment at https://github.com/scipy/scipy/blob/abdab61d65dda1591f9d742230f0d1459fd7c0fa/scipy/stats/morestats.py#L523 says:
probplot generates a probability plot, which should not be confused with
a Q-Q or a P-P plot. Statsmodels has more extensive functionality of this
type, see statsmodels.api.ProbPlot.
So, now, let's look at Statsmodels, where comment at https://github.com/statsmodels/statsmodels/blob/66fc298c51dc323ce8ab8564b07b1b3797108dad/statsmodels/graphics/gofplots.py#L58 says:
ppplot : Probability-Probability plot
Compares the sample and theoretical probabilities (percentiles).
qqplot : Quantile-Quantile plot
Compares the sample and theoretical quantiles
probplot : Probability plot
Same as a Q-Q plot, however probabilities are shown in the scale of
the theoretical distribution (x-axis) and the y-axis contains
unscaled quantiles of the sample data.
So, difference between QQ plot and Probability plot, in these modules, is related to the scales.

The theoretical probability of an event occurring is an "expected" probability based upon knowledge of the situation. It is the number of favorable outcomes to the number of possible outcomes.
When you gather data from observations during an experiment, you will be calculating an empirical (or experimental) probability.
Example: You tossed a coin and you got a head.
Experimental Probability(head)=1
Theoretical Probability(head)=0.5
For simplicity, see the below diagram which shows probability of getting particular Bill amount. p and q plot are shown.
ppplot (Probability-Probability plot)
Compares the sample and theoretical probabilities (percentiles).
qqplot (Quantile-Quantile plot)
Compares the sample and theoretical quantiles
probplot (Probability plot)
Same as a Q-Q plot, however probabilities are shown in the scale of the theoretical distribution (x-axis) and the y-axis contains unscaled quantiles of the sample data.
Difference between ppplot,qqplot and probplot are related to the scales. Both show sample and theoretical values on x and y axis.
Percentile plots
Percentile plots are the simplest plots. You simply plot the data against their plotting positions. The plotting positions are shown on a linear scale, but the data can be scaled as appropriate.
Quantile plots
Quantile plots are similar to probabilty plots. The main differences is that plotting positions are converted into quantiles or ZZ-scores based on a probability distribution.
The default distribution is the standard-normal distribution. You’ll notice that the shape of the data is straighter on the Q-Q plot than the P-P plot. This is due to the transformation that takes place when converting the plotting positions to a distribution’s quantiles.
Best-fit lines
Adding a best-fit line to a probability plot can provide insight as to whether or not a dataset can be characterized by a distribution.
In statistics and probability quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.
Probability density of a normal distribution, with quartiles shown. The area below the red curve is the same in the intervals (−∞,Q1), (Q1,Q2), (Q2,Q3), and (Q3,+∞).
In statistics, a Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.
If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line y = x. If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line, but not necessarily on the line y = x.
A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such as location, scale, and skewness are similar or different in the two distributions.
A P–P plot plots two cumulative distribution functions (cdfs) against each other: It is a probability plot for assessing how closely two data sets agree, which plots the two cumulative distribution functions against each other. P-P plots are vastly used to evaluate the skewness of a distribution.

Related

How can I create a plot that combines a plot of data, and a histogram of different data?

I need to create a plot that has two y-axis, and a single x-axis. On one x/y-axis pair, I need to plot several sets data (with lines). On the other x/y-axis pair, I need to plot a histogram of a different data set. The intention is to present several curves that represent the performance of several design variations, with a histogram of x-axis data, to visualize how optimized each variant is for the operating region.
Reference this example plot plot example.
There are several curves on the upper plot that represent the value of epsilon as a function of V for a set of variants A,B,C
The lower plot is a histogram that represents the amount of data points collected H for each V. This data is not directly related to the upper plot. The data on the lower plot visualizes the operating region for V, so that it is visually obvious what regions are more important for optimization.
I looked into the seaborn documentation for "Visualizing distributions of data" here.
It appears that the seaborn histograms can only be presented for the data being plotted.
I think that I need to do some combination of a separate line plot and histogram so that the correct data is represented in each plot.
I want this to be represented in a single figure, but I am unsure of the exact method to achieve this.
You'll need:
to share x axis: https://matplotlib.org/stable/gallery/subplots_axes_and_figures/shared_axis_demo.html
to adjust gap/space/padding between subplots:
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots_adjust.html
to Invert one of y axis (two options):
https://matplotlib.org/stable/gallery/subplots_axes_and_figures/invert_axes.html
https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.invert_yaxis.html

Plotly: What kind of splines do we plot when using the option line_shape='spline'?

I am preparing some code to interpolate a serie of points with splines.
There are many kinds of splines: quadratic, cubic, many boundary conditions...
So far I have tried the most popular ones: cubic splines, with boundary conditions:
Natural: second derivative is zero at first and last points.
Clamped: first derivative is zero at first and last points.
Not-a-knot: third derivative is continuous at second and second-to-last points.
I have also tried quadratic splines with "clamped" initial condition.
I have discovered that Plotly also has a built-in interpolation function when we define the trace like this:
fig.add_trace(go.Scatter(
x=df['timestamp'],
y=df['values'],
mode='lines',
line_shape='spline',
))
This plotly's spline looks very good for my taste. It is soft and has less oscillation than, for example, the natural cubic spline:
Red line is natural cubic spline.
Gray line is plotly's spline.
So my question is: What exact kind of spline is this?
I have tried to compare it with the curves I have mentioned above. None of them is like the Plotly's spline.
I have checked Plotly's documentation and it does not tell you what kind of curves are they using. But it says that you can add the parameter "smoothing" in order to control the curvature.
Does anyone know how the Plotly guys do it?
I haven't been able to find a complete description in the docs either. But by the looks of your figure, I would assume that it's some sort of Monotone cubic interpolation.
If you compare your figure to a similar figure from the source above, you'll see that the illustrated splines have quite a bit in common:
Judging by the areas highlighted by the grey, red and green circles, the splines applied by plotly seem to have the same smoother traits than other comparable options.

Python distribution statistics on scatter plot style data

I'm trying to get statistics on a distribution but all the libraries I've seen require the input to be in histogram style. That is, with a huge long array of numbers like what plt.hist wants as an input.
I have the bar chart equivalent, i.e. 2 arrays; one with the x-axis centre points, and one with y-axis values for the corresponding value of each point. The plot looks like this:
My question is how can I apply statistics such as mean, range, skewness and kurtosis on this dataset. The numbers are not always integers. It seems very inefficient to force python to make a histogram style array with, for example, 180x 0.125's, 570x 0.25's e.t.c. as in the figure above.
Doing mean on the current array I have will give me the average frequency of all sizes, i.e. plotting a horizontal line on the figure above. I'd like a vertical line to show the average, as if it were a distribution.
Feels like there should be an easy solution! Thanks in advance.

Python plotting Bayesian posterior offset from prior with x-y error bars

I am running a Bayesian analysis where I assume (based on the figure below) a set length as a prior. My model then outputs a posterior estimate for the length. I then input my posterior length +/- 1-sigma into a time model to recover the y-axis error bars.
From the figure below the scatter points are my initial chosen lengths and the posterior recovers length offsets differing by a large margin (but within 1-sigma of the PDF-- the x-axis error bars). The y-axis error bars represent min and max error of a time model which is dependent on input min and max error of the posterior length.
I can't touch the Bayesian model as the process should be considered a black box (i.e. I can't rerun the data locally).
Is there a better way to represent this result? Perhaps as a 2D density? Is there a way to implement this within the standard matplotlib library?

Draw a histogram of a distribution with a discrete component

I'm performing a simulation of a simple queue using SimPy. One of the questions about the system is what is the distribution of the waiting times by a visitor. What I do is draw a normalized histogram of the sample I get during the simulation process.
This distribution is not purely continuous, we have a non-zero probability of the waiting time being exactly zero, hence the peak near the left end. I want it to be somehow obvious from the picture, what is the actual probability of hitting 0 exactly. Right now the height of the peak does not visualize that properly, the height is even higher than one (the reason is that many points are hitting a small segment near zero).
So the question is the general visualization technique of such distributions that are mixtures of a continuous and a discrete one.
(based on the discussion in the comments to OP).
For a distribution of some variable, call it t, being a mixture of a discrete and and continuous components, I'd write the pdf a sum of a set of delta-peaks and a continuous part,
p(t) = \sum_{a} p_a \delta(t-t_a) + f(t)
where a enumerates the discrete values t_a and p_a are probabilities of t_a, and f(t) is the pdf for the continuous part of the distribution, so that f(t)dt is the probability for t to belong to [t,t+dt).
Notice that the whole thing is normalized, \int p(t) =1 where the integral is over the approprite range of t.
Now, for visualizing this, I'd separate the discrete components, and plot them as discrete values (either as narrow bins or as points with droplines etc). Then for the rest, I'd use the histogram where you know the correct normalization from the equation above: the area under the histogram should sum up to 1-\sum_a p_a.
I'm not claiming this being the way, it's just what I'd do.

Categories