Draw a histogram of a distribution with a discrete component - python

I'm performing a simulation of a simple queue using SimPy. One of the questions about the system is what is the distribution of the waiting times by a visitor. What I do is draw a normalized histogram of the sample I get during the simulation process.
This distribution is not purely continuous, we have a non-zero probability of the waiting time being exactly zero, hence the peak near the left end. I want it to be somehow obvious from the picture, what is the actual probability of hitting 0 exactly. Right now the height of the peak does not visualize that properly, the height is even higher than one (the reason is that many points are hitting a small segment near zero).
So the question is the general visualization technique of such distributions that are mixtures of a continuous and a discrete one.

(based on the discussion in the comments to OP).
For a distribution of some variable, call it t, being a mixture of a discrete and and continuous components, I'd write the pdf a sum of a set of delta-peaks and a continuous part,
p(t) = \sum_{a} p_a \delta(t-t_a) + f(t)
where a enumerates the discrete values t_a and p_a are probabilities of t_a, and f(t) is the pdf for the continuous part of the distribution, so that f(t)dt is the probability for t to belong to [t,t+dt).
Notice that the whole thing is normalized, \int p(t) =1 where the integral is over the approprite range of t.
Now, for visualizing this, I'd separate the discrete components, and plot them as discrete values (either as narrow bins or as points with droplines etc). Then for the rest, I'd use the histogram where you know the correct normalization from the equation above: the area under the histogram should sum up to 1-\sum_a p_a.
I'm not claiming this being the way, it's just what I'd do.

Related

How to interpret scipy.stats.probplot results?

I wanted to use scipy.stats.probplot() to perform some gaussianity test on mydata.
from scipy import stats
_,fit=stats.probplot(mydata, dist=stats.norm,plot=ax)
goodness_fit="%.2f" %fit[2]
The documentation says:
Generates a probability plot of sample data against the quantiles of a
specified theoretical distribution (the normal distribution by
default). probplot optionally calculates a best-fit line for the data
and plots the results using Matplotlib or a given plot function.
probplot generates a probability plot, which should not be confused
with a Q-Q or a P-P plot. Statsmodels has more extensive functionality
of this type, see statsmodels.api.ProbPlot.
But if google probability plot, it is a common name for P-P plot, while the documentation says not to confuse the two things.
Now I am confused, what is this function doing?
I looked since hours for an answer to this question, and this can be found in the Scipy/Statsmodel code comments.
In Scipy, comment at https://github.com/scipy/scipy/blob/abdab61d65dda1591f9d742230f0d1459fd7c0fa/scipy/stats/morestats.py#L523 says:
probplot generates a probability plot, which should not be confused with
a Q-Q or a P-P plot. Statsmodels has more extensive functionality of this
type, see statsmodels.api.ProbPlot.
So, now, let's look at Statsmodels, where comment at https://github.com/statsmodels/statsmodels/blob/66fc298c51dc323ce8ab8564b07b1b3797108dad/statsmodels/graphics/gofplots.py#L58 says:
ppplot : Probability-Probability plot
Compares the sample and theoretical probabilities (percentiles).
qqplot : Quantile-Quantile plot
Compares the sample and theoretical quantiles
probplot : Probability plot
Same as a Q-Q plot, however probabilities are shown in the scale of
the theoretical distribution (x-axis) and the y-axis contains
unscaled quantiles of the sample data.
So, difference between QQ plot and Probability plot, in these modules, is related to the scales.
The theoretical probability of an event occurring is an "expected" probability based upon knowledge of the situation. It is the number of favorable outcomes to the number of possible outcomes.
When you gather data from observations during an experiment, you will be calculating an empirical (or experimental) probability.
Example: You tossed a coin and you got a head.
Experimental Probability(head)=1
Theoretical Probability(head)=0.5
For simplicity, see the below diagram which shows probability of getting particular Bill amount. p and q plot are shown.
ppplot (Probability-Probability plot)
Compares the sample and theoretical probabilities (percentiles).
qqplot (Quantile-Quantile plot)
Compares the sample and theoretical quantiles
probplot (Probability plot)
Same as a Q-Q plot, however probabilities are shown in the scale of the theoretical distribution (x-axis) and the y-axis contains unscaled quantiles of the sample data.
Difference between ppplot,qqplot and probplot are related to the scales. Both show sample and theoretical values on x and y axis.
Percentile plots
Percentile plots are the simplest plots. You simply plot the data against their plotting positions. The plotting positions are shown on a linear scale, but the data can be scaled as appropriate.
Quantile plots
Quantile plots are similar to probabilty plots. The main differences is that plotting positions are converted into quantiles or ZZ-scores based on a probability distribution.
The default distribution is the standard-normal distribution. You’ll notice that the shape of the data is straighter on the Q-Q plot than the P-P plot. This is due to the transformation that takes place when converting the plotting positions to a distribution’s quantiles.
Best-fit lines
Adding a best-fit line to a probability plot can provide insight as to whether or not a dataset can be characterized by a distribution.
In statistics and probability quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.
Probability density of a normal distribution, with quartiles shown. The area below the red curve is the same in the intervals (−∞,Q1), (Q1,Q2), (Q2,Q3), and (Q3,+∞).
In statistics, a Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.
If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line y = x. If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line, but not necessarily on the line y = x.
A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such as location, scale, and skewness are similar or different in the two distributions.
A P–P plot plots two cumulative distribution functions (cdfs) against each other: It is a probability plot for assessing how closely two data sets agree, which plots the two cumulative distribution functions against each other. P-P plots are vastly used to evaluate the skewness of a distribution.

Python plotting Bayesian posterior offset from prior with x-y error bars

I am running a Bayesian analysis where I assume (based on the figure below) a set length as a prior. My model then outputs a posterior estimate for the length. I then input my posterior length +/- 1-sigma into a time model to recover the y-axis error bars.
From the figure below the scatter points are my initial chosen lengths and the posterior recovers length offsets differing by a large margin (but within 1-sigma of the PDF-- the x-axis error bars). The y-axis error bars represent min and max error of a time model which is dependent on input min and max error of the posterior length.
I can't touch the Bayesian model as the process should be considered a black box (i.e. I can't rerun the data locally).
Is there a better way to represent this result? Perhaps as a 2D density? Is there a way to implement this within the standard matplotlib library?

Plotting the Differential Energy Spectrum from the Energy Spectrum

I have the energy spectrum of a certain number of particles N(E) v/s E.
However, I want to plot the differential energy spectrum i.e. dN/dE v/s E.I DO NOT intend to calculate the derivative here [ as the traditional way of representing a differential energy spectrum might suggest ] What I essentially need is the number of particles in the histogram to be divided by the bin-width.
Is there any way to do this automatically in matplotlib or something similar? Or do I actually need to do this manually, wherein I need to write some code to first put the particles in different bins and then divide by the bin-width and then redraw the histogram.
matplotlib is a graphical library. it can plot data and edit figures.
What you need to do there is apply a numerical method to differentiate your data. It shouldnt be difficult.
You could just apply the definition of the derivative, having as DeltaT the shortest measurement you got of E
Once you got the data you can just use matplotlib to plot it.
if you post the data here i would glady give you an example of how to do it.
or you can just check https://en.wikipedia.org/wiki/Numerical_differentiation

How interpolate 3D coordinates

I have data points in x,y,z format. They form a point cloud of a closed manifold. How can I interpolate them using R-Project or Python? (Like polynomial splines)
It depends on what the points originally represented. Just having an array of points is generally not enough to derive the original manifold from. You need to know which points go together.
The most common low-level boundary representation ("brep") is a bunch of triangles. This is e.g. what OpenGL and Directx get as input. I've written a Python software that can convert triangular meshes in STL format to e.g. a PDF image. Maybe you can adapt that to for your purpose. Interpolating a triangle is usually not necessary, but rather trivail to do. Create three new points each halfway between two original point. These three points form an inner triangle, and the rest of the surface forms three triangles. So with this you have transformed one triangle into four triangles.
If the points are control points for spline surface patches (like NURBS, or Bézier surfaces), you have to know which points together form a patch. Since these are parametric surfaces, once you know the control points, all the points on the surface can be determined. Below is the function for a Bézier surface. The parameters u and v are the the parametric coordinates of the surface. They run from 0 to 1 along two adjecent edges of the patch. The control points are k_ij.
The B functions are weight functions for each control point;
Suppose you want to approximate a Bézier surface by a grid of 10x10 points. To do that you have to evaluate the function p for u and v running from 0 to 1 in 10 steps (generating the steps is easily done with numpy.linspace).
For each (u,v) pair, p returns a 3D point.
If you want to visualise these points, you could use mplot3d from matplotlib.
By "compact manifold" do you mean a lower dimensional function like a trajectory or a surface that is embedded in 3d? You have several alternatives for the surface-problem in R depending on how "parametric" or "non-parametric" you want to be. Regression splines of various sorts could be applied within the framework of estimating mean f(x,y) and if these values were "tightly" spaced you may get a relatively accurate and simple summary estimate. There are several non-parametric methods such as found in packages 'locfit', 'akima' and 'mgcv'. (I'm not really sure how I would go about statistically estimating a 1-d manifold in 3-space.)
Edit: But if I did want to see a 3D distribution and get an idea of whether is was a parametric curve or trajectory, I would reach for package:rgl and just plot it in a rotatable 3D frame.
If you are instead trying to form the convex hull (for which the word interpolate is probably the wrong choice), then I know there are 2-d solutions and suspect that searching would find 3-d solutions as well. Constructing the right search strategy will depend on specifics whose absence the 2 comments so far reflects. I'm speculating that attempting to model lower and higher order statistics like the 1st and 99th percentile as a function of (x,y) could be attempted if you wanted to use a regression effort to create boundaries. There is a quantile regression package, 'rq' by Roger Koenker that is well supported.

How to use interpolation to calculate a force based on angle

I am trying to make a python script that will output a force based on a measured angle. The inputs are time, the curve and the angle, but I am having trouble using interpolation to fit the force to the curve. I looked at scipy.interpolate, but I'm not sure it will help me because the points aren't evenly spaced.
numpy.interp does not require your points to be evenly distributed. I'm not certain if you mean by "The inputs are time, the curve and the angle" that you have three independent variables, if so you will have to adapt it quite a bit... But for one-variable problems, interp is the way to go.

Categories