rayleigh distribution curve on histogram - python

I have an array of velocity data in directions V_x and V_y. I've plotted a histogram for the velocity norm using the code below,
plt.hist(V_norm_hist, bins=60, density=True, rwidth=0.95)
which gives the following figure:
Now I also want to add a Rayleigh Distribution curve on top of this, but I can't get it to work. I've been trying different combinations using scipy.stats.rayleigh but the scipy homepage isn't really intuative so I can't get it to function properly...
What exactly does the lines
mean, var, skew, kurt = rayleigh.stats(moments='mvsk')
and
x = np.linspace(rayleigh.ppf(0.01),rayleigh.ppf(0.99), 100)
ax.plot(x, rayleigh.pdf(x),'r-', lw=5, alpha=0.6, label='rayleigh pdf')
do?

You might need to first follow the link to rv_continuous, from which rayleigh is subclassed. And from there to the ppf to find out that ppf is the 'Percent point function'. x0 = ppf(0.01) tells at which spot everything less than x0 has accumulated 1% of its total 'weight' and similarly x1 = ppf(0.99) is where 99% of the 'weight' is accumulated. np.linspace(x0, x1, 100) divides the space from x0 to x1 in 100 short intervals. As a continuous distribution can be infinite, these x0 and x1 limits are needed to only show the interesting interval.
rayleigh.pdf(x) gives the pdf at x. So, an indication of how probable each x is.
rayleigh.stats(moments='mvsk') where moments is composed of letters [‘mvsk’] defines which moments to compute: ‘m’ = mean, ‘v’ = variance, ‘s’ = (Fisher’s) skew, ‘k’ = (Fisher’s) kurtosis.
To plot both the histogram and the distribution on the same plot, we need to know the parameters of Raleigh that correspond to your sample (loc and scale). Furthermore, both the pdf and the histogram would need the same x and same y. For the x we can take the limits of the histogram bins. For the y, we can scale up the pdf, knowing that the total area of the pdf is supposed to be 1. And the histogram bins are proportional to the number of entries.
If you do know the loc is 0 but don't know the scale, the wikipedia article gives a formula that connects the scale to the mean of your samples:
estimated_rayleigh_scale = samples.mean() / np.sqrt(np.pi / 2)
Supposing a loc of 0 and a scale of 0.08 the code would look like:
from matplotlib import pyplot as plt
import numpy as np
from scipy.stats import rayleigh
N = 1000
# V = np.random.uniform(0, 0.1, 2*N).reshape((N,2))
# V_norm = (np.linalg.norm(V, axis=1))
scale = 0.08
V_norm_hist = scale * np.sqrt( -2* np.log (np.random.uniform(0, 1, N)))
fig, ax = plt.subplots(1, 1)
num_bins = 60
_binvalues, bins, _patches = plt.hist(V_norm_hist, bins=num_bins, density=False, rwidth=1, ec='white', label='Histogram')
x = np.linspace(bins[0], bins[-1], 100)
binwidth = (bins[-1] - bins[0]) / num_bins
scale = V_norm_hist.mean() / np.sqrt(np.pi / 2)
plt.plot(x, rayleigh(loc=0, scale=scale).pdf(x)*len(V_norm_hist)*binwidth, lw=5, alpha=0.6, label=f'Rayleigh pdf (s={scale:.3f})')
plt.legend()
plt.show()

Related

Hot to calculate FWHM for a gaussian curve fitted over a histogram?

I have a histogram with a fitted gaussian curve, and I'd like to find and calculate the full width at half maximum for this curve. The data used in this code is a single column from a dataframe. I've included a link to an image of my plot. I'm new to python and have no idea how to do this.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def gaussian(n, mean, amplitude, standard_deviation):
return amplitude * np.exp( - (n - mean)**2 / (2*standard_deviation ** 2))
n = df_OI_CMC['Area_1_Micrometers']
#Plot Histogram 1
bin_heights, bin_borders, _ = plt.hist(n, bins =
(0,1,5,10,25,50,75,100,125,150,200,250,500,750,1000,2500,5000,7500,10000), label='histogram',
edgecolor ='white')
bin_widths = np.diff(bin_borders)
bin_centers = bin_borders[:-1] + np.diff(bin_borders) / 2
#Generate enough x values to make the curves look smooth
n_interval_for_fit = np.linspace(bin_borders[0], bin_borders[-1], 10000)
n_interval_for_fit_2 = np.linspace(bin_borders[0], bin_borders_2[-1], 10000)
#CurveFit to Histogram
popt, _ = curve_fit(gaussian, bin_centers, bin_heights, p0=[-44.0543433,
1480.64682738,68.86641026])
plt.rcParams["figure.figsize"] = [12,12]
plt.plot(n_interval_for_fit, gaussian(n_interval_for_fit, *popt), label='fit')
plt.ylim([0, 1500])
plt.xlim([-10,1000])

Scatter plot with varying Quantile/Percentile in python [duplicate]

This question already has an answer here:
Plotting stochastic processes in Python
(1 answer)
Closed 2 years ago.
Basically, I want to plot a scatter plot between two variables with varying percentile, I've plotted the scatter plot with the following toy code but I'm unable to plot it for different percentile (quantile).
quantiles = [1,10,25,50,50,75,90,99]
grays = ["#DCDCDC", "#A9A9A9", "#2F4F4F","#A9A9A9", "#DCDCDC"]
alpha = 0.3
data = df[['area_log','mr_ecdf']]
y = data['mr_ecdf']
x = data['area_log']
idx = np.argsort(x)
x = np.array(x)[idx]
y = np.array(y)[idx]
for i in range(len(quantiles)//2):
plt.fill_between(x, y, y, color='black', alpha = alpha, label=f"{quantiles[i]}")
lower_lim = np.percentile(y, quantiles[i])
upper_lim = np.percentile(y, 100-quantiles[i])
data = data[data['mr_ecdf'] >= lower_lim]
data = data[data['mr_ecdf'] <= upper_lim]
y = data['mr_ecdf']
x = data['area_log']
idx = np.argsort(x)
x = np.array(x)[idx]
y = np.array(y)[idx]
data = df[['area_log','mr_ecdf']]
y = data['mr_ecdf']
x = data['area_log']
plt.scatter(x, y,s=1, color = 'r', label = 'data')
plt.legend()
# axes.set_ylim([0,1])
enter image description here
data link : here
I want plot something like this (First- (1,1)):
As was mentioned by #Mr. T, one way to do that is to calculate the CIs yourself and then plot them using plt.fill_between. The data you show pose a problem since there is not enough points and variance so you'll never get what is on your pictures (and the separation in my figure is also not clear so I have put another example below to show how it works). If you have data for that, post it, I will update. Anyway, you should check the post I mentioned in the comment and some way of doing it follows:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
idx = np.argsort(x)
x = np.array(x)[idx]
y = np.array(y)[idx]
# Create a list of quantiles to calculate
quantiles = [0.05, 0.25, 0.75, 0.95]
grays = ["#DCDCDC", "#A9A9A9", "#2F4F4F","#A9A9A9", "#DCDCDC"]
alpha = 0.3
plt.fill_between(x, y-np.percentile(y, 0.5), y+np.percentile(y, 0.5), color=grays[2], alpha = alpha, label="0.50")
# if the percentiles are symmetrical and we want labels on both sides
for i in range(len(quantiles)//2):
plt.fill_between(x, y, y+np.percentile(y, quantiles[i]), color=grays[i], alpha = alpha, label=f"{quantiles[i]}")
plt.fill_between(x, y-np.percentile(y, quantiles[-(i+1)]),y, color=grays[-(i+1)], alpha = alpha, label=f"{quantiles[-(i+1)]}")
plt.scatter(x, y, color = 'r', label = 'data')
plt.legend()
EDIT:
Some explanation. I am not sure what is not correct in my code, but I would be happy if you can tell me -- there is always a way for improvement (Thanks to #Mr T. again for the catch). Nevertheless, the fill between function does the following:
Fill the area between two horizontal curves.
The curves are defined by the points (x, y1) and (x, y2)
So you specify by the y1 and y2 where you want to have the graph filled with a colour. Let me bring another example:
X = np.linspace(120, 50, 71)
Y = X + 20*np.random.randn(71)
plt.fill_between(X, Y-np.percentile(Y, 95),Y+np.percentile(Y, 95), color="k", alpha = alpha)
plt.fill_between(X, Y-np.percentile(Y, 80),Y+np.percentile(Y, 80), color="r", alpha = alpha)
plt.fill_between(X, Y-np.percentile(Y, 60),Y, color="b", alpha = alpha)
plt.scatter(X, Y, color = 'r', label = 'data')
I generated some random data to see what is happening. The line plt.fill_between(X, Y-np.percentile(Y, 60),Y, color="b", alpha = alpha) is plotting the fill only from the 60th percentile below Y up to Y. The other two lines are covering the space always from both sides of Y (hence the +-). You can see that the percentiles overlap, of course they do, they must -- a 90 percentile includes the 60 as well. So you see only the differences between them. You could plot the data in the opposite order (or change z-factor) but then all would be covered by the highest percentile. I hope this clarifies the answer. Also, your question is perfectly fine, sorry if my answer feels not neutral. Just if you had also the data for the graphs and not only the picture, my/others answer could be more tailored :).

How does distplot/kdeplot calculate the kde curve?

I'm using seaborn for plotting data. Everything is fine until my mentor asked me how the plot is made in the following code for example.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
x = np.random.normal(size=100)
sns.distplot(x)
plt.show()
The result of this code is:
My questions:
How does distplot manage to plot this?
Why does the plot start at -3 and end at 4?
Is there any parametric function or any specific mathematical function that distplot uses to plot the data like this?
I use distplot and kind='kde' to plot my data, but I would like to know what is the maths behind those functions.
Here is some code trying to illustrate how the kde curve is drawn.
The code starts with a random sample of 100 xs.
These xs are shown in a histogram. With density=True the histogram is normalized so that it's full area would be 1. (Standard, the bars of the histogram grow with the number of points. Internally, the complete area is calculated and each bar's height is divided by that area.)
To draw the kde, a gaussian "bell" curve is drawn around each of the N samples. These curves are summed, and normalized by dividing by N.
The sigma of these curves is a free parameter. Default it is calculated by Scott's rule (N ** (-1/5) or 0.4 for 100 points, the green curve in the example plot).
The code below shows the result for different choices of sigma. Smaller sigmas enclose the given data stronger, larger sigmas appear more smooth. There is no perfect choice for sigma, it depends strongly on the data and what is known (or guessed) about the underlying distribution.
import matplotlib.pyplot as plt
import numpy as np
def gauss(x, mu, sigma):
return np.exp(-((x - mu) / sigma) ** 2 / 2) / (sigma * np.sqrt(2 * np.pi))
N = 100
xs = np.random.normal(0, 1, N)
plt.hist(xs, density=True, label='Histogram', alpha=.4, ec='w')
x = np.linspace(xs.min() - 1, xs.max() + 1, 100)
for sigma in np.arange(.2, 1.2, .2):
plt.plot(x, sum(gauss(x, xi, sigma) for xi in xs) / N, label=f'$\\sigma = {sigma:.1f}$')
plt.xlim(x[0], x[-1])
plt.legend()
plt.show()
PS: Instead of a histogram or a kde, other ways to visualize 100 random numbers are a set of short lines:
plt.plot(np.repeat(xs, 3), np.tile((0, -0.05, np.nan), N), lw=1, c='k', alpha=0.5)
plt.ylim(ymin=-0.05)
or dots (jittered, so they don't overlap):
plt.scatter(xs, -np.random.rand(N)/10, s=1, color='crimson')
plt.ylim(ymin=-0.099)

Fitting an un-normalised distribution with scipy.stats

I'm tryng to fit a histogram but the fit only works with normalised data, i.e. with option normed=True in the histogram. Is there a way of doing this with scipy stats (or other method)? Here is a MWE using a uniform distribution:
import matplotlib.pyplot as plt
import numpy as np
import random
from scipy.stats import uniform
data = []
for i in range(1000):
data.append(random.uniform(-1,1))
loc, scale = uniform.fit(data)
x = np.linspace(-1,1, 1000)
y = uniform.pdf(x, loc, scale)
plt.hist(data, bins=100, normed=False)
plt.plot(x, y, 'r-')
plt.show()
I also tried defining my own function (below) but I'm getting a bad fit.
import matplotlib.pyplot as plt
import numpy as np
import random
from scipy import optimize
data = []
for i in range(1000):
data.append(random.uniform(-1,1))
def unif(x,avg,sig):
return avg*x + sig
y, base = np.histogram(data,bins=100)
x = [0.5 * (base[i] + base[i+1]) for i in xrange(len(base)-1)]
popt, pcov = optimize.curve_fit(unif, x, y)
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = unif(x_fit, *popt)
plt.hist(data, bins=100, normed=False)
plt.plot(x_fit, y_fit, 'r-')
plt.show()
Note that it is generally a bad idea to fit a distribution to the histogram. Compared to the raw data the histogram contains less information so the fit will most likely be worse. Thus, the first MWE in the question actually contains the best approach. Simply normalize the histogram and it will match the distribution of the data: plt.hist(data, bins=100, normed=True).
However, it seems you actually want to work with the unnormalized histogram. In that case take the normalization that the histogram would normally use and apply it inverted to the fitted distribution. The documentation describes the normalization as
n/(len(x)`dbin)
which is verbose for saying dividing by the number of observations times the bin width.
Multiplying the distribution by this value results in the expected counts per bin:
loc, scale = uniform.fit(data)
x = np.linspace(-1,1, 1000)
y = uniform.pdf(x, loc, scale)
n_bins = 100
bin_width = np.ptp(data) / n_bins
plt.hist(data, bins=n_bins, normed=False)
plt.plot(x, y * len(data) * bin_width, 'r-')
The second MWE is interesting because you describe the line a a bad fit, but actually it is a very good fit :). You simply overfit the histogram because although you expect a horizontal line (one degree of freedom) you fit an arbitrary line (two degrees of freedom).
So if you want a horizontal line fit a horizontal line and don't be surprised to get something else if you fit something else...
def unif(x, sig):
return 0 * x + sig # slope is zero -> horizontal line
However, there is a much simpler way of obtaining the height of the unnormalized uniform distribution. Just average the histogram over all bins:
y, base = np.histogram(data,bins=100)
y_hat = np.mean(y)
print(y_hat)
# 10.0
Or, even simpler use the theoretical value of len(data) / n_bins == 10.

Using matplotlib for Gaussian

so I am trying to plot a histogram of my data and I seem to be a little confused here. I am using matplotlib in python. Here is the code from their website:
mu = 100 #mean
sigma = 15 #std deviation
x = mu + sigma * np.random.randn(10000)
# the histogram of the data
n, bins, patches = plt.hist(x, num_bins, normed=1, facecolor='green', alpha=0.5)
I am confused as to what the x -axis should be for my use. I have calculated the standard deviation and the mean but, I am uncertain if I should replace the np.random.randn(10000) with the actual data or not.
Just put your data into x variable, that's all. You do not need to compute mean or variance.

Categories