I'm trying to match the generalized extreme value (GEV) distribution's probability density function (pdf) to the data' pdf. This histogram is function of bin. As adjust this bin, the result of the function fitting also changes. And curve_fit(func, x, y) is playing this role properly. but this function uses a "least squares estimation". What I want is to use maximum likelihood estimation (MLE). And it has good results with the stats.genextreme.fit(data)function. However, this function does not represent histogram shape changes according to bin. Just use the original data.
I'm trying to use MLE. And I succeeded in estimating the parameters of the standard normal distribution using MLE. However, it is based on the original data and does not change according to the bin. Even the parameters of the GEV could not be estimated with the original data.
I checked the source code of genextreme_gen, rv_continuous, etc. But, this code is too complicated. I couldn't accept the source code with my Python skills.
I would like to estimate the parameters of the GEV distribution through MLE. And I want to get the result that the estimate changes according to bin.
What should I do?
I am sorry for my poor English, and thank you for your help.
+)
h = 0.5 # bin width
dat = h105[1] # data
b = np.arange(min(dat)-h/2, max(dat), h) # bin range
n, bins = np.histogram(dat, bins=b, density=True) # histogram
x = 0.5*(bins[1:]+bins[:-1]) # x-value of histogram
popt,_ = curve_fit(fg, x, n) # curve_fit(GEV's pdf, x-value of histogram, pdf value)
popt = -popt[0], popt[1], popt[2] # estimated paramter (Least squares estimation, LSE)
x1 = np.linspace((popt[1]-popt[2])/popt[0], dat.max(), 1000)
a1 = stats.genextreme.pdf(x1, *popt) # pdf
popt = stats.genextreme.fit(dat) # estimated parameter (Maximum likelihood estimation, MLE)
x2 = np.linspace((popt[1]-popt[2])/popt[0], dat.max(), 1000)
a2 = stats.genextreme.pdf(x2, *popt)
bin width = 2
bin width = 0.5
One way to do this is to convert bins to data. You can do so by counting number of data points in each bin and then repeating center of the bin this number of times.
I have also tried to sample uniform values from each bin, but using center of the bin and then repeating it seems to provide parameters with higher likelihood.
import scipy.stats as stats
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
ground_truth_params = (0.001, 0.5, 0.999)
count = 50
h = 0.2 # bin width
dat = stats.genextreme.rvs(*ground_truth_params, count) # data
b = np.arange(np.min(dat)-h/2, np.max(dat), h) # bin range
n, bins = np.histogram(dat, bins=b, density=True) # histogram
bin_counts, _ = np.histogram(dat, bins=b, density=False) # histogram
x = 0.5*(bins[1:]+bins[:-1]) # x-value of histogram
def flatten(l):
return [item for sublist in l for item in sublist]
popt,_ = curve_fit(stats.genextreme.pdf, x, n, p0=[0,1,1]) # curve_fit(GEV's pdf, x-value of histogram, pdf value)
popt_lse = -popt[0], popt[1], popt[2] # estimated paramter (Least squares estimation, LSE)
popt_mle = stats.genextreme.fit(dat) # estimated parameter (Maximum likelihood estimation, MLE)
uniform_dat_from_bins = flatten((np.linspace(x - h/2, x + h/2, n) for n, x in zip(bin_counts, x)))
popt_uniform_mle = stats.genextreme.fit(uniform_dat_from_bins) # estimated parameter (Maximum likelihood estimation, MLE)
centered_dat_from_bins = flatten(([x] * n for n, x in zip(bin_counts, x)))
popt_centered_mle = stats.genextreme.fit(centered_dat_from_bins) # estimated parameter (Maximum likelihood estimation, MLE)
plot_params = {
ground_truth_params: 'tab:green',
popt_lse: 'tab:red',
popt_mle: 'tab:orange',
popt_centered_mle: 'tab:blue',
popt_uniform_mle: 'tab:purple'
}
param_names = ['GT', 'LSE', 'MLE', 'bin centered MLE', 'bin uniform MLE']
plt.figure(figsize=(10,5))
plt.bar(x, n, width=h, color='lightgrey')
plt.ylim(0, 0.5)
plt.xlim(-2,10)
for params, color in plot_params.items():
x_pdf = np.linspace(-2, 10, 1000)
y_pdf = stats.genextreme.pdf(x_pdf, *params) # the normal pdf
plt.plot(x_pdf, y_pdf, label='pdf', color=color)
plt.legend(param_names)
plt.figure(figsize=(10,5))
for params, color in plot_params.items():
plt.plot(np.sum(stats.genextreme.logpdf(dat, *params)), 'o', color=color)
This plot shows PDFs that are estimated using different methods along with ground truth PDF
And the next plot shows of likelihoods of estimated parameters given original data.
PDF that is estimated by MLE on original data has the maximum value as expected. Then follow PDFs that are estimated using histogram bin (centered and uniform). After them there is ground truth PDF. And finally comes PDF with the lowest likelihood, which is estimated using least squares.
Related
I'm trying to perform this exercise 3.1 using python:
the code "works" and is the following:
#NUMERICAL ESTIMATE OF PI
import numpy as np #library for numerical calculations
import matplotlib.pyplot as plt #library for plotting purposes
from scipy.stats import norm #needed for gaussian fit
#*******************************************************************************
M = 10**2 #number of times we calculate pi
N = 10**4 #number of point generated
mean_pi=[] #empy list
for i in range(M): #for loops over the period
x=np.random.uniform(-1,1,N) #array of the given shape d and populate it with random samples from a uniform distribution over [-1,1)
y=np.random.uniform(-1,1,N) #array of the given shape d and populate it with random samples from a uniform distribution over [-1,1)
x_sel=x[(x**2+y**2)<=1] #selection of x point
y_sel=y[(x**2+y**2)<=1] #selection of y point
mean_pi+=[4*len(x_sel)/len(x)] #list of pi's mean value
#*******************************************************************************
plt.figure(figsize=(8,3)) #a unique identifier for the figure
_,bins,_=plt.hist(mean_pi,bins=int(np.sqrt(N)),density=True, color="skyblue") #sintex to create a histogram from a dataset x with n bins
#and store an array specifying the bin ranges in the variable bins.
mu, sigma = norm.fit(mean_pi) #get the mean and standard deviation of data
k = sigma*np.sqrt(N) #k parameters
best_fit_line = norm.pdf(bins, mu, sigma) #get a line of best fit for the data
print("\nTime of repetitions:", M, ". The mean of the distribution is: ", mu, ". The standard deviation is:", sigma, ". The k parameters is:", k ,". \n")
#*******************************************************************************
plt.plot(bins, best_fit_line, color="red") #plot y versus x as lines and/or markers
plt.grid() #configure the grid lines
plt.xlabel('Bins',fontweight='bold') #set the label for the x-axis
plt.ylabel('Pi',fontweight='bold') #set the label for the y-axis
plt.title("Histogram for Pi vs. bins") #set a title for the scatter plot
plt.show() #display all open figures
print("\n")
#*******************************************************************************
M = 10**3 #number of times we calculate pi
N = 10**4 #number of point generated
mean_pi=[] #empy list
for i in range(M): #for loops over the period
x=np.random.uniform(-1,1,N) #array of the given shape d and populate it with random samples from a uniform distribution over [-1,1)
y=np.random.uniform(-1,1,N) #array of the given shape d and populate it with random samples from a uniform distribution over [-1,1)
x_sel=x[(x**2+y**2)<=1] #selection of x point
y_sel=y[(x**2+y**2)<=1] #selection of y point
mean_pi+=[4*len(x_sel)/len(x)] #list of pi's mean value
#*******************************************************************************
plt.figure(figsize=(8,3)) #a unique identifier for the figure
_,bins,_=plt.hist(mean_pi,bins=int(np.sqrt(N)),density=True, color="skyblue") #sintex to create a histogram from a dataset x with n bins
#and store an array specifying the bin ranges in the variable bins.
mu, sigma = norm.fit(mean_pi) #get the mean and standard deviation of data
k = sigma*np.sqrt(N) #k parameters
best_fit_line = norm.pdf(bins, mu, sigma) #get a line of best fit for the data
print("Time of repetitions:", M, ". The mean of the distribution is: ", mu, ". The standard deviation is:", sigma, ". The k parameters is:", k ,". \n")
#*******************************************************************************
plt.plot(bins, best_fit_line, color="red") #plot y versus x as lines and/or markers
plt.grid() #configure the grid lines
plt.xlabel('Bins',fontweight='bold') #set the label for the x-axis
plt.ylabel('Pi',fontweight='bold') #set the label for the y-axis
plt.title("Histogram for Pi vs. bins") #set a title for the scatter plot
plt.show() #display all open figures
print("\n")
#*******************************************************************************
M = 5*10**3 #number of times we calculate pi
N = 10**4 #number of point generated
mean_pi=[] #empy list
for i in range(M): #for loops over the period
x=np.random.uniform(-1,1,N) #array of the given shape d and populate it with random samples from a uniform distribution over [-1,1)
y=np.random.uniform(-1,1,N) #array of the given shape d and populate it with random samples from a uniform distribution over [-1,1)
x_sel=x[(x**2+y**2)<=1] #selection of x point
y_sel=y[(x**2+y**2)<=1] #selection of y point
mean_pi+=[4*len(x_sel)/len(x)] #list of pi's mean value
#*******************************************************************************
plt.figure(figsize=(8,3)) #a unique identifier for the figure
_,bins,_=plt.hist(mean_pi,bins=int(np.sqrt(N)),density=True, color="skyblue") #sintex to create a histogram from a dataset x with n bins
#and store an array specifying the bin ranges in the variable bins.
mu, sigma = norm.fit(mean_pi) #get the mean and standard deviation of data
k = sigma*np.sqrt(N) #k parameters
best_fit_line = norm.pdf(bins, mu, sigma) #get a line of best fit for the data
print("Time of repetitions:", M, ". The mean of the distribution is: ", mu, ". The standard deviation is:", sigma, ". The k parameters is:", k ,". \n")
#*******************************************************************************
plt.plot(bins, best_fit_line, color="red") #plot y versus x as lines and/or markers
plt.grid() #configure the grid lines
plt.xlabel('Bins',fontweight='bold') #set the label for the x-axis
plt.ylabel('Pi',fontweight='bold') #set the label for the y-axis
plt.title("Histogram for Pi vs. bins") #set a title for the scatter plot
plt.show() #display all open figures
#*******************************************************************************
print("\n How many couples N you need to estimate pi at better than 0.0001? The number of couples N is:", (k**2)*10**8 ,".")
#*******************************************************************************
With the output:
As you can see, the sigma increase, meanwhile i expect that it decrease when the time repetition increase...i don't understand where is the error.
I also tryed to increase N but the results are not better...
Someone can help me please?
I understand the error. In order to implement the right code it is necessary to chage the line:
_,bins,_=plt.hist(mean_pi,bins=int(np.sqrt(N)),density=True, color="slateblue") #sintex to create a histogram from a dataset x with n bins
in:
_,bins,_=plt.hist(mean_pi,bins=int(np.sqrt(M)),density=True, color="slateblue") #sintex to create a histogram from a dataset x with n bins
and the output will be:
I am working on a project looking at the Poisson filling of droplets by a contaminant whereby the Poisson mean depends on the droplets Volume. There is a volume distribution and each volume size has a likelihood from a Gaussian.
I have a loop generating a Poisson distribution (an array of 2000 numbers) for a different mean in each step. Each distribution has a weight that I generate from a gaussian. Currently, I am just adding all Poisson arrays and creating one large normalised histogram. I wish to weight the frequency of numbers in each array, such that the histogram can take into account the weight. I am unsure how to do this however as it is the frequency of the numbers in each array that has to be weighted and not the numbers themselves.
import numpy as np
from scipy.stats import poison
from matplotlib import pyplot as plt
def gaussian(mu,sig,x): # Gaussian Gives Weight
P_r = 1./(np.sqrt(2.*np.pi)*sig)*np.exp(-np.power((x - mu)/sig, 2.)/2)
return P_r
def poisson(mean):
P = np.random.poisson(mean, 2000)
return P
R= np.linspace(45, 75, 2000) #min and max radius and steps taken between them to gen Poisson
Average_Droplet_Radius = 60
Variance = 15
Mean_Droplet_Average_Occupancy = float(input('Enter mean droplet occupation ')) #Poisson Mean
for mu, sig in [(Average_Droplet_Radius,Variance)]:
np.prob = gaussian(mu,sig,R)
C = Mean_Droplet_Average_Occupancy / (4/3 *np.pi * ( Average_Droplet_Radius**3)) #The constant parameter for all distributions
i = 0
a = np.array([])
for cell in R:
Individual_Mean = C * (4/3 *np.pi * ( R[i]**3))
Individual_Weight = np.prob[i] #want to weight frequency in given Poisson by this
b = (poisson(Individual_Mean))
a = np.append(a, b) # Unweighted Poissons combined
i = i+1
bins_val = np.arange(0, a.max() + 1.5) - 0.5
count, bins, ignored = plt.hist( a, bins_val, density=True) # Creates unweighted, normalised histogram
plt.show()
I was unsure how to use the weights part of plt.hist, as it is a large array of numbers that has weight.
Currently, I get a histogram where each droplet size is equally likely, how can I get the weights in the final distribution?
I have two numpy arrays, one is an array of x values and the other an array of y values and together they give me the empirical cdf. E.g.:
plt.plot(xvalues, yvalues)
plt.show()
I assume the data needs to be smoothed somehow in order to give a smooth pdf.
I would like to plot the pdf. How can I do that?
The raw data is at: http://dpaste.com/1HVK5DR .
There are two main problems: Your data seems to be quite noisy, and it is not equally spaced: The points at the low end are sampled quite densly, while the ponts at the high end are sampled quite sparsely. This can cause numerical issues.
So first I suggest resampling the data using a linear interpolation to get equaly spaced samples: (Note that all the snippets appended to eachother form the content of one python file.)
import matplotlib.pyplot as plt
import numpy as np
from data import xvalues, yvalues #load data from file
print("#datapoints: {}".format(len(xvalues)))
#don't use every point if your computer is not very fast
xv = np.array(xvalues)[::5]
yv = np.array(yvalues)[::5]
#interpolate to have evenly space data
xi = np.linspace(xv.min(), xv.max(), 400)
yi = np.interp(xi, xv, yv)
Then, to smoothen the data, I suggest performing a RBF regression (=using an "RBF Network"). The idea is fiting a curve of the form
c(t) = sum a(i) * phi(t - x(i)) #(not part of the program)
where phi is some radial basis function. (In theory we could use any functions.) To have a very smooth result I choose a very smooth function, namely a gaussian: phi(x) = exp( - x^2/sigma^2) where sigma is yet to be determined. The x(i) are just some nodes that we can define. If we have a smooth function, we just need a few nodes. The number of nodes also determines how much computation needs to be done. The a(i) are the coefficients we can optimize to get the best fit. In this case I just use a least squares approach.
Note that IF we can write a function in the form above, it is very easy to compute the derivative, it is just
c(t) = sum a(i) * phi'(t - x(i))
where phi' is the derivative of phi. #(not part of the program)
Regarding sigma: It is usually a good idea to choose it as a multiple of the step between the nodes we chose. The greater we choose sigma, the smoother the resulting function gets.
#set up rbf network
rbf_nodes = xv[::50][None, :]#use a subset of the x-values as rbf nodes
print("#rbfs: {}".format(rbf_nodes.shape[1]))
#estimate width of kernels:
sigma = 20 #greater = smoother, this is the primary parameter to play with
sigma *= np.max(np.abs(rbf_nodes[0,1:]-rbf_nodes[0,:-1]))
# kernel & derivative
rbf = lambda r:1/(1+(r/sigma)**2)
Drbf = lambda r: -2*r*sigma**2/(sigma**2 + r**2)**2
#compute coefficients of rbf network
r = np.abs(xi[:, None]-rbf_nodes)
A = rbf(r)
coeffs = np.linalg.lstsq(A, yi, rcond=None)[0]
print(coeffs)
#evaluate rbf network
N=1000
xe = np.linspace(xi.min(), xi.max(), N)
Ae = rbf(xe[:, None] - rbf_nodes)
ye = Ae # coeffs
#evaluate derivative
N=1000
xd = np.linspace(xi.min(), xi.max(), N)
Bd = Drbf(xe[:, None] - rbf_nodes)
yd = Bd # coeffs
fig,ax = plt.subplots()
ax2 = ax.twinx()
ax.plot(xv, yv, '-')
ax.plot(xi, yi, '-')
ax.plot(xe, ye, ':')
ax2.plot(xd, yd, '-')
fig.savefig('graph.png')
print('done')
You need the derivative to go from CDF to PDF
PDF(x) = d CDF(x)/ dx
With NumPy, you could use gradient
pdf = np.gradient(yvalues, xvalues)
plt.plot(xvalues, pdf)
plt.show()
or manual differential
pdf = np.diff(yvalues)/np.diff(xvalues)
l = np.asarray(xvalues[:-1])
r = np.asarray(xvalues[1:])
plt.plot((l+r)/2.0, pdf) # points in the middle of interval
plt.show()
Both produce something like, updated picture it got botched somehow
When I bin my data accordingly to scipy.stats.binned_statistic (see here for example), how do I get the error (that is the standard deviation) on the average binned values?
For example, if I bin my data as following:
windspeed = 8 * np.random.rand(500)
boatspeed = .3 * windspeed**.5 + .2 * np.random.rand(500)
bin_means, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='median', bins=[1,2,3,4,5,6,7])
plt.figure()
plt.plot(windspeed, boatspeed, 'b.', label='raw data')
plt.hlines(bin_means, bin_edges[:-1], bin_edges[1:], colors='g', lw=5,
label='binned statistic of data')
plt.legend()
how do I get the standard deviation on the bin_means?
The way to go about this is to construct a probability density estimate from the histogram (this is just a question of normalizing the histogram appropriately), and then computing the standard deviation or any other statistic for the estimated density.
The appropriate normalization is whatever is needed to get the area under the histogram to be 1. As for computing statistics for the density estimate, work from the definition of the statistic as integral(p(x)*f(x), x, -infinity, +infinity), substituting the density estimate for p(x) and whatever is needed for f(x), e.g. x and x^2 to get the first and second moments, from which you calculate the variance and then the standard deviation.
I'll post some formulas tomorrow, or maybe someone else wants to give it a try in the meantime. You might be able to look up some formulas, but my advice is to always try to work out the answer before resorting to looking it up.
Maybe I'm a bit late to answer, but I was wondering how to do the same thing and came across this question. I think calculating it with stats.binned_statistic_2d should be possible, but I haven't figured it out yet. For now I calculated it manually, like so (note than in my code I use a fixed number of equally spaced bins):
windspeed = 8 * numpy.random.rand(500)
boatspeed = .3 * windspeed**.5 + .2 * numpy.random.rand(500)
bin_means, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='median', bins=10)
stds = []
# Match each value to the bin number it belongs to
pairs = zip(boatspeed, binnumber)
# Calculate stdev for all elements inside each bin
for n in list(set(binnumber)): # Iterate over each bin
in_bin = [x for x, nbin in pairs if nbin == n] # Get all elements inside bin n
stds.append(numpy.std(in_bin))
# Calculate the locations of the bins' centers, for plotting
bin_centers = []
for i in range(len(bin_edges) - 1):
center = bin_edges[i] + (float(bin_edges[i + 1]) - float(bin_edges[i]))/2.
bin_centers.append(center)
# Plot means
pyplot.figure()
pyplot.hlines(bin_means, bin_edges[:-1], bin_edges[1:], colors='g', lw=5,
label='binned statistic of data')
# Plot stdev as vertical lines, probably can also be done with errorbar
pyplot.vlines(bin_centers, bin_means - stds, bin_means + stds)
pyplot.legend()
pyplot.show()
Resulting plot (minus the data points):
You have to be careful with the bins. In the code I'm working on using this, one of the bins has no points and I have to adjust my calculations of the stdev accordingly.
just change this line
bin_std, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='std', bins=[1,2,3,4,5,6,7])
How can I calculate in python the Cumulative Distribution Function (CDF)?
I want to calculate it from an array of points I have (discrete distribution), not with the continuous distributions that, for example, scipy has.
(It is possible that my interpretation of the question is wrong. If the question is how to get from a discrete PDF into a discrete CDF, then np.cumsum divided by a suitable constant will do if the samples are equispaced. If the array is not equispaced, then np.cumsum of the array multiplied by the distances between the points will do.)
If you have a discrete array of samples, and you would like to know the CDF of the sample, then you can just sort the array. If you look at the sorted result, you'll realize that the smallest value represents 0% , and largest value represents 100 %. If you want to know the value at 50 % of the distribution, just look at the array element which is in the middle of the sorted array.
Let us have a closer look at this with a simple example:
import matplotlib.pyplot as plt
import numpy as np
# create some randomly ddistributed data:
data = np.random.randn(10000)
# sort the data:
data_sorted = np.sort(data)
# calculate the proportional values of samples
p = 1. * np.arange(len(data)) / (len(data) - 1)
# plot the sorted data:
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax1.plot(p, data_sorted)
ax1.set_xlabel('$p$')
ax1.set_ylabel('$x$')
ax2 = fig.add_subplot(122)
ax2.plot(data_sorted, p)
ax2.set_xlabel('$x$')
ax2.set_ylabel('$p$')
This gives the following plot where the right-hand-side plot is the traditional cumulative distribution function. It should reflect the CDF of the process behind the points, but naturally, it is not as long as the number of points is finite.
This function is easy to invert, and it depends on your application which form you need.
Assuming you know how your data is distributed (i.e. you know the pdf of your data), then scipy does support discrete data when calculating cdf's
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
x = np.random.randn(10000) # generate samples from normal distribution (discrete data)
norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete
# plot the cdf
sns.lineplot(x=x, y=norm_cdf)
plt.show()
We can even print the first few values of the cdf to show they are discrete
print(norm_cdf[:10])
>>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329,
0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ])
The same method to calculate the cdf also works for multiple dimensions: we use 2d data below to illustrate
mu = np.zeros(2) # mean vector
cov = np.array([[1,0.6],[0.6,1]]) # covariance matrix
# generate 2d normally distributed samples using 0 mean and the covariance matrix above
x = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samples
norm_cdf = scipy.stats.norm.cdf(x)
print(norm_cdf.shape)
>>> (1000, 2)
In the above examples, I had prior knowledge that my data was normally distributed, which is why I used scipy.stats.norm() - there are multiple distributions scipy supports. But again, you need to know how your data is distributed beforehand to use such functions. If you don't know how your data is distributed and you just use any distribution to calculate the cdf, you most likely will get incorrect results.
The empirical cumulative distribution function is a CDF that jumps exactly at the values in your data set. It is the CDF for a discrete distribution that places a mass at each of your values, where the mass is proportional to the frequency of the value. Since the sum of the masses must be 1, these constraints determine the location and height of each jump in the empirical CDF.
Given an array a of values, you compute the empirical CDF by first obtaining the frequencies of the values. The numpy function unique() is helpful here because it returns not only the frequencies, but also the values in sorted order. To calculate the cumulative distribution, use the cumsum() function, and divide by the total sum. The following function returns the values in sorted order and the corresponding cumulative distribution:
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
cusum = np.cumsum(counts)
return x, cusum / cusum[-1]
To plot the empirical CDF you can use matplotlib's plot() function. The option drawstyle='steps-post' ensures that jumps occur at the right place. However, you need to force a jump at the smallest data value, so it's necessary to insert an additional element in front of x and y.
import matplotlib.pyplot as plt
def plot_ecdf(a):
x, y = ecdf(a)
x = np.insert(x, 0, x[0])
y = np.insert(y, 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
Example usages:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
plot_ecdf(xvec)
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
plot_ecdf(df['x'])
with output:
For calculating CDF for array of discerete numbers:
import numpy as np
pdf, bin_edges = np.histogram(
data, # array of data
bins=500, # specify the number of bins for distribution function
density=True # True to return probability density function (pdf) instead of count
)
cdf = np.cumsum(pdf*np.diff(bins_edges))
Note that the return array pdf has the length of bins (500 here) and bin_edges has the length of bins+1 (501 here).
So, to calculate the CDF which is nothing but the area below the PDF distribution curve, we can simply calculate the cumulative sum of bin widths (np.diff(bins_edges)) times pdf using Numpy cumsum function
Here's an alternative pandas solution to calculating the empirical CDF, using pd.cut to sort the data into evenly spaced bins first, and then cumsum to compute the distribution.
def empirical_cdf(s: pd.Series, n_bins: int = 100):
# Sort the data into `n_bins` evenly spaced bins:
discretized = pd.cut(s, n_bins)
# Count the number of datapoints in each bin:
bin_counts = discretized.value_counts().sort_index().reset_index()
# Calculate the locations of each bin as just the mean of the bin start and end:
bin_counts["loc"] = (pd.IntervalIndex(bin_counts["index"]).left + pd.IntervalIndex(bin_counts["index"]).right) / 2
# Compute the CDF with cumsum:
return bin_counts.set_index("loc").iloc[:, -1].cumsum()
Below is an example use of the function to discretize the distribution of 10000 datapoints into 100 evenly spaced bins:
s = pd.Series(np.random.randn(10000))
cdf = empirical_cdf(s, n_bins=100)
fig, ax = plt.subplots()
ax.scatter(cdf.index, cdf.values)
import random
import numpy as np
import matplotlib.pyplot as plt
def get_discrete_cdf(values):
values = (values - np.min(values)) / (np.max(values) - np.min(values))
values_sort = np.sort(values)
values_sum = np.sum(values)
values_sums = []
cur_sum = 0
for it in values_sort:
cur_sum += it
values_sums.append(cur_sum)
cdf = [values_sums[np.searchsorted(values_sort, it)]/values_sum for it in values]
return cdf
rand_values = [np.random.normal(loc=0.0) for _ in range(1000)]
_ = plt.hist(rand_values, bins=20)
_ = plt.xlabel("rand_values")
_ = plt.ylabel("nums")
cdf = get_discrete_cdf(rand_values)
x_p = list(zip(rand_values, cdf))
x_p.sort(key=lambda it: it[0])
x = [it[0] for it in x_p]
y = [it[1] for it in x_p]
_ = plt.plot(x, y)
_ = plt.xlabel("rand_values")
_ = plt.ylabel("prob")