Pareto distribution and whether a chart conforms to it - python

I have a figure as shown below, I want to know whether it conforms to the Pareto distribution, or not? Its a cumulative plot.
And, I want to find out the point in x axis which marks the point for the 80-20 rule, i.e the x-axis point which bifurcates the plot into 20 percent having 80 percent of the wealth.
Also, I'm really confused by the scipy.stats Pareto function, would be great if someone can give some intuitive explanation on that, since the documentation is pretty confusing.

scipy.stats.pareto provides a random draw from the Pareto distribution.
To know if your distribution conform to Pareto distribution you should perform a Kolmogorov-Smirnov test.
Draw a random sample from the Pareto distribution using pareto.rvs(shape, size=1000), where shape is the estimated shape parameter of your Pareto distribution, and use scipy.stats.kstest to perform the test :
pareto_smp = pareto.rvs(shape, size=1000)
D, p_value = scipy.stats.kstest(pareto_smp, values)

nobody can simply determine if an observation dataset follows a particular distribution. based on your situation, what you need:
fit empirical distribution using:
statsmodels.ECDF
then, compare (nonparametric) this with your data to see if the Null hypothesis can be rejected
for 20/80 rule:
rescale your X to range [0,1] and simply pick up 0.2 on x axis
source: https://arxiv.org/pdf/1306.0100.pdf

Related

How does scipy.stats distribution fitting work exactly?

I'm interested in the tail distribution of some given data, so I tried using scipy.stats to fit my data to a Gaussian, Generalized extreme value distribution, and a Generalized Pareto distribution.
This is how the data looks like:
Data Histogram
This is what I tried
data=df.loc[:,'X']
v=np.ceil(np.log2(len(data))) + 1 #This criterion is Sturge's rule, we use this formula to calculate the "adequate" number of bins to visualize our data's distribution
y,x=np.histogram(data,bins=int(v),density=True) #"clustering" our data for the plot
plt.hist(data, bins=11, density=True)
plt.title("Histogram")
plt.show()
x = (x + np.roll(x, -1))[:-1] / 2.0 #This takes the mid point of every "bin" interval as the reference x-axis point for its corresponding y probability
# =============================================================================
# Fitting our data and plotting the PDFs
# =============================================================================
fit1=stats.genextreme.fit(data,floc=0) #The fit method finds the optimal parameters (using MLE) for your data fitting a chosen probability distribution
fit2=stats.norm.fit(data)
fit3=stats.genpareto.fit(data,floc=0)
fit4=stats.weibull_min.fit(data,floc=0)
fit5=stats.exponweib.fit(data,floc=0)
fit6=stats.gumbel_r.fit(data,floc=0)
fit7=stats.gumbel_l.fit(data,floc=0)
....
First I had some strange results because I didn't set the initial location parameter to 0, I still didn't exactly understand why.
What surprised me the most though, is that genextreme and Weibull_min gave me different results, when I thought Weibull was a special case of the generalized extreme value distribution with positive shape parameter.
Especially since the Weibull fit seems to work better here.
Here is the Weibull Fit:
Weibull Fit
And this is the GEV Fit:
GEV Fit
Actually the GEV Fit was similar to the Gumbel_r one:
Gumbel_r Fit
I read one could deduce whether Weibull_min or max should be used just from the shape of the data's histogram, how can one do that?
Since I am interested in extreme positive values (Tail distribution), shouldn't I be using Weibull_max since that's the limiting distribution of the maximum?

How to make a sample from the empirical distribution function

I'm trying to implement the nonparametric bootstrapping on Python. It requires to take a sample, build an empirical distribution function from it and then to generate a bunch of samples from this edf. How can I do it?
In scipy I found only how to make your own distribution function if you know the exact formula describing it, but I have only an edf.
The edf you get by sorting the samples:
N = samples.size
ss = np.sort(samples) # these are the x-values of the edf
# the y-values are 1/(2N), 3/(2N), 5/(2N) etc.
edf = lambda x: np.searchsorted(ss, x) / N
However, if you only want to resample then you simply draw from your sample with equal probability and replacement.
If this is too "steppy" for your liking, you can probably use some kind of interpolation to get a smooth distribution.

How to perform a chi-squared goodness of fit test using scientific libraries in Python?

Let's assume I have some data I obtained empirically:
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?
I can fit a model with:
param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))
It is very elegant to calculate the Kolmogorov-Smirnov test.
>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)
However, I can't find a good way of calculating the chi-squared test.
There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).
The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.
Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.
An approximate solution for equal probability bins:
Estimate the parameters of the distribution
Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
Then, use np.histogram to count the number of observations in each bin
then use chisquare test on the frequencies.
An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.
This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.
I haven't looked into this into a long time.
If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.
Why do you need to "verify" that it's exponential? Are you sure you need a statistical test? I can pretty much guarantee that is isn't ultimately exponential & the test would be significant if you had enough data, making the logic of using the test rather forced. It may help you to read this CV thread: Is normality testing 'essentially useless'?, or my answer here: Testing for heteroscedasticity with many observations.
It is typically better to use a qq-plot and/or pp-plot (depending on whether you are concerned about the fit in the tails or middle of the distribution, see my answer here: PP-plots vs. QQ-plots). Information on how to make qq-plots in Python SciPy can be found in this SO thread: Quantile-Quantile plot using SciPy
I tried you problem with OpenTURNS.
Beginning is the same:
import numpy as np
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
If you suspect that your sample x is coming from an Exponential distribution, you can use ot.ExponentialFactory() to fit the parameters:
import openturns as ot
sample = ot.Sample([[p] for p in x])
distribution = ot.ExponentialFactory().build(sample)
As Factory needs a an ot.Sample() as input, I needed format x and reshape it as 10.000 points of dimension 1.
Let's now assess this fitting using ChiSquared test:
result = ot.FittingTest.ChiSquared(sample, distribution, 0.01)
print('Exponential?', result.getBinaryQualityMeasure(), ', P-value=', result.getPValue())
>>> Exponential? True , P-value= 0.9275212544642293
Very good!
And of course, print(distribution) will give you the fitted parameters:
>>> Exponential(lambda = 0.0982391, gamma = 0.0274607)

Generating random number for a distribution of a real data?

I have a set of real data and I want use this data to find a probability distribution and then use their property to generate some random points according to their pdf. A sample of my data set is as following:
#Mag Weight
21.9786 3.6782
24.0305 6.1120
21.9544 4.2225
23.9383 5.1375
23.9352 4.6499
23.0261 5.1355
23.8682 5.9932
24.8052 4.1765
22.8976 5.1901
23.9679 4.3190
25.3362 4.1519
24.9079 4.2090
23.9851 5.1951
22.2094 5.1570
22.3452 5.6159
24.0953 6.2697
24.3901 6.9299
24.1789 4.0222
24.2648 4.4997
25.3931 3.3920
25.8406 3.9587
23.1427 6.9398
21.2985 7.7582
25.4807 3.1112
25.1935 5.0913
25.2136 4.0578
24.6990 3.9899
23.5299 4.6788
24.0880 7.0576
24.7931 5.7088
25.1860 3.4825
24.4757 5.8500
24.1398 4.9842
23.4947 4.4730
20.9806 5.2717
25.9470 3.4706
25.0324 3.3879
24.7186 3.8443
24.3350 4.9140
24.6395 5.0757
23.9181 4.9951
24.3599 4.1125
24.1766 5.4360
24.8378 4.9121
24.7362 4.4237
24.4119 6.1648
23.8215 5.9184
21.5394 5.1542
24.0081 4.2308
24.5665 4.6922
23.5827 5.4992
23.3876 6.3692
25.6872 4.5055
23.6629 5.4416
24.4821 4.7922
22.7522 5.9513
24.0640 5.8963
24.0361 5.6406
24.8687 4.5699
24.8795 4.3198
24.3486 4.5305
21.0720 9.5246
25.2960 3.0828
23.8204 5.8605
23.3732 5.1161
25.5097 2.9010
24.9206 4.0999
24.4140 4.9073
22.7495 4.5059
24.3394 3.5061
22.0560 5.5763
25.4404 5.4916
25.4795 4.4089
24.1772 3.8626
23.6042 4.7476
23.3537 6.4804
23.6842 4.3220
24.1895 3.6072
24.0328 4.3273
23.0243 5.6789
25.7042 4.4493
22.1983 6.1868
22.3661 5.9132
20.9426 4.8079
20.3806 10.1128
25.0105 4.4296
23.6648 6.6482
25.2780 4.4933
24.6870 4.4836
25.4565 4.0990
25.0415 3.9384
24.6098 4.6057
24.7796 4.2042
How could I do this? My first attempt was to fit a polynomial to the binned data and find the probability distribution of weights in each magnitude bin, but I reckon it might be a smarter way to do it. For instance, using scipy.stats.rv_continuous for sampling data from the given distribution but I don't know how it can work and there are not enough examples.
Update:
As I got a lot of comments to use KDE, I used scipy.stats.gaussian_kde and I got the following results.
I am wondering whether it is a good probability distribution to represent the property of my data? First, how could I test it, and second, whether there is a possibility to fit more than one gaussian kde with scipy.stats?
(1) If you have an idea about the distribution from which these data are sampled, then fit that distribution to the data (i.e., adjust parameters via maximum likelihood or whatever) and then sample that.
(2) For more nearly empirical approach, select one datum at random (with equal probability) and then pretend it is the center of a little Gaussian bump, and sample from that bump. This is equivalent to constructing a kernel density estimate and sampling from that. You will have to pick a standard deviation for the bumps.
(3) For an entirely empirical approach, select one datum at random (with equal probability). This is equivalent to assuming the empirical distribution is the same as the actual distribution.
What is this data representing?
SciPy won't help you decide what type of distribution to use. That choice is motivated by where your data is coming from. Once you do decide on a distribution (or you could try several), then you can easily do something like a scipy.optimize.curve_fit on your data to decide on the optimal parameters to give to feed into pdf class in scipy.stats so that it matches your data. Then use a scipy continuous random variable to generate new points from your distribution.
Also, a polynomial is not a probability density function since it is not normalized (integral over all x diverges). Polynomial fits are not going to help you, as far as I know.
Did you try creating a histogram of the data? That will give you a sense of the shape of the density function, at which point you can try fitting the data to a known distribution. Once you have a fitted distribution, you can generate pseudo-random variates to get a 'sanity check', perform a nonparametric test like the Kolmogorov–Smirnov.
So, I would take the following steps:
Create a histogram
Determine characteristics of the data (summary stats, etc.).
Try to fit to parametric distribution.
Try to fit to nonparametric distribution.
Conduct hypothesis tests to rate the fit.

Scipy - Inverse Sampling Method from custom probability density function

I am trying to perform an inverse sampling from a custom probability density function (PDF). I am just wondering if this even possible, i.e. integrating the PDF, inverting the result and then solving it for a given uniform number. The PDF has the shape f(x, alpha, mean(x))=(1/Gamma(alpha+1)(x))((x*(alpha+1)/mean(x))^(alpha+1))exp(-(alpha+1)*(x/mean(x)) where x > 0. From the shape the only values sub-150 are relevant, and for what I am trying to do the sub-80 values are good enough. Extending the range shouldnt be too hard though.
I have tried to do the inversion method, but only found a numerical way to do the integral, which isnt necessarily helpful considering that I need to invert the function to solve:
u = integral(f(x, alpha, mean(x))dx) from 0 to y, where y is unknown and u is uniform random variable between 0 and 1.
The integral has a gamma function and an incomplete gamma function, so trying to invert it is kind of a mess. Any help is welcome.
Thanks a bunch in advance.
Cheers
Assuming you mean that you're trying to randomly choose values which will be distributed according to your PDF, then yes, it is possible. This is described on Wikipedia as inverse transform sampling. Basically, it's just what you said: integrate the PDF to produce the cumulative distribution (CDF), invert it (which can be done ahead of time), and then choose a random number and run it through the inverted CDF.
If your domain is 0 to positive infinity, your distribution appears to match the gamma distribution which is built into Numpy and Scipy, with theta = 1/alpha and k = alpha+1.

Categories