I am trying to fit Gamma CDF using scipy.stats.gamma but I do not know what exactly is the a parameter and how the location and scale parameters are calculated. Different literatures give different ways to calculate them and its very frustrating. I am using below code which is not giving correct CDF. Thanks in advance.
from scipy.stats import gamma
loc = (np.mean(jan))**2/np.var(jan)
scale = np.var(jan)/np.mean(jan)
Jancdf = gamma.cdf(jan,a,loc = loc, scale = scale)
a is the shape. What you have tried works only in the case where loc = 0. First we start with two examples, with shape (or a) = 10 and scale = 5, and the second d1plus50 differs from the first by 50, and you can see the shift which is dictated by loc:
from scipy.stats import gamma
import matplotlib.pyplot as plt
d1 = gamma.rvs(a = 10, scale=5,size=1000,random_state=99)
plt.hist(d1,bins=50,label='loc=0,shape=10,scale=5',density=True)
d1plus50 = gamma.rvs(a = 10, loc= 50,scale=5,size=1000,random_state=99)
plt.hist(d1plus50,bins=50,label='loc=50,shape=10,scale=5',density=True)
plt.legend(loc='upper right')
So you have 3 parameters to estimate from the data, one way is use gamma.fit, we apply this on the simulated distribution with loc=0 :
xlin = np.linspace(0,160,50)
fit_shape, fit_loc, fit_scale=gamma.fit(d1)
print([fit_shape, fit_loc, fit_scale])
[11.135335235456457, -1.9431969603988053, 4.693776771991816]
plt.hist(d1,bins=50,label='loc=0,shape=10,scale=5',density=True)
plt.plot(xlin,gamma.pdf(xlin,a=fit_shape,loc = fit_loc, scale = fit_scale)
And if we do it for the distribution we simulated with loc, and you can see the loc is estimated correctly, as well as shape and scale:
fit_shape, fit_loc, fit_scale=gamma.fit(d1plus50)
print([fit_shape, fit_loc, fit_scale])
[11.135287555530564, 48.05688649976989, 4.693789434095116]
plt.hist(d1plus50,bins=50,label='loc=0,shape=10,scale=5',density=True)
plt.plot(xlin,gamma.pdf(xlin,a=fit_shape,loc = fit_loc, scale = fit_scale))
Related
Given the dataframe:
Brick_cp = pd.DataFrame({"CP":Brick_cp})
which corresponds to this distribution:
sns.distplot(Brick_cp, fit = stats.norm)
VISUALIZATION
I then create a normal function based on the values:
loc, scale = stats.norm.fit(Brick_cp.astype(float))
loc, scale = Out[]: (911.1121589743589, 63.42365993765692)
#PROBABILITY DENSITY FUNCTION (PDF)
x = np.linspace (start = 600, stop = 1200, num = 100)
pdf = stats.norm.pdf(x, loc=loc, scale=scale)
PDF
To which corresponds the CDF:
cdf = stats.norm.cdf(x, loc=loc, scale=scale)
CDF
Finally I create the PROBABILITY DENSITY FUNCTION (PDF):
cdf_ = np.linspace(start=0, stop=1, num=10000)
x_ = stats.norm.ppf(cdf_, loc=loc, scale=scale)
PPF
The aim is to generate a predefined number of random values taken from the PDF. To do this I thought of generating random values in the range between 0 and 1 in the PPF and finding the corresponding value on the abscissae. Currently I do this in this way:
v = np.random.uniform(0,1,1000)
f = lambda x1: np.interp(x1, cdf_, x_)
brick_cp_value = f(v)
I would like to ask if there is an easier way of random sampling in scipy and if the method I am using is correct. Unfortunately I am a beginner. Thanks
Edit: I also tried this method:
random_samples = stats.norm.rvs(loc, scale, size=1000)
Sampling from Gaussian is a very common thing, therefore there is a simple way to do this given the mean (loc) and standard variation (scale) of the pdf (e.g. with numpy.random.normal()):
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
Brick_cp = pd.DataFrame({"CP":Brick_cp})
sns.distplot(Brick_cp, fit = stats.norm)
loc, scale = stats.norm.fit(Brick_cp.astype(float))
random_samples = np.random.normal(loc, scale, size=1000)
I have data distribution that I want to fit Poisson distribution to it. my data looks like that:
I try to fit :
mu = herd_size["COW_NUM"].mean()
ax=sns.displot(data=herd_size["COW_NUM"], kde=True)
ax.set(xlabel='Size',title='Herd size distribution & poisson distribution')
plt.plot(np.arange(0, 2000, 80), [st.poisson.pmf(np.arange(i, i+80), mu).sum()*len(herd_size["COW_NUM"])
for i in np.arange(0, 2000, 80)], color='red')
#every bin contain approximatly 80 observes
plt.show()
but I get something not at the same scale:
UPDATE
I try to apply negative binom distribution with the code:
n=len(herd_size["COW_NUM"])
p =herd_size["COW_NUM"].mean()/(herd_size["COW_NUM"].mean()+2)
ax=sns.displot(data=herd_size["COW_NUM"], kde=True)
ax.set(xlabel='Size',title='Herd size distribution & geometry distribution')
plt.plot(np.arange(0, 2000, 80), [st.nbinom.pmf(np.arange(i, i+80), n,p).sum()*len(herd_size["COW_NUM"])
for i in np.arange(0, 2000, 80)], color='red')
#every bin contain approximatly 80 observes
plt.show()
but I got this:
nbinom
For what you need to plot, might be easier to provide the bins to make your histogram:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import poisson
herd_size = pd.DataFrame({'COW_NUM':np.random.poisson(200,2000)})
binwidth = 10
xstart = 150
xend = 280
bins = np.arange(xstart,xend,binwidth)
o = sns.histplot(data=herd_size["COW_NUM"], kde=True,bins = bins)
Then calculate your mean and total number:
mu = herd_size["COW_NUM"].mean()
n = len(herd_size)
The expected frequency is the difference of the start and end of cdf on your left and right intervals:
plt.plot(bins + binwidth/2 , n*(poisson.cdf(bins+binwidth,mu) - poisson.cdf(bins,mu)), color='red')
Your data is overdispersed, because for a poisson you don't expect data to be so spread. so what you need to do is to use a gamma or a negative binomial to fit it, for example:
from scipy.stats import nbinom
herd_size = pd.DataFrame({'COW_NUM':nbinom.rvs(n=2,p=0.1,loc=240,size=2000)})
binwidth = 50
xstart = 0
xend = 2000
bins = np.arange(xstart,xend,binwidth)
herd_size = pd.DataFrame({'COW_NUM':nbinom.rvs(n=1,p=0.004,size=2000)})
Var = herd_size["COW_NUM"].var()
mu = herd_size["COW_NUM"].mean()
p = (mu/Var)
r = mu**2 / (Var-mu)
n = len(herd_size)
o = sns.histplot(data=herd_size["COW_NUM"], kde=True,bins=bins)
plt.plot(bins + binwidth/2 ,
n*(nbinom.cdf(bins+binwidth,r,p) - nbinom.cdf(bins,r,p)),
color='red')
Your plot is (at least approximately) correct, the problem is with modeling your data as Poisson. As lambda grows large the Poisson looks more and more like a normal distribution — see this plot from Wikipedia. A Poisson distribution has its variance equal to its mean, so with a mean of around ~240 you have a standard deviation of ~15.5. The net result is that outcomes for a Poisson(240) should overwhelmingly fall between 210 and 270, which is what your red plot shows. Try fitting a different distribution to your data.
I just spotted StupidWolf's answer. Other than using a mean of 200 rather than 240, his histogram shows the same behavior described above.
Problem
In a paper I am reading now, it defines a new metric and authors claim some advantages over previous metrics. They verify their claim by some synthetic data, which looks like following
The implementation of their metric is pretty straightforward. However, I am not sure how they create this kind of synthetic data.
What I Have Done
This looks like Gaussian where x is only within certain intervals, I tried with following code but did not get anything similar to the plot presented in the paper.
import numpy as np
def generate_gaussian(size=1000, lb=-0.1, up=0.1):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 0.3
background_neg = base + 0.7
Now I am wondering if the authors create these data using some special distribution (other than Gaussian) I do not know?
Numpy has a numpy.random.normal that draws random samples from a normal (Gaussian) distribution.
import numpy as np
import matplotlib.pyplot as plt
sigma = 0.05
s0 = np.random.normal(0.2, sigma, 5000)
s1 = np.random.normal(0.6, sigma, 5000)
plt.hist(s0, 300, density=True, color="b")
plt.hist(s1, 300, density=True, color="r")
plt.xlim(0, 1)
plt.show()
You can change the values of the mu (mean) and sigma to alter the distributions
mu = 0.55
sigma = 0.1
dist = np.random.normal(mu, sigma, 5000)
You have cut off the data at +/- 0.1. A normalised Gausian distribution only 'looks Gaussian' if you look over the range approximately +/- 3. Try this:
import numpy as np
def generate_gaussian(size=1000, lb=-3, up=3):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 5
background_neg = base + 15
You can use scipy.stats.norm (info).
import libraries
>>> from scipy.stats import norm
>>> from matplotlib import pyplot
plot
>>> pyplot.hist(norm.rvs(loc=1, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_1')
>>> pyplot.hist(norm.rvs(loc=5, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_2')
>>> pyplot.legend()
>>> pyplot.show()
Clarification:
A normal distribution is defined by mean (loc, distribution center) and standard distribution (scale, measure of distribution dispersion or width). rvs generates random samples of the desired normal distribution of size size. For example next code generates 4 random elements of a normal distribution (mean = 1, SD = 1).
>>> norm.rvs(loc=1, scale=1, size=4)
array([ 0.52154255, 1.40873701, 1.55959291, -0.01730568])
I am trying to plot normal distribution curve using Python. First I did it manually by using the normal probability density function and then I found there's an exiting function pdf in scipy under stats module. However, the results I get are quite different.
Below is the example that I tried:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
mean = 5
std_dev = 2
num_dist = 50
# Draw random samples from a normal (Gaussion) distribution
normalDist_dataset = np.random.normal(mean, std_dev, num_dist)
# Sort these values.
normalDist_dataset = sorted(normalDist_dataset)
# Create the bins and histogram
plt.figure(figsize=(15,7))
count, bins, ignored = plt.hist(normalDist_dataset, num_dist, density=True)
new_mean = np.mean(normalDist_dataset)
new_std = np.std(normalDist_dataset)
normal_curve1 = stats.norm.pdf(normalDist_dataset, new_mean, new_std)
normal_curve2 = (1/(new_std *np.sqrt(2*np.pi))) * (np.exp(-(bins - new_mean)**2 / (2 * new_std**2)))
plt.plot(normalDist_dataset, normal_curve1, linewidth=4, linestyle='dashed')
plt.plot(bins, normal_curve2, linewidth=4, color='y')
The result shows how the two curves I get are very different from each other.
My guess is that it is has something to do with bins or pdf behaves differently than usual formula. I have used the same and new mean and standard deviation for both the plots. So, how do I change my code to match what stats.norm.pdf is doing?
I don't know yet which curve is correct.
Function plot simply connects the dots with line segments. Your bins do not have enough dots to show a smooth curve. Possible solution:
....
normal_curve1 = stats.norm.pdf(normalDist_dataset, new_mean, new_std)
bins = normalDist_dataset # Add this line
normal_curve2 = (1/(new_std *np.sqrt(2*np.pi))) * (np.exp(-(bins - new_mean)**2 / (2 * new_std**2)))
....
I have a distribution
This one looks pretty gaussian, and we also can't reject the idea with such a high p-value from the KS test.
BUT, the test distribution is actually also a generated one with a finite sample size and not the CDF itself, as you'll notice in the code. So that's kind of cheating, compared to using the CDF for a smooth gaussian function.
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
d1 = np.random.normal(loc = 3, scale = 2, size = 1000)
d2 = np.random.normal(loc = 3, scale = 0.5, size = 250) # Vary this to test
data = np.concatenate((d1,d2))
xmin, xmax = min(data), max(data)
lnspc = np.linspace(xmin, xmax, len(data))
# lets try the normal distribution first
m, s = stats.norm.fit(data) # get mean and standard deviation from fit
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval
plt.hist(data, color = "lightgrey", normed = True, bins = 50)
plt.plot(lnspc, pdf_g, color = "black", label="Gaussian") # plot it
# Test how not-gaussian our distribution is by generating a distribution from the fit
test_dist = np.random.normal(m, s, len(data))
KS_D, KS_p = stats.ks_2samp(data, test_dist)
plt.title("D = {0:.2f}, p = {1:.2f}".format(KS_D, KS_p))
plt.show()
But I can't figure out how to use the default KS test for, that is
KS_D, KS_p = stats.kstest(data, "norm"),
as it always returns a p-value of 0, i.e. my gaussian data must be in the wrong format.
How should I normalize my data to properly use the KS test? And is simulating the comparison distribution a valid usage, or more incorrect than testing against the continuous CDF for the distribution?
"norm" uses a normal distribution that defaults to be zero-mean, with standard deviation 1 [ref]. Your data have values m and s for that, which are quite different. It is telling you they are very different from this standard reference distribution.
You could still use this test to check if the data look Gaussian if you first normalize (haha) your data appropriately:
data_n = (data - m) / s
KS_D, KS_p = stats.kstest(data_n, "norm")