Given the dataframe:
Brick_cp = pd.DataFrame({"CP":Brick_cp})
which corresponds to this distribution:
sns.distplot(Brick_cp, fit = stats.norm)
VISUALIZATION
I then create a normal function based on the values:
loc, scale = stats.norm.fit(Brick_cp.astype(float))
loc, scale = Out[]: (911.1121589743589, 63.42365993765692)
#PROBABILITY DENSITY FUNCTION (PDF)
x = np.linspace (start = 600, stop = 1200, num = 100)
pdf = stats.norm.pdf(x, loc=loc, scale=scale)
PDF
To which corresponds the CDF:
cdf = stats.norm.cdf(x, loc=loc, scale=scale)
CDF
Finally I create the PROBABILITY DENSITY FUNCTION (PDF):
cdf_ = np.linspace(start=0, stop=1, num=10000)
x_ = stats.norm.ppf(cdf_, loc=loc, scale=scale)
PPF
The aim is to generate a predefined number of random values taken from the PDF. To do this I thought of generating random values in the range between 0 and 1 in the PPF and finding the corresponding value on the abscissae. Currently I do this in this way:
v = np.random.uniform(0,1,1000)
f = lambda x1: np.interp(x1, cdf_, x_)
brick_cp_value = f(v)
I would like to ask if there is an easier way of random sampling in scipy and if the method I am using is correct. Unfortunately I am a beginner. Thanks
Edit: I also tried this method:
random_samples = stats.norm.rvs(loc, scale, size=1000)
Sampling from Gaussian is a very common thing, therefore there is a simple way to do this given the mean (loc) and standard variation (scale) of the pdf (e.g. with numpy.random.normal()):
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
Brick_cp = pd.DataFrame({"CP":Brick_cp})
sns.distplot(Brick_cp, fit = stats.norm)
loc, scale = stats.norm.fit(Brick_cp.astype(float))
random_samples = np.random.normal(loc, scale, size=1000)
Related
Problem Statement - A random variable X is N(25, 4). Find the indicated percentile for X:
a. The 10th percentile
b. The 90th percentile
c. The 80th percentile
d. The 50th percentile
Attempt 1
My code:
import numpy as np
import math
import scipy.stats
mu=25
sigma=4
a=mu-(1.282*4)
b=mu+(1.282*4)
... like that. I got the values from the Zscore table given in
https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_probability/bs704_probability10.html
Attempt 2
X=np.random.normal(25,4,10000) # sample size not mentioned in
problem. I just assumed it
a_9 = np.percentile(X,10)
b_9 = np.percentile(X,90)
c_9 = np.percentile(X,80)
d_9 = np.percentile(X,50)
But the answers are incorrect as per the hidden test cases of the practice platform. Can anyone please tell me the right way to compute the answers? Is there any scipy.stats function for this?
You can use scipy.stats and built-in ppf function (look documentation)
import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt
mu = 25
sigma = 4
# define the normal distribution and PDF
dist = sps.norm(loc=mu, scale=sigma)
x = np.linspace(dist.ppf(.001), dist.ppf(.999))
y = dist.pdf(x)
# calculate PPFs
ppfs = {}
for ppf in [.1, .5, .8, .9]:
p = dist.ppf(ppf)
ppfs.update({ppf*100: p})
# plot results
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(x, y, color='k')
for i, ppf in enumerate(ppfs):
ax.axvline(ppfs[ppf], color=f'C{i}', label=f'{ppf:.0f}th: {ppfs[ppf]:.1f}')
ax.legend()
plt.show()
that gives
Use the ppf method from scipy.stats.norm (normal distribution).
scipy.stats.norm.ppf(0.1, loc=25, scale=4)
This function is analogous to the qnorm function in r. The ppf method gives the value of the random variable at the given percentile.
a_9 = 19.88
b_9 = 30.12
c_9 = 28.36
d_9 = 25.00
X = np.random.normal(25,4,10000000)
I am trying to fit Gamma CDF using scipy.stats.gamma but I do not know what exactly is the a parameter and how the location and scale parameters are calculated. Different literatures give different ways to calculate them and its very frustrating. I am using below code which is not giving correct CDF. Thanks in advance.
from scipy.stats import gamma
loc = (np.mean(jan))**2/np.var(jan)
scale = np.var(jan)/np.mean(jan)
Jancdf = gamma.cdf(jan,a,loc = loc, scale = scale)
a is the shape. What you have tried works only in the case where loc = 0. First we start with two examples, with shape (or a) = 10 and scale = 5, and the second d1plus50 differs from the first by 50, and you can see the shift which is dictated by loc:
from scipy.stats import gamma
import matplotlib.pyplot as plt
d1 = gamma.rvs(a = 10, scale=5,size=1000,random_state=99)
plt.hist(d1,bins=50,label='loc=0,shape=10,scale=5',density=True)
d1plus50 = gamma.rvs(a = 10, loc= 50,scale=5,size=1000,random_state=99)
plt.hist(d1plus50,bins=50,label='loc=50,shape=10,scale=5',density=True)
plt.legend(loc='upper right')
So you have 3 parameters to estimate from the data, one way is use gamma.fit, we apply this on the simulated distribution with loc=0 :
xlin = np.linspace(0,160,50)
fit_shape, fit_loc, fit_scale=gamma.fit(d1)
print([fit_shape, fit_loc, fit_scale])
[11.135335235456457, -1.9431969603988053, 4.693776771991816]
plt.hist(d1,bins=50,label='loc=0,shape=10,scale=5',density=True)
plt.plot(xlin,gamma.pdf(xlin,a=fit_shape,loc = fit_loc, scale = fit_scale)
And if we do it for the distribution we simulated with loc, and you can see the loc is estimated correctly, as well as shape and scale:
fit_shape, fit_loc, fit_scale=gamma.fit(d1plus50)
print([fit_shape, fit_loc, fit_scale])
[11.135287555530564, 48.05688649976989, 4.693789434095116]
plt.hist(d1plus50,bins=50,label='loc=0,shape=10,scale=5',density=True)
plt.plot(xlin,gamma.pdf(xlin,a=fit_shape,loc = fit_loc, scale = fit_scale))
Problem
In a paper I am reading now, it defines a new metric and authors claim some advantages over previous metrics. They verify their claim by some synthetic data, which looks like following
The implementation of their metric is pretty straightforward. However, I am not sure how they create this kind of synthetic data.
What I Have Done
This looks like Gaussian where x is only within certain intervals, I tried with following code but did not get anything similar to the plot presented in the paper.
import numpy as np
def generate_gaussian(size=1000, lb=-0.1, up=0.1):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 0.3
background_neg = base + 0.7
Now I am wondering if the authors create these data using some special distribution (other than Gaussian) I do not know?
Numpy has a numpy.random.normal that draws random samples from a normal (Gaussian) distribution.
import numpy as np
import matplotlib.pyplot as plt
sigma = 0.05
s0 = np.random.normal(0.2, sigma, 5000)
s1 = np.random.normal(0.6, sigma, 5000)
plt.hist(s0, 300, density=True, color="b")
plt.hist(s1, 300, density=True, color="r")
plt.xlim(0, 1)
plt.show()
You can change the values of the mu (mean) and sigma to alter the distributions
mu = 0.55
sigma = 0.1
dist = np.random.normal(mu, sigma, 5000)
You have cut off the data at +/- 0.1. A normalised Gausian distribution only 'looks Gaussian' if you look over the range approximately +/- 3. Try this:
import numpy as np
def generate_gaussian(size=1000, lb=-3, up=3):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 5
background_neg = base + 15
You can use scipy.stats.norm (info).
import libraries
>>> from scipy.stats import norm
>>> from matplotlib import pyplot
plot
>>> pyplot.hist(norm.rvs(loc=1, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_1')
>>> pyplot.hist(norm.rvs(loc=5, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_2')
>>> pyplot.legend()
>>> pyplot.show()
Clarification:
A normal distribution is defined by mean (loc, distribution center) and standard distribution (scale, measure of distribution dispersion or width). rvs generates random samples of the desired normal distribution of size size. For example next code generates 4 random elements of a normal distribution (mean = 1, SD = 1).
>>> norm.rvs(loc=1, scale=1, size=4)
array([ 0.52154255, 1.40873701, 1.55959291, -0.01730568])
I have a distribution
This one looks pretty gaussian, and we also can't reject the idea with such a high p-value from the KS test.
BUT, the test distribution is actually also a generated one with a finite sample size and not the CDF itself, as you'll notice in the code. So that's kind of cheating, compared to using the CDF for a smooth gaussian function.
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
d1 = np.random.normal(loc = 3, scale = 2, size = 1000)
d2 = np.random.normal(loc = 3, scale = 0.5, size = 250) # Vary this to test
data = np.concatenate((d1,d2))
xmin, xmax = min(data), max(data)
lnspc = np.linspace(xmin, xmax, len(data))
# lets try the normal distribution first
m, s = stats.norm.fit(data) # get mean and standard deviation from fit
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval
plt.hist(data, color = "lightgrey", normed = True, bins = 50)
plt.plot(lnspc, pdf_g, color = "black", label="Gaussian") # plot it
# Test how not-gaussian our distribution is by generating a distribution from the fit
test_dist = np.random.normal(m, s, len(data))
KS_D, KS_p = stats.ks_2samp(data, test_dist)
plt.title("D = {0:.2f}, p = {1:.2f}".format(KS_D, KS_p))
plt.show()
But I can't figure out how to use the default KS test for, that is
KS_D, KS_p = stats.kstest(data, "norm"),
as it always returns a p-value of 0, i.e. my gaussian data must be in the wrong format.
How should I normalize my data to properly use the KS test? And is simulating the comparison distribution a valid usage, or more incorrect than testing against the continuous CDF for the distribution?
"norm" uses a normal distribution that defaults to be zero-mean, with standard deviation 1 [ref]. Your data have values m and s for that, which are quite different. It is telling you they are very different from this standard reference distribution.
You could still use this test to check if the data look Gaussian if you first normalize (haha) your data appropriately:
data_n = (data - m) / s
KS_D, KS_p = stats.kstest(data_n, "norm")
I was trying to fit beta prime distribution to my data using python. As there's scipy.stats.betaprime.fit, I tried this:
import numpy as np
import math
import scipy.stats as sts
import matplotlib.pyplot as plt
N = 5000
nb_bin = 100
a = 12; b = 106; scale = 36; loc = -a/(b-1)*scale
y = sts.betaprime.rvs(a,b,loc,scale,N)
a_hat,b_hat,loc_hat,scale_hat = sts.betaprime.fit(y)
print('Estimated parameters: \n a=%.2f, b=%.2f, loc=%.2f, scale=%.2f'%(a_hat,b_hat,loc_hat,scale_hat))
plt.figure()
count, bins, ignored = plt.hist(y, nb_bin, normed=True)
pdf_ini = sts.betaprime.pdf(bins,a,b,loc,scale)
pdf_est = sts.betaprime.pdf(bins,a_hat,b_hat,loc_hat,scale_hat)
plt.plot(bins,pdf_ini,'g',linewidth=2.0,label='ini');plt.grid()
plt.plot(bins,pdf_est,'y',linewidth=2.0,label='est');plt.legend();plt.show()
It shows me the result that:
Estimated parameters:
a=9935.34, b=10846.64, loc=-90.63, scale=98.93
which is quite different from the original one and the figure from the PDF:
If I give the real value of loc and scale as the input of fit function, the estimation result would be better. Has anyone worked on this part already or got a better solution?