For the data yielding the below histogram, I used gamma.fit(data) function. It yields (0.2856629839547822, 0.001612367540316285, 1.3126526078419007) which must be the alpha, loc, and scale parameters of the distribution. However, the mean and standard deviations are m=0.04181341484525036 and s=0.02581912984507876 for the given dataset. The PDF is zero below the mean (m) value. I couldn't find any questions about this problem. What am I missing?
Histogram of the data
The PDF is most definitely not zero for all values below your calculated mean. I fact, over 40% of the PDF area resides at x<=m:
from scipy import stats
g = stats.gamma(0.2856629839547822, 0.001612367540316285, 1.3126526078419007)
m=0.04181341484525036
print(g.cdf(m))
0.4078
Related
I'm new to coding in python, and want to get parameters from a data set that I know from theory is most likely t-distributed. The first method I tried was using t.fit(). To double check the results I also used st.stats.describe(), and noticed I got different results. I also used t.stats() to get the moments "mvsk". I'm not sure what the different functions do, and which results to trust. The parameters are later going to be used in a Monte Carlo Simulation. Can somebody explain the different methods, and what I'm doing wrong?
import numpy as np
from scipy.stats import norm,t
import scipy.stats as st
import pandas as pd
import math
SP = pd.read_excel('S&P+sectors.xlsx',
parse_dates=['date'],
index_col='date')['.SPX']
rets = np.log(SP).diff()
rets = rets.dropna()
t.fit(rets)
print("Parameters from t.fit: ", t.fit(rets), "\n")
d = st.stats.describe(rets)
print(d, "\n")
print("Standard Deviation from st.stats.describe : ",np.sqrt(d[3]), "\n")
mean, var, skew, kurt = t.stats(t.fit(rets)[0], moments='mvsk',
loc = t.fit(rets)[1], scale = t.fit(rets)[2])
print("mean, std.dev, skew, kurt: ",mean,np.sqrt(var),skew,kurt)
Output:
Parameters from t.fit: (2.563005821560674, 0.0005384408493821172, 0.006945103287629065)
DescribeResult(nobs=4767, minmax=(-0.09469514468085727, 0.10957195934756658), mean=0.00011244654312862343, variance=0.00014599380983290917, skewness=-0.21364378793604263, kurtosis=8.494830112279583)
Standard Deviation from st.stats.describe : 0.012082789819942626
mean, std.dev, skew, kurt: 0.0005384408493821172 0.014818254946408262 nan nan
You can see that I get different means from the t.fit() and st.stats.describe(). The standard deviation is different for all three, and the skewness and kurtosis is also different. Why is this?
There is no difference
SQRT(0.00014599380983290917) = 0.01208278982
One is variance, another is stddev
Ok, lets make it more descriptive.
Parameters from t.fit is what fitter think is best to put t-Distribution curve over set of sampled data.
DescribeResult produced variance, not stddev, so here we take square root of variance and get stddev, SQRT(0.00014599380983290917) = 0.01208278982. Then you compute stddev yourslef, and they are the same. Please remember, those values (like stddev, variance, mean) are taken from sampled data.
On the last line you compute DISTRIBUTION mean and stddev, most likely by applying formulas or doing numerical integration. They are ALWAYS different from sampled mean or sampled stddev. Fitting is trying to fit everything (all moments) at once, minimizing some or another error. It works other way around as well - if you come with distribution parameters, compute distribution mean, stddev, and then run some sample and compute sampled mean/stddev, they would be different from distribution ones. Only in case of infinite sample szie you'll reach agreement between distribution moments and sampled moments.
I have the above distribution with a mean of -0.02, standard deviation of 0.09 and with a sample size of 13905.
I am just not sure why the distribution is is left-skewed given the large sample size. From bin [-2.0 to -0.5], there are only 10 sample count/outliers in that bin, which explains the shape.
I am just wondering is it possible to normalize to make it more smooth and 'normal' distribution. Purpose is to feed it into a model, while reducing the standard error of the predictor.
You have two options here. You can either Box-Cox transform or Yeo-Johnson transform. The issue with Box-Cox transform is that it applies only to positive numbers. To use Box-Cox transform, you'll have to take an exponential, perform the Box-Cox transform and then take the log to get the data in the original scale. Box-Cox transform is available in scipy.stats
You can avoid those steps and simply use Yeo-Johnson transform. sklearn provides an API for that
from matplotlib import pyplot as plt
from scipy.stats import normaltest
import numpy as np
from sklearn.preprocessing import PowerTransformer
data=np.array([-0.35714286,-0.28571429,-0.00257143,-0.00271429,-0.00142857,0.,0.,0.,0.00142857,0.00285714,0.00714286,0.00714286,0.01,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.02142857,0.07142857])
pt = PowerTransformer(method='yeo-johnson')
data = data.reshape(-1, 1)
pt.fit(data)
transformed_data = pt.transform(data)
We have transformed our data but we need a way to measure and see if we have moved in the right direction. Since our goal was to move towards being a normal distribution, we will use a normality test.
k2, p = normaltest(data)
transformed_k2, transformed_p = normaltest(transformed_data)
The test returns two values k2 and p. The value of p is of our interest here.
if p is greater than some threshold (ex 0.001 or so), we can say reject the hypothesis that data comes from a normal distribution.
In the example above, you'll see that p is greater than 0.001 while transformed_p is less than this threshold indicating that we are moving in the right direction.
I agree with the top answer, except the last 2 paragraphs, because the interpretation of normaltest's output is flipped. These paragraphs should instead read:
"The test returns two values k2 and p. The value of p is of our interest here.
if p is greater less than some threshold (ex 0.001 or so), we can say reject the null hypothesis that data comes from a normal distribution.
In the example above, you'll see that p is greater less than 0.001 while transformed_p is less greater than this threshold indicating that we are moving in the right direction."
Source: normaltest documentation.
I have a figure as shown below, I want to know whether it conforms to the Pareto distribution, or not? Its a cumulative plot.
And, I want to find out the point in x axis which marks the point for the 80-20 rule, i.e the x-axis point which bifurcates the plot into 20 percent having 80 percent of the wealth.
Also, I'm really confused by the scipy.stats Pareto function, would be great if someone can give some intuitive explanation on that, since the documentation is pretty confusing.
scipy.stats.pareto provides a random draw from the Pareto distribution.
To know if your distribution conform to Pareto distribution you should perform a Kolmogorov-Smirnov test.
Draw a random sample from the Pareto distribution using pareto.rvs(shape, size=1000), where shape is the estimated shape parameter of your Pareto distribution, and use scipy.stats.kstest to perform the test :
pareto_smp = pareto.rvs(shape, size=1000)
D, p_value = scipy.stats.kstest(pareto_smp, values)
nobody can simply determine if an observation dataset follows a particular distribution. based on your situation, what you need:
fit empirical distribution using:
statsmodels.ECDF
then, compare (nonparametric) this with your data to see if the Null hypothesis can be rejected
for 20/80 rule:
rescale your X to range [0,1] and simply pick up 0.2 on x axis
source: https://arxiv.org/pdf/1306.0100.pdf
Let's say I have a column x with uniform distributed values.
To these values, I applied a cdf-function.
Now I want to calculate the Gaussian Copula, but I can't find the function in python. I read already, that Gaussian Copula is something like the "inverse of the cdf function".
The reason why I'm doing it comes from this paragraph:
A visual depiction of applying the Gaussian Copula process to normalize
an observation by applying đ = Phi^-1(đš(đĽ)). Calculating đš(đĽ) yields a value đ˘ â [0, 1]
representing the proportion of shaded area at the left. Then Phi^â1(đ˘) yields a value đ
by matching the shaded area in a Gaussian distribution.
I need your help, does everyone has an idea how to calculate that?
I have 2 ideas so far:
1) gauss = 1/(sqrt(2*pi)*s)*e**(-0.5*(float(x-m)/s)**2)
--> so transform all the values with this to a new value
2) norm.ppf(array,loc,scale)
--> So give the ppf function the mean and the std and the array and it will calculate me the inverse of the CDF... But I doubt #2
The thing is
n.cdf(n.ppf(0.95))
Is not what I want. The idea why I'm doing it, is transforming a not normal/gaussian distribution to a normal distribution.
Like here:
Transform from a non gaussian distribution to a gaussian distribution with Gaussian Copula
Any other ideas or tipps?
Thank you very much :)
EDIT:
I found 2 links which are quite usefull:
1. https://stats.stackexchange.com/questions/197283/how-to-transform-an-arcsine-distribution-to-a-normal-distribution
2. https://stats.stackexchange.com/questions/125648/transformation-chi-squared-to-normal-distribution/125653#125653
In this posts its said that you have to
All the details are in the answer already - you take your random variable, and transform it by its own cdf ..... yielding a uniform result.
Thats true for me. If I take a random distirbution and apply the norm.cdf(data, mean,std) function, I get a uniform distributed cdf
Compare: import pandas as pd
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
cdf=n.cdf(data, n.mean(data),n.std(data))
print cdf
But How can I do the
You then transform again, applying the quantile function (inverse cdf) of the desired distribution (in this case by the standard normal quantile function /inverse of the normal cdf, producing a variable with a standard normal distribution).
Because when I use f.e. the norm.ppf function, the values are not reasonable
I'm pretty new to PyMC and I'm trying desperately to infer the parameters of an underlying gaussian distribution that best fits a distribution of observed data that I have, not with a pre-build normal distrubution, but with a more general method using histograms of the simulated data to build pdfs. But so far I can't get my code to converge, and I don't know why...
So here's a summary of what my code does.
I have a dataset of 5000 points distributed normally (mean=5,sigma=2). I want to retrieve these values (mean, sigma) with a bayesian inference (using MCMC).
I have a data simulator that generates for each iteration of the MCMC process a normal distribution of 5000 points with a random mean and sigma (uniform prior)
From the simulated distribution of points I compute a numpy histogram normed to 1 representing the pdf of the distribution (Nbins=int(sqrt(5000))). I then compute the mean and standard deviation of this distribution.
What I want is the set of parameters that will allow me to build a simulated distribution that best fits the observed data.
I use the most general definition of the log likelihood, that is:
ln L(θ|x)=âln(f(xi|θ)) (the likelihood function being defined as the probability distribution of the observed data given the parameters θ)
Then I interpolate linearly the histogram values for every bin center. Therefore I have a continuous pdf for the simulated distribution. So here f is the interpolated function I made from the histogram of the simulation.
I sum the log(f(xi)) contributions for every (real) data point and return the loglikelihood value at the end.
But some (real) data points are so far off the mean of the simulated distribution that f(xi)=0. For these points the code raises a math domain error (Reminder: log(0)=-inf). So I artificially set the pdf to a small epsilon for the points where it's usually set to 0.
But here's the thing. The loglikelihood is not computed for every iteration. And actually it is not computed at all, in the present architecture of my code. So that's why the MCMC process is not converging. But... I don't know why.
Turns out that building custom likelihood functions does not seem to be very casual approach in the PyMC community, where people usually prefer to used pre-built distributions. I'm having troubles to find some help on these matters, so ideas and suggestions will be deeply appreciated :)
import numpy as np
import matplotlib.pyplot as plt
import math
import pymc as pm
from scipy.interpolate import InterpolatedUnivariateSpline
# Generate the data
np.random.seed(0)
N=5000
true_mean=5.
true_sigma = 2.
data = np.random.normal(true_mean,true_sigma,N)
#prior
m=pm.Uniform('m', lower=4, upper=6)
s=pm.Uniform('s', lower=1, upper=3)
#pm.deterministic
def data_simulator(mean_input=m,sig_input=s):
out=np.empty(4,dtype=object)
datasim = np.random.normal(mean_input,sig_input,N)
hist, bin_edges = np.histogram(datasim, bins=int(math.sqrt(len(datasim))), density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2
m_sim=np.mean(datasim)
s_sim=np.std(datasim)
out[0]=m_sim
out[1]=s_sim
out[2]=bin_centers
out[3]=hist
return out
#pm.stochastic(observed=True)
def logp(value=data,mean_output=data_simulator.value[0],sigma_output=data_simulator.value[1],bin_centers_sim=data_simulator.value[2],hist_sim=data_simulator.value[3]):
interp_sim=InterpolatedUnivariateSpline(bin_centers_sim,hist_sim,k=1,ext=0) #returns the extrapolated values
logp=np.sum(np.log(interp_sim(value)))
print 'logp=',logp
return logp
model = pm.Model({"mean": m,"sigma":s,"data_simulator":data_simulator,"loglikelihood":loglikelihood})
#Run the MCMC sampler
mcmc = pm.MCMC(model)
mcmc.sample(iter=10000, burn=5000)
#Plot the marginals
pm.Matplot.plot(mcmc)