In scipy there is no support for fitting discrete distributions using data. I know there are a lot of subject about this.
For example if i have an array like below:
x = [2,3,4,5,6,7,0,1,1,0,1,8,10,9,1,1,1,0,0]
I couldn't apply for this array:
from scipy.stats import nbinom
param = nbinom.fit(x)
But i would like to ask you up to date, is there any way to fit for these three discrete distributions and then choose the best fit for the discrete dataset?
You can use Method of Moments to fit any particular distribution.
Basic idea: get empirical first, second, etc. moments, then derive distribution parameters from these moments.
So, in all these cases we only need two moments. Let's get them:
import pandas as pd
# for other distributions, you'll need to implement PMF
from scipy.stats import nbinom, poisson, geom
x = pd.Series(x)
mean = x.mean()
var = x.var()
likelihoods = {} # we'll use it later
Note: I used pandas instead of numpy. That is because numpy's var() and std() don't apply Bessel's correction, while pandas' do. If you have 100+ samples, there shouldn't be much difference, but on smaller samples it could be important.
Now, let's get parameters for these distributions. Negative binomial has two parameters: p, r. Let's estimate them and calculate likelihood of the dataset:
# From the wikipedia page, we have:
# mean = pr / (1-p)
# var = pr / (1-p)**2
# without wiki, you could use MGF to get moments; too long to explain here
# Solving for p and r, we get:
p = 1 - mean / var # TODO: check for zero variance and limit p by [0, 1]
r = (1-p) * mean / p
UPD: Wikipedia and scipy are using different definitions of p, one treating it as probability of success and another as probability of failure. So, to be consistent with scipy notion, use:
p = mean / var
r = p * mean / (1-p)
END OF UPD
UPD2:
I'd suggest using #thilak's code log likelihood instead. It allows to avoid loss of precision, which is especially important on large samples.
END OF UPD2
Calculate likelihood:
likelihoods['nbinom'] = x.map(lambda val: nbinom.pmf(val, r, p)).prod()
Same for Poisson, there is only one parameter:
# from Wikipedia,
# mean = variance = lambda. Nothing to solve here
lambda_ = mean
likelihoods['poisson'] = x.map(lambda val: poisson.pmf(val, lambda_)).prod()
Same for Geometric distribution:
# mean = 1 / p # this form fits the scipy definition
p = 1 / mean
likelihoods['geometric'] = x.map(lambda val: geom.pmf(val, p)).prod()
Finally, let's get the best fit:
best_fit = max(likelihoods, key=lambda x: likelihoods[x])
print("Best fit:", best_fit)
print("Likelihood:", likelihoods[best_fit])
Let me know if you have any questions
Great answer by Marat.
In addition to Marat's post I would most certainly recommend taking log of the probability mass function. Some information on why log likelihood is preferred over likelihood- https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution
I would rewrite the code for Negative Binomial to-
log_likelihoods = {}
log_likelihoods['nbinom'] = x.map(lambda val: nbinom.logpmf(val, r, p)).sum()
Note that I have used-
logpmf instead of pmf
sum instead of product
And to find out the best distribution-
best_fit = max(log_likelihoods, key=lambda x: log_likelihoods[x])
print("Best fit:", best_fit)
print("log_Likelihood:", log_likelihoods[best_fit])
Related
I am currently having issue with the implementation of the Metropolis-Hastings algorithm.
I am trying to use the algorithm to calculate integrals of the form
In using this algorithm, we can obtain a long chain of configurations ( in this case, each configuration is just a single numbers) such that in the tail-end of the chain the probability of having a particular configuration follows (or rather tends to) a gaussian distribution.
My code seems to be messing up with obtaining the said gaussian distributions. There is a strange dependence on the transition probablity (the probablity of picking a new candidate configuration depending on the previous configuration in the chain). However, if this transition probability is symmetric, there should be no dependence on this function at all (it only affects speed at which phase space [space of potential configurations] is explored and how quickly the chain converges to the desired distribution)!
In my case I am using a normal distribution transition function (which satisfies the need to be symmetric), with width d.
For each d I use I do indeed get a gaussian distribution however the standard deviation, sigma, depends on my choice of d.
The resulting gaussian should have a sigma of roughly 0.701 but I find that the value I actually get depends on the parameter d, when it shouldn't.
I am not sure where the error in this code is, any help would be greatly appreciated!
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
'''
We want to get an exponential decay integral approx using importance sampling.
We will try to integrate x^2exp(-x^2) over the real line.
Metropolis-hasting alg will generate configuartions (in this case, single numbers) such that
the probablity of a given configuration x^a ~ p(x^a) for p(x) propto exp(-x^2).
Once configs = {x^a} generated, the apporximation, Q_N, of the integral, I, will be given by
Q_N = 1/N sum_(configs) x^2
lim (N-> inf) Q_N -> I
'''
'''
Implementing metropolis-hasting algorithm
'''
#Setting up the initial config for our chain, generating first 2 to generate numpy array
x_0 = np.random.uniform(-20,-10,2)
#Defining function that generates the next N steps in the chain, given a starting config x
#Works by iteratively taking the last element in the chain, generating a new candidate configuration from it and accepting/rejecting according to the algorithm
#Success and failures implemented to see roughly the success rate of each step
def next_steps(x,N):
i = 0
Success = 0
Failures = 0
Data = np.array(x)
d = 1.5 #Spread of (normal) transition function
while i < N:
r = np.random.uniform(0,1)
delta = np.random.normal(0,d)
x_old = Data[-1]
x_new = x_old + delta
hasting_ratio = np.exp(-(x_new**2-x_old**2) )
if hasting_ratio > r:
i = i+1
Data = np.append(Data,x_new)
Success = Success +1
else:
Failures = Failures + 1
print(Success)
print(Failures)
return Data
#Number of steps in the chain
N_iteration = 50000
#Generating the data
Data = next_steps(x_0,N_iteration)
#Plotting data to see convergence of chain to gaussian distribution
plt.plot(Data)
plt.show()
#Obtaining tail end data and obtaining the standard deviation of resulting gaussian distribution
Data = Data[-40000:]
(mu, sigma) = norm.fit(Data)
print(sigma)
#Plotting a histogram to visually see if guassian
plt.hist(Data, bins = 300)
plt.show()
You need to save x even when it doesn't change. Otherwise the center values are under-counted, and more so as d increases, which increases the variance.
import numpy as np
from scipy.stats import norm
"""
We want to get an exponential decay integral approx using importance sampling.
We will try to integrate x^2exp(-x^2) over the real line.
Metropolis-hasting alg will generate configuartions (in this case, single numbers) such that
the probablity of a given configuration x^a ~ p(x^a) for p(x) propto exp(-x^2).
Once configs = {x^a} generated, the apporximation, Q_N, of the integral, I, will be given by
Q_N = 1/N sum_(configs) x^2
lim (N-> inf) Q_N -> I
"""
"""
Implementing metropolis-hasting algorithm
"""
# Setting up the initial config for our chain
x_0 = np.random.uniform(-20, -10)
# Defining function that generates the next N steps in the chain, given a starting config x
# Works by iteratively taking the last element in the chain, generating a new candidate configuration from it and accepting/rejecting according to the algorithm
# Success and failures implemented to see roughly the success rate of each step
def next_steps(x, N):
Success = 0
Failures = 0
Data = np.empty((N,))
d = 1.5 # Spread of (normal) transition function
for i in range(N):
r = np.random.uniform(0, 1)
delta = np.random.normal(0, d)
x_new = x + delta
hasting_ratio = np.exp(-(x_new ** 2 - x ** 2))
if hasting_ratio > r:
x = x_new
Success = Success + 1
else:
Failures = Failures + 1
Data[i] = x
print(Success)
print(Failures)
return Data
# Number of steps in the chain
N_iteration = 50000
# Generating the data
Data = next_steps(x_0, N_iteration)
# Obtaining tail end data and obtaining the standard deviation of resulting gaussian distribution
Data = Data[-40000:]
(mu, sigma) = norm.fit(Data)
print(sigma)
I have just been trying to match the scipy outputs of the lognormal distribution to the formulas on wikipedia.
And I am stuck on the partial expectation with a lower bound.
If I use this simple lognormal distribution:
k = .25
sigma = .5
mu = .1 # from the logged variable
lnorm = scist.lognorm(s=sigma, scale=np.exp(mu))
where k is the lower bound,
the partial expectation, as I understand it, is given by:
Fine. So we are simply talking about the mean of the lognormal distribution and a CDF with a z-score. scipy provides the partial
lnorm.expect(lambda x:x, lb=k)
>>> 1.25199...
Indeed, we can confirm this is the partial by checking it against the conditional expectation. Computing it directly or using the partial above yield the same result:
lnorm.expect(lambda x:x, lb=k) / (1 - lnorm.cdf(k))
>>> 1.25385...
lnorm.expect(lambda x:x, lb=k, conditional=True)
>>> 1.25385...
However, scipy's cdf function takes the x variable, not the z-score and I am uncertain how to transform this:
Into an x value. I have tried many different flavors.
I would have thought:
would do the trick to account for the subtraction of mu that must occur when scipy's cdf (presumably) computes the z-score internally.
Any formulation I use ends up with a very small or 0 value.
Any help would be greatly appreciated.
IIUC, you can simply compute the CDF of a Normal distribution N(0,1) in (mu+sigma^2-ln(k))/2, i.e.
import numpy as np
import scipy.stats as sps
def partial_expectation(mu, sigma, k):
"""
Returns partial expectation given
mean, standard deviation and k.
https://en.wikipedia.org/wiki/Log-normal_distribution
"""
# compute cumulative density function
# of Normal distribution N(0,1) in x=x_phi
x_phi = (mu + sigma**2 - np.log(k))/sigma
phi = sps.norm.cdf(x_phi, loc=0, scale=1)
# mean of lognormal
lognorm_mu = np.exp(mu + .5*(sigma**2))
# result
return lognorm_mu * phi
k = .25
sigma = .5
mu = .1 # from the logged variable
lnorm = sps.lognorm(s=sigma, scale=np.exp(mu))
print('from def:', partial_expectation(mu, sigma, k))
print('from sps:', lnorm.expect(lb=k))
from def: 1.251999952174895
from sps: 1.2519999521748952
I have a data set X which i need to use to maximise the parameters by MLE. I have the log likelihood function
def llh(alpha, beta):
a = [0]*999
for i in range(1, 1000):
a[i-1] = (-0.5)*(((1/beta)*(X[i]-np.sin((alpha)*X[i-1])))**2)
return sum(a)
I need to maximise this but i have no idea how. I can only think of plotting 3d graphs to find the maximum point but that gives me weird answers that are not what I want.
This is the plot I got
Is there any other possible way to get my maximum parameters or am I going about this the wrong way? My dataset model function is Xk = sin(alphaXk-1) + betaWk where Wk is normally distributed with mean 0 and sigma 1. Any help would be appreciated.Thank you!
You have to find the maximum of your likelihood numerically. In practice this is done by computing the negative (log) likelihood and using numerical minimization to find the most likely parameters of your model to describe your data. Make use of scipy.optimize.minimize to minimize your likelihood.
I implemented a short example for normal distributed data.
import numpy as np
from scipy.stats import norm
from scipy.optimize import minimize
def neg_llh(popt, X):
return -np.log(norm.pdf(X, loc = popt[0], scale = popt[1])).sum()
# example data
X = np.random.normal(loc = 5, scale = 2, size = 1000)
# minimize log likelihood
res = minimize(neg_llh, x0 = [2, 2], args = (X))
print(res.x)
array([5.10023503, 2.01174199])
Since you are using sum I suppose the likelihood you defined above is already a (negative?) log-likelihood.
def neg_llh(popt, X):
alpha = popt[0]
beta = popt[1]
return np.sum((-0.5)*(((1 / beta)*(X - np.sin((alpha) * X)))**2))
Try minimizing your negative likelihood. Using your plot you can make a good initial guess (x0) about the values of alpha and beta.
I am interested in a particular density, and I need to sample it "regularly" in a way that represent its shape (not random).
Formally, f is my density function, F is the corresponding cumulative density function (F' = f), whose reverse function rF = F^-1 does exist. I am interested in casting a regular sample from [0, 1] into my variable domain through F^-1. Something like:
import numpy as np
uniform_sample = np.linspace(0., 1., 256 + 2)[1:-1] # source sample
shaped_sample = rF(uniform_sample) # this is what I want to get
Is there a dedicated way to do this with numpy, or should I do this by hand? Here is the 'by hand' way for exponential law:
l = 5. # exponential parameter
# f = lambda x: l * np.exp(-l * x) # density function, not used
# F = lambda x: 1 - np.exp(-l * x) # cumulative density function, not used either
rF = lambda y: np.log(1. / (1. - y)) / l # reverse `F^-1` function
# What I need is:
shaped_sample = rF(uniform_sample)
I know that, in theory, rF is internally used for drawing random samples when np.random.exponential is called, for example (a uniform, random sample from [0, 1] is transformed by rF to get the actual result). So my guess is that numpy.random does know the rF function for each distribution it offers.
How do I access it? Does numpy provide functions like:
np.random.<any_numpy_distribution>.rF
or
np.random.get_reverse_F(<any_custom_density_function>)
.. or should I derive / approximate them myself?
scipy has probability distribution objects for all (I think) of the probability distributions in numpy.random.
http://docs.scipy.org/doc/scipy/reference/stats.html
The all have a ppf() method that does what you want.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.ppf.html
In your example:
import scipy.stats as st
l = 5. # exponential parameter
dist = st.expon(0., l) # distribution object provided by scipy
f = dist.pdf # probability density function
F = dist.cdf # cumulative density function
rF = dist.ppf # percent point function : reverse `F^-1` function
shaped_sample = rF(uniform_sample)
# and much more!
As far as I'm aware there isn't a way to do this directly in numpy. For the case of functions where the cumulative distribution is analytic but it's inverse isn't I generally use a spline to do the inversion numerically.
from scipy.interpolate import UnivariateSpline
x = np.linspace(0.0, 1.0, 1000)
F = cumulative_distn(x) #This we know and is analytic
rF = UnivariateSpline(F, x) #This will then be the inverse
Note that if you can do the inversion of F to rF by hand then you should. This method is only for the case where the inverse cannot be found in a closed form.
I'm trying to use polyfit to find the best fitting straight line to a set of data, but I also need to know the uncertainty on the parameters, so I want the covariance matrix too. The online documentation suggests I write:
polyfit(x, y, 2, cov=True)
but this gives the error:
TypeError: polyfit() got an unexpected keyword argument 'cov'
And sure enough help(polyfit) shows no keyword argument 'cov'.
So does the online documentation refer to a previous release of numpy? (I have 1.6.1, the newest one). I could write my own polyfit function, but has anyone got any suggestions for why I don't have a covariance option on my polyfit?
Thanks
For a solution that comes from a library, I find that using scikits.statsmodels is a convenient choice. In statsmodels, regression objects have callable attributes that return the parameters and standard errors. I put an example of how this would work for you below:
# Imports, I assume NumPy for forming your data.
import numpy as np
import scikits.statsmodels.api as sm
# Form the data here
(X, Y) = ....
reg_x_data = np.ones(X.shape); # 0th degree term.
for ii in range(1,deg+1):
reg_x_data = np.hstack(( reg_x_data, X**(ii) )); # Append the ii^th degree term.
# Store OLS regression results into `result`
result = sm.OLS(Y,reg_x_data).fit()
# Print the estimated coefficients
print result.params
# Print the basic OLS standard error in the coefficients
print result.bse
# Print the estimated basic OLS covariance matrix
print result.cov_params() # <-- Notice, this one is a function by convention.
# Print the heteroskedasticity-consistent standard error
print result.HC0_se
# Print the heteroskedasticity-consistent covariance matrix
print result.cov_HC0
There are additional robust covariance attributes in the result object as well. You can see them by printing out dir(result). Also, by convention, the covariance matrices for the heteroskedasticity-consistent estimators are only available after you already call the corresponding standard error, such as: you must call result.HC0_se prior to result.cov_HC0 because the first reference causes the second one to be computed and stored.
Pandas is another library that probably provides more advanced support for these operations.
Non-library function
This might be useful when you don't want to / can't rely on an extra library function.
Below is a function that I wrote to return the OLS regression coefficients, as well as a bunch of stuff. It returns the residuals, the regression variance and standard error (standard error of the residuals-squared), the asymptotic formula for large-sample variance, the OLS covariance matrix, the heteroskedasticity-consistent "robust" covariance estimate (which is the OLS covariance but weighted according to the residuals), and the "White" or "bias-corrected" heteroskedasticity-consistent covariance.
import numpy as np
###
# Regression and standard error estimation functions
###
def ols_linreg(X, Y):
""" ols_linreg(X,Y)
Ordinary least squares regression estimator given explanatory variables
matrix X and observations matrix Y.The length of the first dimension of
X and Y must be the same (equal to the number of samples in the data set).
Note: these methods should be adapted if you need to use this for large data.
This is mostly for illustrating what to do for calculating the different
classicial standard errors. You would never really want to compute the inverse
matrices for large problems.
This was developed with NumPy 1.5.1.
"""
(N, K) = X.shape
t1 = np.linalg.inv( (np.transpose(X)).dot(X) )
t2 = (np.transpose(X)).dot(Y)
beta = t1.dot(t2)
residuals = Y - X.dot(beta)
sig_hat = (1.0/(N-K))*np.sum(residuals**2)
sig_hat_asymptotic_variance = 2*sig_hat**2/N
conv_st_err = np.sqrt(sig_hat)
sum1 = 0.0
for ii in range(N):
sum1 = sum1 + np.outer(X[ii,:],X[ii,:])
sum1 = (1.0/N)*sum1
ols_cov = (sig_hat/N)*np.linalg.inv(sum1)
PX = X.dot( np.linalg.inv(np.transpose(X).dot(X)).dot(np.transpose(X)) )
robust_se_mat1 = np.linalg.inv(np.transpose(X).dot(X))
robust_se_mat2 = np.transpose(X).dot(np.diag(residuals[:,0]**(2.0)).dot(X))
robust_se_mat3 = np.transpose(X).dot(np.diag(residuals[:,0]**(2.0)/(1.0-np.diag(PX))).dot(X))
v_robust = robust_se_mat1.dot(robust_se_mat2.dot(robust_se_mat1))
v_modified_robust = robust_se_mat1.dot(robust_se_mat3.dot(robust_se_mat1))
""" Returns:
beta -- The vector of coefficient estimates, ordered on the columns on X.
residuals -- The vector of residuals, Y - X.beta
sig_hat -- The sample variance of the residuals.
conv_st_error -- The 'standard error of the regression', sqrt(sig_hat).
sig_hat_asymptotic_variance -- The analytic formula for the large sample variance
ols_cov -- The covariance matrix under the basic OLS assumptions.
v_robust -- The "robust" covariance matrix, weighted to account for the residuals and heteroskedasticity.
v_modified_robust -- The bias-corrected and heteroskedasticity-consistent covariance matrix.
"""
return beta, residuals, sig_hat, conv_st_err, sig_hat_asymptotic_variance, ols_cov, v_robust, v_modified_robust
For your problem, you would use it like this:
import numpy as np
# Define or load your data:
(Y, X) = ....
# Desired polynomial degree
deg = 2;
reg_x_data = np.ones(X.shape); # 0th degree term.
for ii in range(1,deg+1):
reg_x_data = np.hstack(( reg_x_data, X**(ii) )); # Append the ii^th degree term.
# Get all of the regression data.
beta, residuals, sig_hat, conv_st_error, sig_hat_asymptotic_variance, ols_cov, v_robust, v_modified_robust = ols_linreg(reg_x_data,Y)
# Print the covariance matrix:
print ols_cov
If you spot any bugs in my computations (especially the heteroskedasticity-consistent estimators) please let me know and I'll fix it asap.