How to get probability of observation using fitted statsmodel?

How to get probability of observation using fitted statsmodel? - python

I have a fitted Poisson model in statsmodels. For each of my observations I want to calculate the probability of observing a value that is at least that high. In other words I want to calculate:
P(y >= y_i | x_i)
(This should be possible, because the fitted model predicts some value lambda as a function of my independent variable x. This lambda_i value defines a Poisson distribution, from which I should be able to derive a probability.)
My question is really about the implementation in statsmodels, less about the statistics. Although if you believe it is relevant, please do elaborate.

For Poisson we can just use the distribution from scipy.stats to compute results for given predicted means.
e.g.
mu = my_results.predict(...)
stats.poisson.sf(counts, mu)
similar usage with pmf is in
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/discrete/discrete_model.py#L3922

Related

How does scipy.stats distribution fitting work exactly?

I'm interested in the tail distribution of some given data, so I tried using scipy.stats to fit my data to a Gaussian, Generalized extreme value distribution, and a Generalized Pareto distribution.
This is how the data looks like:
Data Histogram
This is what I tried
data=df.loc[:,'X']
v=np.ceil(np.log2(len(data))) + 1 #This criterion is Sturge's rule, we use this formula to calculate the "adequate" number of bins to visualize our data's distribution
y,x=np.histogram(data,bins=int(v),density=True) #"clustering" our data for the plot
plt.hist(data, bins=11, density=True)
plt.title("Histogram")
plt.show()
x = (x + np.roll(x, -1))[:-1] / 2.0 #This takes the mid point of every "bin" interval as the reference x-axis point for its corresponding y probability
# =============================================================================
# Fitting our data and plotting the PDFs
# =============================================================================
fit1=stats.genextreme.fit(data,floc=0) #The fit method finds the optimal parameters (using MLE) for your data fitting a chosen probability distribution
fit2=stats.norm.fit(data)
fit3=stats.genpareto.fit(data,floc=0)
fit4=stats.weibull_min.fit(data,floc=0)
fit5=stats.exponweib.fit(data,floc=0)
fit6=stats.gumbel_r.fit(data,floc=0)
fit7=stats.gumbel_l.fit(data,floc=0)
....
First I had some strange results because I didn't set the initial location parameter to 0, I still didn't exactly understand why.
What surprised me the most though, is that genextreme and Weibull_min gave me different results, when I thought Weibull was a special case of the generalized extreme value distribution with positive shape parameter.
Especially since the Weibull fit seems to work better here.
Here is the Weibull Fit:
Weibull Fit
And this is the GEV Fit:
GEV Fit
Actually the GEV Fit was similar to the Gumbel_r one:
Gumbel_r Fit
I read one could deduce whether Weibull_min or max should be used just from the shape of the data's histogram, how can one do that?
Since I am interested in extreme positive values (Tail distribution), shouldn't I be using Weibull_max since that's the limiting distribution of the maximum?

Calculate KL Divergence between two gamma distribution list

I have two list. Both include normalized percent:
actual_population_distribution = [0.2,0.3,0.3,0.2]
sample_population_distribution = [0.1,0.4,0.2,0.3]
I wish to fit these two list in to gamma distribution and then calculate the returned two list in order to get the KL value.
I have already able to get KL.
This is the function I used to calculate gamma:
def gamma_random_sample(data_list):
mean = np.mean(data_list)
var = np.var(data_list)
g_alpha = mean * mean / var
g_beta = mean / var
for i in range(len(data_list)):
yield random.gammavariate(g_alpha, 1/g_beta)
Fit two lists into gamma distribution:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
This is the code I used to calculate KL:
kl = np.sum(scipy.special.kl_div(actual_grs, sample_grs))
The code above does not produce any errors.
But I suspect the way I did for gamma is wrong because of np.mean/var to get mean and variance.
Indeed, the number is different to:
mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')
if I use this way.
By using "mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')", I will get a KL value way larger than 1 so both two ways are invalid for getting a correct KL.
What do I miss?

See this stack overflow post: https://stats.stackexchange.com/questions/280459/estimating-gamma-distribution-parameters-using-sample-mean-and-std
I don't understand what you are trying to do with:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
It doesn't look like you are fitting to a gamma distribution, it looks like you are using the Method of Moment estimator to get the parameters of the gamma distribution and then you are drawing a single random number for each element of your actual(sample)_population_distribution lists given the distribution statistics of the list.
The gamma distribution is notoriously hard to fit. I hope your actual data has a longer list -- 4 data points are hardly sufficient for estimating a two parameter distribution. The estimates are kind of garbage until you get hundreds of elements or more, take a look at this document on the MLE estimator for the fisher information of a gamma distribution: https://www.math.arizona.edu/~jwatkins/O3_mle.pdf .
I don't know what you are trying to do with the kl divergence either. Your actual population is already normalized to 1 and so is the sample distribution. You can plug in those elements directly into the KL divergence for a discrete score -- what you are doing with your code is a stretching and addition of gamma noise to your original list values with your defined gamma function. You are more likely to have a larger deviation with the KL divergence after the gamma corruption of your original population data.
I'm sorry, I just don't see what you are trying to accomplish here. If I were to guess your original intent, I'd say your problem is that you need hundreds of data points to guarantee convergence with any gamma fitting program.
EDIT: I just wanted to add that with regards to the KL divergence. If you intend to score your fit gamma distributions with the KL divergence, it's better to use an analytical solution where the scale and shape parameters of your two gamma distributions are your two inputs. Randomly sampling noisy data points won't be helpful unless you take 100,000 random samples and histogram them into 1,000 bins or so and then normalize your histogram -- I'm just throwing those numbers out, but you are going to want to approximate a continuous distribution as best as you can and it will be hard because the gamma distributions have long tails. This document has the analytical solution for a generalized distribution: https://arxiv.org/pdf/1401.6853.pdf . Just set that third parameter to 1 and simplify and then code up a function.

How is the p value calculated for multiple variables in linear regression?

I am wondering how the p value is calculated for various variables in a multiple linear regression. I am sure upon reading several resources that <5% indicates the variable is significant for the model. But how is the p value calculated for each and every variable in the multiple linear regression?
I tried to see the statsmodels summary using the summary() function. I can just see the values. I didn't find any resource on how p value for various variables in a multiple linear regression is calculated.
import statsmodels.api as sm
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y = np.dot(X, beta) + e
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
This question has no error but requires an intuition on how p value is calculated for various variables in a multiple linear regression.

Inferential statistics work by comparison to known distributions. In the case of regression, that distribution is typically the t-distribution
You'll notice that each variable has an estimated coefficient from which an associated t-statistic is calculated. x1 for example, has a t-value of -0.278. To get the p-value, we take that t-value, place it on the t-distribution, and calculate the probability of getting a value as extreme as the t-value you calculated. You can gain some intuition for this by noticing that the p-value column is called P>|t|
An additional wrinkle here is that the exact shape of the t-distribution depends on the degrees of freedom
So to calculate a p-value, you need 2 pieces of information: the t-statistic and the residual degrees of freedom of your model (97 in your case)
Taking x1 as an example, you can calculate the p-value in Python like this:
import scipy.stats
scipy.stats.t.sf(abs(-0.278), df=97)*2
0.78160405761659357
The same is done for each of the other predictors using their respective t-values

Get the distribution when a point is sampled from a mixture in PyMC3

I have a model with a pm.NormalMixture(), and when I sample from the normal mixture, I also want to know which of the mixed distributions that point is being sampled from.
import numpy as np
import pymc3 as pm
obs = np.concatenate([np.random.normal(5,1,100),
np.random.normal(10,2,200)])
with pm.Model() as model:
mu = pm.Normal('mu', 10, 10, shape=2)
sd = pm.Normal('sd', 10, 10, shape=2)
x = pm.NormalMixture('x', mu=mu, sd=sd, observed=obs)
I sample from that model, then use that trace to sample from the posterior predictive distribution, and what I want to know is for each x in the posterior predictive trace, which of the two normal distributions being sampled from it belongs to. Is that possible in PyMC3 without doing it manually?

This example demonstrates how posterior predictive checks (PPCs) work. The gist of a PPC is that you first draw random samples from the trace. The trace is essentially always multivariate, and in your model a single sample would be defined by the vector (mu[i,0], mu[i,1], sd[i,0], sd[i,1]). Then, for each trace sample, generate random numbers from the distribution specified for the likelihood with its parameter values equal to those from the trace samples. In your case, this would be NormalMixture(mu[i,:], sd[i,:]). In your model, x is the likelihood function, not an individual point of the trace.
Some practical notes:
You haven't specified a weighting variable, so I'm assuming by default it forces the normal distributions to be weighted equally (I haven't tested this).
The odds of a given point coming from one distribution or the other is just the ratio between the probability densities at that point.
Check out this for recommendations on how to choose priors. For example, your SD prior is placing a lot of weight on very large SDs, which would bias your results, especially for smaller datasets.
Good luck!

Orthogonal distance regression in python: meaning of returned values

I am following the Orthogonal distance regression method to fit data with errors on both the dependent and independent variables.
I am fitting the data with a simple straight line, my model is y = ax + b.
Now, I am able to write the code and plot the line fitting the data, but I am NOT able to read the results:
Beta: [ 2.08346947 0.0024333 ]
Beta Std Error: [ 0.03654482 0.00279946]
Beta Covariance: [[ 2.06089823e-03 -9.99220260e-05]
[ -9.99220260e-05 1.20935366e-05]]
Residual Variance: 0.648029925546
Inverse Condition #: 0.011825289654
Reason(s) for Halting:
Sum of squares convergence
The Beta is just the array containing the values of the parameters of my model (a, b), and Beta Std Error, the associated errors.
Regarding the other values, I don't know their meaning.
Especially, I would like to know which one is indicative of a goodness-of-fit, something like the chi-square when one fits with the errors only on the dependent variable.

Beta Covariance is the covariance matrix of your fitted parameters. It can be thought of as a matrix describing out inter-connected your two parameters are with respect to both themselves and each other.
Residual Variance I believe is a measure of the goodness-of-fit where the smaller the value, the better the fit to your data.
Inverse Condition is the inverse (1/x) of the condition number. The condition number defines how sensitive your fitted function is to changes in the input.
scipy.odr is a wrapper around a much older FORTRAN77 package known as ODRPACK. The documentation for ODRPACK can actually be found on on the scipy website. This may help you in understanding what you need to know as it contains the mathematical descriptions of the parameters.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.