The task
I have data that looks like this:
I want to fit a generalized linear model (glm) to this from a gamma family using statsmodels. Using this model, for each of my observations I want to calculate the probability of observing a value that is smaller than (or equal to) that value. In other words I want to calculate:
P(y <= y_i | x_i)
My questions
How do I get the shape and scale parameters from the fitted glm in statsmodels? According to this question the scale parameter in statsmodels is not parameterized in the normal way. Can I use it directly as input to a gamma distribution in scipy? Or do I need a transformation first?
How do I use these parameters (shape and scale) to get the probabilities? Currently I'm using scipy to generate a distribution for each x_i and get the probability from that. See implementation below.
My current implementation
import scipy.stats as stat
import patsy
import statsmodels.api as sm
# Generate data in correct form
y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')
# Fit model with gamma family and log link
mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()
# Predict mean
myData['mu'] = mod.predict(exog=X)
# Predict probabilities (note that for a gamma distribution mean = shape * scale)
probabilities = np.array(
[stat.gamma(m_i/mod.scale, scale=mod.scale).cdf(y_i) for m_i, y_i in zip(myData['mu'], myData['y'])]
)
However, when I perform this procedure I get the following result:
Currently the predicted probabilities all seem really high. The red line in the graph is the predicted mean. But even for points below this line the predicted cumulative probability is around 80%. This makes me wonder whether the scale parameter I used is indeed the correct one.
In R, you can obtained as estimate of the shape using 1/dispersion (check this post).The naming of the dispersion estimate in statsmodels is a unfortunately scale. So you did to take the reciprocal of this to get the shape estimate. I show it with an example below:
values = gamma.rvs(2,scale=5,size=500)
fit = sm.GLM(values, np.repeat(1,500), family=sm.families.Gamma(sm.families.links.log())).fit()
This is an intercept only model, and we check the intercept and dispersion (named scale):
[fit.params,fit.scale]
[array([2.27875973]), 0.563667465203953]
So the mean is exp(2.2599) = 9.582131 and if we use shape as 1/dispersion , shape = 1/0.563667465203953 = 1.774096 which is what we simulated.
If I use a simulated dataset, it works perfectly fine. This is what it looks like, with a shape of 10:
from scipy.stats import gamma
import numpy as np
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
import pandas as pd
_shape = 10
myData = pd.DataFrame({'x':np.random.uniform(0,10,size=500)})
myData['y'] = gamma.rvs(_shape,scale=np.exp(-myData['x']/3 + 0.5)/_shape,size=500)
myData.plot("x","y",kind="scatter")
Then we fit the model like you did:
y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')
mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()
mu = mod.predict(exog=X)
shape_from_model = 1/mod.scale
probabilities = [gamma(shape_from_model, scale=m_i/shape_from_model).cdf(y_i) for m_i, y_i in zip(mu,myData['y'])]
And plot:
fig, ax = plt.subplots()
im = ax.scatter(myData["x"],myData["y"],c=probabilities)
im = ax.scatter(myData['x'],mu,c="r",s=1)
fig.colorbar(im, ax=ax)
Related
I was looking for a way to obtaining the mean value (Expected Value) from a drawn distribution that I used to fit a Kernel Density Estimation from scipy.stats.gaussian_kde. I remember from my statistics class that the Expected Value is just the Integral over the pdf(x) * x from -infinity to infinity:
I used the the scipy.integrate.quad function to do this task in my code, but I ran into this apperently strange behavior (that might have something to do with the bandwith parameter from the KDE).
Problem
import matplotlib.pyplot as plt
import numpy as np
import random
from scipy.stats import norm, gaussian_kde
from scipy.integrate import quad
from sklearn.neighbors import KernelDensity
np.random.seed(42)
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
kde = gaussian_kde(test_array,bw_method=0.5)
X_range = np.arange(-16,20,0.1)
y_list = []
for X in X_range:
pdf = lambda x : kde.evaluate([[x]])
y_list.append(pdf(X))
y = np.array(y_list)
_ = plt.plot(X_range,y)
# Integrate over pdf * x to obtain the mean
mean_integration_low_bw = quad(lambda x: x * pdf(x), a=-np.inf, b=np.inf)[0]
# Calculate the cdf at point of the mean
zero_int_low = quad(lambda x: pdf(x), a=-np.inf, b=mean_integration_low_bw)[0]
print("The mean after integration: {}\n".format(round(mean_integration_low_bw,4)))
print("F({}): {}".format(round(mean_integration_low_bw,4),round(zero_int_low,4)))
plt.axvline(x=mean_integration_low_bw,color ="r")
plt.show()
If I execute this code I get a strange behavior of the result for the integrated mean and the cumulative distribution function at the point of the calculated mean:
First Question:
In my opinion it should always show: F(Mean) = 0.5 or am I wrong here? (Does this only apply to symetric distributions?)
Second Question:
The more stranger thing ist, that the value for the integrated mean does not change for the bandwith parameter. In my opinion the mean should change too if the shape of the underlying distribution differs. If i set the bandwith to 5 I got the following graph:
Why is the mean value still the same if the curve now has a different shape (due to the wider bandwith)?
I hope those question not only arise due to my flawed understanding of statistics ;)
Your initial data is generate here
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
So you have 500 samples from a distribution with mean 4 and 100 samples from a distribution with mean -10, you can predict the expected average (500*4-10*100)/(500+100) = 1.66666.... that's pretty close to the result given by your code, and also very consistent with the result obtained from the with the first plot.
Given some time series data:
np.random.seed(123)
r = pd.Series(np.random.beta(a=0.5, b=0.5, size=1000),
index=pd.date_range('2013', periods=1000))
and the distributions within scipy.stats._continuous_distns._distn_names:
import scipy.stats as scs
dists = scs._continuous_distns._distn_names
I would like to be able to establish a new distribution, and then call it's .ppf (percent point function), while incorporating exponential weights into the building of the distribution.
For example, with a normal distribution, this would just entail estimating an exponentially-weighted mean and standard deviation:
All continuous distributions take loc and scale as keyword parameters
to adjust the location and scale of the distribution, e.g. for the
standard normal distribution the location is the mean and the scale is
the standard deviation. [source]
ewm = r.ewm(span=60)
loc = ewm.mean().iloc[-1]
scale = ewm.std().iloc[-1]
print(scs.norm.ppf(q=0.05, loc=loc, scale=scale))
-0.196734019969
But I would like to be able to extend this to the broader family of continuous distributions where other parameters (shape) are often involved. For instance,
johnsonsu has parameters a, b, loc, scale;
bradford has parameters c, loc, scale;
burr has parameters c, d, loc, scale.
How could I extend this process to distributions that have parameters besides loc and scale?
Combined snippets from above:
import scipy.stats as scs
import numpy as np
import pandas as pd
np.random.seed(123)
r = pd.Series(np.random.beta(a=0.5, b=0.5, size=1000),
index=pd.date_range('2013', periods=1000))
ewm = r.ewm(span=60)
loc = ewm.mean().iloc[-1]
scale = ewm.std().iloc[-1]
print(scs.norm.ppf(q=0.05, loc=loc, scale=scale))
# -0.196734019969
Here is my implementation:
Given an empirical distribution x, assign exponential weights to each x.
Use these weights to bootstrap sample a new distribution. The weights are the p parameter to np.random.choice.
The .fit method of any distribution can then be called on that bootstrapped data.
Code:
def ewm_weights(i, alpha):
w = (1 - alpha) ** np.arange(i)[::-1]
w /= w.sum()
return w
def bootstrap(a, alpha, size=None):
p = ewm_weights(i=len(a), alpha=alpha)
return np.random.choice(a=a, size=size, p=p)
The definition of ewm_weights follows:
http://pandas.pydata.org/pandas-docs/stable/computation.html#exponentially-weighted-windows
with adjust=True.
Example:
# Create a nonstationary `x` variable with larger mean and stdev in period 2
x1 = np.random.normal(loc=4, scale=3, size=1000)
x2 = np.random.normal(loc=10, scale=5, size=1000)
x = np.hstack((x1,x2))
The histogram of x looks like this:
plt.hist(x, bins=25)
While a bootstrapped b with alpha=0.03 looks like:
b = bootstrap(x, alpha=0.03, size=int(1e6))
plt.hist(b, bins=25)
Any continuous distribution from scipy.stats._continuous_distns._distn_names can then be fit to b.
Issues:
A softmax function might make ewm_weights safer.
This approach ignores autocorrelation in x.
I am having trouble getting a reasonable prediction behavior from least squares fits in statsmodels version 0.6.1. It does not seem to be providing a sensible value.
Consider the following data
import numpy as np
xx = np.array([1.1,2.2,3.3,4.4]) # Independent variable
XX = sm.add_constant(xx) # Include constant for matrix fitting in statsmodels
yy = np.array([2,1,5,6]) # Dependent variable
ww = np.array([0.1,1,3,0.5]) # Weights to try
wn = ww/ww.sum() # Normalized weights
zz = 1.9 # Independent variable value to predict for
We can use numpy to do a weighted fit and prediction
np_unw_value = np.polyval(np.polyfit(xx, yy, deg=1, w=1+0*ww), zz)
print("Unweighted fit prediction from numpy.polyval is {sp}".format(sp=np_unw_value))
and we find a prediction of 2.263636.
As a sanity check, we can also see what R has to say about the matter
import pandas as pd
import rpy2.robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.pandas2ri
rpy2.robjects.pandas2ri.activate()
pdf = pd.DataFrame({'x':xx, 'y':yy, 'w':wn})
pdz = pd.DataFrame({'x':[zz], 'y':[np.Inf]})
rfit = rpy2.robjects.r.lm('y~x', data=pdf, weights=1+0*pdf['w']**2)
rpred = rpy2.robjects.r.predict(rfit, pdz)[0]
print("Unweighted fit prediction from R is {sp}".format(sp=rpred))
and again we find a prediction of 2.263636. My problem is that we do not get that result from statmodels OLS
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
owls = sm.OLS(yy, XX).fit()
sm_value_u, iv_lu, iv_uu = wls_prediction_std(owls, exog=np.array([[1,zz]]))
sm_unw_v = sm_value_u[0]
print("Unweighted OLS fit prediction from statsmodels.wls_prediction_std is {sp}".format(sp=sm_unw_v))
Instead I obtain a value 1.695814 (similar things happen with WLS()). Either there is a bug, or using statsmodels for prediction has some trick too obscure for me to find. What is going on?
The results classes have a predict method that provides the prediction for new values of the explanatory variables:
>>> print(owls.predict(np.array([[1,zz]])))
[ 2.26363636]
The first return of wls_prediction_std is the standard error for the prediction not the prediction itself.
>>> help(wls_prediction_std)
Help on function wls_prediction_std in module statsmodels.sandbox.regression.predstd:
wls_prediction_std(res, exog=None, weights=None, alpha=0.05)
calculate standard deviation and confidence interval for prediction
applies to WLS and OLS, not to general GLS,
that is independently but not identically distributed observations
Parameters
----------
res : regression result instance
results of WLS or OLS regression required attributes see notes
exog : array_like (optional)
exogenous variables for points to predict
weights : scalar or array_like (optional)
weights as defined for WLS (inverse of variance of observation)
alpha : float (default: alpha = 0.05)
confidence level for two-sided hypothesis
Returns
-------
predstd : array_like, 1d
standard error of prediction
same length as rows of exog
interval_l, interval_u : array_like
lower und upper confidence bounds
The sandbox function will be replaced by a new method get_prediction of the results classes that provides the prediction and the extra results like standard deviation and confidence and prediction intervals.
http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.get_prediction.html
Firstly this is an assignment I've been set so I'm only after pointers, and I am restricted to using the following libraries, NumPy, SciPy and MatPlotLib.
We have been given a txt file which includes x and y data for a resonance experiment and have to fit both a gaussian and lorentzian fit. I'm working on the gaussian fit at the minute and have tried following the code laid out in a previous question as a basis for my own code. (Gaussian fit for Python)
from numpy import *
from matplotlib import *
import matplotlib.pyplot as plt
import pylab
from scipy.optimize import curve_fit
energy, intensity = numpy.loadtxt('resonance_data.txt', unpack=True)
n = size(energy)
mean = 30.7
sigma = 10
intensity0 = 45
def gaus(energy, intensity0, energy0, sigma):
return intensity0 * exp(-(energy - energy0)**2 / (sigma**2))
popt, pcov = curve_fit(gaus, energy, intensity, p0=[45, mean, sigma])
plt.plot(energy, intensity, 'o')
plt.xlabel('Energy/eV')
plt.ylabel('Intensity')
plt.title('Plot of Intensity against Energy')
plt.plot(energy, gaus(energy, *popt))
plt.show()
Which returns the following graph
If I keep the expressions for mean and sigma, as in the url posted the curve fit is a horizontal line, so I'm guessing the problem lies in the curve fit not converging or something.
Looks like your data skews heavily to the left, why Gaussian? Not Boltzmann, Log-Normal, or anything else?
Much of these are already implemented in scipy.stats. See scipy.stats.cauchy for lorentzian and scipy.stats.normal gaussian. An example:
import scipy.stats as ss
A=ss.norm.rvs(0, 5, size=(100)) #Generate a random variable of 100 elements, with expected mean=0, std=5
ss.norm.fit_loc_scale(A) #fit both the mean and std
(-0.13053732553697531, 5.163322485150271) #your number will vary.
And I think you don't need the intensity0 parameter, it is just going to be 1/sigma/srqt(2*pi), because the density function has to sum up to 1.
How can I plot the following noisy data with a smooth, continuous line without considering each individual value? I would like to only show the behavior in a nicer way, without caring about noisy and extreme values. This is the code I am using:
import numpy
import sys
import matplotlib.pyplot as plt
from scipy.interpolate import spline
dataset = numpy.genfromtxt(fname='data', delimiter=",")
dic = {}
for d in dataset:
dic[d[0]] = d[1]
plt.plot(range(len(dic)), dic.values(),linestyle='-', linewidth=2)
plt.savefig('plot.png')
plt.show()
In a previous answer, I was introduced to the Savitzky Golay filter, a particular type of low-pass filter, well adapted for data smoothing. How "smooth" you want your resulting curve to be is a matter of preference, and this can be adjusted by both the window-size and the order of the interpolating polynomial. Using the cookbook example for sg_filter:
import numpy as np
import sg_filter
import matplotlib.pyplot as plt
# Generate some sample data similar to your post
X = np.arange(1,1000,1)
Y = np.log(X**3) + 10*np.random.random(X.shape)
Y2 = sg_filter.savitzky_golay(Y, 101, 3)
plt.plot(X,Y,linestyle='-', linewidth=2,alpha=.5)
plt.plot(X,Y2,color='r')
plt.show()
There is more than one way to do it!
Here I show how to reduce noise using a variety of techniques:
Moving average
LOWESS regression
Low pass filter
Interpolation
Sticking with #Hooked example data for consistency:
import numpy as np
import matplotlib.pyplot as plt
X = np.arange(1, 1000, 1)
Y = np.log(X ** 3) + 10 * np.random.random(X.shape)
plt.plot(X, Y, alpha = .5)
plt.show()
Moving average
Sometimes all you need is a moving average.
For example, using pandas with a window size of 100:
import pandas as pd
df = pd.DataFrame(Y, X)
df_mva = df.rolling(100).mean() # moving average with a window size of 100
df_mva.plot(legend = False);
You will probably have to try several window sizes with your data. Note that the first 100 values of df_mva will be NaN but these can be removed with the dropna method.
Usage details for the pandas rolling function.
LOWESS regression
I've used LOWESS (Locally Weighted Scatterplot Smoothing) successfully to remove noise from repeated measures datasets. More information on local regression methods, including LOWESS and LOESS, here. It's a simple method with only one parameter to tune which in my experience gives good results.
Here is how to apply the LOWESS technique using the statsmodels implementation:
import statsmodels.api as sm
y_lowess = sm.nonparametric.lowess(Y, X, frac = 0.3) # 30 % lowess smoothing
plt.plot(y_lowess[:, 0], y_lowess[:, 1]) # some noise removed
plt.show()
It may be necessary to vary the frac parameter, which is the fraction of the data used when estimating each y value. Increase the frac value to increase the amount of smoothing. The frac value must be between 0 and 1.
Further details on statsmodels lowess usage.
Low pass filter
Scipy provides a set of low pass filters which may be appropriate.
After application of the lfiter:
from scipy.signal import lfilter
n = 50 # larger n gives smoother curves
b = [1.0 / n] * n # numerator coefficients
a = 1 # denominator coefficient
y_lf = lfilter(b, a, Y)
plt.plot(X, y_lf)
plt.show()
Check scipy lfilter documentation for implementation details regarding how numerator and denominator coefficients are used in the difference equations.
There are other filters in the scipy.signal package.
Interpolation
Finally, here is an example of radial basis function interpolation:
from scipy.interpolate import Rbf
rbf = Rbf(X, Y, function = 'multiquadric', smooth = 500)
y_rbf = rbf(X)
plt.plot(X, y_rbf)
plt.show()
Smoother approximation can be achieved by increasing the smooth parameter. Alternative function parameters to consider include 'cubic' and 'thin_plate'. When considering the function value, I usually try 'thin_plate' first followed by 'cubic'; however both 'thin_plate' and 'cubic' seemed to struggle with the noise in this dataset.
Check other Rbf options in the scipy docs. Scipy provides other univariate and multivariate interpolation techniques (see this tutorial).