I do this linear regression with StatsModels:
import numpy as np
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
n = 100
x = np.linspace(0, 10, n)
e = np.random.normal(size=n)
y = 1 + 0.5*x + 2*e
X = sm.add_constant(x)
re = sm.OLS(y, X).fit()
print(re.summary())
prstd, iv_l, iv_u = wls_prediction_std(re)
My questions are, iv_l and iv_u are the upper and lower confidence intervals or prediction intervals?
How I get others?
I need the confidence and prediction intervals for all points, to do a plot.
For test data you can try to use the following.
predictions = result.get_prediction(out_of_sample_df)
predictions.summary_frame(alpha=0.05)
I found the summary_frame() method buried here and you can find the get_prediction() method here. You can change the significance level of the confidence interval and prediction interval by modifying the "alpha" parameter.
I am posting this here because this was the first post that comes up when looking for a solution for confidence & prediction intervals – even though this concerns itself with test data rather.
Here's a function to take a model, new data, and an arbitrary quantile, using this approach:
def ols_quantile(m, X, q):
# m: OLS model.
# X: X matrix.
# q: Quantile.
#
# Set alpha based on q.
a = q * 2
if q > 0.5:
a = 2 * (1 - q)
predictions = m.get_prediction(X)
frame = predictions.summary_frame(alpha=a)
if q > 0.5:
return frame.obs_ci_upper
return frame.obs_ci_lower
update see the second answer which is more recent. Many of the models and results classes have now a get_prediction method that provides additional information including prediction intervals and/or confidence intervals for the predicted mean.
old answer:
iv_l and iv_u give you the limits of the prediction interval for each point.
Prediction interval is the confidence interval for an observation and includes the estimate of the error.
I think, confidence interval for the mean prediction is not yet available in statsmodels.
(Actually, the confidence interval for the fitted values is hiding inside the summary_table of influence_outlier, but I need to verify this.)
Proper prediction methods for statsmodels are on the TODO list.
Addition
Confidence intervals are there for OLS but the access is a bit clumsy.
To be included after running your script:
from statsmodels.stats.outliers_influence import summary_table
st, data, ss2 = summary_table(re, alpha=0.05)
fittedvalues = data[:, 2]
predict_mean_se = data[:, 3]
predict_mean_ci_low, predict_mean_ci_upp = data[:, 4:6].T
predict_ci_low, predict_ci_upp = data[:, 6:8].T
# Check we got the right things
print np.max(np.abs(re.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))
plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)
plt.show()
This should give the same results as SAS, http://jpktd.blogspot.ca/2012/01/nice-thing-about-seeing-zeros.html
With time series results, you get a much smoother plot using the get_forecast() method. An example of time series is below:
# Seasonal Arima Modeling, no exogenous variable
model = SARIMAX(train['MI'], order=(1,1,1), seasonal_order=(1,1,0,12), enforce_invertibility=True)
results = model.fit()
results.summary()
The next step is to make the predictions, this generates the confidence intervals.
# make the predictions for 11 steps ahead
predictions_int = results.get_forecast(steps=11)
predictions_int.predicted_mean
These can be put in a data frame but need some cleaning up:
# get a better view
predictions_int.conf_int()
Concatenate the data frame, but clean up the headers
conf_df = pd.concat([test['MI'],predictions_int.predicted_mean, predictions_int.conf_int()], axis = 1)
conf_df.head()
Then we rename the columns.
conf_df = conf_df.rename(columns={0: 'Predictions', 'lower MI': 'Lower CI', 'upper MI': 'Upper CI'})
conf_df.head()
Make the plot.
# make a plot of model fit
# color = 'skyblue'
fig = plt.figure(figsize = (16,8))
ax1 = fig.add_subplot(111)
x = conf_df.index.values
upper = conf_df['Upper CI']
lower = conf_df['Lower CI']
conf_df['MI'].plot(color = 'blue', label = 'Actual')
conf_df['Predictions'].plot(color = 'orange',label = 'Predicted' )
upper.plot(color = 'grey', label = 'Upper CI')
lower.plot(color = 'grey', label = 'Lower CI')
# plot the legend for the first plot
plt.legend(loc = 'lower left', fontsize = 12)
# fill between the conf intervals
plt.fill_between(x, lower, upper, color='grey', alpha='0.2')
plt.ylim(1000,3500)
plt.show()
You can get the prediction intervals by using LRPI() class from the Ipython notebook in my repo (https://github.com/shahejokarian/regression-prediction-interval).
You need to set the t value to get the desired confidence interval for the prediction values, otherwise the default is 95% conf. interval.
The LRPI class uses sklearn.linear_model's LinearRegression , numpy and pandas libraries.
There is an example shown in the notebook too.
summary_frame and summary_table work well when you need exact results for a single quantile, but don't vectorize well. This will provide a normal approximation of the prediction interval (not confidence interval) and works for a vector of quantiles:
def ols_quantile(m, X, q):
# m: Statsmodels OLS model.
# X: X matrix of data to predict.
# q: Quantile.
#
from scipy.stats import norm
mean_pred = m.predict(X)
se = np.sqrt(m.scale)
return mean_pred + norm.ppf(q) * se
To add to Max Ghenis' response here - you can use .get_prediction() to generate confidence intervals, not just prediction intervals, by using .conf_int() after.
predictions = result.get_prediction(out_of_sample_df)
predictions.conf_int(alpha = 0.05)
You can calculate them based on results given by statsmodel and the normality assumptions.
Here is an example for OLS and CI for the mean value:
import statsmodels.api as sm
import numpy as np
from scipy import stats
#Significance level:
sl = 0.05
#Evaluate mean value at a required point x0. Here, at the point (0.0,2.0) for N_model=2:
x0 = np.asarray([1.0, 0.0, 2.0])# If you have no constant in your model, remove the first 1.0. For more dimensions, add the desired values.
#Get an OLS model based on output y and the prepared vector X (as in your notation):
model = sm.OLS(endog = y, exog = X )
results = model.fit()
#Get two-tailed t-values:
(t_minus, t_plus) = stats.t.interval(alpha = (1.0 - sl), df = len(results.resid) - len(x0) )
y_value_at_x0 = np.dot(results.params, x0)
lower_bound = y_value_at_x0 + t_minus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))
upper_bound = y_value_at_x0 + t_plus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))
You can wrap a nice function around this with input results, point x0 and significance level sl.
I am unsure now if you can use this for WLS() since there are extra things happening there.
Ref: Ch3 in [D.C. Montgomery and E.A. Peck. “Introduction to Linear Regression Analysis.” 4th. Ed., Wiley, 1992].
Related
i am trying to use LMFIT to fit a power law model of the form y ~ a (x-x0)^b + d. I used the built in models which exclude the parameter x0:
DATA Data Plot:
x = [57924.223, 57925.339, 57928.226, 57929.22 , 57930.222, 57931.323, 57933.205,
57935.302, 57939.28 , 57951.282]
y = [14455.95775513, 13838.7702847 , 11857.5599917 , 11418.98888834, 11017.30612092,
10905.00155524, 10392.55775922, 10193.91608535,9887.8610764 , 8775.83459273]
err = [459.56414237, 465.27518505, 448.25224285, 476.64621165, 457.05994986,
458.37532126, 469.89966451, 473.68349925, 455.91446878, 507.48473313]
from lmfit.models import PowerlawModel, ConstantModel
power = ExponentialModel()
offset = ConstantModel()
model = power + offset
pars = offset.guess(y, x = x)
pars += power.guess(y, x= x)
result = model.fit(y, x = x, weights=1/err )
print(result.fit_report())
This brings up an error because my data starts at about x = 57000. I was initially offsetting my x-axis by x-57923.24 for all x values which gave me an okay fit. I would like to know how I can implement an x axis offset.
I was looking into expression models...
from lmfit.models import ExpressionModel
mod = ExpressionModel('a*(x-x0)**(b)+d')
mod.guess(y, x=x)
But this I get an error that guess() was not implemented. If i create my own parameters I get the following error too.
ValueError: The model function generated NaN values and the fit aborted! Please check
your model function and/or set boundaries on parameters where applicable. In cases
like this, using "nan_policy='omit'" will probably not work.
any help would be appreciated :)
It turns out that (x-x0)**b is a particularly tricky case to fit. Exponential decays typically require very good initial values for parameters or data over a few decades of decay.
In addition, because x**b is complex when x<0 and b is non-integer, this can mess up the fitting algorithms, which deal strictly with Float64 real values. You also have to be careful in setting bounds on x0 so that x-x0 is always positive and/or allow for values to move along the imaginary axis and then force them back to real.
For your data, getting initial values is somewhat tricky, and with a limited data range (specifically not having a huge drop in intensity), it is hard to be confident that there is a single unique solution - note the high uncertainty in the parameters below. Still, I think this will be close to what you are trying to do:
import numpy as np
from lmfit.models import Model
from matplotlib import pyplot as plt
# Note: use numpy arrays, not lists!
x = np.array([57924.223, 57925.339, 57928.226, 57929.22 , 57930.222,
57931.323, 57933.205, 57935.302, 57939.28 , 57951.282])
y = np.array([14455.95775513, 13838.7702847 , 11857.5599917 ,
11418.98888834, 11017.30612092, 10905.00155524,
10392.55775922, 10193.91608535,9887.8610764 , 8775.83459273])
err = np.array([459.56414237, 465.27518505, 448.25224285, 476.64621165,
457.05994986, 458.37532126, 469.89966451, 473.68349925,
455.91446878, 507.48473313])
# define a function rather than use ExpressionModel (easier to debug)
def powerlaw(x, x0, amplitude, exponent, offset):
return (offset + amplitude * (x-x0)**exponent)
pmodel = Model(powerlaw)
# make parameters with good initial values: challenging!
params = pmodel.make_params(x0=57800, amplitude=1.5e9, exponent=-2.5, offset=7500)
# set bounds on `x0` to prevent (x-x0)**b going complex
params['x0'].min = 57000
params['x0'].max = x.min()-2
# set bounds on other parameters too
params['amplitude'].min = 1.e7
params['offset'].min = 0
params['offset'].max = 50000
params['exponent'].max = 0
params['exponent'].min = -6
# run fit
result = pmodel.fit(y, params, x=x)
print(result.fit_report())
plt.errorbar(x, y, err, marker='o', linewidth=2, label='data')
plt.plot(x, result.init_fit, label='initial guess')
plt.plot(x, result.best_fit, label='best fit')
plt.legend()
plt.show()
Alternatively, you could force x to be complex (say multiply by (1.0+0j)) and then have your model function do return (offset + amplitude * (x-x0)**exponent).real. But, I think you would still carefully selected bounds on the parameters that depend strongly on the actual data you are fitting.
Anyway, this will print a report of
[[Model]]
Model(powerlaw)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 304
# data points = 10
# variables = 4
chi-square = 247807.309
reduced chi-square = 41301.2182
Akaike info crit = 109.178217
Bayesian info crit = 110.388557
[[Variables]]
x0: 57907.5658 +/- 7.81258329 (0.01%) (init = 57800)
amplitude: 10000061.9 +/- 45477987.8 (454.78%) (init = 1.5e+09)
exponent: -2.63343429 +/- 1.19163855 (45.25%) (init = -2.5)
offset: 8487.50872 +/- 375.212255 (4.42%) (init = 7500)
[[Correlations]] (unreported correlations are < 0.100)
C(amplitude, exponent) = -0.997
C(x0, amplitude) = -0.982
C(x0, exponent) = 0.965
C(exponent, offset) = -0.713
C(amplitude, offset) = 0.662
C(x0, offset) = -0.528
(note the huge uncertainty on amplitude!), and generate a plot of
I'm trying to use lmfit to find the best fit parameters of a function for some random data using the Model and Parameters classes. However, it doesn't seem to be exploring the parameter space very much. It does ~10 function evaluations and then returns a terrible fit.
Here is the code:
import numpy as np
from lmfit.model import Model
from lmfit.parameter import Parameters
import matplotlib.pyplot as plt
def dip(x, loc, wid, dep):
"""Make a line with a dip in it"""
# Array of ones
y = np.ones_like(x)
# Define start and end points of dip
start = np.abs(x - (loc - (wid/2.))).argmin()
end = np.abs(x - (loc + (wid/2.))).argmin()
# Set depth of the dip
y[start:end] *= dep
return y
def fitter(x, loc, wid, dep, scatter=0.001, sigma=3):
"""Find the parameters of the dip function in random data"""
# Make the lmfit model
model = Model(dip)
# Make random data and print input values
rand_loc = abs(np.random.normal(loc, scale=0.02))
rand_wid = abs(np.random.normal(wid, scale=0.03))
rand_dep = abs(np.random.normal(dep, scale=0.005))
print('rand_loc: {}\nrand_wid: {}\nrand_dep: {}\n'.format(rand_loc, rand_wid, rand_dep))
data = dip(x, rand_loc, rand_wid, rand_dep) + np.random.normal(0, scatter, x.size)
# Make parameter ranges
params = Parameters()
params.add('loc', value=loc, min=x.min(), max=x.max())
params.add('wid', value=wid, min=0, max=x.max()-x.min())
params.add('dep', value=dep, min=scatter*10, max=0.8)
# Fit the data
result = model.fit(data, x=x, params)
print(result.fit_report())
# Plot it
plt.plot(x, data, 'bo')
plt.plot(x, result.init_fit, 'k--', label='initial fit')
plt.plot(x, result.best_fit, 'r-', label='best fit')
plt.legend(loc='best')
plt.show()
And then I run:
fitter(np.linspace(55707.97, 55708.1, 100), loc=55708.02, wid=0.04, dep=0.98)
Which returns (for example, since it's randomized data):
rand_loc: 55707.99659784677
rand_wid: 0.02015076619874132
rand_dep: 0.9849809461153651
[[Model]]
Model(dip)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 9
# data points = 100
# variables = 3
chi-square = 0.00336780
reduced chi-square = 3.4720e-05
Akaike info crit = -1023.86668
Bayesian info crit = -1016.05117
## Warning: uncertainties could not be estimated:
loc: at initial value
wid: at initial value
[[Variables]]
loc: 55708.0200 (init = 55708.02)
wid: 0.04000000 (init = 0.04)
dep: 0.99754082 (init = 0.98)
Any idea why it executes so few function evaluations returning a bad fit? Any assistance with this would be greatly appreciated!
This is a similar question to fitting step function with variation in the step location with scipy optimize curve_fit. See https://stackoverflow.com/a/59504874/5179748.
Basically, the solvers in scipy.optimize/lmfit assume that parameters are continuous -- not discrete -- variables. They make small changes to the parameters to see what change that makes in the result. A small change in your loc and wid parameters will have no effect on the result, as argmin() will always return an integer value.
You might find that using a Rectangle Model with a finite width (see https://lmfit.github.io/lmfit-py/builtin_models.html#rectanglemodel) will be helpful. I changed your example a bit, but it should be enough to get you started:
import numpy as np
import matplotlib.pyplot as plt
from lmfit.models import RectangleModel, ConstantModel
def dip(x, loc, wid, dep):
"""Make a line with a dip in it"""
# Array of ones
y = np.ones_like(x)
# Define start and end points of dip
start = np.abs(x - (loc - (wid/2.))).argmin()
end = np.abs(x - (loc + (wid/2.))).argmin()
# Set depth of the dip
y[start:end] *= dep
return y
x = np.linspace(0, 1, 201)
data = dip(x, 0.3, 0.09, 0.98) + np.random.normal(0, 0.001, x.size)
model = RectangleModel() + ConstantModel()
params = model.make_params(c=1.0, amplitude=-0.01, center1=.100, center2=0.7, sigma1=0.15)
params['sigma2'].expr = 'sigma1' # force left and right widths to be the same size
params['c'].vary = False # force offset = 1.0 : value away from "dip"
result = model.fit(data, params, x=x)
print(result.fit_report())
plt.plot(x, data, 'bo')
plt.plot(x, result.init_fit, 'k--', label='initial fit')
plt.plot(x, result.best_fit, 'r-', label='best fit')
plt.legend(loc='best')
plt.show()
I want to iteratively fit a curve to data in python with the following approach:
Fit a polynomial curve (or any non-linear approach)
Discard values > 2 standard deviation from mean of the curve
repeat steps 1 and 2 till all values are within confidence interval of the curve
I can fit a polynomial curve as follows:
vals = array([0.00441025, 0.0049001 , 0.01041189, 0.47368389, 0.34841961,
0.3487533 , 0.35067096, 0.31142986, 0.3268407 , 0.38099566,
0.3933048 , 0.3479948 , 0.02359819, 0.36329588, 0.42535543,
0.01308297, 0.53873956, 0.6511364 , 0.61865282, 0.64750302,
0.6630047 , 0.66744816, 0.71759617, 0.05965622, 0.71335208,
0.71992683, 0.61635697, 0.12985441, 0.73410642, 0.77318621,
0.75675988, 0.03003641, 0.77527201, 0.78673995, 0.05049178,
0.55139476, 0.02665514, 0.61664748, 0.81121749, 0.05521697,
0.63404375, 0.32649395, 0.36828268, 0.68981099, 0.02874863,
0.61574739])
x_values = np.linspace(0, 1, len(vals))
poly_degree = 3
coeffs = np.polyfit(x_values, vals, poly_degree)
poly_eqn = np.poly1d(coeffs)
y_hat = poly_eqn(x_values)
How do I do steps 2 and 3?
With the eliminating points too far from an expected solution, you are probably looking for RANSAC (RANdom SAmple Consensus), which is fitting a curve (or any other function) to data within certain bounds, like your case with 2*STD.
You can use scikit-learn RANSAC estimator which is well aligned with included regressors such as LinearRegression. For your polynomial case you need to define your own regression class:
from sklearn.metrics import mean_squared_error
class PolynomialRegression(object):
def __init__(self, degree=3, coeffs=None):
self.degree = degree
self.coeffs = coeffs
def fit(self, X, y):
self.coeffs = np.polyfit(X.ravel(), y, self.degree)
def get_params(self, deep=False):
return {'coeffs': self.coeffs}
def set_params(self, coeffs=None, random_state=None):
self.coeffs = coeffs
def predict(self, X):
poly_eqn = np.poly1d(self.coeffs)
y_hat = poly_eqn(X.ravel())
return y_hat
def score(self, X, y):
return mean_squared_error(y, self.predict(X))
and then you can use RANSAC
from sklearn.linear_model import RANSACRegressor
ransac = RANSACRegressor(PolynomialRegression(degree=poly_degree),
residual_threshold=2 * np.std(y_vals),
random_state=0)
ransac.fit(np.expand_dims(x_vals, axis=1), y_vals)
inlier_mask = ransac.inlier_mask_
Note, the X variable is transformed to 2d array as it is required by sklearn RANSAC implementation and in our custom class flatten back because of numpy polyfit function works with 1d array.
y_hat = ransac.predict(np.expand_dims(x_vals, axis=1))
plt.plot(x_vals, y_vals, 'bx', label='input samples')
plt.plot(x_vals[inlier_mask], y_vals[inlier_mask], 'go', label='inliers (2*STD)')
plt.plot(x_vals, y_hat, 'r-', label='estimated curve')
moreover, playing with the polynomial order and residual distance I got following results with degree=4 and range 1*STD
Another option is to use higher order regressor like Gaussian process
from sklearn.gaussian_process import GaussianProcessRegressor
ransac = RANSACRegressor(GaussianProcessRegressor(),
residual_threshold=np.std(y_vals))
Talking about generalization to DataFrame, you just need to set that all columns except one are features and the one remaining is the output, like here:
import pandas as pd
df = pd.DataFrame(np.array([x_vals, y_vals]).T)
ransac.fit(df[df.columns[:-1]], df[df.columns[-1]])
y_hat = ransac.predict(df[df.columns[:-1]])
it doesn't look like you'll get anything worthwhile following that procedure, there are much better techniques for handling unexpected data. googling for "outlier detection" would be a good start.
with that said, here's how to answer your question:
start by pulling in libraries and getting some data:
import matplotlib.pyplot as plt
import numpy as np
Y = np.array([
0.00441025, 0.0049001 , 0.01041189, 0.47368389, 0.34841961,
0.3487533 , 0.35067096, 0.31142986, 0.3268407 , 0.38099566,
0.3933048 , 0.3479948 , 0.02359819, 0.36329588, 0.42535543,
0.01308297, 0.53873956, 0.6511364 , 0.61865282, 0.64750302,
0.6630047 , 0.66744816, 0.71759617, 0.05965622, 0.71335208,
0.71992683, 0.61635697, 0.12985441, 0.73410642, 0.77318621,
0.75675988, 0.03003641, 0.77527201, 0.78673995, 0.05049178,
0.55139476, 0.02665514, 0.61664748, 0.81121749, 0.05521697,
0.63404375, 0.32649395, 0.36828268, 0.68981099, 0.02874863,
0.61574739])
X = np.linspace(0, 1, len(Y))
next do an initial plot of the data:
plt.plot(X, Y, '.')
as this lets you see what we're dealing with and whether a polynomial would ever be a good fit --- short answer is that this method isn't going to get very far with this sort of data
at this point we should stop, but to answer the question I'll go on, mostly following your polynomial fitting code:
poly_degree = 5
sd_cutoff = 1 # 2 keeps everything
coeffs = np.polyfit(X, Y, poly_degree)
poly_eqn = np.poly1d(coeffs)
Y_hat = poly_eqn(X)
delta = Y - Y_hat
sd_p = np.std(delta)
ok = abs(delta) < sd_p * sd_cutoff
hopefully this makes sense, I use a higher degree polynomial and only cutoff at 1SD because otherwise nothing will be thrown away. the ok array contains True values for those points that are within sd_cutoff standard deviations
to check this, I'd then do another plot. something like:
plt.scatter(X, Y, color=np.where(ok, 'k', 'r'))
plt.fill_between(
X,
Y_hat - sd_p * sd_cutoff,
Y_hat + sd_p * sd_cutoff,
color='#00000020')
plt.plot(X, Y_hat)
which gives me:
so the black dots are the points to keep (i.e. X[ok] gives me these back, and np.where(ok) gives you indicies).
you can play around with the parameters, but you probably want a distribution with fatter tails (e.g. a Student's T-distribution) but, as I said above, using Google for outlier detection would be my suggestion
There are three functions need to solve this. First a line fitting function is necesary to fit a line to a set of points:
def fit_line(x_values, vals, poly_degree):
coeffs = np.polyfit(x_values, vals, poly_degree)
poly_eqn = np.poly1d(coeffs)
y_hat = poly_eqn(x_values)
return poly_eqn, y_hat
We need to know the standard deviation from the points to the line. This function computes that standard deviation:
def compute_sd(x_values, vals, y_hat):
distances = []
for x,y, y1 in zip(x_values, vals, y_hat): distances.append(abs(y - y1))
return np.std(distances)
Finally, we need to compare the distance from a point to the line. The point needs to be thrown out if the distance from the point to the line is greater than two times the standard deviation.
def compare_distances(x_values, vals):
new_vals, new_x_vals = [],[]
for x,y in zip(x_values, vals):
y1 = np.polyval(poly_eqn, x)
distance = abs(y - y1)
if distance < 2*sd:
plt.plot((x,x),(y,y1), c='g')
new_vals.append(y)
new_x_vals.append(x)
else:
plt.plot((x,x),(y,y1), c='r')
plt.scatter(x,y, c='r')
return new_vals, new_x_vals
As you can see in the following graphs, this method does not work well for fitting a line to data that has a lot of outliers. All the points end up getting eliminated for being too far from the fitted line.
while len(vals)>0:
poly_eqn, y_hat = fit_line(x_values, vals, poly_degree)
plt.scatter(x_values, vals)
plt.plot(x_values, y_hat)
sd = compute_sd(x_values, vals, y_hat)
new_vals, new_x_vals = compare_distances(x_values, vals)
plt.show()
vals, x_values = np.array(new_vals), np.array(new_x_vals)
Following code gives a specific value of slope but I want to calculate it with some uncertainty like (1.95+_ 0.03) . How can I do that?
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
x= np.arange(10)
y = np.array([2,4,25,8,10,30,14,16,28,20])
z = stats.linregress(x,y)
print (z)
slope = z[0]
intercept = z[1]
line = slope*x + intercept
plt.plot(x,y,'o', label='original data')
plt.plot(x,line,color='green', label='fitted line')
plt.xlabel("independent _variable")
plt.ylabel("dependentt_variable")
plt.savefig("./linear regression")
You can calculate the confidence interval for the slope using the formula described in detail here. From your code (assuming you are using the scipy.stats package), you can find the (1-α)% confidence interval as follows:
alpha = 0.05
CI = [z.slope+z.stderr*t for t in stats.t.interval(alpha/2, len(x)-2)]
print(CI)
# [1.9276895391491087, 1.9874619760024068]
To print the confidence interval in the form stated in your question:
halfwidth = z.stderr*stats.t.interval(alpha/2, len(x)-2)[1]
print('({} +/- {})'.format(z.slope, halfwidth))
# (1.9575757575757577 +/- 0.02988621842664903)
Alternatively, you could use the StatsModels package which has a built-in method to find the confidence interval. This is explained in the question found here.
I am trying to recreate maximum likelihood distribution fitting, I can already do this in Matlab and R, but now I want to use scipy. In particular, I would like to estimate the Weibull distribution parameters for my data set.
I have tried this:
import scipy.stats as s
import numpy as np
import matplotlib.pyplot as plt
def weib(x,n,a):
return (a / n) * (x / n)**(a - 1) * np.exp(-(x / n)**a)
data = np.loadtxt("stack_data.csv")
(loc, scale) = s.exponweib.fit_loc_scale(data, 1, 1)
print loc, scale
x = np.linspace(data.min(), data.max(), 1000)
plt.plot(x, weib(x, loc, scale))
plt.hist(data, data.max(), density=True)
plt.show()
And get this:
(2.5827280639441961, 3.4955032285727947)
And a distribution that looks like this:
I have been using the exponweib after reading this http://www.johndcook.com/distributions_scipy.html. I have also tried the other Weibull functions in scipy (just in case!).
In Matlab (using the Distribution Fitting Tool - see screenshot) and in R (using both the MASS library function fitdistr and the GAMLSS package) I get a (loc) and b (scale) parameters more like 1.58463497 5.93030013. I believe all three methods use the maximum likelihood method for distribution fitting.
I have posted my data here if you would like to have a go! And for completeness I am using Python 2.7.5, Scipy 0.12.0, R 2.15.2 and Matlab 2012b.
Why am I getting a different result!?
My guess is that you want to estimate the shape parameter and the scale of the Weibull distribution while keeping the location fixed. Fixing loc assumes that the values of your data and of the distribution are positive with lower bound at zero.
floc=0 keeps the location fixed at zero, f0=1 keeps the first shape parameter of the exponential weibull fixed at one.
>>> stats.exponweib.fit(data, floc=0, f0=1)
[1, 1.8553346917584836, 0, 6.8820748596850905]
>>> stats.weibull_min.fit(data, floc=0)
[1.8553346917584836, 0, 6.8820748596850549]
The fit compared to the histogram looks ok, but not very good. The parameter estimates are a bit higher than the ones you mention are from R and matlab.
Update
The closest I can get to the plot that is now available is with unrestricted fit, but using starting values. The plot is still less peaked. Note values in fit that don't have an f in front are used as starting values.
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
>>> plt.plot(data, stats.exponweib.pdf(data, *stats.exponweib.fit(data, 1, 1, scale=02, loc=0)))
>>> _ = plt.hist(data, bins=np.linspace(0, 16, 33), normed=True, alpha=0.5);
>>> plt.show()
It is easy to verify which result is the true MLE, just need a simple function to calculate log likelihood:
>>> def wb2LL(p, x): #log-likelihood
return sum(log(stats.weibull_min.pdf(x, p[1], 0., p[0])))
>>> adata=loadtxt('/home/user/stack_data.csv')
>>> wb2LL(array([6.8820748596850905, 1.8553346917584836]), adata)
-8290.1227946678173
>>> wb2LL(array([5.93030013, 1.57463497]), adata)
-8410.3327470347667
The result from fit method of exponweib and R fitdistr (#Warren) is better and has higher log likelihood. It is more likely to be the true MLE. It is not surprising that the result from GAMLSS is different. It is a complete different statistic model: Generalized Additive Model.
Still not convinced? We can draw a 2D confidence limit plot around MLE, see Meeker and Escobar's book for detail).
Again this verifies that array([6.8820748596850905, 1.8553346917584836]) is the right answer as loglikelihood is lower that any other point in the parameter space. Note:
>>> log(array([6.8820748596850905, 1.8553346917584836]))
array([ 1.92892018, 0.61806511])
BTW1, MLE fit may not appears to fit the distribution histogram tightly. An easy way to think about MLE is that MLE is the parameter estimate most probable given the observed data. It doesn't need to visually fit the histogram well, that will be something minimizing mean square error.
BTW2, your data appears to be leptokurtic and left-skewed, which means Weibull distribution may not fit your data well. Try, e.g. Gompertz-Logistic, which improves log-likelihood by another about 100.
Cheers!
I know it's an old post, but I just faced a similar problem and this thread helped me solve it. Thought my solution might be helpful for others like me:
# Fit Weibull function, some explanation below
params = stats.exponweib.fit(data, floc=0, f0=1)
shape = params[1]
scale = params[3]
print 'shape:',shape
print 'scale:',scale
#### Plotting
# Histogram first
values,bins,hist = plt.hist(data,bins=51,range=(0,25),normed=True)
center = (bins[:-1] + bins[1:]) / 2.
# Using all params and the stats function
plt.plot(center,stats.exponweib.pdf(center,*params),lw=4,label='scipy')
# Using my own Weibull function as a check
def weibull(u,shape,scale):
'''Weibull distribution for wind speed u with shape parameter k and scale parameter A'''
return (shape / scale) * (u / scale)**(shape-1) * np.exp(-(u/scale)**shape)
plt.plot(center,weibull(center,shape,scale),label='Wind analysis',lw=2)
plt.legend()
Some extra info that helped me understand:
Scipy Weibull function can take four input parameters: (a,c),loc and scale.
You want to fix the loc and the first shape parameter (a), this is done with floc=0,f0=1. Fitting will then give you params c and scale, where c corresponds to the shape parameter of the two-parameter Weibull distribution (often used in wind data analysis) and scale corresponds to its scale factor.
From docs:
exponweib.pdf(x, a, c) =
a * c * (1-exp(-x**c))**(a-1) * exp(-x**c)*x**(c-1)
If a is 1, then
exponweib.pdf(x, a, c) =
c * (1-exp(-x**c))**(0) * exp(-x**c)*x**(c-1)
= c * (1) * exp(-x**c)*x**(c-1)
= c * x **(c-1) * exp(-x**c)
From this, the relation to the 'wind analysis' Weibull function should be more clear
I was curious about your question and, despite this is not an answer, it compares the Matlab result with your result and with the result using leastsq, which showed the best correlation with the given data:
The code is as follows:
import scipy.stats as s
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as mtrand
from scipy.integrate import quad
from scipy.optimize import leastsq
## my distribution (Inverse Normal with shape parameter mu=1.0)
def weib(x,n,a):
return (a / n) * (x / n)**(a-1) * np.exp(-(x/n)**a)
def residuals(p,x,y):
integral = quad( weib, 0, 16, args=(p[0],p[1]) )[0]
penalization = abs(1.-integral)*100000
return y - weib(x, p[0],p[1]) + penalization
#
data = np.loadtxt("stack_data.csv")
x = np.linspace(data.min(), data.max(), 100)
n, bins, patches = plt.hist(data,bins=x, normed=True)
binsm = (bins[1:]+bins[:-1])/2
popt, pcov = leastsq(func=residuals, x0=(1.,1.), args=(binsm,n))
loc, scale = 1.58463497, 5.93030013
plt.plot(binsm,n)
plt.plot(x, weib(x, loc, scale),
label='weib matlab, loc=%1.3f, scale=%1.3f' % (loc, scale), lw=4.)
loc, scale = s.exponweib.fit_loc_scale(data, 1, 1)
plt.plot(x, weib(x, loc, scale),
label='weib stack, loc=%1.3f, scale=%1.3f' % (loc, scale), lw=4.)
plt.plot(x, weib(x,*popt),
label='weib leastsq, loc=%1.3f, scale=%1.3f' % tuple(popt), lw=4.)
plt.legend(loc='upper right')
plt.show()
I had the same problem, but found that setting loc=0 in exponweib.fit primed the pump for the optimization. That was all that was needed from #user333700's answer. I couldn't load your data -- your data link points to an image, not data. So I ran a test on my data instead:
import scipy.stats as ss
import matplotlib.pyplot as plt
import numpy as np
N=30
counts, bins = np.histogram(x, bins=N)
bin_width = bins[1]-bins[0]
total_count = float(sum(counts))
f, ax = plt.subplots(1, 1)
f.suptitle(query_uri)
ax.bar(bins[:-1]+bin_width/2., counts, align='center', width=.85*bin_width)
ax.grid('on')
def fit_pdf(x, name='lognorm', color='r'):
dist = getattr(ss, name) # params = shape, loc, scale
# dist = ss.gamma # 3 params
params = dist.fit(x, loc=0) # 1-day lag minimum for shipping
y = dist.pdf(bins, *params)*total_count*bin_width
sqerror_sum = np.log(sum(ci*(yi - ci)**2. for (ci, yi) in zip(counts, y)))
ax.plot(bins, y, color, lw=3, alpha=0.6, label='%s err=%3.2f' % (name, sqerror_sum))
return y
colors = ['r-', 'g-', 'r:', 'g:']
for name, color in zip(['exponweib', 't', 'gamma'], colors): # 'lognorm', 'erlang', 'chi2', 'weibull_min',
y = fit_pdf(x, name=name, color=color)
ax.legend(loc='best', frameon=False)
plt.show()
There have been a few answers to this already here and in other places. likt in Weibull distribution and the data in the same figure (with numpy and scipy)
It still took me a while to come up with a clean toy example so I though it would be useful to post.
from scipy import stats
import matplotlib.pyplot as plt
#input for pseudo data
N = 10000
Kappa_in = 1.8
Lambda_in = 10
a_in = 1
loc_in = 0
#Generate data from given input
data = stats.exponweib.rvs(a=a_in,c=Kappa_in, loc=loc_in, scale=Lambda_in, size = N)
#The a and loc are fixed in the fit since it is standard to assume they are known
a_out, Kappa_out, loc_out, Lambda_out = stats.exponweib.fit(data, f0=a_in,floc=loc_in)
#Plot
bins = range(51)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(bins, stats.exponweib.pdf(bins, a=a_out,c=Kappa_out,loc=loc_out,scale = Lambda_out))
ax.hist(data, bins = bins , density=True, alpha=0.5)
ax.annotate("Shape: $k = %.2f$ \n Scale: $\lambda = %.2f$"%(Kappa_out,Lambda_out), xy=(0.7, 0.85), xycoords=ax.transAxes)
plt.show()
In the meantime, there is really good package out there: reliability. Here is the documentation: reliability # readthedocs.
Your code simply becomes:
from reliability.Fitters import Fit_Weibull_2P
...
wb = Fit_Weibull_2P(failures=data)
plt.show()
Saves a lot of headaches and makes beautiful plots, too.
the order of loc and scale is messed up in the code:
plt.plot(x, weib(x, scale, loc))
the scale parameter should come first.