How to determine which regression curve fits better? PYTHON - python

Well, community:
Recently I have asked how to do exponential regression (Exponential regression function Python) thinking that for that data set the optimal regression was the Hyperbolic.
x_data = np.arange(0, 51)
y_data = np.array([0.001, 0.199, 0.394, 0.556, 0.797, 0.891, 1.171, 1.128, 1.437,
1.525, 1.720, 1.703, 1.895, 2.003, 2.108, 2.408, 2.424,2.537,
2.647, 2.740, 2.957, 2.58, 3.156, 3.051, 3.043, 3.353, 3.400,
3.606, 3.659, 3.671, 3.750, 3.827, 3.902, 3.976, 4.048, 4.018,
4.286, 4.353, 4.418, 4.382, 4.444, 4.485, 4.465, 4.600, 4.681,
4.737, 4.792, 4.845, 4.909, 4.919, 5.100])
Now, I'm doubting:
The first is an exponential fit. The second is hyperbolic. I don't know which is better... How to determine? Which criteria should I follow? Is there some python function?
Thanks in advance!

One common fit statistic is R-squared (R2), which can be calculated as "R2 = 1.0 - (absolute_error_variance / dependent_data_variance)" and it tells you what fraction of the dependent data variance is explained by your model. For example, if the R-squared value is 0.95 then your model explains 95% of the dependent data variance. Since you are using numpy, the R-squared value is trivially calculated as "R2 = 1.0 - (abs_err.var() / dep_data.var())" since numpy arrays have a var() method to calculate variance. When fitting your data to the Michaelis-Menten equation "y = ax / (b + x)" with parameter values of a = 1.0232217656373191E+01 and b = 5.2016057362771100E+01 I calculate an R-squared value of 0.9967, which means that 99.67 percent of the variance in the "y" data is explained by this model. Howver, there is no silver bullet and it is always good to verify other fit statistics and visually inspect the model. Here is my plot for the example I used:

You can take the 2-norm between the function and line of fit. Python has the function np.linalg.norm The R squared value is for linear regression.

Well, you should calculate an error function which measures how good your fit actually is. There are many different error functions you could use but for the start the mean-squared-error should work (if you're interested in further metrics, have a look at http://scikit-learn.org/stable/modules/model_evaluation.html).
You can manually implement mean-squared-error, once you determined the coefficients for your regression problem:
from sklearn.metrics import mean_squared_error
f = lambda x: a * np.exp(b * x) + c
mse = mean_squared_error(y_data, f(x_data))

Related

Getting a negative R-squared value with curve_fit()

I've read a related post on manually calculating R-squared values after using scipy.optimize.curve_fit(). However, they calculate an R-squared value when their function follows the power-law (f(x) = a*x^b). I'm trying to do the same but get negative R-squared values.
Here is my code:
def powerlaw(x, a, b):
'''Generic power law function.'''
return a * x**b
X = s_lt[4:] # independent variable (Pandas series)
Y = s_lm[4:] # dependent variable (Pandas series)
popt, pcov = curve_fit(powerlaw, X, Y)
residuals = Y - powerlaw(X, *popt)
ss_res = np.sum(residuals**2) # residual sum of squares
ss_tot = np.sum((Y-np.mean(Y))**2) # total sum of squares
r_squared = 1 - (ss_res / ss_tot) # r-squared value
print("R-squared of power-law fit = ", str(r_squared))
I got an R-squared value of -0.057....
From my understanding, it's not good to use R-squared values for non-linear functions, but I expected to get a much higher R-squared value than a linear model due to overfitting. Did something else go wrong?
See The R-squared and nonlinear regression: a difficult marriage?. Also When is R squared negative?.
Basically, we have two problems:
nonlinear models do not have an intercept term, at least, not in the usual sense;
the equality SStot=SSreg+SSres may not hold.
The first reference above denotes your statistic "pseudo-R-square" (in the case of non-linear models), and notes that it may be lower than 0.
To further understand what's going on you probably want to plot your data Y as a function of X, the predicted values from the power law as a function of X, and the residuals as a function of X.
For non-linear models I have sometimes calculated the sum of squared deviation from zero, to examine how much of that is explained by the model. Something like this:
pred = powerlaw(X, *popt)
ss_total = np.sum(Y**2) # Not deviation from mean.
ss_resid = np.sum((Y - pred)**2)
pseudo_r_squared = 1 - ss_resid/ss_total
Calculated this way, pseudo_r_squared can potentially be negative (if the model is really bad, worse than just guessing the data are all 0), but if pseudo_r_squared is positive I interpret it as the amount of "variation from 0" explained by the model.

Linear regression with outliers for Machine Learning

Python (jupyter notebook to be exact), using numpy and sklearn only
np.random.seed(16)
x = np.arange(100) 
yp = 3*x + 3 + 2*(np.random.poisson(3*x+3,100)-(3*x+3))
np.random.seed(12)
# Choose how many outliers
out = np.random.choice(100,15)
yp_wo = np.copy(yp)
np.random.seed(12) #set again
yp_wo[out] = yp_wo[out] + 5*np.random.rand(15)*yp[out]
# With outliers
plt.scatter(x,yp_wo)
# Without outliers
plt.scatter(x,yp)
For the data above (wo means "with outliers"), I need to find:
The best coefficients for two more losses: the MAE and the MAPE (Median Absolute Percentage Error)
Plot the best fit line for the MSE loss, the MAE loss, and the MAPE loss.
Apply Ridge Regression to the same data, and use cross validation to choose the optimal parameter alpha (you can use values of alpha = 10^-5, 10^-4, 10^-3, ... 10^3). Which value gives you the lowest MSE?
What confuses me is having to plot the best line fit for two or more losses.
I can follow the code from class and try to get the values, but I don't know what's meant by coefficients.
Any help / guidance?
This is for a homework I am trying to figure out (no I am not asking for the solutions)
Please excuse any formatting errors, I am very new to Stack Overflow.

How to get probability of observation using fitted statsmodel?

I have a fitted Poisson model in statsmodels. For each of my observations I want to calculate the probability of observing a value that is at least that high. In other words I want to calculate:
P(y >= y_i | x_i)
(This should be possible, because the fitted model predicts some value lambda as a function of my independent variable x. This lambda_i value defines a Poisson distribution, from which I should be able to derive a probability.)
My question is really about the implementation in statsmodels, less about the statistics. Although if you believe it is relevant, please do elaborate.
For Poisson we can just use the distribution from scipy.stats to compute results for given predicted means.
e.g.
mu = my_results.predict(...)
stats.poisson.sf(counts, mu)
similar usage with pmf is in
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/discrete/discrete_model.py#L3922

How is the p value calculated for multiple variables in linear regression?

I am wondering how the p value is calculated for various variables in a multiple linear regression. I am sure upon reading several resources that <5% indicates the variable is significant for the model. But how is the p value calculated for each and every variable in the multiple linear regression?
I tried to see the statsmodels summary using the summary() function. I can just see the values. I didn't find any resource on how p value for various variables in a multiple linear regression is calculated.
import statsmodels.api as sm
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y = np.dot(X, beta) + e
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
This question has no error but requires an intuition on how p value is calculated for various variables in a multiple linear regression.
Inferential statistics work by comparison to known distributions. In the case of regression, that distribution is typically the t-distribution
You'll notice that each variable has an estimated coefficient from which an associated t-statistic is calculated. x1 for example, has a t-value of -0.278. To get the p-value, we take that t-value, place it on the t-distribution, and calculate the probability of getting a value as extreme as the t-value you calculated. You can gain some intuition for this by noticing that the p-value column is called P>|t|
An additional wrinkle here is that the exact shape of the t-distribution depends on the degrees of freedom
So to calculate a p-value, you need 2 pieces of information: the t-statistic and the residual degrees of freedom of your model (97 in your case)
Taking x1 as an example, you can calculate the p-value in Python like this:
import scipy.stats
scipy.stats.t.sf(abs(-0.278), df=97)*2
0.78160405761659357
The same is done for each of the other predictors using their respective t-values

Standard errors for multivariate regression coefficients

I've done a multivariate regression using sklearn.linear_model.LinearRegression and obtained the regression coefficients doing this:
import numpy as np
from sklearn import linear_model
clf = linear_model.LinearRegression()
TST = np.vstack([x1,x2,x3,x4])
TST = TST.transpose()
clf.fit (TST,y)
clf.coef_
Now, I need the standard errors for these same coefficients. How can I do that?
Thanks a lot.
Based on this stats question and wikipedia, my best guess is:
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
However, my linear algebra and stats are both quite poor, so I could be missing something important. Another option might be to bootstrap the variance estimate.
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
I guess that this answer is not entirely correct.
In particular, if I am not wrong, according to your code sklearn is adding the constant term in order to compute your coefficient by default.
Then, you need to include in your matrix TST the column of ones. Then, the code is correct and it will give you an array with all the SE
These code has been tested with data. They are correct.
find the X matrix for each data set, n is the length of dataset, m is the variables number
X, n, m=arrays(data)
y=***.reshape((n,1))
linear = linear_model.LinearRegression()
linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it's one variable only.
sum square
s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)
standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)
sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(X.T,X)))))
(t-statistics using, linear.intercept_ for one variable)
t_stat_alpha=linear.intercept_[0]/sd_alpha[0] #( use linear.intercept_ for one variable_
I found that the accepted answer had some mathematical glitches that in total would require edits beyond the recommended etiquette for modifying posts. So here is a solution to compute the standard error estimate for the coefficients obtained through the linear model (using an unbiased estimate as suggested here):
# preparation
X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1)
y_hat = clf.predict(TST).T
m, n = X.shape
# computation
MSE = np.sum((y_hat - y)**2)/(m - n)
coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(X.T,X)))
coef_SE_est = np.sqrt(var_est)
Note that we have to add a column of ones to TST as the original post used the linear_model.LinearRegression in a way that will fit the intercept term. Furthermore, we need to compute the mean squared error (MSE) as in ANOVA. That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_error = df_observations - df_features.
The resulting array coef_SE_est contains the standard error estimates of the intercept and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] resp. To print them out you could use
print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0]))
for i, coef in enumerate(clf.coef_[0,:]):
print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))
The example from the documentation shows how to get the mean square error and explained variance score:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
Does this cover what you need?

Categories