I am trying to perform autoregressive multiple linear regressions using statsmodels (something like y ~ y_1 + X1 + X2, not ARMA-like).
More specifically, I'm looking for a way to get out of sample results.
When I use the predict method, I get in sample results which means that it uses the previous historical value of the estimated variable instead of the estimated value of the variable.
Thanks for your help.
Related
If I have fit a logisitc regression to simple data set, with 1 explanatory x variable, and one independent y variable which is binary 0 or 1, I can produce a graph like this:
https://en.wikipedia.org/wiki/File:Exam_pass_logistic_curve.svg
in Sklearn, after I have done model.fit on the data, how would I determine the x value for a given threshold probability? so, for example, at 0.5 probability, the x variable 'hours studied' should be about 2.75. I have tried the attributes coef_ and intercept_ which don't give me what I want. Is there a way to do this, on sklearn or another similar python package?
I know you can potentially calculate the values manually with the formula for logistic regression substituting beta 0 and beta 1, but I'm looking for a faster/built-in way. Thanks
I want to build linear regression by using OLS from statsmodels. I have a question about manually declaration of coefficients for some explanatory variables.
Is there possibility to parameterize a model in way that for 4 from 10 variables I will put manually coefficients and for rest the fit method will count the values of them?
Or maybe you know a different way haw to do it?
Many thanks for all answers!
MP
Yes, in two steps. First you take those manual coefficients, multiply them with the corresponding ('manual') variables to get vectors, and subtract them from the target.
Then, you can take a normal OLS and get the coefficients of the remaining variables.
Say you have two variables, x1 and x2, and want to set the weight for w1. You can simply infer
w2 =(x2**T * x2) ** (-1) * x2 ** T * (y - w1 * x1)
I am wondering how the p value is calculated for various variables in a multiple linear regression. I am sure upon reading several resources that <5% indicates the variable is significant for the model. But how is the p value calculated for each and every variable in the multiple linear regression?
I tried to see the statsmodels summary using the summary() function. I can just see the values. I didn't find any resource on how p value for various variables in a multiple linear regression is calculated.
import statsmodels.api as sm
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y = np.dot(X, beta) + e
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
This question has no error but requires an intuition on how p value is calculated for various variables in a multiple linear regression.
Inferential statistics work by comparison to known distributions. In the case of regression, that distribution is typically the t-distribution
You'll notice that each variable has an estimated coefficient from which an associated t-statistic is calculated. x1 for example, has a t-value of -0.278. To get the p-value, we take that t-value, place it on the t-distribution, and calculate the probability of getting a value as extreme as the t-value you calculated. You can gain some intuition for this by noticing that the p-value column is called P>|t|
An additional wrinkle here is that the exact shape of the t-distribution depends on the degrees of freedom
So to calculate a p-value, you need 2 pieces of information: the t-statistic and the residual degrees of freedom of your model (97 in your case)
Taking x1 as an example, you can calculate the p-value in Python like this:
import scipy.stats
scipy.stats.t.sf(abs(-0.278), df=97)*2
0.78160405761659357
The same is done for each of the other predictors using their respective t-values
I have a dataset with about 100+ features. I also have a small set of covariates.
I build an OLS linear model using statsmodels for y = x + C1 + C2 + C3 + C4 + ... + Cn for each covariate, and a feature x, and a dependent variable y.
I'm trying to perform hypothesis testing on the regression coefficients to test if the coefficients are equal to 0. I figured a t-test would be the appropriate approach to this, but I'm not quite sure how to go about implementing this in Python, using statsmodels.
I know, particularly, that I'd want to use http://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.RegressionResults.t_test.html#statsmodels.regression.linear_model.RegressionResults.t_test
But I am not certain I understand the r_matrix parameter. What could I provide to this? I did look at the examples but it is unclear to me.
Furthermore, I am not interested in doing the t-tests on the covariates themselves, but just the regression co-eff of x.
Any help appreciated!
Are you sure you don't want statsmodels.regression.linear_model.OLS? This will perform a OLS regression, making available the parameter estimates and the corresponding p-values (and many other things).
from statsmodels.regression import linear_model
from statsmodels.api import add_constant
Y = [1,2,3,5,6,7,9]
X = add_constant(range(len(Y)))
model = linear_model.OLS(Y, X)
results = model.fit()
print(results.params) # [ 0.75 1.32142857]
print(results.pvalues) # [ 2.00489220e-02 4.16826428e-06]
These p-values are from the t-tests of each fit parameter being equal to 0.
It seems like RegressionResults.t_test would be useful for less conventional hypotheses.
I have one regression function, g1(x) = 5x - 1 for one data point.
I have another regression function, g2(x) = 3x + 4.
I want to add these two models to create a final regression model, G(x).
That means:
G(x) = g1(x) + g2(x)
=> 5x - 1 + 3x + 4
=> 8x +3
My question is, how can this be done in python? If my dataset is X, I'm using statsmodels like this:
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import numpy as np
mod_wls = sm.WLS(y, X)
res_wls = mod_wls.fit()
print res_wls.params
And that gives me the coefficients for the regression function that fits the data X.
To add the functions, I can easily grab the coefficients for each, and sum them up to get the coefficients for a new regression function such as G(x). But now that I've got my own coefficients, how can I convert them into a regression function and use them to predict a new data? Because as far as I know, models have to be "fitted" to data before they can be used for prediction.
Or is there any way to directly add regression functions? I'm going to be adding functions iteratively in my algorithm.
The prediction generated by this model should be exactly
np.dot(X_test, res_wls.params)
Thus, if you want to sum several models, e.g.
summed_params = np.array([res_wls.params for res_wls in all_my_res_wls]).sum(axis=0)
your prediction should be
np.dot(X_test, summed_params)
In this case there would be no need to use the built-in functions of the estimator.