I have a dataset with about 100+ features. I also have a small set of covariates.
I build an OLS linear model using statsmodels for y = x + C1 + C2 + C3 + C4 + ... + Cn for each covariate, and a feature x, and a dependent variable y.
I'm trying to perform hypothesis testing on the regression coefficients to test if the coefficients are equal to 0. I figured a t-test would be the appropriate approach to this, but I'm not quite sure how to go about implementing this in Python, using statsmodels.
I know, particularly, that I'd want to use http://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.RegressionResults.t_test.html#statsmodels.regression.linear_model.RegressionResults.t_test
But I am not certain I understand the r_matrix parameter. What could I provide to this? I did look at the examples but it is unclear to me.
Furthermore, I am not interested in doing the t-tests on the covariates themselves, but just the regression co-eff of x.
Any help appreciated!
Are you sure you don't want statsmodels.regression.linear_model.OLS? This will perform a OLS regression, making available the parameter estimates and the corresponding p-values (and many other things).
from statsmodels.regression import linear_model
from statsmodels.api import add_constant
Y = [1,2,3,5,6,7,9]
X = add_constant(range(len(Y)))
model = linear_model.OLS(Y, X)
results = model.fit()
print(results.params) # [ 0.75 1.32142857]
print(results.pvalues) # [ 2.00489220e-02 4.16826428e-06]
These p-values are from the t-tests of each fit parameter being equal to 0.
It seems like RegressionResults.t_test would be useful for less conventional hypotheses.
Related
Good day! I'd be appreciate if someone could guide me through the doubts below.
I'm working on a predictive modelling where I have two independent variables/predictors and one dependent variable.
Most resources of multiple regression only refers to linear regression, yet, the predictors I have are non-linear.
Is it possible to have:
y = z + ax1 + bx1^2 + cx2 + xc2^2
?
Let's say
X = [[2.64 0.96]
[3.75 0.88]
[3.74 0.75]
[6.51 1.27]]
Y = [[0.77]
[1.12]
[1.12]
[1.23]]
I know prediction for multiple linear regression is regr.predict([[new_x1], [new_x2]]). What about multiple polynomial regression?
You can use PolynomialFeatures from sklearn.preprocessing in order to generate the higher order terms. Then you can fit your model on the transformed data.
X = PolynomialFeatures(degree=2).fit_transform(X)
... # use the new X to fit the model
When performed a logistic regression using the two API, they give different coefficients.
Even with this simple example it doesn't produce the same results in terms of coefficients. And I follow advice from older advice on the same topic, like setting a large value for the parameter C in sklearn since it makes the penalization almost vanish (or setting penalty="none").
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
n = 200
x = np.random.randint(0, 2, size=n)
y = (x > (0.5 + np.random.normal(0, 0.5, n))).astype(int)
display(pd.crosstab( y, x ))
max_iter = 100
#### Statsmodels
res_sm = sm.Logit(y, x).fit(method="ncg", maxiter=max_iter)
print(res_sm.params)
#### Scikit-Learn
res_sk = LogisticRegression( solver='newton-cg', multi_class='multinomial', max_iter=max_iter, fit_intercept=True, C=1e8 )
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.coef_)
For example I just run the above code and get 1.72276655 for statsmodels and 1.86324749 for sklearn. And when run multiple times it always gives different coefficients (sometimes closer than others, but anyway).
Thus, even with that toy example the two APIs give different coefficients (so odds ratios), and with real data (not shown here), it almost get "out of control"...
Am I missing something? How can I produce similar coefficients, for example at least at one or two numbers after the comma?
There are some issues with your code.
To start with, the two models you show here are not equivalent: although you fit your scikit-learn LogisticRegression with fit_intercept=True (which is the default setting), you don't do so with your statsmodels one; from the statsmodels docs:
An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
It seems that this is a frequent point of confusion - see for example scikit-learn & statsmodels - which R-squared is correct? (and own answer there as well).
The other issue is that, although you are in a binary classification setting, you ask for multi_class='multinomial' in your LogisticRegression, which should not be the case.
The third issue is that, as explained in the relevant Cross Validated thread Logistic Regression: Scikit Learn vs Statsmodels:
There is no way to switch off regularization in scikit-learn, but you can make it ineffective by setting the tuning parameter C to a large number.
which makes the two models again non-comparable in principle, but you have successfully addressed it here by setting C=1e8. In fact, since then (2016), scikit-learn has indeed added a way to switch regularization off, by setting penalty='none' since, according to the docs:
If βnoneβ (not supported by the liblinear solver), no regularization is applied.
which should now be considered the canonical way to switch off the regularization.
So, incorporating these changes in your code, we have:
np.random.seed(42) # for reproducibility
#### Statsmodels
# first artificially add intercept to x, as advised in the docs:
x_ = sm.add_constant(x)
res_sm = sm.Logit(y, x_).fit(method="ncg", maxiter=max_iter) # x_ here
print(res_sm.params)
Which gives the result:
Optimization terminated successfully.
Current function value: 0.403297
Iterations: 5
Function evaluations: 6
Gradient evaluations: 10
Hessian evaluations: 5
[-1.65822763 3.65065752]
with the first element of the array being the intercept and the second the coefficient of x. While for scikit learn we have:
#### Scikit-Learn
res_sk = LogisticRegression(solver='newton-cg', max_iter=max_iter, fit_intercept=True, penalty='none')
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.intercept_, res_sk.coef_)
with the result being:
[-1.65822806] [[3.65065707]]
These results are practically identical, within the machine's numeric precision.
Repeating the procedure for different values of np.random.seed() does not change the essence of the results shown above.
I am wondering how the p value is calculated for various variables in a multiple linear regression. I am sure upon reading several resources that <5% indicates the variable is significant for the model. But how is the p value calculated for each and every variable in the multiple linear regression?
I tried to see the statsmodels summary using the summary() function. I can just see the values. I didn't find any resource on how p value for various variables in a multiple linear regression is calculated.
import statsmodels.api as sm
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y = np.dot(X, beta) + e
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
This question has no error but requires an intuition on how p value is calculated for various variables in a multiple linear regression.
Inferential statistics work by comparison to known distributions. In the case of regression, that distribution is typically the t-distribution
You'll notice that each variable has an estimated coefficient from which an associated t-statistic is calculated. x1 for example, has a t-value of -0.278. To get the p-value, we take that t-value, place it on the t-distribution, and calculate the probability of getting a value as extreme as the t-value you calculated. You can gain some intuition for this by noticing that the p-value column is called P>|t|
An additional wrinkle here is that the exact shape of the t-distribution depends on the degrees of freedom
So to calculate a p-value, you need 2 pieces of information: the t-statistic and the residual degrees of freedom of your model (97 in your case)
Taking x1 as an example, you can calculate the p-value in Python like this:
import scipy.stats
scipy.stats.t.sf(abs(-0.278), df=97)*2
0.78160405761659357
The same is done for each of the other predictors using their respective t-values
In the linear model π¦ = π0 + π1 Γ π₯i + π2 Γ π₯j + π3 Γ π₯k + π ,
what values for π,j,k β [1,100] results in the model with the highest R-Squared?
The data set consists of 100 independent variables and one dependent variable. Each variable has 50 observations.
My only guess is to loop through all possible combinations of three variables and compare R-squared for each combination. The way I have done it with Python is:
import itertools as itr
import pandas as pd
import time as t
from sklearn import linear_model as lm
start = t.time()
#linear regression model
LR = lm.LinearRegression()
#import data
data = pd.read_csv('csv_file')
#all possible combinations of three variables
combs = [comb for comb in itr.combinations(range(1, 101), 3)]
target = data.iloc[:,0]
hi_R2 = 0
for comb in combs:
variables = data.iloc[:, comb]
R2 = LR.fit(variables, target).score(variables, target)
if R2 > hi_R2:
hi_R2 = R2
indices = comb
end = t.time()
time = float((end-start)/60)
print 'Variables: {}\nR2 = {:.2f}\nTime: {:.1f} mins'.format(indices, hi_R2, time)
It took 4.3 mins to complete. I believe this method is not efficient for data set with thousands observations for each variable. What method would you suggest instead?
Thank you.
Exhaustive search is going to be the slowest way of doing this
The fastest way to do this is mentioned in one of the comments. You should pre-specify your model based on theory/intuition/logic and come up with a set of variables that you hypothesize will be good predictors of your outcome.
The difference between the 2 extremes is that exhaustive search may leave you with a model that doesn't make sense as it will use whatever variables it has access to, even if its completely unrelated to your question of interest
If, however, you dont want to specify a model and still want to use an automated technique to build the "best" model, a middle ground might be something like stepwise regression
There are a few different ways of doing this (e.g. forward/backward elimination), but in the case of forward selection, for example, you start by adding in one variable at a time and testing the coefficient for significance. If the variables improves model fit (either determined throught he individual regression coefficient, or the R2 of the model) you keep it and add another. If it doesnt aid prediction then you throw it away. Repeat this process until you've found your best predictors
I have one regression function, g1(x) = 5x - 1 for one data point.
I have another regression function, g2(x) = 3x + 4.
I want to add these two models to create a final regression model, G(x).
That means:
G(x) = g1(x) + g2(x)
=> 5x - 1 + 3x + 4
=> 8x +3
My question is, how can this be done in python? If my dataset is X, I'm using statsmodels like this:
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import numpy as np
mod_wls = sm.WLS(y, X)
res_wls = mod_wls.fit()
print res_wls.params
And that gives me the coefficients for the regression function that fits the data X.
To add the functions, I can easily grab the coefficients for each, and sum them up to get the coefficients for a new regression function such as G(x). But now that I've got my own coefficients, how can I convert them into a regression function and use them to predict a new data? Because as far as I know, models have to be "fitted" to data before they can be used for prediction.
Or is there any way to directly add regression functions? I'm going to be adding functions iteratively in my algorithm.
The prediction generated by this model should be exactly
np.dot(X_test, res_wls.params)
Thus, if you want to sum several models, e.g.
summed_params = np.array([res_wls.params for res_wls in all_my_res_wls]).sum(axis=0)
your prediction should be
np.dot(X_test, summed_params)
In this case there would be no need to use the built-in functions of the estimator.