OLS predict using only a subset of explanatory variables - python

Say I do an OLS regression using statsmodels of variable y on some explanatory variables x1 x2 x3 (contained in a dataframe df):
res = smf.ols('y ~ x1 + x2 + x3', data=df).fit()
Is it possible to get a predicted value using only a subset of the explanatory variables? For example, I would like to get a predicted value for the observations in df using only x1 and x2 but not x3.
I have tried to do
res.predict(df[['x1','x2']])
but I get the error message: NameError: name 'x3' is not defined.
Edit: The reason I want to do this is the following. I'm running a regression of house values on house characteristics and dummies for metropolitan area, suburban status, and year. I would like to use the dummies for metropolitan area, suburban status and year to construct a price index for each location and time period.
Edit 2: This is how I ended up doing it, in case it can be helpful to anyone or someone can point me to a better way of doing it.
I'm interested in doing an OLS on the following specification:
model = 'price ~ C(MetroArea) + C(City) + C(Year) + x1 + ... + xK'
where 'x1 + ... + xK' is pseudo-code for a bunch of variables I'm using as controls but I'm not interested in, and the categorical variables are very large (e.g. 90 Metropolitan areas).
Next I fit the model with statsmodels and construct the design matrix that I'll use to predict prices using the variables of interest.
res = smf.ols(model, data=mydata).fit()
data_prediction = mydata[['MetroArea','City','Year']]
model_predict = 'C(MetroArea) + C(City) + C(Year)'
X = patsy.dmatrix(model_predict, data=data_prediction, return_type='dataframe')
The tricky part now is to select the right parameters for the variables of interest, since there are many and their names are not exactly those of their respective variables since I've used the categorical operator, C(), of patsy (e.g. variables for MetroArea look like: C(MetroArea)[0], C(MetroArea)[8], ...).
vars_interest = ['Intercept', 'MetroArea', 'City', 'Year']
params_interest = res.params[[any([word in var for word in vars_interest])
for var in res.params.index]]
Get prediction by doing the dot product of the selected parameters and variables of interest:
prediction = np.dot(X,params_interest)

In case anyone stumbles on this old question, there seems to be a cleaner solution using the information contained in the design matrix.
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
mydata = None
vars_of_interest = ['C(MetroArea)', 'C(City)', 'C(Year)']
formula = 'price ~' + " + ".join(vars_of_interest) + ' + x1 + ... + xK'
Y, X = dmatrices(formula, mydata)
# Get the slice names from patsy
slices = X.design_info.term_name_slices
model = sm.OLS(Y, X)
res = model.fit()
prediction = np.zeros(X.shape[0])
for var in vars_of_interest:
prediction += X[:, slices[var]].dot(res.params[slices[var]])

What are you trying to do conceptually? When you predict using your regression you're just plugging values into an equation. So predicting "without x3" is the same as just plugging in x3=0.
In terms of implementing this, it looks like statsmodels is pretty draconian about prediction using the same variable names as you used during a fit. So this is not elegant, but works:
df2 = df.copy()
df2['x3'] = 0
res.predict(df2[['x1','x2','x3']])

Related

Is there a way to run GLM.from_formula without the intercept (PyMC3)?

This may be a dumb question but I've searched through pyMC3 docs and forums and can't seem to find the answer. I'm trying to create a linear regression model from a dataset that I know a priori should not have an intercept. Currently my implementation looks like this:
formula = 'Y ~ ' + ' + '.join(['X1', 'X2'])
# Define data to be used in the model
X = df[['X1', 'X2']]
Y = df['Y']
# Context for the model
with pm.Model() as model:
# set distribution for priors
priors = {'X1': pm.Wald.dist(mu=0.01),
'X2': pm.Wald.dist(mu=0.01) }
family = pm.glm.families.Normal()
# Creating the model requires a formula and data
pm.GLM.from_formula(formula, data = X, family=family, priors = priors)
# Perform Markov Chain Monte Carlo sampling
trace = pm.sample(draws=4000, cores = 2, tune = 1000)
As I said, I know I shouldn't have an intercept but I can't seem to find a way to tell GLM.from_formula() to not look for one. Do you all have a solution? Thanks in advance!
I'm actually puzzled that it does run with an intercept since the default in the code for GLM.from_formula is to pass intercept=False to the constructor. Maybe it's because the patsy parser defaults to adding an intercept?
Either way, one can explicitly include or exclude an intercept via the patsy formula, namely with 1 or 0, respectively. That is, you want:
formula = 'Y ~ 0 + ' + ' + '.join(['X1', 'X2'])

How to find a model for dataset

I have a text file that contains dates and numerical values like
1.1.2020, 45.67
2.1.2020, 49.65
4.1.2020, 47.58
31.1.2020, 55.88
...
Note that value of some dates is missing.
I would like to fit a model of the form ae^(bx) to find an estimate what would be value in 1.1.2021. How can I do that? Is there some Sagemath function for that or some Python library to find such a model.
For your specified formula, this can be solved by fitting a log-log model.
Y = a * exp(bx)
log(Y) = log(a) + bx
Transform your dates into numeric type (for ex. as.numeric())
Fit an ordinary linear model
Back transform the results onto the original scale
Y = exp(intercept + slope*date)
In R, using some made up data
dates=sort(sample(1:100,20))
values=exp(seq(0,5,length.out=20))+rnorm(20)
mod=lm(log(values)~dates)
new=1:100
plot(values~dates)
points(exp(coef(mod)[1]+coef(mod)[2]*new)~new,col="red")

How to extract the regression coefficient from statsmodels.api?

result = sm.OLS(gold_lookback, silver_lookback ).fit()
After I get the result, how can I get the coefficient and the constant?
In other words, if
y = ax + c
how to get the values a and c?
You can use the params property of a fitted model to get the coefficients.
For example, the following code:
import statsmodels.api as sm
import numpy as np
np.random.seed(1)
X = sm.add_constant(np.arange(100))
y = np.dot(X, [1,2]) + np.random.normal(size=100)
result = sm.OLS(y, X).fit()
print(result.params)
will print you a numpy array [ 0.89516052 2.00334187] - estimates of intercept and slope respectively.
If you want more information, you can use the object result.summary() that contains 3 detailed tables with model description.
Cribbing from this answer Converting statsmodels summary object to Pandas Dataframe, it seems that the result.summary() is a set of tables, which you can export as html and then use Pandas to convert to a dataframe, which will allow you to directly index the values you want.
So, for your case (putting the answer from the above link into one line):
df = pd.read_html(result.summary().tables[1].as_html(),header=0,index_col=0)[0]
And then
a=df['coef'].values[1]
c=df['coef'].values[0]
Adding up details on #IdiotTom answer.
You may use:
head = pd.read_html(res.summary2().as_html())[0]
body = pd.read_html(res.summary2().as_html())[1]
Not as nice, but the info is there.
If the input to the API is pandas objects (i.e. a pd.DataFrame for the data, or pd.Series for x and for y), then when you access .params it will be a pd.Series, so each coefficient is easily accessible by its name.
For example:
import statsmodels.api as sm
# sm.__version__ is '0.13.1'
df = pd.DataFrame({'x': [0, 1,2,3,4],
'y': [0.1, 0.2, 0.5, 0.5, 0.8]
})
sm.OLS.from_formula(formula='y~x-1', data=df).fit().params
Outputs the following pd.Series:
x 0.196667
dtype: float64
Allowing for an intercept term (by changing the formula from y~x-1 to y~x) changes the output to include the intercept under the name Intercept:
Intercept 0.08
x 0.17
dtype: float64
The coefficients are saved as a dictionary in the result.params data frame, that's a pandas Series. In it, the constant term is stored as Intercept, as others pointed. The variable terms are stored with their variable names. So, if your model is y ~ x, the regression coefficients will be available as result.params['Intercept'] (that's b) and result.params['x'] (that's a) for the equation y = a*x + b.

How to get the P Value in a Variable from OLSResults in Python?

The OLSResults of
df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
fit = sm.OLS(Y, X).fit()
print(fit.summary())
shows the P values of each attribute to only 3 decimal places.
I need to extract the p value for each attribute like Distance, CarrierNum etc. and print it in scientific notation.
I can extract the coefficients using fit.params[0] or fit.params[1] etc.
Need to get it for all their P values.
Also what does all P values being 0 mean?
You need to do fit.pvalues[i] to get the answer where i is the index of independent variables. i.e. fit.pvalues[0] for intercept, fit.pvalues[1] for Distance, etc.
You can also look for all the attributes of an object using dir(<object>).
Instead of using fit.summary() you could use fit.pvalues[attributeIndex] in a for loop to print the p-values of all your features/attributes as follows:
df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
fit = sm.OLS(Y, X).fit()
for attributeIndex in range (0, numberOfAttributes):
print(fit.pvalues[attributeIndex])
==========================================================================
Also what does all P values being 0 mean?
It might be a good outcome. The p-value for each term tests the null hypothesis that the coefficients (b1, b2, ..., bn) are equal to zero causing no effect to the fitting equation y = b0 + b1x1 + b2x2... A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable (y).
On the other hand, a larger (insignificant) p-value suggests that changes in the predictor are not correlated to changes in the response.
I have used this solution
df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
model = sm.OLS(Y, X).fit()
# Following code snippet will generate sorted dataframe with feature name and it's p-value.
# Hence, you will see most relevant features on the top (p-values will be sorted in ascending order)
d = {}
for i in X.columns.tolist():
d[f'{i}'] = model_ols.pvalues[i]
df_pvalue= pd.DataFrame(d.items(), columns=['Var_name', 'p-Value']).sort_values(by = 'p-Value').reset_index(drop=True)

statsmodels linear regression - patsy formula to include all predictors in model

Say I have a dataframe (let's call it DF) where y is the dependent variable and x1, x2, x3 are my independent variables. In R I can fit a linear model using the following code, and the . will include all of my independent variables in the model:
# R code for fitting linear model
result = lm(y ~ ., data=DF)
I can't figure out how to do this with statsmodels using patsy formulas without explicitly adding all of my independent variables to the formula. Does patsy have an equivalent to R's .? I haven't had any luck finding it in the patsy documentation.
I haven't found . equivalent in patsy documentation either. But what it lacks in conciseness, it can make-up for by giving strong string manipulation in Python. So, you can get formula involving all variable columns in DF using
all_columns = "+".join(DF.columns - ["y"])
This gives x1+x2+x3 in your case. Finally, you can create a string formula using y and pass it to any fitting procedure
my_formula = "y~" + all_columns
result = lm(formula=my_formula, data=DF)
No this doesn't exist in patsy yet, unfortunately. See this issue.
As this is still not included in patsy, I wrote a small function that I call when I need to run statsmodels models with all columns (optionally with exceptions)
def ols_formula(df, dependent_var, *excluded_cols):
'''
Generates the R style formula for statsmodels (patsy) given
the dataframe, dependent variable and optional excluded columns
as strings
'''
df_columns = list(df.columns.values)
df_columns.remove(dependent_var)
for col in excluded_cols:
df_columns.remove(col)
return dependent_var + ' ~ ' + ' + '.join(df_columns)
For example, for a dataframe called df with columns y, x1, x2, x3, running ols_formula(df, 'y', 'x3') returns 'y ~ x1 + x2'

Categories