result = sm.OLS(gold_lookback, silver_lookback ).fit()
After I get the result, how can I get the coefficient and the constant?
In other words, if
y = ax + c
how to get the values a and c?
You can use the params property of a fitted model to get the coefficients.
For example, the following code:
import statsmodels.api as sm
import numpy as np
np.random.seed(1)
X = sm.add_constant(np.arange(100))
y = np.dot(X, [1,2]) + np.random.normal(size=100)
result = sm.OLS(y, X).fit()
print(result.params)
will print you a numpy array [ 0.89516052 2.00334187] - estimates of intercept and slope respectively.
If you want more information, you can use the object result.summary() that contains 3 detailed tables with model description.
Cribbing from this answer Converting statsmodels summary object to Pandas Dataframe, it seems that the result.summary() is a set of tables, which you can export as html and then use Pandas to convert to a dataframe, which will allow you to directly index the values you want.
So, for your case (putting the answer from the above link into one line):
df = pd.read_html(result.summary().tables[1].as_html(),header=0,index_col=0)[0]
And then
a=df['coef'].values[1]
c=df['coef'].values[0]
Adding up details on #IdiotTom answer.
You may use:
head = pd.read_html(res.summary2().as_html())[0]
body = pd.read_html(res.summary2().as_html())[1]
Not as nice, but the info is there.
If the input to the API is pandas objects (i.e. a pd.DataFrame for the data, or pd.Series for x and for y), then when you access .params it will be a pd.Series, so each coefficient is easily accessible by its name.
For example:
import statsmodels.api as sm
# sm.__version__ is '0.13.1'
df = pd.DataFrame({'x': [0, 1,2,3,4],
'y': [0.1, 0.2, 0.5, 0.5, 0.8]
})
sm.OLS.from_formula(formula='y~x-1', data=df).fit().params
Outputs the following pd.Series:
x 0.196667
dtype: float64
Allowing for an intercept term (by changing the formula from y~x-1 to y~x) changes the output to include the intercept under the name Intercept:
Intercept 0.08
x 0.17
dtype: float64
The coefficients are saved as a dictionary in the result.params data frame, that's a pandas Series. In it, the constant term is stored as Intercept, as others pointed. The variable terms are stored with their variable names. So, if your model is y ~ x, the regression coefficients will be available as result.params['Intercept'] (that's b) and result.params['x'] (that's a) for the equation y = a*x + b.
Related
I am trying to do a linear regression. With the results I want to multiply each x with its own estimated coefficient: xi·βi.
However, I am doing a lot of transformations on xi.
For example:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
def log_plus_1(x):
return np.log(x + 1.0)
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
formule = 'Lottery ~ pow(Literacy,2) + log_plus_1(Wealth)'
mod = smf.ols(formula=formule, data=df)
res = mod.fit()
res.params
Now I would need pow(Literacy, 2) and log_plus_1(Wealth). But since they go into the model, I was hoping to get them out of there too. Instead of transforming the data from the original dataset.
In R I would use res$model to get it.
The data is stored as attributes of the model, e.g. the design matrix is mod.exog, the dependent or response variable is mod.endog.
(I'm not sure I remember correctly the details of the following: The data that patsy returns after creating the transformed design matrix should, in this case, be a pandas DataFrame, and should be stored in mod.data.orig_exog or something like that.)
res.predict automatically handles the transformation, i.e. patsy uses the formula information to transform the data for the explanatory variables in prediction in the same way as the data was transformed in creating the model.
predict only returns the prediction and not the internally transformed predict exog.
I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_
The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())
#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.
Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))
Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]
pd.DataFrame(data=regression.coef_, index=X_train.columns)
All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)
Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!
As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64
Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?
Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
df = pd.read_csv("C:\...\Advertising.csv")
x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]
print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params
print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params
print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_
The output is as follows:
Statsmodel.Formula.Api Method
Intercept 7.032594
TV 0.047537
dtype: float64
Statsmodel.Api Method
TV 0.08325
dtype: float64
Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]
The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.
What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?
Came across this issue today and wanted to elaborate on #stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.
Unless you are using actual R-style string-formulas when instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.api and plain statsmodels.api. #Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.
Furthermore it doesn't matter whether you specify the hasconst parameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconst is ignored even though it is supposed to
[Indicate] whether the RHS includes a user-supplied constant
because, in the footnotes
No constant is added by the model unless you are using formulas.
The example below shows that both .formulas.api and .api will require a user-added column vector of 1s if not using R-style string formulas.
# Generate some relational data
np.random.seed(123)
nobs = 25
x = np.random.random((nobs, 2))
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1]
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e
Now throw x and y into Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:
Intercept 1.497761024
X Variable 1 0.012073045
X Variable 2 0.623936056
Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.api or statsmodels.api with hasconst set to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)
import statsmodels.formula.api as smf
import statsmodels.api as sm
print('smf models')
print('-' * 10)
for hc in [None, True, False]:
model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
print(model.params)
# smf models
# ----------
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
Now running things correctly with a column vector of 1.0s added to x. You can use smf here but it's really not necessary if you're not using formulas.
print('sm models')
print('-' * 10)
for hc in [None, True, False]:
model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
print(model.params)
# sm models
# ----------
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
The difference is due to the presence of intercept or not:
in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constant from statsmodels.api
x1 = sm.add_constant(x1)
I had a similar issue with the Logit function.
(I used patsy to create my matrices, so the intercept was there.)
My sm.logit was not converging.
My sm.formula.logit was converging however.
Data going in was exactly the same.
I changed the solver method to 'newton' and the sm.logit converged also.
Is it possible the two versions have different default solver methods.
Say I do an OLS regression using statsmodels of variable y on some explanatory variables x1 x2 x3 (contained in a dataframe df):
res = smf.ols('y ~ x1 + x2 + x3', data=df).fit()
Is it possible to get a predicted value using only a subset of the explanatory variables? For example, I would like to get a predicted value for the observations in df using only x1 and x2 but not x3.
I have tried to do
res.predict(df[['x1','x2']])
but I get the error message: NameError: name 'x3' is not defined.
Edit: The reason I want to do this is the following. I'm running a regression of house values on house characteristics and dummies for metropolitan area, suburban status, and year. I would like to use the dummies for metropolitan area, suburban status and year to construct a price index for each location and time period.
Edit 2: This is how I ended up doing it, in case it can be helpful to anyone or someone can point me to a better way of doing it.
I'm interested in doing an OLS on the following specification:
model = 'price ~ C(MetroArea) + C(City) + C(Year) + x1 + ... + xK'
where 'x1 + ... + xK' is pseudo-code for a bunch of variables I'm using as controls but I'm not interested in, and the categorical variables are very large (e.g. 90 Metropolitan areas).
Next I fit the model with statsmodels and construct the design matrix that I'll use to predict prices using the variables of interest.
res = smf.ols(model, data=mydata).fit()
data_prediction = mydata[['MetroArea','City','Year']]
model_predict = 'C(MetroArea) + C(City) + C(Year)'
X = patsy.dmatrix(model_predict, data=data_prediction, return_type='dataframe')
The tricky part now is to select the right parameters for the variables of interest, since there are many and their names are not exactly those of their respective variables since I've used the categorical operator, C(), of patsy (e.g. variables for MetroArea look like: C(MetroArea)[0], C(MetroArea)[8], ...).
vars_interest = ['Intercept', 'MetroArea', 'City', 'Year']
params_interest = res.params[[any([word in var for word in vars_interest])
for var in res.params.index]]
Get prediction by doing the dot product of the selected parameters and variables of interest:
prediction = np.dot(X,params_interest)
In case anyone stumbles on this old question, there seems to be a cleaner solution using the information contained in the design matrix.
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
mydata = None
vars_of_interest = ['C(MetroArea)', 'C(City)', 'C(Year)']
formula = 'price ~' + " + ".join(vars_of_interest) + ' + x1 + ... + xK'
Y, X = dmatrices(formula, mydata)
# Get the slice names from patsy
slices = X.design_info.term_name_slices
model = sm.OLS(Y, X)
res = model.fit()
prediction = np.zeros(X.shape[0])
for var in vars_of_interest:
prediction += X[:, slices[var]].dot(res.params[slices[var]])
What are you trying to do conceptually? When you predict using your regression you're just plugging values into an equation. So predicting "without x3" is the same as just plugging in x3=0.
In terms of implementing this, it looks like statsmodels is pretty draconian about prediction using the same variable names as you used during a fit. So this is not elegant, but works:
df2 = df.copy()
df2['x3'] = 0
res.predict(df2[['x1','x2','x3']])
I am trying to fit OneVsAll Classification output in training data , rows of output adds upto 1 .
One possible way is to read all the rows and find which column has highest value and prepare data for training .
Eg : y = [[0.2,0.8,0],[0,1,0],[0,0.3,0.7]] can be reduced to y = [b,b,c] , considering a,b,c as corresponding class of the columns 0,1,2 respectively.
Is there a function in scikit-learn which helps to achieve such transformations?
This code does what you want:
import numpy as np
import string
y = np.array([[0.2,0.8,0],[0,1,0],[0,0.3,0.7]])
def transform(y,labels):
f = np.vectorize(lambda i : string.letters[i])
y = f(y.argmax(axis=1))
return y
y = transform(y,'abc')
EDIT: Using the comment by alko, I made it more general be letting the user supply the labels to the transform function.