Using categorical variables in statsmodels OLS class

Using categorical variables in statsmodels OLS class - python

I want to use statsmodels OLS class to create a multiple regression model. Consider the following dataset:
import statsmodels.api as sm
import pandas as pd
import numpy as np
dict = {'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
'debt_ratio':np.random.randn(5), 'cash_flow':np.random.randn(5) + 90}
df = pd.DataFrame.from_dict(dict)
x = data[['debt_ratio', 'industry']]
y = data['cash_flow']
def reg_sm(x, y):
x = np.array(x).T
x = sm.add_constant(x)
results = sm.OLS(endog = y, exog = x).fit()
return results
When I run the following code:
reg_sm(x, y)
I get the following error:
TypeError: '>=' not supported between instances of 'float' and 'str'
I've tried converting the industry variable to categorical, but I still get an error. I'm out of options.

I also had this problem as well and have lots of columns needed to be treated as categorical, and this makes it quite annoying to deal with dummify. And converting to string doesn't work for me.
For anyone looking for a solution without onehot-encoding the data,
The R interface provides a nice way of doing this:
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
dict = {'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
'debt_ratio':np.random.randn(5), 'cash_flow':np.random.randn(5) + 90}
df = pd.DataFrame.from_dict(dict)
x = df[['debt_ratio', 'industry']]
y = df['cash_flow']
# NB. unlike sm.OLS, there is "intercept" term is included here
smf.ols(formula="cash_flow ~ debt_ratio + C(industry)", data=df).fit()
Reference:
https://www.statsmodels.org/stable/example_formulas.html#categorical-variables

You're on the right path with converting to a Categorical dtype. However, once you convert the DataFrame to a NumPy array, you get an object dtype (NumPy arrays are one uniform type as a whole). This means that the individual values are still underlying str which a regression definitely is not going to like.
What you might want to do is to dummify this feature. Instead of factorizing it, which would effectively treat the variable as continuous, you want to maintain some semblance of categorization:
>>> import statsmodels.api as sm
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(444)
>>> data = {
... 'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
... 'debt_ratio':np.random.randn(5),
... 'cash_flow':np.random.randn(5) + 90
... }
>>> data = pd.DataFrame.from_dict(data)
>>> data = pd.concat((
... data,
... pd.get_dummies(data['industry'], drop_first=True)), axis=1)
>>> # You could also use data.drop('industry', axis=1)
>>> # in the call to pd.concat()
>>> data
industry debt_ratio cash_flow finance hospitality mining transportation
0 mining 0.357440 88.856850 0 0 1 0
1 transportation 0.377538 89.457560 0 0 0 1
2 hospitality 1.382338 89.451292 0 1 0 0
3 finance 1.175549 90.208520 1 0 0 0
4 entertainment -0.939276 90.212690 0 0 0 0
Now you have dtypes that statsmodels can better work with. The purpose of drop_first is to avoid the dummy trap:
>>> y = data['cash_flow']
>>> x = data.drop(['cash_flow', 'industry'], axis=1)
>>> sm.OLS(y, x).fit()
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x115b87cf8>
Lastly, just a small pointer: it helps to try to avoid naming references with names that shadow built-in object types, such as dict.

Just another example from a similar case for categorical variables, which gives correct result compared to a statistics course given in R (Hanken, Finland).
import wooldridge as woo
import statsmodels.formula.api as smf
import numpy as np
df = woo.dataWoo('beauty')
print(df.describe)
df['abvavg'] = (df['looks']>=4).astype(int) # good looking
df['belavg'] = (df['looks']<=2).astype(int) # bad looking
df_female = df[df['female']==1]
df_male = df[df['female']==0]
results_female = smf.ols(formula = 'np.log(wage) ~ belavg + abvavg',data=df_female).fit()
print(f"FEMALE results, summary \n {results_female.summary()}")
results_male = smf.ols(formula = 'np.log(wage) ~ belavg + abvavg',data=df_male).fit()
print(f"MALE results, summary \n {results_male.summary()}")
Terveisin, Markus

Related

LinearRegression TypeError

The above screenshot is refereed to as: sample.xlsx. I've been having trouble getting the beta for each stock using the LinearRegression() function.
Input:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_excel('sample.xlsx')
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape((-1, 1))
y = np.array(mean)
model = LinearRegression().fit(x, y)
print(model.coef_)
Output:
Line 16: model = LinearRegression().fit(x, y)
"Singleton array %r cannot be considered a valid collection." % x
TypeError: Singleton array array(3.34) cannot be considered a valid collection.
How can I make the collection valid so that I can get a beta value(model.coef_) for each stock?

X and y must have same shape, so you need to reshape both x and y to 1 row and 1 column. In this case it is resumed to the following:
np.array(mean).reshape(-1,1) or np.array(mean).reshape(1,1)
Given that you are training 5 classifiers, each one with just one value, is not surprising that the 5 models will "learn" that the coefficient of the linear regression is 0 and the intercept is 3.37 (y).
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
"stock": ["ABCD", "XYZ", "JK", "OPQ", "GHI"],
"ChangePercent": [-1.7, 30, 3.7, -15.3, 0]
})
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape(-1,1)
y = np.array(mean).reshape(-1,1)
model = LinearRegression().fit(x, y)
print(f"{model.intercept_} + {model.coef_}*{x} = {y}")
Which is correct from an algorithmic point of view, but it doesn't make any practical sense given that you're only providing one example to train each model.

pycaret giving error:PermissionError: [WinError 32] The process cannot access the file because it is being used by another process

I am using anaconda environment , on windows with pycaret installed,and pycharm.
i want to run a basic toy example with pycaret (not using freely available datasets),
as a simple y=mx+c, where x is 1-d
here is my working code with scikit.
import numpy as np
from sklearn.linear_model import LinearRegression
import pandas as pd
x= np.arange(0,1000,dtype = 'float64')
Y = (x*2) + 1
X = x.reshape(-1,1)
reg = LinearRegression().fit(X, Y)
# if predicting on same model,perfect score
score = reg.score(X,Y)
print('1- RSS/TSS: 1 for perfect regression=' + str(score))
print('coef =' + str(reg.coef_[0])) # slope
print('intercept =' + str(reg.intercept_)) # intercept
this gives expected results as below:
Now,I create Dataframe that i can pass to pycaret pacakge.
data1 = np.vstack((x,Y)).transpose()
# create dataframe as required by Pandas
N= data1.shape[0]
# add first row
dat2 = np.array(['','Col1','Col2'])
for i in range(N):
dat_row = list(data1[i,:].flatten())
nm = ['row'+ str(i)]
dat_row = nm + dat_row
dat2 = np.vstack ((dat2, dat_row) )
df= pd.DataFrame(data=dat2[1:,1:],
index=dat2[1:,0],
columns=dat2[0,1:])
print(df)
print('***************************')
columns = df.applymap(np.isreal).all()
print(columns)
print('***************************')
# now, using Pycaret
from pycaret.regression import *
exp_reg = setup(df, html= False,target='Col2')
print('********************************')
compare_models()
when i do so,
the numeric columns i created (x,y) are shown as categorical. This also recognized
by pyCaret as Categorical.see the figure below.
Why is this Categorical? Can i change it to be treated as numeric?
Once I press enter, finally, Pycaret gives me the error below:
any ideas for this error?
sedy

You can force the data type in PyCaret within setup function by using numeric_features and categorical_features param within the setup function.
For example:
clf1 = setup(data, target = 'target', numeric_features = ['X1', 'X2'])

How can better format the output that I'm attempting to save from several regressions?

I'd like to loop through several specifications of a linear regression and save the results for each model in a python dictionary. The code below is somewhat successful but additional text (e.g. datatype information) is included in the dictionary making it unreadable. Moreover, regarding the confidence interval, I'd like to have two separate columns - one for the upper and another for the lower-bound - but I'm unable to do that.
code:
import patsy
import statsmodels.api as sm
from collections import defaultdict
colleges = ['ARC_g',u'CCSF_g',u'DAC_g',u'DVC_g',u'LC_g',u'NVC_g',u'SAC_g', u'SRJC_g',u'SC_g',u'SCC_g']
results = defaultdict(lambda: defaultdict(int))
for exog in colleges:
exog = exog.encode('ascii')
f1 = 'GRADE_PT_103 ~ %s -1' % exog
y,X = patsy.dmatrices(f1, data,return_type='dataframe')
mod = sm.OLS(y, X) # Describe model
res = mod.fit() # Fit model
results[exog]['beta'] = res.params
#I'd like the confidence interval to be separated into two columns ('upper' and 'lower')
results[exog]['CI'] = res.conf_int()
results[exog]['rsq'] = res.rsquared
pd.DataFrame(results)
______Current output
ARC_g | CCSF_g | ...
beta | ARC_g 0.79304 dtype: float64 | CCSF_g 0.833644 dtype: float64
CI | 0 1 ARC_g 0.557422 1.0... 0 1| CCSF_g 0.655746 1...
rsq | 0.122551 | 0.213053

This is how I'd summarize what you were showing. Hopefully it helps give you some ideas.
import pandas as pd
import statsmodels.formula.api as smf
data = pd.DataFrame(np.random.randn(30, 5), columns=list('YABCD'))
results = {}
for c in data.columns[1:]:
f = 'Y ~ {}'.format(c)
r = smf.ols(formula=f, data=data).fit()
coef = pd.concat([r.params,
r.conf_int().iloc[:, 0],
r.conf_int().iloc[:, 1]], axis=1, keys=['coef', 'lower', 'upper'])
coef.index = ['Intercept', 'Beta']
results[c] = dict(coef=coef, rsq=r.rsquared)
keys = data.columns[1:]
summary = pd.concat([results[k]['coef'].stack() for k in keys], axis=1, keys=keys)
summary.index = summary.index.to_series().str.join(' - ')
summary.append(pd.Series([results[k]['rsq'] for k in keys], keys, name='R Squared'))

Deprecated rolling window option in OLS from Pandas to Statsmodels

as the title suggests, where has the rolling function option in the ols command in Pandas migrated to in statsmodels? I can't seem to find it.
Pandas tells me doom is in the works:
FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
model = pd.ols(y=series_1, x=mmmm, window=50)
in fact, if you do something like:
import statsmodels.api as sm
model = sm.OLS(series_1, mmmm, window=50).fit()
print(model.summary())
you get results (window does not impair the running of the code) but you get only the parameters of the regression run on the entire period, not the series of parameters for each of the rolling period it should be supposed to work on.

I created an ols module designed to mimic pandas' deprecated MovingOLS; it is here.
It has three core classes:
OLS : static (single-window) ordinary least-squares regression. The output are NumPy arrays
RollingOLS : rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.
PandasRollingOLS : wraps the results of RollingOLS in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.
Note that the module is part of a package (which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.
The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper.
An example:
import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS
# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"
syms = {
"TWEXBMTH" : "usd",
"T10Y2YM" : "term_spread",
"GOLDAMGBD228NLBM" : "gold",
}
params = {
"fq": "Monthly,Monthly,Monthly",
"id": ",".join(syms.keys()),
"cosd": "2000-01-01",
"coed": "2019-02-01",
}
data = pd.read_csv(
url + "?" + urllib.parse.urlencode(params, safe=","),
na_values={"."},
parse_dates=["DATE"],
index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
# usd term_spread gold
# DATE
# 2000-02-01 0.012580 -1.409091 0.057152
# 2000-03-01 -0.000113 2.000000 -0.047034
# 2000-04-01 0.005634 0.518519 -0.023520
# 2000-05-01 0.022017 -0.097561 -0.016675
# 2000-06-01 -0.010116 0.027027 0.036599
y = data.usd
x = data.drop('usd', axis=1)
window = 12 # months
model = PandasRollingOLS(y=y, x=x, window=window)
print(model.beta.head()) # Coefficients excluding the intercept
# term_spread gold
# DATE
# 2001-01-01 0.000033 -0.054261
# 2001-02-01 0.000277 -0.188556
# 2001-03-01 0.002432 -0.294865
# 2001-04-01 0.002796 -0.334880
# 2001-05-01 0.002448 -0.241902
print(model.fstat.head())
# DATE
# 2001-01-01 0.136991
# 2001-02-01 1.233794
# 2001-03-01 3.053000
# 2001-04-01 3.997486
# 2001-05-01 3.855118
# Name: fstat, dtype: float64
print(model.rsq.head()) # R-squared
# DATE
# 2001-01-01 0.029543
# 2001-02-01 0.215179
# 2001-03-01 0.404210
# 2001-04-01 0.470432
# 2001-05-01 0.461408
# Name: rsq, dtype: float64

Rolling beta with sklearn
import pandas as pd
from sklearn import linear_model
def rolling_beta(X, y, idx, window=255):
assert len(X)==len(y)
out_dates = []
out_beta = []
model_ols = linear_model.LinearRegression()
for iStart in range(0, len(X)-window):
iEnd = iStart+window
model_ols.fit(X[iStart:iEnd], y[iStart:iEnd])
#store output
out_dates.append(idx[iEnd])
out_beta.append(model_ols.coef_[0][0])
return pd.DataFrame({'beta':out_beta}, index=out_dates)
df_beta = rolling_beta(df_rtn_stocks['NDX'].values.reshape(-1, 1), df_rtn_stocks['CRM'].values.reshape(-1, 1), df_rtn_stocks.index.values, 255)

Adding for completeness a speedier numpy-only solution which limits calculations only to the regression coefficients and the final estimate
Numpy rolling regression function
import numpy as np
def rolling_regression(y, x, window=60):
"""
y and x must be pandas.Series
"""
# === Clean-up ============================================================
x = x.dropna()
y = y.dropna()
# === Trim acc to shortest ================================================
if x.index.size > y.index.size:
x = x[y.index]
else:
y = y[x.index]
# === Verify enough space =================================================
if x.index.size < window:
return None
else:
# === Add a constant if needed ========================================
X = x.to_frame()
X['c'] = 1
# === Loop... this can be improved ====================================
estimate_data = []
for i in range(window, x.index.size+1):
X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster
y_slice = y.values[i-window:i]
coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice)
estimate_data.append(coeff[0] * x.values[window-1] + coeff[1])
# === Assemble ========================================================
estimate = pandas.Series(data=estimate_data, index=x.index[window-1:])
return estimate
Notes
In some specific case uses, which only require the final estimate of the regression, x.rolling(window=60).apply(my_ols) appears to be somewhat slow
As a reminder, the coefficients for a regression can be calculated as a matrix product, as you can read on wikipedia's least squares page. This approach via numpy's matrix multiplication can speed up the process somewhat vs using the ols in statsmodels. This product is expressed in the line starting as coeff = ...

For rolling trend in one column, one can just use:
import numpy as np
def calc_trend(window:int = 30):
df['trend'] = df.rolling(window = window)['column_name'].apply(lambda x: np.polyfit(np.array(range(0,window)), x, 1)[0], raw=True)
However, in my case I wasted to find a trend with respect to date, where date was in another column. I had to create the functionality manually, but it is easy. First, convert from TimeDate to int64 representing days from t_0:
xdays = (df['Date'].values.astype('int64') - df['Date'][0].value) / (1e9*86400)
Then:
def calc_trend(window:int=30):
for t in range(len(df)):
if t < window//2:
continue
i0 = t - window//2 # Start window
i1 = i0 + window # End window
xvec = xdays[i0:i1]
yvec = df['column_name'][i0:i1].values
df.loc[t,('trend')] = np.polyfit(xvec, yvec, 1)[0]

Convert DF into Numpy Array for calculations

I have the data in a dataframe format that I will use for linear regression calculation using user-built function. Here is the code:
from sklearn.datasets import load_boston
boston = load_boston()
bos = pd.DataFrame(boston.data) # convert to DF
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
y = bos.PRICE
x = bos.drop('PRICE', axis = 1) # DROP PRICE since only want X-type variables (not Y-target)
xw = df.to_array(x)
xw = np.insert(xw,0,1, axis = 1) # to insert a column of "1" values
However, I am getting the error:
AttributeError Traceback (most recent call last)
<ipython-input-131-272f1b4d26ba> in <module>()
1 import copy
2
----> 3 xw = df.to_array(x)
AttributeError: 'int' object has no attribute 'to_array'
I am not sure where the problem. I need to pass an array of values (x in this case) to the function to execute some matrix operations
The insert function was working in a step by step code development but for some reason is failing here.
I tried:
xw = copy.deepcopy(x)
with no success
Any thoughts?

it is x.as_matrix() not df.to_array(x)
Please refer to pandas document for more detail on as_matrix()
Here is the code that work
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
boston = load_boston()
bos = pd.DataFrame(boston.data) # convert to DF
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
y = bos.PRICE
x = bos.drop('PRICE', axis = 1) # DROP PRICE since only want X-type variables (not Y-target)
xw = x.as_matrix()
xw = np.insert(xw,0,1, axis = 1) # to insert a column of "1" values

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using categorical variables in statsmodels OLS class - python

Related

LinearRegression TypeError

pycaret giving error:PermissionError: [WinError 32] The process cannot access the file because it is being used by another process

How can better format the output that I'm attempting to save from several regressions?

Deprecated rolling window option in OLS from Pandas to Statsmodels

Convert DF into Numpy Array for calculations

Categories

Resources