Pandas Dataframe AttributeError: 'DataFrame' object has no attribute 'design_info' - python

I am trying to use the predict() function of the statsmodels.formula.api OLS implementation. When I pass a new data frame to the function to get predicted values for an out-of-sample dataset result.predict(newdf) returns the following error: 'DataFrame' object has no attribute 'design_info'. What does this mean and how do I fix it? The full traceback is:
p = result.predict(newdf)
File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 878, in predict
exog = dmatrix(self.model.data.orig_exog.design_info.builder,
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2088, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'design_info'
EDIT: Here is a reproducible example. The error appears to occur when I pickle and then unpickle the result object (which I need to do in my actual project):
import cPickle
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
df = pd.DataFrame({"A": [10,20,30,324,2353], "B": [20, 30, 10, 1, 2332], "C": [0, -30, 120, 11, 2]})
result = sm.ols(formula="A ~ B + C", data=df).fit()
print result.summary()
test1 = result.predict(df) #works
f_myfile = open('resultobject', "wb")
cPickle.dump(result, f_myfile, 2)
f_myfile.close()
print("Result Object Saved")
f_myfile = open('resultobject', "rb")
model = cPickle.load(f_myfile)
test2 = model.predict(df) #produces error

Pickling and unpickling of a pandas DataFrame doesn't save and restore attributes that have been attached by a user, as far as I know.
Since the formula information is currently stored together with the DataFrame of the original design matrix, this information is lost after unpickling a Results and Model instance.
If you don't use categorical variables and transformations, then the correct designmatrix can be built with patsy.dmatrix. I think the following should work
x = patsy.dmatrix("B + C", data=df) # df is data for prediction
test2 = model.predict(x, transform=False)
or constructing the design matrix for the prediction directly should also work Note we need to explicitly add a constant that the formula adds by default.
from statsmodels.api import add_constant
test2 = model.predict(add_constant(df[["B", "C"]]), transform=False)
If the formula and design matrix contain (stateful) transformation and categorical variables, then it's not possible to conveniently construct the design matrix without the original formula information. Constructing it by hand and doing all the calculations explicitly is difficult in this case, and looses all the advantages of using formulas.
The only real solution is to pickle the formula information design_info independently of the dataframe orig_exog.

Related

Map categorical data for logistic regression

I am trying to run a logistic regression, predicting income based off age, num, and hours-per-week. The income column consists of either <=50K or >50. I have tried to replace the categorical data with numerics below by using the Pandas.map() function and recieved the error:
'DataFrame' object has no attribute 'map'. Then I tried adding the rdd function (as shown below) but get the error:
'DataFrame' object has no attribute 'rdd'
import pandas as pd
import statsmodels.api as sm
adult_train = pd.read_csv("C:/.../adult_training.csv")
adult_test = pd.read_csv("C:/.../adult_test.csv")
# Separate data into predictor variables, X, and target variables, y:
X = pd.DataFrame(adult_train[['age', 'hours-per-week', 'num']])
X = sm.add_constant(X)
y = pd.DataFrame(adult_train[['income']]).rdd.map({'<=50K': 0, '>50K': 1}).astype(int)
logreg01 = sm.Logit(y, X).fit()
If you could please help me be able to run the last line of code, it would be really appreciated.

How to pass values from list to scikit learn linear regression model?

I have imported values into python from a PostgreSQL DB.
data = cur.fetchall()
The list is like this:-
[('Ending Crowds', 85, Decimal('50.49')), ('Salute Apollo', 73, Decimal('319.93'))][0]
I need to give 85 as X & Decimal('50.49') as Y in LinearRegression model
Then I imported packages & class-
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
I provide data & perform linear regression -
X = data.iloc[:, 1].values.reshape(-1, 1)
Y = data.iloc[:, 2].values.reshape(-1, 1)
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression
I am getting the error-
AttributeError: 'list' object has no attribute 'iloc'
I am a beginner to pyhon and started just 2 days back but need to do linear regression in python at my job for a project. I think iloc can't be used for list object. But, not able to figure out as to how to pass on X & Y values to linear_regressor. All the examples performing Linear Regression on sites are using .CSV. Please help me out.
No, you can't use .iloc on 'list', it is for dataframe.
convert it into dataframe and try using .iloc
Your solution is below, please approve it if it is correct.
Because it's my 1st answer on StackOverflow
import pandas as pd
from decimal import Decimal
from sklearn.linear_model import LinearRegression
#I don't know what that "[0]" in your list,because I haven't used data fetched from PostgreSQL. Anyway remove it first and store it in temp
temp=[('Ending Crowds', 85, Decimal('50.49')), ('Salute Apollo', 73, Decimal('319.93'))]
#I don't know it really needed or not
var = list(var)
data = []
#It is to remove "Decimal" word
for row in var:
data.append(list(map(str, list(row))))
data=pd.DataFrame(data,columns=["no_use","X","Y"])
X=data['X'].values.reshape(-1, 1)
Y=data['Y'].values.reshape(-1, 1)
print(X,Y)
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression

Getting transformed X values from OLS model using statsmodels

I am trying to do a linear regression. With the results I want to multiply each x with its own estimated coefficient: xi·βi.
However, I am doing a lot of transformations on xi.
For example:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
def log_plus_1(x):
return np.log(x + 1.0)
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
formule = 'Lottery ~ pow(Literacy,2) + log_plus_1(Wealth)'
mod = smf.ols(formula=formule, data=df)
res = mod.fit()
res.params
Now I would need pow(Literacy, 2) and log_plus_1(Wealth). But since they go into the model, I was hoping to get them out of there too. Instead of transforming the data from the original dataset.
In R I would use res$model to get it.
The data is stored as attributes of the model, e.g. the design matrix is mod.exog, the dependent or response variable is mod.endog.
(I'm not sure I remember correctly the details of the following: The data that patsy returns after creating the transformed design matrix should, in this case, be a pandas DataFrame, and should be stored in mod.data.orig_exog or something like that.)
res.predict automatically handles the transformation, i.e. patsy uses the formula information to transform the data for the explanatory variables in prediction in the same way as the data was transformed in creating the model.
predict only returns the prediction and not the internally transformed predict exog.

Using predict() on statsmodels.formula data with different column names using Python and Pandas

I've got some regressions results from running statsmodels.formula.api.ols. Here's a toy example:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
example_df = pd.DataFrame(np.random.randn(10, 3))
example_df.columns = ["a", "b", "c"]
fit = smf.ols('a ~ b', example_df).fit()
I'd like to apply the model to column c, but a naive attempt to do so doesn't work:
fit.predict(example_df["c"])
Here's the exception I get:
PatsyError: Error evaluating factor: NameError: name 'b' is not defined
a ~ b
^
I can do something gross and create a new, temporary DataFrame in which I rename the column of interest:
example_df2 = pd.DataFrame(example_df["c"])
example_df2.columns = ["b"]
fit.predict(example_df2)
Is there a cleaner way to do this? (short of switching to statsmodels.api instead of statsmodels.formula.api)
You can use a dictionary:
>>> fit.predict({"b": example_df["c"]})
array([ 0.84770672, -0.35968269, 1.19592387, -0.77487812, -0.98805215,
0.90584753, -0.15258093, 1.53721494, -0.26973941, 1.23996892])
or create a numpy array for the prediction, although that is much more complicated if there are categorical explanatory variables:
>>> fit.predict(sm.add_constant(example_df["c"].values), transform=False)
array([ 0.84770672, -0.35968269, 1.19592387, -0.77487812, -0.98805215,
0.90584753, -0.15258093, 1.53721494, -0.26973941, 1.23996892])
If you replace your fit definition with this line:
fit = smf.ols('example_df.a ~ example_df.b', example_df).fit()
It should work.
fit.predict(example_df["c"])
array([-0.52664491, -0.53174346, -0.52172484, -0.52819856, -0.5253607 ,
-0.52391618, -0.52800043, -0.53350634, -0.52362988, -0.52520823])

Using pandas pd.cut to generate a categorical variable with statsmodels

I have tried to use pd.cut to create a categorical variable from a continuous variable. I'd like to use this in a subsequent statsmodel defined regression including this dummy variable. When I create a categorical variable created in this way, I get an error
TypeError: data type not understood.
A test case is included below.
import numpy as np
import pandas as pd
import statsmodels as sm
import statsmodels.formula.api as smf
df = pd.DataFrame(np.random.randn(6,4))
df.columns = ['A', 'B', 'C', 'D']
df['ttt']=pd.cut(df['D'], bins=2)
test = smf.ols('A ~ B + ttt', data=df).fit()
I'm sure I've done something obviously wrong. Any help would be appreciated.
I'm not sure exactly where statsmodels is at in terms of including support for the new Categorical type in pandas. For the moment, you may have to convert the categorical back into an object type for it to work (please check that the resulting ols fit is sensible, I don't know the full details of what you're trying to do):
df['ttt_fixed'] = df.ttt.astype(np.object)
test = smf.ols('A ~ B + ttt_fixed', data=df).fit()
test.summary()

Categories