How to calculate Cooks Distance, DFFITS using python statsmodel - python

I want to calculate Cooks_d and DFFITS in Python using statsmodel.
Here is my code in Python:
X = your_str_cleaned[param]
y = your_str_cleaned['Visitor']
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
I tried using this for getting Cooks Distance and DFFITS:
import statsmodels.stats.outliers_influence as st_inf
st_inf.OLSInfluence.summary_frame(results)
But I am getting this error:
'OLSResults' object has no attribute 'results'.
Can someone help me find where I am going wrong?

I experience the same problem, so I had to find a way around. I don't have much experience, and this doesn't fix the root issue with OLSInfluence. But it gives you summary_frame.
I will use pandas dataframes as the source of the data. Even if you have it in other objects (like arrays) you can transform them into a dataframe with relative ease. To show how it works, I will import the Boston housing prices data set from sklearn.datasets:
import pandas as pd
from sklearn.datasets import load_boston
#imports dataset
boston = load_boston()
#generates DataFrame bos
bos = pd.DataFrame(boston.data)
#adds columns names to bos
bos.columns = boston.feature_names
#adds column 'PRICE' to bos
bos['PRICE'] = boston.target
Now let us consider the relation between the column 'RM' and the column 'PRICE', with 'RM'as independent variable. For simplicity, let us consider simple OLS. Here comes the actual answer:
from statsmodels.formula.api import ols
m = ols('PRICE ~ RM',bos).fit()
infl = m.get_influence()
sm_fr = infl.summary_frame()
sm_fr has the columns cooks_d and dffits that you look for.

Related

SHAP plotting waterfall using an index value in dataframe

I am working on a binary classification using random forest algorithm
Currently, am trying to explain the model predictions using SHAP values.
So, I referred this useful post here and tried the below.
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
sv = explainer(ord_test_t)
exp = Explanation(sv.values[:,:,1],
sv.base_values[:,1],
data=ord_test_t.values,
feature_names=ord_test_t.columns)
idx = 20
waterfall(exp[idx])
I like the above approach as it allows to display the feature values along with waterfall plot. So, I wish to use this approach
However, this doesn't help me get the waterfall for a specific row in ord_test_t (test data).
For example, let's consider that ord_test_t.Index.tolist() returns 3,5,8,9 etc...
Now, I want to plot the waterfall plot for ord_test_t.iloc[[9]] but when I pass exp[9], it just gets the 9th row but not the index named as 9.
When I try exp.iloc[[9]] it throws error as explanation object doesnt have iloc.
Can help me with this please?
My suggestion is as following:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
idx = 9
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X.loc[[idx]]) # corrected, pass the row of interest as df
exp = Explanation(
sv.values[:, :, 1], # class to explain
sv.base_values[:, 1],
data=X.loc[[idx]].values, # corrected, pass the row of interest as df
feature_names=X.columns,
)
waterfall(exp[0]) # pretend you have only 1 data point which is 0th
0.40.0
Proof:
model.predict_proba(X.loc[[idx]]) # corrected
array([[0.95752656, 0.04247344]])

export SHAP waterfall plot to dataframe

I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below
row_to_show = 20
data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
rf_boruta.predict_proba(data_for_prediction_array)
explainer = shap.TreeExplainer(rf_boruta)
# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show])
This generated the plot as shown below
However, I want to export this to dataframe and how can I do it?
I expect my output to be like as shown below. I want to export this for the full dataframe. Can you help me please?
Let's do a small experiment:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer
X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
What is explainer here? If you do dir(explainer) you'll find out it has some methods and attributes among which is:
explainer.expected_value
which is of interest to you because this is base on which SHAP values add up.
Furthermore:
sv = explainer.shap_values(X)
len(sv)
will give a hint sv is a list consisting of 2 objects which are most probably SHAP values for 1 and 0, which must be symmetric (because what moves towards 1 moves exactly by the same amount, but with opposite sign, towards 0).
Hence:
sv1 = sv[1]
Now you have everything to pack it to the desired format:
df = pd.DataFrame(sv1, columns=X.columns)
df.insert(0, 'bv', explainer.expected_value[1])
Q: How do I know?
A: Read docs and source code.
If I recall correctly, you can do something like this with pandas
import pandas as pd
shap_values = explainer.shap_values(data_for_prediction)
shap_values_df = pd.DataFrame(shap_values)
to get the feature names, you should do something like this (if data_for_prediction is a dataframe):
feature_names = data_for_prediction.columns.tolist()
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
I'm a currenty using that :
def getShapReport(classifier,X_test):
shap_values = shap.TreeExplainer(classifier).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values[1], X_test)
return pd.DataFrame(shap_values[1])
It first displays the shap values for the model, and for each prediction after that, and finally it returns the dataframe for the positive class(i'm on an imbalance context)
It is for a Tree explainer and not a waterfall, but it is basically the same.

Python: Random intercept model (have to replicate R code)

I'm trying to replicate code from R that estimates a random intercept model. The R code is:
fit=lmer(resid~-1+(1|groupid),data=df)
I'm using the lmer command of the lme4 package to estimate random intercepts for the variable resid for observations in different groups (defined by groupid). There is no 'fixed effects' part, therefore no variable before the (1|groupid). Moreover, I do not want a constant estimated so that I get an intercept for each group.
Not sure how to do similar estimation in Python. I tried something like:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(25, 4), columns=list('ABCD'))
df['groupid'] = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5]
df['groupid'] = df['groupid'].astype('category')
###Random intercepts models
md = smf.mixedlm('A~B-1',data=df,groups=df['groupid'])
mdf = md.fit()
print(mdf.random_effects)
A is resid from the earlier example, while groupid is the same.
1) I am not sure whether the mdf.random_effects are the random intercepts I am looking for
2) I cannot remove the variable B, which I understand is the fixed effects part. If I try:
md = smf.mixedlm('A~-1',data=df,groups=df['groupid'])
I get an error that "Arrays cannot be empty".
Just trying to estimate the exact same model as in the R code. Any advice will be appreciated.

Statsmodels gives different ANOVA results to SPSS

I'm getting acquainted with Statsmodels so as to shift my more complicated stats completely over to python. However, I'm being cautious, so I'm cross-checking my results with SPSS, just to make sure I'm not making any obvious blunders. Most of time, there's no difference, but I have one example of a two-way ANOVA that's throwing up very different test statistics in Statsmodels and SPSS. (Relevant point: the sample sizes in the ANOVA are mismatched, so ANOVA may not be the appropriate model here.)
I'm selecting my model as follows:
import pandas as pd
import scipy as sp
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
Body = pd.read_csv(filepath)
Body = Body.dropna()
Body_lm = ols('Effect ~ C(Fiction) + C(Condition) + C(Fiction)*C(Condition)', data = Body).fit()
table = sm.stats.anova_lm(Body_lm, typ=2)
The Statsmodels output is as below:
sum_sq df F PR(>F)
C(Fiction) 278.176684 1.0 307.624463 1.682042e-55
C(Condition) 4.294764 1.0 4.749408 2.971278e-02
C(Fiction):C(Condition) 10.776312 1.0 11.917092 5.970123e-04
Residual 520.861599 576.0 NaN NaN
The corresponding SPSS results are these:
Can anyone help explain the difference? Is is perhaps the unequal sample sizes being treated differently under the hood? Or am I choosing the wrong model?
Any help appreciated!
You should use sum coding when comparing the means of the variables.
BTW you don't need to specify each variable that are in the interaction term if * multiply operator is used:
“:” adds a new column to the design matrix with the product of the other two columns.
“*” will also include the individual columns that were multiplied together.
Your model should be:
Body_lm = ols('Effect ~ C(Fiction, Sum)*C(Condition, Sum)', data = Body).fit()

Ignoring missing values in multiple OLS regression with statsmodels

I'm trying to run a multiple OLS regression using statsmodels and a pandas dataframe. There are missing values in different columns for different rows, and I keep getting the error message:
ValueError: array must not contain infs or NaNs
I saw this SO question, which is similar but doesn't exactly answer my question: statsmodel.api.Logit: valueerror array must not contain infs or nans
What I would like to do is run the regression and ignore all rows where there are missing variables for the variables I am using in this regression. Right now I have:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
df = pd.read_csv('cl_030314.csv')
results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df).fit()
I want something like missing = "drop".
Any suggestions would be greatly appreciated. Thanks so much.
You answered your own question. Just pass
missing = 'drop'
to ols
import statsmodels.formula.api as smf
...
results = smf.ols(formula = "da ~ cfo + rm_proxy + cpi + year",
data=df, missing='drop').fit()
If this doesn't work then it's a bug and please report it with a MWE on github.
FYI, note the import above. Not everything is available in the formula.api namespace, so you should keep it separate from statsmodels.api. Or just use
import statsmodels.api as sm
sm.formula.ols(...)
The answer from jseabold works very well, but it may be not enough if you the want to do some computation on the predicted values and true values, e.g. if you want to use the function mean_squared_error. In that case, it may be better to get definitely rid of NaN
df = pd.read_csv('cl_030314.csv')
df_cleaned = df.dropna()
results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df_cleaned).fit()

Categories