Performing a Mediation Analysis for Fixed Effects Model in Python - python

I asked the same question here but have not received an answer for a week now, probably because its more of a coding than statistics question
I want to perform a mediation analysis with a fixed effects model as base model in python. I know, that you can perform mediation analysis using statsmodels' Mediation module. But fixed effects models (as far as I know) are only possible with linearmodels. As I haven't performed a mediation analysis so far, I'm not to sure how to combine those two modules.
I've tried the following (from step 3 onward is relevant for my question):
import pandas as pd
from linearmodels import PanelOLS
import statsmodels.api as sm2
from statsmodels.stats.mediation import Mediation
##1. direct effect: X --> Y
DV_LF = df.Y
IV_X = sm2.add_constant(df[['X', 'Control']])
fe_mod_X = PanelOLS(DV_LF, IV_X, entity_effects=True )
fe_res_X = fe_mod_X.fit(cov_type='clustered', cluster_entity=True)
print(fe_res_X)
##2. X --> M
DV_A = df.M
IV_A = sm2.add_constant(df[['X', 'Control']])
fe_mod_A = PanelOLS(DV_A, IV_A, entity_effects=True )
fe_res_A = fe_mod_A.fit(cov_type='clustered', cluster_entity=True)
print(fe_res_A)
##3. M --> Y
IV_M = sm2.add_constant(df[['M', 'Control']])
fe_mod_M = PanelOLS(DV_LF, IV_M, entity_effects=True )
fe_res_M = fe_mod_M.fit(cov_type='clustered', cluster_entity=True)
print(fe_res_M)
##4. X, M --> Y
IV_T = sm2.add_constant(df[['X', 'M', 'Control']])
fe_mod_T = PanelOLS(DV_LF, IV_T, entity_effects=True )
fe_res_T = fe_mod_T.fit(cov_type='clustered', cluster_entity=True)
print(fe_res_T)
med = Mediation(fe_res_T, fe_res_A, 'X', 'M').fit()
med.summary()
but I'm running into an error:
AttributeError: 'PanelEffectsResults' object has no attribute 'exog'
Does my approach make sense? If so how do I have to change the code to make PanelOLS and Mediation work together?
EDIT:
Would it be valid to calculate the indirect effect by multiplying the coefficient of X on M (step 2) with the coefficient of M on Y from step 4 and the direct effect by subtracting the indirect effect off the coefficient of X on Y (step 1)? I'm a little bit confused by that research poster, saying, that normal mediation analysis are not valid for within subject analysis such as Fixed Effects. If my approach is valid, how would I calculate the significance of the difference between the direct and the total effect? I read that bootstrapping would be a good choice, but I am unsure how to apply it in my case.

Related

Python, seaborn, statistic analysis using statannot doesn't look right

I used statannot to perform a statistical test on some basic data, but the results from the statistical test don't seem correct. I.e. a couple of my comparisons come up with "P_val=0.000e+00 U_stat=0.000e+00", which I think should not be possible. Is there something wrong with my data frame and/or code?
Here is the data frame I am using:
and here is my code:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from statannot import add_stat_annotation
import scipy.stats as sp
data = pd.read_excel('Z:/DMF/GROUPS/gr_Veening/Users/Vik/scRNA-seq/FACSAria/Adherence-invasion assays/adherence_invasion_assay_a549-RFP 4-6-21.xlsx',sheet_name="Sheet2", header = 0)
sns.set_theme(style="darkgrid")
ax1 = sns.boxplot(x="Strain", y="adherence_counts", data=data)
x = "Strain"
y = "adherence_counts"
order = ["D39", "D39 Δcps", "19F", "19F ΔcomCDE"]
ax1 = sns.boxplot(data=data, x=x, y=y, order=order)
plt.title("Adherence Assay")
plt.ylabel('CFU/ml')
plt.xlabel('')
ax1.set(xticklabels=["D39", "D39 Δ$\it{cps}$", "19F", "19F Δ$\it{comCDE}$"])
add_stat_annotation(ax1, data=data, x=x, y=y, order=order,
box_pairs=[("D39", "19F"), ("D39", "D39 Δcps"), ("D39 Δcps", "19F"), ("19F", "19F ΔcomCDE")],
test='Mann-Whitney', text_format='star', loc='inside', verbose=2)
Finally, here is the results from this statistical test:
D39 v.s. D39 Δcps: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=0.000e+00 U_stat=0.000e+00
D39 Δcps v.s. 19F: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=2.000e+00
19F v.s. 19F ΔcomCDE: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=7.617e-01 U_stat=8.000e+00
D39 v.s. 19F: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=0.000e+00 U_stat=0.000e+00
C:\Users\Vik\anaconda3\lib\site-packages\scipy\stats\stats.py:7171: RuntimeWarning: divide by zero encountered in double_scalars
z = (bigu - meanrank) / sd
Any help would be greatly appreciated, thanks!
Your problems come from two parts:
Statistically, in some of your cases (such as "D39" vs "19F"), all items are larger/smaller in one group vs the other, hence the 0 U statistic and extreme p-value. It is very much possible to have these results. It comes from examining only the ranks of the values provided (what this test does), it has advantages and limitations (+ Mann-Whitney's test is not adapted to such small sample sizes either, especially with scipy assuming equivariance).
Now that line z = (bigu - meanrank) / sd failing means that np.sqrt(T * n1 * n2 * (n1+n2+1) / 12.0) = 0, so in this case n1 and/or n2 are 0, (these are len(x) and len(y)). source in scipy So,
There is a bug in statannot, because this can happen, silently, if order and box_pair both refer to a series which does not exist in the dataframe, which I'll correct in statannotations. Thank you, then.
However, I cannot reproduce your Warning with a copy of your dataframe.
If this were the only bug, you should see a missing box in your plot at the point you showed us.
If not, is it possible you updated some of the code but did not copy the last output here? Otherwise, there may be something more to uncover, please let us know.
EDIT: As discovered in the discussion, the second problem can happen in statannot if there is a mismatch between a label in order, box_pairs and in the dataset. This has been patched in statannotations, a fork of statannot.

Summarise the posterior of a single parameter from an array with arviz

I am estimating a model using the pyMC3 library in python. In my "real" model, there are four parameter arrays, two of which have over 170,000 parameters in them. Summarising this array of parameters is too computationally intensive on my computer. I have been trying to figure out if the summary function in arviz will allow me to only summarise one (or a small number) of parameters in the array. Below is a reprex where the same problem is present, though the model is a lot simpler. In the linear regression model below, the parameter array b has three parameters in it b[0], b[1], b[2]. I would like to know how to get the summary for just b[0] and b[1] or alternatively for just a single parameter, e.g., b[0].
import pandas as pd
import pymc3 as pm
import arviz as az
d = pd.read_csv("https://quantoid.net/files/mtcars.csv")
mpg = d['mpg'].values
hp = d['hp'].values
weight = d['wt'].values
with pm.Model() as model:
b = pm.Normal("b", mu=0, sigma=10, shape=3)
sig = pm.HalfCauchy("sig", beta=2)
mu = pm.Deterministic('mu', b[0] + b[1]*hp + b[2]*weight)
like = pm.Normal('like', mu=mu, sigma=sig, observed=mpg)
fit = pm.fit(10000, method='advi')
samp = fit.sample(1500)
with model:
smry = az.summary(samp, var_names = ["b"])
It looked like the coords argument to the summary() function would do it, but after googling around and finding a few examples, like the one here with plot_posterior() instead of summary(), I was unable to get something to work. In particular, I tried the following in the hopes that it would return the summary for b[0] and b[1].
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": range(1)})
or this to return the summary of b[0]:
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": [0]})
I suspect I am missing something simple (I'm an R user who dabbles occasionally with Python). Any help is greatly appreciated.
(BTW, I am using Python 3.8.0, pyMC3 3.9.3, arviz 0.10.0)
To use coords for this, you need to update to the development (which will still show 0.11.2 but has the code from github or any >0.11.2 release) version of ArviZ. Until 0.11.2, the coords argument in summary was not used to subset the data (like it did in all plotting functions) but instead it was only taken into account if the input was not already InferenceData in which case it was passed to the converter.
With older versions, you need to use xarray to subset the data before passing it to summary. Therefore you need to explicitly convert the trace to inferencedata beforehand. In the example above it would look like:
with model:
...
samp = fit.sample(1500)
idata = az.from_pymc3(samp)
az.summary(idata.posterior[["b"]].sel({"b_dim_0": [0]}))
Moreover, you may also want to indicate summary to compute only a subset of the stats/diagnostics as shown in the docstring examples.

Python statsmodels logit wald test input

I have fitted a logisitic regression model to some data, everything is working great. I need to calculate the wald statistic which is a function of the model result.
My problem is that I do not understand, from the documentation, what the wald test requires as input? Specifically what is the R matrix and how is it generated?
I tried simply inputting the data I used to train and test the model as the R matrix, but I do not think this is correct. The documentation suggest examining the examples however none give an example of this test. I also asked the same question on crossvalidated but got shot down.
Kind regards
http://statsmodels.sourceforge.net/0.6.0/generated/statsmodels.discrete.discrete_model.LogitResults.wald_test.html#statsmodels.discrete.discrete_model.LogitResults.wald_test
The Wald test is used to test if a predictor is significant or not, of the form:
W = (beta_hat - beta_0) / SE(beta_hat) ~ N(0,1)
So somehow you'll want to input the predictors into the test. Judging from the example of the t.test and f.test, it may be simpler to input a string or tuple to indicate what you are testing.
Here is their example using a string for the f.test:
from statsmodels.datasets import longley
from statsmodels.formula.api import ols
dta = longley.load_pandas().data
formula = 'TOTEMP ~ GNPDEFL + GNP + UNEMP + ARMED + POP + YEAR'
results = ols(formula, dta).fit()
hypotheses = '(GNPDEFL = GNP), (UNEMP = 2), (YEAR/1829 = 1)'
f_test = results.f_test(hypotheses)
print(f_test)
And here is their example using a tuple:
import numpy as np
import statsmodels.api as sm
data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
results = sm.OLS(data.endog, data.exog).fit()
r = np.zeros_like(results.params)
r[5:] = [1,-1]
T_test = results.t_test(r)
If you're still struggling getting the wald test to work, include your code and I can try to help make it work.

How to define General deterministic function in PyMC

In my model, I need to obtain the value of my deterministic variable from a set of parent variables using a complicated python function.
Is it possible to do that?
Following is a pyMC3 code which shows what I am trying to do in a simplified case.
import numpy as np
import pymc as pm
#Predefine values on two parameter Grid (x,w) for a set of i values (1,2,3)
idata = np.array([1,2,3])
size= 20
gridlength = size*size
Grid = np.empty((gridlength,2+len(idata)))
for x in range(size):
for w in range(size):
# A silly version of my real model evaluated on grid.
Grid[x*size+w,:]= np.array([x,w]+[(x**i + w**i) for i in idata])
# A function to find the nearest value in Grid and return its product with third variable z
def FindFromGrid(x,w,z):
return Grid[int(x)*size+int(w),2:] * z
#Generate fake Y data with error
yerror = np.random.normal(loc=0.0, scale=9.0, size=len(idata))
ydata = Grid[16*size+12,2:]*3.6 + yerror # ie. True x= 16, w= 12 and z= 3.6
with pm.Model() as model:
#Priors
x = pm.Uniform('x',lower=0,upper= size)
w = pm.Uniform('w',lower=0,upper =size)
z = pm.Uniform('z',lower=-5,upper =10)
#Expected value
y_hat = pm.Deterministic('y_hat',FindFromGrid(x,w,z))
#Data likelihood
ysigmas = np.ones(len(idata))*9.0
y_like = pm.Normal('y_like',mu= y_hat, sd=ysigmas, observed=ydata)
# Inference...
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(1000, step, start=start, progressbar=False) # draw 1000 posterior samples using NUTS sampling
print('The trace plot')
fig = pm.traceplot(trace, lines={'x': 16, 'w': 12, 'z':3.6})
fig.show()
When I run this code, I get error at the y_hat stage, because the int() function inside the FindFromGrid(x,w,z) function needs integer not FreeRV.
Finding y_hat from a pre calculated grid is important because my real model for y_hat does not have an analytical form to express.
I have earlier tried to use OpenBUGS, but I found out here it is not possible to do this in OpenBUGS. Is it possible in PyMC ?
Update
Based on an example in pyMC github page, I found I need to add the following decorator to my FindFromGrid(x,w,z) function.
#pm.theano.compile.ops.as_op(itypes=[t.dscalar, t.dscalar, t.dscalar],otypes=[t.dvector])
This seems to solve the above mentioned issue. But I cannot use NUTS sampler anymore since it needs gradient.
Metropolis seems to be not converging.
Which step method should I use in a scenario like this?
You found the correct solution with as_op.
Regarding the convergence: Are you using pm.Metropolis() instead of pm.NUTS() by any chance? One reason this could not converge is that Metropolis() by default samples in the joint space while often Gibbs within Metropolis is more effective (and this was the default in pymc2). Having said that, I just merged this: https://github.com/pymc-devs/pymc/pull/587 which changes the default behavior of the Metropolis and Slice sampler to be non-blocked by default (so within Gibbs). Other samplers like NUTS that are primarily designed to sample the joint space still default to blocked. You can always explicitly set this with the kwarg blocked=True.
Anyway, update pymc with the most recent master and see if convergence improves. If not, try the Slice sampler.

Why are LASSO in sklearn (python) and matlab statistical package different?

I am using LaasoCV from sklearn to select the best model is selected by cross-validation. I found that the cross validation gives different result if I use sklearn or matlab statistical toolbox.
I used matlab and replicate the example given in
http://www.mathworks.se/help/stats/lasso-and-elastic-net.html
to get a figure like this
Then I saved the matlab data, and tried to replicate the figure with laaso_path from sklearn, I got
Although there are some similarity between these two figures, there are also certain differences. As far as I understand parameter lambda in matlab and alpha in sklearn are same, however in this figure it seems that there are some differences. Can somebody point out which is the correct one or am I missing something? Further the coefficient obtained are also different (which is my main concern).
Matlab Code:
rng(3,'twister') % for reproducibility
X = zeros(200,5);
for ii = 1:5
X(:,ii) = exprnd(ii,200,1);
end
r = [0;2;0;-3;0];
Y = X*r + randn(200,1)*.1;
save randomData.mat % To be used in python code
[b fitinfo] = lasso(X,Y,'cv',10);
lassoPlot(b,fitinfo,'plottype','lambda','xscale','log');
disp('Lambda with min MSE')
fitinfo.LambdaMinMSE
disp('Lambda with 1SE')
fitinfo.Lambda1SE
disp('Quality of Fit')
lambdaindex = fitinfo.Index1SE;
fitinfo.MSE(lambdaindex)
disp('Number of non zero predictos')
fitinfo.DF(lambdaindex)
disp('Coefficient of fit at that lambda')
b(:,lambdaindex)
Python Code:
import scipy.io
import numpy as np
import pylab as pl
from sklearn.linear_model import lasso_path, LassoCV
data=scipy.io.loadmat('randomData.mat')
X=data['X']
Y=data['Y'].flatten()
model = LassoCV(cv=10,max_iter=1000).fit(X, Y)
print 'alpha', model.alpha_
print 'coef', model.coef_
eps = 1e-2 # the smaller it is the longer is the path
models = lasso_path(X, Y, eps=eps)
alphas_lasso = np.array([model.alpha for model in models])
coefs_lasso = np.array([model.coef_ for model in models])
pl.figure(1)
ax = pl.gca()
ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])
l1 = pl.semilogx(alphas_lasso,coefs_lasso)
pl.gca().invert_xaxis()
pl.xlabel('alpha')
pl.show()
I do not have matlab but be careful that the value obtained with the cross--validation can be unstable. This is because it influenced by the way you subdivide the samples.
Even if you run 2 times the cross-validation in python you can obtain 2 different results.
consider this example :
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
0.00645093258722
0.00691712356467
it's possible that alpha = lambda / n_samples
where n_samples = X.shape[0] in scikit-learn
another remark is that your path is not very piecewise linear as it could/should be. Consider reducing the tol and increasing max_iter.
hope this helps
I know this is an old thread, but:
I'm actually working on piping over to LassoCV from glmnet (in R), and I found that LassoCV doesn't do too well with normalizing the X matrix first (even if you specify the parameter normalize = True).
Try normalizing the X matrix first when using LassoCV.
If it is a pandas object,
(X - X.mean())/X.std()
It seems you also need to multiple alpha by 2
Though I am unable to figure out what is causing the problem, there is a logical direction in which to continue.
These are the facts:
Mathworks have selected an example and decided to include it in their documentation
Your matlab code produces exactly the result as the example.
The alternative does not match the result, and has provided inaccurate results in the past
This is my assumption:
The chance that mathworks have chosen to put an incorrect example in their documentation is neglectable compared to the chance that a reproduction of this example in an alternate way does not give the correct result.
The logical conclusion: Your matlab implementation of this example is reliable and the other is not.
This might be a problem in the code, or maybe in how you use it, but either way the only logical conclusion would be that you should continue with Matlab to select your model.

Categories