Repeated measures correlation in python using pingouin package - python

I want to compute the correlation between two variables (ex. A and B) across all my participants. Because my observations are not independent because I have multiple observations per participant, I computed a repeated measures correlation using pingouin python package.
I plotted the repeated measures correlation in my dataset using the plot_rm_corr function from pingouin. However, the lines across the subjects seem all parallel. It's also seemed to be the case in pingouin documentation. Is this normal ? Could someone please explain the reasoning ?
Thank you !

Related

Automatic Linear/Multiple Regression in Python with 50+ columns

I have a dataset with more than 50 columns and I'm trying to find a way in Python to make a simple linear regression between each combination of variables. The goal here is to find a starting point in furthering my analysis (i.e, I will dwelve deeper into those pairs that have a somewhat significant R Square).
I've put all my columns in a list of numpy arrays. How could I go about making a simple linear regression between each combination, and for that combination, print the R square? Is there a possibility to try also a multiple linear regression, with up to 5-6 variables, again with each combination?
Each array has ~200 rows, so code efficiency in terms of speed would not be a big issue for this personal project.
If you are looking for columns with high r squared values, just try a correlation matrix. To ease the visualization, I would recommend you to plot a heat map using seaborn:
import seaborn as sns
import matplotlib.pyplot as plt
df_corr = df.corr()
sns.heatmap(df_corr, cmap="coolwarm", annot=True)
plt.show()
Other suggestion I have to you is to run a Principal Component Analysis (PCA) in your dataset to find the features with highest variability. Usually, these variables are the most important, and can be used to make the best predictions. Just let me know if want more info on this technique.
This is more of an EDA problem than a python problem. Look into some regression resources, specifically a correlation matrix. However, one possible solution could use itertools.combinations with a group size of 6. This will give you 15,890,700 different options for running a regression so unless you want to run greater than 15 million regressions you should do some EDA to find important features in your dataset.

How to do test of equality of coefficient for 2SLS in Statsmodels or Linearmodels?

So if I ran an experiment with multiple treatment groups and a control, I would analyse the results using Statsmodel ols to see if any of the treatment group were statistically different from the control group:
y ~ C(treatment_group, Treatment('Control')
I would then run results.t_test_pairwise() to find out if the coefficients of each treatment group were equal. I.e. to know whether the results of each treatment group were statistically significantly different to one another.
In the current situation, rather than just running a standard ols, I am using Statsmodel/Linearmodel's 2SLS to analyse an instrumental variable. I can run the analysis perfectly fine, and I get the results. But now I need to see whether the coefficients of the different instruments (the three different treatment groups) are the same, so I know whether the different treatment groups vary in their effect.
Code for statsmodel:
from statsmodels.sandbox.regression.gmm import IV2SLS as SM2SLS
model = SM2SLS(tdf[endog],tdf['elect_lpd'],tdf[inst]).fit()
Or for Linearmodel:
model = LM2SLS(tdf.elect_lpd, tdf[controls], tdf[endog], tdf[inst]).fit(cov_type='clustered')
Josef's response here, suggests that you can use the wald t-test, but I need to use the restriction matrix rather than the formula. So if anyone has any ideas of how to do that, that would be much appreciated.
If anyone else get's stuck with this...I figured out the solution when using Linearmodels.
So after running the model:
model = LM2SLS(tdf.elect_lpd, tdf[controls], tdf[endog], tdf[inst]).fit(cov_type='clustered')
You can then run wald test to compare differences between each of your treatment group. In my case, I had two treatment groups (sbs & wing):
test.wald_test(formula = ['endog_sbs = endog_wing'])

Multivariate Kruskall Wallis Package in Python

I would like to investigate whether there are siginifcant differences between three different groups. There are about 20 numerical attributes for these groups. For each attribute there are about a thousand observations.
My first thought was to calculate a manova. Unfortunately, the data are not normally distributed (tested with Anderson Darling test). From just looking at the data, the distribution is too narrow around the mean and has no tail at all.
When I calculate the Manova anyway, highly significant results come out that are completely against my expectations.
Therefore, I would like to calculate a multivariate Kurskall Wallis test next. So far I have found scipy.stats.kruskal. Unfortunately, it only compares individual data series with each other. Is there already a similar implementation in Python to a MANOVA, where you read in all attributes and all three groups and then give a result?
If you need more information, please let me know.
Thanks a lot! :)

Total Variation Distance for continuous distributions in Python(or R)

I would like to calculate the total variation distance(TVD) between two continuous probability distributions. I would like to point out that while there are two relevant questions(see here and here), they are both working with discrete distributions.
For those not familiar with TVD,
Informally, this is the largest possible difference between the
probabilities that the two probability distributions can assign to the
same event.
as it is described in the respective Wikipedia page. In the case of continuous distributions, TVD is equal with half the integral of the absolute difference between the two (since I cannot add math notation see this for a proof and for the notation).
So far, I wasn't able to find a tool for my job in Python. I would be interested in one if exists. Also, while I have no experience in R, I understand that is commonly used for such tasks so I would be interested in one as well (TVD calculation is the final step of my algorithm so I guess it won't be hard to read some data from a file, do the calculation and print a number even if I am completely new to R).
I would like to add that I am mainly interesting in normal distributions so a tool strictly for those is more than welcomed.
If no such tools exist, then any help adapting answers from this question to use the builtin probability functions will be of great help as well.
Thank you in advance.

Adjusted Boxplot in Python

For my thesis, I am trying to identify outliers in my data set. The data set is constructed of 160000 times of one variable from a real process environment. In this environment however, there can be measurements that are not actual data from the process itself but simply junk data. I would like to filter them out with I little help of literature instead of only "expert opinion".
Now I've read about the IQR method of seeing whether possible outliers lie when dealing with a symmetric distribution like the normal distribution. However, my data set is right skewed and by distribution fitting, inverse gamma and lognormal where the best fit.
So, during my search for methods for non-symmetric distributions, I found this topic on crossvalidated where user603's answer is interesting in particular: Is there a boxplot variant for Poisson distributed data?
In user603's answer, he states that an adjusted boxplot helps to identify possible outliers in your dataset and that R and Matlab have functions for this
(There is an 𝚁R implementation of this
(πš›πš˜πš‹πšžπšœπšπš‹πšŠπšœπšŽ::πšŠπšπš“πš‹πš˜πš‘()robustbase::adjbox()) as well as
a matlab one (in a library called πš•πš’πš‹πš›πšŠlibra)
I was wondering if there is such a function in Python. Or is there a way to calculate the medcouple (see paper in user603's answer) with python?
I really would like to see what comes out the adjusted boxplot for my data..
In the module statsmodels.stats.stattools there is a function medcouple(), which is the measure of the skewness used in the Adjusted Boxplot.
enter link description here
With this variable you can calculate the interval beyond which outliers are defined.

Categories