How to implement multiple testing for scipy.stats tests

How to implement multiple testing for scipy.stats tests - python

I have a dataframe of values from various samples from two groups. I performed a scipy.stats.ttest on these, which works perfectly, but I am a bit concerned here with the fact that so much testing may yield multiple testing error.
And I wonder how to implement MTC (multiple testing correction) with this. I mean, is there some function in scipy or statsmodels which would perform directly the tests and apply MTC on the output series of p-value, or can I apply an MTC function on a list of p-value without problem?
I know that statsmodels may comprise such functions, but what it has in power, it lacks greatly in manageability and documentation, unhappily (indeed, that's not the fault of the developers, they are three for such huge project). Anyway, I am a little stuck here, so I'll gladly take any suggesting. I didn't ask this in CrossValidated, because it is more related to the implementation part than the statistical part.

Edit 9th Oct 2019:
this link works as of today
https://www.statsmodels.org/stable/generated/statsmodels.stats.multitest.multipletests.html
original answer (returns 404 now)
statsmodels.sandbox.stats.multicomp.multipletests takes an array of p-values and returns the adjusted p-values. The documentation is pretty clear.

Related

Multivariate Kruskall Wallis Package in Python

I would like to investigate whether there are siginifcant differences between three different groups. There are about 20 numerical attributes for these groups. For each attribute there are about a thousand observations.
My first thought was to calculate a manova. Unfortunately, the data are not normally distributed (tested with Anderson Darling test). From just looking at the data, the distribution is too narrow around the mean and has no tail at all.
When I calculate the Manova anyway, highly significant results come out that are completely against my expectations.
Therefore, I would like to calculate a multivariate Kurskall Wallis test next. So far I have found scipy.stats.kruskal. Unfortunately, it only compares individual data series with each other. Is there already a similar implementation in Python to a MANOVA, where you read in all attributes and all three groups and then give a result?
If you need more information, please let me know.
Thanks a lot! :)

In sklearn.linear_model.Ridge, what exactly is the solverparameter doing?

In the sklearn.linear_model.Ridge method, there is a parameter, solver : {‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’}.
And according to the documentation, we should choose different parameter depending on different types' values which is dense or sparse or just use auto.
So in my opinion, we just choose a specific parameter to make calculations fast to corresponding data.
Are my thoughts right or wrong?
If not mind, could anyone give me some advice because I didn't search and get anything proving my thoughts or not?
Sincerely thanks.

You are almost right.
Some solver work only with specific type of data (dense vs. sparse) or specific type of problem (non-negative weights).
However for many cases you can use multiple solvers (e.g. for sparse problems you have at least sag, sparse_cg and lsqr). These solvers have different characteristics and some of them might work better in some cases and some of them work better in other cases. In some cases some solvers even do not converge.
In many cases, the simple engineering-like answer is to use solver which works best on your data. You just test all of them and measure time to answer.
If you want, more precise answer you should dig into documentation of referenced methods (e.g. scipy.sparse.linalg.lsqr).

Why is numpy and scipy exp() faster than log()?

In general, log and exp functions should be roughly the same speed. I would expect the numpy and scipy implementations to be relative straightforward wrappers. numpy.log() and scipy.log() have similar speed as expected. However, I found that numpy.log() is ~60% slower than these exp() functions and scipy.log() is 100% slower. Does anyone one know the reason for this?

Not sure why you think that both should be "roughly the same speed". It's true that both can be calculated using a Taylor series (which, even by itself means little without analyzing the error term), but then the numerical tricks kick in.
E.g., an algebraic identity can be used to transform the original exp. Taylor series into a more efficient 2-jump power series. However, for the power series, see here a discussion of by-case optimizations, some of which involve a lookup table.
Which arguments did you give the functions - the same? the worst one for each?
What was the accuracy of the results? And how do you measure the accuracy for each: absolutely, relatively?
Edit It should be noted that these libraries can also have different backends.

PCA with missing values in Python

I'm trying to do a PCA analysis on a masked array. From what I can tell, matplotlib.mlab.PCA doesn't work if the original 2D matrix has missing values. Does anyone have recommendations for doing a PCA with missing values in Python?
Thanks.

Imputing data will skew the result in ways that might bias the PCA estimates. A better approach is to use a PPCA algorithm, which gives the same result as PCA, but in some implementations can deal with missing data more robustly.
I have found two libraries. You have
Package PPCA on PyPI, which is called PCA-magic on github
Package PyPPCA, having the same name on PyPI and github
Since the packages are in low maintenance, you might want to implement it yourself instead. The code above build on theory presented in the well quoted (and well written!) paper by Tipping and Bishop 1999. It is available on Tippings home page if you want guidance on how to implement PPCA properly.
As an aside, the sklearn implementation of PCA is actually a PPCA implementation based on TippingBishop1999, but they have not chosen to implement it in such a way that it handles missing values.
EDIT: both the libraries above had issues so I could not use them directly myself. I forked PyPPCA and bug fixed it. Available on github.

I think you will probably need to do some preprocessing of the data before doing PCA.
You can use:
sklearn.impute.SimpleImputer
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
With this function you can automatically replace the missing values for the mean, median or most frequent value. Which of this options is the best is hard to tell, it depends on many factors such as how the data looks like.
By the way, you can also use PCA using the same library with:
sklearn.decomposition.PCA
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
And many others statistical functions and machine learning tecniques.

deterministic annealing method

I have ran into a shape matching problem and one term which I read about is deterministic annealing. I want to use this method to convert discrete problems, e.g. travelling salesman problem to continuous problems which could assist of sticking in the local minima. I don't know whether there is already an implementation of this statistical method and also implementation of it seems a bit challenging for me because I couldn't completely understand what this method does and couldn't find enough documentations. Can somebody explain it more or introduce a library especially in python that got already implemented?

You can see explication on Simulated annealing. Also, take a look to scipy.optimize.anneal.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to implement multiple testing for scipy.stats tests - python

Related

Multivariate Kruskall Wallis Package in Python

In sklearn.linear_model.Ridge, what exactly is the solverparameter doing?

Why is numpy and scipy exp() faster than log()?

PCA with missing values in Python

deterministic annealing method

Categories

Resources