Is there a python implementation of multidimensional ECDFs?

Is there a python implementation of multidimensional ECDFs? - python

Looking for a package that implements the multivariate version of statsmodels.distributions.ECDF
If one doesn't exist I will implement it for inclusion in statsmodels (if accepted), but don't want to reinvent the wheel.
I see this: https://gitlab.com/stochastic-control/StOpt
It has pybind11 bindings, but not sure if there is a wheel in pip already making this available.

The upcoming release of statsmodels 0.13 includes basic support for copulas.
Empirical and non-parametric copula and multivariate distributions have mainly experimental code, that is currently not public, and not sufficiently tested and verified.
For example _ecdf_mv (*) is currently a multivariate rankdata function. It needs to be divided by the number of observations or converted to plotting positions to make it into a ECDF. Because copulas require continuous uniform margins, ties are randomly or arbitrarily broken.
(*)
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/distributions/tools.py#L424

Related

How to perform adjoint sensitivity in Python (preferably through CVODE)

I want to implement the adjoint sensitivity analysis in python, in order to determine the gradient of my objective function with respect to some parameters. In specific the objective function depends on the solution of a differential equation which in turn depends on said parameters which I am looking to find the optimum of.
To perform this there are numerous good packages both in Julia (see here), as well as CVODES from SUNDIALS, however the latter which does apparently have a wrapper made for python, does not include sensitivity analysis capabilities according to this link. Furthermore, I have looked into SALib for sensitivity analysis, but as far as I understand this refers to some other type of 'sensitivity analysis' and therefore adjoint or even forward sensitivity analysis is not included (correct me if I am wrong on this one).
Thus my question is, does a version of CVODES exist in python with sensitivity analysis capabilities, or is there there any other package where one can use in order to perform adjoint sensitivity analys?

You can easily call Julia code / packages from Python with pyjulia.
https://github.com/JuliaPy/pyjulia

You can try Assimulo, which is a Python wrapper of the SUNDIALS suite. I've been using it for some years now and it works pretty robustly. So far, I have performed forward sensitivity analysis on ODE systems with moderate number of states/parameters using CVODEs (less than 20 states, less than 10 parameters). It works pretty well in terms of robustness (can handle stiff problems, and also supports a variety of linear solvers for sparse problems) and speed, and also supports DAEs through IDAs.
I have installed Assimulo using conda, which deals with all the dependency tree (including SUNDIALS in its more recent version). Finally, I'm not aware whether adjoint sensitivity analysis can be performed using Assimulo. If you find something, let us all know.

when to use numpy vs statistics modules

While working on some statistical analysis tools, I discovered there are at least 3 Python methods to calculate mean and standard deviation (not counting the "roll your own" techniques):
np.mean(), np.std() (with ddof=0 or 1)
statistics.mean(), statistics.pstdev() (and/or statistics.stdev)
scipy.statistics package
That has me scratching my head. There should be one obvious way to do it, right? :-) I've found some older SO posts. One compares the performance advantages of np.mean() vs statistics.mean(). It also highlights differences in the sum operator. That post is here:
why-is-statistics-mean-so-slow
I am working with numpy array data, and my values fall in a small range (-1.0 to 1.0, or 0.0 to 10.0), so the numpy functions seem the obvious answer for my application. They have a good balance of speed, accuracy, and ease of implementation for the data I will be processing.
It appears the statistics module is primarily for those that have data in lists (or other forms), or for widely varying ranges [1e+5, 1.0, 1e-5]. Is that still a fair statement? Are there any numpy enhancements that address the differences in the sum operator? Do recent developments bring any other advantages?
Numerical algorithms generally have positive and negative aspects: some are faster, or more accurate, or require a smaller memory footprint. When faced with a choice of 3-4 ways to do a calculation, a developer's responsibility is to select the "best" method for his/her application. Generally this is a balancing act between competing priorities and resources.
My intent is to solicit replies from programmers experienced in statistical analysis to provide insights into the strengths and weaknesses of the methods above (or other/better methods). [I'm not interested in speculation or opinions without supporting facts.] I will make my own decision based on my design requirements.

Why does NumPy duplicate features of SciPy?
From the SciPy FAQ What is the difference between NumPy and SciPy?:
In an ideal world, NumPy would contain nothing but the array data type and the most basic operations: indexing, sorting, reshaping, basic elementwise functions, etc. All numerical code would reside in SciPy. However, one of NumPy’s important goals is compatibility, so NumPy tries to retain all features supported by either of its predecessors.
It recommends using SciPy over NumPy:
In any case, SciPy contains more fully-featured versions of the linear algebra modules, as well as many other numerical algorithms. If you are doing scientific computing with Python, you should probably install both NumPy and SciPy. Most new features belong in SciPy rather than NumPy.
When should I use the statistics library?
From the statistics library documentation:
The module is not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators.
Thus I would not use it for serious (i.e. resource intensive) computation.
What is the difference between statsmodels and SciPy?
From the statsmodels about page:
The models module of scipy.stats was originally written by Jonathan Taylor. For some time it was part of scipy but was later removed. During the Google Summer of Code 2009, statsmodels was corrected, tested, improved and released as a new package. Since then, the statsmodels development team has continued to add new models, plotting tools, and statistical methods.
Thus you may have a requirement that SciPy is not able to fulfill, or is better fulfilled by a dedicated library.
For example the SciPy documentation for scipy.stats.probplot notes that
Statsmodels has more extensive functionality of this type, see statsmodels.api.ProbPlot.
Thus in cases like these you will need to turn to statistical libraries beyond SciPy.

Just a simple benchmark.
statistics vs NumPy
import statistics as stat
import numpy as np
from timeit import default_timer as timer
l = list(range(1_000_000))
start = timer()
m, std = stat.mean(l), stat.stdev(l)
end = timer()
print(end-start, "sec elapsed.")
print("mean, std:", m, std)
start = timer()
m, std = np.mean(l), np.std(l)
end = timer()
print(end-start, "sec elapsed.")
print("mean, std:", m, std)
(no iteration for brevity.)
Result
NumPy is way faster.
# statistics
3.603845 sec elapsed.
mean, std: 499999.5 288675.2789323441
# numpy
0.2315261999999998 sec elapsed.
mean, std: 499999.5 288675.1345946685

Welch's ANOVA in Python

I'm not sure if stackoverflow is the best forum for this, but anyway...
Scipy implements ANOVA using stats.f_oneway, which assumes equal variances. It says in the docs that if the variances are unequal, one could consider the Kruskal-Wallis test instead.
However, what I want is Welch's ANOVA. Scipy has a Welch t-test, but of course this doesn't work if I have more than two groups.
What I find interesting is that scipy used to have stats.oneway which allowed for an equal variance setting. However, it has been deprecated.
Is there an easy way to implement Welch's ANOVA in Python?

Just required the same thing. I had to copy code from R package. Also requested scipy.stats to add this feature. Here is the ~10 lines of code for the implementation
https://github.com/scipy/scipy/issues/11122

The pingouin package has a Welch's ANOVA implemented. You can find the documentation for it at https://pingouin-stats.org/generated/pingouin.welch_anova.html.

Finding heteroscedasticity in time series

I'm working in python stack (scipy/numpy/pandas) and I need to do a linear fit on a list of (x,y) points that have added noise from some distribution conditioned on x and other global properties. Are any specific methods available to measure and visualize the levels of heteroscedasticity in my data?

Some of the tests listed on the Wikipedia page for Heteroscedasticity can be found in the scipy.stats package. Under the circumstances, the statsmodels package (which is built on top of scipy) may be a better bet. There is an entire module dedicated to Heteroscedasticity tests.

statsmodels heteroscedasticity breusch pagan test doesn't match

When I estimate the breach-pagan test step by step doesn't math with the bp value in statsmodels:
Here is link to the nb in spanish.
I have been tested in Gretl and my 'manual' estimation is correct, but I want to know why is the difference.

Check the Koenker version in Gretl.
I don't find a Gretl reference right now, but according to the unit tests, the version in statsmodels is equal to the Koenker version of the Breusch-Pagan test.
I don't see an option for the original Breusch-Pagan test, but that one is not robust to non-normality (assumption on 4th moment, IIRC).
In general, many of the Lagrange multiplier specification tests have several versions that are asymptotically equivalent but differ in small samples, for example statsmodels reports LM test and the F-test version. Additionally, they differ which extra assumptions are required for their validity.
For example statsmodels does not have yet a heteroscedasticity test that is robust to autocorrelation. There are again several versions and I have not seen them available as standard option in any package yet.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.