With the python package scipy one can find the principle value of a function (given that the pole is of low order) using the "cauchy" weighting method, see scipy.integrate.quad (consider for instance this question, where its usage is demonstrated). Is something analogous possible within the julia ecosystem (of course on can import scipy easily, but the native integration packages of julia should be, in principle, superior).
There doesn't seem to be any native library that does this. GSL has it (https://www.gnu.org/software/gsl/doc/html/integration.html#qawc-adaptive-integration-for-cauchy-principal-values), so you can call it through https://github.com/JuliaMath/GSL.jl
While working on some statistical analysis tools, I discovered there are at least 3 Python methods to calculate mean and standard deviation (not counting the "roll your own" techniques):
np.mean(), np.std() (with ddof=0 or 1)
statistics.mean(), statistics.pstdev() (and/or statistics.stdev)
scipy.statistics package
That has me scratching my head. There should be one obvious way to do it, right? :-) I've found some older SO posts. One compares the performance advantages of np.mean() vs statistics.mean(). It also highlights differences in the sum operator. That post is here:
why-is-statistics-mean-so-slow
I am working with numpy array data, and my values fall in a small range (-1.0 to 1.0, or 0.0 to 10.0), so the numpy functions seem the obvious answer for my application. They have a good balance of speed, accuracy, and ease of implementation for the data I will be processing.
It appears the statistics module is primarily for those that have data in lists (or other forms), or for widely varying ranges [1e+5, 1.0, 1e-5]. Is that still a fair statement? Are there any numpy enhancements that address the differences in the sum operator? Do recent developments bring any other advantages?
Numerical algorithms generally have positive and negative aspects: some are faster, or more accurate, or require a smaller memory footprint. When faced with a choice of 3-4 ways to do a calculation, a developer's responsibility is to select the "best" method for his/her application. Generally this is a balancing act between competing priorities and resources.
My intent is to solicit replies from programmers experienced in statistical analysis to provide insights into the strengths and weaknesses of the methods above (or other/better methods). [I'm not interested in speculation or opinions without supporting facts.] I will make my own decision based on my design requirements.
Why does NumPy duplicate features of SciPy?
From the SciPy FAQ What is the difference between NumPy and SciPy?:
In an ideal world, NumPy would contain nothing but the array data type and the most basic operations: indexing, sorting, reshaping, basic elementwise functions, etc. All numerical code would reside in SciPy. However, one of NumPy’s important goals is compatibility, so NumPy tries to retain all features supported by either of its predecessors.
It recommends using SciPy over NumPy:
In any case, SciPy contains more fully-featured versions of the linear algebra modules, as well as many other numerical algorithms. If you are doing scientific computing with Python, you should probably install both NumPy and SciPy. Most new features belong in SciPy rather than NumPy.
When should I use the statistics library?
From the statistics library documentation:
The module is not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators.
Thus I would not use it for serious (i.e. resource intensive) computation.
What is the difference between statsmodels and SciPy?
From the statsmodels about page:
The models module of scipy.stats was originally written by Jonathan Taylor. For some time it was part of scipy but was later removed. During the Google Summer of Code 2009, statsmodels was corrected, tested, improved and released as a new package. Since then, the statsmodels development team has continued to add new models, plotting tools, and statistical methods.
Thus you may have a requirement that SciPy is not able to fulfill, or is better fulfilled by a dedicated library.
For example the SciPy documentation for scipy.stats.probplot notes that
Statsmodels has more extensive functionality of this type, see statsmodels.api.ProbPlot.
Thus in cases like these you will need to turn to statistical libraries beyond SciPy.
Just a simple benchmark.
statistics vs NumPy
import statistics as stat
import numpy as np
from timeit import default_timer as timer
l = list(range(1_000_000))
start = timer()
m, std = stat.mean(l), stat.stdev(l)
end = timer()
print(end-start, "sec elapsed.")
print("mean, std:", m, std)
start = timer()
m, std = np.mean(l), np.std(l)
end = timer()
print(end-start, "sec elapsed.")
print("mean, std:", m, std)
(no iteration for brevity.)
Result
NumPy is way faster.
# statistics
3.603845 sec elapsed.
mean, std: 499999.5 288675.2789323441
# numpy
0.2315261999999998 sec elapsed.
mean, std: 499999.5 288675.1345946685
I'm not sure if stackoverflow is the best forum for this, but anyway...
Scipy implements ANOVA using stats.f_oneway, which assumes equal variances. It says in the docs that if the variances are unequal, one could consider the Kruskal-Wallis test instead.
However, what I want is Welch's ANOVA. Scipy has a Welch t-test, but of course this doesn't work if I have more than two groups.
What I find interesting is that scipy used to have stats.oneway which allowed for an equal variance setting. However, it has been deprecated.
Is there an easy way to implement Welch's ANOVA in Python?
Just required the same thing. I had to copy code from R package. Also requested scipy.stats to add this feature. Here is the ~10 lines of code for the implementation
https://github.com/scipy/scipy/issues/11122
The pingouin package has a Welch's ANOVA implemented. You can find the documentation for it at https://pingouin-stats.org/generated/pingouin.welch_anova.html.
How can I calculate the cumulative distribution function of a normal distribution in python without using scipy?
I'm specifically referring to this function:
from scipy.stats import norm
norm.cdf(1.96)
I have a Django app running on Heroku and getting scipy up and running on Heroku is quite a pain. Since I only need this one function from scipy, I'm hoping I can use an alternative. I'm already using numpy and pandas, but I can't find the function in there. Are there any alternative packages I can use or even implement it myself?
Just use math.erf:
import math
def normal_cdf(x):
"cdf for standard normal"
q = math.erf(x / math.sqrt(2.0))
return (1.0 + q) / 2.0
Edit to show comparison with scipy:
scipy.stats.norm.cdf(1.96)
# 0.9750021048517795
normal_cdf(1.96)
# 0.9750021048517796
This question seems to be a duplicate of How to calculate cumulative normal distribution in Python where there are many alternatives to scipy listed.
I wanted to highlight the answer of Xavier Guihot https://stackoverflow.com/users/9297144/xavier-guihot which shows that from python3.8 the normal is now a built in:
from statistics import NormalDist
NormalDist(mu=0, sigma=1).cdf(1.96)
# 0.9750021048517796
MATLAB's place function is pretty handy for determining a matrix that gives you the desired eigenvalues of a system.
I'm trying to implement it in Python, but numpy.place doesn't seem to be an analagous function and I can't for the life of me seem to find anything better in the numpy or scipy documentation.
Found this and other control functions in Python Control Systems Library