I am running into a roadblock and would appreciate some help on this.
Problem Statement:
I am trying to calculate XIRR for a cash flow over 30 years in Python.
What have I tried so far:
However, none of the established libraries(like numpy and pandas) seem to have support for this. After doing some research, I learned through this source (https://vindeep.com/Corporate/XIRRCalculation.aspx) that with some simple manipulation, XIRR can be calculated from IRR.
So, all I need is an IRR function that is implemented well. The functionality used to exist in numpy but has moved to this other package (https://github.com/numpy/numpy-financial). While, this package works, it is very very slow. Here is a small test:
import pandas as pd
import numpy as np
import numpy_financial as npf
from time import time
# Generate some example data
t = pd.date_range('2022-01-01', '2037-01-01', freq='D')
cash_flows = np.random.randint(10000, size=len(t)-1)
cash_flows = np.insert(cash_flows, 0, -10000)
# Calculate IRR
start_timer = time()
npf.irr(cash_flows, guess)
stop_timer = time()
print(f"""Time taken to calculate IRR over 30 years of daily data: {round((stop_timer-start_timer)/60, 2)}""")
One other alternative seems to be https://github.com/better/irr - however, this has an edge case bug that has not been addressed in over 4 years.
Can anyone kindly offer to a more stable implementation. It feels like such simple and very commonly used functionality and the lack of a good stable implementation surprises me. Can someone point to any good resources.
Thanks
Uday
Try using pyxirr package. Implemented in Rust, it is blazing fast. For 30 years time period it took about .001 sec.
pyxirr creator here. The library has been used in a financial project for over a year, but I only recently found the time to publish it. We had the task of quickly calculating XIRR for various portfolios and existing implementations quickly became a bottleneck. pyxirr also mimics some numpy-financial functions and works much faster.
The XIRR implementation in Excel is not always correct. In edge cases the algorithm does not converge and shows incorrect result instead of error or NA. The result can be checked with the xnpv function: xnpv(xirr_rate, dates, values) and should be close to zero. Similarly, you can check irr using the npv function: npv(irr_rate, values), but note the difference in npv calculation between Excel and numpy-financial.
Taking a look at the implementation on their GitHub, it is pretty evident to me that the npf.irr() function is implemented pretty well. Your alternative seems to be to implement the function yourself using NumPy operations but I am doubtful that a) that is easy to accomplish or b) possible to accomplish in pure Python.
NumPy Financial seems to be doing their implementation using eigenvalues which means they are performing complex mathematic operations. Perhaps, if you are not bounded to Python, consider Microsoft's C# implementation of IRR and see if that works faster. I suspect that they are using regression to calculate the IRR. Therefore, based on your guess, it may indeed be quicker than NumPy Financial.
Your final alternative is to continue with what you have at the moment and just run on a more powerful machine. On my machine, this operation took about 71 seconds and it is does not even have a GPU. I am sure more powerful computers, with parallelization, should be able to compute this much much faster than that.
Look at the answer I provided here: https://stackoverflow.com/a/66069439/4045275.
I didn't benchmark it against pyxirr
Related
My dataset is in hours and I need guidance on how to convert the standard deviation from hours to minutes.
In the code provided under, I only get the standard deviation for hours. How can I convert it to minutes?
np.sum(data['Time']*data['kneel']/sum(data['Time']))*60
data['kneel'].std()
I had a similar problem, i wanted to calculate cosine distance fast so I used the following solutions.
Try using Dask, it is pretty fast compared to pandas, moreover, you can also try numba, it basically makes python code computationally fast.
While working on some statistical analysis tools, I discovered there are at least 3 Python methods to calculate mean and standard deviation (not counting the "roll your own" techniques):
np.mean(), np.std() (with ddof=0 or 1)
statistics.mean(), statistics.pstdev() (and/or statistics.stdev)
scipy.statistics package
That has me scratching my head. There should be one obvious way to do it, right? :-) I've found some older SO posts. One compares the performance advantages of np.mean() vs statistics.mean(). It also highlights differences in the sum operator. That post is here:
why-is-statistics-mean-so-slow
I am working with numpy array data, and my values fall in a small range (-1.0 to 1.0, or 0.0 to 10.0), so the numpy functions seem the obvious answer for my application. They have a good balance of speed, accuracy, and ease of implementation for the data I will be processing.
It appears the statistics module is primarily for those that have data in lists (or other forms), or for widely varying ranges [1e+5, 1.0, 1e-5]. Is that still a fair statement? Are there any numpy enhancements that address the differences in the sum operator? Do recent developments bring any other advantages?
Numerical algorithms generally have positive and negative aspects: some are faster, or more accurate, or require a smaller memory footprint. When faced with a choice of 3-4 ways to do a calculation, a developer's responsibility is to select the "best" method for his/her application. Generally this is a balancing act between competing priorities and resources.
My intent is to solicit replies from programmers experienced in statistical analysis to provide insights into the strengths and weaknesses of the methods above (or other/better methods). [I'm not interested in speculation or opinions without supporting facts.] I will make my own decision based on my design requirements.
Why does NumPy duplicate features of SciPy?
From the SciPy FAQ What is the difference between NumPy and SciPy?:
In an ideal world, NumPy would contain nothing but the array data type and the most basic operations: indexing, sorting, reshaping, basic elementwise functions, etc. All numerical code would reside in SciPy. However, one of NumPy’s important goals is compatibility, so NumPy tries to retain all features supported by either of its predecessors.
It recommends using SciPy over NumPy:
In any case, SciPy contains more fully-featured versions of the linear algebra modules, as well as many other numerical algorithms. If you are doing scientific computing with Python, you should probably install both NumPy and SciPy. Most new features belong in SciPy rather than NumPy.
When should I use the statistics library?
From the statistics library documentation:
The module is not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators.
Thus I would not use it for serious (i.e. resource intensive) computation.
What is the difference between statsmodels and SciPy?
From the statsmodels about page:
The models module of scipy.stats was originally written by Jonathan Taylor. For some time it was part of scipy but was later removed. During the Google Summer of Code 2009, statsmodels was corrected, tested, improved and released as a new package. Since then, the statsmodels development team has continued to add new models, plotting tools, and statistical methods.
Thus you may have a requirement that SciPy is not able to fulfill, or is better fulfilled by a dedicated library.
For example the SciPy documentation for scipy.stats.probplot notes that
Statsmodels has more extensive functionality of this type, see statsmodels.api.ProbPlot.
Thus in cases like these you will need to turn to statistical libraries beyond SciPy.
Just a simple benchmark.
statistics vs NumPy
import statistics as stat
import numpy as np
from timeit import default_timer as timer
l = list(range(1_000_000))
start = timer()
m, std = stat.mean(l), stat.stdev(l)
end = timer()
print(end-start, "sec elapsed.")
print("mean, std:", m, std)
start = timer()
m, std = np.mean(l), np.std(l)
end = timer()
print(end-start, "sec elapsed.")
print("mean, std:", m, std)
(no iteration for brevity.)
Result
NumPy is way faster.
# statistics
3.603845 sec elapsed.
mean, std: 499999.5 288675.2789323441
# numpy
0.2315261999999998 sec elapsed.
mean, std: 499999.5 288675.1345946685
After taking a couple advanced statistics courses, I decided to code some functions/classes to just automate estimating parameters for different distributions via MLE. In Matlab, the below is something I easily coded once:
function [ params, max, confidence_interval ] = newroutine( fun, data, guesses )
lh = #(x,data) -sum(log(fun(x,data))); %Gets log-likelihood from user-defined fun.
options = optimset('Display', 'off', 'MaxIter', 1000000, 'TolX', 10^-20, 'TolFun', 10^-20);
[theta, max1] = fminunc(#(x) lh(x,data), guesses,options);
params = theta
max = max1
end
Where I just have to correctly specify the underlying pdf equation as fun, and with more code I can calculate p-values, confidence-intervals, etc.
With Python, however, all the sources I've found on MLE automation (for ex., here and here) insist that the easiest way to do this is to delve into OOP using a subclass of statsmodel's, GenericLikelihoodModel, which seems way too complicated for me. My reasoning is that, since the log-likelihood can be automatically created from the pdf (at least for the vast majority of functions), and scipy.stats."random_dist".fit() already easily returns MLE estimates, it seems ridiculous to have to write ~30 lines of class code each time you have a new dist. to fit.
I realize that doing it the way the two links suggests allows you to automatically tap into statsmodel's functions, but it honestly does not seem simpler than tapping into scipy oneself and writing much simpler functions.
Am I missing an easier way to perform basic MLE, or is there a real good reason for the way statsmodels does this?
I wrote the first post outlining the various methods, and I think it is fair to say that while I recommend the statsmodels approach, I did so to leverage the postestimation tools it provides and to get standard errors every time a model is estimated.
When using minimize, the python equivalent of fminunc (as you outline in your example), oftentimes I am forced to use "Nelder-Meade" or some other gradiant-free method to get convergence . Since I need standard errors for statistical inference, this entails an additional step using numdifftools to recover the hessian. So in the end, the method you propose has its complications too (for my work). If all you care about is the maximum likelihood estimate and not inference, then the approach you outline is probably best and you are correct that you don't need the machinery of statsmodel.
FYI: in a later post, I use your approach combined with autograd for significant speedups of big maximum likelihood models. I haven't successfully gotten this to work with statsmodels.
I have a dataframe of values from various samples from two groups. I performed a scipy.stats.ttest on these, which works perfectly, but I am a bit concerned here with the fact that so much testing may yield multiple testing error.
And I wonder how to implement MTC (multiple testing correction) with this. I mean, is there some function in scipy or statsmodels which would perform directly the tests and apply MTC on the output series of p-value, or can I apply an MTC function on a list of p-value without problem?
I know that statsmodels may comprise such functions, but what it has in power, it lacks greatly in manageability and documentation, unhappily (indeed, that's not the fault of the developers, they are three for such huge project). Anyway, I am a little stuck here, so I'll gladly take any suggesting. I didn't ask this in CrossValidated, because it is more related to the implementation part than the statistical part.
Edit 9th Oct 2019:
this link works as of today
https://www.statsmodels.org/stable/generated/statsmodels.stats.multitest.multipletests.html
original answer (returns 404 now)
statsmodels.sandbox.stats.multicomp.multipletests takes an array of p-values and returns the adjusted p-values. The documentation is pretty clear.
So, I've been spending some time looking for a way to get adjusted p-values (aka corrected p-values, q-values, FDR) in Python, but I haven't really found anything. There's the R function p.adjust, but I would like to stick to Python coding, if possible. Is there anything similar for Python?
If this is somehow a bad question, sorry in advance! I did search for answers first, but found none (except a Matlab version)... Any help is appreciated!
It is available in statsmodels.
http://statsmodels.sourceforge.net/devel/stats.html#multiple-tests-and-multiple-comparison-procedures
http://statsmodels.sourceforge.net/devel/generated/statsmodels.sandbox.stats.multicomp.multipletests.html
and some explanations, examples and Monte Carlo
http://jpktd.blogspot.com/2013/04/multiple-testing-p-value-corrections-in.html
According to the biostathandbook, the BH is easy to compute.
def fdr(p_vals):
from scipy.stats import rankdata
ranked_p_values = rankdata(p_vals)
fdr = p_vals * len(p_vals) / ranked_p_values
fdr[fdr > 1] = 1
return fdr
You can try the module rpy2 that allows you to import R functions (b.t.w., a basic search returns How to implement R's p.adjust in Python).
Another possibility is to look at the maths an redo it yourself, because it is still relatively easy.
Apparently there is an ongoing implementation in scipy: http://statsmodels.sourceforge.net/ipdirective/_modules/scikits/statsmodels/sandbox/stats/multicomp.html . Maybe it is already usable.
You mentioned in your question q-values and no answer provided a link which addresses this. I believe this package (at least it seems so from the documentation) calculates q-values in python
https://puolival.github.io/multipy/
and also this one
https://github.com/nfusi/qvalue