So, I've been spending some time looking for a way to get adjusted p-values (aka corrected p-values, q-values, FDR) in Python, but I haven't really found anything. There's the R function p.adjust, but I would like to stick to Python coding, if possible. Is there anything similar for Python?
If this is somehow a bad question, sorry in advance! I did search for answers first, but found none (except a Matlab version)... Any help is appreciated!
It is available in statsmodels.
http://statsmodels.sourceforge.net/devel/stats.html#multiple-tests-and-multiple-comparison-procedures
http://statsmodels.sourceforge.net/devel/generated/statsmodels.sandbox.stats.multicomp.multipletests.html
and some explanations, examples and Monte Carlo
http://jpktd.blogspot.com/2013/04/multiple-testing-p-value-corrections-in.html
According to the biostathandbook, the BH is easy to compute.
def fdr(p_vals):
from scipy.stats import rankdata
ranked_p_values = rankdata(p_vals)
fdr = p_vals * len(p_vals) / ranked_p_values
fdr[fdr > 1] = 1
return fdr
You can try the module rpy2 that allows you to import R functions (b.t.w., a basic search returns How to implement R's p.adjust in Python).
Another possibility is to look at the maths an redo it yourself, because it is still relatively easy.
Apparently there is an ongoing implementation in scipy: http://statsmodels.sourceforge.net/ipdirective/_modules/scikits/statsmodels/sandbox/stats/multicomp.html . Maybe it is already usable.
You mentioned in your question q-values and no answer provided a link which addresses this. I believe this package (at least it seems so from the documentation) calculates q-values in python
https://puolival.github.io/multipy/
and also this one
https://github.com/nfusi/qvalue
Related
I am currently working on a project where I need to decompose my system into observable and unobservable subsystems in an efficient way, so I was looking for a function that could help me with that.
PS: I know about this function and it is not was I am looking for:
s = co.ss(A, B, C, D)
sys, T = co.observable_form(s)
In this case, the system needs to be fully observable.
Thank you!
In case someone is looking for an answer to this, I found a great library that has a function that does just this. It is called "harold" and you can find it here. The said function is called "staircase" and it is equivalent to the obsvf or ctrbf functions from MATLAB; it implements the Staircase Algorithm.
I am running into a roadblock and would appreciate some help on this.
Problem Statement:
I am trying to calculate XIRR for a cash flow over 30 years in Python.
What have I tried so far:
However, none of the established libraries(like numpy and pandas) seem to have support for this. After doing some research, I learned through this source (https://vindeep.com/Corporate/XIRRCalculation.aspx) that with some simple manipulation, XIRR can be calculated from IRR.
So, all I need is an IRR function that is implemented well. The functionality used to exist in numpy but has moved to this other package (https://github.com/numpy/numpy-financial). While, this package works, it is very very slow. Here is a small test:
import pandas as pd
import numpy as np
import numpy_financial as npf
from time import time
# Generate some example data
t = pd.date_range('2022-01-01', '2037-01-01', freq='D')
cash_flows = np.random.randint(10000, size=len(t)-1)
cash_flows = np.insert(cash_flows, 0, -10000)
# Calculate IRR
start_timer = time()
npf.irr(cash_flows, guess)
stop_timer = time()
print(f"""Time taken to calculate IRR over 30 years of daily data: {round((stop_timer-start_timer)/60, 2)}""")
One other alternative seems to be https://github.com/better/irr - however, this has an edge case bug that has not been addressed in over 4 years.
Can anyone kindly offer to a more stable implementation. It feels like such simple and very commonly used functionality and the lack of a good stable implementation surprises me. Can someone point to any good resources.
Thanks
Uday
Try using pyxirr package. Implemented in Rust, it is blazing fast. For 30 years time period it took about .001 sec.
pyxirr creator here. The library has been used in a financial project for over a year, but I only recently found the time to publish it. We had the task of quickly calculating XIRR for various portfolios and existing implementations quickly became a bottleneck. pyxirr also mimics some numpy-financial functions and works much faster.
The XIRR implementation in Excel is not always correct. In edge cases the algorithm does not converge and shows incorrect result instead of error or NA. The result can be checked with the xnpv function: xnpv(xirr_rate, dates, values) and should be close to zero. Similarly, you can check irr using the npv function: npv(irr_rate, values), but note the difference in npv calculation between Excel and numpy-financial.
Taking a look at the implementation on their GitHub, it is pretty evident to me that the npf.irr() function is implemented pretty well. Your alternative seems to be to implement the function yourself using NumPy operations but I am doubtful that a) that is easy to accomplish or b) possible to accomplish in pure Python.
NumPy Financial seems to be doing their implementation using eigenvalues which means they are performing complex mathematic operations. Perhaps, if you are not bounded to Python, consider Microsoft's C# implementation of IRR and see if that works faster. I suspect that they are using regression to calculate the IRR. Therefore, based on your guess, it may indeed be quicker than NumPy Financial.
Your final alternative is to continue with what you have at the moment and just run on a more powerful machine. On my machine, this operation took about 71 seconds and it is does not even have a GPU. I am sure more powerful computers, with parallelization, should be able to compute this much much faster than that.
Look at the answer I provided here: https://stackoverflow.com/a/66069439/4045275.
I didn't benchmark it against pyxirr
I'm looking for a python library for replace the rake function from "Survey", an R library (https://www.rdocumentation.org/packages/survey/versions/4.0/topics/rake)
I have found and try Quantipy, but the weights quality is poor compared to the weights generate with R on the same dataset.
I have found PandaSurvey, but seems to not working correctly (and the documentation is very poor)
I am surprised not to find much on google on this subject. However, it is an essential function if you are working with polls. Python being a datascience language, it's surprising. But maybe I missed it.
Thank you very much!
I am looking for a definitive guide on formulating a CVXOPT quadratic programming problem with quadratic constraints. There are good documents provided here:
The problem statement I am dealing with is identical to the problem here:
What is the matrix G supposed to look like? I've formulated as a system of linear equations, but looking at examples this does not appear to be correct?
The best resource I've found is https://courses.csail.mit.edu/6.867/wiki/images/a/a7/Qp-cvxopt.pdf, but the links at the end are dead for more reading.
I have an ipython notebook trying to use this programming method but it continually fails: https://gist.github.com/jaredvacanti/62010beda0ccfc20d2eac3c900858e50
Edit: I have edited the data source file in the notebook to provide access to the real data used in this optimization problem.
The notebook you have posted seems to have it all figured out. The problem I had is that the source file for data is not available.
Now to your question:
What is the matrix G supposed to look like? I've formulated as a system of
linear equations, but looking at examples this does not appear to be correct?
Rewrite your "linear equations" into matrix form, i.e.
2x + 2y = 4
x - y = 1
is equivalent to
matrix([[2,2],[1,-1]]) * matrix([[x],[y]]) = matrix([[4],[1]])
where matrix is from cvxopt.
I started learning machine learning in Python. I plotted the following graph following the author's notes:
One can notice that up until 3.5 weeks, there is a linear regression, and beyond it, there is a 2nd-order polynomial regression. The textbook question is: when will there be 100 000 requests/day?
The author says to use fsolve
reached_max = fsolve(myFunct-100000, x0=800)/(7*24)
The last point has x = 743 (hours). I am confused when the author says that we need to provide an initial starting position (e.g. 800). Why can it be any number after 743 and why does it have to be after 743? Thank you for your help!
The answer is that x0 will serve as an educated initial guess. In general, fsolve should work fine with its default value (x0 = 0), but in some cases will give a too far answer from the true root.
To better understand the math behind finding the root, I encourage to read about Newton's method which serves as good introduction. Of course, fsolve uses more complicated and efficient techniques, but this should be a good baseline.
Hope I was clear and understood, and this has helped you!