Scipy fisher_exact test taking a very long time

Scipy fisher_exact test taking a very long time - python

I'm using scipy version 1.0.0.
import scipy as sp
x = [[5829225, 5692693], [5760959, 5760959]]
sp.stats.fisher_exact(x)
For the values above scipy does not return anything but waits.
What can be the reason for that?
How can I fix it?
However in R it returns a p-value almost immediately.
a = matrix(c(5829225,5692693,5760959,5760959), nrow=2)
fisher.test(a)

From the notes in the documentation:
The calculated odds ratio is different from the one R uses. This scipy implementation returns the (more common) “unconditional Maximum Likelihood Estimate”, while R uses the “conditional Maximum Likelihood Estimate”.
For tables with large numbers, the (inexact) chi-square test implemented in the function chi2_contingency can also be used.
(Emphasis mine)
Like DSM's comment mentioned, it's probably just very slow for your large values. And since the notes call out large values, you might try the alternative they suggest:
>>> chi2, p, dof, expected = sp.stats.chi2_contingency(x)
>>> p
6.140729432506709e-178

Related

Scipy quad returns (near) zero, even with points argument

Sometimes scipy.integrate.quad wrongly returns near-0 values. This has been addressed in this question, and seems to happen when the integration technique doesn't evaluate the function in the narrow range where it is significantly different than 0. In this and similar questions, the accepted solution was always to use the points parameter to tell scipy where to look. However, for me, this seems to actually make things worse.
Integration of exponential distribution pdf (answer should be just under 1):
from scipy import integrate
import numpy as np
t2=.01
#initial problem: fails when f is large
f=5000000
integrate.quad(lambda t:f*np.exp(-f*(t2-t)),0,t2)
#>>>(3.8816838175855493e-22, 7.717972744727115e-22)
Now, the "fix" makes it fail, even on smaller values of f where the original worked:
f=2000000
integrate.quad(lambda t:f*np.exp(-f*(t2-t)),0,t2)
#>>>(1.00000000000143, 1.6485317987792634e-14)
integrate.quad(lambda t:f*np.exp(-f*(t2-t)),0,t2,points=[t2])
#>>>(1.6117047218907458e-17, 3.2045611390981406e-17)
integrate.quad(lambda t:f*np.exp(-f*(t2-t)),0,t2,points=[t2,t2])
#>>>(1.6117047218907458e-17, 3.2045611390981406e-17)
What's going on here? How can I tell scipy what to do so that this will evaluate for arbitrary values of f?

It's not a generic solution, but I've been able to fix this for the given function by using points=np.geomspace to guide the numerical algorithm towards heavier sampling in the interesting region. I'll leave this open for a little bit to see if anyone finds a generic solution.
Generate random values for t2 and f, then check min and max values for subset that should be very close to 1:
>>> t2s=np.exp((np.random.rand(20)-.5)*10)
>>> fs=np.exp((np.random.rand(20)-.1)*20)
>>> min(integrate.quad(lambda t:f*np.exp(-f*(t2-t)),0,t2,points=t2-np.geomspace(1/f,t2,40))[0] for f in fs for t2 in t2s if f>(1/t2)*10)
0.9999621825009719
>>> max(integrate.quad(lambda t:f*np.exp(-f*(t2-t)),0,t2,points=t2-np.geomspace(1/f,t2,40))[0] for f in fs for t2 in t2s if f>(1/t2)*10)
1.000000288722783

Matching randomization seed between Python and R [duplicate]

Python, NumPy and R all use the same algorithm (Mersenne Twister) for generating random number sequences. Thus, theoretically speaking, setting the same seed should result in same random number sequences in all 3. This is not the case. I think the 3 implementations use different parameters causing this behavior.
R
>set.seed(1)
>runif(5)
[1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
Python
In [3]: random.seed(1)
In [4]: [random.random() for x in range(5)]
Out[4]:
[0.13436424411240122,
0.8474337369372327,
0.763774618976614,
0.2550690257394217,
0.49543508709194095]
NumPy
In [23]: import numpy as np
In [24]: np.random.seed(1)
In [25]: np.random.rand(5)
Out[25]:
array([ 4.17022005e-01, 7.20324493e-01, 1.14374817e-04,
3.02332573e-01, 1.46755891e-01])
Is there some way, where NumPy and Python implementation could produce the same random number sequence? Ofcourse as some comments and answers point out, one could use rpy. What I am specifically looking for is to fine tune the parameters in the respective calls in Python and NumPy to get the sequence.
Context: The concern comes from an EDX course offering in which R is used. In one of the forums, it was asked if Python could be used and the staff replied that some assignments would require setting specific seeds and submitting answers.
Related:
Comparing Matlab and Numpy code that uses random number generation From this it seems that the underlying NumPy and Matlab implementation are similar.
python vs octave random generator: This question does come fairly close to the intended answer. Some sort of wrapper around the default state generator is required.

use rpy2 to call r in python, here is a demo, the numpy array data is sharing memory with x in R:
import rpy2.robjects as robjects
data = robjects.r("""
set.seed(1)
x <- runif(5)
""")
print np.array(data)
data[1] = 1.0
print robjects.r["x"]

I realize this is an old question, but I've stumbled upon the same problem recently, and created a solution which can be useful to others.
I've written a random number generator in C, and linked it to both R and Python. This way, the random numbers are guaranteed to be the same in both languages since they are generated using the same C code.
The program is called SyncRNG and can be found here: https://github.com/GjjvdBurg/SyncRNG.

fsolve problems with the starting point

I'm using fsolve in order to solve a non linear equation. My problem is that, depending on the starting point the solutions change and I am not sure that the ones that I found are the most reasonable.
This is the code
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve, brentq,newton
A = np.arange(0.05,0.95,0.01)
PHI = np.deg2rad(np.arange(0,90,1))
def f(b):
return np.angle((1+3*a**4-3*a**2)+(a**4-a**6)*(np.exp(2j*b)+2*np.exp(-1j*b))+(a**2-2*a**4+a**6)*(np.exp(-2j*b)+2*np.exp(1j*b)))-Phi
B = np.zeros((len(A),len(PHI)))
for i in range(len(A)):
for j in range(len(PHI)):
a = A[i]
Phi = PHI[j]
b = fsolve(f, 1)
B[i,j]= b
I fixed x0 = 1 because it seems to give the more reasonable values. But sometimes, I think the method doesn't converge and the resulting values are too big.
What can I do to find the best solution?
Many thanks!

The eternal issue with turning non-linear solvers loose is having a really good understanding of your function, your initial guess, the solver itself, and the problem you are trying to address.
I note that there are many (a,Phi) combinations where your function does not have real roots. You should do some math, directed by the actual problem you are trying to solve, and determine where the function should have roots. Not knowing the actual problem, I can't do that for you.
Also, as noted on a (since deleted) answer, this is cyclical on b, so using a bounded solver (such as scipy.optimize.minimize using method='L-BFGS-B' might help to keep things under control. Note that to find roots with a minimizer you use the square of your function. If the found minimum is not close to zero (for you to define based on the problem), the real minima might be a complex conjugate pair.
Good luck.

Fitting a distribution to data: how to penalize "bad" parameter estimates?

I'm using using scipy's least-squares optimization to fit an exponentially-modified gaussian distribution to a set of reaction time measurements. In general, it works well, but sometimes, the optimization goes off the rails and chooses a crazy value for a parameter -- the resulting plot clearly doesn't fit the data very well. In general, it looks like the problems arise from floating-point precision errors -- we head off into 0 or inf or nan-land.
I'm thinking of doing two things:
Using the parameters to simultaneously fit a CDF and PDF to the data; I have formulas for both. (I'm using a kernel density estimate to approximate the PDF from the data.)
Somehow taking into account the distance from the initial parameter estimates (generated by the method of moments approach on the wikipedia page). Those estimates are far from perfect, but are pretty good and seem to steer clear of "exploding floating point" problems.
Combining the PDF and CDF fits sounds pretty straightforward; the scales of the error will even be generally the same. But getting the initial parameter fits in there: I'm not quite sure if it's even a good idea -- but if it is:
What would I do about the difference in scale? Should I normalize the parameter "error" to a percent error?
Is there a reasonable way to decide on a relative weight between the data estimation error and parameter "error"?
Are these even the right questions to be asking? Are there generally-regarded "correct" answers, or is "try some stuff until you find something that seems to work" a good approach?
One example dataset
As requested, here's a dataset for which this process isn't working very well. I know there are only a few samples and that the data don't fit the distribution well; I'm still hoping against hope that I can get a "reasonable-looking" result from optimization.
array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
MLE Update
I had a bunch of problems getting my MLE estimate to converge to a "reasonable" value, until I discovered this: If X contains at least one nan, np.sum(X) == nan when X is a numpy array but not when X is a pandas Series. So the sum of the log-likelihood was doing stupid things when the parameters started to go out of bounds.
Added a np.asarray() call and everything is great!

This should have been a comment but I run out of space.
I think a Maximum Likelihood fit is probably the most appropriate approach here. ML method is already implemented for many distributions in scipy.stats. For example, you can find the MLE of normal distribution by calling scipy.stats.norm.fit and find the MLE of exponential distribution in a similar way. Combining these two resulting MLE parameters should give you a pretty good starting parameter for Ex-Gaussian ML fit. In fact I would imaging most of your data is quite nicely Normally distributed. If that is the case, the ML parameter estimates for Normal distribution alone should give you a pretty good starting parameter.
Since Ex-Gaussian only has 3 parameters, I don't think a ML fit will be hard at all. If you could provide a dataset for which your current method doesn't work well, it will be easier to show a real example.
Alright, here you go:
>>> import scipy.special as sse
>>> import scipy.stats as sss
>>> import scipy.optimize as so
>>> from numpy import *
>>> def eg_pdf(p, x): #defines the PDF
m=p[0]
s=p[1]
l=p[2]
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
>>> xo=array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
>>> sss.norm.fit(xo) #get the starting parameter vector form the normal MLE
(624.22222222222217, 132.23977474531389)
>>> def llh(p, f, x): #defines the negative log-likelihood function
return -sum(log(f(p,x)))
>>> so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo)) #yeah, the data is not good
Warning: Maximum number of function evaluations has been exceeded.
array([ 6.14003407e+02, 1.31843250e+02, 9.79425845e-02])
>>> przt=so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo), maxfun=1000) #so, we increase the number of function call uplimit
Optimization terminated successfully.
Current function value: 170.195924
Iterations: 376
Function evaluations: 681
>>> llh(array([624.22222222222217, 132.23977474531389, 1e-6]), eg_pdf, xo)
400.02921290185645
>>> llh(przt, eg_pdf, xo) #quite an improvement over the initial guess
170.19592431051217
>>> przt
array([ 6.14007039e+02, 1.31844654e+02, 9.78934519e-02])
The optimizer used here (fmin, or Nelder-Mead simplex algorithm) does not use any information from gradient and usually works much slower than the optimizer that does. It appears that the derivative of the negative log-likelihood function of Exponential Gaussian may be written in a close form easily. If so, optimizers that utilize gradient/derivative will be better and more efficient choice (such as fmin_bfgs).
The other thing to consider is parameter constrains. By definition, sigma and lambda has to be positive for Exponential Gaussian. You can use a constrained optimizer (such as fmin_l_bfgs_b). Alternatively, you can optimize for:
>>> def eg_pdf2(p, x): #defines the PDF
m=p[0]
s=exp(p[1])
l=exp(p[2])
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
Due to the functional invariance property of MLE, the MLE of this function should be the same as same as the original eg_pdf. There are other transformation that you can use, besides exp(), to project (-inf, +inf) to (0, +inf).
And you can also consider http://en.wikipedia.org/wiki/Lagrange_multiplier.

FloatingPointError from PyMC in sampling from a Dirichlet distribution

After being unsuccessful in using decorators to define the stochastic object of the "logarithm of an exponential random variable", I decided to manually write the code for this new distribution using pymc.stochastic_from_dist. The model that I am trying to implement is available here(the first model):
Now when I try to sample the log(alpha) using MCMC Metropolis and with a Normal distribution as proposal(as it has been stated in the following picture as the sampling method), I am getting the following error:
File "/Library/Python/2.7/site-packages/pymc/distributions.py", line 980, in rdirichlet
return (gammas[0]/gammas[0].sum())[:-1]
FloatingPointError: invalid value encountered in divide
Although the times that the sampling doesn't run into error the sampling histograms are matching with the ones in this paper. My hierarchical model is:
"""
A Hierarchical Bayesian Model for Bags of Marbles
logalpha ~ logarithm of an exponential distribution with parameter lambd
beta ~ Dirichlet([black and white ball proportions]:vector of 1's)
theta ~ Dirichlet(alpha*beta(vector))
"""
import numpy as np
import pymc
from scipy.stats import expon
lambd=1.
__all__=['alpha','beta','theta','logalpha']
#------------------------------------------------------------
# Set up pyMC model: logExponential
# 1 parameter: (alpha)
def logExp_like(x,explambda):
"""log-likelihood for logExponential"""
return -lambd*np.exp(x)+x
def rlogexp(explambda, size=None):
"""random variable from logExponential"""
sample=np.random.exponential(explambda,size)
logSample=np.log(sample)
return logSample
logExponential=pymc.stochastic_from_dist('logExponential',logp=logExp_like,
random=rlogexp,
dtype=np.float,
mv=False)
#------------------------------------------------------------
#Defining model parameteres alpha and beta.
beta=pymc.Dirichlet('beta',theta=[1,1])
logalpha=logExponential('logalpha',lambd)
#pymc.deterministic(plot=False)
def multipar(a=logalpha,b=beta):
out=np.empty(2)
out[0]=(np.exp(a)*b)
out[1]=(np.exp(a)*(1-b))
return out
theta=pymc.Dirichlet('theta',theta=multipar)
And my test sampling code is:
from pymc import Metropolis
from pymc import MCMC
from matplotlib import pyplot as plt
import HBM
import numpy as np
import pymc
import scipy
M=MCMC(HBM)
M.use_step_method(Metropolis,HBM.logalpha, proposal_sd=1.,proposal_distribution='Normal')
M.sample(iter=1000,burn=200)
When I check the values of theta passed to gamma distribution in line 978 of distributions.py I see that there are not zero but small values! So I don't know how to prevent this floating point error?

I found this nugget in their documentation:
The stochastic variable cutoff cannot be smaller than the largest element of D, otherwise D’s density would be zero. The standard Metropolis step method can handle this case without problems; it will propose illegal values occasionally, but these will be rejected.
This would lead me to believe that the dtype=np.float (which is essential has the same range as float), may not be the method you want to go about. The documentation says it needs to be a numpy dtype, but it just needs to be a dtype that converts to a numpy dtype object and in Python2 (correct me if I'm wrong) number dtypes were fixed size types meaning they're the same. Maybe utilizing the Decimal module would be an option. This way you can set the level of precision to encapsulate expected value ranges, and pass it to your extended stochastic method where it would be converted.
from decimal import Decimal, getcontext
getcontext().prec = 15
dtype=Decimal
I don't know this wouldn't still be truncated once the numpy library got a hold of it, or if it would respect the inherited level of precision. I have no accurate method of testing this, but give it a try and let me know how that works for you.
Edit: I tested the notion of precision inheritance and it would seem to hold:
>>> from decimal import Decimal, getcontext
>>> getcontext().prec = 10
>>> Decimal(1) / Decimal(7)
Decimal('0.1428571429')
>>> np.float(Decimal(1) / Decimal(7))
0.1428571429
>>> getcontext().prec = 15
>>> np.float(Decimal(1) / Decimal(7))
0.142857142857143
>>>

If you do get small numbers, it might simply be too small for a float. This is typically also what the logarithms are there for to avoid. What if you use dtype=np.float64?

As you have suggested at the end of your question, the issue is with too small numbers that are float-casted to 0. One solution could be to tweak a little the source code and replace the division with for example np.divide and in the "where" condition to add some explicit casting for to small values to a given threshold.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.