SciPy's Fisher's Exact P-Value Differs from Exact Formula

SciPy's Fisher's Exact P-Value Differs from Exact Formula - python

I recently computed Fisher's Exact Test for a 2x2 contingency table using SciPy's built in fisher_exact() function. I'm using their example code from the SciPy docs:
>>> from scipy.stats import fisher_exact
>>> import numpy as np
>>> table = np.array([[6, 2], [1, 4]])
>>> res = fisher_exact(table, alternative='two-sided')
>>> res[1]
0.10256410256410257
Then I use the formula for Fisher's Exact test using factorials of each pairwise sum divided by factorials of each cell count + factorial of all the counts summed. Here is a website with that formula: https://www.statology.org/fishers-exact-test/
Here is my code implementing the formula:
>>> import numpy as np
>>> from scipy.stats import fisher_exact
>>> from scipy.special import factorial
>>> table = np.array([[6, 2], [1, 4]])
>>> res = fisher_exact(table)
>>> fishers_plug_in = (factorial(table[0][0]+table[0][1])*factorial(table[0][0]+table[1][0])*factorial(table[1][0]+table[1][1])*
factorial(table[0][1]+table[1][1])/(factorial(table[0][0])*factorial(table[0][1])*
factorial(table[1][0])*factorial(table[1][1])*factorial(table[0][0]+table[0][1]+table[1][0]+table[1][1])))
>>> print(fishers_plug_in)
0.08158508158508158
Does anyone have any idea why the calculated P-values are different? My best guess is that SciPy uses some sort of approximation for the factorials either in the fishers_exact() function for larger contingency tables, but I can't find any documentation about this.

As commenter Severin Pappadeux has pointed out, the formula I used is only for the one sided test.

Related

why percentile() method is not calculating the appropriate percentile? Like the 25th percentile for this data should be 1.5 and 2 if rounded off

import numpy as np
value = [1, 2, 3, 4, 5, 6]
x = np.percentile(value, 25)
print(x)
I am calculating percentile using this code to cross verify
import sys
import numpy as np
from numpy import math
def my_percentile(data, percentile):
n = len(data)
p = n * percentile / 100
if p.is_integer():
return sorted(data)[int(p)]
else:
return sorted(data)[int(math.ceil(p)) - 1]
t = [1, 2, 3, 4, 5, 6]
per = my_percentile(t, 25)
print(per)

There's more than one way to calculate quartiles. Wikipedia has a good summary under quantiles.
The values returned by numpy's default calculation match those returned by, for example, R's summary() function.
You need to do one of these things.
Switch to numpy.percentile's default way of calculating quartiles,
provide a value to numpy.percentile's parameter interpolation, or
write your own custom function.
Valid values for interpolation in numpy.percentile are here.
I didn't suggest a value for interpolation, because you didn't include your expected output in your question. You need to consider the effect of your decision on all quartiles, not just on one.
(I don't think scipy.stats.percentileofscore() will work for you.

How to calculate the product of the diagonal in Sympy?

I have given a sympy matrix, so a matrix consisting of real values and symbolic values and I want to calculate the product of diagonal entries of the matrix; An example:
import numpy as np
import sympy as sp
from sympy import *
from sympy.matrices import Matrix
A=sp.Matrix([[1,0,1],[2,3,1],[y,5,x]])
Now the desired result would be 3x; Of course I could do that with a for loop but is there some other, cleaner solution?

You can use diagonal to get the diagonal elements and prod to multiply them:
In [46]: A.diagonal()
Out[46]: [1 3 x]
In [47]: prod(_)
Out[47]: 3⋅x

Oscar Benjamin provided a slick solution. This will work as well:
np.product(np.diag(A))

Get mean of a distribution?

So, I generated a vector d of data that follows a normal distribution with some mean and variance.
I want then to calculate a vector s such that each component of it is a function of the type si=f(di).
Then I want to do the mean. Is there in Python any quick way to do that without any cycle?

You can use numpy to perform a function on an entire array for example if I had such a function
def f(x):
return x * 2
Then I could use numpy as follows
>>> d = numpy.array([1,2,6,7])
>>> f(d)
array([ 2, 4, 12, 14])
Then to calculate the mean
>>> s = f(d)
>>> numpy.mean(s)
8.0

sort eigenvalues and associated eigenvectors after using numpy.linalg.eig in python

I'm using numpy.linalg.eig to obtain a list of eigenvalues and eigenvectors:
A = someMatrixArray
from numpy.linalg import eig as eigenValuesAndVectors
solution = eigenValuesAndVectors(A)
eigenValues = solution[0]
eigenVectors = solution[1]
I would like to sort my eigenvalues (e.g. from lowest to highest), in a way I know what is the associated eigenvector after the sorting.
I'm not finding any way of doing that with python functions. Is there any simple way or do I have to code my sort version?

Use numpy.argsort. It returns the indices one would use to sort the array.
import numpy as np
import numpy.linalg as linalg
A = np.random.random((3,3))
eigenValues, eigenVectors = linalg.eig(A)
idx = eigenValues.argsort()[::-1]
eigenValues = eigenValues[idx]
eigenVectors = eigenVectors[:,idx]
If the eigenvalues are complex, the sort order is lexicographic (that is, complex numbers are sorted according to their real part first, with ties broken by their imaginary part).

Above answer by unutbu is very crisp and concise. But, here is another way we can do it which more general and can be used for lists as well.
eval, evec = sp.eig(A)
ev_list = zip( eval, evec )
ev_list.sort(key=lambda tup:tup[0], reverse=False)
eval, evec = zip(*ev_list)
This tup[0] is the eigenvalue based on which the sort function will sort the list.
reverse = False is for increasing order.

The ubuntu's piece of code doesn't work on my Python 3.6.5. It leads run-time errors. So, I refactored his/her code to this one which works ok on my test cases:
import numpy as np
from numpy import linalg as npla
#
def eigen(A):
eigenValues, eigenVectors = npla.eig(A)
idx = np.argsort(eigenValues)
eigenValues = eigenValues[idx]
eigenVectors = eigenVectors[:,idx]
return (eigenValues, eigenVectors)

Left inverse in numpy or scipy?

I am trying to obtain the left inverse of a non-square matrix in python using either numpy or scipy.
How can I translate the following Matlab code to Python?
>> A = [0,1; 0,1; 1,0]
A =
0 1
0 1
1 0
>> y = [2;2;1]
y =
2
2
1
>> A\y
ans =
1.0000
2.0000
Is there a numpy or scipy equivalent of the left inverse \ operator in Matlab?

Use linalg.lstsq(A,y) since A is not square. See here for details. You can use linalg.solve(A,y) if A is square, but not in your case.

Here is a method that will work with sparse matrices (which from your comments is what you want) which uses the leastsq function from the optimize package
from numpy import *
from scipy.sparse import csr_matrix
from scipy.optimize import leastsq
from numpy.random import rand
A=csr_matrix([[0.,1.],[0.,1.],[1.,0.]])
b=array([[2.],[2.],[1.]])
def myfunc(x):
x.shape = (2,1)
return (A*x - b)[:,0]
print leastsq(myfunc,rand(2))[0]
generates
[ 1. 2.]
It is kind of ugly because of how I had to get the shapes to match up according to what leastsq wanted. Maybe someone else knows how to make this a little more tidy.
I have also tried to get something to work with the functions in scipy.sparse.linalg by using the LinearOperators, but to no avail. The problem is that all of those functions are made to handle square functions only. If anyone finds a way to do it that way, I would like to know as well.

For those who wish to solve large sparse least squares problems:
I have added the LSQR algorithm to SciPy. With the next release, you'll be able to do:
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import lsqr
import numpy as np
A = csr_matrix([[0., 1], [0, 1], [1, 0]])
b = np.array([[2.], [2.], [1.]])
lsqr(A, b)
which returns the answer [1, 2].
If you'd like to use this new functionality without upgrading SciPy, you may download lsqr.py from the code repository at
http://projects.scipy.org/scipy/browser/trunk/scipy/sparse/linalg/isolve/lsqr.py

You can also look for the equivalent of the pseudo-inverse function pinv in numpy/scipy, as an alternative to the other answers that is.

You can calculate the left inverse using matrix calculations:
import numpy as np
linv_A = np.linalg.solve(A.T.dot(A), A.T)
(Why? Because:
)
Test:
np.set_printoptions(suppress=True, precision=3)
np.random.seed(123)
A = np.random.randn(3, 2)
print('A\n', A)
A_linv = np.linalg.solve(A.T.dot(A), A.T)
print('A_linv.dot(A)\n', A_linv.dot(A))
Result:
A
[[-1.086 0.997]
[ 0.283 -1.506]
[-0.579 1.651]]
A_linv.dot(A)
[[ 1. -0.]
[ 0. 1.]]

I haven't tested it, but according to this web page it is:
linalg.solve(A,y)

You can use lsqr from scipy.sparse.linalg to solve sparse matrix systems with least squares

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

SciPy's Fisher's Exact P-Value Differs from Exact Formula - python

As commenter Severin Pappadeux has pointed out, the formula I used is only for the one sided test.

Related

why percentile() method is not calculating the appropriate percentile? Like the 25th percentile for this data should be 1.5 and 2 if rounded off

How to calculate the product of the diagonal in Sympy?

Get mean of a distribution?

sort eigenvalues and associated eigenvectors after using numpy.linalg.eig in python

Left inverse in numpy or scipy?

Categories

Resources