Scikit-learn χ² (chi-squared) statistic and corresponding contingency table - python

In the docs for the chi-squared univariate feature selection function of scikit-learn http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html, it states
This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.
I am struggling to understand what the corresponding contingency table would look like, especially in the case of frequency features.
For example, consider the below dataset with boolean features and targets:
import numpy as np
>>> X = np.random.randint(2, size=50).reshape(10, 5)
array([[1, 0, 0, 0, 1],
[1, 1, 0, 1, 1],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 0, 1, 1, 1],
[0, 1, 1, 0, 0],
[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0]])
>>> y = np.random.randint(2, size=10)
array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
To construct the contingency table with respect to the first feature, we can do this (excuse my PEP8 violation)
import scipy as sp
>>> contingency_table = sp.sparse.coo_matrix(
... (np.ones_like(y), (X[:, 0], y)),
... shape=(np.unique(X[:, 0]).shape[0], np.unique(y).shape[0])).A
array([[1, 2],
[3, 4]])
So now I can calculate the chi-squared statistic and its p-values
>>> sp.stats.chi2_contingency(contingency_table)
(0.17857142857142855,
0.67260381744151676,
1,
array([[ 1.2, 1.8],
[ 2.8, 4.2]]))
And this ought to be consistent with scikit-learn's chi2
from sklearn.feature_selection import chi2
>>> chi2_, pval = chi2(X, y)
>>> chi2_[0], pval[0]
(0.023809523809523787, 0.87737055606414338)
...Nope. Have I misinterpreted something?
Also, what does the contingency table look like in the case of frequencies? I assumed it would be something like
contingency_table = sp.sparse.coo_matrix(
(np.ones_like(y), (X[:, 0], y)),
shape=(X[:, 0].max()+1, np.unique(y).shape[0])).A
But the corresponding table of expected frequencies will most likely have several zero elements.
Edit:
To clarify further, consider the first feature X[:, 0] that is, say, gender and the targets y, say, handedness.
From this we get the cross tabulation
Right-handed Left-handed (!right-handed)
Male 1 2
Female (!male) 3 4
And we can assess the significance of the difference between the two proportions using the Chi-squared test by setting the expected frequency
sklearn.feature_selection.chi2 does this directly without resorting to explicitly computing the table and obtains the scores using a more efficient procedure that is equivalent to scipy.stats.chisquare.
After explicitly enumerating the table shown above, I wanted to verify it is consistent with chi2 when applying scipy.stats.chi2_contingency and to my dismay, it isn't. I'd like to ask why it isn't.

Consider a column x of X. sklearn.feature_selection.chi2 tests whether
the frequencies of the y values where x is 1 agree with the frequencies of y in
the full population. (#larsman's answer shows how you can reproduce the calculation with numpy and scipy.) This is not the same as the standard 2x2 contingency table
analysis of x and y. In a 2x2 contingency table analysis, the frequencies of y
where x is 0 also contribute to the test.
Suppose we form the contingency table for x and y:
| y=0 y=1
----+---------
x=0 | a b
x=1 | c d
Let n = a + b + c + d. This is the number of samples (i.e. same as len(x) and len(y)).
Let nx = c + d. This is the number of occurrences of 1 in x.
Let py1 = (b + d)/n. This is the fraction of the full population where y is 1.
sklearn.feature_selection.chi2 performs a chi2 test on [c, d] using the expected
values [(1-py1)*nx, py1*nx]. This is not the same as the standard contingency table
analysis of a 2x2 table.
Here's an extreme example. Suppose the 2x2 contingency table for x and y is
| y=0 y=1
----+----------
x=0 | 8 8
x=1 | 20 188
The sklearn calculation produces a chi2 score of 1.58, with a p-value of 0.208.
The contingency table analysis of scipy.stats.chi2_contingency gives a chi2 score of 18.6, with a p-value of 1.60e-5.

Given your data,
>>> X = array([[1, 0, 0, 0, 1],
... [1, 1, 0, 1, 1],
... [1, 0, 0, 0, 0],
... [0, 0, 0, 0, 0],
... [0, 0, 0, 0, 1],
... [1, 0, 0, 0, 1],
... [1, 0, 1, 1, 1],
... [0, 1, 1, 0, 0],
... [1, 0, 1, 1, 1],
... [1, 1, 1, 1, 0]])
>>> y = array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
this is what feature_selection.chi2 computes:
>>> Y = np.vstack([1 - y, y])
>>> observed = np.dot(Y, X)
>>> observed
array([[3, 1, 1, 2, 2],
[4, 2, 3, 2, 4]])
These are the observed feature frequencies, per class, i.e. the contingency table. Then the expected values:
>>> feature_count = X.sum(axis=0)
>>> class_prob = Y.mean(axis=1)
>>> expected = np.dot(feature_count.reshape(-1, 1), class_prob.reshape(1, -1)).T
>>> expected
array([[ 2.8, 1.2, 1.6, 1.6, 2.4],
[ 4.2, 1.8, 2.4, 2.4, 3.6]])
Finally, it runs a χ² test:
>>> from scipy.stats import chisquare
>>> score, pval = chisquare(observed, expected)
>>> score
array([ 0.02380952, 0.05555556, 0.375 , 0.16666667, 0.11111111])
>>> pval
array([ 0.87737056, 0.81366372, 0.54029137, 0.6830914 , 0.73888268])
The scores are the relevant bit: they're used to sort the features by discriminative power. Note that you get one score and one p-value per feature.

Related

Interpreting (and comparing) output from numpy.correlate

I have looked at this question but it hasn't really given me any answers.
Essentially, how can I determine if a strong correlation exists or not using np.correlate? I expect the same output as I get from matlab's xcorr with the coeff option which I can understand (1 is a strong correlation at lag l and 0 is no correlation at lag l), but np.correlate produces values greater than 1, even when the input vectors have been normalised between 0 and 1.
Example input
import numpy as np
x = np.random.rand(10)
y = np.random.rand(10)
np.correlate(x, y, 'full')
This gives the following output:
array([ 0.15711279, 0.24562736, 0.48078652, 0.69477838, 1.07376669,
1.28020871, 1.39717118, 1.78545567, 1.85084435, 1.89776181,
1.92940874, 2.05102884, 1.35671247, 1.54329503, 0.8892999 ,
0.67574802, 0.90464743, 0.20475408, 0.33001517])
How can I tell what is a strong correlation and what is weak if I don't know the maximum possible correlation value is?
Another example:
In [10]: x = [0,1,2,1,0,0]
In [11]: y = [0,0,1,2,1,0]
In [12]: np.correlate(x, y, 'full')
Out[12]: array([0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0])
Edit: This was a badly asked question, but the marked answer does answer what was asked. I think it is important to note what I have found whilst digging around in this area, you cannot compare outputs from cross-correlation. In other words, it would not be valid to use the outputs from cross-correlation to say signal x is better correlated to signal y than signal z. Cross-correlation does not provide this kind of information
numpy.correlate is under-documented. I think that we can make sense of it, though. Let's start with your sample case:
>>> import numpy as np
>>> x = [0,1,2,1,0,0]
>>> y = [0,0,1,2,1,0]
>>> np.correlate(x, y, 'full')
array([0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0])
Those numbers are the cross-correlations for each of the possible lags. To make that more clear, let's put the lag numbers above the correlations:
>>> np.concatenate((np.arange(-5, 6)[None,...], np.correlate(x, y, 'full')[None,...]), axis=0)
array([[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5],
[ 0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0]])
Here, we can see that the cross-correlation reaches its peak at a lag of -1. If you look at x and y above, that makes sense: it one shifts y to the left by one place, it matches x exactly.
To verify this, let's try again, this time shifting y further:
>>> y = [0, 0, 0, 0, 1, 2]
>>> np.concatenate((np.arange(-5, 6)[None,...], np.correlate(x, y, 'full')[None,...]), axis=0)
array([[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5],
[ 0, 2, 5, 4, 1, 0, 0, 0, 0, 0, 0]])
Now, the correlation peaks at a lag of -3, meaning that the best match between x and y occurs when y is shifted to the left by 3 places.

python: calculate center of mass

I have a data set with 4 columns: x,y,z, and value, let's say:
x y z value
0 0 0 0
0 1 0 0
0 2 0 0
1 0 0 0
1 1 0 1
1 2 0 1
2 0 0 0
2 1 0 0
2 2 0 0
I would like to calculate the center of mass CM = (x_m,y_m,z_m) of all values. In the present example, I would like to see (1,1.5,0) as output.
I thought this must be a trivial problem, but I can't find a solution to it in the internet. scipy.ndimage.measurements.center_of_mass seems to be the right thing, but unfortunately, the function always returns two values (instead of 3). In addition, I can't find any documentation on how to set up an ndimage from an array: Would I use a numpy array N of shape (9,4)? Would then N[:,0] be the x-coordinate?
Any help is highly appreciated.
The simplest way I can think of is this: just find an average of the coordinates of mass components weighted by each component's contribution.
import numpy
masses = numpy.array([[0, 0, 0, 0],
[0, 1, 0, 0],
[0, 2, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 1],
[1, 2, 0, 1],
[2, 0, 0, 0],
[2, 1, 0, 0],
[2, 2, 0, 0]])
nonZeroMasses = masses[numpy.nonzero(masses[:,3])] # Not really necessary, can just use masses because 0 mass used as weight will work just fine.
CM = numpy.average(nonZeroMasses[:,:3], axis=0, weights=nonZeroMasses[:,3])
Another option is to use the scipy center of mass:
from scipy import ndimage
import numpy
masses = numpy.array([[0, 0, 0, 0],
[0, 1, 0, 0],
[0, 2, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 1],
[1, 2, 0, 1],
[2, 0, 0, 0],
[2, 1, 0, 0],
[2, 2, 0, 0]])
ndimage.measurements.center_of_mass(masses)
How about:
# x y z value
table = np.array([[ 5. , 1.3, 8.3, 9. ],
[ 6. , 6.7, 1.6, 5.9],
[ 9.1, 0.2, 6.2, 3.7],
[ 2.2, 2. , 6.7, 4.6],
[ 3.4, 5.6, 8.4, 7.3],
[ 4.8, 5.9, 5.7, 5.8],
[ 3.7, 1.1, 8.2, 2.2],
[ 0.3, 0.7, 7.3, 4.6],
[ 8.1, 1.9, 7. , 5.3],
[ 9.1, 8.2, 3.3, 5.3]])
def com(xyz, mass):
mass = mass.reshape((-1, 1))
return (xyz * mass).mean(0)
print(com(table[:, :3], table[:, 3]))
Why did ndimage.measurements.center_of_mass not give the expected result?
The key is in how the input data masses was represented by an array of 4-tuples (x, y, z, value)
# x y z value
[[0, 0, 0, 0],
[0, 1, 0, 0],
[0, 2, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 1],
[1, 2, 0, 1],
[2, 0, 0, 0],
[2, 1, 0, 0],
[2, 2, 0, 0]]
The array masses here represents the 3-D position and weights of each mass.
Note however that this python array structure is only a 2-D array. It's shape is (9, 4).
The input you need to pass to ndimage to get the expected result is a 3-D array containing zeros everywhere and the weight of each mass at the appropriate coordinates within the array, like this:
from scipy import ndimage
import numpy
masses = numpy.zeros((3, 3, 1))
# x y z value
masses[1, 1, 0] = 1
masses[1, 2, 0] = 1
CM = ndimage.measurements.center_of_mass(masses)
# x y z
# (1.0, 1.5, 0.0)
Which is exactly the expected output.
Note the limitation of this solution (and the ndimage library) is it requires non-negative integer coordinates. Also will not be efficient for large and/or sparse volumes because each "pixel" of the ndimage needs to be instantiated in memory.

Scipy - find bases of column space of matrix

I'm trying to code up a simple Simplex algorithm, the first step of which is to find a basic feasible solution:
Choose a set B of linearly independent columns of A
Set all components of x corresponding to the columns not in B to zero.
Solve the m resulting equations to determine the components of x. These are the basic variables.
I know the solution will involve using scipy.linalg.svd (or scipy.linalg.lu) and some numpy.argwhere / numpy.where magic, but I'm not sure exactly how.
Does anyone have a pure-Numpy/Scipy implementation of finding a basis (step 1) or, even better, all of the above?
Example:
>>> A
array([[1, 1, 1, 1, 0, 0, 0],
[1, 0, 0, 0, 1, 0, 0],
[0, 0, 1, 0, 0, 1, 0],
[0, 3, 1, 0, 0, 0, 1]])
>>> u, s, v = scipy.linalg.svd(A)
>>> non_zero, = numpy.where(s > 1e-7)
>>> rank = len(non_zero)
>>> rank
4
>>> for basis in some_unknown_function(A):
... print(basis)
{3, 4, 5, 6}
{1, 4, 5, 6}
and so on.
A QR decomposition provides an orthogonal basis for the column space of A:
q,r = np.linalg.qr(A)
If the rank of A is n, then the first n columns of q form a basis for the column space of A.
Try using this
scipy.linalg.orth(A)
this produces orthonormal basis for the matrix A

Numpy: increment elements of an array given the indices required to increment

I am trying to turn a second order tensor into a binary third order tensor. Given a second order tensor as a m x n numpy array: A, I need to take each element value: x, in A and replace it with a vector: v, with dimensions equal to the maximum value of A, but with a value of 1 incremented at the index of v corresponding to the value x (i.e. v[x] = 1). I have been following this question: Increment given indices in a matrix, which addresses producing an array with increments at indices given by 2 dimensional coordinates. I have been reading the answers and trying to use np.ravel_multi_index() and np.bincount() to do the same but with 3 dimensional coordinates, however I keep on getting a ValueError: "invalid entry in coordinates array". This is what I have been using:
def expand_to_tensor_3(array):
(x, y) = array.shape
(a, b) = np.indices((x, y))
a = a.reshape(x*y)
b = b.reshape(x*y)
tensor_3 = np.bincount(np.ravel_multi_index((a, b, array.reshape(x*y)), (x, y, np.amax(array))))
return tensor_3
If you know what is wrong here or know an even better method to accomplish my goal, both would be really helpful, thanks.
You can use (A[:,:,np.newaxis] == np.arange(A.max()+1)).astype(int).
Here's a demonstration:
In [52]: A
Out[52]:
array([[2, 0, 0, 2],
[3, 1, 2, 3],
[3, 2, 1, 0]])
In [53]: B = (A[:,:,np.newaxis] == np.arange(A.max()+1)).astype(int)
In [54]: B
Out[54]:
array([[[0, 0, 1, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0]],
[[0, 0, 0, 1],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]],
[[0, 0, 0, 1],
[0, 0, 1, 0],
[0, 1, 0, 0],
[1, 0, 0, 0]]])
Check a few individual elements of A:
In [55]: A[0,0]
Out[55]: 2
In [56]: B[0,0,:]
Out[56]: array([0, 0, 1, 0])
In [57]: A[1,3]
Out[57]: 3
In [58]: B[1,3,:]
Out[58]: array([0, 0, 0, 1])
The expression A[:,:,np.newaxis] == np.arange(A.max()+1) uses broadcasting to compare each element of A to np.arange(A.max()+1). For a single value, this looks like:
In [63]: 3 == np.arange(A.max()+1)
Out[63]: array([False, False, False, True], dtype=bool)
In [64]: (3 == np.arange(A.max()+1)).astype(int)
Out[64]: array([0, 0, 0, 1])
A[:,:,np.newaxis] is a three-dimensional view of A with shape (3,4,1). The extra dimension is added so that the comparison to np.arange(A.max()+1) will broadcast to each element, giving a result with shape (3, 4, A.max()+1).
With a trivial change, this will work for an n-dimensional array. Indexing a numpy array with the ellipsis ... means "all the other dimensions". So
(A[..., np.newaxis] == np.arange(A.max()+1)).astype(int)
converts an n-dimensional array to an (n+1)-dimensional array, where the last dimension is the binary indicator of the integer in A. Here's an example with a one-dimensional array:
In [6]: a = np.array([3, 4, 0, 1])
In [7]: (a[...,np.newaxis] == np.arange(a.max()+1)).astype(int)
Out[7]:
array([[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0]])
You can make it work this way:
tensor_3 = np.bincount(np.ravel_multi_index((a, b, array.reshape(x*y)),
(x, y, np.amax(array) + 1)))
The difference is that I add 1 to the amax() result, because ravel_multi_index() expects that the indexes are all strictly less than the dimensions, not less-or-equal.
I'm not 100% sure if this is what you wanted; another way to make the code run is to specify mode='clip' or mode='wrap' in ravel_multi_index(), which does something a bit different and I'm guessing is less correct. But you can try it.

Using Scipy minimize (scipy.optimize.minimize) with a large equality constraint matrix

I need to minimize a function of say, five variables (x[0] to x[4])
The scalar function to be minimized is given by X'*H*X. The objective function would look similar to this:
def objfun(x):
H = 0.1*np.ones([5,5])
f = np.dot(np.transpose(x),np.dot(H,x))[0][0]
return f
Which would return a single scalar value.
The question is, how do I implement a constraint equations given by:
A*X - b = 0
Where A and b are subject to change in each run. A random example would be:
A =
array([[ 1, 2, 3, 4, 5],
[ 2, 1, 3, 4, 5],
[-1, 2, 3, 0, 0],
[ 0, -5, 6, 3, 2],
[-3, 5, 6, 2, 8]])
B =
array([[ 0],
[ 2],
[ 3],
[-2],
[-7]])
A and B cannot be hard-coded into a constraint function as they may be different in each run. There are no bounds on the variables and the optimization method need not be specified.
EDIT
I realized that having 5 constraint equations for an optimization problem with 5 variables gives a unique solution just by solving the equations.
So how about a case where A may be defined as:
A =
array([[ 1, 2, 3, 4, 5],
[ 2, 1, 3, 4, 5],
[-1, 2, 3, 0, 0],
[ 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0]])
B =
array([[ 0],
[ 2],
[ 3],
[ 0],
[ 0]])
So we have a 5 variable optimization problem with 3 linear constraints.
You could try using the scipy.optimize.fmin_cobyla function, I don't know the numerical details so you should check it with values for which you know the expected answer and see if it works for your needs, play with the tolerance arguments rhoend and rhobeg and see if you get an expected answer, a sample program could be something like:
import numpy as np
import scipy.optimize
A = \
np.array([[ 1, 2, 3, 4, 5],
[ 2, 1, 3, 4, 5],
[-1, 2, 3, 0, 0],
[ 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0]])
B = \
np.array([[0],
[2],
[3],
[0],
[0]])
def objfun(x):
H = 0.1*np.ones([5,5])
f = np.dot(np.transpose(x),np.dot(H,x))
return f
def constr1(x):
""" The constraint is satisfied when return value >=0 """
sol = A*x
if np.allclose(sol, B):
return 0.01
else:
# Return the negative distance between the expected solution
# and the actual solution for a somehow meaningful value
return -np.linalg.norm(B-sol)
scipy.optimize.fmin_cobyla(objfun, [0.0, 0.0, 0.0, 0.0,0.0], [constr1])
#np.linalg.solve(A, b)
Please note that this given example doesn't have a solution, try it with something that does. I am not completely sure that the constraint function is properly defined, try to find something that works well for you. You should try to provide an initial guess that it's an actual solution instead of [0.0, 0.0, 0.0, 0.0,0.0] for better results.
Check the oficial documentation for more details: http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_cobyla.html#scipy.optimize.fmin_cobyla
Edit: Also depending what kind of solution you are looking for you could probably form a better constrain function, maybe allowing values that are around a certain tolerance distance from the expected solution even if not completely exact, and returning a value higher than 0 the closer they are to the tolerance instead of always 0.1, etc...
The NLopt doc
mentions a neat general method:
all solutions of Ax = b have the form xany + nullspace(A) z,
where xany is one solution and dim(z) < dim(x) .
So minimize f( xany + nullspace(A) z ) over unconstrained z.
For example, in 3d, the constraint x0 + x1 + x2 = 1 has nullspace matrix
[ 1 0 ] : [z0 z1] -> [z0, -z0 + z1, -z1] -- sum 0
[ -1 1 ]
[ 0 -1 ]
("Some care is required in numerically computing the nullspace ...")

Categories