How can I interpret Scikit-learn confusion matrix

How can I interpret Scikit-learn confusion matrix - python

I am using confusion matrix to check the performance of my classifier.
I am using Scikit-Learn, I am little bit confused. How can I interpret the result from
from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
How can I take the decision whether this predicted values are good or no.

The simplest way to reach the decision whether the classifier is good or bad is just to calculate an error using some of standard error measures (for example the Mean squared error). I imagine your example is copied from Scikit's documentation, so I assume you've read the definition.
We have three classes here: 0, 1 and 2. On the diagonal, the confusion matrix tells you, how often a particular class have been predicted correctly. So from the diagonal 2 0 2 we can say that class with index 0 was classified correctly 2 times, class with index 1 was never predicted correctly, and class with index 2 was predicted correctly 2 times.
Under and above the diagonal you have numbers which tell you how many times a class with index equal to the element's row number was classified as class with index equal to matrix's column. For example if you look at the first column, under the diagonal you have: 0 1 (in the lower left corner of the matrix). The lower 1 tells you that class with index 2 (the last row) was once erroneously classified as 0 (the first column). This corresponds to the fact that in your y_true there was one sample with label 2 and was classified as 0. This happened for the first sample.
If you sum all the numbers from the confusion matrix you get the number of testing samples (2 + 2 + 1 + 1 = 6 - equal to the length of y_true and y_pred). If you sum the rows you get the number of samples for each label: as you can verify, indeed there are two 0s, one 1 and three 2s in y_pred.
If you for example divide matrix elements by this number, you could tell that, for example, class with label 2 is recognized with correctly with ~66% accuracy, and in 1/3rd of cases it's confused (hence the name) with class with label 0.
TL;DR:
While single-number error measures measure overall performance, with confusion matrix you can determine if (some examples):
your classifier just sucks with everything
or it handles some classes well, and some not (this gives you a hint to look at this particular part of your data and observe classifier's behaviour for these cases)
it does well, but confuses label A with B quite often. For example, for linear classifiers you may want to check then, if these classes are linearly separable.
Etc.

Related

how does sklearn jaccard_score gets calculated?

I was trying to understand what is going on with sklearn's jaccard_score.
This is the result I got
1. jaccard_score([0 1 1], [1 1 1])
0.6666666666666666
2. jaccard_score([1 1 0], [1 0 0])
0.5
3. jaccard_score([1 1 0], [1 0 1])
0.3333333333333333
I understand that the formula is
intersection / size of A + size of B - intersection
I thought the last one should give me 0.2 because the overlap is 1 and total number of entries is 6 resulting 1/5. but I got 0.33333...
Can anyone explain how sklearn calculates jaccard_score?

Per sklearn's doc, the jaccard_score function "is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true". If the attributes are binary, the computation is based on this using the confusion matrix. Otherwise, the same computation is done using the confusion matrix for each attribute value / class label.
The above definition for binary attributes / classes can be reduced to the set definition as explained in the following.
Assume that there are three records r1, r2, and r3. The vector [0, 1, 1] and [1, 1, 1] -- which are true and predicted classes of the records -- can be mapped to two sets {r2, r3} and {r1, r2, r3} respectively. Here, each element in the vector represents whether the correponding record exists in the set. The Jaccard similarity of the two sets are the same as the definition of similarity value for two vectors.

How can I perform a cluster analysis while checking for anticorrelation as well as correlation?

I have data that is a matrix of z-scores. Each row has zero average. I am trying to perform a kmeans cluster analysis so as to assign each row to a cluster. To take a very simplified example, in the matrix:
[0, -1, 1, 0]
[0, -1, 1, 0]
[0, 1, -1, 0]
[1, 1, -1, -1]
[-1, -1, 1, 1]
(Except the actual z-score data would have a variance of 1 in each row.)
Python should recognize that the top two rows are in one cluster. I can do this with sklearn.cluster.KMeans. However, now I want it to detect "anticorrelation" and classify the third row together with the top two rows because it is simply the negative of them. If I tell it to find two clusters, it should find one with the top three rows and another with the bottom two, because the bottom two are also negatives of one another.
One possible approach (perhaps) is if I could use a user-defined goodness-of-fit function that defines the distance of two points r1 and r2 as being the minimum of sqrt((r1+r2)**2) and sqrt((r1-r2)**2). I might than want to know whether a given row has been used as its positive or negative version in its cluster.
Thanks for any help you can give.

Scikit-learn principal component analysis (PCA) for dimension reduction

I want to perform principal component analysis for dimension reduction and data integration.
I have 3 features(variables) and 5 samples like below. I want to integrate them into 1-dimensional(1 feature) output by transforming them(computing 1st PC). I want to use transformed data for further statistical analysis, because I believe that it displays the 'main' characteristics of 3 input features.
I first wrote a test code with python using scikit-learn like below. It is the simple case that the values of 3 features are all equivalent. In other word, I applied PCA for three same vector, [0, 1, 2, 1, 0].
Code
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
samples = np.array([[0,0,0],[1,1,1],[2,2,2],[1,1,1],[0,0,0]])
pc1 = pca.fit_transform(samples)
print (pc1)
Output
[[-1.38564065]
[ 0.34641016]
[ 2.07846097]
[ 0.34641016]
[-1.38564065]]
Is taking 1st PCA after dimension reduction proper approach for data integration?
1-2. For example, if features are like [power rank, speed rank], and power have roughly negative correlation with speed, when it is a 2-feature case. I want to know the sample which have both 'high power' and 'high speed'. It is easy to decide that [power 1, speed 1] is better than [power 2, speed 2], but difficult for the case like [power 4, speed 2] vs [power 3, speed 3].
So I want to apply PCA to 2-dimensional 'power and speed' dataset, and take 1st PC, then use the rank of '1st PC'. Is this kind of approach still proper?
In this case, I think the output should also be [0, 1, 2, 1, 0] which is the same as the input. But output was [-1.38564065, 0.34641016, 2.07846097, 0.34641016, -1.38564065]. Are there any problem with the code, or is it the right answer?

Yes. It is also called data projection (to the lower dimension).
The resulting output is centered and normalized according to the train data. The result is correct.
In case of only 5 samples I don't think it is wise to run any statistical methods. And if you believe that your features are the same, just check that correlation between dimensions is close to 1, and then you can just disregard other dimensions.

There is no need to use PCA for this small dataset. And for PCA you array should be scaled.
In any case, you have only 3 dimensions: you can plot points and take a look with your eyes, you can calculate distances (make some kind on Nearest Neighborhoods algorithm).

Scale (apply function?) sparse matrix logarithmically

I am using scikit-learn preprocessing scaling for sparse matrices.
My goal is to "scale" each feature-column by taking the logarithm-base the column maximum value. My wording may be inexact. I try to explain.
Say feature-column has values: 0, 8, 2:
Max value = 8
Log-8 of feature value 0 should be 0.0 = math.log(0+1, 8+1) (the +1 is to cope with zeros; so yes, we are actually taking log-base 9)
Log-8 of feature value 8 should be 1.0 = math.log(8+1, 8+1)
Log-8 of feature value 2 should be 0.5 = math.log(2+1, 8+1)
Yes, I can easily apply any arbitrary function-based transformer with FunctionTransformer, but I want the base of the log change (based on) each column (in particular, the maximum value). That is, I want to do something like the MaxAbsScaler, only taking logarithms.
I see that MaxAbsScaler gets first a vector (scale) of the maximum values of each column (code) and then multiples the original matrix times 1 / scale in code.
However, I don't know what to do if I want to take the logarithms-based on the scale vector. Is it even possible to transform the logarithm operation to a multiplication (?) or do I have other possibilities of scipy sparse operations that are efficient?
I hope my intent is clear (and possible).

Logarithm of x in base b is the same as log(x)/log(b), where logs are natural. So, the process you describe amounts to first applying log(x+1) transformation to everything, and then scaling by max absolute value. Conveniently, log(x+1) is a built-in function, log1p. Example:
from sklearn.preprocessing import FunctionTransformer, maxabs_scale
from scipy.sparse import csc_matrix
import numpy as np
logtran = FunctionTransformer(np.log1p, accept_sparse=True)
X = csc_matrix([[ 1., 0, 8], [ 2., 0, 0], [ 0, 1., 2]])
Y = maxabs_scale(logtran.transform(X))
Output (sparse matrix Y):
(0, 0) 0.630929753571
(1, 0) 1.0
(2, 1) 1.0
(0, 2) 1.0
(2, 2) 0.5

Count how many matrices have full rank for all submatrices

I would like to count how many m by n matrices whose elements are 1 or -1 have the property that all its floor(m/2)+1 by n submatrices have full rank. My current method is naive and slow and is in the following python/numpy code. It simply iterates over all matrices and tests all the submatrices.
import numpy as np
import itertools
from scipy.misc import comb
m = 8
n = 4
rowstochoose = int(np.floor(m/2)+1)
maxnumber = comb(m, rowstochoose, exact = True)
matrix_g=(np.array(x).reshape(m,n) for x in itertools.product([-1,1], repeat = m*n))
nofound = 0
for A in matrix_g:
count = 0
for rows in itertools.combinations(range(m), int(rowstochoose)):
if (np.linalg.matrix_rank(A[list(rows)]) == int(min(n,rowstochoose))):
count+=1
else:
break
if (count == maxnumber):
nofound+=1
print nofound, 2**(m*n)
Is there a better/faster way to do this? I would like to do this calculation for n and m up to 20 but any significant improvements would be great.
Context. I am interested in getting some exact solutions for https://math.stackexchange.com/questions/640780/probability-that-every-vector-is-not-orthogonal-to-half-of-the-others .
As a data point to compare implementations. n,m = 4,4 should output 26880 . n,m=5,5 is too slow for me to run. For n = 2 and m = 2,3,4,5,6 the outputs should be 8, 0, 96, 0, 1280.
Current status Feb 2, 2014:
The answer of leewangzhong is fast but is not correct for m > n . leewangzhong is considering how to fix it.
The answer of Hooked does not run for m > n .

(Now a partial solution for n = m//2+1, and the requested code.)
Let k := m//2+1
This is somewhat equivalent to asking, "How many collections of m n-dimensional vectors of {-1,1} have no linearly dependent sets of size min(k,n)?"
For those matrices, we know or can assume:
The first entry of every vector is 1 (if not, multiply the whole by -1). This reduces the count by a factor of 2**m.
All vectors in the list are distinct (if not, any submatrix with two identical vectors has non-full rank). This eliminates a lot. There are choose(2**m,n) matrices of distinct vectors.
The list of vectors are sorted lexicographically (rank isn't affected by permutations). So we're really thinking about sets of vectors instead of lists. This reduces the count by a factor of m! (because we require distinctness).
With this, we have a solution for n=4, m=8. There are only eight different vectors with the property that the first entry is positive. There is only one combination (sorted list) of 8 distinct vectors from 8 distinct vectors.
array([[ 1, 1, 1, 1],
[ 1, 1, 1, -1],
[ 1, 1, -1, 1],
[ 1, 1, -1, -1],
[ 1, -1, 1, 1],
[ 1, -1, 1, -1],
[ 1, -1, -1, 1],
[ 1, -1, -1, -1]], dtype=int8)
100 size-4 combinations from this list have rank 3. So there are 0 matrices with the property.
For a more general solution:
Note that there are 2**(n-1) vectors with first coordinate -1, and choose(2**(n-1),m) matrices to inspect. For n=8 and m=8, there are 128 vectors, and 1.4297027e+12 matrices. It might help to answer, "For i=1,...,k, how many combinations have rank i?"
Alternatively, "What kind of matrices (with the above assumptions) have less than full rank?" And I think the answer is exactly, A sufficient condition is, "Two columns are multiples of each other". I have a feeling that this is true, and I tested this for all 4x4, 5x5, and 6x6 matrices.(Must've screwed up the tests) Since the first column was chosen to be homogeneous, and since all homogeneous vectors are multiples of each other, any submatrix of size k with a homogeneous column other than the first column will have rank less than k.
This is not a necessary condition, though. The following matrix is singular (first plus fourth is equal to third plus second).
array([[ 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, -1],
[ 1, 1, -1, -1, 1],
[ 1, 1, -1, -1, -1],
[ 1, -1, 1, -1, 1]], dtype=int8)
Since there are only two possible values (-1 and 1), all mxn matrices where m>2, k := m//2+1, n = k and with first column -1 have a majority member in each column (i.e. at least k members are the same). So for n=k, the answer is 0.
For n<=8, here's code to generate the vectors.
from numpy import unpackbits, arange, uint8, int8
#all distinct n-length vectors from -1,1 with first entry -1
def nvectors(n):
if n > 8:
raise ValueError #is that the right error?
return -1 + 2 * (
#explode binary numbers to arrays of 8 zeroes and ones
unpackbits(arange(2**(n-1),dtype=uint8)) #unpackbits only takes uint
.reshape((-1,8)) #unpackbits flattens, so we need to shape it to 8 bits
[:,-n:] #only take the last n bytes
.view(int8) #need signed
)
Matrix generator:
#generate all length-m matrices that are combinations of distinct n-vectors
def matrix_g(n,m):
return (array(mat) for mat in combinations(nvectors(n),m))
The following is a function to check that all submatrices of length maxrank have full rank. It stops if any have less than maxrank, instead of checking all combinations.
rankof = np.linalg.matrix_rank
#all submatrices of at least half size have maxrank
#(we only need to check the maxrank-sized matrices)
def halfrank(matrix,maxrank):
return all(rankof(submatr) == maxrank for submatr in combinations(matrix,maxrank))
Generate all matrices that have all half-matrices with full rank
def nicematrices(m,n):
maxrank = min(m//2+1,n)
return (matr for matr in matrix_g(n,m) if halfrank(matr,maxrank))
Putting it all together:
import numpy as np
from numpy import unpackbits, arange, uint8, int8, array
from itertools import combinations
#all distinct n-length vectors from -1,1 with first entry -1
def nvectors(n):
if n > 8:
raise ValueError #is that the right error?
if n==0:
return array([])
return -1 + 2 * (
#explode binary numbers to arrays of 8 zeroes and ones
unpackbits(arange(2**(n-1),dtype=uint8)) #unpackbits only takes uint
.reshape((-1,8)) #unpackbits flattens, so we need to shape it to 8 bits
[:,-n:] #only take the last n bytes
.view(int8) #need signed
)
#generate all length-m matrices that are combinations of distinct n-vectors
def matrix_g(n,m):
return (array(mat) for mat in combinations(nvectors(n),m))
rankof = np.linalg.matrix_rank
#all submatrices of at least half size have maxrank
#(we only need to check the maxrank-sized matrices)
def halfrank(matrix,maxrank):
return all(rankof(submatr) == maxrank for submatr in combinations(matrix,maxrank))
#generate all matrices that have all half-matrices with full rank
def nicematrices(m,n):
maxrank = min(m//2+1,n)
return (matr for matr in matrix_g(n,m) if halfrank(matr,maxrank))
#returns (number of nice matrices, number of all matrices)
def count_nicematrices(m,n):
from math import factorial
return (len(list(nicematrices(m,n)))*factorial(m)*2**m, 2**(m*n))
for i in range(0,6):
print (i, count_nicematrices(i,i))
count_nicematrices(5,5) takes about 15 seconds for me, the vast majority of which is taken by the matrix_rank function.

Since no one's answered yet, here's an answer without code. The useful symmetries that I see are as follows.
Multiply a row by -1.
Multiply a column by -1.
Permute the rows.
Permute the columns.
I would attack this problem by exhaustively generating the non-isomorphs, filtering them, and summing the sizes of their orbits. nauty will be quite useful for the first and third steps. Assuming that most matrices have few symmetries (undoubtedly an excellent assumption for n large, but it's not obvious a priori how large), I would expect 8x8 to be doable, 9x9 to be borderline, and 10x10 to be out of reach.
Expanded pseudocode:
Generate one representative of each orbit of the (m - 1) by (n - 1) 0-1 matrices acted upon by the group generated by row and column permutations, together with the size of the orbit (= (m - 1)! (n - 1)! / the size of the automorphism group). Perhaps the author of the paper that Tim linked would be willing to share his code; otherwise, see below.
For each matrix, replace entries x by (-1)^x. Add one row and one column of 1s. Multiply the size of its orbit by 2^(m + n - 1). This takes care of the sign change symmetries.
Filter the matrices and sum the orbit sizes of the ones that remain. You might save a little computation time here by implementing Gram--Schmidt yourself so that when you try all combinations in lexicographic order there's an opportunity to reuse partial results for the shared prefixes.
Isomorph-free enumeration:
McKay's template can be used to generate the representatives for (m + 1) by n 0-1 matrices from the representatives for m by n 0-1 matrices, in a manner amenable to depth-first search. With each m by n 0-1 matrix, associate a bipartite graph with m black vertices, n white vertices, and the appropriate edge for each 1 entry. Do the following for each m by n representative.
For each length-n vector, construct the graph for the (m + 1) by n matrix consisting of the representative together with the new vector and run nauty to get a canonical labeling and the vertex orbits.
Filter out the possibilities where the vertex corresponding to the new vector is in a different orbit from the black vertex with the lowest number.
Filter out the possibilities with duplicate canonical labelings.
nauty also computes the orders of automorphism groups.

You will need to rethink this problem from a mathematical point of view. That said even with brute force, there are some programming tricks you can use to speed up the process (as SO is a programming site). Little tricks like not recalculating int(min(n,rowstochoose)) and itertools.combinations(range(m), int(rowstochoose)) can save a few percent - but the real gain comes from memoization. Others have mentioned it, but I thought it might be useful to have a complete, working, code example:
import numpy as np
from scipy.misc import comb
import itertools, hashlib
m,n = 4,4
rowstochoose = int(np.floor(m/2)+1)
maxnumber = comb(m, rowstochoose, exact = True)
combo_itr = (x for x in itertools.product([-1,1], repeat = m*n))
matrix_itr = (np.array(x,dtype=np.int8).reshape((n,m)) for x in combo_itr)
sub_shapes = map(list,(itertools.combinations(range(m), int(rowstochoose))))
required_rank = int(min(n,rowstochoose))
memo = {}
no_found = 0
for A in matrix_itr:
check = True
for s in sub_shapes:
view = A[s].view(np.int8)
h = hashlib.sha1(view).hexdigest()
if h not in memo:
memo[h] = np.linalg.matrix_rank(view)
if memo[h] != required_rank:
check = False
break
if check: no_found+=1
print no_found, 2**(m*n)
This gives a speed gain of almost 10x for the 4x4 case - you'll see substantial improvements for larger matrices if you care to wait long enough. It's possible for the larger matrices, where the cost of the rank is proportionally more expensive, that you can order the matrices ahead of time on the hashing:
idx = np.lexsort(view.T)
h = hashlib.sha1(view[idx]).hexdigest()
For the 4x4 case this makes it slightly worse, but I expect that to reverse for the 5x5 case.

Algorithm 1 - memorizing small ones
I would use memorizing of the already checked smaller matrices.
You could simply write down in binary format (0 for -1, 1 for 1) all smaller matrices. BTW, you cane directly check for ranges matrices of (0 and 1) instead of (-1 and 1) - it is the same. Let us call these coding IMAGES. Using long types you can have matrices of up to 64 cells, so, up to 8x8. It is fast. Using String you can them have as large as you need.
Really, 8x8 is more than enough - in the 8GB memory we can place 1G longs. it is about 2^30, so, you can remember matrices of about up to 25-28 elements.
For every size you'll have a set of images:
for 2x2: 1001, 0110, 1000, 0100, 0010, 0001, 0111, 1011, 1101, 1110.
So, you'll have archive=array of NxN, each element of which will be an ordered list of binary images of good matrices.
- (for matrix size MxN, where M>=N, the appropriate place in archive will have coordinates M,N. If M
When you are checking a new large matrix, divide it into small ones
For every small matrix T
If the appropriate place in the archive for size of T has no list, create it and fill by images of all full-rank matrices of size of T and order images. If you are out of memory, stop the process of archive filling.
If T could be in archive, according to size:
Make image of T
Look for image(t) in the list - if it is in it, it is OK, if no, the large matrix should be thrown off.
If T is too big for the archive, check it as you do it.
Algorithm 2 - increasing sizes
The other possibility is to create larger matrices by adding pieces to the lesser ones, already found.
You should decide, up to what size the matrices will grow.
When you find a "correct" matrix of size MxN, try to add a row to top it. New matrices should be checked only for submatrices that include the new row. The same with the new column.
You should set exact algorithm, which sizes are derived from which ones. Thus you can minimize the number of remembered matrices. I thought about that sequence:
Start from 2x2 matrices.
continue with 3x2
4x2, 3x3
5x2, 4x3
6x2, 5x3, 4x4
7x2, 6x3, 5x4
...
So you can remember only (M+N)/2-1 matrices for searching among sizes MxN.
If each time when we can create new size from two old ones, we derive from more square ones, we could also greatly spare place for matrices remembering: For "long" matrices as 7x2 we do need remember and check only the last line 1x2. For matrices 6x3 we should remember their stub of 2x3, and so on.
Also, you don't need to remember the largest matrices - you won't use them for further counting.
Again use "images" for remembering the matrix.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.