Calculating Confusion matrices - python

I am currently calculating multiple confusion matrices and normalizing them.
for i in range(0,215)
[...]
matrix_confusion[i] = np.asarray(confusion_matrix(Y_test, Y_pred))
matrix_confusion[i] = matrix_confusion[i].astype(float) /
matrix_confusion[i].sum(axis=1)[:,np.newaxis]
The goal is to calculate the mean out of all confusion matrices which are filled in the loop above. The problem is that a lot of matrices are not filled because I am skipping the iterations when a ValueError is raised. So I have some matrices which are empty (prefilled with zeros).
Now I thought about doing the following:
matrix_confusion = matrix_confusion[matrix_confusion!=0]
But this also kills the 0s out of the normalized calculated confusion matrice. How could I proceed if I just want a confusion matrice which represents the mean of all previously filled 2x2 confusion matrices in the loop and to not concider the prefilled ones?
#prefilling
matrix_confusion = np.zeros((200,2,2))
Thanks for your help!

First find the matrices that are not all zeros:
valids = np.logical_or.reduce(matrix_confusion != 0, axis=(1, 2))
Then compute the mean:
matrix_confusion_mean = np.mean(matrix_confusion[valids], axis=0)
You should still be careful that at least some matrix is valid, otherwise you would get a matrix of NaNs. You could do:
if np.any(valids):
matrix_confusion_mean = np.mean(matrix_confusion[valids], axis=0)
else:
matrix_confusion_mean = np.zeros((2, 2))

Related

Understanding the logic behind numpy code for Moore-Penrose inverse

I was going through the book called Hands-On Machine Learning with Scikit-Learn, Keras and Tensorflow and the author was explaining how the pseudo-inverse (Moore-Penrose inverse) of a matrix is calculated in the context of Linear Regression. I'm quoting verbatim here:
The pseudoinverse itself is computed using a standard matrix
factorization technique called Singular Value Decomposition (SVD) that
can decompose the training set matrix X into the matrix
multiplication of three matrices U Σ VT (see numpy.linalg.svd()). The
pseudoinverse is calculated as X+ = V * Σ+ * UT. To compute the matrix
Σ+, the algorithm takes Σ and sets to zero all values smaller than a
tiny threshold value, then it replaces all nonzero values with their
inverse, and finally it transposes the resulting matrix. This approach
is more efficient than computing the Normal equation.
I've got an understanding of how the pseudo-inverse and SVD are related from this post. But I'm not able to grasp the rationale behind setting all values less than the threshold to zero. The inverse of a diagonal matrix is obtained by taking the reciprocals of the diagonal elements. Then small values would be converted to large values in the inverse matrix, right? Then why are we removing the large values?
I went and looked into the numpy code, and it looks like follows, just for reference:
#array_function_dispatch(_pinv_dispatcher)
def pinv(a, rcond=1e-15, hermitian=False):
a, wrap = _makearray(a)
rcond = asarray(rcond)
if _is_empty_2d(a):
m, n = a.shape[-2:]
res = empty(a.shape[:-2] + (n, m), dtype=a.dtype)
return wrap(res)
a = a.conjugate()
u, s, vt = svd(a, full_matrices=False, hermitian=hermitian)
# discard small singular values
cutoff = rcond[..., newaxis] * amax(s, axis=-1, keepdims=True)
large = s > cutoff
s = divide(1, s, where=large, out=s)
s[~large] = 0
res = matmul(transpose(vt), multiply(s[..., newaxis], transpose(u)))
return wrap(res)
It's almost certainly an adjustment for numerical error. To see why this might be necessary, look what happens when you take the svd of a rank-one 2x2 matrix. We can create a rank-one matrix by taking the outer product of a vector like so:
>>> a = numpy.arange(2) + 1
>>> A = a[:, None] * a[None, :]
>>> A
array([[1, 2],
[2, 4]])
Although this is a 2x2 matrix, it only has one linearly independent column, and so its rank is one instead of two. So we should expect that when we pass it to svd, one of the singular values will be zero. But look what happens:
>>> U, s, V = numpy.linalg.svd(A)
>>> s
array([5.00000000e+00, 1.98602732e-16])
What we actually get is a singular value that is not quite zero. This result is inevitable in many cases given that we are working with finite-precision floating point numbers. So although the problem you have identified is a real one, we will not be able to tell in practice the difference between a matrix that really has a very small singular value and a matrix that ought to have a zero singular value but doesn't. Setting small values to zero is the safest practical way to handle that problem.

Can covariance of A be used to calculate A'*A?

I am doing a benchmarking test in python on different ways to calculate A'*A, with A being a N x M matrix. One of the fastest ways was to use numpy.dot().
I was curious if I can obtain the same result using numpy.cov() (which gives the covariance matrix) by somehow varying the weights or by somehow pre-processing the A matrix ? But I had no success. Does anyone know if there is any relation between the product A'*A and covariance of A, where A is a matrix with N rows/observations and M columns/variables?
Have a look at the cov source. Near the end of the function it does this:
c = dot(X, X_T.conj())
Which is basically the dot product you want to perform. However, there are all kinds of other operations: checking inputs, subtracting the mean, normalization, ...
In short, np.cov will never ever be faster than np.dot(A.T, A) because internally it contains exactly that operation.
That said - the covariance matrix is computed as
Or in Python:
import numpy as np
a = np.random.rand(10, 3)
m = np.mean(a, axis=0, keepdims=True)
x = np.dot((a - m).T, a - m) / (a.shape[0] - 1)
y = np.cov(a.T)
assert np.allclose(x, y) # check they are equivalent
As you can see, the covariance matrix is equivalent to the raw dot product if you subtract the mean of each variable and divide the result by the number of samples (minus one).

Saving confusion matrix

Is there any possibility to save the confusion matrix which is generated by sklearn.metrics?
I would like to save multiple results of different classification algorithms in an array or maybe a pandas data frame so I can show which algorithm works best.
print('Neural net: \n',confusion_matrix(Y_test, Y_pred), sep=' ')
How could I save the generated confusion matrix within a loop? (I am training over a set of 200 different target variables)
array[i] = confusion_matrix(Y_test,Y_pred)
I run into some definition problems here [array is not defined whereas in the non [i] - version it runs smoothly]
Additionally, I am normalizing the confusion matrix. How could I print out the average result of the confusion matrix after the whole loop? (average of the 200 different confusion matrices)
I am not that fluent with python yet.
First getting to array not defined problem.
In python list is declared as :
array=[]
Since size of list is not given during declaration, no space is allocated. Hence we can't assign values the place which is not allocated.
array[i]=some value, but no space is allocated for array
So if you know the required size of array,fill zeroes during declaration and the use array this way or use array.append() method inside the loop.
Now for saving confusion matrix:
Since confusion matrix returns 2-D array and you need to save multiple such arrays, use 3-D array for saving the value.
import numpy as np
matrix_result=np.zeroes((200,len(y_pred),len(y_pred)))
for i in range(200):
matrix_result[i]=confusion_matrix(X_pred,y_pred)
For averaging
matrix_result_average=matrix_result.mean(axis=0)
I'm not sure what you mean by training over a set of target variables (please elaborate), but here is a start at averaging over confusion matrices, using numpy.
First an empty result matrix is created, which is three-dimensional and the size of 200 stacked confusion matrices. These are then filled in one-by-one in the for-loop. Finally the resulting matrix is averaged along the dimension of the targets, resulting in the average confusion matrix.
import numpy as np
N = len(Y_pred)
result = np.zeros((len(targets), N, N))
for i, target in enumerate(targets):
result[i] = confusion_matrix(Y_test, Y_pred) # do someting with target?
print(result.mean(axis=0))

Jaccard's distance matrix with tensorflow

I would like to compute a distance matrix using the Jaccard distance. And do so as fast as possible. I used to use scikit-learn's pairwise_distances function. But scikit-learn doesn't plan to support GPU, and there's even a known bug that makes the function slower when run in parallel.
My only constraint is that the resulting distance matrix can then be fed to scikit-learn's DBSCAN clustering algorithm. I was thinking about implementing the computation with tensorflow but couldn't find a nice and simple way to do it.
PS: I have reasons to precompute the distance matrix instead of letting DBSCAN do it as needed.
Hej I was facing the same problem.
Given the idea that the jaccard similarity is the ratio of true postives (tp) to the sum of true positives, false negatives (fn) and false positives (fp), I came up with this solution:
def jaccard_distance(self):
tp = tf.reduce_sum(tf.mul(self.target, self.prediction), 1)
fn = tf.reduce_sum(tf.mul(self.target, 1-self.prediction), 1)
fp = tf.reduce_sum(tf.mul(1-self.target, self.prediction), 1)
return 1 - (tp / (tp + fn + fp))
Hope this helps!
I am not a tensorflow expert, but here is the solution I got. As far as I know, the only ways in tensorflow to do a computation on all-pairs of a list is to do a matrix multiplication or use the broadcasting rules, this solution uses both at some point.
So let's assume we have an input boolean matrix of n_samples rows, one per set, and n_features columns, one per possible element. A value True in the i-th row, j-th column means the i-th set contains the element j. Just like scikit-learn's pairwise_distances expect. We can then proceed as follow.
Cast the matrix to numbers, getting 1 for True and 0 for False.
Multiply the matrix by its own transpose. This produce a matrix where each element M[i][j] contains size of the intersection between the i-th and j-th sets.
Compute a cardv vector that contains the cardinality of all the sets by summing the input matrix by rows.
Make a row and a column vector from cardv.
Compute 1 - M / (cardvrow + cardvcol - M). The broadcasting rules will do all the work when adding a row and a column vector.
This algorithm as a whole seems a bit hack-ish, but it works and produce results within a reasonable margin from the result computed by scikit-learn's pairwise_distances function. A better algorithm should probably make a single pass on every pair of input vectors and compute only half of the matrix as it is symmetric. Any improvement is welcome.
setsin = tf.placeholder(tf.bool, shape=(N, M))
sets = tf.cast(setsin, tf.float16)
mat = tf.matmul(sets, sets, transpose_b=True, name="Main_matmul")
#mat = tf.cast(mat, tf.float32, name="Upgrade_mat")
#sets = tf.cast(sets, tf.float32, name="Upgrade_sets")
cardinal = tf.reduce_sum(sets, 1, name="Richelieu")
cardinalrow = tf.expand_dims(cardinal, 0)
cardinalcol = tf.expand_dims(cardinal, 1)
mat = 1 - mat / (cardinalrow + cardinalcol - mat)
I used float16 type as it seems much faster than float32. Casting to float32 might only be useful if the cardinals are large enough to make them inaccurate or if more precision is needed when performing the division. But even when the casts are needed, it seems to be still relevant to do the matrix multiplication as float16.

What's wrong with my PCA?

My code:
from numpy import *
def pca(orig_data):
data = array(orig_data)
data = (data - data.mean(axis=0)) / data.std(axis=0)
u, s, v = linalg.svd(data)
print s #should be s**2 instead!
print v
def load_iris(path):
lines = []
with open(path) as input_file:
lines = input_file.readlines()
data = []
for line in lines:
cur_line = line.rstrip().split(',')
cur_line = cur_line[:-1]
cur_line = [float(elem) for elem in cur_line]
data.append(array(cur_line))
return array(data)
if __name__ == '__main__':
data = load_iris('iris.data')
pca(data)
The iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Output:
[ 20.89551896 11.75513248 4.7013819 1.75816839]
[[ 0.52237162 -0.26335492 0.58125401 0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]
[ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
[ 0.26199559 -0.12413481 -0.80115427 0.52354627]]
Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess
Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:
cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val
But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?
edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.
You decomposed the wrong matrix.
Principal Component Analysis requires manipulating the eigenvectors/eigenvalues
of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.
You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:
import numpy as NP
import numpy.linalg as LA
# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)
# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)
# calculate the covariance matrix
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)
# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)
To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD,
though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's)
LA module--it is a little easier to work with than svd, the return values are the eigenvectors
and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.
Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose,
regardless of the form that your data is in. This is obvious because the matrix you
are decomposing in PCA is a covariance matrix, which by definition is always square
(i.e., the columns are the individual data points of the original matrix, likewise
for the rows, and each cell is the covariance of those two points, as evidenced
by the ones down the main diagonal--a given data point has perfect covariance with itself).
The left singular values returned by SVD(A) are the eigenvectors of AA^T.
The covariance matrix of a dataset A is : 1/(N-1) * AA^T
Now, when you do PCA by using the SVD, you have to divide each entry in your A matrix by (N-1) so you get the eigenvalues of the covariance with the correct scale.
In your case, N=150 and you haven't done this division, hence the discrepancy.
This is explained in detail here
(Can you ask one question, please? Or at least list your questions separately. Your post reads like a stream of consciousness because you are not asking one single question.)
You probably used cov incorrectly by not transposing the matrix first. If cov_mat is 4-by-4, then eig will produce four eigenvalues and four eigenvectors.
Note how SVD and PCA, while related, are not exactly the same. Let X be a 4-by-150 matrix of observations where each 4-element column is a single observation. Then, the following are equivalent:
a. the left singular vectors of X,
b. the principal components of X,
c. the eigenvectors of X X^T.
Also, the eigenvalues of X X^T are equal to the square of the singular values of X. To see all this, let X have the SVD X = QSV^T, where S is a diagonal matrix of singular values. Then consider the eigendecomposition D = Q^T X X^T Q, where D is a diagonal matrix of eigenvalues. Replace X with its SVD, and see what happens.
Question already adressed: Principal component analysis in Python

Categories