Python one-liner for a confusion/contingency matrix needed - python

I want to write a one-liner to calculate a confusion/contingency matrix M (square matrix with either dimension equal to the number of classes ) that counts the cases presented in two vectors of lenght n: Ytrue and Ypredicted. Obiously the following does not work using python and numpy:
error = N.array([error[x,y]+1 for x, y in zip(Ytrue,Ypredicted)]).reshape((n,n))
Any hint to create a one-liner matrix confusion calculator?

error = N.array([zip(Ytrue,Ypred).count(x) for x in itertools.product(classes,repeat=2)]).reshape(n,n)
or
error = N.array([z.count(x) for z in [zip(Ytrue,Ypred)] for x in itertools.product(classes,repeat=2)]).reshape(n,n)
The latter being more efficient but possibly more confusing.
import numpy as N
import itertools
Ytrue = [1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3]
Ypred = [1,1,2,1,2,1,3,1,
2,2,2,2,2,2,2,2,
3,3,2,2,2,1,1,1]
classes = list(set(Ytrue))
n = len(classes)
error = N.array([zip(Ytrue,Ypred).count(x) for x in itertools.product(classes,repeat=2)]).reshape(n,n)
print error
error = N.array([z.count(x) for z in [zip(Ytrue,Ypred)] for x in itertools.product(classes,repeat=2)]).reshape(n,n)
print error
Which produces
[[5 2 1]
[0 8 0]
[3 3 2]]
[[5 2 1]
[0 8 0]
[3 3 2]]

If NumPy is newer or equal than 1.6 and Ytrue and Ypred are NumPy arrays, this code works
np.bincount(n * (Ytrue - 1) + (Ypred -1), minlength=n*n).reshape(n, n)

Related

Fail to overwrite a 2D numpy.ndarray in a loop

I found my program failed to overwrite an np.ndarray (the X variable) in the for loop by assignment statement like "X[i] = another np.ndarray with matched shape". I have no idea how this could happen...
Codes:
import numpy as np
def qr_tridiagonal(T: np.ndarray):
m, n = T.shape
X = T.copy()
Qt = np.identity(m)
for i in range(n-1):
ai = X[i, i]
ak = X[i+1, i]
c = ai/(ai**2 + ak**2)**.5
s = ak/(ai**2 + ak**2)**.5
# Givens rotation
tmp1 = c*X[i] + s*X[i+1]
tmp2 = c*X[i+1] - s*X[i]
print("tmp1 before:", tmp1)
print("X[i] before:", X[i])
X[i] = tmp1
X[i+1] = tmp2
print("tmp1 after:", tmp1)
print("X[i] after:", X[i])
print()
print(X)
return Qt.T, X
A = np.array([[1, 1, 0, 0], [1, 1, 1, 0], [0, 1, 1, 1], [0, 0, 1, 1]])
Q, R = qr_tridiagonal(A)
Output (the first 4 lines):
tmp1 before: [1.41421356 1.41421356 0.70710678 0. ]
X[i] before: [1 1 0 0]
tmp1 after: [1.41421356 1.41421356 0.70710678 0. ]
X[i] after: [1 1 0 0]
Though X[i] is assigned by tmp1, the values in the array X[i] or X[i, :] remain unchanged. Hope somebody help me out....
Other info: the above is a function to compute QR factorization for tridiagonal matrices using Givens Rotation.
I did check that assigning constant values to X[i] work, e.g. X[i] = 10 then the printed results fit this statement. But if X[i] = someArray then in my codes it would fail. I am not sure whether this is a particular issue triggered by the algorithm I was implementing in the above codes, because such scenarios never happen before.
I did try to install new environments using conda to make sure that my python is not problematic. The above strange outputs should be able to re-generate on other devices.
Many thanks to #hpaulj
It turns out to be a problem of datatype. The program is ok but the input datatype is int, which results in intermediate trancation errors.
A lesson learned: be aware of the dtype of np.ndarray!

Vectorizing matrix operations using numpy

I wrote the code below that uses a for loop. I would like to ask if there is a way to vectorize the operation within the second for loop since I intend to work with larger matrices.
import numpy as np
num = 5
A = np.array([[1,2,3,4,5], [4,5,6,4,5], [7,8,9,4,5], [10,11,12,4,5], [13,14,15,4,5]])
sm_factor = np.array([0.1 ,0.1, 0.1, 0.1, 0.1])
d2m = np.zeros((num, num))
d2m[0, 0] = 2
d2m[0, 1] = -5
d2m[0, 2] = 4
d2m[0, 3] = -1
for k in range(1, num-1):
d2m[k, k-1] = 1
d2m[k, k] = -2
d2m[k, k+1] = 1
d2m[num-1, num-4] = -1
d2m[num-1, num-3] = 4
d2m[num-1, num-2] = -5
d2m[num-1, num-1] = 2
x_smf = 0
for i in range(len(sm_factor)):
x_smf = x_smf + sm_factor[i] * (d2m # (A[i, :]).T).T # (d2m # (A[i, :]).T)
x_smf
# 324.0
You can avoid loops for both d2m matrix creation and x_smf vector computation using sps.diags for the creation of a sparse tridiagonal matrix that you can cast to array to be able to edit the first and last lines. Your code will look like this (note that the result of diags in line 10 has been cast to a dense ndarray using scipy.sparse.dia_matrix.toarray method):
import numpy as np
import scipy.sparse as sps
# Dense tridiagonal matrix
d2m = sps.diags([1, -2, 1], [-1, 0, 1], shape=(num, num)).toarray() # cast to array
# First line boundary conditions
d2m[0, 0] = 2
d2m[0, 1] = -5
d2m[0, 2] = 4
d2m[0, 3] = -1
# Last line boundary conditions
d2m[num-1, num-4] = -1
d2m[num-1, num-3] = 4
d2m[num-1, num-2] = -5
d2m[num-1, num-1] = 2
The solution proposed by Valdi_Bo enables you to remove the second FOR loop:
x_smf = np.sum(sm_factor * np.square(d2m # A.T).sum(axis=0))
However, I want to attract your attention on the fact that the x_smf matrix is sparse and storing it as a dense ndarray is bad for both computation time and memory storage. Instead of casting to dense ndarray, I advise you to cast to a sparse matrix format. For example lil_matrix, which is a list of lists sparse matrix format, using tolil() method instead of toarray():
# Sparse tridiagonal matrix
d2m_s = sps.diags([1, -2, 1], [-1, 0, 1], shape=(num, num)).tolil() # cast to lil
Here is a script that compares the three implementations on a bigger case num=4000 (for num=5 all give 324). For this size, I am already seeing benefits of using sparse matrix, here is the whole script (the first lines are a generalisation of the code for num different from 5):
from time import time
import numpy as np
import scipy.sparse as sps
num = 4000
A = np.concatenate([np.arange(1, (num-2)*num+1).reshape(num, num-2), np.repeat([[4, 5]], num, axis=0)], axis=1)
sm_factor = 0.1*np.ones(num)
########## DENSE matrix + FOR loop ##########
d2m = sps.diags([1, -2, 1], [-1, 0, 1], shape=(num, num)).toarray() # cast to array
# First line boundary conditions
d2m[0, 0] = 2
d2m[0, 1] = -5
d2m[0, 2] = 4
d2m[0, 3] = -1
# Last line boundary conditions
d2m[num-1, num-4] = -1
d2m[num-1, num-3] = 4
d2m[num-1, num-2] = -5
d2m[num-1, num-1] = 2
# FOR loop version
t_start = time()
x_smf = 0
for i in range(len(sm_factor)):
x_smf = x_smf + sm_factor[i] * (d2m # (A[i, :]).T).T # (d2m # (A[i, :]).T)
print(f'FOR loop version time: {time()-t_start}s')
print(f'FOR loop version value: {x_smf}\n')
########## DENSE matrix + VECTORIZED ##########
t_start = time()
x_smf_v = np.sum(sm_factor * np.square(d2m # A.T).sum(axis=0))
print(f'VECTORIZED version time: {time()-t_start}s')
print(f'VECTORIZED version value: {x_smf_v}\n')
########## SPARSE matrix + VECTORIZED ##########
d2m_s = sps.diags([1, -2, 1], [-1, 0, 1], shape=(num, num)).tolil() # cast to lil
# First line boundary conditions
d2m_s[0, 0] = 2
d2m_s[0, 1] = -5
d2m_s[0, 2] = 4
d2m_s[0, 3] = -1
# Last line boundary conditions
d2m_s[num-1, num-4] = -1
d2m_s[num-1, num-3] = 4
d2m_s[num-1, num-2] = -5
d2m_s[num-1, num-1] = 2
t_start = time()
x_smf_s = np.sum(sm_factor * np.square(d2m_s # A.T).sum(axis=0))
print(f'SPARSE+VECTORIZED version time: {time()-t_start}s')
print(f'SPARSE+VECTORIZED version value: {x_smf_s}\n')
Here is what I get when running the code:
FOR loop version time: 25.878241777420044s
FOR loop version value: 3.752317536763356e+17
VECTORIZED version time: 1.0873610973358154s
VECTORIZED version value: 3.752317536763356e+17
SPARSE+VECTORIZED version time: 0.37279224395751953s
SPARSE+VECTORIZED version value: 3.752317536763356e+17
As you can see the use of a sparse matrix makes you win another factor 3 on computation time and doesn't require you to adapt the code coming afterwards. It is also a good strategy to test the various scipy implementations of sparse matrices (tocsc(), tocsr(), todok() etc.), some may be more adapted to your case.
After some research and printout of intermediate results of your loop,
I found the solution:
x_smf = np.sum(sm_factor * np.square(d2m # A.T).sum(axis=0))
The result is:
324.0
By the way: Creation of dm2 can be shortened to:
d2m = np.zeros((num, num), dtype='int')
d2m[0, :4] = [ 2, -5, 4, -1]
for k in range(1, num-1):
d2m[k, k-1:k+2] = [ 1, -2, 1]
d2m[-1, -4:] = [-1, 4, -5, 2]

Python integer and float multiplication error

The question seems dummy, but I cannot get it right. The output cm1 is expected to be floats, but I only get zeros and ones.
import numpy as np
import scipy.spatial.distance
sim = scipy.spatial.distance.cosine
a = [2, 3, 1]
b = [3, 1, 2]
c = [1, 2, 6]
cm0 = np.array([a,b,c])
ca, cb, cc = 0.9, 0.7, 0.4
cr = np.array([ca, cb, cc])
cm1 = np.empty_like(cm0)
for i in range(3):
for j in range(3):
cm1[i,j] = cm0[i,j] * cr[i] * cr[j]
print(cm1)
And I get:
[[1 1 0]
[1 0 0]
[0 0 0]]
empty_like() matches the type of the given numpy array by default, as hpaulj suggested in the comments. In your case cm0 is of type integer.
The empty_like function accepts multiple arguments though, one of wich is dtype. Setting dtype to float should solve the problem:
cm1 = np.empty_like(cm0, dtype=float)
And also Python truncates floating point numbers at the decimal point when converting to integers. In your case, every multiplication done results in a number between 1.89 and 0.36, so flooring the results will result in 0s and 1s respectively.
As #hpaulj said in the comments section, the problem is using empty_like which will keep the cm0 dtype, to solve it try:
cm1 = np.empty_like(cm0, dtype=float)

Calculate mean, variance, covariance of different length matrices in a split list

I have an array of 5 values, consisting of 4 values and one index. I sort and split the array along the index. This leads me to splits of matrices with different lengths. From here on I want to calculate the mean, variance of the fourth values and covariance of the first 3 values for every split. My current approach works with a for loop, which I would like to replace by matrix operations, but I am struggeling with the different sizes of my matrices.
import numpy as np
A = np.random.rand(10,5)
A[:,-1] = np.random.randint(4, size=10)
sorted_A = A[np.argsort(A[:,4])]
splits = np.split(sorted_A, np.where(np.diff(sorted_A[:,4]))[0]+1)
My current for loop looks like this:
result = np.zeros((len(splits), 5))
for idx, values in enumerate(splits):
if(len(values))>0:
result[idx, 0] = np.mean(values[:,3])
result[idx, 1] = np.var(values[:,3])
result[idx, 2:5] = np.cov(values[:,0:3].transpose(), ddof=0).diagonal()
else:
result[idx, 0] = values[:,3]
I tried to work with masked arrays without success, since I couldn't load the matrices into the masked arrays in a proper form. Maybe someone knows how to do this or has a different suggestion.
You can use np.add.reduceat as follows:
>>> idx = np.concatenate([[0], np.where(np.diff(sorted_A[:,4]))[0]+1, [A.shape[0]]])
>>> result2 = np.empty((idx.size-1, 5))
>>> result2[:, 0] = np.add.reduceat(sorted_A[:, 3], idx[:-1]) / np.diff(idx)
>>> result2[:, 1] = np.add.reduceat(sorted_A[:, 3]**2, idx[:-1]) / np.diff(idx) - result2[:, 0]**2
>>> result2[:, 2:5] = np.add.reduceat(sorted_A[:, :3]**2, idx[:-1], axis=0) / np.diff(idx)[:, None]
>>> result2[:, 2:5] -= (np.add.reduceat(sorted_A[:, :3], idx[:-1], axis=0) / np.diff(idx)[:, None])**2
>>>
>>> np.allclose(result, result2)
True
Note that the diagonal of the covariance matrix are just the variances which simplifies this vectorization quite a bit.

Dynamically obtain a vector along 1 dimension in 'n' dimensional numpy array

I have a numpy variable that can be 'n' dimensions, for example:
game_board = np.zeros((4,3,3), dtype=np.int8)
I want to obtain a vector along the first dimension based on a vector choose_vector
choose_vector = np.array([x,y],dtype=np.int8)
I know how i can do this statically:
game_board[:, x, y]
# will return [0,0,0,0], the (x,y)th element from 1st dimension
but everything I have tried so far doing this using the choose_vector has not worked:
game_board[:, choose_vector]
# returns
[[[0 0 0]
[0 0 0]]
[[0 0 0]
[0 0 0]]
[[0 0 0]
[0 0 0]]
[[0 0 0]
[0 0 0]]]
print(game_board[choose_vector])
# returns
[[0,0,0]]
how do i construct the index for game_board given choose_vector in order to get the same result as game_board[:, x, y]
I'd then like to expand it to any dimensional game board, but I can probably work it out if i know how to do the above :)
This might not be the cleanest solution, but it doing what you want:
import numpy as np
x,y = 0,0
game_board = np.zeros((4,3,3), dtype=np.int8)
choose_vector = np.array([x, y], dtype=np.uint8)
game_board[[np.newaxis] + choose_vector.tolist()]
The trick is, that you can "replace" the : in your static approach with a np.newaxis inside of a python list.
I have worked this out with help from FlashTek. Instead of using np.newaxis, using slice(None) seems to be a drop in replacement for :
import numpy as np
x,y = 0,0
game_board = np.zeros((4,3,3), dtype=np.int8)
choose_vector = np.array([x, y], dtype=np.uint8)
game_board[[slice(None)] + choose_vector.tolist()]

Categories