Calculating Confusion Matrix by Using the Array of Arrays - python

I am using transformers and datasets libraries to train an multi-class nlp model for real specific dataset and I need to have an idea how my model performs for each label. So, I'd like to calculate the confusion matrix. I have 4 labels. My result.prediction looks like
array([[ -6.906 , -8.11 , -10.29 , 6.242 ],
[ -4.51 , 3.705 , -9.76 , -7.49 ],
[ -6.734 , 3.36 , -10.27 , -6.883 ],
...,
[ 8.41 , -9.43 , -9.45 , -8.6 ],
[ 1.3125, -3.094 , -11.016 , -9.31 ],
[ -7.152 , -8.5 , -9.13 , 6.766 ]], dtype=float16)
In here when predicted value is positive then model predicts 1, else model predicts 0. Next my result.label_ids looks like
array([[0., 0., 0., 1.],
[1., 0., 0., 0.],
[0., 0., 0., 1.],
...,
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 0., 1.]], dtype=float32)
As you can see model return an array of 4, and give 0 values to false labels and 1 to true values.
In general, I've been using the following function to calculate confusion matrix, but in this case it didn't work since this function is for 1 dimensional arrays.
import numpy as np
def compute_confusion_matrix(labels, true, pred):
K = len(labels) # Number of classes
result = np.zeros((K, K))
for i in range(labels):
result[true[i]][pred[i]] += 1
return result
If possible I'd like to modify this function suitable for my above case. At least I would like to understand how can I implement confusion matrix for results that in the form multi dimensional arrays.

A possibility could be reversing the encoding to the format required by compute_confusion_matrix and, in this way, it is still possible to use your function!
To convert the predictions it's possible to do:
pred = list(np.where(result.label_ids == 1.)[1])
where np.where(result.label_ids == 1.)[1] is a numpy 1-dimensional array containing the indexes of the 1.s in each row of result.label_ids.
So pred will look like this according to your result.label_ids:
[3, 0, 3, ..., 0, 0, 3]
so it should have the same format of the original true (if also true is one-hot encoded the same strategy could be used to convert it) and can be used as input of your function for computing the confusion matrix.

First of all I would like to thank Nicola Fanelli for the idea.
The function I gave above as well as the sklearn.metrics.confusion_matrix() both need to be provided a list of predicted and true values. After my prediction step, I try to retrieve my true and predicted values in order to calculate a confusion matrix. The results I was getting are in the following form
array([[0., 0., 0., 1.],
[1., 0., 0., 0.],
[0., 0., 0., 1.],
...,
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 0., 1.]], dtype=float32)
Here the idea is to retrieve the positional index of the value 1. When I tried the approach suggested by Nicola Fanelli , the resulting sizes were lower then the initial ones and they weren't matching. Therefore, confusion matrix cannot be calculated. To be honest I couldn't find the reason behind it, but I'll investigate that more later.
So, I use a different technique to implement the same idea. I used np.argmax() and append these positions to a new list. Here is the code sample for true values
true = []
for i in range(len(result.label_ids)):
n = np.array(result.label_ids[i])
true.append(np.argmax(n))
This way I got the results in the desired format without my sizes are being changed.
Even though this is a working solution for my problem, I am still open to more elegant ways to approach this problem.

Related

How to delete multiple value from matrix numpy low computational cost

I've recently been trying my hand at numpy, and I'm trying to find a solution to delete the elements inside the matrix at column 2 equal to the value stored in the variable element.
Since I am a large amount of data I would need to know if there was a more efficient method which takes less time to execute than the classic for.
I enclose an example:
element = [ 85., 222., 166., 238.]
matrix = [[228., 1., 222.],
[140., 0., 85.],
[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.],
[12., 1., 166.],
[181., 1., 238.]]
the output:
matrix = [[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.]]
The method I used is the following:
for y in element:
matrix = matrix[(matrix[:,2]!= y)]
When running it for a large amount of data it takes a long time. Is there anything more efficient, so that you can save on execution?
Since you tagged numpy, I'd assume matrix is a numpy array. With that, you can use np.isin for your purpose:
matrix = np.array(matrix)
matrix[~np.isin(np.array(matrix)[:,2], element)]
Output:
array([[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.]])

I'd like to know how to calculate the similarity(numerical accuracy) of the two numpy array types in Python

I'm a student who just started deep learning with Python.
First of all, my native language is not English, so I can be poor at using a translator.
I used time series data in deep learning to create a model that predicts the likelihood of certain situations in the future. We've even completed visualizations using graphs.
But rather than visualizing it through graphs, I wanted to understand the similarity between train data and test data, the accuracy of the numbers.
The two data are in the following format:
In [51] : train_r
Out[51] : array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
Note: This data is composed of 0 and 1.
In [52] : test_r
Out[52] : array([[0. , 0. , 0. , ..., 0.03657577, 0.06709877,
0.0569071 ],
[0. , 0. , 0. , ..., 0.04707848, 0.07826 ,
0.0819832 ],
[0. , 0. , 0. , ..., 0.04467918, 0.07355513,
0.08117414],
I used the Cosine Similarity method to determine the accuracy of these two types of data, but an error has occurred.
from numpy import dot
from numpy.linalg import norm
cos_sim = dot(train_r, test_r)/(norm(train_r)*norm(test_r))
ValueError: shapes (100,24) and (100,24) not aligned: 24 (dim 1) != 100 (dim 0)
So I searched the Internet to find a different way, but it didn't help because most of them were string-analysis.
Can I figure out how to calculate the similarity between the two lists, and describe it in numbers?
Found the cause.
The reason for the error is that a total of 24 lists were stored in train_r and test_r.
I tried to calculate the list of 24 at once, and there was an error.
It's a simple solution. You can specify a list in train_r and test_r to calculate by cosine similarity method.
train_c = train_r[:,12]
test_c = test_r[:,12]
from numpy import dot
from numpy.linalg import norm
a = train_c
b = test_c
cos_sim = (dot(a, b)/(norm(a)*norm(b))) * 100
print(cos_sim)
95.18094658851624

Finding the maximum non-zero matrix in python

Suppose we have a matrix:
a = array([[ 2., 3., 0., 0., 0.],
[ 0., 4., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
what is the best way to find the maximum non-zero matrix(ie. matrix which is not fully zero) and spanning all the elements, like
[[2.,3.],
[0.,4.]]
I ve gone through numpy.nonzero which gives the indices of non zero elements but how can i use it efficiently to get the expected matrix?
the matrix must be square. I ve come up with this for now,
a[:np.nonzero(a)[0][-1]+1,:np.nonzero(a)[1][-1]+1]
It works. But does not seem to be elegant. Also it wont work if the matrix does not start at 0. Like,
[[0,0,2,3,0],
[0,0,0,4,0],
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0]]
here the expected output is,
[[2,3],
[0,4]]
The reason why it is not working for the second case is because your starting point for forming the matrix is always 0,0 since you always specify the end range.
This gets the minimum index of np.nonzero as the start range, and the maximum index of np.nonzero as the end range, in both axis. Therefore it is guaranteed to include all non-zero elements
a[np.min(np.nonzero(a)[0]):np.max(np.nonzero(a)[0])+1,
np.min(np.nonzero(a)[1]):np.max(np.nonzero(a)[1])+1]

testing: compare numpy arrays while allowing a certain mismatch

I have two numpy arrays containing integers which I'm comparing with numpy.testing.assert_array_equal. The arrays are "equal enough", i.e. a few elements differ but given the size of my arrays, that's OK (in this specific case). But of course the test fails:
AssertionError:
Arrays are not equal
(mismatch 0.0010541406645359075%)
x: array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],...
y: array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],...
----------------------------------------------------------------------
Ran 1 test in 0.658s
FAILED (failures=1)
Of course one might argue that the (long-term) clean solution to this would be to adapt the reference solution or whatnot, but what I'd prefer is to simply allow for some mismatch without the test failing. I would have hoped for assert_array_equal to have an option for this, but this is not the case.
I've written a function which allows me to do exactly what I want, so the problem might be considered solved, but I'm just wondering whether there is a better, more elegant way to do this. Also, the approach of parsing the error string feels pretty hacky, but I haven't found a better way to get the mismatch percentage value.
def assert_array_equal_tolerant(arr1,arr2,threshold):
"""Compare equality of two arrays while allowing a certain mismatch.
Arguments:
- arr1, arr2: Arrays to compare.
- threshold: Mismatch (in percent) above which the test fails.
"""
try:
np.testing.assert_array_equal(arr1,arr2)
except AssertionError as e:
for arg in e.args[0].split("\n"):
match = re.search(r'mismatch ([0-9.]+)%',arg)
if match:
mismatch = float(match.group(1))
break
else:
raise
if mismatch > threshold:
raise
Just to be clear: I'm not talking about assert_array_almost_equal, and using it is also not feasible, because the errors are not small, they might be huge for a single element, but are confined to a very small number of elements.
You could try (if they are integers) to check for the number of elements that are not equal without regular expressions
unequal_pos = np.where(arr1 != arr2)
len(unequal_pos[0]) # gives you the number of elements that are not equal.
I don't know if you consider this more elegant.
Since the result of np.where can be used as index you can get the elements that do not match with
arr1[unequal_pos]
So you can do pretty much every test you like with that result. Depends on how you want to define the mismatch either by number of different elements or difference between the elements or something even fancier.
Here's a crude comparison, but it seems to be in the spirit of what numpy.testing.assert_array_equal does:
In [71]: x=np.arange(100).reshape(10,10)
In [72]: y=np.arange(100).reshape(10,10)
In [73]: y[(5,7),(3,5)]=(3,5)
In [74]: np.sum(np.abs(x-y)>1)
Out[74]: 2
In [80]: np.sum(x!=y)
Out[80]: 2
count_nonzero is a faster counter (because it is used frequently in other numpy code to allocate space)
In [90]: np.count_nonzero(x!=y)
Out[90]: 2
The function that you are using does:
assert_array_compare(operator.__eq__, x, y, err_msg=err_msg)
np.testing.utils.assert_array_compare is a longish function, but most of it has to do with testing shape, and handling nan and inf. Otherwise it comes down to doing
x==y
and doing a count on the number of mismatches, and generating the err_msg. Note that the err_msg can be customized, so parsing it could simplified.
If you know the shapes match, and you aren't worried about nan like values, then just filtering the numeric difference should work just fine.

Recomended way to create a matrix containing strings in python

I need to write a programm that collects different datasets and unites them. For this I have to read in a comma seperated matrix: In this case each row represents an instance (in this case proteins), each column represents an attribute of the instances. If an instance has an attribute, it is represented by a 1, otherwise 0. The matrix looks like the example given below, but much larger, with 35000 instances and hundreds of attributes.
Proteins,Attribute 1,Attribute 2,Attribute 3,Attribute 4
Protein 1,1,1,1,0
Protein 2,0,1,0,1
Protein 3,1,0,0,0
Protein 4,1,1,1,0
Protein 5,0,0,0,0
Protein 6,1,1,1,1
I need a way to store the matrix before writing into a new file with other information about the instances. I thought of using numpy arrays, since i would like to be able to select and check single columns. I tried to use numpy.empty to create the array of the given size, but it seems that you have to preselect the lengh of the strings and cannot change them afterwards.
Is there a better way to deal with such data? I also thought of dictionarys of lists but then iI cannot select single columns.
You can use numpy.loadtxt, for example:
import numpy as np
a = np.loadtxt(filename, delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=float)
Which will result in something like:
#array([[ 1., 1., 1., 0.],
# [ 0., 1., 0., 1.],
# [ 1., 0., 0., 0.],
# [ 1., 1., 1., 0.],
# [ 0., 0., 0., 0.],
# [ 1., 1., 1., 1.]])
Or, using structured arrays (`np.recarray'):
a = np.loadtxt('stack.txt', delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=[('Attribute 1', float),
('Attribute 2', float),
('Attribute 3', float),
('Attribute 4', float)])
from where you can get each field like:
a['Attribute 1']
#array([ 1., 0., 1., 1., 0., 1.])
Take a look at pandas.
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
You could use genfromtxt instead:
data = np.genfromtxt('file.txt', dtype=None)
This will create a structured array (aka record array) of your table.

Categories