testing: compare numpy arrays while allowing a certain mismatch - python

I have two numpy arrays containing integers which I'm comparing with numpy.testing.assert_array_equal. The arrays are "equal enough", i.e. a few elements differ but given the size of my arrays, that's OK (in this specific case). But of course the test fails:
AssertionError:
Arrays are not equal
(mismatch 0.0010541406645359075%)
x: array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],...
y: array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],...
----------------------------------------------------------------------
Ran 1 test in 0.658s
FAILED (failures=1)
Of course one might argue that the (long-term) clean solution to this would be to adapt the reference solution or whatnot, but what I'd prefer is to simply allow for some mismatch without the test failing. I would have hoped for assert_array_equal to have an option for this, but this is not the case.
I've written a function which allows me to do exactly what I want, so the problem might be considered solved, but I'm just wondering whether there is a better, more elegant way to do this. Also, the approach of parsing the error string feels pretty hacky, but I haven't found a better way to get the mismatch percentage value.
def assert_array_equal_tolerant(arr1,arr2,threshold):
"""Compare equality of two arrays while allowing a certain mismatch.
Arguments:
- arr1, arr2: Arrays to compare.
- threshold: Mismatch (in percent) above which the test fails.
"""
try:
np.testing.assert_array_equal(arr1,arr2)
except AssertionError as e:
for arg in e.args[0].split("\n"):
match = re.search(r'mismatch ([0-9.]+)%',arg)
if match:
mismatch = float(match.group(1))
break
else:
raise
if mismatch > threshold:
raise
Just to be clear: I'm not talking about assert_array_almost_equal, and using it is also not feasible, because the errors are not small, they might be huge for a single element, but are confined to a very small number of elements.

You could try (if they are integers) to check for the number of elements that are not equal without regular expressions
unequal_pos = np.where(arr1 != arr2)
len(unequal_pos[0]) # gives you the number of elements that are not equal.
I don't know if you consider this more elegant.
Since the result of np.where can be used as index you can get the elements that do not match with
arr1[unequal_pos]
So you can do pretty much every test you like with that result. Depends on how you want to define the mismatch either by number of different elements or difference between the elements or something even fancier.

Here's a crude comparison, but it seems to be in the spirit of what numpy.testing.assert_array_equal does:
In [71]: x=np.arange(100).reshape(10,10)
In [72]: y=np.arange(100).reshape(10,10)
In [73]: y[(5,7),(3,5)]=(3,5)
In [74]: np.sum(np.abs(x-y)>1)
Out[74]: 2
In [80]: np.sum(x!=y)
Out[80]: 2
count_nonzero is a faster counter (because it is used frequently in other numpy code to allocate space)
In [90]: np.count_nonzero(x!=y)
Out[90]: 2
The function that you are using does:
assert_array_compare(operator.__eq__, x, y, err_msg=err_msg)
np.testing.utils.assert_array_compare is a longish function, but most of it has to do with testing shape, and handling nan and inf. Otherwise it comes down to doing
x==y
and doing a count on the number of mismatches, and generating the err_msg. Note that the err_msg can be customized, so parsing it could simplified.
If you know the shapes match, and you aren't worried about nan like values, then just filtering the numeric difference should work just fine.

Related

Calculating Confusion Matrix by Using the Array of Arrays

I am using transformers and datasets libraries to train an multi-class nlp model for real specific dataset and I need to have an idea how my model performs for each label. So, I'd like to calculate the confusion matrix. I have 4 labels. My result.prediction looks like
array([[ -6.906 , -8.11 , -10.29 , 6.242 ],
[ -4.51 , 3.705 , -9.76 , -7.49 ],
[ -6.734 , 3.36 , -10.27 , -6.883 ],
...,
[ 8.41 , -9.43 , -9.45 , -8.6 ],
[ 1.3125, -3.094 , -11.016 , -9.31 ],
[ -7.152 , -8.5 , -9.13 , 6.766 ]], dtype=float16)
In here when predicted value is positive then model predicts 1, else model predicts 0. Next my result.label_ids looks like
array([[0., 0., 0., 1.],
[1., 0., 0., 0.],
[0., 0., 0., 1.],
...,
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 0., 1.]], dtype=float32)
As you can see model return an array of 4, and give 0 values to false labels and 1 to true values.
In general, I've been using the following function to calculate confusion matrix, but in this case it didn't work since this function is for 1 dimensional arrays.
import numpy as np
def compute_confusion_matrix(labels, true, pred):
K = len(labels) # Number of classes
result = np.zeros((K, K))
for i in range(labels):
result[true[i]][pred[i]] += 1
return result
If possible I'd like to modify this function suitable for my above case. At least I would like to understand how can I implement confusion matrix for results that in the form multi dimensional arrays.
A possibility could be reversing the encoding to the format required by compute_confusion_matrix and, in this way, it is still possible to use your function!
To convert the predictions it's possible to do:
pred = list(np.where(result.label_ids == 1.)[1])
where np.where(result.label_ids == 1.)[1] is a numpy 1-dimensional array containing the indexes of the 1.s in each row of result.label_ids.
So pred will look like this according to your result.label_ids:
[3, 0, 3, ..., 0, 0, 3]
so it should have the same format of the original true (if also true is one-hot encoded the same strategy could be used to convert it) and can be used as input of your function for computing the confusion matrix.
First of all I would like to thank Nicola Fanelli for the idea.
The function I gave above as well as the sklearn.metrics.confusion_matrix() both need to be provided a list of predicted and true values. After my prediction step, I try to retrieve my true and predicted values in order to calculate a confusion matrix. The results I was getting are in the following form
array([[0., 0., 0., 1.],
[1., 0., 0., 0.],
[0., 0., 0., 1.],
...,
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 0., 1.]], dtype=float32)
Here the idea is to retrieve the positional index of the value 1. When I tried the approach suggested by Nicola Fanelli , the resulting sizes were lower then the initial ones and they weren't matching. Therefore, confusion matrix cannot be calculated. To be honest I couldn't find the reason behind it, but I'll investigate that more later.
So, I use a different technique to implement the same idea. I used np.argmax() and append these positions to a new list. Here is the code sample for true values
true = []
for i in range(len(result.label_ids)):
n = np.array(result.label_ids[i])
true.append(np.argmax(n))
This way I got the results in the desired format without my sizes are being changed.
Even though this is a working solution for my problem, I am still open to more elegant ways to approach this problem.

How to delete multiple value from matrix numpy low computational cost

I've recently been trying my hand at numpy, and I'm trying to find a solution to delete the elements inside the matrix at column 2 equal to the value stored in the variable element.
Since I am a large amount of data I would need to know if there was a more efficient method which takes less time to execute than the classic for.
I enclose an example:
element = [ 85., 222., 166., 238.]
matrix = [[228., 1., 222.],
[140., 0., 85.],
[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.],
[12., 1., 166.],
[181., 1., 238.]]
the output:
matrix = [[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.]]
The method I used is the following:
for y in element:
matrix = matrix[(matrix[:,2]!= y)]
When running it for a large amount of data it takes a long time. Is there anything more efficient, so that you can save on execution?
Since you tagged numpy, I'd assume matrix is a numpy array. With that, you can use np.isin for your purpose:
matrix = np.array(matrix)
matrix[~np.isin(np.array(matrix)[:,2], element)]
Output:
array([[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.]])

eofs.xarray raising TypeError (Using a DataArray to construct a variable is ambiguous)

I'm working on a multidimensional dataset using xarray and had some issues with eofs, the EOF analysis package, and particularly, with its xarray interface.
My xarray DataArray looks like this:
<xarray.DataArray 'timeMonthly_avg_flux' (time: 1800, y: 601, x: 601)>
array([[[0., 0., ..., 0., 0.],
[0., 0., ..., 0., 0.],
...,
[0., 0., ..., 0., 0.],
[0., 0., ..., 0., 0.]],
[[0., 0., ..., 0., 0.],
[0., 0., ..., 0., 0.],
...,
[0., 0., ..., 0., 0.],
[0., 0., ..., 0., 0.]]])
Coordinates:
lat (y, x) float64 ...
lon (y, x) float64 ...
time (time) datetime64[ns] 2001-01-31 2001-02-28 ... 2150-12-31
x (x) float64 -3e+06 -2.99e+06 -2.98e+06 ... 2.98e+06 2.99e+06 3e+06
y (y) float64 -3e+06 -2.99e+06 -2.98e+06 ... 2.98e+06 2.99e+06 3e+06
The problem arises when I run the following:
from eofs.xarray import Eof
solver = Eof(flux) # flux is the above DataArray
flux_eofs = solver.eofs()
for which I get the following TypeError:
TypeError: Using a DataArray object to construct a variable is ambiguous, please extract the data using the .data property.
Also noting that other methods in this function work as intended: I am able to call the principal components as below:
flux_pcs = solver.pcs()
The dataset does have NaN values, but as far as I can tell, the eofs.xarray module has been designed to handle NaNs. For now, my workaround has been to convert the dataset into a Numpy array and use the eofs.standard interface instead, and convert the outputs back into xarray Datasets/DataArrays as required. All methods work as intended when I do this:
from eofs.standard import Eof
flux_np = flux.to_numpy()
solver = Eof(flux_np)
flux_eofs = solver.eofs()
I could find two other instances of this error being raised: as part of the w2w package, where it seems to have been something to do with the python environment, and here, as part of the PyWake project, but it's not clear to me what the problem was.
For everyone encountering this issue, this is a bug which has been also raised and discussed on Github (also by the author of this question :)). The xarray compability is a bit broken currently.
For the time being this can be fixed manually by changing the
eofs/lib/eofs/xarray.py - file
#Lines 638 to 640
# Add non-dimension coordinates.
pcs.coords.update({coord.name: (coord.dims, coord)
for coord in time_ndcoords})
to
# Add non-dimension coordinates.
pcs.coords.update({coord.name: (coord.dims, coord.data)
for coord in time_ndcoords})
There is a pull request fixing this that has unfortunately not been merged yet.
Sorry for the shameless self-promotion ;) you may want to give xeofs a try. It provides EOF analysis (and more) in xarray.
I recently ran into the same error message in my project (it is different in nature to yours). I pip uninstalled the latest version of xarray on my PC (0.20.2) and installed an older version (0.16.0), and (at least) that error went away.

Finding the maximum non-zero matrix in python

Suppose we have a matrix:
a = array([[ 2., 3., 0., 0., 0.],
[ 0., 4., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
what is the best way to find the maximum non-zero matrix(ie. matrix which is not fully zero) and spanning all the elements, like
[[2.,3.],
[0.,4.]]
I ve gone through numpy.nonzero which gives the indices of non zero elements but how can i use it efficiently to get the expected matrix?
the matrix must be square. I ve come up with this for now,
a[:np.nonzero(a)[0][-1]+1,:np.nonzero(a)[1][-1]+1]
It works. But does not seem to be elegant. Also it wont work if the matrix does not start at 0. Like,
[[0,0,2,3,0],
[0,0,0,4,0],
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0]]
here the expected output is,
[[2,3],
[0,4]]
The reason why it is not working for the second case is because your starting point for forming the matrix is always 0,0 since you always specify the end range.
This gets the minimum index of np.nonzero as the start range, and the maximum index of np.nonzero as the end range, in both axis. Therefore it is guaranteed to include all non-zero elements
a[np.min(np.nonzero(a)[0]):np.max(np.nonzero(a)[0])+1,
np.min(np.nonzero(a)[1]):np.max(np.nonzero(a)[1])+1]

Recomended way to create a matrix containing strings in python

I need to write a programm that collects different datasets and unites them. For this I have to read in a comma seperated matrix: In this case each row represents an instance (in this case proteins), each column represents an attribute of the instances. If an instance has an attribute, it is represented by a 1, otherwise 0. The matrix looks like the example given below, but much larger, with 35000 instances and hundreds of attributes.
Proteins,Attribute 1,Attribute 2,Attribute 3,Attribute 4
Protein 1,1,1,1,0
Protein 2,0,1,0,1
Protein 3,1,0,0,0
Protein 4,1,1,1,0
Protein 5,0,0,0,0
Protein 6,1,1,1,1
I need a way to store the matrix before writing into a new file with other information about the instances. I thought of using numpy arrays, since i would like to be able to select and check single columns. I tried to use numpy.empty to create the array of the given size, but it seems that you have to preselect the lengh of the strings and cannot change them afterwards.
Is there a better way to deal with such data? I also thought of dictionarys of lists but then iI cannot select single columns.
You can use numpy.loadtxt, for example:
import numpy as np
a = np.loadtxt(filename, delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=float)
Which will result in something like:
#array([[ 1., 1., 1., 0.],
# [ 0., 1., 0., 1.],
# [ 1., 0., 0., 0.],
# [ 1., 1., 1., 0.],
# [ 0., 0., 0., 0.],
# [ 1., 1., 1., 1.]])
Or, using structured arrays (`np.recarray'):
a = np.loadtxt('stack.txt', delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=[('Attribute 1', float),
('Attribute 2', float),
('Attribute 3', float),
('Attribute 4', float)])
from where you can get each field like:
a['Attribute 1']
#array([ 1., 0., 1., 1., 0., 1.])
Take a look at pandas.
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
You could use genfromtxt instead:
data = np.genfromtxt('file.txt', dtype=None)
This will create a structured array (aka record array) of your table.

Categories