I am generating a confusion matrix to get an idea on my text-classifier's prediction vs ground-truth. The purpose is to understand which intents are being predicted as some another intents. But the problem is I have too many classes (more than 160), so the matrix is sparse, where most of the fields are zeros. Obviously, the diagonal elements are likely to be non-zero, as it is basically the indication of correct prediction.
That being the case, I want to generate a simpler version of it, as we only care non-zero elements if they are non-diagonal, hence, I want to remove the rows and columns where all the elements are zeros (ignoring the diagonal entries), such that the graph becomes much smaller and manageable to view. How to do that?
Following is the code snippet that I have done so far, it will produce mapping for all the intents i.e, (#intent, #intent) dimensional plot.
import matplotlib.pyplot as plt
import numpy as np
from pandas import DataFrame
import seaborn as sns
%matplotlib inline
sns.set(rc={'figure.figsize':(64,64)})
confusion_matrix = pd.crosstab(df['ground_truth_intent_name'], df['predicted_intent_name'])
variables = sorted(list(set(df['ground_truth_intent_name'])))
temp = DataFrame(confusion_matrix, index=variables, columns=variables)
sns.heatmap(temp, annot=True)
TL;DR
Here temp is a pandas dataframe. I need to remove all rows and columns where all elements are zeros (ignoring the diagonal elements, even if they are not zero).
You can use any on the comparison, but first you need to fill the diagonal with 0:
# also consider using
# a = np.isclose(confusion_matrix.to_numpy(), 0)
a = confusion_matrix.to_numpy() != 0
# fill diagonal
np.fill_diagonal(a, False)
# columns with at least one non-zero
cols = a.any(axis=0)
# rows with at least one non-zero
rows = a.any(axis=1)
# boolean indexing
confusion_matrix.loc[rows, cols]
Let's take an example:
# random data
np.random.seed(1)
# this would agree with the above
a = np.random.randint(0,2, (5,5))
a[2] = 0
a[:-1,-1] = 0
confusion_matrix = pd.DataFrame(a)
So the data would be:
0 1 2 3 4
0 1 1 0 0 0
1 1 1 1 1 0
2 0 0 0 0 0
3 0 0 1 0 0
4 0 1 0 0 1
and the code outputs (notice the 2nd row and 4th column are gone):
0 1 2 3
0 1 1 0 0
1 1 1 1 1
3 0 0 1 0
4 0 1 0 0
Related
I have a DataFrame like below, where I want to find index where element goes from 0 to 1.
Below code gets all the instances I want, however it also includes cases where sign changes from -1 to 0, which I don't want.
import numpy as np
df=pd.DataFrame([0,1,1,1,0,1,0,-1,0])
df[np.sign(df[0]).diff(periods=1).eq(1)]
Output:
0
1 1
5 1
8 0
Just add another condition:
filtered = df[np.sign(df[0]).diff(1).eq(1) & np.sign(df[0]).eq(1)]
Output:
>>> filtered
0
1 1
5 1
I am trying to validate if any numbers are duplicates in a 9x9 array however need to exclude all 0 as they are the once I will solve later. I have a 9x9 array and would like to validate if there are any duplicates in the rows and columns however excluding all 0 from the check only numbers from 1 to 9 only. The input array as example would be:
[[1 0 0 7 0 0 0 0 0]
[0 3 2 0 0 0 0 0 0]
[0 0 0 6 0 0 0 0 0]
[0 8 0 0 0 2 0 7 0]
[5 0 7 0 0 1 0 0 0]
[0 0 0 0 0 3 6 1 0]
[7 0 0 0 0 0 2 0 9]
[0 0 0 0 5 0 0 0 0]
[3 0 0 0 0 4 0 0 5]]
Here is where I am currently with my code for this:
#Checking Columns
for c in range(9):
line = (test[:,c])
print(np.unique(line).shape == line.shape)
#Checking Rows
for r in range(9):
line = (test[r,:])
print(np.unique(line).shape == line.shape)
Then I would like to do the exact same for the 3x3 sub arrays in the 9x9 array. Again I need to somehow exclude the 0 from the check. Here is the code I currently have:
for r0 in range(3,9,3):
for c0 in range(3,9,3):
test1 = test[:r0,:c0]
for r in range(3):
line = (test1[r,:])
print(np.unique(line).shape == line.shape)
for c in range(3):
line = (test1[:,c])
print(np.unique(line).shape == line.shape)
``
I would truly appreciate assistance in this regard.
It sure sounds like you're trying to verify the input of a Sudoku board.
You can extract a box as:
for r0 in range(0, 9, 3):
for c0 in range(0, 9, 3):
box = test1[r0:r0+3, c0:c0+3]
... test that np.unique(box) has 9 elements...
Note that this is only about how to extract the elements of the box. You still haven't done anything about removing the zeros, here or on the rows and columns.
Given a box/row/column, you then want something like:
nonzeros = [x for x in box.flatten() if x != 0]
assert len(nonzeros) == len(set(nonzeros))
There may be a more numpy-friendly way to do this, but this should be fast enough.
Excluding zeros is fairly straight forward by masking the array
test = np.array(test)
non_zero_mask = (test != 0)
At this point you can either check the whole matrix for uniqueness
np.unique(test[non_zero_mask])
or you can do it for individual rows/columns
non_zero_row_0 = test[0, non_zero_mask[0]]
unique_0 = np.unique(non_zero_row_0)
You can add the logic above into a loop to get the behavior you want
As for the 3x3 subarrays, you can loop through them as you did in your example.
When you have a small collection of things (small being <=64 or 128, depending on architecture), you can turn it into a set using bits. So for example:
bits = ((2**board) >> 1).astype(np.uint16)
Notice that you have to use right shift after the fact rather than pre-subtracting 1 from board to cleanly handle zeros.
You can now compute three types of sets. Each set is the bitwise OR of bits in a particular arrangement. For this example, you can use sum just the same:
rows = bits.sum(axis=1)
cols = bits.sum(axis=0)
blocks = bits.reshape(3, 3, 3, 3).sum(axis=(1, 3))
Now all you have to do is compare the bit counts of each number to the number of non-zero elements. They will be equal if and only if there are no duplicates. Duplicates will cause the bit count to be smaller.
There are pretty efficient algorithms for counting bits, especially for something as small as a uint16. Here is an example: How to count the number of set bits in a 32-bit integer?. I've adapted it for the smaller size and numpy here:
def count_bits16(arr):
count = arr - ((arr >> 1) & 0x5555)
count = (count & 0x3333) + ((count >> 2) & 0x3333)
return (count * 0x0101) >> 8
This is the count of unique elements for each of the configurations. You need to compare it to the number of non-zero elements. The following boolean will tell you if the board is valid:
count_bits16(rows) == np.count_nonzero(board, axis=1) and \
count_bits16(cols) == np.count_nonzero(board, axis=0) and \
count_bits16(blocks) == np.count_nonzero(board.reshape(3, 3, 3, 3), axis=(1, 3))
I have a numpy array of a fixed size holding irregularly spaced data. An example would be:
[1 0 0 0 3 0 0 0 2 0
0 1 0 0 0 0 0 0 2 0
0 1 0 0 1 0 6 0 9 0
0 0 0 0 6 0 3 0 0 1]
I want to keep the array the same shape, but have all the 0 values overwritten with data interpolated from the points that do have data. If the data points in the array are thought of as height values, this would essentially be creating a surface over the points.
I have been trying to use scipy.interpolate.griddata but am continually getting errors. I start with an array of my known data points, as [x, y, value]. For the above, (first row only for brevity)
data = [0, 0, 1
0, 3, 3
0, 8, 2 ....................
I then define
points = (data[:,0], data[:,1])
values = (data[:,2])
Next, I define the points to sample at (in this case, the grid I desire)
grid = np.indices((4,10))
Finally, call griddata
t = interpolate.griddata(points, values, grid, method = 'linear')
This returns the following error
ValueError: number of dimensions in xi does not match x
Am I using the wrong function?
Thanks!
Solved: You need to pass the desired points as a tuple
t = interpolate.griddata(points, values, (grid[0,:,:], grid[1,:,:]), method = 'linear')
I have never used pandas or numpy for this purpose before and am wondering what's the idiomatic way to construct labeled adjacency matrices in pandas.
My data comes in a shape similar to this. Each "uL22" type of thing is a protein and the the arrays are the neighbors of this protein. Hence( in this example below) an adjacency matrix would have 1s in bL31 row, uL5 column, and the converse, etc.
My problem is twofold:
The actual dimension of the adjacency matrix is dictated by a set of protein-names that is generally much larger than those contained in the nbrtree, so i'm wondering what's the best way to map my nbrtree data to that set, say a 100 by 100 matrix corresponding to neighborhood relationships of a 100 proteins.
I'm not quite sure how to "bind" the names(i.e.uL32etc.) of those 100 proteins to the rows and columns of this matrix such that when I start moving rows around the names move accordingly. ( i'm planning to rearange the adjacency matrix into to have a block-diagonal structure)
"nbrtree": {
"bL31": ["uL5"],
"uL5": ["bL31"],
"bL32": ["uL22"],
"uL22": ["bL32","bL17"],
...
"bL33": ["bL35"],
"bL35": ["bL33","uL15"],
"uL13": ["bL20"],
"bL20": ["uL13","bL21"]
}
>>>len(nbrtree)
>>>40
I'm sure this is a manipulation that people perform daily, i'm just not quite familiar with how dataframes function properly, so i'm probably looking for something very obvious.
Thank you so much!
I don't fully understand your question, But from what I get try out this code.
from pprint import pprint as pp
import pandas as pd
dic = {"first": {
"a": ["b","d"],
"b": ["a","h"],
"c": ["d"],
"d": ["c","g"],
"e": ["f"],
"f": ["e","d"],
"g": ["h","a"],
"h": ["g","b"]
}}
col = list(dic['first'].keys())
data = pd.DataFrame(0, index = col, columns = col, dtype = int)
for x,y in dic['first'].items():
data.loc[x,y] = 1
pp(data)
The output from this code being
a b c d e f g h
a 0 1 0 1 0 0 0 0
b 1 0 0 0 0 0 0 1
c 0 0 0 1 0 0 0 0
d 0 0 1 0 0 0 1 0
e 0 0 0 0 0 1 0 0
f 0 0 0 1 1 0 0 0
g 1 0 0 0 0 0 0 1
h 0 1 0 0 0 0 1 0
Note that this adjaceny matrix here is not symmetric as I have taken some random data
To absorb your labels into the dataframe change to the following
data = pd.DataFrame(0, index = ['index']+col, columns = ['column']+col, dtype = int)
data.loc['index'] = [0]+col
data.loc[:, 'column'] = ['*']+col
Past midnight and maybe someone has an idea how to tackle a problem of mine. I want to count the number of adjacent cells (which means the number of array fields with other values eg. zeroes in the vicinity of array values) as sum for each valid value!.
Example:
import numpy, scipy
s = ndimage.generate_binary_structure(2,2) # Structure can vary
a = numpy.zeros((6,6), dtype=numpy.int) # Example array
a[2:4, 2:4] = 1;a[2,4] = 1 # with example value structure
print a
>[[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 1 1 1 0]
[0 0 1 1 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]]
# The value at position [2,4] is surrounded by 6 zeros, while the one at
# position [2,2] has 5 zeros in the vicinity if 's' is the assumed binary structure.
# Total sum of surrounding zeroes is therefore sum(5+4+6+4+5) == 24
How can i count the number of zeroes in such way if the structure of my values vary?
I somehow believe to must take use of the binary_dilation function of SciPy, which is able to enlarge the value structure, but simple counting of overlaps can't lead me to the correct sum or does it?
print ndimage.binary_dilation(a,s).astype(a.dtype)
[[0 0 0 0 0 0]
[0 1 1 1 1 1]
[0 1 1 1 1 1]
[0 1 1 1 1 1]
[0 1 1 1 1 0]
[0 0 0 0 0 0]]
Use a convolution to count neighbours:
import numpy
import scipy.signal
a = numpy.zeros((6,6), dtype=numpy.int) # Example array
a[2:4, 2:4] = 1;a[2,4] = 1 # with example value structure
b = 1-a
c = scipy.signal.convolve2d(b, numpy.ones((3,3)), mode='same')
print numpy.sum(c * a)
b = 1-a allows us to count each zero while ignoring the ones.
We convolve with a 3x3 all-ones kernel, which sets each element to the sum of it and its 8 neighbouring values (other kernels are possible, such as the + kernel for only orthogonally adjacent values). With these summed values, we mask off the zeros in the original input (since we don't care about their neighbours), and sum over the whole array.
I think you already got it. after dilation, the number of 1 is 19, minus 5 of the starting shape, you have 14. which is the number of zeros surrounding your shape. Your total of 24 has overlaps.