So say I have two clustering outcomes that look like this:
clustering = [[8, 9, 10, 11], [14, 13, 4, 7, 6, 12, 5, 15], [1, 2, 0, 3]]
correct_clustering = [[2, 8, 10, 0, 15], [12, 13, 9, 14], [11, 3, 5, 1, 4, 6, 7]]
How would I go about comparing the outcome contained in clustering to the one contained in correct_clustering. I want to have some number between 0 and 1. I was thinking about calculating the fraction of pairs which are correctly clustered together in the same cluster. But can't think of a programmatic way to solve this.
The best practice measures are indeed based on pair counting.
In particular the adjusted Rand index (ARI) is the standard measure here.
You don't actually count pairs, but the number of pairs from a set can trivially be computed using the binomial, simply (n*(n-1))>>2.
You'll need this for each cluster and each cluster intersection.
The results of all intersections are aggregated, and it is easy to see that this is invariant to the permutation of clusters (and hence to the cluster labels). The Rand index is the accuracy of predicting whether two objects a, b are in the same cluster, or in different clusters. The ARI improves this by adjusting for chance: in a very unbalanced problem, a random result can score a high accuracy, but in ARI it is close to 0 on average.
Use the Rand Index:
import numpy as np
from scipy.special import comb
def rand_index_score(clusters, classes):
tp_plus_fp = comb(np.bincount(clusters), 2).sum()
tp_plus_fn = comb(np.bincount(classes), 2).sum()
A = np.c_[(clusters, classes)]
tp = sum(comb(np.bincount(A[A[:, 0] == i, 1]), 2).sum()
for i in set(clusters))
fp = tp_plus_fp - tp
fn = tp_plus_fn - tp
tn = comb(len(A), 2) - tp - fp - fn
return (tp + tn) / (tp + fp + fn + tn)
clusters = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
classes = [0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 2, 2, 2, 0]
rand_index_score(clusters, classes)
0.6764705882352942
You can use the function adjusted_rand_score in sklearn:
from sklearn.metrics import adjusted_rand_score
clustering = sorted((i, num) for num, lst in enumerate(clustering) for i in lst)
clustering = [i for _, i in clustering]
# [2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]
correct_clustering = sorted((i, num) for num, lst in enumerate(correct_clustering) for i in lst)
correct_clustering = [i for _, i in correct_clustering]
# [0, 2, 0, 2, 2, 2, 2, 2, 0, 1, 0, 2, 1, 1, 1, 0]
ari = adjusted_rand_score(correct_clustering, clustering)
# -0.012738853503184737
The function returns values between 1 and -1 so to get a value between 0 and 1 you need to rescale:
ari_scaled = (ari + 1) / 2
# 0.49363057324840764
Related
How can I count the number of times an array is present in a larger array?
a = np.array([1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1])
b = np.array([1, 1, 1])
The count for the number of times b is present in a should be 3
b can be any combination of 1s and 0s
I'm working with huge arrays, so for loops are pretty slow
If the subarray being searched for contains all 1s, you can count the number of times the subarray appears in the larger array by convolving the two arrays with np.convolve and counting the number of entries in the result that equal the size of the subarray:
# 'valid' = convolve only over the complete overlap of the signals
>>> np.convolve(a, b, mode='valid')
array([1, 1, 2, 3, 2, 2, 2, 3, 3, 2, 1, 1])
# ^ ^ ^ <= Matches
>>> win_size = min(a.size, b.size)
>>> np.count_nonzero(np.convolve(a, b) == win_size)
3
For subarrays that may contain 0s, you can start by using convolution to transform a into an array containing the binary numbers encoded by each window of size b.size. Then just compare each element of the transformed array with the binary number encoded by b and count the matches:
>>> b = np.array([0, 1, 1]) # encodes '3'
>>> weights = 2 ** np.arange(b.size) # == [1, 2, 4, 8, ..., 2**(b.size-1)]
>>> np.convolve(a, weights, mode='valid')
array([4, 1, 3, 7, 6, 5, 3, 7, 7, 6, 4, 1])
# ^ ^ Matches
>>> target = (b * np.flip(weights)).sum() # target==3
>>> np.count_nonzero(np.convolve(a, weights, mode='valid') == target)
2
Not a super fast method, but you can view a as a windowed array using np.lib.stride_tricks.sliding_window_view:
window = np.lib.stride_tricks.sliding_window_view(a, b.shape)
You can now equate this to b directly and find where they match:
result = (window == b).all(-1).sum()
For older versions of numpy (pre-1.20.0), you can use np.libs.stride_tricks.as_strided to achieve a similar result:
window = np.lib.stride_tricks.as_strided(
a, shape=(*(np.array(a.shape) - b.shape + 1), *b.shape),
strides=a.strides + (a.strides[0],) * b.ndim)
Here is a solution using a list comprehension:
a = [1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1]
b = [1, 1, 1]
sum(a[i:i+len(b)]==b for i in range(len(a)-len(b)))
output: 3
Here are a few improvements on #Brian's answer:
Use np.correlate not np.convolve; they are nearly identical but convolve reads a and b in opposite directions
To deal with templates that have zeros convert the zeros to -1. For example:
a = np.array([1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1])
b = np.array([0,1,1])
np.correlate(a,2*b-1)
# array([-1, 1, 2, 1, 0, 0, 2, 1, 1, 0, -1, 1])
The template fits where the correlation equals the number of ones in the template. The indices can be extracted like so:
(np.correlate(a,2*b-1)==np.count_nonzero(b)).nonzero()[0]
# array([2, 6])
If you only need the count use np.count_nonzero
np.count_nonzero((np.correlate(a,2*b-1)==np.count_nonzero(b)))
# 2
I have the following code that converts a noisy square wave to a noiseless one:
import numpy as np
threshold = 0.5
low = 0
high = 1
time = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
amplitude = np.array([0.1, -0.2, 0.2, 1.1, 0.9, 0.8, 0.98, 0.2, 0.1, -0.1])
# using list comprehension
new_amplitude_1 = [low if a<threshold else high for a in amplitude]
print(new_amplitude_1)
# gives: [0, 0, 0, 1, 1, 1, 1, 0, 0, 0]
# using numpy's where
new_amplitude_2 = np.where(amplitude > threshold)
print(new_amplitude_2)
# gives: (array([3, 4, 5, 6]),)
Is is possible to use np.where() in order to obtain identical result for new_amplitude_2 as the list comprehension (new_amplitude_1) in this case?
I read some tutorials online but I can't see the logic to have an if else inside np.where(). Maybe I should use another function?
Here's how you can do it using np.where:
np.where(amplitude < threshold, low, high)
# array([0, 0, 0, 1, 1, 1, 1, 0, 0, 0])
you can do it without where:
new_ampl2 = (amplitude > 0.5).astype(np.int32)
print(new_ampl2)
Out[11]:
array([0, 0, 0, 1, 1, 1, 1, 0, 0, 0])
Say I have two lists (always the same length):
l0 = [0, 4, 4, 4, 0, 0, 0, 8, 8, 0]
l1 = [0, 1, 1, 1, 0, 0, 0, 8, 8, 8]
I have the following rules for intersections and unions I need to apply when comparing these lists element-wise:
# union and intersect
uni = [0]*len(l0)
intersec = [0]*len(l0)
for i in range(len(l0)):
if l0[i] == l1[i]:
uni[i] = l0[i]
intersec[i] = l0[i]
else:
intersec[i] = 0
if l0[i] == 0:
uni[i] = l1[i]
elif l1[i] == 0:
uni[i] = l0[i]
else:
uni[i] = [l0[i], l1[i]]
Thus, the desired output is:
uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, 8]
intersec: [0, 0, 0, 0, 0, 0, 0, 8, 8, 0]
While this works, I need to do this with several hundred very large lists (each, with thousands of elements), so I am looking for a way to vectorize this. I tried using np.where and various masking strategies, but that went nowhere fast. Any suggestions would be most welcome.
* EDIT *
Regarding
uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, 8]
versus
uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, [0, 8]]
I'm still fighting the 8 versus [0, 8] in my mind. The lists are derived from BIO tags in system annotations (see IOB labeling of text chunks), where each list element is a character index in a document and the vakue is an assigned enumerated label. 0 represents a label representing no annotation (i.e., used for determining negatives in a confusion matrix); while non zero elements represent assigned enumerated labels for that character. Since I am ignoring true negatives, I think I can say 8 is equivalent to [0, 8]. As to whether this simplifies things, I am not yet sure.
* EDIT 2 *
I'm using [0, 8] to keep things simple and to keep the definitions of intersection and union consistent with set theory.
I would stay away from calling them 'intersection' and 'union', since those operations have well-defined meanings on sets and the operation you're looking to perform is neither of them.
However, to do what you want:
l0 = [0, 4, 4, 4, 0, 0, 0, 8, 8, 0]
l1 = [0, 1, 1, 1, 0, 0, 0, 8, 8, 8]
values = [
(x
if x == y else 0,
0
if x == y == 0
else x if y == 0
else y if x == 0
else [x, y])
for x, y in zip(l0, l1)
]
result_a, result_b = map(list, zip(*values))
print(result_a)
print(result_b)
This is more than enough for thousands, or even millions of elements since the operation is so basic. Of course, if we're talking billions, you may want to look at numpy anyway.
Semi vectorized solution for union and full for intersection:
import numpy as np
l0 = np.array(l0)
l1 = np.array(l1)
intersec = np.zeros(l0.shape[0])
intersec_idx = np.where(l0==l1)
intersec[intersec_idx] = l0[intersec_idx]
intersec = intersec.astype(int).tolist()
union = np.zeros(l0.shape[0])
union_idx = np.where(l0==l1)
union[union_idx] = l0[union_idx]
no_union_idx = np.where(l0!=l1)
union = union.astype(int).tolist()
for idx in no_union_idx[0]:
union[idx] = [l0[idx], l1[idx]]
and the output:
>>> intersection
[0, 0, 0, 0, 0, 0, 0, 8, 8, 0]
>>> union
[0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, [0, 8]]
NB: I think your original union solution is incorrect. See the last output 8 vs [0,8]
So in a binary array I'm trying to find the points where a 0 and a 1 are next to each other, and redraw the array with these crossover points indicated by modifying the 0 value. Just wondering if there's a better way of comparing each of the values in a numpy array to the 8 surrounding values than using nested for loops.
Currently I have this, which compares to 4 surrounding just for readability here
for x in range(1, rows - 1):
for y in range(1, columns - 1):
if f2[x, y] == 0:
if f2[x-1, y] == 1 or f2[x+1, y] == 1 or f2[x, y-1] == 1 or f2[x, y+1] == 1:
f2[x, y] = 2
EDIT
For example
[[1, 1, 1, 1, 1, 1, 1],
[1, 1, 0, 0, 0, 1, 1],
[1, 1, 0, 0, 0, 1, 1],
[1, 1, 0, 0, 0, 1, 1],
[1, 1, 1, 1, 1, 1, 1]]
to
[[1, 1, 1, 1, 1, 1, 1],
[1, 1, 2, 2, 2, 1, 1],
[1, 1, 2, 0, 2, 1, 1],
[1, 1, 2, 2, 2, 1, 1],
[1, 1, 1, 1, 1, 1, 1]]
This problem can be solved quickly with binary morphology functions
import numpy as np
from scipy.ndimage.morphology import binary_dilation, generate_binary_structure
# Example array
f2 = np.zeros((5,5), dtype=float)
f2[2,2] = 1.
# This line determines the connectivity (all 8 neighbors or just 4)
struct_8_neighbors = generate_binary_structure(2, 2)
# Replace cell with maximum of neighbors (True if any neighbor != 0)
has_neighbor = binary_dilation(f2 != 0, structure=struct_8_neighbors)
# Was cell zero to begin with
was_zero = f2 == 0
# Update step
f2[has_neighbor & was_zero] = 2.
I have a 2D numpy array with about 12 columns and 1000+ rows and each cell contains a number from 1 to 5. I'm searching for the best sextuple of columns according to my point system where 1 and 2 generate -1 point and 4 and 5 gives +1.
If a row in a certain sextuple contains, for example, [1, 4, 5, 3, 4, 3] the point for this row should be +2, because 3*1 + 1*(-1) = 2. Next row may be [1, 2, 2, 3, 3, 3] and should be -3 points.
At first, I tried a strait forward loop solution but I realized there are 665 280 possible combinations of columns to compare and when I also need to search for the best quintuple, quadruple etc. the loop is taking forever.
Is there perhaps a smarter numpy-way of solving my problem?
import numpy as np
import itertools
N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))
x = np.array([0,-1,-1,0,1,1])
y = x[arr]
print(y)
score, best_sextuple = max((y[:,cols].sum(), cols)
for cols in itertools.combinations(range(12),6))
print('''\
score: {s}
sextuple: {c}
'''.format(s = score, c = best_sextuple))
yields, for example,
score: 6
sextuple: (0, 1, 5, 8, 10, 11)
Explanation:
First, let's generate a random example, with 12 columns and 10 rows:
N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))
Now we can use numpy indexing to convert the numbers in arr 1,2,...,5 to the values -1,0,1 (according to your scoring system):
x = np.array([0,-1,-1,0,1,1])
y = x[arr]
Next, let's use itertools.combinations to generate all possible combinations of 6 columns:
for cols in itertools.combinations(range(12),6)
and
y[:,cols].sum()
then gives the score for cols, a choice of columns (sextuple).
Finally, use max to pick off the sextuple with the best score:
score, best_sextuple = max((y[:,cols].sum(), cols)
for cols in itertools.combinations(range(12),6))
import numpy
A = numpy.random.randint(1, 6, size=(1000, 12))
points = -1*(A == 1) + -1*(A == 2) + 1*(A == 4) + 1*(A == 5)
columnsums = numpy.sum(points, 0)
def best6(row):
return numpy.argsort(row)[-6:]
bestcolumns = best6(columnsums)
allbestcolumns = map(best6, points)
bestcolumns will now contain the best 6 columns in ascending order. By similar logic, allbestcolumns will contain the best six columns in each row.
Extending on unutbu's longer answer above, it's possible to generate the masked array of scores automatically. Since your scores for values are consistent every pass through the loop, so the scores for each value only need to be calculated once. Here's slightly inelegant way to do it on an example 6x10 array, before and after your scores are applied.
>>> import numpy
>>> values = numpy.random.randint(6, size=(6,10))
>>> values
array([[4, 5, 1, 2, 1, 4, 0, 1, 0, 4],
[2, 5, 2, 2, 3, 1, 3, 5, 3, 1],
[3, 3, 5, 4, 2, 1, 4, 0, 0, 1],
[2, 4, 0, 0, 4, 1, 4, 0, 1, 0],
[0, 4, 1, 2, 0, 3, 3, 5, 0, 1],
[2, 3, 3, 4, 0, 1, 1, 1, 3, 2]])
>>> b = values.copy()
>>> b[ b<3 ] = -1
>>> b[ b==3 ] = 0
>>> b[ b>3 ] = 1
>>> b
array([[ 1, 1, -1, -1, -1, 1, -1, -1, -1, 1],
[-1, 1, -1, -1, 0, -1, 0, 1, 0, -1],
[ 0, 0, 1, 1, -1, -1, 1, -1, -1, -1],
[-1, 1, -1, -1, 1, -1, 1, -1, -1, -1],
[-1, 1, -1, -1, -1, 0, 0, 1, -1, -1],
[-1, 0, 0, 1, -1, -1, -1, -1, 0, -1]])
Incidentally, this thread claims that creating the combinations directly within numpy will yield around 5x faster performance than itertools, though perhaps at the expense of some readability.