Rand Index function (clustering performance evaluation) - python

As far as I know, there is no package available for Rand Index in python while for Adjusted Rand Index you have the option of using sklearn.metrics.adjusted_rand_score(labels_true, labels_pred).
I wrote the code for Rand Score and I am going to share it with others as the answer to the post.

from scipy.misc import comb
from itertools import combinations
import numpy as np
def check_clusterings(labels_true, labels_pred):
"""Check that the two clusterings matching 1D integer arrays."""
labels_true = np.asarray(labels_true)
labels_pred = np.asarray(labels_pred)
# input checks
if labels_true.ndim != 1:
raise ValueError(
"labels_true must be 1D: shape is %r" % (labels_true.shape,))
if labels_pred.ndim != 1:
raise ValueError(
"labels_pred must be 1D: shape is %r" % (labels_pred.shape,))
if labels_true.shape != labels_pred.shape:
raise ValueError(
"labels_true and labels_pred must have same size, got %d and %d"
% (labels_true.shape[0], labels_pred.shape[0]))
return labels_true, labels_pred
def rand_score (labels_true, labels_pred):
"""given the true and predicted labels, it will return the Rand Index."""
check_clusterings(labels_true, labels_pred)
my_pair = list(combinations(range(len(labels_true)), 2)) #create list of all combinations with the length of labels.
def is_equal(x):
return (x[0]==x[1])
my_a = 0
my_b = 0
for i in range(len(my_pair)):
if(is_equal((labels_true[my_pair[i][0]],labels_true[my_pair[i][1]])) == is_equal((labels_pred[my_pair[i][0]],labels_pred[my_pair[i][1]]))
and is_equal((labels_pred[my_pair[i][0]],labels_pred[my_pair[i][1]])) == True):
my_a += 1
if(is_equal((labels_true[my_pair[i][0]],labels_true[my_pair[i][1]])) == is_equal((labels_pred[my_pair[i][0]],labels_pred[my_pair[i][1]]))
and is_equal((labels_pred[my_pair[i][0]],labels_pred[my_pair[i][1]])) == False):
my_b += 1
my_denom = comb(len(labels_true),2)
ri = (my_a + my_b) / my_denom
return ri
As a simple example:
labels_true = [1, 1, 0, 0, 0, 0]
labels_pred = [0, 0, 0, 1, 0, 1]
rand_score (labels_true, labels_pred)
#0.46666666666666667
There are probably some ways to improve it and make it more pythonic. If you have any suggestion, you may improve it.
I found this implementation which seems faster.
import numpy as np
from scipy.misc import comb
def rand_index_score(clusters, classes):
tp_plus_fp = comb(np.bincount(clusters), 2).sum()
tp_plus_fn = comb(np.bincount(classes), 2).sum()
A = np.c_[(clusters, classes)]
tp = sum(comb(np.bincount(A[A[:, 0] == i, 1]), 2).sum()
for i in set(clusters))
fp = tp_plus_fp - tp
fn = tp_plus_fn - tp
tn = comb(len(A), 2) - tp - fp - fn
return (tp + tn) / (tp + fp + fn + tn)
As a simple example:
labels_true = [1, 1, 0, 0, 0, 0]
labels_pred = [0, 0, 0, 1, 0, 1]
rand_index_score (labels_true, labels_pred)
#0.46666666666666667

Starting from scikit-learn 0.24.0, the sklearn.metrics.rand_score function has been added, implementing the (unadjusted) Rand index. Please check the changelog.
All you have to do is:
from sklearn.metrics import rand_score
rand_score(labels_true, labels_pred)
labels_true and labels_pred can have values in different domains. For example:
>>> rand_score(['a', 'b', 'c'], [5, 6, 7])
1.0

Here is my code:
def rand_index_score(y_gold, y_predict):
index1_index2_pairs = list(it.combinations(range(len(y_gold)), 2)) #create list of all combinations with the length of labels.
numberOfPairs = len(index1_index2_pairs)
fractalUpperPart = 0
for index1_index2 in index1_index2_pairs:
theyRealyAreInSameGroup = y_gold[index1_index2[0]] == y_gold[index1_index2[1]]
itIsPredictedThatTheyAreInSameGroup = y_predict[index1_index2[0]] == y_predict[index1_index2[1]]
if theyRealyAreInSameGroup == itIsPredictedThatTheyAreInSameGroup:
fractalUpperPart += 1
return fractalUpperPart/numberOfPairs

Related

Simple neural network gives wrong output after training

I've been working on a simple neural network.
It takes in a data set with 3 columns, if the first column's value is a 1, then the output should be a 1.
I've provided comments so it is easier to follow.
Code is as follows:
import numpy as np
import random
def sigmoid_derivative(x):
return x * (1 - x)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def think(weights, inputs):
sum = (weights[0] * inputs[0]) + (weights[1] * inputs[1]) + (weights[2] * inputs[2])
return sigmoid(sum)
if __name__ == "__main__":
# Assign random weights
weights = [-0.165, 0.440, -0.867]
# Training data for the network.
training_data = [
[0, 0, 1],
[1, 1, 1],
[1, 0, 1],
[0, 1, 1]
]
# The answers correspond to the training_data by place,
# so first element of training_answers is the answer to the first element of training_data
# NOTE: The pattern is if there's a 1 in the first place, the result should be a one
training_answers = [0, 1, 1, 0]
# Train the neural network
for iteration in range(50000):
# Pick a random piece of training_data
selected = random.randint(0, 3)
training_output = think(weights, training_data[selected])
# Calculate the error
error = training_output - training_answers[selected]
# Calculate the adjustments that need to be applied to the weights
adjustments = np.dot(training_data[selected], error * sigmoid_derivative(training_output))
# Apply adjustments, maybe something wrong is going here?
weights += adjustments
print("The Neural Network has been trained!")
# Result of print below should be close to 1
print(think(weights, [1, 0, 0]))
The result of the last print should be close to 1, however it is not?
I have a feeling that I'm not adjusting the weights correctly.

How to create a 2D array with N lots of random numbers?

I am trying to obtain a variance for a value I obtained by processing a 2x150 array into a discrete correlation function. In order to do this I need to randomly sample 80% of the original data N times, which will allow me to calculate a variance over these values.
have so far been able to create one randomly sampled set of data using this:
rand_indices = []
running_var = (len(find_length)*0.8)
x=0
while x<running_var:
rand_inx = randint(0, (len(find_length)-1))
rand_indices.append(rand_inx)
x=x+1
which creates an array 80% of the length of my original with randomly selected indices to be picked out and processed.
My problem is that I am not sure how to iterate this in order to get N sets of these random numbers, I think ideally in a Nx120 sized array. My whole code so far is:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from random import randint
useless, just_to, find_length = np.loadtxt("w2_mjy_final.dat").T
w2_dat = np.loadtxt("w2_mjy_final.dat")
w2_rel = np.delete(w2_dat, 2, axis = 1)
w2_array = np.asarray(w2_rel)
w1_dat = np.loadtxt("w1_mjy_final.dat")
w1_rel = np.delete(w1_dat, 2, axis=1)
w1_array = np.asarray(w1_rel)
peaks = []
y=1
N = 0
x = 0
z = 0
rand_indices = []
rand_indices2d = []
running_var = (len(find_length)*0.8)
while z<N:
while x<running_var:
rand_inx = randint(0, (len(find_length)-1))
rand_indices.append(rand_inx)
x=x+1
rand_indices2d.append(rand_indices)
z=z+1
while y<N:
w1_sampled = w1_array[rand_indices, :]
w2_sampled = w2_array[rand_indices, :]
w1s_t, w1s_dat = zip(*w1_sampled)
w2s_t, w2s_dat = zip(*w2_sampled)
w2s_mean = np.mean(w2s_dat)
w2s_stdev = np.std(w2s_dat)
w1s_mean = np.mean(w1s_dat)
w1s_stdev = np.std(w1s_dat)
taus = []
dcfs = []
bins = 40
for i in w2s_t:
for j in w1s_t:
tau_datpoint = i-j
taus.append(tau_datpoint)
for k in w2s_dat:
for l in w1s_dat:
dcf_datpoint = ((k - w2s_mean)*(l - w1s_mean))/((w2s_stdev*w1s_stdev))
dcfs.append(dcf_datpoint)
plotdat = np.vstack((taus, dcfs)).T
sort_plotdat = sorted(plotdat, key=lambda x:x[0])
np.savetxt("w1sw2sarray.txt", sort_plotdat)
taus_sort, dcfs_sort = np.loadtxt("w1w2array.txt").T
dcfs_means, taubins_edges, taubins_number = stats.binned_statistic(taus_sort, dcfs_sort, statistic='mean', bins=bins)
taubin_edge = np.delete(taubins_edges, 0)
import operator
indexs, values = max(enumerate(dcfs_means), key=operator.itemgetter(1))
percents = values*0.8
dcf_lists = dcfs_means.tolist()
centarr_negs, centarr_poss = np.split(dcfs_means, [indexs])
centind_negs = np.argmin(np.abs(centarr_negs - percents))
centind_poss = np.argmin(np.abs(centarr_poss - percents))
lagcent_negs = taubins_edges[centind_negs]
lagcent_poss = taubins_edges[int((bins/2)+centind_poss)]
sampled_peak = (np.abs(lagcent_poss - lagcent_negs)/2)+lagcent_negs
peaks.append(sampled_peak)
y=y+1
print peaks
Seeing as you're using numpy already, why not use np.random.randint
In your case:
np.random.randint(len(find_length)-1, size=(N, running_var))
Would give you an N*running_var sized matrix, with random integer entries from 0 to len(find_length)-2 inclusive.
Example Usage:
>>> N=4
>>> running_var=6
>>> find_length = [1,2,3]
>>> np.random.randint(len(find_length)-1, size=(N, running_var))
array([[1, 0, 1, 0, 0, 1],
[1, 0, 1, 1, 0, 0],
[1, 1, 0, 0, 1, 0],
[1, 1, 0, 1, 0, 1]])

Extracting images of digits from a dataset in digits classification in python

Im trying to implement digit classification with knn alogorithm.
#pseudocode
note:(the order of the digits follow the order of the labels)
i.e
train={digit0,digit0,digit1,digit2...}
label={0,0,1,2...}
labels.shape = (10000,)
train.shape=(784*1000)
i have a huge dataset of 10000 digits from 0 to 9 as images of 28 * 28 pixels along with their labels.the labels and digits are arranged in same order.
SO i need to extract the digits 0 and 1 from the dataset and perform knn for different values of k={1,2,3,4,5} for digits 0 and 1 which are 28*28 pixels.i need help with digits extraction.
any suggestions will be appreciated
For numpy array you can use
selected = train[ (label == 0) | (label == 1) ]
import numpy as np
train = np.array(['digit0-1', 'digit0-2', 'digit1-1', 'digit2-1'])
label = np.array([0, 0, 1, 2])
selected = train[ (label == 0) | (label == 1) ]
print(selected)
For pandas DataFrame something similar
selected = train['item'][ (label['val'] == 0) | (label['val'] == 1) ]
import pandas as pd
train = pd.DataFrame({'item': ['digit0-1', 'digit0-2', 'digit1-1', 'digit2-1']})
label = pd.DataFrame({'val': [0, 0, 1, 2]})
selected = train['item'][ (label['val'] == 0) | (label['val'] == 1) ]
print(selected)
or if you will keep all in one DataFrame
import pandas as pd
df = pd.DataFrame({
'train': ['digit0-1', 'digit0-2', 'digit1-1', 'digit2-1'],
'label': [0, 0, 1, 2]
})
selected = df['train'][ (df['label'] == 0) | (df['label'] == 1) ]
print(selected)
For normal list you could use zip() to create pairs
train = ['digit0-1', 'digit0-2', 'digit1-1', 'digit2-1']
label = [0, 0, 1, 2]
selected = []
for t, l in zip(train, label):
if l in (0, 1):
selected.append(t)
print(selected)
The same with list comprehension
train = ['digit0-1', 'digit0-2', 'digit1-1', 'digit2-1']
label = [0, 0, 1, 2]
selected = [t for t, l in zip(train, label) if l in (0, 1)]
print(selected)

How to calculate mean average precision (mAP) using TensorFlow?

I want to use TensorFlow to calculate hashcode‘s mAP (mean average precision), but I don‘t know how to use tensor calculations directly.
The code which using NumPy is the following:
import numpy as np
import time
import os
# read train and test binarayCode
CURRENT_DIR = os.getcwd()
def getCode(train_codes,train_groudTruth,test_codes,test_groudTruth):
line_number = 0
with open(CURRENT_DIR+'/result.txt','r') as f:
for line in f:
temp = line.strip().split('\t')
if line_number < 10000:
test_codes.append([i if i==1 else -1 for i in map(int, list(temp[0]))])
list2 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
list2[int(temp[1])] = 1
test_groudTruth.append(list2) # get test ground truth(0-9)
else:
train_codes.append([i if i==1 else -1 for i in map(int, list(temp[0]))]) # change to -1, 1
list2 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
list2[int(temp[1])] = 1
train_groudTruth.append(list2) # get test ground truth(0-9)
line_number += 1
print 'read data finish'
def getHammingDist(code_a,code_b):
dist = 0
for i in range(len(code_a)):
if code_a[i]!=code_b[i]:
dist += 1
return dist
if __name__ =='__main__':
print getNowTime(),'start!'
train_codes = []
train_groudTruth =[]
test_codes = []
test_groudTruth = []
# get g.t. and binary code
getCode(train_codes,train_groudTruth,test_codes,test_groudTruth)
train_codes = np.array(train_codes)
train_groudTruth = np.array(train_groudTruth)
test_codes = np.array(test_codes)
test_groudTruth = np.array(test_groudTruth)
numOfTest = 10000
# generate hanmming martix, g.t. martix 10000*50000
gt_martix = np.dot(test_groudTruth, np.transpose(train_groudTruth))
print getNowTime(),'gt_martix finish!'
ham_martix = np.dot(test_codes, np.transpose(train_codes)) # hanmming distance map to dot value
print 'ham_martix finish!'
# sort hanmming martix,Returns the indices that would sort an array.
sorted_ham_martix_index = np.argsort(ham_martix,axis=1)
# calculate mAP
print 'sort ham_matrix finished,start calculate mAP'
apall = np.zeros((numOfTest,1),np.float64)
for i in range(numOfTest):
x = 0.0
p = 0
test_oneLine = sorted_ham_martix_index[i,:]
length = test_oneLine.shape[0]
num_return_NN = 5000 # top 1000
for j in range(num_return_NN):
if gt_martix[i][test_oneLine[length-j-1]] == 1: # reverse
x += 1
p += x/(j+1)
if p == 0:
apall[i]=0
else:
apall[i]=p/x
mAP = np.mean(apall)
print 'mAP:',mAP
I want to re-write the code above using tensor operations (like tf.equal()、tf.reduce_sum() so on).
for example
I want to calculate valid accuracy of images
logits = self._model(x_valid)
valid_preds = tf.argmax(logits, axis=1)
valid_preds = tf.to_int32(valid_preds)
self.valid_acc = tf.equal(valid_preds, y_valid)
self.valid_acc = tf.to_int32(self.valid_acc)
self.valid_acc = tf.to_float(tf.reduce_sum(self.valid_acc))/tf.to_float(self.batch_size)
I want to use TensorFlow to calculate hashcode‘s mAP (mean average precision) this way(like tf.XX opreation)
How could I do? Thanks!
You can just calculate the y_score (or predictions) and then use sklearn.metrics to calculate the average precision:
from sklearn.metrics import average_precision_score
predictions = model.predict(x_test)
average_precision_score(y_test, predictions)
If you just want to calculate average precision based on the validation set predictions, you can use the vector of predicted probabilities and the vector of true labels in this scikit-learn function.
If you really want to use a tensorflow function, there's a tensorflow function average_precision_at_k.
For more info about average precision you can see this article.

Python - Remove a row from numpy array?

Hi all what I wan't should be really simple for somebody here..I want to remove a row from a numpy array in a loop like:
for i in range(len(self.Finalweight)):
if self.Finalweight[i] >= self.cutoffOutliers:
"remove line[i from self.wData"
I'm trying to remove outliers from a dataset. My full code os the method is like:
def calculate_Outliers(self):
def calcWeight(Value):
pFinal = abs(Value - self.pMed)/ self.pDev_abs_Med
gradFinal = abs(gradient(Value) - self.gradMed) / self.gradDev_abs_Med
return pFinal * gradFinal
self.pMed = median(self.wData[:,self.yColum-1])
self.pDev_abs_Med = median(abs(self.wData[:,self.yColum-1] - self.pMed))
self.gradMed = median(gradient(self.wData[:,self.yColum-1]))
self.gradDev_abs_Med = median(abs(gradient(self.wData[:,self.yColum-1]) - self.gradMed))
self.workingData= self.wData[calcWeight(self.wData)<self.cutoffOutliers]
self.xData = self.workingData[:,self.xColum-1]
self.yData = self.workingData[:,self.yColum-1]
I'm getting the following error:
ile "bin/dmtools", line 201, in plot_gride
self.calculate_Outliers()
File "bin/dmtools", line 188, in calculate_Outliers
self.workingData= self.wData[calcWeight(self.wData)>self.cutoffOutliers]
ValueError: too many indices for array
There is actually a tool in NumPy specifically made to mask out outliers and invalid data points: masked arrays. Example from the linked page:
x = numpy.array([1, 2, 3, -1, 5])
mx = numpy.ma.masked_array(x, mask=[0, 0, 0, 1, 0])
print mx.mean()
prints
2.75

Categories