Extracting images of digits from a dataset in digits classification in python - python

Im trying to implement digit classification with knn alogorithm.
#pseudocode
note:(the order of the digits follow the order of the labels)
i.e
train={digit0,digit0,digit1,digit2...}
label={0,0,1,2...}
labels.shape = (10000,)
train.shape=(784*1000)
i have a huge dataset of 10000 digits from 0 to 9 as images of 28 * 28 pixels along with their labels.the labels and digits are arranged in same order.
SO i need to extract the digits 0 and 1 from the dataset and perform knn for different values of k={1,2,3,4,5} for digits 0 and 1 which are 28*28 pixels.i need help with digits extraction.
any suggestions will be appreciated

For numpy array you can use
selected = train[ (label == 0) | (label == 1) ]
import numpy as np
train = np.array(['digit0-1', 'digit0-2', 'digit1-1', 'digit2-1'])
label = np.array([0, 0, 1, 2])
selected = train[ (label == 0) | (label == 1) ]
print(selected)
For pandas DataFrame something similar
selected = train['item'][ (label['val'] == 0) | (label['val'] == 1) ]
import pandas as pd
train = pd.DataFrame({'item': ['digit0-1', 'digit0-2', 'digit1-1', 'digit2-1']})
label = pd.DataFrame({'val': [0, 0, 1, 2]})
selected = train['item'][ (label['val'] == 0) | (label['val'] == 1) ]
print(selected)
or if you will keep all in one DataFrame
import pandas as pd
df = pd.DataFrame({
'train': ['digit0-1', 'digit0-2', 'digit1-1', 'digit2-1'],
'label': [0, 0, 1, 2]
})
selected = df['train'][ (df['label'] == 0) | (df['label'] == 1) ]
print(selected)
For normal list you could use zip() to create pairs
train = ['digit0-1', 'digit0-2', 'digit1-1', 'digit2-1']
label = [0, 0, 1, 2]
selected = []
for t, l in zip(train, label):
if l in (0, 1):
selected.append(t)
print(selected)
The same with list comprehension
train = ['digit0-1', 'digit0-2', 'digit1-1', 'digit2-1']
label = [0, 0, 1, 2]
selected = [t for t, l in zip(train, label) if l in (0, 1)]
print(selected)

Related

Audio Data Agmentation in python

I am using below function to augment audio data generated from wav audio files.
def generate_augmented_data(file_path):
augmented_data = []
samples = load_wav(file_path,get_duration=False)
for time_value in [0.7, 1, 1.3]:
for pitch_value in [-1, 0, 1]:
time_stretch_data = librosa.effects.time_stretch(samples, rate=time_value)
final_data = librosa.effects.pitch_shift(time_stretch_data, sr=sample_rate, n_steps=pitch_value)
augmented_data.append(final_data)
return augmented_data
I also need to augment the class labels and facing difficulties with it.
Tried below cod, but its not getting me the expected result
## generating augmented data.
def generate_augmented_data_label(file_path, label):
augmented_data = []
augmented_label = []
samples = load_wav(file_path,get_duration=False)
for time_value in [0.7, 1, 1.3]:
for pitch_value in [-1, 0, 1]:
time_stretch_data = librosa.effects.time_stretch(samples, rate=time_value)
final_data = librosa.effects.pitch_shift(time_stretch_data, sr=sample_rate, n_steps=pitch_value)
augmented_data.append(final_data)
augmented_label.append(label)
return augmented_data,augmented_label
Before augmentation shape for data and labels are as below,
X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
X_train_augmented_data = []
y_train_augmented_data = []
for i in range(len(X_train)):
#print(i)
t1 = X_train.iloc[i]
t2 = y_train[i]
tmp1,tmp2 = generate_augmented_data_label(t1,t2)
#print(tmp1,tmp2)
X_train_augmented_data.append(tmp1)
y_train_augmented_data.append(tmp2)
len(X_train)
1600
len(y_train)
1600
print(len(X_train_augmented_data))
print(len(y_train_augmented_data))
After data augmentation and an additional masking step, shape is coming as
augmented_train_data_mask = []
for i in range(0,len(augmented_train_data_pad)):
augmented_train_data_mask.append(list(map(bool,augmented_train_data_pad[i])))
augmented_train_data_mask = np.array(augmented_train_data_mask)
print(augmented_train_data_pad.shape)
print(augmented_train_data_mask.shape)
(14400, 17640)
(14400, 17640)
However, label len is still 1600. Later when I pass these into an LSTM model, I am getting a shape mismatch error.
ValueError: Data cardinality is ambiguous:
x sizes: 14400, 14400
y sizes: 1600
Make sure all arrays contain the same number of samples.
Looking for some help to resolve this issue.
You can use numpy repeat function to replicate your numpy array.
ex:
In: arr = np.arange(3)
out: array([0, 1, 2])
In : arr.repeat(3)
Out: array([0, 0, 0, 1, 1, 1, 2, 2, 2])
Hope this will suffice your requirement.
You may refer link for reference:
#https://www.geeksforgeeks.org/python-add-similar-value-multiple-times-in-list/
type(y_train)= panda series
from itertools import repeat
new_label=[]
for index, value in y_train.items():
new_label.extend(repeat(value, 2))
len(new_label)

Drawing equal samples from each class in stratified sampling

So I have 1000 class 1 and 2500 class 2. So naturally when using:
sklearn's train_test_split(test_size = 200, stratify = y). I get an imbalanced test set since it is preserving the data distribution from the original data set. However, I would like to split to have 100 class 1 and 100 class 2 in the test set.
How would I do it? Any suggestions would be appreciated.
Split Manually
A manual solution isn't that scary. Main steps explained:
Isolate the index of class-1 and class-2 rows.
Use np.random.permutation() to select random n1 and n2 test samples for class 1 and 2 respectively.
Use df.index.difference() to perform inverse selection for the train samples.
The code can be easily generalized to arbitrary number of classes and arbitrary numbers to be selected as test data (just put n1/n2, idx1/idx2, etc. into lists and process by loops). But that's out of the scope of the question itself.
Code
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
# data
df = pd.DataFrame(
data={
"label": np.array([1]*1000 + [2]*2500),
# label 1 has value > 0, label 2 has value < 0
"value": np.hstack([np.random.uniform(0, 1, 1000),
np.random.uniform(-1, 0, 2500)])
}
)
df = df.sample(frac=1).reset_index(drop=True)
# sampling number for each class
n1 = 100
n2 = 100
# 1. get indexes and lengths for the classes respectively
idx1 = df.index.values[df["label"] == 1]
idx2 = df.index.values[df["label"] == 2]
len1 = len(idx1) # 1000
len2 = len(idx2) # 2500
# 2. draw index for test dataset
draw1 = np.random.permutation(len1)[:n1] # keep the first n1 entries to be selected
idx1_test = idx1[draw1]
draw2 = np.random.permutation(len2)[:n2]
idx2_test = idx2[draw2]
# combine the drawn indexes
idx_test = np.hstack([idx1_test, idx2_test])
# 3. derive index for train dataset
idx_train = df.index.difference(idx_test)
# split
df_train = df.loc[idx_train, :] # optional: .reset_index(drop=True)
df_test = df.loc[idx_test, :]
# len(df_train) = 3300
# len(df_test) = 200
# verify that no row was missing
idx_merged = np.hstack([df_train.index.values, df_test.index.values])
assert len(np.unique(idx_merged)) == 3500

How to create a 2D array with N lots of random numbers?

I am trying to obtain a variance for a value I obtained by processing a 2x150 array into a discrete correlation function. In order to do this I need to randomly sample 80% of the original data N times, which will allow me to calculate a variance over these values.
have so far been able to create one randomly sampled set of data using this:
rand_indices = []
running_var = (len(find_length)*0.8)
x=0
while x<running_var:
rand_inx = randint(0, (len(find_length)-1))
rand_indices.append(rand_inx)
x=x+1
which creates an array 80% of the length of my original with randomly selected indices to be picked out and processed.
My problem is that I am not sure how to iterate this in order to get N sets of these random numbers, I think ideally in a Nx120 sized array. My whole code so far is:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from random import randint
useless, just_to, find_length = np.loadtxt("w2_mjy_final.dat").T
w2_dat = np.loadtxt("w2_mjy_final.dat")
w2_rel = np.delete(w2_dat, 2, axis = 1)
w2_array = np.asarray(w2_rel)
w1_dat = np.loadtxt("w1_mjy_final.dat")
w1_rel = np.delete(w1_dat, 2, axis=1)
w1_array = np.asarray(w1_rel)
peaks = []
y=1
N = 0
x = 0
z = 0
rand_indices = []
rand_indices2d = []
running_var = (len(find_length)*0.8)
while z<N:
while x<running_var:
rand_inx = randint(0, (len(find_length)-1))
rand_indices.append(rand_inx)
x=x+1
rand_indices2d.append(rand_indices)
z=z+1
while y<N:
w1_sampled = w1_array[rand_indices, :]
w2_sampled = w2_array[rand_indices, :]
w1s_t, w1s_dat = zip(*w1_sampled)
w2s_t, w2s_dat = zip(*w2_sampled)
w2s_mean = np.mean(w2s_dat)
w2s_stdev = np.std(w2s_dat)
w1s_mean = np.mean(w1s_dat)
w1s_stdev = np.std(w1s_dat)
taus = []
dcfs = []
bins = 40
for i in w2s_t:
for j in w1s_t:
tau_datpoint = i-j
taus.append(tau_datpoint)
for k in w2s_dat:
for l in w1s_dat:
dcf_datpoint = ((k - w2s_mean)*(l - w1s_mean))/((w2s_stdev*w1s_stdev))
dcfs.append(dcf_datpoint)
plotdat = np.vstack((taus, dcfs)).T
sort_plotdat = sorted(plotdat, key=lambda x:x[0])
np.savetxt("w1sw2sarray.txt", sort_plotdat)
taus_sort, dcfs_sort = np.loadtxt("w1w2array.txt").T
dcfs_means, taubins_edges, taubins_number = stats.binned_statistic(taus_sort, dcfs_sort, statistic='mean', bins=bins)
taubin_edge = np.delete(taubins_edges, 0)
import operator
indexs, values = max(enumerate(dcfs_means), key=operator.itemgetter(1))
percents = values*0.8
dcf_lists = dcfs_means.tolist()
centarr_negs, centarr_poss = np.split(dcfs_means, [indexs])
centind_negs = np.argmin(np.abs(centarr_negs - percents))
centind_poss = np.argmin(np.abs(centarr_poss - percents))
lagcent_negs = taubins_edges[centind_negs]
lagcent_poss = taubins_edges[int((bins/2)+centind_poss)]
sampled_peak = (np.abs(lagcent_poss - lagcent_negs)/2)+lagcent_negs
peaks.append(sampled_peak)
y=y+1
print peaks
Seeing as you're using numpy already, why not use np.random.randint
In your case:
np.random.randint(len(find_length)-1, size=(N, running_var))
Would give you an N*running_var sized matrix, with random integer entries from 0 to len(find_length)-2 inclusive.
Example Usage:
>>> N=4
>>> running_var=6
>>> find_length = [1,2,3]
>>> np.random.randint(len(find_length)-1, size=(N, running_var))
array([[1, 0, 1, 0, 0, 1],
[1, 0, 1, 1, 0, 0],
[1, 1, 0, 0, 1, 0],
[1, 1, 0, 1, 0, 1]])

How to calculate mean average precision (mAP) using TensorFlow?

I want to use TensorFlow to calculate hashcode‘s mAP (mean average precision), but I don‘t know how to use tensor calculations directly.
The code which using NumPy is the following:
import numpy as np
import time
import os
# read train and test binarayCode
CURRENT_DIR = os.getcwd()
def getCode(train_codes,train_groudTruth,test_codes,test_groudTruth):
line_number = 0
with open(CURRENT_DIR+'/result.txt','r') as f:
for line in f:
temp = line.strip().split('\t')
if line_number < 10000:
test_codes.append([i if i==1 else -1 for i in map(int, list(temp[0]))])
list2 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
list2[int(temp[1])] = 1
test_groudTruth.append(list2) # get test ground truth(0-9)
else:
train_codes.append([i if i==1 else -1 for i in map(int, list(temp[0]))]) # change to -1, 1
list2 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
list2[int(temp[1])] = 1
train_groudTruth.append(list2) # get test ground truth(0-9)
line_number += 1
print 'read data finish'
def getHammingDist(code_a,code_b):
dist = 0
for i in range(len(code_a)):
if code_a[i]!=code_b[i]:
dist += 1
return dist
if __name__ =='__main__':
print getNowTime(),'start!'
train_codes = []
train_groudTruth =[]
test_codes = []
test_groudTruth = []
# get g.t. and binary code
getCode(train_codes,train_groudTruth,test_codes,test_groudTruth)
train_codes = np.array(train_codes)
train_groudTruth = np.array(train_groudTruth)
test_codes = np.array(test_codes)
test_groudTruth = np.array(test_groudTruth)
numOfTest = 10000
# generate hanmming martix, g.t. martix 10000*50000
gt_martix = np.dot(test_groudTruth, np.transpose(train_groudTruth))
print getNowTime(),'gt_martix finish!'
ham_martix = np.dot(test_codes, np.transpose(train_codes)) # hanmming distance map to dot value
print 'ham_martix finish!'
# sort hanmming martix,Returns the indices that would sort an array.
sorted_ham_martix_index = np.argsort(ham_martix,axis=1)
# calculate mAP
print 'sort ham_matrix finished,start calculate mAP'
apall = np.zeros((numOfTest,1),np.float64)
for i in range(numOfTest):
x = 0.0
p = 0
test_oneLine = sorted_ham_martix_index[i,:]
length = test_oneLine.shape[0]
num_return_NN = 5000 # top 1000
for j in range(num_return_NN):
if gt_martix[i][test_oneLine[length-j-1]] == 1: # reverse
x += 1
p += x/(j+1)
if p == 0:
apall[i]=0
else:
apall[i]=p/x
mAP = np.mean(apall)
print 'mAP:',mAP
I want to re-write the code above using tensor operations (like tf.equal()、tf.reduce_sum() so on).
for example
I want to calculate valid accuracy of images
logits = self._model(x_valid)
valid_preds = tf.argmax(logits, axis=1)
valid_preds = tf.to_int32(valid_preds)
self.valid_acc = tf.equal(valid_preds, y_valid)
self.valid_acc = tf.to_int32(self.valid_acc)
self.valid_acc = tf.to_float(tf.reduce_sum(self.valid_acc))/tf.to_float(self.batch_size)
I want to use TensorFlow to calculate hashcode‘s mAP (mean average precision) this way(like tf.XX opreation)
How could I do? Thanks!
You can just calculate the y_score (or predictions) and then use sklearn.metrics to calculate the average precision:
from sklearn.metrics import average_precision_score
predictions = model.predict(x_test)
average_precision_score(y_test, predictions)
If you just want to calculate average precision based on the validation set predictions, you can use the vector of predicted probabilities and the vector of true labels in this scikit-learn function.
If you really want to use a tensorflow function, there's a tensorflow function average_precision_at_k.
For more info about average precision you can see this article.

Rand Index function (clustering performance evaluation)

As far as I know, there is no package available for Rand Index in python while for Adjusted Rand Index you have the option of using sklearn.metrics.adjusted_rand_score(labels_true, labels_pred).
I wrote the code for Rand Score and I am going to share it with others as the answer to the post.
from scipy.misc import comb
from itertools import combinations
import numpy as np
def check_clusterings(labels_true, labels_pred):
"""Check that the two clusterings matching 1D integer arrays."""
labels_true = np.asarray(labels_true)
labels_pred = np.asarray(labels_pred)
# input checks
if labels_true.ndim != 1:
raise ValueError(
"labels_true must be 1D: shape is %r" % (labels_true.shape,))
if labels_pred.ndim != 1:
raise ValueError(
"labels_pred must be 1D: shape is %r" % (labels_pred.shape,))
if labels_true.shape != labels_pred.shape:
raise ValueError(
"labels_true and labels_pred must have same size, got %d and %d"
% (labels_true.shape[0], labels_pred.shape[0]))
return labels_true, labels_pred
def rand_score (labels_true, labels_pred):
"""given the true and predicted labels, it will return the Rand Index."""
check_clusterings(labels_true, labels_pred)
my_pair = list(combinations(range(len(labels_true)), 2)) #create list of all combinations with the length of labels.
def is_equal(x):
return (x[0]==x[1])
my_a = 0
my_b = 0
for i in range(len(my_pair)):
if(is_equal((labels_true[my_pair[i][0]],labels_true[my_pair[i][1]])) == is_equal((labels_pred[my_pair[i][0]],labels_pred[my_pair[i][1]]))
and is_equal((labels_pred[my_pair[i][0]],labels_pred[my_pair[i][1]])) == True):
my_a += 1
if(is_equal((labels_true[my_pair[i][0]],labels_true[my_pair[i][1]])) == is_equal((labels_pred[my_pair[i][0]],labels_pred[my_pair[i][1]]))
and is_equal((labels_pred[my_pair[i][0]],labels_pred[my_pair[i][1]])) == False):
my_b += 1
my_denom = comb(len(labels_true),2)
ri = (my_a + my_b) / my_denom
return ri
As a simple example:
labels_true = [1, 1, 0, 0, 0, 0]
labels_pred = [0, 0, 0, 1, 0, 1]
rand_score (labels_true, labels_pred)
#0.46666666666666667
There are probably some ways to improve it and make it more pythonic. If you have any suggestion, you may improve it.
I found this implementation which seems faster.
import numpy as np
from scipy.misc import comb
def rand_index_score(clusters, classes):
tp_plus_fp = comb(np.bincount(clusters), 2).sum()
tp_plus_fn = comb(np.bincount(classes), 2).sum()
A = np.c_[(clusters, classes)]
tp = sum(comb(np.bincount(A[A[:, 0] == i, 1]), 2).sum()
for i in set(clusters))
fp = tp_plus_fp - tp
fn = tp_plus_fn - tp
tn = comb(len(A), 2) - tp - fp - fn
return (tp + tn) / (tp + fp + fn + tn)
As a simple example:
labels_true = [1, 1, 0, 0, 0, 0]
labels_pred = [0, 0, 0, 1, 0, 1]
rand_index_score (labels_true, labels_pred)
#0.46666666666666667
Starting from scikit-learn 0.24.0, the sklearn.metrics.rand_score function has been added, implementing the (unadjusted) Rand index. Please check the changelog.
All you have to do is:
from sklearn.metrics import rand_score
rand_score(labels_true, labels_pred)
labels_true and labels_pred can have values in different domains. For example:
>>> rand_score(['a', 'b', 'c'], [5, 6, 7])
1.0
Here is my code:
def rand_index_score(y_gold, y_predict):
index1_index2_pairs = list(it.combinations(range(len(y_gold)), 2)) #create list of all combinations with the length of labels.
numberOfPairs = len(index1_index2_pairs)
fractalUpperPart = 0
for index1_index2 in index1_index2_pairs:
theyRealyAreInSameGroup = y_gold[index1_index2[0]] == y_gold[index1_index2[1]]
itIsPredictedThatTheyAreInSameGroup = y_predict[index1_index2[0]] == y_predict[index1_index2[1]]
if theyRealyAreInSameGroup == itIsPredictedThatTheyAreInSameGroup:
fractalUpperPart += 1
return fractalUpperPart/numberOfPairs

Categories