How do I implement knn on this clothing dataset? - python

The dataset can be found here: https://drive.google.com/file/d/1leLNUhD5icJPg3oMv5giw_YHduk40sa8/view?usp=sharing
I found an example on the mnist fashion dataset here: https://colab.research.google.com/github/FreeOfConfines/ExampleNNWithKerasAndTensorflow/blob/master/K_Nearest_Neighbor_Classification_with_Tensorflow_on_Fashion_MNIST_Dataset.ipynb#scrollTo=6UuV2szYMAP9
However, it couldn't work due to change in tensorflow versions. As such, I changed the code to
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
However, when running this part of the code:
paramk = 11 # parameter k of k-nearest neighbors
numTrainImages = np.shape(trLabels)[0] # so many train images
numTestImages = np.shape(tLabels)[0] # so many test images
arrayKNNLabels = np.array([])
numErrs = 0
for iTeI in range(0,numTestImages):
arrayL2Norm = np.array([]) # store distance of a test image from all train images
tmpTImage = np.copy(tImages[iTeI])
tmpTImage[tmpTImage > 0] = 1
for jTrI in range(numTrainImages):
tmpTrImage = np.copy(trImages[jTrI])
tmpTrImage[tmpTrImage>0] = 1
l2norm = np.sum(((tmpTrImage-tmpTImage)**2)**(0.5)) # distance between two images; 255 is max. pixel value ==> normalization
if jTrI == 0:
with tf.Session() as sess:
print(tf.count_nonzero(tmpTrImage-tmpTImage, axis=[0,1]).eval())
print(iTeI, jTrI, l2norm)
arrayL2Norm = np.append(arrayL2Norm, l2norm)
sIndex = np.argsort(arrayL2Norm) # sorting distance and returning indices that achieves sort
kLabels = trLabels[sIndex[0:paramk]] # choose first k labels
(values, counts) = np.unique(kLabels, return_counts=True) # find unique labels and their counts
arrayKNNLabels = np.append(arrayKNNLabels, values[np.argmax(counts)])
if arrayKNNLabels[-1] != tLabels[iTeI]:
numErrs += 1
print(numErrs,"/",iTeI)
print("# Classification Errors= ", numErrs, "% accuracy= ", 100.*(numTestImages-numErrs)/numTestImages)
Runtime took over 20 mins (I stopped it earlier). It was stuck around this part of the code:
l2norm = np.sum(((tmpTrImage-tmpTImage)**2)**(0.5)) # distance between two images; 255 is max. pixel value ==> normalization
if jTrI == 0:
with tf.Session() as sess:
print(tf.count_nonzero(tmpTrImage-tmpTImage, axis=[0,1]).eval())
print(iTeI, jTrI, l2norm)
arrayL2Norm = np.append(arrayL2Norm, l2norm)
I know this is for a mnist dataset while mine isn't. But I believe the process should be somewhat similar.
Any guidance on implementing knn on my dataset would be much appreciated.
Thank you.

Related

tf.data.datasets set each batch (prefetch)

I am looking for help thinking through this.
I have a function (that is not a generator) that will give me any number of samples.
Let's say that getting all the data I want to train (1000 samples) can't fit into memory.
So I want to call this function 10 times to get smaller number of samples that fit into memory.
This is a dummy example for simplicity.
def get_samples(num_samples: int, random_seed=0):
np.random.seed(random_seed)
x = np.random.randint(0,100, num_samples)
y = np.random.randint(0,2, num_samples)
return np.array(list(zip(x,y))
Again lets say get_samples(1000,0) won't fit into memory.
So in theory I am looking for something like this:
batch_size = 100
total_num_samples = 1000
batches = []
for i in range(total_num_samples//batch_size):
batches.append(get_samples(batch_size, i))
But this still loads everything into memory.
Again this function is a dummy representation and the real one is already defined and not a generator.
In tf land. I was hoping that:
tf.data.Dataset.batch[0] would equal to the output of get_data(100,0)
tf.data.Dataset.batch[1] would equal to the output of get_data(100,1)
tf.data.Dataset.batch[2] would equal to the output of get_data(100,2)
...
tf.data.Dataset.batch[9] would equal to the output of get_data(100,9)
I understand that I can use tf.data.Datasets with a generator (and I think you can set a generator per batch). But the function I have gives more than a single sample. The set up is too expensive to set it up for a every single sample.
I was wanting to use tf.data.Dataset.prefetch() to run the get_batch function on every batch. And of course, it would call the get_batch with the same parameters on every epoch.
Sorry if the explaination is convoluted. Trying my best to describe the problem.
Anyone have any ideas?
This what I came up with:
def simple_static_synthesizer(batch_size, seed=1, verbose=True):
if verbose:
print(f"Creating Synthetic Data with seed {seed}")
rng = np.random.default_rng(seed)
all_x = []
all_y = []
for i in range(batch_size):
x = np.array(np.concatenate((rng.integers(0,100, 1, dtype=int), rng.integers(0,100, 1, dtype=int), rng.integers(0,100, 1, dtype=int))))
y = np.array(rng.integers(0,2,1, dtype=int))
all_x.append(x)
all_y.append(y)
return all_x, all_y
def my_generator(total_size, batch_size, seed=0, verbose=True):
counter = 0
for i in range(total_size):
# Regenerate for each batch
if counter%batch_size == 0: # Regen data for every batch
x,y = simple_static_synthesizer(batch_size,seed,verbose)
seed += 1
yield x[i%batch_size],y[i%batch_size]
counter += 1
my_gen = my_generator(10,2,seed=1)
# See values
for x,y in my_gen:
print(x,y)
# Call again, this give same answer as above
my_gen = my_generator(10,2,seed=1)
for x,y in my_gen:
print(x,y)
# Dataset with small batches to see if it is doing it correctly
total_samples = 10
batch_size = 2
seed = 5
dataset = tf.data.Dataset.from_generator(
my_generator,
args=[total_samples,batch_size,seed],
output_signature=(
tf.TensorSpec(shape=(3,), dtype=tf.uint8),
tf.TensorSpec(shape=(1,), dtype=tf.uint8),
)
)
for i,(x,y) in enumerate(dataset):
print(x.numpy(),y.numpy())
if i == 4:
break # shows first 3 syn calls
Wish we could have notebook answers!

How To Speed Up KNN Algorithm

I have a simple KNN algorithm that is used to predict the "yield" from a piece of data. There are around 27k rows in a pandas dataframe with 37 different columns. I have been trying to optimize hyper-parameters (the number of nearest neighbours) but running it with one parameter has already taken so long. I was wondering what ways could I improve the code below to make it run faster?
I have tried looking at possibly getting rid of the number of for loops but have no clue where to start really:
#importing modules
from math import sqrt
train_data = df_KNN[:23498]
test_data = df_KNN[23498:]
true_test = pd.DataFrame(df_KNN)
true_test = true_test.iloc[23498:, -1]
true_test = true_test.to_numpy()
#calculating "distance" between rows
def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1)-1):
distance += ((row1[i] - row2[i])**2)
return sqrt(distance)
def get_neighbours(train, test_row, num_neighbours):
distances = list()
for train_row in train:
dist = euclidean_distance(test_row, train_row)
distances.append((train_row, dist))
distances.sort(key=lambda dis: dis[1])
neighbours = list()
for i in range(num_neighbours):
neighbours.append(distances[i][0])
return neighbours
def predict_classification(train, test_row, num_neighbours):
prediction_list = []
for row in test_row:
neighbours = get_neighbours(train, test_row, num_neighbours)
output_values = [row[-1] for row in neighbours]
prediction_list.append(output_values)
prediction = np.mean(prediction_list)
return prediction
def k_nearest_neighbours(train, test, num_neighbours):
predictions = list()
for row in test:
output = predict_classification(train, row, num_neighbours)
predictions.append(output)
return (predictions)
test_pred = k_nearest_neighbours(train_data, test_data, 3)
from sklearn.metrics import r2_score
print(r2_score(true_test, test_pred))
I know I could use other modules but for this purpose I want to implement it from scratch. Cheers!

Odd Results on Entropy Calculation

I am trying to write a function that properly calculates the entropy of a given dataset. However, I am getting very weird entropy values.
I am following the understanding that all entropy calculations must fall between 0 and 1, yet I am consistently getting values above 2.
Note: I must use log base 2 for this
Can someone explain why am I yielding incorrect entropy results?
The dataset I am testing is the ecoli dataset from the UCI Machine Learning Repository
import numpy
import math
#################### DATA HANDLING LIBRARY ####################
def csv_to_array(file):
# Open the file, and load it in delimiting on the ',' for a comma separated value file
data = open(file, 'r')
data = numpy.loadtxt(data, delimiter=',')
# Loop through the data in the array
for index in range(len(data)):
# Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0
try:
data[index] = [float(x) for x in data[index]]
except Exception:
data[index] = 0
except ValueError:
data[index] = 0
# Return the now type-formatted data
return data
# Function that utilizes the numpy library to randomize the dataset.
def randomize_data(csv):
csv = numpy.random.shuffle(csv)
return csv
# Function to split the data into test, training set, and validation sets
def split_data(csv):
# Call the randomize data function
randomize_data(csv)
# Grab the number of rows and calculate where to split
num_rows = csv.shape[0]
validation_split = int(num_rows * 0.10)
training_split = int(num_rows * 0.72)
testing_split = int(num_rows * 0.18)
# Validation set as the first 10% of the data
validation_set = csv[:validation_split]
# Training set as the next 72
training_set = csv[validation_split:training_split + validation_split]
# Testing set as the last 18
testing_set = csv[training_split + validation_split:]
# Split the data into classes vs actual data
training_cols = training_set.shape[1]
testing_cols = testing_set.shape[1]
validation_cols = validation_set.shape[1]
training_classes = training_set[:, training_cols - 1]
testing_classes = testing_set[:, testing_cols - 1]
validation_classes = validation_set[:, validation_cols - 1]
# Take the sets and remove the last (classification) column
training_set = training_set[:-1]
testing_set = testing_set[:-1]
validation_set = validation_set[:-1]
# Return the datasets
return testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes
#################### DATA HANDLING LIBRARY ####################
# This function returns the list of classes, and their associated weights (i.e. distributions)
# for a given dataset
def class_distribution(dataset):
# Ensure the dataset is a numpy array
dataset = numpy.asarray(dataset)
# Collect # of total rows and columns, using numpy
num_total_rows = dataset.shape[0]
num_columns = dataset.shape[1]
# Create a numpy array of just the classes
classes = dataset[:, num_columns - 1]
# Use numpy.unique to remove duplicates
classes = numpy.unique(classes)
# Create an empty array for the class weights
class_weights = []
# Loop through the classes one by one
for aclass in classes:
# Create storage variables
total = 0
weight = 0
# Now loop through the dataset
for row in dataset:
# If the class of the dataset is equal to the current class you are evaluating, increase the total
if numpy.array_equal(aclass, row[-1]):
total = total + 1
# If not, continue
else:
continue
# Divide the # of occurences by total rows
weight = float((total / num_total_rows))
# Add that weight to the list of class weights
class_weights.append(weight)
# Turn the weights into a numpy array
class_weights = numpy.asarray(class_weights)
# Return the array
return classes, class_weights
# This function returns the entropy for a given dataset
# Can be used across an entire csv, or just for a column of data (feature)
def get_entropy(dataset):
# Set initial entropy
entropy = 0.0
# Determine the classes and their frequencies (weights) of the dataset
classes, class_freq = class_distribution(dataset)
# Utilize numpy's quicksort to test the most occurring class first
numpy.sort(class_freq)
# Determine the max entropy for the dataset
max_entropy = math.log(len(classes), 2)
print("MAX ENTROPY FOR THIS DATASET: ", max_entropy)
# Loop through the frequencies and use given formula to calculate entropy
# For...Each simulates the sequence operator
for freq in class_freq:
entropy += float(-freq * math.log(freq, 2))
# Return the entropy value
return entropy
def main():
ecol = csv_to_array('ecoli.csv')
testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes = split_data(ecol)
entropy = get_entropy(ecol)
print(entropy)
main()
The following function was used to calculate Entropy:
# Function to return Shannon's Entropy
def entropy(attributes, dataset, targetAttr):
freq = {}
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
As #MattTimmermans had indicated, entropy's value is actually contingent on the number of classes. For strictly 2 classes, it is contained in the 0 to 1 (inclusive) range. However, for more than 2 classes (which is what was being tested), entropy is calculated with a different formula (converted to Pythonic code above). This post here explains those mathematics and calculations a bit more in detail.

Pixel-wise loss weight for image segmentation in Keras

I am currently using a modified version of the U-Net (https://arxiv.org/pdf/1505.04597.pdf) to segment cell organelles in microscopy images. Since I am using Keras, I took the code from https://github.com/zhixuhao/unet. However, in this version no weight map is implemented to force the network to learn the border pixels.
The results that I have obtained so far are quite good, but the network fails to separate objects that are close to each other. So I want to try and make use of the weight map mentioned in the paper. I have been able to generate the weight map (based on the given formula) for each label image, but I was unable to find out how to use this weight map to train my network and thus solve the above mentioned problem.
Do weight maps and label images have to be combined somehow or is there a Keras function that will allow me to make use of the weight maps? I am Biologist, who only recently started to work with neural networks, so my understanding is still limited. Any help or advice would be greatly appreciated.
In case it is still relevant: I needed to solve this recently. You can paste the code below into a Jupyter notebook to see how it works.
%matplotlib inline
import numpy as np
from skimage.io import imshow
from skimage.measure import label
from scipy.ndimage.morphology import distance_transform_edt
import numpy as np
def generate_random_circles(n = 100, d = 256):
circles = np.random.randint(0, d, (n, 3))
x = np.zeros((d, d), dtype=int)
f = lambda x, y: ((x - x0)**2 + (y - y0)**2) <= (r/d*10)**2
for x0, y0, r in circles:
x += np.fromfunction(f, x.shape)
x = np.clip(x, 0, 1)
return x
def unet_weight_map(y, wc=None, w0 = 10, sigma = 5):
"""
Generate weight maps as specified in the U-Net paper
for boolean mask.
"U-Net: Convolutional Networks for Biomedical Image Segmentation"
https://arxiv.org/pdf/1505.04597.pdf
Parameters
----------
mask: Numpy array
2D array of shape (image_height, image_width) representing binary mask
of objects.
wc: dict
Dictionary of weight classes.
w0: int
Border weight parameter.
sigma: int
Border width parameter.
Returns
-------
Numpy array
Training weights. A 2D array of shape (image_height, image_width).
"""
labels = label(y)
no_labels = labels == 0
label_ids = sorted(np.unique(labels))[1:]
if len(label_ids) > 1:
distances = np.zeros((y.shape[0], y.shape[1], len(label_ids)))
for i, label_id in enumerate(label_ids):
distances[:,:,i] = distance_transform_edt(labels != label_id)
distances = np.sort(distances, axis=2)
d1 = distances[:,:,0]
d2 = distances[:,:,1]
w = w0 * np.exp(-1/2*((d1 + d2) / sigma)**2) * no_labels
else:
w = np.zeros_like(y)
if wc:
class_weights = np.zeros_like(y)
for k, v in wc.items():
class_weights[y == k] = v
w = w + class_weights
return w
y = generate_random_circles()
wc = {
0: 1, # background
1: 5 # objects
}
w = unet_weight_map(y, wc)
imshow(w)
I think you want to use class_weight in Keras. This is actually simple to introduce in your model if you have already calculated the class weights.
Create a dictionary with your class labels and their associated weights. For example
class_weight = {0: 10.9,
1: 20.8,
2: 1.0,
3: 50.5}
Or create a 1D Numpy array of the same length as your number of classes. For example
class_weight = [10.9, 20.8, 1.0, 50.5]
Pass this parameter during training in your model.fit or model.fit_generator
model.fit(x, y, batch_size=batch_size, epochs=num_epochs, verbose=1, class_weight=class_weight)
You can look up the Keras documentation for more details here.

Python Information gain implementation

I am currently using scikit-learn for text classification on the 20ng dataset. I want to calculate the information gain for a vectorized dataset. It has been suggested to me that this can be accomplished, using mutual_info_classif from sklearn. However, this method is really slow, so I was trying to implement information gain myself based on this post.
I came up with the following solution:
from scipy.stats import entropy
import numpy as np
def information_gain(X, y):
def _entropy(labels):
counts = np.bincount(labels)
return entropy(counts, base=None)
def _ig(x, y):
# indices where x is set/not set
x_set = np.nonzero(x)[1]
x_not_set = np.delete(np.arange(x.shape[1]), x_set)
h_x_set = _entropy(y[x_set])
h_x_not_set = _entropy(y[x_not_set])
return entropy_full - (((len(x_set) / f_size) * h_x_set)
+ ((len(x_not_set) / f_size) * h_x_not_set))
entropy_full = _entropy(y)
f_size = float(X.shape[0])
scores = np.array([_ig(x, y) for x in X.T])
return scores
Using a very small dataset, most scores from sklearn and my implementation are equal. However, sklearn seems to take frequencies into account, which my algorithm clearly doesn't. For example
categories = ['talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
X, y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
max_features=100,
stop_words='english')
X_vec = cv.fit_transform(X)
t0 = time()
res_sk = mutual_info_classif(X_vec, y, discrete_features=True)
print("Time passed for sklearn method: %3f" % (time()-t0))
t0 = time()
res_ig = information_gain(X_vec, y)
print("Time passed for ig: %3f" % (time()-t0))
for name, res_mi, res_ig in zip(cv.get_feature_names(), res_sk, res_ig):
print("%s: mi=%f, ig=%f" % (name, res_mi, res_ig))
sample output:
center: mi=0.011824, ig=0.003548
christian: mi=0.128629, ig=0.127122
color: mi=0.028413, ig=0.026397
com: mi=0.041184, ig=0.030458
computer: mi=0.020590, ig=0.012327
cs: mi=0.007291, ig=0.001574
data: mi=0.020734, ig=0.008986
did: mi=0.035613, ig=0.024604
different: mi=0.011432, ig=0.005492
distribution: mi=0.007175, ig=0.004675
does: mi=0.019564, ig=0.006162
don: mi=0.024000, ig=0.017605
earth: mi=0.039409, ig=0.032981
edu: mi=0.023659, ig=0.008442
file: mi=0.048056, ig=0.045746
files: mi=0.041367, ig=0.037860
ftp: mi=0.031302, ig=0.026949
gif: mi=0.028128, ig=0.023744
god: mi=0.122525, ig=0.113637
good: mi=0.016181, ig=0.008511
gov: mi=0.053547, ig=0.048207
So I was wondering if my implementation is wrong, or it is correct, but a different variation of the mutual information algorithm scikit-learn uses.
A little late with my answer but you should look at Orange's implementation. Within their app it is used as a behind-the-scenes processor to help inform the dynamic model parameter building process.
The implementation itself looks fairly straightforward and could most likely be ported out. The entropy calculation first
The sections starting at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L233
def _entropy(dist):
"""Entropy of class-distribution matrix"""
p = dist / np.sum(dist, axis=0)
pc = np.clip(p, 1e-15, 1)
return np.sum(np.sum(- p * np.log2(pc), axis=0) * np.sum(dist, axis=0) / np.sum(dist))
Then the second portion.
https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L305
class GainRatio(ClassificationScorer):
"""
Information gain ratio is the ratio between information gain and
the entropy of the feature's
value distribution. The score was introduced in [Quinlan1986]_
to alleviate overestimation for multi-valued features. See `Wikipedia entry on gain ratio
<http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
"""
def from_contingency(self, cont, nan_adjustment):
h_class = _entropy(np.sum(cont, axis=1))
h_residual = _entropy(np.compress(np.sum(cont, axis=0), cont, axis=1))
h_attribute = _entropy(np.sum(cont, axis=0))
if h_attribute == 0:
h_attribute = 1
return nan_adjustment * (h_class - h_residual) / h_attribute
The actual scoring process happens at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L218

Categories