I have a simple KNN algorithm that is used to predict the "yield" from a piece of data. There are around 27k rows in a pandas dataframe with 37 different columns. I have been trying to optimize hyper-parameters (the number of nearest neighbours) but running it with one parameter has already taken so long. I was wondering what ways could I improve the code below to make it run faster?
I have tried looking at possibly getting rid of the number of for loops but have no clue where to start really:
#importing modules
from math import sqrt
train_data = df_KNN[:23498]
test_data = df_KNN[23498:]
true_test = pd.DataFrame(df_KNN)
true_test = true_test.iloc[23498:, -1]
true_test = true_test.to_numpy()
#calculating "distance" between rows
def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1)-1):
distance += ((row1[i] - row2[i])**2)
return sqrt(distance)
def get_neighbours(train, test_row, num_neighbours):
distances = list()
for train_row in train:
dist = euclidean_distance(test_row, train_row)
distances.append((train_row, dist))
distances.sort(key=lambda dis: dis[1])
neighbours = list()
for i in range(num_neighbours):
neighbours.append(distances[i][0])
return neighbours
def predict_classification(train, test_row, num_neighbours):
prediction_list = []
for row in test_row:
neighbours = get_neighbours(train, test_row, num_neighbours)
output_values = [row[-1] for row in neighbours]
prediction_list.append(output_values)
prediction = np.mean(prediction_list)
return prediction
def k_nearest_neighbours(train, test, num_neighbours):
predictions = list()
for row in test:
output = predict_classification(train, row, num_neighbours)
predictions.append(output)
return (predictions)
test_pred = k_nearest_neighbours(train_data, test_data, 3)
from sklearn.metrics import r2_score
print(r2_score(true_test, test_pred))
I know I could use other modules but for this purpose I want to implement it from scratch. Cheers!
Related
The dataset can be found here: https://drive.google.com/file/d/1leLNUhD5icJPg3oMv5giw_YHduk40sa8/view?usp=sharing
I found an example on the mnist fashion dataset here: https://colab.research.google.com/github/FreeOfConfines/ExampleNNWithKerasAndTensorflow/blob/master/K_Nearest_Neighbor_Classification_with_Tensorflow_on_Fashion_MNIST_Dataset.ipynb#scrollTo=6UuV2szYMAP9
However, it couldn't work due to change in tensorflow versions. As such, I changed the code to
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
However, when running this part of the code:
paramk = 11 # parameter k of k-nearest neighbors
numTrainImages = np.shape(trLabels)[0] # so many train images
numTestImages = np.shape(tLabels)[0] # so many test images
arrayKNNLabels = np.array([])
numErrs = 0
for iTeI in range(0,numTestImages):
arrayL2Norm = np.array([]) # store distance of a test image from all train images
tmpTImage = np.copy(tImages[iTeI])
tmpTImage[tmpTImage > 0] = 1
for jTrI in range(numTrainImages):
tmpTrImage = np.copy(trImages[jTrI])
tmpTrImage[tmpTrImage>0] = 1
l2norm = np.sum(((tmpTrImage-tmpTImage)**2)**(0.5)) # distance between two images; 255 is max. pixel value ==> normalization
if jTrI == 0:
with tf.Session() as sess:
print(tf.count_nonzero(tmpTrImage-tmpTImage, axis=[0,1]).eval())
print(iTeI, jTrI, l2norm)
arrayL2Norm = np.append(arrayL2Norm, l2norm)
sIndex = np.argsort(arrayL2Norm) # sorting distance and returning indices that achieves sort
kLabels = trLabels[sIndex[0:paramk]] # choose first k labels
(values, counts) = np.unique(kLabels, return_counts=True) # find unique labels and their counts
arrayKNNLabels = np.append(arrayKNNLabels, values[np.argmax(counts)])
if arrayKNNLabels[-1] != tLabels[iTeI]:
numErrs += 1
print(numErrs,"/",iTeI)
print("# Classification Errors= ", numErrs, "% accuracy= ", 100.*(numTestImages-numErrs)/numTestImages)
Runtime took over 20 mins (I stopped it earlier). It was stuck around this part of the code:
l2norm = np.sum(((tmpTrImage-tmpTImage)**2)**(0.5)) # distance between two images; 255 is max. pixel value ==> normalization
if jTrI == 0:
with tf.Session() as sess:
print(tf.count_nonzero(tmpTrImage-tmpTImage, axis=[0,1]).eval())
print(iTeI, jTrI, l2norm)
arrayL2Norm = np.append(arrayL2Norm, l2norm)
I know this is for a mnist dataset while mine isn't. But I believe the process should be somewhat similar.
Any guidance on implementing knn on my dataset would be much appreciated.
Thank you.
I am working on a dataset that is very high dimensional and have performed k-means clustering on it. I am trying to find the 20 closest points to each centroid. The dimensions of the dataset (X_emb) is 10 x 2816. Provided is code that I used to find the single-most closest point to each centroid. The commented out code is a potential solution that I found, but I was not able to make it accurately work.
import numpy as np
import pickle as pkl
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.neighbors import NearestNeighbors
from visualization.make_video_v2 import make_video_from_numpy
from scipy.spatial import cKDTree
n_s_train = 10000
df = pkl.load(open('cluster_data/mixed_finetuning_data.pkl', 'rb'))
N = len(df)
X = []
X_emb = []
for i in range(N):
play = df.iloc[i]
if df.iloc[i].label == 1:
X_emb.append(play['embedding'])
X.append(play['input'])
X_emb = np.array(X_emb)
kmeans = KMeans(n_clusters=10)
kmeans.fit(X_emb)
results = kmeans.cluster_centers_
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
# def find_k_closest(centroids, data, k=1, distance_norm=2):
# kdtree = cKDTree(data, leafsize=30)
# distances, indices = kdtree.query(centroids, k, p=distance_norm)
# if k > 1:
# indices = indices[:,-1]
# values = data[indices]
# return indices, values
# indices, values = find_k_closest(results, X_emb)
You can use the pairwise distances to calculate the distances for every point with the centroids with every point in X_emb, then using numpy finding the index of the min 20 elements and finally geting them from X_emb
from sklearn.metrics import pairwise_distances
distances = pairwise_distances(centroids, X_emb, metric='euclidean')
ind = [np.argpartition(i, 20)[:20] for i in distances]
closest = [X_emb[indexes] for indexes in ind]
The shape of closest will be (num of centroids x 20)
You can the NearestNeighbors class from sklearn this way:
from sklearn.neighbors import NearestNeighbors
def find_k_closest(centroids, data):
nns = {}
neighbors = NearesNieghbors(n_neighbors=20).fit(data)
for center in centroids:
nns[center] = neighbors.kneighbors(center, return_distance=false)
return nns
the nns dictionary should contain the centers as key and the list of neighbors as value
I am trying to write a function that properly calculates the entropy of a given dataset. However, I am getting very weird entropy values.
I am following the understanding that all entropy calculations must fall between 0 and 1, yet I am consistently getting values above 2.
Note: I must use log base 2 for this
Can someone explain why am I yielding incorrect entropy results?
The dataset I am testing is the ecoli dataset from the UCI Machine Learning Repository
import numpy
import math
#################### DATA HANDLING LIBRARY ####################
def csv_to_array(file):
# Open the file, and load it in delimiting on the ',' for a comma separated value file
data = open(file, 'r')
data = numpy.loadtxt(data, delimiter=',')
# Loop through the data in the array
for index in range(len(data)):
# Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0
try:
data[index] = [float(x) for x in data[index]]
except Exception:
data[index] = 0
except ValueError:
data[index] = 0
# Return the now type-formatted data
return data
# Function that utilizes the numpy library to randomize the dataset.
def randomize_data(csv):
csv = numpy.random.shuffle(csv)
return csv
# Function to split the data into test, training set, and validation sets
def split_data(csv):
# Call the randomize data function
randomize_data(csv)
# Grab the number of rows and calculate where to split
num_rows = csv.shape[0]
validation_split = int(num_rows * 0.10)
training_split = int(num_rows * 0.72)
testing_split = int(num_rows * 0.18)
# Validation set as the first 10% of the data
validation_set = csv[:validation_split]
# Training set as the next 72
training_set = csv[validation_split:training_split + validation_split]
# Testing set as the last 18
testing_set = csv[training_split + validation_split:]
# Split the data into classes vs actual data
training_cols = training_set.shape[1]
testing_cols = testing_set.shape[1]
validation_cols = validation_set.shape[1]
training_classes = training_set[:, training_cols - 1]
testing_classes = testing_set[:, testing_cols - 1]
validation_classes = validation_set[:, validation_cols - 1]
# Take the sets and remove the last (classification) column
training_set = training_set[:-1]
testing_set = testing_set[:-1]
validation_set = validation_set[:-1]
# Return the datasets
return testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes
#################### DATA HANDLING LIBRARY ####################
# This function returns the list of classes, and their associated weights (i.e. distributions)
# for a given dataset
def class_distribution(dataset):
# Ensure the dataset is a numpy array
dataset = numpy.asarray(dataset)
# Collect # of total rows and columns, using numpy
num_total_rows = dataset.shape[0]
num_columns = dataset.shape[1]
# Create a numpy array of just the classes
classes = dataset[:, num_columns - 1]
# Use numpy.unique to remove duplicates
classes = numpy.unique(classes)
# Create an empty array for the class weights
class_weights = []
# Loop through the classes one by one
for aclass in classes:
# Create storage variables
total = 0
weight = 0
# Now loop through the dataset
for row in dataset:
# If the class of the dataset is equal to the current class you are evaluating, increase the total
if numpy.array_equal(aclass, row[-1]):
total = total + 1
# If not, continue
else:
continue
# Divide the # of occurences by total rows
weight = float((total / num_total_rows))
# Add that weight to the list of class weights
class_weights.append(weight)
# Turn the weights into a numpy array
class_weights = numpy.asarray(class_weights)
# Return the array
return classes, class_weights
# This function returns the entropy for a given dataset
# Can be used across an entire csv, or just for a column of data (feature)
def get_entropy(dataset):
# Set initial entropy
entropy = 0.0
# Determine the classes and their frequencies (weights) of the dataset
classes, class_freq = class_distribution(dataset)
# Utilize numpy's quicksort to test the most occurring class first
numpy.sort(class_freq)
# Determine the max entropy for the dataset
max_entropy = math.log(len(classes), 2)
print("MAX ENTROPY FOR THIS DATASET: ", max_entropy)
# Loop through the frequencies and use given formula to calculate entropy
# For...Each simulates the sequence operator
for freq in class_freq:
entropy += float(-freq * math.log(freq, 2))
# Return the entropy value
return entropy
def main():
ecol = csv_to_array('ecoli.csv')
testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes = split_data(ecol)
entropy = get_entropy(ecol)
print(entropy)
main()
The following function was used to calculate Entropy:
# Function to return Shannon's Entropy
def entropy(attributes, dataset, targetAttr):
freq = {}
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
As #MattTimmermans had indicated, entropy's value is actually contingent on the number of classes. For strictly 2 classes, it is contained in the 0 to 1 (inclusive) range. However, for more than 2 classes (which is what was being tested), entropy is calculated with a different formula (converted to Pythonic code above). This post here explains those mathematics and calculations a bit more in detail.
I am currently using scikit-learn for text classification on the 20ng dataset. I want to calculate the information gain for a vectorized dataset. It has been suggested to me that this can be accomplished, using mutual_info_classif from sklearn. However, this method is really slow, so I was trying to implement information gain myself based on this post.
I came up with the following solution:
from scipy.stats import entropy
import numpy as np
def information_gain(X, y):
def _entropy(labels):
counts = np.bincount(labels)
return entropy(counts, base=None)
def _ig(x, y):
# indices where x is set/not set
x_set = np.nonzero(x)[1]
x_not_set = np.delete(np.arange(x.shape[1]), x_set)
h_x_set = _entropy(y[x_set])
h_x_not_set = _entropy(y[x_not_set])
return entropy_full - (((len(x_set) / f_size) * h_x_set)
+ ((len(x_not_set) / f_size) * h_x_not_set))
entropy_full = _entropy(y)
f_size = float(X.shape[0])
scores = np.array([_ig(x, y) for x in X.T])
return scores
Using a very small dataset, most scores from sklearn and my implementation are equal. However, sklearn seems to take frequencies into account, which my algorithm clearly doesn't. For example
categories = ['talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
X, y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
max_features=100,
stop_words='english')
X_vec = cv.fit_transform(X)
t0 = time()
res_sk = mutual_info_classif(X_vec, y, discrete_features=True)
print("Time passed for sklearn method: %3f" % (time()-t0))
t0 = time()
res_ig = information_gain(X_vec, y)
print("Time passed for ig: %3f" % (time()-t0))
for name, res_mi, res_ig in zip(cv.get_feature_names(), res_sk, res_ig):
print("%s: mi=%f, ig=%f" % (name, res_mi, res_ig))
sample output:
center: mi=0.011824, ig=0.003548
christian: mi=0.128629, ig=0.127122
color: mi=0.028413, ig=0.026397
com: mi=0.041184, ig=0.030458
computer: mi=0.020590, ig=0.012327
cs: mi=0.007291, ig=0.001574
data: mi=0.020734, ig=0.008986
did: mi=0.035613, ig=0.024604
different: mi=0.011432, ig=0.005492
distribution: mi=0.007175, ig=0.004675
does: mi=0.019564, ig=0.006162
don: mi=0.024000, ig=0.017605
earth: mi=0.039409, ig=0.032981
edu: mi=0.023659, ig=0.008442
file: mi=0.048056, ig=0.045746
files: mi=0.041367, ig=0.037860
ftp: mi=0.031302, ig=0.026949
gif: mi=0.028128, ig=0.023744
god: mi=0.122525, ig=0.113637
good: mi=0.016181, ig=0.008511
gov: mi=0.053547, ig=0.048207
So I was wondering if my implementation is wrong, or it is correct, but a different variation of the mutual information algorithm scikit-learn uses.
A little late with my answer but you should look at Orange's implementation. Within their app it is used as a behind-the-scenes processor to help inform the dynamic model parameter building process.
The implementation itself looks fairly straightforward and could most likely be ported out. The entropy calculation first
The sections starting at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L233
def _entropy(dist):
"""Entropy of class-distribution matrix"""
p = dist / np.sum(dist, axis=0)
pc = np.clip(p, 1e-15, 1)
return np.sum(np.sum(- p * np.log2(pc), axis=0) * np.sum(dist, axis=0) / np.sum(dist))
Then the second portion.
https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L305
class GainRatio(ClassificationScorer):
"""
Information gain ratio is the ratio between information gain and
the entropy of the feature's
value distribution. The score was introduced in [Quinlan1986]_
to alleviate overestimation for multi-valued features. See `Wikipedia entry on gain ratio
<http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
"""
def from_contingency(self, cont, nan_adjustment):
h_class = _entropy(np.sum(cont, axis=1))
h_residual = _entropy(np.compress(np.sum(cont, axis=0), cont, axis=1))
h_attribute = _entropy(np.sum(cont, axis=0))
if h_attribute == 0:
h_attribute = 1
return nan_adjustment * (h_class - h_residual) / h_attribute
The actual scoring process happens at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L218
I'm trying to implement my own kNN classifier. I've managed to implement something, but it's incredibly slow...
def euclidean_distance(X_train, X_test):
"""
Create list of all euclidean distances between the given
feature vector and all other feature vectors in the training set
"""
return [np.linalg.norm(X - X_test) for X in X_train]
def k_nearest(X, Y, k):
"""
Get the indices of the nearest feature vectors and return a
list of their classes
"""
idx = np.argpartition(X, k)
return np.take(Y, idx[:k])
def predict(X_test):
"""
For each feature vector get its predicted class
"""
distance_list = [euclidean_distance(X_train, X) for X in X_test]
return np.array([Counter(k_nearest(distances, Y_train, k)).most_common()[0][0] for distances in distance_list])
where (for example)
X = [[ 1.96701284 6.05526865]
[ 1.43021202 9.17058291]]
Y = [ 1. 0.]
Obviously it would be much faster if I didn't use any for loops, but I don't know how to make it work without them. Is there a way I can do this without using for loops / list comprehensions?
Here's a vectorized approach -
from scipy.spatial.distance import cdist
from scipy.stats import mode
dists = cdist(X_train, X)
idx = np.argpartition(dists, k, axis=0)[:k]
nearest_dists = np.take(Y_train, idx)
out = mode(nearest_dists,axis=0)[0]