Note:
This is for a homework assignment in my data mining class.
I'm going to put relevant code snippets on this SO post, but you can find my entire program at http://pastebin.com/CzNFbLJ2
The dataset I'm using for this program can be found at http://archive.ics.uci.edu/ml/datasets/Iris
So I'm getting: RuntimeWarning: invalid value encountered in sqrt
return np.sqrt(m)
I am attempting to find the average Mahalanobis distance of the given iris dataset (for both raw and normalized datasets). The error is only happening on the normalized version of the dataset which is making me wonder if I have incorrectly understood what normalization means (both in code and mathematically).
I thought that normalization means that each component of a vector is divided by it's vector length (causing the vector to add up to 1). I found this SO question How to normalize a 2-dimensional numpy array in python less verbose? and thought it matched up to my concept of normalization. But now my code is reporting that the Mahalanobis distance over the normalized dataset is NAN
def mahalanobis(data):
import numpy as np;
import scipy.spatial.distance;
avg = 0
count = 0
covar = np.cov(data, rowvar=0);
invcovar = np.linalg.inv(covar)
for i in range(len(data)):
for j in range(i + 1, len(data)):
if(j == len(data)):
break
avg += scipy.spatial.distance.mahalanobis(data[i], data[j], invcovar)
count += 1
return avg / count
def normalize(data):
import numpy as np
row_sums = data.sum(axis=1)
norm_data = np.zeros((50, 4))
for i, (row, row_sum) in enumerate(zip(data, row_sums)):
norm_data[i,:] = row / row_sum
return norm_data
Probably too late, but check out page 64-65 in our textbook "Introduction to Data Mining". There's a section called "Normalization or Standardization", which explains the concept of normalized data that Hearne is looking for.
Basically, standardized data set x' = (x - mean(x)) / standardDeviation(x)
Since I see you're using python, here's how to do it using SciPy:
normalizedData = (data - data.mean(axis=0)) / data.std(axis=0, ddof=1)
Source: http://mail.scipy.org/pipermail/numpy-discussion/2011-April/056023.html
You can use pdist() to do the distance calculation without for loop:
from sklearn import datasets
iris = datasets.load_iris()
from scipy.spatial.distance import pdist, squareform
print squareform(pdist(iris.data, 'mahalanobis'))
Normalization in this context probably does mean subtracting the mean and scaling so the data has a unit covariance matrix.
However, to scale every vector in your dataset to unit norm use: norm_data=data/np.sqrt(np.sum(data*data,1))[:,None].
You need to divide by the L2 norm of each vector, which means squaring the value of each element, then taking the square root of the sum. Broadcasting allows you to avoid explicitly coding the loop (see the answer to the question you cited: https://stackoverflow.com/a/8904762/1149913).
Related
I'm trying to follow an example on k-Nearest Neighbors and I'm not sure about the numpy command syntax. I'm supposed to be doing a matrix-wise distance calculation and the code given is
def classify(inputVector, trainingData,labels,k):
dataSetSize=trainingData.shape[0]
diffMat=tile(inputVector,(dataSetSize,1))-trainingData
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group, labels
my question is how does sqDistances**0.5 amount to the distance equation ((A[0]-B[0])+(A[1]-B[1]))^1/2? I don't follow how tile influences it specifically how the matrix is made from (datasetsize,1)-training data.
I hope the following will explain the working.
Numpy tile : https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html
Using this function, you are creating matrix from input vector same to the shape of training data. From this matrix you are subtracting training data which will give you some part from what you mentioned say test[0]-train[0] i.e. element wise difference.
Then you squared each obtained element by using diffMat**2 and then taken sum along axis = 1 (https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html). This resulted in equations like (test[0] - train[0])^2 + (test[1] - train[1])^2.
Next by taking sqDistances**0.5 , it will give Euclidean distance.
To calculate Euclidean distance, this might be helpful
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean
I am new to Python. I would like to perform hierarchical clustering on N by P dataset that contains some missing values. I am planning to use scipy.cluster.hierarchy.linkage function that takes distance matrix in condensed form. Does Python have a method to compute distance matrix for missing value contained data? (In R dist function automatically takes care of missing values... but scipy.spatial.distance.pdist seems not handling missing values!)
I could not find a method to compute distance matrix for data with missing values. So here is my naive solution using Euclidean distance.
import numpy as np
def getMissDist(x,y):
return np.nanmean( (x - y)**2 )
def getMissDistMat(dat):
Npat = dat.shape[0]
dist = np.ndarray(shape=(Npat,Npat))
dist.fill(0)
for ix in range(0,Npat):
x = dat[ix,]
if ix >0:
for iy in range(0,ix):
y = dat[iy,]
dist[ix,iy] = getMissDist(x,y)
dist[iy,ix] = dist[ix,iy]
return dist
Then assume that dat is N (= number of cases) by P (=number of features) data matrix with missing values then one can perform hierarchical clustering on this dat as:
distMat = getMissDistMat(dat)
condensDist = dist.squareform(distMat)
link = hier.linkage(condensDist, method='average')
What function can I use in Python if I want to sample a truncated integer power law?
That is, given two parameters a and m, generate a random integer x in the range [1,m) that follows a distribution proportional to 1/x^a.
I've been searching around numpy.random, but I haven't found this distribution.
AFAIK, neither NumPy nor Scipy defines this distribution for you. However, using SciPy it is easy to define your own discrete distribution function using scipy.rv_discrete:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
def truncated_power_law(a, m):
x = np.arange(1, m+1, dtype='float')
pmf = 1/x**a
pmf /= pmf.sum()
return stats.rv_discrete(values=(range(1, m+1), pmf))
a, m = 2, 10
d = truncated_power_law(a=a, m=m)
N = 10**4
sample = d.rvs(size=N)
plt.hist(sample, bins=np.arange(m)+0.5)
plt.show()
I don't use Python, so rather than risk syntax errors I'll try to describe the solution algorithmically. This is a brute-force discrete inversion. It should translate quite easily into Python. I'm assuming 0-based indexing for the array.
Setup:
Generate an array cdf of size m with cdf[0] = 1 as the first entry, cdf[i] = cdf[i-1] + 1/(i+1)**a for the remaining entries.
Scale all entries by dividing cdf[m-1] into each -- now they actually are CDF values.
Usage:
Generate your random values by generating a Uniform(0,1) and
searching through cdf[] until you find an entry greater than your
uniform. Return the index + 1 as your x-value.
Repeat for as many x-values as you want.
For instance, with a,m = 2,10, I calculate the probabilities directly as:
[0.6452579827864142, 0.16131449569660355, 0.07169533142071269, 0.04032862392415089, 0.02581031931145657, 0.017923832855178172, 0.013168530260947229, 0.010082155981037722, 0.007966147935634743, 0.006452579827864143]
and the CDF is:
[0.6452579827864142, 0.8065724784830177, 0.8782678099037304, 0.9185964338278814, 0.944406753139338, 0.9623305859945162, 0.9754991162554634, 0.985581272236501, 0.9935474201721358, 1.0]
When generating, if I got a Uniform outcome of 0.90 I would return x=4 because 0.918... is the first CDF entry larger than my uniform.
If you're worried about speed you could build an alias table, but with a geometric decay the probability of early termination of a linear search through the array is quite high. With the given example, for instance, you'll terminate on the first peek almost 2/3 of the time.
Use numpy.random.zipf and just reject any samples greater than or equal to m
I have the following setup for kmeans clustering algorithm that I am implementing for a project:
import numpy as np
import scipy
import sys
import random
import matplotlib.pyplot as plt
import operator
class KMeansClass:
#takes in an npArray like object
def __init__(self,dataset,k):
self.dataset=np.array(dataset)
#initialize mins to maximum possible value
self.min_x = sys.maxint
self.min_y = sys.maxint
#initialize maxs to minimum possible value
self.max_x = -(sys.maxint)-1
self.max_y = -(sys.maxint)-1
self.k = k
#a is the coefficient matrix that is continually updated as the centroids of the clusters change respectively.
# It is an mxk matrix where each row corresponds to a training_instance and each column corresponds to a centroid of a cluster
#Values are either 0 or 1. A value for a particular training_instance (data_point) is 1 only for that centroid to which the training_instance
# has the least distance else the value is 0.
self.a = np.zeros(shape=[self.dataset.shape[0],self.k])
self.distanceMatrix = np.empty(shape =[self.dataset.shape[0],self.k])
#initialize mu to zeros of the requisite shape array for now. Change this after implementing max and min methods.
self.mu = np.empty(shape=[k,2])
self.findMinMaxdataPoints()
self.initializeCentroids()
self.createDistanceMatrix()
self.scatterPlotOfInitializedPoints()
#pointa and pointb are npArray like vecors.
def euclideanDistance(self,pointa,pointb):
return np.sqrt(np.sum((pointa - pointb)**2))
""" Problem Initialization And Visualization Helper methods"""
##############################################################################
##param: dataset : list of tuples [(x1,y1),(x2,y2),...(xm,ym)]
def findMinMaxdataPoints(self):
for item in self.dataset:
self.min_x = min(self.min_x,item[0])
self.min_y = min(self.min_y,item[1])
self.max_x = max(self.max_x,item[0])
self.max_y = max(self.max_y,item[1])
def initializeCentroids(self):
for i in range(self.k):
#each value of mu is a tuple with a random number between (min_x - max_x) and (min_y - max_y)
self.mu[i] = (random.randint(self.min_x,self.max_x),random.randint(self.min_y,self.max_y))
self.sortCentroids()
print self.mu
def sortCentroids(self):
#the following 3 lines of code are to ensure that the mu values are always sorted in ascending order first with respect to the
#x values and then with respect to the y values.
half_sorted = sorted(self.mu,key=operator.itemgetter(1)) #sort wrt y values
full_sorted = sorted(half_sorted,key=operator.itemgetter(0)) #sort the y-sorted array wrt x-values
self.mu = np.array(full_sorted)
def scatterPlotOfInitializedPoints(self):
plt.scatter([item[0] for item in self.dataset],[item[1] for item in self.dataset],color='b')
plt.scatter([item[0] for item in self.mu],[item[1] for item in self.mu],color='r')
plt.show()
###############################################################################
#minimizing euclidean distance is the same as minimizing the square of the euclidean distance.
def calcSquareEuclideanDistanceBetweenTwoPoints(point_a,point_b):
return np.sum((pointa-pointb)**2)
def createDistanceMatrix(self):
for i in range(self.dataset.shape[0]):
for j in range(self.k):
self.distanceMatrix[i,j] = calcSquareEuclideanDistanceBetweenTwoPoints(self.dataset[i],self.mu[j])
def createCoefficientMatrix(self):
for i in range(self.dataset.shape[0]):
self.a[i,self.distanceMatrix[i].argmin()] = 1
#update functions for CoefficientMatrix and Centroid values:
def updateCoefficientMatrix(self):
for i in range(self.dataset.shape[0]):
self.a[i,self.distanceMatrix[i].argmin()]= 1
def updateCentroids(self):
for j in range(self.k):
non_zero_indices = np.nonzero(self.a[:,j])
avg = 0
for i in range(len(non_zero_indices[0])):
avg+=self.a[non_zero_indices[0][i],j]
self.mu[j] = avg/len(non_zero_indices[0])
############################################################
def lossFunction(self):
loss=0;
for j in range(self.k):
#vectorized this implementation.
loss+=np.sum(np.dot(self.a[:,j],self.distanceMatrix[:,j]))
return loss
Here my question pertains to the lossFunction and how to use this with the scipy.optimize package. I would like to minimize the loss function iteratively by performing the following steps:
Repeat until convergence:
a> Optimize 'a' by keeping mu constant ( I have an
updateCoefficientMatrix method for updating 'a' matrix which is an
mXk matrix where we have m training instances and k clusters.)
b> Optimize 'mu' by keeping 'a' constant (I have an updateCentroids
method to do this. where mu is a mXk matrix wherein m is number of
training instances and k is the number of clusters and the number of
centroids)
But I am very new to using scipy.optimize package so I am writing to ask for help as to how to invoke the scipy.optimize to achieve my optimization goal as stated above?
Basically I have 2 mxk matrices and I would like to minimize a lossFunction() by first optimizing one mxk matrix keeping the other constant and in the succeeding step optimize the second matrix keeping the first constant. This can be considered a special case of the expectation maximization problem but unfortunately I haven't quite gotten what the documentation is trying to say so far hence thought I'd turn to SO for help.
Thanks in advance!
And this is part of a class assignment so please do not post code! Any guidance or explanation would be highly appreciated.
Use scipy.optimize.minimize twice with different objective functions.
First run optimization with an objective function that takes a as a parameter, and returns the objective value.
As the second step, run scipy.optimize.minimize for a second time on a second objective function that takes mu as a parameter.
When writing the objective functions, remember that Python has nested functions, which avoids the need for passing mu (in the first case) or a (in the second case) as additional arguments; although it can be done by minimize(..., args=[mu]) and minimize(..., args=[a]).
Repeat the two-step process in a for loop, until the answer is such that your convergence condition is satisfied.
Iam trying to calculate PCA of a matrix.
Sometimes the resulting eigen values/vectors are complex values so when trying to project a point to a lower dimension plan by multiplying the eigen vector matrix with the point coordinates i get the following Warning
ComplexWarning: Casting complex values to real discards the imaginary part
In that line of code np.dot(self.u[0:components,:],vector)
The whole code i used to calculate PCA
import numpy as np
import numpy.linalg as la
class PCA:
def __init__(self,inputData):
data = inputData.copy()
#m = no of points
#n = no of features per point
self.m = data.shape[0]
self.n = data.shape[1]
#mean center the data
data -= np.mean(data,axis=0)
# calculate the covariance matrix
c = np.cov(data, rowvar=0)
# get the eigenvalues/eigenvectors of c
eval, evec = la.eig(c)
# u = eigen vectors (transposed)
self.u = evec.transpose()
def getPCA(self,vector,components):
if components > self.n:
raise Exception("components must be > 0 and <= n")
return np.dot(self.u[0:components,:],vector)
The covariance matrix is symmetric, and thus has real eigenvalues. You may see a small imaginary part in some eigenvalues due to numerical error. The imaginary parts can generally be ignored.
You can use scikits python library for PCA, this is an example of how to use it