scipy.optimize + kmeans clustering - python

I have the following setup for kmeans clustering algorithm that I am implementing for a project:
import numpy as np
import scipy
import sys
import random
import matplotlib.pyplot as plt
import operator
class KMeansClass:
#takes in an npArray like object
def __init__(self,dataset,k):
self.dataset=np.array(dataset)
#initialize mins to maximum possible value
self.min_x = sys.maxint
self.min_y = sys.maxint
#initialize maxs to minimum possible value
self.max_x = -(sys.maxint)-1
self.max_y = -(sys.maxint)-1
self.k = k
#a is the coefficient matrix that is continually updated as the centroids of the clusters change respectively.
# It is an mxk matrix where each row corresponds to a training_instance and each column corresponds to a centroid of a cluster
#Values are either 0 or 1. A value for a particular training_instance (data_point) is 1 only for that centroid to which the training_instance
# has the least distance else the value is 0.
self.a = np.zeros(shape=[self.dataset.shape[0],self.k])
self.distanceMatrix = np.empty(shape =[self.dataset.shape[0],self.k])
#initialize mu to zeros of the requisite shape array for now. Change this after implementing max and min methods.
self.mu = np.empty(shape=[k,2])
self.findMinMaxdataPoints()
self.initializeCentroids()
self.createDistanceMatrix()
self.scatterPlotOfInitializedPoints()
#pointa and pointb are npArray like vecors.
def euclideanDistance(self,pointa,pointb):
return np.sqrt(np.sum((pointa - pointb)**2))
""" Problem Initialization And Visualization Helper methods"""
##############################################################################
##param: dataset : list of tuples [(x1,y1),(x2,y2),...(xm,ym)]
def findMinMaxdataPoints(self):
for item in self.dataset:
self.min_x = min(self.min_x,item[0])
self.min_y = min(self.min_y,item[1])
self.max_x = max(self.max_x,item[0])
self.max_y = max(self.max_y,item[1])
def initializeCentroids(self):
for i in range(self.k):
#each value of mu is a tuple with a random number between (min_x - max_x) and (min_y - max_y)
self.mu[i] = (random.randint(self.min_x,self.max_x),random.randint(self.min_y,self.max_y))
self.sortCentroids()
print self.mu
def sortCentroids(self):
#the following 3 lines of code are to ensure that the mu values are always sorted in ascending order first with respect to the
#x values and then with respect to the y values.
half_sorted = sorted(self.mu,key=operator.itemgetter(1)) #sort wrt y values
full_sorted = sorted(half_sorted,key=operator.itemgetter(0)) #sort the y-sorted array wrt x-values
self.mu = np.array(full_sorted)
def scatterPlotOfInitializedPoints(self):
plt.scatter([item[0] for item in self.dataset],[item[1] for item in self.dataset],color='b')
plt.scatter([item[0] for item in self.mu],[item[1] for item in self.mu],color='r')
plt.show()
###############################################################################
#minimizing euclidean distance is the same as minimizing the square of the euclidean distance.
def calcSquareEuclideanDistanceBetweenTwoPoints(point_a,point_b):
return np.sum((pointa-pointb)**2)
def createDistanceMatrix(self):
for i in range(self.dataset.shape[0]):
for j in range(self.k):
self.distanceMatrix[i,j] = calcSquareEuclideanDistanceBetweenTwoPoints(self.dataset[i],self.mu[j])
def createCoefficientMatrix(self):
for i in range(self.dataset.shape[0]):
self.a[i,self.distanceMatrix[i].argmin()] = 1
#update functions for CoefficientMatrix and Centroid values:
def updateCoefficientMatrix(self):
for i in range(self.dataset.shape[0]):
self.a[i,self.distanceMatrix[i].argmin()]= 1
def updateCentroids(self):
for j in range(self.k):
non_zero_indices = np.nonzero(self.a[:,j])
avg = 0
for i in range(len(non_zero_indices[0])):
avg+=self.a[non_zero_indices[0][i],j]
self.mu[j] = avg/len(non_zero_indices[0])
############################################################
def lossFunction(self):
loss=0;
for j in range(self.k):
#vectorized this implementation.
loss+=np.sum(np.dot(self.a[:,j],self.distanceMatrix[:,j]))
return loss
Here my question pertains to the lossFunction and how to use this with the scipy.optimize package. I would like to minimize the loss function iteratively by performing the following steps:
Repeat until convergence:
a> Optimize 'a' by keeping mu constant ( I have an
updateCoefficientMatrix method for updating 'a' matrix which is an
mXk matrix where we have m training instances and k clusters.)
b> Optimize 'mu' by keeping 'a' constant (I have an updateCentroids
method to do this. where mu is a mXk matrix wherein m is number of
training instances and k is the number of clusters and the number of
centroids)
But I am very new to using scipy.optimize package so I am writing to ask for help as to how to invoke the scipy.optimize to achieve my optimization goal as stated above?
Basically I have 2 mxk matrices and I would like to minimize a lossFunction() by first optimizing one mxk matrix keeping the other constant and in the succeeding step optimize the second matrix keeping the first constant. This can be considered a special case of the expectation maximization problem but unfortunately I haven't quite gotten what the documentation is trying to say so far hence thought I'd turn to SO for help.
Thanks in advance!
And this is part of a class assignment so please do not post code! Any guidance or explanation would be highly appreciated.

Use scipy.optimize.minimize twice with different objective functions.
First run optimization with an objective function that takes a as a parameter, and returns the objective value.
As the second step, run scipy.optimize.minimize for a second time on a second objective function that takes mu as a parameter.
When writing the objective functions, remember that Python has nested functions, which avoids the need for passing mu (in the first case) or a (in the second case) as additional arguments; although it can be done by minimize(..., args=[mu]) and minimize(..., args=[a]).
Repeat the two-step process in a for loop, until the answer is such that your convergence condition is satisfied.

Related

Accelerate assigning of probability densities given two values in Python 3

For some of my research, I need to assign a probability density given a value, a mean, and a standard deviation, except I need to do this about 40 million times, so accelerating this code is becoming critical to working in a productive fashion.
I have only 10 values to test (values = 10x1 matrix), but I want to assign a probability for each of these values given a total of 4 million truncated normal distributions per value, each with varying means (all_means = 4 million x 10 matrix), and the same standard deviation (error = 1 value). The code I've been using to do this so far is below:
import scipy.stats as ss
all_probabilities =[]
for row in all_means:
temp_row = []
for i in range(len(row)):
# Isolate key values
mean = row[i]
error = 0.05
value = values[i]
# Create truncated normal distribution and calculate PMF
a, b = 0, np.inf
mu, sigma = float(mean), float(error)
alpha, beta = ((a-mu)/sigma), ((b-mu)/sigma)
prob = ss.truncnorm.pdf(float(value), alpha, beta, loc=mu, scale=sigma)
temp_row.extend([prob])
all_probabilities.extend([temp_row])
A single loop takes an average of 5ms, but to do this 4 million times, means this section of code would take about 5 hours to complete. I assume the limiting factors are in calling ss.truncnorm.pdf, and using extend. The latter I can get around by pre-allocating the probability matrix, but the former I see no work around for.
For more context, this bit of code is part of an algorithm which uses this code an average of 5 times (albeit with a rapidly decreasing number of distributions to test), so any tips to speed up this code would be a huge help.
Apologies if this is trivial, I'm relatively new to optimizing code, and could not find anything on this sort of problem specifically.
You can avoid the inner loop as scipy.stats.truncnorm can be defined as a vector of random variables i.e.
import numpy as np
from scipy.stats import truncnorm
all_probabilities = []
a, b = 0, np.inf
error = 0.05
for row in all_means:
alpha, beta = ((a-row )/error), ((b-row )/error)
# vectorized truncnorm
rv_tn = truncnorm(alpha, beta, loc=row, scale=error)
# predict vector
prob = rv_tn.pdf(values)
all_probabilities.extend(prob)

Minimize the max value in Gurobi optimaztion

I am developing a model to solve a MIP problem using gurobi and python. The problem involves travel times over a set of predefined routes. One of the objective functions I am trying to realize is to minimize the maximum travel time for the selected routes. A mathematical representation of this is:
min f = max(Dij * Zij)
where D is the travel time for each route ij and Z is the assignment variable indicating whether route ij is part of the solution, so that if the route is not selected then the expression evaluates to 0. What is the best way to model this in Gurobi for python?
Here's how you can set up a min max constraint in MIP/Gurobi.
Idea: First, create a new variable called max_distance. This is what the MIP will try to minimize.
Now add constraints, one for each (i,j) combination, such that:
dist[i][j] * Z[i][j] <= max_distance
The above will take care of pushing max_distance to be at least as large as the largest Dij. And the objective function will make max_distance as small as possible.
To make the code that follows work, you have to do two things.
Add in your actual constraints that 'select' the preferred set of Zij
Replace my random values with your actual distances.
Gurobi (Python) Code for MinMax
Here's how you'd approach it in Gurobi (Python). I don't have Gurobi installed, so this hasn't been verified. It is to illustrate the idea of min max.
import sys
import math
import random
import itertools
from gurobipy import *
#Create 10 points (nodes i and j) with random values.
# replace this with your distances.
N=10
random.seed(1)
points = [(random.randint(0,100),random.randint(0,100)) for i in range(n)]
dist = {(i,j) :
math.sqrt(sum((points[i][k]-points[j][k])**2 for k in range(2)))
for i in range(n) for j in range(i)}
m = Model()
# minimize 1 * maxDistvar
mdvar = m.addVar(lb=0.0, obj=1.0, GRB.CONTINUOUS, "maxDistvar")
# Create the Zij variables
vars = tupledict()
for i,j in dist.keys():
vars[i,j] = m.addVar(vtype=GRB.BINARY,
name='z[%d,%d]'%(i,j))
#set up limit max distance constraints
# Maxdistvar is greater than or equal to the largest dist[i, j]
for i in range(N):
for j in range(i):
m.addConstr(vars[i,j]*dist[i, j] <= mdvar, 'maxDist[%d,%d]'%(i,j))
# Also, add your constraints that 'select' \
# certain Zij to be 0 or 1 based on other criteria
# These will decide if Zij is part of your solution.
# Solve
m.optimize()
And print out the selected Zij's. Hope that helps.

How to plot the pdf and cdf for an arbitrary function in python? [duplicate]

The random module (http://docs.python.org/2/library/random.html) has several fixed functions to randomly sample from. For example random.gauss will sample random point from a normal distribution with a given mean and sigma values.
I'm looking for a way to extract a number N of random samples between a given interval using my own distribution as fast as possible in python. This is what I mean:
def my_dist(x):
# Some distribution, assume c1,c2,c3 and c4 are known.
f = c1*exp(-((x-c2)**c3)/c4)
return f
# Draw N random samples from my distribution between given limits a,b.
N = 1000
N_rand_samples = ran_func_sample(my_dist, a, b, N)
where ran_func_sample is what I'm after and a, b are the limits from which to draw the samples. Is there anything of that sort in python?
You need to use Inverse transform sampling method to get random values distributed according to a law you want. Using this method you can just apply inverted function
to random numbers having standard uniform distribution in the interval [0,1].
After you find the inverted function, you get 1000 numbers distributed according to the needed distribution this obvious way:
[inverted_function(random.random()) for x in range(1000)]
More on Inverse Transform Sampling:
http://en.wikipedia.org/wiki/Inverse_transform_sampling
Also, there is a good question on StackOverflow related to the topic:
Pythonic way to select list elements with different probability
This code implements the sampling of n-d discrete probability distributions. By setting a flag on the object, it can also be made to be used as a piecewise constant probability distribution, which can then be used to approximate arbitrary pdf's. Well, arbitrary pdfs with compact support; if you efficiently want to sample extremely long tails, a non-uniform description of the pdf would be required. But this is still efficient even for things like airy-point-spread functions (which I created it for, initially). The internal sorting of values is absolutely critical there to get accuracy; the many small values in the tails should contribute substantially, but they will get drowned out in fp accuracy without sorting.
class Distribution(object):
"""
draws samples from a one dimensional probability distribution,
by means of inversion of a discrete inverstion of a cumulative density function
the pdf can be sorted first to prevent numerical error in the cumulative sum
this is set as default; for big density functions with high contrast,
it is absolutely necessary, and for small density functions,
the overhead is minimal
a call to this distibution object returns indices into density array
"""
def __init__(self, pdf, sort = True, interpolation = True, transform = lambda x: x):
self.shape = pdf.shape
self.pdf = pdf.ravel()
self.sort = sort
self.interpolation = interpolation
self.transform = transform
#a pdf can not be negative
assert(np.all(pdf>=0))
#sort the pdf by magnitude
if self.sort:
self.sortindex = np.argsort(self.pdf, axis=None)
self.pdf = self.pdf[self.sortindex]
#construct the cumulative distribution function
self.cdf = np.cumsum(self.pdf)
#property
def ndim(self):
return len(self.shape)
#property
def sum(self):
"""cached sum of all pdf values; the pdf need not sum to one, and is imlpicitly normalized"""
return self.cdf[-1]
def __call__(self, N):
"""draw """
#pick numbers which are uniformly random over the cumulative distribution function
choice = np.random.uniform(high = self.sum, size = N)
#find the indices corresponding to this point on the CDF
index = np.searchsorted(self.cdf, choice)
#if necessary, map the indices back to their original ordering
if self.sort:
index = self.sortindex[index]
#map back to multi-dimensional indexing
index = np.unravel_index(index, self.shape)
index = np.vstack(index)
#is this a discrete or piecewise continuous distribution?
if self.interpolation:
index = index + np.random.uniform(size=index.shape)
return self.transform(index)
if __name__=='__main__':
shape = 3,3
pdf = np.ones(shape)
pdf[1]=0
dist = Distribution(pdf, transform=lambda i:i-1.5)
print dist(10)
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
And as a more real-world relevant example:
x = np.linspace(-100, 100, 512)
p = np.exp(-x**2)
pdf = p[:,None]*p[None,:] #2d gaussian
dist = Distribution(pdf, transform=lambda i:i-256)
print dist(1000000).mean(axis=1) #should be in the 1/sqrt(1e6) range
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
Here is a rather nice way of performing inverse transform sampling with a decorator.
import numpy as np
from scipy.interpolate import interp1d
def inverse_sample_decorator(dist):
def wrapper(pnts, x_min=-100, x_max=100, n=1e5, **kwargs):
x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)
return f(np.random.random(pnts))
return wrapper
Using this decorator on a Gaussian distribution, for example:
#inverse_sample_decorator
def gauss(x, amp=1.0, mean=0.0, std=0.2):
return amp*np.exp(-(x-mean)**2/std**2/2.0)
You can then generate sample points from the distribution by calling the function. The keyword arguments x_min and x_max are the limits of the original distribution and can be passed as arguments to gauss along with the other key word arguments that parameterise the distribution.
samples = gauss(5000, mean=20, std=0.8, x_min=19, x_max=21)
Alternatively, this can be done as a function that takes the distribution as an argument (as in your original question),
def inverse_sample_function(dist, pnts, x_min=-100, x_max=100, n=1e5,
**kwargs):
x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)
return f(np.random.random(pnts))
I was in a similar situation but I wanted to sample from a multivariate distribution, so, I implemented a rudimentary version of Metropolis-Hastings (which is an MCMC method).
def metropolis_hastings(target_density, size=500000):
burnin_size = 10000
size += burnin_size
x0 = np.array([[0, 0]])
xt = x0
samples = []
for i in range(size):
xt_candidate = np.array([np.random.multivariate_normal(xt[0], np.eye(2))])
accept_prob = (target_density(xt_candidate))/(target_density(xt))
if np.random.uniform(0, 1) < accept_prob:
xt = xt_candidate
samples.append(xt)
samples = np.array(samples[burnin_size:])
samples = np.reshape(samples, [samples.shape[0], 2])
return samples
This function requires a function target_density which takes in a data-point and computes its probability.
For details check-out this detailed answer of mine.
import numpy as np
import scipy.interpolate as interpolate
def inverse_transform_sampling(data, n_bins, n_samples):
hist, bin_edges = np.histogram(data, bins=n_bins, density=True)
cum_values = np.zeros(bin_edges.shape)
cum_values[1:] = np.cumsum(hist*np.diff(bin_edges))
inv_cdf = interpolate.interp1d(cum_values, bin_edges)
r = np.random.rand(n_samples)
return inv_cdf(r)
So if we give our data sample that has a specific distribution, the inverse_transform_sampling function will return a dataset with exactly the same distribution. Here the advantage is that we can get our own sample size by specifying it in the n_samples variable.

Comparing model and data using scipy.optimize

I'm trying to compare a set of discrete data values with a model to estimate the "x" value where there is a good match between the discrete data points and the model. In other words, I'm trying to estimate the x value (or the range of x) where the differences between data (discrete points) and the model are minimum. I've a model that provides Ya(x), Yb(x), Yc(x) (continuous lines). I also have the data points A, B and C (filled circles). I would like to estimate the x value where the data points A, B and C (or most of the points) match well with the corresponding continuous line. I also plot the (model-data)^2 as a function of x. It appears from the second plot that a good match can be obtained for the x range 5.e3 to 1.e4. I was wondering if I may use any scipy.optimize subroutine to estimate it quantitatively.
Thanks for your time and any help would be greatly appreciated.
I think this might be some pseudo-code to get you started. See if this actually matches what you want.
import scipy.optimize
import numpy as np
def yA(x):
# whatever calculations here you do for curve A
return(1.0) # return whatever yA is at x
def yB(x):
# whatever calculations here you do for curve B
return(1.0) # return whatever yB is at x
def yC(x):
# whatever calculations here you do for curve C
return(1.0) # return whatever yC is at x
def func(x,data):
A,B,C = data # unpack tuple
devA = np.abs((yA(x)-A)/yA(x)) # normalize the deviations
devB = np.abs((yB(x)-B)/yB(x)) # to account for the order
devC = np.abs((yC(x)-C)/yC(x)) # of magnitude variations
return(devA+devB+devC) # you want to minimize the sum of the deviations
A = 1.0E-10 # these are your data points (rough guess from plot)
B = 1.0E-11
C = 1.0E-8
x0 = 1000.0 # an initial guess
result = scipy.optimize.minimize(func,x0,args=(A,B,C))
print(result.x)

Numpy stateing that invalid value while calculating normalized mahalanobis distance

Note:
This is for a homework assignment in my data mining class.
I'm going to put relevant code snippets on this SO post, but you can find my entire program at http://pastebin.com/CzNFbLJ2
The dataset I'm using for this program can be found at http://archive.ics.uci.edu/ml/datasets/Iris
So I'm getting: RuntimeWarning: invalid value encountered in sqrt
return np.sqrt(m)
I am attempting to find the average Mahalanobis distance of the given iris dataset (for both raw and normalized datasets). The error is only happening on the normalized version of the dataset which is making me wonder if I have incorrectly understood what normalization means (both in code and mathematically).
I thought that normalization means that each component of a vector is divided by it's vector length (causing the vector to add up to 1). I found this SO question How to normalize a 2-dimensional numpy array in python less verbose? and thought it matched up to my concept of normalization. But now my code is reporting that the Mahalanobis distance over the normalized dataset is NAN
def mahalanobis(data):
import numpy as np;
import scipy.spatial.distance;
avg = 0
count = 0
covar = np.cov(data, rowvar=0);
invcovar = np.linalg.inv(covar)
for i in range(len(data)):
for j in range(i + 1, len(data)):
if(j == len(data)):
break
avg += scipy.spatial.distance.mahalanobis(data[i], data[j], invcovar)
count += 1
return avg / count
def normalize(data):
import numpy as np
row_sums = data.sum(axis=1)
norm_data = np.zeros((50, 4))
for i, (row, row_sum) in enumerate(zip(data, row_sums)):
norm_data[i,:] = row / row_sum
return norm_data
Probably too late, but check out page 64-65 in our textbook "Introduction to Data Mining". There's a section called "Normalization or Standardization", which explains the concept of normalized data that Hearne is looking for.
Basically, standardized data set x' = (x - mean(x)) / standardDeviation(x)
Since I see you're using python, here's how to do it using SciPy:
normalizedData = (data - data.mean(axis=0)) / data.std(axis=0, ddof=1)
Source: http://mail.scipy.org/pipermail/numpy-discussion/2011-April/056023.html
You can use pdist() to do the distance calculation without for loop:
from sklearn import datasets
iris = datasets.load_iris()
from scipy.spatial.distance import pdist, squareform
print squareform(pdist(iris.data, 'mahalanobis'))
Normalization in this context probably does mean subtracting the mean and scaling so the data has a unit covariance matrix.
However, to scale every vector in your dataset to unit norm use: norm_data=data/np.sqrt(np.sum(data*data,1))[:,None].
You need to divide by the L2 norm of each vector, which means squaring the value of each element, then taking the square root of the sum. Broadcasting allows you to avoid explicitly coding the loop (see the answer to the question you cited: https://stackoverflow.com/a/8904762/1149913).

Categories