Multi-knapsack problem with aggregate objective function/objective with a soft limit - python

I am trying to solve a variant of the multi-knapsack example in Google OR-tools. The one feature I cannot seem to encode is a soft limit on the value.
In the original example, an item has a weight that is used to calculate a constraint and a value that is used to calculate the optimum solution. In my variation I have multiple weights/capacities that form quotas and compatibilities for items of certain types. In addition, each bin has a funding target and each item has a value. I would like to minimise the funding shortfall for each bin:
# pseudocode!
minimise: sum(max(0, funding_capacity[j] - sum(item[i, j] * item_value[i] for i in num_items)) for j in num_bins)
The key differences between this approach and the example are that if item_1 has a value of 110 and bin_A has a funding requirement of 100, item_1 can fit into bin_A and makes the funding shortfall go to zero. item_2 with a value of 50 could also fit into bin_A (as long as the other constraints are met) but the optimiser will see no improvement in the objective function. I have attempted to use the objective.SetCoefficient method on a calculation of the funding shortfall but I keep getting errors that I think are to do with this method not liking aggregate functions like sum.
How do I implement the funding shortfall objective above, either in the objective function or alternatively in the constraints? How can I form an objective function using a summary calculation? The ideal answer would be a code snippet for OR-tools in Python but clear illustrative answers from OR-tools in other programming languages would also be helpful.

Working code follows, but here's how you would proceed with the formulation.
Formulation changes to the Multiple Knapsack problem given here
You will need two sets of new variables for each bin. Let's call them shortfall[j] (continuous) and filled[j] (boolean).
Shorfall[j] is simply the funding_requirement[j] - sum_i(funding[items i])
filled[j] is a Boolean, which we want to be 1 if the sum of the funding of each item in the bin is greater than its funding requirement, 0 otherwise.
We have to resort to a standard trick in Integer Programming that involves using a Big M. (A large number)
if total_item_funding >= requirement, filled = 1
if total_item_funding < requirement, filled = 0
This can be expressed in a linear constraint:
shortfall + BigM * filled > 0
Note that if the shortfall goes negative, it forces the filled variable to become 1. If shortfall is positive, filled can stay 0. (We will enforce this using the objective function.)
In the objective function for a Maximization problem, you penalize the filled variable.
Obj: Max sum(i,j) Xij * value_i + sum(j) filled_j * -100
So, this multiple knapsack formulation is incentivized to go close to each bin's funding requirement, but if it crosses the requirement, it pays a penalty.
You can play around with the objective function variables and penalities.
Formulation using Google-OR Tools
Working Python Code. For simplicity, I only made 3 bins.
from ortools.linear_solver import pywraplp
def create_data_model():
"""Create the data for the example."""
data = {}
weights = [48, 30, 42, 36, 36, 48, 42, 42, 36, 24, 30, 30, 42, 36, 36]
values = [10, 30, 25, 50, 35, 30, 15, 40, 30, 35, 45, 10, 20, 30, 25]
item_funding = [50, 17, 38, 45, 65, 60, 15, 30, 10, 25, 75, 30, 40, 40, 35]
data['weights'] = weights
data['values'] = values
data['i_fund'] = item_funding
data['items'] = list(range(len(weights)))
data['num_items'] = len(weights)
num_bins = 3
data['bins'] = list(range(num_bins))
data['bin_capacities'] = [100, 100, 80,]
data['bin_funding'] = [100, 100, 50,]
return data
def main():
data = create_data_model()
# Create the mip solver with the SCIP backend.
solver = pywraplp.Solver.CreateSolver('SCIP')
# Variables
# x[i, j] = 1 if item i is packed in bin j.
x , short, filled = {}, {}, {}
for i in data['items']:
for j in data['bins']:
x[(i, j)] = solver.IntVar(0, 1, 'x_%i_%i' % (i, j))
BIG_M, MAX_SHORT = 1e4, 500
for j in data['bins']:
short[j] = solver.NumVar(-MAX_SHORT, MAX_SHORT,
'bin_shortfall_%i' % (j))
filled[j] = solver.IntVar(0,1, 'filled[%i]' % (i))
# Constraints
# Each item can be in at most one bin.
for i in data['items']:
solver.Add(sum(x[i, j] for j in data['bins']) <= 1)
for j in data['bins']:
# The amount packed in each bin cannot exceed its capacity.
solver.Add(
sum(x[(i, j)] * data['weights'][i]
for i in data['items']) <= data['bin_capacities'][j])
#define bin shortfalls as equality constraints
solver.Add(
data['bin_funding'][j] - sum(x[(i, j)] * data['i_fund'][i]
for i in data['items']) == short[j])
# If shortfall is negative, filled is forced to be true
solver.Add(
short[j] + BIG_M * filled[j] >= 0)
# Objective
objective = solver.Objective()
for i in data['items']:
for j in data['bins']:
objective.SetCoefficient(x[(i, j)], data['values'][i])
for j in data['bins']:
# objective.SetCoefficient(short[j], 1)
objective.SetCoefficient(filled[j], -100)
objective.SetMaximization()
print('Number of variables =', solver.NumVariables())
status = solver.Solve()
if status == pywraplp.Solver.OPTIMAL:
print('OPTMAL SOLUTION FOUND\n\n')
total_weight = 0
for j in data['bins']:
bin_weight = 0
bin_value = 0
bin_fund = 0
print('Bin ', j, '\n')
print(f"Funding {data['bin_funding'][j]} Shortfall \
{short[j].solution_value()}")
for i in data['items']:
if x[i, j].solution_value() > 0:
print('Item', i, '- weight:', data['weights'][i], ' value:',
data['values'][i], data['i_fund'][i])
bin_weight += data['weights'][i]
bin_value += data['values'][i]
bin_fund += data['i_fund'][i]
print('Packed bin weight:', bin_weight)
print('Packed bin value:', bin_value)
print('Packed bin Funding:', bin_fund)
print()
total_weight += bin_weight
print('Total packed weight:', total_weight)
else:
print('The problem does not have an optimal solution.')
if __name__ == '__main__':
main()
Hope that helps you move forward.

Related

How can I make sure my constraint can operate mathematically with the LpVariable?

I am trying to use the pulb libraries classes to solve a LP-Problem. I am having problems implementing the constraint into my code.
After importing the relevant classes and reading from my CSV file I wrote:
prob = pulp.LpProblem("Optimal Number of Bank Tellers", pulp.LpMinimize)
x = pulp.LpVariable("Number of Tellers", lowBound = 0, cat='Integer')
prob += x * (16*4 + 14*4)/8 , "Total Cost of Labor"
for i in [28, 35, 21, 46, 32, 14, 24, 32]:
prob += i / x <= 1/8, "Service Level Constraint for Time Slot {}".format(i)
prob.solve()
Unfortunately I don't quite understand why I get the error message, that 'int' and 'LpVariable' are an unsupported operand type.
How would I correctly model my constraint otherwise? What exactly did I do wrong here?
i / x <= 1/8
is obviously nonlinear. PuLP is only for linear models. Of course, you could write:
i <= x * (1/8)
which makes this linear.
Actually, there is no need to generate all these constraints. We can do with just one:
x >= 8*max([28, 35, 21, 46, 32, 14, 24, 32])
Finally, it is slightly better to specify this as a lower bound on x directly.

Summing until time-condition is reached in Python

I want to sum over a certain, but rolling, period within my dynamic model. The formal representation is as follows
A simple code snippet to run the equation is:
import numpy as np
import pandas as pd
import operator
year = np.arange(50)
m_ = [50, 30, 15]
a = [25, 15, 7.5]
ARC_ = [38, 255, 837]
r = 0.03
I tried subtracting list a from m_ by list(map(operator.sub, m_, a))) as found within another post.
My failed attempt looks something like this:
for t in year:
for i in range(0, 3):
while t < t+(list(map(operator.sub, m_, a))):
L_[t] = sum(ARC_[i] / (1+r) ** t)
Not at all sure that I understood it right, I tried to base my answer on the equation. Even if it is still a bit of from the result you expect, it might help you to solve your issue.
I create a result list to store each value of L[t], i.e. 50 values. Then I compute the start / stop of the sum for every couple (t,i) and compute it.
import numpy as np
years = np.arange(50)
m_ = [50, 30, 15]
a = [25, 15, 7.5]
ARC_ = [38, 255, 837]
r = 0.03
result = []
for t in years:
s = 0
for i in range(3):
t0 = t
tf = t + m_[i]-a[i]
for k in range(int(t0), int(tf+1)):
s += ARC_[i] / (1+r) ** t
result.append(s)
If what you wanted to do is to compute the difference element-wise between m and a, a simple solution is:
[m_[i] - a[i] for i in range(len(m_))]
Hope it helps.

Expand numbers in a list

I have a list of numbers:
[10,20,30]
What I need is to expand it according to a predefined increment. Thus, let's call x the increment and x=2, my result should be:
[10,12,14,16,18,20,22,24,.....,38]
Right now I am using a for loop, but it is very slow and I am wondering if there is a faster way.
EDIT:
newA = []
for n in array:
newA= newA+ generateNewNumbers(n, p, t)
The function generates new number simply generate the new numbers to add to the list.
EDIT2:
To better define the problem the first array contains some timestamps:
[10,20,30]
I have two parameters one is the sampling rate and one is the sampling time, what I need is to expand the array adding between two timestamps the correct number of timestamps, according to the sampling rate.
For example, if I have a sampling rate 3 and a sampling time 3 the result should be:
[10,13,16,19,20,23,26,29,30,33,36,39]
You can add the same set of increments to each time stamp using np.add.outer and then flatten the result using ravel.
import numpy as np
a = [10,20,35]
inc = 3
ninc = 4
np.add.outer(a, inc * np.arange(ninc)).ravel()
# array([10, 13, 16, 19, 20, 23, 26, 29, 35, 38, 41, 44])
You can use list comprhensions but I'm not sure if I understand the stopping condition for the last point inclusion
a = [10, 20, 30, 40]
t = 3
sum([[x for x in range(y, z, t)] for y, z in zip(a[:-1], a[1:])], []) + [a[-1]]
will give
[10, 13, 16, 19, 20, 23, 26, 29, 30, 33, 36, 39, 40]
Using range and itertools.chain
l = [10,20,30]
x = 3
from itertools import chain
list(chain(*[range(i,i+10,x) for i in l]))
#Output:
#[10, 13, 16, 19, 20, 23, 26, 29, 30, 33, 36, 39]
Here is a bunch of good answers already. But I would advise numpy and linear interpolation.
# Now, this will give you the desired result with your first specifications
# And in pure Python too
t = [10, 20, 30]
increment = 2
last = int(round(t[-1]+((t[-1]-t[-2])/float(increment))-1)) # Value of last number in array
# Note if you insist on mathematically "incorrect" endpoint, do:
#last = ((t[-1]+(t[-1]-t[-2])) -((t[-1]-t[-2])/float(increment)))+1
newt = range(t[0], last+1, increment)
# And, of course, this may skip entered values (increment = 3
# But what you should do instead, when you use the samplerate is
# to use linear interpolation
# If you resample the original signal,
# Then you resample the time too
# And don't expand over the existing time
# Because the time doesn't change if you resampled the original properly
# You only get more or less samples at different time points
# But it lasts the same length of time.
# If you do what you originally meant, you actually shift your datapoints in time
# Which is wrong.
import numpy
t = [10, 20, 30, 40, 50, 60]
oldfs = 4000 # 4 KHz samplerate
newfs = 8000 # 8 KHz sample rate (2 times bigger signal and its time axis)
ratio = max(oldfs*1.0, newfs*1.0)/min(newfs, oldfs)
newlen = round(len(t)*ratio)
numpy.interp(
numpy.linspace(0.0, 1.0, newlen),
numpy.linspace(0.0, 1.0, len(t)),
t)
This code can resample your original signal too (if you have one). If you just want to cram in some more timepoints in between, you can also use interpolation. Again, don't go over the existing time. Although this code does it, to be compatible with the first one. And so that you can get ideas on what you can do.
t = [10, 20, 30]
increment = 2
last = t[-1]+((t[-1]-t[-2])/float(increment))-1 # Value of last number in array
t.append(last)
newlen = (t[-1]-t[0])/float(increment)+1 # How many samples we will get in the end
ratio = newlen / len(t)
numpy.interp(
numpy.linspace(0.0, 1.0, newlen),
numpy.linspace(0.0, 1.0, len(t)),
t)
This though results in an increment of 2.5 instead of 2. But it can be corrected. The thing is that this approach would work on floating time points as well as on integers. And fast. It will slow down if there is a lot of them, but until you reach some great number of them it will work pretty fast.

Which programming structure for clustering algorithm

I am trying to implement the following (divisive) clustering algorithm (below is presented short form of the algorithm, the full description is available here):
Start with a sample x, i = 1, ..., n regarded as a single cluster of n data points and a dissimilarity matrix D defined for all pairs of points. Fix a threshold T for deciding whether or not to split a cluster.
First determine the distance between all pairs of data points and choose a pair with the largest distance (Dmax) between them.
Compare Dmax to T. If Dmax > T then divide single cluster in two by using the selected pair as the first elements in two new clusters. The remaining n - 2 data points are put into one of the two new clusters. x_l is added to the new cluster containing x_i if D(x_i, x_l) < D(x_j, x_l), otherwise is added to new cluster containing x_i.
At the second stage, the values D(x_i, x_j) are found within one of two new clusters to find the pair in the cluster with the largest distance Dmax between them. If Dmax < T, the division of the cluster stops and the other cluster is considered. Then the procedure repeats on the clusters generated from this iteration.
Output is a hierarchy of clustered data records. I kindly ask for an advice how to implement the clustering algorithm.
EDIT 1: I attach Python function which defines distance (correlation coefficient) and function which finds maximal distance in data matrix.
# Read data from GitHub
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/nico/collectiveintelligence-book/master/blogdata.txt', sep = '\t', index_col = 0)
data = df.values.tolist()
data = data[1:10]
# Define correlation coefficient as distance of choice
def pearson(v1, v2):
# Simple sums
sum1 = sum(v1)
sum2 = sum(v2)
# Sums of the squares
sum1Sq = sum([pow(v, 2) for v in v1])
sum2Sq = sum([pow(v, 2) for v in v2])
# Sum of the products
pSum=sum([v1[i] * v2[i] for i in range(len(v1))])
# Calculate r (Pearson score)
num = pSum - (sum1 * sum2 / len(v1))
den = sqrt((sum1Sq - pow(sum1,2) / len(v1)) * (sum2Sq - pow(sum2, 2) / len(v1)))
if den == 0: return 0
return num / den
# Find largest distance
dist={}
max_dist = pearson(data[0], data[0])
# Loop over upper triangle of data matrix
for i in range(len(data)):
for j in range(i + 1, len(data)):
# Compute distance for each pair
dist_curr = pearson(data[i], data[j])
# Store distance in dict
dist[(i, j)] = dist_curr
# Store max distance
if dist_curr > max_dist:
max_dist = dist_curr
EDIT 2: Pasted below are functions from Dschoni's answer.
# Euclidean distance
def euclidean(x,y):
x = numpy.array(x)
y = numpy.array(y)
return numpy.sqrt(numpy.sum((x-y)**2))
# Create matrix
def dist_mat(data):
dist = {}
for i in range(len(data)):
for j in range(i + 1, len(data)):
dist[(i, j)] = euclidean(data[i], data[j])
return dist
# Returns i & k for max distance
def my_max(dict):
return max(dict)
# Sort function
list1 = []
list2 = []
def sort (rcd, i, k):
list1.append(i)
list2.append(k)
for j in range(len(rcd)):
if (euclidean(rcd[j], rcd[i]) < euclidean(rcd[j], rcd[k])):
list1.append(j)
else:
list2.append(j)
EDIT 3:
When I run the code provided by #Dschoni the algorithm works as expected. Then I modified the create_distance_list function so we can compute distance between multivariate data points. I use euclidean distance. For toy example I load iris data. I cluster only the first 50 instances of the dataset.
import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, sep = ',')
df = df.drop(4, 1)
df = df[1:50]
data = df.values.tolist()
idl=range(len(data))
dist = create_distance_list(data)
print sort(dist, idl)
The result is as follows:
[[24], [17], [4], [7], [40], [13], [14], [15], [26, 27, 38], [3, 16,
39], [25], [42], [18, 20, 45], [43], [1, 2, 11, 46], [12, 37, 41],
[5], [21], [22], [10, 23, 28, 29], [6, 34, 48], [0, 8, 33, 36, 44],
[31], [32], [19], [30], [35], [9, 47]]
Some data points are still clustered together. I solve this problem by adding small amount of data noise to actual dictionary in the sort function:
# Add small random noise
for key in actual:
actual[key] += np.random.normal(0, 0.005)
Any idea how to solve this problem properly?
A proper working example for the euclidean distance:
import numpy as np
#For random number generation
def create_distance_list(l):
'''Create a distance list for every
unique tuple of pairs'''
dist={}
for i in range(len(l)):
for k in range(i+1,len(l)):
dist[(i,k)]=abs(l[i]-l[k])
return dist
def maximum(distance_dict):
'''Returns the key of the maximum value if unique
or a random key with the maximum value.'''
maximum = max(distance_dict.values())
max_key = [key for key, value in distance_dict.items() if value == maximum]
if len(max_key)>1:
random_key = np.random.random_integers(0,len(max_key)-1)
return (max_key[random_key],)
else:
return max_key
def construct_new_dict(distance_dict,index_list):
'''Helper function to create a distance map for a subset
of data points.'''
new={}
for i in range(len(index_list)):
for k in range(i+1,len(index_list)):
m = index_list[i]
n = index_list[k]
new[(m,n)]=distance_dict[(m,n)]
return new
def sort(distance_dict,idl,threshold=4):
result=[idl]
i=0
try:
while True:
if len(result[i])>=2:
actual=construct_new_dict(dist,result[i])
act_max=maximum(actual)
if distance_dict[act_max[0]]>threshold:
j = act_max[0][0]
k = act_max[0][1]
result[i].remove(j)
result[i].remove(k)
l1=[j]
l2=[k]
for iterr in range(len(result[i])):
s = result[i][iterr]
if s>j:
c1=(j,s)
else:
c1=(s,j)
if s>k:
c2=(k,s)
else:
c2=(s,k)
if actual[c1]<actual[c2]:
l1.append(s)
else:
l2.append(s)
result.remove(result[i])
#What to do if distance is equal?
l1.sort()
l2.sort()
result.append(l1)
result.append(l2)
else:
i+=1
else:
i+=1
except:
return result
#This is the dataset
a = [1,2,2.5,5]
#Giving each entry a unique ID
idl=range(len(a))
dist = create_distance_list(a)
print sort(dist,idl)
I wrote the code for readability, there is a lot of stuff that can made faster, more reliable and prettier. This is just to give you an idea of how it can be done.
Some data points are still clustered together. I solve this problem by
adding small amount of data noise to actual dictionary in the sort
function.
If Dmax > T then divide single cluster in two
Your description doesn't necessarily creates n clusters.
If a cluster has two records which has a distance less than T,
they will be clustered together (am I missing something?)

Capacitated k-means clustering?

I'm newbie to algorithm and optimization.
I'm trying to implement capacitated k-means, but getting unresolved and poor result so far.
This is used as part of a CVRP simulation (capacitated vehicle routing problem).
I'm curious if I interprets the referenced algorithm wrong.
Ref: "Improved K-Means Algorithm for Capacitated Clustering Problem"
(Geetha, Poonthalir, Vanathi)
The simulated CVRP has 15 customers, with 1 depot.
Each customer has Euclidean coordinate (x,y) and demand.
There are 3 vehicles, each has capacity of 90.
So, the capacitated k-means is trying to cluster 15 customers into 3 vehicles, with the total demands in each cluster must not exceed vehicle capacity.
UPDATE:
In the referenced algorithm, I couldn't catch any information about what must the code do when it runs out of "next nearest centroid".
That is, when all of the "nearest centroids" has been examined, in the step 14.b below, while the customers[1] is still unassigned.
This results in the customer with index 1 being unassigned.
Note: customer[1] is customer with largest demand (30).
Q: When this condition is met, what the code should do then?
Here is my interpretation of the referenced algorithm, please correct my code, thank you.
Given n requesters (customers), n = customerCount, and a depot
n demands,
n coordinates (x,y)
calculate number of clusters, k = (sum of all demands) / vehicleCapacity
select initial centroids,
5.a. sort customers based on demand, in descending order = d_customers,
5.b. select k first customers from d_customers as initial centroids = centroids[0 .. k-1],
Create binary matrix bin_matrix, dimension = (customerCount) x (k),
6.a. Fill bin_matrix with all zeros
start WHILE loop, condition = WHILE not converged.
7.a. converged = False
start FOR loop, condition = FOR each customers,
8.a. index of customer = i
calculate Euclidean distances from customers[i] to all centroids => edist
9.a. sort edist in ascending order,
9.b. select first centroid with closest distance = closest_centroid
start WHILE loop, condition = while customers[i] is not assigned to any cluster.
group all the other unassigned customers = G,
11.a. consider closest_centroid as centroid for G.
calculate priorities Pi for each customers of G,
12.a. Priority Pi = (distance from customers[i] to closest_cent) / demand[i]
12.b. select a customer with highest priority Pi.
12.c. customer with highest priority has index = hpc
12.d. Q: IF highest priority customer cannot be found, what must we do ?
assign customers[hpc] to centroids[closest_centroid] if possible.
13.a. demand of customers[hpc] = d1,
13.b. sum of all demands of centroids' members = dtot,
13.c. IF (d1 + dtot) <= vehicleCapacity, THEN..
13.d. assign customers[hpc] to centroids[closest_centroid]
13.e. update bin_matrix, row index = hpc, column index = closest_centroid, set to 1.
IF customers[i] is (still) not assigned to any cluster, THEN..
14.a. choose the next nearest centroid, with the next nearest distance from edist.
14.b. Q: IF there is no next nearest centroid, THEN what must we do ?
calculate converged by comparing previous matrix and updated matrix bin_matrix.
15.a. IF no changes in the bin_matrix, then set converged = True.
otherwise, calculate new centroids from updated clusters.
16.a. calculate new centroids' coordinates based on members of each cluster.
16.b. sum_x = sum of all x-coordinate of a cluster members,
16.c. num_c = number of all customers (members) in the cluster,
16.d. new centroid's x-coordinate of the cluster = sum_x / num_c.
16.e. with the same formula, calculate new centroid's y-coordinate of the cluster = sum_y / num_c.
iterate the main WHILE loop.
My code is always ended with unassigned customer at the step 14.b.
That is when there is a customers[i] still not assigned to any centroid, and it has run out of "next nearest centroid".
And the resulting clusters is poor. Output graph:
-In the picture, star is centroid, square is depot.
In the pic, customer labeled "1", with demand=30 always ended with no assigned cluster.
Output of the program,
k_cluster 3
idx [ 1 -1 1 0 2 0 1 1 2 2 2 0 0 2 0]
centroids [(22.6, 29.2), (34.25, 60.25), (39.4, 33.4)]
members [[3, 14, 12, 5, 11], [0, 2, 6, 7], [9, 8, 4, 13, 10]]
demands [86, 65, 77]
First and third cluster is poorly calculated.
idx with index '1' is not assigned (-1)
Q: What's wrong with my interpretation and my implementation?
Any correction, suggestion, help, will be very much appreciated, thank you in advanced.
Here is my full code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# pastebin.com/UwqUrHhh
# output graph: i.imgur.com/u3v2OFt.png
import math
import random
from operator import itemgetter
from copy import deepcopy
import numpy
import pylab
# depot and customers, [index, x, y, demand]
depot = [0, 30.0, 40.0, 0]
customers = [[1, 37.0, 52.0, 7], \
[2, 49.0, 49.0, 30], [3, 52.0, 64.0, 16], \
[4, 20.0, 26.0, 9], [5, 40.0, 30.0, 21], \
[6, 21.0, 47.0, 15], [7, 17.0, 63.0, 19], \
[8, 31.0, 62.0, 23], [9, 52.0, 33.0, 11], \
[10, 51.0, 21.0, 5], [11, 42.0, 41.0, 19], \
[12, 31.0, 32.0, 29], [13, 5.0, 25.0, 23], \
[14, 12.0, 42.0, 21], [15, 36.0, 16.0, 10]]
customerCount = 15
vehicleCount = 3
vehicleCapacity = 90
assigned = [-1] * customerCount
# number of clusters
k_cluster = 0
# binary matrix
bin_matrix = []
# coordinate of centroids
centroids = []
# total demand for each cluster, must be <= capacity
tot_demand = []
# members of each cluster
members = []
# coordinate of members of each cluster
xy_members = []
def distance(p1, p2):
return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)
# capacitated k-means clustering
# http://www.dcc.ufla.br/infocomp/artigos/v8.4/art07.pdf
def cap_k_means():
global k_cluster, bin_matrix, centroids, tot_demand
global members, xy_members, prev_members
# calculate number of clusters
tot_demand = sum([c[3] for c in customers])
k_cluster = int(math.ceil(float(tot_demand) / vehicleCapacity))
print 'k_cluster', k_cluster
# initial centroids = first sorted-customers based on demand
d_customers = sorted(customers, key=itemgetter(3), reverse=True)
centroids, tot_demand, members, xy_members = [], [], [], []
for i in range(k_cluster):
centroids.append(d_customers[i][1:3]) # [x,y]
# initial total demand and members for each cluster
tot_demand.append(0)
members.append([])
xy_members.append([])
# binary matrix, dimension = customerCount-1 x k_cluster
bin_matrix = [[0] * k_cluster for i in range(len(customers))]
converged = False
while not converged: # until no changes in formed-clusters
prev_matrix = deepcopy(bin_matrix)
for i in range(len(customers)):
edist = [] # list of distance to clusters
if assigned[i] == -1: # if not assigned yet
# Calculate the Euclidean distance to each of k-clusters
for k in range(k_cluster):
p1 = (customers[i][1], customers[i][2]) # x,y
p2 = (centroids[k][0], centroids[k][1])
edist.append((distance(p1, p2), k))
# sort, based on closest distance
edist = sorted(edist, key=itemgetter(0))
closest_centroid = 0 # first index of edist
# loop while customer[i] is not assigned
while assigned[i] == -1:
# calculate all unsigned customers (G)'s priority
max_prior = (0, -1) # value, index
for n in range(len(customers)):
pc = customers[n]
if assigned[n] == -1: # if unassigned
# get index of current centroid
c = edist[closest_centroid][1]
cen = centroids[c] # x,y
# distance_cost / demand
p = distance((pc[1], pc[2]), cen) / pc[3]
# find highest priority
if p > max_prior[0]:
max_prior = (p, n) # priority,customer-index
# if highest-priority is not found, what should we do ???
if max_prior[1] == -1:
break
# try to assign current cluster to highest-priority customer
hpc = max_prior[1] # index of highest-priority customer
c = edist[closest_centroid][1] # index of current cluster
# constraint, total demand in a cluster <= capacity
if tot_demand[c] + customers[hpc][3] <= vehicleCapacity:
# assign new member of cluster
members[c].append(hpc) # add index of customer
xy = (customers[hpc][1], customers[hpc][2]) # x,y
xy_members[c].append(xy)
tot_demand[c] += customers[hpc][3]
assigned[hpc] = c # update cluster to assigned-customer
# update binary matrix
bin_matrix[hpc][c] = 1
# if customer is not assigned then,
if assigned[i] == -1:
if closest_centroid < len(edist)-1:
# choose the next nearest centroid
closest_centroid += 1
# if run out of closest centroid, what must we do ???
else:
break # exit without centroid ???
# end while
# end for
# Calculate the new centroid from the formed clusters
for j in range(k_cluster):
xj = sum([cn[0] for cn in xy_members[j]])
yj = sum([cn[1] for cn in xy_members[j]])
xj = float(xj) / len(xy_members[j])
yj = float(yj) / len(xy_members[j])
centroids[j] = (xj, yj)
# calculate converged
converged = numpy.array_equal(numpy.array(prev_matrix), numpy.array(bin_matrix))
# end while
def clustering():
cap_k_means()
# debug plot
idx = numpy.array([c for c in assigned])
xy = numpy.array([(c[1], c[2]) for c in customers])
COLORS = ["Blue", "DarkSeaGreen", "DarkTurquoise",
"IndianRed", "MediumVioletRed", "Orange", "Purple"]
for i in range(min(idx), max(idx)+1):
clr = random.choice(COLORS)
pylab.plot(xy[idx==i, 0], xy[idx==i, 1], color=clr, \
linestyle='dashed', \
marker='o', markerfacecolor=clr, markersize=8)
pylab.plot(centroids[:][0], centroids[:][1], '*k', markersize=12)
pylab.plot(depot[1], depot[2], 'sk', markersize=12)
for i in range(len(idx)):
pylab.annotate(str(i), xy[i])
pylab.savefig('clust1.png')
pylab.show()
return idx
def main():
idx = clustering()
print 'idx', idx
print 'centroids', centroids
print 'members', members
print 'demands', tot_demand
if __name__ == '__main__':
main()
When the total demand is close to the total capacity, this problem begins to take on aspects of bin packing. As you've discovered, this particular algorithm's greedy approach is not always successful. I don't know whether the authors admitted that, but if they didn't, the reviewers should have caught it.
If you want to continue with something like this algorithm, I would try using integer programming to assign requesters to centroids.
Without going through all the details, the paper you cite says
if ri is not assigned then
choose the next nearest centroid
end if
in the algorithm at the end of section 5.
There must be a next nearest centroid - if two are equidistant I presume it doesn't matter which you choose.
A common issue with fixed-size clustering is that you can often identify 'swaps' in the output, where swapping 2 points between clusters creates a better solution.
We can improve the constrained k-means algorithm from the referenced paper by formulating the 'find the cluster to assign the point to' as an assignment problem, instead of greedily picking the nearest one that isn't full.
A common way to solve this is using a min-cost flow algorithm. The result of this guarantees that there aren't any 'swaps' available that improve the result.
Luckily, someone has already implemented this and created a package for it: https://github.com/joshlk/k-means-constrained
Check out Bradley, P. S., Bennett, K. P., & Demiriz, A. (2000). Constrained k-means clustering.
One side note, is that customers with a very large demand may still not be able to be assigned to a single depot, so the value of k may need to be increased a until there is a feasible solution, or 'splitting' the demand of a customer between multiple depots needs to be allowed.

Categories