Related
I'm new to the numpy in general so this is an easy question however i'm clueless as how to solve it.
i'm trying to implement K nearest neighbor algorithm for classification of a Data set
there are to arrays named new_points and point that respectively have the shape of (30,4)
and (120,4) (with 4 being the total number of the properties of each element)
so i'm trying to calculate the distance between each new point and all old points using numpy.broadcasting
def calc_no_loop(new_points, points):
return np.sum((new_points-points)**2,axis=1)
#doesn't work here is log
ValueError: operands could not be broadcast together with shapes (30,4) (120,4)
however as per rules of broadcasting two array of shapes (30,4) and (120,4) are incompatible
so i would appreciate any insight on how to slove this (using .reshape prehaps - not sure)
please note: that i'have already implemented the same function using one and two loops but can't implement it without one
def calc_two_loops(new_points, points):
m, n = len(new_points), len(points)
d = np.zeros((m, n))
for i in range(m):
for j in range(n):
d[i, j] = np.sum((new_points[i] - points[j])**2)
return d
def calc_one_loop(new_points, points):
m, n = len(new_points), len(points)
d = np.zeros((m, n))
print(d)
for i in range(m):
d[i] = np.sum((new_points[i] - points)**2)
return d
Let's create an exapmle smaller in size:
nNew = 3; nOld = 5 # Number of new / old points
# New points
new_points = np.arange(100, 100 + nNew * 4).reshape(nNew, 4)
# Old points
points = np.arange(10, 10 + nOld * 8, 2).reshape(nOld, 4)
To compute the differences alone, run:
dfr = new_points[:, np.newaxis, :] - points[np.newaxis, :, :]
So far we have differences in each property of each point (every new point with every old point).
The shape of dfr is (3, 5, 4):
first dimension: the number of new point,
second dimension: the number of old point,
third dimension: the difference in each property.
Then, to sum squares of differences by points, run:
d = np.power(dfr, 2).sum(axis=2)
and this is your result.
For my sample data, the result is:
array([[31334, 25926, 21030, 16646, 12774],
[34230, 28566, 23414, 18774, 14646],
[37254, 31334, 25926, 21030, 16646]], dtype=int32)
So you have 30 new points, and 120 old points, so if I understand you correctly you want a shape(120,30) array result of distances.
You could do
import numpy as np
points = np.random.random(120*4).reshape(120,4)
new_points = np.random.random(30*4).reshape(30,4)
def calc_no_loop(new_points, points):
res = np.zeros([len(points[:,0]),len(new_points[:,0])])
for idx in range(len(points[:,0])):
res[idx,:] = np.sum((points[idx,:]-new_points)**2,axis=1)
return np.sqrt(res)
test = calc_no_loop(new_points,points)
print(np.shape(test))
print(test)
Which gives
(120, 30)
[[0.67166838 0.78096694 0.94983683 ... 1.00960301 0.48076185 0.56419991]
[0.88156338 0.54951826 0.73919191 ... 0.87757896 0.76305462 0.52486626]
[0.85271938 0.56085692 0.73063341 ... 0.97884167 0.90509791 0.7505591 ]
...
[0.53968258 0.64514941 0.89225849 ... 0.99278462 0.31861253 0.44615026]
[0.51647526 0.58611128 0.83298535 ... 0.86669406 0.64931403 0.71517123]
[1.08515826 0.64626221 0.6898687 ... 0.96882542 1.08075076 0.80144746]]
But from your function name above I get the notion that you do not want a loop? Then you could do this instead:
def calc_no_loop(new_points, points):
new_points1 = np.repeat(new_points[np.newaxis,...],len(points),axis=0)
points1 = np.repeat(points[:,np.newaxis,:],len(new_points),axis=1)
return np.sqrt(np.sum((new_points-points1)**2 ,axis=2))
test = calc_no_loop(new_points,points)
print(np.shape(test))
print(test)
which has output:
(120, 30)
[[0.67166838 0.78096694 0.94983683 ... 1.00960301 0.48076185 0.56419991]
[0.88156338 0.54951826 0.73919191 ... 0.87757896 0.76305462 0.52486626]
[0.85271938 0.56085692 0.73063341 ... 0.97884167 0.90509791 0.7505591 ]
...
[0.53968258 0.64514941 0.89225849 ... 0.99278462 0.31861253 0.44615026]
[0.51647526 0.58611128 0.83298535 ... 0.86669406 0.64931403 0.71517123]
[1.08515826 0.64626221 0.6898687 ... 0.96882542 1.08075076 0.80144746]]
i.e. the same result. Note that I added the np.sqrt() into the result which you may have forgotten in your example above.
I'm trying to create a Voronoi diagram given a set of scatterplot points. However, several "extra unintended lines" appear to get calculated in the process. Some of these "extra" lines appear to be the infinite edges getting incorrectly calculated. But others are appearing randomly in the middle of the plot as well. How can I only create an extra edge when it's needed/required to connect a polygon to the edge of the plot (e.g. plot boundaries)?
My graph outer boundaries are:
boundaries = np.array([[0, -2], [0, 69], [105, 69], [105, -2], [0, -2]])
Here's the section dealing with the voronoi diagram creation:
def voronoi_polygons(voronoi, diameter):
centroid = voronoi.points.mean(axis=0)
ridge_direction = defaultdict(list)
for (p, q), rv in zip(voronoi.ridge_points, voronoi.ridge_vertices):
u, v = sorted(rv)
if u == -1:
t = voronoi.points[q] - voronoi.points[p] # tangent
n = np.array([-t[1], t[0]]) / np.linalg.norm(t) # normal
midpoint = voronoi.points[[p, q]].mean(axis=0)
direction = np.sign(np.dot(midpoint - centroid, n)) * n
ridge_direction[p, v].append(direction)
ridge_direction[q, v].append(direction)
for i, r in enumerate(voronoi.point_region):
region = voronoi.regions[r]
if -1 not in region:
# Finite region.
yield Polygon(voronoi.vertices[region])
continue
# Infinite region.
inf = region.index(-1) # Index of vertex at infinity.
j = region[(inf - 1) % len(region)] # Index of previous vertex.
k = region[(inf + 1) % len(region)] # Index of next vertex.
if j == k:
# Region has one Voronoi vertex with two ridges.
dir_j, dir_k = ridge_direction[i, j]
else:
# Region has two Voronoi vertices, each with one ridge.
dir_j, = ridge_direction[i, j]
dir_k, = ridge_direction[i, k]
# Length of ridges needed for the extra edge to lie at least
# 'diameter' away from all Voronoi vertices.
length = 2 * diameter / np.linalg.norm(dir_j + dir_k)
# Polygon consists of finite part plus an extra edge.
finite_part = voronoi.vertices[region[inf + 1:] + region[:inf]]
extra_edge = [voronoi.vertices[j] + dir_j * length,
voronoi.vertices[k] + dir_k * length]
combined_finite_edge = np.concatenate((finite_part, extra_edge))
poly = Polygon(combined_finite_edge)
yield poly
Here are the points being used:
['52.629' '24.28099822998047']
['68.425' '46.077999114990234']
['60.409' '36.7140007019043']
['72.442' '28.762001037597656']
['52.993' '43.51799964904785']
['59.924' '16.972000122070312']
['61.101' '55.74899959564209']
['68.9' '13.248001098632812']
['61.323' '29.0260009765625']
['45.283' '36.97500038146973']
['52.425' '19.132999420166016']
['37.739' '28.042999267578125']
['48.972' '2.3539962768554688']
['33.865' '30.240001678466797']
['52.34' '64.94799995422363']
['52.394' '45.391000747680664']
['52.458' '34.79800033569336']
['31.353' '43.14500045776367']
['38.194' '39.24399948120117']
['98.745' '32.15999984741211']
['6.197' '32.606998443603516']
Most likely this is due to the errors associated with floating point arithmetic while computing the voronoi traingulation from your data (esp. the second column).
Assuming that, there is no single solution for such kinds of problems. I urge you to go through this page* of the Qhull manual and try iterating through those parameters in qhull_options before generating the voronoi object that you are inputting in the function. An example would be qhull_options='Qbb Qc Qz QJ'.
Other than that I doubt there is anything that could be modified in the function to avoid such a problem.
*This will take some time though. Just be patient.
Figured out what was wrong: after each polygon I needed to add a null x and y value or else it would attempt to 'stitch' one polygon to another, drawing an additional unintended line in order to do so. So the data should really look more like this:
GameTime,Half,ObjectType,JerseyNumber,X,Y,PlayerIDEvent,PlayerIDTracking,MatchIDEvent,Position,teamId,i_order,v_vor_x,v_vor_y
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,0,22.79645297,6.20866756
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,1,17.63464264,3.41230187
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,2,20.27639318,34.29191902
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,3,32.15600546,36.60432421
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,4,38.34639812,33.62806739
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,5,22.79645297,6.20866756
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,5,nan,nan
0.0,1,1,22,33.865,30.240001678466797,578478,794888,2257663,3,35179.0,,,
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,0,46.91696938,29.44801535
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,1,55.37574848,29.5855499
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,2,58.85876401,23.20381766
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,3,57.17455086,21.5228301
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,4,44.14237744,22.03925667
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,5,45.85962774,28.83613332
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,5,nan,nan
0.0,1,0,92,52.629,24.28099822998047,369351,561593,2257663,1,32446.0,,,
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,0,65.56965667,33.4292025
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,1,57.23303682,32.43809027
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,2,55.65704152,38.97814049
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,3,60.75304149,44.53251169
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,4,65.14170295,40.77562188
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,5,65.56965667,33.4292025
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,5,nan,nan
Here's the code:
x = range(-6,7)
tmp1 = []
for i in range(len(x)):
tmp1.append(math.exp(-(i*i)/(2*self.sigma*self.sigma)))
max_tmp1 = max(tmp1)
mod_tmp1 = []
for i in range(len(tmp1)):
mod_tmp1.append(max_tmp1 - i)
ht1 = np.kron(np.ones((9,1)),tmp1)
sht1 = sum(ht1.flatten(1))
mean = sht1/(13*9)
ht1 = ht1 - mean
ht1 = ht1/sht1
print ht1.shape
h = np.zeros((16,16))
for i in range(0, 9):
for j in range(0, 13):
h[i+3, j+1] = ht1[i, j]
for i in range(0, 10):
ag = 15*i
np.append(h, scipy.misc.imrotate(h, ag, 'bicubic'))
R = []
print h.shape
print self.img.shape
for i in range(0, 11):
print 'here'
R[i] = scipy.signal.convolve2d(self.img, h[i], mode = 'same')
rt = np.zeros(self.img.shape)
x, y = self.img.shape
The error I get states:
ValueError: object of too small depth for desired array
It looks to me as if the problem is that you're setting h up wrongly. I assume you want h[i] to be a 16x16 array suitable for convolving with, but that's not what you've actually made it, for a couple of different reasons.
I suggest you change the loop with the imrotate calls to this:
h = [scipy.misc.imrotate(h, 15*i, 'bicubic') for i in range(10)]
(What your existing code does is: first set up h as a single 16x16 array; then, repeatedly: compute a rotated version, "flatten" both h and that to make 256-element vectors, compute the result of appending them to make a 512-element vector, and throw the result away. numpy.append doesn't operate in place, and defaults to flattening its arguments before it appends. Neither of those is what you want!)
The list comprehension above will give you a 10-element Python list containing rotated versions of your convolution kernel.
... Oh, I see that your loop computing R actually wants 11 kernels, not 10. Make it range(11), then. (Your original code generated rotations of 0, 0, 15, 30, ..., 135 degrees, but I'm guessing 0, 15, 30, ..., 150 degrees is more likely to be what you want.)
I have a set of 2D points stored in a dictionary and i need to find the most efficient path to sampling all points (red traingles) in term of the shortest distance from a start-end point (yellow circle).
dict_points = OrderedDict([(0, (0.5129102892466411, 1.2791525891782567)), (1, (1.8571436538551014, 1.3979619805011203)), (2, (2.796472292985357, 1.3021773853592946)), (3, (2.2054745567697513, 0.5231652951626251)), (4, (1.1209493135130593, 0.8220950186969501)), (5, (0.16416153316980153, 0.7241249969879273))])
where the key is the ID of the point
My strategy is very simple. I use all points sequence possible (720 for 6 points) and i compute the euclidean distance point-by-point starting and ending from the start-end point (yellow point). The sequence with the shortest total distance is the most efficient.
The problem of this approach is that get very slow for a large number of points
import math
import itertools
base = (2.596, 2.196)
def segments(poly):
"""A sequence of (x,y) numeric coordinates pairs """
return zip(poly, poly[1:] + [poly[0]])
def best_path(dict_points, base=None):
sequence_min_distance = None
l = dict_points.keys()
gen = itertools.permutations(l)
min_dist = float('+inf')
for index, i in enumerate(gen):
seq = gen.next()
seq_list = [dict_points[s] for s in seq]
if base:
seq_list.insert(0, base)
points_paired = segments(seq_list)
tot_dist = 0
for points in points_paired:
dist = math.hypot(points[1][0] - points[0][0], points[1][1] - points[0][1])
tot_dist += dist
if tot_dist <= min_dist:
sequence_min_distance = seq
min_dist = min(min_dist, tot_dist)
return sequence_min_distance
best_seq = best_path(dict_points)
(5, 4, 3, 2, 1, 0)
You can also take a look at the project tsp-solver
https://github.com/dmishin/tsp-solver
I'm newbie to algorithm and optimization.
I'm trying to implement capacitated k-means, but getting unresolved and poor result so far.
This is used as part of a CVRP simulation (capacitated vehicle routing problem).
I'm curious if I interprets the referenced algorithm wrong.
Ref: "Improved K-Means Algorithm for Capacitated Clustering Problem"
(Geetha, Poonthalir, Vanathi)
The simulated CVRP has 15 customers, with 1 depot.
Each customer has Euclidean coordinate (x,y) and demand.
There are 3 vehicles, each has capacity of 90.
So, the capacitated k-means is trying to cluster 15 customers into 3 vehicles, with the total demands in each cluster must not exceed vehicle capacity.
UPDATE:
In the referenced algorithm, I couldn't catch any information about what must the code do when it runs out of "next nearest centroid".
That is, when all of the "nearest centroids" has been examined, in the step 14.b below, while the customers[1] is still unassigned.
This results in the customer with index 1 being unassigned.
Note: customer[1] is customer with largest demand (30).
Q: When this condition is met, what the code should do then?
Here is my interpretation of the referenced algorithm, please correct my code, thank you.
Given n requesters (customers), n = customerCount, and a depot
n demands,
n coordinates (x,y)
calculate number of clusters, k = (sum of all demands) / vehicleCapacity
select initial centroids,
5.a. sort customers based on demand, in descending order = d_customers,
5.b. select k first customers from d_customers as initial centroids = centroids[0 .. k-1],
Create binary matrix bin_matrix, dimension = (customerCount) x (k),
6.a. Fill bin_matrix with all zeros
start WHILE loop, condition = WHILE not converged.
7.a. converged = False
start FOR loop, condition = FOR each customers,
8.a. index of customer = i
calculate Euclidean distances from customers[i] to all centroids => edist
9.a. sort edist in ascending order,
9.b. select first centroid with closest distance = closest_centroid
start WHILE loop, condition = while customers[i] is not assigned to any cluster.
group all the other unassigned customers = G,
11.a. consider closest_centroid as centroid for G.
calculate priorities Pi for each customers of G,
12.a. Priority Pi = (distance from customers[i] to closest_cent) / demand[i]
12.b. select a customer with highest priority Pi.
12.c. customer with highest priority has index = hpc
12.d. Q: IF highest priority customer cannot be found, what must we do ?
assign customers[hpc] to centroids[closest_centroid] if possible.
13.a. demand of customers[hpc] = d1,
13.b. sum of all demands of centroids' members = dtot,
13.c. IF (d1 + dtot) <= vehicleCapacity, THEN..
13.d. assign customers[hpc] to centroids[closest_centroid]
13.e. update bin_matrix, row index = hpc, column index = closest_centroid, set to 1.
IF customers[i] is (still) not assigned to any cluster, THEN..
14.a. choose the next nearest centroid, with the next nearest distance from edist.
14.b. Q: IF there is no next nearest centroid, THEN what must we do ?
calculate converged by comparing previous matrix and updated matrix bin_matrix.
15.a. IF no changes in the bin_matrix, then set converged = True.
otherwise, calculate new centroids from updated clusters.
16.a. calculate new centroids' coordinates based on members of each cluster.
16.b. sum_x = sum of all x-coordinate of a cluster members,
16.c. num_c = number of all customers (members) in the cluster,
16.d. new centroid's x-coordinate of the cluster = sum_x / num_c.
16.e. with the same formula, calculate new centroid's y-coordinate of the cluster = sum_y / num_c.
iterate the main WHILE loop.
My code is always ended with unassigned customer at the step 14.b.
That is when there is a customers[i] still not assigned to any centroid, and it has run out of "next nearest centroid".
And the resulting clusters is poor. Output graph:
-In the picture, star is centroid, square is depot.
In the pic, customer labeled "1", with demand=30 always ended with no assigned cluster.
Output of the program,
k_cluster 3
idx [ 1 -1 1 0 2 0 1 1 2 2 2 0 0 2 0]
centroids [(22.6, 29.2), (34.25, 60.25), (39.4, 33.4)]
members [[3, 14, 12, 5, 11], [0, 2, 6, 7], [9, 8, 4, 13, 10]]
demands [86, 65, 77]
First and third cluster is poorly calculated.
idx with index '1' is not assigned (-1)
Q: What's wrong with my interpretation and my implementation?
Any correction, suggestion, help, will be very much appreciated, thank you in advanced.
Here is my full code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# pastebin.com/UwqUrHhh
# output graph: i.imgur.com/u3v2OFt.png
import math
import random
from operator import itemgetter
from copy import deepcopy
import numpy
import pylab
# depot and customers, [index, x, y, demand]
depot = [0, 30.0, 40.0, 0]
customers = [[1, 37.0, 52.0, 7], \
[2, 49.0, 49.0, 30], [3, 52.0, 64.0, 16], \
[4, 20.0, 26.0, 9], [5, 40.0, 30.0, 21], \
[6, 21.0, 47.0, 15], [7, 17.0, 63.0, 19], \
[8, 31.0, 62.0, 23], [9, 52.0, 33.0, 11], \
[10, 51.0, 21.0, 5], [11, 42.0, 41.0, 19], \
[12, 31.0, 32.0, 29], [13, 5.0, 25.0, 23], \
[14, 12.0, 42.0, 21], [15, 36.0, 16.0, 10]]
customerCount = 15
vehicleCount = 3
vehicleCapacity = 90
assigned = [-1] * customerCount
# number of clusters
k_cluster = 0
# binary matrix
bin_matrix = []
# coordinate of centroids
centroids = []
# total demand for each cluster, must be <= capacity
tot_demand = []
# members of each cluster
members = []
# coordinate of members of each cluster
xy_members = []
def distance(p1, p2):
return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)
# capacitated k-means clustering
# http://www.dcc.ufla.br/infocomp/artigos/v8.4/art07.pdf
def cap_k_means():
global k_cluster, bin_matrix, centroids, tot_demand
global members, xy_members, prev_members
# calculate number of clusters
tot_demand = sum([c[3] for c in customers])
k_cluster = int(math.ceil(float(tot_demand) / vehicleCapacity))
print 'k_cluster', k_cluster
# initial centroids = first sorted-customers based on demand
d_customers = sorted(customers, key=itemgetter(3), reverse=True)
centroids, tot_demand, members, xy_members = [], [], [], []
for i in range(k_cluster):
centroids.append(d_customers[i][1:3]) # [x,y]
# initial total demand and members for each cluster
tot_demand.append(0)
members.append([])
xy_members.append([])
# binary matrix, dimension = customerCount-1 x k_cluster
bin_matrix = [[0] * k_cluster for i in range(len(customers))]
converged = False
while not converged: # until no changes in formed-clusters
prev_matrix = deepcopy(bin_matrix)
for i in range(len(customers)):
edist = [] # list of distance to clusters
if assigned[i] == -1: # if not assigned yet
# Calculate the Euclidean distance to each of k-clusters
for k in range(k_cluster):
p1 = (customers[i][1], customers[i][2]) # x,y
p2 = (centroids[k][0], centroids[k][1])
edist.append((distance(p1, p2), k))
# sort, based on closest distance
edist = sorted(edist, key=itemgetter(0))
closest_centroid = 0 # first index of edist
# loop while customer[i] is not assigned
while assigned[i] == -1:
# calculate all unsigned customers (G)'s priority
max_prior = (0, -1) # value, index
for n in range(len(customers)):
pc = customers[n]
if assigned[n] == -1: # if unassigned
# get index of current centroid
c = edist[closest_centroid][1]
cen = centroids[c] # x,y
# distance_cost / demand
p = distance((pc[1], pc[2]), cen) / pc[3]
# find highest priority
if p > max_prior[0]:
max_prior = (p, n) # priority,customer-index
# if highest-priority is not found, what should we do ???
if max_prior[1] == -1:
break
# try to assign current cluster to highest-priority customer
hpc = max_prior[1] # index of highest-priority customer
c = edist[closest_centroid][1] # index of current cluster
# constraint, total demand in a cluster <= capacity
if tot_demand[c] + customers[hpc][3] <= vehicleCapacity:
# assign new member of cluster
members[c].append(hpc) # add index of customer
xy = (customers[hpc][1], customers[hpc][2]) # x,y
xy_members[c].append(xy)
tot_demand[c] += customers[hpc][3]
assigned[hpc] = c # update cluster to assigned-customer
# update binary matrix
bin_matrix[hpc][c] = 1
# if customer is not assigned then,
if assigned[i] == -1:
if closest_centroid < len(edist)-1:
# choose the next nearest centroid
closest_centroid += 1
# if run out of closest centroid, what must we do ???
else:
break # exit without centroid ???
# end while
# end for
# Calculate the new centroid from the formed clusters
for j in range(k_cluster):
xj = sum([cn[0] for cn in xy_members[j]])
yj = sum([cn[1] for cn in xy_members[j]])
xj = float(xj) / len(xy_members[j])
yj = float(yj) / len(xy_members[j])
centroids[j] = (xj, yj)
# calculate converged
converged = numpy.array_equal(numpy.array(prev_matrix), numpy.array(bin_matrix))
# end while
def clustering():
cap_k_means()
# debug plot
idx = numpy.array([c for c in assigned])
xy = numpy.array([(c[1], c[2]) for c in customers])
COLORS = ["Blue", "DarkSeaGreen", "DarkTurquoise",
"IndianRed", "MediumVioletRed", "Orange", "Purple"]
for i in range(min(idx), max(idx)+1):
clr = random.choice(COLORS)
pylab.plot(xy[idx==i, 0], xy[idx==i, 1], color=clr, \
linestyle='dashed', \
marker='o', markerfacecolor=clr, markersize=8)
pylab.plot(centroids[:][0], centroids[:][1], '*k', markersize=12)
pylab.plot(depot[1], depot[2], 'sk', markersize=12)
for i in range(len(idx)):
pylab.annotate(str(i), xy[i])
pylab.savefig('clust1.png')
pylab.show()
return idx
def main():
idx = clustering()
print 'idx', idx
print 'centroids', centroids
print 'members', members
print 'demands', tot_demand
if __name__ == '__main__':
main()
When the total demand is close to the total capacity, this problem begins to take on aspects of bin packing. As you've discovered, this particular algorithm's greedy approach is not always successful. I don't know whether the authors admitted that, but if they didn't, the reviewers should have caught it.
If you want to continue with something like this algorithm, I would try using integer programming to assign requesters to centroids.
Without going through all the details, the paper you cite says
if ri is not assigned then
choose the next nearest centroid
end if
in the algorithm at the end of section 5.
There must be a next nearest centroid - if two are equidistant I presume it doesn't matter which you choose.
A common issue with fixed-size clustering is that you can often identify 'swaps' in the output, where swapping 2 points between clusters creates a better solution.
We can improve the constrained k-means algorithm from the referenced paper by formulating the 'find the cluster to assign the point to' as an assignment problem, instead of greedily picking the nearest one that isn't full.
A common way to solve this is using a min-cost flow algorithm. The result of this guarantees that there aren't any 'swaps' available that improve the result.
Luckily, someone has already implemented this and created a package for it: https://github.com/joshlk/k-means-constrained
Check out Bradley, P. S., Bennett, K. P., & Demiriz, A. (2000). Constrained k-means clustering.
One side note, is that customers with a very large demand may still not be able to be assigned to a single depot, so the value of k may need to be increased a until there is a feasible solution, or 'splitting' the demand of a customer between multiple depots needs to be allowed.