Which programming structure for clustering algorithm - python

I am trying to implement the following (divisive) clustering algorithm (below is presented short form of the algorithm, the full description is available here):
Start with a sample x, i = 1, ..., n regarded as a single cluster of n data points and a dissimilarity matrix D defined for all pairs of points. Fix a threshold T for deciding whether or not to split a cluster.
First determine the distance between all pairs of data points and choose a pair with the largest distance (Dmax) between them.
Compare Dmax to T. If Dmax > T then divide single cluster in two by using the selected pair as the first elements in two new clusters. The remaining n - 2 data points are put into one of the two new clusters. x_l is added to the new cluster containing x_i if D(x_i, x_l) < D(x_j, x_l), otherwise is added to new cluster containing x_i.
At the second stage, the values D(x_i, x_j) are found within one of two new clusters to find the pair in the cluster with the largest distance Dmax between them. If Dmax < T, the division of the cluster stops and the other cluster is considered. Then the procedure repeats on the clusters generated from this iteration.
Output is a hierarchy of clustered data records. I kindly ask for an advice how to implement the clustering algorithm.
EDIT 1: I attach Python function which defines distance (correlation coefficient) and function which finds maximal distance in data matrix.
# Read data from GitHub
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/nico/collectiveintelligence-book/master/blogdata.txt', sep = '\t', index_col = 0)
data = df.values.tolist()
data = data[1:10]
# Define correlation coefficient as distance of choice
def pearson(v1, v2):
# Simple sums
sum1 = sum(v1)
sum2 = sum(v2)
# Sums of the squares
sum1Sq = sum([pow(v, 2) for v in v1])
sum2Sq = sum([pow(v, 2) for v in v2])
# Sum of the products
pSum=sum([v1[i] * v2[i] for i in range(len(v1))])
# Calculate r (Pearson score)
num = pSum - (sum1 * sum2 / len(v1))
den = sqrt((sum1Sq - pow(sum1,2) / len(v1)) * (sum2Sq - pow(sum2, 2) / len(v1)))
if den == 0: return 0
return num / den
# Find largest distance
dist={}
max_dist = pearson(data[0], data[0])
# Loop over upper triangle of data matrix
for i in range(len(data)):
for j in range(i + 1, len(data)):
# Compute distance for each pair
dist_curr = pearson(data[i], data[j])
# Store distance in dict
dist[(i, j)] = dist_curr
# Store max distance
if dist_curr > max_dist:
max_dist = dist_curr
EDIT 2: Pasted below are functions from Dschoni's answer.
# Euclidean distance
def euclidean(x,y):
x = numpy.array(x)
y = numpy.array(y)
return numpy.sqrt(numpy.sum((x-y)**2))
# Create matrix
def dist_mat(data):
dist = {}
for i in range(len(data)):
for j in range(i + 1, len(data)):
dist[(i, j)] = euclidean(data[i], data[j])
return dist
# Returns i & k for max distance
def my_max(dict):
return max(dict)
# Sort function
list1 = []
list2 = []
def sort (rcd, i, k):
list1.append(i)
list2.append(k)
for j in range(len(rcd)):
if (euclidean(rcd[j], rcd[i]) < euclidean(rcd[j], rcd[k])):
list1.append(j)
else:
list2.append(j)
EDIT 3:
When I run the code provided by #Dschoni the algorithm works as expected. Then I modified the create_distance_list function so we can compute distance between multivariate data points. I use euclidean distance. For toy example I load iris data. I cluster only the first 50 instances of the dataset.
import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, sep = ',')
df = df.drop(4, 1)
df = df[1:50]
data = df.values.tolist()
idl=range(len(data))
dist = create_distance_list(data)
print sort(dist, idl)
The result is as follows:
[[24], [17], [4], [7], [40], [13], [14], [15], [26, 27, 38], [3, 16,
39], [25], [42], [18, 20, 45], [43], [1, 2, 11, 46], [12, 37, 41],
[5], [21], [22], [10, 23, 28, 29], [6, 34, 48], [0, 8, 33, 36, 44],
[31], [32], [19], [30], [35], [9, 47]]
Some data points are still clustered together. I solve this problem by adding small amount of data noise to actual dictionary in the sort function:
# Add small random noise
for key in actual:
actual[key] += np.random.normal(0, 0.005)
Any idea how to solve this problem properly?

A proper working example for the euclidean distance:
import numpy as np
#For random number generation
def create_distance_list(l):
'''Create a distance list for every
unique tuple of pairs'''
dist={}
for i in range(len(l)):
for k in range(i+1,len(l)):
dist[(i,k)]=abs(l[i]-l[k])
return dist
def maximum(distance_dict):
'''Returns the key of the maximum value if unique
or a random key with the maximum value.'''
maximum = max(distance_dict.values())
max_key = [key for key, value in distance_dict.items() if value == maximum]
if len(max_key)>1:
random_key = np.random.random_integers(0,len(max_key)-1)
return (max_key[random_key],)
else:
return max_key
def construct_new_dict(distance_dict,index_list):
'''Helper function to create a distance map for a subset
of data points.'''
new={}
for i in range(len(index_list)):
for k in range(i+1,len(index_list)):
m = index_list[i]
n = index_list[k]
new[(m,n)]=distance_dict[(m,n)]
return new
def sort(distance_dict,idl,threshold=4):
result=[idl]
i=0
try:
while True:
if len(result[i])>=2:
actual=construct_new_dict(dist,result[i])
act_max=maximum(actual)
if distance_dict[act_max[0]]>threshold:
j = act_max[0][0]
k = act_max[0][1]
result[i].remove(j)
result[i].remove(k)
l1=[j]
l2=[k]
for iterr in range(len(result[i])):
s = result[i][iterr]
if s>j:
c1=(j,s)
else:
c1=(s,j)
if s>k:
c2=(k,s)
else:
c2=(s,k)
if actual[c1]<actual[c2]:
l1.append(s)
else:
l2.append(s)
result.remove(result[i])
#What to do if distance is equal?
l1.sort()
l2.sort()
result.append(l1)
result.append(l2)
else:
i+=1
else:
i+=1
except:
return result
#This is the dataset
a = [1,2,2.5,5]
#Giving each entry a unique ID
idl=range(len(a))
dist = create_distance_list(a)
print sort(dist,idl)
I wrote the code for readability, there is a lot of stuff that can made faster, more reliable and prettier. This is just to give you an idea of how it can be done.

Some data points are still clustered together. I solve this problem by
adding small amount of data noise to actual dictionary in the sort
function.
If Dmax > T then divide single cluster in two
Your description doesn't necessarily creates n clusters.
If a cluster has two records which has a distance less than T,
they will be clustered together (am I missing something?)

Related

Multi-knapsack problem with aggregate objective function/objective with a soft limit

I am trying to solve a variant of the multi-knapsack example in Google OR-tools. The one feature I cannot seem to encode is a soft limit on the value.
In the original example, an item has a weight that is used to calculate a constraint and a value that is used to calculate the optimum solution. In my variation I have multiple weights/capacities that form quotas and compatibilities for items of certain types. In addition, each bin has a funding target and each item has a value. I would like to minimise the funding shortfall for each bin:
# pseudocode!
minimise: sum(max(0, funding_capacity[j] - sum(item[i, j] * item_value[i] for i in num_items)) for j in num_bins)
The key differences between this approach and the example are that if item_1 has a value of 110 and bin_A has a funding requirement of 100, item_1 can fit into bin_A and makes the funding shortfall go to zero. item_2 with a value of 50 could also fit into bin_A (as long as the other constraints are met) but the optimiser will see no improvement in the objective function. I have attempted to use the objective.SetCoefficient method on a calculation of the funding shortfall but I keep getting errors that I think are to do with this method not liking aggregate functions like sum.
How do I implement the funding shortfall objective above, either in the objective function or alternatively in the constraints? How can I form an objective function using a summary calculation? The ideal answer would be a code snippet for OR-tools in Python but clear illustrative answers from OR-tools in other programming languages would also be helpful.
Working code follows, but here's how you would proceed with the formulation.
Formulation changes to the Multiple Knapsack problem given here
You will need two sets of new variables for each bin. Let's call them shortfall[j] (continuous) and filled[j] (boolean).
Shorfall[j] is simply the funding_requirement[j] - sum_i(funding[items i])
filled[j] is a Boolean, which we want to be 1 if the sum of the funding of each item in the bin is greater than its funding requirement, 0 otherwise.
We have to resort to a standard trick in Integer Programming that involves using a Big M. (A large number)
if total_item_funding >= requirement, filled = 1
if total_item_funding < requirement, filled = 0
This can be expressed in a linear constraint:
shortfall + BigM * filled > 0
Note that if the shortfall goes negative, it forces the filled variable to become 1. If shortfall is positive, filled can stay 0. (We will enforce this using the objective function.)
In the objective function for a Maximization problem, you penalize the filled variable.
Obj: Max sum(i,j) Xij * value_i + sum(j) filled_j * -100
So, this multiple knapsack formulation is incentivized to go close to each bin's funding requirement, but if it crosses the requirement, it pays a penalty.
You can play around with the objective function variables and penalities.
Formulation using Google-OR Tools
Working Python Code. For simplicity, I only made 3 bins.
from ortools.linear_solver import pywraplp
def create_data_model():
"""Create the data for the example."""
data = {}
weights = [48, 30, 42, 36, 36, 48, 42, 42, 36, 24, 30, 30, 42, 36, 36]
values = [10, 30, 25, 50, 35, 30, 15, 40, 30, 35, 45, 10, 20, 30, 25]
item_funding = [50, 17, 38, 45, 65, 60, 15, 30, 10, 25, 75, 30, 40, 40, 35]
data['weights'] = weights
data['values'] = values
data['i_fund'] = item_funding
data['items'] = list(range(len(weights)))
data['num_items'] = len(weights)
num_bins = 3
data['bins'] = list(range(num_bins))
data['bin_capacities'] = [100, 100, 80,]
data['bin_funding'] = [100, 100, 50,]
return data
def main():
data = create_data_model()
# Create the mip solver with the SCIP backend.
solver = pywraplp.Solver.CreateSolver('SCIP')
# Variables
# x[i, j] = 1 if item i is packed in bin j.
x , short, filled = {}, {}, {}
for i in data['items']:
for j in data['bins']:
x[(i, j)] = solver.IntVar(0, 1, 'x_%i_%i' % (i, j))
BIG_M, MAX_SHORT = 1e4, 500
for j in data['bins']:
short[j] = solver.NumVar(-MAX_SHORT, MAX_SHORT,
'bin_shortfall_%i' % (j))
filled[j] = solver.IntVar(0,1, 'filled[%i]' % (i))
# Constraints
# Each item can be in at most one bin.
for i in data['items']:
solver.Add(sum(x[i, j] for j in data['bins']) <= 1)
for j in data['bins']:
# The amount packed in each bin cannot exceed its capacity.
solver.Add(
sum(x[(i, j)] * data['weights'][i]
for i in data['items']) <= data['bin_capacities'][j])
#define bin shortfalls as equality constraints
solver.Add(
data['bin_funding'][j] - sum(x[(i, j)] * data['i_fund'][i]
for i in data['items']) == short[j])
# If shortfall is negative, filled is forced to be true
solver.Add(
short[j] + BIG_M * filled[j] >= 0)
# Objective
objective = solver.Objective()
for i in data['items']:
for j in data['bins']:
objective.SetCoefficient(x[(i, j)], data['values'][i])
for j in data['bins']:
# objective.SetCoefficient(short[j], 1)
objective.SetCoefficient(filled[j], -100)
objective.SetMaximization()
print('Number of variables =', solver.NumVariables())
status = solver.Solve()
if status == pywraplp.Solver.OPTIMAL:
print('OPTMAL SOLUTION FOUND\n\n')
total_weight = 0
for j in data['bins']:
bin_weight = 0
bin_value = 0
bin_fund = 0
print('Bin ', j, '\n')
print(f"Funding {data['bin_funding'][j]} Shortfall \
{short[j].solution_value()}")
for i in data['items']:
if x[i, j].solution_value() > 0:
print('Item', i, '- weight:', data['weights'][i], ' value:',
data['values'][i], data['i_fund'][i])
bin_weight += data['weights'][i]
bin_value += data['values'][i]
bin_fund += data['i_fund'][i]
print('Packed bin weight:', bin_weight)
print('Packed bin value:', bin_value)
print('Packed bin Funding:', bin_fund)
print()
total_weight += bin_weight
print('Total packed weight:', total_weight)
else:
print('The problem does not have an optimal solution.')
if __name__ == '__main__':
main()
Hope that helps you move forward.

Finding anomalous values from sinusoidal data

How can I find anomalous values from following data. I am simulating a sinusoidal pattern. While I can plot the data and spot any anomalies or noise in data, but how can I do it without plotting the data. I am looking for simple approaches other than Machine learning methods.
import random
import numpy as np
import matplotlib.pyplot as plt
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
print("in_array : ", in_array)
out_array = np.sin(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Inject random noise
noise_input = random.uniform(-.5, .5); print("Noise : ",noise_input)
in_array[random.randint(0,len(in_array)-1)] = noise_input
print(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Data with noise
I've thought of the following approach to your problem, since you have only some values that are anomalous in the time vector, it means that the rest of the values have a regular progression, which means that if we gather all the data points in the vector under clusters and calculate the average step for the biggest cluster (which is essentially the pool of values that represent the real deal), then we can use that average to do a triad detection, in a given threshold, over the vector and detect which of the elements are anomalous.
For this we need two functions: calculate_average_step which will calculate that average for the biggest cluster of close values, and then we need detect_anomalous_values which will yield the indexes of the anomalous values in our vector, based on that average calculated earlier.
After we detected the anomalous values, we can go ahead and replace them with an estimated value, which we can determine from our average step value and by using the adjacent points in the vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def calculate_average_step(array, threshold=5):
"""
Determine the average step by doing a weighted average based on clustering of averages.
array: our array
threshold: the +/- offset for grouping clusters. Aplicable on all elements in the array.
"""
# determine all the steps
steps = []
for i in range(0, len(array) - 1):
steps.append(abs(array[i] - array[i+1]))
# determine the steps clusters
clusters = []
skip_indexes = []
cluster_index = 0
for i in range(len(steps)):
if i in skip_indexes:
continue
# determine the cluster band (based on threshold)
cluster_lower = steps[i] - (steps[i]/100) * threshold
cluster_upper = steps[i] + (steps[i]/100) * threshold
# create the new cluster
clusters.append([])
clusters[cluster_index].append(steps[i])
# try to match elements from the rest of the array
for j in range(i + 1, len(steps)):
if not (cluster_lower <= steps[j] <= cluster_upper):
continue
clusters[cluster_index].append(steps[j])
skip_indexes.append(j)
cluster_index += 1 # increment the cluster id
clusters = sorted(clusters, key=lambda x: len(x), reverse=True)
biggest_cluster = clusters[0] if len(clusters) > 0 else None
if biggest_cluster is None:
return None
return sum(biggest_cluster) / len(biggest_cluster) # return our most common average
def detect_anomalous_values(array, regular_step, threshold=5):
"""
Will scan every triad (3 points) in the array to detect anomalies.
array: the array to iterate over.
regular_step: the step around which we form the upper/lower band for filtering
treshold: +/- variation between the steps of the first and median element and median and third element.
"""
assert(len(array) >= 3) # must have at least 3 elements
anomalous_indexes = []
step_lower = regular_step - (regular_step / 100) * threshold
step_upper = regular_step + (regular_step / 100) * threshold
# detection will be forward from i (hence 3 elements must be available for the d)
for i in range(0, len(array) - 2):
a = array[i]
b = array[i+1]
c = array[i+2]
first_step = abs(a-b)
second_step = abs(b-c)
first_belonging = step_lower <= first_step <= step_upper
second_belonging = step_lower <= second_step <= step_upper
# detect that both steps are alright
if first_belonging and second_belonging:
continue # all is good here, nothing to do
# detect if the first point in the triad is bad
if not first_belonging and second_belonging:
anomalous_indexes.append(i)
# detect the last point in the triad is bad
if first_belonging and not second_belonging:
anomalous_indexes.append(i+2)
# detect the mid point in triad is bad (or everything is bad)
if not first_belonging and not second_belonging:
anomalous_indexes.append(i+1)
# we won't add here the others because they will be detected by
# the rest of the triad scans
return sorted(set(anomalous_indexes)) # return unique indexes
if __name__ == "__main__":
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
noisy_out_array = np.sin(in_array)
# display noisy sin
plt.figure()
plt.plot(in_array, noisy_out_array, color = 'red', marker = "o");
plt.title("noisy numpy.sin()")
# detect anomalous values
average_step = calculate_average_step(in_array)
anomalous_indexes = detect_anomalous_values(in_array, average_step)
# replace anomalous points with an estimated value based on our calculated average
for anomalous in anomalous_indexes:
# try forward extrapolation
try:
in_array[anomalous] = in_array[anomalous-1] + average_step
# else try backwward extrapolation
except IndexError:
in_array[anomalous] = in_array[anomalous+1] - average_step
# generate sine wave
out_array = np.sin(in_array)
plt.figure()
plt.plot(in_array, out_array, color = 'green', marker = "o");
plt.title("cleaned numpy.sin()")
plt.show()
Noisy sine:
Cleaned sine:
Your problem relies in the time vector (which is of 1 dimension). You will need to apply some sort of filter on that vector.
First thing that came to mind was medfilt (median filter) from scipy and it looks something like this:
from scipy.signal import medfilt
l1 = [0, 10, 20, 30, 2, 50, 70, 15, 90, 100]
l2 = medfilt(l1)
print(l2)
the output of this will be:
[ 0. 10. 20. 20. 30. 50. 50. 70. 90. 90.]
the problem with this filter though is that if we apply some noise values to the edges of the vector like [200, 0, 10, 20, 30, 2, 50, 70, 15, 90, 100, -50] then the output would be something like [ 0. 10. 10. 20. 20. 30. 50. 50. 70. 90. 90. 0.] and obviously this is not ok for the sine plot since it will produce the same artifacts for the sine values array.
A better approach to this problem is to treat the time vector as an y output and it's index values as the x input and do a linear regression on the "time linear function", not the quotes, it just means we're faking the 2 dimensional model by applying a fake X vector. The code implies the use of scipy's linregress (linear regression) function:
from scipy.stats import linregress
l1 = [5, 0, 10, 20, 30, -20, 50, 70, 15, 90, 100]
l1_x = range(0, len(l1))
slope, intercept, r_val, p_val, std_err = linregress(l1_x, l1)
l1 = intercept + slope * l1_x
print(l1)
whose output will be:
[-10.45454545 -1.63636364 7.18181818 16. 24.81818182
33.63636364 42.45454545 51.27272727 60.09090909 68.90909091
77.72727273]
Now let's apply this to your time vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
N = 20
# N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
# apply filter on time array
in_array_x = range(0, len(in_array))
slope, intercept, r_val, p_val, std_err = linregress(in_array_x, in_array)
in_array = intercept + slope * in_array_x
# generate sine wave
out_array = np.sin(in_array)
print("OUT ARRAY")
print(out_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
plt.show()
the output will be:
the resulting signal will be an approximation of the original, as it is with any form of extrapolation/interpolation/regression filtering.

How to visualize the cluster result as a graph with different node color based on its cluster?

i have a geodesic distance of graph data in .csv format
i want to reduce it into 2D using Multidimensional Scaling (MDS) and cluster it using Kmedoids
This is my code:
# coding: utf-8
import numpy as np
import csv
from sklearn import manifold
from sklearn.metrics.pairwise import pairwise_distances
import kmedoidss
rawdata = csv.reader(open('data.csv', 'r').readlines()[1:])
# Process the data into a 2D array, omitting the header row
data, labels = [], []
for row in rawdata:
labels.append(row[1])
data.append([int(i) for i in row[1:]])
#print data
# Now run very basic MDS
# Documentation here: http://scikit-learn.org/dev/modules/generated/sklearn.manifold.MDS.html#sklearn.manifold.MDS
mds = manifold.MDS(n_components=2, dissimilarity="precomputed")
pos = mds.fit_transform(data)
# distance matrix
D = pairwise_distances(pos, metric='euclidean')
# split into c clusters
M, C = kmedoidss.kMedoids(D, 3)
print ('Data awal : ')
for index, point_idx in enumerate(pos, 1):
print(index, point_idx)
print ('\n medoids:' )
for point_idx in M:
print('{} index ke - {} '.format (pos[point_idx], point_idx+1))
print('')
print('clustering result:')
for label in C:
for point_idx in C[label]:
print('cluster- {}:{} index- {}'.format(label, pos[point_idx], point_idx+1))
kmedoidss.py
import numpy as np
import random
def kMedoids(D, k, tmax=100):
# determine dimensions of distance matrix D
m, n = D.shape
# randomly initialize an array of k medoid indices
M = np.sort(np.random.choice(n, k))
# create a copy of the array of medoid indices
Mnew = np.copy(M)
# initialize a dictionary to represent clusters
C = {}
for t in xrange(tmax):
# determine clusters, i. e. arrays of data indices
J = np.argmin(D[:,M], axis=1)
for kappa in range(k):
C[kappa] = np.where(J==kappa)[0]
# update cluster medoids
for kappa in range(k):
J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)
j = np.argmin(J)
Mnew[kappa] = C[kappa][j]
np.sort(Mnew)
# check for convergence
if np.array_equal(M, Mnew):
break
M = np.copy(Mnew)
else:
# final update of cluster memberships
J = np.argmin(D[:,M], axis=1)
for kappa in range(k):
C[kappa] = np.where(J==kappa)[0]
# return results
return M, C
how to visualize the cluster result as a graph with different node color based on its cluster?
You don't need MDS to run kMedoids - just run it on the original distance matrix (kMedoids can also be made to work on a similarity matrix by switching min for max).
Use MDS only for plotting.
The usual approach for visualization is to use a loop over clusters, and plot each cluster in a different color; or to use a color predicate. There are many examples in the scipy documentation.
http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
y_pred = labels.astype(np.int)
plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)
where X is your pos variable (2d mds result) and labels are an integer cluster number for every point. Since you don't have your data in thid "labels" layout, consider using a loop instead:
for label, pts in C.items():
plt.scatter(pos[pts, 0], pos[pts, 1], color=colors[label])
plt.show()

Organize 3d points and found them by distance from a position

I'm working in 3D context. I've some objects in this space who are represented by x, y, z position.
# My objects names (in my real context it's pheromone "point")
A = 1
B = 2
C = 3
D = 4
# My actual way to stock their positions
pheromones_positions = {
(25, 25, 60): [A, D],
(10, 90, 30): [B],
(5, 85, 8): [C]
}
My objective is to found what points (pheromones) are near (with distance) a given emplacement. I do this simply with:
def calc_distance(a, b):
return sqrt((a[0]-b[0])**2+(a[1]-b[1])**2+(a[2]-b[2])**2)
def found_in_dict(search, points, distance):
for point in points:
if calc_distance(search, point) <= distance:
return points[point]
founds = found_in_dict((20, 20, 55), pheromones_positions, 10)
# found [1, 4] (A and D)
But, with a lot of pheromones it's very slow (test them one by one ...). How can i organize these 3D positions to found more quickly "positions by distance from given position" ?
Does exist algorithms or librarys (numpy ?) who can help me in this way ?
You should compute all (squared) distances at once. With NumPy you can simply subtract the target point of size 1x3 from the (nx3) array of all position coordinates and sum the squared coordinate differences to obtain a list with n elements:
squaredDistances = np.sum((np.array(pheromones_positions.keys()) - (20, 20, 55))**2, axis=1)
idx = np.where(squaredDistances < 10**2)[0]
print pheromones_positions.values()[idx]
Output:
[1, 4]
By the way: Since your return statement is within the for-loop over all points, it will stop iterating after finding a first point. So you might miss a second or third match.

Capacitated k-means clustering?

I'm newbie to algorithm and optimization.
I'm trying to implement capacitated k-means, but getting unresolved and poor result so far.
This is used as part of a CVRP simulation (capacitated vehicle routing problem).
I'm curious if I interprets the referenced algorithm wrong.
Ref: "Improved K-Means Algorithm for Capacitated Clustering Problem"
(Geetha, Poonthalir, Vanathi)
The simulated CVRP has 15 customers, with 1 depot.
Each customer has Euclidean coordinate (x,y) and demand.
There are 3 vehicles, each has capacity of 90.
So, the capacitated k-means is trying to cluster 15 customers into 3 vehicles, with the total demands in each cluster must not exceed vehicle capacity.
UPDATE:
In the referenced algorithm, I couldn't catch any information about what must the code do when it runs out of "next nearest centroid".
That is, when all of the "nearest centroids" has been examined, in the step 14.b below, while the customers[1] is still unassigned.
This results in the customer with index 1 being unassigned.
Note: customer[1] is customer with largest demand (30).
Q: When this condition is met, what the code should do then?
Here is my interpretation of the referenced algorithm, please correct my code, thank you.
Given n requesters (customers), n = customerCount, and a depot
n demands,
n coordinates (x,y)
calculate number of clusters, k = (sum of all demands) / vehicleCapacity
select initial centroids,
5.a. sort customers based on demand, in descending order = d_customers,
5.b. select k first customers from d_customers as initial centroids = centroids[0 .. k-1],
Create binary matrix bin_matrix, dimension = (customerCount) x (k),
6.a. Fill bin_matrix with all zeros
start WHILE loop, condition = WHILE not converged.
7.a. converged = False
start FOR loop, condition = FOR each customers,
8.a. index of customer = i
calculate Euclidean distances from customers[i] to all centroids => edist
9.a. sort edist in ascending order,
9.b. select first centroid with closest distance = closest_centroid
start WHILE loop, condition = while customers[i] is not assigned to any cluster.
group all the other unassigned customers = G,
11.a. consider closest_centroid as centroid for G.
calculate priorities Pi for each customers of G,
12.a. Priority Pi = (distance from customers[i] to closest_cent) / demand[i]
12.b. select a customer with highest priority Pi.
12.c. customer with highest priority has index = hpc
12.d. Q: IF highest priority customer cannot be found, what must we do ?
assign customers[hpc] to centroids[closest_centroid] if possible.
13.a. demand of customers[hpc] = d1,
13.b. sum of all demands of centroids' members = dtot,
13.c. IF (d1 + dtot) <= vehicleCapacity, THEN..
13.d. assign customers[hpc] to centroids[closest_centroid]
13.e. update bin_matrix, row index = hpc, column index = closest_centroid, set to 1.
IF customers[i] is (still) not assigned to any cluster, THEN..
14.a. choose the next nearest centroid, with the next nearest distance from edist.
14.b. Q: IF there is no next nearest centroid, THEN what must we do ?
calculate converged by comparing previous matrix and updated matrix bin_matrix.
15.a. IF no changes in the bin_matrix, then set converged = True.
otherwise, calculate new centroids from updated clusters.
16.a. calculate new centroids' coordinates based on members of each cluster.
16.b. sum_x = sum of all x-coordinate of a cluster members,
16.c. num_c = number of all customers (members) in the cluster,
16.d. new centroid's x-coordinate of the cluster = sum_x / num_c.
16.e. with the same formula, calculate new centroid's y-coordinate of the cluster = sum_y / num_c.
iterate the main WHILE loop.
My code is always ended with unassigned customer at the step 14.b.
That is when there is a customers[i] still not assigned to any centroid, and it has run out of "next nearest centroid".
And the resulting clusters is poor. Output graph:
-In the picture, star is centroid, square is depot.
In the pic, customer labeled "1", with demand=30 always ended with no assigned cluster.
Output of the program,
k_cluster 3
idx [ 1 -1 1 0 2 0 1 1 2 2 2 0 0 2 0]
centroids [(22.6, 29.2), (34.25, 60.25), (39.4, 33.4)]
members [[3, 14, 12, 5, 11], [0, 2, 6, 7], [9, 8, 4, 13, 10]]
demands [86, 65, 77]
First and third cluster is poorly calculated.
idx with index '1' is not assigned (-1)
Q: What's wrong with my interpretation and my implementation?
Any correction, suggestion, help, will be very much appreciated, thank you in advanced.
Here is my full code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# pastebin.com/UwqUrHhh
# output graph: i.imgur.com/u3v2OFt.png
import math
import random
from operator import itemgetter
from copy import deepcopy
import numpy
import pylab
# depot and customers, [index, x, y, demand]
depot = [0, 30.0, 40.0, 0]
customers = [[1, 37.0, 52.0, 7], \
[2, 49.0, 49.0, 30], [3, 52.0, 64.0, 16], \
[4, 20.0, 26.0, 9], [5, 40.0, 30.0, 21], \
[6, 21.0, 47.0, 15], [7, 17.0, 63.0, 19], \
[8, 31.0, 62.0, 23], [9, 52.0, 33.0, 11], \
[10, 51.0, 21.0, 5], [11, 42.0, 41.0, 19], \
[12, 31.0, 32.0, 29], [13, 5.0, 25.0, 23], \
[14, 12.0, 42.0, 21], [15, 36.0, 16.0, 10]]
customerCount = 15
vehicleCount = 3
vehicleCapacity = 90
assigned = [-1] * customerCount
# number of clusters
k_cluster = 0
# binary matrix
bin_matrix = []
# coordinate of centroids
centroids = []
# total demand for each cluster, must be <= capacity
tot_demand = []
# members of each cluster
members = []
# coordinate of members of each cluster
xy_members = []
def distance(p1, p2):
return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)
# capacitated k-means clustering
# http://www.dcc.ufla.br/infocomp/artigos/v8.4/art07.pdf
def cap_k_means():
global k_cluster, bin_matrix, centroids, tot_demand
global members, xy_members, prev_members
# calculate number of clusters
tot_demand = sum([c[3] for c in customers])
k_cluster = int(math.ceil(float(tot_demand) / vehicleCapacity))
print 'k_cluster', k_cluster
# initial centroids = first sorted-customers based on demand
d_customers = sorted(customers, key=itemgetter(3), reverse=True)
centroids, tot_demand, members, xy_members = [], [], [], []
for i in range(k_cluster):
centroids.append(d_customers[i][1:3]) # [x,y]
# initial total demand and members for each cluster
tot_demand.append(0)
members.append([])
xy_members.append([])
# binary matrix, dimension = customerCount-1 x k_cluster
bin_matrix = [[0] * k_cluster for i in range(len(customers))]
converged = False
while not converged: # until no changes in formed-clusters
prev_matrix = deepcopy(bin_matrix)
for i in range(len(customers)):
edist = [] # list of distance to clusters
if assigned[i] == -1: # if not assigned yet
# Calculate the Euclidean distance to each of k-clusters
for k in range(k_cluster):
p1 = (customers[i][1], customers[i][2]) # x,y
p2 = (centroids[k][0], centroids[k][1])
edist.append((distance(p1, p2), k))
# sort, based on closest distance
edist = sorted(edist, key=itemgetter(0))
closest_centroid = 0 # first index of edist
# loop while customer[i] is not assigned
while assigned[i] == -1:
# calculate all unsigned customers (G)'s priority
max_prior = (0, -1) # value, index
for n in range(len(customers)):
pc = customers[n]
if assigned[n] == -1: # if unassigned
# get index of current centroid
c = edist[closest_centroid][1]
cen = centroids[c] # x,y
# distance_cost / demand
p = distance((pc[1], pc[2]), cen) / pc[3]
# find highest priority
if p > max_prior[0]:
max_prior = (p, n) # priority,customer-index
# if highest-priority is not found, what should we do ???
if max_prior[1] == -1:
break
# try to assign current cluster to highest-priority customer
hpc = max_prior[1] # index of highest-priority customer
c = edist[closest_centroid][1] # index of current cluster
# constraint, total demand in a cluster <= capacity
if tot_demand[c] + customers[hpc][3] <= vehicleCapacity:
# assign new member of cluster
members[c].append(hpc) # add index of customer
xy = (customers[hpc][1], customers[hpc][2]) # x,y
xy_members[c].append(xy)
tot_demand[c] += customers[hpc][3]
assigned[hpc] = c # update cluster to assigned-customer
# update binary matrix
bin_matrix[hpc][c] = 1
# if customer is not assigned then,
if assigned[i] == -1:
if closest_centroid < len(edist)-1:
# choose the next nearest centroid
closest_centroid += 1
# if run out of closest centroid, what must we do ???
else:
break # exit without centroid ???
# end while
# end for
# Calculate the new centroid from the formed clusters
for j in range(k_cluster):
xj = sum([cn[0] for cn in xy_members[j]])
yj = sum([cn[1] for cn in xy_members[j]])
xj = float(xj) / len(xy_members[j])
yj = float(yj) / len(xy_members[j])
centroids[j] = (xj, yj)
# calculate converged
converged = numpy.array_equal(numpy.array(prev_matrix), numpy.array(bin_matrix))
# end while
def clustering():
cap_k_means()
# debug plot
idx = numpy.array([c for c in assigned])
xy = numpy.array([(c[1], c[2]) for c in customers])
COLORS = ["Blue", "DarkSeaGreen", "DarkTurquoise",
"IndianRed", "MediumVioletRed", "Orange", "Purple"]
for i in range(min(idx), max(idx)+1):
clr = random.choice(COLORS)
pylab.plot(xy[idx==i, 0], xy[idx==i, 1], color=clr, \
linestyle='dashed', \
marker='o', markerfacecolor=clr, markersize=8)
pylab.plot(centroids[:][0], centroids[:][1], '*k', markersize=12)
pylab.plot(depot[1], depot[2], 'sk', markersize=12)
for i in range(len(idx)):
pylab.annotate(str(i), xy[i])
pylab.savefig('clust1.png')
pylab.show()
return idx
def main():
idx = clustering()
print 'idx', idx
print 'centroids', centroids
print 'members', members
print 'demands', tot_demand
if __name__ == '__main__':
main()
When the total demand is close to the total capacity, this problem begins to take on aspects of bin packing. As you've discovered, this particular algorithm's greedy approach is not always successful. I don't know whether the authors admitted that, but if they didn't, the reviewers should have caught it.
If you want to continue with something like this algorithm, I would try using integer programming to assign requesters to centroids.
Without going through all the details, the paper you cite says
if ri is not assigned then
choose the next nearest centroid
end if
in the algorithm at the end of section 5.
There must be a next nearest centroid - if two are equidistant I presume it doesn't matter which you choose.
A common issue with fixed-size clustering is that you can often identify 'swaps' in the output, where swapping 2 points between clusters creates a better solution.
We can improve the constrained k-means algorithm from the referenced paper by formulating the 'find the cluster to assign the point to' as an assignment problem, instead of greedily picking the nearest one that isn't full.
A common way to solve this is using a min-cost flow algorithm. The result of this guarantees that there aren't any 'swaps' available that improve the result.
Luckily, someone has already implemented this and created a package for it: https://github.com/joshlk/k-means-constrained
Check out Bradley, P. S., Bennett, K. P., & Demiriz, A. (2000). Constrained k-means clustering.
One side note, is that customers with a very large demand may still not be able to be assigned to a single depot, so the value of k may need to be increased a until there is a feasible solution, or 'splitting' the demand of a customer between multiple depots needs to be allowed.

Categories