I'm newbie to algorithm and optimization.
I'm trying to implement capacitated k-means, but getting unresolved and poor result so far.
This is used as part of a CVRP simulation (capacitated vehicle routing problem).
I'm curious if I interprets the referenced algorithm wrong.
Ref: "Improved K-Means Algorithm for Capacitated Clustering Problem"
(Geetha, Poonthalir, Vanathi)
The simulated CVRP has 15 customers, with 1 depot.
Each customer has Euclidean coordinate (x,y) and demand.
There are 3 vehicles, each has capacity of 90.
So, the capacitated k-means is trying to cluster 15 customers into 3 vehicles, with the total demands in each cluster must not exceed vehicle capacity.
UPDATE:
In the referenced algorithm, I couldn't catch any information about what must the code do when it runs out of "next nearest centroid".
That is, when all of the "nearest centroids" has been examined, in the step 14.b below, while the customers[1] is still unassigned.
This results in the customer with index 1 being unassigned.
Note: customer[1] is customer with largest demand (30).
Q: When this condition is met, what the code should do then?
Here is my interpretation of the referenced algorithm, please correct my code, thank you.
Given n requesters (customers), n = customerCount, and a depot
n demands,
n coordinates (x,y)
calculate number of clusters, k = (sum of all demands) / vehicleCapacity
select initial centroids,
5.a. sort customers based on demand, in descending order = d_customers,
5.b. select k first customers from d_customers as initial centroids = centroids[0 .. k-1],
Create binary matrix bin_matrix, dimension = (customerCount) x (k),
6.a. Fill bin_matrix with all zeros
start WHILE loop, condition = WHILE not converged.
7.a. converged = False
start FOR loop, condition = FOR each customers,
8.a. index of customer = i
calculate Euclidean distances from customers[i] to all centroids => edist
9.a. sort edist in ascending order,
9.b. select first centroid with closest distance = closest_centroid
start WHILE loop, condition = while customers[i] is not assigned to any cluster.
group all the other unassigned customers = G,
11.a. consider closest_centroid as centroid for G.
calculate priorities Pi for each customers of G,
12.a. Priority Pi = (distance from customers[i] to closest_cent) / demand[i]
12.b. select a customer with highest priority Pi.
12.c. customer with highest priority has index = hpc
12.d. Q: IF highest priority customer cannot be found, what must we do ?
assign customers[hpc] to centroids[closest_centroid] if possible.
13.a. demand of customers[hpc] = d1,
13.b. sum of all demands of centroids' members = dtot,
13.c. IF (d1 + dtot) <= vehicleCapacity, THEN..
13.d. assign customers[hpc] to centroids[closest_centroid]
13.e. update bin_matrix, row index = hpc, column index = closest_centroid, set to 1.
IF customers[i] is (still) not assigned to any cluster, THEN..
14.a. choose the next nearest centroid, with the next nearest distance from edist.
14.b. Q: IF there is no next nearest centroid, THEN what must we do ?
calculate converged by comparing previous matrix and updated matrix bin_matrix.
15.a. IF no changes in the bin_matrix, then set converged = True.
otherwise, calculate new centroids from updated clusters.
16.a. calculate new centroids' coordinates based on members of each cluster.
16.b. sum_x = sum of all x-coordinate of a cluster members,
16.c. num_c = number of all customers (members) in the cluster,
16.d. new centroid's x-coordinate of the cluster = sum_x / num_c.
16.e. with the same formula, calculate new centroid's y-coordinate of the cluster = sum_y / num_c.
iterate the main WHILE loop.
My code is always ended with unassigned customer at the step 14.b.
That is when there is a customers[i] still not assigned to any centroid, and it has run out of "next nearest centroid".
And the resulting clusters is poor. Output graph:
-In the picture, star is centroid, square is depot.
In the pic, customer labeled "1", with demand=30 always ended with no assigned cluster.
Output of the program,
k_cluster 3
idx [ 1 -1 1 0 2 0 1 1 2 2 2 0 0 2 0]
centroids [(22.6, 29.2), (34.25, 60.25), (39.4, 33.4)]
members [[3, 14, 12, 5, 11], [0, 2, 6, 7], [9, 8, 4, 13, 10]]
demands [86, 65, 77]
First and third cluster is poorly calculated.
idx with index '1' is not assigned (-1)
Q: What's wrong with my interpretation and my implementation?
Any correction, suggestion, help, will be very much appreciated, thank you in advanced.
Here is my full code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# pastebin.com/UwqUrHhh
# output graph: i.imgur.com/u3v2OFt.png
import math
import random
from operator import itemgetter
from copy import deepcopy
import numpy
import pylab
# depot and customers, [index, x, y, demand]
depot = [0, 30.0, 40.0, 0]
customers = [[1, 37.0, 52.0, 7], \
[2, 49.0, 49.0, 30], [3, 52.0, 64.0, 16], \
[4, 20.0, 26.0, 9], [5, 40.0, 30.0, 21], \
[6, 21.0, 47.0, 15], [7, 17.0, 63.0, 19], \
[8, 31.0, 62.0, 23], [9, 52.0, 33.0, 11], \
[10, 51.0, 21.0, 5], [11, 42.0, 41.0, 19], \
[12, 31.0, 32.0, 29], [13, 5.0, 25.0, 23], \
[14, 12.0, 42.0, 21], [15, 36.0, 16.0, 10]]
customerCount = 15
vehicleCount = 3
vehicleCapacity = 90
assigned = [-1] * customerCount
# number of clusters
k_cluster = 0
# binary matrix
bin_matrix = []
# coordinate of centroids
centroids = []
# total demand for each cluster, must be <= capacity
tot_demand = []
# members of each cluster
members = []
# coordinate of members of each cluster
xy_members = []
def distance(p1, p2):
return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)
# capacitated k-means clustering
# http://www.dcc.ufla.br/infocomp/artigos/v8.4/art07.pdf
def cap_k_means():
global k_cluster, bin_matrix, centroids, tot_demand
global members, xy_members, prev_members
# calculate number of clusters
tot_demand = sum([c[3] for c in customers])
k_cluster = int(math.ceil(float(tot_demand) / vehicleCapacity))
print 'k_cluster', k_cluster
# initial centroids = first sorted-customers based on demand
d_customers = sorted(customers, key=itemgetter(3), reverse=True)
centroids, tot_demand, members, xy_members = [], [], [], []
for i in range(k_cluster):
centroids.append(d_customers[i][1:3]) # [x,y]
# initial total demand and members for each cluster
tot_demand.append(0)
members.append([])
xy_members.append([])
# binary matrix, dimension = customerCount-1 x k_cluster
bin_matrix = [[0] * k_cluster for i in range(len(customers))]
converged = False
while not converged: # until no changes in formed-clusters
prev_matrix = deepcopy(bin_matrix)
for i in range(len(customers)):
edist = [] # list of distance to clusters
if assigned[i] == -1: # if not assigned yet
# Calculate the Euclidean distance to each of k-clusters
for k in range(k_cluster):
p1 = (customers[i][1], customers[i][2]) # x,y
p2 = (centroids[k][0], centroids[k][1])
edist.append((distance(p1, p2), k))
# sort, based on closest distance
edist = sorted(edist, key=itemgetter(0))
closest_centroid = 0 # first index of edist
# loop while customer[i] is not assigned
while assigned[i] == -1:
# calculate all unsigned customers (G)'s priority
max_prior = (0, -1) # value, index
for n in range(len(customers)):
pc = customers[n]
if assigned[n] == -1: # if unassigned
# get index of current centroid
c = edist[closest_centroid][1]
cen = centroids[c] # x,y
# distance_cost / demand
p = distance((pc[1], pc[2]), cen) / pc[3]
# find highest priority
if p > max_prior[0]:
max_prior = (p, n) # priority,customer-index
# if highest-priority is not found, what should we do ???
if max_prior[1] == -1:
break
# try to assign current cluster to highest-priority customer
hpc = max_prior[1] # index of highest-priority customer
c = edist[closest_centroid][1] # index of current cluster
# constraint, total demand in a cluster <= capacity
if tot_demand[c] + customers[hpc][3] <= vehicleCapacity:
# assign new member of cluster
members[c].append(hpc) # add index of customer
xy = (customers[hpc][1], customers[hpc][2]) # x,y
xy_members[c].append(xy)
tot_demand[c] += customers[hpc][3]
assigned[hpc] = c # update cluster to assigned-customer
# update binary matrix
bin_matrix[hpc][c] = 1
# if customer is not assigned then,
if assigned[i] == -1:
if closest_centroid < len(edist)-1:
# choose the next nearest centroid
closest_centroid += 1
# if run out of closest centroid, what must we do ???
else:
break # exit without centroid ???
# end while
# end for
# Calculate the new centroid from the formed clusters
for j in range(k_cluster):
xj = sum([cn[0] for cn in xy_members[j]])
yj = sum([cn[1] for cn in xy_members[j]])
xj = float(xj) / len(xy_members[j])
yj = float(yj) / len(xy_members[j])
centroids[j] = (xj, yj)
# calculate converged
converged = numpy.array_equal(numpy.array(prev_matrix), numpy.array(bin_matrix))
# end while
def clustering():
cap_k_means()
# debug plot
idx = numpy.array([c for c in assigned])
xy = numpy.array([(c[1], c[2]) for c in customers])
COLORS = ["Blue", "DarkSeaGreen", "DarkTurquoise",
"IndianRed", "MediumVioletRed", "Orange", "Purple"]
for i in range(min(idx), max(idx)+1):
clr = random.choice(COLORS)
pylab.plot(xy[idx==i, 0], xy[idx==i, 1], color=clr, \
linestyle='dashed', \
marker='o', markerfacecolor=clr, markersize=8)
pylab.plot(centroids[:][0], centroids[:][1], '*k', markersize=12)
pylab.plot(depot[1], depot[2], 'sk', markersize=12)
for i in range(len(idx)):
pylab.annotate(str(i), xy[i])
pylab.savefig('clust1.png')
pylab.show()
return idx
def main():
idx = clustering()
print 'idx', idx
print 'centroids', centroids
print 'members', members
print 'demands', tot_demand
if __name__ == '__main__':
main()
When the total demand is close to the total capacity, this problem begins to take on aspects of bin packing. As you've discovered, this particular algorithm's greedy approach is not always successful. I don't know whether the authors admitted that, but if they didn't, the reviewers should have caught it.
If you want to continue with something like this algorithm, I would try using integer programming to assign requesters to centroids.
Without going through all the details, the paper you cite says
if ri is not assigned then
choose the next nearest centroid
end if
in the algorithm at the end of section 5.
There must be a next nearest centroid - if two are equidistant I presume it doesn't matter which you choose.
A common issue with fixed-size clustering is that you can often identify 'swaps' in the output, where swapping 2 points between clusters creates a better solution.
We can improve the constrained k-means algorithm from the referenced paper by formulating the 'find the cluster to assign the point to' as an assignment problem, instead of greedily picking the nearest one that isn't full.
A common way to solve this is using a min-cost flow algorithm. The result of this guarantees that there aren't any 'swaps' available that improve the result.
Luckily, someone has already implemented this and created a package for it: https://github.com/joshlk/k-means-constrained
Check out Bradley, P. S., Bennett, K. P., & Demiriz, A. (2000). Constrained k-means clustering.
One side note, is that customers with a very large demand may still not be able to be assigned to a single depot, so the value of k may need to be increased a until there is a feasible solution, or 'splitting' the demand of a customer between multiple depots needs to be allowed.
Related
I am a beginner of Python. I have a dataset that contains people's traveling records in each time period and would like to get a new dataframe, that describes the choice set for each person when she travels.
I am trying to find all the stations that are within 5 km of a person from lat/lon coordinates. I have a dataframe that contains person-id, person location coordinates at time t, station coordinates. I would like to get a new dataframe containing all the stations that are within 5 km of the person that have appeared in the dataset (use person-id and time-t as two separate indices), and the respective distances to all of them as another column. For example, if station 1 has appeared in period 1, but not in 2, it is actually still there but just was not in the traveling records of people in time 2. This would generate a dataframe that describes the choice set for each person at time t (for example, a person's consideration set for which station to get gas for her car), as a person can move, but a station would always be available in the choice set after it is built. (Also note that although B did not go to any station at time 2, she still has a choice set of 5 and 6, as long as B has appeared in previous times. In other words, if a person has appeared, she would always be there. And that is why B showed again in time 2.)
import geopy.distance
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist, squareform
df = pd.DataFrame({
'time' : [1,1,2,2],
'personid' : ['A','B','A','C'],
'station' : [5,6,7,5],
'stationLoc' : [(122.286, 114.135),(122.284, 114.131),(122.286, 114.224),(122.286, 114.135)],
'personLoc' : [(122.283, 114.127),(122.283, 114.127),(122.286, 114.219),(122.286, 114.224)],
})
What I expect to get is like:
df1 = pd.DataFrame({
'personid' : ['A','A','A','A','A','B','B','B','B','C'],
'time' : [1,1,2,2,2,1,1,2,2,2],
'stations_within_5km' : [5, 6, 5, 6, 7, 5, 6, 5, 6, 7],
'distance' : [Ato5, Ato6, Ato5, Ato6, Ato7, Bto5, Bto6, Bto5, Bto6, Cto7],
})
I have tried to use a loop, but find it hard to implement this thought to get a standardized format of the data to run a regression. Sonia's answer is great, but it was posted when I did not make my statement clear. Sorry about this, but still appreciate it.
This is written in Python. But if R could work better, R code would also be welcome. Any thoughts would be appreciated.
Thank you very much!
Use the Haversine distance to compute the distance between two coordinates. Haversine formula is given below
def haversine_distance(point1,point2):
'''
Takes point1 and point2 and calculates the haversine distance
between the two points.
Input Parameters
----------------
point1 and point2 as latitude and longitude coordinates in tuples
For e.g.,
point1_coords = (49.012798, 2.550000)
point2_coords = (-43.489399, 172.531998)
Output Parameters
-----------------
Distance in Km
Notes
-----
The Haversine (or great circle) distance is the angular distance
between two points on the surface of a sphere.
'''
lat1,lon1 = point1
lat2,lon2 = point2
# φ1, φ2 are the latitude of point 1 and latitude of point 2 (in radians)
phi1 = math.radians(lat1)
phi2 = math.radians(lat2)
# λ1, λ2 are the longitude of point 1 and longitude of point 2 (in radians).
lambda1 = math.radians(lon1)
lambda2 = math.radians(lon2)
delta_phi = phi2-phi1
delta_lambda = lambda2-lambda1
# Calculating a
a = math.sin(delta_phi/2.0)**2 + math.cos(phi1)*math.cos(phi2)*math.sin(delta_lambda/2.0)**2
# Calculating c
c = 2*math.atan2(math.sqrt(a),math.sqrt(1-a))
R = 6371 # radius of Earth in kilometers
# Calculating Distance
d = R*c # d is the distance
return d
Then calculate the distance between the person loc and all the station locations as shown below:
df['uniq_id'] = df['time'].astype(str) +df['personid'] #Created uniq_id
def stations_within_5km(uniqid_arr, personloc_arr, stationloc_arr, station_arr):
distance = {}
stations_5 = {}
for i in range(len(personloc_arr)):
dist = []
stations_ = []
for j in range(len(personloc_arr)):
if haversine_distance(personloc_arr[i],stationloc_arr[j]) <= 5:
if station_arr.iloc[j] not in stations_:
dist.append(haversine_distance(personloc_arr[i],stationloc_arr[j]))
stations_.append(station_arr.iloc[j])
distance[uniqid_arr.iloc[i]] = dist
stations_5[uniqid_arr.iloc[i]] = stations_
return distance, stations_5
distance, stations_5 = stations_within_5km(df['uniq_id'],df['personLoc'],df['stationLoc'],df['station'])
df['stations_within_5km'] = [stations_5[id_] for id_ in df['uniq_id']]
df['distance'] = [distance[id_] for id_ in df['uniq_id']]
Output:
time personid station stationLoc personLoc uniq_id stations_within_5km distance
0 1 A 5 (122.286, 114.135) (122.283, 114.127) 1A [5, 6] [0.580544416674241, 0.26229648732200417]
1 1 B 6 (122.284, 114.131) (122.283, 114.127) 1B [5, 6] [0.580544416674241, 0.26229648732200417]
2 2 A 7 (122.286, 114.224) (122.286, 114.219) 2A [5, 7] [4.98912110882879, 0.2969715135144607]
3 2 C 5 (122.286, 114.135) (122.286, 114.224) 2C [7] [0.0]
The calculated fields show the stations(duplicates removed) within 5km to the person and their respective distances.
I'm trying to create a Voronoi diagram given a set of scatterplot points. However, several "extra unintended lines" appear to get calculated in the process. Some of these "extra" lines appear to be the infinite edges getting incorrectly calculated. But others are appearing randomly in the middle of the plot as well. How can I only create an extra edge when it's needed/required to connect a polygon to the edge of the plot (e.g. plot boundaries)?
My graph outer boundaries are:
boundaries = np.array([[0, -2], [0, 69], [105, 69], [105, -2], [0, -2]])
Here's the section dealing with the voronoi diagram creation:
def voronoi_polygons(voronoi, diameter):
centroid = voronoi.points.mean(axis=0)
ridge_direction = defaultdict(list)
for (p, q), rv in zip(voronoi.ridge_points, voronoi.ridge_vertices):
u, v = sorted(rv)
if u == -1:
t = voronoi.points[q] - voronoi.points[p] # tangent
n = np.array([-t[1], t[0]]) / np.linalg.norm(t) # normal
midpoint = voronoi.points[[p, q]].mean(axis=0)
direction = np.sign(np.dot(midpoint - centroid, n)) * n
ridge_direction[p, v].append(direction)
ridge_direction[q, v].append(direction)
for i, r in enumerate(voronoi.point_region):
region = voronoi.regions[r]
if -1 not in region:
# Finite region.
yield Polygon(voronoi.vertices[region])
continue
# Infinite region.
inf = region.index(-1) # Index of vertex at infinity.
j = region[(inf - 1) % len(region)] # Index of previous vertex.
k = region[(inf + 1) % len(region)] # Index of next vertex.
if j == k:
# Region has one Voronoi vertex with two ridges.
dir_j, dir_k = ridge_direction[i, j]
else:
# Region has two Voronoi vertices, each with one ridge.
dir_j, = ridge_direction[i, j]
dir_k, = ridge_direction[i, k]
# Length of ridges needed for the extra edge to lie at least
# 'diameter' away from all Voronoi vertices.
length = 2 * diameter / np.linalg.norm(dir_j + dir_k)
# Polygon consists of finite part plus an extra edge.
finite_part = voronoi.vertices[region[inf + 1:] + region[:inf]]
extra_edge = [voronoi.vertices[j] + dir_j * length,
voronoi.vertices[k] + dir_k * length]
combined_finite_edge = np.concatenate((finite_part, extra_edge))
poly = Polygon(combined_finite_edge)
yield poly
Here are the points being used:
['52.629' '24.28099822998047']
['68.425' '46.077999114990234']
['60.409' '36.7140007019043']
['72.442' '28.762001037597656']
['52.993' '43.51799964904785']
['59.924' '16.972000122070312']
['61.101' '55.74899959564209']
['68.9' '13.248001098632812']
['61.323' '29.0260009765625']
['45.283' '36.97500038146973']
['52.425' '19.132999420166016']
['37.739' '28.042999267578125']
['48.972' '2.3539962768554688']
['33.865' '30.240001678466797']
['52.34' '64.94799995422363']
['52.394' '45.391000747680664']
['52.458' '34.79800033569336']
['31.353' '43.14500045776367']
['38.194' '39.24399948120117']
['98.745' '32.15999984741211']
['6.197' '32.606998443603516']
Most likely this is due to the errors associated with floating point arithmetic while computing the voronoi traingulation from your data (esp. the second column).
Assuming that, there is no single solution for such kinds of problems. I urge you to go through this page* of the Qhull manual and try iterating through those parameters in qhull_options before generating the voronoi object that you are inputting in the function. An example would be qhull_options='Qbb Qc Qz QJ'.
Other than that I doubt there is anything that could be modified in the function to avoid such a problem.
*This will take some time though. Just be patient.
Figured out what was wrong: after each polygon I needed to add a null x and y value or else it would attempt to 'stitch' one polygon to another, drawing an additional unintended line in order to do so. So the data should really look more like this:
GameTime,Half,ObjectType,JerseyNumber,X,Y,PlayerIDEvent,PlayerIDTracking,MatchIDEvent,Position,teamId,i_order,v_vor_x,v_vor_y
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,0,22.79645297,6.20866756
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,1,17.63464264,3.41230187
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,2,20.27639318,34.29191902
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,3,32.15600546,36.60432421
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,4,38.34639812,33.62806739
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,5,22.79645297,6.20866756
0.0,1,1,22,None,None,578478,794888,2257663,3,35179.0,5,nan,nan
0.0,1,1,22,33.865,30.240001678466797,578478,794888,2257663,3,35179.0,,,
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,0,46.91696938,29.44801535
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,1,55.37574848,29.5855499
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,2,58.85876401,23.20381766
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,3,57.17455086,21.5228301
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,4,44.14237744,22.03925667
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,5,45.85962774,28.83613332
0.0,1,0,92,None,None,369351,561593,2257663,1,32446.0,5,nan,nan
0.0,1,0,92,52.629,24.28099822998047,369351,561593,2257663,1,32446.0,,,
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,0,65.56965667,33.4292025
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,1,57.23303682,32.43809027
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,2,55.65704152,38.97814049
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,3,60.75304149,44.53251169
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,4,65.14170295,40.77562188
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,5,65.56965667,33.4292025
0.0,1,0,27,None,None,704169,704169,2257663,2,32446.0,5,nan,nan
How can I find anomalous values from following data. I am simulating a sinusoidal pattern. While I can plot the data and spot any anomalies or noise in data, but how can I do it without plotting the data. I am looking for simple approaches other than Machine learning methods.
import random
import numpy as np
import matplotlib.pyplot as plt
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
print("in_array : ", in_array)
out_array = np.sin(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Inject random noise
noise_input = random.uniform(-.5, .5); print("Noise : ",noise_input)
in_array[random.randint(0,len(in_array)-1)] = noise_input
print(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Data with noise
I've thought of the following approach to your problem, since you have only some values that are anomalous in the time vector, it means that the rest of the values have a regular progression, which means that if we gather all the data points in the vector under clusters and calculate the average step for the biggest cluster (which is essentially the pool of values that represent the real deal), then we can use that average to do a triad detection, in a given threshold, over the vector and detect which of the elements are anomalous.
For this we need two functions: calculate_average_step which will calculate that average for the biggest cluster of close values, and then we need detect_anomalous_values which will yield the indexes of the anomalous values in our vector, based on that average calculated earlier.
After we detected the anomalous values, we can go ahead and replace them with an estimated value, which we can determine from our average step value and by using the adjacent points in the vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def calculate_average_step(array, threshold=5):
"""
Determine the average step by doing a weighted average based on clustering of averages.
array: our array
threshold: the +/- offset for grouping clusters. Aplicable on all elements in the array.
"""
# determine all the steps
steps = []
for i in range(0, len(array) - 1):
steps.append(abs(array[i] - array[i+1]))
# determine the steps clusters
clusters = []
skip_indexes = []
cluster_index = 0
for i in range(len(steps)):
if i in skip_indexes:
continue
# determine the cluster band (based on threshold)
cluster_lower = steps[i] - (steps[i]/100) * threshold
cluster_upper = steps[i] + (steps[i]/100) * threshold
# create the new cluster
clusters.append([])
clusters[cluster_index].append(steps[i])
# try to match elements from the rest of the array
for j in range(i + 1, len(steps)):
if not (cluster_lower <= steps[j] <= cluster_upper):
continue
clusters[cluster_index].append(steps[j])
skip_indexes.append(j)
cluster_index += 1 # increment the cluster id
clusters = sorted(clusters, key=lambda x: len(x), reverse=True)
biggest_cluster = clusters[0] if len(clusters) > 0 else None
if biggest_cluster is None:
return None
return sum(biggest_cluster) / len(biggest_cluster) # return our most common average
def detect_anomalous_values(array, regular_step, threshold=5):
"""
Will scan every triad (3 points) in the array to detect anomalies.
array: the array to iterate over.
regular_step: the step around which we form the upper/lower band for filtering
treshold: +/- variation between the steps of the first and median element and median and third element.
"""
assert(len(array) >= 3) # must have at least 3 elements
anomalous_indexes = []
step_lower = regular_step - (regular_step / 100) * threshold
step_upper = regular_step + (regular_step / 100) * threshold
# detection will be forward from i (hence 3 elements must be available for the d)
for i in range(0, len(array) - 2):
a = array[i]
b = array[i+1]
c = array[i+2]
first_step = abs(a-b)
second_step = abs(b-c)
first_belonging = step_lower <= first_step <= step_upper
second_belonging = step_lower <= second_step <= step_upper
# detect that both steps are alright
if first_belonging and second_belonging:
continue # all is good here, nothing to do
# detect if the first point in the triad is bad
if not first_belonging and second_belonging:
anomalous_indexes.append(i)
# detect the last point in the triad is bad
if first_belonging and not second_belonging:
anomalous_indexes.append(i+2)
# detect the mid point in triad is bad (or everything is bad)
if not first_belonging and not second_belonging:
anomalous_indexes.append(i+1)
# we won't add here the others because they will be detected by
# the rest of the triad scans
return sorted(set(anomalous_indexes)) # return unique indexes
if __name__ == "__main__":
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
noisy_out_array = np.sin(in_array)
# display noisy sin
plt.figure()
plt.plot(in_array, noisy_out_array, color = 'red', marker = "o");
plt.title("noisy numpy.sin()")
# detect anomalous values
average_step = calculate_average_step(in_array)
anomalous_indexes = detect_anomalous_values(in_array, average_step)
# replace anomalous points with an estimated value based on our calculated average
for anomalous in anomalous_indexes:
# try forward extrapolation
try:
in_array[anomalous] = in_array[anomalous-1] + average_step
# else try backwward extrapolation
except IndexError:
in_array[anomalous] = in_array[anomalous+1] - average_step
# generate sine wave
out_array = np.sin(in_array)
plt.figure()
plt.plot(in_array, out_array, color = 'green', marker = "o");
plt.title("cleaned numpy.sin()")
plt.show()
Noisy sine:
Cleaned sine:
Your problem relies in the time vector (which is of 1 dimension). You will need to apply some sort of filter on that vector.
First thing that came to mind was medfilt (median filter) from scipy and it looks something like this:
from scipy.signal import medfilt
l1 = [0, 10, 20, 30, 2, 50, 70, 15, 90, 100]
l2 = medfilt(l1)
print(l2)
the output of this will be:
[ 0. 10. 20. 20. 30. 50. 50. 70. 90. 90.]
the problem with this filter though is that if we apply some noise values to the edges of the vector like [200, 0, 10, 20, 30, 2, 50, 70, 15, 90, 100, -50] then the output would be something like [ 0. 10. 10. 20. 20. 30. 50. 50. 70. 90. 90. 0.] and obviously this is not ok for the sine plot since it will produce the same artifacts for the sine values array.
A better approach to this problem is to treat the time vector as an y output and it's index values as the x input and do a linear regression on the "time linear function", not the quotes, it just means we're faking the 2 dimensional model by applying a fake X vector. The code implies the use of scipy's linregress (linear regression) function:
from scipy.stats import linregress
l1 = [5, 0, 10, 20, 30, -20, 50, 70, 15, 90, 100]
l1_x = range(0, len(l1))
slope, intercept, r_val, p_val, std_err = linregress(l1_x, l1)
l1 = intercept + slope * l1_x
print(l1)
whose output will be:
[-10.45454545 -1.63636364 7.18181818 16. 24.81818182
33.63636364 42.45454545 51.27272727 60.09090909 68.90909091
77.72727273]
Now let's apply this to your time vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
N = 20
# N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
# apply filter on time array
in_array_x = range(0, len(in_array))
slope, intercept, r_val, p_val, std_err = linregress(in_array_x, in_array)
in_array = intercept + slope * in_array_x
# generate sine wave
out_array = np.sin(in_array)
print("OUT ARRAY")
print(out_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
plt.show()
the output will be:
the resulting signal will be an approximation of the original, as it is with any form of extrapolation/interpolation/regression filtering.
I am trying to implement the following (divisive) clustering algorithm (below is presented short form of the algorithm, the full description is available here):
Start with a sample x, i = 1, ..., n regarded as a single cluster of n data points and a dissimilarity matrix D defined for all pairs of points. Fix a threshold T for deciding whether or not to split a cluster.
First determine the distance between all pairs of data points and choose a pair with the largest distance (Dmax) between them.
Compare Dmax to T. If Dmax > T then divide single cluster in two by using the selected pair as the first elements in two new clusters. The remaining n - 2 data points are put into one of the two new clusters. x_l is added to the new cluster containing x_i if D(x_i, x_l) < D(x_j, x_l), otherwise is added to new cluster containing x_i.
At the second stage, the values D(x_i, x_j) are found within one of two new clusters to find the pair in the cluster with the largest distance Dmax between them. If Dmax < T, the division of the cluster stops and the other cluster is considered. Then the procedure repeats on the clusters generated from this iteration.
Output is a hierarchy of clustered data records. I kindly ask for an advice how to implement the clustering algorithm.
EDIT 1: I attach Python function which defines distance (correlation coefficient) and function which finds maximal distance in data matrix.
# Read data from GitHub
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/nico/collectiveintelligence-book/master/blogdata.txt', sep = '\t', index_col = 0)
data = df.values.tolist()
data = data[1:10]
# Define correlation coefficient as distance of choice
def pearson(v1, v2):
# Simple sums
sum1 = sum(v1)
sum2 = sum(v2)
# Sums of the squares
sum1Sq = sum([pow(v, 2) for v in v1])
sum2Sq = sum([pow(v, 2) for v in v2])
# Sum of the products
pSum=sum([v1[i] * v2[i] for i in range(len(v1))])
# Calculate r (Pearson score)
num = pSum - (sum1 * sum2 / len(v1))
den = sqrt((sum1Sq - pow(sum1,2) / len(v1)) * (sum2Sq - pow(sum2, 2) / len(v1)))
if den == 0: return 0
return num / den
# Find largest distance
dist={}
max_dist = pearson(data[0], data[0])
# Loop over upper triangle of data matrix
for i in range(len(data)):
for j in range(i + 1, len(data)):
# Compute distance for each pair
dist_curr = pearson(data[i], data[j])
# Store distance in dict
dist[(i, j)] = dist_curr
# Store max distance
if dist_curr > max_dist:
max_dist = dist_curr
EDIT 2: Pasted below are functions from Dschoni's answer.
# Euclidean distance
def euclidean(x,y):
x = numpy.array(x)
y = numpy.array(y)
return numpy.sqrt(numpy.sum((x-y)**2))
# Create matrix
def dist_mat(data):
dist = {}
for i in range(len(data)):
for j in range(i + 1, len(data)):
dist[(i, j)] = euclidean(data[i], data[j])
return dist
# Returns i & k for max distance
def my_max(dict):
return max(dict)
# Sort function
list1 = []
list2 = []
def sort (rcd, i, k):
list1.append(i)
list2.append(k)
for j in range(len(rcd)):
if (euclidean(rcd[j], rcd[i]) < euclidean(rcd[j], rcd[k])):
list1.append(j)
else:
list2.append(j)
EDIT 3:
When I run the code provided by #Dschoni the algorithm works as expected. Then I modified the create_distance_list function so we can compute distance between multivariate data points. I use euclidean distance. For toy example I load iris data. I cluster only the first 50 instances of the dataset.
import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, sep = ',')
df = df.drop(4, 1)
df = df[1:50]
data = df.values.tolist()
idl=range(len(data))
dist = create_distance_list(data)
print sort(dist, idl)
The result is as follows:
[[24], [17], [4], [7], [40], [13], [14], [15], [26, 27, 38], [3, 16,
39], [25], [42], [18, 20, 45], [43], [1, 2, 11, 46], [12, 37, 41],
[5], [21], [22], [10, 23, 28, 29], [6, 34, 48], [0, 8, 33, 36, 44],
[31], [32], [19], [30], [35], [9, 47]]
Some data points are still clustered together. I solve this problem by adding small amount of data noise to actual dictionary in the sort function:
# Add small random noise
for key in actual:
actual[key] += np.random.normal(0, 0.005)
Any idea how to solve this problem properly?
A proper working example for the euclidean distance:
import numpy as np
#For random number generation
def create_distance_list(l):
'''Create a distance list for every
unique tuple of pairs'''
dist={}
for i in range(len(l)):
for k in range(i+1,len(l)):
dist[(i,k)]=abs(l[i]-l[k])
return dist
def maximum(distance_dict):
'''Returns the key of the maximum value if unique
or a random key with the maximum value.'''
maximum = max(distance_dict.values())
max_key = [key for key, value in distance_dict.items() if value == maximum]
if len(max_key)>1:
random_key = np.random.random_integers(0,len(max_key)-1)
return (max_key[random_key],)
else:
return max_key
def construct_new_dict(distance_dict,index_list):
'''Helper function to create a distance map for a subset
of data points.'''
new={}
for i in range(len(index_list)):
for k in range(i+1,len(index_list)):
m = index_list[i]
n = index_list[k]
new[(m,n)]=distance_dict[(m,n)]
return new
def sort(distance_dict,idl,threshold=4):
result=[idl]
i=0
try:
while True:
if len(result[i])>=2:
actual=construct_new_dict(dist,result[i])
act_max=maximum(actual)
if distance_dict[act_max[0]]>threshold:
j = act_max[0][0]
k = act_max[0][1]
result[i].remove(j)
result[i].remove(k)
l1=[j]
l2=[k]
for iterr in range(len(result[i])):
s = result[i][iterr]
if s>j:
c1=(j,s)
else:
c1=(s,j)
if s>k:
c2=(k,s)
else:
c2=(s,k)
if actual[c1]<actual[c2]:
l1.append(s)
else:
l2.append(s)
result.remove(result[i])
#What to do if distance is equal?
l1.sort()
l2.sort()
result.append(l1)
result.append(l2)
else:
i+=1
else:
i+=1
except:
return result
#This is the dataset
a = [1,2,2.5,5]
#Giving each entry a unique ID
idl=range(len(a))
dist = create_distance_list(a)
print sort(dist,idl)
I wrote the code for readability, there is a lot of stuff that can made faster, more reliable and prettier. This is just to give you an idea of how it can be done.
Some data points are still clustered together. I solve this problem by
adding small amount of data noise to actual dictionary in the sort
function.
If Dmax > T then divide single cluster in two
Your description doesn't necessarily creates n clusters.
If a cluster has two records which has a distance less than T,
they will be clustered together (am I missing something?)
I'm working in 3D context. I've some objects in this space who are represented by x, y, z position.
# My objects names (in my real context it's pheromone "point")
A = 1
B = 2
C = 3
D = 4
# My actual way to stock their positions
pheromones_positions = {
(25, 25, 60): [A, D],
(10, 90, 30): [B],
(5, 85, 8): [C]
}
My objective is to found what points (pheromones) are near (with distance) a given emplacement. I do this simply with:
def calc_distance(a, b):
return sqrt((a[0]-b[0])**2+(a[1]-b[1])**2+(a[2]-b[2])**2)
def found_in_dict(search, points, distance):
for point in points:
if calc_distance(search, point) <= distance:
return points[point]
founds = found_in_dict((20, 20, 55), pheromones_positions, 10)
# found [1, 4] (A and D)
But, with a lot of pheromones it's very slow (test them one by one ...). How can i organize these 3D positions to found more quickly "positions by distance from given position" ?
Does exist algorithms or librarys (numpy ?) who can help me in this way ?
You should compute all (squared) distances at once. With NumPy you can simply subtract the target point of size 1x3 from the (nx3) array of all position coordinates and sum the squared coordinate differences to obtain a list with n elements:
squaredDistances = np.sum((np.array(pheromones_positions.keys()) - (20, 20, 55))**2, axis=1)
idx = np.where(squaredDistances < 10**2)[0]
print pheromones_positions.values()[idx]
Output:
[1, 4]
By the way: Since your return statement is within the for-loop over all points, it will stop iterating after finding a first point. So you might miss a second or third match.