Clustering algorithm that keeps separated a given set of pairs - python

I have a clustering problem in which I have to split a set S of samples into C clusters where C is known. Normally, I am able to perform the clustering operation with a simple KMeans clustering, which works just fine.
To complicate things, I have a known set of pairs D of samples that cannot under any circumstances be assinged to the same cluster. Currently I am not using this information and the clustering still works fine, but I would like to introduce it to improve robustness, since it comes for free from the problem I am trying to solve.
Example: S consists of 20 samples with 5 features each, C is 3, and D forces the following pairs {(1, 3), (3, 5), (10, 19)} to be in different clusters.
I am looking for a solution in python3, preferably with numpy/sklearn/scipy.
Do you know if there is some out-of-the-box clustering algorithm that takes into account this kind of constraint? I have looked into sklearn but found no such thing.

This sounds exactly like semi-supervised clustering with pairwise constraints. In it, the unsupervised k-means clustering is augmented by (imperfect) supervision through pairwise constraints for a subset of the data. In your particular example, it is a cannot link-constraint. In addition, must-link constraints could be added as well.
Unfortunately, most implementations I encountered in Python are rather brittle. For example, the Python library active-semi-supervised-clustering allows to add ml (must link) and cl (cannot link) relations just as you describe. The code is:
import numpy as np
from matplotlib import pyplot as plt
from active_semi_clustering.semi_supervised.pairwise_constraints import PCKMeans
# data
S = [[-0.2, -1.0], [3.3, 3.9], [-2.0, 0.6], [2.3, -0.8], [1.1, 1.9], [2.8, -0.3], [4.2, 2.6], [1.8, 6.8], [1.4, -0.7], [2.6, 1.8], [2.6, 5.4], [0.8, -0.6], [3.0, 1.4], [-0.6, -0.4], [0.3, -0.2], [0.8, -0.4], [4.8, 5.1], [2.4, 5.2], [2.3, 5.3], [0.9, 0.3], [2.8, 4.1], [1.4, -0.7], [2.7, 5.6], [0.8, 0.8], [1.9, 5.3], [2.3, 5.3], [2.1, 0.5], [3.1, 5.3], [2.3, 0.8], [-0.2, -0.0], [2.4, 0.0], [3.6, -0.5], [1.3, -0.4], [3.0, 4.6], [0.4, -0.1], [-2.3, -1.4], [-1.9, -1.9], [4.2, 5.4], [-1.3, -0.9], [2.7, 0.2], [1.9, 6.5], [2.8, -0.8], [0.0, -0.3], [3.2, 5.9], [1.7, 4.6], [2.3, -0.3], [2.9, 1.2], [3.5, 2.0], [1.2, 2.3], [2.0, 1.5], [4.2, 5.8], [0.7, -2.0], [-0.8, -0.9], [4.7, 0.7], [-1.2, -1.8], [3.5, 5.1], [2.6, 0.7], [1.1, 3.0], [1.9, 6.5], [2.5, 6.5], [2.2, -0.2], [-0.9, -0.3], [3.1, 4.1], [-0.7, -0.3], [4.1, 5.2], [2.6, 0.8], [4.0, 3.5], [4.2, 4.3], [3.1, 1.1], [0.9, -0.1], [-0.3, 1.2], [0.2, -0.8], [0.1, -1.1], [0.4, -1.1], [-0.1, -0.7]]
S = np.array([np.array(s) for s in S])
# no. of clusters
C = 3
# constraints (indices of points in S)
D = [(1, 3), (3, 5), (10, 19), (7, 11), (4, 6)]
# color plots
colDict = {0: '#fc6A03', 1 : 'green', 2 :'#006699'}
plt.title('Input Data ($S$)', fontsize=20)
plt.scatter(x=[s[0] for s in list(S)], y=[s[1] for s in list(S)], c='darkgrey')
plt.show()
# Naïve Clustering
clust = PCKMeans(n_clusters=C, max_iter=1000)
clust.fit(S, cl=[], ml=[])
plt.title('Naïve (unconstrained) k-Means', fontsize=18)
plt.scatter(x=[s[0] for s in list(S)], y=[s[1] for s in list(S)], c=[colDict[c] for c in clust.labels_])
plt.show()
# Constr. Clustering
const_clust = PCKMeans(n_clusters=C, max_iter=10000)
const_clust.fit(S, ml=[], cl=D)
plt.title('Constrained k-Means', fontsize=18)
plt.scatter(x=[s[0] for s in S.tolist()], y=[s[1] for s in S.tolist()], c=[colDict[c] for c in const_clust.labels_])
plt.show()
which yields
Although the plot looks different, checking if the cannot link-constraints are indeed met results in
[const_clust.labels_[d[0]] != const_clust.labels_[d[1]] for d in D]
>[True, False, True]
indicating that points with index 3 and 5 were assigned the same cluster label. Not good. However, the sample size and the distribution of the data points across the feature space seem to impact this greatly. Potentially, you will see no adverse effects when you apply it to your actual data.
Unfortunately, the repository does not allow to set a seed (to make the iterative estimation procedure reproducible) and ignores the one set via np.random.seed(567). Beware of reproducibility and rerun the code several times.
Other repositories such as scikit-learn indicate that some clustering routines may allow constraints but don't indicate how this can be done.
Note that there are other variants of constrained k-means clustering of this, e.g. where the pairwise constraints are not certain (see this reference) or the number of data points per cluster is constrained (see this python library).

Related

Pandas Group By column to generate quantiles (.25, 0.5, .75)

Let's say we have CityName, Min-Temperature, Max-Temperature, Humidity of different cities.
We need an output dataframe grouped on CityName and want to generate 0.25, 0.5 and 0.75 quantiles. New column names would be OldColunmName + ('Q1)/('Q2')/('Q3').
Example INPUT
df = pd.DataFrame({'cityName': pd.Categorical(['a','a','a','a','b','b','b','b','a','a','a','a','b','b','b','b']),
'MinTemp': [1.1, 2.1, 3.1, 1.1, 2, 2.1, 2.2, 2.4, 2.5, 1.11, 1.31, 2.1, 1, 2, 2.3, 2.1],
'MaxTemp': [2.1, 4.2, 5.1, 2.13, 4, 3.1, 5.2, 3.4, 3.5, 2.11, 2.31, 3.1, 2, 4.3, 4.3, 3.1],
'Humidity': [0.29, 0.19, .45, 0.1, 0.1, 0.1, 0.2, 0.5, 0.11, 0.31, 0.1, .1, .2, 0.3, 0.3, 0.1]
})
OUTPUT
First Approach
First you have to group your data on the column you want which is 'cityName'. Then, because on each column you want to do multiple and different kinds of aggregations, you can use 'agg' function. For functions in the 'agg', you cannot give parameters so you define them as follow:
def quantile_50(x):
return x.quantile(0.5)
def quantile_25(x):
return x.quantile(0.25)
def quantile_75(x):
return x.quantile(0.75)
quantile_df = df.groupby('cityName').agg([quantile_25, quantile_50, quantile_75])
quantile_df
Second Approach
You can use describe method and select the statistics you need. By using idx you can choose which subindex to choose.
idx = pd.IndexSlice
df.groupby('cityName').describe().loc[:, idx[:, ['25%', '50%', '75%']]]

Determining if vertices lie within a set vertices

In my algorithm, I am finding graphs at different thresholds. Each graph G = (V,E). These are undirected graphs found using breadth first search. I would like to determine if the vertices of another graph G' = (V',E') lie within graph G. I am unfamiliar with graph algorithms so please let me know if you would like to see code or a more thorough explanation.
For example, If I have a graph G1 which is a square with 'corner' vertices (among others, but reduced for simplicity) of {(1,1), (1,6), (6,6), (6,1)}, then a smaller square G2 defined by corner vertices {(2,2), (2,5), (5,5), (5,2)} would lie within G1. The third graph G3 defined by corners {(3,3), (3,4), (4,4),(4,3)}. My algorithm produces the following figure for this configuration:
A square thresholded at 2, surrounded by t=1, surrounded by t=0. (I need to fix the edges but the vertices are correct)
My algorithm works on the following matrix:
import numpy as np
A = np.zeros((7,7))
#A[A<1] = -1
for i in np.arange(1,6):
for j in np.arange(1,6):
A[i,j] = 1
for i in np.arange(2,5):
for j in np.arange(2,5):
A[i,j] = 2
for i in np.arange(3,4):
for j in np.arange(3,4):
A[i,j] = 3
print(A)
To create three graphs, the first at threshold 2, the second at threshold 1, the third at threshold 0.
v1 = [[(3.0, 2.25), (3.0, 3.75), (2.25, 3.0), (3.75, 3.0)]]
v2 = [[(2.0, 1.333333), (1.333333, 3.0), (1.333333, 2.0), (1.333333, 4.0), (2.0, 4.666667), (3.0, 4.666667), (4.0, 4.666667), (4.666667, 4.0), (4.666667, 3.0), (4.666667, 2.0), (4.0, 1.333333), (3.0, 1.333333)]]
v3 = [[(1.0, 0.5), (0.5, 2.0), (0.5, 1.0), (0.5, 3.0), (0.5, 4.0), (0.5, 5.0), (1.0, 5.5), (2.0, 5.5), (3.0, 5.5), (4.0, 5.5), (5.0, 5.5), (5.5, 5.0), (5.5, 4.0), (5.5, 3.0), (5.5, 2.0), (5.5, 1.0), (5.0, 0.5), (4.0, 0.5), (3.0, 0.5), (2.0, 0.5)]]
And edge lists:
e1 = [[[2.25, 3.0], [3.0, 2.25]], [[3.0, 3.75], [2.25, 3.0]], [[3.0, 2.25], [3.75, 3.0]], [[3.0, 3.75], [3.75, 3.0]]]
e2 = [[[1.333333, 2.0], [2.0, 1.333333]], [[1.333333, 3.0], [1.333333, 2.0]], [[1.333333, 4.0], [1.333333, 3.0]], [[2.0, 4.666667], [1.333333, 4.0]], [[2.0, 1.333333], [3.0, 1.333333]], [[2.0, 4.666667], [3.0, 4.666667]], [[3.0, 1.333333], [4.0, 1.333333]], [[3.0, 4.666667], [4.0, 4.666667]], [[4.0, 1.333333], [4.666667, 2.0]], [[4.666667, 3.0], [4.666667, 2.0]], [[4.666667, 4.0], [4.666667, 3.0]], [[4.0, 4.666667], [4.666667, 4.0]]]
e3 = [[[0.5, 1.0], [1.0, 0.5]], [[0.5, 2.0], [0.5, 1.0]], [[0.5, 3.0], [0.5, 2.0]], [[0.5, 4.0], [0.5, 3.0]], [[0.5, 5.0], [0.5, 4.0]], [[1.0, 5.5], [0.5, 5.0]], [[1.0, 0.5], [2.0, 0.5]], [[1.0, 5.5], [2.0, 5.5]], [[2.0, 0.5], [3.0, 0.5]], [[2.0, 5.5], [3.0, 5.5]], [[3.0, 0.5], [4.0, 0.5]], [[3.0, 5.5], [4.0, 5.5]], [[4.0, 0.5], [5.0, 0.5]], [[4.0, 5.5], [5.0, 5.5]], [[5.0, 0.5], [5.5, 1.0]], [[5.5, 2.0], [5.5, 1.0]], [[5.5, 3.0], [5.5, 2.0]], [[5.5, 4.0], [5.5, 3.0]], [[5.5, 5.0], [5.5, 4.0]], [[5.0, 5.5], [5.5, 5.0]]]
Again, this gives graphs that look like this
This is the real data that I am working on. More complicated shapes.
Here, for example, I have a red shape inside of a green shape. Ideally, red shapes would lie within red shapes. They would be grouped together in one object (say an array of graphs).
The graphs are connected in a clockwise fashion. I really don't know how to describe it, but perhaps the graphs in the link show this. There's a bug on two of the lines (as you can see in the first plot, in the top right corner), but the vertices are correct.
Hope this helps! I can attach a full workable example, but it would include my whole algorithm and be pages long, with many functions! I basically want to use either input either g1, g2, and g3 into a function (or e1, e2, and e3). The function would tell me that g3 is contained with g2, which is contained within g1.
Your problem really does not have much to do with networks. Fundamentally, you are trying to determine if a point is inside a region described by an ordered list of points. The simplest way to this is to create matplotlib Path which has a contains_point method (there is also a 'contains_points` method to test many points simultaneously).
#!/usr/bin/env python
"""
Determine if a point is within the area defined by a path.
"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.path import Path
from matplotlib.patches import PathPatch
point = [0.5, 0.5]
vertices = np.array([
[0, 0],
[0, 1],
[1, 1],
[1, 0],
[0, 0] # NOTE the repetition of the first vertex
])
path = Path(vertices, closed=True)
print(path.contains_point(point))
# True
# plot to check visually
fig, ax = plt.subplots(1,1)
ax.add_patch(PathPatch(path))
ax.plot(point[0], point[1], 'ro')
Note that if a point is directly on the path, it is not inside the path. However, contains_point supports a radius argument that allows you to add an increment to the extent of the area. Whether you need a positive or negative increment depends on the ordering of the points. IIRC, radius shifts the path left in direction of the path but don't quote me on that.

How to determine if one graph lies within another (vertices on a Cartesian grid)

In my algorithm, I am finding graphs at different thresholds. Each graph G = (V,E). These are undirected graphs found using breadth first search. I would like to determine if the vertices of another graph G' = (V',E') lie within graph G. I am unfamiliar with graph algorithms so please let me know if you would like to see code or a more thorough explanation.
For example, If I have a graph G which is a square with 'corner' vertices (among others, but reduced for simplicity) of (0,0), (0,8), (8,8), and (8,0), then the smaller square defined by corner vertices (2,2), (2,4), (4,4), and (4,2) would lie within G. I am sorry if this is an obvious question, I am just unfamiliar with working with graphs and could use a pointer or two (keywords welcome).
Edit:
My algorithm works on the following matrix:
import numpy as np
A = np.zeros((9,9))
for i in np.arange(1,8):
for j in np.arange(1,8):
A[i,j] = 1
for i in np.arange(2,4):
for j in np.arange(2,4):
A[i,j] = 2
print(A)
yields the matrix:
[[-1. -1. -1. -1. -1. -1. -1. -1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 2. 2. 1. 1. 1. 1. -1.]
[-1. 1. 2. 2. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. -1. -1. -1. -1. -1. -1. -1. -1.]]
To create two graphs:
]
With vertices:
V1 = [[(2.0, 1.333333), (1.333333, 3.0), (1.333333, 2.0), (2.0, 3.666667), (3.0, 3.666667), (3.666667, 3.0), (3.666667, 2.0), (3.0, 1.333333)]]
V2 = [[(1.0, 0.5), (0.5, 2.0), (0.5, 1.0), (0.5, 3.0), (0.5, 4.0), (0.5, 5.0), (0.5, 6.0), (0.5, 7.0), (1.0, 7.5), (2.0, 7.5), (3.0, 7.5), (4.0, 7.5), (5.0, 7.5), (6.0, 7.5), (7.0, 7.5), (7.5, 7.0), (7.5, 6.0), (7.5, 5.0), (7.5, 4.0), (7.5, 3.0), (7.5, 2.0), (7.5, 1.0), (7.0, 0.5), (6.0, 0.5), (5.0, 0.5), (4.0, 0.5), (3.0, 0.5), (2.0, 0.5)]]
And edge lists:
e1 = [[[1.333333, 2.0], [2.0, 1.333333]], [[1.333333, 3.0], [1.333333, 2.0]], [[2.0, 3.666667], [1.333333, 3.0]], [[2.0, 1.333333], [3.0, 1.333333]], [[2.0, 3.666667], [3.0, 3.666667]], [[3.0, 1.333333], [3.666667, 2.0]], [[3.666667, 3.0], [3.666667, 2.0]], [[3.0, 3.666667], [3.666667, 3.0]]]
e2 = [[[0.5, 1.0], [1.0, 0.5]], [[0.5, 2.0], [0.5, 1.0]], [[0.5, 3.0], [0.5, 2.0]], [[0.5, 4.0], [0.5, 3.0]], [[0.5, 5.0], [0.5, 4.0]], [[0.5, 6.0], [0.5, 5.0]], [[0.5, 7.0], [0.5, 6.0]], [[1.0, 7.5], [0.5, 7.0]], [[1.0, 0.5], [2.0, 0.5]], [[1.0, 7.5], [2.0, 7.5]], [[2.0, 0.5], [3.0, 0.5]], [[2.0, 7.5], [3.0, 7.5]], [[3.0, 0.5], [4.0, 0.5]], [[3.0, 7.5], [4.0, 7.5]], [[4.0, 0.5], [5.0, 0.5]], [[4.0, 7.5], [5.0, 7.5]], [[5.0, 0.5], [6.0, 0.5]], [[5.0, 7.5], [6.0, 7.5]], [[6.0, 0.5], [7.0, 0.5]], [[6.0, 7.5], [7.0, 7.5]], [[7.0, 0.5], [7.5, 1.0]], [[7.5, 2.0], [7.5, 1.0]], [[7.5, 3.0], [7.5, 2.0]], [[7.5, 4.0], [7.5, 3.0]], [[7.5, 5.0], [7.5, 4.0]], [[7.5, 6.0], [7.5, 5.0]], [[7.5, 7.0], [7.5,
6.0]], [[7.0, 7.5], [7.5, 7.0]]]
I hope to use it on finding more complicated shapes like the following:
]
In the second picture, I have a red shape inside of a green shape. Ideally, red shapes would lie within red shapes.
I can attach a full workable example, but it would include my whole algorithm and be pages long, with many functions! I basically want to use either input (V1, E1) and (V2, E2) into a function, which would tell me whether one lies within the other.
You can use ray-casting to figure this out. It's a common method for dealing with this problem, so you can find additional information on it elsewhere too. The general description of the algorithm is this:
G1 and G2 are graphs whose edges form a simple polygon/convex hull, where we are attempting to determine if G2 is inside G1.
Choose some arbitrary direction in your space.
For each vertex in G2, cast a ray (a line that starts from one point and extends infinitely in a single direction) in the direction you chose.
If the vertex (a) intersects an edge of G1 an odd number of times OR (b) lies on one of those edges--> the vertex is inside of G1. For all other cases, the vertex is not inside of G1.
G2 is inside of G1 if an only if each vertex of G2 is inside of G1.
This will involve the following subtasks
-Get a list of vertices for G2
-Casting the rays
-Detecting and counting intersection
If you loop through each vertex and draw a line by adding the value you are using to represent G2 on your matrix to all cells in the direction you choose, the intersection value would then just be the sum of the values you are using to represent G1 and G2. In your current case, because you're making squares, this is a little problematic. There may be a better algorithm for drawing the objects or a better way to detect intersections.
Lastly, for detecting if en edge lies on the graph, you should run the check for intersections BEFORE you loop through the vertices. If any of your vertices produce the intersection value before the ray casting, it will tell you that it is on the edge of G1. Mark that this vertex is inside G1, remove it from the list of vertices that need to be checked, and make note of this value so it doesn't count as an extra intersection for all
You may have to tweak this algorithm depending on things like whether you want to count nodes on the edge as inside or outside, or whether you need all vertices inside the figure, but I hope this is a helpful start.

is there a parameter to set the precision for numpy.linspace?

I am trying to check if a numpy array contains a specific value:
>>> x = np.linspace(-5,5,101)
>>> x
array([-5. , -4.9, -4.8, -4.7, -4.6, -4.5, -4.4, -4.3, -4.2, -4.1, -4. ,
-3.9, -3.8, -3.7, -3.6, -3.5, -3.4, -3.3, -3.2, -3.1, -3. , -2.9,
-2.8, -2.7, -2.6, -2.5, -2.4, -2.3, -2.2, -2.1, -2. , -1.9, -1.8,
-1.7, -1.6, -1.5, -1.4, -1.3, -1.2, -1.1, -1. , -0.9, -0.8, -0.7,
-0.6, -0.5, -0.4, -0.3, -0.2, -0.1, 0. , 0.1, 0.2, 0.3, 0.4,
0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3, 1.4, 1.5,
1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6,
2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7,
3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8,
4.9, 5. ])
>>> -5. in x
True
>>> a = 0.2
>>> a
0.2
>>> a in x
False
I assigned a constant to variable a. It seems that the precision of a is not compatible with the elements in the numpy array generated by np.linspace().
I've searched the docs, but didn't find anything about this.
This is not a question of the precision of np.linspace, but rather of the type of the elements in the generated array.
np.linspace generates elements which, conceptually, equally divide the input range between them. However, these elements are then stored as floating point numbers with limited precision, which makes the generation process itself appear to lack precision.
By passing the dtype argument to np.linspace, you can specify the precision of the floating point type used to store its result, which can increase the apparent precision of the generation process.
Nevertheless, you should not use the equality operator to compare floating point numbers. Instead, use np.isclose in conjunction with np.ndarray.any, or some equivalent:
>>> floats_64 = np.linspace(-5, 5, 101, dtype='float64')
>>> floats_128 = np.linspace(-5, 5, 101, dtype='float128')
>>> print(0.2 in floats_64)
False
>>> print(floats_64[52])
0.20000000000000018
>>> print(np.isclose(0.2, floats_64).any()) # check if any element in floats_64 is close to 0.2
True
>>> print(0.2 in floats_128)
False
>>> print(floats_128[52])
0.20000000000000017764
>>> print(np.isclose(0.2, floats_128).any()) # check if any element in floats_128 is close to 0.2
True

Clustering of sequential data

Given the following scenario, I have a really long street. Each house on the street has some number of children. If I were to sequentially append the number of children in each house along an array, I could get some array like:
x = [1,1,1,1,2,2,2,2,1,1,1,1,3,3,3,2,1,1,1,1,2,2,2,2]
I want to locationally determine areas where the households cluster, i.e. I want to group the 2's together, the 3's together, and the 2's at the end together. Normally on 1D data I would sort, determine difference, and find clusters of 1, 2, and 3. But here, I want to keep the index of these values as a factor. So I wanto to end up identifying clusters as:
index: value
0-4 : 1
5-8: 2
9-12: 1
13-16: 3
17-20: 1
21-24: 2
I have seen mean shift used for this detection, and would like to implement this in python. I have also seen kernal density functions. Does anyone know how best to implement this in python?
Edit: To make something clear, I have simplified the problem. At each cluster of integers, the actual problem I would try to address has a gaussian distribtuion of values around that integer value. So I would have a list more like:
x = [0.8, 0.95, 1.2, 1.3, 2.2, 1.6, 1.9, 2.1, 1.1, .7, .9, .9, 3.4, 2.8, 2.9, 3.0, 1.1, 1.0, 0.9, 1.2, 2.2, 2.1, 1.7, 12.0]
A simple approach:
x = [0.8, 0.95, 1.2, 1.3, 2.2, 1.6, 1.9, 2.1, 1.1, .7, .9, .9, 3.4, 2.8, 2.9, 3.0, 1.1, 1.0, 0.9, 1.2, 2.2, 2.1, 1.7, 12.0]
cluster = []
for i, v in enumerate(x):
v = round(v)
if not cluster or cluster[-1][2] != v:
cluster.append([i, i, v])
else:
cluster[-1][1] = i
This results in a list of [start, end, value] lists:
[[ 0, 3, 1],
[ 4, 7, 2],
[ 8, 11, 1],
[12, 14, 3],
[15, 15, 2],
[16, 19, 1],
[20, 23, 2]]
Your desired output wasn't zero-based, therefore the indices look a bit different
Edit:
updated algorithm for updated version of problem

Categories