identify patterns within an array in python - python

I've got a question on identifying patterns within an array. I'm working with the following array:
A = [1.0, 1.1, 9.0, 9.2, 0.9, 9.1, 1.0, 1.0, 1.2, 9.2, 8.9, 1.1]
Now, this array is clearly made of elements clustering about ~1 and elements about ~9.
Is there a way to separate these clusters? I.e., to get to something like:
a_1 = [1.0, 1.1, 0.9, 1.0, 1.0, 1.2, 1.1] # elements around ~1
a_2 = [9.0, 9.2, 9.1, 9.2, 8.9] # elements around ~9
Thanks a lot. Best.

You can do that by comparing each element with which is closer. Is it closer to 1 or 9:
a_1 = [i for i in A if abs(i-1)<=abs(i-9)]
a_2 = [i for i in A if abs(i-1)>abs(i-9)]
But of course this is not a general solution for clustering. It only work in this case when you know the center of the cluster (1 and 9).
If you don't know the center of the cluster, I think you should use a clustering algorithm like K-Means
This is a simple K-Means implementation (with k=2 and 100 as limit iteration). You didn't need to know the center of the cluster, it picks randomly at first.
from random import randint
A = [1.0, 1.1, 9.0, 9.2, 0.9, 9.1, 1.0, 1.0, 1.2, 9.2, 8.9, 1.1]
x = A[randint(0,len(A)-1)]
y = A[randint(0,len(A)-1)]
for _ in range(100):
a_1 = [i for i in A if abs(i-x)<=abs(i-y)]
a_2 = [i for i in A if abs(i-x)>abs(i-y)]
print(x,y)
x = sum(a_1)/len(a_1)
y = sum(a_2)/len(a_2)
print a_1
print a_2

Related

Clustering algorithm that keeps separated a given set of pairs

I have a clustering problem in which I have to split a set S of samples into C clusters where C is known. Normally, I am able to perform the clustering operation with a simple KMeans clustering, which works just fine.
To complicate things, I have a known set of pairs D of samples that cannot under any circumstances be assinged to the same cluster. Currently I am not using this information and the clustering still works fine, but I would like to introduce it to improve robustness, since it comes for free from the problem I am trying to solve.
Example: S consists of 20 samples with 5 features each, C is 3, and D forces the following pairs {(1, 3), (3, 5), (10, 19)} to be in different clusters.
I am looking for a solution in python3, preferably with numpy/sklearn/scipy.
Do you know if there is some out-of-the-box clustering algorithm that takes into account this kind of constraint? I have looked into sklearn but found no such thing.
This sounds exactly like semi-supervised clustering with pairwise constraints. In it, the unsupervised k-means clustering is augmented by (imperfect) supervision through pairwise constraints for a subset of the data. In your particular example, it is a cannot link-constraint. In addition, must-link constraints could be added as well.
Unfortunately, most implementations I encountered in Python are rather brittle. For example, the Python library active-semi-supervised-clustering allows to add ml (must link) and cl (cannot link) relations just as you describe. The code is:
import numpy as np
from matplotlib import pyplot as plt
from active_semi_clustering.semi_supervised.pairwise_constraints import PCKMeans
# data
S = [[-0.2, -1.0], [3.3, 3.9], [-2.0, 0.6], [2.3, -0.8], [1.1, 1.9], [2.8, -0.3], [4.2, 2.6], [1.8, 6.8], [1.4, -0.7], [2.6, 1.8], [2.6, 5.4], [0.8, -0.6], [3.0, 1.4], [-0.6, -0.4], [0.3, -0.2], [0.8, -0.4], [4.8, 5.1], [2.4, 5.2], [2.3, 5.3], [0.9, 0.3], [2.8, 4.1], [1.4, -0.7], [2.7, 5.6], [0.8, 0.8], [1.9, 5.3], [2.3, 5.3], [2.1, 0.5], [3.1, 5.3], [2.3, 0.8], [-0.2, -0.0], [2.4, 0.0], [3.6, -0.5], [1.3, -0.4], [3.0, 4.6], [0.4, -0.1], [-2.3, -1.4], [-1.9, -1.9], [4.2, 5.4], [-1.3, -0.9], [2.7, 0.2], [1.9, 6.5], [2.8, -0.8], [0.0, -0.3], [3.2, 5.9], [1.7, 4.6], [2.3, -0.3], [2.9, 1.2], [3.5, 2.0], [1.2, 2.3], [2.0, 1.5], [4.2, 5.8], [0.7, -2.0], [-0.8, -0.9], [4.7, 0.7], [-1.2, -1.8], [3.5, 5.1], [2.6, 0.7], [1.1, 3.0], [1.9, 6.5], [2.5, 6.5], [2.2, -0.2], [-0.9, -0.3], [3.1, 4.1], [-0.7, -0.3], [4.1, 5.2], [2.6, 0.8], [4.0, 3.5], [4.2, 4.3], [3.1, 1.1], [0.9, -0.1], [-0.3, 1.2], [0.2, -0.8], [0.1, -1.1], [0.4, -1.1], [-0.1, -0.7]]
S = np.array([np.array(s) for s in S])
# no. of clusters
C = 3
# constraints (indices of points in S)
D = [(1, 3), (3, 5), (10, 19), (7, 11), (4, 6)]
# color plots
colDict = {0: '#fc6A03', 1 : 'green', 2 :'#006699'}
plt.title('Input Data ($S$)', fontsize=20)
plt.scatter(x=[s[0] for s in list(S)], y=[s[1] for s in list(S)], c='darkgrey')
plt.show()
# Naïve Clustering
clust = PCKMeans(n_clusters=C, max_iter=1000)
clust.fit(S, cl=[], ml=[])
plt.title('Naïve (unconstrained) k-Means', fontsize=18)
plt.scatter(x=[s[0] for s in list(S)], y=[s[1] for s in list(S)], c=[colDict[c] for c in clust.labels_])
plt.show()
# Constr. Clustering
const_clust = PCKMeans(n_clusters=C, max_iter=10000)
const_clust.fit(S, ml=[], cl=D)
plt.title('Constrained k-Means', fontsize=18)
plt.scatter(x=[s[0] for s in S.tolist()], y=[s[1] for s in S.tolist()], c=[colDict[c] for c in const_clust.labels_])
plt.show()
which yields
Although the plot looks different, checking if the cannot link-constraints are indeed met results in
[const_clust.labels_[d[0]] != const_clust.labels_[d[1]] for d in D]
>[True, False, True]
indicating that points with index 3 and 5 were assigned the same cluster label. Not good. However, the sample size and the distribution of the data points across the feature space seem to impact this greatly. Potentially, you will see no adverse effects when you apply it to your actual data.
Unfortunately, the repository does not allow to set a seed (to make the iterative estimation procedure reproducible) and ignores the one set via np.random.seed(567). Beware of reproducibility and rerun the code several times.
Other repositories such as scikit-learn indicate that some clustering routines may allow constraints but don't indicate how this can be done.
Note that there are other variants of constrained k-means clustering of this, e.g. where the pairwise constraints are not certain (see this reference) or the number of data points per cluster is constrained (see this python library).

Is there a way to add list elements that is equal to a given number with higher elements

I am trying to add list elements that are closest or equal to 15
I am assuming 1st element from the list as total.
It should add in total with 3rd element from top to bottom.
If total > 15 then it should not add in total and go for the next loop.
I am trying the below code, could you suggest here what I am doing wrong -
list1 = [
[5.0, 1.3, 6.6, 5.076923076923077],
[9.0, 1.5, 7.0, 4.666666666666667],
[4.0, 1.0, 4.0, 4.0],
[3.0, 2.0, 5.5, 2.75],
[7.0, 1.6, 3.5, 2.1875],
[2.0, 1.7, 3.5, 2.058823529411765],
[1.0, 3.0, 6.0, 2.0],
[6.0, 1.0, 2.0, 2.0],
[8.0, 2.5, 5.0, 2.0],
[10.0, 1.8, 1.0, 0.5555555555555556]
]
income = 15
total = 0
for i in list1:
if not (total + i[1] > 15):
total += i[1]
print(total)
the output should be 14.9
The problem is that you use a break.
You have to check that adding the current
number in your loop will not result in the
total sum being more than 15.
income = 15
total = 0
for i in list1:
if not (total + i[1] > income):
total += i[1]
But this code will not always work. because number might come in different orders there might be an order were it adds up to exactly 15 but that's a bit more complicated.

Add values when there is a gap between elements

I have a list defined as
A = [1.0, 3.0, 6.0, 7.0, 8.0]
I am trying to fill the gap between the elements of the list with zero values. A gap is an increment between elements that is more than one. So for instance between 1.0 and 3.0 there is one gap: 2.0 and between 3.0 and 6.0 there are two gaps:4.0 and 5.0
I am working with this code but it is not complete and I am missing adding multiple values when the gap is bigger than one increment
B = []
cnt = 0
for i in range(len(A)-1):
if A[i] == A[i+1] - 1:
B.append(A[cnt])
cnt += 1
if A[i] != A[i+1] - 1:
B.append(A[cnt])
B.append(0.0)
cnt += 1
The output of this code is:
B = [1.0, 0.0, 3.0, 0.0, 6.0, 7.0]
But since there are two gaps between 3.0 and 6.0 I need B to look like this:
B = [1.0, 0.0, 3.0, 0.0, 0.0, 6.0, 7.0]
I am a bit stuck on how to do this and I already have a feeling that my code is not very optimized. Any help is appreciated!
You can use a list comprehension. Assuming your list is ordered, you can extract the first and last indices of A. We use set for O(1) lookup complexity within the comprehension.
A = [1.0, 3.0, 6.0, 7.0, 8.0]
A_set = set(A)
res = [i if i in A_set else 0 for i in range(int(A[0]), int(A[-1])+1)]
print(res)
[1, 0, 3, 0, 0, 6, 7, 8]
However, for larger arrays I'd recommend you use a specialist library such as NumPy:
import numpy as np
A = np.array([1.0, 3.0, 6.0, 7.0, 8.0]).astype(int)
B = np.zeros(A.max())
B[A-1] = A
print(B)
array([ 1., 0., 3., 0., 0., 6., 7., 8.])
Based on comments to the question, I can suggest the following solution:
B = [float(x) if x in A else 0.0 for x in range(int(min(A)), int(max(A)) + 1)]

Calculate jaccard distance using scipy in python

I have two separate lists as follows.
list1 =[[0.0, 0.75, 0.2], [0.0, 0.5, 0.7]]
list2 =[[0.9, 0.0, 0.8], [0.0, 0.0, 0.8], [1.0, 0.0, 0.0]]
I want to get a list1 x list2 jaccard distance matrix (i.e. the matrix includes 6 values: 2 x 3)
For example;
[0.0, 0.75, 0.2] in list1 with all the three lists in list2
[0.0, 0.5, 0.7] in list1 with all the three lists in list2
I actually tried both pdist and cdist. However I get the following errors respectively; TypeError: pdist() got multiple values for argument 'metric' and ValueError: XA must be a 2-dimensional array..
Please help me to fix this issue.
You need to pass to pdist a m x n 2D array. To construct it, you can use a simple nested loop.
You could probably do something like this :
import scipy.spatial.distance as dist
list1 =[[0.0, 0.75, 0.2], [0.0, 0.5, 0.7]]
list2 =[[0.9, 0.0, 0.8], [0.0, 0.0, 0.8], [1.0, 0.0, 0.0]]
distance = []
for elem1 in list1:
for elem2 in list2:
distance.append(dist.pdist([elem1,elem2], 'jaccard'))
You get your results in the distance array.

Clustering of sequential data

Given the following scenario, I have a really long street. Each house on the street has some number of children. If I were to sequentially append the number of children in each house along an array, I could get some array like:
x = [1,1,1,1,2,2,2,2,1,1,1,1,3,3,3,2,1,1,1,1,2,2,2,2]
I want to locationally determine areas where the households cluster, i.e. I want to group the 2's together, the 3's together, and the 2's at the end together. Normally on 1D data I would sort, determine difference, and find clusters of 1, 2, and 3. But here, I want to keep the index of these values as a factor. So I wanto to end up identifying clusters as:
index: value
0-4 : 1
5-8: 2
9-12: 1
13-16: 3
17-20: 1
21-24: 2
I have seen mean shift used for this detection, and would like to implement this in python. I have also seen kernal density functions. Does anyone know how best to implement this in python?
Edit: To make something clear, I have simplified the problem. At each cluster of integers, the actual problem I would try to address has a gaussian distribtuion of values around that integer value. So I would have a list more like:
x = [0.8, 0.95, 1.2, 1.3, 2.2, 1.6, 1.9, 2.1, 1.1, .7, .9, .9, 3.4, 2.8, 2.9, 3.0, 1.1, 1.0, 0.9, 1.2, 2.2, 2.1, 1.7, 12.0]
A simple approach:
x = [0.8, 0.95, 1.2, 1.3, 2.2, 1.6, 1.9, 2.1, 1.1, .7, .9, .9, 3.4, 2.8, 2.9, 3.0, 1.1, 1.0, 0.9, 1.2, 2.2, 2.1, 1.7, 12.0]
cluster = []
for i, v in enumerate(x):
v = round(v)
if not cluster or cluster[-1][2] != v:
cluster.append([i, i, v])
else:
cluster[-1][1] = i
This results in a list of [start, end, value] lists:
[[ 0, 3, 1],
[ 4, 7, 2],
[ 8, 11, 1],
[12, 14, 3],
[15, 15, 2],
[16, 19, 1],
[20, 23, 2]]
Your desired output wasn't zero-based, therefore the indices look a bit different
Edit:
updated algorithm for updated version of problem

Categories