Clustering of sequential data - python

Given the following scenario, I have a really long street. Each house on the street has some number of children. If I were to sequentially append the number of children in each house along an array, I could get some array like:
x = [1,1,1,1,2,2,2,2,1,1,1,1,3,3,3,2,1,1,1,1,2,2,2,2]
I want to locationally determine areas where the households cluster, i.e. I want to group the 2's together, the 3's together, and the 2's at the end together. Normally on 1D data I would sort, determine difference, and find clusters of 1, 2, and 3. But here, I want to keep the index of these values as a factor. So I wanto to end up identifying clusters as:
index: value
0-4 : 1
5-8: 2
9-12: 1
13-16: 3
17-20: 1
21-24: 2
I have seen mean shift used for this detection, and would like to implement this in python. I have also seen kernal density functions. Does anyone know how best to implement this in python?
Edit: To make something clear, I have simplified the problem. At each cluster of integers, the actual problem I would try to address has a gaussian distribtuion of values around that integer value. So I would have a list more like:
x = [0.8, 0.95, 1.2, 1.3, 2.2, 1.6, 1.9, 2.1, 1.1, .7, .9, .9, 3.4, 2.8, 2.9, 3.0, 1.1, 1.0, 0.9, 1.2, 2.2, 2.1, 1.7, 12.0]

A simple approach:
x = [0.8, 0.95, 1.2, 1.3, 2.2, 1.6, 1.9, 2.1, 1.1, .7, .9, .9, 3.4, 2.8, 2.9, 3.0, 1.1, 1.0, 0.9, 1.2, 2.2, 2.1, 1.7, 12.0]
cluster = []
for i, v in enumerate(x):
v = round(v)
if not cluster or cluster[-1][2] != v:
cluster.append([i, i, v])
else:
cluster[-1][1] = i
This results in a list of [start, end, value] lists:
[[ 0, 3, 1],
[ 4, 7, 2],
[ 8, 11, 1],
[12, 14, 3],
[15, 15, 2],
[16, 19, 1],
[20, 23, 2]]
Your desired output wasn't zero-based, therefore the indices look a bit different
Edit:
updated algorithm for updated version of problem

Related

Storing multiple arrays in a np.zeros or np.ones

I'm trying to initialize a dummy array of length n using np.zeros(n) with dtype=object. I want to use this dummy array to store n copies of another array of length m.
I'm trying to avoid for loop to set values at each index.
I tried using the below code but keep getting error -
temp = np.zeros(10, dtype=object)
arr = np.array([1.1,1.2,1.3,1.4,1.5])
res = temp * arr
The desired result should be -
np.array([[1.1,1.2,1.3,1.4,1.5], [1.1,1.2,1.3,1.4,1.5], ... 10 copies])
I keep getting the error -
operands could not be broadcast together with shapes (10,) (5,)
I understand that this error arises since the compiler thinks I'm trying to multiply those arrays.
So how do I achieve the task?
np.tile() is a built-in function that repeats a given array reps times. It looks like this is exactly what you need, i.e.:
res = np.tile(arr, 2)
>>> arr = np.array([1.1,1.2,1.3,1.4,1.5])
>>> arr
array([1.1, 1.2, 1.3, 1.4, 1.5])
>>> np.array([arr]*10)
array([[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5]])

Clustering algorithm that keeps separated a given set of pairs

I have a clustering problem in which I have to split a set S of samples into C clusters where C is known. Normally, I am able to perform the clustering operation with a simple KMeans clustering, which works just fine.
To complicate things, I have a known set of pairs D of samples that cannot under any circumstances be assinged to the same cluster. Currently I am not using this information and the clustering still works fine, but I would like to introduce it to improve robustness, since it comes for free from the problem I am trying to solve.
Example: S consists of 20 samples with 5 features each, C is 3, and D forces the following pairs {(1, 3), (3, 5), (10, 19)} to be in different clusters.
I am looking for a solution in python3, preferably with numpy/sklearn/scipy.
Do you know if there is some out-of-the-box clustering algorithm that takes into account this kind of constraint? I have looked into sklearn but found no such thing.
This sounds exactly like semi-supervised clustering with pairwise constraints. In it, the unsupervised k-means clustering is augmented by (imperfect) supervision through pairwise constraints for a subset of the data. In your particular example, it is a cannot link-constraint. In addition, must-link constraints could be added as well.
Unfortunately, most implementations I encountered in Python are rather brittle. For example, the Python library active-semi-supervised-clustering allows to add ml (must link) and cl (cannot link) relations just as you describe. The code is:
import numpy as np
from matplotlib import pyplot as plt
from active_semi_clustering.semi_supervised.pairwise_constraints import PCKMeans
# data
S = [[-0.2, -1.0], [3.3, 3.9], [-2.0, 0.6], [2.3, -0.8], [1.1, 1.9], [2.8, -0.3], [4.2, 2.6], [1.8, 6.8], [1.4, -0.7], [2.6, 1.8], [2.6, 5.4], [0.8, -0.6], [3.0, 1.4], [-0.6, -0.4], [0.3, -0.2], [0.8, -0.4], [4.8, 5.1], [2.4, 5.2], [2.3, 5.3], [0.9, 0.3], [2.8, 4.1], [1.4, -0.7], [2.7, 5.6], [0.8, 0.8], [1.9, 5.3], [2.3, 5.3], [2.1, 0.5], [3.1, 5.3], [2.3, 0.8], [-0.2, -0.0], [2.4, 0.0], [3.6, -0.5], [1.3, -0.4], [3.0, 4.6], [0.4, -0.1], [-2.3, -1.4], [-1.9, -1.9], [4.2, 5.4], [-1.3, -0.9], [2.7, 0.2], [1.9, 6.5], [2.8, -0.8], [0.0, -0.3], [3.2, 5.9], [1.7, 4.6], [2.3, -0.3], [2.9, 1.2], [3.5, 2.0], [1.2, 2.3], [2.0, 1.5], [4.2, 5.8], [0.7, -2.0], [-0.8, -0.9], [4.7, 0.7], [-1.2, -1.8], [3.5, 5.1], [2.6, 0.7], [1.1, 3.0], [1.9, 6.5], [2.5, 6.5], [2.2, -0.2], [-0.9, -0.3], [3.1, 4.1], [-0.7, -0.3], [4.1, 5.2], [2.6, 0.8], [4.0, 3.5], [4.2, 4.3], [3.1, 1.1], [0.9, -0.1], [-0.3, 1.2], [0.2, -0.8], [0.1, -1.1], [0.4, -1.1], [-0.1, -0.7]]
S = np.array([np.array(s) for s in S])
# no. of clusters
C = 3
# constraints (indices of points in S)
D = [(1, 3), (3, 5), (10, 19), (7, 11), (4, 6)]
# color plots
colDict = {0: '#fc6A03', 1 : 'green', 2 :'#006699'}
plt.title('Input Data ($S$)', fontsize=20)
plt.scatter(x=[s[0] for s in list(S)], y=[s[1] for s in list(S)], c='darkgrey')
plt.show()
# Naïve Clustering
clust = PCKMeans(n_clusters=C, max_iter=1000)
clust.fit(S, cl=[], ml=[])
plt.title('Naïve (unconstrained) k-Means', fontsize=18)
plt.scatter(x=[s[0] for s in list(S)], y=[s[1] for s in list(S)], c=[colDict[c] for c in clust.labels_])
plt.show()
# Constr. Clustering
const_clust = PCKMeans(n_clusters=C, max_iter=10000)
const_clust.fit(S, ml=[], cl=D)
plt.title('Constrained k-Means', fontsize=18)
plt.scatter(x=[s[0] for s in S.tolist()], y=[s[1] for s in S.tolist()], c=[colDict[c] for c in const_clust.labels_])
plt.show()
which yields
Although the plot looks different, checking if the cannot link-constraints are indeed met results in
[const_clust.labels_[d[0]] != const_clust.labels_[d[1]] for d in D]
>[True, False, True]
indicating that points with index 3 and 5 were assigned the same cluster label. Not good. However, the sample size and the distribution of the data points across the feature space seem to impact this greatly. Potentially, you will see no adverse effects when you apply it to your actual data.
Unfortunately, the repository does not allow to set a seed (to make the iterative estimation procedure reproducible) and ignores the one set via np.random.seed(567). Beware of reproducibility and rerun the code several times.
Other repositories such as scikit-learn indicate that some clustering routines may allow constraints but don't indicate how this can be done.
Note that there are other variants of constrained k-means clustering of this, e.g. where the pairwise constraints are not certain (see this reference) or the number of data points per cluster is constrained (see this python library).

Pandas Group By column to generate quantiles (.25, 0.5, .75)

Let's say we have CityName, Min-Temperature, Max-Temperature, Humidity of different cities.
We need an output dataframe grouped on CityName and want to generate 0.25, 0.5 and 0.75 quantiles. New column names would be OldColunmName + ('Q1)/('Q2')/('Q3').
Example INPUT
df = pd.DataFrame({'cityName': pd.Categorical(['a','a','a','a','b','b','b','b','a','a','a','a','b','b','b','b']),
'MinTemp': [1.1, 2.1, 3.1, 1.1, 2, 2.1, 2.2, 2.4, 2.5, 1.11, 1.31, 2.1, 1, 2, 2.3, 2.1],
'MaxTemp': [2.1, 4.2, 5.1, 2.13, 4, 3.1, 5.2, 3.4, 3.5, 2.11, 2.31, 3.1, 2, 4.3, 4.3, 3.1],
'Humidity': [0.29, 0.19, .45, 0.1, 0.1, 0.1, 0.2, 0.5, 0.11, 0.31, 0.1, .1, .2, 0.3, 0.3, 0.1]
})
OUTPUT
First Approach
First you have to group your data on the column you want which is 'cityName'. Then, because on each column you want to do multiple and different kinds of aggregations, you can use 'agg' function. For functions in the 'agg', you cannot give parameters so you define them as follow:
def quantile_50(x):
return x.quantile(0.5)
def quantile_25(x):
return x.quantile(0.25)
def quantile_75(x):
return x.quantile(0.75)
quantile_df = df.groupby('cityName').agg([quantile_25, quantile_50, quantile_75])
quantile_df
Second Approach
You can use describe method and select the statistics you need. By using idx you can choose which subindex to choose.
idx = pd.IndexSlice
df.groupby('cityName').describe().loc[:, idx[:, ['25%', '50%', '75%']]]

How to print a value to a new array if it within a bound of previous value in that array in Python/Numpy

If I have an array:
StartArray=np.array([1, 2, 3, 1.4, 1.2, 0.6, 1.8, 1.5, 1.9, 2.2, 3, 4 ,2.3])
I would like to loop through this array starting with StartArray[0] and only keep values that are within +/- .5 of the last kept value to yield:
EndArray=[1, 1.4, 1.2, 1.5, 1.9, 2.2, 2.3]
This is what I have tried so far and the results don't make sense
StartArray=np.array([1, 2, 3, 1.4, 1.2, 0.6, 1.8, 1.5, 1.9, 2.2, 3, 4 ,2.3])
EndArray=np.empty_like(StartArray)
EndArray[0]=StartArray[0]
for i in range(len(StartArray)-1):
if EndArray[i]+.5>StartArray[i+1]>EndArray[i]-.5:
EndArray[i+1]=StartArray[i+1]
Out:
array([ 1. , 0.22559146, 0.13015365, 5.24910493, 0.63804761,
0.6 , 1.73143364, 1.5 , 1.9 , 2.2 ,
6.82525036, 0.61641556, 6.82325036])
List is the good structure for this job:
StartArray=np.array([1, 2, 3, 1.4, 1.2, 0.6, 1.8, 1.5, 1.9, 2.2, 3, 4 ,2.3])
ref=StartArray[0]
End=[]
for x in StartArray:
if abs(x- ref)<.5:
End.append(x)
ref=x
print(np.array(End))
[ 1. 1.4 1.2 1.5 1.9 2.2 2.3]
There are multiple problems with your approach. First, you're initializing EndArray to be the same size as StartArray, but that's not what you want your desired output to be. Instead, initialize EndArray to be an empty list and append values as your loop through StartArray. Secondly, you want the output values to be within 0.5 of the last kept value, so you need to keep track of this.
Adapting your code:
StartArray=np.array([1, 2, 3, 1.4, 1.2, 0.6, 1.8, 1.5, 1.9, 2.2, 3, 4 ,2.3])
EndArray=[]
last_kept = StartArray[0]
EndArray.append(last_kept)
for i in range(len(StartArray)-1):
if np.abs(StartArray[i+1] - last_kept) < 0.5:
last_kept = StartArray[i+1]
EndArray.append(last_kept)
# convert back to numpy array
EndArray = np.array(EndArray)

numpy arrays and list multiplication, finding the maximum and its index

I know that this question might seem repeated, but I tried to debug my code in several ways and still don't know what it is wrong. Below is my code.
def myfunc(LUT,LUT_Prob,test):
x = []
y = []
z = []
x.extend(hamming_distance(test, LUT[i]) for i in range (len(LUT)))
y = [(len(LUT[0])) - j for j in x]
z = [a*b for a,b in zip(y,LUT_prob)]
MAP = max(z)
closest_index = z.index(max(z))
return x, y, LUT_Prob, z, MAP, closest_index
In another script:
Winner = []
for j in range (0,5):
Winner.append(myfunc(LUT1,LUT_Prob1,test[j]))
print 'Winner = {}' .format(Winner)
The output is:
Winner = [([2, 4, 2, 4], [8, 6, 8, 6], [array([ 0.4, 0.2, 0.2, 0.2])], [[array([ 3.2, 1.6, 1.6, 1.6])]], [array([ 3.2, 1.6, 1.6, 1.6])], 0), ([1, 3, 1, 3], [9, 7, 9, 7], [array([ 0.4, 0.2, 0.2, 0.2])], [[array([ 3.6, 1.8, 1.8, 1.8])]], [array([ 3.6, 1.8, 1.8, 1.8])], 0), ([3, 5, 5, 3], [7, 5, 5, 7], [array([ 0.4, 0.2, 0.2, 0.2])], [[array([ 2.8, 1.4, 1.4, 1.4])]], [array([ 2.8, 1.4, 1.4, 1.4])], 0), ([3, 5, 3, 5], [7, 5, 7, 5], [array([ 0.4, 0.2, 0.2, 0.2])], [[array([ 2.8, 1.4, 1.4, 1.4])]], [array([ 2.8, 1.4, 1.4, 1.4])], 0), ([3, 3, 3, 1], [7, 7, 7, 9], [array([ 0.4, 0.2, 0.2, 0.2])], [[array([ 2.8, 1.4, 1.4, 1.4])]], [array([ 2.8, 1.4, 1.4, 1.4])], 0)]
Note: The output is the returned values x, y, LUT_Prob, z, MAP, closest_index with the same order and iterated 5 times.
The errors that I am getting:
1- z is not as expected, the expectation is multiply y and LUT_Prob element wise what I am getting is the results of multiplying the first element of y by LUT_Prob.
2- MAP should be only one value that is in this case "3.2" however there is an array instead.
3- Max_index in this case is correct, however, if the the "3.2" is anywhere else Max_index remains "0".
So, can somebody help?

Categories