clustering data based on multiple linear regression in Python - python

I'd like to separate data and put them into 13 different set of variables like each red circle (see the image below). But I have no idea how to cluster the data based on multiple linear regression. Any idea how can I do this in Python?
Data set:
https://www.dropbox.com/s/ar5rzry0joe9ffu/dataset_v1.xlsx?dl=
Code I am using now for clustering:
print(__doc__)
import openpyxl
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
wb = openpyxl.load_workbook('dataset_v1.xlsx')
sheet = wb.worksheets[0]
ws = wb.active
row_count = sheet.max_row
data = np.zeros((row_count, 2))
index = 0
for r in ws.rows:
data[index,0] = r[0].value
data[index,1] = r[1].value
index += 1
# Compute DBSCAN
db = DBSCAN(eps=5, min_samples=0.1).fit(data)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
clusters = [data[labels == i] for i in range(n_clusters_)]
outliers = data[labels == -1]
# #############################################################################
# Plot result
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 0.5]
class_member_mask = (labels == k)
xy = data[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = data[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Define a threshold, say 50.
Begin a new "cluster" whenever y increases by more than 50.
As long as values decrease, they are still in the previous "cluster".

Related

Python: Point Clustering/Averaging

I have a detector which returns the detected objects' bounding box centers, it works fine for the most part. What I want to do, however, is to consider 10 frames and not 1 frame to conduct the detection, so that I can eliminate more false positives.
The way my detector normally works is follows:
1. Get a frame.
2. Conduct the algorithm.
3. Record the centers into a dictionary per each frame.
The way I thought would help reducing false positives is:
1. Set up a loop of 10:
1. Get a frame.
2. Conduct the algorithm.
3. Record the centers into a dictionary per each frame.
2. Loop over the recorded points after every 10 frames.
3. Use a clustering algorithm or simple distance averaging
4. Get the final centers.
So, I've already implemented some of this logic. I am on step 1.3, I need to find a way to group the coordinates and finalize the estimation.
After 10 frames, my dictionary holds such values (can't paste all):
(4067.0, 527.0): ['torx8', 'screw8'],
(4053.0, 527.0): ['torx8', 'screw1'],
(2627.0, 707.0): ['torx8', 'screw12'],
(3453.0, 840.0): ['torx6', 'screw14'],
(3633.0, 1373.0): ['torx6', 'screw15'],
(3440.0, 840.0): ['torx6', 'screw14'],
(3447.0, 840.0): ['torx6', 'screw14'],
(1660.0, 1707.0): ['torx8', 'screw3'],
(2633.0, 700.0): ['torx8', 'screw7'],
(2627.0, 693.0): ['torx8', 'screw8'],
(4060.0, 533.0): ['torx8', 'screw6'],
(3627.0, 1367.0): ['torx6', 'screw13'],
(2600.0, 680.0): ['torx8', 'screw15'],
(2607.0, 680.0): ['torx8', 'screw7']
As you can notice, most of these points are already the same points with a bit of pixel shift, which is why I am trying to find a way to get rid of the so called duplicates.
Is there an intelligent and efficient way of dealing with this problem? First thing came to my mind was k-means clustering, but I am not sure if this fits to this problem.
Did anyone have similar experience?
EDIT: Okay so I made some progress and I am able to cluster the points using Hierarchical Clustering, because in my case I have no priori knowledge of the number of cluster. Hence, an approximation is required.
# cluster now
points = StandardScaler().fit_transform(points)
db = self.dbscan.fit(points)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise_ = list(db.labels_).count(-1)
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = points[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = points[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
which works great. I am able to eliminate the false positives (see the black dot), however, I still don't know how I could get the average per cluster. Like, after I find the clusters, how can I loop over each cluster and average all the X,Y values? (Before StandardScaler().fit_transform(points), obviously, since after that I lose the pixel coordinates, they are fit between minus one and one.)
Okay, finally, I got it. Since I also would need my points in their original scale (not between -1 and 1) I also had to do rescaling. Anyway, here is the full magic:
def cluster_dbscan(self, points, visualize=False):
# scale the points between -1 and 1
scaler = StandardScaler()
scaled_points = scaler.fit_transform(points)
# cluster
db = DBSCAN(eps=self.clustering_epsilon, min_samples=self.clustering_min_samples, metric='euclidean')
db.fit(scaled_points)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise_ = list(db.labels_).count(-1)
if (visualize == True):
# Black removed and is used for noise instead.
unique_labels = set(db.labels_)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (db.labels_ == k)
xy = scaled_points[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = scaled_points[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
# back to original scale
points = scaler.inverse_transform(scaled_points)
# loop over the clusters, get the centers
centers = np.zeros((n_clusters_, 2)) # for x and y
for i in range(0, n_clusters_):
cluster_points = points[db.labels_ == i]
cluster_mean = np.mean(cluster_points, axis=0)
centers[i, :] = cluster_mean
# we need the original points
return centers

How to cluster data with user_id - k-means algorithm

I want to cluster data of users by user_id, because I need to analyze each cluster after clustering.
my clustering algorithm is k-means/k=3. I'm using python.
my data:
V1,V2
100,10
150,20
200,10
120,15
300,10
400,10
300,10
400,10
I removed user_id column from this data. as far as I know that I should remove user_id for k-means clustering.
my python code:
# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a temporary script file.
"""
from copy import deepcopy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
# Importing the dataset
data = pd.read_csv('C:/Users/S.M_Emamian/Desktop/xclara.csv')
print("Input Data and Shape")
print(data.shape)
data.head()
# Getting the values and plotting it
f1 = data['V1'].values
f2 = data['V2'].values
X = np.array(list(zip(f1, f2)))
plt.scatter(f1, f2, c='black', s=7)
# Euclidean Distance Caculator
def dist(a, b, ax=1):
return np.linalg.norm(a - b, axis=ax)
# Number of clusters
k = 3
# X coordinates of random centroids
C_x = np.random.randint(0, np.max(X)-20, size=k)
# Y coordinates of random centroids
C_y = np.random.randint(0, np.max(X)-20, size=k)
C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
print("Initial Centroids")
print(C)
# Plotting along with the Centroids
plt.scatter(f1, f2, c='#050505', s=7)
plt.scatter(C_x, C_y, marker='*', s=200, c='g')
# To store the value of centroids when it updates
C_old = np.zeros(C.shape)
# Cluster Lables(0, 1, 2)
clusters = np.zeros(len(X))
# Error func. - Distance between new centroids and old centroids
error = dist(C, C_old, None)
# Loop will run till the error becomes zero
while error != 0:
# Assigning each value to its closest cluster
for i in range(len(X)):
distances = dist(X[i], C)
cluster = np.argmin(distances)
clusters[i] = cluster
# Storing the old centroid values
C_old = deepcopy(C)
# Finding the new centroids by taking the average value
for i in range(k):
points = [X[j] for j in range(len(X)) if clusters[j] == i]
C[i] = np.mean(points, axis=0)
error = dist(C, C_old, None)
colors = ['r', 'g', 'b', 'y', 'c', 'm']
fig, ax = plt.subplots()
for i in range(k):
points = np.array([X[j] for j in range(len(X)) if clusters[j] == i])
ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
ax.scatter(C[:, 0], C[:, 1], marker='*', s=200, c='#050505')
'''
==========================================================
scikit-learn
==========================================================
'''
from sklearn.cluster import KMeans
# Number of clusters
kmeans = KMeans(n_clusters=3)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_
# Comparing with scikit-learn centroids
print("Centroid values")
print("Scratch")
print(C) # From Scratch
print("sklearn")
print(centroids) # From sci-kit learn
my code works fine and it visualizes my data as well.
but I need to keep user_id.
for example, I would like to know user_id=5 is Which of the clusters?
Just add user_id after clustering.
Actually, what you probably want to do is the opposite: just add the cluster label to your original data that still has the cluster labels.
As long as you don't change the data order this is a trivial stacking operation.

sklearn DBSCAN to cluster GPS positions with big epsilon

I want to use DBSCAN from sklearn to find clusters from my GPS positions. I don't understand why the coordinate [ 18.28, 57.63] (lower right corner in the figure) is clustered together with the other coordinates to the left. Could it be some problem with big epsilon? sklearn version 0.19.0.
To reproduce this:
I copied demo code from here: http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html but I replaced the sample data with a few coordinates (see variable X in the code below). I got the inspiration from here: http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# #############################################################################
# Generate sample data
X = np.array([[ 11.95, 57.70],
[ 16.28, 57.63],
[ 16.27, 57.63],
[ 16.28, 57.66],
[ 11.95, 57.63],
[ 12.95, 57.63],
[ 18.28, 57.63],
[ 11.97, 57.70]])
# #############################################################################
# Compute DBSCAN
kms_per_radian = 6371.0088
epsilon = 400 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=2, algorithm='ball_tree', metric='haversine').fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
I recently made the same mistake (using hdbscan), and it was the cause of some 'strange' results. For example, the same point would sometimes be included in a cluster, and sometimes be flagged as a noise point. "How can this be?", I kept wondering. It turned out to be because I was passing lat/lon directly and not converting to radians first.
The OP's self-supplied answer is correct, but short on details. One could, of course, just multiply the lat/lon values by π/180, but—if you're already using numpy anyway—the simplest fix is to change this line in the original code:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(X)
to:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(np.radians(X))
The haversine metric requires data in radian

Labeling K-means cluster data points with matplotlib

I have pulled the following data from a .csv file(databoth.csv) and performed a k-means clustering utilising matplotlib. The data is 3 columns(Country, birthrate, life expectancy).
I need help to output:
The number of countries belonging to each cluster.
The list of countries belonging to each cluster.
The mean Life Expectancy and Birth Rate for each cluster.
Here is my code:
import csv
import matplotlib.pyplot as plt
import sys
import pylab as plt
import numpy as np
plt.ion()
#K-Means clustering implementation
# data = set of data points
# k = number of clusters
# maxIters = maximum number of iterations executed k-means
def kMeans(data, K, maxIters = 10, plot_progress = None):
centroids = data[np.random.choice(np.arange(len(data)), K), :]
for i in range(maxIters):
# Cluster Assignment step
C = np.array([np.argmin([np.dot(x_i-y_k, x_i-y_k) for y_k in
centroids]) for x_i in data])
# Move centroids step
centroids = [data[C == k].mean(axis = 0) for k in range(K)]
if plot_progress != None: plot_progress(data, C, np.array(centroids))
return np.array(centroids) , C
# Calculates euclidean distance between
# a data point and all the available cluster
# centroids.
def euclidean_dist(data, centroids, clusters):
for instance in data:
mu_index = min([(i[0], np.linalg.norm(instance-centroids[i[0]])) \
for i in enumerate(centroids)], key=lambda t:t[1])[0]
try:
clusters[mu_index].append(instance)
except KeyError:
clusters[mu_index] = [instance]
# If any cluster is empty then assign one point
# from data set randomly so as to not have empty
# clusters and 0 means.
for cluster in clusters:
if not cluster:
cluster.append(data[np.random.randint(0, len(data), size=1)].flatten().tolist())
return clusters
# this function reads the data from the specified files
def csvRead(file):
np.genfromtxt('dataBoth.csv', delimiter=',')
# function to show the results on the screen in form of 3 clusters
def show(X, C, centroids, keep = False):
import time
time.sleep(0.5)
plt.cla()
plt.plot(X[C == 0, 0], X[C == 0, 1], '*b',
X[C == 1, 0], X[C == 1, 1], '*r',
X[C == 2, 0], X[C == 2, 1], '*g')
plt.plot(centroids[:,0],centroids[:,1],'*m',markersize=20)
plt.draw()
if keep :
plt.ioff()
plt.show()
# generate 3 cluster data
data = csvRead('dataBoth.csv')
m1, cov1 = [9, 8], [[1.5, 2], [1, 2]]
m2, cov2 = [5, 13], [[2.5, -1.5], [-1.5, 1.5]]
m3, cov3 = [3, 7], [[0.25, 0.5], [-0.1, 0.5]]
data1 = np.random.multivariate_normal(m1, cov1, 250)
data2 = np.random.multivariate_normal(m2, cov2, 180)
data3 = np.random.multivariate_normal(m3, cov3, 100)
X = np.vstack((data1,np.vstack((data2,data3))))
np.random.shuffle(X)
# calls to the functions
# first to find centroids using k-means
centroids, C = kMeans(X, K = 3, plot_progress = show)
#second to show the centroids on the graph
show(X, C, centroids, True)
maybe you can use annotate:
http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.annotate
more example :
http://matplotlib.org/users/annotations.html#plotting-guide-annotation
This will allow to have a text label near to each point.
or you can use colours as in this post

How to get result of DBSCAN refer to example from http://scikit-learn.org/

Refer to this example of using DBSCAN, real data input for clustering process is 'X'. But following to the example, i used 'X1' for build model for clustering.
# -*- coding: utf-8 -*-
"""
===================================
Demo of DBSCAN clustering algorithm
===================================
Finds core samples of high density and expands clusters from them.
"""
#print(__doc__)
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X=[(9,0),(7,8),(8,6),(1,2),(1,3),(7,6),(10,14)]
X1 = StandardScaler().fit_transform(X)
##############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X1)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool) # bikin matriks False ukuran matriks db.labels
core_samples_mask[db.core_sample_indices_] = True # bikin matriks, kalau indexnya ada di matriks db, maka true
labels = db.labels_
print "cluster: ", set(labels)
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
In this case i want to get members of noise, so I print xy if k=-1. Unfortunately, xy is refers to X1 not the real data X.
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
class_member_mask = (labels == k)
if k == -1:
# Black used for noise.
xy = X1[class_member_mask]
print "Noise :", xy
else:
xy = X1[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
xy = X1[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
When I try to replace X1 to 'X', I get an error.
xy = X[class_member_mask]
error:
xy=X[class_member_mask&~core_samples_mask]
TypeError: only integer arrays with one element can be converted to an index
May be its because format X1 and X is different. I think it's will solve if I know to how convert X format to X1
X=[(9,0),(7,8),(8,6),(1,2),(1,3),(7,6),(10,14)]
X1=[[ 0.8406627 -1.30435512]
[ 0.25219881 0.56856505]
[ 0.54643076 0.10033501]
[-1.51319287 -0.83612508]
[-1.51319287 -0.60201006]
[ 0.25219881 0.10033501]
[ 1.13489465 1.97325518]]
Help me, give suggestion please...
Convert X1 to numpy array:
X1=[[ 0.8406627, -1.30435512],
[ 0.25219881, 0.56856505],
[ 0.54643076, 0.10033501],
[-1.51319287, -0.83612508],
[-1.51319287, -0.60201006],
[ 0.25219881, 0.10033501],
[ 1.13489465, 1.97325518]]
X1 = np.asarray(X1)

Categories