sklearn DBSCAN to cluster GPS positions with big epsilon - python

I want to use DBSCAN from sklearn to find clusters from my GPS positions. I don't understand why the coordinate [ 18.28, 57.63] (lower right corner in the figure) is clustered together with the other coordinates to the left. Could it be some problem with big epsilon? sklearn version 0.19.0.
To reproduce this:
I copied demo code from here: http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html but I replaced the sample data with a few coordinates (see variable X in the code below). I got the inspiration from here: http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# #############################################################################
# Generate sample data
X = np.array([[ 11.95, 57.70],
[ 16.28, 57.63],
[ 16.27, 57.63],
[ 16.28, 57.66],
[ 11.95, 57.63],
[ 12.95, 57.63],
[ 18.28, 57.63],
[ 11.97, 57.70]])
# #############################################################################
# Compute DBSCAN
kms_per_radian = 6371.0088
epsilon = 400 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=2, algorithm='ball_tree', metric='haversine').fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

I recently made the same mistake (using hdbscan), and it was the cause of some 'strange' results. For example, the same point would sometimes be included in a cluster, and sometimes be flagged as a noise point. "How can this be?", I kept wondering. It turned out to be because I was passing lat/lon directly and not converting to radians first.
The OP's self-supplied answer is correct, but short on details. One could, of course, just multiply the lat/lon values by π/180, but—if you're already using numpy anyway—the simplest fix is to change this line in the original code:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(X)
to:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(np.radians(X))

The haversine metric requires data in radian

Related

selecting data points neighbourhood to support vectors

I have been thinking of this but not sure how to do it. I have a binary imbalanced data, and would like to use svm to select just subset of the majority data points nearest to support vector. Thereafter, I can fit a binary classifier on this "balanced" data.
To illustrate what I mean, a MWE:
# packages import
from collections import Counter
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
import seaborn as sns
# sample data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.9], flip_y=0, random_state=1)
# class distribution summary
print(Counter(y))
Counter({0: 91, 1: 9})
# fit svm model
svc_model = SVC(kernel='linear', random_state=32)
svc_model.fit(X, y)
plt.figure(figsize=(10, 8))
# Plotting our two-features-space
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)
# Constructing a hyperplane using a formula.
w = svc_model.coef_[0] # w consists of 2 elements
b = svc_model.intercept_[0] # b consists of 1 element
x_points = np.linspace(-1, 1) # generating x-points from -1 to 1
y_points = -(w[0] / w[1]) * x_points - b / w[1] # getting corresponding y-points
# Plotting a red hyperplane
plt.plot(x_points, y_points, c='r')
The two classes are well separated by the hyperplane. We can see the support vectors for both classes (even better for class 1).
Since the minority class 0 has 9-data-points, I want to down-sample class 0 by selecting its support vectors, and 8 other data points nearest to it. So that the class distribution becomes {0: 9, 1: 9} ignoring all other data points of 0. I will then use this to fit a binary classifier like LR (or even SVC).
My question is how to select those data points of class 0 nearest to the class support vector, taking into account, a way to reach a balance with data points of minority class 1.
This can be achieved as follows: Get the support vector for class 0, (sv0), iterate over all data points in class 0 (X[y == 0]), compute the distances (d) to the point represented by the support vector, sort them, take the 9 with the smallest values, and concatenate them with the points of class 1 to create the downsampled data (X_ds, y_ds).
sv0 = svc_model.support_vectors_[0]
distances = []
for i, x in enumerate(X[y == 0]):
d = np.linalg.norm(sv0 - x)
distances.append((i, d))
distances.sort(key=lambda tup: tup[1])
index = [i for i, d in distances][:9]
X_ds = np.concatenate((X[y == 0][index], X[y == 1]))
y_ds = np.concatenate((y[y == 0][index], y[y == 1]))
plt.plot(x_points[19:-29], y_points[19:-29], c='r')
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)
plt.scatter(X_ds[y_ds == 0][:,0], X_ds[y_ds == 0][:,1], color='yellow', alpha=0.4)

Python: Point Clustering/Averaging

I have a detector which returns the detected objects' bounding box centers, it works fine for the most part. What I want to do, however, is to consider 10 frames and not 1 frame to conduct the detection, so that I can eliminate more false positives.
The way my detector normally works is follows:
1. Get a frame.
2. Conduct the algorithm.
3. Record the centers into a dictionary per each frame.
The way I thought would help reducing false positives is:
1. Set up a loop of 10:
1. Get a frame.
2. Conduct the algorithm.
3. Record the centers into a dictionary per each frame.
2. Loop over the recorded points after every 10 frames.
3. Use a clustering algorithm or simple distance averaging
4. Get the final centers.
So, I've already implemented some of this logic. I am on step 1.3, I need to find a way to group the coordinates and finalize the estimation.
After 10 frames, my dictionary holds such values (can't paste all):
(4067.0, 527.0): ['torx8', 'screw8'],
(4053.0, 527.0): ['torx8', 'screw1'],
(2627.0, 707.0): ['torx8', 'screw12'],
(3453.0, 840.0): ['torx6', 'screw14'],
(3633.0, 1373.0): ['torx6', 'screw15'],
(3440.0, 840.0): ['torx6', 'screw14'],
(3447.0, 840.0): ['torx6', 'screw14'],
(1660.0, 1707.0): ['torx8', 'screw3'],
(2633.0, 700.0): ['torx8', 'screw7'],
(2627.0, 693.0): ['torx8', 'screw8'],
(4060.0, 533.0): ['torx8', 'screw6'],
(3627.0, 1367.0): ['torx6', 'screw13'],
(2600.0, 680.0): ['torx8', 'screw15'],
(2607.0, 680.0): ['torx8', 'screw7']
As you can notice, most of these points are already the same points with a bit of pixel shift, which is why I am trying to find a way to get rid of the so called duplicates.
Is there an intelligent and efficient way of dealing with this problem? First thing came to my mind was k-means clustering, but I am not sure if this fits to this problem.
Did anyone have similar experience?
EDIT: Okay so I made some progress and I am able to cluster the points using Hierarchical Clustering, because in my case I have no priori knowledge of the number of cluster. Hence, an approximation is required.
# cluster now
points = StandardScaler().fit_transform(points)
db = self.dbscan.fit(points)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise_ = list(db.labels_).count(-1)
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = points[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = points[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
which works great. I am able to eliminate the false positives (see the black dot), however, I still don't know how I could get the average per cluster. Like, after I find the clusters, how can I loop over each cluster and average all the X,Y values? (Before StandardScaler().fit_transform(points), obviously, since after that I lose the pixel coordinates, they are fit between minus one and one.)
Okay, finally, I got it. Since I also would need my points in their original scale (not between -1 and 1) I also had to do rescaling. Anyway, here is the full magic:
def cluster_dbscan(self, points, visualize=False):
# scale the points between -1 and 1
scaler = StandardScaler()
scaled_points = scaler.fit_transform(points)
# cluster
db = DBSCAN(eps=self.clustering_epsilon, min_samples=self.clustering_min_samples, metric='euclidean')
db.fit(scaled_points)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise_ = list(db.labels_).count(-1)
if (visualize == True):
# Black removed and is used for noise instead.
unique_labels = set(db.labels_)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (db.labels_ == k)
xy = scaled_points[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = scaled_points[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
# back to original scale
points = scaler.inverse_transform(scaled_points)
# loop over the clusters, get the centers
centers = np.zeros((n_clusters_, 2)) # for x and y
for i in range(0, n_clusters_):
cluster_points = points[db.labels_ == i]
cluster_mean = np.mean(cluster_points, axis=0)
centers[i, :] = cluster_mean
# we need the original points
return centers

clustering data based on multiple linear regression in Python

I'd like to separate data and put them into 13 different set of variables like each red circle (see the image below). But I have no idea how to cluster the data based on multiple linear regression. Any idea how can I do this in Python?
Data set:
https://www.dropbox.com/s/ar5rzry0joe9ffu/dataset_v1.xlsx?dl=
Code I am using now for clustering:
print(__doc__)
import openpyxl
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
wb = openpyxl.load_workbook('dataset_v1.xlsx')
sheet = wb.worksheets[0]
ws = wb.active
row_count = sheet.max_row
data = np.zeros((row_count, 2))
index = 0
for r in ws.rows:
data[index,0] = r[0].value
data[index,1] = r[1].value
index += 1
# Compute DBSCAN
db = DBSCAN(eps=5, min_samples=0.1).fit(data)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
clusters = [data[labels == i] for i in range(n_clusters_)]
outliers = data[labels == -1]
# #############################################################################
# Plot result
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 0.5]
class_member_mask = (labels == k)
xy = data[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = data[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Define a threshold, say 50.
Begin a new "cluster" whenever y increases by more than 50.
As long as values decrease, they are still in the previous "cluster".

How to adjust this DBSCAN algorithm python

I am using this clustering algorithm to cluster lat and lon points. I am using pre-written code which is given at http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html.
The code is as follows and takes in my file with over 4000 lat and lon points. However I want to adjust this code so that it only defines a cluster as points within say 0.000020 of each other, as I want my clusters to be almost at street level.
At the moment I am getting 11 clusters whereas in theory I want at least 100 clusters.I have tried adjusting and changing different figures but to no avail.
print(__doc__)
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
##############################################################################
# Generate sample data
input = np.genfromtxt(open("dataset_import_noaddress.csv","rb"),delimiter=",", skip_header=1)
coordinates = np.delete(input, [0,1], 1)
X, labels_true = make_blobs(n_samples=4000, centers=coordinates, cluster_std=0.0000005,
random_state=0)
X = StandardScaler().fit_transform(X)
##############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
##############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
You appear to be changing the data generation only:
X, labels_true = make_blobs(n_samples=4000, centers=coordinates, cluster_std=0.0000005,
random_state=0)
instead of the clustering algorithm:
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
^^^^^^^ almost your complete data set?
For geographic data, make sure to use haversine distance instead of Euclidean distance. Earth is more like a sphere than a flat Euclidean world.

How to get result of DBSCAN refer to example from http://scikit-learn.org/

Refer to this example of using DBSCAN, real data input for clustering process is 'X'. But following to the example, i used 'X1' for build model for clustering.
# -*- coding: utf-8 -*-
"""
===================================
Demo of DBSCAN clustering algorithm
===================================
Finds core samples of high density and expands clusters from them.
"""
#print(__doc__)
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X=[(9,0),(7,8),(8,6),(1,2),(1,3),(7,6),(10,14)]
X1 = StandardScaler().fit_transform(X)
##############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X1)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool) # bikin matriks False ukuran matriks db.labels
core_samples_mask[db.core_sample_indices_] = True # bikin matriks, kalau indexnya ada di matriks db, maka true
labels = db.labels_
print "cluster: ", set(labels)
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
In this case i want to get members of noise, so I print xy if k=-1. Unfortunately, xy is refers to X1 not the real data X.
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
class_member_mask = (labels == k)
if k == -1:
# Black used for noise.
xy = X1[class_member_mask]
print "Noise :", xy
else:
xy = X1[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
xy = X1[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
When I try to replace X1 to 'X', I get an error.
xy = X[class_member_mask]
error:
xy=X[class_member_mask&~core_samples_mask]
TypeError: only integer arrays with one element can be converted to an index
May be its because format X1 and X is different. I think it's will solve if I know to how convert X format to X1
X=[(9,0),(7,8),(8,6),(1,2),(1,3),(7,6),(10,14)]
X1=[[ 0.8406627 -1.30435512]
[ 0.25219881 0.56856505]
[ 0.54643076 0.10033501]
[-1.51319287 -0.83612508]
[-1.51319287 -0.60201006]
[ 0.25219881 0.10033501]
[ 1.13489465 1.97325518]]
Help me, give suggestion please...
Convert X1 to numpy array:
X1=[[ 0.8406627, -1.30435512],
[ 0.25219881, 0.56856505],
[ 0.54643076, 0.10033501],
[-1.51319287, -0.83612508],
[-1.51319287, -0.60201006],
[ 0.25219881, 0.10033501],
[ 1.13489465, 1.97325518]]
X1 = np.asarray(X1)

Categories