I want to use DBSCAN from sklearn to find clusters from my GPS positions. I don't understand why the coordinate [ 18.28, 57.63] (lower right corner in the figure) is clustered together with the other coordinates to the left. Could it be some problem with big epsilon? sklearn version 0.19.0.
To reproduce this:
I copied demo code from here: http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html but I replaced the sample data with a few coordinates (see variable X in the code below). I got the inspiration from here: http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# #############################################################################
# Generate sample data
X = np.array([[ 11.95, 57.70],
[ 16.28, 57.63],
[ 16.27, 57.63],
[ 16.28, 57.66],
[ 11.95, 57.63],
[ 12.95, 57.63],
[ 18.28, 57.63],
[ 11.97, 57.70]])
# #############################################################################
# Compute DBSCAN
kms_per_radian = 6371.0088
epsilon = 400 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=2, algorithm='ball_tree', metric='haversine').fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
I recently made the same mistake (using hdbscan), and it was the cause of some 'strange' results. For example, the same point would sometimes be included in a cluster, and sometimes be flagged as a noise point. "How can this be?", I kept wondering. It turned out to be because I was passing lat/lon directly and not converting to radians first.
The OP's self-supplied answer is correct, but short on details. One could, of course, just multiply the lat/lon values by π/180, but—if you're already using numpy anyway—the simplest fix is to change this line in the original code:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(X)
to:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(np.radians(X))
The haversine metric requires data in radian
Related
I have been thinking of this but not sure how to do it. I have a binary imbalanced data, and would like to use svm to select just subset of the majority data points nearest to support vector. Thereafter, I can fit a binary classifier on this "balanced" data.
To illustrate what I mean, a MWE:
# packages import
from collections import Counter
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
import seaborn as sns
# sample data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.9], flip_y=0, random_state=1)
# class distribution summary
print(Counter(y))
Counter({0: 91, 1: 9})
# fit svm model
svc_model = SVC(kernel='linear', random_state=32)
svc_model.fit(X, y)
plt.figure(figsize=(10, 8))
# Plotting our two-features-space
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)
# Constructing a hyperplane using a formula.
w = svc_model.coef_[0] # w consists of 2 elements
b = svc_model.intercept_[0] # b consists of 1 element
x_points = np.linspace(-1, 1) # generating x-points from -1 to 1
y_points = -(w[0] / w[1]) * x_points - b / w[1] # getting corresponding y-points
# Plotting a red hyperplane
plt.plot(x_points, y_points, c='r')
The two classes are well separated by the hyperplane. We can see the support vectors for both classes (even better for class 1).
Since the minority class 0 has 9-data-points, I want to down-sample class 0 by selecting its support vectors, and 8 other data points nearest to it. So that the class distribution becomes {0: 9, 1: 9} ignoring all other data points of 0. I will then use this to fit a binary classifier like LR (or even SVC).
My question is how to select those data points of class 0 nearest to the class support vector, taking into account, a way to reach a balance with data points of minority class 1.
This can be achieved as follows: Get the support vector for class 0, (sv0), iterate over all data points in class 0 (X[y == 0]), compute the distances (d) to the point represented by the support vector, sort them, take the 9 with the smallest values, and concatenate them with the points of class 1 to create the downsampled data (X_ds, y_ds).
sv0 = svc_model.support_vectors_[0]
distances = []
for i, x in enumerate(X[y == 0]):
d = np.linalg.norm(sv0 - x)
distances.append((i, d))
distances.sort(key=lambda tup: tup[1])
index = [i for i, d in distances][:9]
X_ds = np.concatenate((X[y == 0][index], X[y == 1]))
y_ds = np.concatenate((y[y == 0][index], y[y == 1]))
plt.plot(x_points[19:-29], y_points[19:-29], c='r')
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)
plt.scatter(X_ds[y_ds == 0][:,0], X_ds[y_ds == 0][:,1], color='yellow', alpha=0.4)
I have a detector which returns the detected objects' bounding box centers, it works fine for the most part. What I want to do, however, is to consider 10 frames and not 1 frame to conduct the detection, so that I can eliminate more false positives.
The way my detector normally works is follows:
1. Get a frame.
2. Conduct the algorithm.
3. Record the centers into a dictionary per each frame.
The way I thought would help reducing false positives is:
1. Set up a loop of 10:
1. Get a frame.
2. Conduct the algorithm.
3. Record the centers into a dictionary per each frame.
2. Loop over the recorded points after every 10 frames.
3. Use a clustering algorithm or simple distance averaging
4. Get the final centers.
So, I've already implemented some of this logic. I am on step 1.3, I need to find a way to group the coordinates and finalize the estimation.
After 10 frames, my dictionary holds such values (can't paste all):
(4067.0, 527.0): ['torx8', 'screw8'],
(4053.0, 527.0): ['torx8', 'screw1'],
(2627.0, 707.0): ['torx8', 'screw12'],
(3453.0, 840.0): ['torx6', 'screw14'],
(3633.0, 1373.0): ['torx6', 'screw15'],
(3440.0, 840.0): ['torx6', 'screw14'],
(3447.0, 840.0): ['torx6', 'screw14'],
(1660.0, 1707.0): ['torx8', 'screw3'],
(2633.0, 700.0): ['torx8', 'screw7'],
(2627.0, 693.0): ['torx8', 'screw8'],
(4060.0, 533.0): ['torx8', 'screw6'],
(3627.0, 1367.0): ['torx6', 'screw13'],
(2600.0, 680.0): ['torx8', 'screw15'],
(2607.0, 680.0): ['torx8', 'screw7']
As you can notice, most of these points are already the same points with a bit of pixel shift, which is why I am trying to find a way to get rid of the so called duplicates.
Is there an intelligent and efficient way of dealing with this problem? First thing came to my mind was k-means clustering, but I am not sure if this fits to this problem.
Did anyone have similar experience?
EDIT: Okay so I made some progress and I am able to cluster the points using Hierarchical Clustering, because in my case I have no priori knowledge of the number of cluster. Hence, an approximation is required.
# cluster now
points = StandardScaler().fit_transform(points)
db = self.dbscan.fit(points)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise_ = list(db.labels_).count(-1)
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = points[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = points[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
which works great. I am able to eliminate the false positives (see the black dot), however, I still don't know how I could get the average per cluster. Like, after I find the clusters, how can I loop over each cluster and average all the X,Y values? (Before StandardScaler().fit_transform(points), obviously, since after that I lose the pixel coordinates, they are fit between minus one and one.)
Okay, finally, I got it. Since I also would need my points in their original scale (not between -1 and 1) I also had to do rescaling. Anyway, here is the full magic:
def cluster_dbscan(self, points, visualize=False):
# scale the points between -1 and 1
scaler = StandardScaler()
scaled_points = scaler.fit_transform(points)
# cluster
db = DBSCAN(eps=self.clustering_epsilon, min_samples=self.clustering_min_samples, metric='euclidean')
db.fit(scaled_points)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise_ = list(db.labels_).count(-1)
if (visualize == True):
# Black removed and is used for noise instead.
unique_labels = set(db.labels_)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (db.labels_ == k)
xy = scaled_points[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = scaled_points[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
# back to original scale
points = scaler.inverse_transform(scaled_points)
# loop over the clusters, get the centers
centers = np.zeros((n_clusters_, 2)) # for x and y
for i in range(0, n_clusters_):
cluster_points = points[db.labels_ == i]
cluster_mean = np.mean(cluster_points, axis=0)
centers[i, :] = cluster_mean
# we need the original points
return centers
I'd like to separate data and put them into 13 different set of variables like each red circle (see the image below). But I have no idea how to cluster the data based on multiple linear regression. Any idea how can I do this in Python?
Data set:
https://www.dropbox.com/s/ar5rzry0joe9ffu/dataset_v1.xlsx?dl=
Code I am using now for clustering:
print(__doc__)
import openpyxl
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
wb = openpyxl.load_workbook('dataset_v1.xlsx')
sheet = wb.worksheets[0]
ws = wb.active
row_count = sheet.max_row
data = np.zeros((row_count, 2))
index = 0
for r in ws.rows:
data[index,0] = r[0].value
data[index,1] = r[1].value
index += 1
# Compute DBSCAN
db = DBSCAN(eps=5, min_samples=0.1).fit(data)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
clusters = [data[labels == i] for i in range(n_clusters_)]
outliers = data[labels == -1]
# #############################################################################
# Plot result
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 0.5]
class_member_mask = (labels == k)
xy = data[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = data[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Define a threshold, say 50.
Begin a new "cluster" whenever y increases by more than 50.
As long as values decrease, they are still in the previous "cluster".
I am using this clustering algorithm to cluster lat and lon points. I am using pre-written code which is given at http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html.
The code is as follows and takes in my file with over 4000 lat and lon points. However I want to adjust this code so that it only defines a cluster as points within say 0.000020 of each other, as I want my clusters to be almost at street level.
At the moment I am getting 11 clusters whereas in theory I want at least 100 clusters.I have tried adjusting and changing different figures but to no avail.
print(__doc__)
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
##############################################################################
# Generate sample data
input = np.genfromtxt(open("dataset_import_noaddress.csv","rb"),delimiter=",", skip_header=1)
coordinates = np.delete(input, [0,1], 1)
X, labels_true = make_blobs(n_samples=4000, centers=coordinates, cluster_std=0.0000005,
random_state=0)
X = StandardScaler().fit_transform(X)
##############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
##############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
You appear to be changing the data generation only:
X, labels_true = make_blobs(n_samples=4000, centers=coordinates, cluster_std=0.0000005,
random_state=0)
instead of the clustering algorithm:
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
^^^^^^^ almost your complete data set?
For geographic data, make sure to use haversine distance instead of Euclidean distance. Earth is more like a sphere than a flat Euclidean world.
Refer to this example of using DBSCAN, real data input for clustering process is 'X'. But following to the example, i used 'X1' for build model for clustering.
# -*- coding: utf-8 -*-
"""
===================================
Demo of DBSCAN clustering algorithm
===================================
Finds core samples of high density and expands clusters from them.
"""
#print(__doc__)
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X=[(9,0),(7,8),(8,6),(1,2),(1,3),(7,6),(10,14)]
X1 = StandardScaler().fit_transform(X)
##############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X1)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool) # bikin matriks False ukuran matriks db.labels
core_samples_mask[db.core_sample_indices_] = True # bikin matriks, kalau indexnya ada di matriks db, maka true
labels = db.labels_
print "cluster: ", set(labels)
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
In this case i want to get members of noise, so I print xy if k=-1. Unfortunately, xy is refers to X1 not the real data X.
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
class_member_mask = (labels == k)
if k == -1:
# Black used for noise.
xy = X1[class_member_mask]
print "Noise :", xy
else:
xy = X1[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
xy = X1[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
When I try to replace X1 to 'X', I get an error.
xy = X[class_member_mask]
error:
xy=X[class_member_mask&~core_samples_mask]
TypeError: only integer arrays with one element can be converted to an index
May be its because format X1 and X is different. I think it's will solve if I know to how convert X format to X1
X=[(9,0),(7,8),(8,6),(1,2),(1,3),(7,6),(10,14)]
X1=[[ 0.8406627 -1.30435512]
[ 0.25219881 0.56856505]
[ 0.54643076 0.10033501]
[-1.51319287 -0.83612508]
[-1.51319287 -0.60201006]
[ 0.25219881 0.10033501]
[ 1.13489465 1.97325518]]
Help me, give suggestion please...
Convert X1 to numpy array:
X1=[[ 0.8406627, -1.30435512],
[ 0.25219881, 0.56856505],
[ 0.54643076, 0.10033501],
[-1.51319287, -0.83612508],
[-1.51319287, -0.60201006],
[ 0.25219881, 0.10033501],
[ 1.13489465, 1.97325518]]
X1 = np.asarray(X1)