I have a detector which returns the detected objects' bounding box centers, it works fine for the most part. What I want to do, however, is to consider 10 frames and not 1 frame to conduct the detection, so that I can eliminate more false positives.
The way my detector normally works is follows:
1. Get a frame.
2. Conduct the algorithm.
3. Record the centers into a dictionary per each frame.
The way I thought would help reducing false positives is:
1. Set up a loop of 10:
1. Get a frame.
2. Conduct the algorithm.
3. Record the centers into a dictionary per each frame.
2. Loop over the recorded points after every 10 frames.
3. Use a clustering algorithm or simple distance averaging
4. Get the final centers.
So, I've already implemented some of this logic. I am on step 1.3, I need to find a way to group the coordinates and finalize the estimation.
After 10 frames, my dictionary holds such values (can't paste all):
(4067.0, 527.0): ['torx8', 'screw8'],
(4053.0, 527.0): ['torx8', 'screw1'],
(2627.0, 707.0): ['torx8', 'screw12'],
(3453.0, 840.0): ['torx6', 'screw14'],
(3633.0, 1373.0): ['torx6', 'screw15'],
(3440.0, 840.0): ['torx6', 'screw14'],
(3447.0, 840.0): ['torx6', 'screw14'],
(1660.0, 1707.0): ['torx8', 'screw3'],
(2633.0, 700.0): ['torx8', 'screw7'],
(2627.0, 693.0): ['torx8', 'screw8'],
(4060.0, 533.0): ['torx8', 'screw6'],
(3627.0, 1367.0): ['torx6', 'screw13'],
(2600.0, 680.0): ['torx8', 'screw15'],
(2607.0, 680.0): ['torx8', 'screw7']
As you can notice, most of these points are already the same points with a bit of pixel shift, which is why I am trying to find a way to get rid of the so called duplicates.
Is there an intelligent and efficient way of dealing with this problem? First thing came to my mind was k-means clustering, but I am not sure if this fits to this problem.
Did anyone have similar experience?
EDIT: Okay so I made some progress and I am able to cluster the points using Hierarchical Clustering, because in my case I have no priori knowledge of the number of cluster. Hence, an approximation is required.
# cluster now
points = StandardScaler().fit_transform(points)
db = self.dbscan.fit(points)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise_ = list(db.labels_).count(-1)
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = points[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = points[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
which works great. I am able to eliminate the false positives (see the black dot), however, I still don't know how I could get the average per cluster. Like, after I find the clusters, how can I loop over each cluster and average all the X,Y values? (Before StandardScaler().fit_transform(points), obviously, since after that I lose the pixel coordinates, they are fit between minus one and one.)
Okay, finally, I got it. Since I also would need my points in their original scale (not between -1 and 1) I also had to do rescaling. Anyway, here is the full magic:
def cluster_dbscan(self, points, visualize=False):
# scale the points between -1 and 1
scaler = StandardScaler()
scaled_points = scaler.fit_transform(points)
# cluster
db = DBSCAN(eps=self.clustering_epsilon, min_samples=self.clustering_min_samples, metric='euclidean')
db.fit(scaled_points)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise_ = list(db.labels_).count(-1)
if (visualize == True):
# Black removed and is used for noise instead.
unique_labels = set(db.labels_)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (db.labels_ == k)
xy = scaled_points[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = scaled_points[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
# back to original scale
points = scaler.inverse_transform(scaled_points)
# loop over the clusters, get the centers
centers = np.zeros((n_clusters_, 2)) # for x and y
for i in range(0, n_clusters_):
cluster_points = points[db.labels_ == i]
cluster_mean = np.mean(cluster_points, axis=0)
centers[i, :] = cluster_mean
# we need the original points
return centers
Related
Please explain me why this happens:
I have a function to print matrices with discrete values (in this case, -1, 0 and 1):
def PrintMatrixBW(array, title, s_folder):
"""
'array' is the matrix to be represented;
'title' is the tile of the plot;
's_folder' is the saving folder. For NOT saving, s_folder=0.
"""
fig, (ax0) = plt.subplots(1)
fig.tight_layout()
ax0.set_title(title)
colour_map = {
0 : '#000000', # black
1: '#FFFFFF'# white
}
N = len(colour_map)
values = list(colour_map.keys())
colours = list(colour_map.values())
cmap = LinearSegmentedColormap.from_list('', colours, N)
plt.imshow(array, cmap=cmap, origin='lower')
cbar = plt.colorbar()
# Puts each label in the middle of respective colour interval:
colour_width = (max(values) - min(values)) / N
positions = np.linspace(min(values) + colour_width/2, max(values) - colour_width/2, N)
cbar.set_ticks(positions)
cbar.set_ticklabels(values)
plt.show()
filename = title + '_out' + '.png'
# saving option:
if s_folder != 0:
outpath = s_folder
fig.savefig(outpath + filename)
return filename
Now I create a set of matrices like this:
n = 100
matrix = [[rd.choice([-1, 0, 1]) for j in range(n)] for i in range(n)]
PrintMatrixBWR(matrix, "cool_image" + str(n), out_folder)
These are the set of images as an outcome of running the previous lines from n=100 to n=1000:
Why is the matrix becoming more black as I increase its size?
I also tried to change rd.choice([-1, 0, 1]) to rd.choice([0, -1, 1])to see if the problem was in the random number generator, but I obtained the same result.
A slightly different result was when I tried to do rd.choice([1, 0, 1]). I expected not to see red at all, but the three colours appear (this time in different proportions). Here's the result, still weird to me:
In the first set of pictures I don't understand how it gets darker, since the colours should mantain their proportions, and black is not a privileged colour. In the second set, I don't understand how the colour red appears when I specify only to values to populate the matrix, with rd.choice([1, 0, 1]). I also don't understand how can it converge to 50-50 black and white, when I have a 2/3 for 1/3 proportion of ones to zeros.
What is going on?
I have been thinking of this but not sure how to do it. I have a binary imbalanced data, and would like to use svm to select just subset of the majority data points nearest to support vector. Thereafter, I can fit a binary classifier on this "balanced" data.
To illustrate what I mean, a MWE:
# packages import
from collections import Counter
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
import seaborn as sns
# sample data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.9], flip_y=0, random_state=1)
# class distribution summary
print(Counter(y))
Counter({0: 91, 1: 9})
# fit svm model
svc_model = SVC(kernel='linear', random_state=32)
svc_model.fit(X, y)
plt.figure(figsize=(10, 8))
# Plotting our two-features-space
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)
# Constructing a hyperplane using a formula.
w = svc_model.coef_[0] # w consists of 2 elements
b = svc_model.intercept_[0] # b consists of 1 element
x_points = np.linspace(-1, 1) # generating x-points from -1 to 1
y_points = -(w[0] / w[1]) * x_points - b / w[1] # getting corresponding y-points
# Plotting a red hyperplane
plt.plot(x_points, y_points, c='r')
The two classes are well separated by the hyperplane. We can see the support vectors for both classes (even better for class 1).
Since the minority class 0 has 9-data-points, I want to down-sample class 0 by selecting its support vectors, and 8 other data points nearest to it. So that the class distribution becomes {0: 9, 1: 9} ignoring all other data points of 0. I will then use this to fit a binary classifier like LR (or even SVC).
My question is how to select those data points of class 0 nearest to the class support vector, taking into account, a way to reach a balance with data points of minority class 1.
This can be achieved as follows: Get the support vector for class 0, (sv0), iterate over all data points in class 0 (X[y == 0]), compute the distances (d) to the point represented by the support vector, sort them, take the 9 with the smallest values, and concatenate them with the points of class 1 to create the downsampled data (X_ds, y_ds).
sv0 = svc_model.support_vectors_[0]
distances = []
for i, x in enumerate(X[y == 0]):
d = np.linalg.norm(sv0 - x)
distances.append((i, d))
distances.sort(key=lambda tup: tup[1])
index = [i for i, d in distances][:9]
X_ds = np.concatenate((X[y == 0][index], X[y == 1]))
y_ds = np.concatenate((y[y == 0][index], y[y == 1]))
plt.plot(x_points[19:-29], y_points[19:-29], c='r')
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)
plt.scatter(X_ds[y_ds == 0][:,0], X_ds[y_ds == 0][:,1], color='yellow', alpha=0.4)
I'd like to separate data and put them into 13 different set of variables like each red circle (see the image below). But I have no idea how to cluster the data based on multiple linear regression. Any idea how can I do this in Python?
Data set:
https://www.dropbox.com/s/ar5rzry0joe9ffu/dataset_v1.xlsx?dl=
Code I am using now for clustering:
print(__doc__)
import openpyxl
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
wb = openpyxl.load_workbook('dataset_v1.xlsx')
sheet = wb.worksheets[0]
ws = wb.active
row_count = sheet.max_row
data = np.zeros((row_count, 2))
index = 0
for r in ws.rows:
data[index,0] = r[0].value
data[index,1] = r[1].value
index += 1
# Compute DBSCAN
db = DBSCAN(eps=5, min_samples=0.1).fit(data)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
clusters = [data[labels == i] for i in range(n_clusters_)]
outliers = data[labels == -1]
# #############################################################################
# Plot result
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 0.5]
class_member_mask = (labels == k)
xy = data[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = data[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Define a threshold, say 50.
Begin a new "cluster" whenever y increases by more than 50.
As long as values decrease, they are still in the previous "cluster".
I want to use DBSCAN from sklearn to find clusters from my GPS positions. I don't understand why the coordinate [ 18.28, 57.63] (lower right corner in the figure) is clustered together with the other coordinates to the left. Could it be some problem with big epsilon? sklearn version 0.19.0.
To reproduce this:
I copied demo code from here: http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html but I replaced the sample data with a few coordinates (see variable X in the code below). I got the inspiration from here: http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# #############################################################################
# Generate sample data
X = np.array([[ 11.95, 57.70],
[ 16.28, 57.63],
[ 16.27, 57.63],
[ 16.28, 57.66],
[ 11.95, 57.63],
[ 12.95, 57.63],
[ 18.28, 57.63],
[ 11.97, 57.70]])
# #############################################################################
# Compute DBSCAN
kms_per_radian = 6371.0088
epsilon = 400 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=2, algorithm='ball_tree', metric='haversine').fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
I recently made the same mistake (using hdbscan), and it was the cause of some 'strange' results. For example, the same point would sometimes be included in a cluster, and sometimes be flagged as a noise point. "How can this be?", I kept wondering. It turned out to be because I was passing lat/lon directly and not converting to radians first.
The OP's self-supplied answer is correct, but short on details. One could, of course, just multiply the lat/lon values by π/180, but—if you're already using numpy anyway—the simplest fix is to change this line in the original code:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(X)
to:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(np.radians(X))
The haversine metric requires data in radian
Refer to this example of using DBSCAN, real data input for clustering process is 'X'. But following to the example, i used 'X1' for build model for clustering.
# -*- coding: utf-8 -*-
"""
===================================
Demo of DBSCAN clustering algorithm
===================================
Finds core samples of high density and expands clusters from them.
"""
#print(__doc__)
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X=[(9,0),(7,8),(8,6),(1,2),(1,3),(7,6),(10,14)]
X1 = StandardScaler().fit_transform(X)
##############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X1)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool) # bikin matriks False ukuran matriks db.labels
core_samples_mask[db.core_sample_indices_] = True # bikin matriks, kalau indexnya ada di matriks db, maka true
labels = db.labels_
print "cluster: ", set(labels)
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
In this case i want to get members of noise, so I print xy if k=-1. Unfortunately, xy is refers to X1 not the real data X.
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
class_member_mask = (labels == k)
if k == -1:
# Black used for noise.
xy = X1[class_member_mask]
print "Noise :", xy
else:
xy = X1[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
xy = X1[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
When I try to replace X1 to 'X', I get an error.
xy = X[class_member_mask]
error:
xy=X[class_member_mask&~core_samples_mask]
TypeError: only integer arrays with one element can be converted to an index
May be its because format X1 and X is different. I think it's will solve if I know to how convert X format to X1
X=[(9,0),(7,8),(8,6),(1,2),(1,3),(7,6),(10,14)]
X1=[[ 0.8406627 -1.30435512]
[ 0.25219881 0.56856505]
[ 0.54643076 0.10033501]
[-1.51319287 -0.83612508]
[-1.51319287 -0.60201006]
[ 0.25219881 0.10033501]
[ 1.13489465 1.97325518]]
Help me, give suggestion please...
Convert X1 to numpy array:
X1=[[ 0.8406627, -1.30435512],
[ 0.25219881, 0.56856505],
[ 0.54643076, 0.10033501],
[-1.51319287, -0.83612508],
[-1.51319287, -0.60201006],
[ 0.25219881, 0.10033501],
[ 1.13489465, 1.97325518]]
X1 = np.asarray(X1)