I have a large 2D-dataset (.csv) with values from a pressure sensor.
The first value is the pressure value, while the second one records the time the measure was taken.
Looking at the plot, I can see a cluster of points (due noise) where you can detect some linear parts (that is the "good working zone") and non-linear zones.
I thought to use a RANSAC algorithm to detect linear zones, but I'm not sure it's the best way.
By OpenCV I can isolate linear path and it seems working well, but my problem is transforming a 2D dataset in a "Mat": my sensor gives me 32bit values and tests takes days with a sub-second data rate so the final 2d-matrix is an enormous set of 0-1!
So, according to you, what is the best way to detect linear patterns in a 2d-dataset?
edit:
Sending a real dataset is quite problematic, because of its weight (approx 100Mb) and time need to achieve a test (days).
I can send a plot to show my problem.
As You can see, RANSAC works apparently well, but my fear is that a kind of dataset as this:
can cause erroneous results (the first linear part not detect).
An idea is to "split" my dataset in parts but it doesn't seem very efficient...
Is there a method to detect multiple linear zones by RANSAC?
P.S
Here an example code by Python for RANSAC
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from skimage.measure import LineModelND, ransac
// x,y are vectors:
// x -> time value
// y -> pressure value #xtime
data = np.column_stack([x, y])
model = LineModelND()
model.estimate(data)
model_robust, inliers = ransac(data, LineModelND, min_samples=2, residual_threshold=0.01, max_trials=1000)
outliers = inliers == False
line_x = np.arange(x.min(), x.max()+1)
fig, ax = plt.subplots()
ax.plot(data[inliers, 0], data[inliers, 1], '.b', alpha=0.6,label='Linear Data')
ax.plot(data[outliers, 0], data[outliers, 1], '.r', alpha=0.6,label='Non Linear Data')
ax.legend(loc='lower right')
plt.show()
I'm working for a data which have 3 columns: type, x, y, let's say x and y are correlated and they not normalizedly distributed, I want groupby type and filter outliers or noise data points in x and y. Could someone recommend me statitics or machine learning methods to filter outliers or noise data? How can I do that in Python?
I'm considering to use DBSCAN from scikit-learn, is it appropriate method ?
Type1:
Type2:
Type3:
df1 = df.loc[df['type'] == '3']
data= df1[["x", "y"]]
data.plot.scatter(x = "x", y = "y")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(data)
from matplotlib import cm
cmap = cm.get_cmap('Accent')
data.plot.scatter(
x = "iSearchCount",
y = "iGuaPaiCount",
c = clusters,
cmap = cmap,
colorbar = False
)
For this type of data and outliers I would recommend a statistical approach. The SPE/DmodX (distance to model) or Hotelling T2 test may help you here. I do not see the data for the 3 types but I generated some.
These methods are available in the pca library. With the n_std you can adjust the ellipse "width".
pip install pca
import pca
results = pca.spe_dmodx(X, n_std=3, showfig=True)
# If you want to test the Hotelling T2 test.
# results1 = pca.hotellingsT2(X, alpha=0.001)
results is a dictionary and contains the labels of the outliers.
Of course you don't get good results if you don't care about the parameters. Just look at your plot. The scale is huge - your epsilon is tiny! Seems like your data may be integers, so no points except duplicates will ever have a distance of less than 0.5...
Hence all data is considered noise.
Before using a method, make sure you've understood how it works and what parameters you need to set.
I'd also log transform the data first. Working with some simple thresholds may be enough. Don:t overdo things with clustering when your data is unimodal.
After some feature engineering techniques you could consider using OneClassSVM estimator from Sklearn library.
https://justanoderbit.com/outlier-detection/one-class-svm/ describes how to use it for outlier detection.
I am working on an anomaly detection project on a call detail record for a telephone operator, I have prepared a sample of 10000 observations and 80 dimensions which represent the totality of the observations for a day of traffic, the data are represented as follows:
this is a small part of the whole dataset.
however, I decided to use the library PYOD which is an API that offers many unsupervised learning algorithms, I decided to start with CNN:
from pyod.models.knn import KNN
knn= KNN(contamination= 0.1)
result = knn.fit_predict(conso)
Then to visualize the result I decided to resize the sample in 2 dimentions and to display it in scatter with in blue the observations that KNN predicted that were not outliers and in red those which are outliers.
from sklearn.manifold import TSNE
result_f = TSNE(n_components = 2).fit_transform(df_final_2)
result_f = pd.DataFrame(result_f)
color= ['red' if row == 1 else 'blue' for row in result_list]
'df_final_2' is the dataframe version of 'conso'.
then I put all that in the right colors:
import matplotlib.pyplot as plt
plt.scatter(result_f[0],result_f[1], s=1, c=color)
The thing that disturbs me in the graph is that the observations predict as outliers are not really outliers because normally the outliers are in the extremity of the graph and not grouped with the normal behaviors, even by analyzing these obseravations aberent they have a normal behavior in the original dataset, I have tried other PYOD algorithms and I have modified the parameters of each algorithm but I have obtained at least the same result. I made a mistake somewhere and I can not distinguish it.
Thnx.
There are several things to check:
using knn, lof, and similar models that rely on distance measures, the data should be first standardized (using sklearn StandardScaler)
tsne may now work in this case and the dimensionality reduction could be off
maybe do not use fit_predict, but do this (use y_train_pred):
# train kNN detector
clf_name = 'KNN'
clf = KNN(contamination=0.1)
clf.fit(X)
# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores
If none of these work, feel free to open an issue report on GitHub and we will take a further investigation.
For the clustering algorithms in sklearn, is there a way to specify how many clusters you want the algorithm to find (instead of the algorithm finding its own number of clusters)? From my inputted data, I'm hoping for 2 clusters instead of the 3 it outputs for me.
If it helps, I'm using the MeanShift algorithm (but my question applies to all of them). Also, most tutorials seem to use make_blobs, but I'm using pandas's read_csv to upload my data instead if that changes anything.
This is the beginning part of my code:
df = pd.read_csv(filename, header = 0)
original_headers = list(df.columns.values)
df = df._get_numeric_data()
data = df.values
ms = MeanShift()
ms.fit(data)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
n_clusters_ = len(np.unique(labels))
print("Number of estimated clusters:", n_clusters_)
As some users said above, it is not possible set the number of clusters wanted in MeanShift algorithm.
When we talk about clustering, there are a lot of models to be employed depending on your problem. Density based models, like MeanShift and DBSCAN, try to find areas of higher density than the remainder of the data set. So, the number of clusters will be defined by the data itself.
On the other hand, for example, centroid based methods like K-Means, starts its iterations based on the number of centroids passed as parameter.
The following link shows a lot of clustering algorithms of sklearn. Try to figure out which one suits best in your problem.
http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
References:
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
https://en.wikipedia.org/wiki/Cluster_analysis
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
x = [916,684,613,612,593,552,487,484,475,474,438,431,421,418,409,391,389,388,
380,374,371,369,357,356,340,338,328,317,316,315,313,303,283,257,255,254,245,
234,232,227,227,222,221,221,219,214,201,200,194,169,155,140]
kmeans = KMeans(n_clusters=4)
a = kmeans.fit(np.reshape(x,(len(x),1)))
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","y.","b."]
for i in range(len(x)):
plt.plot(x[i], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0], marker = "x", s = 150, linewidths = 5, zorder = 10)
plt.show()
The code above displays 4 clusters, but they are definitely not something I want to have.
I also get an error, which makes it even worst. The output I get is in the picture below.
The error I get is: TypeError: scatter() missing 1 required positional argument: 'y' Error is not a big deal because I don't like what I have anyways.
Following is the image of how I want my output of clusters to look like.
your data is one-dimension (a line), if you want to visualize in two-dimension like pic in your post, your should use two-dimension or multi-dimension data, for example [[1,3], [2,3], [1,5]].
after k-means they are divided into k clusters, and you can use scatter to visualize the output. by the way, scatter take x and y, scatter is two-dimension visualization.
i suggest you to take a look at Orange, a python data mining tool. you can do k-means by drag and drop.
and visualize the output of k-means easily.
good luck! data mining is fun :-)
Your data is 1 dimensional
Don't expect a pretty 2d plot without making up data.
To get rid of the warning, you can set y=x. But it will not change much, the data will continue to be a 1-dimensional line.
You could of course add random noise, and set y to random values. But that means making up fake data.
For one-dimensional algorithm, I recommend to not use clustering at all. These algorithms are designed for complex multivariate data where you cannot afforf a good statistical model anymore. One-dimensional data can be sorted which allows for much more efficient algorithms. You can easily do KDE on such data, and fit thousands of statistical distributions. This will give you a much more meaningful model of higher statistical power.
From a quick look at your plot, I'd say there are no clusters. Instead your data looks like a skewed normal distribution with one clear outlier (to be expected at this data set size) to me. Please, try a more statistical approach.
Since you work with only one dimensional, you should understand what exactly you are computing. With KMeans, you extract four average values; the best thing you can do here is draw your data as below with four horizontal lines showing these values. I get the following picture with the code below. This picture is like the equivalent for 1D of the picture you are showing for 2D.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
x = [916,684,613,612,593,552,487,484,475,474,438,431,421,418,409,391,389,388,
380,374,371,369,357,356,340,338,328,317,316,315,313,303,283,257,255,254,245,
234,232,227,227,222,221,221,219,214,201,200,194,169,155,140]
kmeans = KMeans(n_clusters=4)
a = kmeans.fit(np.reshape(x,(len(x),1)))
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","y.","b."]
for i in centroids: plt.plot( [0, len(x)-1],[i,i], "k" )
for i in range(len(x)):
plt.plot(i, x[i], colors[labels[i]], markersize = 10)
plt.show()
Computing kmeans with 1D data is more interesting with curves like the following one (from the page http://lasp.colorado.edu/home/sorce/2013/01/28/the-sorce-mission-celebrates-ten-years/) because you obviously can see tow distinct average values: