I am working on an anomaly detection project on a call detail record for a telephone operator, I have prepared a sample of 10000 observations and 80 dimensions which represent the totality of the observations for a day of traffic, the data are represented as follows:
this is a small part of the whole dataset.
however, I decided to use the library PYOD which is an API that offers many unsupervised learning algorithms, I decided to start with CNN:
from pyod.models.knn import KNN
knn= KNN(contamination= 0.1)
result = knn.fit_predict(conso)
Then to visualize the result I decided to resize the sample in 2 dimentions and to display it in scatter with in blue the observations that KNN predicted that were not outliers and in red those which are outliers.
from sklearn.manifold import TSNE
result_f = TSNE(n_components = 2).fit_transform(df_final_2)
result_f = pd.DataFrame(result_f)
color= ['red' if row == 1 else 'blue' for row in result_list]
'df_final_2' is the dataframe version of 'conso'.
then I put all that in the right colors:
import matplotlib.pyplot as plt
plt.scatter(result_f[0],result_f[1], s=1, c=color)
The thing that disturbs me in the graph is that the observations predict as outliers are not really outliers because normally the outliers are in the extremity of the graph and not grouped with the normal behaviors, even by analyzing these obseravations aberent they have a normal behavior in the original dataset, I have tried other PYOD algorithms and I have modified the parameters of each algorithm but I have obtained at least the same result. I made a mistake somewhere and I can not distinguish it.
Thnx.
There are several things to check:
using knn, lof, and similar models that rely on distance measures, the data should be first standardized (using sklearn StandardScaler)
tsne may now work in this case and the dimensionality reduction could be off
maybe do not use fit_predict, but do this (use y_train_pred):
# train kNN detector
clf_name = 'KNN'
clf = KNN(contamination=0.1)
clf.fit(X)
# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores
If none of these work, feel free to open an issue report on GitHub and we will take a further investigation.
Related
I have a dataset where every data sample consists of 10-20 2D coordinates points. The data is mostly clean but occasionally there are falsely annotated points. For illustration the cleany annotated data would look like these:
either clustered in a small area or spread across a larger area. The outliers I'm trying to filter out look like this:
the outlier is away from the "correct" cluster.
I tried z-score filtering but this approach falsely marked many annotations as outliers
std_score = np.abs((points - points.mean(axis=0)) / (np.std(points, axis=0) + 0.01))
validity = np.all(std_score <= np.quantile(std_score, 0.95, axis=0), axis=1)
Is there a method designed to solve this problem?
This seems like a typical clustering problem, and if the data looks as you suggested the KMeans from scikit-learn should do the trick. Lets look how we can do this.
First I am generating a data sample, which might look somewhat like your data.
import numpy as np
import matplotlib.pylab as plt
np.random.seed(1) # For reproducibility
cluster_1 = np.random.normal(loc = [1,1], scale = [0.2,0.2], size = (20,2))
cluster_2 = np.random.normal(loc = [2,1], scale = [0.4,0.4], size = (5,2))
plt.scatter(cluster_1[:,0], cluster_1[:,1])
plt.scatter(cluster_2[:,0], cluster_2[:,1])
plt.show()
points = np.vstack([cluster_1, cluster_2])
This is how the data will look like.
Further we will be doing KMeans clustering.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 2).fit(points)
We are choosing n_clusters as 2 believing that there are 2 clusters in the dataset. And after finding these clusters lets look at them.
plt.scatter(points[kmeans.labels_==0][:,0], points[kmeans.labels_==0][:,1], label='cluster_1')
plt.scatter(points[kmeans.labels_==1][:,0], points[kmeans.labels_==1][:,1], label ='cluster_2')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], label = 'cluster_center')
plt.legend()
plt.show()
This will look like as the image shown below.
This should solve your problem. But there ares some things which should be kept in mind.
It will not be perfect all the times.
Might be a problem if you don't have any outliers. Can be solved through silhouette scores.
Difficult to know which cluster to discard (Can be done through locating the center of the clusters (green colored points) or can also be done by finding the cluster with lesser number of points.
Endnote: You might loose some points but would automate the entire process. Depends upon how much you want to trade off in terms of data saved versus manual time saved.
I am using make_moons dataset and I am trying to implement an outlier detection algorithm. That's why I want to generate 3 points which are away from normal data, and testify if they are outlier or not. These 3 points should be randomly selected from my data and should be far as possible from the normal data.
My algorithm will compare the distance between that point with theresold value and finds if it is an outlier or not.
I am aware of the other resources to do that, but my specific problem to do that, is my dataset. I could not find a way to fit the solutions to my dataset
Here is my code to define dataset and fit into K-Means(I have to use K-Means fitted data):
data = make_moons(n_samples=100,noise=0, random_state=0)
X,y=data
n_clusters=10
kmeans = KMeans(n_clusters = n_clusters,random_state=10)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
Shortly, how can i find farthest 3 points in my data, to use it in outlier detection?
As stated in the comments, you should define a criteria to classify outliers. Either way, in the following code, I randomly selected three entries from X and multiplied them by 1,000, so surely that should make them outliers regardless of the definition you choose.
# Import libraries
import numpy as np
from sklearn.datasets import make_moons
# Create data
X, y = make_moons(100, random_state=123)
# Randomly select 3 row numbers from X
np.random.seed(5)
idx = np.random.randint(low=0, high=len(df[0]) + 1, size=3)
# Overwrite the data from the randomly selected rows
for i in idx:
scaler = 1000 # Change this number to whatever you need
X[i] = X[i] * scaler
Note: There is a small probability that idx will have duplicates. It won't happen with np.random.seed(5), but if you choose another seed (or opt to not use one at all) and get duplicates, simply try another one or repeat until you don't get duplicates.
I'm working for a data which have 3 columns: type, x, y, let's say x and y are correlated and they not normalizedly distributed, I want groupby type and filter outliers or noise data points in x and y. Could someone recommend me statitics or machine learning methods to filter outliers or noise data? How can I do that in Python?
I'm considering to use DBSCAN from scikit-learn, is it appropriate method ?
Type1:
Type2:
Type3:
df1 = df.loc[df['type'] == '3']
data= df1[["x", "y"]]
data.plot.scatter(x = "x", y = "y")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(data)
from matplotlib import cm
cmap = cm.get_cmap('Accent')
data.plot.scatter(
x = "iSearchCount",
y = "iGuaPaiCount",
c = clusters,
cmap = cmap,
colorbar = False
)
For this type of data and outliers I would recommend a statistical approach. The SPE/DmodX (distance to model) or Hotelling T2 test may help you here. I do not see the data for the 3 types but I generated some.
These methods are available in the pca library. With the n_std you can adjust the ellipse "width".
pip install pca
import pca
results = pca.spe_dmodx(X, n_std=3, showfig=True)
# If you want to test the Hotelling T2 test.
# results1 = pca.hotellingsT2(X, alpha=0.001)
results is a dictionary and contains the labels of the outliers.
Of course you don't get good results if you don't care about the parameters. Just look at your plot. The scale is huge - your epsilon is tiny! Seems like your data may be integers, so no points except duplicates will ever have a distance of less than 0.5...
Hence all data is considered noise.
Before using a method, make sure you've understood how it works and what parameters you need to set.
I'd also log transform the data first. Working with some simple thresholds may be enough. Don:t overdo things with clustering when your data is unimodal.
After some feature engineering techniques you could consider using OneClassSVM estimator from Sklearn library.
https://justanoderbit.com/outlier-detection/one-class-svm/ describes how to use it for outlier detection.
For the clustering algorithms in sklearn, is there a way to specify how many clusters you want the algorithm to find (instead of the algorithm finding its own number of clusters)? From my inputted data, I'm hoping for 2 clusters instead of the 3 it outputs for me.
If it helps, I'm using the MeanShift algorithm (but my question applies to all of them). Also, most tutorials seem to use make_blobs, but I'm using pandas's read_csv to upload my data instead if that changes anything.
This is the beginning part of my code:
df = pd.read_csv(filename, header = 0)
original_headers = list(df.columns.values)
df = df._get_numeric_data()
data = df.values
ms = MeanShift()
ms.fit(data)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
n_clusters_ = len(np.unique(labels))
print("Number of estimated clusters:", n_clusters_)
As some users said above, it is not possible set the number of clusters wanted in MeanShift algorithm.
When we talk about clustering, there are a lot of models to be employed depending on your problem. Density based models, like MeanShift and DBSCAN, try to find areas of higher density than the remainder of the data set. So, the number of clusters will be defined by the data itself.
On the other hand, for example, centroid based methods like K-Means, starts its iterations based on the number of centroids passed as parameter.
The following link shows a lot of clustering algorithms of sklearn. Try to figure out which one suits best in your problem.
http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
References:
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
https://en.wikipedia.org/wiki/Cluster_analysis
After reading this post here about duplicate values in k-means clustering, I realized I cannot simply use unique points for clustering.
https://stats.stackexchange.com/questions/152808/do-i-need-to-remove-duplicate-objects-for-cluster-analysis-of-objects
I have over 10000000 points, though only 8000 unique ones. Therefore, I initially thought that for speeding it up, I’d use unique points only. Seems like this is a bad idea.
To keep computational time down, this post suggests to add weights to each point. How can this be implemented in python?
Using K-Means package from Scikit library, clustering is performed for number of clusters as 11 here.
The array Y contains data that has been inserted as weights where as X has actual points that need to be clustered.
from sklearn.cluster import KMeans #For applying KMeans
##--------------------------------------------------------------------------------------------------------##
#Starting k-means clustering
kmeans = KMeans(n_clusters=11, n_init=10, random_state=0, max_iter=1000)
#Running k-means clustering and enter the ‘X’ array as the input coordinates and ‘Y’
array as sample weights
wt_kmeansclus = kmeans.fit(X,sample_weight = Y)
predicted_kmeans = kmeans.predict(X, sample_weight = Y)
#Storing results obtained together with respective city-state labels
kmeans_results =
pd.DataFrame({"label":data_label,"kmeans_cluster":predicted_kmeans+1})
#Printing count of points alloted to each cluster and then the cluster centers
print(kmeans_results.kmeans_cluster.value_counts())
I think the post suggests to work with weighted average.
You can create a new dataset out of the old one, and the new dataset will have an extra attribute for each point, it's frequency (i.e it's weight).
Every time you calculate the new centroid for each cluster, take the weighted average of all points of that cluster (instead of calculating the simple mean of all points).
PS: Manipulating the dataset is dangerous. I'd parallelize the code if computational cost is a major factor.