K-means Clustering in Python - python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
x = [916,684,613,612,593,552,487,484,475,474,438,431,421,418,409,391,389,388,
380,374,371,369,357,356,340,338,328,317,316,315,313,303,283,257,255,254,245,
234,232,227,227,222,221,221,219,214,201,200,194,169,155,140]
kmeans = KMeans(n_clusters=4)
a = kmeans.fit(np.reshape(x,(len(x),1)))
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","y.","b."]
for i in range(len(x)):
plt.plot(x[i], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0], marker = "x", s = 150, linewidths = 5, zorder = 10)
plt.show()
The code above displays 4 clusters, but they are definitely not something I want to have.
I also get an error, which makes it even worst. The output I get is in the picture below.
The error I get is: TypeError: scatter() missing 1 required positional argument: 'y' Error is not a big deal because I don't like what I have anyways.
Following is the image of how I want my output of clusters to look like.

your data is one-dimension (a line), if you want to visualize in two-dimension like pic in your post, your should use two-dimension or multi-dimension data, for example [[1,3], [2,3], [1,5]].
after k-means they are divided into k clusters, and you can use scatter to visualize the output. by the way, scatter take x and y, scatter is two-dimension visualization.
i suggest you to take a look at Orange, a python data mining tool. you can do k-means by drag and drop.
and visualize the output of k-means easily.
good luck! data mining is fun :-)

Your data is 1 dimensional
Don't expect a pretty 2d plot without making up data.
To get rid of the warning, you can set y=x. But it will not change much, the data will continue to be a 1-dimensional line.
You could of course add random noise, and set y to random values. But that means making up fake data.
For one-dimensional algorithm, I recommend to not use clustering at all. These algorithms are designed for complex multivariate data where you cannot afforf a good statistical model anymore. One-dimensional data can be sorted which allows for much more efficient algorithms. You can easily do KDE on such data, and fit thousands of statistical distributions. This will give you a much more meaningful model of higher statistical power.
From a quick look at your plot, I'd say there are no clusters. Instead your data looks like a skewed normal distribution with one clear outlier (to be expected at this data set size) to me. Please, try a more statistical approach.

Since you work with only one dimensional, you should understand what exactly you are computing. With KMeans, you extract four average values; the best thing you can do here is draw your data as below with four horizontal lines showing these values. I get the following picture with the code below. This picture is like the equivalent for 1D of the picture you are showing for 2D.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
x = [916,684,613,612,593,552,487,484,475,474,438,431,421,418,409,391,389,388,
380,374,371,369,357,356,340,338,328,317,316,315,313,303,283,257,255,254,245,
234,232,227,227,222,221,221,219,214,201,200,194,169,155,140]
kmeans = KMeans(n_clusters=4)
a = kmeans.fit(np.reshape(x,(len(x),1)))
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","y.","b."]
for i in centroids: plt.plot( [0, len(x)-1],[i,i], "k" )
for i in range(len(x)):
plt.plot(i, x[i], colors[labels[i]], markersize = 10)
plt.show()
Computing kmeans with 1D data is more interesting with curves like the following one (from the page http://lasp.colorado.edu/home/sorce/2013/01/28/the-sorce-mission-celebrates-ten-years/) because you obviously can see tow distinct average values:

Related

Detect cluster outliers

I have a dataset where every data sample consists of 10-20 2D coordinates points. The data is mostly clean but occasionally there are falsely annotated points. For illustration the cleany annotated data would look like these:
either clustered in a small area or spread across a larger area. The outliers I'm trying to filter out look like this:
the outlier is away from the "correct" cluster.
I tried z-score filtering but this approach falsely marked many annotations as outliers
std_score = np.abs((points - points.mean(axis=0)) / (np.std(points, axis=0) + 0.01))
validity = np.all(std_score <= np.quantile(std_score, 0.95, axis=0), axis=1)
Is there a method designed to solve this problem?
This seems like a typical clustering problem, and if the data looks as you suggested the KMeans from scikit-learn should do the trick. Lets look how we can do this.
First I am generating a data sample, which might look somewhat like your data.
import numpy as np
import matplotlib.pylab as plt
np.random.seed(1) # For reproducibility
cluster_1 = np.random.normal(loc = [1,1], scale = [0.2,0.2], size = (20,2))
cluster_2 = np.random.normal(loc = [2,1], scale = [0.4,0.4], size = (5,2))
plt.scatter(cluster_1[:,0], cluster_1[:,1])
plt.scatter(cluster_2[:,0], cluster_2[:,1])
plt.show()
points = np.vstack([cluster_1, cluster_2])
This is how the data will look like.
Further we will be doing KMeans clustering.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 2).fit(points)
We are choosing n_clusters as 2 believing that there are 2 clusters in the dataset. And after finding these clusters lets look at them.
plt.scatter(points[kmeans.labels_==0][:,0], points[kmeans.labels_==0][:,1], label='cluster_1')
plt.scatter(points[kmeans.labels_==1][:,0], points[kmeans.labels_==1][:,1], label ='cluster_2')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], label = 'cluster_center')
plt.legend()
plt.show()
This will look like as the image shown below.
This should solve your problem. But there ares some things which should be kept in mind.
It will not be perfect all the times.
Might be a problem if you don't have any outliers. Can be solved through silhouette scores.
Difficult to know which cluster to discard (Can be done through locating the center of the clusters (green colored points) or can also be done by finding the cluster with lesser number of points.
Endnote: You might loose some points but would automate the entire process. Depends upon how much you want to trade off in terms of data saved versus manual time saved.

Detect linear zones from 2d dataset

I have a large 2D-dataset (.csv) with values from a pressure sensor.
The first value is the pressure value, while the second one records the time the measure was taken.
Looking at the plot, I can see a cluster of points (due noise) where you can detect some linear parts (that is the "good working zone") and non-linear zones.
I thought to use a RANSAC algorithm to detect linear zones, but I'm not sure it's the best way.
By OpenCV I can isolate linear path and it seems working well, but my problem is transforming a 2D dataset in a "Mat": my sensor gives me 32bit values and tests takes days with a sub-second data rate so the final 2d-matrix is an enormous set of 0-1!
So, according to you, what is the best way to detect linear patterns in a 2d-dataset?
edit:
Sending a real dataset is quite problematic, because of its weight (approx 100Mb) and time need to achieve a test (days).
I can send a plot to show my problem.
As You can see, RANSAC works apparently well, but my fear is that a kind of dataset as this:
can cause erroneous results (the first linear part not detect).
An idea is to "split" my dataset in parts but it doesn't seem very efficient...
Is there a method to detect multiple linear zones by RANSAC?
P.S
Here an example code by Python for RANSAC
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from skimage.measure import LineModelND, ransac
// x,y are vectors:
// x -> time value
// y -> pressure value #xtime
data = np.column_stack([x, y])
model = LineModelND()
model.estimate(data)
model_robust, inliers = ransac(data, LineModelND, min_samples=2, residual_threshold=0.01, max_trials=1000)
outliers = inliers == False
line_x = np.arange(x.min(), x.max()+1)
fig, ax = plt.subplots()
ax.plot(data[inliers, 0], data[inliers, 1], '.b', alpha=0.6,label='Linear Data')
ax.plot(data[outliers, 0], data[outliers, 1], '.r', alpha=0.6,label='Non Linear Data')
ax.legend(loc='lower right')
plt.show()

Detect outliers or noise data in each group in Python

I'm working for a data which have 3 columns: type, x, y, let's say x and y are correlated and they not normalizedly distributed, I want groupby type and filter outliers or noise data points in x and y. Could someone recommend me statitics or machine learning methods to filter outliers or noise data? How can I do that in Python?
I'm considering to use DBSCAN from scikit-learn, is it appropriate method ?
Type1:
Type2:
Type3:
df1 = df.loc[df['type'] == '3']
data= df1[["x", "y"]]
data.plot.scatter(x = "x", y = "y")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(data)
from matplotlib import cm
cmap = cm.get_cmap('Accent')
data.plot.scatter(
x = "iSearchCount",
y = "iGuaPaiCount",
c = clusters,
cmap = cmap,
colorbar = False
)
For this type of data and outliers I would recommend a statistical approach. The SPE/DmodX (distance to model) or Hotelling T2 test may help you here. I do not see the data for the 3 types but I generated some.
These methods are available in the pca library. With the n_std you can adjust the ellipse "width".
pip install pca
import pca
results = pca.spe_dmodx(X, n_std=3, showfig=True)
# If you want to test the Hotelling T2 test.
# results1 = pca.hotellingsT2(X, alpha=0.001)
results is a dictionary and contains the labels of the outliers.
Of course you don't get good results if you don't care about the parameters. Just look at your plot. The scale is huge - your epsilon is tiny! Seems like your data may be integers, so no points except duplicates will ever have a distance of less than 0.5...
Hence all data is considered noise.
Before using a method, make sure you've understood how it works and what parameters you need to set.
I'd also log transform the data first. Working with some simple thresholds may be enough. Don:t overdo things with clustering when your data is unimodal.
After some feature engineering techniques you could consider using OneClassSVM estimator from Sklearn library.
https://justanoderbit.com/outlier-detection/one-class-svm/ describes how to use it for outlier detection.

How to plot a multi-dimensional data point in python

Some background first:
I want to plot of Mel-Frequency Cepstral Coefficients of various songs and compare them.
I calculate MFCC's throughout a song and then average them to get one array of 13 coefficients. I want this to represent one point on a graph that I plot.
I'm new to Python and very new to any form of plotting (though I've seen some recommendations to use matplotlib).
I want to be able to visualize this data. Any thoughts on how I might go about doing this?
Firstly, if you want to represent an array of 13 coefficients as a single point in your graph, then you need to break the 13 coefficients down to the number of dimensions in your graph as yan king yin pointed out in his comment.
For projecting your data into 2 dimensions you can either create relevant indicators yourself such as max/min/standard deviation/.... or you apply methods of dimensionality reduction such as PCA.
Whether or not to do so and how to do so is another topic.
Then, plotting is easy and is done as here:
http://matplotlib.org/api/pyplot_api.html
I provide an example code for this solution:
import matplotlib.pyplot as plt
import numpy as np
#fake example data
song1 = np.asarray([1, 2, 3, 4, 5, 6, 2, 35, 4, 1])
song2 = song1*2
song3 = song1*1.5
#list of arrays containing all data
data = [song1, song2, song3]
#calculate 2d indicators
def indic(data):
#alternatively you can calulate any other indicators
max = np.max(data, axis=1)
min = np.min(data, axis=1)
return max, min
x,y = indic(data)
plt.scatter(x, y, marker='x')
plt.show()
The results looks like this:
Yet i want to suggest another solution to your underlying problem, namely: plotting multidimensional data.
I recommend using something parralel coordinate plot which can be constructed with the same fake data:
import pandas as pd
pd.DataFrame(data).T.plot()
plt.show()
Then the result shows all coefficents for each song along the x axis and their value along the y axis. I would looks as follows:
UPDATE:
In the meantime I have discovered the Python Image Gallery which contains two nice example of high dimensional visualization with reference code:
Radar chart
Parallel plot

Plot multidimensional vectors in Python

I have a Matrix that contains N users and K items. I want to plot that matrix in Python by considering each line as a vector with multiple coordinates. For example a simple point plot require X,Y. My vector hasK coordinates and I want to plot each one of those N vectors as a point to see there similarities. Can anyone help me with that ?
UPDATE:
#Matrix M shape = (944, 1683)
plt.figure()
plt.imshow(M, interpolation='nearest', cmap=plt.cm.ocean)
plt.colorbar()
plt.show()
but this gave me as result :
What I want is something like that:
It is difficult from this question to be sure if my answer is relevant, but here's my best guess. I believe deltascience is asking how multidimensional vectors are generally plotted into two-dimensional space, as would be the case with a scatter plot. I think the best answer is that some kind of dimension reduction algorithm is generally performed. In other words, you don't do this by finding the right matplotlib code; you get your data into the right shape (one list for the X axis, and another list for the Y axis) and you then plot it using a typical matplotlib approach:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
M = np.random.rand(944, 1683)
pca = PCA(n_components=2)
reduced = pca.fit_transform(M)
# We need a 2 x 944 array, not 944 by 2 (all X coordinates in one list)
t = reduced.transpose()
plt.scatter(t[0], t[1])
plt.show()
Here are some relevant links:
https://stats.stackexchange.com/questions/63589/how-to-project-high-dimensional-space-into-a-two-dimensional-plane
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
https://towardsdatascience.com/the-art-of-effective-visualization-of-multi-dimensional-data-6c7202990c57
https://www.evl.uic.edu/documents/etemadpour_choosingvisualization_springer2016.pdf
July 2019 Addendum: It didn't occur to me at the the time, but another way people often visualize multi-dimensional data is with network visualization. Each multi-dimensional array in this context would be a node, and the edge weight would be something like the cosine similarity of two nodes, or the Euclidian distance. Networkx in python has some really nice visualization options.

Categories