I'm trying to cluster and visualise some data with xmeans from the pyclustering lib.
I copied the code directly from the example in the documentation,
from pyclustering.cluster import cluster_visualizer
from pyclustering.cluster.xmeans import xmeans
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.utils import read_sample
from pyclustering.samples.definitions import SIMPLE_SAMPLES
sample = X # read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
# Prepare initial centers - amount of initial centers defines amount of clusters from which X-Means will
# start analysis.
amount_initial_centers = 2
initial_centers = kmeans_plusplus_initializer(sample, amount_initial_centers).initialize()
# Create instance of X-Means algorithm. The algorithm will start analysis from 2 clusters, the maximum
# number of clusters that can be allocated is 20.
xmeans_instance = xmeans(sample, initial_centers, 20)
xmeans_instance.process()
# Extract clustering results: clusters and their centers
clusters = xmeans_instance.get_clusters()
centers = xmeans_instance.get_centers()
# Print total sum of metric errors
print("Total WCE:", xmeans_instance.get_total_wce())
# Visualize clustering results
visualizer = cluster_visualizer()
visualizer.append_clusters(clusters, sample)
visualizer.append_cluster(centers, None, marker='*', markersize=10)
visualizer.show()
The only difference is that I assigned to sample the value of my matrix X instead of loading a sample dataset.
When I try to visualise the clustering result I get this error:
Only objects with size dimension 1 (1D plot), 2 (2D plot) or 3 (3D plot) can be displayed. For multi-dimensional data use 'cluster_visualizer_multidim'.
My X matrix is generated in this way:
features = ["I", "Iu", other 7 column names]
data = df[features]
...
X = scaler.fit_transform(data)
Is there a way to visualise the clusters and plotting only two/three features at a time?
I can't find anything on the documentation.
I tried this:
visualizer.append_clusters(clusters, sample[:,[0,1]])
in order to visualise only the first two features and got this error
Only clusters with the same dimension of objects can be displayed on canvas.
EDIT:
I updated the code as suggested in the answer by annoviko but now I get the following error:
ValueError Traceback (most recent call last)
<ipython-input-69-6fd7d2ce5fcd> in <module>
20 visualizer.append_clusters(clusters, X)
21 visualizer.append_cluster(centers, None, marker='*', markersize=10)
---> 22 visualizer.show(pair_filter=[[0, 1], [0, 2]])
/usr/local/lib/python3.8/site-packages/pyclustering/cluster/__init__.py in show(self, pair_filter, **kwargs)
224 raise ValueError("There is no non-empty clusters for visualization.")
225
--> 226 cluster_data = self.__clusters[0].data or self.__clusters[0].cluster
227 dimension = len(cluster_data[0])
228
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
It is raised by visualizer.show(), and it happens even if I remove the pair_filter from within the function call.
In line with the error that you got:
Only objects with size dimension 1 (1D plot), 2 (2D plot) or 3 (3D plot) can be displayed. For multi-dimensional data use 'cluster_visualizer_multidim'.
You have to use cluster_visualizer_multidim as it was mentioned. There is a documentation (pyclustering 0.10.1) with an example: https://pyclustering.github.io/docs/0.10.1/html/dc/d6b/classpyclustering_1_1cluster_1_1cluster__visualizer__multidim.html
For example, if you have a data (D > 3) and you want to display (x0, x1) and (x0, x2) then you can display it in the following way:
visualizer = cluster_visualizer_multidim()
visualizer.append_clusters(clusters, sample_4d)
visualizer.show(pair_filter=[[0, 1], [0, 2]])
Where pair_filter specifies which features should be shown. In example above, it will show only (x0, x1) - [0, 1] and (x0, x2) - [0, 2].
So, in your particular case where you have to display only first two features it should be:
visualizer = cluster_visualizer_multidim()
visualizer.append_clusters(clusters, sample)
visualizer.show(pair_filter=[[0, 1]])
I think I have to make error more readable and make a proposal to use another class in the first sentence. Let me know if it helps (if it is still relevant for you).
Related
I have the fisher's linear discriminant that i need to use it to reduce my examples A and B that are high dimensional matrices to simply 2D, that is exactly like LDA, each example has classes A and B, therefore if i was to have a third example they also have classes A and B, fourth, fifth and n examples would always have classes A and B, therefore i would like to separate them in a simple use of fisher's linear discriminant. Im pretty much new to machine learning, so i dont know how to separate my classes, i've been following the formula by eye and coding on the go. From what i was reading, i need to apply a linear transformation to my data so i can find a good threshold for it, but first i'd need to find the maximization function. For such task, i managed to find Sw and Sb, but i don't know how to go from there...
Where i also need to find the maximization function.
That maximization function gives me an eigen value solution:
What i have for each classes are matrices 5x2 of 2 examples. For instance:
Example 1
Class_A = [
201, 103,
40, 43,
23, 50,
12, 123,
99, 78
]
Class_B = [
201, 129,
114, 195,
180, 90,
69, 62,
76, 90
]
Example 2
Class_A = [
68, 98,
201, 203,
78, 212,
49, 5,
204, 78
]
Class_B = [
52, 19,
220, 219,
159, 195,
99, 23,
46, 50
]
I tried finding Sw for the example above like this:
Example_1_Class_A = np.dot(Example_1_Class_A, np.transpose(Example_1_Class_A))
Example_1_Class_B = np.dot(Example_1_Class_B, np.transpose(Example_1_Class_B))
Example_2_Class_A = np.dot(Example_2_Class_A, np.transpose(Example_2_Class_A))
Example_2_Class_B = np.dot(Example_2_Class_B, np.transpose(Example_2_Class_B))
Sw = sum([Example_1_Class_A, Example_1_Class_B, Example_2_Class_A, Example_2_Class_B], axis=0)
As for Sb, i tried like this:
Example_1_Class_A_mean = Example_1_Class_A.mean(axis=0)
Example_1_Class_B_mean = Example_1_Class_B.mean(axis=0)
Example_2_Class_A_mean = Example_2_Class_A.mean(axis=0)
Example_2_Class_B_mean = Example_2_Class_B.mean(axis=0)
Example_1_Class_A_Sb = np.dot(Example_1_Class_A_mean, np.transpose(Example_1_Class_A_mean))
Example_1_Class_B_Sb = np.dot(Example_1_Class_B_mean, np.transpose(Example_1_Class_B_mean))
Example_2_Class_A_Sb = np.dot(Example_2_Class_A_mean, np.transpose(Example_2_Class_A_mean))
Example_2_Class_B_Sb = np.dot(Example_2_Class_B_mean, np.transpose(Example_2_Class_B_mean))
Sb = sum([Example_1_Class_A_Sb, Example_1_Class_B_Sb, Example_2_Class_A_Sb, Example_2_Class_B_Sb], axis=0)
The problem is, i have no idea what else to do with my Sw and Sb, i am completely lost. Basically, what i need to do is get from here to this:
How for given Example A and Example B, do i separate a cluster only for classes As and only for classes b
Before answering your question, I will first touch the basic difference between PCA and (F)LDA. In PCA you don't know anything about underlying classes, but you assume that the information about classes separability lies in the variance of data. So you rotate your original axes (sometimes it is called projecting all the data onto new ones) in such way that your first new axis is pointing to the direction of most variance, second one is perpendicular to the first one and pointing to the direction of most residiual variance, and so on. This way a PCA transformation results in a (sub)space of the same dimensionality as the original one. Than you can take only first 2 dimensions, rejecting the rest, hence getting a dimensionality reduction from k dimensions to only 2.
LDA works a bit differently. In this case you know in advance how many classes there are in your data, and you can find their mean and covariance matrices. What Fisher criterion does it finds a direction in which the mean between classes is maximized, while at the same time total variability is minimized (total variability is a mean of within-class covariance matrices). And for each two classes there is only one such line. This is why when your data has C classes, LDA can provide you at most C-1 dimensions, regardless of the original data dimensionality. In your case this means that as you have only 2 classes A and B, you will get a one-dimensional projection, i.e. a line. And this is exactly what you have in your picture: original 2d data is projected on to a line. The direction of the line is the solution of the eigenproblem.
Let's generate data that is similar to your picture:
a = np.random.multivariate_normal((1.5, 3), [[0.5, 0], [0, .05]], 30)
b = np.random.multivariate_normal((4, 1.5), [[0.5, 0], [0, .05]], 30)
plt.plot(a[:,0], a[:,1], 'b.', b[:,0], b[:,1], 'r.')
mu_a, mu_b = a.mean(axis=0).reshape(-1,1), b.mean(axis=0).reshape(-1,1)
Sw = np.cov(a.T) + np.cov(b.T)
inv_S = np.linalg.inv(Sw)
res = inv_S.dot(mu_a-mu_b) # the trick
####
# more general solution
#
# Sb = (mu_a-mu_b)*((mu_a-mu_b).T)
# eig_vals, eig_vecs = np.linalg.eig(inv_S.dot(Sb))
# res = sorted(zip(eig_vals, eig_vecs), reverse=True)[0][1] # take only eigenvec corresponding to largest (and the only one) eigenvalue
# res = res / np.linalg.norm(res)
plt.plot([-res[0], res[0]], [-res[1], res[1]]) # this is the solution
plt.plot(mu_a[0], mu_a[1], 'cx')
plt.plot(mu_b[0], mu_b[1], 'yx')
plt.gca().axis('square')
# let's project data point on it
r = res.reshape(2,)
n2 = np.linalg.norm(r)**2
for pt in a:
prj = r * r.dot(pt) / n2
plt.plot([prj[0], pt[0]], [prj[1], pt[1]], 'b.:', alpha=0.2)
for pt in b:
prj = r * r.dot(pt) / n2
plt.plot([prj[0], pt[0]], [prj[1], pt[1]], 'r.:', alpha=0.2)
The resulting projection is calculated using a neat trick for two class problem. You can read details on it here in section 1.6.
Regarding the "examples" you mention in your question. I believe you need to repeat the process for each example, as it is a different set of data point probably with different distributions. Also, put attention that estimated mean (mu_a, mu_b) and class covariance matrices would be slightly different from the ones that data was generated with, especially for small sample size.
Mathematics
See https://sebastianraschka.com/Articles/2014_python_lda.html#lda-in-5-steps for more information.
Implementation using Iris
Since you want to use LDA for dimensionality reduction but provide only 2d data I am showing how to perform this procedure on the iris dataset.
Let's import libraries
import pandas as pd
import numpy as np
import sklearn as sk
from collections import Counter
from sklearn import datasets
# load dataset and transform to pandas df
X, y = datasets.load_iris(return_X_y=True)
X = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(4)])
y = pd.DataFrame(y, columns=['labels'])
tot = pd.concat([X,y], axis=1)
# calculate class means
class_means = tot.groupby('labels').mean()
total_mean = X.mean()
The class_means are given by:
class_means
feat_0 feat_1 feat_2 feat_3
labels
0 5.006 3.428 1.462 0.246
1 5.936 2.770 4.260 1.326
2 6.588 2.974 5.552 2.026
To do this, we first subtract the class means from each observation (basically we calculate x - m_i from the equation above).
Subtract the corresponding class mean from each observation. Since we want to calculate
x_mi = tot.transform(lambda x: x - class_means.loc[x['labels']], axis=1).drop('labels', 1)
def kronecker_and_sum(df, weights):
S = np.zeros((df.shape[1], df.shape[1]))
for idx, row in df.iterrows():
x_m = row.as_matrix().reshape(df.shape[1],1)
S += weights[idx]*np.dot(x_m, x_m.T)
return S
# Each x_mi is weighted with 1. Now we use the kronecker_and_sum function to calculate the within-class scatter matrix S_w
S_w = kronecker_and_sum(x_mi, 150*[1])
mi_m = class_means.transform(lambda x: x - total_mean, axis=1)
# Each mi_m is weighted with the number of observations per class which is 50 for each class in this example. We use kronecker_and_sum to calculate the between-class scatter matrix.
S_b=kronecker_and_sum(mi_m, 3*[50])
eig_vals, eig_vecs = np.linalg.eig(np.linalg.inv(S_w).dot(S_b))
We only need to consider the eigenvalues which are remarkably different from zero (in this case only the first two)
eig_vals
array([ 3.21919292e+01, 2.85391043e-01, 6.53468167e-15, -2.24877550e-15])
Transform X with the matrix of the two eigenvectors which correspond to the highest eigenvalues
W = eig_vecs[:, :2]
X_trafo = np.dot(X, W)
tot_trafo = pd.concat([pd.DataFrame(X_trafo, index=range(len(X_trafo))), y], 1)
# plot the result
tot_trafo.plot.scatter(x=0, y=1, c='labels', colormap='viridis')
We have reduced the dimensions from 4 to 2 and chosen the space in such a way, that the classes can be well seperated.
Scikit-learn usage
Scikit has LDA support aswell. What we did in dozens of lines can be done with the following lines of code:
from sklearn import discriminant_analysis
lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2)
X_trafo_sk = lda.fit_transform(X,y)
pd.DataFrame(np.hstack((X_trafo_sk, y))).plot.scatter(x=0, y=1, c=2, colormap='viridis')
I'm not giving a plot here, cause it is the same as in our derived example (except for a 180 degree rotation).
I have a map of data:
import seaborn as sns
import matplotlib.pyplot as plt
X = 101_by_99_float32_array
ax = sns.heatmap(X, square = True)
plt.show()
Note these data are essentially a 3D surface, and I'm interested in the index positions in X after clustering. I can easily apply the kmeans algorithm to my data:
from sklearn.cluster import KMeans
# three clusters is arbitrary; just used for testing purposes
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10).fit(X)
But I am not sure how to navigate kmeans in a way that will identify to which cluster a pixel in the map above belongs. What I want to do is make a map that looks like the one above, but instead of plotting the z-value for each cell in the 100x99 array X, I'd like to plot the cluster number for each cell in X.
I don't know if this is possible with the output of the kmeans algorithm, but I did try an approach from the scikitlearn documents here:
import numpy as np
k_means_labels = k_means.labels_
k_means_cluster_centers = k_means.cluster_centers_
k_means_labels_unique = np.unique(k_means_labels)
colors = ['#4EACC5', '#FF9C34', '#4E9A06']
plt.figure()
#plt.hold(True)
for k, col in zip(range(3), colors):
my_members = k_means_labels == k
cluster_center = k_means_cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], 'w',
markerfacecolor=col, marker='.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
plt.title('KMeans')
plt.show()
But it's clear this is not accessing the information I want...
It's obvious I do not fully understanding what each component of the kmeans output represents, and I've tried to read the explanations in the answer to the question found here. However, there's nothing in that answer that explicitly addresses whether the indices of the original data were preserved after clustering, which is really the core of my question. If such information is implicitly present in kmeans through some matrix multiplication, I could really use some help extracting it.
Thank you for your time and assistance!
EDIT:
Thanks to #Nakor, for both the explanation about kmeans and the suggestion to reshape my data. How kmeans is interpreting my data is now much clearer. I should not expect it to capture the indices of each sample, but instead rely on reshape to do so. reshape will ravel the original (101,99) matrix into (9999,1) array which, as #Nakor pointed out, is suitable for clustering every entry as an individual sample.
Simply reapply reshape to kmeans.labels_ using the original shape of the data and I've gotten the result I'm looking for:
Y = X.reshape(-1, 1) # shape data to cluster each individual entry
kmeans= KMeans(init='k-means++', n_clusters=3, n_init=10)
kmeans.fit(Y)
Z = kmeans.labels_
A = Z.reshape(101,99)
plt.figure()
ax = sns.heatmap(cu_map, square = True)
plt.figure()
ay = sns.heatmap(A, square = True)
Your issue is that sklearn.cluster.KMeans expects a 2D matrix with [N_samples,N_features]. However, you provide the raw image, so sklearn understands you have 101 samples with 99 features each (each row of your image is a sample, and the columns are the features). As a results, what you get in k_means.labels_ is the cluster assignment of each of the rows.
In you want instead to cluster every single entry, you need to reshape your data like this for instance:
model = KMeans(init='k-means++', n_clusters=3, n_init=10)
model.fit(X.reshape(-1,1))
If I check with randomly generated data, I get:
In [1]: len(model.labels_)
Out[1]: 9999
I have one label per entry.
I have the below scikit learn script which outputs a nice chart (below) with each of the clusters.
I have a couple of questions:
- How can I export this to CSV - with a cluster name or ID?
- How can I name the clusters?
- How can I make sure the clusters are always named the same thing? For example, I want to call the top right segment 'high spenders' how do I so that where it will always be correct?
Thanks!
#import the required libraries
# - matplotlib is a charting library
# - Seaborn builds on top of Matplotlib and introduces additional plot types. It also makes your traditional Matplotlib plots look a bit prettier.
# - Numpy is numerical Python
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
#Generate sample data, with distinct clusters for testing
#n_samples = the number of datapoints, equally split across each clusters
#centers = The number of centers to generate (number of clusters) - a center is the arithmetic mean of all the points belonging to the cluster.
#cluster_std = the standard deviation of the clusters - a quantity expressing by how much the members of a group differ from the mean value for the group (how tight is the cluster going to be)
#random_state = controls the random number generator being used. If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time. However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.
#make_blobs generates "isotropic Gaussian blobs" - X is a numpy array with two columns which contain the (x, y) Gaussian coordinates of these points, whereas y contains the list of categories for each.
#X, y = simply means that the output of make_blobs() has two elements, that are assigned to X and y.
X, y = make_blobs(n_samples=300, centers=4,
cluster_std=0.50, random_state=0)
#X now looks like this - column zero becomes the X axis, column1 becomes the Y axis
array([[ 1.85219907, 1.10411295],
[-1.27582283, 7.76448722],
[ 1.0060939 , 4.43642592],
[-1.20998253, 7.83203579],
[ 1.92461484, 1.06347673],
[ 2.28565919, 0.79166208],
[-1.57379043, 2.69773813],
[ 1.04917913, 4.31668562],
[-1.07436851, 7.93489945],
[-1.15872975, 7.97295642]
#The below statement, will enable us to visualise matplotlib charts, even in ipython
#Using matplotlib backend: MacOSX
#Populating the interactive namespace from numpy and matplotlib
%pylab
#plot the chart
#s = the sizer of the points.
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], s=50);
#now, I am definining that I want to find 4 clusters within the data. The general rule I follow is, I will have 7 times less clusters than datapoints.
kmeans = KMeans(n_clusters=4)
#build the model, based on X with the number of clusters defined above
kmeans.fit(X)
#now we're going to find clusters in the randomly generated dataset
predict = kmeans.predict(X)
#now we can plot the prediction
#c = colour, which is based on the predict variable we defined above
#s = the size of the plots
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], c=predict, s=50)
Based on your code the following worked for me. You can certainly stay with numpy for storing the CSV but I simply prefer pandas. The sorting line should give you the same results everytime you run the code. However, since the initliazation of the clusters can have an impact I would also set a seed in your code, e.g. np.random.seed(42) and call the kmeans function with the random_state parameter, e.g. kmeans = KMeans(n_clusters=4, random_state=42)
# transform to dataframe
import pandas as pd
import seaborn as sns
df = pd.DataFrame(X)
df.columns = ["var1", "var2"]
df["cluster"] = predict
colors = sns.color_palette()[0:4]
df = df.sort_values("cluster")
# check plot
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster"], palette=colors)
plt.show()
# define rename schema
mynames = {"0": "center_left", "1": "top_left", "2": "bot_right", "3": "center"}
df["cluster_name"] = [mynames[str(i)] for i in df.cluster]
# plot again to verify order
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster_name"],
palette=colors)
sns.despine()
plt.show()
# save dataframe as CSV
df.to_csv("myoutput.csv")
The first plot looks like this:
The second plot looks like this:
The CSV will look like this:
I am writing a piece of code to identify different 2D shapes using opencv. I get 4 sets of data from each image of a 2D shape and these are stored in the multidimensional array featureVectors.
I am trying to write an svm/svc that takes into account all 4 features obtained from the image. I have been able to make it work with just 2 features but when i try all 4 my graph comes out looking like this.
My Graph which is incorrect
My values for featureVectors are:
[[ 4.00000000e+00 1.74371349e-03 6.49705560e-01 9.07957236e+01]
[ 4.00000000e+00 4.60937436e-02 1.97642179e-01 9.02041472e+01]
[ 1.00000000e+00 1.18553450e-03 3.03491372e-01 6.03489082e+01]
[ 1.00000000e+00 1.54552898e-02 8.38091425e-01 1.09021207e+02]
[ 3.00000000e+00 1.69961646e-02 4.13691915e+01 1.36838300e+02]]
And my Labels are:
[[2]
[2]
[0]
[0]
[1]]
Here is my code for the SVM:
#Saving featureVectors to a csv file
values1 = featureVectors
header1 = ["Number of Sides", "Standard Deviation of Number of Sides/Perimeter",
"Standard Deviation of the Angles", "Largest Angle"]
my_df = pd.DataFrame(featureVectors)
my_df.to_csv('featureVectors.csv', index=True, header=header1)
#Saving labels to a csv file
values2 = labels
header2 = ["Label"]
my_df = pd.DataFrame(labels)
my_df.to_csv('labels.csv', index=True, header=header2)
#Writing the SVM
def Build_Data_Set(features = header1, features1 = header2):
data_df = pd.DataFrame.from_csv("featureVectors.csv")
#data_df = data_df[:250]
X = np.array(data_df[features].values)
data_df2 = pd.DataFrame.from_csv("labels.csv")
y = np.array(data_df2[features1].values)
#print(X)
#print(y)
return X,y
def Analysis():
X,y = Build_Data_Set()
clf = svm.SVC(kernel = 'linear', C = 1.0)
clf.fit(X, y)
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(0,5)
yy = np.linspace(0,185)
h0 = plt.plot(xx,yy, "k-", label="non weighted")
plt.scatter(X[:, 0],X[:, 1],c=y)
plt.ylabel("Maximum Angle (Degrees)")
plt.xlabel("Number Of Sides")
plt.title('Shapes')
plt.legend()
plt.show()
Analysis()
I have only used 5 data sets(shapes) so far because I knew it wasn't working correctly.
The SVM part of your code is actually correct. The plotting part around it is not, and given the code I'll try to give you some pointers.
First of all:
another example I found(i cant find the link again) said to do that
Copying code without understanding it will probably cause more problems than it solves. Given your code, I'm assuming you used this example as a starter.
plt.scatter(X[:, 0],X[:, 1],c=y)
In the sk-learn example, this snippet is used to plot data points, coloring them according to their label. This works because in the example we're dealing with 2-dimensional data, so this is fine. The data you're dealing with is 4-dimensional, so you're actually just plotting the first two dimensions.
plt.scatter(X[:, 0], y, c=y)
on the other hand makes no sense.
xx = np.linspace(0,5)
yy = np.linspace(0,185)
h0 = plt.plot(xx,yy, "k-", label="non weighted")
Your decision boundary has actually nothing to do with the actual decision boundary. It's just a plot of y over x of your coordinate system.
(In addition to that, you're dealing with multi class data, so you'll have as much decision boundaries as you have classes.)
Now your actual problem is data dimensionality. You're trying to plot 4-dimensional data in a 2d plot, which simply won't work.
A possible approach would be to perform dimensionality reduction to map your 4d data into a lower dimensional space, so if you want to, I'd suggest you reading e.g. the excellent sklearn documentation for an introduction to SVMs and in addition something about dimensionality reduction.
I am considering to use OpenCV's Kmeans implementation since it says to be faster...
Now I am using package cv2 and function kmeans,
I can not understand the parameters' description in their reference:
Python: cv2.kmeans(data, K, criteria, attempts, flags[, bestLabels[, centers]]) → retval, bestLabels, centers
samples – Floating-point matrix of input samples, one row per sample.
clusterCount – Number of clusters to split the set by.
labels – Input/output integer array that stores the cluster indices for every sample.
criteria – The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy. The accuracy is specified as criteria.epsilon. As soon as each of the cluster centers moves by less than criteria.epsilon on some iteration, the algorithm stops.
attempts – Flag to specify the number of times the algorithm is executed using different initial labelings. The algorithm returns the labels that yield the best compactness (see the last function parameter).
flags –
Flag that can take the following values:
KMEANS_RANDOM_CENTERS Select random initial centers in each attempt.
KMEANS_PP_CENTERS Use kmeans++ center initialization by Arthur and Vassilvitskii [Arthur2007].
KMEANS_USE_INITIAL_LABELS During the first (and possibly the only) attempt, use the user-supplied labels instead of computing them from the initial centers. For the second and further attempts, use the random or semi-random centers. Use one of KMEANS_*_CENTERS flag to specify the exact method.
centers – Output matrix of the cluster centers, one row per each cluster center.
what is the argument flags[, bestLabels[, centers]]) mean? and what about his one: → retval, bestLabels, centers ?
Here's my code:
import cv, cv2
import scipy.io
import numpy
# read data from .mat file
mat = scipy.io.loadmat('...')
keys = mat.keys()
values = mat.viewvalues()
data_1 = mat[keys[0]]
nRows = data_1.shape[1]
nCols = data_1.shape[0]
samples = cv.CreateMat(nRows, nCols, cv.CV_32FC1)
labels = cv.CreateMat(nRows, 1, cv.CV_32SC1)
centers = cv.CreateMat(nRows, 100, cv.CV_32FC1)
#centers = numpy.
for i in range(0, nCols):
for j in range(0, nRows):
samples[j, i] = data_1[i, j]
cv2.kmeans(data_1.transpose,
100,
criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_MAX_ITER, 0.1, 10),
attempts=cv2.KMEANS_PP_CENTERS,
flags=cv2.KMEANS_PP_CENTERS,
)
And I encounter such error:
flags=cv2.KMEANS_PP_CENTERS,
TypeError: <unknown> is not a numpy array
How should I understand the parameter list and the usage of cv2.kmeans? Thanks
the documentation on this function is almost impossible to find. I wrote the following Python code in a bit of a hurry, but it works on my machine. It generates two multi-variate Gaussian Distributions with different means and then classifies them using cv2.kmeans(). You may refer to this blog post to get some idea of the parameters.
Handle imports:
import cv
import cv2
import numpy as np
import numpy.random as r
Generate some random points and shape them appropriately:
samples = cv.CreateMat(50, 2, cv.CV_32FC1)
random_points = r.multivariate_normal((100,100), np.array([[150,400],[150,150]]), size=(25))
random_points_2 = r.multivariate_normal((300,300), np.array([[150,400],[150,150]]), size=(25))
samples_list = np.append(random_points, random_points_2).reshape(50,2)
random_points_list = np.array(samples_list, np.float32)
samples = cv.fromarray(random_points_list)
Plot the points before and after classification:
blank_image = np.zeros((400,400,3))
blank_image_classified = np.zeros((400,400,3))
for point in random_points_list:
cv2.circle(blank_image, (int(point[0]),int(point[1])), 1, (0,255,0),-1)
temp, classified_points, means = cv2.kmeans(data=np.asarray(samples), K=2, bestLabels=None,
criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_MAX_ITER, 1, 10), attempts=1,
flags=cv2.KMEANS_RANDOM_CENTERS) #Let OpenCV choose random centers for the clusters
for point, allocation in zip(random_points_list, classified_points):
if allocation == 0:
color = (255,0,0)
elif allocation == 1:
color = (0,0,255)
cv2.circle(blank_image_classified, (int(point[0]),int(point[1])), 1, color,-1)
cv2.imshow("Points", blank_image)
cv2.imshow("Points Classified", blank_image_classified)
cv2.waitKey()
Here you can see the original points:
Here are the points after they have been classified:
I hope that this answer may help you, it is not a complete guide to k-means, but it will at least show you how to pass the parameters to OpenCV.
The problem here is your data_1.transpose is not a numpy array.
OpenCV 2.3.1 and higher python bindings do not take anything except numpy array as image/array parameters. so, data_1.transpose has to be a numpy array.
Generally, all the points in OpenCV are of type numpy.ndarray
eg.
array([[[100., 433.]],
[[157., 377.]],
.
.
[[147., 247.]], dtype=float32)
where each element of array is
array([[100., 433.]], dtype=float32)
and the element of that array is
array([100., 433.], dtype=float32)