Plot K-means clusters after TruncatedSVD Python - python

I'm trying to plot the results of running clustering on my data set but I'm getting the error:
File "cluster.py", line 93, in <module>
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 957, in predict
X = self._check_test_data(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 867, in _check_test_data
n_features, expected_n_features))
ValueError: Incorrect number of features. Got 2 features, expected 73122
My call to fit() works fine, but the plotting is where it goes wrong.
Here's my code:
reduced_data = TruncatedSVD(n_components=2).fit_transform(X)
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=100, n_init=1, verbose=False)
kmeans.fit(X)
h = .02 # point in the mesh [x_min, x_max]x[y_min, y_max].
# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=10)
plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
Can anyone suggest how I can change up my code to get a diagram of the clusters?

The traceback is telling you what the issue is:
ValueError: Incorrect number of features. Got 2 features, expected 73122
The kmeans classifier was fit with 73122-dimensional train samples, therefore you cannot use kmeans to make predictions on 2-dimensional test samples.
To fix your code simply change kmeans.fit(X) to kmeans.fit(reduced_data).

Related

How to convert clustering simple plot to region coloring plots?

I have this plot for my clusterin, and it is my code to plot this:
y_pred = KMeans(n_clusters=4,
random_state=random_state).fit_predict(X_train_pca)
plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1], c=y_pred)
plt.title("Unevenly Sized Blobs")
plt.show()
But I want to change it as that background has the color. I mean, it shows regions something like this
I really appreciate any help that you can provide.
Based on scikit-learn's demo for plotting kmeans decision boundaries:
Modify your current code to store the KMeans model so we can use the fitted model later to color the decision surface. Here it's stored as lowercase kmeans.
kmeans = KMeans(n_clusters=4, random_state=random_state)
y_pred = kmeans.fit_predict(X_train_pca)
plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1], c=y_pred, cmap='Dark2')
Construct an underlying meshgrid from X_train_pca and use kmeans.predict to label the entire mesh. Then use imshow to plot this mesh in color.
# step size of mesh (decrease h to increase plot quality)
h = 0.02
# construct mesh
x_min, x_max = X_train_pca[:, 0].min() - 1, X_train_pca[:, 0].max() + 1
y_min, y_max = X_train_pca[:, 1].min() - 1, X_train_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# obtain labels per mesh point (reuse stored model)
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
# put result into color plot
Z = Z.reshape(xx.shape)
plt.imshow(
Z, interpolation='nearest', cmap='Set2', alpha=0.75,
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
aspect='auto', origin='lower',
)
I don't have your original data, but this is the output from some random training data:

Plotting classification area based on logistic regression

Let's consider data following :
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
I want to create logistic regression on that data set and after that create plot which shows classification area. So I used :
model = LogisticRegression(solver='liblinear', random_state=0)
est=model.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=est.predict(X))
plt.show()
But how can make it look like the one below ?
Edit
I created plot below, but I still don't know how to change specific ones to squares, x'is and create a legend. Do you know maybe how it can be done ? I know I have to do something with marker='s' and marker='x' but it changes look for all image and I only want to change specific classifications.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
logreg = LogisticRegression(C=1e5)
# Create an instance of Logistic Regression Classifier and fit the data.
logreg.fit(X, Y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = .02 # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()

Graph k-NN decision boundaries in Matplotlib

How do I color the decision boundaries for a k-Nearest Neighbor classifier as seen here:
I've got the data for the 3 classes successfully plotted out using scatter (left picture).
Image source: http://cs231n.github.io/classification/
To plot Desicion boundaries you need to make a meshgrid. You can use np.meshgrid to do this. np.meshgrid requires min and max values of X and Y and a meshstep size parameter. It is sometimes prudent to make the minimal values a bit lower then the minimal value of x and y and the max value a bit higher.
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
You then feed your classifier your meshgrid like so Z=clf.predict(np.c_[xx.ravel(), yy.ravel()]) You need to reshape the output of this to be the same format as your original meshgrid Z = Z.reshape(xx.shape). Finally when you are making your plot you need to call plt.pcolormesh(xx, yy, Z, cmap=cmap_light) this will make the dicision boundaries visible in your plot.
Below is a complete example to achieve this found at http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
n_neighbors = 15
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
h = .02 # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
% (n_neighbors, weights))
plt.show()
This results in the following two graphs to be outputted
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
If i take this X as 3-dim dataset what would be the change in the following code:
for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
% (n_neighbors, weights))
plt.show()

Trying to clean up boundaries on nearest neighbour plot

I'm doing a small plot with some training data before testing it in python. My plot looks like this right now. The training data comes from images of letters in Fourier space which I have masked to produce different values for the letters.
These boundaries look less than ideal to me and I'm not sure how to go about fixing them so that red and blue points have there own distinct areas. Here is the code I am using:
X = np.matrix(X)
#knn = KNeighborsClassifier(n_neighbors=3)
y = [0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2]
h = 0.2 # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
n_neighbors = 10
for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Classifier and fit the data.
clf = KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 30, X[:, 0].max() + 20
y_min, y_max = X[:, 1].min() - 5, X[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
% (n_neighbors, weights))
plt.show()
Is it a case of finding different data points to classify my data with? Or is there a way of changing how these decision boundaries are formed? Any help would be appreciated, please let me know if I'm being too vague on something. Thanks in advance!

Separate svm classes for matplotlib legend

I am trying to learn sklearn and for this, I am trying a simple exercise with a linear SVM. The SVC tries to predict the number of bedrooms in a house, based on the value of the house and its area. I have managed to get something that looks ok, but the template I took from matplotlib's documentation uses a color map and I don't know exactly what corresponds to what.
How could I add a legend that specifies what the color of each scattered point corresponds to, and what the SVM's sections correspond to as well?
Also, in order to make the same work, I had to preprocess.scale my features, and the ticks now have the preprocessed value ;( How could I unscale somehow or retrieve the original values to use for the graduation.
Here is the plot:
http://i.imgur.com/ERFuEmJ.png (I don't have enough reputation to post directly)
And here is my code:
style.use('ggplot')
dataset = pd.read_csv('/Path/Paros.csv')
dataset = dataset[dataset['size']<3000]
X = np.array(dataset[['size', 'value']])
y = np.array(dataset[['bedrooms']])
X = preprocessing.scale(X)
h = 0.01 # step size in the mesh
C = 0.01 # SVM regularization parameter
clf = svm.SVC(kernel='linear', C=C).fit(X, y[:,0])
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
print "mesh"
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Size')
plt.ylabel('Price')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.show()
plt.colorbar() did what I was looking for.

Categories