Decision boundaries for nearest centroid - python

I am trying to draw decision boundaries for different classifiers including the nearestcentroid, but when I use this code
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
I get an error saying 'NearestCentroid' object has no attribute 'predict_proba'. How can I fix this?

You can make your own predict_proba:
from sklearn.utils.extmath import softmax
from sklearn.metrics.pairwise import pairwise_distances
def predict_proba(self, X):
distances = pairwise_distances(X, self.centroids_, metric=self.metric)
probs = softmax(distances)
return probs
clf = NearestCentroid()
clf.predict_proba = predict_proba.__get__(clf)
clf.fit(X_train, y_train)
clf.predict_proba(X_test)

Assuming your X has two features, you can generate a meshgrid where each axis pertains to one of the features.
Assuming X is your features array with two features - shape would be (N, 2), where N is the number of samples - and y is your target array.:
# first determine the min and max boundaries for generating the meshgrid
feat1_min, feat1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
feat2_min, feat2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
Now generate your meshgrid and make predictions along the grid:
xx, yy = np.meshgrid(np.arange(feat1_min, feat1_max , 0.02),
np.arange(feat2_min, feat2_max , 0.02)) # 0.02 is step size
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Now make the plot:
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap="autumn")
plt.scatter(X[:, 0], X[:, 1], c=y, cmap="autumn",
edgecolor='k', s=10)
plt.show()

As BearBrown pointed out, you only check if "decison_function" is an attribute of clf. You never check if "predict_proba" is an attribute of clf
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
elif hasattr(clf, "predict_proba"): # This condition ensures that you'll see that predict_proba is not an attribute of clf`enter code here`
Z = clf.predict_proba(numpy.c_[xx.ravel(), yy.ravel()])[:, 1]
else: #This will show you your error again
raise AttributeError("Neither 'decision_function' not 'predict_proba' found in clf")
After this, you should check why what you expect is not an attrib of clf

Related

sklearn svm svc classification for data that are equally spread in space

I'm trying to use the SVM classifier to separate the data into a fixed number of clusters.
However, when doing that the classifier generates few clusters where no apparent relation is noticed between the points. Do I need another type of kernel, or simply the data is spread that way that SVM cannot do better?
The original problem is related to a Wireless Sensor Network (WSN), where a number of sensors are spread in space and communicate with a base station. My approach is to use SVM (sklearn svm.svc) currently with linear kernel, to cluster the data in k clusters.
I have N=300 sensors, to be clustered in K=5, clusters. PS. I'm aware of other algorithms (K-means, cmeans, fuzzy cmeans..) that can be used for this problem and probably perform better.
cmap_bold = ListedColormap(
['#FF0000', '#00FF00', '#0000FF', '#3F3FBF', '#0C0404'])
nb_clusters = 5
X = [[node.pos_x, node.pos_y] for node in network[0:-1]]
y = []
points_per_cluster = float(network.count_alive_nodes() / nb_clusters)
for i in range(nb_clusters):
for _ in range(int(points_per_cluster)):
y.append(i)
X = np.array(X)
y = np.array(y)
C = 1.0
svc = svm.SVC(kernel='linear', C=C)
svc.fit(X, y)
plot_predictions(svc, X, y)
plt.show()
The data for nodes is their network identified position, and for the y simply the classes (0..4).
[ 45.28952891 48.71941502]
[ 21.5114652 185.38108775]
[187.37250476 85.51585448]
[123.80470776 62.4834906 ]
[115.24942266 239.92792797]...
def plot_predictions(estimator, X, y):
estimator.fit(X, y)
x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired, shading="auto", alpha=0.6)
plt.scatter(X[:, 0], X[:, 1], c=y.astype(float), cmap=cmap_bold)
plt.axis('tight')
plt.tight_layout()
When running this I get this, and it does not look right.
Any help is appreciated, thank you.

How to draw decision boundary in SVM sklearn data in python?

I am reading email data from training set and creating train_matrix, train_labels and test_labels. Now how do I display decision boundary using matplot in python. I am using svm of sklearn. There are online example for pre given data sets through iris. But plot fails on custom data. Here is my code
Error :
Traceback (most recent call last):
File "classifier-plot.py", line 115, in <module>
Z = Z.reshape(xx.shape)
ValueError: cannot reshape array of size 260 into shape (150,1750)
Code:
import os
import numpy as np
from collections import Counter
from sklearn import svm
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
def make_Dictionary(root_dir):
all_words = []
emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
for mail in emails:
with open(mail) as m:
for line in m:
words = line.split()
all_words += words
dictionary = Counter(all_words)
list_to_remove = dictionary.keys()
for item in list_to_remove:
if item.isalpha() == False:
del dictionary[item]
elif len(item) == 1:
del dictionary[item]
dictionary = dictionary.most_common(3000)
return dictionary
def extract_features(mail_dir):
files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
features_matrix = np.zeros((len(files),3000))
train_labels = np.zeros(len(files))
count = 0;
docID = 0;
for fil in files:
with open(fil) as fi:
for i,line in enumerate(fi):
if i == 2:
words = line.split()
for word in words:
wordID = 0
for i,d in enumerate(dictionary):
if d[0] == word:
wordID = i
features_matrix[docID,wordID] = words.count(word)
train_labels[docID] = 0;
filepathTokens = fil.split('/')
lastToken = filepathTokens[len(filepathTokens) - 1]
if lastToken.startswith("spmsg"):
train_labels[docID] = 1;
count = count + 1
docID = docID + 1
return features_matrix, train_labels
TRAIN_DIR = "../train-mails"
TEST_DIR = "../test-mails"
dictionary = make_Dictionary(TRAIN_DIR)
print "reading and processing emails from file."
features_matrix, labels = extract_features(TRAIN_DIR)
test_feature_matrix, test_labels = extract_features(TEST_DIR)
model = svm.SVC(kernel="rbf", C=10000)
print "Training model."
features_matrix = features_matrix[:len(features_matrix)/10]
labels = labels[:len(labels)/10]
#train model
model.fit(features_matrix, labels)
predicted_labels = model.predict(test_feature_matrix)
print "FINISHED classifying. accuracy score : "
print accuracy_score(test_labels, predicted_labels)
##----------------
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
X = features_matrix
y = labels
svc = model.fit(X, y)
#svm.SVC(kernel='linear', C=C).fit(X, y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = y[:].min() - 1, y[:].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC with linear kernel']
Z = predicted_labels#svc.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[0])
plt.show()
In the tutorial that you were following Z is computed by applying the classifier to a set of feature vectors generated to form a regular NxM grid. This makes the plot smooth.
When you replaced
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
with
Z = predicted_labels
you replaced this regular grid with the predictions taken on your dataset. The next line failed with an error since it could not reshape an array of size len(files) to an NxM matrix. There is no reason len(files) = NxM.
There is a reason why you could not follow the tutorial directly. Your data dimension is 3000, so your decision boundary would be a 2999-dimensional hyperplane in a 3000-dimensional space. This is not easy to visualize.
In the tutorial the dimension is 4 and it is reduced to 2 for visualization.
The best way to reduce the dimension of your data depends on the data. In the tutorial we just pick the first two components of the 4-dimensional vector.
Another option that works well in many cases is to use Principal Component Analysis to reduce the dimension of data.
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(features_matrix, labels)
reduced_matrix = pca.fit_transform(features_matrix, labels)
model.fit(reduced_matrix, labels)
Such model can be used for 2D visualization. You can just follow the tutorial directly and define
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
A complete but not a very impressive example
We do not have access to your email data, so for illustration we could just use random data.
from sklearn import svm
from sklearn.decomposition import PCA
# initialize algorithms and data with random
model = svm.SVC(gamma=0.001,C=100.0)
pca = PCA(n_components = 2)
rng = np.random.RandomState(0)
U = rng.rand(200, 2000)
v = (rng.rand(200)*2).astype('int')
pca.fit(U,v)
U2 = pca.fit_transform(U,v)
model.fit(U2,v)
# generate grid for plotting
h = 0.2
x_min, x_max = U2[:,0].min() - 1, U2[:, 0].max() + 1
y_min, y_max = U2[:,1].min() - 1, U2[:, 1].max() + 1
xx, yy = np.meshgrid(
np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# create decision boundary plot
Z = s.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
contourf(xx,yy,Z,cmap=plt.cm.coolwarm, alpha=0.8)
scatter(U2[:,0],U2[:,1],c=v)
show()
Would produce a decision boundary that does not look very impressive.
Indeed the first two principal components capture just about 1% of the information contained in the data
>>> print(pca.explained_variance_ratio_)
[ 0.00841935 0.00831764]
If now you introduce just a little bit of carefully disguised asymmetry you would already see an effect.
Modify the data to introduce shifts at just one coordinate randomly selected for each feature
random_shifts = (rng.rand(2000)*200).astype('int')
for i in range(MM):
if v[i] == 1:
U[i,random_shifts[i]] += 5.0
And applying PCA you would get somewhat more informative picture.
Note that here the first two principal components already explain about 5% of the variance and the red part of the picture contains many more red points than blue ones.

Plot decision boundaries of classifier, ValueError: X has 2 features per sample; expecting 908430"

Based on the scikit-learn document http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html#sphx-glr-auto-examples-svm-plot-iris-py.
I try to plot a decision boundaries of the classifier, but it sends a error message call "ValueError: X has 2 features per sample; expecting 908430" for this code "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])"
clf = SGDClassifier().fit(step2, index)
X=step2
y=index
h = .02
colors = "bry"
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis('off')
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
the 'index' is a label which contain around [98579 X 1] label for the comment which include positive, natural and negative
array(['N', 'N', 'P', ..., 'NEU', 'P', 'N'], dtype=object)
the 'step2' is the [98579 X 908430] numpy matrix which formed by the Countvectorizer function, which is about the comment data
<98579x908430 sparse matrix of type '<type 'numpy.float64'>'
with 3168845 stored elements in Compressed Sparse Row format>
The thing is you cannot plot decision boundary for a classifier for data which is not 2 dimensional. Your data is clearly high dimensional, it has 908430 dimensions (NLP task I assume). There is no way to plot actual decision boundary for such a model. Example that you are using is trained on 2D data (reduced Iris) and this is the only reason why they were able to plot it.

scikits learn SVM - 1-dimensional Separating Hyperplane

How to plot the separating "hyperplane" for 1-dimensional data using scikit svm ?
I follow this guide for 2-dimensional data : http://scikit-learn.org/stable/auto_examples/svm/plot_svm_margin.html, but don't know how to make it works for 1-dimensional data
pos = np.random.randn(20, 1) + 1
neg = np.random.randn(20, 1) - 1
X = np.r_[pos, neg]
Y = [0] * 20 + [1] * 20
clf = svm.SVC(kernel='linear', C=0.05)
clf.fit(X, Y)
# how to get "hyperplane" and margins values ??
thanks
The separating hyperplane for two-dimensional data is a line, whereas for one-dimensional data the hyperplane boils down to a point. The easiest way to plot the separating hyperplane for one-dimensional data is a bit of a hack: the data are made two-dimensional by adding a second feature which takes the value 0 for all the samples. By doing so, the second component of the weight vector is zero, i.e. w = [w0, 0] (see the appendix at the end of this post). As w1 = 0 and w1 is in the denominator of the expression that defines the slope and the y-intercept term of the separating line (see appendix), both coefficients are ∞. In this case it is convenient to solve the equation of the separating hyperplane for x, which results in x = x0 = -b/w0. The margin turns out to be 2/w0 (see appendix for details).
The following script implements this approach:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
np.random.seed(0)
pos = np.hstack((np.random.randn(20, 1) + 1, np.zeros((20, 1))))
neg = np.hstack((np.random.randn(20, 1) - 1, np.zeros((20, 1))))
X = np.r_[pos, neg]
Y = [0] * 20 + [1] * 20
clf = svm.SVC(kernel='linear')
clf.fit(X, Y)
w = clf.coef_[0]
x_0 = -clf.intercept_[0]/w[0]
margin = w[0]
plt.figure()
x_min, x_max = np.floor(X.min()), np.ceil(X.max())
y_min, y_max = -3, 3
yy = np.linspace(y_min, y_max)
XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
Z = clf.predict(np.c_[XX.ravel(), np.zeros(XX.size)]).reshape(XX.shape)
plt.pcolormesh(XX, YY, Z, cmap=plt.cm.Paired)
plt.plot(x_0*np.ones(shape=yy.shape), yy, 'k-')
plt.plot(x_0*np.ones(shape=yy.shape) - margin, yy, 'k--')
plt.plot(x_0*np.ones(shape=yy.shape) + margin, yy, 'k--')
plt.scatter(pos, np.zeros(shape=pos.shape), s=80, marker='o', facecolors='none')
plt.scatter(neg, np.zeros(shape=neg.shape), s=80, marker='^', facecolors='none')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.show()
Although the code above is self explanatory, here are some tips. X dimensions are 40 rows by 2 columns: the values in the first column are random numbers while all the elements of the second column are zeros. In the code, the weight vector w = [w0, 0] and the intercept b are clf_coef_[0] and clf.intercept_[0], respectively, wehre clf if the object returned by sklearn.svm.SVC.
And this is the plot you get when the script is run:
For the sake of clarity I'd suggest to tweak the code above by adding/subtracting a small constant to the second feature, for example:
plt.scatter(pos, .3 + np.zeros(shape=pos.shape), ...)
plt.scatter(neg, -.3 + np.zeros(shape=neg.shape), ...)
By doing so the visualization is significantly improved since the different classes are shown without overlap.
Appendix
The separating hyperplane is usually expressed as
where x is a n-dimensional vector, w is the weight vector and b is the bias or intercept. For n = 2 we have w0.x + w1.y + b = 0. After some algebra we obtain y = -(w0/w1).x + (-b/w1). It clearly emerges from this expression that the discriminant hyperplane in a 2D feature space is a line of equation y = a.x + y0, where the slope is given by a = -w0/w1 and the y-intercept term is y0 = -b/w1. In SVM, the margin of a separating hyperplane is 2/‖w‖, which for 2D reduces to
the .coef_ member of clf will return the "hyperplane," which, in one dimension, is just a point. Check out this post for info on how to plot points on a numberline.

How can I convert from scatter size to data coordinates in matplotlib?

I would like to programmatically test whether two scatterplot glyphs will overlap in matplotlib. So given a pair of (x, y) coordinates and a size (which as i understand is the area of the circle, in points), I would like to plot
plt.scatter(x, y, s=s)
and then have a function called points_overlap that takes these parameters and returns True if the points will overlap and False otherwise.
def points_overlap(x, y, s):
if ...
return True
else:
return False
I know there are transformation matrices to take me between the different matplotlib coordinate systems, but I can't figure out the right steps for writing this function.
This needs some testing, but it might work? These should all be in Display space
def overlap(x, y, sx, sy):
return np.linalg.norm(x - y) < np.linalg.norm(sx + sy)
test:
In [227]: X = np.array([[1, 1], [2, 1], [2.5, 1]])
In [228]: s = np.array([20, 10000, 10000])
In [229]: fig, ax = plt.subplots()
In [230]: ax.scatter(X[:, 0], X[:, 1], s=s)
Out[230]: <matplotlib.collections.PathCollection at 0x10c32f28>
In [231]: plt.draw()
Test every pair:
Xt = ax.transData.transform(X)
st = np.sqrt(s)
pairs = product(Xt, Xt)
sizes = product(st, st)
for i, ((x, y), (sx, sy)) in enumerate(zip(pairs, sizes)):
h = i % 3
j = i // 3
if h != j and overlap(x, y, sx, sy):
print((i, h, j))
There's lots of room for improvement. It's probably easier to transform all your data and pass that into the points_overlap function instead of doing the transform inside. That'd be much better actually.

Categories