Sklearn SVC with MNIST Dataset: Consistently wrong with the digit 5?

Sklearn SVC with MNIST Dataset: Consistently wrong with the digit 5? - python

I have set up a very simple SVC to classify the MNIST digits. For some reason, the classifier is pretty consistently incorrectly predicting the digit 5, but when trying all other numbers it doesn't miss a single one. Does anyone have any idea if I might be setting this up wrong, or if it's just really bad at predicting the number 5?
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
data = datasets.load_digits()
images = data.images
targets = data.target
# Split into train and test sets
images_train, images_test, imlabels_train, imlabels_test = train_test_split(images, targets, test_size=.2, shuffle=False)
# Re-shape data so that it's 2D
images_train = np.reshape(images_train, (np.shape(images_train)[0], 64))
images_test = np.reshape(images_test, (np.shape(images_test)[0], 64))
svm_classifier = SVC(gamma='auto').fit(images_train, imlabels_train)
number_correct_svc = 0
preds = []
for label_index in range(len(imlabels_test)):
pred = svm_classifier.predict(images_test[label_index].reshape(1,-1))
if pred[0] == imlabels_test[label_index]:
number_correct_svc += 1
preds.append(pred[0])
print("Support Vector Classifier...")
print(f"\tPercent correct for all test data: {100*number_correct_svc/len(imlabels_test)}%")
confusion_matrix(preds,imlabels_test)
Here is the resulting confusion matrix:
array([[22, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 15, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 15, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 21, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 21, 0, 0, 0, 0, 0],
[13, 21, 20, 16, 16, 37, 23, 20, 31, 16],
[ 0, 0, 0, 0, 0, 0, 14, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 16, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 2, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 21]], dtype=int64)
I've been reading the sklearn page for SVC but can't tell what I'm doing wrong
Update:
I tried using SCV(gamma='scale') and it seems much more reasonable. It would still be nice to know why 'auto' doesn't work?
with scale:
array([[34, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 36, 0, 0, 0, 0, 0, 0, 1, 0],
[ 0, 0, 35, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 27, 0, 0, 0, 0, 0, 1],
[ 1, 0, 0, 0, 34, 0, 0, 0, 0, 0],
[ 0, 0, 0, 2, 0, 37, 0, 0, 0, 1],
[ 0, 0, 0, 0, 0, 0, 37, 0, 0, 0],
[ 0, 0, 0, 2, 0, 0, 0, 35, 0, 1],
[ 0, 0, 0, 6, 1, 0, 0, 1, 31, 1],
[ 0, 0, 0, 0, 2, 0, 0, 0, 1, 33]], dtype=int64)

The second question is much easier to deal with. The thing is in RBF kernel the gamma denotes how wiggly the decision boundary would be. What do we mean by "wiggly"? The higher the value of gamma more precise the decision boundary would be. Decision boundary of the SVM.
if gamma='scale' (default) is passed then it uses 1 / (n_features *X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features.
In the second case the gamma is higher. For MNIST standard deviation is less than 1. As a result the second decision boundary is much more precise giving a better result than the previous case.

Related

reshape machine learning input data for different algorithms

I am experimenting in sklearn learn classification with some NLTK type tutorials. Can someone help me understand why the sklearn MLP neural network can handle different input shapes but the other classifiers cannot?
My input training data is a numpy.ndarray with shape (62, 2)
This is the only thing I know how to do for train test split (any tips appreciated if there is something better)
train_x = list(training[:,0])
train_y = list(training[:,1])
The data looks like this if I print(train_y):
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
The MLP classifier seems to work just fine.
model = MLPClassifier(learning_rate_init=0.0001,max_iter=9000,shuffle=True).fit(train_x, train_y)
But if I try to use the other sklearn classifiers, like:
model = GaussianNB().fit(train_x, train_y)
I get the error:
ValueError: y should be a 1d array, got an array of shape (62, 15) instead.
I think I need to incorporate .reshape(-1,1) somewhere in my code but not sure where. Any tips appreciated not a lot of wisdom here.

As far as I can see the labels y are in the one-hot form. Basically label is a vector with size equal to the number of classes. Each element of that vector is zero except at the index which represents the exact class. That element is one. This is why the shape of y is (62, 15)
U need to transform the label y into a form in which your labels will be represented as integers.
Example:
In this example we have 6 classes: ranging from 0 to 5
[0, 0, 0, 1, 0, 0] -> 3
[1, 0, 0, 0, 0, 0] -> 0
[0, 1, 0, 0, 0, 0] -> 1
You can do this by using the numpy.argmax(y, axis=1) which will return the index of element which has maximum value along the specified axis. Take a look at the documentation

Creating a 'normal distribution' like range in numpy

I am trying to 'bin' an array into bins (similar to histogram). I have an input array input_array and a range bins = np.linspace(-200, 200, 200). The overall function looks something like this:
def bin(arr):
bins = np.linspace(-100, 100, 200)
return np.histogram(arr, bins=bins)[0]
So,
bin([64, 19, 120, 55, 56, 108, 16, 84, 120, 44, 104, 79, 116, 31, 44, 12, 35, 68])
would return:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])
However, I want my bins to be more 'detailed' as I get close to 0... something similar to an indeal normal distribution. As a result, I could have more bins (i.e. short ranges) when I am close to 0 and as I move out towards the range, the bins are bigger. Is it possible?
More specifically, rather than having equally wide bins in a range, can I have an array of range where the bins towards the centre are smaller than towards the extremes?
I have already looked at answers like this and numpy.random.normal, but something is just not clicking right.

Use the inverse error function to generate the bins. You'll need to scale the bins to get the exact range you want
This transform works because the inverse error function is flatter around zero than +/- one.
from scipy.special import erfinv
erfinv(np.linspace(-1,1))
# returns:
array([ -inf, -1.14541135, -0.8853822 , -0.70933273, -0.56893556,
-0.44805114, -0.3390617 , -0.23761485, -0.14085661, -0.0466774 ,
0.0466774 , 0.14085661, 0.23761485, 0.3390617 , 0.44805114,
0.56893556, 0.70933273, 0.8853822 , 1.14541135, inf])

multi label classification confusion matrix have wrong number of labels

i am feeding in y_test and y_pred to a confusion matrix. My data is for multi label classification so the row values are one hot encodings.
my data has 30 labels but after feeding into the confusion matrix, the output only has 11 rows and cols which is confusing me. I thought i should have a 30X30.
Their formats are numpy arrays. (y_test and y_pred are dataframes of which i convert to numpy arrays using dataframe.values)
y_test.shape
(8680, 30)
y_test
array([[1, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
y_pred.shape
(8680, 30)
y_pred
array([[1, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
I transform them to confusion matrix usable format:
y_test2 = y_test.argmax(axis=1)
y_pred2 = y_pred.argmax(axis=1)
conf_mat = confusion_matrix(y_test2, y_pred2)
here is what my confusion matrix look like:
conf_mat.shape
(11, 11)
conf_mat
array([[4246, 77, 13, 72, 81, 4, 6, 3, 0, 0, 4],
[ 106, 2010, 20, 23, 21, 0, 5, 2, 0, 0, 0],
[ 143, 41, 95, 32, 10, 3, 14, 1, 1, 1, 2],
[ 101, 1, 0, 351, 36, 0, 0, 0, 0, 0, 0],
[ 346, 23, 7, 10, 746, 5, 6, 4, 3, 3, 2],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Why does my confusion matrix only have 11 X 11 shape? shouldn't it be 30X30?

I think you are not quit clear the definition of confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
Which in data frame is
pd.DataFrame(confusion_matrix(y_true, y_pred),columns=[0,1,2],index=[0,1,2])
Out[245]:
0 1 2
0 2 0 0
1 0 0 1
2 1 0 2
The column and index are the category of input.
You have (11,11), which means you only have 11 categories in your data

All this means is that some labels are unused.
y_test.any(axis=0)
y_pred.any(axis=0)
Should show that only 11 of the columns have any 1s in them.
Here's what it would look like if that was not the case:
from sklearn.metrics import confusion_matrix
y_test = np.zeros((8680, 30))
y_pred = np.zeros((8680, 30))
y_test[np.arange(8680), np.random.randint(0, 30, 8680)] = 1
y_pred[np.arange(8680), np.random.randint(0, 30, 8680)] = 1
y_test2 = y_test.argmax(axis=1)
y_pred2 = y_pred.argmax(axis=1)
confusion_matrix(y_test2, y_pred2).shape # (30, 30)

scipy block_diag does not preserve complex numbers

I wrote a function that puts tensor products of Pauli matrices on the diagonal, using the block_diag function.
When I implement the function , I obtain:
array([[ 1, 0, 0, 0, 0, 0, 0, 0],
[ 0, 1, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 1, 0, 0, 0, 0],
[ 0, 0, 1, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 1, 0],
[ 0, 0, 0, 0, 0, 0, 0, -1]])
as you can see, the matrix
array([[ 0 , -1j],[1j,0]])
is missing, being replaced by a 2x2 0 matrix.
The warning it gives me is :
/usr/lib/python2.7/dist-packages/scipy/linalg/special_matrices.py:541: ComplexWarning: Casting complex values to real discards the imaginary part
out[r:r + rr, c:c + cc] = arrs[i]
Any ideas on how I can overcome this?

Novelty detection using one class svm-python

I'm in the process of novelty detection using machine-learning. I have tried using one-class svm in scikit learn.
from sklearn import svm
train_data = [[0, 0, 0, 0, 0, 1, 0, 0], [0, 1, 0, 0, 0, 1, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1], [0, 3, 0, 0, 0, 1, 0, 0], [0, 11, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 4]]
test_data = [[0, 0, 0, 0, 0, 1, 0, 0], [0, 1, 0, 0, 0, 1, 0, 0]]
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
clf.fit(train_data)
pred_test = clf.predict(test_data)
I'm new to this area and I want to know how can I say there is novelty in my test data?

The inliers are labeled 1, and the outliers (i.e., the novelties in your case) are labeled -1 (as the result of the predict function).
Please notice that the current documentation incorrectly states that the outliers are labeled 1 & inliers are labeled 0. Please check out the latest updates on github repo for the correct information.

check = clf.predict(test_data)
if check = 1 then not anomaly and
if check = -1 then it an anomaly i.e. data is outlier

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sklearn SVC with MNIST Dataset: Consistently wrong with the digit 5? - python

Related

reshape machine learning input data for different algorithms

Creating a 'normal distribution' like range in numpy

multi label classification confusion matrix have wrong number of labels

scipy block_diag does not preserve complex numbers

Novelty detection using one class svm-python

Categories

Resources