SVM Model Implicitly Dropping Samples From Training Dataset - python

I have a 2D Training dataset with 23 samples. There are 6 true positives in the dataset and 10 true negatives. The data passed into the SVM is of shape (23,2) but the support_vectors_ actually used for training is (16,2).
The consequence is the SVM training model biasing true positives which wouldnt have that bias if the true negatives werent dropped.
Circles are the SVM support vectors. Coloured are the input training vectors to the SVM.
Code:
print(features.shape, ground_truths.shape)
print(ground_truths)
svm_model = svm.SVC(kernel=kernel, degree=degree, C = regularization, probability=True)
svm_model.fit(features, ground_truths)
print(svm_model.support_vectors_.shape)
Output:
(23, 2) (23,)
[0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0]
(16, 2)
Visualization
Why is the SVM model dropping these samples?

Related

Warning Message in binary classification model Gaussian Naive Bayes?

I am using a multiclass classification-ready dataset with 14 continuous variables and classes from 1 to 10.
This is the data file:
https://drive.google.com/file/d/1nPrE7UYR8fbTxWSuqKPJmJOYG3CGN5y9/view?usp=sharing
My goal is to apply the scikit-learn Gaussian NB model to the data, but in a binary classification task where only class 2 is the positive label and the remainder of the classes are all negatives. For that, I did the following code:
from sklearn.naive_bayes import GaussianNB, CategoricalNB
import pandas as pd
dataset = pd.read_csv("PD_21_22_HA1_dataset.txt", index_col=False, sep="\t")
x_d = dataset.values[:, :-1]
y_d = dataset.values[:, -1]
### train_test_split to split the dataframe into train and test sets
## with a partition of 20% for the test https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(x_d, y_d, test_size=0.20, random_state=23)
yc_TRAIN=np.array([int(i==2) for i in y_TRAIN])
mdl = GaussianNB()
mdl.fit(X_TRAIN, yc_TRAIN)
preds = mdl.predict(X_IVS)
# binarization of "y_true" array
yc_IVS=np.array([int(i==2) for i in y_IVS])
print("The Precision is: %7.4f" % precision_score(yc_IVS, preds))
print("The Matthews correlation coefficient is: %7.4f" % matthews_corrcoef(yc_IVS, preds))
But I get the following warning message when calculating precision:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
The matthew's correlation coeficient func also outputs 0 and gives a runtimewarning: invalid value encountered in double_scalars message.
Furthermore, by inspecting preds, I got that the model predicts only negatives/zeros.
I've tried increasing the 20% test partition as some forums suggested but it didn't do anything.
Is this simply a problem of the model not being able to fit against the data or am I doing something wrong that may be inputting the wrong data format/type into the model?
Edit: yc_TRAIN is the result of turning all cases from class 2 into my true positive cases "1" and the remaining classes into negatives/0, so it's a 1-d array of length 9450 (which matches my total number of prediction cases) with over 8697 0s and 753 1s, so its aspect would be something like this:
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
Your code looks fine; this is a classic problem with imbalanced datasets, and it actually means you do not have enough training data to correctly classify the rare positive class.
The only thing you could improve in the given code is to set stratify=y_d in train_test_split, in order to get a stratified training set; decreasing the size of the test set (i.e. leaving more samples for training) may also help:
X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(x_d, y_d, test_size=0.10, random_state=23, stratify=y_d)
If this does not work, you should start thinking of applying class imbalance techniques (or different models); but this is not a programming question any more but a theory/methodology one, and it should be addressed at the appropriate SE sites and not here (see the intro and NOTE in the machine-learning tag info).

how to handle ValueError: Classification metrics can't handle a mix of multilabel-indicator and multiclass targets error

I got this error when I want to get the prediction accuracy and I try every possible way and all stack issues but finally I can not solve the bugs...
the code snippest with bug is:
author_pred1 = model1.predict([ThreeGramTest, ThreeGramTest, ThreeGramTest,ThreeGramTest])
print("class prediction without argmax:",author_pred1)
author_pred1=np.argmax(author_pred1, axis=1)
# Evaluate
print("test data one hot lable", TestAuthorHot)
print("class prediction with argmax:",author_pred1)
# author_pred1 = author_pred1.astype("int64")
print("type of prediction output",type(author_pred1))
print("type of test data", type(TestAuthorHot))
print(np.array(np.unique(author_pred1, return_counts=True)).T)
print(np.array(np.unique(TestAuthorHot, return_counts=True)).T)
# accuracy = accuracy_score(TestAuthorHot, author_pred1.round(), normalize=False)# the bug is here
precision, recall, f1, support = score(TestAuthorHot, author_pred1)
ave_precision = np.average(precision, weights=support / np.sum(support))
ave_recall = np.average(recall, weights=support / np.sum(support))
to know the shapes, the value of data is:
class prediction without argmax: [[3.9413989e-02 8.4685171e-03 2.7781539e-03 ... 5.0324947e-03
6.2263450e-07 3.1461464e-10]
[1.1533947e-02 4.0361892e-02 1.4060171e-02 ... 4.7175577e-05
1.4333490e-01 2.0528505e-07]
[4.5363868e-06 3.1557463e-03 1.4047540e-02 ... 1.3272668e-03
4.6724287e-07 5.9454552e-10]
...
[1.9417159e-04 1.7364822e-02 2.9031632e-03 ... 5.0036388e-04
1.3315305e-04 9.0704253e-07]
[1.8054984e-09 2.9453583e-08 2.3744430e-08 ... 2.7137769e-03
7.7114571e-08 4.9026494e-10]
[7.8946296e-06 5.9516740e-05 8.2868773e-10 ... 3.1905161e-04
2.5262805e-06 2.0384558e-09]]
test data one hot lable [[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 1 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 1 ... 0 0 0]]
class prediction with argmax: [ 7 37 37 ... 39 4 4]
how can I handle the bugs???
The error happens because you are passing to accuracy_score a 2D matrix (TestAuthorHot is a 2D one-hot matrix of labels). accuracy_score accepts only 1D vectors, so you need to transform TestAuthorHot into 1D in order to match it with author_pred1 (which is 1D)
To do this u can simply do:
accuracy_score(np.argmax(TestAuthorHot, axis=1), author_pred1)

SGDClassifier on Big Data (sparse)

Hello everyone i am relatively new to the space of data science. I am trying actually to train a SGDClassifier with over 4000000 samlpes of data without any possitive results.
X vector has 6 features and looks like : [ 2 , 4 , 56431555 , 1 , 0 , 33]
Y vector has 1 feature which is the category. It can be 1 or 0... eg [1]
These are some examples of my data records :
X :
[[ 2 4 56431555 1 0 33]
[ 2 1 71716268 1 0 623]
[ 0 1 302 0 1 33]
...
[ 0 4 3707 0 1 33]
[ 0 1 733126 1 0 33]
[ 0 4 30960953 1 0 33]]
Y:
[0 0 1 ... 1 1 0]
When i use the .predict() on test data the only result i get is that every test vector belongs to the class 0. So i get an array full of zeros.
This are the parameters for which i initialize the classifier
from sklearn import linear_model
model = linear_model.SGDClassifier(max_iter=1000, tol=1e-3)
model.fit(data_train, target_train)
Any suggestions on how to approach this problem? ( i have already tried to use standard scaling on my data)
Note : Avg: Loss on training is huge and when i scale my data is 0.97 which i dont know if this can show anything about my dataset or model

How to model data for tensorflow?

I have data of the form :
A B C D E F G
1 0 0 1 0 0 1
1 0 0 1 0 0 1
1 0 0 1 0 1 0
1 0 1 0 1 0 0
...
1 0 1 0 1 0 0
0 1 1 0 0 0 1
0 1 1 0 0 0 1
0 1 0 1 1 0 0
0 1 0 1 1 0 0
A,B,C,D are my inputs and E,F,G are my outputs. I wrote the following code in Python using TensorFlow:
from __future__ import print_function
#from random import randint
import numpy as np
import tflearn
import pandas as pd
data,labels =tflearn.data_utils.load_csv('dummy_data.csv',target_column=-1,categorical_labels=False, n_classes=None)
print(data)
# Build neural network
net = tflearn.input_data(shape=[None, 4])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 3, activation='softmax')
net = tflearn.regression(net)
# Define model
model = tflearn.DNN(net)
#Start training (apply gradient descent algorithm)
data_to_array = np.asarray(data)
print(data_to_array.shape)
#data_to_array= data_to_array.reshape(6,9)
print(data_to_array.shape)
model.fit(data_to_array, labels, n_epoch=10, batch_size=3, show_metric=True)
I am getting an error which says:
ValueError: Cannot feed value of shape (3, 6) for Tensor 'InputData/X:0', which has shape '(?, 4)'
I am guessing this is because my input data has 7 columns (0...6), but I want the input layer to take only the first four columns as input and predict the last 3 columns in the data as output. How can I model this?
If the data's in a numpy format, then the first 4 columns are taken with a simple slice:
data[:,0:4]
The : means "all rows", and 0:4 is a range of values 0,1,2,3, the first 4 columns.
If the data isn't in a numpy format, just convert it to a numpy format so you can slice easily.
Here's a related article on numpy slices: Numpy - slicing 2d row or column vector from array

Scikit F-score metric error

I am trying to predict a set of labels using Logistic Regression from SciKit. My data is really imbalanced (there are many more '0' than '1' labels) so I have to use the F1 score metric during the cross-validation step to "balance" the result.
[Input]
X_training, y_training, X_test, y_test = generate_datasets(df_X, df_y, 0.6)
logistic = LogisticRegressionCV(
Cs=50,
cv=4,
penalty='l2',
fit_intercept=True,
scoring='f1'
)
logistic.fit(X_training, y_training)
print('Predicted: %s' % str(logistic.predict(X_test)))
print('F1-score: %f'% f1_score(y_test, logistic.predict(X_test)))
print('Accuracy score: %f'% logistic.score(X_test, y_test))
[Output]
>> Predicted: [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
>> Actual: [0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1]
>> F1-score: 0.285714
>> Accuracy score: 0.782609
>> C:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:958:
UndefinedMetricWarning:
F-score is ill-defined and being set to 0.0 due to no predicted samples.
I certainly know that the problem is related to my dataset: it is too small (it is only a sample of the real one). However, can anybody explain the meaning of the "UndefinedMetricWarning" warning that I am seeing? What is actually happening behind the curtains?
It seems it is a known bug here which has been fixed, I guess you should try update sklearn.
However, can anybody explain the meaning of the "UndefinedMetricWarning" warning that I am seeing? What is actually happening behind the curtains?
This is well-described at https://stackoverflow.com/a/34758800/1587329:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/classification.py
F1 = 2 * (precision * recall) / (precision + recall)
precision = TP/(TP+FP) as you've just said if predictor doesn't
predicts positive class at all - precision is 0.
recall = TP/(TP+FN), in case if predictor doesn't predict positive
class - TP is 0 - recall is 0.
So now you are dividing 0/0.
To fix the weighting problem (it's easy for the classifier to (almost) always predict the more prevalent class), you can use class_weight="balanced":
logistic = LogisticRegressionCV(
Cs=50,
cv=4,
penalty='l2',
fit_intercept=True,
scoring='f1',
class_weight="balanced"
)
LogisticRegressionCV says:
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

Categories