Warning Message in binary classification model Gaussian Naive Bayes?

Warning Message in binary classification model Gaussian Naive Bayes? - python

I am using a multiclass classification-ready dataset with 14 continuous variables and classes from 1 to 10.
This is the data file:
https://drive.google.com/file/d/1nPrE7UYR8fbTxWSuqKPJmJOYG3CGN5y9/view?usp=sharing
My goal is to apply the scikit-learn Gaussian NB model to the data, but in a binary classification task where only class 2 is the positive label and the remainder of the classes are all negatives. For that, I did the following code:
from sklearn.naive_bayes import GaussianNB, CategoricalNB
import pandas as pd
dataset = pd.read_csv("PD_21_22_HA1_dataset.txt", index_col=False, sep="\t")
x_d = dataset.values[:, :-1]
y_d = dataset.values[:, -1]
### train_test_split to split the dataframe into train and test sets
## with a partition of 20% for the test https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(x_d, y_d, test_size=0.20, random_state=23)
yc_TRAIN=np.array([int(i==2) for i in y_TRAIN])
mdl = GaussianNB()
mdl.fit(X_TRAIN, yc_TRAIN)
preds = mdl.predict(X_IVS)
# binarization of "y_true" array
yc_IVS=np.array([int(i==2) for i in y_IVS])
print("The Precision is: %7.4f" % precision_score(yc_IVS, preds))
print("The Matthews correlation coefficient is: %7.4f" % matthews_corrcoef(yc_IVS, preds))
But I get the following warning message when calculating precision:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
The matthew's correlation coeficient func also outputs 0 and gives a runtimewarning: invalid value encountered in double_scalars message.
Furthermore, by inspecting preds, I got that the model predicts only negatives/zeros.
I've tried increasing the 20% test partition as some forums suggested but it didn't do anything.
Is this simply a problem of the model not being able to fit against the data or am I doing something wrong that may be inputting the wrong data format/type into the model?
Edit: yc_TRAIN is the result of turning all cases from class 2 into my true positive cases "1" and the remaining classes into negatives/0, so it's a 1-d array of length 9450 (which matches my total number of prediction cases) with over 8697 0s and 753 1s, so its aspect would be something like this:
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]

Your code looks fine; this is a classic problem with imbalanced datasets, and it actually means you do not have enough training data to correctly classify the rare positive class.
The only thing you could improve in the given code is to set stratify=y_d in train_test_split, in order to get a stratified training set; decreasing the size of the test set (i.e. leaving more samples for training) may also help:
X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(x_d, y_d, test_size=0.10, random_state=23, stratify=y_d)
If this does not work, you should start thinking of applying class imbalance techniques (or different models); but this is not a programming question any more but a theory/methodology one, and it should be addressed at the appropriate SE sites and not here (see the intro and NOTE in the machine-learning tag info).

Related

SVM Model Implicitly Dropping Samples From Training Dataset

I have a 2D Training dataset with 23 samples. There are 6 true positives in the dataset and 10 true negatives. The data passed into the SVM is of shape (23,2) but the support_vectors_ actually used for training is (16,2).
The consequence is the SVM training model biasing true positives which wouldnt have that bias if the true negatives werent dropped.
Circles are the SVM support vectors. Coloured are the input training vectors to the SVM.
Code:
print(features.shape, ground_truths.shape)
print(ground_truths)
svm_model = svm.SVC(kernel=kernel, degree=degree, C = regularization, probability=True)
svm_model.fit(features, ground_truths)
print(svm_model.support_vectors_.shape)
Output:
(23, 2) (23,)
[0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0]
(16, 2)
Visualization
Why is the SVM model dropping these samples?

How can I build an LSTM AutoEncoder with PyTorch?

I have my data as a DataFrame:
dOpen dHigh dLow dClose dVolume day_of_week_0 day_of_week_1 ... month_6 month_7 month_8 month_9 month_10 month_11 month_12
639 -0.002498 -0.000278 -0.005576 -0.002228 -0.002229 0 0 ... 0 0 1 0 0 0 0
640 -0.004174 -0.005275 -0.005607 -0.005583 -0.005584 0 0 ... 0 0 1 0 0 0 0
641 -0.002235 0.003070 0.004511 0.008984 0.008984 1 0 ... 0 0 1 0 0 0 0
642 0.006161 -0.000278 -0.000281 -0.001948 -0.001948 0 1 ... 0 0 1 0 0 0 0
643 -0.002505 0.001113 0.005053 0.002788 0.002788 0 0 ... 0 0 1 0 0 0 0
644 0.004185 0.000556 -0.000559 -0.001668 -0.001668 0 0 ... 0 0 1 0 0 0 0
645 0.002779 0.003056 0.003913 0.001114 0.001114 0 0 ... 0 0 1 0 0 0 0
646 0.000277 0.004155 -0.002227 -0.002782 -0.002782 1 0 ... 0 0 1 0 0 0 0
647 -0.005540 -0.007448 -0.003348 0.001953 0.001953 0 1 ... 0 0 1 0 0 0 0
648 0.001393 -0.000278 0.001960 -0.003619 -0.003619 0 0 ... 0 0 1 0 0 0 0
My input will be 10 rows (already one-hot encoded). I want to create an n-dimensional auto encoded representation. So as I understand it, my input and output should be the same.
I've seen some examples to construct this, but am still stuck on the first step. Is my training data just a lot of those samples as to make a matrix? What then?
I apologize for the general nature of the question. Any questions, just ask and I will clarify in the comments.
Thank you.

It isn't quite clear from the question what you are trying to achieve. Based on what you wrote you want to create an autoencoder with the same input and output and that doesn't quite make sense to me when I see your data set. In the common case, the encoder part of the autoencoder creates a model which, based on a large set of input features produces a small output vector and decoder is performing an inverse operation of reconstruction of the plausible input features based on the full set of output and input features. A result of using an autoencoder is enhanced (in some meaning, like with noise removed, etc) input.
You can find a few examples here with the 3rd use case providing code for the sequence data, learning random number generation model. Here is another example, which looks closer to your application. A sequential model is constructed to encode a large data set with information loss. If that is what you are trying to achieve, you'll find the code there.
If the goal is a sequence prediction (like future stock prices), this and that example seem to be more appropriate as you likely only want to predict a handful of values in your data sequence (say dHigh and dLow) and you don't need to predict day_of_week_n or the month_n (even though that part of autoencoder model probably will train much more reliable as the pattern is pretty clear). This approach will allow you to predict a single consequent output feature value (tomorrow's dHigh and dLow)
If you want to predict a sequence of future outputs you can use a sequence of outputs, rather than a single one in your model.
In general, the structure of inputs and outputs is totally up to you

ML on "Adult data set"(dataset) from archive.ics... whith KNeighborsClassifier wont run

I'm trying to use Machine learning to guess if a person has an income of over or under 50k using this data set. I think the code does not work because the data set contains strings. When I use a shorter data set containing 4 instead of 14 variables(and with numbers) the code works. What am I doing wrong?
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
dataset = pandas.read_csv(url, names=names)
# Split dataset
array = dataset.values
X = array[:,0:14]
Y = array[:,14]
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

Let's take a really simple example from your dataset.
Looking at dataset['income'].nunique() (produces 2), we can see you have two classes you're trying to predict. You're on the right track with taking the classification route (although there are different methodological arguments to be made as to whether this problem is better suited for a continuous regression approach, but save that for another day).
Say you want to use age and education to predict whether someone's income is above $50k. Let's try it out:
X = dataset[['age', 'education']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
This Exception should be raised:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/neighbors/base.py", line 891, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 756, in check_X_y
estimator=estimator)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 567, in check_array
array = array.astype(np.float64)
ValueError: could not convert string to float: ' Bachelors'
What if we tried with just age?
X = dataset[['age']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey! That works! So there's something unique about the education column that we need to account for. You've noticed this above - scikit-learn (and many other ML packages - though not all) don't operate off of strings. So we need to do something like "one-hot" encoding - creating k columns, where k represents the number of unique values in your categorical, "string" column (again, there's a methodological question as to whether you include k-1 or k features, but read up on the dummy-variable trap for more info to that end), where each column is composed of 1s and 0s - a 1 if the case/observation in a particular row has that kth attribute, a 0 if not.
There are many ways of doing this in Python:
pandas.get_dummies:
dummies = pandas.get_dummies(dataset['education'], prefix='education')
Here's a sample of dummies:
>>> dummies
education_ 10th education_ 11th education_ 12th education_ 1st-4th education_ 5th-6th ... education_ HS-grad education_ Masters education_ Preschool education_ Prof-school education_ Some-college
0 0 0 0 0 0 ... 0 0 0 0 0
1 0 0 0 0 0 ... 0 0 0 0 0
2 0 0 0 0 0 ... 1 0 0 0 0
3 0 1 0 0 0 ... 0 0 0 0 0
4 0 0 0 0 0 ... 0 0 0 0 0
5 0 0 0 0 0 ... 0 1 0 0 0
6 0 0 0 0 0 ... 0 0 0 0 0
7 0 0 0 0 0 ... 1 0 0 0 0
8 0 0 0 0 0 ... 0 1 0 0 0
9 0 0 0 0 0 ... 0 0 0 0 0
Now we can use this education feature like so:
dataset = dataset.join(dummies)
X = dataset[['age'] + list(dummies)]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey, that worked!
Hopefully that helps to answer your question. There are tons of ways to perform one-hot encoding (e.g. through a list comprehension or sklearn.preprocessing.OneHotEncoder). I'd suggest you read more on "feature engineering" before progressing with your model-building - feature engineering is one of the most important parts of the ML process.

For columns that contain categorical strings, you should transform them to one hot encoding using the function:
dataset = pd.get_dummies(dataset, column=[my_column1, my_column2, ...])
Where my_column1, my_colum2, ...are the column names containing the categorical strings. Be careful, it changes the number of columns you have in your dataframe. Thus, change your split of X accordingly.
Here is the link to the documentation.

Get Top 3 predicted classes from GaussianNB classifier python

I am trying to predict a class using GaussianNB, but I need to get top 3 predicted classes to create a custom score for the prediction.
My training data is x,y,class where given x and y it needs to predict the class
tests variable cointains (x,y) values and testclass contains class values.
Test is a list data set in following format
Index Type Size Value
0 tuple 2 (0.6424, 0.8325)
1 tuple 2 (0.8493, 0.7848)
2 tuple 2 (0.791, 0.4191)
Test class data
Index Type Size Value
0 str 1 1.274e+09
1 str 1 9.5047e+09
Code:
import csv
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import GaussianNB
clf_pf = GaussianNB()
clf_pf.fit(train, trainclass)
print clf_pf.score(test,testclass)
ff = clf_pf.predict_proba(test)
How to get the top 3 predicted classes from above variable ff?
My ff data is like below
0 1 2 3 4 5 6 7 8
0 1.80791e-05 0 0.00126251 0 6.38504e-256 0 0 0 0
1 2.89477e-199 1.01093e-06 0 1.1056e-55 0 5.52213e-67 0 0
2 2.47755e-05 0 2.43499e-08 0 1.00392e-239 0 0 0 0
3 2.54941e-161 3.79815e-06 0 1.53516e-40 0 1.63465e-41 0 0

As said in the comment, ff has [n_samples, n_classes]. Using numpy.argsort you will obtain, for each row, the predicted classes ordered by their probability in ascending order, obtaining again a matrix of shape [n_samples, n_classes]. You then take the last three elements of all rows ([:, -3:]) and reverse their order ([:, ::-1]) to obtain the class with best probability first:
np.argsort(ff)[:, -3:][:, ::-1]
Note the [:, in the slicing just means "get all the rows".

Scikit F-score metric error

I am trying to predict a set of labels using Logistic Regression from SciKit. My data is really imbalanced (there are many more '0' than '1' labels) so I have to use the F1 score metric during the cross-validation step to "balance" the result.
[Input]
X_training, y_training, X_test, y_test = generate_datasets(df_X, df_y, 0.6)
logistic = LogisticRegressionCV(
Cs=50,
cv=4,
penalty='l2',
fit_intercept=True,
scoring='f1'
)
logistic.fit(X_training, y_training)
print('Predicted: %s' % str(logistic.predict(X_test)))
print('F1-score: %f'% f1_score(y_test, logistic.predict(X_test)))
print('Accuracy score: %f'% logistic.score(X_test, y_test))
[Output]
>> Predicted: [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
>> Actual: [0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1]
>> F1-score: 0.285714
>> Accuracy score: 0.782609
>> C:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:958:
UndefinedMetricWarning:
F-score is ill-defined and being set to 0.0 due to no predicted samples.
I certainly know that the problem is related to my dataset: it is too small (it is only a sample of the real one). However, can anybody explain the meaning of the "UndefinedMetricWarning" warning that I am seeing? What is actually happening behind the curtains?

It seems it is a known bug here which has been fixed, I guess you should try update sklearn.

However, can anybody explain the meaning of the "UndefinedMetricWarning" warning that I am seeing? What is actually happening behind the curtains?
This is well-described at https://stackoverflow.com/a/34758800/1587329:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/classification.py
F1 = 2 * (precision * recall) / (precision + recall)
precision = TP/(TP+FP) as you've just said if predictor doesn't
predicts positive class at all - precision is 0.
recall = TP/(TP+FN), in case if predictor doesn't predict positive
class - TP is 0 - recall is 0.
So now you are dividing 0/0.
To fix the weighting problem (it's easy for the classifier to (almost) always predict the more prevalent class), you can use class_weight="balanced":
logistic = LogisticRegressionCV(
Cs=50,
cv=4,
penalty='l2',
fit_intercept=True,
scoring='f1',
class_weight="balanced"
)
LogisticRegressionCV says:
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.