Predicting outcome of multiple targets in SciKit-Learn

Predicting outcome of multiple targets in SciKit-Learn - python

Working on a Classification problem using python scikit, its a medical diagnostics data having 6 features and 2 targets. I tried with one target, trained a model using KNN algorithm, prediction accuracy is 100% with this model.
Now want to extend this to second target, want to predict the outcome of two y values for the same feature set(6 columns).
Following is my code where Im able to accurately predict the outcome of Target 1 ('Outcome1-Urinary-bladder'). How can I extend to predict the outcome of the second Target (Outcome2-Nephritis-of-renal).
X = Feature_set
y = Target1['Outcome1-Urinary-bladder'].values
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X,y)
y_predictor = knn.predict(X)
print metrics.accuracy_score(y,y_predictor)
Click here to view the dataset
What modifications to be made to the code to predict outcome of 2 target values ('Outcome1-Urinary-bladder' & Outcome2-Nephritis-of-renal)?
Please help me out. Thanks in advance.

In general you just wrap your classifier into one-vs-rest classifier wrapper:
http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier
And feed it with the matrix y which will have 2 columns at the same time.
Example of usage:
selClassifiers = {
'linear': LinearSVC(),
'linearWithSGD': SGDClassifier(),
'rbf': SVC(kernel='rbf', probability=True),
'poly': SVC(kernel='poly', probability=True),
'sigmoid': SVC(kernel='sigmoid', probability=True),
'bayes': MultinomialNB()
}
classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(selClassifiers[classif]))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)
As pointed by #yangjie, for your specific classifier there is no need to wrap it, it already support multi-output classification.

Related

Cross validation and logistic regression

I am analyzing a dataset from kaggle and want to apply a logistic regression model to predict something. This is the data: https://www.kaggle.com/code/mohamedadelhosny/stroke-prediction-data-analysis-challenge/data
I split the data into train and test, and want to use cross validation to inssure highest accuracy possible. I did some pre-processing and used the dummy function over catigorical features, got to a certain point in the code, and and I don't know how to proceed. I cant figure out how to use the results of the cross validation, it's not so straight forward.
This is what I got so far:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
X = data_Enco.iloc[:, data_Enco.columns != 'stroke'].values # features
Y = data_Enco.iloc[:, 6] # labels
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
scaler = MinMaxScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
logisticModel = LogisticRegression(class_weight='balanced')
# evaluate model
scores = cross_val_score(logisticModel, scaled_X_train, Y_train, scoring='accuracy', cv=cv)
print('average score = ', np.mean(scores))
print('std of scores = ', np.std(scores))
average score = 0.7483538453549359
std of scores = 0.0190400919099899
So far so good.. I got the results of the model for each 10 splits. But now what? how do I build a confusion matrix? how do I calculate the recall, precesion..? I have the right code without performing cross validation, I just dont know how to adapt it.. how do I use the scores of the cross_val_score function ?
logisticModel = LogisticRegression(class_weight='balanced')
logisticModel.fit(scaled_X_train, Y_train) # Train the model
predictions_log = logisticModel.predict(scaled_X_test)
## Scoring the model
logisticModel.score(scaled_X_test,Y_test)
## Confusion Matrix
Y_pred = logisticModel.predict(scaled_X_test)
real_data = Y_test
print('Observe the difference between the real data and the data predicted by the knn classifier:\n')
print('Predictions: ',Y_pred,'\n\n')
print('Real Data:m', real_data,'\n')
cmtx = pd.DataFrame(
confusion_matrix(real_data, Y_pred, labels=[0, 1]),
index = ['real 0: ', 'real 1:'], columns = ['pred 0:', 'pred 1:']
)
print(cmtx)
print('Accuracy score is: ',accuracy_score(real_data, Y_pred))
print('Precision score is: ',precision_score(real_data, Y_pred))
print('Recall Score is: ',recall_score(real_data, Y_pred))
print('F1 Score is: ',f1_score(real_data, Y_pred))

The performance of a model on the training dataset is not a good estimator of the performance on new data because of overfitting.
Cross-validation is used to obtain an estimation of the performance of your model on new data, i.e. without overfitting. And you correctly applied it to compute the mean and variance of the accuracy of your model. This should be a much better approximation of the accuracy on your test dataset than the accuracy on your training dataset. And that is it.
However, cross-validation is usually used to do model selection. Say you have two logistic regression models that use different sets of independent variables. E.g., one is using only age and gender while the other one is using age, gender, and bmi. Or you want to compare logistic regression with an SVM model.
I.e. you have several possible models and you want to decide which one is best. Of course, you cannot just compare the training dataset accuracies of all the models because those are spoiled by overfitting. And if you use the performance on the test dataset for choosing the best model, the test dataset becomes part of the training, you will have leakage, and thus the performance on the test dataset cannot be used anymore for a final, untainted performance measure. That is why cross-validation is used which creates those splits that contain different versions of validation sets.
So the idea is to
apply cross-validation to each of your candidate models,
use the scores of those cross-validations to choose the best model,
retrain that best model on the complete training dataset to get a final version of your best model, and
to finally apply this final version to the test dataset to obtain some untainted evaluation.
But note, that those three steps are for model selection. However, you have only a single model, the logistic regression, so there is nothing to select from. If you fit your model, let's call it m(p) where p denotes the parameters, to e.g. five folds of CV, you get five different fitted versions m(p1), m(p2), ..., m(p5) of the same model.
So if you have only one model, you fit it to the complete training dataset, maybe use CV to have an additional estimate for the performance on new data, but that's it. But you have already done this. There is no "selection of best model", that is only for if you have several models as described above, like e.g. logistic regression and SVM.

Can we test the Naive Bayes algorithm with unlabeled data?

I have trained the model using labeled data for Naive Bayes algorithm. And tested the same model with the other set of labeled data. And I have calculated accuracy, precision and recall scores using the below code.
My code :
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from io import open
def load_data(filename):
reviews = list()
labels = list()
with open(filename, encoding='utf-8') as file:
file.readline()
for line in file:
line = line.strip().split(' ',1)
labels.append(line[0])
reviews.append(line[1])
return reviews, labels
X_train, y_train = load_data('./train_data.txt')
X_test, y_test = load_data('./test_data.txt')
vec = CountVectorizer()
X_train_transformed = vec.fit_transform(X_train)
X_test_transformed = vec.transform(X_test)
clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)
score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score)
y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))
print("Precision Score : ",precision_score(y_test, y_pred,average='micro'))
print("Recall Score : ",recall_score(y_test, y_pred,average='micro'))
But, now I have another test set which contains unlabeled data. Now, can I test the model with this unlabeled data using the above code ?

This is what I could interpret from your question.
You have trained Naive Bayes model use train data & tested it using test data and you have used confusion matrix & accuracy as a metric to measure the performance of the model.
Now your question may be
Using this model, is it possible to predict label of the unseen data which don't have any labels ?
if that is your question, then, YES it is possible. Moreover that is the reason why you have trained the model i.e, to predict the labels on unseen data.
Since the unseen data don't have labels, how do you know predicted labels are correct ? For this reason only you have tested the model with test data & measured the performance of the model. If the accuracy of the model is 70%, then 70% of the times your model is predicting correctly.
I strongly suggest you to think why are you doing what are you doing before start doing it!!

If you want to automatically calculate the accuracy of the model with unseen data then answer is NO.
To create confusion matix and find out these matrices you need to pass Y label variable. Good practice is to split your training data into training and test data.

How can I explain this drop in performance on test data?

I am asking the question here, even though I hesitated to post it on CrossValidated (or DataScience) StackExchange. I have a dataset of 60 labeled objects (to be used for training) and 150 unlabeled objects (for test). The aim of the problem is to predict the labels of the 150 objects (this used to be given as a homework problem). For each object, I computed 258 features. Considering each object as a sample, I have X_train : (60,258), y_train : (60,) (labels of the objects used for training) and X_test : (150,258). Since the solution of the homework problem was given, I also have the true labels of the 150 objects, in y_test : (150,).
In order to predict the labels of the 150 objects, I choose to use a LogisticRegression (the Scikit-learn implementation). The classifier is trained on (X_train, y_train), after the data has been normalized, and used to make predictions for the 150 objects. Those predictions are compared to y_test to assess the performance of the model. For reproducibility, I copy the code I have used.
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, crosss_val_predict
# Fit classifier
LogReg = LogisticRegression(C=1, class_weight='balanced')
scaler = StandardScaler()
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
# Performance on training data
CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')
print(CV_score)
# Performance on test data
probas = LogReg.predict_proba(X_test)[:, 1]
AUC = metrics.roc_auc_score(y_test, probas)
print(AUC)
The matrices X_train,y_train,X_test and y_test are saved in a .mat file available at this link. My problem is the following :
Using this approach, I get a good performance on training data (CV_score = 0.8) but the performance on the test data is much worse : AUC = 0.54 for C=1 in LogReg and AUC = 0.40 for C=0.01. How can I get AUC<0.5 if a naive classifier should score AUC = 0.5 ? Is this due to the fact that I have a small number of samples for training ?
I have noticed that the performance on test data improves if I change the code for :
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
AUC = metrics.roc_auc_score(y_test, y_pred)
print(AUC)
Indeed, AUC=0.87 for C=1 and 0.9 for C=0.01. Why is the AUC score so much better using cross validation predictions ? Is it because cross validation allows to make predictions on subsets of the test data which do not contain objects/samples which decrease the AUC ?

Looks like you are encountering an overfitting problem, i.e. the classifier trained using the training data is overfitting to the training data. It has poor generalization ability. That is why the performance on the testing dataset isn't good.
cross_val_predict is actually training the classifier using part of your testing data and then predict on the rest. So the performance is much better.
Overall, there seems to be quite some difference between your training and testing datasets. So the classifier with the highest training accuracy doesn't work well on your testing set.
Another point not directly related with your question: since the number of your training samples is much smaller than the feature dimensions, it may be helpful to perform dimension reduction before feeding to classifier.

It looks like your training and test process are inconsistent. Although from your code you intend to standardize your data, you fail to do so during testing. What I mean:
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
Although you define a pipeline, you do not fit the pipeline (clf.fit) but only the Logistic Regression. This matters, because your cross-validated score is calculated with the pipeline (CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')) but during test instead of using the pipeline as expected to predict, you use only LogReg, hence the test data are not standardized.
The second point you raise is different. In y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
you get predictions by doing cross-validation on the test data, while ignoring the train data. Here, you do data standardization since you use clf and thus your score is high; this is evidence that the standardization step is important.
To summarize, standardizing the test data, I believe will improve your test score.

Firstly it makes no sense to have 258 features for 60 training items. Secondly CV=10 for 60 items means you split the data into 10 train/test sets. Each of these has 6 items only in the test set. So whatever results you obtain will be useless. You need more training data and less features.

All zeros when using OneVsRestClassifier

I am trying to use OneCsRestClassifier on my data set. I extracted the features on which model will be trained and fitted Linear SVC on it. After model fitting, when I try to predict on the same data on which the model was fitted, I get all zeros. Is it because of some implementation issues or because my feature extraction is not good enough. I think since I am predicting on the same data on which my model was fitted I should get 100% accuracy. But instead my model predicts all zeros. Here is my code-
#arrFinal contains all the features and the labels. Last 16 columns are labels and features are from 1 to 521. 17th column from the last is not taken
X=np.array(arrFinal[:,1:-17])
X=X.astype(float)
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(float)
clf = OneVsRestClassifier(SVC(kernel='linear'))
clf.fit(X, Y)
ans=clf.predict(Xtest)
print(ans)
print("\n\n\n")
Is there something wrong with my implementation of OneVsRestClassifier?

After looking at your data, it appears the values may be too small for the C value. Try using a sklearn.preprocessing.StandardScaler.
X=np.array(arrFinal[:,1:-17])
X=X.astype(float)
scaler = StandardScaler()
X = scaler.fit_transform(X)
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(float)
clf = OneVsRestClassifier(SVC(kernel='linear', C=100))
clf.fit(X, Y)
ans=clf.predict(Xtest)
print(ans)
print("\n\n\n")
From here, you should look at parameter tuning on the C using cross validation. Either with a learning curve or using a grid search.

What is the theorical foundation for scikit-learn dummy classifier?

By the documentation I read that a dummy classifier can be used to test it against a classification algorithm.
This classifier is useful as a simple baseline to compare with other
(real) classifiers. Do not use it for real problems.
What does the dummy classifier do when it uses the stratified aproach. I know that the docummentation says that:
generates predictions by respecting the training set’s class
distribution.
Could anybody give me a more theorical explanation of why this is a proof for the performance of the classifier?.

The dummy classifier gives you a measure of "baseline" performance--i.e. the success rate one should expect to achieve even if simply guessing.
Suppose you wish to determine whether a given object possesses or does not possess a certain property. If you have analyzed a large number of those objects and have found that 90% contain the target property, then guessing that every future instance of the object possesses the target property gives you a 90% likelihood of guessing correctly. Structuring your guesses this way is equivalent to using the most_frequent method in the documentation you cite.
Because many machine learning tasks attempt to increase the success rate of (e.g.) classification tasks, evaluating the baseline success rate can afford a floor value for the minimal value one's classifier should out-perform. In the hypothetical discussed above, you would want your classifier to get more than 90% accuracy, because 90% is the success rate available to even "dummy" classifiers.
If one trains a dummy classifier with the stratified parameter using the data discussed above, that classifier will predict that there is a 90% probability that each object it encounters possesses the target property. This is different from training a dummy classifier with the most_frequent parameter, as the latter would guess that all future objects possess the target property. Here's some code to illustrate:
from sklearn.dummy import DummyClassifier
import numpy as np
two_dimensional_values = []
class_labels = []
for i in xrange(90):
two_dimensional_values.append( [1,1] )
class_labels.append(1)
for i in xrange(10):
two_dimensional_values.append( [0,0] )
class_labels.append(0)
#now 90% of the training data contains the target property
X = np.array( two_dimensional_values )
y = np.array( class_labels )
#train a dummy classifier to make predictions based on the most_frequent class value
dummy_classifier = DummyClassifier(strategy="most_frequent")
dummy_classifier.fit( X,y )
#this produces 100 predictions that say "1"
for i in two_dimensional_values:
print( dummy_classifier.predict( [i]) )
#train a dummy classifier to make predictions based on the class values
new_dummy_classifier = DummyClassifier(strategy="stratified")
new_dummy_classifier.fit( X,y )
#this produces roughly 90 guesses that say "1" and roughly 10 guesses that say "0"
for i in two_dimensional_values:
print( new_dummy_classifier.predict( [i]) )

A major motivation for Dummy Classifier is F-score, when the positive class is in minority (i.e. imbalanced classes). This classifier is used for sanity test of actual classifier. Actually, dummy classifier completely ignores the input data. In case of 'most frequent' method, it checks the occurrence of most frequent label.

Using the Doc To illustrate DummyClassifier, first let’s create an imbalanced dataset:
>>>
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> y[y != 1] = -1
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Next, let’s compare the accuracy of SVC and most_frequent:
>>>
>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.svm import SVC
>>> clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.63...
>>> clf = DummyClassifier(strategy='most_frequent',random_state=0)
>>> clf.fit(X_train, y_train)
DummyClassifier(constant=None, random_state=0, strategy='most_frequent')
>>> clf.score(X_test, y_test)
0.57...
We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:
>>>
>>> clf = SVC(gamma='scale', kernel='rbf', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.97...
We see that the accuracy was boosted to almost 100%. So this is better.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.