Python Kmeans cluster order for classification - python

I am training with machine learning classification prediction algorithms.
I am testing different methods between logistic regression, Knn or predictions based on KMeans centroids to assign caterory.
Everything worked perfectly except Kmeans inverted the labels 0 and 1. The results are still correct, just that the categories no longer correspond.
The confusion matrix is therefore reversed between True and False and also my accuracy score instead of 99%, it is now at 1%
Cluster 0 has to be the one related to the False and the cluster 1 for True. In addition, statistically the True outnumber the False in this dataset but maybe not in another one.
Is there any solution to fix the labels before or reassign the Kmeans cluster labels?
I don't have have this issue with Knn or logistic regression whose categories correspond well to 0 and 1.
Here is my code for a dataframe 1500 rows, 6 columns in order to predict the category between 0 and 1, either between True or False:
# Kmeans model initialization
km = KMeans(n_clusters=2)
km.fit(X_train_std)
# centroids definition
centroid = km.cluster_centers_
c_km = pd.DataFrame(centroid, columns=X_name)
# prediction pour 2 clusters
y_pred_km = km.predict(X_test_std)
# model training
pred['pred_km'] = y_pred_km
pred['is_genuine_km'] = pred['pred_km'].apply(lambda x: True if x >0 else False)
# plot the confusion matrix & accuracy score
fig, ax = plt.subplots(1,1)
cm_km = metrics.confusion_matrix(y_test, y_pred_km)
cm_display_km = metrics.ConfusionMatrixDisplay(cm_km, display_labels=['False', 'True'])
cm_display_km.plot(ax=ax)
ax.set_title('K-Means Confusion Matrix \n Accuracy = %0.3f' % metrics.accuracy_score(y_test, y_pred_km))
plt.show()

I assume you use scikit-learn. In this case, you can pass km = KMeans(n_clusters=2, random_state=42) to the function to seed the random number generator, so it delivers the same clustering in each run.
See KMeans documentation for the random_state parameter:
Use an int to make the randomness deterministic.

Related

Why the sum "value" isn't equal to the number of "samples" in scikit-learn RandomForestClassifier?

I built a random forest by RandomForestClassifier and plot the decision trees. What does the parameter "value" (pointed by red arrows) mean? And why the sum of two numbers in the [] doesn't equal to the number of "samples"? I saw some other examples, the sum of two numbers in the [] equals to the number of "samples". Why in my case, it doesn't?
df = pd.read_csv("Dataset.csv")
df.drop(['Flow ID', 'Inbound'], axis=1, inplace=True)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace = True)
df.Label[df.Label == 'BENIGN'] = 0
df.Label[df.Label == 'DrDoS_LDAP'] = 1
Y = df["Label"].values
Y = Y.astype('int')
X = df.drop(labels = ["Label"], axis=1)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5)
model = RandomForestClassifier(n_estimators = 20)
model.fit(X_train, Y_train)
Accuracy = model.score(X_test, Y_test)
for i in range(len(model.estimators_)):
fig = plt.figure(figsize=(15,15))
tree.plot_tree(model.estimators_[i], feature_names = df.columns, class_names = ['Benign', 'DDoS'])
plt.savefig('.\\TheForest\\T'+str(i))
Nice catch.
Although undocumented, this is due to the bootstrap sampling taking place by default in a Random Forest model (see my answer in Why is Random Forest with a single tree much better than a Decision Tree classifier? for more on the RF algorithm details and its difference from a mere "bunch" of decision trees).
Let's see an example with the iris data:
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
rf = RandomForestClassifier(max_depth = 3)
rf.fit(iris.data, iris.target)
tree.plot_tree(rf.estimators_[0]) # take the first tree
The result here is similar to what you report: for every other node except the lower right one, sum(value) does not equal samples, as it should be the case for a "simple" decision tree.
A cautious observer would have noticed something else which seems odd here: while the iris dataset has 150 samples:
print(iris.DESCR)
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
and the base node of the tree should include all of them, the samples for the first node are only 89.
Why is that, and what exactly is going on here? To see, let us fit a second RF model, this time without bootstrap sampling (i.e. with bootstrap=False):
rf2 = RandomForestClassifier(max_depth = 3, bootstrap=False) # no bootstrap sampling
rf2.fit(iris.data, iris.target)
tree.plot_tree(rf2.estimators_[0]) # take again the first tree
Well, now that we have disabled bootstrap sampling, everything looks "nice": the sum of value in every node equals samples, and the base node contains indeed the whole dataset (150 samples).
So, the behavior you describe seems to be due to bootstrap sampling indeed, which, while creating samples with replacement (i.e. ending up with duplicate samples for each individual decision tree of the ensemble), these duplicate samples are not reflected in the sample values of the tree nodes, which display the number of unique samples; nevertheless, it is reflected in the node value.
The situation is completely analogous with that of a RF regression model, as well as with a Bagging Classifier - see respectively:
sklearn RandomForestRegressor discrepancy in the displayed tree values
Why does this decision tree's values at each step not sum to the number of samples?

Conversion of IsolationForest decision score to probability algorithm

I am looking to create a generic function to convert the output decision_scores of sklearn's IsolationForest into true probabilities [0.0, 1.0].
I am aware of, and have read, the original paper and I understand mathematically that the output of that function is not a probability, but is instead an average of the path length constructed by each base estimator to isolate an anomaly.
Problem
I want to convert that output to a probability in the form of a tuple (x,y) where x=P(anomaly) and y=1-x.
Current Approach
def convert_probabilities(predictions, scores):
from sklearn.preprocessing import MinMaxScaler
new_scores = [(1,1) for _ in range(len(scores))]
anomalous_idxs = [i for i in (range(len(predictions))) if predictions[i] == -1]
regular_idxs = [i for i in (range(len(predictions))) if predictions[i] == 1]
anomalous_scores = np.asarray(np.abs([scores[i] for i in anomalous_idxs]))
regular_scores = np.asarray(np.abs([scores[i] for i in regular_idxs]))
scaler = MinMaxScaler()
anomalous_scores_scaled = scaler.fit_transform(anomalous_scores.reshape(-1,1))
regular_scores_scaled = scaler.fit_transform(regular_scores.reshape(-1,1))
for i, j in zip(anomalous_idxs, range(len(anomalous_scores_scaled))):
new_scores[i] = (anomalous_scores_scaled[j][0], 1-anomalous_scores_scaled[j][0])
for i, j in zip(regular_idxs, range(len(regular_scores_scaled))):
new_scores[i] = (1-regular_scores_scaled[j][0], regular_scores_scaled[j][0])
return new_scores
modified_scores = convert_probabilities(model_predictions, model_decisions)
Minimum, Reproducible Example
import pandas as pd
from sklearn.datasets import make_classification, load_iris
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
# Get data
X, y = load_iris(return_X_y=True, as_frame=True)
anomalies, anomalies_classes = make_classification(n_samples=int(X.shape[0]*0.05), n_features=X.shape[1], hypercube=False, random_state=60, shuffle=True)
anomalies_df = pd.DataFrame(data=anomalies, columns=X.columns)
# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=60)
# Combine testing data
X_test['anomaly'] = 1
anomalies_df['anomaly'] = -1
X_test = X_test.append(anomalies_df, ignore_index=True)
y_test = X_test['anomaly']
X_test.drop('anomaly', inplace=True, axis=1)
# Build a model
model = IsolationForest(n_jobs=1, bootstrap=False, random_state=60)
# Fit it
model.fit(X_train)
# Test it
model_predictions = model.predict(X_test)
model_decisions = model.decision_function(X_test)
# Print results
for a,b,c in zip(y_test, model_predictions, model_decisions):
print_str = """
Class: {} | Model Prediction: {} | Model Decision Score: {}
""".format(a,b,c)
print(print_str)
Problem
modified_scores = convert_probabilities(model_predictions, model_decisions)
# Print results
for a,b in zip(model_predictions, modified_scores):
ans = False
if a==-1:
if b[0] > b[1]:
ans = True
else:
ans = False
elif a==1:
if b[1] > b[0]:
ans=True
else:
ans=False
print_str = """
Model Prediction: {} | Model Decision Score: {} | Correct: {}
""".format(a,b, str(ans))
print(print_str)
Shows some odd results, such as:
Model Prediction: 1 | Model Decision Score: (0.17604259932311161, 0.8239574006768884) | Correct: True
Model Prediction: 1 | Model Decision Score: (0.7120367886017022, 0.28796321139829784) | Correct: False
Model Prediction: 1 | Model Decision Score: (0.7251531538304419, 0.27484684616955807) | Correct: False
Model Prediction: -1 | Model Decision Score: (0.16776449326185877, 0.8322355067381413) | Correct: False
Model Prediction: 1 | Model Decision Score: (0.8395087028516501, 0.1604912971483499) | Correct: False
Model Prediction: 1 | Model Decision Score: (0.0, 1.0) | Correct: True
How could it be possible for the prediction to be -1 (anomaly), but the probability to only be 37%? Or for the prediction to be 1 (normal), but the probability is 26%?
Note, the toy dataset is labeled but an unsupervised anomaly detection algorithm obviously assumes no labels.
You have three different issues here. First, there is no guarantee that the lower the score you receive from the IsolationForest, the probability of the sample being an outlier is also higher. I mean that if for a bunch of samples you get model_decision scores in ranges (-0.3 : -0.2) and (0.1 : 0.2) that does not necessarily mean that the probability of the first batch being an outliers is higher (but ususally it would be).
The second issue is the actual mapping function from scores to probabilities. So assuming that the lower scores correspond to lower probability of being regular sample (and higher probability of the sample to be an anomaly), the mapping from scores to probabilities not necessarily would be a linear function (such as MinMaxScaler). It may happen that for your data you will need to find your own function. It can be a piecewise linear function as #Jon Nordby suggested. I personally prefer using logistic function to map from scores into probabilities. In this case it can be especially beneficial to use as model_decisions is centered around zero, and negative values indicate anomaly. So you can use something like
def logf(x, alfa=10):
return 1/(1 + np.exp( -alfa * x ))
for mapping from scores to probabilities. Alpha parameter controls how tight the values are packed around the decision boundary. Again, this is not necessarily the best mapping function, it just something that I like to use.
Last issue is connected to the first one, and probably answers your question. Even if generally the scores correlate with the probability of not being anomaly, it does not guarantee that for all samples this would be true. So it might happen that a certain point with a score 0.1 would be an anomaly, and the one with -0.1 is a normal point that was mistakenly detected as anomaly. The decision if the sample is anomaly is made by whether model_decisions smaller than zero. For the samples with scores close to zero, the probability of mistake is higher.
Though months later, there is an answer to this question.
A paper was published in 2011 that attempted to show research on just this topic; unifying anomaly scores into probabilities.
In fact, the pyod library has a common predict_proba method, which gives an option to use this unifying method.
Here is a code implementation of that (influenced from their source):
def convert_probabilities(data, model):
decision_scores = model.decision_function(data)
probs = np.zeros([data.shape[0], int(model.classes)])
pre_erf_score = ( decision_scores - np.mean(decision_scores) ) / ( np.std(decision_scores) * np.sqrt(2) )
erf_score = erf(pre_erf_score)
probs[:, 1] = erf_score.clip(0, 1).ravel()
probs[:, 0] = 1 - probs[:, 1]
return probs
(For reference, pyod does have an Isolation Forest implementation)
Why this is happening
You are observing nonsensical probabilities because you are fitting a different scaler for the inliers and for the outliers. As a result, if the range of your decision scores is [0.5, 1.5] for inliers, you will map these scores to probabilities [0, 1]. Additionally, if the range of the decision scores is [-1.5, -0.5] for outliers, then you will be mapping these scores to probabilities [0, 1] as well. You end up having probability of being inliers set to 1 if the decision score is 1.5 OR -0.5. This is obviously not what you want to have, you want an observation that has decision score -0.5 to have a lower probability than the observation that has decision score 1.5.
First option
The first solution is to fit one single scaler for all your scores. This will also considerably simplify your conversion function as following:
def convert_probabilities(predictions, scores):
scaler = MinMaxScaler()
scores_scaled = scaler.fit_transform(scores.reshape(-1,1))
new_scores = np.concatenate((1-scores_scaled, scores_scaled), axis=1)
return new_scores
This will be a tuple of (probability of being an outlier, probability of being an inlier) with the desired properties.
Limitation of this approach
One of the main limitations of this approach is that there is no guaranty that the probability cut-off between inliers and outliers will be 0.5, which is the most intuitive choice. You might end up with a scenario like "if the probability of being an inlier is less than 60%, then the model predicts it is an outlier".
Second option
The second option is closer to what you wanted to do. You do indeed fit one scaler for each category, however, unlike what you did, both scalers do not return values in the same range. You can set outliers to get scaled to [0, 0.5] and outliers to get scaled to [0.5, 1]. This has the benefit that it would create an intuitive decision boundary at 0.5, where all probabilities above are inliers and vice-versa. It would then look like this:
def convert_probabilities(predictions, scores):
scaler_inliers = MinMaxScaler((0.5, 1))
scaler_outliers = MinMaxScaler((0, 0.5))
scores_inliers_scaled = scaler_inliers.fit_transform(scores[predictions == 1].reshape(-1,1))
scores_outliers_scaled = scaler_outliers.fit_transform(scores[predictions == -1].reshape(-1,1))
scores_scaled = np.zeros((len(scores), 1))
scores_scaled[predictions == 1] = scores_inliers_scaled
scores_scaled[predictions == -1] = scores_outliers_scaled
new_scores = np.concatenate((1-scores_scaled, scores_scaled), axis=1)
return new_scores
Limitation of this approach
The main limitation is how you bring back together both scalers. In the code example above, both are connected at 0.5, which means that the "best outlier" and "worst inlier" have the same probability of 0.5. However, they do not have the same decision score. So one option is to change the scaling ranges to [0, 0.49], and [0.51, 1]` or so, but as you can see, this is getting even more arbitrary.

How can I identify the records inside each cluster in a KNN model in SciKit-Learn Python?

I am making a KNN model. The target variable is divided in 2 categories, and the features are 3 categorical variables (country, language and company). The model says the optimal is 5 clusters, so I did it with 5.
I need to know how can I see the records in each of the 5 clusters (I mean, the countries, languages and companies that the model is grouping in each of them). Is there a way to add the labels of the clusters to the dataframe?
I tried:
predictions = knn.predict(features)
But that is only returning the estimations for the 2 labels of the target variable
I did some research and found:
km.labels_
But that only applies for KMeans, and I am using KNN
I hope somebody can tell me the equivalent for that or how to solve the problem for KNN Model please
KNN is not clustering, but classification.
The parameter k is not the k of k-means; it is the number of neighbors not the number of clusters...
Hence, setting k to 5 dors not suddenly produce 5 labels. Your training data has 2 labels, hence you get 2 labels.
KNN = k-nearest neighbors classification. For k=5 this means 5 nearest neighbors.
K-means clustering = approximate the data with k center vectors. An entirely different k.
Yes it is always possible to match it back.
predictions = knn.predict(features)
y_test['preds'] = predictions
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
If your dataframe is called df, this should work.

Any difference between H2O and Scikit-Learn metrics scoring?

I tried to use H2O to create some machine learning models for binary classification problem, and the test results are pretty good. But then I checked and found something weird. I tried to print the prediction of the model for the test set out of curiosity. And I found out that my model actually predicts 0 (negative) all the time, but the AUC is around 0.65, and precision is not 0.0. Then I tried to use Scikit-learn just to compare the metrics scores, and (as expected) they’re different. The Scikit learn yielded 0.0 precision and 0.5 AUC score, which I think is correct. Here's the code that I used:
model = h2o.load_model(model_path)
predictions = model.predict(Test_data).as_data_frame()
# H2O version to print the AUC score
auc = model.model_performance(Test_data).auc()
# Python version to print the AUC score
auc_sklearn = sklearn.metrics.roc_auc_score(y_true, predictions['predict'].tolist())
Any thought? Thanks in advance!
There is no difference between H2O and scikit-learn scoring, you just need to understand how to make sense of the output so you can compare them accurately.
If you'll look at the data in predictions['predict'] you'll see that it's a predicted class, not a raw predicted value. AUC uses the latter, so you'll need to use the correct column. See below:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Train and cross-validate a GBM
model = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
model.train(x=x, y=y, training_frame=train)
# Test AUC
model.model_performance(test).auc()
# 0.7817203808052897
# Generate predictions on a test set
pred = model.predict(test)
Examine the output:
In [4]: pred.head()
Out[4]:
predict p0 p1
--------- -------- --------
0 0.715077 0.284923
0 0.778536 0.221464
0 0.580118 0.419882
1 0.316875 0.683125
0 0.71118 0.28882
1 0.342766 0.657234
1 0.297636 0.702364
0 0.594192 0.405808
1 0.513834 0.486166
0 0.70859 0.29141
[10 rows x 3 columns]
Now compare to sklearn:
from sklearn.metrics import roc_auc_score
pred_df = pred.as_data_frame()
y_true = test[y].as_data_frame()
roc_auc_score(y_true, pred_df['p1'].tolist())
# 0.78170751032654806
Here you see that they are approximately the same. AUC is an approximate method, so you'll see differences after a few decimal places when you compare different implementations.

Find the most import features for a SVM classification

I'm training a binary classifier using python and the popular scikit-learn module's SVM class. After training I use the predict method to make a classification as laid out in sci-kit's SVC documentation.
I would like to know more about the significance of my sample features to the resulting classification made by the trained decision_function (support vectors). Any strategies for evaluating feature significance when making predictions with such a model are welcome.
Thanks!
Andre
So, how do we interpret feature significance for a given sample's classification?
I think using a linear kernel is the most straightforward way to first approach this because of the significance/relative simplicity of the svc.coef_ attribute of a trained model. check out Bitwise's answer.
Below I will train a linear kernel SVM using scikit training data. Then we will look at the coef_ attribute. I will include a simple plot showing how the dot product of the classifier's coefficients and training feature data divide the resulting classes.
from sklearn import svm
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
data = load_breast_cancer()
X = data.data # training features
y = data.target # training labels
lin_clf = svm.SVC(kernel='linear')
lin_clf.fit(X,y)
scores = np.dot(X, lin_clf.coef_.T)
b0 = y==0 # boolean or "mask" index arrays
b1 = y==1
malignant_scores = scores[b1]
benign_scores = scores[b1]
fig = plt.figure()
fig.suptitle("score breakdown by classification", fontsize=14, fontweight='bold')
score_box_plt = ply.boxplot(
[malignant_scores, benign_scores],
notch=True,
labels=list(data.target_names),
vert=False
)
plt.show(score_box_plt)
As you can see we do seem to have accessed the appropriate intercept and coefficient values. There is obvious separation of class scores with our decision boundary hovering around 0.
Now that we have a scoring system based on our linear coefficients we can easily investigate how each feature contributed to final classification. Here we display each features effect on the final score of that sample.
## sample we're using X[2] --> classified benign, lin_clf score~(-20)
lin_clf.predict(X[2].reshape(1,30))
contributions = np.multiply(X[2], lin_clf.coef_.reshape((30,)))
feature_number = np.arange(len(contributions)) +1
plt.bar(feature_number, contributions, align='center')
plt.xlabel('feature index')
plt.ylabel('score contribution')
plt.title('contribution to classification outcome by feature index')
plt.show(feature_contrib_bar)
We can also simply sort this same data to get a contribution-ranked list of features for a given classification to see which feature contributed the most to the score we are assessing the composition of.
abs_contributions = np.flip(np.sort(np.absolute(contributions)), axis=0)
feat_and_contrib = []
for contrib in abs_contributions:
if contrib not in contributions:
contrib = -contrib
feat = np.where(contributions == contrib)
feat_and_contrib.append((feat[0][0], contrib))
else:
feat = np.where(contributions == contrib)
feat_and_contrib.append((feat[0][0], contrib))
# sorted by max abs value. each row a tuple:;(feature index, contrib)
feat_and_contrib
From that ranked list we can see that the top five feature indices that contributed to the final score (of around -20 along with a classification 'benign') were [0, 22, 13, 2, 21] which correspond to the feature names in our data set; ['mean radius', 'worst perimeter', 'area error', 'mean perimeter', 'worst texture'].
Suppose You have Bag of word Featurization and you want to know which words are important
for classification then use this code for linear svm
weights = np.abs(lr_svm.coef_[0])
sorted_index = np.argsort(wt)[::-1]
top_10 = sorted_index[:10]
terms = text_vectorizer.get_feature_names()
for ind in top_10:
print(terms[ind])
You can use SelectFromModel in sklearn to get the names of the most relevant features of your model. Here is an example of extracting the features for LassoCV.
You can also check out this example which makes use of coef_ attribute in SVM to visualize the top most features.

Categories