I have implemented LinearSVC and SVC from the sklearn-framework for text classification.
I am using TfidfVectorizer to get sparse representation of the input data that consists of two different classes(benign data and malicious data). This part is working pretty fine but now i wanted to implement some kind of anomaly detection by using the OneClassSVM classificator and training a model with only one class (outliers detection...). Unfortunately it is not working with sparse-data. Some developers are working on a patch (https://github.com/scikit-learn/scikit-learn/pull/1586) but there a some bugs so there is no solution yet for using the OneClassSVM-implementation.
Are there any other methods in the sklearn-framework for doing something like that? I am looking over the examples but nothing seems to fit.
Thanks!
A bit late, but in case anyone else is looking for information on this... There's a third-party anomaly detection module for sklearn here: http://www.cit.mak.ac.ug/staff/jquinn/software/lsanomaly.html, based on least-squares methods. It should be a plug-in replacement for OneClassSVM.
Unfortunately, scikit-learn currently implements only one-class SVM and robust covariance estimator for outlier detection
You can try a comparision of these methods (as provided in the doc) by examining differences on the 2d data:
import numpy as np
import pylab as pl
import matplotlib.font_manager
from scipy import stats
from sklearn import svm
from sklearn.covariance import EllipticEnvelope
# Example settings
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]
# define two outlier detection tools to be compared
classifiers = {
"One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
kernel="rbf", gamma=0.1),
"robust covariance estimator": EllipticEnvelope(contamination=.1)}
# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500))
n_inliers = int((1. - outliers_fraction) * n_samples)
n_outliers = int(outliers_fraction * n_samples)
ground_truth = np.ones(n_samples, dtype=int)
ground_truth[-n_outliers:] = 0
# Fit the problem with varying cluster separation
for i, offset in enumerate(clusters_separation):
np.random.seed(42)
# Data generation
X1 = 0.3 * np.random.randn(0.5 * n_inliers, 2) - offset
X2 = 0.3 * np.random.randn(0.5 * n_inliers, 2) + offset
X = np.r_[X1, X2]
# Add outliers
X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]
# Fit the model with the One-Class SVM
pl.figure(figsize=(10, 5))
for i, (clf_name, clf) in enumerate(classifiers.iteritems()):
# fit the data and tag outliers
clf.fit(X)
y_pred = clf.decision_function(X).ravel()
threshold = stats.scoreatpercentile(y_pred,
100 * outliers_fraction)
y_pred = y_pred > threshold
n_errors = (y_pred != ground_truth).sum()
# plot the levels lines and the points
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
subplot = pl.subplot(1, 2, i + 1)
subplot.set_title("Outlier detection")
subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
cmap=pl.cm.Blues_r)
a = subplot.contour(xx, yy, Z, levels=[threshold],
linewidths=2, colors='red')
subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
colors='orange')
b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white')
c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black')
subplot.axis('tight')
subplot.legend(
[a.collections[0], b, c],
['learned decision function', 'true inliers', 'true outliers'],
prop=matplotlib.font_manager.FontProperties(size=11))
subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
subplot.set_xlim((-7, 7))
subplot.set_ylim((-7, 7))
pl.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
pl.show()
Related
I have been thinking of this but not sure how to do it. I have a binary imbalanced data, and would like to use svm to select just subset of the majority data points nearest to support vector. Thereafter, I can fit a binary classifier on this "balanced" data.
To illustrate what I mean, a MWE:
# packages import
from collections import Counter
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
import seaborn as sns
# sample data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.9], flip_y=0, random_state=1)
# class distribution summary
print(Counter(y))
Counter({0: 91, 1: 9})
# fit svm model
svc_model = SVC(kernel='linear', random_state=32)
svc_model.fit(X, y)
plt.figure(figsize=(10, 8))
# Plotting our two-features-space
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)
# Constructing a hyperplane using a formula.
w = svc_model.coef_[0] # w consists of 2 elements
b = svc_model.intercept_[0] # b consists of 1 element
x_points = np.linspace(-1, 1) # generating x-points from -1 to 1
y_points = -(w[0] / w[1]) * x_points - b / w[1] # getting corresponding y-points
# Plotting a red hyperplane
plt.plot(x_points, y_points, c='r')
The two classes are well separated by the hyperplane. We can see the support vectors for both classes (even better for class 1).
Since the minority class 0 has 9-data-points, I want to down-sample class 0 by selecting its support vectors, and 8 other data points nearest to it. So that the class distribution becomes {0: 9, 1: 9} ignoring all other data points of 0. I will then use this to fit a binary classifier like LR (or even SVC).
My question is how to select those data points of class 0 nearest to the class support vector, taking into account, a way to reach a balance with data points of minority class 1.
This can be achieved as follows: Get the support vector for class 0, (sv0), iterate over all data points in class 0 (X[y == 0]), compute the distances (d) to the point represented by the support vector, sort them, take the 9 with the smallest values, and concatenate them with the points of class 1 to create the downsampled data (X_ds, y_ds).
sv0 = svc_model.support_vectors_[0]
distances = []
for i, x in enumerate(X[y == 0]):
d = np.linalg.norm(sv0 - x)
distances.append((i, d))
distances.sort(key=lambda tup: tup[1])
index = [i for i, d in distances][:9]
X_ds = np.concatenate((X[y == 0][index], X[y == 1]))
y_ds = np.concatenate((y[y == 0][index], y[y == 1]))
plt.plot(x_points[19:-29], y_points[19:-29], c='r')
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)
plt.scatter(X_ds[y_ds == 0][:,0], X_ds[y_ds == 0][:,1], color='yellow', alpha=0.4)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
data=pd.DataFrame(
{"input":
[0.001,0.015,0.066,0.151,0.266,0.402,0.45,0.499,0.598,0.646,0.738,0.782,0.86,0.894,0.924,0.95],
"output":[0.5263157894736842,0.5789473684210524,0.6315789473684206,0.6842105263157897,
0.6315789473684206, 0.7894736842105263, 0.8421052631578945, 0.7894736842105263, 0.736842105263158,
0.6842105263157897, 0.736842105263158, 0.736842105263158,0.6842105263157897, 0.6842105263157897,
0.6315789473684206,0.5789473684210524]})
I have the above data that includes input and output data and ı want to make a curve that properly fits this data. Firstly plotting of input and output values are here :
I have made this code:
X=data.iloc[:,0].to_numpy()
X=X.reshape(-1,1)
y=data.iloc[:,1].to_numpy()
y=y.reshape(-1,1)
poly=PolynomialFeatures(degree=2)
poly.fit(X,y)
X_poly=poly.transform(X)
reg=LinearRegression().fit(X_poly,y)
plt.scatter(X,y,color="blue")
plt.plot(X,reg.predict(X_poly),color="orange",label="Polynomial Linear Regression")
plt.xlabel("Temperature")
plt.ylabel("Pressure")
plt.legend(loc="upper left")
plot is:
But ı don't find the above curve's equation (orange curve) how can ı find?
Your plot actually corresponds to your code run with
poly=PolynomialFeatures(degree=7)
and not to degree=2. Indeed, running your code with the above change, we get:
Now, your polynomial features are:
poly.get_feature_names()
# ['1', 'x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7']
and the respective coefficients of your linear regression are:
reg.coef_
# array([[ 0. , 5.43894411, -68.14277256, 364.28508827,
# -941.70924401, 1254.89358662, -831.27091422, 216.43304954]])
plus the intercept:
reg.intercept_
# array([0.51228593])
Given the above, and setting
coef = reg.coef_[0]
since here we have a single feature in the initial data, your regression equation is:
y = reg.intercept_ + coef[0] + coef[1]*x + coef[2]*x**2 + coef[3]*x**3 + coef[4]*x**4 + coef[5]*x**5 + coef[6]*x**6 + coef[7]*x**7
For visual verification, we can plot the above function with some x data in [0, 1]
x = np.linspace(0, 1, 15)
Running the above expression for y and
plt.plot(x, y)
gives:
Using some randomly generated data x, we can verify that the results of the equation y_eq are indeed equal to the results produced by the regression model y_reg within the limits of numerical precision:
x = np.random.rand(1,10)
y_eq = reg.intercept_ + coef[0] + coef[1]*x + coef[2]*x**2 + coef[3]*x**3 + coef[4]*x**4 + coef[5]*x**5 + coef[6]*x**6 + coef[7]*x**7
y_reg = np.concatenate(reg.predict(poly.transform(x.reshape(-1,1))))
y_eq
# array([[0.72452703, 0.64106819, 0.67394222, 0.71756648, 0.71102853,
# 0.63582055, 0.54243177, 0.71104983, 0.71287962, 0.6311952 ]])
y_reg
# array([0.72452703, 0.64106819, 0.67394222, 0.71756648, 0.71102853,
# 0.63582055, 0.54243177, 0.71104983, 0.71287962, 0.6311952 ])
np.allclose(y_reg, y_eq)
# True
Irrelevant to the question, I guess you already know that trying to fit such high order polynomials to so few data points is not a good idea, and you probably should remain to a low degree of 2 or 3...
Note sure how you produced the plot shown in the question. When I ran your code I got the following (degree=2) polynomial fitted to the data as expected:
Now that you have fitted the data you can see the coefficients of the model thus:
print(reg.coef_)
print(reg.intercept_)
# [[ 0. 0.85962436 -0.83796885]]
# [0.5523586]
Note that the data that was used to fit this model is equivalent to the following:
X_poly = np.concatenate([np.ones((16,1)), X, X**2], axis=1)
Therefore a single data point is a vector created as follows:
temp = 0.5
x = np.array([1, temp, temp**2]).reshape((1,3))
Your polynomial model is simply a linear model of the polynomial features:
y = A.x + B
or
y = reg.coef_.dot(x.T) + reg.intercept_
print(y) # [[0.77267856]]
Verification:
print(reg.predict(x)) # array([[0.77267856]])
I am doing linear regression with two dimensional variables:
filtered[['p_tag_x', 'p_tag_y', 's_tag_x', 's_tag_y']].head()
p_tag_x p_tag_y s_tag_x s_tag_y
35 589.665646 1405.580171 517.5 1636.5
36 589.665646 1405.580171 679.5 1665.5
100 610.546851 2425.303250 569.5 2722.0
101 610.546851 2425.303250 728.0 2710.0
102 717.237730 1411.842428 820.0 1616.5
clt = linear_model.LinearRegression()
clt.fit(filtered[['p_tag_x', 'p_tag_y']], filtered[['s_tag_x', 's_tag_y']])
I am getting following coefficients of the regression:
clt.coef_
array([[ 0.4529769 , -0.22406594],
[-0.00859452, -0.00816968]])
And the residues (X_0, and Y_0)
clt.residues_
array([ 1452.97816371, 69.12754694])
How I should I understand the above coefficients matrix in terms of the regression line ?
As i already explained in the comments, you got an extra-dimension in your coef_ as well as intercept_ because you got 2 targets (y.shape(n_samples, n_targets)). In this case sklearn will fit 2 independent regressors, one for each target.
You then can just take those n regressors apart and handle each one on it's own.
The formula of your regression line is still:
y(w, x) = intercept_ + coef_[0] * x[0] + coef_[1] * x[1] ...
Sadly your example is a bit harder to visualize because of the dimensionality.
Consider this a demo, with a lot of ugly hard-coding for this specific case (and bad example data!):
Code:
# Warning: ugly demo-like code using a lot of hard-coding!!!!!
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import linear_model
X = np.array([[589.665646, 1405.580171],
[589.665646, 1405.580171],
[610.546851, 2425.303250],
[610.546851, 2425.303250],
[717.237730, 1411.842428]])
y = np.array([[517.5, 1636.5],
[679.5, 1665.5],
[569.5, 2722.0],
[728.0, 2710.0],
[820.0, 1616.5]])
clt = linear_model.LinearRegression()
clt.fit(X, y)
print(clt.coef_)
print(clt.residues_)
def curve_0(x, y): # target 0; single-point evaluation hardcoded for 2 features!
return clt.intercept_[0] + x * clt.coef_[0, 0] + y * clt.coef_[0, 1]
def curve_1(x, y): # target 1; single-point evaluation hardcoded for 2 features!
return clt.intercept_[1] + x * clt.coef_[1, 0] + y * clt.coef_[1, 1]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
xs = [np.amin(X[:, 0]), np.amax(X[:, 0])]
ys = [np.amin(X[:, 1]), np.amax(X[:, 1])]
# regressor 0
ax.scatter(X[:, 0], X[:, 1], y[:, 0], c='blue')
ax.plot([xs[0], xs[1]], [ys[0], ys[1]], [curve_0(xs[0], ys[0]), curve_0(xs[1], ys[1])], c='cyan')
# regressor 1
ax.scatter(X[:, 0], X[:, 1], y[:, 1], c='red')
ax.plot([xs[0], xs[1]], [ys[0], ys[1]], [curve_1(xs[0], ys[0]), curve_1(xs[1], ys[1])], c='magenta')
ax.set_xlabel('X[:, 0] feature 0')
ax.set_ylabel('X[:, 1] feature 1')
ax.set_zlabel('Y')
plt.show()
Output:
Remarks:
You don't have to calculate the formula by yourself: clt.predict() will do that!
The code-lines involving ax.plot(...) use the assumption, that our line is defined by just 2 points (linear)!
I'm new to scikit-lear and GMM in general... I have some problem with the fit quality of a Gaussian Mixture Model in python (scikit-learn) .
I have an array of data, which you may find at DATA HERE that I want to fit with a GMM with n = 2 components.
As benchmark I superimpose a Normal fit.
Errors/weirdness:
setting n = 1 components, I cannot recover with GMM(1) the Normal benchmark fit
setting n = 2 components, the Normal fit is better than GMM(2) fit
GMM(n) seems to provide always the same fit...
Here is what I get: what I'm doing wrong here? (the picture displays the fits with GMM(2)). Thanks in advance for your help.
Code below (to run it, save data in the same folder)
from numpy import *
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from collections import OrderedDict
from scipy.stats import norm
from sklearn.mixture import GMM
# Upload the data: "epsi" (array of floats)
file_xlsx = './db_X.xlsx'
data = pd.read_excel(file_xlsx)
epsi = data["epsi"].values;
t_ = len(epsi);
# Normal fit (for benchmark)
epsi_grid = arange(min(epsi),max(epsi)+0.001,0.001);
mu = mean(epsi);
sigma2 = var(epsi);
normal = norm.pdf(epsi_grid, mu, sqrt(sigma2));
# TENTATIVE - Gaussian mixture fit
gmm = GMM(n_components = 2); # fit quality doesn't improve if I set: covariance_type = 'full'
gmm.fit(reshape(epsi,(t_,1)));
gauss_mixt = exp(gmm.score(reshape(epsi_grid,(len(epsi_grid),1))));
# same result if I apply the definition of pdf of a Gaussian mixture:
# pdf_mixture = w_1 * N(mu_1, sigma_1) + w_2 * N(mu_2, sigma_2)
# as suggested in:
# http://stackoverflow.com/questions/24878729/how-to-construct-and-plot-uni-variate-gaussian-mixture-using-its-parameters-in-p
#
#gauss_mixt = array([p * norm.pdf(epsi_grid, mu, sd) for mu, sd, p in zip(gmm.means_.flatten(), sqrt(gmm.covars_.flatten()), gmm.weights_)]);
#gauss_mixt = sum(gauss_mixt, axis = 0);
# Create a figure showing the comparison between the estimated distributions
# setting the figure object
fig = plt.figure(figsize = (10,8))
fig.set_facecolor('white')
ax = plt.subplot(111)
# colors
red = [0.9, 0.3, 0.0];
grey = [0.9, 0.9, 0.9];
green = [0.2, 0.6, 0.3];
# x-axis limits
q_inf = float(pd.DataFrame(epsi).quantile(0.0025));
q_sup = float(pd.DataFrame(epsi).quantile(0.9975));
ax.set_xlim([q_inf, q_sup])
# empirical pdf of data
nb = int(10*log(t_));
ax.hist(epsi, bins = nb, normed = True, color = grey, edgecolor = 'k', label = "Empirical");
# Normal fit
ax.plot(epsi_grid, normal, color = green, lw = 1.0, label = "Normal fit");
# Gaussian Mixture fit
ax.plot(epsi_grid, gauss_mixt, color = red, lw = 1.0, label = "GMM(2)");
# title
ax.set_title("Issue: Normal fit out-performs the GMM fit?", size = 14)
# legend
ax.legend(loc='upper left');
plt.tight_layout()
plt.show()
The problem was the bound on the single components variances min_covar, which is by default 1e-3 and is meant to prevent overfitting.
Lowering that limit solved the problem (see picture):
gmm = GMM(n_components = 2, min_covar = 1e-12)
I want to cluster data with missing columns. Doing it manually I would calculate the distance in case of a missing column simply without this column.
With scikit-learn, missing data is not possible. There is also no chance to specify a user distance function.
Is there any chance to cluster with missing data?
Example data:
n_samples = 1500
noise = 0.05
X, _ = make_swiss_roll(n_samples, noise)
rnd = np.random.rand(X.shape[0],X.shape[1])
X[rnd<0.1] = np.nan
I think you can use an iterative EM-type algorithm:
Initialize missing values to their column means
Repeat until convergence:
Perform K-means clustering on the filled-in data
Set the missing values to the centroid coordinates of the clusters to which they were assigned
Implementation
import numpy as np
from sklearn.cluster import KMeans
def kmeans_missing(X, n_clusters, max_iter=10):
"""Perform K-Means clustering on data with missing values.
Args:
X: An [n_samples, n_features] array of data to cluster.
n_clusters: Number of clusters to form.
max_iter: Maximum number of EM iterations to perform.
Returns:
labels: An [n_samples] vector of integer labels.
centroids: An [n_clusters, n_features] array of cluster centroids.
X_hat: Copy of X with the missing values filled in.
"""
# Initialize missing values to their column means
missing = ~np.isfinite(X)
mu = np.nanmean(X, 0, keepdims=1)
X_hat = np.where(missing, mu, X)
for i in xrange(max_iter):
if i > 0:
# initialize KMeans with the previous set of centroids. this is much
# faster and makes it easier to check convergence (since labels
# won't be permuted on every iteration), but might be more prone to
# getting stuck in local minima.
cls = KMeans(n_clusters, init=prev_centroids)
else:
# do multiple random initializations in parallel
cls = KMeans(n_clusters, n_jobs=-1)
# perform clustering on the filled-in data
labels = cls.fit_predict(X_hat)
centroids = cls.cluster_centers_
# fill in the missing values based on their cluster centroids
X_hat[missing] = centroids[labels][missing]
# when the labels have stopped changing then we have converged
if i > 0 and np.all(labels == prev_labels):
break
prev_labels = labels
prev_centroids = cls.cluster_centers_
return labels, centroids, X_hat
Example with fake data
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def make_fake_data(fraction_missing, n_clusters=5, n_samples=1500,
n_features=3, seed=None):
# complete data
gen = np.random.RandomState(seed)
X, true_labels = make_blobs(n_samples, n_features, n_clusters,
random_state=gen)
# with missing values
missing = gen.rand(*X.shape) < fraction_missing
Xm = np.where(missing, np.nan, X)
return X, true_labels, Xm
X, true_labels, Xm = make_fake_data(fraction_missing=0.3, n_clusters=5, seed=0)
labels, centroids, X_hat = kmeans_missing(Xm, n_clusters=5)
# plot the inferred points, color-coded according to the true cluster labels
fig, ax = plt.subplots(1, 2, subplot_kw={'projection':'3d', 'aspect':'equal'})
ax[0].scatter3D(X[:, 0], X[:, 1], X[:, 2], c=true_labels, cmap='gist_rainbow')
ax[1].scatter3D(X_hat[:, 0], X_hat[:, 1], X_hat[:, 2], c=true_labels,
cmap='gist_rainbow')
ax[0].set_title('Original data')
ax[1].set_title('Imputed (30% missing values)')
fig.tight_layout()
Benchmark
To assess the algorithm's performance, we can use the adjusted mutual information between the true and inferred cluster labels. A score of 1 is perfect performance and 0 represents chance:
from sklearn.metrics import adjusted_mutual_info_score
fraction = np.arange(0.0, 1.0, 0.05)
n_repeat = 10
scores = np.empty((2, fraction.shape[0], n_repeat))
for i, frac in enumerate(fraction):
for j in range(n_repeat):
X, true_labels, Xm = make_fake_data(fraction_missing=frac, n_clusters=5)
labels, centroids, X_hat = kmeans_missing(Xm, n_clusters=5)
any_missing = np.any(~np.isfinite(Xm), 1)
scores[0, i, j] = adjusted_mutual_info_score(labels, true_labels)
scores[1, i, j] = adjusted_mutual_info_score(labels[any_missing],
true_labels[any_missing])
fig, ax = plt.subplots(1, 1)
scores_all, scores_missing = scores
ax.errorbar(fraction * 100, scores_all.mean(-1),
yerr=scores_all.std(-1), label='All labels')
ax.errorbar(fraction * 100, scores_missing.mean(-1),
yerr=scores_missing.std(-1),
label='Labels with missing values')
ax.set_xlabel('% missing values')
ax.set_ylabel('Adjusted mutual information')
ax.legend(loc='best', frameon=False)
ax.set_ylim(0, 1)
ax.set_xlim(-5, 100)
Update:
In fact, after a quick Google search it seems that what I've come up with above is pretty much the same as the k-POD algorithm for K-means clustering of missing data (Chi, Chi & Baraniuk, 2016).
Here is a different algorithm that I use. Instead of replacing the missing values the values are ignored and in order to capture the differences between missing and non-missing i impliment missing dummies.
Compared to Alis algorithm it seems is easier for observations with missing observatons to jump from class to class. Since I do not fill the missing values.
I fortunely did not have the time to compare using Ali's beautiful code, but feel free to do it (I might do it when I get the time) and contribute to the discussion about the best method.
import numpy as np
class kmeans_missing(object):
def __init__(self,potential_centroids,n_clusters):
#initialize with potential centroids
self.n_clusters=n_clusters
self.potential_centroids=potential_centroids
def fit(self,data,max_iter=10,number_of_runs=1):
n_clusters=self.n_clusters
potential_centroids=self.potential_centroids
dist_mat=np.zeros((data.shape[0],n_clusters))
all_centroids=np.zeros((n_clusters,data.shape[1],number_of_runs))
costs=np.zeros((number_of_runs,))
for k in range(number_of_runs):
idx=np.random.choice(range(potential_centroids.shape[0]), size=(n_clusters), replace=False)
centroids=potential_centroids[idx]
clusters=np.zeros(data.shape[0])
old_clusters=np.zeros(data.shape[0])
for i in range(max_iter):
#Calc dist to centroids
for j in range(n_clusters):
dist_mat[:,j]=np.nansum((data-centroids[j])**2,axis=1)
#Assign to clusters
clusters=np.argmin(dist_mat,axis=1)
#Update clusters
for j in range(n_clusters):
centroids[j]=np.nanmean(data[clusters==j],axis=0)
if all(np.equal(clusters,old_clusters)):
break # Break when to change in clusters
if i==max_iter-1:
print('no convergence before maximal iterations are reached')
else:
clusters,old_clusters=old_clusters,clusters
all_centroids[:,:,k]=centroids
costs[k]=np.mean(np.min(dist_mat,axis=1))
self.costs=costs
self.cost=np.min(costs)
self.best_model=np.argmin(costs)
self.centroids=all_centroids[:,:,self.best_model]
self.all_centroids=all_centroids
def predict(self,data):
dist_mat=np.zeros((data.shape[0],self.n_clusters))
for j in range(self.n_clusters):
dist_mat[:,j]=np.nansum((data-self.centroids[j])**2,axis=1)
prediction=np.argmin(dist_mat,axis=1)
cost=np.min(dist_mat,axis=1)
return prediction,cost
Here is an example on how though it might be usefull.
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from kmeans_missing import *
def make_fake_data(fraction_missing, n_clusters=5, n_samples=1500,
n_features=2, seed=None):
# complete data
gen = np.random.RandomState(seed)
X, true_labels = make_blobs(n_samples, n_features, n_clusters,
random_state=gen)
# with missing values
missing = gen.rand(*X.shape) < fraction_missing
Xm = np.where(missing, np.nan, X)
return X, true_labels, Xm
X, true_labels, X_hat = make_fake_data(fraction_missing=0.3, n_clusters=3, seed=0)
X_missing_dummies=np.isnan(X_hat)
n_clusters=3
X_hat = np.concatenate((X_hat,X_missing_dummies),axis=1)
kmeans_m=kmeans_missing(X_hat,n_clusters)
kmeans_m.fit(X_hat,max_iter=100,number_of_runs=10)
print(kmeans_m.costs)
prediction,cost=kmeans_m.predict(X_hat)
for i in range(n_clusters):
print([np.mean((prediction==i)*(true_labels==j)) for j in range(3)],np.mean((prediction==i)))
--EDIT--
In this example the occurrences of missing values are completly random and when that is the case. Not adding the missing value dummies preforms better, since missing value dummies in that case is noise. Not including them would also be the correct thing to do in order to compare with Ali's algorithm.