I have fit a Kmeans model on document embeddings from a Doc2Vec model to cluster the embeddings and get a visualization as well as the most frequent terms per cluster. I have been able to do this fine and get the same visualization each time.
When I run the kmeans.fit_predict on the model it gives me a list of cluster labels according to the clusters I have specified of the same length as the number of document embeddings I have. The issue comes when running the model multiple times it gives a similar spread per cluster each time but the cluster labels will change after running it multiple times. For example,
Run 1 - 0:100, 1:100, 2:10
Run 2 - 0:99 , 1:101, 2:10
Run 3 - 2:100, 0:100, 1:10
Run 4 - 0:100, 1:100, 2:10
I tried saving the model and using the same model multiple times but encountered the same issue. This causes the most frequent terms per cluster and position of the cluster in the visualization to change, which changes the way it is interpreted. I was planning to use the labels as a classification method but doesn't this make that impossible? I'm not sure if its an issue with my code or if this is normal behavior if anyone can help it would be much appreciated.
df = pd.read_csv("data.csv")
d2v_model = Doc2Vec.load("d2vmodel")
clusters = 3
iterations = 100
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations)
X = kmeans_model.fit(d2v_model.docvecs.vectors_docs)
l = kmeans_model.fit_predict(d2v_model.docvecs.vectors_docs)
labels = kmeans_model.labels_.tolist()
pca = PCA(n_components=2).fit(d2v_model.docvecs.vectors_docs)
datapoint = pca.transform(d2v_model.docvecs.vectors_docs)
df["clusters"] = labels
cluster_list = []
cluster_colors = ["#FFFF00", "#008000", "#0000FF"]
plt.figure
color = [cluster_colors[i] for i in labels]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker="^", s=150, c="#000000")
plt.show()
for i in range(clusters):
df_temp = df[df["clusters"]==i]
cluster_words = Counter(" ".join(df_temp["Body"].str.lower()).split()).most_common(25)
[cluster_list.append(x[0]) for x in cluster_words]
cluster_list.clear()
for Kmeans, when you run fit for multiple time, every time centroid will be initialized randomly. To make it deterministic you can use random_state parameters. you can refer to the docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations, random_state = 'int number need to given')
Stabilizing the initialization randomization by specifying a random_state (per #qaiser's answer) may help – perhaps by ensuring similar-ish sets of doc-vectors, against same starting KMeans state, tends to find the 'same' clusters in the same named slots.
But there could be situations, where the doc-vectors have a different distribution, or where initialized state is (by bad luck) highly sensitive to doc-vector distribution, where even this repeated-initialization doesn't maintain coherent clusters.
You might want to also consider one or both of:
(1) initializing the KMeans clusters to match the prior run's centroids, to bias the later analysis towards creating compatibly named/centered clusters;
(2) after the second run finishes, rename the clusters according to which (of all possible 3! arbitrary naming permutations of 3 clusters) leaves the smallest possible total distances between each 'new' cluster of the same name to the 'prior' cluster of the same name.
I think the issue might be use of .fit_predict. Try just .predict see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
try:
l = kmeans_model.predict(d2v_model.docvecs.vectors_docs)
similar worked for me
Related
I have 100 clusters, each with a mean and standard deviation value. These clusters are predefined using the SPSS software package, by using the 2-step cluster method. Therefore, the optimisation of these cluster distributions to fit the data has already been done.
For new (unseen) data, we want to assign cluster membership by selecting the maximum log-likelihood cluster, for any given set of coordinates X. To do this, I have written my own code for comparison with what was output by SPSS using the same method: https://www.norusis.com/pdf/SPC_v19.pdf
Using data that has been correctly labelled by SPSS, about 42% of the clusters are correctly labelled by minimising the RMSE to the cluster mean (which is not what SPSS does), and less than 20% of the clusters are labelled correctly by my code when assigning the maximum log-likelihood cluster (which is what SPPSS reports to do).
I know that the maximum log-likelihood cluster should be the correct cluster ( https://www.norusis.com/pdf/SPC_v19.pdf ), but there is only a 20% success rate from this code when compared to the correct cluster labels from SPSS. What am I doing wrong?
Here is the code below.
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
import math
from scipy import stats
# importa raw files
clusters_df = pd.read_csv('ClusterCoordinates.csv') # clusters are in order of cluster numbers enabling us to use index for identification
clusters_df = clusters_df.drop(columns=['Cluster'])
print(clusters_df.shape)
clusters = clusters_df.to_numpy()
frames_df_raw = pd.read_csv('FrameCoordinates.csv')
frames_df = frames_df_raw.drop(columns=['frame','replica','voltage','system','ff','cluster'])
print(frames_df.shape)
frames = frames_df.to_numpy()
clusters_sd_df = pd.read_csv('ClusterCoordinates_SD.csv')
clusters_sd_df = clusters_sd_df.drop(columns=['Cluster'])
print(clusters_sd_df.shape)
clusters_sd = clusters_sd_df.to_numpy()
rmseCalc = []
llCalc = []
assignedCluster_RMSE = []
assignedCluster_LL = []
# create tables with RMSE and LL values
for frame in frames:
for cluster, cluster_sd in zip(clusters, clusters_sd):
# we compare cluster assignment using minimum RMSE vs maximum log likelihood methods.
rmseCalc.append(math.sqrt(mean_squared_error(np.array(cluster),np.array(frame))))
llCalc.append(-np.sum(stats.norm.logpdf(frame, loc=cluster, scale=cluster_sd)))
rmseCalc=np.array(rmseCalc)
llCalc=np.array(llCalc)
llCalc=np.nan_to_num(llCalc)
minRMSE = np.where(rmseCalc==rmseCalc.min())
maxLL = np.where(llCalc==llCalc.min())
print(maxLL[0][0]+1)
assignedCluster_RMSE.append(minRMSE[0][0]+1)
assignedCluster_LL.append(maxLL[0][0]+1)
rmseCalc=[]
llCalc=[]
frames_df_raw['predCluster_RMSE'] = np.array(assignedCluster_RMSE)
frames_df_raw['predCluster_LL'] = np.array(assignedCluster_LL)
frames_df_raw.to_csv('frames_clustered.csv')
I was expecting the cluster labels assigned by the code to match those already assigned by SPSS, since the methods used are intended to be the same.
I have a Python code that aims to build several multi-fidelity models (one for each of several variables) and use Emukit's experimental design functions to update them iteratively. I am using simple uncertainty acquisition (ModelVariance) and the multi-fidelity-wrapped gradient optimizer as shown in the examples here and here. I started by applying this technique to only one of my several variables. When doing that I noticed that 1) all update points (x_new) seemed to be selected from the LF model and 2) the variance dropped precipitously everywhere after adding only a single update point. I shrugged this off initially, and applied the technique to all my variables (using a loop over a dictionary to do each variable in turn). When I did that, I discovered that the mean predictions (new model points) seemed perfectly reasonable, but the reported variances using .predict() for ALL the models of ALL the variables were exactly the same, and were in fact what I had been given by the program when just doing the single variable. Something seems to be going very wrong finding and updating the variances after adding a new training point and using .set_data to update the model and I am not sure what or where the problem is. Is there an emukit bug? Am I using an incorrect setting? Is the problem with my dictionaries or for-loops? I am at a loss. Can anyone offer some insight?
Here is the code I currently have, somewhat redacted. I am sorry that it's such a long read....
# SKIPPING GENERAL IMPORTS
def make_mf(x,y,kernel,fidels):
# Generic multifidelity model builder.
# Returns a mutlifidelity model built based on the training points (x and y),
# kernels, and number of fidelities
mf_lin_model=GPyLinearMultiFidelityModel(x, y,kernel, n_fidelities=fidels)
# set up loop to fix noise to 0 for all fidelities, indicating training points are exact
for i in range(fidels):
if i == 0:
caller = "mf_lin_model.mixed_noise.Gaussian_noise.fix(0)"
else:
caller = "mf_lin_model.mixed_noise.Gaussian_noise_" + str(i) + ".fix(0)"
eval(caller)
## Wrap the model using the given 'GPyMultiOutputWrapper'
mf_model= model = GPyMultiOutputWrapper(mf_lin_model, 2, n_optimization_restarts=5,verbose_optimization=False)
# Fit the model
mf_model.optimize()
# Return the final model to the calling procedure
return(mf_model)
np.random.seed(20)
# list of y (result variables)
yvars=["VAR1","VAR2","VAR3"]
#list of x (input) variables
xvars=["XVAR"]
# list of fidelity levels. levels should be in order of ascending fidelity (0=lowest)
levels=["lf","hf"]
# list of what we'll need to store for each variable and level
# these are the model itself, the predicted values for plotting,
# and the predicted values at the training points
contents=['surrogate','y_plot','y_train']
# list of medium_fidelity variables
# these are the training coordintaes, the model, predicted values for plotting,
# predicted variances, the maximum and mean variance, and predicted
# values at the training points
multifivars=['y_plot','variance','varmax','varmean','pl_train']
mainvars=['model','x_train','y_train']
# set up a dictionary to store the models and related results for each y-variable
# and each fidelity
MyModels={key:{lkey:{ckey:None for ckey in contents} for lkey in levels} for key in yvars}
# Set up a dictionary for the multi-fidelity models
MultiFidelity={key:{vkey: None for vkey in mainvars}for key in yvars}
for key in MultiFidelity.keys():
for level in levels:
MultiFidelity[key][level]={mkey:None for mkey in multifivars}
#set up a dictionary to easily access data
MyData={key:None for key in levels}
# set up a dictionaries to easily access training and plotting points
x_train={key:None for key in levels}
Y_plot={key:None for key in levels}
T_plot={key:None for key in levels}
# Number of initial points evaluated at each fidelity level
npoints=[5,2]
MyPoints={levels[i]:npoints[i] for i in range(len(levels))}
## SKIPPED THE SECTION WHERE I READ IN THE RAW DATA
# High sampling of models for plotting of functions
x_plot = np.linspace(2, 16, 200)[:, None]
# set up points for plotting and retrieving MF model
X_plot = convert_x_list_to_array([x_plot, x_plot])
for i in range(len(levels)):
Y_plot[levels[i]] = X_plot[i*len(x_plot):(i+1)*len(x_plot)]
Y_plot_h=X_plot[len(x_plot):]
# Sampling for training for multi-fidelity analysis
x_train[levels[0]] = np.atleast_2d(np.random.rand(MyPoints[levels[0]])*14+2).T
for i in range (1,len(levels)):
x_train[levels[i]] = np.atleast_2d(np.random.permutation(x_train[levels[i-1]])[:MyPoints[levels[i]]])
#x_train_h = np.atleast_2d([3, 9.5, 11, 15]).T
# set up points for plotting mf result at training points
X_train=convert_x_list_to_array([x_train[levels[0]],x_train[levels[0]]])
for i in range(len(levels)):
T_plot[levels[i]] = X_train[i*len(x_train[levels[0]]):(i+1)*len(x_train[levels[0]])]
#print(X_train)
# combine the training points of all fidelity levels into a list of arrays
xtemp=[]
for level in levels:
xtemp.append(x_train[level])
kernels = [GPy.kern.RBF(1), GPy.kern.RBF(1)]
lin_mf_kernel = emukit.multi_fidelity.kernels.LinearMultiFidelityKernel(kernels)
for var in MyModels.keys():
ytemp=[]
for level in levels:
# use SciPy interpolate to build surrogate for given variable and fidelity level
MyModels[var][level]['surrogate']=interpolate.interp1d(MyData[level]['Coll'],MyData[level][var])
# find y-values for training MF points and append to a list of arrays
MyModels[var][level]['y_train']=MyModels[var][level]['surrogate'](x_train[level])
ytemp.append(MyModels[var][level]['y_train'])
MyModels[var][level]['y_plot']=MyModels[var][level]['surrogate'](x_plot)
## Convert lists of arrays to ndarrays augmented with fidelity indicators
MultiFidelity[var]['x_train'],MultiFidelity[var]['y_train']=convert_xy_lists_to_arrays(xtemp,ytemp)
# Build the multi-fidelity model
## Construct a linear multi-fidelity model
MultiFidelity[var]['model']= make_mf(MultiFidelity[var]['x_train'], MultiFidelity[var]['y_train'], lin_mf_kernel,len(levels))
# Get multifidelity model values and variances at plotting points
for level in levels:
MultiFidelity[var][level]['y_plot'],MultiFidelity[var][level]['variance']=MultiFidelity[var]['model'].predict(Y_plot[level])
# find maximum and average variance to measure the accuracy of the MF model
MultiFidelity[var][level]['varmax']=np.amax(MultiFidelity[var][level]['variance'])
MultiFidelity[var][level]['varmean']=np.mean(MultiFidelity[var][level]['variance'])
MultiFidelity[var][level]['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot[level])
for key in MyModels.keys():
for level in levels:
print(key,level,MultiFidelity[key][level]['varmax'],MultiFidelity[key][level]['varmean'])
# set up the parameter space. we are scanning in x between 2 and 16 to match the range of my input
parameter_space = ParameterSpace([ContinuousParameter('x', 2, 16), InformationSourceParameter(len(levels))])
# set up how we will look for the target of our search
optimizer = MultiSourceAcquisitionOptimizer(GradientAcquisitionOptimizer(parameter_space), parameter_space)
# Plot each variable vs X for BEFORE any new points are added
for var in yvars:
plot_vars(var,0)
# Note: right now I am basing the aquisition function on the first variable ONLY. I intend to
build a more complex function later when I get these bugs worked out.
acquisition=ModelVariance(MultiFidelity[yvars[0]]['model'])
# perform optimization to find the target point
x_new, val = optimizer.optimize(acquisition)
# x_new=np.atleast_2d(0)
# x_new[0][0]=np.random.rand()*14+2
print('first update points is',x_new)
# I want to manually specify that I add one HF training point and 4 LF training points,
# hence the way the following code is built. This could be a source of problems?
# construct our own version of the new data point because we will want it from the HF surrogate model
# (hence the value 1 in the final column)
new_point_x_hi = [[x_new[0][0],1.]]
# also, since this is an HF point, we include it as a training point in the LF model
new_point_x_lo = [[x_new[0][0],0.]]
# # we also append the new x-value to the training point x-array
x_train[levels[0]]=np.append(x_train[levels[0]],[[x_new[0][0]]],axis=0)
x_train[levels[1]]=np.append(x_train[levels[1]],[[x_new[0][0]]],axis=0)
# next, prepare points to allow the plotting of the training points on each model
X_train=convert_x_list_to_array([x_train[levels[0]],x_train[levels[0]]])
for i in range(len(levels)):
T_plot[levels[i]] = X_train[i*len(x_train[levels[0]]):(i+1)*len(x_train[levels[0]])]
for var in yvars:
# Now, for every variable in our list we add training points and update the models
# find the corresponding y-values from the respective surrogates
new_point_y_hi = np.atleast_2d(MyModels[var]['hf']['surrogate'](x_new[0][0]))
new_point_y_lo = np.atleast_2d(MyModels[var]['lf']['surrogate'](x_new[0][0]))
# Note that, as usual, we make these into 2D arrays to match EMUKit's formatting
# now append the new point to our model's training data arrays
MultiFidelity[var]['x_train']=np.append(MultiFidelity[var]['x_train'],new_point_x_hi,axis=0)
MultiFidelity[var]['y_train']=np.append(MultiFidelity[var]['y_train'],new_point_y_hi,axis=0)
MultiFidelity[var]['x_train']=np.append(MultiFidelity[var]['x_train'],new_point_x_lo,axis=0)
MultiFidelity[var]['y_train']=np.append(MultiFidelity[var]['y_train'],new_point_y_lo,axis=0)
# now we use .set_data to update the model based on the extended training data
# MultiFidelity[var]['model']= make_mf(MultiFidelity[var]['x_train'], MultiFidelity[var]['y_train'], lin_mf_kernel,len(levels))
MultiFidelity[var]['model'].set_data(MultiFidelity[var]['x_train'],MultiFidelity[var]['y_train'])
# and finally, re-calculate the values and variances at our plotting points to create an updated plot
# MultiFidelity[var]['lf']['y_plot'],MultiFidelity[var]['lf']['variance']=MultiFidelity[var]['model'].predict(Y_plot['lf'])
# MultiFidelity[var]['hf']['y_plot'],MultiFidelity[var]['hf']['variance']=MultiFidelity[var]['model'].predict(Y_plot['hf'])
# MultiFidelity[var]['hf']['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot['hf'])
# not forgetting to update the maximum and average variances
for level in levels:
# get new plotting points
MultiFidelity[var][level]['y_plot'],MultiFidelity[var][level]['variance']=MultiFidelity[var]['model'].predict(Y_plot[level])
MultiFidelity[var][level]['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot[level])
# find maximum and average variance to measure the accuracy of the MF model
MultiFidelity[var][level]['varmax']=np.amax(MultiFidelity[var][level]['variance'])
MultiFidelity[var][level]['varmean']=np.mean(MultiFidelity[var][level]['variance'])
# report maximum and avarage variance
print(var,level,'max = ',MultiFidelity[var][level]['varmax'],'mean = ', MultiFidelity[var][level]['varmean'])
# Plot each variable vs Coll for rcas, helios and the low and high-fidelity models for aftr HF point added
plot_vars(var,1)
# NOW DID THE SAME THING FOR A SEQUENCE OF 4 LF POINTS
I have tried using different acquisition functions and got the same behavior. I have also tried rebilding the model from scratch using model.optimize() and only got stranger behavior.
I have a data set that contains comments from bird watchers. I used TF-IDF vectorizer to convert the text comments into vector features, and then ran K-means clustering to separate my data into clusters. I have a set of clear clusters. However, I have been trying to find a way to find out which words made it into which clusters. I am aware of how to get the feature labels/names, but I want to see the actual data points under each feature, and then convert them back to the original words. I am using Python and Scikit-Learn's K-means algorithm.
def final_k_model(X, finalk):
final_k_mod = KMeans(n_clusters=finalk, init='random', n_init=10, max_iter=300, tol=1e-04, random_state=0)
final_k_mod.fit(X)
# plot the results:
centroids = final_k_mod.cluster_centers_
tsne_init = 'pca'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
random_state = 1
tsnemodel = TSNE(n_components=2, random_state=random_state, init=tsne_init, perplexity=tsne_perplexity,
early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)
transformed_centroids = tsnemodel.fit_transform(centroids)
plt.figure(1)
plt.scatter(transformed_centroids[:, 0], transformed_centroids[:, 1], marker='x')
plt.savefig('plots\\cluster.png')
plt.show()
return final_k_mod
I included some code, but not sure if it helps as I don't have an error. I am just trying to figure out if this is even possible, I've been googling and looking at tutorials but haven't found it.
Assuming you calculated the X in your code by the following method,
#corpus = list of all documents
#vocab = list of all words in corpus
tdf_idf = TfidfVectorizer(vocabulary=vocab)
X = tdf_idf.fit_transform(corpus)
is the following that you are looking for?
for centroid in centroids:
score_this_centroid = {}
for word in tdf_idf.vocabulary_.keys():
score_this_centroid[word] = centroid[tdf_idf.vocabulary_[word]]
pass
I apologize for a longer than usual intro, but it is important for the question:
I've recently been assigned to work on an existing project, which uses Keras+Tensorflow to create a Fully Connected Net.
Overall the model has 3 fully connected layers with 500 neurons and has 2 output classes. The first layer has 500 neurons which are connected to 82 input features. The model is used in the production and is retrained weekly, using this week information generated by an outer source.
The engineer which designed the model is no longer working here and I'm trying to reverse engineer and understand the behavior of the model.
Couple of objectives I have defined for myself are:
Understand the feature selection process and feature importance.
Understand and control the weekly re-training process.
In order to try and answer both of them, I've implemented an experiment where I feed my code with two models: one from the previous week and the other from the current week:
import pickle
import numpy as np
import matplotlib.pyplot as plt
from keras.models import model_from_json
path1 = 'C:/Model/20190114/'
path2 = 'C:/Model/20190107/'
model_name1 = '0_10.1'
model_name2 = '0_10.2'
models = [path1 + model_name1, path2 + model_name2]
features_cum_weight = {}
I then take each feature and try to sum all the weights (their absolute value) which connect it to the first hidden layer.
This way I create two vectors of 82 values:
for model_name in models:
structure_filename = model_name + "_structure.json"
weights_filename = model_name + "_weights.h5"
with open(structure_filename, 'r') as model_json:
model = model_from_json(model_json.read())
model.load_weights(weights_filename)
in_layer_weights = model.layers[0].get_weights()[0]
in_layer_weights = abs(in_layer_weights)
features_cum_weight[model_name] = in_layer_weights.sum(axis=1)
I then plot them, using MatplotLib:
# Plot the Evolvement of Input Neuron Weights:
keys = list(features_cum_weight.keys())
weights_1 = features_cum_weight[keys[0]]
weights_2 = features_cum_weight[keys[1]]
fig, ax = plt.subplots(nrows=2, ncols=2)
width = 0.35 # the width of the bars
n_plots = 4
batch = int(np.ceil(len(weights_1)/n_plots))
for i in range(n_plots):
start = i*(batch+1)
stop = min(len(weights_1), start + batch + 1)
cur_w1 = weights_1[start:stop]
cur_w2 = weights_2[start:stop]
ind = np.arange(len(cur_w1))
cur_ax = ax[i//2][i%2]
cur_ax.bar(ind - width/2, cur_w1, width, color='SkyBlue', label='Current Model')
cur_ax.bar(ind + width/2, cur_w2, width, color='IndianRed', label='Previous Model')
cur_ax.set_ylabel('Sum of Weights')
cur_ax.set_title('Sum of all weights connected by feature')
cur_ax.set_xticks(ind)
cur_ax.legend()
cur_ax.set_ylim(0, 30)
plt.show()
Resulting in the following plot:
MatPlotLib plot
I then try to compare the vectors to deduce:
If the vectors have been changed drastically - there might be some major change in the training data or some problem while retraining the model.
If some value is close to zero the model might have recognized this feature as not important.
I want your opinion and insights on the following:
The overall approach to this experiment.
Advice on other ideas on reverse engineering on a given model.
Insights on the output I provide here.
Thank you all, I am open to any suggestions and critic!
This type of deduction is not entirely true. The combination between the features is not linear. It is true that if is strictly 0 does not matter, but it may be that it is then recombined in another way and in another deep layer.
It would be true if your model is linear. In fact, this is how the PCA analysis works, where it searches for linear relationships through the covariance matrix. The eigenvalue would indicate the importance of each feature.
I think that there are several ways to confirm your suspicions:
Eliminate features that you think are not important to train again and see the result. If it is similar, your suspicions are correct.
Apply the current model, take an example (we will call it as pivot) to evaluate and significantly change the features that you consider irrelevant and create many examples. This applies for several pivots. If the result is similar, that field should not matter. Example (I consider the first feature to be irrelevant):
data = np.array([[0.5, 1, 0.5], [1, 2, 5]])
range_values = 50
new_data = []
for i in range(data.shape[0]):
sample = data[i]
# We create new samples
for i in range (1000):
noise = np.random.rand () * range_values
new_sample = sample.copy()
new_sample[0] += noise
new_data.append(new_sample)
I have trained an PassiveAggressiveClassifier with a set of 165 categories.
Now I can already use it to predict certain inputs but it fails sometimes and it would be very helpful know how "confident" is the classifier on each prediction and what are the other considerations.
As far as I understand I get the distances for each category using decision_function
distances = np.array(ppl.decision_function(sample))
which gives me something like this for the distances:
[-1.4222 -1.5083 -2.6488 -2.3428 -1.3167 -3.9615 -2.7804 -1.9563 -0.5054
-1.9524 -3.0026 -3.422 -2.1301 -2.0119 -2.1381 -2.2186 -2.0848 -2.4514
-1.9478 -2.3101 -2.4044 -1.9155 -1.569 -1.31 -1.4865 -2.3251 -1.7773
-1.304 -1.5215 -2.0634 -1.6987 -1.9217 -2.2863 -1.8166 -2.0219 -1.9594
-1.747 -2.1503 -2.162 -1.9507 -1.5971 -3.4499 -1.8946 -2.4328 -2.2415
-1.9045 -2.065 -1.9671 -1.8592 -1.6283 -1.7626 -2.2175 -2.1725 -3.7855
-5.1397 -3.6485 -4.4072 -2.2109 -2.048 -2.4887 -2.2324 -2.7897 -1.2932
-1.975 -1.516 -1.6127 -1.7135 -1.8243 -1.4887 -2.8973 -1.9656 -2.2236
-2.2466 -2.1224 -1.2247 -1.9657 -1.6138 -2.7787 -1.5004 -2.0136 -1.1001
-1.7226 -1.5829 -2.0317 -1.0834 -1.7444 -1.356 -2.3453 -1.7161 -2.2683
-2.2725 -0.4512 -4.5038 -2.0386 -2.1849 -2.4256 -1.5678 -1.8114 -2.2138
-2.2654 -1.8823 -2.7489 -1.8477 -2.1383 -1.6019 -2.84 -2.2595 -2.0764
-1.6758 -2.4279 -2.3489 -2.1884 -2.1888 -1.6289 -1.7358 -1.2989 -1.5656
-1.3362 -1.888 -2.1061 -1.4517 -2.0572 -2.4971 -2.2966 -2.6121 -2.4728
-2.8977 -1.7571 -2.4363 -1.4775 -1.7144 -2.047 -3.9252 -1.9907 -2.1808
-2.066 -1.9862 -1.4898 -2.3335 -2.6088 -2.4554 -2.4139 -1.7187 -2.2909
-1.4846 -1.8696 -2.444 -2.6253 -1.7738 -1.7192 -1.8737 -1.9977 -1.9948
-1.7667 -2.0704 -3.0147 -1.9014 -1.7713 -2.2551]
Now I have two questions:
1st whether it is possible to map the distances back to the categories since the length of the array (159) does not match my categories array.
2nd how can I calculate a confidence for the single prediction using the distances?
Question 1
As per the comment, make sure all your classes are contained in the training set. You can achieve this for example by using the train_test_split function and passing your targets into the stratify parameter.
Once you do this, the problem will disappear and there will be one classifier per each class. As a result, if you pass a sample to decision_function method there will be one distance to the hyperplane for each class.
Question 2
You can turn the distances into probabilities through rescaling and normalizing (i.e. softmax). This is already implemented internally in the _predict_proba_lr method. See the source code here.