Extracting confidence from scikit PassiveAggressiveClassifier() for single prediction

Extracting confidence from scikit PassiveAggressiveClassifier() for single prediction - python

I have trained an PassiveAggressiveClassifier with a set of 165 categories.
Now I can already use it to predict certain inputs but it fails sometimes and it would be very helpful know how "confident" is the classifier on each prediction and what are the other considerations.
As far as I understand I get the distances for each category using decision_function
distances = np.array(ppl.decision_function(sample))
which gives me something like this for the distances:
[-1.4222 -1.5083 -2.6488 -2.3428 -1.3167 -3.9615 -2.7804 -1.9563 -0.5054
-1.9524 -3.0026 -3.422 -2.1301 -2.0119 -2.1381 -2.2186 -2.0848 -2.4514
-1.9478 -2.3101 -2.4044 -1.9155 -1.569 -1.31 -1.4865 -2.3251 -1.7773
-1.304 -1.5215 -2.0634 -1.6987 -1.9217 -2.2863 -1.8166 -2.0219 -1.9594
-1.747 -2.1503 -2.162 -1.9507 -1.5971 -3.4499 -1.8946 -2.4328 -2.2415
-1.9045 -2.065 -1.9671 -1.8592 -1.6283 -1.7626 -2.2175 -2.1725 -3.7855
-5.1397 -3.6485 -4.4072 -2.2109 -2.048 -2.4887 -2.2324 -2.7897 -1.2932
-1.975 -1.516 -1.6127 -1.7135 -1.8243 -1.4887 -2.8973 -1.9656 -2.2236
-2.2466 -2.1224 -1.2247 -1.9657 -1.6138 -2.7787 -1.5004 -2.0136 -1.1001
-1.7226 -1.5829 -2.0317 -1.0834 -1.7444 -1.356 -2.3453 -1.7161 -2.2683
-2.2725 -0.4512 -4.5038 -2.0386 -2.1849 -2.4256 -1.5678 -1.8114 -2.2138
-2.2654 -1.8823 -2.7489 -1.8477 -2.1383 -1.6019 -2.84 -2.2595 -2.0764
-1.6758 -2.4279 -2.3489 -2.1884 -2.1888 -1.6289 -1.7358 -1.2989 -1.5656
-1.3362 -1.888 -2.1061 -1.4517 -2.0572 -2.4971 -2.2966 -2.6121 -2.4728
-2.8977 -1.7571 -2.4363 -1.4775 -1.7144 -2.047 -3.9252 -1.9907 -2.1808
-2.066 -1.9862 -1.4898 -2.3335 -2.6088 -2.4554 -2.4139 -1.7187 -2.2909
-1.4846 -1.8696 -2.444 -2.6253 -1.7738 -1.7192 -1.8737 -1.9977 -1.9948
-1.7667 -2.0704 -3.0147 -1.9014 -1.7713 -2.2551]
Now I have two questions:
1st whether it is possible to map the distances back to the categories since the length of the array (159) does not match my categories array.
2nd how can I calculate a confidence for the single prediction using the distances?

Question 1
As per the comment, make sure all your classes are contained in the training set. You can achieve this for example by using the train_test_split function and passing your targets into the stratify parameter.
Once you do this, the problem will disappear and there will be one classifier per each class. As a result, if you pass a sample to decision_function method there will be one distance to the hyperplane for each class.
Question 2
You can turn the distances into probabilities through rescaling and normalizing (i.e. softmax). This is already implemented internally in the _predict_proba_lr method. See the source code here.

Related

High difference in predictions on different train test split sizes

I am unable to figure out the reason behind the contrasting difference in predictions on different test train splits while training the linear model using LinearRegression.
This is the my initial try on the data:
x_train,x_test,y_train,true_p=train_test_split(train,y,random_state=121,test_size=0.2,shuffle=True)
lreg=LinearRegression()
lreg.fit(x_train,y_train)
train_pred=lreg.predict(x_train)
test_pred=lreg.predict(x_test)
this is the output in train_pred:
train_pred
array([12.37512481, 11.67234874, 11.82821202, ..., 12.61139596,
12.13886881, 12.42435563])
this is the output in test_pred:
test_pred
array([ 1.21885520e+01, 1.13462088e+01, 1.14144208e+01, 1.22832932e+01,
1.29980626e+01, 1.17641183e+01, 1.20982465e+01, 1.15846156e+01,
1.17403904e+01, 4.17353113e+07, 1.27941840e+01, 1.21739628e+01,
..., 1.22022858e+01, 1.15779229e+01, 1.24931376e+01, 1.26387188e+01,
1.18341585e+01, 1.18411881e+01, 1.21475986e+01, 1.25104774e+01])
The predicted data of both variables have very huge difference, while the latter one is the wrong predicted data.
I have tried increasing the test size to 0.4. Now I have received good prediction.
x_train,x_test,y_train,true_p=train_test_split(train,y,random_state=121,test_size=0.4,shuffle=True)
lreg=LinearRegression()
lreg.fit(x_train,y_train)
train_pred=lreg.predict(x_train)
test_pred=lreg.predict(x_test)
These are the outputs of train_pred and test_pred:
train_pred
array([11.95505983, 12.66847164, 11.81978843, 12.82992812, 12.44707462,
11.78809995, 11.92753084, 12.6082893 , 12.22644843, 11.93325658,
12.2449481 ,..., 11.69256008, 11.67984786, 12.54313682, 12.30652695])
test_pred
array([12.22133867, 11.18863973, 11.46923967, 12.26340761, 12.99240451,
11.77865948, 12.04321231, 11.44137667, 11.71213919, 11.44206212,
..., 12.15412777, 12.39184805, 10.96310233, 12.06243916, 12.11383494,
12.28327695, 11.19989021, 12.61439939, 12.22474378])
What is the reason behind this? How to rectify this problem on 0.2 test train split?
Thank you

Check units of your test_pred. They are all x10 (seen by the e+01). If you set the print settings of numpy to remove the scientific notation by np.set_printoptions(suppress=True) and then print your test_pred you should see that it looks very similar to train_pred. So in short, nothing is wrong.

Just when the data has very high variance, in a very small test set, significant differences in predictions can occur. I would say it is an * underfitting *.
Start by analyzing your dataset and you will see the main causes of this variance through basic descriptive statistics (graphs, measures of position and dispersion, etc.). After that, increase the size of your test set, so that it is balanced otherwise your study will be biased.
But from what I saw, everything is fine, the only problem is the notation e + 01 means that the number is multiplied by 10

changing cluster labels for kmeans model

I have fit a Kmeans model on document embeddings from a Doc2Vec model to cluster the embeddings and get a visualization as well as the most frequent terms per cluster. I have been able to do this fine and get the same visualization each time.
When I run the kmeans.fit_predict on the model it gives me a list of cluster labels according to the clusters I have specified of the same length as the number of document embeddings I have. The issue comes when running the model multiple times it gives a similar spread per cluster each time but the cluster labels will change after running it multiple times. For example,
Run 1 - 0:100, 1:100, 2:10
Run 2 - 0:99 , 1:101, 2:10
Run 3 - 2:100, 0:100, 1:10
Run 4 - 0:100, 1:100, 2:10
I tried saving the model and using the same model multiple times but encountered the same issue. This causes the most frequent terms per cluster and position of the cluster in the visualization to change, which changes the way it is interpreted. I was planning to use the labels as a classification method but doesn't this make that impossible? I'm not sure if its an issue with my code or if this is normal behavior if anyone can help it would be much appreciated.
df = pd.read_csv("data.csv")
d2v_model = Doc2Vec.load("d2vmodel")
clusters = 3
iterations = 100
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations)
X = kmeans_model.fit(d2v_model.docvecs.vectors_docs)
l = kmeans_model.fit_predict(d2v_model.docvecs.vectors_docs)
labels = kmeans_model.labels_.tolist()
pca = PCA(n_components=2).fit(d2v_model.docvecs.vectors_docs)
datapoint = pca.transform(d2v_model.docvecs.vectors_docs)
df["clusters"] = labels
cluster_list = []
cluster_colors = ["#FFFF00", "#008000", "#0000FF"]
plt.figure
color = [cluster_colors[i] for i in labels]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker="^", s=150, c="#000000")
plt.show()
for i in range(clusters):
df_temp = df[df["clusters"]==i]
cluster_words = Counter(" ".join(df_temp["Body"].str.lower()).split()).most_common(25)
[cluster_list.append(x[0]) for x in cluster_words]
cluster_list.clear()

for Kmeans, when you run fit for multiple time, every time centroid will be initialized randomly. To make it deterministic you can use random_state parameters. you can refer to the docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations, random_state = 'int number need to given')

Stabilizing the initialization randomization by specifying a random_state (per #qaiser's answer) may help – perhaps by ensuring similar-ish sets of doc-vectors, against same starting KMeans state, tends to find the 'same' clusters in the same named slots.
But there could be situations, where the doc-vectors have a different distribution, or where initialized state is (by bad luck) highly sensitive to doc-vector distribution, where even this repeated-initialization doesn't maintain coherent clusters.
You might want to also consider one or both of:
(1) initializing the KMeans clusters to match the prior run's centroids, to bias the later analysis towards creating compatibly named/centered clusters;
(2) after the second run finishes, rename the clusters according to which (of all possible 3! arbitrary naming permutations of 3 clusters) leaves the smallest possible total distances between each 'new' cluster of the same name to the 'prior' cluster of the same name.

I think the issue might be use of .fit_predict. Try just .predict see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
try:
l = kmeans_model.predict(d2v_model.docvecs.vectors_docs)
similar worked for me

Understanding the output of scipy.stats.multivariate_normal

I am trying to build a multidimensional gaussian model using scipy.stats.multivariate_normal. I am trying to use the output of scipy.stats.multivariate_normal.pdf() to figure out if a test value fits reasonable well in the observed distribution.
From what I understand, high values indicate a better fit to the given model and low values otherwise.
However, in my dataset, I see extremely large PDF(x) results, which lead me to question if I understand things correctly. The area under the PDF curve must be 1, so very large values are hard to comprehend.
For e.g., consider:
x = [-0.0007569417915494715, -0.01394295997613827, 0.000982078369890444, -0.03633664354397629, -0.03730583036106844, 0.013920453054506978, -0.08115836865224338, -0.07208494497398354, -0.06255237023298793, -0.0531888840386906, -0.006823760545565131]
mean = [0.01663645201261102, 0.07800335614699873, 0.016291452384234965, 0.012042931155488702, 0.0042637244100103885, 0.016531331606477996, -0.021702714746699842, -0.05738646649459681, 0.00921296058625439, 0.027940994009345254, 0.07548111758006244]
covariance = [[0.07921927017771506, 0.04780185747873293, 0.0788086850274493, 0.054129466248481264, 0.018799028456661045, 0.07523731808137141, 0.027682748950487425, -0.007296954729572955, 0.07935165417756569, 0.0569381100965656, 0.04185848489472492], [0.04780185747873293, 0.052300105044833595, 0.047749467098423544, 0.03254872837949123, 0.010582358713999951, 0.045792252383799206, 0.01969282984717051, -0.006089301208961258, 0.05067712814145293, 0.03146214776997301, 0.04452949330387575], [0.0788086850274493, 0.047749467098423544, 0.07841809405745602, 0.05374461924031552, 0.01871005609017673, 0.07487015790787396, 0.02756781074862818, -0.007327131572569985, 0.07895548129950304, 0.056417456686115544, 0.04181063355048408], [0.054129466248481264, 0.03254872837949123, 0.05374461924031552, 0.04538801863296238, 0.015795381235224913, 0.05055944754764062, 0.02017033995851422, -0.006505939129684573, 0.05497361331950649, 0.043858860182247515, 0.029356699144606032], [0.018799028456661045, 0.010582358713999951, 0.01871005609017673, 0.015795381235224913, 0.016260640022897347, 0.015459548918222347, 0.0064542528152879705, -0.0016656858963383602, 0.018761682220822192, 0.015361512546799405, 0.009832025009280924], [0.07523731808137141, 0.045792252383799206, 0.07487015790787396, 0.05055944754764062, 0.015459548918222347, 0.07207012779105286, 0.026330967917717253, -0.006907504360835279, 0.0753380831201204, 0.05335128471397023, 0.03998397595850863], [0.027682748950487425, 0.01969282984717051, 0.02756781074862818, 0.02017033995851422, 0.0064542528152879705, 0.026330967917717253, 0.020837940236441078, -0.003320408544812026, 0.027859582829638897, 0.01967636950969646, 0.017105000942890598], [-0.007296954729572955, -0.006089301208961258, -0.007327131572569985, -0.006505939129684573, -0.0016656858963383602, -0.006907504360835279, -0.003320408544812026, 0.024529061074105817, -0.007869287828047853, -0.006228903058681195, -0.0058974553248417995], [0.07935165417756569, 0.05067712814145293, 0.07895548129950304, 0.05497361331950649, 0.018761682220822192, 0.0753380831201204, 0.027859582829638897, -0.007869287828047853, 0.08169291677188911, 0.05731196406065222, 0.04450058445993234], [0.0569381100965656, 0.03146214776997301, 0.056417456686115544, 0.043858860182247515, 0.015361512546799405, 0.05335128471397023, 0.01967636950969646, -0.006228903058681195, 0.05731196406065222, 0.05064023101024737, 0.02830810316675855], [0.04185848489472492, 0.04452949330387575, 0.04181063355048408, 0.029356699144606032, 0.009832025009280924, 0.03998397595850863, 0.017105000942890598, -0.0058974553248417995, 0.04450058445993234, 0.02830810316675855, 0.040658283674780395]]
For this, if I compute y = multivariate_normal.pdf(x, mean, cov);
the result is 342562705.3859754.
How could this be the case? Am I missing something?
Thanks.

This is fine. The probability density function can be larger than 1 at a specific point. It's the integral than must be equal to 1.
The idea that pdf < 1 is correct for discrete variables. However, for continuous ones, the pdf is not a probability. It's a value that is integrated to a probability. That is, the integral from minus infinity to infinity, in all dimensions, is equal to 1.

How to make two out of one training example in tensorflow dataset using map

I have a number of training examples in my dataset and would like to rotate each one so that I get double the number. I am using datasets and tried it like this:
def addrotation(images, labels):
images_rotated_left = tf.contrib.image.rotate(images, pi/2.0)
labels_rotated_left = tf.stack([labels[1], labels[2], labels[0]])
return tf.stack([images,images_rotated_left]), tf.stack([labels, labels_rotated_left])
But when I now use dataset = dataset.map(addrotation), I get examples with double the data.
Is it possible to return the rotated tensors in a way so that they count as seperate examples or "lines"?

Never mind, I found a solution:
I create a new dataset with all the rotated examples and then zip the two datasets together like explained here:
https://stackoverflow.com/a/47344405/984336

How to represent Word2Vec model to graph? (or convert a 1x300 numpy array to just 1x2 array)

I have a 1x300 numpy array from my Word2Vec model which is returns like this:
[ -2.55022556e-01 1.06162608e+00 -5.86191297e-01 -4.43067521e-01
4.46810514e-01 4.31743741e-01 2.16610283e-01 9.27684903e-01
-4.47879761e-01 -9.11142007e-02 3.27048987e-01 -8.05553675e-01
-8.54483843e-02 -2.85595834e-01 -2.70745698e-02 -3.08014955e-02
1.53204888e-01 3.16114485e-01 -2.82659411e-01 -2.98218042e-01
-1.03240972e-02 2.12806061e-01 1.63605273e-01 9.42423999e-01
1.20789325e+00 4.11570221e-01 -5.46323597e-01 1.95108235e-01
-4.53743488e-01 -1.28625661e-01 -7.43277609e-01 1.11551750e+00
-4.51873302e-01 -1.14495361e+00 -6.69551417e-02 6.88364863e-01
-6.01781428e-01 -2.36386538e-01 -3.64305973e-01 1.18274912e-01
2.03438237e-01 -1.01153564e+00 6.67958856e-01 1.80363625e-01
1.26524955e-01 -2.96024203e-01 -9.93479714e-02 -4.93405871e-02
1.02504417e-01 7.63318688e-02 -3.68398607e-01 3.03587675e-01
-2.90227026e-01 1.51891649e-01 -6.93689287e-03 -3.99766594e-01
-1.86124116e-01 -2.86920428e-01 2.04880714e-01 1.39914978e+00
1.84370011e-01 -4.58923727e-01 3.91094625e-01 -7.52937734e-01
3.05261135e-01 -4.55163687e-01 7.22679734e-01 -3.76093656e-01
6.05900526e-01 3.26470852e-01 4.72957864e-02 -1.18182398e-01
3.51043999e-01 -3.07209432e-01 -6.10330477e-02 4.14131492e-01
7.57511556e-02 -6.48704231e-01 1.42518353e+00 -9.20495167e-02
6.36665523e-01 5.48510313e-01 5.92754841e-01 -6.29535854e-01
-4.47180003e-01 -8.99413109e-01 -1.52441502e-01 -1.98326513e-01
4.74154204e-01 -2.07036674e-01 -6.70400202e-01 6.67807996e-01
-1.04234733e-01 7.16163218e-01 3.32825005e-01 8.20083246e-02
5.88186264e-01 4.06852067e-01 2.66174138e-01 -5.35981596e-01
3.26077454e-02 -4.04357493e-01 2.19569445e-01 -2.74264365e-01
-1.65187627e-01 -4.06753153e-01 6.12065434e-01 -1.89857081e-01
-5.56927800e-01 -6.78636551e-01 -7.52498448e-01 1.04564428e+00
5.32510102e-01 5.05628288e-01 1.95120305e-01 -6.40793025e-01
5.73082231e-02 -1.58281475e-02 -2.62718409e-01 1.74351722e-01
-6.95129633e-02 3.44214857e-01 -4.24746841e-01 -2.75907904e-01
-6.60992935e-02 -1.19041657e+00 -6.01056278e-01 5.67718685e-01
-6.47478551e-02 1.55902460e-01 -2.48480186e-01 5.56753576e-01
1.29889056e-01 3.91534269e-01 1.28707469e-01 1.29670590e-01
-6.98880851e-01 2.43386969e-01 7.70289376e-02 -1.14947490e-01
-4.31593180e-01 -6.16873622e-01 6.03831768e-01 -2.07050622e-01
1.23276520e+00 -1.67524610e-02 -4.67656374e-01 1.00281858e+00
5.17916441e-01 -7.99495637e-01 -4.22653735e-01 -1.45487636e-01
-8.71369673e-04 1.25453219e-01 -1.25869447e-02 4.66426492e-01
5.07026255e-01 -6.53024793e-01 7.53435045e-02 8.33864748e-01
3.37398499e-01 7.50920832e-01 -4.80326146e-01 -4.52838868e-01
5.92808545e-01 -3.57870340e-01 -1.07011057e-01 -1.13945460e+00
3.97635132e-01 1.23554178e-01 4.81683850e-01 5.47445454e-02
-2.18614921e-01 -2.00085923e-01 -3.73975009e-01 8.74632657e-01
6.71471596e-01 -4.01738763e-01 4.76147681e-01 -5.79257011e-01
-1.51511624e-01 1.43170074e-01 5.00052273e-01 1.46719962e-01
2.43085429e-01 5.89158475e-01 -5.25088668e-01 -2.65306592e-01
2.18211919e-01 3.83228660e-01 -2.51622144e-02 2.32621357e-01
8.06669474e-01 1.37254462e-01 4.59401071e-01 5.63044667e-01
-5.79878241e-02 2.68106610e-01 5.47239482e-01 -5.05441546e-01]
It's so frustrating to read because I just want to get a 1x2 array like [12,19] so I can represent it to graph and make a cosine distance measurement to the 1x2 array.
How to do it? Or how to represent the 1x300 Word2Vec model to a 2D graph?

There are many ways to apply "dimensionality reduction" to high-dimensional data, for aid in interpretation or graphing.
One super-simple way to reduce your 300-dimensions to just 2-dimensions, for plotting on a flat screen/paper: just discard 298 of the dimensions! You'll have something to plot – such as the point (-0.255022556, 1.06162608) if taking just the 1st 2 dimensions of your example vector.
However, starting from word2vec vectors, those won't likely be very interesting points, individually or when you start plotting multiple words. The exact axes dimensions of such vectors are unlikely to be intuitively meaningful to humans, and you're throwing 99.7% of all the meaning per vector away – and quite likely the dimensions which (in concert with each other) capture semantically-meaningful relationships.
So you'd be more likely to do some more thoughtful dimensionality-reduction. A super-simple technique would be to pick two vector-directions that are thought to be meaningful as your new X and Y axes. In the word2vec world, these wouldn't necessarily be existing vectors in the set – though they could be – but might be the difference between two vectors. (The analogy-solving power of word2vec vectors essentially comes from discovering the difference-between two vectors A and B, then applying that difference to a third vector C to find a 4th vector D, at which point D often has the same human-intuitive analogical-relationship to C as B had to A.)
For example, you might difference the word-vectors for 'man' and 'woman', to get a vector which bootstraps your new X-axis. Then difference the word-vectors for 'parent' and 'worker', to get vector which bootstraps your new Y-axis. Then, for every candidate 300-dimensional vector you want to plot, find that candidate vector's "new X" by calculating the magnitude of its projection onto your X-direction-vector. Then, find that candidate vector's "new Y" by calculating the magnitude of its projection onto your Y-direction-vector. This might result in a set of relative values that, on a 2-D chart, vaguely match human intuitions about often-observed linguisti relationships between gender and familial/workplace roles.
As #poorna-prudhvi's comment mentions, PCA and t-SNE are other techniques – which may specifically do better at preserving certain interesting qualities of the full-dimensional data. t-SNE, especially, was invented for to support machine-learning and plotting, and tries to keep the distance-relationships that existed in the higher-number-of-dimensions similar in the lower-number-of-dimensions.

In addition to #gojomo's answer, if it's only for experimenting i'd recommend using tensorflow's projector which provides a nice GUI for out of the box (approx) PCA and t-SNE.
Just use numpy.savetxt to format your vectors properly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.