High difference in predictions on different train test split sizes - python

I am unable to figure out the reason behind the contrasting difference in predictions on different test train splits while training the linear model using LinearRegression.
This is the my initial try on the data:
x_train,x_test,y_train,true_p=train_test_split(train,y,random_state=121,test_size=0.2,shuffle=True)
lreg=LinearRegression()
lreg.fit(x_train,y_train)
train_pred=lreg.predict(x_train)
test_pred=lreg.predict(x_test)
this is the output in train_pred:
train_pred
array([12.37512481, 11.67234874, 11.82821202, ..., 12.61139596,
12.13886881, 12.42435563])
this is the output in test_pred:
test_pred
array([ 1.21885520e+01, 1.13462088e+01, 1.14144208e+01, 1.22832932e+01,
1.29980626e+01, 1.17641183e+01, 1.20982465e+01, 1.15846156e+01,
1.17403904e+01, 4.17353113e+07, 1.27941840e+01, 1.21739628e+01,
..., 1.22022858e+01, 1.15779229e+01, 1.24931376e+01, 1.26387188e+01,
1.18341585e+01, 1.18411881e+01, 1.21475986e+01, 1.25104774e+01])
The predicted data of both variables have very huge difference, while the latter one is the wrong predicted data.
I have tried increasing the test size to 0.4. Now I have received good prediction.
x_train,x_test,y_train,true_p=train_test_split(train,y,random_state=121,test_size=0.4,shuffle=True)
lreg=LinearRegression()
lreg.fit(x_train,y_train)
train_pred=lreg.predict(x_train)
test_pred=lreg.predict(x_test)
These are the outputs of train_pred and test_pred:
train_pred
array([11.95505983, 12.66847164, 11.81978843, 12.82992812, 12.44707462,
11.78809995, 11.92753084, 12.6082893 , 12.22644843, 11.93325658,
12.2449481 ,..., 11.69256008, 11.67984786, 12.54313682, 12.30652695])
test_pred
array([12.22133867, 11.18863973, 11.46923967, 12.26340761, 12.99240451,
11.77865948, 12.04321231, 11.44137667, 11.71213919, 11.44206212,
..., 12.15412777, 12.39184805, 10.96310233, 12.06243916, 12.11383494,
12.28327695, 11.19989021, 12.61439939, 12.22474378])
What is the reason behind this? How to rectify this problem on 0.2 test train split?
Thank you

Check units of your test_pred. They are all x10 (seen by the e+01). If you set the print settings of numpy to remove the scientific notation by np.set_printoptions(suppress=True) and then print your test_pred you should see that it looks very similar to train_pred. So in short, nothing is wrong.

Just when the data has very high variance, in a very small test set, significant differences in predictions can occur. I would say it is an * underfitting *.
Start by analyzing your dataset and you will see the main causes of this variance through basic descriptive statistics (graphs, measures of position and dispersion, etc.). After that, increase the size of your test set, so that it is balanced otherwise your study will be biased.
But from what I saw, everything is fine, the only problem is the notation e + 01 means that the number is multiplied by 10

Related

Odd linear model results

I'm unit acceptance testing some code I wrote. It's conceivable that at some point in the real world we will have input data where the dependent variable is constant. Not the norm, but possible. A linear model should yield coefficients of 0 in this case (right?), which is fine and what we would want -- but for some reason I'm getting some wild results when I try to fit the model on this use case.
I have tried 3 models and get diffirent weird results every time -- or no results in some cases.
For this use case all of the dependent observations are set at 100, all the freq_weights are set at 1, and the independent variables are a binary coded dummy set of 20 features.
In total there are 150 observations.
Again, this data is unlikely in the real world but I need my code to be able to work on this ugly data. IDK why I'm getting such erroneous and different results.
As I understand with no variance in the dependent variable I should be getting 0 for all my coefficients.
freq = freq['Freq']
Indies = sm.add_constant(df)
model = sm.OLS(df1, Indies)
res = model.fit()
res.params
yields:
const 65.990203
x1 17.214836
reg = statsmodels.GLM(df1, Indies, freq_weights = freq)
results = reg.fit(method = 'lbfgs', max_start_irls=0)
results.params
yields:
const 83.205034
x1 82.575228
reg = statsmodels.GLM(df1, Indies, freq_weights = freq)
result2 = reg.fit()
result2.params
yields
PerfectSeparationError: Perfect separation detected, results not available

changing cluster labels for kmeans model

I have fit a Kmeans model on document embeddings from a Doc2Vec model to cluster the embeddings and get a visualization as well as the most frequent terms per cluster. I have been able to do this fine and get the same visualization each time.
When I run the kmeans.fit_predict on the model it gives me a list of cluster labels according to the clusters I have specified of the same length as the number of document embeddings I have. The issue comes when running the model multiple times it gives a similar spread per cluster each time but the cluster labels will change after running it multiple times. For example,
Run 1 - 0:100, 1:100, 2:10
Run 2 - 0:99 , 1:101, 2:10
Run 3 - 2:100, 0:100, 1:10
Run 4 - 0:100, 1:100, 2:10
I tried saving the model and using the same model multiple times but encountered the same issue. This causes the most frequent terms per cluster and position of the cluster in the visualization to change, which changes the way it is interpreted. I was planning to use the labels as a classification method but doesn't this make that impossible? I'm not sure if its an issue with my code or if this is normal behavior if anyone can help it would be much appreciated.
df = pd.read_csv("data.csv")
d2v_model = Doc2Vec.load("d2vmodel")
clusters = 3
iterations = 100
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations)
X = kmeans_model.fit(d2v_model.docvecs.vectors_docs)
l = kmeans_model.fit_predict(d2v_model.docvecs.vectors_docs)
labels = kmeans_model.labels_.tolist()
pca = PCA(n_components=2).fit(d2v_model.docvecs.vectors_docs)
datapoint = pca.transform(d2v_model.docvecs.vectors_docs)
df["clusters"] = labels
cluster_list = []
cluster_colors = ["#FFFF00", "#008000", "#0000FF"]
plt.figure
color = [cluster_colors[i] for i in labels]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker="^", s=150, c="#000000")
plt.show()
for i in range(clusters):
df_temp = df[df["clusters"]==i]
cluster_words = Counter(" ".join(df_temp["Body"].str.lower()).split()).most_common(25)
[cluster_list.append(x[0]) for x in cluster_words]
cluster_list.clear()
for Kmeans, when you run fit for multiple time, every time centroid will be initialized randomly. To make it deterministic you can use random_state parameters. you can refer to the docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations, random_state = 'int number need to given')
Stabilizing the initialization randomization by specifying a random_state (per #qaiser's answer) may help – perhaps by ensuring similar-ish sets of doc-vectors, against same starting KMeans state, tends to find the 'same' clusters in the same named slots.
But there could be situations, where the doc-vectors have a different distribution, or where initialized state is (by bad luck) highly sensitive to doc-vector distribution, where even this repeated-initialization doesn't maintain coherent clusters.
You might want to also consider one or both of:
(1) initializing the KMeans clusters to match the prior run's centroids, to bias the later analysis towards creating compatibly named/centered clusters;
(2) after the second run finishes, rename the clusters according to which (of all possible 3! arbitrary naming permutations of 3 clusters) leaves the smallest possible total distances between each 'new' cluster of the same name to the 'prior' cluster of the same name.
I think the issue might be use of .fit_predict. Try just .predict see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
try:
l = kmeans_model.predict(d2v_model.docvecs.vectors_docs)
similar worked for me

Understanding the output of scipy.stats.multivariate_normal

I am trying to build a multidimensional gaussian model using scipy.stats.multivariate_normal. I am trying to use the output of scipy.stats.multivariate_normal.pdf() to figure out if a test value fits reasonable well in the observed distribution.
From what I understand, high values indicate a better fit to the given model and low values otherwise.
However, in my dataset, I see extremely large PDF(x) results, which lead me to question if I understand things correctly. The area under the PDF curve must be 1, so very large values are hard to comprehend.
For e.g., consider:
x = [-0.0007569417915494715, -0.01394295997613827, 0.000982078369890444, -0.03633664354397629, -0.03730583036106844, 0.013920453054506978, -0.08115836865224338, -0.07208494497398354, -0.06255237023298793, -0.0531888840386906, -0.006823760545565131]
mean = [0.01663645201261102, 0.07800335614699873, 0.016291452384234965, 0.012042931155488702, 0.0042637244100103885, 0.016531331606477996, -0.021702714746699842, -0.05738646649459681, 0.00921296058625439, 0.027940994009345254, 0.07548111758006244]
covariance = [[0.07921927017771506, 0.04780185747873293, 0.0788086850274493, 0.054129466248481264, 0.018799028456661045, 0.07523731808137141, 0.027682748950487425, -0.007296954729572955, 0.07935165417756569, 0.0569381100965656, 0.04185848489472492], [0.04780185747873293, 0.052300105044833595, 0.047749467098423544, 0.03254872837949123, 0.010582358713999951, 0.045792252383799206, 0.01969282984717051, -0.006089301208961258, 0.05067712814145293, 0.03146214776997301, 0.04452949330387575], [0.0788086850274493, 0.047749467098423544, 0.07841809405745602, 0.05374461924031552, 0.01871005609017673, 0.07487015790787396, 0.02756781074862818, -0.007327131572569985, 0.07895548129950304, 0.056417456686115544, 0.04181063355048408], [0.054129466248481264, 0.03254872837949123, 0.05374461924031552, 0.04538801863296238, 0.015795381235224913, 0.05055944754764062, 0.02017033995851422, -0.006505939129684573, 0.05497361331950649, 0.043858860182247515, 0.029356699144606032], [0.018799028456661045, 0.010582358713999951, 0.01871005609017673, 0.015795381235224913, 0.016260640022897347, 0.015459548918222347, 0.0064542528152879705, -0.0016656858963383602, 0.018761682220822192, 0.015361512546799405, 0.009832025009280924], [0.07523731808137141, 0.045792252383799206, 0.07487015790787396, 0.05055944754764062, 0.015459548918222347, 0.07207012779105286, 0.026330967917717253, -0.006907504360835279, 0.0753380831201204, 0.05335128471397023, 0.03998397595850863], [0.027682748950487425, 0.01969282984717051, 0.02756781074862818, 0.02017033995851422, 0.0064542528152879705, 0.026330967917717253, 0.020837940236441078, -0.003320408544812026, 0.027859582829638897, 0.01967636950969646, 0.017105000942890598], [-0.007296954729572955, -0.006089301208961258, -0.007327131572569985, -0.006505939129684573, -0.0016656858963383602, -0.006907504360835279, -0.003320408544812026, 0.024529061074105817, -0.007869287828047853, -0.006228903058681195, -0.0058974553248417995], [0.07935165417756569, 0.05067712814145293, 0.07895548129950304, 0.05497361331950649, 0.018761682220822192, 0.0753380831201204, 0.027859582829638897, -0.007869287828047853, 0.08169291677188911, 0.05731196406065222, 0.04450058445993234], [0.0569381100965656, 0.03146214776997301, 0.056417456686115544, 0.043858860182247515, 0.015361512546799405, 0.05335128471397023, 0.01967636950969646, -0.006228903058681195, 0.05731196406065222, 0.05064023101024737, 0.02830810316675855], [0.04185848489472492, 0.04452949330387575, 0.04181063355048408, 0.029356699144606032, 0.009832025009280924, 0.03998397595850863, 0.017105000942890598, -0.0058974553248417995, 0.04450058445993234, 0.02830810316675855, 0.040658283674780395]]
For this, if I compute y = multivariate_normal.pdf(x, mean, cov);
the result is 342562705.3859754.
How could this be the case? Am I missing something?
Thanks.
This is fine. The probability density function can be larger than 1 at a specific point. It's the integral than must be equal to 1.
The idea that pdf < 1 is correct for discrete variables. However, for continuous ones, the pdf is not a probability. It's a value that is integrated to a probability. That is, the integral from minus infinity to infinity, in all dimensions, is equal to 1.

Extracting confidence from scikit PassiveAggressiveClassifier() for single prediction

I have trained an PassiveAggressiveClassifier with a set of 165 categories.
Now I can already use it to predict certain inputs but it fails sometimes and it would be very helpful know how "confident" is the classifier on each prediction and what are the other considerations.
As far as I understand I get the distances for each category using decision_function
distances = np.array(ppl.decision_function(sample))
which gives me something like this for the distances:
[-1.4222 -1.5083 -2.6488 -2.3428 -1.3167 -3.9615 -2.7804 -1.9563 -0.5054
-1.9524 -3.0026 -3.422 -2.1301 -2.0119 -2.1381 -2.2186 -2.0848 -2.4514
-1.9478 -2.3101 -2.4044 -1.9155 -1.569 -1.31 -1.4865 -2.3251 -1.7773
-1.304 -1.5215 -2.0634 -1.6987 -1.9217 -2.2863 -1.8166 -2.0219 -1.9594
-1.747 -2.1503 -2.162 -1.9507 -1.5971 -3.4499 -1.8946 -2.4328 -2.2415
-1.9045 -2.065 -1.9671 -1.8592 -1.6283 -1.7626 -2.2175 -2.1725 -3.7855
-5.1397 -3.6485 -4.4072 -2.2109 -2.048 -2.4887 -2.2324 -2.7897 -1.2932
-1.975 -1.516 -1.6127 -1.7135 -1.8243 -1.4887 -2.8973 -1.9656 -2.2236
-2.2466 -2.1224 -1.2247 -1.9657 -1.6138 -2.7787 -1.5004 -2.0136 -1.1001
-1.7226 -1.5829 -2.0317 -1.0834 -1.7444 -1.356 -2.3453 -1.7161 -2.2683
-2.2725 -0.4512 -4.5038 -2.0386 -2.1849 -2.4256 -1.5678 -1.8114 -2.2138
-2.2654 -1.8823 -2.7489 -1.8477 -2.1383 -1.6019 -2.84 -2.2595 -2.0764
-1.6758 -2.4279 -2.3489 -2.1884 -2.1888 -1.6289 -1.7358 -1.2989 -1.5656
-1.3362 -1.888 -2.1061 -1.4517 -2.0572 -2.4971 -2.2966 -2.6121 -2.4728
-2.8977 -1.7571 -2.4363 -1.4775 -1.7144 -2.047 -3.9252 -1.9907 -2.1808
-2.066 -1.9862 -1.4898 -2.3335 -2.6088 -2.4554 -2.4139 -1.7187 -2.2909
-1.4846 -1.8696 -2.444 -2.6253 -1.7738 -1.7192 -1.8737 -1.9977 -1.9948
-1.7667 -2.0704 -3.0147 -1.9014 -1.7713 -2.2551]
Now I have two questions:
1st whether it is possible to map the distances back to the categories since the length of the array (159) does not match my categories array.
2nd how can I calculate a confidence for the single prediction using the distances?
Question 1
As per the comment, make sure all your classes are contained in the training set. You can achieve this for example by using the train_test_split function and passing your targets into the stratify parameter.
Once you do this, the problem will disappear and there will be one classifier per each class. As a result, if you pass a sample to decision_function method there will be one distance to the hyperplane for each class.
Question 2
You can turn the distances into probabilities through rescaling and normalizing (i.e. softmax). This is already implemented internally in the _predict_proba_lr method. See the source code here.

Scikit-learn SVC always giving accuracy 0 on random data cross validation

In the following code I create a random sample set of size 50, with 20 features each. I then generate a random target vector composed of half True and half False values.
All of the values are stored in Pandas objects, since this simulates a real scenario in which the data will be given in that way.
I then perform a manual leave-one-out inside a loop, each time selecting an index, dropping its respective data, fitting the rest of the data using a default SVC, and finally running a prediction on the left-out data.
import random
import numpy as np
import pandas as pd
from sklearn.svm import SVC
n_samp = 50
m_features = 20
X_val = np.random.rand(n_samp, m_features)
X = pd.DataFrame(X_val, index=range(n_samp))
# print X_val
y_val = [True] * (n_samp/2) + [False] * (n_samp/2)
random.shuffle(y_val)
y = pd.Series(y_val, index=range(n_samp))
# print y_val
seccess_count = 0
for idx in y.index:
clf = SVC() # Can be inside or outside loop. Result is the same.
# Leave-one-out for the fitting phase
loo_X = X.drop(idx)
loo_y = y.drop(idx)
clf.fit(loo_X.values, loo_y.values)
# Make a prediction on the sample that was left out
pred_X = X.loc[idx:idx]
pred_result = clf.predict(pred_X.values)
print y.loc[idx], pred_result[0] # Actual value vs. predicted value - always opposite!
is_success = y.loc[idx] == pred_result[0]
seccess_count += 1 if is_success else 0
print '\nSeccess Count:', seccess_count # Almost always 0!
Now here's the strange part - I expect to get an accuracy of about 50%, since this is random data, but instead I almost always get exactly 0! I say almost always, since every about 10 runs of this exact code I get a few correct hits.
What's really crazy to me is that if I choose the answers opposite to those predicted, I will get 100% accuracy. On random data!
What am I missing here?
Ok, I think I just figured it out! It all comes down to our old machine learning foe - the majority class.
In more detail: I chose a target comprising 25 True and 25 False values - perfectly balanced. When performing the leave-one-out, this caused a class imbalance, say 24 True and 25 False. Since the SVC was set to default parameters, and run on random data, it probably couldn't find any way to predict the result other than choosing the majority class, which in this iteration would be False! So in every iteration the imbalance was turned against the currently-left-out sample.
All in all - a good lesson in machine learning, and an excelent mathematical riddle to share with your friends :)

Categories