PCA, when applied on new data, performance collapsed - python

I am using PCA to do dimension reduction, my training data has 1200000 records with 335 dimensions. Here is my code to train the model
X, y = load_data(f_file1)
valid_X, valid_y = load_data(f_file2)
pca = PCA(n_components=n_compo, whiten=True)
X = pca.fit_transform(X)
valid_input = pca.transform(valid_X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
clf = DecisionTreeClassifier(criterion='entropy', max_depth=30,
min_samples_leaf=2, class_weight={0: 10, 1: 1}) # imbalanced class
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train)*100,
clf.score(X_test, y_test)*100,
recall_score(y_train, clf.predict(X_train))*100,
recall_score(y_test, clf.predict(X_test))*100,
precision_score(y_train, clf.predict(X_train))*100,
precision_score(y_test, clf.predict(X_test))*100,
auc(*roc_curve(y_train, clf.predict_proba(X_train)[:, 1], pos_label=1)[:-1])*100,
auc(*roc_curve(y_test, clf.predict_proba(X_test)[:, 1], pos_label=1)[:-1])*100)
print(precision_score(valid_y, clf.predict(valid_input))*100,
recall_score(valid_y, clf.predict(valid_input))*100,
accuracy_score(valid_y, clf.predict(valid_input))*100,
auc(*roc_curve(valid_y, clf.predict_proba(valid_input)[:, 1], pos_label=1)[:-1])*100)
The output is
99.80, 99.32, 99.87, 99.88, 99.74, 98.78, 99.99, 99.46
0.00, 0.00, 97.13, 49.98, 700.69
So the recall and precision are 0s. Why PCA seems doesn't work on validate data and is the model got overfitted?

Probably it's overfitted because
max_depth=30
It's too much.
How did you select PCA dimension? Optimal value you can get via eigenvectors/eigenvalues approach:
data = data.values
mean = np.mean(data.T, axis=1)
demeaned = data - mean
evals, evecs = np.linalg.eig(np.cov(demeaned.T))
order = evals.argsort()[::-1]
evals = evals[order]
plt.plot(evals)
plt.grid(True)
plt.savefig('_!pca.png')
Optimal values you select by x values where line drop down to very zero.

Related

Regression with Self Organizing Map (SOM) / Kohonen Map

I am evaluating an SOM/Kohonen Map as a regressor for a dataset. Unfortunately it performs extremely bad - so bad, that I think I might have an error in my code. While the R2 score for the training dataset is usually roughly only around 1-5%, the R2 score for the test dataset is ALWAYS extremely negative; example:
Train: 1.09 %
Test: -5668908.61 %
Even though I went over my code over and over again, I just want to make sure, that I did not make a mistake with scaling the data or such, which might cause the bad performance. Basically I split the data into X and y and then use sklearns test_train_split() to get the respective datasets.
I use sklearns MinMaxScaler() to fit_transform() X_train and apply the same transformation on X_test so that there is no data leakage. For y_train I use a separate scaler (scalery).
After each model is trained, I use the y_train scaler (scalery) to inverse the scaling on y_pred, y_pred_train and y_train.
Is there some mistake in my approach? I just want to make sure, that this type of model performs just inherently badly and not because of an error on my side.
Here is my code:
data = load_dataset(currency, 1440, predictor, data_range)
X = data.drop(predictor, axis =1)
y = data[[predictor]]
scaler = MinMaxScaler(feature_range=(0, 1))
scalery = MinMaxScaler(feature_range=(0, 1))
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
shuffle=False,
)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = scalery.fit_transform(y_train)
map_size= int(5* math.sqrt(X_test.shape[0])) #vesanto
info_dict = {
'currency': currency,
'data_range': data_range,
'epochs': 0
}
for i in range(100,2100,100):
info_dict['epochs'] = i
print(f"GridSearch Configuration: {map_size}x{map_size}")
print(currency, data_range, i)
som = susi.SOMRegressor(
n_rows=map_size,
n_columns=map_size,
n_iter_unsupervised=i,
n_iter_supervised=i,
neighborhood_mode_unsupervised="linear",
neighborhood_mode_supervised="linear",
learn_mode_unsupervised="min",
learn_mode_supervised="min",
learning_rate_start=0.5,
learning_rate_end=0.05,
# do_class_weighting=True,
random_state=None,
n_jobs=1)
som.fit(X_train, y_train.ravel())
y_pred = som.predict(X_test)
y_pred_train = som.predict(X_train)
y_pred = scalery.inverse_transform(pd.DataFrame(y_pred))
y_train = scalery.inverse_transform(pd.DataFrame(y_train))
y_pred_train = scalery.inverse_transform(pd.DataFrame(y_pred_train))
print("Train: {0:.2f} %".format(r2_score(y_train, y_pred_train)*100))
print("Test: {0:.2f} %".format(r2_score(y_test, y_pred)*100))

For Loop In Python using sklearn.model_selection.train_test_split

I need to create a FOR loop in Python that will repeat steps 1-2 1,00 times.
Split sample randomly into training test using a 632:368 ratio.
Build the model using the 63.2% training data and compute R square in holdout data.
I can't seem to grab the R square for the dataset :
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(10_00):
X_train, X_test, y_train, y_test = train_test_split(xall, y,
test_size=0.382,
random_state=seed)
modelall = LinearRegression()
modelall.fit(xall, y)
modelall = LinearRegression().fit(xall, y)
r_sq = modelall.score(xall, y)
print('coefficient of determination:', r_sq)
Fit the model using the TRAINING data and estimate the score using the TEST data.
Use this:
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(100):
X_train, X_test, y_train, y_test = train_test_split(xall, y, test_size=0.382, random_state=seed)
modelall = LinearRegression()
modelall.fit(X_train, y_train)
r_sq = modelall.score(X_test, y_test)
print('coefficient of determination:', r_sq)
You are fitting a linear model to the whole dataset (xall) with a different seed number. Linear regression should give you the same output irrespective of the seed value.

See the score of each fold when cross validating a model using a for loop

I want to see the individual score of each fitted model to visualize the strength of cross validation (I am doing this to show my coworkers why cross validation is important).
I have a .csv file with 500 rows, 200 independent variables and 1 binary target. I defined skf to fold the data 5 times using StratifiedKFold.
My code looks like this:
X = data.iloc[0:500, 2:202]
y = data["target"]
skf = StratifiedKFold(n_splits = 5, random_state = 0)
clf = svm.SVC(kernel = "linear")
Scores = [0] * 5
for i, j in skf.split(X, y):
X_train, y_train = X.iloc[i], y.iloc[i]
X_test, y_test = X.iloc[j], y.iloc[j]
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
As you can see, I assigned a list of 5 zeroes to Scores. I would like to assign the clf.score(X_test, y_test) of each of the 5 predictions to the list. However, the indices i and j are not {1, 2, 3, 4, 5}. Rather, they are row numbers used to fold the X and y data frames.
How can I assign the test scores of each of the k fitted models into Scoreswithin this loop? Do I need a separate index for this?
I know using cross_val_score literally does all this and gives you a geometric average of the k scores. However, I want to show my coworkers what happens behind the cross validation functions that come in the sklearn library.
Thanks in advance!
If I understood the question, and you don't need any particular indexing for Scores:
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
X = np.random.normal(size = (500, 200))
y = np.random.randint(low = 0, high=2, size=500)
skf = StratifiedKFold(n_splits = 5, random_state = 0)
clf = SVC(kernel = "linear")
Scores = []
for i, j in skf.split(X, y):
X_train, y_train = X[i], y[i]
X_test, y_test = X[j], y[j]
clf.fit(X_train, y_train)
Scores.append(clf.score(X_test, y_test))
The result is:
>>>Scores
[0.5247524752475248, 0.53, 0.5, 0.51, 0.4444444444444444]

Confusion Matrix return single matrix

I found an issue with scikit confusion matrix.
I use confusion matrix inside KFold, then when the y_true and y_pred is 100% correct, the confusion matrix return a single number. This make my confusion matrix variable broke, because i add the result from confusion matrix in each fold. Any one have solution for this?
Here is my code
model = MultinomialNB()
kf = KFold(n_splits=10)
cf = np.array([[0, 0], [0, 0]])
for train_index, test_index in kf.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cf += confusion_matrix(y_test, y_pred)
Thank You
The cleanest way is probably to pass a list of all possible classes in as the labels argument. Here is an example that shows the issue and it being resolved (based on spoofed data for the truth and predictions).
from sklearn.metrics import confusion_matrix
import numpy as np
y_test = np.array([1,1,1,1,1,0,0])
y_pred = np.array([0,1,1,1,1,0,0])
labels = np.unique(y_test)
cf = np.array([[0, 0], [0, 0]])
for indices in [ [0,1,2,3], [1,2,3] , [1,2,3,4,5,6]]:
cm1= confusion_matrix(y_test[indices], y_pred[indices])
cm2= confusion_matrix(y_test[indices], y_pred[indices], labels=labels)
print (cm1.shape == (2,2), cm2.shape == (2,2))
In the first subset, both classes appear; but in the second subset, only one class appears and so the cm1 matrix is not of size (2,2) (it comes out as (1,1)). But note that by indicating all potential classes in labels, cm2 is always ok.
If you already know that the labels can only be 0 or 1, you could just assign labels=[0,1], but using np.unique will be more robust.
You can check first if all pred_values are all equal to true_values. If it is the case, then just increment your 00 and 11 confusion matrix values by the true_values (or pred_values).
X = pd.DataFrame({'f1': [1]*10 + [0]*10,
'f2': [3]*10 + [10]*10}).values
y = np.array([1]*10 + [0]*10)
model = MultinomialNB()
kf = KFold(n_splits=5)
cf = np.array([[0, 0], [0, 0]])
for train_index, test_index in kf.split(X):
x_train, x_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
if all(y_test == y_pred): # if perfect prediction
cf[0][0] += sum(y_pred == 0) # increment by number of 0 values
cf[1][1] += sum(y_pred == 1) # increment by number of 1 values
else:
cf += confusion_matrix(y_test, y_pred) # else add cf values
Result of print(cf)
>> [10 0]
[0 10]
Be careful to overfitting

Why is sklearn's Perceptron predicting with accuracy, precision etc. of 1?

I am using sklearn.linear_model.Perceptron on a synthetic dataset I created. The data consists of 2 classes each of which is a multivariate Gaussian with a common non-diagonal covariance matrix. The centroids of the classes are close enough that there is significant overlap.
mean1 = np.ones((20,))
mean2 = 2 * np.ones((20,))
A = 0.1 * np.random.randn(20,20)
cov = np.dot(A, A.T)
class1 = np.random.multivariate_normal(mean1, cov, 2000)
class2 = np.random.multivariate_normal(mean2, cov, 2000)
class1 = np.concatenate((class1, np.ones((len(class1), 1))), axis=1)
class2 = np.concatenate((class2, 2*np.ones((len(class2), 1))), axis=1)
class1_train, class1_test = train_test_split(class1, test_size=0.3)
class2_train, class2_test = train_test_split(class2, test_size=0.3)
train = np.concatenate((class1_train, class2_train), axis=0)
test = np.concatenate((class1_test, class2_test), axis=0)
np.random.shuffle(train)
np.random.shuffle(test)
y_train = train[:,20]
x_train = train[:,0:20]
y_test = test[:,20]
x_test = test[:,0:20]
After saving this data, I just used :
classifier = sklearn.linear_model.Perceptron()
classifier.fit(x_train, y_train)
predicted_test = classifier.predict(x_test)
accuracy = sklearn.metrics.accuracy_score(y_test, predicted_test)
precision = sklearn.metrics.precision_score(y_test, predicted_test)
recall = sklearn.metrics.recall_score(y_test, predicted_test)
f_measure = sklearn.metrics.f1_score(y_test, predicted_test)
print(accuracy, precision, recall, f_measure)
The data is overlapping by design. But yet the linear classifier is able to predict perfectly somehow with accuracy, precision etc. all being 1.
The correct way of using cross_validation.train_test_split is to give it the complete dataset, and letting it partition the data to x_train, x_test, y_train, y_test.
The following code works better:
class1 = np.random.multivariate_normal(mean1, cov, 2000)
class2 = np.random.multivariate_normal(mean2, cov, 2000)
class1 = np.concatenate((class1, np.ones((len(class1), 1))), axis=1)
class2 = np.concatenate((class2, 2*np.ones((len(class2), 1))), axis=1)
dataset = np.concatenate((class1, class2), axis=0)
np.random.shuffle(dataset)
x_train, x_test, y_train, y_test = \
cross_validation.train_test_split(dataset[:,:20], dataset[:,20], test_size=0.3)
Notice that the Perceptron can actually acheive 100% accuracy with your data. Try adding some noise to it, in order to get a feeling of it.
For instance:
noise = np.random.normal(0,1,(4000, 20))
dataset[:, 0:20] = dataset[:, 0:20] + noise
x_train, x_test, y_train, y_test = \
cross_validation.train_test_split(dataset[:,:20], dataset[:,20], test_size=0.3)

Categories