I am using sklearn.linear_model.Perceptron on a synthetic dataset I created. The data consists of 2 classes each of which is a multivariate Gaussian with a common non-diagonal covariance matrix. The centroids of the classes are close enough that there is significant overlap.
mean1 = np.ones((20,))
mean2 = 2 * np.ones((20,))
A = 0.1 * np.random.randn(20,20)
cov = np.dot(A, A.T)
class1 = np.random.multivariate_normal(mean1, cov, 2000)
class2 = np.random.multivariate_normal(mean2, cov, 2000)
class1 = np.concatenate((class1, np.ones((len(class1), 1))), axis=1)
class2 = np.concatenate((class2, 2*np.ones((len(class2), 1))), axis=1)
class1_train, class1_test = train_test_split(class1, test_size=0.3)
class2_train, class2_test = train_test_split(class2, test_size=0.3)
train = np.concatenate((class1_train, class2_train), axis=0)
test = np.concatenate((class1_test, class2_test), axis=0)
np.random.shuffle(train)
np.random.shuffle(test)
y_train = train[:,20]
x_train = train[:,0:20]
y_test = test[:,20]
x_test = test[:,0:20]
After saving this data, I just used :
classifier = sklearn.linear_model.Perceptron()
classifier.fit(x_train, y_train)
predicted_test = classifier.predict(x_test)
accuracy = sklearn.metrics.accuracy_score(y_test, predicted_test)
precision = sklearn.metrics.precision_score(y_test, predicted_test)
recall = sklearn.metrics.recall_score(y_test, predicted_test)
f_measure = sklearn.metrics.f1_score(y_test, predicted_test)
print(accuracy, precision, recall, f_measure)
The data is overlapping by design. But yet the linear classifier is able to predict perfectly somehow with accuracy, precision etc. all being 1.
The correct way of using cross_validation.train_test_split is to give it the complete dataset, and letting it partition the data to x_train, x_test, y_train, y_test.
The following code works better:
class1 = np.random.multivariate_normal(mean1, cov, 2000)
class2 = np.random.multivariate_normal(mean2, cov, 2000)
class1 = np.concatenate((class1, np.ones((len(class1), 1))), axis=1)
class2 = np.concatenate((class2, 2*np.ones((len(class2), 1))), axis=1)
dataset = np.concatenate((class1, class2), axis=0)
np.random.shuffle(dataset)
x_train, x_test, y_train, y_test = \
cross_validation.train_test_split(dataset[:,:20], dataset[:,20], test_size=0.3)
Notice that the Perceptron can actually acheive 100% accuracy with your data. Try adding some noise to it, in order to get a feeling of it.
For instance:
noise = np.random.normal(0,1,(4000, 20))
dataset[:, 0:20] = dataset[:, 0:20] + noise
x_train, x_test, y_train, y_test = \
cross_validation.train_test_split(dataset[:,:20], dataset[:,20], test_size=0.3)
Related
I am evaluating an SOM/Kohonen Map as a regressor for a dataset. Unfortunately it performs extremely bad - so bad, that I think I might have an error in my code. While the R2 score for the training dataset is usually roughly only around 1-5%, the R2 score for the test dataset is ALWAYS extremely negative; example:
Train: 1.09 %
Test: -5668908.61 %
Even though I went over my code over and over again, I just want to make sure, that I did not make a mistake with scaling the data or such, which might cause the bad performance. Basically I split the data into X and y and then use sklearns test_train_split() to get the respective datasets.
I use sklearns MinMaxScaler() to fit_transform() X_train and apply the same transformation on X_test so that there is no data leakage. For y_train I use a separate scaler (scalery).
After each model is trained, I use the y_train scaler (scalery) to inverse the scaling on y_pred, y_pred_train and y_train.
Is there some mistake in my approach? I just want to make sure, that this type of model performs just inherently badly and not because of an error on my side.
Here is my code:
data = load_dataset(currency, 1440, predictor, data_range)
X = data.drop(predictor, axis =1)
y = data[[predictor]]
scaler = MinMaxScaler(feature_range=(0, 1))
scalery = MinMaxScaler(feature_range=(0, 1))
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
shuffle=False,
)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = scalery.fit_transform(y_train)
map_size= int(5* math.sqrt(X_test.shape[0])) #vesanto
info_dict = {
'currency': currency,
'data_range': data_range,
'epochs': 0
}
for i in range(100,2100,100):
info_dict['epochs'] = i
print(f"GridSearch Configuration: {map_size}x{map_size}")
print(currency, data_range, i)
som = susi.SOMRegressor(
n_rows=map_size,
n_columns=map_size,
n_iter_unsupervised=i,
n_iter_supervised=i,
neighborhood_mode_unsupervised="linear",
neighborhood_mode_supervised="linear",
learn_mode_unsupervised="min",
learn_mode_supervised="min",
learning_rate_start=0.5,
learning_rate_end=0.05,
# do_class_weighting=True,
random_state=None,
n_jobs=1)
som.fit(X_train, y_train.ravel())
y_pred = som.predict(X_test)
y_pred_train = som.predict(X_train)
y_pred = scalery.inverse_transform(pd.DataFrame(y_pred))
y_train = scalery.inverse_transform(pd.DataFrame(y_train))
y_pred_train = scalery.inverse_transform(pd.DataFrame(y_pred_train))
print("Train: {0:.2f} %".format(r2_score(y_train, y_pred_train)*100))
print("Test: {0:.2f} %".format(r2_score(y_test, y_pred)*100))
I've done DecisionTreeRegression as well as RandomForestRegression on the same dataset.
For RandomForest I've used 5 random best combinations and the results were all similar results as you'd expect. I've calculated the average of R^2, RMSE and MAE and have gotten
R^2 : 0.7, MAE: 145716, RMSE: 251828.
For DecisionTree I've used Repeated K-Fold, calculated the average and gotten:
R^2: 0.29, MAE: 121791, RMSE: 198280.
No transformations or scaling have been done on the response variable which is Home Prices.
I'm new to statistics, but I'm pretty sure R^2 should be higher, if MAE and RMSE are lower on the same dataset when there is no scaling done. That being said, the dataset in question is pretty low in quality compared to the other datasets that I'm using which are yielding the appropriate proportions in error scores.
My question is, since this dataset is poor in quality, and I'm sure that there will be negative R2 values as well as above one for the DecisionTree Model for this dataset: Is it possible that calculating the mean of scores after cross-validation gives arbitrary results for R^2, if some of the R^2 values are not in the 0-1 interval, or is it more likely that there's an issue with the logic of my code(or something else)?
def decisionTreeRegression(df, features):
df = df.sample(frac=1, random_state=0)
scaler = StandardScaler()
X = df[features]
y = df[['Price']]
param_grid = {'max_depth': np.arange(1,40,3)}
tree = GridSearchCV(DecisionTreeRegressor(), param_grid,return_train_score=False)
tree.fit(X,y)
tree_final = DecisionTreeRegressor(max_depth=tree.best_params_['max_depth'])
cv = RepeatedKFold(n_splits=5, n_repeats=100)
mae_scores = cross_val_score(tree_final, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
mse_scores = cross_val_score(tree_final, X, y, scoring='neg_mean_squared_error', cv=cv, n_jobs=-1)
r2_scores = cross_val_score(tree_final, X, y, scoring='r2', cv=cv, n_jobs=-1)
return makeScoresCV(mae_scores,mse_scores,r2_scores)
def makeScoresCV(mae_scores,mse_scores,r2_scores):
# convert scores to positive
mae_scores= absolute(mae_scores)
mse_scores= absolute(mse_scores)
# summarize the result
s_mean = mean(mae_scores)
s_mean2 = mean(mse_scores)
s_mean3 = mean(r2_scores)
return s_mean,np.sqrt(s_mean2),s_mean3
mae, rmse, r2 = decisionTreeRegression(df_de,fe_de)
print("mae : " + str(mae))
print("rmse : " + str(rmse))
print("r2 : " + str(r2))
Console:
mae : 153189.34673362423
rmse : 253284.5137707182
r2 : 0.30183525616923246
Random Forest (seperate notebook):
scaler = StandardScaler()
X = df.drop('Price', axis = 1)
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=123, shuffle=True)
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
rmse = (np.sqrt(mean_squared_error(test_labels,predictions)))
r2 = r2_score(test_labels,predictions) # from sklearn.metrics
mae = np.sum(np.absolute((test_labels - predictions))) / len(predictions)
return mae,r2,rmse
maes = []
rmses = []
r2s = []
for i in range(10):
rf_random.fit(X_train, y_train)
best_random = rf_random.best_estimator_
mae,r2,rmse = evaluate(best_random, X_test, y_test)
maes.append(mae)
rmses.append(rmse)
r2s.append(r2)
print("MAE")
print(math.fsum(maes) / len(maes))
print("RMSE")
print(math.fsum(rmses) / len(rmses))
print("R2")
print(math.fsum(r2s) / len(r2s))
Console:
MAE
145716.7264983288
RMSE
251828.40328030512
R2
0.7082730127977784
I need to create a FOR loop in Python that will repeat steps 1-2 1,00 times.
Split sample randomly into training test using a 632:368 ratio.
Build the model using the 63.2% training data and compute R square in holdout data.
I can't seem to grab the R square for the dataset :
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(10_00):
X_train, X_test, y_train, y_test = train_test_split(xall, y,
test_size=0.382,
random_state=seed)
modelall = LinearRegression()
modelall.fit(xall, y)
modelall = LinearRegression().fit(xall, y)
r_sq = modelall.score(xall, y)
print('coefficient of determination:', r_sq)
Fit the model using the TRAINING data and estimate the score using the TEST data.
Use this:
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(100):
X_train, X_test, y_train, y_test = train_test_split(xall, y, test_size=0.382, random_state=seed)
modelall = LinearRegression()
modelall.fit(X_train, y_train)
r_sq = modelall.score(X_test, y_test)
print('coefficient of determination:', r_sq)
You are fitting a linear model to the whole dataset (xall) with a different seed number. Linear regression should give you the same output irrespective of the seed value.
I have the following code that attempts to valuate stock on non-price based features.
price = df.loc[:,'regularMarketPrice']
features = df.loc[:,feature_list]
#
X_train, X_test, y_train, y_test = train_test_split(features, price, test_size = 0.15, random_state = 1)
if len(X_train.shape) < 2:
X_train = np.array(X_train).reshape(-1,1)
X_test = np.array(X_test).reshape(-1,1)
#
model = LinearRegression()
model.fit(X_train,y_train)
#
print('Train Score:', model.score(X_train,y_train))
print('Test Score:', model.score(X_test,y_test))
#
y_predicted = model.predict(X_test)
In my df (which is very large), there is never an instance where 'regularMarketPrice' is less than 0. However, I occasionally receive a value less than 0 for some points in y_predicted.
Is there a way in Scikit to say anything less than 0 is an invalid prediction? I am hoping this makes my model more accurate.
Please comment if there is a need for further explanation.
To make more prediction larger than 0, you should not use linear regression. You should consider generalized linear regression (glm), such as poisson regression.
from sklearn.linear_model import PoissonRegressor
price = df.loc[:,'regularMarketPrice']
features = df.loc[:,feature_list]
#
X_train, X_test, y_train, y_test = train_test_split(features, price, test_size = 0.15, random_state = 1)
if len(X_train.shape) < 2:
X_train = np.array(X_train).reshape(-1,1)
X_test = np.array(X_test).reshape(-1,1)
#
model = PoissonRegressor()
model.fit(X_train,y_train)
#
print('Train Score:', model.score(X_train,y_train))
print('Test Score:', model.score(X_test,y_test))
#
y_predicted = model.predict(X_test)
All prediction is greater than or equal to 0
Consider using something other than a Gaussian response variable. Plot your y-values with a histogram. If the data are right skewed, considering modeling with a glm, gamma distribution, and log link.
Alternatively, you could set the y_predicted to the max of the model.score value and 0.
I am using PCA to do dimension reduction, my training data has 1200000 records with 335 dimensions. Here is my code to train the model
X, y = load_data(f_file1)
valid_X, valid_y = load_data(f_file2)
pca = PCA(n_components=n_compo, whiten=True)
X = pca.fit_transform(X)
valid_input = pca.transform(valid_X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
clf = DecisionTreeClassifier(criterion='entropy', max_depth=30,
min_samples_leaf=2, class_weight={0: 10, 1: 1}) # imbalanced class
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train)*100,
clf.score(X_test, y_test)*100,
recall_score(y_train, clf.predict(X_train))*100,
recall_score(y_test, clf.predict(X_test))*100,
precision_score(y_train, clf.predict(X_train))*100,
precision_score(y_test, clf.predict(X_test))*100,
auc(*roc_curve(y_train, clf.predict_proba(X_train)[:, 1], pos_label=1)[:-1])*100,
auc(*roc_curve(y_test, clf.predict_proba(X_test)[:, 1], pos_label=1)[:-1])*100)
print(precision_score(valid_y, clf.predict(valid_input))*100,
recall_score(valid_y, clf.predict(valid_input))*100,
accuracy_score(valid_y, clf.predict(valid_input))*100,
auc(*roc_curve(valid_y, clf.predict_proba(valid_input)[:, 1], pos_label=1)[:-1])*100)
The output is
99.80, 99.32, 99.87, 99.88, 99.74, 98.78, 99.99, 99.46
0.00, 0.00, 97.13, 49.98, 700.69
So the recall and precision are 0s. Why PCA seems doesn't work on validate data and is the model got overfitted?
Probably it's overfitted because
max_depth=30
It's too much.
How did you select PCA dimension? Optimal value you can get via eigenvectors/eigenvalues approach:
data = data.values
mean = np.mean(data.T, axis=1)
demeaned = data - mean
evals, evecs = np.linalg.eig(np.cov(demeaned.T))
order = evals.argsort()[::-1]
evals = evals[order]
plt.plot(evals)
plt.grid(True)
plt.savefig('_!pca.png')
Optimal values you select by x values where line drop down to very zero.