Regression with Self Organizing Map (SOM) / Kohonen Map - python

I am evaluating an SOM/Kohonen Map as a regressor for a dataset. Unfortunately it performs extremely bad - so bad, that I think I might have an error in my code. While the R2 score for the training dataset is usually roughly only around 1-5%, the R2 score for the test dataset is ALWAYS extremely negative; example:
Train: 1.09 %
Test: -5668908.61 %
Even though I went over my code over and over again, I just want to make sure, that I did not make a mistake with scaling the data or such, which might cause the bad performance. Basically I split the data into X and y and then use sklearns test_train_split() to get the respective datasets.
I use sklearns MinMaxScaler() to fit_transform() X_train and apply the same transformation on X_test so that there is no data leakage. For y_train I use a separate scaler (scalery).
After each model is trained, I use the y_train scaler (scalery) to inverse the scaling on y_pred, y_pred_train and y_train.
Is there some mistake in my approach? I just want to make sure, that this type of model performs just inherently badly and not because of an error on my side.
Here is my code:
data = load_dataset(currency, 1440, predictor, data_range)
X = data.drop(predictor, axis =1)
y = data[[predictor]]
scaler = MinMaxScaler(feature_range=(0, 1))
scalery = MinMaxScaler(feature_range=(0, 1))
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
shuffle=False,
)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = scalery.fit_transform(y_train)
map_size= int(5* math.sqrt(X_test.shape[0])) #vesanto
info_dict = {
'currency': currency,
'data_range': data_range,
'epochs': 0
}
for i in range(100,2100,100):
info_dict['epochs'] = i
print(f"GridSearch Configuration: {map_size}x{map_size}")
print(currency, data_range, i)
som = susi.SOMRegressor(
n_rows=map_size,
n_columns=map_size,
n_iter_unsupervised=i,
n_iter_supervised=i,
neighborhood_mode_unsupervised="linear",
neighborhood_mode_supervised="linear",
learn_mode_unsupervised="min",
learn_mode_supervised="min",
learning_rate_start=0.5,
learning_rate_end=0.05,
# do_class_weighting=True,
random_state=None,
n_jobs=1)
som.fit(X_train, y_train.ravel())
y_pred = som.predict(X_test)
y_pred_train = som.predict(X_train)
y_pred = scalery.inverse_transform(pd.DataFrame(y_pred))
y_train = scalery.inverse_transform(pd.DataFrame(y_train))
y_pred_train = scalery.inverse_transform(pd.DataFrame(y_pred_train))
print("Train: {0:.2f} %".format(r2_score(y_train, y_pred_train)*100))
print("Test: {0:.2f} %".format(r2_score(y_test, y_pred)*100))

Related

I am getting 100% accuracy in my decision tree model. Where I was wrong?

#split dataset in features and target variable
feature_cols = ['RIAGENDR_0', 'RIDAGEYR', 'RIDRETH3_2', 'RIDRETH3_3', 'RIDRETH3_4', 'RIDRETH3_6', 'RIDRETH3_7', 'INDFMPIR', 'DMDMARTZ_1.0', 'DMDMARTZ_2.0', 'DMDMARTZ_3.0', 'DMDMARTZ_4.0', 'DMDMARTZ_6.0', 'DMDEDUC2', 'RFXT010', 'BMXWT', 'BMXBMI', 'URXUMA', 'LBDHDD', 'LBXFER', 'LBXGH', 'LBXBPB', 'LBXBCD', 'LBXBSE', 'LBXBMN', 'URXUBA', 'URXUCD', 'URXUCO', 'URXUCS', 'URXUMO', 'URXUMN', 'URXUPB', 'URXUSB', 'URXUSN', 'URXUTL', 'URXUTU']
X = data[feature_cols] # Features
scale = StandardScaler()
X = scale.fit_transform(X)
y = data['depre_score'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_test)
print(y_pred)
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
recall_sensitivity = metrics.recall_score(y_test, y_pred, pos_label=1)
recall_specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
print(recall_sensitivity, recall_specificity)
Why do you think you are doing something wrong? Perhaps your data are such that you can achieve a perfect classification... e.g., see this mushroom classification.
Having said that, it is also possible that there is some leakage in your data as specified by #gtomer. That means an exact point that is present in training set is available in your test set. You can do K-fold test on your data and see how it follows up with the accuracy. And secondly, use different classifiers too (it is better to use Random Forests compared to Decision Trees)

Using TimeSeriesSplit within cross_val_score

I'm fitting a time series. In this sense, I'm trying to cross-validate using the TimeSeriesSplit function. I believe that the easiest way to apply this function is through the cross_val_score function, through the cv argument.
The question is simple, is the way I am passing the CV argument correct? Should I do the split(scaled_train) or should I use the split(X_train) or split(input_data) ? Or, should I cross-validate in another way?
This is the code I am writing:
def fit_model1(data: pd.DataFrame):
df = data
scores_fit_model1 = []
for sizes in test_sizes:
# Generate Test Design
input_data = df.drop('next_count',axis=1)
output_data = df[['next_count']]
X_train, X_test, y_train, y_test = train_test_split(input_data, output_data, test_size=sizes, random_state=0, shuffle=False)
#scaling
scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(X_train)
scaled_test = scaler.transform(X_test)
#Build Model
lr = LinearRegression()
lr.fit(scaled_train, y_train.values.ravel())
predictions = lr.predict(scaled_test)
#Cross Validation Definition
time_split = TimeSeriesSplit(n_splits=10)
#performance metrics
r2 = cross_val_score(lr, scaled_train, y_train.values.ravel(), cv=time_split.split(scaled_train), scoring = 'r2', n_jobs =1).mean()
scores_fit_model1.append(r2)
return scores_fit_model1
The TimeSeriesSplit is simply an iterator that yields a growing window of sequential folds. Therefore, you can pass it as is to cv, or you can pass time_series_split(scaled_train), which amounts to the same thing: making splits in an array of the same size as your train data (which cross_val_score takes as the second positional parameter). It doesn't matter whether the TimeSeriesSplit gets the scaled or original data, as long as cross_val_score has the scaled data.
I made some minor simplifications in your code as well - scaling before the train_test_split, and making the output data a Series (so you don't need values.ravel):
def fit_model1(data: pd.DataFrame):
df = data
scores_fit_model1 = []
for sizes in test_sizes:
# Generate Test Design
input_data = df.drop('next_count',axis=1)
output_data = df['next_count']
scaler = MinMaxScaler()
scaled_input = scaler.fit_transform(input_data)
X_train, X_test, y_train, y_test = train_test_split(scaled_input, output_data, test_size=sizes, random_state=0, shuffle=False)
#Build Model
lr = LinearRegression()
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
#Cross Validation Definition
time_split = TimeSeriesSplit(n_splits=10)
#performance metrics
r2 = cross_val_score(lr, X_train, y_train, cv=time_split, scoring = 'r2', n_jobs =1).mean()
scores_fit_model1.append(r2)
return scores_fit_model1

Python - Predicting test data that is smaller than train data

I have preprocessed some data ready to train a Multinomial Naive Bayes classification. The train data is 80% of my data and the test data is 20%.
The train data is an array of size 8452 and the test data is an array of size of 4231
If I want to see the predictions of train data I execute the following code just fine
multiNB = MultinomialNB()
model = multiNB.fit(x_train, y_train)
y_preds = model.predict(x_train)
but if I want to predict my test
i.e.
y_preds = model.predict(x_test)
I get the following error:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 8452 is different from 4231)
If I need to provide more information about my code please ask, but I am stuck here and I do not really understand what is causing that error, and any help is welcomed.
This is how I obtained my train-test sets:
total_count = len(tokenised_reviews)
split = int(total_count * 0.8)
shuffle = np.random.permutation(total_count)
x = []
y = []
for i in range(total_count):
x.append(x_data[shuffle[i]])
y.append(y_data[shuffle[i]])
x_train = x[:split]
x_test = x[split:]
y_train = y[:split]
y_test = y[split:]
Too long to type as a comment, I got a very weird structure when I tried your again. I have no idea what is x_data so hard to explain what is the exact error.
i suspect something went wrong with putting the data back into a list again, so if you do this:
total_count = len(x_train)
split = int(total_count * 0.8)
shuffle = np.random.permutation(total_count)
x_train = x_data[shuffle[split:]]
x_test = x_data[shuffle[:split]]
y_train = y_data[shuffle[split:]]
y_test = y_data[shuffle[:split]]
You should get your x_train and x_test as a subset of the original data.
Or you can simply do:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2)

For Loop In Python using sklearn.model_selection.train_test_split

I need to create a FOR loop in Python that will repeat steps 1-2 1,00 times.
Split sample randomly into training test using a 632:368 ratio.
Build the model using the 63.2% training data and compute R square in holdout data.
I can't seem to grab the R square for the dataset :
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(10_00):
X_train, X_test, y_train, y_test = train_test_split(xall, y,
test_size=0.382,
random_state=seed)
modelall = LinearRegression()
modelall.fit(xall, y)
modelall = LinearRegression().fit(xall, y)
r_sq = modelall.score(xall, y)
print('coefficient of determination:', r_sq)
Fit the model using the TRAINING data and estimate the score using the TEST data.
Use this:
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(100):
X_train, X_test, y_train, y_test = train_test_split(xall, y, test_size=0.382, random_state=seed)
modelall = LinearRegression()
modelall.fit(X_train, y_train)
r_sq = modelall.score(X_test, y_test)
print('coefficient of determination:', r_sq)
You are fitting a linear model to the whole dataset (xall) with a different seed number. Linear regression should give you the same output irrespective of the seed value.

How is train_test_split with test_size=0 affecting the data?

I was using train_test_split in my code and then wanted to change it to cross validation, but something strange is hapenning.
train, test = train_test_split(data, test_size=0)
x_train = train.drop('CRO', axis=1)
y_train = train['CRO']
scaler = MinMaxScaler(feature_range=(0, 1))
x_train_scaled = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train_scaled)
for k in range(1, 5):
knn = neighbors.KNeighborsRegressor(n_neighbors=k, weights='uniform')
scores = model_selection.cross_val_score(knn, x_train, y_train, cv=5)
print(scores.mean(), 'score for k = ', k)
This code gives the scores around 0.8, but when I delete that first line and change the 'train' set for the 'data' set in the 2nd and 3rd lines, the score changes to 0.2, which is strange because I even set the test_size to 0 so the train should be equal to the whole data.
What is hapenning?
One thing to be aware of are the implicit arguments passed in train_test_split.
By default, shuffle=True, which could easily be adding some noise into your training data by shuffling it, where just passing the data in without shuffling my be introducing some other pattern into the model.

Categories