I am training 2000 Logistic Regression classifiers using keras.
The inputs for each classifier are:
for training: vectors: 8250X50, labels:8250
for validation:2750X50, labels:2750
for testing:3000X50, labels:3000
for every classifier, I save the predictions and the scores (kappa score, accuracy..)
The code is very slow it needs three hours for training the first 600 classifiers.
I used the following code
def lg_keras2(input_dim,output_dim,ep,X,y,Xv,yv,XT,yT,class_weight1):
model = Sequential()
model.add(Dense(output_dim, input_dim=input_dim, activation='sigmoid'))
#model.summary()
model.compile(optimizer='adam', loss='binary_crossentropy',metrics = ["accuracy",mcor,recall, f1])
result = model.fit(X, y, epochs=ep, verbose=0, batch_size = 128, class_weight = {0 :class_weight1[0] , 1:class_weight1[1] } ,validation_data = (Xv, yv))
test = model.evaluate(XT, yT, verbose=0)
kappa_Score=(cohen_kappa_score( yT,(model.predict_classes(XT))))
return model,result,test,kappa_Score
After that I trained the 2000 classifiers as follow:
from sklearn.utils import class_weight
from sklearn.metrics import cohen_kappa_score
directionsLGR=[]
scores=[]
predictions=[]
kappa_Score_all=[]
for i in range(0,2000):
Class_weight = class_weight.compute_class_weight('balanced',
np.unique(pmiweights_Train[:,i]),
pmiweights_Train[:,i])
#start_time = time.time()
model,results,test,kappa = lg_keras2(50,1,30,mdsTrain, pmiweights_Train[:,i],mdsVal, pmiweights_val[:,i],mdsTest,pmiweights_Test[:,i],Class_weight)
#print("--- %s seconds ---" % (time.time() - start_time))
weights=np.array(model.get_weights())[0].flatten()
directionsLGR.append(weights)
predictions.append(model.predict_classes(mds))
kappa_Score_all.append(kappa)
scores.append(test)
Is there anything that I can do to speed this process.
I will appreciate any suggestions
Related
I'm creating a vanilla neural network with artificial datasets to learn pytorch. I'm currently looking how I can get the predictions for the test data set and obtain the statistical metrics including mse, mae, and r2. I was wondering if my calculations are correct. Also, is there any built-in function that could potentially give me all these results in pytorch as it happens in sckit-learn?
Let's first upload libraries and then generate artificial training and test data.
import random
import torch
import pandas as pd
import numpy as np
from torch import nn
from torch.utils.data import Dataset,DataLoader,TensorDataset
from torchvision import datasets, transforms
import math
n_input, n_hidden, n_out= 5, 64, 1
#Create training and test datasets
X_train = pd.DataFrame([[random.random() for i in range(n_input)] for j in range(1000)])
y_train = pd.DataFrame([[random.random() for i in range(n_out)] for j in range(1000)])
X_test = pd.DataFrame([[random.random() for i in range(n_input)] for j in range(50)])
y_test = pd.DataFrame([[random.random() for i in range(n_out)] for j in range(50)])
test_dataset = TensorDataset(torch.Tensor(X_test.to_numpy().astype(np.float32)), torch.Tensor((y_test).to_numpy().astype(np.float32)))
testloader = DataLoader(test_dataset, batch_size= 32)
#For training, use 32 as a batch size
training_dataset = TensorDataset(torch.Tensor(X_train.to_numpy().astype(np.float32)), torch.Tensor((y_train).to_numpy().astype(np.float32)))
dataloader = DataLoader(training_dataset, batch_size=32, shuffle=True)
Now, let us generate the model to train.
model = nn.Sequential(nn.Linear(n_input, n_hidden),
nn.ReLU(),
nn.Linear(n_hidden, n_out),
nn.ReLU())
loss_function = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
Next, I start the training process.
losses = []
epochs = 1000
for epoch in range(epochs+1):
for times,(x_train,y_train) in enumerate(dataloader):
y_pred = model(x_train)
loss = loss_function(y_pred, y_train)
model.zero_grad()
loss.backward()
optimizer.step()
Now, I'd like to get the predictions for the test dataset and get statistical results. This is the part that I need help and some guidance. Do they seem correct? Is there something I might be doing wrong?
from torchmetrics import R2Score
r2score = R2Score()
running_mae = running_mse = running_r2 = 0
with torch.no_grad():
model.eval()
for times,(x_test,y_test) in enumerate(testloader):
y_pred = model(x_test)
error = torch.abs(y_pred - y_test).sum().data
squared_error=((y_pred - y_test)*(y_pred - y_test)).sum().data
running_mae+=error
running_mse+=squared_error
running_r2+=r2score(y_pred, y_test)
mse = math.sqrt(squared_error/ len(testloader))
mae = error / len(testloader)
r2 = running_r2 / len(testloader)
print("MSE:",mse, "MAE:", mae, "R2:", r2)
I am using tensorflow and keras for a binary classification problem.
I have only a training set of 81 samples (Testsize 21), but ~1900 features. I know its too less samples and too many features, but its a biological problem (gene-expression data), so i have to deal with it.
My model looks like this: (using different neurons per layer, different number of hidden layers, regularization and dropout to deal with the high dimensional data)
model = Sequential()
model.add(Input((input_shape,)))
for i in range(num_hidden):
model.add(Dense(n_neurons, activation="relu",kernel_regularizer=keras.regularizers.l1_l2(l1_reg, l2_reg)))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation="sigmoid"))
ann_optimizer= keras.optimizers.Adam()
model.compile(loss="binary_crossentropy",
optimizer=ann_optimizer, metrics=['accuracy'])
I am using a 10 fold nested cross validation and grid search in the inner fold like this:
# fit and evaluate the model
# configure the inner cross-validation procedure (5 fold, 80 inner training dataset, 20 inner test dataset)
cv_inner = ShuffleSplit(n_splits=5, test_size=0.2, random_state=1)
# define the mode
ann = KerasRegressor(build_fn=regressionModel_sequential, input_shape=X_train.shape[1],
batch_size=batch_size)
# use pipeline as prevent from Leaky Preprocessing (StandardScaler on 80% inner-training dataset))
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('ann', ann)])
# define the grid search of with inner cv to get good parameters
grid_search_result = GridSearchCV(
pipe, param_grid, n_jobs=-1, cv=cv_inner, refit=True, verbose=0)
#refit = True a final model with the entire inner-training dataset
# execute search
grid_search_result.fit(X_train, y_train, ann__verbose=0)
logger.info('>>>>> est=%.3f, params=%s' % (grid_search_result.best_score_, grid_search_result.best_params_))
# to get loss curve
ann_val = regressionModel_sequential(input_shape=X_train.shape[1],
n_neurons=grid_search_result.best_params_['ann__n_neurons'],
l1_reg=grid_search_result.best_params_['ann__l1_reg'],
l2_reg=grid_search_result.best_params_['ann__l2_reg'],
num_hidden=grid_search_result.best_params_['ann__num_hidden'],
dropout_rate=grid_search_result.best_params_['ann__dropout_rate'])
# Validation with outer 20 %
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
history = ann_val.fit(X_train, y_train, batch_size=batch_size, verbose=0,
validation_split=0.25, shuffle=True, epochs=grid_search_result.best_params_['ann__epochs'])
plot_history(history, directory, i)
# use best grid search reult for predicting on outer test dataset
y_predicted = ann_val.predict(X_test)
# print predicted
logger.info(y_predicted[:5])
logger.info(y_test[:5])
rmse = (np.sqrt(metrics.mean_squared_error(y_test, y_predicted)))
mae = (metrics.mean_squared_error(y_test, y_predicted))
r_squared = metrics.r2_score(y_test, y_predicted)
My loss seems good: loss
But accuracy is very bad. accuracy (example from one outer fold)
Does anyone have suggestions on what i could do to improve my results?
I also know that the biological question behind is very hard/maybe not possible to solve.
Can someone explain why adding random numbers to the loss does not affect the predictions of this Keras model? Every time I run it I get a very similar AUC for both models but I would expect the AUC from the second model to be close to 0.5. I use Colab.
Any suggestions why this might be happening?
import numpy as np
import pandas as pd
import tensorflow as tf
import keras as keras
from keras import layers
import random
from keras import backend as K
from sklearn import metrics
from sklearn.metrics import roc_auc_score
opt = tf.keras.optimizers.Adam(learning_rate=1e-04)
#resetting seeds to ensure reproducibility
def reset_random_seeds():
tf.random.set_seed(1)
np.random.seed(1)
random.seed(1)
def get_auc(y_test,y_pred):
fpr, tpr, threshold = metrics.roc_curve(y_test, y_pred)
auc = metrics.auc(fpr, tpr)
return auc
#standard loss function with binary cross-entropy
def binary_crossentropy1(y_true, y_pred):
bin_cross = tf.keras.losses.BinaryCrossentropy(from_logits=False)
bce1 = K.mean(bin_cross(y_true, y_pred))
return bce1
#same loss function but with added random numbers
def binary_crossentropy2(y_true, y_pred):
bin_cross = tf.keras.losses.BinaryCrossentropy(from_logits=False)
bce2 = K.mean(bin_cross(y_true, y_pred))
penalty = tf.random.normal([], mean=50.0, stddev=100.0)
bce2 = tf.math.add(bce2, penalty)
return bce2
#model without randomness
reset_random_seeds()
input1 = keras.Input(shape=(9,))
x = layers.Dense(12, activation="relu", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(input1)
x = layers.Dense(8, activation="relu", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(x)
output = layers.Dense(1, activation="sigmoid", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(x)
model1 = keras.Model(inputs=input1, outputs=output)
model1.compile(optimizer=opt, loss=binary_crossentropy1, metrics=['accuracy'])
model1.fit(x=X_train, y=y_train, epochs=10, batch_size = 32)
model1_pred = model1.predict(X_test)
#model with randomness
reset_random_seeds()
input1 = keras.Input(shape=(9,))
x = layers.Dense(12, activation="relu", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(input1)
x = layers.Dense(8, activation="relu", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(x)
output = layers.Dense(1, activation="sigmoid", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(x)
model2 = keras.Model(inputs=input1, outputs=output)
model2.compile(optimizer=opt, loss=binary_crossentropy2, metrics=['accuracy'])
model2.fit(x=X_train, y=y_train, epochs=10, batch_size = 32)
model2_pred = model2.predict(X_test)
print(get_auc(y_test, model1_pred))
print(get_auc(y_test, model2_pred))
Result
0.7228943446346893
0.7231896873302319
What the penalty looks like
penalty = 112.050842
penalty = 139.664017
penalty = 152.505341
penalty = -37.1483
penalty = -74.08284
penalty = 155.872528
penalty = 42.7903175
The training is guided by the gradient of the loss with respect to the input.
The random value that you add to the loss in the second model is independent form the input, so it will not contribute to the gradient of the loss during training. When you are running the prediction you are taking the the model output (before the loss function), so that's not affected as well.
I have implemented the following metric to look at Precision and Recall of the classes I deem relevant.
metrics=[tf.keras.metrics.Recall(class_id=1, name='Bkwd_R'),tf.keras.metrics.Recall(class_id=2, name='Fwd_R'),tf.keras.metrics.Precision(class_id=1, name='Bkwd_P'),tf.keras.metrics.Precision(class_id=2, name='Fwd_P')]
How can I implement the same in Tensorflow 2.5 for F1 score (i.e specifically for class 1 and class 2, and not class 0, without a custom function.
Update
Using this metric setup:
tfa.metrics.F1Score(num_classes = 3, average = None, name = f1_name)
I get the following during training:
13367/13367 [==============================] 465s 34ms/step - loss: 0.1683 - f1_score: 0.5842 - val_loss: 0.0943 - val_f1_score: 0.3314
and when I do model.evaluate:
224/224 [==============================] - 11s 34ms/step - loss: 0.0665 - f1_score: 0.3325
and the scoring =
Score: [0.06653735041618347, array([0.99740255, 0. , 0. ], dtype=float32)]
The problem is that this is training based on the average, but I would like to train on the F1 score of a sensible averaging/each of the last two values/classes in the array (which are 0 in this case)
Edit
Will accept a non tensorflow specific function that gives the desired result (with full function and call during fit code) but was really hoping for something using the exisiting tensorflow code if it exists)
You can have a look at https://www.tensorflow.org/addons/api_docs/python/tfa/metrics/F1Score in tensorflow-addons package.
Specifically, if you need a per-class score, you need to set the average param to None, or macro.
As is mentioned in David Harris' comment, a neural network model is trained on loss functions, not on metric scores. Losses help drive the model towards a solution to provide accurate labels via backpropagation. Metrics help to provide a comparable evaluation of that model's performance that are a lot more human-legible.
So, that being said, I feel like what you're saying in your question is that "there are three classes, and I want the model to care more about the last two of the three". I want to
IF that's the case, one approach you can take is to weight your samples by label. Let's say that you have labels in an array y_train.
# Which classes are you wanting to focus on
classes_i_care_about = [1, 2]
# Initialize all weights to 1.0
sample_weights = np.ones(shape=(len(y_train),))
# Give the classes you care about 50% more weight
sample_weight[np.isin(y_train, classes_i_care_about)] = 1.5
...
model.fit(
x=X_train,
y=y_train,
sample_weight=sample_weight,
epochs=5
)
This is the best advice I can offer without knowing more. If you're looking for other info on how you can have your model do better on certain classes, other info could be useful, such as:
What's the proportions of labels in your dataset?
What is the last layer of your model architecture? Dense(3, activation="softmax")?
What loss are you using?
Here's a more complete, reproducible example that shows what I'm talking about with the sample weights:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import tensorflow_addons as tfa
iris_data = load_iris() # load the iris dataset
x = iris_data.data
y_ = iris_data.target.reshape(-1, 1) # Convert data to a single column
# One Hot encode the class labels
encoder = OneHotEncoder(sparse=False)
y = encoder.fit_transform(y_)
# Split the data for training and testing
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.20)
# Build the model
def get_model():
model = Sequential()
model.add(Dense(10, input_shape=(4,), activation='relu', name='fc1'))
model.add(Dense(10, activation='relu', name='fc2'))
model.add(Dense(3, activation='softmax', name='output'))
# Adam optimizer with learning rate of 0.001
optimizer = Adam(lr=0.001)
model.compile(
optimizer,
loss='categorical_crossentropy',
metrics=[
'accuracy',
tfa.metrics.F1Score(
num_classes=3,
average=None,
)
]
)
return model
model = get_model()
model.fit(
train_x,
train_y,
verbose=2,
batch_size=5,
epochs=25,
)
results = model.evaluate(test_x, test_y)
print('Final test set loss: {:4f}'.format(results[0]))
print('Final test set accuracy: {:4f}'.format(results[1]))
print('Final test F1 scores: {}'.format(results[2]))
Final test set loss: 0.585964
Final test set accuracy: 0.633333
Final test F1 scores: [1. 0.15384616 0.6206897 ]
Now, we add weight to classes 1 and 2:
sample_weight = np.ones(shape=(len(train_y),))
sample_weight[
(train_y[:, 1] == 1) | (train_y[:, 2] == 1)
] = 1.5
model = get_model()
model.fit(
train_x,
train_y,
sample_weight=sample_weight,
verbose=2,
batch_size=5,
epochs=25,
)
results = model.evaluate(test_x, test_y)
print('Final test set loss: {:4f}'.format(results[0]))
print('Final test set accuracy: {:4f}'.format(results[1]))
print('Final test F1 scores: {}'.format(results[2]))
Final test set loss: 0.437623
Final test set accuracy: 0.900000
Final test F1 scores: [1. 0.8571429 0.8571429]
Here, the model has emphasized learning these, and their respective performance is improved.
I am trying to change a CNN classification model to a CNN regression model. The classification model had some press statements as an input and the change (0 for negative return on the release day and 1 for positive change) of an Index as the second variable. Now I am trying to change the model from a classification to a regression in the end, so that I can work with the actual returns and not with the binary classification.
So my input in the neural network looks like this:
document VIX 1d
1999-05-18 Release Date: May 18, 1999\n\nFor immediate re... -0.010526
1999-06-30 Release Date: June 30, 1999\n\nFor immediate r... -0.082645
1999-08-24 Release Date: August 24, 1999\n\nFor immediate... -0.043144
(document will tokenizes before going in the NN, just that you have an example)
I changed so far the following parameters:
- loss function is now the mean squared error (before: binary cross entropy) , the activation of the last layer now linear (before: sigmoid) and the metrics to mse (before: acc)
Below you can see my code:
all_words = [word for tokens in X for word in tokens]
all_sentence_lengths = [len(tokens) for tokens in X]
ALL_VOCAB = sorted(list(set(all_words)))
print("%s words total, with a vocabulary size of %s" % (len(all_words), len(ALL_VOCAB)))
print("Max sentence length is %s" % max(all_sentence_lengths))
####################### CHANGE THE PARAMETERS HERE #####################################
EMBEDDING_DIM = 300 # how big is each word vector
MAX_VOCAB_SIZE = 1893# how many unique words to use (i.e num rows in embedding vector)
MAX_SEQUENCE_LENGTH = 1086 # max number of words in a comment to use
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, lower=True, char_level=False)
tokenizer.fit_on_texts(change_df["document"].tolist())
training_sequences = tokenizer.texts_to_sequences(X_train.tolist())
train_word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(train_word_index))
train_embedding_weights = np.zeros((len(train_word_index)+1, EMBEDDING_DIM))
for word,index in train_word_index.items():
train_embedding_weights[index,:] = w2v_model[word] if word in w2v_model else np.random.rand(EMBEDDING_DIM)
print(train_embedding_weights.shape)
######################## TRAIN AND TEST SET #################################
train_cnn_data = pad_sequences(training_sequences, maxlen=MAX_SEQUENCE_LENGTH)
test_sequences = tokenizer.texts_to_sequences(X_test.tolist())
test_cnn_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
def ConvNet(embeddings, max_sequence_length, num_words, embedding_dim, trainable=False, extra_conv=True):
embedding_layer = Embedding(num_words,
embedding_dim,
weights=[embeddings],
input_length=max_sequence_length,
trainable=trainable)
sequence_input = Input(shape=(max_sequence_length,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
# Yoon Kim model (https://arxiv.org/abs/1408.5882)
convs = []
filter_sizes = [3, 4, 5]
for filter_size in filter_sizes:
l_conv = Conv1D(filters=128, kernel_size=filter_size, activation='relu')(embedded_sequences)
l_pool = MaxPooling1D(pool_size=3)(l_conv)
convs.append(l_pool)
l_merge = concatenate([convs[0], convs[1], convs[2]], axis=1)
# add a 1D convnet with global maxpooling, instead of Yoon Kim model
conv = Conv1D(filters=128, kernel_size=3, activation='relu')(embedded_sequences)
pool = MaxPooling1D(pool_size=3)(conv)
if extra_conv == True:
x = Dropout(0.5)(l_merge)
else:
# Original Yoon Kim model
x = Dropout(0.5)(pool)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(1, activation='linear')(x)
model = Model(sequence_input, preds)
model.compile(loss='mean_squared_error',
optimizer='adadelta',
metrics=['mse'])
model.summary()
return model
x_train = train_cnn_data
y_tr = y_train
x_test = test_cnn_data
model = ConvNet(train_embedding_weights, MAX_SEQUENCE_LENGTH, len(train_word_index)+1, EMBEDDING_DIM, False)
#define callbacks
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=4, verbose=1)
callbacks_list = [early_stopping]
hist = model.fit(x_train, y_tr, epochs=5, batch_size=33, validation_split=0.1, shuffle=True, callbacks=callbacks_list)
y_tes=model.predict(x_test, batch_size=33, verbose=1)
Does someone has an idea what else should I change as the code is working, but I have very poor results I think.. Like running the code gives me the following result:
Epoch 5/5
33/118 [=======>......................] - ETA: 15s - loss: 0.0039 - mse: 0.0039
66/118 [===============>..............] - ETA: 9s - loss: 0.0031 - mse: 0.0031
99/118 [========================>.....] - ETA: 3s - loss: 0.0034 - mse: 0.0034
118/118 [==============================] - 22s 189ms/step - loss: 0.0035 - mse: 0.0035 - val_loss: 0.0060 - val_mse: 0.0060
Or at least a source where I can read something? I just find some classification CNNs on the web, but no example actually NLP CNN with a regression.
Thanks a lot,
Lukas
This is a great example. Copy/paste the code, load the datasets; it should answer all of your questions.
# Classification with Tensorflow 2.0
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
# %matplotlib inline
import seaborn as sns
sns.set(style="darkgrid")
cols = ['price', 'maint', 'doors', 'persons', 'lug_capacity', 'safety', 'output']
cars = pd.read_csv(r'C:\\your_path\\cars_dataset.csv', names=cols, header=None)
cars.head()
price = pd.get_dummies(cars.price, prefix='price')
maint = pd.get_dummies(cars.maint, prefix='maint')
doors = pd.get_dummies(cars.doors, prefix='doors')
persons = pd.get_dummies(cars.persons, prefix='persons')
lug_capacity = pd.get_dummies(cars.lug_capacity, prefix='lug_capacity')
safety = pd.get_dummies(cars.safety, prefix='safety')
labels = pd.get_dummies(cars.output, prefix='condition')
# To create our feature set, we can merge the first six columns horizontally:
X = pd.concat([price, maint, doors, persons, lug_capacity, safety] , axis=1)
# Let's see how our label column looks now:
labels.head()
y = labels.values
# The final step before we can train our TensorFlow 2.0 classification model is to divide the dataset into training and test sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
# Model Training
# To train the model, let's import the TensorFlow 2.0 classes. Execute the following script:
from tensorflow.keras.layers import Input, Dense, Activation,Dropout
from tensorflow.keras.models import Model
# The next step is to create our classification model:
input_layer = Input(shape=(X.shape[1],))
dense_layer_1 = Dense(15, activation='relu')(input_layer)
dense_layer_2 = Dense(10, activation='relu')(dense_layer_1)
output = Dense(y.shape[1], activation='softmax')(dense_layer_2)
model = Model(inputs=input_layer, outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
# The following script shows the model summary:
print(model.summary())
# Result:
# Model: "model"
# Layer (type) Output Shape Param #
# Finally, to train the model execute the following script:
history = model.fit(X_train, y_train, batch_size=8, epochs=50, verbose=1, validation_split=0.2)
# Result:
# Train on 7625 samples, validate on 1907 samples
# Epoch 1/50
# - 4s 492us/sample - loss: 3.0998 - acc: 0.2658 - val_loss: 12.4542 - val_acc: 0.0834
# Let's finally evaluate the performance of our classification model on the test set:
score = model.evaluate(X_test, y_test, verbose=1)
print("Test Score:", score[0])
print("Test Accuracy:", score[1])
# Result:
# Regression with TensorFlow 2.0
petrol_cons = pd.read_csv(r'C:\\your_path\\gas_consumption.csv')
# Let's print the first five rows of the dataset via the head() function:
petrol_cons.head()
X = petrol_cons.iloc[:, 0:4].values
y = petrol_cons.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Model Training
# The next step is to train our model. This is process is quite similar to training the classification. The only change will be in the loss function and the number of nodes in the output dense layer. Since now we are predicting a single continuous value, the output layer will only have 1 node.
input_layer = Input(shape=(X.shape[1],))
dense_layer_1 = Dense(100, activation='relu')(input_layer)
dense_layer_2 = Dense(50, activation='relu')(dense_layer_1)
dense_layer_3 = Dense(25, activation='relu')(dense_layer_2)
output = Dense(1)(dense_layer_3)
model = Model(inputs=input_layer, outputs=output)
model.compile(loss="mean_squared_error" , optimizer="adam", metrics=["mean_squared_error"])
# Finally, we can train the model with the following script:
history = model.fit(X_train, y_train, batch_size=2, epochs=100, verbose=1, validation_split=0.2)
# Result:
# Train on 30 samples, validate on 8 samples
# Epoch 1/100
# To evaluate the performance of a regression model on test set, one of the most commonly used metrics is root mean squared error. We can find mean squared error between the predicted and actual values via the mean_squared_error class of the sklearn.metrics module. We can then take square root of the resultant mean squared error. Look at the following script:
from sklearn.metrics import mean_squared_error
from math import sqrt
pred_train = model.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train)))
# Result:
# 57.398156439652396
pred = model.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred)))
# Result:
# 86.61012708343948
# https://stackabuse.com/tensorflow-2-0-solving-classification-and-regression-problems/
# datasets:
# https://www.kaggle.com/elikplim/car-evaluation-data-set
# for OLS analysis
import statsmodels.api as sm
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
# Results:
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.987
Model: OLS Adj. R-squared (uncentered): 0.986
Method: Least Squares F-statistic: 867.8
Date: Thu, 09 Apr 2020 Prob (F-statistic): 3.17e-41
Time: 13:13:11 Log-Likelihood: -269.00
No. Observations: 48 AIC: 546.0
Df Residuals: 44 BIC: 553.5
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 -14.2390 8.414 -1.692 0.098 -31.196 2.718
x2 -0.0594 0.017 -3.404 0.001 -0.095 -0.024
x3 0.0012 0.003 0.404 0.688 -0.005 0.007
x4 1630.8913 130.969 12.452 0.000 1366.941 1894.842
==============================================================================
Omnibus: 9.750 Durbin-Watson: 2.226
Prob(Omnibus): 0.008 Jarque-Bera (JB): 9.310
Skew: 0.880 Prob(JB): 0.00952
Kurtosis: 4.247 Cond. No. 1.00e+05
==============================================================================
data sources:
https://www.kaggle.com/elikplim/car-evaluation-data-set
https://drive.google.com/file/d/1mVmGNx6cbfvRHC_DvF12ZL3wGLSHD9f_/view
Maybe two more questions:
1. You are getting quite high numbers for the regression root mean squared error. (57.39 and 86.61) & I get (for my dataset) 0.0851 (train) and 0.1169 (test). Seems that my values are quite good, right? The lower mean root mean squared error, the better or? I had my statistics class quite a while ago... :D
2. Do you maybe even know (or maybe you have an example), how I would have to implement another variable in the regression in a neural network? In my case, I have text data and returns I want to predict. I would like to include some macroeconomic (control)variables as well..
Thanks!