I'm trying to learn Keras and trying something very simple. I have created a dataframe with 200.000 random letters with two columns. letter and is_x. is_x is set to 1 (or True) if letter is capital "X".
Here's what i did so far:
model = Sequential()
model.add(Dense(32, activation='tanh', input_shape=(X_train.shape[1],)))
model.add(Dense(16, activation='tanh'))
model.add(Dense(y_train.shape[1], activation='sigmoid'))
#model.compile(optimizer=SGD(), loss='categorical_crossentropy', metrics=['accuracy'])
model.compile(optimizer=Adam(lr=0.05), loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(X_test, y_test))
results = model.evaluate(X_test, y_test)
y_predict = model.predict(X_test)
print(results)
print("---")
for i in y_predict:
print(i)
and here's the results:
[0.09158177]
[0.09158175]
[0.09158177]
[0.09158177]
[0.09158175]
[0.09158177]
[0.09158173]
What I'm trying to get is 1 or 0 if is_x is True. I sşmply feed letter as X_ and is_x as y_ however i only get some numbers and they are all look same like 0.996 and etc. Also accuracy is something like 0.99 but it's far away from the reality.
I'm very confused about activation and optimizer and loss. I couldn't understand which to choose and how to solve this simple problem. I have studied lots of training videos on udemy but noone explains why and how they use these functions.
I can't really answer the optimizer and activation piece very effectively but I can provide some assistance with the other parts. tanh and relu are both pretty popular activation functions so you should be fine with either of those. Likewise, Adam is an effective optimizer so you should be fine on that level.
Loss Function should be binary_crossentropy in your problem. This is used when you have two classes to learn (0/1). categorical_crossentropy is used in the case of multiclass problems and mse is useful for regression analyses. The objective of the algorithm is to minimize the value of this function. Thus you need to choose the appropriate one for the problem at hand.
Your accuracy is very high. A major reason for this is that you have uneven sample sizes between is x and is not x. All an algorithm has to do in order to achieve a really high score is predict 'not x' for everything.
In order to assess your model a little bit more, try this:
from sklearn.metrics import confusion_matrix
# this will remove the probabilities and give 1/0
y_predict = (y_predict > .5)
# this will create a confusion matrix
print(confusion_matrix(y_test, y_predict)
This makes it easier to view the model accuracy by showing you how many times it predicted x and the actual result was x, how many times it predicted not x and the actual result was not x, along with the mislabeled cases.
predicted
not x x
not x # #
x # #
Using this tool you can assess how accurate your model was in a better way.
Related
I am trying to build a model to predict house prices.
I have some features X (no. of bathrooms , etc.) and target Y (ranging around $300,000 to $800,000)
I have used sklearn's Standard Scaler to standardize Y before fitting it to the model.
Here is my Keras model:
def build_model():
model = Sequential()
model.add(Dense(36, input_dim=36, activation='relu'))
model.add(Dense(18, input_dim=36, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='sgd', metrics=['mae','mse'])
return model
I am having trouble trying to interpret the results -- what does a MSE of 0.617454319755 mean?
Do I have to inverse transform this number, and square root the results, getting an error rate of 741.55 in dollars?
math.sqrt(sc.inverse_transform([mse]))
I apologise for sounding silly as I am starting out!
I apologise for sounding silly as I am starting out!
Do not; this is a subtle issue of great importance, which is usually (and regrettably) omitted in tutorials and introductory expositions.
Unfortunately, it is not as simple as taking the square root of the inverse-transformed MSE, but it is not that complicated either; essentially what you have to do is:
Transform back your predictions to the initial scale of the original data
Get the MSE between these invert-transformed predictions and the original data
Take the square root of the result
in order to get a performance indicator of your model that will be meaningful in the business context of your problem (e.g. US dollars here).
Let's see a quick example with toy data, omitting the model itself (which is irrelevant here, and in fact can be any regression model - not only a Keras one):
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np
# toy data
X = np.array([[1,2], [3,4], [5,6], [7,8], [9,10]])
Y = np.array([3, 4, 5, 6, 7])
# feature scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X)
# outcome scaling:
sc_Y = StandardScaler()
Y_train = sc_Y.fit_transform(Y.reshape(-1, 1))
Y_train
# array([[-1.41421356],
# [-0.70710678],
# [ 0. ],
# [ 0.70710678],
# [ 1.41421356]])
Now, let's say that we fit our Keras model (not shown here) using the scaled sets X_train and Y_train, and get predictions on the training set:
prediction = model.predict(X_train) # scaled inputs here
print(prediction)
# [-1.4687586 -0.6596055 0.14954728 0.95870024 1.001172 ]
The MSE reported by Keras is actually the scaled MSE, i.e.:
MSE_scaled = mean_squared_error(Y_train, prediction)
MSE_scaled
# 0.052299712818541934
while the 3 steps I have described above are simply:
MSE = mean_squared_error(Y, sc_Y.inverse_transform(prediction)) # first 2 steps, combined
MSE
# 0.10459946572909758
np.sqrt(MSE) # 3rd step
# 0.323418406602187
So, in our case, if our initial Y were US dollars, the actual error in the same units (dollars) would be 0.32 (dollars).
Notice how the naive approach of inverse-transforming the scaled MSE would give a very different (and incorrect) result:
np.sqrt(sc_Y.inverse_transform([MSE_scaled]))
# array([2.25254588])
MSE is mean square error, here is the formula.
Basically it is a mean of square of different of expected output and prediction. Making square root of this will not give you the difference between error and output. This is useful for training.
Currently you have build a model.
If you want to train the model use these function.
mode.fit(x=input_x_array, y=input_y_array, batch_size=None, epochs=1, verbose=1, callbacks=None, validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, validation_steps=None)
If you want to do prediction of the output you should use following code.
prediction = model.predict(np.array(input_x_array))
print(prediction)
You can find more details here.
https://keras.io/models/about-keras-models/
https://keras.io/models/sequential/
I'm implementing a Multilayer Perceptron in Keras and using scikit-learn to perform cross-validation. For this, I was inspired by the code found in the issue Cross Validation in Keras
from sklearn.cross_validation import StratifiedKFold
def load_data():
# load your data using this function
def create model():
# create your model using this function
def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
# fit and evaluate here.
if __name__ == "__main__":
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
In my studies on neural networks, I learned that the knowledge representation of the neural network is in the synaptic weights and during the network tracing process, the weights that are updated to thereby reduce the network error rate and improve its performance. (In my case, I'm using Supervised Learning)
For better training and assessment of neural network performance, a common method of being used is cross-validation that returns partitions of the data set for training and evaluation of the model.
My doubt is...
In this code snippet:
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
We define, train and evaluate a new neural net for each of the generated partitions?
If my goal is to fine-tune the network for the entire dataset, why is it not correct to define a single neural network and train it with the generated partitions?
That is, why is this piece of code like this?
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
and not so?
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
Is my understanding of how the code works wrong? Or my theory?
If my goal is to fine-tune the network for the entire dataset
It is not clear what you mean by "fine-tune", or even what exactly is your purpose for performing cross-validation (CV); in general, CV serves one of the following purposes:
Model selection (choose the values of hyperparameters)
Model assessment
Since you don't define any search grid for hyperparameter selection in your code, it would seem that you are using CV in order to get the expected performance of your model (error, accuracy etc).
Anyway, for whatever reason you are using CV, the first snippet is the correct one; your second snippet
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
will train your model sequentially over the different partitions (i.e. train on partition #1, then continue training on partition #2 etc), which essentially is just training on your whole data set, and it is certainly not cross-validation...
That said, a final step after the CV which is often only implied (and frequently missed by beginners) is that, after you are satisfied with your chosen hyperparameters and/or model performance as given by your CV procedure, you go back and train again your model, this time with the entire available data.
You can use wrappers of the Scikit-Learn API with Keras models.
Given inputs x and y, here's an example of repeated 5-fold cross-validation:
from sklearn.model_selection import RepeatedKFold, cross_val_score
from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
def buildmodel():
model= Sequential([
Dense(10, activation="relu"),
Dense(5, activation="relu"),
Dense(1)
])
model.compile(optimizer='adam', loss='mse', metrics=['mse'])
return(model)
estimator= KerasRegressor(build_fn=buildmodel, epochs=100, batch_size=10, verbose=0)
kfold= RepeatedKFold(n_splits=5, n_repeats=100)
results= cross_val_score(estimator, x, y, cv=kfold, n_jobs=2) # 2 cpus
results.mean() # Mean MSE
I think many of your questions will be answered if you read about nested cross-validation. This is a good way to "fine tune" the hyper parameters of your model. There's a thread here:
https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection
The biggest issue to be aware of is "peeking" or circular logic. Essentially - you want to make sure that none of data used to assess model accuracy is seen during training.
One example where this might be problematic is if you are running something like PCA or ICA for feature extraction. If doing something like this, you must be sure to run PCA on your training set, and then apply the transformation matrix from the training set to the test set.
The main idea of testing your model performance is to perform the following steps:
Train a model on a training set.
Evaluate your model on a data not used during training process in order to simulate a new data arrival.
So basically - the data you should finally test your model should mimic the first data portion you'll get from your client/application to apply your model on.
So that's why cross-validation is so powerful - it makes every data point in your whole dataset to be used as a simulation of new data.
And now - to answer your question - every cross-validation should follow the following pattern:
for train, test in kFold.split(X, Y
model = training_procedure(train, ...)
score = evaluation_procedure(model, test, ...)
because after all, you'll first train your model and then use it on a new data. In your second approach - you cannot treat it as a mimicry of a training process because e.g. in second fold your model would have information kept from the first fold - which is not equivalent to your training procedure.
Of course - you could apply a training procedure which uses 10 folds of consecutive training in order to finetune network. But this is not cross-validation then - you'll need to evaluate this procedure using some kind of schema above.
The commented out functions make this a little less obvious, but the idea is to keep track of your model performance as you iterate through your folds and at the end provide either those lower level performance metrics or an averaged global performance. For example:
The train_evaluate function ideally would output some accuracy score for each split, which could be combined at the end.
def train_evaluate(model, x_train, y_train, x_test, y_test):
model.fit(x_train, y_train)
return model.score(x_test, y_test)
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
scores = np.zeros(10)
idx = 0
for train, test in kFold.split(X, Y):
model = create_model()
scores[idx] = train_evaluate(model, X[train], Y[train], X[test], Y[test])
idx += 1
print(scores)
print(scores.mean())
So yes you do want to create a new model for each fold as the purpose of this exercise is to determine how your model as it is designed performs on all segments of the data, not just one particular segment that may or may not allow the model to perform well.
This type of approach becomes particularly powerful when applied along with a grid search over hyperparameters. In this approach you train a model with varying hyperparameters using the cross validation splits and keep track of the performance on splits and overall. In the end you will be able to get a much better idea of which hyperparameters allow the model to perform best. For a much more in depth explanation see sklearn Model Selection and pay particular attention to the sections of Cross Validation and Grid Search.
So I have a simple Sequential model built in Keras and I'm asked to train it multiple times (particularly, 5 times) with the same set of data (although I would have to change the train-test split). I would like to, then, perform an average of these trained models in the sense of:
Average the final accuracy on train and on validation.
Average the learning curve.
I know I can do this with a loop using plain python and stuff, but since this seems to me a common thing to do, I wonder if there is already a built in function to do exactly that. I think there is a way to train multiple times and save the best model, but I just want to do the average of the final results.
Maybe you are thinking about Bagging. https://en.wikipedia.org/wiki/Bootstrap_aggregating
You can train multiple models and average their outputs, but not the models. There isn't and there won't be a function for that because it is as simple as for example calculating average in regression task.
For averaging accuracy metrics over n trials, see the pseudocode below
def loop_model(x_train, y_train, x_test, n_loops = 5):
arr_metric = list()
for _ in range(n_loops):
print("Trial ", _+1 ," of ", n_loops)
model = build_model(units, activation, ...)
history = model.fit(x_train, y_train, ...)
y_true, y_pred = model.predict(x_test)
metric = compute_metric(y_true, y_pred)
arr_metric.append(metric)
del model
keras.backend.clear_session()
# Convert to numpy array to compute mean
avg_metric = np.array(arr_metric).mean()
return avg_metric
The function build_model() compiles the model while compute_metric() computes your accuracy metric. These are not built-in functions but a part of pseudocode. Using a numpy array to compute the mean is one approach and there are other things you can try.
See this answer as to why I suggested using the last two lines in the for loop.
I was using a sequential model in keras for categorical classification.
given data:
x_train = np.random.random((5000, 20))
y_train = keras.utils.to_categorical(np.random.randint(10, size=(5000, 1)), num_classes=10)
x_test = np.random.random((500, 20))
y_test = keras.utils.to_categorical(np.random.randint(10, size=(500, 1)), num_classes=10)
feature scaling is important:
scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
model
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=20))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, y_train,
epochs=100,
batch_size=32)
data to be predicted
z = np.random.random((9999000, 20))
Should I scale this data?
How to scale this data?
predictions = model.predict_classes(z)
As you see that, the training and testing samples are only a few as compared to the data to be predicted (z). Using the scaler fitted with x_train to rescale x_test, seems OK.
However, using the same scaler fitted with only 5000 samples to rescale z (9999000 samples), seems not OK. Are there any best practices in the field of deep learning to solve this problem?
For classifiers not sensitive to feature scaling linke Random Forests do not have this issue. However, for deep learning, this issue exists.
The training data shown here is for example purpose only. In the real problem, the training data is not coming from the same (uniform) probability distribution. It is difficult to label data and the training data has human bias towards easy to label. Only the samples easier to label was labelled.
However, using the same scaler fitted with only 5000 samples to rescale z (9999000 samples), seems not OK.
Not clear why you think so. This is exactly the standard practice, i.e. using the scaler fitted with your training data, as you have correctly done with your test data:
z_scaled = scaler.transform(z)
predictions = model.predict_classes(z_scaled)
The number of samples (500 or 10^6) does not make any difference here; the important thing is for all this data (x and z) to come from the same probability distribution. In practice (and for data that may still be in the future) this is only assumed (and one of the things to watch for after model deployment is exactly if this assumption does not hold, or ceases to be correct after some time). But especially here, with your simulated data coming from the exact same (uniform) probability distribution, this is exactly the correct thing to do.
i am trying to build an MLP model that takes a dataset consists of 9 columns
this is a sample (patient number, time in mill/sec., normalization of X Y and Z, kurtosis, skewness, pitch, roll and yaw, label) respectively.
1,15,-0.248010047716,0.00378335508419,-0.0152548459993,-86.3738760481,0.872322164158,-3.51314800063,0
1,31,-0.248010047716,0.00378335508419,-0.0152548459993,-86.3738760481,0.872322164158,-3.51314800063,0
1,46,-0.267422664673,0.0051143782875,-0.0191247001961,-85.7662354031,1.0928406847,-4.08015176908,0
1,62,-0.267422664673,0.0051143782875,-0.0191247001961,-85.7662354031,1.0928406847,-4.08015176908,0
and this is my code, there is no error in my code but the results with and without features are the same .. so i am asking if i used the right way to fed those features into the model.
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
import pandas as pd
import itertools
import math
np.random.seed(7)
train = np.loadtxt("featwithsignalsTRAIN.txt", delimiter=",")
test = np.loadtxt("featwithsignalsTEST.txt", delimiter=",")
x_train = train[:,[2,3,4,5,6,7]]
x_test = test[:,[2,3,4,5,6,7]]
y_train = train[:,8]
y_test = test[:,8]
model = Sequential()
model.add(Dense(500, input_dim=6, activation='relu'))
model.add(Dense(300, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy' , optimizer='adam', metrics=['accuracy'])
# Fit the model
batch_size = 128
epochs = 10
hist = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=2,
)
avg = np.mean(hist.history['acc'])
print('The Average Testing Accuracy is', avg)
##Evaluate the model
score=model.evaluate(x_test, y_test, verbose=2)
print(score)
There is nothing wrong with your model, but it's possible that your model doesn't learn anything useful. It could be that you are using a learning too high or too small, that you need more epochs, or that simply your features are not useful.
Here are some advices :
You can directly add a validation set to your fit method, which will compute the same metrics on this set at the end of each epoch and will allow you to see if your model learn something useful or if it's just overfitting on the training set without having to wait for the model to finish its training. (make sure you use verbose = 1 or 2 to see the training process).
model.fit( ... , validation_data = (x_test , y_test) , ...)
I see that you used the history callback. A good practice is to see how the accuracy is changing from an epoch to another instead of taking the mean. This allows you to see if your network is effectively learning something. A network rarely converge on the firsts epochs.
Do you have an idea of the 'usefulness' of your feature ? You can get an idea of that by performing an exploratory analysis before creating your model or by fitting a more 'conventional' model (linear regression, decision trees, random forest ...). It's Highly recommended before fitting a neural network, and this also allows you to compare different types of models and to see if you realy need to use neural networks.
If you are sure that your features would at least perform better than a random guess, try playing with the learning rate. A high learnign rate could cause the model to overshoot the minimum, and a learning rate too small could cause the model to learn very slowly or to get stuck in a local minima. You could also try to tune the number of epochs.