Code
All of the 4 columns are float64. I'm not sure what to do about this error as I've looked at similar stack overflow issues and nothing regarding converting it to numpy array and/or float32 seem to solve this.
This code is base on:
Google Colab
I just replaced the housing data with mlb pitching data and applied dropna() to the train and test dataframes.
Thank you.
In the function train_model(), specify the array .astype(float).
So the line you need to edit:
features = {name:np.array(value) for name, value in dataset.items()}
... change to:
features = {name:np.array(value).astype(float) for name, value in dataset.items()}
You'll also need to do it in the next block as well:
test_features = {name:np.array(value).astype(float) for name, value in cut_test_df_norm.items()}
Full function code:
def train_model(model, dataset, epochs, label_name,
batch_size=None):
"""Train the model by feeding it data."""
# Split the dataset into features and label.
features = {name:np.array(value).astype(float) for name, value in dataset.items()}
label = np.array(features.pop(label_name))
history = model.fit(x=features, y=label, batch_size=batch_size,
epochs=epochs, shuffle=True)
print(features)
# The list of epochs is stored separately from the rest of history.
epochs = history.epoch
# To track the progression of training, gather a snapshot
# of the model's mean squared error at each epoch.
hist = pd.DataFrame(history.history)
mse = hist["mean_squared_error"]
return epochs, mse
And:
# After building a model against the training set, test that model
# against the test set.
test_features = {name:np.array(value).astype(float) for name, value in cut_test_df_norm.items()}
test_label = np.array(test_features.pop(label_name)) # isolate the label
print("\n Evaluate the new model against the test set:")
my_model.evaluate(x = test_features, y = test_label, batch_size=batch_size)
Related
I have a big dataset that can't be loaded in RAM due to lack of enough memory.
What I am trying to do is train the model in x portions of the dataset to get the final model trained in the whole dataset as following:
num_divisione_dataset=4
div_tr = int(int(len(x_tr))/num_divisione_dataset)
div_val = int(2160/num_divisione_dataset)
num_training = int(math.ceil(100/num_divisione_dataset))
for i in range(0,num_divisione_dataset-1):
model.fit(
x_tr[div_tr*i:div_tr*(i+1)], y_tr[div_tr*i:div_tr*(i+1)],
batch_size = 32,
callbacks=[model_checkpoint_callback],
validation_data = (x_val, y_val),
epochs = 25
)
Is it a right way to train a model?
The batch_size = 32 already is a way to train the model in batches of size 32. It seems you have two levels of batching, one that you built yourself an another that's provided by Tensorflow.
The problem with your batching is epochs=25. The Tensorflow batches alternate within an epoch, and the next epoch it loops again over the Tensorflow batches. But you first train 25 epochs with your first batch, then 25 epochs with your second batch, etcetera.
I'm not sure this is best solved in software. It might be easier to just ignore the lack of RAM, and let the OS swap to disk. Buying more RAM could be another viable route. But a possible software route would be an input pipeline
Put your data in a CSV file. Then use make_csv_dataset to load it in batches and pass it to model.fit. Make sure to set num_epochs=1, otherwise the data set will loop forever.
Here you can find an example on how to use it.
A minimal code should be:
DATASET_PATH=#path of the csv file
LABEL_COLUMN=#name of the column in csv file representing output
COLUMNS=["a","b","c","d"] #name of the columns in csv file representing input
BATCH_SIZE=int(len(x_tr)/num_divisione_dataset)
def get_dataset(batch_size = 5):
return tf.data.experimental.make_csv_dataset(DATASET_PATH, batch_size = batch_size, label_name = LABEL_COLUMN, num_epochs = 1)
dataset = get_dataset(batch_size=BATCH_SIZE)
train_size= #put the train_dataset size here
train_dataset = dataset.take(train_size)
val_dataset = dataset.skip(train_size)
columns=[]
for c in COLUMNS:
cln = tf.feature_column.numeric_column(c, shape=())
columns.append(cln)
feature_layer = tf.keras.layers.DenseFeatures(columns)
model = Sequential()
model.add(feature_layer)
model.add...# add your NN layers
model.compile... #parameters to compile
history = model.fit(
train_dataset,
validation_data=val_dataset,
callbacks=[model_checkpoint_callback],
epochs=25,
)
I was testing the usage of an h5 file vs flow_from_directory and noticed different validation scores during training, but very similar scores for training and when tested the model on the test data both of them gives almost the same scores. For the experiment I am using the same model and giving them 5 epochs, and they have both the same starting weights, via get_weights and save_weights. I would like to use the h5 alternative since it cuts down my training time per epoch from 2min to half of a min.
Using flow from directory
Here's the scores during training
And the prediction results on the test data being:
loss: 1.9690 - accuracy: 0.4802
Using flow from the h5 file
And the "trained" model applied to the test data:
loss: 1.9695 - accuracy: 0.4822
From the stats, looks like it is training from the same data and predicting on the same test data(because of the similar loss and accuracy scores) and the validation data is different.
How the h5 file was created
Below I will show the code for inserting the validation data into the dataset. For the training and testing images it's the same code but modifying it accordingly .
...
hdf5_file.create_dataset('x_val', val_data_shape, np.uint8)
hdf5_file.create_dataset('y_val', val_label_shape, np.uint8 )
...
for i in range(len(val_it)):
if i%200 == 0:
print(f"Validation: Done {i} of {len(val_it)}")
x, y = val_it[i]
img = x[0]
label = y[0]
hdf5_file['x_val'][i, ...] = img
hdf5_file['y_val'][i, ...] = label
...
How the data was loaded to fetch it into the model
First I create a datagen that applies the preprocess needed for the Resnet model. For the training data and the test data are done in the same fashion.
datagen_val = ImageDataGenerator(preprocessing_function=preprocess_resnet)
val_it = datagen_val.flow_from_directory(BASE_PATH + 'val', batch_size=1)
val_x = hdf5_file['x_val']
val_y = hdf5_file['y_val']
datagen_val_h5 = ImageDataGenerator(preprocessing_function=preprocess_resnet)
val_it_h5 = datagen_val_h5.flow((val_x, val_y), batch_size=1)
Question is, why the loss and accuracy score on the validation dataset differ between the two methods when on training and test they score very similar?
I need to train a model on a dataset that required more memory than my GPU has. what is the best practice for feeding the dataset to model?
here is my steps:
first of all, I load dataset using batch_size
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
the second step i prepare data
for record in raw_train_ds.take(1):
train_images, train_labels = record['image'], record['label']
print(train_images.shape)
train_images = train_images.numpy().astype(np.float32) / 255.0
train_labels = tf.keras.utils.to_categorical(train_labels)
and then i feed data to the model
history = model.fit(train_images,train_labels, epochs=NUM_EPOCHS, validation_split=0.2)
but at step 2 I prepared data for the first batch and missed the rest batches because the model.fit is out of the loop scope (which, as I understand, works for one, first batch only).
On the other hand, I can't remove take(1) and move the model.fit method under the cycle. Because yes, in this case, I will handle all batches, but at the same time model.fill will be called at the end on each iteration and in this case, it also will not work properly
so, how I should change my code to be able to work appropriately with a big dataset using model.fit? could you point article, any documents, or just advise how to deal with it? thanks
Update
In my post below (approach 1) I describe one approach on how to solve the problem - are there any other better approaches or it is only one way how to solve this?
You can pass the whole dataset to fit for training. As you can see in the documentation, one of the possible values of the first parameter is:
A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).
So you just need to convert your dataset to that format (a tuple with input and target) and pass it to fit:
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
raw_train_ds = datasets['train']
train_dataset_fit = raw_train_ds.map(
lambda x: (tf.cast.dtypes(x['image'], tf.float32) / 255.0, x['label']))
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)
One problem with this is that it does not support a validation_split parameter but, as shown in this guide, tfds already gives you the functionality to have the splits of the data. So you would just need to get the test split dataset, transform it as above and pass it as validation_data to fit.
Approach 1
Thank #jdehesa I changed my code :
load dataset - in reality, it doesn't load data into memory till the first call 'next' from the dataset iterator. and even then, I think the iterator will load a portion of data (batch) with a size equal in BATCH_SIZE
raw_train_ds, raw_validation_ds = builder.as_dataset(split=["train[:90%]", "train[10%:]"], batch_size=BATCH_SIZE)
collected all required transformation into one method
def prepare_data(x):
train_images, train_labels = x['image'], x['label']
# TODO: resize image
train_images = tf.cast(train_images,tf.float32)/ 255.0
# train_labels = tf.keras.utils.to_categorical(train_labels,num_classes=NUM_CLASSES)
train_labels = tf.one_hot(train_labels,NUM_CLASSES)
return (train_images, train_labels)
applied these transformations to each element in batch (dataset) using the method td.data.Dataset.map
train_dataset_fit = raw_train_ds.map(prepare_data)
and then fed this dataset into model.fit - as I understand the model.fit will iterate through all batches in the dataset.
train_dataset_fit = raw_train_ds.map(prepare_data)
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)
I am trying to use a LSTM network in Keras to make predictions of timeseries data one step into the future. The data I have is of 5 dimensions, and I am trying to use the previous 3 periods of readings to predict the a future value in the next period. I have normalised the data and removed all NaN etc, and this is the code I am trying to use to train the network:
def Network_ii(IN, OUT, TIME_PERIOD, EPOCHS, BATCH_SIZE, LTSM_SHAPE):
length = len(OUT)
train_x = IN[:int(0.9 * length)]
validation_x = IN[int(0.9 * length):]
train_y = OUT[:int(0.9 * length)]
validation_y = OUT[int(0.9 * length):]
# Define Network & callback:
train_x = train_x.reshape(train_x.shape[0],3, 5)
validation_x = validation_x.reshape(validation_x.shape[0],3, 5)
model = Sequential()
model.add(LSTM(units=128, return_sequences= True, input_shape=(train_x.shape[1],3)))
model.add(LSTM(units=128))
model.add(Dense(units=1))
model.compile(optimizer='adam', loss='mean_squared_error')
train_y = np.asarray(train_y)
validation_y = np.asarray(validation_y)
history = model.fit(train_x, train_y, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(validation_x, validation_y))
# Score model
score = model.evaluate(validation_x, validation_y, verbose=0)
print('Test loss:', score)
# Save model
model.save(f"models/new_model")
I am attempting to roughly follow the steps outlined here- https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
However, no matter what adjustments I have made in terms of changing the number of dimensions used to train the network or the length of the time period I cannot get the output of the model to give predictions that are not either 1 or 0. This is even though the target data, in the array 'OUT' is made up of data continuous on [0,1].
I think there may be something wrong with how I am setting up the .Sequential() function, but I cannot see what to adjust. I am relatively new to this so any help would be greatly appreciated.
You are probably using a prediction function that is not the standard. Maybe you are using predict_classes?
The one that is well documented and the standard is model.predict.
I have a large dataset and it doesn't fit in memory. So while training, SSD is being used and epochs take too much time.
I save my dataset 9 part of .npz file. I choose first part (part 0) as validation part and I didn't use in training.
I use code below, and acc & val_acc result were fine. But I feel I do big mistake somewhere. I didn't see any example like this
for part in range(1,9):
X_Train, Y_Train = loadPart(part)
history = model.fit(X_Train, Y_Train, batch_size=128, epochs=1, verbose=1)
and also I load part 0 as Test data
val_loss, val_acc = model.evaluate(X_Test, Y_Test)
I tried to check val_acc after train each part of dataset and I observed val_acc was increasing.
Could you please let me know if this usage is legal or illegal and why?
EDIT:
I tried fit_generator but it still use disk during training and ETA was about 2,500 hours. (in model.fit with whole dataset it was about 30 mins per epoch) I use code below:
model.fit_generator(generate_batches()), steps_per_epoch=196000,epochs=10)
def generate_batches():
for part in range(1,9):
x, y = loadPart(part) yield(x,y)
def loadPart(part):
data = np.load('C:/FOLDER_PATH/'+str(part)+'.npz')
return [data['x'], data['y']
and X data shape is (196000,1536,1)
EDIT 2:
I found an answer in [github](
https://github.com/keras-team/keras/issues/4446). It says it is ok with call model.fit() multiple times in for but still I don't sure what happens in behind.What is the different between call model.fit() multiple times and call once with whole dataset.
If your model does not fit in RAM the keras documentation suggests the following (https://keras.io/getting-started/faq/#how-can-i-use-keras-with-datasets-that-dont-fit-in-memory):
You can do batch training using model.train_on_batch(x, y) and model.test_on_batch(x, y). See the models documentation.
Alternatively, you can write a generator that yields batches of training data and use the method model.fit_generator(data_generator, steps_per_epoch, epochs).
This means you could try to further split your training data into batches of 128 on your SSD and then do something like:
import glob
import numpy as np
def generate_batches(data_folder):
while True:
batches_paths = glob.glob("%s/*.npz" % data_folder)
for batch_path in batches_paths:
with np.load(batch_path) as batch:
x, y = preprocess_batch(batch)
yield (x, y)
model.fit_generator(generate_batches("/your-data-folder"), steps_per_epoch=10000, epochs=10)
The preprocess_batch function would be responsible for extracting your x and y from each .npz file and the steps_per_epoch argument in the fit_generator function should be the rounded up value of your number of data samples divided by your batch size.
More info:
https://keras.io/models/sequential/#fit_generator
https://www.pyimagesearch.com/2018/12/24/how-to-use-keras-fit-and-fit_generator-a-hands-on-tutorial/
you can use dask also, this chunks the data into smaller sections by default if you have a set that doesnt fit into RAM
If you are training as described in your question and it is trained in one session, then there is no difference. But if you are training in multiple sessions and continuing from previous training then you should save your model either after every epoch(i.e, are training through 9 sets in 1 epoch) or in your case you can save after every set of the dataset(i.e, after every 1 of 9 dataset) and in every session load the weights using model.load_weights("path to model") before you continue your training.
You can save model after every epoch using model.save("path to directory").