Dataset for tensorflow shuffle() messing up predictions

Dataset for tensorflow shuffle() messing up predictions - python

Hi I created a little function to prepare my dataset before training as below which reshuffles also. But this functionality runs every time I get the output without running the function again.
def create_ds(x, y, shuffle=True, batch_size=512):
# Renaming columns in the input df
x = MyPackage.rename_df(x)
ds = tf.data.Dataset.from_tensor_slices((dict(x), y))
if shuffle:
ds = ds.shuffle(len(x))
ds = ds.batch(batch_size)
return ds
Then I use this to create a tensorflow dataset for train and test.
train_ds = create_ds(
train_df,
train_df['target'].values,
shuffle=True,
batch_size=batch_size,
)
test_ds = create_ds(
test_df,
test_df['target'].values,
shuffle=False,
batch_size=batch_size_test,
)
--- Here I compile the model not included --
model.fit(
train_ds,
epochs=epochs,
callbacks=[
tf.keras.callbacks.TensorBoard(log_dir=os.path.join(output_path, "logs")),
tf.keras.callbacks.EarlyStopping(monitor="val_auc", patience=10),
],
validation_data=test_ds,
verbose=2,
)
# predict
pred_1 = model.predict(train_ds)
pred_2 = model.predict(train_ds)
But the pred_1 != pred_2. This is because of the shuffle = True when building the train_ds - it gets swapped around still. I thought once I had shuffled the data; why is it doing it every time I call it.
For example - I predicted on 3 rows and got for pred_1:
array([[0.51523584],
[0.50634336],
[0.51264596]], dtype=float32)
and pred_2:
array([[0.50634336],
[0.51523584],
[0.51264596]], dtype=float32)

Related

How to deal with DataCollator and DataLoaders in Huggingface?

I have issues combining a DataLoader and DataCollator. The following code with DataCollatorWithPadding results in a ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. when I want to iterate through the batches.
from torch.utils.data.dataloader import DataLoader
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16,
collate_fn=data_collator)
eval_dataloader = DataLoader(eval_dataset, batch_size=16, collate_fn=data_collator)
for epoch in range(2):
model.train()
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
However, I found annother approach where I changed the DataCollator to lambda x: x Then it gives me a TypeError: DistilBertForSequenceClassification object argument after ** must be a mapping, not list
from torch.utils.data.dataloader import DataLoader
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16, collate_fn=lambda x: x )
eval_dataloader = DataLoader(eval_dataset, batch_size=16, collate_fn=lambda x: x)
for epoch in range(2):
model.train()
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
For reproducability and for the rest of the code I provide you a Jupyter Notebook on Google Colab. You find the errors at the bottom of the notebook.
Link to Colab Notebook

If you take a look at the train_dataset object from your notebook:
print(train_dataset)
Output:
Dataset({
features: ['text', 'label', 'input_ids', 'attention_mask'],
num_rows: 25000
})
DataCollatorWithPadding doesn't know how to pad the text column because it's just a string.
Since you've already tokenized the dataset, you can simply remove the text column like so:
train_dataset = train_dataset.remove_columns("text")
The other three columns are all tensors and so can be padded by the data collator. Your first training loop will then run as expected.

Trying to extract y_val from dataset throws "all the input arrays must have same number of dimensions"

I am very new to machine learning and python in general. I'm working on a project requiring to make an image classification model. I've read the data from my local disk using tf.keras.preprocessing.image_dataset_from_directory and now I'm trying to extract x_val and y_val to generate a skilearn.metrics.classification_report with.
The issues is that whenever I call:
y_val = np.concatenate([y_val, np.argmax(y.numpy(), axis=-1)])`
I get the following error and I have no idea why or how to fix it
y_val = np.concatenate([y_val, np.argmax(y.numpy(), axis=-1)])
File "<array_function internals>", line 5, in concatenate
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 0 dimension(s)`
Here's my code
#data is split into train and validation folders with 6 folders in each representing a class, like this:
#data/train/hamburger/<haburger train images in here>
#data/train/pizza/<pizza train images in here>
#data/validation/hamburger/<haburger test images in here>
#data/validation/pizza/<pizza test images in here>
#training_dir = ......
validation_dir = pathlib.Path('path to data dir on local disk')
#hyperparams
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
training_dir,
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
validation_dir,
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
class_names = train_ds.class_names
print(class_names)
print(val_ds.class_names)
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
normalization_layer = layers.experimental.preprocessing.Rescaling(1./255)
resize_and_rescale = tf.keras.Sequential([
layers.experimental.preprocessing.Resizing(img_height, img_width),
layers.experimental.preprocessing.Rescaling(1./255)
])
#normalization, augmentation, model layers
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
metrics=['accuracy'])
model.summary()
start_time = time.monotonic()
epochs = 1
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=epochs
)
#plot
#testing random image
test_path = pathlib.Path('C:/Users/adi/Desktop/New folder/downloads/hamburger/images(91).jpg')
img = keras.preprocessing.image.load_img(
test_path, target_size=(img_height, img_width)
)
img_array = keras.preprocessing.image.img_to_array(img)
img_array = tf.expand_dims(img_array, 0) # Create a batch
predictions = model.predict(img_array)
score = tf.nn.softmax(predictions[0])
print(
"This image most likely belongs to {} with a {:.2f} percent confidence."
.format(class_names[np.argmax(score)], 100 * np.max(score))
)
x_val = np.array([])
y_val = np.array([])
for x, y in val_ds:
x_val = np.concatenate([x_val, np.argmax(model.predict(x), axis=-1)])
y_val = np.concatenate([y_val, np.argmax(y.numpy(), axis=-1)]) #<----- crashes here
print(classification_report(y_val, x_val, target_names = ['doughnuts (Class 0)','french_fries (Class 1)', 'hamburger (Class 2)','hot_dog (Class 3)', 'ice_cream (Class 4)','pizza (Class 5)']))
Any ideas why I'm getting this error and how I can fix it. Or alternatively how can I get what I need to make classification_report work. Thank you.

You don't need argmax operation while getting the true classes.
Since you did not specify class_mode in tf.keras.preprocessing.image_dataset_from_directory, labels are sparse which means they are not one-hot-encoded.
If you had one-hot-encoded vector labels, above code of yours would be correct.
Another thing is that renaming your arrays should be better like this and when predicting one image at a time, you can use model(x) which is more efficient. Correct code should be:
predicted_classes = np.array([])
labels = np.array([])
for x, y in val_ds:
predicted_classes = np.concatenate([predicted_classes, np.argmax(model(x), axis=-1)])
labels = np.concatenate([labels, y.numpy()])

Tensorflow TypeError: `generator` must be callable in tf.data.Dataset.from_generator(gen)

I am slightly losging my mind about a simple task. I want to implement a simple RandomForestClassifier on Images using the tf.estimator.BoostedTreesClassifier (Gradient Boosted Tree is good enough although the difference is clear to me). I'm following the https://www.tensorflow.org/tutorials/estimator/boosted_trees_model_understanding guide. I swapped the
# Use entire batch since this is such a small dataset.
NUM_EXAMPLES = len(y_train)
def make_input_fn(X, y, n_epochs=None, shuffle=True):
def input_fn():
dataset = tf.data.Dataset.from_tensor_slices((X.to_dict(orient='list'), y))
if shuffle:
dataset = dataset.shuffle(NUM_EXAMPLES)
# For training, cycle thru dataset as many times as need (n_epochs=None).
dataset = (dataset
.repeat(n_epochs)
.batch(NUM_EXAMPLES))
return dataset
return input_fn
with my own function looking like this
# LOADING IMAGES USING TENSORFLOW
labels = ['some','fancy','labels']
batch_size = 32
datagen = ImageDataGenerator(
rescale=1. / 255,
data_format='channels_last',
validation_split=0.1,
dtype=tf.float32
)
train_generator = datagen.flow_from_directory(
'./images',
classes=labels,
target_size=(128, 128),
batch_size=batch_size,
class_mode='categorical',
shuffle=True,
subset='training',
seed=42
)
valid_generator = datagen.flow_from_directory(
'./images',
classes=labels,
target_size=(128, 128),
batch_size=batch_size,
class_mode='categorical',
shuffle=False,
subset='validation',
seed=42
)
# THE SWAPPED FUNCTION:
NUM_FEATURES = 128 * 128
NUM_EXAMPLES = len(train_generator)
def make_input_fn(gen, n_epochs=None, shuffle=True):
def input_fn():
dataset = tf.data.Dataset.from_generator(gen, (tf.float32, tf.int32))
if shuffle:
dataset = dataset.shuffle(NUM_EXAMPLES)
# For training, cycle thru dataset as many times as need (n_epochs=None).
dataset = (dataset
.repeat(n_epochs)
.batch(NUM_EXAMPLES))
return dataset
return input_fn
def _generator_(tf_gen):
print(len(tf_gen))
def arg_free():
for _ in range(len(tf_gen)):
X, y = next(iter(tf_gen))
X = X.reshape((len(X), -1))
print(X.shape)
yield X, y
return arg_free()
_gen = _generator_(train_generator)
print(callable(g_gen)) # returns Fals WHY?!
I dont understand why this is not working and why on earth nobody ever thaught about making a simple enough tutorial (or why I am not able to find it :D). If you are asking yourself, why I want to use the RandomForest and not regular Deep Learning aproaches. The RF is set by the supervising Authorithy as well as it has to be TF (and not e.g. sklearn).
Anyway, any help will be appreciated.

Tensorflow Keras use tfrecords also for validation

Right now I'm using keras with tensorflow backend.
The dataset was stored in the tfrecords format. Training without any validation set is working, but how to integrate my validation-tfrecords also?
Lets assume this code as coarse skeleton:
def _ds_parser(proto):
features = {
'X': tf.FixedLenFeature([], tf.string),
'Y': tf.FixedLenFeature([], tf.string)
}
parsed_features = tf.parse_single_example(proto, features)
# get the data back as float32
parsed_features['X'] = tf.decode_raw(parsed_features['I'], tf.float32)
parsed_features['Y'] = tf.decode_raw(parsed_features['Y'], tf.float32)
return parsed_features['X'], parsed_features['Y']
def datasetLoader(dataSetPath, batchSize):
dataset = tf.data.TFRecordDataset(dataSetPath)
# Maps the parser on every filepath in the array. You can set the number of parallel loaders here
dataset = dataset.map(_ds_parser, num_parallel_calls=8)
# This dataset will go on forever
dataset = dataset.repeat()
# Set the batchsize
dataset = dataset.batch(batchSize)
# Create an iterator
iterator = dataset.make_one_shot_iterator()
# Create your tf representation of the iterator
X, Y = iterator.get_next()
# Bring the date back in shape
X = tf.reshape(I, [-1, 66, 198, 3])
Y = tf.reshape(Y,[-1,1])
return X, Y
X, Y = datasetLoader('PATH-TO-DATASET', 264)
model_X = keras.layers.Input(tensor=X)
model_output = keras.layers.Conv2D(filters=16, kernel_size=3, strides=1, padding='valid', activation='relu',
input_shape=(-1, 66, 198, 3))(model_X)
model_output = keras.layers.Dense(units=1, activation='linear')(model_output)
model = keras.models.Model(inputs=model_X, outputs=model_output)
model.compile(
optimizer=optimizer,
loss='mean_squared_error',
target_tensors=[Y]
)
parallel_model.fit(
epochs=epochs,
steps_per_epoch=stepPerEpoch,
shuffle=False,
validation_data=????
)
The question is, how to pass the validation set?
I have found something related here: gcloud-ml-engine-with-keras, but I'm not sure how to fit this into my problem.

First, You don't need to use iterator. Keras model will accept dataset object instead separate data/labels parameters, and will handle iteration. You only need to specify steps_per_epoch, hence you need to know dataset size. If you have separate tfrecords file for train/validation, then you can just create dataset object and pass it to validation_data. If you have one file and you'd like to split it, you can do
dataset = tf.data.TFRecordDataset('file.tfrecords')
dataset_train = dataset.take(size)
dataset_val = dataset.skip(size)
...

Ok I found the answer myself: basically it's done by simply change import keras toimport tensorflow.keras as keras. Tf.keras allows you to pass the validation set also as tensor:
X, Y = datasetLoader('PATH-TO-DATASET', 264)
X_val, Y_val = datasetLoader('PATH-TO-VALIDATION-DATASET', 264)
# ... define and compile the model like above
parallel_model.fit(
epochs= epochs,
steps_per_epoch= STEPS_PER_EPOCH,
shuffle= False,
validation_data= (X_val, Y_val),
validation_steps= STEPS_PER_VALIDATION_EPOCH
)

Randomly augmenting images using Keras

I am experimenting with MNIST dataset to learn Keras library. In MNIST, there are 60k training images and 10k validation images.
In both sets, I'd like to introduce augmentation on 30% of the images.
datagen = ImageDataGenerator(horizontal_flip=True, vertical_flip=True)
datagen.fit(training_images)
datagen.fit(validation_images)
This does not augment images and I am not sure how to use model.fit_generator method. My current model.fit is as following:
model.fit(training_images, training_labels, validation_data=(validation_images, validation_labels), epochs=10, batch_size=200, verbose=2)
How do I apply augmentation on some of the images in this dataset?

I'd try to define my own generator in the following manner:
from sklearn.model_selection import train_test_split
from six import next
def partial_flow(array, flags, generator, aug_percentage, batch_size):
# Splitting data into arrays which will be augmented and which won't
not_aug_array, not_aug_flags, aug_array, aug_flags = train_test_split(
array,
test_size=aug_percentage)
# Preparation of generators which will be used for augmentation.
aug_split_size = int(batch_size * split_size)
# We will use generator without any augmentation to yield not augmented data
not_augmented_gen = ImageDataGenerator()
aug_gen = generator.flow(
x=aug_array,
y=aug_flags,
batch_size=aug_split_size)
not_aug_gen = not_augmented_gen.flow(
x=not_aug_array,
y=not_aug_flags,
batch_size=batch_size - aug_split_size)
# Yiedling data
while True:
# Getting augmented data
aug_x, aug_y = next(aug_gen)
# Getting not augmented data
not_aug_x, not_aug_y = next(not_aug_gen)
# Concatenation
current_x = numpy.concatenate([aug_x, not_aug_x], axis=0)
current_y = numpy.concatenate([aug_y, not_aug_y], axis=0)
yield current_x, current_y
Now you could run training by:
batch_size = 200
model.fit_generator(partial_flow(training_images, training_labels, 0.7, batch_size),
steps_per_epoch=int(training_images.shape[0] / batch_size),
epochs=10,
validation_data=partial_flow(validation_images, validation_labels, 0.7, batch_size),
validation_steps=int(validation_images.shape[0] / batch_size))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataset for tensorflow shuffle() messing up predictions - python

Related

How to deal with DataCollator and DataLoaders in Huggingface?

Trying to extract y_val from dataset throws "all the input arrays must have same number of dimensions"

Tensorflow TypeError: `generator` must be callable in tf.data.Dataset.from_generator(gen)

Tensorflow Keras use tfrecords also for validation

Randomly augmenting images using Keras

Categories

Resources