How to get Data Generator more efficient? - python

To train a neural network, I modified a code I found on YouTube. It looks as follows:
def data_generator(samples, batch_size, shuffle_data = True, resize=224):
num_samples = len(samples)
while True:
random.shuffle(samples)
for offset in range(0, num_samples, batch_size):
batch_samples = samples[offset: offset + batch_size]
X_train = []
y_train = []
for batch_sample in batch_samples:
img_name = batch_sample[0]
label = batch_sample[1]
img = cv2.imread(os.path.join(root_dir, img_name))
#img, label = preprocessing(img, label, new_height=224, new_width=224, num_classes=37)
img = preprocessing(img, new_height=224, new_width=224)
label = my_onehot_encoded(label)
X_train.append(img)
y_train.append(label)
X_train = np.array(X_train)
y_train = np.array(y_train)
yield X_train, y_train
Now, I tried to train a neural network using this code, train sample size is 105.000 (image files which contain 8 characters out of 37 possibilities, A-Z, 0-9 and blank space).
I used a relatively small batch size (32, I think that is already too small) to get it more efficient but nevertheless it took like forever to train one quarter of the first epoch (I had 826 steps per epoch, and it took 90 minutes for 199 steps... steps_per_epoch = num_train_samples // batch_size).
The following functions are included in the data generator:
def shuffle_data(data):
data=random.shuffle(data)
return data
I don't think we can make this function anyhow more efficient or exclude it from the generator.
def preprocessing(img, new_height, new_width):
img = cv2.resize(img,(new_height, new_width))
img = img/255
return img
For preprocessing/resizing the data I use this code to get the images to a unique size of e.g. (224, 224, 3). I think, this part of the generator takes the most time, but I don't see a possibility to exclude it from the generator (since my memory would be full, if we resize the images outside the batches).
#One Hot Encoding of the Labels
from numpy import argmax
# define input string
def my_onehot_encoded(label):
# define universe of possible input values
characters = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ '
# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(characters))
int_to_char = dict((i, c) for i, c in enumerate(characters))
# integer encode input data
integer_encoded = [char_to_int[char] for char in label]
# one hot encode
onehot_encoded = list()
for value in integer_encoded:
character = [0 for _ in range(len(characters))]
character[value] = 1
onehot_encoded.append(character)
return onehot_encoded
I think, in this part there could be one approach to make it more efficient. I am thinking about to exclude this code from the generator and produce the array y_train outside of the generator, so that the generator does not have to one hot encode the labels every time.
What do you think? Or should I maybe go for a completely different approach?

I have found your question very intriguing because you give only clues. So here is my investigation.
Using your snippets, I have found GitHub repository and 3 part video tutorial on YouTube that mainly focuses on the benefits of using generator functions in Python.
The data is based on this kaggle (I would recommend to check out different kernels on that problem to compare the approach that you already tried with another CNN networks and review API in use).
You do not need to write a data generator from scratch, though it is not hard, but inventing the wheel is not productive.
Keras has the ImageDataGenerator class.
Plus here is a more generic example for DataGenerator.
Tensorflow offers very neat pipelines with their tf.data.Dataset.
Nevertheless, to solve the kaggle's task, the model needs to perceive single images only, hence the model is a simple deep CNN. But as I understand, you are combining 8 random characters (classes) into one image to recognize multiple classes at once. For that task, you need R-CNN or YOLO as your model. I just recently opened for myself YOLO v4, and it is possible to make it work for specific task really quick.
General advice about your design and code.
Make sure the library uses GPU. It saves a lot of time. (Even though I repeated flowers experiment from the repository very fast on CPU - about 10 minutes, but resulting predictions are no better than a random guess. So full training requires a lot of time on CPU.)
Compare different versions to find a bottleneck. Try a dataset with 48 images (1 per class), increase the number of images per class, and compare. Reduce image size, change the model structure, etc.
Test brand new models on small, artificial data to prove the idea or use iterative process, start from projects that can be converted to your task (handwriting recognition?).

Related

10 fold cross validation python

There is a deep learning based model using Transfer Learning and LSTM in this article, that author used 10 fold cross validation (as explained in table 3) and took the average of results.
I am familiar with 10 fold cross validation as we need to divide the data and pass to the model, however in this code(here) I can't figure out how to partition data and pass it.
There is two train/test/dev datasets (one for emotion analysis, and one for sentiment analysis we use both for transfer learning, but my focus is on emotion analysis). The raw data is in couple of files in txt format, and after running the model, it gives two new txt files, one for predicted labels, one for true labels.
There is a line of code in the main file:
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
if args.mode=='train':
model.train(data)
sess = model.restore_last_session()
model.predict(data, sess)
if args.mode=='test':
sess = model.restore_last_session()
model.predict(data, sess)
in which the 'data' is a class of Data(code) that includes test/train/dev datasets:
which I think I need to pass the divided data here. If I am right, how can I do partitioning and perform 10 fold cross validation?
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
class Data(object):
def __init__(self,data_path,vocab_path,pretrained,batch_size):
self.batch_size = batch_size
data, vocab ,pretrained= self.load_vocab_data(data_path,vocab_path,pretrained)
self.train=data['train']
self.valid=data['valid']
self.test=data['test']
self.train2=data['train2']
self.valid2=data['valid2']
self.test2=data['test2']
self.word_size = len(vocab['word2id'])+1
self.max_sent_len = vocab['max_sent_len']
self.max_topic_len = vocab['max_topic_len']
self.word2id = vocab['word2id']
word2id = vocab['word2id']
#self.id2word = dict((v, k) for k, v in word2id.iteritems())
self.id2word = {}
for k, v in six.iteritems(word2id):
self.id2word[v]=k
self.pretrained=pretrained
by the look of it, seems the train method can get the session and continue to train from existing model def train(self, data, sess=None)
so with a very minimal changes to existing code and libraries you can do smth like
first load all the data and build the model
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
then create the cross validation data set, smth like
def get_new_data_object():
return data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
cross_validation = []
for i in range(10):
tmp_data = get_new_data_object()
tmp_data.train= #get 90% of tmp_data['train']
tmp_data.valid= #get 90% of tmp_data['valid']
tmp_data.test= #get 90% of tmp_data['test']
tmp_data.train2= #get 90% of tmp_data['train2']
tmp_data.valid2= #get 90% of tmp_data['valid2']
tmp_data.test2= #get 90% of tmp_data['test2']
cross_validation.append(tmp_data)
than run the model n times (10 for 10-fold cross validation)
sess = null
for data in cross_validation:
model.train(data, sess)
sess = model.restore_last_session()
keep in mind to pay attention to some key ideas
I don't know how your data is structured exactly but that effect the way of splitting it to test, train and (in your case) valid
the splitting of data has to be the exact split for each triple of test, train and valid, it can be done randomly or taking different part every time, as long it consistent
you can train the model n times with cross validation or create n models and pick the best to avoid overfitting
this code is just a draft, you can implement it how you would like, there are some great library that already implemented such functionality, and of course can be optimize (not reading the whole data files each time)
one more consideration is to separate the model creation from the data, especially the data arg of the model constructor, from a quick look it seems it only use the dimension of the data, so its a good practice not to pass the whole object
more over, if the model integrate other properties of the data object in it's state (when creating), like the data itself, my code might not work and a more surgical approach
hope it helps, and point you in the right direction

How to write an efficient custom Keras data generator

I would like to train a convolutional recurrent neural network for video frame prediction. The individual frames are quite big so it is challenging to fit the entire training data in memory at once. As such, I followed some tutorials online to create a custom data generator. When testing it, it seems to work but it is slower by a factor of at least 100 than using the pre-loaded data directly. Since I can only fit about a batch size of 8 on the GPU I understand that the data needs to be generated really fast, however, this does not seem to be the case.
I train my model on a single P100 and have 32 GB of memory available to be used by up to 16 cores.
class DataGenerator(tf.keras.utils.Sequence):
def __init__(self, images, input_images=5, predict_images=5, batch_size=16, image_size=(200, 200),
channels=1):
self.images = images
self.input_images = input_images
self.predict_images = predict_images
self.batch_size = batch_size
self.image_size = image_size
self.channels = channels
self.nr_images = int(len(self.images)-input_images-predict_images)
def __len__(self):
return int(np.floor(self.nr_images) / self.batch_size)
def __getitem__(self, item):
# Randomly select the beginning image of each batch
batch_indices = random.sample(range(0, self.nr_images), self.batch_size)
# Allocate the output images
x = np.empty((self.batch_size, self.input_images,
*self.image_size, self.channels), dtype='uint8')
y = np.empty((self.batch_size, self.predict_images,
*self.image_size, self.channels), dtype='uint8')
# Get the list of input an prediction images
for i in range(self.batch_size):
list_images_input = range(batch_indices[i], batch_indices[i]+self.input_images)
list_images_predict = range(batch_indices[i]+self.input_images,
batch_indices[i]+self.input_images+self.predict_images)
for j, ID in enumerate(list_images_input):
x[i, ] = np.load(np.reshape(self.images[ID], (*self.imagesize, self.channels))
# Read in the prediction images
for j, ID in enumerate(list_images_predict):
y[i, ] = np.load(np.reshape(self.images[ID], (*self.imagesize, self.channels))
return x, y
# Training the model using fit_generator
params = {'batch_size': 8,
'input_images': 5,
'predict_images': 5,
'image_size': (100, 100),
'channels': 1
}
data_path = "input_frames/"
input_images = sorted(glob.glob(data_path + "*.png"))
training_generator = DataGenerator(input_images, **params)
model.fit_generator(generator=training_generator, epochs=10, workers=6)
I would have expected that Keras will prepare the next data batch while the current batch is being processed on the GPU but it does not seem to catch up. In other words, preparing the data before sending it to the GPU seems to be the bottleneck.
Any idea on how to improve the performance of a data generator like this? Is there something missing that guarantees that the data is being prepared in a timely manner?
Thanks a lot!
When you use fit_generator, there is a workers= setting that can be used to scale up the number of generator workers. However you should ensure that the 'item' parameter in getitem is taken into account in order to ensure that the different workers (which are not synchronised) return different values depending on item index. i.e. instead of random sample, perhaps just return a slice of the data based on the index. You can shuffle the entire dataset before starting in order to make sure the dataset order is randomised.
Can you please try with use_multiprocessing=True? These are the numbers I observe on my GTX 1080Ti based system with the data generator you provided.
model.fit_generator(generator=training_generator, epochs=10, workers=6)
148/148 [==============================] - 9s 60ms/step
model.fit_generator(generator=training_generator, epochs=10, workers=6, use_multiprocessing=True)
148/148 [==============================] - 2s 11ms/step
You can try the prefetching of tf.data.Dataset. The prefetching allows you to compute the next batch(es) using your CPU while your GPU computes the gradient descent in the same time. Be careful: you need to change the numpy array into tf.constant in the data generator. Then try:
import tensoflow as tf
generator = DataGenerator(images)
spec = [tf.TypeSpec(shape=(generator.batch_size, generator.input_images,
*generator.image_size, generator.channels), dtype='uint8'),
tf.TypeSpec(shape=(generator.batch_size, generator.predict_images,
*generator.image_size, generator.channels), dtype='uint8')
dataset = tf.data.Dataset.from_generator(DataGenerator, output_signature=spec)
dataset.batch(batch_size).prefetch(-1) # this order is important
# a custom training loop is better than model.fit() otherwise prefetching can fail
def train_loop():
...
You can change the "-1" in prefetch() to another value like 1, 2 or more to get the maximum speed depending on your machine and the batch size.
this blog helps in setting up input data pipeline with tf.data and it also is much more efficient than using ImageDataGenerators and the code is also explained by using a custom data directory.
It also enhances the performance with prefetch, cache.
Prefetch processes the next batch while the current batch is being used.

Reversing an ML model

I am currently teaching myself the basics of machine learning by creating a simple image classifier using Keras (with a Tensorflow backend). The model classifies a (greyscaled) image as either a cat or not a cat.
My model is relatively good at this task, so I now want to see if it can generate images that it would classify as a cat.
I have attempted to start this in a simple way, by creating a random array of the same shape as the images, with random numbers in each index:
from random import randint
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
model.load_weights("model_weights.h5")
confidence = 0.0
thresholdConfidence = 0.6
while confidence < thresholdConfidence:
img_array = np.array([[[randint(0, 255) for z in range(1)] for y in range(64)] for x in range(64)])
img_array = img_array.reshape((1,) + img_array.shape)
confidence = model.predict(img_array)
This method is obviously not good at all, since it just creates random things and could potentially run eternally. Could the model somehow run in reverse by telling it that an array is 100% cat, and having it predict what the array representation of the image is?
Thank you for reading.
[This is my first post on StackOverflow, so please let me know if I've done something wrong!]
If you wish to generate a special type of image, you can use Generative Adversary Networks. This are made into two parts which need to be trained separately. The two parts are
Generator : Creates noise that is random images.
Discriminator : Gives feedback to the generator regarding the images
You can refer here.

Big HDF5 dataset, how to efficienly shuffle after each epoch

I'm currently working with a big image dataset (~60GB) to train a CNN (Keras/Tensorflow) for a simple classification task.
The images are video frames, and thus highly correlated in time, so I shuffled the data already once when generating the huge .hdf5 file...
To feed the data into the CNN without having to load the whole set at once into memory I wrote a simple batch generator (see code below).
Now my question:
Usually it is recommended to shuffle the data after each training epoch right? (for SGD convergence reasons?) But to do so I'd have to load the whole dataset after each epoch and shuffle it, which is exactly what I wanted to avoid using the batch generator...
So: Is it really that important to shuffle the dataset after each epoch and if yes how could I do that as efficiently as possible?
Here is the current code of my batch generator:
def generate_batches_from_hdf5_file(hdf5_file, batch_size, dimensions, num_classes):
"""
Generator that returns batches of images ('xs') and labels ('ys') from a h5 file.
"""
filesize = len(hdf5_file['labels'])
while 1:
# count how many entries we have read
n_entries = 0
# as long as we haven't read all entries from the file: keep reading
while n_entries < (filesize - batch_size):
# start the next batch at index 0
# create numpy arrays of input data (features)
xs = hdf5_file['images'][n_entries: n_entries + batch_size]
xs = np.reshape(xs, dimensions).astype('float32')
# and label info. Contains more than one label in my case, e.g. is_dog, is_cat, fur_color,...
y_values = hdf5_file['labels'][n_entries:n_entries + batch_size]
#ys = keras.utils.to_categorical(y_values, num_classes)
ys = to_categorical(y_values, num_classes)
# we have read one more batch from this file
n_entries += batch_size
yield (xs, ys)
Yeah, shuffling improves performance since running the data in the same order each time may get you stuck in suboptimal areas.
Don't shuffle the entire data. Create a list of indices into the data, and shuffle that instead. Then move sequentially over the index list and use its values to pick data from the data set.

Classifying images using adjusted CNN tutorial - system is jumbling the output

Sorry, this is a long one!
I am 80% sure that the problem is that I don't fully understand how tensorflow uses the tf.train.batch function to queue data.
I am trying to adapt one of the tensorflow tutorials to classify a large number of images.
Tutorial can be found here: https://www.tensorflow.org/tutorials/deep_cnn
I have built some modules which can encode my raw data in the same format that cifar10 uses. I am using this to construct training and evaluation data which the program is able to evaluate to a high degree of accuracy. Accuracy varies depending on the quality of the imagesets I put in. To keep things simple I have trained it using 32x32 monochrome tiles of either yellow or blue (category 0 and 1 respectively). Conveniently the network is able to identify whether it is being given a yellow or blue tile with 100% accuracy.
I have also been able to adapt cifar10_eval.py to output predictions rather than an accuracy percentage. This allows me to feed in un-classified data and output predictions as a list. To do this I have exchanged the statement:
top_k_op = tf.nn.in_top_k(logits, labels, 1)
for:
output_2 = tf.argmax(logits, 1)
I have added a variable and a boolean to the eval_once function call to allow it to access the definition for "output_2" and to let me switch between this and "top_k_op" depending on whether I am in evaluation mode or if I am predicting new data.
So far so good. This method works for small amounts of input data but fails as soon as I want to output more than 128 classifications. Not coincidentally 128 is the batch size.
In theory the first item (3073 bytes) in the binary should correspond to the first item in the list which is churned out when I am predicting new data. This happens for inputs of up to 128 images but the data gets jumbled up when I try to categorise more images. Actually, some of the predictions are lost completely!
There are a couple of reasons that this happens. The tutorial isn't designed to care about the order in which data is read or processed, just that individual images correspond with their labels. Originally the data loss was randomised(!) but I have managed to remove the random element by removing multi-threading (threads = 1 rather than 16) and stopped it from shuffling filenames.
filename_queue = tf.train.string_input_producer(filenames, shuffle=False)
string_input_producer has a hidden/optional argument which shuffles the file names. For model evaluation I have set this to false as above.
However.... I am still stuck with jumbled data loss when evaluating data larger than a single batch.
Does anyone know why this happens and have any ideas about how it could be fixed?
In theory I could redesign the code to rebuild the graph and evaluate it for 128 images at a time. However, I want to classify millions of images and feel that I'd be asking for trouble trying to open a new graph instance per batch.
PS, I've done my homework:
I have verified that my initial data to binary conversion works by running a program which can read cifar10-style files and interpret it as a big tile of images. I have run this code on both the original cifar10 binaries and my own binaries and am able to reconstruct both perfectly.
When I encode uncategorised data I add a category label of zero to make sure the tutorial can read the file. However, I make sure that this label is chucked away at the file reading stage and thus is not used when generating a list of predictions.
I have verified the output predictions by printing the list directly onto the screen as a python output and also by using it to assemble a PNG image which can be compared with the original inputs. This verification works perfectly for small batch sizes and starts to fall apart in larger batch sizes.
I've also made some modifications to the tutorial not discussed in this post. These are simple modifications such as changing the number of categories to 2 rather than 10. Am confident that this is not the issue.
PPS, here is a copy of some functions from the modified script. I haven't pasted everything because this question is already huge:
from cifar10_eval:
def eval_once(saver, summary_writer, top_k_op, output_2, summary_op, mapping=False):
"""Run Eval once.
Args:
saver: Saver.
summary_writer: Summary writer.
top_k_op: Top K op.
summary_op: Summary op.
"""
with tf.Session() as sess:
ckpt = tf.train.get_checkpoint_state(FLAGS.checkpoint_dir)
if ckpt and ckpt.model_checkpoint_path:
# Restores from checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
# Assuming model_checkpoint_path looks something like:
# /my-favorite-path/cifar10_train/model.ckpt-0,
# extract global_step from it.
global_step = ckpt.model_checkpoint_path.split('/')[-1].split('-')[-1]
else:
print('No checkpoint file found')
return
# Start the queue runners.
coord = tf.train.Coordinator()
try:
threads = []
for qr in tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS):
threads.extend(qr.create_threads(sess, coord=coord, daemon=True,
start=True))
num_iter = int(math.ceil(FLAGS.num_examples / FLAGS.batch_size))
true_count = 0 # Counts the number of correct predictions.
total_sample_count = num_iter * FLAGS.batch_size
step = 0
output=[]
if mapping: # if in mapping mode generate a map, if in default mode (variable set to False by default) then tally predictions instead.
while step < num_iter and not coord.should_stop():
step += 1
hold = sess.run(output_2)
print(hold)
for i in range (len(hold)):
output.append(hold[i])
return(output)
from cifar10_input:
def inputs(mapping, data_dir, batch_size):
"""Construct input for CIFAR evaluation using the Reader ops.
Args:
mapping: bool, indicating if one should use the raw or pre-classified eval data set.
data_dir: Path to the CIFAR-10 data directory.
batch_size: Number of images per batch.
Returns:
images: Images. 4D tensor of [batch_size, IMAGE_SIZE, IMAGE_SIZE, 3] size.
labels: Labels. 1D tensor of [batch_size] size.
"""
filelist = os.listdir(data_dir)
filenames = []
if mapping:
# from Raw_Image_Processor import file_name
for f in filelist:
if f.startswith("raw_batch"):
filenames.append(os.path.join(data_dir, f))
num_examples_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN
else:
for f in filelist:
if f.startswith("eval_batch"):
filenames.append(os.path.join(data_dir, f))
num_examples_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_EVAL
for f in filenames:
if not tf.gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames, shuffle=False)
# Read examples from files in the filename queue.
read_input = read_cifar10(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
height = IMAGE_SIZE
width = IMAGE_SIZE
# Image processing for evaluation.
# Crop the central [height, width] of the image.
resized_image = tf.image.resize_image_with_crop_or_pad(reshaped_image,
height, width)
# Subtract off the mean and divide by the variance of the pixels.
float_image = tf.image.per_image_standardization(resized_image)
# Set the shapes of tensors.
float_image.set_shape([height, width, 3])
read_input.label.set_shape([1])
# Ensure that the random shuffling has good mixing properties.
min_fraction_of_examples_in_queue = 0.4
min_queue_examples = int(num_examples_per_epoch *
min_fraction_of_examples_in_queue)
# Generate a batch of images and labels by building up a queue of examples.
return _generate_image_and_label_batch(float_image, read_input.label,
min_queue_examples, batch_size,
shuffle=False)
from cifar10_input:
def _generate_image_and_label_batch(image, label, min_queue_examples,
batch_size, shuffle):
"""Construct a queued batch of images and labels.
Args:
image: 3-D Tensor of [height, width, 3] of type.float32.
label: 1-D Tensor of type.int32
min_queue_examples: int32, minimum number of samples to retain
in the queue that provides of batches of examples.
batch_size: Number of images per batch.
shuffle: boolean indicating whether to use a shuffling queue.
Returns:
images: Images. 4D tensor of [batch_size, height, width, 3] size.
labels: Labels. 1D tensor of [batch_size] size.
"""
# Create a queue that shuffles the examples, and then
# read 'batch_size' images + labels from the example queue.
num_preprocess_threads = 16
if shuffle:
images, label_batch = tf.train.shuffle_batch(
[image, label],
batch_size=batch_size,
num_threads=num_preprocess_threads,
capacity=min_queue_examples + 3 * batch_size,
min_after_dequeue=min_queue_examples)
else:
images, label_batch = tf.train.batch(
[image, label],
batch_size=batch_size,
num_threads=1,
capacity=1,
enqueue_many = False)
# Display the training images in the visualizer.
tf.summary.image('images', images)
return images, tf.reshape(label_batch, [batch_size])
Edit:
Partial solution is given in the comments below. Information loss is dependent on batch size so it turns out that increasing the batch size (in mapping mode only) is an effective fix.
However, I'm still unsure why it loses and/or scrambles information when the batch size is exceeded. Presumably the batches are taken is some non-sequential order. I don't need it to take the project forwards but if someone could explain how or why this happens it would be greatly appreciated.
Edit 2:
It's back! I've set the batch size to be equivalent to one binary file (in my case roughly 10,000 images). Data is not lost or jumbled within this batch but when I try to process multiple files (about 30) it mixes up the batches a little rather than outputting them on a FIFO basis.
A picture is probably the easiest way for you to see what is going on:
classification map
This is a reconstructed image of a rock face from which the classifier has been trained to recognize three categories. As you can see the reconstruction is mostly smooth. However, there are two clean breaks near the top of the image where a batch (or 3) has been outputted in non-chronological order. These should have appeared at the bottom of the image rather than near the top.

Categories