how to save ocr model from keras author-A_K_Nain

how to save ocr model from keras author-A_K_Nain - python

Im studying tensorflow ocr model from keras example authored by A_K_Nain. This model use custom object (CTC Layer). It is in the site:https://keras.io/examples/vision/captcha_ocr/
I trained model using my dataset and then the result of prediction model is perfect.
I want to save and load this model and i tried it. But i got some errors so i appended this code in CTC Layer class.
def get_config(self):
config = super(CTCLayer, self).get_config()
config.update({"name":self.name})
return config
After that
I tried to save whole model and weight but nothing worked.
So i applied 2 save point.
First way.
history = model.fit(
train_dataset,
validation_data=validation_dataset,
epochs=70,
callbacks=[early_stopping],
)
model.save('./model/my_model')
---------------------------------------
new_model = load_model('./model/my_model', custom_objects={'CTCLayer':CTCLayer})
prediction_model = keras.models.Model(
new_model .get_layer(name='image').input, new_model .get_layer(name='dense2').output
)
and second way.
prediction_model = keras.models.Model(
model.get_layer(name='image').input, model.get_layer(name='dense2').output
)
prediction_model.save('./model/my_model')
These still never worked. it didn't make error but result of prediction is terrible.
Accurate results are obtained when training and saving and loading are performed together.
If I load same model without training together, the result is so bad.
How can i use this model without training everytime? please help me.

The problem does not come from tensorflow. In the captcha_ocr tutorial, characters is a set, sets are unordered. So the mapping from characters to integers using StringLookup is dependent of the current run of the notebook. That is why you get rubbish when using it in another notebook without retraining, the mapping is not the same!
A solution is to use an ordered list instead of the set for characters :
characters = sorted(list(set([char for label in labels for char in label])))
Note that the set operator here permits to get a unique version of each character and then it is converted back to a list and sorted. It will work then on any script/notebook without retraining (using the same formula).

The problem is not in the saved model but in the character list that you are using to map number back to string. Everytime you restart the notebook, it resets the character list and when you load your model, it can't accurately map the numbers back to string. To resolve this issue you need to save character list. Please follow the below code.
train_labels_cleaned = []
characters = set()
max_len = 0
for label in train_labels:
label = label.split(" ")[-1].strip()
for char in label:
characters.add(char)
max_len = max(max_len, len(label))
train_labels_cleaned.append(label)
print("Maximum length: ", max_len)
print("Vocab size: ", len(characters))
# Check some label samples
train_labels_cleaned[:10]
ff = list(characters)
# save list as pickle file
import pickle
with open("/content/drive/MyDrive/Colab Notebooks/OCR_course/characters", "wb") as fp: #Pickling
pickle.dump(ff, fp)
# Load character list again
import pickle
with open("/content/drive/MyDrive/Colab Notebooks/OCR_course/characters", "rb") as fp: # Unpickling
b = pickle.load(fp)
print(b)
AUTOTUNE = tf.data.AUTOTUNE
# Maping characaters to integers
char_to_num = StringLookup(vocabulary=b, mask_token=None)
#Maping integers back to original characters
num_to_chars = StringLookup(vocabulary=char_to_num.get_vocabulary(), mask_token=None, invert=True)
Now when you map back the numbers to string after prediction, it will retain the original order and will predict accurately.
If you still didn't understand the logic, you can watch my video in which I explained this project from scratch and resolved all the issues that you are facing.
https://youtu.be/ZiUEdS_5Byc

Related

How to get Data Generator more efficient?

To train a neural network, I modified a code I found on YouTube. It looks as follows:
def data_generator(samples, batch_size, shuffle_data = True, resize=224):
num_samples = len(samples)
while True:
random.shuffle(samples)
for offset in range(0, num_samples, batch_size):
batch_samples = samples[offset: offset + batch_size]
X_train = []
y_train = []
for batch_sample in batch_samples:
img_name = batch_sample[0]
label = batch_sample[1]
img = cv2.imread(os.path.join(root_dir, img_name))
#img, label = preprocessing(img, label, new_height=224, new_width=224, num_classes=37)
img = preprocessing(img, new_height=224, new_width=224)
label = my_onehot_encoded(label)
X_train.append(img)
y_train.append(label)
X_train = np.array(X_train)
y_train = np.array(y_train)
yield X_train, y_train
Now, I tried to train a neural network using this code, train sample size is 105.000 (image files which contain 8 characters out of 37 possibilities, A-Z, 0-9 and blank space).
I used a relatively small batch size (32, I think that is already too small) to get it more efficient but nevertheless it took like forever to train one quarter of the first epoch (I had 826 steps per epoch, and it took 90 minutes for 199 steps... steps_per_epoch = num_train_samples // batch_size).
The following functions are included in the data generator:
def shuffle_data(data):
data=random.shuffle(data)
return data
I don't think we can make this function anyhow more efficient or exclude it from the generator.
def preprocessing(img, new_height, new_width):
img = cv2.resize(img,(new_height, new_width))
img = img/255
return img
For preprocessing/resizing the data I use this code to get the images to a unique size of e.g. (224, 224, 3). I think, this part of the generator takes the most time, but I don't see a possibility to exclude it from the generator (since my memory would be full, if we resize the images outside the batches).
#One Hot Encoding of the Labels
from numpy import argmax
# define input string
def my_onehot_encoded(label):
# define universe of possible input values
characters = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ '
# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(characters))
int_to_char = dict((i, c) for i, c in enumerate(characters))
# integer encode input data
integer_encoded = [char_to_int[char] for char in label]
# one hot encode
onehot_encoded = list()
for value in integer_encoded:
character = [0 for _ in range(len(characters))]
character[value] = 1
onehot_encoded.append(character)
return onehot_encoded
I think, in this part there could be one approach to make it more efficient. I am thinking about to exclude this code from the generator and produce the array y_train outside of the generator, so that the generator does not have to one hot encode the labels every time.
What do you think? Or should I maybe go for a completely different approach?

I have found your question very intriguing because you give only clues. So here is my investigation.
Using your snippets, I have found GitHub repository and 3 part video tutorial on YouTube that mainly focuses on the benefits of using generator functions in Python.
The data is based on this kaggle (I would recommend to check out different kernels on that problem to compare the approach that you already tried with another CNN networks and review API in use).
You do not need to write a data generator from scratch, though it is not hard, but inventing the wheel is not productive.
Keras has the ImageDataGenerator class.
Plus here is a more generic example for DataGenerator.
Tensorflow offers very neat pipelines with their tf.data.Dataset.
Nevertheless, to solve the kaggle's task, the model needs to perceive single images only, hence the model is a simple deep CNN. But as I understand, you are combining 8 random characters (classes) into one image to recognize multiple classes at once. For that task, you need R-CNN or YOLO as your model. I just recently opened for myself YOLO v4, and it is possible to make it work for specific task really quick.
General advice about your design and code.
Make sure the library uses GPU. It saves a lot of time. (Even though I repeated flowers experiment from the repository very fast on CPU - about 10 minutes, but resulting predictions are no better than a random guess. So full training requires a lot of time on CPU.)
Compare different versions to find a bottleneck. Try a dataset with 48 images (1 per class), increase the number of images per class, and compare. Reduce image size, change the model structure, etc.
Test brand new models on small, artificial data to prove the idea or use iterative process, start from projects that can be converted to your task (handwriting recognition?).

10 fold cross validation python

There is a deep learning based model using Transfer Learning and LSTM in this article, that author used 10 fold cross validation (as explained in table 3) and took the average of results.
I am familiar with 10 fold cross validation as we need to divide the data and pass to the model, however in this code(here) I can't figure out how to partition data and pass it.
There is two train/test/dev datasets (one for emotion analysis, and one for sentiment analysis we use both for transfer learning, but my focus is on emotion analysis). The raw data is in couple of files in txt format, and after running the model, it gives two new txt files, one for predicted labels, one for true labels.
There is a line of code in the main file:
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
if args.mode=='train':
model.train(data)
sess = model.restore_last_session()
model.predict(data, sess)
if args.mode=='test':
sess = model.restore_last_session()
model.predict(data, sess)
in which the 'data' is a class of Data(code) that includes test/train/dev datasets:
which I think I need to pass the divided data here. If I am right, how can I do partitioning and perform 10 fold cross validation?
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
class Data(object):
def __init__(self,data_path,vocab_path,pretrained,batch_size):
self.batch_size = batch_size
data, vocab ,pretrained= self.load_vocab_data(data_path,vocab_path,pretrained)
self.train=data['train']
self.valid=data['valid']
self.test=data['test']
self.train2=data['train2']
self.valid2=data['valid2']
self.test2=data['test2']
self.word_size = len(vocab['word2id'])+1
self.max_sent_len = vocab['max_sent_len']
self.max_topic_len = vocab['max_topic_len']
self.word2id = vocab['word2id']
word2id = vocab['word2id']
#self.id2word = dict((v, k) for k, v in word2id.iteritems())
self.id2word = {}
for k, v in six.iteritems(word2id):
self.id2word[v]=k
self.pretrained=pretrained

by the look of it, seems the train method can get the session and continue to train from existing model def train(self, data, sess=None)
so with a very minimal changes to existing code and libraries you can do smth like
first load all the data and build the model
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
then create the cross validation data set, smth like
def get_new_data_object():
return data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
cross_validation = []
for i in range(10):
tmp_data = get_new_data_object()
tmp_data.train= #get 90% of tmp_data['train']
tmp_data.valid= #get 90% of tmp_data['valid']
tmp_data.test= #get 90% of tmp_data['test']
tmp_data.train2= #get 90% of tmp_data['train2']
tmp_data.valid2= #get 90% of tmp_data['valid2']
tmp_data.test2= #get 90% of tmp_data['test2']
cross_validation.append(tmp_data)
than run the model n times (10 for 10-fold cross validation)
sess = null
for data in cross_validation:
model.train(data, sess)
sess = model.restore_last_session()
keep in mind to pay attention to some key ideas
I don't know how your data is structured exactly but that effect the way of splitting it to test, train and (in your case) valid
the splitting of data has to be the exact split for each triple of test, train and valid, it can be done randomly or taking different part every time, as long it consistent
you can train the model n times with cross validation or create n models and pick the best to avoid overfitting
this code is just a draft, you can implement it how you would like, there are some great library that already implemented such functionality, and of course can be optimize (not reading the whole data files each time)
one more consideration is to separate the model creation from the data, especially the data arg of the model constructor, from a quick look it seems it only use the dimension of the data, so its a good practice not to pass the whole object
more over, if the model integrate other properties of the data object in it's state (when creating), like the data itself, my code might not work and a more surgical approach
hope it helps, and point you in the right direction

Tensorflow predicting for specific ids

im using the Tensorflow dataset api and try to predict for a multilabel classification. Its working fine but the resulting predictions have no correspondending id, so I don't know to what the prediciton belongs to.
I'm using the following code to create the dataset and predict:
def test_input_fn():
filenames = tf.train.match_filenames_once("../input/test/*_green.png")
dataset = tf.data.Dataset.from_tensor_slices(filenames)
def _parse_image_data(filename):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_png(image_string, channels=1)
image_reshape = tf.reshape(image_decoded, [512*512*1])
return image_reshape
return dataset.map(_parse_image_data).batch(8)
pred_result = estimator.predict(test_input)
Is there an easy way to somehow append the ids to the prediction.

You can return image_reshape,filename in your _parse_image_data(filename) function, and modify your model function accordingly, so that for predictions it will return not only the predicted labels but also the filenames.
More detailedly, suppose your model function is model_fn(features, labels, mode, params), now your features is no longer just image_reshape, but image_shape,filename, so basically you should change wherever you use features to features[0]; you may also need to append features[1] to your predictions (I'm assuming you name your variables like this, I can give more specific codes if you can show me your model function).
Or, if you would like a separate function that simply prints the filenames content :
def print_filenames():
filenames = tf.train.match_filenames_once("../input/test/*_green.png")
with tf.Session() as sess:
sess.run(tf.local_variables_initializer())
print(sess.run(filenames))
In the above codes, the output is in the as-is order, so it should be the same order of your predictions.

How to get two tf.dataset from tf.data.Dataset.zip((images, labels))

I am working on the Python/tensorflow/mnist tutorial.
Since a few weeks using the orignal code from tensorflow web site i get the warning that the image dataset would soon be deprecated abd that i should use the following one :
https://github.com/tensorflow/models/blob/master/official/mnist/dataset.py
I load it it my code using :
from tensorflow.models.official.mnist import dataset
trainfile = dataset.train(data_dir)
Which returns :
tf.data.Dataset.zip((images, labels))
The issue is that I cannot find a,way to separate them in the following way for example :
trainfile = dataset.train(data_dir)
train_data= trainfile.images
train_label= trainfile.label
But this clearly doesnot work because the attributrs images and label do not exist. trainfile is a tf.dataset.
Knowing that tf.dataset is made of int32 and float32 i tried :
train_data = trainfile.map(lambda x,y : x.dtype == tf.float32)
But it returns and empty dataset.
I insist (but will be open mimded) in doing it this way (two complete batches of image and label) because this is how the tutorial works :
https://www.tensorflow.org/tutorials/estimators/cnn
I saw a lot of solution to get elements from datasets but nothing to go back from the zip operations that is done in the following code
tf.data.Dataset.zip((images, labels))
Thanks you in advance for your help.

I hope this helps:
inputs = tf.placeholder(tf.float32, shape=(None, 784), name='inputs')
outputs = tf.placeholder(tf.float32, shape=(None,), name='outputs')
#Prepare a tensorflow dataset
ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds = ds.shuffle(buffer_size=10, reshuffle_each_iteration=True).batch(batch_size=batch_size, drop_remainder=True).repeat()
iter = ds.make_one_shot_iterator()
next = iter.get_next()
inputs = next[0]
outputs = next[1]

TensorFlow's get_single_element() is finally around which can be used to unzip features and labels from the dataset.
This avoids the need of generating and using an iterator using .map(), iter() or one_shot_iterator() (which could be costly for big datasets).
get_single_element() returns a tensor (or a tuple or dict of tensors) encapsulating all the members of the dataset. We need to pass all the members of the dataset batched into a single element.
This can be used to get features as a tensor-array, or features and labels as a tuple or dictionary (of tensor-arrays) depending upon how the original dataset was created.
Check this answer on SO for an example that unpacks features and labels into a tuple of tensor-arrays.

Instead of separating into two datasets, one for images and another for labels, it's best to make a single iterator which returns both the image and the label.
The reason why this is preferred is that it's a lot easier to ensure that you match each example with its label even after a complicated series of shuffles, reorderings, filterings, etc, as you might have in a nontrivial input pipeline.

You can visualize images and find its associated labels
ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds = ds.shuffle(buffer_size=10).batch(batch_size=batch_size)
iter = ds.make_one_shot_iterator()
next = iter.get_next()
def display(image, label):
# display image
...
plt.imshow(image)
...
with tf.Session() as sess:
try:
while True:
image, label = sess.run(next)
# image = numpy array (batch, image_size)
# label = numpy array (batch, label)
display(image[0], label[0]) #display first image in batch
except:
pass

How to restore a saved model with LSTM layers in Keras

I was following a tutorial to generate English text using LSTMs and using Shakespeare's works as a training file. This is the model I am using with reference to that-
model = Sequential()
model.add(LSTM(HIDDEN_DIM, input_shape=(None, VOCAB_SIZE), return_sequences=True))
model.add(Dropout(0.2))
for i in range(LAYER_NUM - 1):
model.add(LSTM(HIDDEN_DIM, return_sequences=True))
model.add(TimeDistributed(Dense(VOCAB_SIZE)))
model.add(Activation('softmax'))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")
After 30 epochs of training, I save the model using model.save('model.h5'). At this point, the model has learned the basic format and has learned a few words. However, when I try to load the model in a new program using load_model('model.h5') and try to generate some text, it ends up predicting completely random letters and symbols. This led me to think that the model weights are not being restored properly, since I encountered the same problem while storing only the model weights. So is there any alternative for storing and restoring trained models with LSTM layers?
For reference, in order to generate the text, the function randomly generates a character and feeds it into the model to predict the next character. This is the function-
def generate_text(model, length):
ix = [np.random.randint(VOCAB_SIZE)]
y_char = [ix_to_char[ix[-1]]]
X = np.zeros((1, length, VOCAB_SIZE))
for i in range(length):
X[0, i, :][ix[-1]] = 1
print(ix_to_char[ix[-1]], end="")
ix = np.argmax(model.predict(X[:, :i+1, :])[0], 1)
y_char.append(ix_to_char[ix[-1]])
return ('').join(y_char)
EDIT
The snippet of code for training-
for nbepoch in range(1, 11):
print('Epoch ', nbepoch)
model.fit(X, y, batch_size=64, verbose=1, epochs=1)
if nbepoch % 10 == 0:
model.model.save('checkpoint_{}_epoch_{}.h5'.format(512, nbepoch))
generate_text(model, 50)
print('\n\n\n')
Where generate_text() is just a function to predict a new character, starting from a randomly generated character. After every 10 epochs of training, the entire model is saved as a .h5 file.
The code for loading the model-
print('Loading Model')
model = load_model('checkpoint_512_epoch_10.h5')
print('Model loaded')
generate_text(model, 400)
As far as predictions go, the text generation is normally structured while training and the model learns some words. However, when the saved model is loaded, the text generation is completely random, as if the weights were randomly reinitialized.

After doing a bit of digging, I finally found out that the way I was creating the dictionary mapping between characters and one-hot vectors was the issue. I was using the char = list(set(data)) function to get a list of all the characters in the file, and then assign the index of the character as that character's 'code number'. However, apparently the list(set(data)) function does not always output the same list, instead the order is randomized for each 'session' of python. So my dictionary mapping used to change between saving and loading the model, since that occurred in different scripts. Using char = sorted(list(set(data))) works to eliminate this problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.