Tensorflow model for OCR

Tensorflow model for OCR - python

I am new in Tensorflow and I am trying to build model which will be able to perform OCR on my images. I have to read 9 characters (fixed in all images), numbers and letters. My model would be similar to this
https://matthewearl.github.io/2016/05/06/cnn-anpr/
My questions would be, should I train my model against each character firstly and after combine characters to get full label represented. Or I should train on full label straight ?
I know that I need to pass to model, images + labels for corresponding image, what is the format of those labels, is it textual file, I am bit confused about that part, so any explanation about format of labels which are passed to model would be helpful ? I appreciate, thanks.

There are a couple of ways to deal with this (the following list is not exhaustive).
1) The first one is word classification directly from your image. If your vocabulary of 9 characters is limited you can train a word specific classifier. You can then convolve this classifier with your image and select the word with the highest probability.
2) The second option is to train a character classifier, find all characters in your image, and find the most likely line that has the 9 character you are looking for.
3) The third option is to train a text detector, find all possible text boxes. Then read all text boxes with a sequence-based model, and select the most likely solution that follows your constraints. A simple sequence-based model is introduced in the following paper: http://ai.stanford.edu/~ang/papers/ICPR12-TextRecognitionConvNeuralNets.pdf. Other sequence-based models could be based on HMMs, Connectionist Temporal Classification, Attention based models, etc.
4) The fourth option are attention-based models that work end-to-end to first find the text and then output the characters one-by-one.
Note that this list is not exhaustive, there can be many different ways to solve this problem. Other options can even use third party solutions like Abbyy or Tesseract to help solve your problem.

I'd recommend to train an end-to-end OCR model with attention. You can try the Attention OCR which we used to transcribe street names https://github.com/tensorflow/models/tree/master/research/attention_ocr
My guess it should work pretty well for your case. Refer to the answer https://stackoverflow.com/a/44461910 for instructions on how to prepare the data for it.

Related

Feature Extraction Using Representation Learning

I'm new to machine learning, and I've been given a task where I'm asked to extract features from a data set with continuous data using representation learning (for example a stacked autoencoder).
Then I'm to combine these extracted features with the original features of the dataset and then use a feature selection technique to determine my final set of features that goes into my prediction model.
Could anyone point me to some resources or demos or sample code of how I could get started on this? I'm very confused on where to begin on this and would love some advice!

Okay, say you have an input of (1000 instances and 30 features). What I would do based on what you told us is:
Train an autoencoder, a neural network that compresses the input and then decompresses it, which has as a target your original input. The compressed representation lies in the latent space and encapsulates information about the input which is not directly accessible by humans. Now you may find such networks in tensorflow or pytorch. Tensorflow is easier and more straightforward so it could be better for you. I will provide this link (https://keras.io/examples/generative/vae/) for a variational autoencoder that may do the job for you. This has Conv2D layers so it performs really well for image data, but you can play around with the architecture. I cannot tell u more because you did not provide more info for your dataset. However, the important thing is the following:
After your autoencoder is trained properly and you need to make sure of it, (it adequately reconstructs the input) then you need to extract the aforementioned latent inputs (you will find more in the link). Now, that will be let's say 16 numbers but you can play with it. These 16 numbers were built to preserve info regarding your input. You said you wanted to combine these numbers with your input so might as well do that and end up with 46 input features. Now the feature selection part has to do with selecting the input features that are more useful for your model. That is not very interesting, you may find more information (https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e) and one way to select features is by training many models with different feature subsets. Remember, techniques such as PCA are for feature extraction not selection. I cannot provide any demo that does the whole thing but there are sources that can help. Remember, your autoencoder is supposed to return 16 numbers for each training example. Your autoencoder is trained only on your train data, with your train data as targets.

model for hand written text recognition

I have been attempting to create a model that given an image, can read the text from it. I am attempting to do this by implementing a cnn, rnn, and ctc. I am doing this with TensorFlow and Keras. There are a couple of things I am confused about. For reading single digits, I understand that your last layer in the model should have 9 nodes, since those are the options. However, for reading words, aren't there infinitely many options, so how many nodes should I have in my last layer. Also, I am confused as to how I should add my ctc to my Keras model. Is it as a loss function?

I see two options here:
You can construct your model to recognize separate letters of those words, then there are as many nodes in the last layer as there are letters and symbols in the alphabet that your model will read.
You can make output of your model as a vector and then "decode" this vector using some other tool that can encode/decode words as vectors. One such tool I can think of is word2vec. Or there's an option to download some database of possible words and create such a tool yourself.
Description of your model is very vague. If you want to get more specific help, then you should provide more info, e.g. some model architecture.

How to extract relevant phrases from sentences regarding a particular topic using Neural networks?

I have training data as two columns
1.'Sentences'
2.'Relevant_text' (text in this column is a subset of text in the column 'Sentences')
I tried training a RNN with LSTM directly treating 'Sentences' as input and 'Relevant_text' and output but the results were disappointing.
I want to know how to approach this type of problem? Does this kind of problem have a name? Which models should I explore?

If the target text is the subset of the input text, then, I believe, this problem can be solved as a tagging problem: make your neural network for each word predict whether it is "relevant" or not.
On the one hand, the problem of taking a text and selecting its subset that best reflects its meaning is called extractive summarization, and has lots of solutions, from the well known unsupervised textRank algorithm to complex BERT-based neural models.
On the other hand, technically your problem is just binary token-wise classification: you label each token (word or other symbol) of your input text as "relevant" or not, and train any neural network architecture which is good for tagging on this data. Specifically, I would look into architectures for POS tagging, because they are very well studied. Typically, it is BiLSTM, maybe with a CRF head. More modern models are based on pretrained contextual word embeddings, such as BERT (maybe, you won't even need to fine tune them - just use it as a feature extractor, and add a BiLSTM on top). If you want a more lightweight model, you can consider a CNN over pretrained and fixed word embeddings.
One final parameter you should time playing with is the threshold for classifying the word as relevant - maybe, the default one, 0.5, is not the best choice. Maybe, instead of keeping all the tokens with probability-of-being-important higher than 0.5, you would like to keep the top k tokens, where k is fixed or is some percentage of the whole text.
Of course, more specific recommendations would be dataset-specific, so if you could share your dataset, it would be a great help.

Multiple Output Vectors for a single Input in Keras

I want to create a Neural Network in Keras for converting handwriting into computer letters.
My first step is to convert a sentence into an Array. My Array has the shape (1, number of letters,27). Now I want to input it in my Deep Neural Network and train.
But how do I input it properly if the dimension doesn't fit those from my image? And how do I achieve that my predict function gives me an output array of (1, number of letters,27)?

Seems like you are attempting to do Handwritten Recognition or similarly Optical Character Recognition or OCR. This is quite a broad field, and there are many ways to proceed. Even though, one approach I suggest is the following:
It is commonly known that Neural Networks have fixed size inputs, that is if you build it to take, say, inputs of shape (28,28,1) then the model will expect that shape as their inputs. Therefore, having a dimension in your samples that depends on the number of letters in a sentence (something variable) is not recommended, as you will not be able to train a model in such way with NNs.
Training such a model could be possible if you design it to predict one character at a time, instead a whole sentence that can have different lengths, and then group the predicted characters. The steps you could try to achieve this could be:
Obtain training samples for the characters you wish to recognize (like the MNIST database for example), and design and train your model to predict one character at a time.
Take the image with writing to classify and pass a Sliding Window over it that matches your expected input size (say a 28x28 window). Then, classify each of those windows to a character. Instead of Sliding Window, you could try isolating your desired features somehow and just classify those 28x28 segments instead.
Group the predicted characters somehow so you get words (probably grouping those separated by empty spaces) or do whatever you want with the predictions.
You can also try searching for tutorials or guides for Handwriting recognition like this one I have found quite useful. Hope this helps you get on track, good luck.

Experimenting with creating OCR in tensorflow, what to do after training on letters?

Honestly, i'm just stuck and can't think. I have worked hard to create an amazing model that can read letters, but how do I move on to words, sentences, paragraphs and full papers?
This is a general question so forgive me for not providing code, but assume I have successfully trained a network at recognizing letters of many kinds and many fonts, with all sorts of different noise and distortions in the image.
(just to be technical, the images the model is trained on are 36*36 grayscale images only, and the model is a simple classifier with some conv2d layers)
Now I want to use this well-trained model with all it's parameters and give it something to read, to turn in into a full OCR program. This is where i'm stuck. I want to give the program a photo/scan of a paper, and have it recognize all the letters. But how do I "predict" using my model, when the image is obviously larger than the images it was trained on of single letter?
I have tried adding an additional layer of conv2d that would try to read features of parts of the image, but that was too complicated and I couldn't figure it out.
I have also looked at opencv programs that recognize where there is text in the image and crop that out, but none that I could find separate out single letters that could now be fed to the trained model to try and read.
What is my next step from here?

If the fonts of the letters will be the same throughout the whole image you could use the so called: "sliding window technique"
You start from the upper left corner and slide your scan window to the right for the size of the letter until you reach the end of the paper.
The sliding window will be the size of the scanned letter and when inputted to your neural network it will output the letter. Save those letters somewhere.
Other methods would include changing your neural network and being smarter about detecting blobs of text on the scanned paper
If you are looking for an off-the-shelf solution take a look at Tessaract-ocr.

Check out the following links for ideas:
STN-OCR: A single Neural Network for Text Detection and Text Recognition
STN-OCR on Medium
Attention-based Extraction of Structured Information from Street View Imagery
Another Attention-based OCR Repo
A model using both CNN and LSTM

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.