Tensorflow model for OCR arabic - python

I am a beginner in Tensorflow and I want to build an OCR model with Tensorflow that detects Arabic words from cursive Arabic fonts (i.e. joint Arabic handwriting). Ideally, the model would be able to detect both Arabic and English. Please see the attached image of a page in a dictionary that I am currently trying to OCR. The other pages in the book have the same font and layout with both English and Arabic.
I have two questions:
(1) Would I be training with individual characters in the joint/cursive Arabic text or would I need bounding boxes for the entire words or individual characters?
(2) Are there any other OCR Tensorflow (or Keras) models available that deal with cursive writing particularly with Arabic.

Tesseract, an OCR engine from Google, has an Arabic trained model.
Learn more about it here: https://github.com/tesseract-ocr/tesseract
Languages it supports are here: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages
The Arabic dataset is here: https://github.com/tesseract-ocr/tessdata/blob/master/ara.traineddata
Hope this helps!

I don't think so you can use the whole page as the input image, maybe word by word is a better choice as a primitive solution, let's look at these links:
https://hackernoon.com/latest-deep-learning-ocr-with-keras-and-supervisely-in-15-minutes-34aecd630ed8
http://ai.stanford.edu/~ang/papers/ICPR12-TextRecognitionConvNeuralNets.pdf
How to create dataset in the same format as the FSNS dataset?

Related

Emojis are regarded as unknown(UNK) in BERT

My research interest is effect of emojis in text. I am trying to classify sarcastic tweets in text. A month ago I have used a dataset where I added the tokens using:
tokenizer.add_tokens('List of Emojis').
So when I tested the BERT model had successfully added the tokens. But 2 days ago when I did the same thing for another dataset, BERT model has categorized then as 'UNK' tokens. My question is, is there a recent change in the BERT model? I have tried it with the following tokenizer,
BertTokenizer.from_pretrained('bert-base-uncased')
This is same for distilbert. It does not recognize the emojis despite explicitly adding them. At first I read somewhere there is no need to add them in the tokenizer because BERT or distilbert has already those emojis in the 30000 tokens but I tried both. By adding and without adding. For both cases it does not recognize the emojis.
What can I do to solve this issue. Your thoughts on this would be appreciated.
You might need to distinguish between a BERT model (the architecture) and a pre-trained BERT model. The former can definitely support emoji; the latter will only have reserved code points for them if they were in the data that was used to create the WordPiece tokenizer.
Here is an analysis of the 119,547 WordPiece vocab used in the HuggingFace multilingual model It does not mention emoji. Note that 119K is very large for a vocab; more normal is 8K, 16K or 32K. The size of the vocab has quite a big influence on the model size: the first and last layers of a Transformer (e.g. BERT) model have way more weights than between any other layer.
I've just been skimming how the paper Time to Take Emoji Seriously: They Vastly Improve Casual Conversational Models deals with it. They append 3267 emoji to the end of the vocabulary. Then train it on some data with emoji in so it can try and learn what to do with those new characters.
BTW, a search of the HuggingFace github repository found they are using from emoji import demojize. This sounds like they convert emoji into text. Depending on what you are doing, you might need to disable it, or conversely you might need to be using that in your pipeline.

How can I create and fit vocab.bpe file (GPT and GPT2 OpenAI models) with my own corpus text?

This question is for those who are familiar with GPT or GPT2 OpenAI models. In particular, with the encoding task (Byte-Pair Encoding). This is my problem:
I would like to know how I could create my own vocab.bpe file.
I have a spanish corpus text that I would like to use to fit my own bpe encoder. I have succeedeed in creating the encoder.json with the python-bpe library, but I have no idea on how to obtain the vocab.bpe file.
I have reviewed the code in gpt-2/src/encoder.py but, I have not been able to find any hint. Any help or idea?
Thank you so much in advance.
check out here, you can easily create the same vocab.bpe using the following command:
python learn_bpe -o ./vocab.bpe -i dataset.txt --symbols 50000
I haven't worked with GPT2, but bpemb is a very good place to start for subword embeddings. According to the README
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.
I've used the pretrained embeddings for one of my projects along with sentencepiece and it turned out to be very useful.

Experimenting with creating OCR in tensorflow, what to do after training on letters?

Honestly, i'm just stuck and can't think. I have worked hard to create an amazing model that can read letters, but how do I move on to words, sentences, paragraphs and full papers?
This is a general question so forgive me for not providing code, but assume I have successfully trained a network at recognizing letters of many kinds and many fonts, with all sorts of different noise and distortions in the image.
(just to be technical, the images the model is trained on are 36*36 grayscale images only, and the model is a simple classifier with some conv2d layers)
Now I want to use this well-trained model with all it's parameters and give it something to read, to turn in into a full OCR program. This is where i'm stuck. I want to give the program a photo/scan of a paper, and have it recognize all the letters. But how do I "predict" using my model, when the image is obviously larger than the images it was trained on of single letter?
I have tried adding an additional layer of conv2d that would try to read features of parts of the image, but that was too complicated and I couldn't figure it out.
I have also looked at opencv programs that recognize where there is text in the image and crop that out, but none that I could find separate out single letters that could now be fed to the trained model to try and read.
What is my next step from here?
If the fonts of the letters will be the same throughout the whole image you could use the so called: "sliding window technique"
You start from the upper left corner and slide your scan window to the right for the size of the letter until you reach the end of the paper.
The sliding window will be the size of the scanned letter and when inputted to your neural network it will output the letter. Save those letters somewhere.
Other methods would include changing your neural network and being smarter about detecting blobs of text on the scanned paper
If you are looking for an off-the-shelf solution take a look at Tessaract-ocr.
Check out the following links for ideas:
STN-OCR: A single Neural Network for Text Detection and Text Recognition
STN-OCR on Medium
Attention-based Extraction of Structured Information from Street View Imagery
Another Attention-based OCR Repo
A model using both CNN and LSTM

How does spacy use word embeddings for Named Entity Recognition (NER)?

I'm trying to train an NER model using spaCy to identify locations, (person) names, and organisations. I'm trying to understand how spaCy recognises entities in text and I've not been able to find an answer. From this issue on Github and this example, it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron.
However, nowhere in the code does it appear that spaCy uses the GLoVe embeddings (although each word in the sentence/document appears to have them, if present in the GLoVe corpus).
My questions are -
Are these used in the NER system now?
If I were to switch out the word vectors to a different set, should I expect performance to change in a meaningful way?
Where in the code can I find out how (if it all) spaCy is using the word vectors?
I've tried looking through the Cython code, but I'm not able to understand whether the labelling system uses word embeddings.
spaCy does use word embeddings for its NER model, which is a multilayer CNN. There's a quite a nice video that Matthew Honnibal, the creator of spaCy made, about how its NER works here. All three English models use GloVe vectors trained on Common Crawl, but the smaller models "prune" the number of vectors by having similar words mapped to the same vector link.
It's quite doable to add custom vectors. There's an overview of the process in the spaCy docs, plus some example code on Github.

Tensorflow model for OCR

I am new in Tensorflow and I am trying to build model which will be able to perform OCR on my images. I have to read 9 characters (fixed in all images), numbers and letters. My model would be similar to this
https://matthewearl.github.io/2016/05/06/cnn-anpr/
My questions would be, should I train my model against each character firstly and after combine characters to get full label represented. Or I should train on full label straight ?
I know that I need to pass to model, images + labels for corresponding image, what is the format of those labels, is it textual file, I am bit confused about that part, so any explanation about format of labels which are passed to model would be helpful ? I appreciate, thanks.
There are a couple of ways to deal with this (the following list is not exhaustive).
1) The first one is word classification directly from your image. If your vocabulary of 9 characters is limited you can train a word specific classifier. You can then convolve this classifier with your image and select the word with the highest probability.
2) The second option is to train a character classifier, find all characters in your image, and find the most likely line that has the 9 character you are looking for.
3) The third option is to train a text detector, find all possible text boxes. Then read all text boxes with a sequence-based model, and select the most likely solution that follows your constraints. A simple sequence-based model is introduced in the following paper: http://ai.stanford.edu/~ang/papers/ICPR12-TextRecognitionConvNeuralNets.pdf. Other sequence-based models could be based on HMMs, Connectionist Temporal Classification, Attention based models, etc.
4) The fourth option are attention-based models that work end-to-end to first find the text and then output the characters one-by-one.
Note that this list is not exhaustive, there can be many different ways to solve this problem. Other options can even use third party solutions like Abbyy or Tesseract to help solve your problem.
I'd recommend to train an end-to-end OCR model with attention. You can try the Attention OCR which we used to transcribe street names https://github.com/tensorflow/models/tree/master/research/attention_ocr
My guess it should work pretty well for your case. Refer to the answer https://stackoverflow.com/a/44461910 for instructions on how to prepare the data for it.

Categories