I am using tesseract (through the python wrapper) in order to extract text from documents. These documents do not include any images or tables, simply text.
Is there any option to distinguish the titles/headings from the text? Ideally I want to be able to have something like a xml tree rather than the full string chain (I do not need to have a visual of the document layout).
I found some third party tools that seem to be able to help but I was wondering if I can do it directly from tesseract.
You can use Nanonets OCR api for create your own model that seperates headings and text or you can add different labels.
I am quite late to answer, but this answer might help others who are looking for a solution.
firstly, tesseract only wont be able to extract such "features" from the document. But all you need it a little bit of understanding of ML and vision libraries(like luminoth or detectronV2)
basically, you have to give some sample documents with mark-ups (like title, header1, header2 etc) and train the model. after training you can use the model on different unseen images to fetch such details.
You can use a ml based solution but in such use cases I prefer to use light weight solutions which are based on opencv's features. You may use regular text detection and pair it with morphological transformations to detect header text.
Related
I have thousands groups of paragraphs and I need to classify these paragraphs. The problem is that I need to classify each paragraph based on other paragraphs in the group! For example, a paragraph individually maybe belongs to class A but according to other paragraph in the group it belongs to class B.
I have tested lots of traditional and deep approaches( in fields like text classification, IR, text understanding, sentiment classification and so on) but those couldn't classify correctly.
I was wondering if anybody has worked in this area and could give me some suggestion. Any suggestions are appreciated. Thank you.
Update 1:
Actually we are looking for manual sentences/paragraph for some fields, so we first need to recognize if a sentence/paragraph is a manual or not second we need to classify it to it's fields and we can recognize its field only based on previous or next sentences/paragraphs.
To classify the paragraphs to manual/no-manual we have developed some promising approaches but the problem come up when we should recognize the field according to previous or next sentences/paragraphs, but which one?? we don't know the answer would be in any other sentences!!.
Update 2:
We can not use whole text of group as input because those are too big (sometimes tens of thousands of words) and contain some other classes and machine can't learn properly which lead to the drop the accuracy sharply.
Here is a picture that maybe help to better understanding the problem:
Let's say I have a set of images of passports. I am working on a project where I have to identify the name on each passport and eventually transform that object into text.
For the very first part of labeling (or classification (I think. beginner here)) where the name is on each passport, how would I go about that?
What techniques / software can I use to accomplish this?
in great detail or any links would be great. I'm trying to figure out how this is done exactly so I can began coding
I know training a model is involved possibly but I'm just not sure
I'm using Python if that matters.
thanks
There's two routes you can take, one where you have labeled data (or you want to label data yourseld), and one where you don't have that.
Let's start with the latter. Say you have an image of a passport. You want to detect where the text in the image is, and what that text says. You can achieve this using a library called pytessaract. It's an AI that does exactly this for you. It works well because it has been trained on a lot of other images, so it's good in detecting text in any image.
If you have labels you might be able to improve your model you could make with pytessaract, but this is a lot harder. If you want to learn it anyway, I would recommend with learning ŧensorflow, and use "transfer learning" to improve your model.
I am trying to read date from an OCR response of an image. The OCR output is something like this.
\nPatientsName:KantibhaiPatelAgeISex:71YearslMale\nRef.by:Dr.KetanShuklaMS.MCH.\nReg.Date:29/06/201519;03\nLabRefNo;ARY-8922-15ReportingDate.29/06/201519:10\nHEMOGRAMREPORT\nTESTRESULTREFERENCEINTERVAL\n
I am interested in extracting the reporting date i.e. 29/06/2015. Also I am interested in storing the patient details in a database (MongoDB) chronologically. Hence I need to store the date in a standardized format for easy future queries.
All suggestions are welcomed.
Edit - Since the data is coming as an OCR response there tends to be a lot of noise and sometimes misinterpreted characters. Is there any method that can have a better fault tolerance for string searching.
re.search(r'Date:([0-9]{2}\/[0-9]{2}\/[0-9]{4})', ocr_response).group(1)
The above statement explicitly looks for numbers, but what if some number is not read or misinterpeted as a character ?
use re module:
import re
print re.search(r'[Date:]*([0-9]{0,2}[\/-]([0-9]{0,2}|[a-z]{3})[\/-][0-9]{0,4})', ocr_response).group(1)
Output:
29/06/2015
You should go with good NER(Named,Entity Recognition) model, you can custom train your own model if you have good amount of annotated training data or you can use pre-trained models which does not require annotated dataset.
Spacy is a good Python library for NER. Have a look on the link below-
https://spacy.io/
It uses deep neural networks at the backend to recognize various entities present in the text (date in your case).
Hope it gives you an alternative to regular expression, thanks for the upvote in advance.
Objective: I am trying to do a project on Natural Language Processing (NLP), where I want to extract information and represent it in graphical form.
Description:
I am considering news article as input to my project.
Removing unwanted data in the input & making it in Clean Format.
Performing NLP & Extracting Information/Knowledge
Representing Information/Knowledge in Graphical Format.
Is it Possible?
If want to use nltk, you can start here. Its has some explanation about tokenizing, Part Of Speech Tagging, Parsing and more.
Check this page for an example of named entity detection using nltk.
The Graphical representation can be performed using igraph or matplotlib.
Also, scikit-learn has a great text feature extraction methods, in case you want to run some more sophisticated models.
The first step is to try and do this job yourself by hand with a pencil. Try it on not just one but a collection of news stories. You really do have to do this and not just think about it. Draw the graphics just as you'd want the computer to.
What this does is forces you to create rules about how information is transformed to graphics. This is NOT always possible, so doing it by hand is a good test. If you can't do it then you can't program a computer to do it.
Assuming you have found a paper and pencil method. What I like to do is work BACKWARDS. Your method starts with the text. No. Start with the numbers you need to draw the graphic. Then you think about where are these numbers in the stories and what words do I have to look at to get these numbers. Your job is now more like a hunting trip, you know the data is there, but how to find it.
Sorry for the lack of details but I don't know your exact problem but this works in every case. First learn to do the job yourself on paper then work backwards from the output to the input.
If you try to design this software in the forward direction you get stuck soon because you can't possibly know what to do with your text because you don't know what you need, it's like pushing a rope it don't work. Go to the other end and pull the rope. Do the graphic work FIRST then pull the needed data from the news stories.
I've trying to read about NLP in general and nltk in specific to use with python. I don't know for sure if what am looking for exists out there, or if I perhaps need to develop it.
I have a program that collect text from different files, the text is extremely random and talks about different things. Each file contains a paragraph or 3 maximum, my program opens the files and store them into a table.
My question is, can i guess tags of what the paragraph is about? if anyone knows of an existing technology or approach, I would really appreciate it.
Thanks,
Your task is called "document classification", and the nltk book has a whole chapter on it. I'd start with that.
It all depends on your criteria for assigning tags. Are you interested in matching your documents against a pre-existing set of tags, or perhaps in topic extraction (select the N most important words or phrases in the text)?
You should train a classifier, the easiest one to develop (and you don't really need to develop it as NLTK provides one) is the naive baesian. The problem is that you'll need to classify manually a corpus of observations and then have the program guess what tag best fits a given paragraph (needless to say that the bigger the training corpus the more precise will be your classifier, IMHO you can reach a 80-85% of correctness). Take a look at the docs.