Parsing date from OCR response in Python - python

I am trying to read date from an OCR response of an image. The OCR output is something like this.
\nPatientsName:KantibhaiPatelAgeISex:71YearslMale\nRef.by:Dr.KetanShuklaMS.MCH.\nReg.Date:29/06/201519;03\nLabRefNo;ARY-8922-15ReportingDate.29/06/201519:10\nHEMOGRAMREPORT\nTESTRESULTREFERENCEINTERVAL\n
I am interested in extracting the reporting date i.e. 29/06/2015. Also I am interested in storing the patient details in a database (MongoDB) chronologically. Hence I need to store the date in a standardized format for easy future queries.
All suggestions are welcomed.
Edit - Since the data is coming as an OCR response there tends to be a lot of noise and sometimes misinterpreted characters. Is there any method that can have a better fault tolerance for string searching.
re.search(r'Date:([0-9]{2}\/[0-9]{2}\/[0-9]{4})', ocr_response).group(1)
The above statement explicitly looks for numbers, but what if some number is not read or misinterpeted as a character ?

use re module:
import re
print re.search(r'[Date:]*([0-9]{0,2}[\/-]([0-9]{0,2}|[a-z]{3})[\/-][0-9]{0,4})', ocr_response).group(1)
Output:
29/06/2015

You should go with good NER(Named,Entity Recognition) model, you can custom train your own model if you have good amount of annotated training data or you can use pre-trained models which does not require annotated dataset.
Spacy is a good Python library for NER. Have a look on the link below-
https://spacy.io/
It uses deep neural networks at the backend to recognize various entities present in the text (date in your case).
Hope it gives you an alternative to regular expression, thanks for the upvote in advance.

Related

Huggingface NER with custom data

I have a csv data as below.
**token** **label**
0.45" length
1-12 size
2.6" length
8-9-78 size
6mm length
Whenever I get the text as below
6mm 8-9-78 silver head
I should be able to say length = 6mm and size = 8-9-78. I'm new to NLP world, I'm trying to solve this using Huggingface NER. I have gone through various articles. I'm not getting how to train with my own data. Which model/tokeniser should I make use of? Or should I build my own? Any help would be appreciated.
I would maybe look at spaCy's pattern matching + NER to start. The pattern matching rules spacy provides are really powerful, especially when combined with their statistical NER models. You can even use the patterns you develop to create your own custom NER model. This will give you a good idea of where you still have gaps or complexity that might require something else like Huggingface, etc.
If you are willing to pay, you can also leverage prodigy which provides a nice UI with Human In the Loop interactions.
Adding REGEX entities to SpaCy's Matcher
I had two options one is Spacy (as suggested by #scarpacci) and other one is SparkNLP. I opted for SparkNLP and found a solution. I formatted the data in CoNLL format and trained using Spark's NerDlApproach and GLOVE word embedding.

Perform Named Entity Recognition - NLP

I am trying to learn how to perform Named Entity Recognition.
I have a set of discharge summaries containing medical information about patients. I converted my unstructured data into structured data. Now, I have a DataFrame that looks like this:
Text | Target
normal coronary arteries... R060
The Text column contains information about the diagnosis of a patient, and the Target column contains the code that will need to be predicted in a further task.
I have also constructed a dictionary that looks like this:
Code (Key) | Term (Value)
A00 Cholera
This dictionary brings information about each diagnosis and the afferent code. The term column will be used to identify the clinical entities in the corpus.
I will need to train a classifier and predict the code in order to automate the process of assigning codes for the discharge summaries (I am explaining this to have an idea about the task I'm performing).
Until now I have converted my data into a structured one. I am trying to understand how I should perform Named Entity Recognition to label the medical terminology. I would like to try direct matching and fuzzy matching but I am not sure what are the previous steps. Should I perform tokenizing, stemming, lemmatizing before? Or firstly should I find the medical terminology as clinical named entities are often multi-token terms with nested structures that include other named entities inside them? Also what packages or tools are you recommending me to use in Python?
I am new in this field so any help will be appreciated! Thanks!
If you are asking for building a classification model, then you should go for deep learning. Deep learning is highly efficient in classification.
While dealing with such type of language processing tasks, I recommend you to first tokenize your text and do padding. Basic tokenization should be enough, but you can go for more preprocessing like basic string processing because proper preprocessing can improve your model accuracy upto 3% or 4%. For basic string processing, you can use regex(built-in package called re) in python.
https://docs.python.org/3/library/re.html
I think, you are doing mapping after preprocessing. Mapping should be enough for tasks like classification, but I recommend you to learn about word embeddings. Word embedding will improve your model.
For all these tasks, i recommend you to use tensorflow. Tensorflow is famous tool for machine learning, language processing, image processing, and much more. You can learn natural language processing from official tensorflow documentation. They have provided all learning material in tensorflow tutorial section.
https://www.tensorflow.org/tutorials/
I think, this will help you. All the best for your work!!!!
Thank you.

A machine learning model for matching pattern between two sets of strings?

I am trying to learn HTML transformations performed by a certain service using machine learning. I have broken down my problem into a pattern matching problem. For now I am trying to learn pattern in which tags are transformed. For example, for same data I have this pattern in original HTML "html, body, div, h1" and following pattern in transformed page "html, body, div, div, div". I have 14000 such data points and I want to train a model that would take as input patterns from original page and output transformed patterns. I have looked into a few NLP model but either I have failed to understand them completely or they were not very helpful.
If someone could give me any pointers or preferably suggest some python based model that would be great.
your question is not clear enough to help you with some answer but still from what I was able to figure out your input will be html tags in a string pattern & your output too is a string pattern of html tags.
You can use a bi-directional LSTM or CRF for this kind of task. Read about them and you'll have a clear idea.
But if same input pattern is giving multiple output pattern then it will be difficult for most ML algos to learn. You can remove those data points and you'll be good to go.

How to separate title and headers from body text in image

I am using tesseract (through the python wrapper) in order to extract text from documents. These documents do not include any images or tables, simply text.
Is there any option to distinguish the titles/headings from the text? Ideally I want to be able to have something like a xml tree rather than the full string chain (I do not need to have a visual of the document layout).
I found some third party tools that seem to be able to help but I was wondering if I can do it directly from tesseract.
You can use Nanonets OCR api for create your own model that seperates headings and text or you can add different labels.
I am quite late to answer, but this answer might help others who are looking for a solution.
firstly, tesseract only wont be able to extract such "features" from the document. But all you need it a little bit of understanding of ML and vision libraries(like luminoth or detectronV2)
basically, you have to give some sample documents with mark-ups (like title, header1, header2 etc) and train the model. after training you can use the model on different unseen images to fetch such details.
You can use a ml based solution but in such use cases I prefer to use light weight solutions which are based on opencv's features. You may use regular text detection and pair it with morphological transformations to detect header text.

Setup data for dynamic topic modelling

I'm trying to learn dynamic topic modeling(to capture the semantic changes in the words) from data scrapped from PUBMED. I was able to get the data in the form of xml and was able to extract the "abstract" text and the date information off of it and saved that in the csv format. (But this is just a part of the data.)
Format obtained
Year|month|day|abstractText
I'm planning on using gensim lda for my model
I've never really done topic modeling before and need your help with guiding me through this process one step at a time.
Questions:
Is csv a preferred format to feed into gensim lda?
for dynamic modeling, how should the time aspect of the data be captured and used in the model?
is there a better way to organize the data than in csv files?
Should i use the bodytext instead of the abstract for this?
Hope I learn a lot from this. Thanks in advance.

Categories