Perform Named Entity Recognition - NLP - python

I am trying to learn how to perform Named Entity Recognition.
I have a set of discharge summaries containing medical information about patients. I converted my unstructured data into structured data. Now, I have a DataFrame that looks like this:
Text | Target
normal coronary arteries... R060
The Text column contains information about the diagnosis of a patient, and the Target column contains the code that will need to be predicted in a further task.
I have also constructed a dictionary that looks like this:
Code (Key) | Term (Value)
A00 Cholera
This dictionary brings information about each diagnosis and the afferent code. The term column will be used to identify the clinical entities in the corpus.
I will need to train a classifier and predict the code in order to automate the process of assigning codes for the discharge summaries (I am explaining this to have an idea about the task I'm performing).
Until now I have converted my data into a structured one. I am trying to understand how I should perform Named Entity Recognition to label the medical terminology. I would like to try direct matching and fuzzy matching but I am not sure what are the previous steps. Should I perform tokenizing, stemming, lemmatizing before? Or firstly should I find the medical terminology as clinical named entities are often multi-token terms with nested structures that include other named entities inside them? Also what packages or tools are you recommending me to use in Python?
I am new in this field so any help will be appreciated! Thanks!

If you are asking for building a classification model, then you should go for deep learning. Deep learning is highly efficient in classification.
While dealing with such type of language processing tasks, I recommend you to first tokenize your text and do padding. Basic tokenization should be enough, but you can go for more preprocessing like basic string processing because proper preprocessing can improve your model accuracy upto 3% or 4%. For basic string processing, you can use regex(built-in package called re) in python.
https://docs.python.org/3/library/re.html
I think, you are doing mapping after preprocessing. Mapping should be enough for tasks like classification, but I recommend you to learn about word embeddings. Word embedding will improve your model.
For all these tasks, i recommend you to use tensorflow. Tensorflow is famous tool for machine learning, language processing, image processing, and much more. You can learn natural language processing from official tensorflow documentation. They have provided all learning material in tensorflow tutorial section.
https://www.tensorflow.org/tutorials/
I think, this will help you. All the best for your work!!!!
Thank you.

Related

Huggingface NER with custom data

I have a csv data as below.
**token** **label**
0.45" length
1-12 size
2.6" length
8-9-78 size
6mm length
Whenever I get the text as below
6mm 8-9-78 silver head
I should be able to say length = 6mm and size = 8-9-78. I'm new to NLP world, I'm trying to solve this using Huggingface NER. I have gone through various articles. I'm not getting how to train with my own data. Which model/tokeniser should I make use of? Or should I build my own? Any help would be appreciated.
I would maybe look at spaCy's pattern matching + NER to start. The pattern matching rules spacy provides are really powerful, especially when combined with their statistical NER models. You can even use the patterns you develop to create your own custom NER model. This will give you a good idea of where you still have gaps or complexity that might require something else like Huggingface, etc.
If you are willing to pay, you can also leverage prodigy which provides a nice UI with Human In the Loop interactions.
Adding REGEX entities to SpaCy's Matcher
I had two options one is Spacy (as suggested by #scarpacci) and other one is SparkNLP. I opted for SparkNLP and found a solution. I formatted the data in CoNLL format and trained using Spark's NerDlApproach and GLOVE word embedding.

Which document embedding model for document similarity

First, I want to explain my task. I have a dataset of 300k documents with an average of 560 words (no stop word removal yet) 75% in German, 15% in English and the rest in different languages. The goal is to recommend similar documents based on an existing one. At the beginning I want to focus on the German and English documents.  
To achieve this goal I looked into several methods on feature extraction for document similarity, especially the word embedding methods have impressed me because they are context aware in contrast to simple TF-IDF feature extraction and the calculation of cosine similarity. 
I'm overwhelmed by the amount of methods I could use and I haven't found a proper evaluation of those methods yet. I know for sure that the size of my documents are too big for BERT, but there is FastText, Sent2Vec, Doc2Vec and the Universal Sentence Encoder from Google. My favorite method based on my research is Doc2Vec even though there aren't any or old pre-trained models which means I have to do the training on my own.
Now that you know my task and goal, I have the following questions:
Which method should I use for feature extraction based on the rough overview of my data?
My dataset is too small to train Doc2Vec on it. Do I achieve good results if I train the model on English / German Wikipedia? 
You really have to try the different methods on your data, with your specific user tasks, with your time/resources budget to know which makes sense.
You 225K German documents and 45k English documents are each plausibly large enough to use Doc2Vec - as they match or exceed some published results. So you wouldn't necessarily need to add training on something else (like Wikipedia) instead, and whether adding that to your data would help or hurt is another thing you'd need to determine experimentally.
(There might be special challenges in German given compound words using common-enough roots but being individually rare, I'm not sure. FastText-based approaches that use word-fragments might be helpful, but I don't know a Doc2Vec-like algorithm that necessarily uses that same char-ngrams trick. The closest that might be possible is to use Facebook FastText's supervised mode, with a rich set of meaningful known-labels to bootstrap better text vectors - but that's highly speculative and that mode isn't supported in Gensim.)

How to decide between NER and QA Model?

I am completing a task involving NLP and transformers. I would like to identify relevant features in a corpus of text. If i was to extract the relevant features from job description for instance the tools that would be used at the job (powerpoint, excel, java, etc..) and the level of proficiency required would this task be better suited for a Named Entity Recognition model or a Question Answering model.
If I was to approach it like a NER task I would attach a label to all the relevant tools in the training data and hope it would generalize well. I could approach the problem simialrly as a QA model and ask things like "what tools does this job require" and supply a description as context.
I plan to use the transformers library unless I am missing a better tool for this task. There are many features I am looking to extract so not all may be as simple as grabbing keywords from a list (programming languages, microsoft office etc...).
Would one of these approaches be a better fit or am I missing a better way to approach the proble.
Any help appreciated. Thank you!
From what you say, it seems it an entity recognition task. However, the questions you should ask and answer yourself are:
How will your user interact with the model?
Structured information → Entity recognition.
Chatbot → QA.
Is there a predefined set of entities that you are going to extract from the text?
Yes → entity recognition.
No → QA.
How do the training data you have for finetuning look like?
Only a few of them → Entity recognition.
Plenty of data, question-answer pair → QA.

Unsure of how to get started with using NLP for analyzing user feedback

I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!
Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.
VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.

Tweet classification into multiple categories on (Unsupervised data/tweets)

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.
Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.
Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.
You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

Categories