Huggingface NER with custom data

Huggingface NER with custom data - python

I have a csv data as below.
**token** **label**
0.45" length
1-12 size
2.6" length
8-9-78 size
6mm length
Whenever I get the text as below
6mm 8-9-78 silver head
I should be able to say length = 6mm and size = 8-9-78. I'm new to NLP world, I'm trying to solve this using Huggingface NER. I have gone through various articles. I'm not getting how to train with my own data. Which model/tokeniser should I make use of? Or should I build my own? Any help would be appreciated.

I would maybe look at spaCy's pattern matching + NER to start. The pattern matching rules spacy provides are really powerful, especially when combined with their statistical NER models. You can even use the patterns you develop to create your own custom NER model. This will give you a good idea of where you still have gaps or complexity that might require something else like Huggingface, etc.
If you are willing to pay, you can also leverage prodigy which provides a nice UI with Human In the Loop interactions.
Adding REGEX entities to SpaCy's Matcher

I had two options one is Spacy (as suggested by #scarpacci) and other one is SparkNLP. I opted for SparkNLP and found a solution. I formatted the data in CoNLL format and trained using Spark's NerDlApproach and GLOVE word embedding.

Related

how to use Nlp Pos Tagging in other langauge like sindhi / urdu

I am working on a research paper on pos tagging in NLP but my question is that how to implement the pos tagging in another local language plz help me thank you.

It depends on the POS-Tagger you are using. Usually a (probabilistic) tagger has two language-specific components: a language model and a dictionary.
The dictionary contains all words with their possible tags, annotated by frequency. This can be created and edited manually, or derived from training data. If your language has a rich morphology, you might want to use a morphological analyser to support this, or you could simply have all inflected forms as dictionary entries in their own right.
The language model contains sequences of tags and their frequencies, usually trigrams (sequences of three items). It is extracted from training data, and reflects grammatical constraints on word class distribution.
So in order to adapt an existing tagger for a new language there are two main steps:
create a tag set for your language. While there is some overlap between tag sets for different languages (they usually all have nouns or verbs), you might want specific markers for cases or tenses, as they can help in disambiguation.
annotate training data. You need some texts to generate the language model (and possibly also the dictionary). This data you feed into the training algorithm to produce the language-specific resource files.
Annotating by hand is fairly tedious, but you can use an iterative process: annotate a smallish text, run it through the training mechanism, and use the tagger to annotate a longer text. This will have many errors, but it's easier to correct the errors than it is to annotate a text from scratch. Then add this text to your training data and repeat. You will find that the tagger's performance will gradually get better as you build up more training data,

Unsure of how to get started with using NLP for analyzing user feedback

I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!

Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.

VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.

Perform Named Entity Recognition - NLP

I am trying to learn how to perform Named Entity Recognition.
I have a set of discharge summaries containing medical information about patients. I converted my unstructured data into structured data. Now, I have a DataFrame that looks like this:
Text | Target
normal coronary arteries... R060
The Text column contains information about the diagnosis of a patient, and the Target column contains the code that will need to be predicted in a further task.
I have also constructed a dictionary that looks like this:
Code (Key) | Term (Value)
A00 Cholera
This dictionary brings information about each diagnosis and the afferent code. The term column will be used to identify the clinical entities in the corpus.
I will need to train a classifier and predict the code in order to automate the process of assigning codes for the discharge summaries (I am explaining this to have an idea about the task I'm performing).
Until now I have converted my data into a structured one. I am trying to understand how I should perform Named Entity Recognition to label the medical terminology. I would like to try direct matching and fuzzy matching but I am not sure what are the previous steps. Should I perform tokenizing, stemming, lemmatizing before? Or firstly should I find the medical terminology as clinical named entities are often multi-token terms with nested structures that include other named entities inside them? Also what packages or tools are you recommending me to use in Python?
I am new in this field so any help will be appreciated! Thanks!

If you are asking for building a classification model, then you should go for deep learning. Deep learning is highly efficient in classification.
While dealing with such type of language processing tasks, I recommend you to first tokenize your text and do padding. Basic tokenization should be enough, but you can go for more preprocessing like basic string processing because proper preprocessing can improve your model accuracy upto 3% or 4%. For basic string processing, you can use regex(built-in package called re) in python.
https://docs.python.org/3/library/re.html
I think, you are doing mapping after preprocessing. Mapping should be enough for tasks like classification, but I recommend you to learn about word embeddings. Word embedding will improve your model.
For all these tasks, i recommend you to use tensorflow. Tensorflow is famous tool for machine learning, language processing, image processing, and much more. You can learn natural language processing from official tensorflow documentation. They have provided all learning material in tensorflow tutorial section.
https://www.tensorflow.org/tutorials/
I think, this will help you. All the best for your work!!!!
Thank you.

What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?

I'm currently working on replacing a system based on nltk entity extraction combined with regexp matching where I have several named entity dictionaries. The dictionary entities are both of common type (PERSON (employees) etc.) as well as custom types (e.g. SKILL). I want to use the pre-trained spaCy model and include my dictionaries somehow, to increase the NER accuracy. Here are my thoughts on possible methods:
Use spaCy's Matcher API, iterate through the dictionary and add each phrase with a callback to add the entity?
I've just found spacy-lookup, which seems like an easy way to provide long lists of words/phrases to match.
But what if I want to have fuzzy matching? Is there a way to add directly to the Vocab and thus have some fuzzy matching through Bloom filter / n-gram word vectors, or is there some extension out there that suits this need? Otherwise I guess I could copy spacy-lookup and replace the flashtext machinery with something else, e.g. Levenshtein distance.
While playing around with spaCy I did try just training the NER directly with a single word from the dictionary (without any sentence context), and this did "work". But I would, of course, have to take much care to keep the model from forgetting everything.
Any help appreciated, I feel like this must be a pretty common requirement and would love to hear what's working best for people out there.

I would recommend looking at spaCy's Entity Ruler. If you convert your existing dictionary into the schema for matching, you can add rules for each of your entities and new types.
This is quite powerful because you can combine it with the existing statistical NER available in a standard spacy model to achieve some of the "fuzzy matching" you mention. From the docs:
The entity ruler is designed to integrate with spaCy’s existing statistical models and enhance the named entity recognizer. If it’s added before the "ner" component, the entity recognizer will respect the existing entity spans and adjust its predictions around it. This can significantly improve accuracy in some cases. If it’s added after the "ner" component, the entity ruler will only add spans to the doc.ents if they don’t overlap with existing entities predicted by the model. To overwrite overlapping entities, you can set overwrite_ents=True on initialization.

I use the Matcher with dynamically generated callbacks. I think it works well.
I got curious why the Matcher doesn't support fuzzy matching, and found this comment by the author of spacy on a closed issue.
You really want to precompute the search sets, rather than do them on-the-fly in the matcher. Once you've precomputed the similarity values, you can use extension attributes and a >= comparison in the Matcher to perform the search.
I think this is a case where the implementation details strongly matter, and an API that obscures them would actually be a disservice.
I think this is a good point, and it tells you how to build what you want.

Improving parsing of unstructured text

I am parsing contract announcements into columns to capture the company, the amount awarded, the description of the project awarded, etc. A raw example can be found here.
I wrote a script using regular expressions to do this but over time contingencies arise that I have to account for which bar the regexp method from being a long term solution. I have been reading up on NLTK and it seems there are two ways to go about using NLTK to solve my problem:
chunk the announcements using RegexpParser expressions - this might be a weak solution if two different fields I want to capture have the same sentence structure.
take n announcements, tokenize and run the n announcements through the pos tagger, manually tag the parts of the announcements I want to capture using the IOB format and then use those tagged announcements to train an NER model. A method discussed here
Before I go about manually tagging announcements I want to gauge
that 2 is a reasonable solution
if there are existing tagged corpus that might be useful to train my model
knowing that accuracy improves with training data size, how many manually tagged announcements I should start with.
Here's an example of how I am building the training set. If there are any apparent flaws please let me know.

Trying to get company names and project descriptions using just POS tags will be a headache. Definitely go the NER route.
Spacy has a default English NER model that can recognize organizations; it may or may not work for you but it's worth a shot.
What sort of output do you expect for "the description of the project awarded"? Typically NER would find items several tokens long, but I could imagine a description being several sentences.
For tagging, note that you don't have to work with text files. Brat is an open-source tool for visually tagging text.
How many examples you need depends on your input, but think of about a hundred as the absolute minimum and build up from there.
Hope that helps!
Regarding the project descriptions, thanks to your example I now have a better idea. It looks like the language in the first sentence of the grants is pretty regular in how it introduces the project description: XYZ Corp has been awarded $XXX for [description here].
I have never seen typical NER methods used for arbitrary phrases like that. If you've already got labels there's no harm in trying and seeing how prediction goes, but if you have issues there is another way.
Given the regularity of language a parser might be effective here. You can try out the Stanford Parser online here. Using the output of that (a "parse tree"), you can pull out the VP where the verb is "award", then pull out the PP under that where the IN is "for", and that should be what you're looking for. (The capital letters are Penn Treebank Tags; VP means "verb phrase", PP means "prepositional phrase", IN means "preposition.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.