I am a NLP novice trying to learn, and would like to better understand how Named Entity Recognition (NER) is implemented in practice, for example in popular python libraries such as spaCy.
I understand the basic concept behind it, but I suspect I am missing some details.
From the documentation, it is not clear to me for example how much preprocessing is done on the text and annotation data; and what statistical model is used.
Do you know if:
In order to work, the text has to go through chunking before the model is trained, right? Otherwise it wouldn't be able to perform anything useful?
Are the text and annotations typically normalized prior to the training of the model? So that if a named entity is at the beginning or middle of a sentence it can still work?
Specifically in spaCy, how are things implemented concretely? Is it a HMM, CRF or something else that is used to build the model?
Apologies if this is all trivial, I am having some trouble finding easy to read documentation on NER implementations.
In https://spacy.io/models/en#en_core_web_md they say English multi-task CNN trained on OntoNotes. So I imagine that's how they obtain the NEs. You can see that the pipeline is
tagger, parser, ner
and read more here: https://spacy.io/usage/processing-pipelines. I would try to remove the different components and see what happens. This way you could see what depends on what. I'm pretty sure NER depends on tagger, but not sure whether requires the parser. All of them of course require the tokenizer
I don't understand your second point. If an entity is at the beginning or middle of a sentence is just fine, the NER system should be able to catch it. I don't see how you're using the word normalize in a position of text context.
Regarding the model, they mention multi-task CNN, so I guess the CNN is the model for NER. Sometimes people use a CRF on top, but they don't mention it so probably is just that. According to their performance figures, it's good enough
Related
My spacy version is 2.3.7. I have an existing trained custom NER model with NER and Entity Ruler pipes.
I want to update and retrain this existing pipeline.
The code to create the entity ruler pipe was as follows-
ruler = EntityRuler(nlp)
for i in patt_dict:
ruler.add_patterns(i)
nlp.add_pipe(ruler, name = "entity_ruler")
Where patt_dict is the original patterns dictionary I had made.
Now, after finishing the training, now I have more input data and want to train the model more with the new input data.
How can I modify the above code to add more of patterns dictionary to the entity ruler when I load the spacy model later and want to retrain it with more input data?
It is generally better to retrain from scratch. If you train only on new data you are likely to run into "catastrophic forgetting", where the model forgets anything not in the new data.
This is covered in detail in this spaCy blog post. As of v3 the approach outlined there is available in spaCy, but it's still experimental and needs some work. In any case, it's still kind of a workaround, and the best thing is to train from scratch with all data.
I'd also recommend polm23's suggestion to retrain fully in this situation.
Here is why: we are asking the model to produce inferences based on weights derived from matching input data to labels/classes/whatever over and over. These weights are toggled via backprop to reduce the error gradient vis a vis the labels/classes/whatever. When the weights, given whatever data, produce errors as close to 0 as possible eventually the loss reaches an equilibrium or you just call it via hyper parameters (epochs).
However, by only using the new data, you will only optimize for that specific data. The model will generalize poorly, but really only because it is learning exactly what you asked it to learn and nothing else. When you add in that retraining fully is usually not the end of the world, it just kinda makes sense as a best practice.
(This is my imperfect understanding of the catastrophic forgetting issue, happy to learn more if other's have deeper knowledge).
I've trained a Doc2Vec model in order to do a simple binary classification task, but I would also love to see which words or sentences weigh more in terms of contributing to the meaning of a given text. So far I had no luck finding anything relevant or helpful. Any ideas how could I implement this feature? Should I switch from Doc2Vec to more conventional methods like tf-idf?
You are asking about model interpretability. Some ways I have seen this explored:
Depending on your classifier, the parameters of the model may tell you what it is looking at. For example, in attention-based models, what the model attends to is telling.
Tools like Lime and Anchor are useful for any black box model, and will probably work in this case. The documentation for both shows how to use it with text data.
I am trying to build a subject extractor, simply put, read all the sentences of a paragraph and make a calculated guess to what the subject of the paragraph/article/document is. I might even upgrade it to a summerize depending on the progress I make.
There is a great deal of information on the internet. It is difficult to understand all of it and select a correct path, as I am not well versed with NLP.
I was hoping someone with some experience could point me in the right direction.
I am NOT looking for a linguistic computation model, but rather an n-gram or neural network approach, something that has been done recently.
I am also looking into coreference resolution using n-grams, if anyone has any leads on that, it is much appreciated. Slightly familiar with the Stanford Coreferential Solver, but don't want to use it as is.
Any information, ideas and opinions are welcome.
#Dagger,
For finding the 'topic' of the whole document, there are several approaches you can try and research. The unsupervised approaches will be faster and will get you started but may not differentiate between closely related documents that have similar topics. These also don't require neural network. The supervised techniques will be able to recognise differences in similar documents better but require training of networks. You should be able to easily find blogs about implementing these in your desired programming language.
Unsupervised
K-Means Clustering using TF-IDF On Text Words - see intro here
Latent Dirichlet Allocation
Supervised
Text Classification models using SVM, Logistic Regressions and neural nets
LSTM/RNN models using neural net
The neural net models will require training on a set of known documents with associated topics first. They are best suited for picking ONE most likely topic from their model but there are multi-class topic implementations possible.
If you post example data and/or domain along with programming language, I can give some more specifics for you to explore.
I use text blob library of python, and the Naive bayes classifier of text blob. I have learned that it uses nltk naive bayes classifier. Here is the question: My input sentences are non-english (Turkish). Will it be possible? I don't know how it works. But I tried 10 training data, and it seems to work. I wonder how it works, this naive babes classifier of nltk, on non-English data. What are the disadvantages?
Although a classifier trained for English is unlikely to work on other languages, it sounds like you are using textblob to train a classifier for your text domain. Nothing rules out using data from another language, so the real question is whether you are getting acceptable performance. The first thing you should do is test your classifier on a few hundred new sentences (not the ones you trained it on!). If you're happy, that's the end of the story. If not, read on.
What makes or breaks any classifier is the selection of features to train it with. The NLTK's classifiers require a "feature extraction" function that converts a sentence into a dictionary of features. According to its tutorial, textblob provides some kind of "bag of words" feature function by default. Presumably that's the one you're using, but you can easily plug in your own feature function.
This is where language-specific resources come in: Many classifiers use a "stopword list" to discard common words like and and the. Obviously, this list must be language-specific. And as #JustinBarber wrote in a comment, languages with lots of morphology (like Turkish) have more word forms, which may limit the effectiveness of word-based classification. You may see improvement if you "stem" or lemmatize your words; both procedures transform different inflected word forms to a common form.
Going further afield, you didn't say what your classifier is for but it's possible that you could write a custom recognizer for some text properties, and plug them in as features. E.g., in case you're doing sentiment analysis, some languages (including English) have grammatical constructions that indicate high emotion.
For more, read a few chapters of the NLTK book, especially the chapter on classification.
I am trying to summarize text documents that belong to legal domain.
I am referring to the site deeplearning.net on how to implement the deep learning architectures. I have read quite a few research papers on document summarization (both single document and multidocument) but I am unable to figure to how exactly the summary is generated for each document.
Once the training is done, the network stabilizes during testing phase. So even if I know the set of features (which I have figured out) that are learnt during the training phase, it would be difficult to find out the importance of each feature (because the weight vector of the network is stabilized) during the testing phase where I will be trying to generate summary for each document.
I tried to figure this out for a long time but it's in vain.
If anybody has worked on it or have any idea regarding the same, please give me some pointers. I really appreciate your help. Thank you.
I think you need to be a little more specific. When you say "I am unable to figure to how exactly the summary is generated for each document", do you mean that you don't know how to interpret the learned features, or don't you understand the algorithm? Also, "deep learning techniques" covers a very broad range of models - which one are you actually trying to use?
In the general case, deep learning models do not learn features that are humanly intepretable (albeit, you can of course try to look for correlations between the given inputs and the corresponding activations in the model). So, if that's what you're asking, there really is no good answer. If you're having difficulties understanding the model you're using, I can probably help you :-) Let me know.
this is a blog series that talks in much detail from the very beginning of how text summarization works, recent research uses seq2seq deep learning based models, this blog series begins by explaining this architecture till reaching the newest research approaches
Also this repo collects multiple implementations on building a text summarization model, it runs these models on google colab, and hosts the data on google drive, so no matter how powerful your computer is, you can use google colab which is a free system to train your deep models on
If you like to see the text summarization in action, you can use this free api.
I truly hope this helps