I have a dataframe with a column called "titleFinal". It's the title of a table in a PDF. My current goal is to pull entities from the table titles and then later use them for analysis. I plan on training a model over my data (by creating a list of entities), but so far I've used the base NER. Unfortunately, when I look at what doc.ents extracts, it doesn't seem consistent at all and I'm not sure if I did something wrong or the model is simply extracting entities poorly.
I started with the small model, but there was a noticeable improvement when I switched to the large model. I'm not seeing as many inconsistencies. However, they are still there. For example:
Table 21-9 Cumulative Effects Initial Screening -> [(21)]
Table 21-18 Cumulative Effects Initial Screening – Human Occupancy and Resources -> [(21), (Cumulative, Effects, Initial, Screening, –, Human, Occupancy, and, Resources)]
These inconsistencies happen quite frequently throughout the list of entities so I'm wondering what I can do to resolve this. Is this expected?
Unfortunately, I can't share the dataset yet, but here's the code I'm currently using:
nlp = spacy.load("en_core_web_lg") # large model for production
tokens = []
lemma = []
ents = []
for doc in nlp.pipe(df['titleFinal'].astype('unicode').values, batch_size=50,
n_threads=3):
if doc.is_parsed:
tokens.append([n.text for n in doc])
lemma.append([n.lemma_ for n in doc])
ents.append([e for e in doc.ents])
else:
# We want to make sure that the lists of parsed results have the
# same number of entries of the original Dataframe, so add some blanks in case the parse fails
tokens.append(None)
lemma.append(None)
ents.append(None)
df['titleFinal_tokens'] = tokens
df['titleFinal_lemma'] = lemma
df['titleFinal_ents'] = ents
Is my approach wrong here?
Info
spaCy version: 2.2.4
Platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
Models: en
Related
I'm a spacy beginner who is doing samples for learning purposes, I have referred to an article on how to create an address parser using SpaCy.
My tutorial datasheet as follows
which is running perfectly,
Then I created my own data set which contains addresses in Denmark
but when I run the training command, there is an error,
ValueError: [E1010] Unable to set entity information for token 1 which is included in more than one span in entities, blocked, missing, or outside.
As per the questions asked I StackOverflow and other platforms, the reason for the error is duplicate words in a span
[18, Mbl Denmark A/S, Glarmestervej, 8600, Silkeborg, Denmark]
Recipient contains the word "Denmark" and Country contains the word "Demark"
can anyone suggest to me the solution to fix this
Code for Create DocBin object for building training/test
db = DocBin()
for text, annotations in training_data:
doc = nlp(text) #Construct a Doc object
ents = []
for start, end, label in annotations:
span = doc.char_span(start, end, label=label)
ents.append(span)
doc.ents = ents
db.add(doc)
In general, entities can't be nested or overlapping, and if you have data like that you have to decide what kind of output you want.
If you actually want nested or overlapping annotations, you can use the spancat, which supports that.
In this case though, "Denmark" in "Mbl Denmark" is not really interesting and you probably don't want to annotate it. I would recommend you use filter_spans on your list of spans before assigning it to the Doc. filter_spans will take the longest (or first) span of any overlapping spans, resulting in a list of non-overlapping spans, which you can use for normal entity annotations.
INTRO
My question originates a bit far from the title but in essence sums up well what I'm stuck with at the moment.
I need to integrate a NER model of spacy as part of a complex and distributed NLP pipeline and what I'm doing to do so is:
Train a new NER model based on en_core_web_lg model to recognize also my custom entities in NER task
Save the model skipping the vocabulary to save disk space and memory usage
Finally load the model to run some inference, using tokens and vectors that somebody pre-computed before in my pipeline, instead of compute it again using the model vocabulary (standard way).
The reason why I am saving the model without the vocab is because in my distributed pipeline one of the first things that is done is tokenize / vectorize the text so that the rest of the tasks have this input.
→ Before continuing I want to clarify that in the standard way (saving the vocab), I could train my custom NER, save / load and run an inference without major problems with a very good accuracy.
After that and reading the spaCy documentation, I found that it is possible to save my model without the vocabulary, and you can even build the Doc from a list of tokens and a custom vocabulary (in my case an empty vocab). Also, I was also able to set the document vectors using those that someone previously calculated for me in my pipeline.
However, when I save the model inside the ner/cfg file I see that there is a reference to the vectors on which the NER model was trained (en_core_web_lg.vectors):
{
"disable":[
"tagger",
"parser"
],
"beam_width":1,
"beam_density":0.0,
"beam_update_prob":1.0,
"cnn_maxout_pieces":3,
"nr_feature_tokens":6,
"deprecation_fixes":{
"vectors_name":"en_core_web_lg.vectors"
},
"nr_class":86,
"hidden_depth":1,
"token_vector_width":96,
"hidden_width":64,
"maxout_pieces":2,
"pretrained_vectors":"en_core_web_lg.vectors",
"bilstm_depth":0,
"self_attn_depth":0,
"conv_depth":4,
"conv_window":1,
"embed_size":2000
}
This reference is the cause of an error when I try to load the model without having those vectors in memory (in other words without having that vocab loaded).
If I delete those references in the cfg file, the model loads correctly and I can run inferences using my vectors but the predictions obtained are very different from the ones I obtained with my first model (with the original vocab) and contain several errors.
QUESTION
This brings me to my original question: is it possible to save the NER model with an empty vocab and then run an inference with a SpaCy Doc built somehow from the tokens and vectors previously calculated in my pipeline?
Thanks a lot in advance!
BTW I'm using spaCy 2.3.7 and I put below some snippets of my code to clarify:
1. Training:
nlp = spacy.load("en_core_web_lg", disable=['tagger', 'parser'])
ner = nlp.get_pipe('ner')
ner.add_label("FOO_ENTITY")
ner.add_label("BAR_ENTITY")
ner.add_label("COOL_ENTITY")
# Start the training
optimizer = nlp.begin_training()
# Loop for EPOCHS iterations
losses_hist = []
for itn in range(30):
# Shuffle the training data
random.shuffle(Xy_train)
losses = {}
# Batch the examples and iterate over them
for batch in spacy.util.minibatch(Xy_train, size=32):
texts = [text for text, entities, _ in batch]
golds = [{'entities': entities} for text, entities, _ in batch]
# Update the model
nlp.update(docs=texts, golds=golds, losses=losses)
print(losses)
losses_hist.append(losses)
2.a Run inference (standard):
# I already have the text split in tokens
doc = Doc(nlp.vocab, words=tokens) # Create doc from tokens
ner = nlp.get_pipe("ner")
doc = ner(doc) # Call NER step for doc
for ent in doc.ents:
print(f"value: {ent.text}, start: {ent.start_char}, end: {ent.end_char}, entity: {ent.label_}")
2.b Run inference with external vectors
# I already have the text split in tokens and their vectors
vectors = Vectors(data=embeddings, keys=tokens)
nlp.vocab.vectors = vectors
doc = Doc(nlp.vocab, words=tokens)
ner = nlp.get_pipe("ner")
doc = ner(doc) # Call NER step for doc
for ent in doc.ents:
print(f"value: {ent.text}, start: {ent.start_char}, end: {ent.end_char}, entity: {ent.label_}")
3. Save / Load model:
# Save model
nlp.to_disk(str(dir_)
# Load model
nlp = spacy.load(str(dir_), exclude=['vocab'])
In spacy v2 (not v3!) there are some hidden background steps that register the vectors globally under a particular name for use as features in the statistical models. (The idea behind this is that multiple models in the same process can potentially share the same vectors in RAM.)
To get a subset of vectors to work for a particular text, you need to register the new vectors under the right name:
use the same Vectors(name=) as in the model metadata when creating the vectors (will be something like en_core_web_lg.vectors)
run spacy._ml.link_vectors_to_models(vocab)
I'm pretty sure that this will start printing warnings and renaming the vectors internally based on the data shape if you do it repeatedly for different sets of vectors with the same name. I think you can ignore the warnings and it will work for that individual text, but it may break any other models loaded in the same script that are using that same vectors name/shape.
If you are doing this a lot in practice, you might want to write a custom version of link_vectors_to_models that iterates over the words in the vocab more efficiently for very small vector tables, or only modifies the words in the vocab that you know that you need. It really depends on the size of the vocab at the point where you're running link_vectors_to_models.
After several tests I discovered that the following way to reload the vectors for the vocab of my model in memory, works correctly for me. On the other hand this answer worked as inspiration but didn't work for spaCy 2.1.4.
The interesting thing is that:
I necessarily had to make a deep copy of the vocab, before updating it
The vectors must have exactly the same name as the vectors on which the NER model was trained (en_core_web_lg.vectors in my case)
It is necessary to update the vectors in memory as follows: thinc.extra.load_nlp.VECTORS
import copy
import spacy
import thinc
from spacy.vectors import Vectors
from spacy.vocab import Vocab
from thinc.v2v import Model
vocab = copy.deepcopy(nlp.vocab)
vectors = Vectors(data=embeddings, keys=tokens, name='en_core_web_lg.vectors')
ops = Model.ops
thinc.extra.load_nlp.VECTORS[(ops.device, 'en_core_web_lg.vectors')] = vectors.data # Must re-load vectors
vocab.vectors = vectors
doc = Doc(vocab, words=tokens) # Create doc from tokens an custom vocab
ner = nlp.get_pipe("ner")
doc = ner(doc) # Call NER step for doc
I hope it can be of use to other users!
I have a df with a column that contains comments from which I want to extract the organisations. This article provides a great approach but it is too slow for my problem. The df I am using has over 1,000,000 rows and I am using a Google Colab notebook.
Currently my approach is (from the linked article):
def get_orgs(text):
# process the text with our SpaCy model to get named entities
doc = nlp(text)
# initialize list to store identified organizations
org_list = []
# loop through the identified entities and append ORG entities to org_list
for entity in doc.ents:
if entity.label_ == 'ORG':
org_list.append(entity.text)
# if organization is identified more than once it will appear multiple times in list
# we use set() to remove duplicates then convert back to list
org_list = list(set(org_list))
return org_list
df['organizations'] = df['body'].apply(get_orgs)
Is there a faster way to process this? And, would you advise to apply it to a Pandas df or are there better/faster alternatives?
There are a couple of things you can do in general to speed up spaCy. There's a section in the docs on this.
The first thing to try is creating docs in a pipe. You'll need to be a little creative to get this working with a dataframe:
org_lists = []
for doc in nlp.pipe(iter(df['body']):
org_lists.append(...) # do your processing here
# now you can add a column in your dataframe
The other thing is to disable components you aren't using. Since it looks like you're only using NER you can do this:
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
# Do something with the doc here
Those together should give you a significant speedup.
I understand that we should create Example objects and pass it to the nlp.update() method. According to the example in the docs, we have
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
example = Example.from_dict(doc, {"entities": entity_offsets})
nlp.update([example], sgd=optimizer)
And looking at the source code of the make_doc() method, it seems like we would be just tokenizing the input text and then annotating the tokens.
But the Example object should have the reference/"gold-standard" and the predicted values. How does the information ends up in the document when we call nlp.make_doc()?
Additionally, when trying to get the predicted entity tags (using a trained nlp pipeline) back from the Example object I get no entities (though I could if I had created the object with nlp(text). And training crashes if I try using nlp(text) instead of nlp.make_doc(text) with
...
>>> spacy.pipeline._parser_internals.ner.BiluoPushDown.set_costs()
ValueError()
You can feel free to ask this sort of question on the Github Discussions board as well. Thanks also for taking time to think about this and read some of the code before asking. I wish every question were like this.
Anyway. I think the Example.from_dict() constructor might be getting in the way of understanding how the class works. Does this make things clearer for you?
from spacy.tokens import Doc, Span
from spacy.training import Example
import spacy
nlp = spacy.blank("en")
# Build a reference Doc object, representing the gold standard.
y = Doc(
nlp.vocab,
words=["I", "work", "at", "Berlin!", ".", "It", "'s", "a", "hipster", "bar", "."]
)
# There are other ways we could set up the Doc object, including just passing
# stuff into the constructor. I wanted to show modifying the Doc to set annotations.
ent_start = y.text.index("Berlin!")
assert ent_start != -1
ent_end = ent_start + len("Berlin!")
y.ents = [y.char_span(ent_start, ent_end, label="ORG")]
# Okay, so we have our gold-standard, aka reference aka y, Doc object.
# Now, at runtime we won't necessarily be tokenizing that input text that way.
# It's a weird entity. If we only learn from the gold tokens, we can never learn
# to tag this correctly, no matter how many examples we see, if the predicted tokens
# don't match this tokenization. Because we'll always be learning from "Berlin!" but
# seeing "Berlin", "!" at runtime. We'll have train/test skew. Since spaCy cares how
# it does on actual text, not just on the benchmark (which is usually run with
# gold tokens), we want to train from samples that have the runtime tokenization. So
# the Example object holds a pair (x, y), where the x is the input.
x = nlp.make_doc(y.text)
example = Example(x, y)
# Show the aligned gold-standard NER tags. These should have the entity as B-ORG L-ORG.
print(example.get_aligned_ner())
The other piece of information that might explain this is that the pipeline components try to deal with partial annotations, so that you can have rules which are presetting some entities. This is what's happening when you have a fully annotated Doc as the x --- it's taking those annotations as part of the input, and there's no valid action for the model when it tries to construct the best sequence of actions to learn from. The usability for this situation could be improved.
Is there any way to add entities to a spacy doc object using BERT's offsets ? Problem is my whole pipeline is spacy dependent and i am using the latest PUBMEDBERT for which spacy doesnt provide support.
So at times the offsets of entities given by pubmedbert dont result into a valid SPAN for spacy as the tokenization is completely different.
what work have i done till now to solve my problem ?
I made a custom tokenizer by asking spacy to split on punctuation, similar to bert but there are certain cases wherein i just cant make a rule. for example:-
text = '''assessment
Exdtve age-rel mclr degn, left eye, with actv chrdl neovas
Mar-10-2020
assessment'''
Pubmedbert predicted 13:17 to be an entity i.e. dtve
but on adding the span as entity in spacy doc object it results NONE as it is not a valid span.
span = doc.char_span(row['start'], row['end'], row['ent'])
doc.ents = list(doc.ents) + [span]
TypeError: object of type 'NoneType' has no len()
Consider row['start'] to be 13, row['end'] to be 17 and row['ent'] to be label
how can i solve this problem ? is there anyway i can just add entities in spacy doc object using starting and ending offset given by pubmedbert
would really appreciate any help on this, Thank you.
Because spacy stores entities internally as IOB tags on tokens in the doc, you can only add entity spans that correspond to full tokens underneath.
If you're only using this doc to store these entities (not using any other components like a tagger or parser from another model that expect a different tokenizer), you can create a doc with the same tokenization as the BERT model:
import spacy
from spacy.tokens import Doc
nlp = spacy.blank("en")
# bert_tokens = [..., "Ex", "dtve", ...]
words, spaces = spacy.util.get_words_and_spaces(bert_tokens, text)
doc = Doc(nlp.vocab, words=words, spaces=spaces)
Then you should be able to add the entity spans to the document.
If you need the original spacy tokenization + entities based on a different tokenization, then you'll have to adjust the entity character offsets to match the spacy token boundaries in order to add them. Since this can depend a lot on the data/task (if dtve is an entity, is Exdtve also necessarily an entity of the same type?), you probably need a custom solution based on your data. If you're trying to adjust the entity spans to line up with the current tokens, you can see the character start and length for each token with token.idx and len(token).