Editing Span labels in Spacy doc

Editing Span labels in Spacy doc - python

This is my first time asking a question so please let me know if there's more information that you might need.
I have a spacy doc and a list of tags that looks like ['O', 'O', 'PERSON','O','GPE',...] and would like to edit the entity labels in the doc object to match the tags.
I understand that the doc consists of tokens and also entities Doc.ents, but there seems to be a lot of different components where if I change one artificially it might break the integrity of the Doc.
My question is, what would be the best way to go about this? Is there a certain constructor I can use or a method? Can I just change the label directly?
Thanks!

Normally the easiest way to set entities from IOB tags besides using a file converter is to use the Doc constructor. But it looks like your tags don't have the IOB part (they should have like B-PERSON, not just PERSON), so that won't work.
You can set the entity label correctly, but you can't set the IOB tag directly, and without updates to the IOB tags the entities won't be recognized correctly.
Since it sounds like these tags are coming from outside spaCy, tokenization is also an issue. Are you sure the tokens will always align with spaCy tokens?
If alignment is not an issue, and you never have the same entity occur twice in a row with nothing in between, I guess you could automatically convert your labels to BIO just by making the first item B and the second item O. If you do that then you can use the Doc constructor.
Example:
original: O PERSON PERSON O O
cleaned: O B-PERSON I-PERSON O O

Related

How to set entity information for token which is included in more than one span in entities in SpaCy?

I'm a spacy beginner who is doing samples for learning purposes, I have referred to an article on how to create an address parser using SpaCy.
My tutorial datasheet as follows
which is running perfectly,
Then I created my own data set which contains addresses in Denmark
but when I run the training command, there is an error,
ValueError: [E1010] Unable to set entity information for token 1 which is included in more than one span in entities, blocked, missing, or outside.
As per the questions asked I StackOverflow and other platforms, the reason for the error is duplicate words in a span
[18, Mbl Denmark A/S, Glarmestervej, 8600, Silkeborg, Denmark]
Recipient contains the word "Denmark" and Country contains the word "Demark"
can anyone suggest to me the solution to fix this
Code for Create DocBin object for building training/test
db = DocBin()
for text, annotations in training_data:
doc = nlp(text) #Construct a Doc object
ents = []
for start, end, label in annotations:
span = doc.char_span(start, end, label=label)
ents.append(span)
doc.ents = ents
db.add(doc)

In general, entities can't be nested or overlapping, and if you have data like that you have to decide what kind of output you want.
If you actually want nested or overlapping annotations, you can use the spancat, which supports that.
In this case though, "Denmark" in "Mbl Denmark" is not really interesting and you probably don't want to annotate it. I would recommend you use filter_spans on your list of spans before assigning it to the Doc. filter_spans will take the longest (or first) span of any overlapping spans, resulting in a list of non-overlapping spans, which you can use for normal entity annotations.

Difference between spaCy's (v3.0) `nlp.make_doc(text)` and `nlp(text)`? Why should we use `nlp.make_doc(text)` when training?

I understand that we should create Example objects and pass it to the nlp.update() method. According to the example in the docs, we have
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
example = Example.from_dict(doc, {"entities": entity_offsets})
nlp.update([example], sgd=optimizer)
And looking at the source code of the make_doc() method, it seems like we would be just tokenizing the input text and then annotating the tokens.
But the Example object should have the reference/"gold-standard" and the predicted values. How does the information ends up in the document when we call nlp.make_doc()?
Additionally, when trying to get the predicted entity tags (using a trained nlp pipeline) back from the Example object I get no entities (though I could if I had created the object with nlp(text). And training crashes if I try using nlp(text) instead of nlp.make_doc(text) with
...
>>> spacy.pipeline._parser_internals.ner.BiluoPushDown.set_costs()
ValueError()

You can feel free to ask this sort of question on the Github Discussions board as well. Thanks also for taking time to think about this and read some of the code before asking. I wish every question were like this.
Anyway. I think the Example.from_dict() constructor might be getting in the way of understanding how the class works. Does this make things clearer for you?
from spacy.tokens import Doc, Span
from spacy.training import Example
import spacy
nlp = spacy.blank("en")
# Build a reference Doc object, representing the gold standard.
y = Doc(
nlp.vocab,
words=["I", "work", "at", "Berlin!", ".", "It", "'s", "a", "hipster", "bar", "."]
)
# There are other ways we could set up the Doc object, including just passing
# stuff into the constructor. I wanted to show modifying the Doc to set annotations.
ent_start = y.text.index("Berlin!")
assert ent_start != -1
ent_end = ent_start + len("Berlin!")
y.ents = [y.char_span(ent_start, ent_end, label="ORG")]
# Okay, so we have our gold-standard, aka reference aka y, Doc object.
# Now, at runtime we won't necessarily be tokenizing that input text that way.
# It's a weird entity. If we only learn from the gold tokens, we can never learn
# to tag this correctly, no matter how many examples we see, if the predicted tokens
# don't match this tokenization. Because we'll always be learning from "Berlin!" but
# seeing "Berlin", "!" at runtime. We'll have train/test skew. Since spaCy cares how
# it does on actual text, not just on the benchmark (which is usually run with
# gold tokens), we want to train from samples that have the runtime tokenization. So
# the Example object holds a pair (x, y), where the x is the input.
x = nlp.make_doc(y.text)
example = Example(x, y)
# Show the aligned gold-standard NER tags. These should have the entity as B-ORG L-ORG.
print(example.get_aligned_ner())
The other piece of information that might explain this is that the pipeline components try to deal with partial annotations, so that you can have rules which are presetting some entities. This is what's happening when you have a fully annotated Doc as the x --- it's taking those annotations as part of the input, and there's no valid action for the model when it tries to construct the best sequence of actions to learn from. The usability for this situation could be improved.

Adding entites to spacy doc object using BERT's offsets

Is there any way to add entities to a spacy doc object using BERT's offsets ? Problem is my whole pipeline is spacy dependent and i am using the latest PUBMEDBERT for which spacy doesnt provide support.
So at times the offsets of entities given by pubmedbert dont result into a valid SPAN for spacy as the tokenization is completely different.
what work have i done till now to solve my problem ?
I made a custom tokenizer by asking spacy to split on punctuation, similar to bert but there are certain cases wherein i just cant make a rule. for example:-
text = '''assessment
Exdtve age-rel mclr degn, left eye, with actv chrdl neovas
Mar-10-2020
assessment'''
Pubmedbert predicted 13:17 to be an entity i.e. dtve
but on adding the span as entity in spacy doc object it results NONE as it is not a valid span.
span = doc.char_span(row['start'], row['end'], row['ent'])
doc.ents = list(doc.ents) + [span]
TypeError: object of type 'NoneType' has no len()
Consider row['start'] to be 13, row['end'] to be 17 and row['ent'] to be label
how can i solve this problem ? is there anyway i can just add entities in spacy doc object using starting and ending offset given by pubmedbert
would really appreciate any help on this, Thank you.

Because spacy stores entities internally as IOB tags on tokens in the doc, you can only add entity spans that correspond to full tokens underneath.
If you're only using this doc to store these entities (not using any other components like a tagger or parser from another model that expect a different tokenizer), you can create a doc with the same tokenization as the BERT model:
import spacy
from spacy.tokens import Doc
nlp = spacy.blank("en")
# bert_tokens = [..., "Ex", "dtve", ...]
words, spaces = spacy.util.get_words_and_spaces(bert_tokens, text)
doc = Doc(nlp.vocab, words=words, spaces=spaces)
Then you should be able to add the entity spans to the document.
If you need the original spacy tokenization + entities based on a different tokenization, then you'll have to adjust the entity character offsets to match the spacy token boundaries in order to add them. Since this can depend a lot on the data/task (if dtve is an entity, is Exdtve also necessarily an entity of the same type?), you probably need a custom solution based on your data. If you're trying to adjust the entity spans to line up with the current tokens, you can see the character start and length for each token with token.idx and len(token).

Can python-docx preserve font color and styles when importing documents?

Essentially what I need to do is write a program that takes in many .docx files and puts them all in one, ordered in a certain way. I have importing working via:
import docx, os, glob
finaldocname = 'Midterm-All-Questions.docx'
finaldoc=docx.Document()
docstoworkon = glob.glob('*.docx')
if finaldocname in docstoworkon:
docstoworkon.remove(finaldocname) #dont process final doc if it exists
for f in docstoworkon:
doc=docx.Document(f)
fullText=[]
for para in doc.paragraphs:
fullText.append(para.text) #generates a long text list
# finaldoc.styles = doc.styles
for l in fullText:
# if l=='u\'\\n\'':
if '#' in l:
print('We got here!')
if '#1 ' not in l: #check last two characters to see if this is the first question
finaldoc.add_section() #only add a page break between questions
finaldoc.add_paragraph(l)
# finaldoc.add_page_break
# finaldoc.add_page_break
finaldoc.save(finaldocname)
But I need to preserve text styles, like font colors, sizes, italics, etc., and they aren't in this method since it just gets the raw text and dumps it. I can't find anything on the python-docx documentation about preserving text styles or importing in something other than raw text. Does anyone know how to go about this?

Styles are a bit difficult to work with in python-docx but it can be done.
See this explanation first to understand some of the problems with styles and Word.
The Long Way
When you read in a file as a Document() it will bring in all of the paragraphs and within each of these are the runs. These runs are chunks of text with the same style attached to them.
You can find out how many paragraphs or runs there are by doing len() on the object or you can iterate through them like you did in your example with paragraphs.
You can inspect the style of any given paragraph but runs may have different styles than the paragraph as a whole, so I would skip to the run itself and inspect the style there using paragraphs[0].runs[0].style which will give you a style object. You can inspect the font object beyond that which will tell you a number of attributes like size, italic, bold, etc.
Now to the long solution:
You first should create a new blank paragraph, then you should go and add_run() one by one with your text from your original. For each of these you can define a style attribute but it would have to be a named style as described in the first link. You cannot apply a stlye object directly as it won't copy the attributes over. But there is a way around that: check the attributes that you care about copying to the output and then ensure your new run applies the same attributes.
doc_out = docx.Document()
for para in doc.paragraphs:
p = doc_out.add_paragraph()
for run in para.runs:
r = p.add_run(run.text)
if run.bold:
r.bold = True
if run.italic:
r.italic = True
# etc
Obviously this is inefficient and not a great solution, but it will work to ensure you have copied the style appropriately.
Add New Styles
There is a way to add styles by name but because it isn't likely that the Word document you are getting the text and styles from is using named styles (rather than just applying bold, etc. to the words that you want), it is probably going to be a long road to adding a lot of slightly different styles or sometimes even the same ones.
Unfortunately that is the best answer I have for you on how to do this. Working with Word, Outlook, and Excel documents is not great in Python, especially for what you are trying to do.

Can a token be removed from a spaCy document during pipeline processing?

I am using spaCy (a great Python NLP library) to process a number of very large documents, however, my corpus has a number of common words that I would like to eliminate in the document processing pipeline. Is there a way to remove a token from the document within a pipeline component?

spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.
While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc text consistency. One solution would be to add a custom extension attribute like is_excluded to the tokens, based on whatever objective you want to use:
from spacy.tokens import Token
def get_is_excluded(token):
# Getter function to determine the value of token._.is_excluded
return token.text in ['some', 'excluded', 'words']
Token.set_extension('is_excluded', getter=get_is_excluded)
When processing a Doc, you can now filter it to only get the tokens that are not excluded:
doc = nlp("Test that tokens are excluded")
print([token.text for token if not token._.is_excluded])
# ['Test', 'that', 'tokens', 'are']
You can also make this more complex by using the Matcher or PhraseMatcher to find sequences of tokens in context and mark them as excluded.
Also, for completeness: If you do want to change the tokens in a Doc, you can achieve this by constructing a new Doc object with words (a list of strings) and optional spaces (a list of boolean values indicating whether the token is followed by a space or not). To construct a Doc with attributes like part-of-speech tags or dependency labels, you can then call the Doc.from_array method with the attributes to set and a numpy array of the values (all IDs).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.