Spacy Coreference resolution with merged spans - python

How can I take into account merged spans in my coreferences using spacy and corefee?
While changing the lenght of the merged span, corefee doesn't change it's indexes thus returning the neighboring token rather than the merged span. Note that it does detect the correct span ("big bad wolf")
nlp = spacy.load('en_core_web_trf')
nlp.add_pipe('coreferee')
doc = nlp("the big bad wolf is sweet, he knows how to dance")
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[1:4])
for token in doc:
print(token.text, token.pos_, token.tag_)
dict_of_coref = doc._.coref_chains
dict_of_coref
for element in dict_of_coref:
print(element)
for ele in element:
print(doc[ele[0]])

Related

filter custom spans overlaps in spacy doc

I have a bunch of regex in this way:
(for simplicity the regex patters are very easy, the real case the regex are very long and barely incomprehensible since they are created automatically from other tool)
I want to create spans in a doc based on those regex.
This is the code:
import spacy
from spacy.tokens import Doc, Span, Token
import re
rx1 = ["blue","blue print"]
text = " this is blue but there is a blue print. The light is red and the heat is in the infra red."
my_regexes = {'blue':["blue","blue print"],
'red': ["red", "infra red"] }
nlp = spacy.blank("en")
doc = nlp(text)
print(doc.text)
for name, rxs in my_regexes.items():
doc.spans[name] = []
for rx in rxs:
for i, match in enumerate(re.finditer(rx, doc.text)):
start, end = match.span()
span = doc.char_span(start, end, alignment_mode="expand")
# This is a Span object or None if match doesn't map to valid token sequence
span_to_add = Span(doc, span.start, span.end, label=name +str(i))
doc.spans[name].append(span_to_add)
if span is not None:
print("Found match:", name, start, end, span.text )
It works.
Now I want to filter the spans in a way that when a series of tokens (for instance "infra red") contain another span ("red") only the longest one is kept.
I saw this:
How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?
but that looks to be for a matcher, and I can not make it work in my case. Since I would like to eliminate the token Span out of the document.
Any idea?
spacy.util.filter_spans will do this. The answer is the same as the linked question, where matcher results are converted to spans in order to filter them with this function.
docs.spans[name] = spacy.util.filter_spans(doc.spans[name])

How to write code to merge punctuations and phrases using spaCy

What I would like to do
I would like to perse and dependency analysis using spaCy, one of the open-source libraries for natural language processing.
And especially, I hope to know how to write code for the option to merge punctuations and phrases in Python.
Problem
There are bottons to mearge punctuations and phrases on the displaCy Dependency Vizualizer Web App.
However, I cannot find the way to write these options when it comes to writing code in the local environment.
The current code returns the following not merged version.
The sample sentence is from your dictionary.
Current Code
It is from the sample code on the spaCy official website.
Please let me know how to fix it to set punctuations and phrases merge options.
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
sentence = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."
doc = nlp(sentence)
displacy.render(doc, style="dep")
What I tried to do
There was one example for the merge implementation.
However it didn't work when I apply the sentence.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories.")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
retokenizer.merge(span)
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
Example Code
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
retokenizer.merge(span)
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)
If you need to merge noun chunks, check out the built-in merge_noun_chunks pipeline component. When added to your pipeline using nlp.add_pipe, it will take care of merging the spans automatically.
You can just use the code from the displaCy Dependency Vizualizer:
import spacy
nlp = spacy.load("en_core_web_sm")
def merge_phrases(doc):
with doc.retokenize() as retokenizer:
for np in list(doc.noun_chunks):
attrs = {
"tag": np.root.tag_,
"lemma": np.root.lemma_,
"ent_type": np.root.ent_type_,
}
retokenizer.merge(np, attrs=attrs)
return doc
def merge_punct(doc):
spans = []
for word in doc[:-1]:
if word.is_punct or not word.nbor(1).is_punct:
continue
start = word.i
end = word.i + 1
while end < len(doc) and doc[end].is_punct:
end += 1
span = doc[start:end]
spans.append((span, word.tag_, word.lemma_, word.ent_type_))
with doc.retokenize() as retokenizer:
for span, tag, lemma, ent_type in spans:
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
retokenizer.merge(span, attrs=attrs)
return doc
text = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."
doc = nlp(text)
# Merge noun phrases into one token.
doc = merge_phrases(doc)
# Attach punctuation to tokens
doc = merge_punct(doc)
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)

Accessing out of range word in spaCy doc : why does it work?

I'm learning spaCy and am playing with Matchers.
I have:
a very basic sentence ("white shepherd dog")
a matcher object, searching for a pattern ("white shepherd")
a print to show the match, and the word and POS before that match
I just wanted to check how to handle the index out of range exception I'm expecting to get because there's nothing before the match. I didn't expect it to work, but it did and is returning 'dog', which is after the match... and now I'm confused.
It looks like spaCy uses a circular list (or deque I think) ?
This needs a language model to run, you can install it with the following command line, if you'd like to reproduce it:
python -m spacy download en_core_web_md
And this is the code
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
# Get previous token and its POS
print("Previous token: ", doc[start - 1].text, doc[start - 1].pos_) # I would expect the error here
I get the following:
>>> Matched span: white shepherd
>>> Previous token: dog PROPN
Can someone explain what's going on ?
Thanks !
You are looking for a token at index 0-1 which evaluated to -1, which is the last token.
I recommend using the Token.nbor method to look for the first token before the span, and if no previous token exists make it None or an empty string.
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
try:
nbor_tok = span[0].nbor(-1)
print("Previous token:", nbor_tok, nbor_tok.pos_)
except IndexError:
nbor_tok = ''
print("Previous token: None None")

How to filter tokens from spaCy document

I would like to parse a document using spaCy and apply a token filter so that the final spaCy document does not include the filtered tokens. I know that I can take the sequence of tokens filtered, but I am insterested in having the actual Doc structure.
text = u"This document is only an example. " \
"I would like to create a custom pipeline that will remove specific tokesn from the final document."
doc = nlp(text)
def keep_token(tok):
# This is only an example rule
return tok.pos_ not not in {'PUNCT', 'NUM', 'SYM'}
final_tokens = list(filter(keep_token, doc))
# How to get a spacy.Doc from final_tokens?
I tried to reconstruct a new spaCy Doc from the tokens lists but the API is not clear how to do it.
I am pretty sure that you have found your solution till now but because it is not posted here I thought it may be useful to add it.
You can remove tokens by converting doc to numpy array, removing from numpy array and then converting back to doc.
Code:
import spacy
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
from spacy.tokens import Doc
import numpy
def remove_tokens_on_match(doc):
indexes = []
for index, token in enumerate(doc):
if (token.pos_ in ('PUNCT', 'NUM', 'SYM')):
indexes.append(index)
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
np_array = numpy.delete(np_array, indexes, axis = 0)
doc2 = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in indexes])
doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
return doc2
# load english model
nlp = spacy.load('en')
doc = nlp(u'This document is only an example. \
I would like to create a custom pipeline that will remove specific tokens from \
the final document.')
print(remove_tokens_on_match(doc))
You can look to a similar question that I answered here.
Depending on what you want to do there are several approaches.
1. Get the original Document
Tokens in SpaCy have references to their document, so you can do this:
original_doc = final_tokens[0].doc
This way you can still get PoS, parse data etc. from the original sentence.
2. Construct a new document without the removed tokens
You can append the strings of all the tokens with whitespace and create a new document. See the token docs for information on text_with_ws.
doc = nlp(''.join(map(lambda x: x.text_with_ws, final_tokens)))
This is probably not going to give you what you want though - PoS tags will not necessarily be the same, and the resulting sentence may not make sense.
If neither of those was what you had in mind, let me know and maybe I can help.

New named entity class in Spacy

I need to train Spacy NER to be able to recognize 2 new classes for named entity recognition, all I have are files with list of items that are supposed to be in new classes.
For example: Rolling Stones, Muse, Arctic Monkeys - artists
Any idea how this can be done?
This seems like a perfect use case for Matcher or PhraseMatcher (if you care about performance).
import spacy
nlp = spacy.load('en')
def merge_phrases(matcher, doc, i, matches):
'''
Merge a phrase. We have to be careful here because we'll change the token indices.
To avoid problems, merge all the phrases once we're called on the last match.
'''
if i != len(matches)-1:
return None
spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
for ent_id, label, span in spans:
span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])
matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add(entity_key='1', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Rolling'}, {spacy.attrs.ORTH: 'Stones'}]], on_match=merge_phrases)
matcher.add(entity_key='2', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Muse'}]], on_match=merge_phrases)
matcher.add(entity_key='3', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Arctic'}, {spacy.attrs.ORTH: 'Monkeys'}]], on_match=merge_phrases)
doc = nlp(u'The Rolling Stones are an English rock band formed in London in 1962. The first settled line-up consisted of Brian Jones, Ian Stewart, Mick Jagger, Keith Richards, Bill Wyman and Charlie Watts')
matcher(doc)
for ent in doc.ents:
print(ent)
See the documentation for more details. From my experience, with 400k entities in a Matcher it would take almost a second to match each document.
PhraseMatcher is much much faster but a bit trickier to use. Note that this is "strict" matcher, it won't match any entities it haven't seen before.

Categories