Add known matches to Spacy document with character offsets - python

I would like to run some analysis on documents using different Spacy tools, though I am interested in the Dependency Matcher in particular.
It just so happens that for these documents, I already have the character offsets of some difficult-to-parse entities. A somewhat-contrived example:
from spacy.lang.en import English
nlp = English()
text = "Apple is opening its first big office in San Francisco."
already_known_entities = [
{"offsets":(0,5), "id": "apple"},
{"offsets":(41,54), "id": "san-francisco"}
]
# do something here so that `nlp` knows about those entities
doc = nlp(text)
I've thought about doing something like this:
from spacy.lang.en import English
nlp = English()
text = "Apple is opening its first big office in San Francisco."
already_known_entities = [{"offsets":(0,5), "id": "apple"}, {"offsets":(41,54), "id": "san-francisco"}]
ruler = nlp.add_pipe("entity_ruler")
patterns = []
for e in already_known_entities:
patterns.append({
"label": "GPE",
"pattern": text[e["offsets"][0]:e["offsets"][1]]
})
ruler.add_patterns(patterns)
doc = nlp(text)
This technically works, and it's not the worst solution in the world, but I was still wondering if offsets can be added to the nlp object directly. As far as I can tell, the Matcher docs don't show anything like this. I also understand this might be a bit of a departure from typical Matcher behavior, where a pattern can be applied to all documents in a corpus--whereas here I want to tag entities at certain offsets only for particular documents. Offsets from one document do not apply to other documents.

You are looking for Doc.char_span.
doc = "Blah blah blah"
span = doc.char_span(0, 4, label="BLAH")
doc.ents = [span]
Note that doc.ents is a tuple, so you can't append to it, but you can convert it to a list and set the ents, for example.

Related

How to match repeating patterns in spacy?

I have a similar question as the one asked in this post: How to define a repeating pattern consisting of multiple tokens in spacy? The difference in my case compared to the linked post is that my pattern is defined by POS and dependency tags. As a consequence I don't think I could easily use regex to solve my problem (as is suggested in the accepted answer of the linked post).
For example, let's assume we analyze the following sentence:
"She told me that her dog was big, black and strong."
The following code would allow me to match the list of adjectives at the end of the sentence:
import spacy # I am using spacy 2
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
# Create doc object from text
doc = nlp(u"She told me that her dog was big, black and strong.")
# Set up pattern matching
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"IS_PUNCT": True}, {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
matcher.add("AdjList", [pattern])
matches = matcher(doc)
Running this code would match "big, black and strong". However, this pattern would not find the list of adjectives in the following sentences "She told me that her dog was big and black" or "She told me that her dog was big, black, strong and playful".
How would I have to define a (single) pattern for spacy's matcher in order to find such a list with any number of adjectives? Put differently, I am looking for the correct syntax for a pattern where the part {"POS": "ADJ"}, {"IS_PUNCT": True} can be repeated arbitrarily often before the list concludes with the pattern {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}.
Thanks for any hints.
The solution / issue isn't fundamentally different from the question linked to, there's no facility for repeating multi-token patterns in a match like that. You can use a for loop to build multiple patterns to capture what you want.
patterns = []
for ii in range(1, 5):
pattern = [{"POS": "ADJ"}, {"IS_PUNCT":True}] * ii
pattern += [{"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
patterns.append(pattern)
Alternately you could do something with the dependency matcher. In your example sentence it's not that clean, but for a sentence like "It was a big, brown, playful dog", the adjectives all have dependency arcs directly connecting them to the noun.
As a separate note, you are not handling sentences with the serial comma.

Spacy: How to get all words that describe a noun?

I am new to spacy and to nlp overall.
To understand how spacy works I would like to create a function which takes a sentence and returns a dictionary,tuple or list with the noun and the words describing it.
I know that spacy creates a tree of the sentence and knows the use of each word (shown in displacy).
But what's the right way to get from:
"A large room with two yellow dishwashers in it"
To:
{noun:"room",adj:"large"}
{noun:"dishwasher",adj:"yellow",adv:"two"}
Or any other solution that gives me all related words in a usable bundle.
Thanks in advance!
This is a very straightforward use of the DependencyMatcher.
import spacy
from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")
pattern = [
{
"RIGHT_ID": "target",
"RIGHT_ATTRS": {"POS": "NOUN"}
},
# founded -> subject
{
"LEFT_ID": "target",
"REL_OP": ">",
"RIGHT_ID": "modifier",
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "nummod"]}}
},
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("FOUNDED", [pattern])
text = "A large room with two yellow dishwashers in it"
doc = nlp(text)
for match_id, (target, modifier) in matcher(doc):
print(doc[modifier], doc[target], sep="\t")
Output:
large room
two dishwashers
yellow dishwashers
It should be easy to turn that into a dictionary or whatever you'd like. You might also want to modify it to take proper nouns as the target, or to support other kinds of dependency relations, but this should be a good start.
You may also want to look at the noun chunks feature.
What you want to do is called "noun chunks":
import spacy
nlp = spacy.load('en_core_web_md')
txt = "A large room with two yellow dishwashers in it"
doc = nlp(txt)
chunks = []
for chunk in doc.noun_chunks:
out = {}
root = chunk.root
out[root.pos_] = root
for tok in chunk:
if tok != root:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'NOUN': room, 'DET': A, 'ADJ': large},
{'NOUN': dishwashers, 'NUM': two, 'ADJ': yellow},
{'PRON': it}
]
You may notice "noun chunk" doesn't guarantee the root will always be a noun. Should you wish to restrict your results to nouns only:
chunks = []
for chunk in doc.noun_chunks:
out = {}
noun = chunk.root
if noun.pos_ != 'NOUN':
continue
out['noun'] = noun
for tok in chunk:
if tok != noun:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'noun': room, 'DET': A, 'ADJ': large},
{'noun': dishwashers, 'NUM': two, 'ADJ': yellow}
]

How to write code to merge punctuations and phrases using spaCy

What I would like to do
I would like to perse and dependency analysis using spaCy, one of the open-source libraries for natural language processing.
And especially, I hope to know how to write code for the option to merge punctuations and phrases in Python.
Problem
There are bottons to mearge punctuations and phrases on the displaCy Dependency Vizualizer Web App.
However, I cannot find the way to write these options when it comes to writing code in the local environment.
The current code returns the following not merged version.
The sample sentence is from your dictionary.
Current Code
It is from the sample code on the spaCy official website.
Please let me know how to fix it to set punctuations and phrases merge options.
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
sentence = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."
doc = nlp(sentence)
displacy.render(doc, style="dep")
What I tried to do
There was one example for the merge implementation.
However it didn't work when I apply the sentence.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories.")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
retokenizer.merge(span)
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
Example Code
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
retokenizer.merge(span)
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)
If you need to merge noun chunks, check out the built-in merge_noun_chunks pipeline component. When added to your pipeline using nlp.add_pipe, it will take care of merging the spans automatically.
You can just use the code from the displaCy Dependency Vizualizer:
import spacy
nlp = spacy.load("en_core_web_sm")
def merge_phrases(doc):
with doc.retokenize() as retokenizer:
for np in list(doc.noun_chunks):
attrs = {
"tag": np.root.tag_,
"lemma": np.root.lemma_,
"ent_type": np.root.ent_type_,
}
retokenizer.merge(np, attrs=attrs)
return doc
def merge_punct(doc):
spans = []
for word in doc[:-1]:
if word.is_punct or not word.nbor(1).is_punct:
continue
start = word.i
end = word.i + 1
while end < len(doc) and doc[end].is_punct:
end += 1
span = doc[start:end]
spans.append((span, word.tag_, word.lemma_, word.ent_type_))
with doc.retokenize() as retokenizer:
for span, tag, lemma, ent_type in spans:
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
retokenizer.merge(span, attrs=attrs)
return doc
text = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."
doc = nlp(text)
# Merge noun phrases into one token.
doc = merge_phrases(doc)
# Attach punctuation to tokens
doc = merge_punct(doc)
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)

python spacy looking for two (or more) words in a window

I am trying to identify concepts in texts. Oftentimes I consider that a concept appears in a text when two or more words appear relatively close to each other.
For instance a concept would be any of the words
forest, tree, nature
in a distance less than 4 words from
fire, burn, overheat
I am learning spacy and so far I can use the matcher like this:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],[{"LOWER": "hello"}, {"LOWER": "world"}])
That would match hello world and hello, world (or tree firing for the above mentioned example)
I am looking for a solution that would yield matches of the words Hello and World within a window of 5 words.
I had a look into:
https://spacy.io/usage/rule-based-matching
and the operators there described, but I am not able to put this word-window approach in "spacy" syntax.
Furthermore, I am not able to generalize that to more words as well.
Some ideas?
Thanks
For a window with K words, where K is relatively small, you can add K-2 optional wildcard tokens between your words. Wildcard means "any symbol", and in Spacy terms it is just an empty dict. Optional means the token may be there or may not, and in Spacy in is encoded as {"OP": "?"}.
Thus, you can write your matcher as
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"OP": "?"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "world"}])
which means you look for "hello", then 0 to 3 tokens of any kind, then "world". For example, for
doc = nlp(u"Hello brave new world")
for match_id, start, end in matcher(doc):
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
it will print you
15578876784678163569 HelloWorld 0 4 Hello brave new world
And if you want to match the other order (world ? ? ? hello) as well, you need to add the second, symmetric pattern into your matcher.
I'm relatively new to spaCy but I think the following pattern should work for any number of tokens between 'hello' and 'world' that are comprised of ASCII characters:
[{"LOWER": "hello"}, {'IS_ASCII': True, 'OP': '*'}, {"LOWER": "world"}]
I tested it using Explosion's rule-based match explorer and it works. Overlapping matches will return just one match (eg, "hello and I do mean hello world').

How to add additional currency characters in Spacy

I have documents where the character \u0080 is used as a Euro. I want to add these and other characters to the currency symbol list so that the money entity gets picked up by the Spacy NER. What is the best way to deal with this?
Additionally I also have cases where money is represented as CAD 5,000 and these are not picked by the NER as Money. What is the best way to deal with this situation, training the NER or adding CAD as a currency symbol?
1. The 'u\0080' problem
First thing first, it seems that the interpretation of the 'u\0080' character depends on the platform you are using, it doesn't print on a Windows 7 machine but it works on a Linux machine...
For completeness, I have assumed that you get your text from an html document containing the '€' escape sequence (which should print as € in a browser), the '\u0080' character and some other arbitrary symbols we identify as currencies.
Before passing the text content to spaCy, we can call html.unescape which will take care of translating € to €, which in turn is going to be recognized by the default configuration as a currency.
text_html = ("I just found out that CAD 1,000 is about 641.3 &#x80. "
"Some people call it 641.3 \u0080. "
"Fantastic! But in the U.K. I'd rather pay 344🎅 or \U0001F33B56.")
text = html.unescape(text_html)
Second, if there are symbols that are not recognized as a currency, like 🎅 and 🌻 for example, then we can change the Defaults of the language we use to qualify them as currencies.
That consists in replacing the lex_attr_getters[IS_CURRENCY] function by a custom one which holds a list of symbols describing a currency.
def is_currency_custom(text):
# Stripping punctuation
table = str.maketrans({key: None for key in string.punctuation})
text = text.translate(table)
all_currencies = ["\U0001F385", "\U0001F33B", "\u0080", "CAD"]
if text in all_currencies:
return True
return is_currency_original(text)
# Keep a reference to the original is_currency function
is_currency_original = EnglishDefaults.lex_attr_getters[IS_CURRENCY]
# Assign a new function for IS_CURRENCY
EnglishDefaults.lex_attr_getters[IS_CURRENCY] = is_currency_custom
2. The CAD 5,000 problem
For this one, a simple solution would be to define a special case. We say to the tokenizer that, wherever it meets CAD, this is a special case and it needs to do as instructed by us. We can set the IS_CURRENCY flag amongst other things.
special_case = [{
ORTH: u'CAD',
TAG: u'$',
IS_CURRENCY: True}]
nlp.tokenizer.add_special_case(u'CAD', special_case)
Note that this is not perfect as you may get false positives. Imagine a document from an Canadian company selling CAD drawing services... So this is good but not great.
If we want to be more precise, we can create a Matcher object that will look for patterns like CURRENCY[SPACE]NUMBER or NUMBER[SPACE]CURRENCY and associate the MONEY entity with it.
matcher = Matcher(nlp.vocab)
MONEY = nlp.vocab.strings['MONEY']
# This is the matcher callback that sets the MONEY entity
def add_money_ent(matcher, doc, i, matches):
match_id, start, end = matches[i]
doc.ents += ((MONEY, start, end),)
matcher.add(
'MoneyRedefined',
add_money_ent,
[{'IS_CURRENCY': True}, {'IS_SPACE': True, 'OP': '?'}, {'LIKE_NUM': True}],
[{'LIKE_NUM': True}, {'IS_SPACE': True, 'OP': '?'}, {'IS_CURRENCY': True}]
)
and you apply it to your doc object with matcher(doc). The 'OP' key makes the pattern optional, by allowing it to match 0 or 1 times.
3. The full code
import spacy
from spacy.symbols import IS_CURRENCY
from spacy.lang.en import EnglishDefaults
from spacy.matcher import Matcher
from spacy import displacy
import html
import string
def is_currency_custom(text):
# Stripping punctuation
table = str.maketrans({key: None for key in string.punctuation})
text = text.translate(table)
all_currencies = ["\U0001F385", "\U0001F33B", "\u0080", "CAD"]
if text in all_currencies:
return True
return is_currency_original(text)
# Keep a reference to the original is_currency function
is_currency_original = EnglishDefaults.lex_attr_getters[IS_CURRENCY]
# Assign a new function for IS_CURRENCY
EnglishDefaults.lex_attr_getters[IS_CURRENCY] = is_currency_custom
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
MONEY = nlp.vocab.strings['MONEY']
# This is the matcher callback that sets the MONEY entity
def add_money_ent(matcher, doc, i, matches):
match_id, start, end = matches[i]
doc.ents += ((MONEY, start, end),)
matcher.add(
'MoneyRedefined',
add_money_ent,
[{'IS_CURRENCY': True}, {'IS_SPACE': True, 'OP': '?'}, {'LIKE_NUM': True}],
[{'LIKE_NUM': True}, {'IS_SPACE': True, 'OP': '?'}, {'IS_CURRENCY': True}]
)
text_html = ("I just found out that CAD 1,000 is about 641.3 &#x80. "
"Some people call it 641.3 \u0080. "
"Fantastic! But in the U.K. I'd rather pay 344🎅 or \U0001F33B56.")
text = html.unescape(text_html)
doc = nlp(text)
matcher(doc)
displacy.serve(doc, style='ent')
This gives the expected:

Categories