Unexpected behavior of SpaCy matcher with negation - python

Somehow I have trouble understanding the negation in SpaCy matchers. I tried this code:
import spacy
from spacy.matcher import Matcher
import json
nlp = spacy.load('en_core_web_sm')
#from spacy.tokenizer import Tokenizer
matcher = Matcher(nlp.vocab)
Sentence = "The cat is black"
negative_sentence = "The cat is not black"
test_pattern = '''
[
[
{
"TEXT": "cat"
},
{
"LEMMA": "be"
},
{
"LOWER": "not",
"OP": "!"
},
{
"LOWER": "black"
}
]
]
'''
db = json.loads(test_pattern)
matcher.add("TEST_PATTERNS", db)
'''*********************Validate matcher on positive sentence******************'''
doc = nlp(Sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Positive sentence identified')
else:
print('Nothing found for positive sentence')
'''*********************Validate matcher on negative sentence******************'''
doc = nlp(negative_sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Negative sentence identified')
else:
print('Nothing found for negative sentence')
The result is:
Nothing found for positive sentence
Nothing found for negative sentence
I would expect that the sentence "The cat is black" would be a match. Furthermore, when I replace the ! with any other sign ("*", "?", or "+") it works as expected:
import spacy
from spacy.matcher import Matcher
import json
nlp = spacy.load('en_core_web_sm')
#from spacy.tokenizer import Tokenizer
matcher = Matcher(nlp.vocab)
Sentence = "The cat is black"
negative_sentence = "The cat is not black"
test_pattern = '''
[
[
{
"TEXT": "cat"
},
{
"LEMMA": "be"
},
{
"LOWER": "not",
"OP": "?"
},
{
"LOWER": "black"
}
]
]
'''
db = json.loads(test_pattern)
matcher.add("TEST_PATTERNS", db)
'''*********************Validate matcher on positive sentence******************'''
doc = nlp(Sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Positive sentence identified')
else:
print('Nothing found for positive sentence')
'''*********************Validate matcher on negative sentence******************'''
doc = nlp(negative_sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Negative sentence identified')
else:
print('Nothing found for negative sentence')
Result:
Positive sentence identified
Negative sentence identified
How can I use the negation and only identify "The cat is black" and not "The cat is not black".
The reason why like to of the "OP" is because there might also other words between "is" and "black" (e.g., "The cat is kind and black" and not "The cat is not kind and black" ).
Any help on understanding negation with SpaCy matchers is highly appreciated.

Each dictionary in your match pattern corresponds to a token by default. With the ! operator it still corresponds to one token, just in a negative sense. With the * operator it corresponds to zero or more tokens, with + it's one or more tokens.
Looking at your original pattern, these are your tokens:
text: cat
lemma: be
text: not, op: !
lower: cat
Given the sentence "The cat is black", the match process works like this:
"the" matches nothing so we skip it.
"cat" matches your first token.
"is" matches your second token.
"black" matches your third token because it is not "not"
The sentence ends so there is no "cat" token, so the match fails.
When debugging patterns it's helpful to step through them like above.
For the other ops... * and ? work because "not" matches zero times. I would not expect + to work in the positive case.
The way you are trying to avoid matching negated things is kind of tricky. I would recommend you match all sentences with the relevant words first, ignoring negation, and then check if there is negation using the dependency parse.

Related

Spacy: How to get all words that describe a noun?

I am new to spacy and to nlp overall.
To understand how spacy works I would like to create a function which takes a sentence and returns a dictionary,tuple or list with the noun and the words describing it.
I know that spacy creates a tree of the sentence and knows the use of each word (shown in displacy).
But what's the right way to get from:
"A large room with two yellow dishwashers in it"
To:
{noun:"room",adj:"large"}
{noun:"dishwasher",adj:"yellow",adv:"two"}
Or any other solution that gives me all related words in a usable bundle.
Thanks in advance!
This is a very straightforward use of the DependencyMatcher.
import spacy
from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")
pattern = [
{
"RIGHT_ID": "target",
"RIGHT_ATTRS": {"POS": "NOUN"}
},
# founded -> subject
{
"LEFT_ID": "target",
"REL_OP": ">",
"RIGHT_ID": "modifier",
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "nummod"]}}
},
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("FOUNDED", [pattern])
text = "A large room with two yellow dishwashers in it"
doc = nlp(text)
for match_id, (target, modifier) in matcher(doc):
print(doc[modifier], doc[target], sep="\t")
Output:
large room
two dishwashers
yellow dishwashers
It should be easy to turn that into a dictionary or whatever you'd like. You might also want to modify it to take proper nouns as the target, or to support other kinds of dependency relations, but this should be a good start.
You may also want to look at the noun chunks feature.
What you want to do is called "noun chunks":
import spacy
nlp = spacy.load('en_core_web_md')
txt = "A large room with two yellow dishwashers in it"
doc = nlp(txt)
chunks = []
for chunk in doc.noun_chunks:
out = {}
root = chunk.root
out[root.pos_] = root
for tok in chunk:
if tok != root:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'NOUN': room, 'DET': A, 'ADJ': large},
{'NOUN': dishwashers, 'NUM': two, 'ADJ': yellow},
{'PRON': it}
]
You may notice "noun chunk" doesn't guarantee the root will always be a noun. Should you wish to restrict your results to nouns only:
chunks = []
for chunk in doc.noun_chunks:
out = {}
noun = chunk.root
if noun.pos_ != 'NOUN':
continue
out['noun'] = noun
for tok in chunk:
if tok != noun:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'noun': room, 'DET': A, 'ADJ': large},
{'noun': dishwashers, 'NUM': two, 'ADJ': yellow}
]

Python find offsets of a word token in a text

I wrote this function findTokenOffset that finds the offset of a given word in a pre-tokenized text (as a list of spaced words or according to a certain tokenizer).
import re, json
def word_regex_ascii(word):
return r"\b{}\b".format(re.escape(word))
def findTokenOffset(text,tokens):
seen = {} # map if a token has been see already!
items=[] # word tokens
my_regex = word_regex_ascii
# for each token word
for index_word,word in enumerate(tokens):
r = re.compile(my_regex(word), flags=re.I | re.X | re.UNICODE)
item = {}
# for each matched token in sentence
for m in r.finditer(text):
token=m.group()
characterOffsetBegin=m.start()
characterOffsetEnd=characterOffsetBegin+len(m.group()) - 1 # LP: star from 0
found=-1
if word in seen:
found=seen[word]
if characterOffsetBegin > found:
# store last word has been seen
seen[word] = characterOffsetEnd
item['index']=index_word+1 #// word index starts from 1
item['word']=token
item['characterOffsetBegin'] = characterOffsetBegin
item['characterOffsetEnd'] = characterOffsetEnd
items.append(item)
break
return items
This code works ok when the tokens are single words like
text = "George Washington came to Washington"
tokens = text.split()
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2))
But, supposed to have tokens having a multi-token fashion like here:
text = "George Washington came to Washington"
tokens = ["George Washington", "Washington"]
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2))
the offset does not work properly, due to repeating words in different tokens:
[
{
"index": 1,
"word": "George Washington",
"characterOffsetBegin": 0,
"characterOffsetEnd": 16
},
{
"index": 2,
"word": "Washington",
"characterOffsetBegin": 7,
"characterOffsetEnd": 16
}
]
How to add support to multi-token and overlapped token regex matching (thanks to the suggestion in comments for this exact problem's name)?
If you do not need the search phrase/word index information in the resulting output, you can use the following approach:
import re,json
def findTokenOffset(text, pattern):
items = []
for m in pattern.finditer(text):
item = {}
item['word']=m.group()
item['characterOffsetBegin'] = m.start()
item['characterOffsetEnd'] = m.end()
items.append(item)
return items
text = "George Washington came to Washington Washington.com"
tokens = ["George Washington", "Washington"]
pattern = re.compile(fr'(?<!\w)(?:{"|".join(sorted(map(re.escape, tokens), key=len, reverse=True))})(?!\w)(?!\.\b)', re.I )
offsets = findTokenOffset(text,pattern)
print(json.dumps(offsets, indent=2))
The output of the Python demo:
[
{
"word": "George Washington",
"characterOffsetBegin": 0,
"characterOffsetEnd": 17
},
{
"word": "Washington",
"characterOffsetBegin": 26,
"characterOffsetEnd": 36
}
]
The main part is pattern = re.compile(fr'(?<!\w)(?:{"|".join(sorted(map(re.escape, tokens), key=len, reverse=True))})\b(?!\.\b)', re.I ) that does the following:
map(re.escape, tokens) - escapes special chars inside tokens strings
sorted(..., key=len, reverse=True) - sorts the items in escaped tokens by length in a descending order (so that Washigton Post could match earlier than Washington)
"|".join(...) - created an alternation list of tokens, token1|token2|etc
(?<!\w)(?:...)(?!\w)(?!\.\b) - is the final pattern that matches all the alternatives in tokens as whole words. (?<!\w) and (?!\w) are used to enable word boundary detection even if the tokens start/end with a special character.
NOTE ON WORD BOUNDARIES
You should check your token boundary requirements. I added (?!\.\b) as you mention that Washington should not match in Washington.com, so I inferred to want to fail any word match when it is immediately followed with . and a word boundary. There are a lot of other possible solutions, the main one being whitespace boundaries, (?<!\S) and (?!\S).
Besides, see Match a whole word in a string using dynamic regex.
If you want to lookup for Washington, but not George Washington, you can remove the sentences you found from initial string. So, you can sort the 'tokens' by the word quantity. That gives you an opportunity to firstly scan the senteces, and after that, the words.

Accessing out of range word in spaCy doc : why does it work?

I'm learning spaCy and am playing with Matchers.
I have:
a very basic sentence ("white shepherd dog")
a matcher object, searching for a pattern ("white shepherd")
a print to show the match, and the word and POS before that match
I just wanted to check how to handle the index out of range exception I'm expecting to get because there's nothing before the match. I didn't expect it to work, but it did and is returning 'dog', which is after the match... and now I'm confused.
It looks like spaCy uses a circular list (or deque I think) ?
This needs a language model to run, you can install it with the following command line, if you'd like to reproduce it:
python -m spacy download en_core_web_md
And this is the code
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
# Get previous token and its POS
print("Previous token: ", doc[start - 1].text, doc[start - 1].pos_) # I would expect the error here
I get the following:
>>> Matched span: white shepherd
>>> Previous token: dog PROPN
Can someone explain what's going on ?
Thanks !
You are looking for a token at index 0-1 which evaluated to -1, which is the last token.
I recommend using the Token.nbor method to look for the first token before the span, and if no previous token exists make it None or an empty string.
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
try:
nbor_tok = span[0].nbor(-1)
print("Previous token:", nbor_tok, nbor_tok.pos_)
except IndexError:
nbor_tok = ''
print("Previous token: None None")

Use spacy on pretokenized text

I want to use spacy for processing an already pre-tokenized text. Parsing a list of tokens to spacy does not work.
import spacy
nlp = spacy.load("en_core_web_sm")
nlp(["This", "is", "a", "sentence"])
This gives a TypeError (which makes sense):
TypeError: Argument 'string' has incorrect type (expected str, got list)
I could replace the tokenizer with a custom one, but I feel like that would overcomplicate things and is not the preferred way.
Thank you for your help :D
You can use this method:
tokens = ["This", "is", "a", "sentence"]
sentence = nlp.tokenizer.tokens_from_list(tokens)
print(sentence)
This is a sentence
As of spaCy 3.0+, nlp.tokenizer.tokens_from_list() has been deprecated. Use the Doc object instead.
import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")
sent = ["This", "is", "a", "sentence"]
doc = Doc(nlp.vocab, sent)
for token in nlp(doc):
print(token.text, token.pos_)
If you use:
sentence = nlp.tokenizer.tokens_from_list(tokens) with spacy.matcher / Matcher you'll get error:
Try using nlp() instead of nlp.make_doc() or list(nlp.pipe()) instead
of list(nlp.tokenizer.pipe()).
The way I solved it: I iterate over each item inside of a for loop:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern = [{'LEMMA': 'sentence', 'POS': 'NOUN'}]
matcher.add('Searched Word', None, pattern)
X = ["Sentence one", "Sentence two", "Sentence three", "sentence last !"]
for i in X.index:
doc = nlp(X[i])
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
A a better way of doing this is using nlp.pipe:
for doc in nlp.pipe(X):
print([token.text for token in doc])
Also good for faster algo running and more efficient text processing.
Hope this helps. Thank you.

Tokenizer expanding extractions

I am looking for a tokenizer that is expanding contractions.
Using nltk to split a phrase into tokens, the contraction is not expanded.
nltk.word_tokenize("she's")
-> ['she', "'s"]
However, when using a dictionary with contraction mappings only, and therefore not taking any information provided by surrounding words into account, it's not possible to decide whether "she's" should be mapped to "she is" or to "she has".
Is there a tokenizer that provides contraction expansion?
You can do rule based matching with Spacy to take information provided by surrounding words into account.
I wrote some demo code below which you can extend to cover more cases:
import spacy
from spacy.pipeline import EntityRuler
from spacy import displacy
from spacy.matcher import Matcher
sentences = ["now she's a software engineer" , "she's got a cat", "he's a tennis player", "He thinks that she's 30 years old"]
nlp = spacy.load('en_core_web_sm')
def normalize(sentence):
ans = []
doc = nlp(sentence)
#print([(t.text, t.pos_ , t.dep_) for t in doc])
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "got"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "been"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"POS": "DET"}]
matcher.add("case_is", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"IS_DIGIT": True}]
matcher.add("case_is", None, pattern)
# .. add more cases
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
for idx, t in enumerate(doc):
if string_id == 'case_has' and t.text == "'s" and idx >= start and idx < end:
ans.append("has")
continue
if string_id == 'case_is' and t.text == "'s" and idx >= start and idx < end:
ans.append("is")
continue
else:
ans.append(t.text)
return(' '.join(ans))
for s in sentences:
print(s)
print(normalize(s))
print()
output:
now she's a software engineer
now she is a software engineer
she's got a cat
she has got a cat
he's a tennis player
he is a tennis player
He thinks that she's 30 years old
He thinks that she is 30 years is old

Categories