Tokenizer expanding extractions - python

I am looking for a tokenizer that is expanding contractions.
Using nltk to split a phrase into tokens, the contraction is not expanded.
nltk.word_tokenize("she's")
-> ['she', "'s"]
However, when using a dictionary with contraction mappings only, and therefore not taking any information provided by surrounding words into account, it's not possible to decide whether "she's" should be mapped to "she is" or to "she has".
Is there a tokenizer that provides contraction expansion?

You can do rule based matching with Spacy to take information provided by surrounding words into account.
I wrote some demo code below which you can extend to cover more cases:
import spacy
from spacy.pipeline import EntityRuler
from spacy import displacy
from spacy.matcher import Matcher
sentences = ["now she's a software engineer" , "she's got a cat", "he's a tennis player", "He thinks that she's 30 years old"]
nlp = spacy.load('en_core_web_sm')
def normalize(sentence):
ans = []
doc = nlp(sentence)
#print([(t.text, t.pos_ , t.dep_) for t in doc])
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "got"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "been"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"POS": "DET"}]
matcher.add("case_is", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"IS_DIGIT": True}]
matcher.add("case_is", None, pattern)
# .. add more cases
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
for idx, t in enumerate(doc):
if string_id == 'case_has' and t.text == "'s" and idx >= start and idx < end:
ans.append("has")
continue
if string_id == 'case_is' and t.text == "'s" and idx >= start and idx < end:
ans.append("is")
continue
else:
ans.append(t.text)
return(' '.join(ans))
for s in sentences:
print(s)
print(normalize(s))
print()
output:
now she's a software engineer
now she is a software engineer
she's got a cat
she has got a cat
he's a tennis player
he is a tennis player
He thinks that she's 30 years old
He thinks that she is 30 years is old

Related

Unexpected behavior of SpaCy matcher with negation

Somehow I have trouble understanding the negation in SpaCy matchers. I tried this code:
import spacy
from spacy.matcher import Matcher
import json
nlp = spacy.load('en_core_web_sm')
#from spacy.tokenizer import Tokenizer
matcher = Matcher(nlp.vocab)
Sentence = "The cat is black"
negative_sentence = "The cat is not black"
test_pattern = '''
[
[
{
"TEXT": "cat"
},
{
"LEMMA": "be"
},
{
"LOWER": "not",
"OP": "!"
},
{
"LOWER": "black"
}
]
]
'''
db = json.loads(test_pattern)
matcher.add("TEST_PATTERNS", db)
'''*********************Validate matcher on positive sentence******************'''
doc = nlp(Sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Positive sentence identified')
else:
print('Nothing found for positive sentence')
'''*********************Validate matcher on negative sentence******************'''
doc = nlp(negative_sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Negative sentence identified')
else:
print('Nothing found for negative sentence')
The result is:
Nothing found for positive sentence
Nothing found for negative sentence
I would expect that the sentence "The cat is black" would be a match. Furthermore, when I replace the ! with any other sign ("*", "?", or "+") it works as expected:
import spacy
from spacy.matcher import Matcher
import json
nlp = spacy.load('en_core_web_sm')
#from spacy.tokenizer import Tokenizer
matcher = Matcher(nlp.vocab)
Sentence = "The cat is black"
negative_sentence = "The cat is not black"
test_pattern = '''
[
[
{
"TEXT": "cat"
},
{
"LEMMA": "be"
},
{
"LOWER": "not",
"OP": "?"
},
{
"LOWER": "black"
}
]
]
'''
db = json.loads(test_pattern)
matcher.add("TEST_PATTERNS", db)
'''*********************Validate matcher on positive sentence******************'''
doc = nlp(Sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Positive sentence identified')
else:
print('Nothing found for positive sentence')
'''*********************Validate matcher on negative sentence******************'''
doc = nlp(negative_sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Negative sentence identified')
else:
print('Nothing found for negative sentence')
Result:
Positive sentence identified
Negative sentence identified
How can I use the negation and only identify "The cat is black" and not "The cat is not black".
The reason why like to of the "OP" is because there might also other words between "is" and "black" (e.g., "The cat is kind and black" and not "The cat is not kind and black" ).
Any help on understanding negation with SpaCy matchers is highly appreciated.
Each dictionary in your match pattern corresponds to a token by default. With the ! operator it still corresponds to one token, just in a negative sense. With the * operator it corresponds to zero or more tokens, with + it's one or more tokens.
Looking at your original pattern, these are your tokens:
text: cat
lemma: be
text: not, op: !
lower: cat
Given the sentence "The cat is black", the match process works like this:
"the" matches nothing so we skip it.
"cat" matches your first token.
"is" matches your second token.
"black" matches your third token because it is not "not"
The sentence ends so there is no "cat" token, so the match fails.
When debugging patterns it's helpful to step through them like above.
For the other ops... * and ? work because "not" matches zero times. I would not expect + to work in the positive case.
The way you are trying to avoid matching negated things is kind of tricky. I would recommend you match all sentences with the relevant words first, ignoring negation, and then check if there is negation using the dependency parse.

Spacy: How to get all words that describe a noun?

I am new to spacy and to nlp overall.
To understand how spacy works I would like to create a function which takes a sentence and returns a dictionary,tuple or list with the noun and the words describing it.
I know that spacy creates a tree of the sentence and knows the use of each word (shown in displacy).
But what's the right way to get from:
"A large room with two yellow dishwashers in it"
To:
{noun:"room",adj:"large"}
{noun:"dishwasher",adj:"yellow",adv:"two"}
Or any other solution that gives me all related words in a usable bundle.
Thanks in advance!
This is a very straightforward use of the DependencyMatcher.
import spacy
from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")
pattern = [
{
"RIGHT_ID": "target",
"RIGHT_ATTRS": {"POS": "NOUN"}
},
# founded -> subject
{
"LEFT_ID": "target",
"REL_OP": ">",
"RIGHT_ID": "modifier",
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "nummod"]}}
},
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("FOUNDED", [pattern])
text = "A large room with two yellow dishwashers in it"
doc = nlp(text)
for match_id, (target, modifier) in matcher(doc):
print(doc[modifier], doc[target], sep="\t")
Output:
large room
two dishwashers
yellow dishwashers
It should be easy to turn that into a dictionary or whatever you'd like. You might also want to modify it to take proper nouns as the target, or to support other kinds of dependency relations, but this should be a good start.
You may also want to look at the noun chunks feature.
What you want to do is called "noun chunks":
import spacy
nlp = spacy.load('en_core_web_md')
txt = "A large room with two yellow dishwashers in it"
doc = nlp(txt)
chunks = []
for chunk in doc.noun_chunks:
out = {}
root = chunk.root
out[root.pos_] = root
for tok in chunk:
if tok != root:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'NOUN': room, 'DET': A, 'ADJ': large},
{'NOUN': dishwashers, 'NUM': two, 'ADJ': yellow},
{'PRON': it}
]
You may notice "noun chunk" doesn't guarantee the root will always be a noun. Should you wish to restrict your results to nouns only:
chunks = []
for chunk in doc.noun_chunks:
out = {}
noun = chunk.root
if noun.pos_ != 'NOUN':
continue
out['noun'] = noun
for tok in chunk:
if tok != noun:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'noun': room, 'DET': A, 'ADJ': large},
{'noun': dishwashers, 'NUM': two, 'ADJ': yellow}
]

How to ask Spacy phraseMatch to match all the token in the list?

I have a following algorithm:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
phrase_matcher = PhraseMatcher(nlp.vocab)
CAT = [nlp.make_doc(text) for text in ['pension', 'underwriter', 'health', 'client']]
phrase_matcher.add("CATEGORY 1",None, *CAT)
text = 'The client works as a marine assistant underwriter. He has recently opted to stop paying into his pension. '
doc = nlp(text)
matches = phrase_matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'CategoryID'
span = doc[start : end] # get the matched slice of the doc
print(rule_id, span.text)
# Output
CATEGORY 1 client
CATEGORY 1 underwriter
CATEGORY 1 pension
Can I ask to return the result when all words can be found in the sentence. I expect not to see anything here as 'health' is not part of the sentence.
Can I do this type of matching with PhraseMatcher? or Do I need to change for another type of rule based match? Thank you

Accessing out of range word in spaCy doc : why does it work?

I'm learning spaCy and am playing with Matchers.
I have:
a very basic sentence ("white shepherd dog")
a matcher object, searching for a pattern ("white shepherd")
a print to show the match, and the word and POS before that match
I just wanted to check how to handle the index out of range exception I'm expecting to get because there's nothing before the match. I didn't expect it to work, but it did and is returning 'dog', which is after the match... and now I'm confused.
It looks like spaCy uses a circular list (or deque I think) ?
This needs a language model to run, you can install it with the following command line, if you'd like to reproduce it:
python -m spacy download en_core_web_md
And this is the code
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
# Get previous token and its POS
print("Previous token: ", doc[start - 1].text, doc[start - 1].pos_) # I would expect the error here
I get the following:
>>> Matched span: white shepherd
>>> Previous token: dog PROPN
Can someone explain what's going on ?
Thanks !
You are looking for a token at index 0-1 which evaluated to -1, which is the last token.
I recommend using the Token.nbor method to look for the first token before the span, and if no previous token exists make it None or an empty string.
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
try:
nbor_tok = span[0].nbor(-1)
print("Previous token:", nbor_tok, nbor_tok.pos_)
except IndexError:
nbor_tok = ''
print("Previous token: None None")

Matcher is returning some duplicates entry

I want output as ["good customer service","great ambience"] but I am getting ["good customer","good customer service","great ambience"] because pattern is matching with good customer also but this phrase doesn't make any sense. How can I remove these kind of duplicates
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("good customer service and great ambience")
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: adjective followed by one or more noun
pattern = [{"POS": 'ADJ'},{"POS": 'NOUN', "OP": '+'}]
matcher.add("ADJ_NOUN_PATTERN", None,pattern)
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
You may post-process the matches by grouping the tuples against the start index and only keeping the one with the largest end index:
from itertools import *
#...
matches = matcher(doc)
results = [max(list(group),key=lambda x: x[2]) for key, group in groupby(matches, lambda prop: prop[1])]
print("Matches:", [doc[start:end].text for match_id, start, end in results])
# => Matches: ['good customer service', 'great ambience']
The groupby(matches, lambda prop: prop[1]) will group the matches by the start index, here, resulting in [(5488211386492616699, 0, 2), (5488211386492616699, 0, 3)] and (5488211386492616699, 4, 6). max(list(group),key=lambda x: x[2]) will grab the item where end index (Value #3) is the biggest.
Spacy has a built-in function to do just that. Check filter_spans:
The documentation says:
When spans overlap, the (first) longest span is preferred over shorter spans.
Example:
doc = nlp("This is a sentence.")
spans = [doc[0:2], doc[0:2], doc[0:4]]
filtered = filter_spans(spans)

Categories