I am new to spacy and to nlp overall.
To understand how spacy works I would like to create a function which takes a sentence and returns a dictionary,tuple or list with the noun and the words describing it.
I know that spacy creates a tree of the sentence and knows the use of each word (shown in displacy).
But what's the right way to get from:
"A large room with two yellow dishwashers in it"
To:
{noun:"room",adj:"large"}
{noun:"dishwasher",adj:"yellow",adv:"two"}
Or any other solution that gives me all related words in a usable bundle.
Thanks in advance!
This is a very straightforward use of the DependencyMatcher.
import spacy
from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")
pattern = [
{
"RIGHT_ID": "target",
"RIGHT_ATTRS": {"POS": "NOUN"}
},
# founded -> subject
{
"LEFT_ID": "target",
"REL_OP": ">",
"RIGHT_ID": "modifier",
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "nummod"]}}
},
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("FOUNDED", [pattern])
text = "A large room with two yellow dishwashers in it"
doc = nlp(text)
for match_id, (target, modifier) in matcher(doc):
print(doc[modifier], doc[target], sep="\t")
Output:
large room
two dishwashers
yellow dishwashers
It should be easy to turn that into a dictionary or whatever you'd like. You might also want to modify it to take proper nouns as the target, or to support other kinds of dependency relations, but this should be a good start.
You may also want to look at the noun chunks feature.
What you want to do is called "noun chunks":
import spacy
nlp = spacy.load('en_core_web_md')
txt = "A large room with two yellow dishwashers in it"
doc = nlp(txt)
chunks = []
for chunk in doc.noun_chunks:
out = {}
root = chunk.root
out[root.pos_] = root
for tok in chunk:
if tok != root:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'NOUN': room, 'DET': A, 'ADJ': large},
{'NOUN': dishwashers, 'NUM': two, 'ADJ': yellow},
{'PRON': it}
]
You may notice "noun chunk" doesn't guarantee the root will always be a noun. Should you wish to restrict your results to nouns only:
chunks = []
for chunk in doc.noun_chunks:
out = {}
noun = chunk.root
if noun.pos_ != 'NOUN':
continue
out['noun'] = noun
for tok in chunk:
if tok != noun:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'noun': room, 'DET': A, 'ADJ': large},
{'noun': dishwashers, 'NUM': two, 'ADJ': yellow}
]
Related
Somehow I have trouble understanding the negation in SpaCy matchers. I tried this code:
import spacy
from spacy.matcher import Matcher
import json
nlp = spacy.load('en_core_web_sm')
#from spacy.tokenizer import Tokenizer
matcher = Matcher(nlp.vocab)
Sentence = "The cat is black"
negative_sentence = "The cat is not black"
test_pattern = '''
[
[
{
"TEXT": "cat"
},
{
"LEMMA": "be"
},
{
"LOWER": "not",
"OP": "!"
},
{
"LOWER": "black"
}
]
]
'''
db = json.loads(test_pattern)
matcher.add("TEST_PATTERNS", db)
'''*********************Validate matcher on positive sentence******************'''
doc = nlp(Sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Positive sentence identified')
else:
print('Nothing found for positive sentence')
'''*********************Validate matcher on negative sentence******************'''
doc = nlp(negative_sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Negative sentence identified')
else:
print('Nothing found for negative sentence')
The result is:
Nothing found for positive sentence
Nothing found for negative sentence
I would expect that the sentence "The cat is black" would be a match. Furthermore, when I replace the ! with any other sign ("*", "?", or "+") it works as expected:
import spacy
from spacy.matcher import Matcher
import json
nlp = spacy.load('en_core_web_sm')
#from spacy.tokenizer import Tokenizer
matcher = Matcher(nlp.vocab)
Sentence = "The cat is black"
negative_sentence = "The cat is not black"
test_pattern = '''
[
[
{
"TEXT": "cat"
},
{
"LEMMA": "be"
},
{
"LOWER": "not",
"OP": "?"
},
{
"LOWER": "black"
}
]
]
'''
db = json.loads(test_pattern)
matcher.add("TEST_PATTERNS", db)
'''*********************Validate matcher on positive sentence******************'''
doc = nlp(Sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Positive sentence identified')
else:
print('Nothing found for positive sentence')
'''*********************Validate matcher on negative sentence******************'''
doc = nlp(negative_sentence, matcher)
matches = matcher(doc)
if matches != []:
print('Negative sentence identified')
else:
print('Nothing found for negative sentence')
Result:
Positive sentence identified
Negative sentence identified
How can I use the negation and only identify "The cat is black" and not "The cat is not black".
The reason why like to of the "OP" is because there might also other words between "is" and "black" (e.g., "The cat is kind and black" and not "The cat is not kind and black" ).
Any help on understanding negation with SpaCy matchers is highly appreciated.
Each dictionary in your match pattern corresponds to a token by default. With the ! operator it still corresponds to one token, just in a negative sense. With the * operator it corresponds to zero or more tokens, with + it's one or more tokens.
Looking at your original pattern, these are your tokens:
text: cat
lemma: be
text: not, op: !
lower: cat
Given the sentence "The cat is black", the match process works like this:
"the" matches nothing so we skip it.
"cat" matches your first token.
"is" matches your second token.
"black" matches your third token because it is not "not"
The sentence ends so there is no "cat" token, so the match fails.
When debugging patterns it's helpful to step through them like above.
For the other ops... * and ? work because "not" matches zero times. I would not expect + to work in the positive case.
The way you are trying to avoid matching negated things is kind of tricky. I would recommend you match all sentences with the relevant words first, ignoring negation, and then check if there is negation using the dependency parse.
I have the following text and want to isolate a part of the sentence related to a keyword, in this case keywords = ['pizza', 'chips'].
text = "The pizza is great but the chips aren't the best"
Expected Output:
{'pizza': 'The pizza is great'}
{'chips': "the chips aren't the best"}
I have tried using the Spacy Dependency Matcher but admittedly I'm not quite sure how it works. I tried the following pattern for chips which yields no matches.
import spacy
from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")
pattern = [
{
"RIGHT_ID": "chips_id",
"RIGHT_ATTRS": {"ORTH": "chips"}
},
{
"LEFT_ID": "chips_id",
"REL_OP": "<<",
"RIGHT_ID": "other_words",
"RIGHT_ATTRS": {"POS": '*'}
}
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("chips", [pattern])
doc = nlp("The pizza is great but the chips aren't the best")
for id_, (_, other_words) in matcher(doc):
print(doc[other_words])
Edit:
Additional example sentences:
example_sentences = [
"The pizza's are just OK, the chips is stiff and the service mediocre",
"Then the mains came and the pizza - these we're really average - chips had loads of oil and was poor",
"Nice pizza freshly made to order food is priced well, but chips are not so keenly priced.",
"The pizzas and chips taste really good and the Tango Ice Blast was refreshing"
]
Here is my attempt at a very limited solution to your problem, since I do not know how extensive you will want this to be.
I utilized code from this answer in order to address the problem.
import spacy
import re
en = spacy.load('en_core_web_sm')
text = "The pizza is great but the chips aren't the best"
doc = en(text)
seen = set() # keep track of covered words
chunks = []
for sent in doc.sents:
heads = [cc for cc in sent.root.children if cc.dep_ == 'conj']
for head in heads:
words = [ww for ww in head.subtree]
for word in words:
seen.add(word)
chunk = (' '.join([ww.text for ww in words]))
chunks.append( (head.i, chunk) )
unseen = [ww for ww in sent if ww not in seen]
chunk = ' '.join([ww.text for ww in unseen])
chunks.append( (sent.root.i, chunk) )
chunks = sorted(chunks, key=lambda x: x[0])
output_dict = {}
for np in doc.noun_chunks:
insensitive_the = re.compile(re.escape('the '), re.IGNORECASE)
new_np = insensitive_the.sub('',np.text)
output_dict[new_np]=''
for ii, chunk in chunks:
#print(ii, chunk)
for key in output_dict:
if key in chunk:
output_dict[key]=chunk
print(output_dict)
The output I get is:
I am aware there are a few problems:
The conjunction 'but' should not be in the value of the pizza key.
The word "are n't" should be aren't in the second value of the dictionary.
However, I believe we can fix this if we know more information about what sort of sentences you are dealing with. For instance, we might have a list of conjunctions that we can strip from all the values of the dict if the sentences are simple enough.
Update with example sentences:
As you can see, I think SpaCy struggles a bit with the punctuation, as well as knowing that you only want food items as nouns in the dictionary, presumably.
you could use the following function :
def spliter(text : str , keyword :list, number_of_words:int):
L = text.split()
sentences = dict()
for k in L :
if k in keyword :
n = L.index(k)
if len(L) -n -1 > number_of_words :
sentences.update({k:' '.join(L[n : n + number_of_words])})
else :
sentences.update({k:' '.join(L[n :])})
return sentences
Note : number_of_word define how many word you want to get after the desired keyword
Output : for number_of_words = 3 you get :
{'pizza': 'pizza is great', 'chips': "chips aren't the best"}
I wrote this function findTokenOffset that finds the offset of a given word in a pre-tokenized text (as a list of spaced words or according to a certain tokenizer).
import re, json
def word_regex_ascii(word):
return r"\b{}\b".format(re.escape(word))
def findTokenOffset(text,tokens):
seen = {} # map if a token has been see already!
items=[] # word tokens
my_regex = word_regex_ascii
# for each token word
for index_word,word in enumerate(tokens):
r = re.compile(my_regex(word), flags=re.I | re.X | re.UNICODE)
item = {}
# for each matched token in sentence
for m in r.finditer(text):
token=m.group()
characterOffsetBegin=m.start()
characterOffsetEnd=characterOffsetBegin+len(m.group()) - 1 # LP: star from 0
found=-1
if word in seen:
found=seen[word]
if characterOffsetBegin > found:
# store last word has been seen
seen[word] = characterOffsetEnd
item['index']=index_word+1 #// word index starts from 1
item['word']=token
item['characterOffsetBegin'] = characterOffsetBegin
item['characterOffsetEnd'] = characterOffsetEnd
items.append(item)
break
return items
This code works ok when the tokens are single words like
text = "George Washington came to Washington"
tokens = text.split()
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2))
But, supposed to have tokens having a multi-token fashion like here:
text = "George Washington came to Washington"
tokens = ["George Washington", "Washington"]
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2))
the offset does not work properly, due to repeating words in different tokens:
[
{
"index": 1,
"word": "George Washington",
"characterOffsetBegin": 0,
"characterOffsetEnd": 16
},
{
"index": 2,
"word": "Washington",
"characterOffsetBegin": 7,
"characterOffsetEnd": 16
}
]
How to add support to multi-token and overlapped token regex matching (thanks to the suggestion in comments for this exact problem's name)?
If you do not need the search phrase/word index information in the resulting output, you can use the following approach:
import re,json
def findTokenOffset(text, pattern):
items = []
for m in pattern.finditer(text):
item = {}
item['word']=m.group()
item['characterOffsetBegin'] = m.start()
item['characterOffsetEnd'] = m.end()
items.append(item)
return items
text = "George Washington came to Washington Washington.com"
tokens = ["George Washington", "Washington"]
pattern = re.compile(fr'(?<!\w)(?:{"|".join(sorted(map(re.escape, tokens), key=len, reverse=True))})(?!\w)(?!\.\b)', re.I )
offsets = findTokenOffset(text,pattern)
print(json.dumps(offsets, indent=2))
The output of the Python demo:
[
{
"word": "George Washington",
"characterOffsetBegin": 0,
"characterOffsetEnd": 17
},
{
"word": "Washington",
"characterOffsetBegin": 26,
"characterOffsetEnd": 36
}
]
The main part is pattern = re.compile(fr'(?<!\w)(?:{"|".join(sorted(map(re.escape, tokens), key=len, reverse=True))})\b(?!\.\b)', re.I ) that does the following:
map(re.escape, tokens) - escapes special chars inside tokens strings
sorted(..., key=len, reverse=True) - sorts the items in escaped tokens by length in a descending order (so that Washigton Post could match earlier than Washington)
"|".join(...) - created an alternation list of tokens, token1|token2|etc
(?<!\w)(?:...)(?!\w)(?!\.\b) - is the final pattern that matches all the alternatives in tokens as whole words. (?<!\w) and (?!\w) are used to enable word boundary detection even if the tokens start/end with a special character.
NOTE ON WORD BOUNDARIES
You should check your token boundary requirements. I added (?!\.\b) as you mention that Washington should not match in Washington.com, so I inferred to want to fail any word match when it is immediately followed with . and a word boundary. There are a lot of other possible solutions, the main one being whitespace boundaries, (?<!\S) and (?!\S).
Besides, see Match a whole word in a string using dynamic regex.
If you want to lookup for Washington, but not George Washington, you can remove the sentences you found from initial string. So, you can sort the 'tokens' by the word quantity. That gives you an opportunity to firstly scan the senteces, and after that, the words.
I am looking for a tokenizer that is expanding contractions.
Using nltk to split a phrase into tokens, the contraction is not expanded.
nltk.word_tokenize("she's")
-> ['she', "'s"]
However, when using a dictionary with contraction mappings only, and therefore not taking any information provided by surrounding words into account, it's not possible to decide whether "she's" should be mapped to "she is" or to "she has".
Is there a tokenizer that provides contraction expansion?
You can do rule based matching with Spacy to take information provided by surrounding words into account.
I wrote some demo code below which you can extend to cover more cases:
import spacy
from spacy.pipeline import EntityRuler
from spacy import displacy
from spacy.matcher import Matcher
sentences = ["now she's a software engineer" , "she's got a cat", "he's a tennis player", "He thinks that she's 30 years old"]
nlp = spacy.load('en_core_web_sm')
def normalize(sentence):
ans = []
doc = nlp(sentence)
#print([(t.text, t.pos_ , t.dep_) for t in doc])
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "got"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "been"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"POS": "DET"}]
matcher.add("case_is", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"IS_DIGIT": True}]
matcher.add("case_is", None, pattern)
# .. add more cases
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
for idx, t in enumerate(doc):
if string_id == 'case_has' and t.text == "'s" and idx >= start and idx < end:
ans.append("has")
continue
if string_id == 'case_is' and t.text == "'s" and idx >= start and idx < end:
ans.append("is")
continue
else:
ans.append(t.text)
return(' '.join(ans))
for s in sentences:
print(s)
print(normalize(s))
print()
output:
now she's a software engineer
now she is a software engineer
she's got a cat
she has got a cat
he's a tennis player
he is a tennis player
He thinks that she's 30 years old
He thinks that she is 30 years is old
I have a list of sentences that need to find the noun phrases for each sentence using SpaCy. Currently, the outputs only append all noun phrases from all of the sentences. How can I get the noun phrases for each sentence and print as a list of lists?
say we have two elements of sentences in a list -
A = ["I am a boy", "I am a girl"]
A_np = []
for x in A:
doc = nlp(x)
for np in doc.noun_chunks:
story_np.append(np.text)
A_np
I am expecting to get something like this:
[['I','boy'],['I','girl']]
You need to do two improvizations:
1/ noun_chunks are spans, not tokens. Hence better to iterate over individual tokens of a noun chunk.
2/ You need to have an intermediate list to store noun chunks of a single sentence.
Improvised code, you can adjust it as per your requirement :
>>> A = ["I am a boy", "I am a girl"]
>>> nlp = spacy.load('en')
>>> A_np = []
>>> for x in A:
... doc = nlp(x)
... sent_nps = []
... for np in doc.noun_chunks:
... sent_nps.extend([token.text for token in np])
... A_np.append(sent_nps)
...
>>> A_np
[['I', 'a', 'boy'], ['I', 'a', 'girl']]
I figure it out by adding an empty list before the second loop and inserting doc chunks to the last element of the empty list. These two loops will keep phrasing noun phrases and inserting the processed noun phrases.
A = ["I am a boy", "I am a girl"]
A_np = []
for x in A:
doc = nlp(x)
A_np.append([])
for np in doc.noun_chunks:
story_np[-1].append(np.text)
A_np
After creating the list of words from the sentences and removing the noise and stop words, bringing all of then to same cases, you will have a set of words left in the data.
Then you can call the library
nlp = spacy.load('en', disable=['parser', 'ner'])
or like
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
then you can def a function to filter out the noun words like:
def filter_nouns(texts, tags=['NOUN']):
output = []
for x in texts:
doc = nlp(" ".join(x))
output.append([token.lemma_ for token in doc if token.pos_ in tags])
return output
then you can apply the defined function on the cleaned data
I hope it proves useful