I need to train Spacy NER to be able to recognize 2 new classes for named entity recognition, all I have are files with list of items that are supposed to be in new classes.
For example: Rolling Stones, Muse, Arctic Monkeys - artists
Any idea how this can be done?
This seems like a perfect use case for Matcher or PhraseMatcher (if you care about performance).
import spacy
nlp = spacy.load('en')
def merge_phrases(matcher, doc, i, matches):
'''
Merge a phrase. We have to be careful here because we'll change the token indices.
To avoid problems, merge all the phrases once we're called on the last match.
'''
if i != len(matches)-1:
return None
spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
for ent_id, label, span in spans:
span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])
matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add(entity_key='1', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Rolling'}, {spacy.attrs.ORTH: 'Stones'}]], on_match=merge_phrases)
matcher.add(entity_key='2', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Muse'}]], on_match=merge_phrases)
matcher.add(entity_key='3', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Arctic'}, {spacy.attrs.ORTH: 'Monkeys'}]], on_match=merge_phrases)
doc = nlp(u'The Rolling Stones are an English rock band formed in London in 1962. The first settled line-up consisted of Brian Jones, Ian Stewart, Mick Jagger, Keith Richards, Bill Wyman and Charlie Watts')
matcher(doc)
for ent in doc.ents:
print(ent)
See the documentation for more details. From my experience, with 400k entities in a Matcher it would take almost a second to match each document.
PhraseMatcher is much much faster but a bit trickier to use. Note that this is "strict" matcher, it won't match any entities it haven't seen before.
Related
I have a bunch of regex in this way:
(for simplicity the regex patters are very easy, the real case the regex are very long and barely incomprehensible since they are created automatically from other tool)
I want to create spans in a doc based on those regex.
This is the code:
import spacy
from spacy.tokens import Doc, Span, Token
import re
rx1 = ["blue","blue print"]
text = " this is blue but there is a blue print. The light is red and the heat is in the infra red."
my_regexes = {'blue':["blue","blue print"],
'red': ["red", "infra red"] }
nlp = spacy.blank("en")
doc = nlp(text)
print(doc.text)
for name, rxs in my_regexes.items():
doc.spans[name] = []
for rx in rxs:
for i, match in enumerate(re.finditer(rx, doc.text)):
start, end = match.span()
span = doc.char_span(start, end, alignment_mode="expand")
# This is a Span object or None if match doesn't map to valid token sequence
span_to_add = Span(doc, span.start, span.end, label=name +str(i))
doc.spans[name].append(span_to_add)
if span is not None:
print("Found match:", name, start, end, span.text )
It works.
Now I want to filter the spans in a way that when a series of tokens (for instance "infra red") contain another span ("red") only the longest one is kept.
I saw this:
How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?
but that looks to be for a matcher, and I can not make it work in my case. Since I would like to eliminate the token Span out of the document.
Any idea?
spacy.util.filter_spans will do this. The answer is the same as the linked question, where matcher results are converted to spans in order to filter them with this function.
docs.spans[name] = spacy.util.filter_spans(doc.spans[name])
After using a amazon review scraper to build this data frame, I called on nlp in order to tokenize and create a new column containing the processed reviews as 'docs'
However, now I am trying to create a pattern in order to analyzing the reviews in the doc column, but I keep getting know matches, which makes me thinking I'm missing one more pre-processing step, or perhaps not pointing the matcher in the right direction.
While the following code executes without any errors, I receive a matches list with 0 - even though I know the word exists in the doc column. The docs for spaCy are still a tad slim, and I'm not too sure the matcher.add is correct, as the one specific in the tutorial
matcher.add("Name_of_List", None, pattern)
returns an error saying that only 2 arguments are required for this class.
source -- https://course.spacy.io/en/chapter1
Question: What do I need to change to accurately analyze the df doc column for the pattern created?
Thanks!
Full code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_md')
df = pd.read_csv('paper_towel_US.csv')
#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]
# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]
#calling matcher to create pattern
matcher = Matcher(nlp.vocab)
pattern =[{"LEMMA": "love"},
{"OP":"+"}
]
matcher.add("QUALITY_PATTERN", [pattern])
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
df['doc'].apply(find_matches)
df sample for reproduction via df.iloc[596:600, :].to_clipboard(sep=',')
,product,title,rating,body,doc,num_tokens
596,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Awesome!,5,Great towels!,Great towels!,3
597,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Good buy!,5,Love these,Love these,2
598,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Meh,3,"Does not clean countertop messes well. Towels leave a large residue. They are durable, though","Does not clean countertop messes well. Towels leave a large residue. They are durable, though",18
599,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Exactly as Described. Packaged Well and Mailed Promptly,4,Exactly as Described. Packaged Well and Mailed Promptly,Exactly as Described. Packaged Well and Mailed Promptly,9
You are trying to get the matches from the "df.doc" string with doc = nlp("df.doc"). You need to extract matches from the df['doc'] column instead.
An example solution is to remove doc = nlp("df.doc") and use the nlp = spacy.load('en_core_web_sm'):
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
>>> df['doc'].apply(find_matches)
0 None
1 (0, 2, Love these)
2 None
3 None
Name: doc, dtype: object
Full code snippet:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
df = pd.read_csv(r'C:\Users\admin\Desktop\s.txt')
#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]
# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]
#calling matcher to create pattern
matcher = Matcher(nlp.vocab)
pattern =[{"LEMMA": "love"},
{"OP":"+"}
]
matcher.add("QUALITY_PATTERN", [pattern])
#doc = nlp("df.doc")
#matches = matcher(doc)
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
print(df['doc'].apply(find_matches))
I want to categorize the following keywords:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
phrase_matcher = PhraseMatcher(nlp.vocab)
cat_patterns = [nlp(text) for text in ('cat', 'cute', 'fat')]
dog_patterns = [nlp(text) for text in ('dog', 'fat')]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('Category1', None, *cat_patterns)
matcher.add('Category2', None, *dog_patterns)
doc = nlp("I have a white cat. It is cute and fat; I have a black dog. It is fat,too")
matches = matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'CategoryID'
span = doc[start : end] # get the matched slice of the doc
print(rule_id, span.text)
#Output
#Category1 cat
#Category1 cute
#Category1 fat
#Category2 fat
#Category2 dog
#Category1 fat
#Category2 fat
However, my expected output is if the text contains cat and cute or cat and fat together, it will fall in the first category; if the text contains dog and fat together, then it will fall in the second category.
#Category1 cat cute
#Category1 cat fat
#Category2 dog fat
Is it possible to do it using the similar algorithm? Thank you
From the spaCy documentation on Matchers (https://spacy.io/usage/rule-based-matching), there is no way to detect 2 different tokens separated by an arbitrary number of tokens. If you knew how many tokens were between "cat" and "fat", for example, then you could use wildcard patterns (https://spacy.io/usage/rule-based-matching#adding-patterns-wildcard), but it looks like from your example that distance between tokens can vary.
Two solutions that I can see to solve your problem:
Keep track of matches in your for loop using some sort of data structure. If all the tokens you are looking for end up being found, then add that match to your final results.
Use regular expressions to detect what you are looking for. spaCy does have great tools for rule-based matching, but it looks like you aren't using any linguistic aspects of the words you are searching for. A simple regex like /cat.*?fat/ will find the matches you are looking for.
I am trying to create a custom entity label called FRUIT using the rule-based Matcher (i.e. adding on_match rules), following the spaCy guide. I'm using spaCy 2.0.11, so I believe the steps to do so have changed compared to spaCy 1.X
Example: doc = nlp('Tom wants to eat some apples at the United Nations')
Expected text and entity outputs:
Tom PERSON
apples FRUIT
the United Nations ORG
However, I seem to get the following error: [E084] Error assigning label ID 7429577500961755728 to span: not in StringStore. I have included my code below. When I change nlp.vocab.strings['FRUIT'] to nlp.vocab.strings['EVENT'], strangely it works but apples would be assigned the entity label EVENT. Anyone else encountering this issue?
doc = nlp('Tom wants to eat some apples at the United Nations')
FRUIT = nlp.vocab.strings['FRUIT']
def add_ent(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
doc.ents += ((FRUIT, start, end),)
matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': 'apples'}]
matcher.add('AddApple', add_ent, pattern)
matches = matcher(doc)
for ent in doc.ents:
print(ent.text, ent.label_)
Oh okay, I think I found a solution. The label has to be added to nlp.vocab.strings if it is not there:
nlp.vocab.strings.add('FRUIT')
I have a bunch of unrelated paragraphs, and I need to traverse them to find similar occurrences such as that, given a search where I look for object falls, I find a boolean True for text containing:
Box fell from shelf
Bulb shattered on the ground
A piece of plaster fell from the ceiling
And False for:
The blame fell on Sarah
The temperature fell abruptly
I am able to use nltk to tokenise, tag and get Wordnet synsets, but I am finding it hard to figure out how to fit nltk's moving parts together to achieve the desired result. Should I chunk before looking for synsets? Should I write a context-free grammar? Is there a best practice when translating from treebank tags to Wordnet grammar tags? None of this is explained in the nltk book, and I couldn't find it on the nltk cookbook yet.
Bonus points for answers that include pandas in the answer.
[ EDIT ]:
Some code to get things started
In [1]:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
def tag(x):
return pos_tag(word_tokenize(x))
phrases = ['Box fell from shelf',
'Bulb shattered on the ground',
'A piece of plaster fell from the ceiling',
'The blame fell on Sarah',
'Berlin fell on May',
'The temperature fell abruptly']
ser = Series(phrases)
ser.map(tag)
Out[1]:
0 [(Box, NNP), (fell, VBD), (from, IN), (shelf, ...
1 [(Bulb, NNP), (shattered, VBD), (on, IN), (the...
2 [(A, DT), (piece, NN), (of, IN), (plaster, NN)...
3 [(The, DT), (blame, NN), (fell, VBD), (on, IN)...
4 [(Berlin, NNP), (fell, VBD), (on, IN), (May, N...
5 [(The, DT), (temperature, NN), (fell, VBD), (a...
dtype: object
The way I would do it is the following:
Use nltk to find nouns followed by one or two verbs. In order to match your exact specifications I would use Wordnet:
The only nouns (NN, NNP, PRP, NNS) that should be found are the ones that are in a semantic relation with "physical" or "material" and the only verbs (VB, VBZ, VBD, etc...) that should be found are the ones that are in a semantic relation with "fall".
I mentioned "one or two verbs" because a verb can be preceded by an auxiliary. What you could also do is create a dependency tree to spot subject-verb relations, but it does not seem to be necessary in this case.
You might also want to make sure you exclude location names and keep person names (Because you would accept "John has fallen" but not "Berlin has fallen"). This can also be done with Wordnet, locations have the tag 'noun.location'.
I am not sure in which context you would have to convert the tags so I cannot provide a proper answer to that, in seems to me that you might not need that in this case: You use the POS tags to identify nouns and verbs and then you check if each noun and verb belong to a synset.
Hope this helps.
Not perfect, but most of the work is there. Now on to hardcoding pronouns (such as 'it') and closed-class words and adding multiple targets to handle things like 'shattered'. Not a single-liner, but not an impossible task!
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series, DataFrame
import collections
from nltk import wordnet
wn = wordnet.wordnet
def tag(x):
return pos_tag(word_tokenize(x))
def flatten(l):
for el in l:
if isinstance(el, collections.Iterable) and not isinstance(el, basestring):
for sub in flatten(el):
yield sub
else:
yield el
def noun_verb_match(phrase, nouns, verbs):
res = []
for i in range(len(phrase) -1):
if (phrase[i][1] in nouns) &\
(phrase[i + 1][1] in verbs):
res.append((phrase[i], phrase[i + 1]))
return res
def hypernym_paths(word, pos):
res = [x.hypernym_paths() for x in wn.synsets(word, pos)]
return set(flatten(res))
def bool_syn(double, noun_syn, verb_syn):
"""
Returns boolean if noun/verb double contains the target Wordnet Synsets.
Arguments:
double: ((noun, tag), (verb, tag))
noun_syn, verb_syn: Wordnet Synset string (i.e., 'travel.v.01')
"""
noun = double[0][0]
verb = double[1][0]
noun_bool = wn.synset(noun_syn) in hypernym_paths(noun, 'n')
verb_bool = wn.synset(verb_syn) in hypernym_paths(verb, 'v')
return noun_bool & verb_bool
def bool_loop(l, f):
"""
Tests all list elements for truthiness and
returns True if any is True.
Arguments:
l: List.
e: List element.
f: Function returning boolean.
"""
if len(l) == 0:
return False
else:
return f(l[0]) | bool_loop(l[1:], f)
def bool_noun_verb(series, nouns, verbs, noun_synset_target, verb_synset_target):
tagged = series.map(tag)
nvm = lambda x: noun_verb_match(x, nouns, verbs)
matches = tagged.apply(nvm)
bs = lambda x: bool_syn(x, noun_synset_target, verb_synset_target)
return matches.apply(lambda x: bool_loop(x, bs))
phrases = ['Box fell from shelf',
'Bulb shattered on the ground',
'A piece of plaster fell from the ceiling',
'The blame fell on Sarah',
'Berlin fell on May',
'The temperature fell abruptly',
'It fell on the floor']
nouns = "NN NNP PRP NNS".split()
verbs = "VB VBD VBZ".split()
noun_synset_target = 'artifact.n.01'
verb_synset_target = 'travel.v.01'
df = DataFrame()
df['text'] = Series(phrases)
df['fall'] = bool_noun_verb(df.text, nouns, verbs, noun_synset_target, verb_synset_target)
df