I want output as ["good customer service","great ambience"] but I am getting ["good customer","good customer service","great ambience"] because pattern is matching with good customer also but this phrase doesn't make any sense. How can I remove these kind of duplicates
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("good customer service and great ambience")
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: adjective followed by one or more noun
pattern = [{"POS": 'ADJ'},{"POS": 'NOUN', "OP": '+'}]
matcher.add("ADJ_NOUN_PATTERN", None,pattern)
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
You may post-process the matches by grouping the tuples against the start index and only keeping the one with the largest end index:
from itertools import *
#...
matches = matcher(doc)
results = [max(list(group),key=lambda x: x[2]) for key, group in groupby(matches, lambda prop: prop[1])]
print("Matches:", [doc[start:end].text for match_id, start, end in results])
# => Matches: ['good customer service', 'great ambience']
The groupby(matches, lambda prop: prop[1]) will group the matches by the start index, here, resulting in [(5488211386492616699, 0, 2), (5488211386492616699, 0, 3)] and (5488211386492616699, 4, 6). max(list(group),key=lambda x: x[2]) will grab the item where end index (Value #3) is the biggest.
Spacy has a built-in function to do just that. Check filter_spans:
The documentation says:
When spans overlap, the (first) longest span is preferred over shorter spans.
Example:
doc = nlp("This is a sentence.")
spans = [doc[0:2], doc[0:2], doc[0:4]]
filtered = filter_spans(spans)
Related
I have a bunch of regex in this way:
(for simplicity the regex patters are very easy, the real case the regex are very long and barely incomprehensible since they are created automatically from other tool)
I want to create spans in a doc based on those regex.
This is the code:
import spacy
from spacy.tokens import Doc, Span, Token
import re
rx1 = ["blue","blue print"]
text = " this is blue but there is a blue print. The light is red and the heat is in the infra red."
my_regexes = {'blue':["blue","blue print"],
'red': ["red", "infra red"] }
nlp = spacy.blank("en")
doc = nlp(text)
print(doc.text)
for name, rxs in my_regexes.items():
doc.spans[name] = []
for rx in rxs:
for i, match in enumerate(re.finditer(rx, doc.text)):
start, end = match.span()
span = doc.char_span(start, end, alignment_mode="expand")
# This is a Span object or None if match doesn't map to valid token sequence
span_to_add = Span(doc, span.start, span.end, label=name +str(i))
doc.spans[name].append(span_to_add)
if span is not None:
print("Found match:", name, start, end, span.text )
It works.
Now I want to filter the spans in a way that when a series of tokens (for instance "infra red") contain another span ("red") only the longest one is kept.
I saw this:
How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?
but that looks to be for a matcher, and I can not make it work in my case. Since I would like to eliminate the token Span out of the document.
Any idea?
spacy.util.filter_spans will do this. The answer is the same as the linked question, where matcher results are converted to spans in order to filter them with this function.
docs.spans[name] = spacy.util.filter_spans(doc.spans[name])
Executing the below to extract the list of names from the text1. The text1 variable is the merge of the pdf's.
But executing the below code gives just one name out of complete input.
Tried to change patterns but didn't work.
Code:
import spacy
from spacy.matcher import Matcher
# load pre-trained model
nlp = spacy.load('en_core_web_sm')
# initialize matcher with a vocab
matcher = Matcher(nlp.vocab)
def extract_name(resume_text):
nlp_text = nlp(resume_text)
#print(nlp_text)
# First name and Last name are always Proper Nouns
pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
#matcher.add('NAME', None, [pattern])
matcher.add('NAME', [pattern], on_match=None)
matches = matcher(nlp_text)
for match_id, start, end in matches:
span = nlp_text[start:end]
#print(span)
return span.text
Execution: extract_name(text1)
O/P: 'VIKRAM RATHOD'
Expected O/P: List of all names in the text1
For your questions :
Adding the matcher declaration :
self._nlp = spacy.load("en_core_web_lg")
self._matcher = Matcher(self._nlp.vocab)
As general best practice remove all punctuation:
table = str.maketrans(string.punctuation,' '*32) ##Remove punctuation
sentence = sentence .translate(table).strip()
To catch middle name add:
pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN',"OP": "*"},{'POS': 'PROPN'}]
Now loop over all the matches and add them to a dict
New_list_of_matches={}
for match_id, start, end in matches:
string_id = ((self.NlpObj)._nlp.vocab).strings[match_id] # Get string representation
span=str((self.NlpObj)._doc[start:end]).split()
if string_id in New_list_of_matches:
if len(span)>New_list_of_matches[string_id]['lenofSpan']:
New_list_of_matches[string_id]={'span':span,'lenofSpan':len(span)}
else:
New_list_of_matches[string_id]={'span':span,'lenofSpan':len(span)}
It is important to keep the length of the span that way you can differ between cases when you find names with 2 words with ones with 3 words(middle name)
Now :
for keys,items in New_list_of_matches.items():
if keys=='NAME':
if len(items['span'])==2:
Name=items['span'][items['lenofSpan']-2]+' '+items['span'][items['lenofSpan']-1]
elif len(items['span'])==3:
Name=items['span'][items['lenofSpan']-3]+items['span'][items['lenofSpan']-2]+' '+items['span'][items['lenofSpan']-1]
I'm trying to set a matcher finding word 'iPhone X'.
The sample code says I should follow below.
import spacy
# Import the Matcher
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")
# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
I tried another approach by putting like below.
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone X"}]
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
Why is the second approach not working? I assumed if I put the two word 'iPhone' and 'X' together, it might work as the same way cause it regard the word with space in the middle as a long unique word. But it didn't.
The possible reason I could think of is,
matcher condition should be a single word without empty space.
Am I right? or is there another reason the second approach not working?
Thank you.
The answer is in how Spacy tokenizes the string:
>>> print([t.text for t in doc])
['Upcoming', 'iPhone', 'X', 'release', 'date', 'leaked', 'as', 'Apple', 'reveals', 'pre', '-', 'orders']
As you see, the iPhone and X are separate tokens. See the Matcher reference:
A pattern added to the Matcher consists of a list of dictionaries. Each dictionary describes one token and its attributes.
Thus, you cannot use them both in one token definition.
I have a following algorithm:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
phrase_matcher = PhraseMatcher(nlp.vocab)
CAT = [nlp.make_doc(text) for text in ['pension', 'underwriter', 'health', 'client']]
phrase_matcher.add("CATEGORY 1",None, *CAT)
text = 'The client works as a marine assistant underwriter. He has recently opted to stop paying into his pension. '
doc = nlp(text)
matches = phrase_matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'CategoryID'
span = doc[start : end] # get the matched slice of the doc
print(rule_id, span.text)
# Output
CATEGORY 1 client
CATEGORY 1 underwriter
CATEGORY 1 pension
Can I ask to return the result when all words can be found in the sentence. I expect not to see anything here as 'health' is not part of the sentence.
Can I do this type of matching with PhraseMatcher? or Do I need to change for another type of rule based match? Thank you
I am looking for a tokenizer that is expanding contractions.
Using nltk to split a phrase into tokens, the contraction is not expanded.
nltk.word_tokenize("she's")
-> ['she', "'s"]
However, when using a dictionary with contraction mappings only, and therefore not taking any information provided by surrounding words into account, it's not possible to decide whether "she's" should be mapped to "she is" or to "she has".
Is there a tokenizer that provides contraction expansion?
You can do rule based matching with Spacy to take information provided by surrounding words into account.
I wrote some demo code below which you can extend to cover more cases:
import spacy
from spacy.pipeline import EntityRuler
from spacy import displacy
from spacy.matcher import Matcher
sentences = ["now she's a software engineer" , "she's got a cat", "he's a tennis player", "He thinks that she's 30 years old"]
nlp = spacy.load('en_core_web_sm')
def normalize(sentence):
ans = []
doc = nlp(sentence)
#print([(t.text, t.pos_ , t.dep_) for t in doc])
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "got"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "been"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"POS": "DET"}]
matcher.add("case_is", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"IS_DIGIT": True}]
matcher.add("case_is", None, pattern)
# .. add more cases
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
for idx, t in enumerate(doc):
if string_id == 'case_has' and t.text == "'s" and idx >= start and idx < end:
ans.append("has")
continue
if string_id == 'case_is' and t.text == "'s" and idx >= start and idx < end:
ans.append("is")
continue
else:
ans.append(t.text)
return(' '.join(ans))
for s in sentences:
print(s)
print(normalize(s))
print()
output:
now she's a software engineer
now she is a software engineer
she's got a cat
she has got a cat
he's a tennis player
he is a tennis player
He thinks that she's 30 years old
He thinks that she is 30 years is old