Tokenizing First Name and Last Name as one word - python

Input: Barack Obama is the President
(Desire) Output: Who is the President?
the problem is although SpaCy recognize Barack Obama as one person, but when tokenizing the text in the earlier stage, Barack Obama had been separated into two words, ie: "Barack" and "Obama"
attached is my sample code:
import spacy
from nltk import word_tokenize
nlp = spacy.load('en_core_web_sm')
text = 'Barack Obama is the President'
BreakText = word_tokenize(text)
document = nlp(text)
person = []
for ent in document.ents:
if ent.label_ == 'PERSON':
person.append(ent)
k = person[0]
j = BreakText.index(str(k))
BreakText[j] = 'Who'
Final = " ".join(BreakText)
print(Final + "?")
or is there another way around to get my desire output?
UPDATE: this works!
k = person[0]
o = text.replace(str(k), 'Who')
print(o + "?")

Spacy will give you the full text of the entity with ent.text.

You are describing named entity recognition (NER), rather than tokenization.
Chapter 7 in the nltk docs describes how NER takes a few steps forward from tokenization, to part of speech tagging, to entity recognition.
http://www.nltk.org/book/ch07.html
nltk.ne_chunk()
Is most likely the functionality you are interested in.

Related

Extracting start and end indices of a token using spacy

I am looking at lots of sentences and looking to extract the start and end indices of a word in a given sentence.
For example, the input is as follows:
"This is a sentence written in English by a native English speaker."
And What I want is the span of the word 'English' which in this case is : (30,37) and (50, 57).
Note: I was pointed to this answer (Get position of word in sentence with spacy)
But this answer doesn't solve my problem. It can help me in getting the start character of the token but not the end index.
All help appreciated
You can do this with re in pure python:
s="This is a sentence written in english by a native English speaker."
import re
[(i.start(), i.end()) for i in re.finditer('ENGLISH', s.upper())]
#output
[(30, 37), (50, 57)]
You can do in spacy as well:
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp("This is a sentence written in english by a native English speaker.")
for ent in doc.ents:
if ent.text.upper()=='ENGLISH':
print(ent.start_char,ent.end_char)
Using the idea from the answer you link you could do something like this
from spacy.lang.en import English
nlp = English()
s = nlp("This is a sentence written in english by a native English speaker")
boundaries = []
for idx, i in enumerate(s[:-1]):
if i.text.lower() == "english":
boundaries.append((i.idx, s[idx+1].idx-1))
You can simply do it like this using SpaCy, which do not need any check for the last token (unlike #giovanni's solution):
def get_char_span(input_txt):
doc = nlp(input_txt)
for i, token in enumerate(doc):
start_i = token.idx
end_i = start_i + len(token.text)
# token span and the token
print(i, token)
# character span
print((start_i, end_i))
# veryfying it in the original input_text
print(input_txt[start_i:end_i])
inp = "My name is X, what's your name?"
get_char_span(inp)

How to ask Spacy phraseMatch to match all the token in the list?

I have a following algorithm:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
phrase_matcher = PhraseMatcher(nlp.vocab)
CAT = [nlp.make_doc(text) for text in ['pension', 'underwriter', 'health', 'client']]
phrase_matcher.add("CATEGORY 1",None, *CAT)
text = 'The client works as a marine assistant underwriter. He has recently opted to stop paying into his pension. '
doc = nlp(text)
matches = phrase_matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'CategoryID'
span = doc[start : end] # get the matched slice of the doc
print(rule_id, span.text)
# Output
CATEGORY 1 client
CATEGORY 1 underwriter
CATEGORY 1 pension
Can I ask to return the result when all words can be found in the sentence. I expect not to see anything here as 'health' is not part of the sentence.
Can I do this type of matching with PhraseMatcher? or Do I need to change for another type of rule based match? Thank you

CV Parser name matching

I am using NLP with python to find the names from the string. I am able to find the if i have a full name (first name and last name) but in the string i have only first name means my code is not able to recognize as Person. Below is my code.
import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
string = """
Sriram is working as a python developer
"""
def ie_preprocess(document):
document = ' '.join([i for i in document.split() if i not in stop])
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
def extract_names(document):
names = []
sentences = ie_preprocess(document)
#print(sentences)
for tagged_sentence in sentences:
for chunk in nltk.ne_chunk(tagged_sentence):
#print("Out Side ",chunk)
if type(chunk) == nltk.tree.Tree:
if chunk.label() == 'PERSON':
print("In Side ",chunk)
names.append(' '.join([c[0] for c in chunk]))
return names
if __name__ == '__main__':
names = extract_names(string)
print(names)
My advice is to use the StanfordNLP/Spacy NER, using nltk ne chunks is a little janky. StanfordNLP is more commonly used by researchers, but Spacy is easier to work with. Here is an example using Spacy to print the name of each named entity and its type:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> text = 'Sriram is working as a python developer'
>>> doc = nlp(text)
>>> for ent in doc.ents:
print(ent.text,ent.label_)
Sriram ORG
>>>
Note that it classifies Sriram as an organization, which may be because it is not a common English name and Spacy is trained on English corpa. Good luck!

Unable to create a custom entity type/label using Matcher in Spacy 2

I am trying to create a custom entity label called FRUIT using the rule-based Matcher (i.e. adding on_match rules), following the spaCy guide. I'm using spaCy 2.0.11, so I believe the steps to do so have changed compared to spaCy 1.X
Example: doc = nlp('Tom wants to eat some apples at the United Nations')
Expected text and entity outputs:
Tom PERSON
apples FRUIT
the United Nations ORG
However, I seem to get the following error: [E084] Error assigning label ID 7429577500961755728 to span: not in StringStore. I have included my code below. When I change nlp.vocab.strings['FRUIT'] to nlp.vocab.strings['EVENT'], strangely it works but apples would be assigned the entity label EVENT. Anyone else encountering this issue?
doc = nlp('Tom wants to eat some apples at the United Nations')
FRUIT = nlp.vocab.strings['FRUIT']
def add_ent(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
doc.ents += ((FRUIT, start, end),)
matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': 'apples'}]
matcher.add('AddApple', add_ent, pattern)
matches = matcher(doc)
for ent in doc.ents:
print(ent.text, ent.label_)
Oh okay, I think I found a solution. The label has to be added to nlp.vocab.strings if it is not there:
nlp.vocab.strings.add('FRUIT')

How to extract noun adjective pairs from a sentence

I wish to extract noun-adjective pairs from this sentence. So, basically I want something like :
(Mark,sincere) (John,sincere).
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Mark and John are sincere employees at Google."
print ne_chunk(pos_tag(word_tokenize(sentence)))
Spacy's POS tagging would be a better than NLTK. It's faster and better. Here is an example of what you want to do
import spacy
nlp = spacy.load('en')
doc = nlp(u'Mark and John are sincere employees at Google.')
noun_adj_pairs = []
for i,token in enumerate(doc):
if token.pos_ not in ('NOUN','PROPN'):
continue
for j in range(i+1,len(doc)):
if doc[j].pos_ == 'ADJ':
noun_adj_pairs.append((token,doc[j]))
break
noun_adj_pairs
output
[(Mark, sincere), (John, sincere)]

Categories