I wish to extract noun-adjective pairs from this sentence. So, basically I want something like :
(Mark,sincere) (John,sincere).
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Mark and John are sincere employees at Google."
print ne_chunk(pos_tag(word_tokenize(sentence)))
Spacy's POS tagging would be a better than NLTK. It's faster and better. Here is an example of what you want to do
import spacy
nlp = spacy.load('en')
doc = nlp(u'Mark and John are sincere employees at Google.')
noun_adj_pairs = []
for i,token in enumerate(doc):
if token.pos_ not in ('NOUN','PROPN'):
continue
for j in range(i+1,len(doc)):
if doc[j].pos_ == 'ADJ':
noun_adj_pairs.append((token,doc[j]))
break
noun_adj_pairs
output
[(Mark, sincere), (John, sincere)]
Related
I am looking at lots of sentences and looking to extract the start and end indices of a word in a given sentence.
For example, the input is as follows:
"This is a sentence written in English by a native English speaker."
And What I want is the span of the word 'English' which in this case is : (30,37) and (50, 57).
Note: I was pointed to this answer (Get position of word in sentence with spacy)
But this answer doesn't solve my problem. It can help me in getting the start character of the token but not the end index.
All help appreciated
You can do this with re in pure python:
s="This is a sentence written in english by a native English speaker."
import re
[(i.start(), i.end()) for i in re.finditer('ENGLISH', s.upper())]
#output
[(30, 37), (50, 57)]
You can do in spacy as well:
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp("This is a sentence written in english by a native English speaker.")
for ent in doc.ents:
if ent.text.upper()=='ENGLISH':
print(ent.start_char,ent.end_char)
Using the idea from the answer you link you could do something like this
from spacy.lang.en import English
nlp = English()
s = nlp("This is a sentence written in english by a native English speaker")
boundaries = []
for idx, i in enumerate(s[:-1]):
if i.text.lower() == "english":
boundaries.append((i.idx, s[idx+1].idx-1))
You can simply do it like this using SpaCy, which do not need any check for the last token (unlike #giovanni's solution):
def get_char_span(input_txt):
doc = nlp(input_txt)
for i, token in enumerate(doc):
start_i = token.idx
end_i = start_i + len(token.text)
# token span and the token
print(i, token)
# character span
print((start_i, end_i))
# veryfying it in the original input_text
print(input_txt[start_i:end_i])
inp = "My name is X, what's your name?"
get_char_span(inp)
I am using NLP with python to find the names from the string. I am able to find the if i have a full name (first name and last name) but in the string i have only first name means my code is not able to recognize as Person. Below is my code.
import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
string = """
Sriram is working as a python developer
"""
def ie_preprocess(document):
document = ' '.join([i for i in document.split() if i not in stop])
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
def extract_names(document):
names = []
sentences = ie_preprocess(document)
#print(sentences)
for tagged_sentence in sentences:
for chunk in nltk.ne_chunk(tagged_sentence):
#print("Out Side ",chunk)
if type(chunk) == nltk.tree.Tree:
if chunk.label() == 'PERSON':
print("In Side ",chunk)
names.append(' '.join([c[0] for c in chunk]))
return names
if __name__ == '__main__':
names = extract_names(string)
print(names)
My advice is to use the StanfordNLP/Spacy NER, using nltk ne chunks is a little janky. StanfordNLP is more commonly used by researchers, but Spacy is easier to work with. Here is an example using Spacy to print the name of each named entity and its type:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> text = 'Sriram is working as a python developer'
>>> doc = nlp(text)
>>> for ent in doc.ents:
print(ent.text,ent.label_)
Sriram ORG
>>>
Note that it classifies Sriram as an organization, which may be because it is not a common English name and Spacy is trained on English corpa. Good luck!
I am trying to get all the words in Wordnet dictionary that are of type noun and category food.
I have found a way to check if a word is noun.food but I need the reverse method:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def if_food(word):
syns = wn.synsets(word, pos = wn.NOUN)
for syn in syns:
print(syn.lexname())
if 'food' in syn.lexname():
return 1
return 0
So I think I have found a solution:
# Using the NLTK WordNet dictionary check if the word is noun and a food.
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def if_food(word):
syns = wn.synsets(str(word), pos = wn.NOUN)
for syn in syns:
if 'food' in syn.lexname():
return 1
return 0
Then using the qdapDictionaries::GradyAugmented R English words dictionary I have checked each word if it's a noun.food:
en_dict = pd.read_csv("GradyAugmentedENDict.csv")
en_dict['is_food'] = en_dict.word.apply(if_food)
en_dict[en_dict.is_food == 1].to_csv("en_dict_is_food.csv")
It it actually did the job.
Hope it will help others.
Input: Barack Obama is the President
(Desire) Output: Who is the President?
the problem is although SpaCy recognize Barack Obama as one person, but when tokenizing the text in the earlier stage, Barack Obama had been separated into two words, ie: "Barack" and "Obama"
attached is my sample code:
import spacy
from nltk import word_tokenize
nlp = spacy.load('en_core_web_sm')
text = 'Barack Obama is the President'
BreakText = word_tokenize(text)
document = nlp(text)
person = []
for ent in document.ents:
if ent.label_ == 'PERSON':
person.append(ent)
k = person[0]
j = BreakText.index(str(k))
BreakText[j] = 'Who'
Final = " ".join(BreakText)
print(Final + "?")
or is there another way around to get my desire output?
UPDATE: this works!
k = person[0]
o = text.replace(str(k), 'Who')
print(o + "?")
Spacy will give you the full text of the entity with ent.text.
You are describing named entity recognition (NER), rather than tokenization.
Chapter 7 in the nltk docs describes how NER takes a few steps forward from tokenization, to part of speech tagging, to entity recognition.
http://www.nltk.org/book/ch07.html
nltk.ne_chunk()
Is most likely the functionality you are interested in.
Is there a way to get an adjective corresponding to a given adverb in NLTK or other python library.
For example, for the adverb "terribly", I need to get "terrible".
Thanks.
There is a relation in wordnet that connects the adjectives to adverbs and vice versa.
>>> from itertools import chain
>>> from nltk.corpus import wordnet as wn
>>> from difflib import get_close_matches as gcm
>>> possible_adjectives = [k.name for k in chain(*[j.pertainyms() for j in chain(*[i.lemmas for i in wn.synsets('terribly')])])]
['terrible', 'atrocious', 'awful', 'rotten']
>>> gcm('terribly',possible_adjectives)
['terrible']
A more human readable way to computepossible_adjective is as followed:
possible_adj = []
for ss in wn.synsets('terribly'):
for lemmas in ss.lemmas: # all possible lemmas.
for lemma in lemmas:
for ps in lemma.pertainyms(): # all possible pertainyms.
for p in ps:
for ln in p.name: # all possible lemma names.
possible_adj.append(ln)
EDIT: In the newer version of NLTK:
possible_adj = []
for ss in wn.synsets('terribly'):
for lemmas in ss.lemmas(): # all possible lemmas
for ps in lemmas.pertainyms(): # all possible pertainyms
possible_adj.append(ps.name())
As MKoosej mentioned, nltk's lemmas is no longer an attribute but a method. I also made a little simplification to get the most possible word. Hope someone else can use it also:
wordtoinv = 'unduly'
s = []
winner = ""
for ss in wn.synsets(wordtoinv):
for lemmas in ss.lemmas(): # all possible lemmas.
s.append(lemmas)
for pers in s:
posword = pers.pertainyms()[0].name()
if posword[0:3] == wordtoinv[0:3]:
winner = posword
break
print winner # undue