Following several other posts, [e.g. Detect English verb tenses using NLTK , Identifying verb tenses in python, Python NLTK figure out tense ] I wrote the following code to determine tense of a sentence in Python using POS tagging:
from nltk import word_tokenize, pos_tag
def determine_tense_input(sentence):
text = word_tokenize(sentence)
tagged = pos_tag(text)
tense = {}
tense["future"] = len([word for word in tagged if word[1] == "MD"])
tense["present"] = len([word for word in tagged if word[1] in ["VBP", "VBZ","VBG"]])
tense["past"] = len([word for word in tagged if word[1] in ["VBD", "VBN"]])
return(tense)
This returns a value for the usage of past/present/future verbs, which I typically then take the max value of as the tense of the sentence. The accuracy is moderately decent, but I am wondering if there is a better way of doing this.
For example, is there now by-chance a package written which is more dedicated to extracting the tense of a sentence? [note - 2 of the 3 stack-overflow posts are 4-years old, so things may have now changed]. Or alternatively, should I be using a different parser from within nltk to increase accuracy? If not, hope the above code may help someone else!
You can strengthen your approach in various ways. You could think more about the grammar of English and add some more rules based on whatever you observe; or you could push the statistical approach, extract some more (relevant) features and throw the whole lot at a classifier. The NLTK gives you plenty of classifiers to play with, and they're well documented in the NLTK book.
You can have the best of both worlds: Hand-written rules can be in the form of features that are fed to the classifier, which will decide when it can rely on them.
As of http://dev.lexalytics.com/wiki/pmwiki.php?n=Main.POSTags, the tags mean
MD Modal verb (can, could, may, must)
VB Base verb (take)
VBC Future tense, conditional
VBD Past tense (took)
VBF Future tense
VBG Gerund, present participle (taking)
VBN Past participle (taken)
VBP Present tense (take)
VBZ Present 3rd person singular (takes)
so that your code would be
tense["future"] = len([word for word in tagged if word[1] in ["VBC", "VBF"])
You could use the Stanford Parser to get a dependency parse of the sentence. The root of the dependency parse will be the 'primary' verb that defines the sentence (I'm not too sure what the specific linguistic term is). You can then use the POS tag on this verb to find its tense, and use that.
This worked for me:
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
`grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}
"""`
The only thing is that you gotta deal with modal verbs, cause "could" or "may", for example, are treated as "will" in this case and give you the future group.
No, of course not. This is what I got so far (you might want to read nltk book grammar parsing section, too):
I left only verb tags to simplify the task a little bit, then used nltk's RegexpParser.
def tense_detect(tagged_sentence):
verb_tags = ['MD','MDF',
'BE','BEG','BEN','BED','BEDZ','BEZ','BEM','BER',
'DO','DOD','DOZ',
'HV','HVG','HVN','HVD','HVZ',
'VB','VBG','VBN','VBD','VBZ',
'SH',
'TO',
'JJ' # maybe?
]
verb_phrase = []
for item in tagged_sentence:
if item[1] in verb_tags:
verb_phrase.append(item)
grammar = r'''
future perfect continuous passive: {<MDF><HV><BEN><BEG><VBN|VBD>+}
conditional perfect continuous passive:{<MD><HV><BEN><BEG><VBN|VBD>+}
future continuous passive: {<MDF><BE><BEG><VBN|VBD>+}
conditional continuous passive: {<MD><BE><BEG><VBN|VBD>+}
future perfect continuous: {<MDF><HV><BEN><VBG|HVG|BEG>+}
conditional perfect continuous: {<MD><HV><BEN><VBG|HVG|BEG>+}
past perfect continuous passive: {<HVD><BEN><BEG><VBN|VBD>+}
present perfect continuous passive: {<HV|HVZ><BEN><BEG><VBN|VBD>+}
future perfect passive: {<MDF><HV><BEN><VBN|VBD>+}
conditional perfect passive: {<MD><HV><BEN><VBN|VBD>+}
future continuous: {<MDF><BE><VBG|HVG|BEG>+ }
conditional continuous: {<MD><BE><VBG|HVG|BEG>+ }
future indefinite passive: {<MDF><BE><VBN|VBD>+ }
conditional indefinite passive: {<MD><BE><VBN|VBD>+ }
future perfect: {<MDF><HV><HVN|BEN|VBN|VBD>+ }
conditional perfect: {<MD><HV><HVN|BEN|VBN|VBD>+ }
past continuous passive: {<BED|BEDZ><BEG><VBN|VBD>+}
past perfect continuous: {<HVD><BEN><HVG|BEG|VBG>+}
past perfect passive: {<HVD><BEN><VBN|VBD>+}
present continuous passive: {<BEM|BER|BEZ><BEG><VBN|VBD>+}
present perfect continuous: {<HV|HVZ><BEN><VBG|BEG|HVG>+}
present perfect passive: {<HV|HVZ><BEN><VBN|VBD>+}
future indefinite: {<MDF><BE|DO|VB|HV>+ }
conditional indefinite: {<MD><BE|DO|VB|HV>+ }
past continuous: {<BED|BEDZ><VBG|HVG|BEG>+}
past perfect: {<HVD><BEN|VBN|HVD|HVN>+}
past indefinite passive: {<BED|BEDZ><VBN|VBD>+}
present indefinite passive: {<BEM|BER|BEZ><VBN|VBD>+}
present continuous: {<BEM|BER|BEZ><BEG|VBG|HVG>+}
present perfect: {<HV|HVZ><BEN|HVD|VBN|VBD>+ }
past indefinite: {<DOD><VB|HV|DO>|<BEDZ|BED|HVD|VBN|VBD>+}
infinitive: {<TO><BE|HV|VB>+}
present indefinite: {<DO|DOZ><DO|HV|VB>+|<DO|HV|VB|BEZ|DOZ|BER|HVZ|BEM|VBZ>+}
'''
cp = nltk.RegexpParser(grammar)
result = cp.parse(verb_phrase)
display(result)
tenses_set = set()
for node in result:
if type(node) is nltk.tree.Tree:
tenses_set.add(node.label())
return result, tenses_set
This works just OK. Even with odd complex sentences. The big problem are the causatives, like "I have my car washed every day". Removing everything but the verbs results in " have washed", which gives Present Perfect.
You gotta tweak it anyway.
I've just fixed the computer and don't have nltk installed yet to show the outcome. Will try to do it tomorrow.
Related
..Hello Everybody! I'm working on an NLP project, and I want to detect if there is a negation with a given verb in a sentence
For example : the function "Is_there_negation" should return "True" with the following parameters :
text:"I don't want to eat right now"
verb:"eat"
How can I complete this function(I'm really beginner in NLP)
import spacy
nlp = spacy.load("en_core_web_sm")
def Is_there_negation(doc,verb):
for token in doc :
if(token.dep_=="neg" and token.head.text==verb):
return True
elif(token.text==verb):
for tk in token.subtree:
.
.
.
return False
Thanks in advance
This is quite tricky. From the syntactic point of view, eat is not negated in the sentence, it is the auxiliary word want that is negated as you can see in a dependency parse of the sentence.
What might do the job is:
Find a negative particle (with dependency label neg)
Look what it is its dependency parent. If it is the verb you are looking for, return true.
If not, list descendants of the verb and have a look if the verb you are looking for is in the subtree (i.e, has the correct lemma and is tagged as a VERB). You might want however to limit the search to some subtree depth to ignore weird cases with too complicated phrases.
I just started developing a very simple program that gets a txt file and tells you the misspelled words according to it. I looked up what would be the best program to use and I read that NLTK and use 'Words'. I did it and noticed that it is not doing its job correctly or maybe I'm not doing something correctly and it's actually my fault but can someone please check it out.
from nltk.corpus import words
setwords = set(words.words())
def prompt():
userinput = input("File to Evaluate: ").strip()
with open(userinput, 'r') as file:
words = file.read()
return words
def main():
error_list = []
words = prompt()
words_splitted = words.split()
for i in words_splitted:
if i in setwords:
pass
elif i not in setwords:
error_list.append(i)
print(f"We're not sure these words exist: {error_list}")
if __name__ == '__main__':
main()
The program runs fine but please give me some help in figuring out if NLTK is actually bad at detecting words or it's failing in my program. I'm using this program with testing.txt which is a file with the famous John Quincy Adams letter from his mother.
The output on the terminal is this: Screenshot Output
As you can see in the picture, it just prints out a lot of words that shouldn't even be confused such as 'ages', 'heaven' and 'contest
NLTK is designed to help do natural language analysis. I'm really not sure that it's the optimal tool for trying to do spell correction. For one thing, the word list which you are using does not attempt to contain every possible correctly-spelled word, because it assumes that you will use one of the "stemmers" built into NLTK; a stemmer attempts to figure out what each word's "stem" (or base) is. Stemming lets the package analyse "ages" as the plural of "age" and the fact that that will work means that it is unnecessary to include "ages" in the word list.
It is worth noting that NLTK includes utilities which do a much better job of splitting input into words than just calling string.split(), which doesn't know anything about punctuation. If you're going to use NLTK, you'd be well-advised to let it do this work for you, for example with the nltk.word_tokenize function.
Moreover, NLTK generally tries to guess what a word is if it doesn't know it, which means that it will often be able to identify the part of speech of a misspelled or even invented word.
As an example, I ran its default part of speech tagger on Lewis Carroll's famous Jabberwocky, to produce the following output. (I added the definition of each part of speech tag to make it easier to read.)
>>> poem = """'Twas brillig, and the slithy toves
... did gyre and gimble in the wabe:
... All mimsy were the borogoves,
... and the mome raths outgrabe.
... """
>>> print('\n'.join(f"{word+' :':<12}({tag}) {tagdict[tag][0]}"
... for word, tag in nltk.pos_tag(nltk.word_tokenize(poem))))
'T : (NN) noun, common, singular or mass
was : (VBD) verb, past tense
brillig : (NN) noun, common, singular or mass
, : (,) comma
and : (CC) conjunction, coordinating
the : (DT) determiner
slithy : (JJ) adjective or numeral, ordinal
toves : (NNS) noun, common, plural
did : (VBD) verb, past tense
gyre : (NN) noun, common, singular or mass
and : (CC) conjunction, coordinating
gimble : (JJ) adjective or numeral, ordinal
in : (IN) preposition or conjunction, subordinating
the : (DT) determiner
wabe : (NN) noun, common, singular or mass
: : (:) colon or ellipsis
All : (DT) determiner
mimsy : (NNS) noun, common, plural
were : (VBD) verb, past tense
the : (DT) determiner
borogoves : (NNS) noun, common, plural
, : (,) comma
and : (CC) conjunction, coordinating
the : (DT) determiner
mome : (JJ) adjective or numeral, ordinal
raths : (NNS) noun, common, plural
outgrabe : (RB) adverb
. : (.) sentence terminator
NLTK is an extraordinary body of work with a lot of practical applications. I'm just not sure that yours is one of them. But if you haven't already done so, do take a look at the Creative Commons licensed book which describes NLTK, Natural Language Processing with Python. Not only is the book a good guide to the NLTK library, it is also a gentle introduction to text processing in Python 3.
I'm trying to build a POS Tagger for a Voise Assistant. However, the nltk's pos tagger nltk.pos_tag doesn't work well for me. For example:
sent = 'open Youtube'
tokens = nltk.word_tokenize(sent)
nltk.pos_tag(tokens, tagset='universal')
>>[('open', 'ADJ'), ('Youtube', 'NOUN')]
In the above case I'd want the word open to be a verb and not an adjective. Similarly, it tags the word 'close' as an adverb and not a verb.
I have also tried using an n-gram tagger
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
default_tagger = nltk.DefaultTagger('NOUN')
unigram_tagger = nltk.UnigramTagger(train_sents, backoff = default_tagger)
bigram_tagger = nltk.BigramTagger(train_sents, backoff = unigram_tagger)
trigram_tagger = nltk.TrigramTagger(train_sents, backoff = bigram_tagger)
I have used the brown corpus from nltk. But it still gives the same result.
So I'd like to know:
Is there a better tagged corpus to train a tagger for making a voice/virtual assistant?
Is there a higher n-gram than trigram i.e. that looks at 4 words or more together like trigram and bigram look at 3 and 2 words together respectively. Will it improve the performance?
How can I fix this?
Concerning question #3
I think this is not a general solution, but it works at least for the context you mention "Do this/that". So, if you put a "to" at the beginning the tagger will tend to "understand" a verb instead an adjetive, noun or adverb!
I took this screenshot using Freeling_demo just to compare interpretations
Specifically, if you want to use Freeling there are java/python API avaliable or you can call it just using command line.
Respect question #2 I think include context work better for complete sentences or large texts, maybe is not the case for command a basic virtual assistant.
Good luck!
While "PM" can mean "pm(time)" it can also mean "Prime Minister".
I want to capture the latter. I want lemma of "PM" to return "Prime Minister". How can I do this using spacy?
Example returning unexpected lemma:
>>> import spacy
>>> #nlp = spacy.load('en')
>>> nlp = spacy.load('en_core_web_lg')
>>> doc = nlp(u'PM means prime minister')
>>> for word in doc:
... print(word.text, word.lemma_)
...
PM pm
means mean
prime prime
minister minister
As per doc https://spacy.io/api/annotation, spacy uses WordNet for lemmas;
A lemma is the uninflected form of a word. The English lemmatization data is taken from WordNet..
When I tried inputting "pm" in Wordnet, it shows "Prime Minister" as one of the lemmas.
What am I missing here?
I think it would help answer your question by clarifying some common NLP tasks.
Lemmatization is the process of finding the canonical word given different inflections of the word. For example, run, runs, ran and running are forms of the same lexeme: run. If you were to lemmatize run, runs, and ran the output would be run. In your example sentence, note how it lemmatizes means to mean.
Given that, it doesn't sound like the task you want to perform is lemmatization. It may help to solidify this idea with a silly counterexample: what are the different inflections of a hypothetical lemma "pm": pming, pmed, pms? None of those are actual words.
It sounds like your task may be closer to Named Entity Recognition (NER), which you could also do in spaCy. To iterate through the detected entities in a parsed document, you can use the .ents attribute, as follows:
>>> for ent in doc.ents:
... print(ent, ent.label_)
With the sentence you've given, spacy (v. 2.0.5) doesn't detect any entities. If you replace "PM" with "P.M." it will detect that as an entity, but as a GPE.
The best thing to do depends on your task, but if you want your desired classification of the "PM" entity, I'd look at setting entity annotations. If you want to pull out every mention of "PM" from a big corpus of documents, use the matcher in a pipeline.
When I run lemmas of prime minister on nltk.wordnet (which uses it as well) I get:
>>>[str(lemma.name()) for lemma in wn.synset('prime_minister.n.01').lemmas()] ['Prime_Minister', 'PM', 'premier']
It keeps acronyms the same so maybe you want to check word.lemma() which would give you a different ID according to the context?
"First thing we do, let's kill all the lawyers." - William Shakespeare
Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags:
[["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]]
The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the overall "meaning"* of a sentence.
*note the scare quotes. I acknowledge this is a very hard problem and there is most likely no perfect solution at this point in time. Nonetheless, I am interested to see attempts at solving the specific problem (extracting "kill" and "lawyers") and the general problem (summarising the overall meaning of a sentence in keywords/tags)
I don't think theres any perfect answer to this question because there aren't any gold-set of input/output mappings which everybody will agree upon. You think the most important words for that sentence are ('kill', 'lawyers'), someone else might argue the correct answer should be ('first', 'kill', 'lawyers'). If you are able to very precisely and completely unambiguously describe exactly what you want your system to do, your problem will be more than half solved.
Until then, I can suggest some additional heuristics to help you get what you want.
Build an idf dictionary using your data, i.e. build a mapping from every word to a number that correlates with how rare that word is. Bonus points for doing it for larger n-grams as well.
By combining the idf values of each word in your input sentence along with their POS tags, you answer questions of the form 'What is the rarest verb in this sentence?', 'What is the rarest noun in this sentence', etc. In any reasonable corpus, 'kill' should be rarer than 'do', and 'lawyers' rarer than 'thing', so maybe trying to find the rarest noun and rarest verb in a sentence and returning just those two will do the trick for most of your intended use cases. If not, you can always make your algorithm a little more complicated and see if that seems to do the job better.
Ways to expand this include trying to identify larger phrases using n-gram idf's, building a full parse-tree of the sentence (using maybe the stanford parser) and identifying some pattern within these trees to help you figure out which parts of the tree do important things tend to be based, etc.
One simple approach would be to keep stop word lists for NN, VB etc. These would be high frequency words that usually don't add much semantic content to a sentence.
The snippet below shows distinct lists for each type of word token, but you could just as well employ a single stop word list for both verbs and nouns (such as this one).
stop_words = dict(
NNP=['first', 'second'],
NN=['thing'],
VBP=['do','done'],
VB=[],
NNS=['lets', 'things'],
)
def filter_stop_words(pos_list):
return [[token, token_type]
for token, token_type in pos_list
if token.lower() not in stop_words[token_type]]
in your case, you can simply use Rake (thanks to Fabian) package for python to get what you need:
>>> path = #your path
>>> r = RAKE.Rake(path)
>>> r.run("First thing we do, let's kill all the lawyers")
[('lawyers', 1.0), ('kill', 1.0), ('thing', 1.0)]
the path can be for example this file.
but in general, you better to use NLTK package for the NLP usages