How to get better lemmas from Spacy - python

While "PM" can mean "pm(time)" it can also mean "Prime Minister".
I want to capture the latter. I want lemma of "PM" to return "Prime Minister". How can I do this using spacy?
Example returning unexpected lemma:
>>> import spacy
>>> #nlp = spacy.load('en')
>>> nlp = spacy.load('en_core_web_lg')
>>> doc = nlp(u'PM means prime minister')
>>> for word in doc:
... print(word.text, word.lemma_)
...
PM pm
means mean
prime prime
minister minister
As per doc https://spacy.io/api/annotation, spacy uses WordNet for lemmas;
A lemma is the uninflected form of a word. The English lemmatization data is taken from WordNet..
When I tried inputting "pm" in Wordnet, it shows "Prime Minister" as one of the lemmas.
What am I missing here?

I think it would help answer your question by clarifying some common NLP tasks.
Lemmatization is the process of finding the canonical word given different inflections of the word. For example, run, runs, ran and running are forms of the same lexeme: run. If you were to lemmatize run, runs, and ran the output would be run. In your example sentence, note how it lemmatizes means to mean.
Given that, it doesn't sound like the task you want to perform is lemmatization. It may help to solidify this idea with a silly counterexample: what are the different inflections of a hypothetical lemma "pm": pming, pmed, pms? None of those are actual words.
It sounds like your task may be closer to Named Entity Recognition (NER), which you could also do in spaCy. To iterate through the detected entities in a parsed document, you can use the .ents attribute, as follows:
>>> for ent in doc.ents:
... print(ent, ent.label_)
With the sentence you've given, spacy (v. 2.0.5) doesn't detect any entities. If you replace "PM" with "P.M." it will detect that as an entity, but as a GPE.
The best thing to do depends on your task, but if you want your desired classification of the "PM" entity, I'd look at setting entity annotations. If you want to pull out every mention of "PM" from a big corpus of documents, use the matcher in a pipeline.

When I run lemmas of prime minister on nltk.wordnet (which uses it as well) I get:
>>>[str(lemma.name()) for lemma in wn.synset('prime_minister.n.01').lemmas()] ['Prime_Minister', 'PM', 'premier']
It keeps acronyms the same so maybe you want to check word.lemma() which would give you a different ID according to the context?

Related

Need help in constructing pattern for OR and AND using spacy python

Let's say suppose I have a text saying
input sentence : Computer programming is the process of writing instructions that get executed by computers. The instructions, also known as code, are written in a programming language which the computer can understand and use to perform a task or solve a problem. Basic computer programming involves the analysis of a problem and development of a logical sequence of instructions to solve it.
and I have to find if any of the phrases matches the given text sentences.
or_phrases = [efficient, design]
expected output - yes, because the above input sentence has a word "efficient"
and_phrase = [love, live]
expected output : None. Because the above input sentence doesn't have love or live anywhere in the entire sentence. Order of the words doesn't matter. To convert this into a reg expr:
re.match('(?=.*love)|(?=.*live)'
Looking to put this rule into spacy's phrase or token matcher
Is there a way to put this grammar pattern into a spacy pattern matcher?
or_phrases should give me sentences having either of the words.
and_phrases should give me sentences having both the words.
With spacy you can easily make an OR phrase by using 'TEXT': {'IN':["word1", "word2"]}. In your example this will look like this:
text = """Computer programming is the process of writing instructions that get executed by computers. The instructions, also known as code, are written in a programming language which the computer can understand and use to perform a task or solve a problem. Basic computer programming involves the analysis of a problem and development of a logical sequence of instructions to solve it. """
doc = nlp(text)
matcher = Matcher(nlp.vocab)
or_pattern = [[
{"TEXT": {"IN": ["process", "random"]}} # the list of "OR" words
]]
matcher.add("or_phrases", or_pattern)
for sent in doc.sents:
matches = matcher(sent)
if len(matches) > 0:
print(sent)
I still can't come up with an easy solution for AND but I'll try

How to exclude certain names and terms from stemming (Python NLTK SnowballStemmer (Porter2))

I am newly getting into NLP, Python, and posting on Stackoverflow at the same time, so please be patient with me if I might seem ignorant :).
I am using SnowballStemmer in Python's NLTK in order to stem words for textual analysis. While lemmatization seems to understem my tokens, the snowball porter2 stemmer, which I read is mostly preferred to the basic porter stemmer, overstems my tokens. I am analyzing tweets including many names and probably also places and other words which should not be stemmed, like: hillary, hannity, president, which are now reduced to hillari, hanniti, and presid (you probably guessed already whose tweets I am analyzing).
Is there an easy way to exclude certain terms from stemming? Conversely, I could also merely lemmatize tokens and include a rule for common suffixes like -ed, -s, …. Another idea might be to merely stem verbs and adjectives as well as nouns ending in s. That might also be close enough…
I am using below code as of now:
# LEMMATIZE AND STEM WORDS
from nltk.stem.snowball import EnglishStemmer
lemmatizer = nltk.stem.WordNetLemmatizer()
snowball = EnglishStemmer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in text]
def snowball_stemmer(text):
return [snowball.stem(w) for w in text]
# APPLY FUNCTIONS
tweets['text_snowball'] = tweets.text_processed.apply(snowball_stemmer)
tweets['text_lemma'] = tweets.text_processed.apply(lemmatize_text)
I hope someone can help… Contrary to my past experience with all kinds of issues, I have not been able to find adequate help for my issue online so far.
Thanks!
Do you know NER? It means named entity recognition. You can preprocess your text and locate all named entities, which you then exclude from stemming. After stemming, you can merge the data again.

How to get all noun phrases in Spacy(Python)

I would like to extract "all" the noun phrases from a sentence. I'm wondering how I can do it. I have the following code:
doc2 = nlp("what is the capital of Bangladesh?")
for chunk in doc2.noun_chunks:
print(chunk)
Output:
1. what
2. the capital
3. bangladesh
Expected:
the capital of Bangladesh
I have tried answers from spacy doc and StackOverflow. Nothing worked. It seems only cTakes and Stanford core NLP can give such complex NP.
Any help is appreciated.
Spacy clearly defines a noun chunk as:
A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses." (https://spacy.io/api/doc#noun_chunks)
If you process the dependency parse differently, allowing prepositional modifiers and nested phrases/chunks, then you can end up with what you're looking for.
I bet you could modify the existing spacy code fairly easily to do what you want:
https://github.com/explosion/spaCy/blob/06c6dc6fbcb8fbb78a61a2e42c1b782974bd43bd/spacy/lang/en/syntax_iterators.py
For those who are still looking for this answer
noun_pharses=set()
for nc in doc.noun_chunks:
for np in [nc, doc[nc.root.left_edge.i:nc.root.right_edge.i+1]]:
noun_pharses.add(np)
This is how I get all the complex noun phrase

Words.word() from nltk corpus seemingly contains strange non-valid words

This code loops through every word in word.words() from the nltk library, then pushes the word into an array. Then it checks every word in the array to see if it is an actual word by using the same library and somehow many words are strange words that aren't real at all, like "adighe". What's going on here?
import nltk
from nltk.corpus import words
test_array = []
for i in words.words():
i = i.lower()
test_array.append(i)
for i in test_array:
if i not in words.words():
print(i)
I don't think there's anything mysterious going on here. The first such example I found is "Aani", "the dog-headed ape sacred to the Egyptian god Thoth". Since it's a proper noun, "Aani" is in the word list and "aani" isn't.
According to dictionary.com, "Adighe" is an alternative spelling of "Adygei", which is another proper noun meaning a region of Russia. Since it's also a language I suppose you might argue that "adighe" should also be allowed. This particular word list will argue that it shouldn't.

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner
There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.
Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

Categories