This question already has answers here:
wordnet lemmatization and pos tagging in python
(8 answers)
Closed 6 years ago.
I build a Plaintext-Corpus and the next step is to lemmatize all my texts. I'm using the WordNetLemmatizer and need the pos_tag for each token in order to do not get the Problem that e.g. loving -> lemma = loving and love -> lemma = love...
The default WordNetLemmatizer-POS-Tag is n (=Noun) i think, but how can i use the pos_tag? I think the expected WordNetLemmatizer-POS-Tag are diffrent to the pos_tag i get. Is there a function or something that can help me?!?!
in this line i think the word_pos is wrong and that's the error-reason
lemma = wordnet_lemmatizer.lemmatize(word,word_pos)
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
corpus_root = 'C:\\Users\\myname\\Desktop\\TestCorpus'
lyrics = PlaintextCorpusReader(corpus_root,'.*')
for fileid in lyrics.fileids():
tokens = word_tokenize(lyrics.raw(fileid))
tagged_tokens = pos_tag(tokens)
for tagged_token in tagged_tokens:
word = tagged_token[0]
word_pos = tagged_token[1]
print(tagged_token[0])
print(tagged_token[1])
lemma = wordnet_lemmatizer.lemmatize(word,pos=word_pos)
print(lemma)
Additional Question: Is the pos_tag enough for my lemmatization or need i another tagger? My texts are lyrics...
You need to convert the tag from the pos_tagger to one of the four "syntactic categories" that wordnet recognizes, then pass that to the lemmatizer as the word_pos.
From the docs:
Syntactic category: n for noun files, v for verb files, a for adjective files, r for adverb files.
Related
I am new to text analysis and am trying to create a bag of words model(using sklearn's CountVectorizer method). I have a data frame with a column of text with words like 'acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody'.
I think that 'acid' and 'wood' should be the only words included in the final output, however neither stemming nor lemmatizing seems to accomplish this.
Stemming produces 'acid','wood','woodi',woodsi'
and lemmatizing produces a worse output of 'acid' 'acidic' 'acidity' 'wood' 'woodsy' 'woody'. I assume this is due to the part of speech not being specified accurately although I am not sure where this specification should go. I have included it in the line X = vectorizer.fit_transform(df['text'],'a') (I believe that most of the words should be adjectives) however, it does not make a difference in the output.
What can I do to improve the output?
My full code is below;
!pip install nltk
nltk.download('omw-1.4')
import nltk
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
Data Frame:
df = pd.DataFrame()
df['text']=['acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody']
CountVectorizer with Stemmer:
analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
def stemmed_words(doc):
return (stemmer.stem(w) for w in analyzer(doc))
vectorizer = CountVectorizer(stop_words='english',analyzer=stemmed_words)
X = vectorizer.fit_transform(df['text'])
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()
CountVectorizer with Lemmatizer:
analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
def lemed_words(doc):
return(lemmatizer.lemmatize(w) for w in analyzer(doc))
vectorizer = CountVectorizer(stop_words='english',analyzer=lemed_words)
X = vectorizer.fit_transform(df['text'],'a')
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()
Might be a simple under-performing issue with the wordnetlemmatizer and the stemmer.
Try different ones like...
Stemmers:
Porter ( -> from nltk.stem import PorterStemmer)
Lancaster (-> from nltk.stem import LancasterStemmer)
Lemmatizers:
spacy ( -> import spacy)
IWNLP ( -> from spacy_iwnlp import spaCyIWNLP)
HanTa ( -> from HanTa import HanoverTagger /Note: is more or less trained for german language)
Had the same issue and switching to a different Stemmer and Lemmatizer solved the issue. For closer instruction on how to propperly implement the stemmers and lemmatizers, a quick search on the web reveals good examples on all cases.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 11 months ago.
Improve this question
I have a text document i need to use stemming and Lemmatization on. I have already cleaned the data and tokenised it as well as removing stop words
what i need to do is take the list as an input and return a dict and the dict should have the keys 'original stem and lemmma. and the values being the nth word transformed in that way
snowball stemmer is defined as Stemmer()
and WordNetLemmatizer is defined as lemmatizer()
heres the code ive written but it does give our an error
def find_roots(token_list, n):
n = 2
original = tokens
stem = [ele for sub in original for idx, ele in
enumerate(sub.split()) if idx == (n - 1)]
stem = stemmer(stem)
lemma = [ele for sub in original for idx, ele in
enumerate(sub.split()) if idx == (n - 1)]
lemma = lemmatizer()
return
Any help would be appreciated
I really don't understand what you are trying to do in the list comprehensions, so I'll just write how I would do it:
from nltk import WordNetLemmatizer, SnowballStemmer
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")
def find_roots(token_list, n):
token = token_list[n]
stem = stemmer.stem(token)
lemma = lemmatizer.lemmatize(token)
return {"original": token, "stem": stem, "lemma": lemma}
roots_dict = find_roots(["said", "talked", "walked"], n=2)
print(roots_dict)
> {'original': 'walked', 'stem': 'walk', 'lemma': 'walked'}
You can do what you want with spacy like below: (In many cases spacy performs better than nltk.)
# $ pip install -U spacy
import spacy
from nltk import WordNetLemmatizer, SnowballStemmer
sp = spacy.load('en_core_web_sm')
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")
words = ['compute', 'computer', 'computed', 'computing', 'said', 'talked', 'walked']
for word in words:
print(f'Orginal Word : {word}')
print(f'Stemmer with nltk : {stemmer.stem(word)}')
print(f'Lemmatization with nltk : {lemmatizer.lemmatize(word)}')
sp_word = sp(word)
print(f'Lemmatization with spacy : {sp_word[0].lemma_}')
Output:
Orginal Word : compute
Stemmer with nltk : comput
Lemmatization with nltk : compute
Lemmatization with spacy : compute
Orginal Word : computer
Stemmer with nltk : comput
Lemmatization with nltk : computer
Lemmatization with spacy : computer
Orginal Word : computed
Stemmer with nltk : comput
Lemmatization with nltk : computed
Lemmatization with spacy : compute
Orginal Word : computing
Stemmer with nltk : comput
Lemmatization with nltk : computing
Lemmatization with spacy : compute
Orginal Word : said
Stemmer with nltk : said
Lemmatization with nltk : said
Lemmatization with spacy : say
Orginal Word : talked
Stemmer with nltk : talk
Lemmatization with nltk : talked
Lemmatization with spacy : talk
Orginal Word : walked
Stemmer with nltk : walk
Lemmatization with nltk : walked
Lemmatization with spacy : walk
Here I have m trying to read the content let's say 'book1.txt' and here I have to remove all the special characters and punctuation marks and word tokenise the content using nltk's word tokeniser.
Lemmatize those token using wordnetLemmatizer
And write those token into csv file one by one.
Here is the code I m using which obviously is not working but just need some suggestion on this please.
import nltk
from nltk.stem import WordNetLemmatizer
import csv
from nltk.tokenize import word_tokenize
file_out=open('data.csv','w')
with open('book1.txt','r') as myfile:
for s in myfile:
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
for word in words:
token=WordNetLemmatizer().lemmatize(words,'v')
filtered_sentence=[""]
for n in words:
if n not in token:
filtered_sentence.append(""+n)
file_out.writelines(filtered_sentence+["\n"])
There's some issues here, most notably with the last 2 for loops.
The way you are doing it made it write it as follows:
word1
word1word2
word1word2word3
word1word2word3word4
........etc
I'm guessing that is not the expected output. I'm assuming the expected output is:
word1
word2
word3
word4
........etc (without creating duplicates)
I applied the code below to a 3 paragraph Cat Ipsum file. Note that I changed some variable names due to my own naming conventions.
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from pprint import pprint
# read the text into a single string.
with open("book1.txt") as infile:
text = ' '.join(infile.readlines())
words = word_tokenize(text)
words = [word.lower() for word in words if word.isalpha()]
# create the lemmatized word list
results = []
for word in words:
# you were using words instead of word below
token = WordNetLemmatizer().lemmatize(word, "v")
# check if token not already in results.
if token not in results:
results.append(token)
# sort results, just because :)
results.sort()
# print and save the results
pprint(results)
print(len(results))
with open("nltk_data.csv", "w") as outfile:
outfile.writelines(results)
I want to Apply NLP WordNetLemmatizer on whole sentence. The problem is that I get an error:
KeyError: 'NNP'
Its like Im getting unknown 'pos' value, but I do not know why. I want to get base form of the words, but without 'pos' it does not work.
Can you tell me what am I doing wrong?
import nltk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
sentence = "I want to find the best way to lemmantize this sentence so that I can see better results of it"
taged_words = nltk.pos_tag(sentence)
print(taged_words)
lemmantised_sentence = []
lemmatizer = WordNetLemmatizer()
for word in taged_words:
filtered_text_lemmantised = lemmatizer.lemmatize(word[0], pos=word[1])
print(filtered_text_lemmantised)
lemmantised_sentence.append(filtered_text_lemmantised)
lemmantised_sentence = ' '.join(lemmantised_sentence)
print(lemmantised_sentence)
The sentence should be split before sending it to pos_tag function. Also, the pos argument differs in what kind of strings it accepts. It only accepts 'N','V' and so on. I have updated your code from this https://stackoverflow.com/a/15590384/7349991.
import nltk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
def main():
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
sentence = "I want to find the best way to lemmantize this sentence so that I can see better results of it"
taged_words = nltk.pos_tag(sentence.split())
print(taged_words)
lemmantised_sentence = []
lemmatizer = WordNetLemmatizer()
for word in taged_words:
if word[1]=='':
continue
filtered_text_lemmantised = lemmatizer.lemmatize(word[0], pos=get_wordnet_pos(word[1]))
print(filtered_text_lemmantised)
lemmantised_sentence.append(filtered_text_lemmantised)
lemmantised_sentence = ' '.join(lemmantised_sentence)
print(lemmantised_sentence)
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
else:
return wordnet.ADV
if __name__ == '__main__':
main()
I am trying to get all the words in Wordnet dictionary that are of type noun and category food.
I have found a way to check if a word is noun.food but I need the reverse method:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def if_food(word):
syns = wn.synsets(word, pos = wn.NOUN)
for syn in syns:
print(syn.lexname())
if 'food' in syn.lexname():
return 1
return 0
So I think I have found a solution:
# Using the NLTK WordNet dictionary check if the word is noun and a food.
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def if_food(word):
syns = wn.synsets(str(word), pos = wn.NOUN)
for syn in syns:
if 'food' in syn.lexname():
return 1
return 0
Then using the qdapDictionaries::GradyAugmented R English words dictionary I have checked each word if it's a noun.food:
en_dict = pd.read_csv("GradyAugmentedENDict.csv")
en_dict['is_food'] = en_dict.word.apply(if_food)
en_dict[en_dict.is_food == 1].to_csv("en_dict_is_food.csv")
It it actually did the job.
Hope it will help others.