Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 11 months ago.
Improve this question
I have a text document i need to use stemming and Lemmatization on. I have already cleaned the data and tokenised it as well as removing stop words
what i need to do is take the list as an input and return a dict and the dict should have the keys 'original stem and lemmma. and the values being the nth word transformed in that way
snowball stemmer is defined as Stemmer()
and WordNetLemmatizer is defined as lemmatizer()
heres the code ive written but it does give our an error
def find_roots(token_list, n):
n = 2
original = tokens
stem = [ele for sub in original for idx, ele in
enumerate(sub.split()) if idx == (n - 1)]
stem = stemmer(stem)
lemma = [ele for sub in original for idx, ele in
enumerate(sub.split()) if idx == (n - 1)]
lemma = lemmatizer()
return
Any help would be appreciated
I really don't understand what you are trying to do in the list comprehensions, so I'll just write how I would do it:
from nltk import WordNetLemmatizer, SnowballStemmer
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")
def find_roots(token_list, n):
token = token_list[n]
stem = stemmer.stem(token)
lemma = lemmatizer.lemmatize(token)
return {"original": token, "stem": stem, "lemma": lemma}
roots_dict = find_roots(["said", "talked", "walked"], n=2)
print(roots_dict)
> {'original': 'walked', 'stem': 'walk', 'lemma': 'walked'}
You can do what you want with spacy like below: (In many cases spacy performs better than nltk.)
# $ pip install -U spacy
import spacy
from nltk import WordNetLemmatizer, SnowballStemmer
sp = spacy.load('en_core_web_sm')
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")
words = ['compute', 'computer', 'computed', 'computing', 'said', 'talked', 'walked']
for word in words:
print(f'Orginal Word : {word}')
print(f'Stemmer with nltk : {stemmer.stem(word)}')
print(f'Lemmatization with nltk : {lemmatizer.lemmatize(word)}')
sp_word = sp(word)
print(f'Lemmatization with spacy : {sp_word[0].lemma_}')
Output:
Orginal Word : compute
Stemmer with nltk : comput
Lemmatization with nltk : compute
Lemmatization with spacy : compute
Orginal Word : computer
Stemmer with nltk : comput
Lemmatization with nltk : computer
Lemmatization with spacy : computer
Orginal Word : computed
Stemmer with nltk : comput
Lemmatization with nltk : computed
Lemmatization with spacy : compute
Orginal Word : computing
Stemmer with nltk : comput
Lemmatization with nltk : computing
Lemmatization with spacy : compute
Orginal Word : said
Stemmer with nltk : said
Lemmatization with nltk : said
Lemmatization with spacy : say
Orginal Word : talked
Stemmer with nltk : talk
Lemmatization with nltk : talked
Lemmatization with spacy : talk
Orginal Word : walked
Stemmer with nltk : walk
Lemmatization with nltk : walked
Lemmatization with spacy : walk
Related
I am new to text analysis and am trying to create a bag of words model(using sklearn's CountVectorizer method). I have a data frame with a column of text with words like 'acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody'.
I think that 'acid' and 'wood' should be the only words included in the final output, however neither stemming nor lemmatizing seems to accomplish this.
Stemming produces 'acid','wood','woodi',woodsi'
and lemmatizing produces a worse output of 'acid' 'acidic' 'acidity' 'wood' 'woodsy' 'woody'. I assume this is due to the part of speech not being specified accurately although I am not sure where this specification should go. I have included it in the line X = vectorizer.fit_transform(df['text'],'a') (I believe that most of the words should be adjectives) however, it does not make a difference in the output.
What can I do to improve the output?
My full code is below;
!pip install nltk
nltk.download('omw-1.4')
import nltk
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
Data Frame:
df = pd.DataFrame()
df['text']=['acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody']
CountVectorizer with Stemmer:
analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
def stemmed_words(doc):
return (stemmer.stem(w) for w in analyzer(doc))
vectorizer = CountVectorizer(stop_words='english',analyzer=stemmed_words)
X = vectorizer.fit_transform(df['text'])
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()
CountVectorizer with Lemmatizer:
analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
def lemed_words(doc):
return(lemmatizer.lemmatize(w) for w in analyzer(doc))
vectorizer = CountVectorizer(stop_words='english',analyzer=lemed_words)
X = vectorizer.fit_transform(df['text'],'a')
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()
Might be a simple under-performing issue with the wordnetlemmatizer and the stemmer.
Try different ones like...
Stemmers:
Porter ( -> from nltk.stem import PorterStemmer)
Lancaster (-> from nltk.stem import LancasterStemmer)
Lemmatizers:
spacy ( -> import spacy)
IWNLP ( -> from spacy_iwnlp import spaCyIWNLP)
HanTa ( -> from HanTa import HanoverTagger /Note: is more or less trained for german language)
Had the same issue and switching to a different Stemmer and Lemmatizer solved the issue. For closer instruction on how to propperly implement the stemmers and lemmatizers, a quick search on the web reveals good examples on all cases.
I have tried looking into this and couldn't find any possible way to do this the way I imagine. The term as an example I am trying to group is 'No complaints', when looking at this word the 'No' is picked up during the stopwords which I have manually removed from the stopwords to ensure it is included in the data. However, both words will be picked during the sentiment analysis as Negative words. I am wanting to combine them together so they can be categorised under either Neutral or Positive. Is it possible to manually group them words or terms together and decide how they are analysed in the sentiment analysis?
I have found a way to group words using POS tagging & Chunking but this combines tags together or Multi-Word Expressionsand doesn't necessarily pick them up correctly in the sentiment analysis.
Current code (using POS Tagging):
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize, MWETokenizer
import re, gensim, nltk
from gensim.utils import simple_preprocess
import pandas as pd
d = {'text': ['no complaints', 'not bad']}
df = pd.DataFrame(data=d)
stop = stopwords.words('english')
stop.remove('no')
stop.remove('not')
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(df))
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
data_words_nostops = remove_stopwords(data_words)
txt = df
txt = txt.apply(str)
#pos tag
words = [word_tokenize(i) for i in sent_tokenize(txt['text'])]
pos_tag= [nltk.pos_tag(i) for i in words]
#chunking
tagged_token = nltk.pos_tag(tokenized_text)
grammar = "NP : {<DT>+<NNS>}"
phrases = nltk.RegexpParser(grammar)
result = phrases.parse(tagged_token)
print(result)
sia = SentimentIntensityAnalyzer()
def find_sentiment(post):
if sia.polarity_scores(post)["compound"] > 0:
return "Positive"
elif sia.polarity_scores(post)["compound"] < 0:
return "Negative"
else:
return "Neutral"
df['sentiment'] = df['text'].apply(lambda x: find_sentiment(x))
df['compound'] = [sia.polarity_scores(x)['compound'] for x in df['text']]
df
Output:
(S
0/CD
(NP no/DT complaints/NNS)
1/CD
not/RB
bad/JJ
Name/NN
:/:
text/NN
,/,
dtype/NN
:/:
object/NN)
|text |sentiment |compound
|:--------------|:----------|:--------
0 |no complaints |Negative |-0.5994
1 |not bad |Positive | 0.4310
I understand that my current code does not incorporate the POS Tagging & chunking in the sentiment analysis, but you can see the combination of the word 'no complaints' however it's current sentiment and sentiment score is negative (-0.5994), the aim is to use POS tagging and assign the sentiment as positive... somehow if possible!
Option 1
Use VADER sentiment analysis instead, which seems to be handling such idioms better than how nltk does (NLTK incorporates VADER actually, but seems to behave differently in such situations). No need to change anything in your code, except install VADER, as described in the instructions, and then import the library in your code as follows (while removing the one from nltk.sentiment...)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
Using VADER, you should get the following results. I've added one extra idiom (i.e., "no worries"), which would also be given a negative score if nltk's sentiment was used.
text sentiment compound
0 no complaints Positive 0.3089
1 not bad Positive 0.4310
2 no worries Positive 0.3252
Option 2
Modify NLTK's lexicon, as described here; however, it might not always work (as probably accepts only single words, but not idioms). Example below:
new_words = {
'no complaints': 3.0
}
sia = SentimentIntensityAnalyzer()
sia.lexicon.update(new_words)
I am trying to clean some text data.
fisrt i removed the stop words, then i tried to Lemmatize the text. But words such as nouns are removed
Sample Data
https://drive.google.com/file/d/1p9SKWLSVYeNScOCU_pEu7A08jbP-50oZ/view?usp=sharing
udpated Code
# Libraries
import spacy
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
import nltk; nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['covid', 'COVID-19', 'coronavirus'])
article= pd.read_csv("testdata.csv")
data = article.title.values.tolist()
nlp = spacy.load('en_core_web_sm')
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(data))
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
data_words_nostops = remove_stopwords(data_words)
print ("*** Text After removing Stop words: ")
print(data_words_nostops)
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV','PRON']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
data_lemmatized = lemmatization(data_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV','PRON'])
print ("*** Text After Lemmatization: ")
print(data_lemmatized)
The output after removing Stopwords is :
[['qaia', 'flags', 'amman', 'melbourne', 'jetstar', 'flights', 'recovery', 'plan'],
['western', 'amman', 'suburb', 'new','nsw', 'ground', 'zero', children],
['flight', 'returned', 'amman','qaia', 'staff', 'contract','driving'], ]]
The output after Lematization :
[['flight', 'recovery', 'plan']
['suburb', 'ground']
['return', 'contract','driving']
on each reacord I do not understand the following :
-1st reord: why these words are removed: "'qaia', 'flags', 'amman', 'melbourne', 'jetstar'
-2ed recored: essential words are reomved same as the first reord, Also, I was expecting children to convert to child
-3ed, "driving" is not converted to "drive"
I was expecting that words will such as "Amman" will not removed, Also i am expecting the words will be converted from plural to singular. And the verbs will be converted to the infinitive ...
What i am missing here???
Thanx in advance
I'm guessing that most of your issues are because you're not feeding spaCy full sentences and it's not assigning the correct part-of-speech tags to your words. This can cause the lemmatizer to return the wrong results. However, since you've only provided snippets of code and none of the original text, it's difficult to answer this question. Next time consider boiling down your question to a few lines of code that someone else can run on their machine EXACTLY AS WRITTEN, and providing a sample input that fails. See Minimal Reproducible Example
Here's an example that works and is close to what you're doing.
import spacy
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
allow_postags = set(['NOUN', 'VERB', 'ADJ', 'ADV', 'PROPN'])
nlp = spacy.load('en')
text = 'The children in Amman and Melbourne are too young to be driving.'
words = []
for token in nlp(text):
if token.text not in stop_words and token.pos_ in allow_postags:
words.append(token.lemma_)
print(' '.join(words))
This returns child Amman Melbourne young drive
I wish to extract noun-adjective pairs from this sentence. So, basically I want something like :
(Mark,sincere) (John,sincere).
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Mark and John are sincere employees at Google."
print ne_chunk(pos_tag(word_tokenize(sentence)))
Spacy's POS tagging would be a better than NLTK. It's faster and better. Here is an example of what you want to do
import spacy
nlp = spacy.load('en')
doc = nlp(u'Mark and John are sincere employees at Google.')
noun_adj_pairs = []
for i,token in enumerate(doc):
if token.pos_ not in ('NOUN','PROPN'):
continue
for j in range(i+1,len(doc)):
if doc[j].pos_ == 'ADJ':
noun_adj_pairs.append((token,doc[j]))
break
noun_adj_pairs
output
[(Mark, sincere), (John, sincere)]
This question already has answers here:
wordnet lemmatization and pos tagging in python
(8 answers)
Closed 6 years ago.
I build a Plaintext-Corpus and the next step is to lemmatize all my texts. I'm using the WordNetLemmatizer and need the pos_tag for each token in order to do not get the Problem that e.g. loving -> lemma = loving and love -> lemma = love...
The default WordNetLemmatizer-POS-Tag is n (=Noun) i think, but how can i use the pos_tag? I think the expected WordNetLemmatizer-POS-Tag are diffrent to the pos_tag i get. Is there a function or something that can help me?!?!
in this line i think the word_pos is wrong and that's the error-reason
lemma = wordnet_lemmatizer.lemmatize(word,word_pos)
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
corpus_root = 'C:\\Users\\myname\\Desktop\\TestCorpus'
lyrics = PlaintextCorpusReader(corpus_root,'.*')
for fileid in lyrics.fileids():
tokens = word_tokenize(lyrics.raw(fileid))
tagged_tokens = pos_tag(tokens)
for tagged_token in tagged_tokens:
word = tagged_token[0]
word_pos = tagged_token[1]
print(tagged_token[0])
print(tagged_token[1])
lemma = wordnet_lemmatizer.lemmatize(word,pos=word_pos)
print(lemma)
Additional Question: Is the pos_tag enough for my lemmatization or need i another tagger? My texts are lyrics...
You need to convert the tag from the pos_tagger to one of the four "syntactic categories" that wordnet recognizes, then pass that to the lemmatizer as the word_pos.
From the docs:
Syntactic category: n for noun files, v for verb files, a for adjective files, r for adverb files.