Tokenize text - Very slow when doing it - python

Question
I have a data frame with +90,000 rows and with a column ['text'] that contains the text of some news.
The length of the text has an average of 3.000 words and when I pass the word_tokenize it makes it very slow, Which could be a more efficent method to do it?
from nltk.tokenize import word_tokenize
df['tokenized_text'] = df.iloc[0:10]['texto'].apply(word_tokenize)
df.head()
Also word_tokenize hasn't some punctuations and other characters that I don't want, so I created a function to filter them where I'm using spacy.
from spacy.lang.es.stop_words import STOP_WORDS
from nltk.corpus import stopwords
spanish_stopwords = set(stopwords.words('spanish'))
otherCharacters = ['`','�',' ','\xa0']
def tokenize(phrase):
sentence_tokens = []
tokenized_phrase = nlp(phrase)
for token in tokenized_phrase:
if ~token.is_punct or ~token.is_stop or ~(token.text.lower() in spanish_stopwords) or ~(token.text.lower() in otherCharacters) or ~(token.text.lower() in STOP_WORDS):
sentence_tokens.append(token.text.lower())
return sentence_tokens
Any other better method to do it?
Thanks for reading my maybe noob👨🏽‍💻 question😀, have a nice day🌻.
Appreciations
nlp is defined before
import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()
I'm using spacy to tokenize but also using the nltk stop_words for spanish language.

If you are only tokenizing, use a blank model (which only contains a tokenizer) instead of es_core_news_sm:
nlp = spacy.blank("es")

In order to make spacy faster when you only wish to tokenize.
you can change:
nlp = es_core_news_sm.load()
To:
nlp = spacy.load("es_core_news_sm", disable=["tagger", "ner", "parser"])
A small explanation:
Spacy gives a full language model which not merely tokenize your sentence but also do parsing, and pos and ner tagging. when actually most of the calculation time is being done for the other tasks (parse tree, pos, ner) and not the tokenization which is actually much 'lighter' task, computation wise.
But, as you can see spacy allow you to use only what you actually need and by that save you some time.
Another thing, you can make your function more efferent by lowering token only once and add the stop word to spacy (even if you didn't want to do so, the fact that otherCharacters is a list and not a set is not very efficient ).
I would also add this:
for w in stopwords.words('spanish'):
nlp.vocab[w].is_stop = True
for w in otherCharacters:
nlp.vocab[w].is_stop = True
for w in STOP_WORDS:
nlp.vocab[w].is_stop = True
and than:
for token in tokenized_phrase:
if not token.is_punct and not token.is_stop:
sentence_tokens.append(token.text.lower())

Related

Neither stemmer nor lemmatizer seem to work very well, what should I do?

I am new to text analysis and am trying to create a bag of words model(using sklearn's CountVectorizer method). I have a data frame with a column of text with words like 'acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody'.
I think that 'acid' and 'wood' should be the only words included in the final output, however neither stemming nor lemmatizing seems to accomplish this.
Stemming produces 'acid','wood','woodi',woodsi'
and lemmatizing produces a worse output of 'acid' 'acidic' 'acidity' 'wood' 'woodsy' 'woody'. I assume this is due to the part of speech not being specified accurately although I am not sure where this specification should go. I have included it in the line X = vectorizer.fit_transform(df['text'],'a') (I believe that most of the words should be adjectives) however, it does not make a difference in the output.
What can I do to improve the output?
My full code is below;
!pip install nltk
nltk.download('omw-1.4')
import nltk
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
Data Frame:
df = pd.DataFrame()
df['text']=['acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody']
CountVectorizer with Stemmer:
analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
def stemmed_words(doc):
return (stemmer.stem(w) for w in analyzer(doc))
vectorizer = CountVectorizer(stop_words='english',analyzer=stemmed_words)
X = vectorizer.fit_transform(df['text'])
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()
CountVectorizer with Lemmatizer:
analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
def lemed_words(doc):
return(lemmatizer.lemmatize(w) for w in analyzer(doc))
vectorizer = CountVectorizer(stop_words='english',analyzer=lemed_words)
X = vectorizer.fit_transform(df['text'],'a')
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()
Might be a simple under-performing issue with the wordnetlemmatizer and the stemmer.
Try different ones like...
Stemmers:
Porter ( -> from nltk.stem import PorterStemmer)
Lancaster (-> from nltk.stem import LancasterStemmer)
Lemmatizers:
spacy ( -> import spacy)
IWNLP ( -> from spacy_iwnlp import spaCyIWNLP)
HanTa ( -> from HanTa import HanoverTagger /Note: is more or less trained for german language)
Had the same issue and switching to a different Stemmer and Lemmatizer solved the issue. For closer instruction on how to propperly implement the stemmers and lemmatizers, a quick search on the web reveals good examples on all cases.

In Python, is there a way in any NLP library to combine words to state them as positive?

I have tried looking into this and couldn't find any possible way to do this the way I imagine. The term as an example I am trying to group is 'No complaints', when looking at this word the 'No' is picked up during the stopwords which I have manually removed from the stopwords to ensure it is included in the data. However, both words will be picked during the sentiment analysis as Negative words. I am wanting to combine them together so they can be categorised under either Neutral or Positive. Is it possible to manually group them words or terms together and decide how they are analysed in the sentiment analysis?
I have found a way to group words using POS tagging & Chunking but this combines tags together or Multi-Word Expressionsand doesn't necessarily pick them up correctly in the sentiment analysis.
Current code (using POS Tagging):
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize, MWETokenizer
import re, gensim, nltk
from gensim.utils import simple_preprocess
import pandas as pd
d = {'text': ['no complaints', 'not bad']}
df = pd.DataFrame(data=d)
stop = stopwords.words('english')
stop.remove('no')
stop.remove('not')
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(df))
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
data_words_nostops = remove_stopwords(data_words)
txt = df
txt = txt.apply(str)
#pos tag
words = [word_tokenize(i) for i in sent_tokenize(txt['text'])]
pos_tag= [nltk.pos_tag(i) for i in words]
#chunking
tagged_token = nltk.pos_tag(tokenized_text)
grammar = "NP : {<DT>+<NNS>}"
phrases = nltk.RegexpParser(grammar)
result = phrases.parse(tagged_token)
print(result)
sia = SentimentIntensityAnalyzer()
def find_sentiment(post):
if sia.polarity_scores(post)["compound"] > 0:
return "Positive"
elif sia.polarity_scores(post)["compound"] < 0:
return "Negative"
else:
return "Neutral"
df['sentiment'] = df['text'].apply(lambda x: find_sentiment(x))
df['compound'] = [sia.polarity_scores(x)['compound'] for x in df['text']]
df
Output:
(S
0/CD
(NP no/DT complaints/NNS)
1/CD
not/RB
bad/JJ
Name/NN
:/:
text/NN
,/,
dtype/NN
:/:
object/NN)
|text |sentiment |compound
|:--------------|:----------|:--------
0 |no complaints |Negative |-0.5994
1 |not bad |Positive | 0.4310
I understand that my current code does not incorporate the POS Tagging & chunking in the sentiment analysis, but you can see the combination of the word 'no complaints' however it's current sentiment and sentiment score is negative (-0.5994), the aim is to use POS tagging and assign the sentiment as positive... somehow if possible!
Option 1
Use VADER sentiment analysis instead, which seems to be handling such idioms better than how nltk does (NLTK incorporates VADER actually, but seems to behave differently in such situations). No need to change anything in your code, except install VADER, as described in the instructions, and then import the library in your code as follows (while removing the one from nltk.sentiment...)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
Using VADER, you should get the following results. I've added one extra idiom (i.e., "no worries"), which would also be given a negative score if nltk's sentiment was used.
text sentiment compound
0 no complaints Positive 0.3089
1 not bad Positive 0.4310
2 no worries Positive 0.3252
Option 2
Modify NLTK's lexicon, as described here; however, it might not always work (as probably accepts only single words, but not idioms). Example below:
new_words = {
'no complaints': 3.0
}
sia = SentimentIntensityAnalyzer()
sia.lexicon.update(new_words)

Punctuation, stopwords and lemmatization with spacy

I'm trying to apply punctuation removal, stopwords removal and lemmatization to a list of strings
I tried to use lemma_, is_stop and is_punct
data = ['We will pray and hope for the best',
'Though it may not make landfall all week if it follows that track',
'Heavy rains, capable of producing life-threatening flash floods, are possible']
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en")
doc = list(nlp.pipe(data))
data_clean = [[w.lemma_ for w in doc if not w.is_stop and not w.is_punct and not w.like_num] for doc in data]
I have the following error:
AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'lemma_'
(same problem for is_stop and is_punct)
You iterate over the unprocessed list of strings data in the outer-loop, but you need to iterate over doc.
Further, your variables have unfavorable names, the following naming should be less confusing:
docs = list(nlp.pipe(data))
data_clean = [[w.lemma_ for w in doc if (not w.is_stop and not w.is_punct and not w.like_num)] for doc in docs]

How to filter tokens from spaCy document

I would like to parse a document using spaCy and apply a token filter so that the final spaCy document does not include the filtered tokens. I know that I can take the sequence of tokens filtered, but I am insterested in having the actual Doc structure.
text = u"This document is only an example. " \
"I would like to create a custom pipeline that will remove specific tokesn from the final document."
doc = nlp(text)
def keep_token(tok):
# This is only an example rule
return tok.pos_ not not in {'PUNCT', 'NUM', 'SYM'}
final_tokens = list(filter(keep_token, doc))
# How to get a spacy.Doc from final_tokens?
I tried to reconstruct a new spaCy Doc from the tokens lists but the API is not clear how to do it.
I am pretty sure that you have found your solution till now but because it is not posted here I thought it may be useful to add it.
You can remove tokens by converting doc to numpy array, removing from numpy array and then converting back to doc.
Code:
import spacy
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
from spacy.tokens import Doc
import numpy
def remove_tokens_on_match(doc):
indexes = []
for index, token in enumerate(doc):
if (token.pos_ in ('PUNCT', 'NUM', 'SYM')):
indexes.append(index)
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
np_array = numpy.delete(np_array, indexes, axis = 0)
doc2 = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in indexes])
doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
return doc2
# load english model
nlp = spacy.load('en')
doc = nlp(u'This document is only an example. \
I would like to create a custom pipeline that will remove specific tokens from \
the final document.')
print(remove_tokens_on_match(doc))
You can look to a similar question that I answered here.
Depending on what you want to do there are several approaches.
1. Get the original Document
Tokens in SpaCy have references to their document, so you can do this:
original_doc = final_tokens[0].doc
This way you can still get PoS, parse data etc. from the original sentence.
2. Construct a new document without the removed tokens
You can append the strings of all the tokens with whitespace and create a new document. See the token docs for information on text_with_ws.
doc = nlp(''.join(map(lambda x: x.text_with_ws, final_tokens)))
This is probably not going to give you what you want though - PoS tags will not necessarily be the same, and the resulting sentence may not make sense.
If neither of those was what you had in mind, let me know and maybe I can help.

How to remove stop words from documents in gensim?

I'm building a NLP chat application using Doc2Vec technique in Python using its gensim package. I have already done tokenizing and stemming. I want to remove the stop words (to test if it works better) from both the training set as well as the question which user throws.
Here is my code.
import gensim
import nltk
from gensim import models
from gensim import utils
from gensim import corpora
from nltk.stem import PorterStemmer
ps = PorterStemmer()
sentence0 = models.doc2vec.LabeledSentence(words=[u'sampl',u'what',u'is'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'sampl',u'tell',u'me',u'about'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'elig',u'what',u'is',u'my'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'limit', u'what',u'is',u'my'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'claim',u'how',u'much',u'can',u'I'],tags=["SENT_4"])
sentence5 = models.doc2vec.LabeledSentence(words=[u'retir',u'i',u'am',u'how',u'much',u'can',u'elig',u'claim'],tags=["SENT_5"])
sentence6 = models.doc2vec.LabeledSentence(words=[u'resign',u'i',u'have',u'how',u'much',u'can',u'i',u'claim',u'elig'],tags=["SENT_6"])
sentence7 = models.doc2vec.LabeledSentence(words=[u'promot',u'what',u'is',u'my',u'elig',u'post',u'my'],tags=["SENT_7"])
sentence8 = models.doc2vec.LabeledSentence(words=[u'claim',u'can,',u'i',u'for'],tags=["SENT_8"])
sentence9 = models.doc2vec.LabeledSentence(words=[u'product',u'coverag',u'cover',u'what',u'all',u'are'],tags=["SENT_9"])
sentence10 = models.doc2vec.LabeledSentence(words=[u'hotel',u'coverag',u'cover',u'what',u'all',u'are'],tags=["SENT_10"])
sentence11 = models.doc2vec.LabeledSentence(words=[u'onlin',u'product',u'can',u'i',u'for',u'bought',u'through',u'claim',u'sampl'],tags=["SENT_11"])
sentence12 = models.doc2vec.LabeledSentence(words=[u'reimburs',u'guidelin',u'where',u'do',u'i',u'apply',u'form',u'sampl'],tags=["SENT_12"])
sentence13 = models.doc2vec.LabeledSentence(words=[u'reimburs',u'procedur',u'rule',u'and',u'regul',u'what',u'is',u'the',u'for'],tags=["SENT_13"])
sentence14 = models.doc2vec.LabeledSentence(words=[u'can',u'i',u'submit',u'expenditur',u'on',u'behalf',u'of',u'my',u'friend',u'and',u'famili',u'claim',u'and',u'reimburs'],tags=["SENT_14"])
sentence15 = models.doc2vec.LabeledSentence(words=[u'invoic',u'bills',u'procedur',u'can',u'i',u'submit',u'from',u'shopper stop',u'claim'],tags=["SENT_15"])
sentence16 = models.doc2vec.LabeledSentence(words=[u'invoic',u'bills',u'can',u'i',u'submit',u'from',u'pantaloon',u'claim'],tags=["SENT_16"])
sentence17 = models.doc2vec.LabeledSentence(words=[u'invoic',u'procedur',u'can',u'i',u'submit',u'invoic',u'from',u'spencer',u'claim'],tags=["SENT_17"])
# User asks a question.
document = input("Ask a question:")
tokenized_document = list(gensim.utils.tokenize(document, lowercase = True, deacc = True))
#print(type(tokenized_document))
stemmed_document = []
for w in tokenized_document:
stemmed_document.append(ps.stem(w))
sentence19 = models.doc2vec.LabeledSentence(words= stemmed_document, tags=["SENT_19"])
# Building vocab.
sentences = [sentence0,sentence1,sentence2,sentence3, sentence4, sentence5,sentence6, sentence7, sentence8, sentence9, sentence10, sentence11, sentence12, sentence13, sentence14, sentence15, sentence16, sentence17, sentence19]
#I tried to remove the stop words but it didn't work out as LabeledSentence object has no attribute lower.
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in sentences]
..
Is there a way I can remove stop words from sentences directly and get a new set of vocab without stop words ?
Your sentences object is a already a list of LabeledSentence objects. You construct these above; they include a list-of-strings in words and a list-of-strings in tags.
So each item in that list (document in your list-comprehension) can't have a string method like .lower() applied to it. (Nor would it need to be .split(), as its words are already separate tokens.)
The cleanest approach would be to remove stop-words from the lists-of-words before they're used to construct LabeledSentence objects. For example, you could make a function without_stopwords(), defined at the top. Then your lines creating LabeledSentence objects could instead be like:
sentence0 = LabeledSentence(words=remove_stopwords([u'sampl', u'what', u'is']),
tags=["SENT_0"])
Alternatively, you could mutate the existing LabeledSentence objects so that each of their words attributes now lack stop-words. This would replace your last line with something more like:
for doc in sentences:
doc.words = [word for word in doc.words if word not in stoplist]
texts = sentences
Separately, things you didn't ask but should know:
TaggedDocument is now the preferred example-class name for Doc2Vec text objects – but in fact any object that has the two required properties words and tags will work fine.
Doc2Vec doesn't show many of the desired properties on tiny, toy-sized datasets – don't be surprised if a model built on dozens of sentences does not do anything useful, or misleads about what preprocessing/meta-parameter options are best. (Tens of thousands of texts, and texts at least tens-of-words long, are much better for meaningful results.)
Much Word2Vec/Doc2Vec work doesn't bother with stemming or stop-word removal, but it may sometimes be helpful.

Categories