Tokenizing with out changing the length - python

I have a text file with comments and I need to do classification with it. The text file is a list has a length of 1000. But after tokenization, stem, and stopwords the length of gets bigger. How can I tokenize with out changing the length?
Here is my code:
!pip install nltk
import nltk
import pandas as pd
data = open('X_train.txt',encoding='utf8').readlines()
nltk.download('punkt')
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk import word_tokenize
for word in data:
if word in stopwords.words('english'):
data.remove(word)
words_after = []
for w in data:
tokens = word_tokenize(w)
for token in tokens:
stemmed_word = ps.stem(token)
words_after.append(stemmed_word)
print(len(words_after))

Related

Neither stemmer nor lemmatizer seem to work very well, what should I do?

I am new to text analysis and am trying to create a bag of words model(using sklearn's CountVectorizer method). I have a data frame with a column of text with words like 'acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody'.
I think that 'acid' and 'wood' should be the only words included in the final output, however neither stemming nor lemmatizing seems to accomplish this.
Stemming produces 'acid','wood','woodi',woodsi'
and lemmatizing produces a worse output of 'acid' 'acidic' 'acidity' 'wood' 'woodsy' 'woody'. I assume this is due to the part of speech not being specified accurately although I am not sure where this specification should go. I have included it in the line X = vectorizer.fit_transform(df['text'],'a') (I believe that most of the words should be adjectives) however, it does not make a difference in the output.
What can I do to improve the output?
My full code is below;
!pip install nltk
nltk.download('omw-1.4')
import nltk
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
Data Frame:
df = pd.DataFrame()
df['text']=['acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody']
CountVectorizer with Stemmer:
analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
def stemmed_words(doc):
return (stemmer.stem(w) for w in analyzer(doc))
vectorizer = CountVectorizer(stop_words='english',analyzer=stemmed_words)
X = vectorizer.fit_transform(df['text'])
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()
CountVectorizer with Lemmatizer:
analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
def lemed_words(doc):
return(lemmatizer.lemmatize(w) for w in analyzer(doc))
vectorizer = CountVectorizer(stop_words='english',analyzer=lemed_words)
X = vectorizer.fit_transform(df['text'],'a')
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()
Might be a simple under-performing issue with the wordnetlemmatizer and the stemmer.
Try different ones like...
Stemmers:
Porter ( -> from nltk.stem import PorterStemmer)
Lancaster (-> from nltk.stem import LancasterStemmer)
Lemmatizers:
spacy ( -> import spacy)
IWNLP ( -> from spacy_iwnlp import spaCyIWNLP)
HanTa ( -> from HanTa import HanoverTagger /Note: is more or less trained for german language)
Had the same issue and switching to a different Stemmer and Lemmatizer solved the issue. For closer instruction on how to propperly implement the stemmers and lemmatizers, a quick search on the web reveals good examples on all cases.

Unstructured data, NLP Lemmatize Book Review

Here I have m trying to read the content let's say 'book1.txt' and here I have to remove all the special characters and punctuation marks and word tokenise the content using nltk's word tokeniser.
Lemmatize those token using wordnetLemmatizer
And write those token into csv file one by one.
Here is the code I m using which obviously is not working but just need some suggestion on this please.
import nltk
from nltk.stem import WordNetLemmatizer
import csv
from nltk.tokenize import word_tokenize
file_out=open('data.csv','w')
with open('book1.txt','r') as myfile:
for s in myfile:
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
for word in words:
token=WordNetLemmatizer().lemmatize(words,'v')
filtered_sentence=[""]
for n in words:
if n not in token:
filtered_sentence.append(""+n)
file_out.writelines(filtered_sentence+["\n"])
There's some issues here, most notably with the last 2 for loops.
The way you are doing it made it write it as follows:
word1
word1word2
word1word2word3
word1word2word3word4
........etc
I'm guessing that is not the expected output. I'm assuming the expected output is:
word1
word2
word3
word4
........etc (without creating duplicates)
I applied the code below to a 3 paragraph Cat Ipsum file. Note that I changed some variable names due to my own naming conventions.
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from pprint import pprint
# read the text into a single string.
with open("book1.txt") as infile:
text = ' '.join(infile.readlines())
words = word_tokenize(text)
words = [word.lower() for word in words if word.isalpha()]
# create the lemmatized word list
results = []
for word in words:
# you were using words instead of word below
token = WordNetLemmatizer().lemmatize(word, "v")
# check if token not already in results.
if token not in results:
results.append(token)
# sort results, just because :)
results.sort()
# print and save the results
pprint(results)
print(len(results))
with open("nltk_data.csv", "w") as outfile:
outfile.writelines(results)

Apply NLP WordNetLemmatizer on whole sentence show error with unknown pos

I want to Apply NLP WordNetLemmatizer on whole sentence. The problem is that I get an error:
KeyError: 'NNP'
Its like Im getting unknown 'pos' value, but I do not know why. I want to get base form of the words, but without 'pos' it does not work.
Can you tell me what am I doing wrong?
import nltk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
sentence = "I want to find the best way to lemmantize this sentence so that I can see better results of it"
taged_words = nltk.pos_tag(sentence)
print(taged_words)
lemmantised_sentence = []
lemmatizer = WordNetLemmatizer()
for word in taged_words:
filtered_text_lemmantised = lemmatizer.lemmatize(word[0], pos=word[1])
print(filtered_text_lemmantised)
lemmantised_sentence.append(filtered_text_lemmantised)
lemmantised_sentence = ' '.join(lemmantised_sentence)
print(lemmantised_sentence)
The sentence should be split before sending it to pos_tag function. Also, the pos argument differs in what kind of strings it accepts. It only accepts 'N','V' and so on. I have updated your code from this https://stackoverflow.com/a/15590384/7349991.
import nltk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
def main():
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
sentence = "I want to find the best way to lemmantize this sentence so that I can see better results of it"
taged_words = nltk.pos_tag(sentence.split())
print(taged_words)
lemmantised_sentence = []
lemmatizer = WordNetLemmatizer()
for word in taged_words:
if word[1]=='':
continue
filtered_text_lemmantised = lemmatizer.lemmatize(word[0], pos=get_wordnet_pos(word[1]))
print(filtered_text_lemmantised)
lemmantised_sentence.append(filtered_text_lemmantised)
lemmantised_sentence = ' '.join(lemmantised_sentence)
print(lemmantised_sentence)
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
else:
return wordnet.ADV
if __name__ == '__main__':
main()

Importing stopword dictionary to python

How can I import a specific stopword dictionary (excel sheet) into Python and run it additionally to the nltk stopword list? Currently my stopword section looks like this:
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
Thanks in advance!
You can import an excel sheet using the pandas library. This example assumes that your stopwords are located in the first column, one word per row. Afterwards, create the union of the nltk stopwords and your own stopwords:
import pandas as pd
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
# check pandas docs for more info on usage of read_excel
custom_words = pd.read_excel('your_file.xlsx', header=None, names=['mywords'])
# union of two sets
stop_words = stop_words | set(custom_words['mywords'])
words = [w for w in words if not w in stop_words]

remove stopwords and tokenize for collocationbigramfinder NLTK

I keep getting this error
sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
when i try to run this script. Not sure what is wrong. I am essentially reading from a text file, filtering out the stopwords and tokenizing them using NLTK.
import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
text_file=open('sentiment_test.txt', 'r')
lines=text_file.readlines()
filtered_words = [w for w in lines if not w in stopwords.words('english')]
print filtered_words
tokens=word_tokenize(str(filtered_words)
print tokens
finder = BigramCollocationFinder.from_words(tokens)
Any help would be much appreciated.
I am presuming that sentiment_test.txt is just plain text, and not a specific format.
You are trying to filter lines and not words. You should first tokenize and then filter the stopwords.
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
with open('sentiment_test.txt', 'r') as text_file:
text = text_file.read()
tokens=word_tokenize(str(text))
tokens = [w for w in tokens if not w in stopset]
print tokens
Hope this helps.

Categories