Stopwords coming up in most influential words - python

I am running some NLP code, trying to find the most influential (positively or negatively) words in a survey. My problem is that, while I successfully add some extra stopwords to the NLTK stopwords file, they keep coming up as influential words later on.
So, I have a dataframe, first column contains scores, second column contains comments.
I add extra stopwords:
stopwords = stopwords.words('english')
extra = ['Cat', 'Dog']
stopwords.extend(extra)
I check that they are added, using the len method before and after.
I create this function to remove punctuation and stopwords from my comments:
def text_process(comment):
nopunc = [char for char in comment if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords]
I run the model (not going to include the whole code since it doesn't make a difference):
corpus = df['Comment']
y = df['Label']
vectorizer = CountVectorizer(analyzer=text_process)
x = vectorizer.fit_transform(corpus)
...
And then to get the most influential words:
feature_to_coef = {word: coef for word, coef in zip(vectorizer.get_feature_names(), nb.coef_[0])}
for best_positive in sorted(
feature_to_coef.items(),
key=lambda x: x[1],
reverse=True)[:20]:
print (best_positive)
But, Cat and Dog are in the results.
What am I doing wrong, any ideas?
Thank you very much!

Looks like it is because you have capitalize words 'Cat' and 'Dog'
In your text_process function, you have if word.lower() not in stopwords which only works if the stopwords are lower case

Related

Text Clustering and Visualize Similarity in Python

I'm analyzing a book of 1167 pages (txt file). So far I've done preprocessing of the data (cleaning, removing punctuation, stop word removing, tokenization).
Now How can I cluster the text and visualize the similarity (plot the cluster)
For example -
text1 = "there is one unshakable conviction that people—whatever the degree of development of their understanding and whatever the form taken by the factors present in their individuality for engendering all kinds of ideal"
text2 = "That is why I now also, in setting forth on this venture quite new for me, namely authorship, begin by pronouncing this invocation"
In my task, I divided whole book into chapters. tex1 is chapter one text2 is chapter 2 and so on. Now I wanna compare chapter 1 and 2 like this.
Example code that I used for data pre processing:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(pages[1])
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]

why does if not in(x,y) not work at all in python

I want to select words only if the word in each rows of my column not in stop words and not in string punctuation.
This is my data after tokenizing and removing the stopwords, i also want to remove the punctuation at the same time i remove the stopwords. See in number two after usf there's comma. I think of if word not in (stopwords,string.punctuation) since it would be not in stopwords and not in string.punctuation i see it from here but it resulting in fails to remove stopwords and the punctuation. How to fix this?
data['text'].head(5)
Out[38]:
0 ['ve, searching, right, words, thank, breather...
1 [free, entry, 2, wkly, comp, win, fa, cup, fin...
2 [nah, n't, think, goes, usf, ,, lives, around,...
3 [even, brother, like, speak, ., treat, like, a...
4 [date, sunday, !, !]
Name: text, dtype: object
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
data = pd.read_csv(r"D:/python projects/read_files/SMSSpamCollection.tsv",
sep='\t', header=None)
data.columns = ['label','text']
stopwords = set(stopwords.words('english'))
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in (stopwords,string.punctuation)]
return data
data['text'] = data['text'].apply(process)
If you still want to do it in one if statement, you could convert string.punctuation to a set and combine it with stopwords with an OR operation. This is how it would look like:
data = [word for word in data if word not in (stopwords|set(string.punctuation))]
then you need to change
data = [word for word in data if word not in (stopwords,string.punctuation)]
to
data = [word for word in data if word not in stopwords and word not in string.punctuation]
in function process you must Convert type(String) to pandas.core.series.Series and use
concat
the function will be:
'
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in
pd.concat([stopwords,pd.Series(string.punctuation)]) ]
return data

Code to create a reliable Language model from my own corpus

I have a corpus of sentences in a specific domain.
I am looking for an open-source code/package, that I can give the data and it will generate a good, reliable language model. (Meaning, given a context, know the probability for each word).
Is there such a code/project?
I saw this github repo: https://github.com/rafaljozefowicz/lm, but it didn't work.
I recommend writing your own basic implementation. First, we need some sentences:
import nltk
from nltk.corpus import brown
words = brown.words()
total_words = len(words)
sentences = list(brown.sents())
sentences is now a list of lists. Each sublist represents a sentence with each word as an element. Now you need to decide whether or not you want to include punctuation in your model. If you want to remove it, try something like the following:
punctuation = [",", ".", ":", ";", "!", "?"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = [word for word in sentence if word not in punctuation]
sentences[i] = new_sentence
Next, you need to decide whether or not you care about capitalization. If you don't care about it, you could remove it like so:
for i, sentence in enumerate(sentences.copy()):
new_sentence = list()
for j, word in enumerate(sentence.copy()):
new_word = word.lower() # Lower case all characters in word
new_sentence.append(new_word)
sentences[i] = new_sentence
Next, we need special start and end words to represent words that are valid at the beginning and end of sentences. You should pick start and end words that don't exist in your training data.
start = ["<<START>>"]
end = ["<<END>>"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = start + sentence + end
sentences[i] = new_sentence
Now, let's count unigrams. A unigram is a sequence of one word in a sentence. Yes, a unigram model is just a frequency distribution of each word in the corpus:
new_words = list()
for sentence in sentences:
for word in sentence:
new_words.append(word)
unigram_fdist = nltk.FreqDist(new_words)
And now it's time to count bigrams. A bigram is a sequence of two words in a sentence. So, for the sentence "i am the walrus", we have the following bigrams: "<> i", "i am", "am the", "the walrus", and "walrus <>".
bigrams = list()
for sentence in sentences:
new_bigrams = nltk.bigrams(sentence)
bigrams += new_bigrams
Now we can create a frequency distribution:
bigram_fdist = nltk.ConditionalFreqDist(bigrams)
Finally, we want to know the probability of each word in the model:
def getUnigramProbability(word):
if word in unigram_fdist:
return unigram_fdist[word]/total_words
else:
return -1 # You should figure out how you want to handle out-of-vocabulary words
def getBigramProbability(word1, word2):
if word1 not in bigram_fdist:
return -1 # You should figure out how you want to handle out-of-vocabulary words
elif word2 not in bigram_fdist[word1]:
# i.e. "word1 word2" never occurs in the corpus
return getUnigramProbability(word2)
else:
bigram_frequency = bigram_fdist[word1][word2]
unigram_frequency = unigram_fdist[word1]
bigram_probability = bigram_frequency / unigram_frequency
return bigram_probability
While this isn't a framework/library that just builds the model for you, I hope seeing this code has demystified what goes on in a language model.
You might try word_language_model from PyTorch examples. There just might be an issue if you have a big corpus. They load all data in memory.

Remove Stop Words Python

So I am reading in a csv file and the getting the words in it. I am trying to remove stop words. Here is my code.
import pandas as pd
from nltk.corpus import stopwords as sw
def loadCsv(fileName):
df = pd.read_csv(fileName, error_bad_lines=False)
df.dropna(inplace = True)
return df
def getWords(dataframe):
words = []
for tweet in dataframe['SentimentText'].tolist():
for word in tweet.split():
word = word.lower()
words.append(word)
return set(words) #Create a set from the words list
def removeStopWords(words):
for word in words: # iterate over word_list
if word in sw.words('english'):
words.remove(word) # remove word from filtered_word_list if it is a stopword
return set(words)
df = loadCsv("train.csv")
words = getWords(df)
words = removeStopWords(words)
On this line
if word in sw.words('english'):
I get the following error.
exception: no description
Further down the line I am going to try to remove punctuation, any pointers for that too would be great.
Any help is much appreciated.
EDIT
def removeStopWords(words):
filtered_word_list = words #make a copy of the words
for word in words: # iterate over words
if word in sw.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
return set(filtered_word_list)
Here is a simplified version of the problem, without Panda. I believe the issue with the original code is with modifying the set words while iterating over it. By using a conditional list comprehension, we can test for each word, creating a new list, and ultimately converting it into a set, as per the original code.
from nltk.corpus import stopwords as sw
def removeStopWords(words):
return set([w for w in words if not w in sw.words('english')])
sentence = 'this is a very common english sentence with a finite set of words from my imagination'
words = set(sentence.split())
print(removeStopWords(words))
Change removeStopWords function to the following:
def getFilteredStopWords(words):
list_stopWords=list(set(sw.words('english')))
filtered_words=[w for w in words if not w in list_stopWords# remove word from filtered_words if it is a stopword
return filtered_words
def remmove_stopwords(sentence):
list_stop_words = set(stopwords.words('english'))
words = sentence.split(' ')
filtered_words = [w for w in words if w not in list_stop_words]
sentence_list = ' '.join(w for w in filtered_words)
return sentence_list

filtering stopwords near punctuation

I am trying to filter out stopwords in my text like so:
clean = ' '.join([word for word in text.split() if word not in (stopwords)])
The problem is that text.split() has elements like 'word.' that don't match to the stopword 'word'.
I later use clean in sent_tokenize(clean), however, so I don't want to get rid of the punctuation altogether.
How do I filter out stopwords while retaining punctuation, but filtering words like 'word.'?
I thought it would be possible to change the punctuation:
text = text.replace('.',' . ')
and then
clean = ' '.join([word for word in text.split() if word not in (stop words)] or word == ".")
But is there a better way?
Tokenize the text first, than clean it from stopwords. A tokenizer usually recognizes punctuation.
import nltk
text = 'Son, if you really want something in this life,\
you have to work for it. Now quiet! They are about\
to announce the lottery numbers.'
stopwords = ['in', 'to', 'for', 'the']
sents = []
for sent in nltk.sent_tokenize(text):
tokens = nltk.word_tokenize(sent)
sents.append(' '.join([w for w in tokens if w not in stopwords]))
print sents
['Son , if you really want something this life , you have work it .', 'Now quiet !', 'They are about announce lottery numbers .']
You could use something like this:
import re
clean = ' '.join([word for word in text.split() if re.match('([a-z]|[A-Z])+', word).group().lower() not in (stopwords)])
This pulls out everything except lowercase and uppercase ascii letters and matches it to words in your stopcase set or list. Also, it assumes that all of your words in stopwords are lowercase, which is why I converted the word to all lowercase. Take that out if I made to great of an assumption
Also, I'm not proficient in regex, sorry if there's a cleaner or robust way of doing this.

Categories