i have a txt file containing stopwords and i want to remove the stopwords from my sentences in a dataframe. I tried doing this:
f = open("stopwords.txt", "r")
stopword_list = []
for line in f:
stripped_line = line.strip()
line_list = stripped_line.split()
stopword_list.append(line_list[0])
f.close()
len(stopword_list)
tokens_without_sw = [word for word in tokenized_tweets if not word in stopword_list]
print("After stopwords removed")
print(tokens_without_sw)
but it doesn't change anything, it doesn't remove the stopwords on the list
Similar to your other question,
You can use re.sub or Series.str.replace with a regex to look for any of the words in your stopword_list list, surrounded by word boundaries, and replace them with nothing.
I'm assuming stopword_list has already been read.
import re
stopword_list = ["tweet", "not"]
escaped_words = "|".join(re.escape(word) for word in stopword_list)
print(repr(escaped_words))
# 'tweet|not'
regex = fr"\b({escaped_words})\b"
print(repr(regex))
# '\\b(tweet|not)\\b'
Now, call Series.str.replace with case=False to do a case-insensitive match:
df = pd.DataFrame({'tweets': ['this is a tweet', 'this is not a tweet', 'no', 'Another tweet', 'Not another tweet', 'Tweet not']})
df['clean'] = df['tweets'].str.replace(regex, '', case=False, regex=True)
which gives:
tweets clean
0 this is a tweet this is a
1 this is not a tweet this is a
2 no no
3 Another tweet Another
4 Not another tweet another
5 Tweet not
Note that this leaves two spaces where a word was removed. This is easy to remove just like we removed words. In this case, the regex is simply r"\s{2,}", which looks for two or more consecutive whitespace.
df['clean'] = df['tweets'].str.replace(regex, '', case=False, regex=True).str.replace(r"\s{2,}", " ", regex=True).str.strip()
tweets clean
0 this is a tweet this is a
1 this is not a tweet this is a
2 no no
3 Another tweet Another
4 Not another tweet another
5 Tweet not
Related
I want to select words only if the word in each rows of my column not in stop words and not in string punctuation.
This is my data after tokenizing and removing the stopwords, i also want to remove the punctuation at the same time i remove the stopwords. See in number two after usf there's comma. I think of if word not in (stopwords,string.punctuation) since it would be not in stopwords and not in string.punctuation i see it from here but it resulting in fails to remove stopwords and the punctuation. How to fix this?
data['text'].head(5)
Out[38]:
0 ['ve, searching, right, words, thank, breather...
1 [free, entry, 2, wkly, comp, win, fa, cup, fin...
2 [nah, n't, think, goes, usf, ,, lives, around,...
3 [even, brother, like, speak, ., treat, like, a...
4 [date, sunday, !, !]
Name: text, dtype: object
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
data = pd.read_csv(r"D:/python projects/read_files/SMSSpamCollection.tsv",
sep='\t', header=None)
data.columns = ['label','text']
stopwords = set(stopwords.words('english'))
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in (stopwords,string.punctuation)]
return data
data['text'] = data['text'].apply(process)
If you still want to do it in one if statement, you could convert string.punctuation to a set and combine it with stopwords with an OR operation. This is how it would look like:
data = [word for word in data if word not in (stopwords|set(string.punctuation))]
then you need to change
data = [word for word in data if word not in (stopwords,string.punctuation)]
to
data = [word for word in data if word not in stopwords and word not in string.punctuation]
in function process you must Convert type(String) to pandas.core.series.Series and use
concat
the function will be:
'
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in
pd.concat([stopwords,pd.Series(string.punctuation)]) ]
return data
I am relatively new to NLP so please be gentle. I
have a complete list of the text from Trump's tweets since taking office and I am tokenizing the text to analyze the content.
I am using the TweetTokenizer from the nltk library in python and I'm trying to get everything tokenized except for numbers and punctuation. Problem is my code removes all the tokens except one.
I have tried using the .isalpha() method but this did not work, which I thought would as should only be True for strings composed from the alphabet.
#Create a content from the tweets
text= non_re['text']
#Make all text in lowercase
low_txt= [l.lower() for l in text]
#Iteratively tokenize the tweets
TokTweet= TweetTokenizer()
tokens= [TokTweet.tokenize(t) for t in low_txt
if t.isalpha()]
My output from this is just one token.
If I remove the if t.isalpha() statement then I get all of the tokens including numbers and punctuation, suggesting the isalpha() is to blame from the over-trimming.
What I would like, is a way to get the tokens from the tweet text without punctuation and numbers.
Thanks for your help!
Try something like below:
import string
import re
import nltk
from nltk.tokenize import TweetTokenizer
tweet = "first think another Disney movie, might good, it's kids movie. watch it, can't help enjoy it. ages love movie. first saw movie 10 8 years later still love it! Danny Glover superb could play"
def clean_text(text):
# remove numbers
text_nonum = re.sub(r'\d+', '', text)
# remove punctuations and convert characters to lower case
text_nopunct = "".join([char.lower() for char in text_nonum if char not in string.punctuation])
# substitute multiple whitespace with single whitespace
# Also, removes leading and trailing whitespaces
text_no_doublespace = re.sub('\s+', ' ', text_nopunct).strip()
return text_no_doublespace
cleaned_tweet = clean_text(tweet)
tt = TweetTokenizer()
print(tt.tokenize(cleaned_tweet))
output:
['first', 'think', 'another', 'disney', 'movie', 'might', 'good', 'its', 'kids', 'movie', 'watch', 'it', 'cant', 'help', 'enjoy', 'it', 'ages', 'love', 'movie', 'first', 'saw', 'movie', 'years', 'later', 'still', 'love', 'it', 'danny', 'glover', 'superb', 'could', 'play']
# Function for removing Punctuation from Text and It gives total no.of punctuation removed also
# Input: Function takes Existing fie name and New file name as string i.e 'existingFileName.txt' and 'newFileName.txt'
# Return: It returns two things Punctuation Free File opened in read mode and a punctuation count variable.
def removePunctuation(tokenizeSampleText, newFileName):
from nltk.tokenize import word_tokenize
existingFile = open(tokenizeSampleText, 'r')
read_existingFile = existingFile.read()
tokenize_existingFile = word_tokenize(read_existingFile)
puncRemovedFile = open(newFileName, 'w+')
import string
stringPun = list(string.punctuation)
count_pun = 0
for word in tokenize_existingFile:
if word in stringPun:
count_pun += 1
else:
word = word + ' '
puncRemovedFile.write(''.join(word))
existingFile.close()
puncRemovedFile.close()
return open(newFileName, 'r'), count_pun
punRemoved, punCount = removePunctuation('Macbeth.txt', 'Macbeth-punctuationRemoved.txt')
print(f'Total Punctuation : {punCount}')
punRemoved.read()
I'm attempting to remove all the stop words from text input. The code below removes all the stop words, except one that begin a sentence.
How do I remove those words?
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
stopwords_nltk_en = set(stopwords.words('english'))
from string import punctuation
exclude_punctuation = set(punctuation)
stoplist_combined = set.union(stopwords_nltk_en, exclude_punctuation)
def normalized_text(text):
lemma = WordNetLemmatizer()
stopwords_punctuations_free = ' '.join([i for i in text.lower().split() if i not in stoplist_combined])
normalized = ' '.join(lemma.lemmatize(word) for word in stopwords_punctuations_free.split())
return normalized
sentence = [['The birds are always in their house.'], ['In the hills the birds nest.']]
for item in sentence:
print (normalized_text(str(item)))
OUTPUT:
the bird always house
in hill bird nest
The culprit is this line of code:
print (normalized_text(str(item)))
If you try to print str(item) for the first element of your sentence list, you'll get:
['The birds are always in their house.']
which, then, lowered and split becomes:
["['the", 'birds', 'are', 'always', 'in', 'their', "house.']"]
As you can see, the first element is ['the which does not match the stop word the.
Solution: Use ''.join(item) to convert item to str
Edit after comment
Inside the text string there are still some apices '. To solve, call the normalized as:
for item in sentence:
print (normalized_text(item))
Then, import the regex module with import re and change:
text.lower().split()
with:
re.split('\'| ', ''.join(text).lower())
So I am reading in a csv file and the getting the words in it. I am trying to remove stop words. Here is my code.
import pandas as pd
from nltk.corpus import stopwords as sw
def loadCsv(fileName):
df = pd.read_csv(fileName, error_bad_lines=False)
df.dropna(inplace = True)
return df
def getWords(dataframe):
words = []
for tweet in dataframe['SentimentText'].tolist():
for word in tweet.split():
word = word.lower()
words.append(word)
return set(words) #Create a set from the words list
def removeStopWords(words):
for word in words: # iterate over word_list
if word in sw.words('english'):
words.remove(word) # remove word from filtered_word_list if it is a stopword
return set(words)
df = loadCsv("train.csv")
words = getWords(df)
words = removeStopWords(words)
On this line
if word in sw.words('english'):
I get the following error.
exception: no description
Further down the line I am going to try to remove punctuation, any pointers for that too would be great.
Any help is much appreciated.
EDIT
def removeStopWords(words):
filtered_word_list = words #make a copy of the words
for word in words: # iterate over words
if word in sw.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
return set(filtered_word_list)
Here is a simplified version of the problem, without Panda. I believe the issue with the original code is with modifying the set words while iterating over it. By using a conditional list comprehension, we can test for each word, creating a new list, and ultimately converting it into a set, as per the original code.
from nltk.corpus import stopwords as sw
def removeStopWords(words):
return set([w for w in words if not w in sw.words('english')])
sentence = 'this is a very common english sentence with a finite set of words from my imagination'
words = set(sentence.split())
print(removeStopWords(words))
Change removeStopWords function to the following:
def getFilteredStopWords(words):
list_stopWords=list(set(sw.words('english')))
filtered_words=[w for w in words if not w in list_stopWords# remove word from filtered_words if it is a stopword
return filtered_words
def remmove_stopwords(sentence):
list_stop_words = set(stopwords.words('english'))
words = sentence.split(' ')
filtered_words = [w for w in words if w not in list_stop_words]
sentence_list = ' '.join(w for w in filtered_words)
return sentence_list
I am trying to filter out stopwords in my text like so:
clean = ' '.join([word for word in text.split() if word not in (stopwords)])
The problem is that text.split() has elements like 'word.' that don't match to the stopword 'word'.
I later use clean in sent_tokenize(clean), however, so I don't want to get rid of the punctuation altogether.
How do I filter out stopwords while retaining punctuation, but filtering words like 'word.'?
I thought it would be possible to change the punctuation:
text = text.replace('.',' . ')
and then
clean = ' '.join([word for word in text.split() if word not in (stop words)] or word == ".")
But is there a better way?
Tokenize the text first, than clean it from stopwords. A tokenizer usually recognizes punctuation.
import nltk
text = 'Son, if you really want something in this life,\
you have to work for it. Now quiet! They are about\
to announce the lottery numbers.'
stopwords = ['in', 'to', 'for', 'the']
sents = []
for sent in nltk.sent_tokenize(text):
tokens = nltk.word_tokenize(sent)
sents.append(' '.join([w for w in tokens if w not in stopwords]))
print sents
['Son , if you really want something this life , you have work it .', 'Now quiet !', 'They are about announce lottery numbers .']
You could use something like this:
import re
clean = ' '.join([word for word in text.split() if re.match('([a-z]|[A-Z])+', word).group().lower() not in (stopwords)])
This pulls out everything except lowercase and uppercase ascii letters and matches it to words in your stopcase set or list. Also, it assumes that all of your words in stopwords are lowercase, which is why I converted the word to all lowercase. Take that out if I made to great of an assumption
Also, I'm not proficient in regex, sorry if there's a cleaner or robust way of doing this.