I am trying to filter out stopwords in my text like so:
clean = ' '.join([word for word in text.split() if word not in (stopwords)])
The problem is that text.split() has elements like 'word.' that don't match to the stopword 'word'.
I later use clean in sent_tokenize(clean), however, so I don't want to get rid of the punctuation altogether.
How do I filter out stopwords while retaining punctuation, but filtering words like 'word.'?
I thought it would be possible to change the punctuation:
text = text.replace('.',' . ')
and then
clean = ' '.join([word for word in text.split() if word not in (stop words)] or word == ".")
But is there a better way?
Tokenize the text first, than clean it from stopwords. A tokenizer usually recognizes punctuation.
import nltk
text = 'Son, if you really want something in this life,\
you have to work for it. Now quiet! They are about\
to announce the lottery numbers.'
stopwords = ['in', 'to', 'for', 'the']
sents = []
for sent in nltk.sent_tokenize(text):
tokens = nltk.word_tokenize(sent)
sents.append(' '.join([w for w in tokens if w not in stopwords]))
print sents
['Son , if you really want something this life , you have work it .', 'Now quiet !', 'They are about announce lottery numbers .']
You could use something like this:
import re
clean = ' '.join([word for word in text.split() if re.match('([a-z]|[A-Z])+', word).group().lower() not in (stopwords)])
This pulls out everything except lowercase and uppercase ascii letters and matches it to words in your stopcase set or list. Also, it assumes that all of your words in stopwords are lowercase, which is why I converted the word to all lowercase. Take that out if I made to great of an assumption
Also, I'm not proficient in regex, sorry if there's a cleaner or robust way of doing this.
Related
I don't understand why I don't remove the stopword "a" in this loop. It seems so obvious that this should work...
Given a list of stop words, write a function that takes a string and returns a string stripped of the stop words. Output: stripped_paragraph = 'want figure out how can better data scientist'
Below I define 'stopwords'
I split all the words by a space, make a set of words while retaining the order
loop through the ordered and split substring set ('osss' var) and conditionally remove each word if it's a word in the list 'stopwords'
paragraph = 'I want to figure out how I can be a better data scientist'
def rm_stopwards(par):
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
osss = list(list(dict.fromkeys(par.split(' ')))) # ordered_split_shortened_set
for word in osss:
if word.strip() in stopwords:
osss.remove(word)
else:
next
return ' '.join(osss)
print("stripped_paragraph = "+"'"+(rm_stopwards(paragraph))+"'")
My incorrect output is: 'want figure out how can a better data scientist'
Correct output: 'want figure out how can better data scientist'
edit: note that .strip() in the condition check with word.strip() is unnecessary and I still get the same output, that was me checking to make sure there wasn't an extra space somehow
edit2: this is an interview question, so I can't use any imports
What your trying to do can be achieved with much fewer lines of code.
The main problem in your code is your changing the list while iterating over it.
This works and is much simpler. Essentially looping over the list of your paragraph words, and only keeping the ones that aren't in the stopwords list. Then joining them back together with a space.
paragraph = 'I want to figure out how I can be a better data scientist'
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
filtered = ' '.join([word for word in paragraph.split() if word not in stopwords])
print(filtered)
You may also consider using nltk, which has a predefined list of stopwords.
You should not change(delete/add) a collection(osss) while iterating over it.
del_list = []
for word in osss:
if word.strip() in stopwords:
del_list.append(word)
else:
next
osss = [e for e in osss if e not in del_list]
paragraph = 'I want to figure out how I can be a better data scientist'
def rm_stopwards(par):
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
osss = list(list(dict.fromkeys(par.split(' ')))) # ordered_split_shortened_set
x = list(osss)
for word in osss:
if word.strip() in stopwords:
x.remove(word)
#else:
# next
ret = ' '.join(x)
return ret
print("stripped_paragraph = "+"'"+(rm_stopwards(paragraph))+"'")
I have a list of sentences. Each sentence has to be converted to a json. There is a unique 'name' for each sentence that is also specified in that json. The problem is that the number of sentences is large so it's monotonous to manually give a name. The name should be similar to the meaning of the sentence e.g., if the sentence is "do you like cake?" then the name should be like "likeCake". I want to automate the process of creation of name for each sentence. I googled text summarization but the results were not for sentence summarization but paragraph summarization. How to go about this?
This sort of task is used for natural language processing. You can get a result similar to what you want by removing Stop Words. Bases on this article, you can use the Natural Language Toolkit for dealing with the stop words. After installing the libray (pip install nltk), you can do something around the lines of:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# load data
file = open('yourFileWithSentences.txt', 'rt')
lines = file.readlines()
file.close()
stop_words = set(stopwords.words('english'))
for line in Lines:
# split into words
tokens = word_tokenize(line)
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# filter out stop words
words = [w for w in words if not w in stop_words]
print(f"Var name is {''.join(words)}")
Note that you can extend the stop_words set by adding any other words you might want to remove.
I am relatively new to NLP so please be gentle. I
have a complete list of the text from Trump's tweets since taking office and I am tokenizing the text to analyze the content.
I am using the TweetTokenizer from the nltk library in python and I'm trying to get everything tokenized except for numbers and punctuation. Problem is my code removes all the tokens except one.
I have tried using the .isalpha() method but this did not work, which I thought would as should only be True for strings composed from the alphabet.
#Create a content from the tweets
text= non_re['text']
#Make all text in lowercase
low_txt= [l.lower() for l in text]
#Iteratively tokenize the tweets
TokTweet= TweetTokenizer()
tokens= [TokTweet.tokenize(t) for t in low_txt
if t.isalpha()]
My output from this is just one token.
If I remove the if t.isalpha() statement then I get all of the tokens including numbers and punctuation, suggesting the isalpha() is to blame from the over-trimming.
What I would like, is a way to get the tokens from the tweet text without punctuation and numbers.
Thanks for your help!
Try something like below:
import string
import re
import nltk
from nltk.tokenize import TweetTokenizer
tweet = "first think another Disney movie, might good, it's kids movie. watch it, can't help enjoy it. ages love movie. first saw movie 10 8 years later still love it! Danny Glover superb could play"
def clean_text(text):
# remove numbers
text_nonum = re.sub(r'\d+', '', text)
# remove punctuations and convert characters to lower case
text_nopunct = "".join([char.lower() for char in text_nonum if char not in string.punctuation])
# substitute multiple whitespace with single whitespace
# Also, removes leading and trailing whitespaces
text_no_doublespace = re.sub('\s+', ' ', text_nopunct).strip()
return text_no_doublespace
cleaned_tweet = clean_text(tweet)
tt = TweetTokenizer()
print(tt.tokenize(cleaned_tweet))
output:
['first', 'think', 'another', 'disney', 'movie', 'might', 'good', 'its', 'kids', 'movie', 'watch', 'it', 'cant', 'help', 'enjoy', 'it', 'ages', 'love', 'movie', 'first', 'saw', 'movie', 'years', 'later', 'still', 'love', 'it', 'danny', 'glover', 'superb', 'could', 'play']
# Function for removing Punctuation from Text and It gives total no.of punctuation removed also
# Input: Function takes Existing fie name and New file name as string i.e 'existingFileName.txt' and 'newFileName.txt'
# Return: It returns two things Punctuation Free File opened in read mode and a punctuation count variable.
def removePunctuation(tokenizeSampleText, newFileName):
from nltk.tokenize import word_tokenize
existingFile = open(tokenizeSampleText, 'r')
read_existingFile = existingFile.read()
tokenize_existingFile = word_tokenize(read_existingFile)
puncRemovedFile = open(newFileName, 'w+')
import string
stringPun = list(string.punctuation)
count_pun = 0
for word in tokenize_existingFile:
if word in stringPun:
count_pun += 1
else:
word = word + ' '
puncRemovedFile.write(''.join(word))
existingFile.close()
puncRemovedFile.close()
return open(newFileName, 'r'), count_pun
punRemoved, punCount = removePunctuation('Macbeth.txt', 'Macbeth-punctuationRemoved.txt')
print(f'Total Punctuation : {punCount}')
punRemoved.read()
I am running some NLP code, trying to find the most influential (positively or negatively) words in a survey. My problem is that, while I successfully add some extra stopwords to the NLTK stopwords file, they keep coming up as influential words later on.
So, I have a dataframe, first column contains scores, second column contains comments.
I add extra stopwords:
stopwords = stopwords.words('english')
extra = ['Cat', 'Dog']
stopwords.extend(extra)
I check that they are added, using the len method before and after.
I create this function to remove punctuation and stopwords from my comments:
def text_process(comment):
nopunc = [char for char in comment if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords]
I run the model (not going to include the whole code since it doesn't make a difference):
corpus = df['Comment']
y = df['Label']
vectorizer = CountVectorizer(analyzer=text_process)
x = vectorizer.fit_transform(corpus)
...
And then to get the most influential words:
feature_to_coef = {word: coef for word, coef in zip(vectorizer.get_feature_names(), nb.coef_[0])}
for best_positive in sorted(
feature_to_coef.items(),
key=lambda x: x[1],
reverse=True)[:20]:
print (best_positive)
But, Cat and Dog are in the results.
What am I doing wrong, any ideas?
Thank you very much!
Looks like it is because you have capitalize words 'Cat' and 'Dog'
In your text_process function, you have if word.lower() not in stopwords which only works if the stopwords are lower case
I have a corpus of sentences in a specific domain.
I am looking for an open-source code/package, that I can give the data and it will generate a good, reliable language model. (Meaning, given a context, know the probability for each word).
Is there such a code/project?
I saw this github repo: https://github.com/rafaljozefowicz/lm, but it didn't work.
I recommend writing your own basic implementation. First, we need some sentences:
import nltk
from nltk.corpus import brown
words = brown.words()
total_words = len(words)
sentences = list(brown.sents())
sentences is now a list of lists. Each sublist represents a sentence with each word as an element. Now you need to decide whether or not you want to include punctuation in your model. If you want to remove it, try something like the following:
punctuation = [",", ".", ":", ";", "!", "?"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = [word for word in sentence if word not in punctuation]
sentences[i] = new_sentence
Next, you need to decide whether or not you care about capitalization. If you don't care about it, you could remove it like so:
for i, sentence in enumerate(sentences.copy()):
new_sentence = list()
for j, word in enumerate(sentence.copy()):
new_word = word.lower() # Lower case all characters in word
new_sentence.append(new_word)
sentences[i] = new_sentence
Next, we need special start and end words to represent words that are valid at the beginning and end of sentences. You should pick start and end words that don't exist in your training data.
start = ["<<START>>"]
end = ["<<END>>"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = start + sentence + end
sentences[i] = new_sentence
Now, let's count unigrams. A unigram is a sequence of one word in a sentence. Yes, a unigram model is just a frequency distribution of each word in the corpus:
new_words = list()
for sentence in sentences:
for word in sentence:
new_words.append(word)
unigram_fdist = nltk.FreqDist(new_words)
And now it's time to count bigrams. A bigram is a sequence of two words in a sentence. So, for the sentence "i am the walrus", we have the following bigrams: "<> i", "i am", "am the", "the walrus", and "walrus <>".
bigrams = list()
for sentence in sentences:
new_bigrams = nltk.bigrams(sentence)
bigrams += new_bigrams
Now we can create a frequency distribution:
bigram_fdist = nltk.ConditionalFreqDist(bigrams)
Finally, we want to know the probability of each word in the model:
def getUnigramProbability(word):
if word in unigram_fdist:
return unigram_fdist[word]/total_words
else:
return -1 # You should figure out how you want to handle out-of-vocabulary words
def getBigramProbability(word1, word2):
if word1 not in bigram_fdist:
return -1 # You should figure out how you want to handle out-of-vocabulary words
elif word2 not in bigram_fdist[word1]:
# i.e. "word1 word2" never occurs in the corpus
return getUnigramProbability(word2)
else:
bigram_frequency = bigram_fdist[word1][word2]
unigram_frequency = unigram_fdist[word1]
bigram_probability = bigram_frequency / unigram_frequency
return bigram_probability
While this isn't a framework/library that just builds the model for you, I hope seeing this code has demystified what goes on in a language model.
You might try word_language_model from PyTorch examples. There just might be an issue if you have a big corpus. They load all data in memory.