Removing stopwords from a list of text files - python

I have a list of processed text files, that looks somewhat like this:
text = "this is the first text document " this is the second text document " this is the third document "
I've been able to successfully tokenize the sentences:
sentences = sent_tokenize(text)
for ii, sentence in enumerate(sentences):
sentences[ii] = remove_punctuation(sentence)
sentence_tokens = [word_tokenize(sentence) for sentence in sentences]
And now I would like to remove stopwords from this list of tokens. However, because it's a list of sentences within a list of text documents, I can't seem to figure out how to do this.
This is what I've tried so far, but it returns no results:
sentence_tokens_no_stopwords = [w for w in sentence_tokens if w not in stopwords]
I'm assuming achieving this will require some sort of for loop, but what I have now isn't working. Any help would be appreciated!

You can create two lists generators like that:
sentence_tokens_no_stopwords = [[w for w in s if w not in stopwords] for s in sentence_tokens ]

Related

create variable name based on text in the sentence

I have a list of sentences. Each sentence has to be converted to a json. There is a unique 'name' for each sentence that is also specified in that json. The problem is that the number of sentences is large so it's monotonous to manually give a name. The name should be similar to the meaning of the sentence e.g., if the sentence is "do you like cake?" then the name should be like "likeCake". I want to automate the process of creation of name for each sentence. I googled text summarization but the results were not for sentence summarization but paragraph summarization. How to go about this?
This sort of task is used for natural language processing. You can get a result similar to what you want by removing Stop Words. Bases on this article, you can use the Natural Language Toolkit for dealing with the stop words. After installing the libray (pip install nltk), you can do something around the lines of:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# load data
file = open('yourFileWithSentences.txt', 'rt')
lines = file.readlines()
file.close()
stop_words = set(stopwords.words('english'))
for line in Lines:
# split into words
tokens = word_tokenize(line)
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# filter out stop words
words = [w for w in words if not w in stop_words]
print(f"Var name is {''.join(words)}")
Note that you can extend the stop_words set by adding any other words you might want to remove.

NLTK - Stopwords, hashing over list

I'll try to make this as easy to understand as possible as I can imagine how infuriating long and drawn-out problems can become.
I have a list of tweets, all stored within a variable called 'all_tweets'. (This is because some tweets fell into the 'text' category, while others fell into 'extended_tweet', so I had to merge them together.
I tokenised the tweets, and it all worked perfectly. I got a list of each tweet and each word within a tweet all seperated.
I am trying to now implement stopwords into the code so I can filter out, you guessed it, any stopwords.
My code is as follows:
wordVec = [nltk.word_tokenize(tweet) for tweet in all_tweets]
stopWords = set(stopwords.words('english'))
wordsFiltered = []
for w in wordVec:
if w not in stopWords:
wordsFiltered.append(w)
I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-29-ae7a97fb3811> in <module>
4
5 for w in wordVec:
----> 6 if w not in stopWords:
7 wordsFiltered.append(w)
TypeError: unhashable type: 'list'
I'm well aware I cannot hash over a list. I looked at my tweets and each set of words is all within their own list. I'm very well aware of whats going on but is there any workaround to this issue?
Any help would be appreciated thanks.
You said you're well aware of what's going on, but are you? wordVec is not a list of strings, it's a list of lists of strings.
So when you say:
for w in wordVec:
w is not a word, it's a list of words.
Which means if you say:
if w not in stopWords:
You are asking if the current list of words is in the set. You can't put lists in sets because they are mutable and cannot be hashed, hence the error.
I'm guessing what you really wanted to do is to iterate over the lists of words, and then to iterate over the words in the current list.
import nltk
from nltk.corpus import stopwords
tweets = [
"Who here likes cheese? I myself like cheese.",
"Do you have cheese? Do they have cheese?"
]
tokenized_tweets = [nltk.word_tokenize(tweet) for tweet in tweets]
stop_words = set(stopwords.words("english"))
filtered_tweets = []
for tokenized_tweet in tokenized_tweets:
filtered_tweets.append(" ".join(word for word in tokenized_tweet if word.casefold() not in stop_words))
print(filtered_tweets)
Output:
['likes cheese ? like cheese .', 'cheese ? cheese ?']
I just arbitrarily decided to join the list of filtered words before appending them to the filtered_tweets list - as you can see it results in the punctuation being separated by whitespace, which might be undesirable. In any case you don't need to join the words back into a string, you can just append the list itself.
your variable wordVec is a list of lists, so when you are doing:
for w in wordVec:
if w not in stopWords:
you check if a list is in a set, w is a list so you get
TypeError: unhashable type: 'list'
you can fix:
for w in wordVec:
word_tokenize.append([e for e in w if e not in stop_words]))
or you could use a list comprehension:
word_tokenize = [[e for e in w if e not in stop_words] for w in wordVec]
try this:
text = 'hello my friend, how are you today, are you ok?'
tokenized_word=word_tokenize(text)
stop_words = set(stopwords.words('english'))
stops = []
for w in tokenized_word:
if w not in stop_words:
stops.append(w)
print(stops)

Python: Create a new variable derived from extracting a sentence from a text

I have a data frame which one of the variables is a fairly long paragraph containing many sentences. Sometimes the sentences are separated by a full stop sometimes by a comma. I'm trying to create a new variable by extracting only selected parts of the text using selected words. Please see below a short sample of the data frame the result I have at the moment, followed by the code I'm using. Note - the text in the first variable is pretty large.
PhysicalMentalDemands Physical_driving Physical_telephones
[driving may be necessary [driving......] [telephones...]
occasionally.
as well as telephones will also
be occasional to frequent.]
Code used:
searched_words = ['driving' , 'telephones']
for i in searched_words:
Test ['Physical' +"_"+ str(i)] = Test ['PhysicalMentalDemands'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if w.lower() in searched_words)])
Issue:
At the moment my code extract the sentences but extract using both of the words. I've seem other similar posts but none managed to solve my issue.
Fixed
searched_words = ['driving', 'physical']
for i in searched_words:
df['Physical' + '_' + i] = result['PhysicalMentalDemands'].str.lower().apply(lambda text: [sent for sent in sent_tokenize(text)
if i in word_tokenize(sent)])
If you want separate lists for each searched word, you might consider reorganizing your code into something like this:
searched_words = ['driving', 'telephones']
for searched_word in searched_words:
Test ['Physical' +"_"+ searched_word)] = Test ['PhysicalMentalDemands'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(w for w in word_tokenize(sent) if w.lower() == searched_word)])
Note that the meat of the fix is changing if w.lower() in searched_words to if w.lower() == searched_word.

Counting non stop words in an NLTK corpus

In python using NLTK how would I find a count of the number of non stop words in a document filtered by category?
I can figure out how to get the words in a corpus filtered by a category e.g. all the words in the brown corpus for category ‘news’ is:
text = nltk.corpus.brown.words(categories=category)
And separately I can figure out how to get all the words for a particular document e.g. all the words in the document ‘cj47’ in the brown corpus is:
text = nltk.corpus.brown.words(fileids='cj47')
And then I can loop through the results and count up the words that are not stopwords e.g.
stopwords = nltk.corpus.stopwords.words('english')
for w in text:
if w.lower() not in stopwords:
#found a non stop words
But how do I put it together so that I am filtering by category for a particular document? If I try to specify a category and a filter at the same time e.g.
text = nltk.corpus.brown.words(categories=category, fields=’cj47’)
I get an error saying:
ValueError: Specify fields or categories, not both
Get fileids for a category:
fileids = nltk.corpus.brown.fileids(categories=category)
For each file, count the non-stopwords:
for f in fileids:
words = nltk.corpus.brown.words(fileids=f)
sum = sum([1 for w in words if w.lower() not in stopwords])
print "Document %s: %d non-stopwords." % (f, sum)

Python list comprehension on two list

I am stuck with list comprehension in Python
I have the following data structure
dataset = [sentence1, sentence2,...]
sentence = [word1, word2,...]
in addition, I have a list of special words
special_words = [special_word1, special_word2, special_word3,...]
I want to run over all special_words in special_words and fetch all words that occur together in sentence with special_words.
As result I expect,
data=[special_word1_list, special_word2_list, ...],
,where special_word1_list = [word1, word2, ...]
it means word1, word2 ,... were in the sentences together with special_word1_list
I tried many difference ways to construct the list comprehension, unfortunately without any success.
I would appreciate any help, in addition if you know any good article about list comprehension, post it here.
data = [
{
word
for sentence in sentences
if special_word in sentence
for word in sentence
}
for special_word in special_words
]
I think you want:
data = [sentence for sentence in data
if any(word in special_words
for word in sentence)]
Alternatively, I'd suggest creating a dictionary mapping special words to sets of other words occurring in the same sentence as the respective special word:
{sw : {w for ds in dataset if sw in ds for w in ds if w != sw}
for sw in special_words}

Categories