NLTK - Stopwords, hashing over list - python

I'll try to make this as easy to understand as possible as I can imagine how infuriating long and drawn-out problems can become.
I have a list of tweets, all stored within a variable called 'all_tweets'. (This is because some tweets fell into the 'text' category, while others fell into 'extended_tweet', so I had to merge them together.
I tokenised the tweets, and it all worked perfectly. I got a list of each tweet and each word within a tweet all seperated.
I am trying to now implement stopwords into the code so I can filter out, you guessed it, any stopwords.
My code is as follows:
wordVec = [nltk.word_tokenize(tweet) for tweet in all_tweets]
stopWords = set(stopwords.words('english'))
wordsFiltered = []
for w in wordVec:
if w not in stopWords:
wordsFiltered.append(w)
I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-29-ae7a97fb3811> in <module>
4
5 for w in wordVec:
----> 6 if w not in stopWords:
7 wordsFiltered.append(w)
TypeError: unhashable type: 'list'
I'm well aware I cannot hash over a list. I looked at my tweets and each set of words is all within their own list. I'm very well aware of whats going on but is there any workaround to this issue?
Any help would be appreciated thanks.

You said you're well aware of what's going on, but are you? wordVec is not a list of strings, it's a list of lists of strings.
So when you say:
for w in wordVec:
w is not a word, it's a list of words.
Which means if you say:
if w not in stopWords:
You are asking if the current list of words is in the set. You can't put lists in sets because they are mutable and cannot be hashed, hence the error.
I'm guessing what you really wanted to do is to iterate over the lists of words, and then to iterate over the words in the current list.
import nltk
from nltk.corpus import stopwords
tweets = [
"Who here likes cheese? I myself like cheese.",
"Do you have cheese? Do they have cheese?"
]
tokenized_tweets = [nltk.word_tokenize(tweet) for tweet in tweets]
stop_words = set(stopwords.words("english"))
filtered_tweets = []
for tokenized_tweet in tokenized_tweets:
filtered_tweets.append(" ".join(word for word in tokenized_tweet if word.casefold() not in stop_words))
print(filtered_tweets)
Output:
['likes cheese ? like cheese .', 'cheese ? cheese ?']
I just arbitrarily decided to join the list of filtered words before appending them to the filtered_tweets list - as you can see it results in the punctuation being separated by whitespace, which might be undesirable. In any case you don't need to join the words back into a string, you can just append the list itself.

your variable wordVec is a list of lists, so when you are doing:
for w in wordVec:
if w not in stopWords:
you check if a list is in a set, w is a list so you get
TypeError: unhashable type: 'list'
you can fix:
for w in wordVec:
word_tokenize.append([e for e in w if e not in stop_words]))
or you could use a list comprehension:
word_tokenize = [[e for e in w if e not in stop_words] for w in wordVec]

try this:
text = 'hello my friend, how are you today, are you ok?'
tokenized_word=word_tokenize(text)
stop_words = set(stopwords.words('english'))
stops = []
for w in tokenized_word:
if w not in stop_words:
stops.append(w)
print(stops)

Related

create variable name based on text in the sentence

I have a list of sentences. Each sentence has to be converted to a json. There is a unique 'name' for each sentence that is also specified in that json. The problem is that the number of sentences is large so it's monotonous to manually give a name. The name should be similar to the meaning of the sentence e.g., if the sentence is "do you like cake?" then the name should be like "likeCake". I want to automate the process of creation of name for each sentence. I googled text summarization but the results were not for sentence summarization but paragraph summarization. How to go about this?
This sort of task is used for natural language processing. You can get a result similar to what you want by removing Stop Words. Bases on this article, you can use the Natural Language Toolkit for dealing with the stop words. After installing the libray (pip install nltk), you can do something around the lines of:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# load data
file = open('yourFileWithSentences.txt', 'rt')
lines = file.readlines()
file.close()
stop_words = set(stopwords.words('english'))
for line in Lines:
# split into words
tokens = word_tokenize(line)
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# filter out stop words
words = [w for w in words if not w in stop_words]
print(f"Var name is {''.join(words)}")
Note that you can extend the stop_words set by adding any other words you might want to remove.

I'm getting TypeError: expected string or bytes-like object "

import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
paragraph = ''' State-run Bharat Sanchar Nigam Ltd (BSNL) is readying to pay November salary in another two days, which will be raised from internal accruals and bank loans.'''
sentence = nltk.sent_tokenize(paragraph)
stemmer = PorterStemmer()
for i in range(len(sentence)):
words = nltk.word_tokenize(i)
words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
sentence[i] = ' '.join(words)
I am getting an error on this part
words = nltk.word_tokenize(i)
range() produces an iterable of integers. So, when you feed i into nltk.word_tokenize(), you're feeding it an integer. Obviously, an integer is not string-like.
I don't personally know how nltk.word_tokenize() is supposed to work, but based on context clues it would seem you might want to pass the sentence object at the index i instead of just the index i:
words = nltk.word_tokenize(sentence[i])

Removing stopwords from a list of text files

I have a list of processed text files, that looks somewhat like this:
text = "this is the first text document " this is the second text document " this is the third document "
I've been able to successfully tokenize the sentences:
sentences = sent_tokenize(text)
for ii, sentence in enumerate(sentences):
sentences[ii] = remove_punctuation(sentence)
sentence_tokens = [word_tokenize(sentence) for sentence in sentences]
And now I would like to remove stopwords from this list of tokens. However, because it's a list of sentences within a list of text documents, I can't seem to figure out how to do this.
This is what I've tried so far, but it returns no results:
sentence_tokens_no_stopwords = [w for w in sentence_tokens if w not in stopwords]
I'm assuming achieving this will require some sort of for loop, but what I have now isn't working. Any help would be appreciated!
You can create two lists generators like that:
sentence_tokens_no_stopwords = [[w for w in s if w not in stopwords] for s in sentence_tokens ]

Converting a Text File to a String in Python

I am new to python and am trying to find the largest word in the alice_in_worderland.txt. I think I have a good system set up ("See Below"), but my output is returning a "word" with dashes connecting multiple words. Is there someway to remove the dashes in the input of the file? For the text file visit here
sample from text file:
That's very important,' the King said, turning to the jury. They were
just beginning to write this down on their slates, when the White
Rabbit interrupted: UNimportant, your Majesty means, of course,' he
said in a very respectful tone, but frowning and making faces at him
as he spoke. " UNimportant, of course, I meant,' the King hastily
said, and went on to himself in an undertone, important--unimportant--
unimportant--important--' as if he were trying which word sounded
best."
code:
#String input
with open("alice_in_wonderland.txt", "r") as myfile:
string=myfile.read().replace('\n','')
#initialize list
my_list = []
#Split words into list
for word in string.split(' '):
my_list.append(word)
#initialize list
uniqueWords = []
#Fill in new list with unique words to shorten final printout
for i in my_list:
if not i in uniqueWords:
uniqueWords.append(i)
#Legnth of longest word
count = 0
#Longest word place holder
longest = []
for word in uniqueWords:
if len(word)>count:
longest = word
count = len(longest)
print longest
>>> import nltk # pip install nltk
>>> nltk.download('gutenberg')
>>> words = nltk.corpus.gutenberg.words('carroll-alice.txt')
>>> max(words, key=len) # find the longest word
'disappointment'
Here's one way using re and mmap:
import re
import mmap
with open('your alice in wonderland file') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
words = re.finditer('\w+', mf)
print max((word.group() for word in words), key=len)
# disappointment
Far more efficient than loading the file to physical memory.
Use str.replace to replace the dashes with spaces (or whatever you want). To do this, simply add another call to replace after the first call on line 3:
string=myfile.read().replace('\n','').replace('-', ' ')

Python list comprehension on two list

I am stuck with list comprehension in Python
I have the following data structure
dataset = [sentence1, sentence2,...]
sentence = [word1, word2,...]
in addition, I have a list of special words
special_words = [special_word1, special_word2, special_word3,...]
I want to run over all special_words in special_words and fetch all words that occur together in sentence with special_words.
As result I expect,
data=[special_word1_list, special_word2_list, ...],
,where special_word1_list = [word1, word2, ...]
it means word1, word2 ,... were in the sentences together with special_word1_list
I tried many difference ways to construct the list comprehension, unfortunately without any success.
I would appreciate any help, in addition if you know any good article about list comprehension, post it here.
data = [
{
word
for sentence in sentences
if special_word in sentence
for word in sentence
}
for special_word in special_words
]
I think you want:
data = [sentence for sentence in data
if any(word in special_words
for word in sentence)]
Alternatively, I'd suggest creating a dictionary mapping special words to sets of other words occurring in the same sentence as the respective special word:
{sw : {w for ds in dataset if sw in ds for w in ds if w != sw}
for sw in special_words}

Categories