Unstructured data, NLP Lemmatize Book Review - python

Here I have m trying to read the content let's say 'book1.txt' and here I have to remove all the special characters and punctuation marks and word tokenise the content using nltk's word tokeniser.
Lemmatize those token using wordnetLemmatizer
And write those token into csv file one by one.
Here is the code I m using which obviously is not working but just need some suggestion on this please.
import nltk
from nltk.stem import WordNetLemmatizer
import csv
from nltk.tokenize import word_tokenize
file_out=open('data.csv','w')
with open('book1.txt','r') as myfile:
for s in myfile:
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
for word in words:
token=WordNetLemmatizer().lemmatize(words,'v')
filtered_sentence=[""]
for n in words:
if n not in token:
filtered_sentence.append(""+n)
file_out.writelines(filtered_sentence+["\n"])

There's some issues here, most notably with the last 2 for loops.
The way you are doing it made it write it as follows:
word1
word1word2
word1word2word3
word1word2word3word4
........etc
I'm guessing that is not the expected output. I'm assuming the expected output is:
word1
word2
word3
word4
........etc (without creating duplicates)
I applied the code below to a 3 paragraph Cat Ipsum file. Note that I changed some variable names due to my own naming conventions.
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from pprint import pprint
# read the text into a single string.
with open("book1.txt") as infile:
text = ' '.join(infile.readlines())
words = word_tokenize(text)
words = [word.lower() for word in words if word.isalpha()]
# create the lemmatized word list
results = []
for word in words:
# you were using words instead of word below
token = WordNetLemmatizer().lemmatize(word, "v")
# check if token not already in results.
if token not in results:
results.append(token)
# sort results, just because :)
results.sort()
# print and save the results
pprint(results)
print(len(results))
with open("nltk_data.csv", "w") as outfile:
outfile.writelines(results)

Related

create variable name based on text in the sentence

I have a list of sentences. Each sentence has to be converted to a json. There is a unique 'name' for each sentence that is also specified in that json. The problem is that the number of sentences is large so it's monotonous to manually give a name. The name should be similar to the meaning of the sentence e.g., if the sentence is "do you like cake?" then the name should be like "likeCake". I want to automate the process of creation of name for each sentence. I googled text summarization but the results were not for sentence summarization but paragraph summarization. How to go about this?
This sort of task is used for natural language processing. You can get a result similar to what you want by removing Stop Words. Bases on this article, you can use the Natural Language Toolkit for dealing with the stop words. After installing the libray (pip install nltk), you can do something around the lines of:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# load data
file = open('yourFileWithSentences.txt', 'rt')
lines = file.readlines()
file.close()
stop_words = set(stopwords.words('english'))
for line in Lines:
# split into words
tokens = word_tokenize(line)
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# filter out stop words
words = [w for w in words if not w in stop_words]
print(f"Var name is {''.join(words)}")
Note that you can extend the stop_words set by adding any other words you might want to remove.

How to add more stopwords in nltk list?

I have the following code. I have to add more words in nltk stopword list. After i run thsi, it does not add the words in the list
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
new_words = open("stopwords_en.txt", "r")
new_stopwords = stop.union(new_word)
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in new_stopwords])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
doc_clean = [clean(doc).split() for doc in emails_body_text]
Don't do things blindly. Read in your new list of stopwords, inspect it to see that it's right, then add it to the other stopword list. Start with the code suggested by #greg_data, but you'll need to strip newlines and maybe do other things -- who knows what your stopwords file looks like?
This might do it, for example:
new_words = open("stopwords_en.txt", "r").read().split()
new_stopwords = stop.union(new_words)
PS. Don't keep splitting and joining your document; tokenize once and work with the list of tokens.

Stemming words with NLTK (python)

I am new to Python text processing, I am trying to stem word in text document, has around 5000 rows.
I have written below script
from nltk.corpus import stopwords # Import the stop word list
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
def Description_to_words(raw_Description ):
# 1. Remove HTML
Description_text = BeautifulSoup(raw_Description).get_text()
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", Description_text)
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
# 5. stem words
words = ([stemmer.stem(w) for w in words])
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
clean_Description = Description_to_words(train["Description"][15])
But when I test results words were not stemmed , can anyone help me to know what is issue , I am doing something wrong in "Description_to_words" function
And, when I execute stem command separately like below it works.
from nltk.tokenize import sent_tokenize, word_tokenize
>>> words = word_tokenize("MOBILE APP - Unable to add reading")
>>>
>>> for w in words:
... print(stemmer.stem(w))
...
mobil
app
-
unabl
to
add
read
Here's each step of your function, fixed.
Remove HTML.
Description_text = BeautifulSoup(raw_Description).get_text()
Remove non-letters, but don't remove whitespaces just yet. You can also simplify your regex a bit.
letters_only = re.sub("[^\w\s]", " ", Description_text)
Convert to lower case, split into individual words: I recommend using word_tokenize again, here.
from nltk.tokenize import word_tokenize
words = word_tokenize(letters_only.lower())
Remove stop words.
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
Stem words. Here is another issue. Stem meaningful_words, not words.
return ' '.join(stemmer.stem(w) for w in meaningful_words])

To find frequency of every word in text file in python

I want to find frequency of all words in my text file so that i can find out most frequently occuring words from them.
Can someone please help me the command to be used for that.
import nltk
text1 = "hello he heloo hello hi " // example text
fdist1 = FreqDist(text1)
I have used above code but problem is that it is not giving word frequency,rather it is displaying frequency of every character.
Also i want to know how to input text using text file.
I saw you were using the example and saw the same thing you were seeing, in order for it to work properly, you have to split the string by spaces. If you do not do this, it seems to count each character, which is what you were seeing. This returns the proper counts of each word, not character.
import nltk
text1 = 'hello he heloo hello hi '
text1 = text1.split(' ')
fdist1 = nltk.FreqDist(text1)
print (fdist1.most_common(50))
If you want to read from a file and get the word count, you can do it like so:
input.txt
hello he heloo hello hi
my username is heinst
your username is frooty
python code
import nltk
with open ("input.txt", "r") as myfile:
data=myfile.read().replace('\n', ' ')
data = data.split(' ')
fdist1 = nltk.FreqDist(data)
print (fdist1.most_common(50))
For what it's worth, NLTK seems like overkill for this task. The following will give you word frequencies, in order from highest to lowest.
from collections import Counter
input_string = [...] # get the input from a file
word_freqs = Counter(input_string.split())
text1 in the nltk book is a collection of tokens (words, punctuation) unlike in your code example where text1 is a string (collection of Unicode codepoints):
>>> from nltk.book import text1
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text1[99] # 100th token in the text
','
>>> from nltk import FreqDist
>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024,
'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})
If your input is indeed space-separated words then to find the frequency, use #Boa's answer:
freq = Counter(text_with_space_separated_words.split())
Note: FreqDist is a Counter but it also defines additional methods such as .plot().
If you want to use nltk tokenizers instead:
#!/usr/bin/env python3
from itertools import chain
from nltk import FreqDist, sent_tokenize, word_tokenize # $ pip install nltk
with open('your_text.txt') as file:
text = file.read()
words = chain.from_iterable(map(word_tokenize, sent_tokenize(text)))
freq = FreqDist(map(str.casefold, words))
freq.pprint()
# -> FreqDist({'hello': 2, 'hi': 1, 'heloo': 1, 'he': 1})
sent_tokenize() tokenizes the text into sentences. Then word_tokenize tokenizes each sentence into words. There are many ways to tokenize text in nltk.
In order to have the frequency as well as the words as a dictionary, the following code will be beneficial:
import nltk
from nltk.tokenize import word_tokenize
for f in word_tokenize(inputSentence):
dict[f] = fre[f]
print dict
I think below code is useful for you to get the frequency of each word in the file in dictionary form
myfile=open('greet.txt')
temp=myfile.read()
x=temp.split("\n")
y=list()
for item in x:
z=item.split(" ")
y.append(z)
count=dict()
for name in y:
for items in name:
if items not in count:`enter code here`
count[items]=1
else:
count[items]=count[items]+1
print(count)

remove stopwords and tokenize for collocationbigramfinder NLTK

I keep getting this error
sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
when i try to run this script. Not sure what is wrong. I am essentially reading from a text file, filtering out the stopwords and tokenizing them using NLTK.
import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
text_file=open('sentiment_test.txt', 'r')
lines=text_file.readlines()
filtered_words = [w for w in lines if not w in stopwords.words('english')]
print filtered_words
tokens=word_tokenize(str(filtered_words)
print tokens
finder = BigramCollocationFinder.from_words(tokens)
Any help would be much appreciated.
I am presuming that sentiment_test.txt is just plain text, and not a specific format.
You are trying to filter lines and not words. You should first tokenize and then filter the stopwords.
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
with open('sentiment_test.txt', 'r') as text_file:
text = text_file.read()
tokens=word_tokenize(str(text))
tokens = [w for w in tokens if not w in stopset]
print tokens
Hope this helps.

Categories