I'm analyzing a book of 1167 pages (txt file). So far I've done preprocessing of the data (cleaning, removing punctuation, stop word removing, tokenization).
Now How can I cluster the text and visualize the similarity (plot the cluster)
For example -
text1 = "there is one unshakable conviction that people—whatever the degree of development of their understanding and whatever the form taken by the factors present in their individuality for engendering all kinds of ideal"
text2 = "That is why I now also, in setting forth on this venture quite new for me, namely authorship, begin by pronouncing this invocation"
In my task, I divided whole book into chapters. tex1 is chapter one text2 is chapter 2 and so on. Now I wanna compare chapter 1 and 2 like this.
Example code that I used for data pre processing:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(pages[1])
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
Related
I have a list of sentences. Each sentence has to be converted to a json. There is a unique 'name' for each sentence that is also specified in that json. The problem is that the number of sentences is large so it's monotonous to manually give a name. The name should be similar to the meaning of the sentence e.g., if the sentence is "do you like cake?" then the name should be like "likeCake". I want to automate the process of creation of name for each sentence. I googled text summarization but the results were not for sentence summarization but paragraph summarization. How to go about this?
This sort of task is used for natural language processing. You can get a result similar to what you want by removing Stop Words. Bases on this article, you can use the Natural Language Toolkit for dealing with the stop words. After installing the libray (pip install nltk), you can do something around the lines of:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# load data
file = open('yourFileWithSentences.txt', 'rt')
lines = file.readlines()
file.close()
stop_words = set(stopwords.words('english'))
for line in Lines:
# split into words
tokens = word_tokenize(line)
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# filter out stop words
words = [w for w in words if not w in stop_words]
print(f"Var name is {''.join(words)}")
Note that you can extend the stop_words set by adding any other words you might want to remove.
I have a corpus of sentences in a specific domain.
I am looking for an open-source code/package, that I can give the data and it will generate a good, reliable language model. (Meaning, given a context, know the probability for each word).
Is there such a code/project?
I saw this github repo: https://github.com/rafaljozefowicz/lm, but it didn't work.
I recommend writing your own basic implementation. First, we need some sentences:
import nltk
from nltk.corpus import brown
words = brown.words()
total_words = len(words)
sentences = list(brown.sents())
sentences is now a list of lists. Each sublist represents a sentence with each word as an element. Now you need to decide whether or not you want to include punctuation in your model. If you want to remove it, try something like the following:
punctuation = [",", ".", ":", ";", "!", "?"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = [word for word in sentence if word not in punctuation]
sentences[i] = new_sentence
Next, you need to decide whether or not you care about capitalization. If you don't care about it, you could remove it like so:
for i, sentence in enumerate(sentences.copy()):
new_sentence = list()
for j, word in enumerate(sentence.copy()):
new_word = word.lower() # Lower case all characters in word
new_sentence.append(new_word)
sentences[i] = new_sentence
Next, we need special start and end words to represent words that are valid at the beginning and end of sentences. You should pick start and end words that don't exist in your training data.
start = ["<<START>>"]
end = ["<<END>>"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = start + sentence + end
sentences[i] = new_sentence
Now, let's count unigrams. A unigram is a sequence of one word in a sentence. Yes, a unigram model is just a frequency distribution of each word in the corpus:
new_words = list()
for sentence in sentences:
for word in sentence:
new_words.append(word)
unigram_fdist = nltk.FreqDist(new_words)
And now it's time to count bigrams. A bigram is a sequence of two words in a sentence. So, for the sentence "i am the walrus", we have the following bigrams: "<> i", "i am", "am the", "the walrus", and "walrus <>".
bigrams = list()
for sentence in sentences:
new_bigrams = nltk.bigrams(sentence)
bigrams += new_bigrams
Now we can create a frequency distribution:
bigram_fdist = nltk.ConditionalFreqDist(bigrams)
Finally, we want to know the probability of each word in the model:
def getUnigramProbability(word):
if word in unigram_fdist:
return unigram_fdist[word]/total_words
else:
return -1 # You should figure out how you want to handle out-of-vocabulary words
def getBigramProbability(word1, word2):
if word1 not in bigram_fdist:
return -1 # You should figure out how you want to handle out-of-vocabulary words
elif word2 not in bigram_fdist[word1]:
# i.e. "word1 word2" never occurs in the corpus
return getUnigramProbability(word2)
else:
bigram_frequency = bigram_fdist[word1][word2]
unigram_frequency = unigram_fdist[word1]
bigram_probability = bigram_frequency / unigram_frequency
return bigram_probability
While this isn't a framework/library that just builds the model for you, I hope seeing this code has demystified what goes on in a language model.
You might try word_language_model from PyTorch examples. There just might be an issue if you have a big corpus. They load all data in memory.
I am new to Python text processing, I am trying to stem word in text document, has around 5000 rows.
I have written below script
from nltk.corpus import stopwords # Import the stop word list
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
def Description_to_words(raw_Description ):
# 1. Remove HTML
Description_text = BeautifulSoup(raw_Description).get_text()
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", Description_text)
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
# 5. stem words
words = ([stemmer.stem(w) for w in words])
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
clean_Description = Description_to_words(train["Description"][15])
But when I test results words were not stemmed , can anyone help me to know what is issue , I am doing something wrong in "Description_to_words" function
And, when I execute stem command separately like below it works.
from nltk.tokenize import sent_tokenize, word_tokenize
>>> words = word_tokenize("MOBILE APP - Unable to add reading")
>>>
>>> for w in words:
... print(stemmer.stem(w))
...
mobil
app
-
unabl
to
add
read
Here's each step of your function, fixed.
Remove HTML.
Description_text = BeautifulSoup(raw_Description).get_text()
Remove non-letters, but don't remove whitespaces just yet. You can also simplify your regex a bit.
letters_only = re.sub("[^\w\s]", " ", Description_text)
Convert to lower case, split into individual words: I recommend using word_tokenize again, here.
from nltk.tokenize import word_tokenize
words = word_tokenize(letters_only.lower())
Remove stop words.
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
Stem words. Here is another issue. Stem meaningful_words, not words.
return ' '.join(stemmer.stem(w) for w in meaningful_words])
Currently I'm working with WordNet based semantic similarity measurement project. As I know below are the steps for computing semantic similarity between two sentences:
Each sentence is partitioned into a list of tokens.
Stemming words.
Part-of-speech disambiguation (or tagging).
Find the most appropriate sense for every word in a sentence (Word Sense Disambiguation).
Compute the similarity of the sentences based on the similarity of the pairs of words.
Now I'm at step 3. But I couldn't get the correct output. I'm not very familiar with Python. So I would appreciate your help.
This is my code.
import nltk
from nltk.corpus import stopwords
def get_tokens():
test_sentence = open("D:/test/resources/AnswerEvaluation/Sample.txt", "r")
try:
for item in test_sentence:
stop_words = set(stopwords.words('english'))
token_words = nltk.word_tokenize(item)
sentence_tokenization = [word for word in token_words if word not in stop_words]
print (sentence_tokenization)
return sentence_tokenization
except Exception as e:
print (str(e))
def get_stems():
tokenized_sentence = get_tokens()
for tokens in tokenized_sentence:
sentence_stemming = nltk.PorterStemmer().stem(tokens)
print (sentence_stemming)
return sentence_stemming
def get_tags():
stemmed_sentence = get_stems()
tag_words = nltk.pos_tag(stemmed_sentence)
print (tag_words)
return tag_words
get_tags()
Sample.txt contains the sentences, I was taking a ride in the car. I was riding in the car.
In python using NLTK how would I find a count of the number of non stop words in a document filtered by category?
I can figure out how to get the words in a corpus filtered by a category e.g. all the words in the brown corpus for category ‘news’ is:
text = nltk.corpus.brown.words(categories=category)
And separately I can figure out how to get all the words for a particular document e.g. all the words in the document ‘cj47’ in the brown corpus is:
text = nltk.corpus.brown.words(fileids='cj47')
And then I can loop through the results and count up the words that are not stopwords e.g.
stopwords = nltk.corpus.stopwords.words('english')
for w in text:
if w.lower() not in stopwords:
#found a non stop words
But how do I put it together so that I am filtering by category for a particular document? If I try to specify a category and a filter at the same time e.g.
text = nltk.corpus.brown.words(categories=category, fields=’cj47’)
I get an error saying:
ValueError: Specify fields or categories, not both
Get fileids for a category:
fileids = nltk.corpus.brown.fileids(categories=category)
For each file, count the non-stopwords:
for f in fileids:
words = nltk.corpus.brown.words(fileids=f)
sum = sum([1 for w in words if w.lower() not in stopwords])
print "Document %s: %d non-stopwords." % (f, sum)