Currently I'm working with WordNet based semantic similarity measurement project. As I know below are the steps for computing semantic similarity between two sentences:
Each sentence is partitioned into a list of tokens.
Stemming words.
Part-of-speech disambiguation (or tagging).
Find the most appropriate sense for every word in a sentence (Word Sense Disambiguation).
Compute the similarity of the sentences based on the similarity of the pairs of words.
Now I'm at step 3. But I couldn't get the correct output. I'm not very familiar with Python. So I would appreciate your help.
This is my code.
import nltk
from nltk.corpus import stopwords
def get_tokens():
test_sentence = open("D:/test/resources/AnswerEvaluation/Sample.txt", "r")
try:
for item in test_sentence:
stop_words = set(stopwords.words('english'))
token_words = nltk.word_tokenize(item)
sentence_tokenization = [word for word in token_words if word not in stop_words]
print (sentence_tokenization)
return sentence_tokenization
except Exception as e:
print (str(e))
def get_stems():
tokenized_sentence = get_tokens()
for tokens in tokenized_sentence:
sentence_stemming = nltk.PorterStemmer().stem(tokens)
print (sentence_stemming)
return sentence_stemming
def get_tags():
stemmed_sentence = get_stems()
tag_words = nltk.pos_tag(stemmed_sentence)
print (tag_words)
return tag_words
get_tags()
Sample.txt contains the sentences, I was taking a ride in the car. I was riding in the car.
Related
I'm analyzing a book of 1167 pages (txt file). So far I've done preprocessing of the data (cleaning, removing punctuation, stop word removing, tokenization).
Now How can I cluster the text and visualize the similarity (plot the cluster)
For example -
text1 = "there is one unshakable conviction that people—whatever the degree of development of their understanding and whatever the form taken by the factors present in their individuality for engendering all kinds of ideal"
text2 = "That is why I now also, in setting forth on this venture quite new for me, namely authorship, begin by pronouncing this invocation"
In my task, I divided whole book into chapters. tex1 is chapter one text2 is chapter 2 and so on. Now I wanna compare chapter 1 and 2 like this.
Example code that I used for data pre processing:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(pages[1])
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
I have tried looking into this and couldn't find any possible way to do this the way I imagine. The term as an example I am trying to group is 'No complaints', when looking at this word the 'No' is picked up during the stopwords which I have manually removed from the stopwords to ensure it is included in the data. However, both words will be picked during the sentiment analysis as Negative words. I am wanting to combine them together so they can be categorised under either Neutral or Positive. Is it possible to manually group them words or terms together and decide how they are analysed in the sentiment analysis?
I have found a way to group words using POS tagging & Chunking but this combines tags together or Multi-Word Expressionsand doesn't necessarily pick them up correctly in the sentiment analysis.
Current code (using POS Tagging):
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize, MWETokenizer
import re, gensim, nltk
from gensim.utils import simple_preprocess
import pandas as pd
d = {'text': ['no complaints', 'not bad']}
df = pd.DataFrame(data=d)
stop = stopwords.words('english')
stop.remove('no')
stop.remove('not')
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(df))
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
data_words_nostops = remove_stopwords(data_words)
txt = df
txt = txt.apply(str)
#pos tag
words = [word_tokenize(i) for i in sent_tokenize(txt['text'])]
pos_tag= [nltk.pos_tag(i) for i in words]
#chunking
tagged_token = nltk.pos_tag(tokenized_text)
grammar = "NP : {<DT>+<NNS>}"
phrases = nltk.RegexpParser(grammar)
result = phrases.parse(tagged_token)
print(result)
sia = SentimentIntensityAnalyzer()
def find_sentiment(post):
if sia.polarity_scores(post)["compound"] > 0:
return "Positive"
elif sia.polarity_scores(post)["compound"] < 0:
return "Negative"
else:
return "Neutral"
df['sentiment'] = df['text'].apply(lambda x: find_sentiment(x))
df['compound'] = [sia.polarity_scores(x)['compound'] for x in df['text']]
df
Output:
(S
0/CD
(NP no/DT complaints/NNS)
1/CD
not/RB
bad/JJ
Name/NN
:/:
text/NN
,/,
dtype/NN
:/:
object/NN)
|text |sentiment |compound
|:--------------|:----------|:--------
0 |no complaints |Negative |-0.5994
1 |not bad |Positive | 0.4310
I understand that my current code does not incorporate the POS Tagging & chunking in the sentiment analysis, but you can see the combination of the word 'no complaints' however it's current sentiment and sentiment score is negative (-0.5994), the aim is to use POS tagging and assign the sentiment as positive... somehow if possible!
Option 1
Use VADER sentiment analysis instead, which seems to be handling such idioms better than how nltk does (NLTK incorporates VADER actually, but seems to behave differently in such situations). No need to change anything in your code, except install VADER, as described in the instructions, and then import the library in your code as follows (while removing the one from nltk.sentiment...)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
Using VADER, you should get the following results. I've added one extra idiom (i.e., "no worries"), which would also be given a negative score if nltk's sentiment was used.
text sentiment compound
0 no complaints Positive 0.3089
1 not bad Positive 0.4310
2 no worries Positive 0.3252
Option 2
Modify NLTK's lexicon, as described here; however, it might not always work (as probably accepts only single words, but not idioms). Example below:
new_words = {
'no complaints': 3.0
}
sia = SentimentIntensityAnalyzer()
sia.lexicon.update(new_words)
I have a list of sentences. Each sentence has to be converted to a json. There is a unique 'name' for each sentence that is also specified in that json. The problem is that the number of sentences is large so it's monotonous to manually give a name. The name should be similar to the meaning of the sentence e.g., if the sentence is "do you like cake?" then the name should be like "likeCake". I want to automate the process of creation of name for each sentence. I googled text summarization but the results were not for sentence summarization but paragraph summarization. How to go about this?
This sort of task is used for natural language processing. You can get a result similar to what you want by removing Stop Words. Bases on this article, you can use the Natural Language Toolkit for dealing with the stop words. After installing the libray (pip install nltk), you can do something around the lines of:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# load data
file = open('yourFileWithSentences.txt', 'rt')
lines = file.readlines()
file.close()
stop_words = set(stopwords.words('english'))
for line in Lines:
# split into words
tokens = word_tokenize(line)
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# filter out stop words
words = [w for w in words if not w in stop_words]
print(f"Var name is {''.join(words)}")
Note that you can extend the stop_words set by adding any other words you might want to remove.
I want to find the relevance of some words (like economy, technology) in a single document.
The document has around 30 pages, the idea is to extract all text and determine words relevances for this document.
I know that TF-IDF is used in a group of document, but is it possible to use TF-IDF to solve this problem? If not, how can I do this in Python?
Using NLTK and one of its builtin corpora, you can make some estimates on how "relevant" a word is:
from collections import Counter
from math import log
from nltk import word_tokenize
from nltk.corpus import brown
toks = word_tokenize(open('document.txt').read().lower())
tf = Counter(toks)
freqs = Counter(w.lower() for w in brown.words())
n = len(brown.words())
for word in tf:
tf[word] *= log(n / (freqs[word] + 1))**2
for word, score in tf.most_common(10):
print('%8.2f %s' % (score, word))
Change document.txt to the name of your document and the script will output the ten most "relevant" words in it.
I have a corpus of sentences in a specific domain.
I am looking for an open-source code/package, that I can give the data and it will generate a good, reliable language model. (Meaning, given a context, know the probability for each word).
Is there such a code/project?
I saw this github repo: https://github.com/rafaljozefowicz/lm, but it didn't work.
I recommend writing your own basic implementation. First, we need some sentences:
import nltk
from nltk.corpus import brown
words = brown.words()
total_words = len(words)
sentences = list(brown.sents())
sentences is now a list of lists. Each sublist represents a sentence with each word as an element. Now you need to decide whether or not you want to include punctuation in your model. If you want to remove it, try something like the following:
punctuation = [",", ".", ":", ";", "!", "?"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = [word for word in sentence if word not in punctuation]
sentences[i] = new_sentence
Next, you need to decide whether or not you care about capitalization. If you don't care about it, you could remove it like so:
for i, sentence in enumerate(sentences.copy()):
new_sentence = list()
for j, word in enumerate(sentence.copy()):
new_word = word.lower() # Lower case all characters in word
new_sentence.append(new_word)
sentences[i] = new_sentence
Next, we need special start and end words to represent words that are valid at the beginning and end of sentences. You should pick start and end words that don't exist in your training data.
start = ["<<START>>"]
end = ["<<END>>"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = start + sentence + end
sentences[i] = new_sentence
Now, let's count unigrams. A unigram is a sequence of one word in a sentence. Yes, a unigram model is just a frequency distribution of each word in the corpus:
new_words = list()
for sentence in sentences:
for word in sentence:
new_words.append(word)
unigram_fdist = nltk.FreqDist(new_words)
And now it's time to count bigrams. A bigram is a sequence of two words in a sentence. So, for the sentence "i am the walrus", we have the following bigrams: "<> i", "i am", "am the", "the walrus", and "walrus <>".
bigrams = list()
for sentence in sentences:
new_bigrams = nltk.bigrams(sentence)
bigrams += new_bigrams
Now we can create a frequency distribution:
bigram_fdist = nltk.ConditionalFreqDist(bigrams)
Finally, we want to know the probability of each word in the model:
def getUnigramProbability(word):
if word in unigram_fdist:
return unigram_fdist[word]/total_words
else:
return -1 # You should figure out how you want to handle out-of-vocabulary words
def getBigramProbability(word1, word2):
if word1 not in bigram_fdist:
return -1 # You should figure out how you want to handle out-of-vocabulary words
elif word2 not in bigram_fdist[word1]:
# i.e. "word1 word2" never occurs in the corpus
return getUnigramProbability(word2)
else:
bigram_frequency = bigram_fdist[word1][word2]
unigram_frequency = unigram_fdist[word1]
bigram_probability = bigram_frequency / unigram_frequency
return bigram_probability
While this isn't a framework/library that just builds the model for you, I hope seeing this code has demystified what goes on in a language model.
You might try word_language_model from PyTorch examples. There just might be an issue if you have a big corpus. They load all data in memory.