natural language corpus string to int - python

take a sample of sentences from each of the corpus1, corpus2 and corpus3 corpora and displays the average length (as measured in terms of the number of characters in the sentence).
so I've 3 corpus and sample_raw_sents is a defined function to return random sentences:
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
for sentence in tcr.sample_raw_sents(sample_size):
print(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
print(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
print(len(sentence))
so using this code all lengths are printed, though how do I sum() these lengths?

Use zip, it will allow you to draw a sentence from each corpus all at once.
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
zipped = zip(tcr.sample_raw_sents(sample_size),
rcr.sample_raw_sents(sample_size),
mcr.sample_raw_sents(sample_size))
for s1, s2, s3 in zipped:
summed = len(s1) + len(s2) + len(s3)
average = summed/3
print(summed, average)

You could store all lengths of sentences in list and then sum them up.
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
lengths = []
for sentence in tcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
print(sum(lengths) / len(lengths))

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
s = 0
for sentence in tcr.sample_raw_sents(sample_size):
s = s + len(sentence)
for sentence in rcr.sample_raw_sents(sample_size):
s = s + len(sentence)
for sentence in mcr.sample_raw_sents(sample_size):
s = s + len(sentence)
average = s/150
print('average: {}'.format(average))

Related

Python most frequent sentences in large text

My goal is to extract the x most frequent sentences that include the word y. My solution right now works this way:
Extract all sentences that contains the word y
Count the most frequent words in these sentences and store them in a list
Extract sentences that includes z amount of words from the list
In order to get x sentences then I simply increase or decrease z
So I get the most frequent sentences by looking for the most frequent words. This method works on smaller amount of data. But will take forever on larger amount of data.
EDIT - code
# Extract all sentences from data containing the word
def getSentences(word):
sentences = []
for x in data_lemmatized:
if word in x:
sentences.append(x)
return sentences
# Get the most frequent words from all the sentences
def getSentenceWords(sentences):
cnt = Counter()
for x in sentences:
for y in x:
cnt[y] += 1
words = []
for x,y in cnt.most_common(30):
if(x not in exclude and x != ' '):
words.append(x)
return words
# Get sentences that contains as many words as possible
def countWordshelp(allSentences, words, amountWords):
tempList = []
for sentence in allSentences:
temp = len(words[:amountWords])
count = 0
for word in words[:amountWords]:
if(word in sentence):
count += 1
if(count == temp):
tempList.append(sentence)
return tempList
def countWords(allSentences, words, nrSentences):
tempList = []
prevList = []
amountWords = 1
tempList = countWords2help(allSentences, words, amountWords)
while(len(tempList) > nrSentences):
amountWords += 1
newAllSentences = tempList
prevList = tempList
tempList = countWords2help(newAllSentences,words, amountWords)
if(len(tempList) < nrSentences):
return prevList[:nrSentences]
return tempList
if __name__ == '__main__':
for x in terms:
for y in x:
allSentences = getSentences(y)
words = getSentenceWords(allSentences)
test = countWords2(allSentences,words,nrSentences)
allTest.append(test)
terms will be a list of list each containing 10 words, data_lemmatized will be the large data in lemmatized form.

NLTK wordnet calculating path similarity of words in two lists

I'm trying to find the similarity of words in a text file. I have attached the code below where i read from a text file and split the contents into two lists but now i would like to compare the words in list 1 to list 2.
file = open('M:\ThirdYear\CE314\Assignment2\sim_data\Assignment_Additional.txt', 'r')
word1 = []
word2 = []
split = [line.strip() for line in file]
count = 0
for line in split:
if count == (len(split) - 1):
break
else:
word1.append(line.split('\t')[0])
word2.append(line.split('\t')[1])
count = count + 1
print(word1)
print(word2)
for x, y in zip(word1, word2):
w1 = wordnet.synset(x + '.n.1')
w2 = wordnet.synset(y + '.n.1')
print(w1.path_similarity(w2))
I want to iterate through both lists and print their path_similarity but only when they abide to the rules wordnet.synset(x + '.n.1') meaning any words that do not have '.n.1' i want to ignore and skip but i'm not entirely sure how to make this check in python

Apply collocation from listo of bigrams with NLTK in Python

I have to find and "apply" collocations in several sentences. The sentences are stored in a list of string. Let' focus on only one sentence now.
Here's an example:
sentence = 'I like to eat the ice cream in new york'
Here's what I want in the end:
sentence_final = 'I like to eat the ice_cream in new_york'
I'm using Python NLTK to find the collocations and I'm able to create a set containing all the possible collocations over all the sentences I have.
Here's an example of the set:
set_collocations = set([('ice', 'cream'), ('new', 'york'), ('go', 'out')])
It's obviously bigger in reality.
I created the following function, which should return the new function, modified as described above:
def apply_collocations(sentence, set_colloc):
window_size = 2
words = sentence.lower().split()
list_bigrams = list(nltk.bigrams(words))
set_bigrams=set(list_bigrams)
intersect = set_bigrams.intersection(set_colloc)
print(set_colloc)
print(set_bigrams)
# No collocation in this sentence
if not intersect:
return sentence
# At least one collocation in this sentence
else:
set_words_iters = set()
# Create set of words of the collocations
for bigram in intersect:
set_words_iters.add(bigram[0])
set_words_iters.add(bigram[1])
# Sentence beginning
if list_bigrams[0][0] not in set_words_iters:
new_sentence = list_bigrams[0][0]
begin = 1
else:
new_sentence = list_bigrams[0][0] + '_' + list_bigrams[0][1]
begin = 2
for i in range(begin, len(list_bigrams)):
print(new_sentence)
if list_bigrams[i][1] in set_words_iters and list_bigrams[i] in intersect:
new_sentence += ' ' + list_bigrams[i][0] + '_' + list_bigrams[i][1]
elif list_bigrams[i][1] not in set_words_iters:
new_sentence += ' ' + list_bigrams[i][1]
return new_sentence
2 question:
Is there a more optimized way to to this?
Since I'm a little bit inexpert with NLTK, can someone tell me if there' a "direct way" to apply collocations to a certain text? I mean, once I have identified the bigrams which I consider collocations, is there some function (or fast method) to modify my sentences?
You can simply replace the string "x y" by "x_y" for each element in your collocations set:
def apply_collocations(sentence, set_colloc):
res = sentence.lower()
for b1,b2 in set_colloc:
res = res.replace("%s %s" % (b1 ,b2), "%s_%s" % (b1 ,b2))
return res

NGram based Language detection William B. Cavnar and John M. Trenkle

I am trying to implement the NGram based Langauage detection paper by William B. Cavnar and John M. Trenkle using https://github.com/z0mbiehunt3r/ngrambased-textcategorizer/blob/master/ngramfreq.py
import operator
import string
import glob
import os.path
from nltk.util import ngrams
#file which contains the language to be detected
filename = raw_input("Enter the file name: ")
fp = open(filename)
text = str(fp.read())
fp.close()
#tokenize the text
rawtext = text.translate(None, string.punctuation)
words = [w.lower() for w in rawtext.split(" ")]
#generate ngrams for the text
gen_ngrams=[]
for word in words:
for i in range(1,6):
temp = ngrams(word, i, pad_left = True, pad_right = True, left_pad_symbol = ' ', right_pad_symbol =' ')
#join the characters of individual ngrams
for t in temp:
ngram = ' '.join(t)
gen_ngrams.append(ngram)
#calculate ngram frequencies of the text
ngram_stats = {}
for n in gen_ngrams:
if not ngram_stats.has_key(n):
ngram_stats.update({n:1})
else:
ng_count = ngram_stats[n]
ngram_stats.update({n:ng_count+1})
#now sort them, add an iterator to dict and reverse sort based on second column(count of ngrams)
ngrams_txt_sorted = sorted(ngram_stats.iteritems(), key=operator.itemgetter(1), reverse = True)[0:300]
#Load ngram language statistics
lang_stats={}
for filepath in glob.glob('./langdata/*.dat'):
filename = os.path.basename(filepath)
lang = os.path.splitext(filename)[0]
ngram_stats = open(filepath,"r").readlines()
ngram_stats = [x.rstrip() for x in ngram_stats]
lang_stats.update({lang:ngram_stats})
#compare ngram frequency statistics by doing a rank order lookup
lang_ratios = {}
txt_ng = [ng[0] for ng in ngrams_txt_sorted]
print txt_ng
max_out_of_place = len(txt_ng)
for lang, ngram_stat in lang_stats.iteritems():
lang_ng = [ng[0] for ng in lang_stats]
doc_dist = 0
for n in txt_ng:
try:
txt_ng_index = txt_ng.index(n)
lang_ng_index = lang_ng.index(n)
except ValueError:
lang_ng_index = max_out_of_place
doc_dist += abs(lang_ng_index - txt_ng_index)
lang_ratios.update({lang:doc_dist})
for i in lang_ratios.iteritems():
print i
predicted_lang = min(lang_ratios, key=lang_ratios.get)
print "The language is",predicted_lang
It outputs 'English' every time I execute it. The computed distances are always the same for all the languages. I am not able to figure out the logical error in the above code. Kindly help me.
Comparing to the Cavnar & Trenkle code, it looks like
ngram = ' '.join(t)
should be
ngram = ''join(t)
(without the space)
I bet this is what's throwing off your stats.

Object has no attribute 'update'

I am trying to use the code that is on this link... see example 6.
So this is the code:
import json
import nltk
import numpy
BLOG_DATA = "resources/ch05-webpages/feed.json"
N = 100 # Number of words to consider
CLUSTER_THRESHOLD = 5 # Distance between words to consider
TOP_SENTENCES = 5 # Number of sentences to return for a "top n" summary
# Approach taken from "The Automatic Creation of Literature Abstracts" by H.P. Luhn
def _score_sentences(sentences, important_words):
scores = []
sentence_idx = -1
for s in [nltk.tokenize.word_tokenize(s) for s in sentences]:
sentence_idx += 1
word_idx = []
# For each word in the word list...
for w in important_words:
try:
# Compute an index for where any important words occur in the sentence.
word_idx.append(s.index(w))
except ValueError, e: # w not in this particular sentence
pass
word_idx.sort()
# It is possible that some sentences may not contain any important words at all.
if len(word_idx)== 0: continue
# Using the word index, compute clusters by using a max distance threshold
# for any two consecutive words.
clusters = []
cluster = [word_idx[0]]
i = 1
while i < len(word_idx):
if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:
cluster.append(word_idx[i])
else:
clusters.append(cluster[:])
cluster = [word_idx[i]]
i += 1
clusters.append(cluster)
# Score each cluster. The max score for any given cluster is the score
# for the sentence.
max_cluster_score = 0
for c in clusters:
significant_words_in_cluster = len(c)
total_words_in_cluster = c[-1] - c[0] + 1
score = 1.0 * significant_words_in_cluster \
* significant_words_in_cluster / total_words_in_cluster
if score > max_cluster_score:
max_cluster_score = score
scores.append((sentence_idx, score))
return scores
def summarize(txt):
sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
normalized_sentences = [s.lower() for s in sentences]
words = [w.lower() for sentence in normalized_sentences for w in
nltk.tokenize.word_tokenize(sentence)]
fdist = nltk.FreqDist(words)
top_n_words = [w[0] for w in fdist.items()
if w[0] not in nltk.corpus.stopwords.words('english')][:N]
scored_sentences = _score_sentences(normalized_sentences, top_n_words)
# Summarization Approach 1:
# Filter out nonsignificant sentences by using the average score plus a
# fraction of the std dev as a filter
avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
if score > avg + 0.5 * std]
# Summarization Approach 2:
# Another approach would be to return only the top N ranked sentences
top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
# Decorate the post object with summaries
return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])
blog_data = json.loads(open(BLOG_DATA).read())
for post in blog_data:
post.update(summarize(post['content']))
print post['title']
print '=' * len(post['title'])
print
print 'Top N Summary'
print '-------------'
print ' '.join(post['top_n_summary'])
print
print 'Mean Scored Summary'
print '-------------------'
print ' '.join(post['mean_scored_summary'])
print
But when I run it it says:
Traceback (most recent call last):
File "/home/jetonp/PycharmProjects/Summeriza/blogs_and_nlp__summarize.py", line 117, in <module>
post.update(summarize(post['content']))
AttributeError: 'unicode' object has no attribute 'update'
Process finished with exit code 1
What is causing this error and how do I fix it?
I figured it out. In the example that you are working off of, the summarize method returns a dictionary. Your summarize method does not return anything, due to improper indentation. For part of it, there is just three spaces, and for part of it there were no spaces. The standard indentation in python is four spaces. Summarize should look like this:
def summarize(txt):
sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
normalized_sentences = [s.lower() for s in sentences]
words = [w.lower() for sentence in normalized_sentences for w in
nltk.tokenize.word_tokenize(sentence)]
fdist = nltk.FreqDist(words)
top_n_words = [w[0] for w in fdist.items()
if w[0] not in nltk.corpus.stopwords.words('english')][:N]
scored_sentences = _score_sentences(normalized_sentences, top_n_words)
# Summarization Approach 1:
# Filter out nonsignificant sentences by using the average score plus a
# fraction of the std dev as a filter
avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
if score > avg + 0.5 * std]
# Summarization Approach 2:
# Another approach would be to return only the top N ranked sentences
top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
# Decorate the post object with summaries
return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])

Categories