I've looked at quite a few posts but none seem to help.
I want to calcuate Term Frequency & Inverse Document Frequency; a Bag of Words technique used in Deep Learning. The purpose of this code is just to calculate the formula. I do not implement an ANN here.
Below is a minimal code example. It is after the for loop I have this problem.
import math
docs = 1000
words_per_doc = 100 # length of doc
#word_freq = 10
#doc_freq = 100
dp = 4
print('Term Frequency Inverse Document Frequency')
# term, word_freq, doc_freq
words = [['the', 10, 100], ['python', 10, 900]]
tfidf_ = []
for idx, val in enumerate(words):
print(words[idx][0] + ':')
word_freq = words[idx][1]
doc_freq = words[idx][2]
tf = round(word_freq/words_per_doc, dp)
idf = round(math.log10(docs/doc_freq), dp)
tfidf = round((tf*idf), dp)
print(str(tf) + ' * ' + str(idf) + ' = ' + str(tfidf))
tfidf_.append(tfidf)
print()
max_val = max(tfidf)
max_idx = tfidf.index(max_val)
#max_idx = tfidf.index(max(tfidf))
lowest_idx = 1 - max_idx
print('Therefore, \'' + words[max_idx][0] + '\' semantically is more important than \'' + words[lowest_idx][0] + '\'.')
#print('log(N/|{d∈D:w∈W}|)')
Error:
line 25, in <module>
max_val = max(tfidf)
TypeError: 'float' object is not iterable
You are trying to pass tfidf on your function instead of tfidf_
tfidf is int and tfidf_ is your list
So code should be
max_val = max(tfidf_)
max_idx = tfidf_.index(max_val)
Related
My goal here is to print lines from text files together. Some lines, however, are not together like they should be. I resolved the first problem where the denominator was on the line after. For the else statement, they all seem to have the same value/index.
import fitz # this is pymupdf
with fitz.open("math-problems.pdf") as doc: #below converts pdf to txt
text = ""
for page in doc:
text += page.getText()
file_w = open("output.txt", "w") #save as txt file
file_w.write(text)
file_w.close()
file_r = open("output.txt", "r") #read txt file
word = 'f(x) = '
#--------------------------
list1 = file_r.readlines() # read each line and put into list
list2 = [k for k in list1 if word in k] # look for all elements with "f(x)" and put all in new list
list1_N = list1
list2_N = list2
list1 = [e[3:] for e in list1] #remove first three characters (the first three characters are always "1) " or "A) "
char = str('\n')
for char in list2:
index = list1.index(char)
def digitcheck(s):
isdigit = str.isdigit
return any(map(isdigit,s))
xx = digitcheck(list1[index])
if xx:
print(list1[index] + " / " + list1_N[index+1])
else:
print(list1[index] + list1[index+1]) # PROBLEM IS HERE, HOW COME EACH VALUE IS SAME HERE?
Output from terminal:
f(x) = x3 + x2 - 20x
/ x2 - 3x - 18
f(x) =
2 + 5x
f(x) =
2 + 5x
f(x) =
2 + 5x
f(x) =
2 + 5x
f(x) = x2 + 3x - 10
/ x2 - 5x - 14
f(x) = x2 + 2x - 8
/ x2 - 3x - 10
f(x) = x - 1
/ x2 + 8
f(x) = 3x3 - 2x - 6
/ 8x3 - 7x + 4
f(x) =
2 + 5x
f(x) = x3 - 6x2 + 4x - 1
/ x2 + 8x
Process finished with exit code 0
SOLVED
#copperfield was correct, I had repeating values so my index was repeating. I solved this using a solution by #Shonu93 in here. Essentially it locates all indices of duplicate values and puts these indices into one list elem_pos and then prints each index from list1
if empty in list1:
counter = 0
elem_pos = []
for i in list1:
if i == empty:
elem_pos.append(counter)
counter = counter + 1
xy = elem_pos
for i in xy:
print(list1[i] + list1_N[i+1])
I'm working on this python problem:
Given a sequence of the DNA bases {A, C, G, T}, stored as a string, returns a conditional probability table in a data structure such that one base (b1) can be looked up, and then a second (b2), to get the probability p(b2 | b1) of the second base occurring immediately after the first. (Assumes the length of seq is >= 3, and that the probability of any b1 and b2 which have never been seen together is 0. Ignores the probability that b1 will be followed by the end of the string.)
You may use the collections module, but no other libraries.
However I'm running into a roadblock:
word = 'ATCGATTGAGCTCTAGCG'
def dna_prob2(seq):
tbl = dict()
levels = set(word)
freq = dict.fromkeys(levels, 0)
for i in seq:
freq[i] += 1
for i in levels:
tbl[i] = {x:0 for x in levels}
lastlevel = ''
for i in tbl:
if lastlevel != '':
tbl[lastlevel][i] += 1
lastlevel = i
for i in tbl:
print(i,tbl[i][i] / freq[i])
return tbl
tbl['T']['T'] / freq[i]
Basically, the end result is supposed to be the final line tbl you see above. However, when I try to do that in print(i,tbl[i][i] /freq[i), and run dna_prob2(word), I get 0.0s for everything.
Wondering if anyone here can help out.
Thanks!
I am not sure what it is your code is doing, but this works:
def makeprobs(word):
singles = {}
probs = {}
thedict={}
ll = len(word)
for i in range(ll-1):
x1 = word[i]
x2 = word[i+1]
singles[x1] = singles.get(x1, 0)+1.0
thedict[(x1, x2)] = thedict.get((x1, x2), 0)+1.0
for i in thedict:
probs[i] = thedict[i]/singles[i[0]]
return probs
I finally got back to my professor. This is what it was trying to accomplish:
word = 'ATCGATTGAGCTCTAGCG'
def dna_prob2(seq):
tbl = dict()
levels = set(seq)
freq = dict.fromkeys(levels, 0)
for i in seq:
freq[i] += 1
for i in levels:
tbl[i] = {x:0 for x in levels}
lastlevel = ''
for i in seq:
if lastlevel != '':
tbl[lastlevel][i] += 1
lastlevel = i
return tbl, freq
condfreq, freq = dna_prob2(word)
print(condfreq['T']['T']/freq['T'])
print(condfreq['G']['A']/freq['A'])
print(condfreq['C']['G']/freq['G'])
Hope this helps.
take a sample of sentences from each of the corpus1, corpus2 and corpus3 corpora and displays the average length (as measured in terms of the number of characters in the sentence).
so I've 3 corpus and sample_raw_sents is a defined function to return random sentences:
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
for sentence in tcr.sample_raw_sents(sample_size):
print(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
print(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
print(len(sentence))
so using this code all lengths are printed, though how do I sum() these lengths?
Use zip, it will allow you to draw a sentence from each corpus all at once.
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
zipped = zip(tcr.sample_raw_sents(sample_size),
rcr.sample_raw_sents(sample_size),
mcr.sample_raw_sents(sample_size))
for s1, s2, s3 in zipped:
summed = len(s1) + len(s2) + len(s3)
average = summed/3
print(summed, average)
You could store all lengths of sentences in list and then sum them up.
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
lengths = []
for sentence in tcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
print(sum(lengths) / len(lengths))
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
s = 0
for sentence in tcr.sample_raw_sents(sample_size):
s = s + len(sentence)
for sentence in rcr.sample_raw_sents(sample_size):
s = s + len(sentence)
for sentence in mcr.sample_raw_sents(sample_size):
s = s + len(sentence)
average = s/150
print('average: {}'.format(average))
I am trying to implement the NGram based Langauage detection paper by William B. Cavnar and John M. Trenkle using https://github.com/z0mbiehunt3r/ngrambased-textcategorizer/blob/master/ngramfreq.py
import operator
import string
import glob
import os.path
from nltk.util import ngrams
#file which contains the language to be detected
filename = raw_input("Enter the file name: ")
fp = open(filename)
text = str(fp.read())
fp.close()
#tokenize the text
rawtext = text.translate(None, string.punctuation)
words = [w.lower() for w in rawtext.split(" ")]
#generate ngrams for the text
gen_ngrams=[]
for word in words:
for i in range(1,6):
temp = ngrams(word, i, pad_left = True, pad_right = True, left_pad_symbol = ' ', right_pad_symbol =' ')
#join the characters of individual ngrams
for t in temp:
ngram = ' '.join(t)
gen_ngrams.append(ngram)
#calculate ngram frequencies of the text
ngram_stats = {}
for n in gen_ngrams:
if not ngram_stats.has_key(n):
ngram_stats.update({n:1})
else:
ng_count = ngram_stats[n]
ngram_stats.update({n:ng_count+1})
#now sort them, add an iterator to dict and reverse sort based on second column(count of ngrams)
ngrams_txt_sorted = sorted(ngram_stats.iteritems(), key=operator.itemgetter(1), reverse = True)[0:300]
#Load ngram language statistics
lang_stats={}
for filepath in glob.glob('./langdata/*.dat'):
filename = os.path.basename(filepath)
lang = os.path.splitext(filename)[0]
ngram_stats = open(filepath,"r").readlines()
ngram_stats = [x.rstrip() for x in ngram_stats]
lang_stats.update({lang:ngram_stats})
#compare ngram frequency statistics by doing a rank order lookup
lang_ratios = {}
txt_ng = [ng[0] for ng in ngrams_txt_sorted]
print txt_ng
max_out_of_place = len(txt_ng)
for lang, ngram_stat in lang_stats.iteritems():
lang_ng = [ng[0] for ng in lang_stats]
doc_dist = 0
for n in txt_ng:
try:
txt_ng_index = txt_ng.index(n)
lang_ng_index = lang_ng.index(n)
except ValueError:
lang_ng_index = max_out_of_place
doc_dist += abs(lang_ng_index - txt_ng_index)
lang_ratios.update({lang:doc_dist})
for i in lang_ratios.iteritems():
print i
predicted_lang = min(lang_ratios, key=lang_ratios.get)
print "The language is",predicted_lang
It outputs 'English' every time I execute it. The computed distances are always the same for all the languages. I am not able to figure out the logical error in the above code. Kindly help me.
Comparing to the Cavnar & Trenkle code, it looks like
ngram = ' '.join(t)
should be
ngram = ''join(t)
(without the space)
I bet this is what's throwing off your stats.
I am trying to use the code that is on this link... see example 6.
So this is the code:
import json
import nltk
import numpy
BLOG_DATA = "resources/ch05-webpages/feed.json"
N = 100 # Number of words to consider
CLUSTER_THRESHOLD = 5 # Distance between words to consider
TOP_SENTENCES = 5 # Number of sentences to return for a "top n" summary
# Approach taken from "The Automatic Creation of Literature Abstracts" by H.P. Luhn
def _score_sentences(sentences, important_words):
scores = []
sentence_idx = -1
for s in [nltk.tokenize.word_tokenize(s) for s in sentences]:
sentence_idx += 1
word_idx = []
# For each word in the word list...
for w in important_words:
try:
# Compute an index for where any important words occur in the sentence.
word_idx.append(s.index(w))
except ValueError, e: # w not in this particular sentence
pass
word_idx.sort()
# It is possible that some sentences may not contain any important words at all.
if len(word_idx)== 0: continue
# Using the word index, compute clusters by using a max distance threshold
# for any two consecutive words.
clusters = []
cluster = [word_idx[0]]
i = 1
while i < len(word_idx):
if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:
cluster.append(word_idx[i])
else:
clusters.append(cluster[:])
cluster = [word_idx[i]]
i += 1
clusters.append(cluster)
# Score each cluster. The max score for any given cluster is the score
# for the sentence.
max_cluster_score = 0
for c in clusters:
significant_words_in_cluster = len(c)
total_words_in_cluster = c[-1] - c[0] + 1
score = 1.0 * significant_words_in_cluster \
* significant_words_in_cluster / total_words_in_cluster
if score > max_cluster_score:
max_cluster_score = score
scores.append((sentence_idx, score))
return scores
def summarize(txt):
sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
normalized_sentences = [s.lower() for s in sentences]
words = [w.lower() for sentence in normalized_sentences for w in
nltk.tokenize.word_tokenize(sentence)]
fdist = nltk.FreqDist(words)
top_n_words = [w[0] for w in fdist.items()
if w[0] not in nltk.corpus.stopwords.words('english')][:N]
scored_sentences = _score_sentences(normalized_sentences, top_n_words)
# Summarization Approach 1:
# Filter out nonsignificant sentences by using the average score plus a
# fraction of the std dev as a filter
avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
if score > avg + 0.5 * std]
# Summarization Approach 2:
# Another approach would be to return only the top N ranked sentences
top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
# Decorate the post object with summaries
return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])
blog_data = json.loads(open(BLOG_DATA).read())
for post in blog_data:
post.update(summarize(post['content']))
print post['title']
print '=' * len(post['title'])
print
print 'Top N Summary'
print '-------------'
print ' '.join(post['top_n_summary'])
print
print 'Mean Scored Summary'
print '-------------------'
print ' '.join(post['mean_scored_summary'])
print
But when I run it it says:
Traceback (most recent call last):
File "/home/jetonp/PycharmProjects/Summeriza/blogs_and_nlp__summarize.py", line 117, in <module>
post.update(summarize(post['content']))
AttributeError: 'unicode' object has no attribute 'update'
Process finished with exit code 1
What is causing this error and how do I fix it?
I figured it out. In the example that you are working off of, the summarize method returns a dictionary. Your summarize method does not return anything, due to improper indentation. For part of it, there is just three spaces, and for part of it there were no spaces. The standard indentation in python is four spaces. Summarize should look like this:
def summarize(txt):
sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
normalized_sentences = [s.lower() for s in sentences]
words = [w.lower() for sentence in normalized_sentences for w in
nltk.tokenize.word_tokenize(sentence)]
fdist = nltk.FreqDist(words)
top_n_words = [w[0] for w in fdist.items()
if w[0] not in nltk.corpus.stopwords.words('english')][:N]
scored_sentences = _score_sentences(normalized_sentences, top_n_words)
# Summarization Approach 1:
# Filter out nonsignificant sentences by using the average score plus a
# fraction of the std dev as a filter
avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
if score > avg + 0.5 * std]
# Summarization Approach 2:
# Another approach would be to return only the top N ranked sentences
top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
# Decorate the post object with summaries
return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])