Exporting relevant words TF-IDF TextBlob python - python

I followed this tutorial to search the relevant words in my documents. My code:
>>> for i, blob in enumerate(bloblist):
print i+1
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:10]:
print("\t{}, score {}".format(word, round(score, 5)))
1
k555ld-xx1014h, score 0.19706
fuera, score 0.03111
dentro, score 0.01258
i5, score 0.0051
1tb, score 0.00438
sorprende, score 0.00358
8gb, score 0.0031
asus, score 0.00228
ordenador, score 0.00171
duro, score 0.00157
2
frentes, score 0.07007
write, score 0.05733
acceleration, score 0.05255
aprovechando, score 0.05255
. . .
Here's my problem, I would like to export a data frame with the following information: index, 10 top words (separated with commas). Something that i can save with pandas dataframe.
Example:
TOPWORDS = pd.DataFrame(topwords.items(), columns=['ID', 'TAGS'])
Thank you all in advance.

Solved!
Here's my solution, perhaps not the best but it works.
tags = {}
for i, blob in enumerate(bloblist):
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
a =""
for word, score in sorted_words[:10]:
a= a + ' '+ word
tags[i+1] = a

probably you got problems with tuples....
Doc..
https://docs.python.org/2/tutorial/datastructures.html
http://www.tutorialspoint.com/python/python_tuples.htm
Here you go!

Related

Optimizing pyspark job

I am new to spark and I am trying to make my pyspark job more efficient, it is very slow when applying it to big data. The objective is to take phrase and its count and turn then into counts of unique words and n-most frequent and least frequent words with the associated count.
Im my first 2 processing steps, I am using lambda function to multiply the phrases by counts and then I use flatMap to turn int into a list of all words.
To illustrate, the first map step with lambda function followed by flatMap turns the following input into a flat list of each word that then can be counted
good morning \t 1
hello \t 3
goodbye \t 1
turns into [good, morning, hello, hello, hello, goodbye] or even if the flatMap is the right approach and whether to use a reduceByKey approach
Any feedback on how I can significantly optimize the performance of this job? Thank you! The pyspark function is:
def top_bottom_words(resilent_dd, n):
"""
outputs total unique words and n top and bottom words with count for each of the words
input:
resilent_dd, n
where:
resilent_dd: resilient distributed dataset with form of (phrase) \t (count)
n: number of top and bottom words to output
output:
total, n_most_freqent , n_least_freqent
where:
total: count of unique words in the vocabulary
n_most_freqent: n top words of highest counts with count for each of the words
n_least_freqent: n bottom words of lowest counts with count for each of the words
"""
total, top_n, bottom_n = None, None, None
resilent_dd.persist()
phrases = resilent_dd.map(
lambda line: " ".join(
[line.split("\t")[0] for i in range(int(line.split("\t")[1]))]
)
)
phrases = phrases.flatMap(lambda line: line.lower().split(" "))
word_counts = phrases.countByValue()
total = len(word_counts)
n_most_freqent = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:n]
n_least_freqent = sorted(word_counts.items(), key=lambda x: x[1])[:n]
resilent_dd.unpersist()
return total, n_most_freqent, n_least_freqent
In case anyone has a similar question, this approach works well and is fast
total, n_most_freqent, n_least_freqent = None, None, None
phrases = resilent_dd.map(lambda line: (line.split("\t")[0].lower().split(" "), int(line.split("\t")[1]))) \
.flatMap(lambda x: [(word, x[1]) for word in x[0]]) \
.reduceByKey(lambda a, b: a + b) \
.sortBy(lambda x: x[1], ascending=False)
total = phrases.count()
n_most_freqent = phrases.take(n)
n_least_freqent = phrases.takeOrdered(n, key=lambda x: x[1])
return total, n_most_freqent, n_least_freqent

Counting/Print Unique Words in Directory up to x instances

I am attempting to take all unique words in tale4653, count their instances, and then read off the top 100 mentioned unique words.
My struggle is sorting the directory so that I can print both the unique word and its' respected instances.
My code thus far:
import string
fhand = open('tale4653.txt')
counts = dict()
for line in fhand:
line = line.translate(None, string.punctuation)
line = line.lower()
words = line.split()
for word in words:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
fhand.close()
rangedValue = sorted(counts.values(), reverse=True)
i =0
while i<100:
print rangedValue[i]
i=i+1
Thank you community,
you loose the word (the key in your dictionary) when you do counts.values())
you can do this instead
rangedValue = sorted(counts.items(), reverse=True, key=lambda x: x[1])
for word, count in rangedValue:
print word + ': ' + str(rangedValue)
when you do counts.items() it will return a list of tuples of key and value like this:
[('the', 1), ('end', 2)]
and when we sort it we tell it to take the second value as the "key" to sort with
DorElias is correct in the initial problem: you need to use count.items() with key=lambda x: x[1] or key=operator.itemgetter(1), latter of which would be faster.
However, I'd like to show how I'd do it, completely avoiding sorted in your code. collections.Counter is an optimal data structure for this code. I also prefer the logic of reading words in a file be wrapped in a generator
import string
from collections import Counter
def read_words(filename):
with open(filename) as fhand:
for line in fhand:
line = line.translate(None, string.punctuation)
line = line.lower()
words = line.split()
for word in words: # in Python 3 one can use `yield from words`
yield word
counts = Counter(read_words('tale4653.txt'))
for word, count in counts.most_common(100):
print('{}: {}'.format(word, count))

how can I divide the frequency of bigram pair by unigram word?

below is my code.
from __future__ import division
import nltk
import re
f = open('C:/Python27/brown_A1_half.txt', 'rU')
w = open('C:/Python27/brown_A1_half_Out.txt', 'w')
#to read whole file using read()
filecontents = f.read()
from nltk.tokenize import sent_tokenize
sent_tokenize_list = sent_tokenize(filecontents)
for sentence in sent_tokenize_list:
sentence = "Start " + sentence + " End"
tokens = sentence.split()
bigrams = (tuple(nltk.bigrams(tokens)))
bigrams_frequency = nltk.FreqDist(bigrams)
for k,v in bigrams_frequency.items():
print k, v
then the printing result is "(bigrams), its frequency ". here, what I want is
for each bigram pair, divide the bigram frequency by the first appearing unigram word frequency. (for example, if there is a bigram ('red', 'apple') and its frequency is "3", then I want to divide it by the frequency of 'red').
This is for obtaining the MLE prob, that is "MLE prob = Counting of (w1, w2) / Counting of (w1)" . help me plz...
You can add the following in the for loop (after print k, v):
number_unigrams = tokens.count(k[0])
prob = v / number_unigrams
That should give you the MLE prob for each bigram.

Finding the average number

For the task I need the average score of each person so if Dan scored 5 in one line and 7 in another he would then be displayed as having an average of 6.the average is what i need ordered and displayed.
so I have to sort the into the highest average scores that people have gained, to the lowest average and display the sorted version of it in python. one of the file I have to sort looks like this.
Bob:0
Bob:1
Jane:9
Drake:8
Dan:4
Josh:1
Dan:5
How can i do this on python?
d = {}
with open('in.txt') as f:
data = f.readlines()
for x in data:
x = x.strip()
if not x:
continue
name = x.split(':')[0].strip()
score = int(x.split(':')[-1].split('/')[0].strip())
if name not in d:
d[name] = {}
d[name]['score'] = 0
d[name]['count'] = 0
d[name]['count'] += 1
d[name]['score'] = (d[name]['score'] + score) / float(d[name]['count'])
ds = sorted(d.keys(), key=lambda k: d[k]['score'], reverse=True)
for x in ds:
print('{0}: {1}'.format(x, d[x]['score']))

Object has no attribute 'update'

I am trying to use the code that is on this link... see example 6.
So this is the code:
import json
import nltk
import numpy
BLOG_DATA = "resources/ch05-webpages/feed.json"
N = 100 # Number of words to consider
CLUSTER_THRESHOLD = 5 # Distance between words to consider
TOP_SENTENCES = 5 # Number of sentences to return for a "top n" summary
# Approach taken from "The Automatic Creation of Literature Abstracts" by H.P. Luhn
def _score_sentences(sentences, important_words):
scores = []
sentence_idx = -1
for s in [nltk.tokenize.word_tokenize(s) for s in sentences]:
sentence_idx += 1
word_idx = []
# For each word in the word list...
for w in important_words:
try:
# Compute an index for where any important words occur in the sentence.
word_idx.append(s.index(w))
except ValueError, e: # w not in this particular sentence
pass
word_idx.sort()
# It is possible that some sentences may not contain any important words at all.
if len(word_idx)== 0: continue
# Using the word index, compute clusters by using a max distance threshold
# for any two consecutive words.
clusters = []
cluster = [word_idx[0]]
i = 1
while i < len(word_idx):
if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:
cluster.append(word_idx[i])
else:
clusters.append(cluster[:])
cluster = [word_idx[i]]
i += 1
clusters.append(cluster)
# Score each cluster. The max score for any given cluster is the score
# for the sentence.
max_cluster_score = 0
for c in clusters:
significant_words_in_cluster = len(c)
total_words_in_cluster = c[-1] - c[0] + 1
score = 1.0 * significant_words_in_cluster \
* significant_words_in_cluster / total_words_in_cluster
if score > max_cluster_score:
max_cluster_score = score
scores.append((sentence_idx, score))
return scores
def summarize(txt):
sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
normalized_sentences = [s.lower() for s in sentences]
words = [w.lower() for sentence in normalized_sentences for w in
nltk.tokenize.word_tokenize(sentence)]
fdist = nltk.FreqDist(words)
top_n_words = [w[0] for w in fdist.items()
if w[0] not in nltk.corpus.stopwords.words('english')][:N]
scored_sentences = _score_sentences(normalized_sentences, top_n_words)
# Summarization Approach 1:
# Filter out nonsignificant sentences by using the average score plus a
# fraction of the std dev as a filter
avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
if score > avg + 0.5 * std]
# Summarization Approach 2:
# Another approach would be to return only the top N ranked sentences
top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
# Decorate the post object with summaries
return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])
blog_data = json.loads(open(BLOG_DATA).read())
for post in blog_data:
post.update(summarize(post['content']))
print post['title']
print '=' * len(post['title'])
print
print 'Top N Summary'
print '-------------'
print ' '.join(post['top_n_summary'])
print
print 'Mean Scored Summary'
print '-------------------'
print ' '.join(post['mean_scored_summary'])
print
But when I run it it says:
Traceback (most recent call last):
File "/home/jetonp/PycharmProjects/Summeriza/blogs_and_nlp__summarize.py", line 117, in <module>
post.update(summarize(post['content']))
AttributeError: 'unicode' object has no attribute 'update'
Process finished with exit code 1
What is causing this error and how do I fix it?
I figured it out. In the example that you are working off of, the summarize method returns a dictionary. Your summarize method does not return anything, due to improper indentation. For part of it, there is just three spaces, and for part of it there were no spaces. The standard indentation in python is four spaces. Summarize should look like this:
def summarize(txt):
sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
normalized_sentences = [s.lower() for s in sentences]
words = [w.lower() for sentence in normalized_sentences for w in
nltk.tokenize.word_tokenize(sentence)]
fdist = nltk.FreqDist(words)
top_n_words = [w[0] for w in fdist.items()
if w[0] not in nltk.corpus.stopwords.words('english')][:N]
scored_sentences = _score_sentences(normalized_sentences, top_n_words)
# Summarization Approach 1:
# Filter out nonsignificant sentences by using the average score plus a
# fraction of the std dev as a filter
avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
if score > avg + 0.5 * std]
# Summarization Approach 2:
# Another approach would be to return only the top N ranked sentences
top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
# Decorate the post object with summaries
return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])

Categories