I am new to spark and I am trying to make my pyspark job more efficient, it is very slow when applying it to big data. The objective is to take phrase and its count and turn then into counts of unique words and n-most frequent and least frequent words with the associated count.
Im my first 2 processing steps, I am using lambda function to multiply the phrases by counts and then I use flatMap to turn int into a list of all words.
To illustrate, the first map step with lambda function followed by flatMap turns the following input into a flat list of each word that then can be counted
good morning \t 1
hello \t 3
goodbye \t 1
turns into [good, morning, hello, hello, hello, goodbye] or even if the flatMap is the right approach and whether to use a reduceByKey approach
Any feedback on how I can significantly optimize the performance of this job? Thank you! The pyspark function is:
def top_bottom_words(resilent_dd, n):
"""
outputs total unique words and n top and bottom words with count for each of the words
input:
resilent_dd, n
where:
resilent_dd: resilient distributed dataset with form of (phrase) \t (count)
n: number of top and bottom words to output
output:
total, n_most_freqent , n_least_freqent
where:
total: count of unique words in the vocabulary
n_most_freqent: n top words of highest counts with count for each of the words
n_least_freqent: n bottom words of lowest counts with count for each of the words
"""
total, top_n, bottom_n = None, None, None
resilent_dd.persist()
phrases = resilent_dd.map(
lambda line: " ".join(
[line.split("\t")[0] for i in range(int(line.split("\t")[1]))]
)
)
phrases = phrases.flatMap(lambda line: line.lower().split(" "))
word_counts = phrases.countByValue()
total = len(word_counts)
n_most_freqent = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:n]
n_least_freqent = sorted(word_counts.items(), key=lambda x: x[1])[:n]
resilent_dd.unpersist()
return total, n_most_freqent, n_least_freqent
In case anyone has a similar question, this approach works well and is fast
total, n_most_freqent, n_least_freqent = None, None, None
phrases = resilent_dd.map(lambda line: (line.split("\t")[0].lower().split(" "), int(line.split("\t")[1]))) \
.flatMap(lambda x: [(word, x[1]) for word in x[0]]) \
.reduceByKey(lambda a, b: a + b) \
.sortBy(lambda x: x[1], ascending=False)
total = phrases.count()
n_most_freqent = phrases.take(n)
n_least_freqent = phrases.takeOrdered(n, key=lambda x: x[1])
return total, n_most_freqent, n_least_freqent
Related
Background
I'm working on a HackerRank problem Word Order. The task is to
Read the following input from stdin
4
bcdef
abcdefg
bcde
bcdef
Produce the output that reflects:
Number of unique words in first line
Count of occurrences for each unique words
Example:
3 # Number of unique words
2 1 1 # count of occurring words, 'bcdef' appears twice = 2
Problem
I've coded two solutions, the second one passes initial tests but fail due to exceeding time limit. First one would also work but I was unnecessarily sorting outputs (time limit issue would occur though).
Notes
In first solution I was unnecessarily sorting values, this is fixed in the second solution
I'm keen to be making better (proper) use of standard Python data structures, list/dictionary comprehension - I would be particularly keen to receive a solution that doesn't import any addittional modules, with exception of import os if needed.
Code
import os
def word_order(words):
# Output no of distinct words
distinct_words = set(words)
n_distinct_words = len(distinct_words)
print(str(n_distinct_words))
# Count occurrences of each word
occurrences = []
for distinct_word in distinct_words:
n_word_appearances = 0
for word in words:
if word == distinct_word:
n_word_appearances += 1
occurrences.append(n_word_appearances)
occurrences.sort(reverse=True)
print(*occurrences, sep=' ')
# for o in occurrences:
# print(o, end=' ')
def word_order_two(words):
'''
Run through all words and only count multiple occurrences, do the maths
to calculate unique words, etc. Attempt to construct a dictionary to make
the operation more memory efficient.
'''
# Construct a count of word occurrences
dictionary_words = {word:words.count(word) for word in words}
# Unique words are equivalent to dictionary keys
unique_words = len(dictionary_words)
# Obtain sorted dictionary values
# sorted_values = sorted(dictionary_words.values(), reverse=True)
result_values = " ".join(str(value) for value in dictionary_words.values())
# Output results
print(str(unique_words))
print(result_values)
return 0
if __name__ == '__main__':
q = int(input().strip())
inputs = []
for q_itr in range(q):
s = input()
inputs.append(s)
# word_order(words=inputs)
word_order_two(words=inputs)
Those nested loops are very bad performance wise (they make your algorithm quadratic) and quite unnecessary. You can get all counts in single iteration. You could use a plain dict or the dedicated collections.Counter:
from collections import Counter
def word_order(words):
c = Counter(words)
print(len(c))
print(" ".join(str(v) for _, v in c.most_common()))
The "manual" implementation that shows the workings of the Counter and its methods:
def word_order(words):
c = {}
for word in words:
c[word] = c.get(word, 0) + 1
print(len(c))
print(" ".join(str(v) for v in sorted(c.values(), reverse=True)))
# print(" ".join(map(str, sorted(c.values(), reverse=True))))
Without any imports, you could count unique elements by
len(set(words))
and count their occurrences by
def counter(words):
count = dict()
for word in words:
if word in count:
count[word] += 1
else:
count[word] = 1
return count.values()
You can use Counter then print output like below:
>>> from collections import Counter
>>> def counter_words(words):
... cnt = Counter(words)
... print(len(cnt))
... print(*[str(v) for k,v in c.items()] , sep=' ')
>>> inputs = ['bcdef' , 'abcdefg' , 'bcde' , 'bcdef']
>>> counter_words(inputs)
3
2 1 1
I have two dictionary. Each of dictionary include words. some words are common some are not. I want to show to output common word frequency1 frequency2 and frequency sum. How can I do that ? and I have to find the top 20.
For example my output must be like:
Common WORD frequ1. freq2 freqsum
1 print 10. 5. 15
2 number. 2. 1. 3.
3 program 19. 20. 39
Here is my code:
commonwordsbook1andbook2 = []
for element in finallist1:
if element in finallist2:
commonwordsbook1andbook2.append(element)
common1 = {}
for word in commonwordsbook1andbook2:
if word not in common1:
common1[word] = 1
else:
common1[word] += 1
common1 = sorted(common1.items(), key=lambda x: x[1], reverse=True) #distinct2
for k, v in wordcount2[:a]:
print(k, v)
Assuming that the dictionaries have individual frequencies of each word, we can do something simpler. Like...
print("Common Word | Freq-1 | Freq-2 | Freq-Sum")
for i in freq1:
if i in freq2:
print(i,freq1[i],freq2[i],freq1[i]+freq2[i])
Since you aren't allowed to use Counter, you can implement the same functionality using dictionaries. Let's define a function to return a dictionary that contains the counts of all words in the given list. Dictionaries have a get() function that gets the value of the given key, while also allowing you to specify a default if the key is not found.
def countwords(lst):
dct = {}
for word in lst:
dct[word] = dct.get(word, 0) + 1
return dct
count1 = countwords(finallist1)
count2 = countwords(finallist2)
words1 = set(count1.keys())
words2 = set(count2.keys())
count1.keys() will give us all the unique words in finallist1.
Then we convert both of these to sets and then find their intersection to get the common words.
common_words = words1.intersection(words2)
Now that you know the common words, printing them and their counts should be trivial:
for w in common_words:
print(f"{w}\t{count1[w]}\t{count2[w]}\t{count1[w] + count2[w]}")
Write a function longestWord() which receives a list of words, then returns the longest word ending with "ion".
This is what i got so far:
def longest(listTest):
word_len = []
for n in listTest:
word_len.append((len(n), n))
word_len.sort()
return word_len[-1][1]
print(longest(["ration","hello","exclamation"]))
I think you need:
longestWord = max([i for i in listTest if i.endswith("ion")], key=len)
print(longestWord)
In [9]: listTest
Out[9]: ['ration', 'hello', 'exclamation']
In [10]: def longest(lst):
...: return sorted([(len(i),i) for i in lst if i.endswith("ion")], key=lambda x:x[0])[-1][1]
...:
In [11]: longest(listTest)
Out[11]: 'exclamation'
The following function accomplishes this with list comprehension and the max function. You can also pass any suffix you like here.
def longest(words, suffix="ion"):
# Filter the passed words
words_with_suffix = [w for w in words if w.endswith(suffix)]
# Return the longest word in the filtered list
return max(words_with_suffix, key=len)
list_test = ["ration","hello","exclamation"]
print(longest(list_test))
longest = ''
for word in listTest:
if word.endswith('ion') && (longest.size() < word.size() ):
longest= word
print(longest);
Since none of the answers so far seem to be able to handle the edge case of two different words having the same length (and ending in "ion"), here is my simple approach:
filteredWords = [word for word in listTest if word.endswith("ion")]
wordLenghts = [len(word) for word in filteredWords]
maxLength = max(wordLenghts)
longestWordsIndices = [i for i, j in enumerate(wordLenghts) if j == maxLength]
print( [filteredWords[w] for w in longestWordsIndices] )
You can create a new list where you can have all the strings that are ending with 'ion', then you can sort the list to get the longest word ending with 'ion'.
In the given code, I have used Bubble sort to sort the list considering the list to be small, if you have to deal with larger data for this then I will suggest you, use a better sorting algorithm since Bubble sort performance is O(n^2).
s = ["ration","hello", "test", 'why', 'call', 'ion', "exclamation", 'cation', 'anion']
word = []
for i in s:
if i.endswith('ion'):
word.append(i)
#using bubble sort considering the list is small
for i in range(len(word) - 1):
for j in range(len(word) - (i+1)):
if len(word[j]) < len(word[j+1]):
temp = word[j]
word[j] = word[j+1]
word[j+1] = temp
print(word[0]) #the longest word
Given the following basis:
basis = "Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay."
and the following words:
words = "word, text, bank, tree"
How can I calculate the PMI-values of each word in "words" compared to each word in "basis", where I can use a context window size 5 (that is two positions before and two after the target word)?
I know how to calculate the PMI, but I don't know how to handle the fact of the context window.
I calculate the 'normal' PMI-values as follows:
def PMI(ContingencyTable):
(a,b,c,d,N) = ContingencyTable
# avoid log(0)
a += 1
b += 1
c += 1
d += 1
N += 4
R_1 = a + b
C_1 = a + c
return log(float(a)/(float(R_1)*float(C_1))*float(N),2)
I did a little searching on PMI, looks like heavy duty packages are out there, "windowing" included
In PMI the "mutual" seems to refer to the joint probability of two different words so you need to firm up that idea with respect to the problem statement
I took on the smaller problem of just generating the short windowed lists in your problem statement mostly for my own exercise
def wndw(wrd_l, m_l, pre, post):
"""
returns a list of all lists of sequential words in input wrd_l
that are within range -pre and +post of any word in wrd_l that matches
a word in m_l
wrd_l = list of words
m_l = list of words to match on
pre, post = ints giving range of indices to include in window size
"""
wndw_l = list()
for i, w in enumerate(wrd_l):
if w in m_l:
wndw_l.append([wrd_l[i + k] for k in range(-pre, post + 1)
if 0 <= (i + k ) < len(wrd_l)])
return wndw_l
basis = """Each word of the text is converted as follows: move any
consonant (or consonant cluster) that appears at the start
of the word to the end, then append ay."""
words = "word, text, bank, tree"
print(*wndw(basis.split(), [x.strip() for x in words.split(',')], 2, 2),
sep="\n")
['Each', 'word', 'of', 'the']
['of', 'the', 'text', 'is', 'converted']
['of', 'the', 'word', 'to', 'the']
I followed this tutorial to search the relevant words in my documents. My code:
>>> for i, blob in enumerate(bloblist):
print i+1
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:10]:
print("\t{}, score {}".format(word, round(score, 5)))
1
k555ld-xx1014h, score 0.19706
fuera, score 0.03111
dentro, score 0.01258
i5, score 0.0051
1tb, score 0.00438
sorprende, score 0.00358
8gb, score 0.0031
asus, score 0.00228
ordenador, score 0.00171
duro, score 0.00157
2
frentes, score 0.07007
write, score 0.05733
acceleration, score 0.05255
aprovechando, score 0.05255
. . .
Here's my problem, I would like to export a data frame with the following information: index, 10 top words (separated with commas). Something that i can save with pandas dataframe.
Example:
TOPWORDS = pd.DataFrame(topwords.items(), columns=['ID', 'TAGS'])
Thank you all in advance.
Solved!
Here's my solution, perhaps not the best but it works.
tags = {}
for i, blob in enumerate(bloblist):
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
a =""
for word, score in sorted_words[:10]:
a= a + ' '+ word
tags[i+1] = a
probably you got problems with tuples....
Doc..
https://docs.python.org/2/tutorial/datastructures.html
http://www.tutorialspoint.com/python/python_tuples.htm
Here you go!