Amazon MapReduce with my own reducer for streaming - python

I wrote a simple map and reduce program in python to count the numbers for each sentence, and then group the same number together. i.e suppose sentence 1 has 10 words, sentence 2 has 17 words and sentence 3 has 10 words. The final result will be:
10 \t 2
17 \t 1
The mapper function is:
import sys
import re
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
for line in sys.stdin:
word = str(len(line.split())) # calculate how many words for each line
count = str(1)
print "%s\t%s" % (word, count)
The reducer function is:
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t')
try:
count = int(count)
word = int(word)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print "%s\t%s" % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print "%s\t%s" %(current_word, current_count)
I tested on my local machine with the first 200 lines of the file :
head -n 200 sentences.txt | python mapper.py | sort | python reducer.py
The results are correct. Then I used Amazon MapReduce streaming service, it failed at the reducer step. So I changed the print in the mapper function to:
print "LongValueSum" + word + "\t" + "1"
This fits into the default aggregate in mapreduce streaming service. In this case, I don't need the reducer.py function. I get the final results from the big file sentences.txt. But I don't know why my reducer.py function failed. Thank you!

Got it! A "stupid" mistake. When I tested it, I use something like python mapper.py. But for mapreduce, I need make it executable. So just add
# !/usr/bin/env python
in the beginning.

Related

Can't figure out how to run this Python script

Can someone explain to me (i'm not competent in programming) how to use correctly this script (link: https://github.com/dumbmatter/find-repeated-words)?. Basically it should work by taking a text file as input and outputing an HTML file with words that are repeatedly used close together highlighted, But when I run it (I installed Pyzo) I got the message: "SyntaxError: invalid syntax". I have no idea what is Python talking about, i can just assume the problem concerns the input file.
CODE:
#!/usr/bin/env python
import sys from string import punctuation from operator import itemgetter
# Check command line inputs if len(sys.argv) == 1:
print 'Pass the input text file as the first argument.'
sys.exit() elif len(sys.argv) == 2:
infile = sys.argv[1]
outfile = '%s.html' % (infile.split('.')[0],) else:
infile = sys.argv[1]
outfile = sys.argv[2]
print infile, outfile
N = 10 words = {} # Dict of word frequencies pos = {} # Dict of word positions scores = [] # List of word repeatedness scores articles = ['the', 'a', 'of', 'and', 'in', 'et', 'al'] # Common articles to ignore
# Build lists
words_gen = (word.strip(punctuation).lower() for line in open(infile)
for word in line.split())
i = 0 for word in words_gen:
words[word] = words.get(word, 0) + 1
# Build a list of word positions
if words[word] == 1:
pos[word] = [i]
else:
pos[word].append(i)
i += 1
# Calculate scores
words_gen = (word.strip(punctuation).lower() for line in open(infile)
for word in line.split())
i = 0 for word in words_gen:
scores.append(0)
# scores[i] = -1 + sum([pow(2, -abs(d-i)) for d in pos[word]]) # The -1 accounts for the 2^0 for self words
if word not in articles and len(word) > 2:
for d in pos[word]:
if d != i and abs(d-i) < 50:
scores[i] += 1.0/abs(d-i)
i += 1
scores = [score*1.0/max(scores) for score in scores] # Scale from 0 to 1
# Write colored output
f = open(outfile, 'w'); i = 0 for line in open(infile):
for word in line.split():
f.write('<span style="background: rgb(%i, 255, 255)">%s</span> ' % ((1-scores[i])*255, word))
i += 1
f.write('<br /><br />') f.close()
print 'Output saved to %s' % (outfile,)
Python is very sensitive to the formatting of the code, you cannot break or indent lines at the places python does not expect it. Just looking at the first lines:
import sys from string import punctuation from operator import itemgetter
should be split into 3 lines:
import sys
from string import punctuation
from operator import itemgetter
There are more errors like this in the code you pasted. I have downloaded the original code from the link, and it works fine.

Counting several instances of the same word from a text file

Complete beginner, searched a lot of threads but couldn't find a solution that fits me.
I have a text file, python_examples.txt which contains some words. On line four, the word hello appears twice in a row, like "hello hello".
My code is supposed to find the word the user inputs and count how many times it appears, it works but as I said, not if the same word appears multiple times on the same row. So there are 2 hellos on line 4 and one on line 13 but it only finds a total of 2 hellos. Fixes? Thanks,
user_input = input("Type in the word you are searching for: ")
word_count = 0
line_count = 0
with open ("python_example.txt", "r+") as f:
for line in f:
line_count += 1
if user_input in line:
word_count += 1
print("found " + user_input + " on line " + str(line_count))
else:
print ("nothing on line " + str(line_count))
print ("\nfound a total of " + str(word_count) + " words containing " + "'" + user_input + "'")
you can use str.count:
word_count += line.count(user_input)
instead of :
word_count += 1
it will count all appearance of user_input in the file line
The issue is with these two lines:
if user_input in line:
word_count += 1
You increase the count by 1 if the input appears on the line, regardless of whether it appears more than once.
This should do the job:
user_input = input("Type in the word you are searching for: ")
word_count = 0
with open("python_example.txt") as f:
for line_num, line in enumerate(f, start=1):
line_inp_count = line.count(user_input)
if line_inp_count:
word_count += line_inp_count
print(f"input {user_input} appears {line_inp_count} time(s) on line {line_num}")
else:
print(f"nothing on line {line_num}")
print(f"the input appeared a total of {word_count} times in {line_num} lines.")
Let me know if you have any questions :)
One option is use a library to parse the words in your text file rather than iterating one line at a time. There are several classes in nltk.tokenize which are easy to use.
import nltk.tokenize.regexp
def count_word_in_file(filepath, word):
"""Give the number for times word appears in text at filepath."""
tokenizer = nltk.tokenize.regexp.WordPunctTokenizer()
with open(filepath) as f:
tokens = tokenizer.tokenize(f.read())
return tokens.count(word)
This handles awkward cases like the substring 'hell' appearing in 'hello' as mentioned in a comment, and is also a route towards case-insenstive matching, stemming, and other refinements.

Only print specific amount of Counter items, with decent formatting

Trying to print out the top N most frequent used words in a text file. So far, I have the file system and the counter and everything working, just cant figure out how to print the certain amount I want in a pretty way. Here is my code.
import re
from collections import Counter
def wordcount(user):
"""
Docstring for word count.
"""
file=input("Enter full file name w/ extension: ")
num=int(input("Enter how many words you want displayed: "))
with open(file) as f:
text = f.read()
words = re.findall(r'\w+', text)
cap_words = [word.upper() for word in words]
word_counts = Counter(cap_words)
char, n = word_counts.most_common(num)[0]
print ("WORD: %s \nOCCURENCE: %d " % (char, n) + '\n')
Basically, I just want to go and make a loop of some sort that will print out the following...
For instance num=3
So it will print out the 3 most frequent used words, and their count.
WORD: Blah Occurrence: 3
Word: bloo Occurrence: 2
Word: blee Occurrence: 1
I would iterate "most common" as follows:
most_common = word_counts.most_common(num) # removed the [0] since we're not looking only at the first item!
for item in most_common:
print("WORD: {} OCCURENCE: {}".format(item[0], item[1]))
Two comments:
1. Use format() to format strings instead of % - you'll thank me later for this advice!
2. This way you'll be able to iterate any number of "top N" results without hardcoding "3" into your code.
Save the most common elements and use a loop.
common = word_counts.most_common(num)[0]
for i in range(3):
print("WORD: %s \nOCCURENCE: %d \n" % (common[i][0], common[i][1]))

Variable to control how many lines print Python

I'm trying to figure out how to use a variable to control the number of lines a script prints. I want to use the output variable and print only the number of lines the user requests. Any help would be greatly appreciated.
import sys, os
print ""
print "Running Script..."
print ""
print "This program analyzes word frequency in a file and"
print "prints a report on the n most frequent words."
print ""
filename = raw_input("File to analyze? ")
if os.path.isfile(filename):
print "The file", filename, "exists!"
else:
print "The file", filename, "doesn't exist!"
sys.exit()
print ""
output = raw_input("Output analysis of how many words? ")
readfile = open(filename, 'r+')
words = readfile.read().split()
wordcount = {}
for word in words:
if word in wordcount:
wordcount[word] += 1
else:
wordcount[word] = 1
sortbyfreq = sorted(wordcount,key=wordcount.get,reverse=True)
for word in sortbyfreq:
print "%-20s %10d" % (word, wordcount[word])
Simply create a counter in your final loop, which checks the number of loops done, and breaks when a certain number has been reached.
limit = {enter number}
counter = 0
for word in sortbyfreq:
print "%-20s %10d" % (word, wordcount[word])
counter += 1
if counter >= limit:
break
Dictionaries are essentially unordered, so you won't get anywhere trying to output elements after sorting by their frequency.
Use a collections.Counter instead:
from collections import Counter
sortbyfreq = Counter(words) # Instead of the wordcount dictionary + for loop.
You could then access the user defined most common elements with:
n = int(raw_input('How many?: '))
for item, count in sortbyfreq.most_common(n):
print "%-20s %10d" % (item, count)

Defining words as 2 letters or more in python 2.6

I have a python script that I am writing for a class assignment which calculates the top 10 most frequent words in a text document and displays the words and their frequency. I was able to get this part of the script working just fine, but the assignment says a word is defined as 2 letters or more. I cannot seem to define a word as 2 letters or more for some reason, when I run the script, nothing happens.
# Most Frequent Words:
from string import punctuation
from collections import defaultdict
def sort_words(x, y):
return cmp(x[1], y[1]) or cmp(y[0], x[0])
number = 10
words = {}
words_gen = (word.strip(punctuation).lower() for line in open("charactermask.txt")
for word in line.split())
words = defaultdict(int)
for word in words_gen:
words[word] +=1
letters = len(word)
while letters >= 2:
top_words = sorted(words.iteritems(),
key=lambda(word, count): (-count, word))[:number]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
One problem with your script is the loop
while letters >= 2:
top_words = sorted(words.iteritems(),
key=lambda(word, count): (-count, word))[:number]
You are not looping through the words here; this loop will just loop forever. You need to change the script so that this part of the script actually iterates over all of the words. (Also, you will probably want to change while to if because you only need that code to execute once per word.)
I would refactor your code and use a collections.Counter object:
import collections
import string
with open("charactermask.txt") as f:
words = [x.strip(string.punctuation).lower() for x in f.read().split()]
counter = collections.defaultdict(int):
for word in words:
if len(word) >= 2:
counter[word] += 1

Categories