I'm trying to figure out how to use a variable to control the number of lines a script prints. I want to use the output variable and print only the number of lines the user requests. Any help would be greatly appreciated.
import sys, os
print ""
print "Running Script..."
print ""
print "This program analyzes word frequency in a file and"
print "prints a report on the n most frequent words."
print ""
filename = raw_input("File to analyze? ")
if os.path.isfile(filename):
print "The file", filename, "exists!"
else:
print "The file", filename, "doesn't exist!"
sys.exit()
print ""
output = raw_input("Output analysis of how many words? ")
readfile = open(filename, 'r+')
words = readfile.read().split()
wordcount = {}
for word in words:
if word in wordcount:
wordcount[word] += 1
else:
wordcount[word] = 1
sortbyfreq = sorted(wordcount,key=wordcount.get,reverse=True)
for word in sortbyfreq:
print "%-20s %10d" % (word, wordcount[word])
Simply create a counter in your final loop, which checks the number of loops done, and breaks when a certain number has been reached.
limit = {enter number}
counter = 0
for word in sortbyfreq:
print "%-20s %10d" % (word, wordcount[word])
counter += 1
if counter >= limit:
break
Dictionaries are essentially unordered, so you won't get anywhere trying to output elements after sorting by their frequency.
Use a collections.Counter instead:
from collections import Counter
sortbyfreq = Counter(words) # Instead of the wordcount dictionary + for loop.
You could then access the user defined most common elements with:
n = int(raw_input('How many?: '))
for item, count in sortbyfreq.most_common(n):
print "%-20s %10d" % (item, count)
Related
I am trying to write code that will accept a filename from the command line and
print out the following properties:
number of lines
number of characters
number of words
number of “the”
number of “a/an"
I keep getting the error message
"argument of type 'int' is not iterable"
for the line if 'the' in words:.
How do I fix this?
import sys
import string
file_name=sys.argv[0]
char= words = lines = theCount = aCount= 0
with open(file_name,'r') as in_file:
for line in in_file:
lines +=1
words +=len(line.split())
char +=len(line)
if 'the' in words:
theCount +=1
if 'a' in words:
a +=1
if 'an' in words:
a +=1
print("Filename:", file_name)
print("Number of lines:", lines)
print("Number of characters:", char)
print("Number of 'the'", theCount)
print("Number of a/an:", aCount)
If you are trying to collect the actual words, rather than just the count of them, then perhaps you need to initialize words to an empty list:
words = []
and change
words += len(line.split())
to
words += line.split()
there are some errors in your code, read the comments in this snipped:
import sys
#import string #not sure if this is needed
file_name=sys.argv[0]
char= words = lines = theCount = aCount= 0
with open(file_name,'r') as in_file:
for line in in_file:
lines +=1
x = line.split() #use a variable to hold the split words
#so that you can search in it
words +=len(x)
char +=len(line)
if 'the' in x: #your original code was using "words" variable
#that holds the "number of words" in the line,
#therefore ints are not iterable
theCount +=1
if 'a' in x:
aCount +=1 #your original code using "a" variable
#which did not initialized,
#you have initialized "aCount" variable
if 'an' in x:
aCount +=1 #same as above
print("Filename:", file_name)
print("Number of lines:", lines)
print("Number of characters:", char)
print("Number of 'the'", theCount)
print("Number of a/an:", aCount)
https://repl.it/Mnwz/0
Trying to print out the top N most frequent used words in a text file. So far, I have the file system and the counter and everything working, just cant figure out how to print the certain amount I want in a pretty way. Here is my code.
import re
from collections import Counter
def wordcount(user):
"""
Docstring for word count.
"""
file=input("Enter full file name w/ extension: ")
num=int(input("Enter how many words you want displayed: "))
with open(file) as f:
text = f.read()
words = re.findall(r'\w+', text)
cap_words = [word.upper() for word in words]
word_counts = Counter(cap_words)
char, n = word_counts.most_common(num)[0]
print ("WORD: %s \nOCCURENCE: %d " % (char, n) + '\n')
Basically, I just want to go and make a loop of some sort that will print out the following...
For instance num=3
So it will print out the 3 most frequent used words, and their count.
WORD: Blah Occurrence: 3
Word: bloo Occurrence: 2
Word: blee Occurrence: 1
I would iterate "most common" as follows:
most_common = word_counts.most_common(num) # removed the [0] since we're not looking only at the first item!
for item in most_common:
print("WORD: {} OCCURENCE: {}".format(item[0], item[1]))
Two comments:
1. Use format() to format strings instead of % - you'll thank me later for this advice!
2. This way you'll be able to iterate any number of "top N" results without hardcoding "3" into your code.
Save the most common elements and use a loop.
common = word_counts.most_common(num)[0]
for i in range(3):
print("WORD: %s \nOCCURENCE: %d \n" % (common[i][0], common[i][1]))
I wrote a simple map and reduce program in python to count the numbers for each sentence, and then group the same number together. i.e suppose sentence 1 has 10 words, sentence 2 has 17 words and sentence 3 has 10 words. The final result will be:
10 \t 2
17 \t 1
The mapper function is:
import sys
import re
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
for line in sys.stdin:
word = str(len(line.split())) # calculate how many words for each line
count = str(1)
print "%s\t%s" % (word, count)
The reducer function is:
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t')
try:
count = int(count)
word = int(word)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print "%s\t%s" % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print "%s\t%s" %(current_word, current_count)
I tested on my local machine with the first 200 lines of the file :
head -n 200 sentences.txt | python mapper.py | sort | python reducer.py
The results are correct. Then I used Amazon MapReduce streaming service, it failed at the reducer step. So I changed the print in the mapper function to:
print "LongValueSum" + word + "\t" + "1"
This fits into the default aggregate in mapreduce streaming service. In this case, I don't need the reducer.py function. I get the final results from the big file sentences.txt. But I don't know why my reducer.py function failed. Thank you!
Got it! A "stupid" mistake. When I tested it, I use something like python mapper.py. But for mapreduce, I need make it executable. So just add
# !/usr/bin/env python
in the beginning.
wordsFreq = {}
words = []
while True:
inputWord = raw_input()
if (inputWord != ""):
words.append(inputWord)
else:
break
for word in words:
wordsFreq[word] = wordsFreq.get(word, 0) + 1
for word,freq in wordsFreq.items():
print word + " - " + str(freq)
Apparently my words[] and the for loop is redundant but I had no further explanation than that, can anyone explain to me why it is redundant?
You can skip the step of building a list of words and instead directly create the frequency dict as the user is entering words. I've used defaultdict to avoid having to check if a word has already been added.
from collections import defaultdict
wordsFreq = defaultdict(int)
while True:
word = raw_input()
if not word:
break
wordsFreq[word] += 1
If you aren't allowed to use defaultdict, it could look like this:
wordsFreq = {}
while True:
word = raw_input()
if not word:
break
wordsFreq[word] = wordFreq.get(word, 0) + 1
You can use collections.Counter to do this easily:
from collections import Counter
words = []
input_word = True
while input_word:
input_word = raw_input()
words.append(input_word)
counted = Counter(words)
for word, freq in counted.items():
print word + " - " + str(freq)
Note that an empty string evaluates to false, so rather than breaking when it equals an empty string, we can just use the string as our loop condition.
Edit: If you don't wish to use Counter as an academic exercise, then the next best option is a collections.defaultdict:
from collections import defaultdict
words = defaultdict(int)
input_word = True
while input_word:
input_word = raw_input()
if input_word:
words[input_word] += 1
for word, freq in words.items():
print word + " - " + str(freq)
The defaultdict ensures all keys will point to a value of 0 if they havn't been used before. This makes it easy for us to count using one.
If you still want to keep your list of words as well, then you would need to do that in addition. E.g:
words = []
words_count = defaultdict(int)
input_word = True
while input_word:
input_word = raw_input()
if input_word:
words.append(input_word)
words_count[input_word] += 1
I think your teacher was trying to say you can write the loop like this
wordsFreq = {}
while True:
inputWord = raw_input()
if (inputWord != ""):
wordsFreq[inputWord] = wordsFreq.get(inputWord, 0) + 1
else:
break
for word,freq in wordsFreq.items():
print word + " - " + str(freq)
There is no need to store the words in a temporary list, you can count them as you read them in
You can do this:
wordsFreq = {}
while True:
inputWord = raw_input()
try:
wordsFreq[inputWord] = wordsFreq[inputWord] + 1
except KeyError:
wordsFreq[inputWord] = 1
I have a txt file. I have written code that finds the unique words and the number of times each word appears in that file. I now need to figure out how to print the lines that those words apear in as well. How can I go about doing this?
Here is a sample output:
Analyze what file: itsy_bitsy_spider.txt
Concordance for file itsy_bitsy_spider.txt
itsy : Total Count: 2
Line:1: The ITSY Bitsy spider crawled up the water spout
Line:4: and the ITSY Bitsy spider went up the spout again
#this function will get just the unique words without the stop words.
def openFiles(openFile):
for i in openFile:
i = i.strip()
linelist.append(i)
b = i.lower()
thislist = b.split()
for a in thislist:
if a in stopwords:
continue
else:
wordlist.append(a)
#print wordlist
#this dictionary is used to count the number of times each stop
countdict = {}
def countWords(this_list):
for word in this_list:
depunct = word.strip(punctuation)
if depunct in countdict:
countdict[depunct] += 1
else:
countdict[depunct] = 1
from collections import defaultdict
target = 'itsy'
word_summary = defaultdict(list)
with open('itsy.txt', 'r') as f:
lines = f.readlines()
for idx, line in enumerate(lines):
words = [w.strip().lower() for w in line.split()]
for word in words:
word_summary[word].append(idx)
unique_words = len(word_summary.keys())
target_occurence = len(word_summary[target])
line_nums = set(word_summary[target])
print "There are %s unique words." % unique_words
print "There are %s occurences of '%s'" % (target_occurence, target)
print "'%s' is found on lines %s" % (target, ', '.join([str(i+1) for i in line_nums]))
If you parsed the input text file line by line, you could maintain another dictionary that is a word -> List<Line> mapping. ie for each word in a line, you add an entry. Might look something like the following. Bearing in mind I'm not very familiar with python, so there may be syntactic shortcuts I've missed.
eg
countdict = {}
linedict = {}
for line in text_file:
for word in line:
depunct = word.strip(punctuation)
if depunct in countdict:
countdict[depunct] += 1
else:
countdict[depunct] = 1
# add entry for word in the line dict if not there already
if depunct not in linedict:
linedict[depunct] = []
# now add the word -> line entry
linedict[depunct].append(line)
One modification you will probably need to make is to prevent duplicates being added to the linedict if a word appears twice in the line.
The above code assumes that you only want to read the text file once.
openFile = open("test.txt", "r")
words = {}
for line in openFile.readlines():
for word in line.strip().lower().split():
wordDict = words.setdefault(word, { 'count': 0, 'line': set() })
wordDict['count'] += 1
wordDict['line'].add(line)
openFile.close()
print words