Counting/Print Unique Words in Directory up to x instances - python

I am attempting to take all unique words in tale4653, count their instances, and then read off the top 100 mentioned unique words.
My struggle is sorting the directory so that I can print both the unique word and its' respected instances.
My code thus far:
import string
fhand = open('tale4653.txt')
counts = dict()
for line in fhand:
line = line.translate(None, string.punctuation)
line = line.lower()
words = line.split()
for word in words:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
fhand.close()
rangedValue = sorted(counts.values(), reverse=True)
i =0
while i<100:
print rangedValue[i]
i=i+1
Thank you community,

you loose the word (the key in your dictionary) when you do counts.values())
you can do this instead
rangedValue = sorted(counts.items(), reverse=True, key=lambda x: x[1])
for word, count in rangedValue:
print word + ': ' + str(rangedValue)
when you do counts.items() it will return a list of tuples of key and value like this:
[('the', 1), ('end', 2)]
and when we sort it we tell it to take the second value as the "key" to sort with

DorElias is correct in the initial problem: you need to use count.items() with key=lambda x: x[1] or key=operator.itemgetter(1), latter of which would be faster.
However, I'd like to show how I'd do it, completely avoiding sorted in your code. collections.Counter is an optimal data structure for this code. I also prefer the logic of reading words in a file be wrapped in a generator
import string
from collections import Counter
def read_words(filename):
with open(filename) as fhand:
for line in fhand:
line = line.translate(None, string.punctuation)
line = line.lower()
words = line.split()
for word in words: # in Python 3 one can use `yield from words`
yield word
counts = Counter(read_words('tale4653.txt'))
for word, count in counts.most_common(100):
print('{}: {}'.format(word, count))

Related

How to pull the key and value of an item with the maximum value out of a python dictionary

The following code creates a dictionary of email addresses and how frequently each email appears. What is the best way to pull out the key and value for the email with the max frequency?
fname = input("Enter file:")
try:
fhand = open(fname)
except:
print('File cannot be opened:')
exit()
counts = dict()
for line in fhand:
if line.startswith('From:'):
words = line.split(' ', 1)[1]
words = words.rstrip('\n')
counts[words] = counts.get(words,0)+1
print(counts)
Based on Getting key with maximum value in dictionary?
import operator
max(counts.iteritems(), key=operator.itemgetter(1))[0]
Or you can define two variables
max_count = 0
frequent = None
and append the following to the if clause
if counts[words] > max_count
max_count = counts[words]
frequent = words
At the end frequent would contain most frequent email (or None if there is no email in file).
you can get both the key and value using
sorted(counts.items(), key=lambda x: x[1])[-1]
or
sorted(counts.items(), key=lambda x: x[1], reverse=True)[0]
You can invert the key and value in a comprehension and use max() normally:
count,word = max(c,w for w,c in counts.items())
Use collections.Counter instead, which has a most_common method.
from collections import Counter
# Ignoring file opening code since it's irrelevant...
counts = Counter()
for line in fhand:
if line.startswith('From:'):
email_addr = line.rstrip('\n').split(' ', 1)[1]
counts[email_addr] += 1
print(counts.most_common(1))
Or as a comprehension:
counts = Counter(
line.rstrip('\n').split(' ', 1)[1]
for line in fhand
if line.startswith('From:')
)

Find words that appear only once

I am retrieving only unique words in a file, here is what I have so far, however is there a better way to achieve this in python in terms of big O notation? Right now this is n squared
def retHapax():
file = open("myfile.txt")
myMap = {}
uniqueMap = {}
for i in file:
myList = i.split(' ')
for j in myList:
j = j.rstrip()
if j in myMap:
del uniqueMap[j]
else:
myMap[j] = 1
uniqueMap[j] = 1
file.close()
print uniqueMap
If you want to find all unique words and consider foo the same as foo. and you need to strip punctuation.
from collections import Counter
from string import punctuation
with open("myfile.txt") as f:
word_counts = Counter(word.strip(punctuation) for line in f for word in line.split())
print([word for word, count in word_counts.iteritems() if count == 1])
If you want to ignore case you also need to use line.lower(). If you want to accurately get unique word then there is more involved than just splitting the lines on whitespace.
I'd go with the collections.Counter approach, but if you only wanted to use sets, then you could do so by:
with open('myfile.txt') as input_file:
all_words = set()
dupes = set()
for word in (word for line in input_file for word in line.split()):
if word in all_words:
dupes.add(word)
all_words.add(word)
unique = all_words - dupes
Given an input of:
one two three
two three four
four five six
Has an output of:
{'five', 'one', 'six'}
Try this to get unique words in a file.using Counter
from collections import Counter
with open("myfile.txt") as input_file:
word_counts = Counter(word for line in input_file for word in line.split())
>>> [word for (word, count) in word_counts.iteritems() if count==1]
-> list of unique words (words that appear exactly once)
You could slightly modify your logic and move it from unique on second occurrence (example using sets instead of dicts):
words = set()
unique_words = set()
for w in (word.strip() for line in f for word in line.split(' ')):
if w in words:
continue
if w in unique_words:
unique_words.remove(w)
words.add(w)
else:
unique_words.add(w)
print(unique_words)

Storing a string and a set in a dictionary

I am trying to build a dictionary that contains unique words that appear in a input file as well as the line number of each unique word. This is what I have so far.
def unique_word_index():
line_no = 0
word_set=set()
line_no_set=set()
word_map = {}
for line in input_file:
word_lst=line.strip().split()
word_lst=[w.lower().strip(string.punctuation) for w in word_lst]
line_no += 1
for word in word_lst:
if word !="":
line_no_set.add(line_no)
if 'word' in word_map.keys():
word_map['word']=line_no_set
else:
word_map['word']=''
Try the following code:
def unique_words(input_file):
file = open(input_file)
wordlist = {}
dups = []
copy = []
for index, value in enumerate(file):
words = value.split()
for word in words:
wordlist[word] = index
dups.append(word)
for word in dups:
if dups.count(word) != 1 and word not in copy:
del(wordlist[word])
copy.append(word)
for item in wordlist:
print 'The unique word '+item+' occurs on line '+str(wordlist[item])
It adds all the values to a dict and to a list, and then runs to the list to make sure each value only occurs once. If not, we delete it from the dict, leaving us with only the unique data.
This runs as:
>>> unique_words('test.txt')
The unique word them occurs on line 2
The unique word I occurs on line 1
The unique word there occurs on line 0
The unique word some occurs on line 2
The unique word times occurs on line 3
The unique word say occurs on line 2
The unique word too occurs on line 3
The unique word have occurs on line 1
The unique word of occurs on line 2
>>>
You could go like this:
def unique_words(input_file):
word_map = dict()
for i, line in enumerate(input_file):
words = line.strip().split()
for word in words:
word = word.lower().strip(string.punctuation)
if word in word_map:
word_map[word] = None
else:
word_map[word] = i
return dict((w, i) for w, i in word_map.items() if i is not None)
It adds the words and their corresponding line numbers to the dictionary word_map. When a word is seen more than once, its line number is replaced by None. The last line removes the entries whose line number is None.
Now the compact version, that uses Counter:
from collections import Counter
def unique_words(input_file):
words = [(i, w.lower().strip(string.punctuation))
for i, line in enumerate(input_file) for w in line.strip().split()]
word_counts = Counter(w for _, w in words)
return dict((w, i) for i, w in words if word_counts[w] == 1)

Python - Unable to split lines from a txt file into words

My goal is to open a file and split it into unique words and display that list (along with a number count). I think I have to split the file into lines and then split those lines into words and add it all into a list.
The problem is that if my program will run in an infinite loop and not display any results, or it will only read a single line and then stop. The file being read is The Gettysburg Address.
def uniquify( splitz, uniqueWords, lineNum ):
for word in splitz:
word = word.lower()
if word not in uniqueWords:
uniqueWords.append( word )
def conjunctionFunction():
uniqueWords = []
with open(r'C:\Users\Alex\Desktop\Address.txt') as f :
getty = [line.rstrip('\n') for line in f]
lineNum = 0
lines = getty[lineNum]
getty.append("\n")
while lineNum < 20 :
splitz = lines.split()
lineNum += 1
uniquify( splitz, uniqueWords, lineNum )
print( uniqueWords )
conjunctionFunction()
Using your current code, the line:
lines = getty[lineNum]
should be moved within the while loop.
You figured out what's wrong with your code, but nonetheless, I would do this slightly differently. Since you need to keep track of the number of unique words and their counts, you should use a dictionary for this task:
wordHash = {}
with open('C:\Users\Alex\Desktop\Address.txt', 'r') as f :
for line in f:
line = line.rstrip().lower()
for word in line:
if word not in wordHash:
wordHash[word] = 1
else:
wordHash[word] += 1
print wordHash
def splitData(filename):
return [words for words in open(filename).reads().split()]
Easiest way to split a file into words :)
Assume inp is retrived from a file
inp = """Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense."""
data = inp.splitlines()
print data
_d = {}
for line in data:
word_lst = line.split()
for word in word_lst:
if word in _d:
_d[word] += 1
else:
_d[word] = 1
print _d.keys()
Output
['Beautiful', 'Flat', 'Simple', 'is', 'dense.', 'Explicit', 'better', 'nested.', 'Complex', 'ugly.', 'Sparse', 'implicit.', 'complex.', 'than', 'complicated.']
I recommend:
#!/usr/local/cpython-3.3/bin/python
import pprint
import collections
def genwords(file_):
for line in file_:
for word in line.split():
yield word
def main():
with open('gettysburg.txt', 'r') as file_:
result = collections.Counter(genwords(file_))
pprint.pprint(result)
main()
...but you could use re.findall to deal with punctuation better, instead of string.split.

Stop words nltk/python problem

I have some code that processes a dataset for later use, the code i'm using for the stop words seems to be ok, however I think the problem lies within the rest of my code as it seems to only remove some of the stop words.
import re
import nltk
# Quran subset
filename = 'subsetQuran.txt'
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
word_list2 = [w for w in word_list if not w in nltk.corpus.stopwords.words('english')]
# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list2:
# remove punctuation marks
word = punctuation.sub("", word)
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result
for freq, word in freq_list2:
print word, freq
f = open("wordfreq.txt", "w")
f.write( str(freq_list3) )
f.close()
The output is looking like this
[(71, 'allah'), (65, 'ye'), (46, 'day'), (21, 'lord'), (20, 'truth'), (20, 'say'), (20, 'and')
This is just a small sample, there are others that should have been removed.
Any help is appreciated.
try stripping your words while making your word_list2
word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]

Categories