counting lengths of the words in a .txt - python

I have seen similar questions but nothing that truly helped me. I need to read in a text file, split it, and count the lengths of the words. I am also trying to print them out in a table with the length of the word on the left and then the actual word on the right. My code is all screwed up right now cause I got to the point where I decided to ask for help.
a = open('owlcreek.txt').read().split()
lengths = dict()
for word in a:
length = len(word)
if length not in lengths:
for length, counter in lengths.items():
print "Words of length %d: %d" % (length, counter)
#words=[line for line in a]
#print ("\n" .join(counts))
Also I guess I will need to write a little parser to get all the "!-- out. I tried to use The Counter, but I guess I don't know how to use it properly.

It should be like this:
a=open('owlcreek.txt').read().split()
lengths=dict()
for word in a:
length = len(word)
# if the key is not present, add it
if not lengths.has_key(length):
# the value should be the list of words
lengths[length] = []
# append the word to the list for length key
lengths[length].append(word)
# print them out as length, count(words of that length)
for length, wrds in lengths.items():
print "Words of length %d: %d" % (length, len(wrds))
Hope this helps!

A simple regular expression will suffice to clear out the punctuation and spaces.
edit: If I'm understanding your problem correctly, you want all the unique words in a text file, sorted by length. In which case:
import re
import itertools
with open('README.txt', 'r') as file:
words = set(re.findall(r"\w+'\w+|\w+", file.read())) # discard duplicates
sorted_words = sorted(words, key=len)
for length, words in itertools.groupby(sorted_words, len):
words = list(words)
print("Words of length {0}: {1}".format(length, len(words)))
for word in words:
print(word)

Related

How to take out punctuation from string and find a count of words of a certain length?

I am opening trying to create a function that opens a .txt file and counts the words that have the same length as the number specified by the user.
The .txt file is:
This is a random text document. How many words have a length of one?
How many words have the length three? We have the power to figure it out!
Is a function capable of doing this?
I'm able to open and read the file, but I am unable to exclude punctuation and find the length of each word.
def samplePractice(number):
fin = open('sample.txt', 'r')
lstLines = fin.readlines()
fin.close
count = 0
for words in lstLines:
words = words.split()
for i in words:
if len(i) == number:
count += 1
return count
You can try using the replace() on the string and pass in the desired punctuation and replace it with an empty string("").
It would look something like this:
puncstr = "Hello!"
nopuncstr = puncstr.replace(".", "").replace("?", "").replace("!", "")
I have written a sample code to remove punctuations and to count the number of words. Modify according to your requirement.
import re
fin = """This is a random text document. How many words have a length of one? How many words have the length three? We have the power to figure it out! Is a function capable of doing this?"""
fin = re.sub(r'[^\w\s]','',fin)
print(len(fin.split()))
The above code prints the number of words. Hope this helps!!
instead of cascading replace() just use strip() a one time call
Edit: a cleaner version
pl = '?!."\'' # punctuation list
def samplePractice(number):
with open('sample.txt', 'r') as fin:
words = fin.read().split()
# clean words
words = [w.strip(pl) for w in words]
count = 0
for word in words:
if len(word) == number:
print(word, end=', ')
count += 1
return count
result = samplePractice(4)
print('\nResult:', result)
output:
This, text, many, have, many, have, have, this,
Result: 8
your code is almost ok, it just the second for block in wrong position
pl = '?!."\'' # punctuation list
def samplePractice(number):
fin = open('sample.txt', 'r')
lstLines = fin.readlines()
fin.close
count = 0
for words in lstLines:
words = words.split()
for i in words:
i = i.strip(pl) # clean the word by strip
if len(i) == number:
count += 1
return count
result = samplePractice(4)
print(result)
output:
8

Only print specific amount of Counter items, with decent formatting

Trying to print out the top N most frequent used words in a text file. So far, I have the file system and the counter and everything working, just cant figure out how to print the certain amount I want in a pretty way. Here is my code.
import re
from collections import Counter
def wordcount(user):
"""
Docstring for word count.
"""
file=input("Enter full file name w/ extension: ")
num=int(input("Enter how many words you want displayed: "))
with open(file) as f:
text = f.read()
words = re.findall(r'\w+', text)
cap_words = [word.upper() for word in words]
word_counts = Counter(cap_words)
char, n = word_counts.most_common(num)[0]
print ("WORD: %s \nOCCURENCE: %d " % (char, n) + '\n')
Basically, I just want to go and make a loop of some sort that will print out the following...
For instance num=3
So it will print out the 3 most frequent used words, and their count.
WORD: Blah Occurrence: 3
Word: bloo Occurrence: 2
Word: blee Occurrence: 1
I would iterate "most common" as follows:
most_common = word_counts.most_common(num) # removed the [0] since we're not looking only at the first item!
for item in most_common:
print("WORD: {} OCCURENCE: {}".format(item[0], item[1]))
Two comments:
1. Use format() to format strings instead of % - you'll thank me later for this advice!
2. This way you'll be able to iterate any number of "top N" results without hardcoding "3" into your code.
Save the most common elements and use a loop.
common = word_counts.most_common(num)[0]
for i in range(3):
print("WORD: %s \nOCCURENCE: %d \n" % (common[i][0], common[i][1]))

Getting the third largest word in a string/list in python

For the code below i can't seem to get the 3rd largest word. I am splitting the string i get from user input and putting it in the "words" var, then i make 2 lists - one of which includes the words sorted in terms of length.
Then i get the length of the longest word (in maxlist) and second longest word (in maxlist2) and remove them. All that's left should be the third longest word from the original list and any shorter words. But i find it doesn't quite work right.
The second and third "for" statements below don't seem to remove all instances of wordlength represented by "maxlist"
For example, if i represent words by just the letter "e" and use different numbers of e's for different wordlength (ie. ee, eeee, eeeee) some of these instances are removed by the "for" statement and some are not. For this input:
"e ee eee eeee eeee eeee eeee eeee eeee eeee eeee" i should expect all "eeee" words to be removed by the code:
if len(word) == maxlist:
sort2.remove(word)
If i repeat the code again for the next longest word (which is done by the third "for" statement) i should also remove the "eee" instance. They are not removed though, and the final list remains "'e', 'ee', 'eee', 'eeee', 'eeee'"
The second "for" statement seems to remove 6 instances of "eeee" but not all 8 instances. What is wrong here? Please help!!
My final output should be the third longest word of the original list + any shorter words.
def ThirdGreatest(strArr):
words = strArr.split()
sort=[] # length of words
sort2=[] # actual words
for word in words:
sort2.append(word)
sort.append(len(word))
sort2.sort()
maxlist= len(max(sort2, key=len))
for word in sort2:
if len(word) == maxlist:
sort2.remove(word)
maxlist2 = len(max(sort2, key=len))
for word in sort2:
if len(word) == maxlist2:
sort2.remove(word)
maxlist3 = (max(sort2, key=len))
print
print "biggest word is {} char long ".format(maxlist)
print sort
print "3rd biggest word is {}: ".format(maxlist3)
print "3rd biggest word is {}: ".format(sort2) # list of words remaining
#after the first 2 longest have been removed
ThirdGreatest(raw_input("Enter String: "))
You should use heapq for finding the third largest:
third_largest = heapq.nlargest(3, set(words))[-1]
After that, you can use all sorts of stuff, e.g. list comprehension:
[word for word in words if word != third_largest]
Your problem is:
for word in sort2:
if len(word) == maxlist:
sort2.remove(word)
Don't change the list you're currently iterating, that's just gonna mess things up. It's like you're reading a book and someone rips out pages while you're reading.
Iterate over a copy instead:
for word in sort2[:]:
if len(word) == maxlist:
sort2.remove(word)
Note the added [:], which gives you a copy.
And an alternative solution:
[next(g) for _, g in groupby(sorted(words, key=len), len)][-3]
Demo:
>>> words = 'This is a test and I try hard to make it good'.split()
>>> from itertools import groupby
>>> [next(g) for _, g in groupby(sorted(words, key=len), len)][-3]
'is'
This was my original, long and cumbersome solution but other users have posted much more concise and clear answers. Thank you guys.
Def ThirdGreatest(strArr):
words = strArr.split()
sort=[] # length of words
sort2=[] # actual words
for word in words:
sort2.append(word)
sort.append(len(word))
sort3=set(sort2)
sorted_set3= sorted(sort3, key=len)
sort4 =[]
for n in sorted_set3:
sort4.append(n)
maxlist= len(max(sort4, key=len))
for word in sort4:
if len(word) == maxlist:
sort4.remove(word)
maxlist2 = len(max(sort4, key=len))
for word in sort4:
if len(word) == maxlist2:
sort4.remove(word)
print "_______________________________________________________________"
print "The third largest word is: {} ".format(max(sort4, key=len))
ThirdGreatest(raw_input("Enter String: "))

Comparing lengths of words in strings

Need to find the longest word in a string and print that word.
1.) Ask user to enter sentence separated by spaces.
2.)Find and print the longest word. If two or more words are the same length than print the first word.
this is what I have so far
def maxword(splitlist): #sorry, still trying to understand loops
for word in splitlist:
length = len(word)
if ??????
wordlist = input("Enter a sentence: ")
splitlist = wordlist.split()
maxword(splitlist)
I'm hitting a wall when trying to compare the lenghts of words in a sentance. I'm a student who's been using python for 5 weeks.
def longestWord(sentence):
longest = 0 # Keep track of the longest length
word = '' # And the word that corresponds to that length
for i in sentence.split():
if len(i) > longest:
word = i
longest = len(i)
return word
>>> s = 'this is a test sentence with some words'
>>> longestWord(s)
'sentence'
You can use max with a key:
def max_word(splitlist):
return max(splitlist.split(),key=len) if splitlist.strip() else "" # python 2
def max_word(splitlist):
return max(splitlist.split()," ",key=len) # python 3
Or use a try/except as suggested by jon clements:
def max_word(splitlist):
try:
return max(splitlist.split(),key=len)
except ValueError:
return " "
You're going in the right direction. Most of your code looks good, you just need to finish the logic to determine which is the longest word. Since this seems like a homework question I don't want to give you the direct answer (even though everyone else has which I think is useless for a student like you), but there are multiple ways to solve this problem.
You're getting the length of each word correctly, but what do you need to compare each length against? Try to say the problem aloud and how you'd personally solve the problem aloud. I think you'll find that your english description translates nicely to a python version.
Another solution that doesn't use an if statement might use the built-in python function max which takes in a list of numbers and returns the max of them. How could you use that?
You can use nlargest from heapq module
import heapq
heapq.nlargest(1, sentence.split(), key=len)
sentence = raw_input("Enter sentence: ")
words = sentence.split(" ")
maxlen = 0
longest_word = ''
for word in words:
if len(word) > maxlen:
maxlen = len(word)
longest_word = word
print(word, maxlen)

Defining words as 2 letters or more in python 2.6

I have a python script that I am writing for a class assignment which calculates the top 10 most frequent words in a text document and displays the words and their frequency. I was able to get this part of the script working just fine, but the assignment says a word is defined as 2 letters or more. I cannot seem to define a word as 2 letters or more for some reason, when I run the script, nothing happens.
# Most Frequent Words:
from string import punctuation
from collections import defaultdict
def sort_words(x, y):
return cmp(x[1], y[1]) or cmp(y[0], x[0])
number = 10
words = {}
words_gen = (word.strip(punctuation).lower() for line in open("charactermask.txt")
for word in line.split())
words = defaultdict(int)
for word in words_gen:
words[word] +=1
letters = len(word)
while letters >= 2:
top_words = sorted(words.iteritems(),
key=lambda(word, count): (-count, word))[:number]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
One problem with your script is the loop
while letters >= 2:
top_words = sorted(words.iteritems(),
key=lambda(word, count): (-count, word))[:number]
You are not looping through the words here; this loop will just loop forever. You need to change the script so that this part of the script actually iterates over all of the words. (Also, you will probably want to change while to if because you only need that code to execute once per word.)
I would refactor your code and use a collections.Counter object:
import collections
import string
with open("charactermask.txt") as f:
words = [x.strip(string.punctuation).lower() for x in f.read().split()]
counter = collections.defaultdict(int):
for word in words:
if len(word) >= 2:
counter[word] += 1

Categories