I'm trying to complete a simple word-count program, which keeps track of the number of words, characters, and lines in a connected file.
# This program counts the number of lines, words, and characters in a file, entered by the user.
# The file is test text from a standard lorem ipsum generator.
import string
def wc():
# Sets the count of normal lines, words, and characters to 0 for proper iterative operation.
lines = 0
words = 0
chars = 0
print("This program will count the number of lines, words, and characters in a file.")
# Stores a variable as a string for more graceful coding and no errors experienced previously.
filename =("test.txt")
# Opens file and stores it as new variable, and loops through each line once the connection with file is made.
with open(filename, 'r') as fileObject:
for l in fileObject:
# Splits text file into each individual word for word count.
words = l.split()
lines += 1
words += len(words)
chars += len(l)
print("Lines:", lines)
print("Words:", words)
print("Characters:", chars)
wc()
while 1:
pass
Now, if all goes well, it should be printing the total number of lines, letters, and words in the file, but all I get is this message:
"words += len(words)
TypeError: 'int' object is not iterable
"
What is wrong?
SOLVED! New code:
# This program counts the number of lines, words, and characters in a file, entered by the user.
# The file is test text from a standard lorem ipsum generator.
import string
def wc():
# Sets the count of normal lines, words, and characters to 0 for proper iterative operation.
lines = 0
words = 0
chars = 0
print("This program will count the number of lines, words, and characters in a file.")
# Stores a variable as a string for more graceful coding and no errors experienced previously.
filename =("test.txt")
# Opens file and stores it as new variable, and loops through each line once the connection with file is made.
with open(filename, 'r') as fileObject:
for l in fileObject:
# Splits text file into each individual word for word count.
wordsFind = l.split()
lines += 1
words += len(wordsFind)
chars += len(l)
print("Lines:", lines)
print("Words:", words)
print("Characters:", chars)
wc()
while 1:
pass
It looks like you're using the variable name words for your count, and also for the result of l.split(). You need to differentiate these by using different variable names for them.
Related
Write a function named file_stats that takes one string parameter (in_file) that is the name of an existing text file. The function file_stats should calculate three statistics about in_file: the number of lines it contains, the number of words and the number of characters, and print the three statistics on separate lines. For example, the following would be be correct input and output. (Hint: the number of characters may vary depending on what platform you are working.)
file_stats('created_equal.txt')
lines 2
words 13
characters 72
Below is what I have:
fileName = "C:\Users\Jeff Hardy\Desktop\index.txt"
chars = 0
words = 0
lines = 0
def file_stats(in_file):
global lines, words, chars
with open(in_file, 'r') as fd:
for line in fd:
lines += 1
wordsList = line.split()
words += len(wordsList)
for word in wordsList:
chars += len(word)
file_stats(fileName)
print("Number of lines: {0}".format(lines))
print("Number of words: {0}".format(words))
print("Number of chars: {0}".format(chars))
The code is giving me the following error:
(unicode error) 'unicodeescape' codec can't decode bytes in
position 2-3: truncated\UXXXXXXXX escape
I believe your error has to do with the encoding of your file,
or needs to be fileName = "C:\\Users\\Jeff Hardy\\Desktop\\index.txt"
and the instructions are asking you to print within the function, not affect a global variable, then you need to update the values within the loop, not after it (indentation matters)
def file_stats(in_file):
lines = words = chars = 0
with open(in_file, 'r', encoding="utf-8") as fd:
for line in fd:
lines += 1
words += len(line.split()) # If you split "x , y" is the comma a word?
chars += len(line) # Are spaces considered a character?
print("lines {0}".format(lines))
print("words {0}".format(words))
print("characters {0}".format(chards))
I have to compress a file into a list of words and list of positions to recreate the original file. My program should also be able to take a compressed file and recreate the full text, including punctuation and capitalization, of the original file. I have everything correct apart from the recreation, using the map function my program can't convert my list of positions into floats because of the '[' as it is a list.
My code is:
text = open("speech.txt")
CharactersUnique = []
ListOfPositions = []
DownLine = False
while True:
line = text.readline()
if not line:
break
TwoList = line.split()
for word in TwoList:
if word not in CharactersUnique:
CharactersUnique.append(word)
ListOfPositions.append(CharactersUnique.index(word))
if not DownLine:
CharactersUnique.append("\n")
DownLine = True
ListOfPositions.append(CharactersUnique.index("\n"))
w = open("List_WordsPos.txt", "w")
for c in CharactersUnique:
w.write(c)
w.close()
x = open("List_WordsPos.txt", "a")
x.write(str(ListOfPositions))
x.close()
with open("List_WordsPos.txt", "r") as f:
NewWordsUnique = f.readline()
f.close()
h = open("List_WordsPos.txt", "r")
lines = h.readlines()
NewListOfPositions = lines[1]
NewListOfPositions = map(float, NewListOfPositions)
print("Recreated Text:\n")
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
print(recreation)
The error I get is:
Task 3 Code.py", line 42, in <genexpr>
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
ValueError: could not convert string to float: '['
I am using Python IDLE 3.5 (32-bit). Does anyone have any ideas on how to fix this?
Why do you want to turn the position values in the list into floats, since they list indices, and those must be integer? I suspected this might be an instance of what is called the XY Problem.
I also found your code difficult to understand because you haven't followed the PEP 8 - Style Guide for Python Code. In particular, with how many (although not all) of the variable names are CamelCased, which according to the guidelines, should should be reserved for the class names.
In addition some of your variables had misleading names, like CharactersUnique, which actually [mostly] contained unique words.
So, one of the first things I did was transform all the CamelCased variables into lowercase underscore-separated words, like camel_case. In several instances I also gave them better names to reflect their actual contents or role: For example: CharactersUnique became unique_words.
The next step was to improve the handling of files by using Python's with statement to ensure they all would be closed automatically at the end of the block. In other cases I consolidated multiple file open() calls into one.
After all that I had it almost working, but that's when I discovered a problem with the approach of treating newline "\n" characters as separate words of the input text file. This caused a problem when the file was being recreated by the expression:
" ".join(NewWordsUnique[pos] for pos in (NewListOfPositions))
because it adds one space before and after every "\n" character encountered that aren't there in the original file. To workaround that, I ended up writing out the for loop that recreates the file instead of using a list comprehension, because doing so allows the newline "words" could be handled properly.
At any rate, here's the resulting rewritten (and working) code:
input_filename = "speech.txt"
compressed_filename = "List_WordsPos.txt"
# Two lists to represent contents of input file.
unique_words = ["\n"] # preload with newline "word"
word_positions = []
with open(input_filename, "r") as input_file:
for line in input_file:
for word in line.split():
if word not in unique_words:
unique_words.append(word)
word_positions.append(unique_words.index(word))
word_positions.append(unique_words.index("\n")) # add newline at end of each line
# Write representations of the two data-structures to compressed file.
with open(compressed_filename, "w") as compr_file:
words_repr = " ".join(repr(word) for word in unique_words)
compr_file.write(words_repr + "\n")
positions_repr = " ".join(repr(posn) for posn in word_positions)
compr_file.write(positions_repr + "\n")
def strip_quotes(word):
"""Strip the first and last characters from the string (assumed to be quotes)."""
tmp = word[1:-1]
return tmp if tmp != "\\n" else "\n" # newline "words" are special case
# Recreate input file from data in compressed file.
with open(compressed_filename, "r") as compr_file:
line = compr_file.readline()
new_unique_words = list(map(strip_quotes, line.split()))
line = compr_file.readline()
new_word_positions = map(int, line.split()) # using int, not float here
words = []
lines = []
for posn in new_word_positions:
word = new_unique_words[posn]
if word != "\n":
words.append(word)
else:
lines.append(" ".join(words))
words = []
print("Recreated Text:\n")
recreation = "\n".join(lines)
print(recreation)
I created my own speech.txt test file from the first paragraph of your question and ran the script on it with these results:
Recreated Text:
I have to compress a file into a list of words and list of positions to recreate
the original file. My program should also be able to take a compressed file and
recreate the full text, including punctuation and capitalization, of the
original file. I have everything correct apart from the recreation, using the
map function my program can't convert my list of positions into floats because
of the '[' as it is a list.
Per your question in the comments:
You will want to split the input on spaces. You will also likely want to use different data structures.
# we'll map the words to a list of positions
all_words = {}
with open("speech.text") as f:
data = f.read()
# since we need to be able to re-create the file, we'll want
# line breaks
lines = data.split("\n")
for i, line in enumerate(lines):
words = line.split(" ")
for j, word in enumerate(words):
if word in all_words:
all_words[word].append((i, j)) # line and pos
else:
all_words[word] = [(i, j)]
Note that this does not yield maximum compression as foo and foo. count as separate words. If you want more compression, you'll have to go character by character. Hopefully now you can use a similar approach to do so if desired.
I'm writing a program that counts all lines, words and characters from a file given as input.
import string
def main():
print "Program determines the number of lines, words and chars in a file."
file_name = raw_input("What is the file name to analyze? ")
in_file = open(file_name, 'r')
data = in_file.read()
words = string.split(data)
chars = 0
lines = 0
for i in words:
chars = chars + len(i)
print chars, len(words)
main()
To some extent, the code is ok.
I don't know however how to count 'spaces' in the file. My character counter counts only letters, spaces are excluded.
Plus I'm drawing a blank when it comes to counting lines.
You can just use len(data) for the character length.
You can split data by lines using the .splitlines() method, and length of that result is the number of lines.
But, a better approach would be to read the file line by line:
chars = words = lines = 0
with open(file_name, 'r') as in_file:
for line in in_file:
lines += 1
words += len(line.split())
chars += len(line)
Now the program will work even if the file is very large; it won't hold more than one line at a time in memory (plus a small buffer that python keeps to make the for line in in_file: loop a little faster).
Very Simple:
If you want to print no of chars , no of words and no of lines in the file. and including the spaces.. Shortest answer i feel is mine..
import string
data = open('diamond.txt', 'r').read()
print len(data.splitlines()), len(string.split(data)), len(data)
Keep coding buddies...
read file-
d=fp.readlines()
characters-
sum([len(i)-1 for i in d])
lines-
len(d)
words-
sum([len(i.split()) for i in d])
This is one crude way of counting words without using any keywords:
#count number of words in file
fp=open("hello1.txt","r+");
data=fp.read();
word_count=1;
for i in data:
if i==" ":
word_count=word_count+1;
# end if
# end for
print ("number of words are:", word_count);
def myfunc(filename):
filename=open('hello.txt','r')
lines=filename.readlines()
filename.close()
lengths={}
for line in lines:
for punc in ".,;'!:&?":
line=line.replace(punc," ")
words=line.split()
for word in words:
length=len(word)
if length not in lengths:
lengths[length]=0
lengths[length]+=1
for length,counter in lengths.items():
print(length,counter)
filename.close()
Use Counter. (<2.7 version)
You are counting the frequency of words in a single line.
for line in lines:
for word in length.keys():
print(wordct,length)
length is dict of all distinct words plus their frequency, not their length
length.get(word,0)+1
so you probably want to replace the above with
for line in lines:
....
#keep this at this indentaiton - will have a v large dict but of all words
for word in sorted(length.keys(), key=lambda x:len(x)):
#word, freq, length
print(word, length[word], len(word), "\n")
I would also suggest
Dont bring the file into memory like that, the file objects and handlers are now iterators and well optimised for reading from files.
drop the wordct and so on in the main lines loop.
rename length to something else - perhaps words or dict_words
Errr, maybe I misunderstood - are you trying to count the number of distinct words in the file, in which case use len(length.keys()) or the length of each word in the file, presumably ordered by length....
The question has been more clearly defined now so replacing the above answer
The aim is to get a frequency of word lengths throughout the whole file.
I would not even bother with line by line but use something like:
fo = open(file)
d_freq = {}
st = 0
while 1:
next_space_index = fo.find(" ", st+1)
word_len = next_space_index - st
d_freq.get(word_len,0) += 1
print d_freq
I think that will work, not enough time to try it now. HTH
I have this piece of code, the last bit of the code starting from d = {}.
I'm trying to print the words with its line number located in the text but it is not working, it's only printing the words - anyone know why?
import sys
import string
text = []
infile = open(sys.argv[1], 'r').read()
for punct in string.punctuation:
infile = infile.replace(punct, "")
text = infile.split("\n")
dict = open(sys.argv[2], 'r').read()
dictset = []
dictset = dict.split()
words = []
words = list(set(text) - set(dictset))
words = [text.lower() for text in words]
words.sort()
d = {}
counter = 0
for lines in text:
counter += 1
if word not in d:
d[words] = [counter]
else:
d[words.append[counter]
print(word, d)
This code outputs:
helo
goin
ist
I want it to output :
helo #tab# 3 4
goin #tab# 1 2
text is a list of WORDS, it's not a list of LINES. When you do:
text = infile.split()
you're irreversibly, forever throwing away all connections between a word and the line it was in. So when you later write
for lines in text:
it's a lie: text's items are words, not lines. If they weren't, then this other earlier line:
words = list(set(text) - set(dictset))
would be totally broken -- this depends on text's items being words, not lines.
And, by the way, when you do:
words = [text.lower() for text in words]
text is now left bound to the last item in words -- you've destroyed whatever other value it had previously.
Recommendation number one: stop reusing identifiers for many different, incompatible purposes. Make a commitment to yourself that no identifier shall ever be bound to two different things within any one of your programs. This will, at least, reduce the incredible amount of utter confusion that you manager to pile onto so few lines.