Python - I read a file but it shows me erroneous values? - python

def showCounts(fileName):
lineCount = 0
wordCount = 0
numCount = 0
comCount = 0
dotCount = 0
with open(fileName, 'r') as f:
for line in f:
for char in line:
if char.isdigit() == True:
numCount+=1
elif char == '.':
dotCount+=1
elif char == ',':
comCount+=1
#i know formatting below looks off but it's right
words = line.split()
lineCount += 1
wordCount += len(words)
for word in words:
# text = word.translate(string.punctuation)
exclude = set(string.punctuation)
text = ""
text = ''.join(ch for ch in text if ch not in exclude)
try:
if int(text) >= 0 or int(text) < 0:
numCount += 1
except ValueError:
pass
print("Line count: " + str(lineCount))
print("Word count: " + str(wordCount))
print("Number count: " + str(numCount))
print("Comma count: " + str(comCount))
print("Dot count: " + str(dotCount) + "\n")
I have it read a .txt file containing words, lines, dots, commas, and numbers. It will give me the correct number of dots commas and numbers, but the words and lines values will be each much much higher than they actually are. Any one know why? Thanks guys.

I don't know if this is actually the answer, but my reputation isn't high enough to comment, so I'm putting it here. You obviously don't need to accept it as the final answer if it doesn't solve the issue.
So, I think it might have something to do with the fact that all of your print statements are actually outside of the showCounts() function. Try indenting the print statements.
I hope this helps.

Related

Counting several instances of the same word from a text file

Complete beginner, searched a lot of threads but couldn't find a solution that fits me.
I have a text file, python_examples.txt which contains some words. On line four, the word hello appears twice in a row, like "hello hello".
My code is supposed to find the word the user inputs and count how many times it appears, it works but as I said, not if the same word appears multiple times on the same row. So there are 2 hellos on line 4 and one on line 13 but it only finds a total of 2 hellos. Fixes? Thanks,
user_input = input("Type in the word you are searching for: ")
word_count = 0
line_count = 0
with open ("python_example.txt", "r+") as f:
for line in f:
line_count += 1
if user_input in line:
word_count += 1
print("found " + user_input + " on line " + str(line_count))
else:
print ("nothing on line " + str(line_count))
print ("\nfound a total of " + str(word_count) + " words containing " + "'" + user_input + "'")
you can use str.count:
word_count += line.count(user_input)
instead of :
word_count += 1
it will count all appearance of user_input in the file line
The issue is with these two lines:
if user_input in line:
word_count += 1
You increase the count by 1 if the input appears on the line, regardless of whether it appears more than once.
This should do the job:
user_input = input("Type in the word you are searching for: ")
word_count = 0
with open("python_example.txt") as f:
for line_num, line in enumerate(f, start=1):
line_inp_count = line.count(user_input)
if line_inp_count:
word_count += line_inp_count
print(f"input {user_input} appears {line_inp_count} time(s) on line {line_num}")
else:
print(f"nothing on line {line_num}")
print(f"the input appeared a total of {word_count} times in {line_num} lines.")
Let me know if you have any questions :)
One option is use a library to parse the words in your text file rather than iterating one line at a time. There are several classes in nltk.tokenize which are easy to use.
import nltk.tokenize.regexp
def count_word_in_file(filepath, word):
"""Give the number for times word appears in text at filepath."""
tokenizer = nltk.tokenize.regexp.WordPunctTokenizer()
with open(filepath) as f:
tokens = tokenizer.tokenize(f.read())
return tokens.count(word)
This handles awkward cases like the substring 'hell' appearing in 'hello' as mentioned in a comment, and is also a route towards case-insenstive matching, stemming, and other refinements.

Counting Paragraph and Most Frequent Words in Python Text File

I am trying to count the number of paragraphs and the most frequent words in a text file (any text file for that matter) but seem to have zero output when I run my code, no errors either. Any tips on where I'm going wrong?
filename = input("enter file name: ")
inf = open(filename, 'r')
#frequent words
wordcount={}
for word in inf.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for key in wordcount.keys():
print ("%s %s " %(key , wordcount[key]))
#Count Paragraph(s)
linecount = 0
for i in inf:
paragraphcount = 0
if '\n' in i:
linecount += 1
if len(i) < 2: paragraphcount *= 0
elif len(i) > 2: paragraphcount = paragraphcount + 1
print('%-4d %4d %s' % (paragraphcount, linecount, i))
inf.close()
filename = raw_input("enter file name: ")
wordcount={}
paragraphcount = 0
linecount = 0
with open(filename, 'r') as ftext:
for line in ftext.readlines():
if line in ('\n', '\r\n'):
if linecount == 0:
paragraphcount = paragraphcount + 1
linecount = linecount + 1
else:
linecount = 0
#frequent words
for word in line.split():
wordcount[word] = wordcount.get(word,0) + 1
print wordcount
print paragraphcount
When you are reading a file, there is a cursor that indicates which byte you are reading at the moment. In your code, you are trying to read the file twice and encountered a strange behavior, which shoud have been a hint that you are doing something wrong. To the solution,
What is the correct way ?
You should read the file once, store every line, then find word count and paragraph count, using the same store. Rather than trying to reading it twice.
What is happening is the current code ?
When you first read the file, your byte cursor is set to the end of the file, when you try to read lines, if returns an empty list because it tries to read the end of the file. You can corrent this by resetting the file pointer(the cursor).
Call inf.seek(0) just before you try to read lines. But instead of this, you should be focusing on implementing a method I mentioned in the first section.

Error with expected an indented block, any one know?

a = "aabb"
words = list(a)
new_words = []
count = 1
for i in range(0,len(words)-1):
if words[i] == words[i+1]:
count = count + 1
if i+2 == len(words):
print words[i+1],count
else:
print words[i],count
count = 1
You have a file with mixed tabs and spaces and it's confusing you vs the interpreter. Replace all the tab characters in your file with spaces and then correctly indent the program.

find all spaces, newlines and tabs in a python file

def count_spaces(filename):
input_file = open(filename,'r')
file_contents = input_file.read()
space = 0
tabs = 0
newline = 0
for line in file_contents == " ":
space +=1
return space
for line in file_contents == '\t':
tabs += 1
return tabs
for line in file_contents == '\n':
newline += 1
return newline
input_file.close()
I'm trying to write a function which takes a filename as a parameter and returns the total number of all spaces, newlines and also tab characters in the file. I want to try use a basic for loop and if statement but I'm struggling at the moment :/ any help would be great thanks.
Your current code doesn't work because you're combining loop syntax (for x in y) with a conditional test (x == y) in a single muddled statement. You need to separate those.
You also need to use just a single return statement, as otherwise the first one you reach will stop the function from running and the other values will never be returned.
Try:
for character in file_contents:
if character == " ":
space +=1
elif character == '\t':
tabs += 1
elif character == '\n':
newline += 1
return space, tabs, newline
The code in Joran Beasley's answer is a more Pythonic approach to the problem. Rather than having separate conditions for each kind of character, you can use the collections.Counter class to count the occurrences of all characters in the file, and just extract the counts of the whitespace characters at the end. A Counter works much like a dictionary.
from collections import Counter
def count_spaces(filename):
with open(filename) as in_f:
text = in_f.read()
count = Counter(text)
return count[" "], count["\t"], count["\n"]
To support large files, you could read a fixed number of bytes at a time:
#!/usr/bin/env python
from collections import namedtuple
Count = namedtuple('Count', 'nspaces ntabs nnewlines')
def count_spaces(filename, chunk_size=1 << 13):
"""Count number of spaces, tabs, and newlines in the file."""
nspaces = ntabs = nnewlines = 0
# assume ascii-based encoding and b'\n' newline
with open(filename, 'rb') as file:
chunk = file.read(chunk_size)
while chunk:
nspaces += chunk.count(b' ')
ntabs += chunk.count(b'\t')
nnewlines += chunk.count(b'\n')
chunk = file.read(chunk_size)
return Count(nspaces, ntabs, nnewlines)
if __name__ == "__main__":
print(count_spaces(__file__))
Output
Count(nspaces=150, ntabs=0, nnewlines=20)
mmap allows you to treat a file as a bytestring without actually loading the whole file into memory e.g., you could search for a regex pattern in it:
#!/usr/bin/env python3
import mmap
import re
from collections import Counter, namedtuple
Count = namedtuple('Count', 'nspaces ntabs nnewlines')
def count_spaces(filename, chunk_size=1 << 13):
"""Count number of spaces, tabs, and newlines in the file."""
nspaces = ntabs = nnewlines = 0
# assume ascii-based encoding and b'\n' newline
with open(filename, 'rb', 0) as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
c = Counter(m.group() for m in re.finditer(br'[ \t\n]', s))
return Count(c[b' '], c[b'\t'], c[b'\n'])
if __name__ == "__main__":
print(count_spaces(__file__))
Output
Count(nspaces=107, ntabs=0, nnewlines=18)
C=Counter(open(afile).read())
C[' ']
In my case tab(\t) is converted to " "(four spaces). So i have modified the
logic a bit to take care of that.
def count_spaces(filename):
with open(filename,"r") as f1:
contents=f1.readlines()
total_tab=0
total_space=0
for line in contents:
total_tab += line.count(" ")
total_tab += line.count("\t")
total_space += line.count(" ")
print("Space count = ",total_space)
print("Tab count = ",total_tab)
print("New line count = ",len(contents))
return total_space,total_tab,len(contents)

Where's the syntax error in this program?

This program I created in python is supposed to count the number of uppercase letters, lower case, digits, and number of white space characters in a text file. It keeps coming back with syntax error. I'm having trouble finding out why my error is.
infile = open("text.txt", "r")
uppercasecount, lowercasecount, digitcount = (0, 0, 0)
for character in infile.readlines():
if character.isupper() == True:
uppercasecount += 1
if character.islower() == True:
lowercasecount += 1
if character.isdigit() == True:
digitcount += 1
print(uppercasecount),(lowercasecount),(digitcount)
print "Total count is %d Upper case, %d Lower case and %d Digit(s)" %(uppercasecount, lowercasecount, digitcount)
change this:
print(uppercasecount),(lowercasecount),(digitcount)
to:
print uppercasecount,lowercasecount,digitcount
Instead of readlines, use read:
for character in infile.read():
readlines will read the whole file as a list of line
read will read the whole file as a string
Complete answer for python 2 and 3. Try this if you want to get count of letters not words
infile = open("text.txt", "r")
uppercasecount, lowercasecount, digitcount = (0, 0, 0)
for character in infile.read():
if character.isupper() == True:
uppercasecount += 1
if character.islower() == True:
lowercasecount += 1
if character.isdigit() == True:
digitcount += 1
print(uppercasecount,lowercasecount,digitcount)
print("Total count is %d Upper case, %d Lower case and %d Digit(s)" %(uppercasecount, lowercasecount, digitcount))

Categories