Counting each DIFFERENT character in text file - python

I'm relatively new to python and coding and I'm trying to write a code that counts the number of times each different character comes out in a text file while disregarding the case of the characters.
What I have so far is
letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o',
'p','q','r','s','t','u','v','w','x','y','z']
prompt = "Enter filename: "
titles = "char count\n---- -----"
itemfmt = "{0:5s}{1:10d}"
totalfmt = "total{0:10d}"
whiteSpace = {' ':'space', '\t':'tab', '\n':'nline', '\r':'crtn'}
filename = input(prompt)
fname = filename
numberCharacters = 0
fname = open(filename, 'r')
for line in fname:
linecount +=1
word = line.split()
word += words
for word in words:
for char in word:
numberCharacters += 1
return numberCharacters
Somethings seems wrong about this. Is there a more efficient way to do my desired task?
Thanks!

from collections import Counter
frequency_per_character = Counter(open(filename).read().lower())
Then you can display them as you wish.

A better way would be to use the str methods such as isAlpha
chars = {}
for l in open('filename', 'rU'):
for c in l:
if not c.isalpha(): continue
chars[c] = chars.get(c, 0) + 1
And then use the chars dict to write the final histogram.

You over-complicating it, you can just convert your file content to a set in order to eliminate duplicated characters:
number_diff_value = len(set(open("file_path").read()))

Related

How to take out punctuation from string and find a count of words of a certain length?

I am opening trying to create a function that opens a .txt file and counts the words that have the same length as the number specified by the user.
The .txt file is:
This is a random text document. How many words have a length of one?
How many words have the length three? We have the power to figure it out!
Is a function capable of doing this?
I'm able to open and read the file, but I am unable to exclude punctuation and find the length of each word.
def samplePractice(number):
fin = open('sample.txt', 'r')
lstLines = fin.readlines()
fin.close
count = 0
for words in lstLines:
words = words.split()
for i in words:
if len(i) == number:
count += 1
return count
You can try using the replace() on the string and pass in the desired punctuation and replace it with an empty string("").
It would look something like this:
puncstr = "Hello!"
nopuncstr = puncstr.replace(".", "").replace("?", "").replace("!", "")
I have written a sample code to remove punctuations and to count the number of words. Modify according to your requirement.
import re
fin = """This is a random text document. How many words have a length of one? How many words have the length three? We have the power to figure it out! Is a function capable of doing this?"""
fin = re.sub(r'[^\w\s]','',fin)
print(len(fin.split()))
The above code prints the number of words. Hope this helps!!
instead of cascading replace() just use strip() a one time call
Edit: a cleaner version
pl = '?!."\'' # punctuation list
def samplePractice(number):
with open('sample.txt', 'r') as fin:
words = fin.read().split()
# clean words
words = [w.strip(pl) for w in words]
count = 0
for word in words:
if len(word) == number:
print(word, end=', ')
count += 1
return count
result = samplePractice(4)
print('\nResult:', result)
output:
This, text, many, have, many, have, have, this,
Result: 8
your code is almost ok, it just the second for block in wrong position
pl = '?!."\'' # punctuation list
def samplePractice(number):
fin = open('sample.txt', 'r')
lstLines = fin.readlines()
fin.close
count = 0
for words in lstLines:
words = words.split()
for i in words:
i = i.strip(pl) # clean the word by strip
if len(i) == number:
count += 1
return count
result = samplePractice(4)
print(result)
output:
8

reading a text file and counting how many times a word is repeated. Using .split function. Now wants it to ignore case sensitive

Getting the desired output so far.
The program prompts user to search for a word.
user enters it and the program reads the file and gives the output.
'ashwin: 2'
Now i want it to ignore case sensitive. For example, "Ashwin" and "ashwin" both shall return 2, as it contains two ashwin`s in the text file.
def word_count():
file = "test.txt"
word = input("Enter word to be searched:")
k = 0
with open(file, 'r') as f:
for line in f:
words = line.split()
for i in words:
if i == word:
k = k + 1
print(word + ": " + str(k))
word_count()
You could use lower() to compare the strings in this part if i.lower() == word.lower():
For example:
def word_count():
file = "test.txt"
word = input("Enter word to be searched:")
k = 0
with open(file, 'r') as f:
for line in f:
words = line.split()
for i in words:
if i.lower() == word.lower():
k = k + 1
print(word + ": " + str(k))
word_count()
You can either use .lower on the line and word to eliminate case.
Or you can use the built-in re module.
len(re.findall(word, text, flags=re.IGNORECASE))
Use the Counter class from collections that returns an dictionary with key value pairs that could be accessed using O(1) time.
from collections import Counter
def word_count():
file = "test.txt"
with open(file, 'r') as f:
words = f.read().replace('\n', '').lower().split()
count = Counter(words)
word = input("Enter word to be searched:")
print(word, ":", count.get(word.lower()))

find all spaces, newlines and tabs in a python file

def count_spaces(filename):
input_file = open(filename,'r')
file_contents = input_file.read()
space = 0
tabs = 0
newline = 0
for line in file_contents == " ":
space +=1
return space
for line in file_contents == '\t':
tabs += 1
return tabs
for line in file_contents == '\n':
newline += 1
return newline
input_file.close()
I'm trying to write a function which takes a filename as a parameter and returns the total number of all spaces, newlines and also tab characters in the file. I want to try use a basic for loop and if statement but I'm struggling at the moment :/ any help would be great thanks.
Your current code doesn't work because you're combining loop syntax (for x in y) with a conditional test (x == y) in a single muddled statement. You need to separate those.
You also need to use just a single return statement, as otherwise the first one you reach will stop the function from running and the other values will never be returned.
Try:
for character in file_contents:
if character == " ":
space +=1
elif character == '\t':
tabs += 1
elif character == '\n':
newline += 1
return space, tabs, newline
The code in Joran Beasley's answer is a more Pythonic approach to the problem. Rather than having separate conditions for each kind of character, you can use the collections.Counter class to count the occurrences of all characters in the file, and just extract the counts of the whitespace characters at the end. A Counter works much like a dictionary.
from collections import Counter
def count_spaces(filename):
with open(filename) as in_f:
text = in_f.read()
count = Counter(text)
return count[" "], count["\t"], count["\n"]
To support large files, you could read a fixed number of bytes at a time:
#!/usr/bin/env python
from collections import namedtuple
Count = namedtuple('Count', 'nspaces ntabs nnewlines')
def count_spaces(filename, chunk_size=1 << 13):
"""Count number of spaces, tabs, and newlines in the file."""
nspaces = ntabs = nnewlines = 0
# assume ascii-based encoding and b'\n' newline
with open(filename, 'rb') as file:
chunk = file.read(chunk_size)
while chunk:
nspaces += chunk.count(b' ')
ntabs += chunk.count(b'\t')
nnewlines += chunk.count(b'\n')
chunk = file.read(chunk_size)
return Count(nspaces, ntabs, nnewlines)
if __name__ == "__main__":
print(count_spaces(__file__))
Output
Count(nspaces=150, ntabs=0, nnewlines=20)
mmap allows you to treat a file as a bytestring without actually loading the whole file into memory e.g., you could search for a regex pattern in it:
#!/usr/bin/env python3
import mmap
import re
from collections import Counter, namedtuple
Count = namedtuple('Count', 'nspaces ntabs nnewlines')
def count_spaces(filename, chunk_size=1 << 13):
"""Count number of spaces, tabs, and newlines in the file."""
nspaces = ntabs = nnewlines = 0
# assume ascii-based encoding and b'\n' newline
with open(filename, 'rb', 0) as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
c = Counter(m.group() for m in re.finditer(br'[ \t\n]', s))
return Count(c[b' '], c[b'\t'], c[b'\n'])
if __name__ == "__main__":
print(count_spaces(__file__))
Output
Count(nspaces=107, ntabs=0, nnewlines=18)
C=Counter(open(afile).read())
C[' ']
In my case tab(\t) is converted to " "(four spaces). So i have modified the
logic a bit to take care of that.
def count_spaces(filename):
with open(filename,"r") as f1:
contents=f1.readlines()
total_tab=0
total_space=0
for line in contents:
total_tab += line.count(" ")
total_tab += line.count("\t")
total_space += line.count(" ")
print("Space count = ",total_space)
print("Tab count = ",total_tab)
print("New line count = ",len(contents))
return total_space,total_tab,len(contents)

Counting prefixes from a csv file using python

Is there a way to make this python system count the prefixes in the array? I keep getting a prefixcount result of 0
Any help would be appreciated :)
The code I have is below
file = input("What is the csv file's name?")+".csv"
openfile = open(file)
text = sorted(openfile)
print("Sorted list")
print(text)
dictfile = open("DICT.txt")
prefix = ['de', 'dys', 'fore', 'wh']
prefixcount = 0
for word in text:
for i in range(0, len(prefix)):
if word>prefix[i]:
break
if word[0:len(prefix[i])] == prefix[i]:
prefixcount+=1
break
print(prefixcount)
Firstly, when you meet a condition where you want to skip an iteration, use continue - break ends the loop entirely.
Secondly, word>prefix[i] does not do what you want; it compares the strings lexicographically (see e.g. the Python docs), and you really want to know len(word) < len(prefix[i]).
I think what you want is:
prefixcount = 0
for word in text:
for pref in prefix:
if word.startswith(pref):
prefixcount += 1

Python Count not resetting?

I'm trying to insert an increment after the occurance of ~||~ in my .txt. I have this working, however I want to split it up, so after each semicolon, it starts back over at 1.
So Far I have the following, which does everything except split up at semicolons.
inputfile = "output2.txt"
outputfile = "/output3.txt"
f = open(inputfile, "r")
words = f.read().split('~||~')
f.close()
count = 1
for i in range(len(words)):
if ';' in words [i]:
count = 1
words[i] += "~||~" + str(count)
count = count + 1
f2 = open(outputfile, "w")
f2.write("".join(words))
Why not first split the file based on the semicolon, then in each segment count the occurences of '~||~'.
import re
count = 0
with open(inputfile) as f:
semicolon_separated_chunks = f.read().split(';')
count = len(re.findall('~||~', semicolon_separated_chunks))
# if file text is 'hello there ~||~ what is that; what ~||~ do you ~|| mean; nevermind ~||~'
# then count = 4
Instead of resetting the counter the way you are now, you could do the initial split on ;, and then split the substrings on ~||~. You'd have to store your words another way, since you're no longer doing words = f.read().split('~||~'), but it's safer to make an entirely new list anyway.
inputfile = "output2.txt"
outputfile = "/output3.txt"
all_words = []
f = open(inputfile, "r")
lines = f.read().split(';')
f.close()
for line in lines:
count = 1
words = line.split('~||~')
for word in words:
all_words.append(word + "~||~" + str(count))
count += 1
f2 = open(outputfile, "w")
f2.write("".join(all_words))
See if this works for you. You also may want to put some strategically-placed newlines in there, to make the output more readable.

Categories