Letter Count with Frequency, using Dictionaries - python

I was wondering if anyone could help me out.
How do I get this code to record ONLY the frequency of letters in a text file into a dictionary (does NOT count spaces, line, numbers, etc)?
Additionally how do I divide each letter by the total letters to report the percent frequency of each letter in the file?
This is what I have currently:
def linguisticCalc():
"""
Asks user to input a VALID filename. File must be a text file. IF valid, returns the frequency of ONLY letters in file.
"""
filename = input("Please type your VALID filename")
if os.path.exists(filename) == True:
with open(filename, 'r') as f:
f_content = f.read()
freq = {}
for i in f_content:
if i in freq:
freq[i] += 1
else:
freq[i] = 1
print(str(freq))
else:
print("This filename is NOT valid. Use the getValidFilename function to test inputs.")

Something that might help you determine whether the character in question is a letter, is this:
import string
# code here
if character in string.ascii_letters:
# code here

Check out collections.Counter()
You can use it to Count every letter in a string:
Counter('Articles containing potentially dated statements from 2011')
It gives this output, which is useful for counting characters in a string:
Counter({'A': 1,
'r': 2,
't': 8,
'i': 4,
'c': 2,
'l': 3,
'e': 5,
's': 3,
' ': 6,
'o': 3,
'n': 5,
'a': 4,
'g': 1,
'p': 1,
'y': 1,
'd': 2,
'm': 2,
'f': 1,
'2': 1,
'0': 1,
'1': 2})

Related

Ignore Whitespace while counting number of characters in a String

I am trying to write a function which will count the number of characters present in an input string and store as key-value in a dictionary.The code is partially working i.e it is also counting the whitespaces present in between 2 words.How do I avoid counting the whitespaces?
#Store Characters of a string in a Dictionary
def char_dict(string):
char_dic = {}
for i in string:
if i in char_dic:
char_dic[i]+= 1
else:
char_dic[i]= 1
return char_dic
print(char_dict('My name is Rajib'))
You could just continue if the character is a white space:
def char_dict(string):
char_dic = {}
for i in string:
if ' ' == i:
continue
if i in char_dic:
char_dic[i] += 1
else:
char_dic[i]= 1
return char_dic
print(char_dict('My name is Rajib')) # {'j': 1, 'm': 1, 'M': 1, 'i': 2, 'b': 1, 'e': 1, 'a': 2, 'y': 1, 'R': 1, 'n': 1, 's': 1}
A cleaner solution would be:
from collections import defaultdict
def countNonSpaceChars(string):
charDic = defaultdict(lambda: 0)
for char in string:
if char.isspace():
continue
charDic[char] += 1
return dict(charDic)
print(countNonSpaceChars('My name is Rajib')) # {'i': 2, 'a': 2, 'R': 1, 'y': 1, 'M': 1, 'm': 1, 'e': 1, 'n': 1, 'j': 1, 's': 1, 'b': 1}
You can delete space -> string = string.replace (" ","")
def char_dict(string):
char_dic = {}
string=string.replace(" ","")
for i in string:
if i in char_dic:
char_dic[i]+= 1
else:
char_dic[i]= 1
return char_dic
print(char_dict('My name is Rajib'))
To simplify things for you, there's a library called collections that has a Counter function that will produce a dictionary of values and their occurrences in a string. Then, I would simply remove the whitespace key from the dictionary if it is present using the del keyword.
from collections import Counter
def char_dict(string):
text = 'My name is Rajib'
c = Counter(text)
if ' ' in c: del c[' ']
print(char_dict('My name is Rajib'))
This method is very readable and doesn't require too much reinventing.

count characters of a string using function

Write a function named count_letters that takes as a parameter a string and returns a dictionary that tabulates how many of each letter is in that string. The string can contain characters other than letters, but only the letters should be counted. The string could even be the empty string. Lower-case and upper-case versions of a letter should be part of the same count. The keys of the dictionary should be the upper-case letters. If a letter does not appear in the string, then it would not get added to the dictionary. For example, if the string is
"AaBb"
then the dictionary that is returned should contain these key-value pairs:
{'A': 2, 'B': 2}
def count_letters(string):
"""counts all the letters in a given string"""
your_dict = dict()
for x in string:
x = x.upper() # makes lowercase upper
if x not in your_dict:
your_dict[x]= 1
else:
your_dict[x] += 1
return your_dict
I am getting the following error when I go to upload:
Test Failed: {'Q': 1, 'U': 3, 'I': 3, 'S': 6, ' ': 3, 'C[48 chars]': 1} != {'S': 6, 'U': 3, 'I': 3, 'T': 3, 'O': 3, 'C[32 chars]': 1}
+ {'C': 2, 'D': 2, 'E': 2, 'I': 3, 'O': 3, 'P': 1, 'Q': 1, 'S': 6, 'T': 3, 'U': 3}
- {' ': 3,
- '?': 1,
- 'C': 2,
- 'D': 2,
- 'E': 2,
- 'I': 3,
- 'O': 3,
- 'P': 1,
- 'Q': 1,
- 'S': 6,
- 'T': 3,
- 'U': 3}
Try something like this. Feel free to adjust it to your requirements:
import collections
def count_letters(string):
return collections.Counter(string.upper())
print(count_letters('Google'))
Output: Counter({'G': 2, 'O': 2, 'L': 1, 'E': 1})
For documentation of the Counter dict subclass in collections module, check this.
Update without using collections module:
def count_letters(string):
your_dict={}
for i in string.upper():
if i in your_dict:
your_dict[i] += 1
else:
your_dict[i] = 1
return your_dict
Output: {'G': 2, 'O': 2, 'L': 1, 'E': 1}
This solution does use collections, but unlike with Counter we aren’t getting the entire solution from a single library function. I hope it’s permitted, and if it isn’t, that it will at least be informative in some way.
import collections as colls
def count_letters(str_in):
str_folded = str_in.casefold()
counts = colls.defaultdict(int)
for curr_char in str_folded:
counts[curr_char] += 1
return counts
defaultdict is extremely practical. As the name indicates, when we try to index a dictionary with a key that doesn’t exist, it creates a default value for that key and carries out our original operation. In this case, since we declare that our defaultdict will use integers for its keys, the default value is 0.
str.casefold() is a method designed specifically for the complex problem that is case-insensitive comparison. While it is unlikely to make a difference here, it’s a good function to know.
Let me know if you have any questions :)
Without using collections, here is a solution:
def count_letters(string):
string = string.upper()
counts = {}
for a in set(string):
counts[a] = string.count(a)
return counts
This function iterates over set(string), which is equal to all the letters used in your word, without duplicates, and in uppercase. Then it counts how many times each letter appears in your string, and adds it to your counts dictionary.
I hope this answers your question. :)

Replacing each letter with its number self

I want to take each letter of a word (a is 1, b is 2, etc.), then add them all together to find the sum of all the numbers. For example, “apple” would be 50. I have this code:
conversions = {
'a': 1,
'b': 2,
'c': 3,
'd': 4,
'e': 5,
'f': 6,
'g': 7,
'h': 8,
'i': 9,
'j': 10,
'k': 11,
'l': 12,
'm': 13,
'n': 14,
'o': 15,
'p': 16,
'q': 17,
'r': 18,
's': 19,
't': 20,
'u': 21,
'v': 22,
'w': 23,
'x': 24,
'y': 25,
'z': 26
}
def conversion(word):
for letter in word:
word.replace(letter, str(conversions[letter]))
word = list(word)
for number in word:
number = int(number)
return sum(word)
However, this returns the following error:
invalid literal for int() with base 10
I’ve probably done some dumb mistake, but I can’t seem to figure out what the problem is. Any help would be much appreciated.
Strings are immutable. word.replace() returns the modifies string, it doesn't update word. And you can't change a string into a list of numbers anyway.
You don't need to use replace at all, just add up the conversions.
def conversion(word):
return sum(conversions[letter] for letter in word)
You are definitely making things unnecessarily complicated. All lowercase alphabetic characters are sequentially coded. Get their codes and add them up:
def conversion(word):
return sum(ord(x) - ord('a') + 1 for x in word)
conversion('apple')
#50
Beware that this code will not handle upper-case letters or punctuation correctly.
Try this (this will handle upper and lowercase):
def conversion(word):
return sum([ord(x.lower()) - 96 for x in word])
>>> conversion("AppLe")
50

Creating a function in Python that counts number of letters in a dictionary [duplicate]

This question already has answers here:
Counting each letter's frequency in a string
(2 answers)
Closed 4 years ago.
How do I create a function that will let me input a word, and it will execute to create a dictionary that counts individual letters in the code. I would want it to display as a dictionary, for example, by inputting 'hello' it will display {'e': 1, 'h': 1, 'l': 2, 'o': 1}
I AM ALSO required to have 2 arguments in the function, one for the string and one for the dictionary. THIS IS DIFFERENT to the "Counting each letter's frequency in a string" question.
For example, I think I would have to start as,
d = {}
def count(text, d ={}):
count = 0
for l in text:
if l in d:
count +=1
else:
d.append(l)
return count
But this is incorrect? Also Would i need to set a default value to text, by writing text ="" in case the user does not actually enter any word?
Furthermore, if there were existing values already in the dictionary, I want it to add to that existing list. How would this be achieved?
Also if there were already existing words in the dictionary, then how would you add onto that list, e.g. dct = {'e': 1, 'h': 1, 'l': 2, 'o': 1} and now i run in terminal >>> count_letters('hello', dct) the result would be {'e': 2, 'h': 2, 'l': 4, 'o': 2}
If you can use Pandas, you can use value_counts():
import pandas as pd
word = "hello"
letters = [letter for letter in word]
pd.Series(letters).value_counts().to_dict()
Output:
{'e': 1, 'h': 1, 'l': 2, 'o': 1}
Otherwise, use dict and list comprehensions:
letter_ct = {letter:0 for letter in word}
for letter in word:
letter_ct[letter] += 1
letter_ct
You can use pythons defaultdict
from collections import defaultdict
def word_counter(word):
word_dict = defaultdict(int)
for letter in word:
word_dict[letter] += 1
return(word_dict)
print(word_counter('hello'))
Output:
defaultdict(<class 'int'>, {'h': 1, 'e': 1, 'l': 2, 'o': 1})
def count_freqs(string, dictionary={}):
for letter in string:
if letter not in dictionary:
dictionary[letter] = 1
else:
dictionary[letter] += 1
return dictionary

My NLTK code almost does what I need it to, but not quite

Code:
def add_lexical_features(fdist, feature_vector):
for word, freq in fdist.items():
fname = "unigram:{0}".format(word)
if selected_features == None or fname in selected_features:
feature_vector[fname] = 1
if selected_features == None or fname in selected_features:
feature_vector[fname] = float(freq) / fdist.N()
print(feature_vector)
if __name__ == '__main__':
file_name = "restaurant-training.data"
p = process_reviews(file_name)
for i in range(0, len(p)):
print(p[i]+ "\n")
uni_dist = nltk.FreqDist(p[0])
feature_vector = {}
x = add_lexical_features(uni_dist, feature_vector)
What this is trying to do is output the frequency of words in the list of reviews (p being the list of reviews, p[0] being the string). And this works....except it does it by letter, not my word.
I am still new to NLTK, so this might be obvious, but I really can't get it.
For example, this currently outputs a large list of things like:
{'unigram:n': 0.0783132530120482}
This is fine, and I think that is the right number (number of time n appears over total letters) but I want it to be by word, not by letter.
Now, I also want it do it by bigrams, once I can get it working by single words, making the double words might be easy, but I am not quite seeing it, so some guidance their would be nice.
Thanks.
The input to nltk.FreqDist should be a list of strings, not just a string. See the difference:
>>> import nltk
>>> uni_dist = nltk.FreqDist(['the', 'dog', 'went', 'to', 'the', 'park'])
>>> uni_dist
FreqDist({'the': 2, 'went': 1, 'park': 1, 'dog': 1, 'to': 1})
>>> uni_dist2 = nltk.FreqDist('the dog went to the park')
>>> uni_dist2
FreqDist({' ': 5, 't': 4, 'e': 3, 'h': 2, 'o': 2, 'a': 1, 'd': 1, 'g': 1, 'k': 1, 'n': 1, ...})
You can convert your string into a list of individual words using split.
Side note: I think you might want to be calling nltk.FreqDist on p[i] rather than p[0].

Categories