Dictionary of punctuation counts for list of strings - python

How can I use dict comprehension to build a dictionary of punctuation counts for a list of strings? I was able to do it for a single string like this:
import string
test_string = "1990; and 1989', \ '1975/97', '618-907 CE"
counts = {p:test_string.count(p) for p in string.punctuation}
Edit: For anyone who may want this in the future, here's Patrick Artner's answer copied from below, with a very small modification to keep only punctuation counts:
# return punctuation Counter dict for string/list/pd.Series
import string
from collections import Counter
from itertools import chain
def count_punctuation(str_series_or_list):
c = Counter(chain(*str_series_or_list))
unwanted = set(c) - set(string.punctuation)
for unwanted_key in unwanted: del c[unwanted_key]
return c

Why count yourself?
import string
from collections import Counter
test_string = "1990; and 1989', \ '1975/97', '618-907 CE"
c = Counter(test_string) # counts all occurences
for p in string.punctuation: # prints the one in string.punctuation
print(p , c[p]) # access like dictionary (its a subclass of dict)
print(c)
Output:
! 0
" 0
# 0
$ 0
% 0
& 0
' 4
( 0
) 0
* 0
+ 0
, 2
- 1
. 0
/ 1
: 0
; 1
< 0
= 0
> 0
? 0
# 0
[ 0
\ 1
] 0
^ 0
_ 0
` 0
{ 0
| 0
} 0
~ 0
Counter({'9': 7, ' ': 6, '1': 4, "'": 4, '7': 3, '0': 2, '8': 2, ',': 2, ';': 1, 'a': 1, 'n': 1, 'd': 1, '\\': 1, '5': 1, '/': 1, '6': 1, '-': 1, 'C': 1, 'E': 1})
Counter is dictionary-like: see https://docs.python.org/2/library/collections.html#collections.Counter
Edit: multiple strings in a list:
import string
from collections import Counter
from itertools import chain
test_strings = [ "1990; and 1989', \ '1975/97', '618-907 CE" , "someone... or no one? that's the question!", "No I am not!"]
c = Counter(chain(*test_strings))
for p in string.punctuation:
print(p , c[p])
print(c)
Output: (removed 0-Entries)
! 2
' 5
, 2
- 1
. 3
/ 1
; 1
? 1
\ 1
Counter({' ': 15, 'o': 8, '9': 7, 'n': 6, "'": 5, 'e': 5, 't': 5, '1': 4, 'a': 3, '7': 3, 's': 3, '.': 3, '0': 2, '8': 2, ',': 2, 'm': 2, 'h': 2, '!': 2, ';': 1, 'd': 1, '\\': 1, '5': 1, '/': 1, '6': 1, '-': 1, 'C': 1, 'E': 1, 'r': 1, '?': 1, 'q': 1, 'u': 1, 'i': 1, 'N': 1, 'I': 1})

Related

Count the occurrences of each character in a alpha-numeric column in a DataFrame

I have a alpha-numeric column in a DataFrame. I would like get the total count of each characters(0-9, A-Z) occurs in the entire column.
e.g.
Serial
03000395
A000458B
667BC345
Desired Output
Character Counts
0 7
1 0
2 0
3 3
.
.
A 1
B 2
C 1
..
Z
You can use Counter to get this
from collections import Counter
Counter('03000395 A000458B 667BC345)
Output:
Counter({'0': 7,
'3': 3,
'9': 1,
'5': 3,
' ': 2,
'A': 1,
'4': 2,
'8': 1,
'B': 2,
'6': 2,
'7': 1,
'C': 1})
EDIT: After clarifying comments about this being in a dataframe you can do this:
pd.DataFrame(df.Serial.value_counts()).reset_index().rename({'index':'Character','Serial':'Count'})
EDIT2: After even further clarification, I think this is what you want
counts = dict(Counter(''.join(df.Serial)))
pd.DataFrame({"Character":counts.keys(), "Count":counts.values()})

Creating a program that returns a score by using a key on a list

I'm basically trying to read a txt file, remove all symbols and punctuation that isn't in the alphabet (A-Z), and then produce an output that lists out all the words in the file with a score side by side. In order to get the score I'm trying to compare each letter of the word to a key. This key represents how much the letter is worth. By adding up all of the letter values for the given word, I'll get the total score for that word.
alphakey = {'a': 5, 'b': 7, 'c': 4, 'd': 3, 'e': 7, 'f': 3,
'g': 3, 'h': 5, 'i': 2, 'j': 2, 'k': 1, 'l': 2,
'm': 6, 'n': 3, 'o': 1, 'p': 2, 'q': 1, 'r': 4,
's': 3, 't': 7, 'u': 5, 'v': 5, 'w': 2, 'x': 1,
'y': 2, 'z': 9}
This is what I have so far, but I'm completely stuck.
with open("hunger_games.txt") as p:
text = p.read()
text = text.lower()
text = text.split()
new = []
for word in text:
if word.isalpha() == False:
new.append(word[:-1])
else:
new.append(word)
class TotalScore():
def score():
total = 0
for word in new:
for letter in word:
total += alphakey[letter]
return total
I'm trying to get something like:
you 5
by 4
cool 10
ect.. for all the words in the list. Thanks in advance for any help.
As pointed out in the comments, you don't need to have a class for that and your return is miss-indented, otherwise I think your score function does what you need to compute the total score.
If you need to have a per-word score you can make use of a dictionary (again), to store these:
def word_score(word):
return sum(alphakey[l] for l in word)
def text_scores(filename):
with open(filename) as p:
text = p.read()
text = re.sub(r'[^a-zA-Z ]', '', text.lower())
return {w: word_score(w) for w in text.split()}
print(text_scores("hunger_games.txt"))
If hunger_games.txt contains "you by cool", then this prints:
{'you': 8, 'by': 9, 'cool': 8}
Does the punctuation have to be removed? Or are you doing that so that you can match up the keys of the dictionary? If you are okay with the punctuation staying in then this can be solved in a few lines:
alphakey = {'a': 5, 'b': 7, 'c': 4, 'd': 3, 'e': 7, 'f': 3,
'g': 3, 'h': 5, 'i': 2, 'j': 2, 'k': 1, 'l': 2,
'm': 6, 'n': 3, 'o': 1, 'p': 2, 'q': 1, 'r': 4,
's': 3, 't': 7, 'u': 5, 'v': 5, 'w': 2, 'x': 1,
'y': 2, 'z': 9}
with open("hunger_games.txt") as p:
text = p.read()
text = text.lower()
words = text.split()
uniqueWords = {}
for word in words:
if not word in uniqueWords:
uniqueWords[word] = sum([alphakey[letter] for letter in word if letter.isalpha()])
print(uniqueWords)
That last line might need a bit of explanation. First off
[alphakey[letter] for letter in word if letter.isalpha()]
is an example of something called a "list comprehension". They are a very useful feature of Python that lets us create an entire list in a single line. The one I just listed will go through every letter in a "word" and, if it is alphabetical, it will return the value from "alpha key". For example if the word was:
"hello"
it would return the list:
[5, 7, 2, 2, 1]
If the word was:
"w4h&t"
the list comprehension would ignore the "4" and "&" and return the list:
[2, 5, 7]
To turn those into a single value we wrap the comprehension the sum function. So the final value is 17 for the word "hello", and 14 for "w4h&t".
I suggest you to use nltk for text manipulation.
Here is my solution (you can shrink some chunks of code, I just made it more visually simple to understand).
Basically you split text into list of words, then we can remove all duplicates using set() function, and then we loop through all words calculating the score. I hope that code is quite clear.
import nltk
alphakey = {'a': 5, 'b': 7, 'c': 4, 'd': 3, 'e': 7, 'f': 3,
'g': 3, 'h': 5, 'i': 2, 'j': 2, 'k': 1, 'l': 2,
'm': 6, 'n': 3, 'o': 1, 'p': 2, 'q': 1, 'r': 4,
's': 3, 't': 7, 'u': 5, 'v': 5, 'w': 2, 'x': 1,
'y': 2, 'z': 9}
text = """
boy girl girl boy dog Dog car cAr dog girl you by cool 123asd .asd; 12asd
"""
words = []
results = {}
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
words += nltk.word_tokenize(sentence)
words = list(set([word.lower() for word in words]))
for word in words:
if word.isalpha():
total = 0
for letter in word:
total += alphakey[letter]
results[word] = total
for val in results:
print(f"{val} {results[val]}")
output:
dog 7
you 8
by 9
boy 10
cool 8
car 13
girl 11

counting individual characters in a large list of passwords without killing loop

I'm trying to get a count on, for example, how many a's are in a massive list of passwords out of curiosity. but i think when i'm trying to add a count for a character it's killing the loop that is going through all the characters.
#Examine passwords.txt
file = open('passwords.txt','r')
a = 0
b = 0
c = 0
d = 0
e = 0
f = 0
g = 0
h = 0
i = 0
j = 0
k = 0
l = 0
m = 0
n = 0
o = 0
p = 0
q = 0
r = 0
s = 0
t = 0
u = 0
v = 0
w = 0
x = 0
y = 0
z = 0
with open('passwords.txt','r') as fileobj:
for line in fileobj:
for char in line:
if char == a:
a += 1
elif char == b:
b += 1
print(a)
print(b)
print(c)
print(d)
print(e)
print(f)
You should use a dictionary (or a list which has an order) to store how many counts of each letter there are. This is much better than using 26 one letter variables which is ridiculous!
To create a dictionary, you can use a dictionary-comprehension with str.count on the entire contents of the file.
with open('passwords.txt','r') as fileobj:
text = fileobj.read()
letterCounts = {c: text.count(c) for c in "abcdefghijklmnopqrstuvwxyz"}
which would give letterCounts as something like:
{'s': 0, 'a': 4, 'o': 0, 'i': 0, 'm': 0, 'k': 0, 'q': 0, 'y': 0, 'c': 1, 'j': 0, 'b': 4, 'g': 0, 'd': 1, 'h': 0, 'e': 0, 'f': 0, 'u': 0, 'n': 0, 'w': 0, 't': 0, 'x': 0, 'p': 0, 'l': 0, 'r': 0, 'z': 0, 'v': 0}
from collections import Counter
with open('passwords.txt','r') as fileobj:
counts = Counter()
for line in fileobj:
counts.update(line)
counts is a Counter keeping track of the counts of all the characters that appear in the file. YOu would access the number of as with counts['a']

How to separate upper and lower case with counter?

I am thinking of something with collections
s = 'Hello Mr. Rogers, how are you this fine Tuesday?'
import collections
c = collections.Counter(s)
As a result I have
Counter({' ': 8,
',': 1,
'.': 1,
'?': 1,
'H': 1,
'M': 1,
'R': 1,
'T': 1,
'a': 2,
'd': 1,
'e': 5,
'f': 1,
'g': 1,
'h': 2,
'i': 2,
'l': 2,
'n': 1,
'o': 4,
'r': 3,
's': 3,
't': 1,
'u': 2,
'w': 1,
'y': 2})
If I try sum I got syntax problem
print sum(1 for i in c if i.isupper())
File "<ipython-input-21-66a7538534ee>", line 4
print sum(1 for i in c if i.isupper())
^
SyntaxError: invalid syntax
How should I count only upper or lower from the counter?
You lack the () in your generator expresion:
sum((1 for x in c if x.isupper()))
4
EDIT: As #Błotosmętek sugest, you lack the () in your print, i guess you are using python3, you should use print()
You can try something like this:
import collections
s = 'Hello Mr. Rogers, how are you this fine Tuesday?'
c = collections.Counter([ch for ch in s if ch.isupper()])
# Change to ch.islower() if you need lower case
# c = collections.Counter([ch for ch in s if ch.islower()])
print(c)

Convert every character in a String to a Dictionary Key

Suppose i have a string "abcdefghijklmnopqrstuvwxyz"and i want to initialize dictionary keys with those values.
alphabet = 'abcdefghijklmnopqrstuvwxyz'
alphabetDict = dict()
for char in alphabet:
alphabetDict[char] = 0
Is there a better way of doing that?
You can use dict.fromkeys() method -
>>> s = 'abcdefghijklmnopqrstuvwxyz'
>>> alphaDict = dict.fromkeys(s,0)
>>> alphaDict
{'m': 0, 'p': 0, 'i': 0, 'n': 0, 'd': 0, 'w': 0, 'k': 0, 'y': 0, 's': 0, 'b': 0, 'h': 0, 't': 0, 'u': 0, 'q': 0, 'g': 0, 'l': 0, 'e': 0, 'a': 0, 'j': 0, 'c': 0, 'o': 0, 'f': 0, 'v': 0, 'x': 0, 'z': 0, 'r': 0}
From documentation -
fromkeys(seq[, value])
Create a new dictionary with keys from seq and values set to value.
fromkeys() is a class method that returns a new dictionary. value defaults to None.
Please note, you should not use this if value is something mutable like list or another dict , etc. As the value is only evaluted once when you call the method fromkeys() , and all keys point to the same object.
You can use this for immutable types as value like int, str , etc.
Also, you do not need to specify the s or alphabet string, you can instead use string.ascii_lowercase . Example -
>>> import string
>>> alphaDict = dict.fromkeys(string.ascii_lowercase,0)
>>> alphaDict
{'m': 0, 'p': 0, 'i': 0, 'n': 0, 'd': 0, 'w': 0, 'k': 0, 'y': 0, 's': 0, 'b': 0, 'h': 0, 't': 0, 'u': 0, 'q': 0, 'g': 0, 'l': 0, 'e': 0, 'a': 0, 'j': 0, 'c': 0, 'o': 0, 'f': 0, 'v': 0, 'x': 0, 'z': 0, 'r': 0}
You can use dictionary comprehensions in Python.
alphabetDict = {char: 0 for char in alphabet}
Dictionaries (Python Docs)
There is a minor difference between this answer and Anand's above. Dict comprehensions evaluate the value for every key, while fromkeys only does it once. If you're using things like ints, this poses no problem. However, if you do
d = {key: [] for key in <some set>}
d[3].append(5)
print(d[2])
gives you
[]
and you have distinct lists, while
d = dict.fromkeys(<some set>, [])
d[3].append(5)
print(d[2])
gives you
[5]
will map all the keys to the same list.
Yes, you can do that in one line using dictionary comprehensions.
In [1]: alphabet = 'abcdefghijklmnopqrstuvwxyz'
In [2]: {x:0 for x in alphabet} # dictionary comprehension
Out[2]:
{'a': 0,
'b': 0,
'c': 0,
'd': 0,
'e': 0,
'f': 0,
'g': 0,
'h': 0,
'i': 0,
'j': 0,
'k': 0,
'l': 0,
'm': 0,
'n': 0,
'o': 0,
'p': 0,
'q': 0,
'r': 0,
's': 0,
't': 0,
'u': 0,
'v': 0,
'w': 0,
'x': 0,
'y': 0,
'z': 0}
Tried with another approach.
dict(zip(alphabets, '0'*len(alphabets)))
If you need a dictionary with different values instead of a constant value, you may create one like below with the use of random module:
>>> import random
>>> alphabet = 'abcdefghijklmnopqrstuvwxyz'
>>> my_dict = dict([ (ch, random.randint(1,len(alphabet)) ) for ch in alphabet ] )
>>> my_dict
{'a': 17, 'b': 15, 'c': 3, 'd': 5, 'e': 5, 'f': 13, 'g': 7, 'h': 1, 'i': 3, 'j': 12, 'k': 11, 'l': 7, 'm': 8, 'n': 23, 'o': 15, 'p': 7, 'q': 9, 'r': 19, 's': 17, 't': 22, 'u': 20, 'v': 24, 'w': 26, 'x': 14, 'y': 7, 'z': 24}
>>>
I creates dictionaries like above when I need a dictionary with random values for testing purposes.
Another way to create a dictionary with each char of a text with character count.
>>> char_count = lambda text, char: text.count(char)
>>> text = "Genesis 1 - 1 In the beginning God created the heavens and the earth. 2 Now the earth was formless and desolate, and there was darkness upon the surface of the watery deep, and God's active force was moving about over the surface of the waters."
>>> my_dict = dict( [ ( char, char_count(text, char) ) for char in text ] )
>>> my_dict
{'G': 3, 'e': 32, 'n': 13, 's': 15, 'i': 5, ' ': 45, '1': 2, '-': 1, 'I': 1, 't': 17, 'h': 12, 'b': 2, 'g': 3, 'o': 12, 'd': 10, 'c': 5, 'r': 12, 'a': 19, 'v': 4, '.': 2, '2': 1, 'N': 1, 'w': 6, 'f': 6, 'm': 2, 'l': 2, ',': 2, 'k': 1, 'u': 4, 'p': 2, 'y': 1, "'": 1}
Explanation:
1. lambda function counts number of occurrences of a characters.
2. Call lambda function for each character in text to get the count of that particular character.
Note: You may improve this code to avoid duplicate calls for repeated characters.
Using dictionary comprehension may be easier than all above:
{ char:(text.count(char)) for char in text }
In order to avoid duplication as mentioned by #Robert Ranjan , we do it this way
>>> import string
>>> char_count = lambda text, char: text.count(char)
>>> allAscii = list(string.printable)
>>> # alphabet = 'abcdefghijklmnopqrstuvwxyz'
>>> text = "Genesis 1 - 1 In the beginning God created the heavens and the earth. 2 Now the earth was formless and desolate, and * # there was darkness upon the surface of the watery deep, and God's active force was moving about over the surface of the waters."
>>> # my_dict = dict( [ ( char, char_count(text, char) ) for char in alphabet]
>>> my_dict = dict( [ ( char, char_count(text, char) ) for char in allAscii]
>>> for eachKey in my_dict:
print(repr(eachKey), ': ', my_dict[eachKey], ' ', end=' || ')
'0' : 0 || '1' : 2 || '2' : 1 || '3' : 0 || '4' : 0 || '5' : 0 || '6' : 0 || '7' : 0 || '8' : 0 || '9' : 0 || 'a' : 19 || 'b' : 2 || 'c' : 5 || 'd' : 10 || 'e' : 32 || 'f' : 6 || 'g' : 3 || 'h' : 12 || 'i' : 5 || 'j' : 0 || 'k' : 1 || 'l' : 2 || 'm' : 2 || 'n' : 13 || 'o' : 12 || 'p' : 2 || 'q' : 0 || 'r' : 12 || 's' : 15 || 't' : 17 || 'u' : 4 || 'v' : 4 || 'w' : 6 || 'x' : 0 || 'y' : 1 || 'z' : 0 || 'A' : 0 || 'B' : 0 || 'C' : 0 || 'D' : 0 || 'E' : 0 || 'F' : 0 || 'G' : 3 || 'H' : 0 || 'I' : 1 || 'J' : 0 || 'K' : 0 || 'L' : 0 || 'M' : 0 || 'N' : 1 || 'O' : 0 || 'P' : 0 || 'Q' : 0 || 'R' : 0 || 'S' : 0 || 'T' : 0 || 'U' : 0 || 'V' : 0 || 'W' : 0 || 'X' : 0 || 'Y' : 0 || 'Z' : 0 || '!' : 0 || '"' : 0 || '#' : 0 || '$' : 0 || '%' : 0 || '&' : 0 || "'" : 1 || '(' : 0 || ')' : 0 || '*' : 1 || '+' : 0 || ',' : 2 || '-' : 1 || '.' : 2 || '/' : 0 || ':' : 0 || ';' : 0 || '<' : 0 || '=' : 0 || '>' : 0 || '?' : 0 || '#' : 1 || '[' : 0 || '\\' : 0 || ']' : 0 || '^' : 0 || '_' : 0 || '`' : 0 || '{' : 0 || '|' : 0 || '}' : 0 || '~' : 0 || ' ' : 47 || '\t' : 0 || '\n' : 0 || '\r' : 0 || '\x0b' : 0 || '\x0c' : 0 ||
>>>

Categories