How to separate upper and lower case with counter? - python

I am thinking of something with collections
s = 'Hello Mr. Rogers, how are you this fine Tuesday?'
import collections
c = collections.Counter(s)
As a result I have
Counter({' ': 8,
',': 1,
'.': 1,
'?': 1,
'H': 1,
'M': 1,
'R': 1,
'T': 1,
'a': 2,
'd': 1,
'e': 5,
'f': 1,
'g': 1,
'h': 2,
'i': 2,
'l': 2,
'n': 1,
'o': 4,
'r': 3,
's': 3,
't': 1,
'u': 2,
'w': 1,
'y': 2})
If I try sum I got syntax problem
print sum(1 for i in c if i.isupper())
File "<ipython-input-21-66a7538534ee>", line 4
print sum(1 for i in c if i.isupper())
^
SyntaxError: invalid syntax
How should I count only upper or lower from the counter?

You lack the () in your generator expresion:
sum((1 for x in c if x.isupper()))
4
EDIT: As #Błotosmętek sugest, you lack the () in your print, i guess you are using python3, you should use print()

You can try something like this:
import collections
s = 'Hello Mr. Rogers, how are you this fine Tuesday?'
c = collections.Counter([ch for ch in s if ch.isupper()])
# Change to ch.islower() if you need lower case
# c = collections.Counter([ch for ch in s if ch.islower()])
print(c)

Related

Creating a program that returns a score by using a key on a list

I'm basically trying to read a txt file, remove all symbols and punctuation that isn't in the alphabet (A-Z), and then produce an output that lists out all the words in the file with a score side by side. In order to get the score I'm trying to compare each letter of the word to a key. This key represents how much the letter is worth. By adding up all of the letter values for the given word, I'll get the total score for that word.
alphakey = {'a': 5, 'b': 7, 'c': 4, 'd': 3, 'e': 7, 'f': 3,
'g': 3, 'h': 5, 'i': 2, 'j': 2, 'k': 1, 'l': 2,
'm': 6, 'n': 3, 'o': 1, 'p': 2, 'q': 1, 'r': 4,
's': 3, 't': 7, 'u': 5, 'v': 5, 'w': 2, 'x': 1,
'y': 2, 'z': 9}
This is what I have so far, but I'm completely stuck.
with open("hunger_games.txt") as p:
text = p.read()
text = text.lower()
text = text.split()
new = []
for word in text:
if word.isalpha() == False:
new.append(word[:-1])
else:
new.append(word)
class TotalScore():
def score():
total = 0
for word in new:
for letter in word:
total += alphakey[letter]
return total
I'm trying to get something like:
you 5
by 4
cool 10
ect.. for all the words in the list. Thanks in advance for any help.
As pointed out in the comments, you don't need to have a class for that and your return is miss-indented, otherwise I think your score function does what you need to compute the total score.
If you need to have a per-word score you can make use of a dictionary (again), to store these:
def word_score(word):
return sum(alphakey[l] for l in word)
def text_scores(filename):
with open(filename) as p:
text = p.read()
text = re.sub(r'[^a-zA-Z ]', '', text.lower())
return {w: word_score(w) for w in text.split()}
print(text_scores("hunger_games.txt"))
If hunger_games.txt contains "you by cool", then this prints:
{'you': 8, 'by': 9, 'cool': 8}
Does the punctuation have to be removed? Or are you doing that so that you can match up the keys of the dictionary? If you are okay with the punctuation staying in then this can be solved in a few lines:
alphakey = {'a': 5, 'b': 7, 'c': 4, 'd': 3, 'e': 7, 'f': 3,
'g': 3, 'h': 5, 'i': 2, 'j': 2, 'k': 1, 'l': 2,
'm': 6, 'n': 3, 'o': 1, 'p': 2, 'q': 1, 'r': 4,
's': 3, 't': 7, 'u': 5, 'v': 5, 'w': 2, 'x': 1,
'y': 2, 'z': 9}
with open("hunger_games.txt") as p:
text = p.read()
text = text.lower()
words = text.split()
uniqueWords = {}
for word in words:
if not word in uniqueWords:
uniqueWords[word] = sum([alphakey[letter] for letter in word if letter.isalpha()])
print(uniqueWords)
That last line might need a bit of explanation. First off
[alphakey[letter] for letter in word if letter.isalpha()]
is an example of something called a "list comprehension". They are a very useful feature of Python that lets us create an entire list in a single line. The one I just listed will go through every letter in a "word" and, if it is alphabetical, it will return the value from "alpha key". For example if the word was:
"hello"
it would return the list:
[5, 7, 2, 2, 1]
If the word was:
"w4h&t"
the list comprehension would ignore the "4" and "&" and return the list:
[2, 5, 7]
To turn those into a single value we wrap the comprehension the sum function. So the final value is 17 for the word "hello", and 14 for "w4h&t".
I suggest you to use nltk for text manipulation.
Here is my solution (you can shrink some chunks of code, I just made it more visually simple to understand).
Basically you split text into list of words, then we can remove all duplicates using set() function, and then we loop through all words calculating the score. I hope that code is quite clear.
import nltk
alphakey = {'a': 5, 'b': 7, 'c': 4, 'd': 3, 'e': 7, 'f': 3,
'g': 3, 'h': 5, 'i': 2, 'j': 2, 'k': 1, 'l': 2,
'm': 6, 'n': 3, 'o': 1, 'p': 2, 'q': 1, 'r': 4,
's': 3, 't': 7, 'u': 5, 'v': 5, 'w': 2, 'x': 1,
'y': 2, 'z': 9}
text = """
boy girl girl boy dog Dog car cAr dog girl you by cool 123asd .asd; 12asd
"""
words = []
results = {}
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
words += nltk.word_tokenize(sentence)
words = list(set([word.lower() for word in words]))
for word in words:
if word.isalpha():
total = 0
for letter in word:
total += alphakey[letter]
results[word] = total
for val in results:
print(f"{val} {results[val]}")
output:
dog 7
you 8
by 9
boy 10
cool 8
car 13
girl 11

Python convert utf-8 back to string

I have a string which looks like
a = 'Verm\xc3\xb6gensverzeichnis'
When i do print(a), it shows me the right result, which is Vermögensverzeichnis.
print(a)
Vermögensverzeichnis
What i want to do is to calculate the occurrence of each letter using Counter() and save them in a dataframe. When I use Counter(a), it gives me a result like this:
Counter({'V': 1,
'c': 1,
'e': 4,
'g': 1,
'h': 1,
'i': 2,
'm': 1,
'n': 2,
'r': 2,
's': 2,
'v': 1,
'z': 1,
'\xb6': 1,
'\xc3': 1})
Could you please help me get rid of codes like \xc3\xb6? I have tried with many existing answers, unfortunately they do not work.
Thanks a lot in advance!
This must be Python 2. Work with Unicode if you want to count characters vs. encoded bytes. \xc3\xb6 are the encoded bytes of ö:
>>> a = 'Verm\xc3\xb6gensverzeichnis'
>>> print a # Note this only works if your terminal is configured for UTF-8 encoding.
Vermögensverzeichnis
Decode to Unicode. It should still print correctly as long as your terminal is configured correctly:
>>> u = a.decode('utf8')
>>> u
u'Verm\xf6gensverzeichnis'
>>> print u
Vermögensverzeichnis
Count the Unicode code points:
>>> from collections import Counter
>>> Counter(u)
Counter({u'e': 4, u'i': 2, u'n': 2, u's': 2, u'r': 2, u'c': 1, u'v': 1, u'g': 1, u'h': 1, u'V': 1, u'm': 1, u'\xf6': 1, u'z': 1})
u'\xf6' is the Unicode codepoint for ö. Print the keys and values to display them on the terminal properly:
>>> for k,v in Counter(u).iteritems():
... print k,v
...
c 1
v 1
e 4
g 1
i 2
h 1
V 1
m 1
n 2
s 2
r 2
ö 1
z 1
Future study to see where this will break: Unicode normalization and graphemes.

Dictionary of punctuation counts for list of strings

How can I use dict comprehension to build a dictionary of punctuation counts for a list of strings? I was able to do it for a single string like this:
import string
test_string = "1990; and 1989', \ '1975/97', '618-907 CE"
counts = {p:test_string.count(p) for p in string.punctuation}
Edit: For anyone who may want this in the future, here's Patrick Artner's answer copied from below, with a very small modification to keep only punctuation counts:
# return punctuation Counter dict for string/list/pd.Series
import string
from collections import Counter
from itertools import chain
def count_punctuation(str_series_or_list):
c = Counter(chain(*str_series_or_list))
unwanted = set(c) - set(string.punctuation)
for unwanted_key in unwanted: del c[unwanted_key]
return c
Why count yourself?
import string
from collections import Counter
test_string = "1990; and 1989', \ '1975/97', '618-907 CE"
c = Counter(test_string) # counts all occurences
for p in string.punctuation: # prints the one in string.punctuation
print(p , c[p]) # access like dictionary (its a subclass of dict)
print(c)
Output:
! 0
" 0
# 0
$ 0
% 0
& 0
' 4
( 0
) 0
* 0
+ 0
, 2
- 1
. 0
/ 1
: 0
; 1
< 0
= 0
> 0
? 0
# 0
[ 0
\ 1
] 0
^ 0
_ 0
` 0
{ 0
| 0
} 0
~ 0
Counter({'9': 7, ' ': 6, '1': 4, "'": 4, '7': 3, '0': 2, '8': 2, ',': 2, ';': 1, 'a': 1, 'n': 1, 'd': 1, '\\': 1, '5': 1, '/': 1, '6': 1, '-': 1, 'C': 1, 'E': 1})
Counter is dictionary-like: see https://docs.python.org/2/library/collections.html#collections.Counter
Edit: multiple strings in a list:
import string
from collections import Counter
from itertools import chain
test_strings = [ "1990; and 1989', \ '1975/97', '618-907 CE" , "someone... or no one? that's the question!", "No I am not!"]
c = Counter(chain(*test_strings))
for p in string.punctuation:
print(p , c[p])
print(c)
Output: (removed 0-Entries)
! 2
' 5
, 2
- 1
. 3
/ 1
; 1
? 1
\ 1
Counter({' ': 15, 'o': 8, '9': 7, 'n': 6, "'": 5, 'e': 5, 't': 5, '1': 4, 'a': 3, '7': 3, 's': 3, '.': 3, '0': 2, '8': 2, ',': 2, 'm': 2, 'h': 2, '!': 2, ';': 1, 'd': 1, '\\': 1, '5': 1, '/': 1, '6': 1, '-': 1, 'C': 1, 'E': 1, 'r': 1, '?': 1, 'q': 1, 'u': 1, 'i': 1, 'N': 1, 'I': 1})

Counting total number of letters in a string

a = "All men are created equal under the power of the constitution, Thomas Jefferson"
i know a.count('A') will return how many "A"s there are. But I want to count how many A's, e's, c's and T's there are and adding them together. Help much appreciated.
Im using Python3
Look into collections.Counter:
>>> from collections import Counter
>>> import string
>>> c = Counter(l for l in a if l in string.ascii_letters)
>>> c
Counter({'e': 11, 't': 6, 'o': 6, 'r': 5, 'n': 5, 'a': 4, 'l': 3, 'f': 3,
's': 3, 'u': 3, 'h': 3, 'i': 2, 'd': 2, 'c': 2, 'm': 2, 'A': 1,
'p': 1, 'w': 1, 'T': 1, 'J': 1, 'q': 1})
>>> sum(c.values())
66
>>> c = Counter(l for l in a if l in 'AecT')
>>> c
Counter({'e': 11, 'c': 2, 'A': 1, 'T': 1})
>>> sum(c.values())
15
Python has a great module for this. Use Counter from collections
from collections import Counter
a = "All men are created equal under the power of the constitution, Thomas Jefferson"
counter = Counter(a)
print(counter)
It will output a dictionary of all letters as keys and the values will be the occurrences.
You could use regex expressions to find the total number of letters easily
import re
p = re.compile("\w")
a = "All men are created equal under the power of the constitution, Thomas Jefferson"
numberOfLetters = len(p.findall(a))
Will return 66.
If you just want A,e,c, and T you should use this regex instead:
p = re.compile("[A|e|c|T]")
Will return 15.
Just tried with an another approach
map(lambda x: [x, a.count(x)], 'AecT')
'a' is the input string. 'AecT' can replace with required letters as per the need.

Count every word in a text file python

What i want is to be able to feed in a multiline Text file which is like a paragraph long and then to be returned with something like:
{'Total words': 'NUMBER', 'Words ending with LY': 'NUMBER'}
I have never used Counter before but i believe that is how i would do it. So i want it to count every word and if the word ends in LY add it to the second count. Considering i have never used Counter i don't know where to go...
with open('SOMETHING.txt') as f:
# something to do with counter here?
EDIT: I have to do it without using counter! how would i achieve the same result but without the counter library?
This should work for you...
def parse_file():
with open('SOMETHING.txt', 'r') as f:
c1 = 0
c2 = 0
for i in f:
w = i.split()
c1 += len(w)
for j in w:
if j.endswith('LY'):
c2 += 1
return {'Total words': c1, 'Words ending with LY': c2}
I would recommend however, you have a look at a few python basics.
Is this hard to try?
from collections import defaultdict
result = defaultdict(int)
result_second = defaultdict(int)
for word in open('text.txt').read().split():
result[word] += 1
if word.endswith('LY'):
result_second[word] +=1
print result,result_second
Output:
defaultdict(<type 'int'>, {'and': 1, 'Considering': 1, 'have': 2, "don't": 1, 'is': 1, 'it': 2, 'second': 1, 'want': 1, 'in': 1, 'before': 1, 'would': 1, 'to': 3, 'count.': 1, 'go...': 1, 'how': 1, 'add': 1, 'if': 1, 'LY': 1, 'it.': 1, 'do': 1, 'ends': 1, 'used': 2, 'that': 1, 'I': 1, 'Counter': 2, 'but': 1, 'So': 1, 'know': 1, 'never': 2, 'believe': 1, 'count': 1, 'word': 2, 'i': 5, 'every': 1, 'the': 2, 'where': 1})
Use collections.Counter()
import collections
with open('your_file.txt') as fp:
text = fp.read()
counter = collections.Counter(['ends_in_ly' if token.endswith('LY') else 'doesnt_end_in_ly' for token in text.split()])
Without counter
with open('file.txt') as fp:
tokens = fp.read().split()
c = sum([1 if token.endswith('LY') else 0 for token in tokens])
return {'ending_in_ly': c, 'not_ending_in_ly': len(tokens) - c}

Categories