Count every word in a text file python - python

What i want is to be able to feed in a multiline Text file which is like a paragraph long and then to be returned with something like:
{'Total words': 'NUMBER', 'Words ending with LY': 'NUMBER'}
I have never used Counter before but i believe that is how i would do it. So i want it to count every word and if the word ends in LY add it to the second count. Considering i have never used Counter i don't know where to go...
with open('SOMETHING.txt') as f:
# something to do with counter here?
EDIT: I have to do it without using counter! how would i achieve the same result but without the counter library?

This should work for you...
def parse_file():
with open('SOMETHING.txt', 'r') as f:
c1 = 0
c2 = 0
for i in f:
w = i.split()
c1 += len(w)
for j in w:
if j.endswith('LY'):
c2 += 1
return {'Total words': c1, 'Words ending with LY': c2}
I would recommend however, you have a look at a few python basics.

Is this hard to try?
from collections import defaultdict
result = defaultdict(int)
result_second = defaultdict(int)
for word in open('text.txt').read().split():
result[word] += 1
if word.endswith('LY'):
result_second[word] +=1
print result,result_second
Output:
defaultdict(<type 'int'>, {'and': 1, 'Considering': 1, 'have': 2, "don't": 1, 'is': 1, 'it': 2, 'second': 1, 'want': 1, 'in': 1, 'before': 1, 'would': 1, 'to': 3, 'count.': 1, 'go...': 1, 'how': 1, 'add': 1, 'if': 1, 'LY': 1, 'it.': 1, 'do': 1, 'ends': 1, 'used': 2, 'that': 1, 'I': 1, 'Counter': 2, 'but': 1, 'So': 1, 'know': 1, 'never': 2, 'believe': 1, 'count': 1, 'word': 2, 'i': 5, 'every': 1, 'the': 2, 'where': 1})

Use collections.Counter()
import collections
with open('your_file.txt') as fp:
text = fp.read()
counter = collections.Counter(['ends_in_ly' if token.endswith('LY') else 'doesnt_end_in_ly' for token in text.split()])
Without counter
with open('file.txt') as fp:
tokens = fp.read().split()
c = sum([1 if token.endswith('LY') else 0 for token in tokens])
return {'ending_in_ly': c, 'not_ending_in_ly': len(tokens) - c}

Related

Python print statement not printing anything at the end of my function

I want to print a phrase at the end of my function, but my desired output is not printing. There are no errors popping up in python, it just isn't printing and acting like it is ignoring it. wordlist is the list of words the user entered to find how many times each word appears in the website they entered. sitewordlist is the entire list of words in the website.
def count(wordlist, sitewordlist):
x = 0
while x < len(wordlist):
numblist = []
wordcount = sitewordlist.count(wordlist[x])
numblist.append(wordcount)
x = x + 1
final(numblist, wordlist)
def final(numblist, wordlist):
y = 0
while y < len(numblist):
print("The word" + wordlist[y] + "appears" + numblist[y] + "times.")
y = y + 1
main()
Problem: in your first while you increase x until it is equal to len(wordlist) - your second while is only entered if x is smaller then len(wordlist) - thats kind of contradictionary.
You can use collections.Counter to count things easily and get a dict from it:
from collections import Counter
def count(wordlist, sitewordlist):
data = Counter(sitewordlist)
for w in wordlist:
print(f"The word {w} appears {data.get(w,0)} times.")
text = """n 1066, William of Normandy introduced what, in later centuries, became referred
to as a feudal system, by which he sought the advice of a council of tenants-in-chief (a
person who held land) and ecclesiastics before making laws. In 1215, the tenants-in-chief
secured Magna Carta from King John, which established that the king may not levy or collect
any taxes (except the feudal taxes to which they were hitherto accustomed), save with the
consent of his royal council, which gradually developed into a parliament. Over the
centuries, the English Parliament progressively limited the power of the English monarchy
which arguably culminated in the English Civil War and the trial and execution of Charles
I in 1649. After the restoration of the monarchy under Charles II, and the subsequent
Glorious Revolution of 1688, the supremacy of Parliament was a settled principle and all
future English and later British sovereigns were restricted to the role of constitutional
monarchs with limited executive authority. The Act of Union 1707 merged the English
Parliament with the Parliament of Scotland to form the Parliament of Great Britain.
When the Parliament of Ireland was abolished in 1801, its former members were merged
into what was now called the Parliament of the United Kingdom.
(quote from: https://en.wikipedia.org/wiki/Parliament_of_England)""".split()
# some cleanup
text[:] = [t.strip(".,-!?1234567890)([]{}\n") for t in text]
words = ["is","and","not","are"]
count(words,text)
Output:
The word is appears 0 times.
The word and appears 6 times.
The word not appears 1 times.
The word are appears 0 times.
Full Counter:
Counter({'the': 22, 'of': 15, 'Parliament': 7, '': 6, 'and': 6, 'a': 5, 'which': 5,
'English': 5, 'in': 4, 'to': 4, 'were': 3, 'with': 3, 'was': 3, 'what': 2, 'later': 2,
'centuries': 2, 'feudal': 2, 'council': 2, 'tenants-in-chief': 2, 'taxes': 2, 'into': 2,
'limited': 2,'monarchy': 2, 'Charles': 2, 'merged': 2, 'n': 1, 'William': 1, 'Normandy': 1,
'introduced': 1, 'became': 1, 'referred': 1, 'as': 1, 'system': 1, 'by': 1, 'he': 1,
'sought': 1, 'advice': 1, 'person': 1, 'who': 1, 'held': 1, 'land': 1, 'ecclesiastics': 1,
'before': 1, 'making': 1, 'laws': 1, 'In': 1, 'secured': 1, 'Magna': 1, 'Carta': 1,
'from': 1, 'King': 1, 'John': 1, 'established': 1, 'that': 1, 'king': 1, 'may': 1,
'not': 1, 'levy': 1, 'or': 1, 'collect': 1, 'any': 1, 'except': 1, 'they': 1,
'hitherto': 1, 'accustomed': 1, 'save': 1, 'consent': 1, 'his': 1, 'royal': 1,
'gradually': 1, 'developed': 1, 'parliament': 1, 'Over': 1, 'progressively': 1, 'power': 1,
'arguably': 1, 'culminated': 1, 'Civil': 1, 'War': 1, 'trial': 1, 'execution': 1,
'I': 1, 'After': 1, 'restoration': 1, 'under': 1, 'II': 1, 'subsequent': 1, 'Glorious': 1,
'Revolution': 1, 'supremacy': 1, 'settled': 1, 'principle': 1, 'all': 1, 'future': 1,
'British': 1, 'sovereigns': 1, 'restricted': 1, 'role': 1, 'constitutional': 1,
'monarchs': 1, 'executive': 1, 'authority': 1, 'The': 1, 'Act': 1, 'Union': 1,
'Scotland': 1, 'form': 1, 'Great': 1, 'Britain': 1, 'When': 1, 'Ireland': 1,
'abolished': 1, 'its': 1, 'former': 1, 'members': 1, 'now': 1, 'called': 1, 'United': 1,
'Kingdom': 1, 'quote': 1, 'from:': 1,
'https://en.wikipedia.org/wiki/Parliament_of_England': 1})
While is not really appropriate here. You can simulate Counter using a normal dict and while like so:
def count_me_other(words,text):
wordlist = words.split()
splitted = (x.strip(".,!?") for x in text.split())
d = {}
it = iter(splitted)
try:
while it:
c = next(it)
if c not in d:
d[c]=1
else:
d[c]+=1
except StopIteration:
for w in wordlist:
print(f"The word {w} appears {d.get(w,0)} times.")
wordlist = "A C E G I K M"
text = "A B C D E F G A B C D E F A B C D E A B C D A B C A B A"
count_me_other(wordlist,text)
Output:
The word A appears 7 times.
The word C appears 5 times.
The word E appears 3 times.
The word G appears 1 times.
The word I appears 0 times.
The word K appears 0 times.
The word M appears 0 times.
Or use for ... in conjunction with a normal / defaultdict:
def count_me_other_2(words,text):
wordlist = words.split()
splitted = (x.strip(".,!?") for x in text.split())
d = {}
for w in splitted:
if w not in d:
d[w]=1
else:
d[w]+=1
for w in wordlist:
print(f"The word {w} appears {d.get(w,0)} times.")
def count_me_other_3(words,text):
from collections import defaultdict
wordlist = words.split()
splitted = (x.strip(".,!?") for x in text.split())
d = defaultdict(int)
for w in splitted:
d[w] += 1
for w in wordlist:
print(f"The word {w} appears {d.get(w,0)} times.")
count_me_other_2(wordlist,text)
count_me_other_3(wordlist,text)
with identical output.
You're using while-loops to act like for-loops, but you're using the same iterator x in both, and you're not resetting its value to 0 in between. So the second while-loop sees that x is already equal to len(wordlist), and so it doesn't execute the body of the loop.

How to separate upper and lower case with counter?

I am thinking of something with collections
s = 'Hello Mr. Rogers, how are you this fine Tuesday?'
import collections
c = collections.Counter(s)
As a result I have
Counter({' ': 8,
',': 1,
'.': 1,
'?': 1,
'H': 1,
'M': 1,
'R': 1,
'T': 1,
'a': 2,
'd': 1,
'e': 5,
'f': 1,
'g': 1,
'h': 2,
'i': 2,
'l': 2,
'n': 1,
'o': 4,
'r': 3,
's': 3,
't': 1,
'u': 2,
'w': 1,
'y': 2})
If I try sum I got syntax problem
print sum(1 for i in c if i.isupper())
File "<ipython-input-21-66a7538534ee>", line 4
print sum(1 for i in c if i.isupper())
^
SyntaxError: invalid syntax
How should I count only upper or lower from the counter?
You lack the () in your generator expresion:
sum((1 for x in c if x.isupper()))
4
EDIT: As #Błotosmętek sugest, you lack the () in your print, i guess you are using python3, you should use print()
You can try something like this:
import collections
s = 'Hello Mr. Rogers, how are you this fine Tuesday?'
c = collections.Counter([ch for ch in s if ch.isupper()])
# Change to ch.islower() if you need lower case
# c = collections.Counter([ch for ch in s if ch.islower()])
print(c)

Counting multiple letter groups in a string

I've been trying to adapt my python function to count groups of letters instead of single letters and I'm having a bit of trouble. Here's the code I have to count individual letters:
my_seq = "CTAAAGTCAACCTTCGGTTGACCTTGAAAGGGCCTTGGGAACCTTCGGTTGACCTTGAGGGTTCCCTAAGGGTT"
def count_letters(str):
counts = {}
for c in str:
if c in counts:
counts[c]+=1
else:
counts[c]=1
return counts
counts = count_letters(my_seq)
print(counts)
The function currently spits out counts for each individual letter. Right now it prints this:
{'C': 23, 'T': 30, 'G': 30, 'A': 20}
Ideally, I'd like it to print something like this:
{'CTA': 2, 'TAG': 3, 'CGC': 1, 'GAG': 2 ... }
I'm very new to python and this is proving to be difficult.
This can be done pretty quickly using collections.Counter.
from collections import Counter
s = "CTAACAAC"
def chunk_string(s, n):
return [s[i:i+n] for i in range(len(s)-n+1)]
counter = Counter(chunk_string(s, 3))
# Counter({'AAC': 2, 'ACA': 1, 'CAA': 1, 'CTA': 1, 'TAA': 1})
Edit: To elaborate on chunk_string:
It takes a string s and a chunk size n as arguments. Each s[i:i+n] is a slice of the string that is n characters long. The loop iterates over the valid indices where the string can be sliced (0 to len(s)-n). All of these slices are then grouped in a list comprehension. An equivalent method is:
def chunk_string(s, n):
chunks = []
last_index = len(s) - n
for i in range(0, last_index + 1):
chunks.append(s[i:i+n])
return chunks
This is basically as the first posted answer by Jared Goguen, but in reply to OP's comment, for a possible way without importing a module:
>>> m
'CTAAAGTCAACCTTCGGTTGACCTTGAGGGTTCCCTAAGGGTTGGGGATGACCCTTGGGTCTAAAGTCAACCTTCGGTTGACCTTGAGGGTTCCCTAAGGGTT'
>>> l = [m[i:i+3] for i in range(len(m)-2)]
>>>
>>> d = {}
>>>
>>> for k in set(l):
d[k] = l.count(k)
>>> d
{'AAG': 4, 'GGA': 1, 'AAA': 2, 'TAA': 4, 'AGG': 4, 'AGT': 2, 'GGG': 7, 'ACC': 5, 'CGG': 2, 'GGT': 7, 'TCC': 2, 'TGA': 5, 'CAA': 2, 'TGG': 2, 'GTC': 3, 'AAC': 2, 'ATG': 1, 'CTT': 5, 'TCA': 2, 'CCT': 7, 'CCC': 3, 'GTT': 6, 'TTG': 6, 'GAT': 1, 'GAC': 3, 'TCG': 2, 'GAG': 2, 'CTA': 4, 'TTC': 4, 'TCT': 1}
Or if you are a fan of one liners:
>>> d = {k:l.count(k) for k in set(l)}

What's the correct way to loop over a list and make a dictionary with dict comprehension in python?

testWords is a list with words. setTestWords is the same list as a set. I want to create a dictionary with Dict Comprehension where I will use the word as key and the count as value. I'm also using the .count.
Example output would be like this:
>>> dictTestWordsCount[:2]
>>> {'hi': 22, 'hello': 99}
This is the line if code I'm using but it seems to crash my notebook every time.
l = {x: testWords.count(x) for x in setTestwords}
Not sure what causes your notebook to crash...
In [62]: txt = "the quick red fox jumped over the lazy brown dog"
In [63]: testWords = txt.split()
In [64]: setTestWords = set(testWords)
In [65]: {x:testWords.count(x) for x in setTestWords}
Out[65]:
{'brown': 1,
'dog': 1,
'fox': 1,
'jumped': 1,
'lazy': 1,
'over': 1,
'quick': 1,
'red': 1,
'the': 2}
Or better, Use collection.defaultdict
from collections import defaultdict
d = defaultdict(int)
for word in txt.split():
d[word]+=1
print(d)
defaultdict(int,
{'brown': 1,
'dog': 1,
'fox': 1,
'jumped': 1,
'lazy': 1,
'over': 1,
'quick': 1,
'red': 1,
'the': 2})

Word frequency using dictionary

My problem is I can't figure out how to display the word count using the dictionary and refer
to keys length. For example, consider the following piece of text:
"This is the sample text to get an idea!. "
Then the required output would be
3 2
2 3
0 5
as there are 3 words of length 2, 2 words of length 3, and 0 words of length 5 in the
given sample text.
I got as far as displaying the list the word occurrence frequency:
def word_frequency(filename):
word_count_list = []
word_freq = {}
text = open(filename, "r").read().lower().split()
word_freq = [text.count(p) for p in text]
dictionary = dict(zip(text,word_freq))
return dictionary
print word_frequency("text.txt")
which diplays the dict in this format:
{'all': 3, 'show': 1, 'welcomed': 1, 'not': 2, 'availability': 1, 'television,': 1, '28': 1, 'to': 11, 'has': 2, 'ehealth,': 1, 'do': 1, 'get': 1, 'they': 1, 'milestone': 1, 'kroes,': 1, 'now': 3, 'bringing': 2, 'eu.': 1, 'like': 1, 'states.': 1, 'them.': 1, 'european': 2, 'essential': 1, 'available': 4, 'because': 2, 'people': 3, 'generation': 1, 'economic': 1, '99.4%': 1, 'are': 3, 'eu': 1, 'achievement,': 1, 'said': 3, 'for': 3, 'broadband': 7, 'networks': 2, 'access': 2, 'internet': 1, 'across': 2, 'europe': 1, 'subscriptions': 1, 'million': 1, 'target.': 1, '2020,': 1, 'news': 1, 'neelie': 1, 'by': 1, 'improve': 1, 'fixed': 2, 'of': 8, '100%': 1, '30': 1, 'affordable': 1, 'union,': 2, 'countries.': 1, 'products': 1, 'or': 3, 'speeds': 1, 'cars."': 1, 'via': 1, 'reached': 1, 'cloud': 1, 'from': 1, 'needed': 1, '50%': 1, 'been': 1, 'next': 2, 'households': 3, 'commission': 5, 'live': 1, 'basic': 1, 'was': 1, 'said:': 1, 'more': 1, 'higher.': 1, '30mbps': 2, 'that': 4, 'but': 2, 'aware': 1, '50mbps': 1, 'line': 1, 'statement,': 1, 'with': 2, 'population': 1, "europe's": 1, 'target': 1, 'these': 1, 'reliable': 1, 'work': 1, '96%': 1, 'can': 1, 'ms': 1, 'many': 1, 'further.': 1, 'and': 6, 'computing': 1, 'is': 4, 'it': 2, 'according': 1, 'have': 2, 'in': 5, 'claimed': 1, 'their': 1, 'respective': 1, 'kroes': 1, 'areas.': 1, 'responsible': 1, 'isolated': 1, 'member': 1, '100mbps': 1, 'digital': 2, 'figures': 1, 'out': 1, 'higher': 1, 'development': 1, 'satellite': 4, 'who': 1, 'connected': 2, 'coverage': 2, 'services': 2, 'president': 1, 'a': 1, 'vice': 1, 'mobile': 2, "commission's": 1, 'points': 1, '"access': 1, 'rural': 1, 'the': 16, 'agenda,': 1, 'having': 1}
def freqCounter(infilepath):
answer = {}
with open(infilepath) as infile:
for line in infilepath:
for word in line.strip().split():
l = len(word)
if l not in answer:
answer[l] = 0
answer[l] += 1
return answer
AlternativelyL
import collections
def freqCounter(infilepath):
with open(infilepath) as infile:
return collections.Counter(len(word) for line in infile for word in line.strip().split())
Use collections.Counter
import collections
sentence = "This is the sample text to get an idea"
Count = collections.Counter([len(a) for a in sentence.split()])
print Count
To count how many words in a text have given lengths: size -> frequency distribution, you could use a regular expression to extract words:
#!/usr/bin/env python3
import re
from collections import Counter
text = "This is the sample text to get an idea!. "
words = re.findall(r'\w+', text.casefold())
frequencies = Counter(map(len, words)).most_common()
print("\n".join(["%d word(s) of length %d" % (n, length)
for length, n in frequencies]))
Output
3 word(s) of length 2
3 word(s) of length 4
2 word(s) of length 3
1 word(s) of length 6
Note: It ignores the punctuation such as !. after 'idea' unlike .split()-based solutions automatically.
To read words from a file, you could read lines and extract words from them in the same way as it done for text in the first code example:
from itertools import chain
with open(filename) as file:
words = chain.from_iterable(re.findall(r'\w+', line.casefold())
for line in file)
# use words here.. (the same as above)
frequencies = Counter(map(len, words)).most_common()
print("\n".join(["%d word(s) of length %d" % (n, length)
for length, n in frequencies]))
In practice, you could use a list to find the length frequency distribution if you ignore words that are longer than a threshold:
def count_lengths(words, maxlen=100):
frequencies = [0] * (maxlen + 1)
for length in map(len, words):
if length <= maxlen:
frequencies[length] += 1
return frequencies
Example
import re
text = "This is the sample text to get an idea!. "
words = re.findall(r'\w+', text.casefold())
frequencies = count_lengths(words)
print("\n".join(["%d word(s) of length %d" % (n, length)
for length, n in enumerate(frequencies) if n > 0]))
Output
3 word(s) of length 2
2 word(s) of length 3
3 word(s) of length 4
1 word(s) of length 6

Categories