Count frequency of words in a file using python

Count frequency of words in a file using python - python

I am having a file which has a paragraph. I just want to count frequency of each word. I have tried it in the following way. But I am not getting any output. Can anyone please help me.
dic = {}
with open("C:\\Users\\vWX442280\Desktop\\f1.txt" ,'r') as f:
for line in f:
l1 = line.split(" ")
for w in l1:
dic[w] = dic.get(w,0)+1
print ('\n'.join(['%s,%s' % (k, v) for k, v in dic.items()]))
I am getting output like this.
Python,2
is,3
good,1
helps,1
in,2
machine,2
learning,1
learning,1
goos,1
python,1
famous,1
kill,1
the,1
machine,1
it,1
a,1
good,1
day,1

A pure python way without importing any libraries. More code, but I wanted to wtite some bad code today (:
file = open('path/to/file.txt', 'r')
content = ' '.join(line for line in file.read().splitlines())
content = content.split(' ')
freqs = {}
for word in content:
if word not in freqs:
freqs[word] = 1
else:
freqs[word] += 1
file.close()
This uses a python dictionary to store the words and the amount of times they appear.
I know it's better to use with open(blah) as b: but this is just to get the idea across. ¯\_(ツ)_/¯

From your code, I spotted the following issues
for s in l: l is a line of text, the for loop will loop through each character, not word
The f.split('\n') expression will generate an error because f is a file object, and it does not have the .split() method, string does
With that in mind, here is a rewrite of your code to make it works:
dic = {}
with open("f1.txt" ,'r') as f:
for l in f:
for w in l.split():
dic[w] = dic.get(w,0)+1
print ('\n'.join(['%s,%s' % (k, v) for k, v in dic.items()]))

You can use the count method
mystring = "hello hello hello"
mystring.count("hello") # 3

Related

Find specific values in a txt file and adding them up with python

I have a txt file which looks like that:
[Chapter.Title1]
Irrevelent=90 B
Volt=0.10 ienl
Watt=2 W
Ampere=3 A
Irrevelent=91 C
[Chapter.Title2]
Irrevelent=999
Irrevelent=999
[Chapter.Title3]
Irrevelent=92 B
Volt=0.20 ienl
Watt=5 W
Ampere=6 A
Irrevelent=93 C
What I want is that it catches "Title1" and the values "0,1", "2" and "3". Then adds them up (which would be 5.1).
I don't care about the lines with "irrevelent" at the beginning.
And then the same with the third block. Catching "Title3" and adding "0.2", "5" and "6".
The second block with "Title2" does not contain "Volt", Watt" and "Ampere" and is therefore not relevant.
Can anyone please help me out with this?
Thank you and cheers

You can use regular expressions to get the values and the titles in lists, then use them.
txt = """[Chapter.Title1]
Irrevelent=90 B
Volt=1 V
Watt=2 W
Ampere=3 A
Irrevelent=91 C
[Chapter.Title2]
Irrevelent=92 B
Volt=4 V
Watt=5 W
Ampere=6 A
Irrevelent=93 C"""
#that's just the text
import re
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=(\d+)'
rxv2=r'Watt=(\d+)'
rxv3=r'Ampere=(\d+)'
res1 = re.findall(rx1, txt)
resv1 = re.findall(rxv1, txt)
resv2 = re.findall(rxv2, txt)
resv3 = re.findall(rxv3, txt)
print(res1)
print(resv1)
print(resv2)
print(resv3)
Here you get the titles and the interesting values you want :
['Title1', 'Title2']
['1', '4']
['2', '5']
['3', '6']
You can then use them as you want, for example :
for title_index in range(len(res1)):
print(res1[title_index])
value=int(resv1[title_index])+int(resv2[title_index])+int(resv3[title_index])
#use float() instead of int() if you have non integer values
print("the value is:", value)
You get :
Title1
the value is: 6
Title2
the value is: 15
Or you can store them in a dictionary or an other structure, for example :
#dict(zip(keys, values))
data= dict(zip(res1, [int(resv1[i])+int(resv2[i])+int(resv3[i]) for i in range(len(res1))] ))
print(data)
You get :
{'Title1': 6, 'Title2': 15}
Edit : added opening of the file
import re
with open('filename.txt', 'r') as file:
txt = file.read()
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=([0-9]+(?:\.[0-9]+)?)'
rxv2=r'Watt=([0-9]+(?:\.[0-9]+)?)'
rxv3=r'Ampere=([0-9]+(?:\.[0-9]+)?)'
res1 = re.findall(rx1, txt)
resv1 = re.findall(rxv1, txt)
resv2 = re.findall(rxv2, txt)
resv3 = re.findall(rxv3, txt)
data= dict(zip(res1, [float(resv1[i])+float(resv2[i])+float(resv3[i]) for i in range(len(res1))] ))
print(data)
Edit 2 : ignoring missing values
import re
with open('filename.txt', 'r') as file:
txt = file.read()
#divide the text into parts starting with "chapter"
substr = "Chapter"
chunks_idex = [_.start() for _ in re.finditer(substr, txt)]
chunks = [txt[chunks_idex[i]:chunks_idex[i+1]-1] for i in range(len(chunks_idex)-1)]
chunks.append(txt[chunks_idex[-1]:]) #add the last chunk
#print(chunks)
keys=[]
values=[]
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=([0-9]+(?:\.[0-9]+)?)'
rxv2=r'Watt=([0-9]+(?:\.[0-9]+)?)'
rxv3=r'Ampere=([0-9]+(?:\.[0-9]+)?)'
for chunk in chunks:
res1 = re.findall(rx1, chunk)
resv1 = re.findall(rxv1, chunk)
resv2 = re.findall(rxv2, chunk)
resv3 = re.findall(rxv3, chunk)
# check if we can find all of them by checking if the lists are not empty
if res1 and resv1 and resv2 and resv3 :
keys.append(res1[0])
values.append(float(resv1[0])+float(resv2[0])+float(resv3[0]))
data= dict(zip(keys, values ))
print(data)

Here's a quick and dirty way to do this, reading line by line, if the input file is predictable enough.
In the example I just print out the titles and the values; you can of course process them however you want.
f = open('file.dat','r')
for line in f.readlines():
## Catch the title of the line:
if '[Chapter' in line:
print(line[9:-2])
## catch the values of Volt, Watt, Amere parameters
elif line[:4] in ['Volt','Watt','Ampe']:
value = line[line.index('=')+1:line.index(' ')]
print(value)
## if line is "Irrelevant", or blank, do nothing
f.close()

There are many ways to achieve this. Here's one:
d = dict()
V = {'Volt', 'Watt', 'Ampere'}
with open('chapter.txt', encoding='utf-8') as f:
key = None
for line in f:
if line.startswith('[Chapter'):
d[key := line.strip()] = 0
elif key and len(t := line.split('=')) > 1 and t[0] in V:
d[key] += float(t[1].split()[0])
for k, v in d.items():
if v > 0:
print(f'Total for {k} = {v}')
Output:
Total for [Chapter.Title1] = 6
Total for [Chapter.Title2] = 15

Find frequency of words line by line in txt file Python (how to format properly)

I'm trying to make a simple program that can find the frequency of occurrences in a text file line by line. I have it outputting everything correctly except for when more than one word is on a line in the text file. (More information below)
The text file looks like this:
Hello
Hi
Hello
Good Day
Hi
Good Day
Good Night
I want the output to be: (Doesn't have to be in the same order)
Hello: 2
Hi: 2
Good Day: 2
Good Night: 2
What it's currently outputting:
Day: 2
Good: 3
Hello: 2
Hi: 2
Night: 1
My code:
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split(None)
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output

You want to preserve the lines. Don't split. Don't capitalize. Don't sort
Use a Counter
from collections import Counter
c = Counter()
with open('test.txt') as f:
for line in f:
c[line.rstrip()] += 1
for k, v in c.items():
print('{}: {}'.format(k, v))

Instead of splitting the text by None, split it by each line break so you get each line into a list.
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split('\n')
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output

You can make it yourself very easy by using a Counter object. If you want to count the occurrences of full lines you can simply do:
from collections import Counter
with open('file.txt') as f:
c = Counter(f)
print(c)
Edit
Since you asked for a way without modules:
counter_dict = {}
with open('file.txt') as f:
l = f.readlines()
for line in l:
if line not in counter_dict:
counter_dict[line] = 0
counter_dict[line] +=1
print(counter_dict)

Thank you all for the answers, most of the code produces the desired output just in different ways. The code I ended up using with no modules was this:
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split('\n')
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output
The code I ended up using with modules was this:
from collections import Counter
c = Counter()
with open('live.txt') as f:
for line in f:
c[line.rstrip()] += 1
for k, v in c.items():
print('{}: {}'.format(k, v))

Python - Unable to split lines from a txt file into words

My goal is to open a file and split it into unique words and display that list (along with a number count). I think I have to split the file into lines and then split those lines into words and add it all into a list.
The problem is that if my program will run in an infinite loop and not display any results, or it will only read a single line and then stop. The file being read is The Gettysburg Address.
def uniquify( splitz, uniqueWords, lineNum ):
for word in splitz:
word = word.lower()
if word not in uniqueWords:
uniqueWords.append( word )
def conjunctionFunction():
uniqueWords = []
with open(r'C:\Users\Alex\Desktop\Address.txt') as f :
getty = [line.rstrip('\n') for line in f]
lineNum = 0
lines = getty[lineNum]
getty.append("\n")
while lineNum < 20 :
splitz = lines.split()
lineNum += 1
uniquify( splitz, uniqueWords, lineNum )
print( uniqueWords )
conjunctionFunction()

Using your current code, the line:
lines = getty[lineNum]
should be moved within the while loop.

You figured out what's wrong with your code, but nonetheless, I would do this slightly differently. Since you need to keep track of the number of unique words and their counts, you should use a dictionary for this task:
wordHash = {}
with open('C:\Users\Alex\Desktop\Address.txt', 'r') as f :
for line in f:
line = line.rstrip().lower()
for word in line:
if word not in wordHash:
wordHash[word] = 1
else:
wordHash[word] += 1
print wordHash

def splitData(filename):
return [words for words in open(filename).reads().split()]
Easiest way to split a file into words :)

Assume inp is retrived from a file
inp = """Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense."""
data = inp.splitlines()
print data
_d = {}
for line in data:
word_lst = line.split()
for word in word_lst:
if word in _d:
_d[word] += 1
else:
_d[word] = 1
print _d.keys()
Output
['Beautiful', 'Flat', 'Simple', 'is', 'dense.', 'Explicit', 'better', 'nested.', 'Complex', 'ugly.', 'Sparse', 'implicit.', 'complex.', 'than', 'complicated.']

I recommend:
#!/usr/local/cpython-3.3/bin/python
import pprint
import collections
def genwords(file_):
for line in file_:
for word in line.split():
yield word
def main():
with open('gettysburg.txt', 'r') as file_:
result = collections.Counter(genwords(file_))
pprint.pprint(result)
main()
...but you could use re.findall to deal with punctuation better, instead of string.split.

How to find a string in a text file, Python

How do I get the number of times a certain 2 characters are used in a text files, (e.g. ('hi'))
And how do I print the sum out as an int?
I tried doing this:
for line in open('test.txt'):
ly = line.split()
for i in ly:
a = i.count('ly')
print(sum(a))
But it failed, thanks in advance!

Your program fails because your variable a is an integer and you cannot apply the sum function to an integer.
Several examples have already been presented. Here is mine:
with open("test.txt") as fp:
a = fp.read().count('ly')
print(a)

you can simply count 'ly' on each line :
sum(line.count('ly') for line in open('test.txt'))

Different approach:
from collections import Counter
text = open('text.txt').read()
word_count = Counter(text.split())
print word_count['hi']

for line in open('test.txt'):
ly = line.split()
alist = [i.count('hi') for i in ly]
print sum(alist)

You can try something like this
for line in open('test.txt'):
ly = line.split()
for i in ly:
if 'word' in i:
a = a + 1
print (a)

Count lines matching different patterns in one pass

I have a python script the given a pattern goes over a file and for each line that matches the pattern it keeps counts how many times that line shows up in the file.
The script is the following:
#!/usr/bin/env python
import time
fnamein = 'Log.txt'
def filter_and_count_matches(fnamein, fnameout, match):
fin = open(fnamein, 'r')
curr_matches = {}
order_in_file = [] # need this because dict has no particular order
for line in (l for l in fin if l.find(match) >= 0):
line = line.strip()
if line in curr_matches:
curr_matches[line] += 1
else:
curr_matches[line] = 1
order_in_file.append(line)
#
fout = open(fnameout, 'w')
#for line in order_in_file:
for line, _dummy in sorted(curr_matches.iteritems(),
key=lambda (k, v): (v, k), reverse=True):
fout.write(line + '\n')
fout.write(' = {}\n'.format(curr_matches[line]))
fout.close()
def main():
for idx, match in enumerate(open('staffs.txt', 'r').readlines()):
curr_time = time.time()
match = match.strip()
fnameout = 'm{}.txt'.format(idx+1)
filter_and_count_matches(fnamein, fnameout, match)
print 'Processed {}. Time = {}'.format(match, time.time() - curr_time)
main()
So right now I am going over the file each time I want to check for a different pattern.
It is possible to do this go going over the file just once (the file is quite big, so it takes a while to process). It would be nice to be able to do this in a elegant "easy" way. Thanks!
Thanks

Looks like a Counter would do what you need:
from collections import Counter
lines = Counter([line for line in myfile if match_string in line])
For example, if myfile contains
123abc
abc456
789
123abc
abc456
and match_string is "abc", then the above code gives you
>>> lines
Counter({'123abc': 2, 'abc456': 2})
For multiple patterns, how about this:
patterns = ["abc", "123"]
# initialize one Counter for each pattern
results = {pattern:Counter() for pattern in patterns}
for line in myfile:
for pattern in patterns:
if pattern in line:
results[pattern][line] += 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count frequency of words in a file using python - python

You can use the count method mystring = "hello hello hello" mystring.count("hello") # 3

Related

Find specific values in a txt file and adding them up with python

Find frequency of words line by line in txt file Python (how to format properly)

Python - Unable to split lines from a txt file into words

How to find a string in a text file, Python

Count lines matching different patterns in one pass

Categories

Resources