Counting number of occurrence of a string in a text file - python

I have a text file containing:
Rabbit:Grass
Eagle:Rabbit
Grasshopper:Grass
Rabbit:Grasshopper
Snake:Rabbit
Eagle:Snake
I want to count the number of occurrence of a string, say, the number of times the animals occur in the text file and print the count. Here's my code:
fileName = input("Enter the name of file:")
foodChain = open(fileName)
table = []
for line in foodChain:
contents = line.strip().split(':')
table.append(contents)
def countOccurence(l):
count = 0
for i in l:
#I'm stuck here#
count +=1
return count
I'm unsure about how will python count the occurrence in a text file. The output i wanted is:
Rabbit: 4
Eagle: 2
Grasshopper: 2
Snake: 2
Grass: 2
I just need some help on the counting part and I will be able to manage the rest of it. Regards.

what you need is a dictionary.
dictionary = {}
for line in table:
for animal in line:
if animal in dictionary:
dictionary[animal] += 1
else:
dictionary[animal] = 1
for animal, occurences in dictionary.items():
print(animal, ':', occurences)

The solution using str.split(), re.sub() functions and collections.Counter subclass:
import re, collections
with open(filename, 'r') as fh:
# setting space as a common delimiter
contents = re.sub(r':|\n', ' ', fh.read()).split()
counts = collections.Counter(contents)
# iterating through `animal` counts
for a in counts:
print(a, ':', counts[a])
The output:
Snake : 2
Rabbit : 4
Grass : 2
Eagle : 2
Grasshopper : 2

Use in to judge if an array is an element of another array, in Python, you can use a string as array:
def countOccurence(l):
count = 0
#I'm stuck here#
if l in table:
count +=1
return count

from collections import defaultdict
dd = defaultdict(int)
with open(fpath) as f:
for line in f:
words = line.split(':')
for word in words:
dd[word] += 1
for k,v in dd.items():
print(k+': '+str(v))

Related

How to efficiently count word occurrences in Python without additional modules

Background
I'm working on a HackerRank problem Word Order. The task is to
Read the following input from stdin
4
bcdef
abcdefg
bcde
bcdef
Produce the output that reflects:
Number of unique words in first line
Count of occurrences for each unique words
Example:
3 # Number of unique words
2 1 1 # count of occurring words, 'bcdef' appears twice = 2
Problem
I've coded two solutions, the second one passes initial tests but fail due to exceeding time limit. First one would also work but I was unnecessarily sorting outputs (time limit issue would occur though).
Notes
In first solution I was unnecessarily sorting values, this is fixed in the second solution
I'm keen to be making better (proper) use of standard Python data structures, list/dictionary comprehension - I would be particularly keen to receive a solution that doesn't import any addittional modules, with exception of import os if needed.
Code
import os
def word_order(words):
# Output no of distinct words
distinct_words = set(words)
n_distinct_words = len(distinct_words)
print(str(n_distinct_words))
# Count occurrences of each word
occurrences = []
for distinct_word in distinct_words:
n_word_appearances = 0
for word in words:
if word == distinct_word:
n_word_appearances += 1
occurrences.append(n_word_appearances)
occurrences.sort(reverse=True)
print(*occurrences, sep=' ')
# for o in occurrences:
# print(o, end=' ')
def word_order_two(words):
'''
Run through all words and only count multiple occurrences, do the maths
to calculate unique words, etc. Attempt to construct a dictionary to make
the operation more memory efficient.
'''
# Construct a count of word occurrences
dictionary_words = {word:words.count(word) for word in words}
# Unique words are equivalent to dictionary keys
unique_words = len(dictionary_words)
# Obtain sorted dictionary values
# sorted_values = sorted(dictionary_words.values(), reverse=True)
result_values = " ".join(str(value) for value in dictionary_words.values())
# Output results
print(str(unique_words))
print(result_values)
return 0
if __name__ == '__main__':
q = int(input().strip())
inputs = []
for q_itr in range(q):
s = input()
inputs.append(s)
# word_order(words=inputs)
word_order_two(words=inputs)
Those nested loops are very bad performance wise (they make your algorithm quadratic) and quite unnecessary. You can get all counts in single iteration. You could use a plain dict or the dedicated collections.Counter:
from collections import Counter
def word_order(words):
c = Counter(words)
print(len(c))
print(" ".join(str(v) for _, v in c.most_common()))
The "manual" implementation that shows the workings of the Counter and its methods:
def word_order(words):
c = {}
for word in words:
c[word] = c.get(word, 0) + 1
print(len(c))
print(" ".join(str(v) for v in sorted(c.values(), reverse=True)))
# print(" ".join(map(str, sorted(c.values(), reverse=True))))
Without any imports, you could count unique elements by
len(set(words))
and count their occurrences by
def counter(words):
count = dict()
for word in words:
if word in count:
count[word] += 1
else:
count[word] = 1
return count.values()
You can use Counter then print output like below:
>>> from collections import Counter
>>> def counter_words(words):
... cnt = Counter(words)
... print(len(cnt))
... print(*[str(v) for k,v in c.items()] , sep=' ')
>>> inputs = ['bcdef' , 'abcdefg' , 'bcde' , 'bcdef']
>>> counter_words(inputs)
3
2 1 1

Find frequency of words line by line in txt file Python (how to format properly)

I'm trying to make a simple program that can find the frequency of occurrences in a text file line by line. I have it outputting everything correctly except for when more than one word is on a line in the text file. (More information below)
The text file looks like this:
Hello
Hi
Hello
Good Day
Hi
Good Day
Good Night
I want the output to be: (Doesn't have to be in the same order)
Hello: 2
Hi: 2
Good Day: 2
Good Night: 2
What it's currently outputting:
Day: 2
Good: 3
Hello: 2
Hi: 2
Night: 1
My code:
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split(None)
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output
You want to preserve the lines. Don't split. Don't capitalize. Don't sort
Use a Counter
from collections import Counter
c = Counter()
with open('test.txt') as f:
for line in f:
c[line.rstrip()] += 1
for k, v in c.items():
print('{}: {}'.format(k, v))
Instead of splitting the text by None, split it by each line break so you get each line into a list.
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split('\n')
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output
You can make it yourself very easy by using a Counter object. If you want to count the occurrences of full lines you can simply do:
from collections import Counter
with open('file.txt') as f:
c = Counter(f)
print(c)
Edit
Since you asked for a way without modules:
counter_dict = {}
with open('file.txt') as f:
l = f.readlines()
for line in l:
if line not in counter_dict:
counter_dict[line] = 0
counter_dict[line] +=1
print(counter_dict)
Thank you all for the answers, most of the code produces the desired output just in different ways. The code I ended up using with no modules was this:
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split('\n')
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output
The code I ended up using with modules was this:
from collections import Counter
c = Counter()
with open('live.txt') as f:
for line in f:
c[line.rstrip()] += 1
for k, v in c.items():
print('{}: {}'.format(k, v))

Creating a ranking for lines of text file and keeping only top lines

Let's say I have a text file with thousands of lines of the following form:
Word Number1 Number2
In this text file, the "Word" is indeed some word that changes from one line to another, and the numbers are likewise changing numbers. However, some of these words are the same... Consider the following example:
Hello 5 7
Hey 3 2
Hi 7 3
Hi 5 2
Hello 1 4
Hey 5 2
Hello 8 1
What would be a python script that reads the text file and keeps only the lines that contain the highest Number1 for any given Word (deleting all lines that do not satisfy this condition)? The output for the above example with such a script would be:
Hi 7 3
Hey 5 2
Hello 8 1
Note: the order of the lines in the output is irrelevant, all that matters is that the above condition is satisfied. Also, if for a given Word, the highest Number1 is the same for two or more lines, the output should keep only one of them, such that there is only one occurence of any Word in the output.
I've no clue how to approach the deletion aspect, but I can guess (perhaps incorrectly) that the first step would be to make a list from all the lines in the text file, i.e.
List1 = open("textfile.txt").readlines()
At any rate, many thanks in advance for the help!
You can try this:
f = [i.strip('\n').split() for i in open('the_file.txt')]
other_f = {i[0]:map(int, i[1:]) for i in f}
for i in f:
if other_f[i[0]][0] < int(i[1]):
other_f[i[0]] = map(int, i[1:])
new_f = open('the_file.txt', 'w')
for a, b in other_f.items():
new_f.write(a + " "+' '.join(map(str, b))+"\n")
new_f.close()
Output:
Hi 7 3
Hello 8 1
Hey 5 2
You can store the lines in a dict, with the words as keys. To make things easier, you can store a tuple with the value of the first numeric field (converted to integer, otherwise you would sort by lexicographic order) and the line.
We use dict.setdefault in case we encounter the word for the first time.
highest = {}
with open('text.txt') as f:
for line in f:
name, val, _ = line.split(' ', 2)
val = int(val)
if val > highest.setdefault(name, (val, line))[0]:
highest[name] = (val, line)
out = [tup[1] for name, tup in highest.items()]
print(''.join(out))
# Hey 5 2
# Hello 8 1
# Hi 7 3
first sorted the list with 1st and 2nd column as the key from high to low
then remove the duplicate items
list1 = open(r'textfile.txt').read().splitlines()
output = sorted(list1, key=lambda x:(x.split()[0], int(x.split()[1])), reverse=True)
uniq_key = []
for i in sorted_dat:
key = i.split()[0]
if key in uniq_key:
output.remove(i)
else:
uniq_key.append(key)
>>> output
['Hi 7 3', 'Hey 5 2', 'Hello 8 1']
Because file objects are iterable, it is not necessary to do the readlines up front. So let's open the file and then just iterate over it using a for loop.
fin = open('sometext.txt')
We create a dictionary to hold the results, as we go.
topwords = dict()
Iterating now, over the lines in the file:
for line in fin:
We strip off the new line characters and split the lines into individual strings, based on where the spaces are (the default behavior for split()).
word, val1, val2 = line.strip().split()
val1 = int(val1)
We check to see if we have already seen the word, if yes, we then check to see if the first value is greater than the first value previously stored.
if word in topwords:
if val1 > topwords[word][0]:
topwords[word] = [val1, val2]
else:
topwords[word] = [val1, val2]
Once we finish parsing all the words, we go back and iterate over the top words and print the results to the screen.
for word in topwords:
output = '{} {} {}'.format(word, *topwords[word])
print(output)
The final script looks like this:
fin = open('sometext.txt')
topwords = dict()
for line in fin:
word, val1, val2 = line.strip().split()
val1 = int(val1)
if word in topwords:
if val1 > topwords[word][0]:
topwords[word] = [val1, val2]
else:
topwords[word] = [val1, val2]
for word in topwords:
output = '{} {} {}'.format(word, *topwords[word])
print(output)

Dictionaries overwriting in Python

This program is to take the grammar rules found in Binary.text and store them into a dictionary, where the rules are:
N = N D
N = D
D = 0
D = 1
but the current code returns D: D = 1, N:N = D, whereas I want N: N D, N: D, D:0, D:1
import sys
import string
#default length of 3
stringLength = 3
#get last argument of command line(file)
filename1 = sys.argv[-1]
#get a length from user
try:
stringLength = int(input('Length? '))
filename = input('Filename: ')
except ValueError:
print("Not a number")
#checks
print(stringLength)
print(filename)
def str2dict(filename="Binary.txt"):
result = {}
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]] = line
print (result)
return result
print (str2dict("Binary.txt"))
Firstly, your data structure of choice is wrong. Dictionary in python is a simple key-to-value mapping. What you'd like is a map from a key to multiple values. For that you'll need:
from collections import defaultdict
result = defaultdict(list)
Next, where are you splitting on '=' ? You'll need to do that in order to get the proper key/value you are looking for? You'll need
key, value = line.split('=', 1) #Returns an array, and gets unpacked into 2 variables
Putting the above two together, you'd go about in the following way:
result = defaultdict(list)
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
key, value = line.split('=', 1)
result[key.strip()].append(value.strip())
return result
Dictionaries, by definition, cannot have duplicate keys. Therefor there can only ever be a single 'D' key. You could, however, store a list of values at that key if you'd like. Ex:
from collections import defaultdict
# rest of your code...
result = defaultdict(list) # Use defaultdict so that an insert to an empty key creates a new list automatically
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]].append(line)
print (result)
return result
This will result in something like:
{"D" : ["D = N D", "D = 0", "D = 1"], "N" : ["N = D"]}

generating a single outfile after analyzing multiple files in python

i have multiple files each containing 8/9 columns.
for a single file : I have to read last column containing some value and count the number of occurrence of each value and then generate an outfile.
I have done it like:
inp = open(filename,'r').read().strip().split('\n')
out = open(filename,'w')
from collections import Counter
C = Counter()
for line in inp:
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()
now the problem is it works fine if I have to generate one output for one input. But I need to scan a directory using glob.iglobfunction for all files to be used as input. And then have to perform above said program on each file to gather result for each file and then of course have to write all of the analyzed results for each file into a single OUTPUT file.
NOTE: During generating single OUTPUT file if any value is found to be getting repeated then instead of writing same entry twice it is preferred to sum up the 'count' only. e.g. analysis of 1st file generate:
123 6
111 5
0 6
45 5
and 2nd file generate:
121 9
111 7
0 1
22 2
in this case OUTPUT file must be written such a way that it contain:
123 6
111 12 #sum up count no. in case of similar value entry
0 7
45 5
22 2
i have written prog. for single file analysis BUT i'm stuck in mass analysis section.
please help.
from collections import Counter
import glob
out = open(filename,'w')
g_iter = glob.iglob('path_to_dir/*')
C = Counter()
for filename in g_iter:
f = open(filename,'r')
inp = f.read().strip().split('\n')
f.close()
for line in inp:
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()
After de-uglification:
from collections import Counter
import glob
def main():
# create Counter
cnt = Counter()
# collect data
for fname in glob.iglob('path_to_dir/*.dat'):
with open(fname) as inf:
cnt.update(line.split()[-1] for line in inf)
# dump results
with open("summary.dat", "w") as outf:
outf.writelines("{:5s} {:>5d}\n".format(val,num) for val,num in cnt.iteritems())
if __name__=="__main__":
main()
Initialise a empty dictionary at the top of the program,
lets say, dic=dict()
and for each Counter update the dic so that the values of similar keys are summed and the new keys are also added to the dic
to update dic use this:
dic=dict( (n, dic.get(n, 0)+C.get(n, 0)) for n in set(dic)|set(C) )
where C is the current Counter, and after all files are finished write the dic to the output file.
import glob
from collections import Counter
dic=dict()
g_iter = glob.iglob(r'c:\\python32\fol\*')
for x in g_iter:
lis=[]
with open(x) as f:
inp = f.readlines()
for line in inp:
num=line.split()[-1]
lis.append(num)
C=Counter(lis)
dic=dict( (n, dic.get(n, 0)+C.get(n, 0)) for n in set(dic)|set(C) )
for x in dic:
print(x,'\t',dic[x])
I did like this.
import glob
out = open("write.txt",'a')
from collections import Counter
C = Counter()
for file in glob.iglob('temp*.txt'):
for line in open(file,'r').read().strip().split('\n'):
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()

Categories