Counting the frequency of each word in a given text [closed] - python

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I am looking for a python program that counts the frequencies of each word in a text, and output each word with its count and line numbers where it appears.
We define a word as a contiguous sequence of non-white-space characters. (hint: split())
Note: different capitalizations of the same character sequence should be considered same word, e.g. Python and python, I and i.
The input will be several lines with the empty line terminating the text. Only alphabet characters and white spaces will be present in the input.
The output is formatted as follows:
Each line begins with a number indicating the frequency of the word, a white space, then the word itself, and a list of line numbers containing this word.
Sample Input
Python is a cool language but OCaml
is even cooler since it is purely functional
Sample Output
3 is 1 2
1 a 1
1 but 1
1 cool 1
1 cooler 2
1 even 2
1 functional 2
1 it 2
1 language 1
1 ocaml 1
1 purely 2
1 python 1
1 since 2
PS.
I am not a student I am learning Python on my own..

Using collections.defaultdict, collections.Counter and string formatting:
from collections import Counter, defaultdict
data = """Python is a cool language but OCaml
is even cooler since it is purely functional"""
result = defaultdict(lambda: [0, []])
for i, l in enumerate(data.splitlines()):
for k, v in Counter(l.split()).items():
result[k][0] += v
result[k][1].append(i+1)
for k, v in result.items():
print('{1} {0} {2}'.format(k, *v))
Output:
1 since [2]
3 is [1, 2]
1 a [1]
1 it [2]
1 but [1]
1 purely [2]
1 cooler [2]
1 functional [2]
1 Python [1]
1 cool [1]
1 language [1]
1 even [2]
1 OCaml [1]
If the order matters, you can sort the result this way:
items = sorted(result.items(), key=lambda t: (-t[1][0], t[0].lower()))
for k, v in items:
print('{1} {0} {2}'.format(k, *v))
Output:
3 is [1, 2]
1 a [1]
1 but [1]
1 cool [1]
1 cooler [2]
1 even [2]
1 functional [2]
1 it [2]
1 language [1]
1 OCaml [1]
1 purely [2]
1 Python [1]
1 since [2]

Frequency tabulations are often best solved with a counter.
from collections import Counter
word_count = Counter()
with open('input', 'r') as f:
for line in f:
for word in line.split(" "):
word_count[word.strip().lower()] += 1
for word, count in word_count.iteritems():
print "word: {}, count: {}".format(word, count)

Ok, so you've already identified split to turn your string into a list of words. You want to list the lines on which each word occurs, however, so you should split the string first into lines, then into words. Then, you can create a dictionary, where keys are the words (put to lowercase first), and the values can be a structure containing the number of occurrences and the lines of occurrence.
You may also want to put in some code to check whether something is a valid word (e.g. whether it contains numbers), and to sanitise a word (remove punctuation). I'll leave these up to you.
def wsort(item):
# sort descending by count, then ascending alphabetically
word, freq = item
return -freq['count'], word
def wfreq(str):
words = {}
# split by line, then by word
lines = [line.split() for line in str.split('\n')]
for i in range(len(lines)):
for word in lines[i]:
# if the word is not in the dictionary, create the entry
word = word.lower()
if word not in words:
words[word] = {'count':0, 'lines':set()}
# update the count and add the line number to the set
words[word]['count'] += 1
words[word]['lines'].add(i+1)
# convert from a dictionary to a sorted list using wsort to give the order
return sorted(words.iteritems(), key=wsort)
inp = "Python is a cool language but OCaml\nis even cooler since it is purely functional"
for word, freq in wfreq(inp):
# generate the desired list format
lines = " ".join(str(l) for l in list(freq['lines']))
print "%i %s %s" % (freq['count'], word, lines)
This should provide the exact same output as in your sample:
3 is 1 2
1 a 1
1 but 1
1 cool 1
1 cooler 2
1 even 2
1 functional 2
1 it 2
1 language 1
1 ocaml 1
1 purely 2
1 python 1
1 since 2

First of all find all the words that are present in the text. Using split().
In case the text is present in a file, then we will first get it into a string, and all it text. also remove all the \n from the text.
filin=open('file','r')
di = readlines(filin)
text = ''
for i in di:
text += i</pre></code>
now check the number of times each word is there in the text. we will deal with the line numbers later.
dicts = {}
for i in words_list:
dicts[i] = 0
for i in words_list:
for j in range(len(text)):
if text[j:j+len(i)] == i:
dicts[i] += 1
now we have a dictionary with the words as keys and the values being the mumber of times the word appears in the text.
now for the line numbers:
dicts2 = {}
for i in words_list:
dicts2[i] = 0
filin.seek(0)
for i in word_list:
filin.seek(0)
count = 1
for j in filin:
if i in j:
dicts2[i] += (count,)
count += 1
now dicts2 has the words as the key and the list of the line numbers it is in as the values. inside a tuple
if in case the data is already in a string, you just need to remove all the \ns.
di = split(string_containing_text,'\n')
and everything else will be the same.
i am sure you can format the output.

Related

Iterating through zipped objects in Python

I've been having a hard time trying to solve this recently (although this looks like a trivial matter).
I have these 3 dictionaries:
letters_words = {'A': ['allow', 'arise'], 'B': ['bring', 'buy']}
words_cxns = {'allow': ['CXN1', 'CXN2'], 'arise': ['CXN1', 'CXN3'], 'bring': ['CXN2', 'CXN3'], 'buy': ['CXN3']}
cxns_ids = {'CXN1': 1, 'CXN2': 2, 'CXN3': 3}
Every letter has a few words, every word is associated with certain constructions, every construction has an id.
In the end I want to get this:
A
allow
CXN1, 1
CXN2, 2
arise
CXN1, 1
CXN3, 3
B
bring
CXN2, 2
CXN3, 3
buy
CXN3, 3
The spaces and punctuation don't matter... The main thing is that it gets listed right.
Here is what I'm currently doing:
for letter, words in zip(letters_words.keys(), letters_words.values()):
print(letter)
for word in words:
print(word)
for w, cnxs in zip(words_cxns.keys(), words_cxns.values()):
if w == word:
for c in cxns:
for cxn, ix in zip(cxns_ids.keys(), cxns_ids.values()):
if cxn == c:
print(c, ix)
However, my output looks like this at the moment:
A
allow
CXN1 1
CXN2 2
CXN3 3
arise
CXN1 1
CXN2 2
CXN3 3
B
bring
CXN1 1
CXN2 2
CXN3 3
buy
CXN1 1
CXN2 2
CXN3 3
What am I missing? :/
You do not need zip for this task, as the construction merely depends on the word, not on the iteration of words. Here is a possible solution that produces your desired output:
for letter, words in letters_words.items():
print('\n' + letter)
for word in words:
print('\n' + word)
cxns = words_cxns[word]
for cxn in cxns:
cxn_id = cxns_ids[cxn]
print(cxn, ',', cxn_id)
No need to zip:
letters_words = {'A': ['allow', 'arise'], 'B': ['bring', 'buy']}
words_cxns = {'allow': ['CXN1', 'CXN2'], 'arise': ['CXN1', 'CXN3'], 'bring': ['CXN2', 'CXN3'], 'buy': ['CXN3']}
cxns_ids = {'CXN1': 1, 'CXN2': 2, 'CXN3': 3}
for k,v in letters_words.items():
print("\n" + k + "\n")
for w in v:
print(w)
for word in words_cxns[w]:
print(word, cxns_ids[word])
Output:
A
allow
CXN1 1
CXN2 2
arise
CXN1 1
CXN3 3
B
bring
CXN2 2
CXN3 3
buy
CXN3 3
Try this, the idea is to get the cxns directly from the dictionary instead of using a second zip object. I commented on the relevant row.
for letter, words in zip(letters_words.keys(), letters_words.values()):
print(letter)
for word in words:
print(word)
# no need to create a new zip object, get value from dict instead
for cxns in words_cxns[word]:
print(cxns, cxns_ids[cxns])
That's embarrassing, but I've made a typo which I couldn't find for 2 days! On line 6 of my code suggestion, I've written cnxs instead of cxns. Once I changed it, everything worked!

How to efficiently count word occurrences in Python without additional modules

Background
I'm working on a HackerRank problem Word Order. The task is to
Read the following input from stdin
4
bcdef
abcdefg
bcde
bcdef
Produce the output that reflects:
Number of unique words in first line
Count of occurrences for each unique words
Example:
3 # Number of unique words
2 1 1 # count of occurring words, 'bcdef' appears twice = 2
Problem
I've coded two solutions, the second one passes initial tests but fail due to exceeding time limit. First one would also work but I was unnecessarily sorting outputs (time limit issue would occur though).
Notes
In first solution I was unnecessarily sorting values, this is fixed in the second solution
I'm keen to be making better (proper) use of standard Python data structures, list/dictionary comprehension - I would be particularly keen to receive a solution that doesn't import any addittional modules, with exception of import os if needed.
Code
import os
def word_order(words):
# Output no of distinct words
distinct_words = set(words)
n_distinct_words = len(distinct_words)
print(str(n_distinct_words))
# Count occurrences of each word
occurrences = []
for distinct_word in distinct_words:
n_word_appearances = 0
for word in words:
if word == distinct_word:
n_word_appearances += 1
occurrences.append(n_word_appearances)
occurrences.sort(reverse=True)
print(*occurrences, sep=' ')
# for o in occurrences:
# print(o, end=' ')
def word_order_two(words):
'''
Run through all words and only count multiple occurrences, do the maths
to calculate unique words, etc. Attempt to construct a dictionary to make
the operation more memory efficient.
'''
# Construct a count of word occurrences
dictionary_words = {word:words.count(word) for word in words}
# Unique words are equivalent to dictionary keys
unique_words = len(dictionary_words)
# Obtain sorted dictionary values
# sorted_values = sorted(dictionary_words.values(), reverse=True)
result_values = " ".join(str(value) for value in dictionary_words.values())
# Output results
print(str(unique_words))
print(result_values)
return 0
if __name__ == '__main__':
q = int(input().strip())
inputs = []
for q_itr in range(q):
s = input()
inputs.append(s)
# word_order(words=inputs)
word_order_two(words=inputs)
Those nested loops are very bad performance wise (they make your algorithm quadratic) and quite unnecessary. You can get all counts in single iteration. You could use a plain dict or the dedicated collections.Counter:
from collections import Counter
def word_order(words):
c = Counter(words)
print(len(c))
print(" ".join(str(v) for _, v in c.most_common()))
The "manual" implementation that shows the workings of the Counter and its methods:
def word_order(words):
c = {}
for word in words:
c[word] = c.get(word, 0) + 1
print(len(c))
print(" ".join(str(v) for v in sorted(c.values(), reverse=True)))
# print(" ".join(map(str, sorted(c.values(), reverse=True))))
Without any imports, you could count unique elements by
len(set(words))
and count their occurrences by
def counter(words):
count = dict()
for word in words:
if word in count:
count[word] += 1
else:
count[word] = 1
return count.values()
You can use Counter then print output like below:
>>> from collections import Counter
>>> def counter_words(words):
... cnt = Counter(words)
... print(len(cnt))
... print(*[str(v) for k,v in c.items()] , sep=' ')
>>> inputs = ['bcdef' , 'abcdefg' , 'bcde' , 'bcdef']
>>> counter_words(inputs)
3
2 1 1

why Im having String indexing problem in Python

I'm trying to understand why I'm having the same index again when I apply .index or .find
why I'm getting the same index '2' again why not '3'? when a letter is repeated, and what is the alternative way to get an index 3 for the second 'l'
text = 'Hello'
for i in text:
print(text.index(i))
the output is:
0
1
2
2
4
It's because .index() returns the lowest or first index of the substring within the string. Since the first occurrence of l in hello is at index 2, you'll always get 2 for "hello".index("l").
So when you're iterating through the characters of hello, you get 2 twice and never 3 (for the second l). Expanded into separate lines, it looks like this:
"hello".index("h") # = 0
"hello".index("e") # = 1
"hello".index("l") # = 2
"hello".index("l") # = 2
"hello".index("o") # = 4
Edit: Alternative way to get all indices:
One way to print all the indices (although not sure how useful this is since it just prints consecutive numbers) is to remove the character you just read from the string:
removed = 0
string = "hello world" # original string
for char in string:
print("{} at index {}".format(char, string.index(char) + removed)) # index is index() + how many chars we've removed
string = string[1:] # remove the char we just read
removed +=1 # increment removed count
text = 'Hello'
for idx, ch in enumerate(text):
print(f'char {ch} at index {idx}')
output
char H at index 0
char e at index 1
char l at index 2
char l at index 3
char o at index 4
If you want to find the second occurance, you should search in the substring after the first occurance
text = 'Hello'
first_index = text.index('l')
print('First index:', first_index)
second_index = text.index('l', first_index+1) # Search in the substring after the first occurance
print('Second index:', second_index)
The output is:
First index: 2
Second index: 3

Trouble printing things out so that they are aligned

I'm trying to print out a list of results i have but I want them to be aligned. They currently look like:
table
word: frequency:
i 9
my 2
to 2
test 2
it 2
hate 1
stupid 1
accounting 1
class 1
because 1
its 1
from 1
six 1
seven 1
pm 1
how 1
is 1
this 1
helping 1
becuase 1
im 1
going 1
do 1
a 1
little 1
on 1
freind 1
ii 1
I want the frequency to be aligned with each other so they aren't going in this weird zig zag form. I tried playing around with adding things to the format but it didn't work. This is what my code looks like:
import string
from collections import OrderedDict
f=open('mariah.txt','r')
a=f.read() # read the text file like it would normal look ie no \n or anything
# print(a)
c=a.lower() # convert everything in the textfile to lowercase
# print(c)
y=c.translate(str.maketrans('','',string.punctuation)) # get rid of any punctuation
# print(y)
words_in_novel=y.split() # splitting every word apart. the default for split is on white-space characters. Why when i split like " " for the spacing is it then giving me \n?
#print(words_in_novel)
count={}
for word in words_in_novel:
#print(word)
if word in count: # if the word from the word_in_novel is already in count then add a one to that counter
count[word]+=1
else:
count[word]=1 # if the word is the first time in count set it to 1
print(count)
print("\n\n\n\n\n\n\n")
# this orderes the dictionary where its sorts them by the second term wehre t[1] refers to the term after the colon
# reverse so we are sorting from greatest to least values
g=(sorted(count.items(), key=lambda t: t[1], reverse=True))
# g=OrderedDict(sorted(count.items(), key=lambda t: t[1]))
print(g)
print("\n\n\n\n\n\n\n")
print("{:^20}".format("table"))
print("{}{:>20}".format("word:","frequency:"))
for i in g:
# z=g[i]
# print(i)
# a=len(i[0])
# print(a)
# c=50+a
# print(c)
print("{}{:>20}".format(i[0],i[1]))
Does anyone know how to make them going in a straight line?
You need to adjust the width/alighnment of your 1st column, not the 2nd.
The right way:
...
print("{:<20}{}".format("word:","frequency:"))
for i in g:
print("{:<20}{}".format(i[0],i[1]))
The output would look as:
word: frequency:
i 9
my 2
...
accounting 2
class 1
because 1
...
Ok , for the part of your code :
for i in g:
r = " "*25
#print("{}{:>20}".format(i[0],i[1]))
r[:len(i[0])] = i[0]
r = r[:22]+str(i[1])
print(r)
it should work
If you even find that the frequency is greater than a single digit you could try something like this:
max_len = max(len(i[0]) for i in g)
format_str = "{{:<{}}}{{:>{}}}".format(max_len, 20 - max_len)
for i in g:
print(format_str.format(i[0], i[1]))
Align words too
print("{:<10}{:>10}".format(i[0],i[1]))

Creating a ranking for lines of text file and keeping only top lines

Let's say I have a text file with thousands of lines of the following form:
Word Number1 Number2
In this text file, the "Word" is indeed some word that changes from one line to another, and the numbers are likewise changing numbers. However, some of these words are the same... Consider the following example:
Hello 5 7
Hey 3 2
Hi 7 3
Hi 5 2
Hello 1 4
Hey 5 2
Hello 8 1
What would be a python script that reads the text file and keeps only the lines that contain the highest Number1 for any given Word (deleting all lines that do not satisfy this condition)? The output for the above example with such a script would be:
Hi 7 3
Hey 5 2
Hello 8 1
Note: the order of the lines in the output is irrelevant, all that matters is that the above condition is satisfied. Also, if for a given Word, the highest Number1 is the same for two or more lines, the output should keep only one of them, such that there is only one occurence of any Word in the output.
I've no clue how to approach the deletion aspect, but I can guess (perhaps incorrectly) that the first step would be to make a list from all the lines in the text file, i.e.
List1 = open("textfile.txt").readlines()
At any rate, many thanks in advance for the help!
You can try this:
f = [i.strip('\n').split() for i in open('the_file.txt')]
other_f = {i[0]:map(int, i[1:]) for i in f}
for i in f:
if other_f[i[0]][0] < int(i[1]):
other_f[i[0]] = map(int, i[1:])
new_f = open('the_file.txt', 'w')
for a, b in other_f.items():
new_f.write(a + " "+' '.join(map(str, b))+"\n")
new_f.close()
Output:
Hi 7 3
Hello 8 1
Hey 5 2
You can store the lines in a dict, with the words as keys. To make things easier, you can store a tuple with the value of the first numeric field (converted to integer, otherwise you would sort by lexicographic order) and the line.
We use dict.setdefault in case we encounter the word for the first time.
highest = {}
with open('text.txt') as f:
for line in f:
name, val, _ = line.split(' ', 2)
val = int(val)
if val > highest.setdefault(name, (val, line))[0]:
highest[name] = (val, line)
out = [tup[1] for name, tup in highest.items()]
print(''.join(out))
# Hey 5 2
# Hello 8 1
# Hi 7 3
first sorted the list with 1st and 2nd column as the key from high to low
then remove the duplicate items
list1 = open(r'textfile.txt').read().splitlines()
output = sorted(list1, key=lambda x:(x.split()[0], int(x.split()[1])), reverse=True)
uniq_key = []
for i in sorted_dat:
key = i.split()[0]
if key in uniq_key:
output.remove(i)
else:
uniq_key.append(key)
>>> output
['Hi 7 3', 'Hey 5 2', 'Hello 8 1']
Because file objects are iterable, it is not necessary to do the readlines up front. So let's open the file and then just iterate over it using a for loop.
fin = open('sometext.txt')
We create a dictionary to hold the results, as we go.
topwords = dict()
Iterating now, over the lines in the file:
for line in fin:
We strip off the new line characters and split the lines into individual strings, based on where the spaces are (the default behavior for split()).
word, val1, val2 = line.strip().split()
val1 = int(val1)
We check to see if we have already seen the word, if yes, we then check to see if the first value is greater than the first value previously stored.
if word in topwords:
if val1 > topwords[word][0]:
topwords[word] = [val1, val2]
else:
topwords[word] = [val1, val2]
Once we finish parsing all the words, we go back and iterate over the top words and print the results to the screen.
for word in topwords:
output = '{} {} {}'.format(word, *topwords[word])
print(output)
The final script looks like this:
fin = open('sometext.txt')
topwords = dict()
for line in fin:
word, val1, val2 = line.strip().split()
val1 = int(val1)
if word in topwords:
if val1 > topwords[word][0]:
topwords[word] = [val1, val2]
else:
topwords[word] = [val1, val2]
for word in topwords:
output = '{} {} {}'.format(word, *topwords[word])
print(output)

Categories