Only print specific amount of Counter items, with decent formatting - python

Trying to print out the top N most frequent used words in a text file. So far, I have the file system and the counter and everything working, just cant figure out how to print the certain amount I want in a pretty way. Here is my code.
import re
from collections import Counter
def wordcount(user):
"""
Docstring for word count.
"""
file=input("Enter full file name w/ extension: ")
num=int(input("Enter how many words you want displayed: "))
with open(file) as f:
text = f.read()
words = re.findall(r'\w+', text)
cap_words = [word.upper() for word in words]
word_counts = Counter(cap_words)
char, n = word_counts.most_common(num)[0]
print ("WORD: %s \nOCCURENCE: %d " % (char, n) + '\n')
Basically, I just want to go and make a loop of some sort that will print out the following...
For instance num=3
So it will print out the 3 most frequent used words, and their count.
WORD: Blah Occurrence: 3
Word: bloo Occurrence: 2
Word: blee Occurrence: 1

I would iterate "most common" as follows:
most_common = word_counts.most_common(num) # removed the [0] since we're not looking only at the first item!
for item in most_common:
print("WORD: {} OCCURENCE: {}".format(item[0], item[1]))
Two comments:
1. Use format() to format strings instead of % - you'll thank me later for this advice!
2. This way you'll be able to iterate any number of "top N" results without hardcoding "3" into your code.

Save the most common elements and use a loop.
common = word_counts.most_common(num)[0]
for i in range(3):
print("WORD: %s \nOCCURENCE: %d \n" % (common[i][0], common[i][1]))

Related

Find the occurrence of a particular word from a file in python [duplicate]

I'm trying to find the number of occurrences of a word in a string.
word = "dog"
str1 = "the dogs barked"
I used the following to count the occurrences:
count = str1.count(word)
The issue is I want an exact match. So the count for this sentence would be 0.
Is that possible?
If you're going for efficiency:
import re
count = sum(1 for _ in re.finditer(r'\b%s\b' % re.escape(word), input_string))
This doesn't need to create any intermediate lists (unlike split()) and thus will work efficiently for large input_string values.
It also has the benefit of working correctly with punctuation - it will properly return 1 as the count for the phrase "Mike saw a dog." (whereas an argumentless split() would not). It uses the \b regex flag, which matches on word boundaries (transitions between \w a.k.a [a-zA-Z0-9_] and anything else).
If you need to worry about languages beyond the ASCII character set, you may need to adjust the regex to properly match non-word characters in those languages, but for many applications this would be an overcomplication, and in many other cases setting the unicode and/or locale flags for the regex would suffice.
You can use str.split() to convert the sentence to a list of words:
a = 'the dogs barked'.split()
This will create the list:
['the', 'dogs', 'barked']
You can then count the number of exact occurrences using list.count():
a.count('dog') # 0
a.count('dogs') # 1
If it needs to work with punctuation, you can use regular expressions. For example:
import re
a = re.split(r'\W', 'the dogs barked.')
a.count('dogs') # 1
Use a list comprehension:
>>> word = "dog"
>>> str1 = "the dogs barked"
>>> sum(i == word for word in str1.split())
0
>>> word = 'dog'
>>> str1 = 'the dog barked'
>>> sum(i == word for word in str1.split())
1
split() returns a list of all the words in a sentence. Then we use a list comprehension to count how many times the word appears in a sentence.
import re
word = "dog"
str = "the dogs barked"
print len(re.findall(word, str))
You need to split the sentence into words. For you example you can do that with just
words = str1.split()
But for real word usage you need something more advanced that also handles punctuation. For most western languages you can get away with replacing all punctuation with spaces before doing str1.split().
This will work for English as well in simple cases, but note that "I'm" will be split into two words: "I" and "m", and it should in fact be split into "I" and "am". But this may be overkill for this application.
For other cases such as Asian language, or actual real world usage of English, you might want to use a library that does the word splitting for you.
Then you have a list of words, and you can do
count = words.count(word)
#counting the number of words in the text
def count_word(text,word):
"""
Function that takes the text and split it into word
and counts the number of occurence of that word
input: text and word
output: number of times the word appears
"""
answer = text.split(" ")
count = 0
for occurence in answer:
if word == occurence:
count = count + 1
return count
sentence = "To be a programmer you need to have a sharp thinking brain"
word_count = "a"
print(sentence.split(" "))
print(count_word(sentence,word_count))
#output
>>> %Run test.py
['To', 'be', 'a', 'programmer', 'you', 'need', 'to', 'have', 'a', 'sharp', 'thinking', 'brain']
2
>>>
Create the function that takes two inputs which are sentence of text and word.
Split the text of a sentence into the segment of words in a list,
Then check whether the word to be counted exist in the segmented words and count the occurrence as a return of the function.
If you don't need RegularExpression then you can do this neat trick.
word = " is " #Add space at trailing and leading sides.
input_string = "This is some random text and this is str which is mutable"
print("Word count : ",input_string.count(word))
Output -- Word count : 3
Below is a simple example where we can replace the desired word with the new word and also for desired number of occurrences:
import string
def censor(text, word):<br>
newString = text.replace(word,"+" * len(word),text.count(word))
print newString
print censor("hey hey hey","hey")
output will be : +++ +++ +++
The first Parameter in function is search_string.
Second one is new_string which is going to replace your search_string.
Third and last is number of occurrences .
Let us consider the example s = "suvotisuvojitsuvo".
If you want to count no of distinct count "suvo" and "suvojit" then you use the count() method... count distinct i.e) you don't count the suvojit to suvo.. only count the lonely "suvo".
suvocount = s.count("suvo") // #output: 3
suvojitcount = s.count("suvojit") //# output : 1
Then find the lonely suvo count you have to negate from the suvojit count.
lonelysuvo = suvocount - suvojicount //# output: 3-1 -> 2
This would be my solution with help of the comments:
word = str(input("type the french word chiens in english:"))
str1 = "dogs"
times = int(str1.count(word))
if times >= 1:
print ("dogs is correct")
else:
print ("your wrong")
If you want to find the exact number of occurrence of the specific word in the sting and you don't want to use any count function, then you can use the following method.
text = input("Please enter the statement you want to check: ")
word = input("Please enter the word you want to check in the statement: ")
# n is the starting point to find the word, and it's 0 cause you want to start from the very beginning of the string.
n = 0
# position_word is the starting Index of the word in the string
position_word = 0
num_occurrence = 0
if word.upper() in text.upper():
while position_word != -1:
position_word = text.upper().find(word.upper(), n, len(text))
# increasing the value of the stating point for search to find the next word
n = (position_word + 1)
# statement.find("word", start, end) returns -1 if the word is not present in the given statement.
if position_word != -1:
num_occurrence += 1
print (f"{word.title()} is present {num_occurrence} times in the provided statement.")
else:
print (f"{word.title()} is not present in the provided statement.")
This is simple python program using split function
str = 'apple mango apple orange orange apple guava orange'
print("\n My string ==> "+ str +"\n")
str = str.split()
str2=[]
for i in str:
if i not in str2:
str2.append(i)
print( i,str.count(i))
I have just started out to learn coding in general and I do not know any libraries as such.
s = "the dogs barked"
value = 0
x = 0
y=3
for alphabet in s:
if (s[x:y]) == "dog":
value = value+1
x+=1
y+=1
print ("number of dog in the sentence is : ", value)
Another way to do this is by tokenizing string (breaking into words)
Use Counter from collection module of Python Standard Library
from collections import Counter
str1 = "the dogs barked"
stringTokenDict = { key : value for key, value in Counter(str1.split()).items() }
print(stringTokenDict['dogs'])
#This dictionary contains all words & their respective count

Python 3.0 - How do I output which character is counted the most?

So I was able to create a program that counts the amount of vowels (specifially e i o) in a text file I have on my computer. However, I can't figure out for the life of me how to show which one occurs the most. I assumed I would say something like
for ch in 'i':
return numvowel?
I'm just not too sure what the step is.
I basically want it to output in the end saying "The letter, i, occurred the most in the text file"
def vowelCounter():
inFile = open('file.txt', 'r')
contents = inFile.read()
# variable to store total number of vowels
numVowel = 0
# This counts the total number of occurrences of vowels o e i.
for ch in contents:
if ch in 'i':
numVowel = numVowel + 1
if ch in 'e':
numVowel = numVowel + 1
if ch in 'o':
numVowel = numVowel + 1
print('file.txt has', numVowel, 'vowel occurences total')
inFile.close()
vowelCounter()
If you want to show which one occurs the most, you have to keep counts of each individual vowel instead of just 1 total count like what you have done.
Keep 3 separate counters (one for each of the 3 vowels you care about) and then you can get the total by summing them up OR if you want to find out which vowel occurs the most you can simply compare the 3 counters to find out.
Try using regular expressions;
https://docs.python.org/3.5/library/re.html#regular-expression-objects
import re
def vowelCounter():
with open('file.txt', 'r') as inFile:
content = inFile.read()
o_count = len(re.findall('o',content))
e_count = len(re.findall('e',content))
i_count = len(re.findall('i',content))
# Note, if you want this to be case-insensitive,
# then add the addition argument re.I to each findall function
print("O's: {0}, E's:{1}, I's:{2}".format(o_count,e_count,i_count))
vowelCounter()
You can do this:
vowels = {} # dictionary of counters, indexed by vowels
for ch in contents:
if ch in ['i', 'e', 'o']:
# If 'ch' is a new vowel, create a new mapping for it with the value 1
# otherwise increment its counter by 1
vowels[ch] = vowels.get(ch, 0) + 1
print("'{}' occured the most."
.format(*[k for k, v in vowels.items() if v == max(vowels.values())]))
Python claims to have "batteries included", and this is a classical case. The class collections.Counter does pretty much this.
from collections import Counter
with open('file.txt') as file_
counter = Counter(file_.read())
print 'Count of e: %s' % counter['e']
print 'Count of i: %s' % counter['i']
print 'Count of o: %s' % counter['o']
Let vowels = 'eio', then
{ i: contents.count(i) for i in vowels }
For each item in vowels count the number of occurrences in contents and add it as part of the resulting dictionary (note the wrapping curly brackets over the comprehension).

Why is this not correct? (codeeval challenge)PYTHON

This is what I have to do https://www.codeeval.com/open_challenges/140/
I've been on this challenge for three days, please help. It it is 85-90 partially solved. But not 100% solved... why?
This is my code:
import sys
test_cases = open(sys.argv[1], 'r')
for test in test_cases:
saver=[]
text=""
textList=[]
positionList=[]
num=0
exists=int()
counter=0
for l in test.strip().split(";"):
saver.append(l)
for i in saver[0].split(" "):
textList.append(i)
for j in saver[1].split(" "):
positionList.append(j)
for i in range(0,len(positionList)):
positionList[i]=int(positionList[i])
accomodator=[None]*len(textList)
for n in range(1,len(textList)):
if n not in positionList:
accomodator[n]=textList[len(textList)-1]
exists=n
for item in positionList:
accomodator[item-1]=textList[counter]
counter+=1
if counter>item:
accomodator[exists-1]=textList[counter]
for word in accomodator:
text+=str(word) + " "
print text
test_cases.close()
This code works for me:
import sys
def main(name_file):
_file = open(name_file, 'r')
text = ""
while True:
try:
line = _file.next()
disordered_line, numbers_string = line.split(';')
numbers_list = map(int, numbers_string.strip().split(' '))
missing_number = sum(xrange(sorted(numbers_list)[0],sorted(numbers_list)[-1]+1)) - sum(numbers_list)
if missing_number == 0:
missing_number = len(disordered_line)
numbers_list.append(missing_number)
disordered_list = disordered_line.split(' ')
string_position = zip(disordered_list, numbers_list)
ordered = sorted(string_position, key = lambda x: x[1])
text += " ".join([x[0] for x in ordered])
text += "\n"
except StopIteration:
break
_file.close()
print text.strip()
if __name__ == '__main__':
main(sys.argv[1])
I'll try to explain my code step by step so maybe you can see the difference between your code and mine one:
while True
A loop that breaks when there are no more lines.
try:
I put the code inside a try and catch the StopIteracion exception, because this is raised when there are no more items in a generator.
line = _file.next()
Use a generator, so that way you do not put all the lines in memory from once.
disordered_line, numbers_string = line.split(';')
Get the unordered phrase and the numbers of every string's position.
numbers_list = map(int, numbers_string.strip().split(' '))
Convert every number from string to int
missing_number = sum(xrange(sorted(numbers_list)[0],sorted(numbers_list)[-1]+1)) - sum(numbers_list)
Get the missing number from the serial of numbers, so that missing number is the position of the last string in the phrase.
if missing_number == 0:
missing_number = len(unorder_line)
Check if the missing number is equal to 0 if so then the really missing number is equal to the number of the strings that make the phrase.
numbers_list.append(missing_number)
Append the missing number to the list of numbers.
disordered_list = disordered_line.split(' ')
Conver the disordered phrase into a list.
string_position = zip(disordered_list, numbers_list)
Combine every string with its respective position.
ordered = sorted(string_position, key = lambda x: x[1])
Order the combined list by the position of the string.
text += " ".join([x[0] for x in ordered])
Concatenate the ordered phrase, and the reamining code it's easy to understand.
UPDATE
By looking at your code here is my opinion tha might solve your problem.
split already returns a list so you do not have to loop over the splitted content to add that content to another list.
So these six lines:
for l in test.strip().split(";"):
saver.append(l)
for i in saver[0].split(" "):
textList.append(i)
for j in saver[1].split(" "):
positionList.append(j)
can be converted into three:
splitted_test = test.strip().split(';')
textList = splitted_test[0].split(" ")
positionList = map(int, splitted_test[1].split(" "))
In this line positionList = map(int, splitted_test[0].split(" ")) You already convert numbers into int, so you save these two lines:
for i in range(0,len(positionList)):
positionList[i]=int(positionList[i])
The next lines:
accomodator=[None]*len(textList)
for n in range(1,len(textList)):
if n not in positionList:
accomodator[n]=textList[len(textList)-1]
exists=n
can be converted into the next four:
missing_number = sum(xrange(sorted(positionList)[0],sorted(positionList)[-1]+1)) - sum(positionList)
if missing_number == 0:
missing_number = len(textList)
positionList.append(missing_number)
Basically what these lines do is calculate the missing number in the serie of numbers so the len of the serie is the same as textList.
The next lines:
for item in positionList:
accomodator[item-1]=textList[counter]
counter+=1
if counter>item:
accomodator[exists-1]=textList[counter]
for word in accomodator:
text+=str(word) + " "
Can be replaced by these ones:
string_position = zip(textList, positionList)
ordered = sorted(string_position, key = lambda x: x[1])
text += " ".join([x[0] for x in ordered])
text += "\n"
From this way you can save, lines and memory, also use xrange instead of range.
Maybe the factors that make your code pass partially could be:
Number of lines of the script
Number of time your script takes.
Number of memory your script uses.
What you could do is:
Use Generators. #You save memory
Reduce for's, this way you save lines of code and time.
If you think something could be made it easier, do it.
Do not redo the wheel, if something has been already made it, use it.

Variable to control how many lines print Python

I'm trying to figure out how to use a variable to control the number of lines a script prints. I want to use the output variable and print only the number of lines the user requests. Any help would be greatly appreciated.
import sys, os
print ""
print "Running Script..."
print ""
print "This program analyzes word frequency in a file and"
print "prints a report on the n most frequent words."
print ""
filename = raw_input("File to analyze? ")
if os.path.isfile(filename):
print "The file", filename, "exists!"
else:
print "The file", filename, "doesn't exist!"
sys.exit()
print ""
output = raw_input("Output analysis of how many words? ")
readfile = open(filename, 'r+')
words = readfile.read().split()
wordcount = {}
for word in words:
if word in wordcount:
wordcount[word] += 1
else:
wordcount[word] = 1
sortbyfreq = sorted(wordcount,key=wordcount.get,reverse=True)
for word in sortbyfreq:
print "%-20s %10d" % (word, wordcount[word])
Simply create a counter in your final loop, which checks the number of loops done, and breaks when a certain number has been reached.
limit = {enter number}
counter = 0
for word in sortbyfreq:
print "%-20s %10d" % (word, wordcount[word])
counter += 1
if counter >= limit:
break
Dictionaries are essentially unordered, so you won't get anywhere trying to output elements after sorting by their frequency.
Use a collections.Counter instead:
from collections import Counter
sortbyfreq = Counter(words) # Instead of the wordcount dictionary + for loop.
You could then access the user defined most common elements with:
n = int(raw_input('How many?: '))
for item, count in sortbyfreq.most_common(n):
print "%-20s %10d" % (item, count)

counting lengths of the words in a .txt

I have seen similar questions but nothing that truly helped me. I need to read in a text file, split it, and count the lengths of the words. I am also trying to print them out in a table with the length of the word on the left and then the actual word on the right. My code is all screwed up right now cause I got to the point where I decided to ask for help.
a = open('owlcreek.txt').read().split()
lengths = dict()
for word in a:
length = len(word)
if length not in lengths:
for length, counter in lengths.items():
print "Words of length %d: %d" % (length, counter)
#words=[line for line in a]
#print ("\n" .join(counts))
Also I guess I will need to write a little parser to get all the "!-- out. I tried to use The Counter, but I guess I don't know how to use it properly.
It should be like this:
a=open('owlcreek.txt').read().split()
lengths=dict()
for word in a:
length = len(word)
# if the key is not present, add it
if not lengths.has_key(length):
# the value should be the list of words
lengths[length] = []
# append the word to the list for length key
lengths[length].append(word)
# print them out as length, count(words of that length)
for length, wrds in lengths.items():
print "Words of length %d: %d" % (length, len(wrds))
Hope this helps!
A simple regular expression will suffice to clear out the punctuation and spaces.
edit: If I'm understanding your problem correctly, you want all the unique words in a text file, sorted by length. In which case:
import re
import itertools
with open('README.txt', 'r') as file:
words = set(re.findall(r"\w+'\w+|\w+", file.read())) # discard duplicates
sorted_words = sorted(words, key=len)
for length, words in itertools.groupby(sorted_words, len):
words = list(words)
print("Words of length {0}: {1}".format(length, len(words)))
for word in words:
print(word)

Categories