Related
I am trying to make a job that takes in a text file, then counts the number of syllables in each word, then ultimately returns the top 10 words with the most syllables. I'm able to get all of the word/syllable pairs sorted in descending order, however, I am struggling to figure out how to return only the top 10 words. Here's my code so far:
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
WORD_RE = re.compile(r"[\w']+")
class MRMostUsedWordSyllables(MRJob):
def steps(self):
return [
MRStep(mapper=self.word_splitter_mapper,
reducer=self.sorting_word_syllables),
MRStep(reducer=self.reducer_word_sorted),
MRStep(reducer=self.get_top_10_reducer)
]
def word_splitter_mapper(self, _, line):
#for word in line.split():
for word in WORD_RE.findall(line):
yield(word.lower(), None)
def sorting_word_syllables(self, word, count):
count = 0
vowels = 'aeiouy'
word = word.lower().strip()
if word in vowels:
count +=1
for index in range(1,len(word)):
if word[index] in vowels and word[index-1] not in vowels:
count +=1
if word.endswith('e'):
count -= 1
if word.endswith('le'):
count+=1
if count == 0:
count +=1
yield None, (int(count), word)
def reducer_word_sorted(self, _, syllables_counts):
for count, word in sorted(syllables_counts, reverse=True):
yield (int(count), word)
def get_top_10_reducer(self, count, word):
self.aList = []
for value in list(range(count)):
self.aList.append(value)
self.bList = []
for i in range(10):
self.bList.append(max(self.aList))
self.aList.remove(max(self.aList))
for i in range(10):
yield self.bList[i]
if __name__ == '__main__':
import time
start = time.time()
MRMostUsedWordSyllables.run()
end = time.time()
print(end - start)
I know my issue is with the "get_top_10_reducer" function. I keep getting ValueError: max() arg is an empty sequence.
According to the error, one of your reducers has returned 0 for the count. Do you have an empty line in your input, for example? You should filter this data out as early as possible.
Overall, I think you need to remove reducer_word_sorted. There is no guarantee this returns sorted data. Instead, I think it regroups all data based on the numeric count key, then emits in a non-deterministic order to the next step.
That being said, your top 10 reducer is never using the value of word parameter , which should be a list itself, actually, grouped by each count key emitted by the previous reducer.
With the reducer_word_sorted removed, the sorting_word_syllables returns None for its key... This is fine because you then have all split words in a giant list, so define a regular function
def get_syllable_count_pair(word):
return (syllables(word), word, )
Use that within the reducer
def get_top_10_reducer(self, count, word):
assert count == None # added for a guard
with_counts = [get_syllable_count_pair(w) for w in word]
# Sort the words by the syllable count
sorted_counts = sorted(syllables_counts, reverse=True, key=lambda x: x[0])
# Slice off the first ten
for t in sorted_counts[:10]:
yield t
I have two dictionary. Each of dictionary include words. some words are common some are not. I want to show to output common word frequency1 frequency2 and frequency sum. How can I do that ? and I have to find the top 20.
For example my output must be like:
Common WORD frequ1. freq2 freqsum
1 print 10. 5. 15
2 number. 2. 1. 3.
3 program 19. 20. 39
Here is my code:
commonwordsbook1andbook2 = []
for element in finallist1:
if element in finallist2:
commonwordsbook1andbook2.append(element)
common1 = {}
for word in commonwordsbook1andbook2:
if word not in common1:
common1[word] = 1
else:
common1[word] += 1
common1 = sorted(common1.items(), key=lambda x: x[1], reverse=True) #distinct2
for k, v in wordcount2[:a]:
print(k, v)
Assuming that the dictionaries have individual frequencies of each word, we can do something simpler. Like...
print("Common Word | Freq-1 | Freq-2 | Freq-Sum")
for i in freq1:
if i in freq2:
print(i,freq1[i],freq2[i],freq1[i]+freq2[i])
Since you aren't allowed to use Counter, you can implement the same functionality using dictionaries. Let's define a function to return a dictionary that contains the counts of all words in the given list. Dictionaries have a get() function that gets the value of the given key, while also allowing you to specify a default if the key is not found.
def countwords(lst):
dct = {}
for word in lst:
dct[word] = dct.get(word, 0) + 1
return dct
count1 = countwords(finallist1)
count2 = countwords(finallist2)
words1 = set(count1.keys())
words2 = set(count2.keys())
count1.keys() will give us all the unique words in finallist1.
Then we convert both of these to sets and then find their intersection to get the common words.
common_words = words1.intersection(words2)
Now that you know the common words, printing them and their counts should be trivial:
for w in common_words:
print(f"{w}\t{count1[w]}\t{count2[w]}\t{count1[w] + count2[w]}")
I have a string that holds a very long sentence without whitespaces/spaces.
mystring = "abcdthisisatextwithsampletextforasampleabcd"
I would like to find all of the repeated substrings that contains minimum 4 chars.
So I would like to achieve something like this:
'text' 2 times
'sample' 2 times
'abcd' 2 times
As both abcd,text and sample can be found two times in the mystring they were recognized as properly matched substrings with more than 4 char length. It's important that I am seeking repeated substrings, finding only existing English words is not a requirement.
The answers I found are helpful for finding duplicates in texts with whitespaces, but I couldn't find a proper resource that covers the situation when there are no spaces and whitespaces in the string. How can this be done in the most efficient way?
Let's go through this step by step. There are several sub-tasks you should take care of:
Identify all substrings of length 4 or more.
Count the occurrence of these substrings.
Filter all substrings with 2 occurrences or more.
You can actually put all of them into a few statements. For understanding, it is easier to go through them one at a time.
The following examples all use
mystring = "abcdthisisatextwithsampletextforasampleabcd"
min_length = 4
1. Substrings of a given length
You can easily get substrings by slicing - for example, mystring[4:4+6] gives you the substring from position 4 of length 6: 'thisis'. More generically, you want substrings of the form mystring[start:start+length].
So what values do you need for start and length?
start must...
cover all substrings, so it must include the first character: start in range(0, ...).
not map to short substrings, so it can stop min_length characters before the end: start in range(..., len(mystring) - min_length + 1).
length must...
cover the shortest substring of length 4: length in range(min_length, ...).
not exceed the remaining string after i: length in range(..., len(mystring) - i + 1))
The +1 terms come from converting lengths (>=1) to indices (>=0).
You can put this all together into a single comprehension:
substrings = [
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
]
2. Count substrings
Trivially, you want to keep a count for each substring. Keeping anything for each specific object is what dicts are made for. So you should use substrings as keys and counts as values in a dict. In essence, this corresponds to this:
counts = {}
for substring in substrings:
try: # increase count for existing keys, set for new keys
counts[substring] += 1
except KeyError:
counts[substring] = 1
You can simply feed your substrings to collections.Counter, and it produces something like the above.
>>> counts = collections.Counter(substrings)
>>> print(counts)
Counter({'abcd': 2, 'abcdt': 1, 'abcdth': 1, 'abcdthi': 1, 'abcdthis': 1, ...})
Notice how the duplicate 'abcd' maps to the count of 2.
3. Filtering duplicate substrings
So now you have your substrings and the count for each. You need to remove the non-duplicate substrings - those with a count of 1.
Python offers several constructs for filtering, depending on the output you want. These work also if counts is a regular dict:
>>> list(filter(lambda key: counts[key] > 1, counts))
['abcd', 'text', 'samp', 'sampl', 'sample', 'ampl', 'ample', 'mple']
>>> {key: value for key, value in counts.items() if value > 1}
{'abcd': 2, 'ampl': 2, 'ample': 2, 'mple': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'text': 2}
Using Python primitives
Python ships with primitives that allow you to do this more efficiently.
Use a generator to build substrings. A generator builds its member on the fly, so you never actually have them all in-memory. For your use case, you can use a generator expression:
substrings = (
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
)
Use a pre-existing Counter implementation. Python comes with a dict-like container that counts its members: collections.Counter can directly digest your substring generator. Especially in newer version, this is much more efficient.
counts = collections.Counter(substrings)
You can exploit Python's lazy filters to only ever inspect one substring. The filter builtin or another generator generator expression can produce one result at a time without storing them all in memory.
for substring in filter(lambda key: counts[key] > 1, counts):
print(substring, 'occurs', counts[substring], 'times')
Nobody is using re! Time for an answer [ab]using the regular expression built-in module ;)
import re
Finding all the maximal substrings that are repeated
repeated_ones = set(re.findall(r"(.{4,})(?=.*\1)", mystring))
This matches the longest substrings which have at least a single repetition after (without consuming). So it finds all disjointed substrings that are repeated while only yielding the longest strings.
Finding all substrings that are repeated, including overlaps
mystring_overlap = "abcdeabcdzzzzbcde"
# In case we want to match both abcd and bcde
repeated_ones = set()
pos = 0
while True:
match = re.search(r"(.{4,}).*(\1)+", mystring_overlap[pos:])
if match:
repeated_ones.add(match.group(1))
pos += match.pos + 1
else:
break
This ensures that all --not only disjoint-- substrings which have repetition are returned. It should be much slower, but gets the work done.
If you want in addition to the longest strings that are repeated, all the substrings, then:
base_repetitions = list(repeated_ones)
for s in base_repetitions:
for i in range(4, len(s)):
repeated_ones.add(s[:i])
That will ensure that for long substrings that have repetition, you have also the smaller substring --e.g. "sample" and "ample" found by the re.search code; but also "samp", "sampl", "ampl" added by the above snippet.
Counting matches
Because (by design) the substrings that we count are non-overlapping, the count method is the way to go:
from __future__ import print_function
for substr in repeated_ones:
print("'%s': %d times" % (substr, mystring.count(substr)))
Results
Finding maximal substrings:
With the question's original mystring:
{'abcd', 'text', 'sample'}
with the mystring_overlap sample:
{'abcd'}
Finding all substrings:
With the question's original mystring:
{'abcd', 'ample', 'mple', 'sample', 'text'}
... and if we add the code to get all substrings then, of course, we get absolutely all the substrings:
{'abcd', 'ampl', 'ample', 'mple', 'samp', 'sampl', 'sample', 'text'}
with the mystring_overlap sample:
{'abcd', 'bcde'}
Future work
It's possible to filter the results of the finding all substrings with the following steps:
take a match "A"
check if this match is a substring of another match, call it "B"
if there is a "B" match, check the counter on that match "B_n"
if "A_n = B_n", then remove A
go to first step
It cannot happen that "A_n < B_n" because A is smaller than B (is a substring) so there must be at least the same number of repetitions.
If "A_n > B_n" it means that there is some extra match of the smaller substring, so it is a distinct substring because it is repeated in a place where B is not repeated.
Script (explanation where needed, in comments):
from collections import Counter
mystring = "abcdthisisatextwithsampletextforasampleabcd"
mystring_len = len(mystring)
possible_matches = []
matches = []
# Range `start_index` from 0 to 3 from the left, due to minimum char count of 4
for start_index in range(0, mystring_len-3):
# Start `end_index` at `start_index+1` and range it throughout the rest of
# the string
for end_index in range(start_index+1, mystring_len+1):
current_string = mystring[start_index:end_index]
if len(current_string) < 4: continue # Skip this interation, if len < 4
possible_matches.append(mystring[start_index:end_index])
for possible_match, count in Counter(possible_matches).most_common():
# Iterate until count is less than or equal to 1 because `Counter`'s
# `most_common` method lists them in order. Once 1 (or less) is hit, all
# others are the same or lower.
if count <= 1: break
matches.append((possible_match, count))
for match, count in matches:
print(f'\'{match}\' {count} times')
Output:
'abcd' 2 times
'text' 2 times
'samp' 2 times
'sampl' 2 times
'sample' 2 times
'ampl' 2 times
'ample' 2 times
'mple' 2 times
Here's a Python3 friendly solution:
from collections import Counter
min_str_length = 4
mystring = "abcdthisisatextwithsampletextforasampleabcd"
all_substrings =[mystring[start_index:][:end_index + 1] for start_index in range(len(mystring)) for end_index in range(len(mystring[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0] for item in counted_substrings.most_common() if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
print(counted_final_candidates)
Bonus: largest string
sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in not_counted_final_candidates if substring1!=substring2 and substring1 in substring2 ]
largest_common_string = list(set(not_counted_final_candidates) - set(sub_sub_strings))
Everything as a function:
from collections import Counter
def get_repeated_strings(input_string, min_str_length = 2, calculate_largest_repeated_string = True ):
all_substrings = [input_string[start_index:][:end_index + 1]
for start_index in range(len(input_string))
for end_index in range(len(input_string[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0]
for item in counted_substrings.most_common()
if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
### This is just a bit of bonus code for calculating the largest repeating sting
if calculate_largest_repeated_string == True:
sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in
not_counted_final_candidates if substring1 != substring2 and substring1 in substring2]
largest_common_strings = list(set(not_counted_final_candidates) - set(sub_sub_strings))
return counted_final_candidates, largest_common_strings
else:
return counted_final_candidates
Example:
mystring = "abcdthisisatextwithsampletextforasampleabcd"
print(get_repeated_strings(mystring, min_str_length= 4))
Output:
({'abcd': 2, 'text': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'ampl': 2, 'ample': 2, 'mple': 2}, ['abcd', 'text', 'sample'])
CODE:
pattern = "abcdthisisatextwithsampletextforasampleabcd"
string_more_4 = []
k = 4
while(k <= len(pattern)):
for i in range(len(pattern)):
if pattern[i:k+i] not in string_more_4 and len(pattern[i:k+i]) >= 4:
string_more_4.append( pattern[i:k+i])
k+=1
for i in string_more_4:
if pattern.count(i) >= 2:
print(i + " -> " + str(pattern.count(i)) + " times")
OUTPUT:
abcd -> 2 times
text -> 2 times
samp -> 2 times
ampl -> 2 times
mple -> 2 times
sampl -> 2 times
ample -> 2 times
sample -> 2 times
Hope this helps as my code length was short and it is easy to understand. Cheers!
This is in Python 2 because I'm not doing Python 3 at this time. So you'll have to adapt it to Python 3 yourself.
#!python2
# import module
from collections import Counter
# get the indices
def getIndices(length):
# holds the indices
specific_range = []; all_sets = []
# start building the indices
for i in range(0, length - 2):
# build a set of indices of a specific range
for j in range(1, length + 2):
specific_range.append([j - 1, j + i + 3])
# append 'specific_range' to 'all_sets', reset 'specific_range'
if specific_range[j - 1][1] == length:
all_sets.append(specific_range)
specific_range = []
break
# return all of the calculated indices ranges
return all_sets
# store search strings
tmplst = []; combos = []; found = []
# string to be searched
mystring = "abcdthisisatextwithsampletextforasampleabcd"
# mystring = "abcdthisisatextwithtextsampletextforasampleabcdtext"
# get length of string
length = len(mystring)
# get all of the indices ranges, 4 and greater
all_sets = getIndices(length)
# get the search string combinations
for sublst in all_sets:
for subsublst in sublst:
tmplst.append(mystring[subsublst[0]: subsublst[1]])
combos.append(tmplst)
tmplst = []
# search for matching string patterns
for sublst in all_sets:
for subsublst in sublst:
for sublstitems in combos:
if mystring[subsublst[0]: subsublst[1]] in sublstitems:
found.append(mystring[subsublst[0]: subsublst[1]])
# make a dictionary containing the strings and their counts
d1 = Counter(found)
# filter out counts of 2 or more and print them
for k, v in d1.items():
if v > 1:
print k, v
$ cat test.py
import collections
import sys
S = "abcdthisisatextwithsampletextforasampleabcd"
def find(s, min_length=4):
"""
Find repeated character sequences in a provided string.
Arguments:
s -- the string to be searched
min_length -- the minimum length of the sequences to be found
"""
counter = collections.defaultdict(int)
# A repeated sequence can't be longer than half the length of s
sequence_length = len(s) // 2
# populate counter with all possible sequences
while sequence_length >= min_length:
# Iterate over the string until the number of remaining characters is
# fewer than the length of the current sequence.
for i, x in enumerate(s[:-(sequence_length - 1)]):
# Window across the string, getting slices
# of length == sequence_length.
candidate = s[i:i + sequence_length]
counter[candidate] += 1
sequence_length -= 1
# Report.
for k, v in counter.items():
if v > 1:
print('{} {} times'.format(k, v))
return
if __name__ == '__main__':
try:
s = sys.argv[1]
except IndexError:
s = S
find(s)
$ python test.py
sample 2 times
sampl 2 times
ample 2 times
abcd 2 times
text 2 times
samp 2 times
ampl 2 times
mple 2 times
This is my approach to this problem:
def get_repeated_words(string, minimum_len):
# Storing count of repeated words in this dictionary
repeated_words = {}
# Traversing till last but 4th element
# Actually leaving `minimum_len` elements at end (in this case its 4)
for i in range(len(string)-minimum_len):
# Starting with a length of 4(`minimum_len`) and going till end of string
for j in range(i+minimum_len, len(string)):
# getting the current word
word = string[i:j]
# counting the occurrences of the word
word_count = string.count(word)
if word_count > 1:
# storing in dictionary along with its count if found more than once
repeated_words[word] = word_count
return repeated_words
if __name__ == '__main__':
mystring = "abcdthisisatextwithsampletextforasampleabcd"
result = get_repeated_words(mystring, 4)
This is how I would do it, but I don't know any other way:
string = "abcdthisisatextwithsampletextforasampleabcd"
l = len(string)
occurences = {}
for i in range(4, l):
for start in range(l - i):
substring = string[start:start + i]
occurences[substring] = occurences.get(substring, 0) + 1
for key in occurences.keys():
if occurences[key] > 1:
print("'" + key + "'", str(occurences[key]), "times")
Output:
'sample' 2 times
'ampl' 2 times
'sampl' 2 times
'ample' 2 times
'samp' 2 times
'mple' 2 times
'text' 2 times
Efficient, no, but easy to understand, yes.
Here is simple solution using the more_itertools library.
Given
import collections as ct
import more_itertools as mit
s = "abcdthisisatextwithsampletextforasampleabcd"
lbound, ubound = len("abcd"), len(s)
Code
windows = mit.flatten(mit.windowed(s, n=i) for i in range(lbound, ubound))
filtered = {"".join(k): v for k, v in ct.Counter(windows).items() if v > 1}
filtered
Output
{'abcd': 2,
'text': 2,
'samp': 2,
'ampl': 2,
'mple': 2,
'sampl': 2,
'ample': 2,
'sample': 2}
Details
The procedures are:
build sliding windows of varying sizes lbound <= n < ubound
count all occurrences and filter replicates
more_itertools is a third-party package installed by > pip install more_itertools.
s = 'abcabcabcdabcd'
d = {}
def get_repeats(s, l):
for i in range(len(s)-l):
ss = s[i: i+l]
if ss not in d:
d[ss] = 1
else:
d[ss] = d[ss]+1
return d
get_repeats(s, 3)
import re
string = "is2 Thi1s T4est 3a"
def order(sentence):
res = ''
count = 1
list = sentence.split()
for i in list:
for i in list:
a = re.findall('\d+', i)
if a == [str(count)]:
res += " ".join(i)
count += 1
print(res)
order(string)
Above there is a code which I have problem with. Output which I should get is:
"Thi1s is2 3a T4est"
Instead I'm getting the correct order but with spaces in the wrong places:
"T h i 1 si s 23 aT 4 e s t"
Any idea how to make it work with this code concept?
You are joining the characters of each word:
>>> " ".join('Thi1s')
'T h i 1 s'
You want to collect your words into a list and join that instead:
def order(sentence):
number_words = []
count = 1
words = sentence.split()
for word in words:
for word in words:
matches = re.findall('\d+', word)
if matches == [str(count)]:
number_words.append(word)
count += 1
result = ' '.join(number_words)
print(result)
I used more verbose and clear variable names. I also removed the list variable; don't use list as a variable name if you can avoid it, as that masks the built-in list name.
What you implemented comes down to a O(N^2) (quadratic time) sort. You could instead use the built-in sort() function to bring this to O(NlogN); you'd extract the digit and sort on its integer value:
def order(sentence):
digit = re.compile(r'\d+')
return ' '.join(
sorted(sentence.split(),
key=lambda w: int(digit.search(w).group())))
This differs a little from your version in that it'll only look at the first (consecutive) digits, it doesn't care about the numbers being sequential, and will break for words without digits. It also uses a return to give the result to the caller rather than print. Just use print(order(string)) to print the return value.
If you assume the words are numbered consecutively starting at 1, then you can sort them in O(N) time even:
def order(sentence):
digit = re.compile(r'\d+')
words = sentence.split()
result = [None] * len(words)
for word in words:
index = int(digit.search(word).group())
result[index - 1] = word
return ' '.join(result)
This works by creating a list of the same length, then using the digits from each word to put the word into the correct index (minus 1, as Python lists start at 0, not 1).
I think the bug is simply in the misuse of join(). You want to concatenate the current sorted string. i is simply a token, hence simply add it to the end of the string. Code untested.
import re
string = "is2 Thi1s T4est 3a"
def order(sentence):
res = ''
count = 1
list = sentence.split()
for i in list:
for i in list:
a = re.findall('\d+', i)
if a == [str(count)]:
res = res + " " + i # your bug here
count += 1
print(res)
order(string)
I found this programming problem while looking at a job posting on SO. I thought it was pretty interesting and as a beginner Python programmer I attempted to tackle it. However I feel my solution is quite...messy...can anyone make any suggestions to optimize it or make it cleaner? I know it's pretty trivial, but I had fun writing it. Note: Python 2.6
The problem:
Write pseudo-code (or actual code) for a function that takes in a string and returns the letter that appears the most in that string.
My attempt:
import string
def find_max_letter_count(word):
alphabet = string.ascii_lowercase
dictionary = {}
for letters in alphabet:
dictionary[letters] = 0
for letters in word:
dictionary[letters] += 1
dictionary = sorted(dictionary.items(),
reverse=True,
key=lambda x: x[1])
for position in range(0, 26):
print dictionary[position]
if position != len(dictionary) - 1:
if dictionary[position + 1][1] < dictionary[position][1]:
break
find_max_letter_count("helloworld")
Output:
>>>
('l', 3)
Updated example:
find_max_letter_count("balloon")
>>>
('l', 2)
('o', 2)
There are many ways to do this shorter. For example, you can use the Counter class (in Python 2.7 or later):
import collections
s = "helloworld"
print(collections.Counter(s).most_common(1)[0])
If you don't have that, you can do the tally manually (2.5 or later has defaultdict):
d = collections.defaultdict(int)
for c in s:
d[c] += 1
print(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])
Having said that, there's nothing too terribly wrong with your implementation.
If you are using Python 2.7, you can quickly do this by using collections module.
collections is a hight performance data structures module. Read more at
http://docs.python.org/library/collections.html#counter-objects
>>> from collections import Counter
>>> x = Counter("balloon")
>>> x
Counter({'o': 2, 'a': 1, 'b': 1, 'l': 2, 'n': 1})
>>> x['o']
2
Here is way to find the most common character using a dictionary
message = "hello world"
d = {}
letters = set(message)
for l in letters:
d[message.count(l)] = l
print d[d.keys()[-1]], d.keys()[-1]
Here's a way using FOR LOOP AND COUNT()
w = input()
r = 1
for i in w:
p = w.count(i)
if p > r:
r = p
s = i
print(s)
The way I did uses no built-in functions from Python itself, only for-loops and if-statements.
def most_common_letter():
string = str(input())
letters = set(string)
if " " in letters: # If you want to count spaces too, ignore this if-statement
letters.remove(" ")
max_count = 0
freq_letter = []
for letter in letters:
count = 0
for char in string:
if char == letter:
count += 1
if count == max_count:
max_count = count
freq_letter.append(letter)
if count > max_count:
max_count = count
freq_letter.clear()
freq_letter.append(letter)
return freq_letter, max_count
This ensures you get every letter/character that gets used the most, and not just one. It also returns how often it occurs. Hope this helps :)
If you want to have all the characters with the maximum number of counts, then you can do a variation on one of the two ideas proposed so far:
import heapq # Helps finding the n largest counts
import collections
def find_max_counts(sequence):
"""
Returns an iterator that produces the (element, count)s with the
highest number of occurrences in the given sequence.
In addition, the elements are sorted.
"""
if len(sequence) == 0:
raise StopIteration
counter = collections.defaultdict(int)
for elmt in sequence:
counter[elmt] += 1
counts_heap = [
(-count, elmt) # The largest elmt counts are the smallest elmts
for (elmt, count) in counter.iteritems()]
heapq.heapify(counts_heap)
highest_count = counts_heap[0][0]
while True:
try:
(opp_count, elmt) = heapq.heappop(counts_heap)
except IndexError:
raise StopIteration
if opp_count != highest_count:
raise StopIteration
yield (elmt, -opp_count)
for (letter, count) in find_max_counts('balloon'):
print (letter, count)
for (word, count) in find_max_counts(['he', 'lkj', 'he', 'll', 'll']):
print (word, count)
This yields, for instance:
lebigot#weinberg /tmp % python count.py
('l', 2)
('o', 2)
('he', 2)
('ll', 2)
This works with any sequence: words, but also ['hello', 'hello', 'bonjour'], for instance.
The heapq structure is very efficient at finding the smallest elements of a sequence without sorting it completely. On the other hand, since there are not so many letter in the alphabet, you can probably also run through the sorted list of counts until the maximum count is not found anymore, without this incurring any serious speed loss.
def most_frequent(text):
frequencies = [(c, text.count(c)) for c in set(text)]
return max(frequencies, key=lambda x: x[1])[0]
s = 'ABBCCCDDDD'
print(most_frequent(s))
frequencies is a list of tuples that count the characters as (character, count). We apply max to the tuples using count's and return that tuple's character. In the event of a tie, this solution will pick only one.
I noticed that most of the answers only come back with one item even if there is an equal amount of characters most commonly used. For example "iii 444 yyy 999". There are an equal amount of spaces, i's, 4's, y's, and 9's. The solution should come back with everything, not just the letter i:
sentence = "iii 444 yyy 999"
# Returns the first items value in the list of tuples (i.e) the largest number
# from Counter().most_common()
largest_count: int = Counter(sentence).most_common()[0][1]
# If the tuples value is equal to the largest value, append it to the list
most_common_list: list = [(x, y)
for x, y in Counter(sentence).items() if y == largest_count]
print(most_common_count)
# RETURNS
[('i', 3), (' ', 3), ('4', 3), ('y', 3), ('9', 3)]
Question :
Most frequent character in a string
The maximum occurring character in an input string
Method 1 :
a = "GiniGinaProtijayi"
d ={}
chh = ''
max = 0
for ch in a : d[ch] = d.get(ch,0) +1
for val in sorted(d.items(),reverse=True , key = lambda ch : ch[1]):
chh = ch
max = d.get(ch)
print(chh)
print(max)
Method 2 :
a = "GiniGinaProtijayi"
max = 0
chh = ''
count = [0] * 256
for ch in a : count[ord(ch)] += 1
for ch in a :
if(count[ord(ch)] > max):
max = count[ord(ch)]
chh = ch
print(chh)
Method 3 :
import collections
line ='North Calcutta Shyambazaar Soudipta Tabu Roopa Roopi Gina Gini Protijayi Sovabazaar Paikpara Baghbazaar Roopa'
bb = collections.Counter(line).most_common(1)[0][0]
print(bb)
Method 4 :
line =' North Calcutta Shyambazaar Soudipta Tabu Roopa Roopi Gina Gini Protijayi Sovabazaar Paikpara Baghbazaar Roopa'
def mostcommonletter(sentence):
letters = list(sentence)
return (max(set(letters),key = letters.count))
print(mostcommonletter(line))
Here are a few things I'd do:
Use collections.defaultdict instead of the dict you initialise manually.
Use inbuilt sorting and max functions like max instead of working it out yourself - it's easier.
Here's my final result:
from collections import defaultdict
def find_max_letter_count(word):
matches = defaultdict(int) # makes the default value 0
for char in word:
matches[char] += 1
return max(matches.iteritems(), key=lambda x: x[1])
find_max_letter_count('helloworld') == ('l', 3)
If you could not use collections for any reason, I would suggest the following implementation:
s = input()
d = {}
# We iterate through a string and if we find the element, that
# is already in the dict, than we are just incrementing its counter.
for ch in s:
if ch in d:
d[ch] += 1
else:
d[ch] = 1
# If there is a case, that we are given empty string, then we just
# print a message, which says about it.
print(max(d, key=d.get, default='Empty string was given.'))
sentence = "This is a great question made me wanna watch matrix again!"
char_frequency = {}
for char in sentence:
if char == " ": #to skip spaces
continue
elif char in char_frequency:
char_frequency[char] += 1
else:
char_frequency[char] = 1
char_frequency_sorted = sorted(
char_frequency.items(), key=lambda ky: ky[1], reverse=True
)
print(char_frequency_sorted[0]) #output -->('a', 9)
# return the letter with the max frequency.
def maxletter(word:str) -> tuple:
''' return the letter with the max occurance '''
v = 1
dic = {}
for letter in word:
if letter in dic:
dic[letter] += 1
else:
dic[letter] = v
for k in dic:
if dic[k] == max(dic.values()):
return k, dic[k]
l, n = maxletter("Hello World")
print(l, n)
output: l 3
you may also try something below.
from pprint import pprint
sentence = "this is a common interview question"
char_frequency = {}
for char in sentence:
if char in char_frequency:
char_frequency[char] += 1
else:
char_frequency[char] = 1
pprint(char_frequency, width = 1)
out = sorted(char_frequency.items(),
key = lambda kv : kv[1], reverse = True)
print(out)
print(out[0])
statistics.mode(data)
Return the single most common data point from discrete or nominal data. The mode (when it exists) is the most typical value and serves as a measure of central location.
If there are multiple modes with the same frequency, returns the first one encountered in the data. If the smallest or largest of those is desired instead, use min(multimode(data)) or max(multimode(data)). If the input data is empty, StatisticsError is raised.
import statistics as stat
test = 'This is a test of the fantastic mode super special function ssssssssssssss'
test2 = ['block', 'cheese', 'block']
val = stat.mode(test)
val2 = stat.mode(test2)
print(val, val2)
mode assumes discrete data and returns a single value. This is the standard treatment of the mode as commonly taught in schools:
mode([1, 1, 2, 3, 3, 3, 3, 4])
3
The mode is unique in that it is the only statistic in this package that also applies to nominal (non-numeric) data:
mode(["red", "blue", "blue", "red", "green", "red", "red"])
'red'
Here is how I solved it, considering the possibility of multiple most frequent chars:
sentence = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, \
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim."
joint_sentence = sentence.replace(" ", "")
frequencies = {}
for letter in joint_sentence:
frequencies[letter] = frequencies.get(letter, 0) +1
biggest_frequency = frequencies[max(frequencies, key=frequencies.get)]
most_frequent_letters = {key: value for key, value in frequencies.items() if value == biggest_frequency}
print(most_frequent_letters)
Output:
{'e': 12, 'i': 12}
#file:filename
#quant:no of frequent words you want
def frequent_letters(file,quant):
file = open(file)
file = file.read()
cnt = Counter
op = cnt(file).most_common(quant)
return op
# This code is to print all characters in a string which have highest frequency
def find(str):
y = sorted([[a.count(i),i] for i in set(str)])
# here,the count of unique character and the character are taken as a list
# inside y(which is a list). And they are sorted according to the
# count of each character in the list y. (ascending)
# Eg : for "pradeep", y = [[1,'r'],[1,'a'],[1,'d'],[2,'p'],[2,'e']]
most_freq= y[len(y)-1][0]
# the count of the most freq character is assigned to the variable 'r'
# ie, most_freq= 2
x= []
for j in range(len(y)):
if y[j][0] == most_freq:
x.append(y[j])
# if the 1st element in the list of list == most frequent
# character's count, then all the characters which have the
# highest frequency will be appended to list x.
# eg :"pradeep"
# x = [['p',2],['e',2]] O/P as expected
return x
find("pradeep")