Executing an anagram checker algorithm in less than a second - python

I have created this algorithm to check whether two strings are anagram of each other.
In this exercise, I consider two strings to be anagram of each other if they have the same characters or if they differ just by one. For example math and amth are anagram but even math and maths are anagram.
I need to execute this algorithm in less than a second, but with some examples included in the test it takes sometimes more than 10 minutes. So clearly this can be done way, way better. The nested for loop is the problem but I just can't come up with a possible solution without that.
#len(word1) <= len(word2)
def check(word1, word2):
lword1 = len(word1)
lword2 = len(word2)
sword1 = sorted(word1)
sword2 = sorted(word2)
# lword1 == lword2
if lword1 == lword2:
return sword1 == sword2
# lword1 < lword2, word2 has one more character
sword2_copy = sword2.copy()
for c in sword2:
if c in sword1:
sword1.remove(c)
sword2_copy.remove(c)
return len(sword1) == 0 and len(sword2_copy) == 1
def main(fin, fout, k):
words = [line.rstrip('\n') for line in open(fin)]
words = [x.strip(' ') for x in words]
d = {}
for w1 in words:
for w2 in words:
if len(w1) == len(w2) or len(w1) == len(w2) - 1:
if check(w1, w2):
if w1 not in d.keys():
d[w1] = [w2]
else:
d[w1].append(w2)
highV = list(d.values())[0]
highK = list(d.keys())[0]
for key, value in d.items():
if len(value) > len(highV) or (len(value) == len(highV) and key < highK):
highK = key
highV = value
highV.sort()
with open(fout, 'w') as f:
for i in range(len(highV)):
f.write(highV[i]+' ')
if (i + 1) % k == 0:
f.write('\n')
return int(len(highV))

You should check out the Counter from collections:
from collections import Counter
str = 'Testanagram'
counter = Counter(str)
print(counter)
> Counter({'a': 3, 'T': 1, 'e': 1, 's': 1, 't': 1, 'n': 1, 'g': 1, 'r': 1, 'm': 1})
Using this, you should be much faster - you can also subtract one counter from another to get the diff

This seems to work pretty fast, although for my list of words (235,886 of 'em, the full list from /usr/share/dict/words) it takes around 2 seconds so it might still be too slow for you. But– honestly, how often do you plan to run it on the entire list?
with open('/usr/share/dict/words', 'r') as f:
wordlist = f.readlines()
wordlist = [word.strip() for word in wordlist]
wordlist = [(word,''.join(sorted([kar for kar in word]))) for word in wordlist]
worddict = {}
for word in wordlist:
if word[1] in worddict:
worddict[word[1]].append(word[0])
else:
worddict[word[1]] = [word[0]]
for word in wordlist:
if len(worddict[word[1]]) > 1:
print (worddict[word[1]])
Result:
['aal', 'ala']
['aam', 'ama']
['aba', 'baa']
['abac', 'caba']
['abactor', 'acrobat']
['abaft', 'bafta']
['abalone', 'balonea']
...
(27,390 lines omitted for brevity)
['ozotype', 'zootype']
['gazy', 'zyga']
['glazy', 'zygal']
[Finished in 2.1s]
It creates a dictionary with the sorted characters of each word as the key and a list containing just that word itself as its initial value. If the key is already present in the dictionary, the word is appended to its list. Then it's only a matter of printing all lists longer than 1 item.
A side effect is that all anagram words appear multiple times. That's logic for you: incomputable, which is in this word list, is an anagram of uncompatible, and therefore uncompatible (per definition also in this list) is an anagram of incomputable. QED.¹
The largest set of anagram words it finds is this one:
['angor', 'argon', 'goran', 'grano', 'groan', 'nagor', 'orang', 'organ', 'rogan']
and an interesting pair of opposites is
['misrepresentation', 'representationism']
The list even contains the word `pythonic':
['hypnotic', 'phytonic', 'pythonic', 'typhonic']
¹ After trying: printing each combination only once appeared to be trivial. It reduces the list to 12,189 'unique' sets, and the check took another 0.1 second.

The main optimisation will be in check().
Using something like collections.Counter, the solution looks simpler, but is slower:
def check_with_counter(word1, word2):
c1 = Counter(word1)
c2 = Counter(word2)
return sum(((c1 - c2) + (c2 - c1)).values()) < 2
A solution similar to yours, but considerably faster (by about an order of magnitude)
def check_faster(word1, word2):
# checks faster than check() and works for words of any length (in any order)
ld = len(word1) - len(word2)
if ld in [0, 1]:
sword_long = list(word1)
sword_short = list(word2)
if ld == 0:
return sword_long == sword_short
elif ld == -1:
sword_long = list(word2)
sword_short = list(word1)
else:
return False
for c in sword_short:
try:
sword_long.remove(c)
except ValueError:
pass
return len(sword_long) < 2
And putting it to use in a somewhat faster run():
def run_faster(fin, fout, k):
words = [line.rstrip('\n') for line in open(fin)]
words = [x.strip(' ') for x in words]
d = {}
for w1 in words:
for w2 in words:
if check_faster(w1, w2):
if w1 not in d.keys():
d[w1] = [w2]
else:
d[w1].append(w2)
most = 0
most_anagrams = []
for word, anagrams in d.items():
if len(anagrams) > most:
most = len(anagrams)
most_anagrams = anagrams
most_anagrams.sort()
with open(fout, 'w') as f:
for i in range(len(most_anagrams)):
f.write(most_anagrams[i]+' ')
if (i + 1) % k == 0:
f.write('\n')
return int(len(most_anagrams))

Related

How to find the longest common substring between two strings using Python?

I want to write a Python code that computes the longest common substring between two strings from the input.
Example:
word1 = input('Give 1. word: xlaqseabcitt')
word2 = input('Give 2. word: peoritabcpeor')
Wanted output:
abc
I have code like this so far:
word1 = input("Give 1. word: ")
word2 = input("Give 2. word: ")
longestSegment = ""
tempSegment = ""
for i in range(len(word1)):
if word1[i] == word2[i]:
tempSegment += word1[i]
else:
tempSegment = ""
if len(tempSegment) > len(longestSegment):
longestSegment = tempSegment
print(longestSegment)
I end up with IndexError when word2 is shorter than word1, and it does not give me the common substring.
EDIT: I found this solution:
string1 = input('Give 1. word: ')
string2 = input('Give 2. word: ')
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
for j in range(len2):
lcs_temp=0
match=''
while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]):
match += string2[j+lcs_temp]
lcs_temp+=1
if (len(match) > len(answer)):
answer = match
print(answer)
However, I would like to see a library function call that could be used to compute the longest common substring between two strings.
Alternatively, please suggest a more concise code to achieve the same.
You can build a dictionary from the first string containing the positions of each character, keyed on the characters. Then go through the second string and compare the substring of each character with the rest of the second string at that position:
# extract common prefix
def common(A,B) :
firstDiff = (i for i,(a,b) in enumerate(zip(A,B)) if a!=b) # 1st difference
commonLen = next(firstDiff,min(len(A),len(B))) # common length
return A[:commonLen]
word1 = "xlaqseabcitt"
word2 = "peoritabcpeor"
# position(s) of each character in word1
sub1 = dict()
for i,c in enumerate(word1): sub1.setdefault(c,[]).append(i)
# maximum (by length) of common prefixes from matching first characters
maxSub = max((common(word2[i:],word1[j:])
for i,c in enumerate(word2)
for j in sub1.get(c,[])),key=len)
print(maxSub) # abc
For me, looks like the solution that works is using the suffix_trees package:
from suffix_trees import STree
a = ["xxx ABC xxx", "adsa abc"]
st = STree.STree(a)
print(st.lcs()) # "abc"
Here is an answer if you later want to compute any number of strings. It should return the longest common substring. It work with the different test i gave it. (as long as you don't use the '§' character)
It is not a library but you can still import the functions in your code just like a library. You can use the same logic with your own code (only for two strings.) Do so as follows (put both files in the same directory for the sake of simplicity). I am supposing you will call the file findmatch.py.
import findmatch
longest_substring = findmatch.prep(['list', 'of', 'strings'])
Here is the code that should be in 'findmatch.py'.
def main(words,first):
nextreference = first
reference = first
for word in words:
foundsub = False
print('reference : ',reference)
print('word : ', word)
num_of_substring = 0
length_longest_substring = 0
for i in range(len(word)):
print('nextreference : ', nextreference)
letter = word[i]
print('letter : ', letter)
if word[i] in reference:
foundsub = True
num_of_substring += 1
locals()['substring'+str(num_of_substring)] = word[i]
print('substring : ', locals()['substring'+str(num_of_substring)])
for j in range(len(reference)-i):
if word[i:i+j+1] in reference:
locals()['substring'+str(num_of_substring) ]= word[i:i+j+1]
print('long_sub : ',locals()['substring'+str(num_of_substring)])
print('new : ',len(locals()['substring'+str(num_of_substring)]))
print('next : ',len(nextreference))
print('ref : ', len(reference))
longer = (len(reference)<len(locals()['substring'+str(num_of_substring)]))
longer2 = (len(nextreference)<len(locals()['substring'+str(num_of_substring)]))
if (num_of_substring==1) or longer or longer2:
nextreference = locals()['substring'+str(num_of_substring)]
if not foundsub:
for i in range(len(words)):
words[i] = words[i].replace(reference, '§')
#§ should not be used in any of the strings, put a character you don't use here
print(words)
try:
nextreference = main(words, first)
except Exception as e:
return None
reference = nextreference
return reference
def prep(words):
first = words[0]
words.remove(first)
answer = main(words, first)
return answer
if __name__ == '__main__':
words = ['azerty','azertydqse','fghertqdfqf','ert','sazjjjjjjjjjjjert']
#just some absurd examples any word in here
substring = prep(words)
print('answer : ',substring)
It is basically creating your own library.
I hope this aswers helps someone.
Here is a recursive solution :
def lcs(X, Y, m, n):
if m == 0 or n == 0:
return 0
elif X[m - 1] == Y[n - 1]:
return 1 + lcs(X, Y, m - 1, n - 1);
else:
return max(lcs(X, Y, m, n - 1), lcs(X, Y, m - 1, n));
Since someone asked for a multiple-word solution, here's one:
def multi_lcs(words):
words.sort(key=lambda x:len(x))
search = words.pop(0)
s_len = len(search)
for ln in range(s_len, 0, -1):
for start in range(0, s_len-ln+1):
cand = search[start:start+ln]
for word in words:
if cand not in word:
break
else:
return cand
return False
>>> multi_lcs(['xlaqseabcitt', 'peoritabcpeor'])
'abc'
>>> multi_lcs(['xlaqseabcitt', 'peoritabcpeor', 'visontatlasrab'])
'ab'
for small strings, copy this into a file in your project, let's say string_utils.py
def find_longest_common_substring(string1, string2):
s1 = string1
s2 = string2
longest_substring = ""
longest_substring_i1 = None
longest_substring_i2 = None
# iterate through every index (i1) of s1
for i1, c1 in enumerate(s1):
# for each index (i2) of s2 that matches s1[i1]
for i2, c2 in enumerate(s2):
# if start of substring
if c1 == c2:
delta = 1
# make sure we aren't running past the end of either string
while i1 + delta < len(s1) and i2 + delta < len(s2):
# if end of substring
if s2[i2 + delta] != s1[i1 + delta]:
break
# still matching characters move to the next character in both strings
delta += 1
substring = s1[i1:(i1 + delta)]
# print(f'substring candidate: {substring}')
# replace longest_substring if newly found substring is longer
if len(substring) > len(longest_substring):
longest_substring = substring
longest_substring_i1 = i1
longest_substring_i2 = i2
return (longest_substring, longest_substring_i1, longest_substring_i2)
Then it can be used as follows:
import string_utils
print(f"""(longest substring, index of string1, index of string2):
{ string_utils.find_longest_common_substring("stackoverflow.com", "tackerflow")}""")
For any that are curious the print statement when uncommented prints:
substring candidate: tack
substring candidate: ack
substring candidate: ck
substring candidate: o
substring candidate: erflow
substring candidate: rflow
substring candidate: flow
substring candidate: low
substring candidate: ow
substring candidate: w
substring candidate: c
substring candidate: o
(longest substring, index of string1, index of string2):
('erflow', 7, 4)
Here is a naive solution in terms of time complexity but simple enough to understand:
def longest_common_substring(a, b):
"""Find longest common substring between two strings A and B."""
if len(a) > len(b):
a, b = b, a
for i in range(len(a), 0, -1):
for j in range(len(a) - i + 1):
if a[j:j + i] in b:
return a[j:j + i]
return ''
A super fast library is available for Python: pylcs
It can find the indices of the longest common substring (LCS) between 2 strings, and can do some other related tasks as well.
A function to return the LCS using this library consists of 2 lines:
import pylcs
def find_LCS(s1, s2):
res = pylcs.lcs_string_idx(s1, s2)
return ''.join([s2[i] for i in res if i != -1])
Example:
s1 = 'bbbaaabaa'
s2 = 'abaabaab'
print(find_LCS(s1, s2))
aabaa
Explanation:
In this example res is:
[-1, -1, -1, -1, 2, 3, 4, 5, 6]
It is a mapping of all characters in s1 - to the indices of characters in s2 of the LCS.
-1 indicates that the character of s1 is NOT part of the LCS.
The reasons behind the speed and efficiency of this library are that it's implemented in C++ and uses dynamic programming.

How can I map multiple characters in a string to single characters more efficiently?

I am looking for an efficient method to map groups of characters to single characters.
Currently, my code looks similar to the following:
example = 'Accomodation'
VOWELS = 'aeiou'
CONSONANTS = 'bcdfghjklmnpqrstvwxyz'
output = ''
for char in example:
if char in VOWELS:
output += 'v'
elif char in VOWELS.upper():
output += 'V'
elif char in CONSONANTS:
....
Eventually it will return, in the case of the example, Vccvcvcvcvvc.
I would like to make this part more efficient:
for char in example:
if char in VOWELS:
output += 'v'
elif char in VOWELS.upper():
output += 'V'
elif char in CONSONANTS:
....
Ideally, the solution would allow for a dictionary of characters to map to as the key, with their values being a list of options. E.g.
replace_dict = {'v': VOWELS,
'V': VOWELS.upper(),
'c': CONSONANTS,
...
I am not too familiar with map, but I'd expect the solution would utilise it somehow.
Research
I found a similar problem here: python replace multiple characters in a string
The solution to that problem indicates I would need something like:
target = 'Accomodation'
charset = 'aeioubcdfghjklmnpqrstvwxyzAEIOUBCDFGHJKLMNPQRSTVWXYZ'
key = 'vvvvvcccccccccccccccccccccVVVVVCCCCCCCCCCCCCCCCCCCCC'
However, I don't think the assignments look particularly clear - despite it saving a block of if/else statements. Additionally, if I wanted to add more character sets, the assignments would even less readable, e.g. for different foreign character sets.
Can anyone, perhaps with better knowledge on built-in functions, produce an example that works more efficiently/cleanly than the above two examples?
I am also open to other ideas that do not require the use of a dictionary.
The solution should be in python3.
There is more efficient way with creating such a dict:
example = 'Accomodation'
VOWELS = 'aeiou'
CONSONANTS = 'bcdfghjklmnpqrstvwxyz'
replace_dict = {
**{v: 'v' for v in VOWELS},
**{V: 'V' for V in VOWELS.upper()},
**{c: 'c' for c in CONSONANTS}
}
print(''.join(replace_dict[s] for s in example))
# Vccvcvcvcvvc
Your replace_dict idea is close, but it's better to "flip" the dict "inside-out", i.e. turn it from {'v': 'aei', 'c': 'bc'} into {'a': 'v', 'e': 'v', 'b': 'c', ...}.
def get_replace_map_from_dict(replace_dict):
replace_map = {}
for cls, chars in replace_dict.items():
replace_map.update(dict.fromkeys(chars, cls))
return replace_map
def replace_with_map(s, replace_map):
return "".join(replace_map.get(c, c) for c in s)
VOWELS = "aeiou"
CONSONANTS = "bcdfghjklmnpqrstvwxyz"
replace_map = get_replace_map_from_dict(
{"v": VOWELS, "V": VOWELS.upper(), "c": CONSONANTS}
)
print(replace_with_map("Accommodation, thanks!", replace_map))
The replace_with_map function above retains all unmapped characters (but you can change that with the second parameter to .get() there), so the output is
Vccvccvcvcvvc, ccvccc!
This is one approach using a dict.
Ex:
example = 'Accomodation'
VOWELS = 'aeiou'
CONSONANTS = 'bcdfghjklmnpqrstvwxyz'
replace_dict = {'v': VOWELS,
"V": VOWELS.upper(),
"c": CONSONANTS
}
print("".join(k for i in example
for k, v in replace_dict.items() if i in v
)
)
Output:
Vccvcvcvcvvc
How about a reverse lookup to what you are doing - should be scalable
VOWELS = 'aeiou'
CONSONANTS = 'bcdfghjklmnpqrstvwxyz'
example = "Accomodation"
lookup_dict = {k: "v" for k in VOWELS}
lookup_dict.update({k: "c" for k in CONSONANTS})
lookup_dict.update({k: "V" for k in VOWELS.upper()})
lookup_dict.update({k: "C" for k in CONSONANTS.upper()})
''.join([lookup_dict[i] for i in example])
Try this one. No need for CONSONANTS and works not only with English, but with Russian letters as well (I was surprised):
example = 'AccomodatioNеёэыуюяЕЁЭЫуюяРаботает'
VOWELS = 'aeiouуаоиеёэыуюя'
output = ''
for char in example:
if char.isalpha():
x = 'v' if char.lower() in VOWELS else 'c'
output += x if char.islower() else x.upper()
print(output)
VccvcvcvcvvCvvvvvvvVVVVvvvCvcvcvvc
I am new to Python and having so much fun to play with it. Let see how good are these dictionaries. The four algorithms that were suggested here:
Alex (myself) - C runtime library style
Adam - matching with four strings
Sanyash, Rakesh, Mortz - dictionary (look up tables)
AKX - replace with map
I made small corrections in proposed code to make all work consistence. Also, I wanted to keep the combined code under 100 lines, but got to 127 with four functions to test and trying to satisfy PyCharm with number of extra blank lines. Here is the first race results:
Place Name Time Total
1. AKX 0.6777 16.5018 The winner of Gold medal!!!
2. Sanyash 0.8874 21.5725 Slower by 31%
3. Alex 0.9573 23.2569 Slower by 41%
4. Adam 0.9584 23.2210 Slower by 41%
Then I made small improvements to my code:
VOWELS_UP = VOWELS.upper()
def vowels_consonants0(example):
output = ''
for char in example:
if char.isalpha():
if char.islower():
output += 'v' if char in VOWELS else 'c'
else:
output += 'V' if char in VOWELS_UP else 'C'
return output
That got me the second place:
Place Name Time Total
1. AKX 0.6825 16.5331 The winner of Gold medal!!!
2. Alex 0.7026 17.1036 Slower by 3%
3. Sanyash 0.8557 20.8817 Slower by 25%
4. Adam 0.9631 23.3327 Slower by 41%
Now I need to shave this 3% and get the first place. I tested with the text from Leo Tolstoy novel War and Peace
Original source code:
import time
import itertools
VOWELS = 'eaiouу' # in order of letter frequency
CONSONANTS = 'bcdfghjklmnpqrstvwxyz'
def vowels_consonants0(example):
output = ''
for char in example:
if char.isalpha():
x = 'v' if char.lower() in VOWELS else 'c'
output += x if char.islower() else x.upper()
return output
def vowels_consonants1(example):
output = ''
for char in example:
if char in VOWELS:
output += 'v'
elif char in VOWELS.upper():
output += 'V'
elif char in CONSONANTS:
output += 'c'
elif char in CONSONANTS.upper():
output += 'C'
return output
def vowels_consonants2(example):
replace_dict = {
**{v: 'v' for v in VOWELS},
**{V: 'V' for V in VOWELS.upper()},
**{c: 'c' for c in CONSONANTS},
**{c: 'c' for c in CONSONANTS.upper()}
}
return ''.join(replace_dict[s] if s in replace_dict else '' for s in example)
def get_replace_map_from_dict(replace_dict):
replace_map = {}
for cls, chars in replace_dict.items():
replace_map.update(dict.fromkeys(chars, cls))
return replace_map
def replace_with_map(s, replace_map):
return "".join(replace_map.get(c, c) for c in s)
replace_map = get_replace_map_from_dict(
{"v": VOWELS, "V": VOWELS.upper(), "c": CONSONANTS, "C": CONSONANTS.upper()}
)
def vowels_consonants3(example):
output = ''
for char in example:
if char in replace_map:
output += char
output = replace_with_map(output, replace_map)
return output
def test(function, name):
text = open(name, encoding='utf-8')
t0 = time.perf_counter()
line_number = 0
char_number = 0
vc_number = 0 # vowels and consonants
while True:
line_number += 1
line = text.readline()
if not line:
break
char_number += len(line)
vc_line = function(line)
vc_number += len(vc_line)
t0 = time.perf_counter() - t0
text.close()
return t0, line_number, char_number, vc_number
tests = [vowels_consonants0, vowels_consonants1, vowels_consonants2, vowels_consonants3]
names = ["Alex", "Adam", "Sanyash", "AKX"]
best_time = float('inf')
run_times = [best_time for _ in tests]
sum_times = [0.0 for _ in tests]
show_result = [True for _ in tests]
print("\n!!! Start the race by permutation with no repetitions now ...\n")
print(" * - best time in race so far")
print(" + - personal best time\n")
print("Note Name Time (Permutation)")
products = itertools.permutations([0, 1, 2, 3])
for p in list(products):
print(p)
for n in p:
clock, lines, chars, vcs = test(tests[n], 'war_peace.txt')
sum_times[n] += clock
note = " "
if clock < run_times[n]:
run_times[n] = clock
note = "+" # Improved personal best time
if clock < best_time:
best_time = clock
note = "*" # Improved total best time
print("%s %8s %6.4f" % (note, names[n], clock), end="")
if show_result[n]:
show_result[n] = False
print(" Lines:", lines, "Characters:", chars, "Letters:", vcs)
else:
print()
print("\n!!! Finish !!! and the winner by the best run time is ...\n")
print("Place Name Time Total")
i = 0
for n in sorted(range(len(run_times)), key=run_times.__getitem__):
i += 1
t = run_times[n]
print("%d. %8s %.4f %.4f " % (i, names[n], t, sum_times[n]), end="")
if i == 1:
print("The winner of Gold medal!!!")
else:
print("Slower by %2d%%" % (round(100.0 * (t - best_time)/best_time)))

Anagram check in Python

I am trying to write a program that will compare two lists of words and check the words to see if they are anagrams.
eg.,
input : ['cinema','host','aab','train'], ['iceman', 'shot', 'bab', 'rain']
I am using the below code:
#!/usr/bin/env python
anagram_dict = {}
def anagram_solver(first_words,second_words):
for word in first_words:
first_word = list(word)
second_word = list(second_words[first_words.index(word)])
first_copy = first_word
second_copy = second-word
if len(first_word) != len(second_word):
anagram_dict[first_words.index(word)] = 0
else:
for char in first_word:
second_word = second_copy
if char in second_word:
first_copy.remove(char)
second_copy.remove(char)
else:
pass
if len(first_copy) == len(second_copy):
print first_copy
print second_copy
anagram_dict[first_words.index(word)] = 1
else:
anagram_dict[first_words.index(word)] = 0
for k,v in anagram_dict.items():
print "%d : %d" %(k,v)
if __name__ == "__main__":
anagram_solver(['cinema','host','aab','train'],['iceman','shot','bab','rain'])
When I execute this script, in the for loop for char in first_word: the loop is skipped, by one list item. for example, if it is processing the list ['c','i','n','e','m','a']
it only processes 'c','n','m' and ignores the other items. If I remove the list.remove(), then it doesn't skip the items.
One can execute this script to better understand, what I am trying to explain here.
Just wondering why is this behavior and how to overcome this ?
You can simply sort the words and check if they are equal:
def anagram_solver(first_words, second_words):
result = []
for i in xrange(len(first_words)):
a = list(first_words[i])
b = list(second_words[i])
a.sort()
b.sort()
result.append(a == b)
return result
Example:
>>> a = ['cinema','host','aab','train']
>>> b = ['iceman', 'shot', 'bab', 'rain']
>>> anagram_solver(a, b)
[True, True, False, False]
Python handles lists by reference, so when you set first_copy = first_word, you're actually just making first_copy and first_word point to the same list. You can overcome this behavior (actually copy the list) using
first_copy = first_word[:]
second_copy = second_word[:]
To answer to your question according to its title: "Anagram check in Python"
You can do that in one three lines:
first_words = ['cinema','host','aab','train']
second_words = ['iceman', 'shot', 'bab', 'rain']
print [sorted(a) == sorted(b) for (a,b) in zip(first_words,second_words)]
Producing:
[True, True, False, False]
You can use enumerate with sorted:
[sorted(a[ind]) == sorted(ele) for ind, ele in enumerate(b)]
There are two ways to do this. One is pretty easy and other one is a bit complicated but is Optimal.
First Method
def anagram1(s1,s2):
# We need to get rid of the empty spaces and
# lower case the string
s1 = s1.replace(' ', '').lower()
s2 = s2.replace(' ', '').lower()
# Now we will return boolean for sorted match.
return sorted(s1) == sorted(s2)
The next Method is bit longer:
def anagram2(s1, s2):
# We will remove spaces and will lower case the string
s1 = s1.replace(' ', '').lower()
s2 = s2.replace(' ', '').lower()
# We will do the edge case to check if both strings have same number of letters
if len(s1) != len(s2):
return False
# will creat an empty dictionary.
count = {}
for letter in s1:
if letter in count:
# We are assigning value 1 for every letter in s1
count[letter] += 1
# if it is the start of loop u just want to assign one into it.
else:
count[letter] = 1
for s2 we will do the opposite.
for letter in s2:
if letter in count:
# We are making every value of the letters from 1 to zero
count[letter] -= 1
else:
count[letter] = 1
for k in count:
if count[k] != 0:
return False
# other wise just return true
return True
def anagram(string_one, string_two):
string_one = string_one.replace(' ', '').lower()
string_two = string_two.replace(' ', '').lower()
string_list_one = []
string_list_two = []
for letters in string_one:
string_list_one.append(letters)
for letters_t in string_two:
string_list_two.append(letters_t)
string_list_one.sort()
string_list_two.sort()
if(string_list_one == string_list_two):
return True
else:
return False

How to extract a sublist from a list where the last character of the previous element is the same as the first character of the last element? - python

The aim of this program is to take a list of words and print the longest possible 'chain' of the words, the conditions of which are that each word has the same first character as the last character of the word before it.
The example I used to test this program is the list of animals:
giraffe
elephant
ant
tiger
raccoon
cat
hedgehog
mouse
The longest chain should read:
hedgehog
giraffe
elephant
tiger
raccoon
However when I run the program below it returns:
giraffe
elephant
tiger
raccoon
Please could someone help to identify the tiny issue with my program that might be causing this. It's probably obvious but I'm fresh out of ideas.
Here is the program:
from random import *
def legal_chain(word, chain):
"""Tests if a word can be 'legally' added to the end
of a chain"""
if word[0] == chain[-1][-1]:
return True
else:
return False
def longest_chain(chain, V, longest):
""" Returns the longest possible chain of strings where the
starting character of each string is the same as the last
character from a given starting word and vocabulary V"""
extended = False
for word in V:
if legal_chain(word, chain) is True:
V.remove(word)
chain.append(word)
longest = longest_chain(chain, V, longest)
extended = True
if extended is False:
if len(chain) > len(longest):
longest = chain
return longest
def find_longest(chain, V, longest):
"""Finds the longest chain for all possible starting words
within a given vocabulary V"""
longs = []
i = 0
for word in V:
chain = [word]
longer = longest_chain(chain, V, longest)
longs.append(longer)
if len(longs) == len(V):
while len(longs) > 1:
if len(longs[i]) < len(longs[i + 1]):
del longs[i]
elif len(longs[i]) > len(longs[i + 1]):
del longs[i + 1]
else:
i += 1
return longs
def print_longest(chain, V, longest):
"""Displays the longest chain of words with each word on a new line"""
the_longest = find_longest(chain, V, longest)
for list in the_longest:
for word in list:
print(word, '\n')
v = open('animals.txt', 'r').readlines()
V = [word.strip() for word in v]
longest = []
chain = []
print_longest(chain, V, longest)
PLEASE IGNORE ANY INDENTATION ERRORS, THE PROGRAM WORKS WITHOUT AN ERROR, THERE IS AN ISSUE WITH COPY AND PASTE!
edit I believe the following fixes the indentation errors (in the sense of no compiler errors, and output is the same as OP had stated):
from random import *
def legal_chain(word, chain):
"""Tests if a word can be 'legally' added to the end of a chain"""
if word[0] == chain[-1][-1]:
return True
else:
return False
def longest_chain(chain, V, longest):
""" Returns the longest possible chain of strings where the
starting character of each string is the same as the last
character from a given starting word and vocabulary V"""
extended = False
for word in V:
if legal_chain(word, chain) is True:
V.remove(word)
chain.append(word)
longest = longest_chain(chain, V, longest)
extended = True
if extended is False:
if len(chain) > len(longest):
longest = chain
return longest
def find_longest(chain, V, longest):
"""Finds the longest chain for all possible starting words
within a given vocabulary V"""
longs = []
i = 0
for word in V:
chain = [word]
longer = longest_chain(chain, V, longest)
longs.append(longer)
if len(longs) == len(V):
while len(longs) > 1:
if len(longs[i]) < len(longs[i + 1]):
del longs[i]
elif len(longs[i]) > len(longs[i + 1]):
del longs[i + 1]
else:
i += 1
return longs
def print_longest(chain, V, longest):
"""Displays the longest chain of words with each word on a new line"""
the_longest = find_longest(chain, V, longest)
for list in the_longest:
for word in list:
print(word, '\n')
v = open('animals.txt', 'r').readlines()
V = [word.strip() for word in v]
longest = []
chain = []
print_longest(chain, V, longest)
I think this will help you:
words_array = ['giraffe', 'elephant', 'ant', 'tiger', 'racoon', 'cat', 'hedgedog', 'mouse']
def longest_chain(words_array, current_chain):
res_chain = list(current_chain)
test_chain = []
for s in words_array:
temp_words_array = list(words_array)
temp_words_array.remove(s)
if len(current_chain) == 0:
test_chain = longest_chain(temp_words_array, current_chain + [s])
else:
if s[0] == current_chain[-1][-1]:
test_chain = longest_chain(temp_words_array, current_chain + [s])
if len(test_chain) > len(res_chain):
res_chain = list(test_chain)
return res_chain
print(longest_chain(words_array, []))
Try this if your list is small:
from itertools import permutations, chain
animals = """giraffe
elephant
ant
tiger
raccoon
cat
hedgehog
mouse"""
longest = sorted([[(j,k) for j, k in zip(i[:-1],i[1:]) if j[-1] == k[0]] \
for i in permutations([i for i in animals.split('\n')])], key=len)[-1]
print list(chain(*[longest[0], [i[1] for i in longest]]))
[out]:
['hedgehog', 'giraffe', 'giraffe', 'elephant', 'tiger', 'raccoon']
NOTE: The list comprehension to get the longest chain is the same as such a nested loop:
animals = animals.split('\n')
animal_chains = []
# Loop through all possible permutations of the list.
for i in permutations(animals):
possible_chain = []
# Reading two items at once in a list.
for j, k in zip(i[:-1], i[1:]):
# Check if last character of the this animal == first character of the next.
if j[-1] == k[0]:
possible_chain.append((j,k))
animal_chains.append(possible_chain)
# sort the animal_chains by length and get the longest.
longest = sorted(animal_chains, key=len)[-1]

Finding the most frequent character in a string

I found this programming problem while looking at a job posting on SO. I thought it was pretty interesting and as a beginner Python programmer I attempted to tackle it. However I feel my solution is quite...messy...can anyone make any suggestions to optimize it or make it cleaner? I know it's pretty trivial, but I had fun writing it. Note: Python 2.6
The problem:
Write pseudo-code (or actual code) for a function that takes in a string and returns the letter that appears the most in that string.
My attempt:
import string
def find_max_letter_count(word):
alphabet = string.ascii_lowercase
dictionary = {}
for letters in alphabet:
dictionary[letters] = 0
for letters in word:
dictionary[letters] += 1
dictionary = sorted(dictionary.items(),
reverse=True,
key=lambda x: x[1])
for position in range(0, 26):
print dictionary[position]
if position != len(dictionary) - 1:
if dictionary[position + 1][1] < dictionary[position][1]:
break
find_max_letter_count("helloworld")
Output:
>>>
('l', 3)
Updated example:
find_max_letter_count("balloon")
>>>
('l', 2)
('o', 2)
There are many ways to do this shorter. For example, you can use the Counter class (in Python 2.7 or later):
import collections
s = "helloworld"
print(collections.Counter(s).most_common(1)[0])
If you don't have that, you can do the tally manually (2.5 or later has defaultdict):
d = collections.defaultdict(int)
for c in s:
d[c] += 1
print(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])
Having said that, there's nothing too terribly wrong with your implementation.
If you are using Python 2.7, you can quickly do this by using collections module.
collections is a hight performance data structures module. Read more at
http://docs.python.org/library/collections.html#counter-objects
>>> from collections import Counter
>>> x = Counter("balloon")
>>> x
Counter({'o': 2, 'a': 1, 'b': 1, 'l': 2, 'n': 1})
>>> x['o']
2
Here is way to find the most common character using a dictionary
message = "hello world"
d = {}
letters = set(message)
for l in letters:
d[message.count(l)] = l
print d[d.keys()[-1]], d.keys()[-1]
Here's a way using FOR LOOP AND COUNT()
w = input()
r = 1
for i in w:
p = w.count(i)
if p > r:
r = p
s = i
print(s)
The way I did uses no built-in functions from Python itself, only for-loops and if-statements.
def most_common_letter():
string = str(input())
letters = set(string)
if " " in letters: # If you want to count spaces too, ignore this if-statement
letters.remove(" ")
max_count = 0
freq_letter = []
for letter in letters:
count = 0
for char in string:
if char == letter:
count += 1
if count == max_count:
max_count = count
freq_letter.append(letter)
if count > max_count:
max_count = count
freq_letter.clear()
freq_letter.append(letter)
return freq_letter, max_count
This ensures you get every letter/character that gets used the most, and not just one. It also returns how often it occurs. Hope this helps :)
If you want to have all the characters with the maximum number of counts, then you can do a variation on one of the two ideas proposed so far:
import heapq # Helps finding the n largest counts
import collections
def find_max_counts(sequence):
"""
Returns an iterator that produces the (element, count)s with the
highest number of occurrences in the given sequence.
In addition, the elements are sorted.
"""
if len(sequence) == 0:
raise StopIteration
counter = collections.defaultdict(int)
for elmt in sequence:
counter[elmt] += 1
counts_heap = [
(-count, elmt) # The largest elmt counts are the smallest elmts
for (elmt, count) in counter.iteritems()]
heapq.heapify(counts_heap)
highest_count = counts_heap[0][0]
while True:
try:
(opp_count, elmt) = heapq.heappop(counts_heap)
except IndexError:
raise StopIteration
if opp_count != highest_count:
raise StopIteration
yield (elmt, -opp_count)
for (letter, count) in find_max_counts('balloon'):
print (letter, count)
for (word, count) in find_max_counts(['he', 'lkj', 'he', 'll', 'll']):
print (word, count)
This yields, for instance:
lebigot#weinberg /tmp % python count.py
('l', 2)
('o', 2)
('he', 2)
('ll', 2)
This works with any sequence: words, but also ['hello', 'hello', 'bonjour'], for instance.
The heapq structure is very efficient at finding the smallest elements of a sequence without sorting it completely. On the other hand, since there are not so many letter in the alphabet, you can probably also run through the sorted list of counts until the maximum count is not found anymore, without this incurring any serious speed loss.
def most_frequent(text):
frequencies = [(c, text.count(c)) for c in set(text)]
return max(frequencies, key=lambda x: x[1])[0]
s = 'ABBCCCDDDD'
print(most_frequent(s))
frequencies is a list of tuples that count the characters as (character, count). We apply max to the tuples using count's and return that tuple's character. In the event of a tie, this solution will pick only one.
I noticed that most of the answers only come back with one item even if there is an equal amount of characters most commonly used. For example "iii 444 yyy 999". There are an equal amount of spaces, i's, 4's, y's, and 9's. The solution should come back with everything, not just the letter i:
sentence = "iii 444 yyy 999"
# Returns the first items value in the list of tuples (i.e) the largest number
# from Counter().most_common()
largest_count: int = Counter(sentence).most_common()[0][1]
# If the tuples value is equal to the largest value, append it to the list
most_common_list: list = [(x, y)
for x, y in Counter(sentence).items() if y == largest_count]
print(most_common_count)
# RETURNS
[('i', 3), (' ', 3), ('4', 3), ('y', 3), ('9', 3)]
Question :
Most frequent character in a string
The maximum occurring character in an input string
Method 1 :
a = "GiniGinaProtijayi"
d ={}
chh = ''
max = 0
for ch in a : d[ch] = d.get(ch,0) +1
for val in sorted(d.items(),reverse=True , key = lambda ch : ch[1]):
chh = ch
max = d.get(ch)
print(chh)
print(max)
Method 2 :
a = "GiniGinaProtijayi"
max = 0
chh = ''
count = [0] * 256
for ch in a : count[ord(ch)] += 1
for ch in a :
if(count[ord(ch)] > max):
max = count[ord(ch)]
chh = ch
print(chh)
Method 3 :
import collections
line ='North Calcutta Shyambazaar Soudipta Tabu Roopa Roopi Gina Gini Protijayi Sovabazaar Paikpara Baghbazaar Roopa'
bb = collections.Counter(line).most_common(1)[0][0]
print(bb)
Method 4 :
line =' North Calcutta Shyambazaar Soudipta Tabu Roopa Roopi Gina Gini Protijayi Sovabazaar Paikpara Baghbazaar Roopa'
def mostcommonletter(sentence):
letters = list(sentence)
return (max(set(letters),key = letters.count))
print(mostcommonletter(line))
Here are a few things I'd do:
Use collections.defaultdict instead of the dict you initialise manually.
Use inbuilt sorting and max functions like max instead of working it out yourself - it's easier.
Here's my final result:
from collections import defaultdict
def find_max_letter_count(word):
matches = defaultdict(int) # makes the default value 0
for char in word:
matches[char] += 1
return max(matches.iteritems(), key=lambda x: x[1])
find_max_letter_count('helloworld') == ('l', 3)
If you could not use collections for any reason, I would suggest the following implementation:
s = input()
d = {}
# We iterate through a string and if we find the element, that
# is already in the dict, than we are just incrementing its counter.
for ch in s:
if ch in d:
d[ch] += 1
else:
d[ch] = 1
# If there is a case, that we are given empty string, then we just
# print a message, which says about it.
print(max(d, key=d.get, default='Empty string was given.'))
sentence = "This is a great question made me wanna watch matrix again!"
char_frequency = {}
for char in sentence:
if char == " ": #to skip spaces
continue
elif char in char_frequency:
char_frequency[char] += 1
else:
char_frequency[char] = 1
char_frequency_sorted = sorted(
char_frequency.items(), key=lambda ky: ky[1], reverse=True
)
print(char_frequency_sorted[0]) #output -->('a', 9)
# return the letter with the max frequency.
def maxletter(word:str) -> tuple:
''' return the letter with the max occurance '''
v = 1
dic = {}
for letter in word:
if letter in dic:
dic[letter] += 1
else:
dic[letter] = v
for k in dic:
if dic[k] == max(dic.values()):
return k, dic[k]
l, n = maxletter("Hello World")
print(l, n)
output: l 3
you may also try something below.
from pprint import pprint
sentence = "this is a common interview question"
char_frequency = {}
for char in sentence:
if char in char_frequency:
char_frequency[char] += 1
else:
char_frequency[char] = 1
pprint(char_frequency, width = 1)
out = sorted(char_frequency.items(),
key = lambda kv : kv[1], reverse = True)
print(out)
print(out[0])
statistics.mode(data)
Return the single most common data point from discrete or nominal data. The mode (when it exists) is the most typical value and serves as a measure of central location.
If there are multiple modes with the same frequency, returns the first one encountered in the data. If the smallest or largest of those is desired instead, use min(multimode(data)) or max(multimode(data)). If the input data is empty, StatisticsError is raised.
import statistics as stat
test = 'This is a test of the fantastic mode super special function ssssssssssssss'
test2 = ['block', 'cheese', 'block']
val = stat.mode(test)
val2 = stat.mode(test2)
print(val, val2)
mode assumes discrete data and returns a single value. This is the standard treatment of the mode as commonly taught in schools:
mode([1, 1, 2, 3, 3, 3, 3, 4])
3
The mode is unique in that it is the only statistic in this package that also applies to nominal (non-numeric) data:
mode(["red", "blue", "blue", "red", "green", "red", "red"])
'red'
Here is how I solved it, considering the possibility of multiple most frequent chars:
sentence = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, \
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim."
joint_sentence = sentence.replace(" ", "")
frequencies = {}
for letter in joint_sentence:
frequencies[letter] = frequencies.get(letter, 0) +1
biggest_frequency = frequencies[max(frequencies, key=frequencies.get)]
most_frequent_letters = {key: value for key, value in frequencies.items() if value == biggest_frequency}
print(most_frequent_letters)
Output:
{'e': 12, 'i': 12}
#file:filename
#quant:no of frequent words you want
def frequent_letters(file,quant):
file = open(file)
file = file.read()
cnt = Counter
op = cnt(file).most_common(quant)
return op
# This code is to print all characters in a string which have highest frequency
def find(str):
y = sorted([[a.count(i),i] for i in set(str)])
# here,the count of unique character and the character are taken as a list
# inside y(which is a list). And they are sorted according to the
# count of each character in the list y. (ascending)
# Eg : for "pradeep", y = [[1,'r'],[1,'a'],[1,'d'],[2,'p'],[2,'e']]
most_freq= y[len(y)-1][0]
# the count of the most freq character is assigned to the variable 'r'
# ie, most_freq= 2
x= []
for j in range(len(y)):
if y[j][0] == most_freq:
x.append(y[j])
# if the 1st element in the list of list == most frequent
# character's count, then all the characters which have the
# highest frequency will be appended to list x.
# eg :"pradeep"
# x = [['p',2],['e',2]] O/P as expected
return x
find("pradeep")

Categories