Python check for Anagram in O(n) solution - python

I'm trying to check if 2 strings are anagrams. This solution is simple, but not efficient (Ologn) I know I could use Collections and Counter, then compare the occurrence of each character, but I'm trying to avoid any modules for an interview. What would be the fastest way to solve this problem? (Perhaps, checking occurrence of each character?)
def check(word1,word2):
return sorted(word1)==sorted(word2)

Your code doesn't even return a correct value. This one-liner is O(n log n):
return sorted(word1) == sorted(word2)
For an O(n) solution, you can count all characters:
from collections import Counter
# ...
def check(a, b)
return Counter(a) == Counter(b)
Without collections it is much longer:
def check(a, b):
chars = dict.fromkeys(a + b, 0)
for c in a:
chars[c] += 1
for c in b:
chars[c] -= 1
return not any(chars.values())
This code does the following:
chars = dict.fromkeys(a + b, 0): Creates a dict, which has all the occurring characters in either word as keys set to 0.
for c in a: chars[c] += 1: this will iterate over a and count the occurrences of each character in it. chars now contains the count of separate characters, (and some zeroes for characters in b but not a)
for c in b: chars[c] -= 1: much the same as before, but instead this will subtract the character counts of b from chars
return not any(chars.values()): chars['h'] == 0 if and only if a and b has the same amount of 'h'. This line checks if chars has only zeroes as values, meaning that all characters have the same count in both inputs. (as any returns if there is any truthy value in the sequence. 0 is falsy, every other integer is truthy.)
Both lists get iterated over once. Assuming O(1) access time for dictionaries makes the whole algorithm run in O(n) time (where n is the total length of the inputs). Space complexity is O(n) too (all characters can be distinct). Don't make that mistake when they ask you complexity. It's not necessary time complexity.

Here's a nice option from http://interactivepython.org/runestone/static/pythonds/AlgorithmAnalysis/AnAnagramDetectionExample.html:
def anagramSolution(s1,s2):
TABLE_SIZE = 128
c1 = [0]*TABLE_SIZE
c2 = [0]*TABLE_SIZE
for ch in s1:
pos = ord(ch)
c1[pos] = c1[pos] + 1
for ch in s2:
pos = ord(ch)
c2[pos] = c2[pos] + 1
j = 0
stillOK = True
while j<TABLE_SIZE and stillOK:
if c1[j]==c2[j]:
j = j + 1
else:
stillOK = False
return stillOK
This runs in O(n). Essentially, you loop over both strings, counting the occurrences of each letter. In the end, you can simply iterate over each letter, making sure the counts are equal.
As noted in the comments, this will have a harder time scaling for unicode. If you expect unicode, you would likely want to use a dictionary.

I'd write it like this without imports:
def count_occurences(mystring):
occs = {}
for char in mystring:
if char in occs:
occs[char] += 1
else:
occs[char] = 1
return occs
def is_anagram(str1, str2):
return count_occurences(str1) == count_occurences(str2)
Or, if you can use imports, just not a Counter, use a defaultdict:
from collections import defaultdict
def count_occurences(mystring):
occs = defaultdict(int)
for char in mystring:
occs[char] += 1
return occs
def is_anagram(str1, str2):
return count_occurences(str1) == count_occurences(str2)

Related

Count max substring of the same character

i want to write a function in which it receives a string (s) and a single letter (s). the function needs to return the length of the longest substring of this letter. i dont know why the function i wrote doesn't work
for exmaple: print(count_longest_repetition('eabbaaaacccaaddd', 'a') supposed to return '4'
def count_longest_repetition(s, c):
n= len(s)
lst=[]
length_charachter=0
for i in range(n-1):
if s[i]==c and s[i+1]==c:
if s[i] in lst:
lst.append(s[i])
length_charachter= len(lst)
return length_charachter
Due to the condition if s[i] in lst, nothing will be appended to 'lst' as originally 'lst' is empty and the if condition will never be satisfied. Also, to traverse through the entire string you need to use range(n) as it generates numbers from 0 to n-1. This should work -
def count_longest_repetition(s, c):
n= len(s)
length_charachter=0
max_length = 0
for i in range(n):
if s[i] == c:
length_charachter += 1
else:
length_charachter = 0
max_length = max(max_length, length_charachter)
return max_length
I might suggest using a regex approach here with re.findall:
def count_longest_repetition(s, c):
matches = re.findall(r'' + c + '+', s)
matches = sorted(matches, key=len, reverse=True)
return len(matches[0])
cnt = count_longest_repetition('eabbaaaacccaaddd', 'a')
print(cnt)
This prints: 4
To better explain the above, given the inputs shown, the regex used is a+, that is, find groups of one or more a characters. The sorted list result from the call to re.findall is:
['aaaa', 'aa', 'a']
By sorting descending by string length, we push the longest match to the front of the list. Then, we return this length from the function.
Your function doesn't work because if s[i] in lst: will initially return false and never gets to add anything to to the lst list (so it will remain false throughout the loop).
You should look into regular expressions for this kind of string processing/search:
import re
def count_longest_repetition(s, c):
return max((0,*map(len,re.findall(f"{re.escape(c)}+",s))))
If you're not allowed to use libraries, you could compute repetitions without using a list by adding matches to a counter that you reset on every mismatch:
def count_longest_repetition(s, c):
maxCount = count = 0
for b in s:
count = (count+1)*(b==c)
maxCount = max(count,maxCount)
return maxCount
This can also be done by groupby
from itertools import groupby
def count_longest_repetition(text,let):
return max([len(list(group)) for key, group in groupby(list(text)) if key==let])
count_longest_repetition("eabbaaaacccaaddd",'a')
#returns 4

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Why my 2nd method is slower than my 1st method?

I was doing leetcode problem No. 387. First Unique Character in a String. Given a string, find the first non-repeating character in it and return it's index. If it doesn't exist, return -1.
Examples:
s = "leetcode"
return 0.
s = "loveleetcode",
return 2.
I wrote 2 algorithm:
Method 1
def firstUniqChar(s):
d = {}
L = len(s)
for i in range(L):
if s[i] not in d:
d[s[i]] = [i]
else:
d[s[i]].append(i)
M = L
for k in d:
if len(d[k])==1:
if d[k][0]<M:
M = d[k][0]
if M<L:
return M
else:
return -1
This is very intuitive, i.e., first create a count dictionary by looping over all the char in s (this can also be done using one line in collections.Counter), then do a second loop only checking those keys whose value is a list of length 1. I think as I did 2 loops, it must have some redundant computation. So I wrote the 2nd algorithm, which I think is better than the 1st one but in the leetcode platform, the 2nd one runs much slower than the 1st one and I couldn't figure out why.
Method 2
def firstUniqChar(s):
d = {}
L = len(s)
A = []
for i in range(L):
if s[i] not in d:
d[s[i]] = i
A.append(i)
else:
try:
A.remove(d[s[i]])
except:
pass
if len(A)==0:
return -1
else:
return A[0]
The 2nd one just loop once for all char in s
Your first solution is O(n), but your second solution is O(n^2), as method A.remove is looping over elements of A.
As others have said - using list.remove is quite expensive... Your use of collections.Counter is a good idea.
You need to scan the string once to find uniques. Then probably what's better is to sequentially scan it again and take the index of the first unique - that makes your potential code:
from collections import Counter
s = "loveleetcode"
# Build a set of unique values
unique = {ch for ch, freq in Counter(s).items() if freq == 1}
# re-iterate over the string until we first find a unique value or
# not - default to -1 if none found
first_index = next((ix for ix, ch in enumerate(s) if ch in unique), -1)
# 2

Finding Palidrome from a permutation in Python

I have a string, I need to find out palindromic sub-string of length 4( all 4 indexes sub-strings), in which the indexes should be in ascending order (index1<index2<index3<index4).
My code is working fine for small string like mystr. But when it comes to large string it takes long time.
from itertools import permutations
#Mystr
mystr = "kkkkkkz" #"ghhggh"
#Another Mystr
#mystr = "kkkkkkzsdfsfdkjdbdsjfjsadyusagdsadnkasdmkofhduyhfbdhfnsklfsjdhbshjvncjkmkslfhisduhfsdkadkaopiuqegyegrebkjenlendelufhdysgfdjlkajuadgfyadbldjudigducbdj"
l = len(mystr)
mylist = permutations(range(l), 4)
cnt = 0
for i in filter(lambda i: i[0] < i[1] < i[2] < i[3] and (mystr[i[0]] + mystr[i[1]] + mystr[i[2]] + mystr[i[3]] == mystr[i[3]] + mystr[i[2]] + mystr[i[1]] + mystr[i[0]]), mylist):
#print(i)
cnt += 1
print(cnt) # Number of palindromes found
If you want to stick with the basic structure of your current algorithm, a few ways to speed it up would be to use combinations instead of the permutations, which will return an iterable in sorted order. This means you don't need to check that the indexes are in ascending order. Secondly you can speed up the bit that checks for a palindrome by simply checking to see if the first two characters are identical to the last two characters reversed (instead of comparing the whole thing against its reversed self).
from itertools import combinations
mystr = "kkkkkkzsdfsfdkjdbdsjfjsadyusagdsadnkasdmkofhduyhfbdhfnsklfsjdhbshjvncjkmkslfhisduhfsdkadkaopiuqegyegrebkjenlendelufhdysgfdjlkajuadgfyadbldjudigducbdj"
cnt = 0
for m in combinations(mystr, 4):
if m[:2] == m[:1:-1]: cnt += 1
print cnt
Or if you want to simplify that last bit to a one-liner:
print len([m for m in combinations(mystr, 4) if m[:2] == m[:1:-1]])
I didn't do a real time test on this but on my system this method takes about 6.3 seconds to run (with your really long string) which is significantly faster than your method.

Python find if strings are anagram of each other string

I am trying to solve the above interview question to check if string are anagram of each other. The implementation is as follows:
NO_OF_CHARS = 256
def areAnagram(str1, str2):
count = [0] * NO_OF_CHARS
i = 0
while (i in str1) and (i in str2):
count[ord(i)]+=1
i += 1
if len(str1) != len(str2):
return 0
for i in xrange(NO_OF_CHARS):
if count[i]:
return 0
return 1
str1 = "geeksforgeeks"
str2 = "forgeeksgeeks"
if areAnagram(str1, str2):
print "The two strings are anagram of each other"
else:
print "The two strings are not anagram of each other"
I am getting the following error while running the code:
TypeError: 'In <string> requires string as left operand
Am I doing something wrong in the while loop? Also, how can I avoid to use i = 0 statement? Thanks.
An easy way to see if strings are comprised of the same characters is to compare them as sorted lists:
def is_anagram(src, trgt):
"""
Determine if trgt is an anagram of src
:param src: (str)
:param trgt: (str)
:returns: (bool) True if trgt is an anagram of src; else False
"""
return sorted(src) == sorted(trgt)
If you want to go for counting the characters, you need to make counts for both strings and compare them
NO_OF_CHARS = 256
def areAnagram(str1, str2):
if len(str1) != len(str2):
return 0
count = [0] * NO_OF_CHARS
for c1,c2 in zip(str1,str2):
count[ord(c1)] +=1
count[ord(c2)] -=1
return all(not c for c in count)
I moved checking the length of strings to the beginning of the method for efficiency and clarity
EDIT: Updated my answer according to Blckknght's comment
The canonical way to do this in Python is to use collections.Counter:
from collections import Counter
def areAnagram(str1, str2):
return Counter(str1) == Counter(str2)
This should take O(N) space and time (where N is max(len(str1), len(str2))). But do be aware that even though this code's asymptotic performance is better, it may still be slower for short strings than a version using sorted. Python's sort code is very fast!
If you're likely to be using the function to compare very-unlike strings, you could perhaps it up a little bit with a special case checking the string lengths before counting:
def areAnagram(str1, str2):
return len(str1) == len(str2) and Counter(str1) == Counter(str2)

Categories