Check if strings are anagrams

Check if strings are anagrams - python

I want to create a python function that checks if the given strings are anagram or not, this code works if there is only one word that doesn't match, i.e if the strings are bad and dad it returns 'b' and 'd' but if the strings are 'zippo' and 'hipps' it just returns z and h, how should I modify the code to return all the values that do not match.
def anagram(str_1, str_2):
'''
This function check if two string are anagram,
if yes then prints yes otherwise it checks
for the words that need to be deleted to make it an anagram.
'''
if sorted(str_1) == sorted(str_2):
return "The given strings are anagrams"
else:
# words_needed_to_removed = []
zipped_strings = zip(str_1,str_2)
for (i,j) in zipped_strings:
if i!=j:
return i,j
# words_needed_to_removed.append((i,j))
# return f"The words needed to be removed to make the strings an anagram are:{words_needed_to_removed}"

You have two options
Instead of return use yield
Instead of returning as soon as you find the unequal values, store the values in some list as tuples and then return the list.
Also, another suggestion is:
Instead of sorting the strings, find the count of each character in a particular string and then compare that.
Sorting takes O(nlogn) whereas counting just takes O(n). Though there's this added space complexity.

To return everything, add it to a list and then return the list at the end of the loop.
def anagram(str_1, str_2):
if sorted(str_1) == sorted(str_2):
return "The given strings are anagrams"
else:
zipped_strings = zip(str_1,str_2)
all_letters = []
for (i,j) in zipped_strings:
if i!=j:
all_letters.append((i, j))
return all_letters
However, your code won't work for cases where the sorted order causes an offset between the matching letters. For example: anagram("ohipzs", "hipzas") gives [('o', 'h'), ('h', 'i'), ('i', 'p'), ('p', 'z'), ('z', 'a')], even though the only mismatch is 'o' and 'a'.
Instead, you could first, make a counter containing the counts of all letters in the first word.
counter = dict()
for letter in word1:
counter[letter] = counter.get(letter, 0) + 1
Next, subtract all letters of the second word from this counter
for letter in word2:
counter[letter] = counter.get(letter, 0) - 1
Finally, return all keys of counter that do not have a value of zero
mismatch_letters_1 = []
mismatch_letters_2 = []
for letter, count in counter.items():
if count == 0:
# Counts match perfectly, skip this letter
continue
elif count > 0:
append_to_list = mismatch_letters_1
else:
append_to_list = mismatch_letters_2
for _ in range(abs(count)):
append_to_list.append(letter)
answer = [(i, j) for i, j in zip(mismatch_letters_1, mismatch_letters_2)]
Putting this into a function and testing gives:
anagram2("bad", "dad") -> [('b', 'd')]
anagram2("zippo", "hipps") -> [('z', 'h'), ('o', 's')]
anagram2("ohipzs", "hipzas") -> [('o', 'a')]

Related

How can I store my result in tuple in Python

Write a function called word_freq(text) which takes one string
argument. This string will not have any punctuation. Perform a count
of the number of 'n' character words in this string and return a list
of tuples of the form[(n, count), (n-1, count) ...] in descending
order of the counts. For example:
Example: word_freq('a aaa a aaaa')
Result: [(4, 1), (3, 1), (1, 2)]
Note: that this does not show anything for the 2 character words. str1
= 'a aaa a aaa' str.split(str1) str.count(str1)
def word_freq(str): Python code to find frequency of each word
I tried this
text = 'a aaa a aaaa'
def word_freq(str):
tuple = ()
count = {}
for x in str:
if x in count.keys():
count[x] += 1
else:
count[x] = 1
print(count)
def count_letters(word):
char = "a"
count = 0
for c in word:
if char == c:
count += 1
return count
word_freq(text)

The code below does what you want. Now I'll explain how it works. before anything, we will make a dictionary called "WC" which will hold the count of each n-character-word in our sentence. now we start. first of all, it receives a string from user. then it takes the string and using split(), it turns the string into a LIST of words. then for each word it checks its length, if it is 2, it ignores it. otherwise, it will add 1 to the count of that n-character word in our dictionary.
after every word is checked, we use wc.items() to turn our dictionary into a list of tuples. Each element in the list is a tuple that contains data for each word. each tuple has 2 elements. the first is number of charatcers of each word and the second element is the number of times it existed in the sentence. with that out of the way, Now all we need is to do is sort this list based on the character counts in reverse (from high char count to low char count). we do that using the sorted function. we sort based on x[0] which means the first element of each tuple which is the character count for each word. Finally, we return this list of tuples. You can print it.
if anything is unclear, let me know. also, you can put print() statements at every line so you can better understand what is happening.
here's the code, I hope it helps:
inp = input("Enter your text: ")
def word_count(inp_str):
wc = {}
for item in inp_str.strip().split():
if len(item) == 2:
continue
wc[len(item)] = wc.get(len(item), 0) + 1
return sorted(wc.items(), key=lambda x: x[0], reverse = True)
print(word_count(inp))

Reconstruct input string given ngrams of that string

Given a string, e.g. i am a string.
I can generate the n-grams of this string like so, using the nltk package, where n is variable as per a specified range.
from nltk import ngrams
s = 'i am a string'
for n in range(1, 3):
for grams in ngrams(s.split(), n):
print(grams)
Gives the output:
('i',)
('am',)
('a',)
('string',)
('i', 'am')
('am', 'a')
('a', 'string')
Is there a way to 'reconstruct' the original string using combinations of the generated ngrams? Or, in the words of the below commenter, is there a way to divide the sentence into consecutive word sequences where each sequence has a maximum length of k (in this case k is 2).
[('i'), ('am'), ('a'), ('string')]
[('i', 'am'), ('a'), ('string')]
[('i'), ('am', 'a'), ('string')]
[('i'), ('am'), ('a', 'string')]
[('i', 'am'), ('a', 'string')]
The question is similar to this one, though with an additional layer of complexity.
Working solution - adapted from here.
I have a working solution, but it's really slow for longer strings.
def get_ngrams(s, min_=1, max_=4):
token_lst = []
for n in range(min_, max_):
for idx, grams in enumerate(ngrams(s.split(), n)):
token_lst.append(' '.join(grams))
return token_lst
def to_sum_k(s):
for len_ in range(1, len(s.split())+1):
for i in itertools.permutations(get_ngrams(s), r=len_):
if ' '.join(i) == s:
print(i)
to_sum_k('a b c')

EDIT:
This answer was based on the assumption that the question was to reconstruct an unknown unique string based on it's ngrams. I'll leave it active for anyone interested in this problem. The actual answer for the actual problem as clarified in the comments can be found here.
EDIT END
In general no. Consider e.g. the case n = 2 and s = "a b a b". Then your ngrams would be
[("a"), ("b"), ("a", "b"), ("b", "a")]
The set of strings that generate this set of ngrams in this case however would be all that may be generated by
(ab(a|(ab)*a?))|(ba(b|(ba)*b?)
Or n = 2, s = "a b c a b d a", where "c" and "d" may be arbitrarily ordered within the generating strings. E.g. "a b d a b c a" would also be a valid string. In addition the same issue as above arises and an arbitrary number of strings can generate the set of ngrams.
That being said there exists a way to test whether a set of ngrams uniquely identifies a string:
Consider your set of strings as a description of a non-deterministic state-machine. Each ngram can be defined as a chain of states where the single characters are transitions. As an example for the ngrams [("a", "b", "c"), ("c", "d"), ("a", "d", "b")] we would build the following state-machine:
0 ->(a) 1 ->(b) 2 ->(c) 3
0 ->(c) 3 ->(d) 4
0 ->(a) 1 ->(d) 5 ->(b) 6
Now perform a determinization of this state-machine. Iff there exists a unique string that can be reconstructed from the ngrams, the state-machine will have a longest transition-chain that doesn't contain any cycles and contains all ngrams we built the original state-machine from. In this case the original string is simply the individual state-transitions of this path joined back together. Otherwise there exist multiple strings that can be built from the provided ngrams.

While my previous answer assumed that the problem was to find an unknown string based on it's ngrams, this answer will deal with the problem of finding all ways to construct a given string using it's ngrams.
Assuming repetitions are allowed the solution is fairly simple: Generate all possible number sequences summing up to the length of the original string with no number larger than n and use these to create the ngram-combinations:
import numpy
def generate_sums(l, n, intermediate):
if l == 0:
yield intermediate
elif l < 0:
return
else:
for i in range(1, n + 1):
yield from generate_sums(l - i, n, intermediate + [i])
def combinations(s, n):
words = s.split(' ')
for c in generate_sums(len(words), n, [0]):
cs = numpy.cumsum(c)
yield [words[l:u] for (l, u) in zip(cs, cs[1:])]
EDIT:
As pointed out by #norok2 (thanks for the work) in the comments, it seems to be faster to use alternative cumsum-implementations instead of the one provided by numpy for this usecase.
END EDIT
If repetitions are not allowed things become a little bit more tricky. In this case we can use a non-deterministic finite automaton as defined in my previous answer and build our sequences based on traversals of the automaton:
def build_state_machine(s, n):
next_state = 1
transitions = {}
for ng in ngrams(s.split(' '), n):
state = 0
for word in ng:
if (state, word) not in transitions:
transitions[(state, word)] = next_state
next_state += 1
state = transitions[(state, word)]
return transitions
def combinations(s, n):
transitions = build_state_machine(s, n)
states = [(0, set(), [], [])]
for word in s.split(' '):
new_states = []
for state, term_visited, path, cur_elem in states:
if state not in term_visited:
new_states.append((0, term_visited.union(state), path + [tuple(cur_elem)], []))
if (state, word) in transitions:
new_states.append((transitions[(state, word)], term_visited, path, cur_elem + [word]))
states = new_states
return [path + [tuple(cur_elem)] if state != 0 else path for (state, term_visited, path, cur_elem) in states if state not in term_visited]
As an example the following state machine would be generated for the string "a b a":
Red connections indicate a switch to the next ngram and need to be handled separately (second if in the loop), since they can only be traversed once.

find couple of words which if deleting one letter will print the couple of words

When given a string I need to find similar words which have a difference in a letter. "no" and "noc" are similar words with a difference in one letter without using libraries or short functions
for example:
if I have the string "car ucar nor or caar"
will print:
car---ucar
nor---or
car---caar
I have this code:
what I need to change in order that the code will work?
also I don't know how to define j that will start from the next word in the 0 index.
Thank you for the help!
def Difference(s):
list=s.split(" ")
i=0
countDigit=0
for word1 in range(len(list)):
for word2 in range(len(list)):
if word1[i]==word1[j]:
i+=1
j+=1
continue
elif word1[i]!=word[j]:
countDigit+=1
if countDigit==1:
print(word1,"--- ",word2)
else:
break
s="car ucar nor or caar"
Difference(s)

You can use this function to check if two strings are one edit away or not.
Call this function for every pair of strings and if this returns TRUE print that couple otherwise pass next pair of string to this function.
You will have to convert this algorithm in Python, would be an easy task!

If I get this right, the following is a good start:
def letter_remove(from_str, target_str):
"""
For each letter of from_str - remove it and check if it matches target_str
"""
for i in range(len(from_str)):
new_word = from_str[:i] + from_str[i+1:]
if new_word == target_str:
print(target_str,"--- ",from_str)
def difference(s):
list=s.split(" ")
for word1 in list:
for word2 in list:
if word1==word2:
continue
letter_remove(word2, word1)
letter_remove(word1, word2)
s="car ucar nor or caar"
difference(s)
This will give you:
$ python2 ~/tmp/test.py
('car', '--- ', 'ucar')
('car', '--- ', 'caar')
('car', '--- ', 'caar')
('car', '--- ', 'ucar')
('or', '--- ', 'nor')
('or', '--- ', 'nor')
('car', '--- ', 'caar')
('car', '--- ', 'caar')
Observations:
We need to compare word1 to word2 and the reverse since removing a letter from word1 might result in word2
The results need deduplication
A better (maybe) version
We can use sets to ensure that the elements in the set are unique
Instead of printing we add each combination in the set as a tuple
We return all sets and print them at the end
def letter_remove(from_str, target_str):
"""
For each letter of from_str - remove it and check if it matches target_str
Returns:
A set of unique combinations found
"""
results = set()
for i in range(len(from_str)):
new_word = from_str[:i] + from_str[i+1:]
if new_word == target_str:
# Sort words
a, b = target_str, from_str
results.add((target_str, from_str))
return results
def difference_set(s):
list=s.split(" ")
all_results = set()
for word1 in list:
for word2 in list:
if word1==word2:
continue
all_results.update(letter_remove(word2, word1))
all_results.update(letter_remove(word1, word2))
return all_results
# This returns a set (unique elements) of the found differences
s="car ucar nor or caar"
sets = difference_set(s)
for s in sets:
print(s)
The output of the above is
$ python2 ~/tmp/test.py
('or', 'nor')
('car', 'caar')
('car', 'ucar')
Observations:
The above is a very inefficient algorithm since it will create too many strings for all possible letter removal and I would not recommend it for very long inputs. A smarter algorithm could compare each letter in the words and allow skipping one mismatching index
A definitely better approach
Comments inline
def letter_remove2(from_str, target_str):
"""
For each letter of from_str - remove it and check if it matches target_str
Returns:
True: if the two strings can be matched by removing a character from one
"""
skipped_a_letter = False
i = 0
j = 0
# if they differ by more than a letter, then we do not accept them
if abs(len(from_str) - len(target_str)) > 1:
return False
# Loop target's letters
while i < len(target_str):
if target_str[i] == from_str[j]:
j += 1
i += 1
continue
# If we have not already skipped a letter from from_str, skip this one
# by increasing j but not i!
if not skipped_a_letter:
j += 1
# Ensure we have not exceeded the length of from_str
if len(from_str) <= j:
return False
skipped_a_letter = True
continue
# If we reach here, it means that character do not match and we have
# already attempted to skip a letter - no match after all
return False
# If we successfully loop, it means that we can match by removing a letter
return True
def difference_set(s):
list=s.split(" ")
all_results = set()
for word1 in list:
for word2 in list:
if word1==word2:
continue
if letter_remove2(word2, word1):
# Keep the target word first in the set since it will always
# be the shorter one
all_results.add((word1, word2))
if letter_remove2(word1, word2):
all_results.add((word2, word1))
return all_results
Output:
('or', 'nor')
('car', 'caar')
('car', 'ucar')

The difflib library can help you.
The code below will print all the elements in the list that differ by one character.
Diffib provides an efficient way find the differences.
By doing a nested iteration over the list you can test each item against every other item.
The list comprehension adds all the differences to a list, and then counts the differences - if there is only one, then the criteria is met and the string is printed.
def Differences(s):
sl = s.split(" ")
for t in sl:
for u in sl:
difflist = [diff for diff in difflib.ndiff(t,u) if diff[0] != ' ']
if len(difflist) == 1:
print ("{}---{}".format(t,u))
s = 'car ucar nor or caar'
Differences(s)
This will give output of:
car---ucar
car---caar
ucar---car
nor---or
or---nor
caar---car

Print elements of string multiple times based on position

I'm trying to print each element individually, which is fine but also repeat each element based on position eg. "abcd" = A-Bb-Ccc-Dddd etc
So my problems are making print statements print x times based off their position in the string. I've tried a few combinations using len and range but i often encounter errors because i'm using strings not ints.
Should i be using len and range here? I'd prefer if you guys didn't post finished code, just basically how to go about that specific problem (if possible) so i can still go about figuring it out myself.
user_string = input()
def accum(s):
for letter in s:
pos = s[0]
print(letter.title())
pos = s[0 + 1]
accum(user_string)

You can enumerate iterables (lists, strings, ranges, dictkeys, ...) - it provides the index and a value:
text = "abcdef"
for idx,c in enumerate(text):
print(idx,c)
Output:
(0, 'a')
(1, 'b')
(2, 'c')
(3, 'd')
(4, 'e')
(5, 'f')
You can use that to print something multiple times. The print command takes 2 optional parameters :
print("Bla","blubb", sep=" --->", end=" Kawumm\n")
Output:
Bla --->blubb Kawumm
that specify what is printed between outputs and on the end of output - you can specify an end="" - so you can continue printing on the same line.
Doku:
Print
Enumerate
Edit:
user_string = input()
def accum(s):
t = [] # list to store stuff into
for count, letter in enumerate(s):
total = letter.upper() + letter * (count) # 1st as Upper, rest as is
t.append(total) # add to list
print(*t, sep="-") # the * "unpacks" the list into its parts
accum(user_string)
Unpacking:
print( [1,2,3,4,5], sep=" +++ ") # its just 1 value to print, no sep needed
print(*[1,2,3,4,5], sep=" +++ ") # 5 values to print, sep needed
Output:
[1, 2, 3, 4, 5]
1 +++ 2 +++ 3 +++ 4 +++ 5

You could try having a counter that will increase by 1 as the loop traverses through the string. Then, within the loop that you currently have, have another for loop to loop the size of the counter. If you want it to print out the first letter capitalized then you will need to account for that along with the dashes.

index of second repeated character in a string

I am trying a hangman code in python. For matching a character of a word , iam using index function to get the location of character.
Ex :word = 'COMPUTER'
user_input = raw_input('Enter a character :') # say 'T; is given here
if user_input in word:
print "\nThe Character %c is present in the word \n" %user_input
word_dict[word.index(user_input)] = user_input
#so the output will looks like
{0: '_', 1: '_', 2: '_', 3: '_', 4: '_', 5: 'T', 6: '_', 7: '_'}
Now , my problems comes when it comes with the repeated character.
# Another example
>>> 'CARTOON'.index('O')
4
For the second 'O', how to get its index. since i have used this 'index' logic, i am looking to continue on this way.

As per the str.index docs, signature looks like this
str.index(sub[, start[, end]])
The second parameter is the starting index to search from. So you can pass the index which you got for the first item + 1, to get the next index.
i = 'CARTOON'.index('O')
print 'CARTOON'.index('O', i + 1)
Output
5
The above code can be written like this
data = 'CARTOON'
print data.index('O', data.index('O') + 1)
You can even have this as a utility function, like this
def get_second_index(input_string, sub_string):
return input_string.index(sub_string, input_string.index(sub_string) + 1)
print get_second_index("CARTOON", "O")
Note: If the string is not found atleast twice, this will throw ValueError.
The more generalized way,
def get_index(input_string, sub_string, ordinal):
current = -1
for i in range(ordinal):
current = input_string.index(sub_string, current + 1)
else:
raise ValueError("ordinal {} - is invalid".format(ordinal))
return current
print get_index("AAABBBCCCC", "C", 4)

A perhaps more pythonic method would be to use a generator, thus avoiding the intermediate array 'found':
def find_indices_of(char, in_string):
index = -1
while True:
index = in_string.find(char, index + 1)
if index == -1:
break
yield index
for i in find_indices_of('x', 'axccxx'):
print i
1
4
5
An alternative would be the enumerate built-in
def find_indices_of_via_enumerate(char, in_string):
return (index for index, c in enumerate(in_string) if char == c)
This also uses a generator.
I then got curious as to perf differences. I'm a year into using python, so I'm only beginning to feel truly knowledgeable. Here's a quick test, with various types of data:
test_cases = [
('x', ''),
('x', 'axxxxxxxxxxxx'),
('x', 'abcdefghijklmnopqrstuvw_yz'),
('x', 'abcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvwxyz'),
]
for test_case in test_cases:
print "('{}', '{}')".format(*test_case)
print "string.find:", timeit.repeat(
"[i for i in find_indices_of('{}', '{}')]".format(*test_case),
"from __main__ import find_indices_of",
)
print "enumerate :", timeit.repeat(
"[i for i in find_indices_of_via_enumerate('{}', '{}')]".format(*test_case),
"from __main__ import find_indices_of_via_enumerate",
)
print
Which, on my machine results in these timings:
('x', '')
string.find: [0.6248660087585449, 0.6235580444335938, 0.6264920234680176]
enumerate : [0.9158611297607422, 0.9153609275817871, 0.9118690490722656]
('x', 'axxxxxxxxxxxx')
string.find: [6.01502799987793, 6.077538013458252, 5.997750997543335]
enumerate : [3.595151901245117, 3.5859270095825195, 3.597352981567383]
('x', 'abcdefghijklmnopqrstuvw_yz')
string.find: [0.6462750434875488, 0.6512351036071777, 0.6495819091796875]
enumerate : [2.6581480503082275, 2.6216518878936768, 2.6187551021575928]
('x', 'abcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvwxyz')
string.find: [1.2539417743682861, 1.2511990070343018, 1.2702908515930176]
enumerate : [7.837890863418579, 7.791800022125244, 7.9181809425354]
enumerate() method is more expressive, pythonic. Whether or not perf differences matter depends on the actual use cases.

You've asked how to find the second occurrence, and gotten an excellent answer for that, generalized for any specific occurrence. What you'll realize you actually want though is all occurrences at once. Here's a method for that:
def find_characters(word, character):
found = []
last_index = -1
while True:
try:
last_index = word.index(character, last_index+1)
except ValueError:
break
else:
found.append(last_index)
return found

You can use the count method of the strings to find the number of occurrences of the user_input in the string. Then, use the str.index(sub,start) method for each occurrence of the user_input in the word and increment start by 1 each time so that you do not wind up getting the same index each time.
if user_input in word:
count=word.count(user_input)
a=word.index(user_input)
word_dict[word.index(a)]=user_input
for i in range(count-1):
a=word.index(user_input,a+1)
word_dict[word.index(a)]=user_input

This should really be a one-liner if you use filter because if you use index you will be forced to either iterate or use recursion. In this case, there is absolutely no need for either. You can just filter out the values that are relevant to you.
Using filter is easy. An example implementation is the following one-liner:
def f1(w, c):
return zip(* filter(lambda (x,y): x == c, zip(w, range(len(w))) ))[1]
f1('cartoon', 'o') # --> (4, 5)
You can always add error checking as in:
def f1(w, c) :
if c not in w: return ()
else: return zip(* filter(lambda (x,y): x == c, zip(w, range(len(w))) ))[1]
If the character isn't found in the string, you just get an empty tuple. Otherwise, you get all elements that match. If you want something generic, counting on the fact that there will only be one or two instances of a character is not the right way to go about it. For example:
In [18]: f1('supercalifragilisticexpialidocious', 'i')
Out[18]: (8, 13, 15, 18, 23, 26, 30)

Here is another Example.
a="samesame"
po=-1 # for this, po+1 is start from 0
for c in a:
if c=='s': # For example, I chose "S" what I want to find
po = a.index(c,po+1) # if you find first element 'C' then search again in next postion
print(po)

Apologies if this answer is not properly formatted or I've messed up somewhere as I am new here and this is my first post. I have used the following code in my own Hangman games to get the index of multiple repeating letters in a word and has worked great. Hopefully a newcomer will understand this ok.
a = "hangman" #the chosen word
length = len(a) #determines length of chosen word
for i in range(length) #this will loop through the code length number of times
if a[i] == "n": #n is the players guess. Checks if the letter is at index i
po = a.index("n", i) # po gets the index of the letter if previous line is true
print(po) #prints the position/s
Hope this helps someone!

def findcharpos(string, character, position=None):
array = []
index = -1
while True:
try:
index = string.index(character, index+1)
except ValueError:
break
else:
array.append(index)
if position == None and len(array) != 0:
return array
elif position > len(array):
raise ValueError(f"The character {character} does not occur {position}
times in the {string}")
else:
return array[position-1]
return array

msg = 'samesame'
print(msg.index('s', 2)) # prints the index of second 's' in the string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check if strings are anagrams - python

Related

How can I store my result in tuple in Python

Reconstruct input string given ngrams of that string

find couple of words which if deleting one letter will print the couple of words

Print elements of string multiple times based on position

index of second repeated character in a string

Categories

Resources