split a string into a minimal number of unique substrings

split a string into a minimal number of unique substrings - python

Given a string consisting of lowercase letters.
Need to split this string into a minimal number of substrings in such a way that no letter occurs more than once in each substring.
For example, here are some correct splits of the string "abacdec":
('a', 'bac', 'dec'), ('a', bacd', 'ec') and (ab', 'ac', 'dec').
Given 'dddd', function should return 4. The result can be achieved by splitting the string into four substrings ('d', 'd', 'd', 'd').
Given 'cycle', function should return 2. The result can be achieved by splitting the string into two substrings ('cy', 'cle') or ('c', 'ycle').
Given 'abba', function should return 2 (I believe it should be 1 - the mistake as originally stated). The result can be achieved by splitting the string into two substrings ('ab', 'ba')
Here is a code which I've written. I feel that it is too complicated and also not sure whether it is efficient in matter of time complexity.
I would be glad to have suggestions of a shorter and simpler one. Thanks!
#!/usr/bin/python
from collections import Counter
def min_distinct_substrings(string):
# get the longest and unique substring
def get_max_unique_substr(s):
def is_unique(substr):
return all(m == 1 for m in Counter(substr).values())
max_sub = 0
for i in range(len(s), 0, -1):
for j in range(0, i):
if len(s[j:i]):
if is_unique(s[j:i]):
substr_len = len(s[j:i])
else:
substr_len = 0
max_sub = max(max_sub, substr_len)
return max_sub
max_unique_sub_len = get_max_unique_substr(string)
out = []
str_prefx = []
# get all valid prefix - 'a', 'ab' are valid - 'aba' not valid since 'a' is not unique
for j in range(len(string)):
if all(m==1 for m in Counter(string[:j + 1]).values()):
str_prefx.append(string[:j + 1])
else:
break
# consider only valid prefix
for k in str_prefx:
# get permutation substrings; loop starts from longest substring to substring of 2
for w in range(max_unique_sub_len, 1, -1):
word = ''
words = [k] # first substring - the prefix - is added
# go over the rest of the string - start from position after prefix
for i in range(len(k), len(string)):
# if letter already seen - the substring will be added to words
if string[i] in word or len(word) >= w:
words.append(word)
word = ''
# if not seen and not last letter - letter is added to word
word += string[i]
words.append(word)
if words not in out: # to avoid duplicated words' list
out.append(words)
min_list = min(len(i) for i in out) # get the minimum lists
# filter the minimum lists (and convert to tuple for printing purposes)
out = tuple( (*i, ) for i in out if len(i) <= min_list )
return out

The greedy algorithm described by Tarik works well to efficiently get the value of the minimal number of substrings. If you want to find all of the valid splits you have to check them all though:
import itertools
def min_unique_substrings(w):
def all_substrings_are_unique(ss):
return all(len(set(s)) == len(s) for s in ss)
# check if input is already unique
if all_substrings_are_unique([w]):
return [[w]]
# divide the input string into parts, starting with the fewest divisions
for divisions in range(2, len(w)-1):
splits = []
for delim in itertools.combinations(range(1, len(w)), divisions-1):
delim = [0, *delim, len(w)]
substrings = [w[delim[i]:delim[i+1]] for i in range(len(delim)-1)]
splits.append(substrings)
# check if there are any valid unique substring splits
filtered = list(filter(all_substrings_are_unique, splits))
if len(filtered):
# if there are any results they must be divided into the
# fewest number of substrings and we can stop looking
return filtered
# not found; worst case of one character per substring
return [list(w)]
> print(min_unique_substrings('abacdec'))
[['a', 'bac', 'dec'], ['a', 'bacd', 'ec'], ['a', 'bacde', 'c'], ['ab', 'ac', 'dec'], ['ab', 'acd', 'ec'], ['ab', 'acde', 'c']]
> print(min_unique_substrings('cycle'))
[['c', 'ycle'], ['cy', 'cle']]
> print(min_unique_substrings('dddd'))
[['d', 'd', 'd', 'd']]
> print(min_unique_substrings('abba'))
[['ab', 'ba']]
> print(min_unique_substrings('xyz'))
[['xyz']]

Ok, thought about it again. Using a greedy algorithm should do. Loop through the letters in order. Accumulate a substring as long as no letter is repeated. Once a duplicate letter is found, spit out the substring and start with another substring until all letters are exhausted.

Related

Error "index out of range" when working with strings in a for loop in python

I'm very new to python and I'm practicing different exercises.
I need to write a program to decode a string. The original string has been modified by adding, after each vowel (letters ’a’, ’e’, ’i’, ’o’ and ’u’), the letter ’p’ and then that same vowel again.
For example, the word “kemija” becomes “kepemipijapa” and the word “paprika” becomes “papapripikapa”.
vowel = ['a', 'e', 'i', 'o', 'u']
input_word = list(input())
for i in range(len(input_word)):
if input_word[i] in vowel:
input_word.pop(i + 1)
input_word.pop(i + 2)
print(input_word)
The algorithm I had in mind was to detect the index for which the item is a vowel and then remove the following 2 items after this item ,so if input_word[0] == 'e' then the next 2 items (input_word[1], input_word[2]) must be removed from the list. For the sample input zepelepenapa, I get this error message : IndexError: pop index out of range even when I change the for loop to range(len(input_word) - 2) ,again I get this same error.
thanks in advance

The loop will run a number of times equal to the original length of input_word, due to range(len(input_word)). An IndexError will occur if input_word is shortened inside the loop, because the code inside the loop tries to access every element in the original list input_word with the expression input_word[i] (and, for some values of input_word, the if block could even attempt to pop items off the list beyond its original length, due to the (i + 1) and (i + 2)).
Hardcoding the loop definition with a specific number like 2, e.g. with range(len(input_word) - 2), to make it run fewer times to account for removed letters isn't a general solution, because the number of letters to be removed is initially unknown (it could be 0, 2, 4, ...).
Here are a couple of possible solutions:
Instead of removing items from input_word, create a new list output_word and add letters to it if they meet the criteria. Use a helper list skip_these_indices to keep track of indices that should be "removed" from input_word so they can be skipped when building up the new list output_word:
vowel = ['a', 'e', 'i', 'o', 'u']
input_word = list("zepelepenapa")
output_word = []
skip_these_indices = []
for i in range(len(input_word)):
# if letter 'i' shouldn't be skipped, add it to output_word
if i not in skip_these_indices:
output_word.append(input_word[i])
# check whether to skip the next two letters after 'i'
if input_word[i] in vowel:
skip_these_indices.append(i + 1)
skip_these_indices.append(i + 2)
print(skip_these_indices) # [2, 3, 6, 7, 10, 11]
print(output_word) # ['z', 'e', 'l', 'e', 'n', 'a']
print(''.join(output_word)) # zelena
Alternatively, use two loops. The first loop will keep track of which letters should be removed in a list called remove_these_indices. The second loop will remove them from input_word:
vowel = ['a', 'e', 'i', 'o', 'u']
input_word = list("zepelepenapa")
remove_these_indices = []
# loop 1 -- find letters to remove
for i in range(len(input_word)):
# if letter 'i' isn't already marked for removal,
# check whether we should remove the next two letters
if i not in remove_these_indices:
if input_word[i] in vowel:
remove_these_indices.append(i + 1)
remove_these_indices.append(i + 2)
# loop 2 -- remove the letters (pop in reverse to avoid IndexError)
for i in reversed(remove_these_indices):
# if input_word has a vowel in the last two positions,
# without a "p" and the same vowel after it,
# which it shouldn't based on the algorithm you
# described for generating the coded word,
# this 'if' statement will avoid popping
# elements that don't exist
if i < len(input_word):
input_word.pop(i)
print(remove_these_indices) # [2, 3, 6, 7, 10, 11]
print(input_word) # ['z', 'e', 'l', 'e', 'n', 'a']
print(''.join(input_word)) # zelena

pop() removes an item at the given position in the list and returns it. This alters the list in place.
For example if I have:
my_list = [1,2,3,4]
n = my_list.pop()
will return n = 4 in this instance. If I was to print my_list after this operation it would return [1,2,3]. So the length of the list will change every time pop() is used. That is why you are getting IndexError: pop index out of range.
So to solve this we should avoid using pop() since it's really not needed in this situation. The following will work:
word = 'kemija'
vowels = ['a', 'e', 'i', 'o', 'u']
new_word = []
for w in word:
if w in vowels:
new_word.extend([w,'p',w])
# alternatively you could use .append() over .extend() but would need more lines:
# new_word.append(w)
# new_word.append('p')
# new_word.append(w)
else:
new_word.append(w)
decoded_word = ''.join(new_word)
print(decoded_word)

How do I check if a string only has letters from a list of letters in Python

word represents the string I am checking, letters is a list of random letters. I need to make sure that the word only contains letters in a list. However if their are repeating letters, there needs to be that many repeating letters in the list. If returned True it needs to remove the letters used in the word from the list. I am really struggling with this one.
example: w.wordcheck('feed') -> False
letters = ['n', 'e', 'f', 'g', 'e', 'a', 'z']
w.wordcheck('gag') -> false
w.wordcheck('gene') -> True
w.wordcheck('gene') -> True
print(letters) -> ['f', 'a', 'z']
letters = []
def wordcheck(self, word)
for char in word:
if char not in self.letters:
return False
else:
return True

One way using collections.Counter:
from collections import Counter
letters = ['n', 'e', 'f', 'g', 'e', 'a', 'z']
cnt = Counter(letters)
def wordcheck(word):
return all(cnt[k] - v >= 0 for k, v in Counter(word).items())
Output:
wordcheck("gag")
# False
wordcheck("gene")
# True

You can do solve , this problem by finding the case where word result into false.
these cases are, when the character is not in the letters and character frequency in word is more than the character frequency in the letter.
once if any of condition meet, return false else return true.
# your code goes here
from collections import Counter
letters = ['n', 'e', 'f', 'g', 'e', 'a', 'z']
letters_count = Counter(letters)
def func(word):
word_count = Counter(word)
check = True
for word in word_count:
if word not in letters_count or word_count.get(word)>letters_count.get(word):
check = False
break
return check
l = ['feed', 'gag', 'gene']
for i in l:
print(func(i))
output
False
False
True

There are already better answers, but I felt like adding a novel solution just for the heck of it:
from itertools import groupby
def chunks(s):
return set("".join(g) for _, g in groupby(sorted(s)))
def wordcheck(word, valid_letters):
return chunks(word) <= chunks(valid_letters)
Steps:
Turn word into a set of chunks, e.g.: "gag" -> {"a", "gg"}
Turn valid_letters into a set of chunks
Check if word is a subset of valid_letters
Limitations:
This is a mostly silly implementation
This will only return True if the exact number of repeated letters is present in valid_letters, e.g.: if valid_letters = "ccc" and word = "cc" this will return False because there are too few c's in word
It's really inefficient

Radix Sort for Strings in Python

My radix sort function outputs sorted but wrong list when compared to Python's sort:
My radix sort: ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
Python's sort: ['a', 'aa', 'ab', 'abid', 'abs', 'asd', 'avc', 'axy']
* My radix sort does not do padding
* Its mechanism is least significant bit (LSB)
* I need to utilise the length of each word
The following is my code.
def count_sort_letters(array, size, col, base):
output = [0] * size
count = [0] * base
min_base = ord('a')
for item in array:
correct_index = min(len(item) - 1, col)
letter = ord(item[-(correct_index + 1)]) - min_base
count[letter] += 1
for i in range(base - 1):
count[i + 1] += count[i]
for i in range(size - 1, -1, -1):
item = array[i]
correct_index = min(len(item) - 1, col)
letter = ord(item[-(correct_index + 1)]) - min_base
output[count[letter] - 1] = item
count[letter] -= 1
return output
def radix_sort_letters(array):
size = len(array)
max_col = len(max(array, key = len))
for col in range(max_col):
array = count_sort_letters(array, size, col, 26)
return array
Can anyone find a way to solve this problem?

As I mentioned in my comments:
In your code the lines:
correct_index = min(len(item) - 1, col)
letter = ord(item[-(correct_index + 1)]) - min_base
Always uses the first letter of the word once col is greater than the word length. This
causes shorter words to be sorted based upon their first letter once
col is greater than the word length. For instance ['aa', 'a'] remains
unchanged since on the for col loop we compare the 'a' in both words,
which keeps the results unchanged.
Code Correction
Note: Attempted to minimize changes to your original code
def count_sort_letters(array, size, col, base, max_len):
""" Helper routine for performing a count sort based upon column col """
output = [0] * size
count = [0] * (base + 1) # One addition cell to account for dummy letter
min_base = ord('a') - 1 # subtract one too allow for dummy character
for item in array: # generate Counts
# get column letter if within string, else use dummy position of 0
letter = ord(item[col]) - min_base if col < len(item) else 0
count[letter] += 1
for i in range(len(count)-1): # Accumulate counts
count[i + 1] += count[i]
for item in reversed(array):
# Get index of current letter of item at index col in count array
letter = ord(item[col]) - min_base if col < len(item) else 0
output[count[letter] - 1] = item
count[letter] -= 1
return output
def radix_sort_letters(array, max_col = None):
""" Main sorting routine """
if not max_col:
max_col = len(max(array, key = len)) # edit to max length
for col in range(max_col-1, -1, -1): # max_len-1, max_len-2, ...0
array = count_sort_letters(array, len(array), col, 26, max_col)
return array
lst = ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
print(radix_sort_letters(lst))
Test
lst = ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
print(radix_sort_letters(lst))
# Compare to Python sort
print(radix_sort_letters(lst)==sorted(lst))
Output
['a', 'aa', 'ab', 'abid', 'abs', 'asd', 'avc', 'axy']
True
Explanation
Counting Sort is a stable sort meaning:
Let's walk through an example of how the function works.
Let's sort: ['ac', 'xb', 'ab']
We walk through each character of each list in reverse order.
Iteration 0:
Key is last character in list (i.e. index -1):
keys are ['c','b', 'b'] (last characters of 'ac', 'xb', and 'ab'
Peforming a counting sort on these keys we get ['b', 'b', 'c']
This causes the corresponding words for these keys to be placed in
the order: ['xb', 'ab', 'ac']
Entries 'xb' and 'ab' have equal keys (value 'b') so they maintain their
order of 'xb' followed by 'ab' of the original list
(since counting sort is a stable sort)
Iteration 1:
Key is next to last character (i.e. index -2):
Keys are ['x', 'a', 'a'] (corresponding to list ['xb', 'ab', 'ac'])
Counting Sort produces the order ['a', 'a', 'a']
which causes the corresponding words to be placed in the order
['ab', 'ac', 'xb'] and we are done.
Original Software Error--your code originally went left to right through the strings rather than right to left. We need to go right to left since we want to sort our last sort to be based upon the first character, the next to last to be based upon the 2nd character, etc.
Different Length Strings-the example above was with equal-length strings.
The previous example was simplified assuming equal length strings. Now let's try unequal lengths strings such as:
['ac', 'a', 'ab']
This immediately presents a problem since the words don't have equal lengths we can't choose a letter each time.
We can fix by padding each word with a dummy character such as '*' to get:
['ac', 'a*', 'ab']
Iteration 0: keys are last character in each word, so: ['c', '*', 'b']
The understanding is that the dummy character is less than all other
characters, so the sort order will be:
['*', 'b', 'c'] causing the related words to be sorted in the order
['a*', 'ab', 'ac']
Iteration 1: keys are next to last character in each word so: ['a', 'a', 'a']
Since the keys are all equal counting sort won't change the order so we keep
['a*', 'ab', 'ac']
Removing the dummy character from each string (if any) we end up with:
['a', 'ab', 'ac']
The idea behind get_index is to mimic the behavior of padding strings without
actual padding (i.e. padding is extra work). Thus, based upon the index
it evaluates if the index points to the padded or unpadded portion of the string
and returns an appropriate index into the counting array for counting.

return a list that contains the last vowel phoneme and subsequent consonant phoneme(s)

So i was working on this question
A vowel phoneme is a phoneme whose last character is 0, 1, or 2. As examples, the word BEFORE (B IH0 F AO1 R) contains two vowel phonemes and the word GAP (G AE1 P) has one.
The parameter represents a list of phonemes. The function is to return a list that contains the last vowel phoneme and subsequent consonant phoneme(s) in the given list of phonemes. The empty list is to be returned if the list of phonemes does not contain a vowel phoneme.

def last_phonemes(phoneme_list):
""" (list of str) -> list of str
Return the last vowel phoneme and subsequent consonant phoneme(s) in
phoneme_list.
>>> last_phonemes(['AE1', 'B', 'S', 'IH0', 'N', 'TH'])
['IH0', 'N', 'TH']
>>> last_phonemes(['IH0', 'N'])
['IH0', 'N']
>>> last_phonemes(['B', 'S'])
[]
"""
for i, phoneme in reversed(list(enumerate(phoneme_list))):
if phoneme[-1] in '012':
return phoneme_list[i:]
return []
EDIT explanation
You want to iterate over the phoneme_list in reversed order. When you find the first item that contains the '0' character, then you want to slice the original list (you got the slice part right in your code). You will need the index to make the slicing, so you enumerate the phoneme_list before reversing.

Finding consecutive consonants in a word

I need code that will show me the consecutive consonants in a word. For example, for "concertation" I need to obtain ["c","nc","rt","t","n"].
Here is my code:
def SuiteConsonnes(mot):
consonnes=[]
for x in mot:
if x in "bcdfghjklmnprstvyz":
consonnes += x + ''
return consonnes
I manage to find the consonants, but I don't see how to find them consecutively. Can anybody tell me what I need to do?

You can use regular expressions, implemented in the re module
Better solution
>>> re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "concertation", re.IGNORECASE)
['c', 'nc', 'rt', 't', 'n']
[bcdfghjklmnprstvyz]+ matches any sequence of one or more characters from the character class
re.IGNORECASE enables a case in sensitive match on the characters. That is
>>> re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "CONCERTATION", re.IGNORECASE)
['C', 'NC', 'RT', 'T', 'N']
Another Solution
>>> import re
>>> re.findall(r'[^aeiou]+', "concertation",)
['c', 'nc', 'rt', 't', 'n']
[^aeiou] Negated character class. Matches anything character other than the one in this character class. That is in short Matches consonents in the string
+ quantifer + matches one or more occurence of the pattern in the string
Note This will also find the non alphabetic, adjacent characters in the solution. As the character class is anything other than vowels
Example
>>> re.findall(r'[^aeiou]+', "123concertation",)
['123c', 'nc', 'rt', 't', 'n']
If you are sure that the input always contain alphabets, this solution is ok
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings.
The string is scanned left-to-right, and matches are returned in the order found.
If you are curious about how the result is obtained for
re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "concertation")
concertation
|
c
concertation
|
# o is not present in the character class. Matching ends here. Adds match, 'c' to ouput list
concertation
|
n
concertation
|
c
concertation
|
# Match ends again. Adds match 'nc' to list
# And so on

You could do this with regular expressions and the re module's split function:
>>> import re
>>> re.split(r"[aeiou]+", "concertation", flags=re.I)
['c', 'nc', 'rt', 't', 'n']
This method splits the string whenever one or more consecutive vowels are matched.
To explain the regular expression "[aeiou]+": here the vowels have been collected into a class [aeiou] while the + indicates that one or more occurrence of any character in this class can be matched. Hence the string "concertation" is split at o, e, a and io.
The re.I flag means that the case of the letters will be ignored, effectively making the character class equal to [aAeEiIoOuU].
Edit: One thing to keep in mind is that this method implicitly assumes that the word contains only vowels and consonants. Numbers and punctuation will be treated as non-vowels/consonants. To match only consecutive consonants, instead use re.findall with the consonants listed in the character class (as noted in other answers).
One useful shortcut to typing out all the consonants is to use the third-party regex module instead of re.
This module supports set operations, so the character class containing the consonants can be neatly written as the entire alphabet minus the vowels:
[[a-z]--[aeiou]] # equal to [bcdefghjklmnpqrstvwxyz]
Where [a-z] is the entire alphabet, -- is set difference and [aeiou] are the vowels.

If you are up for a non-regex solution, itertools.groupby would work perfectly fine here, like this
>>> from itertools import groupby
>>> is_vowel = lambda char: char in "aAeEiIoOuU"
>>> def suiteConsonnes(in_str):
... return ["".join(g) for v, g in groupby(in_str, key=is_vowel) if not v]
...
>>> suiteConsonnes("concertation")
['c', 'nc', 'rt', 't', 'n']

A really, really simple solution without importing anything is to replace the vowels with a single thing, then split on that thing:
def SuiteConsonnes(mot):
consonnes = ''.join([l if l not in "aeiou" else "0" for l in mot])
return [c for c in consonnes.split("0") if c is not '']
To keep it really similar to your code - and to add generators - we get this:
def SuiteConsonnes(mot):
consonnes=[]
for x in mot:
if x in "bcdfghjklmnprstvyz":
consonnes.append(x)
elif consonnes:
yield ''.join(consonnes)
consonnes = []
if consonnes: yield ''.join(consonnes)

def SuiteConsonnes(mot):
consonnes=[]
consecutive = '' # initialize consecutive string of consonants
for x in mot:
if x in "aeiou": # checks if x is not a consonant
if consecutive: # checks if consecutive string is not empty
consonnes.append(consecutive) # append consecutive string to consonnes
consecutive = '' # reinitialize consecutive for another consecutive string of consonants
else:
consecutive += x # add x to consecutive string if x is a consonant or not a vowel
if consecutive: # checks if consecutive string is not empty
consonnes.append(consecutive) # append last consecutive string of consonants
return consonnes
SuiteConsonnes('concertation')
#['c', 'nc', 'rt', 't', 'n']

Not that I'd recommend it for readability, but a one-line solution is:
In [250]: q = "concertation"
In [251]: [s for s in ''.join([l if l not in 'aeiou' else ' ' for l in q]).split()]
Out[251]: ['c', 'nc', 'rt', 't', 'n']
That is: join the non-vowels with spaces and split again on whitespace.

Use regular expressions from re built-in module:
import re
def find_consonants(string):
# find all non-vovels occuring 1 or more times:
return re.findall(r'[^aeiou]+', string)

Although I think you should go with #nu11p01n73R's answer, this will also work:
re.sub('[AaEeIiOoUu]+',' ','concertation').split()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

split a string into a minimal number of unique substrings - python

Ok, thought about it again. Using a greedy algorithm should do. Loop through the letters in order. Accumulate a substring as long as no letter is repeated. Once a duplicate letter is found, spit out the substring and start with another substring until all letters are exhausted.

Related

Error "index out of range" when working with strings in a for loop in python

How do I check if a string only has letters from a list of letters in Python

Radix Sort for Strings in Python

return a list that contains the last vowel phoneme and subsequent consonant phoneme(s)

Finding consecutive consonants in a word

Categories

Resources