Radix Sort for Strings in Python - python

My radix sort function outputs sorted but wrong list when compared to Python's sort:
My radix sort: ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
Python's sort: ['a', 'aa', 'ab', 'abid', 'abs', 'asd', 'avc', 'axy']
* My radix sort does not do padding
* Its mechanism is least significant bit (LSB)
* I need to utilise the length of each word
The following is my code.
def count_sort_letters(array, size, col, base):
output = [0] * size
count = [0] * base
min_base = ord('a')
for item in array:
correct_index = min(len(item) - 1, col)
letter = ord(item[-(correct_index + 1)]) - min_base
count[letter] += 1
for i in range(base - 1):
count[i + 1] += count[i]
for i in range(size - 1, -1, -1):
item = array[i]
correct_index = min(len(item) - 1, col)
letter = ord(item[-(correct_index + 1)]) - min_base
output[count[letter] - 1] = item
count[letter] -= 1
return output
def radix_sort_letters(array):
size = len(array)
max_col = len(max(array, key = len))
for col in range(max_col):
array = count_sort_letters(array, size, col, 26)
return array
Can anyone find a way to solve this problem?

As I mentioned in my comments:
In your code the lines:
correct_index = min(len(item) - 1, col)
letter = ord(item[-(correct_index + 1)]) - min_base
Always uses the first letter of the word once col is greater than the word length. This
causes shorter words to be sorted based upon their first letter once
col is greater than the word length. For instance ['aa', 'a'] remains
unchanged since on the for col loop we compare the 'a' in both words,
which keeps the results unchanged.
Code Correction
Note: Attempted to minimize changes to your original code
def count_sort_letters(array, size, col, base, max_len):
""" Helper routine for performing a count sort based upon column col """
output = [0] * size
count = [0] * (base + 1) # One addition cell to account for dummy letter
min_base = ord('a') - 1 # subtract one too allow for dummy character
for item in array: # generate Counts
# get column letter if within string, else use dummy position of 0
letter = ord(item[col]) - min_base if col < len(item) else 0
count[letter] += 1
for i in range(len(count)-1): # Accumulate counts
count[i + 1] += count[i]
for item in reversed(array):
# Get index of current letter of item at index col in count array
letter = ord(item[col]) - min_base if col < len(item) else 0
output[count[letter] - 1] = item
count[letter] -= 1
return output
def radix_sort_letters(array, max_col = None):
""" Main sorting routine """
if not max_col:
max_col = len(max(array, key = len)) # edit to max length
for col in range(max_col-1, -1, -1): # max_len-1, max_len-2, ...0
array = count_sort_letters(array, len(array), col, 26, max_col)
return array
lst = ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
print(radix_sort_letters(lst))
Test
lst = ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
print(radix_sort_letters(lst))
# Compare to Python sort
print(radix_sort_letters(lst)==sorted(lst))
Output
['a', 'aa', 'ab', 'abid', 'abs', 'asd', 'avc', 'axy']
True
Explanation
Counting Sort is a stable sort meaning:
Let's walk through an example of how the function works.
Let's sort: ['ac', 'xb', 'ab']
We walk through each character of each list in reverse order.
Iteration 0:
Key is last character in list (i.e. index -1):
keys are ['c','b', 'b'] (last characters of 'ac', 'xb', and 'ab'
Peforming a counting sort on these keys we get ['b', 'b', 'c']
This causes the corresponding words for these keys to be placed in
the order: ['xb', 'ab', 'ac']
Entries 'xb' and 'ab' have equal keys (value 'b') so they maintain their
order of 'xb' followed by 'ab' of the original list
(since counting sort is a stable sort)
Iteration 1:
Key is next to last character (i.e. index -2):
Keys are ['x', 'a', 'a'] (corresponding to list ['xb', 'ab', 'ac'])
Counting Sort produces the order ['a', 'a', 'a']
which causes the corresponding words to be placed in the order
['ab', 'ac', 'xb'] and we are done.
Original Software Error--your code originally went left to right through the strings rather than right to left. We need to go right to left since we want to sort our last sort to be based upon the first character, the next to last to be based upon the 2nd character, etc.
Different Length Strings-the example above was with equal-length strings.
The previous example was simplified assuming equal length strings. Now let's try unequal lengths strings such as:
['ac', 'a', 'ab']
This immediately presents a problem since the words don't have equal lengths we can't choose a letter each time.
We can fix by padding each word with a dummy character such as '*' to get:
['ac', 'a*', 'ab']
Iteration 0: keys are last character in each word, so: ['c', '*', 'b']
The understanding is that the dummy character is less than all other
characters, so the sort order will be:
['*', 'b', 'c'] causing the related words to be sorted in the order
['a*', 'ab', 'ac']
Iteration 1: keys are next to last character in each word so: ['a', 'a', 'a']
Since the keys are all equal counting sort won't change the order so we keep
['a*', 'ab', 'ac']
Removing the dummy character from each string (if any) we end up with:
['a', 'ab', 'ac']
The idea behind get_index is to mimic the behavior of padding strings without
actual padding (i.e. padding is extra work). Thus, based upon the index
it evaluates if the index points to the padded or unpadded portion of the string
and returns an appropriate index into the counting array for counting.

Related

split a string into a minimal number of unique substrings

Given a string consisting of lowercase letters.
Need to split this string into a minimal number of substrings in such a way that no letter occurs more than once in each substring.
For example, here are some correct splits of the string "abacdec":
('a', 'bac', 'dec'), ('a', bacd', 'ec') and (ab', 'ac', 'dec').
Given 'dddd', function should return 4. The result can be achieved by splitting the string into four substrings ('d', 'd', 'd', 'd').
Given 'cycle', function should return 2. The result can be achieved by splitting the string into two substrings ('cy', 'cle') or ('c', 'ycle').
Given 'abba', function should return 2 (I believe it should be 1 - the mistake as originally stated). The result can be achieved by splitting the string into two substrings ('ab', 'ba')
Here is a code which I've written. I feel that it is too complicated and also not sure whether it is efficient in matter of time complexity.
I would be glad to have suggestions of a shorter and simpler one. Thanks!
#!/usr/bin/python
from collections import Counter
def min_distinct_substrings(string):
# get the longest and unique substring
def get_max_unique_substr(s):
def is_unique(substr):
return all(m == 1 for m in Counter(substr).values())
max_sub = 0
for i in range(len(s), 0, -1):
for j in range(0, i):
if len(s[j:i]):
if is_unique(s[j:i]):
substr_len = len(s[j:i])
else:
substr_len = 0
max_sub = max(max_sub, substr_len)
return max_sub
max_unique_sub_len = get_max_unique_substr(string)
out = []
str_prefx = []
# get all valid prefix - 'a', 'ab' are valid - 'aba' not valid since 'a' is not unique
for j in range(len(string)):
if all(m==1 for m in Counter(string[:j + 1]).values()):
str_prefx.append(string[:j + 1])
else:
break
# consider only valid prefix
for k in str_prefx:
# get permutation substrings; loop starts from longest substring to substring of 2
for w in range(max_unique_sub_len, 1, -1):
word = ''
words = [k] # first substring - the prefix - is added
# go over the rest of the string - start from position after prefix
for i in range(len(k), len(string)):
# if letter already seen - the substring will be added to words
if string[i] in word or len(word) >= w:
words.append(word)
word = ''
# if not seen and not last letter - letter is added to word
word += string[i]
words.append(word)
if words not in out: # to avoid duplicated words' list
out.append(words)
min_list = min(len(i) for i in out) # get the minimum lists
# filter the minimum lists (and convert to tuple for printing purposes)
out = tuple( (*i, ) for i in out if len(i) <= min_list )
return out
The greedy algorithm described by Tarik works well to efficiently get the value of the minimal number of substrings. If you want to find all of the valid splits you have to check them all though:
import itertools
def min_unique_substrings(w):
def all_substrings_are_unique(ss):
return all(len(set(s)) == len(s) for s in ss)
# check if input is already unique
if all_substrings_are_unique([w]):
return [[w]]
# divide the input string into parts, starting with the fewest divisions
for divisions in range(2, len(w)-1):
splits = []
for delim in itertools.combinations(range(1, len(w)), divisions-1):
delim = [0, *delim, len(w)]
substrings = [w[delim[i]:delim[i+1]] for i in range(len(delim)-1)]
splits.append(substrings)
# check if there are any valid unique substring splits
filtered = list(filter(all_substrings_are_unique, splits))
if len(filtered):
# if there are any results they must be divided into the
# fewest number of substrings and we can stop looking
return filtered
# not found; worst case of one character per substring
return [list(w)]
> print(min_unique_substrings('abacdec'))
[['a', 'bac', 'dec'], ['a', 'bacd', 'ec'], ['a', 'bacde', 'c'], ['ab', 'ac', 'dec'], ['ab', 'acd', 'ec'], ['ab', 'acde', 'c']]
> print(min_unique_substrings('cycle'))
[['c', 'ycle'], ['cy', 'cle']]
> print(min_unique_substrings('dddd'))
[['d', 'd', 'd', 'd']]
> print(min_unique_substrings('abba'))
[['ab', 'ba']]
> print(min_unique_substrings('xyz'))
[['xyz']]
Ok, thought about it again. Using a greedy algorithm should do. Loop through the letters in order. Accumulate a substring as long as no letter is repeated. Once a duplicate letter is found, spit out the substring and start with another substring until all letters are exhausted.

Splitting a list into sublists in python [duplicate]

This question already has answers here:
How to split a list-of-strings into sublists-of-strings by a specific string element
(6 answers)
Closed 9 months ago.
I am trying to split a list into sublists if it contains a certain element like '---'.
For example, if I have a list:
['a', 'b', 'c', '----', 'd', 'e'], then the resulting list should be
[['a', 'b', 'c'], ['d', 'e']]
I am new to python and struggling with this, this is the code that I wrote for this problem but its not working
start_index = 0
end_index = 0
new_list = []
for character in range(0, len(characters_list)- 1):
if characters_list[character] == '----':
end_index = character - 1
if character == characters_list.index('----'):
start_index = 0
else:
start_index = character + 1
for char in range(start_index, end_index):
new_list.append(characters_list[char])
Use groupby from itertools. It groups the terms of the list into subslists wrt to the criterium described by key-function. Use the match (is a boolean value) to filter the sublist.
import itertools as it
characters_list = #
new_lst = list(list(i) for match, i in it.groupby(characters_list, lambda p: p == '----') if not match)
print(new_lst)
To make clear how the key works, here an example of grouping with the opposite condition
list(list(i) for match, i in it.groupby(characters_list, lambda p: p != '----') if match)
A more intuitive approach
lst = ['a', 'b', 'c', '----', 'd', 'e', '----', '1']
out = [[]]
for term in lst:
if term != '----':
out[-1].append(term)
else:
out.append([])
print(out)

How to find common elements from a list of lists such that order of occurrence is maintained?

I have a list of lists where the length of the lists are same. I need to find the common elements from them with the order of occurrence maintained.
For example:
Suppose the list of lists is [['a','e','d','c','f']['e','g','a','d','c']['c','a','h','e','j']]
The output list should contain ['a','e','c'] Priority should be given to elements which occur earlier in most of the lists. In this example 'a' occurs earlier, then 'e' and so on.
How to proceed with this?
you could find common items first then sorted it
from collections import defaultdict
data = [['a','e','d','c','f'],['e','g','a','d','c'],['c','a','h','e','j']]
common = set(data[0])
for line in data:
common = common.intersection(set(line))
res = defaultdict(int)
for line in data:
for idx, item in enumerate(line):
if item in common:
res[item] += idx
[item[0] for item in sorted(res.items(), key=lambda x: x[1])]
output:
['a', 'e', 'c']
Here's a quick solution that I managed to get working:
data = [['a', 'e', 'd', 'c', 'f'],
['e', 'g', 'a', 'd', 'c'], ['c', 'a', 'h', 'e', 'j']]
# count number of times each character appears
char_count = {}
for arr in data:
for char in arr:
if not char in char_count:
char_count.update({char: 1})
else:
char_count[char] += 1
# select characters that appear multiple times
common_chars = [i[0] for i in char_count.items() if i[1] > 1]
# remove characters that are not present in all lists
for char in common_chars:
count = 0
for arr in data:
if char in arr:
count += 1
if count < len(data):
common_chars.remove(char)
# final result with common characters
print(common_chars)
Resulting output:
['a', 'e', 'c']
Probably not the most efficient solution if you're working with lots of data though.

Basic Sorting / Order Algorithm

Trying to implement and form a very simple algorithm. This algorithm takes in a sequence of letters or numbers. It first creates an array (list) out of each character or digit. Then it checks each individual character compared with the following character in the sequence. If the two are equal, it removes the character from the array.
For example the input: 12223344112233 or AAAABBBCCCDDAAABB
And the output should be: 1234123 or ABCDAB
I believe the issue stems from the fact I created a counter and increment each loop. I use this counter for my comparison using the counter as an index marker in the array. Although, each time I remove an item from the array it changes the index while the counter increases.
Here is the code I have:
def sort(i):
iter = list(i)
counter = 0
for item in iter:
if item == iter[counter + 1]:
del iter[counter]
counter = counter + 1
return iter
You're iterating over the same list that you are deleting from. That usually causes behaviour that you would not expect. Make a copy of the list & iterate over that.
However, there is a simpler solution: Use itertools.groupby
import itertools
def sort(i):
return [x for x, _ in itertools.groupby(list(i))]
print(sort('12223344112233'))
Output:
['1', '2', '3', '4', '1', '2', '3']
A few alternatives, all using s = 'AAAABBBCCCDDAAABB' as setup:
>>> import re
>>> re.sub(r'(.)\1+', r'\1', s)
'ABCDAB'
>>> p = None
>>> [c for c in s if p != (p := c)]
['A', 'B', 'C', 'D', 'A', 'B']
>>> [c for c, p in zip(s, [None] + list(s)) if c != p]
['A', 'B', 'C', 'D', 'A', 'B']
>>> [c for i, c in enumerate(s) if not s.endswith(c, None, i)]
['A', 'B', 'C', 'D', 'A', 'B']
The other answers a good. This one iterates over the list in reverse to prevent skipping items, and uses the look ahead type algorithm OP described. Quick note OP this really isn't a sorting algorithm.
def sort(input_str: str) -> str:
as_list = list(input_str)
for idx in range(len(as_list), 0, -1)):
if item == as_list[idx-1]:
del as_list[idx]
return ''.join(as_list)

Problem with for-loop in python

This code is supposed to be able to sort the items in self.array based upon the order of the characters in self.order. The method sort runs properly until the third iteration, unil for some reason the for loop seems to repeat indefinitely. What is going on here?
Edit: I'm making my own sort function because it is a bonus part of a python assignment I have.
class sorting_class:
def __init__(self):
self.array = ['ca', 'bd', 'ac', 'ab'] #An array of strings
self.arrayt = []
self.globali = 0
self.globalii = 0
self.order = ['a', 'b', 'c', 'd'] #Order of characters
self.orderi = 0
self.carry = []
self.leave = []
self.sortedlist = []
def sort(self):
for arrayi in self.arrayt: #This should only loop for the number items in self.arrayt. However, the third time this is run it seems to loop indefinitely.
print ('run', arrayi) #Shows the problem
if self.order[self.orderi] == arrayi[self.globali]:
self.carry.append(arrayi)
else:
if self.globali != 0:
self.leave.append(arrayi)
def srt(self):
self.arrayt = self.array
my.sort() #First this runs the first time.
while len(self.sortedlist) != len(self.array):
if len(self.carry) == 1:
self.sortedlist.append(self.carry)
self.arrayt = self.leave
self.leave = []
self.carry = []
self.globali = 1
self.orderi = 0
my.sort()
elif len(self.carry) == 0:
if len(self.leave) != 0: #Because nothing matches 'aa' during the second iteration, this code runs the third time"
self.arrayt = self.leave
self.globali = 1
self.orderi += 1
my.sort()
else:
self.arrayt = self.array
self.globalii += 1
self.orderi = self.globalii
self.globali = 0
my.sort()
self.orderi = 0
else: #This is what runs the second time.
self.arrayt = self.carry
self.carry = []
self.globali += 1
my.sort()
my = sorting_class()
my.srt()
The key-extractor Alex mentions is trivial enough to put in a lambda function
>>> array = ['ca', 'bd', 'ac', 'ab']
>>> order = ['a', 'b', 'c', 'd']
>>> sorted(array, key=lambda v:map(order.index,v))
['ab', 'ac', 'bd', 'ca']
>>> order = ['b', 'a', 'c', 'd']
>>> sorted(array, key=lambda v:map(order.index,v))
['bd', 'ab', 'ac', 'ca']
>>> order = ['d', 'c', 'b', 'a']
>>> sorted(array, key=lambda v:map(order.index,v))
['ca', 'bd', 'ac', 'ab']
Let's see how this works:
map calls the method order.index for each item in v and uses those return values to create a list.
v will be one of the elements of array
>>> order = ['a', 'b', 'c', 'd']
>>> map(order.index,array[0])
[2, 0]
>>> map(order.index,array[1])
[1, 3]
>>> map(order.index,array[2])
[0, 2]
>>> map(order.index,array[3])
[0, 1]
The function is supplied as a key= to sort, so internally those lists are being sorted instead of the strings.
During the third pass of your loop you are appending new elements to the list you are iterating over therefore you can never leave the loop:
self.arrayt = self.leave - this assignment leads to the fact that self.leave.append(arrayi) will append elements to the list self.arrayt refers to.
In general you may think about creating copies of lists not just assigning different variables/members to the same list instances.
You have self.arrayt = self.leave which makes arrayt refer to exactly the same array as leave (it's not a copy of the contents!!!), then in the loop for arrayi in self.arrayt: you perpetrate a self.leave.append(arrayi) -- which lenghtens self.leave, which is just another name for the very list self.arrayt you're looping on. Appending to the list you're looping on is a good recipe for infinite loops.
This is just one symptom of this code's inextricable messiness. I recommend you do your sorting with the built-in sort method and put your energy into defining the right key= key-extractor function to get things sorted the exact way you want -- a much more productive use of your time.

Categories