Python recursion to split string by sliding window - python

Recently, I face an interesting coding task that involves splitting a string multiple permutations with a given K-limit size.
For example:
s = "iamfoobar"
k = 4 # the max number of the items on a list after the split
The s can split into the following combinations
[
["i", "a", "m", "foobar"],
["ia", "m", "f", "oobar"],
["iam", "f", "o", "obar"]
# etc
]
I tried to figure out how to do that with a quick recursively function, but I cannot get it to work.
I have try this out, but didn't seem to work
def sliding(s, k):
if len(s) < k:
return []
else:
for i in range(0, k):
return [s[i:i+1]] + sliding(s[i+1:len(s) - i], k)
print(sliding("iamfoobar", 4))
And only got this
['i', 'a', 'm', 'f', 'o', 'o']

Your first main problem is that although you use a loop, you immediately return a single list. So no matter how much you fix everything around, your output will never match what you expect as it will be.... a single list.
Second, on the recursive call you start with s[i:i+1] but according to your example you want all prefixes, so something like s[:i] is more suitable.
Additionaly, in the recursive call you never reduce k which is the natural recursive step.
Lastly, your stop condition seems wrong also. As above, if the natural step is reducing k, the natural stop would be if k == 1 then return [[s]]. This is because the only way to split the string to 1 part is the string itself...
The important thing is to keep in mind your final output format and think how that can work in your step. In this case you want to return a list of all possible permutations as lists. So in case of k == 1, you simply return a list of a single list of the string.
Now as the step, you want to take a different prefix each time, and add to it all permutations from the call of the rest of the string with k-1. All in all the code can be something like this:
def splt(s, k):
if k == 1: # base sace - stop condition
return [[s]]
res = []
# loop over all prefixes
for i in range(1, len(s)-k+2):
for tmp in splt(s[i:], k-1):
# add to prefix all permutations of k-1 parts of the rest of s
res.append([s[:i]] + tmp)
return res
You can test it on some inputs and see how it works.
If you are not restricted to recursion, another approach is to use itertools.combinations. You can use that to create all combinations of indexes inside the string to split it into k parts, and then simply concatenate those parts and put them in a list. A raw version is something like:
from itertools import combinations
def splt(s, k):
res = []
for indexes in combinations(range(1, len(s)), k-1):
indexes = [0] + list(indexes) + [len(s)] # add the edges to k-1 indexes to create k parts
res.append([s[start:end] for start, end in zip(indexes[:-1], indexes[1:])]) # concatenate the k parts
return res

The main issue in your implementation is that your loop does not do what is supposed to do as it returns the first result instead of appending the results.
Here's an example of an implementation:
def sliding(s, k):
# If there is not enough values of k is below 0
# there is no combination possible
if len(s) < k or k < 1:
return []
# If k is one, we return a list containing all the combinations,
# which is a single list containing the string
if k == 1:
return [[s]]
results = []
# Iterate through all the possible values for the first value
for i in range(1, len(s) - k + 2):
first_value = s[:i]
# Append the result of the sub call to the first values
for sub_result in sliding(s[i:], k - 1):
results.append([first_value] + sub_result)
return results
print(sliding("iamfoobar", 4))

Related

Find out all possible cartesian products using Recursion

The function takes in a string (str) s and an integer (int) n .
The function returns a list (list) of all Cartesian product of s with length n
The expression product(s,n) can be computed by adding each character in s to the
result of product(s,n-1) .
>>> product('ab',3)
'aaa', 'aab', 'aba', 'abb', 'baa', 'bab', 'bba', 'bbb']
My attempt:
def product(s, n):
if n == 0:
return ""
string = ''
for i in range(len(s)):
string += s[i] + product(s, n - 1)
return string
Disclaimer: It doesn't work^
Your code is building a single string. That doesn't match what the function is supposed to do, which is return a list of strings. You need to add the list-building logic, and deal with the fact that your recursive calls are going to produce lists as well.
I'd do something like this:
def product(s, n):
if n == 0:
return ['']
result = []
for prefix in s: # pick a first character
for suffix in product(s, n-1): # recurse to get the rest
result.append(prefix + suffix) # combine and add to our results
return result
This produces the output in the desired order, but it recurses a lot more often than necessary. You could swap the order of the loops, though to avoid getting the results in a different order, you'd need to change the logic so that you pick the last character from s directly while letting the recursion produce the prefix.
def product(s, n):
if n == 0:
return ['']
result = []
for prefix in product(s, n-1): # recurse to get the start of each string
for suffix in s: # pick a final character
result.append(prefix + suffix) # combine and add to our results
return result

How does this recursive permutation generator work? [duplicate]

def permute2(seq):
if not seq: # Shuffle any sequence: generator
yield seq # Empty sequence
else:
for i in range(len(seq)):
rest = seq[:i] + seq[i+1:] # Delete current node
for x in permute2(rest): # Permute the others
# In some cases x = empty string
yield seq[i:i+1] + x # Add node at front
for x in permute2('abc'):
print('result =',x)
When yielding the first result ('abc') the value of i == 0 and seq == 'abc'. The control flow then takes it to the top of the outer for loop where i == 1, which makes sense; however, seq == 'bc', which completely baffles me.
I'll try to explain it from an inductive standpoint.
Let the length of the sequence be n.
Our base case is n = 0, when we simply return the empty sequence (it is the only permutation).
If n > 0, then:
To define a permutation, we must first begin by choosing the first element. To define all permutations, obviously we must consider all elements of the sequence as possible first elements.
Once we have chosen the first element, we now need to choose the rest of the elements. But this is just the same as finding all the permutations of sequence without the element we chose. This new sequence is of length n-1, and so by induction we can solve it.
Slightly altering the original code to mimic my explanation and hopefully make it clearer:
def permute2(seq):
length = len(seq)
if (len == 0):
# This is our base case, the empty sequence, with only one possible permutation.
yield seq
return
for i in range(length):
# We take the ith element as the first element of our permutation.
first_element_as_single_element_collection = seq[i:i+1]
# Note other_elements has len(other_elements) == len(seq) - 1
other_elements = seq[:i] + seq[i+1:]
for permutation_of_smaller_collection in permute2(other_elements):
yield first_element_as_single_element_collection + permutation_of_smaller_collection
for x in permute2('abc'):
print('result =',x)
Hopefully it is clear that the original code does exactly the same thing as the above code.

most efficient way to iterate over a large array looking for a missing element in Python

I was trying an online test. the test asked to write a function that given a list of up to 100000 integers whose range is 1 to 100000, would find the first missing integer.
for example, if the list is [1,4,5,2] the output should be 3.
I iterated over the list as follow
def find_missing(num)
for i in range(1, 100001):
if i not in num:
return i
the feedback I receives is the code is not efficient in handling big lists.
I am quite new and I couldnot find an answer, how can I iterate more efficiently?
The first improvement would be to make yours linear by using a set for the repeated membership test:
def find_missing(nums)
s = set(nums)
for i in range(1, 100001):
if i not in s:
return i
Given how C-optimized python sorting is, you could also do sth like:
def find_missing(nums)
s = sorted(set(nums))
return next(i for i, n in enumerate(s, 1) if i != n)
But both of these are fairly space inefficient as they create a new collection. You can avoid that with an in-place sort:
from itertools import groupby
def find_missing(nums):
nums.sort() # in-place
return next(i for i, (k, _) in enumerate(groupby(nums), 1) if i != k)
For any range of numbers, the sum is given by Gauss's formula:
# sum of all numbers up to and including nums[-1] minus
# sum of all numbers up to but not including nums[-1]
expected = nums[-1] * (nums[-1] + 1) // 2 - nums[0] * (nums[0] - 1) // 2
If a number is missing, the actual sum will be
actual = sum(nums)
The difference is the missing number:
result = expected - actual
This compulation is O(n), which is as efficient as you can get. expected is an O(1) computation, while actual has to actually add up the elements.
A somewhat slower but similar complexity approach would be to step along the sequence in lockstep with either a range or itertools.count:
for a, e in zip(nums, range(nums[0], len(nums) + nums[0])):
if a != e:
return e # or break if not in a function
Notice the difference between a single comparison a != e, vs a linear containment check like e in nums, which has to iterate on average through half of nums to get the answer.
You can use Counter to count every occurrence of your list. The minimum number with occurrence 0 will be your output. For example:
from collections import Counter
def find_missing():
count = Counter(your_list)
keys = count.keys() #list of every element in increasing order
main_list = list(range(1:100000)) #the list of values from 1 to 100k
missing_numbers = list(set(main_list) - set(keys))
your_output = min(missing_numbers)
return your_output

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Algorithm for finding the possible palindromic strings in a list containing a list of possible subsequences

I have "n" number of strings as input, which i separate into possible subsequences into a list like below
If the Input is : aa, b, aa
I create a list like the below(each list having the subsequences of the string):
aList = [['a', 'a', 'aa'], ['b'], ['a', 'a', 'aa']]
I would like to find the combinations of palindromes across the lists in aList.
For eg, the possible palindromes for this would be 5 - aba, aba, aba, aba, aabaa
This could be achieved by brute force algorithm using the below code:
d = []
def isPalindrome(x):
if x == x[::-1]: return True
else: return False
for I in itertools.product(*aList):
a = (''.join(I))
if isPalindrome(a):
if a not in d:
d.append(a)
count += 1
But this approach is resulting in a timeout when the number of strings and the length of the string are bigger.
Is there a better approach to the problem ?
Second version
This version uses a set called seen, to avoid testing combinations more than once.
Note that your function isPalindrome() can simplified to single expression, so I removed it and just did the test in-line to avoid the overhead of an unnecessary function call.
import itertools
aList = [['a', 'a', 'aa'], ['b'], ['a', 'a', 'aa']]
d = []
seen = set()
for I in itertools.product(*aList):
if I not in seen:
seen.add(I)
a = ''.join(I)
if a == a[::-1]:
d.append(a)
print('d: {}'.format(d))
Current approach has disadvantage and that most of generated solutions are finally thrown away when checked that solution is/isn't palindrome.
One Idea is that once you pick solution from one side, you can immediate check if there is corresponding solution in last group.
For example lets say that your space is this
[["a","b","c"], ... , ["b","c","d"]]
We can see that if you pick "a" as first pick, there is no "a" in last group and this exclude all possible solutions that would be tried other way.
For larger input you could probably get some time gain by grabbing words from the first array, and compare them with the words of the last array to check that these pairs still allow for a palindrome to be formed, or that such a combination can never lead to one by inserting arrays from the remaining words in between.
This way you probably cancel out a lot of possibilities, and this method can be repeated recursively, once you have decided that a pair is still in the running. You would then save the common part of the two words (when the second word is reversed of course), and keep the remaining letters separate for use in the recursive part.
Depending on which of the two words was longer, you would compare the remaining letters with words from the array that is next from the left or from the right.
This should bring a lot of early pruning in the search tree. You would thus not perform the full Cartesian product of combinations.
I have also written the function to get all substrings from a given word, which you probably already had:
def allsubstr(str):
return [str[i:j+1] for i in range(len(str)) for j in range(i, len(str))]
def getpalindromes_trincot(aList):
def collectLeft(common, needle, i, j):
if i > j:
return [common + needle + common[::-1]] if needle == needle[::-1] else []
results = []
for seq in aRevList[j]:
if seq.startswith(needle):
results += collectRight(common+needle, seq[len(needle):], i, j-1)
elif needle.startswith(seq):
results += collectLeft(common+seq, needle[len(seq):], i, j-1)
return results
def collectRight(common, needle, i, j):
if i > j:
return [common + needle + common[::-1]] if needle == needle[::-1] else []
results = []
for seq in aList[i]:
if seq.startswith(needle):
results += collectLeft(common+needle, seq[len(needle):], i+1, j)
elif needle.startswith(seq):
results += collectRight(common+seq, needle[len(seq):], i+1, j)
return results
aRevList = [[seq[::-1] for seq in seqs] for seqs in aList]
return collectRight('', '', 0, len(aList)-1)
# sample input and call:
input = ['already', 'days', 'every', 'year', 'later'];
aList = [allsubstr(word) for word in input]
result = getpalindromes_trincot(aList)
I did a timing comparison with the solution that martineau posted. For the sample data I have used, this solution is about 100 times faster:
See it run on repl.it
Another Optimisation
Some gain could also be found in not repeating the search when the first array has several entries with the same string, like the 'a' in your example data. The results that include the second 'a' will obviously be the same as for the first. I did not code this optimisation, but it might be an idea to improve the performance even more.

Categories