Finding Subsequences of a large String

Finding Subsequences of a large String - python

I am trying to get all the Subsequences of a String. Example:-
firstString = "ABCD"
O/P should be;
'ABCD', 'BCD', 'ACD', 'ABD', 'ABC', 'CD', 'BD', 'BC', 'AD', 'AC', 'AB', 'D', 'C', 'B', 'A'
For that I am using following part of code:-
#!usr/bin/python
from __future__ import print_function
from operator import itemgetter
from subprocess import call
import math
import itertools
import operator
call(["date"])
firstArray = []
firstString = "ABCD"
firstList = list(firstString)
for L in range(0, len(firstList)+1):
for subset in itertools.combinations(firstList, L):
firstArray.append(''.join(subset))
firstArray.reverse()
print (firstArray)
call(["date"])
But this code is not scalable.
If I provide :-
firstString = "ABCDABCDABCDABCDABCDABCDABCD"
The program takes almost 6 mins time to complete.
---------------- Capture while running the script --------------------
python sample-0012.py
Wed Feb 8 21:30:30 PST 2017
Wed Feb 8 21:30:30 PST 2017
Can someone please help?

What you are looking for is called a "Power set" (or Powerset).
The wikipedia def:
a power set (or powerset) of any set S is the set of all subsets of S,
including the empty set and S itself.
A good solution might be recursive, here you can find one:
link

For better doing with powerset concept go through,
How to get all possible combinations of a list’s elements?
otherwise, you can do like this.
wordlist = []
for i in range(len(firststring)):
...: comblist = combinations(list(firststring), i+1)
...: same_length_words = []
...: for i, word in enumerate(comblist):
...: if word not in same_length_words:
...: same_length_words.append(word)
...: for each_word in same_length_words:
...: wordlist.append(''.join(each_word))
...:

try this
from itertools import chain, combinations
firstString = 'ABCD'
data = list(firstString)
lists = chain.from_iterable(combinations(data, r) for r in range(len(data)+1))
print [''.join(i) for i in lists if i]
# ['A', 'B', 'C', 'D', 'AB', 'AC', 'AD', 'BC', 'BD', 'CD', 'ABC', 'ABD', 'ACD', 'BCD', 'ABCD']

Related

Python code to solve classic P(n, r): Print all permutations of n objects taken r at a time without repetition

Python code to solve classic P(n, r)
Problem: Print all permutations of n objects taken r at a time without repetition.
I'm a Python learner looking for an elegant solution vs. trying to solve a coding problem at work.
Interested in seeing code to solve the classic P(n, r) permuation problem -- how to print all permuations of a string taken r characters at a time, without repeated characters.
Because learning is my focus, not interested in using the Python itertools "permutations" library function. Looked at it, but couldn't understand what it was doing. Looking for actual code to solve this problem, so I can learn the implementation.
Example: if input string s == 'abcdef', and r == 4, then n == 6.
Output would be something like: abcd abce abcf abde abdf abef ...
There are a lot of closely similar questions, but I didn't find a duplicate. Most specify "r". I want to leave r as an input parameter to keep the solution general.

This approach uses recursive generator functions which I find very readable. It is the easiest to start with combinations:
def combs(s, r):
if not r:
yield ''
elif s:
first, rest = s[0], s[1:]
for comb in combs(rest, r-1):
yield first + comb # use first char ...
yield from combs(rest, r) # ... or don't
>>> list(combs('abcd', 2))
['ab', 'ac', 'ad', 'bc', 'bd', 'cd']
>>> list(combs('abcd', 3))
['abc', 'abd', 'acd', 'bcd']
And build permutations on top of them:
def perms(s, r):
if not r:
yield ''
else:
for comb in combs(s, r):
for i, char in enumerate(comb):
rest = comb[:i] + comb[i+1:]
for perm in perms(rest, r-1):
yield char + perm
>>> list(perms('abc', 2))
['ab', 'ba', 'ac', 'ca', 'bc', 'cb']
>>> list(perms('abcd', 2))
['ab', 'ba', 'ac', 'ca', 'ad', 'da', 'bc', 'cb', 'bd', 'db', 'cd', 'dc']

how to get all possible strings for the alphabet letters in python?

For example, given the alphabet = 'abcd', how I can get this output in Python:
a
aa
b
bb
ab
ba
(...)
iteration by iteration.
I already tried the powerset() function that is found here on stackoverflow,
but that doesn't repeat letters in the same string.
Also, if I want to set a minimum and maximum limit that the string can have, how can I?
For example min=3 and max=4, abc, aaa, aba, ..., aaaa, abca, abcb, ...

You can use combinations_with_replacement from itertools (docs). The function combinations_with_replacement takes an iterable object as its first argument (e.g. your alphabet) and the desired length of the combinations to generate. Since you want strings of different lengths, you can loop over each desired length.
For example:
from itertools import combinations_with_replacement
def get_all_poss_strings(alphabet, min_length, max_length):
poss_strings = []
for r in range(min_length, max_length + 1):
poss_strings += combinations_with_replacement(alphabet, r)
return ["".join(s) for s in poss_strings] # combinations_with_replacement returns tuples, so join them into individual strings
Sample:
alphabet = "abcd"
min_length = 3
max_length = 4
get_all_poss_strings(alphabet, min_length, max_length)
Output:
['aaa', 'aab', 'aac', 'aad', 'abb', 'abc', 'abd', 'acc', 'acd', 'add', 'bbb', 'bbc', 'bbd', 'bcc', 'bcd', 'bdd', 'ccc', 'ccd', 'cdd', 'ddd', 'aaaa', 'aaab', 'aaac', 'aaad', 'aabb', 'aabc', 'aabd', 'aacc', 'aacd', 'aadd', 'abbb', 'abbc', 'abbd', 'abcc', 'abcd', 'abdd', 'accc', 'accd', 'acdd', 'addd', 'bbbb', 'bbbc', 'bbbd', 'bbcc', 'bbcd', 'bbdd', 'bccc', 'bccd', 'bcdd', 'bddd', 'cccc', 'cccd', 'ccdd', 'cddd', 'dddd']
Edit:
If order also matters for your strings (as indicated by having "ab" and "ba"), you can use the following function to get all permutations of all lengths in a given range:
from itertools import combinations_with_replacement, permutations
def get_all_poss_strings(alphabet, min_length, max_length):
poss_strings = []
for r in range(min_length, max_length + 1):
combos = combinations_with_replacement(alphabet, r)
perms_of_combos = []
for combo in combos:
perms_of_combos += permutations(combo)
poss_strings += perms_of_combos
return list(set(["".join(s) for s in poss_strings]))
Sample:
alphabet = "abcd"
min_length = 1
max_length = 2
get_all_poss_strings(alphabet, min_length, max_length)
Output:
['a', 'aa', 'ab', 'ac', 'ad', 'b', 'ba', 'bb', 'bc', 'bd', 'c', 'ca', 'cb', 'cc', 'cd', 'd', 'da', 'db', 'dc', 'dd']

You can use the product function of itertools with varying lengths. The result differs in order from the example you give, but this may be what you want. This results in a generator that you can use to get all your desired strings. This code lets you set a minimum and a maximum length of the returned strings. If you do not specify a value for parameter maxlen then the generator is infinite. Be sure you have a way to stop it or you will get an infinite loop.
import itertools
def allcombinations(alphabet, minlen=1, maxlen=None):
thislen = minlen
while maxlen is None or thislen <= maxlen:
for prod in itertools.product(alphabet, repeat=thislen):
yield ''.join(prod)
thislen += 1
for c in allcombinations('abcd', minlen=1, maxlen=2):
print(c)
This example gives the printout which is similar to your first example, though in a different order.
a
b
c
d
aa
ab
ac
ad
ba
bb
bc
bd
ca
cb
cc
cd
da
db
dc
dd
If you really want a full list, just use
list(allcombinations('abcd', minlen=1, maxlen=2))

Using Recursion to make sequences of a word

I was given a homework assignment to find all possible sequences of a given word. eg. if word = 'abc', the below code would return ['a', 'ab', 'abc', 'ac', 'acb', 'b', 'ba', 'bac', 'bc', 'bca', 'c', 'ca', 'cab', 'cb', 'cba'].
However, this feels inefficient. I'm just starting to learn recursion, so I'm not sure if there is a better or more efficient way to produce these sequences?
edit:
I think it's necessary to add a couple things as I kept working and reading the material
Duplicates are fine, those are sorted out in a separate function
Each value is unique, so sequence 'aab' should produce two 'aa' sequences
def gen_all_strings(word):
if len(word) == 1:
return list(word)
else:
main_list = list()
for idx in range(len(word)):
cur_val = word[idx]
rest = gen_all_strings(word[:idx] + word[idx+1:])
main_list.append(cur_val)
for seq in rest:
main_list.append(cur_val + seq)
return main_list

Itertools and list comprehensions are good for breaking stuff down like this.
import itertools
["".join(x) for y in range(1, len(word) + 1) for x in itertools.permutations(word, y)]

Getting all combinations of a string and its substrings

I've seen many questions on getting all the possible substrings (i.e., adjacent sets of characters), but none on generating all possible strings including the combinations of its substrings.
For example, let:
x = 'abc'
I would like the output to be something like:
['abc', 'ab', 'ac', 'bc', 'a', 'b', 'c']
The main point is that we can remove multiple characters that are not adjacent in the original string (as well as the adjacent ones).
Here is what I have tried so far:
def return_substrings(input_string):
length = len(input_string)
return [input_string[i:j + 1] for i in range(length) for j in range(i, length)]
print(return_substrings('abc'))
However, this only removes sets of adjacent strings from the original string, and will not return the element 'ac' from the example above.
Another example is if we use the string 'abcde', the output list should contain the elements 'ace', 'bd' etc.

You can do this easily using itertools.combinations
>>> from itertools import combinations
>>> x = 'abc'
>>> [''.join(l) for i in range(len(x)) for l in combinations(x, i+1)]
['a', 'b', 'c', 'ab', 'ac', 'bc', 'abc']
If you want it in the reversed order, you can make the range function return its sequence in reversed order
>>> [''.join(l) for i in range(len(x),0,-1) for l in combinations(x, i)]
['abc', 'ab', 'ac', 'bc', 'a', 'b', 'c']

This is a fun exercise. I think other answers may use itertools.product or itertools.combinations. But just for fun, you can also do this recursively with something like
def subs(string, ret=['']):
if len(string) == 0:
return ret
head, tail = string[0], string[1:]
ret = ret + list(map(lambda x: x+head, ret))
return subs(tail, ret)
subs('abc')
# returns ['', 'a', 'b', 'ab', 'c', 'ac', 'bc', 'abc']

#Sunitha answer provided the right tool to use. I will just go and suggest an improved way while using your return_substrings method. Basically, my solution will take care of duplicates.
I will use "ABCA" in order to prove validity of my solution. Note that it would include a duplicate 'A' in the returned list of the accepted answer.
Python 3.7+ solution,
x= "ABCA"
def return_substrings(x):
all_combnations = [''.join(l) for i in range(len(x)) for l in combinations(x, i+1)]
return list(reversed(list(dict.fromkeys(all_combnations))))
# return list(dict.fromkeys(all_combnations)) for none-reversed ordering
print(return_substrings(x))
>>>>['ABCA', 'BCA', 'ACA', 'ABA', 'ABC', 'CA', 'BA', 'BC', 'AA', 'AC', 'AB', 'C', 'B', 'A']
Python 2.7 solution,
You'll have to use OrderedDict instead of a normal dict. Therefore,
return list(reversed(list(dict.fromkeys(all_combnations))))
becomes
return list(reversed(list(OrderedDict.fromkeys(all_combnations))))
Order is irrelevant for you ?
You can reduce code complexity if order is not relevant,
x= "ABCA"
def return_substrings(x):
all_combnations = [''.join(l) for i in range(len(x)) for l in combinations(x, i+1)]
return list(set(all_combnations))

def return_substrings(s):
all_sub = set()
recent = {s}
while recent:
tmp = set()
for word in recent:
for i in range(len(word)):
tmp.add(word[:i] + word[i + 1:])
all_sub.update(recent)
recent = tmp
return all_sub

For an overkill / different version of the accepted answer (expressing combinations using https://docs.python.org/3/library/itertools.html#itertools.product ):
["".join(["abc"[y[0]] for y in x if y[1]]) for x in map(enumerate, itertools.product((False, True), repeat=3))]
For a more visual interpretation, consider all substrings as a mapping of all bitstrings of length n.

How to generate subpeptides (special combinations) from a string representing a cyclic peptide?

Here is my problem: I have a sequence representing a cyclic peptide and I'm trying to create a function that generate all possible subpeptides. A subpeptide is created when bonds between 2 aminoacids are broken. For example: for the peptide 'ABCD', its subpeptides would be 'A', 'B', 'C', 'D', 'AB', 'BC', 'CD', 'DA', 'ABC', 'BCD', 'CDA', DAB'. Thus, the amount of possible subpeptides from a peptide of length n will always be n*(n-1). Note that not all of them are substrings from peptide ('DA', 'CDA'...).
I've written a code that generate combinations. However, there are some excessive elements, such as not linked aminoacids ('AC', 'BD'...). Does anyone have a hint of how could I eliminate those, since peptide may have a different length each time the function is called? Here's what I have so far:
def Subpeptides(peptide):
subpeptides = []
from itertools import combinations
for n in range(1, len(peptide)):
subpeptides.extend(
[''.join(comb) for comb in combinations(peptide, n)]
)
return subpeptides
Here are the results for peptide 'ABCD':
['A', 'B', 'C', 'D', 'AB', 'AC', 'AD', 'BC', 'BD', 'CD', 'ABC', 'ABD', 'ACD', 'BCD']
The order of aminoacids is not important, if they represent a real sequence of the peptide. For example, 'ABD' is a valid form of 'DAB', since D and A have a bond in the cyclic peptide.
I'm using Python.

it's probably easier to just generate them all:
def subpeptides(peptide):
l = len(peptide)
looped = peptide + peptide
for start in range(0, l):
for length in range(1, l):
print(looped[start:start+length])
which gives:
>>> subpeptides("ABCD")
A
AB
ABC
B
BC
BCD
C
CD
CDA
D
DA
DAB
(if you want a list instead of printing, just change print(...) to yield ... and you have a generator).
all the above does is enumerate the different places the first bond could be broken, and then the different products you would get if the next bond broke after one, two, or three (in this case) acids. looped is just an easy way to avoid having the logic of going "round the loop".

Last term is missed
you can use below code
def subpeptides(peptide):
l = len(peptide)
ls=[]
looped = peptide + peptide
for start in range(0, l):
for length in range(1, l):
ls.append( (looped[start:start+length]))
ls.append(peptide)
return ls

you can use this one
>>>aa='ABCD'
>>> F=[]
>>> B=[]
>>> for j in range(1,len(aa)+1,1):
for i in range(0,len(aa),1):
A=str.split(((aa*j)[i:i+j]))
B=B+A
C=(B[0:len(aa)*len(aa)-len(aa)+1])
it gives you:
C=['A', 'B', 'C', 'D', 'AB', 'BC', 'CD', 'DA', 'ABC', 'BCD', 'CDA', 'DAB', 'ABCD']
i hope this helps , btw im doing the coursera course too if it would be of interest joining up forces , let me know

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding Subsequences of a large String - python

What you are looking for is called a "Power set" (or Powerset). The wikipedia def: a power set (or powerset) of any set S is the set of all subsets of S, including the empty set and S itself. A good solution might be recursive, here you can find one: link

Related

Python code to solve classic P(n, r): Print all permutations of n objects taken r at a time without repetition

how to get all possible strings for the alphabet letters in python?

Using Recursion to make sequences of a word

Getting all combinations of a string and its substrings

How to generate subpeptides (special combinations) from a string representing a cyclic peptide?

Categories

Resources