Related
From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']
I am trying to create a loop where I can generate string using loop. What I am trying to achieve is that I want to create a small collection of strings starting from 1 character to up to 5 characters.
So, starting from sting 1, I want to go to 55555 but this is number so it seems easy if I just add them, but when it comes to alpha numeric, it gets tricky.
Here is explanation,
I have collection of alpha-numeric chars as string s = "123ABC" and what I want to do is that I want to create all possible 1 character string out of it, so I will have 1,2,3,A,B,C and after that I want to add one more digit in length of string so I can get 11, 12, 13 and so on until I get all possible combination out of it up to CA, CB, CC and I want to get it up to CCCCCC. I am confused in loop because I can get it to generate a temp sting but looping inside to rotate characters is tricky,
this is what I have done so far,
i = 0
strr = "123ABC"
while i < len(strr):
t = strr[0] * (i+1)
for q in range(0, len(t)):
# Here I need help to rotate more
pass
i += 1
Can anyone explain me or point me to resource where I can find solution for it?
You may want to use itertools.permutations function:
import itertools
chars = '123ABC'
for i in xrange(1, len(chars)+1):
print list(itertools.permutations(chars, i))
EDIT:
To get a list of strings, try this:
import itertools
chars = '123ABC'
strings = []
for i in xrange(1, len(chars)+1):
strings.extend(''.join(x) for x in itertools.permutations(chars, i))
This is a nested loop. Different depths of recursion produce all possible combinations.
strr = "123ABC"
def prod(items, level):
if level == 0:
yield []
else:
for first in items:
for rest in prod(items, level-1):
yield [first] + rest
for ln in range(1, len(strr)+1):
print("length:", ln)
for s in prod(strr, ln):
print(''.join(s))
It is also called cartesian product and there is a corresponding function in itertools.
In a numerical sequence (e.g. one-dimensional array) I want to find different patterns of numbers and count each finding separately. However, the numbers can occur repeatedly but only the basic pattern is important.
# Example signal (1d array)
a = np.array([1,1,2,2,2,2,1,1,1,2,1,1,2,3,3,3,3,3,2,2,1,1,1])
# Search for these exact following "patterns": [1,2,1], [1,2,3], [3,2,1]
# Count the number of pattern occurrences
# [1,2,1] = 2 (occurs 2 times)
# [1,2,3] = 1
# [3,2,1] = 1
I have come up with the Knuth-Morris-Pratt string matching (http://code.activestate.com/recipes/117214/), which gives me the index of the searched pattern.
for s in KnuthMorrisPratt(list(a), [1,2,1]):
print('s')
The problem is, I don't know how to find the case, where the pattern [1,2,1] "hides" in the sequence [1,2,2,2,1]. I need to find a way to reduce this sequence of repeated numbers in order to get to [1,2,1]. Any ideas?
I don't use NumPy and I am quite new to Python, so there might be a better and more efficient solution.
I would write a function like this:
def dac(data, pattern):
count = 0
for i in range(len(data)-len(pattern)+1):
tmp = data[i:(i+len(pattern))]
if tmp == pattern:
count +=1
return count
If you want to ignore repeated numbers in the middle of your pattern:
def dac(data, pattern):
count = 0
for i in range(len(data)-len(pattern)+1):
tmp = [data[i], data [i+1]]
try:
for j in range(len(data)-i):
print(i, i+j)
if tmp[-1] != data[i+j+1]:
tmp.append(data[i+j+1])
if len(tmp) == len(pattern):
print(tmp)
break
except:
pass
if tmp == pattern:
count +=1
return count
Hope that might help.
Here's a one-liner that will do it
import numpy as np
a = np.array([1,1,2,2,2,2,1,1,1,2,1,1,2,3,3,3,3,3,2,2,1,1,1])
p = np.array([1,2,1])
num = sum(1 for k in
[a[j:j+len(p)] for j in range(len(a) - len(p) + 1)]
if np.array_equal(k, p))
The innermost part is a list comprehension that generates all pieces of the array that are the same length as the pattern. The outer part sums 1 for every element of this list which matches the pattern.
The only way I could think of solving your problem with the
subpatterns matching was to use regex.
The following is a demonstration for findind for example the sequence [1,2,1] in list1:
import re
list1 = [1,1,2,2,2,2,1,1,1,2,1,1,2,3,3,3,3,3,2,2,1,1,1]
str_list = ''.join(str(i) for i in list1)
print re.findall(r'1+2+1', str_list)
This will give you as a result:
>>> print re.findall(r'1+2+1', str_list)
['1122221', '1121']
Okay, I was working on a code that would give every possible combination of the scrambled letters you input. Here it is:
import random, math
words = []
original = raw_input("What do you need scrambled? ")
def word_scramble(scrambled):
original_length = len(scrambled)
loops = math.factorial(original_length)
while loops > 0:
new_word = []
used_numbers = []
while len(new_word) < original_length:
number = random.randint(0, original_length - 1)
while number in used_numbers:
number = random.randint(0, original_length - 1)
while number not in used_numbers:
used_numbers.append(number)
new_word.append(scrambled[number])
if new_word not in words:
words.append(("".join(str(x) for x in new_word)))
loops -= 1
word_scramble(original)
print ("\n".join(str(x) for x in words))
The problem is, it still gives duplicates, even though it isn't supposed to. For instance, I can input "imlk" and will sometimes get "milk" twice, while still only giving me 24 permutations, meaning some permutations are being excluded. The:
if new_word not in words:
words.append(("".join(str(x) for x in new_word)))
loops -= 1
is supposed to prevent duplicates from being in the list. So I'm not really sure what the issue is. And sorry that the main title of the question was so vague/weird. I wasn't really sure how to phrase it better.
How about itertools.permutations?
import itertools
original = raw_input("What do you need scrambled? ")
result = [''.join(s) for s in itertools.permutations(original)]
if new_word not in words:
words.append(("".join(str(x) for x in new_word)))
loops -= 1
new_word is a list of letters, but words contains strings not lists of letters. They're in different formats, so the check will always succeed.
For instance, you might get the check:
if ['m', 'i', 'l', 'k'] not in ['imlk', 'milk', 'klim']
rather than
if 'milk' not in ['imlk', 'milk', 'klim']
By the way, your algorithm is going to scale very badly the more letters it needs to scramble. It relies upon randomly stumbling upon unused words, which is fast at first, but slow the more words are used.
You'll be better off if you can figure out a way to enumerate the permutations in a predictable order without guessing.
From the doc:
itertools.permutations(iterable[, r])
Return successive r length permutations of elements in the iterable.
If r is not specified or is None, then r defaults to the length of the
iterable and all possible full-length permutations are generated.
import itertools
word=raw_input()
scramble_all = [''.join(p) for p in itertools.permutations(word)]
print scramble_all
Output:
['milk', 'mikl', 'mlik', 'mlki', 'mkil', 'mkli', 'imlk', 'imkl',
'ilmk', 'ilkm', 'ikml', 'iklm', 'lmik', 'lmki', 'limk', 'likm',
'lkmi', 'lkim', 'kmil', 'kmli', 'kiml', 'kilm', 'klmi', 'klim']
Disregarding your original question, I can provide a better solution for generating all combinations. Just as a reminder, there is a difference between all possible combinations and permutations.
A combination is where the ordering of a subset of elements from some set does not matter. For example, "bac" and "abc" are equal combinations. The opposite is true for a permutation where order does matter, "abc" and "bac" are not equal permutations. Combinations are all possible subsets of a set, while permutations are all possible orderings of a set.
With that in mind, here is an algorithm to generate all possible combinations (subsets) of a set.
def all_subsets( L ):
if L == []:
return []
result = []
for i in range( len ( L ) ):
result += [ [ L[ i ] ] ]
result += [ e + [ L[ i ] ] for e in all_subsets( L[ i+1 : ] ) ]
return result
I would like to find a way for the possible products of the given list.
Basically the index can be followed by a number that is next to that number such as A1A2,A2C3... or circular number A3D1, D3B1... Below, I have an example
An example:
the_list=['A1','A2','A3','B1','B2','B3','C1','C2','C3','D1','D2','D3']
The results should be:
['A1A2','A1B2','A1C2','A1D2','A2A3','A2B3','A2C3','A2D3','A3A1','A3B1','A3C1','A3D1'
'B1A2,'B2A3'...
'C1A2'...']
So far, I tried this :
the_list=['A1','A2','A3','B1','B2','B3','C1','C2','C3','D1','D2','D3']
result=[]
for i in range(len(the_list)):
for k in range((i%3+1),len(the_list)+1,3):
s=str(the_list[i])+str(the_list[k%len(the_list)])
result.append(s)
Output:
['A1A2', 'A1B2', 'A1C2', 'A1D2', 'A2A3', 'A2B3', 'A2C3', 'A2D3', 'A3B1', 'A3C1',
'A3D1', 'A3A1', 'B1A2', 'B1B2', 'B1C2', 'B1D2', 'B2A3', 'B2B3', 'B2C3', 'B2D3', 'B3B1',
'B3C1', 'B3D1', 'B3A1', 'C1A2', 'C1B2', 'C1C2', 'C1D2', 'C2A3', 'C2B3', 'C2C3', 'C2D3',
'C3B1', 'C3C1', 'C3D1', 'C3A1', 'D1A2', 'D1B2', 'D1C2', 'D1D2', 'D2A3', 'D2B3', 'D2C3',
'D2D3', 'D3B1', 'D3C1', 'D3D1', 'D3A1']
This works fine. But, I want to make it more scalable, so far it generates two sequences like A1A2, A1D2... How can i change my code to make it scalable? So, if the scale is 3, it should generate A1A2A3,... in the same manner.
Update: I think there should be one more for loop that takes care of the size and accumulates the sequence based on that number, but I could not figure it out so far how to.
Use numbers = IT.cycle(numbers) to generate the sequence of valid
numbers. By making it a cycle, you do not have to treat 1 following 3 any different than 2 following 1.
The letters in each item can be generated by itertools.product. The
repeat parameter is especially useful here. It will allow you to,
as you say, "scale" the generator to longer sequences with no
additional effort.
You can use zip to combine the letters generated by
itertools.product (called lets below) with the numbers from
itertools.cycle.
''.join(IT.chain.from_iterable is just a way to join the list of
tuples returned by zip into a string.
import itertools as IT
def neighbor_product(letters, numbers, repeat = 2):
N = len(numbers)
numbers = collections.deque(numbers)
for lets in IT.product(letters, repeat = repeat):
for i in range(N):
yield ''.join(IT.chain.from_iterable(zip(lets, IT.cycle(numbers))))
numbers.rotate(-1)
letters = 'ABCD'
numbers = '123'
for item in neighbor_product(letters, numbers, repeat = 3):
print(item)
yields
A1A2A3
A2A3A1
A3A1A2
A1A2B3
...
D3D1C2
D1D2D3
D2D3D1
D3D1D2
I think this is what you're after.
import itertools
def products(letters='ABCD', N=3, scale=2):
for lets in itertools.product(letters, repeat=scale):
for j in xrange(N):
yield ''.join('%s%d' % (c, (i + j) % N + 1)
for i, c in enumerate(lets))
print list(products(scale=3))
result=[i+j for i in the_list for j in the_list]
I think what you are looking for is a subset of all possible permutations. itertools.permutations() returns a generator of all permutations as n-sized tuples for a given sequence and a given length n. I would iterate over all permutations and filter them based on your criteria. What about this:
In [1]: from itertools import permutations
In [2]: the_list = ['A1','A2','A3','B1','B2','B3','C1','C2','C3','D1','D2','D3']
In [3]: results = []
In [4]: digits = list(set([elem[1] for elem in the_list]))
In [5]: digits
Out[5]: ['1', '2', '3']
In [6]: for perm in permutations(the_list, 2):
....: if (int(perm[0][1])+1 == int(perm[1][1]) or
....: (perm[0][1] == digits[-1] and perm[1][1] == digits[0])):
....: results.append(''.join(perm))
In [7]: sorted(results)
Out[7]: ['A1A2', 'A1B2', 'A1C2', 'A1D2', 'A2A3', 'A2B3', 'A2C3',
'A2D3', 'A3A1', 'A3B1', 'A3C1', 'A3D1', 'B1A2', 'B1B2',
'B1C2', 'B1D2', 'B2A3', 'B2B3', 'B2C3', 'B2D3', 'B3A1',
'B3B1', 'B3C1', 'B3D1', 'C1A2', 'C1B2', 'C1C2', 'C1D2',
'C2A3', 'C2B3', 'C2C3', 'C2D3', 'C3A1', 'C3B1', 'C3C1',
'C3D1', 'D1A2', 'D1B2', 'D1C2', 'D1D2', 'D2A3', 'D2B3',
'D2C3', 'D2D3', 'D3A1', 'D3B1', 'D3C1', 'D3D1']
Since not every combination is valid according to your specifications, checking every possible combination and discarding invalid ones may not be the best approach here.
Since the number of each item is important for determining whether a combination is valid, let's create a lookup table for that first:
from collections import defaultdict
lookup = defaultdict(list)
for item in the_list:
lookup[int(item[1])].append(item)
This makes it easy to get all items with a specific number (which will be useful when we want to get the items for consequtive numbers):
lookup[1] == ['A1', 'B1', 'C1', 'D1']
Creating all valid combinations can now be done as following:
from itertools import product
def valid_combinations(lookup):
min_number = min(lookup)
max_number = max(lookup)
for number in lookup:
# Let's just assume here that we've only got consecutive numbers, no gaps:
next_number = min_number if number == max_number else number + 1
for combination in product(lookup[number], lookup[next_number]):
yield ''.join(combination)
To allow any number of items to be chained, we'll need to modify that a bit:
def valid_combinations(lookup, scale = 2):
min_number = min(lookup)
max_number = max(lookup)
def wrap_number(n):
while n > max_number:
n -= max_number + 1 - min_number
return n
for number in lookup:
numbers = list(wrap_number(n) for n in range(number, number + scale))
items = [lookup[n] for n in numbers]
for combination in product(*items):
yield ''.join(combination)
For scale 5, this would produce the following results (showing only the first few of a total of 3072):
['A1A2A3A1A2', 'A1A2A3A1B2', 'A1A2A3A1C2', 'A1A2A3A1D2', 'A1A2A3B1A2', ...]