Looking for elegant glob-like DNA string expansion - python

I'm trying to make a glob-like expansion of a set of DNA strings that have multiple possible bases.
The base of my DNA strings contains the letters A, C, G, and T. However, I can have special characters like M which could be an A or a C.
For example, say I have the string:
ATMM
I would like to take this string as input and output the four possible matching strings:
ATAA
ATAC
ATCA
ATCC
Rather than brute force a solution, I feel like there must be some elegant Python/Perl/Regular Expression trick to do this.
Thank you for any advice.
Edit, thanks cortex for the product operator. This is my solution:
Still a Python newbie, so I bet there's a better way to handle each dictionary key than another for loop. Any suggestions would be great.
import sys
from itertools import product
baseDict = dict(M=['A','C'],R=['A','G'],W=['A','T'],S=['C','G'],
Y=['C','T'],K=['G','T'],V=['A','C','G'],
H=['A','C','T'],D=['A','G','T'],B=['C','G','T'])
def glob(str):
strings = [str]
## this loop visits very possible base in the dictionary
## probably a cleaner way to do it
for base in baseDict:
oldstrings = strings
strings = []
for string in oldstrings:
strings += map("".join,product(*[baseDict[base] if x == base
else [x] for x in string]))
return strings
for line in sys.stdin.readlines():
line = line.rstrip('\n')
permutations = glob(line)
for x in permutations:
print x

Agree with other posters that it seems like a strange thing to want to do. Of course, if you really want to, there is (as always) an elegant way to do it in Python (2.6+):
from itertools import product
map("".join, product(*[['A', 'C'] if x == "M" else [x] for x in "GMTTMCA"]))
Full solution with input handling:
import sys
from itertools import product
base_globs = {"M":['A','C'], "R":['A','G'], "W":['A','T'],
"S":['C','G'], "Y":['C','T'], "K":['G','T'],
"V":['A','C','G'], "H":['A','C','T'],
"D":['A','G','T'], "B":['C','G','T'],
}
def base_glob(glob_sequence):
production_sequence = [base_globs.get(base, [base]) for base in glob_sequence]
return map("".join, product(*production_sequence))
for line in sys.stdin.readlines():
productions = base_glob(line.strip())
print "\n".join(productions)

You probably could do something like this in python using the yield operator
def glob(str):
if str=='':
yield ''
return
if str[0]!='M':
for tail in glob(str[1:]):
yield str[0] + tail
else:
for c in ['A','G','C','T']:
for tail in glob(str[1:]):
yield c + tail
return
EDIT: As correctly pointed out I was making a few mistakes. Here is a version which I tried out and works.

This isn't really an "expansion" problem and it's almost certainly not doable with any sensible regular expression.
I believe what you're looking for is "how to generate permutations".

You could for example do this recursively. Pseudo-code:
printSequences(sequence s)
switch "first special character in sequence"
case ...
case M:
s1 = s, but first M replaced with A
printSequences(s1)
s2 = s, but first M replaced with C
printSequences(s2)
case none:
print s;

Regexps match strings, they're not intended to be turned into every string they might match.
Also, you're looking at a lot of strings being output from this - for instance:
MMMMMMMMMMMMMMMM (16 M's)
produces 65,536 16 character strings - and I'm guessing that DNA sequences are usually longer than that.
Arguably any solution to this is pretty much 'brute force' from a computer science perspective, because your algorithm is O(2^n) on the original string length. There's actually quite a lot of work to be done.
Why do you want to produce all the combinations? What are you going to do with them? (If you're thinking to produce every string possibility and then look for it in a large DNA sequence, then there are much better ways of doing that.)

Related

Generate wordlist with known characters

I'm looking to write a piece of code in Javascript or Python that generates a wordlist file out of a pre-defined combination of characters.
E.g.
input = abc
output =
ABC
abc
Abc
aBc
abC
AbC
ABc
aBC
I have very basic knowledge of either so all help is appreciated.
Thank you
I'll assume that you're able to import Python packages. Therefore, take a look at itertools.product:
This tool computes the cartesian product of input iterables.
For example, product(A, B) returns the same as ((x,y) for x in A for y in B).
It looks quite like what you're looking for, right? That's every possible combination from two different lists.
Since you're new to Python, I'll assume you don't know what a map is. Nothing too hard to understand:
Returns a list of the results after applying the given function to each item of a given iterable (list, tuple etc.)
That's easy! So the first parameter is the function you want to apply and the second one is your iterable.
The function I applied in the map is as follows:
''.join
This way you set '' as your separator (basically no separator at all) and put together every character with .join.
Why would you want to put together the characters? Well, you'll have a list (a lot of them in fact) and you want a string, so you better put those chars together in each list.
Now here comes the hard part, the iterable inside the map:
itertools.product(*((char.upper(), char.lower()) for char in string)
First of all notice that * is the so-called splat operator in this situation. It splits the sequence into separate arguments for the function call.
Now that you know that, let's dive into the code.
Your (A, B) for itertools.product(A, B) are now (char.upper(), char.lower()). That's both versions of char, upper and lowercase. And what's char? It's an auxiliar variable that will take the value of each and every character in the given string, one at a time.
Therefore for input 'abc' char will take values a, b and c while in the loop, but since you're asking for every possible combination of uppercase and lowercase char you'll get exactly what you asked for.
I hope I made everything clear enough. :)
Let me know if you need any further clarification in the comments. Here's a working function based on my previous explanation:
import itertools
def func():
string = input("Introduce some characters: ")
output = map(''.join, itertools.product(*((char.upper(), char.lower()) for char in string)))
print(list(output))
As an additional note, if you printed output you wouldn't get your desired output, you have to turn the map type into a list for it to be printable.
A simple approach using generators, and no library code. It returns a generator (iterator-like object), but can be converted to a list easily.
def lU(s):
if not s:
yield ''
else:
for sfx in lU(s[1:]):
yield s[0].upper() + sfx
yield s[0].lower() + sfx
print list(lU("abc"))
Note that all the sub-lists of suffixes are not fully expanded, but the number of generator objects (each a constant size) that get generated is proportional to the length of the string.

Why doesn't str(a) == reversed(str(a)) work as a palindrome test in Python?

I have been trying to find the answer to problem #4 in Project Euler in Python but I really can´t seem to find the problem in my code. Here is the question:
A palindromic number reads the same both ways. The largest palindrome made from the product of two 2-digit numbers is 9009 = 91 × 99.
Find the largest palindrome made from the product of two 3-digit numbers.
And here is my code:
nums = list(range(10000, 998001))
pals = []
def palindrome(a):
if str(a) == reversed(str(a)):
pals.append(a)
for every in nums:
palindrome(every)
For a start, try printing out the string and its supposed reversal - you'll find they're not as you expect. A more sensible way of getting the string reversal of s is with s[::-1].
Then you need to realise that you're checking every number in that range of yours, 10000..998000 (noticing I've left out 998001 there since Python ranges are exclusive at the end). Not all of those numbers will be a product of two 3-digit numbers. Of course, it may be that was going to be your next step once you'd got all the palindromes, in which case feel free to ignore this paragraph (other than fixing the range, of course).
As an aside, I probably wouldn't wrap that range in a list. If you're using Python 2, it's already a list and, for Python 3, it's probably better to leave it as a lazy iterator so as not to waste memory.
And, as a final aside, I probably wouldn't do it this way since I prefer readable code but those who demand "Pythonic" solutions may like to look at something like:
print(max((i*j
for i in range(100,1000)
for j in range(100,1000)
if str(i*j) == str(i*j)[::-1]
)))
though of course true Python aficionados will want that on a single line :-) I've just split it for readability.
Python's reversed returns an iterator object. So, no one string could be equal to the iterator.
Because reversed returns an iterator object, you should replace str(a) == reversed(str(a)) with str(a) == str(a)[::-1]. The [::-1] qualifier is a string splicing that will reverse the collection.
Let me start with what everyone else pointed out: reversed(s) is not the same as s[::-1].
Everyone else wanted you to use s[::-1] instead of reversed(). Sure, that suggestion is scrutable and idiomatic. But why limit yourself! This is Python, after all.
from itertools import combinations, ifilter, imap, starmap
from operator import eq, mul
def is_palindrome(integer):
"""Returns a True if an input integer's digits are palindromic, False otherwise."""
string = str(integer)
return all(imap(eq, iter(string), reversed(string)))
def largest_palindrome_product(n_digits=3):
"""Solves Project Euler #4"""
all_n_digit_numbers = xrange(10**(n_digits)-1, 10**(n_digits-1), -1)
palindromes = ifilter(is_palindrome,
(starmap(mul, combinations(all_n_digit_numbers, 2)))
)
return max(palindromes)
largest_palindrome_product()
This solution has the valuable feature of retaining the use of reversed()! For extra inscrutability I tried to use as many functions from itertools as I could.

Fastest way to create a list of overlapping substrings from a string in Python

I'm trying to generate a list of all overlapping n-length substrings in a given string.
For example, for an n of 6 and the string "hereismystring" I would generate the list ["hereis", "ereism", "reismy", ..., "string"]. The trivial code I'm using right now looks like this:
n = 6
l = len(string)
substrings = [string[i:(i + n)] for i in xrange(l - n + 1)]
Easy enough. Problem is, I'd like to speed this up (I have very many very long strings). Is there a faster technique in Python? Will dropping down to Cython help at all given that Python's string routines are in C anyhow?
For reference, this technique takes about 100us on my machine (a new Macbook Pro) for a 500-length string and an n of 30.
Thanks for the help in advance!
Taking a step back from the issue of which Python coding technique will be the fastest, I would approach the problem differently. Since all the strings are the same length, and all come from a single source string, why not simply work with the ranges of characters directly, rather than convert them into proper strings? You would avoid a lot of allocation and copying, but you would have to adjust your code to know that each "string" is n characters long.
In other words, just read ranges from the source string directly when you want to work with a substring. You'll be working with the characters you want as fast as they can be pulled from cache. You could express a "substring" as merely an offset into the source string.
Sometimes if you want ultra-fast performance you have to leave familiar data structures behind. Just a thought.
How about:
>>> d = deque("hereismystring")
>>> s = ''.join(d)[:6]
>>> while not len(s) % 6:
... print s
... _ = d.popleft()
... s = ''.join(d)[:6]
...
hereis
ereism
reismy
eismys
ismyst
smystr
mystri
ystrin
string
>>>
I believe deque is O(1) while lists are O(n)

Python text encryption: rot13

I am currently doing an assignment that encrypts text by using rot 13, but some of my text wont register.
# cgi is to escape html
# import cgi
def rot13(s):
#string encrypted
scrypt=''
alph='abcdefghijklmonpqrstuvwxyz'
for c in s:
# check if char is in alphabet
if c.lower() in alph:
#find c in alph and return its place
i = alph.find(c.lower())
#encrypt char = c incremented by 13
ccrypt = alph[ i+13 : i+14 ]
#add encrypted char to string
if c==c.lower():
scrypt+=ccrypt
if c==c.upper():
scrypt+=ccrypt.upper()
#dont encrypt special chars or spaces
else:
scrypt+=c
return scrypt
# return cgi.escape(scrypt, quote = True)
given_string = 'Rot13 Test'
print rot13(given_string)
OUTPUT:
13 r
[Finished in 0.0s]
Hmmm, seems like a bunch of things are not working.
Main problem should be in ccrypt = alph[ i+13 : i+14 ]: you're missing a % len(alph) otherwise if, for example, i is equal to 18, then you'll end out of the list boundary.
In your output, in fact, only e is encoded to r because it's the only letter in your test string which, moved by 13, doesn't end out of boundary.
The rest of this answer are just tips to clean the code a little bit:
instead of alph='abc.. you can declare an import string at the beginning of the script and use a string.lowercase
instead of using string slicing, for just one character it's better to use string[i], gets the work done
instead of c == c.upper(), you can use builtin function if c.isupper() ....
The trouble you're having is with your slice. It will be empty if your character is in the second half of the alphabet, because i+13 will be off the end. There are a few ways you could fix it.
The simplest might be to simply double your alphabet string (literally: alph = alph * 2). This means you can access values up to 52, rather than just up to 26. This is a pretty crude solution though, and it would be better to just fix the indexing.
A better option would be to subtract 13 from your index, rather than adding 13. Rot13 is symmetric, so both will have the same effect, and it will work because negative indexes are legal in Python (they refer to positions counted backwards from the end).
In either case, it's not actually necessary to do a slice at all. You can simply grab a single value (unlike C, there's no char type in Python, so single characters are strings too). If you were to make only this change, it would probably make it clear why your current code is failing, as trying to access a single value off the end of a string will raise an exception.
Edit: Actually, after thinking about what solution is really best, I'm inclined to suggest avoiding index-math based solutions entirely. A better approach is to use Python's fantastic dictionaries to do your mapping from original characters to encrypted ones. You can build and use a Rot13 dictionary like this:
alph="abcdefghijklmnopqrstuvwxyz"
rot13_table = dict(zip(alph, alph[13:]+alph[:13])) # lowercase character mappings
rot13_table.update((c.upper(),rot13_table[c].upper()) for c in alph) # upppercase
def rot13(s):
return "".join(rot13_table.get(c, c) for c in s) # non-letters are ignored
First thing that may have caused you some problems - your string list has the n and the o switched, so you'll want to adjust that :) As for the algorithm, when you run:
ccrypt = alph[ i+13 : i+14 ]
Think of what happens when you get 25 back from the first iteration (for z). You are now looking for the index position alph[38:39] (side note: you can actually just say alph[38]), which is far past the bounds of the 26-character string, which will return '':
In [1]: s = 'abcde'
In [2]: s[2]
Out[2]: 'c'
In [3]: s[2:3]
Out[3]: 'c'
In [4]: s[49:50]
Out[4]: ''
As for how to fix it, there are a number of interesting methods. Your code functions just fine with a few modifications. One thing you could do is create a mapping of characters that are already 'rotated' 13 positions:
alph = 'abcdefghijklmnopqrstuvwxyz'
coded = 'nopqrstuvwxyzabcdefghijklm'
All we did here is split the original list into halves of 13 and then swap them - we now know that if we take a letter like a and get its position (0), the same position in the coded list will be the rot13 value. As this is for an assignment I won't spell out how to do it, but see if that gets you on the right track (and #Makoto's suggestion is a perfect way to check your results).
This line
ccrypt = alph[ i+13 : i+14 ]
does not do what you think it does - it returns a string slice from i+13 to i+14, but if these indices are greater than the length of the string, the slice will be empty:
"abc"[5:6] #returns ''
This means your solution turns everything from n onward into an empty string, which produces your observed output.
The correct way of implementing this would be (1.) using a modulo operation to constrain the index to a valid number and (2.) using simple character access instead of string slices, which is easier to read, faster, and throws an IndexError for invalid indices, meaning your error would have been obvious.
ccrypt = alph[(i+13) % 26]
If you're doing this as an exercise for a course in Python, ignore this, but just saying...
>>> import codecs
>>> codecs.encode('Some text', 'rot13')
'Fbzr grkg'
>>>

python: finding a missing letter in the alphabet from a list - least lines of code

I'm trying to find the missing letter in the alphabet from the list with the least lines of code.
If the list is sorted already (using list.sort()), what is the fastest or least lines of code to find the missing letter.
If I know there are only one missing letter.
(This is not any type of interview questions. I actually need to do this in my script where I want to put least amount of work in this process since it will be repeated over and over indeterministically)
Some questions:
Are all the letters upper or lower case? (a/A)
Is this the only alphabet you'll want to check?
Why are you doing this so often?
Least lines of code:
# do this once, outside the loop
alphabet=set(string.ascii_lowercase)
# inside the loop, just 1 line:
missingletter=(alphabet-set(yourlist)).pop()
The advantage of the above is that you can do it without having to sort the list first. If, however, the list is always sorted, you can use bisection to get there faster. On a simple 26-letter alphabet though, is there much point?
Bisection (done in ~4 lookups):
frompos, topos = 0, len(str)
for i in range(1,100): #never say forever with bisection...
trypos = (frompos+topos+1)/2
print "try:",frompos,trypos,topos
if alphabet[trypos] != str[trypos]:
topos = trypos
else:
frompos = trypos
if topos-frompos==1:
if alphabet[topos] != str[topos]:
print alphabet[frompos]
else:
print alphabet[topos]
break
This code requires fewer lookups, so is by far the better scaling version O(log n), but may still be slower when executed via a python interpreter because it goes via python ifs instead of set operations written in C.
(Thanks to J.F.Sebastian and Kylotan for their comments)
Using a list comprehension:
>>> import string
>>> sourcelist = 'abcdefghijklmnopqrstuvwx'
>>> [letter for letter in string.ascii_lowercase if letter not in sourcelist]
['y', 'z']
>>>
The string module has some predefined constants that are useful.
>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
>>> string.letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.hexdigits
'0123456789abcdefABCDEF'
>>> string.octdigits
'01234567'
>>> string.digits
'0123456789'
>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c'
>>>
In the too clever for it's own good category, and assuming there is exactly one missing letter in a lowercase alphabet:
print chr(2847 - sum(map(ord, theString)))
[Edit]
I've run some timings on the various solutions to see which is faster.
Mine turned out to be fairly slow in practice (slightly faster if I use itertools.imap instead).
Surprisingly, the listcomp solution by monkut turned out to be fastest - I'd have expected the set solutions to do better, as this must scan the list each time to find the missing letter.
I tried first converting the test list to a set in advance of membership checking, expecting this to speed it up but in fact it made it slower. It looks like the constant factor delay in creating the set dwarfs the cost of using an O(n**2) algorithm for such a short string.
That suggested than an even more basic approach, taking advantage of early exiting, could perform even better. The below is what I think currently performs best:
def missing_letter_basic(s):
for letter in string.ascii_lowercase:
if letter not in s: return letter
raise Exception("No missing letter")
The bisection method is probably best when working with larger strings however. It is only just edged out by the listcomp here, and has much better asymptotic complexity, so for strings larger than an alphabet, it will clearly win.
[Edit2]
Actually, cheating a bit, I can get even better than that, abusing the fact that there are only 26 strings to check, behold the ultimate O(1) missing letter finder!
find_missing_letter = dict((string.ascii_lowercase[:i]+string.ascii_lowercase[i+1:],
string.ascii_lowercase[i]) for i in range(26)).get
>>> find_missing_letter('abcdefghijklmnoprstuvwxyz')
'q'
Here are my timings (500000 runs, tested with letters missing near the start, middle and end of the string (b, m and y)
"b" "m" "y"
bisect : 2.762 2.872 2.922 (Phil H)
find_gap : 3.388 4.533 5.642 (unwind)
listcomp : 2.832 2.858 2.822 (monkut)
listcomp_set : 4.770 4.746 4.700 As above, with sourcelist=set(sourcelist) first
set_difference : 2.924 2.945 2.880 (Phil H)
sum : 3.815 3.806 3.868
sum_imap : 3.284 3.280 3.260
basic : 0.544 1.379 2.359
dict_lookup : 0.135 0.133 0.134
Here's one way of doing it, assuming your "alphabets" is integers, and that the list has at least two items:
for i in xrange(1, len(a)):
if a[i] != a[i - 1] + 1:
print a[i - 1] + 1, "is missing"
With sorted lists a binary search is usually the fastest alghorythm. Could you please provide an example list and an example "missing alphabet"?
if you're talking about alphabet as letters:
letterSet = set()
for word in wordList:
letterSet.update(set(word.lower()))
import string
alphabet = set(string.lowercase)
missingLetters = alphabet.difference(letterSet)
class MissingFinder(object):
"A simplified missing items locator"
def __init__(self, alphabet):
"Store a set from our alphabet"
self.alphabet= set(alphabet)
def missing(self, sequence):
"Return set of missing letters; sequence not necessarily set"
return self.alphabet.difference(sequence)
>>> import string
>>> finder= MissingFinder(string.ascii_lowercase)
>>> finder.missing(string.ascii_lowercase[:5] + string.ascii_lowercase[6:])
>>> set(['f'])
>>> # rinse, repeat calling finder.missing
I'm sure the class and instance names could be improved :)

Categories