Trying to convert a integer into a ACGT DNA sequence

Trying to convert a integer into a ACGT DNA sequence - python

I am trying to reverse my stringtobin function so that when I run bintostring([3]) it will return "AAAT" where A=0,C=1,G=2,T=3, for example CCCC will return 85 because (1 * 64) + (1 * 16) + (1 * 4) + (1 * 1) = 85. My bintostring function now just returns an empty string.
dna = {'A':0, 'C':1, 'G':2, 'T':3}
dna2 = {0:'A', 1:'C', 2:'G', 3:'T'}
def bintostring(num):
seq = []
nums = [64,16,4,1]
#main while
i = 0
while i<len(num):
#nums while (iterate through nums)
k = 0
while k<len(nums):
#dna2 while (iterate through dna2)
x = 0
while x<len(dna2):
check = 0
if num[i]//nums[k] == dna2[x]:
seq.append(dna2[x])
check+=1
elif check>0:
seq.append('A')
x+=1
k+=1
i+=1
return("".join(seq))
print(bintostring([3]))
def stringtobin(seq):
power_of_4 = 1
num = 0
if len(seq)!=4: return None
i = len(seq)-1
while i>=0:
power_of_4*=4
Digitval = dna[seq[i]]
num+=Digitval*power_of_4//4
i-=1
return num
print(stringtobin("AAAT"))

Your encoding is in base 4 which can't hold the length information of your sequence.
Without the length information the encoded value 3 could mean T or TA or TAAA or TAAAA... (there would be no way to know).
If the sequences are always 4 letters long (or the length is stored/provided separately), you can implement the functions like this
def stringToBin(S):
return sum( 4**i*"ACGT".index(p) for i,p in enumerate(S))
def binToString(N,size=4):
result = ""
for _ in range(size):
N,p = divmod(N,4)
result += "ACGT"[p]
return result
print(stringToBin("AAAT")) # 192
print(binToString(192)) # AAAT
print(stringToBin("TA")) # 3
print(stringToBin("TAAA")) # 3
print(binToString(3)) # TAAA
print(binToString(3,2)) # TA (length has to be supplied separately)
If you want your numeric encoding to also carry the length information, you should make it base 5 and use a non-zero value for each letter. This way, TA and TAAA would give different numbers.
def stringToBin(S):
return sum( 5**i*" ACGT".index(p) for i,p in enumerate(S))
def binToString(N):
result = ""
while N:
N,p = divmod(N,5)
result += " ACGT"[p]
return result
print(stringToBin("TA")) # 9
print(stringToBin("TAAA")) # 159
print(binToString(9)) # TA
print(binToString(159)) # TAAA
Obviously this produces larger number so, a 32 bit unsigned integer will only hold 13 letters as opposed to 16 in base 4. If you're doing this to reduce the size of storage, using text compression (e.g. zip) will probably be more efficient than converting to a fixed base binary representation

Your attempt seems inordinately complex. Just map the bottom two bits to a value, then shift them off.
def bintostring(num):
seq = []
for n in num:
subseq = []
for b in range(4):
subseq.append(dna2[n & 3])
n >>= 2
seq.append("".join(reversed(subseq)))
return seq
In case it's not obvious, & is bitwise AND; value & 3 obtains the bottom two bits of value.
The stringtobin function could be similarly simplified. Demo: https://ideone.com/RlzegN

Related

Find a hash encryption input from an output

I have this function hash() that encrypts a given string to an integer.
letters = 'weiojpknasdjhsuert'
def hash(string_input):
h = 3
for i in range(0, len(string_input)):
h = h * 43 + letters.index(string_input[i])
return h
So if I do print(hash('post')), my output is: 11231443
How can I find what my input needs to be to get an output like 1509979332193868 if the input can only be a string from letters? There is a formula inside the loop body but I couldn't figure out how to reverse it.

It seem like since 43 is larger than your alphabet, you can just reverse the math. I don't know how to prove there are no hash collisions, so this may have edge cases. For example:
letters = 'weiojpknasdjhsuert'
def hash(string_input):
h = 3
for i in range(0, len(string_input)):
h = h * 43 + letters.index(string_input[i])
return h
n = hash('wepostjapansand')
print(n)
# 9533132150649729383107184
def rev(n):
s = ''
while n:
l = n % 43 # going in reverse this is the index of the next letter
n -= l # now reverse the math...subtract that index
n //= 43 # and divide by 43 to get the next n
if n:
s = letters[l] + s
return s
print(rev(n))
# wepostjapansand
With a more reasonable alphabet, like lowercase ascii and a space, this still seem to be okay:
from string import ascii_lowercase
letters = ascii_lowercase + ' '
def hash(string_input):
h = 3
for i in range(0, len(string_input)):
h = h * 43 + letters.index(string_input[i])
return h
n = hash('this is some really long text to test how well this works')
print(n)
# 4415562436659212948343839746913950248717359454765313646276852138403823934037244869651587521298
def rev(n):
s = ''
# with more compact logic
while n > 3:
s = letters[n % 43] + s
n = (n - (n % 43)) // 43
return s
print(rev(n))
# this is some really long text to test how well this works
The basic idea is that after all the math, the last number is:
prev * 43 + letter_index
This means you can recover the final letter index by taking the prev modulus 43. Then subtract that and divide by 43 (which is just the reverse of the math) and do it again until your number is zero.

Multiplication of binaries through repeated addition

Let's say we're trying to multiply 10011 and 1101 (or in arithmetic terms, 19 x 13). We all know that this is the same as adding 10011 to itself 13 times or vice versa. Apparently, I've found a code at https://www.w3resource.com/python-exercises/challenges/1/python-challenges-1-exercise-31.php which provided a way on how to add two binary numbers. My question is, in general, if we multiply two binary numbers A and B, how are we going to iterate A to add itself B times? Obviously, in order to do that we have to convert B to decimal/integer first.
def add_binary_nums(x, y):
max_len = max(len(x), len(y))
x = x.zfill(max_len)
y = y.zfill(max_len)
result = ''
carry = 0
for i in range(max_len-1, -1, -1):
r = carry
r += 1 if x[i] == '1' else 0
r += 1 if y[i] == '1' else 0
result = ('1' if r % 2 == 1 else '0') + result
carry = 0 if r < 2 else 1
if carry !=0 : result = '1' + result
return result.zfill(max_len)
print(add_binary_nums('11', '1'))

You can count up to a number by starting at 0 and adding 1 until you are done. Since you already have defined a binary add, you only need to add the loop:
def binary_range(stop: str):
"""Count `stop` times"""
current = '0'
while stop != current:
yield current
current = add_binary_nums(current, '1')
This is enough to do something "n times". You can now do "a * b" as "add a to itself b times":
def binary_mul(a: str, b: str):
"""Multiplay the binary ``a`` by the binary ``b``"""
result = '0'
for _ in binary_range(b):
result = add_binary_nums(result, a)
return result
If you don't care about building a binary calculator, use Python to convert binary to integers or vice versa. int(bin_string, 2) converts a string such as "01101" to the appropriate integer, and bin(integer) converts it back to "0b01101".
For example, a binary multiplication that takes and returns strings looks like this:
def binary_mul(a: str, b: str):
return bin(int(a, 2) * int(b, 2))[:2]

Codewars. Some tests are passed, but i need to get tests which outputs the following mistake: 3263 should equal -1

Can you explain it what problems are here? To my mind, this code is like a heap of crap but with the right solving. I beg your pardon for my english.
the task of this kata:
Some numbers have funny properties. For example:
89 --> 8¹ + 9² = 89 * 1
695 --> 6² + 9³ + 5⁴= 1390 = 695 * 2
46288 --> 4³ + 6⁴+ 2⁵ + 8⁶ + 8⁷ = 2360688 = 46288 * 51
Given a positive integer n written as abcd... (a, b, c, d... being digits) and a positive integer p we want to find a positive integer k, if it exists, such as the sum of the digits of n taken to the successive powers of p is equal to k * n. In other words:
Is there an integer k such as : (a ^ p + b ^ (p+1) + c ^(p+2) + d ^ (p+3) + ...) = n * k
If it is the case we will return k, if not return -1.
Note: n, p will always be given as strictly positive integers.
dig_pow(89, 1) should return 1 since 8¹ + 9² = 89 = 89 * 1
dig_pow(92, 1) should return -1 since there is no k such as 9¹ + 2² equals 92 * k
dig_pow(695, 2) should return 2 since 6² + 9³ + 5⁴= 1390 = 695 * 2
dig_pow(46288, 3) should return 51 since 4³ + 6⁴+ 2⁵ + 8⁶ + 8⁷ = 2360688 = 46288 * 51
def dig_pow(n, p):
if n > 0 and p > 0:
b = []
a = str(n)
result = []
for i in a:
b.append(int(i))
for x in b:
if p != 1:
result.append(x ** p)
p += 1
else:
result.append(x ** (p + 1))
if int((sum(result)) / n) < 1:
return -1
elif int((sum(result)) / n) < 2:
return 1
else:
return int((sum(result)) / n)
test results:
Test Passed
Test Passed
Test Passed
Test Passed
3263 should equal -1

I don't know what exact version of Python you're using. This following code are in Python 3. And if I get you correctly, the code can be as simple as
def dig_pow(n, p):
assert n > 0 and p > 0
digits = (int(i) for i in str(n)) # replaces your a,b part with generator
result = 0 # you don't use result as a list, so an int suffice
for x in digits: # why do you need if in the loop? (am I missing something?)
result += x ** p
p += 1
if result % n: # you just test for divisibility
return -1
else:
return result // n
The major problem is that, in your objective, you have only two option of returning, but you wrote if elif else, which is definitely unnecessary and leads to problems and bugs. The % is modulus operator.
Also, having an if and not returning anything in the other branch is often not a good idea (see the assert part). Of course, if you don't like it, just fall back to if.

I believe this could work as well and I find it a little easier to read, however it can definitely be improved:
def dig_pow(n, p):
value = 0
for digit in str(n):
value += int(digit)**p
p += 1
for k in range(1,value):
if value/k == n:
return k
return -1

this is some example simple example than using:
digits = (int(i) for i in str(n))
I'm opting to use this version since I am still a beginner which can be done with this alt way:
result = 0
for digits in str(n):
#iterate through each digit from n
# single of digits turn to int & power to p
for number in digits:
result += int(number) ** p
p += 1
as for the full solution, it goes like this:
def dig_pow(n, p):
# example n = 123 , change it to string = 1, 2, 3
# each string[] **p, and p iterate by 1
# if n % p not equal to p return - 1
result = 0
for digits in str(n):
#iterate through each digit from n
# single digit turn to int & power to p
for number in digits:
result += int(number) ** p
p += 1
if result % n:
return -1
else:
return result // n

Returns the largest n such that R[n] = S

Write a function answer(str_S) which, given the base-10 string
representation of an integer S, returns the largest n such that R(n) =
S. Return the answer as a string in base-10 representation. If there
is no such n, return "None". S will be a positive integer no greater
than 10^25.
where R(n) is the number of zombits at time n:
R(0) = 1
R(1) = 1
R(2) = 2
R(2n) = R(n) + R(n + 1) + n (for n > 1)
R(2n + 1) = R(n - 1) + R(n) + 1 (for n >= 1)
Test cases
==========
Inputs:
(string) str_S = "7"
Output:
(string) "4"
Inputs:
(string) str_S = "100"
Output:
(string) "None"
My program below is correct but it is not scalable since here the range of S can be a very large number like 10^24. Could anyone help me with some suggestion to improve the code further so that it can cover any input case.
def answer(str_S):
d = {0: 1, 1: 1, 2: 2}
str_S = int(str_S)
i = 1
while True:
if i > 1:
d[i*2] = d[i] + d[i+1] + i
if d[i*2] == str_S:
return i*2
elif d[i*2] > str_S:
return None
if i>=1:
d[i*2+1] = d[i-1] + d[i] + 1
if d[i*2+1] == str_S:
return i*2 + 1
elif d[i*2+1] > str_S:
return None
i += 1
print answer('7')

First of all, where are you having trouble with the scaling? I ran your code on a 30-digit number, and it seemed to complete okay. Do you have a memory limit? Python handles arbitrarily large integers, although very large ones get flipped into digital arithmetic mode.
Given the density of R values, I suspect that you can save space as well as time if you switch to a straight array: use the value as an array index instead of a dict key.

Average of two strings in alphabetical/lexicographical order

Suppose you take the strings 'a' and 'z' and list all the strings that come between them in alphabetical order: ['a','b','c' ... 'x','y','z']. Take the midpoint of this list and you find 'm'. So this is kind of like taking an average of those two strings.
You could extend it to strings with more than one character, for example the midpoint between 'aa' and 'zz' would be found in the middle of the list ['aa', 'ab', 'ac' ... 'zx', 'zy', 'zz'].
Might there be a Python method somewhere that does this? If not, even knowing the name of the algorithm would help.
I began making my own routine that simply goes through both strings and finds midpoint of the first differing letter, which seemed to work great in that 'aa' and 'az' midpoint was 'am', but then it fails on 'cat', 'doggie' midpoint which it thinks is 'c'. I tried Googling for "binary search string midpoint" etc. but without knowing the name of what I am trying to do here I had little luck.
I added my own solution as an answer

If you define an alphabet of characters, you can just convert to base 10, do an average, and convert back to base-N where N is the size of the alphabet.
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def enbase(x):
n = len(alphabet)
if x < n:
return alphabet[x]
return enbase(x/n) + alphabet[x%n]
def debase(x):
n = len(alphabet)
result = 0
for i, c in enumerate(reversed(x)):
result += alphabet.index(c) * (n**i)
return result
def average(a, b):
a = debase(a)
b = debase(b)
return enbase((a + b) / 2)
print average('a', 'z') #m
print average('aa', 'zz') #mz
print average('cat', 'doggie') #budeel
print average('google', 'microsoft') #gebmbqkil
print average('microsoft', 'google') #gebmbqkil
Edit: Based on comments and other answers, you might want to handle strings of different lengths by appending the first letter of the alphabet to the shorter word until they're the same length. This will result in the "average" falling between the two inputs in a lexicographical sort. Code changes and new outputs below.
def pad(x, n):
p = alphabet[0] * (n - len(x))
return '%s%s' % (x, p)
def average(a, b):
n = max(len(a), len(b))
a = debase(pad(a, n))
b = debase(pad(b, n))
return enbase((a + b) / 2)
print average('a', 'z') #m
print average('aa', 'zz') #mz
print average('aa', 'az') #m (equivalent to ma)
print average('cat', 'doggie') #cumqec
print average('google', 'microsoft') #jlilzyhcw
print average('microsoft', 'google') #jlilzyhcw

If you mean the alphabetically, simply use FogleBird's algorithm but reverse the parameters and the result!
>>> print average('cat'[::-1], 'doggie'[::-1])[::-1]
cumdec
or rewriting average like so
>>> def average(a, b):
... a = debase(a[::-1])
... b = debase(b[::-1])
... return enbase((a + b) / 2)[::-1]
...
>>> print average('cat', 'doggie')
cumdec
>>> print average('google', 'microsoft')
jlvymlupj
>>> print average('microsoft', 'google')
jlvymlupj

It sounds like what you want, is to treat alphabetical characters as a base-26 value between 0 and 1. When you have strings of different length (an example in base 10), say 305 and 4202, your coming out with a midpoint of 3, since you're looking at the characters one at a time. Instead, treat them as a floating point mantissa: 0.305 and 0.4202. From that, it's easy to come up with a midpoint of .3626 (you can round if you'd like).
Do the same with base 26 (a=0...z=25, ba=26, bb=27, etc.) to do the calculations for letters:
cat becomes 'a.cat' and doggie becomes 'a.doggie', doing the math gives cat a decimal value of 0.078004096, doggie a value of 0.136390697, with an average of 0.107197397 which in base 26 is roughly "cumcqo"

Based on your proposed usage, consistent hashing ( http://en.wikipedia.org/wiki/Consistent_hashing ) seems to make more sense.

Thanks for everyone who answered, but I ended up writing my own solution because the others weren't exactly what I needed. I am trying to average app engine key names, and after studying them a bit more I discovered they actually allow any 7-bit ASCII characters in the names. Additionally I couldn't really rely on the solutions that converted the key names first to floating point, because I suspected floating point accuracy just isn't enough.
To take an average, first you add two numbers together and then divide by two. These are both such simple operations that I decided to just make functions to add and divide base 128 numbers represented as lists. This solution hasn't been used in my system yet so I might still find some bugs in it. Also it could probably be a lot shorter, but this is just something I needed to get done instead of trying to make it perfect.
# Given two lists representing a number with one digit left to decimal point and the
# rest after it, for example 1.555 = [1,5,5,5] and 0.235 = [0,2,3,5], returns a similar
# list representing those two numbers added together.
#
def ladd(a, b, base=128):
i = max(len(a), len(b))
lsum = [0] * i
while i > 1:
i -= 1
av = bv = 0
if i < len(a): av = a[i]
if i < len(b): bv = b[i]
lsum[i] += av + bv
if lsum[i] >= base:
lsum[i] -= base
lsum[i-1] += 1
return lsum
# Given a list of digits after the decimal point, returns a new list of digits
# representing that number divided by two.
#
def ldiv2(vals, base=128):
vs = vals[:]
vs.append(0)
i = len(vs)
while i > 0:
i -= 1
if (vs[i] % 2) == 1:
vs[i] -= 1
vs[i+1] += base / 2
vs[i] = vs[i] / 2
if vs[-1] == 0: vs = vs[0:-1]
return vs
# Given two app engine key names, returns the key name that comes between them.
#
def average(a_kn, b_kn):
m = lambda x:ord(x)
a = [0] + map(m, a_kn)
b = [0] + map(m, b_kn)
avg = ldiv2(ladd(a, b))
return "".join(map(lambda x:chr(x), avg[1:]))
print average('a', 'z') # m#
print average('aa', 'zz') # n-#
print average('aa', 'az') # am#
print average('cat', 'doggie') # d(mstr#
print average('google', 'microsoft') # jlim.,7s:
print average('microsoft', 'google') # jlim.,7s:

import math
def avg(str1,str2):
y = ''
s = 'abcdefghijklmnopqrstuvwxyz'
for i in range(len(str1)):
x = s.index(str2[i])+s.index(str1[i])
x = math.floor(x/2)
y += s[x]
return y
print(avg('z','a')) # m
print(avg('aa','az')) # am
print(avg('cat','dog')) # chm
Still working on strings with different lengths... any ideas?

This version thinks 'abc' is a fraction like 0.abc. In this approach space is zero and a valid input/output.
MAX_ITER = 10
letters = " abcdefghijklmnopqrstuvwxyz"
def to_double(name):
d = 0
for i, ch in enumerate(name):
idx = letters.index(ch)
d += idx * len(letters) ** (-i - 1)
return d
def from_double(d):
name = ""
for i in range(MAX_ITER):
d *= len(letters)
name += letters[int(d)]
d -= int(d)
return name
def avg(w1, w2):
w1 = to_double(w1)
w2 = to_double(w2)
return from_double((w1 + w2) * 0.5)
print avg('a', 'a') # 'a'
print avg('a', 'aa') # 'a mmmmmmmm'
print avg('aa', 'aa') # 'a zzzzzzzz'
print avg('car', 'duck') # 'cxxemmmmmm'
Unfortunately, the naïve algorithm is not able to detect the periodic 'z's, this would be something like 0.99999 in decimal; therefore 'a zzzzzzzz' is actually 'aa' (the space before the 'z' periodicity must be increased by one.
In order to normalise this, you can use the following function
def remove_z_period(name):
if len(name) != MAX_ITER:
return name
if name[-1] != 'z':
return name
n = ""
overflow = True
for ch in reversed(name):
if overflow:
if ch == 'z':
ch = ' '
else:
ch=letters[(letters.index(ch)+1)]
overflow = False
n = ch + n
return n
print remove_z_period('a zzzzzzzz') # 'aa'

I haven't programmed in python in a while and this seemed interesting enough to try.
Bear with my recursive programming. Too many functional languages look like python.
def stravg_half(a, ln):
# If you have a problem it will probably be in here.
# The floor of the character's value is 0, but you may want something different
f = 0
#f = ord('a')
L = ln - 1
if 0 == L:
return ''
A = ord(a[0])
return chr(A/2) + stravg_half( a[1:], L)
def stravg_helper(a, b, ln, x):
L = ln - 1
A = ord(a[0])
B = ord(b[0])
D = (A + B)/2
if 0 == L:
if 0 == x:
return chr(D)
# NOTE: The caller of helper makes sure that len(a)>=len(b)
return chr(D) + stravg_half(a[1:], x)
return chr(D) + stravg_helper(a[1:], b[1:], L, x)
def stravg(a, b):
la = len(a)
lb = len(b)
if 0 == la:
if 0 == lb:
return a # which is empty
return stravg_half(b, lb)
if 0 == lb:
return stravg_half(a, la)
x = la - lb
if x > 0:
return stravg_helper(a, b, lb, x)
return stravg_helper(b, a, la, -x) # Note the order of the args

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to convert a integer into a ACGT DNA sequence - python

Related

Find a hash encryption input from an output

Multiplication of binaries through repeated addition

Codewars. Some tests are passed, but i need to get tests which outputs the following mistake: 3263 should equal -1

Returns the largest n such that R[n] = S

Average of two strings in alphabetical/lexicographical order

Categories

Resources