Rough string alignment in python

Rough string alignment in python - python

If I have two strings of equal length like the following:
'aaaaabbbbbccccc'
'bbbebcccccddddd'
Is there an efficient way to align the two such that the most letters as possible line up as shown below?
'aaaaabbbbbccccc-----'
'-----bbbebcccccddddd'
The only way I can think of doing this is brute force by editing the strings and then iterating through and comparing.

Return the index which gives the maximum score, where the maximum score is the strings which have the most matching characters.
def best_overlap(a, b):
return max([(score(a[offset:], b), offset) for offset in xrange(len(a))], key=lambda x: x[0])[1]
def score(a, b):
return sum([a[i] == b[i] for i in xrange(len(a))])
>>> best_overlap(a, b)
5
>>> a + '-' * best_overlap(a, b); '-' * best_overlap(a, b) + b
'aaaaabbbbbccccc-----'
'-----bbbebcccccddddd'
Or, equivalently:
def best_match(a, b):
max = 0
max_score = 0
for offset in xrange(len(a)):
val = score(a[offset:], b)
if val > max_score:
max_score = val
max = offset
return max
There is room for optimizations such as:
Early exit for no matching characters
Early exit when maximum possible match found

I'm not sure what you mean by efficient, but you can use the find method on str:
first = 'aaaaabbbbbccccc'
second = 'bbbebcccccddddd'
second_prime = '-'* first.find(second[0]) + second
first_prime = first + '-' * (len(second_prime) - len(first))
print first_prime + '\n' + second_prime
# Output:
# aaaaabbbbbccccc-----
# -----bbbebcccccddddd

I can't see any other way than brute forcing it. The complexity will be quadratic in the string length, which might be acceptable, depending on what string lengths you are working with.
Something like this maybe:
def align(a, b):
best, best_x = 0, 0
for x in range(len(a)):
s = sum(i==j for (i,j) in zip(a[x:],b[:-x]))
if s > best:
best, best_x = s, x
return best_x
align('aaaaabbbbbccccc', 'bbbebcccccddddd')
5

I would do something like the binary & function on each of your strings. Compares each of the strings when they are lined up, counting up the number of times letters match. Then, shift by one and do the same thing, and go on and on with shifting until they are no longer lined up. The shift with the most matching letters in this fashion is the correct output shift, and you can add the dashes when you print it out. You don't actually have to modify the strings for this, just count the number of shifts and offset your comparing of the characters by that shift amount. This is not terribly efficient (O(n^2) = n+(n-2)+(n-4)...), but is the best I could come up with.

Related

Python rearrange a list without changing the values and making every rearrangment different

I want to write a function that get two integers. The integers indicate how many strings of either of the two chars appears in a string.
For example:
my_func(x,y): x amount of 'u' and y amount of 'r'.
my_func(2,3) is a string 'uurrr'
And the goal of the function is to write all the possible combinations of that string without changing the amount of x,y and every rearrangement is different:
Example:
my_func(1,1) will return: 'ru', 'ur'
my_func(1,2) will return: 'urr', 'rur', 'rru'
my_func(2,2) will return: 'uurr', 'ruru', 'rruu','urur', 'ruur', 'urru'
What I tried without covering all cases:
RIGHT = 'r'
UP = 'u'
def factorial(m):
if m>1:
return factorial(m-1)*m
else:
return 1
def binom(n,k):
return int(factorial(n)/(factorial(k)*factorial(n-k)))
def up_and_right(n, k, lst):
if n-k == 1 or n-k==-1 or n-k == 0 or n==1 or k==1:
num_of_ver = n+k
else:
num_of_ver = binom(n+k,2)
first_way_to_target = RIGHT*n + UP*k
lst.append(first_way_to_target)
way_to_target = first_way_to_target
for i in range(num_of_ver-1):
for j in range(n+k-1,0,-1):
if way_to_target[j]==UP and way_to_target[j-1]==RIGHT:
way_to_target = list(way_to_target)
way_to_target[j-1] = UP
way_to_target[j] = RIGHT
way_to_target = ''.join(way_to_target)
lst.append(way_to_target)
return lst
Thanks in advance!

Use itertools.permutations to get all the rearrangements, make a set of them to eliminate duplicates (because e.g. swapping two rs around counts as a separate permutation but doesn't change anything), and then join them back into strings because permutations returns character tuples instead.
This demonstration at the REPL should give you enough to write your function:
>>> import itertools
>>> [''.join(p) for p in set(itertools.permutations('u' * 2 + 'r' * 2))]
['uurr', 'ruur', 'ruru', 'rruu', 'urur', 'urru']

Given 2 strings, return number of positions where the two strings contain the same length 2 substring

here is my code:
def string_match(a, b):
count = 0
if len(a) < 2 or len(b) < 2:
return 0
for i in range(len(a)):
if a[i:i+2] == b[i:i+2]:
count = count + 1
return count
And here are the results:
Correct me if I am wrong but, I see that it didn't work probably because the two string lengths are the same. If I were to change the for loop statement to:
for i in range(len(a)-1):
then it would work for all cases provided. But can someone explain to me why adding the -1 makes it work? Perhaps I'm comprehending how the for loop works in this case. And can someone tell me a more optimal way to write this because this is probably really bad code. Thank you!

But can someone explain to me why adding the -1 makes it work?
Observe:
test = 'food'
i = len(test) - 1
test[i:i+2] # produces 'd'
Using len(a) as your bound means that len(a) - 1 will be used as an i value, and therefore a slice is taken at the end of a that would extend past the end. In Python, such slices succeed, but produce fewer characters.

String slicing can return strings that are shorter than requested. In your first failing example that checks "abc" against "abc", in the third iteration of the for loop, both a[i:i+2] and b[i:i+2] are equal to "c", and therefore count is incremented.
Using range(len(a)-1) ensures that your loop stops before it gets to a slice that would be just one letter long.

Since the strings may be of different lengths, you want to iterate only up to the end of the shortest one. In addition, you're accessing i+2, so you only want i to iterate up to the index before the last item (otherwise you might get a false positive at the end of the string by going off the end and getting a single-character string).
def string_match(a: str, b: str) -> int:
return len([
a[i:i+2]
for i in range(min(len(a), len(b)) - 1)
if a[i:i+2] == b[i:i+2]
])
(You could also do this counting with a sum, but this makes it easy to get the actual matches as well!)

You can use this :
def string_match(a, b):
if len(a) < 2 or len(b) < 0:
return 0
subs = [a[i:i+2] for i in range(len(a)-1)]
occurence = list(map(lambda x: x in b, subs))
return occurence.count(True)

Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

I am looking for an algorithm (possibly implemented in Python) able to find the most REPETITIVE sequence in a string. Where for REPETITIVE, I mean any combination of chars that is repeated over and over without interruption (tandem repeat).
The algorithm I am looking for is not the same as the "find the most common word" one. In fact, the repetitive block doesn't need to be the most common word (substring) in the string.
For example:
s = 'asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs'
> f(s)
'UBAUBAUBAUBAUBA' #the "most common word" algo would return 'BA'
Unfortunately, I have no idea on how to tackle this. Any help is very welcome.
UPDATE
A little extra example to clarify that I want to be returned the sequence with the most number of repetition, whatever its basic building block is.
g = 'some noisy spacer'
s = g + 'AB'*5 + g + '_ABCDEF'*2 + g + 'AB'*3
> f(s)
'ABABABABAB' #the one with the most repetitions, not the max len
Examples from #rici:
s = 'aaabcabc'
> f(s)
'abcabc'
s = 'ababcababc'
> f(s)
'ababcababc' #'abab' would also be a solution here
# since it is repeated 2 times in a row as 'ababcababc'.
# The proper algorithm would return both solutions.

With combination of re.findall() (using specific regex patten) and max() functions:
import re
# extended sample string
s = 'asdfewfUBAUBAUBAUBAUBAasdkjnfencsADADADAD sometext'
def find_longest_rep(s):
result = max(re.findall(r'((\w+?)\2+)', s), key=lambda t: len(t[0]))
return result[0]
print(find_longest_rep(s))
The output:
UBAUBAUBAUBAUBA
The crucial pattern:
((\w+?)\2+):
(....) - the outermost captured group which is the 1st captured group
(\w+?) - any non-whitespace character sequence enclosed into the 2nd captured group; +? - quantifier, matches between one and unlimited times, as few times as possible, expanding as needed
\2+ - matches the same text as most recently matched by the 2nd capturing group

Here is the solution based on ((\w+?)\2+) regex but with additional improvements:
import re
from itertools import chain
def repetitive(sequence, rep_min_len=1):
"""Find the most repetitive sequence in a string.
:param str sequence: string for search
:param int rep_min_len: minimal length of repetitive substring
:return the most repetitive substring or None
"""
greedy, non_greedy = re.compile(r'((\w+)\2+)'), re.compile(r'((\w+?)\2+)')
all_rep_seach = lambda regex: \
(regex.search(sequence[shift:]) for shift in range(len(sequence)))
searched = list(
res.groups()
for res in chain(all_rep_seach(greedy), all_rep_seach(non_greedy))
if res)
if not sequence:
return None
cmp_key = lambda res: res[0].count(res[1]) if len(res[1]) >= rep_min_len else 0
return max(searched, key=cmp_key)[0]
You can test it like so:
def check(seq, expected, rep_min_len=1):
result = repetitive(seq, rep_min_len)
print('%s => %s' % (seq, result))
assert result == expected, expected
check('asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs', 'UBAUBAUBAUBAUBA')
check('some noisy spacerABABABABABsome noisy spacer_ABCDEF_ABCDEFsome noisy spacerABABAB', 'ABABABABAB')
check('aaabcabc', 'aaa')
check('aaabcabc', 'abcabc', rep_min_len=2)
check('ababcababc', 'ababcababc')
check('ababcababcababc', 'ababcababcababc')
Key features:
used greedy ((\w+)\2+) and non-greedy ((\w+)\2+?) regex;
search repetitive substring in all substrings with the shift from the beginning (e.g.'string' => ['string', 'tring', 'ring', 'ing', 'ng', 'g']);
selection is based on the number of repetitions not on the length of subsequence (e.g. for 'ABABABAB_ABCDEF_ABCDEF' result will be 'ABABABAB', not '_ABCDEF_ABCDEF');
the minimum length of a repeating sequence is matters (see 'aaabcabc' check).

What you are searching for is an algorithm to find the 'largest' primitive tandem repeat in a string. Here is a paper describing a linear time algorithm to find all tandem repeats in a string and by extension all primitive tandem repeats. Gusfield. Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String

Here is a brute force algorithm that I wrote. Maybe it will be useful:
def find_most_repetitive_substring(string):
max_counter = 1
position, substring_length, times = 0, 0, 0
for i in range(len(string)):
for j in range(len(string) - i):
counter = 1
if j == 0:
continue
while True:
if string[i + counter * j: i + (counter + 1) * j] != string[i: i + j] or i + (counter + 1) * j > len(string):
if counter > max_counter:
max_counter = counter
position, substring_length, times = i, j, counter
break
else:
counter += 1
return string[position: position + substring_length * times]

how to make an imputed string to a list, change it to a palindrome(if it isn't already) and reverse it as a string back

A string is palindrome if it reads the same forward and backward. Given a string that contains only lower case English alphabets, you are required to create a new palindrome string from the given string following the rules gives below:
1. You can reduce (but not increase) any character in a string by one; for example you can reduce the character h to g but not from g to h
2. In order to achieve your goal, if you have to then you can reduce a character of a string repeatedly until it becomes the letter a; but once it becomes a, you cannot reduce it any further.
Each reduction operation is counted as one. So you need to count as well how many reductions you make. Write a Python program that reads a string from a user input (using raw_input statement), creates a palindrome string from the given string with the minimum possible number of operations and then prints the palindrome string created and the number of operations needed to create the new palindrome string.
I tried to convert the string to a list first, then modify the list so that should any string be given, if its not a palindrome, it automatically edits it to a palindrome and then prints the result.after modifying the list, convert it back to a string.
c=raw_input("enter a string ")
x=list(c)
y = ""
i = 0
j = len(x)-1
a = 0
while i < j:
if x[i] < x[j]:
a += ord(x[j]) - ord(x[i])
x[j] = x[i]
print x
else:
a += ord(x[i]) - ord(x[j])
x [i] = x[j]
print x
i = i + 1
j = (len(x)-1)-1
print "The number of operations is ",a print "The palindrome created is",( ''.join(x) )
Am i approaching it the right way or is there something I'm not adding up?

Since only reduction is allowed, it is clear that the number of reductions for each pair will be the difference between them. For example, consider the string 'abcd'.
Here the pairs to check are (a,d) and (b,c).
Now difference between 'a' and 'd' is 3, which is obtained by (ord('d')-ord('a')).
I am using absolute value to avoid checking which alphabet has higher ASCII value.
I hope this approach will help.
s=input()
l=len(s)
count=0
m=0
n=l-1
while m<n:
count+=abs(ord(s[m])-ord(s[n]))
m+=1
n-=1
print(count)

This is a common "homework" or competition question. The basic concept here is that you have to find a way to get to minimum values with as few reduction operations as possible. The trick here is to utilize string manipulation to keep that number low. For this particular problem, there are two very simple things to remember: 1) you have to split the string, and 2) you have to apply a bit of symmetry.
First, split the string in half. The following function should do it.
def split_string_to_halves(string):
half, rem = divmod(len(string), 2)
a, b, c = '', '', ''
a, b = string[:half], string[half:]
if rem > 0:
b, c = string[half + 1:], string[rem + 1]
return (a, b, c)
The above should recreate the string if you do a + c + b. Next is you have to convert a and b to lists and map the ord function on each half. Leave the remainder alone, if any.
def convert_to_ord_list(string):
return map(ord, list(string))
Since you just have to do a one-way operation (only reduction, no need for addition), you can assume that for each pair of elements in the two converted lists, the higher value less the lower value is the number of operations needed. Easier shown than said:
def convert_to_palindrome(string):
halfone, halftwo, rem = split_string_to_halves(string)
if halfone == halftwo[::-1]:
return halfone + halftwo + rem, 0
halftwo = halftwo[::-1]
zipped = zip(convert_to_ord_list(halfone), convert_to_ord_list(halftwo))
counter = sum([max(x) - min(x) for x in zipped])
floors = [min(x) for x in zipped]
res = "".join(map(chr, floors))
res += rem + res[::-1]
return res, counter
Finally, some tests:
target = 'ideal'
print convert_to_palindrome(target) # ('iaeai', 6)
target = 'euler'
print convert_to_palindrome(target) # ('eelee', 29)
target = 'ohmygodthisisinsane'
print convert_to_palindrome(target) # ('ehasgidihmhidigsahe', 84)
I'm not sure if this is optimized nor if I covered all bases. But I think this pretty much covers the general concept of the approach needed. Compared to your code, this is clearer and actually works (yours does not). Good luck and let us know how this works for you.

How can I scramble a word with a factor?

I would like to scramble a word with a factor. The bigger the factor is, the more scrambled the word will become.
For example, the word "paragraphs" with factor of 1.00 would become "paaprahrgs", and it will become "paargarphs" with a factor of 0.50.
The distance from the original letter position and the number of scrambled letters should be taken into consideration.
This is my code so far, which only scrambles without a factor:
def Scramble(s):
return ''.join(random.sample(s, len(s)))
Any ideas?
P.S. This isn't an homework job - I'm trying to make something like this: http://d24w6bsrhbeh9d.cloudfront.net/photo/190546_700b.jpg

You could use the factor as a number of shuffling chars in the string around.
As the factor seem's to be between 0 and 1, you can multiply the factor with the string's length.
from random import random
def shuffle(string, factor):
string = list(string)
length = len(string)
if length < 2:
return string
shuffles = int(length * factor)
for i in xrange(shuffles):
i, j = tuple(int(random() * length) for i in xrange(2))
string[i], string[j] = string[j], string[i]
return "".join(string)
x = "computer"
print shuffle(x, .2)
print shuffle(x, .5)
print shuffle(x, .9)
coupmter
eocpumtr
rpmeutoc
If you want the first and the last characters to stay in place, simply split them and add them later on.
def CoolWordScramble(string, factor = .5):
if len(string) < 2:
return string
first, string, last = string[0], string[1:-1], string[-1]
return first + shuffle(string, factor) + last

You haven't defined what your "factor" should mean, so allow me to redefine it for you: A scrambling factor N (an integer) would be the result of swapping two random letters in a word, N times.
With this definition, 0 means the resulting word is the same as the input, 1 means only one pair of letters is swapped, and 10 means the swap is done 10 times.

You can make the "factor" roughly correspond to the number of times two adjacent letters of the word switch their positions (a transposition).
In each transposition, choose a random position (from 0 through the length-minus-two), then switch the positions of the letter at that position and the letter that follows it.

It could be implemented many ways, but here is my solution:
Wrote a function that just changes a letter's place:
def scramble(s):
s = list(s) #i think more easier, but it is absolutely performance loss
p = s.pop(random.randint(0, len(s)-1))
s.insert(random.randint(0, len(s)-1), p)
return "".join(s)
And wrote a function that apply to a string many times:
def scramble_factor(s, n):
for i in range(n):
s = scramble(s)
return s
Now we can use it:
>>> s = "paragraph"
>>> scramble_factor(s, 0)
'paragraph'
>>> scramble_factor(s, 1)
'pgararaph'
>>> scramble_factor(s, 2)
'prahagrap'
>>> scramble_factor(s, 5)
'pgpaarrah'
>>> scramble_factor(s, 10)
'arpahprag'
Of course functions can be combined or nested, but it is clear I think.
Edit:
It doesn't consider distance, but the scramble function easily replaced just for swapping adjacent letters. Here is one:
def scramble(s):
if len(s)<=1:
return s
index = random.randint(0, len(s)-2)
return s[:index] + s[index + 1] + s[index] + s[index+2:]

You could do a for-loop that counts down to 0.
Convert the String into a Char-Array and use a RNG to choose 2 letters to swap.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Rough string alignment in python - python

Related

Python rearrange a list without changing the values and making every rearrangment different

Given 2 strings, return number of positions where the two strings contain the same length 2 substring

Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

how to make an imputed string to a list, change it to a palindrome(if it isn't already) and reverse it as a string back

How can I scramble a word with a factor?

Categories

Resources