How we could code the reverse complete of a DNA sequence from its code?
A DNA sequence can contain 4 different characters A, C, G, T; where A is the complement of T and C is the complement of G.
A reverse complement of A DNA sequence is the complement of a sequence but in an inverse way (we compute the complement of each character from right to left).
Example: the complement of (AA) is: TT, the complement of (AC) is GT and so on...
In general, using python we code a sequence by mapping each character to a number going from 0 to 3,
{A:0, C:1, G:2, T:3}
then the coding of AA is: 0, the coding of AC is:
AC = 0*4^0+1*4^1 = 4
the coding of GT is:
GT = 2*4^0+3*4^1 = 14
How could I transform the code of each sequence to its reverse complement in python without creating a dictionary? For the above example: convert 4 to 14? and 0 to 15 ...
Your symbol set is too small for a hash map to actually be efficient. And mixing two's complement into your problem has just caused confusion.
symbols = 'ACGT'
complements = symbols[::-1] # reverse order
import string
table = string.maketrans(symbols, complements)
sample = 'ACCGTT'
# output: AACGGT
Converting to some bitpacked format would take less space but require a lot more special handling, as you'd need to track sizes separately, perform arbitrarily wide shifts and so on. Python can certainly do it, in particular with int() accepting many bases and creating arbitrary width results, but it's likely a counterproductive detour.
digits = string.digits[:len(symbols)]
length = len(sample)
digitmap = string.maketrans(symbols, digits)
number = int(sample.translate(digitmap), len(digits))
def reversemapnumber(function=id, number=0, radix=0b100, length=0):
result = 0
for i in range(length):
number,digit = divmod(number, radix)
result = result*radix + function(digit)
return result
revcomplemented = reversemapnumber(function=lambda x: 3-x,
number=number, length=length)
# binary form
print('{:0{}b}'.format(revcomplemented, length*2))
# back to text form
for i in range(2*length-2, -2, -2)))
In that jumble of code I've used division rather than shifts to be somewhat more generic (supporting radix not a power of two), but the printing examples rely on the width exactly. In the end it's just tricky and unclear.
the reverse of a list in python
>>> xs = [1,2,3]
>>> reversed(xs)
<listreverseiterator object at 0x10089c9d0>
>>> list(reversed(xs))
[3, 2, 1]
def complement(x):
return ~x & 15 # as 15 == int('1111', 2)
the 15 is a bitmask. It represents the binary 1111. We then use the binary and operator.
>>> "{0:b}".format(complement(int('1111',2)))
>>> "{0:b}".format(complement(int('0001',2)))
>>> "{0:b}".format(complement(int('1001',2)))
>>> xs = [int('1111',2), int('1001',2), int('0110',2), int('1011',2)]
>>> map(complement, xs)
[0, 6, 9, 4]
>>> list(reversed(map(complement, xs)))
[4, 9, 6, 0]
Basing your example where
given a sequence of 6 characters: ACCGTT, the complement of A is: T,
and the complement of C is G; so the reverse complement of ACCGTT is: AACGGT.
assume that you have c complemnt function complement and a reverse function reverse.
we have reverse(ACCGTT) = TTGCCA and complement(ACCGTT) = TGGCAA
. Reversing a list after calling a function on each element is the same as calling a function on each element on a list.
complement(reverse(ACCGTT)) = reverse(complement(ACCGTT))
So the other part of the question is that you want to map
{A:0, C:1, G:2, T:3}
A -> T | 0 -> 3
T -> A | 3 -> 0
C -> G | 1 -> 2
G -> C | 2 -> 1
which in binary would be
a = int('00', 2) # 0
c = int('01', 2) # 1
g = int('10', 2) # 2
t = int('11', 2) # 3
def complement(x):
return ~x & 3 # this 3 is the same as int('11', 2)
def reverse_complement(list_of_ints):
return list(reversed(map(complement, list_of_ints)))
I'm trying to get a quick implementation of the following problem, ideally such that it would work in a numba function. The problem is the following: I have two random integers a & b and consider their binary representation of length L, e.g.
L=4: a=10->1010, b=6->0110.
This is the information that is feed into the function. Then I cut both binary representations in two at the same random position and fuse one of the two results, e.g.
L=4: a=1|010, b=0|110 ---> c=1110 or 0010.
One of the two outcome is chosen with equal probability and that is the outcome of the function. The cut occurs between the first 1/0 and the last 0/1 of the binary representation.
This is currently my code:
def func(a,b,l):
bin_a = [int(i) for i in str(bin(a))[2:].zfill(l)]
bin_b = [int(i) for i in str(bin(b))[2:].zfill(l)]
randint = random.randint(1, l - 1)
print("randint", randint)
if random.random() < 0.5:
result = bin_a[0:randint]+bin_b[randint:l]
result = bin_b[0:randint] + bin_a[randint:l]
return result
I have the feeling that there a possibly many shortcuts to this problem that I do not come up with. Also my code does not work in numba :/. Thanks for any help!
Edit: This is an update of my code, thanks to Prunes help! It also works as a numba function. If there is no further improvements to that, I would close the question.
def func2(a,b,l):
randint = random.randint(1, l - 1)
print("randint", randint)
bitlist_l = [1]*randint+[0]*(l-randint)
bitlist_r = [0]*randint+[1]*(l-randint)
print("bitlist_l", bitlist_l)
print("bitlist_r", bitlist_r)
l_mask = 0
r_mask = 0
for i in range(l):
l_mask = (l_mask << 1) | bitlist_l[i]
r_mask = (r_mask << 1) | bitlist_r[i]
print("l_mask", l_mask)
print("r_mask", r_mask)
if random.random() < 0.5:
c = (a & l_mask) | (b & r_mask)
c = (b & l_mask) | (a & r_mask)
return c
You lose a lot of time converting between string and int. Try bit operations instead. Mask the items you want and construct the output without all the conversions. Try these steps:
size = [length of larger number in bits] There are many ways to get this.
Make a mask template, size 1-bits.
Pick your random position, pos randint is a poor anem, as it shadows the function you're using.
Make two masks: l_mask = mask << pos; r_mask = mask >> pos. This gives you two mutually exclusive and exhaustive bit-maps for your inputs.
Flip your random coin, the 50-50 chance. The < 0.5 result would be ...
(a & l_mask) | (b & rmask)
For the >= 0.5 result, switch a and b in that expression.
You can improve your code by realizing that you do not need a "human readable" binary representation to do binary operations.
For example, creating the mask:
m = (1<<randompos) - 1
The crossover can be done like so:
c = (a if coinflip else b) ^ ((a^b)&m)
And that's all.
Full example:
# create random sample
a,b = np.random.randint(1<<32,size=2)
randompos = np.random.randint(1,32)
coinflip = np.random.randint(2)
# 12
# 0
# do the crossover
m = (1<<randompos) - 1
c = (a if coinflip else b) ^ ((a^b)&m)
# check
for i in (a,b,m,c):
# 11100011110111000001001111100011
# 11010110110000110010101001111011
# 00000000000000000000111111111111
# 11010110110000110010001111100011
The usual saying is that string comparison must be done in constant time when checking things like password or hashes, and thus, it is recommended to avoid a == b.
However, I run the follow script and the results don't support the hypothesis that a==b short circuit on the first non-identical character.
from time import perf_counter_ns
import random
def timed_cmp(a, b):
start = perf_counter_ns()
a == b
end = perf_counter_ns()
return end - start
def n_timed_cmp(n, a, b):
"average time for a==b done n times"
ts = [timed_cmp(a, b) for _ in range(n)]
return sum(ts) / len(ts)
def check_cmp_time():
# generate a random string of n characters
n = 2 ** 8
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
# print the 10 fastest
for x in sorted_timed[:10]:
i, t = x
print("{}\t{:3f}".format(i, t))
i, t = timed[0]
print("{}\t{:3f}".format(i, t))
i, t = timed[1]
print("{}\t{:3f}".format(i, t))
if __name__ == "__main__":
Here is the result of a run, re-running the script gives slightly different results, but nothing satisfactory.
# ran with cpython 3.8.3
6 78.051700
1 78.203200
15 78.222700
14 78.384800
11 78.396300
12 78.441800
9 78.476900
13 78.519000
8 78.586200
3 78.631500
0 80.691100
1 78.203200
I would've expected that the fastest comparison would be where the first differing character is at the beginning of the string, but it's not what I get.
Any idea what's going on ???
There's a difference, you just don't see it on such small strings. Here's a small patch to apply to your code, so I use longer strings, and I do 10 checks by putting the A at a place, evenly spaced in the original string, from the beginning to the end, I mean, like this:
## -15,13 +15,13 ## def n_timed_cmp(n, a, b):
def check_cmp_time():
# generate a random string of n characters
- n = 2 ** 8
+ n = 2 ** 16
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
- diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+ diffs = [s[:i] + "A" + s[i+1:] for i in range(0, n, n // 10)]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
and you'll get:
0 122.621000
1 213.465700
2 380.214100
3 460.422000
5 694.278700
4 722.010000
7 894.630300
6 1020.722100
9 1149.473000
8 1341.754500
0 122.621000
1 213.465700
Note that with your example, with only 2**8 characters, it's already noticable, apply this patch:
## -21,7 +21,7 ## def check_cmp_time():
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
- diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+ diffs = [s[:i] + "A" + s[i+1:] for i in [0, n - 1]]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
to only keep the two extreme cases (first letter change vs last letter change) and you'll get:
$ python3
0 124.131800
1 135.566000
Numbers may vary, but most of the time test 0 is a tad faster that test 1.
To isolate more precisely which caracter is modified, it's possible as long as the memcmp does it character by character, so as long as it does not use integer comparisons, typically on the last character if they get misaligned, or on really short strings, like 8 char string, as I demo here:
from time import perf_counter_ns
from statistics import median
import random
def check_cmp_time():
# generate a random string of n characters
n = 8
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
diffs = [s[:i] + "A" + s[i + 1 :] for i in range(n)]
values = {x: [] for x in range(n)}
for _ in range(10_000_000):
for i, diff in enumerate(diffs):
start = perf_counter_ns()
s == diff
values[i].append(perf_counter_ns() - start)
timed = [[k, median(v)] for k, v in values.items()]
sorted_timed = sorted(timed, key=lambda t: t[1])
# print the 10 fastest
for x in sorted_timed[:10]:
i, t = x
print("{}\t{:3f}".format(i, t))
i, t = timed[0]
print("{}\t{:3f}".format(i, t))
i, t = timed[1]
print("{}\t{:3f}".format(i, t))
if __name__ == "__main__":
Which gives me:
1 221.000000
2 222.000000
3 223.000000
4 223.000000
5 223.000000
6 223.000000
7 223.000000
0 241.000000
The differences are so small, Python and perf_counter_ns may no longer be the right tools here.
See, to know why it doesn't short circuit, you'll have to do some digging. The simple answer is, of course, it doesn't short circuit because the standard doesn't specify so. But you might think, "Why wouldn't the implementations choose to short circuit? Surely, It must be faster!". Not quite.
Let's take a look at cpython, for obvious reasons. Look at the code for unicode_compare_eq function defined in unicodeobject.c
static int
unicode_compare_eq(PyObject *str1, PyObject *str2)
int kind;
void *data1, *data2;
Py_ssize_t len;
int cmp;
len = PyUnicode_GET_LENGTH(str1);
if (PyUnicode_GET_LENGTH(str2) != len)
return 0;
kind = PyUnicode_KIND(str1);
if (PyUnicode_KIND(str2) != kind)
return 0;
data1 = PyUnicode_DATA(str1);
data2 = PyUnicode_DATA(str2);
cmp = memcmp(data1, data2, len * kind);
return (cmp == 0);
(Note: This function is actually called after deducing that str1 and str2 are not the same object - if they are - well that's just a simple True immediately)
Focus on this line specifically-
cmp = memcmp(data1, data2, len * kind);
Ahh, we're back at another cross road. Does memcmp short circuit? The C standard does not specify such a requirement. As seen in the opengroup docs and also in Section of the C Standard Draft The memcmp function
#include <string.h>
int memcmp(const void *s1, const void *s2, size_t n);
The memcmp function compares the first n characters of the object pointed to by s1 to
the first n characters of the object pointed to by s2.
The memcmp function returns an integer greater than, equal to, or less than zero,
accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.
Most Some C implementations (including glibc) choose to not short circuit. But why? are we missing something, why would you not short circuit?
Because the comparison they use isn't might not be as naive as a byte by byte by check. The standard does not require the objects to be compared byte by byte. Therein lies the chance of optimization.
What glibc does, is that it compares elements of type unsigned long int instead of just singular bytes represented by unsigned char. Check out the implementation
There's a lot more going under the hood - a discussion far outside the scope of this question, after all this isn't even tagged as a C question ;). Though I found that this answer may be worth a look. But just know, the optimization is there, just in a much different form than the approach that may come in mind at first glance.
Edit: Fixed wrong function link
Edit: As #Konrad Rudolph has stated, glibc memcmp does apparently short circuit. I've been misinformed.
I am splitting and sending my data as below
256 bit data :
splitted into 8 32bit hex :
[0x1c1d1e1fL, 0x18191a1bL, 0x14151617L, 0x10111213L, 0xc0d0e0fL, 0x8090a0bL, 0x4050607L, 0x10203L]
now I got the data vice versa ,
[0x1c1d1e1f, 0x18191a1b, 0x14151617, 0x10111213, 0xc0d0e0f, 0x8090a0b, 0x4050607, 0x10203]
now I want to stitch it back to 256 bit number . I tried to achieve it by using ".join" , but it didn't give me the results I wanted .
my current code :
str_data = [str(i) for i in data]
print '[{}]'.format(', '.join(x for x in str_data))
int_data = int("".join(str_data))
print int_data
data = [int_data]
for above code , I am getting
[471670303, 404298267, 336926231, 269554195, 202182159, 134810123, 67438087, 66051]
which is not I want .
Each number (from the lst in reverse order) gets shifted by 32 bits and added together. So we can just multiply each number by (2**32)**i, where i is the position of the number in the list and then sum all the numbers
>>> lst = [0x1c1d1e1f, 0x18191a1b, 0x14151617, 0x10111213, 0xc0d0e0f, 0x8090a0b, 0x4050607, 0x10203]
>>> sum(n* (2**32)**i for i, n in enumerate(lst))
Use reduce (functools.reduce in Python 3) and bitwise operators.
>>> lst = [0x1c1d1e1fL, 0x18191a1bL, 0x14151617L, 0x10111213L, 0xc0d0e0fL, 0x8090a0bL, 0x4050607L, 0x10203L]
>>> reduce(lambda acc, x: acc<<32 | x, reversed(lst))
>>> hex(_)
This is also about 4 times faster than repeated exponentiation.
Here is the example which is bothering me:
>>> x = decimal.Decimal('0.0001')
>>> print x.normalize()
>>> print x.normalize().to_eng_string()
Is there a way to have engineering notation for representing mili (10e-3) and micro (10e-6)?
Here's a function that does things explicitly, and also has support for using SI suffixes for the exponent:
def eng_string( x, format='%s', si=False):
Returns float/int value <x> formatted in a simplified engineering format -
using an exponent that is a multiple of 3.
format: printf-style string used to format the value before the exponent.
si: if true, use SI suffix for exponent, e.g. k instead of e3, n instead of
e-9 etc.
E.g. with format='%.2f':
1.23e-08 => 12.30e-9
123 => 123.00
1230.0 => 1.23e3
-1230000.0 => -1.23e6
and with si=True:
1230.0 => 1.23k
-1230000.0 => -1.23M
sign = ''
if x < 0:
x = -x
sign = '-'
exp = int( math.floor( math.log10( x)))
exp3 = exp - ( exp % 3)
x3 = x / ( 10 ** exp3)
if si and exp3 >= -24 and exp3 <= 24 and exp3 != 0:
exp3_text = 'yzafpnum kMGTPEZY'[ ( exp3 - (-24)) / 3]
elif exp3 == 0:
exp3_text = ''
exp3_text = 'e%s' % exp3
return ( '%s'+format+'%s') % ( sign, x3, exp3_text)
Matplotlib implemented the engineering formatter, so one option is to directly use Matplotlibs formatter, e.g.:
import matplotlib as mpl
formatter = mpl.ticker.EngFormatter()
result: '10 k'
Original answer:
Based on Julian Smith's excellent answer (and this answer), I changed the function to improve on the following points:
Python3 compatible (integer division)
Compatible for 0 input
Rounding to significant number of digits, by default 3, no trailing zeros printed
so here's the updated function:
import math
def eng_string( x, sig_figs=3, si=True):
Returns float/int value <x> formatted in a simplified engineering format -
using an exponent that is a multiple of 3.
sig_figs: number of significant figures
si: if true, use SI suffix for exponent, e.g. k instead of e3, n instead of
e-9 etc.
x = float(x)
sign = ''
if x < 0:
x = -x
sign = '-'
if x == 0:
exp = 0
exp3 = 0
x3 = 0
exp = int(math.floor(math.log10( x )))
exp3 = exp - ( exp % 3)
x3 = x / ( 10 ** exp3)
x3 = round( x3, -int( math.floor(math.log10( x3 )) - (sig_figs-1)) )
if x3 == int(x3): # prevent from displaying .0
x3 = int(x3)
if si and exp3 >= -24 and exp3 <= 24 and exp3 != 0:
exp3_text = 'yzafpnum kMGTPEZY'[ exp3 // 3 + 8]
elif exp3 == 0:
exp3_text = ''
exp3_text = 'e%s' % exp3
return ( '%s%s%s') % ( sign, x3, exp3_text)
The decimal module is following the Decimal Arithmetic Specification, which states:
This is outdated - see below
to-scientific-string – conversion to numeric string
The coefficient is first converted to a string in base ten using the characters 0 through 9 with no leading zeros (except if its value is zero, in which case a single 0 character is used).
Next, the adjusted exponent is calculated; this is the exponent, plus the number of characters in the converted coefficient, less one. That is, exponent+(clength-1), where clength is the length of the coefficient in decimal digits.
If the exponent is less than or equal to zero and the adjusted exponent is greater than or equal to -6, the number will be converted
to a character form without using exponential notation.
to-engineering-string – conversion to numeric string
This operation converts a number to a string, using engineering
notation if an exponent is needed.
The conversion exactly follows the rules for conversion to scientific
numeric string except in the case of finite numbers where exponential
notation is used. In this case, the converted exponent is adjusted to be a multiple of three (engineering notation) by positioning the decimal point with one, two, or three characters preceding it (that is, the part before the decimal point will range from 1 through 999).
This may require the addition of either one or two trailing zeros.
If after the adjustment the decimal point would not be followed by a digit then it is not added. If the final exponent is zero then no indicator letter and exponent is suffixed.
For each abstract representation [sign, coefficient, exponent] on the left, the resulting string is shown on the right.
Or, in other words:
>>> for n in (10 ** e for e in range(-1, -8, -1)):
... d = Decimal(str(n))
... print d.to_eng_string()
I realize that this is an old thread, but it does come near the top of a search for python engineering notation and it seems prudent to have this information located here.
I am an engineer who likes the "engineering 101" engineering units. I don't even like designations such as 0.1uF, I want that to read 100nF. I played with the Decimal class and didn't really like its behavior over the range of possible values, so I rolled a package called engineering_notation that is pip-installable.
pip install engineering_notation
From within Python:
>>> from engineering_notation import EngNumber
>>> EngNumber('1000000')
>>> EngNumber(1000000)
>>> EngNumber(1000000.0)
>>> EngNumber('0.1u')
>>> EngNumber('1000m')
This package also supports comparisons and other simple numerical operations.
The «full» quote shows what is wrong!
The decimal module is indeed following the proprietary (IBM) Decimal Arithmetic Specification.
Quoting this IBM specification in its entirety clearly shows what is wrong with decimal.to_eng_string() (emphasis added):
to-engineering-string – conversion to numeric string
This operation converts a number to a string, using engineering
notation if an exponent is needed.
The conversion exactly follows the rules for conversion to scientific
numeric string except in the case of finite numbers where exponential
notation is used. In this case, the converted exponent is adjusted to be a multiple of three (engineering notation) by positioning the decimal point with one, two, or three characters preceding it (that is, the part before the decimal point will range from 1 through 999). This may require the addition of either one or two trailing zeros.
If after the adjustment the decimal point would not be followed by a digit then it is not added. If the final exponent is zero then no indicator letter and exponent is suffixed.
This proprietary IBM specification actually admits to not applying the engineering notation for numbers with an infinite decimal representation, for which ordinary scientific notation is used instead! This is obviously incorrect behaviour for which a Python bug report was opened.
from math import floor, log10
def powerise10(x):
""" Returns x as a*10**b with 0 <= a < 10
if x == 0: return 0,0
Neg = x < 0
if Neg: x = -x
a = 1.0 * x / 10**(floor(log10(x)))
b = int(floor(log10(x)))
if Neg: a = -a
return a,b
def eng(x):
"""Return a string representing x in an engineer friendly notation"""
a,b = powerise10(x)
if -3 < b < 3: return "%.4g" % x
a = a * 10**(b % 3)
b = b - b % 3
return "%.4gE%s" % (a,b)
Test result
>>> eng(0.0001)
Like the answers above, but a bit more compact:
from math import log10, floor
def eng_format(x,precision=3):
"""Returns string in engineering format, i.e. 100.1e-3"""
x = float(x) # inplace copy
if x == 0:
a,b = 0,0
sgn = 1.0 if x > 0 else -1.0
x = abs(x)
a = sgn * x / 10**(floor(log10(x)))
b = int(floor(log10(x)))
if -3 < b < 3:
return ("%." + str(precision) + "g") % x
a = a * 10**(b % 3)
b = b - b % 3
return ("%." + str(precision) + "gE%s") % (a,b)
In [10]: eng_format(-1.2345e-4,precision=5)
Out[10]: '-123.45E-6'
Instead of a complete shuffle, I am looking for a partial shuffle function in python.
Example : "string" must give rise to "stnrig", but not "nrsgit"
It would be better if I can define a specific "percentage" of characters that have to be rearranged.
Purpose is to test string comparison algorithms. I want to determine the "percentage of shuffle" beyond which an(my) algorithm will mark two (shuffled) strings as completely different.
Update :
Here is my code. Improvements are welcome !
import random
percent_to_shuffle = int(raw_input("Give the percent value to shuffle : "))
to_shuffle = list(raw_input("Give the string to be shuffled : "))
num_of_chars_to_shuffle = int((len(to_shuffle)*percent_to_shuffle)/100)
for i in range(0,num_of_chars_to_shuffle):
print ''.join(to_shuffle)
This is a problem simpler than it looks. And the language has the right tools not to stay between you and the idea,as usual:
import random
def pashuffle(string, perc=10):
data = list(string)
for index, letter in enumerate(data):
if random.randrange(0, 100) < perc/2:
new_index = random.randrange(0, len(data))
data[index], data[new_index] = data[new_index], data[index]
return "".join(data)
Your problem is tricky, because there are some edge cases to think about:
Strings with repeated characters (i.e. how would you shuffle "aaaab"?)
How do you measure chained character swaps or re arranging blocks?
In any case, the metric defined to shuffle strings up to a certain percentage is likely to be the same you are using in your algorithm to see how close they are.
My code to shuffle n characters:
import random
def shuffle_n(s, n):
idx = range(len(s))
idx = idx[:n]
mapping = dict((idx[i], idx[i-1]) for i in range(n))
return ''.join(s[mapping.get(x,x)] for x in range(len(s)))
Basically chooses n positions to swap at random, and then exchanges each of them with the next in the list... This way it ensures that no inverse swaps are generated and exactly n characters are swapped (if there are characters repeated, bad luck).
Explained run with 'string', 3 as input:
idx is [0, 1, 2, 3, 4, 5]
we shuffle it, now it is [5, 3, 1, 4, 0, 2]
we take just the first 3 elements, now it is [5, 3, 1]
those are the characters that we are going to swap
s t r i n g
^ ^ ^
t (1) will be i (3)
i (3) will be g (5)
g (5) will be t (1)
the rest will remain unchanged
so we get 'sirgnt'
The bad thing about this method is that it does not generate all the possible variations, for example, it could not make 'gnrits' from 'string'. This could be fixed by making partitions of the indices to be shuffled, like this:
import random
def randparts(l):
n = len(l)
s = random.randint(0, n-1) + 1
if s >= 2 and n - s >= 2: # the split makes two valid parts
yield l[:s]
for p in randparts(l[s:]):
yield p
else: # the split would make a single cycle
yield l
def shuffle_n(s, n):
idx = range(len(s))
mapping = dict((x[i], x[i-1])
for i in range(len(x))
for x in randparts(idx[:n]))
return ''.join(s[mapping.get(x,x)] for x in range(len(s)))
import random
def partial_shuffle(a, part=0.5):
# which characters are to be shuffled:
idx_todo = random.sample(xrange(len(a)), int(len(a) * part))
# what are the new positions of these to-be-shuffled characters:
idx_target = idx_todo[:]
# map all "normal" character positions {0:0, 1:1, 2:2, ...}
mapper = dict((i, i) for i in xrange(len(a)))
# update with all shuffles in the string: {old_pos:new_pos, old_pos:new_pos, ...}
mapper.update(zip(idx_todo, idx_target))
# use mapper to modify the string:
return ''.join(a[mapper[i]] for i in xrange(len(a)))
for i in xrange(5):
print partial_shuffle('abcdefghijklmnopqrstuvwxyz', 0.2)
Evil and using a deprecated API:
import random
# adjust constant to taste
# 0 -> no effect, 0.5 -> completely shuffled, 1.0 -> reversed
# Of course this assumes your input is already sorted ;)
cmp = lambda a, b: cmp(a, b) * (-1 if random.random() < 0.2 else 1)
maybe like so:
>>> s = 'string'
>>> shufflethis = list(s[2:])
>>> random.shuffle(shufflethis)
>>> s[:2]+''.join(shufflethis)
Taking from fortran's idea, i'm adding this to collection. It's pretty fast:
def partial_shuffle(st, p=20):
p = int(round(p/100.0*len(st)))
idx = range(len(s))
sample = random.sample(idx, p)
samptrav = 1
for i in range(len(st)):
if i in sample:
res += st[sample[-samptrav]]
samptrav += 1
res += st[i]
return res