How to optimize str.replace() in Python

How to optimize str.replace() in Python - python

I am working on a binary string (i.e it only contains 1 and 0) and I need to run a function N number of times. This function replaces any instance of '01' in string to '10'. However, str.replace takes too much time to process the output, especially when the the length of string as well as N can be as big as 10^6.
I have tried implementing regex but it hasn't provided me with any optimization, instead taking more time to perform the task.
For example, if the string given to me is 01011 and N is equal to 1, then the output should be 10101. Similarly, if N becomes 2, the output becomes 11010 and so on.
Are there any optimizations of str.replace in python or is there any bit manipulation I could do to optimize my code?

Let's think of the input as bits forming an unsigned integer, possible a very large one. For example:
1001 1011 # input number X
0100 1101 # Y = X>>1 = X//2 -- to get the previous bit to the same column
1001 0010 # Z1 = X & ~Y -- We are looking for 01, i.e. 1 after previous 0
0001 0010 # Z2 = Z1 with the highest bit cleared, because we don't want
# to process the implicit 0 before the number
1010 1101 # output = X + Z2, this adds 1 where 01's are;
# 1 + 01 = 10, this is what we want
Thus we can process the whole list just with few arithmetic operations.
Update: sample code, I tried to address the comment about leading zeroes.
xstr = input("Enter binary number: ")
x = int(xstr, base=2)
digits = len(xstr)
mask = 2**(digits-1) - 1
print("{:0{width}b}".format(x,width=digits))
while True:
z2 = x & ~(x >> 1) & mask
if z2 == 0:
print("final state")
break
x += z2
print("{:0{width}b}".format(x,width=digits))

While this is not an answer to the actual replacement question, my preliminary investigations show that the flipping rule will eventually arrange all the 1s at the beginning of the string and all the 0s at the end, so the following function will give the correct answer if N is close to len(s).
from collections import Counter
def asymptote(s, N):
counts = Counter(s)
return '1'*counts['1'] + '0'*counts['0']
I compared the results with
def brute(s, N):
for i in range(N):
s = s.replace('01', '10')
return s
This graph shows where we have agreement between the brute force method and the asymptotic result for random strings
The yellow part is where the brute force and asymptotic result are the same. So you can see you need at least len(s)/2 flips to get to the asymptotic result most of the time and sometimes you need a bit more (the red line is 3*len(s)/4).

Here is the program I spoke of:
from typing import Dict
from itertools import product
table_1 = {
"01": 1,
"11": 0,
}
tables = {
1: table_1
}
def _apply_table(s: str, n: int, table: Dict[str, int]) -> str:
tl = n * 2
out = ["0"] * len(s)
for i in range(len(s)):
if s[i] == '1':
if i < tl:
t = '1' * (tl - i - 1) + s[:i + 1]
else:
t = s[i - tl + 1:i + 1]
o = table[t]
out[i - o] = '1'
return ''.join(out)
def _get_table(n: int) -> Dict[str, int]:
if n not in tables:
tables[n] = _generate_table(n)
return tables[n]
def _generate_table(n: int) -> Dict[str, int]:
def apply(t: str):
return _apply_table(_apply_table(t, n - 1, _get_table(n - 1)), 1, table_1)
tl = n * 2
ts = (''.join(ps) + '1' for ps in product('01', repeat=tl - 1))
return {t: len(apply(t).rpartition('1')[2]) for t in ts}
def transform(s: str, n: int):
return _apply_table(s, n, _get_table(n))
This is not very fast, but transform has a time-complexity of O(M) with M being the length of the string. But the space-complexity and the bad time complexity of the _generate_table function makes it unusable :-/ (It may however be possible that you can improve it, or implement it in C for faster speed. (It also gets better if you store the hash-tables and not recompute them every time)

Related

Trying to convert a integer into a ACGT DNA sequence

I am trying to reverse my stringtobin function so that when I run bintostring([3]) it will return "AAAT" where A=0,C=1,G=2,T=3, for example CCCC will return 85 because (1 * 64) + (1 * 16) + (1 * 4) + (1 * 1) = 85. My bintostring function now just returns an empty string.
dna = {'A':0, 'C':1, 'G':2, 'T':3}
dna2 = {0:'A', 1:'C', 2:'G', 3:'T'}
def bintostring(num):
seq = []
nums = [64,16,4,1]
#main while
i = 0
while i<len(num):
#nums while (iterate through nums)
k = 0
while k<len(nums):
#dna2 while (iterate through dna2)
x = 0
while x<len(dna2):
check = 0
if num[i]//nums[k] == dna2[x]:
seq.append(dna2[x])
check+=1
elif check>0:
seq.append('A')
x+=1
k+=1
i+=1
return("".join(seq))
print(bintostring([3]))
def stringtobin(seq):
power_of_4 = 1
num = 0
if len(seq)!=4: return None
i = len(seq)-1
while i>=0:
power_of_4*=4
Digitval = dna[seq[i]]
num+=Digitval*power_of_4//4
i-=1
return num
print(stringtobin("AAAT"))

Your encoding is in base 4 which can't hold the length information of your sequence.
Without the length information the encoded value 3 could mean T or TA or TAAA or TAAAA... (there would be no way to know).
If the sequences are always 4 letters long (or the length is stored/provided separately), you can implement the functions like this
def stringToBin(S):
return sum( 4**i*"ACGT".index(p) for i,p in enumerate(S))
def binToString(N,size=4):
result = ""
for _ in range(size):
N,p = divmod(N,4)
result += "ACGT"[p]
return result
print(stringToBin("AAAT")) # 192
print(binToString(192)) # AAAT
print(stringToBin("TA")) # 3
print(stringToBin("TAAA")) # 3
print(binToString(3)) # TAAA
print(binToString(3,2)) # TA (length has to be supplied separately)
If you want your numeric encoding to also carry the length information, you should make it base 5 and use a non-zero value for each letter. This way, TA and TAAA would give different numbers.
def stringToBin(S):
return sum( 5**i*" ACGT".index(p) for i,p in enumerate(S))
def binToString(N):
result = ""
while N:
N,p = divmod(N,5)
result += " ACGT"[p]
return result
print(stringToBin("TA")) # 9
print(stringToBin("TAAA")) # 159
print(binToString(9)) # TA
print(binToString(159)) # TAAA
Obviously this produces larger number so, a 32 bit unsigned integer will only hold 13 letters as opposed to 16 in base 4. If you're doing this to reduce the size of storage, using text compression (e.g. zip) will probably be more efficient than converting to a fixed base binary representation

Your attempt seems inordinately complex. Just map the bottom two bits to a value, then shift them off.
def bintostring(num):
seq = []
for n in num:
subseq = []
for b in range(4):
subseq.append(dna2[n & 3])
n >>= 2
seq.append("".join(reversed(subseq)))
return seq
In case it's not obvious, & is bitwise AND; value & 3 obtains the bottom two bits of value.
The stringtobin function could be similarly simplified. Demo: https://ideone.com/RlzegN

Algorithm to give shortest expression for one number in terms of another number

Heads up: apologies for my poor style, inefficient code, and general stupidity. I'm doing this purely out of interest. I have no delusions that I will become a professional programmer.
How would solve this problem (assuming it can be solved). To be clear you want to take in an int x and an int y and return some expression in terms of y that equals x. For example if I passed in 9 and 2 one of the shortest solutions (I'm pretty sure) would be ((2+2)x2)+(2/2) with 4 total operators. You can assume positive integers because that's what I'm doing for the most part.
I have found a fastish partial solution that returns a solution for whatever numbers, but usually not the smallest one. Here it is coded in python:
def re_express(n, a, expression):
if n==0:
print(expression)
else:
if a**a < n:
n -= a**a
expression += "+" + str(a) + "^" +str(a)
re_express(n, a, expression)
elif a*a < n:
n -= a*a
expression += "+" + str(a) + "*" + str(a)
re_express(n, a, expression)
elif a < n:
n -= a
expression += "+" + str(a)
re_express(n, a, expression)
else:
n -= 1
expression += "+" + str(a) + "/" + str(a)
re_express(n, a, expression)
I have also think I have one that returns a pretty small solution, but it is not guaranteed to be the smallest solution. I hand simulated it and it got the 2 and 9 example correct, whereas my first algorithm produced a 5 operator solution: 2^2+2^2+2/2. But it gets slow quickly for large n's. I wrote it down in pseudocode.
function re_express(n, a, children, root, solutions):
input: n, number to build
a, number to build with
children, an empty array
root, a node
solutions, an empty dictionary
if not root:
root.value = a
if the number of layers in the tree is greater than the size of the shortest solution, return the shortest solution
run through each mathematical operator and create a child node of root as long as the child will have a unique value
each child.value = operator(root, a)
for each child: child.parent = root
for each child: child.operator = operator as a string
loop through all children and check to see what number you need to produce n with each operator, if that number is in the list of children
when you find something like that then you have paths back up the tree, use the following loop for both children (obviously a child with only i
i = 0
while child != root:
expression += “(“ + str(a) + child.operator
child = child.parent
expression += str(a)
for in range(i):
expression += “)”
then you combine the expressions and you have one possible solution to add up to n and you have one possible answer
store the solution in solutions with the key as the number of operators used and the value as the expression in string form
re_express(n, a, children, root, solutions)
As you can tell the big-O for that algorithm is garbage and it doesn't even give a guaranteed solution, so there must be a better way.

We can use a variation of Dijkstra's algorithm to find the result. However, we don't really have a graph. If we consider numbers as nodes of the graph, then edges would start at a pair of nodes and an operation and end at another number. But this is just a detail that does not prevent Dijkstra's idea.
The idea is the following: We start with y and explore more and more numbers that we can express using a set of pre-defined operations. For every number, we store how many operations we need to express it. If we find a number that we have already seen (but the previous expression takes more operations), we update the number's attributes (this is the path length in Dijktra's terms).
Let's do the example x = 9, y = 2.
Initially, we can represent 2 with zero operations. So, we put that in our number list:
2: 0 operations
From this list, we now need to find the number with the least number of operations. This is essential because it guarantees that we never need to visit this number again. So we take 2. We fix it and explore new numbers that we can express using 2 and all other fixed numbers. Right now, we only have 2, so we can express (using + - * /):
2: 0 operations, fixed
0: 1 operation (2 - 2)
1: 1 operation (2 / 2)
4: 1 operation (2 + 2 or 2 * 2)
On we go. Take the next number with the least number of operations (does not matter which one). I'll take 0. We can't really express new numbers with zero, so:
2: 0 operations, fixed
0: 1 operation (2 - 2), fixed
1: 1 operation (2 / 2)
4: 1 operation (2 + 2 or 2 * 2)
Now let's take 1. The number of operations for new numbers will be the sum of the input numbers' number of operations plus 1. So, using 1, we could
2 + 1 = 3 with 0 + 1 + 1 = 2 operations (2 + (2 / 2))
0 + 1 = 1 with 3 operations --> the existing number of operations for `0` is better, so we do not update
1 + 1 = 2 with 3 operations
2 - 1 = 1 with 2 operations
0 - 1 = -1 with 3 operations *new*
1 - 2 = -1 with 2 operations *update*
1 - 0 = 1 with 3 operations
1 - 1 = 0 with 3 operations
2 * 1 = 2 with 2 operations
0 * 1 = 0 with 3 operations
1 * 1 = 1 with 3 operations
2 / 1 = 2 with 2 operations
0 / 1 = 0 with 3 operations
1 / 2 = invalid if you only want to consider integers, use it otherwise
1 / 0 = invalid
1 / 1 = 1 with 3 operations
So, our list is now:
2: 0 operations, fixed
0: 1 operation (2 - 2), fixed
1: 1 operation (2 / 2), fixed
3: 2 operations (2 + 2/2)
4: 1 operation (2 + 2 or 2 * 2)
-1: 2 operations (2/2 - 2)
Go on and take 4 and so on. Once you've reached your target number x, you are done in the smallest possible way.
In pseudo code:
L = create a new list with pairs of numbers and their number of operations
put (y, 0) into the list
while there are unfixed entries in the list:
c = take the unfixed entry with the least number of operations
fix c
if c = x:
finish
for each available operation o:
for each fixed number f in L:
newNumber = o(c, f)
operations = c.numberOfOperations + f.numberOfOperations + 1
if newNumber not in L:
add (newNumber, operations) to L
else if operations < L[newNumber].numberOfOperations:
update L[newNumber].numberOfOperations = operations
repeat for o(f, c)
If you store the operation list with the numbers (or their predecessors), you can reconstruct the expression at the end.
Instead of using a simple list, a priority queue will make the retrieval of unfixed entries with the minimum number of operations fast.

Here is my code for this problem which I don't think is the fastest and the most optimal algorithm, but it should always find the best answer.
First, let's assume we don't have parentheses and solve the subproblem for finding the minimum number of operations to convert a to n with these operations ['+', '*', '/', '-']. For solving this problem we use BFS. First we define an equation class:
class equation:
def __init__(self, constant, level=0, equation_list=None):
self.constant = constant
if equation_list != None:
self.equation_list = equation_list
else:
self.equation_list = [constant]
self.level = level
def possible_results(self):
return recursive_possible_results(self.equation_list)
def get_child(self, operation):
child_equation_list = self.equation_list + [operation]
child_equation_list += [self.constant]
child_level = self.level + 1
return equation(self.constant, child_level, child_equation_list)
The constructor gets a constant which is the constant in our expressions (here is a) and a level which indicates the number of operations used in this equation, and an equation_list which is a list representation of equation expression. For the first node, ( root ) our level is 0, and our equation_list has only the constant.
Now let's calculate all possible ways to parentheses an equation. We use a recursive function which returns a list of all possible results and their expressions:
calculated_expr = {}
def is_operation(symbol):
return (symbol in operations)
def recursive_possible_results(equation_list):
results = []
if len(equation_list) == 1:
return [{'val': equation_list[0], 'expr': str(equation_list[0])}]
key = ''
for i in range(len(equation_list)):
if is_operation(equation_list[i]):
key += equation_list[i]
if key in calculated_expr.keys():
return calculated_expr[key]
for i in range(len(equation_list)):
current_symbol = equation_list[i]
if is_operation(current_symbol):
left_results = recursive_possible_results(equation_list[:i])
right_results = recursive_possible_results(equation_list[i+1:])
for left_res in left_results:
for right_res in right_results:
try:
res_val = eval(str(left_res['val']) + current_symbol + str(right_res['val']))
res_expr = '(' + left_res['expr'] + ')' + current_symbol + '(' + right_res['expr'] + ')'
results.append({'val': res_val, 'expr': res_expr})
except ZeroDivisionError:
pass
calculated_expr[key] = results
return results
(I think this is the part of the code that needs to be more optimized since we are calculating a bunch of equations more than once. Maybe a good dynamic programming algorithm ...)
For Breadth-First-Search we don't need to explore the whole tree from the root. For example for converting 2 to 197, we need at least 7 operations since our biggest number with 6 operations is 128 and is still less than 197. So, we can search the tree from level 7. Here is a code to create our first level's nodes (equations):
import math
import itertools
def create_first_level(n, constant):
level_exprs = []
level_number = int(math.log(n, constant)) - 1
level_operations = list(itertools.product(operations, repeat=level_number))
for opt_list in level_operations:
equation_list = [constant]
for opt in opt_list:
equation_list.append(opt)
equation_list.append(constant)
level_exprs.append(equation(constant, level_number, equation_list))
return level_exprs
In the end, we call our BFS and get the result:
def re_express(n, a):
visit = set()
queue = []
# root = equation(a)
# queue.append(root)
# Skip levels
queue = create_first_level(n, a)
while queue:
curr_node = queue.pop(0)
for operation in operations:
queue.append(curr_node.get_child(operation))
possible_results = curr_node.possible_results()
for pr in possible_results:
if pr['val'] == n:
print(pr['expr'])
print('Number of operations: %d' % curr_node.level)
queue = []
break
re_express(9, 2)
Here is the output:
(((2)+(2))*(2))+((2)/(2))
Number of operations: 4

Returns the largest n such that R[n] = S

Write a function answer(str_S) which, given the base-10 string
representation of an integer S, returns the largest n such that R(n) =
S. Return the answer as a string in base-10 representation. If there
is no such n, return "None". S will be a positive integer no greater
than 10^25.
where R(n) is the number of zombits at time n:
R(0) = 1
R(1) = 1
R(2) = 2
R(2n) = R(n) + R(n + 1) + n (for n > 1)
R(2n + 1) = R(n - 1) + R(n) + 1 (for n >= 1)
Test cases
==========
Inputs:
(string) str_S = "7"
Output:
(string) "4"
Inputs:
(string) str_S = "100"
Output:
(string) "None"
My program below is correct but it is not scalable since here the range of S can be a very large number like 10^24. Could anyone help me with some suggestion to improve the code further so that it can cover any input case.
def answer(str_S):
d = {0: 1, 1: 1, 2: 2}
str_S = int(str_S)
i = 1
while True:
if i > 1:
d[i*2] = d[i] + d[i+1] + i
if d[i*2] == str_S:
return i*2
elif d[i*2] > str_S:
return None
if i>=1:
d[i*2+1] = d[i-1] + d[i] + 1
if d[i*2+1] == str_S:
return i*2 + 1
elif d[i*2+1] > str_S:
return None
i += 1
print answer('7')

First of all, where are you having trouble with the scaling? I ran your code on a 30-digit number, and it seemed to complete okay. Do you have a memory limit? Python handles arbitrarily large integers, although very large ones get flipped into digital arithmetic mode.
Given the density of R values, I suspect that you can save space as well as time if you switch to a straight array: use the value as an array index instead of a dict key.

Optimizing python code

Any tips on optimizing this python code for finding next palindrome:
Input number can be of 1000000 digits
COMMENTS ADDED
#! /usr/bin/python
def inc(lst,lng):#this function first extract the left half of the string then
#convert it to int then increment it then reconvert it to string
#then reverse it and finally append it to the left half.
#lst is input number and lng is its length
if(lng%2==0):
olst=lst[:lng/2]
l=int(lng/2)
olst=int(olst)
olst+=1
olst=str(olst)
p=len(olst)
if l<p:
olst2=olst[p-2::-1]
else:
olst2=olst[::-1]
lst=olst+olst2
return lst
else:
olst=lst[:lng/2+1]
l=int(lng/2+1)
olst=int(olst)
olst+=1
olst=str(olst)
p=len(olst)
if l<p:
olst2=olst[p-3::-1]
else:
olst2=olst[p-2::-1]
lst=olst+olst2
return lst
t=raw_input()
t=int(t)
while True:
if t>0:
t-=1
else:
break
num=raw_input()#this is input number
lng=len(num)
lst=num[:]
if(lng%2==0):#this if find next palindrome to num variable
#without incrementing the middle digit and store it in lst.
olst=lst[:lng/2]
olst2=olst[::-1]
lst=olst+olst2
else:
olst=lst[:lng/2+1]
olst2=olst[len(olst)-2::-1]
lst=olst+olst2
if int(num)>=int(lst):#chk if lst satisfies criteria for next palindrome
num=inc(num,lng)#otherwise call inc function
print num
else:
print lst

I think most of the time in this code is spent converting strings to integers and back. The rest is slicing strings and bouncing around in the Python interpreter. What can be done about these three things? There are a few unnecessary conversions in the code, which we can remove. I see no way to avoid the string slicing. To minimize your time in the interpreter you just have to write as little code as possible :-) and it also helps to put all your code inside functions.
The code at the bottom of your program, which takes a quick guess to try and avoid calling inc(), has a bug or two. Here's how I might write that part:
def nextPal(num):
lng = len(num)
guess = num[:lng//2] + num[(lng-1)//2::-1] # works whether lng is even or odd
if guess > num: # don't bother converting to int
return guess
else:
return inc(numstr, n)
This simple change makes your code about 100x faster for numbers where inc doesn't need to be called, and about 3x faster for numbers where it does need to be called.
To do better than that, I think you need to avoid converting to int entirely. That means incrementing the left half of the number without using ordinary Python integer addition. You can use an array and carry out the addition algorithm "by hand":
import array
def nextPal(numstr):
# If we don't need to increment, just reflect the left half and return.
n = len(numstr)
h = n//2
guess = numstr[:n-h] + numstr[h-1::-1]
if guess > numstr:
return guess
# Increment the left half of the number without converting to int.
a = array.array('b', numstr)
zero = ord('0')
ten = ord('9') + 1
for i in range(n - h - 1, -1, -1):
d = a[i] + 1
if d == ten:
a[i] = zero
else:
a[i] = d
break
else:
# The left half was all nines. Carry the 1.
# Update n and h since the length changed.
a.insert(0, ord('1'))
n += 1
h = n//2
# Reflect the left half onto the right half.
a[n-h:] = a[h-1::-1]
return a.tostring()
This is another 9x faster or so for numbers that require incrementing.
You can make this a touch faster by using a while loop instead of for i in range(n - h - 1, -1, -1), and about twice as fast again by having the loop update both halves of the array rather than just updating the left-hand half and then reflecting it at the end.

You don't have to find the palindrome, you can just generate it.
Split the input number, and reflect it. If the generated number is too small, then increment the left hand side and reflect it again:
def nextPal(n):
ns = str(n)
oddoffset = 0
if len(ns) % 2 != 0:
oddoffset = 1
leftlen = len(ns) / 2 + oddoffset
lefts = ns[0:leftlen]
right = lefts[::-1][oddoffset:]
p = int(lefts + right)
if p < n:
## Need to increment middle digit
left = int(lefts)
left += 1
lefts = str(left)
right = lefts[::-1][oddoffset:]
p = int(lefts + right)
return p
def test(n):
print n
p = nextPal(n)
assert p >= n
print p
test(1234567890)
test(123456789)
test(999999)
test(999998)
test(888889)
test(8999999)

EDIT
NVM, just look at this page: http://thetaoishere.blogspot.com/2009/04/finding-next-palindrome-given-number.html

Using strings. n >= 0
from math import floor, ceil, log10
def next_pal(n):
# returns next palindrome, param is an int
n10 = str(n)
m = len(n10) / 2.0
s, e = int(floor(m - 0.5)), int(ceil(m + 0.5))
start, middle, end = n10[:s], n10[s:e], n10[e:]
assert (start, middle[0]) == (end[-1::-1], middle[-1]) #check that n is actually a palindrome
r = int(start + middle[0]) + 1 #where the actual increment occurs (i.e. add 1)
r10 = str(r)
i = 3 - len(middle)
if len(r10) > len(start) + 1:
i += 1
return int(r10 + r10[-i::-1])
Using log, more optized. n > 9
def next_pal2(n):
k = log10(n + 1)
l = ceil(k)
s, e = int(floor(l/2.0 - 0.5)), int(ceil(l/2.0 + 0.5))
mmod, emod = 10**(e - s), int(10**(l - e))
start, end = divmod(n, emod)
start, middle = divmod(start, mmod)
r1 = 10*start + middle%10 + 1
i = middle > 9 and 1 or 2
j = s - i + 2
if k == l:
i += 1
r2 = int(str(r1)[-i::-1])
return r1*10**j + r2

Average of two strings in alphabetical/lexicographical order

Suppose you take the strings 'a' and 'z' and list all the strings that come between them in alphabetical order: ['a','b','c' ... 'x','y','z']. Take the midpoint of this list and you find 'm'. So this is kind of like taking an average of those two strings.
You could extend it to strings with more than one character, for example the midpoint between 'aa' and 'zz' would be found in the middle of the list ['aa', 'ab', 'ac' ... 'zx', 'zy', 'zz'].
Might there be a Python method somewhere that does this? If not, even knowing the name of the algorithm would help.
I began making my own routine that simply goes through both strings and finds midpoint of the first differing letter, which seemed to work great in that 'aa' and 'az' midpoint was 'am', but then it fails on 'cat', 'doggie' midpoint which it thinks is 'c'. I tried Googling for "binary search string midpoint" etc. but without knowing the name of what I am trying to do here I had little luck.
I added my own solution as an answer

If you define an alphabet of characters, you can just convert to base 10, do an average, and convert back to base-N where N is the size of the alphabet.
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def enbase(x):
n = len(alphabet)
if x < n:
return alphabet[x]
return enbase(x/n) + alphabet[x%n]
def debase(x):
n = len(alphabet)
result = 0
for i, c in enumerate(reversed(x)):
result += alphabet.index(c) * (n**i)
return result
def average(a, b):
a = debase(a)
b = debase(b)
return enbase((a + b) / 2)
print average('a', 'z') #m
print average('aa', 'zz') #mz
print average('cat', 'doggie') #budeel
print average('google', 'microsoft') #gebmbqkil
print average('microsoft', 'google') #gebmbqkil
Edit: Based on comments and other answers, you might want to handle strings of different lengths by appending the first letter of the alphabet to the shorter word until they're the same length. This will result in the "average" falling between the two inputs in a lexicographical sort. Code changes and new outputs below.
def pad(x, n):
p = alphabet[0] * (n - len(x))
return '%s%s' % (x, p)
def average(a, b):
n = max(len(a), len(b))
a = debase(pad(a, n))
b = debase(pad(b, n))
return enbase((a + b) / 2)
print average('a', 'z') #m
print average('aa', 'zz') #mz
print average('aa', 'az') #m (equivalent to ma)
print average('cat', 'doggie') #cumqec
print average('google', 'microsoft') #jlilzyhcw
print average('microsoft', 'google') #jlilzyhcw

If you mean the alphabetically, simply use FogleBird's algorithm but reverse the parameters and the result!
>>> print average('cat'[::-1], 'doggie'[::-1])[::-1]
cumdec
or rewriting average like so
>>> def average(a, b):
... a = debase(a[::-1])
... b = debase(b[::-1])
... return enbase((a + b) / 2)[::-1]
...
>>> print average('cat', 'doggie')
cumdec
>>> print average('google', 'microsoft')
jlvymlupj
>>> print average('microsoft', 'google')
jlvymlupj

It sounds like what you want, is to treat alphabetical characters as a base-26 value between 0 and 1. When you have strings of different length (an example in base 10), say 305 and 4202, your coming out with a midpoint of 3, since you're looking at the characters one at a time. Instead, treat them as a floating point mantissa: 0.305 and 0.4202. From that, it's easy to come up with a midpoint of .3626 (you can round if you'd like).
Do the same with base 26 (a=0...z=25, ba=26, bb=27, etc.) to do the calculations for letters:
cat becomes 'a.cat' and doggie becomes 'a.doggie', doing the math gives cat a decimal value of 0.078004096, doggie a value of 0.136390697, with an average of 0.107197397 which in base 26 is roughly "cumcqo"

Based on your proposed usage, consistent hashing ( http://en.wikipedia.org/wiki/Consistent_hashing ) seems to make more sense.

Thanks for everyone who answered, but I ended up writing my own solution because the others weren't exactly what I needed. I am trying to average app engine key names, and after studying them a bit more I discovered they actually allow any 7-bit ASCII characters in the names. Additionally I couldn't really rely on the solutions that converted the key names first to floating point, because I suspected floating point accuracy just isn't enough.
To take an average, first you add two numbers together and then divide by two. These are both such simple operations that I decided to just make functions to add and divide base 128 numbers represented as lists. This solution hasn't been used in my system yet so I might still find some bugs in it. Also it could probably be a lot shorter, but this is just something I needed to get done instead of trying to make it perfect.
# Given two lists representing a number with one digit left to decimal point and the
# rest after it, for example 1.555 = [1,5,5,5] and 0.235 = [0,2,3,5], returns a similar
# list representing those two numbers added together.
#
def ladd(a, b, base=128):
i = max(len(a), len(b))
lsum = [0] * i
while i > 1:
i -= 1
av = bv = 0
if i < len(a): av = a[i]
if i < len(b): bv = b[i]
lsum[i] += av + bv
if lsum[i] >= base:
lsum[i] -= base
lsum[i-1] += 1
return lsum
# Given a list of digits after the decimal point, returns a new list of digits
# representing that number divided by two.
#
def ldiv2(vals, base=128):
vs = vals[:]
vs.append(0)
i = len(vs)
while i > 0:
i -= 1
if (vs[i] % 2) == 1:
vs[i] -= 1
vs[i+1] += base / 2
vs[i] = vs[i] / 2
if vs[-1] == 0: vs = vs[0:-1]
return vs
# Given two app engine key names, returns the key name that comes between them.
#
def average(a_kn, b_kn):
m = lambda x:ord(x)
a = [0] + map(m, a_kn)
b = [0] + map(m, b_kn)
avg = ldiv2(ladd(a, b))
return "".join(map(lambda x:chr(x), avg[1:]))
print average('a', 'z') # m#
print average('aa', 'zz') # n-#
print average('aa', 'az') # am#
print average('cat', 'doggie') # d(mstr#
print average('google', 'microsoft') # jlim.,7s:
print average('microsoft', 'google') # jlim.,7s:

import math
def avg(str1,str2):
y = ''
s = 'abcdefghijklmnopqrstuvwxyz'
for i in range(len(str1)):
x = s.index(str2[i])+s.index(str1[i])
x = math.floor(x/2)
y += s[x]
return y
print(avg('z','a')) # m
print(avg('aa','az')) # am
print(avg('cat','dog')) # chm
Still working on strings with different lengths... any ideas?

This version thinks 'abc' is a fraction like 0.abc. In this approach space is zero and a valid input/output.
MAX_ITER = 10
letters = " abcdefghijklmnopqrstuvwxyz"
def to_double(name):
d = 0
for i, ch in enumerate(name):
idx = letters.index(ch)
d += idx * len(letters) ** (-i - 1)
return d
def from_double(d):
name = ""
for i in range(MAX_ITER):
d *= len(letters)
name += letters[int(d)]
d -= int(d)
return name
def avg(w1, w2):
w1 = to_double(w1)
w2 = to_double(w2)
return from_double((w1 + w2) * 0.5)
print avg('a', 'a') # 'a'
print avg('a', 'aa') # 'a mmmmmmmm'
print avg('aa', 'aa') # 'a zzzzzzzz'
print avg('car', 'duck') # 'cxxemmmmmm'
Unfortunately, the naïve algorithm is not able to detect the periodic 'z's, this would be something like 0.99999 in decimal; therefore 'a zzzzzzzz' is actually 'aa' (the space before the 'z' periodicity must be increased by one.
In order to normalise this, you can use the following function
def remove_z_period(name):
if len(name) != MAX_ITER:
return name
if name[-1] != 'z':
return name
n = ""
overflow = True
for ch in reversed(name):
if overflow:
if ch == 'z':
ch = ' '
else:
ch=letters[(letters.index(ch)+1)]
overflow = False
n = ch + n
return n
print remove_z_period('a zzzzzzzz') # 'aa'

I haven't programmed in python in a while and this seemed interesting enough to try.
Bear with my recursive programming. Too many functional languages look like python.
def stravg_half(a, ln):
# If you have a problem it will probably be in here.
# The floor of the character's value is 0, but you may want something different
f = 0
#f = ord('a')
L = ln - 1
if 0 == L:
return ''
A = ord(a[0])
return chr(A/2) + stravg_half( a[1:], L)
def stravg_helper(a, b, ln, x):
L = ln - 1
A = ord(a[0])
B = ord(b[0])
D = (A + B)/2
if 0 == L:
if 0 == x:
return chr(D)
# NOTE: The caller of helper makes sure that len(a)>=len(b)
return chr(D) + stravg_half(a[1:], x)
return chr(D) + stravg_helper(a[1:], b[1:], L, x)
def stravg(a, b):
la = len(a)
lb = len(b)
if 0 == la:
if 0 == lb:
return a # which is empty
return stravg_half(b, lb)
if 0 == lb:
return stravg_half(a, la)
x = la - lb
if x > 0:
return stravg_helper(a, b, lb, x)
return stravg_helper(b, a, la, -x) # Note the order of the args

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize str.replace() in Python - python

Related

Trying to convert a integer into a ACGT DNA sequence

Algorithm to give shortest expression for one number in terms of another number

Returns the largest n such that R[n] = S

Optimizing python code

Average of two strings in alphabetical/lexicographical order

Categories

Resources