Get the string that is the midpoint between two other strings - python

Is there a library or code snippet available that can take two strings and return the exact or approximate mid-point string between the two strings?
Preferably the code would be in Python.
Background:
This seems like a simple problem on the surface, but I'm kind of struggling with it:
Clearly, the midpoint string between "A" and "C" would be "B".
With base64 encoding, the midpoint string between "A" and "B" would probably be "Ag"
With UTF-8 encoding, I'm not sure what the valid midpoint would be because the middle character seems to be a control character: U+0088 c2 88 <control>
Practical Application:
The reason I am asking is because I was hoping write map-reduce type algorithm to read all of the entries out of our database and process them. The primary keys in the database are UTF-8 encoded strings with random distributions of characters. The database we are using is Cassandra.
Was hoping to get the lowest key and the highest key out of the database, then break that up into two ranges by finding the midpoint, then breaking those two ranges up into two smaller sections by finding each of their midpoints until I had a few thousand sections, then I could read each section asynchronously.
Example if the strings were base-16 encoded: (Some of the midpoints are approximate):
Starting highest and lowest keys: '000' 'FFF'
/ \ / \
'000' '8' '8' 'FFF'
/ \ / \ / \ / \
Result: '000' '4' '4' '8' '8' 'B8' 'B8' 'FFF'
(After 3 levels of recursion)

Unfortunately not all sequences of bytes are valid UTF-8, so it's not trivial to just take the midpoint of the UTF-8 values, like the following.
def midpoint(s, e):
'''Midpoint of start and end strings'''
(sb, eb) = (int.from_bytes(bytes(x, 'utf-8'), byteorder='big') for x in (s, e))
midpoint = int((eb - sb) / 2 + sb)
midpoint_bytes = midpoint.to_bytes((midpoint.bit_length() // 8) + 1, byteorder='big')
return midpoint_bytes.decode('utf-8')
Basically this code converts each string into an integer represented by the sequence of bytes in memory, finds the midpoint of those two integers, and attempts to interpret the "midpoint" bytes as UTF-8 again.
Depending on exactly what behavior you would like, the next step could be to replace the invalid bytes in midpoint_bytes with some kind of replacement character to form a valid UTF-8 string. For your problem it might not matter much exactly which character you use for the replacement so long as you're consistent.
However, since you're trying to partition the data and don't seem to care too much about the string representation of the midpoint, another option is to just leave the midpoint representation as an integer and convert the keys to integers while doing the partition. Depending on the scale of your problem this option may or may not be feasible.

Here's a general solution that gives an approximate midpoint m between any two Unicode strings a and b, such that a < m < b if possible:
from os.path import commonprefix
# This should be set according to the range and frequency of
# characters used.
MIDCHAR = u'm'
def midpoint(a, b):
prefix = commonprefix((a, b))
p = len(prefix)
# Find the codepoints at the position where the strings differ.
ca = ord(a[p]) if len(a) > p else None
cb = ord(b[p])
# Find the approximate middle code point.
cm = (cb // 2 if ca is None else (ca + cb) // 2)
# If a middle code point was found, add it and return.
if ca < cm < cb:
return prefix + unichr(cm)
# If b still has more characters after this, then just use
# b's code point and return.
if len(b) > p + 1:
return prefix + unichr(cb)
# Otherwise, if cb == 0, then a and b are consecutive so there
# is no midpoint. Return a.
if cb == 0:
return a
# Otherwise, use part of a and an extra character so that
# the result is greater than a.
i = p + 1
while i < len(a) and a[i] >= MIDCHAR:
i += 1
return a[:i] + MIDCHAR
The function assumes that a < b. Other than that, it should work with arbitrary Unicode strings, even ones containing u'\x00' characters. Note also that it may return strings containing u'\x00' or other nonstandard code points. If there is no midpoint due to b == a + u'\x00' then a is returned.

Related

Can f-strings auto-pad to [the next] even number of digits on output?

Based on this answer (among others) it seems like f-strings is [one of] the preferred ways to convert to hexadecimal representation.
While one can specify an explicit target length, up to which to pad with leading zeroes, given a goal of an output with an even number of digits, and inputs with an arbitrary # of bits, I can imagine:
pre-processing to determine the number of bits of the input, to feed an input-specific value in to the fstring, or
post-processing a-la out = "0"+f"{val:x}" if len(f"{val:x}") % 2 else f"{val:02x}" (or even using .zfill())
The latter seems like it might be more efficient than the former - is there a built-in way to do this with fstrings, or a better alternative?
Examples of input + expected output:
[0x]1 -> [0x]01
[0x]22 -> [0x]22
[0x]333 -> [0x]0333
[0x]4444 -> [0x]4444
and so on.
Here's a postprocessing alternative that uses assignment expressions (Python 3.8+):
print((len(hx:=f"{val:x}") % 2) * '0' + hx)
If you still want a one-liner without assignment expressions you have to evaluate your f-string twice:
print((len(f"{val:x}") % 2) * '0' + f"{val:x}")
As a two-liner
hx = f"{val:x}"
print((len(hx) % 2) * '0' + hx)
And one more version:
print(f"{'0'[:len(hex(val))%2]}{val:x}")
I don't think there's anything built in to f-string formatting that will do this. You probably have to figure out what the "natural" width would be then round that up to the next even number.
Something like this:
def hf(n):
width = len(hex(n)) - 2 # account for leading 0x
width += width % 2 # round up
return f'{n:0{width}x}'
print(hf(1))
print(hf(15))
print(hf(16))
print(hf(255))
print(hf(256))
Output:
01
0f
10
ff
0100
You can use a variable in the pad-length part of the f-string. For example:
n = 4
val = 257
print(f"{val:0{n}x}") # 0101
Now, to figure out how many hex characters are in an integer, you just need to find how many bits are in the integer:
hex_count, rem = divmod(max(1, val.bit_length()), 4)
hex_count += (rem > 0)
(max(1, val.bit_length()) handles the case where val == 0, which has a bit length of 0)
So let's get the next even number after hex_count:
pad_length = hex_count + (hex_count % 2)
print(f"{val:0{pad_length}x}") # 0101
I'm not sure if this is any better than simply converting it to a hex string and then figuring out how much padding is needed, but I can't think of a readable way to do this all in an f-string. An unreadable way would be by combining all of the above into a single line, but IMO readable code is better than unreadable one-liners. I don't think there's a way to specify what you want as a simple f-string.
Note that negative numbers are formatted to an even number of digits, plus the - sign.

How to call an index value from an itertools permutation without converting it to a list?

I need to create all combinations of these characters:
'0123456789qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM. '
That are 100 letters long, such as:
'0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001'
I'm currently using this code:
import itertools
babel = itertools.product(k_c, repeat = 100)
This code works, but I need to be able to return the combination at a certain index, however itertools.product does not support indexing, turning the product into a list yields a MemoryError, and iterating through the product until I reaches a certain value takes too long for values over a billion.
Thanks for any help
With 64 characters and 100 letters there will be 64^100 combinations. For each value of the first letter, there will be 64^99 combinations of the remaining letters, then 64^98, 64^97, and so on.
This means that your Nth combination can be expressed as N in base 64 where each "digit" represents the index of the letter in the string.
An easy solution would be to build the string recursively by progressively determining the index of each position and getting the rest of the string with the remainder of N:
chars = '0123456789qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM. '
def comboNumber(n,size=100):
if size == 1: return chars[n]
return comboNumber(n//len(chars),size-1)+chars[n%len(chars)]
output:
c = comboNumber(123456789000000000000000000000000000000000000123456789)
print(c)
# 000000000000000000000000000000000000000000000000000000000000000000000059.90jDxZuy6drpQdWATyZ8007dNJs
c = comboNumber(1083232247617211325080159061900470944719547986644358934)
print(c)
# 0000000000000000000000000000000000000000000000000000000000000000000000Python.Person says Hello World
Conversely, if you want to know at which combination index a particular string is located, you can compute the base64 value by combining the character index (digit) at each position:
s = "Python.Person says Hello World" # leading zeroes are implied
i = 0
for c in s:
i = i*len(chars)+chars.index(c)
print(i) # 1083232247617211325080159061900470944719547986644358934
You are now this much closer to understanding base64 encoding which is the same thing applied to 24bit numbers coded over 4 characters (i.e 3 binary bytes --> 4 alphanumeric characters) or any variant thereof

decode/revert characters with shift in python

I have a function. the input would be a word, and every time each character will be added to the shifted value of the result.
def magic2(b):
res = 0
for c in b:
res = (res << 8) + ord(c)
print(res)
return res
Because it uses shifts, I will lose some data. I wanna to decode/reverse this with exact letters of the input word.
For example, if input would be "saman", the output is result "495555797358" and step by step would be:
115
29537
7561581
1935764833
495555797358
How can I get back to the input word just with these outputs?
Consider what you're doing: for each character, you shift left by 8 bits, and then add on another 8 bits.1
So, how do you undo that? Well, for each character, you grab the rightmost 8 bits, then shift everything else right by 8 bits. How do you know when you're done? When shifting right by 8 bits leaves you with 0, you must have just gotten the leftmost character. So:
def unmagic2(n):
while n > 0:
c = chr(n & 0xff) # 0xff is (1 << 8) - 1
n = n >> 8
Now you just have to figure out what to do with each c to get your original string back. It's not quite as trivial as you might at first think, because we're getting the leftmost character last, not first. But you should be able to figure it out from here.
1. If you're using the full gamut of Unicode, this is of course lossy, because you're shifting left by 8 bits and then adding on another 21 bits, so there's no way to reverse that. But I'll assume you're using Latin-1 strings here, or bytes—or Python 2 str.

Save list of numbers to (binary) file with defined bits per number

I have a list/array of numbers, which I want to save to a binary file.
The crucial part is, that each number should not be saved as a pre-defined data type.
The bits per value are constant for all values in the list but do not correspond to the typical data types (e.g. byte or int).
import numpy as np
# create 10 random numbers in range 0-63
values = np.int32(np.round(np.random.random(10)*63));
# each value requires exactly 6 bits
# how to save this to a file?
# just for debug/information: bit string representation
bitstring = "".join(map(lambda x: str(bin(x)[2:]).zfill(6), values));
print(bitstring)
In the real project, there are more than a million values I want to store with a given bit dephts.
I already tried the module bitstring, but appending each value to the BitArray costs a lot of time...
The may be some numpy-specific way that make things easier, but here's a pure Python (2.x) way to do it. It first converts the list of values into a single integer since Python supports int values of any length. Next it converts that int value into a string of bytes and writes it to the file.
Note: If you're sure all the values will fit within the bit-width specified, the array_to_int() function could be sped up slightly by changing the (value & mask) it's using to just value.
import random
def array_to_int(values, bitwidth):
mask = 2**bitwidth - 1
shift = bitwidth * (len(values)-1)
integer = 0
for value in values:
integer |= (value & mask) << shift
shift -= bitwidth
return integer
# In Python 2.7 int and long don't have the "to_bytes" method found in Python 3.x,
# so here's one way to do the same thing.
def to_bytes(n, length):
return ('%%0%dx' % (length << 1) % n).decode('hex')[-length:]
BITWIDTH = 6
#values = [random.randint(0, 2**BITWIDTH - 1) for _ in range(10)]
values = [0b000001 for _ in range(10)] # create fixed pattern for debugging
values[9] = 0b011111 # make last one different so it can be spotted
# just for debug/information: bit string representation
bitstring = "".join(map(lambda x: bin(x)[2:].zfill(BITWIDTH), values));
print(bitstring)
bigint = array_to_int(values, BITWIDTH)
width = BITWIDTH * len(values)
print('{:0{width}b}'.format(bigint, width=width)) # show integer's value in binary
num_bytes = (width+8 - (width % 8)) // 8 # round to whole number of 8-bit bytes
with open('data.bin', 'wb') as file:
file.write(to_bytes(bigint, num_bytes))
Since you give an example with a string, I'll assume that's how you get the results. This means performance is probably never going to be great. If you can, try creating bytes directly instead of via a string.
Side note: I'm using Python 3 which might require you to make some changes for Python 2. I think this code should work directly in Python 2, but there are some changes around bytearrays and strings between 2 and 3, so make sure to check.
byt = bytearray(len(bitstring)//8 + 1)
for i, b in enumerate(bitstring):
byt[i//8] += (b=='1') << i%8
and for getting the bits back:
bitret = ''
for b in byt:
for i in range(8):
bitret += str((b >> i) & 1)
For millions of bits/bytes you'll want to convert this to a streaming method instead, as you'd need a lot of memory otherwise.

Need Assistance in Calculating Checksum

I am working on an interface in Python to a home automation system (ElkM1). I have sample code in C# below which apparently correctly calculates the checksum needed when sending messages to this system. I put together the python code below but it doesn't appear to be returning the correct value.
According to the documentation the checksum of the message needs to be the sum of the ASCII values of the message in mod256 then taken as 2s complement. From their manual: "This is the hexadecimal two‟s complement of the modulo-256 sum of the ASCII values of all characters in the message excluding the checksum itself and the CR-LF terminator at the end of the message. Permissible characters are ASCII 0-9 and upper case A-F. When all the characters are added to the Checksum, the value should equal 0."
The vendor has a tool which will calculate the correct checksum. As test data I have been using '00300005000' which should return a checksum of 74
My code returns 18
Thanks in advance.
My Code (Python)
def calc_checksum (string):
'''
Calculates checksum for sending commands to the ELKM1.
Sums the ASCII character values mod256 and takes
the Twos complement
'''
sum= 0
for i in range(len(string)) :
sum = sum + ord(string[i])
temp = sum % 256 #mod256
rem = temp ^ 256 #inverse
cc1 = hex(rem)
cc = cc1.upper()
p=len(cc)
return cc[p-2:p]
Their Code C#:
private string checksum(string s)
{
int sum = 0;
foreach (char c in s)
sum += (int)c;
sum = -(sum % 256);
return ((byte)sum).ToString("X2");
}
FWIW, here's a literal translation of the C# code into Python:
def calc_checksum(s):
sum = 0
for c in s:
sum += ord(c)
sum = -(sum % 256)
return '%2X' % (sum & 0xFF)
print calc_checksum('00300005000')
It outputs is E8 for the message shown which is different from both your and the C# code. Given the description in the manual and doing the calculations by hand, I don't see how their answer could be 74. How do you know that's the correct answer?
After seeing Mark Ransom's comment that the C# code does indeed return E8, I spent some time debugging your Python code and found out why it doesn't produce the same result. One problem is that it doesn't calculate the two's complement correctly on the line with the comment #inverse in your code. There's at least a couple of ways to do that correctly.
A second problem is way the hex() function handles negative numbers is not what you'd might expect. With the -24 two's complement in this case it produces -0x18, not 0xffe8 or something similar. This means that just taking the last two characters of the uppercased result would be incorrect. An really easy way to do that is just convert the lower byte of the value to uppercase hexadecimal using the % string interpolation operator. Here's a working version of your function:
def calc_checksum(string):
'''
Calculates checksum for sending commands to the ELKM1.
Sums the ASCII character values mod256 and takes
the Twos complement.
'''
sum = 0
for i in range(len(string)):
sum = sum + ord(string[i])
temp = sum % 256 # mod256
# rem = (temp ^ 0xFF) + 1 # two's complement, hard way (one's complement + 1)
rem = -temp # two's complement, easier way
return '%2X' % (rem & 0xFF)
A more Pythonic (and faster) implementation would be a one-liner like this which makes use of the built-in sum() function:
def calc_checksum(s):
"""
Calculates checksum for sending commands to the ELKM1.
Sums the ASCII character values mod256 and returns
the lower byte of the two's complement of that value.
"""
return '%2X' % (-(sum(ord(c) for c in s) % 256) & 0xFF)

Categories