Python: Reversibly encode alphanumeric string to integer - python

I want to convert a string (composed of alphanumeric characters) into an integer and then convert this integer back into a string:
string --> int --> string
In other words, I want to represent an alphanumeric string by an integer.
I found a working solution, which I included in the answer, but I do not think it is the best solution, and I am interested in other ideas/methods.
Please don't tag this as duplicate just because a lot of similar questions already exist, I specifically want an easy way of transforming a string into an integer and vice versa.
This should work for strings that contain alphanumeric characters, i.e. strings containing numbers and letters.

Here's what I have so far:
First define an string
m = "test123"
string -> bytes
mBytes = m.encode("utf-8")
bytes -> int
mInt = int.from_bytes(mBytes, byteorder="big")
int -> bytes
mBytes = mInt.to_bytes(((mInt.bit_length() + 7) // 8), byteorder="big")
bytes -> string
m = mBytes.decode("utf-8")
All together
m = "test123"
mBytes = m.encode("utf-8")
mInt = int.from_bytes(mBytes, byteorder="big")
mBytes2 = mInt.to_bytes(((mInt.bit_length() + 7) // 8), byteorder="big")
m2 = mBytes2.decode("utf-8")
print(m == m2)
Here is an identical reusable version of the above:
class BytesIntEncoder:
#staticmethod
def encode(b: bytes) -> int:
return int.from_bytes(b, byteorder='big')
#staticmethod
def decode(i: int) -> bytes:
return i.to_bytes(((i.bit_length() + 7) // 8), byteorder='big')
If you're using Python <3.6, remove the optional type annotations.
Test:
>>> s = 'Test123'
>>> b = s.encode()
>>> b
b'Test123'
>>> BytesIntEncoder.encode(b)
23755444588720691
>>> BytesIntEncoder.decode(_)
b'Test123'
>>> _.decode()
'Test123'

Recall that a string can be encoded to bytes, which can then be encoded to an integer. The encodings can then be reversed to get the bytes followed by the original string.
This encoder uses binascii to produce an identical integer encoding to the one in the answer by charel-f. I believe it to be identical because I extensively tested it.
Credit: this answer.
from binascii import hexlify, unhexlify
class BytesIntEncoder:
#staticmethod
def encode(b: bytes) -> int:
return int(hexlify(b), 16) if b != b'' else 0
#staticmethod
def decode(i: int) -> int:
return unhexlify('%x' % i) if i != 0 else b''
If you're using Python <3.6, remove the optional type annotations.
Quick test:
>>> s = 'Test123'
>>> b = s.encode()
>>> b
b'Test123'
>>> BytesIntEncoder.encode(b)
23755444588720691
>>> BytesIntEncoder.decode(_)
b'Test123'
>>> _.decode()
'Test123'

Assuming the character set is merely alphanumeric, i.e. a-z A-Z 0-9, this requires 6 bits per character. As such, using an 8-bit byte-encoding is theoretically an inefficient use of memory.
This answer converts the input bytes into a sequence of 6-bit integers. It encodes these small integers into one large integer using bitwise operations. Whether this actually translates into real-world storage efficiency is measured by sys.getsizeof, and is more likely for larger strings.
This implementation customizes the encoding for the choice of character set. If for example you were working with just string.ascii_lowercase (5 bits) rather than string.ascii_uppercase + string.digits (6 bits), the encoding would be correspondingly efficient.
Unit tests are also included.
import string
class BytesIntEncoder:
def __init__(self, chars: bytes = (string.ascii_letters + string.digits).encode()):
num_chars = len(chars)
translation = ''.join(chr(i) for i in range(1, num_chars + 1)).encode()
self._translation_table = bytes.maketrans(chars, translation)
self._reverse_translation_table = bytes.maketrans(translation, chars)
self._num_bits_per_char = (num_chars + 1).bit_length()
def encode(self, chars: bytes) -> int:
num_bits_per_char = self._num_bits_per_char
output, bit_idx = 0, 0
for chr_idx in chars.translate(self._translation_table):
output |= (chr_idx << bit_idx)
bit_idx += num_bits_per_char
return output
def decode(self, i: int) -> bytes:
maxint = (2 ** self._num_bits_per_char) - 1
output = bytes(((i >> offset) & maxint) for offset in range(0, i.bit_length(), self._num_bits_per_char))
return output.translate(self._reverse_translation_table)
# Test
import itertools
import random
import unittest
class TestBytesIntEncoder(unittest.TestCase):
chars = string.ascii_letters + string.digits
encoder = BytesIntEncoder(chars.encode())
def _test_encoding(self, b_in: bytes):
i = self.encoder.encode(b_in)
self.assertIsInstance(i, int)
b_out = self.encoder.decode(i)
self.assertIsInstance(b_out, bytes)
self.assertEqual(b_in, b_out)
# print(b_in, i)
def test_thoroughly_with_small_str(self):
for s_len in range(4):
for s in itertools.combinations_with_replacement(self.chars, s_len):
s = ''.join(s)
b_in = s.encode()
self._test_encoding(b_in)
def test_randomly_with_large_str(self):
for s_len in range(256):
num_samples = {s_len <= 16: 2 ** s_len,
16 < s_len <= 32: s_len ** 2,
s_len > 32: s_len * 2,
s_len > 64: s_len,
s_len > 128: 2}[True]
# print(s_len, num_samples)
for _ in range(num_samples):
b_in = ''.join(random.choices(self.chars, k=s_len)).encode()
self._test_encoding(b_in)
if __name__ == '__main__':
unittest.main()
Usage example:
>>> encoder = BytesIntEncoder()
>>> s = 'Test123'
>>> b = s.encode()
>>> b
b'Test123'
>>> encoder.encode(b)
3908257788270
>>> encoder.decode(_)
b'Test123'

so I needed transfer a dictionary in terms of numbers,
it may look kinda ugly but it's efficient in the way that every char (english letters) is exactly 2 numbers but it's capable of transfering any kind of unicode char
import json
myDict = {
"le key": "le Valueue",
2 : {
"heya": 1234569,
"3": 4
},
'Α α, Β β, Γ γ' : 'שלום'
}
def convertDictToNum(toBeConverted):
return int(''.join([(lambda c: c if len(c) ==2 else '0'+c )(str(ord(c) - 26)) for c in str(json.dumps(toBeConverted))]))
def loadDictFromNum(toBeDecoded):
toBeDecoded = str(toBeDecoded)
return json.loads(''.join([chr(int(toBeDecoded[cut:cut + 2]) + 26) for cut in range(0, len(toBeDecoded), 2)]))
numbersDict = convertDictToNum(myDict)
print(numbersDict)
# 9708827506817595083206088....
recoveredDict = loadDictFromNum(numbersDict)
print(recoveredDict)
# {'le key': 'le Valueue', '2': {'heya': 1234569, '3': 4}, 'Α α, Β β, Γ γ': 'שלום'}

Related

Python : Increase a string

I am not really familiar with Python yet.
I have a string like "11223300". Now I want to increase the last byte of that string ( from "00" to "FF"). I tried to convert the string into an integer ( integer=int(string,16)) and increase it, and convert it back later, but that does not work for me. Maybe one of you guys has a better idea.
string = "11223300"
counter = int(string, 16)
for i in range(255):
counter = counter + 1
IV = hex(counter)
Now I want to convert the IV from hex into a string
Thanks!
You can use format to convert int to your hex string, which will not keep the 0x prefix:
string = "11223300"
counter = int(string, 16)
for i in range(255):
counter = counter + 1
IV = format(counter, 'X')
print(IV)
Output:
112233FF
The function below takes a string and increases it by n from a given charset
For example, in the charset ['a','b','c'] :
"aaa" + 1 = "aab"
"aac" + 1 = "aba"
"acc" + 2 = "bab"
def str_increaser(mystr, charset, n_increase):
# Replaces a char in given string & index
def replace_chr_in_str(mystr, index, char):
mystr = list(mystr)
mystr[index] = char
return ''.join(mystr)
# Increases the char last (n) and possibly its left neighboor (n-1).
def local_increase(mystr, charset):
l_cs = len(charset)
# increasing 1 to last char if it's not the last char in charset (e.g. 'z' in lowercase ascii).
if (charset.index(mystr[-1]) < l_cs - 1):
mystr = replace_chr_in_str(mystr, -1, charset[charset.index(mystr[-1]) + 1])
# else, reset last char to the first one in charset and increase the its left char, using a reducted string for recursion
else:
mystr = replace_chr_in_str(mystr, -1, charset[0])
mystr = local_increase(mystr[:-1], charset) + mystr[-1]
return mystr
# Case if input = "zz...zz" in an ascii lowercase charset for instance
if (mystr == charset[-1] * len(mystr)):
print("str_increaser(): Input already max in charset")
else:
for i in range(n_increase):
mystr = local_increase(mystr, charset)
return mystr
Here's an exemple :
# In bash : $ man ascii
# charset = map(chr, range(97, 123)) + map(chr, range(65, 91))
import string
charset = string.lowercase + string.uppercase
print(str_increaser("RfZ", charset, 2)) # outputs "Rgb"
This function might be used to get all permutations in some charsets.
So are you literally wanting to change the last two bits of the string to "FF"? If so, easy
string = "11223300"
modified_string = string[:-2] + "FF"
Edit from comments
hex_str = "0x11223300"
for i in range(256):
hex_int = int(hex_str, 16)
new_int = hex_int + 0x01
print(hex(hex_int))
hex_str = str(hex(new_int))

converting alphanumeric string to int and vice versa in python

I am trying to convert alphanumeric string with maximum length of 40 characters to an integer as small as possible so that we can easily save and retrieve from database. I am not aware if there is any python method existing for it or any simple algorithms we can use. To be specific my string will have only characters 0-9 and a-g. So kindly help with any suggestions on how we can uniquely convert from string to int and vice versa. I am using Python 2.7 on Cent os 6.5
This is not that difficult:
def str2int(s, chars):
i = 0
for c in reversed(s):
i *= len(chars)
i += chars.index(c)
return i
def int2str(i, chars):
s = ""
while i:
s += chars[i % len(chars)]
i //= len(chars)
return s
Example:
>>> chars = "".join(str(n) for n in range(10)) + "abcdefg"
>>> str2int("0235abg02", chars)
14354195089
>>> int2str(_, chars)
'0235abg02'
Basically if you want to encode n characters into an integer you interpret it as base-n.
There are 17 symbols in your input, so you can treat is as a base-17 number:
>>> int('aga0',17)
53924
For the reverse conversion, there are lots of solutions over here.
Improving on the above answers:
# The location of a character in the string matters.
chars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
charsLen = len(chars)
def numberToStr(num):
s = ""
while num:
s = self.chars[num % charsLen] + s
num //= charsLen
return s # Or e.g. "s.zfill(10)"
Can handle strings with leading 0s:
def strToNumber(numStr):
num = 0
for i, c in enumerate(reversed(numStr)):
num += chars.index(c) * (charsLen ** i)
return num

How do I concatenate an escape to a string?

My goal is to convert a binary value into the "bytestring" format python interprets it as. Example: 1111111111111111 would be 0xffff and when interpreted be represented as \xff\xff\xff . If there is a more direct method of converting it to this format please let me know as that would be ideal, as of now I'm using brute force with this solution:
hexnum = hex(int("11110100111100001110110101111011",2))
hexstring = str(hexnum)[2:]
finalstr = ""
i = 0
while(i<=len(hexstring)):
finalstr+= hexstring[i:i+2]
finalstr+= "\x"
i=i+2
My problem is when:
print repr(finalstr)
I receive the error
ValueError: invalid \x escape
How do I properly concatenate the escape or how do I convert a binary string into the hex bytearray format python uses?
You can use binascii.unhexlify like this:
>>> import binascii
>>> s = "11110100111100001110110101111011"
>>> binascii.unhexlify(format(int(s, 2), 'x'))
'\xf4\xf0\xed{'
In Python 3:
v = int("11110100111100001110110101111011",2)
v.to_bytes((v.bit_length() + 7) // 8, 'big')
produces a bytes value represented by the bits:
>>> v = int("11110100111100001110110101111011",2)
>>> v.to_bytes((v.bit_length() + 7) // 8, 'big')
b'\xf4\xf0\xed{'
You can't just prepend the \x syntax; that only works in string literals.
In Python 2, you could use a bytearray() instead, as it takes a list of integers in the range 0-255:
v = int("11110100111100001110110101111011",2)
bytes_list = []
while v:
v, one_byte = divmod(v, 256)
bytes_list.append(one_byte)
str(bytearray(bytes_list[::-1]))
Demo:
>>> v = int("11110100111100001110110101111011",2)
>>> bytes_list = []
>>> while v:
... v, one_byte = divmod(v, 256)
... bytes_list.append(one_byte)
...
>>> bytearray(bytes_list[::-1])
bytearray(b'\xf4\xf0\xed{')
>>> str(bytearray(bytes_list[::-1]))
'\xf4\xf0\xed{'

Python efficient obfuscation of string

I need to obfuscate lines of Unicode text to slow down those who may want to extract them. Ideally this would be done with a built in Python module or a small add-on library; the string length will be the same or less than the original; and the "unobfuscation" be as fast as possible.
I have tried various character swaps and XOR routines, but they are slow. Base64 and hex encoding increase the size considerably. To date the most efficient method I've found is compressing with zlib at the lowest setting (1). Is there a better way?
How about the old ROT13 trick?
Python 3:
>>> import codecs
>>> x = 'some string'
>>> y = codecs.encode(x, 'rot13')
>>> y
'fbzr fgevat'
>>> codecs.decode(y, 'rot13')
u'some string'
Python 2:
>>> x = 'some string'
>>> y = x.encode('rot13')
>>> y
'fbzr fgevat'
>>> y.decode('rot13')
u'some string'
For a unicode string:
>>> x = u'國碼'
>>> print x
國碼
>>> y = x.encode('unicode-escape').encode('rot13')
>>> print y
\h570o\h78op
>>> print y.decode('rot13').decode('unicode-escape')
國碼
This uses a simple, fast encryption scheme on bytes objects.
# For Python 3 - strings are Unicode, print is a function
def obfuscate(byt):
# Use same function in both directions. Input and output are bytes
# objects.
mask = b'keyword'
lmask = len(mask)
return bytes(c ^ mask[i % lmask] for i, c in enumerate(byt))
def test(s):
data = obfuscate(s.encode())
print(len(s), len(data), data)
newdata = obfuscate(data).decode()
print(newdata == s)
simple_string = 'Just plain ASCII'
unicode_string = ('sensei = \N{HIRAGANA LETTER SE}\N{HIRAGANA LETTER N}'
'\N{HIRAGANA LETTER SE}\N{HIRAGANA LETTER I}')
test(simple_string)
test(unicode_string)
Python 2 version:
# For Python 2
mask = 'keyword'
nmask = [ord(c) for c in mask]
lmask = len(mask)
def obfuscate(s):
# Use same function in both directions. Input and output are
# Python 2 strings, ASCII only.
return ''.join([chr(ord(c) ^ nmask[i % lmask])
for i, c in enumerate(s)])
def test(s):
data = obfuscate(s.encode('utf-8'))
print len(s), len(data), repr(data)
newdata = obfuscate(data).decode('utf-8')
print newdata == s
simple_string = u'Just plain ASCII'
unicode_string = (u'sensei = \N{HIRAGANA LETTER SE}\N{HIRAGANA LETTER N}'
'\N{HIRAGANA LETTER SE}\N{HIRAGANA LETTER I}')
test(simple_string)
test(unicode_string)
It depends on the size of your input, if it's over 1K then using numpy is about 60x faster (runs in less than 2% of the naïve Python code).
import time
import numpy as np
mask = b'We are the knights who say "Ni"!'
mask_length = len(mask)
def mask_python(val: bytes) -> bytes:
return bytes(c ^ mask[i % mask_length] for i, c in enumerate(val))
def mask_numpy(val: bytes) -> bytes:
arr = np.frombuffer(val, dtype=np.int8)
length = len(value)
np_mask = np.tile(np.frombuffer(mask, dtype=np.int8), round(length/mask_length+0.5))[:length]
masked = arr ^ np_mask
return masked.tobytes()
value = b'0123456789'
for i in range(9):
start_py = time.perf_counter()
masked_py = mask_python(value)
end_py = time.perf_counter()
start_np = time.perf_counter()
masked_np = mask_numpy(value)
end_np = time.perf_counter()
assert masked_py == masked_np
print(f"{i+1} {len(value)} {end_py-start_py} {end_np-start_np}")
value = value * 10
Note: I'm a novice with numpy, if anyone has any comments on my code I would be very happy to hear about it in comments.
use codecs with hex encoding , like :
>>> codecs.encode(b'test/jimmy', 'hex')
b'746573742f6a696d6d79'
>>> codecs.decode(b'746573742f6a696d6d79', 'hex')
b'test/jimmy'

How can I increment a char?

I'm new to Python, coming from Java and C. How can I increment a char? In Java or C, chars and ints are practically interchangeable, and in certain loops, it's very useful to me to be able to do increment chars, and index arrays by chars.
How can I do this in Python? It's bad enough not having a traditional for(;;) looper - is there any way I can achieve what I want to achieve without having to rethink my entire strategy?
In Python 2.x, just use the ord and chr functions:
>>> ord('c')
99
>>> ord('c') + 1
100
>>> chr(ord('c') + 1)
'd'
>>>
Python 3.x makes this more organized and interesting, due to its clear distinction between bytes and unicode. By default, a "string" is unicode, so the above works (ord receives Unicode chars and chr produces them).
But if you're interested in bytes (such as for processing some binary data stream), things are even simpler:
>>> bstr = bytes('abc', 'utf-8')
>>> bstr
b'abc'
>>> bstr[0]
97
>>> bytes([97, 98, 99])
b'abc'
>>> bytes([bstr[0] + 1, 98, 99])
b'bbc'
"bad enough not having a traditional for(;;) looper"?? What?
Are you trying to do
import string
for c in string.lowercase:
...do something with c...
Or perhaps you're using string.uppercase or string.letters?
Python doesn't have for(;;) because there are often better ways to do it. It also doesn't have character math because it's not necessary, either.
Check this: USING FOR LOOP
for a in range(5):
x='A'
val=chr(ord(x) + a)
print(val)
LOOP OUTPUT: A B C D E
I came from PHP, where you can increment char (A to B, Z to AA, AA to AB etc.) using ++ operator. I made a simple function which does the same in Python. You can also change list of chars to whatever (lowercase, uppercase, etc.) is your need.
# Increment char (a -> b, az -> ba)
def inc_char(text, chlist = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
# Unique and sort
chlist = ''.join(sorted(set(str(chlist))))
chlen = len(chlist)
if not chlen:
return ''
text = str(text)
# Replace all chars but chlist
text = re.sub('[^' + chlist + ']', '', text)
if not len(text):
return chlist[0]
# Increment
inc = ''
over = False
for i in range(1, len(text)+1):
lchar = text[-i]
pos = chlist.find(lchar) + 1
if pos < chlen:
inc = chlist[pos] + inc
over = False
break
else:
inc = chlist[0] + inc
over = True
if over:
inc += chlist[0]
result = text[0:-len(inc)] + inc
return result
There is a way to increase character using ascii_letters from string package which ascii_letters is a string that contains all English alphabet, uppercase and lowercase:
>>> from string import ascii_letters
>>> ascii_letters[ascii_letters.index('a') + 1]
'b'
>>> ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
Also it can be done manually;
>>> letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> letters[letters.index('c') + 1]
'd'
def doubleChar(str):
result = ''
for char in str:
result += char * 2
return result
print(doubleChar("amar"))
output:
aammaarr
For me i made the fallowing as a test.
string_1="abcd"
def test(string_1):
i = 0
p = ""
x = len(string_1)
while i < x:
y = (string_1)[i]
i=i+1
s = chr(ord(y) + 1)
p=p+s
print(p)
test(string_1)

Categories