Numeric ID to very short unique strings - python

I have rather long IDs 1000000000109872 and would like to represent them as strings.
However all the libraries for Rust I've found such as hash_ids and block_id produce strings that are way bigger.
Ideally I'd like 4 to maybe 5 characters, numbers are okay but only uppercase letters. Doesn't need to be cryptographically secure as long as it's unique.
Is there anything that fits my needs?
I've tried this website: https://v2.cryptii.com/decimal/base64 and for 1000000000109872 I get 4rSw, this is very short which is great. But it's not uppercase.

This is the absolute best you can do if you want to guarantee no collisions without having any specific guarantees on the range of the inputs beyond "unsigned int" and you want it to be stateless:
def base_36(n: int) -> str:
if not isinstance(n, int):
raise TypeError("Check out https://mypy.readthedocs.io/")
if n < 0:
raise ValueError("IDs must be non-negative")
if n < 10:
return str(n)
if n < 36:
return chr(n - 10 + ord('A'))
return base_36(n // 36) + base_36(n % 36)
print(base_36(1000000000109872)) # 9UGXNOTWDS
If you're willing to avoid collisions by keeping track of id allocations, you can of course do much better:
ids: dict[int, int] = {}
def stateful_id(n: int) -> str:
return base_36(ids.setdefault(n, len(ids)))
print(stateful_id(1000000000109872)) # 0
print(stateful_id(1000000000109454)) # 1
print(stateful_id(1000000000109872)) # 0
or if some parts of the ID can be safely truncated:
MAGIC_NUMBER = 1000000000000000
def truncated_id(n: int) -> str:
if n < MAGIC_NUMBER:
raise ValueError(f"IDs must be >= {MAGIC_NUMBER}")
return base_36(n - MAGIC_NUMBER)
print(truncated_id(1000000000109872)) # 2CS0

Short Answer: Impossible.
Long Answer: You're asking to represent 10^16 digits in 36^5 (5 uppercase chars).
Actually, an uppercase/number char would be a one of 36 cases (10 numbers + 26 chars). But, 36^5 = 60,466,176 is less than 10^9, which wouldn't work.
Since 36^10 < 10^16 < 36^11, you'll need at least 11 uppercase chars to represent your (10^16) long IDs.

As you already stated that there is even a checksum inside the original ID, I assume the new representation should contain all of its data.
In this case, your question is strongly related to lossless compression and information content.
Information content says that every data contains a certain amount of information. Information can be measured in bits.
The sad news is that now matter what, you cannot magically reduce your data to less bits. It will always keep the same amount of bits. You can just change the representation to store those bits as compact as possible, but you cannot reduce the number.
You might think of jpg or compressed movies, that are stored very compact; the problem there is they are lossy. They discard information not perceived by the human eye/ear and just delete them.
In your case, there is no trickery possible. You will always have a smallest and a largest ID that you handed out. And all the IDs between your smallest and largest ID have to be distinguishable.
Now some math. If you know the amount of possible states of your data (e.g. the amount of distinguishable IDs), you can compute the required information content like this: log2(N), where N is the number of possible states.
So let's say you have 1000000 different IDs, that would mean you need log2(1000000) = 19.93 bits to represent those IDs. You will never be able to reduce this number to anything less.
Now to actually represent them: You say you want to store them in in a string of 26 different uppercase letters or 10 different digits. This is called a base36 encoding.
Each digit of this can carry log2(36) = 5.17 bits of information. Therefore, to store your 1000000 different IDs, you need at least 19.93/5.17 = 3.85 digits.
This is exactly what #Samwise's answer shows you. His answer is the mathematically most optimal way to encode this. You will never get better than his answer. And the amount if digits will always grow if the amount of possible IDs you want to represent grows. There's just no mathematical way around that.

Related

Is there a built-in python function to count bit flip in a binary string?

Is there a built-in python function to count bit flip in a binary string? The question I am trying to solve is given a binary string of arbitrary length, how can I return the number of bit flips of the string. For example, the bit flip number of '01101' is 3, and '1110011' is 2, etc.
The way I can come up with to solve this problem is to use a for loop and a counter. However, that seems too lengthy. Is there a way I can do that faster? Or is there a built-in function in python that allows me to do that directly? Thanks for the help!
There is a very fast way to do that without any explicit loops and only using Python builtins: you can convert the string to a binary number, then detect all the bit flips using a XOR-based integer tricks and then convert the integer back to a string to count the number of bit flips. Here is the code:
# Convert the binary string `s` to an integer: "01101" -> 0b01101
n = int(s, 2)
# Build a binary mask to skip the most significant bit of n: 0b01101 -> 0b01111
mask = (1 << (len(s)-1)) - 1
# Check if the ith bit of n is different from the (i+1)th bit of n using a bit-wise XOR:
# 0b01101 & 0b01111 -> 0b1101 (discard the first bit)
# 0b01101 >> 1 -> 0b0110
# 0b1101 ^ 0b0110 -> 0b1011
bitFlips = (n & mask) ^ (n >> 1)
# Convert the integer back to a string and count the bit flips: 0b1011 -> "0b1011" -> 3
flipCount = bin(bitFlips).count('1')
This trick is much faster than other methods since integer operations are very optimized compare to a loop-based interpreted codes or the ones working on iterables. Here are performance results for a string of size 1000 on my machine:
ljdyer's solution: 96 us x1.0
Karl's solution: 39 us x2.5
This solution: 4 us x24.0
If you are working with short bounded strings, then there are even faster ways to count the number of bits set in an integer.
Don't know about a built in function, but here's a one-liner:
bit_flip_count = len([x for x in range(1, len(x0)) if x0[x] != x0[x-1]])
Given a sequence of values, you can find the number of times that the value changes by grouping contiguous values and then counting the groups. There will be one more group than the number of changes (since the elements before the first change are also in a group). (Of course, for an empty sequence, this gives you a result of -1; you may want to handle this case separately.)
Grouping in Python is built-in, via the standard library itertools.groupby. This tool only considers contiguous groups, which is often a drawback (if you want to make a histogram, for example, you have to sort the data first) but in our case is exactly what we want. The overall interface of this tool is a bit complex, but in our case we can use it simply:
from itertools import groupby
def changes_in(sequence):
return len(list(groupby(sequence))) - 1

Creating a unique short ID from string value

I have some data that has unique IDs stored as a string in the form:
ddd.dddddaddd.dddddz
Where d is some digit and a/z is some alphabet character. The digits may be 0-9 and the characters are either E or W for the a and N or S for the z.
I'd like to turn this into a unique integer and what I've tried using the hashlib module returns:
>>> int(hashlib.sha256(str.encode(s)).hexdigest(), 16)
Output: a very long integer (on another system cannot copy it)
Is there a way to generate a unique integer ID from a string so that it does not exceed 12 digits? I know that I will never need a unique integer ID beyond 12 digits.
Just something simple:
>>> s = '123.45678W123.45678S'
>>> int(s.translate(str.maketrans('EWNS', '1234', '.')))
123456782123456784
Not the impossible 12 digits you're still asking for in the question, but under the 20 digits you allowed in the comments.
As you are dealing with coordinates, I would try my best to keep the information in the final 12-digit ID.
If your points are global, it might be necessary to keep the degrees but they may be widespread, so you can sacrifice some information when it comes to precision.
If your points are local (all within a range of less than 10 degrees) you might skip the first two digits of the degrees and focus on the decimals.
As it may be possible that two points are close to each other, it may be prudent to reserve one digit as a serial number.
Proposal for widespread points:
s = "123.45678N123.45678E"
ident = "".join([s[0:6],s[10:16]]).replace(".","")
q = 0
if s[9]=="N":
q+=1
if s[-1]=="E":
q+=2
ident+=str(q)+'0'
The example would translate to 123451234530.
After computing the initial ident numbers for each ID, you should loop through them and increment the last digit if an ident is already taken.
This way you could easily reconstruct the location from the ID by just separating the first 10 digits to two degrees of the format ddd.dd and use the [-2] digit as an indicator of the quadrant (0:SW, 1:SE, 2:NW, 3:NE).

Python: Setting up a binary-number string converter, then indexing the result

I have a bit of an challenge before me.
Currently I'm trying to accomplish this process:
Feed a decimal, or any number really, into a binary converter
Now that we possess a binary string, we must measure the length of the string. (as in, numstr="10001010" - I want the return to count the characters and return "8")
Finally, I need to extract a section of said string, if I want to cut out the first half of the string "10001010" and save both halves, I want the return to read "1000" and "1010"
Current Progess:
newint=input("Enter a number:")
newint2= int(newint)
binStr=""
while newint2>0:
binStr= str(newint2%2) + binStr
newint2= newint2//2
print (binStr)
newint = input("Enter a binary number:")
temp=newint
power = 0
number = 0
while len(temp) > 0:
bit=int(temp[-1])
number = number + bit * 2 ** power
power+=1
temp = temp[:-1]
print(number)
//This works for integer values, how do I get it to also work for decimal values, where the integer is either there or 0 (35.45 or 0.4595)?
This is where I'm lost, I'm not sure what the best way to attempt this next step would be.
Once I convert my decimal or integer into binary representation, how can I cut my string by varying lengths? Let's say my binary representation is 100 characters, and I want to cut out lengths that are 10% the total length, so I get 10 blocks of 10 characters, or blocks that are 20% total length so I have 5 blocks of 20 characters.
Any advice is appreciated, I'm a super novice and this has been a steep challenge for me.
Strings can be divided up through slice notation.
a='101010101010'
>>>a[0]
'1'
>>>a[0:5]
'10101'
>>>a[0:int(len(a)/2)]
'101010'
That's something you should read up on, if you're getting into Python.
Here is my suggestion, based on answer from What's the best way to split a string into fixed length chunks and work with them in Python? :
def chunkstring(string, percent):
length = int(len(string) * percent / 100)
return ([string[0+i:length+i] for i in range(0, len(string), length)])
# Define a string with 100 characters
a = '0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789'
# Split this string by percentage of total length
print(chunkstring(a, 10)) # 10 %
print(chunkstring(a, 20)) # 20 %

Why do I have to change integers to strings in order to iterate them in Python?

First of all, I have only recently started to learn Python on codeacademy.com and this is probably a very basic question, so thank you for the help and please forgive my lack of knowledge.
The function below takes positive integers as input and returns the sum of all that numbers' digits. What I don't understand, is why I have to change the type of the input into str first, and then back into integer, in order to add the numbers' digits to each other. Could someone help me out with an explanation please? The code works fine for the exercise, but I feel I am missing the big picture here.
def digit_sum(n):
num = 0
for i in str(n):
num += int(i)
return num
Integers are not sequences of digits. They are just (whole) numbers, so they can't be iterated over.
By turning the integer into a string, you created a sequence of digits (characters), and a string can be iterated over. It is no longer a number, it is now text.
See it as a representation; you could also have turned the same number into hexadecimal text, or octal text, or binary text. It would still be the same numerical value, just written down differently in text.
Iteration over a string works, and gives you single characters, which for a number means that each character is also a digit. The code takes that character and turns it back into a number with int(i).
You don't have to use that trick. You could also use maths:
def digit_sum(n):
total = 0
while n:
n, digit = divmod(n, 10)
num += digit
return num
This uses a while loop, and repeatedly divides the input number by ten (keeping the remainder) until 0 is reached. The remainders are summed, giving you the digit sum. So 1234 is turned into 123 and 4, then 12 and 3, etc.
Let's say the number 12345
So I would need 1,2,3,4,5 from the given number and then sum it up.
So how to get individuals number. One mathematical way was how #Martijn Pieters showed.
Another is to convert it into a string , and make it iterable.
This is one of the many ways to do it.
>>> sum(map(int, list(str(12345))))
15
The list() function break a string into individual letters. SO I needed a string. Once I have all numbers as individual letters, I can convert them into integers and add them up .

Reversible hash function?

I need a reversible hash function (obviously the input will be much smaller in size than the output) that maps the input to the output in a random-looking way. Basically, I want a way to transform a number like "123" to a larger number like "9874362483910978", but not in a way that will preserve comparisons, so it must not be always true that, if x1 > x2, f(x1) > f(x2) (but neither must it be always false).
The use case for this is that I need to find a way to transform small numbers into larger, random-looking ones. They don't actually need to be random (in fact, they need to be deterministic, so the same input always maps to the same output), but they do need to look random (at least when base64encoded into strings, so shifting by Z bits won't work as similar numbers will have similar MSBs).
Also, easy (fast) calculation and reversal is a plus, but not required.
I don't know if I'm being clear, or if such an algorithm exists, but I'd appreciate any and all help!
None of the answers provided seemed particularly useful, given the question. I had the same problem, needing a simple, reversible hash for not-security purposes, and decided to go with bit relocation. It's simple, it's fast, and it doesn't require knowing anything about boolean maths or crypo algorithms or anything else that requires actual thinking.
The simplest would probably be to just move half the bits left, and the other half right:
def hash(n):
return ((0x0000FFFF & n)<<16) + ((0xFFFF0000 & n)>>16)
This is reversible, in that hash(hash(n)) = n, and has non-sequential pairs {n,m}, n < m, where hash(m) < hash(n).
And to get a much less sequential looking implementation, you might also want to consider an interlace reordering from [msb,z,...,a,lsb] to [msb,lsb,z,a,...] or [lsb,msb,a,z,...] or any other relocation you feel gives an appropriately non-sequential sequence for the numbers you deal with, or even add a XOR on top for peak desequential'ing.
(The above function is safe for numbers that fit in 32 bits, larger numbers are guaranteed to cause collisions and would need some more bit mask coverage to prevent problems. That said, 32 bits is usually enough for any non-security uid).
Also have a look at the multiplicative inverse answer given by Andy Hayden, below.
Another simple solution is to use multiplicative inverses (see Eri Clippert's blog):
we showed how you can take any two coprime positive integers x and m and compute a third positive integer y with the property that (x * y) % m == 1, and therefore that (x * z * y) % m == z % m for any positive integer z. That is, there always exists a “multiplicative inverse”, that “undoes” the results of multiplying by x modulo m.
We take a large number e.g. 4000000000 and a large co-prime number e.g. 387420489:
def rhash(n):
return n * 387420489 % 4000000000
>>> rhash(12)
649045868
We first calculate the multiplicative inverse with modinv which turns out to be 3513180409:
>>> 3513180409 * 387420489 % 4000000000
1
Now, we can define the inverse:
def un_rhash(h):
return h * 3513180409 % 4000000000
>>> un_rhash(649045868) # un_rhash(rhash(12))
12
Note: This answer is fast to compute and works for numbers up to 4000000000, if you need to handle larger numbers choose a sufficiently large number (and another co-prime).
You may want to do this with hexidecimal (to pack the int):
def rhash(n):
return "%08x" % (n * 387420489 % 4000000000)
>>> rhash(12)
'26afa76c'
def un_rhash(h):
return int(h, 16) * 3513180409 % 4000000000
>>> un_rhash('26afa76c') # un_rhash(rhash(12))
12
If you choose a relatively large co-prime then this will seem random, be non-sequential and also be quick to calculate.
What you are asking for is encryption. A block cipher in its basic mode of operation, ECB, reversibly maps a input block onto an output block of the same size. The input and output blocks can be interpreted as numbers.
For example, AES is a 128 bit block cipher, so it maps an input 128 bit number onto an output 128 bit number. If 128 bits is good enough for your purposes, then you can simply pad your input number out to 128 bits, transform that single block with AES, then format the output as a 128 bit number.
If 128 bits is too large, you could use a 64 bit block cipher, like 3DES, IDEA or Blowfish.
ECB mode is considered weak, but its weakness is the constraint that you have postulated as a requirement (namely, that the mapping be "deterministic"). This is a weakness, because once an attacker has observed that 123 maps to 9874362483910978, from then on whenever she sees the latter number, she knows the plaintext was 123. An attacker can perform frequency analysis and/or build up a dictionary of known plaintext/ciphertext pairs.
Basically, you are looking for 2 way encryption, and one that probably uses a salt.
You have a number of choices:
TripleDES
AES
Here is an example:" Simple insecure two-way "obfuscation" for C#
What language are you looking at? If .NET then look at the encryption namespace for some ideas.
Why not just XOR with a nice long number?
Easy. Fast. Reversible.
Or, if this doesn't need to be terribly secure, you could convert from base 10 to some smaller base (like base 8 or base 4, depending on how long you want the numbers to be).

Categories