Creating a unique short ID from string value

Creating a unique short ID from string value - python

I have some data that has unique IDs stored as a string in the form:
ddd.dddddaddd.dddddz
Where d is some digit and a/z is some alphabet character. The digits may be 0-9 and the characters are either E or W for the a and N or S for the z.
I'd like to turn this into a unique integer and what I've tried using the hashlib module returns:
>>> int(hashlib.sha256(str.encode(s)).hexdigest(), 16)
Output: a very long integer (on another system cannot copy it)
Is there a way to generate a unique integer ID from a string so that it does not exceed 12 digits? I know that I will never need a unique integer ID beyond 12 digits.

Just something simple:
>>> s = '123.45678W123.45678S'
>>> int(s.translate(str.maketrans('EWNS', '1234', '.')))
123456782123456784
Not the impossible 12 digits you're still asking for in the question, but under the 20 digits you allowed in the comments.

As you are dealing with coordinates, I would try my best to keep the information in the final 12-digit ID.
If your points are global, it might be necessary to keep the degrees but they may be widespread, so you can sacrifice some information when it comes to precision.
If your points are local (all within a range of less than 10 degrees) you might skip the first two digits of the degrees and focus on the decimals.
As it may be possible that two points are close to each other, it may be prudent to reserve one digit as a serial number.
Proposal for widespread points:
s = "123.45678N123.45678E"
ident = "".join([s[0:6],s[10:16]]).replace(".","")
q = 0
if s[9]=="N":
q+=1
if s[-1]=="E":
q+=2
ident+=str(q)+'0'
The example would translate to 123451234530.
After computing the initial ident numbers for each ID, you should loop through them and increment the last digit if an ident is already taken.
This way you could easily reconstruct the location from the ID by just separating the first 10 digits to two degrees of the format ddd.dd and use the [-2] digit as an indicator of the quadrant (0:SW, 1:SE, 2:NW, 3:NE).

Related

Numeric ID to very short unique strings

I have rather long IDs 1000000000109872 and would like to represent them as strings.
However all the libraries for Rust I've found such as hash_ids and block_id produce strings that are way bigger.
Ideally I'd like 4 to maybe 5 characters, numbers are okay but only uppercase letters. Doesn't need to be cryptographically secure as long as it's unique.
Is there anything that fits my needs?
I've tried this website: https://v2.cryptii.com/decimal/base64 and for 1000000000109872 I get 4rSw, this is very short which is great. But it's not uppercase.

This is the absolute best you can do if you want to guarantee no collisions without having any specific guarantees on the range of the inputs beyond "unsigned int" and you want it to be stateless:
def base_36(n: int) -> str:
if not isinstance(n, int):
raise TypeError("Check out https://mypy.readthedocs.io/")
if n < 0:
raise ValueError("IDs must be non-negative")
if n < 10:
return str(n)
if n < 36:
return chr(n - 10 + ord('A'))
return base_36(n // 36) + base_36(n % 36)
print(base_36(1000000000109872)) # 9UGXNOTWDS
If you're willing to avoid collisions by keeping track of id allocations, you can of course do much better:
ids: dict[int, int] = {}
def stateful_id(n: int) -> str:
return base_36(ids.setdefault(n, len(ids)))
print(stateful_id(1000000000109872)) # 0
print(stateful_id(1000000000109454)) # 1
print(stateful_id(1000000000109872)) # 0
or if some parts of the ID can be safely truncated:
MAGIC_NUMBER = 1000000000000000
def truncated_id(n: int) -> str:
if n < MAGIC_NUMBER:
raise ValueError(f"IDs must be >= {MAGIC_NUMBER}")
return base_36(n - MAGIC_NUMBER)
print(truncated_id(1000000000109872)) # 2CS0

Short Answer: Impossible.
Long Answer: You're asking to represent 10^16 digits in 36^5 (5 uppercase chars).
Actually, an uppercase/number char would be a one of 36 cases (10 numbers + 26 chars). But, 36^5 = 60,466,176 is less than 10^9, which wouldn't work.
Since 36^10 < 10^16 < 36^11, you'll need at least 11 uppercase chars to represent your (10^16) long IDs.

As you already stated that there is even a checksum inside the original ID, I assume the new representation should contain all of its data.
In this case, your question is strongly related to lossless compression and information content.
Information content says that every data contains a certain amount of information. Information can be measured in bits.
The sad news is that now matter what, you cannot magically reduce your data to less bits. It will always keep the same amount of bits. You can just change the representation to store those bits as compact as possible, but you cannot reduce the number.
You might think of jpg or compressed movies, that are stored very compact; the problem there is they are lossy. They discard information not perceived by the human eye/ear and just delete them.
In your case, there is no trickery possible. You will always have a smallest and a largest ID that you handed out. And all the IDs between your smallest and largest ID have to be distinguishable.
Now some math. If you know the amount of possible states of your data (e.g. the amount of distinguishable IDs), you can compute the required information content like this: log2(N), where N is the number of possible states.
So let's say you have 1000000 different IDs, that would mean you need log2(1000000) = 19.93 bits to represent those IDs. You will never be able to reduce this number to anything less.
Now to actually represent them: You say you want to store them in in a string of 26 different uppercase letters or 10 different digits. This is called a base36 encoding.
Each digit of this can carry log2(36) = 5.17 bits of information. Therefore, to store your 1000000 different IDs, you need at least 19.93/5.17 = 3.85 digits.
This is exactly what #Samwise's answer shows you. His answer is the mathematically most optimal way to encode this. You will never get better than his answer. And the amount if digits will always grow if the amount of possible IDs you want to represent grows. There's just no mathematical way around that.

Converting binary to decimal using only Boolean and logic comparisons

I am taking a Python Certification class and have taken two practice exams to prepare for the timed exam I will be scheduling soon. However, there is limited interaction with professors and the discussion board is mostly students. I have a question that has been on both practice exams, so I imagine it will be on the real exam as well, and I can not see to wrap my head around how to solve it. There is no way in the class to see how to solve coding problems you have gotten incorrect, which is a major disappointment as that helps me in the future. I know there are built in functions for solving binary/decimal conversions, but the professor is wanting this done using Boolean logic and numerical comparisons as we are still in the early stages of the course. If anyone could assist in "walking" through the why's of the answer I would greatly appreciate it. Thank you.
number = 1101
You may modify the lines of code above, but don't move them! When you
Submit your code, we'll change these lines to assign different values
to the variables.
The number above represents a binary number. It will always be up to
eight digits, and all eight digits will always be either 1 or 0.
The string gives the binary representation of a number. In binary,
each digit of that string corresponds to a power of
2. The far left digit represents 128, then 64, then 32, then 16, then 8, then 4, then 2, and then finally 1 at the far right.
So, to convert the number to a decimal number, you want to (for
example) add 128 to the total if the first digit is 1, 64 if the
second digit is 1, 32 if the third digit is 1, etc.
For example, 00001101 is the number 13: there is a 0 in the 128s
place, 64s place, 32s place, 16s place, and 2s place. There are 1s in
the 8s, 4s, and 1s place. 8 + 4 + 1 = 13.
Note that although we use 'if' a lot to describe this problem, this
can be done entirely boolean logic and numerical comparisons.
Print the number that results from this conversion.

number = "00001101" #in Python, leading zeros are not permitted, so use a string
total = 0 #this var will keep track of the number in decimal form
index = len(number)-1 #eg 1100 has 4 digits and the max power is 3, 2^3.
for str_digit in number: #for each digit (as a string) in the number,
#total += int(str_digit)* 2**index #add the value (0 or 1) multiplied by 2 raised to the index power
if int(str_digit): #either 'if 0' or 'if 1'
total += 2**index #add 2 raised to the index power
index -= 1 # decrease the index
print(total)
Note that the line if int(str_digit): is actually redundant if you use the commented line total += int(str_digit)* 2**index instead, but I included it because your question specified that you want to test the Boolean value.
This line is the same as if 0: or if 1: which is the same as if False: or if True:.

All you need is this:
int(number, base=2)

Why do I have to change integers to strings in order to iterate them in Python?

First of all, I have only recently started to learn Python on codeacademy.com and this is probably a very basic question, so thank you for the help and please forgive my lack of knowledge.
The function below takes positive integers as input and returns the sum of all that numbers' digits. What I don't understand, is why I have to change the type of the input into str first, and then back into integer, in order to add the numbers' digits to each other. Could someone help me out with an explanation please? The code works fine for the exercise, but I feel I am missing the big picture here.
def digit_sum(n):
num = 0
for i in str(n):
num += int(i)
return num

Integers are not sequences of digits. They are just (whole) numbers, so they can't be iterated over.
By turning the integer into a string, you created a sequence of digits (characters), and a string can be iterated over. It is no longer a number, it is now text.
See it as a representation; you could also have turned the same number into hexadecimal text, or octal text, or binary text. It would still be the same numerical value, just written down differently in text.
Iteration over a string works, and gives you single characters, which for a number means that each character is also a digit. The code takes that character and turns it back into a number with int(i).
You don't have to use that trick. You could also use maths:
def digit_sum(n):
total = 0
while n:
n, digit = divmod(n, 10)
num += digit
return num
This uses a while loop, and repeatedly divides the input number by ten (keeping the remainder) until 0 is reached. The remainders are summed, giving you the digit sum. So 1234 is turned into 123 and 4, then 12 and 3, etc.

Let's say the number 12345
So I would need 1,2,3,4,5 from the given number and then sum it up.
So how to get individuals number. One mathematical way was how #Martijn Pieters showed.
Another is to convert it into a string , and make it iterable.
This is one of the many ways to do it.
>>> sum(map(int, list(str(12345))))
15
The list() function break a string into individual letters. SO I needed a string. Once I have all numbers as individual letters, I can convert them into integers and add them up .

How to 'and' data without ignoring digits?

Say I have a number, 18573628, where each digit represents some kind of flag, and I want to check if the value of the fourth flag is set to 7 or not (which it is).
I do not want to use indexing. I want to in some way and with a flag mask, such as this:
00070000
I would normally use np.logical_and() or something like that, but that will consider any positive value to be True. How can I and while considering the value of a digit? For example, preforming the operation with
flags = 18573628
and
mask = 00070000
would yield 00010000
though trying a different mask, such as
mask = 00040000
would yield 00000000

What you can do is
if (x // 10**n % 10) == y:
...
to check if the n-th digit of x (counting from right) is equal to y

You have to use divide and modulo for a decimal mask:
flags = 18573628
mask = 10000
if (flags / mask) % 10 == 7:
do_something

You can convert the input number into an array of digit numbers and then simply indexing into that array with that specific index or indices would give us those digit(s). For doing that conversion, we can use np.fromstring, like so -
In [87]: nums = np.fromstring(str(18573628),dtype=np.uint8)-48
In [88]: nums
Out[88]: array([1, 8, 5, 7, 3, 6, 2, 8], dtype=uint8)
In [89]: nums[3] == 7
Out[89]: True

Say I have a number, 18573628, where each digit represents some kind of flag, and I want to check if the value of the fourth flag is set to 7
Firstly, bitwise operations like & are bit-wise, which is to say they operate on base-2 digits. They don't operate naturally on digits of any other base, although bases which are themselves powers of 2 work out ok.
To stick with bit-wise operations
You need to know how many values each flag can take, to figure out how many bits each flag needs to encode.
If you want to allow each flag the values zero to nine, you need four bits. However, in this scheme, your number won't behave like a normal integer (storing a base-10 digit in each 4-bit group is called Binary Coded Decimal).
The reason it won't behave like a normal integer is that flag values 1,2,3 will be stored as 1 * 16**2 + 2*16 + 3 instead of the 1 * 10**2 + 2*10 + 3 you'd normally expect. So you'd need to write some code to support this use. However, extracting flag n (counting from zero at the right) just becomes
def bcdFlagValue(bcd, flagnum):
if flagnum == 0:
return bcd & 0x0F;
return 0x0F & (bcd >> ((flagnum-1) * 4))
If you actually need a different range of values for each flag, you need to choose the correct number of bits, and adjust the shift and mask values appropriately.
In either case, you'll need a helper function if you want to print your flags as the base-10 number you showed.
To use normal base 10 numbers
You need to use division and modulo (as 6502 showed), because base-10 numbers don't fit evenly into base-2 bits, so simple bit operations don't work
Note
The BCD approach saves space at the cost of complexity, effort and some speed - from subsequent comments, it's probably simpler to just use the string of digit characters directly unless you really need to save 4 bits per digit.

if flags and mask are hexadecimal values, you can do:
flags = int("18573628", 16)
mask = int("00070000", 16)
result = flags & mask
print(hex(result))
=> '0x70000'

Without dealing with the particulars of your case (the SDSS data, which should be documented in the product specification), let's look at some options.
First, you need to to know if it is to be read in big-endian or little-endian order (is the first bit to the right or to the left). Then you need to know the size of each flag. For a series of yes-no parameters, it could simply be 1 bit (0 or 1). For up to four options, it could be two bits (00, 01, 10, 11), etc. It is also possible that some combinations are reserved for future expansion, don't currently have meaning, and should not be expected to occur in the data. I've also seen instances where the flag size varies, so first n bits mean refer to parameter x, next n bits refer to parameter y, etc.
There is a good explanation of the concept as part of Landsat-8 satellite imagery:
http://landsat.usgs.gov/qualityband.php
To read the values, you convert the base 10 integer to binary, and traverse it in the specified chunks, converting back to int to obtain the parameter values according to your product specification.

Float formatting in Python

While doing this exercise:
>>> amount = 24.325
>>> print("%7f" % amount)
>>> 24.325000
I didn't understand why instead of printing ' 24.325' (whith 1 space before the number) it just added three '0' and didn't move it towards the right at all.
So I thought that maybe, when you don't specify the precision, Python adds '0' until the number has at least 6 digits after the decimal point (if it doesn't already have them) and THAN takes in consideration the width I set (in this case, 7) and adds the needed spaces. In the exercise, with the extra '0's, it ends up having 9 digits, so it didn't add any. Is my hypothesis correct?
The question is, why 6 digits after the decimal point? Is it something that Python does by default?
Thank you.

The python docs exactly describe how the formatting operator % works.
In particular, you can have a minimum field with an a precision.
The former defines the minimum length of your resulting string, the latter gives the number of digits after the decimal point.
As the default precision is 6 for %f, you get what you get.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.