I want to emphasis that this is not a ask for completing my homework or job: I am studying the LZW algorithm for gif file compression by reading someone's code on github, and got confused by a code block here:
class DataBlock(object):
def __init__ (self):
self.bitstream = bytearray()
self.pos = 0
def encode_bits (self, num, size):
"""
Given a number *num* and a length in bits *size*, encode *num*
as a *size* length bitstring at the current position in the bitstream.
"""
string = bin(num)[2:]
string = '0'*(size - len(string)) + string
for digit in reversed(string):
if len(self.bitstream) * 8 <= self.pos:
self.bitstream.append(0)
if digit == '1':
self.bitstream[-1] |= 1 << self.pos % 8
self.pos += 1
What I cannot understand is the for loop in the function encode_bits():
for digit in reversed(string):
if len(self.bitstream) * 8 <= self.pos:
self.bitstream.append(0)
if digit == '1':
self.bitstream[-1] |= 1 << self.pos % 8
self.pos += 1
Here is my guess (depend on his comment):
The function encode_bits() will turn an input integer num into a binary string of length size (padding zeroes at left if needed) and reverse the string, and append the digits to bitstream one by one. Hence
suppose s=DataBlock(), then s.encode_bits(3, 3) would firstly turn 3 into 011 (padding a zero at left to make it length 3) and reverse it to 110, and then append 110 to self.bitstream, hence the result should be bytearray('110'). But as I run the code the result gives bytearray(b'\x03'), not as expected. Further more, \x03 is one byte, not 3 bits, conflicts with his comment, I cannot understand why?
I forgot to add that his code runs and gives correct output hence there's something wrong in my understanding.
Try looking at it this way:
You will be given a bytearray object (call it x for the moment).
How many bytes are in the object? (Obviously, it's just len(x).)
How many bits are in the object? (This is an exercise; please calculate the answer.)
Once you've done that, suppose we start with no (zero) bytes in x, i.e., x is a bytearray(b''). How many bytes do we need to add (x.append(...)) in order to store three bits? What if we want to store eight bits? What if we want to store ten bits?
Again, these are exercises. Calculate the answers and you should be enlightened.
(Incidentally, this technique, of compressing some number of sub-objects into some larger space, is called packing. In mathematics the problem is generalized, while in computers it is often more limited.)
Related
I have rather long IDs 1000000000109872 and would like to represent them as strings.
However all the libraries for Rust I've found such as hash_ids and block_id produce strings that are way bigger.
Ideally I'd like 4 to maybe 5 characters, numbers are okay but only uppercase letters. Doesn't need to be cryptographically secure as long as it's unique.
Is there anything that fits my needs?
I've tried this website: https://v2.cryptii.com/decimal/base64 and for 1000000000109872 I get 4rSw, this is very short which is great. But it's not uppercase.
This is the absolute best you can do if you want to guarantee no collisions without having any specific guarantees on the range of the inputs beyond "unsigned int" and you want it to be stateless:
def base_36(n: int) -> str:
if not isinstance(n, int):
raise TypeError("Check out https://mypy.readthedocs.io/")
if n < 0:
raise ValueError("IDs must be non-negative")
if n < 10:
return str(n)
if n < 36:
return chr(n - 10 + ord('A'))
return base_36(n // 36) + base_36(n % 36)
print(base_36(1000000000109872)) # 9UGXNOTWDS
If you're willing to avoid collisions by keeping track of id allocations, you can of course do much better:
ids: dict[int, int] = {}
def stateful_id(n: int) -> str:
return base_36(ids.setdefault(n, len(ids)))
print(stateful_id(1000000000109872)) # 0
print(stateful_id(1000000000109454)) # 1
print(stateful_id(1000000000109872)) # 0
or if some parts of the ID can be safely truncated:
MAGIC_NUMBER = 1000000000000000
def truncated_id(n: int) -> str:
if n < MAGIC_NUMBER:
raise ValueError(f"IDs must be >= {MAGIC_NUMBER}")
return base_36(n - MAGIC_NUMBER)
print(truncated_id(1000000000109872)) # 2CS0
Short Answer: Impossible.
Long Answer: You're asking to represent 10^16 digits in 36^5 (5 uppercase chars).
Actually, an uppercase/number char would be a one of 36 cases (10 numbers + 26 chars). But, 36^5 = 60,466,176 is less than 10^9, which wouldn't work.
Since 36^10 < 10^16 < 36^11, you'll need at least 11 uppercase chars to represent your (10^16) long IDs.
As you already stated that there is even a checksum inside the original ID, I assume the new representation should contain all of its data.
In this case, your question is strongly related to lossless compression and information content.
Information content says that every data contains a certain amount of information. Information can be measured in bits.
The sad news is that now matter what, you cannot magically reduce your data to less bits. It will always keep the same amount of bits. You can just change the representation to store those bits as compact as possible, but you cannot reduce the number.
You might think of jpg or compressed movies, that are stored very compact; the problem there is they are lossy. They discard information not perceived by the human eye/ear and just delete them.
In your case, there is no trickery possible. You will always have a smallest and a largest ID that you handed out. And all the IDs between your smallest and largest ID have to be distinguishable.
Now some math. If you know the amount of possible states of your data (e.g. the amount of distinguishable IDs), you can compute the required information content like this: log2(N), where N is the number of possible states.
So let's say you have 1000000 different IDs, that would mean you need log2(1000000) = 19.93 bits to represent those IDs. You will never be able to reduce this number to anything less.
Now to actually represent them: You say you want to store them in in a string of 26 different uppercase letters or 10 different digits. This is called a base36 encoding.
Each digit of this can carry log2(36) = 5.17 bits of information. Therefore, to store your 1000000 different IDs, you need at least 19.93/5.17 = 3.85 digits.
This is exactly what #Samwise's answer shows you. His answer is the mathematically most optimal way to encode this. You will never get better than his answer. And the amount if digits will always grow if the amount of possible IDs you want to represent grows. There's just no mathematical way around that.
I have a function. the input would be a word, and every time each character will be added to the shifted value of the result.
def magic2(b):
res = 0
for c in b:
res = (res << 8) + ord(c)
print(res)
return res
Because it uses shifts, I will lose some data. I wanna to decode/reverse this with exact letters of the input word.
For example, if input would be "saman", the output is result "495555797358" and step by step would be:
115
29537
7561581
1935764833
495555797358
How can I get back to the input word just with these outputs?
Consider what you're doing: for each character, you shift left by 8 bits, and then add on another 8 bits.1
So, how do you undo that? Well, for each character, you grab the rightmost 8 bits, then shift everything else right by 8 bits. How do you know when you're done? When shifting right by 8 bits leaves you with 0, you must have just gotten the leftmost character. So:
def unmagic2(n):
while n > 0:
c = chr(n & 0xff) # 0xff is (1 << 8) - 1
n = n >> 8
Now you just have to figure out what to do with each c to get your original string back. It's not quite as trivial as you might at first think, because we're getting the leftmost character last, not first. But you should be able to figure it out from here.
1. If you're using the full gamut of Unicode, this is of course lossy, because you're shifting left by 8 bits and then adding on another 21 bits, so there's no way to reverse that. But I'll assume you're using Latin-1 strings here, or bytes—or Python 2 str.
I have a bit of an challenge before me.
Currently I'm trying to accomplish this process:
Feed a decimal, or any number really, into a binary converter
Now that we possess a binary string, we must measure the length of the string. (as in, numstr="10001010" - I want the return to count the characters and return "8")
Finally, I need to extract a section of said string, if I want to cut out the first half of the string "10001010" and save both halves, I want the return to read "1000" and "1010"
Current Progess:
newint=input("Enter a number:")
newint2= int(newint)
binStr=""
while newint2>0:
binStr= str(newint2%2) + binStr
newint2= newint2//2
print (binStr)
newint = input("Enter a binary number:")
temp=newint
power = 0
number = 0
while len(temp) > 0:
bit=int(temp[-1])
number = number + bit * 2 ** power
power+=1
temp = temp[:-1]
print(number)
//This works for integer values, how do I get it to also work for decimal values, where the integer is either there or 0 (35.45 or 0.4595)?
This is where I'm lost, I'm not sure what the best way to attempt this next step would be.
Once I convert my decimal or integer into binary representation, how can I cut my string by varying lengths? Let's say my binary representation is 100 characters, and I want to cut out lengths that are 10% the total length, so I get 10 blocks of 10 characters, or blocks that are 20% total length so I have 5 blocks of 20 characters.
Any advice is appreciated, I'm a super novice and this has been a steep challenge for me.
Strings can be divided up through slice notation.
a='101010101010'
>>>a[0]
'1'
>>>a[0:5]
'10101'
>>>a[0:int(len(a)/2)]
'101010'
That's something you should read up on, if you're getting into Python.
Here is my suggestion, based on answer from What's the best way to split a string into fixed length chunks and work with them in Python? :
def chunkstring(string, percent):
length = int(len(string) * percent / 100)
return ([string[0+i:length+i] for i in range(0, len(string), length)])
# Define a string with 100 characters
a = '0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789'
# Split this string by percentage of total length
print(chunkstring(a, 10)) # 10 %
print(chunkstring(a, 20)) # 20 %
I have a list/array of numbers, which I want to save to a binary file.
The crucial part is, that each number should not be saved as a pre-defined data type.
The bits per value are constant for all values in the list but do not correspond to the typical data types (e.g. byte or int).
import numpy as np
# create 10 random numbers in range 0-63
values = np.int32(np.round(np.random.random(10)*63));
# each value requires exactly 6 bits
# how to save this to a file?
# just for debug/information: bit string representation
bitstring = "".join(map(lambda x: str(bin(x)[2:]).zfill(6), values));
print(bitstring)
In the real project, there are more than a million values I want to store with a given bit dephts.
I already tried the module bitstring, but appending each value to the BitArray costs a lot of time...
The may be some numpy-specific way that make things easier, but here's a pure Python (2.x) way to do it. It first converts the list of values into a single integer since Python supports int values of any length. Next it converts that int value into a string of bytes and writes it to the file.
Note: If you're sure all the values will fit within the bit-width specified, the array_to_int() function could be sped up slightly by changing the (value & mask) it's using to just value.
import random
def array_to_int(values, bitwidth):
mask = 2**bitwidth - 1
shift = bitwidth * (len(values)-1)
integer = 0
for value in values:
integer |= (value & mask) << shift
shift -= bitwidth
return integer
# In Python 2.7 int and long don't have the "to_bytes" method found in Python 3.x,
# so here's one way to do the same thing.
def to_bytes(n, length):
return ('%%0%dx' % (length << 1) % n).decode('hex')[-length:]
BITWIDTH = 6
#values = [random.randint(0, 2**BITWIDTH - 1) for _ in range(10)]
values = [0b000001 for _ in range(10)] # create fixed pattern for debugging
values[9] = 0b011111 # make last one different so it can be spotted
# just for debug/information: bit string representation
bitstring = "".join(map(lambda x: bin(x)[2:].zfill(BITWIDTH), values));
print(bitstring)
bigint = array_to_int(values, BITWIDTH)
width = BITWIDTH * len(values)
print('{:0{width}b}'.format(bigint, width=width)) # show integer's value in binary
num_bytes = (width+8 - (width % 8)) // 8 # round to whole number of 8-bit bytes
with open('data.bin', 'wb') as file:
file.write(to_bytes(bigint, num_bytes))
Since you give an example with a string, I'll assume that's how you get the results. This means performance is probably never going to be great. If you can, try creating bytes directly instead of via a string.
Side note: I'm using Python 3 which might require you to make some changes for Python 2. I think this code should work directly in Python 2, but there are some changes around bytearrays and strings between 2 and 3, so make sure to check.
byt = bytearray(len(bitstring)//8 + 1)
for i, b in enumerate(bitstring):
byt[i//8] += (b=='1') << i%8
and for getting the bits back:
bitret = ''
for b in byt:
for i in range(8):
bitret += str((b >> i) & 1)
For millions of bits/bytes you'll want to convert this to a streaming method instead, as you'd need a lot of memory otherwise.
I've been doing an RSA encryption/decryption assignment for a school assignment, and I've actually gotten the whole thing working. The one thing I want to make sure I understand is the padding. The book states that after we turn the string of characters into a string of digits (A = 00, and Z = 25) we then need to determine the size of the blocks and add dummy characters to the end.
The book states:
Next, we divide this string into equally sized blocks of 2N digits,
where 2N is the largest even number such that the number 2525 ... 25
with 2N digits does not exceed n.
It doesn't tell me where it gets 25 from, so I deduced that it was the index of the last character (Z in this case) of the our actual key of characters.
So here is my Python3 implementation (fair warning it is somewhat cringe-worthy):
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def __determineSize__(message, n):
if (n < len(alphabet) - 1):
raise Exception("n is not sufficiently large")
buffer = ""
for i in range(0, n, 2):
buffer += str(len(alphabet) - 1) #+= "25" in this case
if (int(buffer) > n):
groupSize = len(buffer) - 2
return groupSize
It starts with 25 ( len(alphabet) = 26, 26 - 1 = 25), if it is not larger than n we increase it to 2525. If it larger at this point we stop because we know we've gone to far and we return the length 2, because the length 4 is too large.
This is how I understood it, and it works but it doesn't seem right. Did I interpret this correctly or am I completely off base? If I am can someone set me straight? (I'm not asking for code, because it's for an assignment I don't want to plagerise anyone so if anyone could just tell me what I'm supposed to do in simple English or show me in pseudo code that would be great.)
Like always, thanks to everyone in advance!