I am struggling with a number-format problem in Python 3.6. My goal is to convert binary data from a file into printable decimal numbers. As an example, I need to convert two little-endian bytes in the form of a byte string...
b'\x12\00'
into its big-endian binary form...
0000000000010010
and finally to its 16-bit fixed-point Q15 decimal number form...
(1 / 4096) + (1 / 16384) = 0.00030517578 (basically, we've made the 2 bytes above human-readable)
In my failed attempts, the struct.unpack function seemed promising, but my low-level / number representation experience just isn't very mature at the moment.
Failed Attempt:
struct.unpack('<h', b'\x12\x00') # Yields (18,)
The above code gets me "18", which would be fine if the bytes represented an integer, but they do not.
Any help / advice would be appreciated. Thank you!
Answered by #jasonharper in the question comments--
The bytes do represent an integer - which has been shifted by 15 bits. Divide by 32768 (2**15) to get the actual Q15 value. (This doesn't match the value you calculated, but that's because you did the math wrong - the two set bits actually have place values of 1/2048 and 1/16384.)
I achieved the proper value via the following code--
struct.unpack('<h', b'\x12\x00')[0] / (2**15)
Related
I would like to write a script in python that takes a float value for example -37.32 and output its IEEE value (11000010000101010100011110101110). for the first bit if the number is negative than it is 1 else it's 0. for the exponent its the matter of dividing the number by 2 and getting the remainder if I am correct. as for the mantissa, I have no idea how to calculate it. returning a string would be the most reasonable way since it would be better for constructing each element of the IEEE.
can someone help me with the approach I will be taking? or show a script tackling this problem and explaining it to me?
The struct module can be used to convert a float to a sequence of bytes. Then it's just a matter of converting each byte to a binary string and joining them together.
>>> import struct
>>> ''.join('{:08b}'.format(b) for b in struct.pack('f', -37.32))
'10101110010001110001010111000010'
I'd like to calculate (⌊2^(1918)*π⌋+124476) in python but I get this error when I do it using the following code:
b = math.floor((2**1918) * math.pi) + 124476
print(b)
OverflowError: int too large to convert to float
How can you get this to work? In the end I just like to have it all as hexadecimal (if that helps with answering my question) but I was actually only trying to get it as an integer first :)
The right solution really depends on how precise the results are required. Since 2^1918 already is too large for both standard integer and floating point containers, it is not possible to get away with direct calculations without loosing all the precision below ~ 10^300.
In order to compute the desired result, you should use arbitrary-precision calculation techniques. You can implement the algorithms yourself or use one of the available libraries.
Assuming you are looking for an integer part of your expression, it will take about 600 decimal places to store the results precisely. Here is how you can get it using mpmath:
from mpmath import mp
mp.dps = 600
print(mp.floor(mp.power(2, 1918)*mp.pi + 124476))
74590163000744215664571428206261183464882552592869067139382222056552715349763159120841569799756029042920968184704590129494078052978962320087944021101746026347535981717869532122259590055984951049094749636380324830154777203301864744802934173941573749720376124683717094961945258961821638084501989870923589746845121992752663157772293235786930128078740743810989039879507242078364008020576647135087519356182872146031915081433053440716531771499444683048837650335204793844725968402892045220358076481772902929784589843471786500160230209071224266538164123696273477863853813807997663357545.0
Next, all you have to do is to convert it to hex representation (or extract hex from its internal binary form), which is a matter for another subject :)
The basic problem is what the message says. Python integers can be arbitrarily large, larger even than the range of a float. 2**1918 in decimal contains 578 significant digits and is way bigger than the biggest float your IEEE754 hardware can represent. So the call just fails.
You could try looking at the mpmath module. It is designed for floating point arithmetic outside the bounds of what ordinary hardware can handle.
I think the problem can be solved without resorting to high-precision arithmetic. floor(n.something + m) where m and n are integers is equal to floor(n.something) + m. So in this case you are looking for floor(2**1918 * pi) plus an integer (namely 124476). floor(2**whatever * pi) is just the first whatever + 2 bits of pi. So just look up the first 1920 bits of pi, add the bits for 124476, and output as hex digits.
A spigot algorithm can generate digits of pi without using arbitrary precision. A quick web search seems to find some Python implementations for generating digits in base 10. I didn't see anything about base 2, but Plouffe's formula generates base 16 digits if I am not mistaken.
The problem is that (2**1918) * math.pi attempts to convert the integer to 64-bit floating point precision, which is insufficiently large. You can convert math.pi to a fraction to use arbitrary precision.
>>> math.floor((2**1918) * fractions.Fraction(math.pi) + 124476)
74590163000744212756918704280961225881025315246628098737524697383138220762542289800871336766911957454080350508173317171375032226685669280397906783245438534131599390699781017605377332298669863169044574050694427882869191541933796848577277592163846082732344724959222075452985644173036076895843129191378853006780204194590286508603564336292806628948212533561286572102730834409985441874735976583720122784469168008083076285020654725577288682595262788418426186598550864392013191287665258445673204426746083965447956681216069719524525240073122409298640817341016286940008045020172328756796
Note that arbitrary precision applies to the calculation; math.pi is defined only with 64-bit floating point precision. Use an external library, such as mpmath, if you need the exact value.
To convert this to a hexadecimal string, use hex or a string format:
>>> hex(math.floor((2**1918) * fractions.Fraction(math.pi) + 124476))
'0xc90fdaa22168c0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001e63c'
>>> '%x' % math.floor((2**1918) * fractions.Fraction(math.pi) + 124476)
'c90fdaa22168c0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001e63c'
>>> f'{math.floor((2**1918) * fractions.Fraction(math.pi) + 124476):X}'
'C90FDAA22168C0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001E63C'
For string formats, x provides lower-case hex whereas X provides upper-case case.
I want to convert a given hex into base64 (in python without using any libraries). As I learned from other stackoverflow answers, we can either group 3 hex (12 bits i.e. 4 bits each) to get 2 base64 values (12 bits i.e. 6 bits each). And also we can group 6 hex(24 bits) into 4 base64 values (24 bits).
The standard procedure is to append all the binary bits of hexs together and start grouping from left in packets of 6.
My question is regarding the situation we need padding for:
(Assuming we are converting 3 hex into 2 base64)
There will arise a situation when we are left with only 2 or 1 hex values to convert. Take the example below:
'a1' to base64
10100001 (binary of a1)
101000 01(0000) //making groups of 6 and adding additional 0's where required
This gives "oQ"the answer which is at some place(oQ==) and something different in other place(wqE=)
Q1. Which of the two sources are giving the correct answer? Why the other one is wrong being a good online decoder?
Q2. How do we realise the number of '=' here? (We could have just add sufficient 0's wherever needed as in example above, and thus ending the answer to be just oQ here and not oQ== , assuming oQ== is correct)
My concept is that: if the hex is of length 2 (rather than 3) we pad with a single = (hence complying with the answer wqE= in above case)
, else if the hex is of length 1 ( rather than 3), we pad with double ='s.
At the same time, I am confused that, if 3 hex is converted into 2 base64, we would never need two ='s.
'a' to base64
1010 (binary of a)
Q3. How to convert hex 'a' to base64.
Base64 is defined by RFC 4648 as being "designed to represent arbitrary sequences of
octets". Octet is a unit of 8 bits, in practice synonymous with byte. When your input is in the form of a hex string, your first step should be to decode it into a byte string. You need two hex characters for each byte. If the length of the input is odd, the reasonable course of action is to raise an error.
To address you numbered questions:
Q1: Even while going to implement you own encoder, you can make use of Python standard library to investigate. Decoding the two results back to bytes gives:
>>> import base64
>>> base64.b64decode(b'oQ==')
b'\xa1'
>>> base64.b64decode(b'wqE=')
b'\xc2\xa1'
So, oQ== is correct, while wqE= has a c2 byte added in front. I can guess that it is the result of applying UTF-8 encoding before Base64. To confirm:
>>> '\u00a1'.encode('utf-8')
b'\xc2\xa1'
Q2: The rules for padding are detailed in the RFC.
Q3: This is ambiguous and you are right to be confused.
I've been reading about base64 conversion, and what I understand is that the encoded version of the original data will be 133% of the original size.
Then, I'm reading about how YouTube is able to have unique identifiers to their videos like FJZQSHn7fc and the reason was: an 11 character base64 string can map to a huge number.
Wait, say a huge number contains 20 characters, then wouldn't a base64 encoded string be 133% of that size, not shorter?
I'm very confused. Are there different types of base64 conversion (string to base64 vs. decimal to base64), once resulting in a bigger, and the other in a smaller resulting string?
Each character in base 64 can encode 6 bits of data. Thus 11 characters can encode 6x11 = 66 bits of data.
2^66 = 73786976294838206464
73786976294838206464 (approximately 7.4 x 10^19 or 74 quintillion) possible identifiers is more than enough to distinguish unique YouTube videos for the foreseeable future.
It is unlikely that YouTube is using these strings of length 11 as encodings of smaller objects. You can use base64 (just a number in base 64 after all) without having to think of it as an encoding of something else, just like you can use bytes (binary numbers with 8 bits) without thinking of those bytes as being encodings of ascii characters. The only important question with an identifier scheme is if there are enough identifiers to go around. In this case there clearly are.
Think of it like this: you have a 64bit number (called long in Java, for example).
Now, you can print that number in different ways:
As a binary number (base 2), printing 64 '0' or '1'
As a decimal number (base 10), printing up to 20 decimal digits
As a hexadecimal number (base 16), printing 16 hexadeciaml digits
As a number in base 64, printing 11 "digits" in that base. You can use any graphical symbols as digits.
... you understand by now that there are many more possibilities ...
It seems like they use the same base-64 numbers as the ones that are used in base64 encoding, that is, uppercase and lowercase letters, ordinary digits and 2 extra chars. Each character represents a 6-bit value. So you get 66 bits, and depending on the algorithm used, either the leading or trailing 2 bits are cut off to get a nice long value back.
You are confusing what things are being compared.
There are 2 statements, both comparing different things:
"base64 encoding is 133% bigger than original size"
"An 11 character base64 string can encode a huge number"
In the case of 1, they are normally referring to a string encoded maybe with ASCII using 8bits a character, and comparing that with the same string encoded in base64. That is 133% bigger, because in base64 you can't use all 255 bit combinations in every byte.
In the case of 2, they are comparing using a numeric identifier, and then either encoding it as base64, or base10. In this case, base64 is a lot shorter than base10.
You can also think of the (1) case as comparing base256 against base64, and the (2) case as comparing base10 against base64.
When you say Base64, some would think of RFC 4648. If YouTube is using RFC 4648, then it's a 12-digit number where they're omitting the last digit because it is always '=', the padding character (the 65th element of the base64 alphabet). The 12 digits represent three blocks of four digits, and four digits yield 24 bits of information. YouTube video IDs would therefore be 64-bit, not 66-bit, if they're using the standard.
Those 64 bits might be representing an unsigned integer. YouTube used MySQL and then sharded MySQL through Vitess, so you could imagine them using an UNSIGNED BIGINT key internally that they encode via RFC 4648-compliant Base64 externally.
Clearly Tom Scott thinks YouTube is squeezing 66 bits out of their 11 characters; his video says so.
If he's wrong, then their frontend might allow you to specify four distinct video IDs for the same video. Those two extra bits' values do not affect the UNSIGNED BIGINT. Which two bits they are depend on endianness and other choices of encoding.
Regardless of whether YouTube is using standard or nonstandard encoding, they can represent 18446744073709551615 in 11 characters (since the padding character is always there and and thus omitted for a 64-bit quantity).
Perhaps they use something like the following to compute a pseudorandom 64-bit integer when a new video is created:
import base64
import random
def Base64RandomSlug():
array = bytearray(random.getrandbits(8) for x in range(64 // 8))
b = base64.urlsafe_b64encode(bytes(array))
return b.decode('utf-8').rstrip('=')
I have been working on a program and I have been trying to convert a big binary file (As a string) and pack it into a file. I have tried for days to make such thing possible. Here is the code I had written to pack the large binary string.
binaryRecieved="11001010101....(Shortened)"
f=open(fileName,'wb')
m=long(binaryRecieved,2)
struct.pack('i',m)
f.write(struct.pack('i',m))
f.close()
quit()
I am left with the error
struct.pack('i',x)
struct.error: integer out of range for 'i' format code
My integer is out of range, so I was wondering if there is a different way of going about with this.
Thanks
Convert your bit string to a byte string: see for example this question Converting bits to bytes in Python. Then pack the bytes with struct.pack('c', bytestring)
For encoding m in big-endian order (like "ten" being written as "10" in normal decimal use) use:
def as_big_endian_bytes(i):
out=bytearray()
while i:
out.append(i&0xff)
i=i>>8
out.reverse()
return out
For encoding m in little-endian order (like "ten" being written as "01" in normal decimal use) use:
def as_little_endian_bytes(i):
out=bytearray()
while i:
out.append(i&0xff)
i=i>>8
return out
both functions work on numbers - like you do in your question - so the returned bytearray may be shorter than expected (because for numbers leading zeroes do not matter).
For an exact representation of a binary-digit-string (which is only possible if its length is dividable by 8) you would have to do:
def as_bytes(s):
assert len(s)%8==0
out=bytearray()
for i in range(0,len(s)-8,8):
out.append(int(s[i:i+8],2))
return out
In struct.pack you have used 'i' which represents an integer number, which is limited. As your code states, you have a long output; thus, you may want to use 'd' in stead of 'i', to pack your data up as double. It should work.
See Python struct for more information.