How to convert an integer to variable length byte string?

How to convert an integer to variable length byte string? - python

I want to convert an integer (int or long) a big-endian byte string. The byte string has to be of variable length, so that only the minimum number of bytes are used (the total length length of the preceding data is known, so the variable length can be inferred).
My current solution is
import bitstring
bitstring.BitString(hex=hex(456)).tobytes()
Which obviously depends on the endianness of the machine and gives false results, because 0 bits are append and no prepended.
Does any one know a way to do this without making any assumption about the length or endianess of an int?

Something like this. Untested (until next edit). For Python 2.x. Assumes n > 0.
tmp = []
while n:
n, d = divmod(n, 256)
tmp.append(chr(d))
result = ''.join(tmp[::-1])
Edit: tested.
If you don't read manuals but like bitbashing, instead of the divmod caper, try this:
d = n & 0xFF; n >>= 8
Edit 2: If your numbers are relatively small, the following may be faster:
result = ''
while n:
result = chr(n & 0xFF) + result
n >>= 8
Edit 3: The second method doesn't assume that the int is already bigendian. Here's what happens in a notoriously littleendian environment:
Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> n = 65539
>>> result = ''
>>> while n:
... result = chr(n & 0xFF) + result
... n >>= 8
...
>>> result
'\x01\x00\x03'
>>> import sys; sys.byteorder
'little'
>>>

A solution using struct and itertools:
>>> import itertools, struct
>>> "".join(itertools.dropwhile(lambda c: not(ord(c)), struct.pack(">i", 456))) or chr(0)
'\x01\xc8'
We can drop itertools by using a simple string strip:
>>> struct.pack(">i", 456).lstrip(chr(0)) or chr(0)
'\x01\xc8'
Or even drop struct using a recursive function:
def to_bytes(n):
return ([chr(n & 255)] + to_bytes(n >> 8) if n > 0 else [])
"".join(reversed(to_bytes(456))) or chr(0)

If you're using Python 2.7 or later then you can use the bit_length method to round the length up to the next byte:
>>> i = 456
>>> bitstring.BitString(uint=i, length=(i.bit_length()+7)/8*8).bytes
'\x01\xc8'
otherwise you can just test for whole-byteness and pad with a zero nibble at the start if needed:
>>> s = bitstring.BitString(hex=hex(i))
>>> ('0x0' + s if s.len%8 else s).bytes
'\x01\xc8'

I reformulated John Machins second answer in one line for use on my server:
def bytestring(n):
return ''.join([chr((n>>(i*8))&0xFF) for i in range(n.bit_length()/8,-1,-1)])
I have found that the second method, using bit-shifting, was faster for both large and small numbers, and not just small numbers.

Related

Simulating a C cast in Python [duplicate]

Let's say I have this number i = -6884376.
How do I refer to it as to an unsigned variable?
Something like (unsigned long)i in C.

Assuming:
You have 2's-complement representations in mind; and,
By (unsigned long) you mean unsigned 32-bit integer,
then you just need to add 2**32 (or 1 << 32) to the negative value.
For example, apply this to -1:
>>> -1
-1
>>> _ + 2**32
4294967295L
>>> bin(_)
'0b11111111111111111111111111111111'
Assumption #1 means you want -1 to be viewed as a solid string of 1 bits, and assumption #2 means you want 32 of them.
Nobody but you can say what your hidden assumptions are, though. If, for example, you have 1's-complement representations in mind, then you need to apply the ~ prefix operator instead. Python integers work hard to give the illusion of using an infinitely wide 2's complement representation (like regular 2's complement, but with an infinite number of "sign bits").
And to duplicate what the platform C compiler does, you can use the ctypes module:
>>> import ctypes
>>> ctypes.c_ulong(-1) # stuff Python's -1 into a C unsigned long
c_ulong(4294967295L)
>>> _.value
4294967295L
C's unsigned long happens to be 4 bytes on the box that ran this sample.

To get the value equivalent to your C cast, just bitwise and with the appropriate mask. e.g. if unsigned long is 32 bit:
>>> i = -6884376
>>> i & 0xffffffff
4288082920
or if it is 64 bit:
>>> i & 0xffffffffffffffff
18446744073702667240
Do be aware though that although that gives you the value you would have in C, it is still a signed value, so any subsequent calculations may give a negative result and you'll have to continue to apply the mask to simulate a 32 or 64 bit calculation.
This works because although Python looks like it stores all numbers as sign and magnitude, the bitwise operations are defined as working on two's complement values. C stores integers in twos complement but with a fixed number of bits. Python bitwise operators act on twos complement values but as though they had an infinite number of bits: for positive numbers they extend leftwards to infinity with zeros, but negative numbers extend left with ones. The & operator will change that leftward string of ones into zeros and leave you with just the bits that would have fit into the C value.
Displaying the values in hex may make this clearer (and I rewrote to string of f's as an expression to show we are interested in either 32 or 64 bits):
>>> hex(i)
'-0x690c18'
>>> hex (i & ((1 << 32) - 1))
'0xff96f3e8'
>>> hex (i & ((1 << 64) - 1)
'0xffffffffff96f3e8L'
For a 32 bit value in C, positive numbers go up to 2147483647 (0x7fffffff), and negative numbers have the top bit set going from -1 (0xffffffff) down to -2147483648 (0x80000000). For values that fit entirely in the mask, we can reverse the process in Python by using a smaller mask to remove the sign bit and then subtracting the sign bit:
>>> u = i & ((1 << 32) - 1)
>>> (u & ((1 << 31) - 1)) - (u & (1 << 31))
-6884376
Or for the 64 bit version:
>>> u = 18446744073702667240
>>> (u & ((1 << 63) - 1)) - (u & (1 << 63))
-6884376
This inverse process will leave the value unchanged if the sign bit is 0, but obviously it isn't a true inverse because if you started with a value that wouldn't fit within the mask size then those bits are gone.

Python doesn't have builtin unsigned types. You can use mathematical operations to compute a new int representing the value you would get in C, but there is no "unsigned value" of a Python int. The Python int is an abstraction of an integer value, not a direct access to a fixed-byte-size integer.

Since version 3.2 :
def unsignedToSigned(n, byte_count):
return int.from_bytes(n.to_bytes(byte_count, 'little', signed=False), 'little', signed=True)
def signedToUnsigned(n, byte_count):
return int.from_bytes(n.to_bytes(byte_count, 'little', signed=True), 'little', signed=False)
output :
In [3]: unsignedToSigned(5, 1)
Out[3]: 5
In [4]: signedToUnsigned(5, 1)
Out[4]: 5
In [5]: unsignedToSigned(0xFF, 1)
Out[5]: -1
In [6]: signedToUnsigned(0xFF, 1)
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 signedToUnsigned(0xFF, 1)
Input In [1], in signedToUnsigned(n, byte_count)
4 def signedToUnsigned(n, byte_count):
----> 5 return int.from_bytes(n.to_bytes(byte_count, 'little', signed=True), 'little', signed=False)
OverflowError: int too big to convert
In [7]: signedToUnsigned(-1, 1)
Out[7]: 255
Explanations : to/from_bytes convert to/from bytes, in 2's complement considering the number as one of size byte_count * 8 bits. In C/C++, chances are you should pass 4 or 8 as byte_count for respectively a 32 or 64 bit number (the int type).
I first pack the input number in the format it is supposed to be from (using the signed argument to control signed/unsigned), then unpack to the format we would like it to have been from. And you get the result.
Note the Exception when trying to use fewer bytes than required to represent the number (In [6]). 0xFF is 255 which can't be represented using a C's char type (-128 ≤ n ≤ 127). This is preferable to any other behavior.

You could use the struct Python built-in library:
Encode:
import struct
i = -6884376
print('{0:b}'.format(i))
packed = struct.pack('>l', i) # Packing a long number.
unpacked = struct.unpack('>L', packed)[0] # Unpacking a packed long number to unsigned long
print(unpacked)
print('{0:b}'.format(unpacked))
Out:
-11010010000110000011000
4288082920
11111111100101101111001111101000
Decode:
dec_pack = struct.pack('>L', unpacked) # Packing an unsigned long number.
dec_unpack = struct.unpack('>l', dec_pack)[0] # Unpacking a packed unsigned long number to long (revert action).
print(dec_unpack)
Out:
-6884376
[NOTE]:
> is BigEndian operation.
l is long.
L is unsigned long.
In amd64 architecture int and long are 32bit, So you could use i and I instead of l and L respectively.
[UPDATE]
According to the #hl037_ comment, this approach works on int32 not int64 or int128 as I used long operation into struct.pack(). Nevertheless, in the case of int64, the written code would be changed simply using long long operand (q) in struct as follows:
Encode:
i = 9223372036854775807 # the largest int64 number
packed = struct.pack('>q', i) # Packing an int64 number
unpacked = struct.unpack('>Q', packed)[0] # Unpacking signed to unsigned
print(unpacked)
print('{0:b}'.format(unpacked))
Out:
9223372036854775807
111111111111111111111111111111111111111111111111111111111111111
Next, follow the same way for the decoding stage. As well as this, keep in mind q is long long integer — 8byte and Q is unsigned long long
But in the case of int128, the situation is slightly different as there is no 16-byte operand for struct.pack(). Therefore, you should split your number into two int64.
Here's how it should be:
i = 10000000000000000000000000000000000000 # an int128 number
print(len('{0:b}'.format(i)))
max_int64 = 0xFFFFFFFFFFFFFFFF
packed = struct.pack('>qq', (i >> 64) & max_int64, i & max_int64)
a, b = struct.unpack('>QQ', packed)
unpacked = (a << 64) | b
print(unpacked)
print('{0:b}'.format(unpacked))
Out:
123
10000000000000000000000000000000000000
111100001011110111000010000110101011101101001000110110110010000000011110100001101101010000000000000000000000000000000000000

just use abs for converting unsigned to signed in python
a=-12
b=abs(a)
print(b)
Output:
12

How to split bytes into bits [duplicate]

I am working with Python3.2. I need to take a hex stream as an input and parse it at bit-level. So I used
bytes.fromhex(input_str)
to convert the string to actual bytes. Now how do I convert these bytes to bits?

Another way to do this is by using the bitstring module:
>>> from bitstring import BitArray
>>> input_str = '0xff'
>>> c = BitArray(hex=input_str)
>>> c.bin
'0b11111111'
And if you need to strip the leading 0b:
>>> c.bin[2:]
'11111111'
The bitstring module isn't a requirement, as jcollado's answer shows, but it has lots of performant methods for turning input into bits and manipulating them. You might find this handy (or not), for example:
>>> c.uint
255
>>> c.invert()
>>> c.bin[2:]
'00000000'
etc.

What about something like this?
>>> bin(int('ff', base=16))
'0b11111111'
This will convert the hexadecimal string you have to an integer and that integer to a string in which each byte is set to 0/1 depending on the bit-value of the integer.
As pointed out by a comment, if you need to get rid of the 0b prefix, you can do it this way:
>>> bin(int('ff', base=16))[2:]
'11111111'
... or, if you are using Python 3.9 or newer:
>>> bin(int('ff', base=16)).removepreffix('0b')
'11111111'
Note: using lstrip("0b") here will lead to 0 integer being converted to an empty string. This is almost always not what you want to do.

Operations are much faster when you work at the integer level. In particular, converting to a string as suggested here is really slow.
If you want bit 7 and 8 only, use e.g.
val = (byte >> 6) & 3
(this is: shift the byte 6 bits to the right - dropping them. Then keep only the last two bits 3 is the number with the first two bits set...)
These can easily be translated into simple CPU operations that are super fast.

using python format string syntax
>>> mybyte = bytes.fromhex("0F") # create my byte using a hex string
>>> binary_string = "{:08b}".format(int(mybyte.hex(),16))
>>> print(binary_string)
00001111
The second line is where the magic happens. All byte objects have a .hex() function, which returns a hex string. Using this hex string, we convert it to an integer, telling the int() function that it's a base 16 string (because hex is base 16). Then we apply formatting to that integer so it displays as a binary string. The {:08b} is where the real magic happens. It is using the Format Specification Mini-Language format_spec. Specifically it's using the width and the type parts of the format_spec syntax. The 8 sets width to 8, which is how we get the nice 0000 padding, and the b sets the type to binary.
I prefer this method over the bin() method because using a format string gives a lot more flexibility.

I think simplest would be use numpy here. For example you can read a file as bytes and then expand it to bits easily like this:
Bytes = numpy.fromfile(filename, dtype = "uint8")
Bits = numpy.unpackbits(Bytes)

input_str = "ABC"
[bin(byte) for byte in bytes(input_str, "utf-8")]
Will give:
['0b1000001', '0b1000010', '0b1000011']

Here how to do it using format()
print "bin_signedDate : ", ''.join(format(x, '08b') for x in bytevector)
It is important the 08b . That means it will be a maximum of 8 leading zeros be appended to complete a byte. If you don't specify this then the format will just have a variable bit length for each converted byte.

To binary:
bin(byte)[2:].zfill(8)

Use ord when reading reading bytes:
byte_binary = bin(ord(f.read(1))) # Add [2:] to remove the "0b" prefix
Or
Using str.format():
'{:08b}'.format(ord(f.read(1)))

The other answers here provide the bits in big-endian order ('\x01' becomes '00000001')
In case you're interested in little-endian order of bits, which is useful in many cases, like common representations of bignums etc -
here's a snippet for that:
def bits_little_endian_from_bytes(s):
return ''.join(bin(ord(x))[2:].rjust(8,'0')[::-1] for x in s)
And for the other direction:
def bytes_from_bits_little_endian(s):
return ''.join(chr(int(s[i:i+8][::-1], 2)) for i in range(0, len(s), 8))

One line function to convert bytes (not string) to bit list. There is no endnians issue when source is from a byte reader/writer to another byte reader/writer, only if source and target are bit reader and bit writers.
def byte2bin(b):
return [int(X) for X in "".join(["{:0>8}".format(bin(X)[2:])for X in b])]

I came across this answer when looking for a way to convert an integer into a list of bit positions where the bitstring is equal to one. This becomes very similar to this question if you first convert your hex string to an integer like int('0x453', 16).
Now, given an integer - a representation already well-encoded in the hardware, I was very surprised to find out that the string variants of the above solutions using things like bin turn out to be faster than numpy based solutions for a single number, and I thought I'd quickly write up the results.
I wrote three variants of the function. First using numpy:
import math
import numpy as np
def bit_positions_numpy(val):
"""
Given an integer value, return the positions of the on bits.
"""
bit_length = val.bit_length() + 1
length = math.ceil(bit_length / 8.0) # bytelength
bytestr = val.to_bytes(length, byteorder='big', signed=True)
arr = np.frombuffer(bytestr, dtype=np.uint8, count=length)
bit_arr = np.unpackbits(arr, bitorder='big')
bit_positions = np.where(bit_arr[::-1])[0].tolist()
return bit_positions
Then using string logic:
def bit_positions_str(val):
is_negative = val < 0
if is_negative:
bit_length = val.bit_length() + 1
length = math.ceil(bit_length / 8.0) # bytelength
neg_position = (length * 8) - 1
# special logic for negatives to get twos compliment repr
max_val = 1 << neg_position
val_ = max_val + val
else:
val_ = val
binary_string = '{:b}'.format(val_)[::-1]
bit_positions = [pos for pos, char in enumerate(binary_string)
if char == '1']
if is_negative:
bit_positions.append(neg_position)
return bit_positions
And finally, I added a third method where I precomputed a lookuptable of the positions for a single byte and expanded that given larger itemsizes.
BYTE_TO_POSITIONS = []
pos_masks = [(s, (1 << s)) for s in range(0, 8)]
for i in range(0, 256):
positions = [pos for pos, mask in pos_masks if (mask & i)]
BYTE_TO_POSITIONS.append(positions)
def bit_positions_lut(val):
bit_length = val.bit_length() + 1
length = math.ceil(bit_length / 8.0) # bytelength
bytestr = val.to_bytes(length, byteorder='big', signed=True)
bit_positions = []
for offset, b in enumerate(bytestr[::-1]):
pos = BYTE_TO_POSITIONS[b]
if offset == 0:
bit_positions.extend(pos)
else:
pos_offset = (8 * offset)
bit_positions.extend([p + pos_offset for p in pos])
return bit_positions
The benchmark code is as follows:
def benchmark_bit_conversions():
# for val in [-0, -1, -3, -4, -9999]:
test_values = [
# -1, -2, -3, -4, -8, -32, -290, -9999,
# 0, 1, 2, 3, 4, 8, 32, 290, 9999,
4324, 1028, 1024, 3000, -100000,
999999999999,
-999999999999,
2 ** 32,
2 ** 64,
2 ** 128,
2 ** 128,
]
for val in test_values:
r1 = bit_positions_str(val)
r2 = bit_positions_numpy(val)
r3 = bit_positions_lut(val)
print(f'val={val}')
print(f'r1={r1}')
print(f'r2={r2}')
print(f'r3={r3}')
print('---')
assert r1 == r2
import xdev
xdev.profile_now(bit_positions_numpy)(val)
xdev.profile_now(bit_positions_str)(val)
xdev.profile_now(bit_positions_lut)(val)
import timerit
ti = timerit.Timerit(10000, bestof=10, verbose=2)
for timer in ti.reset('str'):
for val in test_values:
bit_positions_str(val)
for timer in ti.reset('numpy'):
for val in test_values:
bit_positions_numpy(val)
for timer in ti.reset('lut'):
for val in test_values:
bit_positions_lut(val)
for timer in ti.reset('raw_bin'):
for val in test_values:
bin(val)
for timer in ti.reset('raw_bytes'):
for val in test_values:
val.to_bytes(val.bit_length(), 'big', signed=True)
And it clearly shows the str and lookup table implementations are ahead of numpy. I tested this on CPython 3.10 and 3.11.
Timed str for: 10000 loops, best of 10
time per loop: best=20.488 µs, mean=21.438 ± 0.4 µs
Timed numpy for: 10000 loops, best of 10
time per loop: best=25.754 µs, mean=28.509 ± 5.2 µs
Timed lut for: 10000 loops, best of 10
time per loop: best=19.420 µs, mean=21.305 ± 3.8 µs

Converting from Float to Int changed the output

I defined a simple sqrt function to calculate a square root of a given number.
def sqrt(n):
low = 0
up = n
for i in range(50):
mid = float(low+up)/2
if pow(mid,2) < n:
low = mid
elif pow(mid,2) > n:
up = mid
else:
break
return mid
When I do:
print(sqrt(9))
I get 3.0 as the output. However when I do:
print(int(sqrt(9))) I get 2.
Could somebody please help me understand why this is happening?
Thank you.

Actually, sqrt(9) returns 2.9999999999999973 [1], at least in Python version 3.5 (and perhaps all versions 3.x?). And int returns only the integer component of the number. That's why you get 2 as the result, as the integer component of 2.9999999999999973 is indeed 2.
If you want to round the result to the nearest integer, you can do round(sqrt(9)) which produces 3.
[1] To understand why this happens and how to fix this issue, you can refer to this post.

It is because of rounding. Your method definition actually computes the sqrt(9) to be 2.9999999999999973. However the rounding is not applied to casting, therefore the decimal is lost in the casting of int(2.9999999999999973).
When you print directly, python applies rounding which brings this value up to 3.
Python 2.7 output:
Python 2.7.17 (v2.7.17:c2f86d86e6, Oct 19 2019, 21:01:17) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> def sqrt(n):
... low = 0
... up = n
... for i in range(50):
... mid = float(low+up)/2
... if pow(mid,2) < n:
... low = mid
... elif pow(mid,2) > n:
... up = mid
... else:
... break
... return mid
...
>>> sqrt(9)
2.9999999999999973
>>> print(sqrt(9))
3.0
>>> print(int(sqrt(9)))
2
>>>

If you want to get converted into int you can use the following:
print(int(round(sqrt(9))))

Extracting nth bit in hexadecimal value

Let's say I have a string representing a hexadecimal value such as "0x4", binary represented as 0100. If I want test whether the nth bit is set to 1, and where I count starting from the least significant bit (meaning in this example that only the 3rd bit is 1) how can I do this in the most elegant way?
I doubt the way I am doing is very elegant or efficient.
bits = '{0:08b}'.format(int(0x4, 16))
and then check if str(bits[-3]) is "1"
bits = '{0:08b}'.format(int(0x4, 16))
if str(bits[-3]) == "1":
print "Bit is set to 1"
I'd like a neat way of doing this, e.g. using bitwise operators or shifting.

To test if a bit is set, use the bitwise & on a bit mask of just that bit:
>>> bool(12 & 0b0100)
True
To get a bit mask set at the n-th position, bit-shift a 1 by n-1 positions:
>>> n = 3
>>> 1 << n-1
4
>>> bin(1 << n-1)
'0b100'
Combined, you can directly check whether a specific bit is set:
>>> def bitset(number, n):
... """Test whether ``number`` has the ``n``'th bit set"""
... return bool(number & 1 << n - 1)
...
>>> bitset(0x4, 3)
If your input is a string (instead of generating a string from an integer), use int to convert it:
>>> bitset(int('0x4', 16), 3)

Usually you'd use bitwise operators for that:
if 4 & 1 << n:
print('Bit', n, 'is set')

You could use the bin() function to convert an integer to a binary string:
>>> n = int('0x4', 16)
>>> n
4
>>> bin(n)
'0b100'
>>> bin(n)[-3] == '1'
True
A more efficient way would be to operate on the integer directly using bitwise operators:
>>> bool(n & 1<<2) # shift "1" 2 bits to the left and use bitwise and
True

you can use:
def check_nth_bit(str_hex, nth_bit):
return (int(str_hex, 16) & 2 **(nth_bit - 1)) >> (nth_bit - 1) == 1
print(check_nth_bit('0x4', 3))
print(check_nth_bit('0x4', 1))
output:
True
False

Truncate integers when more than 64 bits

I'm trying to perform some 64 bit additions, ie:
a = 0x15151515
b = 0xFFFFFFFF
c = a + b
print hex(c)
My problem is that the above outputs:
0x115151514
I would like the addition to be 64 bit and disregard the overflow, ie expected output would be:
0x15151514
NB: I'm not looking to truncate the string output, I would like c = 0x15151514. I'm trying to simulator some 64 bit register operations.

Then just use the logical and operator &
c = 0xFFFFFFFF & (a+b)
By the way, these are 32 bit values, not 64 bit values (count the F; every two F is one byte == 8 bit; it's eight F, so four byte, so 32 bit).

Another solution using numpy:
import numpy as np
a = np.array([0x15151515], dtype=np.uint32) # use np.uint64 for 64 bits operations
b = np.array([0xFFFFFFFF], dtype=np.uint32)
c = a + b
print(c, c.dtype)
[353703188] uint32
pros: more readable than binary mask if many operations, especially if other operations such as division are used in which case you cannot just apply the mask at the final result but also need to apply it at intermediary operations ex: (0xFFFFFFFF + 1) // 2)
cons: adds a dependency, requires to be careful with literals:
c = a + 2**32 # 2**32 does not fit in np.uint32 so numpy changes the type of c
print(c, c.dtype)
[4648670485] uint64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert an integer to variable length byte string? - python

Related

Simulating a C cast in Python [duplicate]

How to split bytes into bits [duplicate]

Converting from Float to Int changed the output

Extracting nth bit in hexadecimal value

Truncate integers when more than 64 bits

Categories

Resources