Bit order in python's struct.pack

Bit order in python's struct.pack - python

When packing bytes with python's struct.pack, I was surprised that although my byte order is little-endian, my bit order appears to be big-endian. My most significant bytes appear on the right side in the output below, but the most significant bits of each byte appear on the left. (I'm using BitArray from bitstring to display the bits.)
In[23]: BitArray(struct.pack('B', 1)).bin
Out[23]:'00000001'
In[24]: BitArray(struct.pack('H', 1)).bin
Out[24]:'0000000100000000'
In[25]: sys.byteorder
Out[25]:'little'
This surprises me because I read here that "Bit order usually follows the same endianness as the byte order for a given computer system. That is, in a big endian system the most significant bit is stored at the lowest bit address; in a little endian system, the least significant bit is stored at the lowest bit address."
Am I interpreting it correctly that my bit order is the reverse of my byte order here?
Also, I know you can change the byte order using the > and <, but I guess there is no way to change the bit order?
Edit: For context, right now I'm writing a python implementation of TCP communication with an ATI NetFT sensor based on the protocol description starting on page B - 76 here. But, this same question comes up frequently in my work implementing serial and network communications with all sorts of sensors. In this case, the protocol description says things like: set bit 2 of byte 16 to 1 to bias the sensor, and I've been finding that bit 0 in python does not correspond to the bit 0 that controls the bias -- the bit order in the byte seems to be flipped.

No, Python supplies no way to reverse the bit order - but you don't need to. The article made you overly paranoid ;-)
The endianness of byte order is normally invisible to software. If, e.g., you read a 2-byte short in C, the underlying hardware delivers a big-endian result regardless of the physical storage convention. Store 258 (0x0102) and you read 258 back, regardless of the storage's physical byte order. The only way you can tell the difference is to read (or write) part of an N-byte value in a chunk of less than N bytes. That's common enough in network protocols and portable storage formats, but rare outside those.
Similarly, the only way you could detect the endianness of physical bit order is if the machine were bit-addressable, so you could read one bit at a time directly. I don't know of any current machine that supports bit addressing, and even if there were such a beast C supports no direct bit-level access anyway. If you read a byte at time, the hardware will deliver the bytes in big-endian bit order again regardless of the physical bit storage order.
If, e.g., you're poking a bit at a time into a bit-level serial port, then you'll need to know the convention the specific hardware requires. But in that case struct.pack() is useless anyway - the smallest unit struct.pack() manipulates is a byte, and at that level hardware bit-level ordering is invisible. For example, your struct.pack('B', 1) will unpack as 1 regardless of the bit-level endianness of the machine you run it on.
Bits of Code
Since "general principles" don't seem to be enough here, and there was no specific code presented to work with, here are bits of code that may be useful.
As mentioned in a comment, if you want to reverse a byte's bit order, the simplest and fastest way is to precompute a list with 256 items, mapping a byte to its bit-reversed value:
br = [int("{:08b}".format(i)[::-1], 2) for i in range(256)]
assert sorted(br) == list(range(256))
Then, e.g.,
>>> br[0], br[1], br[2], br[254], br[255]
(0, 128, 64, 127, 255)
If you're working with bytes objects, the .translate() method can use this table (after converting it to a bytes object) to convert the whole object with one call:
reverse_table = bytes(br)
and then, e.g.,
>>> original = bytes([0, 1, 2, 3, 254, 255])
>>> print([i for i in original.translate(reverse_table)])
[0, 128, 64, 192, 127, 255]
If instead you're building bytes a bit at a time (as in "set bit 2 of byte 16 to 1"), you can build them in "reversed order" (when appropriate) from the start. To build a byte in LSB 0 order, "setting bit i" means
byte |= 1 << i
To build a byte in MSB 0 order instead, it's
byte |= 1 << (7-i)
But without knowing the precise details of the API(s) you're using, and how you like to work, it's really not possible to guess at the precise code you need.

Related

Converting 8-bit and 7-bit values in Python 3.x

I'm trying to convert unsigned integer values to ones represented as 4 7-bit bytes; the goal is to send data using Roland's address-mapped MIDI System Exclusive protocol, which represents address and size values like so:
MSB LSB
--------- --------- --------- ---------
0aaa aaaa 0bbb bbbb 0ccc cccc 0ddd dddd
I unfortunately don't really know where to begin doing this; the goal is to do this in Python 3.x, which I'm prototyping in at the moment. What I'm really having trouble with is the math and bit manipulations, but oddly I can't really even find any general algorithms or rules of thumb for doing this. The closest I found was this discussion on solutions in Perl from around a decade ago, but I'm having a bit of trouble deciphering the Perl too. Other than that, I've only seen a couple C++ questions with answers recommending using bitsets.
For a specific usage example, say I want to send 128 bytes of data. This requires me to send a 4-byte size value using only the lower 7 bits of each byte. However, this value would normally be 0x00000080, where the upper bit in the LSB is 1, requiring conversion.
Sorry if this is confusing, and I may be way off base here, but can anyone point me in the right direction? I'm sure someone has done this before, since it seems like it would come up regularly in MIDI programming.

Variable-length quantities (as in the linked question) use a different encoding.
Anyway, to split a value into 7-bit bytes, extract the lowest 7 bits in each step, and then shift the remaining value right by 7 bits so that the next portion is in the right position in the next step:
def encode_as_4x7(value):
result = []
for i in range(4):
result = [value & 0x7f] + result
value >>= 7
return result

Passing a sequence of bits to a file python

As a part of a bigger project, I want to save a sequence of bits in a file so that the file is as small as possible. I'm not talking about compression, I want to save the sequence as it is but using the least amount of characters. The initial idea was to turn mini-sequences of 8 bits into chars using ASCII encoding and saving those chars, but due to some unknown problem with strange characters, the characters retrieved when reading the file are not the same that were originally written. I've tried opening the file with utf-8 encoding, latin-1 but none seems to work. I'm wondering if there's any other way, maybe by turning the sequence into a hexadecimal number?

technically you can not write less than a byte because the os organizes memory in bytes (write individual bits to a file in python), so this is binary file io, see https://docs.python.org/2/library/io.html there are modules like struct
open the file with the 'b' switch, indicates binary read/write operation, then use i.e. the to_bytes() function (Writing bits to a binary file) or struct.pack() (How to write individual bits to a text file in python?)
with open('somefile.bin', 'wb') as f:
import struct
>>> struct.pack("h", 824)
'8\x03'
>>> bits = "10111111111111111011110"
>>> int(bits[::-1], 2).to_bytes(4, 'little')
b'\xfd\xff=\x00'
if you want to get around the 8 bit (byte) structure of the memory you can use bit manipulation and techniques like bitmasks and BitArrays
see https://wiki.python.org/moin/BitManipulation and https://wiki.python.org/moin/BitArrays
however the problem is, as you said, to read back the data if you use BitArrays of differing length i.e. to store a decimal 7 you need 3 bit 0x111 to store a decimal 2 you need 2 bit 0x10. now the problem is to read this back.
how can your program know if it has to read the value back as a 3 bit value or as a 2 bit value ? in unorganized memory the sequence decimal 72 looks like 11110 that translates to 111|10 so how can your program know where the | is ?
in normal byte ordered memory decimal 72 is 0000011100000010 -> 00000111|00000010 this has the advantage that it is clear where the | is
this is why memory on its lowest level is organized in fixed clusters of 8 bit = 1 byte. if you want to access single bits inside a bytes/ 8 bit clusters you can use bitmasks in combination with logic operators (http://www.learncpp.com/cpp-tutorial/3-8a-bit-flags-and-bit-masks/). in python the easiest way for single bit manipulation is the module ctypes
if you know that your values are all 6 bit maybe it is worth the effort, however this is also tough...
(How do you set, clear, and toggle a single bit?)
(Why can't you do bitwise operations on pointer in C, and is there a way around this?)

LZ77 compression reserved bytes "< , >"

I'm learning about LZ77 compression, and I saw that when I find a repeated string of bytes, I can use a pointer of the form <distance, length>, and that the "<", ",", ">" bytes are reserved. So... How do I compress a file that has these bytes, if I cannot compress these byte,s but cannot change it by a different byte (because decoders wouldn't be able to read it). Is there a way? Or decoders only decode is there is a exact <d, l> string? (if there is, so imagine if by a coencidence, we find these bytes in a file. What would happen?)
Thanks!

LZ77 is about referencing strings back in the decompressing buffer by their lengths and distances from the current position. But it is left to you how do you encode these back-references. Many implementations of LZ77 do it in different ways.
But you are right that there must be some way to distinguish "literals" (uncompressed pieces of data meant to be copied "as is" from the input to the output) from "back-references" (which are copied from already uncompressed portion).
One way to do it is reserving some characters as "special" (so called "escape sequences"). You can do it the way you did it, that is, by using < to mark the start of a back-reference. But then you also need a way to output < if it is a literal. You can do it, for example, by establishing that when after < there's another <, then it means a literal, and you just output one <. Or, you can establish that if after < there's immediately >, with nothing in between, then that's not a back-reference, so you just output <.
It also wouldn't be the most efficient way to encode those back-references, because it uses several bytes to encode a back-reference, so it will become efficient only for referencing strings longer than those several bytes. For shorter back-references it will inflate the data instead of compressing them, unless you establish that matches shorter than several bytes are being left as is, instead of generating back-references. But again, this means lower compression gains.
If you compress only plain old ASCII texts, you can employ a better encoding scheme, because ASCII uses just 7 out of 8 bits in a byte. So you can use the highest bit to signal a back-reference, and then use the remaining 7 bits as length, and the very next byte (or two) as back-reference's distance. This way you can always tell for sure whether the next byte is a literal ASCII character or a back-reference, by checking its highest bit. If it is 0, just output the character as is. If it is 1, use the following 7 bits as length, and read up the next 2 bytes to use it as distance. This way every back-reference takes 3 bytes, so you can efficiently compress text files with repeating sequences of more than 3 characters long.
But there's a still better way to do this, which gives even more compression: you can replace your characters with bit codes of variable lengths, crafted in such a way that the characters appearing more often would have shortest codes, and those which are rare would have longer codes. To achieve that, these codes have to be so-called "prefix codes", so that no code would be a prefix of some other code. When your codes have this property, you can always distinguish them by reading these bits in sequence until you decode some of them. Then you can be sure that you won't get any other valid item by reading more bits. The next bit always starts another new sequence. To produce such codes, you need to use Huffman trees. You can then join all your bytes and different lengths of references into one such tree and generate distinct bit codes for them, depending on their frequency. When you try to decode them, you just read the bits until you reach the code of some of these elements, and then you know for sure whether it is a code of some literal character or a code for back-reference's length. In the second case, you then read some additional bits for the distance of the back-reference (also encoded with a prefix code). This is what DEFLATE compression scheme does. But this is whole another story, and you will find the details in the RFC supplied by #MarkAdler.

If I understand your question correctly, it makes no sense. There are no "reserved bytes" for the uncompressed input of an LZ77 compressor. You need to simply encodes literals and length/distance pairs unambiguously.

Is there any way to add two bytes with overflow in python?

I am using pySerial to read in data from an attached device. I want to calculate the checksum of each received packet. The packet is read in as a char array, with the actual checksum being the very last byte, at the end of the packet. To calculate the checksum, I would normally sum over the packet payload, and then compare it to the actual checksum.
Normally in a language like C, we would expect overflow, because the checksum itself is only one byte. I'm not sure about the internals of python, but from my experience with the language it looks like it will default to a larger size variable (maybe some internal bigInt class or something). Is there anyway to mimic the expected behavior of adding two chars, without writing my own implementation? Thanks.

Sure, just take the modulus of your result to fit it back in the size you want. You can do the modulus at the end or at every step. For example:
>>> payload = [100, 101, 102, 103, 104] # arbitrary sequence of bytes
>>> sum(payload) % 256 # modulo 256 to make the answer fit in a single byte
254 # this would be your checksum

to improve upon the earlier example, just bitwise-and it with 0xFF. not sure if python does the optimization by default or not.
sum(bytes) & 0xFF

Summing the bytes and then taking the modulus, as in sum(bytes) % 256 (or sum(bytes) & 0xFF), is (in many programming languages) vulnerable to integer overflow, since there is a finite maximum value that integer types can represent.
But, since we are talking about Python, this is not technically an issue: Python integers are arbitrary-precision, so an integer overflow can't occur.
If you want to perform the modulus operation on an element-by-element basis, you can use functools.reduce():
>>> payload = [100, 101, 102, 103, 104] # arbitrary sequence of bytes
# (Python 3 uses functools.reduce() instead of builtin reduce() function)
>>> import functools
>>> functools.reduce(lambda x,y: (x+y)%256, payload)
254

Binary data with pyserial(python serial port)

serial.write() method in pyserial seems to only send string data. I have arrays like [0xc0,0x04,0x00] and want to be able to send/receive them via the serial port? Are there any separate methods for raw I/O?
I think I might need to change the arrays to ['\xc0','\x04','\x00'], still, null character might pose a problem.

An alternative method, without using the array module:
def a2s(arr):
""" Array of integer byte values --> binary string
"""
return ''.join(chr(b) for b in arr)

You need to convert your data to a string
"\xc0\x04\x00"
Null characters are not a problem in Python -- strings are not null-terminated the zero byte behaves just like another byte "\x00".
One way to do this:
>>> import array
>>> array.array('B', [0xc0, 0x04, 0x00]).tostring()
'\xc0\x04\x00'

I faced a similar (but arguably worse) issue, having to send control bits through a UART from a python script to test an embedded device. My data definition was "field1: 8 bits , field2: 3 bits, field3 7 bits", etc. It turns out you can build a robust and clean interface for this using the BitArray library. Here's a snippet (minus the serial set-up)
from bitstring import BitArray
cmdbuf = BitArray(length = 50) # 50 byte BitArray
cmdbuf.overwrite('0xAA', 0) # Init the marker byte at the head
Here's where it gets flexible. The command below replaces the 4 bits at
bit position 23 with the 4 bits passed. Note that it took a binary
bit value, given in string form. I can set/clear any bits at any location
in the buffer this way, without having to worry about stepping on
values in adjacent bytes or bits.
cmdbuf.overwrite('0b0110', 23)
# To send on the (previously opened) serial port
ser.write( cmdbuf )

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.