I'm trying to convert unsigned integer values to ones represented as 4 7-bit bytes; the goal is to send data using Roland's address-mapped MIDI System Exclusive protocol, which represents address and size values like so:
MSB LSB
--------- --------- --------- ---------
0aaa aaaa 0bbb bbbb 0ccc cccc 0ddd dddd
I unfortunately don't really know where to begin doing this; the goal is to do this in Python 3.x, which I'm prototyping in at the moment. What I'm really having trouble with is the math and bit manipulations, but oddly I can't really even find any general algorithms or rules of thumb for doing this. The closest I found was this discussion on solutions in Perl from around a decade ago, but I'm having a bit of trouble deciphering the Perl too. Other than that, I've only seen a couple C++ questions with answers recommending using bitsets.
For a specific usage example, say I want to send 128 bytes of data. This requires me to send a 4-byte size value using only the lower 7 bits of each byte. However, this value would normally be 0x00000080, where the upper bit in the LSB is 1, requiring conversion.
Sorry if this is confusing, and I may be way off base here, but can anyone point me in the right direction? I'm sure someone has done this before, since it seems like it would come up regularly in MIDI programming.
Variable-length quantities (as in the linked question) use a different encoding.
Anyway, to split a value into 7-bit bytes, extract the lowest 7 bits in each step, and then shift the remaining value right by 7 bits so that the next portion is in the right position in the next step:
def encode_as_4x7(value):
result = []
for i in range(4):
result = [value & 0x7f] + result
value >>= 7
return result
Related
I want to convert a given hex into base64 (in python without using any libraries). As I learned from other stackoverflow answers, we can either group 3 hex (12 bits i.e. 4 bits each) to get 2 base64 values (12 bits i.e. 6 bits each). And also we can group 6 hex(24 bits) into 4 base64 values (24 bits).
The standard procedure is to append all the binary bits of hexs together and start grouping from left in packets of 6.
My question is regarding the situation we need padding for:
(Assuming we are converting 3 hex into 2 base64)
There will arise a situation when we are left with only 2 or 1 hex values to convert. Take the example below:
'a1' to base64
10100001 (binary of a1)
101000 01(0000) //making groups of 6 and adding additional 0's where required
This gives "oQ"the answer which is at some place(oQ==) and something different in other place(wqE=)
Q1. Which of the two sources are giving the correct answer? Why the other one is wrong being a good online decoder?
Q2. How do we realise the number of '=' here? (We could have just add sufficient 0's wherever needed as in example above, and thus ending the answer to be just oQ here and not oQ== , assuming oQ== is correct)
My concept is that: if the hex is of length 2 (rather than 3) we pad with a single = (hence complying with the answer wqE= in above case)
, else if the hex is of length 1 ( rather than 3), we pad with double ='s.
At the same time, I am confused that, if 3 hex is converted into 2 base64, we would never need two ='s.
'a' to base64
1010 (binary of a)
Q3. How to convert hex 'a' to base64.
Base64 is defined by RFC 4648 as being "designed to represent arbitrary sequences of
octets". Octet is a unit of 8 bits, in practice synonymous with byte. When your input is in the form of a hex string, your first step should be to decode it into a byte string. You need two hex characters for each byte. If the length of the input is odd, the reasonable course of action is to raise an error.
To address you numbered questions:
Q1: Even while going to implement you own encoder, you can make use of Python standard library to investigate. Decoding the two results back to bytes gives:
>>> import base64
>>> base64.b64decode(b'oQ==')
b'\xa1'
>>> base64.b64decode(b'wqE=')
b'\xc2\xa1'
So, oQ== is correct, while wqE= has a c2 byte added in front. I can guess that it is the result of applying UTF-8 encoding before Base64. To confirm:
>>> '\u00a1'.encode('utf-8')
b'\xc2\xa1'
Q2: The rules for padding are detailed in the RFC.
Q3: This is ambiguous and you are right to be confused.
Or I guess binary in general. I'm obviously quite new to coding, so I'll appreciate any help here.
I just started learning about converting numbers into binary, specifically two's complement. The course presented the following code for converting:
num = 19
if num < 0:
isNeg = True
num = abs(num)
else:
isNeg = False
result = ''
if num == 0:
result = '0'
while num > 0:
result = str(num % 2) + result
num = num // 2
if isNeg:
result = '-' + result
This raised a couple of questions with me and after doing some research (mostly here on Stack Overflow), I found myself more confused than I was before. Hoping somebody can break things down a bit more for me. Here are some of those questions:
I thought it was outright wrong that the code suggested just appending a - to the front of a binary number to show its negative counterpart. It looks like bin() does the same thing, but don't you have to flip the bits and add a 1 or something? Is there a reason for this other than making it easy to comprehend/read?
Was reading here and one of the answers in particular said that Python doesn't really work in two's complement, but something else that mimics it. The disconnect here for me is that Python shows me one thing but is storing the numbers a different way. Again, is this just for ease of use? Is bin() using two's complement or Python's method?
Follow-up to that one, how does the 'sign-magnitude' format mentioned in the above answer differ from two's complement?
The Professor doesn't talk at all about 8-bit, 16-bit, 64-bit, etc., which I saw a lot of while reading up on this. Where does this distinction come from, and does Python use one? Or are those designations specific to the program that I might be coding?
A lot of these posts I've only reference how Python stores integers. Is that suggesting that it stores floats a different way, or are they just speaking broadly?
As I wrote this all up, I sort of realized that maybe I'm diving into the deep end before learning how to swim, but I'm curious like that and like to have a deeper understanding of stuff before moving on.
I thought it was outright wrong that the code suggested just appending a - to the front of a binary number to show its negative counterpart. It looks like bin() does the same thing, but don't you have to flip the bits and add a 1 or something? Is there a reason for this other than making it easy to comprehend/read?
You have to somehow designate the number being negative. You can add another symbol (-), add a sign bit at the very beginning, use ones'-complement, use two's-complement, or some other completely made-up scheme that works. Both the ones'- and two's-complement representation of a number require a fixed number of bits, which doesn't exist for Python integers:
>>> 2**1000
1071508607186267320948425049060001810561404811705533607443750
3883703510511249361224931983788156958581275946729175531468251
8714528569231404359845775746985748039345677748242309854210746
0506237114187795418215304647498358194126739876755916554394607
7062914571196477686542167660429831652624386837205668069376
The natural solution is to just prepend a minus sign. You can similarly write your own version of bin() that requires you to specify the number of bits and return the two's-complement representation of the number.
Was reading here and one of the answers in particular said that Python doesn't really work in two's complement, but something else that mimics it. The disconnect here for me is that Python shows me one thing but is storing the numbers a different way. Again, is this just for ease of use? Is bin() using two's complement or Python's method?
Python is a high-level language, so you don't really know (or care) how your particular Python runtime interally stores integers. Whether you use CPython, Jython, PyPy, IronPython, or something else, the language specification only defines how they should behave, not how they should be represented in memory. bin() just takes a number and prints it out using binary digits, the same way you'd convert 123 into base-2.
Follow-up to that one, how does the 'sign-magnitude' format mentioned in the above answer differ from two's complement?
Sign-magnitude usually encodes a number n as 0bXYYYYYY..., where X is the sign bit and YY... are the binary digits of the non-negative magnitude. Arithmetic with numbers encoded as two's-complement is more elegant due to the representation, while sign-magnitude encoding requires special handling for operations on numbers of opposite signs.
The Professor doesn't talk at all about 8-bit, 16-bit, 64-bit, etc., which I saw a lot of while reading up on this. Where does this distinction come from, and does Python use one? Or are those designations specific to the program that I might be coding?
No, Python doesn't define a maximum size for its integers because it's not that low-level. 2**1000000 computes fine, as will 2**10000000 if you have enough memory. n-bit numbers arise when your hardware makes it more beneficial to make your numbers a certain size. For example, processors have instructions that quickly work with 32-bit numbers but not with 87-bit numbers.
A lot of these posts I've only reference how Python stores integers. Is that suggesting that it stores floats a different way, or are they just speaking broadly?
It depends on what your Python runtime uses. Usually floating point numbers are like C doubles, but that's not required.
don't you have to flip the bits and add a 1 or something?
Yes, for two complement notation you invert all bits and add one to get the negative counterpart.
Is bin() using two's complement or Python's method?
Two's complement is a practical way to represent negative number in electronics that can have only 0 and 1. Internally the microprocessor uses two's complement for negative numbers and all modern microprocessors do. For more info, see your textbook on computer architecture.
how does the 'sign-magnitude' format mentioned in the above answer
differ from two's complement?
You should look what this code does and why it is there:
while num > 0:
result = str(num % 2) + result
num = num // 2
When packing bytes with python's struct.pack, I was surprised that although my byte order is little-endian, my bit order appears to be big-endian. My most significant bytes appear on the right side in the output below, but the most significant bits of each byte appear on the left. (I'm using BitArray from bitstring to display the bits.)
In[23]: BitArray(struct.pack('B', 1)).bin
Out[23]:'00000001'
In[24]: BitArray(struct.pack('H', 1)).bin
Out[24]:'0000000100000000'
In[25]: sys.byteorder
Out[25]:'little'
This surprises me because I read here that "Bit order usually follows the same endianness as the byte order for a given computer system. That is, in a big endian system the most significant bit is stored at the lowest bit address; in a little endian system, the least significant bit is stored at the lowest bit address."
Am I interpreting it correctly that my bit order is the reverse of my byte order here?
Also, I know you can change the byte order using the > and <, but I guess there is no way to change the bit order?
Edit: For context, right now I'm writing a python implementation of TCP communication with an ATI NetFT sensor based on the protocol description starting on page B - 76 here. But, this same question comes up frequently in my work implementing serial and network communications with all sorts of sensors. In this case, the protocol description says things like: set bit 2 of byte 16 to 1 to bias the sensor, and I've been finding that bit 0 in python does not correspond to the bit 0 that controls the bias -- the bit order in the byte seems to be flipped.
No, Python supplies no way to reverse the bit order - but you don't need to. The article made you overly paranoid ;-)
The endianness of byte order is normally invisible to software. If, e.g., you read a 2-byte short in C, the underlying hardware delivers a big-endian result regardless of the physical storage convention. Store 258 (0x0102) and you read 258 back, regardless of the storage's physical byte order. The only way you can tell the difference is to read (or write) part of an N-byte value in a chunk of less than N bytes. That's common enough in network protocols and portable storage formats, but rare outside those.
Similarly, the only way you could detect the endianness of physical bit order is if the machine were bit-addressable, so you could read one bit at a time directly. I don't know of any current machine that supports bit addressing, and even if there were such a beast C supports no direct bit-level access anyway. If you read a byte at time, the hardware will deliver the bytes in big-endian bit order again regardless of the physical bit storage order.
If, e.g., you're poking a bit at a time into a bit-level serial port, then you'll need to know the convention the specific hardware requires. But in that case struct.pack() is useless anyway - the smallest unit struct.pack() manipulates is a byte, and at that level hardware bit-level ordering is invisible. For example, your struct.pack('B', 1) will unpack as 1 regardless of the bit-level endianness of the machine you run it on.
Bits of Code
Since "general principles" don't seem to be enough here, and there was no specific code presented to work with, here are bits of code that may be useful.
As mentioned in a comment, if you want to reverse a byte's bit order, the simplest and fastest way is to precompute a list with 256 items, mapping a byte to its bit-reversed value:
br = [int("{:08b}".format(i)[::-1], 2) for i in range(256)]
assert sorted(br) == list(range(256))
Then, e.g.,
>>> br[0], br[1], br[2], br[254], br[255]
(0, 128, 64, 127, 255)
If you're working with bytes objects, the .translate() method can use this table (after converting it to a bytes object) to convert the whole object with one call:
reverse_table = bytes(br)
and then, e.g.,
>>> original = bytes([0, 1, 2, 3, 254, 255])
>>> print([i for i in original.translate(reverse_table)])
[0, 128, 64, 192, 127, 255]
If instead you're building bytes a bit at a time (as in "set bit 2 of byte 16 to 1"), you can build them in "reversed order" (when appropriate) from the start. To build a byte in LSB 0 order, "setting bit i" means
byte |= 1 << i
To build a byte in MSB 0 order instead, it's
byte |= 1 << (7-i)
But without knowing the precise details of the API(s) you're using, and how you like to work, it's really not possible to guess at the precise code you need.
As a part of a bigger project, I want to save a sequence of bits in a file so that the file is as small as possible. I'm not talking about compression, I want to save the sequence as it is but using the least amount of characters. The initial idea was to turn mini-sequences of 8 bits into chars using ASCII encoding and saving those chars, but due to some unknown problem with strange characters, the characters retrieved when reading the file are not the same that were originally written. I've tried opening the file with utf-8 encoding, latin-1 but none seems to work. I'm wondering if there's any other way, maybe by turning the sequence into a hexadecimal number?
technically you can not write less than a byte because the os organizes memory in bytes (write individual bits to a file in python), so this is binary file io, see https://docs.python.org/2/library/io.html there are modules like struct
open the file with the 'b' switch, indicates binary read/write operation, then use i.e. the to_bytes() function (Writing bits to a binary file) or struct.pack() (How to write individual bits to a text file in python?)
with open('somefile.bin', 'wb') as f:
import struct
>>> struct.pack("h", 824)
'8\x03'
>>> bits = "10111111111111111011110"
>>> int(bits[::-1], 2).to_bytes(4, 'little')
b'\xfd\xff=\x00'
if you want to get around the 8 bit (byte) structure of the memory you can use bit manipulation and techniques like bitmasks and BitArrays
see https://wiki.python.org/moin/BitManipulation and https://wiki.python.org/moin/BitArrays
however the problem is, as you said, to read back the data if you use BitArrays of differing length i.e. to store a decimal 7 you need 3 bit 0x111 to store a decimal 2 you need 2 bit 0x10. now the problem is to read this back.
how can your program know if it has to read the value back as a 3 bit value or as a 2 bit value ? in unorganized memory the sequence decimal 72 looks like 11110 that translates to 111|10 so how can your program know where the | is ?
in normal byte ordered memory decimal 72 is 0000011100000010 -> 00000111|00000010 this has the advantage that it is clear where the | is
this is why memory on its lowest level is organized in fixed clusters of 8 bit = 1 byte. if you want to access single bits inside a bytes/ 8 bit clusters you can use bitmasks in combination with logic operators (http://www.learncpp.com/cpp-tutorial/3-8a-bit-flags-and-bit-masks/). in python the easiest way for single bit manipulation is the module ctypes
if you know that your values are all 6 bit maybe it is worth the effort, however this is also tough...
(How do you set, clear, and toggle a single bit?)
(Why can't you do bitwise operations on pointer in C, and is there a way around this?)
I'm creating some fuzz tests in python and it would be invaluable for me to be able to, given a binary string, randomly flip some bits and ensure that exceptions are correctly raised, or results are correctly displayed for slight alterations on given valid binaries. Does anyone know how I might go about this in Python? I realize this is pretty trivial in lower level languages but for work reasons I've been told to do this in Python, but I'm not sure how to start this, or get the binary representation for something in python. Any ideas on how to execute these fuzz tests in Python?
Strings are immutable, so to make changes, the first thing to do is probably to convert it into a list. At the same time, you can convert the digits into ints for greater ease in manipulation.
hexstring = "1234567890deadbeef"
values = [int(digit, 16) for digit in hexstring]
Then you can flip an individual bit in any of the hex digits.
digitindex = 2
bitindex = 3
values[digitindex] ^= 1 << bitindex
If needed, you can then convert back to hex.
result = "".join("0123456789abcdef"[val] for val in values)
One thing you could try is to convert the string into a bytearray, then performing bit manipulations on each character. You can access each character by index and treat it as an integer.
For example:
>>> a = "hello world"
>>> b = bytearray(a)
>>> b[0] = b[0] ^ 5 # bitwise XOR
>>> print b # or do str(b) to convert it back to a string
mello world
You may also find this article on the Python wiki about bit manipulation to be useful. It goes over bit manipulation in Python to far greater detail, along with loads of useful tips and tricks.