Proper way for converting to bigendian for network submission

Proper way for converting to bigendian for network submission - python

I need to get an int through the network. Is this the proper way to convert to bytes in big-endian?
pack("I",socket.htonl(integer_value))
I unpack it as:
socket.ntohl(unpack("I",data)[0])
I noticed that pack-unpack also have the <> to use for endian conversion so I am not sure if I could just directly use that instead or if htonl is safer.

You should use only the struct module for communicating with another system. By using the htonl first, you'll end up with an indeterminate order being transmitted.
Since you need to convert the integer into a string of bytes in order to send it to another system, you'll need to use struct.pack (because htonl just returns a different integer than the one passed as argument and you cannot directly send an integer). And in using struct.pack you must choose an endianness for that string of bytes (if you don't specify one, you'll get a default ordering which may not be the same on the receiving side so you really need to choose one).
Converting an integer to a sequence of bytes in a definite order is exactly what struct.pack("!I", integer_value) does and a sequence of bytes in a definite order is exactly what you need on the receiving end.
On the other hand, if you use struct.pack("!I", socket.htonl(integer_value)), what does that do? Well, first it puts the integer into big-endian order (network byte order), then it takes your already big-endian integer and converts it to bytes in "big-endian order". But, on a little endian machine, that will actually reverse the ordering again, and you will end up transmitting the integer in little-endian byte order if you do both those two operations.
But on a big-endian machine htonl is a no-op, and then you're converting the result into bytes in big-endian order.
So using ntohl actually defeats the purpose and a receiving machine would have to know the byte-order used on the sending machine in order to properly decode it. Observe...
Little-endian box:
>>> print(socket.htonl(27))
452984832
>>> print(struct.pack("!I", 27))
b'\x00\x00\x00\x1b'
>>> print(struct.pack("!I", socket.htonl(27)))
b'\x1b\x00\x00\x00'
Big-endian box:
>>> print(socket.htonl(27))
27
>>> print(struct.pack("!I", 27))
b'\x00\x00\x00\x1b'
>>> print(struct.pack("!I", socket.htonl(27)))
b'\x00\x00\x00\x1b'

struct.unpack() uses '!' in the format specifiers for network byte order. But its the same as '>'...

Related

Which are the advantages of byte objects over string objects in Python?

I understand the differences between byte/bytearray and string in Python and how to handle/manipulate/convert these objects but I cannot find real life scenarios/examples where you would prefer to work with bytes instead of strings in the code.
Which are the advantages of byte objects over string objects in Python?
and in which real life scenarios should you convert in your code strings into bytes and why?

For all modern computer architectures, a byte consists of 8 bits and thus can encode 256 distinct values.
In the ASCII character encoding, there are only 128 different values, with only a subset of those being printable. With UTF-8 it gets a little more complicated, but you end up in a similar problem, that not all byte sequences are representable as a string. So anytime you have a sequence of bytes that is not representable as a string, you have to use bytes() or bytearray.
One example of when you might need to use bytes, is when working with crypto and pseudo-random sequence generation, where you will often end up with a sequence of bytes that cannot be represented 1-to-1 as a string. This is because you want to work with as large as possible an output space when generating pseudo-random numbers and sequences. See for example secrets.token_bytes from the stdlib.
If you want to represent such a sequence as a string, it's possible to encode it into a sequence of bytes that are all inside the ASCII encoding space, but of course, at the cost of using more bytes. For example, you can encode it as hex characters or in base64. Hex has the advantage that the size of the resulting string is always 2 * n_bytes, while base64 is the most efficient way of encoding bytes into ASCII, i.e. it will use the least amount of extra bytes. Note that the secrets stdlib module also gives you convenience functions that does this conversion for you.

in which real life scenarios should you convert in your code strings into bytes and why?
One example is using some compression algorithm which works on bytes rather than str. Take look at lzma built-in module examples, note that it does work with bytes rather than str. In case of a lot of text this allow more effiecient usage of available memory (i.e. saving same text in smaller space).

How can I extract mixed binary and ascii values from a bytes string like I did in 2.x?

The following represents a binary image extracted from a file (spaces inserted between bytes to make reading easier). File is opened with 'rb' mode.
01 77 33 9F 41 42 43 44 00 11 11 11
In Python 2.7, I read it as a character string and I use ord() to extract the binary values and then I can extract or even search the string for a specific text value (such as the "ABCD" in characters 4-7). The binary bytes can be anything from 0-FF. I've been putting off conversion to python 3 partly because of this.
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
I don't write the file, I just use it, so changing it is not an option.
I've seen lots of examples of using b' and other things to convert fixed strings but I need a way to intermix these values, extracting bytes, 2-byte to 8-byte values as 16-bit to 64-bit words, and extracting/searching for ASCII strings within the larger string.
The byte/character separation in Python 3 seems somewhat inflexible for what I need. I'm sure there's a way to do this I just haven't found an example or an answered question that seems to cover this case.
This is a simplified example, I can't provide real data (it's proprietary) but this illustrates the problem. The real files may be short (<1K) or huge (>100K), containing multiple records of different sizes.
Is there an easy, straightforward way to essentially replicate the functionality I have in Python 2.7?
This is on Windows.
Thanks

I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
Read the file in binary mode, as you are doing. This produces a bytes object, which in 3.x is not the same as a str (as it would be in 2.x).
Interpret the bytes as bytes, as needed, to figure out the general structure of the data. Slicing the bytes produces another bytes as before; indexing produces an int with the numeric value of that single byte (not as before) - no ord required.
When you have determined a subset of the bytes that represent a string (let's say for convenience that you have sliced it out), convert to string using the appropriate encoding: e.g. str(my_bytes, 'ascii'). Note that ASCII will not handle byte values 0x80 through 0xFF; especially with binary-ish legacy file formats, there's a good chance your data is actually something like Latin-1: str(my_bytes, 'iso-8859-1').
search the string for a specific text value
You can search at either the text or the byte level - bytes objects support the in operator, searching for either a subsequence of bytes or a single integer value. Whether it makes more sense to search before or after string conversion will depend on what you are doing.
using b' and other things to convert fixed strings
b'' is just the syntax for a literal bytes object. It's what you'll see if you ask for the repr of what you read from the file. Prefixing a b onto an existing string literal in your code isn't really "converting" anything, but replacing it with the value you should have had in the first place.
2-byte to 8-byte values as 16-bit to 64-bit words
The documentation says it at least as well as I could:
>>> help(int.from_bytes)
Help on built-in function from_bytes:
from_bytes(...) method of builtins.type instance
int.from_bytes(bytes, byteorder, *, signed=False) -> int
Return the integer represented by the given array of bytes.
The bytes argument must be a bytes-like object (e.g. bytes or bytearray).
The byteorder argument determines the byte order used to represent the
integer. If byteorder is 'big', the most significant byte is at the
beginning of the byte array. If byteorder is 'little', the most
significant byte is at the end of the byte array. To request the native
byte order of the host system, use `sys.byteorder' as the byte order value.
The signed keyword-only argument indicates whether two's complement is
used to represent the integer.

Python Version for Ruby's array.pack() and unpack()?

In Ruby, I could easily pack an array representing some sequence into a binary string:
# for int
# "S*!" directive means format for 16-bit int, and using native endianess
# 16-bit int, so each digit was represented by two bytes. "\x01\x00" and "\x02\x00"
# here the native endianess is "little endian", so you should
# look at it backwards, "\x01\x00" becomes 0001, and "\x02\x00" becomes 0002
"\x01\x00\x02\x00".unpack("S!*")
# [1, 2]
# for hex
# "H*" means every element in the array is a digit for the hexstream
["037fea0651b358c361de"].pack("H*")
# "\x03\x7F\xEA\x06Q\xB3X\xC3a\xDE"
API doc for pack and unpack.
I couldn't find an uniform and equivalent way of transforming sequence to bytes (or vice versa) in python.
While struct provides methods for packing into bytes objects, the format string available has no option for hexstream.
EDIT: What I really want is something as versatile as Ruby's arr.pack and str.unpack, which supports multiple formatting and endianess control.

for a string in the utf-8 range it would be:
from binascii import unhexlify
strg = "464F4F"
unhexlify(strg).decode() # FOO (str)
if your content is just binary
strg = "037fea0651b358c361de"
unhexlify(strg) # b'\x03\x7f\xea\x06Q\xb3X\xc3a\xde' (bytes)
also bytes.fromhex (as in Davis Herring's answer) may be worth checking out.

struct does only fixed-width encodings that correspond to a memory dump of something like a C struct. You want bytes.fromhex or binascii.unhexlify, depending on the source type (which is never a list).
After any such conversion, you can use struct.unpack on a byte string containing any number of “records” corresponding to the format string; each is decoded into an element of the returned tuple. The format string supports the usual integer sizes and endianness choices; it is of course possible to construct a format dynamically to do things like read a matrix whose dimensions are chosen at runtime:
mat=struct.unpack("%dd"%cols,buf) # rows determined from len(buf)
It’s also possible to construct a lower-memory array if the element type is primitive; then you can follow up with byteswap as Alec A mentioned. NumPy offers similar facilities.

Try memoryview.cast, which allows you to change the endianness of an array or byte object.
Storing values as arrays makes things easier, as you can use the byteswap function.

casting byte array to signed short in python

I want to convert a bytearray type or a list of binary strings in python to a signed short list. In fact, I am getting a byte stream from Ethernet and I want to convert them in signed short; however, the only way I found in Python is using struct.unpack which seems to be slow since it requires a format string to determine the type of each byte.
This format requirement slows in two steps:
1) Required to make a long string for a long array of bytes
2) Required to search one-by-one in the array.
In C++, the following simple code does the job on the entire memory block contained by InBuf:
OutBuf = short int[len]
InBuf = char[len*2]
memcpy(&OutBuf, &InBuf, len*2)
This skips doing the format search within the byte array as well as the format string construction. Does anyone know a better way to do so in Python?

If you're using Python > 3.2 you could use int.from_bytes:
int.from_bytes(b, byteorder='little', signed=True)

Python 2: Why is this bytestring order switched in struct.pack() and struct.unpack() methods?

In Python 2.7.5, I have an hex 0xbba1, and I want to change it in bytestring format.
>>> bytetoint = lambda bytestr: struct.unpack('H', bytestr)[0]
>>> hextobyte = lambda hexnum: struct.pack('H', hexnum)
>>> hextobyte(0xbba1)
'\xa1\xbb'
>>> hex(bytetoint('\xa1\xbb'))
'0xbba1'
Why are the first byte '\xa1' and the second byte'\xbb' switched in place?
How can I get the right bytestring from hex, or vice versa?
e.g. 0xbba1 -> '\xbb\xa1'
'\xbb\xa1' -> 0xbba1

It's a little-endian/big-endian thing. You can't really say the bytes are switched, because nothing in the int definition says what order the bytes representing it are laid out in.
The result you have is a perfectly usable little-endian representation. If you want to force big-endian, which may look better to a human reader, you can specify the byte order with >:
>>> import struct
>>> struct.pack('>H', 0xbba1)
'\xbb\xa1'
>>> hex(struct.unpack('>H', '\xbb\xa1')[0])
'0xbba1'

First read about endianness so that you understand where this problem is coming from. On a typical x86-based computer with a little-endian CPU, the correct in-memory representation of int(0xbba1) is the two bytes a1 bb, in that order.
If you really want to decode a byte string from the opposite big-endian order, see this section of the struct docs:
bytestring = `\xbb\xa1`
hex( struct.unpack('>H','\xbb\xa1')[0] )

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.