casting byte array to signed short in python - python

I want to convert a bytearray type or a list of binary strings in python to a signed short list. In fact, I am getting a byte stream from Ethernet and I want to convert them in signed short; however, the only way I found in Python is using struct.unpack which seems to be slow since it requires a format string to determine the type of each byte.
This format requirement slows in two steps:
1) Required to make a long string for a long array of bytes
2) Required to search one-by-one in the array.
In C++, the following simple code does the job on the entire memory block contained by InBuf:
OutBuf = short int[len]
InBuf = char[len*2]
memcpy(&OutBuf, &InBuf, len*2)
This skips doing the format search within the byte array as well as the format string construction. Does anyone know a better way to do so in Python?

If you're using Python > 3.2 you could use int.from_bytes:
int.from_bytes(b, byteorder='little', signed=True)

Related

How do I hash integers and strings inputs using murmurhash3

I'm looking to get a hash value for string and integer inputs.
Using murmurhash3, I'm able to do it for strings but not integers:
pip install murmurhash3
import mmh3
mmh3.hash(34)
Returns the following error:
TypeError: a bytes-like object is required, not 'int'
I could convert it to bytes like this:
mmh3.hash(bytes(34))
But then I'll get an error message if the input is string
How do I overcome this without converting the integer to string?
How do I overcome this without converting the integer to string?
You can't. Or more precisely, you need to convert it to bytes or str in some way, but it needn't be a human-readable text form like b'34'/'34'. A common approach on Python 3 would be:
my_int = 34 # Or some other value
my_int_as_bytes = my_int.to_bytes((my_int.bit_length() + 7) // 8, 'little')
which makes a minimalist raw bytes representation of the original int (regardless of length); for 34, you'd get b'"' (because it only takes one byte to store it, so you're basically getting a bytes object with its ordinal value), but for larger ints it still works (unlike mucking about with chr), and it's always as small as possible (getting 8 bits of data per byte, rather than a titch over 3 bits per byte as you'd get converting to a text string).
If you're on Python 2 (WHY?!? It's been end-of-life for nearly a year), int.to_bytes doesn't exist, but you can fake it with moderate efficiency in various ways, e.g. (only handling non-negative values, unlike to_bytes which handles signed values with a simple flag):
from binascii import unhexlify
my_int_as_bytes = unhexlify('%x' % (my_int,))

How can I extract mixed binary and ascii values from a bytes string like I did in 2.x?

The following represents a binary image extracted from a file (spaces inserted between bytes to make reading easier). File is opened with 'rb' mode.
01 77 33 9F 41 42 43 44 00 11 11 11
In Python 2.7, I read it as a character string and I use ord() to extract the binary values and then I can extract or even search the string for a specific text value (such as the "ABCD" in characters 4-7). The binary bytes can be anything from 0-FF. I've been putting off conversion to python 3 partly because of this.
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
I don't write the file, I just use it, so changing it is not an option.
I've seen lots of examples of using b' and other things to convert fixed strings but I need a way to intermix these values, extracting bytes, 2-byte to 8-byte values as 16-bit to 64-bit words, and extracting/searching for ASCII strings within the larger string.
The byte/character separation in Python 3 seems somewhat inflexible for what I need. I'm sure there's a way to do this I just haven't found an example or an answered question that seems to cover this case.
This is a simplified example, I can't provide real data (it's proprietary) but this illustrates the problem. The real files may be short (<1K) or huge (>100K), containing multiple records of different sizes.
Is there an easy, straightforward way to essentially replicate the functionality I have in Python 2.7?
This is on Windows.
Thanks
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
Read the file in binary mode, as you are doing. This produces a bytes object, which in 3.x is not the same as a str (as it would be in 2.x).
Interpret the bytes as bytes, as needed, to figure out the general structure of the data. Slicing the bytes produces another bytes as before; indexing produces an int with the numeric value of that single byte (not as before) - no ord required.
When you have determined a subset of the bytes that represent a string (let's say for convenience that you have sliced it out), convert to string using the appropriate encoding: e.g. str(my_bytes, 'ascii'). Note that ASCII will not handle byte values 0x80 through 0xFF; especially with binary-ish legacy file formats, there's a good chance your data is actually something like Latin-1: str(my_bytes, 'iso-8859-1').
search the string for a specific text value
You can search at either the text or the byte level - bytes objects support the in operator, searching for either a subsequence of bytes or a single integer value. Whether it makes more sense to search before or after string conversion will depend on what you are doing.
using b' and other things to convert fixed strings
b'' is just the syntax for a literal bytes object. It's what you'll see if you ask for the repr of what you read from the file. Prefixing a b onto an existing string literal in your code isn't really "converting" anything, but replacing it with the value you should have had in the first place.
2-byte to 8-byte values as 16-bit to 64-bit words
The documentation says it at least as well as I could:
>>> help(int.from_bytes)
Help on built-in function from_bytes:
from_bytes(...) method of builtins.type instance
int.from_bytes(bytes, byteorder, *, signed=False) -> int
Return the integer represented by the given array of bytes.
The bytes argument must be a bytes-like object (e.g. bytes or bytearray).
The byteorder argument determines the byte order used to represent the
integer. If byteorder is 'big', the most significant byte is at the
beginning of the byte array. If byteorder is 'little', the most
significant byte is at the end of the byte array. To request the native
byte order of the host system, use `sys.byteorder' as the byte order value.
The signed keyword-only argument indicates whether two's complement is
used to represent the integer.

Switching endianness in the middle of a struct.unpack format string

I have a bunch of binary data (the contents of a video game save-file, as it happens) where a part of the data contains both little-endian and big-endian integer values. Naively, without reading much of the docs, I tried to unpack it this way...
struct.unpack(
'3sB<H<H<H<H4s<I<I32s>IbBbBbBbB12s20sBB4s',
string_data
)
...and of course I got this cryptic error message:
struct.error: bad char in struct format
The problem is that struct.unpack format strings do not expect individual fields to be marked with endianness. The actually correct format-string here would be something like
struct.unpack(
'<3sBHHHH4sII32sIbBbBbBbB12s20sBB4s',
string_data
)
except that this will flip the endianness of the third I field (parsing it as little-endian, when I really want to parse it as big-endian).
Is there an easy and/or "Pythonic" solution to my problem? I have already thought of three possible solutions, but none of them is particularly elegant. In the absence of better ideas I'll probably go with number 3:
I could extract a substring and parse it separately:
(my.f1, my.f2, ...) = struct.unpack('<3sBHHHH4sII32sIbBbBbBbB12s20sBB4s', string_data)
my.f11 = struct.unpack('>I', string_data[56:60])
I could flip the bits in the field after the fact:
(my.f1, my.f2, ...) = struct.unpack('<3sBHHHH4sII32sIbBbBbBbB12s20sBB4s', string_data)
my.f11 = swap32(my.f11)
I could just change my downstream code to expect this field to be represented differently — it's actually a bitmask, not an arithmetic integer, so it wouldn't be too hard to flip around all the bitmasks I'm using with it; but the big-endian versions of these bitmasks are more mnemonically relevant than the little-endian versions.
A little late to the party, but I just had the same problem. I solved it with a custom numpy dtype, which allows to mix elements with different endianess (see https://numpy.org/doc/stable/reference/generated/numpy.dtype.html):
t=np.dtype('>u4,<u4') # Compound type with two 4-byte unsigned int with different byte order
a=np.zeros(shape=1, dtype=t) # Create an array of length one with above type
a[0][0]=1 # Assign first uint
a[0][1]=1 # Assign second uint
bytes=a.tobytes() # bytes should be b'\x01\x00\x00\x00\x00\x00\x00\x01'
b=np.frombuffer(buf, dtype=t) # should yield array[(1,1)]
c=np.frombuffer(buf, dtype=np.uint32) # yields array([ 1, 16777216]

Proper way for converting to bigendian for network submission

I need to get an int through the network. Is this the proper way to convert to bytes in big-endian?
pack("I",socket.htonl(integer_value))
I unpack it as:
socket.ntohl(unpack("I",data)[0])
I noticed that pack-unpack also have the <> to use for endian conversion so I am not sure if I could just directly use that instead or if htonl is safer.
You should use only the struct module for communicating with another system. By using the htonl first, you'll end up with an indeterminate order being transmitted.
Since you need to convert the integer into a string of bytes in order to send it to another system, you'll need to use struct.pack (because htonl just returns a different integer than the one passed as argument and you cannot directly send an integer). And in using struct.pack you must choose an endianness for that string of bytes (if you don't specify one, you'll get a default ordering which may not be the same on the receiving side so you really need to choose one).
Converting an integer to a sequence of bytes in a definite order is exactly what struct.pack("!I", integer_value) does and a sequence of bytes in a definite order is exactly what you need on the receiving end.
On the other hand, if you use struct.pack("!I", socket.htonl(integer_value)), what does that do? Well, first it puts the integer into big-endian order (network byte order), then it takes your already big-endian integer and converts it to bytes in "big-endian order". But, on a little endian machine, that will actually reverse the ordering again, and you will end up transmitting the integer in little-endian byte order if you do both those two operations.
But on a big-endian machine htonl is a no-op, and then you're converting the result into bytes in big-endian order.
So using ntohl actually defeats the purpose and a receiving machine would have to know the byte-order used on the sending machine in order to properly decode it. Observe...
Little-endian box:
>>> print(socket.htonl(27))
452984832
>>> print(struct.pack("!I", 27))
b'\x00\x00\x00\x1b'
>>> print(struct.pack("!I", socket.htonl(27)))
b'\x1b\x00\x00\x00'
Big-endian box:
>>> print(socket.htonl(27))
27
>>> print(struct.pack("!I", 27))
b'\x00\x00\x00\x1b'
>>> print(struct.pack("!I", socket.htonl(27)))
b'\x00\x00\x00\x1b'
struct.unpack() uses '!' in the format specifiers for network byte order. But its the same as '>'...

python proper way to decode raw udp packet

I'm making a script to get Valve's server information (players online, map, etc)
the packet I get when I request for information is this:
'\xff\xff\xff\xffI\x11Stargate Central CAP SBEP\x00sb_wuwgalaxy_fix\x00garrysmod\x00Spacebuild\x00\xa0\x0f\n\x0c\x00dw\x00\x0114.09.08\x00\xb1\x87i\x06\xb4g\x17.\x15#\x01gm:spacebuild3\x00\xa0\x0f\x00\x00\x00\x00\x00\x00'
This may help you to see what I'm trying to do https://developer.valvesoftware.com/wiki/Server_queries#A2S_INFO
The problem is, I don't know how to decode this properly, it's easy to get the string but I have no idea how to get other types like byte and short
for example '\xa0\x0f'
For now I'm doing multiple split but do you know if there is any better way of doing this?
Python has functions for encoding/decoding different data types into bytes. Take a look at the struct package, the functions struct.pack() and struct.unpack() are your friends there.
taken from https://docs.python.org/2/library/struct.html
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
The first argument of the unpack function defines the format of the data stored in the second argument. Now you need to translate the description given by valve to a format string. If you wanted to unpack 2 bytes and a short from a data string (that would have a length of 4 bytes, of course), you could do something like this:
(first_byte, second_byte, the_short) = unpack("cc!h", data)
You'll have to take care yourself, to get the correct part of the data string (and I don't know if those numbers are signed or not, be sure to take care of that, too).
The strings you'll have to do differently (they are null-terminated here, so start were you know a string starts and read to the first "\0" byte).
pack() work's the other way around and stores data in a byte string. Take a look at the examples on the python doc and play around with it a bit to get a feel for it (when a tuple is returned/needed, e.g.).
struct supports you in getting the right byte order, which most of the time is network byte order and different from your system. That is of course only necessary for multi byte integers (like short) - so a format string of `"!h" should unpack a short correctly.

Categories