How do I hash integers and strings inputs using murmurhash3

How do I hash integers and strings inputs using murmurhash3 - python

I'm looking to get a hash value for string and integer inputs.
Using murmurhash3, I'm able to do it for strings but not integers:
pip install murmurhash3
import mmh3
mmh3.hash(34)
Returns the following error:
TypeError: a bytes-like object is required, not 'int'
I could convert it to bytes like this:
mmh3.hash(bytes(34))
But then I'll get an error message if the input is string
How do I overcome this without converting the integer to string?

How do I overcome this without converting the integer to string?
You can't. Or more precisely, you need to convert it to bytes or str in some way, but it needn't be a human-readable text form like b'34'/'34'. A common approach on Python 3 would be:
my_int = 34 # Or some other value
my_int_as_bytes = my_int.to_bytes((my_int.bit_length() + 7) // 8, 'little')
which makes a minimalist raw bytes representation of the original int (regardless of length); for 34, you'd get b'"' (because it only takes one byte to store it, so you're basically getting a bytes object with its ordinal value), but for larger ints it still works (unlike mucking about with chr), and it's always as small as possible (getting 8 bits of data per byte, rather than a titch over 3 bits per byte as you'd get converting to a text string).
If you're on Python 2 (WHY?!? It's been end-of-life for nearly a year), int.to_bytes doesn't exist, but you can fake it with moderate efficiency in various ways, e.g. (only handling non-negative values, unlike to_bytes which handles signed values with a simple flag):
from binascii import unhexlify
my_int_as_bytes = unhexlify('%x' % (my_int,))

Related

How can I extract mixed binary and ascii values from a bytes string like I did in 2.x?

The following represents a binary image extracted from a file (spaces inserted between bytes to make reading easier). File is opened with 'rb' mode.
01 77 33 9F 41 42 43 44 00 11 11 11
In Python 2.7, I read it as a character string and I use ord() to extract the binary values and then I can extract or even search the string for a specific text value (such as the "ABCD" in characters 4-7). The binary bytes can be anything from 0-FF. I've been putting off conversion to python 3 partly because of this.
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
I don't write the file, I just use it, so changing it is not an option.
I've seen lots of examples of using b' and other things to convert fixed strings but I need a way to intermix these values, extracting bytes, 2-byte to 8-byte values as 16-bit to 64-bit words, and extracting/searching for ASCII strings within the larger string.
The byte/character separation in Python 3 seems somewhat inflexible for what I need. I'm sure there's a way to do this I just haven't found an example or an answered question that seems to cover this case.
This is a simplified example, I can't provide real data (it's proprietary) but this illustrates the problem. The real files may be short (<1K) or huge (>100K), containing multiple records of different sizes.
Is there an easy, straightforward way to essentially replicate the functionality I have in Python 2.7?
This is on Windows.
Thanks

I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
Read the file in binary mode, as you are doing. This produces a bytes object, which in 3.x is not the same as a str (as it would be in 2.x).
Interpret the bytes as bytes, as needed, to figure out the general structure of the data. Slicing the bytes produces another bytes as before; indexing produces an int with the numeric value of that single byte (not as before) - no ord required.
When you have determined a subset of the bytes that represent a string (let's say for convenience that you have sliced it out), convert to string using the appropriate encoding: e.g. str(my_bytes, 'ascii'). Note that ASCII will not handle byte values 0x80 through 0xFF; especially with binary-ish legacy file formats, there's a good chance your data is actually something like Latin-1: str(my_bytes, 'iso-8859-1').
search the string for a specific text value
You can search at either the text or the byte level - bytes objects support the in operator, searching for either a subsequence of bytes or a single integer value. Whether it makes more sense to search before or after string conversion will depend on what you are doing.
using b' and other things to convert fixed strings
b'' is just the syntax for a literal bytes object. It's what you'll see if you ask for the repr of what you read from the file. Prefixing a b onto an existing string literal in your code isn't really "converting" anything, but replacing it with the value you should have had in the first place.
2-byte to 8-byte values as 16-bit to 64-bit words
The documentation says it at least as well as I could:
>>> help(int.from_bytes)
Help on built-in function from_bytes:
from_bytes(...) method of builtins.type instance
int.from_bytes(bytes, byteorder, *, signed=False) -> int
Return the integer represented by the given array of bytes.
The bytes argument must be a bytes-like object (e.g. bytes or bytearray).
The byteorder argument determines the byte order used to represent the
integer. If byteorder is 'big', the most significant byte is at the
beginning of the byte array. If byteorder is 'little', the most
significant byte is at the end of the byte array. To request the native
byte order of the host system, use `sys.byteorder' as the byte order value.
The signed keyword-only argument indicates whether two's complement is
used to represent the integer.

casting byte array to signed short in python

I want to convert a bytearray type or a list of binary strings in python to a signed short list. In fact, I am getting a byte stream from Ethernet and I want to convert them in signed short; however, the only way I found in Python is using struct.unpack which seems to be slow since it requires a format string to determine the type of each byte.
This format requirement slows in two steps:
1) Required to make a long string for a long array of bytes
2) Required to search one-by-one in the array.
In C++, the following simple code does the job on the entire memory block contained by InBuf:
OutBuf = short int[len]
InBuf = char[len*2]
memcpy(&OutBuf, &InBuf, len*2)
This skips doing the format search within the byte array as well as the format string construction. Does anyone know a better way to do so in Python?

If you're using Python > 3.2 you could use int.from_bytes:
int.from_bytes(b, byteorder='little', signed=True)

Proper way for converting to bigendian for network submission

I need to get an int through the network. Is this the proper way to convert to bytes in big-endian?
pack("I",socket.htonl(integer_value))
I unpack it as:
socket.ntohl(unpack("I",data)[0])
I noticed that pack-unpack also have the <> to use for endian conversion so I am not sure if I could just directly use that instead or if htonl is safer.

You should use only the struct module for communicating with another system. By using the htonl first, you'll end up with an indeterminate order being transmitted.
Since you need to convert the integer into a string of bytes in order to send it to another system, you'll need to use struct.pack (because htonl just returns a different integer than the one passed as argument and you cannot directly send an integer). And in using struct.pack you must choose an endianness for that string of bytes (if you don't specify one, you'll get a default ordering which may not be the same on the receiving side so you really need to choose one).
Converting an integer to a sequence of bytes in a definite order is exactly what struct.pack("!I", integer_value) does and a sequence of bytes in a definite order is exactly what you need on the receiving end.
On the other hand, if you use struct.pack("!I", socket.htonl(integer_value)), what does that do? Well, first it puts the integer into big-endian order (network byte order), then it takes your already big-endian integer and converts it to bytes in "big-endian order". But, on a little endian machine, that will actually reverse the ordering again, and you will end up transmitting the integer in little-endian byte order if you do both those two operations.
But on a big-endian machine htonl is a no-op, and then you're converting the result into bytes in big-endian order.
So using ntohl actually defeats the purpose and a receiving machine would have to know the byte-order used on the sending machine in order to properly decode it. Observe...
Little-endian box:
>>> print(socket.htonl(27))
452984832
>>> print(struct.pack("!I", 27))
b'\x00\x00\x00\x1b'
>>> print(struct.pack("!I", socket.htonl(27)))
b'\x1b\x00\x00\x00'
Big-endian box:
>>> print(socket.htonl(27))
27
>>> print(struct.pack("!I", 27))
b'\x00\x00\x00\x1b'
>>> print(struct.pack("!I", socket.htonl(27)))
b'\x00\x00\x00\x1b'

struct.unpack() uses '!' in the format specifiers for network byte order. But its the same as '>'...

Python struct.unpack errors with TypeError: a bytes-like object is required, not 'str'

Can someone please help with the following line of code and Error? I am unfamiliar with python value conversions.
The specific line that generates the error is:
value = struct.unpack("<h",chr(b)+chr(a))[0]
TypeError: a bytes-like object is required, not 'str'
The code fragment is:
if packet_code ==0x80: # raw value
row_length = yield
a = yield
b = yield
value = struct.unpack("<h",chr(b)+chr(a))[0]
The input data is:
b'\x04\x80\x02\x00\xb2\xcb\xaa\xaa\x04\x80\x02\x00p\r\xaa\xaa\x04\x80\x02\x00]
\xaa\xaa\x04\x80\x02\x00#=\xaa\xaa\x04\x80\x02\x007F\xaa\xaa\x04\x80\x02\x00\!\xaa\xaa\x04\x80\x02\x00=#\xaa\xaa\x04\x80\x02\x00=#\xaa\xaa\x04\x80\x02\x00i\x14\xaa\xaa\x04\x80\x02\x00]
\xaa\xaa\x04\x80\x02\x00p\r\xaa\xaa\x04\x80\x02\x00\x80\xfd\xaa\xaa
I am using python 3.5. This code seems to work in the older versions.
Here is the link to similar parser code where it may have worked with previous versions of Python:
Parser Code Link
Here is the link to the description of how the data is sent from the device
RAW Wave Value (16-bit)
This Data Value consists of two bytes, and represents a single raw wave sample. Its value is a signed 16-bit integer that ranges from -32768 to 32767. The first byte of the Value represents the high-order bits of the twos-compliment value, while the second byte represents the low-order bits. To reconstruct the full raw wave value, simply shift the first byte left by 8 bits, and bitwise-or with the second byte:
short raw = (Value[0]<<8) | Value[2];
where Value[0] is the high-order byte, and Value1 is the low-order byte.
In systems or languages where bit operations are inconvenient, the following arithmetic operations may be substituted instead:
raw = Value[0]*256 + Value[1];
if( raw >= 32768 ) raw = raw - 65536;
Really appreciate any help as I am currently stuck.

When you are using Python 2.x str is a byte array. For Python 3, you must use bytes like this:
struct.unpack("<h", bytes([b, a]))[0]

if you use python3 you can use the following lines for the received data and convert it to a short data type.
struct.unpack('<h', data)
struct.unpack('<h', data[0:4])
struct.unpack('<h', b''.join(…))
If it receives the data as a list, it uses converts the array to bytes:
struct.unpack('<h', bytes(data))
Remember you must convert your information to bytes and not send as str, in order to use unpack and decompress the information in the data type you require.

why is struct not accepting a string as an input

i got an issue with structs not packing a string
i currently create a random 20 byte long string and when i try to pack this using structs in 20 octets by the code below
payload = struct.pack("H" * 20, *rendezvous_cookie)
rendezvous_cookie calculated by os.urandom(20)
i get the error struct.error: cannot convert argument to integer
is there any quick easy way of encoding the string so it can be packed this way?
Thanks
Edit managed to fix it by doing :
payload = struct.pack('!20s', rendezvous_cookie)
this way it takes the input as a string fine and is still of 20 octets

os.urandom(n) returns a random str of length n.
If you want to make a list of integers out of it, use:
[ord(b) for b in os.urandom(n)]
You can feed that as arguments to struct.pack.
Note, however, that os.urandom(n) already returns a serialized list of bytes. You may be able to use that directly. Using struct.pack("H", ...) makes each number occupy two bytes (one of which will hold no data).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.